MiroHealth
Groundbreaking
science
at your
fingertips
The brain is our most vital organ.
We help you treat it that way.
back to Our Science
INITIAL TEST-RETEST
RELIABILITY AND LEARNING
EFFECTS
Miro is designed to allow longitudinal assessments of changes in participant and neurological function.
The test-retest reliability of Miro measurements, and the longitudinal performance of Miro V3 is being
assessed in Miro-sponsored studies conducted at contract research organizations. To date, 29 participants
have completed three time points. These participants are healthy with no reported cognitive deficits and
have scored in the healthy normal range on the TICS (Telephone Interview for Cognitive Status).
Miro Versions
v3.1.2.6 MiroMind
v2.1.3.0 On-device speech recognition algorithms
v2.1.2.0 Natural language processing algorithms
v2.0 Machine learning algorithms
v3.0 Reference data set
Demographics
DIAGNOSIS% F/MMEAN AGE (RANGE)TOTAL N=29
Normal59 / 4169.0 (19-82)29
(% F/M = percent female / percent male)
Methods
Assessment. Participants are enrolled at three contract research organizations and evaluated using the
Miro platform at three sequential assessments in one week intervals. The primary study objective is to
determine the stability and reliability of Miro testing over time. Prospective participants are screened with
the TICS (Telephone Interview of participant Status) to verify their status as cognitively normal.
participants are excluded from the primary study if they have diagnoses or evidence of neurological
conditions that could affect neuropsychological test performance or if they are using medications that
can affect test performance.

Missing data. Only participants who have completed a course of three assessments on the Miro platform
are included in these test-retest reliability and trend analyses. When a particular Miro score is missing for
one or more time points, that participant is excluded from the analyses of the reliability and trends for
that score.

Standardization of scores. Is is helpful for the interpretation of learning effects or other trends to have all
scores on one, interpretable scale. Estimates of intra-class correlations and trends could be strongly
influenced by a small number of outlying values. To improve robustness to outliers and interpretability of
Miro scores and of possible trends in longitudinal analyses, Miro scores undergo a two-step
standardization process.

(1) First raw scores are quantile normalized, or have quantiles in the empirical distribution of scores
mapped to quantiles of a standard normal distribution; and (2) the mean (normalized) score for normal
participants is subtracted from each score and the resulting centered scores are rescaled by the standard
deviation of the scores among the normal participants. After this process, standardized scores have
mean values of zero and standard deviations of one in the normal participants, and have an overall
normal distribution. For scores derived from the Category Fluency and Letter Fluency modules, scores
based on test sessions with different stimuli are standardized separately. So, while the number of words a
participant speaks having a particular initial letter in the Letter Fluency module differs whether the
stimulus or required letter is “F”, or “S”, after separate standardization of Letter Fluency scores based on
the stimulus used in each test session, the standardized scores are directly comparable and may be used
in ICC and trend analyses.

Miro scores. Miro calculates many variable scores from each test module. Some of the Miro variables are
directly analogous to standard neuropsychological variables. From these Miro calculates scores that are
analogous to the scores generated using a traditional battery of neuropsychological exams. Miro also
collects many scores from each test module that are informative, but not comparable to anything used
in standard practice. This broad collection of scores is being used in two data-driven refinements of the
Miro platform:

1. In development of functional domain-scores using factor analyses to generate estimates of individuals’
placements on the latent dimensions of participant functional ability.

2. In machine learning approaches to develop risk-scores or classifiers that are specifically designed to
identify participants with a particular diagnosis or functional profile, or to allow precision differential
diagnoses of functionally overlapping disorders.
Results
The test-retest reliability intra-class
correlation coefficient (ICC) for the MCI
Risk Score was (0.79), with a 95%
confidence interval of (0.65, 0.89). This
shows the stability or reliability of
measurements of individuals’ functional
abilities. The test-retest reliability of scores
generated over three time points is
quantified by the intra-class correlation.
This measure generalizes the familiar
notion of correlation to allow more than
two assessments per participant. In
addition to the test-retest reliability of
measurements, this longitudinal analysis
allows detection and characterization of
learning effects or trends, as shown below.

Table 2 shows the ICCs and trends for
basic scores calculated using many of
the Miro test modules. Of the hundreds
of basic scores Miro calculates as features
to describe a participant’s performance
across the Miro modules, these scores
have simple, intuitive definitions and
all have statistically significant ICCs.

The column labeled “Trend” in Table 2
shows an estimate of the average
numbers of standard deviations by
which a participant’s scores change
from assessment to assessment. From
the corresponding list of p-values we
see that of all the variables shown,
these trends are only significant for
the timing variables from the Choice
Reaction Time module. The negative
values for these significant trends
show that for this module, participants
had faster responses during subsequent
assessments using this module.
Table 2. Trends
VARIABLESICCSLOPEP-VALUE
Voice Pitch0.84470.125400.04622
Voice Absolute jitter0.81368-0.053230.35855
Voice Fundamental frequency0.77106-0.000460.91865
Trails B Time to Completion0.75484-0.046760.46198
Trails A Time to Completion0.751040.021190.86017
Voice Harmonicity0.729810.114780.19152
Delayed Verbal Learning and Memory: Total Correct (Free Recall)0.70693-0.085250.27436
Picture Description: Number of Content Units0.705930.073350.45395
Coding: Total Correct0.67579-0.075320.33126
Choice Reaction Time: Avg Complex0.66766-0.014790.80900
Digit Span Backward: Number Correct0.65432-0.008270.87631
Digit Span Forward: Number Correct0.65312-0.091830.28884
Test-retest reliability and trends. participants were assessed on the Miro platform three times at one week
intervals. The consistency of the participants’ scores were assessed using the intra-class correlation (ICC).
Statistically significant ICC values have lower confidence interval boundaries greater than zero and
indicate scores that are consistently measured for individuals and that reliably indicate that individual’s
performance. Some scores have a significant trend over an individual’s three assessments. This is
indicated with an asterisk beside the p-value for a score’s estimated trend.
Discussion
Initial findings demonstrate tools stability for longitudinal testing varies by variable type. Hi-fidelity data like
vocal acoustics, are significantly more stable and reliable then touchscreen responses.

Test-retest reliability results demonstrate the potential to track the signature performance of each individual
over time. The ICC quantifies the test-retest reliability of scores from sequential assessments in type of
analysis of variance: from the pool of all measurements collected during a study, the ICC is the fraction of
variance explained if the clusters of grouped measurements from each individual are known. This measure is
statistically significant for many of the scores generated when a participant is evaluated. These significant
scores include: analogs of traditional neuropsychological test scores; intuitive scores that would be infeasible
to calculate during a traditional test session such as standard deviation of response times; features derived
from advanced signal processing such as acoustic analyses of voice recordings; and advanced linguistic
features based on machine learning and NLP processing of free speech.

Some scores show significant trends across repeated testing sessions. In the results show, there is a decrease
in response times for the Choice Reaction Time module. The particular stimuli used in each assessment with
this module are randomized, but a learning effect from a participant’s increased familiarity with the
instructions and required tasks in the game may explain the improvement in performance on repeated
assessments. Miro was designed to minimize learning effects by generating directly comparable but distinct
exams on each repeated assessment. For example, the case of the Choice Reaction Time module, stimuli
presentation order is randomized.

Having a learning effect for a score or module is not a problem – discrepancies between performance on
subsequent assessments and expected performance given prior performance and expected learning effects
are an informative and powerful set of features that Miro makes available. It is expected that with larger data
sets collected over time, Miro’s ability to characterize differences between observed and expected
performance over time may be used to predict disease course, monitor therapeutic effects, support
differential diagnosis, describe disease sub-types, and find phenotypic markers.
Follow the data
GET STARTED