The need for objective outcome measures to advance intervention research in autism

ABO PHOTOGRAPHY / Shutterstock

The literature on autism intervention is replete with statistically nonsignificant trials, both for pharmacological and behavioral-educational approaches. While subjects in open-label drug trials show improvement more often than not, as do subjects in uncontrolled and ‘waitlist control’ behavioral intervention trials, it has been disappointingly rare to find treatment-associated benefits in trials that include well-controlled comparison groups.

But when a treatment trial yields nonsignificant results, we must ask whether the trial was truly ‘negative,’ or whether the trial itself ‘failed.’ In the parlance of drug development (and applicable to other intervention approaches as well), negative trials are those that yield valid evidence on the treatment’s lack of efficacy, at least in the population and under the conditions in which it was tested. By contrast, failed trials are those that, because of deficiencies in design or execution, are inadequate to provide valid inferences about the treatment’s efficacy.

Placebo effects

Large placebo responses are commonly found in drug trials for autism, and their presence suggests that these trials are very likely failed trials, rather than robustly negative trials. Many had hoped that parent-rated symptom scales in autism would be less susceptible to placebo effects than self-reported symptomatology in depression, anxiety and other psychiatric conditions, but this has not turned out to be true.

Trials of a diverse array of drugs in children and adults with ASD have shown placebo-associated improvements as large as those for the treatment groups. A recently published trial looking at the efficacy of memantine (an NMDA receptor antagonist) showed such effects on the Social Responsiveness Scale (SRS) in the context of a randomized withdrawal design1 — a trial design that is regarded as less susceptible to placebo effects than the common prospective, parallel-group design that randomizes at baseline. The failed trials of group I metabotropic glutamate receptor 5 (mGluR5) antagonists2,3 and arbaclofen (a GABAB receptor agonist)4 in fragile X syndrome also used parent-rated symptom scales and also showed large placebo effects (Figure 1).

Figure 1. Mavoglurant effects in fragile X syndrome. A double-blind, randomized clinical trial of mavoglurant, a mGluR5 inhibitor, in adults with fragile X syndrome found no benefit associated with drug treatment, as rated on the Aberrant Behavior Checklist – Community Edition (ABC-C). After 12 weeks of treatment, subjects receiving placebo showed similar or greater improvement on placebo versus various doses of mavoglurant (25, 50, 100 mg). Image adapted from Berry-Kravis E. et al. 2

Perhaps the drugs that have been tested truly don’t work better than placebo, and the statistically nonsignificant results reflect the absence of any real benefit. But even for a (future) drug that is genuinely beneficial, placebo-related changes would add to the challenge of demonstrating treatment-associated benefits.

The roots of these placebo effects are believed to be manifold, and various alternative study designs (from placebo run-in periods to the randomized withdrawal design) have been implemented in drug trials for other psychiatric conditions to minimize placebo effects, but without complete success (e.g., 5).

Other nonspecific effects influence subjective rating scales

Rebecca Jones and colleagues recently reported that the problems with current autism rating scales extend beyond placebo. They found that serial ratings (baseline and eight weeks later) on the SRS and on the Aberrant Behavior Checklist (ABC) showed improvement in the absence of any treatment at all6 (Figure 2). These ‘improvements’ cannot be termed placebo effects, since no placebo was administered. Rather, they represent a broader class of nonspecific effects that likely pertain to many parent-reported rating scales for autism.

Figure 2. Placebo-like effects observed in the absence of treatment. Caregivers of children with ASD scored their symptoms less severely after an eight-week study in which no treatment was provided. This figure illustrates significant improvements in the Aberrant Behavior Checklist (ABC) score (A) and Social Responsiveness Scale (SRS) (B) after the study (T2) compared to baseline (T1). Image adapted from Jones R.M. et al.6

Like the SRS and ABC, most assessments of autism symptom severity are parent-reported rating scales. As such, they are inherently subjective and prone to nonspecific effects. Clinician-rated scales, like the Clinical Global Impression scales, also are subjective and can show substantial placebo effects (e.g., 7). (A stark contrast is found in the objective measures used in clinical trials in other medical conditions, such as glycosylated hemoglobin for diabetes treatments or viral load for human immunodeficiency virus [HIV] treatments.)

The Vineland Adaptive Behavior Scales (Interview version) also are dependent on parent reports, but those reports are filtered through a clinician and are anchored in specific behaviors rather than being vague ratings of sociability. Many in the research community have had concerns that the Vineland might not be sensitive to small but meaningful changes in symptom severity, but it has shown relative resistance to placebo effects and also has shown the potential for change in at least a handful of trials (e.g., 8, 9). As a result, it has been designated as the primary endpoint for several high-profile, ongoing trials in autism (e.g., balovaptan [an antagonist of the vasopressin V1A receptor; NCT03504917 and NCT02901431] and arbaclofen [NCT03887676 and NCT03682978]).

The Autism Diagnostic Observation Schedule (ADOS), obviously designed as a diagnostic scale, also has been used as an outcome measure in intervention trials (e.g., 10). Its scoring is strongly anchored in specific actions and behaviors, but it retains a degree of subjectivity, since it is scored by (trained and certified) clinicians. As might be expected from its diagnostic roots, however, the ADOS is relatively insensitive to change. It also is resource intensive, requiring substantial time and effort to administer and score. Neither the Vineland nor the ADOS are suitable for repeat administration within short time intervals, as would be desirable in a study that seeks to elucidate the trajectory of treatment-associated change.

The recently developed Autism Impact Measure (AIM)11 is still a parent report, but it can be administered repeatedly at relatively short intervals, and like the Vineland, it is anchored in specific behaviors. Its performance as an outcome measure is being evaluated in several ongoing trials (e.g., arbaclofen [NCT03887676 and NCT03682978]).

Physiological biomarkers relevant to autism

Objective assessments relevant to autism are available in the form of physiological biomarkers, using approaches such as positron emission tomography (PET), electroencephalography (EEG)/event-related potentials (ERP), functional magnetic resonance imaging (fMRI) and eye tracking.

These methods have been used in autism-related research to provide evidence of target engagement, to demonstrate proof of mechanism, and potentially to demonstrate proof of concept. However, these measures obviously reflect internal biological processes (in the case of EEG or fMRI) or rather narrow behaviors in an artificial, lab-based paradigm (eye tracking), rather than real-life behavior and function. Thus, while they may yield evidence of potentially relevant drug effects (e.g., in association with mGluR5 antagonism in fragile X syndrome12), they cannot serve as endpoints for regulatory approval of drugs, and also are not fully satisfactory for demonstrating the clinical efficacy of behavioral and educational treatments.

SFARI awards to support the development of performance-based
outcome measures

In light of these challenges, the Simons Foundation Autism Research Initiative (SFARI) announced the Novel Outcome Measures in ASD request for applications (RFA) — the first RFA on this targeted topic was in 2015 and the latest one in 2019.

Awards supported by this RFA program are intended to support the development of new assessment tools that have the potential to serve as ‘approvable endpoints’ in regulatory trials (or to serve an analogous purpose in behavioral/educational trials) while being less prone to nonspecific effects and more sensitive to change than currently available measures.

The aim of all of the awarded projects is to develop performance-based measures (i.e., direct assessments of the subject’s behavior, rather than parent- or clinician-ratings based on recollection and interpretation of the subjects’ behavior). Thus, the measures proposed in these projects avoid the subjectivity associated with recollection biases, nonstandardized conditions for behavioral observation and interpretation/scoring by nonexpert raters.

Four grants were awarded in 2015. A summary of the measures that were developed as part of these projects can be found here.

The new measures (awarded as a result of the 2019 RFA) include:

  1. an eye-gaze measure that promises to be applicable across a wide range of ages and has already shown evidence of utility as a measure of autism severity13 (Figure 3)

Figure 3. Relationship between eye-tracking measure and the Autism Diagnostic Observation Schedule, 2nd edition (ADOS-2). The Autism Severity Index (ASI), which was calculated using an eye-tracking measure, correlates with autism severity scores obtained from the ADOS-2. Image from Frazier T.W. et al.14
  1. tablet-based tests of social-perceptive and social-cognitive processes

  2. a computer-vision assessment of behaviors elicited during presentation of socially relevant stimuli on a tablet device


  3. and

  4. the BOSCC (Brief Observation of Social Communication Change)14 and ELSA (Elicitation of Language Samples for Analysis) — measures which elicit social-communicative interactions in a standardized social framework (Figure 4). Both tools have been previously developed but require further refinement and validation.

Each of these projects will examine the psychometric validity of their measures, including test-retest and inter-rater reliability, as well as concurrent validity in comparison to existing measures of symptom severity.

The BOSCC and ELSA stand out from the other new measures because they assess social and communicative behavior in the context of ‘live’ interaction with other people, rather than using computer- or tablet-based stimuli. The other measures may well have utility as proof-of-concept endpoints or as early markers of intervention effects, but they face a greater challenge than the BOSCC or ELSA in demonstrating whether they are indicative of function in true social contexts and thereby appropriate as ‘approval endpoints.’

Figure 4. Brief Observation of Social Communication Change (BOSCC) captures improvement in sociability. BOSCC showed decreased social communication scores (i.e., indicating improvement) following treatment (T2) compared to baseline (T1) when applied to both videos of parent/examiner-child interactions (Standard BOSCC) and videos showing administration of the Autism Diagnostic Observation Schedule, 2nd edition (ADOS- BOSCC). By contrast, no changes in sociability were detected using the Autism Diagnostic Observation Schedule Calibrated Severity Scores and Social Affect Scores (ADOS SSC SA). Image from Kim S.H. et al.15

By advancing ‘measurement science’ for autism symptomatology, the new awards have the potential to advance treatment development broadly. The adoption of new outcome measures that minimize placebo and other nonspecific effects should decrease the risk of false-negative trial results and the consequent premature dismissal of potentially valid therapeutic hypotheses.

The measures that are developed and validated under this new round of support will certainly not constitute the apogee of outcome assessment. Emerging technologies for data collection and analysis should ultimately allow true ecological assessment of symptomatology — that is, assessment of behavior not during a discrete period of time in an experimental context, but of behavior that occurs throughout the course of daily activity and evaluated across multiple hours of multiple days. Such methods raise privacy and potentially other concerns that must be grappled with, but they assay exactly what treatments for autism should support: positive changes in social-communicative (and repetitive-restrictive and sensory) symptoms that manifest throughout the day in real-life circumstances.


  1. Hardan A.Y. et al. Autism 23, 2096-2111 (2019) PubMed
  2. Berry-Kravis E. et al. Sci. Transl. Med. 8, 321ra5 (2016) PubMed
  3. Youssef E.A. et al. Neuropsychopharmacology 43, 503-512 (2018) PubMed
  4. Berry-Kravis E. et al. J. Neurodev. Disord. 9, 3 (2017) PubMed
  5. Fava M. et al. Psychother. Psychosom. 72, 115-127 (2003) PubMed
  6. Jones R.M. et al. Autism Res. 10, 1567-1572 (2017) PubMed
  7. Masi A. et al. Transl. Psychiatry 5, e640 (2015) PubMed
  8. Veenstra-VanderWeele J. et al. Neuropsychopharmacology 42, 1390-1398 (2017) PubMed
  9. Bolognani F. et al. Sci. Transl. Med. 11, eaat7838 (2019) PubMed
  10. Dawson G. et al. Pediatrics 125, e17-23 (2010) PubMed
  11. Kanne S.M. et al. J. Autism Dev. Disord. 44, 168-179 (2014) PubMed
  12. Hessl D. et al. PLoS One 14, e0209984 (2019) PubMed
  13. Frazier T.W. et al. J. Am. Acad. Child Adolesc. Psychiatry 57, 858-866 (2018) PubMed
  14. Grzadzinski R. et al. J. Autism Dev. Disord. 46, 2464-2479 (2016) PubMed
  15. Kim S.H. et al. Autism 23, 1176-1185 (2019) PubMed
Recent Blog Posts