AutismAtlas: Engineering serendipity in autism research by transforming data discovery and creating opportunities for data integration

Awarded: 2025
Award Type: Director
Award #: SFI-AN-Director-00025400

Arjun Krishnan, Ph.D.
University of Colorado Anschutz Medical Campus

The autism research landscape faces a critical data discovery bottleneck that fundamentally limits scientific progress. While unprecedented volumes of valuable datasets exist across SFARI resources (SPARK, Simons Simplex Collection, Simons Searchlight, Autism Inpatient Collection, Research Match studies) and public repositories, these data remain fragmented and underutilized due to metadata scattered across disconnected repositories with inconsistent and poor standards. This challenge also leads to a major missed opportunity: metadata serves as the grand unifier connecting diverse data types through shared attributes; so, the disconnected and unstandardized nature of metadata also prevents researchers from identifying non-obvious dataset combinations that could reveal breakthrough insights about mechanisms, subtypes, and interventions concerning autism (ASD) and related neurodevelopmental disorders (ND).

This project proposes to develop AutismAtlas, an AI-powered platform that transforms ASD/ND researchers’ ability to discover relevant datasets and combine them in novel ways to engineering serendipitous discoveries through intelligent metadata integration. The investigators will build a scalable pipeline that harvests and semantically enriches unstructured metadata from all SFARI datasets and 802,241 datasets from major public repositories (GEO, SRA, BioSamples, EBI BioStudies) spanning genetics, genomics, transcriptomics, proteomics, metabolomics, imaging, and clinical trials. Fine-tuned large language models will extract ASD/ND-relevant biomedical entities (age, sex, diagnosis, interventions, comorbidities, phenotypes) and map them to controlled vocabularies (MONDO, EFO, UBERON).

Building on their prior work in both ASD¹ and ML/AI-based metadata analysis², the team will develop InfoTokens, a novel method identifying semantically informative terms about ASD/ND, and implement scalable semantic similarity algorithms using maximum-minimum waypoint sampling and FAISS-based vector databases for rapid dataset connection discovery. A conversational AI research assistant will enable natural language queries, returning relevant datasets with explanations and suggesting conceptual strategies for integrative analyses and cross-domain collaborations.

AutismAtlas will be systematically evaluated using 30,000 labeled human omics samples from 650 datasets, measuring precision, recall and user satisfaction through ASD/ND researcher usability testing. This platform will transform isolated ASD/ND research into an interconnected discovery ecosystem, enabling researchers to identify novel subtypes, shared biological pathways, and precision medicine approaches through previously unforeseen dataset integration opportunities.

References

Krishnan A. et al. Nat. Neurosci. 19, 1454—1462 (2016) PubMed
Yuan H. et al. Brief. Bioinform. 26, bbae652 (2024) PubMed