Workshop Report: Sequencing

**DNA deterrents:** The workshop focused on the technical, analytical and economic challenges that face sequencing projects in autism research.

Organizers: SFARI
New York, 17 December, 2009

When he saw John Sulston’s and Bob Waterston’s poster on the physical map of the C. elegans genome at a 1989 Cold Spring Harbor Laboratory symposium, James Watson famously asked, “You can’t see it without wanting to sequence it, can you?”

At SFARI, there are similar feelings about the Simons Simplex Collection (SSC). As the number of available DNA samples continues to increase and the cost of sequencing continues to drop, one can’t help but want to capture all of the genetic variation that might be contributing to autism susceptibility in these families. Toward this end, SFARI organized a one-day workshop on the prospects for sequencing samples from the collection, and invited a stellar group of participants: Peter Barrett, Aravinda Chakravarti, Mark Daly, Evan Eichler, Richard Gibbs, Rick Lifton, Elaine Mardis, Len Pennacchio, Christian Schaaf, Jay Shendure, Chris Walsh, Mike Wigler and Michael Zwick. This is a brief summary of that workshop.

The SSC, one of SFARI’s core projects, is more than halfway to its goal of collecting approximately 3,000 simplex autism families — which have one child affected with autism spectrum disorder, and unaffected parents and siblings — totaling roughly 12,000 individuals. Establishing a permanent repository of both genetic samples and extensive phenotypic data for all family members provides a valuable resource for a variety of research projects, including those aiming to identify de novo and inherited genetic variants associated with autism.

The workshop focused on three main questions: 1) What sequencing projects are currently under way in autism, and what are the preliminary results from these studies? 2) What can be learned from candidate gene sequencing projects? 3) If the ultimate goal is to sequence whole exomes or whole genomes of all individuals in the SSC, what are the technical, analytical and economic challenges that need to be overcome?

Ongoing projects:

There is good reason to think that known autism-related molecular pathways, such as those involved in the functioning of neuronal networks and the formation of synapses, are a good starting point in the search for potential autism susceptibility genes.

Researchers from Huda Zoghbi‘s laboratory at Baylor College of Medicine have assembled a collection of proteins known to be implicated in these pathways and have screened both fetal and adult human brain cDNA libraries to generate, using a high-stringency yeast two-hybrid analysis, a large protein interaction network called the ‘autism interactome’.

Zoghbi and her Baylor collaborator Richard Gibbs aim to sequence the 500 genes involved in this autism interactome. Because sequence analysis of SSC samples has detected only a very small number of de novo mutations in known autism susceptibility genes — which are mostly associated with syndromic autism — Christian Schaaf, a postdoctoral fellow in Zoghbi’s laboratory, suggested that idiopathic, simplex autism is likely to be caused by genetic alterations in genes that are different from those implicated in syndromic autism.

Various modes of inheritance may contribute to the genetics of autism, including autosomal recessive inheritance. Chris Walsh, a neurologist at Harvard University, studies families with multiple affected family members, and uses families with known consanguinity to identify autism susceptibility genes by homozygosity mapping.

Walsh’s group has identified several new autism-linked mutations using these techniques. The combination of homozygosity mapping with next-generation sequencing of the homozygous regions might provide further insight into the genetics of autism spectrum disorders. What’s more, SNP arrays of the DNA from SSC samples are likely to provide information regarding occult shared ancestry among individuals within the collection.

But why does each locus identified to date account for such a small number of cases? How many autism susceptibility genes are really out there? Michael Wigler, a researcher at Cold Spring Harbor Laboratory, presented his estimate of approximately 200 autism susceptibility genes. A mathematical model developed in his laboratory predicts that about 50 percent of all cases of autism are sporadic, about 25 percent are dominant with relatively high penetrance, and the remaining 25 percent are complex (including autosomal recessive inheritance).

Michael Zwick, who studies molecular evolution and evolutionary genetics at Emory University, commented on the problem of paralogous sequences — those with similarity resulting from duplication. These sequences can confound the ability to do exon-capture analysis, as co-amplification of paralogous sequences can occur. For Zwick, this proved to be a significant challenge in his studies on the X chromosome, which shares sizable paralogy with the Y chromosome.

However, paralogous sequences are shared by other regions of the genome, some of which will be transcribed and translated. Interestingly, some of these sequences might represent some of the most highly mutable regions of the human genome.

Overall, the participants agreed that projects to sequence candidate genes previously implicated in autism are valuable insofar as they produce a more manageable amount of data that may be enriched for biologically relevant genes. Larger sequencing projects (whole exome or whole genome) are attractive because they are more comprehensive. From a cost perspective, these projects are becoming more and more affordable. There remains, however, the challenge of data analysis. As we move forward, we will learn about our ability to draw clinically or biologically relevant conclusions from those large-scale studies.

Large-scale sequencing:

Large-scale sequencing results might be flawed because of false positives, caused by somatic mutations or cell-line artifacts, and the presence of paralogous sequences in the genome. Walsh and Evan Eichler, a professor and Howard Hughes investigator of genome sciences at the University of Washington School of Medicine, cited unpublished studies showing that 50 to 90 percent of detected de novo point mutations represent either somatic events or artifacts present in the DNA of virally transformed lymphoblastoid cell lines.

To achieve best-quality sequencing data, and to reduce the number of false-positive results, sequencing should be performed on whole-blood DNA. Whole-genome amplification should be considered as a way to increase the amount of DNA available to researchers for sequencing projects.

Although technical advances have increased our ability to generate large amounts of sequencing data, and their affordability has improved, the challenges of data quality, analysis and management have become more demanding.

Gibbs emphasized the need for thorough data validation and argued that all data found to be relevant should be confirmed by a second, independent technique or platform. As different next-generation sequencing platforms have very different error profiles, independent validation on two of these platforms will significantly increase the validity of large-scale sequencing data. The generation of computational algorithms and the building of error models would represent another phase of quality control.

Mark Daly, associate professor of medicine at Harvard University, proposed the implementation of a centralized data repository for raw sequence data from the SSC — and potentially other autism sequencing projects — to be able to uniformly and consistently analyze large-scale sequencing data, similar to how data for the 1000 Genomes dataset were analyzed.

The data analysis pipeline would comprise a series of SNP-calling and quality-score algorithms. From this repository, data could then be made available to a broader community. Walsh suggested starting with the generation of a database tracking ongoing sequencing analyses of SSC samples to reduce the redundancy of such experiments. His suggestion was widely accepted by the other participants.

Lessons learned:

There is no doubt that large-scale sequencing analyses will detect a large number of polymorphisms. One of the major challenges will be to find out which of those alterations are causative for the phenotype under investigation. Jay Shendure, assistant professor of genome sciences at the University of Washington, delineated that most of the novel variants detected occur as private mutations or SNPs. There are several ways to filter such data, but Shendure suggests that the comparison to ‘normal genomes’ as present in the dbSNP or 1000 Genomes datasets provides an extremely powerful filter.

Filtering by sequence similarity in comparison to other species and using tools like PolyPhen (predicting the possible impact of an amino acid substitution on the structure and function of a human protein) might provide additional information but can be difficult to interpret, especially when studying a human behavioral phenotype like autism. Shendure and Rick Lifton, chair of genetics at Yale University, are pioneers in the field of whole-exome sequencing. They have shown how genetic diagnoses of Mendelian disorders can be made by whole-exome capture and massively parallel sequencing.

Researchers in the field of autism genetics can learn from the experience gained by large-scale sequencing projects in other fields, such as cancer genetics. Elaine Mardis, associate professor of genetics at Washington University in St. Louis, concluded that the power of genetics depends on the quality of the underlying phenotypic data. Large repositories of high-quality, standardized clinical data such as the SSC provide the amount of clinical information that is needed to identify clinical sub-phenotypes and to draw conclusions about genotype-phenotype correlations.

Second, Mardis said, whole-genome sequencing is feasible, but algorithms for data processing, data analysis and data storage need to be in place. Third, the discovery power of next-generation sequencing will rapidly outstrip our ability to conduct functional evaluations of our discoveries. Finally, Mardis emphasized the importance of detailed consent forms so that research subjects (or their legally authorized representatives) understand what it means to have their DNA sequence placed in a public database.

Beyond coding regions:

As geneticists have learned over the past few decades, pathology can be caused by mechanisms other than changes in the coding region of the genome. This refers to both non-coding sequences and to epigenetic mechanisms. Walsh expressed concern that sequencing studies focused only on the so-called exome might miss important regulatory sequences affecting gene expression, as well as non-coding RNAs.

Len Pennacchio, chief of the Genetic Analysis Program at the US Department of Energy’s Joint Genome Institute, presented his data on tissue-specific enhancers of gene expression. He and his team have used chromatin immunoprecipitation with the enhancer-associated protein p300, followed by massively parallel sequencing, to identify and accurately predict the tissue-specific activity of enhancers. Pennacchio suggested adding bona fide enhancers specific to the human forebrain to ‘regions of interest’ when studying the etiology of autism. The human enhancer browser is accessible at http://enhancer.lbl.gov.

Studying the genetics of autism has been sobering for many years, because researchers have not been able to identify genetic changes that would contribute to a large percentage of cases of autism spectrum disorders. This might be the result of significant etiological heterogeneity, with each genetic susceptibility locus accounting only for a small fraction of cases or having a small effect. Several dozens, if not hundreds, of genes might be involved in the etiology of autism. Multi-gene interactions might play a role, as well as epigenetic mechanisms, genomic imprinting and gene-environment interactions.

Recent advances in next-generation sequencing techniques have made the large-scale analysis of human genomes possible. There is good reason to hope that large-scale sequencing of DNA from individuals affected with autism spectrum disorders and their unaffected family members will identify recurrently mutated genes causally related to their disorder.

The ultimate goal of these studies would include determining the prevalence of those mutations in the population, correlating specific mutations to clinical phenotypes, and identifying biological pathways involved in the pathophysiology of autism spectrum disorders. Whole-exome sequencing of about 100 individuals from the SSC will be an important step toward achieving these goals.