Family Trio Sings for Genomic Supper

  • Genome sequence variations used in conjunction with selective breeding programs for agricultural animals, like dairy cows, can increase industry productivity.
  • High-quality reference genome sequences are the key to exploitation of information present in genomes.
  • A new method for assembling a genome sequence using the trio of mom, dad, and an offspring greatly reduces errors.
  • The new method identified errors in a “reference” cow genome sequence and will lead to its improvement.

 

Solving a giant crossword puzzle and completely sequencing a genome have a lot in common, including despair and satisfaction. The puzzle just requires the assembly of all components into the one correct pattern. The first 90% is fast and furious. One’s confidence grows as the unique solution becomes tantalizingly close. Satisfaction seemingly guaranteed. But then, the last 10% rears its ugly head and frustratingly devours time and confidence. “I can’t get no satisfaction”—the plaintive words of Mick Jagger mercilessly resonant. The stark realization is depressing. Most of the puzzle is correct, but there must be an error somewhere. But it’s hard to go back. The inevitable outcome is to accept something that is mainly correct and move on—“you can’t always get what you want.” However, all is not lost. Koren and nine colleagues [1] recently developed a very smart solution for completing the genomic puzzle with much lower error rates. They used a genomic trio of mom, dad, and one offspring for maximal effect and then tested their method in three species. The results were impressive, particularly for the cow.

What is a Genome Sequence?

A genome is the complete set of genetic material (DNA) present in a cell. All life has DNA or its first cousin RNA, which is present in a few viruses. The genome contains the genetic blueprint for the form and function of life. That’s a big call! DNA is a long double helical molecule present in the nucleus of a cell. Amazingly, there is about two meters of highly compacted DNA present in the nucleus of each mammalian cell, which is only six one–thousandths of a millimetre in diameter [2]. There is one long molecule of DNA associated with each chromosome. The sequence of a genome is the ordered sequence of the building blocks of DNA, called nucleotides, in all chromosomes. A mammalian genome contains about three billion nucleotides. Long stretches of these nucleotides code for the tens of thousands of genes and their regulatory regions. The nucleotide code for genes is read and ultimately converted into biological action by molecular machinery present within the cell. (Surprisingly, only about 8% of the genome seems functional [3]. The big mystery is the role of the remaining 92%.)

What’s All the Fuss About a Genome Sequence?

A genome sequence contains four massive books of information. First, it contains the detailed blueprint for how a fertilized egg develops into a complex multicellular organism, like a cow. The patterns of gene activities in cells underpin the amazing process of development eventually leading to an array of very different and precisely positioned cell types in an individual. Second, inscribed in DNA is a detailed history book recording past events. It describes the evolution of a species, population migrations over large periods of time, and monumental battles with diseases and changing environments. Like all history books, it was written by the victors, i.e. the survivors who successfully reproduced and passed on their DNA to future generations. However, about 99% of all species that ever lived are extinct [4], but their ghostly genetic footprints lie everywhere in the genomes of modern-day species [5]. Third, the individual variations in the genome sequence provide a catalog of Who’s Who at the Zoo for today’s population. Nowhere is this better highlighted than for the solving of crimes using forensic DNA or the ability to trace food from the plate back to pasture to guarantee food safety. Perhaps one the most important applications is the use of genetic variations in agricultural production animals and plants for the accelerated selective breeding of individuals with the most desirable production traits. For example, genetic variants could be used to select for bulls, even when immature, that are likely in the future to sire cows producing more milk for less feed. Fourth, the genome contains a medical textbook that has recorded the results of gigantic natural experiments occurring throughout the ages. Scientists repeatably note that the most variable regions of the genome often involve immune defense genes [5, 6]. The variations in these genes were enriched in the population by past battles with disease and today they can provide clues for solutions to present day diseases [5, 7, 8]. Genetic variations can also point to previously unsuspected genes affecting health [7, 8]. The key to deciphering and exploiting the information present in these books is a high-quality genome sequence, often called a “reference” genome sequence.

A Genome Sequence is the Ultimate Extreme Puzzle

Sequencing a “complete” genome for a complex life form, like a mammal, is not for the faint-hearted, even 14 years after the human genome was “completed” [9]. The word “completed” is a technical term that refers to a very low DNA sequence error rate and no omitted genomic regions. “Completion” is a tall order. Scientific consortia have “completed” the genome sequences for only about five complex life forms (human, mouse, fruit fly, a worm, and a yeast) [10]. As of 2017, Lewin and colleagues [10] noted that there were about 2,500 genome sequences for complex life forms, but only about 25 were higher quality “reference” sequences. A good “reference” genome sequence is fundamentally important for understanding the unique biology of a species and for the identification of genetic variations in their population that can be exploited for a wide range of purposes, including gene discovery and DNA variation-assisted selective breeding of animals and plant species used in agriculture.

Koren and colleagues explain that scientists have struggled with the technical challenges required to improve the quality of “reference” genome sequences [1]. They note that the general strategy used for sequencing a genome is to produce millions of very small overlapping DNA sequences from one individual and then virtually assemble these sequences into fewer but longer sequences. The process, in theory, can lead to a complete genome sequence. However, the assembly process is confounded by three difficulties. First, some parts of the genome are recalcitrant to DNA sequencing, like a spoilt child determined not to cooperate. These regions can be conquered, but they require personalized attention, additional time, and more finance. Second, other sequences have near identical copies present at thousands of different places in the genome. Hence, it is hard to know exactly where a specific repeat sequence goes in the genomic puzzle—it’s a little like trying to complete the picture puzzle entitled “polar bear in a snowy winter’s day.” Mastering repeats also requires additional tailored and expensive approaches. Third, genetic variation confounds the virtual assembly of short DNA sequences into a genomic sequence. The latter is where Koren and nine colleagues made huge progress [1].

Mom and Dad Improve Their Offspring’s Genome Sequence

Variety is the spice of life and so it is in genetics. Each human individual is about 99.9% identical to others in the population, but the 0.1% difference, often numbering millions of variations in the genome, is what makes a person genetically unique. Koren and colleagues explain that genetic variation is also rampant even within the individual, who is an amalgamation of the variations inherited from their parents [1]. These genetic differences in an individual make genome sequence assembly difficult. A human has 46 chromosomes consisting of 22 chromosomal pairs and a pair of sex chromosomes XY or XX. One chromosome of a matching chromosomal pair in the offspring comes from mom and one from dad. Hence, the offspring contains genetic variations that may be different at the same relative position in a matched pair of chromosomes—a little like spelling differences for the same English and American word. In production animals, one way to decrease the problem of genetic variation is to sequence the genome of a partially inbred animal (produced from related parents), which has markedly fewer genetic variations than an outbred animal. Inbreeding cannot be used for sequencing many species due to a variety of practical reasons, and for humans, it is a major ethical issue. Moreover, the inbred individual is not representative of the wider population both in terms of its genetic variation and biology. Hence, its relevance to the form and function of other individuals is questionable. This is particularly true for some production animals that are the result of crosses between parents from very different breeds or breeding lines.

Koren and colleagues inspirationally embraced genetic variation as a tool to improve “reference” genome sequences [1]. Unlike other strategies designed to address the same issue, their method becomes more efficient for assembling a genome sequence when there is greater genetic variation within an individual. Their approach is very smart. They first obtained a large number of short DNA sequences from each parent. They then took longer DNA sequences obtained from the offspring and matched these to the short DNA sequences from each parent. This strategy generated two “bins” of the long offspring DNA sequences, one containing offspring DNA sequences that came from mom and the other from dad. Using these “binned” DNA sequences, Koren and colleagues independently assembled each of the parent-specific offspring genomes (called haplotype-resolved genomes). For a human, these two genomes represented the DNA contributions from the 23 chromosomes present in the parental egg or sperm. The investigators then fully reconstructed the offspring genome by combining these two parent-specific genomes, which together represented the DNA in all 23 pairs of the offspring chromosomes.

New Method Shines

A new method has to jump the daunting bar of independent validation and versatility. Many fall heavily at this final hurdle. Koren and colleagues tested their method using three very different species; a small plant called Arabidopsis, a human, and the offspring from a parental cross between two cattle subspecies represented by the Brahman and Angus cattle breeds [1]. The method worked very well in all three circumstances. Perhaps the most striking results were obtained with the cattle cross where 99% of the offspring-assembled genome could be directly attributed to a specific parent. Measures of the quality and accuracy of the parental contribution to the cow offspring genome sequence were very high. The investigators then compared each parental contribution to the offspring genome with a decade old cow “reference” genome sequence. Surprisingly, the investigators’s comparisons revealed 3,178 instances where small parts of the old genome assembly were probably back to front. The crossword puzzle had errors. The investigators confirmed this problem using additional and independent analyses. The old cow “reference” genome sequence was not perfect, but it was very useful. Like an aging car, it is about to be replaced by a shiny new model.

Implications

Koren and colleagues’ new method can very efficiently dissect out the parental contributions to an offspring’s genome [1]. Importantly, the new genome sequence assembly method will markedly improve “reference” genome sequences for a range of species and will pay handsome dividends for all agricultural production animals. The investigators noted that the method will better sample the genetic variation derived from each parent. Their method will potentially greatly improve the accuracy of selective breeding in agricultural animal populations, guided by DNA sequence variations, to enhance desirable production traits and improve industry productivity. This aspect is particularly relevant where the commercial livestock population going to market is the result of a cross between parents from very different genetic backgrounds, e.g. crosses between different breeds of cattle. In this case, a commercially relevant cattle trait in the offspring may result from contributions from different genetic variations in the parents. The investigators also implied that their method is useful for characterizing specific immune-related regions of the human genome that normally are difficult to correctly sequence due to large-scale variations in the human population and within an individual. Some of these genomic regions are particularly important as they define the compatibility of individuals for organ transplantation. The accurate identification of parental contribution to the offspring genome will also aid the discovery of specific gene variants that affect the health of humans and animals. These genes may provide clues aiding the discovery of new ways to improve health. The trio of mom, dad, and an offspring, together, more easily solve the genomic puzzle. Sometimes, for puzzles, the process is just as important as the outcomes.

 

1. Koren S, Rhie A, Walenz BP, Dilthey AT, Bickhart DM, Kingan SB, et al. De novo assembly of haplotype-resolved genomes with trio binning. Nat Biotechnol. 2018.

2. Integrated DNA Technologies. Molecular facts and figures 2011 [Available from: http://sfvideo.blob.core.windows.net/sitefinity/docs/default-source/biotech-basics/molecular-facts-and-figures.pdf?sfvrsn=4563407_4].

3. Rands CM, Meader S, Ponting CP, Lunter G. 8.2% of the Human genome is constrained: variation in rates of turnover across functional element classes in the human lineage. PLoS Genet. 2014;10(7):e1004525.

4. Newman ME. A model of mass extinction. J Theor Biol. 1997;189(3):235-252.

5. Elsik CG, Tellam RL, Worley KC, Gibbs RA, Muzny DM, Weinstock GM, et al. The genome sequence of taurine cattle: a window to ruminant biology and evolution. Science. 2009;324(5926):522-528.

6. Tellam RL, Lemay DG, Van Tassell CP, Lewin HA, Worley KC, Elsik CG. Unlocking the bovine genome. BMC Genomics. 2009;10:193.

7. Timpson NJ, Greenwood CMT, Soranzo N, Lawson DJ, Richards JB. Genetic architecture: the shape of the genetic contribution to human traits and disease. Nat Rev Genet. 2018;19(2):110-124.

8. Kanai M, Akiyama M, Takahashi A, Matoba N, Momozawa Y, Ikeda M, et al. Genetic analysis of quantitative traits in the Japanese population links cell types to complex human diseases. Nat Genet. 2018;50(3):390-400.

9. Consortium IHGS. Finishing the euchromatic sequence of the human genome. Nature. 2004;431(7011):931-945.

10. Lewin HA, Robinson GE, Kress WJ, Baker WJ, Coddington J, Crandall KA, et al. Earth BioGenome Project: Sequencing life for the future of life. Proc Natl Acad Sci U S A. 2018;115(17):4325-4333.