In recent years, the continuous advancements in sequencing technology have markedly refined genome sequencing techniques, yielding significant achievements in both animal and plant genomic research. Numerous plant genome drafts and detailed maps have emerged, supplying invaluable resources for the scientific community. This article provides an in-depth analysis of the characteristics innate to third-generation sequencing technologies and systematically reviews the progress in pre-sequencing preparations, genome assembly, annotation processes, and comparative genomics. Furthermore, it elucidates the unique features and challenges inherent to plant genome research. Through comprehensive plant genome sequencing, researchers can not only obtain the genome sequences and key functional genes of plants, thus supporting in-depth molecular investigations into plant evolution, gene composition, and regulatory mechanisms, but also offer essential reference value and guidance for forthcoming plant genomic studies.
The sequencing of entire plant genomes constitutes a highly influential and extensive endeavor, facilitated by advanced genomic technologies. This initiative aims to elucidate the genetic blueprints of numerous essential plant species. Moreover, this project enables precise analysis of genetic variability and mutations at the population level, thereby establishing a robust foundation for genomic-level plant research. Consequently, it offers invaluable guidance and support for traditional research paradigms.
Over the past two decades, significant advancements have been achieved in the field of whole-genome sequencing for both animals and plants. The initiation of the Human Genome Project (HGP) in 1990 marked the advent of large-scale genomic DNA sequencing. By the year 2000, the preliminary completion of the human genome draft indicated that extensive DNA sequencing had become a routine methodological approach. However, comparative to animal genomics, the study of plant genomes presents distinct challenges. Plant genomes are often characterized by polyploidy, considerable genome size, high heterozygosity, and the presence of extensive repetitive sequences and entirely or partially duplicated genome segments. Consequently, it was virtually impossible to sequence certain complex plant genomes using traditional Sanger sequencing or early second-generation sequencing technologies.
With the continuous advancements in sequencing technologies and the gradual reduction in associated costs, an increasing number of plant genome sequencing projects have been initiated and have yielded substantial results. The publication of the complete genome sequence of the model organism Arabidopsis thaliana in 2000 marked the commencement of comprehensive plant genome research. Subsequently, the completion of the rice (Oryza sativa) genome sequence in 2002, the first among cereal crops, established a crucial foundation for the exploration of gene annotation and the study of orthologous genes in other plant species. In-depth analyses of these genomic datasets have enhanced the understanding of critical issues pertaining to species growth, development, evolution, and origin. Moreover, these studies have expedited the discovery of novel genes and the process of species improvement, thereby paving the way for genome sequencing efforts in other plant taxa.
Over the past decade, genomic research on numerous plant species, including Populus (poplar), Vitis vinifera (grape), Sorghum bicolor (sorghum), Zea mays (maize), Cucumis sativus (cucumber), Glycine max (soybean), Ricinus communis (castor bean), Malus domestica (apple), Fragaria vesca (strawberry), Theobroma cacao (cocoa tree), Brassica rapa (Chinese cabbage), and Solanum tuberosum (potato), has been documented. These advancements have been facilitated by the rapid evolution and widespread application of various sequencing technologies, which have substantially shortened the time required for whole-genome sequencing and reduced associated costs. Concurrently, these studies have refined research objectives and accelerated experimental design processes. Consequently, the understanding of physiological and biochemical mechanisms in plant growth and development has been elevated to the molecular level, providing novel perspectives for comprehending gene structure, composition, function, regulation, and species evolution at the molecular level.
Figure 1 illustrates the current progress in genome sequencing of various plant species. The x-axis represents the contig N50 of the genome assembly, while the y-axis displays the estimated genome size for each plant. Different sequencing platforms are denoted by colors: red for Roche 454, brown for Illumina, green for Oxford nanopore, blue for PacBio SMRT, and pink for Sanger. Tea plants are highlighted with a rectangular box (Xia et al., 2020).
Table 1: The Partial Published Complete Plant Genome Sequencing
Plant Name (Scientific Name) | Genome Size | Family, Genus | Sequencing Platform |
---|---|---|---|
Arabidopsis thaliana | 125M | Brassicaceae, Arabidopsis | Sanger construct BAC/TAC library |
Oryza sativa | 466M | Poaceae, Oryza | Sanger whole-genome shotgun |
Populus trichocarpa | 480M | Salicaceae, Populus | Sanger whole-genome shotgun |
Chlamydomonas reinhardtii | 130M | Chlamydomonadaceae | Sanger whole-genome shotgun |
Vitis vinifera | 490M | Vitaceae, Vitis | Sanger whole-genome shotgun |
Carica papaya | 370M | Caricaceae, Carica | Sanger whole-genome shotgun |
Sorghum bicolor | 730M | Poaceae, Sorghum | Sanger whole-genome shotgun |
Zea mays | 2300M | Poaceae, Zea | Sanger clone-by-clone |
Cucumis sativus | 350M | Cucurbitaceae, Cucumis | Sanger + Illumina GA |
Glycine max | 1100M | Fabaceae, Glycine | Sanger whole-genome shotgun |
Brachypodium distachyon | 260M | Poaceae, Brachypodium | Sanger whole-genome shotgun |
Ricinus communis | 350M | Euphorbiaceae, Ricinus | Sanger whole-genome shotgun |
Malus domestica | 742M | Rosaceae, Malus | Sanger + 454 sequencer |
Fragaria vesca | 240M | Rosaceae, Fragaria | Roche/454, Illumina/Solexa |
Theobroma cacao | 430M | Malvaceae, Theobroma | Illumina whole-genome shotgun |
Solanum tuberosum | 844M | Solanaceae, Solanum | Illumina, 454 whole-genome shotgun |
Brassica rapa | 485M | Brassicaceae, Brassica | Illumina GA |
Cannabis sativa | 534M | Cannabaceae, Cannabis | Illumina HiSeq, 454 |
Juglans regia | 667M | Juglandaceae, Juglans | Illumina GA, HiSeq 2000 |
Setaria italica | 423M | Poaceae, Setaria | Illumina HiSeq 2000 |
Prunus armeniaca | 280M | Rosaceae, Prunus | Illumina GA |
Citrus sinensis | 367M | Rutaceae, Citrus | Illumina GAⅡ, WGS |
Citrullus lanatus | 425M | Cucurbitaceae, Citrullus | Illumina |
Hordeum vulgare | 5.1G | Poaceae, Hordeum | Illumina + Roche 454 |
Phyllostachys edulis | 2.05G | Poaceae, Phyllostachys | Illumina |
Triticum aestivum | 4.94G | Poaceae, Triticum | Illumina HiSeq |
Picea abies | 19.6G | Pinaceae, Picea | Whole-genome shotgun |
Nelumbo nucifera | 879M | Nelumbonaceae, Nelumbo | Illumina, 454 |
Populus euphratica | 497M | Salicaceae, Populus | Whole-genome shotgun |
Amborella trichopoda | 748M | Amborellaceae, Amborella | Roche 454, Illumina |
Plant Genome Sequencing and Assembly
To date, comprehensive sequencing and assembly of several hundred plant genomes have been accomplished. These endeavors encompass a variety of model plants, cereal crops, horticultural species, oil crops, and bioenergy plants. In contrast to animal genomes, plant genomes exhibit significant complexities, characterized by highly repetitive sequences, transcription factors, retrotransposons, and polyploidy. These factors complicate the assembly and sequencing of plant genomes, introducing substantial uncertainty.
Advancements in sequencing technologies have substantially mitigated these challenges. The transition from Sanger sequencing to second-generation sequencing technologies—exemplified by Illumina and Roche 454 platforms—enabled de novo sequencing. Currently, third-generation single-molecule sequencing technologies, such as PacBio’s Single Molecule Real-Time (SMRT) sequencing, continue to drive down costs while enhancing efficiency and accuracy.
Pre-Sequencing Preparation and Strategic Selection
Prior to commencing plant genome sequencing, it is essential to gather relevant species information and conduct a preliminary survey to assess the genome’s complexity. This preliminary sequencing (survey sequencing) aims to determine the genome’s size and heterozygosity. These factors critically influence the feasibility of advancing to subsequent sequencing phases. Typically, genomes with substantial size (exceeding 10 Gb) impose stringent demands on sequencing technologies, assembly software, and computational memory, thereby hindering successful assembly. Moreover, elevated heterozygosity may lead to an assembled genome that inaccurately exceeds the actual genome size.
If the heterozygosity of a species surpasses 0.5%, assembly may present significant challenges. Conversely, heterozygosity levels exceeding 1% render assembly exceedingly difficult, complicating subsequent biological analyses.
Given the variation in size and complexity of plant genomes, multiple critical factors must be meticulously considered when undertaking plant genome sequencing projects. Firstly, it is imperative to determine the sequencing technology to be employed and to establish the optimal length of the reads. Secondly, comprehensive genome coverage must be ensured, and the appropriate size of the library must be chosen judiciously. Moreover, suitable software should be selected for the assembly process. The strategy formulated at the inception of the study will have profound implications for the progress of genome completion; thus, selecting the appropriate sequencing method or platform is paramount.
At present, owing to the nascent stage of third-generation sequencing technologies, mainstream research methodologies primarily rely on first-generation and second-generation sequencing technologies. In this context, it is also necessary to construct libraries, such as BAC (Bacterial Artificial Chromosome), Fosmid, and Cosmid, and utilize sequencing with different grades of insert fragments. For species with smaller genomes, platforms such as Roche 454 or Illumina (formerly known as “Solexa”) may be considered. Conversely, for complex large plant genomes, it is recommended to employ a combination of two or more sequencing platforms to facilitate more accurate genome assembly, thereby enabling the construction of either a scaffold-based or a high-resolution genome map.