Research Development in Plant Genome Sequencing

In recent years, the continuous advancements in sequencing technology have markedly refined genome sequencing techniques, yielding significant achievements in both animal and plant genomic research. Numerous plant genome drafts and detailed maps have emerged, supplying invaluable resources for the scientific community. This article provides an in-depth analysis of the characteristics innate to third-generation sequencing technologies and systematically reviews the progress in pre-sequencing preparations, genome assembly, annotation processes, and comparative genomics. Furthermore, it elucidates the unique features and challenges inherent to plant genome research. Through comprehensive plant genome sequencing, researchers can not only obtain the genome sequences and key functional genes of plants, thus supporting in-depth molecular investigations into plant evolution, gene composition, and regulatory mechanisms, but also offer essential reference value and guidance for forthcoming plant genomic studies.

The sequencing of entire plant genomes constitutes a highly influential and extensive endeavor, facilitated by advanced genomic technologies. This initiative aims to elucidate the genetic blueprints of numerous essential plant species. Moreover, this project enables precise analysis of genetic variability and mutations at the population level, thereby establishing a robust foundation for genomic-level plant research. Consequently, it offers invaluable guidance and support for traditional research paradigms.

Over the past two decades, significant advancements have been achieved in the field of whole-genome sequencing for both animals and plants. The initiation of the Human Genome Project (HGP) in 1990 marked the advent of large-scale genomic DNA sequencing. By the year 2000, the preliminary completion of the human genome draft indicated that extensive DNA sequencing had become a routine methodological approach. However, comparative to animal genomics, the study of plant genomes presents distinct challenges. Plant genomes are often characterized by polyploidy, considerable genome size, high heterozygosity, and the presence of extensive repetitive sequences and entirely or partially duplicated genome segments. Consequently, it was virtually impossible to sequence certain complex plant genomes using traditional Sanger sequencing or early second-generation sequencing technologies.

With the continuous advancements in sequencing technologies and the gradual reduction in associated costs, an increasing number of plant genome sequencing projects have been initiated and have yielded substantial results. The publication of the complete genome sequence of the model organism Arabidopsis thaliana in 2000 marked the commencement of comprehensive plant genome research. Subsequently, the completion of the rice (Oryza sativa) genome sequence in 2002, the first among cereal crops, established a crucial foundation for the exploration of gene annotation and the study of orthologous genes in other plant species. In-depth analyses of these genomic datasets have enhanced the understanding of critical issues pertaining to species growth, development, evolution, and origin. Moreover, these studies have expedited the discovery of novel genes and the process of species improvement, thereby paving the way for genome sequencing efforts in other plant taxa.

Over the past decade, genomic research on numerous plant species, including Populus (poplar), Vitis vinifera (grape), Sorghum bicolor (sorghum), Zea mays (maize), Cucumis sativus (cucumber), Glycine max (soybean), Ricinus communis (castor bean), Malus domestica (apple), Fragaria vesca (strawberry), Theobroma cacao (cocoa tree), Brassica rapa (Chinese cabbage), and Solanum tuberosum (potato), has been documented. These advancements have been facilitated by the rapid evolution and widespread application of various sequencing technologies, which have substantially shortened the time required for whole-genome sequencing and reduced associated costs. Concurrently, these studies have refined research objectives and accelerated experimental design processes. Consequently, the understanding of physiological and biochemical mechanisms in plant growth and development has been elevated to the molecular level, providing novel perspectives for comprehending gene structure, composition, function, regulation, and species evolution at the molecular level.

Figure 1Current progress in plant genome sequencing.

Figure 1 illustrates the current progress in genome sequencing of various plant species. The x-axis represents the contig N50 of the genome assembly, while the y-axis displays the estimated genome size for each plant. Different sequencing platforms are denoted by colors: red for Roche 454, brown for Illumina, green for Oxford nanopore, blue for PacBio SMRT, and pink for Sanger. Tea plants are highlighted with a rectangular box (Xia et al., 2020).

Table 1: The Partial Published Complete Plant Genome Sequencing

Plant Name (Scientific Name) Genome Size Family, Genus Sequencing Platform
Arabidopsis thaliana 125M Brassicaceae, Arabidopsis Sanger construct BAC/TAC library
Oryza sativa 466M Poaceae, Oryza Sanger whole-genome shotgun
Populus trichocarpa 480M Salicaceae, Populus Sanger whole-genome shotgun
Chlamydomonas reinhardtii 130M Chlamydomonadaceae Sanger whole-genome shotgun
Vitis vinifera 490M Vitaceae, Vitis Sanger whole-genome shotgun
Carica papaya 370M Caricaceae, Carica Sanger whole-genome shotgun
Sorghum bicolor 730M Poaceae, Sorghum Sanger whole-genome shotgun
Zea mays 2300M Poaceae, Zea Sanger clone-by-clone
Cucumis sativus 350M Cucurbitaceae, Cucumis Sanger + Illumina GA
Glycine max 1100M Fabaceae, Glycine Sanger whole-genome shotgun
Brachypodium distachyon 260M Poaceae, Brachypodium Sanger whole-genome shotgun
Ricinus communis 350M Euphorbiaceae, Ricinus Sanger whole-genome shotgun
Malus domestica 742M Rosaceae, Malus Sanger + 454 sequencer
Fragaria vesca 240M Rosaceae, Fragaria Roche/454, Illumina/Solexa
Theobroma cacao 430M Malvaceae, Theobroma Illumina whole-genome shotgun
Solanum tuberosum 844M Solanaceae, Solanum Illumina, 454 whole-genome shotgun
Brassica rapa 485M Brassicaceae, Brassica Illumina GA
Cannabis sativa 534M Cannabaceae, Cannabis Illumina HiSeq, 454
Juglans regia 667M Juglandaceae, Juglans Illumina GA, HiSeq 2000
Setaria italica 423M Poaceae, Setaria Illumina HiSeq 2000
Prunus armeniaca 280M Rosaceae, Prunus Illumina GA
Citrus sinensis 367M Rutaceae, Citrus Illumina GAⅡ, WGS
Citrullus lanatus 425M Cucurbitaceae, Citrullus Illumina
Hordeum vulgare 5.1G Poaceae, Hordeum Illumina + Roche 454
Phyllostachys edulis 2.05G Poaceae, Phyllostachys Illumina
Triticum aestivum 4.94G Poaceae, Triticum Illumina HiSeq
Picea abies 19.6G Pinaceae, Picea Whole-genome shotgun
Nelumbo nucifera 879M Nelumbonaceae, Nelumbo Illumina, 454
Populus euphratica 497M Salicaceae, Populus Whole-genome shotgun
Amborella trichopoda 748M Amborellaceae, Amborella Roche 454, Illumina

Plant Genome Sequencing and Assembly

To date, comprehensive sequencing and assembly of several hundred plant genomes have been accomplished. These endeavors encompass a variety of model plants, cereal crops, horticultural species, oil crops, and bioenergy plants. In contrast to animal genomes, plant genomes exhibit significant complexities, characterized by highly repetitive sequences, transcription factors, retrotransposons, and polyploidy. These factors complicate the assembly and sequencing of plant genomes, introducing substantial uncertainty.

Advancements in sequencing technologies have substantially mitigated these challenges. The transition from Sanger sequencing to second-generation sequencing technologies—exemplified by Illumina and Roche 454 platforms—enabled de novo sequencing. Currently, third-generation single-molecule sequencing technologies, such as PacBio’s Single Molecule Real-Time (SMRT) sequencing, continue to drive down costs while enhancing efficiency and accuracy.

Pre-Sequencing Preparation and Strategic Selection

Prior to commencing plant genome sequencing, it is essential to gather relevant species information and conduct a preliminary survey to assess the genome’s complexity. This preliminary sequencing (survey sequencing) aims to determine the genome’s size and heterozygosity. These factors critically influence the feasibility of advancing to subsequent sequencing phases. Typically, genomes with substantial size (exceeding 10 Gb) impose stringent demands on sequencing technologies, assembly software, and computational memory, thereby hindering successful assembly. Moreover, elevated heterozygosity may lead to an assembled genome that inaccurately exceeds the actual genome size.

If the heterozygosity of a species surpasses 0.5%, assembly may present significant challenges. Conversely, heterozygosity levels exceeding 1% render assembly exceedingly difficult, complicating subsequent biological analyses.

Given the variation in size and complexity of plant genomes, multiple critical factors must be meticulously considered when undertaking plant genome sequencing projects. Firstly, it is imperative to determine the sequencing technology to be employed and to establish the optimal length of the reads. Secondly, comprehensive genome coverage must be ensured, and the appropriate size of the library must be chosen judiciously. Moreover, suitable software should be selected for the assembly process. The strategy formulated at the inception of the study will have profound implications for the progress of genome completion; thus, selecting the appropriate sequencing method or platform is paramount.

At present, owing to the nascent stage of third-generation sequencing technologies, mainstream research methodologies primarily rely on first-generation and second-generation sequencing technologies. In this context, it is also necessary to construct libraries, such as BAC (Bacterial Artificial Chromosome), Fosmid, and Cosmid, and utilize sequencing with different grades of insert fragments. For species with smaller genomes, platforms such as Roche 454 or Illumina (formerly known as “Solexa”) may be considered. Conversely, for complex large plant genomes, it is recommended to employ a combination of two or more sequencing platforms to facilitate more accurate genome assembly, thereby enabling the construction of either a scaffold-based or a high-resolution genome map.

Leave a Reply

Your email address will not be published. Required fields are marked *