Genome Assembly of a Chinese Soybean
De novo assembly and sequence analysis of a Chinese soybean genome
Soybean is one of the most important crops worldwide. A high-quality reference genome will facilitate its functional analysis and molecular breeding. Previously, we de novo assembled a high-quality Chinese soybean genome Gmax_ZH13. However, due to technical limitations at the time when we generated Gmax_ZH13, a large number of small contigs were not anchored onto chromosomes. Therefore, we here build a new golden reference genome for Zhonghuang 13 consisting of 20 nearly complete chromosomes by adding more single-molecule real time (SMRT) sequencing reads. Furthermore, we add large RNA-seq and smRNA-seq datasets for improving the annotation of its protein coding genes.
In this project, I mainly conducted the tissue collection, library preparation, and further analysis for RNA-seq and smRNA-seq.
Main workflow
-
Tissue collection:
-
Library preparation:
-
Protein coding genes annotation:
-
MIRNA genes annotation:
To improve the accuracy of gene annotation, 27 tissues were collected from Zhonghuang 13 plants during their growing season at the experimental station of the Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, in 2018. Each tissues have two replications.
Total RNA of each sample was isolated using RNA isolation kit (Tiangen, Beijing, China) according to the manufacturer’s protocol. Libraries for RNA-seq were constructed and sequenced on the Illumina HiSeq X Ten platform for each sample. Libraries for smRNA-seq were constructed and sequenced on the Illumina NextSeq 550AR platform. In total, 353 Gb reads were produced.
The RNA-seq data, together with the Iso-seq reads used for annotation in the last version, were used as expression sequence tag (EST) evidences to predict protein coding genes using MAKER, which was a professional pipeline popularly used in gene prediction. We annotated a total of 55,443 protein coding genes containing 96,366 mRNAs in the nuclear genome. We found that 97% of the 1,440 single copy Embryophyta genes in BUSCO_v3 were completely assembled, confirming the high quality of our annotation. We found that 42,259 of the newly annotated genes matching to the genes in the last version, whose IDs were therefore inherited from Gmax_ZH13. In addition, we annotated 81 and 49 protein coding genes for chloroplast genome and mitochondrial genome respectively.
To annotate MIRNA genes, small RNAs from the same 27 samples used for protein coding gene annotation were sequenced. Using these reads, we annotated 331 MIRNA genes. Transcripts of these loci generated 349 pair of miRNAs (miRNA-3p and miRNA-5p). Using the RNA sequence and small RNA sequence data of 27 samples, we provided a detailed expression profiling for all the protein coding genes and miRNAs.
References
Funding
Funding for the de novo assembly and sequence analysis of the genome has been generously provided by the following sponsors:
National Key Research & Development Program Grant #: 2017YFD0101305 |
National Natural Science Foundation Grant #: 31525018 |
State Key Laboratory of Plant Cell and Chromosome Engineering Grant #: PCCE-KF-2019-05 |