Gossypium arboreum (A2) 'SXY1' genome HAU_v1

Overview
Analysis NameGossypium arboreum (A2) 'SXY1' genome HAU_v1
Methodllumina and PacBio
Source (v1)
Date performed2021-08-08

About the assembly

In this study, we applied Oxford Nanopore sequencing technology to assemble G. rotundifolium (K2*) 'K201', G. arboreum (A2) 'SXY1', and G. raimondii (D5) 'D502' genomes. G. arboreum and G. raimondii genomes have been de novo assembled previously using Illumina and PacBio reads, but both genomes have a number of sequence gaps and require an improvement in assembly contiguity. We generated a total of 304 Gb, 212 Gb, 125 Gb Nanopore sequencing data with a genome coverage 124×, 131×, 167× for K2*, A2 and D5, respectively. We assembled 3,593, 1,173 and 366 contigs for G. rotundifolium, G. arboreum and G. raimondii with a contig length of 2.44 Gb, 1.62 Gb and 0.75 Gb, respectively (Table 1). These initial contigs were polished using Illumina paired-end reads with a genome coverage of 108×, 118×, 132× for K2*, A2 and D5. The contig N50 is 5.33 Mb, 11.69 Mb and 17.04 Mb for K2*, A2 and D5, respectively. The maximum contig has a length of 32.72 Mb, 58.57 Mb and 43.74 Mb. After polishing contig using Illumina reads, we used high-through chromosome conformation capture (Hi-C) data to order and orient contigs, aimed at constructing pseudo chromosomes of each species. In the Hi-C assisted assembly, 2,559, 485 and 201 contigs were placed on the 13 chromosomes of K2*, A2 and D5 genomes, occupying over 99% of genome length.

*Should be K12

 

Table 1. Summary of genome assemblies and annotations of G. rotundifolium, G. arboreum and G.raimondii.

Genomic feature G. rotundifolium 'Grot K201' G. arboreum 'Shixiya1' G. raimondii 'Grai D502'
Total length of contigs, bp 2,444,364,209 1,621,008,062 750,197,587
Total length of scaffolds, bp 2,444,484,509 1,621,030,562 750,205,487
Total length of gaps, bp 120,300 22,500 7,900
Percentage of anchoring 99.28% 99.47% 99.57%
Percentage of anchoring and ordering 93.16% 98.84% 99.01%
Number of contigs 3,593 1,173 366
Number of scaffolds 2,390 948 287
Contig N50, bp 5,326,689 11,691,474 17,043,680
Contig N90, bp 621,066 2,910,421 3,537,560
Scaffold N50, bp 177,839,665 129,592,444 57,716,579
Scaffold N90, bp 115,394,628 93,157,762 49,929,625
Maximun contig length, bp 32,728,186 58,575,076 43,739,617
Maximum scaffold length, bp 205,722,655 143,367,608 63,188,200
GC content 36.38% 35.16% 33.23%
Percentage of repeat sequences 80.92% 68.05% 57.04%
GC content 36.38% 35.16% 33.23%
Number of genes 41,590 41,778 40,820

 

Supplementary Table 4. Comparing A2 genome with previously published genome version.

Genomic feature HAU_A2 WHU_A2 CRI_A2
Total assemblied size, bp 1,621,008,062 1,636,985,834 1,710,104,083
Number of total scaffolds 948 1,269 4,516
Total length of gaps, bp 22,500 116,300 3,730,000
Contig N50, bp 11,691,474 1,832,000 1,100,000
Scaffold N50, bp 129,592,444 118,841,821 113,035,596
Scaffold N90, bp 93,157,762 87,096,684 162,124
Percentage of anchoring and ordering 98.84% 92.19% 85.94%
Number of genes 41,778 43,278 40,960

 

Publication

Wang, M. et al. Comparative genome analyses highlight transposon-mediated genome expansion and the evolutionary architecture of 3D genomic folding in cotton. Mol Biol Evol. 2021 May 11:msab128. doi: 10.1093/molbev/msab128.

Assembly

The chromosomes (pseudomolecules) and scaffolds for Gossypium arboreum '(A2)' genome. This file belongs to the HAU G. arboreum Assembly v1.0

Chromosomes & scaffolds (FASTA format) G.arboreum_HAU.fa.gz G.arboreum_HAU.fa.gz.md5
Functional Analysis

Functional annotation files for the Gossypium arboreum HAU Genome v1.0 are available for download below. The Gossypium arboreum HAU Genome v1.0 proteins were analyzed using InterProScan in order to assign InterPro domains and Gene Ontology (GO) terms. Pathways analysis was performed using the KEGG Automatic Annotation Server (KAAS).

Downloads

GO assignments from InterProScan A2_HAU_v1_genes2GO.xlsx.gz
IPR assignments from InterProScan A2_HAU_v1_genes2IPR.xlsx.gz
Proteins mapped to KEGG Orthologs A2_HAU_v1_KEGG-orthologis.xlsx.gz
Proteins mapped to KEGG Pathways A2_HAU_v1_KEGG-pathways.xlsx.gz

 

Genes

The predicted gene model, their alignments and proteins for Gossypium arboreum'(A2)' genome. These files belong to the HAU G. arboreum Assembly v1.0

Predicted gene models with exons (GFF3 format) G.arboreum_HAU.gff3.gz
Coding sequences, CDS (FASTA format) G.arboreum_HAU.cds.fa.gz
Protein sequences (FASTA format) G.arboreum_HAU.pep.fa.gz
Markers
Marker alignments were performed by the CottonGen Team of Main Bioinformatics Lab at WSU. The alignment tool 'BLAT' was used to map marker sequences from CottonGen to the Gossypium arboreum HAU me assembly. Markers required 90% identity over 97% of their length. For SSRs & RFLPs, gap size was restricted to 1000bp or less with less than 2 gaps. For dbSNPs and Indels gap size was restricted to 2bp with less than 2 gaps. The available files are in GFF3 format. Markers available in CottonGen are linked to JBrowse.
 
CottonGen SNP markers mapped to genome G.arboreum_HAU-A2_SNP
CottonGen RFLP markers mapped to genome G.arboreum_HAU-A2_RFLP
CottonGen SSR markers mapped to genome G.arboreum_HAU-A2_SSR
CottonGen InDel markers mapped to genome G.arboreum_HAU-A2_InDel

 

Protein Homology

Homology of the Gossypium arboreum HAU Genome v1.0 proteins was determined by pairwise sequence comparison using the blastp algorithm against various protein databases. An expectation value cutoff less than 1e-9 was used for the NCBI nr (Release 2018-05) and 1e-6  for the Arabidoposis proteins (Araport11), UniProtKB/SwissProt (Release 2019-01), and UniProtKB/TrEMBL (Release 2019-01) databases. The best hit reports are available for download in Excel format. 

 

Protein Homologs

G.arboreum HAU Genome v1.0 proteins with NCBI nr homologs (EXCEL file) A2_HAU_v1_vs_nr.xlsx.gz
G.arboreum HAU Genome v1.0 proteins with NCBI nr (FASTA file) A2_HAU_v1_vs_nr_hit.fasta.gz
G.arboreum HAU Genome v1.0 proteins without NCBI nr (FASTA file) A2_HAU_v1_vs_nr_noHit.fasta.gz
G.arboreum HAU Genome v1.0 proteins with arabidopsis (Araport11) homologs (EXCEL file) A2_HAU_v1_vs_tair.xlsx.gz
G.arboreum HAU Genome v1.0 proteins with arabidopsis (Araport11) (FASTA file) A2_HAU_v1_vs_tair_hit.fasta.gz
G.arboreum HAU Genome v1.0 proteins without arabidopsis (Araport11) (FASTA file) A2_HAU_v1_vs_tair_noHit.fasta.gz
G.arboreum HAU Genome v1.0 proteins with SwissProt homologs (EXCEL file) A2_HAU_v1_vs_swissprot.xlsx.gz
G.arboreum HAU Genome v1.0 proteins with SwissProt (FASTA file) A2_HAU_v1_vs_swissprot_hit.fasta.gz
G.arboreum HAU Genome v1.0 proteins without SwissProt (FASTA file) A2_HAU_v1_vs_swissprot_noHit.fasta.gz
G.arboreum HAU Genome v1.0 proteins with TrEMBL homologs (EXCEL file) A2_HAU_v1_vs_trembl.xlsx.gz
G.arboreum HAU Genome v1.0 proteins with TrEMBL (FASTA file) A2_HAU_v1_vs_trembl_hit.fasta.gz
G.arboreum HAU Genome v1.0 proteins without TrEMBL (FASTA file) A2_HAU_v1_vs_trembl_noHit.fasta.gz

 

Transcript Alignments
Transcript alignments were performed by the CottonGen Team of Main Bioinformatics Lab at WSU. The alignment tool 'BLAT' was used to map transcripts to the G. arboreum genome assembly. Alignments with an alignment length of 97% and 90% identify were preserved. The available files are in GFF3 format.

 

G. arboreum CottonGen RefTrans v1 G.arboreum_HAU-A2_g.arboreum_cottongen_reftransV1
G. barbadense CottonGen RefTrans v1 G.arboreum_HAU-A2_G.barbadense_cottongen_reftransV1
G. hirsutum CottonGen RefTrans v1 G.arboreum_HAU-A2_g.hirsutum_cottongen_reftransV1
G. raimondii CottonGen RefTrans v1 G.arboreum_HAU-A2_g.raimondii_cottongen_reftransV1