Gossypium hirsutum (AD1) 'TM-1' genome UTX_v3.1

Overview
Analysis NameGossypium hirsutum (AD1) 'TM-1' genome UTX_v3.1
MethodPacBio, MECAT, ARROW
Source (v3.1)
Date performed2025-03-19

Please Note: This genome assembly is made available through a "Reserved Analyses" restriction. Please see the Restrictions on Dataset Usage for further details OR contact Principal Collaborators:

Z. Jeffrey Chen (University of Texas at Austin) (email: zjchen AT austin DOT utexas DOT edu)
Jane Grimwood (HudsonAlpha Institute for Biotechnology) (email: jgrimwood AT hudsonalpha DOT org)

Data Overview
The v3.1 annotation release is on genome assembly v3.0, a high-quality version of the Gossypium hirsutum genome sequenced from high-quality large-molecule genomic DNA of G. hirsutum L. acc. TM-1, the same genotype that was used to construct the physical map and sequence BAC ends (Saski et al. 2017).

The work is supported by grants from the National Science Foundation (IOS1444552 and IOS1739092) to Z. Jeffrey Chen (PI), Jane Grimwood (co-PI), Chris Saski (Co-PI), Brian Scheffler and Keith McGee (Co-PIs), and David Stelly (Co-PI), from USDA-ARS (6402-21310-004-11S and 6402-21310-004) to Daniel Peterson and Brian Scheffler, and from Cotton Incorporated (13-694, 13-965 and 14-371) to David Stelly, Jeremy Schmutz, and Z. Jeffrey Chen.

Genome Information  
Assembly Source NA
Assembly version v3.0
Annotation Source NA
Annotation Version v3.1
Total Scaffold Length (bp) 2,278,157,202
Number of Scaffolds 249
Min. Number of Scaffolds containing half of assembly (L50) 10
Shortest Scaffold from L50 set (N50) 106,525,396
Total Contig Length (bp) 2,277,507,202
Number of Contigs 314
Min. Number of Contigs containing half of assembly (L50) 20
Shortest Contig from L50 set (N50) 39,954,488
Number of Protein-coding Transcripts 109,792
Number of Protein-coding Genes 75,854
Percentage of Eukaryote BUSCO Genes 99
Percentage of Embroyphyte BUSCO Genes 98.6
Full BUSCO results - Embryophyta (OrthoDB v9) 98.6
Full BUSCO results - Eukaryota (OrthoDB v9) 99

Assembly
Main assembly consisted of 116.73x of PACBIO coverage (11,243 bp average read size), and was assembled using MECAT and the resulting sequence was polished using ARROW. A total of 108,262 unique, non-repetitive, non-overlapping 1 KB syntenic markers were generated using the G. hirsutum v2.0 genome (Chen et al., 2020) release and aligned to the polished TM-1 assembly. Contig breaks were identified as an abrupt change in linkage group. A total of 3 breaks were made. The broken contigs were then ordered, oriented, and assembled into 26 chromosomes using the G. hirsutum v1 syntenic markers. A total of 212 joins were made during this process. The HiC library was aligned to the integrated chromosomes, and several minor rearrangements were made. Adjacent alternative haplotypes were identified on the joined contig set. Althap regions were collapsed using the longest common substring between the two haplotypes. A total of 116 adjacent altHaps were collapsed. The contigs from the G. hirsutum v2 was used to patch remaining gaps in the G. hirsutum v3 assembly. A total of 31 gaps were patched. Care was taken to ensure that telomere was properly oriented in the chromosomes, and the resulting sequence was screened for retained vector and/or contaminants. Finally, Homozygous SNPs and INDELs were corrected in the release sequence using ~55x of Illumina reads (2x150, 400bp insert).

Gene Prediction
Transcript assemblies were made from 2 x 150 bp stranded paired-end Illumina RNA-seq reads using PERTRAN, which conducts genome-guided transcriptome short read assembly via GSNAP (Wu and Nacu, 2010) and builds splice alignment graphs after alignment validation, realignment and correction. 315,806 transcript assemblies were constructed using PASA (Haas, 2003) from RNA-seq transcript assemblies above. Loci were determined by transcript assembly alignments and/or EXONERATE alignments of proteins from Arabidopsis thaliana, grape, soybean, rice, sorghum, foxtail millet, Brachypodium distachyon genomes and Swiss-Prot proteomes to repeat-soft-masked G. hirsutum genome using RepeatMasker (Smit, 2013-2015) with up to 2K BP extension on both ends unless extending into another locus on the same strand. Repeat library consists of de novo repeats by RepeatModeler (Smit, 2008-2015) on G. hirsutum genome and repeats in RepBase. Gene models were predicted by homology-based predictors, FGENESH+ (Salamov, 2000), FGENESH_EST (similar to FGENESH+, but using EST to compute splice site and intron input instead of protein/translated ORF), and EXONERATE (Slater and Birney, 2005), and PASA assembly ORFs (in-house homology constrained ORF finder). The best scored predictions for each locus are selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats. The selected gene predictions were improved by PASA. Improvement includes adding UTRs, splicing correction, and adding alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to above mentioned proteomes to obtain Cscore and protein coverage. Cscore is a protein BLASTP score ratio to MBH (mutual best hit) BLASTP score and protein coverage is the highest percentage of protein aligned to the best of homologs. PASA-improved transcripts were selected based on Cscore, protein coverage, EST coverage, and its CDS overlapping with repeats. The transcripts were selected if its Cscore is larger than or equal to 0.5 and protein coverage larger than or equal to 0.5, or it has EST coverage, but their CDS overlapping with repeats is less than 20%. For gene models whose CDS overlaps with repeats for more than 20%, their Cscore must be at least 0.9 and homology coverage at least 70% to be selected. The selected gene models were subject to Pfam analysis and gene models whose protein is more than 30% in Pfam TE domains were removed and weak gene models. Incomplete gene models, low homology supported without fully transcriptome supported gene models and short single exon (< 300 BP CDS) without protein domain nor good expression gene models were manually filtered out.

 

Reference: Chen et al. Genomic diversifications of five Gossypium allopolyploid species and their impact on cotton improvement Nat Genet 20 April 2020.

Restrictions on Dataset Usage

Gossypium hirsutum genome v2.1 data is made available before scientific publication according to the Ft. Lauderdale Accord. By accessing these data, you agree not to publish any articles containing analyses of genes or genomic data on a whole genome or chromosome scale prior to publication by principal investigators of a comprehensive genome analysis without the consent of project's investigators listed in Contacts below. ("Reserved Analyses"). "Reserved analyses" include the identification of complete (whole genome) sets of genomic features such as genes, gene families, regulatory elements, repeat structures, GC content, or any other genome feature, and whole-genome- or chromosome- scale comparisons with other species. The embargo on publication of Reserved Analyses by researchers outside of the Gossypium hirsutum Genome Sequencing Project is expected to extend until the publication of the results of the sequencing project is accepted. Studies of any type on the reserved data sets that are not in direct competition with those planned by the principle investigators may also be undertaken after an agreement with project's principle investigators. The assembly and sequence data should not be redistributed or repackaged without permission from the project's principle investigators.

We request that potential users of this sequence assembly contact the individuals listed under Contacts with their plans to ensure that proposed usage of sequence data are not considered Reserved Analyses.

Contacts

Principal Investigators:
Z. Jeffrey Chen (University of Texas at Austin) (email: zjchen@austin.utexas.edu)
Jane Grimwood (HudsonAlpha Institute for Biotechnology) (email: jgrimwood@hudsonalpha.org)

 

 

Assembly

The chromosomes (pseudomolecules) for Gossypium hirsutum TM-1 genome. These files belong to the Phytozome Gossypium hirsutum v3.1

Chromosomes (FASTA format) G.hirsutum_UTX-TM1_v3.0.fa.gz
Genes

The predicted gene model, their alignments and proteins for Gossypium hirsutum TM-1 genome. These files belong to the Phytozome Gossypium hirsutum v3.1

Predicted gene models with exons (GFF3 format) G.hirsutum_UTX-TM1_v3.1_genes_with_exon_gff3.gz
Predicted gene models (GFF3 format) G.hirsutum_UTX-TM1_v3.1_genes_gff3.gz
CDS sequences (FASTA format) G.hirsutum_UTX-TM1_v3.1_CDS.fa.gz
Transcript sequences (FASTA format) G.hirsutum_UTX-TM1_v3.1_transcript.fa.gz
Protein sequences (FASTA format) G.hirsutum_UTX-TM1_v3.1_protein.fa.gz
Markers
Marker alignments were performed by the CottonGen Team of Main Bioinformatics Lab at WSU. The alignment tool 'BLAT' was used to map marker sequences from CottonGen to the Gossypium hirsutum TM-1 UTX v3.1 assembly. Markers required 90% identity over 97% of their length. For SSRs & RFLPs, gap size was restricted to 1000bp or less with less than 2 gaps. For dbSNPs and Indels gap size was restricted to 2bp with less than 2 gaps. The available files are in GFF3 format. Markers available in CottonGen are linked to JBrowse.
 
CottonGen SNP markers mapped to genome AD1_UTX_v3.1_SNP
CottonGen RFLP markers mapped to genome AD1_UTX_v3.1_RFLP
CottonGen SSR markers mapped to genome AD1_UTX_v3.1_SSR
CottonGen InDel markers mapped to genome AD1_UTX_v3.1_InDel
Transcript Alignments
Transcript alignments were performed by the CottonGen Team of Main Bioinformatics Lab at WSU. The alignment tool 'BLAT' was used to map transcripts to the G. hirsutum TM-1 UTX genome 3.0 assembly. Alignments with an alignment length of 97% and 90% identify were preserved. The available files are in GFF3.
G. arboreum CottonGen RefTrans v1 AD1_UTX_v3.1_g.arboreum_cottongen_reftransV1
G. hirsutum CottonGen RefTrans v1 AD1_UTX_v3.1_g.hirsutum_cottongen_reftransV1
G. barbadense CottonGen RefTrans v1 AD1_UTX_v3.1_g.barbadense_cottongen_reftransV1
G. raimondii CottonGen RefTrans v1 AD1_UTX_v3.1_g.raimondii_cottongen_reftransV1