Background & Summary

Tibetan sheep play a unique and essential role in China’s rural revitalization strategy. High-quality development of the Tibetan sheep industry is expected to improve social and economic development while maintaining the ecosystem stability and national security of the Qinghai–Tibetan Plateau. Germplasms of these sheep are considered valuable because of their unique biological characteristics, high genetic stability, stress resistance, meat quality, and carpet wool quality1,2,3. Additionally, having undergone long-term natural and artificial selection under specific ecological conditions, they are a model species for studying adaptations to extreme environments. Therefore, assembly of the Tibetan sheep reference genome at the chromosome level will substantially facilitate comparative and functional genomics research. Although draft genome assemblies of certain sheep species have been released, their quality metrics were low4,5,6,7,8. Furthermore, no comprehensive genomes of sheep from high-altitude environments (>4,000 m) are available.

Guide Black-Fur sheep (GD) are known for their black skins and spend most of the year grazing in their core habitat of Guide County, Hainan Tibetan Autonomous Prefecture of Qinghai, China (altitude ~4,100 m). GD sheep exhibit a variety of fleece colors, ranging from blackish-red to grey. Black skin contains more melanocytes than other skin colors and thus synthesizes more melanin, absorbing ultraviolet radiation and thereby reducing its damaging effects on organs9,10. This adaptation strategy has allowed GD to thrive in a robust ultraviolet environment.

In this study, we aimed to produced high-quality, chromosome-scale, de novo genome assemblies of the high-altitude environments GD using Illumina, PacBio, and chromosome conformation capture (Hi-C) sequencing. Comparisons with other mammalian genomes helped to identify rapidly evolving genes and elucidate the evolutionary history of Tibetan sheep.

Methods

Sample collection and sequencing

All animal experiments were performed under the guidance of ethical regulations from the Institutional Animal Care and Use Committee of Lanzhou Institute of Husbandry and Pharmaceutical Science of Chinese Academy of Agricultural Sciences (Approval No. NKMYD201805; Approval Date: 18 October 2018). We randomly selected one adult male GD (age: 3 years) from flocks raised under standard feeding regimes and with free access to water in Hainan Tibetan Autonomous Prefecture, Qinghai, China (altitude: ~4,100 m). Blood samples were collected from the jugular vein and then preserved in EDTA anti-freezing tubes at −20 °C. The animals was slaughtered by exsanguination after being deprived of food for 24 h. The brain, heart, liver, kidney, spleen, lung, and muscle tissues were excised within 30 min and stored in liquid nitrogen.

For PacBio long-read and short-read sequencing, genomic DNA was extracted from the blood and liver of the GD. The DNA was sequenced at Frasergen Bioinformatics (Wuhan, China) using the PacBio Sequel platform (Pacific Biosciences), yielding a total of 267.82 Gb of PacBio continuous long reads for corresponding to 98 × genomic coverage (Table 1). The DNA was re-sequenced using HiSeq X-Ten (Illumina, CA, USA) to correct long reads. Paired-end sequencing produced 263.48 Gb of short-read data corresponding to 96 × genomic coverage (Table 1).

Table 1 Sequencing data statistics.

The PacBio Iso-Seq method sequenced full-length transcripts via Single Molecule Real-Time (SMRT) sequencing technology. Total RNA was extracted from all tissues (brain, femur, spleen, back muscle, kidney, heart, lung, and liver) using the RN33 kit (Aidlabs Biotechnologies, Beijing, China). All required profiles were obtained using the same sequencing platform described for PacBio, producing 36.90 Gb of Iso-Seq reads (Table 1).

For Hi-C sequencing, fresh liver tissues were collected for library construction, which was performed as previously described11. Paired-end (2 × 150 bp) sequencing was conducted on the Illumina HiSeq X-Ten platform to abtain 494.40 Gb of data corresponding to 181 × genomic coverage (Table 1).

Genome assembly and polishing

All subreads from SMRT sequencing were used for de novo assembly of the GD genome. The draft genome assembly was obtained using MECAT2 with default parameters12. Variant calling with gcpp in the SMRT link 4 toolkit was performed to correct errors after the initial genome assembly. Next, initial assembly contigs from the previous step were mapped with corresponding HiSeq reads and polished once using Pilon (v1.22) to correct any remaining errors13. This gave a total of 1,972 primary contigs for GD, with an N50 size of 20.30 Mb (Table 2).

Table 2 Assembly statistics of Guide black fur sheep.

Hi-C scaffolding

Next-generation sequencing short reads were aligned to their assembly using the Burrows-Wheeler Aligner (BWA-MEM algorithm, v0.7.17) to further increase single base accuracy14. Purge Haplotigs was used to filter redundant sequences resulting from genome heterozygosity15. Pseudochromosomes were dcreated during Hi-C analysis, as described previously16. Briefly, all clean read pairs produced from the Hi-C library were mapped to the polished sheep contig assembly using the BWA-MEM algorithm with default parameters. LACHESIS was then used to cluster contigs into chromosome-level scaffolds based on the genomic proximity signal of Hi-C data17. After scaffolding, we obtained 27 chromosomes containing > 98% of the contigs (Table 3, Fig. 1). The final assembled genome (2.73 Gb) had a scaffold N50 of 107 Mb and 1,246 gaps in total (Table 2).

Table 3 Summary of the Hi-C grouping results in each pseudo chromosome.
Fig. 1
figure 1

Hi-C interaction heatmap for the GD genome. Color shading from light to dark indicates the increase in interaction intensit.

Annotation of repetitive sequences and genes

De novo and homology-based prediction methods annotated repeat sequences in the GD genome. Known transposable elements were identified by combining RepeatMasker, RepeatProteinMask, and RepeatModeler18. Repetitive transposable element (TE) sequences comprised ~52% of the total assembly of the genome, with long terminal repeat retrotransposons being the most abundant (~20.99%) (Table 4).

Table 4 Summary of genome repeats in the GD genome.

For gene annotation, we adopted a strategy combining ab initio gene finding, homology-based gene prediction, and Iso-Seq reads. First, the assembled GD genome was hard- and soft- masked by RepeatMasker. Second, ab initio gene prediction was performed using the Augustus (v3.3.1) and SNAP gene models, which were pre-trained using homologous proteins19. Third, Exonerate (v2.2.0) set to default parameters was used to predict genes from protein sequences20. Fourth, clean RNA-sequencing reads were assembled into transcripts via Trinity for RNA-based gene prediction, followed by further prediction of gene structure using PASA21. Finally, Maker (v3.00) was employed to integrate the three sets of prediction results22. In total, we identified 20,759 predicted protein-coding genes within the genome, representing 98.42% of all genes (Table 5). A comparison of gene features among GD, Texel sheep (Ovis aries)23, Rambouillet sheep (Ovis aries)24, San Clemente goats (Capra hircus)25, and humans (Homo sapiens)26 revealed similar length distributions for coding sequences, genes, exons, and introns (Fig. 2).

Table 5 Functional annotation statistics of the GD genome.
Fig. 2
figure 2

Comparison of gene features among the genomes of GD and four other animal species. Gene features include the lengths of genes, CDS, exons, and introns.

The tRNA-related genes were mainly identified by tRNAscan-SE (v1.3.1) and Infernal (v1.1.2) using default parameters18,27. A total of 256,815 non-coding RNA genes were predicted, comprising 528 miRNA, 231 rRNA, 264,021 tRNA, and 2,017 snRNA genes (Table 6).

Table 6 Information on non-protein-coding genes identified in the GD genome.

Data Records

Raw data from the long-read and short-read sequencing have been deposited in the NCBI Sequence Read Archive database with accession numbers SRR22290763 (Iso-seq) and SRR22585187 (whole-genome sequencing) under BioProject PRJNA89885228,29. The final chromosome assembly was deposited in the GenBank at NCBI with accession number: JBEJUG01000000030. The draft genome assembly and genome annotation were deposited in the Figshare database (https://doi.org/10.6084/m9.figshare.26013145)31.

Technical Validation

Evaluation of the genome assembly

Quality metrics included assessment of the completeness of each genome based on the proportion of single-copy conserved orthologs obtained and specific signs of mis-assembly. Benchmarking Universal Single-Copy Orthologs (BUSCO) software32 revealed that ~93% of the conserved genes were identified in the GD genome, confirming the completeness of the obtained assemblies. The 93.1% of single-copy conserved orthologs in this assembly was higher than those found in Hu sheep and Texel sheep23,33. The percentage of duplicated complete BUSCOs detected in the GD genome (1.9%) was comparable to the ranges observed across the genomes of Texel sheep and Rambouillet sheep (0.9%–1.6%)23,24. The GD genome also exhibited a lower percentage of missing BUSCOs (4.2%) compared with the Texel sheep genome (11.1%)23, suggestive of genomic integrity and the precise assembly of highly complex regions (Table 7).

Table 7 BUSCO completeness analysis of genome assembly.

Genome collinearity analysis

Genome synteny analysis of the GD, Texel, and Rambouillet sheep breeds was performed using the MUMmer tool with default parameters (filtering of delta sequences with the -1 parameter and removal of collinear fragments <10 kb). Single-nucleotide polymorphisms and variations between the three genomes were identified using the show-snps (-rT parameter) and show-diff (-rH parameter) utilities, respectively. Overall, the GD genome demonstrates strong collinearity with Texel and Rambouillet sheep. Some obvious insertions and inversions on chromosomes 1, 3, and 7 may be species-specific (Fig. 3).

Fig. 3
figure 3

Genome collinearity among the GD, Texel, and Rambouillet sheep breeds.