An ultra-high density SNP-based linkage map for enhancing the pikeperch (Sander lucioperca) genome assembly to chromosome-scale

de los Ríos-Pérez, Lidia; Nguinkal, Julien A.; Verleih, Marieke; Rebl, Alexander; Brunner, Ronald M.; Klosa, Jan; Schäfer, Nadine; Stüeken, Marcus; Goldammer, Tom; Wittenburg, Dörte

doi:10.1038/s41598-020-79358-z

Download PDF

Article
Open access
Published: 18 December 2020

An ultra-high density SNP-based linkage map for enhancing the pikeperch (Sander lucioperca) genome assembly to chromosome-scale

Lidia de los Ríos-Pérez¹,
Julien A. Nguinkal²,
Marieke Verleih²,
Alexander Rebl²,
Ronald M. Brunner²,
Jan Klosa¹,
Nadine Schäfer²,
Marcus Stüeken³,
Tom Goldammer^2,4 &
…
Dörte Wittenburg¹

Scientific Reports volume 10, Article number: 22335 (2020) Cite this article

2027 Accesses
6 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Pikeperch (Sander lucioperca) is a fish species with growing economic significance in the aquaculture industry. However, successful positioning of pikeperch in large-scale aquaculture requires advances in our understanding of its genome organization. In this study, an ultra-high density linkage map for pikeperch comprising 24 linkage groups and 1,023,625 single nucleotide polymorphisms markers was constructed after genotyping whole-genome sequencing data from 11 broodstock and 363 progeny, belonging to 6 full-sib families. The sex-specific linkage maps spanned a total of 2985.16 cM in females and 2540.47 cM in males with an average inter-marker distance of 0.0030 and 0.0026 cM, respectively. The sex-averaged map spanned a total of 2725.53 cM with an average inter-marker distance of 0.0028 cM. Furthermore, the sex-averaged map was used for improving the contiguity and accuracy of the current pikeperch genome assembly. Based on 723,360 markers, 706 contigs were anchored and oriented into 24 pseudomolecules, covering a total of 896.48 Mb and accounting for 99.47% of the assembled genome size. The overall contiguity of the assembly improved with a scaffold N50 length of 41.06 Mb. Finally, an updated annotation of protein-coding genes and repetitive elements of the enhanced genome assembly is provided at NCBI.

Chromosome-scale assembly and high-density genetic map of the yellow drum, Nibea albiflora

Article Open access 15 October 2021

Linkage mapping, comparative genome analysis, and QTL detection for growth in a non-model teleost, the meagre Argyrosomus regius, using ddRAD sequencing

Article Open access 29 March 2022

Development of whole-genome multiplex assays and construction of an integrated genetic map using SSR markers in Senegalese sole

Article Open access 14 December 2020

Introduction

Pikeperch (Sander lucioperca) is a freshwater fish species from the Percidae family native to Europe and Asia^1,2. Its meat quality, with low fat content and high protein³, has placed it as a fish of high commercial value and a candidate for intensive inland aquaculture. In a period of 10 years, from 2007 to 2017, the global capture production of pikeperch increased from 17,891 to 20,481 tonnes, while the global inland aquaculture production increased from 627 to 1418 tonnes⁴, making evident the growing demand for this species.

Several studies have been performed in pikeperch concerning productive (e.g., growth and survival)^5,6,7 and reproductive (e.g., fecundity and spawning)^8,9 traits. However, despite the growing commercial importance of this species, little information is available regarding its genetic and genomic makeup. In 2018, the first high-density linkage map of pikeperch was built using specific locus amplified fragment sequencing (SLAF-seq). The map consisted of 8159 SLAFs including 8767 single nucleotide polymorphisms (SNPs) markers in 24 linkage groups (LGs) and spanned 3421.81 cM, with an average inter-marker distance of 0.46 cM¹⁰.

Linkage analysis of high-density genomic markers has facilitated the assembly of reference genomes by anchoring scaffolds, produced during de novo genome assembly, into linkage groups and providing a chromosome frame¹¹. The resulting linkage maps provide useful information or even the essential basis for the analysis of sex-related structural differences and inheritance patterns¹². Furthermore, linkage maps are often used for the detection of chromosomal locations of functional or disease genes and the identification of quantitative trait loci (QTLs) associated to economically important traits^13,14. Several linkage maps have been produced for a number of fish species and used with different purposes. In common carp (Cyprinus carpio) and yellow drum (Nibea albiflora), high-density linkage maps were built for comparative genomic analysis and identification of QTLs for growth and sex related traits^15,16. A linkage map produced in European whitefish (Coregonus sp. “Albock”) helped to investigate its genomic basis of adaptation and speciation¹⁷. Recently, in channel catfish (Ictalurus punctatus), a high-density linkage map was used for the construction of chromosome maps¹⁸.

With the fast advancements in next-generation sequencing technologies, an increasing number of sequencing and genotyping methodologies for SNPs have been developed, making it possible to rapidly discover a huge number of markers at relatively low cost^19,20. The challenge persists to arrange this excessive amount of genetic information into physical coordinates. The first highly contiguous draft genome assembly of pikeperch was published recently²¹. It contained ~ 900 Mb of total sequence, comprising 1966 contigs ordered into 1313 scaffolds. However, this first draft assembly is fragmented and requires improvement to a chromosome-scale. Genomes with accurate and complete architecture provide additional genomic context by orienting genes relative to each other and helping to determine other genomic features such as centromeres, telomeres, complex repeat elements and regulatory regions²². Assemblies with low integrity and completeness have been one of the major limitations to improve research in aquaculture species^23,24. Therefore, a linkage analysis is urgently required to build a basis for upgrading the current pikeperch genome, and developing breeding strategies in pikeperch aquaculture.

In this study, we report the construction of an ultra-high density linkage map for pikeperch based on the most common form of genetic variation, i.e., SNPs, and the improvement of the pikeperch genome assembly to chromosome-scale. The workflow described covers tissue sampling to raw sequence data, to finally yield a large panel of hard-filtered SNPs. A linkage map was constructed using the software Lep-Map3²⁵ suited to sequence data and capable of handling millions of markers, and the characteristics of the 24 resulting linkage groups are reported. The generated linkage map was then used to enhance the pikeperch genome assembly by anchoring and ordering its scaffolds into chromosome-scale pseudomolecules. The key genomic features were annotated for the enhanced pikeperch genome, including coding genes, non-coding RNA, and various repeat elements.

Results

Sequence processing and genotyping

A total of 90,416,509,334 paired-end reads (151 bp) from the 394 pikeperch samples were generated with an average number of 229,483,526 reads per sample and an average of 31.08-fold coverage. After trimming and quality filtering, a total of 87,771,258,936 paired-end reads were retained, with an average number of 222,769,693 reads per sample. The average percentage of properly paired reads was 96.43%. Although the Genome Analysis Toolkit v4.0 (GATK) variant calling pipeline²⁶ simultaneously discovers SNPs and Indels, we focused only on the SNPs and obtained a total of 1,619,874 SNPs after hard-filtering. For completeness, results for both types of variants are shown in Fig. 1.

Pedigree construction

Results from the pedigree showed that the 375 pikeperch sampled from the pool of progeny belonged to six out of the seven matings performed at the fish facility. Four of the families were full-sibs and two other full-sib families built one paternal half-sib family. The number of progeny corresponding to each mating is shown in Table 1. The mating, from which no progeny was found, was reported to have a very low number of eggs. Additionally, two more matings had extremely few progeny, which could be related to multiple factors, such as fertilization and hatching rate^27,28, stocking density²⁹, size sorting³⁰, and cannibalistic behaviour in early stages³¹, among others.

Table 1 Matings and number of individuals sampled from each family.

Full size table

Linkage map construction

From the 1,563,541 initial biallelic variants, 1,478,421 were identified as informative out of which 91,252 were discarded after filtering by segregation distortion and after allowing at most 10% of missing genotypes. Hence, a total of 1,387,169 variants were kept for further analysis. A range of logarithm of odds (LOD) scores from 5 to 70 incrementing by 5 was tested for linkage grouping. A LOD score of 50 resulted in 24 LGs that were expected to match to the 24 chromosomes observed in karyotype studies in pikeperch^32,33. In total, 1,023,625 SNPs were uniquely assigned to the 24 LGs and ordered to generate the female, male and sex-averaged linkage maps (Table 2, Fig. 2). The number of SNPs per LG ranged from 28,022 to 59,051 with an average of 42,651 markers per LG. In total, 863 out of 1313 scaffolds were involved, covering 894.02 Mb of the total genome length of 900.48 Mb. The number of SNPs per scaffold ranged from 1 to 25,495 with mean 1186. Out of the 863 scaffolds, 65 had only one SNP and 15 had more than 10,000 SNPs, while all magnitudes in between were almost equally represented: 136 scaffolds contained two to 10 SNPs, 165 scaffolds included 11 to 100 SNPs and 1001 to 10,000 SNPs were found in 209 scaffolds.

Table 2 Description of the female, male and sex-averaged linkage maps. LG: linkage group, cM: centiMorgan, F:M: female:male.

Full size table

The SNPs on the female map were arranged on 7805 distinct positions with observed recombination events constituting a total genetic length of 2985.16 cM. The genetic length of LGs ranged from 85.79 cM (LG22) to 176.19 cM (LG12) with an average length of 124.38 cM. The average inter-marker distance was 0.0030 cM with the smallest and largest distance being 0.0022 (LG1) and 0.0039 (LG18 and LG19). The largest gap between adjacent markers was of 22.77 cM (LG15).

The SNPs on the male map were arranged on 3917 distinct positions with observed recombination events constituting a total genetic length of 2540.47 cM. The genetic length of LGs ranged from 80.88 cM (LG7) to 145.25 cM (LG6) with an average length of 105.85 cM. The average inter-marker distance was 0.0026 cM with the smallest and largest distance being 0.0015 cM (LG2) and 0.0049 cM (LG22). The largest gap between adjacent markers was of 53.73 cM (LG22). The female:male (F:M) length ratio for the LGs varied from 0.60 (LG22) to 1.74 (LG2), with an average of 1.21. The LGs showed different recombination activities between the female and male maps; Figure S1 shows its non-linear relationship. Furthermore, 18 LGs showed larger genetic distances in females than in males. In contrast, three LGs (LG5, LG8 and LG22) showed larger genetic distances in males than females. Three LGs (LG6, LG13 and LG21) had approximately the same length between sexes.

The SNPs on the sex-averaged map were arranged on 11,459 distinct positions with a total genetic length of 2725.53 cM. The genetic length for the LGs ranged from 86.59 cM (LG20) to 144.61 cM (LG6) with an average length of 113.56 cM. The average inter-marker distance was 0.0028 cM with the smallest and largest distance being 0.0019 (LG1) and 0.0037 (LG19, LG21 and LG22). The largest gap between adjacent markers was 19.97 cM (LG22).

Genome assembly and annotations

The generated de novo assembly consisted of 1602 contigs with N50 size of 6.3 Mb, which is more than a twofold improvement over the previously published draft assembly (GenBank accession: PRJNA561467). The integrated chromosome-scale assembly yielded 336 scaffolds with N50 size of 41.06 Mb from which the 24 largest scaffolds represented the putative 24 pikeperch chromosomes, and covered 896.48 Mb (99.47%) of the assembly size. Only 4.74 Mb (0.53%) could not be anchored into pseudomolecules. The average accuracy at base-level was 99.9996 (i.e., 1 error in 100 kb). Over 99.80% of the genomic paired-end reads mapped to the improved assembly, with 97.50% of them mapping concordantly. Moreover, from a total of 4584 actinopterygians core genes, BUSCO assessment recovered 94.50% as full-length single-copy, 2.23% as duplicated, 1.59% as fragmented and 1.68% were missing, indicating that most genes were accurately assembled (Table 3).

Table 3 Comparison of statistics between our chromosome-scale assembly and the first published pikeperch draft assembly (GenBank accession PRJNA561467). Genome annotation metrics were taken from Nguinkal et al. (2019)²¹. Differences between the statistic results shown in this table and NCBI are due to the use of different genome annotation services.

Full size table

Homology and structure-based approaches were used for functional annotation of protein-coding genes. We found 31,234 genes (93.36% of protein-coding genes) with at least one significant hit in one of the functional databases queried. The predicted non-coding genes included 2345 transfer RNA (tRNA), 160 ribosomal RNA (rRNA) and 145 microRNA (miRNA) (Table 3).

Repetitive sequences accounted for ~ 37% of the assembled genome, and spanned 334 Mb in total, which is in range with the repeats content reported in other Percidae fish³⁴. With more than 250 Mb (27.76% of assembly size), DNA transposons and retroelements were the most abundant type of repeats found in the pikeperch genome. In particular, long interspersed nuclear elements (LINEs), long terminal repeat (LTR) elements and hobo-Activator occupied 10.16%, 3.22% and 4.94%, respectively, of the assembled genome (Fig. 3a).

The obtained consensus gene models included a total of 33,456 high-quality protein-coding genes, which was substantially higher than that found in the previously published draft assembly (GenBank accession: PRJNA561467) version. The average length of coding sequences (CDS) was 1451 bp. On average, each S. lucioperca gene had 7.8 exons, each with an average length of 156 bp. About 82% of the 278,346 exonic sequences were < 200 bp. Introns showed an average length of 2276 bp, with 2% of them having a length of > 10 kb. Moreover, the total length of intronic and exonic DNA on each chromosome was significantly correlated to the chromosome size with correlation coefficients of R = 0.78 and R = 0.81, respectively (Fig. 3b,c). Consequently, the gene content per chromosome was also significantly correlated to the chromosome size, with a correlation coefficient of R = 0.96 (Fig. 3d). Overall, the distribution of CDS length, intron length and exon number is comparable with other percid genomes³⁴. The 24 chromosomes were sorted by physical size, from largest to smallest and named accordingly (Table 4, Fig. 4). Given a genome-wide average of 40 genes per Mb, the chromosomes 21 and 23 displayed the highest and lowest gene density with 52 and 34 genes per Mb, respectively. Additionally, we observed a putative nucleolus organizer region (NOR) on chromosome 7, which had already been observed in previous cytogenetics analysis on pikeperch³⁵.

Table 4 Description of chromosomes ordered by size with corresponding LG. LG: linkage group, Mb: Megabase.

Full size table

A liftover of the SNPs assigned to LGs to the chromosome-scale build yielded a panel of 992,340 genome-wide reference SNPs for pikeperch (Table S2). In total, 31,278 SNPs failed to map to the chromosome-scale assembly and 7 duplicated SNPs were removed.

Discussion

We reported the construction of an ultra-high density SNP-based linkage map for pikeperch, and the further anchoring of the genome assembly into the first chromosome-scale assembly.

Our map comprised 24 linkage groups, with a total of one million SNP markers which spanned between 2500 and 3000 cM for the female, male and sex-averaged maps. In order to obtain a high quality linkage map, we strictly filtered the data and finally retained 1.6 million SNP markers from sequence data. Roughly, 600 K SNPs could not be assigned to any LG. The female map was slightly longer than the male map, with an overall F:M length ratio of 1.21, though some LGs harboured extreme differences between genders (F:M length ratio up to 1.74). This result was consistent with the only linkage map reported in pikeperch, where the female map was also found to be longer than the male map, with an overall F:M length ratio of 1.62 (4179.41 cM vs. 2582.83 cM)¹⁰. Our results are also consistent with the pattern between sexes in several teleost fish species like red-spotted grouper (1.47 F:M length ratio³⁶), Pacific bluefin tuna (1.34 F:M length ratio³⁷) and barramundi (2.1 F:M length ratio³⁸).

Average inter-marker distances were between 0.0026 cM and 0.0030 cM for the female, male and sex-averaged map leading to a more than 100 times higher resolution compared with the linkage map published by Guo et al.¹⁰. Additionally, our linkage map was based on about one million SNPs from whole-genome sequencing of 6 full-sib families comprising 363 progeny, while the map derived by Guo et al. was built using 8767 SNPs from a single family with 150 progeny. Though the total length of male map was almost equal, the female map length differed being 1.4 times longer in the earlier study. However, a larger mapping population, enormously increased number of markers, and thus essentially smaller average inter-marker distance, substantiated a more precise estimation of the genetic distances. Because of the close proximity of SNPs, recombination events rarely happened within scaffolds and this was manifested by genetic positions hardly differing within long stretches, see Supplementary Table S1. Though being beneficial at the large scale, ordering of markers at the fine scale might be insufficient based on linkage analysis only¹¹. However, high-quality linkage maps are a valuable source for the correct placement of scaffolds into chromosomes³⁹. Our ultra-high density linkage map was used to anchor the genome scaffolds into chromosome-scale. Compared to the previous genome assembly²¹, the scaffold N50 length was increased from 4.9 Mb to 41.06 Mb covering 896.48 Mb (99.47%) of the assembly size. This new chromosome-scale genome assembly represents an important resource to fill the gap in the Percidae family tree, where Luciopercinae (Sander spp.) was the only sub-family missing a chromosome-level assembly (according to NCBI query: June 2020).

Anchoring scaffolds into chromosomes has been performed with Chromonomer software (http://catchenlab.life.illinois.edu/chromonomer/). Alternatively, well-suited software such as Lep-Anchor⁴⁰ or ALLMAPS⁴¹ provide potential for further advances.

The conversion of the genomic positions of SNPs into the chromosome-scale assembly successfully lifted over 96.94% of the markers. The remaining 3.06% (31,278) of the SNPs could not be lifted over because they resided in contigs that only existed in the older assembly build or because of sequence incompatibilities between the assemblies, such as mismatching reference alleles, e.g., a variant that was considered an alternate in the source assembly was now considered the reference in the target assembly. Additionally, 7 SNPs were found to be duplicated; they mapped to the same physical position because of collapsing or overlapping contigs in the target assembly.

The karyotype of the pikeperch consists of one pair of metacentric, 15 pairs of submetacentric and 8 pairs of subtelo-acrocentric chromosomes^32,33. The number of linkage groups for the female, male and sex-averaged maps built in this study was chosen corresponding to the number of chromosome pairs from microscopic observations^32,33. With the aim of identifying the chromosome type and specifying the location of centromeric regions, we applied the centromere mapping method developed by Limborg et al. (2015)⁴² to all linkage groups of our female map. This required recombination frequencies (RF) between each of the two terminal markers and any other marker (m) on each linkage group. The resulting RF_m curves shall indicate a metacentric chromosome if the two curves cross at almost 0.5 and an acrocentric chromosome if the curves smoothly approach 0.5 at the ends. In our study, this method did not allow for a clear differentiation between metacentric and acrocentric LGs and therefore, remained inconclusive (Supplementary Figure S2). In order to account for possible genotype errors at the terminal markers, markers close to them have been verified, and they confirmed the inconclusive outcome. As mentioned by Limborg et al. (2015)⁴², this method has reduced precision if recombination interference is incomplete and chromosome arms are long (> 50 cM). This leads to an increasing frequency of double crossover events inducing RF_m to level off after ~ 50 cM. This was observed in our study, possibly indicating incomplete interference in pikeperch. Thus, further research is needed to elucidate the extent of interference and to narrow down the location of the centromere by studying regions with repressed recombination activity⁴³. Once centromeres have been identified, the order of chromosomes will change accordingly.

The development of genomic resources for pikeperch will allow a better understanding of the species and a faster positioning in the aquaculture industry. Pushed by the advancements in high-throughput methods for SNP genotyping, genomic selection has been introduced in some aquaculture breeding programs, but further research is needed to effectively combine existing breeding designs with available genomic information⁴⁴. Mapping of genomic regions associated with diseases will provide further possibilities to accelerate the breeding success in aquaculture species⁴⁵. Moreover, the genomic resources generated in this project will serve for various future studies, including the improvement in the contiguity and accuracy of the chromosome-scale assembly for pikeperch and the development of a SNP array.

Material and methods

All procedures involving the handling and treatment of fish used in this study were approved by the Committee on the Ethics of Animal Experiments of Mecklenburg-Western Pomerania (Landesamt für Gesundheit und Soziales LAGuS). Approval ID: 7221.3–1-009/19. The methods were performed in accordance with relevant guidelines and regulations.

Broodstock management and family production

Seven matings of pikeperch were generated in a state´s aquaculture facility in Hohen Wangelin (State Research Institute for Agriculture and Fisheries in Hohen Wangelin, Mecklenburg-Western Pomerania, Germany) within their normal production cycle. For the production of the families, mature broodstock were placed in spawning tanks using a sex ratio of 2:1 and 1:1. Spawning tanks dimensions were 1.17 × 0.88 × 1.10 m (l × w × h) with a water column of 1.0 m kept at 12 °C and daily water exchange rate of 5%. Broodstock were fed with a diet for trout broodstock containing 44% protein. After spawning, per family eggs were collected and treated to prevent bacterial and fungal growth. The treatment consisted of a 10 min bath in a solution made of 50 ml of 37% formalin in 10 L of water. Eggs were then placed in incubation tanks in a small scaled recirculating aquaculture system (RAS) with continuous aeration, cooling system and UV-disinfection. After a 24 h hatching period, all obtained progeny from each family were mixed and transferred into round tanks each with a water column of 0.5 m and kept at a water temperature of 15 °C and daily water exchange rate of 5%. As first exogenous prey, larvae were fed with marine copepods in the first two days, followed by Artemia spp. for the next 10 days, and were then adapted to dry food. After 45 days with a mean weight of 0.5 g, family mixed larvae were stocked in round tanks with a water volume of 3 m³ at a water temperature of 21 °C and daily water exchange rate of 5%. Larvae were fed with dry food containing between 50 to 64% protein according to the growth stage; daily diet consisted of 5% to 10% of the biomass of the tank.

DNA extraction and sequencing

Genomic DNA from the 18 broodstock (11 males and 7 females) used for the family production was isolated from flash-frozen caudal fin tissue sampled after mating. One male fish was used twice, giving a total of 19 samples from 18 different individuals. A total of 375 progeny were collected for sampling at the age of 16 and 28 weeks. Genomic DNA from the progeny was isolated from blood obtained from the caudal vein or flash-frozen caudal fin. For both, broodstock and progeny, genomic DNA isolation was performed using DNeasy Blood and Tissue Kit (Qiagen) and following manufacturer’s protocol. DNA quantity and quality were determined with the NanoDrop ND-1000 spectrophotometer (NanoDropTechnologies, Wilmington, Delaware, USA). Whole genome paired-end sequencing was performed on each of the individuals (Macrogen, Korea) with Illumina NovaSeq 6000 technology.