Introduction

Genomic selection is commonly used in animal breeding programs, especially for dairy cattle (Van Eenennaam et al. 2014). The main reasons for the effective application of genomic selection in livestock breeding are that it is efficient, that is, the process has high prediction accuracy, the cost of phenotyping (mainly progeny tests) is higher than the cost of genotyping, and the process significantly shortens the selection cycle (Meuwissen et al. 2013). An important application of genomic selection in plant breeding is the prediction of untested single crosses (genotypic value prediction) and testcrosses (prediction of general combining ability effect) in hybrid breeding (Zhao et al. 2015). Genomic prediction of two-way and three-way crosses has been investigated (Philipp et al. 2016). Bernardo (1994) pioneered the prediction of untested single crosses based on best linear unbiased prediction (BLUP). Many significant studies on the prediction of untested single cross and testcross performance have been published in the last 23 years with a focus on the assessment of prediction accuracy. Most investigations were based on empirical data and estimated the prediction accuracy using a cross-validation procedure. Very few were based on simulated data (Li et al. 2017; Technow et al. 2012). With no exception, the inference was that prediction of untested single crosses and testcrosses can be efficient, depending on the heritability, training set size, and number of tested inbreds in hybrid combination (both, one, and none parents tested). Remarkably, this conclusion was drawn from studies differing in molecular marker, marker density, number of inbreds, level of relatedness, diversity, and linkage disequilibrium (LD) between inbreds, heterotic pattern, training set size, genetic model, and statistical approach (Zhao et al. 2015). Efficient prediction of two-way and three-way crosses of barley has been achieved using training and validation sets that include the same class of hybrids (Philipp et al. 2016).

Most studies on genomic prediction of maize single cross performance published since 2011 have used single nucleotide polymorphisms (SNPs), with the number of filtered SNPs ranging from 425 (Zhao et al. 2013a) to 39,627 (Technow et al. 2012). For grain yield, the relative prediction accuracies (computed as the accuracy divided by the root square of the heritability) in these studies ranged from 0.27 to 0.62 and from 0.65 to 0.95, respectively. The number of inbreds in each heterotic group was highly variable as well, ranging from 6 and 9 (Bernardo 1994) to 75 and 75 (Technow et al. 2012), respectively. The relative accuracy for grain yield observed by Bernardo (1994) ranged between 0.72 and 0.89. The level of relatedness between inbreds ranged from non-related inbreds in each group (Technow et al. 2012) to a maximum average value of 0.58 (RFLP-based coancestry coefficient) (Bernardo 1995). The relative accuracy obtained by Bernardo (1995) ranged from 0.41 to 0.80 for grain yield. The common heterotic groups were Stiff Stalk and non-Stiff Stalk (Kadam et al. 2016) or Dent and Flint (Technow et al. 2014). The relative accuracies for grain yield ranged from 0.28 to 0.77 and from 0.75 and 0.92, respectively. The study of Bernardo (1996a) involved nine heterotic groups and the relative accuracies for grain yield ranged from 0.43 to 0.88. These results evidence that prediction accuracy is proportional to the molecular marker density and that high accuracy can be achieved regardless of number of inbreds, level of relatedness, and number of heterotic groups. No study provided distinctly higher prediction accuracy of the additive-dominance model relative to the additive model. Finally, with only testcrosses the genomic BLUP (GBLUP) approach outperformed pedigree-based BLUP (Albrecht et al. 2014; Albrecht et al. 2011) in regard to prediction accuracy.

Genomic prediction of single crosses has been carried out based on tested single crosses using cross-validation. Thus, the estimated prediction accuracies are not for untested single crosses. Consequently, none of the previous studies on the efficiency of genomic prediction of single cross performance measured the efficacy of identification of the best untested single crosses. Our main objective was to assess the prediction efficiency of untested single crosses by correlating the predicted and true genotypic values of untested single crosses (prediction accuracy) and measuring the efficacy of identification of the best 300 untested single crosses (coincidence index) using a large simulated data set. The secondary objectives were to highlight that the prediction accuracy primarily depends on the overall LD in the groups of selected doubled haploid (DH) lines, that the prediction efficiency with no heterotic pattern can be as high as the prediction efficiency involving heterotic groups, and that the choice of single crosses for testing should be random instead of selecting DH lines for a diallel to maximize the prediction efficiency. Further, we derived a model for genomic prediction of untested single crosses based on the SNP average effects of substitution and dominance deviations.

Materials and methods

Theory

Most important papers on genomic selection offer deep statistical aspects on the whole-regression models, extending to SNP effects a previously derived gene model. Some important papers include only basic quantitative genetics theory based on linkage equilibrium. The quantitative genetics theory developed in this paper provides a genetic model for genomic prediction of untested single crosses that accounts for the LD between QTLs and SNPs. The model offers the genetic background to the models fitted in previous papers on the prediction of untested single crosses and testcrosses (Albrecht et al. 2011; Massman et al. 2013; Technow et al. 2012). The theory is comprehensive, i.e., it is adequate for DH and inbred lines, for predicting untested single crosses and testcrosses, and for crops with and without defined heterotic groups, and it is easily extended to genomic prediction of two-way and three-way crosses (relevant for rice, wheat, and barley breeders), based on Jenkins (1934). The theoretical accuracy can be used in future investigations on the efficiency of genomic prediction of untested single crosses based on a deterministic approach, as in the study of Grattapaglia and Resende (2010).

LD in a group of selected DH or inbred lines

Consider a group of DH or inbred lines selected from a population or heterotic group. Assume also a QTL (alleles B/b) and a SNP (alleles C/c) where B and b are the alleles that increase and decrease the trait expression, respectively. Define the joint genotype probabilities as \({\mathrm{P}}({\mathrm{BBCC}}) = {\mathrm{f}}_{22}\), \({\mathrm{P}}({\mathrm{BBcc}}) = {\mathrm{f}}_{20}\), \({\mathrm{P}}({\mathrm{bbCC}}) = {\mathrm{f}}_{02}\), and \({\mathrm{P}}({\mathrm{bbcc}}) = {\mathrm{f}}_{00}\), where the subscripts indicate the numbers of copies of the major allele (B and C). The measure of LD between the QTL and the SNP is \({\mathrm{\Delta }}_{{\mathrm{bc}}} = {\mathrm{f}}_{22}{\mathrm{f}}_{00} - {\mathrm{f}}_{20}{\mathrm{f}}_{02}\) (Kempthorne 1957) and the haplotype frequencies are \({\mathrm{P}}({\mathrm{BC}}) = {\mathrm{f}}_{22} = {p}_{\mathrm{b}}{p}_{\mathrm{c}} + \Delta _{{\mathrm{bc}}}\), \({\mathrm{P}}({\mathrm{Bc}}) = {\mathrm{f}}_{20} = {{p}}_{\mathrm{b}}{{q}}_{\mathrm{c}} - {\mathrm{\Delta }}_{{\mathrm{bc}}}\), \({\mathrm{P}}({\mathrm{bC}}) = {\mathrm{f}}_{02} = {{q}}_{\mathrm{b}}{{p}}_{\mathrm{c}} - {\mathrm{\Delta }}_{{\mathrm{bc}}}\), and \({\mathrm{P}}({\mathrm{bc}}) = {\mathrm{f}}_{00} = {{q}}_{\mathrm{b}}{{q}}_{\mathrm{c}} + {\mathrm{\Delta }}_{{\mathrm{bc}}}\), where p is the frequency of the major allele (B or C) and q = 1 − p is the frequency of the minor allele (b or c). Note that \({p}_{\mathrm{b}} = {\mathrm{f}}_{22} + {\mathrm{f}}_{20}\) and \({p}_{\mathrm{c}} = {\mathrm{f}}_{22} + {\mathrm{f}}_{02}\). It is important to highlight that we are not assuming that the QTL and the SNP are linked and in LD in the population or heterotic group, because this is not a necessary condition for genomic prediction. But we are assuming that they are in LD in the group of DH or inbred lines. Furthermore, because of selection, genetic drift, and inbreeding (only for inbreds and linked QTLs and SNPs), the gene and genotypic frequencies and the LD values concerning the selected DH or inbred lines cannot be traced to the values in the population or heterotic group.

SNP genotypic values of DH or inbred lines

The average genotypic value for a group of selected DH or inbred lines is \({\mathrm{M}}_{{\mathrm{IL}}} = {{m}}_{\mathrm{b}} + \left( {{{p}}_{\mathrm{b}} - {{q}}_{\mathrm{b}}} \right)\,{{a}}_{\mathrm{b}}\), where m b is the mean of the genotypic values of the homozygotes and a b is the deviation between the genotypic value of the homozygote of higher expression and m b. Thus, the average SNP genotypic values for the DH or inbred lines CC and cc are

$${\mathrm{G}}_{\rm CC} = \frac{1}{{{\mathrm{f}}_{.2}}}\left[ {{\mathrm{f}}_{22}\left( {{{m}}_{\mathrm{b}} + {{a}}_{\mathrm{b}}} \right) + {\mathrm{f}}_{02}\left( {{{m}}_{\mathrm{b}} - {{a}}_{\mathrm{b}}} \right)} \right] \\ = {\mathrm{M}}_{{\mathrm{IL}}} + {\mathrm{2}q}_{\mathrm{c}}{\mathrm{\alpha }}_{{\mathrm{SNP}}} = {\mathrm{M}}_{{\mathrm{IL}}} + {\mathrm{A}}_{\rm {CC}}$$
$${\mathrm{G}}_{\rm {cc}} = \frac{1}{{{\mathrm{f}}_{.0}}}\left[ {{\mathrm{f}}_{20}\left( {{{m}}_{\mathrm{b}} + {{a}}_{\mathrm{b}}} \right) + {\mathrm{f}}_{00}\left( {{{m}}_{\mathrm{b}} - {{a}}_{\mathrm{b}}} \right)} \right] \\ = {\mathrm{M}}_{{\mathrm{IL}}} - 2{{p}}_{\mathrm{c}}{\mathrm{\alpha }}_{{\mathrm{SNP}}} = {\mathrm{M}}_{{\mathrm{IL}}} + {\mathrm{A}}_{\rm {cc}}$$

where \(\alpha _{{\mathrm{SNP}}} = \left[ {\frac{{\Delta _{{\mathrm{bc}}}}}{{{p}_{\mathrm{c}}{q}_{\mathrm{c}}}}} \right]{a}_{\mathrm{b}} = {\mathrm{\kappa }}_{\rm {bc}}{{a}}_{\mathrm{b}}\) is the average effect of a SNP substitution in the group of DH or inbred lines, and A is the SNP additive value for a DH or inbred line. Note that E(A) = 0.

Assuming two QTLs (alleles B and b and E and e) in LD with the SNP, the average effect of a SNP substitution in the selected DH or inbred lines is \({\mathrm{\alpha }}_{{\mathrm{SNP}}} = {\mathrm{\kappa }}_{{\mathrm{bc}}}{{a}}_{\mathrm{b}} + {\mathrm{\kappa }}_{{\mathrm{ce}}}{{a}}_{\mathrm{e}}\), where \({\mathrm{\kappa }}_{{\mathrm{ce}}} = \left[ {\frac{{\Delta _{{\mathrm{ce}}}}}{{{{p}}_{\mathrm{c}}{{q}}_{\mathrm{c}}}}} \right]\). Thus, the average effect of a SNP substitution (and the SNP additive value) is proportional to the LD measure and to the deviation a for each QTL that is in LD with the marker.

SNP genotypic values of single crosses

To maximize the heterosis, maize breeders commonly assess single crosses originating from selected DH or inbred lines from distinct heterotic groups. Consider n1 DH or inbred lines from a population or heterotic group and n2 DH or inbred lines from a distinct population or heterotic group. The average genotypic value for the single crosses derived by crossing the DH or inbred lines from group 1 with the DH or inbred lines from group 2 is

$${\mathrm{M}}_{\mathrm{H}} = {{m}}_{\mathrm{b}} + \left( {{{p}}_{{\mathrm{b}}1}{{p}}_{{\mathrm{b2}}} - {{q}}_{{\mathrm{b1}}}{{q}}_{{\mathrm{b2}}}} \right)\,{{a}}_{\mathrm{b}} + \left( {{{p}}_{{\mathrm{b1}}}{{q}}_{{\mathrm{b2}}} + {{q}}_{{\mathrm{b1}}}{{p}}_{{\mathrm{b2}}}} \right)\,{{d}}_{\mathrm{b}}$$

where d b is the dominance deviation (the deviation between the genotypic value of the heterozygote and m b).

The average genotypic values for the single crosses derived from DH or inbred lines CC and cc of group 1 are

$$\begin{array}{ccccc}{\mathrm{G}}_{{\mathrm{CC1}}} = {\mathrm{M}}_{\mathrm{H}} + {{q}}_{{\mathrm{c1}}}{\mathrm{\kappa }}_{{\mathrm{bc1}}}\left[ {{{a}}_{\mathrm{b}} + \left( {{{q}}_{{\mathrm{b2}}} - {{p}}_{{\mathrm{b2}}}} \right){{d}}_{\mathrm{b}}} \right] \\ = {\mathrm{M}}_{\mathrm{H}} + {{q}}_{{\mathrm{c1}}}{\mathrm{\kappa }}_{{\mathrm{bc1}}}{\mathrm{\alpha }}_{{\mathrm{b2}}} = {\mathrm{M}}_{\mathrm{H}} + {{q}}_{{\mathrm{c1}}}{\mathrm{\alpha }}_{{\mathrm{SNP1}}} = {\mathrm{M}}_{\mathrm{H}} + {\mathrm{GCA}}_{{\mathrm{CC1}}}\\ \end{array}$$
$${\mathrm{G}}_{{\mathrm{cc1}}} = {\mathrm{M}}_{\mathrm{H}} - {{p}}_{{\mathrm{c1}}}{\mathrm{\kappa }}_{{\mathrm{bc1}}}{\mathrm{\alpha }}_{{\mathrm{b2}}} = {\mathrm{M}}_{\mathrm{H}} - {{p}}_{{\mathrm{c1}}}{\mathrm{\alpha }}_{{\mathrm{SNP1}}} \\ = {\mathrm{M}}_{\mathrm{H}} + {\mathrm{GCA}}_{{\mathrm{cc1}}}$$

where α b2 is the average effect of allelic substitution in the population derived by random crosses between the DH or inbred lines from group 2, α SNP1 is the SNP effect of allelic substitution in the hybrid population relative to a SNP derived from group 1, and GCA is the general combining ability effect for a SNP locus. Note that α SNP1 depends on the LD in group 1 (\({\mathrm{\kappa }}_{{\mathrm{bc1}}} = {\mathrm{\Delta }}_{{\mathrm{bc1}}}/{{p}}_{{\mathrm{c1}}}{{q}}_{{\mathrm{c1}}}\)) and the average effect of allelic substitution in the population derived by random crosses between the DH or inbred lines from group 2. Furthermore, \({\mathrm{E}}({\mathrm{GCA}}) = {{p}}_{{\mathrm{c1}}}{\mathrm{GCA}}_{{\mathrm{CC1}}}{{ + q}}_{{\mathrm{c1}}}{\mathrm{GCA}}_{{\mathrm{cc1}}} = {\mathrm{0}}\). Concerning the single crosses derived from DH or inbred lines CC and cc of group 2, we have

$$\begin{array}{ccccc}{\mathrm{G}}_{{\mathrm{CC2}}} = {\mathrm{M}}_{\mathrm{H}} + {{q}}_{{\mathrm{c2}}}{\mathrm{\kappa }}_{{\mathrm{bc2}}}\left[ {{{a}}_{\mathrm{b}} + \left( {{{q}}_{{\mathrm{b1}}} - {{p}}_{{\mathrm{b1}}}} \right){{d}}_{\mathrm{b}}} \right] \\ = {\mathrm{M}}_{\mathrm{H}} + {{q}}_{{\mathrm{c2}}}{\mathrm{\kappa }}_{{\mathrm{bc2}}}{\mathrm{\alpha }}_{{\mathrm{b1}}} = {\mathrm{M}}_{\mathrm{H}} + {{q}}_{{\mathrm{c2}}}{\mathrm{\alpha }}_{{\mathrm{SNP2}}} = {\mathrm{M}}_{\mathrm{H}} + {\mathrm{GCA}}_{{\mathrm{CC2}}}\end{array}$$
$${\mathrm{G}}_{{\mathrm{cc2}}} = {\mathrm{M}}_{\mathrm{H}} - {{p}}_{{\mathrm{c2}}}{\mathrm{\kappa }}_{{\mathrm{bc2}}}{\mathrm{\alpha }}_{{\mathrm{b1}}} = {\mathrm{M}}_{\mathrm{H}} \!-\! {{p}}_{{\mathrm{c2}}}{\mathrm{\alpha }}_{{\mathrm{SNP2}}} = {\mathrm{M}}_{\mathrm{H}}\! +\! {\mathrm{GCA}}_{{\mathrm{cc2}}}$$

Note that E(GCA) = 0. The average genotypic values for the single crosses concerning the SNP locus are

$$\begin{array}{ccccc}{\mathrm{G}}_{{\mathrm{CC1xCC2}}} = {\mathrm{M}}_{\mathrm{H}} \!+\! {{q}}_{{\mathrm{c1}}}{\mathrm{\alpha }}_{{\mathrm{SNP1}}} + {{q}}_{{\mathrm{c2}}}{\mathrm{\alpha }}_{{\mathrm{SNP2}}} - {{\mathrm{2}q}}_{{\mathrm{c1}}}{{q}}_{{\mathrm{c2}}}{\mathrm{\kappa }}_{{\mathrm{bc1}}}{\mathrm{\kappa }}_{{\mathrm{bc2}}}{{d}}_{\mathrm{b}}\\ = {\mathrm{M}}_{\mathrm{H}} + {\mathrm{GCA}}_{{\mathrm{CC1}}} + {\mathrm{GCA}}_{{\mathrm{CC2}}} + {\mathrm{SCA}}_{{\mathrm{CC1xCC2}}}\end{array}$$
$$\begin{array}{ccccc}{\mathrm{G}}_{{\mathrm{cc1xcc2}}} = {\mathrm{M}}_{\mathrm{H}} - {{p}}_{{\mathrm{c1}}}{\mathrm{\alpha }}_{{\mathrm{SNP1}}} - {{p}}_{{\mathrm{c2}}}{\mathrm{\alpha }}_{{\mathrm{SNP2}}} - {{\mathrm{2}p}}_{{\mathrm{c1}}}{{p}}_{{\mathrm{c2}}}{\mathrm{\kappa }}_{{\mathrm{bc1}}}{\mathrm{\kappa }}_{{\mathrm{bc2}}}{{d}}_{\mathrm{b}}\\ = {\mathrm{M}}_{\mathrm{H}} + {\mathrm{GCA}}_{{\mathrm{cc1}}} + {\mathrm{GCA}}_{{\mathrm{cc2}}} + {\mathrm{SCA}}_{{\mathrm{cc1xcc2}}}\end{array}$$
$$\begin{array}{ccccc}{\mathrm{G}}_{{\mathrm{CC1xcc2}}} = {\mathrm{M}}_{\mathrm{H}} + {{q}}_{{\mathrm{c1}}}{\mathrm{\alpha }}_{{\mathrm{SNP1}}} - {{p}}_{{\mathrm{c2}}}{\mathrm{\alpha }}_{{\mathrm{SNP2}}} + {{\mathrm{2}q}}_{{\mathrm{c1}}}{{p}}_{{\mathrm{c2}}}{\mathrm{\kappa }}_{{\mathrm{bc1}}}{\mathrm{\kappa }}_{{\mathrm{bc2}}}{{d}}_{\mathrm{b}}\\ = {\mathrm{M}}_{\mathrm{H}} + {\mathrm{GCA}}_{{\mathrm{CC1}}} + {\mathrm{GCA}}_{{\mathrm{cc2}}} + {\mathrm{SCA}}_{{\mathrm{CC1xcc2}}}\end{array}$$
$$\begin{array}{ccccc}{\mathrm{G}}_{{\mathrm{cc1xCC2}}} = {\mathrm{M}}_{\mathrm{H}} - {{p}}_{{\mathrm{c1}}}{\mathrm{\alpha }}_{{\mathrm{SNP1}}} + {{q}}_{{\mathrm{c2}}}{\mathrm{\alpha }}_{{\mathrm{SNP2}}} + {{\mathrm{2}p}}_{{\mathrm{c1}}}{{q}}_{{\mathrm{c2}}}{\mathrm{\kappa }}_{{\mathrm{bc1}}}{\mathrm{\kappa }}_{{\mathrm{bc2}}}{{d}}_{\mathrm{b}}\\ = {\mathrm{M}}_{\mathrm{H}} + {\mathrm{GCA}}_{{\mathrm{cc1}}} + {\mathrm{GCA}}_{{\mathrm{CC2}}} + {\mathrm{SCA}}_{{\mathrm{cc1xCC2}}}\\ \end{array}$$

where \({\mathrm{\kappa }}_{{\mathrm{bc1}}}{\mathrm{\kappa }}_{{\mathrm{bc2}}}{d}_{\mathrm{b}} = {d}_{{\mathrm{SNP}}}\) is the SNP dominance deviation in the hybrid population, and SCA stands for the specific combining ability effect for a SNP locus. Note that \({\mathrm{E}}({\mathrm{SCA}}) =\) \({p}_{{\mathrm{c1}}}{p}_{{\mathrm{c2}}}{\mathrm{SCA}}_{{\mathrm{CC1xCC2}}} + {p}_{{\mathrm{c1}}}{q}_{{\mathrm{c2}}}{\mathrm{SCA}}_{{\mathrm{CC1xcc2}}} + {q}_{{\mathrm{c1}}}{p}_{{\mathrm{c2}}}{\mathrm{SCA}}_{{\mathrm{cc1xCC2}}} + {q}_{{\mathrm{c1}}}{q}_{{\mathrm{c2}}}{\mathrm{SCA}}_{{\mathrm{cc1xcc2}}} = {\mathrm{0}}\) and E(SCA|CC) = E(SCA|cc) = 0 for each group. That is, the expectation of the SNP SCA effects given a SNP genotype for the common DH or inbred line is also zero. Also note that the four genotypic values depend on four unknown parameters (MH, α SNP1, α SNP2, and d SNP).

Assuming two QTLs (alleles B and b and E and e) in LD with the SNP, the SNP dominance deviation is \({d}_{{\mathrm{SNP}}} = {\mathrm{\kappa }}_{{\mathrm{bc1}}}{\mathrm{\kappa }}_{{\mathrm{bc2}}}{d}_{\mathrm{b}} + {\mathrm{\kappa }}_{{\mathrm{ce1}}}{\mathrm{\kappa }}_{{\mathrm{ce2}}}{d}_{\mathrm{e}}\). Thus, the SNP dominance deviation (and the SNP SCA effect) is proportional to the product of the LD values in both groups of DH or inbred lines and to the dominance deviation for each QTL that is in LD with the marker.

The previous model expressed as a function of the SNP GCA and SCA effects was proposed by Massman et al. (2013), but the authors assumed \({\mathrm{GCA}}_{{\mathrm{CC}}} + {\mathrm{GCA}}_{{\mathrm{cc}}}{\mathrm{ = 0}}\) (for each heterotic group and for each SNP) and \({\mathrm{SCA}}_{{\mathrm{CC1xCC2}}} = {\mathrm{SCA}}_{{\mathrm{cc1xcc2}}} = {-}\! {\mathrm{SCA}}_{{\mathrm{CC1xcc2}}} = {-}\! {\mathrm{SCA}}_{{\mathrm{cc1xCC2}}}\). Technow et al. (2012) used a standard extension from QTL to SNP and defined the single cross genotypic value for a SNP as a function of the SNP deviations a and d. That is, \({\mathrm{G}} = {\mathrm{M}}_{\mathrm{H}} + {{u}}_{\mathrm{1}}{{a}} + {{u}}_{\mathrm{2}}{{a}} + {{u}}_{\mathrm{3}}{{d}}\), where u 1 and u 2 are equal 1/2 or −1/2 if the corresponding DH or inbred line is homozygous for distinct SNP alleles (CC or cc), and u 3 equals 0 if the single cross is homozygous or 1 if heterozygous.

SNP genotypic values of single crosses from DH or inbred lines derived from the same population or heterotic group

Well-defined heterotic groups are known for maize but not for special maize such as popcorn and sweet corn, and for other crops such as wheat (Zhao et al. 2013b), rice (Xu et al. 2014), and barley (Philipp et al. 2016). Thus, for many breeders, it is interesting to know about the efficiency of genomic prediction of singles crosses when there are no heterotic groups. Assuming n DH or inbred lines derived from the same population or heterotic group, the average genotypic values for the single crosses concerning the SNP locus are

$${\mathrm{G}}_{{\mathrm{CCxCC}}} = {\mathrm{M}} + {{\mathrm{2}q}}_{\mathrm{c}}{\mathrm{\alpha }}_{{\mathrm{SNP}}} - {{\mathrm{2}q}}_{\mathrm{c}}^{\mathrm{2}}{\mathrm{\kappa }}_{{\mathrm{bc}}}^{\mathrm{2}}{d}_{\mathrm{b}} \\ = {\mathrm{M}} + {\mathrm{2GCA}}_{{\mathrm{CC}}} + {\mathrm{SCA}}_{{\mathrm{CCxCC}}}$$
$${\mathrm{G}}_{{\mathrm{ccxcc}}} = {\mathrm{M}} - {{\mathrm{2}p}}_{\mathrm{c}}{\mathrm{\alpha }}_{{\mathrm{SNP}}} - {{\mathrm{2}p}}_{\mathrm{c}}^{\mathrm{2}}{\mathrm{\kappa }}_{{\mathrm{bc}}}^{\mathrm{2}}{{d}}_{\mathrm{b}} \\ = {\mathrm{M}} + {\mathrm{2GCA}}_{{\mathrm{cc}}} + {\mathrm{SCA}}_{{\mathrm{ccxcc}}}$$
$${\mathrm{G}}_{{\mathrm{CCxcc}}} = {\mathrm{M}} + {\mathrm{2}}\left( {{q}_{\mathrm{c}} - {p}_{\mathrm{c}}} \right)\,{\mathrm{\alpha }}_{{\mathrm{SNP}}} + {{\mathrm{2}p}}_{\mathrm{c}}{q}_{\mathrm{c}}{\mathrm{\kappa }}_{{\mathrm{bc}}}^{\mathrm{2}}{d}_{\mathrm{b}}\\ = {\mathrm{M}} + {\mathrm{GCA}}_{{\mathrm{CC}}} + {\mathrm{GCA}}_{{\mathrm{cc}}} + {\mathrm{SCA}}_{{\mathrm{CCxcc}}},$$

where \({\mathrm{M}} = {{m}}_{\mathrm{b}} + \left( {{{p}}_{\mathrm{c}} - {{q}}_{\mathrm{c}}} \right)\,{{a}}_{\mathrm{b}} + {{\mathrm{2}p}}_{\mathrm{c}}{{q}}_{\mathrm{c}}{{d}}_{\mathrm{b}}\) is the hybrid population mean, \({\mathrm{\alpha }}_{{\mathrm{SNP}}} = {\mathrm{\kappa }}_{{\mathrm{bc}}}\left[ {{{a}}_{\mathrm{b}} + \left( {{{q}}_{\mathrm{b}} - {{p}}_{\mathrm{b}}} \right)\,{{d}}_{\mathrm{b}}} \right] = {\mathrm{\kappa }}_{{\mathrm{bc}}}{\mathrm{\alpha }}_{\mathrm{b}}\) is the average effect of a SNP substitution in the hybrid population, and \({d}_{{\mathrm{SNP}}} = {\mathrm{\kappa }}_{{\mathrm{bc}}}^{\mathrm{2}}{d}_{\mathrm{b}}\) is the SNP dominance deviation. Note that the SNP GCA effects are equal to half the SNP additive value for the single crosses (A), the SNP SCA effects are the SNP dominance deviations for the single crosses (D), and the three genotypic values depend on three unknown parameters (M, α SNP, and d SNP). Also note that E(GCA) = E(A) = E(SCA) = E(SCA|CC) = E(SCA|cc) = E(D) = 0.

Accuracy of single cross genomic prediction

Assuming a QTL and a SNP in LD in the two groups of DH or inbred lines, the predictor of the single cross QTL genotypic value is the single cross SNP genotypic value (because they are proportional). Thus, the covariance between the predictor and the genotypic value is

$$\begin{array}{ccccc}{\mathrm{Cov}}\left( {{\tilde{\mathrm G,G}}} \right) = {\mathrm{f}}_{{\mathrm{22}}}^{\mathrm{1}}{\mathrm{f}}_{{\mathrm{22}}}^{\mathrm{2}}\\ \left[ {{\mathrm{M}}_{\mathrm{H}} + {\mathrm{GCA}}_{{\mathrm{CC1}}} + {\mathrm{GCA}}_{{\mathrm{CC2}}} + {\mathrm{SCA}}_{{\mathrm{CC1xCC2}}}} \right]\\ \left[ {{\mathrm{M}}_{\mathrm{H}} + {\mathrm{GCA}}_{{\mathrm{BB1}}} + {\mathrm{GCA}}_{{\mathrm{BB2}}} + {\mathrm{SCA}}_{{\mathrm{BB1xBB2}}}} \right] + \\ + {\mathrm{f}}_{{\mathrm{22}}}^{\mathrm{1}}{\mathrm{f}}_{{\mathrm{20}}}^{\mathrm{2}}\left[ {{\mathrm{M}}_{\mathrm{H}} + {\mathrm{GCA}}_{{\mathrm{CC1}}} + {\mathrm{GCA}}_{{\mathrm{cc2}}} + {\mathrm{SCA}}_{{\mathrm{CC1xcc2}}}} \right]\\ \left[ {{\mathrm{M}}_{\mathrm{H}} + {\mathrm{GCA}}_{{\mathrm{BB1}}} + {\mathrm{GCA}}_{{\mathrm{BB2}}} + {\mathrm{SCA}}_{{\mathrm{BB1xBB2}}}} \right] + \\ ...\\ + {\mathrm{f}}_{{\mathrm{00}}}^{\mathrm{1}}{\mathrm{f}}_{{\mathrm{00}}}^{\mathrm{2}}\left[ {{\mathrm{M}}_{\mathrm{H}} + {\mathrm{GCA}}_{{\mathrm{cc1}}} + {\mathrm{GCA}}_{{\mathrm{cc2}}} + {\mathrm{SCA}}_{{\mathrm{cc1xcc2}}}} \right]\\ \left[ {{\mathrm{M}}_{\mathrm{H}} + {\mathrm{GCA}}_{{\mathrm{bb1}}} + {\mathrm{GCA}}_{{\mathrm{bb2}}} + {\mathrm{SCA}}_{{\mathrm{bb1xbb2}}}} \right] - \left( {{\mathrm{M}}_{\mathrm{H}}} \right)^{{\kern 1pt} {\mathrm{2}}}\\ = {{p}}_{{\mathrm{c1}}}{{q}}_{{\mathrm{c1}}}\left( {{\mathrm{\kappa }}_{{\mathrm{bc1}}}{\mathrm{\alpha }}_{{\mathrm{b2}}}} \right)^{\mathrm{2}} + {{p}}_{{\mathrm{c2}}}{{q}}_{{\mathrm{c2}}}\left( {{\mathrm{\kappa }}_{{\mathrm{bc2}}}{\mathrm{\alpha }}_{{\mathrm{b1}}}} \right)^{\mathrm{2}} + {{\mathrm{4}p}}_{{\mathrm{c1}}}{{q}}_{{\mathrm{c1}}}{{p}}_{{\mathrm{c2}}}{{q}}_{{\mathrm{c2}}}\left( {{\mathrm{\kappa }}_{{\mathrm{bc1}}}{\mathrm{\kappa }}_{{\mathrm{bc2}}}{{d}}_{\mathrm{b}}} \right)^{\mathrm{2}}\\ = {{p}}_{{\mathrm{c1}}}{{q}}_{{\mathrm{c1}}}\left( {{\mathrm{\alpha }}_{{\mathrm{SNP1}}}} \right)^{\mathrm{2}} + {{p}}_{{\mathrm{c2}}}{{q}}_{{\mathrm{c2}}}\left( {{\mathrm{\alpha }}_{{\mathrm{SNP2}}}} \right)^{\mathrm{2}} + {{\mathrm{4}p}}_{{\mathrm{c1}}}{{q}}_{{\mathrm{c1}}}{{p}}_{{\mathrm{c2}}}{{q}}_{{\mathrm{c2}}}\left( {{{d}}_{{\mathrm{SNP}}}} \right)^{{\kern 1pt} {\mathrm{2}}}\\ = {\mathrm{\sigma }}_{{\mathrm{GCA}}_{{\mathrm{SNP}}}}^{{\mathrm{2(1)}}}{\mathrm{ + \sigma }}_{{\mathrm{GCA}}_{{\mathrm{SNP}}}}^{{\mathrm{2(2)}}}{\mathrm{ + \sigma }}_{{\mathrm{SCA}}_{{\mathrm{SNP}}}}^{\mathrm{2}}{\mathrm{ = \sigma }}_{{\mathrm{G(SNP)}}}^{\mathrm{2}}\\ \end{array}$$

where the GCA and SCA effects for the QTL are \({\mathrm{GCA}}_{{\mathrm{BB1}}} = {{q}}_{{\mathrm{b1}}}{\mathrm{\alpha }}_{{\mathrm{b2}}}\), \({\mathrm{GCA}}_{{\mathrm{bb1}}} = - {{p}}_{{\mathrm{b1}}}{\mathrm{\alpha }}_{{\mathrm{b2}}}\), \({\mathrm{GCA}}_{{\mathrm{BB2}}} = {{q}}_{{\mathrm{b2}}}{\mathrm{\alpha }}_{{\mathrm{b1}}}\), \({\mathrm{GCA}}_{{\mathrm{bb2}}} = - {{p}}_{{\mathrm{b2}}}{\mathrm{\alpha }}_{{\mathrm{b1}}}\), \({\mathrm{SCA}}_{{\mathrm{BB1xBB2}}} = - {{\mathrm{2}q}}_{{\mathrm{b1}}}{{q}}_{{\mathrm{b2}}}{{d}}_{\mathrm{b}}\), \({\mathrm{SCA}}_{{\mathrm{BB1xbb2}}} = {{\mathrm{2}q}}_{{\mathrm{b1}}}{{p}}_{{\mathrm{b2}}}{{d}}_{\mathrm{b}}\), \({\mathrm{SCA}}_{{\mathrm{bb1xBB2}}} = {{\mathrm{2}p}}_{{\mathrm{b1}}}{q}_{{\mathrm{b2}}}{d}_{\mathrm{b}}\), and \({\mathrm{SCA}}_{{\mathrm{bb1xbb2}}} = - {{\mathrm{2}p}}_{{\mathrm{b1}}}{p}_{{\mathrm{b2}}}{d}_{\mathrm{b}}\), \(\sigma _{{\mathrm{GCA}}}^2\) and \(\sigma _{{\mathrm{SCA}}}^2\) are the GCA and SCA variances for the SNP locus, and \(\sigma _{\mathrm{G}}^2\) is the SNP genotypic variance. The GCA and SCA variances for the QTL are \(\sigma _{{\mathrm{GCA}}}^{2(1)} = {{p}}_{{\mathrm{b1}}}{{q}}_{{\mathrm{b1}}}\left( {{\mathrm{\alpha }}_{{\mathrm{b2}}}} \right)^2\), \(\sigma _{{\mathrm{GCA}}}^{2(2)} = {{p}}_{{\mathrm{b2}}}{{q}}_{{\mathrm{b2}}}\left( {{\mathrm{\alpha }}_{{\mathrm{b1}}}} \right)^2\), and \(\sigma _{{\mathrm{SCA}}}^2 = 4{{p}}_{{\mathrm{b1}}}{{q}}_{{\mathrm{b1}}}{{p}}_{{\mathrm{b2}}}{{q}}_{{\mathrm{b2}}}\left( {{{d}}_{\mathrm{b}}} \right)^{\,2}\). The QTL genotypic variance is \(\sigma _{\mathrm{G}}^2 = \sigma _{{\mathrm{GCA}}}^{2(1)} + \sigma _{{\mathrm{GCA}}}^{2(2)} + \sigma _{{\mathrm{SCA}}}^2\). Thus, the single cross prediction accuracy is

$$\rho _{{\tilde{\mathrm G,G}}} = \sqrt {\frac{{\sigma _{{\mathrm{G(SNP)}}}^2}}{{\sigma _{\mathrm{G}}^2}}}$$

Assuming s SNPs,

$$\rho _{{\tilde{\mathrm G,G}}} = \mathop {\sum}\limits_{{{r}} = 1}^s {\sigma _{{{\rm G}({\rm SNP}({\rm r}))}}^2} /\sqrt {{\mathrm{\sigma }}_{{\tilde{\mathrm G}}}^2\sigma _{\mathrm{G}}^2}$$

where \(\sigma _{{\tilde{\mathrm G}}}^2\) is the variance of the predicted single cross genotypic values, and \(\sigma _{\mathrm{G}}^2\) is the single cross genotypic variance. Furthermore,

$${\mathrm{\alpha }}_{{\mathrm{SNP}(r)1}} = \mathop {\sum}\limits_{i = 1}^{{{k{\prime}}}} {\left[ {\frac{{\Delta _{r{\kern 1pt} i1}}}{{{\mathrm{p}}_{{\mathrm{r1}}}{\mathrm{q}}_{{\mathrm{r1}}}}}} \right]{\mathrm{\alpha }}_{{\mathrm{i2}}}} = \mathop {\sum}\limits_{{{i}} = 1}^{{{k{\prime}}}} {{\mathrm{\kappa }}_{{\mathrm{r}}{\kern 1pt} {\mathrm{i1}}}{\mathrm{\alpha }}_{{\mathrm{i2}}}} ,$$

where k′ is the number of QTLs in LD with the SNP r in group 1, and

$${{d}}_{{{\mathrm{SNP}(r)}}} = \mathop {\sum}\limits_{{i} = {\mathrm{1}}}^{{{k{\prime\prime}}}} {\left[ {\frac{{\Delta _{{ri}{\kern 1pt} {\mathrm{1}}}}}{{{{p}}_{{r\mathrm{1}}}{{q}}_{{r\mathrm{1}}}}}} \right]\left[ {\frac{{\Delta _{{ri}{\kern 1pt} {\mathrm{2}}}}}{{{{p}}_{{r\mathrm{2}}}{{q}}_{{r\mathrm{2}}}}}} \right]{{d}}_{{i}}} = \mathop {\sum}\limits_{{i} = {\mathrm{1}}}^{{{k{\prime\prime}}}} {{\mathrm{\kappa }}_{{ri\mathrm{1}}}{\mathrm{\kappa }}_{{ri\mathrm{2}}}{{d}}_{i}}$$

where k″ is the number of QTLs in LD with the SNP r in both groups.

Because the accuracy of genomic prediction of single crosses depends on the squares of the average effects of SNP substitution and the SNP dominance deviations, it is not affected by the linkage phase (coupling or repulsion), as it does not depend on linkage. But it does depend on the magnitude of the LD in each group of DH or inbred lines.

Assuming single crosses derived from DH or inbred lines of a single population or heterotic group we have \({\mathrm{\sigma }}_{{\mathrm{G(SNP)}}}^{\mathrm{2}} = {{\mathrm{2}p}}_{\mathrm{c}}{{q}}_{\mathrm{c}}\left( {{\mathrm{\alpha }}_{{\mathrm{SNP}}}} \right)^{\,{\mathrm{2}}} + \left( {{{\mathrm{2}p}}_{\mathrm{c}}{q}_{\mathrm{c}}{d}_{{\mathrm{SNP}}}} \right)^{\,{\mathrm{2}}}\) and \({\mathrm{\sigma }}_{\mathrm{G}}^{\mathrm{2}} = {{\mathrm{2}p}}_{\mathrm{b}}{q}_{\mathrm{b}}\left( {{\mathrm{\alpha }}_{\mathrm{b}}} \right)^{\,{\mathrm{2}}} + \left( {{{\mathrm{2}p}}_{\mathrm{b}}{q}_{\mathrm{b}}{d}_{\mathrm{b}}} \right)^{\,{\mathrm{2}}}\).

The statistical model for single cross genomic prediction

Consider n 1 and n 2 (several tens) DH or inbred lines from two populations or heterotic groups genotyped for s (thousands) SNPs and the experimental assessment of h (a few hundred) single-crosses (h much lower than n 1.n 2) in e (several) environments (a combination of growing seasons, years, and locations). Defining y as the adjusted single cross phenotypic mean, the statistical model for prediction of the average effects of SNP substitution and the SNP dominance deviations is

$${{y}} = {\mathrm{M}}_{\mathrm{H}} + \mathop {\sum}\limits_{{{r}} = {\mathrm{1}}}^{{s}} {\left( {{\mathrm{z}}_{{\mathrm{1}}_{{r}}}{\mathrm{\alpha }}_{{\mathrm{SNP1}}_{{r}}} + {\mathrm{z}}_{{\mathrm{2}}_{{r}}}{\mathrm{\alpha }}_{{\mathrm{SNP2}}_{{r}}} + {\mathrm{z}}_{{\mathrm{3}}_{{r}}}{\mathrm{d}}_{{\mathrm{SNP}}_{{r}}}} \right)} + {\mathrm{error}}$$

where \({\mathrm{z}}_{{\mathrm{1}}_{{r}}} = {{q}}_{{r\mathrm{1}}}\), \({\mathrm{z}}_{{\mathrm{2}}_{{r}}} = {{q}}_{{r\mathrm{2}}}\), and \({\mathrm{z}}_{{\mathrm{3}}_{{r}}} = - {\mathrm{2}q}_{{{r1}}}{{q}}_{{r\mathrm{2}}}\) if the SNP genotypes for the DH or inbred lines are CC (group 1) and CC (group 2), \({\mathrm{z}}_{{\mathrm{1}}_{{r}}} = - {{p}}_{{r\mathrm{1}}}\), \({\mathrm{z}}_{{\mathrm{2}}_{{r}}} = - {{p}}_{{r\mathrm{2}}}\), and \({\mathrm{z}}_{{\mathrm{3}}_{{r}}} = - {\mathrm{2}p}_{{r\mathrm{1}}}{{p}}_{{r\mathrm{2}}}\) if the SNP genotypes are cc (group 1) and cc (group 2), \({\mathrm{z}}_{{\mathrm{1}}_{{r}}} = {{q}}_{{r\mathrm{1}}}\), \({\mathrm{z}}_{{\mathrm{2}}_{{r}}} = - {{p}}_{{r\mathrm{2}}}\), and \({\mathrm{z}}_{{\mathrm{3}}_{{r}}} = {{\mathrm{2}q}}_{{r\mathrm{1}}}{{p}}_{{r\mathrm{2}}}\) if the SNP genotypes are CC (group 1) and cc (group 2), and \({\mathrm{z}}_{{\mathrm{1}}_{{r}}} = - {{p}}_{{r\mathrm{1}}}\), \({\mathrm{z}}_{{\mathrm{2}}_{{r}}} = {{q}}_{{r\mathrm{2}}}\), and \({\mathrm{z}}_{{\mathrm{3}}_{{r}}} = {{p}}_{{r\mathrm{1}}}{{q}}_{{r\mathrm{2}}}\) if the SNP genotypes are cc (group 1) and CC (group 2).

Regarding the single crosses obtained from DH or inbred lines of the same population or heterotic group, we have

$${{y}} = {\mathrm{M}} + \mathop {\sum}\limits_{{{r}} = {\mathrm{1}}}^{{s}} {\left( {{\mathrm{z}}_{{\mathrm{1}}_{{r}}}{\mathrm{\alpha }}_{{\mathrm{SNP}}_{{r}}} + {\mathrm{z}}_{{\mathrm{2}}_{{r}}}{{d}}_{{\mathrm{SNP}}_{{r}}}} \right)} + {\mathrm{error}}$$

where \({\mathrm{z}}_{{\mathrm{1}}_{{r}}} = {{\mathrm{2}q}}_{{r}}\) and \({\mathrm{z}}_{{\mathrm{2}}_{{r}}} = - {{\mathrm{2}q}}_{{r}}^{\mathrm{2}}\) if the SNP genotypes for the two crossed DH or inbred lines are CC and CC, \({\mathrm{z}}_{{\mathrm{1}}_{{r}}} = - {{\mathrm{2}p}}_{{r}}\) and \({\mathrm{z}}_{{\mathrm{2}}_{{r}}} = - {{\mathrm{2}p}}_{{r}}^{\mathrm{2}}\) if the SNP genotypes are cc and cc, and \({\mathrm{z}}_{{\mathrm{1}}_{{r}}} = {\mathrm{2}}\left( {{{q}}_{{r}} - {{p}}_{{r}}} \right)\) and \({\mathrm{z}}_{{\mathrm{2}}_{{r}}} = {{\mathrm{2}p}}_{{r}}{{q}}_{{r}}\) if the SNP genotypes are CC and cc.

The statistical problem of genomic prediction when there is a very large number of molecular markers and relatively few observations has been addressed through several regularized whole-genome regression and prediction methods (Daetwyler et al. 2013; de Los Campos et al. 2013). Based on one of these approaches, the SNP average effects of substitution and SNP dominance deviations are predicted and used to provide genomic prediction of non-assessed single crosses. The predicted genotypic value for a non-assessed single cross of DH or inbred lines from two groups is

$${\tilde{\mathrm G}} = {\hat{\mathrm M}}_{\mathrm{H}} + \mathop {\sum}\limits_{{{r}} = {\mathrm{1}}}^{{s}} {\left( {{\mathrm{z}}_{{\mathrm{1}}_{{r}}}{\tilde{\mathrm \alpha }}_{{\mathrm{SNP1}}_{{r}}} + {\mathrm{z}}_{{\mathrm{2}}_{{r}}}{\tilde{\mathrm \alpha }}_{{\mathrm{SNP2}}_{{r}}} + {\mathrm{z}}_{{\mathrm{3}}_{{r}}}{\tilde{ d}}_{{\mathrm{SNP}}_{{r}}}} \right)}$$

For a non-assessed single cross of DH or inbred lines from the same group, the predicted genotypic value is

$${\tilde{\mathrm G}} = {\hat{\mathrm M}} + \mathop {\sum}\limits_{{{r}} = {\mathrm{1}}}^{{s}} {\left( {{\mathrm{z}}_{{\mathrm{1}}_{{r}}}{\tilde{\mathrm \alpha }}_{{\mathrm{SNP}}_{{r}}} + {\mathrm{z}}_{{\mathrm{2}}_{{r}}}{\tilde{d}}_{{\mathrm{SNP}}_{{r}}}} \right)}$$

Simulation

The SNP and QTL genotypes for DH lines, the QTL genotypes for single crosses, and the phenotypes for DH lines and single crosses were simulated using the software REALbreeding (available by request). The software does not assume a distribution for the LD values and gene effects, but computes the true LD values and gene effects based on quantitative genetics theory (Viana 2004). SNP and QTL allele frequencies follow a beta distribution. The parameters m, a, and d for each QTL are computed from the maximum and minimum genotypic values for homozygotes informed by the user. Based on our input, the software distributed 10,000 SNPs and 400 QTLs on ten chromosomes (1000 SNPs and 40 QTLs by chromosome). The average SNP density was 0.1 cM. The QTLs were distributed in the regions covered by the SNPs (~100 cM/chromosome).

The genotypic values of the DH lines and single crosses were generated assuming a single set of 400 QTLs and two degrees of dominance. To simulate grain yield and expansion volume (a measure of popcorn quality), we defined positive dominance (0 < degree of dominance ≤ 1.2) and bidirectional dominance (−1.2 ≤ degree of dominance ≤ 1.2), respectively. For grain yield and expansion volume, the maximum and minimum genotypic values for homozygotes were 140 and 30 g/plant and 55 and 15 mL/g, respectively. The phenotypic values were obtained from the sum of the population mean, genotypic value, and experimental error. The error variance was computed from the broad sense heritability.

Initially, the software simulated 350 S0 plants of the first heterotic group. The population was a second generation composite. In a composite, there is LD only for linked SNPs and QTLs (Viana et al. 2016). Then the software sampled one (scenario 1) or one to five (scenario 2) gametes from the S0 plants, generating 350 DH lines. To generate 350 DH lines from S3 plants, the software selfed S0 plants for three generations using the single seed descent process. The number of DH lines per S3 plant ranged from one to five (scenario 3). For each DH line sampling process, the software selected 70 DH lines, assuming a trait heritability of 30%. The same computational procedures provided the three groups of 70 selected DH lines from the second heterotic group (a second composite). For each DH line sampling process, the software crossed 70 × 70 DH lines to generate 4900 single crosses.

To investigate the efficiency of genomic prediction of untested single crosses when there is no heterotic group (relevant for rice, wheat, and barley breeders), the software also crossed 70 selected DH lines from the same heterotic group for generating 2415 single crosses (scenario 4). To highlight that the efficiency of genomic prediction of untested single crosses does not depend on the LD in the reference population, but on the LD in the groups of selected DH lines, the same computational procedures were used to derive 70 selected DH lines from the first and second heterotic groups after 10 generations of random crosses (to decrease the LD) (scenario 5).

Data files

The data for processing was obtained from 50 random samplings of 1470 (30%) and 490 (10%) of the single crosses to be assessed, assuming a trait heritability of 30, 60, and 100%. Thus, the genotypic value prediction accuracies of the assessed single crosses were 0.55, 0.77, and 1.00, respectively. With no exception, all DH lines from both heterotic groups were represented in the tested single crosses. Additionally, to assess the relevance of the number of DH lines sampled, we fixed the number of DH lines in each heterotic group to achieve approximately the same number of assessed single crosses using a diallel. That is, we sampled 38 and 22 DH lines in each heterotic group 50 times for a diallel (scenario 6), generating 1444 (30%) and 484 (10%) single crosses for assessment, respectively. In this case, only 54 and 31% of the DH lines are represented in the tested single crosses. We denoted these processes as sampling of single crosses and sampling of DH lines.

Assuming no heterotic groups, we proceeded to 50 random samplings of 724 (30%) and 241 (10%) of the single crosses from the same heterotic group for testing, also assuming a trait heritability of 30, 60, and 100%. With few exceptions when sampling 10% of the single crosses for testing, all DH lines from the heterotic group were represented in the assessed single crosses. The last scenario was genomic prediction of untested single crosses under an average density of one SNP for each cM. This lower density was obtained by random sampling of 100 SNPs per chromosome using a REALbreeding tool (sampler).

Statistical analysis

The methods used for prediction of the non-assessed single crosses (70 and 90% of the single crosses) were ridge regression BLUP (RR-BLUP), GBLUP, and pedigree-based BLUP. We used the rrBLUP package (Endelman 2011) for the analyses. To investigate the single cross prediction efficiency based on our model and on the models proposed by Massman et al. (2013) and Technow et al. (2012), we used another REALbreeding tool (Incidence matrix) to generate the incidence matrices for the three models and for the two DH line sampling processes. We also fitted the additive model (including only the GCA effects) to assess the relevance of the SCA effects on genomic prediction of single cross performance. The accuracies of single cross genotypic value prediction were obtained from the correlation between the true genotypic values of the non-assessed single crosses computed by REALbreeding and the values predicted by RR-BLUP, GBLUP, and BLUP. We also computed the efficiency of identification of the 300 non-assessed single crosses of higher genotypic value (coincidence index). The coincidence index was computed as the number of the best 300 predicted untested single crosses among the 300 untested single crosses of greater true genotypic value divided by 300. For each DH lines derivation process and heritability, the parametric average coincidence index was computed from the average phenotypic values of the 4,900 single crosses as the number of the 300 single crosses of greater average phenotypic value among the 300 single crosses of greater true genotypic value divided by 300. Regarding grain yield, for heritability of 30% the coincidence index was 0.2533, 0.2833, and 0.2433 assuming one DH line per S0 plant, one to five DH lines per S0 plant, and one to five DH lines per S3 plant, respectively. The corresponding values for heritability of 60% were, respectively, 0.4800, 0.4900, and 0.4567. Concerning expansion volume, the corresponding values for heritabilities of 30 and 60% were, respectively, 0.2600, 0.2833, and 0.2700, and 0.4733, 0.5100, and 0.4533. The assumed average parametric coincidence index was 0.26 and 0.48 for heritabilities of 30 and 60%, respectively, for both traits.

Results

Using our model, average SNP density of 0.1 cM, training set size of 30%, positive dominance (grain yield), additive-dominance model, and sampling of single crosses, the prediction accuracies of the non-assessed single crosses were greater than the accuracies of the assessed single crosses for low (up to 46% higher) and intermediate (up to 16% higher) heritabilities (Table 1; Fig. 1a). As the prediction accuracy of assessed single crosses approaches 1.0, the accuracy of the non-assessed single crosses approaches ~0.9 (up to 11% lower). Sampling one to five DH lines per S3 plant was only slightly superior to the other DH lines derivation processes, regardless of the prediction accuracy of the assessed single crosses (up to 5% higher). Fitting the additive model provided essentially the same prediction accuracies since the maximum decrease was ~1%. No significant differences between the prediction accuracies of non-assessed single crosses were observed when assuming bidirectional dominance (expansion volume). The differences compared to positive dominance ranged from approximately −5 to 2%. However, a striking difference was observed between the sampling processes of single crosses for testing. Random sampling of single crosses provided higher prediction accuracies of non-assessed single crosses compared to sampling DH lines for a diallel. The increases in the accuracies by sampling single crosses ranged from ~38 to 77%, which was proportional to the heritability. Decreasing the average SNP density to 1 cM led to a slight decrease in the prediction accuracy of non-assessed single crosses of approximately −4%. Decreasing the training set size to 10% decreased the prediction accuracy of non-assessed single crosses in approximately −5 to −15%, inversely proportional to the heritability. To establish that the prediction accuracy of non-assessed single crosses depends on the level of (overall) LD in the groups of selected DH or inbred lines, we derived DH lines from the same base populations after 10 generations of random crosses (to decrease the LD). The accuracies were also high, ranging from 0.83 to 0.95, proportional to the heritability. The prediction accuracies of non-assessed single crosses from DH lines of the same population were equivalent to the accuracies for single crosses derived from DH lines belonging to distinct heterotic groups, ranging from 0.83 to 0.91, also proportional to the heritability. When comparing our statistical model with those proposed by Massman et al. (2013) and Technow et al. (2012), we observed no differences in the prediction accuracies of non-assessed single crosses (maximum difference of 1%). Interestingly, the Massman et al. (2013) and Technow et al. (2012) models provide identical accuracies. Finally, no significant differences between the prediction accuracies for RR-BLUP, GBLUP, and BLUP occurred (maximum of 2%), except for one to five DH lines per S3 plant, where BLUP was 9 to 10% lower, regardless of the heritability.

Table 1 Average prediction accuracies of non-assessed single crosses and its standard deviation, assuming single crosses from selected DH lines, 30 and 10% of assessed single crosses, two traits, two sampling processes of single crosses, four statistical models, three DH line sampling processes, two genetic models, and three accuracies of assessed single crosses
Fig. 1
figure 1

Predicted accuracies (a) and coincidence indexes (b) for untested single crosses (square: 1 DH line/S0; triangle: 1–5 DH lines/S0; circle: 1–5 DH lines/S3), and parametric accuracies (a) and coincidence indexes (b) for tested single crosses (continuous line), assuming our model, average SNP density of 0.1 cM, training set size of 30%, positive dominance (grain yield), additive-dominance model, and sampling of single crosses

Concerning the coincidence index, in general the inferences are the same as those established from the prediction accuracy analysis (Table 2; Fig. 1b). There were no differences between the coincidence indexes regarding our model and the models proposed by Massman et al. (2013) and Technow et al. (2012) (maximum difference of 3%) and between the RR-BLUP, GBLUP, and BLUP approaches, except for one to five DH lines per S3 plant, where the coincidence for BLUP was −19 to −27% lower, proportional to the heritability. The coincidence indexes were also high for single crosses derived from selected DH lines obtained from the base populations with lower LD (ranging from 0.55 to 0.76, proportional to the heritability) and from selected DH lines of the same population (ranging from 0.61 to 0.76, also proportional to the heritability). Sampling single crosses for assessment also provided a higher coincidence index compared to sampling DH lines for a diallel (39 to 98% higher, proportional to the heritability). Decreasing the SNP density and the training set size decreased the coincidence index from 5 to 10% (proportional to the heritability) and from 17 to 26% (inversely proportional to the heritability), respectively. The maximum difference in the coincidence index by fitting the additive-dominance and the additive models was −3%. Only for one DH line per S0 plant and assuming bidirectional dominance, the coincidence indexes were slightly greater than the values obtained assuming positive dominance (9–14% greater). This sampling process of DH lines provided the higher values of the coincidence index compared to the other sampling processes (7–26% higher, inversely proportional to the heritability). Finally, the coincidence index value of the non-assessed single crosses were greater than the parametric values for all assessed single crosses when assuming low (up to 117% higher) and intermediate (up to 39% higher) heritabilities (Table 1). However, as the parametric coincidence of assessed single crosses approached 1.0, the coincidence values of the non-assessed single crosses approached 0.60–0.74 (up to 26–40% lower), depending on the DH line sampling process.

Table 2 Average coincidence of the best 300 predicted single crosses and its standard deviation, assuming single crosses from selected DH lines, 30 and 10% of assessed single crosses, two traits, two sampling processes of single crosses, four statistical models, three DH line sampling processes, two genetic models, and three parametric coincidence of assessed single crosses

Discussion

Bernardo (1994) first suggested using BLUP for predicting untested maize single cross performance. Based on the prediction accuracies obtained by Bernardo (1994, 1995, 1996a, 1996b, 1996c) for grain yield and other traits (distinct genetic controls), a breeder should realize that the performance of untested single crosses can be effectively predicted using relationship information from molecular or pedigree data, unbalanced and large data set, and diverse heterotic patterns. The significance of genomic prediction has been confirmed with maize (Zhao et al. 2015) and other important crops, such as rice (Xu et al. 2014), wheat (Zhao et al. 2013b), and barley (Philipp et al. 2016). However, there has been no published evidence that the prediction of untested single crosses is of general use by breeders of worldwide seed companies. Additional proof may be needed to make the prediction of untested single crosses as successful as Jenkins’ (1934) method for predicting double-cross performance. This paper offers a significant contribution in this direction.

Our assessment on efficiency of prediction of untested single cross performance maintains some similarities with a few earlier studies, but there are sharp differences compared to most investigations. This study is based on a simulated data set, an approach also used by Technow et al. (2012), assuming 400 QTLs distributed along ten chromosomes. Thus, the prediction accuracies and coincidence indexes (a measure of untested single crosses selection efficiency) are available for non-assessed single crosses since the values were computed based on the true genotypic values of the non-assessed single crosses and not on a cross-validation procedure involving assessed single crosses. This does not mean that we consider simulated data to be better than field data or have any criticism of the cross-validation procedure. Because of the assumptions, we know that simulated data cannot integrally describe the complexity of populations and genetic determination of traits (Daetwyler et al. 2013). To highlight the relevance of (overall) LD, our study is based on conditions that are not favorable to the prediction of untested single cross performance: a very low level of relatedness between the DH lines, low and intermediate heritabilities for the assessed single crosses, and not a higher heterotic pattern. In studies by Massman et al. (2013) and Bernardo (1994, 1995, 1996a), the coancestry coefficient between inbreds from the same heterotic group ranged from 0.11 to 0.58. Riedelsheimer et al. (2012) observed high relatedness only between the non-Stiff Stalk inbreds. Technow et al. (2012) assumed non-related inbreds. For most of the investigations on prediction of untested single crosses and testcrosses, the grain yield heritability ranged from 0.72 to 0.88. The common heterotic patterns in these studies are Stiff Stalk and non-Stiff Stalk and Dent and Flint. The minor allele frequency in the groups of Dent and Flint inbreds were ~0.10 and 0.20, respectively, and ~20% of the SNPs showed a difference of allelic frequency of at least 0.60.

Concerning the prediction accuracy and the efficiency of identification of the best 300 non-assessed single crosses, our results prove that the prediction of untested single crosses is a very efficient procedure (note that we are not saying genomic prediction), especially for low and intermediate heritabilities of the assessed single crosses. The prediction accuracies of the non-assessed single crosses under low (0.55–0.71) and intermediate (0.74–0.87) accuracies of assessed single crosses achieved 0.85 and 0.89, respectively. It is important to highlight that these are not relative accuracies. Most importantly, the coincidence of the non-assessed single crosses under low (0.26–0.39) and intermediate (0.44–0.66) parametric coincidences of assessed single crosses achieved 0.59 and 0.64, respectively. For high heritability (80–95%; accuracies from 0.89 to 0.97), as observed in most studies on prediction of untested single cross performance, we can state (based on values predicted by fitting a quadratic regression model) that the prediction accuracy of non-assessed single crosses is up to only 10% lower (0.87–0.92). Most impressively, the coincidence index can range from 0.61 to 0.71 (parametric coincidences between 0.72–0.93). Under maximum accuracy of assessed single crosses (1.00), the prediction accuracy and coincidence of non-assessed single crosses achieved 0.93 and 0.76. Thus, assuming high heritability, high SNP density, and a training set size of 30%, the accuracy can achieve 0.92 and the efficiency of identification of the best 9% of the non-assessed single crosses can achieve 0.71. It is important to highlight that this efficacy can be increased by using more related DH or inbred lines, under high LD. Thus, we strongly recommend that maize breeders, as well as rice, wheat, and barley breeders, make widespread use of prediction of non-assessed single crosses, at least for preliminary screening or prior to field testing.

To take advantage of genomic prediction, Kadam et al. (2016) recommend redesigning hybrid breeding programs. However, because breeders are unlikely to rely solely on genomic predictions when selecting superior untested hybrids, Technow et al. (2014) believe that genomic prediction will be combined with field testing of the most promising experimental hybrids. For grain yield, the prediction accuracies observed by Bernardo (1994, 1995, 1996a) ranged from 0.14 to 0.80, proportional to the heritability (in the range 35–74%) and training set size. The non-relative accuracies (relative accuracy × root square of heritability) observed in the studies of Kadam et al. (2016), Technow et al. (2014), Massman et al. (2013), Technow et al. (2012), and Riedelsheimer et al. (2012) ranged between 0.20 and 0.86, also proportional to the heritability (in the range 53–98%) and training set size.

We hope that readers have realized the importance of (overall) LD for effective prediction of non-assessed single crosses, as well as genetic variability. Breeders have no control over LD and relatedness between the DH or inbred lines. However, selection should always provide a high level of overall LD in the groups of selected DH or inbred lines. Comparison of our LD assessment with the LD analyses from other studies is inadequate because our distances are in cM and not in base-pairs. But in general, the level of LD was high (r 2 of ~0.3) for only SNPs separated by up to 0.5 Mb (Massman et al. 2013; Riedelsheimer et al. 2012; Technow et al. 2012, 2014). To maximize the prediction accuracy and the efficiency of identification of the best non-assessed single crosses it is necessary to adopt random sampling of single crosses for testing instead of the random sampling of DH or inbred lines for a diallel. This is because sampling 30 or even 10% of the single crosses leads to single crosses for testing derived from all DH or inbred lines from each group. In our case, in every resampling assuming training set size of 30 and 10% we always get groups of assessed single crosses (1470 and 490 single crosses, respectively) derived from the 70 DH lines of each group. However, sampling DH lines for a diallel provided 1440 and 484 single crosses for testing derived from 38 and 22 DH lines, respectively. Thus, the sampling of single crosses provides the best prediction of the SNP average effects of substitution and dominance deviations. Riedelsheimer et al. (2012) emphasized the need for large genetic variability to obtain high prediction accuracies. Furthermore, their results indicated that pairs of closely related lines and population structuring only weakly contributed to the high prediction accuracies. Because dominance can be a relevant genetic effect, breeders should always fit the additive-dominance model to maximize the prediction accuracy and the efficiency of identification of the best non-assessed single crosses. Interestingly, in most of the studies on prediction of non-assessed single crosses the prediction accuracy did not increase significantly when modeling SCA in addition to GCA effects (Zhao et al. 2015).

Concerning SNP density and training set size, factors related to the costs of genotyping and phenotyping, breeders should find a balance between efficiency and expenses, since maximizing SNP density and training set size maximizes the efficiency of untested single cross prediction. Based on our results, because the decreases in the prediction accuracy (~4%) and coincidence index (5–10%) by decreasing the average SNP density from 0.1 to 1 cM are of reduced magnitude, we consider sufficient to employ custom genotyping to provide an average SNP density of 1 cM. Decreasing the training set size from 30 to 10% of the single crosses does not significantly affect the prediction accuracy under intermediate to high heritability (decrease of up to 9%), but the coincidence index can be reduced by up to 21%. However, considering that the coincidence index will be kept in the range 0.48–0.61, proportional to the heritability, and that the maximum values are in the range 0.48 to 0.61, we also consider sufficient to assess at least 10% of the possible single crosses. As highlighted by Zhao et al. (2015), marker density only marginally affects the prediction accuracy of untested single crosses and for biparental populations a plateau for the accuracy is reached with a few hundred markers. Technow et al. (2014) did not find an improvement in prediction accuracies when using higher SNP density. Additionally, increasing the training set size led to a relatively small increase in the prediction accuracy. However, the prediction accuracies obtained by Riedelsheimer et al. (2012) under high density (38,019 SNPs) were substantially higher than those reached with a low-density marker panel (1152 SNPs). In the study of Technow et al. (2012), the prediction accuracies increased with SNP density and number of parents tested in hybrid combination.

The DH line sampling process, heterotic pattern, and statistical approach should not be worries for breeders. However, under high heritability, sampling more than one DH line per S0 or S3 plant provided higher coincidence values and high prediction accuracy in our study. For rice, wheat, and barley breeders, our message is that high prediction accuracy and high efficiency of identification of the best non-assessed single crosses does not depend on heterotic groups but on the (overall) LD in the group or in each group of DH or inbred lines. In other words, the efficiency of prediction of non-assessed single crosses derived from DH or inbred lines from the same population can be as high as the prediction efficiency of untested single crosses derived from DH or inbred lines from distinct heterotic groups. This was not confirmed comparing the relative prediction accuracies for the grain yield of maize untested single crosses (from ~0.50 to 0.95, for most studies) with those obtained with rice, wheat, and barley untested hybrids (0.50–0.60, approximately) (Philipp et al. 2016; Xu et al. 2014; Zhao et al. 2013b). However, the lower relative prediction accuracies for untested rice, wheat, and barley hybrids should be due to prediction of two-way and three-way crosses. Regarding the statistical approach, our model did not provide an increase in the efficiency of non-assessed single cross prediction compared to the models proposed by Massman et al. (2013) and Technow et al. (2012). Importantly, our results showed that these two models are really identical (data not shown). Thus, because of the simplified definition of the incidence matrices for these two previous models, it is quite safe to use either of them. Finally, the choice between the statistical approaches RR-BLUP (based on prediction of SNP average effects of substitution and dominance deviations), GBLUP (based on additive and dominance genomic matrices), and pedigree-based BLUP (prediction of genotypic values of non-assessed single crosses based on additive and dominance matrices from pedigree records) should not be a serious worry for breeders as well. Our evidence is that there is no significant difference between RR-BLUP and GBLUP regarding the prediction accuracy and efficiency of identification of the best untested single crosses. Furthermore, even when the level of relatedness between the DH or inbred lines in each group is low, pedigree-based BLUP is generally as efficient as genomic prediction, except when the DH lines are derived from an inbred population. Thus, DNA polymorphism is not essential for efficient prediction of non-assessed single cross performance. In a review on genomic selection in hybrid breeding, Zhao et al. (2015) state that the choice of the biometrical model has no substantial impact on the prediction accuracy of untested single crosses. Technow et al. (2014) observed that the GBLUP and BayesB prediction methods resulted in very similar prediction accuracies. According to Massman et al. (2013), the pedigree-based BLUP and RR-BLUP models did not lead to significantly different prediction accuracies. Technow et al. (2012) concluded that BayesB produced significantly higher accuracies for the additive-dominance model than GBLUP.

Our main contributions to the assessment of prediction efficiency of untested single cross performance are the following: (1) the prediction accuracy of untested single crosses ranged from ~0.80 to 0.90 as the heritability of tested single crosses ranged from low (30%) to high (100%); however, the efficacy of identification of the best 9% of the untested single crosses ranged from ~0.50 to 0.70, depending on the DH line sampling process; (2) the prediction accuracy for crops showing no defined heterotic pattern can be as efficient as with maize, for which there are well-defined heterotic groups; this is because the most important factor affecting the prediction efficiency is the overall LD; (3) to maximize prediction accuracy and coincidence the choice of single crosses for testing should be based on a random process; this procedure maximizes the number of DH lines in hybrid combinations and provides better predictions of the SNP average effects of substitution and dominance deviations compared to sampling DH lines for a diallel; (4) because of the non-significant decreases in the prediction accuracy and coincidence, the prediction of untested single crosses can be efficient when assuming a reduced training set size (10%) and SNP density of 1 cM; (5) RR-BLUP and GBLUP provide equivalent prediction efficiencies of untested single crosses; (6) except for DH lines derived from inbred populations, pedigree-based BLUP is as efficient as genomic prediction of untested single crosses; and (7) the theoretical accuracy shows that the prediction accuracy is not affected by the linkage phase.

Data archiving

The data set is available at https://doi.org/10.6084/m9.figshare.5035130.v3.