# Joint Analysis of Strain and Parent-of-Origin Effects for Recombinant Inbred Intercrosses Generated from Multiparent Populations with the Collaborative Cross as an Example

^{*}School of Mathematics and Statistics, Wuhan University, Hubei 430072, China^{†}Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington 98109^{‡}Department of Biostatistics, University of North Carolina at Chapel Hill, North Carolina 27599

- 1Corresponding author: Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599. E-mail: feizou{at}email.unc.edu

## Abstract

Multiparent populations (MPP) have become popular resources for complex trait mapping because of their wider allelic diversity and larger population size compared with traditional two-way recombinant inbred (RI) strains. In mice, the collaborative cross (CC) is one of the most popular MPP and is derived from eight genetically diverse inbred founder strains. The strategy of generating RI intercrosses (RIX) from MPP in general and from the CC in particular can produce a large number of completely reproducible heterozygote genomes that better represent the (outbred) human population. Since both maternal and paternal haplotypes of each RIX are readily available, RIX is a powerful resource for studying both standing genetic and epigenetic variations of complex traits, in particular, the parent-of-origin (PoO) effects, which are important contributors to many complex traits. Furthermore, most complex traits are affected by >1 genes, where multiple quantitative trait locus mapping could be more advantageous. In this paper, for MPP-RIX data but taking CC-RIX as a working example, we propose a general Bayesian variable selection procedure to simultaneously search for multiple genes with founder allelic effects and PoO effects. The proposed model respects the complex relationship among RIX samples, and the performance of the proposed method is examined by extensive simulations.

- imprinting
- Bayesian variable selection
- parameter-expanded Gaussian priors
- Multiparental Populations
- MPP

Most human genes have functional mouse counterparts, and genomes of both organisms are organized similarly. Thus, mouse serves as a good model organism for complex human diseases. Recombinant inbred (RI) mice are among the major mouse resources in biomedical and genetic research. However, traditional mouse RI lines are derived from only two inbred parental strains, with limited a number of lines available, resulting in a low percentage (15%) of genetic variation across all mouse inbred strains (Yuan *et al.* 2011) and extensive blind spots where a sizable proportion of the genome is identical by descent. These limitations make traditional RI insufficient for studying complex traits. With the need for more powerful resources, multiparent populations (MPP) (de Koning and McIntyre 2017), a set of inbred lines using multiple lines as founders, can overcome the limitations of traditional RI lines and have become an innovative tool for fine quantitative trait locus (QTL) mapping. In the past 15 yr, different kinds of MPP have been established in plants and animals, such as nested association mapping design (Yu *et al.* 2008) and multiparent advanced generation intercross (MAGIC) populations (Cavanagh *et al.* 2008; Kover *et al.* 2009; Huang *et al.* 2015; Ladejobi *et al.* 2016) in plants, and the *Drosophila* Synthetic Population Resource (King *et al.* 2012) in animals. The mouse MPP include the collaborative cross (CC) (Complex Trait Consortium 2004), in which a genetically diverse set of eight inbred strains were selected as breeding founders (Iraqi *et al.* 2008) to generate a large number of RI lines. The eight founder strains were predicted to represent ∼90% of the genetic variation presented in laboratory mice (Threadgill and Churchill 2012). The CC thus greatly overcomes the limitations of the traditional mouse RI lines, and is the only mammalian resource with high genome-wide genetic variation that is uniformly distributed across a large, heterogeneous, and infinitely reproducible population (Threadgill *et al.* 2011).

Through the generation of RI intercrosses (RIX) of RI lines, a large number of potential “outbred” RIX samples can be generated. That is, given *L* RI lines, or genetically distinct reciprocal or nonreciprocal individuals, or RIX, can be produced. An example of using CC-RIX to illustrate a distinct response to Ebola virus infection, by Rasmussen *et al.* (2014), was summarized by the editor: “the CC-RIX mice could prove valuable for preliminary screens of candidate therapeutics and vaccines.” Since all parental RI lines are isogenic at each locus, genotypes of RIX can be imputed in advance from those of their parental RI lines. Additional advantages of RIX can be found in Threadgill and Churchill (2012), in particular, its power in support of analysis of parent-of-origin (PoO) effects, where effects of certain alleles are different depending on whether those alleles are inherited maternally or paternally. PoO makes a significant contribution to the heritability of most complex traits (Mott *et al.* 2014). In addition, genomic PoO effects provide a great model to study epigenetic regulation of gene expression (Barlow 2011).

During the past 20 yr, QTL mapping methods, including analysis of variance (ANOVA), interval mapping (Lander and Botstein 1989), composite interval mapping (Jansen and Stam 1994; Zeng 1994), and multiple interval mapping (Kao and Zeng 1997; Kao *et al.* 1999), have been well developed using experimental crosses, such as backcross, , and RI (see Broman (2001) for reviews), and many excellent open software packages, such as QTLCart (Basten *et al.* 1999), MapManager (Manly and Olson 1999), and R/qtl (Broman *et al.* 2003), are freely available online. For diallel data obtained from the CC founder strains, Lenarcic *et al.* (2012) employed a general Bayesian model for decomposing phenotypic variance into biologically intuitive components. Crowley *et al.* (2014) also applied a Bayesian method to a diallel cross of the eight founder strains to estimate genome-wide genetic and PoO effects (not QTL mapping) on responses to haloperidol, an antipsychotic drug. Zhang *et al.* (2014) proposed a general single QTL Bayesian framework for MAGIC data to coherently estimate haplotype and diplotype effects of founder alleles. Similarly, for MAGIC data, Wei and Xu (2016) developed a single QTL mapping method with a random-effects model by treating the founder allelic effects of each locus as random, and scanning the entire genome one locus at a time using a likelihood ratio test. For RIX mice, a mixed-effects model was developed for modeling the unbalanced relatedness among them (Tsaih *et al.* 2005; Zou *et al.* 2005). Gong and Zou (2012) proposed a more flexible, nonparametric single QTL mapping method for detecting QTL with time-varying coefficients. Hallin *et al.* (2016) applied a random-effects model to map single QTL by partitioning the variance of growth traits across different environments of yeast strains into additive, dominance, and pairwise epistatic components. However, many complex traits are affected by >1 gene, and multiple QTL mapping may be more powerful than locus-by-locus analysis. For CC-RIX data, Yuan *et al.* (2011) constructed a mixed-effects model to simultaneously map multiple QTL, but treated the eight CC founder alleles as standard biallelic single-nucleotide polymorphisms (SNPs). For many complex traits, it is arguable that modeling the effects of the eight founder alleles could lead to improved mapping power (Collaborative Cross Consortium 2012; Vered *et al.* 2014). In addition, for complex traits where PoO effects are suspected, modeling such effects may further improve QTL mapping power (Lawson *et al.* 2013) and understanding of etiologies of complex traits (Threadgill and Churchill 2012).

In this paper, for MPP-RIX data, we develop a Bayesian variable selection procedure to simultaneously map multiple genes with founder allelic effects and PoO effects. We demonstrate our method with CC-RIX data, but the method is general enough for other MPP-RIX populations. We place parameter-expanded Gaussian (PeG) priors on both the random founder allelic effects and PoO effects for variable selection.

The paper is organized as follows. In the *Statistical Method* section, we first introduce the CC-RIX experiment, then propose a random-effects model and describe a Bayesian variable selection procedure. In the *Simulation Study* section, we perform extensive simulation to examine the proposed method. Summary comments are given in the *Discussion* section.

## Statistical Method

The CC-RIX panel is the RI intercross of CC lines. For *L* CC RI lines, a total of reciprocal CC-RIX and nonreciprocal CC-RIX can be generated [see Figure 1 of Zou *et al.* (2005), Yuan *et al.* (2011)]. Let *n* be the total number of CC-RIX samples and *p* be the total number of genetic markers. Further, let the phenotype of individual *i* (the dependent variable) be

### Model

In order to account for the unbalanced relatedness between the CC-RIX mice and model the eight founder alleles and PoO effects, we extend the mixed additive random-effects model of Gong and Zou (2012) and Yuan *et al.* (2011) as follows:(1)where and are binary {0, 1} variables used to decide whether the *j*th QTL and PoO should be included in or excluded from the model and µ is the overall mean; and the *k*th element of equals 2, 1, or 0, depending on whether inherits 2, 1, or 0 copies of the *k*th founder allele (*k* = 1, ⋯, 8) at the *j*th candidate locus; of parents of that are equal to Let be an *n* × L matrix whose th element equals Clearly, for all *i* = 1, 2, ⋯, *n*, since each CC-RIX has two and only two parents. Therefore, represents the *k*th founder allelic effect of locus *j*, and represents the random polygenic effect of founder strain *l*. We let follow and the random error follow

At a given locus, PoO effects can only be estimated from individuals with heterogeneous genotypes, since for those with homozygote genotypes, the parental contributions cannot be distinguished. Label the eight-founder alleles A, B, C, D, E, F, G, and H by numbers 1–8, and let the element of be equal to 1 or −1 if the maternal or paternal allele of the *i*th subject at the *j*th locus equals the *k*th founder allele, and 0 otherwise. For those CC-RIX samples with a homozygous genotype at a given locus, such as AA or BB, we let Therefore, represents the PoO effects of the *j*th locus.

Model (1) can be expressed by the following matrix form:(2)where and

To enable selection of QTL and PoO effects, we reexpress the eight-dimensional vectors and as the expanded parameters and for respectively, such that and This parameter-expansion idea was first proposed by Kuo and Mallick (1998) and has an advantage in selecting important variables and shrinking coefficients.

In the above model (2), the observed data are marker genotypes PoO information matrix , and parental CC-RI information matrix The unobserved variables include *μ*, , and the indicators and . We assign the following conjugate Gaussian priors to the QTL and PoO effects as:For convenience, we name the above hierarchical priors for QTL and PoO coefficients as PeG priors in the sequel. We assign conjugate noninformation hyper-priors to the variance parameterswhere is used to ensure that the posterior distribution is proper (ter Braak *et al.* 2005). The above PeG priors are equivalent to “spike and slab” point mass mixture Gaussian priors (Mitchell and Beauchamp 1988; George and McCulloch 1993) for the expanded parametersandIt is well known that these priors can achieve variable selection. However, compared with mixture priors, the PeG priors can employ a block Gibbs sampler to update the two blocks of parameters, one for and (corresponding to the selected predictors) and one for and (corresponding to the unselected predictors) in turn, and therefore can dramatically reduce computational time, especially for high-dimensional data with large *p*.

For indicator variables and we specify Bernoulli priors with inclusion probability and for respectively. To be more flexible, we further apply hierarchical uniform priors to and Lastly, we specify a flat prior on *μ* as

### Block Gibbs sampling algorithm for posterior computation

The specific priors above result in known marginal conditional distributions for all variables. The blockwise Gibbs sampling algorithm that we employ can be summarized as follows:

First, we initiate , and from uniform distribution then sample other parameters , and indicators and from their priors. We then perform the following block Gibbs sampling procedures. Superscripts and signify the Markov chain Monte Carlo (MCMC) iterations, and refers to the initial iteration.

*Step 1*. Updating*μ*: is drawn from the normal distribution.*Step 2*. Updating we divide the long vector into two blocks and corresponding to the selected () and unselected () predictors, respectively. We then sample from their priors,*i.e.*, the eight-dimensional multivariate normal distribution with zero mean and covariance matrix and sample from the eight-dimensional multivariate normal distribution with mean and covariance matrix*Step 3*. Updating similar to Step 2, we divide into two parts, and corresponding to the selected () and unselected () predictors, respectively. Then is sampled from their priors,*i.e.*, the eight-dimensional multivariate normal distribution with zero mean and covariance matrix and is sampled from the eight-dimensional multivariate normal distribution with mean and covariance matrix*Step 4*. Updating is drawn from the multivariate normal distribution with mean and covariance matrix*Step 5*. Updating is sampled from the scale-inverted distribution,*Step 6*. Updating is sampled from the scale-inverted distribution,*Step 7*. Updating the random-effects variance is sampled from the scale-inverted distribution,*Step 8*. Updating the residual variance is sampled from the scale-inverted distribution,*Step 9*. Updating is sampled from the Bernoulli distribution with success probability where and*Step 10*. Updating is sampled from the Bernoulli distribution with success probability where and*Step 11*. Updating is sampled from the beta distribution*Step 12*. Updating is sampled from the beta distribution

## Simulation Study

In this section, we run extensive simulations to evaluate the performance of the proposed Bayesian method. We apply the loop design in Zou *et al.* (2005) and Yuan *et al.* (2011) by ordering the *L* CC-RI lines randomly and forming them into a circle, and then mating each CC-RI line (clockwise) with the next three CC-RI lines after it (*i.e.*, CC-RI_{1} mated with CC-RI_{2}, CC-RI_{3}, and CC-RI_{4}; CC-RI_{2}mated with CC-RI_{3}, CC-RI, and CC-RI; and so on). For in this way, we can generate a total of CC-RIX samples. Parental CC-RI genotypes are simulated in R/qtl (Broman *et al.* 2003). All the simulation results are based on total 100 replications for each simulation setup.

In model (2), we set the overall mean *μ* = 1, the variance of random errors the polygenic effect variance , and Nineteen chromosomes each with a length of 70 cM are simulated, on which total *p* = 133, 266, and 1330 evenly spaced markers (resulting in 10-, 5-, and 1-cM intervals between nearby markers on each chromosome, named cases 1–3) are generated, corresponding to three marker density cases. Among the total *p* simulated markers, we randomly select five markers, and let the first three markers have QTL effects the last three markers have PoO effects. That is, the first two genes only have QTL effects, and the last two only have PoO effects. However, the middle-selected marker has both QTL and PoO effects. The corresponding variances of QTL or PoO effects of the selected markers are all set to 1.

To better assess the performance of our method in situations where multiple nearby SNPs jointly affect the outcome, for cases 1–3, we now let the number of causal SNPs in each of the first two QTL be 2 instead of 1. Specifically, we randomly select two nearby SNPs for each of the first two simulated QTL and denote the alleles of the two SNPs *A* and *a*. Based on the haplotype frequencies , and (without loss of generality, we assume that ), we create three haplotype allelic groups, where haplotype is group 1, is group 2, and the other two are group 3, and set the genetic effects of the three groups to 1, 2, and 3, respectively. The three new simulation setups are labeled as cases 1^{∗} to 3^{∗}.

For each simulated data point, we generate a single long chain with 20,000 cycles, of which the first 10,000 cycles are discarded as burn-in, resulting in a total of 10,000 samples for post-MCMC analysis. All the analysis is done in MATLAB, and the MATLAB source code is submitted as supplemental material (File S2).

For comparison, we fit each simulated data point with the following three models.

### The same model as (2), but without the PoO terms

(3)All the priors are set to be the same as their counterparts in model (2). For convenience, we subsequently call model (3) the “mixed model”.

### Linear mixed-effects model (LMM) for single gene mapping

(4)Here, random effects and random errors and the random effects are also assumed to be independent of the random errors as before. Moreover, and are the fixed QTL and PoO effects, respectively, of the *j*th tested locus. Given the constraints that and for we force the intercept in model (4) to be **0** to overcome the identifiability problem, and jointly test the effects of the *j*th locus as:In contrast to *H*_{1}* _{j}*, some of the equations in are not satisfied. After obtaining the maximum likelihood estimates of the parameters in model (4), we perform a test with the following log-odds ratio (LOD):which is equivalent to the log-likelihood ratio test, where and are the likelihood of model (4) under null hypothesis and alternative hypothesis respectively. Since the above hypothesis test is performed

*p*times, it is necessary to find an appropriate significance threshold to control the multiple testing, which can be obtained, for example, by modified permutation procedures (Zou

*et al.*2005). However, in our comparison, we evaluate the receiver operating characteristic (ROC) curve, which only requires the use of LOD scores.

### Yuan’s Bayesian method

(5)Here, is the effect of the *j*th putative QTL with , where equals −1, 0, or 1 if the putative QTL genotype is , or respectively. The other parameters are set the same as those in the mixed model. The prior of is set to 1 ≤ *j* ≤ *p* in Yuan *et al.* (2011).

To compare the methods, ROC curves (Fawcett 2006), where true positive rates (also known as sensitivity) are plotted against false positive rates (also known as 1-specificity) evaluated at various threshold cut-offs. To estimate the sensitivity (the proportion of positives that are correctly identified as such) and specificity (the proportion of negatives that are correctly identified as such), we define false and true positive findings as follows. If a detected locus falls no more than 10 cM apart from any simulated genes, we call it a true positive finding, otherwise a false positive finding. Then, combining the outputs of each model for the 100 data sets, we can calculate the corresponding sensitivity and specificity. Specifically, for the *j*th marker, we record the LOD scores, , for the LMM model; the maximum posterior frequency between and for our proposed method; the posterior frequency for the mixed model; and the posterior mean of for Yuan’s model. The corresponding area under the ROC curve (AUC) is also calculated for each of the four models, and the results are presented in Table 1. Generally speaking, a model with a higher AUC value indicates on average a better performance compared with those with lower AUC scores (Fawcett 2006).

From Table 1, it is clear that our proposed method outperforms the other three methods for all cases, regardless of whether there is a single causal SNP or multiple causal SNPs in each QTL. Yuan’s method fails in all the simulated cases, which is expected as the method only models biallelic SNP effects instead of the founder allelic effects that we simulate. The LMM model outperforms the mixed model, in particular when the marker density is sparse. This phenomenon is further confirmed by Figure 1, where the ROC curves of the proposed method are always higher than the ROC curves of the other three methods; the ROC curves of Yuan’s model fall along the 45-degree line, indicating its low power for mapping genes with founder allelic effects. The ROC curves of the LMM model fall between those of our proposed model and the mixed model in cases 1 and 2, and cases 1* and 2*, but cross with the ROC curves of the mixed model in cases 3 and 3*. Figure 1 and Table 1 show that the mapping precisions of all the methods increase as the number of markers increases, as expected.

Figure 2 presents the Manhattan plots of the four models based on the average estimate of a total of 100 replications across the whole genome for case 1. Clearly, our proposed method is the winner in mapping genes with founder allelic effects (marked by an asterisk) and PoO effects (marked by a circle). The Manhattan plots for cases 2 and 3, and cases 1*, 2*, and 3*, are included in Figures S1–S5 in File S1, respectively, and show similar patterns.

## Discussion

The CC (Threadgill and Churchill 2012), a panel of newly emerged multiparent RI mouse strains, was developed, similar to other MPPs, to provide greater genetic diversity than the traditional RI populations and thereby to improve our power of understanding of complex traits. CC-RIX, F1 crosses that are generated from parental CC RI lines (Zou *et al.* 2005) can serve as excellent mouse models for mapping genes with both traditional genetic effects and epigenetic effects, such as imprinting. This study extends the model of Gong and Zou (2012) and Yuan *et al.* (2011) and, to the best of our knowledge, it is the first use of Bayesian variable selection methods to jointly map multiple QTL with both founder allelic effects and PoO effects for MPP-RIX data, in particular, CC-RIX data.

In this article, it is assumed that the RI lines are equally distanced from each other in terms of genetic distance which is a sensible assumption given the funnel design used for generating CC RI lines. However, we do observe that some RI lines share more or fewer founder alleles than expected. Our limited simulation shows that such genetic unbalance has a negligible effect on mapping genes, but this issue deserves further investigation. Alternatively, as genotypes of the parental RI lines are available, we could modify the design matrices , and in model (1) and replace them with local and genome-wide similarity matrices as done in sequence kernel association tests (SKAT) (Ionita-Laza *et al.* 2013) and genome-wide complex trait analysis (GCTA) (Yang *et al.* 2011). In addition, in our model we assume additive founder allelic effects. This assumption can be easily relaxed to allow genes with both additive and dominance effects (Woods *et al.* 2012; Zhang *et al.* 2014). Because dominant genetic effects are orthogonal to PoO effects, missing the dominant genetic effects does not cause any bias in the PoO effects estimate.

One of the drawbacks of Bayesian methods is their computational cost, especially for data with large numbers of samples and markers. Our model employs a block Gibbs sampler, which dramatically reduces computational time. However, further improvements may be possible. For example, instead of modeling each marker separately, we could jointly model multiple nearby markers and reduce the magnitude of *p*. Similar ideas have been proposed for genome-wide association study data (Wu *et al.* 2010; Lu *et al.* 2015). This approach is also biologically meaningful, since for some complex traits multiple causal SNPs may be located in a single region, and our simulations (cases 1*–3*) have shown that the proposed method is powerful in mapping regions with multiple causal SNPs. However, grouping nearby markers or SNPs may offer further help in improving mapping power, which deserves more thorough investigation.

## Acknowledgments

We thank two anonymous referees for their valuable comments on our manuscript. This research was partially supported by NIH grant R01 GM105785, and by the scholarship from China Scholarship Council (CSC) under the Grant CSC No.201506270042.

## Footnotes

Supplemental material is available online at www.g3journal.org/lookup/suppl/doi:10.1534/g3.117.300483/-/DC1.

*Communicating editor: D. J. de Koning*

- Received November 7, 2017.
- Accepted December 3, 2017.

- Copyright © 2018 Liu
*et al.*

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.