There are reasons that QTL analysis needs to be performed in multiple related families in plants that are derived from

1 INTRODUCTION

There are reasons that QTL analysis needs to be performed in multiple related families that are derived from inbred lines. Multiple experimental QTL mapping families exist for numerous crops (e.g., barley, oat, soybean, corn), and specific inbred parents are often shared across families (e.g. Brummer et al., 1997; Kianian et al., 1999; Orf et al., 1999). In breeding situations, parents are often mated in diallel designs to generate families from which to select. The number of families produced is often greater than the number of inbred parents. From the view of QTL mapping, jointly analyzing these populations may lead to gains in detection power, mapping resolution and QTL estimation. Though a great variety of software packages are available for QTL mapping analysis (Knapp, 1997), adequate programs are not readily available to analyze data from several inbred line cross populations that share some parents in common.

Programs that have developed for single crosses or even multiple independent inbred crosses cannot be readily applied to multiple interconnected populations for the following reasons. First, while it is reasonable to assume that each parent carries a distinct allele at a QTL in single crosses, this assumption is obviously violated in multiple interconnected populations since some families share a parent in common. Sharing a common parent causes co-variance among these families. In QTL analysis, however, ignoring co-variance among these families is conceptually problematic and leads to large bias in QTL estimation. Second, as multiple founder parents are involved, some parents may share alleles identity-in-state (IIS) at a QTL. While this is possible, information from progeny of parents sharing a QTL allele can be pooled to improve estimation of QTL effect. Though the identity of alleles is not observable, the phenotypic effect of alleles provides information that may guide the specification of statistical models estimating the correct number of alleles and pooling information appropriately. Selection among such models would allow statistical assessment of identity in state at a QTL among parents. Knowing allele identity-in-state among parents would allow models to fit the data well whereas estimating separate alleles for all parents might decrease model likelihood.

INTERQTL is a software package developed for Bayesian QTL mapping in multiple inbred populations where some families share common founder parents. The analysis is capable of mapping multiple QTL on multiple chromosomes and it deals with a variety of inbred crosses, such as backcrosses (BC1 and BC2), F2, double haploid lines and recombinant inbred lines. The package consists of three parts: a module for Bayesian analysis, a module for simulation, and an interface module. The first two models are programmed in Borland C++ as WIN32 console applications so that they can be used independent in either DOS or WINDOWS. The third module is written in Visual C++ 6 that organizes and runs the first two modules in WINDOWS desktops such as WINDOWS 3.X, 98 and NT. Basically, the interface module provides a single-document editor window for users to prepare their input file as well as dialogue windows where users can tail a specific analysis to their needs and visualize posteriors and Markov chains of some model parameters. Currently, additive effect of QTL is the only genetic effect in the model, but model extensions will consider including dominance and other interaction effects.

In INTERQTL, we use Bayes theorem to give the probability of each allele number and configuration model, conditional on phenotype and marker data over all populations. The Bayesian output presents the probability of all models together, as integrated over the full posterior distribution of the unknown parameters. Modeling allele number in INTERQTL is made possible by introducing an allele configuration matrix in the model that links a founder parent to a specific allele at a QTL, which is not present in previous models that assume fixed number of alleles. The change of allele number at a QTL is thus reflected from the change of the number of columns in the configuration matrix that are not all 0s. Note that changing allele number in the model changes the dimension of parameters, leading to Markov chain with reversible jumps. In MCMC, a scalar Metropolis-Hastings procedure is used to update family means, allele values, allele number, allele configuration, QTL variance and residual variance, where each parameter is sampled in turn considering all other parameters fixed (Gilks et al., 1996). QTL genotypes and positions, however, are updated jointly in INTERQTL. In addition to updating parameters using Metropolis-Hastings, Gibbs samplers are also available for updating family means, allele value, QTL variance and residual variance, the theoretical frame of which is described in Sorensen (1999).

With the possibility to inferring allelic number and configurations given the data of DNA markers and phenotypic values, INTERQTL expanded QTL mapping experiment in the sense that it allowing random selection of inbred parental lines, which are not successfully handled in previous programs. In traditional QTL mapping experiments that makes use of single crosses, founder parents have to be selected at the opposite extremes for the trait of interest. By doing that, it gives a high chance that the QTL is heterozygous in the F1 parents, and type II error in QTL mapping (i.e. the QTL exist but is not detectable because of fixation to the same allele in both lines) is minimized. However, since many other genetic and environmental factors contributed to the total phenotypic variance, it is not 100% guaranteed that the 2 founders with distinct phenotypic values necessarily carry distinct alleles at focal QTL. There are still other disadvantages with non-random selection of parental lines. For example, choosing parents with dramatic difference in one trait increases the statistical power for detecting QTL responsible for this trait, but these populations are not useful in detecting QTL for other traits using previous methods. As QTL mapping experiments are costly in time and money, it is not wise to carry out one experiment to only to localize QTL for one trait. In addition, due to the non-random selection of parents, estimation of QTL effect is biased and can only be inferred upon the two parental lines, but not the pool of available strains where the two lines were selected. This limitation no longer exists in INTERQTL, since it deals naturally with founder parents that are randomly chosen.

Missing marker data arises in two situations when analyzing multiple interconnected populations. First, not all markers will have segregating alleles in all populations so that a given tested map position will be at different distance from the nearest informative marker depending on the population. Second, some individuals within a population will have missing data. In INTERQTL, both situations are handled in conceptually similar ways by using multi-marker calculation of conditional QTL genotype probability (Ott, 1999). In the MCMC, missing marker genotypes are updated in each cycle of iteration, which is analogous to the updates for QTL genotypes. INTERQTL also provide flexibility to model different types of QTL effects. In practice, QTL effects are either treated as fixed or random (Xu, 1998), but models have their advantages and disadvantages. As fixed-QTL model approaches, allelic substitution effects are usually estimated and tested, and QTL variance is calculated from estimated allelic effects. Though these methods are simple and efficient in QTL mapping analysis with single or a small number of populations, it is challenged with many families since the number of model parameters will dramatically increase. Given a fixed population size, fitting a great number of model effects not only decrease the likelihood of the model, it also give rise to computational difficulties. Random-QTL models fit the need for revealing the nature of multiple-allelic genetics of QTL in many mapping populations and give direct estimation of QTL variance. However, the specific parametric distribution of the random effects, usually normal with mean zero and estimated variance, is something hard to accept when only a very small number of alleles or parents that carry these alleles are involved. In a recent simulation research, we found that obtaining reliable estimates of QTL was only possible when a considerable number of parents were used (Wu and Jannink, 2003). In other words, a fixed-QTL mode approach is preferred to deal with single or a small number of mapping populations whereas a random-QTL model is recommended for analyzing many populations. To account for these situations, INTERQTL allows users to model either random or fixed QTL effects by turning on and off a subroutine that updates QTL variance. Clearly, changing model types (fixed, random, or mixed models) is as easy as a single mouse click. INTERQTL also allows users to pre-set allele number and configuration in accordance with prior information, in addition to letting it be inferred by the analysis. Further more, this subroutine for updating allele number can be turned off so that the number of alleles in the model is fixed to the number of parents. This flexibility in INTERQTL allows users to examine for themselves what kinds of assumption about allele number and configuration is most appropriate to their QTL mapping problems.

We chose to work within the framework of Bayesian analysis because of its great flexibility and efficiency in modeling complicated situations, as compared to maximum likelihood approaches. Currently, there are a few programs available to implement Bayesian QTL analysis (Sillanpaa and Arjas, 1998; Gaffney et al., 2001). As mentioned above, our package differs from theirs in that INTERQTL is capable of generating a Bayesian posterior of the number of QTL alleles segregating among parents and estimating the posterior probability of different possible identity-in-state configurations. With this possibility, INTERQTL is better fitted for QTL mapping in multiple interconnected populations. Note that the number of QTL is fixed in the model while a subroutine is activated to give posterior number of QTL alleles. The analysis gives posterior number of QTL only when the subroutine for changing allele number is turned off.

The package does not have very high requirements for a PC, since it runs on most old model PCs (i.e. 50 MB hard drive, 16 MB ram and a 120 MHz Pentium processor). However, as the computational task in a Bayesian analysis with multiple populations and multiple QTL is tremendous, it is recommended that INTERQTL be run on a high performance PC. In our analysis where 6 QTL were simulated on the genome of 7 chromosomes in 30 populations each with 20 to 40 progeny, a single run with 100,000 MCMC iterations in a PC with 1.8 GHz CPU, 100M ram, 20GB hard disk took 20 minutes to 2 hours, depending on the complexity of the model. In contrast, it took one day or two in the old model PC that we have mentioned before.