Quantitative trait locusยถ

Introductionยถ

Every genetic model considered here is an instance of generalized linear mixed model (GLMM). It consists in four main components [St16]:

• A linear predictor, ๐ณ = M๐ + ๐๐ฎ.

• The distribution of the random effects, ๐ฎ โผ ๐(๐, ฮฃ).

• The residual distribution, yแตข | ๐ฎ.

• The link function, ๐แตข = g(zแตข).

The term ๐แตข represents the mean of yแตข conditioned on ๐ฎ:

$๐แตข = ๐ด[yแตข|๐ฎ].$

The role of the link function is to scale the domain of zแตข, which ranges from -โ to +โ, to the residual distribution parameter ๐แตข. For example, the mean of a Bernoulli distribution is bounded within [0, 1], and therefore requires a link function to translate values of zแตข into values of ๐แตข.

The distribution of the outcome, conditioned on the random effects, has to be one from the exponential family [Ef18] having mean ๐แตข:

$yแตข|๐ฎ โผ ๐ด๐ก๐๐ต๐๐(๐แตข).$

A notable instance of the above model is the linear mixed model (LMM). It consists of the identity link function, ๐แตข = g(๐แตข), and of normally distributed residuals, yแตข | ๐ฎ โผ ๐(๐แตข, ๐แตขยฒ) [Mc11]. It is more commonly described by the equation

(1)ยถ$๐ฒ = ๐ผ๐ + ๐๐ฎ + ๐,$

for which ๐แตขโผ๐(0, ๐แตขยฒ). The random variables ๐ฎ and ๐ are independent from each other as well as ๐แตข and ๐โฑผ for iโ j. Defining ๐ฏ = ๐๐ฎ leads to:

$๐ฏ โผ ๐(๐, ๐ฮฃ๐แต).$

There is another even simpler instance of GLMM that is also used in genetic analysis: a linear model (LM) is merely a LMM without the random effects:

$๐ฒ = ๐ผ๐ + ๐.$

The above models are used to establish a statistical tests to find significant association between genetic loci and phenotype. For that, their parameters have to be estimated.

As an example, let us define two parameters that will describe the overall variances of the random effects and of the residual effects:

$ฮฃ = ๐โ๐ธโ ~~\text{and}~~ ๐แตขยฒ = ๐โ.$

If we assume a LMM, this example of model can be described by Eq. (1) for which

$๐ฎ โผ ๐(๐, ๐โ๐ธโ) ~~\text{and}~~ ๐ โผ ๐(๐, ๐โ๐ธโ).$

Equivalently, we have

$๐ฒ = ๐ผ๐ + ๐ฏ + ๐,$

for which

$๐ฏ โผ ๐(๐, ๐โ๐๐แต) ~~\text{and}~~ ๐ โผ ๐(๐, ๐โ๐ธโ).$

Therefore we have a model with three parameters: an array of effect sizes ๐ and variances ๐โ and ๐โ. If ๐ contains the normalized SNP genotypes of the samples, ๐๐แต is an estimation of the genetic relationship between the samples [Wa17].

Statistical testยถ

We use the likelihood ratio test (LRT) approach [LR18] to assess the significance of the association between genetic variants and the phenotype. It is based on the ratio between the marginal likelihood of the null ๐โ and alternative ๐โ models, for which the simpler model ๐โ is defined by placing a constraint on one or more parameters of the alternative model ๐โ.

The parameter inference is done via the maximum likelihood estimation (MLE) approach [ML18], for which the marginal likelihood p(๐ฒ | ๐ผ, ๐; ๐) is maximized over the parameters set ๐. Let ๐โ and ๐โ be the optimal parameters set under the null and alternative models. The likelihood ratio statistics is given by

$-2 \log(p(๐ฒ| ๐ผ, ๐; ๐โ) / p(๐ฒ| ๐ผ, ๐; ๐โ)),$

which asymptotically follows a ฯยฒ distribution [Wh14]. We will make use of the LRT approach in the next sections to flag significant genetic associations.

Single-trait associationยถ

We first consider that the observed phenotype is described by additive effects from covariates and genetic components. Any deviation from that is assumed to be captured by the residual distribution. Let ๐ผ be a matrix of covariates and let ๐ถ be a matrix of genetic variants that we suspect might have some effect on the phenotype. Therefore, we have the linear model:

$\begin{split}๐ฒ = \underbrace{๐ผ๐}_{\text{covariates}}+ \underbrace{๐ถ๐}_{\text{genetics}}+ \underbrace{๐}_{\text{noise}},\\ \text{where}~~๐โผ๐(๐, ๐โ๐ธ),~~~~~~\end{split}$

and we wish to compare the following hypotheses:

$\begin{split}๐โ: ๐ = ๐\\ ๐โ: ๐ โ ๐\end{split}$

Note that the parameters of the above model are the covariate effect sizes, ๐, the effect sizes of a set of genetic variants, ๐, and the variance ๐โ of the noise variable. Under the null hypothesis, we set ๐=๐ and fit the rest of the parameters. Under the alternative hypothesis, we learn all the parameters. At the end, we compare the marginal likelihoods via the likelihood ratio test.

Let us first generate a random data set having a phenotype, covariates, and a set of genetic candidates.

>>> from numpy import ones, stack
>>> from numpy.random import RandomState
>>> from pandas import DataFrame
>>>
>>> random = RandomState(2)
>>>
>>> # sample size
>>> n = 100
>>>
>>> # covariates
>>> offset = ones(n) * random.randn()
>>> age = random.randint(16, 75, n)
>>> M = stack((offset, age), axis=1)
>>> M = DataFrame(stack([offset, age], axis=1), columns=["offset", "age"])
>>> M["sample"] = [f"sample{i}" for i in range(n)]
>>> M = M.set_index("sample")
offset      age
sample
sample0 -0.41676 38.00000
sample1 -0.41676 59.00000
sample2 -0.41676 34.00000
sample3 -0.41676 27.00000
sample4 -0.41676 56.00000
>>> # genetic variants
>>> G = random.randn(n, 4)
>>>
>>> # sampling the phenotype
>>> alpha = random.randn(2)
>>> beta = random.randn(4)
>>> eps = random.randn(n)
>>> y = M @ alpha + G @ beta + eps


We now apply the function limix.qtl.scan() to our data set

>>> from limix.qtl import scan
>>>
>>> r = scan(G, y, "normal", M=M, verbose=False)
>>> print(r)
Hypothesis 0
------------

๐ฒ ~ ๐(๐ผ๐ถ, 3.462โ๐ธ)

M     = ['offset' 'age']
๐ถ     = [2.10096551 0.19582931]
se(๐ถ) = [1.25826998 0.01068367]
lml   = -203.98750767964498

Hypothesis 2
------------

๐ฒ ~ ๐(๐ผ๐ถ + G๐, s(3.462โ๐ธ))

lml       cov. effsizes   cand. effsizes
--------------------------------------------------
mean   -1.951e+02       9.915e-01       -6.198e-01
std     9.227e+00       9.342e-01        3.974e-01
min    -2.031e+02       1.844e-01       -1.025e+00
25%    -2.026e+02       1.959e-01       -9.275e-01
50%    -1.967e+02       5.965e-01       -6.047e-01
75%    -1.893e+02       1.831e+00       -2.970e-01
max    -1.841e+02       2.312e+00       -2.448e-01

Likelihood-ratio test p-values
------------------------------

๐โ vs ๐โ
----------------
mean   6.514e-02
std    8.856e-02
min    2.804e-10
25%    2.606e-07
50%    3.651e-02
75%    1.016e-01
max    1.875e-01


Suppose we also have access to the whole genotype of our samples, ๐, and we want to use them to account for population structure and cryptic relatedness in our data [Ho13]. Since the number of genetic variants in ๐ is commonly larger than the number of samples, and because we are not actually interested in their effect sizes, we will include it in our model as a random component. We now have a linear mixed model:

$\begin{split}๐ฒ = \underbrace{๐ผ๐}_{\text{covariates}}+ \underbrace{๐ถ๐}_{\text{genetics}}+ \underbrace{๐๐ฎ}_{\text{pop. struct.}}+ \underbrace{๐}_{\text{noise}},\\ \text{where}~~ ๐ฎโผ๐(๐, ๐โ๐ธโ) ~~\text{and} ~~๐โผ๐(๐, ๐โ๐ธโ).\end{split}$

It is important to note that ๐ฏ=๐๐ฎ can be equivalently described by a multivariate Normal distribution with a covariance proportional to ๐บ = ๐๐แต:

$๐ฏ โผ ๐(๐, ๐โ๐บ).$

We make use of the function limix.stats.linear_kinship() to define the covariance matrix ๐บ, and call limix.qtl.scan() to perform the analysis.

>>> from limix.stats import linear_kinship, multivariate_normal
>>> from numpy import zeros, eye
>>>
>>> # Whole genotype of each sample.
>>> X = random.randn(n, 50)
>>> # Estimate a kinship relationship between samples.
>>> K = linear_kinship(X, verbose=False) + 1e-9 * eye(n)
>>> # Update the phenotype
>>> y += multivariate_normal(random, zeros(n), K)
>>>
>>> r = scan(X, y, "normal", K, ๐ผ=M, verbose=False)
>>> print(r)
Hypothesis 0
------------

๐ฒ ~ ๐(๐ผ๐ถ, 1.436โ๐บ + 2.934โ๐ธ)

M     = ['offset' 'age']
๐ถ     = [1.95338293 0.19448903]
se(๐ถ) = [1.25455536 0.01076470]
lml   = -211.3819625136375

Hypothesis 2
------------

๐ฒ ~ ๐(๐ผ๐ถ + G๐, s(1.436โ๐บ + 2.934โ๐ธ))

lml       cov. effsizes   cand. effsizes
--------------------------------------------------
mean   -2.109e+02       1.069e+00        5.922e-02
std     7.210e-01       8.819e-01        2.474e-01
min    -2.114e+02       1.919e-01       -5.204e-01
25%    -2.113e+02       1.944e-01       -1.102e-01
50%    -2.111e+02       8.904e-01        4.920e-02
75%    -2.109e+02       1.956e+00        2.366e-01
max    -2.076e+02       2.215e+00        6.433e-01

Likelihood-ratio test p-values
------------------------------

๐โ vs ๐โ
----------------
mean   4.843e-01
std    2.705e-01
min    6.294e-03
25%    3.137e-01
50%    4.752e-01
75%    6.953e-01
max    9.929e-01


Non-normal trait associationยถ

If the residuals of the phenotype does not follow a Normal distribution, then we might consider performing the analysis using a generalized linear mixed model. Let us consider Poisson distributed residuals:

$yแตข | ๐ณ โผ ๐ฟ๐๐๐๐๐๐(๐แตข=g(zแตข)),$

where the latent phenotype is described by

$๐ณ = ๐ผ๐ + ๐๐ฎ + ๐,$

for

$๐ฎ โผ ๐(๐, ๐โ๐ธโ) ~~\text{and}~~ ๐ โผ ๐(๐, ๐โ๐ธโ).$

Note that the term ๐ in the above model is not the residual variable, as it were in the Eq. (1). The term ๐ is used to account for the so-called over-dispersion, i.e., when the residual distribution is not sufficient to explain the variability of yแตข.

>>> from numpy import exp
>>>
>>> z = (y - y.mean()) / y.std()
>>> y = random.poisson(exp(z))
>>>
>>> r = scan(G, y, "poisson", K, M=M, verbose=False)
>>> print(r)
Hypothesis 0
------------

๐ณ ~ ๐(๐ผ๐ถ, 0.154โ๐บ + 0.000โ๐ธ) for yแตข ~ Poisson(ฮปแตข=g(zแตข)) and g(x)=eหฃ

M     = ['offset' 'age']
๐ถ     = [5.17511934 0.04665214]
se(๐ถ) = [0.85159296 0.00604330]
lml   = -145.33385788740767

Hypothesis 2
------------

๐ณ ~ ๐(๐ผ๐ถ + G๐, s(0.154โ๐บ + 0.000โ๐ธ)) for yแตข ~ Poisson(ฮปแตข=g(zแตข)) and g(x)=eหฃ

lml       cov. effsizes   cand. effsizes
--------------------------------------------------
mean   -1.440e+02       2.553e+00       -1.306e-01
std     1.343e+00       2.682e+00        9.268e-02
min    -1.453e+02       4.345e-02       -2.227e-01
25%    -1.450e+02       4.635e-02       -2.018e-01
50%    -1.439e+02       2.456e+00       -1.344e-01
75%    -1.428e+02       5.054e+00       -6.321e-02
max    -1.427e+02       5.202e+00       -3.085e-02

Likelihood-ratio test p-values
------------------------------

๐โ vs ๐โ
----------------
mean   2.830e-01
std    3.213e-01
min    2.274e-02
25%    2.519e-02
50%    2.113e-01
75%    4.692e-01
max    6.867e-01


Single-trait with interactionยถ

The following linear mixed model is considered:

$\begin{split}๐ฒ = ๐ผ๐ + (๐ถโ๐ดโ)๐โ + (๐ถโ๐ดโ)๐โ + ๐๐ฎ + ๐,\\ \text{where}~~ ๐ฎโผ๐(๐, ๐โ๐ธโ) ~~\text{and}~~ ๐โผ๐(๐, ๐โ๐ธโ).\end{split}$

The operator โ works as follows:

$๐ฐโ๐ฑ = [๐ฐโ๐ฑโ ~~...~~ ๐ฐโ๐ฑโ ~~ ๐ฐโ๐ฑโ ~~...~~ ๐ฐโ๐ฑโ ~~...~~ ๐ฐโ๐ฑโ]$

Therefore, the terms ๐ถโ๐ดโ and ๐ถโ๐ดโ can be understood as interaction terms between genetics, ๐ถ, and environments, ๐ดโ and ๐ดโ.

We define three hypotheses from the above linear mixed model:

$\begin{split}๐โ: ๐โ=๐ ~~\text{and}~~ ๐โ=๐\\ ๐โ: ๐โโ ๐ ~~\text{and}~~ ๐โ=๐\\ ๐โ: ๐โโ ๐ ~~\text{and}~~ ๐โโ ๐\end{split}$

The hypothesis ๐โ is for no-interaction, ๐โ is for interaction with environments encoded in ๐ดโ, and ๐โ is for interaction with environments encoded in ๐ดโ and ๐ดโ. We perform three statistical tests:

• ๐โ (null) vs ๐โ (alternative)

• ๐โ (null) vs ๐โ (alternative)

• ๐โ (null) vs ๐โ (alternative)

Here is an example.

>>> from numpy import concatenate, newaxis
>>> from limix.qtl import iscan
>>>
>>> # Generate interacting variables (environment)
>>> E0 = random.randn(y.shape[0], 1)
>>> E1 = random.randn(y.shape[0], 1)
>>>
>>> r = iscan(G, y, "normal", K, M, E0=E0, E1=E1, verbose=False)
>>> print(r)
Hypothesis 0
------------

๐ฒ ~ ๐(๐ผ๐ถ, 0.376โ๐บ + 2.077โ๐ธ)

M     = ['offset' 'age']
๐ถ     = [3.12608063 0.06042316]
se(๐ถ) = [1.01867609 0.00870181]
lml   = -185.77488727691096

Hypothesis 1
------------

๐ฒ ~ ๐(๐ผ๐ถ + (๐ถโ๐ดโ)๐โ, s(0.376โ๐บ + 2.077โ๐ธ))

lml       cov. effsizes   cand. effsizes
--------------------------------------------------
mean   -1.856e+02       1.611e+00       -2.976e-03
std     1.949e-01       1.658e+00        1.208e-01
min    -1.858e+02       6.034e-02       -1.461e-01
25%    -1.858e+02       6.058e-02       -4.769e-02
50%    -1.856e+02       1.590e+00       -7.487e-03
75%    -1.854e+02       3.137e+00        3.722e-02
max    -1.854e+02       3.235e+00        1.492e-01

Hypothesis 2
------------

๐ฒ ~ ๐(๐ผ๐ถ + (๐ถโ๐ดโ)๐โ + (๐ถโ๐ดโ)๐โ, s(0.376โ๐บ + 2.077โ๐ธ))

lml       cov. effsizes   cand. effsizes
--------------------------------------------------
mean   -1.852e+02       1.612e+00        7.001e-03
std     7.598e-01       1.659e+00        1.475e-01
min    -1.857e+02       5.991e-02       -2.573e-01
25%    -1.856e+02       6.096e-02       -4.135e-02
50%    -1.855e+02       1.571e+00        3.611e-02
75%    -1.851e+02       3.135e+00        7.660e-02
max    -1.841e+02       3.241e+00        1.971e-01

Likelihood-ratio test p-values
------------------------------

๐โ vs ๐โ    ๐โ vs ๐โ    ๐โ vs ๐โ
----------------------------------------
mean   6.867e-01   6.501e-01   5.244e-01
std    3.199e-01   3.350e-01   3.168e-01
min    3.963e-01   1.795e-01   9.940e-02
25%    4.185e-01   5.578e-01   3.784e-01
50%    6.755e-01   7.277e-01   5.971e-01
75%    9.436e-01   8.200e-01   7.431e-01
max    9.995e-01   9.654e-01   8.042e-01


Multi-trait associationยถ

LMM can also be used to jointly model multiple traits. Let n, c, and p be the number of samples, covariates, and traits, respectively. The outcome variable ๐ is a nรp matrix distributed according to

(2)ยถ$๐๐๐(๐) โผ ๐((๐ฐ โ ๐ผ) ๐๐๐(๐), ๐ฒโ โ ๐๐แต + ๐ฒโ โ ๐ธ).$

๐ฐ and ๐ผ are design matrices of dimensions pรp and nรc provided by the user, where ๐ผ is the usual matrix of covariates commonly used in single-trait models. ๐ is a cรp matrix of fixed-effect sizes per trait. ๐ is a nรr matrix provided by the user and I is a nรn identity matrices. ๐ฒโ and ๐ฒโ are both symmetric matrices of dimensions pรp, for which ๐ฒโ is guaranteed by our implementation to be of full rank. The parameters of this model are the matrices ๐, ๐ฒโ, and ๐ฒโ. ๐๐๐(โ) is a function that stacks the columns of the provided matrix into a vector [Ve19].

Let ๐ฒ=๐๐๐(๐) and ๐=๐๐๐(๐). We can extend the model in Eq. (2) to represent three different hypotheses:

$๐ฒ โผ ๐((๐ฐ โ ๐ผ)๐ + (๐ฐโ โ ๐ถ)๐โ + (๐ฐโ โ ๐ถ)๐โ, ๐ฒโ โ ๐๐แต + ๐ฒโ โ ๐ธ);$

the hypotheses being

$\begin{split}๐โ: ๐โ=๐ ~~\text{and}~~ ๐โ=๐\\ ๐โ: ๐โโ ๐ ~~\text{and}~~ ๐โ=๐\\ ๐โ: ๐โโ ๐ ~~\text{and}~~ ๐โโ ๐\end{split}$

as before. Here is an example.

>>> from numpy import eye
>>>
>>> p = 2
>>> Y = random.randn(n, p)
>>> A = random.randn(p, p)
>>> A = A @ A.T
>>> A0 = ones((p, 1))
>>> A1 = eye(p)
>>>
>>> r = scan(G, Y, K=K, M=M, A=A, A0=A0, A1=A1, verbose=False)
>>> print(r)
Hypothesis 0
------------

๐ฒ ~ ๐((Aโ๐ผ)๐, Cโโ๐บ + Cโโ๐ธ)

traits   = ['0' '1']
M        = ['offset' 'age']
๐ถ        = [-0.16350676 -0.00299814 -0.34521236 -0.00080406]
se(๐ถ)    = [11.30571652  0.09640163  5.36090270  0.04573611]
diag(Cโ) = [0.01404947 0.29153072]
diag(Cโ) = [0.81175806 0.85780008]
lml      = -277.3341913587698

Hypothesis 1
------------

๐ฒ ~ ๐((Aโ๐ผ)๐ + (AโโG)๐โ, s(Cโโ๐บ + Cโโ๐ธ))

lml       cov. effsizes   cand. effsizes
--------------------------------------------------
mean   -2.763e+02      -1.243e-01       -2.842e-02
std     1.329e+00       1.435e-01        1.120e-01
min    -2.773e+02      -3.712e-01       -1.666e-01
25%    -2.772e+02      -2.032e-01       -7.187e-02
50%    -2.767e+02      -6.896e-02       -2.672e-02
75%    -2.758e+02      -1.371e-03        1.673e-02
max    -2.744e+02       1.386e-04        1.063e-01

Hypothesis 2
------------

๐ฒ ~ ๐((Aโ๐ผ)๐ + (AโโG)๐โ + (AโโG)๐โ, s(Cโโ๐บ + Cโโ๐ธ))

lml       cov. effsizes   cand. effsizes
--------------------------------------------------
mean   -2.761e+02      -1.245e-01       -1.209e-02
std     1.404e+00       1.439e-01        5.823e-02
min    -2.772e+02      -3.745e-01       -1.151e-01
25%    -2.770e+02      -2.025e-01       -3.202e-02
50%    -2.766e+02      -6.702e-02       -7.898e-03
75%    -2.757e+02      -1.441e-03        2.127e-02
max    -2.741e+02      -7.171e-04        7.372e-02

Likelihood-ratio test p-values
------------------------------

๐โ vs ๐โ    ๐โ vs ๐โ    ๐โ vs ๐โ
----------------------------------------
mean   3.973e-01   6.122e-01   8.438e-01
std    3.851e-01   3.942e-01   1.327e-01
min    1.597e-02   9.170e-02   7.251e-01
25%    1.133e-01   4.159e-01   7.319e-01
50%    3.626e-01   7.039e-01   8.370e-01
75%    6.466e-01   9.002e-01   9.488e-01
max    8.478e-01   9.493e-01   9.760e-01


References

LR18

Wikipedia contributors. (2018, October 21). Likelihood-ratio test. In Wikipedia, The Free Encyclopedia. Retrieved 16:13, November 27, 2018, from https://en.wikipedia.org/w/index.php?title=Likelihood-ratio_test&oldid=865020904

ML18

Wikipedia contributors. (2018, November 8). Maximum likelihood estimation. In Wikipedia, The Free Encyclopedia. Retrieved 16:08, November 27, 2018, from https://en.wikipedia.org/w/index.php?title=Maximum_likelihood_estimation&oldid=867823508

St16

Stroup, W. W. (2016). Generalized linear mixed models: modern concepts, methods and applications. CRC press.

Ef18

Wikipedia contributors. (2018, October 18). Exponential family. In Wikipedia, The Free Encyclopedia. Retrieved 18:45, November 25, 2018, from https://en.wikipedia.org/w/index.php?title=Exponential_family&oldid=864576150

Mc11

McCulloch, Charles E., and Shayle R. Searle. Generalized, linear, and mixed models. John Wiley & Sons, 2004.

Ve19

Wikipedia contributors. (2018, September 11). Vectorization (mathematics). In Wikipedia, The Free Encyclopedia. Retrieved 16:18, November 28, 2018, from https://en.wikipedia.org/w/index.php?title=Vectorization_(mathematics)&oldid=859035294

Wa17

Wang, B., Sverdlov, S., & Thompson, E. (2017). Efficient estimation of realized kinship from single nucleotide polymorphism genotypes. Genetics, 205(3), 1063-1078.

Wh14

White, H. (2014). Asymptotic theory for econometricians. Academic press.

Ho13

Hoffman, G. E. (2013). Correcting for population structure and kinship using the linear mixed model: theory and extensions. PloS one, 8(10), e75707.