18.443 File Rproject3_rmd_multinomial_theory.html

Maximum Likelihood Estimates of Multinomial Cell Probabilities

Definition: Multinomial Distribution (generalization of Binomial)

Section \(8.5.1\) of Rice discusses multinomial cell probabilities.

Data consisting of: \[ X_1, X_2, \ldots, X_m\]

are counts in cells \(1, \ldots, m\) and follow a multinomial distribution

\[f(x_1, \ldots, x_n \mid p_1, \ldots, p_m ) = { n! \over \prod_{j=1}^m x_i ! } \prod_{j=1}^m p_j ^{x_j}\]

where

  • \((p_1, \ldots, p_m)\) is the vector of cell probabilities with \(\sum_{i=1}^m p_i=1.\)

  • \(n=\sum_{j=1}^m x_i\) is the total count.

These data arise from a random sample of single-count Multinomial random variables, which are a generalization of Bernoulli random variables (\(m\) distinct outcomes versus 2 distinct outcomes).

Suppose

  • \(W_1, W_2, \ldots, W_n\) are iid \(W \sim Multinomial(n, probs=(p_1, \ldots, p_m))\) random variables:

  • The sample space of each \(W_i\) is \({\cal W} = \{1, 2, \ldots, m\}\), a set of \(m\) distinct outcomes.

  • \(P( W_i=k) = p_k,\) \(k=1,2, \ldots, m.\)

Define

  • \(X_k = \sum_{i=1}^n 1( W_i =k)\), (sum of indicators of outcome \(k\)), \(k=1, \ldots, m\)

  • X\(=(X_1, \ldots, X_m)\)

Maximum Likelihood Estimation

Likelihood function of Multinomial

\[\begin{array}{lcl} lik(p_1, \ldots, p_m) &=& log [ f(x_1, \ldots, x_m \mid p_1, \ldots, p_m )] \\ &=& log (n!) - \sum_{j=1}^m log( x_j !) + \sum_{j=1}^m x_j log(p_j) \\\end{array}\]

Maximum Likelihood Estimate (MLE)

The MLE of \((p_1, \ldots, p_m)\) maximizes \(lik(p_1, \ldots, p_m)\) (with \(x_1, \ldots, x_m\) fixed!)

  • Maximum achieved when differential is zero

  • Constraint: \(\sum_{j=1}^m p_j =1\)

  • Apply method of Lagrange multipliers

  • Solution: \(\hat p_j = x_j /n\), \(j=1, \ldots, m.\)

Note: if any \(x_j =0,\) then \(\hat p_j=0\) solved as limit

Example 8.5.1.A

Hardy-Weinberg Equilibrium

  • Equilibrium frequency of genotypes: \(AA\), \(Aa\), and \(aa\)

  • \(P(a)=\theta\) and \(P(A)=1-\theta\)

  • Equilibrium probabilities of genotypes: \((1-\theta)^2\), \(2(\theta)(1-\theta)\), and \(\theta^2.\)

  • Multinomial Data: \((X_1, X_2, X_3)\) corresponding to counts of \(AA\), \(Aa\), and \(aa\) in a sample of size \(n.\)

See, e.g.

http://www.nature.com/scitable/definition/hardy-weinberg-equation-299

http://www.nature.com/scitable/definition/hardy-weinberg-equilibrium-122

Sample Data

\[\begin{array}{|l|lll|}\hline Genotype& AA & Aa & aa & Total \\ Count &X_1 & X_2 & X_3 & n\\ Frequency & 342 & 500 & 187 & 1029 \\ \hline \end{array}\]

Estimation of \(\theta\)

  • \(X_3 \sim Binomial( n, p=\theta^2)\)

\(\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\) \(\tilde \theta = (\tilde p)^{1/2} = (X_3/n)^{1/2}\) \(=\sqrt{187/1029}=.4263.\)

  • \((X_1, X_2, X_3) \sim Multinomial(n, p=( (1-\theta)^2, 2\theta(1-\theta), \theta^2))\)

\(\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\) Solve for MLE

\[\begin{array}{rcl} l(\theta ) &=& log( f(x_1, x_2, x_3 \mid p_1(\theta), p_2(\theta), p_3(\theta)))\\ &=& log( { n! \over x_1! x_2! x_3!} p_1(\theta)^{x_1} p_2(\theta)^{x_2} p_3(\theta)^{x_3} )\\ &=&x_1 log((1-\theta)^2) + x_2 log( 2 \theta ( 1-\theta)) + x_3 log ( \theta^2) + (\text{non-}\theta \; terms)\\ &=& (2 x_1 + x_2) log(1-\theta) + (2x_3 + x_2) log(\theta) + (\text{non-}\theta \; terms)\\ &&\\ \Longrightarrow \;\; \hat \theta & =& {2 x_3 + x_2 \over 2 x_1 + 2x_2 + 2x_3} = {2 x_3 + x_2 \over 2n} = 0.4247\\ \end{array} \]

  • Which estimate is better?

  • Conduct Parametric Bootstrap Simulation!