Markov Chains

Introduction
The Markov assumption
- The central dogma of biology as a Markov chain
Components of Markov Chains
- Example of Markov Chain
Long run behavior of Markov chains: stationary probabilities
Uses of Markov Chains
Markov Chain Monte Carlo (MCMC)
Addendum: Hidden Markov Models
- Components of a HMM
Exercises:

Last updated: 2020-06-23

Checks: 7 0

Knit directory: MSTPsummerstatistics/

This reproducible R Markdown analysis was created with workflowr (version 1.5.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20180927)

The command set.seed(20180927) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: ac9c26c

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .RData
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/.RData
    Ignored:    analysis/.Rhistory
    Ignored:    data/.DS_Store

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
html	61e3892	Anthony Hung	2020-06-23	add cheatsheet links
html	2a37983	Anthony Hung	2020-06-23	Build site.
html	a1a0bd4	Anthony Hung	2020-06-23	Build site.
html	cdeb6c5	Anthony Hung	2020-06-22	Build site.
html	14c0094	Anthony Hung	2020-06-12	Build site.
html	1378fca	Anthony Hung	2020-06-12	Build site.
html	9bb0ed6	Anthony Hung	2020-05-12	Build site.
html	2114e6c	Anthony Hung	2020-05-10	Build site.
html	29c91df	Anthony Hung	2020-05-10	Build site.
html	a6d0787	Anthony Hung	2020-05-09	Build site.
html	e18c369	Anthony Hung	2020-05-02	Build site.
html	0e6b6d0	Anthony Hung	2020-04-30	Build site.
html	5cbe42c	Anthony Hung	2020-04-23	Build site.
html	4e08935	Anthony Hung	2020-03-30	Build site.
html	f15db48	Anthony Hung	2020-03-30	Build site.
html	310d040	Anthony Hung	2020-02-20	Build site.
html	5a37a3e	Anthony Hung	2020-02-14	Build site.
html	96722bd	Anthony Hung	2019-08-07	Build site.
html	15ca1f1	Anthony Hung	2019-07-18	Build site.
html	a3aa9e0	Anthony Hung	2019-07-18	Build site.
html	ceb577e	Anthony Hung	2019-07-12	Build site.
Rmd	6234571	Anthony Hung	2019-07-12	commit changes
html	6234571	Anthony Hung	2019-07-12	commit changes
html	397882b	Anthony Hung	2019-05-30	Build site.
Rmd	2debade	Anthony Hung	2019-05-30	commit before republish
html	2debade	Anthony Hung	2019-05-30	commit before republish
html	6d3e1c8	Anthony Hung	2019-05-28	Build site.
Rmd	4fcd1d0	Anthony Hung	2019-05-28	commit before republish
html	4fcd1d0	Anthony Hung	2019-05-28	commit before republish
html	c117ef1	Anthony Hung	2019-05-28	Build site.
Rmd	c3c7b6e	Anthony Hung	2019-05-28	Bayesian
html	c3c7b6e	Anthony Hung	2019-05-28	Bayesian
html	b291d24	Anthony Hung	2019-05-24	Build site.
html	4e210d6	Anthony Hung	2019-05-24	Build site.
html	c4bdfdc	Anthony Hung	2019-05-22	Build site.
Rmd	4ce8e85	Anthony Hung	2019-05-21	bandersnatch add
html	096760a	Anthony Hung	2019-05-19	Build site.
Rmd	193ab25	Anthony Hung	2019-05-19	additions to complete mult testing
html	193ab25	Anthony Hung	2019-05-19	additions to complete mult testing
html	da98ae8	Anthony Hung	2019-05-18	Build site.
Rmd	239723e	Anthony Hung	2019-05-08	Update learning objectives
html	2ec7944	Anthony Hung	2019-05-06	Build site.
Rmd	d45dca4	Anthony Hung	2019-05-06	Republish
html	d45dca4	Anthony Hung	2019-05-06	Republish
Rmd	ee75486	Anthony Hung	2019-05-05	Build site.

Introduction

Markov chains are stochastic processes that describe the sequence of possible countable events for a system in which the probability of transitions from each event to the next is dependent only on the event immediately preceeding that event. Markov chains are a staple in computational statistics. Our objective today is to learn the basics behind discrete time Markov Chains and their long-run behavior.

The Markov assumption

The Markov assumption assumes that in order to predict the future behavior of a system, all that is required is knowledge of the present state of the system and not the past state of the system. For example, given a set of times \(t_1, t_2, t_3, t_4\) and states \(X_1, X_2, X_3, X_4\), under the Markov assumption or Markov property:

\[P(X_4=1|X_3=0, X_2=1, X_1=1) = P(X_4=1|X_3=0)\]

In other words, “the past and the future are conditionally independent given the present”. If we have knowledge about the present, then knowing the past does not give us any more information to predict what will happen in the future. Another term that is commonly used to describe Markov chains is “memorylessness.”

Question: What distribution that we have discussed in probability is also described by the property of “memorylessness”?

The Poisson distribution is memoryless. You can set any point along a Poisson process as time 0 and have it be another Poisson process.

The central dogma of biology as a Markov chain

The central dogma of biology describes how information moves from DNA to RNA to Protein.

\[DNA \rightarrow RNA \rightarrow Protein\]

The assumption under the central dogma is that information flows only in one direction, and never backwards. Under a Markov chain model of the central dogma, the amount of RNA you observe in a cell is some function of the genetic variations seen at the DNA sequence level (in coding and noncoding regulatory regions), and the amount of protein you see in the cell is some function of the abundance of RNA transcripts in the cell coding for that protein. If you know the amount of RNA in the cell, then knowing the underlying DNA sequence of the cell at the gene encoding the protein does not give you more information to better predict the amount of protein in the cell. Obviously, there are exceptions to such a simple model of biology, but in the vast majority of cases this model does a very good job of describing biological networks.

Components of Markov Chains

A Markov chain describing the states that a time-dependent random variable X_t takes on at each time-step t is fully determined by two elements:

A transition probability matrix (P) that defines the transition probabilities between each pair of states i and j (\(P_{ij} = P(X_t = j | X_{t-1}=i)\)).
An initial probability distribution vector (\(x_0\)) containing values for \(P(X_0 = i)\) for each state i.

With these two quantities, we can compute the probability that a Markov chain takes on any given state at any given time.

Example of Markov Chain

Let’s explore an example of a Markov Chain. For example, let’s say that we live in a world where the weather is very predictable. There are only 3 states of weather (X=s when it’s Sunny, X=c when it’s Cloudy, and X=r when it’s rainy), and the weather is always the same for the whole day. On a day when it is sunny, the probability that it will be sunny the next day is 0.6, the probabilty that it will be cloudy the next day is 0.3, and the probability that it will be rainy the next day is 0.1. On a day when it is cloudy, the probability that the next day is sunny is 0.2, the probability that the next day is cloudy is 0.3, and the probability that the next day is rainy is 0.5. On a day when it is rainy, the probabilty that the next day will be rainy is 0.5, the probability that next day will be sunny is 0.4, and the probability that the next day will be cloudy is 0.1. A visual depiction of these transition probabilities can be found here: https://stackoverflow.com/questions/36574814/creating-three-state-markov-chain-plot. We can also create a transition probability matrix in R, assuming that the order of the states is s,c,r:

P <- matrix(c(c(0.6,0.2,0.4),c(0.3,0.3,0.1),c(0.1,0.5,0.5)),nrow=3)
P

     [,1] [,2] [,3]
[1,]  0.6  0.3  0.1
[2,]  0.2  0.3  0.5
[3,]  0.4  0.1  0.5

The matrix is constructed with i in the rows, and j in the columns. Each cell is filled in with the values for \(P_{ij} = P(X_t = j | X_{t-1}=i)\). Each of the rows must sum to 1, since the matrix describes all possible transitions.

Let’s also assume that we have an initial probability vector \(x_0\):

x_0 <- c(0.8, 0.1, 0.1)

In this case, this means on day 0 there is a 0.8 probability that it is sunny, 0.1 probability that it is cloudy, and 0.1 probability that it is rainy. The vector must also sum to 1, since it must fully describe all possible states on day 0.

If we want to compute the probability that on day 1, the weather will be sunny, we must solve the following equation:

\[P(X_1 = sunny) = \sum\limits_i P(X_0 = i)P(X_1 = sunny|X_0 = i)\]

We could do this manually using our vector \(x_0\) and P:

\[P(X_1 = sunny) = 0.8*0.6 + 0.1*0.2 + 0.1*0.4 = 0.54\]

The probability that day 1 is sunny is 0.54. But what if we want to calculate the probability that day 2 is sunny? We would need to first find the state probabilities for all three states for day 1 (repeating what we did above two more times), then plug them in to the following equation:

\[P(X_2 = sunny) = \sum\limits_i P(X_1 = i)P(X_2 = sunny|X_1 = i)\]

You can appreciate that manually summing over all possible states would quickly get unwieldy, multiplying the number of calculations you need to perform by 3 for each increasing day.

Fortunately, we can use matrix algebra to do all this multiplying and summing for us. In R, if we would like to calculate \(x_1\), our vector of state probabilities on day 1 is:

x_0%*%P

     [,1] [,2] [,3]
[1,] 0.54 0.28 0.18

We can do this for any number of days. For example, for the vector of state probabilities after 10 days:

library(expm) #the expm package allows us to raise a matrix to a power

Loading required package: Matrix


Attaching package: 'expm'

The following object is masked from 'package:Matrix':

    expm

x_0 %*% (P %^% 10)

          [,1]      [,2]      [,3]
[1,] 0.4411769 0.2352953 0.3235278

Note that in our simple example, you may ask yourself “what if the probabilty that it rains tomorrow depends not only on the weather today, but also yesterday?” You can actually slightly change our Markov chain to account for this! Instead of representing your states as S, C, and R, you can easily imagine changing our states to SS, SC, SR, CS, CC, CR, RS, RC, and RR to now create a Markov chain in which the probability of weather tomorrow only depends on the weather over the past two days. Now you can see why Markov chains are so powerful!

Long run behavior of Markov chains: stationary probabilities

For a certain class of Markov chains, called ergodic Markov chains (https://brilliant.org/wiki/stationary-distributions/), there exists a stationary probability distribution, or an equillibrium distribution of possible states that a Markov chain will converge upon after many many iterations. To see what this means, let’s see what the probability distributions are for the weather on the 100th day given our values for P and \(x_0\).

x_0 %*% (P %^% 100)

          [,1]      [,2]      [,3]
[1,] 0.4411765 0.2352941 0.3235294

What about on the 101st day? 102nd?

x_0 %*% (P %^% 101)

          [,1]      [,2]      [,3]
[1,] 0.4411765 0.2352941 0.3235294

x_0 %*% (P %^% 102)

          [,1]      [,2]      [,3]
[1,] 0.4411765 0.2352941 0.3235294

You can see that the probabilities have convereged upon an equillibrium vector which does not change even as we step forward into the future. In fact, this equillibrium or stationary distribution is unique to the transition probability matrix and does not depend on \(x_0\):

c(1, 0, 0) %*% (P %^% 101)

          [,1]      [,2]      [,3]
[1,] 0.4411765 0.2352941 0.3235294

c(0, 1, 0) %*% (P %^% 102)

          [,1]      [,2]      [,3]
[1,] 0.4411765 0.2352941 0.3235294

This makes sense because if you go far enough into the future, the current state becomes less and less important and gives you less and less information to predict what the states will look like in the future. If I know it is cloudy today, I can probably do a good job of predicting if it will rain tomorrow. But I probably will not be better at predicting if it will rain 10 years from now compared to if I knew that today was sunny.

If we did not want to simply raise a matrix to a large power in order to find a stationary distribution, eigenvalue decomposition (beyond the scope of this class) could be used to determine the stationary probability.

Note that not all Markov chains will have a unique stationary distribution. For example, consider the following transition matrix describing the transitions between two states:

matrix(c(0,1,1,0), ncol =2, byrow=F)

     [,1] [,2]
[1,]    0    1
[2,]    1    0

This is called a periodic Markov Chain.

The Markov chain with this transition probability matrix will also not reach a unique stationary distribution. Why?

matrix(c(0.5,0.5, 0, 0, 0, 0.5, 0.5, 0, 0, 0, 0, 0, 0.3, 0.4, 0.3, 0, 0, 0.3, 0.3, 0.4, 0, 0, 0.3, 0.3, 0.4), ncol =5, byrow=T)

     [,1] [,2] [,3] [,4] [,5]
[1,]  0.5  0.5  0.0  0.0  0.0
[2,]  0.5  0.5  0.0  0.0  0.0
[3,]  0.0  0.0  0.3  0.4  0.3
[4,]  0.0  0.0  0.3  0.3  0.4
[5,]  0.0  0.0  0.3  0.3  0.4

Markov Chains like these are known as reducible Markov chains.

Markov chains that reach a stationary distribution are Ergodic, or aperiodic and irreducible.

Uses of Markov Chains

Why do we care about Markov chains? Besides being able to predict the behavior and states of systems, there are many practical applications. For example, if you’ve ever used a smartphone keyboard and seen the suggestions for the next word, you’ve made use of a Markov chain! The subreddit /r/SubredditSimulator generates complete posts using Markov Chains created from transition matrices generated using words included in the last 500 submissions in a variety of subreddit communities on reddit.

Perhaps more relevant to us in Biology and Bayesian statistics, Markov chains are an important element of Markov Chain Monte Carlo methods.

Markov Chain Monte Carlo (MCMC)

As the name may suggest, MCMC methods bring together two concepts: Markov chains and monte carlo methods.

Markov chains are stochastic processes where future states only depend on the last state (the markov property)
Monte Carlo methods rely on simulating a large number of random numbers to estimate properties of a distribution. An example is here, where we use monte carlo methods to estimate pi: https://academo.org/demos/estimating-pi-monte-carlo/. We also use Monte Carlo methods to estimate population means through using sample means. If we were to take a larger and larger number of samples from the population (similar to simulating more and more points on the square), then our estimate of the population mean (the sample mean) would become more and more precice (Question: what mathematical law predicts this behavior? The Law of Large Numbers )

Markov Chain Monte Carlo techniques use Markov chains to sample repeatedly from a complicated or unnamed distribution, then use Monte Carlo methods to estimate properties of that distribution. Why would we ever need to do this? Let’s return to our previous exploration of Bayes theorem

Recall that under Bayes theorem,

\[Posterior\_probability \propto prior\_probability * Likelihood\] Although there are cases where multiplying two distributions (our prior probability distribution and the probability distribution of our likelihood) will result in a posterior probability that is named and well-defined (e.g. the normal distribution, or binomial distribution), many times it will actually result in an unnamed distribution that we cannot easily work with. In these cases, MCMC methods are used to sample from the unnamed posterior probability distribution, and statistics are calculated on the samples to determine the mean of the distribution, the median, the 95% credible interval, etc of this unnamed distribution.

MCMC methods construct a Markov chain that tends to stay in states that are of higher probability in the posterior distribution, and tends to avoid states that are of lower probability. The equillibrium distribution of the MCMC will therefore be the posterior distribution, and in the long term the distribution of states that are sampled by that MCMC will correspond to the posterior distribution.

The algorithm that is commonly used to construct MCMCs is called the Metropolis-Hastings algorithm: https://stephens999.github.io/fiveMinuteStats/MH_intro.html

Addendum: Hidden Markov Models

Markov chains are very useful when we are trying to compute the probability for a sequence of events. However, much of the time we don’t actually observe the sequence of events directly (i.e. they are hidden from us). For example, when we look at a sequence of DNA, we don’t see promoters or enhancers or introns or exons directly. Instead, what we see is a string of A T C and Gs. When we look at a sentence, we don’t see parts of speech. Instead, we see words and can infer their parts of speech based on the sequence we observe them in. Hidden markov models (HMM) are an excellent way to model these types of situations, where we are dealing with both hidden and observed states.

Components of a HMM

A HMM can be determined partially using the same elements we described for a regular markov chain, but we need to add on more terms in order to fully determine it.

A transition probability matrix (P) that defines the transition probabilities between each pair of hidden states i and j (\(P_{ij} = P(S_t = j | S_{t-1}=i)\)).
An initial probability distribution vector (\(\pi\)) containing values for \(P(\pi_0 = i)\) for each state i.
A set of emission probabilities, which express the probability of an observed state (\(X_t\)) given

We can see a graphical depiction of a HMM here:

https://media.springernature.com/original/springer-static/image/art%3A10.1007%2Fs13253-017-0283-8/MediaObjects/13253_2017_283_Fig1_HTML.gif

http://compbio.mit.edu/ChromHMM/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1233730/

Examples of HMMs used in biology abound, and in fact more often than not Hidden Markov chains are a better model of what happens in Biology than a strict Markov chain. Why might this be? Because Markov chains rely on observed states being exactly mapping to what they are (somewhat like a HMM with a 1:1 mapping between hidden and observed states), any extra interference between “what you see” and “what’s really there” will prevent a Markov chain from accurately modeling the data. Because scientific measurements are often noisy (and biology is itself very noisy), we rarely ever observe hidden states in nature.

There are three so called “problems” to solve when working with HMMs in order to determine the parameters of a HMM that best fit some observed data:

The Evaluation Problem Given an HMM and a sequence of observations, what is the probability that the observations are generated by the model (likelihood for the model)?
The Decoding Problem Given an HMM and a sequence of observations, what is the most likely state sequence in the model that produced the observations (What is the state sequence)?
The Learning Problem Given an HMM and a sequence of observations, how should we adjust the model parameters in order to maximize the probability that the observations are generated by the model (What HMM best fits the observations)?

Exercises:

Use R and matrix multiplication to solve these problems From Sheldon Ross’s Introduction to Probability Models, 11th ed.:

Coin 1 comes up heads with probability 0.6 and coin 2 with probability 0.5. A coin is continually flipped until it comes up tails, at which time that coin is put aside and we start flipping the other one. If we start the process with coin 1 what is the probability that coin 2 is used on the fifth flip? 4/9
Suppose that coin 1 has probability 0.7 of coming up heads, and coin 2 has probability 0.6 of coming up heads. If the coin flipped today comes up heads, then we select coin 1 to flip tomorrow, and if it comes up tails, then we select coin 2 to flip tomorrow. If the coin initially flipped is equally likely to be coin 1 or coin 2, then what is the probability that the coin flipped on the third day after the initial flip is coin 1? Suppose that the coin flipped on Monday comes up heads. What is the probability that the coin flipped on Friday of the same week also comes up heads?

sessionInfo()

R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] expm_0.999-4  Matrix_1.2-18

loaded via a namespace (and not attached):
 [1] workflowr_1.5.0 Rcpp_1.0.4.6    lattice_0.20-38 rprojroot_1.3-2
 [5] digest_0.6.25   later_1.0.0     grid_3.6.3      R6_2.4.1       
 [9] backports_1.1.6 git2r_0.26.1    magrittr_1.5    evaluate_0.14  
[13] stringi_1.4.5   rlang_0.4.5     fs_1.3.1        promises_1.1.0 
[17] whisker_0.4     rmarkdown_1.18  tools_3.6.3     stringr_1.4.0  
[21] glue_1.4.0      httpuv_1.5.2    xfun_0.12       yaml_2.2.1     
[25] compiler_3.6.3  htmltools_0.4.0 knitr_1.26