Introduction to Markov Chain Monte Carlo Methods

Monte Carlo is a sample from a distribution which is used to estimate the distribution and to compute max, mean

Markov Chain Monte Carlo is a sampling using “local” information which is used for Generic “problem solving technique”, decision/optimization/value problems and generic, but not necessarily very efficient.


Markov chain Monte Carlo (MCMC) was invented soon after ordinary Monte Carlo at Los Alamos, one of the few places where computers were available at the time. Metropolis et al. (1953) simulated a liquid in equilibrium with its gas phase.

The obvious way to find out about the thermodynamic equilibrium is to simulate the dynamics of the system, and let it run until it reaches equilibrium. The tour de force was their realization that they did not need to simulate the exact dynamics; they only needed to simulate some Markov chain having the same equilibrium distribution. Simulations following the scheme of Metropolis et al. (1953) are said to use the Metropolis algorithm. As computers became more widely available, the Metropolis algorithm was widely used by chemists and physicists, but it did not become widely known among statisticians until after 1990.

Hastings (1970) generalized the Metropolis algorithm, and simulations following his scheme are said to use the Metropolis–Hastings algorithm. A special case of the Metropolis Hastings algorithm was introduced by Geman and Geman (1984), apparently without knowledge of earlier work. Simulations following their scheme are said to use the Gibbs sampler. Much of Geman and Geman (1984) discusses optimization to find the posterior mode rather than simulation, and it took some time for it to be understood in the spatial statistics community that the Gibbs sampler simulated the posterior distribution, thus enabling full Bayesian inference of all kinds. A methodology that was later seen to be very similar to the Gibbs sampler was introduced by Tanner and Wong (1987), again apparently without knowledge of earlier work. To this day, some refer to the Gibbs sampler as “data augmentation” following these authors. Gelfand and Smith (1990) made the wider Bayesian community aware of the Gibbs sampler, which up to that time had been known only in the spatial statistics community. Then it took off; as of this writing, a search for Gelfand and Smith (1990) on Google Scholar yields 4003 links to other works. It was rapidly realized that most Bayesian inference could.

The fifth author was Edward Teller, the “father of the hydrogen bomb.” Handbook of Markov Chain Monte Carlo be done by MCMC, whereas very little could be done without MCMC. It took a while for researchers to properly understand the theory of MCMC (Geyer, 1992; Tierney, 1994) and that all of the aforementioned work was a special case of the notion of MCMC. Green (1995) generalized the Metropolis Hastings algorithm, as much as it can be generalized. Although this terminology is not widely used, we say that simulations following his scheme use the Metropolis, Hastings, Green algorithm. MCMC is not used only for Bayesian inference. Likelihood inference in cases where the likelihood cannot be calculated explicitly due to missing data or complex dependence can also use MCMC (Geyer, 1994, 1999; Geyer and Thompson, 1992, 1995, and references cited therein).

Markov Chain Monte Carlo

A sequence X1, X2, ... of random elements of some set is a Markov chain if the conditional distribution of Xn+1 given X1, ... , Xn depends on Xn only. The set in which the Xi take values is called the state space of the Markov chain.

A Markov chain has stationary transition probabilities if the conditional distribution of Xn+1 given Xn does not depend on n. This is the main kind of Markov chain of interest in MCMC. Some kinds of adaptive MCMC have nonstationary transition probabilities. 

  • The joint distribution of a Markov chain is determined by 
  • The marginal distribution of X1, called the initial distribution
  • The conditional distribution of Xn+1 given Xn, called the transition probability distribution (because of the assumption of stationary transition probabilities, this does not depend on n)

Random Walk on {0,1}m


Generate chain : pick  j \epsilon \left \{ 1,..,m \right \}  uniformly at random and set Xt =(z1 ,...,1-zj ,...,zm) where (z1 ,...,zm) = Xt-1

Markov Chain Monte Carlo basic idea :

Given a probability distribution \pi on a set \Omega, the problem is to generate random elements of \Omega with distribution \pi

Markov Chain Monte Carlo does that by constructing a Markov Chain with stationary distribution \pi and simulating the chain.

Computer Programs and Markov Chains

Suppose you have a computer program

Initialize x 
 repeat { 
      Generate pseudorandom change to x 
       Output x 

If x is the entire state of the computer program exclusive of random number generator seeds (which we ignore, pretending pseudorandom is random), this is MCMC. It is important that x must be the entire state of the program. Otherwise the resulting stochastic process need not be Markov.

There is not much structure here. Most simulations can be fit into this format. Thus most simulations can be thought of as MCMC if the entire state of the computer program is considered the state of the Markov chain. Hence, MCMC is a very general simulation methodology

Markov Chain Monte Carlo : Uniform Sampler

Problem: sample elements uniformly at random from set (large but finite) Ω.

Idea: construct an irreducible symmetric Markov Chain with states Ω and run it for sufficient time.

Example: generate uniformly at random a feasible solution to the Knapsack Problem.

m items and their weight wi and value vi , knapsack with weight limit b


z=(z_{1} ,...,z_{m})\epsilon \left \{ 0,1 \right \}^{m} ,  zi means whether we take item vi.

feasible solutions \Omega = \left \{z \epsilon \left \{ 0,1 \right \}^{m} ; \sum _{i}w_{i} z_{i}\leq b \right \}

problem : maximize \sum _{i} v_{i} z_{i} subject to z \epsilon Ω

Markov Chain Monte Carlo : Optimization

Metropolis Algorithm theoretically works, but it needs large \beta to make “good” states more likely and its convergence time may be exponential in \beta and also it tries changing b over time.

Simulated Annealing is for Knapsack Problem:

\alpha = min\left \{ 1,exp\beta \left ( t \right ) \sum_{i}^{}v_{i}\left ( y_{i}-z_{i} \right )\right \}

 \beta \left ( t \right ) increases slowly with time (e.g. =log(t), =(1.001)t)