Introduction: Discrete Probability Distributions

discrete

 

What is Discrete Distribution?

A discrete distribution is a distribution of data in statistics that has discrete values. Discrete values are countable, finite, non-negative integers, such as 1, 10, 15, etc.

Types of Discrete Distribution

Types of discrete probability distributions include:

Poisson:

The Poisson distribution describes the probability to find exactly x events in a given length of time if the events occur independently at a constant rate. In addition, the Poisson distribution can be obtained as an approximation of a binomial distribution when the number of trials n of the latter distribution is large, success probability p is small, and np is a finite number.

The PMF of the Poisson distribution is given by

 P\left ( X= x \right ) = \frac{e^{-\lambda }\lambda^{x}}{x!},= x=0,1,.........\infty

where λ is a positive number. Both the mean and variance of the Poisson distribution are equal to λ.

Bernoulli: 

The Bernoulli distribution is the most basic discrete distribution. A variable that follows the distribution can take one of two possible values, 1 (usually called a success) or 0 (failure), where the probability of success is p, 0 < p < 1. An example of a Bernoulli random variable (that is a variable that follows the Bernoulli distribution) is the outcome of a coin toss, where the outcome is either a head (success) or a tail (failure) and the probability of a head is a number between 0 and 1.

The PMF of a Bernoulli distribution is given by P(X = x) = px(1−p)1−x, where x can be either 0 or 1. The CDF F(x) of the distribution is 0 if x < 0, 1−p if 0 ≤ x < 1, and 1 if x ≥ 1. The mean and the variance of the distribution are p and p(1 − p), respectively.

If X1X2, … , Xn are independent Bernoulli random variables, all with success probability p,

then   \sum_{i= 1}^{n} X_{i},

which denotes the number of successes among the Xi's, is a binomial random variable with number of trials = n and success probability p. The maximum likelihood estimate of p from a sample x1x2, … , xn from the Bernoulli random variable is

the sample mean \frac{1}{n}\sum_{i}^{} x_{i},

which is the proportion of successes in the sample.

To generate a random number X from the Bernoulli distribution, one has to first generate a random number R from the U(0, 1) distribution (the uniform distribution between 0 and 1). If R is less than p, then X is set to 0; otherwise, X is set to 1.

Binomial

A binomial distribution is the distribution of the sum of n independent Bernoulli random variables, all of which have the same success probability p. The quantity n is called the number of trials and p the success probability.

The PMF of a binomial distribution is given by

P\left ( X= x \right ) = \binom{n}{r} p^{x}\left ( 1- p\right )^{n-x}, x=0,1,2,3,.....n.

The mean and variance of the distribution are np and np(1−p). For large values of n, the probability that the binomial random variable X takes the value x can be computed by approximating X by a normal variable Y with mean np and variance np(1 − p) and computing the probability that Y lies between x − 0.5 and x + 0.5.

The maximum likelihood estimate of p from a sample x1x2, …, xs from the binomial distribution is the ratio of the sample mean \frac{1}{8} \sum_{i}^{}x_{i} 

and n. To generate a random number from a binomial distribution, one can generate n Bernoulli random variables, all with probability p, and add them up. Alternatively, one can compute the CDF of the binomial distribution, generate R ∼ U(0, 1), and then take x as the random number from the binomial distribution if the CDF for x − 1 is less than R and the CDF for x is greater than or equal to R (this is often called the CDF-inversion method).

Several applications of the binomial distribution to educational data can be found in Novick and Jackson (1974). In one such application, the number of words spelled correctly (out of a total of n words) by a student, where each word is equally difficult to spell, is assumed to follow the binomial distribution.

Multinomial

the multinomial distribution applies as the probability distribution for the counts in a contingency table. In many cases, certain marginal totals are also fixed, such as in experimental designs in which the number of subjects at each experimental condition is fixed. In addition, in regression-type models such as logistic regression, it is common to treat the counts at the combinations of levels of the predictor variables as fixed even if they are not fixed in the sampling design. In these cases, at each such combination the response is usually assumed to have a binomial distribution when the response is binary and a multinomial distribution when the response is multicategory; this replaces the usual normal response assumption for regression models for continuous responses. Other sampling possibilities, less common in practice, are that the sample size is itself random (i.e., no counts are fixed, such as in observing categorical outcome measures for all subjects who visit some clinic over a future time period) or, at the other extreme, all marginal totals are fixed; these lead to Poisson and hypergeometric distributions.

Discrete probability distributions in machine learning

Discrete probability distributions are used in machine learning, most notably in the modeling of binary and multi-class classification problems, but also in evaluating the performance for binary classification models, such as the calculation of confidence intervals, and in the modeling of the distribution of words in text for natural language processing.

Knowledge of discrete probability distributions is also required in the choice of activation functions in the output layer of deep learning neural networks for classification tasks and selecting an appropriate loss function.

Discrete probability distributions play an important role in applied machine learning and there are a few distributions that a practitioner must know about.

The probability distribution above gives a visual representation of the probability that a certain amount of people would walk into the store at any given hour. Without doing any quantitative analysis, we can observe that there is a high likelihood that between 9 and 17 people will walk into the store at any given hour.

A discrete random variable is a random variable that can have one of a finite set of specific outcomes. The two types of discrete random variables most commonly used in machine learning are binary and categorical.

  • Binary Random Variable: x in {0, 1}
  • Categorical Random Variable: x in {1, 2, …, K}.

A binary random variable is a discrete random variable where the finite set of outcomes is in {0, 1}. A categorical random variable is a discrete random variable where the finite set of outcomes is in {1, 2, …, K}, where K is the total number of unique outcomes.

Each outcome or event for a discrete random variable has a probability.

The relationship between the events for a discrete random variable and their probabilities is called the discrete probability distribution and is summarized by a probability mass function, or PMF for short.

For outcomes that can be ordered, the probability of an event equal to or less than a given value is defined by the cumulative distribution function, or CDF for short. The inverse of the CDF is called the percentage-point function and will give the discrete outcome that is less than or equal to a probability.

  • PMF: Probability Mass Function, returns the probability of a given outcome.
  • CDF: Cumulative Distribution Function, returns the probability of a value less than or equal to a given outcome.
  • PPF: Percent-Point Function, returns a discrete value that is less than or equal to the given probability.

There are many common discrete probability distributions.

The most common are the Bernoulli and Multinoulli distributions for binary and categorical discrete random variables respectively, and the Binomial and Multinomial distributions that generalize each to multiple independent trials.

  • Binary Random Variable: Bernoulli Distribution
  • Sequence of a Binary Random Variable: Binomial Distribution
  • Categorical Random Variable: Multinoulli Distribution
  • Sequence of a Categorical Random Variable: Multinomial Distribution

Bernoulli Distribution

The Bernoulli distribution is a discrete probability distribution that covers a case where an event will have a binary outcome as either a 0 or 1.

  • x in {0, 1}

A “Bernoulli trial” is an experiment or case where the outcome follows a Bernoulli distribution. The distribution and the trial are named after the Swiss mathematician Jacob Bernoulli.

Some common examples of Bernoulli trials include:

  • The single flip of a coin that may have a heads (0) or a tails (1) outcome.
  • A single birth of either a boy (0) or a girl (1).

A common example of a Bernoulli trial in machine learning might be a binary classification of a single example as the first class (0) or the second class (1).

The distribution can be summarized by a single variable p that defines the probability of an outcome 1. Given this parameter, the probability for each event can be calculated as follows:

  • P(x=1) = p
  • P(x=0) = 1 – p

In the case of flipping a fair coin, the value of p would be 0.5, giving a 50% probability of each outcome.

Binomial Distribution

The repetition of multiple independent Bernoulli trials is called a Bernoulli process.

The outcomes of a Bernoulli process will follow a Binomial distribution. As such, the Bernoulli distribution would be a Binomial distribution with a single trial.

Some common examples of Bernoulli processes include:

  • A sequence of independent coin flips.
  • A sequence of independent births.

The performance of a machine learning algorithm on a binary classification problem can be analyzed as a Bernoulli process, where the prediction by the model on an example from a test set is a Bernoulli trial (correct or incorrect).

The Binomial distribution summarizes the number of successes in a given number of Bernoulli trials k, with a given probability of success for each trial p.

We can demonstrate this with a Bernoulli process where the probability of success is 30% or P(x=1) = 0.3 and the total number of trials is 100 (k=100).

We can simulate the Bernoulli process with randomly generated cases and count the number of successes over the given number of trials. This can be achieved via the binomial() NumPy function. This function takes the total number of trials and probability of success as arguments and returns the number of successful outcomes across the trials for one simulation.

Multinoulli Distribution

The Multinoulli distribution, also called the categorical distribution, covers the case where an event will have one of K possible outcomes.

  • x in {1, 2, 3, …, K}

It is a generalization of the Bernoulli distribution from a binary variable to a categorical variable, where the number of cases K for the Bernoulli distribution is set to 2, K=2.

A common example that follows a Multinoulli distribution is:

  • A single roll of a die that will have an outcome in {1, 2, 3, 4, 5, 6}, e.g. K=6.

A common example of a Multinoulli distribution in machine learning might be a multi-class classification of a single example into one of K classes, e.g. one of three different species of the iris flower.

The distribution can be summarized with K variables from p1 to pK, each defining the probability of a given categorical outcome from 1 to K, and where all probabilities sum to 1.0.

  • P(x=1) = p1
  • P(x=2) = p1
  • P(x=3) = p3
  • P(x=K) = pK

In the case of a single roll of a die, the probabilities for each value would be 1/6, or about 0.166 or about 16.6%.

Multinomial Distribution

The repetition of multiple independent Multinoulli trials will follow a multinomial distribution.

The multinomial distribution is a generalization of the binomial distribution for a discrete variable with K outcomes.

An example of a multinomial process includes a sequence of independent dice rolls.

A common example of the multinomial distribution is the occurrence counts of words in a text document, from the field of natural language processing.

A multinomial distribution is summarized by a discrete random variable with K outcomes, a probability for each outcome from p1 to pK, and k successive trials.

We can demonstrate this with a small example with 3 categories (K=3) with equal probability (p=33.33%) and 100 trials.

Firstly, we can use the multinomial()  NumPy function to simulate 100 independent trials and summarize the number of times that the event resulted in each of the given categories. The function takes both the number of trials and the probabilities for each category as a list.

Summary

Discrete probability distributions used in machine learning.The probability of outcomes for discrete random variables can be summarized using discrete probability distributions.A single binary outcome has a Bernoulli distribution, and a sequence of binary outcomes has a Binomial distribution. A single categorical outcome has a Multinoulli distribution, and a sequence of categorical outcomes has a Multinomial distribution.