Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Maximum Likelihood

Open In Colab

Maximum Likelihood

How do we get the loss function that we use for logistic regression? It relies on a statistical argument called maximum likelihood (ML). Sadly, ML is used to be represent maximum likelihood and machine learning, both important topics in data science. So you’ll just kind of have to get used to which one is being used via the context.

To figure out maximum likelihood, let’s consider a bunch of coin flips, each with their own probability of a head. Say

pi=P(Yi=1xi)   1pi=P(Yi=0xi)p_i = P(Y_i = 1 | x_i) ~~~ 1 - p_i = P(Y_i = 0 | x_i)

Here, we write  xi~| x_i in the probability statement to denote that the probability may depend on the realized value of some variable that also depends on ii, XiX_i, which is denoted as xix_i. So, for a context, think YiY_i is event that person ii has hypertension and xix_i their smoking consumption in pack years. We’d like to estimate the probability that someone has hypertension given their pack years.

We could write this more compactly as:

P(Yi=jxi)=pij(1pi)1j   j{0,1}.P(Y_i = j | x_i) = p_i ^ j (1 - p_i)^{1-j} ~~~ j \in \{0, 1\}.

Consider a dataset, Y1,,YnY_1, \ldots, Y_n and x1,,xnx_1, \ldots, x_n. Consider a sequence of potential observed values of the YiY_i, say y1,,yny_1, \ldots, y_n where each yiy_i is 0 or 1. Then, using our formula:

P(Yi=yixi)=piyi(1pi)1yiP(Y_i = y_i | x_i) = p_i ^ {y_i} (1 - p_i)^{1-y_i}

This is the (perhaps?) unfortunate notation that statisticians use, YiY_i for the conceptual value of a variable and yiy_i for a realized value or number that we could plug in. This equation is just the probability of one yiy_i. This is why I’m using a lowercase xix_i for the variables we’re conditioning on. Perhaps if I was being more correct, I would write something like P(Yi=yi  Xi=xi)P(Y_i = y_i ~|~ X_i = x_i), but I find that adds too much notation.

What about all of them jointly? If the coin flips are independent, a statistical way of saying unrelated, then the probabilities multiply. So the joint probability of our data in this case is

P(Y1=y1,,Yn=yn  x1,,xn)=i=1npiyi(1pi)1yiP(Y_1 = y_1, \ldots, Y_n = y_n ~|~ x_1, \ldots, x_n) = \prod_{i=1}^n p_i ^ {y_i} (1 - p_i)^{1-y_i}

This model doesn’t say much, there’s nothing to tie these probabilities together. In our example, all we could do is estimate the probability of hypertension for a bunch of people with exactly the same pack years. There’s no parsimony so to speak. It seems logical that groups with nearly the same pack years should have similar probabilities, or even better that they vary smoothly with pack years. Our logistic regression model does this.

Consider again, our logistic regression model:

logit(pi)=β0+β1xi\mathrm{logit}(p_i) = \beta_0 + \beta_1 x_i

Now we have a model that relates the probabilities to the xix_i in a smooth way. This implies that if we plot xix_i versus logit(pi)\mathrm{logit}(p_i) we have a line and if we plot xix_i versus pip_i it looks like a sigmoid. So, under this model, what is our joint probability?

P(Y1=y1,,Yn=yn  x1,,xn)=i=1npiyi(1pi)1yi=i=1n(eβ0+β1xi1+eβ0+β1xi)yi(11+eβ0+β1xi)1yiP(Y_1 = y_1, \ldots, Y_n = y_n ~|~ x_1, \ldots, x_n) = \prod_{i=1}^n p_i ^ {y_i} (1 - p_i)^{1-y_i} = \prod_{i=1}^n \left(\frac{e^{\beta_0 + \beta_1 x_i}}{1 + e^{\beta_0 + \beta_1 x_i}}\right)^{y_i} \left(\frac{1}{1 + e^{\beta_0 + \beta_1 x_i}}\right)^{1-y_i}

We can work around with this a bit to get

exp{β0i=1nyi+β1i=1nyixi}×i=1n(11+eβ0+β1xi)\exp\left\{\beta_0 \sum_{i=1}^n y_i + \beta_1 \sum_{i=1}^n y_i x_i\right\}\times \prod_{i=1}^n \left(\frac{1}{1 + e^{\beta_0 + \beta_1 x_i}}\right)

Notice, interestingly, this only depends on nn, i=1nyi\sum_{i=1}^n y_i and i=1nyixi\sum_{i=1}^n y_i x_i. These are called the sufficient statistics, since we don’t actually need to know the individual data points, just these quantities. (Effectively these quantities can be turned into the proportion of Ys that are one and the correlation between the Ys and Xs.) Plug in these quantities and this equation spits out the joint probability of that particular sequence of 0s and 1s for the Ys given this particular collection of Xs. Of course, we can’t actually do this, because we don’t know β0\beta_0 or β1\beta_1.

The statistician Fisher got around this problem using maximum likelihood. The principle is something like this. Pick the values of β0\beta_0 and β1\beta_1 that make the data that we actually observed most probable. This seems like a good idea, since the data that we observed must be at least somewhat probable (since we observed it). When you take the joint probability and plug in the actual Ys and Xs that we observed and view it as a function of β0\beta_0 and β1\beta_1, it’s called a likelihood. So a likelihood is the joint probability with the observed data plugged in and maximum likelihood finds the values of the parameters that makes the data that we observed most likely.

Let’s consider that for our problem. Generally, since sums are more convenient than producs, we take the natural logarithm. Then, this works out to be:

β0i=1nyi+β1i=1nyixii=1nlog(1+eβ0+β1xi)\beta_0 \sum_{i=1}^n y_i + \beta_1 \sum_{i=1}^n y_i x_i - \sum_{i=1}^n \log\left(1 + e^{\beta_0 + \beta_1 x_i}\right)

This is the function that sklearn maximizes over β1\beta_1 and β0\beta_0 to obtain the estimates. There’s quite a few really good properties of maximum likelihood, which is why we use it.

Linear regression

You might be surprised to find out that linear regression can also be cast as a likelihood problem. Consider an instance where we assume that the YiY_i are Gaussian with a mean equal to β0+β1xi\beta_0 + \beta_1 x_i and variance σ2\sigma^2. What that means is that the probability that YiY_i lies betweens the points AA and BB is governed by the equation

P(Yi[A,B)  xi)=ABexp{(yiβ0β1xi)2/2σ2}dyiP(Y_i \in [A, B) ~|~ x_i) = \int_A^B \exp\left\{ -(y_i - \beta_0 - \beta_1 x_i)^2 / 2\sigma^2 \right\} dy_i

Letting A=A=-\infty and taking the derivative with respect to BB, we obtain the density function, sort of the probability on an infintessimally small interval:

exp{(yiβ0β1xi)2/2σ2}\exp\left\{ -(y_i - \beta_0 - \beta_1 x_i)^2 / 2\sigma^2 \right\}

Uses the density evaluated at the observed data, the joint likelihood assuming independence is:

i=1nexp{(yiβ0β1xi)2/2σ2}=exp{i=1n(yiβ0β1xi)2/2σ2}\prod_{i=1}^n \exp\left\{ -(y_i - \beta_0 - \beta_1 x_i)^2 / 2\sigma^2 \right\} = \exp\left\{ -\sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2 / 2\sigma^2 \right\}

Since it’s more convenient to deal with logs we get that the joint log likelihood is

i=1n(yiβ0β1xi)2/2σ2- \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2 / 2\sigma^2

Since minimizing the negative is the same as maximizing this, and the constants of proportionality are irrelevant for maximizing for β1\beta_1 and β0\beta_0, we get that maximum likelihood for these parameters minimizes

i=1n(yiβ0β1xi)2\sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2

which is the same thing we minimized to obtain our least squares regression estimates.