Maximum Likelihood

Open In Colab

Maximum Likelihood#

How do we get the loss function that we use for logistic regression? It relies on a statistical argument called maximum likelihood (ML). Sadly, ML is used to be represent maximum likelihood and machine learning, both important topics in data science. So you’ll just kind of have to get used to which one is being used via the context.

To figure out maximum likelihood, let’s consider a bunch of coin flips, each with their own probability of a head. Say

pi=P(Yi=1|xi)   1pi=P(Yi=0|xi)

Here, we write  |xi in the probability statement to denote that the probability may depend on the realized value of some variable that also depends on i, Xi, which is denoted as xi. So, for a context, think Yi is event that person i has hypertension and xi their smoking consumption in pack years. We’d like to estimate the probability that someone has hypertension given their pack years.

We could write this more compactly as:

P(Yi=j|xi)=pij(1pi)1j   j{0,1}.

Consider a dataset, Y1,,Yn and x1,,xn. Consider a sequence of potential observed values of the Yi, say y1,,yn where each yi is 0 or 1. Then, using our formula:

P(Yi=yi|xi)=piyi(1pi)1yi

This is the (perhaps?) unfortunate notation that statisticians use, Yi for the conceptual value of a variable and yi for a realized value or number that we could plug in. This equation is just the probability of one yi. This is why I’m using a lowercase xi for the variables we’re conditioning on. Perhaps if I was being more correct, I would write something like P(Yi=yi | Xi=xi), but I find that adds too much notation.

What about all of them jointly? If the coin flips are independent, a statistical way of saying unrelated, then the probabilities multiply. So the joint probability of our data in this case is

P(Y1=y1,,Yn=yn | x1,,xn)=i=1npiyi(1pi)1yi

This model doesn’t say much, there’s nothing to tie these probabilities together. In our example, all we could do is estimate the probability of hypertension for a bunch of people with exactly the same pack years. There’s no parsimony so to speak. It seems logical that groups with nearly the same pack years should have similar probabilities, or even better that they vary smoothly with pack years. Our logistic regression model does this.

Consider again, our logistic regression model:

logit(pi)=β0+β1xi

Now we have a model that relates the probabilities to the xi in a smooth way. This implies that if we plot xi versus logit(pi) we have a line and if we plot xi versus pi it looks like a sigmoid. So, under this model, what is our joint probability?

P(Y1=y1,,Yn=yn | x1,,xn)=i=1npiyi(1pi)1yi=i=1n(eβ0+β1xi1+eβ0+β1xi)yi(11+eβ0+β1xi)1yi

We can work around with this a bit to get

exp{β0i=1nyi+β1i=1nyixi}×i=1n(11+eβ0+β1xi)

Notice, interestingly, this only depends on n, i=1nyi and i=1nyixi. These are called the sufficient statistics, since we don’t actually need to know the individual data points, just these quantities. (Effectively these quantities can be turned into the proportion of Ys that are one and the correlation between the Ys and Xs.) Plug in these quantities and this equation spits out the joint probability of that particular sequence of 0s and 1s for the Ys given this particular collection of Xs. Of course, we can’t actually do this, because we don’t know β0 or β1.

The statistician Fisher got around this problem using maximum likelihood. The principle is something like this. Pick the values of β0 and β1 that make the data that we actually observed most probable. This seems like a good idea, since the data that we observed must be at least somewhat probable (since we observed it). When you take the joint probability and plug in the actual Ys and Xs that we observed and view it as a function of β0 and β1, it’s called a likelihood. So a likelihood is the joint probability with the observed data plugged in and maximum likelihood finds the values of the parameters that makes the data that we observed most likely.

Let’s consider that for our problem. Generally, since sums are more convenient than producs, we take the natural logarithm. Then, this works out to be:

β0i=1nyi+β1i=1nyixii=1nlog(1+eβ0+β1xi)

This is the function that sklearn maximizes over β1 and β0 to obtain the estimates. There’s quite a few really good properties of maximum likelihood, which is why we use it.

Linear regression#

You might be surprised to find out that linear regression can also be cast as a likelihood problem. Consider an instance where we assume that the Yi are Gaussian with a mean equal to β0+β1xi and variance σ2. What that means is that the probability that Yi lies betweens the points A and B is governed by the equation

P(Yi[A,B) | xi)=ABexp{(yiβ0β1xi)2/2σ2}dyi

Letting A= and taking the derivative with respect to B, we obtain the density function, sort of the probability on an infintessimally small interval:

exp{(yiβ0β1xi)2/2σ2}

Uses the density evaluated at the observed data, the joint likelihood assuming independence is:

i=1nexp{(yiβ0β1xi)2/2σ2}=exp{i=1n(yiβ0β1xi)2/2σ2}

Since it’s more convenient to deal with logs we get that the joint log likelihood is

i=1n(yiβ0β1xi)2/2σ2

Since minimizing the negative is the same as maximizing this, and the constants of proportionality are irrelevant for maximizing for β1 and β0, we get that maximum likelihood for these parameters minimizes

i=1n(yiβ0β1xi)2

which is the same thing we minimized to obtain our least squares regression estimates.