If you mess around with machine learning for long enough, you’ll eventually run into the **logit **and **sigmoid **functions. These are useful functions when you are working with probabilities or trying to classify data.

Given a probability p, the corresponding **odds **are calculated as *p / (1 – p)*. For example if p=0.75, the odds are 3 to 1: 0.75/0.25 = 3.

The **logit function** is simply the logarithm of the odds: *logit(x) = log(x / (1 – x))*. Here is a plot of the logit function:

The value of the logit function heads towards infinity as *p* approaches 1 and towards negative infinity as it approaches 0.

The logit function is useful in analytics because it maps probabilities (which are values in the range [0, 1]) to the full range of real numbers. In particular, if you are working with “yes-no” (binary) inputs it can be useful to transform them into real-valued quantities prior to modeling. This is essentially what happens in logistic regression.

The inverse of the logit function is the **sigmoid** function. That is, if you have a probability *p*, *sigmoid(logit(p)) = p*. The sigmoid function maps arbitrary real values __back__ to the range [0, 1]. The larger the value, the closer to 1 you’ll get.

The formula for the sigmoid function is *σ(x) = 1/(1 + exp(-x))*. Here is a plot of the function:

The sigmoid might be useful if you want to transform a real valued variable into something that represents a probability. This sometimes happens at the end of a classification process. (As Wikipedia and other sources note, the term “sigmoid function” is used to refer to a class of functions with S-shaped curves. In most machine learning contexts, “sigmoid” usually refers specifically to the function described above.)

There are other functions that map probabilities to reals (and vice-versa), so what’s so special about the logit and sigmoid? One reason is that the logit function has the nice connection to odds described at the beginning of the article. A second is that the gradients of the logit and sigmoid are simple to calculate (try it and see). The reason why this is important is that many optimization and machine learning techniques make use of gradients, for example when estimating parameters for a neural network.

The biggest drawback of the sigmoid function for many analytics practitioners is the so-called “vanishing gradient” problem. You can read more about this problem here (and here), but the point is that this problem pertains not only to the sigmoid function, but __any__ function that squeezes real values to the [0, 1] range. In neural networks, where the vanishing gradient problem is particularly annoying, it is often a good idea to seek alternatives as suggested here.

Thank you. Makes many things clear.

Yeah, I don’t think 0.25/0.75 = 3 …. last time I checked my calculator

Thanks for pointing that out. Your calculator is correct.

“Given a probability p, the corresponding odds are calculated as p / (1 – p). For example if p=0.25, the odds are 3 to 1: 0.25/0.75 = 3”

*odds are 3 to 1 for p = .75

*for .25, odds are 1 to 3

Thanks for the article. It was helpful to see this all concisely and well described. I especially like the point about vanishing gradient, which makes sense.

Thanks for the constructive feedback! Much appreciated.