If you’ve clicked on this, you probably know nothing about sigmoids but have strong feelings about hotdogs. That’s okay, don’t worry, you’re in a safe space.
And if you’ve never even seen a sigmoid \(\sigma\) function, know that it’s a marvel. A real beauty. It’s elegant in a quiet, understated way. It stretches calmly from one infinity to the other, gently guiding you from 0 to 1 like a patient parent. It’s smooth, dependable, it’s exactly what one hopes for in a world full of uncertainty. You’ll instantly know one when you see one1.
This \(\sigma\) function is widely used in machine learning, especially when solving binary classification problems, like the legendary Jian-Yang’s SeeFood app in the TV series Silicon Valley:
If an input image is either “hotdog” or “not hotdog”, that makes it a binary classification problem. And if Jian-Yang used a neural network for his app (he definitely did), there’s a good chance the last layer had a single neuron using the sigmoid function.
But… why?
The reason the literature recommends the sigmoid function for such problems is something that I used to accept as “intelligent people say so and I just need it to work”, until I actually looked at the maths behind it. It turned out to be a fun and interesting journey, and clearly not as complicated as I thought it would be. So, let’s dive in.
The hotdog hypothesis
Let’s imagine we have a bunch of food images (our dataset), and we want to train a model to tell whether any given one shows a hotdog.
One common way to tackle this problem is to use a linear classifier to model the relationship between the features of the food images and the hotdog class. What we try to achieve is find a line (precisely speaking, a hyperplane), a decision boundary, that separates our two classes, “hotdog” and “not hotdog”, as clearly as possible. That is, if such a separation is possible.
If we have good data, and have identified distinctive features (for instance, let’s say color and shape), we should have two distinct classes. Now, just a side note here: in modern machine learning, features are usually discovered by the model during the training rather than explicitly chosen. But that changes nothing about the issue at hand.
Because here’s the catch: such a linear separation isn’t always possible. For instance, some hotdogs might be oddly shaped, or maybe the photo of the hotdog was taken under some colored lighting. The result is that instead of having a clear decision boundary between the two classes, we get a mess of a decision space.
Here we can see clean input data with distinctive features that is very linearly separable:

Fig 1 - Clean data, easy linear separation
And now, here’s messy data. No matter how we try to draw a line to separate the two classes, we’ll always end up misclassifying some points:

Fig 2 - Messy data, no linear separation
Mathematicians are nature’s way of solving a problem by creating a new one
To live a good life, we usually need a good doctor, a good lawyer and a good pastor. Sometimes, it also happens that we need mathematicians. They can be very practical people – granted, only when it comes to math.
The way mathematicians approached this problem is by reframing it into something they could solve. They thought: if we cannot separate the data cleanly, then let’s embrace the uncertainty. What if, instead of forcing a hard yes or no answer, we built a model that could quantify a maybe? Something like:
Given these color and shape features extracted from the image, and the dataset I’ve been trained on, there’s a 73% chance that what you’re showing me is a hotdog. It might still not be a hotdog, but that’s unlikely.
We’re not committing to a binary decision anymore: we’re assigning a probability. That’s way more flexible, especially when the data is messy, and we haven’t identified all the features that would separate everything cleanly. We accept that there are things we don’t formally know, and the world would be a better place if more people embraced that in online discourse. If mathematicians can do it with hotdogs, you can do it with public policy.
Now, uncertainty’s a great twist, but it also means that we have to find a solution to a new problem. We’d love to find a function that takes as input some food image features and outputs a probability. Ideally, we’d also love this function to be linear. Because linear models (like logistic regression) are so simple and so efficient to train it would make everything easier.
However, in the case of image classification, such a linear function is a wild dream. Realistically, I’m sorry to tell you that the best we can hope to build if we want something that works is a model that takes those image features, processes them in some deep and nonlinear way (like a ConvNet), and spits out a single real number.
But in both cases we need to find a clever way to transform that real number into a probability. So for the sake of simplicity, let’s go with the linear way.
Bending the probability of a hotdog space to our will
The problem we face is that in mathematical terms, a probability is a number between 0 and 1, that is a number in the interval \([0,1]\). However, a linear function can output any real number ranging from negative infinity to positive infinity. In other words, such a function would span the interval \((-\infty,+\infty)\).
To find a linear function mapping a probability to a class, we first need to find a way to map a probability onto the entire real number line, so that every probability corresponds to exactly one real number, and vice versa.
So, let’s get going. We start with a probability \(p\) of an image being a hotdog. One straightforward way to stretch this \([0,1]\) interval to the \((0, +\infty)\) range is by transforming this probability into odds. Odds basically tell us how much more likely an event is to happen than not to happen.
$$ \text{odds} = \frac{p}{1 - p} $$
For instance, if an image has \(p=.75\) (i.e. 75 % chance) of being a hotdog, it has \(1-p=.25\) chance of not being a hotdog, so the odds are \(.75/.25 = 3\). This image is 3 times more likely to be a hotdog than not. As the probability approaches 100 % chance, the odds approach infinity.
We’re nearly there, we just stretched our probabilities from \((0,1)\) to \((0, +\infty)\). Now, we just need to stretch that again to \((-\infty, +\infty)\), the whole real line. And look at this nifty function that is the logarithm function.

Fig 3 - What a good, useful, function
It is defined on \((0, +\infty)\) and outputs value from \((-\infty,+\infty)\) while also being bijective. It’s exactly what we needed.
If we use it on our odds, we obtain what’s called the logit function:
$$ \text{logit}(p) = \log\left(\text{odds}\right) = \log\left(\frac{p}{1 - p}\right) $$
This function is perfect for linear models. It’s continuous, differentiable and most importantly, bijective. So every value maps cleanly from the hotdog probability space to real space and back again. And because we’re using \(log\) function, we’re also inheriting some nice properties, like \(\log(ab) = \log(a) + \log(b)\) which makes working with multiplicative relationships simpler. That turns out to be massively useful for logistic regression, Bayesian inference…
And, for the more nerdy, here’s where the Maximum Likelihood Estimation (MLE) you might have heard of sneaks into the story. Since we’re now predicting probabilities instead of hard labels, we can model the data using a Bernoulli distribution – which, for the less nerdy that are lost in this paragraph is just fancy-speak for: “either it’s a hotdog or it’s not”. MLE gives us a way to train the model by finding the parameters that will maximize the likelihood of seeing our actual “hotdog” / “not hotdog” labels, given the model’s predicted probabilities. Spoiler: this leads directly to the classic binary cross-entropy loss used in training neural networks.
So all of this machinery, the logits, the sigmoid, MLE… is not just a beautiful accident. As Dirk Gently would say, this is all part of the “fundamental interconnectedness of all things” – from logits and probabilities to hotdogs and neural nets.
Now, if logit maps a probability to real number, its inverse maps a real number back to a probability. And that inverse, ladies and gentlemen, is our hero: the sigmoid \(\sigma\) function.
$$ \sigma(x) = \frac{1}{1 + e^{-x}} \quad \text{and} \quad \sigma(\text{logit}(p)) = p $$
It takes any real number (from our linear model) and squashes it into the \([0,1]\) range. So now, our model can output numbers like -4.3 or 1.9 and sigmoid will turn that into a nice clean probability.

Fig 4 - A beautiful logistic regression on our dataset, with a dashed line marking the p=0.5 decision threshold
What just happened?
It’s easy to get lost, so let’s take a step back to look what we’ve accomplished. We started with a classification problem that wasn’t linearly separable, “Is it a hotdog?”, where we expected a hard yes or no. But instead of forcing a clean split, we reframed the problem as a probabilistic one: “How confident are we that it’s a hotdog?”. That shift changed everything.
To make this work with linear models, we needed a trick. We stretched probabilities (which live between 0 and 1) into real-number space using the logit function. Then we brought those real numbers back to probabilities using sigmoid. That’s how we turned a messy, uncertain world into something a bit more manageable – mathematically speaking. And as we train our model to output logits, raw scores from a linear function, we’ll use sigmoid to translate those into probabilities. And just like that, we’ve built a system that can say, with mathematical confidence, how likely it is that something is a hotdog.
But here’s what I found kind of poetic in this journey – maybe it’s just me. Sigmoid wasn’t the answer from the start. Instead, we engineered the question until sigmoid became the only answer that made sense. Isn’t that neat?

Fig 5 - Our model predicts this is a hotdog with p=0.87
-
You can click now and see a sigmoid function on Desmos. Or wait 10 minutes and get two. Delayed gratification, anyone? ↩︎