An Intuitive (and Short) Explanation of Bayes’ Theorem

Posted on May 6, 2007January 26, 2015 by kalid

Bayes’ theorem was the subject of a detailed article. The essay is good, but over 15,000 words long — here’s the condensed version for Bayesian newcomers like myself:

Tests are not the event. We have a cancer test, separate from the event of actually having cancer. We have a test for spam, separate from the event of actually having a spam message.
Tests are flawed. Tests detect things that don’t exist (false positive), and miss things that do exist (false negative).
Tests give us test probabilities, not the real probabilities. People often consider the test results directly, without considering the errors in the tests.
False positives skew results. Suppose you are searching for something really rare (1 in a million). Even with a good test, it’s likely that a positive result is really a false positive on somebody in the 999,999.
People prefer natural numbers. Saying “100 in 10,000″ rather than “1%” helps people work through the numbers with fewer errors, especially with multiple percentages (“Of those 100, 80 will test positive” rather than “80% of the 1% will test positive”).
Even science is a test. At a philosophical level, scientific experiments can be considered “potentially flawed tests” and need to be treated accordingly. There is a test for a chemical, or a phenomenon, and there is the event of the phenomenon itself. Our tests and measuring equipment have some inherent rate of error.

Bayes’ theorem finds the actual probability of an event from the results of your tests. For example, you can:

Correct for measurement errors. If you know the real probabilities and the chance of a false positive and false negative, you can correct for measurement errors.
Relate the actual probability to the measured test probability. Bayes’ theorem lets you relate Pr(A|X), the chance that an event A happened given the indicator X, and Pr(X|A), the chance the indicator X happened given that event A occurred. Given mammogram test results and known error rates, you can predict the actual chance of having cancer.

Anatomy of a Test

The article describes a cancer testing scenario:

1% of women have breast cancer (and therefore 99% do not).
80% of mammograms detect breast cancer when it is there (and therefore 20% miss it).
9.6% of mammograms detect breast cancer when it’s not there (and therefore 90.4% correctly return a negative result).

Put in a table, the probabilities look like this:

How do we read it?

1% of people have cancer
If you already have cancer, you are in the first column. There’s an 80% chance you will test positive. There’s a 20% chance you will test negative.
If you don’t have cancer, you are in the second column. There’s a 9.6% chance you will test positive, and a 90.4% chance you will test negative.

How Accurate Is The Test?

Now suppose you get a positive test result. What are the chances you have cancer? 80%? 99%? 1%?

Here’s how I think about it:

Ok, we got a positive result. It means we’re somewhere in the top row of our table. Let’s not assume anything — it could be a true positive or a false positive.
The chances of a true positive = chance you have cancer * chance test caught it = 1% * 80% = .008
The chances of a false positive = chance you don’t have cancer * chance test caught it anyway = 99% * 9.6% = 0.09504

The table looks like this:

And what was the question again? Oh yes: what’s the chance we really have cancer if we get a positive result. The chance of an event is the number of ways it could happen given all possible outcomes:

Probability = desired event / all possibilities

The chance of getting a real, positive result is .008. The chance of getting any type of positive result is the chance of a true positive plus the chance of a false positive (.008 + 0.09504 = .10304).

So, our chance of cancer is .008/.10304 = 0.0776, or about 7.8%.

Interesting — a positive mammogram only means you have a 7.8% chance of cancer, rather than 80% (the supposed accuracy of the test). It might seem strange at first but it makes sense: the test gives a false positive 10% of the time, so there will be a ton of false positives in any given population. There will be so many false positives, in fact, that most of the positive test results will be wrong.

Let’s test our intuition by drawing a conclusion from simply eyeballing the table. If you take 100 people, only 1 person will have cancer (1%), and they’re nearly guaranteed to test positive (80% chance). Of the 99 remaining people, about 10% will test positive, so we’ll get roughly 10 false positives. Considering all the positive tests, just 1 in 11 is correct, so there’s a 1/11 chance of having cancer given a positive test. The real number is 7.8% (closer to 1/13, computed above), but we found a reasonable estimate without a calculator.

Bayes’ Theorem

We can turn the process above into an equation, which is Bayes’ Theorem. It lets you take the test results and correct for the “skew” introduced by false positives. You get the real chance of having the event. Here’s the equation:

And here’s the decoder key to read it:

Pr(A|X) = Chance of having cancer (A) given a positive test (X). This is what we want to know: How likely is it to have cancer with a positive result? In our case it was 7.8%.
Pr(X|A) = Chance of a positive test (X) given that you had cancer (A). This is the chance of a true positive, 80% in our case.
Pr(A) = Chance of having cancer (1%).
Pr(~A) = Chance of not having cancer (99%).
Pr(X|~A) = Chance of a positive test (X) given that you didn’t have cancer (~A). This is a false positive, 9.6% in our case.

Try it with any number:

It all comes down to the chance of a true positive result divided by the chance of any positive result. We can simplify the equation to:

Pr(X) is a normalizing constant and helps scale our equation. Without it, we might think that a positive test result gives us an 80% chance of having cancer.

Pr(X) tells us the chance of getting any positive result, whether it’s a real positive in the cancer population (1%) or a false positive in the non-cancer population (99%). It’s a bit like a weighted average, and helps us compare against the overall chance of a positive result.

In our case, Pr(X) gets really large because of the potential for false positives. Thank you, normalizing constant, for setting us straight! This is the part many of us may neglect, which makes the result of 7.8% counter-intuitive.

Intuitive Understanding: Shine The Light

The article mentions an intuitive understanding about shining a light through your real population and getting a test population. The analogy makes sense, but it takes a few thousand words to get there :).

Consider a real population. You do some tests which “shines light” through that real population and creates some test results. If the light is completely accurate, the test probabilities and real probabilities match up. Everyone who tests positive is actually “positive”. Everyone who tests negative is actually “negative”.

But this is the real world. Tests go wrong. Sometimes the people who have cancer don’t show up in the tests, and the other way around.

Bayes’ Theorem lets us look at the skewed test results and correct for errors, recreating the original population and finding the real chance of a true positive result.

Bayesian Spam Filtering

One clever application of Bayes’ Theorem is in spam filtering. We have

Event A: The message is spam.
Test X: The message contains certain words (X)

Plugged into a more readable formula (from Wikipedia):

Bayesian filtering allows us to predict the chance a message is really spam given the “test results” (the presence of certain words). Clearly, words like “viagra” have a higher chance of appearing in spam messages than in normal ones.

Spam filtering based on a blacklist is flawed — it’s too restrictive and false positives are too great. But Bayesian filtering gives us a middle ground — we use probabilities. As we analyze the words in a message, we can compute the chance it is spam (rather than making a yes/no decision). If a message has a 99.9% chance of being spam, it probably is. As the filter gets trained with more and more messages, it updates the probabilities that certain words lead to spam messages. Advanced Bayesian filters can examine multiple words in a row, as another data point.

Share Your Feedback

Questions
Insights
Suggestions

Have feedback? Just enter it above. I'm making a curated set of questions and insights for the article. Thanks!

Join Over 250k Monthly Readers

Hi! I'm Kalid, founder of BetterExplained, and I want to improve your math understanding. Join the newsletter and we'll turn Huh? to Aha!

144 comments

Gavrilo Princep says:

May 7, 2007 at 3:40 am

you have a typo, in …

9.6% of mammograms miss breast cancer when it is there (and therefore 90.4% say it is there when it isn’t).

… you meant to say somthing like :

9.6% of mammograms incorrectly indicate breast cancer when it isn’t there, and the other 90.4% correctly say it is not there when, well, it is not there.
Kalid says:

May 7, 2007 at 10:42 am

Thanks Gavrilo — I just fixed it.
Amal says:

May 12, 2007 at 11:48 am

Hey, here’s an interesting bayes problem i came across first in a book (The curious incident of the dog in the night time).

Suppose you are in a game show. You are given the choice of three doors – one of which conceals a valuable prize and the
others conceal a goat.

After you make a choice, the host opens one of the other doors (–one without a prize).

He then gives you the option of staying with the initial choice of door or switching to the other door. The door finally chosen is then opened.

Should you switch, not switch, or does it make no difference what the contestant does?
_____________________________________________

ANSWER
by Bayes theorum you can see that if you switch u’d have a 2:1 advantage.
Kalid says:

May 13, 2007 at 6:27 pm

Hi Amal, thanks for dropping by. Yes, I like that question too, it was presented to us as “The Monty Hall” problem when studying computer science.

It’s pretty amazing how counter-intuitive the results can be — switching your choice after you’ve picked “shouldn’t” change your chances, right? I plan on writing about this paradox, too
Ed says:

August 27, 2007 at 10:49 pm

Oddly useful! I’ve been reading Bayes explanations for a while, and this one really hit home for me for some reason.

One thing that you might consider adding (something I’ve never seen) is a pie-chart visualization of what’s going on. Basically, you have a pie of 100% of people. 1% of that pie has cancer, so that’s a tiny slice. The test will produce a positive for 80% of that 1% slice + 9.6% of the remaining 99% slice– you can imagine that as a little blue translucent piece of appropriate size that covers most of the 1% slice and a chunk of the 99% slice. From that mental image, it’s obvious what’s going on– there’s a lot more blue on the 99% than on the 1%. Might be too complicated, but hey. Anyways, thanks.
Kalid says:

August 30, 2007 at 1:29 am

Hey Ed, thanks for the comment. I agree — some type of chart may make the relationship that much clearer. Appreciate the suggestion, I’ll put one together.
Lee says:

October 21, 2007 at 3:50 pm

Bayes theorem can also be thought of as

True Positives
——————————–
True Positivees + False Positives

So a large number of false positives reduces the accuracy of the test because the denominator increases.
Kalid says:

October 21, 2007 at 9:23 pm

Thanks Lee! That’s a great way to put it.
Randy says:

November 7, 2007 at 12:21 am

About Monty Hall- the Bayes application to this seems very forced. The Monty Hall problem is a simple probability problem, or it can be viewed as a partitioning problem. See:
randy.strausses.net/tech/montyhall.htm

Using Bayes for this makes it needlessly complex, not “betterExplained”.

Similarly, the article above is needlessly complex- nuke the first equation and leave the simpler one. You just pulled it out of thin air anyway- it doesn’t help anyone.

The usual diagram, given in HS stats classes, is a rectangle, with A, ~A on the top, B, ~B on the side. Say A is .9 and B is .2. The area of the small quadrant (.02), is the probability of A and B both happening. This area can be also viewed as P(A|B)*P(B) or P(B|A)*P(A). You have to explain why, but it’s pretty evident from the diagram. Then just equate these two and divide by P(B) and you have the simpler equation.
Kalid says:

November 8, 2007 at 2:29 pm

Hi Randy, Bayes may be overkill for the Monty Hall problem, but it’s interesting to see that it can apply there as well.

Yes, the diagram you mention may be a helpful addition to the discussion above, appreciate the feedback.
numerodix says:

November 27, 2007 at 1:20 pm

Just wanna say thank you for writing this. I know about the original article and I tried reading it but somewhere along the way I got lost and couldn’t follow it.
Kalid says:

November 27, 2007 at 1:51 pm

Hi numerodix, you’re welcome — I found the original article interesting but a bit long as well, so I decided to summarize it here.
Matteo says:

April 7, 2008 at 2:28 am

Hello, I just came upon this site and I’m finding it beautiful. I think I spotted an error in this article, though.
When you say:

“”Of those 100, 80 will test positive” rather than “80% of the 1% will test positive”).”,

you probably wanted to say: “rather than ‘80% of the 100% will test positive'”.
Kalid says:

April 8, 2008 at 12:40 am

Hi Matteo, thanks for the comment. The statement actually refers to the original 1%, so it’s giving a way of giving compound percentages (80% of 1% vs. 80 out of 10,000).
Anonymous says:

May 6, 2008 at 5:47 am

wow thank you so much for this, you really did a good job explaining it, i have my AP statistics exam today at noon so this might save me
John D Stackpole says:

January 11, 2009 at 8:26 pm

Randy, back on Nov 7 2007, suggested using overlapping rectangles – Venn diagrams – to help clarify the Rev. Bayes. In their book “Chances Are…” (Viking Penguin, 2006), Kaplan & Kaplan did so on pp. 184 ff. Indeed it does help.
Anonymous says:

April 15, 2009 at 11:32 pm

you. are. the. best.
patty says:

April 17, 2009 at 7:50 am

oh my god this is the dogs bollocks for my molecular phylogenetics revision!
Kalid says:

May 5, 2009 at 3:19 pm

@Anonymous: Thanks!

@John: Appreciate the reference. Another explanation with a venn diagram: blog.oscarbonilla.com/2009/05/visualizing-bayes-theorem/

@Anonymous: Thank you!

@Patty: Glad it helps
Dan Weisberg says:

October 2, 2009 at 10:40 am

This is one of the best explanations I’ve found. Perhaps we can see if I really understand it by trying a real world problem I’m wrestling with.

Here’s the data:
– The odds of a chest pain (CP) being caused by a heart attack is 40%.
– The odds of a CP being caused by other factors (anxiety, depression, etc.) is 60%.
– The odds of a heart attack occurring to a female above age 50 is 80%.
– The odds of a heart attack occurring to a female under age 50 is 20%.

I am presented with a 24 year old female who says she is having chest pain. What is the probability that her chest pain is caused by a heart attack? Is it 0.4 x 0.2 = 0.08?

Also, 78% of patients having heart attacks present with diaphoresis (sweating), so 22% of patients having heart attacks don’t sweat. This female is not sweating, so are the odds of her having a heart attack 0.22 x 0.08 = 0.0176?

Thank you!
Emily Riley says:

October 29, 2009 at 2:58 am

Thanks for writing this!! Even my stats prof was making this too difficult for everyone, but you have simplified it for me. I now have an understanding of the Bayes formula (enough to write my midterm this morning ).
AYUSH says:

December 19, 2009 at 2:45 am

thanks! finally got the concept behind bayes rule
Kalid says:

December 21, 2009 at 11:48 pm

@Ayush: Glad it helped!
Andrew says:

January 7, 2010 at 10:49 am

Could you work out an example of an email with *two* words, say ‘Viagra’ and ‘hello’?
Aditya says:

March 25, 2010 at 1:13 am

I didnt get that, bayes theorem is still a tilted pot for me, but thanks for helping!
Asim says:

May 7, 2010 at 8:18 am

what would happen if we have to consider other prior probability..lets say, the doctor looked at the symptoms of the patient and guessed that he has 60 percent chance of having cancer. Doctor sends him for the test and test showed positive result. How would we incorporate that 60 percent odd of having cancer based on the patient’s symptoms to the Bayesian equation.
Francis says:

August 29, 2010 at 12:34 pm

hi
thank you for this article. The first time I came across Bayes Theorem in a business statistics book it was not so clear at all. No it makes more sense for me and its pretty clear..
Kalid says:

August 29, 2010 at 11:22 pm

@France: You’re welcome, glad it helped. I understand it better now too, but there’s still more to go before it’s completely intuitive for me :). I’d like to do a follow-up to this focusing on using the probabilities it predicts.
kr says:

December 7, 2010 at 9:46 am

On the cancer example, it’s interesting to see that a negative test is really significant. That is, if the test says you don’t have cancer, then probability of not having cancer is 99.78% ! So, the value of mammogram is that the healthcare