Introduction to Bayesian Statistics
- 1 The difference between a Bayesian and a frequentist
- 2 Terminology
- 3 Probability
- 4 What is probability?
The difference between a Bayesian and a frequentist
Bayesian statistics assumes a fundamentally different model of the universe from the one of frequentist statistics. This difference can sometimes divide the two “camps” of statistics along philosophical lines. Some problems are more naturally suited to one or the other approach, or some can be viewed from either camp.
Bayesian statistics takes its name from Thomas Bayes, an English Presbyterian minister and amateur mathematician. This is why Bayesian is capitalized.
Philosophy of Bayesian statistics
The basic Bayesian philosophy can be summarized as follows: Everyone has information, or opinions, about the world. When we encounter new information, we take that information and combine it with what we already know to create new opinions. The goal of Bayesian statistics is to do this in a coherent way. Whereas a frequentist assumes that there is an "exact truth" out there, which can only be measured with measurement error, the Bayesian regards measurements as exact and the "underlying universe" as the thing subject to uncertainty.
What do we mean by coherency? The basic idea is that nobody should be able to make a Dutch book against you: it should not be possible to get you to agree to a bet where you lose no matter what. This seems like a very simple principle, but lies at the foundations of modern statistics.
This philosophy has some interesting implications: most fundamental of all, my opinions may not be the same as your opinions. If we are sitting in a room together, you might think there is a 40% chance that it will rain tomorrow, and I might think there is a 20% chance. There are many reasons we might disagree: for example, you might have a bad knee that hurts when it will rain, and your knee is hurting. This may also have an impact on our reaction to new information. If we turn on the weather report, and the weatherman says there will be brief showers, you may say there is a 90% chance it will rain, but I may only say there is a 30% chance, because I distrust weathermen.
This might sound like there is no way Bayesians can ever agree on anything, and we might as well all go home now. However, the principle of coherency has an interesting implication: if we can manage to agree on our current opinions, and we observe the same piece of new information, our updated opinions should agree.
How does this compare to the frequentist philosophy?
The frequentist philosophy says that the likelihood of something should be equal to the long-run chance that it happens. For example, if I have a fair coin, and I could flip my fair coin exactly the same way a very large number of times, then I should get heads approximately half the time, and the larger the number of flips, the closer the proportion should be to a half. The study of probability and statistics started in gambling, in which games are (theoretically) played over and over under very similar conditions, and the idea of fairness is paramount: this is one of the motivations for the frequentist philosophy. Under the frequentist philosophy, all events have a true and fixed probability of occurring.
This is fundamentally different from the Bayesian viewpoint, in which a probability is viewed as a subjective phenomenon, according to the information available at the time.
Bayesian statistics is full of strange terms that may not make sense at first. Bayesians talk a lot about priors, posteriors, and likelihoods. These are all very simple terms that we can define in terms of the above example.
- A prior is the opinion you hold before you observe a new piece of information. In the above example, this would be our opinions of how likely it is to rain. In addition, we should figure out how strongly we hold that opinion: this is often called the prior information.
- A likelihood refers to a mathematical statement about the probability of something happening. This is a complex and delicate thing, and is one of the reasons Bayesian statistics is usually studied after frequentist statistics, where likelihoods are more natural creatures to study.
- A posterior is the updated opinion you hold after you get a new piece of information. Because there is additional information coming in (in our example, the weatherman's report) we should have more information than we started with. This is a natural concept.
The complex part of the above has to do with the likelihood. What, you might ask, is a likelihood? We can think of a likelihood as being like a statement about long-run odds, but what it really is is a statement about the mathematical properties of any information we get about what we're interested in.
In our example, the thing in question---the piece of information we've gotten---is the weatherman's report. In reality, a weatherman might say any number of things, but suppose we simplify it down to predicting rain, or predicting sunshine. This makes it a yes-no question. Yes-no questions are often dealt with by using the Bernoulli likelihood, in which something either ‘succeeds‘ or ‘fails‘. We could agree, for example, that we will call it a ‘success‘ if the weatherman predicts rain, and a ‘failure‘ if he doesn't (it doesn't matter which one we pick to be a success: the maths comes out the same).
Most likelihoods that people actually work with are extremely simple versions of the world. In reality, weathermen usually tell us a lot more information, like showing us forecast maps, giving us an idea of the probability that it will rain, and telling us something about the current conditions that lead them to believe what they're saying. But here, we've simplified it down to a yes or no question: Does the weatherman predict rain, or sun?
One term that comes up a lot in dealing with both priors and likelihoods is parameters. A parameter is any unknown quantity in a mathematical model of something. Here we have one obvious unknown quantity: the probability that it will rain.
Priors are also mathematical models. Rather than being mathematical models of something we observe, however, they are mathematical models about something we think.
While priors are more intuitive to think about having, they are somewhat less intuitive to come up with, most of the time. The difference is that while a likelihood asks us to come up with a mathematical model for whatever it is we're talking about (in this case, whether or not it will be raining), priors ask us to come up with a mathematical model for the probability of whatever parameters we used in the likelihood.
When stated that way, this seems simple: you have a model for what is going to happen, and now you have to talk about the probability of things that drive that model. In practicality, however, this gets complicated fast. What is the probability that the probability of rain tomorrow is 30%? The process of figuring out what your prior actually is is called eliciting a prior, and is an area of research in and of itself.
Eliciting a prior can be simplified sometimes because many likelihoods have a mathematical form that suggests a mathematical form for the prior. For example, if we use a Bernoulli likelihood to talk about whether or not it will be raining, there is a natural choice of prior: the Beta distribution. This is because when we go to update our prior to produce the posterior, we get another Beta distribution back out.
When this happens, we call the prior a conjugate prior for that likelihood. Conjugate priors are often used because they are easy to deal with, and make it easy to add more information of the same type later on. In our example, if we saw a second weather report, we would have yet another updated Beta distribution.
What is probability?
To a frequentist, probability is the long run chances of something happening. For example, if I flip a coin twice, I might get two heads, or two tails, or one head and one tail. If I flip a coin twice another time, I might get something different. If I repeat this experiment many many times, an infinite number of times, I should get two heads 25% of the time. This is the "true chance" of getting two heads on a flip of two coins.
To a Bayesian, however, probability is subjective. We say that your personal probability for something happening is equal to the point at which you'd take a bet either way. To understand this a little better, let's take a moment to talk about odds.
Odds versus probability versus chance
Lots of books and lots of people use "chance" to mean something like "things that happen at random" or "the chances of" to mean something like "probability". I did it myself two paragraphs above. People do this because "chance" is not a very scary word, unlike "probability" or even worse "stochastic". However, it is not very specific. From here on out, we will try to avoid this word when we have a better one to use. This means we need to define a couple of words that are more specific.
In a betting context, odds are probably the easiest thing to understand---but the hardest to understand outside that context! Imagine that you have a very large urn filled with some colored marbles (and pretend all the marbles are clearly colored: there's no multicolored or in-between shades). Maybe they are red or blue marbles. The odds of drawing a red marble are simple: the number of red marbles to the number of not-red (blue) marbles. If we call the number of red marbles R and the number of blue marbles B, then the odds in favor of red are R to B, or R:B. The odds against red are B:R, read as B to R. We can see that odds are symmetric in this way: the odds against red are the same as the odds for blue, if one of the two has to happen.
Probability is related to the odds. The probability of drawing a red marble is just the number of red marbles over the total number of marbles. If we think about it as a frequentist, this makes sense: if I draw a marble, then put it back and mix up the marbles really well, then repeat this infinitely often, in the long run, I should end up with a red marble R times out of R + B. So the probability of red is R / (R+B).
Fair bets and personal probability
Now imagine that you are going to make a bet based on this urn full of marbles. You know how many red marbles and how many blue marbles are in the bin. You have to pay $1 to play the game. How much money has to be offered before this becomes a fair bet?
Imagine there are no red marbles at all. If you bet on red, you will lose. If you bet on blue, you will win. There is no question about it at all. How much money should you be offered in order to take the bet? Logically, no payoff in the world would be big enough to get you to take a bet on red. But conversely, they should only have to offer you your $1 back in order to get you to bet on blue. It is a sure thing: you don't need any extra money for it to be fair, because you know you'll get your $1 back.
Now suppose there are two marbles, one red and one blue. If you bet on red, one-half of the time you will get the red payoff. If you bet on blue, one-half of the time you will get the blue payoff. How much money should they offer you for this to be a fair game? The logical answer is $2 in either case: if you made this bet a very large number of times, you would, in the long run, break even.