June 9th 2023

Sign upSign InSign upSign InPaddy AltonFollowITNEXT--ListenShareAt the heart of Bayesian statistics lies a simple insight: that there’s no such thing as a free lunch.Put more directly, data cannot be used to create new beliefs about the world, it can only be used to update your pre-existing beliefs. To see why this is so, let’s go back to Bayes’ Theorem, a simple-yet-powerful result that underpins Bayesian statistics.Note: I published a previous version of this article on Quora last year. It is reproduced here in an expanded form. In expanding I drew on another of my previous works, which you can read here.The theorem states thatIn words, ‘The Probability of A, given B, is the probability of B given A multiplied by the ratio of the probabilities of A and B’.Why is this true? If we multiply both sides by P(B) we getwhich is nicely symmetrical at least. And more intuitive to visualise! Let’s imagine you are parachuting into this square, buffeted by winds as you fall:Let’s say there is an equal probability of landing anywhere inside the square. ThenIf I tell you that you have landed in the red circle, what is the probability that you have also landed in the blue circle? Or, in notation: what is P(B|A)?One thing we can say for sure is that if we multiply P(B|A) by the chance of landing in the red circle in the first place (P(A), that is), we will get P(AB), the probability of landing in the purple overlapping area. So P(AB) = P(B|A) × P(A).But the same argument works the other way round! ‘If I tell you that you have landed in the blue circle’, etc etc, and we get P(AB) = P(A|B) × P(B).Since both of these are equal to P(AB), they are equal to each other:… and Bayes Theorem follows immediately by dividing both sides by P(B).So far, so abstract. But now let me ask you something. Have you ever done a scientific experiment where you had to measure something? If yes, did you measure the same thing repeatedly? Was the answer slightly different each time?If so, you’re in a good place to see how we can apply Bayes Theorem in the real world.Your measurements are data. We know they aren’t perfect: they have some uncertainty. But you’re going to use them to estimate the true value of the thing you’re trying to measure. Let us call that true value X.Every time you make measurements, they come out slightly different. Given the true value, X, there will be some probability of making a particular measurement. Succinctly, we might talk about P(measurement|X), a probability distribution. It defines the probability of measuring a particular value given the true value is actually X.(we call this distribution ‘the likelihood’)So how are we to estimate X? This word ‘estimate’ is important: it implies that, whatever procedure you use, you will tell me what you think is the most likely true value. In other words, you aren’t 100% confident what the true value is, but you have an idea. There is a decent chance that the true value is a similar, but not equal value — and a smaller chance it’s a not-so-similar value.Put mathematically, there is a probability P(X=estimate) that the true value exactly equals your estimated value. But this probability is not equal to one: there is a non-zero probability X equals some other value x, P(X=x).Thinking about all the possible values we could choose for this alternative value x, P(X=x) will be higher for some values of x and lower for others. What we have here is another probability distribution over all possible values of X (from now on I will just call this P(X) for simplicity).But now we have a problem. What if someone else does the experiment and observes different data? We know the measurements will be slightly different each time — they are randomly drawn from P(measurement|X). Well then, they would come up with a different estimate for the true value! Put another way, they would find a different probability distribution P(X).Succinctly, this probability distribution the two experimenters are inferring is not really P(X). It would be better to call it P(X|measurement): the probability that the true value of whatever it is you’re measuring is x, given the measurement you made.(there is a ‘real’ P(X), which we will come to in a moment)You probably made a series of measurements (i.e. you collected data) that should be considered as a whole when making estimates. From now on I will talk about P(data|X) and P(X|data).I think you can see where this is going. By Bayes’ Theorem, we can relate these two probability distributions in the following way:We’ve just introduced two new distributions, P(X) and P(data). Thinking back to our diagram from the start of this article, these can only have one meaning. Like the red and blue circles, they are our probabilities before we’ve finished parachuting down into the square.After ‘landing’, we will know what data we’ve observed, and given that knowledge we can think about P(X|data). So P(X) is the probability the true value of whatever we’re measuring is x not given the data, i.e. before we make any measurements!Likewise, P(data) is the probability of making that particular set of measurements when we don’t know for certain what X is. We sometimes call this ‘the marginal likelihood’. We often ignore it; in practice it’s a constant that guarantees that our probabilities will add up to one, as they should.What Bayes Theorem is telling us here — and the mathematics is unequivocal — is this: whatever procedure you use to estimate X from your data cannot just give you the one true answer. The only thing you can do with data is update P(X) to P(X|data) — that is, update your beliefs about the probability of X having a certain value x. You update your beliefs from whatever they were before (your prior beliefs) to whatever they are now (you posterior beliefs).In concise language, P(X) is the prior probability distribution and P(X|data) is the posterior probability distribution.“But I haven’t any idea what the value of the thing I’m measuring is before I try to measure it!”This is the disconcerting thing about Bayesian statistics. The mathematics can’t be argued with: there is no free lunch here. You need to make a first stab at it before data can help you.Or don’t. If you assume that P(X) is the same (uniform) across all values of X, then P(X|data) is proportional to P(data|X). In that case, the value of X that gives you the biggest P(data|X) is straightforwardly the best estimate of X, given the data.Since P(data|X) is called the likelihood, we call this ‘maximum likelihood estimation’. Because the uniform prior we chose didn’t tell us that any value of X was more likely than any other, we call it an ‘uninformative prior’.This might sound good to you! Indeed, in practice maximum likelihood estimation is a pretty routine approach. There are three points to make:For example, let us say a big oak tree fell over during a storm yesterday. It looks very old and you decide to go and try to count the growth rings in the trunk to estimate its age in years. The age can, mathematically speaking, be any number that’s more than zero. Are you sure you want to assume that it’s equally likely that it’s one year old as one hundred years old as one billion years old?In practice, if it’s really easy to precisely count those growth rings then the uncertainty on that measurement is going to be really low. In that case the choice of prior won’t matter: it will not shift your best estimate of X much from the one given by maximising the likelihood. But if it’s not so easy to get an accurate count, using a more reasonable prior could adjust your final estimate a fair bit.You’re free to describe the prior and the posterior probability distributions in many different ways. Often, there is no analytical procedure to use data to go from the former to the latter; we resort to numerical methods. In recent times these methods were democratised via a set of techniques called probabilistic programming.A popular approach is Markov-Chain Monte-Carlo (MCMC), the details of which you can read about elsewhere. To make this work, you need a mathematical function to describe your prior and P(data|X) (the likelihood function). However, what you get out is a set of random samples from the posterior probability distribution P(X|data), not a mathematical function to describe it.This is absolutely fine for computing a credible interval. Let’s say 95% of (a large number of) samples drawn from P(X|data) lie between 0.47 and 0.54. You may infer that, given the data, there is a 95% probability that X lies between 0.47 and 0.54. Likewise, you can estimate the single most probable value of X given the data.Often that is what you want to do! Job done.The problem is this: what do you do if you then want to reuse your posterior as the prior for a new experiment? As I said, in MCMC you need to specify a mathematical function to describe the prior. Unless you have an idea about what the functional form ought to be and can successfully fit a function with that form to the previous set of MCMC results, this would appear to block you from learning over multiple experiments.(MCMC can also be computationally expensive and there are certain situations where it can fail to work properly)However, in certain special circumstances there is an analytical procedure for updating prior to posterior, given data. In even more special circumstances, the resulting function has the same form as the one you used for the prior (with different parameters). This is a conjugate prior.This is beautiful. It means that with every new datapoint you can simply update your probability distribution for X … and then repeat as new data come in.Here is an example. The Bernoulli distribution is the probability distribution describing a weighted coin flip. You only need one number to specify it — the probability of getting ‘heads’. Let us say you have a weighted coin and want to figure out this number, which I shall call X.You can flip the coin a large number of times (we call each flip a ‘trial’’) and use the number of heads divided by the total number of trials as our estimate for X. But in Bayesian statistics we really want to get a posterior for X — a distribution — rather than a point estimate. As discussed, there’s no free lunch here: we need to define a prior on X first.We could make many choices. However, if we choose something called the Beta distribution, something special happens.The Beta distribution is the ‘conjugate prior’ for the Bernoulli distribution. If you use a Beta distribution as your prior for X (remember, X is the parameter of the Bernoulli distribution: the true probability of flipping a heads), there is a procedure that can be used toYou can then use this new distribution as the prior for the next trial — rinse and repeat, refine your posterior for X over many trials.Even better, this procedure is simple: the Beta distribution has two parameters, and you simply add 1 to the first if you get a heads, and add 1 to the other if you get tails. No messing around with MCMC (or other numerical methods) here, just straightforward maths.In fact, a variety of likelihood functions — that is, P(data|X) — have conjugate priors. If you pick a likelihood function in the exponential family it will have a conjugate prior (many important statistical distributions are in this family, including the Normal distribution).Some kind soul has compiled a very comprehensive table of likelihood functions and their corresponding conjugate priors on Wikipedia from multiple sources. Of course, you still need to decide whether any of these is a reasonable likelihood function for your particular problem! There are many cases where you won’t be able to do this; there, numerical methods such as MCMC are appropriate.How do you make that decision?We should briefly return to the coin-flip example. I brushed something under the carpet there, which was this:This is a very simple example of what we mean by a generative model.To build a generative model, we select a mathematical function that can be used to generate ‘fake’ data. There could be a random component to the output of this function. In the case of our weighted coin flip we used a Bernoulli distribution; it outputs either 1 or 0 (heads or tails) and has one parameter (which I called X). This parameter determines how often you get each result.Clearly this is not actually the same thing as a physical biased coin. It is a mathematical model that can generate appropriately similar results. We are modelling the biased coin with a Bernoulli distribution.Crucially, this allows us to provide P(data|X). And, using what we’ve learnt above, we can calculate P(X|data). To be crystal-clear: it allows us to learn something about the ‘secret’ parameter X by specifying a model and a prior. This is the soul of Bayesian inference.The tricky thing here is this: an appropriate generative model (and, therefore, the likelihood function) could in practice be rather complex! In these cases you won’t be able to select a likelihood function that comes with a conjugate prior. You will use numerical methods instead.As a final aside, you may be wondering how you choose between generative models that yield similar results. This is a huge and interesting topic in its own right, and Bayesian statistics has a lot to say about it. I will confine myself here to noting thatPhew. Here ends the most comprehensive introduction to Bayesian statistics I’ve published to-date.It’s always a challenge to know what (not) to include and to decide whether to split things up into multiple articles! In the end I’ve erred on the side of creating a single chunky article that can serve as a jumping off point.Your feedback is welcome. If you found a section confusing or wanted to know more about something that wasn’t covered, let me know in the comments. In future I intend to write more about using Bayesian statistics in practice.----ITNEXTExpect articles on data science, engineering, and analysis.HelpStatusWritersBlogCareersPrivacyTermsAboutText to speechTeams

Unveiling the Secrets of Seamless Clo…
Infinix Note 40 Series Featuring â€˜A…
Tips for College Students Who Want to…
Unveiling GlyNac: The Breakthrough in…
Navigating Freelance Writing Jobs wit…

This post first appeared on VedVyas Articles, please read the originial post: here