Bayesian Averaging

6 minute read

This article is part of the series on distributed Bayesian reasoning. It assumes you have read the previous article on the Basic Math.

The Problem: Small Samples

In Basic Math, we used the informed probability Pi as the prior beliefs of the meta-reasoner. We calculated this as the ratio:

Pi(𝐴=1|𝐡=1)=P(𝐴=1,𝐡=1|Bβ‰₯0)P(𝐡=1|Bβ‰₯0)=c(𝐴=1,𝐡=1)c(𝐡=1)

But this is actually a little Naive. Suppose only one person actually voted on both 𝐴 and 𝐡 and accepts both. Then 𝑐(𝐴=1,𝐡=1)=𝑐(𝐡=1)=1, and this ratio is 100%. Is this really a good estimate of the probability that the meta-reasoner would accept 𝐴 given they accepted 𝐡?

Certainly not. A single vote from a single user is not a great deal of information. We need a more sophisticated way of estimating the priors of the meta-reasoner based on the evidence we have in the form of arguments and votes.

The Bayesian approach to answering this question requires us to have priors: we actually need to start with an estimate of this probability – or rather, a distribution of possible probabilities – even before we have any data! Then we can use the rules of Bayesian belief updating to combine our priors with our data to come up with a posterior belief.

The Beta-Bernoulli Model

It turns out, we are actually dealing with a textbook example of a problem that can be solved with a simple Bayesian hierarchical model. The solution, using a beta-Bernoulli distribution, is amply described elsewhere (I learned about them from this book). Here is the solution:

Let:

  • Ο‰ = our prior estimate of the probability that the average juror accepts 𝐴 before getting any vote data
  • ΞΊ = our prior estimate of the concentration of likely values around Ο‰ (high ΞΊ means low variance)
  • 𝑁 = c(A>=0) = the number of users who have voted on 𝐴
  • z = c(A=1) = the number of those users who also agree with 𝐴

Then our posterior estimate of the probability that the average user accepts 𝐴 is given they have voted on it is:

(0)Ο‰(ΞΊβˆ’2)+1+zΞΊ+N

What should we use as our prior Ο‰? That depends on the context. If this method is being implemented in a social platform, then this can be based on historical data. For example if in the past, the average accept/reject ratio for arguments submitted to the platform was 80%, then having nothing else to go on, 80% is a good estimate of Ο‰. Our estimate of ΞΊ can also be made using historical data.

What we have done here is sometimes called Bayesian Averaging. The above formula essentially gives us a weighted average of our prior Ο‰ and the observed ratio z/𝑁, with our data z/𝑁 getting higher weight the larger the value of N relative to ΞΊ.

The Bayesian-Average Probability Function

When calculating values of 𝑃 up to this point, we have just taking ratios of counts from our votes table (the 𝑐 function). For example, the formula for 𝑃(𝐴=a) is just:

P(𝐴=a)=c(𝐴=a)c()

Where c() is the total number of voters. To use a Bayesian approach to estimating probabilities, instead of taking a ratio, we plug these same two counts into (0).

Let’s define a new function 𝑃α΅₯ that does this for us.

So where, by definition

P(Ξ±)=c(Ξ±)c()

We have instead:

(1)Pv(Ξ±)=Ο‰(ΞΊβˆ’2)+1+c(Ξ±)ΞΊ+c()

And where by definition of conditional probability:

P(Ξ±|Ξ²)=c(Ξ±,Ξ²)c(Ξ²)

We have instead

(2)Pv(Ξ±|Ξ²)=Ο‰(ΞΊβˆ’2)+1+c(Ξ±,Ξ²)ΞΊ+c(Ξ²)

Now let’s compute an actual value of 𝑃α΅₯(𝐴=1). First, we need to choose priors. Let’s suppose that historically, on average 80% voters accept root claims initially. So Ο‰=80%. And let’s suppose the variation in this distribution can be represented by ΞΊ=10. So

Pv(A=1)=Ο‰(ΞΊβˆ’2)+1+c(A=1)ΞΊ+c()=(80%)(10βˆ’2)+1+50010+1000β‰Š50.23%

In this case, the large amount of votes overwhelms our relatively weak prior, and so our result is very close to 𝑃ᡒ(𝐴=a)=50%.

Two-Level Bayesian Averaging

Reviewing where we are going with this, recall from the Basic Math article that the justified opinion formula in the case of an argument tree with a single premise argument is:

(3)Ph(𝐴=1)=βˆ‘b=01Pi(𝐴=1|𝐡=b)Ph(𝐡=b)

Now we are saying that 𝑃ᡒ(𝐴=1|𝐡=b) may not be a good estimate that the average person would accept/reject A given they accepted/rejected B. So instead, we want to use Bayesian averaging and use the formula for 𝑃α΅₯(A=1|B=b) in place of 𝑃ᡒ(𝐴=1|𝐡=b). So substituting (2) into (3)

(4)Ph(𝐴=1)=βˆ‘b=01Ο‰(ΞΊβˆ’2)+1+c(A=a,B=b)ΞΊ+c(B=b)Ph(𝐡=b)

But what are our priors Ο‰ and ΞΊ?

Recall that we have just used Bayesian averaging to estimate of the probability that the average person accepts 𝐴 (𝑃α΅₯(𝐴=1)=50.23%)). This seems like an reasonable prior for our estimate of 𝑃α΅₯(𝐴=1|𝐡=b). Before considering the 150 users who voted on 𝐡, we have a large amount of data telling us the average user has a roughly even chance of accepting 𝐴, and we have no prior reason to believe that accepting/rejecting 𝐡 either increases or decreases this probability. Unless we have strong evidence showing accepting/rejecting 𝐡 changes the probability that people accept/reject 𝐴, we should assume it doesn’t.

However if we use 𝑃α΅₯(𝐴=1) as a prior for 𝑃α΅₯(𝐴=1|𝐡=b), there is a subtle problem: we will be β€œdouble counting”. We are counting votes of users for whom 𝐴=1 and 𝐡=b as evidence for estimating 𝑃α΅₯(𝐴=1), and then counting the same votes as evidence for estimating 𝑃α΅₯(𝐴=1|𝐡=b). So to avoid double counting, our prior should actually be 𝑃α΅₯(𝐴=1|𝐡≠b).

The priors for 𝑃α΅₯(𝐴=1|𝐡≠b), on the other hand, can be the same priors we used to calculate 𝑃α΅₯(𝐴=1), because we don’t have anything to go on besides historical data. So let’s Ο‰=80% and ΞΊ=10. Then let’s start with 𝐡=1, and calculate:

Pv(A=1|Bβ‰ 1)=Ο‰(ΞΊβˆ’2)+1+c(A=1,Bβ‰ 1)ΞΊ+c(Bβ‰ 1)=8010+900β‰ˆ46.96

Now we can set Ο‰=𝑃α΅₯(𝐴=1|𝐡≠1) as the prior for calculating 𝑃α΅₯(𝐴=1|𝐡=1).

What is our prior estimate of ΞΊ? We might think that it should be proportional to the number of people who voted on 𝐴, but this is mistaken. A large number of votes on 𝐴 provide strong evidence for estimating Ο‰ = 𝑃α΅₯(𝐴=a). But our estimate for ΞΊ is based on our prior expectations about the degree to which people are influenced by arguments. This information can come from observation of actual variance in the case of past arguments. If this is historical very high, then ΞΊ should be low, and vice versa.

For simplicity, let’s use the same prior ΞΊ=10 that we used before.

We can now finally calculate:

Pv(𝐴=1|B=1)=𝑃α΅₯(𝐴=1|Bβ‰ 1)(ΞΊβˆ’2)+1+c(A=1,B=1)ΞΊ+c(B=1)β‰ˆ(46.96%)(10βˆ’2)+1+8010+100β‰ˆ77.05%

This is slightly lower than 𝑃ᡒ(𝐴=1|𝐡=1)=80. This is because we still have a reasonably large number of votes on 𝐡, and these votes provide strong evidence for a posterior value of 80% that overpower the prior estimate.

Clearly, we can extend this reasoning to long argument threads, though we will not do this here.

Further Development

This document is a work in progress – these models have not been fully developed. In fact, we are looking for collaborators. If you are an expert in Bayesian hierarchical models and causal inference, please contact collaborations@deliberati.io.

Updated: