Estimating true ratings


If you take a rating website, say IMDB, or Goodreads, and you sorted items purely by review scores, the stuff that would float to the top would be pretty obscure. That’s because the easiest way to maintain a perfect score is to have a very small sample size.

So, a math question: what is the statistically “correct” way to handle this?

In this analysis, I will assume there exists a “true” average review score, and we are trying to estimate it. The “true” average is the average that would be attained if there were a sufficiently large sample of reviewers. We’re not imagining that everyone in the world is reviewing the same book (for example, we don’t expect book reviews to reflect the opinions of people who don’t like reading books period). But we could imagine, what if there were a billion identical yet statistically independent Earths, and we averaged all their review scores.  Obviously it’s very hard to come across a billion identical yet statistically independent Earths, and that’s why we use math instead.

This premise may be fairly questioned. I once discussed the philosophical problems with review scores, including questioning the very idea of taking averages. But here, I’m just focusing on the math for math’s sake.  And, I really mean it, it’s hardcore math.  If you don’t want math, just skip to the last section I guess.

1. Bayesian approach

To follow the Bayesian approach, we start by assuming a prior probability distribution, and then we calculate a posterior probability distribution, and then calculate the expected average review score. Easy, right?

Let P(x) be the prior probability distribution of “true” review scores.  I’m going to normalize these scores between 0 and 1, so that 1 is a perfect review, and 0 is a perfectly bad review.  If P(x) is a uniform distribution, that means all true review scores between 0 and 1 are equally likely.  Now imagine we look at a single review R.  Seeing a single review causes us to update our probability estimates.  P(x|R) denotes the posterior probability distribution.  We can calculate P(x|R) with the Bayes equation.

P(x|R) = P(R|x)P(x) / P(R)

We don’t really need P(x|R), we just need the expectation value of x given R.  It turns out there’s a very nice form of the Bayes equation that we can use:

E[x|R] = E[x*P(R|x)]/E[P(R|x)]

But, maybe it’s not so nice when I have to explain that “E” represents the expectation value, which has to be expanded into a nasty integral.

There’s still one quantity we’re missing, P(R|x).  To explain P(R|x), suppose that the “true” average review score is a 0.82, then P(0.7|0.82) is the probability that any given reviewer will assign it a score of 0.7. The distribution is not obvious, and cannot be calculated from first principles.  For example on Goodreads you have to give an integer rating between 1 and 5, so it’s actually impossible to give an individual review score of exactly 0.7.  P(R|x) ought to be measured empirically by collecting data from the website.

I don’t really want to collect data, so I’m going to make a simplification. Suppose that every review is either 0 or 1. Basically, your review options are restricted to thumbs up and thumbs down. This gives us an exact formula for P(R|x):

P(1|x) = x
P(0|x) = 1-x

So far I’ve been discussing a single review, but we want to be able to handle many reviews.  Suppose that out of N reviews, R of them are thumbs up, for an average review score of R/N. I’m introducing the combination function C(N,R), to count the number of distinct ways that the reviews could be ordered, although this term just cancels out later.

P(R|N,x) = C(N,R) * x^{R} * (1-x)^{N-R}

Now, in order to calculate E[x|N,R], I’m going to have to make assumptions about P(x).  In practice, P(x) would also need to be measured empirically by collecting data.  But to keep things simple, I’ll assume that P(x) is a uniform distribution. This still results in a pretty nasty integral, but at least I can just look up the answer. We’ll be using the Beta function:

B(m,n) = \int_{0}^{1} x^{m-1} (1-x)^{n-1} dx = \frac{(m-1)!(n-1)!}{(m+n-1)!}

So here’s the result:

E[x|N,R] = \int_{0}^{1} x*P(R|N,x)*P(x) dx / \int_{0}^{1} P(R|N,x)*P(x) dx = \frac{B(R+2,N-R+1)}{B(R+1,N-R+1)} = \frac{R+1}{N+2}

So the expected “true” average review score is (R+1)/(N+2). This is equivalent to adding two additional review scores, one thumbs up and one thumbs down. For example, if a movie has a 100% rating, but only based on two reviews, then our estimate of the “true” score is 0.75, or three thumbs ups out of four.

Of course, I made a lot of simplifying assumptions to get there. In principle you should use numerical calculations based on real data. If someone wants to try that, be my guest.

Frequentist Approach

Using a frequentist approach, I’d like to create an “unbiased estimator” of the true review score. It’s difficult to explain what an unbiased estimator is, so I’ll refer to the last time I tried.

Now the whole point of a frequentist approach is to do statistical analysis without any knowledge of prior probability distributions.  In this context, that seems a bit silly.  We may not know the prior probability distribution P(x), but we could in principle make a good guess by analyzing real data.  So why assume that we can’t get P(x), when in fact we can get P(x)?  Now, I am not a frequentist hater, but in my opinion the frequentist approach just isn’t appropriate for this problem.  I’m trying it anyway, because it’s a mathematical exercise.

So for P(R|N,x), let’s make the same assumption as before. All reviews are either thumbs up or thumbs down. So we still have

P(R|N,x) = C(N,R) * x^{R} * (1-x)^{N-R}

We want to find an estimator function \hat{\Theta}(N,R), which is intended to estimate the “true” review score x. There are many possible estimators, we’re imposing the condition that our estimator must be “unbiased”. In other words, the bias is zero. Here’s the equation for the bias:

Bias(\hat{\Theta}(N,R), x) = \sum_{R=0}^N P(N,R|x) \hat{\Theta}(N,R) - x = 0

This equation has to be true for all x and N. I found that you can’t fulfill this condition when N=0, so I guess the estimator just isn’t defined for N=0.

To be honest, it’s been a minute since I learned statistics, and I don’t remember how you’re supposed to solve an equation like this one.  But we can plug in a few values of N, and solve those equations individually.

N = 1:
\sum_{R=0}^1 P(R|N=1,x) \hat{\Theta}(1,R) - x = 0
x*\hat{\Theta}(1,1) + (1-x)*\hat{\Theta}(1,0) - x = 0
\hat{\Theta}(1,1) = 1
\hat{\Theta}(1,0) = 0

N = 2:
\sum_{R=0}^1 P(R|N=2,x) \hat{\Theta}(2,R) - x = 0
x^2*\hat{\Theta}(2,2) + 2*x*(1-x)*\hat{\Theta}(2,1) + (1-x)^2*\hat{\Theta}(2,0) - x = 0
\hat{\Theta}(2,2) = 1
\hat{\Theta}(2,1) = 0.5
\hat{\Theta}(2,0) = 0

I went a bit further on my own, and I believe the unbiased estimator is actually just R/N. In other words, the frequentist estimate of the “true” average rating is just equal to the current average rating.

That’s really not what we wanted.  The whole point was to determine the appropriate correction to review scores, but instead we found a correction equal to zero!  My conclusion is that the unbiased estimator approach is just not suited to the problem.

General thoughts

TL;DR: Using a Bayesian approach, you can estimate the “true” average rating by pretending that there were two additional ratings, one thumbs up, and one thumbs down.

I also tried using a frequentist approach, but it didn’t really work because it didn’t make any correction base on number of reviews. I think the Bayesian approach makes more sense here anyways, because you can empirically measure the prior probability distribution based on analysis of real data.

I made a bunch of simplifying assumptions, because I didn’t want to actually pull data from IMDB or Goodreads. I just wanted an analytical math problem that I could use to illustrate the problem and approach. If you want to do some serious analysis, you would base the inputs on empirical analysis, and then do a numerical calculation.

After I tried this analysis, someone pointed me to another article by Evan Miller with a different proposal:

CORRECT SOLUTION: Score = Lower bound of Wilson score confidence interval for a Bernoulli parameter

Evan doesn’t really offer mathematical justification for this, it’s just vibes really. But it seems like a reasonable practical solution.

Evan is interested in a slightly different question from me. I’m thinking about the best way to estimate the “true” rating of an object, while Evan is thinking about the best way to show a user the top-rated objects. Evan’s estimator is explicitly biased, since the “true” rating will be higher than this estimate 95% of the time. But if all you’re doing with the estimate is sorting objects with it, then bias is fine.

Of course, no matter what approach you use, statistical flukes will always float to the top. There’s no getting around that. At most, we’re proposing a quantitative adjustment that penalizes the most obvious statistical flukes.

Leave a Reply

Your email address will not be published. Required fields are marked *