Estimating true ratings

If you take a rating website, say IMDB, or Goodreads, and you sorted items purely by review scores, the stuff that would float to the top would be pretty obscure. That’s because the easiest way to maintain a perfect score is to have a very small sample size.

So, a math question: what is the statistically “correct” way to handle this?

In this analysis, I will assume there exists a “true” average review score, and we are trying to estimate it. The “true” average is the average that would be attained if there were a sufficiently large sample of reviewers. We’re not imagining that everyone in the world is reviewing the same book (for example, we don’t expect book reviews to reflect the opinions of people who don’t like reading books period). But we could imagine, what if there were a billion identical yet statistically independent Earths, and we averaged all their review scores. Obviously it’s very hard to come across a billion identical yet statistically independent Earths, and that’s why we use math instead.

This premise may be fairly questioned. I once discussed the philosophical problems with review scores, including questioning the very idea of taking averages. But here, I’m just focusing on the math for math’s sake. And, I really mean it, it’s hardcore math. If you don’t want math, just skip to the last section I guess.

1. Bayesian approach

To follow the Bayesian approach, we start by assuming a prior probability distribution, and then we calculate a posterior probability distribution, and then calculate the expected average review score. Easy, right?

Let P(x) be the prior probability distribution of “true” review scores. I’m going to normalize these scores between 0 and 1, so that 1 is a perfect review, and 0 is a perfectly bad review. If P(x) is a uniform distribution, that means all true review scores between 0 and 1 are equally likely. Now imagine we look at a single review R. Seeing a single review causes us to update our probability estimates. P(x|R) denotes the posterior probability distribution. We can calculate P(x|R) with the Bayes equation.

$P(x|R) = P(R|x)P(x) / P(R)$

We don’t really need P(x|R), we just need the expectation value of x given R. It turns out there’s a very nice form of the Bayes equation that we can use:

$E[x|R] = E[x*P(R|x)]/E[P(R|x)]$

But, maybe it’s not so nice when I have to explain that “E” represents the expectation value, which has to be expanded into a nasty integral.

There’s still one quantity we’re missing, P(R|x). To explain P(R|x), suppose that the “true” average review score is a 0.82, then P(0.7|0.82) is the probability that any given reviewer will assign it a score of 0.7. The distribution is not obvious, and cannot be calculated from first principles. For example on Goodreads you have to give an integer rating between 1 and 5, so it’s actually impossible to give an individual review score of exactly 0.7. P(R|x) ought to be measured empirically by collecting data from the website.

I don’t really want to collect data, so I’m going to make a simplification. Suppose that every review is either 0 or 1. Basically, your review options are restricted to thumbs up and thumbs down. This gives us an exact formula for P(R|x):

$P(1|x) = x$
$P(0|x) = 1-x$

So far I’ve been discussing a single review, but we want to be able to handle many reviews. Suppose that out of N reviews, R of them are thumbs up, for an average review score of R/N. I’m introducing the combination function C(N,R), to count the number of distinct ways that the reviews could be ordered, although this term just cancels out later.

$P(R|N,x) = C(N,R) * x^{R} * (1-x)^{N-R}$

Now, in order to calculate E[x|N,R], I’m going to have to make assumptions about P(x). In practice, P(x) would also need to be measured empirically by collecting data. But to keep things simple, I’ll assume that P(x) is a uniform distribution. This still results in a pretty nasty integral, but at least I can just look up the answer. We’ll be using the Beta function:

$B(m,n) = \int_{0}^{1} x^{m-1} (1-x)^{n-1} dx = \frac{(m-1)!(n-1)!}{(m+n-1)!}$

So here’s the result:

$E[x|N,R] = \int_{0}^{1} x*P(R|N,x)*P(x) dx / \int_{0}^{1} P(R|N,x)*P(x) dx = \frac{B(R+2,N-R+1)}{B(R+1,N-R+1)} = \frac{R+1}{N+2}$

So the expected “true” average review score is (R+1)/(N+2). This is equivalent to adding two additional review scores, one thumbs up and one thumbs down. For example, if a movie has a 100% rating, but only based on two reviews, then our estimate of the “true” score is 0.75, or three thumbs ups out of four.

Of course, I made a lot of simplifying assumptions to get there. In principle you should use numerical calculations based on real data. If someone wants to try that, be my guest.

Frequentist Approach

Using a frequentist approach, I’d like to create an “unbiased estimator” of the true review score. It’s difficult to explain what an unbiased estimator is, so I’ll refer to the last time I tried.

Now the whole point of a frequentist approach is to do statistical analysis without any knowledge of prior probability distributions. In this context, that seems a bit silly. We may not know the prior probability distribution P(x), but we could in principle make a good guess by analyzing real data. So why assume that we can’t get P(x), when in fact we can get P(x)? Now, I am not a frequentist hater, but in my opinion the frequentist approach just isn’t appropriate for this problem. I’m trying it anyway, because it’s a mathematical exercise.

So for P(R|N,x), let’s make the same assumption as before. All reviews are either thumbs up or thumbs down. So we still have

$P(R|N,x) = C(N,R) * x^{R} * (1-x)^{N-R}$

We want to find an estimator function $\hat{\Theta}(N,R)$ , which is intended to estimate the “true” review score x. There are many possible estimators, we’re imposing the condition that our estimator must be “unbiased”. In other words, the bias is zero. Here’s the equation for the bias:

$Bias(\hat{\Theta}(N,R), x) = \sum_{R=0}^N P(N,R|x) \hat{\Theta}(N,R) - x = 0$

This equation has to be true for all x and N. I found that you can’t fulfill this condition when N=0, so I guess the estimator just isn’t defined for N=0.

To be honest, it’s been a minute since I learned statistics, and I don’t remember how you’re supposed to solve an equation like this one. But we can plug in a few values of N, and solve those equations individually.

N = 1:
$\sum_{R=0}^1 P(R|N=1,x) \hat{\Theta}(1,R) - x = 0$
$x*\hat{\Theta}(1,1) + (1-x)*\hat{\Theta}(1,0) - x = 0$
$\hat{\Theta}(1,1) = 1$
$\hat{\Theta}(1,0) = 0$

N = 2:
$\sum_{R=0}^1 P(R|N=2,x) \hat{\Theta}(2,R) - x = 0$
$x^2*\hat{\Theta}(2,2) + 2*x*(1-x)*\hat{\Theta}(2,1) + (1-x)^2*\hat{\Theta}(2,0) - x = 0$
$\hat{\Theta}(2,2) = 1$
$\hat{\Theta}(2,1) = 0.5$
$\hat{\Theta}(2,0) = 0$

I went a bit further on my own, and I believe the unbiased estimator is actually just R/N. In other words, the frequentist estimate of the “true” average rating is just equal to the current average rating.

That’s really not what we wanted. The whole point was to determine the appropriate correction to review scores, but instead we found a correction equal to zero! My conclusion is that the unbiased estimator approach is just not suited to the problem.

General thoughts

TL;DR: Using a Bayesian approach, you can estimate the “true” average rating by pretending that there were two additional ratings, one thumbs up, and one thumbs down.

I also tried using a frequentist approach, but it didn’t really work because it didn’t make any correction base on number of reviews. I think the Bayesian approach makes more sense here anyways, because you can empirically measure the prior probability distribution based on analysis of real data.

I made a bunch of simplifying assumptions, because I didn’t want to actually pull data from IMDB or Goodreads. I just wanted an analytical math problem that I could use to illustrate the problem and approach. If you want to do some serious analysis, you would base the inputs on empirical analysis, and then do a numerical calculation.

After I tried this analysis, someone pointed me to another article by Evan Miller with a different proposal:

CORRECT SOLUTION: Score = Lower bound of Wilson score confidence interval for a Bernoulli parameter

Evan doesn’t really offer mathematical justification for this, it’s just vibes really. But it seems like a reasonable practical solution.

Evan is interested in a slightly different question from me. I’m thinking about the best way to estimate the “true” rating of an object, while Evan is thinking about the best way to show a user the top-rated objects. Evan’s estimator is explicitly biased, since the “true” rating will be higher than this estimate 95% of the time. But if all you’re doing with the estimate is sorting objects with it, then bias is fine.

Of course, no matter what approach you use, statistical flukes will always float to the top. There’s no getting around that. At most, we’re proposing a quantitative adjustment that penalizes the most obvious statistical flukes.

Perfect Number says

January 22, 2026 at 6:56 am

So the point of the unbiased estimator is to be correct on average, but it ends up being not the right thing for this because what you’re getting at here is more about uncertainty- like if something has a review score of 0.9, how seriously do we take that? (I haven’t heard of unbiased estimators before, this is just the thought I had after reading your thing and also googling it.) So it’s all over the place, often very wrong, but it averages out to be correct (which is also true of the review scores that people leave, vs the “true” rating).

The Bayesian approach seems better but I also wonder if it’s actually getting at this “uncertainty” concept. Also it’s interesting that it came out to adding 1 positive review and 1 negative review- yeah makes sense that there should be some correction to pull the number closer to 0.5, and makes sense that this should have a greater effect if your denominator is small, but why 1, specifically? Why didn’t it come out to adding 10 positive and 10 negative, or something like that?

Back when I used to work with sensor data for robotics, we used Kalman filters to get an estimate of the robot’s true position and speed, based on imperfect data from various sensors, the estimate from the previous calculation, and a mathematical model of how you expect the robot’s motion to work (like speed=distance/time, that kind of very simple mathematical model). Like, there are several different sensors (cameras, wheel encoders, lidar, GPS, etc) and they each have an estimate of your position, and they each have some margin of error, so what’s your actual position? The Kalman filter is an algorithm for how seriously to take each of these different data sources, and calculate an estimate of your “true” position. That could be kind of similar to estimating the “true” review score based on the noisy review scores that people post. A Kalman filter is meant to be used with a continuously updating data stream though, so maybe not the same thing?

Or maybe you could think of it like, initially we expect the “true” score to be 0.5, and every time there’s a new data point, we update a little bit.

Comments

Perfect Number says

January 22, 2026 at 6:33 am

Next time I come across a billion identical yet statistically independent earths, I am definitely trying this.
Perfect Number says

January 22, 2026 at 6:56 am

So the point of the unbiased estimator is to be correct on average, but it ends up being not the right thing for this because what you’re getting at here is more about uncertainty- like if something has a review score of 0.9, how seriously do we take that? (I haven’t heard of unbiased estimators before, this is just the thought I had after reading your thing and also googling it.) So it’s all over the place, often very wrong, but it averages out to be correct (which is also true of the review scores that people leave, vs the “true” rating).

The Bayesian approach seems better but I also wonder if it’s actually getting at this “uncertainty” concept. Also it’s interesting that it came out to adding 1 positive review and 1 negative review- yeah makes sense that there should be some correction to pull the number closer to 0.5, and makes sense that this should have a greater effect if your denominator is small, but why 1, specifically? Why didn’t it come out to adding 10 positive and 10 negative, or something like that?

Back when I used to work with sensor data for robotics, we used Kalman filters to get an estimate of the robot’s true position and speed, based on imperfect data from various sensors, the estimate from the previous calculation, and a mathematical model of how you expect the robot’s motion to work (like speed=distance/time, that kind of very simple mathematical model). Like, there are several different sensors (cameras, wheel encoders, lidar, GPS, etc) and they each have an estimate of your position, and they each have some margin of error, so what’s your actual position? The Kalman filter is an algorithm for how seriously to take each of these different data sources, and calculate an estimate of your “true” position. That could be kind of similar to estimating the “true” review score based on the noisy review scores that people post. A Kalman filter is meant to be used with a continuously updating data stream though, so maybe not the same thing?

Or maybe you could think of it like, initially we expect the “true” score to be 0.5, and every time there’s a new data point, we update a little bit.
Siggy says

January 22, 2026 at 10:16 am

@Perfect Number,

Unbiased estimators are correct on average, but the question is, average over what? It’s not averaging over all probabilities, because frequentism has no notion of prior probabilities. It’s averaging over all possible observations. For example, suppose the “true” rating of a book is 0.8. On each of our identical yet statistically independent earths, the same book will have different average ratings which will only approximate the true rating. The unbiased estimator is correct when averaged across the statistically independent earths; it is not necessarily correct if averaged across books with different true ratings.

Or maybe you could think of it like, initially we expect the “true” score to be 0.5, and every time there’s a new data point, we update a little bit.

That’s right, although the initial expectation of 0.5 comes entirely from our assumptions on P(x). P(x) was assumed to be a uniform distribution, and 0.5 is the mean of that distribution. Under more realistic assumptions, the initial expectation would be higher (e.g. the average score on Goodreads is more like 4/5 stars).
Hj Hornbeck says

January 22, 2026 at 9:10 pm

Neat, you’ve just reinvented the rule of succession! That’s equivalent to modelling the system with a Beta conjugate prior, using a Bayes/Laplace Beta(1,1) hyperprior.
Siggy says

January 22, 2026 at 10:50 pm

@HJ
Neat! Also good to see that I got the formula right.

A Trivial Knot

Everything is simple except when not

This summer will be no fun

The Greater Gardening of 2026 - Part 7 - Tilling Topsoil

The grooming of Virginia Roberts Giuffre

Can't Even with these Dicks

Origami: Horses

The Probability Broach: Crime is everywhere, crime, crime

Estimating true ratings

Comments

Leave a Reply Cancel reply