Review scores: a philosophical investigation


Normally, in the introduction to an article, I would provide a “hook”, explaining my interest in the topic, and why you should be too. But my usual approach felt wrong here, since I cannot justify my own interest, and arguably if you’re reading this rather than scrolling past the title, you should be less interested than you currently are.

So, review scores. WTF are they? I don’t have the answers, but I sure have some questions. Why is 0/10 bad, 10/10 good, and 5/10… also bad? What goals do people have in assigning a score, and do they align with the goals of people reading the same score? What does it mean to take the average of many review scores? And why do we expect review scores to be normally distributed?

Mathematical structure

Review scores are intuitively understood as a measure of the quality of a work (such as a video game, movie, book, or LP)–or perhaps a measure of our enjoyment of the work? Already we have this question: is it quality, or is it enjoyment, or are those two concepts the same? But we must leave that question hanging, because there are more existentially pressing questions to come. Review scores do more than just express quality/enjoyment, they assign a number. And numbers are quite the loaded concept.

First, numbers are fully ordered. Is quality/enjoyment ordered? Could we really take any two works, or any two experiences, and judge whether one is better than the other? Can two works be equal to one another?

Second, perhaps more troublingly, numbers can be added and subtracted. What would it even mean to add or subtract two review scores? Could we say that the “difference” between a 1/10 and a 4/10 is equal to the “difference” between a 4/10 and 7/10? I dunno about that. My intuition is that 7/10 might be something I would enjoy, but both a 1/10 and 4/10 likely represent something I wouldn’t enjoy, so the distance between 1/10 and 4/10 is relatively small.

Now you could say that just because review scores are represented with numbers, and just because it makes sense to add and subtract numbers, does not therefore mean that it makes sense to add and subtract review scores. And I agree! However, in that case, numbers seem like the wrong mathematical metaphor. If review scores are ordered, but cannot be added or subtracted, then review scores are more properly described as ordinals, rather than numbers.

The thing is, we do not treat review scores like ordinals. We like taking averages of our review scores. An average, importantly, involves adding multiple review scores, and then dividing by the number of scores. For instance, if one person gives a 1/10 rating, and another gives a 7/10 rating, the average score is a 4/10. Our method of taking averages implies that the difference between a 1/10 and 4/10 is equal to the difference between a 4/10 and 7/10, intuition be damned.

My understanding is that websites that aggregate review scores typically do not take simple averages, perhaps because they are aware of these very issues. I do not know how they compute averages, and probably they don’t want anyone to know, lest they game the system. So, in practice, are we treating review scores as numbers or as ordinals? We don’t even know! How do we sleep at night, not knowing if the sheep are counted or merely arranged in order?

Practical function

To be reductive, the purpose of a review score is to inform a buy/no-buy decision. To be slightly less reductive, a review score also might inform you whether to buy now or wait, how excited to be, or how much trust to put into the work.

Informing a buy/no-buy decision is not quite the same as just telling you whether or not to buy. A review score is just one of many inputs into our brains’ algorithms. For instance, I might buy a game if it’s in the “puzzle adventure” genre, is praised for its story, and has reviews scores of at least 6/10. But I may not buy a first person shooter regardless of review scores.

If we’re being less reductive, we might say that the purpose of the review score is to express a certain level of enjoyment on the part of the reviewer. So, if I were to give these games ratings based purely on my own enjoyment, I would have to give categorically higher scores to puzzle adventure games, and categorically lower scores to first person shooters. But would this actually be useful to someone trying to make the buy/no-buy decision? It seems that rather than hearing about my genre preferences, you’re better off just having an understanding your own genre preferences, and finding review scores that reflect the quality of a game within its own genre.

So on the one hand, we have this idea that review scores should reflect the internal enjoyment of the reviewer. On the other hand, we have this idea that review scores should be useful for making decisions. These two goals are at odds!

We almost always just ignore this problem by finding the right reviewers—reviewers whose personal tastes just so happen to align with what is useful to the consumer. You want a review of a romance novel, you get it from someone who likes romance novels, that’s just common sense, right? You certainly wouldn’t get it from me, because I’d just write some meta about my issues with the romance genre as a whole (yes I have done this).

Something I’m circling around, is the so-called “objective review” commonly demanded by gamers. The philosophical problems with the “objective review” are too glaring for me to waste breath on—and I say this after having wasted a bunch of breath on numbers vs ordinals.

But even though reviews are not in any sense “objective”, neither do they represent unfiltered subjective opinion. We filter review scores by selecting “critics” whose opinions are somehow more valuable than the rest of us genre-picky riff-raff. Or, in the case of websites with user review scores, we are selecting reviewers who are sufficiently engaged to leave ratings, and putting these scores through an averaging algorithm nobody quite understands. So what even is a review score? Stare at this mystery long enough, and you might just go mad, maybe become one of those gamers who demands “objectivity”.

Statistical Distribution

In the process of writing this essay, I did a bit of “research”, looking to see what the score distributions are on sites like Goodreads, IMDB, or Metacritic. A few interesting things came up. First, I found an interesting academic article comparing review score distributions on Goodreads vs Amazon. The difference between the two was explained as the result of Amazon being more directly tied to sales. So for instance, people give more 1 and 5 star reviews on Amazon, because that’s the best strategy if you’re trying to maximize your influence on buy decisions.

The second article I found was a data scientist trying to select the “best” movie review website, purely on the basis of how closely their score distributions resemble a centered normal distribution. Along similar lines, there’s a FiveThirtyEight criticizing Fandango on the basis of its skewed score distribution. Both of these articles are deeply misguided.

FiveThirtyEight complained that 98% of all movies on fandango are rated between 3 and 5 stars. It tells a story of someone being persuaded to watch a terrible movie on the basis of a 3-star score on Fandango. Well nobody ever told you that 3 stars means “good”, that was just your assumption.

And I have to point out… if your complaint is that the distribution is asymmetrical, there is an easy way to make it symmetrical again. Change 3 stars to 1 star; 3.5 stars to 2 stars; 4 stars to 3 stars; 4.5 stars to 4 stars. Now you’ve got yourself a nice centered normal distribution, which by your own standards means it’s the perfect movie review website. If a review score distribution is mathematically isomorphic to the perfect review score distribution, I’m forced to conclude that the score distribution was perfect to begin with. At worst, we could say that the scores were communicated poorly.

Both of these articles express the assumption that the “ideal” center of the distribution is 3/5, or alternatively 5/10. But where does this assumption even come from, and is it consistent with our other ideas about review scores? For example, many people think 5/10 describes something you neither liked nor disliked. But what if most people like most movies? Wouldn’t you expect the review score distribution to be centered above 5/10?

Or maybe 5/10 means good, but only good enough to match the experience of doing something else besides watching a movie. Maybe 5/10 means that even people who love movies enough to rate them on Fandango could take or leave this particular movie, so don’t bother unless you love movies even more than that. I don’t know!

Personally, I just guess at the meaning of review scores based on past experience. In the realm of music, a 10/10 means “worth trying first 20 seconds”. In the realm of video games, a 7/10 means “it is a game that someone thought was worth reviewing”. On Goodreads, a 4.01 means “amazing”, and 3.98 means “totally unreadable”. So on and so forth.

Another assumption the data science article made was that review score distributions ought to be normal, peaked distribution. The author went so far as to dismiss Rotten Tomatoes, which had a uniform, unpeaked distribution. Why though?

Normal distributions make some sense, insofar as movies are the sum of many small parts, which randomly improve your experience or detract from it. On the other hand, uniform distributions maximize entropy. In plainer terms: uniform distributions give you greater power to differentiate between movies. Uniform distributions also lend themselves to a natural interpretation, that of percentiles.

So it’s funny. On the one hand, the data science article assumes that the normal distribution is best, and dismisses websites with uniform distributions. On the other hand, I found another analysis that translates IMDB scores into percentiles, essentially forcing a uniform distribution. Clearly we have some fundamental disagreements over what we want in review scores.

So, to summarize my conclusions. WTF are review scores? Are they numbers or ordinals? What purposes do they serve? What is their appropriate distribution? How does anyone else get by without wondering these things?

Comments

  1. John Morales says

    First, numbers are fully ordered. Is quality/enjoyment ordered? Could we really take any two works, or any two experiences, and judge whether one is better than the other? Can two works be equal to one another?

    Well, obviously some ratings are subjective, and some are objective.
    So, sometimes yes, albeit not in toto, but rather regarding some particular aspect of what is being rated.

    Second, perhaps more troublingly, numbers can be added and subtracted. What would it even mean to add or subtract two review scores?

    Depends on what that number (being a position on a bounded scale) represents.

    I really don’t see why anyone would want to add/subtract them, anyway. At best, they function as mathematical inequalities.

    If review scores are ordered, but cannot be added or subtracted, then review scores are more properly described as ordinals, rather than numbers.

    Again, it depends on what those numbers represent, but as I noted, inequalities with ordered degrees of magnitude is a better metaphor.

    In the process of writing this essay, I did a bit of “research”, looking to see what the score distributions are on sites like Goodreads, IMDB, or Metacritic.

    Those are subjective ratings, and by now it’s pretty clear that if you had put that in your post title, I would not have caviled as I have. 🙂

    (objective ratings (such as exam scores), on the other hand, are subject to mathematical analysis)

    So, to summarize my conclusions. WTF are review scores?

    The reviewers’ preferences, basically.

    (For example, I like plot-driven stories, so when a reviewer I know has a predilection for character over plot, I generally invert their rating)

  2. Dauphni says

    Being from the Netherlands I never questioned the 10 point scale of review scores, since they always appeared to be perfectly analogous to the 10 point grade system we use from primary school all the way to university. A 0 means you didn’t get anything right, a 6 means you passed, and a 10 means no mistakes at all. Often these grades map directly to the percentage of correct answers, but for things like essays it can get a little more subjective.

    Similar ten point or percentage grading systems are very common worldwide, which would give many people an intuitive understanding of what each number means, so using it for review scores makes a lot of sense on that basis.

  3. sonofrojblake says

    “WTF are review scores?”
    They are a label.
    “Are they numbers or ordinals?”
    Neither. See above.
    “What purposes do they serve?”
    I can only say what purpose I think they serve, based on how I use them and how I observe other people using them. To wit: they are a shorthand to tell me whether or not to bother reading the review. They actually tell me very little about the film itself per se. The use I make of it will be heavily context- and even media-dependent.

    Example: I see a review of the latest Wonder Woman movie on the Guardian website. I come to this review with the following context:
    1. I’ve seen the first one, and it was great.
    2. I like comic book movies, Gal Gadot is so hot my heterosexual wife drools at the sight of her, and it doesn’t hurt that Chris Pine is good too.
    3. Bloody Covid, so no films for ages.
    4. Very likely to watch the film regardless of the review OR score.
    5. It’s lunchtime, and I have time to read maybe six articles on the website while I eat, and it’s a slow news day.

    Therefore, the star rating here is of at best academic interest – I’m going to read the review, and pretty much regardless of what it says, I’m going to see the film. (It’s pants, btw. That’s all the review you’re getting from me.) Tenet was similarly review and rating-proof. (Note, I watched both of these at home).

    Another: I see a review of a film I’m not aware of, but that stars someone I like who I know makes interesting choices (e.g. Eva Green, Daniel Radcliffe). It’s not a franchise or even a sequel. I’m accepting of the genre. It’s 2015, so I’ve seen some films recently. The review is in a general interest magazine, and I’ve some time. Here, if the score is between say, 3/10 to 6/10, I’m probably not even going to bother reading the review. I’ll read something that might be good, and I’ll read something that might be a rant about something shit, but something in between isn’t going to be worth my time when there’s something over the page about snow-kiting in Antarctica.

    Another: I see a review of a film I’ve never heard of, starring nobody I’ve heard of, in a genre I could take or leave. If it’s not a five star rating, I’m not reading one word of that review. Life’s too short.

    Another: I’ve bought an honest-to-goodness paper copy of Empire magazine, and I’ve a Sunday afternoon to myself or a train journey or somethin . I’m going to read *all* the reviews. The star rating at best tickles up what I’m about to read anyway, like a subheading on a news story.

    “What is their appropriate distribution?”
    What is the appropriate distribution of movie quality/my time to read reviews? How long is a piece of string?

    “How does anyone else get by without wondering these things?”
    I think we’ve established that I don’t know that any better than you do .

  4. billseymour says

    … is it quality, or is it enjoyment …

    … numbers are fully ordered.

    quality + i * enjoyment?

  5. says

    Something I’m circling around, is the so-called “objective review” commonly demanded by gamers.

    People don’t enjoy books or games or movies objectively. For example, I can enjoy a novel with a mediocre plot but an excellently written trans protagonist. I cannot enjoy an otherwise well-written novel with an excellent plot if the protagonist is an insufferable preachy religious person or an arrogant jerk or a number of other character archetypes that I cannot stand.

    There is a novel series from which I read the first two books and then I quit, because the author introduced a new character who was a devoutly religious person who just couldn’t stop preaching his beliefs. When I discussed these books with a friend (an atheist), he said that this character I loathed was his favorite character in these books. He enjoyed how caring and selfless this character was.

  6. says

    @Dauphni #2,

    Those grade systems are arbitrary too. Exam-writers can modify scores by making the questions easier or more difficult. It’s only by convention that they set the difficulty to such a level that 60% is a pass. But I can see how this might influence how one interprets ratings.

  7. billseymour says

    Andreas Avester @5:

    I can enjoy a novel with a mediocre plot but an excellently written trans protagonist. I cannot enjoy an otherwise well-written novel with an excellent plot if the protagonist is an insufferable preachy religious person or an arrogant jerk …

    My experience is almost the opposite of that when it comes to C. J. Cherryh’s Foreigner novels.  The plots are all basically the same:  dishonorable rentiers behave selfishly and honorable rentiers defeat them; but I find the books well-written with enough suspense to make them page-turners.  My guess is that ratings of works of art are mostly useless given the different tastes we all have.

    Also, I’ve decided that my comment @4 is bass-ackwards, at least when it comes to consumer products.  Buyers can’t really judge a product’s quality and tend to conflate that with flashiness, so enjoyment should be the real term and quality imaginary. 😎

  8. milu says

    i had a good laugh 😀
    good questions though! people! let’s wake up! how is this conspicuous absence of a theoretical framework for review scores not keeping us awake at night?

    Here’s my 2 cents. Bit of a tangent, as it seemed you were only talking about products, not services. I’ve worked for a while in a hotel, then for another while in a call center as a technical advisor, so i’m wary of how scoring a “service experience” can affect the workers, all the more so as most service employees are often some of the most precarious workers. I understand the impulse to give a bad rating when you’ve had a bad interaction. But once cast, the score goes into a black box where it will be wielded by employers in ways that the consumer has no control over.

    And look, I don’t have a cogent economic point to make here. I’m sure in many situations this is a good enough proxy that managers can adjust their offer on the basis of it and actually respond to the demand. I can see why this would seem like a reasonable system for achieving that.

    But my experience as a worker is that it so often feels a bit degrading. Good interactions with customers are pleasant just because, you don’t need them celebrated. Getting a good mark for those is like, “ok”. But some of the bad scores really sting. Sometimes it’s just bad faith, bad luck, whatever.

    But it’s not even that. Because while sometimes “streaks” and absolute numbers might get you bonuses, the metrics used by management are usually, like, averages over a week or a month. It’s the sort of atmosphere it creates. It’s just kind of evil, you know? In a cartoonish way. The two workplaces where I worked that were like that, i had really decent managers, pretty friendly. But when it came to “the numbers” they would be like, “hey, you know how it works, my hands are tied, you just really need to fix your average. But heyyy! I know you can do it!” *pats me on the back*

  9. Mabel Lebeau says

    My encounter with these 0-10 polls is as a system of ranking, as related to a scale of relativity such as “On a scale of 0-10 where ‘0’ indicates no pain at all, and ’10’ is the worst pain ever experienced … .” No one knows if I have an intact system of pain sensation, or am over-sensitive compared to other people, for reasons such as underlying anxiety, actual pain, distorted perception of pain, or even if I slept wrong the night before, and have a crick in my neck or a sore thumb that just makes everything else seem a little over the edge when it comes to pain. The only real use of this sort of ranking system is for my own evaluation of pain before a procedure, during it, after anesthetic, after anesthetic wears off, after it’s all over, and then two weeks out.

    When it comes to Press-Ganey service evaluations used in assessing employee pay, I think it’s flatly morally wrong for others to contribute without to an evaluation without consideration of any underlying circumstances, when it impacts whether a person is hired or fired. Employees are often told scores contribute to employee performance evaluation, but as co-workers we don’t know why a person was hired to begin with, so how do we know a person is doing their job or not? Why can we assign numbers to someone? How do we know whether a number is weighted, or not?

    Heavens! I’ve had enough of questionnaires to know by the end of the first page, I have forgotten what I think a ‘5’ means as compared to a ‘6’. I begin to wonder where it’s all going, the intent. I want to keep working with this particular co-worker so they get all ’10’s and the next worker has been a little too irritating lately, so they get ‘5s’. There is no science in it at all, because science is based on reproducibility. Whenever, I begin one of these Press-Ganey evaluation polls, I’m inclined to say, “I pass. Fire me. Whatever. This questionnaire bears no relevance to mine or anyone else’s job performance, and I’m not going to be any part of a lynching.” And, yet, there’s the subsequent thought that if I don’t contribute something to this propaganda, perhaps the basis of the ‘grading system’ will depend on some other ignoramus who hates everyone.

    To assign numbers to personal whims and opinions is much like phrenology, looking for a lucky number ‘7’ as a sign from the heavens, chancing on finding a four-leaf clover, or even as a non-musician ascribing the beauty of a piece of music to a certain sequence of alternating notes.

  10. milu says

    @Mabel Lebeau

    Whenever, I begin one of these Press-Ganey evaluation polls, I’m inclined to say, “I pass. […]” And, yet, there’s the subsequent thought that if I don’t contribute something to this propaganda, perhaps the basis of the ‘grading system’ will depend on some other ignoramus who hates everyone.

    In the sort of situation you describe i always give all 10’s, then maybe write actual feedback in the comment field. I also make a point to state my position as conscientious objector to numerical ratings. I realize doing this on an individual basis has virtually no effect, but i’m sentimental about that shred of dignity i guess.

Leave a Reply

Your email address will not be published. Required fields are marked *