A Tale of Books and Bias

Books were one of our first methods of data storytelling-beyond speaking or paintings. Humans turned data into words, and now we have begun to turn those words back into data. We typically analyze books and their texts for their content or for the effect they have on us. But what about the effect we have on what other readers may think of those same books based on our ratings? Might we, as readers, be agents of bias?

I tried to answer that query by analyzing reader behavior as reflected in Goodreads, a social network for readers with more than 125 million users. In this social network, readers can report the books they have read and rate them on a scale from one to five.

Methodology: Web Scraping Goodreads

I built up on Goodreads’ Web Scraper by Maria Antoniak to extract data from 42 fantasy and sci-fi book series. The scraped data included information about total number and distribution of ratings, as well as the order of each book inside its respective series.

For the purpose of this article, number of ratings will be treated as number of readers, although, as with any self-report measurement, there probably are Goodreads users who have nor registered or rated that book even if they’ve read it. And many of that book’s readers don’t even have a social network where they report their ratings.

However, today I am going to focus on another type of self-selection bias: the (possible) self-selection bias among Goodreads readers of a book series, that is: which readers continue reading a book series and, consequently, rate it on Goodreads.

To find out about the likelihood of this bias’ existence, I examined the evolution of rating distributions and reader number throughout each book series. I use the term ‘return rate’ as the number of readers of a specific book divided by the number of readers of the first book of that series. Thus, the return rate is a percentage that is always 100 percent in the first book and should never increase from one book to the next inside the same series, because it doesn’t make sense to read a third book of a series if you have not read the second one.

Results: Second parts are not only good, but better than the first book in the series.

The books included in the analysis, their rating distributions, weighted mean average and return rate are shown in Figure 1.

There’s this saying that sequels are never as good as the first book in the series, but in 67 percent of the 42 book series analyzed, that statement seems to be false, as second books in a series have an average rating higher than their respective first part. However, not everyone who rated the first book rated the second one: mean return rate is 47 percent.

Does this effect propagate to the following books of each series?

I tested that by calculating the weighted mean of the sequels and the return rate between its first and last books. By ‘weighted mean of the sequels’ I refer to the average rating of books from second to last one, each one multiplied by the number of readers of that book. I chose this method because, if on average a book was not liked at all, it’s likely that only those who did like it continued with the series, and the last parts would not reflect the opinion of those who chose not to continue, which is relevant to rate the whole series. By pondering the average by the number of readers, I can avoid that bias.

The result is similar to the first analysis, with 67 percent of sequels rating higher than their first books, although the mean return rate is much lower (28 percent). So, on average, sequels are rated (slightly) higher (<0.10 points in scale of one to five) than the first parts.

However, that average takes into account both positive and negative differences, thus annulling each other. If I analyze the absolute differences, we can see that the mean difference goes up to 0.17 points, with an average increase of 0.17 points for those series whose sequel was rated higher, and an average decrease of -0.16 points for those whose sequel was rated lower.

Were those who kept reading (and rating) the ones who rated the first book higher? Without digging into each user’s profile we cannot know for sure, so that question remains as a hypothesis for future analysis. However, the results of this preliminary analysis suggest that could be the case.

Conclusion: The rating of the first book in the series is the most accurate.

In the third phase (67 percent) of the fantasy and sci-fi book series I analyzed, the mean rating of the sequels was 0.15 points higher than the mean rating of the first book. I did not conduct any statistical hypothesis tests, so I can only assess the significance in descriptive terms. In the rating scale from one to five, this effect is considered small, although consistent.

On the other hand, the number of readers decreased as the number of books in the series increased. This effect was stronger for book series with a decrease in ratings (21 percent mean return rate versus 32 percent in books with an increase in ratings).

In summary, sequels have higher ratings than the original book in the series, but fewer readers. This means that when we look at the rating of a sequel, we might be looking at an overestimation, because only the readers who liked the previous part (thus, who are likely to like the following ones) rated that sequel. If readers who didn’t like the first part had rated the sequels too, would the rating be as high?

Taking these factors into account, if this text were a detective story, readers would be our primary suspect of biasing book data. However, to discover the truth, you will have to wait for a sequel to this analysis.

Celia López

Customer Success Engineer at Plotly

I read code, not minds. I am a Psychologist who had an early interest in Math. That interest made me study a Master in Data Analysis and Research Design, which allowed to discover the Wonderful Word of Programming. I became fluent first in R and then in Python. I spent a couple of years putting that knowledge into practice as a Data Analyst and now I help users to code and understand their own data apps.