What does IMDb stand for?

IMDb (the Internet Movie Database) is an online database of information related to audiovisual content.

What is the IMDb rating system?

The IMDb rating system is a way of ordering audiovisual content by a score generated through the votes of its web users.

What type of database is IMDb?

IMDb’s principal data is about movies: They store the title, year, gross, duration, genre, and other common characteristics.

What is the purpose of IMDb?

IMDb’s purpose is to be the biggest, principal encyclopedia of audiovisual content.

Technology10 minute read

Stars Realigned: Improving the IMDb Rating System

IMDb ratings have genre bias: For example, dramas tend to score higher. Removing common feature bias and keeping unique characteristics, it’s possible to create a new, refined score based on IMDb information.

authors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.

Juan Manuel Ortiz de Zarate

Verified Expert in Engineering

Juan (MSc, computer science) is a data science/AI PhD student. As a senior web developer, his main expertise includes R, Python, and PHP.

Expertise

Data Science Artificial Intelligence Machine Learning

PREVIOUSLY AT

Movie watchers sometimes use rankings to select what to watch. Once doing this myself, I noticed that many of the best-ranked movies belonged to the same genre: drama. This made me think that the ranking could have some kind of genre bias.

I was on one of the most popular sites for movie lovers, IMDb, which covers movies from all over the world and from any year. Its famous ranking is based on a huge collection of reviews. For this IMDb data analysis, I decided to download all the information available there to analyze it and try to create a new, refined ranking that would consider a wider range of criteria.

The IMDb Rating System: Filtering IMDb’s Data

I was able to download information on 242,528 movies released between 1970 and 2019 inclusive. The information that IMDb gave me for each one was: Rank, Title, ID, Year, Certificate, Rating, Votes, Metascore, Synopsis, Runtime, Genre, Gross, and SearchYear.

To have enough information to analyze, I needed a minimum number of reviews per movie, so the first thing that I did was to filter movies with less than 500 reviews. This resulted in a set of 33,296 movies, and in the next table, we could see a summary analysis of its fields:

Field	Type	Null Count	Mean	Median
Rank	Factor	0
Title	Factor	0
ID	Factor	0
Year	Int	0	2003	2006
Certificate	Factor	17587
Rating	Int	0	6.1	6.3
Votes	Int	0	21040	2017
Metascore	Int	22350	55.3	56
Synopsis	Factor	0
Runtime	Int	132	104.9	100
Genre	Factor	0
Gross	Factor	21415
SearchYear	Int	0	2003	2006

Note: In R, Factor refers to strings. Rank and Gross are that way in the original IMDb dataset due to having, for example, thousands of separators.

Before starting to refine the score, I had to further analyze this dataset. For starters, the fields Certificate, Metascore, and Gross had more than 50% of null values so they aren’t useful. Rank depends intrinsically on Rating (the variable to refine), therefore, it doesn’t bear any useful information. The same is true with ID in that it’s a unique identifier for each movie.

Finally, Title and Synopsis are short text fields. It could be possible to use them through some NLP technique, but because it’s a limited amount of text, I decided not to take them into account for this task.

After this first filter, I was left with Genre, Rating, Year, Votes, SearchYear, and Runtime. In the Genre field, there was more than one genre per movie, separated by commas. So to capture the additive effect of having many genres, I transformed it using one-hot encoding. This resulted in 22 new boolean fields—one for each genre—with a value of 1 if the movie had this genre or 0 otherwise.

IMDb Data Analysis

To see the correlations between variables, I calculated the correlation matrix.

A correlation matrix among all the remaining original columns and the new genre columns. Numbers close to zero result in blank spaces in the grid. Negative correlations result in red dots and positive correlations in blue dots. The dots are larger and darker the stronger the correlation is. (Visual highlights are described in the main article text.)

Here, a value close to 1 represents a strong positive correlation, and values close to -1 a strong negative correlation. By this graph, I made many observations:

Year and SearchYear are absolutely correlated. This means that they probably have the same values and that having both is the same as having only one, so I kept only Year.
Some fields had expected positive correlations, such as:
- Music with Musical
- Action with Adventure
- Animation with Adventure
Same for negative correlations:
- Drama vs. Horror
- Comedy vs. Horror
- Horror vs. Romance
Related to the key variable (Rating) I noticed:
- It has a positive and important correlation with Runtime and Drama.
- It has a lower correlation with Votes, Biography, and History.
- It has a considerably negative correlation with Horror and a lower negative one with Thriller, Action, Sci-Fi, and Year.
- It doesn’t have any other significant correlations.

It seemed to be that long dramas were well-rated, while short horror movies weren’t. In my opinion—I didn’t have the data to check it—it didn’t correlate with the kind of movies that generate more profits, like Marvel or Pixar movies.

It could be that the people who vote on this site are not the best representative of the general people criterion. It makes sense because those who take the time to submit reviews on the site are probably some sort of movie critics with a more specific criterion. Anyway, my objective was to remove the effect of common movie features, so I tried to remove this bias in the process.

Genre Distribution in the IMDb Rating System

The next step was to analyze the distribution of each genre over the rating. To do that, I created a new field called Principal_Genre based on the first genre that appeared in the original Genre field. To visualize this, I made a violin graph.

A violin plot showing the rating distribution for each genre.

One more time, I could see that Drama correlates with high ratings and Horror with lower. However, this graph also revealed other genres as having good scores: Biography and Animation. That their correlations didn’t appear in the previous matrix was probably because there were too few movies with these genres. So next I created a frequency bar plot by genre.

A bar graph showing how many movies of each genre were in the database. Comedy, Drama, and Action had frequencies around 6,000 or above; Crime and Horror were above 2,000; the rest were under 1,000.

Effectively, Biography and Animation had very few movies, as did Sport and Adult. For this reason, they are not very well correlated with Rating.

Other Variables in the IMDb Rating System

After that, I started to analyze the continuous covariables: Year, Votes, and Runtime. In the scatter plot, you can see the relation between Rating and Year.

As we saw previously, Year seemed to have a negative correlation with Rating: As the year increases, the rating variance also increases, reaching more negative values on newer movies.

Next, I made the same plot for Votes.

Here, the correlation was clearer: the higher the number of votes, the higher the ranking. However, most of the movies had not so many votes, and in this case, Rating had a bigger variance.

Lastly, I looked at the relationship with Runtime.

A scatter plot between rating and runtime.

Again, we have a similar pattern but even stronger: Higher runtimes mean higher ratings, but there were very few cases for high runtimes.

After all this analysis, I had a better idea of the data I was dealing with, so I decided to test some models to predict the ratings based on these fields. My idea was that the difference between my best model predictions and the real Rating would remove the common features’ influence and reflect the particular characteristics that make a movie better than others.

I started with the simplest model, the linear one. To evaluate which model performed better, I observed the root-mean-square (RMSE) and mean absolute (MAE) errors. They are standard measures for this kind of task. Also, they are on the same scale as the predicted variable, so they are easy to interpret.

In this first model, RMSE was 1.03, and MAE 0.78. But linear models suppose independence over the errors, a median of zero, and constant variance. If this is correct, the “residual vs. predicted values” graph should look like a cloud without structure. So I decided to graph it to corroborate that.

Residual vs. predicted values scatterplot.

I could see that up to 7 in the predicted values, it had a non-structured shape, but after this value, it has a clear linear descent shape. Consequently, the model suppositions were bad, and also, I had an “overflow” on the predicted values because in reality, Rating can’t be more than 10.

In the previous IMDb data analysis, with a higher amount of Votes, the Rating improved; however, this happened in a few cases and for a huge amount of votes. This could cause distortions in the model and produce this Rating overflow. To check this, I evaluated what would happen with this same model, removing the Votes field.

Residual vs. predicted values scatterplot when the Votes field is removed.

This was much better! It had a clearer, non-structured shape without overflow-predicted values. The Votes field also depends on reviewer activity and is not a feature of films, so I decided to drop this field as well. The errors after removing it were 1.06 on RMSE and 0.81 on MAE—a little worse, but not so much, and I preferred to have better suppositions and feature selection than a little better performance on my training set.

IMDb Data Analysis: How Well Do Other Models Work?

The next thing I did was to try different models to analyze which performed better. For each model, I used the random search technique to optimize hyperparameter values and 5-fold cross-validation to prevent model bias. In the following table are the estimated errors obtained:

Model	RMSE	MAE
Neural Network	1.044596	0.795699
Boosting	1.046639	0.7971921
Inference Tree	1.05704	0.8054783
GAM	1.0615108	0.8119555
Linear Model	1.066539	0.8152524
Penalized Linear Reg	1.066607	0.8153331
KNN	1.066714	0.8123369
Bayesian Ridge	1.068995	0.8148692
SVM	1.073491	0.8092725

As you can see, all models perform similarly, so I used some of them to analyze a little more data. I wanted to know the influence of each field over the rating. The simplest way to do that is by observing the parameters of the linear model. But to avoid distortions on them previously, I had scaled the data and then retrained the linear model. The weights were as pictured here.

A bar graph of linear model weights ranging from nearly -0.25 for Horror to nearly 0.25 for Drama.

In this graph, it’s clear that two of the most important variables are Horror and Drama, where the first has a negative impact on the rating and the second a positive. There are also other fields that impact positively—like Animation and Biography—while Action, Sci-Fi, and Year impact negatively. Moreover, Principal_Genre does not have a considerable impact, so it’s more important which genres a movie has than which one is the principal.

With the generalized additive model (GAM), I could also see a more detailed impact for the continuous variables, which in this case was the Year.

A graph of Year vs. s(Year) using the generalized additive model. The s(Year) value follows a curve starting up near 0.6 for 1970, bottoming out below 0 at 2010, and increasing to near 0 again by 2019.

Here, we have something more interesting. While it was true that for recent movies, the rating tended to be lower, the effect was not constant. It has the lowest value in 2010 and then it appears to “recover.” It would be intriguing to find out what happened after that year in movie production that could have produced this change.

The best model was neural networks, which had the lowest RMSE and MAE, but as you can see, no model reached perfect performance. But this was not bad news in terms of my objective. The information available let me estimate the performance somewhat well, but it is not enough. There is some other information that I couldn’t get from IMDb that is making Rating differ from the expected score based on Genre, Runtime, and Year. It may be actor performance, movie scripts, photography, or many other things.

From my perspective, these other characteristics are what really matters in selecting what to watch. I don’t care if a given movie is a drama, action, or science fiction. I want it to have something special, something that makes me have a good time, makes me learn something, makes me reflect on reality, or just entertains me.

So I created a new, refined rating by taking the IMDb rating and subtracting the predicted rating of the best model. By doing this, I was removing the effect of the Genre, Runtime, and Year and keeping this other unknown information that is much more important to me.

IMDb Rating System Alternative: The Final Results

Let’s see now which are the 10 best movies by my new rating vs. by the real IMDb rating:

IMDb

Title	Genre	IMDb Rating	Refined Rating
Ko to tamo peva	Adventure,Comedy,Drama	8.9	1.90
Dipu Number 2	Adventure,Family	8.9	3.14
El señor de los anillos: El retorno del rey	Adventure,Drama,Fantasy	8.9	2.67
El señor de los anillos: La comunidad del anillo	Adventure,Drama,Fantasy	8.8	2.55
Anbe Sivam	Adventure,Comedy,Drama	8.8	2.38
Hababam Sinifi Tatilde	Adventure,Comedy,Drama	8.7	1.66
El señor de los anillos: Las dos torres	Adventure,Drama,Fantasy	8.7	2.46
Mudras Calling	Adventure,Drama,Romance	8.7	2.34
Interestelar	Adventure,Drama,Sci-Fi	8.6	2.83
Volver al futuro	Adventure,Comedy,Sci-Fi	8.5	2.32

Mine

Title	Genre	IMDb Rating	Refined Rating
Dipu Number 2	Adventure,Family	8.9	3.14
Interestelar	Adventure,Drama,Sci-Fi	8.6	2.83
El señor de los anillos: El retorno del rey	Adventure,Drama,Fantasy	8.9	2.67
El señor de los anillos: La comunidad del anillo	Adventure,Drama,Fantasy	8.8	2.55
Kolah ghermezi va pesar khale	Adventure,Comedy,Family	8.1	2.49
El señor de los anillos: Las dos torres	Adventure,Drama,Fantasy	8.7	2.46
Anbe Sivam	Adventure,Comedy,Drama	8.8	2.38
Los caballeros de la mesa cuadrada	Adventure,Comedy,Fantasy	8.2	2.35
Mudras Calling	Adventure,Drama,Romance	8.7	2.34
Volver al futuro	Adventure,Comedy,Sci-Fi	8.5	2.32

As you can see, the podium didn’t change radically. This was expected because the RMSE was not so high, and here we are watching the top. Let’s see what happened with the bottom 10:

IMDb

Title	Genre	IMDb Rating	Refined Rating
Holnap történt - A nagy bulvárfilm	Comedy,Mystery	1	-4.86
Cumali Ceber: Allah Seni Alsin	Comedy	1	-4.57
Badang	Comedy,Fantasy	1	-4.74
Yyyreek!!! Kosmiczna nominacja	Comedy	1.1	-4.52
Proud American	Drama	1.1	-5.49
Browncoats: Independence War	Action,Sci-Fi,War	1.1	-3.71
The Weekend It Lives	Comedy,Horror,Mystery	1.2	-4.53
Bolívar: el héroe	Animation,Biography	1.2	-5.34
Rise of the Black Bat	Action,Sci-Fi	1.2	-3.65
Hatsukoi	Drama	1.2	-5.38

Mine

Title	Genre	IMDb Rating	Refined Rating
Proud American	Drama	1.1	-5.49
Santa and the Ice Cream Bunny	Family,Fantasy	1.3	-5.42
Hatsukoi	Drama	1.2	-5.38
Reis	Biography,Drama	1.5	-5.35
Bolívar: el héroe	Animation,Biography	1.2	-5.34
Hanum & Rangga: Faith & The City	Drama,Romance	1.2	-5.28
After Last Season	Animation,Drama,Sci-Fi	1.7	-5.27
Barschel - Mord in Genf	Drama	1.6	-5.23
Rasshu raifu	Drama	1.5	-5.08
Kamifûsen	Drama	1.5	-5.08

The same thing happened here, but now we can see that more dramas appear in the refined case than in IMDb’s, which shows that some dramas could be over-ranked just for being dramas.

Maybe the most interesting podium to see is the 10 movies with the greatest difference between the IMDb rating system’s score and my refined one. These movies are the ones that have more weight on their unknown characteristics and make the movie much better (or worse) than expected for its known features.

Title	IMDb Rating	Refined Rating	Difference
Kanashimi no beradonna	7.4	-0.71	8.11
Jesucristo Superstar	7.4	-0.69	8.09
Pink Floyd The Wall	8.1	0.03	8.06
Tenshi no tamago	7.6	-0.42	8.02
Jibon Theke Neya	9.4	1.52	7.87
El baile	7.8	0.00	7.80
Santa and the Three Bears	7.1	-0.70	7.80
La alegre historia de Scrooge	7.5	-0.24	7.74
Piel de asno	7	-0.74	7.74
1776	7.6	-0.11	7.71

If I were a movie director and had to produce a new movie, after doing all this IMDb data analysis, I could have a better idea of what kind of movie to make to have a better IMDb ranking. It would be a long animated biography drama that would be a remake of an old movie—for example, Amadeus. Probably this would assure a good IMDb ranking, but I’m not sure about profits…

What do you think about the movies that rank in this new measure? Do you like them? Or do you prefer the original ones? Let me know in the comments below!

Understanding the basics

What does IMDb stand for?
IMDb (the Internet Movie Database) is an online database of information related to audiovisual content.
What is the IMDb rating system?
The IMDb rating system is a way of ordering audiovisual content by a score generated through the votes of its web users.
What type of database is IMDb?
IMDb’s principal data is about movies: They store the title, year, gross, duration, genre, and other common characteristics.
What is the purpose of IMDb?
IMDb’s purpose is to be the biggest, principal encyclopedia of audiovisual content.

Juan Manuel Ortiz de Zarate

Verified Expert in Engineering

Ciudad de Buenos Aires, Buenos Aires, Argentina

Member since November 6, 2019

About the author

Juan (MSc, computer science) is a data science/AI PhD student. As a senior web developer, his main expertise includes R, Python, and PHP.

authors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.

Expertise

Data Science Artificial Intelligence Machine Learning

PREVIOUSLY AT

Hire Juan

Stars Realigned: Improving the IMDb Rating System

Juan Manuel Ortiz de Zarate

Expertise

The IMDb Rating System: Filtering IMDb’s Data

IMDb Data Analysis

Genre Distribution in the IMDb Rating System

Other Variables in the IMDb Rating System

IMDb Data Analysis: How Well Do Other Models Work?

IMDb Rating System Alternative: The Final Results

Understanding the basics

What does IMDb stand for?

What is the IMDb rating system?

What type of database is IMDb?

What is the purpose of IMDb?

Tags

Juan Manuel Ortiz de Zarate

About the author

Expertise

Juan Manuel Ortiz de Zarate

Using an LLM API As an Intelligent Virtual Assistant for Python Development

Toptal Developers

Stars Realigned: Improving the IMDb Rating System

ByJuan Manuel Ortiz de Zarate

Expertise

The IMDb Rating System: Filtering IMDb’s Data

IMDb Data Analysis

Genre Distribution in the IMDb Rating System

Other Variables in the IMDb Rating System

IMDb Rating System Refinements

IMDb Data Analysis: How Well Do Other Models Work?

IMDb Rating System Alternative: The Final Results

Understanding the basics

What does IMDb stand for?

What is the IMDb rating system?

What type of database is IMDb?

What is the purpose of IMDb?

Tags

About the author

Expertise

Toptal Developers

Juan Manuel Ortiz de Zarate