Movie Analytics

How to predict the success of a movie using data analytics?

The film industry has evolved immensely in the last few decades. It has become a multi-billion dollar entertainment industry that creates hundreds, if not thousands, of films every year. Movies are an easily accessible form of entertainment that people of all ages can enjoy. However, only a few films achieve success and come to be regarded highly. Different factors that affect the success of a movie include its cast, release date, genre, among others.

Understanding these factors and how they affect the movie’s success can help stakeholders predict whether a movie will be a success or a failure. While no one formula can definitively decide whether a movie will be a success or a failure, analyzing data in movie analytics from previous movies can drive decision-making in beneficial ways.

Movies have been an enormous source of entertainment for moviegoers and profits for producers and other financial stakeholders. It is, therefore, essential to make use of technology that can predict the success or failure of a movie. Movie analytics can enable producers and directors to influence the success rate of a movie by making informed choices for its marketing and other such variables. There are many movie analytics products available for studios and producers that can efficiently predict the success of movies by analyzing many data points.

Movie Prediction using Data Analytics

Making a movie takes a lot of time and money; it is a massive undertaking. Thus, predicting whether or not it is worth the investment is crucial. Data mining and movie analytics are potent tools with which these predictions can be made with great accuracy. This is done by applying data mining techniques to movie datasets. The data is then preprocessed to get it ready for interpretation and analysis.

Data analytics can be used to create a model that can predict the success of a movie. Predicting the success or failure of a movie involves steps for movie analytics that include data collection, data analysis, and model building. This data can then be collated and interpreted, often with visual representations, to get a clear picture of the prediction.

Data Collection

Model building works on a foundation of data that can train a machine based on historical data. Datasets for movies are hosted online by multiple organizations such as,, etc. The dataset used in this blog contains 3,75,377 movies ranging from 1970 to 2020. This particular dataset has the following features:

Feature Name



A unique movie ID for every movie


Title or name of the movie

Original Title

Alternate movie title

Release Date

Date of release of the movie


A budget (in dollars)


Revenue earned (in dollars)


The community popularity of a movie based on page views, downloads, votes, and other activities


The runtime of a movie in minutes

Vote Average

Average user rating ranging from 1 to 10

Vote Count

Number of voters for a movie


Boolean variable with True value for films with adult content False for others


Status of the movie: in production, released or cancelled


Action, drama, comedy, suspense, thriller, etc.

Production Companies

Name of production house or company

Production Countries

Countries of release

Certification US

Rating in US markets

To categorize a given film as a hit or flop, the system uses decision tree classifiers to create a decision tree for the provided input tuple that comprises various attribute values such as the director’s name, primary cast (up to three actors), movie genre, budget, and so on.

To train and test the classifier, we use data from over 5000 movies, which we collected from IMDb. The CSV file is the input to the classifier algorithm, which produces the class label as a result. Our classifier will have at least 85% accuracy, implying that it will be reasonably accurate.

Decision Tree

The dataset is calculated using the ID3 algorithm to construct decision trees for the given input movie name. After all of the calculations, a decision tree is built, and the class label for the entered movie is predicted. Moreover, the administrator will be able to make changes to the system, including changing the selected qualities, adding, editing, or removing details from datasets, editing user profiles, and so on.

Data Preprocessing

Once data collection and collation are done, the next step is data preprocessing. This involves modifying the raw dataset into a suitable or meaningful form that can be used for classification. This step also helps determine the various features useful for the classification of data. Data preprocessing can be further divided into several steps that are described below:

a) Data Cleaning

The dataset downloaded for this prediction has the following features: budget, revenue, and runtime. The dataset has missing values for around 75% of movies, and almost 100,000 movies have their runtime as null. Rows with missing or invalid values are deleted.

b) Feature Extraction

Feature extraction involves creating new features using existing ones. For example, the feature named Success can be created using movie revenue and release date. The value is True if the movie’s revenue is more than double its budget; otherwise, it is False. The feature Release_Date can be used to extract a new Feature called Month. Feature extraction is performed to keep relevant features and remove unnecessary information. It improves processing times because categorical models need more complex analysis.

c) Feature Selection

Feature selection is made based on correlation and coefficient values between continuous variables. If the value of the correlation coefficient is high, it means the corresponding variables are redundant and hence not required. To handle this, either one of the features can be removed, or both can be combined to make one. The correlation coefficient is low for all the variables except Vote_Count and Popularity, which is 0.67. Graphs plotting Popularity vs. Year and Revenue vs. Year are generated to address this. The graphs below show the Average Popularity by Year and Average Revenue by Year.

Movie Analytics Graphs

On examining these graphs, it can be concluded that the graph plotting popularity seems highly skewed towards the latest movies. On the other hand, the graph plotting revenue does not show any specific pattern with Popularity. Hence, popularity is not a required feature for classification and can be removed.

All in all, the features below were removed from the dataset as these were not required to predict the movie’s success.

Feature Name

Reason to Remove

Original Title

An alternate title is not required

Release Date

Changed to year


It showed a high skewness towards latest movies


Value released for all the movies


Replaced with a single genre

Production Companies

Showed very high variance

Production Countries

Replaced with a single country


Films with Adult status were removed

Final Dataset:

After all the preprocessing, the final dataset consisted of the below features:


Feature Name




Unique ID for each movie name



Movie name


Money used for the production of the movie


Complete movie time


Year in which movie was released


Average community score


Average number of votes


Significant genre


Significant country of production


Rating to determine the age suitability of the viewer

Target Variable


True if the budget is more than double the revenue

Data Analysis

Before generating the final model, the final dataset needs to be analyzed. The analysis can be done using a scatter plot, as shown below. The images show the relation between continuous features:

Scatter Plot

On examining the above bar plots, it can be seen that the budget and revenue features are skewed left, which implies that most of the movies in the dataset were made with a low budget and generated a low revenue in return. The bar plot for the feature year is skewed right, indicating that most movies in the dataset were recently released. 

The plotting for Runtime shows a loose linear relationship that is neither increasing nor decreasing. Furthermore, the scatter plot for revenue * Year and Budget * Year shows an increasing relationship. This indicates that if the movie is new or has been released recently, more money has been spent on it, consequently generating more revenue.

In this way, the success rate of an upcoming movie can be predicted based on previous datasets and patterns using movie analytics on the last year’s data. In so doing, a relationship between the different numeric attributes and predictor variables emerged.


This methodology can be used by anyone to predict whether a movie will succeed or fail. Users can identify and analyze characters from films at different levels of detail using the technique of movie analytics. These prediction techniques help stakeholders make informed decisions before releasing a movie and can significantly impact a movie’s success or failure.

Leave a reply:

Your email address will not be published.

Site Footer