The film industry has evolved immensely in the last few decades. It has become a multi-billion dollar entertainment industry that creates hundreds, if not thousands, of films every year. Movies are an easily accessible form of entertainment that people of all ages can enjoy. However, only a few films achieve success and come to be regarded highly. Different factors that affect the success of a movie include its cast, release date, genre, among others.
Understanding these factors and how they affect the movie’s success can help stakeholders predict whether a movie will be a success or a failure. While no one formula can definitively decide whether a movie will be a success or a failure, analyzing data in movie analytics from previous movies can drive decision-making in beneficial ways.
Movies have been an enormous source of entertainment for moviegoers and profits for producers and other financial stakeholders. It is, therefore, essential to make use of technology that can predict the success or failure of a movie. Movie analytics can enable producers and directors to influence the success rate of a movie by making informed choices for its marketing and other such variables. There are many movie analytics products available for studios and producers that can efficiently predict the success of movies by analyzing many data points.
Table of Contents
Movie Prediction using Data Analytics
Making a movie takes a lot of time and money; it is a massive undertaking. Thus, predicting whether or not it is worth the investment is crucial. Data mining and movie analytics are potent tools with which these predictions can be made with great accuracy. This is done by applying data mining techniques to movie datasets. The data is then preprocessed to get it ready for interpretation and analysis.
Data analytics can be used to create a model that can predict the success of a movie. Predicting the success or failure of a movie involves steps for movie analytics that include data collection, data analysis, and model building. This data can then be collated and interpreted, often with visual representations, to get a clear picture of the prediction.
Data Collection
Model building works on a foundation of data that can train a machine based on historical data. Datasets for movies are hosted online by multiple organizations such as imdb.com, boxoffice.com, etc. The dataset used in this blog contains 3,75,377 movies ranging from 1970 to 2020. This particular dataset has the following features:
Feature Name | Description |
ID | A unique movie ID for every movie |
Title | Title or name of the movie |
Original Title | Alternate movie title |
Release Date | Date of release of the movie |
Budget | A budget (in dollars) |
Revenue | Revenue earned (in dollars) |
Popularity | The community popularity of a movie based on page views, downloads, votes, and other activities |
Runtime | The runtime of a movie in minutes |
Vote Average | Average user rating ranging from 1 to 10 |
Vote Count | Number of voters for a movie |
Adult | Boolean variable with True value for films with adult content False for others |
Status | Status of the movie: in production, released or cancelled |
Genre | Action, drama, comedy, suspense, thriller, etc. |
Production Companies | Name of production house or company |
Production Countries | Countries of release |
Certification US | Rating in US markets |
To categorize a given film as a hit or flop, the system uses decision tree classifiers to create a decision tree for the provided input tuple that comprises various attribute values such as the director’s name, primary cast (up to three actors), movie genre, budget, and so on.
To train and test the classifier, we use data from over 5000 movies, which we collected from IMDb. The CSV file is the input to the classifier algorithm, which produces the class label as a result. Our classifier will have at least 85% accuracy, implying that it will be reasonably accurate.
Decision Tree
The dataset is calculated using the ID3 algorithm to construct decision trees for the given input movie name. After all of the calculations, a decision tree is built, and the class label for the entered movie is predicted. Moreover, the administrator will be able to make changes to the system, including changing the selected qualities, adding, editing, or removing details from datasets, editing user profiles, and so on.
Data Preprocessing
Once data collection and collation are done, the next step is data preprocessing. This involves modifying the raw dataset into a suitable or meaningful form that can be used for classification. This step also helps determine the various features useful for the classification of data. Data preprocessing can be further divided into several steps that are described below:
a) Data Cleaning
The dataset downloaded for this prediction has the following features: budget, revenue, and runtime. The dataset has missing values for around 75% of movies, and almost 100,000 movies have their runtime as null. Rows with missing or invalid values are deleted.
b) Feature Extraction
Feature extraction involves creating new features using existing ones. For example, the feature named Success can be created using movie revenue and release date. The value is True if the movie’s revenue is more than double its budget; otherwise, it is False. The feature Release_Date can be used to extract a new Feature called Month. Feature extraction is performed to keep relevant features and remove unnecessary information. It improves processing times because categorical models need more complex analysis.
c) Feature Selection
Feature selection is made based on correlation and coefficient values between continuous variables. If the value of the correlation coefficient is high, it means the corresponding variables are redundant and hence not required. To handle this, either one of the features can be removed, or both can be combined to make one. The correlation coefficient is low for all the variables except Vote_Count and Popularity, which is 0.67. Graphs plotting Popularity vs. Year and Revenue vs. Year are generated to address this. The graphs below show the Average Popularity by Year and Average Revenue by Year.

On examining these graphs, it can be concluded that the graph plotting popularity seems highly skewed towards the latest movies. On the other hand, the graph plotting revenue does not show any specific pattern with Popularity. Hence, popularity is not a required feature for classification and can be removed.
All in all, the features below were removed from the dataset as these were not required to predict the movie’s success.
Feature Name | Reason to Remove |
Original Title | An alternate title is not required |
Release Date | Changed to year |
Popularity | It showed a high skewness towards latest movies |
Status | Value released for all the movies |
Genres | Replaced with a single genre |
Production Companies | Showed very high variance |
Production Countries | Replaced with a single country |
Adult | Films with Adult status were removed |
Final Dataset:
After all the preprocessing, the final dataset consisted of the below features:
Tokens | Feature Name | Description |
Identifiers | ID | Unique ID for each movie name |
Predictors | Title | Movie name |
Budget | Money used for the production of the movie | |
Runtime | Complete movie time | |
Year | Year in which movie was released | |
Vote_Average | Average community score | |
Vote_Count | Average number of votes | |
Genre | Significant genre | |
Country | Significant country of production | |
Certification_US | Rating to determine the age suitability of the viewer | |
Target Variable | Success | True if the budget is more than double the revenue |
Data Analysis
Before generating the final model, the final dataset needs to be analyzed. The analysis can be done using a scatter plot, as shown below. The images show the relation between continuous features:

On examining the above bar plots, it can be seen that the budget and revenue features are skewed left, which implies that most of the movies in the dataset were made with a low budget and generated a low revenue in return. The bar plot for the feature year is skewed right, indicating that most movies in the dataset were recently released.
The plotting for Runtime shows a loose linear relationship that is neither increasing nor decreasing. Furthermore, the scatter plot for revenue * Year and Budget * Year shows an increasing relationship. This indicates that if the movie is new or has been released recently, more money has been spent on it, consequently generating more revenue.
In this way, the success rate of an upcoming movie can be predicted based on previous datasets and patterns using movie analytics on the last year’s data. In so doing, a relationship between the different numeric attributes and predictor variables emerged.
Conclusion
This methodology can be used by anyone to predict whether a movie will succeed or fail. Users can identify and analyze characters from films at different levels of detail using the technique of movie analytics. These prediction techniques help stakeholders make informed decisions before releasing a movie and can significantly impact a movie’s success or failure.