-
Download the principals dataset, the name dataset, and the general movies dataset.
-
Extract the files using either 7zip, Winrar, or gzip. For the IMDb dataset, rename it to
Imdb_Movie_Dataset.csv -
Move the files into a folder called
movie-datain the root directory of the project -
Install the Python packages using the following command:
pip install -r requirements.txt(Note: There are many of them) -
Run the following Python notebooks in order:
-
combine-data.ipynb - Extracts the data, combines it with cast & crew, and loads it into a csv file (produces combined_df.csv).
-
pivot-data.ipynb - Calculates a score for each actor and crew member based on the amount of money the movies they were in made. Compiles score averages for each movie and loads it into a csv file (produces movies_with_scores.csv). It also pivots the data in a much more basic way (just to make looking at data easier) in a csv file (produces basic_pivot_df.csv).
-
encode-data.ipynb - Drops imdb_id, overview, tagline, production_companies, keywords, and status. Converts release date into a date, month, and year. One-hot encodes adult, original language, genres, production countries, and spoken languages. Saves it all in one dataframe and onto a csv file (produces movies_encoded.csv).
- analysis.ipynb - Displays statistics on the data such as counts, averages, standard deviations, and correlations. Calculated profit and unprofitability scores and calculated correlations with non-profitability. Also calculates baseline predictions at 64%. Saved final dataset containing highest correlations as a csv (cleaned_analysis_data.csv).
-
To run one of the various machine learning models, open the corresponding notebook and run every single cell.
| File | Model | Accuracy |
|---|---|---|
| gaussian-naive-bayes.ipynb | GaussianNB | 45% |
| svm.ipynb | Support Vector Machine | 53% |
| baseline.ipynb | Calculates baseline confusion matrix predictions on 50/50 splits and 80/20 splits. | 64% |
| regression.ipynb | Logistic regression | 65% |
| KNN.ipynb | K-nearest neighbor | 70% |
| decision_tree.ipynb | Decision Tree model | 75% |
| neural.ipynb | Basic neural network | 78% |
| forest.ipynb | Random forest model | 84% |
| gradient_boost.ipynb | Two gradient boost models | 84% |
To run the Streamlit application, type the following command into the terminal: streamlit run demo.py. This should run the application on your local machine in your default web browser.