The IMDB-250 Data Project is a comprehensive toolkit for scraping, preprocessing, and storing data from IMDB's Top 250 movies list. It aims to facilitate data analysis and insights into trends within top-rated movies. This project includes scripts for data collection (crawl.py), preprocessing (preprocessing.py), and database interaction (db.py), alongside a structured database schema for efficient data storage.
- Data Scraping: Automated collection of movie data from IMDB's Top 250 list.
- Data Preprocessing: Cleaning and formatting the scraped data for analysis.
- Database Integration: Storing and managing the processed data in a structured database.
- Python 3.x
- Required Python libraries:
beautifulsoup4,pandas,sqlalchemy,requests - MySQL or SQLite (depending on your setup)
- Clone the repository:
git clone https://github.com/sanaazz/IMDB-250.git
- Install the required dependencies:
pip install -r requirements.txt
- Data Scraping: Run
crawl.pyto fetch data from IMDB.python crawl.py
- Data Preprocessing: Execute
preprocessing.pyto clean and prepare the data.python preprocessing.py
- Database Setup and Data Storage: Use
db.pyto create the database schema and insert the preprocessed data.python db.py
Contributions, issues, and feature requests are welcome! Feel free to check the issues page.
Distributed under the AGPL-3.0 License. See LICENSE for more information.
- IMDB for providing an extensive dataset of top-rated movies.