This is a simple IMDB rating classifier application that panalizes reviews in accordance with some pre-defined ruleset.
The application scrapes data from IMDB and adjusts the rating system according to some specific validation rules (review penalization).
The data is scraped from the IMDB charts API using the BeautifulSoup library.
The data structure of the parsed and normalized payload is as follows (example):
"rank": "1",
"title": "The Shawshank Redemption",
"year": "1994",
"rating": "9.2",
"votes": "2,223,000",
"url": "/title/tt0111161/",
"oscars_won": 0,
"penalized": false
We would then, extract the following fields, into a dataframe:
- rank (int)
- title (str)
- year (int)
- rating (float)
- votes (int)
- url (str)
- oscars_won (int)
- penalized (bool)
Using dataclasses, we can then, preprocess the data against some schema definition.
The rules are as follows:
schema = {
"rank": {
"type": "int",
"min": 1,
"max": 250,
"required": True,
"title": {
"type": "str",
"required": True,
"year": {
"type": "int",
"min": 1900,
"max": 2023,
"required": True,
"rating": {
"type": "float",
"min": 0.0,
"max": 10.0,
"required": True,
"votes": {
"type": "int",
"min": 0,
"required": True,
"url": {
"type": "str",
"required": True,
"oscars_won": {
"type": "int",
"min": 0,
"required": True,
"penalized": {
"type": "bool",
"required": True,
- Python>=3.8>=3.10
- BeautifulSoup4
- requests
- pytest
- tox
- click
- pre-commit
- flake8
- black
- isort
and more...
For development purposes:
Clone the repository
foo@bar:~$ git clone [email protected]/marouenes/imdb-rating-classifier.git
Create a virtual environment
foo@bar:~/imdb-rating-classifier$ virtualenv .venv
Activate the virtual environment
foo@bar:~/imdb-rating-classifier$ source .venv/bin/activate
Install the dev dependencies
foo@bar:~/imdb-rating-classifier$ pip install -r requirements-dev.txt
Install the pre-commit hooks
foo@bar:~/imdb-rating-classifier$ pre-commit install
For usage:
Install the dependencies and build the wheel
foo@bar:~/imdb-rating-classifier$ pip install -e .
The application is publicly available and published on PyPI and can be installed using pip:
foo@bar:~$ pip install imdb-rating-classifier
- Display the help message and the available commands
foo@bar:~$ imdb-rating-classifier generate --help
Usage: imdb-rating-classifier generate [OPTIONS]
Generate the output dataset containing both the original and adjusted
An extra JSON file will be generated alongside the csv file
--output FILE The path to the output file.
--number-of-movies INTEGER The number of movies to scrape.
-h, --help Show this message and exit.
- Run the application with the default number of movies (20) and the default output file (data.csv)
imdb-rating-classifier generate
- Run the application with a specific number of movies
imdb-rating-classifier generate --number-of-movies 100
- Run the application with a specific number of movies and a specific output file
imdb-rating-classifier generate --number-of-movies 100 --output some_name.csv
- Run tests and pre-commit hooks
foo@bar:~/imdb-rating-classifier$ tox
The application is automatically packaged and distributed to PyPI, It is also automatically tested using tox as an environment orchestrator and GitHub Actions.
- Add more tests
- Add more validation rules
- Add more documentation
- Add more features!
- Add a readthedocs page
- Describe code in readthedocs
- Publish the package on PyPI
- Add oscar awards or nominations for the movies
- Add a version switch for the cli
MIT License