Using natural language processing and deep feature embeddings to recommend subreddits.
βββ README.md
βββ data
βΒ Β βββ interim <- Intermediate data that has been transformed.
βΒ Β βββ processed <- The final, canonical data sets for modeling.
βΒ Β βββ raw <- The original, immutable data dump.
β
βββ models <- Trained and serialized models.
β
βββ src
βΒ Β βββ __init__.py
β β
βΒ Β βββ data <- Scripts to download and generate data.
β β
βΒ Β βββ features <- Scripts to turn raw data into features for modeling.
β β
βΒ Β βββ models <- Scripts to train models.
β
βββ tests
β
βββ requirements.txt
β
βββ tox.ini
Create a virtual environment and install the dependencies.
virtualenv env
source env/bin/activate
pip install -r requirements.txtCreate a file called .env in the root of the project directory with your reddit API keys in the format below. Downloading the data is quite slow, so it will multithread with as many keys as you have available.
CLIENT_0=api_key:api_id
CLIENT_1=api_key:api_idRun the data extraction scripts.
python src/data/make_subreddit_list.py
python src/data/download_reddit_data.py