A Wagtail (Django) project that scrapes BBC Innovation/Technology headlines and publishes articles as Wagtail pages. Includes a styled list page with pagination.
- Python 3.13 (project uses a local venv)
- Git (optional)
- Node is NOT required
-
Create and activate a virtual environment
- Windows PowerShell
python -m venv .venv .\.venv\Scripts\Activate.ps1
- Windows PowerShell
-
Install dependencies
pip install -r requirements.txt -
Create the database and a superuser
python manage.py migrate python manage.py createsuperuser
-
Run the development server
python manage.py runserver
Visit http://127.0.0.1:8000/ and http://127.0.0.1:8000/admin/ (login with the superuser you created).
- In the admin, create a new page of type “News List Page” at the root.
- Publish it. Its URL will display the styled list with pagination.
The project includes a management command to fetch and publish articles under your News List Page.
- Fetch and publish:
python manage.py scrape_news
What it does:
- Scrapes https://www.bbc.com/innovation/technology for headlines.
- Follows each headline to extract title, date, and summary (robust to HTML changes).
- Creates child pages of type
NewsArticleunder your firstNewsListPage. - Skips duplicates by title.
Use the provided bash scripts on Linux/macOS or WSL to create an hourly cron job. These scripts are idempotent and add/remove a marked block in your user crontab.
- Make scripts executable (first time only):
chmod +x scripts/register-hourly-scraper.sh scripts/remove-hourly-scraper.sh- Register hourly job (defaults to top of the hour; logs to logs/cron-fetch.log):
# If using a virtualenv inside the project:
./scripts/register-hourly-scraper.sh --python .venv/bin/python
# Or rely on python3 on PATH:
# ./scripts/register-hourly-scraper.sh- Remove the cron job:
./scripts/remove-hourly-scraper.shNotes:
- The job runs:
python manage.py scrape_newsfrom the project root. - Override schedule with:
--schedule "5 * * * *"(run at minute 5 every hour). - Logs are written to
logs/cron-fetch.log.
The News List Page template displays articles with accessible pagination controls. Page size is defined in the backend and can be adjusted in NewsListPage.get_context if needed.
news/models.py: Wagtail page models (NewsListPage,NewsArticle).news/templates/news/news_list_page.html: List page template.wagtailTask/static/css/wagtailTask.css: Global styles (news list styles are scoped under.news-list).news/scraper/bbc_scraper.py: Scraper with resilient parsing.news/management/commands/scrape_news.py: Management command to run the scraper and publish pages.
- Static files not styling the page? Ensure
DEBUG=True(default in dev) and the base template includes{% load static %}and the stylesheet link. This project already does viawagtailTask/templates/base.html. - No
NewsListPagefound when running the command: Create and publish one in the Wagtail admin first. - SSL or connection errors when scraping: Re-run the command later; the scraper has small retries and fallbacks.
This repository is for learning/demo purposes.