Wagtail News Scraper Demo

A Wagtail (Django) project that scrapes BBC Innovation/Technology headlines and publishes articles as Wagtail pages. Includes a styled list page with pagination.

Prerequisites

Python 3.13 (project uses a local venv)
Git (optional)
Node is NOT required

Quick start

Create and activate a virtual environment
- Windows PowerShell
```
python -m venv .venv
.\.venv\Scripts\Activate.ps1
```
Install dependencies
```
pip install -r requirements.txt
```

Create the database and a superuser

python manage.py migrate
python manage.py createsuperuser

Run the development server
```
python manage.py runserver
```
Visit http://127.0.0.1:8000/ and http://127.0.0.1:8000/admin/ (login with the superuser you created).

Wagtail setup

In the admin, create a new page of type “News List Page” at the root.
Publish it. Its URL will display the styled list with pagination.

Scraping BBC articles

The project includes a management command to fetch and publish articles under your News List Page.

Fetch and publish:
```
python manage.py scrape_news
```

What it does:

Scrapes https://www.bbc.com/innovation/technology for headlines.
Follows each headline to extract title, date, and summary (robust to HTML changes).
Creates child pages of type NewsArticle under your first NewsListPage.
Skips duplicates by title.

Automate scraping hourly (cron)

Use the provided bash scripts on Linux/macOS or WSL to create an hourly cron job. These scripts are idempotent and add/remove a marked block in your user crontab.

Make scripts executable (first time only):

chmod +x scripts/register-hourly-scraper.sh scripts/remove-hourly-scraper.sh

Register hourly job (defaults to top of the hour; logs to logs/cron-fetch.log):

# If using a virtualenv inside the project:
./scripts/register-hourly-scraper.sh --python .venv/bin/python

# Or rely on python3 on PATH:
# ./scripts/register-hourly-scraper.sh

Remove the cron job:

./scripts/remove-hourly-scraper.sh

Notes:

The job runs: python manage.py scrape_news from the project root.
Override schedule with: --schedule "5 * * * *" (run at minute 5 every hour).
Logs are written to logs/cron-fetch.log.

Pagination

The News List Page template displays articles with accessible pagination controls. Page size is defined in the backend and can be adjusted in NewsListPage.get_context if needed.

Project layout

news/models.py: Wagtail page models (NewsListPage, NewsArticle).
news/templates/news/news_list_page.html: List page template.
wagtailTask/static/css/wagtailTask.css: Global styles (news list styles are scoped under .news-list).
news/scraper/bbc_scraper.py: Scraper with resilient parsing.
news/management/commands/scrape_news.py: Management command to run the scraper and publish pages.

Troubleshooting

Static files not styling the page? Ensure DEBUG=True (default in dev) and the base template includes {% load static %} and the stylesheet link. This project already does via wagtailTask/templates/base.html.
No NewsListPage found when running the command: Create and publish one in the Wagtail admin first.
SSL or connection errors when scraping: Re-run the command later; the scraper has small retries and fallbacks.

License

This repository is for learning/demo purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
home		home
news		news
scripts		scripts
search		search
wagtailTask		wagtailTask
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
manage.py		manage.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wagtail News Scraper Demo

Prerequisites

Quick start

Wagtail setup

Scraping BBC articles

Automate scraping hourly (cron)

Pagination

Project layout

Troubleshooting

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Wagtail News Scraper Demo

Prerequisites

Quick start

Wagtail setup

Scraping BBC articles

Automate scraping hourly (cron)

Pagination

Project layout

Troubleshooting

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages