A Retrieval-Augmented Generation (RAG) system that ingests news articles, extracts their content, and allows users to query information about recent news events.
This application combines several components to create a powerful news query system:
- Kafka Consumer: Ingests links to news articles in real-time.
- Content Extraction: Scrapes and cleans HTML content from news sites.
- Vector Database: Stores article content with embeddings for semantic search.
- LLM Integration: Uses OpenAI models to provide accurate answers based on retrieved content.
- REST API: Provides a query endpoint for users to ask questions about news.
The system uses a multi-layered approach to extract content from news websites:
- Multiple CSS selectors to target different article layouts.
- CAPTCHA and error detection to gracefully handle blocked sites.
- HTTP error code handling (401, 403, 429) to manage access restrictions.
To reduce token usage and improve quality:
- Articles are cleaned and summarized to ~1000 words maximum.
- Only the most important information is preserved.
- Content is processed to remove HTML artifacts, excessive whitespace, and irrelevant sections.
The system implements several strategies to ensure reliable operation:
- JSON parsing recovery mechanisms for LLM responses.
- Fallback content generation for failed extractions.
- Duplicate article detection to prevent redundant processing.
- Try/catch blocks around critical operations to prevent cascade failures.
- MongoDB vector search is used for semantic similarity matching.
- Embeddings are generated using OpenAI's
text-embedding-3-smallmodel. - Results are limited to top 10 matches to maintain relevance and performance.
- Context is carefully managed to minimize token consumption.
- System prompts are concise yet effective.
- Article content is summarized while maintaining information density.
- Node.js (v18 or higher)
- MongoDB (with Atlas Vector Search capability)
- Kafka
- OpenAI API key
Create a .env file with the following variables:
# Example environment variables
OPENAI_API_KEY=your_openai_api_key_here
MONGODB_URI=your_mongodb_connection_string
KAFKA_SERVER=your_kafka_server_address- Clone the repository.
- Install dependencies.
- Build the TypeScript code.
- Start the application.
Request body:
{
"query": "your question here"
}Response:
{
"answer": "An answer from LLM example",
"sources": [
{
“title”: "What's the latest on Los Angeles wildfires and how did they start?",
“url”: "https://www.bbc.com/news/articles/clyxypryrnko",
“date”: "2025-01-21T13:17:36Z"
}
]
}Potential enhancements include:
- Response streaming for better user experience.
- GraphQL API implementation with Yoga.
- Advanced caching strategies.
- Support for more data sources beyond Kafka.