Skip to content

Mario5T/El_Matador

Repository files navigation

El Matador โ€” AI-Powered News Credibility Analyzer

Python Streamlit scikit-learn License

El Matador is an ML-powered news credibility analysis tool that helps users evaluate the trustworthiness of news articles. It combines a TF-IDF + scikit-learn classifier trained on 63,000+ labeled articles with a suite of rule-based linguistic pattern detectors to produce an interpretable credibility score.


โœจ Features

Feature Description
Credibility Score (0โ€“100) Blends ML model confidence (70%) with pattern analysis (30%)
4-class classification REAL, FAKE, MISLEADING, UNVERIFIED
9 pattern detectors Sensational language, excessive caps, vague sources, conspiracy framing, emotional manipulation, one-sidedness, lack of evidence, extreme adjectives, clickbait
Emotional tone 5-level tone classification from Neutral to Highly manipulative
Suspicious claims Up to 5 flagged sentences per article with fact-check guidance
Streamlit UI Interactive two-column layout with full score breakdown
CLI-compatible analyze() returns a JSON-serialisable dict for downstream integration

๐Ÿ—๏ธ Project Structure

El_Matador/
โ”œโ”€โ”€ src/                        # Refactored source packages
โ”‚   โ”œโ”€โ”€ analyzer/
โ”‚   โ”‚   โ””โ”€โ”€ credibility_analyzer.py  # Core orchestrator
โ”‚   โ”œโ”€โ”€ models/
โ”‚   โ”‚   โ””โ”€โ”€ model_loader.py          # Lazy singleton model loader
โ”‚   โ”œโ”€โ”€ patterns/
โ”‚   โ”‚   โ”œโ”€โ”€ pattern_detector.py      # 9-pattern linguistic detector
โ”‚   โ”‚   โ”œโ”€โ”€ emotional_analyzer.py    # Tone classifier
โ”‚   โ”‚   โ””โ”€โ”€ claim_highlighter.py     # Suspicious-claim extractor
โ”‚   โ””โ”€โ”€ utils/
โ”‚       โ””โ”€โ”€ text_utils.py            # Canonical text helpers
โ”‚
โ”œโ”€โ”€ tests/                      # pytest test suite
โ”‚   โ”œโ”€โ”€ test_utils.py
โ”‚   โ”œโ”€โ”€ test_patterns.py
โ”‚   โ”œโ”€โ”€ test_claim_highlighter.py
โ”‚   โ””โ”€โ”€ test_analyzer.py
โ”‚
โ”œโ”€โ”€ models/                     # Trained model artefacts (git-ignored)
โ”‚   โ”œโ”€โ”€ best_model.joblib
โ”‚   โ”œโ”€โ”€ tfidf_vectorizer.joblib
โ”‚   โ”œโ”€โ”€ metadata.txt
โ”‚   โ””โ”€โ”€ training_report.json
โ”‚
โ”œโ”€โ”€ streamlit_app.py            # Streamlit UI entry point
โ”œโ”€โ”€ train_model.py              # Model training script (with cross-validation)
โ”œโ”€โ”€ pyproject.toml              # pytest config
โ”œโ”€โ”€ requirements.txt
โ””โ”€โ”€ .gitignore

Note: The legacy flat-file modules (credibility_analyzer.py, pattern_detector.py, etc.) remain in the root for backward compatibility. New development should target src/.


๐Ÿš€ Quick Start

Prerequisites

  • Python 3.9+
  • pip

1 โ€” Clone the repository

git clone https://github.com/Mario5T/El_Matador.git
cd El_Matador

2 โ€” Create and activate a virtual environment

python -m venv venv

# macOS / Linux
source venv/bin/activate

# Windows
venv\Scripts\activate

3 โ€” Install dependencies

pip install -r requirements.txt

4 โ€” Download the dataset (for training only)

Download the WELFake dataset from Kaggle:

https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification

Place the file at:

dataset/WELFake_Dataset.csv

โš ๏ธ The CSV (~150 MB) is excluded from version control. You only need it to retrain the model.

5 โ€” Train the model

python train_model.py

This will:

  • Train Logistic Regression and Passive Aggressive classifiers
  • Print full metrics (accuracy, precision, recall, F1, confusion matrix) for each
  • Run 5-fold cross-validation on the best model
  • Save artefacts to models/

Expected output (example):

[4/6] Training & evaluating models โ€ฆ

  โ–ธ Passive Aggressive  (trained in 1.8s)
    Accuracy  : 0.9667
    Precision : 0.9668
    Recall    : 0.9667
    F1 Score  : 0.9667

[5/6] Cross-validating best model (Passive Aggressive, k=5) โ€ฆ
    accuracy  : 0.9654 ยฑ 0.0021
    precision : 0.9655 ยฑ 0.0022
    recall    : 0.9654 ยฑ 0.0021
    f1        : 0.9654 ยฑ 0.0021

6 โ€” Run the Streamlit app

streamlit run streamlit_app.py

The app opens at http://localhost:8501.


๐Ÿ“ฅ Input / Output

Input format

Paste the plain text of a news article (minimum 50 characters, maximum 50,000 characters).

Example โ€” credible article snippet:

Scientists at Stanford University published a peer-reviewed study showing 
a new vaccine candidate was 89% effective in phase 3 trials involving 30,000 
participants. Dr. Jane Smith confirmed the results would be submitted to the FDA.

Example โ€” suspicious article snippet:

SHOCKING: Government scientists EXPOSED! Sources say the deep state is 
covering up a massive false flag. Many believe this conspiracy, but the 
mainstream media doesn't want you to know the truth. Wake up, people!

Output

The analyze() method returns a dictionary:

Key Type Description
classification str REAL, FAKE, MISLEADING, or UNVERIFIED
credibility_score int 0โ€“100; higher = more credible
risk_level str Low Risk (โ‰ฅ75), Medium Risk (40โ€“74), High Risk (<40)
confidence int System confidence in its own assessment (0โ€“100%)
analysis_summary str 2โ€“4 sentence summary
key_indicators list[str] Top linguistic red flags detected
emotional_tone str Dominant tone from Neutral to Highly emotional and manipulative
suspicious_claims list[str] Up to 5 sentences flagged for fact-checking
recommended_action str Actionable user guidance
explanation str Detailed assessment explanation
pattern_score float Raw pattern suspicion score (0.0โ€“1.0)
patterns dict All 9 pattern detector outputs

๐Ÿ–ฅ๏ธ CLI Usage

from src.models import ModelLoader
from src.analyzer import CredibilityAnalyzer

loader   = ModelLoader()
model, vectorizer = loader.load()   # cached after first call

analyzer = CredibilityAnalyzer()
result   = analyzer.analyze(article_text, model, vectorizer)

print(result["classification"])     # โ†’ "FAKE"
print(result["credibility_score"])  # โ†’ 18

๐Ÿงช Running Tests

pytest

Test suite covers:

  • test_utils.py โ€” text preprocessing and sentence-level helpers
  • test_patterns.py โ€” PatternDetector outputs and edge cases
  • test_claim_highlighter.py โ€” suspicious-claim extraction
  • test_analyzer.py โ€” full pipeline with mocked ML model

๐Ÿ“Š Model Performance

Model Accuracy Precision Recall F1 (weighted)
Logistic Regression ~94% ~94% ~94% ~94%
Passive Aggressive โœ… ~96.7% ~96.7% ~96.7% ~96.7%

Trained on WELFake dataset (~63,000 labelled articles, 80/20 split).
5-fold cross-validation confirms generalisation (F1 std โ‰ˆ 0.002).


๐Ÿš€ Deployment

Streamlit Cloud (recommended)

  1. Push to GitHub (model artefacts excluded โ€” see below).
  2. Connect repo to share.streamlit.io.
  3. Set entry point: streamlit_app.py.
  4. Add models/best_model.joblib and models/tfidf_vectorizer.joblib via Streamlit Secrets or a one-time download script in streamlit_app.py.

Cold-start optimisation: @st.cache_resource on load_model() ensures the model is loaded once per session, not on every rerun.

Model caching note

Because .joblib files are excluded from git (they are ~2.5 MB total), you have two options for Streamlit Cloud:


๐Ÿ“‹ Commit Strategy

Suggested branch / PR breakdown:

Branch Purpose
feat/src-restructure Move modules into src/ packages
feat/model-pipeline Improved train_model.py with CV and JSON report
feat/analyzer-refactor Singleton sub-components, constants, type hints
feat/model-loader Lazy ModelLoader with lru_cache
feat/tests Full pytest suite
chore/gitignore Add .DS_Store, models/*.joblib, dataset/
docs/readme This README

๐Ÿ”ฎ Limitations

  • No real-time fact-checking โ€” analysis is purely linguistic/structural.
  • No knowledge injection โ€” the model has no live internet access.
  • Domain drift โ€” model trained on English news; performance may degrade on non-news text or other languages.
  • Satire blind spot โ€” satirical articles may score as MISLEADING due to sensational language patterns.

๐Ÿ“„ License

MIT โ€” see LICENSE.

Releases

No releases published

Packages

 
 
 

Contributors

Languages