Skip to content

frostyhand/semantic-vector-internal-linking

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Automating Internal Linking with Embedded Vectors

A complete AI-powered system for automating internal link suggestions using OpenAI embeddings, vector similarity search, and optional SERP intent analysis.

Demo repository for the Search Engine Land article: "Automating Internal Linking with Embedded Vectors"

🎯 What This Does

This tool automatically analyzes your website's content and suggests relevant internal linking opportunities by:

  1. Scraping your website content from XML sitemaps
  2. Generating embeddings using OpenAI's text-embedding-ada-002 model
  3. Finding similar content using FAISS vector similarity search
  4. Ranking suggestions by semantic relevance
  5. Enriching with SERP data (optional) via DataForSEO API
  6. Exporting results in multiple formats (JSON, CSV, HTML, WordPress)

πŸ—οΈ Architecture

[Website Sitemap] 
     ↓
[Content Scraper] ──> [Text Preprocessor (clean + tokenize)] 
     ↓                           ↓
[OpenAI Embeddings API]         [DataForSEO API (optional SERP intent)]
     ↓                           ↓
[Vector Store (FAISS)] β—€β”€β”€β”€β”˜
     ↓
[CrewAI Agent: Similarity Scanner]
     ↓
[Link Suggestion Generator]
     ↓
[CMS Integration / Report]

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • OpenAI API key
  • (Optional) DataForSEO API credentials

Installation

# Clone the repository
git clone https://github.com/frostyhand/semantic-vector-internal-linking
cd semantic-vector-internal-linking

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Edit .env with your API keys

Basic Usage

# Set up the environment and check dependencies
python scripts/run.py setup

# Analyze a website (replace with your sitemap URL)
python scripts/run.py analyze --sitemap-url "https://example.com/sitemap.xml"

# Limit to 10 pages for testing
python scripts/run.py analyze --sitemap-url "https://example.com/sitemap.xml" --max-pages 10

# Export only CSV format
python scripts/run.py analyze --sitemap-url "https://example.com/sitemap.xml" --format csv

# Include SERP intent analysis
python scripts/run.py analyze --sitemap-url "https://example.com/sitemap.xml" --use-serp-data

Environment Variables

# Required
export OPENAI_API_KEY="your-openai-api-key"

# Optional (for SERP analysis)
export DFSEO_USER="your-dataforseo-username"
export DFSEO_PASS="your-dataforseo-password"

πŸ“Š Output Formats

The tool generates multiple output formats:

1. JSON Report

{
  "generated_at": "2024-01-15T10:30:00",
  "total_pages": 25,
  "total_suggestions": 147,
  "suggestions": {
    "https://example.com/page1": [
      {
        "url": "https://example.com/related-page",
        "title": "Related Page Title",
        "similarity_score": 0.847,
        "suggested_anchor_text": "Related Page"
      }
    ]
  }
}

2. CSV Report

Easy to import into Excel or Google Sheets for analysis:

  • Source URL, Target URL, Similarity Score
  • Suggested anchor text and priority levels
  • SERP intent data (if enabled)

3. HTML Report

Beautiful visual report for stakeholder presentations

4. WordPress Format

Ready-to-import format for WordPress plugins like Link Whisper

πŸ› οΈ Advanced Usage

Custom Similarity Threshold

python scripts/run.py analyze --sitemap-url "https://example.com/sitemap.xml" --min-similarity 0.8

Test Individual URLs

python scripts/run.py test-scraper --url "https://example.com/test-page"

List Sitemap URLs

python scripts/run.py list-urls --sitemap-url "https://example.com/sitemap.xml" --max-urls 20

🧩 Code Structure

semantic-vector-internal-linking/
β”œβ”€β”€ crawler/              # Web scraping and content extraction
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── web_scraper.py   # BeautifulSoup-based scraper
β”œβ”€β”€ embeddings/           # OpenAI API integration
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── openai_embed.py  # Embedding generation
β”œβ”€β”€ vectorstore/          # FAISS vector database
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── faiss_store.py   # Vector storage and similarity search
β”œβ”€β”€ agents/               # CrewAI orchestration
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── internal_link_agent.py  # AI agent workflows
β”œβ”€β”€ suggestions/          # Output formatting
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── formatter.py     # Multiple export formats
β”œβ”€β”€ dataforseo/           # SERP analysis (optional)
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── serp_intent.py   # Search intent enrichment
β”œβ”€β”€ scripts/              # CLI interface
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── run.py           # Main orchestrator script
β”œβ”€β”€ requirements.txt      # Dependencies
β”œβ”€β”€ .env.example         # Environment template
└── README.md            # This file

πŸ”§ Key Components

1. Web Scraper (crawler/web_scraper.py)

  • Parses XML sitemaps
  • Extracts content from web pages
  • Handles rate limiting and error recovery
  • Cleans and normalizes text content

2. Embedding Generator (embeddings/openai_embed.py)

  • Interfaces with OpenAI's embedding API
  • Supports batch processing for efficiency
  • Handles API rate limits and retries

3. Vector Store (vectorstore/faiss_store.py)

  • FAISS-based similarity search
  • Persistent storage and loading
  • Metadata management for pages

4. CrewAI Agents (agents/internal_link_agent.py)

  • Orchestrates the entire pipeline
  • Manages task dependencies
  • Provides logging and error handling

5. Suggestion Formatter (suggestions/formatter.py)

  • Multiple export formats
  • Priority scoring
  • WordPress plugin compatibility

πŸ“ˆ Performance Tips

  1. Batch Processing: Use batch embedding generation for better API efficiency
  2. Caching: Vector stores are saved to disk for reuse
  3. Rate Limiting: Built-in delays respect API rate limits
  4. Memory Management: FAISS handles large vector datasets efficiently

πŸ” SERP Intent Analysis (Optional)

When enabled with --use-serp-data, the tool enriches suggestions with:

  • Search intent classification (informational, commercial, navigational)
  • Competition level analysis
  • Top-ranking content analysis
  • People Also Ask questions

🎯 SEO Best Practices

The tool implements several SEO best practices:

  1. Semantic Relevance: Uses AI embeddings for true content similarity
  2. Anchor Text Suggestions: Provides natural anchor text recommendations
  3. Priority Scoring: Ranks suggestions by relevance and potential impact
  4. Intent Matching: Optional SERP analysis for search intent alignment
  5. Scalability: Handles large websites efficiently

🀝 Integration Options

WordPress

Use the WordPress output format with plugins like:

  • Link Whisper
  • Internal Link Juicer
  • Custom development with WordPress REST API

Other CMS Platforms

  • JSON output for custom integrations
  • CSV for manual review and implementation
  • HTML reports for stakeholder presentations

🚨 Troubleshooting

Common Issues

  1. API Key Errors

    # Check if API key is set
    echo $OPENAI_API_KEY
  2. Memory Issues with Large Sites

    # Process in smaller batches
    python scripts/run.py analyze --sitemap-url "..." --max-pages 50
  3. Rate Limiting

    • The tool includes built-in rate limiting
    • For faster processing, consider OpenAI API tier upgrades

Debug Mode

# Enable verbose logging
export PYTHONPATH=. python scripts/run.py analyze --sitemap-url "..." --verbose

πŸ“š Dependencies

  • OpenAI: Embedding generation
  • CrewAI: Agent orchestration and workflows
  • FAISS: Fast similarity search
  • BeautifulSoup: Web scraping
  • Click: CLI interface
  • Requests: HTTP client
  • NumPy: Numerical operations

πŸ“ License

MIT License - feel free to use this for your own projects!

πŸ™‹β€β™€οΈ Support

For issues or questions:

  1. Check the troubleshooting section above
  2. Review the code comments for implementation details
  3. Submit issues on GitHub for bugs or feature requests

Happy linking! πŸ”—βœ¨

About

AI-powered internal linking automation using vector embeddings and semantic similarity search

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors