A complete AI-powered system for automating internal link suggestions using OpenAI embeddings, vector similarity search, and optional SERP intent analysis.
Demo repository for the Search Engine Land article: "Automating Internal Linking with Embedded Vectors"
This tool automatically analyzes your website's content and suggests relevant internal linking opportunities by:
- Scraping your website content from XML sitemaps
- Generating embeddings using OpenAI's text-embedding-ada-002 model
- Finding similar content using FAISS vector similarity search
- Ranking suggestions by semantic relevance
- Enriching with SERP data (optional) via DataForSEO API
- Exporting results in multiple formats (JSON, CSV, HTML, WordPress)
[Website Sitemap]
β
[Content Scraper] ββ> [Text Preprocessor (clean + tokenize)]
β β
[OpenAI Embeddings API] [DataForSEO API (optional SERP intent)]
β β
[Vector Store (FAISS)] βββββ
β
[CrewAI Agent: Similarity Scanner]
β
[Link Suggestion Generator]
β
[CMS Integration / Report]
- Python 3.8+
- OpenAI API key
- (Optional) DataForSEO API credentials
# Clone the repository
git clone https://github.com/frostyhand/semantic-vector-internal-linking
cd semantic-vector-internal-linking
# Install dependencies
pip install -r requirements.txt
# Set up environment variables
cp .env.example .env
# Edit .env with your API keys# Set up the environment and check dependencies
python scripts/run.py setup
# Analyze a website (replace with your sitemap URL)
python scripts/run.py analyze --sitemap-url "https://example.com/sitemap.xml"
# Limit to 10 pages for testing
python scripts/run.py analyze --sitemap-url "https://example.com/sitemap.xml" --max-pages 10
# Export only CSV format
python scripts/run.py analyze --sitemap-url "https://example.com/sitemap.xml" --format csv
# Include SERP intent analysis
python scripts/run.py analyze --sitemap-url "https://example.com/sitemap.xml" --use-serp-data# Required
export OPENAI_API_KEY="your-openai-api-key"
# Optional (for SERP analysis)
export DFSEO_USER="your-dataforseo-username"
export DFSEO_PASS="your-dataforseo-password"The tool generates multiple output formats:
{
"generated_at": "2024-01-15T10:30:00",
"total_pages": 25,
"total_suggestions": 147,
"suggestions": {
"https://example.com/page1": [
{
"url": "https://example.com/related-page",
"title": "Related Page Title",
"similarity_score": 0.847,
"suggested_anchor_text": "Related Page"
}
]
}
}Easy to import into Excel or Google Sheets for analysis:
- Source URL, Target URL, Similarity Score
- Suggested anchor text and priority levels
- SERP intent data (if enabled)
Beautiful visual report for stakeholder presentations
Ready-to-import format for WordPress plugins like Link Whisper
python scripts/run.py analyze --sitemap-url "https://example.com/sitemap.xml" --min-similarity 0.8python scripts/run.py test-scraper --url "https://example.com/test-page"python scripts/run.py list-urls --sitemap-url "https://example.com/sitemap.xml" --max-urls 20semantic-vector-internal-linking/
βββ crawler/ # Web scraping and content extraction
β βββ __init__.py
β βββ web_scraper.py # BeautifulSoup-based scraper
βββ embeddings/ # OpenAI API integration
β βββ __init__.py
β βββ openai_embed.py # Embedding generation
βββ vectorstore/ # FAISS vector database
β βββ __init__.py
β βββ faiss_store.py # Vector storage and similarity search
βββ agents/ # CrewAI orchestration
β βββ __init__.py
β βββ internal_link_agent.py # AI agent workflows
βββ suggestions/ # Output formatting
β βββ __init__.py
β βββ formatter.py # Multiple export formats
βββ dataforseo/ # SERP analysis (optional)
β βββ __init__.py
β βββ serp_intent.py # Search intent enrichment
βββ scripts/ # CLI interface
β βββ __init__.py
β βββ run.py # Main orchestrator script
βββ requirements.txt # Dependencies
βββ .env.example # Environment template
βββ README.md # This file
- Parses XML sitemaps
- Extracts content from web pages
- Handles rate limiting and error recovery
- Cleans and normalizes text content
- Interfaces with OpenAI's embedding API
- Supports batch processing for efficiency
- Handles API rate limits and retries
- FAISS-based similarity search
- Persistent storage and loading
- Metadata management for pages
- Orchestrates the entire pipeline
- Manages task dependencies
- Provides logging and error handling
- Multiple export formats
- Priority scoring
- WordPress plugin compatibility
- Batch Processing: Use batch embedding generation for better API efficiency
- Caching: Vector stores are saved to disk for reuse
- Rate Limiting: Built-in delays respect API rate limits
- Memory Management: FAISS handles large vector datasets efficiently
When enabled with --use-serp-data, the tool enriches suggestions with:
- Search intent classification (informational, commercial, navigational)
- Competition level analysis
- Top-ranking content analysis
- People Also Ask questions
The tool implements several SEO best practices:
- Semantic Relevance: Uses AI embeddings for true content similarity
- Anchor Text Suggestions: Provides natural anchor text recommendations
- Priority Scoring: Ranks suggestions by relevance and potential impact
- Intent Matching: Optional SERP analysis for search intent alignment
- Scalability: Handles large websites efficiently
Use the WordPress output format with plugins like:
- Link Whisper
- Internal Link Juicer
- Custom development with WordPress REST API
- JSON output for custom integrations
- CSV for manual review and implementation
- HTML reports for stakeholder presentations
-
API Key Errors
# Check if API key is set echo $OPENAI_API_KEY
-
Memory Issues with Large Sites
# Process in smaller batches python scripts/run.py analyze --sitemap-url "..." --max-pages 50
-
Rate Limiting
- The tool includes built-in rate limiting
- For faster processing, consider OpenAI API tier upgrades
# Enable verbose logging
export PYTHONPATH=. python scripts/run.py analyze --sitemap-url "..." --verbose- OpenAI: Embedding generation
- CrewAI: Agent orchestration and workflows
- FAISS: Fast similarity search
- BeautifulSoup: Web scraping
- Click: CLI interface
- Requests: HTTP client
- NumPy: Numerical operations
MIT License - feel free to use this for your own projects!
For issues or questions:
- Check the troubleshooting section above
- Review the code comments for implementation details
- Submit issues on GitHub for bugs or feature requests
Happy linking! πβ¨