Skip to content

Add AI bot classification for event enrichment#80

Open
jaredmixpanel wants to merge 5 commits intomasterfrom
feature/ai-bot-classification
Open

Add AI bot classification for event enrichment#80
jaredmixpanel wants to merge 5 commits intomasterfrom
feature/ai-bot-classification

Conversation

@jaredmixpanel
Copy link
Contributor

@jaredmixpanel jaredmixpanel commented Feb 19, 2026

Summary

Adds AI bot classification integrated into the Mixpanel class that automatically detects AI crawler requests and enriches tracked events with classification properties.

What it does

  • Classifies user-agent strings against a database of 12 known AI bots
  • Enriches events with $is_ai_bot, $ai_bot_name, $ai_bot_provider, and $ai_bot_category properties
  • Supports custom bot patterns that take priority over built-in patterns
  • Case-insensitive matching

AI Bots Detected

GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, Claude-User, Google-Extended, PerplexityBot, Bytespider, CCBot, Applebot-Extended, Meta-ExternalAgent, cohere-ai

Implementation Details

Architecture

  • Two paths: bot_detection => true flag on Mixpanel::getInstance() (enriches in track()) and BotClassifyingConsumer strategy
  • Anti-double-classification guard when both paths are active (_isUsingBotClassifyingConsumer() check in track())
  • Uses full PCRE regex strings in bot database (e.g., '/GPTBot\//i')
  • Invalid custom regex patterns silently skipped via @preg_match
  • PHPUnit upgraded to v9 for PHP 8.4 compatibility (separate commit)

Public API

Class / Method Description
BotClassifier_AiBotClassifier::__construct($additional_bots = array()) Create classifier with optional extra patterns prepended to built-in database
BotClassifier_AiBotClassifier::classify($user_agent) Classify a user-agent string; returns ['$is_ai_bot' => bool, ...]
BotClassifier_AiBotClassifier::createClassifier($options = array()) Static factory; reads additional_bots key from $options
BotClassifier_AiBotClassifier::getBotDatabase() Returns built-in bot list without regex patterns (safe for inspection)
BotClassifier_AiBotDatabase::getDatabase() Returns raw bot pattern array (pattern, name, provider, category, description)
BotClassifier_AiBotDatabase::getDatabaseForInspection() Returns bot entries without patterns (name, provider, category, description)
ConsumerStrategies_BotClassifyingConsumer Consumer wrapper that enriches event batches with bot classification in persist()
Mixpanel::__construct($token, $options) Now accepts bot_detection and bot_additional_patterns options
Mixpanel::track($event, $properties) Automatically enriches properties when bot_detection is enabled and $user_agent is present

Notable Design Decisions

  1. Two integration paths: The flag-based path (bot_detection => true) classifies at track() time before enqueueing, while BotClassifyingConsumer classifies at persist() time on the batch. This lets users choose early enrichment (properties visible in queue) vs. deferred enrichment (consumer-level).
  2. Anti-double-classification guard: Mixpanel::track() calls _isUsingBotClassifyingConsumer() to detect when the BotClassifyingConsumer is already configured as the consumer strategy, preventing the same event from being classified twice.
  3. Custom patterns prepended: array_merge($additional_bots, getDatabase()) puts custom patterns first so they are checked before built-in patterns, allowing users to override or extend classification.

Usage Examples

Flag-Based Detection

$mp = Mixpanel::getInstance("YOUR_TOKEN", array(
    "bot_detection" => true
));

// Properties are enriched automatically in track() when $user_agent is present
$mp->track("Page View", array(
    '$user_agent' => $_SERVER['HTTP_USER_AGENT']
));
// Event properties will include $is_ai_bot, $ai_bot_name, etc. if matched

Consumer Strategy

$mp = Mixpanel::getInstance("YOUR_TOKEN", array(
    "consumer"  => "bot_classifying",
    "consumers" => array(
        "bot_classifying" => "ConsumerStrategies_BotClassifyingConsumer"
    ),
    // Optional: configure the inner consumer (default: "curl")
    "bot_classifying_inner_consumer" => "curl",
    // Optional: custom user-agent property name (default: "$user_agent")
    "bot_user_agent_property" => '$user_agent'
));

$mp->track("Page View", array(
    '$user_agent' => $_SERVER['HTTP_USER_AGENT']
));
// Classification happens at flush/persist time inside BotClassifyingConsumer::persist()

Standalone Classification

require_once("/path/to/lib/BotClassifier/AiBotClassifier.php");

$classifier = new BotClassifier_AiBotClassifier();
// or use the factory:
// $classifier = BotClassifier_AiBotClassifier::createClassifier();

$result = $classifier->classify("Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.0");
// => ['$is_ai_bot' => true, '$ai_bot_name' => 'GPTBot', '$ai_bot_provider' => 'OpenAI', '$ai_bot_category' => 'indexing']

$result = $classifier->classify("Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0");
// => ['$is_ai_bot' => false]

// Inspect the built-in database (without regex patterns)
$bots = $classifier->getBotDatabase();

Custom Bot Patterns

// Custom patterns are checked before built-in patterns
$mp = Mixpanel::getInstance("YOUR_TOKEN", array(
    "bot_detection" => true,
    "bot_additional_patterns" => array(
        array(
            "pattern"  => "/MyCustomBot\//i",
            "name"     => "MyCustomBot",
            "provider" => "MyCompany",
            "category" => "indexing"
        )
    )
));

// Or with the consumer strategy
$mp = Mixpanel::getInstance("YOUR_TOKEN", array(
    "consumer"  => "bot_classifying",
    "consumers" => array(
        "bot_classifying" => "ConsumerStrategies_BotClassifyingConsumer"
    ),
    "bot_additional_patterns" => array(
        array(
            "pattern"  => "/MyCustomBot\//i",
            "name"     => "MyCustomBot",
            "provider" => "MyCompany",
            "category" => "indexing"
        )
    )
));

Files Added

  • lib/BotClassifier/AiBotClassifier.php
  • lib/BotClassifier/AiBotDatabase.php
  • lib/ConsumerStrategies/BotClassifyingConsumer.php
  • test/BotClassifier/AiBotClassifierTest.php
  • test/BotClassifier/BotClassifyingIntegrationTest.php

Files Modified

  • composer.json
  • lib/Base/MixpanelBase.php
  • lib/Mixpanel.php
  • phpunit.xml.dist
  • test/Base/MixpanelBaseProducerTest.php
  • test/ConsumerStrategies/AbstractConsumerTest.php
  • test/ConsumerStrategies/CurlConsumerTest.php
  • test/ConsumerStrategies/FileConsumerTest.php
  • test/ConsumerStrategies/SocketConsumerTest.php
  • test/MixpanelTest.php
  • test/Producers/MixpanelEventsProducerTest.php
  • test/Producers/MixpanelGroupsProducerTest.php
  • test/Producers/MixpanelPeopleProducerTest.php

Test Plan

  • All 12 AI bot user-agents correctly classified
  • Non-AI-bot user-agents return $is_ai_bot: false (Chrome, Googlebot, curl, etc.)
  • Empty string and null/nil inputs handled gracefully
  • Case-insensitive matching works
  • Custom bot patterns checked before built-in
  • Event properties preserved through enrichment
  • No regressions in existing test suite

Unit tests for AiBotClassifier and integration tests for
Mixpanel::track() bot detection enrichment.

Part of AI bot classification feature for PHP SDK.
Add BotClassifier_AiBotDatabase with 12 AI bot patterns and
BotClassifier_AiBotClassifier for user-agent classification.
Modify Mixpanel::track() to enrich events with bot classification
properties when bot_detection is enabled.

Part of AI bot classification feature for PHP SDK.
Add ConsumerStrategies_BotClassifyingConsumer that wraps any consumer
and enriches events with AI bot classification at persist time.

Part of AI bot classification feature for PHP SDK.
Update composer.json, phpunit.xml.dist, and all existing test files
to use PHPUnit 9 (PHPUnit\Framework\TestCase, void return types).

Part of AI bot classification feature for PHP SDK.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds AI bot classification functionality to the Mixpanel PHP library, enabling automatic detection of 12 known AI crawler user-agents and enriching tracked events with classification metadata. It also upgrades the testing infrastructure from PHPUnit 5.6 to PHPUnit 9.6.

Changes:

  • Adds AI bot classification system with support for 12 AI bots (GPTBot, ClaudeBot, PerplexityBot, etc.) and optional custom patterns
  • Upgrades test infrastructure to PHPUnit 9.6 with modernized test syntax (setUp/tearDown return type declarations)
  • Provides two integration approaches: opt-in bot_detection flag or BotClassifyingConsumer wrapper

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
composer.json Upgrades PHPUnit from 5.6 to 9.6
phpunit.xml.dist Removes deprecated PHPUnit configuration options
lib/Base/MixpanelBase.php Adds bot_detection and bot_additional_patterns configuration options
lib/Mixpanel.php Integrates bot classifier, adds getQueue() method, enriches track() with bot classification
lib/BotClassifier/AiBotDatabase.php Defines database of 12 AI bot patterns with metadata
lib/BotClassifier/AiBotClassifier.php Implements classification logic with custom pattern support
lib/ConsumerStrategies/BotClassifyingConsumer.php Provides alternative consumer wrapper approach for bot classification
test/BotClassifier/AiBotClassifierTest.php Comprehensive unit tests for classifier (all 12 bots, edge cases, custom patterns)
test/BotClassifier/BotClassifyingIntegrationTest.php Integration tests for Mixpanel class with bot detection enabled
test/Base/MixpanelBaseProducerTest.php Migrates to PHPUnit 9 syntax
test/ConsumerStrategies/*.php Migrates all consumer test classes to PHPUnit 9 syntax
test/Producers/*.php Migrates all producer test classes to PHPUnit 9 syntax
test/MixpanelTest.php Migrates main test class to PHPUnit 9 syntax

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Add bot property assertions in integration test (via flush + file content check)
- Add double-classification guard when using BotClassifyingConsumer
- Add invalid regex handling for custom patterns
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants