Skip to content

Comments

Add AI bot classification for event enrichment#155

Open
jaredmixpanel wants to merge 4 commits intomasterfrom
feature/ai-bot-classification
Open

Add AI bot classification for event enrichment#155
jaredmixpanel wants to merge 4 commits intomasterfrom
feature/ai-bot-classification

Conversation

@jaredmixpanel
Copy link
Contributor

@jaredmixpanel jaredmixpanel commented Feb 19, 2026

Summary

Adds AI bot classification consumer wrapper that automatically detects AI crawler requests and enriches tracked events with classification properties.

What it does

  • Classifies user-agent strings against a database of 12 known AI bots
  • Enriches events with $is_ai_bot, $ai_bot_name, $ai_bot_provider, and $ai_bot_category properties
  • Supports custom bot patterns that take priority over built-in patterns
  • Case-insensitive matching

AI Bots Detected

GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, Claude-User, Google-Extended, PerplexityBot, Bytespider, CCBot, Applebot-Extended, Meta-ExternalAgent, cohere-ai

Implementation Details

Architecture

  • BotClassifyingConsumer wraps any Mixpanel consumer (Consumer, BufferedConsumer, or custom) -- uses the SDK's established consumer extension model
  • Only enriches events on the 'events' endpoint; people, groups, and imports pass through unmodified via self._base.send()
  • Transparent proxy of flush() for BufferedConsumer compatibility -- delegates to self._base.flush() when the wrapped consumer supports it
  • Classification logic lives in ai_bot_classifier.py, consumer wrapping in ai_bot_consumer.py, and framework helpers in ai_bot_helpers.py -- clean separation of concerns

Public API

classify_user_agent(user_agent: Optional[str]) -> Dict[str, Any]
Classify a single user-agent string. Returns {'$is_ai_bot': False} for non-bots or None/empty input. Returns {'$is_ai_bot': True, '$ai_bot_name': ..., '$ai_bot_provider': ..., '$ai_bot_category': ...} for recognized AI bots.

create_classifier(additional_bots: Optional[List[Dict[str, Any]]] = None) -> Callable
Factory that returns a classifier function with optional additional bot patterns prepended to the built-in database. Each entry requires 'pattern' (compiled regex), 'name', 'provider', and 'category' keys.

get_bot_database() -> List[Dict[str, str]]
Returns a copy of the built-in bot database for inspection (name, provider, category, description per entry).

BotClassifyingConsumer(base_consumer, user_agent_property='$user_agent', additional_bots=None)
Consumer wrapper. Intercepts send() calls on the 'events' endpoint, parses the JSON message, extracts the user-agent from properties[user_agent_property], classifies it, and merges classification properties back before forwarding to base_consumer.send().

Framework helpers (ai_bot_helpers.py):

  • extract_request_context_django(request) -> Dict[str, str] -- extracts $user_agent from request.META['HTTP_USER_AGENT'] and $ip from HTTP_X_FORWARDED_FOR / REMOTE_ADDR
  • extract_request_context_flask(request) -> Dict[str, str] -- extracts $user_agent from request.headers['User-Agent'] and $ip from request.remote_addr
  • extract_request_context_fastapi(request) -> Dict[str, str] -- extracts $user_agent from request.headers['user-agent'] and $ip from request.client.host

All three are exported from mixpanel/__init__.py via: from .ai_bot_classifier import classify_user_agent, create_classifier, get_bot_database and from .ai_bot_consumer import BotClassifyingConsumer.

Notable Design Decisions

  1. Classification at the consumer layer, not in Mixpanel.track() -- keeps the core SDK untouched and lets users opt-in by wrapping their consumer. This follows the same extension pattern the SDK already uses for custom consumers.
  2. additional_bots checked before built-ins -- create_classifier() prepends custom patterns to the database list so user-defined patterns take priority, allowing overrides of built-in bot names or categories.
  3. Only the 'events' endpoint is enriched -- people profile updates, group updates, and imports pass through unmodified since bot classification is only meaningful for event tracking where a user-agent is present.

Usage Examples

Automatic Event Enrichment

from mixpanel import Mixpanel, Consumer
from mixpanel.ai_bot_consumer import BotClassifyingConsumer

consumer = BotClassifyingConsumer(Consumer())
mp = Mixpanel('YOUR_TOKEN', consumer=consumer)

# When $user_agent contains an AI bot string, the event is automatically
# enriched with $is_ai_bot, $ai_bot_name, $ai_bot_provider, $ai_bot_category
mp.track('user_id', 'page_view', {
    '$user_agent': request.headers.get('User-Agent'),
})

Standalone Classification

from mixpanel.ai_bot_classifier import classify_user_agent

result = classify_user_agent('Mozilla/5.0 ClaudeBot/1.0')
# {'$is_ai_bot': True, '$ai_bot_name': 'ClaudeBot', '$ai_bot_provider': 'Anthropic', '$ai_bot_category': 'indexing'}

result = classify_user_agent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0')
# {'$is_ai_bot': False}

Custom Bot Patterns

import re
from mixpanel.ai_bot_classifier import create_classifier

classifier = create_classifier(additional_bots=[
    {
        'pattern': re.compile(r'MyInternalBot/', re.IGNORECASE),
        'name': 'MyInternalBot',
        'provider': 'MyCompany',
        'category': 'internal',
    },
])

result = classifier('MyInternalBot/1.0')
# {'$is_ai_bot': True, '$ai_bot_name': 'MyInternalBot', '$ai_bot_provider': 'MyCompany', '$ai_bot_category': 'internal'}

Framework Integration (Django)

from mixpanel.ai_bot_helpers import extract_request_context_django

def my_view(request):
    mp.track('user_id', 'page_view', {
        **extract_request_context_django(request),
        'page_url': request.path,
    })

Framework Integration (Flask)

from mixpanel.ai_bot_helpers import extract_request_context_flask

@app.route('/page')
def page():
    mp.track('user_id', 'page_view', {
        **extract_request_context_flask(request),
        'page_url': request.path,
    })

Framework Integration (FastAPI)

from mixpanel.ai_bot_helpers import extract_request_context_fastapi

@app.get('/page')
async def page(request: Request):
    mp.track('user_id', 'page_view', {
        **extract_request_context_fastapi(request),
        'page_url': str(request.url),
    })

Files Added

  • mixpanel/ai_bot_classifier.py -- bot database, classify_user_agent(), create_classifier(), get_bot_database()
  • mixpanel/ai_bot_consumer.py -- BotClassifyingConsumer wrapper class
  • mixpanel/ai_bot_helpers.py -- Django/Flask/FastAPI request context extractors
  • test_ai_bot_classifier.py -- classifier unit tests
  • test_ai_bot_consumer.py -- consumer wrapper unit tests

Files Modified

  • mixpanel/__init__.py -- imports classify_user_agent, create_classifier, get_bot_database, BotClassifyingConsumer

Test Plan

  • All 12 AI bot user-agents correctly classified
  • Non-AI-bot user-agents return $is_ai_bot: false (Chrome, Googlebot, curl, etc.)
  • Empty string and null/nil inputs handled gracefully
  • Case-insensitive matching works
  • Custom bot patterns checked before built-in
  • Event properties preserved through enrichment
  • No regressions in existing test suite

Part of AI bot classification feature for Python SDK.
Part of AI bot classification feature for Python SDK.
Part of AI bot classification feature for Python SDK.
@codecov
Copy link

codecov bot commented Feb 19, 2026

Codecov Report

❌ Patch coverage is 96.42857% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.66%. Comparing base (b0fc5e5) to head (0b6e320).

Files with missing lines Patch % Lines
test_ai_bot_consumer.py 94.40% 4 Missing and 4 partials ⚠️
mixpanel/ai_bot_classifier.py 81.81% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #155      +/-   ##
==========================================
+ Coverage   94.28%   94.66%   +0.38%     
==========================================
  Files           9       13       +4     
  Lines        1557     1893     +336     
  Branches      101      116      +15     
==========================================
+ Hits         1468     1792     +324     
- Misses         54       60       +6     
- Partials       35       41       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds AI bot classification functionality to the Mixpanel Python SDK, enabling automatic detection and enrichment of events from AI crawler requests. The implementation follows the SDK's established patterns and provides both a core classification engine and a consumer wrapper that seamlessly integrates with the existing tracking infrastructure.

Changes:

  • Adds AI bot detection for 12 known AI crawlers (GPTBot, ClaudeBot, PerplexityBot, etc.) with extensible custom bot pattern support
  • Implements BotClassifyingConsumer wrapper that enriches events with $is_ai_bot, $ai_bot_name, $ai_bot_provider, and $ai_bot_category properties
  • Provides framework-specific helper functions for Django, Flask, and FastAPI to simplify user-agent extraction

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
mixpanel/ai_bot_classifier.py Core classification logic with bot database and pattern matching using compiled regex patterns
mixpanel/ai_bot_consumer.py Consumer wrapper that intercepts events endpoint and enriches with bot classification
mixpanel/ai_bot_helpers.py Framework integration helpers for extracting user-agent from Django, Flask, and FastAPI requests
mixpanel/init.py Exports new BotClassifyingConsumer and classification functions
test_ai_bot_classifier.py Comprehensive tests for classification logic covering all 12 bots plus edge cases
test_ai_bot_consumer.py Tests for consumer wrapper including property preservation, endpoint filtering, and BufferedConsumer compatibility

Address PR review: add $ai_bot_category assertions for
Bytespider, CCBot, Applebot-Extended, Meta-ExternalAgent.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant