Add AI bot classification for event enrichment#155
Open
jaredmixpanel wants to merge 4 commits intomasterfrom
Open
Add AI bot classification for event enrichment#155jaredmixpanel wants to merge 4 commits intomasterfrom
jaredmixpanel wants to merge 4 commits intomasterfrom
Conversation
Part of AI bot classification feature for Python SDK.
Part of AI bot classification feature for Python SDK.
Part of AI bot classification feature for Python SDK.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #155 +/- ##
==========================================
+ Coverage 94.28% 94.66% +0.38%
==========================================
Files 9 13 +4
Lines 1557 1893 +336
Branches 101 116 +15
==========================================
+ Hits 1468 1792 +324
- Misses 54 60 +6
- Partials 35 41 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds AI bot classification functionality to the Mixpanel Python SDK, enabling automatic detection and enrichment of events from AI crawler requests. The implementation follows the SDK's established patterns and provides both a core classification engine and a consumer wrapper that seamlessly integrates with the existing tracking infrastructure.
Changes:
- Adds AI bot detection for 12 known AI crawlers (GPTBot, ClaudeBot, PerplexityBot, etc.) with extensible custom bot pattern support
- Implements BotClassifyingConsumer wrapper that enriches events with
$is_ai_bot,$ai_bot_name,$ai_bot_provider, and$ai_bot_categoryproperties - Provides framework-specific helper functions for Django, Flask, and FastAPI to simplify user-agent extraction
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| mixpanel/ai_bot_classifier.py | Core classification logic with bot database and pattern matching using compiled regex patterns |
| mixpanel/ai_bot_consumer.py | Consumer wrapper that intercepts events endpoint and enriches with bot classification |
| mixpanel/ai_bot_helpers.py | Framework integration helpers for extracting user-agent from Django, Flask, and FastAPI requests |
| mixpanel/init.py | Exports new BotClassifyingConsumer and classification functions |
| test_ai_bot_classifier.py | Comprehensive tests for classification logic covering all 12 bots plus edge cases |
| test_ai_bot_consumer.py | Tests for consumer wrapper including property preservation, endpoint filtering, and BufferedConsumer compatibility |
Address PR review: add $ai_bot_category assertions for Bytespider, CCBot, Applebot-Extended, Meta-ExternalAgent.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds AI bot classification consumer wrapper that automatically detects AI crawler requests and enriches tracked events with classification properties.
What it does
$is_ai_bot,$ai_bot_name,$ai_bot_provider, and$ai_bot_categorypropertiesAI Bots Detected
GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, Claude-User, Google-Extended, PerplexityBot, Bytespider, CCBot, Applebot-Extended, Meta-ExternalAgent, cohere-ai
Implementation Details
Architecture
BotClassifyingConsumerwraps any Mixpanel consumer (Consumer, BufferedConsumer, or custom) -- uses the SDK's established consumer extension model'events'endpoint; people, groups, and imports pass through unmodified viaself._base.send()flush()forBufferedConsumercompatibility -- delegates toself._base.flush()when the wrapped consumer supports itai_bot_classifier.py, consumer wrapping inai_bot_consumer.py, and framework helpers inai_bot_helpers.py-- clean separation of concernsPublic API
classify_user_agent(user_agent: Optional[str]) -> Dict[str, Any]Classify a single user-agent string. Returns
{'$is_ai_bot': False}for non-bots orNone/empty input. Returns{'$is_ai_bot': True, '$ai_bot_name': ..., '$ai_bot_provider': ..., '$ai_bot_category': ...}for recognized AI bots.create_classifier(additional_bots: Optional[List[Dict[str, Any]]] = None) -> CallableFactory that returns a classifier function with optional additional bot patterns prepended to the built-in database. Each entry requires
'pattern'(compiled regex),'name','provider', and'category'keys.get_bot_database() -> List[Dict[str, str]]Returns a copy of the built-in bot database for inspection (name, provider, category, description per entry).
BotClassifyingConsumer(base_consumer, user_agent_property='$user_agent', additional_bots=None)Consumer wrapper. Intercepts
send()calls on the'events'endpoint, parses the JSON message, extracts the user-agent fromproperties[user_agent_property], classifies it, and merges classification properties back before forwarding tobase_consumer.send().Framework helpers (
ai_bot_helpers.py):extract_request_context_django(request) -> Dict[str, str]-- extracts$user_agentfromrequest.META['HTTP_USER_AGENT']and$ipfromHTTP_X_FORWARDED_FOR/REMOTE_ADDRextract_request_context_flask(request) -> Dict[str, str]-- extracts$user_agentfromrequest.headers['User-Agent']and$ipfromrequest.remote_addrextract_request_context_fastapi(request) -> Dict[str, str]-- extracts$user_agentfromrequest.headers['user-agent']and$ipfromrequest.client.hostAll three are exported from
mixpanel/__init__.pyvia:from .ai_bot_classifier import classify_user_agent, create_classifier, get_bot_databaseandfrom .ai_bot_consumer import BotClassifyingConsumer.Notable Design Decisions
Mixpanel.track()-- keeps the core SDK untouched and lets users opt-in by wrapping their consumer. This follows the same extension pattern the SDK already uses for custom consumers.additional_botschecked before built-ins --create_classifier()prepends custom patterns to the database list so user-defined patterns take priority, allowing overrides of built-in bot names or categories.'events'endpoint is enriched -- people profile updates, group updates, and imports pass through unmodified since bot classification is only meaningful for event tracking where a user-agent is present.Usage Examples
Automatic Event Enrichment
Standalone Classification
Custom Bot Patterns
Framework Integration (Django)
Framework Integration (Flask)
Framework Integration (FastAPI)
Files Added
mixpanel/ai_bot_classifier.py-- bot database,classify_user_agent(),create_classifier(),get_bot_database()mixpanel/ai_bot_consumer.py--BotClassifyingConsumerwrapper classmixpanel/ai_bot_helpers.py-- Django/Flask/FastAPI request context extractorstest_ai_bot_classifier.py-- classifier unit teststest_ai_bot_consumer.py-- consumer wrapper unit testsFiles Modified
mixpanel/__init__.py-- importsclassify_user_agent,create_classifier,get_bot_database,BotClassifyingConsumerTest Plan
$is_ai_bot: false(Chrome, Googlebot, curl, etc.)