Skip to content

Conversation

@theomonnom
Copy link
Member

@theomonnom theomonnom commented Jan 17, 2026

Summary by CodeRabbit

  • New Features
    • Pluggable evaluation framework with multiple judges, JudgmentResult, EvaluationResult and JudgeGroup for scoring conversations.
    • Conversation history now records agent configuration updates, including tool diffs and initial configs.
    • Session tagging (Tagger) and telemetry now emit evaluation results and session outcome metadata.
    • Job context exposes a tagger and primary session accessor; session reporting includes tagging.
    • Example frontdesk agent runs evaluations at session end and tracks appointment state.

✏️ Tip: You can customize this high-level summary in your review settings.

@chenghao-mou chenghao-mou requested a review from a team January 17, 2026 21:12
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 17, 2026

Warning

Rate limit exceeded

@theomonnom has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 0 minutes and 39 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 5a2c218 and baed998.

📒 Files selected for processing (2)
  • livekit-agents/livekit/agents/evals/evaluation.py
  • livekit-agents/livekit/agents/evals/judge.py
📝 Walkthrough

Walkthrough

Adds an LLM-driven evaluation subsystem (judges, JudgeGroup, EvaluationResult), records agent configuration updates into chat context, and introduces a Tagger used by JobContext and telemetry to surface evaluation outcomes.

Changes

Cohort / File(s) Summary
Evaluation core
livekit-agents/livekit/agents/evals/judge.py, livekit-agents/livekit/agents/evals/evaluation.py, livekit-agents/livekit/agents/evals/__init__.py
New judging framework: Judge, specialized judges (task_completion, handoff, accuracy, tool_use, safety, relevancy, coherence, conciseness), JudgmentResult, Verdict type, Evaluator protocol, EvaluationResult, JudgeGroup (concurrent execution), and exports added in evals package __all__.
Chat context & LLM exports
livekit-agents/livekit/agents/llm/chat_context.py, livekit-agents/livekit/agents/llm/__init__.py, livekit-agents/livekit/agents/__init__.py
Introduces AgentConfigUpdate model; extends ChatItem union to include it; re-exports AgentConfigUpdate (and AgentHandoff) from llm and top-level agents package.
Agent runtime tracking
livekit-agents/livekit/agents/voice/agent_activity.py
Emits AgentConfigUpdate items on instruction/tool changes and at startup; computes tools_added/tools_removed diffs and stores full tool definitions in an in-memory _tools attribute on the update.
Observability & job context
livekit-agents/livekit/agents/observability.py, livekit-agents/livekit/agents/job.py
Adds Tagger class to collect tags, evaluation data, and outcome reasons; JobContext now initializes/exposes a tagger, exposes primary_session, forwards tagger into session-report upload, and removes run_job.
Telemetry & traces
livekit-agents/livekit/agents/telemetry/traces.py
_upload_session_report signature updated to accept tagger; adds logger factory; emits evaluation and outcome telemetry when tagger provides data; consolidates logger usage.
Examples
examples/frontdesk/frontdesk_agent.py
Integrates JudgeGroup into on_session_end, adds appointment_booked to userdata, and uses evaluation outcomes to mark session success/failure.

Sequence Diagram

sequenceDiagram
    participant Client as Client
    participant JG as JudgeGroup
    participant J as Judge
    participant LLM as LLM
    participant T as Tagger

    Client->>JG: evaluate(chat_ctx, reference)
    JG->>J: evaluate(chat_ctx, reference, llm)
    J->>LLM: stream(prompt with chat + instructions)
    LLM-->>J: reasoning + "Verdict: <pass|fail|maybe>"
    J-->>JG: JudgmentResult
    JG->>T: _evaluation(EvaluationResult)
    JG-->>Client: EvaluationResult (aggregate)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through chats and config blooms,
I gathered verdicts in tidy rooms,
Tags and tools I tucked away,
Judges chimed at close of day,
I nibbled codes and hummed hoorays! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 59.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main changes: addition of AgentConfigUpdate and initial judges framework for evaluating agent conversations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@livekit-agents/livekit/agents/llm/chat_context.py`:
- Around line 213-229: AgentConfigUpdate is missing the agent_id field, so
callers that set agent_id and the formatter _format_chat_ctx that reads it can
fail; add an agent_id: str | None = None (or appropriate type) to the
AgentConfigUpdate model declaration so the value is preserved and safe to
access, and ensure the new field is included before PrivateAttr/_tools in the
AgentConfigUpdate class so ChatItem (which unions AgentConfigUpdate) will carry
agent_id correctly.
♻️ Duplicate comments (2)
livekit-agents/livekit/agents/voice/agent_activity.py (1)

314-323: agent_id field missing in AgentConfigUpdate (covered in model)

This block passes agent_id; ensure the model defines it so the value isn’t lost.

livekit-agents/livekit/agents/evals/judge.py (1)

16-40: Guard agent_id access for config updates

agent_id is referenced here; ensure the model defines it (see AgentConfigUpdate).

🧹 Nitpick comments (3)
livekit-agents/livekit/agents/voice/agent_activity.py (1)

331-349: Stabilize tool diff ordering for deterministic updates

setlist produces nondeterministic ordering; sorting keeps logs/tests stable.

🔧 Proposed fix
-        tools_added = list(new_tool_names - old_tool_names) or None
-        tools_removed = list(old_tool_names - new_tool_names) or None
+        tools_added = sorted(new_tool_names - old_tool_names) or None
+        tools_removed = sorted(old_tool_names - new_tool_names) or None
livekit-agents/livekit/agents/evals/judge.py (2)

8-13: Add a Google‑style class docstring for JudgmentResult

🔧 Proposed fix
 `@dataclass`
 class JudgmentResult:
+    """Result of a judge evaluation.
+
+    Attributes:
+        passed: Whether the evaluation passed.
+        reasoning: Model reasoning for the judgment.
+    """
     passed: bool
     """Whether the evaluation passed."""
     reasoning: str
     """Chain-of-thought reasoning for the judgment."""
As per coding guidelines, please add Google-style docstrings.

43-87: Make PASS/FAIL parsing deterministic

rfind can be tripped by “PASS/FAIL” in the reasoning. Require a final verdict line and parse only that.

🔧 Proposed fix
         prompt_parts.extend(
             [
                 "",
                 "Does the conversation meet the criteria? Don't overthink it.",
-                "Explain your reasoning step by step, then answer Pass or Fail.",
+                "Provide a brief justification, then output a final line with exactly PASS or FAIL.",
             ]
         )
@@
-        response = "".join(response_chunks)
-
-        response_upper = response.upper()
-        pass_pos = response_upper.rfind("PASS")
-        fail_pos = response_upper.rfind("FAIL")
-        passed = pass_pos > fail_pos if pass_pos != -1 else False
+        response = "".join(response_chunks).strip()
+        last_line = response.splitlines()[-1].strip().upper() if response else ""
+        passed = last_line.startswith("PASS")
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 853bc41 and 9af2755.

📒 Files selected for processing (6)
  • livekit-agents/livekit/agents/__init__.py
  • livekit-agents/livekit/agents/evals/__init__.py
  • livekit-agents/livekit/agents/evals/judge.py
  • livekit-agents/livekit/agents/llm/__init__.py
  • livekit-agents/livekit/agents/llm/chat_context.py
  • livekit-agents/livekit/agents/voice/agent_activity.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

  • livekit-agents/livekit/agents/evals/__init__.py
  • livekit-agents/livekit/agents/llm/__init__.py
  • livekit-agents/livekit/agents/__init__.py
  • livekit-agents/livekit/agents/voice/agent_activity.py
  • livekit-agents/livekit/agents/evals/judge.py
  • livekit-agents/livekit/agents/llm/chat_context.py
🧬 Code graph analysis (6)
livekit-agents/livekit/agents/evals/__init__.py (1)
livekit-agents/livekit/agents/evals/judge.py (6)
  • Judge (43-87)
  • JudgmentResult (9-13)
  • accuracy_judge (112-128)
  • safety_judge (151-168)
  • task_completion_judge (90-109)
  • tool_use_judge (131-148)
livekit-agents/livekit/agents/llm/__init__.py (1)
livekit-agents/livekit/agents/llm/chat_context.py (1)
  • AgentConfigUpdate (213-224)
livekit-agents/livekit/agents/__init__.py (1)
livekit-agents/livekit/agents/llm/chat_context.py (2)
  • AgentConfigUpdate (213-224)
  • AgentHandoff (205-210)
livekit-agents/livekit/agents/voice/agent_activity.py (2)
livekit-agents/livekit/agents/llm/tool_context.py (4)
  • get_fnc_tool_names (283-292)
  • tools (44-46)
  • ToolContext (295-418)
  • flatten (320-325)
livekit-agents/livekit/agents/llm/chat_context.py (1)
  • AgentConfigUpdate (213-224)
livekit-agents/livekit/agents/evals/judge.py (4)
livekit-agents/livekit/agents/voice/agent_activity.py (1)
  • llm (2815-2819)
livekit-agents/livekit/agents/llm/chat_context.py (7)
  • ChatContext (232-670)
  • items (241-242)
  • items (245-246)
  • text_content (164-173)
  • copy (297-354)
  • copy (690-691)
  • add_message (248-281)
livekit-agents/livekit/agents/voice/agent_session.py (1)
  • output (394-395)
livekit-agents/livekit/agents/voice/agent.py (1)
  • instructions (99-104)
livekit-agents/livekit/agents/llm/chat_context.py (2)
livekit-agents/livekit/agents/utils/misc.py (1)
  • shortuuid (21-22)
livekit-agents/livekit/agents/llm/tool_context.py (1)
  • Tool (31-32)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: livekit-plugins-cartesia
  • GitHub Check: livekit-plugins-deepgram
  • GitHub Check: livekit-plugins-inworld
  • GitHub Check: livekit-plugins-openai
  • GitHub Check: livekit-plugins-elevenlabs
  • GitHub Check: unit-tests
  • GitHub Check: type-check (3.13)
  • GitHub Check: type-check (3.9)
🔇 Additional comments (11)
livekit-agents/livekit/agents/llm/__init__.py (2)

1-15: LGTM: AgentConfigUpdate re-exported from llm


55-69: LGTM: __all__ updated to include AgentConfigUpdate

livekit-agents/livekit/agents/__init__.py (2)

39-49: LGTM: top-level imports updated


117-156: LGTM: __all__ export list updated

livekit-agents/livekit/agents/voice/agent_activity.py (2)

18-25: LGTM: tool diff helpers wired in


603-611: LGTM: initial config snapshot recorded

livekit-agents/livekit/agents/evals/judge.py (4)

90-109: LGTM: task completion judge instructions are clear


112-128: LGTM: accuracy judge instructions look solid


131-148: LGTM: tool-use judge instructions look solid


151-167: LGTM: safety judge instructions look solid

livekit-agents/livekit/agents/evals/__init__.py (1)

1-17: LGTM: judge APIs re-exported

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +213 to 229
class AgentConfigUpdate(BaseModel):
id: str = Field(default_factory=lambda: utils.shortuuid("item_"))
type: Literal["agent_config_update"] = Field(default="agent_config_update")

instructions: str | None = None
tools_added: list[str] | None = None
tools_removed: list[str] | None = None

created_at: float = Field(default_factory=time.time)

_tools: list[Tool] = PrivateAttr(default_factory=list)
"""Full tool definitions (in-memory only, not serialized)."""

ChatItem = Annotated[
Union[ChatMessage, FunctionCall, FunctionCallOutput, AgentHandoff], Field(discriminator="type")
Union[ChatMessage, FunctionCall, FunctionCallOutput, AgentHandoff, AgentConfigUpdate],
Field(discriminator="type"),
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add agent_id to AgentConfigUpdate to match callers and formatter

Call sites set agent_id and _format_chat_ctx reads it, but the model doesn’t declare it. Add the field so the value is preserved and attribute access is safe.

🔧 Proposed fix
 class AgentConfigUpdate(BaseModel):
     id: str = Field(default_factory=lambda: utils.shortuuid("item_"))
     type: Literal["agent_config_update"] = Field(default="agent_config_update")
 
+    agent_id: str | None = None
     instructions: str | None = None
     tools_added: list[str] | None = None
     tools_removed: list[str] | None = None
🤖 Prompt for AI Agents
In `@livekit-agents/livekit/agents/llm/chat_context.py` around lines 213 - 229,
AgentConfigUpdate is missing the agent_id field, so callers that set agent_id
and the formatter _format_chat_ctx that reads it can fail; add an agent_id: str
| None = None (or appropriate type) to the AgentConfigUpdate model declaration
so the value is preserved and safe to access, and ensure the new field is
included before PrivateAttr/_tools in the AgentConfigUpdate class so ChatItem
(which unions AgentConfigUpdate) will carry agent_id correctly.

created_at: float = Field(default_factory=time.time)


class AgentConfigUpdate(BaseModel):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still very fuzzy about the name. Judging by the name, I would assume it is related to changing stt/llm/tts of the agent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah.. do you have any name suggestion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe LLMConfigUpdate?

@theomonnom theomonnom requested a review from a team January 21, 2026 19:43
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
livekit-agents/livekit/agents/telemetry/traces.py (1)

356-365: Custom Tagger tags are never emitted.
tagger.add/remove won’t surface anywhere because tagger.tags isn’t logged or uploaded. This makes the new tagging API effectively a no-op for custom tags.

🐛 Suggested fix: include tags in the session report log
     _log(
         chat_logger,
         body="session report",
         timestamp=int((report.started_at or report.timestamp or 0) * 1e9),
         attributes={
             "session.options": vars(report.options),
             "session.report_timestamp": report.timestamp,
             "agent_name": agent_name,
+            "session.tags": sorted(tagger.tags),
         },
     )

Also applies to: 385-412

🤖 Fix all issues with AI agents
In `@livekit-agents/livekit/agents/evals/evaluation.py`:
- Around line 61-66: The majority_passed property compares a fractional score to
a count, which is wrong; update the logic in majority_passed (in evaluation.py)
to compare like-for-like: either check if self.score > 0.5 (since score is in
[0,1]) or compute passed_count = self.score * len(self.judgments) and compare
passed_count > len(self.judgments)/2; keep the existing empty-judgments shortcut
(return True when not self.judgments).
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9af2755 and 754ce38.

📒 Files selected for processing (7)
  • examples/frontdesk/frontdesk_agent.py
  • livekit-agents/livekit/agents/evals/__init__.py
  • livekit-agents/livekit/agents/evals/evaluation.py
  • livekit-agents/livekit/agents/evals/judge.py
  • livekit-agents/livekit/agents/job.py
  • livekit-agents/livekit/agents/observability.py
  • livekit-agents/livekit/agents/telemetry/traces.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • livekit-agents/livekit/agents/evals/init.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

  • livekit-agents/livekit/agents/observability.py
  • livekit-agents/livekit/agents/job.py
  • examples/frontdesk/frontdesk_agent.py
  • livekit-agents/livekit/agents/telemetry/traces.py
  • livekit-agents/livekit/agents/evals/judge.py
  • livekit-agents/livekit/agents/evals/evaluation.py
🧬 Code graph analysis (5)
livekit-agents/livekit/agents/observability.py (1)
livekit-agents/livekit/agents/evals/evaluation.py (2)
  • EvaluationResult (32-71)
  • name (18-20)
livekit-agents/livekit/agents/job.py (1)
livekit-agents/livekit/agents/observability.py (1)
  • Tagger (9-101)
examples/frontdesk/frontdesk_agent.py (4)
livekit-agents/livekit/agents/evals/evaluation.py (4)
  • JudgeGroup (73-165)
  • judges (123-125)
  • evaluate (22-28)
  • evaluate (127-165)
livekit-agents/livekit/agents/job.py (3)
  • userdata (662-663)
  • make_session_report (269-304)
  • tagger (255-267)
livekit-agents/livekit/agents/voice/agent_session.py (2)
  • userdata (371-375)
  • userdata (378-379)
livekit-agents/livekit/agents/observability.py (2)
  • success (37-46)
  • fail (48-57)
livekit-agents/livekit/agents/telemetry/traces.py (1)
livekit-agents/livekit/agents/observability.py (3)
  • Tagger (9-101)
  • evaluations (81-83)
  • outcome_reason (86-88)
livekit-agents/livekit/agents/evals/evaluation.py (2)
livekit-agents/livekit/agents/evals/judge.py (10)
  • JudgmentResult (15-34)
  • name (148-149)
  • name (199-200)
  • name (267-268)
  • evaluate (151-185)
  • evaluate (202-252)
  • evaluate (270-318)
  • passed (22-24)
  • uncertain (32-34)
  • failed (27-29)
livekit-agents/livekit/agents/job.py (4)
  • job (323-325)
  • job (692-693)
  • get_job_context (57-64)
  • tagger (255-267)
🔇 Additional comments (30)
livekit-agents/livekit/agents/job.py (3)

41-41: Tagger initialization looks solid.
Creates a per-job Tagger instance and wires it into context state cleanly.

Also applies to: 186-187


254-267: Nice public Tagger accessor.
Clear docstring and straightforward API surface.


225-231: All _upload_session_report call sites have been updated correctly.
Only one call site exists in the codebase (job.py:225), and it includes the required tagger argument along with all other parameters.

examples/frontdesk/frontdesk_agent.py (4)

28-38: Eval judge imports look good.


45-48: Userdata flag default is fine.


111-112: Marks booking success appropriately.


176-204: No changes needed. The code is correct and follows the recommended pattern from the library documentation.

  • ChatContext.copy() does support both exclude_function_call and exclude_instructions parameters as used in the code.
  • The pattern of calling judges.evaluate() without error handling matches the official JudgeGroup docstring example, which explicitly documents this as the recommended approach. Results are automatically tagged to the session.

Likely an incorrect or invalid review comment.

livekit-agents/livekit/agents/telemetry/traces.py (1)

320-383: Session-scoped logger reuse is a solid improvement.
Keeps consistent logger attributes across session report and chat-item logs.

livekit-agents/livekit/agents/observability.py (5)

32-35: State initialization is clear and minimal.


37-57: Outcome tagging logic is straightforward.


59-73: Add/remove tag helpers are clean.


75-88: Accessors return copies as expected.


90-101: Evaluation tagging hook integrates cleanly.

livekit-agents/livekit/agents/evals/evaluation.py (6)

14-28: Evaluator protocol reads well.


38-49: Score computation is consistent with the docstring.


51-60: Pass/fail helpers are clear.


68-71: none_failed helper looks good.


95-115: JudgeGroup initialization is straightforward.


127-165: Concurrent evaluation flow looks good.

livekit-agents/livekit/agents/evals/judge.py (11)

14-34: JudgmentResult definition is clear.


37-65: Chat item formatting covers all relevant item types.


68-70: ChatContext formatting helper is concise.


73-81: Latest-instructions extraction is sensible.


84-86: Handoff detection helper is fine.


89-110: Verdict parsing logic looks reasonable.


113-136: LLM evaluation flow is clean and readable.


139-185: Generic Judge evaluation path is solid.


188-253: Task completion judge prompt flow is well-structured.


255-318: Handoff judge logic is clear and pragmatic.


321-500: Built-in judge factories are well documented and consistent.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +61 to +66
@property
def majority_passed(self) -> bool:
"""True if more than half of the judgments passed."""
if not self.judgments:
return True
return self.score > len(self.judgments) / 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

majority_passed is mathematically incorrect.
self.score is in [0,1], but it’s compared against len(self.judgments)/2 (≥1 for 2+ judges), so this almost always returns False.

🐛 Suggested fix
     def majority_passed(self) -> bool:
         """True if more than half of the judgments passed."""
         if not self.judgments:
             return True
-        return self.score > len(self.judgments) / 2
+        passed = sum(1 for j in self.judgments.values() if j.passed)
+        return passed > len(self.judgments) / 2
🤖 Prompt for AI Agents
In `@livekit-agents/livekit/agents/evals/evaluation.py` around lines 61 - 66, The
majority_passed property compares a fractional score to a count, which is wrong;
update the logic in majority_passed (in evaluation.py) to compare like-for-like:
either check if self.score > 0.5 (since score is in [0,1]) or compute
passed_count = self.score * len(self.judgments) and compare passed_count >
len(self.judgments)/2; keep the existing empty-judgments shortcut (return True
when not self.judgments).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
livekit-agents/livekit/agents/job.py (1)

220-231: Tags silently fail to upload when recording is disabled, contradicting documentation.

The Tagger API documentation states "Tags are uploaded to LiveKit Cloud at session end" without condition, but _upload_session_report() (which uploads tags along with session report) is only invoked when report.enable_recording is true. This means tags are silently dropped whenever recording is disabled, creating a mismatch between documented behavior and actual behavior. Either upload tags independently of the recording flag or update documentation to clarify tags only upload when recording is enabled.

🤖 Fix all issues with AI agents
In `@examples/frontdesk/frontdesk_agent.py`:
- Around line 176-204: The early return in on_session_end prevents tagging short
sessions; instead of returning when len(chat.items) < 3, skip running JudgeGroup
evaluation but still call the tagging logic: create the report
(ctx.make_session_report()), skip or bypass judges.evaluate(...) when the chat
is short, then call ctx.tagger.success() if
ctx.primary_session.userdata.appointment_booked is true or ctx.tagger.fail(...)
otherwise, ensuring the tagging runs regardless of whether judges were executed.

In `@livekit-agents/livekit/agents/evals/judge.py`:
- Around line 1-10: The import block is unsorted causing Ruff I001; reorder the
imports into proper groups: keep "from __future__ import annotations" first,
then standard library imports (re, dataclasses/dataclass, typing/Any, Literal)
sorted alphabetically, then third-party (none here), then local/package imports
sorted alphabetically — specifically ensure "from ..llm import LLM,
ChatContext", "from ..log import logger", and "from ..types import NOT_GIVEN,
NotGivenOr" are in the local imports group and sorted; run ruff to verify the
import order is now compliant.
🧹 Nitpick comments (2)
livekit-agents/livekit/agents/job.py (2)

254-267: Align the new Tagger docstring with Google-style format.

This docstring adds a new public API; please standardize it with Returns / Examples sections for consistency.

♻️ Proposed docstring format
     def tagger(self) -> Tagger:
         """Returns the Tagger for adding tags and outcomes to the session.
 
-        Tags are uploaded to LiveKit Cloud at session end.
-
-        Example:
-            ```python
-            ctx.tagger.success(reason="Task completed successfully")
-            ctx.tagger.fail(reason="User hung up before completing")
-            ctx.tagger.add("voicemail:true")
-            ```
+        Notes:
+            Tags are uploaded to LiveKit Cloud at session end.
+
+        Returns:
+            Tagger: Tagger instance scoped to this job.
+
+        Examples:
+            ctx.tagger.success(reason="Task completed successfully")
+            ctx.tagger.fail(reason="User hung up before completing")
+            ctx.tagger.add("voicemail:true")
         """
         return self._tagger

As per coding guidelines, please use Google-style docstrings.


345-351: Document primary_session return/raise behavior in Google style.

Since the accessor raises when unset, the docstring should include a Raises section for clarity and compliance.

♻️ Proposed docstring format
     def primary_session(self) -> AgentSession:
-        """Returns the primary AgentSession for this job."""
+        """Returns the primary AgentSession for this job.
+
+        Returns:
+            AgentSession: The primary session instance.
+
+        Raises:
+            RuntimeError: If no AgentSession was started for this job.
+        """
         if not self._primary_agent_session:
             raise RuntimeError("No AgentSession was started for this job")
         return self._primary_agent_session

As per coding guidelines, please use Google-style docstrings.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 754ce38 and dfd036f.

📒 Files selected for processing (3)
  • examples/frontdesk/frontdesk_agent.py
  • livekit-agents/livekit/agents/evals/judge.py
  • livekit-agents/livekit/agents/job.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

  • examples/frontdesk/frontdesk_agent.py
  • livekit-agents/livekit/agents/evals/judge.py
  • livekit-agents/livekit/agents/job.py
🪛 GitHub Actions: CI
livekit-agents/livekit/agents/evals/judge.py

[error] 1-1: Ruff I001 Import block is un-sorted or un-formatted.

🪛 GitHub Check: ruff
livekit-agents/livekit/agents/evals/judge.py

[failure] 1-9: Ruff (I001)
livekit-agents/livekit/agents/evals/judge.py:1:1: I001 Import block is un-sorted or un-formatted

🔇 Additional comments (23)
livekit-agents/livekit/agents/evals/judge.py (18)

11-34: Clear verdict modeling and helpers.

The Verdict alias plus JudgmentResult helpers are concise and easy to consume.


37-65: Chat item formatting is consistent and readable.

Good coverage across message, tool, handoff, and config update item types.


68-70: Nice, minimal wrapper for chat context formatting.


73-81: Latest-instructions lookup is straightforward and correct.


84-86: Handoff detection is clean and efficient.


89-110: Verdict parsing logic is clear and well-structured.


113-136: LLM evaluation flow is well-scoped and deterministic.


139-185: Judge base class interface is clean and reusable.


188-253: Task-completion judge prompt construction looks solid.


255-318: Handoff judge behavior and prompt are consistent.


321-332: Factory: task_completion_judge is clear and well-scoped.


334-345: Factory: handoff_judge is clear and well-scoped.


347-365: Factory: accuracy_judge criteria are precise and actionable.


367-387: Factory: tool_use_judge criteria are comprehensive.


389-408: Factory: safety_judge criteria are appropriate for regulated contexts.


410-429: Factory: relevancy_judge criteria are clear and testable.


431-448: Factory: coherence_judge criteria are clear and concise.


451-468: Factory: conciseness_judge criteria are well-calibrated for voice.

examples/frontdesk/frontdesk_agent.py (3)

28-38: Evals import additions are clean and localized.


45-48: appointment_booked flag is a useful session signal.


102-114: Booking flag is set at the right point in the flow.

livekit-agents/livekit/agents/job.py (2)

41-42: LGTM for Tagger import.

Clean, scoped addition for the new observability capability.


186-187: LGTM for per-context Tagger initialization.

Keeping a Tagger instance on the JobContext is a clean lifecycle boundary.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines 176 to 204
async def on_session_end(ctx: JobContext) -> None:
# import json
report = ctx.make_session_report()

# Skip evaluation for very short conversations
chat = report.chat_history.copy(exclude_function_call=True, exclude_instructions=True)
if len(chat.items) < 3:
return

judges = JudgeGroup(
llm="openai/gpt-4o-mini",
judges=[
task_completion_judge(),
accuracy_judge(),
tool_use_judge(),
handoff_judge(),
safety_judge(),
relevancy_judge(),
coherence_judge(),
conciseness_judge(),
],
)

# report = ctx.make_session_report()
# report_json = json.dumps(report.to_cloud_data(), indent=2)
await judges.evaluate(report.chat_history)

pass
if ctx.primary_session.userdata.appointment_booked:
ctx.tagger.success()
else:
ctx.tagger.fail(reason="Appointment was not booked")

Copy link
Contributor

@coderabbitai coderabbitai bot Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Short-session early return skips tagging outcomes.

At Line 180–182, returning early prevents ctx.tagger.success() / fail(...) from being called, so short sessions never get tagged—even if a booking completed. Consider skipping evaluation but still tagging.

🔧 Suggested restructure
-    if len(chat.items) < 3:
-        return
-
-    judges = JudgeGroup(
-        llm="openai/gpt-4o-mini",
-        judges=[
-            task_completion_judge(),
-            accuracy_judge(),
-            tool_use_judge(),
-            handoff_judge(),
-            safety_judge(),
-            relevancy_judge(),
-            coherence_judge(),
-            conciseness_judge(),
-        ],
-    )
-
-    await judges.evaluate(report.chat_history)
+    if len(chat.items) >= 3:
+        judges = JudgeGroup(
+            llm="openai/gpt-4o-mini",
+            judges=[
+                task_completion_judge(),
+                accuracy_judge(),
+                tool_use_judge(),
+                handoff_judge(),
+                safety_judge(),
+                relevancy_judge(),
+                coherence_judge(),
+                conciseness_judge(),
+            ],
+        )
+
+        await judges.evaluate(report.chat_history)
 
     if ctx.primary_session.userdata.appointment_booked:
         ctx.tagger.success()
     else:
         ctx.tagger.fail(reason="Appointment was not booked")
🤖 Prompt for AI Agents
In `@examples/frontdesk/frontdesk_agent.py` around lines 176 - 204, The early
return in on_session_end prevents tagging short sessions; instead of returning
when len(chat.items) < 3, skip running JudgeGroup evaluation but still call the
tagging logic: create the report (ctx.make_session_report()), skip or bypass
judges.evaluate(...) when the chat is short, then call ctx.tagger.success() if
ctx.primary_session.userdata.appointment_booked is true or ctx.tagger.fail(...)
otherwise, ensuring the tagging runs regardless of whether judges were executed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's fine bro

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theomonnom, understood! I'll leave the implementation as is. 👍

Comment on lines 1 to 10
from __future__ import annotations

import re
from dataclasses import dataclass
from typing import Any, Literal

from ..llm import LLM, ChatContext
from ..types import NOT_GIVEN, NotGivenOr
from ..log import logger

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix import order to satisfy Ruff I001 (CI is failing).

Ruff reports an unsorted import block at Line 1. Please sort the imports to pass CI.

🧹 Proposed import order
-from __future__ import annotations
-
-import re
-from dataclasses import dataclass
-from typing import Any, Literal
-
-from ..llm import LLM, ChatContext
-from ..types import NOT_GIVEN, NotGivenOr
-from ..log import logger
+from __future__ import annotations
+
+import re
+from dataclasses import dataclass
+from typing import Any, Literal
+
+from ..llm import ChatContext, LLM
+from ..log import logger
+from ..types import NOT_GIVEN, NotGivenOr

As per coding guidelines, please run ruff and keep import blocks sorted.

🧰 Tools
🪛 GitHub Actions: CI

[error] 1-1: Ruff I001 Import block is un-sorted or un-formatted.

🪛 GitHub Check: ruff

[failure] 1-9: Ruff (I001)
livekit-agents/livekit/agents/evals/judge.py:1:1: I001 Import block is un-sorted or un-formatted

🤖 Prompt for AI Agents
In `@livekit-agents/livekit/agents/evals/judge.py` around lines 1 - 10, The import
block is unsorted causing Ruff I001; reorder the imports into proper groups:
keep "from __future__ import annotations" first, then standard library imports
(re, dataclasses/dataclass, typing/Any, Literal) sorted alphabetically, then
third-party (none here), then local/package imports sorted alphabetically —
specifically ensure "from ..llm import LLM, ChatContext", "from ..log import
logger", and "from ..types import NOT_GIVEN, NotGivenOr" are in the local
imports group and sorted; run ruff to verify the import order is now compliant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants