Add prediction variable name customization to LLM as Judge by martinscooper · Pull Request #1622 · IBM/unitxt

martinscooper · 2025-02-21T16:56:42Z

This PR implements customizing the prediction name, allowing to give the evaluator more information about what exactly is being evaluated.

Currently, the prompts refers to a generic 'response'. This PR allows the prompt to refer to other words like: 'text', 'prediction', 'assistant_response', 'summary', etc.

The customization can be done in two ways:

setting LLMJudge.response_variable_name, which defaults to 'response', so that this is backward compatible, or
setting LLMJudge.response_variable_name_field and including a key with name being the value of response_variable_name_field to every task_data instance.

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

martinscooper · 2025-02-21T17:31:30Z

Example usage:

from typing import Any, List

from unitxt import evaluate, load_dataset
from unitxt.blocks import Task, TaskCard
from unitxt.llm_as_judge import CreateCriteriaFromString
from unitxt.loaders import LoadFromDictionary
from unitxt.templates import NullTemplate

data = {
    "test": [
        {
            "question": "How is the weather?",
            "judgement": "The temperature is described in both Fahrenheit and Celsius.",
            "response_variable_name": "assistant response"
            
        },
        {
            "question": "Tell me a joke about cats",
            "judgement": "Is the response funny?",
            "response_variable_name": "joke"
        },
    ]
}

card = TaskCard(
    loader=LoadFromDictionary(data=data, data_classification_policy=["public"]),
    preprocess_steps=[
        CreateCriteriaFromString(field="judgement", to_field="criteria"),
    ],
    task=Task(
        input_fields={"question": str, "response_variable_name": str},
        reference_fields={"criteria": Any},
        prediction_type=List[str],
        metrics=[
            "metrics.llm_as_judge.pairwise.watsonx.llama3_1_70b[context_fields=question,criteria_field=criteria,response_variable_name_field=response_variable_name,include_prompts_in_result=True]"
        ],
        default_template=NullTemplate(),
    ),
)

test_dataset = load_dataset(card=card, split="test")

predictions = [
    [
        """On most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit (around 31-34°C). The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
        """On most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit. The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
        """On most days, the weather is warm and humid. The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
    ],
    [
        """Why did the cat cross the road? To cat to the other side.""",
        """Why did the cat sit on the computer? Because it wanted to keep an eye on the mouse!""",
        """What is red, yellow and green? A traffic light.""",
    ],
]

results = evaluate(predictions=predictions, data=test_dataset)

print("Global Scores:")
print(results.global_scores.summary)

print("Instance Scores:")
print(results.instance_scores.summary)

martinscooper · 2025-02-21T17:34:42Z

Assessment prompt for "response_variable_name": "assistant response":

[{'role': 'user', 'content': 'You are provided a pair of assistant responses (Assistant response 1 and Assistant response 2) generated subject to a context.\nYou will choose the better quality assistant response subject to the evaluation criteria.\n\nThis is the context:\nquestion: How is the weather?\n\nThis is the evaluation criteria:\n\nThe temperature is described in both Fahrenheit and Celsius.\n\nAssistant response 1:\nOn most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit (around 31-34°C). The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.\nAssistant response 2:\nOn most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit. The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.\n\nKeeping the evaluation criteria in mind, briefly assess which assistant response is better.\nFocus on the evaluation criteria during assessment, do not provide a general assessment.\nAssessment:\n\nLets think step by step '}]

Assessment prompt for "response_variable_name": "joke":

[{'role': 'user', 'content': 'You are provided a pair of jokes (Joke 1 and Joke 2) generated subject to a context.\nYou will choose the better quality joke subject to the evaluation criteria.\n\nThis is the context:\nquestion: Tell me a joke about cats\n\nThis is the evaluation criteria:\n\nIs the response funny?\n\nJoke 1:\nWhy did the cat cross the road? To cat to the other side.\nJoke 2:\nWhy did the cat sit on the computer? Because it wanted to keep an eye on the mouse!\n\nKeeping the evaluation criteria in mind, briefly assess which joke is better.\nFocus on the evaluation criteria during assessment, do not provide a general assessment.\nAssessment:\n\nLets think step by step '}]

martinscooper · 2025-02-21T19:53:33Z

The evaluator's prompts use the plural form of the response variable name though. I am currently just appending an 's' after it, but this fails for words like 'summaries'.

yoavkatz · 2025-02-24T09:05:13Z

I think the main point is whether it makes a difference in performance (because it complicates the API and the interface).

martinscooper force-pushed the llm-judge-response-name branch from 7917e11 to 8b0d2c9 Compare February 21, 2025 16:58

add response variable name

eca7455

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

martinscooper force-pushed the llm-judge-response-name branch from 090ff22 to eca7455 Compare February 21, 2025 17:31

martinscooper requested review from OfirArviv, elronbandel and yoavkatz and removed request for OfirArviv February 21, 2025 17:31

martinscooper marked this pull request as draft March 12, 2025 12:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prediction variable name customization to LLM as Judge#1622

Add prediction variable name customization to LLM as Judge#1622
martinscooper wants to merge 1 commit intomainfrom
llm-judge-response-name

martinscooper commented Feb 21, 2025

Uh oh!

martinscooper commented Feb 21, 2025

Uh oh!

martinscooper commented Feb 21, 2025

Uh oh!

martinscooper commented Feb 21, 2025

Uh oh!

yoavkatz commented Feb 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

martinscooper commented Feb 21, 2025

Uh oh!

martinscooper commented Feb 21, 2025

Uh oh!

martinscooper commented Feb 21, 2025

Uh oh!

martinscooper commented Feb 21, 2025

Uh oh!

yoavkatz commented Feb 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants