Add prediction variable name customization to LLM as Judge#1622
Draft
martinscooper wants to merge 1 commit intomainfrom
Draft
Add prediction variable name customization to LLM as Judge#1622martinscooper wants to merge 1 commit intomainfrom
martinscooper wants to merge 1 commit intomainfrom
Conversation
7917e11 to
8b0d2c9
Compare
Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>
090ff22 to
eca7455
Compare
Contributor
Author
|
Example usage: from typing import Any, List
from unitxt import evaluate, load_dataset
from unitxt.blocks import Task, TaskCard
from unitxt.llm_as_judge import CreateCriteriaFromString
from unitxt.loaders import LoadFromDictionary
from unitxt.templates import NullTemplate
data = {
"test": [
{
"question": "How is the weather?",
"judgement": "The temperature is described in both Fahrenheit and Celsius.",
"response_variable_name": "assistant response"
},
{
"question": "Tell me a joke about cats",
"judgement": "Is the response funny?",
"response_variable_name": "joke"
},
]
}
card = TaskCard(
loader=LoadFromDictionary(data=data, data_classification_policy=["public"]),
preprocess_steps=[
CreateCriteriaFromString(field="judgement", to_field="criteria"),
],
task=Task(
input_fields={"question": str, "response_variable_name": str},
reference_fields={"criteria": Any},
prediction_type=List[str],
metrics=[
"metrics.llm_as_judge.pairwise.watsonx.llama3_1_70b[context_fields=question,criteria_field=criteria,response_variable_name_field=response_variable_name,include_prompts_in_result=True]"
],
default_template=NullTemplate(),
),
)
test_dataset = load_dataset(card=card, split="test")
predictions = [
[
"""On most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit (around 31-34°C). The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
"""On most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit. The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
"""On most days, the weather is warm and humid. The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
],
[
"""Why did the cat cross the road? To cat to the other side.""",
"""Why did the cat sit on the computer? Because it wanted to keep an eye on the mouse!""",
"""What is red, yellow and green? A traffic light.""",
],
]
results = evaluate(predictions=predictions, data=test_dataset)
print("Global Scores:")
print(results.global_scores.summary)
print("Instance Scores:")
print(results.instance_scores.summary) |
Contributor
Author
|
Assessment prompt for "response_variable_name": "assistant response": [{'role': 'user', 'content': 'You are provided a pair of assistant responses (Assistant response 1 and Assistant response 2) generated subject to a context.\nYou will choose the better quality assistant response subject to the evaluation criteria.\n\nThis is the context:\nquestion: How is the weather?\n\nThis is the evaluation criteria:\n\nThe temperature is described in both Fahrenheit and Celsius.\n\nAssistant response 1:\nOn most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit (around 31-34°C). The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.\nAssistant response 2:\nOn most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit. The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.\n\nKeeping the evaluation criteria in mind, briefly assess which assistant response is better.\nFocus on the evaluation criteria during assessment, do not provide a general assessment.\nAssessment:\n\nLets think step by step '}]Assessment prompt for "response_variable_name": "joke": [{'role': 'user', 'content': 'You are provided a pair of jokes (Joke 1 and Joke 2) generated subject to a context.\nYou will choose the better quality joke subject to the evaluation criteria.\n\nThis is the context:\nquestion: Tell me a joke about cats\n\nThis is the evaluation criteria:\n\nIs the response funny?\n\nJoke 1:\nWhy did the cat cross the road? To cat to the other side.\nJoke 2:\nWhy did the cat sit on the computer? Because it wanted to keep an eye on the mouse!\n\nKeeping the evaluation criteria in mind, briefly assess which joke is better.\nFocus on the evaluation criteria during assessment, do not provide a general assessment.\nAssessment:\n\nLets think step by step '}] |
Contributor
Author
|
The evaluator's prompts use the plural form of the response variable name though. I am currently just appending an 's' after it, but this fails for words like 'summaries'. |
Member
|
I think the main point is whether it makes a difference in performance (because it complicates the API and the interface). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR implements customizing the prediction name, allowing to give the evaluator more information about what exactly is being evaluated.
Currently, the prompts refers to a generic 'response'. This PR allows the prompt to refer to other words like: 'text', 'prediction', 'assistant_response', 'summary', etc.
The customization can be done in two ways:
LLMJudge.response_variable_name, which defaults to 'response', so that this is backward compatible, orLLMJudge.response_variable_name_fieldand including a key with name being the value ofresponse_variable_name_fieldto every task_data instance.