evaluation¶
duallens_analytics.evaluation
¶
LLM-as-Judge evaluation for RAG responses.
This module implements two evaluation dimensions for RAG quality:
- Groundedness – Is the answer fully supported by the retrieved context (no hallucination)?
- Relevance – Does the answer directly and completely address the user’s question?
Each evaluator sends the question, context, and answer to the LLM with
a specialised system prompt and parses the structured Score / Justification
response into an :class:EvalResult dataclass.
EVAL_USER_TEMPLATE = '###Question\n{question}\n\n###Context\n{context}\n\n###Answer\n{answer}\n'
module-attribute
¶
User-message template with {question}, {context}, and {answer} placeholders.
RELEVANCE_SYSTEM = 'You are tasked with rating AI-generated answers to questions posed by users.\nYou will be presented a question, context used by the AI system to generate the answer and an AI-generated answer to the question.\n\nIn the input, the question will begin with ###Question, the context will begin with ###Context while the AI-generated answer will begin with ###Answer.\n\nEvaluation criteria:\n- Relevance: Does the answer directly address the question asked?\n- Completeness: Does the answer cover the key aspects of the question?\n\nProvide:\n1. A **score** from 1 (not relevant) to 5 (highly relevant and complete).\n2. A brief **justification** (2-3 sentences).\n\nFormat your response exactly as:\nScore: <number>\nJustification: <text>\n'
module-attribute
¶
System prompt for the relevance evaluator.
EvalResult
dataclass
¶
Parsed result from an LLM-as-Judge evaluation.
On instantiation the raw LLM response is automatically parsed to
extract a numeric score and a textual justification.
Attributes:
| Name | Type | Description |
|---|---|---|
dimension |
str
|
Evaluation dimension label (e.g. |
raw_response |
str
|
The full text returned by the evaluator LLM. |
score |
int | None
|
Extracted integer score (1–5), or |
justification |
str
|
Extracted justification text. |
Source code in src/duallens_analytics/evaluation.py
evaluate_groundedness(question, context, answer, settings)
¶
Evaluate how well the answer is grounded in the context.
A score of 5 means the answer is fully supported by the context with no hallucinated information. A score of 1 indicates the answer contains significant unsupported claims.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
question
|
str
|
The original user question. |
required |
context
|
str
|
Concatenated retrieved context passages. |
required |
answer
|
str
|
The AI-generated answer. |
required |
settings
|
Settings
|
Application settings. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
An |
EvalResult
|
class: |
Source code in src/duallens_analytics/evaluation.py
evaluate_relevance(question, context, answer, settings)
¶
Evaluate how relevant and complete the answer is to the question.
A score of 5 means the answer directly addresses every aspect of the question. A score of 1 means it is off-topic or incomplete.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
question
|
str
|
The original user question. |
required |
context
|
str
|
Concatenated retrieved context passages. |
required |
answer
|
str
|
The AI-generated answer. |
required |
settings
|
Settings
|
Application settings. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
An |
EvalResult
|
class: |