Skip to content

evaluation

duallens_analytics.evaluation

LLM-as-Judge evaluation for RAG responses.

This module implements two evaluation dimensions for RAG quality:

  • Groundedness – Is the answer fully supported by the retrieved context (no hallucination)?
  • Relevance – Does the answer directly and completely address the user’s question?

Each evaluator sends the question, context, and answer to the LLM with a specialised system prompt and parses the structured Score / Justification response into an :class:EvalResult dataclass.

EVAL_USER_TEMPLATE = '###Question\n{question}\n\n###Context\n{context}\n\n###Answer\n{answer}\n' module-attribute

User-message template with {question}, {context}, and {answer} placeholders.

RELEVANCE_SYSTEM = 'You are tasked with rating AI-generated answers to questions posed by users.\nYou will be presented a question, context used by the AI system to generate the answer and an AI-generated answer to the question.\n\nIn the input, the question will begin with ###Question, the context will begin with ###Context while the AI-generated answer will begin with ###Answer.\n\nEvaluation criteria:\n- Relevance: Does the answer directly address the question asked?\n- Completeness: Does the answer cover the key aspects of the question?\n\nProvide:\n1. A **score** from 1 (not relevant) to 5 (highly relevant and complete).\n2. A brief **justification** (2-3 sentences).\n\nFormat your response exactly as:\nScore: <number>\nJustification: <text>\n' module-attribute

System prompt for the relevance evaluator.

EvalResult dataclass

Parsed result from an LLM-as-Judge evaluation.

On instantiation the raw LLM response is automatically parsed to extract a numeric score and a textual justification.

Attributes:

Name Type Description
dimension str

Evaluation dimension label (e.g. "Groundedness").

raw_response str

The full text returned by the evaluator LLM.

score int | None

Extracted integer score (1–5), or None if parsing failed.

justification str

Extracted justification text.

Source code in src/duallens_analytics/evaluation.py
@dataclass
class EvalResult:
    """Parsed result from an LLM-as-Judge evaluation.

    On instantiation the raw LLM response is automatically parsed to
    extract a numeric ``score`` and a textual ``justification``.

    Attributes:
        dimension: Evaluation dimension label (e.g. ``"Groundedness"``).
        raw_response: The full text returned by the evaluator LLM.
        score: Extracted integer score (1–5), or ``None`` if parsing
            failed.
        justification: Extracted justification text.
    """

    dimension: str
    raw_response: str
    score: int | None = None
    justification: str = ""

    def __post_init__(self) -> None:
        self._parse()

    def _parse(self) -> None:
        """Parse ``Score:`` and ``Justification:`` lines from *raw_response*."""
        for line in self.raw_response.splitlines():
            low = line.strip().lower()
            if low.startswith("score:"):
                with contextlib.suppress(ValueError, IndexError):
                    self.score = int(low.split(":", 1)[1].strip().split()[0])
            elif low.startswith("justification:"):
                self.justification = line.split(":", 1)[1].strip()

evaluate_groundedness(question, context, answer, settings)

Evaluate how well the answer is grounded in the context.

A score of 5 means the answer is fully supported by the context with no hallucinated information. A score of 1 indicates the answer contains significant unsupported claims.

Parameters:

Name Type Description Default
question str

The original user question.

required
context str

Concatenated retrieved context passages.

required
answer str

The AI-generated answer.

required
settings Settings

Application settings.

required

Returns:

Name Type Description
An EvalResult

class:EvalResult with the groundedness score.

Source code in src/duallens_analytics/evaluation.py
def evaluate_groundedness(
    question: str, context: str, answer: str, settings: Settings
) -> EvalResult:
    """Evaluate how well the *answer* is grounded in the *context*.

    A score of **5** means the answer is fully supported by the context
    with no hallucinated information.  A score of **1** indicates the
    answer contains significant unsupported claims.

    Args:
        question: The original user question.
        context: Concatenated retrieved context passages.
        answer: The AI-generated answer.
        settings: Application settings.

    Returns:
        An :class:`EvalResult` with the groundedness score.
    """
    return _evaluate(GROUNDEDNESS_SYSTEM, "Groundedness", question, context, answer, settings)

evaluate_relevance(question, context, answer, settings)

Evaluate how relevant and complete the answer is to the question.

A score of 5 means the answer directly addresses every aspect of the question. A score of 1 means it is off-topic or incomplete.

Parameters:

Name Type Description Default
question str

The original user question.

required
context str

Concatenated retrieved context passages.

required
answer str

The AI-generated answer.

required
settings Settings

Application settings.

required

Returns:

Name Type Description
An EvalResult

class:EvalResult with the relevance score.

Source code in src/duallens_analytics/evaluation.py
def evaluate_relevance(question: str, context: str, answer: str, settings: Settings) -> EvalResult:
    """Evaluate how relevant and complete the *answer* is to the *question*.

    A score of **5** means the answer directly addresses every aspect
    of the question.  A score of **1** means it is off-topic or
    incomplete.

    Args:
        question: The original user question.
        context: Concatenated retrieved context passages.
        answer: The AI-generated answer.
        settings: Application settings.

    Returns:
        An :class:`EvalResult` with the relevance score.
    """
    return _evaluate(RELEVANCE_SYSTEM, "Relevance", question, context, answer, settings)