evaluation¶

`duallens_analytics.evaluation` ¶

LLM-as-Judge evaluation for RAG responses.

This module implements two evaluation dimensions for RAG quality:

Groundedness – Is the answer fully supported by the retrieved context (no hallucination)?
Relevance – Does the answer directly and completely address the user’s question?

Each evaluator sends the question, context, and answer to the LLM with a specialised system prompt and parses the structured Score / Justification response into an :class:EvalResult dataclass.

`EVAL_USER_TEMPLATE = '###Question\n{question}\n\n###Context\n{context}\n\n###Answer\n{answer}\n'` `module-attribute` ¶

User-message template with {question}, {context}, and {answer} placeholders.

RELEVANCE_SYSTEM = 'You are tasked with rating AI-generated answers to questions posed by users.\nYou will be presented a question, context used by the AI system to generate the answer and an AI-generated answer to the question.\n\nIn the input, the question will begin with ###Question, the context will begin with ###Context while the AI-generated answer will begin with ###Answer.\n\nEvaluation criteria:\n- Relevance: Does the answer directly address the question asked?\n- Completeness: Does the answer cover the key aspects of the question?\n\nProvide:\n1. A score from 1 (not relevant) to 5 (highly relevant and complete).\n2. A brief justification (2-3 sentences).\n\nFormat your response exactly as:\nScore: <number>\nJustification: <text>\n' `module-attribute` ¶

System prompt for the relevance evaluator.

`EvalResult` `dataclass` ¶

Parsed result from an LLM-as-Judge evaluation.

On instantiation the raw LLM response is automatically parsed to extract a numeric score and a textual justification.

Attributes:

Name	Type	Description
`dimension`	`str`	Evaluation dimension label (e.g. `"Groundedness"`).
`raw_response`	`str`	The full text returned by the evaluator LLM.
`score`	`int \| None`	Extracted integer score (1–5), or `None` if parsing failed.
`justification`	`str`	Extracted justification text.

Source code in src/duallens_analytics/evaluation.py

@dataclass
class EvalResult:
    """Parsed result from an LLM-as-Judge evaluation.

    On instantiation the raw LLM response is automatically parsed to
    extract a numeric ``score`` and a textual ``justification``.

    Attributes:
        dimension: Evaluation dimension label (e.g. ``"Groundedness"``).
        raw_response: The full text returned by the evaluator LLM.
        score: Extracted integer score (1–5), or ``None`` if parsing
            failed.
        justification: Extracted justification text.
    """

    dimension: str
    raw_response: str
    score: int | None = None
    justification: str = ""

    def __post_init__(self) -> None:
        self._parse()

    def _parse(self) -> None:
        """Parse ``Score:`` and ``Justification:`` lines from *raw_response*."""
        for line in self.raw_response.splitlines():
            low = line.strip().lower()
            if low.startswith("score:"):
                with contextlib.suppress(ValueError, IndexError):
                    self.score = int(low.split(":", 1)[1].strip().split()[0])
            elif low.startswith("justification:"):
                self.justification = line.split(":", 1)[1].strip()

`evaluate_groundedness(question, context, answer, settings)` ¶

Evaluate how well the answer is grounded in the context.

A score of 5 means the answer is fully supported by the context with no hallucinated information. A score of 1 indicates the answer contains significant unsupported claims.

Parameters:

Name	Type	Description	Default
`question`	`str`	The original user question.	required
`context`	`str`	Concatenated retrieved context passages.	required
`answer`	`str`	The AI-generated answer.	required
`settings`	`Settings`	Application settings.	required

Returns:

Name	Type	Description
`An`	`EvalResult`	class:`EvalResult` with the groundedness score.

Source code in src/duallens_analytics/evaluation.py

def evaluate_groundedness(
    question: str, context: str, answer: str, settings: Settings
) -> EvalResult:
    """Evaluate how well the *answer* is grounded in the *context*.

    A score of **5** means the answer is fully supported by the context
    with no hallucinated information.  A score of **1** indicates the
    answer contains significant unsupported claims.

    Args:
        question: The original user question.
        context: Concatenated retrieved context passages.
        answer: The AI-generated answer.
        settings: Application settings.

    Returns:
        An :class:`EvalResult` with the groundedness score.
    """
    return _evaluate(GROUNDEDNESS_SYSTEM, "Groundedness", question, context, answer, settings)

`evaluate_relevance(question, context, answer, settings)` ¶

Evaluate how relevant and complete the answer is to the question.

A score of 5 means the answer directly addresses every aspect of the question. A score of 1 means it is off-topic or incomplete.

Parameters:

Name	Type	Description	Default
`question`	`str`	The original user question.	required
`context`	`str`	Concatenated retrieved context passages.	required
`answer`	`str`	The AI-generated answer.	required
`settings`	`Settings`	Application settings.	required

Returns:

Name	Type	Description
`An`	`EvalResult`	class:`EvalResult` with the relevance score.

Source code in src/duallens_analytics/evaluation.py

def evaluate_relevance(question: str, context: str, answer: str, settings: Settings) -> EvalResult:
    """Evaluate how relevant and complete the *answer* is to the *question*.

    A score of **5** means the answer directly addresses every aspect
    of the question.  A score of **1** means it is off-topic or
    incomplete.

    Args:
        question: The original user question.
        context: Concatenated retrieved context passages.
        answer: The AI-generated answer.
        settings: Application settings.

    Returns:
        An :class:`EvalResult` with the relevance score.
    """
    return _evaluate(RELEVANCE_SYSTEM, "Relevance", question, context, answer, settings)

evaluation¶

duallens_analytics.evaluation ¶

EVAL_USER_TEMPLATE = '###Question\n{question}\n\n###Context\n{context}\n\n###Answer\n{answer}\n' module-attribute ¶

EvalResult dataclass ¶

evaluate_groundedness(question, context, answer, settings) ¶

evaluate_relevance(question, context, answer, settings) ¶

`duallens_analytics.evaluation` ¶

`EVAL_USER_TEMPLATE = '###Question\n{question}\n\n###Context\n{context}\n\n###Answer\n{answer}\n'` `module-attribute` ¶

`EvalResult` `dataclass` ¶

`evaluate_groundedness(question, context, answer, settings)` ¶

`evaluate_relevance(question, context, answer, settings)` ¶