data_loader¶

`duallens_analytics.data_loader` ¶

Load and chunk company AI-initiative PDF documents.

This module handles the first stage of the RAG ingestion pipeline:

Extract the knowledge-base ZIP archive.
Discover PDF files inside the extracted directory.
Split each PDF into overlapping text chunks using :class:~langchain.text_splitter.RecursiveCharacterTextSplitter with a tiktoken-based length function.

`extract_zip(zip_path=ZIP_PATH, dest=PDF_DIR)` ¶

Extract the knowledge-base ZIP and return the directory with PDFs.

If the archive contains a single top-level sub-folder the path of that sub-folder is returned; otherwise dest itself is returned.

Parameters:

Name	Type	Description	Default
`zip_path`	`Path`	Path to the `.zip` archive (default: see :data:`~duallens_analytics.config.ZIP_PATH`).	`ZIP_PATH`
`dest`	`Path`	Destination directory for extraction (default: see :data:`~duallens_analytics.config.PDF_DIR`).	`PDF_DIR`

Returns:

Name	Type	Description
`A`	`Path`	class:`~pathlib.Path` pointing to the directory that contains
	`Path`	the extracted PDF files.

Source code in src/duallens_analytics/data_loader.py

def extract_zip(zip_path: Path = ZIP_PATH, dest: Path = PDF_DIR) -> Path:
    """Extract the knowledge-base ZIP and return the directory with PDFs.

    If the archive contains a single top-level sub-folder the path of
    that sub-folder is returned; otherwise *dest* itself is returned.

    Args:
        zip_path: Path to the ``.zip`` archive (default: see
            :data:`~duallens_analytics.config.ZIP_PATH`).
        dest: Destination directory for extraction (default: see
            :data:`~duallens_analytics.config.PDF_DIR`).

    Returns:
        A :class:`~pathlib.Path` pointing to the directory that contains
        the extracted PDF files.
    """
    logger.info("Extracting ZIP %s → %s", zip_path, dest)
    dest.mkdir(parents=True, exist_ok=True)
    with zipfile.ZipFile(zip_path, "r") as zf:
        zf.extractall(dest)
        logger.debug("Extracted %d files from ZIP", len(zf.namelist()))

    # The zip may contain a sub-folder; find it
    subdirs = [p for p in dest.iterdir() if p.is_dir()]
    result = subdirs[0] if subdirs else dest
    logger.info("PDF directory resolved to %s", result)
    return result

`load_and_chunk(pdf_dir=None, chunk_size=1000, chunk_overlap=200, encoding_name='cl100k_base')` ¶

Load all PDFs from pdf_dir and split into LangChain Document chunks.

When pdf_dir is None the function calls :func:extract_zip first to unpack the default ZIP archive.

Chunking parameters (chunk_size, chunk_overlap, encoding_name) can be overridden at runtime via Hydra config or directly.