data_loader¶
duallens_analytics.data_loader
¶
Load and chunk company AI-initiative PDF documents.
This module handles the first stage of the RAG ingestion pipeline:
- Extract the knowledge-base ZIP archive.
- Discover PDF files inside the extracted directory.
- Split each PDF into overlapping text chunks using
:class:
~langchain.text_splitter.RecursiveCharacterTextSplitterwith atiktoken-based length function.
extract_zip(zip_path=ZIP_PATH, dest=PDF_DIR)
¶
Extract the knowledge-base ZIP and return the directory with PDFs.
If the archive contains a single top-level sub-folder the path of that sub-folder is returned; otherwise dest itself is returned.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
zip_path
|
Path
|
Path to the |
ZIP_PATH
|
dest
|
Path
|
Destination directory for extraction (default: see
:data: |
PDF_DIR
|
Returns:
| Name | Type | Description |
|---|---|---|
A |
Path
|
class: |
Path
|
the extracted PDF files. |
Source code in src/duallens_analytics/data_loader.py
load_and_chunk(pdf_dir=None, chunk_size=1000, chunk_overlap=200, encoding_name='cl100k_base')
¶
Load all PDFs from pdf_dir and split into LangChain Document chunks.
When pdf_dir is None the function calls :func:extract_zip
first to unpack the default ZIP archive.
Chunking parameters (chunk_size, chunk_overlap, encoding_name) can be overridden at runtime via Hydra config or directly.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdf_dir
|
Path | None
|
Directory containing |
None
|
chunk_size
|
int
|
Target number of tokens per chunk. |
1000
|
chunk_overlap
|
int
|
Number of overlapping tokens between adjacent chunks to preserve context across boundaries. |
200
|
encoding_name
|
str
|
|
'cl100k_base'
|
Returns:
| Type | Description |
|---|---|
list[Document]
|
A list of :class: |
list[Document]
|
each representing one text chunk with metadata. |