Skip to content

vector_store

duallens_analytics.vector_store

ChromaDB vector-store management.

This module provides functions to build, load, and query a ChromaDB vector store backed by OpenAI embeddings. The store is persisted to disk so subsequent application runs skip re-embedding.

Typical usage::

store = get_or_create_vector_store(settings)
retriever = get_retriever(store, k=5)
docs = retriever.invoke("What AI projects is Google working on?")

build_vector_store(chunks, settings, persist_dir=CHROMA_DIR, collection_name='AI_Initiatives')

Create a ChromaDB collection from pre-chunked documents.

Each document chunk is embedded via the configured OpenAI embedding model and stored in a local ChromaDB directory so that subsequent runs can load the collection without re-embedding.

Parameters:

Name Type Description Default
chunks list[Document]

Pre-split :class:~langchain_core.documents.Document objects produced by :func:~duallens_analytics.data_loader.load_and_chunk.

required
settings Settings

Application settings (used to select the embedding model).

required
persist_dir Path

Filesystem path where ChromaDB persists its data.

CHROMA_DIR
collection_name str

Logical name of the collection inside ChromaDB.

'AI_Initiatives'

Returns:

Name Type Description
A Chroma

class:~langchain_chroma.Chroma instance

Chroma

backed by the newly created collection.

Source code in src/duallens_analytics/vector_store.py
def build_vector_store(
    chunks: list[Document],
    settings: Settings,
    persist_dir: Path = CHROMA_DIR,
    collection_name: str = "AI_Initiatives",
) -> Chroma:
    """Create a ChromaDB collection from pre-chunked documents.

    Each document chunk is embedded via the configured OpenAI embedding
    model and stored in a local ChromaDB directory so that subsequent
    runs can load the collection without re-embedding.

    Args:
        chunks: Pre-split :class:`~langchain_core.documents.Document`
            objects produced by :func:`~duallens_analytics.data_loader.load_and_chunk`.
        settings: Application settings (used to select the embedding model).
        persist_dir: Filesystem path where ChromaDB persists its data.
        collection_name: Logical name of the collection inside ChromaDB.

    Returns:
        A :class:`~langchain_chroma.Chroma` instance
        backed by the newly created collection.
    """
    logger.info(
        "Building vector store: %d chunks → collection=%s, persist=%s",
        len(chunks),
        collection_name,
        persist_dir,
    )
    persist_dir.mkdir(parents=True, exist_ok=True)
    return Chroma.from_documents(
        chunks,
        _embedding_fn(settings),
        collection_name=collection_name,
        persist_directory=str(persist_dir),
    )

collection_exists(persist_dir=CHROMA_DIR, collection_name='AI_Initiatives')

Check whether a persisted ChromaDB collection already exists.

The heuristic inspects the presence of chroma.sqlite3 inside persist_dir.

Parameters:

Name Type Description Default
persist_dir Path

Directory that ChromaDB uses for persistence.

CHROMA_DIR
collection_name str

Unused at present but kept for future multi-collection support.

'AI_Initiatives'

Returns:

Type Description
bool

True if the SQLite database file is present.

Source code in src/duallens_analytics/vector_store.py
def collection_exists(
    persist_dir: Path = CHROMA_DIR,
    collection_name: str = "AI_Initiatives",
) -> bool:
    """Check whether a persisted ChromaDB collection already exists.

    The heuristic inspects the presence of ``chroma.sqlite3`` inside
    *persist_dir*.

    Args:
        persist_dir: Directory that ChromaDB uses for persistence.
        collection_name: Unused at present but kept for future
            multi-collection support.

    Returns:
        ``True`` if the SQLite database file is present.
    """
    chroma_sqlite = persist_dir / "chroma.sqlite3"
    exists = chroma_sqlite.is_file()
    logger.debug(
        "Checking persisted collection (path=%s, exists=%s)",
        chroma_sqlite,
        exists,
    )
    return exists

get_or_create_vector_store(settings, persist_dir=CHROMA_DIR, collection_name=None)

Load an existing vector store or ingest from PDFs if none is found.

This is the recommended entry-point for application code. It encapsulates the load-or-build logic so callers do not need to manage ingestion themselves.

If the ChromaDB data directory already exists on disk the documents are not re-embedded, saving time and API credits.

Parameters:

Name Type Description Default
settings Settings

Application settings (embedding model, chunking params, collection name).

required
persist_dir Path

Filesystem path for ChromaDB persistence.

CHROMA_DIR
collection_name str | None

Override for the collection name. Defaults to settings.collection_name.

None

Returns:

Name Type Description
A Chroma

class:~langchain_chroma.Chroma instance

Chroma

ready for retrieval.

Source code in src/duallens_analytics/vector_store.py
def get_or_create_vector_store(
    settings: Settings,
    persist_dir: Path = CHROMA_DIR,
    collection_name: str | None = None,
) -> Chroma:
    """Load an existing vector store or ingest from PDFs if none is found.

    This is the recommended entry-point for application code.  It
    encapsulates the *load-or-build* logic so callers do not need to
    manage ingestion themselves.

    If the ChromaDB data directory already exists on disk the documents
    are **not** re-embedded, saving time and API credits.

    Args:
        settings: Application settings (embedding model, chunking params,
            collection name).
        persist_dir: Filesystem path for ChromaDB persistence.
        collection_name: Override for the collection name. Defaults to
            ``settings.collection_name``.

    Returns:
        A :class:`~langchain_chroma.Chroma` instance
        ready for retrieval.
    """
    collection_name = collection_name or settings.collection_name

    if collection_exists(persist_dir, collection_name):
        logger.info("Persisted vector store found – loading from disk")
        return load_vector_store(settings, persist_dir, collection_name)

    logger.info("No persisted vector store found – ingesting from PDFs")
    chunks = load_and_chunk(
        chunk_size=settings.chunk_size,
        chunk_overlap=settings.chunk_overlap,
        encoding_name=settings.encoding_name,
    )
    return build_vector_store(chunks, settings, persist_dir, collection_name)

get_retriever(store, k=10, search_type='similarity')

Create a LangChain retriever over the given vector store.

Parameters:

Name Type Description Default
store Chroma

A ChromaDB-backed vector store.

required
k int

Number of top-matching documents to return per query.

10
search_type str

Retrieval strategy — "similarity" (cosine) or "mmr" (maximal marginal relevance).

'similarity'

Returns:

Name Type Description
A VectorStoreRetriever

class:~langchain_core.vectorstores.VectorStoreRetriever

VectorStoreRetriever

that can be called with retriever.invoke(question).

Source code in src/duallens_analytics/vector_store.py
def get_retriever(
    store: Chroma,
    k: int = 10,
    search_type: str = "similarity",
) -> VectorStoreRetriever:
    """Create a LangChain retriever over the given vector store.

    Args:
        store: A ChromaDB-backed vector store.
        k: Number of top-matching documents to return per query.
        search_type: Retrieval strategy — ``"similarity"`` (cosine) or
            ``"mmr"`` (maximal marginal relevance).

    Returns:
        A :class:`~langchain_core.vectorstores.VectorStoreRetriever`
        that can be called with ``retriever.invoke(question)``.
    """
    logger.info("Creating retriever (search_type=%s, k=%d)", search_type, k)
    return store.as_retriever(search_type=search_type, search_kwargs={"k": k})

load_vector_store(settings, persist_dir=CHROMA_DIR, collection_name='AI_Initiatives')

Load an already-persisted ChromaDB collection from disk.

This skips the embedding step entirely by reading the existing SQLite database that ChromaDB maintains inside persist_dir.

Parameters:

Name Type Description Default
settings Settings

Application settings (needed for the embedding function so that query-time embeddings match the stored ones).

required
persist_dir Path

Filesystem path where ChromaDB persists its data.

CHROMA_DIR
collection_name str

Logical name of the collection inside ChromaDB.

'AI_Initiatives'

Returns:

Name Type Description
A Chroma

class:~langchain_chroma.Chroma instance.

Source code in src/duallens_analytics/vector_store.py
def load_vector_store(
    settings: Settings,
    persist_dir: Path = CHROMA_DIR,
    collection_name: str = "AI_Initiatives",
) -> Chroma:
    """Load an already-persisted ChromaDB collection from disk.

    This skips the embedding step entirely by reading the existing
    SQLite database that ChromaDB maintains inside *persist_dir*.

    Args:
        settings: Application settings (needed for the embedding function
            so that query-time embeddings match the stored ones).
        persist_dir: Filesystem path where ChromaDB persists its data.
        collection_name: Logical name of the collection inside ChromaDB.

    Returns:
        A :class:`~langchain_chroma.Chroma` instance.
    """
    logger.info("Loading existing vector store: collection=%s", collection_name)
    return Chroma(
        collection_name=collection_name,
        embedding_function=_embedding_fn(settings),
        persist_directory=str(persist_dir),
    )