Skip to content

Add in checkpoint saving whilst Sentence Transformer embeddings are generated #8

@agstephens

Description

@agstephens

Might be able to checkpoint the transformer model.

Yes. The easiest approach is to take advantage of Chroma’s built-in persistence and store your embeddings on disk. Then on subsequent runs, you can load from the persisted database instead of re-embedding all the documents. The key steps are:

  1. Embed Your Documents with a persist_directory
  2. Call persist() to Save
  3. Load the Persisted Database on the Next Run

Below is a simplified example of how you might do it:

from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

def embed_and_persist_docs():
    # 1. Load documents
    loader = DirectoryLoader("./docs", glob="*.txt")
    docs = loader.load()

    # 2. Split documents into chunks
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    docs = text_splitter.split_documents(docs)

    # 3. Create and save the embeddings
    embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

    # IMPORTANT: specify `persist_directory`
    vectordb = Chroma.from_documents(
        documents=docs,
        embedding=embeddings,
        persist_directory="./chroma_db"
    )

    # 4. Persist to disk
    vectordb.persist()

def load_existing_db():
    # On subsequent runs, you can load directly from the persisted DB
    embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
    vectordb = Chroma(
        persist_directory="./chroma_db",
        embedding_function=embeddings
    )
    
    # Then you can query or do similarity searches without re-embedding
    query = "What is the doc about?"
    docs = vectordb.similarity_search(query)
    print(docs)

How This Solves Your Problem

  1. No Re-Embedding: Once the embeddings are computed and persisted to the directory (e.g. ./chroma_db), you can load them from disk in future runs. That way, you skip the embedding step entirely if nothing has changed.

  2. Checkpointing While Embedding:

    • A simple form of checkpointing is to do smaller batches of documents, then call persist() after each batch. If your script crashes or is interrupted, you can pick up from the last persisted batch rather than starting over.
    • Concretely, you might:
      1. Split all your documents into subsets (or even one document at a time).
      2. Embed each subset and add them to the Chroma database (via Chroma.add_documents(...)).
      3. Call vectordb.persist().
      4. Move on to the next subset.

    Here’s a sketch for batching:

    def batch_embedding(docs, batch_size=10):
        embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
        vectordb = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)
    
        for i in range(0, len(docs), batch_size):
            batch_docs = docs[i : i + batch_size]
            vectordb.add_documents(batch_docs)
            vectordb.persist()
            print(f"Persisted batch {i // batch_size + 1}")

    If your script dies in the middle, it should already have persisted the previous batches. On restart, you can detect which batches were completed (e.g., by checking how many documents are in the DB or by keeping a simple batch counter file) and pick up from where you left off.

Alternative: Caching Embedding Calls

If you simply want to avoid calling the embedding API for any text chunk that has already been embedded before, LangChain also provides a caching utility for LLM calls. However, embedding caching is not fully built-in at the same level; you’d typically have to implement your own function that checks if the text was already embedded (e.g., by storing a hash in a local SQLite or JSON file) before calling OpenAI again.

But in most use cases, Chroma’s local persistence (as shown above) is sufficient to ensure you don’t re-embed the same documents every time you run your script. It’s often the easiest “one-line” solution for checkpointing your embeddings.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions