Add in checkpoint saving whilst Sentence Transformer embeddings are generated

Might be able to checkpoint the transformer model.

Yes. The easiest approach is to take advantage of **Chroma’s built-in persistence** and store your embeddings on disk. Then on subsequent runs, you can load from the persisted database instead of re-embedding all the documents. The key steps are:

1. **Embed Your Documents with a `persist_directory`**  
2. **Call `persist()` to Save**  
3. **Load the Persisted Database on the Next Run**

Below is a simplified example of how you might do it:

```python
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

def embed_and_persist_docs():
    # 1. Load documents
    loader = DirectoryLoader("./docs", glob="*.txt")
    docs = loader.load()

    # 2. Split documents into chunks
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    docs = text_splitter.split_documents(docs)

    # 3. Create and save the embeddings
    embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

    # IMPORTANT: specify `persist_directory`
    vectordb = Chroma.from_documents(
        documents=docs,
        embedding=embeddings,
        persist_directory="./chroma_db"
    )

    # 4. Persist to disk
    vectordb.persist()

def load_existing_db():
    # On subsequent runs, you can load directly from the persisted DB
    embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
    vectordb = Chroma(
        persist_directory="./chroma_db",
        embedding_function=embeddings
    )
    
    # Then you can query or do similarity searches without re-embedding
    query = "What is the doc about?"
    docs = vectordb.similarity_search(query)
    print(docs)
```

### How This Solves Your Problem

1. **No Re-Embedding**: Once the embeddings are computed and persisted to the directory (e.g. `./chroma_db`), you can load them from disk in future runs. That way, you skip the embedding step entirely if nothing has changed.

2. **Checkpointing While Embedding**: 
   - A simple form of checkpointing is to do **smaller batches** of documents, then call `persist()` after each batch. If your script crashes or is interrupted, you can pick up from the last persisted batch rather than starting over.
   - Concretely, you might:
     1. Split all your documents into subsets (or even one document at a time).
     2. Embed each subset and add them to the Chroma database (via `Chroma.add_documents(...)`).
     3. Call `vectordb.persist()`.
     4. Move on to the next subset.

   Here’s a sketch for batching:

   ```python
   def batch_embedding(docs, batch_size=10):
       embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
       vectordb = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)

       for i in range(0, len(docs), batch_size):
           batch_docs = docs[i : i + batch_size]
           vectordb.add_documents(batch_docs)
           vectordb.persist()
           print(f"Persisted batch {i // batch_size + 1}")
   ```

   If your script dies in the middle, it should already have persisted the previous batches. On restart, you can detect which batches were completed (e.g., by checking how many documents are in the DB or by keeping a simple batch counter file) and pick up from where you left off.

### Alternative: Caching Embedding Calls
If you simply want to avoid calling the embedding API for any text chunk that has already been embedded before, LangChain also provides a caching utility for LLM calls. However, **embedding caching** is not fully built-in at the same level; you’d typically have to implement your own function that checks if the text was already embedded (e.g., by storing a hash in a local SQLite or JSON file) before calling OpenAI again.

But in most use cases, **Chroma’s local persistence** (as shown above) is sufficient to ensure you don’t re-embed the same documents every time you run your script. It’s often the easiest “one-line” solution for checkpointing your embeddings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add in checkpoint saving whilst Sentence Transformer embeddings are generated #8

How This Solves Your Problem

Alternative: Caching Embedding Calls

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add in checkpoint saving whilst Sentence Transformer embeddings are generated #8

Description

How This Solves Your Problem

Alternative: Caching Embedding Calls

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions