Might be able to checkpoint the transformer model.
Yes. The easiest approach is to take advantage of Chroma’s built-in persistence and store your embeddings on disk. Then on subsequent runs, you can load from the persisted database instead of re-embedding all the documents. The key steps are:
- Embed Your Documents with a
persist_directory
- Call
persist() to Save
- Load the Persisted Database on the Next Run
Below is a simplified example of how you might do it:
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
def embed_and_persist_docs():
# 1. Load documents
loader = DirectoryLoader("./docs", glob="*.txt")
docs = loader.load()
# 2. Split documents into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(docs)
# 3. Create and save the embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
# IMPORTANT: specify `persist_directory`
vectordb = Chroma.from_documents(
documents=docs,
embedding=embeddings,
persist_directory="./chroma_db"
)
# 4. Persist to disk
vectordb.persist()
def load_existing_db():
# On subsequent runs, you can load directly from the persisted DB
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
vectordb = Chroma(
persist_directory="./chroma_db",
embedding_function=embeddings
)
# Then you can query or do similarity searches without re-embedding
query = "What is the doc about?"
docs = vectordb.similarity_search(query)
print(docs)
How This Solves Your Problem
-
No Re-Embedding: Once the embeddings are computed and persisted to the directory (e.g. ./chroma_db), you can load them from disk in future runs. That way, you skip the embedding step entirely if nothing has changed.
-
Checkpointing While Embedding:
- A simple form of checkpointing is to do smaller batches of documents, then call
persist() after each batch. If your script crashes or is interrupted, you can pick up from the last persisted batch rather than starting over.
- Concretely, you might:
- Split all your documents into subsets (or even one document at a time).
- Embed each subset and add them to the Chroma database (via
Chroma.add_documents(...)).
- Call
vectordb.persist().
- Move on to the next subset.
Here’s a sketch for batching:
def batch_embedding(docs, batch_size=10):
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
vectordb = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)
for i in range(0, len(docs), batch_size):
batch_docs = docs[i : i + batch_size]
vectordb.add_documents(batch_docs)
vectordb.persist()
print(f"Persisted batch {i // batch_size + 1}")
If your script dies in the middle, it should already have persisted the previous batches. On restart, you can detect which batches were completed (e.g., by checking how many documents are in the DB or by keeping a simple batch counter file) and pick up from where you left off.
Alternative: Caching Embedding Calls
If you simply want to avoid calling the embedding API for any text chunk that has already been embedded before, LangChain also provides a caching utility for LLM calls. However, embedding caching is not fully built-in at the same level; you’d typically have to implement your own function that checks if the text was already embedded (e.g., by storing a hash in a local SQLite or JSON file) before calling OpenAI again.
But in most use cases, Chroma’s local persistence (as shown above) is sufficient to ensure you don’t re-embed the same documents every time you run your script. It’s often the easiest “one-line” solution for checkpointing your embeddings.
Might be able to checkpoint the transformer model.
Yes. The easiest approach is to take advantage of Chroma’s built-in persistence and store your embeddings on disk. Then on subsequent runs, you can load from the persisted database instead of re-embedding all the documents. The key steps are:
persist_directorypersist()to SaveBelow is a simplified example of how you might do it:
How This Solves Your Problem
No Re-Embedding: Once the embeddings are computed and persisted to the directory (e.g.
./chroma_db), you can load them from disk in future runs. That way, you skip the embedding step entirely if nothing has changed.Checkpointing While Embedding:
persist()after each batch. If your script crashes or is interrupted, you can pick up from the last persisted batch rather than starting over.Chroma.add_documents(...)).vectordb.persist().Here’s a sketch for batching:
If your script dies in the middle, it should already have persisted the previous batches. On restart, you can detect which batches were completed (e.g., by checking how many documents are in the DB or by keeping a simple batch counter file) and pick up from where you left off.
Alternative: Caching Embedding Calls
If you simply want to avoid calling the embedding API for any text chunk that has already been embedded before, LangChain also provides a caching utility for LLM calls. However, embedding caching is not fully built-in at the same level; you’d typically have to implement your own function that checks if the text was already embedded (e.g., by storing a hash in a local SQLite or JSON file) before calling OpenAI again.
But in most use cases, Chroma’s local persistence (as shown above) is sufficient to ensure you don’t re-embed the same documents every time you run your script. It’s often the easiest “one-line” solution for checkpointing your embeddings.