This document explains the conversion from the Python LangChain-based RAG script to R using ragnar and ellmer packages.
| Python Package | R Package | Purpose |
|---|---|---|
langchain_community.document_loaders |
ragnar::read_as_markdown() |
Document loading |
langchain_experimental.text_splitter.SemanticChunker |
ragnar::markdown_chunk() |
Semantic chunking |
langchain_openai.embeddings.OpenAIEmbeddings |
ragnar::embed_openai() |
Embeddings |
langchain_core.vectorstores.InMemoryVectorStore |
ragnar::ragnar_store_create() |
Vector storage |
langchain_openai.ChatOpenAI |
ellmer::chat_openai() |
Chat interface |
chatlas.ChatOllama |
ellmer::chat_ollama() |
Local Ollama models |
Python:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader(pdf_path)
pages = [page.page_content for page in loader.lazy_load()]R:
library(ragnar)
pdf_markdown <- read_as_markdown(pdf_path)Python:
from langchain_experimental.text_splitter import SemanticChunker
text_splitter = SemanticChunker(OpenAIEmbeddings(model="text-embedding-3-large"))
chunks = text_splitter.create_documents(pages)R:
chunks <- markdown_chunk(pdf_markdown)Python:
from langchain_core.vectorstores import InMemoryVectorStore
vectorstore = InMemoryVectorStore.from_texts(
texts,
embedding=OpenAIEmbeddings(model="text-embedding-3-large")
)R:
store <- ragnar_store_create(
"store.duckdb",
embed = \(x) embed_openai(x, model = "text-embedding-3-large")
)
ragnar_store_insert(store, chunks)
ragnar_store_build_index(store)Note: R uses DuckDB for persistent storage instead of in-memory storage.
Python:
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"score_threshold": 0.7, "k": 3}
)
retrieved_documents = retriever.invoke(user_query)R:
retrieved_chunks <- ragnar_retrieve(store, user_query, top_k = 3)Note: ragnar_retrieve() combines both VSS (vector similarity) and BM25 text search.
Python:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
llm = ChatOpenAI(model="gpt-4o", temperature=0)
structured_prompt = ChatPromptTemplate.from_template(prompt_template)
chain = structured_prompt | llm | StrOutputParser()
response = chain.invoke({"retrieved_documents": docs, "user_query": query})R:
library(ellmer)
chat <- chat_openai(
model = "gpt-4o",
system_prompt = "Your system prompt here"
)
# Manual retrieval approach
response <- chat$chat(formatted_prompt)
# Or use tool registration for automatic retrieval
ragnar_register_tool_retrieve(chat, store, top_k = 3)
response <- chat$chat(user_query)Python:
def clean_hyphenated_linebreaks(text):
return re.sub(r'(\w+)-\n(\w+)', r'\1\2', text)
def fix_ligatures(text):
ligature_map = {'fi': 'fi', 'fl': 'fl', ...}
for ligature, replacement in ligature_map.items():
text = text.replace(ligature, replacement)
return textR:
library(stringr)
clean_hyphenated_linebreaks <- function(text) {
str_replace_all(text, "(\\w+)-\\n(\\w+)", "\\1\\2")
}
fix_ligatures <- function(text) {
ligature_map <- c(
"\ufb01" = "fi", # fi
"\ufb02" = "fl", # fl
...
)
for (i in seq_along(ligature_map)) {
text <- str_replace_all(text, names(ligature_map)[i], ligature_map[i])
}
return(text)
}- Integrated workflow: ragnar provides end-to-end RAG pipeline
- Persistent storage: Uses DuckDB for efficient, persistent vector storage
- Hybrid search: Combines VSS and BM25 automatically
- Tool integration:
ragnar_register_tool_retrieve()lets LLMs retrieve on-demand - Tidyverse-friendly: Works well with dplyr/tidyverse workflows
Make sure you have your OpenAI API key set:
# In your .Renviron file (edit with usethis::edit_r_environ())
OPENAI_API_KEY="your-api-key-here"install.packages("ragnar")
install.packages("ellmer")
install.packages("stringr")source("main.R")