Skip to content

ksmin23/lightrag-bigquery

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

lightrag-bigquery

Google Cloud BigQuery storage backend for LightRAG.

This package provides four BigQuery-backed storage classes as an external plugin — no modifications to LightRAG source code required.

Storage Class Description
KV BigQueryKVStorage Key-value storage with JSON serialization
Vector BigQueryVectorStorage Vector storage with cosine similarity search
Graph BigQueryGraphStorage Graph storage with BigQuery Property Graph support
DocStatus BigQueryDocStatusStorage Document processing status tracking

Installation

pip install lightrag-hku
pip install git+https://github.com/ksmin23/lightrag-bigquery.git@v0.1.0

Quick Start

import asyncio
import lightrag_bigquery
from lightrag import LightRAG, QueryParam
from lightrag.llm.openai import gpt_4o_mini_complete, openai_embed

# Register BigQuery storage classes with LightRAG
lightrag_bigquery.register()

async def main():
    rag = LightRAG(
        working_dir="./rag_storage",
        llm_model_func=gpt_4o_mini_complete,
        embedding_func=openai_embed,
        kv_storage="BigQueryKVStorage",
        vector_storage="BigQueryVectorStorage",
        graph_storage="BigQueryGraphStorage",
        doc_status_storage="BigQueryDocStatusStorage",
        addon_params={
            "bigquery_project_id": "my-project",
            "bigquery_dataset_id": "my-dataset",
        },
    )

    await rag.initialize_storages()
    await rag.ainsert("Your document text here")
    result = await rag.aquery("Your question", param=QueryParam(mode="hybrid"))
    print(result)
    await rag.finalize_storages()

asyncio.run(main())

Configuration

BigQuery connection settings can be provided via addon_params or environment variables. Environment variables are used as fallback when addon_params are not set.

addon_params key Environment Variable Description
bigquery_project_id BIGQUERY_PROJECT or GOOGLE_CLOUD_PROJECT GCP project ID
bigquery_dataset_id BIGQUERY_DATASET BigQuery dataset ID
bigquery_graph_name BIGQUERY_GRAPH_NAME Property graph name (default: lightrag_knowledge_graph)

Using Environment Variables

export GOOGLE_CLOUD_PROJECT=my-project
export BIGQUERY_DATASET=my-dataset
lightrag_bigquery.register()
rag = LightRAG(
    kv_storage="BigQueryKVStorage",
    vector_storage="BigQueryVectorStorage",
    graph_storage="BigQueryGraphStorage",
    doc_status_storage="BigQueryDocStatusStorage",
    ...
)

LLM Authentication

LLM and embedding authentication is handled by LightRAG core, not by this package. Choose one of the following options depending on your LLM provider:

Option Environment Variable Description
Gemini via Vertex AI GOOGLE_GENAI_USE_VERTEXAI=true Uses Application Default Credentials (ADC). No API key needed. Recommended on GCP.
Gemini via AI Studio GEMINI_API_KEY Uses a Gemini API key from AI Studio.
OpenAI OPENAI_API_KEY Uses an OpenAI API key.

Note: When using Vertex AI mode, LightRAG's gemini.py checks for the exact string "true" (case-insensitive). Values like "1" or "yes" will not activate Vertex AI mode.

Prerequisites

GCP Authentication

gcloud auth application-default login

Or with a service account key:

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json

BigQuery Dataset

The dataset is created automatically during initialization (CREATE SCHEMA IF NOT EXISTS). Tables and the property graph are also created automatically.

If you prefer to create the dataset manually:

export BIGQUERY_DATASET=lightrag

bq --location=US mk --dataset $GOOGLE_CLOUD_PROJECT:$BIGQUERY_DATASET

Project Structure

lightrag-bigquery/
├── pyproject.toml
├── src/
│   └── lightrag_bigquery/
│       ├── __init__.py                         # register() and public exports
│       ├── client.py                           # BigQueryClientManager and helpers
│       └── storage.py                          # All 4 storage class implementations
└── examples/
    ├── .env.example                            # Environment variable template
    ├── _config.py                              # Shared configuration loader
    ├── requirements.txt
    ├── basic_usage.py
    ├── env_var_config.py
    ├── batch_insert_and_query.py
    └── knowledge_graph_exploration.py

Design Decisions

Decision Approach Rationale
Sync vs Async Synchronous BigQuery SDK wrapped with asyncio.to_thread BigQuery Python SDK is synchronous; avoids blocking the event loop
Upsert MERGE INTO ... WHEN MATCHED / NOT MATCHED BigQuery lacks INSERT OR UPDATE; MERGE is the idiomatic alternative
Workspace Isolation Column-based filtering (WHERE workspace = @ws) Avoids DDL proliferation from per-workspace tables
Property Graph BigQuery Property Graph (CREATE PROPERTY GRAPH) Native graph support for nodes and edges
Embedding Type ARRAY<FLOAT64> BigQuery's native vector type with COSINE_DISTANCE support
Vector Search COSINE_DISTANCE() in ORDER BY Simple and universal; VECTOR_SEARCH with IVF index can be added later
Client Reuse Singleton BigQueryClientManager Shares a single BigQuery client across all storage classes
Primary Key PRIMARY KEY (id) NOT ENFORCED BigQuery does not enforce primary keys; used as advisory hints
Fuzzy Search LIKE-based pattern matching BigQuery standard SQL; sufficient for entity label search

License

MIT

About

Google Cloud BigQuery storage backend plugin for LightRAG — scalable KV, vector, and graph storage with BigQuery Graph support

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors