A production-ready Retrieval-Augmented Generation (RAG) application built with Spring AI 2.0, Ollama, and PGVector. Upload your documents (PDFs, text files), chunk and embed them into a vector store, and chat with an LLM that answers only from your uploaded content — no hallucinations.
This project is designed as a learning resource for developers exploring Spring AI and building RAG pipelines.
┌─────────────────────────────────────────────────────────┐
│ React Frontend │
│ (Vite + React Router + CSS Modules) │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌───────────────┐ │
│ │ HomePage │ │ UploadPage │ │ ChatPage │ │
│ └──────────┘ └──────┬───────┘ └───────┬───────┘ │
└─────────────────────────────────────────────┼───────────┘
│ /api/upload │ /api/chat
▼ ▼
┌─────────────────────────────────────────────────────────┐
│ Spring Boot Backend │
│ │
│ ┌────────────────────┐ ┌──────────────────────┐ │
│ │FileUploadController│ │ ChatController │ │
│ │ POST /api/upload │ │ POST /api/chat │ │
│ │ GET /api/status │ │ │ │
│ └─────────┬──────────┘ └──────────┬───────────┘ │
│ │ │ │
│ ┌─────────▼──────────┐ ┌──────────▼───────────┐ │
│ │DataIngestionService│ │ QuestionAnswer │ │
│ │AsyncIngestion │ │ Advisor │ │
│ │Processor (@Async) │ │ (similarity search) │ │
│ └─────────┬──────────┘ └──────────┬───────────┘ │
│ │ │ │
│ ┌─────────▼───────────────────────────▼───────────┐ │
│ │ PGVector (Vector Store) │ │
│ │ Embeddings via Ollama nomic-embed-text │ │
│ └─────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ Ollama LLM │ │
│ │ llama3.2:1b │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────┘
| Layer | Technology | Purpose |
|---|---|---|
| Backend | Spring Boot 4.0 + Spring AI 2.0 | REST API, AI orchestration |
| LLM | Ollama (any model) | Local inference, no API keys needed. Default: llama3.2:1b |
| Embeddings | Ollama (nomic-embed-text) | Document embedding for similarity search |
| Vector Store | PGVector (PostgreSQL extension) | Stores and queries document embeddings |
| Document Parsing | Apache Tika | Extracts text from PDFs, DOCX, TXT, etc. |
| Frontend | React 19 + Vite + React Router | SPA with modular dark-themed UI |
| Infrastructure | Docker Compose (Spring Boot managed) | Auto-started by Spring Boot on app launch |
This project demonstrates several key Spring AI features:
// ChatController.java
this.chatClient = ChatClient.builder(ollamaChatModel)
.defaultSystem(SYSTEM_PROMPT)
.defaultAdvisors(/* ... */)
.build();The ChatClient is Spring AI's high-level abstraction for interacting with LLMs. You configure it once with a system prompt and advisors, then call .prompt().user(message).call().content() for each request.
QuestionAnswerAdvisor.builder(vectorStore)
.searchRequest(SearchRequest.builder()
.topK(3)
.similarityThreshold(0.7)
.build())
.build()This is the core of the RAG pipeline. The QuestionAnswerAdvisor automatically:
- Takes the user's question
- Performs a similarity search against the vector store
- Injects the retrieved document chunks as CONTEXT into the prompt
- Sends the augmented prompt to the LLM
topK(3) — Retrieve top 3 most similar chunks (tradeoff: more chunks = more context but slower inference).
similarityThreshold(0.7) — Only include chunks with ≥70% similarity (filters out irrelevant noise).
MultipartFile → Apache Tika → TikaDocumentReader → TextSplitter → VectorStore
TikaDocumentReader— Spring AI's integration with Apache Tika. Reads any supported file format (PDF, DOCX, TXT, HTML) and producesDocumentobjects.TokenTextSplitter— Splits documents into chunks by token count, respecting sentence boundaries.VectorStore.accept()— Embeds chunks using the configured embedding model and stores them in PGVector.
// ChunkingConfig.java
@Bean
public TextSplitter textSplitter() {
return new TokenTextSplitter(
chunkSize, // 300 tokens per chunk
minChunkSizeChars, // minimum 100 characters
minChunkLengthToEmbed, // skip chunks < 50 chars
maxNumChunks, // cap at 5000 chunks per document
keepSeparator, // preserve sentence boundaries
List.of('.', '!', '?', '\n') // split on punctuation
);
}Chunking is critical for RAG quality. Too large = irrelevant context; too small = lost meaning. The values are externalized to application.properties so you can tune without recompiling.
spring.ai.vectorstore.pgvector.initialize-schema=trueSpring AI auto-creates the vector_store table in PostgreSQL with the pgvector extension. No manual SQL needed.
spring.ai.ollama.base-url=http://localhost:11434
spring.ai.ollama.chat.options.model=llama3.2:1b
spring.ai.ollama.chat.options.num-ctx=2048
spring.ai.ollama.chat.options.temperature=0.1
spring.ai.ollama.init.pull-model-strategy=when_missingmodel— Any model available on Ollama's model library works. Just change the value:llama3.2:1b— Lightweight, fast on CPU (~4s responses)llama3.2:3b— Better quality, still runs on most machinesllama3.1:8b— High quality, needs ~8GB RAMmistral:7b— Strong general-purpose alternativegemma2:9b— Google's model, good at instruction followingphi3:mini— Microsoft's compact model
pull-model-strategy=when_missing— Automatically downloads the chosen model on first run.num-ctx=2048— Context window size (tokens). Larger = can process more context but slower.temperature=0.1— Low temperature for factual, grounded answers (less creative, more accurate).
File ingestion can take minutes for large documents. The upload endpoint returns a job ID immediately while processing continues in the background:
// AsyncIngestionProcessor.java — separate @Component bean
@Async
public void process(String jobId, byte[][] fileBytes, String[] fileNames, Map<String, JobStatus> jobs) {
// parse, chunk, embed — runs on a separate thread
}Important: Spring's
@Asyncuses proxy-based AOP. Calling an@Asyncmethod from within the same class bypasses the proxy and runs synchronously. That's why the async logic is in a separate@Componentbean.
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-docker-compose</artifactId>
</dependency>This is one of the most powerful features in this project. You don't need to run docker compose up manually. When you start the Spring Boot application with ./mvnw spring-boot:run, Spring Boot:
- Detects
compose.yamlin the project root - Automatically runs
docker compose upto start Ollama and PGVector - Reads the container connection details (ports, credentials)
- Auto-configures the datasource, vector store, and Ollama base URL
When the application shuts down, it also stops the Docker containers. This means the entire infrastructure lifecycle is managed by Spring Boot — zero manual Docker commands needed for development.
├── compose.yaml # Auto-started by Spring Boot on app launch
├── pom.xml # Maven dependencies (Spring AI 2.0, Tika, PGVector)
├── src/main/java/ai/assistant/bot/
│ ├── BotApplication.java # Spring Boot entry point
│ ├── config/
│ │ ├── AsyncConfig.java # Enables @Async support
│ │ └── ChunkingConfig.java # TokenTextSplitter bean configuration
│ ├── controller/
│ │ ├── ChatController.java # POST /api/chat — RAG chat endpoint
│ │ └── FileUploadController.java # POST /api/upload — async file ingestion
│ ├── model/
│ │ └── JobStatus.java # Java record for ingestion job tracking
│ └── service/
│ ├── DataIngestionService.java # Interface
│ ├── DataIngestionServiceImpl.java # Orchestrates upload + async handoff
│ └── AsyncIngestionProcessor.java # @Async document processing
├── src/main/resources/
│ └── application.properties # All configuration (Ollama, PGVector, chunking)
└── ContextAI/ # React frontend
├── Dockerfile # Multi-stage build (Node → Nginx)
├── nginx.conf # SPA routing + API proxy
├── src/
│ ├── App.jsx # State management + routing
│ ├── components/
│ │ ├── navbar/ # Navigation bar
│ │ ├── hero/ # Homepage hero panel
│ │ ├── upload/ # File upload widget
│ │ ├── status/ # Ingestion job status cards
│ │ └── chat/ # Chat interface
│ └── pages/
│ ├── HomePage.jsx # Landing page with app description
│ ├── UploadPage.jsx # Document upload + status tracking
│ └── ChatPage.jsx # Chat with your documents
└── vite.config.js # Dev server proxy to backend
- Java 25+ (Amazon Corretto or any JDK)
- Docker & Docker Compose (for Ollama and PGVector)
- Node.js 20+ (for frontend development)
- Maven (or use the included
mvnwwrapper)
git clone https://github.com/Siddharthpratapsingh/ContextAI-SpringAI.git
cd ContextAI-SpringAI./mvnw spring-boot:runThis automatically:
- Detects
compose.yamland runsdocker compose up(Ollama + PGVector) - Auto-configures datasource and Ollama connections from the running containers
- Downloads
llama3.2:1bmodel if missing - Creates the vector store schema in PostgreSQL
- Starts the API server on
http://localhost:8080
Note: You don't need to run
docker compose upseparately. Spring Boot manages the entire Docker lifecycle.
cd ContextAI
npm install
npm run devFrontend runs at http://localhost:5173 with API calls proxied to the backend.
docker compose up --build frontendFrontend available at http://localhost:3000. Note that the backend already manages Ollama and PGVector containers via Spring Boot Docker Compose integration — this command is only needed if you want to run the frontend in a container instead of using npm run dev.
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/upload |
Upload files (multipart). Returns job ID immediately. |
GET |
/api/ingestion/status/{jobId} |
Check ingestion job status (PROCESSING / COMPLETED / FAILED). |
DELETE |
/api/ingestion/status/{jobId} |
Remove a completed job from tracking. |
POST |
/api/chat |
Send a question (plain text body). Returns RAG-grounded answer. |
curl -X POST http://localhost:8080/api/upload \
-F "file=@my-document.pdf"Response:
{"jobId": "a1b2c3d4-...", "status": "PROCESSING", "message": "Ingestion in progress"}curl -X POST http://localhost:8080/api/chat \
-H "Content-Type: text/plain" \
-d "What are the key points in the uploaded document?"All configuration lives in src/main/resources/application.properties:
# Ollama — swap the model to any from https://ollama.com/library
spring.ai.ollama.base-url=http://localhost:11434
spring.ai.ollama.chat.options.model=llama3.2:1b # Try: llama3.1:8b, mistral:7b, gemma2:9b
spring.ai.ollama.chat.options.num-ctx=2048 # Context window (tokens)
spring.ai.ollama.chat.options.temperature=0.1 # Lower = more factual
# Vector Store
spring.ai.vectorstore.pgvector.initialize-schema=true
# Chunking (tune these for your documents)
rag.chunking.chunk-size=300 # Tokens per chunk
rag.chunking.min-chunk-size-chars=100 # Minimum characters
rag.chunking.min-chunk-length-to-embed=50 # Skip tiny chunks
rag.chunking.max-num-chunks=5000 # Max chunks per document
# File Upload
spring.servlet.multipart.max-file-size=50MB
spring.servlet.multipart.max-request-size=50MB| Parameter | Effect | Tradeoff |
|---|---|---|
topK (SearchRequest) |
Number of chunks retrieved | More = richer context but slower LLM inference |
similarityThreshold |
Minimum relevance score (0.0–1.0) | Higher = more precise but may miss relevant chunks |
chunk-size |
Tokens per chunk | Larger = more context per chunk but less precise retrieval |
num-ctx |
LLM context window | Larger = can process more chunks but uses more memory/time |
temperature |
LLM creativity | Lower = more factual, higher = more creative |
- Spring AI makes RAG simple —
QuestionAnswerAdvisorhandles the entire retrieve-augment-generate pipeline in one line. - Ollama runs locally — No API keys, no cloud costs, full privacy. Swap models by changing one property (
spring.ai.ollama.chat.options.model). Browse available models at ollama.com/library. @Asyncneeds separate beans — Spring's proxy-based AOP doesn't intercept self-invocations. Always put@Asyncmethods in a different@Component.- Chunk size matters — Too large and retrieval returns irrelevant context; too small and you lose semantic meaning.
- Small models need simple prompts —
llama3.2:1bcan't follow complex multi-rule system prompts. Keep instructions short and direct. - Docker Compose integration — Spring Boot auto-detects
compose.yaml, starts the containers on app launch, auto-configures connections, and stops them on shutdown. No manualdocker compose upneeded.
MIT
Contributions are welcome! Feel free to open issues or submit pull requests.