embabel · jasperblues · Feb 28, 2026 · Feb 21, 2026 · Feb 28, 2026
diff --git a/.env.example b/.env.example
@@ -0,0 +1,17 @@
+# Copy to .env and fill in your values:
+#   cp .env.example .env
+
+# Your personal profile name (loads application-<GUIDE_PROFILE>.yml)
+# Create your config: cp scripts/user-config/application-user.yml.example scripts/user-config/application-<yourname>.yml
+GUIDE_PROFILE=user
+
+# OpenAI API key (required for embeddings and chat)
+OPENAI_API_KEY=sk-proj-your-key-here
+
+# Neo4j (optional — defaults shown)
+# NEO4J_USERNAME=neo4j
+# NEO4J_PASSWORD=brahmsian
+# NEO4J_URI=bolt://localhost:7687
+
+# Discord bot token (optional — only needed for Discord integration)
+# DISCORD_TOKEN=your-discord-token
diff --git a/.gitignore b/.gitignore
@@ -1,6 +1,7 @@
 # Gradle
 .gradle/
 **/build/
+**/bin/
 
 # MCP resources
 embabel-agent-api/src/main/resources/mcp/**
@@ -33,6 +34,11 @@ embabel-agent-api/src/main/resources/mcp/**
 .env
 .envrc
 
+# Personal application overrides (set GUIDE_PROFILE in .env; default profile is "user")
+# Ignore all personal profile files except the checked-in example
+scripts/user-config/application-*.yml
+!scripts/user-config/application-*.yml.example
+
 # Temporary files
 *.tmp
 *.bak

diff --git a/README.md b/README.md
@@ -35,6 +35,8 @@ curl -X POST http://localhost:1337/api/v1/data/load-references
 
 To see stats on data, make a GET request or browse to http://localhost:1337/api/v1/data/stats
 
+RAG content storage uses the `ChunkingContentElementRepository` interface from the `embabel-agent-rag-core` library. The default backend is Neo4j via `DrivineStore`. You can plug in other backends by providing a different `ChunkingContentElementRepository` bean.
+
 ## Viewing and Deleting Data
 
 Go to the Neo Browser at http://localhost:7474/browser/

diff --git a/scripts/INGESTION-TESTING.md b/scripts/INGESTION-TESTING.md
@@ -0,0 +1,103 @@
+# Testing Guide
+
+## Run all tests
+
+```bash
+./mvnw test
+```
+
+Runs all 97 tests (unit + integration). Integration tests use Testcontainers to spin up Neo4j automatically — no local Neo4j needed.
+
+## Run specific test classes
+
+```bash
+# Single class
+./mvnw test -Dtest=IngestionResultTest
+
+# Multiple classes
+./mvnw test -Dtest="IngestionResultTest,IngestionRunnerTest,DataManagerControllerTest"
+
+# Single method
+./mvnw test -Dtest="IngestionRunnerTest#summary banner contains URL results"
+```
+
+## Test coverage by area
+
+### Ingestion pipeline (new)
+
+| Test class | Type | What it covers |
+|---|---|---|
+| `IngestionResultTest` | Unit | `IngestionResult` record: totals, `hasFailures()`, duration |
+| `IngestionRunnerTest` | Unit | `IngestionRunner`: calls `loadReferences`, prints banner with URLs/dirs/stats/port, `formatDuration` |
+| `DataManagerControllerTest` | Unit | REST endpoints: `GET /stats`, `POST /load-references` returns `IngestionResult` |
+| `DataManagerLoadReferencesIntegrationTest` | Integration | Full pipeline: DataManager → Neo4j. Ingests sample directory, verifies structured result + documents/chunks in store |
+
+Run just these:
+
+```bash
+./mvnw test -Dtest="IngestionResultTest,IngestionRunnerTest,DataManagerControllerTest,DataManagerLoadReferencesIntegrationTest"
+```
+
+### Other test areas
+
+| Test class | Type | What it covers |
+|---|---|---|
+| `GuidePropertiesPathResolutionTest` | Unit | Path resolution (`~/`, absolute, relative) |
+| `HubApiControllerTest` | Integration | Hub REST API (register, login, sessions, JWT) |
+| `HubServiceTest` | Integration | User registration validation |
+| `DrivineGuideUserRepositoryTest` | Integration | Neo4j user repository (Drivine) |
+| `GuideUserRepositoryDefaultImplTest` | Integration | Neo4j user repository (GraphView) |
+| `GuideUserServiceTest` | Integration | Anonymous web user service |
+| `McpSecurityTest` | Integration | MCP endpoints are publicly accessible |
+
+## Using local Neo4j (faster iteration)
+
+By default, tests use Testcontainers (slower startup, fully isolated). For faster runs during development:
+
+1. Start Neo4j:
+
+```bash
+docker compose up neo4j -d
+```
+
+2. Run tests with local Neo4j:
+
+```bash
+USE_LOCAL_NEO4J=true ./mvnw test
+```
+
+## Manual testing of fresh-ingest.sh
+
+To test the full ingestion flow end-to-end:
+
+1. Set up your `.env` and personal profile (see `scripts/README.md`)
+2. Run:
+
+```bash
+./scripts/fresh-ingest.sh
+```
+
+3. Watch for the **INGESTION COMPLETE** banner with:
+   - Time elapsed
+   - Loaded/failed URLs
+   - Ingested/failed directories
+   - RAG store stats (documents, chunks, elements)
+   - Port and MCP endpoint
+
+4. Verify the REST API:
+
+```bash
+# Stats
+curl http://localhost:1337/api/v1/data/stats
+
+# Trigger ingestion manually (returns JSON IngestionResult)
+curl -X POST http://localhost:1337/api/v1/data/load-references
+```
+
+5. Verify MCP:
+
+```bash
+curl -i --max-time 3 http://localhost:1337/sse
+```
+
+Should return `Content-Type: text/event-stream`.
diff --git a/scripts/README.md b/scripts/README.md
@@ -1,3 +1,45 @@
 # Shell scripts
 
-- `shell.sh` runs the application in interactive shell mode.
+| Script | Purpose |
+|---|---|
+| `fresh-ingest.sh` | Wipes Neo4j RAG data and re-ingests everything from scratch. Use for first-time setup or when you want a clean slate. |
+| `append-ingest.sh` | Re-ingests without clearing existing data. Use when you've added new URLs or directories. Comment out already-ingested items in your profile to avoid re-processing them. |
+| `shell.sh` | Runs the application in interactive shell mode. |
+
+Both ingestion scripts start Neo4j in Docker, load your personal profile, and print an **INGESTION COMPLETE** banner when done.
+
+## Personal profiles
+
+Both scripts read `GUIDE_PROFILE` from `.env` (default: `user`).
+Each developer can have their own Spring profile:
+
+```bash
+cp scripts/user-config/application-user.yml.example scripts/user-config/application-yourname.yml
+# Edit to taste, then:
+echo 'GUIDE_PROFILE=yourname' >> .env
+./scripts/fresh-ingest.sh
+```
+
+This loads `application-yourname.yml` with your URLs, directories, and settings.
+See `scripts/user-config/README.md` for full details.
+
+## Using append-ingest.sh
+
+Since `append-ingest.sh` doesn't clear the store, you should comment out URLs and directories that are already ingested in your profile to avoid re-processing them. For example:
+
+```yaml
+guide:
+  urls:
+    # - https://docs.embabel.com/embabel-agent/guide/0.3.5-SNAPSHOT/  # already ingested
+    - https://some-new-url.com  # new, will be ingested
+  directories:
+    # - ~/github/jmjava/guide  # already ingested
+    - ~/github/jmjava/new-repo  # new, will be ingested
+```
+
+Then run `./scripts/append-ingest.sh`. The new content is added alongside existing data in Neo4j.
+
+## Tips
+
+- **If ingestion seems stuck** on a URL: the thread is blocked on fetch -> parse -> embed. Try lowering `embedding-batch-size` to 20, or temporarily remove the slow URL.
+- **Speed up ingestion**: increase `embedding-batch-size` (default 50) or `max-chunk-size` (default 4000).
diff --git a/scripts/append-ingest.sh b/scripts/append-ingest.sh
@@ -0,0 +1,73 @@
+#!/usr/bin/env bash
+# Re-ingest content WITHOUT clearing Neo4j first.
+# Existing RAG data is kept; new/updated content is added on top.
+# IngestionRunner prints the summary when done.
+#
+# Set GUIDE_PROFILE in .env to use your own profile (default: "user").
+# e.g. GUIDE_PROFILE=menke → loads application-menke.yml
+set -e
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+GUIDE_ROOT="$(dirname "$SCRIPT_DIR")"
+cd "$GUIDE_ROOT"
+
+if [ -f .env ]; then
+  echo "Loading .env..."
+  set -a
+  source .env
+  set +a
+fi
+
+GUIDE_PORT="${GUIDE_PORT:-1337}"
+EXISTING_PID=$(lsof -ti :"$GUIDE_PORT" 2>/dev/null | head -1)
+if [ -n "$EXISTING_PID" ]; then
+  echo "Killing existing process on port $GUIDE_PORT (PID $EXISTING_PID)..."
+  kill "$EXISTING_PID" 2>/dev/null || true
+  sleep 1
+  kill -9 "$EXISTING_PID" 2>/dev/null || true
+  sleep 1
+fi
+
+echo "Ensuring Neo4j is up (Docker)..."
+docker compose up neo4j -d
+
+NEO4J_BOLT_PORT="${NEO4J_BOLT_PORT:-7687}"
+echo "Waiting for Neo4j on port $NEO4J_BOLT_PORT..."
+max_wait=60
+elapsed=0
+while [ $elapsed -lt $max_wait ]; do
+  if docker exec embabel-neo4j cypher-shell -u "${NEO4J_USERNAME:-neo4j}" -p "${NEO4J_PASSWORD:-brahmsian}" "RETURN 1" >/dev/null 2>&1; then
+    echo "Neo4j is ready."
+    break
+  fi
+  sleep 3
+  elapsed=$((elapsed + 3))
+  echo "  ... ${elapsed}s"
+done
+if [ $elapsed -ge $max_wait ]; then
+  echo "Neo4j did not become ready in time."
+  exit 1
+fi
+
+echo "Keeping existing RAG data (append mode)."
+
+GUIDE_PROFILE="${GUIDE_PROFILE:-user}"
+export SPRING_PROFILES_ACTIVE="local,${GUIDE_PROFILE}"
+export NEO4J_URI="${NEO4J_URI:-bolt://localhost:${NEO4J_BOLT_PORT}}"
+export NEO4J_HOST="${NEO4J_HOST:-localhost}"
+
+# Force ingestion on startup (IngestionRunner prints the summary)
+export GUIDE_RELOADCONTENTONSTARTUP=true
+
+echo ""
+echo "Starting Guide with profiles: $SPRING_PROFILES_ACTIVE"
+echo "Neo4j: $NEO4J_URI"
+echo ""
+echo "Ingestion will append to existing data."
+echo "Watch for the INGESTION COMPLETE banner."
+echo "Press Ctrl+C to stop."
+echo ""
+
+# Run in foreground so Ctrl+C kills it directly
+# Include scripts/user-config/ so Spring Boot finds personal profile files
+./mvnw -DskipTests spring-boot:run -Dspring-boot.run.arguments="--spring.config.additional-location=file:./scripts/user-config/"
diff --git a/scripts/fresh-ingest.sh b/scripts/fresh-ingest.sh
@@ -0,0 +1,75 @@
+#!/usr/bin/env bash
+# Wipe Neo4j RAG data and re-ingest everything from scratch.
+# Starts Neo4j (Docker), clears all ContentElement nodes, then runs Guide
+# with reload-content-on-startup=true. IngestionRunner prints the summary.
+#
+# Set GUIDE_PROFILE in .env to use your own profile (default: "user").
+# e.g. GUIDE_PROFILE=menke → loads application-menke.yml
+set -e
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+GUIDE_ROOT="$(dirname "$SCRIPT_DIR")"
+cd "$GUIDE_ROOT"
+
+if [ -f .env ]; then
+  echo "Loading .env..."
+  set -a
+  source .env
+  set +a
+fi
+
+GUIDE_PORT="${GUIDE_PORT:-1337}"
+EXISTING_PID=$(lsof -ti :"$GUIDE_PORT" 2>/dev/null | head -1)
+if [ -n "$EXISTING_PID" ]; then
+  echo "Killing existing process on port $GUIDE_PORT (PID $EXISTING_PID)..."
+  kill "$EXISTING_PID" 2>/dev/null || true
+  sleep 1
+  kill -9 "$EXISTING_PID" 2>/dev/null || true
+  sleep 1
+fi
+
+echo "Ensuring Neo4j is up (Docker)..."
+docker compose up neo4j -d
+
+NEO4J_BOLT_PORT="${NEO4J_BOLT_PORT:-7687}"
+echo "Waiting for Neo4j on port $NEO4J_BOLT_PORT..."
+max_wait=60
+elapsed=0
+while [ $elapsed -lt $max_wait ]; do
+  if docker exec embabel-neo4j cypher-shell -u "${NEO4J_USERNAME:-neo4j}" -p "${NEO4J_PASSWORD:-brahmsian}" "RETURN 1" >/dev/null 2>&1; then
+    echo "Neo4j is ready."
+    break
+  fi
+  sleep 3
+  elapsed=$((elapsed + 3))
+  echo "  ... ${elapsed}s"
+done
+if [ $elapsed -ge $max_wait ]; then
+  echo "Neo4j did not become ready in time."
+  exit 1
+fi
+
+echo "Clearing RAG content in Neo4j (ContentElement nodes)..."
+docker exec embabel-neo4j cypher-shell -u "${NEO4J_USERNAME:-neo4j}" -p "${NEO4J_PASSWORD:-brahmsian}" "MATCH (c:ContentElement) DETACH DELETE c" 2>/dev/null || true
+echo "RAG content cleared."
+
+GUIDE_PROFILE="${GUIDE_PROFILE:-user}"
+export SPRING_PROFILES_ACTIVE="local,${GUIDE_PROFILE}"
+export NEO4J_URI="${NEO4J_URI:-bolt://localhost:${NEO4J_BOLT_PORT}}"
+export NEO4J_HOST="${NEO4J_HOST:-localhost}"
+
+# Force ingestion on startup (IngestionRunner prints the summary)
+export GUIDE_RELOADCONTENTONSTARTUP=true
+
+echo ""
+echo "Starting Guide with profiles: $SPRING_PROFILES_ACTIVE"
+echo "Neo4j: $NEO4J_URI"
+echo ""
+echo "Ingestion will run automatically on startup."
+echo "Watch for the INGESTION COMPLETE banner."
+echo "Press Ctrl+C to stop."
+echo ""
+
+# Run in foreground so Ctrl+C kills it directly
+# Include scripts/user-config/ so Spring Boot finds personal profile files
+./mvnw -DskipTests spring-boot:run -Dspring-boot.run.arguments="--spring.config.additional-location=file:./scripts/user-config/"
diff --git a/scripts/user-config/README.md b/scripts/user-config/README.md
@@ -0,0 +1,39 @@
+# Personal config
+
+Each developer can have their own Spring profile with personal settings (URLs, directories, paths, etc.).
+
+## Quick start
+
+```bash
+cp scripts/user-config/application-user.yml.example scripts/user-config/application-myname.yml
+# Edit to taste, then:
+echo 'GUIDE_PROFILE=myname' >> .env
+./scripts/fresh-ingest.sh
+```
+
+## How it works
+
+- The scripts (`fresh-ingest.sh`, `append-ingest.sh`) read `GUIDE_PROFILE` from `.env` (default: `user`)
+- Spring profiles become `local,<GUIDE_PROFILE>` → loads `application-<GUIDE_PROFILE>.yml`
+- The scripts pass `--spring.config.additional-location=file:./scripts/user-config/` so Spring picks up profiles from this directory
+- Personal profiles in `scripts/user-config/` are gitignored (only the `.example` is checked in)
+
+## Ingestion on startup
+
+The `IngestionRunner` only activates when `guide.reload-content-on-startup` is `true`. The default in `application.yml` is `false`, so normal builds (`./mvnw test`, `./mvnw spring-boot:run`) never trigger ingestion. Only the scripts set this flag -- `fresh-ingest.sh` exports `GUIDE_RELOADCONTENTONSTARTUP=true` before launching the app.
+
+## Failure recovery
+
+Ingestion is resilient at every level -- a single failure never prevents the remaining items from being processed:
+
+- **URLs**: each URL is ingested independently. If one times out or returns an error, the rest continue.
+- **Directories**: each configured directory is ingested independently. A missing or unreadable directory doesn't block others.
+- **Documents within a directory**: each file is written to the store individually. A single unparseable file (e.g. corrupt encoding) doesn't skip the remaining files in that directory.
+
+All failures are collected with their source and reason into the `IngestionResult`, which is:
+- Printed in the **INGESTION COMPLETE** banner (so you can see what failed and why at a glance)
+- Returned as JSON from `POST /api/v1/data/load-references` for programmatic inspection
+
+## MCP tools
+
+All ingested content -- both URLs and local directories -- is immediately available through the MCP tools (`docs_vectorSearch`, `docs_textSearch`, etc.). The MCP tools and the ingestion pipeline share the same Neo4j store, so there is no separate sync step. Once ingestion completes, MCP clients (Cursor, Claude Desktop, etc.) can search the content right away.