Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Copy to .env and fill in your values:
# cp .env.example .env

# Your personal profile name (loads application-<GUIDE_PROFILE>.yml)
# Create your config: cp scripts/user-config/application-user.yml.example scripts/user-config/application-<yourname>.yml
GUIDE_PROFILE=user

# OpenAI API key (required for embeddings and chat)
OPENAI_API_KEY=sk-proj-your-key-here

# Neo4j (optional — defaults shown)
# NEO4J_USERNAME=neo4j
# NEO4J_PASSWORD=brahmsian
# NEO4J_URI=bolt://localhost:7687

# Discord bot token (optional — only needed for Discord integration)
# DISCORD_TOKEN=your-discord-token
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Gradle
.gradle/
**/build/
**/bin/

# MCP resources
embabel-agent-api/src/main/resources/mcp/**
Expand Down Expand Up @@ -33,6 +34,11 @@ embabel-agent-api/src/main/resources/mcp/**
.env
.envrc

# Personal application overrides (set GUIDE_PROFILE in .env; default profile is "user")
# Ignore all personal profile files except the checked-in example
scripts/user-config/application-*.yml
!scripts/user-config/application-*.yml.example

# Temporary files
*.tmp
*.bak
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,8 @@ curl -X POST http://localhost:1337/api/v1/data/load-references

To see stats on data, make a GET request or browse to http://localhost:1337/api/v1/data/stats

RAG content storage uses the `ChunkingContentElementRepository` interface from the `embabel-agent-rag-core` library. The default backend is Neo4j via `DrivineStore`. You can plug in other backends by providing a different `ChunkingContentElementRepository` bean.

## Viewing and Deleting Data

Go to the Neo Browser at http://localhost:7474/browser/
Expand Down
103 changes: 103 additions & 0 deletions scripts/INGESTION-TESTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Testing Guide

## Run all tests

```bash
./mvnw test
```

Runs all 97 tests (unit + integration). Integration tests use Testcontainers to spin up Neo4j automatically — no local Neo4j needed.

## Run specific test classes

```bash
# Single class
./mvnw test -Dtest=IngestionResultTest

# Multiple classes
./mvnw test -Dtest="IngestionResultTest,IngestionRunnerTest,DataManagerControllerTest"

# Single method
./mvnw test -Dtest="IngestionRunnerTest#summary banner contains URL results"
```

## Test coverage by area

### Ingestion pipeline (new)

| Test class | Type | What it covers |
|---|---|---|
| `IngestionResultTest` | Unit | `IngestionResult` record: totals, `hasFailures()`, duration |
| `IngestionRunnerTest` | Unit | `IngestionRunner`: calls `loadReferences`, prints banner with URLs/dirs/stats/port, `formatDuration` |
| `DataManagerControllerTest` | Unit | REST endpoints: `GET /stats`, `POST /load-references` returns `IngestionResult` |
| `DataManagerLoadReferencesIntegrationTest` | Integration | Full pipeline: DataManager → Neo4j. Ingests sample directory, verifies structured result + documents/chunks in store |

Run just these:

```bash
./mvnw test -Dtest="IngestionResultTest,IngestionRunnerTest,DataManagerControllerTest,DataManagerLoadReferencesIntegrationTest"
```

### Other test areas

| Test class | Type | What it covers |
|---|---|---|
| `GuidePropertiesPathResolutionTest` | Unit | Path resolution (`~/`, absolute, relative) |
| `HubApiControllerTest` | Integration | Hub REST API (register, login, sessions, JWT) |
| `HubServiceTest` | Integration | User registration validation |
| `DrivineGuideUserRepositoryTest` | Integration | Neo4j user repository (Drivine) |
| `GuideUserRepositoryDefaultImplTest` | Integration | Neo4j user repository (GraphView) |
| `GuideUserServiceTest` | Integration | Anonymous web user service |
| `McpSecurityTest` | Integration | MCP endpoints are publicly accessible |

## Using local Neo4j (faster iteration)

By default, tests use Testcontainers (slower startup, fully isolated). For faster runs during development:

1. Start Neo4j:

```bash
docker compose up neo4j -d
```

2. Run tests with local Neo4j:

```bash
USE_LOCAL_NEO4J=true ./mvnw test
```

## Manual testing of fresh-ingest.sh

To test the full ingestion flow end-to-end:

1. Set up your `.env` and personal profile (see `scripts/README.md`)
2. Run:

```bash
./scripts/fresh-ingest.sh
```

3. Watch for the **INGESTION COMPLETE** banner with:
- Time elapsed
- Loaded/failed URLs
- Ingested/failed directories
- RAG store stats (documents, chunks, elements)
- Port and MCP endpoint

4. Verify the REST API:

```bash
# Stats
curl http://localhost:1337/api/v1/data/stats

# Trigger ingestion manually (returns JSON IngestionResult)
curl -X POST http://localhost:1337/api/v1/data/load-references
```

5. Verify MCP:

```bash
curl -i --max-time 3 http://localhost:1337/sse
```

Should return `Content-Type: text/event-stream`.
44 changes: 43 additions & 1 deletion scripts/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,45 @@
# Shell scripts

- `shell.sh` runs the application in interactive shell mode.
| Script | Purpose |
|---|---|
| `fresh-ingest.sh` | Wipes Neo4j RAG data and re-ingests everything from scratch. Use for first-time setup or when you want a clean slate. |
| `append-ingest.sh` | Re-ingests without clearing existing data. Use when you've added new URLs or directories. Comment out already-ingested items in your profile to avoid re-processing them. |
| `shell.sh` | Runs the application in interactive shell mode. |

Both ingestion scripts start Neo4j in Docker, load your personal profile, and print an **INGESTION COMPLETE** banner when done.

## Personal profiles

Both scripts read `GUIDE_PROFILE` from `.env` (default: `user`).
Each developer can have their own Spring profile:

```bash
cp scripts/user-config/application-user.yml.example scripts/user-config/application-yourname.yml
# Edit to taste, then:
echo 'GUIDE_PROFILE=yourname' >> .env
./scripts/fresh-ingest.sh
```

This loads `application-yourname.yml` with your URLs, directories, and settings.
See `scripts/user-config/README.md` for full details.

## Using append-ingest.sh

Since `append-ingest.sh` doesn't clear the store, you should comment out URLs and directories that are already ingested in your profile to avoid re-processing them. For example:

```yaml
guide:
urls:
# - https://docs.embabel.com/embabel-agent/guide/0.3.5-SNAPSHOT/ # already ingested
- https://some-new-url.com # new, will be ingested
directories:
# - ~/github/jmjava/guide # already ingested
- ~/github/jmjava/new-repo # new, will be ingested
```

Then run `./scripts/append-ingest.sh`. The new content is added alongside existing data in Neo4j.

## Tips

- **If ingestion seems stuck** on a URL: the thread is blocked on fetch -> parse -> embed. Try lowering `embedding-batch-size` to 20, or temporarily remove the slow URL.
- **Speed up ingestion**: increase `embedding-batch-size` (default 50) or `max-chunk-size` (default 4000).
73 changes: 73 additions & 0 deletions scripts/append-ingest.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
#!/usr/bin/env bash
# Re-ingest content WITHOUT clearing Neo4j first.
# Existing RAG data is kept; new/updated content is added on top.
# IngestionRunner prints the summary when done.
#
# Set GUIDE_PROFILE in .env to use your own profile (default: "user").
# e.g. GUIDE_PROFILE=menke → loads application-menke.yml
set -e

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
GUIDE_ROOT="$(dirname "$SCRIPT_DIR")"
cd "$GUIDE_ROOT"

if [ -f .env ]; then
echo "Loading .env..."
set -a
source .env
set +a
fi

GUIDE_PORT="${GUIDE_PORT:-1337}"
EXISTING_PID=$(lsof -ti :"$GUIDE_PORT" 2>/dev/null | head -1)
if [ -n "$EXISTING_PID" ]; then
echo "Killing existing process on port $GUIDE_PORT (PID $EXISTING_PID)..."
kill "$EXISTING_PID" 2>/dev/null || true
sleep 1
kill -9 "$EXISTING_PID" 2>/dev/null || true
sleep 1
fi

echo "Ensuring Neo4j is up (Docker)..."
docker compose up neo4j -d

NEO4J_BOLT_PORT="${NEO4J_BOLT_PORT:-7687}"
echo "Waiting for Neo4j on port $NEO4J_BOLT_PORT..."
max_wait=60
elapsed=0
while [ $elapsed -lt $max_wait ]; do
if docker exec embabel-neo4j cypher-shell -u "${NEO4J_USERNAME:-neo4j}" -p "${NEO4J_PASSWORD:-brahmsian}" "RETURN 1" >/dev/null 2>&1; then
echo "Neo4j is ready."
break
fi
sleep 3
elapsed=$((elapsed + 3))
echo " ... ${elapsed}s"
done
if [ $elapsed -ge $max_wait ]; then
echo "Neo4j did not become ready in time."
exit 1
fi

echo "Keeping existing RAG data (append mode)."

GUIDE_PROFILE="${GUIDE_PROFILE:-user}"
export SPRING_PROFILES_ACTIVE="local,${GUIDE_PROFILE}"
export NEO4J_URI="${NEO4J_URI:-bolt://localhost:${NEO4J_BOLT_PORT}}"
export NEO4J_HOST="${NEO4J_HOST:-localhost}"

# Force ingestion on startup (IngestionRunner prints the summary)
export GUIDE_RELOADCONTENTONSTARTUP=true

echo ""
echo "Starting Guide with profiles: $SPRING_PROFILES_ACTIVE"
echo "Neo4j: $NEO4J_URI"
echo ""
echo "Ingestion will append to existing data."
echo "Watch for the INGESTION COMPLETE banner."
echo "Press Ctrl+C to stop."
echo ""

# Run in foreground so Ctrl+C kills it directly
# Include scripts/user-config/ so Spring Boot finds personal profile files
./mvnw -DskipTests spring-boot:run -Dspring-boot.run.arguments="--spring.config.additional-location=file:./scripts/user-config/"
75 changes: 75 additions & 0 deletions scripts/fresh-ingest.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
#!/usr/bin/env bash
# Wipe Neo4j RAG data and re-ingest everything from scratch.
# Starts Neo4j (Docker), clears all ContentElement nodes, then runs Guide
# with reload-content-on-startup=true. IngestionRunner prints the summary.
#
# Set GUIDE_PROFILE in .env to use your own profile (default: "user").
# e.g. GUIDE_PROFILE=menke → loads application-menke.yml
set -e

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
GUIDE_ROOT="$(dirname "$SCRIPT_DIR")"
cd "$GUIDE_ROOT"

if [ -f .env ]; then
echo "Loading .env..."
set -a
source .env
set +a
fi

GUIDE_PORT="${GUIDE_PORT:-1337}"
EXISTING_PID=$(lsof -ti :"$GUIDE_PORT" 2>/dev/null | head -1)
if [ -n "$EXISTING_PID" ]; then
echo "Killing existing process on port $GUIDE_PORT (PID $EXISTING_PID)..."
kill "$EXISTING_PID" 2>/dev/null || true
sleep 1
kill -9 "$EXISTING_PID" 2>/dev/null || true
sleep 1
fi

echo "Ensuring Neo4j is up (Docker)..."
docker compose up neo4j -d

NEO4J_BOLT_PORT="${NEO4J_BOLT_PORT:-7687}"
echo "Waiting for Neo4j on port $NEO4J_BOLT_PORT..."
max_wait=60
elapsed=0
while [ $elapsed -lt $max_wait ]; do
if docker exec embabel-neo4j cypher-shell -u "${NEO4J_USERNAME:-neo4j}" -p "${NEO4J_PASSWORD:-brahmsian}" "RETURN 1" >/dev/null 2>&1; then
echo "Neo4j is ready."
break
fi
sleep 3
elapsed=$((elapsed + 3))
echo " ... ${elapsed}s"
done
if [ $elapsed -ge $max_wait ]; then
echo "Neo4j did not become ready in time."
exit 1
fi

echo "Clearing RAG content in Neo4j (ContentElement nodes)..."
docker exec embabel-neo4j cypher-shell -u "${NEO4J_USERNAME:-neo4j}" -p "${NEO4J_PASSWORD:-brahmsian}" "MATCH (c:ContentElement) DETACH DELETE c" 2>/dev/null || true
echo "RAG content cleared."

GUIDE_PROFILE="${GUIDE_PROFILE:-user}"
export SPRING_PROFILES_ACTIVE="local,${GUIDE_PROFILE}"
export NEO4J_URI="${NEO4J_URI:-bolt://localhost:${NEO4J_BOLT_PORT}}"
export NEO4J_HOST="${NEO4J_HOST:-localhost}"

# Force ingestion on startup (IngestionRunner prints the summary)
export GUIDE_RELOADCONTENTONSTARTUP=true

echo ""
echo "Starting Guide with profiles: $SPRING_PROFILES_ACTIVE"
echo "Neo4j: $NEO4J_URI"
echo ""
echo "Ingestion will run automatically on startup."
echo "Watch for the INGESTION COMPLETE banner."
echo "Press Ctrl+C to stop."
echo ""

# Run in foreground so Ctrl+C kills it directly
# Include scripts/user-config/ so Spring Boot finds personal profile files
./mvnw -DskipTests spring-boot:run -Dspring-boot.run.arguments="--spring.config.additional-location=file:./scripts/user-config/"
39 changes: 39 additions & 0 deletions scripts/user-config/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Personal config

Each developer can have their own Spring profile with personal settings (URLs, directories, paths, etc.).

## Quick start

```bash
cp scripts/user-config/application-user.yml.example scripts/user-config/application-myname.yml
# Edit to taste, then:
echo 'GUIDE_PROFILE=myname' >> .env
./scripts/fresh-ingest.sh
```

## How it works

- The scripts (`fresh-ingest.sh`, `append-ingest.sh`) read `GUIDE_PROFILE` from `.env` (default: `user`)
- Spring profiles become `local,<GUIDE_PROFILE>` → loads `application-<GUIDE_PROFILE>.yml`
- The scripts pass `--spring.config.additional-location=file:./scripts/user-config/` so Spring picks up profiles from this directory
- Personal profiles in `scripts/user-config/` are gitignored (only the `.example` is checked in)

## Ingestion on startup

The `IngestionRunner` only activates when `guide.reload-content-on-startup` is `true`. The default in `application.yml` is `false`, so normal builds (`./mvnw test`, `./mvnw spring-boot:run`) never trigger ingestion. Only the scripts set this flag -- `fresh-ingest.sh` exports `GUIDE_RELOADCONTENTONSTARTUP=true` before launching the app.

## Failure recovery

Ingestion is resilient at every level -- a single failure never prevents the remaining items from being processed:

- **URLs**: each URL is ingested independently. If one times out or returns an error, the rest continue.
- **Directories**: each configured directory is ingested independently. A missing or unreadable directory doesn't block others.
- **Documents within a directory**: each file is written to the store individually. A single unparseable file (e.g. corrupt encoding) doesn't skip the remaining files in that directory.

All failures are collected with their source and reason into the `IngestionResult`, which is:
- Printed in the **INGESTION COMPLETE** banner (so you can see what failed and why at a glance)
- Returned as JSON from `POST /api/v1/data/load-references` for programmatic inspection

## MCP tools

All ingested content -- both URLs and local directories -- is immediately available through the MCP tools (`docs_vectorSearch`, `docs_textSearch`, etc.). The MCP tools and the ingestion pipeline share the same Neo4j store, so there is no separate sync step. Once ingestion completes, MCP clients (Cursor, Claude Desktop, etc.) can search the content right away.
Loading