The goal of this repository is to handle the creation of node descriptions and embeddings for graphs generated by pykagcee.
- Neo4j DBMS (local or remote). We recommend using the Neo4j Desktop application due to its better performance.
- uv tool to manage python virtual environment and dependencies.
- Chat model API to generate the descriptions.
- Embedding model API to generate the embeddings.
Create your .env. You can use the .env.example file as a template.
cp .env.example .envSet your Neo4j connection details in the .env file. Note that you previously need to create a knowledge graph
out of pykagcee.
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_passwordFor the chat and embedding models, the project comes with the langchain-openai integration.
If you use other provider than OpenAI, add the integration package with uv add langchain-{provider} command.
See available providers.
CHAT_PROVIDER=openai
CHAT_MODEL=gpt-4.1-nano
CHAT_API_KEY=sk-proj-fakekey123
EMBEDDING_PROVIDER=openai
EMBEDDING_MODEL=text-embedding-3-small
EMBEDDING_API_KEY=sk-proj-fakekey123If you are serving an OpenAI-compatible API (e.g., with vLLM)
you can set the CHAT_BASE_URL and EMBEDDING_BASE_URL variables, keeping the provider as openai.
CHAT_PROVIDER=openai
CHAT_BASE_URL=https://example-vllm-openai-compatible-serve.test/v1
EMBEDDING_PROVIDER=openai
EMBEDDING_BASE_URL=https://example-vllm-openai-compatible-serve.test/v1You will need to set the CHAT_MAX_CONTEXT to avoid exceeding the model context length when generating
the descriptions.
CHAT_MAX_CONTEXT=3000By default, when generating the description for a symbol, we include n random related symbols
to provide more context to the model. You can configure how many related symbols to include
by setting the MAX_RELATION_CONTEXT variable in the .env file. The default value is 3.
MAX_RELATION_CONTEXT=3Create environment and install dependencies:
uv syncTo generate descriptions for a single project use describe command:
uv run pyastran describe /path/to/single/project --max-concurrent-queries 50The optional max-concurrent-queries param determines how many nodes will be described in parallel. Default is 100.
To generate descriptions for multiple projects under a directory use describe --all command:
uv run pyastran describe --all /path/to/multiple/projects --max-concurrent-queries 50 --max-concurrent-tasks 2The optional max-concurrent-tasks param determines how many projects will be processed in parallel. Default is 1.
To generate embeddings for a single project use embed command:
uv run pyastran embed /path/to/single/project --max-concurrent-queries 50This command will create the embeddings for all nodes and a vector index named description_embedding_index.
The optional max-concurrent-queries param determines how many nodes will be embedded in parallel. Default is 100.
To generate embeddings for multiple projects under a directory use embed --all command:
uv run pyastran embed --all /path/to/multiple/projects --max-concurrent-queries 50 --max-concurrent-tasks 2This command will create the embeddings for all nodes and a vector index named description_embedding_index per project.
The optional max-concurrent-tasks param determines how many projects will be processed in parallel. Default is 1.
Due to a bug in pykagcee, sometimes the file_path property of the nodes is empty.
To fix this issue,
use the fix-paths --all command at any time (even if you have not generated descriptions or embeddings yet):
uv run pyastran fix-paths --all /path/to/multiple/projectsClean all descriptions, embeddings and indexed from all databases.
uv run pyastran wipe