Skip to content

miosomos/pyastran

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pyastran

The goal of this repository is to handle the creation of node descriptions and embeddings for graphs generated by pykagcee.

Requirements

  • Neo4j DBMS (local or remote). We recommend using the Neo4j Desktop application due to its better performance.
  • uv tool to manage python virtual environment and dependencies.
  • Chat model API to generate the descriptions.
  • Embedding model API to generate the embeddings.

Installation

Create your .env. You can use the .env.example file as a template.

cp .env.example .env

Set your Neo4j connection details in the .env file. Note that you previously need to create a knowledge graph out of pykagcee.

NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_password

For the chat and embedding models, the project comes with the langchain-openai integration. If you use other provider than OpenAI, add the integration package with uv add langchain-{provider} command. See available providers.

CHAT_PROVIDER=openai
CHAT_MODEL=gpt-4.1-nano
CHAT_API_KEY=sk-proj-fakekey123

EMBEDDING_PROVIDER=openai
EMBEDDING_MODEL=text-embedding-3-small
EMBEDDING_API_KEY=sk-proj-fakekey123

If you are serving an OpenAI-compatible API (e.g., with vLLM) you can set the CHAT_BASE_URL and EMBEDDING_BASE_URL variables, keeping the provider as openai.

CHAT_PROVIDER=openai
CHAT_BASE_URL=https://example-vllm-openai-compatible-serve.test/v1

EMBEDDING_PROVIDER=openai
EMBEDDING_BASE_URL=https://example-vllm-openai-compatible-serve.test/v1

You will need to set the CHAT_MAX_CONTEXT to avoid exceeding the model context length when generating the descriptions.

CHAT_MAX_CONTEXT=3000

By default, when generating the description for a symbol, we include n random related symbols to provide more context to the model. You can configure how many related symbols to include by setting the MAX_RELATION_CONTEXT variable in the .env file. The default value is 3.

MAX_RELATION_CONTEXT=3

Create environment and install dependencies:

uv sync

Usage

Generate descriptions

To generate descriptions for a single project use describe command:

uv run pyastran describe /path/to/single/project --max-concurrent-queries 50

The optional max-concurrent-queries param determines how many nodes will be described in parallel. Default is 100.

To generate descriptions for multiple projects under a directory use describe --all command:

uv run pyastran describe --all /path/to/multiple/projects --max-concurrent-queries 50 --max-concurrent-tasks 2

The optional max-concurrent-tasks param determines how many projects will be processed in parallel. Default is 1.

Embed descriptions

To generate embeddings for a single project use embed command:

uv run pyastran embed /path/to/single/project --max-concurrent-queries 50

This command will create the embeddings for all nodes and a vector index named description_embedding_index.

The optional max-concurrent-queries param determines how many nodes will be embedded in parallel. Default is 100.

To generate embeddings for multiple projects under a directory use embed --all command:

uv run pyastran embed --all /path/to/multiple/projects --max-concurrent-queries 50 --max-concurrent-tasks 2

This command will create the embeddings for all nodes and a vector index named description_embedding_index per project.

The optional max-concurrent-tasks param determines how many projects will be processed in parallel. Default is 1.

Fix path issues

Due to a bug in pykagcee, sometimes the file_path property of the nodes is empty. To fix this issue, use the fix-paths --all command at any time (even if you have not generated descriptions or embeddings yet):

uv run pyastran fix-paths --all /path/to/multiple/projects

Wipe all descriptions and embeddings

Clean all descriptions, embeddings and indexed from all databases.

uv run pyastran wipe

About

Creation of node descriptions for a graph generated by pykagcee.

Resources

License

Stars

Watchers

Forks

Contributors

Languages