Graph-Based Knowledge Retrieval API

This is my attempt to solve the technical test for Corolair. I have implemented a RESTful API for knowledge retrieval and question answering on course materials, using a graph-based RAG (retrieval-augmented generation) method on PDF documents. This API leverages GraphRAG for complex knowledge linking, document embeddings for efficient retrieval, and OpenAI models for answer generation.

Objective

The API allows users to upload PDFs, create a knowledge graph, and retrieve information or direct answers based on content. Using GraphRAG enhances context understanding in question answering by creating an interconnected knowledge graph from course materials.

Prerequisites

Python 3.8+
API Keys:
- OpenAI API for LLM responses and embeddings
Libraries/Tools:
- LanceDB, Langchain(i had issues with Docling to handle Pdfs so i used langchain loader), FastAPI ....
- Swagger for API documentation

Installation

Clone the repository:

git clone https://github.com/shadlia/corolair_tech_test
cd shadlia/corolair_tech_test

Install dependencies:
```
pip install -r requirements.txt
```
Environment Variables:
- Set up your OpenAI API key:
```
export OPENAI_API_KEY='your_openai_api_key'
```

Start your app:

 uvicorn app:app --host 0.0.0.0 --port 8000

Project Structure

The project is organized as follows:

├── app.py                 # Main application file
├── routes/                # Directory for route handlers
│   ├── answer.py          # Route for answering queries
│   ├── retrieve.py        # Route for retrieving content
│   └── upload.py          # Route for uploading PDFs
├── utils/              # Directory for service logic
│   ├── answer_generator.py # utils for generating answers
│   ├── embeddings_generator.py      # utils for handling embeddings
│   ├── gKnowlege_graph.py           # utils for knowledge graph management
│   └── pdf_processor.py   # utils for processing PDFs
└── README.md              # Documentation for the API

API Overview

The API provides knowledge retrieval and answering services by processing and embedding PDF documents, creating a knowledge graph, and allowing retrieval and answering through queries. The service is designed with three primary endpoints:

/upload: Upload and process a PDF from a URL.
/retrieve: Retrieve text chunks relevant to a query.
/answer: Provide a direct answer based on the document content and the relevant chunks.

Endpoints

1. Upload Document - `POST /upload`

Description: Uploads a PDF from a given URL, processes it, and stores the resulting data for querying.
Parameters:
- url (string): URL to the PDF document.
Response:
- document_id (string): Unique identifier for the uploaded document.

2. Retrieve Content - `POST /retrieve`

Description: Accepts a document ID and a query, returning relevant text chunks based on similarity scoring and the knowledge graph.
Parameters:
- document_id (string): The ID of the document to query.
- query (string): The user's query.
Response:
- results (array): Array of text chunks with similarity scores.

3. Answer Query - `POST /answer`

Description: Accepts a document ID and a query, returning a contextual answer if available.
Parameters:
- document_id (string): The ID of the document to query.
- query (string): The user’s question.
Response:
- answer (string): Answer to the query, contextualized for learning.
- note (string): Returns a note if an answer is not available in the document.

Usage Examples

# Upload a document
curl -X POST -H "Content-Type: application/json" -d '{"url":"<pdf_url>"}' http://localhost:8000/upload

# Retrieve information
curl -X POST -H "Content-Type: application/json" -d '{"document_id": "doc_id", "query": "Wwhat is Quantitative Analysis?"}' http://localhost:8000/retrieve

# Get a direct answer
curl -X POST -H "Content-Type: application/json" -d '{"document_id": "doc_id", "query": "Wwhat is Quantitative Analysis?"}' http://localhost:8000/answer

Example of irrelevant query :

Testing and Documentation

Testing with Sample Data:
- Use CBV Institute Level I Course Notes (PDF) as test data.
Swagger API Documentation:
- Access Swagger at http://localhost:8000/docs to view interactive documentation and test endpoints directly.

Security Considerations

OpenAI Key Management: Ensure OpenAI keys are stored securely and not hardcoded in the source.

Bonus: Agent Workflow

Workflow for Query Resolution:

Chunk-Based Retrieval: <<<<<<< HEAD
- The first step is to check the relevant chunks for the user query. Using document embeddings and similarity scoring, we retrieve the most relevant chunks from the document.
LLM Response Generation:
- Once relevant chunks are retrieved, the system feeds these into an LLM (e.g., OpenAI) to generate a relevant answer.
Answer Relevance Check:
- If the generated answer is deemed relevant, it is returned to the user.
Fallback to Agent:
- If the answer is not relevant or the content is missing in the document, the fallback agent is invoked. This agent will provide alternative sources or generate a new answer based on external knowledge sources.

Agent Workflow Diagram:

[User Query] --> [Check Chunks for Relevance]
       |
       v
[Relevant Chunks Found?] --- No --> [Invoke Agent for Alternative Answer]
       |
       v
    Yes
       |
       v
[LLM Generates Answer]
       |
       v
[Answer Relevant?] --- No --> [Invoke Agent for Alternative Answer]
       |
       v
    Yes
       |
       v
[Return Answer to User]

This workflow ensures the API can provide accurate answers based on the content of the uploaded documents, while also providing alternative solutions when necessary.

Example if the document doesnt provide an answer : 1- The agent start : 2- VOILA we got the alternative answer :

Notes and Future Enhancements

Here are a few improvements i thought about to enhance the system:

Improved Similarity Calculation: Explore advanced methods for calculating similarity between embeddings to better match relevant content or use langchain/llama_index functionalities
Enhanced Knowledge Graph Construction: Use LanceDB's advanced capabilities for efficient knowledge graph storage and retrieval.
Integration with Docling: I couldn't use Docling due to packages incompability with Langchain (used it for the agent and as pdf loader)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Graph-Based Knowledge Retrieval API

Table of Contents

Objective

Prerequisites

Installation

Project Structure

API Overview

Endpoints

1. Upload Document - `POST /upload`

2. Retrieve Content - `POST /retrieve`

3. Answer Query - `POST /answer`

Usage Examples

Testing and Documentation

Security Considerations

Bonus: Agent Workflow

Workflow for Query Resolution:

Agent Workflow Diagram:

Notes and Future Enhancements

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
routes		routes
utils		utils
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

shadlia/Graph-Based-Knowledge-Retrieval-api

Folders and files

Latest commit

History

Repository files navigation

Graph-Based Knowledge Retrieval API

Table of Contents

Objective

Prerequisites

Installation

Project Structure

API Overview

Endpoints

1. Upload Document - POST /upload

2. Retrieve Content - POST /retrieve

3. Answer Query - POST /answer

Usage Examples

Testing and Documentation

Security Considerations

Bonus: Agent Workflow

Workflow for Query Resolution:

Agent Workflow Diagram:

Notes and Future Enhancements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Upload Document - `POST /upload`

2. Retrieve Content - `POST /retrieve`

3. Answer Query - `POST /answer`

Packages