This is my attempt to solve the technical test for Corolair. I have implemented a RESTful API for knowledge retrieval and question answering on course materials, using a graph-based RAG (retrieval-augmented generation) method on PDF documents. This API leverages GraphRAG for complex knowledge linking, document embeddings for efficient retrieval, and OpenAI models for answer generation.
- Objective
- Prerequisites
- Installation
- Project Structure
- API Overview
- Endpoints
- Usage Examples
- Testing & Documentation
- Security Considerations
- Bonus : Agent Workflow
- Notes and Future Enhancements
The API allows users to upload PDFs, create a knowledge graph, and retrieve information or direct answers based on content. Using GraphRAG enhances context understanding in question answering by creating an interconnected knowledge graph from course materials.
- Python 3.8+
- API Keys:
- OpenAI API for LLM responses and embeddings
- Libraries/Tools:
- LanceDB, Langchain(i had issues with Docling to handle Pdfs so i used langchain loader), FastAPI ....
- Swagger for API documentation
-
Clone the repository:
git clone https://github.com/shadlia/corolair_tech_test cd shadlia/corolair_tech_test -
Install dependencies:
pip install -r requirements.txt
-
Environment Variables:
- Set up your OpenAI API key:
export OPENAI_API_KEY='your_openai_api_key'
- Set up your OpenAI API key:
-
Start your app:
uvicorn app:app --host 0.0.0.0 --port 8000
The project is organized as follows:
├── app.py # Main application file
├── routes/ # Directory for route handlers
│ ├── answer.py # Route for answering queries
│ ├── retrieve.py # Route for retrieving content
│ └── upload.py # Route for uploading PDFs
├── utils/ # Directory for service logic
│ ├── answer_generator.py # utils for generating answers
│ ├── embeddings_generator.py # utils for handling embeddings
│ ├── gKnowlege_graph.py # utils for knowledge graph management
│ └── pdf_processor.py # utils for processing PDFs
└── README.md # Documentation for the API
The API provides knowledge retrieval and answering services by processing and embedding PDF documents, creating a knowledge graph, and allowing retrieval and answering through queries. The service is designed with three primary endpoints:
/upload: Upload and process a PDF from a URL./retrieve: Retrieve text chunks relevant to a query./answer: Provide a direct answer based on the document content and the relevant chunks.
- Description: Uploads a PDF from a given URL, processes it, and stores the resulting data for querying.
- Parameters:
url(string): URL to the PDF document.
- Response:
document_id(string): Unique identifier for the uploaded document.
- Description: Accepts a document ID and a query, returning relevant text chunks based on similarity scoring and the knowledge graph.
- Parameters:
document_id(string): The ID of the document to query.query(string): The user's query.
- Response:
results(array): Array of text chunks with similarity scores.
- Description: Accepts a document ID and a query, returning a contextual answer if available.
- Parameters:
document_id(string): The ID of the document to query.query(string): The user’s question.
- Response:
answer(string): Answer to the query, contextualized for learning.note(string): Returns a note if an answer is not available in the document.
# Upload a document
curl -X POST -H "Content-Type: application/json" -d '{"url":"<pdf_url>"}' http://localhost:8000/upload# Retrieve information
curl -X POST -H "Content-Type: application/json" -d '{"document_id": "doc_id", "query": "Wwhat is Quantitative Analysis?"}' http://localhost:8000/retrieve# Get a direct answer
curl -X POST -H "Content-Type: application/json" -d '{"document_id": "doc_id", "query": "Wwhat is Quantitative Analysis?"}' http://localhost:8000/answer- Testing with Sample Data:
- Use CBV Institute Level I Course Notes (PDF) as test data.
- Swagger API Documentation:
- OpenAI Key Management: Ensure OpenAI keys are stored securely and not hardcoded in the source.
-
Chunk-Based Retrieval: <<<<<<< HEAD
- The first step is to check the relevant chunks for the user query. Using document embeddings and similarity scoring, we retrieve the most relevant chunks from the document.
-
LLM Response Generation:
- Once relevant chunks are retrieved, the system feeds these into an LLM (e.g., OpenAI) to generate a relevant answer.
-
Answer Relevance Check:
- If the generated answer is deemed relevant, it is returned to the user.
-
Fallback to Agent:
- If the answer is not relevant or the content is missing in the document, the fallback agent is invoked. This agent will provide alternative sources or generate a new answer based on external knowledge sources.
[User Query] --> [Check Chunks for Relevance]
|
v
[Relevant Chunks Found?] --- No --> [Invoke Agent for Alternative Answer]
|
v
Yes
|
v
[LLM Generates Answer]
|
v
[Answer Relevant?] --- No --> [Invoke Agent for Alternative Answer]
|
v
Yes
|
v
[Return Answer to User]
This workflow ensures the API can provide accurate answers based on the content of the uploaded documents, while also providing alternative solutions when necessary.
Example if the document doesnt provide an answer :
1- The agent start :
2- VOILA we got the alternative answer :

Here are a few improvements i thought about to enhance the system:
- Improved Similarity Calculation: Explore advanced methods for calculating similarity between embeddings to better match relevant content or use langchain/llama_index functionalities
- Enhanced Knowledge Graph Construction: Use LanceDB's advanced capabilities for efficient knowledge graph storage and retrieval.
- Integration with Docling: I couldn't use Docling due to packages incompability with Langchain (used it for the agent and as pdf loader)




