Skip to content

This repository contains my solution for a technical test with Corolair

Notifications You must be signed in to change notification settings

shadlia/Graph-Based-Knowledge-Retrieval-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Graph-Based Knowledge Retrieval API

This is my attempt to solve the technical test for Corolair. I have implemented a RESTful API for knowledge retrieval and question answering on course materials, using a graph-based RAG (retrieval-augmented generation) method on PDF documents. This API leverages GraphRAG for complex knowledge linking, document embeddings for efficient retrieval, and OpenAI models for answer generation.

Table of Contents

  1. Objective
  2. Prerequisites
  3. Installation
  4. Project Structure
  5. API Overview
  6. Endpoints
  7. Usage Examples
  8. Testing & Documentation
  9. Security Considerations
  10. Bonus : Agent Workflow
  11. Notes and Future Enhancements

Objective

The API allows users to upload PDFs, create a knowledge graph, and retrieve information or direct answers based on content. Using GraphRAG enhances context understanding in question answering by creating an interconnected knowledge graph from course materials.

Prerequisites

  • Python 3.8+
  • API Keys:
    • OpenAI API for LLM responses and embeddings
  • Libraries/Tools:
    • LanceDB, Langchain(i had issues with Docling to handle Pdfs so i used langchain loader), FastAPI ....
    • Swagger for API documentation

Installation

  1. Clone the repository:

    git clone https://github.com/shadlia/corolair_tech_test
    cd shadlia/corolair_tech_test
  2. Install dependencies:

    pip install -r requirements.txt
  3. Environment Variables:

    • Set up your OpenAI API key:
      export OPENAI_API_KEY='your_openai_api_key'
  4. Start your app:

     uvicorn app:app --host 0.0.0.0 --port 8000

Project Structure

The project is organized as follows:

├── app.py                 # Main application file
├── routes/                # Directory for route handlers
│   ├── answer.py          # Route for answering queries
│   ├── retrieve.py        # Route for retrieving content
│   └── upload.py          # Route for uploading PDFs
├── utils/              # Directory for service logic
│   ├── answer_generator.py # utils for generating answers
│   ├── embeddings_generator.py      # utils for handling embeddings
│   ├── gKnowlege_graph.py           # utils for knowledge graph management
│   └── pdf_processor.py   # utils for processing PDFs
└── README.md              # Documentation for the API

API Overview

The API provides knowledge retrieval and answering services by processing and embedding PDF documents, creating a knowledge graph, and allowing retrieval and answering through queries. The service is designed with three primary endpoints:

  • /upload: Upload and process a PDF from a URL.
  • /retrieve: Retrieve text chunks relevant to a query.
  • /answer: Provide a direct answer based on the document content and the relevant chunks.

Endpoints

1. Upload Document - POST /upload

  • Description: Uploads a PDF from a given URL, processes it, and stores the resulting data for querying.
  • Parameters:
    • url (string): URL to the PDF document.
  • Response:
    • document_id (string): Unique identifier for the uploaded document.

2. Retrieve Content - POST /retrieve

  • Description: Accepts a document ID and a query, returning relevant text chunks based on similarity scoring and the knowledge graph.
  • Parameters:
    • document_id (string): The ID of the document to query.
    • query (string): The user's query.
  • Response:
    • results (array): Array of text chunks with similarity scores.

3. Answer Query - POST /answer

  • Description: Accepts a document ID and a query, returning a contextual answer if available.
  • Parameters:
    • document_id (string): The ID of the document to query.
    • query (string): The user’s question.
  • Response:
    • answer (string): Answer to the query, contextualized for learning.
    • note (string): Returns a note if an answer is not available in the document.

Usage Examples

# Upload a document
curl -X POST -H "Content-Type: application/json" -d '{"url":"<pdf_url>"}' http://localhost:8000/upload

Screenshot from 2024-11-11 10-59-12

# Retrieve information
curl -X POST -H "Content-Type: application/json" -d '{"document_id": "doc_id", "query": "Wwhat is Quantitative Analysis?"}' http://localhost:8000/retrieve

Screenshot from 2024-11-11 11-00-15

# Get a direct answer
curl -X POST -H "Content-Type: application/json" -d '{"document_id": "doc_id", "query": "Wwhat is Quantitative Analysis?"}' http://localhost:8000/answer

Screenshot from 2024-11-11 11-00-55

Example of irrelevant query : Screenshot from 2024-11-09 16-12-47

Testing and Documentation

  1. Testing with Sample Data:
  2. Swagger API Documentation:
    • Access Swagger at http://localhost:8000/docs to view interactive documentation and test endpoints directly. Screenshot from 2024-11-09 16-16-47

Security Considerations

  • OpenAI Key Management: Ensure OpenAI keys are stored securely and not hardcoded in the source.

Bonus: Agent Workflow

Workflow for Query Resolution:

  1. Chunk-Based Retrieval: <<<<<<< HEAD

    • The first step is to check the relevant chunks for the user query. Using document embeddings and similarity scoring, we retrieve the most relevant chunks from the document.
  2. LLM Response Generation:

    • Once relevant chunks are retrieved, the system feeds these into an LLM (e.g., OpenAI) to generate a relevant answer.
  3. Answer Relevance Check:

    • If the generated answer is deemed relevant, it is returned to the user.
  4. Fallback to Agent:

    • If the answer is not relevant or the content is missing in the document, the fallback agent is invoked. This agent will provide alternative sources or generate a new answer based on external knowledge sources.

Agent Workflow Diagram:

[User Query] --> [Check Chunks for Relevance]
       |
       v
[Relevant Chunks Found?] --- No --> [Invoke Agent for Alternative Answer]
       |
       v
    Yes
       |
       v
[LLM Generates Answer]
       |
       v
[Answer Relevant?] --- No --> [Invoke Agent for Alternative Answer]
       |
       v
    Yes
       |
       v
[Return Answer to User]

This workflow ensures the API can provide accurate answers based on the content of the uploaded documents, while also providing alternative solutions when necessary.

Example if the document doesnt provide an answer : 1- The agent start : Screenshot from 2024-11-09 17-05-49 2- VOILA we got the alternative answer : Screenshot from 2024-11-09 17-16-38

Notes and Future Enhancements

Here are a few improvements i thought about to enhance the system:

  • Improved Similarity Calculation: Explore advanced methods for calculating similarity between embeddings to better match relevant content or use langchain/llama_index functionalities
  • Enhanced Knowledge Graph Construction: Use LanceDB's advanced capabilities for efficient knowledge graph storage and retrieval.
  • Integration with Docling: I couldn't use Docling due to packages incompability with Langchain (used it for the agent and as pdf loader)

About

This repository contains my solution for a technical test with Corolair

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages