fix: pipeline hook data into FastAPI response by frayle-ons · Pull Request #135 · datasciencecampus/classifai

frayle-ons · 2026-02-16T16:10:57Z

✨ Summary

The VectorStore methodssearch(), reverse_search() and embed() accept and return VectorStore Dataclasses which specify the necessary columns and their dataframe data types to ensure proper functionality.

Users can add metadata columns which are not required in several ways - firstly by adding metadata during the Vector Database Indexing process, this data is then returned automatically with search results and in the FastAPI response.

The second approach is that users can add extra columns of data with pre and post processing hooks.
In the post processing hook operations, extra columns were returned from the VectorStore methods correctly, but not in the corresponding fastapi server response - this is because the code logic responsible for converting the dataframe to a json object did not extract extra data added potentially added during hook logic. The conversion code only account for the required columns and the index-time metadata columns.

These changes introduce a small change to make sure that any extra columns added to the data are in fact passed piped into the JSON object at conversion time.

The logic operates as follows, handling the required data, indexed metadata and 'other data' separately:

    for query_id, group_df in grouped:
        # Convert group_df to a list of dictionaries
        rows_as_dicts = group_df.to_dict(orient="records")

        # Build the list of ResultEntry objects for the current group
        response_entries = []
        for row in rows_as_dicts:
            # Extract metadata columns dynamically
            metadata_values = {meta: row[meta] for meta in meta_data}

            # Find other values - added by hooks - any other per-row columns not in reserved/meta
            other_values = {
                k: v for k, v in row.items() if k not in ["doc_id", "doc_text", "score", "rank"] and k not in meta_data
            }

            # Create a ResultEntry object
            response_entries.append(
                ResultEntry(
                    label=row["doc_id"],
                    description=row["doc_text"],
                    score=row["score"],  # Assuming `score` is a column in the DataFrame
                    rank=row["rank"],  # Assuming `rank` is a column in the DataFrame
                    **metadata_values,  # Add metadata dynamically
                    **other_values,  # Add any extra columns dynamically
                )
            )

Similar logic is applied to the VectorStore reverse search API conversion code, with changed required column names.

Finally, this ticket also changes the VectorStoreSearchOutput 'rank' column to begin at a value of 1, rather than 0 - based on request from package user feedback.

Apart from the change in ranking values, the core functionality of the VectorStore remains the same, the hook data changes only affect the server module of the package.

📜 Changes Introduced

fix: column data in result objects not essential or parquet metadata are now passed through to FastAPI response like other data columns
chore: changed 'rank' column to start from 1 instead of 0, in result dataclass objects.

✅ Checklist

Please confirm you've completed these checks before requesting a review.

Code passes linting with Ruff
Security checks pass using Bandit
API and Unit tests are written and pass using pytest
Terraform files (if applicable) follow best practices and have been validated (terraform fmt & terraform validate)
DocStrings follow Google-style and are added as per Pylint recommendations
Documentation has been updated if needed

🔍 How to Test

I set up the following script after building from source on this branch and used uv run test_script.py to showcase the changes to the codebase. It uses the testdata.csv file to index a dataset and includes metadata 'country' as an extra column. I then created a postprocessing hook that adds a new column to add the corresponding capital city of a row with country X. Finally for the reverse search method I added a trivial hook that adds a 'fruit' column with a value of 'coconut' at every row. These hooks are loaded to the vectorstore and the server is started, the tester can observe the hook added content being returned over the API with these changes.

from classifai.servers import run_server
from classifai.vectorisers import HuggingFaceVectoriser
from classifai.indexers import VectorStore
from classifai.indexers.dataclasses import VectorStoreSearchInput, VectorStoreSearchOutput, VectorStoreReverseSearchOutput


# creating a vectoriser 
vectoriser = HuggingFaceVectoriser(model_name="sentence-transformers/all-MiniLM-L6-v2")

#first pass at creating a vectorstore
# my_vector_store = VectorStore(
#     file_name="./DEMO/data/testdata.csv",
#     data_type="csv",
#     vectoriser=vectoriser,
#     overwrite=True,
#     output_dir="test_vdb",
#     meta_data={"country": str}
# )


# defining some data that our hook will use
extra_injection_info = {
    'USA': 'Washington DC', 
    "Egypt": 'Cairo', 
    "Kenya": 'Nairobi', 
    "India": 'New Delhi', 
    "France": 'Paris',
    "Germany": 'Berlin',
    "Italy": 'Rome',
    "Spain": 'Madrid',
    "Australia": 'Canberra',
    "Brazil": 'Brasilia',
    "Japan": 'Tokyo',
    "Canada": 'Ottawa',
    "UK": 'London',
    "Nepal": 'Kathmandu',
    "South Africa": 'Pretoria',
    "Russia": 'Moscow',
    "Sweden": 'Stockholm',
    "China": 'Beijing',
    "Indonesia": 'Jakarta',
    "Philippines": 'Manila',
    "Mongolia": 'Ulaanbaatar',
    "Norway": 'Oslo',
    "Iceland": 'Reykjavik',
    "Switzerland": 'Bern',
    "Maldives": 'Male',
    "Mexico": 'Mexico City',
    }



# writing 2 post processing functions
def add_country_hook(input_data: VectorStoreSearchOutput) -> VectorStoreSearchOutput:
    input_data['capital'] = input_data['country'].map(extra_injection_info)
    return input_data

def inject_favourite_fruit_data(input_data: VectorStoreReverseSearchOutput) -> VectorStoreReverseSearchOutput:
    input_data['fruit'] = 'coconut'
    return input_data



# reloading the vectorstore with hooks attached
my_vector_store = VectorStore.from_filespace(
    folder_path="./test_vdb/",
    vectoriser=vectoriser,
    hooks={
        'search_postprocess': add_country_hook, 
        "reverse_search_postprocess": inject_favourite_fruit_data
        }
    )


# running the server that can be tested at port 8000
run_server(
    vector_stores=[my_vector_store],
    endpoint_names=["my_vector_store"],
    port=8000,
)

Moving to the server and using the swagger API docs you can see that for the forward search endpoint the capital cities column also is pulled through into the response, and the fruit column is showing as well, alongside the country metadata from the VDB. Before the current changes these data points would have been lost in the API.

…ass to fastapi response

lukeroantreeONS · 2026-02-17T11:00:39Z

src/classifai/servers/pydantic_models.py

@@ -127,12 +127,16 @@
            # Extract metadata columns dynamically
            metadata_values = {meta: row[meta] for meta in meta_data if meta in row}

+            # Find other values - added by hooks - any other per-row columns not in reserved/meta
+            other_values = {k: v for k, v in row.items() if k not in ["doc_id", "doc_text"] and k not in meta_data}


Suggest changing to the following;

hook_columns = set(group_df.columns).difference(meta_data.keys()).difference({"doc_id", "doc_text", "score", "rank"}) for row in rows_as_dicts: # Extract metadata columns dynamically metadata_values = {meta: row[meta] for meta in meta_data if meta in row} # Find other values - added by hooks - any other per-row columns not in reserved/meta other_values = {k: v for k, v in row.items() if k in hook_columns}

This moves identifying the extra columns outside of the for loop.

lukeroantreeONS · 2026-02-17T11:01:16Z

src/classifai/servers/pydantic_models.py

        for row in rows_as_dicts:
            # Extract metadata columns dynamically
            metadata_values = {meta: row[meta] for meta in meta_data}

+            # Find other values - added by hooks - any other per-row columns not in reserved/meta
+            other_values = {
+                k: v for k, v in row.items() if k not in ["doc_id", "doc_text", "score", "rank"] and k not in meta_data
+            }
+


See comment above

lukeroantreeONS

Good PR, functionality is as-described, test results in the expected output.
Further testing shows it to be robust at handling injected data with different datatypes ('standard' types only - int, float, bool).
I've requested one small change (same change in two places), to reduce some computational overhead.
Happy to approve it with that applied.

changed rank column to start from 1, other column result values now p…

6e962f2

…ass to fastapi response

frayle-ons linked an issue Feb 16, 2026 that may be closed by this pull request

Hook Columns not passed to FastAPI Response #134

Open

github-actions bot added the bug Something isn't working label Feb 16, 2026

frayle-ons requested a review from rileyok-ons February 16, 2026 16:11

lukeroantreeONS reviewed Feb 17, 2026

View reviewed changes

lukeroantreeONS requested changes Feb 17, 2026

View reviewed changes

updating where extra column values are found

82d63c3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: pipeline hook data into FastAPI response#135

fix: pipeline hook data into FastAPI response#135
frayle-ons wants to merge 2 commits intomainfrom
134-fastapi-missing-hook-data

frayle-ons commented Feb 16, 2026

Uh oh!

lukeroantreeONS Feb 17, 2026

Uh oh!

lukeroantreeONS Feb 17, 2026

Uh oh!

lukeroantreeONS left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

frayle-ons commented Feb 16, 2026

✨ Summary

📜 Changes Introduced

✅ Checklist

🔍 How to Test

Uh oh!

lukeroantreeONS Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

lukeroantreeONS Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

lukeroantreeONS left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lukeroantreeONS left a comment •

edited

Loading