Skip to content

fix: pipeline hook data into FastAPI response#135

Open
frayle-ons wants to merge 2 commits intomainfrom
134-fastapi-missing-hook-data
Open

fix: pipeline hook data into FastAPI response#135
frayle-ons wants to merge 2 commits intomainfrom
134-fastapi-missing-hook-data

Conversation

@frayle-ons
Copy link
Contributor

✨ Summary

The VectorStore methodssearch(), reverse_search() and embed() accept and return VectorStore Dataclasses which specify the necessary columns and their dataframe data types to ensure proper functionality.

Users can add metadata columns which are not required in several ways - firstly by adding metadata during the Vector Database Indexing process, this data is then returned automatically with search results and in the FastAPI response.

The second approach is that users can add extra columns of data with pre and post processing hooks.
In the post processing hook operations, extra columns were returned from the VectorStore methods correctly, but not in the corresponding fastapi server response - this is because the code logic responsible for converting the dataframe to a json object did not extract extra data added potentially added during hook logic. The conversion code only account for the required columns and the index-time metadata columns.

These changes introduce a small change to make sure that any extra columns added to the data are in fact passed piped into the JSON object at conversion time.

The logic operates as follows, handling the required data, indexed metadata and 'other data' separately:

    for query_id, group_df in grouped:
        # Convert group_df to a list of dictionaries
        rows_as_dicts = group_df.to_dict(orient="records")

        # Build the list of ResultEntry objects for the current group
        response_entries = []
        for row in rows_as_dicts:
            # Extract metadata columns dynamically
            metadata_values = {meta: row[meta] for meta in meta_data}

            # Find other values - added by hooks - any other per-row columns not in reserved/meta
            other_values = {
                k: v for k, v in row.items() if k not in ["doc_id", "doc_text", "score", "rank"] and k not in meta_data
            }

            # Create a ResultEntry object
            response_entries.append(
                ResultEntry(
                    label=row["doc_id"],
                    description=row["doc_text"],
                    score=row["score"],  # Assuming `score` is a column in the DataFrame
                    rank=row["rank"],  # Assuming `rank` is a column in the DataFrame
                    **metadata_values,  # Add metadata dynamically
                    **other_values,  # Add any extra columns dynamically
                )
            )

Similar logic is applied to the VectorStore reverse search API conversion code, with changed required column names.

Finally, this ticket also changes the VectorStoreSearchOutput 'rank' column to begin at a value of 1, rather than 0 - based on request from package user feedback.

Apart from the change in ranking values, the core functionality of the VectorStore remains the same, the hook data changes only affect the server module of the package.

📜 Changes Introduced

  • fix: column data in result objects not essential or parquet metadata are now passed through to FastAPI response like other data columns
  • chore: changed 'rank' column to start from 1 instead of 0, in result dataclass objects.

✅ Checklist

Please confirm you've completed these checks before requesting a review.

  • Code passes linting with Ruff
  • Security checks pass using Bandit
  • API and Unit tests are written and pass using pytest
  • Terraform files (if applicable) follow best practices and have been validated (terraform fmt & terraform validate)
  • DocStrings follow Google-style and are added as per Pylint recommendations
  • Documentation has been updated if needed

🔍 How to Test

I set up the following script after building from source on this branch and used uv run test_script.py to showcase the changes to the codebase. It uses the testdata.csv file to index a dataset and includes metadata 'country' as an extra column. I then created a postprocessing hook that adds a new column to add the corresponding capital city of a row with country X. Finally for the reverse search method I added a trivial hook that adds a 'fruit' column with a value of 'coconut' at every row. These hooks are loaded to the vectorstore and the server is started, the tester can observe the hook added content being returned over the API with these changes.

from classifai.servers import run_server
from classifai.vectorisers import HuggingFaceVectoriser
from classifai.indexers import VectorStore
from classifai.indexers.dataclasses import VectorStoreSearchInput, VectorStoreSearchOutput, VectorStoreReverseSearchOutput


# creating a vectoriser 
vectoriser = HuggingFaceVectoriser(model_name="sentence-transformers/all-MiniLM-L6-v2")

#first pass at creating a vectorstore
# my_vector_store = VectorStore(
#     file_name="./DEMO/data/testdata.csv",
#     data_type="csv",
#     vectoriser=vectoriser,
#     overwrite=True,
#     output_dir="test_vdb",
#     meta_data={"country": str}
# )


# defining some data that our hook will use
extra_injection_info = {
    'USA': 'Washington DC', 
    "Egypt": 'Cairo', 
    "Kenya": 'Nairobi', 
    "India": 'New Delhi', 
    "France": 'Paris',
    "Germany": 'Berlin',
    "Italy": 'Rome',
    "Spain": 'Madrid',
    "Australia": 'Canberra',
    "Brazil": 'Brasilia',
    "Japan": 'Tokyo',
    "Canada": 'Ottawa',
    "UK": 'London',
    "Nepal": 'Kathmandu',
    "South Africa": 'Pretoria',
    "Russia": 'Moscow',
    "Sweden": 'Stockholm',
    "China": 'Beijing',
    "Indonesia": 'Jakarta',
    "Philippines": 'Manila',
    "Mongolia": 'Ulaanbaatar',
    "Norway": 'Oslo',
    "Iceland": 'Reykjavik',
    "Switzerland": 'Bern',
    "Maldives": 'Male',
    "Mexico": 'Mexico City',
    }



# writing 2 post processing functions
def add_country_hook(input_data: VectorStoreSearchOutput) -> VectorStoreSearchOutput:
    input_data['capital'] = input_data['country'].map(extra_injection_info)
    return input_data

def inject_favourite_fruit_data(input_data: VectorStoreReverseSearchOutput) -> VectorStoreReverseSearchOutput:
    input_data['fruit'] = 'coconut'
    return input_data



# reloading the vectorstore with hooks attached
my_vector_store = VectorStore.from_filespace(
    folder_path="./test_vdb/",
    vectoriser=vectoriser,
    hooks={
        'search_postprocess': add_country_hook, 
        "reverse_search_postprocess": inject_favourite_fruit_data
        }
    )


# running the server that can be tested at port 8000
run_server(
    vector_stores=[my_vector_store],
    endpoint_names=["my_vector_store"],
    port=8000,
)

Moving to the server and using the swagger API docs you can see that for the forward search endpoint the capital cities column also is pulled through into the response, and the fruit column is showing as well, alongside the country metadata from the VDB. Before the current changes these data points would have been lost in the API.

Screenshot 2026-02-16 at 16 05 10 Screenshot 2026-02-16 at 16 05 50

@frayle-ons frayle-ons linked an issue Feb 16, 2026 that may be closed by this pull request
@github-actions github-actions bot added the bug Something isn't working label Feb 16, 2026
Comment on lines 126 to 131
@@ -127,12 +127,16 @@
# Extract metadata columns dynamically
metadata_values = {meta: row[meta] for meta in meta_data if meta in row}

# Find other values - added by hooks - any other per-row columns not in reserved/meta
other_values = {k: v for k, v in row.items() if k not in ["doc_id", "doc_text"] and k not in meta_data}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest changing to the following;

        hook_columns = set(group_df.columns).difference(meta_data.keys()).difference({"doc_id", "doc_text", "score", "rank"})

        for row in rows_as_dicts:
            # Extract metadata columns dynamically
            metadata_values = {meta: row[meta] for meta in meta_data if meta in row}

            # Find other values - added by hooks - any other per-row columns not in reserved/meta
            other_values = {k: v for k, v in row.items() if k in hook_columns}

This moves identifying the extra columns outside of the for loop.

Comment on lines +177 to +185
for row in rows_as_dicts:
# Extract metadata columns dynamically
metadata_values = {meta: row[meta] for meta in meta_data}

# Find other values - added by hooks - any other per-row columns not in reserved/meta
other_values = {
k: v for k, v in row.items() if k not in ["doc_id", "doc_text", "score", "rank"] and k not in meta_data
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above

Copy link
Collaborator

@lukeroantreeONS lukeroantreeONS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good PR, functionality is as-described, test results in the expected output.
Further testing shows it to be robust at handling injected data with different datatypes ('standard' types only - int, float, bool).
I've requested one small change (same change in two places), to reduce some computational overhead.
Happy to approve it with that applied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hook Columns not passed to FastAPI Response

2 participants