fix: pipeline hook data into FastAPI response#135
Open
frayle-ons wants to merge 2 commits intomainfrom
Open
Conversation
…ass to fastapi response
Comment on lines
126
to
131
| @@ -127,12 +127,16 @@ | |||
| # Extract metadata columns dynamically | |||
| metadata_values = {meta: row[meta] for meta in meta_data if meta in row} | |||
|
|
|||
| # Find other values - added by hooks - any other per-row columns not in reserved/meta | |||
| other_values = {k: v for k, v in row.items() if k not in ["doc_id", "doc_text"] and k not in meta_data} | |||
Collaborator
There was a problem hiding this comment.
Suggest changing to the following;
hook_columns = set(group_df.columns).difference(meta_data.keys()).difference({"doc_id", "doc_text", "score", "rank"})
for row in rows_as_dicts:
# Extract metadata columns dynamically
metadata_values = {meta: row[meta] for meta in meta_data if meta in row}
# Find other values - added by hooks - any other per-row columns not in reserved/meta
other_values = {k: v for k, v in row.items() if k in hook_columns}This moves identifying the extra columns outside of the for loop.
Comment on lines
+177
to
+185
| for row in rows_as_dicts: | ||
| # Extract metadata columns dynamically | ||
| metadata_values = {meta: row[meta] for meta in meta_data} | ||
|
|
||
| # Find other values - added by hooks - any other per-row columns not in reserved/meta | ||
| other_values = { | ||
| k: v for k, v in row.items() if k not in ["doc_id", "doc_text", "score", "rank"] and k not in meta_data | ||
| } | ||
|
|
Collaborator
There was a problem hiding this comment.
See comment above
lukeroantreeONS
requested changes
Feb 17, 2026
Collaborator
There was a problem hiding this comment.
Good PR, functionality is as-described, test results in the expected output.
Further testing shows it to be robust at handling injected data with different datatypes ('standard' types only - int, float, bool).
I've requested one small change (same change in two places), to reduce some computational overhead.
Happy to approve it with that applied.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
✨ Summary
The VectorStore methods
search(),reverse_search()andembed()accept and return VectorStore Dataclasses which specify the necessary columns and their dataframe data types to ensure proper functionality.Users can add metadata columns which are not required in several ways - firstly by adding metadata during the Vector Database Indexing process, this data is then returned automatically with search results and in the FastAPI response.
The second approach is that users can add extra columns of data with pre and post processing hooks.
In the post processing hook operations, extra columns were returned from the VectorStore methods correctly, but not in the corresponding fastapi server response - this is because the code logic responsible for converting the dataframe to a json object did not extract extra data added potentially added during hook logic. The conversion code only account for the required columns and the index-time metadata columns.
These changes introduce a small change to make sure that any extra columns added to the data are in fact passed piped into the JSON object at conversion time.
The logic operates as follows, handling the required data, indexed metadata and 'other data' separately:
Similar logic is applied to the VectorStore reverse search API conversion code, with changed required column names.
Finally, this ticket also changes the VectorStoreSearchOutput 'rank' column to begin at a value of 1, rather than 0 - based on request from package user feedback.
Apart from the change in ranking values, the core functionality of the VectorStore remains the same, the hook data changes only affect the server module of the package.
📜 Changes Introduced
✅ Checklist
terraform fmt&terraform validate)🔍 How to Test
I set up the following script after building from source on this branch and used
uv run test_script.pyto showcase the changes to the codebase. It uses the testdata.csv file to index a dataset and includes metadata 'country' as an extra column. I then created a postprocessing hook that adds a new column to add the corresponding capital city of a row with country X. Finally for the reverse search method I added a trivial hook that adds a 'fruit' column with a value of 'coconut' at every row. These hooks are loaded to the vectorstore and the server is started, the tester can observe the hook added content being returned over the API with these changes.Moving to the server and using the swagger API docs you can see that for the forward search endpoint the capital cities column also is pulled through into the response, and the fruit column is showing as well, alongside the country metadata from the VDB. Before the current changes these data points would have been lost in the API.