Skip to content

Add optional _limitations metadata field for search quality tracking #68

@bdcdo

Description

@bdcdo

Problem

When using web search (use_search=True), there's no systematic way to track:

  • Which sources were actually consulted
  • Search failures or timeouts
  • Data quality/confidence level
  • What manual verification might be needed

Currently, users must parse free-text fields to understand if the agent had difficulties.

Proposed Solution

Add an optional track_limitations parameter that creates a _limitations metadata column with structured info about each row's search quality.

Schema

class SearchLimitations(BaseModel):
    """Auto-generated metadata about search quality"""
    confidence_level: Literal["high", "medium", "low"] = Field(
        description="high=multiple concordant sources, medium=some gaps, low=sparse/failed"
    )
    sources_consulted: list[str] = Field(
        description="List of sources actually consulted"
    )
    search_failures: list[str] = Field(
        default_factory=list,
        description="Any searches that failed (timeout, 429, etc.)"
    )
    limitations: str = Field(
        description="Description of limitations: outdated data, conflicts, missing info"
    )
    manual_verification_needed: Optional[str] = Field(
        default=None,
        description="Suggested manual checks if confidence is low"
    )

Usage

result_df = dataframeit(
    data=df,
    questions=MySchema,
    use_search=True,
    track_limitations=True,  # NEW
)

# Result includes _limitations column with structured metadata
print(result_df['_limitations'].iloc[0])
# {'confidence_level': 'medium', 'sources_consulted': ['Orphanet', 'FDA'], ...}

Implementation Notes

  1. With search_per_field=False: Single _limitations column for the whole row

  2. With search_per_field=True: Either:

    • One _limitations column aggregating all fields, OR
    • Per-field limitations in each field's nested dict (e.g., doenca_rara._limitations)
  3. The agent would be instructed to self-evaluate its search quality as part of the structured output.

Benefits

  1. Quality assurance: Easily filter rows with low confidence for manual review
  2. Debugging: Understand why certain searches failed
  3. Transparency: Document data provenance and limitations
  4. Reproducibility: Know which sources were consulted

Alternative: User-defined limitations field

Allow users to define their own limitations schema that gets appended to every search:

class MyLimitations(BaseModel):
    confianca: str
    fontes: str
    problemas: str

result_df = dataframeit(
    ...,
    limitations_schema=MyLimitations,  # Auto-added to each row
)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions