Skip to content

Embeddings are not used in _get_similar_columns #44

@shannonycj

Description

@shannonycj

in src/workflow/agents/information_retriever/tool_kit/retrieve_entity.py

semantic similarity scores are computed for (column, question_hint) pairs, and the scores are used for sorting:

similar_column_names.sort(key=lambda x: x[2], reverse=True)

but the sorting is not followed by any shortlisting and the scores are then discarded

table_column_pairs = list(set([(table, column) for table, column, _ in similar_column_names]))

and then it comes to a structure change, so the sorting is also useless

similar_columns = self._get_similar_column_names(keywords=keywords, question=question, hint=hint)
        for table_name, column_name in similar_columns:
            if table_name not in selected_columns:
                selected_columns[table_name] = []
            if column_name not in selected_columns[table_name]:
                selected_columns[table_name].append(column_name)

Essentially, the column retrieval based keywords is only according to difflib.SequenceMatcher(column_name, potential_column_name). Though embeddings are computed but not used. Please clarify if am wrong. Thanks a lot.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions