Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 20, 2025

📄 77% (0.77x) speedup for convert_to_coco in unstructured/staging/base.py

⏱️ Runtime : 8.18 milliseconds 4.63 milliseconds (best of 60 runs)

📝 Explanation and details

The optimization significantly improves performance by replacing expensive operations in the annotations generation loop with more efficient alternatives.

Key optimizations:

  1. Category lookup optimization: The original code used a list comprehension with filtering and indexing [x["id"] for x in categories if x["name"] == el["type"]][0] for every element, which has O(n) complexity per lookup. The optimized version creates a dictionary mapping category_name_to_id = {cat["name"]: cat["id"] for cat in categories} once, then uses O(1) dictionary lookups. This eliminates repeated linear searches through the categories list.

  2. Coordinate access optimization: The original code repeatedly called el["metadata"].get("coordinates") multiple times per element when extracting bbox and area calculations. The optimized version stores this in a variable coordinates = el["metadata"].get("coordinates") and reuses it, reducing redundant dictionary lookups.

  3. Loop structure improvement: Instead of using a complex list comprehension for annotations, the optimized code uses an explicit loop with early variable assignment. This reduces the overhead of recreating the same coordinate calculations multiple times within the comprehension.

  4. Error handling preservation: The optimization maintains the original IndexError behavior when unknown element types are encountered by catching KeyError from the dictionary lookup and converting it to IndexError.

Performance impact: The line profiler shows the annotations section dropped from 58.4% of total time (25.41ms) to distributed across multiple smaller operations, resulting in a 76% speedup overall (8.18ms → 4.63ms).

Test results indicate: The optimization is particularly effective for larger datasets - the 500-element test shows 79.9% improvement, and the 999-element test shows 94.2% improvement, demonstrating that the O(n²) → O(n) complexity reduction scales well with input size.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 38 Passed
🌀 Generated Regression Tests 20 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 1 Passed
📊 Tests Coverage 100.0%
⚙️ Existing Unit Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
staging/test_base.py::test_convert_to_coco 145μs 147μs -0.877%⚠️
🌀 Generated Regression Tests and Runtime
from __future__ import annotations

# Patch the import inside convert_to_coco
from typing import Optional

# imports
from unstructured.staging.base import convert_to_coco

# --- Minimal stubs for dependencies ---

# Simulate TYPE_TO_TEXT_ELEMENT_MAP
TYPE_TO_TEXT_ELEMENT_MAP = {
    "Title": "TitleText",
    "Paragraph": "ParagraphText",
    "Table": "TableText",
    "List": "ListText",
}


# Minimal CoordinatesMetadata and ElementMetadata for the Element class
class CoordinatesMetadata:
    def __init__(self, points=None, system=None):
        self.points = points
        self.system = system

    def to_dict(self):
        d = {}
        if self.points is not None:
            d["points"] = self.points
        if self.system is not None:
            d["system"] = self.system
        # For compatibility with convert_to_coco
        d["layout_width"] = abs(self.points[0][0] - self.points[2][0]) if self.points else None
        d["layout_height"] = abs(self.points[0][1] - self.points[1][1]) if self.points else None
        return d


class ElementMetadata:
    def __init__(self):
        self.coordinates = None
        self.detection_origin = None
        self.file_directory = ""
        self.filename = ""
        self.page_number = ""

    def to_dict(self):
        d = {}
        if self.coordinates:
            d["coordinates"] = self.coordinates.to_dict()
        if self.detection_origin:
            d["detection_origin"] = self.detection_origin
        if self.file_directory:
            d["file_directory"] = self.file_directory
        if self.filename:
            d["filename"] = self.filename
        if self.page_number:
            d["page_number"] = self.page_number
        return d


# Minimal Element class
class Element:
    def __init__(
        self,
        element_id: Optional[str] = None,
        type_: Optional[str] = None,
        text: str = "",
        coordinates: Optional[tuple[tuple[float, float], ...]] = None,
        file_directory: str = "",
        filename: str = "",
        page_number: str = "",
    ):
        self._element_id = element_id or "id"
        self.type = type_ or "Paragraph"
        self.text = text
        self.metadata = ElementMetadata()
        if coordinates is not None:
            self.metadata.coordinates = CoordinatesMetadata(points=coordinates)
        self.metadata.file_directory = file_directory
        self.metadata.filename = filename
        self.metadata.page_number = page_number

    @property
    def id(self):
        return self._element_id

    def to_dict(self):
        return {
            "type": self.type,
            "element_id": self.id,
            "text": self.text,
            "metadata": self.metadata.to_dict(),
        }


# --- Unit tests for convert_to_coco ---

# BASIC TEST CASES


def test_empty_elements():
    # Test with no elements
    codeflash_output = convert_to_coco([])
    result = codeflash_output  # 13.2μs -> 15.0μs (11.9% slower)


def test_single_element_with_coordinates():
    # Test with one element with coordinates
    coords = ((0, 0), (0, 10), (20, 0), (20, 10))
    el = Element(
        element_id="el1",
        type_="Paragraph",
        text="Hello world",
        coordinates=coords,
        file_directory="/some/dir",
        filename="file.pdf",
        page_number="1",
    )
    codeflash_output = convert_to_coco(
        [el], dataset_description="desc", dataset_version="2.0", contributors=("Alice", "Bob")
    )
    result = codeflash_output  # 16.8μs -> 17.3μs (2.88% slower)
    ann = result["annotations"][0]
    # category_id matches Paragraph
    cat_id = [c["id"] for c in result["categories"] if c["name"] == "Paragraph"][0]


def test_duplicate_images_are_deduped():
    # Two elements with same image metadata -> only one image in result
    coords = ((0, 0), (0, 10), (20, 0), (20, 10))
    el1 = Element(
        element_id="el1",
        type_="Paragraph",
        text="A",
        coordinates=coords,
        file_directory="/dir",
        filename="file.pdf",
        page_number="1",
    )
    el2 = Element(
        element_id="el2",
        type_="Paragraph",
        text="B",
        coordinates=coords,
        file_directory="/dir",
        filename="file.pdf",
        page_number="1",
    )
    codeflash_output = convert_to_coco([el1, el2])
    result = codeflash_output  # 30.0μs -> 31.1μs (3.35% slower)


# EDGE TEST CASES


def test_element_with_missing_metadata_fields():
    # Element with no file_directory, filename, page_number, or coordinates
    el = Element(element_id="el1", type_="Title", text="T", coordinates=None)
    # Remove all metadata fields
    el.metadata.file_directory = ""
    el.metadata.filename = ""
    el.metadata.page_number = ""
    codeflash_output = convert_to_coco([el])
    result = codeflash_output  # 17.8μs -> 18.7μs (4.68% slower)
    img = result["images"][0]
    # Annotation has bbox as [] and area as None
    ann = result["annotations"][0]


def test_element_with_minimal_coordinates():
    # Coordinates with all zeros
    coords = ((0, 0), (0, 0), (0, 0), (0, 0))
    el = Element(element_id="el1", type_="Table", text="tab", coordinates=coords)
    codeflash_output = convert_to_coco([el])
    result = codeflash_output  # 18.1μs -> 18.8μs (3.33% slower)
    ann = result["annotations"][0]


def test_element_with_negative_coordinates():
    # Coordinates with negative values
    coords = ((-10, -5), (-10, 5), (10, -5), (10, 5))
    el = Element(element_id="el1", type_="Paragraph", text="P", coordinates=coords)
    codeflash_output = convert_to_coco([el])
    result = codeflash_output  # 17.3μs -> 18.1μs (4.36% slower)
    ann = result["annotations"][0]


def test_categories_are_sorted_and_unique():
    # Add elements with all types, including repeated types
    els = [
        Element(element_id="e1", type_="Title"),
        Element(element_id="e2", type_="Table"),
        Element(element_id="e3", type_="Paragraph"),
        Element(element_id="e4", type_="Paragraph"),
    ]
    codeflash_output = convert_to_coco(els)
    result = codeflash_output  # 30.5μs -> 28.2μs (8.11% faster)
    # Categories are sorted and unique
    names = [c["name"] for c in result["categories"]]


# LARGE SCALE TEST CASES


def test_performance_with_large_elements(monkeypatch):
    # Patch datetime to avoid repeated calls for performance
    class DummyDatetime:
        @classmethod
        def now(cls):
            class D:
                def strftime(self, fmt):
                    return "2023-01-01"

                @property
                def year(self):
                    return 2023

                @property
                def date(self):
                    return self

                def isoformat(self):
                    return "2023-01-01"

            return D()

    monkeypatch.setattr("datetime.datetime", DummyDatetime)
    # 500 elements
    elements = [
        Element(
            element_id=f"id{i}",
            type_="Paragraph",
            text="x",
            coordinates=((0, 0), (0, 1), (1, 0), (1, 1)),
            file_directory="d",
            filename="f",
            page_number="1",
        )
        for i in range(500)
    ]
    # Should run efficiently
    codeflash_output = convert_to_coco(elements)
    result = codeflash_output  # 1.74ms -> 966μs (79.9% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from datetime import date
from typing import Optional

# imports
import pytest

from unstructured.staging.base import convert_to_coco

# --- Minimal stubs for required classes and structures ---

# Simulate TYPE_TO_TEXT_ELEMENT_MAP as in unstructured.documents.elements
TYPE_TO_TEXT_ELEMENT_MAP = {
    "Title": "Title",
    "NarrativeText": "NarrativeText",
    "ListItem": "ListItem",
    "Table": "Table",
    "Figure": "Figure",
}


# Minimal CoordinatesMetadata and ElementMetadata for test
class CoordinatesMetadata:
    def __init__(self, points=None, system=None, layout_width=None, layout_height=None):
        self.points = points
        self.system = system
        self.layout_width = layout_width
        self.layout_height = layout_height

    def to_dict(self):
        d = {}
        if self.points is not None:
            d["points"] = self.points
        if self.layout_width is not None:
            d["layout_width"] = self.layout_width
        if self.layout_height is not None:
            d["layout_height"] = self.layout_height
        return d


class ElementMetadata:
    def __init__(
        self,
        coordinates=None,
        file_directory="",
        filename="",
        page_number="",
        detection_origin=None,
    ):
        self.coordinates = coordinates
        self.file_directory = file_directory
        self.filename = filename
        self.page_number = page_number
        self.detection_origin = detection_origin

    def to_dict(self):
        d = {}
        if self.coordinates is not None:
            d["coordinates"] = self.coordinates.to_dict()
        if self.file_directory:
            d["file_directory"] = self.file_directory
        if self.filename:
            d["filename"] = self.filename
        if self.page_number:
            d["page_number"] = self.page_number
        if self.detection_origin:
            d["detection_origin"] = self.detection_origin
        return d


# Minimal Element class for test
class Element:
    def __init__(
        self,
        type_: str,
        element_id: str,
        text: str = "",
        metadata: Optional[ElementMetadata] = None,
    ):
        self.type = type_
        self.element_id = element_id
        self.text = text
        self.metadata = metadata if metadata is not None else ElementMetadata()

    def to_dict(self):
        return {
            "type": self.type,
            "element_id": self.element_id,
            "text": self.text,
            "metadata": self.metadata.to_dict(),
        }


# --- Unit tests ---

# Basic Test Cases


def test_empty_elements():
    # Test with no elements
    codeflash_output = convert_to_coco([])
    result = codeflash_output  # 16.6μs -> 19.1μs (13.1% slower)
    # Categories should be all keys from TYPE_TO_TEXT_ELEMENT_MAP
    expected_cats = sorted(TYPE_TO_TEXT_ELEMENT_MAP.keys())
    actual_cats = sorted([c["name"] for c in result["categories"]])


def test_single_element_with_coordinates():
    # Test a single element with coordinates
    coords = ((1.0, 2.0), (1.0, 6.0), (5.0, 6.0), (5.0, 2.0))
    coords_md = CoordinatesMetadata(points=coords, layout_width=4.0, layout_height=4.0)
    metadata = ElementMetadata(
        coordinates=coords_md, file_directory="dir", filename="file.pdf", page_number="1"
    )
    el = Element("Title", "id1", "My Title", metadata)
    codeflash_output = convert_to_coco([el])
    result = codeflash_output  # 21.9μs -> 22.3μs (1.86% slower)
    img = result["images"][0]
    ann = result["annotations"][0]
    # Category id should match the Title category
    title_cat_id = [c["id"] for c in result["categories"] if c["name"] == "Title"][0]


def test_dataset_description_version_contributors():
    # Test custom dataset_description, version, contributors
    el = Element("Title", "id1")
    desc = "My custom dataset"
    version = "2.1"
    contributors = ("Alice", "Bob")
    codeflash_output = convert_to_coco(
        [el], dataset_description=desc, dataset_version=version, contributors=contributors
    )
    result = codeflash_output  # 13.5μs -> 14.2μs (5.00% slower)
    info = result["info"]
    # date_created should be a valid date string
    date.fromisoformat(info["date_created"])


def test_elements_with_missing_metadata_fields():
    # Element with no coordinates, file_directory, filename, page_number
    el = Element("NarrativeText", "nid")
    codeflash_output = convert_to_coco([el])
    result = codeflash_output  # 14.9μs -> 15.5μs (4.02% slower)
    img = result["images"][0]
    ann = result["annotations"][0]


# Edge Test Cases


def test_elements_with_duplicate_image_metadata():
    # Two elements with identical image metadata should deduplicate images
    coords = ((0, 0), (0, 1), (1, 1), (1, 0))
    coords_md = CoordinatesMetadata(points=coords, layout_width=1.0, layout_height=1.0)
    meta = ElementMetadata(coordinates=coords_md, file_directory="a", filename="b", page_number="1")
    el1 = Element("Title", "id1", metadata=meta)
    el2 = Element("Title", "id2", metadata=meta)
    codeflash_output = convert_to_coco([el1, el2])
    result = codeflash_output  # 20.4μs -> 19.2μs (6.06% faster)


def test_element_with_extra_metadata_fields():
    # Metadata with extra, unused fields should not break
    class ExtraMeta(ElementMetadata):
        def to_dict(self):
            d = super().to_dict()
            d["extra"] = "something"
            return d

    el = Element("Figure", "fid", metadata=ExtraMeta())
    codeflash_output = convert_to_coco([el])
    result = codeflash_output  # 24.2μs -> 23.7μs (2.11% faster)
    img = result["images"][0]


def test_element_with_unknown_type_raises():
    # Element type not in TYPE_TO_TEXT_ELEMENT_MAP should raise KeyError
    el = Element("UnknownType", "uid")
    with pytest.raises(IndexError):
        convert_to_coco([el])  # 18.3μs -> 19.1μs (4.14% slower)


def test_elements_with_nonstring_ids():
    # Element with non-string id should work as long as it's hashable (since function expects string, but our stub allows any)
    el = Element("Title", 123)
    codeflash_output = convert_to_coco([el])
    result = codeflash_output  # 18.0μs -> 18.5μs (2.70% slower)


# Large Scale Test Cases


def test_large_number_of_elements_with_duplicate_metadata():
    # 300 elements, but only 3 unique image metadata
    N = 300
    metas = []
    for i in range(3):
        coords = ((i, i), (i, i + 1), (i + 1, i + 1), (i + 1, i))
        coords_md = CoordinatesMetadata(points=coords, layout_width=1, layout_height=1)
        metas.append(
            ElementMetadata(
                coordinates=coords_md,
                file_directory="dir",
                filename=f"f{i}.pdf",
                page_number=str(i),
            )
        )
    elements = []
    for i in range(N):
        m = metas[i % 3]
        elements.append(Element("Table", f"id{i}", metadata=m))
    codeflash_output = convert_to_coco(elements)
    result = codeflash_output  # 995μs -> 535μs (86.0% faster)


def test_performance_on_near_1000_elements():
    # This test is not for timing, but to ensure the function completes and output is correct
    N = 999
    elements = []
    for i in range(N):
        coords = ((i, i), (i, i + 1), (i + 1, i + 1), (i + 1, i))
        coords_md = CoordinatesMetadata(points=coords, layout_width=1, layout_height=1)
        meta = ElementMetadata(coordinates=coords_md)
        elements.append(Element("ListItem", f"id{i}", metadata=meta))
    codeflash_output = convert_to_coco(elements)
    result = codeflash_output  # 3.24ms -> 1.67ms (94.2% faster)
    # Check that all annotation ids are present
    ids = set(ann["id"] for ann in result["annotations"])


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from unstructured.documents.elements import Address
from unstructured.staging.base import convert_to_coco


def test_convert_to_coco():
    convert_to_coco(
        Address(
            "",
            element_id=None,
            coordinates=None,
            coordinate_system=None,
            metadata=None,
            detection_origin="",
            embeddings=[],
        ),
        dataset_description="",
        dataset_version="",
        contributors="",
    )


def test_convert_to_coco_2():
    convert_to_coco((), dataset_description="\x00", dataset_version="", contributors="")
🔎 Concolic Coverage Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_e8goshnj/tmpk37npm8d/test_concolic_coverage.py::test_convert_to_coco_2 15.9μs 18.8μs -15.3%⚠️

To edit these changes git checkout codeflash/optimize-convert_to_coco-mje6h0n3 and push.

Codeflash Static Badge

The optimization significantly improves performance by replacing expensive operations in the annotations generation loop with more efficient alternatives.

**Key optimizations:**

1. **Category lookup optimization**: The original code used a list comprehension with filtering and indexing `[x["id"] for x in categories if x["name"] == el["type"]][0]` for every element, which has O(n) complexity per lookup. The optimized version creates a dictionary mapping `category_name_to_id = {cat["name"]: cat["id"] for cat in categories}` once, then uses O(1) dictionary lookups. This eliminates repeated linear searches through the categories list.

2. **Coordinate access optimization**: The original code repeatedly called `el["metadata"].get("coordinates")` multiple times per element when extracting bbox and area calculations. The optimized version stores this in a variable `coordinates = el["metadata"].get("coordinates")` and reuses it, reducing redundant dictionary lookups.

3. **Loop structure improvement**: Instead of using a complex list comprehension for annotations, the optimized code uses an explicit loop with early variable assignment. This reduces the overhead of recreating the same coordinate calculations multiple times within the comprehension.

4. **Error handling preservation**: The optimization maintains the original `IndexError` behavior when unknown element types are encountered by catching `KeyError` from the dictionary lookup and converting it to `IndexError`.

**Performance impact**: The line profiler shows the annotations section dropped from 58.4% of total time (25.41ms) to distributed across multiple smaller operations, resulting in a 76% speedup overall (8.18ms → 4.63ms).

**Test results indicate**: The optimization is particularly effective for larger datasets - the 500-element test shows 79.9% improvement, and the 999-element test shows 94.2% improvement, demonstrating that the O(n²) → O(n) complexity reduction scales well with input size.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 December 20, 2025 10:51
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant