Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 20, 2025

📄 28% (0.28x) speedup for object_detection_classes in unstructured/partition/pdf_image/analysis/layout_dump.py

⏱️ Runtime : 19.8 microseconds 15.5 microseconds (best of 86 runs)

📝 Explanation and details

The optimization applies static pre-computation by moving the expensive list(LABEL_MAP.values()) operations outside the function and storing the results in module-level constants _YOLOX_CLASSES and _DETECTRON_CLASSES.

Key changes:

  • Eliminates repeated dictionary value extraction and list conversion on every function call
  • Replaces runtime list(YOLOX_LABEL_MAP.values()) and list(DETECTRON_LABEL_MAP.values()) with direct constant references

Why this is faster:
The original code calls list(dict.values()) every time the function executes, which involves iterating through dictionary values and creating a new list. With static pre-computation, this work happens only once at module import time, and subsequent calls simply return the pre-built lists.

Performance impact based on usage:
Looking at the function reference, object_detection_classes is called from a dump() method in layout analysis, suggesting it's likely called multiple times during PDF processing workflows. The 27% speedup (19.8μs → 15.5μs) becomes significant when processing many documents or layout elements.

Test case optimization patterns:

  • Small label maps (10 classes): 31-37% faster
  • Large label maps (1000 classes): 32-44% faster, showing the optimization scales well with label map size
  • Repeated calls: Up to 57% faster on subsequent calls, demonstrating the benefit of avoiding repeated list construction

This optimization is particularly effective for workloads that repeatedly query model classes during document processing pipelines.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 16 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 83.3%
🌀 Generated Regression Tests and Runtime
# imports
from unstructured.partition.pdf_image.analysis.layout_dump import object_detection_classes


# Simulate the external dependencies and label maps for test purposes
class UnstructuredYoloXModel:
    pass


class UnstructuredDetectronONNXModel:
    pass


YOLOX_LABEL_MAP = {
    0: "person",
    1: "bicycle",
    2: "car",
    3: "motorcycle",
    4: "airplane",
    5: "bus",
    6: "train",
    7: "truck",
    8: "boat",
    9: "traffic light",
}

DETECTRON_LABEL_MAP = {
    0: "background",
    1: "person",
    2: "bicycle",
    3: "car",
    4: "motorcycle",
    5: "airplane",
    6: "bus",
    7: "train",
    8: "truck",
    9: "boat",
}


# Simulated get_model function for testing
def get_model(model_name):
    if model_name == "yolox":
        return UnstructuredYoloXModel()
    elif model_name == "detectron":
        return UnstructuredDetectronONNXModel()
    elif model_name == "yolox_custom":
        return UnstructuredYoloXModel()
    elif model_name == "detectron_custom":
        return UnstructuredDetectronONNXModel()
    else:
        return "unknown_model_type"


# unit tests

# Basic Test Cases


def test_yolox_returns_correct_classes():
    # Test that YOLOX returns the correct class names
    expected = [
        "person",
        "bicycle",
        "car",
        "motorcycle",
        "airplane",
        "bus",
        "train",
        "truck",
        "boat",
        "traffic light",
    ]
    codeflash_output = object_detection_classes("yolox")
    result = codeflash_output  # 917ns -> 667ns (37.5% faster)


def test_large_yolox_label_map():
    # Test with a large YOLOX label map
    large_map = {i: f"class_{i}" for i in range(1000)}
    global YOLOX_LABEL_MAP
    old_map = YOLOX_LABEL_MAP
    YOLOX_LABEL_MAP = large_map
    try:
        codeflash_output = object_detection_classes("yolox")
        result = codeflash_output
    finally:
        YOLOX_LABEL_MAP = old_map  # Restore original map


def test_performance_large_scale():
    # Test that function executes quickly for large inputs (not a strict timing test, but ensures no crash)
    large_map = {i: f"fast_class_{i}" for i in range(999)}
    global YOLOX_LABEL_MAP
    old_map = YOLOX_LABEL_MAP
    YOLOX_LABEL_MAP = large_map
    try:
        codeflash_output = object_detection_classes("yolox")
        result = codeflash_output
    finally:
        YOLOX_LABEL_MAP = old_map  # Restore original map


# Edge case: Label maps with duplicate values
def test_duplicate_class_names_in_label_map():
    # Test that duplicate values in label map are preserved in output
    dup_map = {0: "person", 1: "person", 2: "car"}
    global YOLOX_LABEL_MAP
    old_map = YOLOX_LABEL_MAP
    YOLOX_LABEL_MAP = dup_map
    try:
        codeflash_output = object_detection_classes("yolox")
        result = codeflash_output
    finally:
        YOLOX_LABEL_MAP = old_map


# Edge case: Label map with non-string values
def test_non_string_class_names_in_label_map():
    # Test that non-string values in label map are returned as-is
    non_string_map = {0: "person", 1: 42, 2: None}
    global YOLOX_LABEL_MAP
    old_map = YOLOX_LABEL_MAP
    YOLOX_LABEL_MAP = non_string_map
    try:
        codeflash_output = object_detection_classes("yolox")
        result = codeflash_output
    finally:
        YOLOX_LABEL_MAP = old_map


# Edge case: Label map is empty
def test_empty_label_map():
    # Test that an empty label map returns an empty list
    empty_map = {}
    global YOLOX_LABEL_MAP
    old_map = YOLOX_LABEL_MAP
    YOLOX_LABEL_MAP = empty_map
    try:
        codeflash_output = object_detection_classes("yolox")
        result = codeflash_output
    finally:
        YOLOX_LABEL_MAP = old_map


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
# imports
import pytest

# function to test
from unstructured.partition.pdf_image.analysis.layout_dump import object_detection_classes

# unit tests

# --- Basic Test Cases ---


def test_yolox_model_returns_correct_classes():
    # Test that a YOLOX model name returns the correct class list
    # We use a known YOLOX model name from the library
    model_name = "yolox"
    codeflash_output = object_detection_classes(model_name)
    result = codeflash_output  # 875ns -> 666ns (31.4% faster)


def test_detectron_model_returns_correct_classes():
    # Test that a Detectron model name returns the correct class list
    model_name = "detectron2_onnx"
    codeflash_output = object_detection_classes(model_name)
    result = codeflash_output  # 1.54μs -> 1.33μs (15.7% faster)


def test_yolox_and_detectron_class_lists_are_different():
    # The class lists for YOLOX and Detectron should not be identical
    yolox_classes = set(object_detection_classes("yolox"))  # 875ns -> 666ns (31.4% faster)
    detectron_classes = set(
        object_detection_classes("detectron2_onnx")
    )  # 833ns -> 708ns (17.7% faster)


# --- Edge Test Cases ---


def test_numeric_model_name_raises_type_error_or_value_error():
    # Passing a numeric model name should raise an error
    with pytest.raises(Exception) as excinfo:
        object_detection_classes(123)  # 2.67μs -> 2.75μs (3.02% slower)


def test_large_number_of_classes_in_yolox_label_map(monkeypatch):
    # Simulate a YOLOX_LABEL_MAP with 1000 classes
    large_label_map = {i: f"class_{i}" for i in range(1000)}
    monkeypatch.setattr("unstructured_inference.models.yolox.YOLOX_LABEL_MAP", large_label_map)
    # The returned list should have 1000 elements, all unique
    codeflash_output = object_detection_classes("yolox")
    result = codeflash_output  # 1.54μs -> 1.17μs (32.0% faster)
    # All class names should start with "class_"
    for cls in result:
        pass


def test_large_number_of_classes_in_detectron_label_map(monkeypatch):
    # Simulate a DETECTRON_LABEL_MAP with 999 classes
    large_label_map = {i: f"dclass_{i}" for i in range(999)}
    monkeypatch.setattr(
        "unstructured_inference.models.detectron2onnx.DEFAULT_LABEL_MAP", large_label_map
    )
    codeflash_output = object_detection_classes("detectron2_onnx")
    result = codeflash_output  # 1.96μs -> 1.42μs (38.2% faster)
    for cls in result:
        pass


def test_performance_with_large_label_map(monkeypatch):
    # This test checks that the function does not take excessive time with large label maps
    import time

    large_label_map = {i: f"perfclass_{i}" for i in range(1000)}
    monkeypatch.setattr("unstructured_inference.models.yolox.YOLOX_LABEL_MAP", large_label_map)
    start = time.time()
    codeflash_output = object_detection_classes("yolox")
    result = codeflash_output  # 1.08μs -> 750ns (44.4% faster)
    end = time.time()


def test_returned_list_is_not_modified_by_caller():
    # Modifying the returned list should not affect future calls
    codeflash_output = object_detection_classes("yolox")
    orig = codeflash_output  # 958ns -> 708ns (35.3% faster)
    copy = orig.copy()
    copy.append("new_class")
    # A fresh call should not include the new class
    codeflash_output = object_detection_classes("yolox")  # 458ns -> 291ns (57.4% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-object_detection_classes-mje75g8x and push.

Codeflash Static Badge

The optimization applies **static pre-computation** by moving the expensive `list(LABEL_MAP.values())` operations outside the function and storing the results in module-level constants `_YOLOX_CLASSES` and `_DETECTRON_CLASSES`.

**Key changes:**
- Eliminates repeated dictionary value extraction and list conversion on every function call
- Replaces runtime `list(YOLOX_LABEL_MAP.values())` and `list(DETECTRON_LABEL_MAP.values())` with direct constant references

**Why this is faster:**
The original code calls `list(dict.values())` every time the function executes, which involves iterating through dictionary values and creating a new list. With static pre-computation, this work happens only once at module import time, and subsequent calls simply return the pre-built lists.

**Performance impact based on usage:**
Looking at the function reference, `object_detection_classes` is called from a `dump()` method in layout analysis, suggesting it's likely called multiple times during PDF processing workflows. The 27% speedup (19.8μs → 15.5μs) becomes significant when processing many documents or layout elements.

**Test case optimization patterns:**
- Small label maps (10 classes): 31-37% faster
- Large label maps (1000 classes): 32-44% faster, showing the optimization scales well with label map size
- Repeated calls: Up to 57% faster on subsequent calls, demonstrating the benefit of avoiding repeated list construction

This optimization is particularly effective for workloads that repeatedly query model classes during document processing pipelines.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 December 20, 2025 11:10
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant