Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 20, 2025

📄 12% (0.12x) speedup for filter_element_types in unstructured/staging/base.py

⏱️ Runtime : 544 microseconds 487 microseconds (best of 104 runs)

📝 Explanation and details

The optimization achieves an 11% speedup through two key changes that reduce Python overhead:

1. Generator Expression in exactly_one()

  • Changed sum([(arg is not None and arg != "") for arg in kwargs.values()]) to sum((arg is not None and arg != "") for arg in kwargs.values())
  • Eliminates creation of an intermediate list, reducing memory allocation overhead
  • Though this function shows minimal improvement in isolation, it's called frequently (94 times in the profiler)

2. List Comprehensions Replace Manual Loops in filter_element_types()

  • Replaced explicit for loops with filtered_elements.append() calls with direct list comprehensions
  • return [element for element in elements if type(element) in include_element_types]
  • return [element for element in elements if type(element) not in exclude_element_types]

Why This Speeds Up Execution:

  • Reduced Python bytecode overhead: List comprehensions are implemented in C and execute faster than explicit Python loops with .append() calls
  • Fewer function calls: Eliminates repeated append() method calls which have per-call overhead
  • Better memory patterns: List comprehensions can pre-allocate the result list size in some cases

Performance Impact by Test Case:

  • Large datasets benefit most: Tests with 1000+ elements show 23-40% improvements (e.g., test_large_number_of_elements_include goes from 36.9μs to 26.3μs)
  • Small datasets have modest overhead: Basic tests with few elements show 5-20% slower performance due to list comprehension setup costs
  • The optimization is particularly effective when filtering large collections, which is typical for document processing workflows where this function likely operates on many document elements

The optimization maintains identical functionality while providing substantial performance gains for realistic workloads involving larger element collections.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 37 Passed
🌀 Generated Regression Tests 36 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
⚙️ Existing Unit Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
staging/test_base.py::test_filter_element_types_with_exclude_and_include_element_type 2.58μs 2.58μs 0.000%✅
staging/test_base.py::test_filter_element_types_with_exclude_element_type 1.75μs 2.12μs -17.6%⚠️
staging/test_base.py::test_filter_element_types_with_include_element_type 1.92μs 2.42μs -20.7%⚠️
🌀 Generated Regression Tests and Runtime
# imports
import pytest

from unstructured.staging.base import filter_element_types

# --- Minimal stubs for Element and subclasses for testing ---


class Element:
    """Base element class for testing."""

    def __init__(self, value=None):
        self.value = value


class Title(Element):
    pass


class NarrativeText(Element):
    pass


class ListItem(Element):
    pass


class Table(Element):
    pass


class Figure(Element):
    pass


# --- Unit tests ---

# ------------------ BASIC TEST CASES ------------------


def test_include_element_types_basic():
    # Only include Title and ListItem elements
    elements = [Title("t1"), NarrativeText("n1"), ListItem("l1"), Title("t2")]
    codeflash_output = filter_element_types(elements, include_element_types=[Title, ListItem])
    result = codeflash_output  # 2.00μs -> 2.25μs (11.1% slower)


def test_exclude_element_types_basic():
    # Exclude NarrativeText elements
    elements = [Title("t1"), NarrativeText("n1"), ListItem("l1"), NarrativeText("n2")]
    codeflash_output = filter_element_types(elements, exclude_element_types=[NarrativeText])
    result = codeflash_output  # 1.62μs -> 1.92μs (15.2% slower)


def test_include_element_types_single_type():
    # Only include Title elements
    elements = [Title("t1"), NarrativeText("n1"), ListItem("l1"), Title("t2")]
    codeflash_output = filter_element_types(elements, include_element_types=[Title])
    result = codeflash_output  # 1.46μs -> 1.58μs (7.90% slower)


def test_exclude_element_types_single_type():
    # Exclude ListItem elements
    elements = [Title("t1"), ListItem("l1"), NarrativeText("n1"), ListItem("l2")]
    codeflash_output = filter_element_types(elements, exclude_element_types=[ListItem])
    result = codeflash_output  # 1.46μs -> 1.62μs (10.3% slower)


def test_include_and_exclude_are_mutually_exclusive():
    # Both include and exclude specified: should raise ValueError
    elements = [Title("t1"), ListItem("l1")]
    with pytest.raises(ValueError):
        filter_element_types(
            elements, include_element_types=[Title], exclude_element_types=[ListItem]
        )  # 2.25μs -> 2.38μs (5.26% slower)


def test_neither_include_nor_exclude():
    # Neither include nor exclude specified: should raise ValueError
    elements = [Title("t1"), ListItem("l1")]
    with pytest.raises(ValueError):
        filter_element_types(elements)  # 1.92μs -> 1.92μs (0.052% slower)


# ------------------ EDGE TEST CASES ------------------


def test_empty_elements_list():
    # Empty input: should return empty list
    codeflash_output = filter_element_types([], include_element_types=[Title])
    result = codeflash_output  # 1.17μs -> 1.46μs (20.0% slower)


def test_elements_with_no_matching_types_include():
    # No elements match the include types
    elements = [NarrativeText("n1"), ListItem("l1")]
    codeflash_output = filter_element_types(elements, include_element_types=[Title])
    result = codeflash_output  # 1.50μs -> 1.88μs (20.0% slower)


def test_elements_with_no_matching_types_exclude():
    # No elements match the exclude types, so all should be returned
    elements = [NarrativeText("n1"), ListItem("l1")]
    codeflash_output = filter_element_types(elements, exclude_element_types=[Title])
    result = codeflash_output  # 1.38μs -> 1.75μs (21.4% slower)


def test_elements_with_subclass_of_element():
    # Subclass of Element not explicitly listed in include/exclude
    class CustomTitle(Title):
        pass

    elements = [CustomTitle("ct1"), Title("t1")]
    # Only Title, not CustomTitle, should be included (type, not isinstance)
    codeflash_output = filter_element_types(elements, include_element_types=[Title])
    result = codeflash_output  # 1.42μs -> 1.58μs (10.5% slower)


def test_elements_with_multiple_types():
    # Include multiple types, some elements of each
    elements = [Title("t1"), NarrativeText("n1"), ListItem("l1"), Table("tb1"), Figure("f1")]
    codeflash_output = filter_element_types(elements, include_element_types=[Title, Table])
    result = codeflash_output  # 1.50μs -> 1.67μs (9.96% slower)


def test_elements_with_duplicate_types():
    # Elements with duplicate types
    elements = [Title("t1"), Title("t2"), Title("t3")]
    codeflash_output = filter_element_types(elements, exclude_element_types=[NarrativeText])
    result = codeflash_output  # 1.42μs -> 1.62μs (12.9% slower)


def test_large_number_of_elements_include():
    # Large list of elements, include only ListItem
    elements = [Title(f"t{i}") if i % 3 == 0 else ListItem(f"l{i}") for i in range(1000)]
    codeflash_output = filter_element_types(elements, include_element_types=[ListItem])
    result = codeflash_output  # 36.9μs -> 26.3μs (40.0% faster)
    # There should be 1000 - (1000 // 3 + (1 if 1000 % 3 != 0 else 0)) ListItems
    expected_count = 1000 - len([i for i in range(1000) if i % 3 == 0])


def test_large_number_of_elements_exclude():
    # Large list, exclude Table and Figure
    elements = []
    for i in range(1000):
        if i % 5 == 0:
            elements.append(Table(f"tb{i}"))
        elif i % 7 == 0:
            elements.append(Figure(f"f{i}"))
        else:
            elements.append(NarrativeText(f"n{i}"))
    codeflash_output = filter_element_types(elements, exclude_element_types=[Table, Figure])
    result = codeflash_output  # 44.0μs -> 35.5μs (23.8% faster)
    # Count NarrativeText elements
    expected_count = len([i for i in range(1000) if i % 5 != 0 and i % 7 != 0])


def test_large_scale_all_types_included():
    # Large list, include all types
    elements = (
        [Title(f"t{i}") for i in range(250)]
        + [NarrativeText(f"n{i}") for i in range(250)]
        + [ListItem(f"l{i}") for i in range(250)]
        + [Table(f"tb{i}") for i in range(250)]
    )
    codeflash_output = filter_element_types(
        elements, include_element_types=[Title, NarrativeText, ListItem, Table]
    )
    result = codeflash_output  # 46.7μs -> 34.8μs (34.3% faster)
    # All types are present
    types = set(type(e) for e in result)


def test_large_scale_none_of_type_included():
    # Large list, include type not present
    elements = [Title(f"t{i}") for i in range(1000)]
    codeflash_output = filter_element_types(elements, include_element_types=[NarrativeText])
    result = codeflash_output  # 25.0μs -> 24.2μs (3.44% faster)


def test_large_scale_exclude_all():
    # Large list, exclude all types present
    elements = [Title(f"t{i}") for i in range(1000)]
    codeflash_output = filter_element_types(elements, exclude_element_types=[Title])
    result = codeflash_output  # 21.7μs -> 19.8μs (9.68% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import pytest  # used for our unit tests

from unstructured.staging.base import filter_element_types

# --- Dummy classes and function to support testing ---


# Minimal stub for Element and subclasses to use as test data
class Element:
    pass


class Title(Element):
    def __init__(self, value=None):
        self.value = value


class NarrativeText(Element):
    def __init__(self, value=None):
        self.value = value


class ListItem(Element):
    def __init__(self, value=None):
        self.value = value


class Table(Element):
    def __init__(self, value=None):
        self.value = value


class FakeElement:  # Not a subclass of Element
    pass


# --- Unit tests ---

# 1. BASIC TEST CASES


def test_include_single_type():
    # Only include Title elements
    elements = [Title("A"), NarrativeText("B"), Title("C")]
    codeflash_output = filter_element_types(elements, include_element_types=[Title])
    result = codeflash_output  # 1.42μs -> 1.58μs (10.6% slower)


def test_exclude_single_type():
    # Exclude NarrativeText elements
    elements = [Title("A"), NarrativeText("B"), Title("C")]
    codeflash_output = filter_element_types(elements, exclude_element_types=[NarrativeText])
    result = codeflash_output  # 1.46μs -> 1.67μs (12.5% slower)


def test_include_multiple_types():
    # Include both Title and NarrativeText
    elements = [Title("A"), NarrativeText("B"), ListItem("C"), Table("T")]
    codeflash_output = filter_element_types(elements, include_element_types=[Title, NarrativeText])
    result = codeflash_output  # 1.46μs -> 1.62μs (10.3% slower)


def test_exclude_multiple_types():
    # Exclude Title and Table
    elements = [Title("A"), NarrativeText("B"), ListItem("C"), Table("T")]
    codeflash_output = filter_element_types(elements, exclude_element_types=[Title, Table])
    result = codeflash_output  # 1.46μs -> 1.62μs (10.3% slower)


def test_include_type_not_present():
    # Include ListItem, which is not present
    elements = [Title("A"), NarrativeText("B")]
    codeflash_output = filter_element_types(elements, include_element_types=[ListItem])
    result = codeflash_output  # 1.17μs -> 1.38μs (15.2% slower)


def test_exclude_type_not_present():
    # Exclude Table, which is not present; all elements should remain
    elements = [Title("A"), NarrativeText("B")]
    codeflash_output = filter_element_types(elements, exclude_element_types=[Table])
    result = codeflash_output  # 1.25μs -> 1.46μs (14.3% slower)


def test_include_with_empty_elements():
    # Empty elements list should always return empty
    codeflash_output = filter_element_types([], include_element_types=[Title])
    result = codeflash_output  # 958ns -> 1.17μs (17.9% slower)


def test_exclude_with_empty_elements():
    codeflash_output = filter_element_types([], exclude_element_types=[Title])
    result = codeflash_output  # 1.04μs -> 1.25μs (16.7% slower)


# 2. EDGE TEST CASES


def test_both_include_and_exclude_specified():
    # Should raise ValueError if both are specified
    elements = [Title("A")]
    with pytest.raises(ValueError):
        filter_element_types(
            elements, include_element_types=[Title], exclude_element_types=[NarrativeText]
        )  # 2.12μs -> 2.25μs (5.56% slower)


def test_neither_include_nor_exclude_specified():
    # Should raise ValueError if neither is specified
    elements = [Title("A")]
    with pytest.raises(ValueError):
        filter_element_types(elements)  # 1.83μs -> 1.92μs (4.33% slower)


def test_elements_with_non_element_type():
    # Elements that are not subclasses of Element should be ignored (not included)
    elements = [Title("A"), FakeElement(), NarrativeText("B")]
    codeflash_output = filter_element_types(elements, include_element_types=[Title, NarrativeText])
    result = codeflash_output  # 1.62μs -> 1.88μs (13.3% slower)


def test_include_type_is_parent_class():
    # If include_element_types contains Element, all elements should be included
    elements = [Title("A"), NarrativeText("B"), ListItem("C")]
    codeflash_output = filter_element_types(elements, include_element_types=[Element])
    result = codeflash_output  # 1.21μs -> 1.50μs (19.5% slower)


def test_exclude_type_is_parent_class():
    # If exclude_element_types contains Element, all should be excluded (since type(e) is not Element)
    elements = [Title("A"), NarrativeText("B"), ListItem("C")]
    codeflash_output = filter_element_types(elements, exclude_element_types=[Element])
    result = codeflash_output  # 1.42μs -> 1.67μs (14.9% slower)


def test_elements_with_duplicates():
    # Duplicated elements should be preserved
    elements = [Title("A"), Title("A"), NarrativeText("B")]
    codeflash_output = filter_element_types(elements, include_element_types=[Title])
    result = codeflash_output  # 1.33μs -> 1.46μs (8.57% slower)


def test_elements_with_none():
    # None in elements should not be included
    elements = [Title("A"), None, NarrativeText("B")]
    codeflash_output = filter_element_types(elements, include_element_types=[Title, NarrativeText])
    result = codeflash_output  # 1.38μs -> 1.50μs (8.33% slower)


def test_elements_with_mixed_types():
    # Elements of unrelated types (not subclass of Element)
    elements = [Title("A"), "string", 123, NarrativeText("B")]
    codeflash_output = filter_element_types(elements, include_element_types=[NarrativeText])
    result = codeflash_output  # 1.25μs -> 1.50μs (16.7% slower)


# 3. LARGE SCALE TEST CASES


def test_large_include():
    # Large number of elements, include one type
    elements = [Title(f"T{i}") if i % 2 == 0 else NarrativeText(f"N{i}") for i in range(1000)]
    codeflash_output = filter_element_types(elements, include_element_types=[Title])
    result = codeflash_output  # 33.2μs -> 26.0μs (28.1% faster)
    # Ensure order is preserved
    for idx, e in enumerate(result):
        pass


def test_large_exclude():
    # Large number of elements, exclude one type
    elements = [Title(f"T{i}") if i % 2 == 0 else NarrativeText(f"N{i}") for i in range(1000)]
    codeflash_output = filter_element_types(elements, exclude_element_types=[NarrativeText])
    result = codeflash_output  # 32.9μs -> 26.0μs (26.8% faster)


def test_large_all_types():
    # Large input with multiple types, include two types
    elements = []
    for i in range(250):
        elements.append(Title(f"T{i}"))
        elements.append(NarrativeText(f"N{i}"))
        elements.append(ListItem(f"L{i}"))
        elements.append(Table(f"Tb{i}"))
    codeflash_output = filter_element_types(elements, include_element_types=[Title, Table])
    result = codeflash_output  # 39.4μs -> 30.0μs (31.6% faster)
    # Check order
    for i in range(250):
        pass
import pytest

from unstructured.staging.base import filter_element_types


def test_filter_element_types():
    with pytest.raises(
        TypeError, match="__bool__\\ should\\ return\\ bool,\\ returned\\ SymbolicBool"
    ):
        filter_element_types((v1 := ()), include_element_types=None, exclude_element_types=v1)

To edit these changes git checkout codeflash/optimize-filter_element_types-mje69hec and push.

Codeflash Static Badge

The optimization achieves an **11% speedup** through two key changes that reduce Python overhead:

**1. Generator Expression in `exactly_one()`**
- Changed `sum([(arg is not None and arg != "") for arg in kwargs.values()])` to `sum((arg is not None and arg != "") for arg in kwargs.values())`
- Eliminates creation of an intermediate list, reducing memory allocation overhead
- Though this function shows minimal improvement in isolation, it's called frequently (94 times in the profiler)

**2. List Comprehensions Replace Manual Loops in `filter_element_types()`**
- Replaced explicit `for` loops with `filtered_elements.append()` calls with direct list comprehensions
- `return [element for element in elements if type(element) in include_element_types]`
- `return [element for element in elements if type(element) not in exclude_element_types]`

**Why This Speeds Up Execution:**
- **Reduced Python bytecode overhead**: List comprehensions are implemented in C and execute faster than explicit Python loops with `.append()` calls
- **Fewer function calls**: Eliminates repeated `append()` method calls which have per-call overhead
- **Better memory patterns**: List comprehensions can pre-allocate the result list size in some cases

**Performance Impact by Test Case:**
- **Large datasets benefit most**: Tests with 1000+ elements show 23-40% improvements (e.g., `test_large_number_of_elements_include` goes from 36.9μs to 26.3μs)
- **Small datasets have modest overhead**: Basic tests with few elements show 5-20% slower performance due to list comprehension setup costs
- **The optimization is particularly effective when filtering large collections**, which is typical for document processing workflows where this function likely operates on many document elements

The optimization maintains identical functionality while providing substantial performance gains for realistic workloads involving larger element collections.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 December 20, 2025 10:45
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash labels Dec 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant