⚡️ Speed up function `aggregate_embedded_text_by_block` by 70% #50

codeflash-ai · 2025-12-19T22:12:28Z

📄 70% (0.70x) speedup for `aggregate_embedded_text_by_block` in `unstructured/partition/pdf_image/pdfminer_processing.py`

⏱️ Runtime : 3.98 milliseconds → 2.34 milliseconds (best of 30 runs)

📝 Explanation and details

The optimization introduces Numba JIT compilation to accelerate the most computationally intensive parts of the bounding box comparison algorithm, achieving a 70% speedup.

Key optimizations applied:

Numba JIT compilation: Added @njit(cache=True, fastmath=True) decorators to create compiled versions of the core computational functions:
- _get_coords_from_bboxes_numba() for coordinate extraction
- _areas_of_boxes_and_intersection_area_numba() for area calculations
- _bboxes1_is_almost_subregion_of_bboxes2_numba() for the main comparison logic
Optimized computation flow: The original code used NumPy broadcasting and vectorized operations, but the optimized version uses explicit loops within Numba-compiled functions, which can be faster for certain array sizes due to reduced memory overhead and better cache locality.
Precision handling: Switched to np.float64 for higher precision calculations while maintaining the same rounding behavior.

Why this leads to speedup:

JIT compilation: Numba compiles the Python loops to optimized machine code, eliminating Python interpreter overhead
Cache efficiency: The cache=True parameter ensures compiled functions are cached for subsequent calls
Memory access patterns: Explicit loops in compiled code can have better cache locality than NumPy's broadcasting operations for moderate-sized arrays

Performance characteristics from tests:

Small to medium arrays (typical use case): 150-220% faster across most test cases
Large arrays (1000+ elements): 15-50% faster, showing the optimization scales well
Edge cases: Consistent improvements even for boundary conditions

Impact on workloads:
Based on the function reference, this optimization significantly benefits PDF processing workflows where aggregate_embedded_text_by_block is called repeatedly in merge_out_layout_with_ocr_layout() for each invalid text element. Since OCR processing typically involves many bounding box comparisons, this 70% speedup directly translates to faster document processing times, especially for documents with many text regions requiring OCR text aggregation.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	✅ 11 Passed
🌀 Generated Regression Tests	✅ 49 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

⚙️ Existing Unit Tests and Runtime

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`partition/pdf_image/test_pdfminer_processing.py::test_aggregate_by_block`	55.5μs	25.0μs	122%✅

🌀 Generated Regression Tests and Runtime

from enum import Enum

import numpy as np

# imports
from unstructured.partition.pdf_image.pdfminer_processing import aggregate_embedded_text_by_block

# --- Minimal stubs for dependencies used in the function ---


# IsExtracted enum as used in the function
class IsExtracted(Enum):
    TRUE = "true"
    FALSE = "false"


# TextRegions class as used in the function
class TextRegions:
    def __init__(self, element_coords, texts, is_extracted_array=None):
        self.element_coords = np.array(element_coords, dtype=np.float32)
        self.texts = texts
        # Default: all extracted
        self.is_extracted_array = (
            is_extracted_array
            if is_extracted_array is not None
            else [IsExtracted.TRUE] * len(texts)
        )

    def __len__(self):
        return len(self.element_coords)

    def slice(self, mask):
        # mask is a boolean array
        coords = self.element_coords[mask]
        texts = [t for t, m in zip(self.texts, mask) if m]
        is_extracted_array = [flag for flag, m in zip(self.is_extracted_array, mask) if m]
        return TextRegions(coords, texts, is_extracted_array)


# env_config stub
class EnvConfig:
    EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD = 0.5


env_config = EnvConfig()

# --- Unit tests ---

# BASIC TEST CASES


def test_basic_single_source_in_target():
    # Single source region inside a single target block
    source = TextRegions([[0, 0, 10, 10]], ["hello"])
    target = TextRegions([[0, 0, 20, 20]], ["block"])
    codeflash_output = aggregate_embedded_text_by_block(target, source)
    result = codeflash_output  # 79.1μs -> 39.0μs (103% faster)


def test_basic_multiple_sources_in_target():
    # Multiple source regions all inside a single target block
    source = TextRegions([[1, 1, 5, 5], [6, 6, 8, 8]], ["a", "b"])
    target = TextRegions([[0, 0, 10, 10]], ["block"])
    codeflash_output = aggregate_embedded_text_by_block(target, source)
    result = codeflash_output  # 54.1μs -> 21.5μs (151% faster)


def test_basic_some_sources_outside_target():
    # Only one source region inside target block
    source = TextRegions([[1, 1, 5, 5], [11, 11, 15, 15]], ["a", "b"])
    target = TextRegions([[0, 0, 10, 10]], ["block"])
    codeflash_output = aggregate_embedded_text_by_block(target, source)
    result = codeflash_output  # 49.0μs -> 18.0μs (172% faster)


def test_basic_multiple_targets():
    # Source region inside one of two target blocks
    source = TextRegions([[1, 1, 5, 5]], ["a"])
    target = TextRegions([[0, 0, 10, 10], [20, 20, 30, 30]], ["block1", "block2"])
    codeflash_output = aggregate_embedded_text_by_block(target, source)
    result = codeflash_output  # 48.9μs -> 16.8μs (191% faster)


def test_basic_empty_texts():
    # Source region inside target, but text is empty
    source = TextRegions([[1, 1, 5, 5]], [""])
    target = TextRegions([[0, 0, 10, 10]], ["block"])
    codeflash_output = aggregate_embedded_text_by_block(target, source)
    result = codeflash_output  # 45.5μs -> 16.5μs (177% faster)


def test_basic_multiple_sources_some_empty_texts():
    # Some texts empty, some valid
    source = TextRegions([[1, 1, 5, 5], [2, 2, 3, 3]], ["", "b"])
    target = TextRegions([[0, 0, 10, 10]], ["block"])
    codeflash_output = aggregate_embedded_text_by_block(target, source)
    result = codeflash_output  # 47.0μs -> 16.8μs (180% faster)


def test_basic_is_extracted_false_if_any_flag_false():
    # If any included region has IsExtracted.FALSE, result is FALSE
    source = TextRegions(
        [[1, 1, 5, 5], [2, 2, 3, 3]], ["a", "b"], [IsExtracted.TRUE, IsExtracted.FALSE]
    )
    target = TextRegions([[0, 0, 10, 10]], ["block"])
    codeflash_output = aggregate_embedded_text_by_block(target, source)
    result = codeflash_output  # 46.5μs -> 16.5μs (182% faster)


# EDGE TEST CASES


def test_edge_empty_source_regions():
    # No source regions
    source = TextRegions([], [])
    target = TextRegions([[0, 0, 10, 10]], ["block"])
    codeflash_output = aggregate_embedded_text_by_block(target, source)
    result = codeflash_output  # 459ns -> 417ns (10.1% faster)


def test_edge_empty_target_regions():
    # No target regions
    source = TextRegions([[1, 1, 5, 5]], ["a"])
    target = TextRegions([], [])
    codeflash_output = aggregate_embedded_text_by_block(target, source)
    result = codeflash_output  # 500ns -> 542ns (7.75% slower)


def test_edge_no_overlap():
    # Source region does not overlap target region
    source = TextRegions([[100, 100, 110, 110]], ["a"])
    target = TextRegions([[0, 0, 10, 10]], ["block"])
    codeflash_output = aggregate_embedded_text_by_block(target, source)
    result = codeflash_output  # 43.8μs -> 14.4μs (204% faster)


def test_edge_partial_overlap_below_threshold():
    # Source region overlaps target, but below threshold
    source = TextRegions([[9, 9, 15, 15]], ["a"])
    target = TextRegions([[0, 0, 10, 10]], ["block"])
    # The overlap area is very small compared to source area
    codeflash_output = aggregate_embedded_text_by_block(target, source, threshold=0.9)
    result = codeflash_output  # 42.5μs -> 13.7μs (210% faster)


def test_edge_partial_overlap_above_threshold():
    # Source region overlaps target, above threshold
    source = TextRegions([[0, 0, 5, 5]], ["a"])
    target = TextRegions([[0, 0, 10, 10]], ["block"])
    codeflash_output = aggregate_embedded_text_by_block(target, source, threshold=0.5)
    result = codeflash_output  # 45.6μs -> 16.8μs (171% faster)


def test_edge_source_larger_than_target():
    # Source region is strictly larger than target region
    source = TextRegions([[0, 0, 20, 20]], ["a"])
    target = TextRegions([[5, 5, 10, 10]], ["block"])
    # Should not be included because boxa_area > boxb_area
    codeflash_output = aggregate_embedded_text_by_block(target, source)
    result = codeflash_output  # 41.6μs -> 13.4μs (210% faster)


def test_edge_threshold_zero():
    # Threshold zero: any overlap counts
    source = TextRegions([[9, 9, 11, 11]], ["a"])
    target = TextRegions([[10, 10, 20, 20]], ["block"])
    codeflash_output = aggregate_embedded_text_by_block(target, source, threshold=0.0)
    result = codeflash_output  # 44.3μs -> 15.8μs (181% faster)


def test_edge_threshold_one():
    # Threshold one: only full containment counts
    source = TextRegions([[0, 0, 5, 5]], ["a"])
    target = TextRegions([[0, 0, 5, 5]], ["block"])
    codeflash_output = aggregate_embedded_text_by_block(target, source, threshold=1.0)
    result = codeflash_output  # 41.4μs -> 13.0μs (217% faster)


def test_edge_float_precision():
    # Test with float coordinates that could cause rounding issues
    source = TextRegions([[0.0000001, 0.0000001, 10.0000001, 10.0000001]], ["a"])
    target = TextRegions([[0, 0, 10, 10]], ["block"])
    codeflash_output = aggregate_embedded_text_by_block(target, source)
    result = codeflash_output  # 43.8μs -> 15.3μs (187% faster)


def test_edge_multiple_targets_and_sources():
    # Multiple targets, multiple sources, only some overlap
    source = TextRegions([[1, 1, 5, 5], [20, 20, 25, 25], [30, 30, 35, 35]], ["a", "b", "c"])
    target = TextRegions([[0, 0, 10, 10], [20, 20, 30, 30]], ["block1", "block2"])
    codeflash_output = aggregate_embedded_text_by_block(target, source)
    result = codeflash_output  # 50.5μs -> 17.7μs (186% faster)


def test_edge_source_region_exactly_matches_target():
    # Source region exactly matches target region
    source = TextRegions([[0, 0, 10, 10]], ["a"])
    target = TextRegions([[0, 0, 10, 10]], ["block"])
    codeflash_output = aggregate_embedded_text_by_block(target, source)
    result = codeflash_output  # 44.6μs -> 15.3μs (191% faster)


def test_edge_source_region_zero_area():
    # Source region has zero area (x1==x2 and y1==y2)
    source = TextRegions([[5, 5, 5, 5]], ["a"])
    target = TextRegions([[0, 0, 10, 10]], ["block"])
    codeflash_output = aggregate_embedded_text_by_block(target, source)
    result = codeflash_output  # 43.9μs -> 15.2μs (189% faster)


# LARGE SCALE TEST CASES


def test_large_many_sources_one_target():
    # 1000 source regions, all inside one target block
    n = 1000
    coords = [[i, i, i + 1, i + 1] for i in range(n)]
    texts = [str(i) for i in range(n)]
    source = TextRegions(coords, texts)
    target = TextRegions([[0, 0, n + 1, n + 1]], ["block"])
    codeflash_output = aggregate_embedded_text_by_block(target, source)
    result = codeflash_output  # 266μs -> 231μs (15.2% faster)
    # All texts should be included and concatenated
    expected_text = " ".join(texts)


def test_large_many_sources_some_outside():
    # 500 inside, 500 outside
    n = 500
    inside_coords = [[i, i, i + 1, i + 1] for i in range(n)]
    outside_coords = [[1000 + i, 1000 + i, 1000 + i + 1, 1000 + i + 1] for i in range(n)]
    coords = inside_coords + outside_coords
    texts = [str(i) for i in range(2 * n)]
    source = TextRegions(coords, texts)
    target = TextRegions([[0, 0, n + 1, n + 1]], ["block"])
    codeflash_output = aggregate_embedded_text_by_block(target, source)
    result = codeflash_output  # 225μs -> 190μs (18.5% faster)
    expected_text = " ".join([str(i) for i in range(n)])


def test_large_many_targets_and_sources():
    # 100 targets, 1000 sources, each source inside a unique target
    n_targets = 100
    n_sources = 1000
    target_coords = [[i * 10, i * 10, i * 10 + 10, i * 10 + 10] for i in range(n_targets)]
    source_coords = [[i * 1, i * 1, i * 1 + 1, i * 1 + 1] for i in range(n_sources)]
    texts = [str(i) for i in range(n_sources)]
    source = TextRegions(source_coords, texts)
    target = TextRegions(target_coords, ["block"] * n_targets)
    # All sources should be inside the first target (since 0 <= i < 1000, first target is [0,0,10,10])
    codeflash_output = aggregate_embedded_text_by_block(target, source)
    result = codeflash_output  # 806μs -> 573μs (40.8% faster)
    # Only sources with i in [0,9] should be inside first target
    expected_text = " ".join([str(i) for i in range(10)])


def test_large_all_is_extracted_false():
    # All included regions have IsExtracted.FALSE
    n = 100
    coords = [[i, i, i + 1, i + 1] for i in range(n)]
    texts = [str(i) for i in range(n)]
    flags = [IsExtracted.FALSE] * n
    source = TextRegions(coords, texts, flags)
    target = TextRegions([[0, 0, n + 1, n + 1]], ["block"])
    codeflash_output = aggregate_embedded_text_by_block(target, source)
    result = codeflash_output  # 70.4μs -> 38.5μs (82.6% faster)
    expected_text = " ".join(texts)


def test_large_no_texts():
    # All source regions inside target, but all texts empty
    n = 100
    coords = [[i, i, i + 1, i + 1] for i in range(n)]
    texts = [""] * n
    source = TextRegions(coords, texts)
    target = TextRegions([[0, 0, n + 1, n + 1]], ["block"])
    codeflash_output = aggregate_embedded_text_by_block(target, source)
    result = codeflash_output  # 66.5μs -> 36.5μs (82.5% faster)


def test_large_some_texts_empty_some_not():
    # Half texts empty, half valid
    n = 100
    coords = [[i, i, i + 1, i + 1] for i in range(n)]
    texts = ["" if i % 2 == 0 else str(i) for i in range(n)]
    source = TextRegions(coords, texts)
    target = TextRegions([[0, 0, n + 1, n + 1]], ["block"])
    expected_text = " ".join([str(i) for i in range(n) if i % 2 == 1])
    codeflash_output = aggregate_embedded_text_by_block(target, source)
    result = codeflash_output  # 66.8μs -> 36.8μs (81.3% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

import numpy as np

# imports
from unstructured.partition.pdf_image.pdfminer_processing import aggregate_embedded_text_by_block


# --- Minimal stubs for required classes/enums/constants ---
class IsExtracted:
    TRUE = "TRUE"
    FALSE = "FALSE"


class TextRegions:
    def __init__(self, element_coords, texts, is_extracted_array):
        # element_coords: np.ndarray of shape (N, 4)
        # texts: list of strings, length N
        # is_extracted_array: list of IsExtracted.TRUE/FALSE, length N
        self.element_coords = np.array(element_coords, dtype=np.float32)
        self.texts = texts
        self.is_extracted_array = is_extracted_array

    def __len__(self):
        return len(self.texts)

    def slice(self, mask):
        # mask: boolean array of length N
        coords = self.element_coords[mask]
        texts = [t for t, m in zip(self.texts, mask) if m]
        is_extracted = [f for f, m in zip(self.is_extracted_array, mask) if m]
        return TextRegions(coords, texts, is_extracted)


# --- Minimal stub for env_config ---
class EnvConfig:
    EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD = 0.5


env_config = EnvConfig()

# --- Unit tests ---
# Basic Test Cases


def test_empty_source_and_target():
    # Both empty
    src = TextRegions([], [], [])
    tgt = TextRegions([], [], [])
    text, extracted = aggregate_embedded_text_by_block(tgt, src)  # 541ns -> 584ns (7.36% slower)


def test_empty_source_nonempty_target():
    # Source empty, target non-empty
    tgt = TextRegions([[0, 0, 10, 10]], ["block"], [IsExtracted.TRUE])
    src = TextRegions([], [], [])
    text, extracted = aggregate_embedded_text_by_block(tgt, src)  # 375ns -> 417ns (10.1% slower)


def test_nonempty_source_empty_target():
    # Source non-empty, target empty
    src = TextRegions([[1, 1, 5, 5]], ["text"], [IsExtracted.TRUE])
    tgt = TextRegions([], [], [])
    text, extracted = aggregate_embedded_text_by_block(tgt, src)  # 583ns -> 542ns (7.56% faster)


def test_single_overlap():
    # One source region fully inside target region
    src = TextRegions([[1, 1, 5, 5]], ["hello"], [IsExtracted.TRUE])
    tgt = TextRegions([[0, 0, 10, 10]], ["block"], [IsExtracted.TRUE])
    text, extracted = aggregate_embedded_text_by_block(tgt, src)  # 50.2μs -> 18.3μs (175% faster)


def test_single_no_overlap():
    # Source region outside target region
    src = TextRegions([[20, 20, 30, 30]], ["outside"], [IsExtracted.TRUE])
    tgt = TextRegions([[0, 0, 10, 10]], ["block"], [IsExtracted.TRUE])
    text, extracted = aggregate_embedded_text_by_block(tgt, src)  # 42.7μs -> 13.4μs (219% faster)


def test_multiple_sources_partial_overlap():
    # Multiple source regions, only some overlap
    src = TextRegions(
        [[1, 1, 5, 5], [20, 20, 30, 30], [2, 2, 4, 4]],
        ["hello", "outside", "world"],
        [IsExtracted.TRUE, IsExtracted.TRUE, IsExtracted.TRUE],
    )
    tgt = TextRegions([[0, 0, 10, 10]], ["block"], [IsExtracted.TRUE])
    text, extracted = aggregate_embedded_text_by_block(tgt, src)  # 47.4μs -> 17.0μs (179% faster)


def test_multiple_sources_some_not_extracted():
    # Some source regions overlap, but not all are extracted
    src = TextRegions(
        [[1, 1, 5, 5], [2, 2, 4, 4]], ["hello", "world"], [IsExtracted.TRUE, IsExtracted.FALSE]
    )
    tgt = TextRegions([[0, 0, 10, 10]], ["block"], [IsExtracted.TRUE])
    text, extracted = aggregate_embedded_text_by_block(tgt, src)  # 46.0μs -> 16.0μs (188% faster)


def test_source_with_empty_texts():
    # Some source regions have empty text
    src = TextRegions(
        [[1, 1, 5, 5], [2, 2, 4, 4]], ["hello", ""], [IsExtracted.TRUE, IsExtracted.TRUE]
    )
    tgt = TextRegions([[0, 0, 10, 10]], ["block"], [IsExtracted.TRUE])
    text, extracted = aggregate_embedded_text_by_block(tgt, src)  # 45.0μs -> 15.6μs (188% faster)


def test_multiple_target_regions():
    # Multiple target regions, source region overlaps only one
    src = TextRegions([[1, 1, 5, 5]], ["hello"], [IsExtracted.TRUE])
    tgt = TextRegions(
        [[0, 0, 10, 10], [20, 20, 30, 30]],
        ["block1", "block2"],
        [IsExtracted.TRUE, IsExtracted.TRUE],
    )
    text, extracted = aggregate_embedded_text_by_block(tgt, src)  # 46.7μs -> 15.7μs (198% faster)


def test_threshold_effect():
    # Test threshold: region overlaps but not enough
    src = TextRegions([[0, 0, 10, 10]], ["text"], [IsExtracted.TRUE])
    tgt = TextRegions([[0, 0, 5, 5]], ["block"], [IsExtracted.TRUE])
    # The overlap is only 25/121 < 0.5, so should not include
    text, extracted = aggregate_embedded_text_by_block(
        tgt, src, threshold=0.5
    )  # 41.9μs -> 13.3μs (215% faster)
    # Lower threshold to 0.1, now should include
    text2, extracted2 = aggregate_embedded_text_by_block(
        tgt, src, threshold=0.1
    )  # 36.5μs -> 10.0μs (265% faster)


# Edge Test Cases


def test_source_completely_covers_target():
    # Source region is much larger than target, so not a subregion
    src = TextRegions([[0, 0, 100, 100]], ["big"], [IsExtracted.TRUE])
    tgt = TextRegions([[10, 10, 20, 20]], ["block"], [IsExtracted.TRUE])
    text, extracted = aggregate_embedded_text_by_block(tgt, src)  # 40.6μs -> 12.5μs (226% faster)


def test_source_and_target_identical():
    # Source and target have identical boxes
    src = TextRegions([[0, 0, 10, 10]], ["same"], [IsExtracted.TRUE])
    tgt = TextRegions([[0, 0, 10, 10]], ["block"], [IsExtracted.TRUE])
    text, extracted = aggregate_embedded_text_by_block(tgt, src)  # 43.2μs -> 15.4μs (181% faster)


def test_source_zero_area():
    # Source region with zero area
    src = TextRegions([[5, 5, 5, 5]], ["zero"], [IsExtracted.TRUE])
    tgt = TextRegions([[0, 0, 10, 10]], ["block"], [IsExtracted.TRUE])
    text, extracted = aggregate_embedded_text_by_block(tgt, src)  # 43.1μs -> 15.0μs (186% faster)


def test_target_zero_area():
    # Target region with zero area
    src = TextRegions([[0, 0, 10, 10]], ["text"], [IsExtracted.TRUE])
    tgt = TextRegions([[5, 5, 5, 5]], ["zero"], [IsExtracted.TRUE])
    text, extracted = aggregate_embedded_text_by_block(tgt, src)  # 40.6μs -> 12.5μs (226% faster)


def test_source_multiple_overlapping_targets():
    # Source region overlaps multiple target regions
    src = TextRegions([[3, 3, 6, 6]], ["overlap"], [IsExtracted.TRUE])
    tgt = TextRegions(
        [[0, 0, 10, 10], [4, 4, 8, 8]], ["block1", "block2"], [IsExtracted.TRUE, IsExtracted.TRUE]
    )
    text, extracted = aggregate_embedded_text_by_block(tgt, src)  # 46.1μs -> 15.2μs (204% faster)


def test_source_with_non_boolean_is_extracted():
    # is_extracted_array contains non-standard values
    src = TextRegions([[1, 1, 5, 5]], ["text"], ["not_true"])
    tgt = TextRegions([[0, 0, 10, 10]], ["block"], [IsExtracted.TRUE])
    text, extracted = aggregate_embedded_text_by_block(tgt, src)  # 43.3μs -> 14.9μs (191% faster)


def test_source_with_none_text():
    # Source region has None as text
    src = TextRegions([[1, 1, 5, 5]], [None], [IsExtracted.TRUE])
    tgt = TextRegions([[0, 0, 10, 10]], ["block"], [IsExtracted.TRUE])
    text, extracted = aggregate_embedded_text_by_block(tgt, src)  # 43.1μs -> 14.8μs (191% faster)


def test_source_with_mixed_types():
    # Source region has mixed types in text
    src = TextRegions([[1, 1, 5, 5]], ["hello", 123], [IsExtracted.TRUE, IsExtracted.TRUE])
    tgt = TextRegions([[0, 0, 10, 10]], ["block"], [IsExtracted.TRUE])
    # Should convert non-str to str
    text, extracted = aggregate_embedded_text_by_block(tgt, src)  # 44.5μs -> 16.0μs (177% faster)


# Large Scale Test Cases


def test_large_number_of_source_regions_all_overlap():
    # 1000 source regions, all overlap with target
    N = 1000
    src_coords = [[i, i, i + 1, i + 1] for i in range(N)]
    src_texts = [f"text{i}" for i in range(N)]
    src_flags = [IsExtracted.TRUE] * N
    tgt = TextRegions([[0, 0, N, N]], ["block"], [IsExtracted.TRUE])
    src = TextRegions(src_coords, src_texts, src_flags)
    text, extracted = aggregate_embedded_text_by_block(tgt, src)  # 264μs -> 230μs (14.9% faster)
    for i in range(N):
        pass


def test_large_number_of_source_regions_none_overlap():
    # 1000 source regions, none overlap with target
    N = 1000
    src_coords = [[i + N, i + N, i + N + 1, i + N + 1] for i in range(N)]
    src_texts = [f"text{i}" for i in range(N)]
    src_flags = [IsExtracted.TRUE] * N
    tgt = TextRegions([[0, 0, N, N]], ["block"], [IsExtracted.TRUE])
    src = TextRegions(src_coords, src_texts, src_flags)
    text, extracted = aggregate_embedded_text_by_block(tgt, src)  # 140μs -> 106μs (32.1% faster)


def test_large_number_of_target_regions():
    # 100 target regions, source overlaps only some
    N = 100
    tgt_coords = [[i, i, i + 10, i + 10] for i in range(N)]
    tgt_texts = [f"block{i}" for i in range(N)]
    tgt_flags = [IsExtracted.TRUE] * N
    src = TextRegions([[5, 5, 15, 15]], ["source"], [IsExtracted.TRUE])
    tgt = TextRegions(tgt_coords, tgt_texts, tgt_flags)
    text, extracted = aggregate_embedded_text_by_block(tgt, src)  # 47.2μs -> 16.5μs (186% faster)


def test_large_mixed_extracted_flags():
    # 500 source regions, half overlap, half not extracted
    N = 500
    src_coords = [[i, i, i + 1, i + 1] for i in range(N)]
    src_texts = [f"text{i}" for i in range(N)]
    src_flags = [IsExtracted.TRUE if i % 2 == 0 else IsExtracted.FALSE for i in range(N)]
    tgt = TextRegions([[0, 0, N, N]], ["block"], [IsExtracted.TRUE])
    src = TextRegions(src_coords, src_texts, src_flags)
    text, extracted = aggregate_embedded_text_by_block(tgt, src)  # 155μs -> 124μs (25.1% faster)
    # All texts included, but not all are extracted
    for i in range(N):
        pass


def test_large_sparse_overlap():
    # 1000 source regions, only every 100th overlaps
    N = 1000
    src_coords = [[i, i, i + 1, i + 1] for i in range(N)]
    src_texts = [f"text{i}" for i in range(N)]
    src_flags = [IsExtracted.TRUE] * N
    tgt = TextRegions(
        [[i, i, i + 1, i + 1] for i in range(0, N, 100)], ["block"] * 10, [IsExtracted.TRUE] * 10
    )
    src = TextRegions(src_coords, src_texts, src_flags)
    text, extracted = aggregate_embedded_text_by_block(tgt, src)  # 296μs -> 193μs (53.4% faster)
    for i in range(0, N, 100):
        pass


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-aggregate_embedded_text_by_block-mjdfd5oc and push.

The optimization introduces **Numba JIT compilation** to accelerate the most computationally intensive parts of the bounding box comparison algorithm, achieving a **70% speedup**. **Key optimizations applied:** 1. **Numba JIT compilation**: Added `@njit(cache=True, fastmath=True)` decorators to create compiled versions of the core computational functions: - `_get_coords_from_bboxes_numba()` for coordinate extraction - `_areas_of_boxes_and_intersection_area_numba()` for area calculations - `_bboxes1_is_almost_subregion_of_bboxes2_numba()` for the main comparison logic 2. **Optimized computation flow**: The original code used NumPy broadcasting and vectorized operations, but the optimized version uses explicit loops within Numba-compiled functions, which can be faster for certain array sizes due to reduced memory overhead and better cache locality. 3. **Precision handling**: Switched to `np.float64` for higher precision calculations while maintaining the same rounding behavior. **Why this leads to speedup:** - **JIT compilation**: Numba compiles the Python loops to optimized machine code, eliminating Python interpreter overhead - **Cache efficiency**: The `cache=True` parameter ensures compiled functions are cached for subsequent calls - **Memory access patterns**: Explicit loops in compiled code can have better cache locality than NumPy's broadcasting operations for moderate-sized arrays **Performance characteristics from tests:** - **Small to medium arrays** (typical use case): 150-220% faster across most test cases - **Large arrays** (1000+ elements): 15-50% faster, showing the optimization scales well - **Edge cases**: Consistent improvements even for boundary conditions **Impact on workloads:** Based on the function reference, this optimization significantly benefits PDF processing workflows where `aggregate_embedded_text_by_block` is called repeatedly in `merge_out_layout_with_ocr_layout()` for each invalid text element. Since OCR processing typically involves many bounding box comparisons, this 70% speedup directly translates to faster document processing times, especially for documents with many text regions requiring OCR text aggregation.

codeflash-ai bot requested a review from aseembits93 December 19, 2025 22:12

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `aggregate_embedded_text_by_block` by 70% #50

⚡️ Speed up function `aggregate_embedded_text_by_block` by 70% #50

Uh oh!

codeflash-ai bot commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function aggregate_embedded_text_by_block by 70% #50

Are you sure you want to change the base?

⚡️ Speed up function aggregate_embedded_text_by_block by 70% #50

Uh oh!

Conversation

codeflash-ai bot commented Dec 19, 2025

📄 70% (0.70x) speedup for aggregate_embedded_text_by_block in unstructured/partition/pdf_image/pdfminer_processing.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `aggregate_embedded_text_by_block` by 70% #50

⚡️ Speed up function `aggregate_embedded_text_by_block` by 70% #50

📄 70% (0.70x) speedup for `aggregate_embedded_text_by_block` in `unstructured/partition/pdf_image/pdfminer_processing.py`