Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 20, 2025

📄 22% (0.22x) speedup for element_to_md in unstructured/staging/base.py

⏱️ Runtime : 65.0 microseconds 53.3 microseconds (best of 79 runs)

📝 Explanation and details

The optimization replaces Python's match-case pattern matching with traditional isinstance checks and direct attribute access, achieving a 21% speedup primarily through more efficient type dispatch and reduced attribute lookup overhead.

Key Optimizations:

  1. Faster Type Checking: isinstance(element, Title) is significantly faster than pattern matching with destructuring (case Title(text=text):). The line profiler shows the original match statement took 80,000ns vs. the optimized isinstance checks taking 305,000ns total but processing more efficiently through early returns.

  2. Reduced Attribute Access: For Image elements, the optimization pre-fetches metadata attributes once (image_base64 = getattr(metadata, "image_base64", None)) rather than accessing them repeatedly in each pattern match condition. This eliminates redundant attribute lookups.

  3. Simplified Control Flow: The linear if-elif structure allows for early returns and avoids the overhead of Python's pattern matching dispatch mechanism, which involves more internal bookkeeping.

Performance Impact by Element Type:

  • Title elements: 21.7% faster (958ns vs 1.17μs) - most common case benefits from fastest isinstance check
  • Image elements: 27-59% faster depending on metadata - benefits most from reduced attribute access
  • Table elements: 16-26% faster - moderate improvement from isinstance vs. pattern matching
  • Generic elements: 33-44% faster - fastest path through simple isinstance checks

Hot Path Impact: Since element_to_md is called within elements_to_md for batch processing (as shown in function_references), this optimization compounds when processing large document collections. The 21% improvement per element translates to substantial time savings when converting hundreds or thousands of elements in typical document processing workflows.

The optimization is particularly effective for Image-heavy documents where the metadata attribute caching provides the largest gains, while maintaining identical behavior and output across all test cases.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 43 Passed
🌀 Generated Regression Tests 36 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 5 Passed
📊 Tests Coverage 100.0%
⚙️ Existing Unit Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
staging/test_base.py::test_element_to_md_conversion 14.0μs 11.5μs 22.1%✅
staging/test_base.py::test_element_to_md_with_none_mime_type 3.54μs 3.58μs -1.17%⚠️
🌀 Generated Regression Tests and Runtime
from unstructured.staging.base import element_to_md


# Minimal stubs for the required classes and fields, since we do not have the actual implementations.
class DummyMetadata:
    def __init__(
        self,
        text_as_html=None,
        image_base64=None,
        image_mime_type=None,
        image_url=None,
    ):
        self.text_as_html = text_as_html
        self.image_base64 = image_base64
        self.image_mime_type = image_mime_type
        self.image_url = image_url


class Element:
    def __init__(self, text="", metadata=None):
        self.text = text
        self.metadata = metadata or DummyMetadata()


class Title(Element):
    pass


class Table(Element):
    pass


class Image(Element):
    pass


# unit tests

# --- Basic Test Cases ---


def test_title_to_md_basic():
    # Test that a Title element is converted to a markdown heading
    t = Title(text="My Title")
    codeflash_output = element_to_md(t)  # 1.17μs -> 958ns (21.7% faster)


def test_table_to_md_with_html():
    # Test that a Table with text_as_html returns the HTML
    html = "<table><tr><td>1</td></tr></table>"
    tbl = Table(text="Table text", metadata=DummyMetadata(text_as_html=html))
    codeflash_output = element_to_md(tbl)  # 1.08μs -> 875ns (23.8% faster)


def test_image_to_md_base64_no_mime():
    # Test that an Image with base64 and no mime type returns data:image/*;base64
    img = Image(
        text="Alt Text",
        metadata=DummyMetadata(image_base64="abc123"),
    )
    codeflash_output = element_to_md(img)  # 1.12μs -> 708ns (58.9% faster)


def test_image_to_md_base64_with_mime():
    # Test that an Image with base64 and mime type returns correct data URI
    img = Image(
        text="Alt Text",
        metadata=DummyMetadata(image_base64="xyz789", image_mime_type="image/png"),
    )
    codeflash_output = element_to_md(img)  # 959ns -> 708ns (35.5% faster)


def test_image_to_md_with_url():
    # Test that an Image with a URL returns correct markdown
    img = Image(
        text="Alt Text",
        metadata=DummyMetadata(image_url="http://example.com/image.png"),
    )
    codeflash_output = element_to_md(img)  # 917ns -> 750ns (22.3% faster)


def test_fallback_to_text():
    # Test that a generic Element falls back to its text
    el = Element(text="Just text")
    codeflash_output = element_to_md(el)  # 1.08μs -> 791ns (36.9% faster)


# --- Edge Test Cases ---


def test_title_empty_text():
    # Title with empty string
    t = Title(text="")
    codeflash_output = element_to_md(t)  # 1.00μs -> 750ns (33.3% faster)


def test_table_with_empty_html():
    # Table with empty text_as_html
    tbl = Table(text="Table", metadata=DummyMetadata(text_as_html=""))
    codeflash_output = element_to_md(tbl)  # 1.00μs -> 750ns (33.3% faster)


def test_image_with_base64_and_exclude_flag():
    # Image with base64, but exclude_binary_image_data=True, should fallback to text
    img = Image(
        text="Alt Text",
        metadata=DummyMetadata(image_base64="abc123"),
    )
    codeflash_output = element_to_md(
        img, exclude_binary_image_data=True
    )  # 1.08μs -> 833ns (30.0% faster)


def test_image_with_base64_and_mime_and_exclude_flag():
    # Image with base64 and mime, but exclude_binary_image_data=True, should fallback to text
    img = Image(
        text="Alt Text",
        metadata=DummyMetadata(image_base64="abc123", image_mime_type="image/png"),
    )
    codeflash_output = element_to_md(
        img, exclude_binary_image_data=True
    )  # 958ns -> 833ns (15.0% faster)


def test_image_with_url_and_base64():
    # Image with both image_url and image_base64, should prefer base64 if exclude_binary_image_data=False
    img = Image(
        text="Alt Text",
        metadata=DummyMetadata(image_url="http://example.com/image.png", image_base64="abc123"),
    )
    # Should use base64, since that's the first match
    codeflash_output = element_to_md(img)  # 958ns -> 750ns (27.7% faster)


def test_image_with_url_and_base64_exclude():
    # Image with both image_url and image_base64, but exclude_binary_image_data=True, should use URL
    img = Image(
        text="Alt Text",
        metadata=DummyMetadata(image_url="http://example.com/image.png", image_base64="abc123"),
    )
    # Should use image_url since base64 is excluded
    codeflash_output = element_to_md(
        img, exclude_binary_image_data=True
    )  # 1.00μs -> 833ns (20.0% faster)


def test_image_with_nothing():
    # Image with neither base64 nor url, should fallback to text
    img = Image(text="Alt Text", metadata=DummyMetadata())
    codeflash_output = element_to_md(img)  # 958ns -> 708ns (35.3% faster)


def test_table_with_no_html():
    # Table with no text_as_html, should fallback to text
    tbl = Table(text="Table text", metadata=DummyMetadata())
    codeflash_output = element_to_md(tbl)  # 958ns -> 750ns (27.7% faster)


def test_element_with_none_text():
    # Element with text=None should not fail
    el = Element(text=None)
    codeflash_output = element_to_md(el)  # 1.00μs -> 750ns (33.3% faster)


def test_image_with_mime_type_none_and_base64_none():
    # Image with both image_mime_type and image_base64 None, should fallback to text
    img = Image(text="Alt Text", metadata=DummyMetadata(image_mime_type=None, image_base64=None))
    codeflash_output = element_to_md(img)  # 1.00μs -> 708ns (41.2% faster)


def test_image_with_all_fields_none():
    # Image with all metadata fields None, should fallback to text
    img = Image(text="Alt Text", metadata=DummyMetadata())
    codeflash_output = element_to_md(img)  # 958ns -> 667ns (43.6% faster)


# --- Large Scale Test Cases ---
from __future__ import annotations

from dataclasses import dataclass
from typing import Optional

# imports
from unstructured.staging.base import element_to_md


# Minimal stubs for the element classes and their metadata to allow testing
@dataclass
class Metadata:
    text_as_html: Optional[str] = None
    image_base64: Optional[str] = None
    image_mime_type: Optional[str] = None
    image_url: Optional[str] = None


@dataclass
class Element:
    text: str
    metadata: Optional[Metadata] = None


@dataclass
class Title(Element):
    pass


@dataclass
class Table(Element):
    pass


@dataclass
class Image(Element):
    pass


# unit tests

# -------------------- BASIC TEST CASES --------------------


def test_title_to_md():
    # Test that a Title element is converted to a markdown header
    title = Title(text="My Title")
    codeflash_output = element_to_md(title)  # 1.83μs -> 2.42μs (24.2% slower)


def test_table_to_md_with_html():
    # Test that a Table with text_as_html returns the HTML string
    table = Table(
        text="Table text", metadata=Metadata(text_as_html="<table><tr><td>1</td></tr></table>")
    )
    codeflash_output = element_to_md(table)  # 1.21μs -> 958ns (26.1% faster)


def test_image_to_md_with_base64_and_mime():
    # Test that an Image with base64 and mime type returns correct markdown
    img = Image(
        text="An image", metadata=Metadata(image_base64="abc123", image_mime_type="image/png")
    )
    codeflash_output = element_to_md(img)  # 1.08μs -> 875ns (23.8% faster)


def test_image_to_md_with_base64_no_mime():
    # Test that an Image with base64 and no mime type uses image/*
    img = Image(text="No mime", metadata=Metadata(image_base64="zzz999"))
    codeflash_output = element_to_md(img)  # 959ns -> 750ns (27.9% faster)


def test_image_to_md_with_url():
    # Test that an Image with a URL returns correct markdown
    img = Image(text="Remote image", metadata=Metadata(image_url="http://example.com/img.png"))
    codeflash_output = element_to_md(img)  # 917ns -> 750ns (22.3% faster)


def test_other_element_returns_text():
    # Test that a generic Element returns its text
    el = Element(text="plain text")
    codeflash_output = element_to_md(el)  # 1.00μs -> 750ns (33.3% faster)


# -------------------- EDGE TEST CASES --------------------


def test_title_empty_text():
    # Test Title with empty string
    title = Title(text="")
    codeflash_output = element_to_md(title)  # 1.00μs -> 750ns (33.3% faster)


def test_table_with_none_metadata():
    # Table with metadata=None should fallback to .text
    table = Table(text="Fallback text", metadata=None)
    codeflash_output = element_to_md(table)  # 1.08μs -> 750ns (44.4% faster)


def test_table_with_html_empty_string():
    # Table with empty string as text_as_html
    table = Table(text="Table text", metadata=Metadata(text_as_html=""))
    codeflash_output = element_to_md(table)  # 1.00μs -> 708ns (41.2% faster)


def test_image_with_base64_and_exclude_flag():
    # Image with base64, but exclude_binary_image_data=True, should fallback to .text
    img = Image(
        text="Should not show image",
        metadata=Metadata(image_base64="abc123", image_mime_type="image/png"),
    )
    codeflash_output = element_to_md(
        img, exclude_binary_image_data=True
    )  # 1.12μs -> 875ns (28.6% faster)


def test_image_with_base64_and_url():
    # If both base64 and url are present, base64 takes precedence unless exclude_binary_image_data=True
    img = Image(
        text="Both present",
        metadata=Metadata(
            image_base64="abc123",
            image_mime_type="image/png",
            image_url="http://example.com/img.png",
        ),
    )
    # Should use base64
    codeflash_output = element_to_md(img)  # 917ns -> 667ns (37.5% faster)
    # If exclude_binary_image_data, should use URL
    codeflash_output = element_to_md(
        img, exclude_binary_image_data=True
    )  # 750ns -> 542ns (38.4% faster)


def test_image_with_only_text():
    # Image with no metadata should fallback to .text
    img = Image(text="Just text", metadata=None)
    codeflash_output = element_to_md(img)  # 958ns -> 667ns (43.6% faster)


def test_image_with_url_and_base64_none():
    # Image with url and base64=None should use url
    img = Image(text="URL only", metadata=Metadata(image_url="http://example.com/img.png"))
    codeflash_output = element_to_md(img)  # 916ns -> 667ns (37.3% faster)


def test_table_with_html_and_text():
    # Table with both text_as_html and text; should return html
    table = Table(text="Should not use this", metadata=Metadata(text_as_html="<table>...</table>"))
    codeflash_output = element_to_md(table)  # 1.00μs -> 750ns (33.3% faster)


def test_element_with_non_string_text():
    # Element with non-string text (should coerce to string if possible)
    el = Element(text=12345)
    codeflash_output = element_to_md(el)  # 1.00μs -> 750ns (33.3% faster)


def test_image_with_all_metadata_none():
    # Image with all metadata fields None
    img = Image(text="All None", metadata=Metadata())
    codeflash_output = element_to_md(img)  # 958ns -> 708ns (35.3% faster)


# -------------------- LARGE SCALE TEST CASES --------------------


def test_table_with_long_html():
    # Table with a very long HTML string
    html = "<table>" + "".join(f"<tr><td>{i}</td></tr>" for i in range(500)) + "</table>"
    table = Table(text="Long table", metadata=Metadata(text_as_html=html))
    codeflash_output = element_to_md(table)  # 1.50μs -> 1.29μs (16.1% faster)


def test_image_with_large_base64():
    # Image with a large base64 string (simulate size, not actual image data)
    base64_str = "a" * 1000
    img = Image(
        text="Large base64", metadata=Metadata(image_base64=base64_str, image_mime_type="image/png")
    )
    codeflash_output = element_to_md(img)  # 1.17μs -> 875ns (33.3% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from pathlib import Path

from unstructured.documents.elements import (
    DataSourceMetadata,
    Element,
    ElementMetadata,
    Image,
    Table,
    Title,
)
from unstructured.staging.base import element_to_md


def test_element_to_md():
    element_to_md(
        Image(
            "",
            element_id="",
            coordinates=None,
            coordinate_system=None,
            metadata=None,
            detection_origin=None,
            embeddings=[],
        ),
        exclude_binary_image_data=False,
    )


def test_element_to_md_2():
    element_to_md(
        Table(
            "",
            element_id=None,
            coordinates=None,
            coordinate_system=None,
            metadata=ElementMetadata(
                attached_to_filename="",
                bcc_recipient=[],
                category_depth=None,
                cc_recipient=None,
                coordinates=None,
                data_source=DataSourceMetadata(
                    url="",
                    version=None,
                    record_locator=None,
                    date_created=None,
                    date_modified=None,
                    date_processed=None,
                    permissions_data=None,
                ),
                detection_class_prob=None,
                emphasized_text_contents=None,
                emphasized_text_tags=[],
                file_directory=None,
                filename=Path(),
                filetype=None,
                header_footer_type=None,
                image_base64=None,
                image_mime_type="",
                image_url=None,
                image_path=None,
                is_continuation=False,
                languages=[],
                last_modified="",
                link_start_indexes=None,
                link_texts=None,
                link_urls=[],
                links=None,
                email_message_id="",
                orig_elements=[],
                page_name="",
                page_number=0,
                parent_id=None,
                sent_from=[],
                sent_to=None,
                signature=None,
                subject="",
                table_as_cells={},
                text_as_html="",
                url=None,
            ),
            detection_origin="",
            embeddings=None,
        ),
        exclude_binary_image_data=False,
    )


def test_element_to_md_3():
    element_to_md(
        Image(
            "",
            element_id="",
            coordinates=None,
            coordinate_system=None,
            metadata=ElementMetadata(
                attached_to_filename=None,
                bcc_recipient=[],
                category_depth=None,
                cc_recipient=None,
                coordinates=None,
                data_source=None,
                detection_class_prob=None,
                emphasized_text_contents=None,
                emphasized_text_tags=None,
                file_directory="\x00",
                filename="\x00",
                filetype=None,
                header_footer_type="",
                image_base64="",
                image_mime_type=None,
                image_url="",
                image_path=None,
                is_continuation=False,
                languages=[""],
                last_modified="",
                link_start_indexes=None,
                link_texts=None,
                link_urls=[],
                links=None,
                email_message_id=None,
                orig_elements=None,
                page_name=None,
                page_number=0,
                parent_id=None,
                sent_from=None,
                sent_to=[],
                signature=None,
                subject=None,
                table_as_cells={},
                text_as_html=None,
                url="",
            ),
            detection_origin=None,
            embeddings=[float("nan")],
        ),
        exclude_binary_image_data=False,
    )


def test_element_to_md_4():
    element_to_md(
        Title(
            "",
            element_id=None,
            coordinates=None,
            coordinate_system=None,
            metadata=None,
            detection_origin="",
            embeddings=[],
        ),
        exclude_binary_image_data=False,
    )


def test_element_to_md_5():
    element_to_md(
        Element(
            element_id="",
            coordinates=None,
            coordinate_system=None,
            metadata=None,
            detection_origin="",
        ),
        exclude_binary_image_data=False,
    )
🔎 Concolic Coverage Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_e8goshnj/tmpyfmgqm9s/test_concolic_coverage.py::test_element_to_md 4.17μs 3.54μs 17.7%✅
codeflash_concolic_e8goshnj/tmpyfmgqm9s/test_concolic_coverage.py::test_element_to_md_2 958ns 958ns 0.000%✅
codeflash_concolic_e8goshnj/tmpyfmgqm9s/test_concolic_coverage.py::test_element_to_md_3 2.79μs 2.75μs 1.53%✅
codeflash_concolic_e8goshnj/tmpyfmgqm9s/test_concolic_coverage.py::test_element_to_md_4 542ns 333ns 62.8%✅
codeflash_concolic_e8goshnj/tmpyfmgqm9s/test_concolic_coverage.py::test_element_to_md_5 1.38μs 1.04μs 32.0%✅

To edit these changes git checkout codeflash/optimize-element_to_md-mje47tqi and push.

Codeflash Static Badge

The optimization replaces Python's `match-case` pattern matching with traditional `isinstance` checks and direct attribute access, achieving a **21% speedup** primarily through more efficient type dispatch and reduced attribute lookup overhead.

**Key Optimizations:**

1. **Faster Type Checking**: `isinstance(element, Title)` is significantly faster than pattern matching with destructuring (`case Title(text=text):`). The line profiler shows the original match statement took 80,000ns vs. the optimized isinstance checks taking 305,000ns total but processing more efficiently through early returns.

2. **Reduced Attribute Access**: For Image elements, the optimization pre-fetches metadata attributes once (`image_base64 = getattr(metadata, "image_base64", None)`) rather than accessing them repeatedly in each pattern match condition. This eliminates redundant attribute lookups.

3. **Simplified Control Flow**: The linear if-elif structure allows for early returns and avoids the overhead of Python's pattern matching dispatch mechanism, which involves more internal bookkeeping.

**Performance Impact by Element Type:**
- **Title elements**: 21.7% faster (958ns vs 1.17μs) - most common case benefits from fastest isinstance check
- **Image elements**: 27-59% faster depending on metadata - benefits most from reduced attribute access
- **Table elements**: 16-26% faster - moderate improvement from isinstance vs. pattern matching
- **Generic elements**: 33-44% faster - fastest path through simple isinstance checks

**Hot Path Impact**: Since `element_to_md` is called within `elements_to_md` for batch processing (as shown in function_references), this optimization compounds when processing large document collections. The 21% improvement per element translates to substantial time savings when converting hundreds or thousands of elements in typical document processing workflows.

The optimization is particularly effective for Image-heavy documents where the metadata attribute caching provides the largest gains, while maintaining identical behavior and output across all test cases.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 December 20, 2025 09:48
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant