This guide covers how to use the Parallel Polars integration for DataFrame-native data enrichment.
Polars DataFrame
│
▼
parallel_enrich(df, input_columns, output_columns)
│
▼
Parallel Task Group API (batch processing)
│
▼
Polars DataFrame with new columns
The integration processes all rows in a single batch for efficiency, then adds the enriched columns back to your DataFrame.
- Python 3.12+
- Parallel API Key from platform.parallel.ai
pip install parallel-web-tools[polars]Or with all dependencies:
pip install parallel-web-tools[all]import polars as pl
from parallel_web_tools.integrations.polars import parallel_enrich
# Create a DataFrame
df = pl.DataFrame({
"company": ["Google", "Microsoft", "Apple"],
"website": ["google.com", "microsoft.com", "apple.com"],
})
# Enrich with company information
result = parallel_enrich(
df,
input_columns={
"company_name": "company",
"website": "website",
},
output_columns=[
"CEO name",
"Founding year",
"Headquarters city",
],
)
# Access the enriched DataFrame
print(result.result)
print(f"Success: {result.success_count}, Errors: {result.error_count}")Output:
shape: (3, 6)
┌───────────┬───────────────┬─────────────────┬──────────────┬──────────────────┐
│ company ┆ website ┆ ceo_name ┆ founding_year┆ headquarters_city│
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ str │
╞═══════════╪═══════════════╪═════════════════╪══════════════╪══════════════════╡
│ Google ┆ google.com ┆ Sundar Pichai ┆ 1998 ┆ Mountain View │
│ Microsoft ┆ microsoft.com ┆ Satya Nadella ┆ 1975 ┆ Redmond │
│ Apple ┆ apple.com ┆ Tim Cook ┆ 1976 ┆ Cupertino │
└───────────┴───────────────┴─────────────────┴──────────────┴──────────────────┘
Success: 3, Errors: 0
Set your API key via environment variable:
export PARALLEL_API_KEY="your-api-key"Or pass it directly:
result = parallel_enrich(
df,
input_columns={"company_name": "company"},
output_columns=["CEO name"],
api_key="your-api-key",
)def parallel_enrich(
df: pl.DataFrame,
input_columns: dict[str, str],
output_columns: list[str],
api_key: str | None = None,
processor: str = "lite-fast",
timeout: int = 600,
include_basis: bool = False,
) -> EnrichmentResultParameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
df |
pl.DataFrame |
required | DataFrame to enrich |
input_columns |
dict[str, str] |
required | Mapping of input descriptions to column names |
output_columns |
list[str] |
required | List of output column descriptions |
api_key |
str | None |
None |
API key (uses env var if not provided) |
processor |
str |
"lite-fast" |
Parallel processor to use |
timeout |
int |
600 |
Timeout in seconds |
include_basis |
bool |
False |
Include citations in results |
Returns: EnrichmentResult
@dataclass
class EnrichmentResult:
dataframe: pl.DataFrame # Enriched DataFrame
success_count: int # Number of successful rows
error_count: int # Number of failed rows
errors: list[dict[str, Any]] # Error details
elapsed_time: float # Processing time in secondsSame as parallel_enrich() but accepts a pl.LazyFrame. Collects the LazyFrame before processing.
import polars as pl
from parallel_web_tools.integrations.polars import parallel_enrich
df = pl.DataFrame({
"name": ["Tesla", "SpaceX", "Neuralink"],
})
result = parallel_enrich(
df,
input_columns={"company_name": "name"},
output_columns=[
"CEO name",
"Industry",
"Year founded",
"Headquarters",
],
)
print(result.result)df = pl.DataFrame({
"company": ["Acme Corp"],
"domain": ["acme.com"],
"location": ["San Francisco, CA"],
})
result = parallel_enrich(
df,
input_columns={
"company_name": "company",
"website": "domain",
"headquarters": "location",
},
output_columns=[
"Number of employees",
"Annual revenue (USD)",
"Main products",
],
)# Fast, basic metadata
result = parallel_enrich(df, ..., processor="lite-fast")
# Standard enrichments
result = parallel_enrich(df, ..., processor="base-fast")
# Deep research
result = parallel_enrich(df, ..., processor="pro-fast")result = parallel_enrich(
df,
input_columns={"company_name": "company"},
output_columns=["CEO name"],
include_basis=True,
)
# Access citations
for row in result.result.iter_rows(named=True):
print(f"CEO: {row['ceo_name']}")
print(f"Sources: {row['_basis']}")result = parallel_enrich(df, ...)
if result.error_count > 0:
print(f"Failed rows: {result.error_count}")
for error in result.errors:
print(f" Row {error['row']}: {error['error']}")
# Filter successful rows only
successful_df = result.result.filter(
pl.col("ceo_name").is_not_null()
)# Read from CSV lazily
lf = pl.scan_csv("companies.csv")
# Filter and select
lf = lf.filter(pl.col("active") == True).select(["name", "website"])
# Enrich (will collect the LazyFrame)
from parallel_web_tools.integrations.polars import parallel_enrich_lazy
result = parallel_enrich_lazy(
lf,
input_columns={"company_name": "name", "website": "website"},
output_columns=["CEO name"],
)For large datasets, consider processing in batches:
def enrich_in_batches(df: pl.DataFrame, batch_size: int = 100):
"""Process large DataFrames in batches."""
results = []
for i in range(0, len(df), batch_size):
batch = df.slice(i, batch_size)
result = parallel_enrich(
batch,
input_columns={"company_name": "company"},
output_columns=["CEO name"],
)
results.append(result.result)
return pl.concat(results)| Processor | Speed | Cost | Best For |
|---|---|---|---|
lite, lite-fast |
Fastest | ~$0.005/row | Basic metadata, high volume |
base, base-fast |
Fast | ~$0.01/row | Standard enrichments |
core, core-fast |
Medium | ~$0.025/row | Cross-referenced data |
pro, pro-fast |
Slow | ~$0.10/row | Deep research |
Output columns are automatically converted to valid Python identifiers:
| Description | Column Name |
|---|---|
"CEO name" |
ceo_name |
"Founding year (YYYY)" |
founding_year |
"Annual revenue [USD]" |
annual_revenue |
"2024 Revenue" |
col_2024_revenue |
# Good - specific descriptions
output_columns = [
"CEO name (current CEO or equivalent leader)",
"Founding year (YYYY format)",
"Annual revenue (USD, most recent fiscal year)",
]
# Less specific - may get inconsistent results
output_columns = ["CEO", "Year", "Revenue"]- High volume, basic data: Use
lite-fast - Standard company info: Use
base-fast - Research-quality data: Use
pro-fast
result = parallel_enrich(df, ...)
# Check for errors before using results
if result.error_count > 0:
logger.warning(f"{result.error_count} rows failed enrichment")
# Errors don't stop processing - partial results are returnedThe integration processes all rows in a single batch. For very large datasets (1000+ rows), consider:
- Processing in smaller batches
- Using
lite-fastprocessor for faster results - Increasing timeout for large batches
Ensure the column names in input_columns values match your DataFrame:
# Wrong - column name doesn't exist
input_columns={"company_name": "Company"} # Capital C
# Correct
input_columns={"company_name": "company"} # LowercaseIncrease the timeout for large batches:
result = parallel_enrich(
df,
...,
timeout=1200, # 20 minutes
)Check your API key:
# Verify env var is set
echo $PARALLEL_API_KEY
# Or pass directly
result = parallel_enrich(..., api_key="your-key")- See the demo notebook for more examples
- Check Parallel Documentation for API details
- View parallel-web-tools on GitHub