From f8789c82794fe903bc81be72d540c657a3d910a0 Mon Sep 17 00:00:00 2001 From: David Orel Date: Sun, 24 May 2026 17:56:36 +0200 Subject: [PATCH 01/12] doc: Refactor documentation markdown formatting across all docs Standardize markdown syntax across all documentation files: - Replace YAML-style separators (----) with three-dash (---) format - Consistently format titles with single quotes for proper escaping - Normalize code block delimiters and structure - Apply consistent heading and separator styles - No content changes, purely formatting improvements --- docs/README.md | 188 +- docs/_en/api/cache.md | 308 +-- docs/_en/api/core.md | 548 +++--- docs/_en/api/decorators.md | 513 ++--- docs/_en/api/execution.md | 546 +++--- docs/_en/api/index.md | 52 +- docs/_en/api/storage.md | 608 +++--- docs/_en/api/tracking.md | 634 +++---- docs/_en/api/websocket.md | 2 +- docs/_en/examples/api-example.md | 736 ++++---- docs/_en/examples/dag_visualization_demo.md | 2 +- docs/_en/examples/dataflow-audio-pipeline.md | 546 +++--- docs/_en/examples/index.md | 180 +- docs/_en/examples/nicegui-dag-demo.md | 2 +- docs/_en/examples/quickstart.md | 344 ++-- docs/_en/examples/registry-discovery.md | 576 +++--- docs/_en/examples/resource_aware_demo.md | 502 ++--- docs/_en/examples/retry_demo.md | 504 ++--- docs/_en/examples/scheduled-pipeline.md | 398 ++-- docs/_en/examples/secure_api_example.md | 2 +- docs/_en/examples/tracking-demo.md | 366 ++-- docs/_en/examples/websocket-demo.md | 2 +- docs/_en/guides/api.md | 1549 ++++++++-------- docs/_en/guides/cache.md | 602 +++--- docs/_en/guides/execution.md | 1028 +++++----- docs/_en/guides/index.md | 54 +- docs/_en/guides/migration.md | 904 ++++----- docs/_en/guides/performance.md | 1140 ++++++------ docs/_en/guides/pipelines.md | 1088 +++++------ docs/_en/guides/retry.md | 970 +++++----- docs/_en/guides/scheduling.md | 1750 +++++++++--------- docs/_en/guides/security.md | 24 +- docs/_en/guides/tasks.md | 982 +++++----- docs/_en/guides/tracking.md | 1076 +++++------ docs/_en/guides/websocket.md | 1552 ++++++++-------- docs/_en/index.md | 54 +- docs/_en/quickstart.md | 774 ++++---- docs/_fr/api/cache.md | 302 +-- docs/_fr/api/core.md | 542 +++--- docs/_fr/api/decorators.md | 516 +++--- docs/_fr/api/execution.md | 546 +++--- docs/_fr/api/index.md | 52 +- docs/_fr/api/storage.md | 408 ++-- docs/_fr/api/tracking.md | 714 +++---- docs/_fr/api/websocket.md | 2 +- docs/_fr/examples/api-example.md | 736 ++++---- docs/_fr/examples/dag_visualization_demo.md | 2 +- docs/_fr/examples/dataflow-audio-pipeline.md | 546 +++--- docs/_fr/examples/index.md | 180 +- docs/_fr/examples/nicegui-dag-demo.md | 2 +- docs/_fr/examples/quickstart.md | 344 ++-- docs/_fr/examples/registry-discovery.md | 576 +++--- docs/_fr/examples/resource_aware_demo.md | 502 ++--- docs/_fr/examples/retry_demo.md | 488 ++--- docs/_fr/examples/scheduled-pipeline.md | 396 ++-- docs/_fr/examples/secure_api_example.md | 2 +- docs/_fr/examples/tracking-demo.md | 364 ++-- docs/_fr/examples/websocket-demo.md | 2 +- docs/_fr/guides/api.md | 1698 ++++++++--------- docs/_fr/guides/cache.md | 492 ++--- docs/_fr/guides/execution.md | 1028 +++++----- docs/_fr/guides/index.md | 54 +- docs/_fr/guides/performance.md | 1120 +++++------ docs/_fr/guides/pipelines.md | 1174 ++++++------ docs/_fr/guides/retry.md | 970 +++++----- docs/_fr/guides/scheduling.md | 1730 ++++++++--------- docs/_fr/guides/security.md | 24 +- docs/_fr/guides/tasks.md | 996 +++++----- docs/_fr/guides/tracking.md | 1076 +++++------ docs/_fr/guides/websocket.md | 2 +- docs/_fr/index.md | 54 +- docs/_fr/quickstart.md | 766 ++++---- 72 files changed, 19868 insertions(+), 19644 deletions(-) diff --git a/docs/README.md b/docs/README.md index 7b1cb5b..61a30a8 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,94 +1,94 @@ ---- -title: Taskiq-Flow Documentation -nav_order: 1 -permalink: / ---- -# Taskiq-Flow Documentation - -> **Version**: {VERSION} | Last updated: 2026-05-05 - -Welcome to the official documentation for **Taskiq-Flow**, a powerful Python library for orchestrating asynchronous task workflows with pipelines, dataflow DAGs, real-time tracking, and distributed scheduling. - -## Language Selection - -This documentation is available in two languages: - -- **[ English Documentation]({{ '/en/' | relative_url }})** — Complete technical documentation in English (source language) -- **[ Documentation Française]({{ '/fr/' | relative_url }})** — Traduction française complète - -Both versions are kept synchronized, with code examples remaining in English for consistency. - -## Documentation Structure - -The documentation is organized into the following sections: - -### Getting Started -- **[Installation]({{ '/en/guides/installation/' | relative_url }})** — Setup instructions and configuration -- **[Quick Start]({{ '/en/quickstart/' | relative_url }})** — 5-minute tutorial to run your first pipeline -- **[Core Concepts]({{ '/en/guides/core-concepts/' | relative_url }})** — Understanding pipeline types and patterns - -### User Guides -- **[Pipelines]({{ '/en/guides/pipelines/' | relative_url }})** — Sequential and dataflow pipeline patterns -- **[Tasks]({{ '/en/guides/tasks/' | relative_url }})** — Task definition, decorators, and metadata -- **[Execution]({{ '/en/guides/execution/' | relative_url }})** — Execution models and error handling -- **[Tracking & Monitoring]({{ '/en/guides/tracking/' | relative_url }})** — Real-time progress and status -- **[WebSocket]({{ '/en/guides/websocket/' | relative_url }})** — Live event streaming -- **[Scheduling]({{ '/en/guides/scheduling/' | relative_url }})** — Cron-based pipeline scheduling -- **[Retry]({{ '/en/guides/retry/' | relative_url }})** — Error recovery and retry strategies -- **[Performance]({{ '/en/guides/performance/' | relative_url }})** — Optimization and scaling -- **[API (REST)]({{ '/en/guides/api/' | relative_url }})** — FastAPI integration and endpoints - -### API Reference -- **[Core API]({{ '/en/api/core/' | relative_url }})** — Pipeline, DataflowPipeline, middleware -- **[Decorators]({{ '/en/api/decorators/' | relative_url }})** — @pipeline_task and utilities -- **[Execution]({{ '/en/api/execution/' | relative_url }})** — ExecutionEngine, DAG, DAGBuilder -- **[Tracking]({{ '/en/api/tracking/' | relative_url }})** — TrackingManager and storage backends -- **[WebSocket]({{ '/en/api/websocket/' | relative_url }})** — HookManager and event system - -### Examples -- **[Example Gallery]({{ '/en/examples/' | relative_url }})** — Walkthroughs of all example scripts - - Basic pipeline - - Tracking demo - - Scheduled pipeline - - Dataflow audio pipeline - - Registry discovery - - WebSocket demo - - REST API - -## Quick Overview - -Taskiq-Flow combines **taskiq-pipelines' orchestration** with **pipefunc's dataflow model**: - -- **Sequential Pipelines** — Linear workflows with `.call_next()`, `.map()`, `.filter()`, `.group()` -- **Dataflow Pipelines** — Automatic DAG construction from task dependencies using `@pipeline_task` -- **Real-time Tracking** — Monitor execution with PipelineTrackingManager -- **WebSocket Streaming** — Live events for dashboard integration -- **Scheduling** — Cron-based pipeline execution with APScheduler -- **REST API** — FastAPI endpoints for remote management -- **Parallel Execution** — Automatic concurrency for independent tasks -- **Map-Reduce** — Built-in batch processing helpers - -## Read the Guides - -**[→ Getting Started (English)]({{ '/en/quickstart/' | relative_url }})** - -**[→ Commencer (Français)]({{ '/fr/quickstart/' | relative_url }})** - -## Quick Links - -- **Project Repository**: [GitHub - taskiq-flow](https://github.com/dorel14/taskiq-flow) -- **PyPI Package**: [taskiq-flow](https://pypi.org/project/taskiq-flow/) -- **Taskiq Documentation**: [taskiq-python.github.io](https://taskiq-python.github.io/) -- **Issue Tracker**: [GitHub Issues](https://github.com/dorel14/taskiq-flow/issues) - -## Contributing - -Contributions are welcome! Please read our [contributing guide](https://github.com/dorel14/taskiq-flow/blob/master/CONTRIBUTING.md) for details on how to submit pull requests, add features, or report bugs. - -## License - -This project is licensed under the MIT License — see the [LICENSE](https://github.com/dorel14/taskiq-flow/blob/main/LICENSE) file for details. - ---- - -*Maintained by SoniqueBay Team · Documentation last built: 2026-05-05* +--- +title: Taskiq-Flow Documentation +nav_order: 1 +permalink: / +--- +# Taskiq-Flow Documentation + +> **Version**: {VERSION} | Last updated: 2026-05-05 + +Welcome to the official documentation for **Taskiq-Flow**, a powerful Python library for orchestrating asynchronous task workflows with pipelines, dataflow DAGs, real-time tracking, and distributed scheduling. + +## Language Selection + +This documentation is available in two languages: + +- **[ English Documentation]({{ '/en/' | relative_url }})** — Complete technical documentation in English (source language) +- **[ Documentation Française]({{ '/fr/' | relative_url }})** — Traduction française complète + +Both versions are kept synchronized, with code examples remaining in English for consistency. + +## Documentation Structure + +The documentation is organized into the following sections: + +### Getting Started +- **[Installation]({{ '/en/guides/installation/' | relative_url }})** — Setup instructions and configuration +- **[Quick Start]({{ '/en/quickstart/' | relative_url }})** — 5-minute tutorial to run your first pipeline +- **[Core Concepts]({{ '/en/guides/core-concepts/' | relative_url }})** — Understanding pipeline types and patterns + +### User Guides +- **[Pipelines]({{ '/en/guides/pipelines/' | relative_url }})** — Sequential and dataflow pipeline patterns +- **[Tasks]({{ '/en/guides/tasks/' | relative_url }})** — Task definition, decorators, and metadata +- **[Execution]({{ '/en/guides/execution/' | relative_url }})** — Execution models and error handling +- **[Tracking & Monitoring]({{ '/en/guides/tracking/' | relative_url }})** — Real-time progress and status +- **[WebSocket]({{ '/en/guides/websocket/' | relative_url }})** — Live event streaming +- **[Scheduling]({{ '/en/guides/scheduling/' | relative_url }})** — Cron-based pipeline scheduling +- **[Retry]({{ '/en/guides/retry/' | relative_url }})** — Error recovery and retry strategies +- **[Performance]({{ '/en/guides/performance/' | relative_url }})** — Optimization and scaling +- **[API (REST)]({{ '/en/guides/api/' | relative_url }})** — FastAPI integration and endpoints + +### API Reference +- **[Core API]({{ '/en/api/core/' | relative_url }})** — Pipeline, DataflowPipeline, middleware +- **[Decorators]({{ '/en/api/decorators/' | relative_url }})** — @pipeline_task and utilities +- **[Execution]({{ '/en/api/execution/' | relative_url }})** — ExecutionEngine, DAG, DAGBuilder +- **[Tracking]({{ '/en/api/tracking/' | relative_url }})** — TrackingManager and storage backends +- **[WebSocket]({{ '/en/api/websocket/' | relative_url }})** — HookManager and event system + +### Examples +- **[Example Gallery]({{ '/en/examples/' | relative_url }})** — Walkthroughs of all example scripts + - Basic pipeline + - Tracking demo + - Scheduled pipeline + - Dataflow audio pipeline + - Registry discovery + - WebSocket demo + - REST API + +## Quick Overview + +Taskiq-Flow combines **taskiq-pipelines' orchestration** with **pipefunc's dataflow model**: + +- **Sequential Pipelines** — Linear workflows with `.call_next()`, `.map()`, `.filter()`, `.group()` +- **Dataflow Pipelines** — Automatic DAG construction from task dependencies using `@pipeline_task` +- **Real-time Tracking** — Monitor execution with PipelineTrackingManager +- **WebSocket Streaming** — Live events for dashboard integration +- **Scheduling** — Cron-based pipeline execution with APScheduler +- **REST API** — FastAPI endpoints for remote management +- **Parallel Execution** — Automatic concurrency for independent tasks +- **Map-Reduce** — Built-in batch processing helpers + +## Read the Guides + +**[→ Getting Started (English)]({{ '/en/quickstart/' | relative_url }})** + +**[→ Commencer (Français)]({{ '/fr/quickstart/' | relative_url }})** + +## Quick Links + +- **Project Repository**: [GitHub - taskiq-flow](https://github.com/dorel14/taskiq-flow) +- **PyPI Package**: [taskiq-flow](https://pypi.org/project/taskiq-flow/) +- **Taskiq Documentation**: [taskiq-python.github.io](https://taskiq-python.github.io/) +- **Issue Tracker**: [GitHub Issues](https://github.com/dorel14/taskiq-flow/issues) + +## Contributing + +Contributions are welcome! Please read our [contributing guide](https://github.com/dorel14/taskiq-flow/blob/master/CONTRIBUTING.md) for details on how to submit pull requests, add features, or report bugs. + +## License + +This project is licensed under the MIT License — see the [LICENSE](https://github.com/dorel14/taskiq-flow/blob/main/LICENSE) file for details. + +--- + +*Maintained by SoniqueBay Team · Documentation last built: 2026-05-05* diff --git a/docs/_en/api/cache.md b/docs/_en/api/cache.md index d4eafc8..9db9d55 100644 --- a/docs/_en/api/cache.md +++ b/docs/_en/api/cache.md @@ -1,154 +1,154 @@ ---- -title: API Reference: Cache -nav_order: 32 -color_scheme: dark ---- -# API Reference: Cache - -**Dogpile-based caching with cache stampede sémantics** - -> **Version**: {VERSION} | **New in v1.2.0** | **Module**: `taskiq_flow.cache`, `taskiq_flow.middlewares.cache` - ---- - -## Overview - -Taskiq-Flow v1.2.0 introduces a **cache layer** for workers built around the **Dogpile pattern**. The key principle: when a cached entry expires, only one thread/process is allowed to regenerate it. All other requestors wait and then pick up the fresh value — eliminating the stampede. - -``` -Concurrent requests at TTL expiry: - -Without Dogpile: [task runs × 10 simultaneously] → overload -With Dogpile: [1 task runs, 9 wait] → one result shared -``` - ---- - -## `BaseCacheAdapter` (ABC) - -```python -from taskiq_flow.storage.base import BaseCacheAdapter - -class MyCacheAdapter(BaseCacheAdapter): - async def get_or_create(self, key, creator, ttl_seconds=3600) -> Any: ... - async def get(self, key) -> Any | None: ... - async def set(self, key, value, ttl_seconds=3600) -> None: ... - async def invalidate(self, key) -> bool: ... - async def clear(self) -> None: ... - def get_stats(self) -> dict: ... -``` - -Abstract interface; implement it to integrate a new cache backend. - -| Method | Dogpile-Safe? | Description | -|--------|--------------|-------------| -| `get_or_create(key, creator, ttl)` | **Yes** | Read-through with lock: creates via `creator()` only if missing/expired, atomically | -| `get(key)` | Read side | Cache read; `None` on miss | -| `set(key, value, ttl)` | Write side | Store with optional TTL seconds | -| `invalidate(key)` | — | Evict a single entry immediately | -| `clear()` | — | Flush the entire cache | -| `get_stats()` | — | Return `{"hits", "misses", "hit_rate", "size", "keys"}` | - ---- - -## `InMemoryCacheAdapter` - -```python -from taskiq_flow.cache import InMemoryCacheAdapter - -cache = InMemoryCacheAdapter() - -result = await cache.get_or_create( - "expensive_computation", - lambda: compute_expensive(), - ttl_seconds=300, -) -print(cache.get_stats()) -# {"hits": 42, "misses": 3, "hit_rate": 0.93, "size": 41, "keys": [...]} -``` - -| Feature | Detail | -|---------|--------| -| Thread safety | Per-key `threading.Lock` | -| TTL | Monotonic clock; no system-clock dependency | -| Dogpile lock | Lock released only when creator finishes | -| Async creator | `creator()` may return a coroutine; it is awaited automatically | -| Stats | `get_stats()`: hits, misses, hit_rate, size, keys | - ---- - -## `RedisCacheAdapter` - -```python -from taskiq_flow.cache import RedisCacheAdapter - -cache = RedisCacheAdapter( - redis_url="redis://localhost:6379", - default_ttl=3600, - lock_timeout=10, -) -result = await cache.get_or_create("shared_computation", - lambda: compute_expensive(), - ttl_seconds=300) -``` - -Distributed cache with Redis-backed Dogpile locking. - -| Feature | Detail | -|---------|--------| -| Distributed Dogpile lock | `SETNX`-based; multiple workers safely share one entry | -| Native Redis TTL | `EXPIRE` per key | -| JSON serialization | Automatic for non-primitive types | -| Lock timeout | Configurable; prevents deadlocks if a worker crashes mid-generation | - ---- - -## `CacheMiddleware` - -```python -from taskiq_flow.middlewares import CacheMiddleware -from taskiq_flow.cache import InMemoryCacheAdapter - -broker.add_middlewares( - PipelineMiddleware(), - CacheMiddleware(cache=InMemoryCacheAdapter(), default_ttl=3600), -) -``` - -`CacheMiddleware` is the production-ready way to enable caching on a broker. It hooks into both `pre_execute` and `post_save`: - -- **`pre_execute`** — Returns cached result if present; task is skipped (short-circuited). -- **`post_save`** — Stores the successful result in cache for next time. - -| Constructor Parameter | Type | Default | Description | -|----------------------|------|---------|-------------| -| `cache` | `BaseCacheAdapter \| None` | `None` | Cache backend; `None` → `InMemoryCacheAdapter` | -| `enabled` | `bool` | `True` | Global toggle | -| `default_ttl` | `int` | `3600` | Default cache lifetime in seconds | - -**Per-task label overrides:** - -| Message Label | Values | Effect | -|---------------|--------|--------| -| `cache_ttl` | integer seconds | Override TTL for this task execution | -| `cache_errors` | `"true"` | Cache failed (error) results too | - ---- - -## Choosing a Cache Backend - -| Backend | When to use | -|---------|-------------| -| `InMemoryCacheAdapter` | Development, tests, single-worker | -| `RedisCacheAdapter` | Production, multi-worker, distributed | - ---- - -## Further Reading - -- **[Storage & Cache Middleware Guide]({{ '/en/guides/cache/' | relative_url }})** — Full middleware configuration -- **[Storage API Reference]({{ '/en/api/storage/' | relative_url }})** — Storage adapters - ---- - -*New in v1.2.0. Cache adapters are async and interchangeable at instantiation time.* +--- +title: 'API Reference: Cache' +nav_order: 32 +color_scheme: dark +--- +# API Reference: Cache + +**Dogpile-based caching with cache stampede sémantics** + +> **Version**: {VERSION} | **New in v1.2.0** | **Module**: `taskiq_flow.cache`, `taskiq_flow.middlewares.cache` + +--- + +## Overview + +Taskiq-Flow v1.2.0 introduces a **cache layer** for workers built around the **Dogpile pattern**. The key principle: when a cached entry expires, only one thread/process is allowed to regenerate it. All other requestors wait and then pick up the fresh value — eliminating the stampede. + +``` +Concurrent requests at TTL expiry: + +Without Dogpile: [task runs × 10 simultaneously] → overload +With Dogpile: [1 task runs, 9 wait] → one result shared +``` + +--- + +## `BaseCacheAdapter` (ABC) + +```python +from taskiq_flow.storage.base import BaseCacheAdapter + +class MyCacheAdapter(BaseCacheAdapter): + async def get_or_create(self, key, creator, ttl_seconds=3600) -> Any: ... + async def get(self, key) -> Any | None: ... + async def set(self, key, value, ttl_seconds=3600) -> None: ... + async def invalidate(self, key) -> bool: ... + async def clear(self) -> None: ... + def get_stats(self) -> dict: ... +``` + +Abstract interface; implement it to integrate a new cache backend. + +| Method | Dogpile-Safe? | Description | +|--------|--------------|-------------| +| `get_or_create(key, creator, ttl)` | **Yes** | Read-through with lock: creates via `creator()` only if missing/expired, atomically | +| `get(key)` | Read side | Cache read; `None` on miss | +| `set(key, value, ttl)` | Write side | Store with optional TTL seconds | +| `invalidate(key)` | — | Evict a single entry immediately | +| `clear()` | — | Flush the entire cache | +| `get_stats()` | — | Return `{"hits", "misses", "hit_rate", "size", "keys"}` | + +--- + +## `InMemoryCacheAdapter` + +```python +from taskiq_flow.cache import InMemoryCacheAdapter + +cache = InMemoryCacheAdapter() + +result = await cache.get_or_create( + "expensive_computation", + lambda: compute_expensive(), + ttl_seconds=300, +) +print(cache.get_stats()) +# {"hits": 42, "misses": 3, "hit_rate": 0.93, "size": 41, "keys": [...]} +``` + +| Feature | Detail | +|---------|--------| +| Thread safety | Per-key `threading.Lock` | +| TTL | Monotonic clock; no system-clock dependency | +| Dogpile lock | Lock released only when creator finishes | +| Async creator | `creator()` may return a coroutine; it is awaited automatically | +| Stats | `get_stats()`: hits, misses, hit_rate, size, keys | + +--- + +## `RedisCacheAdapter` + +```python +from taskiq_flow.cache import RedisCacheAdapter + +cache = RedisCacheAdapter( + redis_url="redis://localhost:6379", + default_ttl=3600, + lock_timeout=10, +) +result = await cache.get_or_create("shared_computation", + lambda: compute_expensive(), + ttl_seconds=300) +``` + +Distributed cache with Redis-backed Dogpile locking. + +| Feature | Detail | +|---------|--------| +| Distributed Dogpile lock | `SETNX`-based; multiple workers safely share one entry | +| Native Redis TTL | `EXPIRE` per key | +| JSON serialization | Automatic for non-primitive types | +| Lock timeout | Configurable; prevents deadlocks if a worker crashes mid-generation | + +--- + +## `CacheMiddleware` + +```python +from taskiq_flow.middlewares import CacheMiddleware +from taskiq_flow.cache import InMemoryCacheAdapter + +broker.add_middlewares( + PipelineMiddleware(), + CacheMiddleware(cache=InMemoryCacheAdapter(), default_ttl=3600), +) +``` + +`CacheMiddleware` is the production-ready way to enable caching on a broker. It hooks into both `pre_execute` and `post_save`: + +- **`pre_execute`** — Returns cached result if present; task is skipped (short-circuited). +- **`post_save`** — Stores the successful result in cache for next time. + +| Constructor Parameter | Type | Default | Description | +|----------------------|------|---------|-------------| +| `cache` | `BaseCacheAdapter \| None` | `None` | Cache backend; `None` → `InMemoryCacheAdapter` | +| `enabled` | `bool` | `True` | Global toggle | +| `default_ttl` | `int` | `3600` | Default cache lifetime in seconds | + +**Per-task label overrides:** + +| Message Label | Values | Effect | +|---------------|--------|--------| +| `cache_ttl` | integer seconds | Override TTL for this task execution | +| `cache_errors` | `"true"` | Cache failed (error) results too | + +--- + +## Choosing a Cache Backend + +| Backend | When to use | +|---------|-------------| +| `InMemoryCacheAdapter` | Development, tests, single-worker | +| `RedisCacheAdapter` | Production, multi-worker, distributed | + +--- + +## Further Reading + +- **[Storage & Cache Middleware Guide]({{ '/en/guides/cache/' | relative_url }})** — Full middleware configuration +- **[Storage API Reference]({{ '/en/api/storage/' | relative_url }})** — Storage adapters + +--- + +*New in v1.2.0. Cache adapters are async and interchangeable at instantiation time.* diff --git a/docs/_en/api/core.md b/docs/_en/api/core.md index 1bfdaf3..3efe552 100644 --- a/docs/_en/api/core.md +++ b/docs/_en/api/core.md @@ -1,271 +1,277 @@ ---- -permalink: /en/api/core/ -title: API Reference: Core Components -nav_order: 30 -color_scheme: dark ---- -# API Reference: Core Components - -**Pipeline, DataflowPipeline, PipelineMiddleware, PipelineContext, and core exceptions** - -> **Version**: {VERSION} | **Module**: `taskiq_flow.core`, `taskiq_flow.pipeline`, `taskiq_flow.middleware` - ---- - -## Core Classes - -### Pipeline (SequentialPipeline) - -The classic sequential pipeline for linear task orchestration. - -```python -from taskiq_flow import Pipeline - -pipeline = Pipeline(broker) -``` - -**Constructor**: -```python -Pipeline( - broker: BaseBroker, - max_parallel: int = None, # Global parallelism limit - timeout: float = None, # Overall timeout in seconds - pipeline_id: str = None # Auto-generated if not provided -) -``` - -**Methods**: - -| Method | Signature | Description | -|--------|-----------|-------------| -| `call_next` | `call_next(task, *args, **kwargs) -> Pipeline` | Chain a task; passes previous result as first arg | -| `call_after` | `call_after(task, *args, **kwargs) -> Pipeline` | Run task without consuming previous result | -| `map` | `map(task, max_parallel=None, output_name=None) -> Pipeline` | Apply task to each element of iterable result | -| `filter` | `filter(task) -> Pipeline` | Keep elements where task returns truthy | -| `group` | `group(tasks, param_names=None) -> Pipeline` | Run multiple tasks in parallel from same input | -| `kiq` | `kiq(*args, **kwargs) -> Task` | Start pipeline execution | -| `with_tracking` | `with_tracking(tracking_manager) -> Pipeline` | Attach tracking manager | -| `with_hooks` | `with_hooks(hook_manager) -> Pipeline` | Attach hook manager for events | -| `with_retry` | `with_retry(...) -> Pipeline` | Configure retry policy | -| `with_timeout` | `with_timeout(seconds) -> Pipeline` | Set timeout | -| `with_context` | `with_context(enable=True) -> Pipeline` | Enable passing PipelineContext to tasks | - -**Example**: -```python -pipeline = ( - Pipeline(broker) - .call_next(task1) - .call_next(task2, factor=2) - .map(task3, max_parallel=10) - .filter(validate) - .with_tracking(tracking) -) -result = await pipeline.kiq(initial_input) -``` - ---- - -### DataflowPipeline - -Automatic DAG construction from task dependencies using `@pipeline_task` decorators. - -```python -from taskiq_flow import DataflowPipeline - -pipeline = DataflowPipeline.from_tasks( - broker, - [task_a, task_b, task_c] -) -``` - -**Constructor**: -```python -DataflowPipeline( - broker: BaseBroker, - tasks: list[Callable] = None, - max_parallel: int = None, - timeout: float = None, - pipeline_id: str = None -) -``` - -**Class Methods**: - -| Method | Description | -|--------|-------------| -| `from_tasks(broker, tasks, **kwargs)` | Build pipeline from list of task functions with `@pipeline_task` decorators | - -**Instance Methods** (most shared with `Pipeline`): - -| Method | Description | -|--------|-------------| -| `print_dag()` | Print ASCII DAG to console | -| `visualize()` | Return JSON representation of DAG | -| `visualize_dot()` | Return Graphviz DOT string | -| `kiq_dataflow(**kwargs)` | Execute pipeline with named inputs | - -**Example**: -```python -@broker.task -@pipeline_task(output="features") -def extract(data): ... - -@broker.task -@pipeline_task(output="tags") -def tag(features): ... - -pipeline = DataflowPipeline.from_tasks(broker, [extract, tag]) -pipeline.print_dag() -# Output: -# Level 0: extract -# Level 1: tag - -results = await pipeline.kiq_dataflow(data=input_data) -# results = {"features": ..., "tags": ...} -``` - ---- - -### PipelineMiddleware - -The middleware that orchestrates pipeline step execution. - -```python -from taskiq_flow import PipelineMiddleware - -broker.add_middlewares(PipelineMiddleware()) -``` - -**Responsibilities**: - -- Intercepts task completion -- Determines next step to execute -- Manages pipeline state transitions -- Passes results between steps -- Emits hook events - -**Note**: This middleware **must** be added to the broker for any pipeline to work. - ---- - -### PipelineContext - -Metadata passed to tasks when `with_context(enable=True)` is set. - -```python -from taskiq_flow import PipelineContext - -@broker.task -async def my_task(data: str, context: PipelineContext): - print(f"Pipeline: {context.pipeline_id}") - print(f"Step: {context.step_index}") - print(f"Task ID: {context.task_id}") -``` - -**Fields**: - -| Field | Type | Description | -|-------|------|-------------| -| `pipeline_id` | `str` | Unique pipeline instance ID | -| `step_index` | `int` | Current step number (0-indexed) | -| `task_id` | `str` | Underlying taskiq task ID | -| `execution_mode` | `str` | `"sequential"`, `"parallel"`, `"map_reduce"` | -| `started_at` | `datetime` | Pipeline start timestamp | -| `broker` | `BaseBroker` | Reference to broker instance | - ---- - -## Core Exceptions - -All exceptions inherit from `TaskiqFlowError` base class. - -```python -from taskiq_flow import TaskiqFlowError -``` - -| Exception | Meaning | Typical Cause | -|-----------|---------|--------------| -| `PipelineError` | Generic pipeline failure | Step failed | -| `CycleError` | Circular dependency detected | DAG has cycle | -| `TaskNotFoundError` | Task not in registry | Missing task in DataflowPipeline | -| `InvalidOutputError` | Output key conflict | Two tasks declare same output | -| `ConfigurationError` | Invalid pipeline config | Missing middleware, bad parameters | -| `TrackingError` | Tracking operation failed | Storage unavailable | - -**Example handling**: -```python -try: - result = await pipeline.kiq(data) -except CycleError as e: - print(f"DAG cycle detected: {e}") -except PipelineError as e: - print(f"Pipeline failed: {e}") -``` - ---- - -## Utilities - -### DataflowRegistry - -For manual DAG construction and inspection. - -```python -from taskiq_flow import DataflowRegistry - -registry = DataflowRegistry() -registry.register_task(task, output="out", inputs=["in"]) -dag = registry.build_dag() -``` - -See detailed documentation in `docs/en/api/dataflow.md`. - ---- - -### ExecutionEngine - -Low-level DAG executor for advanced use cases. - -```python -from taskiq_flow import ExecutionEngine - -engine = ExecutionEngine(broker, dag) -results = await engine.execute(inputs={"x": 1, "y": 2}) -``` - -See execution API docs. - ---- - -### PipelineScheduler - -Cron-based pipeline scheduling. - -```python -from taskiq_flow import PipelineScheduler - -scheduler = PipelineScheduler(broker) -await scheduler.schedule(pipeline, cron="* * * * *") -await scheduler.start() -``` - -See scheduling guide. - ---- - -## Version Compatibility - -This documentation covers **Taskiq-Flow v0.3.0+**. - -API stability: -- `Pipeline` and `DataflowPipeline`: Stable (v0.3+) -- `pipeline_task` decorator: Stable (v0.3+) -- `PipelineMiddleware`: Stable (v0.3+) -- `PipelineScheduler`: Stable (v0.3+) -- `PipelineTrackingManager`: Stable (v0.3+) - -Breaking changes will be noted in [CHANGELOG.md](https://github.com/dorel14/taskiq-flow/blob/main/CHANGELOG.md). - ---- - -*For detailed examples, see the [Examples]({{ '/en/examples/' | relative_url }}) section. For method-level documentation, refer to inline Python docstrings (`help(Pipeline)`).* +--- +permalink: /en/api/core/ +title: 'API Reference: Core Components' +nav_order: 30 +color_scheme: dark +--- +# API Reference: Core Components + +**Pipeline, DataflowPipeline, PipelineMiddleware, PipelineContext, and core exceptions** + +> **Version**: {VERSION} | **Module**: `taskiq_flow.core`, `taskiq_flow.pipeline`, `taskiq_flow.middleware` + +--- + +## Core Classes + +### Pipeline (SequentialPipeline) + +The classic sequential pipeline for linear task orchestration. + +```python +from taskiq_flow import Pipeline + +pipeline = Pipeline(broker) +``` + +**Constructor**: + +```python +Pipeline( + broker: BaseBroker, + max_parallel: int = None, # Global parallelism limit + timeout: float = None, # Overall timeout in seconds + pipeline_id: str = None # Auto-generated if not provided +) +``` + +**Methods**: + +| Method | Signature | Description | +|--------|-----------|-------------| +| `call_next` | `call_next(task, *args, **kwargs) -> Pipeline` | Chain a task; passes previous result as first arg | +| `call_after` | `call_after(task, *args, **kwargs) -> Pipeline` | Run task without consuming previous result | +| `map` | `map(task, max_parallel=None, output_name=None) -> Pipeline` | Apply task to each element of iterable result | +| `filter` | `filter(task) -> Pipeline` | Keep elements where task returns truthy | +| `group` | `group(tasks, param_names=None) -> Pipeline` | Run multiple tasks in parallel from same input | +| `kiq` | `kiq(*args, **kwargs) -> Task` | Start pipeline execution | +| `with_tracking` | `with_tracking(tracking_manager) -> Pipeline` | Attach tracking manager | +| `with_hooks` | `with_hooks(hook_manager) -> Pipeline` | Attach hook manager for events | +| `with_retry` | `with_retry(...) -> Pipeline` | Configure retry policy | +| `with_timeout` | `with_timeout(seconds) -> Pipeline` | Set timeout | +| `with_context` | `with_context(enable=True) -> Pipeline` | Enable passing PipelineContext to tasks | + +**Example**: + +```python +pipeline = ( + Pipeline(broker) + .call_next(task1) + .call_next(task2, factor=2) + .map(task3, max_parallel=10) + .filter(validate) + .with_tracking(tracking) +) +result = await pipeline.kiq(initial_input) +``` + +--- + +### DataflowPipeline + +Automatic DAG construction from task dependencies using `@pipeline_task` decorators. + +```python +from taskiq_flow import DataflowPipeline + +pipeline = DataflowPipeline.from_tasks( + broker, + [task_a, task_b, task_c] +) +``` + +**Constructor**: + +```python +DataflowPipeline( + broker: BaseBroker, + tasks: list[Callable] = None, + max_parallel: int = None, + timeout: float = None, + pipeline_id: str = None +) +``` + +**Class Methods**: + +| Method | Description | +|--------|-------------| +| `from_tasks(broker, tasks, **kwargs)` | Build pipeline from list of task functions with `@pipeline_task` decorators | + +**Instance Methods** (most shared with `Pipeline`): + +| Method | Description | +|--------|-------------| +| `print_dag()` | Print ASCII DAG to console | +| `visualize()` | Return JSON representation of DAG | +| `visualize_dot()` | Return Graphviz DOT string | +| `kiq_dataflow(**kwargs)` | Execute pipeline with named inputs | + +**Example**: + +```python +@broker.task +@pipeline_task(output="features") +def extract(data): ... + +@broker.task +@pipeline_task(output="tags") +def tag(features): ... + +pipeline = DataflowPipeline.from_tasks(broker, [extract, tag]) +pipeline.print_dag() +# Output: +# Level 0: extract +# Level 1: tag + +results = await pipeline.kiq_dataflow(data=input_data) +# results = {"features": ..., "tags": ...} +``` + +--- + +### PipelineMiddleware + +The middleware that orchestrates pipeline step execution. + +```python +from taskiq_flow import PipelineMiddleware + +broker.add_middlewares(PipelineMiddleware()) +``` + +**Responsibilities**: + +- Intercepts task completion +- Determines next step to execute +- Manages pipeline state transitions +- Passes results between steps +- Emits hook events + +**Note**: This middleware **must** be added to the broker for any pipeline to work. + +--- + +### PipelineContext + +Metadata passed to tasks when `with_context(enable=True)` is set. + +```python +from taskiq_flow import PipelineContext + +@broker.task +async def my_task(data: str, context: PipelineContext): + print(f"Pipeline: {context.pipeline_id}") + print(f"Step: {context.step_index}") + print(f"Task ID: {context.task_id}") +``` + +**Fields**: + +| Field | Type | Description | +|-------|------|-------------| +| `pipeline_id` | `str` | Unique pipeline instance ID | +| `step_index` | `int` | Current step number (0-indexed) | +| `task_id` | `str` | Underlying taskiq task ID | +| `execution_mode` | `str` | `"sequential"`, `"parallel"`, `"map_reduce"` | +| `started_at` | `datetime` | Pipeline start timestamp | +| `broker` | `BaseBroker` | Reference to broker instance | + +--- + +## Core Exceptions + +All exceptions inherit from `TaskiqFlowError` base class. + +```python +from taskiq_flow import TaskiqFlowError +``` + +| Exception | Meaning | Typical Cause | +|-----------|---------|--------------| +| `PipelineError` | Generic pipeline failure | Step failed | +| `CycleError` | Circular dependency detected | DAG has cycle | +| `TaskNotFoundError` | Task not in registry | Missing task in DataflowPipeline | +| `InvalidOutputError` | Output key conflict | Two tasks declare same output | +| `ConfigurationError` | Invalid pipeline config | Missing middleware, bad parameters | +| `TrackingError` | Tracking operation failed | Storage unavailable | + +**Example handling**: + +```python +try: + result = await pipeline.kiq(data) +except CycleError as e: + print(f"DAG cycle detected: {e}") +except PipelineError as e: + print(f"Pipeline failed: {e}") +``` + +--- + +## Utilities + +### DataflowRegistry + +For manual DAG construction and inspection. + +```python +from taskiq_flow import DataflowRegistry + +registry = DataflowRegistry() +registry.register_task(task, output="out", inputs=["in"]) +dag = registry.build_dag() +``` + +See detailed documentation in `docs/en/api/dataflow.md`. + +--- + +### ExecutionEngine + +Low-level DAG executor for advanced use cases. + +```python +from taskiq_flow import ExecutionEngine + +engine = ExecutionEngine(broker, dag) +results = await engine.execute(inputs={"x": 1, "y": 2}) +``` + +See execution API docs. + +--- + +### PipelineScheduler + +Cron-based pipeline scheduling. + +```python +from taskiq_flow import PipelineScheduler + +scheduler = PipelineScheduler(broker) +await scheduler.schedule(pipeline, cron="* * * * *") +await scheduler.start() +``` + +See scheduling guide. + +--- + +## Version Compatibility + +This documentation covers **Taskiq-Flow v0.3.0+**. + +API stability: + +- `Pipeline` and `DataflowPipeline`: Stable (v0.3+) +- `pipeline_task` decorator: Stable (v0.3+) +- `PipelineMiddleware`: Stable (v0.3+) +- `PipelineScheduler`: Stable (v0.3+) +- `PipelineTrackingManager`: Stable (v0.3+) + +Breaking changes will be noted in [CHANGELOG.md](https://github.com/dorel14/taskiq-flow/blob/main/CHANGELOG.md). + +--- + +*For detailed examples, see the [Examples]({{ '/en/examples/' | relative_url }}) section. For method-level documentation, refer to inline Python docstrings (`help(Pipeline)`).* diff --git a/docs/_en/api/decorators.md b/docs/_en/api/decorators.md index 5f07024..46235ec 100644 --- a/docs/_en/api/decorators.md +++ b/docs/_en/api/decorators.md @@ -1,251 +1,264 @@ ---- -permalink: /en/api/decorators/ -title: API Reference: Decorators -nav_order: 31 -color_scheme: dark ---- -# API Reference: Decorators - -**Task decorators, pipeline_task, and utility decorators** - +--- +permalink: /en/api/decorators/ +title: 'API Reference: Decorators' +nav_order: 31 +color_scheme: dark +--- +# API Reference: Decorators + +**Task decorators, pipeline_task, and utility decorators** + > **Version**: {VERSION} | **Module**: `taskiq_flow.decorators` - ---- - -## Overview - -The `@pipeline_task` decorator annotates taskiq tasks with output declarations, enabling automatic dependency resolution in DataflowPipeline. - ---- - -## @pipeline_task - -Marks a task with what it produces for downstream consumers. - -```python -from taskiq_flow import pipeline_task - -@broker.task -@pipeline_task(output="features") -def extract(data: list[str]) -> dict: - return compute_features(data) -``` - -**Parameters**: - -| Parameter | Type | Description | -|-----------|------|-------------| -| `output` | `str` | Single output key name | -| `outputs` | `list[str]` | Multiple output keys (for tuple returns) | -| `inputs` | `list[str]` | Explicit input dependencies (optional, auto-detected) | -| `description` | `str` | Human-readable description (for documentation) | - -**Usage patterns**: - -### Single output (most common) - -```python -@broker.task -@pipeline_task(output="processed_data") -def process(raw_data: str) -> dict: - return {"result": raw_data.upper()} -``` - -### Multiple outputs - -```python -@broker.task -@pipeline_task(outputs=["features", "metadata"]) -def split_output(audio: np.ndarray) -> tuple[dict, dict]: - features = extract_features(audio) - metadata = extract_meta(audio) - return features, metadata # unpacked to both outputs -``` - -Downstream tasks can consume either output: - -```python -@broker.task -@pipeline_task(output="tags") -def tag(features: dict): ... # consumes 'features' output - -@broker.task -@pipeline_task(output="info") -def describe(metadata: dict): ... # consumes 'metadata' output -``` - ---- - -## @pipeline_task_multi_output - -Alias for `@pipeline_task(outputs=[...])`. Provides clarity for multi-output tasks: - -```python -from taskiq_flow import pipeline_task_multi_output - -@broker.task -@pipeline_task_multi_output(outputs=["x", "y"]) -def split(value: int) -> tuple[int, int]: - return value // 2, value % 2 -``` - ---- - -## Utility Functions - -### get_task_outputs(task: Callable) -> list[str] - -Get declared output keys for a task: - -```python -from taskiq_flow import get_task_outputs - -outputs = get_task_outputs(extract_task) -print(outputs) # ['features'] -``` - -### get_task_inputs(task: Callable) -> list[str] - -Get declared input dependencies: - -```python -from taskiq_flow import get_task_inputs - -inputs = get_task_inputs(tag_task) -print(inputs) # ['features'] -``` - -### is_pipeline_task(task: Callable) -> bool - -Check if a function has been decorated with `@pipeline_task`: - -```python -from taskiq_flow import is_pipeline_task - -if is_pipeline_task(my_func): - print("This is a pipeline task with output declarations") -``` - -### resolve_task_dependencies(tasks: list[Callable]) -> dict - -Build a dependency map: - -```python -from taskiq_flow import resolve_task_dependencies - -deps = resolve_task_dependencies([task_a, task_b, task_c]) -# Returns: {task_a: [], task_b: ['features'], task_c: ['tags']} -``` - ---- - -## Decorator Order - -The decorator order matters: `@broker.task` must be outermost (applied last), `@pipeline_task` inner (applied first): - -```python -# CORRECT -@broker.task -@pipeline_task(output="result") -def my_task(): ... - -# INCORRECT (will fail) -@pipeline_task(output="result") -@broker.task -def my_task(): ... -``` - -Why: `@broker.task` wraps the function; `@pipeline_task` attaches metadata to the original function. Python applies decorators bottom-to-top. - ---- - -## Type Hints & Static Analysis - -Type hints help IDEs and static checkers understand dataflow: - -```python -from typing import TypedDict - -class AudioFeatures(TypedDict): - duration: float - tempo: float - -@broker.task -@pipeline_task(output="features") -def extract(path: str) -> AudioFeatures: - return {"duration": 180.0, "tempo": 120.0} - -@broker.task -@pipeline_task(output="tags") -def tag(features: AudioFeatures) -> list[str]: # type-safe - return ["fast", "electronic"] -``` - -Using `TypedDict` or Pydantic models provides better IDE autocomplete and mypy checking. - ---- - -## Versioning & Metadata - -Attach version and other metadata: - -```python -@broker.task( - name="extract_features_v2", - labels={"version": "2.0.0", "experimental": False} -) -@pipeline_task( - output="features", - description="Extract audio features (v2 with improvedtempo estimation)" -) -def extract(path: str) -> dict: - ... -``` - ---- - -## Common Pitfalls - -| Pitfall | Consequence | Fix | -|----------|-------------|-----| -| Missing `@broker.task` | Task not registered with broker | Add decorator | -| `output` not set | No downstream consumers can depend on it | Always declare `output` for dataflow tasks | -| Output name mismatch | Downstream task doesn't receive input | Ensure downstream parameter name matches upstream `output` | -| Using `@pipeline_task` on SequentialPipeline tasks | No effect but unnecessary | Only needed for DataflowPipeline | - ---- - -## Example: Complete Dataflow Pipeline - -```python -from taskiq import InMemoryBroker -from taskiq_flow import DataflowPipeline, pipeline_task - -broker = InMemoryBroker() - -@broker.task -@pipeline_task(output="raw") -def load(source: str) -> dict: - return {"data": read_file(source)} - -@broker.task -@pipeline_task(output="clean") -def clean(raw: dict) -> dict: - return {"data": preprocess(raw["data"])} - -@broker.task -@pipeline_task(output="stats") -def analyze(clean: dict) -> dict: - return compute_stats(clean["data"]) - -# Build -pipeline = DataflowPipeline.from_tasks(broker, [load, clean, analyze]) - -# Execute -results = await pipeline.kiq_dataflow(source="data.csv") -# results = {"raw": {...}, "clean": {...}, "stats": {...}} -``` - ---- - -*For the full task API, see [Tasks Guide]({{ '/en/guides/tasks/' | relative_url }}). For writing custom decorators, extend `BaseTaskDecorator` from `taskiq_flow.decorators`.* + +--- + +## Overview + +The `@pipeline_task` decorator annotates taskiq tasks with output declarations, enabling automatic dependency resolution in DataflowPipeline. + +--- + +## @pipeline_task + +Marks a task with what it produces for downstream consumers. + +{% raw %} +```python +from taskiq_flow import pipeline_task + +@broker.task +@pipeline_task(output="features") +def extract(data: list[str]) -> dict: + return compute_features(data) +``` +{% endraw %} +**Parameters**: + +| Parameter | Type | Description | +|-----------|------|-------------| +| `output` | `str` | Single output key name | +| `outputs` | `list[str]` | Multiple output keys (for tuple returns) | +| `inputs` | `list[str]` | Explicit input dependencies (optional, auto-detected) | +| `description` | `str` | Human-readable description (for documentation) | + +**Usage patterns**: + +### Single output (most common) + +{% raw %} +```python +@broker.task +@pipeline_task(output="processed_data") +def process(raw_data: str) -> dict: + return {"result": raw_data.upper()} +``` +{% endraw %} +### Multiple outputs + +{% raw %} +```python +@broker.task +@pipeline_task(outputs=["features", "metadata"]) +def split_output(audio: np.ndarray) -> tuple[dict, dict]: + features = extract_features(audio) + metadata = extract_meta(audio) + return features, metadata # unpacked to both outputs +``` +{% endraw %} +Downstream tasks can consume either output: + +{% raw %} +```python +@broker.task +@pipeline_task(output="tags") +def tag(features: dict): ... # consumes 'features' output + +@broker.task +@pipeline_task(output="info") +def describe(metadata: dict): ... # consumes 'metadata' output +``` +{% endraw %} +--- + +## @pipeline_task_multi_output + +Alias for `@pipeline_task(outputs=[...])`. Provides clarity for multi-output tasks: + +{% raw %} +```python +from taskiq_flow import pipeline_task_multi_output + +@broker.task +@pipeline_task_multi_output(outputs=["x", "y"]) +def split(value: int) -> tuple[int, int]: + return value // 2, value % 2 +``` +{% endraw %} +--- + +## Utility Functions + +### get_task_outputs(task: Callable) -> list[str] + +Get declared output keys for a task: + +{% raw %} +```python +from taskiq_flow import get_task_outputs + +outputs = get_task_outputs(extract_task) +print(outputs) # ['features'] +``` +{% endraw %} +### get_task_inputs(task: Callable) -> list[str] + +Get declared input dependencies: + +{% raw %} +```python +from taskiq_flow import get_task_inputs + +inputs = get_task_inputs(tag_task) +print(inputs) # ['features'] +``` +{% endraw %} +### is_pipeline_task(task: Callable) -> bool + +Check if a function has been decorated with `@pipeline_task`: + +{% raw %} +```python +from taskiq_flow import is_pipeline_task + +if is_pipeline_task(my_func): + print("This is a pipeline task with output declarations") +``` +{% endraw %} +### resolve_task_dependencies(tasks: list[Callable]) -> dict + +Build a dependency map: + +{% raw %} +```python +from taskiq_flow import resolve_task_dependencies + +deps = resolve_task_dependencies([task_a, task_b, task_c]) +# Returns: {task_a: [], task_b: ['features'], task_c: ['tags']} +``` +{% endraw %} +--- + +## Decorator Order + +The decorator order matters: `@broker.task` must be outermost (applied last), `@pipeline_task` inner (applied first): + +{% raw %} +```python +# CORRECT +@broker.task +@pipeline_task(output="result") +def my_task(): ... + +# INCORRECT (will fail) +@pipeline_task(output="result") +@broker.task +def my_task(): ... +``` +{% endraw %} +Why: `@broker.task` wraps the function; `@pipeline_task` attaches metadata to the original function. Python applies decorators bottom-to-top. + +--- + +## Type Hints & Static Analysis + +Type hints help IDEs and static checkers understand dataflow: + +{% raw %} +```python +from typing import TypedDict + +class AudioFeatures(TypedDict): + duration: float + tempo: float + +@broker.task +@pipeline_task(output="features") +def extract(path: str) -> AudioFeatures: + return {"duration": 180.0, "tempo": 120.0} + +@broker.task +@pipeline_task(output="tags") +def tag(features: AudioFeatures) -> list[str]: # type-safe + return ["fast", "electronic"] +``` +{% endraw %} +Using `TypedDict` or Pydantic models provides better IDE autocomplete and mypy checking. + +--- + +## Versioning & Metadata + +Attach version and other metadata: + +{% raw %} +```python +@broker.task( + name="extract_features_v2", + labels={"version": "2.0.0", "experimental": False} +) +@pipeline_task( + output="features", + description="Extract audio features (v2 with improvedtempo estimation)" +) +def extract(path: str) -> dict: + ... +``` +{% endraw %} +--- + +## Common Pitfalls + +| Pitfall | Consequence | Fix | +|----------|-------------|-----| +| Missing `@broker.task` | Task not registered with broker | Add decorator | +| `output` not set | No downstream consumers can depend on it | Always declare `output` for dataflow tasks | +| Output name mismatch | Downstream task doesn't receive input | Ensure downstream parameter name matches upstream `output` | +| Using `@pipeline_task` on SequentialPipeline tasks | No effect but unnecessary | Only needed for DataflowPipeline | + +--- + +## Example: Complete Dataflow Pipeline + +{% raw %} +```python +from taskiq import InMemoryBroker +from taskiq_flow import DataflowPipeline, pipeline_task + +broker = InMemoryBroker() + +@broker.task +@pipeline_task(output="raw") +def load(source: str) -> dict: + return {"data": read_file(source)} + +@broker.task +@pipeline_task(output="clean") +def clean(raw: dict) -> dict: + return {"data": preprocess(raw["data"])} + +@broker.task +@pipeline_task(output="stats") +def analyze(clean: dict) -> dict: + return compute_stats(clean["data"]) + +# Build +pipeline = DataflowPipeline.from_tasks(broker, [load, clean, analyze]) + +# Execute +results = await pipeline.kiq_dataflow(source="data.csv") +# results = {"raw": {...}, "clean": {...}, "stats": {...}} +``` +{% endraw %} +--- + +*For the full task API, see [Tasks Guide]({{ '/en/guides/tasks/' | relative_url }}). For writing custom decorators, extend `BaseTaskDecorator` from `taskiq_flow.decorators`.* diff --git a/docs/_en/api/execution.md b/docs/_en/api/execution.md index 09b1fe2..216015a 100644 --- a/docs/_en/api/execution.md +++ b/docs/_en/api/execution.md @@ -1,274 +1,274 @@ ---- -permalink: /en/api/execution/ -title: API Reference: Execution Engine -nav_order: 32 -color_scheme: dark ---- -# API Reference: Execution Engine - -**ExecutionEngine, DAG, map-reduce utilities, and error handling** - +--- +permalink: /en/api/execution/ +title: 'API Reference: Execution Engine' +nav_order: 32 +color_scheme: dark +--- +# API Reference: Execution Engine + +**ExecutionEngine, DAG, map-reduce utilities, and error handling** + > **Version**: {VERSION} | **Module**: `taskiq_flow.execution_engine`, `taskiq_flow.dataflow.dag`, `taskiq_flow.map_reduce` - ---- - -## ExecutionEngine - -Low-level engine for executing DAGs directly, bypassing Pipeline abstraction. - -```python -from taskiq_flow import ExecutionEngine, DataflowRegistry - -# Build registry manually -registry = DataflowRegistry() -registry.register_task(load, output="raw", inputs=[]) -registry.register_task(process, output="clean", inputs=["raw"]) -registry.register_task(save, output="saved", inputs=["clean"]) - -# Build DAG -dag = registry.build_dag() - -# Create engine -engine = ExecutionEngine(broker, dag) - -# Execute -results = await engine.execute(inputs={"source": "data.csv"}) -# results = {"raw": ..., "clean": ..., "saved": ...} -``` - -**Constructor**: -```python -ExecutionEngine( - broker: BaseBroker, - dag: DAG, - max_parallel: int = None, - on_step_complete: callable = None -) -``` - -**Methods**: - -| Method | Signature | Description | -|--------|-----------|-------------| -| `execute` | `execute(inputs: dict) -> dict` | Run the DAG with given inputs | -| `execute_async` | `execute_async(inputs: dict) -> AsyncIterator` | Stream results as they complete | -| `cancel` | `cancel()` | Stop running execution | - -**Events**: - -```python -async def on_step(task_name: str, result: Any): - print(f"Step {task_name} completed") - -engine = ExecutionEngine(broker, dag, on_step_complete=on_step) -``` - ---- - -## DAG (Directed Acyclic Graph) - -Represents the execution graph of tasks. - -```python -from taskiq_flow.dataflow import DAG, DAGNode - -dag = DAG() -node = DAGNode(task=my_task, output="result", inputs=["input_a"]) -dag.add_node(node) -``` - -**DAG Methods**: - -| Method | Description | -|--------|-------------| -| `add_node(node: DAGNode)` | Add a task node | -| `add_edge(from_task, to_task)` | Add dependency | -| `topological_sort() -> list[DAGNode]` | Return execution order | -| `get_parallel_levels() -> list[list[DAGNode]]` | Group nodes by parallel execution level | -| `validate()` | Check for cycles, missing nodes | -| `print()` | ASCII visualization to console | - -**DAG Properties**: - -| Property | Type | Description | -|----------|------|-------------| -| `nodes` | `list[DAGNode]` | All nodes in graph | -| `edges` | `set[tuple[DAGNode, DAGNode]]` | Dependency edges | -| `roots` | `list[DAGNode]` | Nodes with no dependencies | -| `leaves` | `list[DAGNode]` | Nodes with no dependents | - ---- - -## DAGNode - -Represents a single task in the DAG with its I/O specification. - -```python -from taskiq_flow.dataflow import DAGNode - -node = DAGNode( - task=my_task_function, - output="result_key", - inputs=["input_a", "input_b"], - metadata={"description": "My task"} -) -``` - -**Properties**: - -| Property | Type | Description | -|----------|------|-------------| -| `task` | `Callable` | The task function | -| `task_name` | `str` | Auto-generated or custom name | -| `output` | `str` | Output key (single) | -| `outputs` | `list[str]` | Output keys (multiple) | -| `inputs` | `list[str]` | Required input keys | -| `metadata` | `dict` | Arbitrary metadata | - ---- - -## DAGBuilder - -Helper to construct DAGs programmatically (less common; usually use DataflowRegistry). - -```python -from taskiq_flow import DAGBuilder - -builder = DAGBuilder() -builder.add_task(task1, output="a", inputs=[]) -builder.add_task(task2, output="b", inputs=["a"]) -builder.add_task(task3, output="c", inputs=["a", "b"]) - -dag = builder.build() -``` - -**Builder Pattern**: - -```python -dag = (DAGBuilder() - .node(load, output="raw", inputs=[]) - .node(process, output="clean", inputs=["raw"]) - .node(save, output="saved", inputs=["clean"]) - .build() -) -``` - ---- - -## MapReduce - -Utility for parallel map followed by reduce. - -### MapReduce.map - -```python -from taskiq_flow import MapReduce - -mapped = await MapReduce.map( - broker, - map_func, # Task function to apply - items: Iterable, # Items to process - output: str = "mapped", - max_parallel: int = None -) -# Returns: MapReduceResult (behaves like Task) -``` - -### MapReduce.reduce - -```python -reduced = await MapReduce.reduce( - broker, - reduce_func, # Aggregation function - mapped_result, # Output from MapReduce.map - input_name: str, # Name of mapped output to consume - output: str = "reduced" -) -# Returns: Task (with final result) -``` - -### MapReduce.map_reduce (combined) - -```python -final = await MapReduce.map_reduce( - broker, - map_func, - items, - reduce_func, - map_output="mapped", - reduce_output="final", - max_parallel=10 -) -``` - -All three return Task objects; call `.wait_result()` to retrieve value. - ---- - -## DataflowRegistry (Advanced) - -Manual task registration for dynamic pipeline construction. - -```python -from taskiq_flow import DataflowRegistry - -registry = DataflowRegistry() - -# Register tasks with explicit I/O -registry.register_task( - task=extract, - output="features", - inputs=["audio_files"] # external input -) -registry.register_task( - task=tag, - output="tags", - inputs=["features"] # depends on extract's output -) - -# Inspect -print("Tasks:", [t.task_name for t in registry.get_tasks()]) -print("Outputs:", registry.get_outputs()) -print("External inputs:", registry.get_external_inputs()) - -# Build DAG -dag = registry.build_dag() -dag.print() - -# Execute via ExecutionEngine -engine = ExecutionEngine(broker, dag) -results = await engine.execute(inputs={"audio_files": files}) -``` - -**Registry Queries**: - -| Method | Description | -|--------|-------------| -| `get_tasks()` | List all registered TaskNode objects | -| `get_outputs()` | List all output keys | -| `get_external_inputs()` | List inputs not produced by any task | -| `get_producer(output_key)` | Get task producing given output | -| `get_consumers(input_key)` | List tasks consuming given input | -| `build_dag()` | Construct DAG, validate, return ready-to-execute | - ---- - -## Version Notes - -- **ExecutionEngine** introduced in v0.3.0 -- `DAG` and `DAGNode` are used internally by DataflowPipeline -- MapReduce utility available since v0.2.0 - ---- - -## Next Steps - -- **[Tracking API]({{ '/en/api/tracking/' | relative_url }})** — Monitor execution with PipelineTrackingManager -- **[WebSocket API]({{ '/en/api/websocket/' | relative_url }})** — HookManager and event system -- **[Core API]({{ '/en/api/core/' | relative_url }})** — Pipeline and middleware reference -- **[Dataflow Audio Pipeline Example]({{ '/en/examples/dataflow-audio-pipeline/' | relative_url }})** — See ExecutionEngine used in a real DAG pipeline -- **[Registry Discovery Example]({{ '/en/examples/registry-discovery/' | relative_url }})** — Manual DAG construction and ExecutionEngine usage - ---- - -*For advanced use cases only. 95% of users should stick with Pipeline and DataflowPipeline abstractions.* + +--- + +## ExecutionEngine + +Low-level engine for executing DAGs directly, bypassing Pipeline abstraction. + +```python +from taskiq_flow import ExecutionEngine, DataflowRegistry + +# Build registry manually +registry = DataflowRegistry() +registry.register_task(load, output="raw", inputs=[]) +registry.register_task(process, output="clean", inputs=["raw"]) +registry.register_task(save, output="saved", inputs=["clean"]) + +# Build DAG +dag = registry.build_dag() + +# Create engine +engine = ExecutionEngine(broker, dag) + +# Execute +results = await engine.execute(inputs={"source": "data.csv"}) +# results = {"raw": ..., "clean": ..., "saved": ...} +``` + +**Constructor**: +```python +ExecutionEngine( + broker: BaseBroker, + dag: DAG, + max_parallel: int = None, + on_step_complete: callable = None +) +``` + +**Methods**: + +| Method | Signature | Description | +|--------|-----------|-------------| +| `execute` | `execute(inputs: dict) -> dict` | Run the DAG with given inputs | +| `execute_async` | `execute_async(inputs: dict) -> AsyncIterator` | Stream results as they complete | +| `cancel` | `cancel()` | Stop running execution | + +**Events**: + +```python +async def on_step(task_name: str, result: Any): + print(f"Step {task_name} completed") + +engine = ExecutionEngine(broker, dag, on_step_complete=on_step) +``` + +--- + +## DAG (Directed Acyclic Graph) + +Represents the execution graph of tasks. + +```python +from taskiq_flow.dataflow import DAG, DAGNode + +dag = DAG() +node = DAGNode(task=my_task, output="result", inputs=["input_a"]) +dag.add_node(node) +``` + +**DAG Methods**: + +| Method | Description | +|--------|-------------| +| `add_node(node: DAGNode)` | Add a task node | +| `add_edge(from_task, to_task)` | Add dependency | +| `topological_sort() -> list[DAGNode]` | Return execution order | +| `get_parallel_levels() -> list[list[DAGNode]]` | Group nodes by parallel execution level | +| `validate()` | Check for cycles, missing nodes | +| `print()` | ASCII visualization to console | + +**DAG Properties**: + +| Property | Type | Description | +|----------|------|-------------| +| `nodes` | `list[DAGNode]` | All nodes in graph | +| `edges` | `set[tuple[DAGNode, DAGNode]]` | Dependency edges | +| `roots` | `list[DAGNode]` | Nodes with no dependencies | +| `leaves` | `list[DAGNode]` | Nodes with no dependents | + +--- + +## DAGNode + +Represents a single task in the DAG with its I/O specification. + +```python +from taskiq_flow.dataflow import DAGNode + +node = DAGNode( + task=my_task_function, + output="result_key", + inputs=["input_a", "input_b"], + metadata={"description": "My task"} +) +``` + +**Properties**: + +| Property | Type | Description | +|----------|------|-------------| +| `task` | `Callable` | The task function | +| `task_name` | `str` | Auto-generated or custom name | +| `output` | `str` | Output key (single) | +| `outputs` | `list[str]` | Output keys (multiple) | +| `inputs` | `list[str]` | Required input keys | +| `metadata` | `dict` | Arbitrary metadata | + +--- + +## DAGBuilder + +Helper to construct DAGs programmatically (less common; usually use DataflowRegistry). + +```python +from taskiq_flow import DAGBuilder + +builder = DAGBuilder() +builder.add_task(task1, output="a", inputs=[]) +builder.add_task(task2, output="b", inputs=["a"]) +builder.add_task(task3, output="c", inputs=["a", "b"]) + +dag = builder.build() +``` + +**Builder Pattern**: + +```python +dag = (DAGBuilder() + .node(load, output="raw", inputs=[]) + .node(process, output="clean", inputs=["raw"]) + .node(save, output="saved", inputs=["clean"]) + .build() +) +``` + +--- + +## MapReduce + +Utility for parallel map followed by reduce. + +### MapReduce.map + +```python +from taskiq_flow import MapReduce + +mapped = await MapReduce.map( + broker, + map_func, # Task function to apply + items: Iterable, # Items to process + output: str = "mapped", + max_parallel: int = None +) +# Returns: MapReduceResult (behaves like Task) +``` + +### MapReduce.reduce + +```python +reduced = await MapReduce.reduce( + broker, + reduce_func, # Aggregation function + mapped_result, # Output from MapReduce.map + input_name: str, # Name of mapped output to consume + output: str = "reduced" +) +# Returns: Task (with final result) +``` + +### MapReduce.map_reduce (combined) + +```python +final = await MapReduce.map_reduce( + broker, + map_func, + items, + reduce_func, + map_output="mapped", + reduce_output="final", + max_parallel=10 +) +``` + +All three return Task objects; call `.wait_result()` to retrieve value. + +--- + +## DataflowRegistry (Advanced) + +Manual task registration for dynamic pipeline construction. + +```python +from taskiq_flow import DataflowRegistry + +registry = DataflowRegistry() + +# Register tasks with explicit I/O +registry.register_task( + task=extract, + output="features", + inputs=["audio_files"] # external input +) +registry.register_task( + task=tag, + output="tags", + inputs=["features"] # depends on extract's output +) + +# Inspect +print("Tasks:", [t.task_name for t in registry.get_tasks()]) +print("Outputs:", registry.get_outputs()) +print("External inputs:", registry.get_external_inputs()) + +# Build DAG +dag = registry.build_dag() +dag.print() + +# Execute via ExecutionEngine +engine = ExecutionEngine(broker, dag) +results = await engine.execute(inputs={"audio_files": files}) +``` + +**Registry Queries**: + +| Method | Description | +|--------|-------------| +| `get_tasks()` | List all registered TaskNode objects | +| `get_outputs()` | List all output keys | +| `get_external_inputs()` | List inputs not produced by any task | +| `get_producer(output_key)` | Get task producing given output | +| `get_consumers(input_key)` | List tasks consuming given input | +| `build_dag()` | Construct DAG, validate, return ready-to-execute | + +--- + +## Version Notes + +- **ExecutionEngine** introduced in v0.3.0 +- `DAG` and `DAGNode` are used internally by DataflowPipeline +- MapReduce utility available since v0.2.0 + +--- + +## Next Steps + +- **[Tracking API]({{ '/en/api/tracking/' | relative_url }})** — Monitor execution with PipelineTrackingManager +- **[WebSocket API]({{ '/en/api/websocket/' | relative_url }})** — HookManager and event system +- **[Core API]({{ '/en/api/core/' | relative_url }})** — Pipeline and middleware reference +- **[Dataflow Audio Pipeline Example]({{ '/en/examples/dataflow-audio-pipeline/' | relative_url }})** — See ExecutionEngine used in a real DAG pipeline +- **[Registry Discovery Example]({{ '/en/examples/registry-discovery/' | relative_url }})** — Manual DAG construction and ExecutionEngine usage + +--- + +*For advanced use cases only. 95% of users should stick with Pipeline and DataflowPipeline abstractions.* diff --git a/docs/_en/api/index.md b/docs/_en/api/index.md index 71bc8da..4f9fc63 100644 --- a/docs/_en/api/index.md +++ b/docs/_en/api/index.md @@ -1,26 +1,26 @@ ---- -title: API Reference -nav_order: 35 -permalink: /en/api/ -color_scheme: dark ---- -# API Reference - -Complete module and class documentation for Taskiq-Flow. - -## Available API Docs - -| Module | Description | -|--------|-------------| -| **[Core Components]({{ '/en/api/core/' | relative_url }})** | Pipeline, DataflowPipeline, middleware, exceptions | -| **[Decorators]({{ '/en/api/decorators/' | relative_url }})** | `@pipeline_task` and utilities | -| **[Execution]({{ '/en/api/execution/' | relative_url }})** | ExecutionEngine, DAG, DAGBuilder | -| **[Storage]({{ '/en/api/storage/' | relative_url }})** new in v1.2.0 | Pluggable storage adapters (InMemory, Redis, SQLite), factory, StorageMiddleware | -| **[Cache]({{ '/en/api/cache/' | relative_url }})** new in v1.2.0 | Dogpile-based caching (InMemory, Redis adapters), CacheMiddleware | -| **[Tracking]({{ '/en/api/tracking/' | relative_url }})** | TrackingManager and storage backends | -| **[Optimization]({{ '/en/api/optimization/' | relative_url }})** | ResourceAwareExecutor | -| **[WebSocket]({{ '/en/api/websocket/' | relative_url }})** | HookManager and event system | - ---- - -*Not sure where to start? See the [Quick Start]({{ '/en/quickstart/' | relative_url }}) or [User Guides]({{ '/en/guides/' | relative_url }}).* +--- +title: API Reference +nav_order: 35 +permalink: /en/api/ +color_scheme: dark +--- +# API Reference + +Complete module and class documentation for Taskiq-Flow. + +## Available API Docs + +| Module | Description | +|--------|-------------| +| **[Core Components]({{ '/en/api/core/' | relative_url }})** | Pipeline, DataflowPipeline, middleware, exceptions | +| **[Decorators]({{ '/en/api/decorators/' | relative_url }})** | `@pipeline_task` and utilities | +| **[Execution]({{ '/en/api/execution/' | relative_url }})** | ExecutionEngine, DAG, DAGBuilder | +| **[Storage]({{ '/en/api/storage/' | relative_url }})** new in v1.2.0 | Pluggable storage adapters (InMemory, Redis, SQLite), factory, StorageMiddleware | +| **[Cache]({{ '/en/api/cache/' | relative_url }})** new in v1.2.0 | Dogpile-based caching (InMemory, Redis adapters), CacheMiddleware | +| **[Tracking]({{ '/en/api/tracking/' | relative_url }})** | TrackingManager and storage backends | +| **[Optimization]({{ '/en/api/optimization/' | relative_url }})** | ResourceAwareExecutor | +| **[WebSocket]({{ '/en/api/websocket/' | relative_url }})** | HookManager and event system | + +--- + +*Not sure where to start? See the [Quick Start]({{ '/en/quickstart/' | relative_url }}) or [User Guides]({{ '/en/guides/' | relative_url }}).* diff --git a/docs/_en/api/storage.md b/docs/_en/api/storage.md index 060313f..477af5f 100644 --- a/docs/_en/api/storage.md +++ b/docs/_en/api/storage.md @@ -1,304 +1,304 @@ ---- -title: API Reference: Storage -nav_order: 31 -color_scheme: dark ---- -# API Reference: Storage - -**Pluggable persistence layer — adapters, factory, and `StorageMiddleware`** - -> **Version**: {VERSION} | **New in v1.2.0** | **Module**: `taskiq_flow.storage`, `taskiq_flow.middlewares.storage` - ---- - -## Overview - -Taskiq-Flow v1.2.0 introduces a **centralized storage layer** that decouples all persistence concerns (tracking, scheduling, results history) from the broker implementation. The storage system provides: - -- **One unified interface** — `BaseStorageAdapter` works with every backend -- **Three built-in adapters** — InMemory, Redis, SQLite/SQLAlchemy -- **Auto-detection factory** — `StorageAdapterFactory` picks the right backend automatically -- **Middleware integration** — `StorageMiddleware` plugs into the TaskIQ middleware pipeline - -Use `StorageMiddleware` instead of ad-hoc persistence code: it intercepts task events and stores results via a pluggable adapter. - ---- - -## Module: `taskiq_flow.storage` - -### `StorageEntry` - -```python -from taskiq_flow.storage import StorageEntry -from datetime import datetime, timezone - -entry = StorageEntry( - key="pipeline:my_run:task:abc123", - value={"status": "completed", "result": 42}, - expires_at=datetime.now(timezone.utc) + timedelta(hours=1), - metadata={"pipeline_id": "my_run"}, -) -``` - -A typed container for a single stored value with optional TTL and metadata. - -| Attribute | Type | Description | -|-----------|------|-------------| -| `key` | `str` | Unique storage key | -| `value` | `Any` | Stored value (JSON-serializable recommended) | -| `created_at` | `datetime` | Timestamp of creation (UTC) | -| `expires_at` | `datetime \| None` | Expiration timestamp; `None` = never expires | -| `metadata` | `dict` | Arbitrary metadata tags | - -| Method | Signature | Description | -|--------|-----------|-------------| -| `is_expired()` | `() -> bool` | Returns `True` if the entry has expired | -| `remaining_ttl()` | `() -> float \| None` | Seconds until expiry; `None` if never expires | - ---- - -### `BaseStorageAdapter` (ABC) - -```python -from taskiq_flow.storage import BaseStorageAdapter - -class MyAdapter(BaseStorageAdapter): - async def get(self, key: str) -> Any | None: ... - async def set(self, key: str, value: Any, ttl_seconds: int | None = None) -> None: ... - async def delete(self, key: str) -> bool: ... - async def exists(self, key: str) -> bool: ... - async def keys(self, pattern: str = "*") -> list[str]: ... - async def cleanup(self, ttl_seconds: int = 3600) -> int: ... -``` - -Abstract interface that all storage backends must implement. Use this to implement a custom backend (e.g., PostgreSQL, DynamoDB). - -| Method | Description | -|--------|-------------| -| `get(key)` | Retrieve value by key; returns `None` if missing or expired | -| `set(key, value, ttl_seconds)` | Store a value with optional TTL in seconds | -| `delete(key)` | Remove entry by key; returns `True` if deleted | -| `exists(key)` | Check whether a key exists | -| `keys(pattern)` | List keys matching a glob pattern (e.g., `"pipeline:*"`) | -| `cleanup(ttl_seconds)` | Purge expired entries; returns count deleted | - ---- - -### `InMemoryStorageAdapter` - -```python -from taskiq_flow.storage import InMemoryStorageAdapter - -storage = InMemoryStorageAdapter() -# Usage transparent — same interface as other adapters -``` - -In-process dict-based adapter with per-key TTL support. Best suited for development, testing, and single-process deployments. - -| Feature | Status | -|---------|--------| -| TTL | Per-key TTL via `set(key, value, ttl_seconds=…)` | -| Concurrency | Protected by `asyncio.Lock` | -| Persistence across restarts | Volatile | -| Distributed sharing | Process-local only | -| Pattern scanning (`keys("*")`) | `fnmatch`-based | - ---- - -### `RedisStorageAdapter` - -```python -from taskiq_flow.storage import RedisStorageAdapter - -storage = RedisStorageAdapter( - redis_url="redis://localhost:6379", - ttl_seconds=3600, # Default TTL -) -``` - -Redis-backed persistent adapter with native TTL support, JSON serialization, and async I/O. - -```python -# Store -await storage.set("pipeline:run42:status", {"phase": "running"}, ttl_seconds=86400) - -# Retrieve -status = await storage.get("pipeline:run42:status") - -# Pattern scan -keys = await storage.keys("pipeline:run42:*") -``` - -| Feature | Status | -|---------|--------| -| Native TTL | Redis `EXPIRE` per-key | -| JSON serialization | Automatic via `json.dumps/loads` | -| Distributed sharing | All workers share the same Redis | -| Persistent across restarts | (as long as Redis persists) | -| Dependency | `pip install redis` | - ---- - -### `SQLiteStorageAdapter` - -```python -from taskiq_flow.storage import SQLiteStorageAdapter - -storage = SQLiteStorageAdapter( - db_url="sqlite+aiosqlite:///taskiq-flow.db", - async_mode=True, -) -``` - -SQLite/SQLAlchemy-backed adapter for persistent local storage without an external service. - -```python -# Store -await storage.set("pipeline:run42:status", {"phase": "completed"}) - -# Works with any SQLAlchemy URL: SQLite, PostgreSQL, MySQL, etc. -pg = SQLiteStorageAdapter(db_url="postgresql+asyncpg://user:pw@host/db", async_mode=True) -``` - -| Feature | Status | -|---------|--------| -| Persistent | On-disk SQLite (or any SQLAlchemy DB) | -| Async mode | `asyncio`-compatible via `aiosqlite`/`asyncpg` | -| Distributed sharing | Only shared via a network database (PostgreSQL, MySQL) | -| Dependency | `aiosqlite` (bundled), `sqlalchemy` (bundled) | - ---- - -## Module: `taskiq_flow.storage.factory` - -### `StorageAdapterFactory` - -```python -from taskiq_flow.storage.factory import StorageAdapterFactory -from taskiq_flow.config import TaskiqFlowConfig - -# Auto-detect best adapter from configuration -config = TaskiqFlowConfig() # reads from env vars or defaults -adapter = StorageAdapterFactory.create_storage_adapter(config=config) - -# Or specify broker for broker-based detection -adapter = StorageAdapterFactory.create_storage_adapter( - config=config, - broker=redis_broker, - redis_url="redis://localhost:6379", - ttl_seconds=7200, -) -``` - -Priority order for `create_storage_adapter(type="auto")`: - -| Priority | Backend | Condition | -|----------|---------|-----------| -| 1 | `RedisStorageAdapter` | `storage_type="redis"` or broker is RedisBroker | -| 2 | `SQLiteStorageAdapter` | `storage_type="sqlite"` or `"sqlalchemy"` | -| 3 | `InMemoryStorageAdapter` | Fallback when no Redis/SQLite configured | - -| Factory Method | Description | -|----------------|-------------| -| `create_storage_adapter(config, broker, redis_url, ttl_seconds)` | Create a `BaseStorageAdapter` | -| `create_cache_adapter(config, redis_url, default_ttl, lock_timeout)` | Create a `BaseCacheAdapter` | -| `create_default_middlewares(config, broker)` | Create both `StorageMiddleware` and `CacheMiddleware` | - ---- - -## Module: `taskiq_flow.middlewares.storage` - -### `StorageMiddleware` - -```python -from taskiq_flow.middlewares import StorageMiddleware -from taskiq_flow.storage import InMemoryStorageAdapter - -storage = InMemoryStorageAdapter() -middleware = StorageMiddleware(storage=storage, enabled=True) - -broker.add_middlewares(middleware) -``` - -`StorageMiddleware` intercepts the TaskIQ lifecycle and persists task results -through the configured `BaseStorageAdapter`. It complements `PipelineMiddleware` -by offering **a centralized and pluggable** persistence layer. - -| Parameter | Type | Default | Description | -|-----------|------|---------|-------------| -| `storage` | `BaseStorageAdapter \| None` | `None` | Storage backend to use. Auto-creates `InMemoryStorageAdapter` if `None`. | -| `enabled` | `bool` | `True` | Toggle persistence on/off globally | - -| Hook | Signature | Description | -|------|-----------|-------------| -| `post_save(message, result)` | Persists `TaskiqResult` to storage keyed by `task_id` (optionally prefixed by `pipeline_id`) | - -**Storage key format**: `pipeline:{pipeline_id}:task:{task_id}` or `task:{task_id}`. - ---- - -## Examples - -### Using `StorageAdapterFactory` for quick setup - -```python -from taskiq_flow.storage.factory import StorageAdapterFactory - -# Zero-config — auto-detects from environment -storage = StorageAdapterFactory.create_storage_adapter() - -# From TaskiqFlowConfig -from taskiq_flow.config import TaskiqFlowConfig -config = TaskiqFlowConfig( - storage_type="redis", - storage_redis_url="redis://localhost:6379", -) -storage = StorageAdapterFactory.create_storage_adapter(config=config) -``` - -### Using `StorageMiddleware` in a broker - -```python -from taskiq import InMemoryBroker -from taskiq_flow.middlewares import StorageMiddleware -from taskiq_flow.storage import InMemoryStorageAdapter - -broker = InMemoryBroker(await_inplace=True) -middleware = StorageMiddleware(storage=InMemoryStorageAdapter()) -broker.add_middlewares(middleware, PipelineMiddleware()) -``` - -### Using `create_default_middlewares` - -```python -from taskiq_flow.storage.factory import StorageAdapterFactory - -middlewares = StorageAdapterFactory.create_default_middlewares() -broker.add_middlewares( - middlewares["storage"], # StorageMiddleware - middlewares["cache"], # CacheMiddleware - PipelineMiddleware(), -) -``` - ---- - -## Choosing a Storage Backend - -| Backend | Use Case | Pros | Cons | -|---------|----------|------|------| -| `InMemoryStorageAdapter` | Development, tests, single-process | Zero dependencies, fast | Volatile, not shared | -| `RedisStorageAdapter` | Production, distributed | Fast, shared, survives restarts | Requires Redis | -| `SQLiteStorageAdapter` | Lightweight persistent, no external service | No external service, SQL queries | Single-writer contention | - ---- - -## Further Reading - -- **[Storage & Cache Middleware Guide]({{ '/en/guides/cache/' | relative_url }})** — Complete middleware setup -- **[Cache API Reference]({{ '/en/api/cache/' | relative_url }})** — Dogpile-based caching -- **[Pipeline Guide]({{ '/en/guides/pipelines/' | relative_url }})** — How pipelines use storage - ---- - -*New in v1.2.0. Storage adapters are fully interchangeable: swap the adapter without changing any application logic.* +--- +title: 'API Reference: Storage' +nav_order: 31 +color_scheme: dark +--- +# API Reference: Storage + +**Pluggable persistence layer — adapters, factory, and `StorageMiddleware`** + +> **Version**: {VERSION} | **New in v1.2.0** | **Module**: `taskiq_flow.storage`, `taskiq_flow.middlewares.storage` + +--- + +## Overview + +Taskiq-Flow v1.2.0 introduces a **centralized storage layer** that decouples all persistence concerns (tracking, scheduling, results history) from the broker implementation. The storage system provides: + +- **One unified interface** — `BaseStorageAdapter` works with every backend +- **Three built-in adapters** — InMemory, Redis, SQLite/SQLAlchemy +- **Auto-detection factory** — `StorageAdapterFactory` picks the right backend automatically +- **Middleware integration** — `StorageMiddleware` plugs into the TaskIQ middleware pipeline + +Use `StorageMiddleware` instead of ad-hoc persistence code: it intercepts task events and stores results via a pluggable adapter. + +--- + +## Module: `taskiq_flow.storage` + +### `StorageEntry` + +```python +from taskiq_flow.storage import StorageEntry +from datetime import datetime, timezone + +entry = StorageEntry( + key="pipeline:my_run:task:abc123", + value={"status": "completed", "result": 42}, + expires_at=datetime.now(timezone.utc) + timedelta(hours=1), + metadata={"pipeline_id": "my_run"}, +) +``` + +A typed container for a single stored value with optional TTL and metadata. + +| Attribute | Type | Description | +|-----------|------|-------------| +| `key` | `str` | Unique storage key | +| `value` | `Any` | Stored value (JSON-serializable recommended) | +| `created_at` | `datetime` | Timestamp of creation (UTC) | +| `expires_at` | `datetime \| None` | Expiration timestamp; `None` = never expires | +| `metadata` | `dict` | Arbitrary metadata tags | + +| Method | Signature | Description | +|--------|-----------|-------------| +| `is_expired()` | `() -> bool` | Returns `True` if the entry has expired | +| `remaining_ttl()` | `() -> float \| None` | Seconds until expiry; `None` if never expires | + +--- + +### `BaseStorageAdapter` (ABC) + +```python +from taskiq_flow.storage import BaseStorageAdapter + +class MyAdapter(BaseStorageAdapter): + async def get(self, key: str) -> Any | None: ... + async def set(self, key: str, value: Any, ttl_seconds: int | None = None) -> None: ... + async def delete(self, key: str) -> bool: ... + async def exists(self, key: str) -> bool: ... + async def keys(self, pattern: str = "*") -> list[str]: ... + async def cleanup(self, ttl_seconds: int = 3600) -> int: ... +``` + +Abstract interface that all storage backends must implement. Use this to implement a custom backend (e.g., PostgreSQL, DynamoDB). + +| Method | Description | +|--------|-------------| +| `get(key)` | Retrieve value by key; returns `None` if missing or expired | +| `set(key, value, ttl_seconds)` | Store a value with optional TTL in seconds | +| `delete(key)` | Remove entry by key; returns `True` if deleted | +| `exists(key)` | Check whether a key exists | +| `keys(pattern)` | List keys matching a glob pattern (e.g., `"pipeline:*"`) | +| `cleanup(ttl_seconds)` | Purge expired entries; returns count deleted | + +--- + +### `InMemoryStorageAdapter` + +```python +from taskiq_flow.storage import InMemoryStorageAdapter + +storage = InMemoryStorageAdapter() +# Usage transparent — same interface as other adapters +``` + +In-process dict-based adapter with per-key TTL support. Best suited for development, testing, and single-process deployments. + +| Feature | Status | +|---------|--------| +| TTL | Per-key TTL via `set(key, value, ttl_seconds=…)` | +| Concurrency | Protected by `asyncio.Lock` | +| Persistence across restarts | Volatile | +| Distributed sharing | Process-local only | +| Pattern scanning (`keys("*")`) | `fnmatch`-based | + +--- + +### `RedisStorageAdapter` + +```python +from taskiq_flow.storage import RedisStorageAdapter + +storage = RedisStorageAdapter( + redis_url="redis://localhost:6379", + ttl_seconds=3600, # Default TTL +) +``` + +Redis-backed persistent adapter with native TTL support, JSON serialization, and async I/O. + +```python +# Store +await storage.set("pipeline:run42:status", {"phase": "running"}, ttl_seconds=86400) + +# Retrieve +status = await storage.get("pipeline:run42:status") + +# Pattern scan +keys = await storage.keys("pipeline:run42:*") +``` + +| Feature | Status | +|---------|--------| +| Native TTL | Redis `EXPIRE` per-key | +| JSON serialization | Automatic via `json.dumps/loads` | +| Distributed sharing | All workers share the same Redis | +| Persistent across restarts | (as long as Redis persists) | +| Dependency | `pip install redis` | + +--- + +### `SQLiteStorageAdapter` + +```python +from taskiq_flow.storage import SQLiteStorageAdapter + +storage = SQLiteStorageAdapter( + db_url="sqlite+aiosqlite:///taskiq-flow.db", + async_mode=True, +) +``` + +SQLite/SQLAlchemy-backed adapter for persistent local storage without an external service. + +```python +# Store +await storage.set("pipeline:run42:status", {"phase": "completed"}) + +# Works with any SQLAlchemy URL: SQLite, PostgreSQL, MySQL, etc. +pg = SQLiteStorageAdapter(db_url="postgresql+asyncpg://user:pw@host/db", async_mode=True) +``` + +| Feature | Status | +|---------|--------| +| Persistent | On-disk SQLite (or any SQLAlchemy DB) | +| Async mode | `asyncio`-compatible via `aiosqlite`/`asyncpg` | +| Distributed sharing | Only shared via a network database (PostgreSQL, MySQL) | +| Dependency | `aiosqlite` (bundled), `sqlalchemy` (bundled) | + +--- + +## Module: `taskiq_flow.storage.factory` + +### `StorageAdapterFactory` + +```python +from taskiq_flow.storage.factory import StorageAdapterFactory +from taskiq_flow.config import TaskiqFlowConfig + +# Auto-detect best adapter from configuration +config = TaskiqFlowConfig() # reads from env vars or defaults +adapter = StorageAdapterFactory.create_storage_adapter(config=config) + +# Or specify broker for broker-based detection +adapter = StorageAdapterFactory.create_storage_adapter( + config=config, + broker=redis_broker, + redis_url="redis://localhost:6379", + ttl_seconds=7200, +) +``` + +Priority order for `create_storage_adapter(type="auto")`: + +| Priority | Backend | Condition | +|----------|---------|-----------| +| 1 | `RedisStorageAdapter` | `storage_type="redis"` or broker is RedisBroker | +| 2 | `SQLiteStorageAdapter` | `storage_type="sqlite"` or `"sqlalchemy"` | +| 3 | `InMemoryStorageAdapter` | Fallback when no Redis/SQLite configured | + +| Factory Method | Description | +|----------------|-------------| +| `create_storage_adapter(config, broker, redis_url, ttl_seconds)` | Create a `BaseStorageAdapter` | +| `create_cache_adapter(config, redis_url, default_ttl, lock_timeout)` | Create a `BaseCacheAdapter` | +| `create_default_middlewares(config, broker)` | Create both `StorageMiddleware` and `CacheMiddleware` | + +--- + +## Module: `taskiq_flow.middlewares.storage` + +### `StorageMiddleware` + +```python +from taskiq_flow.middlewares import StorageMiddleware +from taskiq_flow.storage import InMemoryStorageAdapter + +storage = InMemoryStorageAdapter() +middleware = StorageMiddleware(storage=storage, enabled=True) + +broker.add_middlewares(middleware) +``` + +`StorageMiddleware` intercepts the TaskIQ lifecycle and persists task results +through the configured `BaseStorageAdapter`. It complements `PipelineMiddleware` +by offering **a centralized and pluggable** persistence layer. + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `storage` | `BaseStorageAdapter \| None` | `None` | Storage backend to use. Auto-creates `InMemoryStorageAdapter` if `None`. | +| `enabled` | `bool` | `True` | Toggle persistence on/off globally | + +| Hook | Signature | Description | +|------|-----------|-------------| +| `post_save(message, result)` | Persists `TaskiqResult` to storage keyed by `task_id` (optionally prefixed by `pipeline_id`) | + +**Storage key format**: `pipeline:{pipeline_id}:task:{task_id}` or `task:{task_id}`. + +--- + +## Examples + +### Using `StorageAdapterFactory` for quick setup + +```python +from taskiq_flow.storage.factory import StorageAdapterFactory + +# Zero-config — auto-detects from environment +storage = StorageAdapterFactory.create_storage_adapter() + +# From TaskiqFlowConfig +from taskiq_flow.config import TaskiqFlowConfig +config = TaskiqFlowConfig( + storage_type="redis", + storage_redis_url="redis://localhost:6379", +) +storage = StorageAdapterFactory.create_storage_adapter(config=config) +``` + +### Using `StorageMiddleware` in a broker + +```python +from taskiq import InMemoryBroker +from taskiq_flow.middlewares import StorageMiddleware +from taskiq_flow.storage import InMemoryStorageAdapter + +broker = InMemoryBroker(await_inplace=True) +middleware = StorageMiddleware(storage=InMemoryStorageAdapter()) +broker.add_middlewares(middleware, PipelineMiddleware()) +``` + +### Using `create_default_middlewares` + +```python +from taskiq_flow.storage.factory import StorageAdapterFactory + +middlewares = StorageAdapterFactory.create_default_middlewares() +broker.add_middlewares( + middlewares["storage"], # StorageMiddleware + middlewares["cache"], # CacheMiddleware + PipelineMiddleware(), +) +``` + +--- + +## Choosing a Storage Backend + +| Backend | Use Case | Pros | Cons | +|---------|----------|------|------| +| `InMemoryStorageAdapter` | Development, tests, single-process | Zero dependencies, fast | Volatile, not shared | +| `RedisStorageAdapter` | Production, distributed | Fast, shared, survives restarts | Requires Redis | +| `SQLiteStorageAdapter` | Lightweight persistent, no external service | No external service, SQL queries | Single-writer contention | + +--- + +## Further Reading + +- **[Storage & Cache Middleware Guide]({{ '/en/guides/cache/' | relative_url }})** — Complete middleware setup +- **[Cache API Reference]({{ '/en/api/cache/' | relative_url }})** — Dogpile-based caching +- **[Pipeline Guide]({{ '/en/guides/pipelines/' | relative_url }})** — How pipelines use storage + +--- + +*New in v1.2.0. Storage adapters are fully interchangeable: swap the adapter without changing any application logic.* diff --git a/docs/_en/api/tracking.md b/docs/_en/api/tracking.md index b5d4de4..c1a36fe 100644 --- a/docs/_en/api/tracking.md +++ b/docs/_en/api/tracking.md @@ -1,318 +1,318 @@ ---- -permalink: /en/api/tracking/ -title: API Reference: Tracking & Monitoring -nav_order: 33 -color_scheme: dark ---- -# API Reference: Tracking & Monitoring - -**PipelineTrackingManager, storage backends, and status models** - +--- +permalink: /en/api/tracking/ +title: 'API Reference: Tracking & Monitoring' +nav_order: 33 +color_scheme: dark +--- +# API Reference: Tracking & Monitoring + +**PipelineTrackingManager, storage backends, and status models** + > **Version**: {VERSION} | **Module**: `taskiq_flow.tracking`, `taskiq_flow.tracking.models` - ---- - -## PipelineTrackingManager - -Central coordinator for recording and retrieving pipeline execution data. - -```python -from taskiq_flow import PipelineTrackingManager - -tracking = PipelineTrackingManager() -tracking = tracking.with_auto_storage(broker) -# or -tracking = tracking.with_storage(InMemoryPipelineStorage()) -``` - -**Configuration**: - -```python -tracking = PipelineTrackingManager( - storage=None, # Optional pre-configured storage - max_history=1000, # Max pipeline records to keep (memory store only) - auto_cleanup=True # Auto-purge old records -) -``` - -**Storage selection** (via `with_auto_storage`): - -| Broker | Auto-selected storage | -|--------|----------------------| -| `InMemoryBroker` | `InMemoryPipelineStorage` | -| `RedisBroker` | `RedisPipelineStorage` | -| Other | Falls back to memory | - ---- - -## Methods - -### Attaching to Pipelines - -```python -pipeline = Pipeline(broker).with_tracking(tracking) -# or -pipeline.with_tracking(tracking) # in-place modification -``` - -The tracking manager must be attached **before** calling `pipeline.kiq()`. - -### Querying Status - -```python -# Get status of a specific pipeline execution -status = await tracking.get_status(pipeline_id: str) -> PipelineStatus | None - -# List all tracked pipelines -all_statuses = await tracking.list_pipelines( - status_filter: str | None = None, # Filter by status - limit: int = 100 -) -> list[PipelineStatus] - -# Get historical executions -history = await tracking.get_history( - since: datetime | None = None, - until: datetime | None = None, - limit: int = 100 -) -> list[PipelineStatus] -``` - -### Maintenance - -```python -# Delete specific pipeline record -await tracking.delete_pipeline(pipeline_id: str) - -# Delete records older than N days -deleted_count = await tracking.cleanup_older_than(days: int = 30) -> int - -# Get aggregate metrics -metrics = await tracking.get_metrics( - days: int = 7 -) -> TrackingMetrics -``` - -### Event Listeners - -```python -class MyListener: - async def on_pipeline_start(self, pipeline_id: str): - print(f"Pipeline {pipeline_id} started") - - async def on_pipeline_complete(self, pipeline_id: str, status: PipelineStatus): - send_alert_if_failed(status) - -listener = MyListener() -tracking.add_listener(listener) -``` - -**Listener hooks** (all optional): - -- `on_pipeline_start(pipeline_id)` -- `on_step_start(pipeline_id, step_name)` -- `on_step_complete(pipeline_id, step_name, result)` -- `on_pipeline_complete(pipeline_id, status)` -- `on_pipeline_error(pipeline_id, error)` - ---- - -## Storage Backends - -### InMemoryPipelineStorage - -```python -from taskiq_flow.tracking import InMemoryPipelineStorage - -storage = InMemoryPipelineStorage(max_records=1000) -tracking = PipelineTrackingManager().with_storage(storage) -``` - -**Features**: -- Zero configuration -- Fast (no I/O) -- **Not shared between workers** -- Lost on process restart -- Good for: development, testing, single-process - -**Parameters**: - -| Parameter | Type | Default | Description | -|-----------|------|---------|-------------| -| `max_records` | `int` | 1000 | Maximum pipeline records to retain (LRU eviction) | - ---- - -### RedisPipelineStorage - -```python -from taskiq_flow.tracking import RedisPipelineStorage -import redis.asyncio as redis - -redis_client = redis.Redis(host="localhost", port=6379, decode_responses=True) -storage = RedisPipelineStorage( - redis_client, - key_prefix="taskiq_flow:tracking:", - ttl_seconds=604800 # 7 days -) -tracking = PipelineTrackingManager().with_storage(storage) -``` - -**Features**: -- Shared across multiple workers -- Persistent across restarts -- Scalable (Redis cluster) -- TTL-based expiration -- Good for: production, distributed deployments - -**Parameters**: - -| Parameter | Type | Default | Description | -|-----------|------|---------|-------------| -| `redis_client` | `Redis` | **required** | Connected Redis client | -| `key_prefix` | `str` | `"taskiq_flow:tracking:"` | Prefix for all keys | -| `ttl_seconds` | `int` | 604800 (7d) | Auto-expire after N seconds | -| `serializer` | `Callable` | `json.dumps` | Custom serialization function | - ---- - -## Data Models - -### PipelineStatus - -Complete status of a pipeline execution. - -```python -from taskiq_flow.tracking.models import PipelineStatus - -status: PipelineStatus -``` - -**Attributes**: - -| Attribute | Type | Description | -|-----------|------|-------------| -| `pipeline_id` | `str` | Unique identifier | -| `status` | `str` | `PENDING`, `RUNNING`, `COMPLETED`, `FAILED`, `CANCELLED` | -| `pipeline_type` | `str` | `"sequential"` or `"dataflow"` | -| `started_at` | `datetime` | Execution start timestamp | -| `completed_at` | `datetime | None` | End time if finished | -| `duration_ms` | `float` | Total duration in milliseconds | -| `steps` | `list[StepStatus]` | Per-step status objects | -| `result` | `Any` | Final return value (if completed) | -| `error` | `str \| None` | Error message if failed | - -**Methods**: -- `model_dump()` — Return as dictionary (Pydantic model) -- `is_finished()` — True if terminal state (COMPLETED/FAILED/CANCELLED) - ---- - -### StepStatus - -Status of a single pipeline step. - -```python -from taskiq_flow.tracking.models import StepStatus - -step: StepStatus -``` - -**Attributes**: - -| Attribute | Type | Description | -|-----------|------|-------------| -| `step_name` | `str` | Task name | -| `status` | `str` | `PENDING`, `RUNNING`, `COMPLETED`, `FAILED` | -| `started_at` | `datetime` | Step start time | -| `completed_at` | `datetime | None` | Step end time | -| `duration_ms` | `float` | Execution duration | -| `result` | `Any` | Return value | -| `error` | `str \| None` | Error message | -| `retry_count` | `int` | Number of retry attempts | - ---- - -### TrackingMetrics - -Aggregated statistics (returned by `get_metrics()`). - -```python -from taskiq_flow.tracking.models import TrackingMetrics - -metrics: TrackingMetrics -``` - -**Attributes**: - -| Attribute | Type | Description | -|-----------|------|-------------| -| `total_pipelines` | `int` | Total executions tracked | -| `completed` | `int` | Successful completions | -| `failed` | `int` | Failed executions | -| `success_rate` | `float` | Ratio completed / total | -| `avg_duration_ms` | `float` | Average pipeline duration | -| `p95_duration_ms` | `float` | 95th percentile duration | -| `failure_reasons` | `dict[str, int]` | Error type → count | -| `most_frequent_step` | `str | None` | Step that fails most often | - ---- - -## Custom Storage Implementation - -Implement `TrackingStorage` protocol for custom backends: - -```python -from taskiq_flow.tracking.storage import TrackingStorage -from taskiq_flow.tracking.models import PipelineStatus - -class PostgresStorage(TrackingStorage): - async def save_status(self, status: PipelineStatus): - """Save or update pipeline status.""" - ... - - async def get_status(self, pipeline_id: str) -> PipelineStatus | None: - """Fetch pipeline status by ID.""" - ... - - async def list_pipelines(self, status_filter: str | None = None, - limit: int = 100) -> list[PipelineStatus]: - """List pipelines, optionally filtered by status.""" - ... - - async def delete_pipeline(self, pipeline_id: str): - """Remove pipeline record.""" - ... - - async def cleanup_older_than(self, days: int) -> int: - """Delete records older than N days. Returns count deleted.""" - ... - -tracking = PipelineTrackingManager().with_storage(PostgresStorage()) -``` - -All storage methods must be async. - ---- - -## Best Practices - -1. **Production**: Always use Redis storage (shared, persistent) -2. **TTL**: Set appropriate TTL (7–30 days) to bound storage growth -3. **Listeners**: Add alerting listeners for failures -4. **Cleanup**: Schedule periodic cleanup (daily cron job) -5. **Indexing**: For custom DB stores, index on `pipeline_id`, `started_at` for query performance - ---- - -## Troubleshooting - -| Issue | Likely Cause | Fix | -|-------|--------------|-----| -| `get_status()` returns `None` | Tracking not attached, or wrong `pipeline_id` | Ensure `pipeline.with_tracking(tracking)` called before `kiq()` | -| Storage errors | Redis connection failed | Check Redis is running, connection string valid | -| Memory growth (memory store) | Not purging old records | Set `max_records` or use Redis with TTL | -| Listeners not firing | Not added before pipeline start | Call `tracking.add_listener()` before `pipeline.kiq()` | - ---- - -*Combine with [WebSocket]({{ '/en/api/websocket/' | relative_url }}) for real-time streaming. See [Tracking Guide]({{ '/en/guides/tracking/' | relative_url }}) for usage patterns.* + +--- + +## PipelineTrackingManager + +Central coordinator for recording and retrieving pipeline execution data. + +```python +from taskiq_flow import PipelineTrackingManager + +tracking = PipelineTrackingManager() +tracking = tracking.with_auto_storage(broker) +# or +tracking = tracking.with_storage(InMemoryPipelineStorage()) +``` + +**Configuration**: + +```python +tracking = PipelineTrackingManager( + storage=None, # Optional pre-configured storage + max_history=1000, # Max pipeline records to keep (memory store only) + auto_cleanup=True # Auto-purge old records +) +``` + +**Storage selection** (via `with_auto_storage`): + +| Broker | Auto-selected storage | +|--------|----------------------| +| `InMemoryBroker` | `InMemoryPipelineStorage` | +| `RedisBroker` | `RedisPipelineStorage` | +| Other | Falls back to memory | + +--- + +## Methods + +### Attaching to Pipelines + +```python +pipeline = Pipeline(broker).with_tracking(tracking) +# or +pipeline.with_tracking(tracking) # in-place modification +``` + +The tracking manager must be attached **before** calling `pipeline.kiq()`. + +### Querying Status + +```python +# Get status of a specific pipeline execution +status = await tracking.get_status(pipeline_id: str) -> PipelineStatus | None + +# List all tracked pipelines +all_statuses = await tracking.list_pipelines( + status_filter: str | None = None, # Filter by status + limit: int = 100 +) -> list[PipelineStatus] + +# Get historical executions +history = await tracking.get_history( + since: datetime | None = None, + until: datetime | None = None, + limit: int = 100 +) -> list[PipelineStatus] +``` + +### Maintenance + +```python +# Delete specific pipeline record +await tracking.delete_pipeline(pipeline_id: str) + +# Delete records older than N days +deleted_count = await tracking.cleanup_older_than(days: int = 30) -> int + +# Get aggregate metrics +metrics = await tracking.get_metrics( + days: int = 7 +) -> TrackingMetrics +``` + +### Event Listeners + +```python +class MyListener: + async def on_pipeline_start(self, pipeline_id: str): + print(f"Pipeline {pipeline_id} started") + + async def on_pipeline_complete(self, pipeline_id: str, status: PipelineStatus): + send_alert_if_failed(status) + +listener = MyListener() +tracking.add_listener(listener) +``` + +**Listener hooks** (all optional): + +- `on_pipeline_start(pipeline_id)` +- `on_step_start(pipeline_id, step_name)` +- `on_step_complete(pipeline_id, step_name, result)` +- `on_pipeline_complete(pipeline_id, status)` +- `on_pipeline_error(pipeline_id, error)` + +--- + +## Storage Backends + +### InMemoryPipelineStorage + +```python +from taskiq_flow.tracking import InMemoryPipelineStorage + +storage = InMemoryPipelineStorage(max_records=1000) +tracking = PipelineTrackingManager().with_storage(storage) +``` + +**Features**: +- Zero configuration +- Fast (no I/O) +- **Not shared between workers** +- Lost on process restart +- Good for: development, testing, single-process + +**Parameters**: + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `max_records` | `int` | 1000 | Maximum pipeline records to retain (LRU eviction) | + +--- + +### RedisPipelineStorage + +```python +from taskiq_flow.tracking import RedisPipelineStorage +import redis.asyncio as redis + +redis_client = redis.Redis(host="localhost", port=6379, decode_responses=True) +storage = RedisPipelineStorage( + redis_client, + key_prefix="taskiq_flow:tracking:", + ttl_seconds=604800 # 7 days +) +tracking = PipelineTrackingManager().with_storage(storage) +``` + +**Features**: +- Shared across multiple workers +- Persistent across restarts +- Scalable (Redis cluster) +- TTL-based expiration +- Good for: production, distributed deployments + +**Parameters**: + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `redis_client` | `Redis` | **required** | Connected Redis client | +| `key_prefix` | `str` | `"taskiq_flow:tracking:"` | Prefix for all keys | +| `ttl_seconds` | `int` | 604800 (7d) | Auto-expire after N seconds | +| `serializer` | `Callable` | `json.dumps` | Custom serialization function | + +--- + +## Data Models + +### PipelineStatus + +Complete status of a pipeline execution. + +```python +from taskiq_flow.tracking.models import PipelineStatus + +status: PipelineStatus +``` + +**Attributes**: + +| Attribute | Type | Description | +|-----------|------|-------------| +| `pipeline_id` | `str` | Unique identifier | +| `status` | `str` | `PENDING`, `RUNNING`, `COMPLETED`, `FAILED`, `CANCELLED` | +| `pipeline_type` | `str` | `"sequential"` or `"dataflow"` | +| `started_at` | `datetime` | Execution start timestamp | +| `completed_at` | `datetime | None` | End time if finished | +| `duration_ms` | `float` | Total duration in milliseconds | +| `steps` | `list[StepStatus]` | Per-step status objects | +| `result` | `Any` | Final return value (if completed) | +| `error` | `str \| None` | Error message if failed | + +**Methods**: +- `model_dump()` — Return as dictionary (Pydantic model) +- `is_finished()` — True if terminal state (COMPLETED/FAILED/CANCELLED) + +--- + +### StepStatus + +Status of a single pipeline step. + +```python +from taskiq_flow.tracking.models import StepStatus + +step: StepStatus +``` + +**Attributes**: + +| Attribute | Type | Description | +|-----------|------|-------------| +| `step_name` | `str` | Task name | +| `status` | `str` | `PENDING`, `RUNNING`, `COMPLETED`, `FAILED` | +| `started_at` | `datetime` | Step start time | +| `completed_at` | `datetime | None` | Step end time | +| `duration_ms` | `float` | Execution duration | +| `result` | `Any` | Return value | +| `error` | `str \| None` | Error message | +| `retry_count` | `int` | Number of retry attempts | + +--- + +### TrackingMetrics + +Aggregated statistics (returned by `get_metrics()`). + +```python +from taskiq_flow.tracking.models import TrackingMetrics + +metrics: TrackingMetrics +``` + +**Attributes**: + +| Attribute | Type | Description | +|-----------|------|-------------| +| `total_pipelines` | `int` | Total executions tracked | +| `completed` | `int` | Successful completions | +| `failed` | `int` | Failed executions | +| `success_rate` | `float` | Ratio completed / total | +| `avg_duration_ms` | `float` | Average pipeline duration | +| `p95_duration_ms` | `float` | 95th percentile duration | +| `failure_reasons` | `dict[str, int]` | Error type → count | +| `most_frequent_step` | `str | None` | Step that fails most often | + +--- + +## Custom Storage Implementation + +Implement `TrackingStorage` protocol for custom backends: + +```python +from taskiq_flow.tracking.storage import TrackingStorage +from taskiq_flow.tracking.models import PipelineStatus + +class PostgresStorage(TrackingStorage): + async def save_status(self, status: PipelineStatus): + """Save or update pipeline status.""" + ... + + async def get_status(self, pipeline_id: str) -> PipelineStatus | None: + """Fetch pipeline status by ID.""" + ... + + async def list_pipelines(self, status_filter: str | None = None, + limit: int = 100) -> list[PipelineStatus]: + """List pipelines, optionally filtered by status.""" + ... + + async def delete_pipeline(self, pipeline_id: str): + """Remove pipeline record.""" + ... + + async def cleanup_older_than(self, days: int) -> int: + """Delete records older than N days. Returns count deleted.""" + ... + +tracking = PipelineTrackingManager().with_storage(PostgresStorage()) +``` + +All storage methods must be async. + +--- + +## Best Practices + +1. **Production**: Always use Redis storage (shared, persistent) +2. **TTL**: Set appropriate TTL (7–30 days) to bound storage growth +3. **Listeners**: Add alerting listeners for failures +4. **Cleanup**: Schedule periodic cleanup (daily cron job) +5. **Indexing**: For custom DB stores, index on `pipeline_id`, `started_at` for query performance + +--- + +## Troubleshooting + +| Issue | Likely Cause | Fix | +|-------|--------------|-----| +| `get_status()` returns `None` | Tracking not attached, or wrong `pipeline_id` | Ensure `pipeline.with_tracking(tracking)` called before `kiq()` | +| Storage errors | Redis connection failed | Check Redis is running, connection string valid | +| Memory growth (memory store) | Not purging old records | Set `max_records` or use Redis with TTL | +| Listeners not firing | Not added before pipeline start | Call `tracking.add_listener()` before `pipeline.kiq()` | + +--- + +*Combine with [WebSocket]({{ '/en/api/websocket/' | relative_url }}) for real-time streaming. See [Tracking Guide]({{ '/en/guides/tracking/' | relative_url }}) for usage patterns.* diff --git a/docs/_en/api/websocket.md b/docs/_en/api/websocket.md index 8090f00..b793f43 100644 --- a/docs/_en/api/websocket.md +++ b/docs/_en/api/websocket.md @@ -1,6 +1,6 @@ --- permalink: /en/api/websocket/ -title: API Reference: WebSocket Integration +title: 'API Reference: WebSocket Integration' nav_order: 34 color_scheme: dark --- diff --git a/docs/_en/examples/api-example.md b/docs/_en/examples/api-example.md index 872b7d4..e5a4d87 100644 --- a/docs/_en/examples/api-example.md +++ b/docs/_en/examples/api-example.md @@ -1,369 +1,369 @@ ---- -permalink: /en/examples/api-example/ -title: Example: api_example.py -nav_order: 47 -color_scheme: dark ---- -# Example: api_example.py - -**FastAPI integration for remote pipeline management** - +--- +permalink: /en/examples/api-example/ +title: 'Example: api_example.py' +nav_order: 47 +color_scheme: dark +--- +# Example: api_example.py + +**FastAPI integration for remote pipeline management** + > **Version**: {VERSION} | **File**: `examples/api_example.py` - ---- - -## Overview - -This comprehensive example demonstrates how to build a production-ready REST API for Taskiq-Flow using FastAPI. It covers: - -- Setting up FastAPI with pipeline visualization endpoints -- Registering pipelines programmatically -- Adding custom endpoints for pipeline execution -- Retrieving pipeline results via API -- Full OpenAPI/Swagger documentation - -**Prerequisites**: Install FastAPI and uvicorn: -```bash -pip install fastapi uvicorn[standard] -``` - ---- - -## What This Example Shows - -- Using `PipelineVisualizationAPI` for built-in endpoints -- Registering pipelines with the API -- Creating custom endpoints to execute pipelines remotely -- Retrieving results by task ID -- Complete production API structure - ---- - -## Code Walkthrough - -### 1. Define Tasks and Pipeline - -```python -from fastapi import FastAPI, HTTPException -from taskiq import InMemoryBroker -from taskiq_flow import DataflowPipeline, pipeline_task - -broker = InMemoryBroker(await_inplace=True) - -@broker.task -@pipeline_task(output="user_data") -async def fetch_user_data(user_id: int) -> dict: - """Fetch user data from database.""" - await asyncio.sleep(0.1) - return {"id": user_id, "name": f"User{user_id}", "email": f"user{user_id}@example.com"} - -@broker.task -@pipeline_task(output="order_history") -async def fetch_orders(user_data: dict) -> list: - """Fetch order history for user.""" - await asyncio.sleep(0.2) - user_id = user_data["id"] - return [{"order_id": 100 + user_id, "total": 99.99}] - -@broker.task -@pipeline_task(output="recommendations") -async def generate_recommendations(user_data: dict, order_history: list): - """Generate recommendations.""" - await asyncio.sleep(0.15) - return ["product_A", "product_B", "product_C"] - -# Build pipeline -sample_pipeline = DataflowPipeline.from_tasks( - broker, - [fetch_user_data, fetch_orders, generate_recommendations], -) -sample_pipeline.pipeline_id = "sample_recommendation_pipeline" -``` - -**Pipeline DAG structure:** - -```mermaid -flowchart TD - A[fetch_user_data
output: user_data] --> B[fetch_orders
output: order_history] - A --> C[generate_recommendations
output: recommendations] - B --> C -``` - -### 2. Create FastAPI App with Visualization API - -```python -from taskiq_flow.api import create_visualization_api, PipelineVisualizationAPI - -def create_app() -> FastAPI: - app = FastAPI(title="TaskIQ Flow API", version="1.0.0") - - # Create visualization API (auto-mounts /pipelines endpoints) - viz_api = create_visualization_api(broker, app) - viz_api.add_pipeline("sample_recommendation_pipeline", sample_pipeline) - - # Custom endpoints below... - return app -``` - -The `create_visualization_api()` automatically adds these endpoints: -- `GET /health` -- `GET /pipelines` -- `POST /pipelines/{pipeline_id}` (register) -- `GET /pipelines/{pipeline_id}/status` -- `GET /pipelines/{pipeline_id}/dag` -- `GET /pipelines/{pipeline_id}/dag/dot` -- `GET /pipelines/{pipeline_id}/visualize` - -### 3. Add Custom Execute Endpoint - -```python -@app.post("/pipelines/{pipeline_id}/execute") -async def execute_pipeline( - pipeline_id: str, - parameters: dict[str, Any], -) -> dict[str, Any]: - """Execute a pipeline with given parameters.""" - if pipeline_id not in viz_api.pipelines: - raise HTTPException(status_code=404, detail=f"Pipeline {pipeline_id} not found") - - pipeline = viz_api.pipelines[pipeline_id] - try: - result = await pipeline.kiq_dataflow(**parameters) - return { - "status": "executed", - "pipeline_id": pipeline_id, - "task_id": result.task_id, - "message": "Pipeline execution started. Use /result/{task_id} to check status.", - } - except Exception as e: - raise HTTPException(status_code=500, detail=str(e)) from e -``` - -### 4. Add Result Retrieval Endpoint - -```python -@app.get("/pipelines/result/{task_id}") -async def get_result(task_id: str) -> dict[str, Any]: - """Get the result of a pipeline execution.""" - try: - result = await broker.result_backend.get_result(task_id) - if result is None: - raise HTTPException(status_code=404, detail=f"No result found for task_id {task_id}") - return {"task_id": task_id, "result": result.return_value} - except Exception as e: - raise HTTPException(status_code=500, detail=str(e)) from e -``` - -### 5. Run the Server - -```bash -uvicorn examples.api_example:create_app --reload --port 8000 -``` - -Or programmatically: - -```python -if __name__ == "__main__": - import uvicorn - uvicorn.run(create_app(), host="0.0.0.0", port=8000) -``` - ---- - -## API Endpoints Reference - -### Built-in (from `create_visualization_api`) - -| Method | Endpoint | Description | -|--------|----------|-------------| -| GET | `/health` | Health check | -| GET | `/pipelines` | List all registered pipelines | -| POST | `/pipelines/{pipeline_id}` | Register new pipeline | -| GET | `/pipelines/{pipeline_id}/status` | Get current execution status | -| GET | `/pipelines/{pipeline_id}/dag` | Get DAG as JSON | -| GET | `/pipelines/{pipeline_id}/dag/dot` | Get DAG as DOT string | -| GET | `/pipelines/{pipeline_id}/visualize` | Full visualization metadata | - -### Custom (defined in example) - -| Method | Endpoint | Description | -|--------|----------|-------------| -| POST | `/pipelines/{pipeline_id}/execute` | Execute pipeline with parameters | -| GET | `/pipelines/result/{task_id}` | Get result by task ID | - ---- - -## Testing the API - -### 1. Interactive Docs -Open http://localhost:8000/docs for Swagger UI. - -### 2. Execute Pipeline - -```bash -curl -X POST "http://localhost:8000/pipelines/sample_recommendation_pipeline/execute" \ - -H "Content-Type: application/json" \ - -d '{"user_id": 123}' -``` - -Response: -```json -{ - "status": "executed", - "pipeline_id": "sample_recommendation_pipeline", - "task_id": "abc123def456", - "message": "Pipeline execution started..." -} -``` - -### 3. Poll for Result - -```bash -curl "http://localhost:8000/pipelines/result/abc123def456" -``` - -Response: -```json -{ - "task_id": "abc123def456", - "result": { - "user_data": {"id": 123, "name": "User123", ...}, - "order_history": [...], - "recommendations": ["product_A", "product_B", "product_C"] - } -} -``` - -### 4. View DAG - -```bash -curl "http://localhost:8000/pipelines/sample_recommendation_pipeline/dag" -``` - -Returns JSON structure of the pipeline graph. - ---- - -## Programmatic API Usage - -You can also use the API classes directly without HTTP: - -```python -from taskiq_flow.api import PipelineVisualizationAPI - -app = FastAPI() -viz_api = PipelineVisualizationAPI(broker, app) - -# Register pipeline -viz_api.add_pipeline("my_pipe", my_pipeline) - -# List registered pipelines -for pid, p in viz_api.pipelines.items(): - print(f"Pipeline: {pid}, tasks: {len(p.visualize()['nodes'])}") - -# Get visualization -dag_json = my_pipeline.visualize() -dot = my_pipeline.visualize_dot() -``` - -This is useful for building custom dashboard backends or CLI tools. - ---- - -## Production Considerations - -### 1. Use Persistent Broker -```python -from taskiq import RedisStreamBroker -broker = RedisStreamBroker(redis_url="redis://localhost:6379") -``` - -### 2. Add Authentication -```python -from fastapi import Depends, Security -from fastapi.security import APIKeyHeader - -api_key_header = APIKeyHeader(name="X-API-Key") - -async def verify_api_key(api_key: str = Security(api_key_header)): - if api_key != os.getenv("API_SECRET"): - raise HTTPException(status_code=403, detail="Invalid API key") - return api_key - -@app.post("/pipelines/{pipeline_id}/execute") -async def execute(..., api_key: str = Security(verify_api_key)): - # ... -``` - -### 2b. Add JWT Authentication -```python -from jose import jwt -from fastapi import Depends - -async def get_current_user(token: str = Depends(oauth2_scheme)): - payload = jwt.decode(token, SECRET_KEY, algorithms=["HS256"]) - return payload["sub"] - -@app.post("/pipelines/{pipeline_id}/execute") -async def execute(..., user: str = Depends(get_current_user)): - logger.info(f"User {user} executed {pipeline_id}") - # ... -``` - -### 2c. Add Pipeline-Level Authorization -```python -from taskiq_flow.security.authorization import PipelineAuthorization - -authorization = PipelineAuthorization(rules={ - "admin": {"read": ["*"], "write": ["*"]}, - "viewer": {"read": ["audio_*"], "write": []}, -}) - -async def check_pipeline_access( - pipeline_id: str = Path(...), - user: dict = Depends(get_current_user), -): - if not authorization.can_read(pipeline_id, user): - raise HTTPException(status_code=403, detail="Access denied") - return user -``` - -### 3. Add Rate Limiting -```python -from slowapi import Limiter -limiter = Limiter(key_func=get_remote_address) -@app.post("/pipelines/{pipeline_id}/execute") -@limiter.limit("10/minute") -async def execute(...): - # ... -``` - -### 4. Enable CORS for Web Frontend -```python -from fastapi.middleware.cors import CORSMiddleware -app.add_middleware( - CORSMiddleware, - allow_origins=["https://your-dashboard.com"], - allow_methods=["*"], - allow_headers=["*"], -) -``` - -### 5. Deploy with Gunicorn -```bash -gunicorn -k uvicorn.workers.UvicornWorker -w 4 main:app --bind 0.0.0.0:8000 -``` - ---- - -## Learning Path - -After this example: - -1. **[API Guide]({{ '/en/guides/api/' | relative_url }})** — Full REST API documentation and best practices -2. **[WebSocket Guide]({{ '/en/guides/websocket/' | relative_url }})** — Add real-time updates to your API -3. **[Tracking Guide]({{ '/en/guides/tracking/' | relative_url }})** — Store execution history for analytics - ---- - -*This example provides a complete, production-ready API foundation. Extend it with authentication, rate limiting, and custom endpoints for your specific use case.* + +--- + +## Overview + +This comprehensive example demonstrates how to build a production-ready REST API for Taskiq-Flow using FastAPI. It covers: + +- Setting up FastAPI with pipeline visualization endpoints +- Registering pipelines programmatically +- Adding custom endpoints for pipeline execution +- Retrieving pipeline results via API +- Full OpenAPI/Swagger documentation + +**Prerequisites**: Install FastAPI and uvicorn: +```bash +pip install fastapi uvicorn[standard] +``` + +--- + +## What This Example Shows + +- Using `PipelineVisualizationAPI` for built-in endpoints +- Registering pipelines with the API +- Creating custom endpoints to execute pipelines remotely +- Retrieving results by task ID +- Complete production API structure + +--- + +## Code Walkthrough + +### 1. Define Tasks and Pipeline + +```python +from fastapi import FastAPI, HTTPException +from taskiq import InMemoryBroker +from taskiq_flow import DataflowPipeline, pipeline_task + +broker = InMemoryBroker(await_inplace=True) + +@broker.task +@pipeline_task(output="user_data") +async def fetch_user_data(user_id: int) -> dict: + """Fetch user data from database.""" + await asyncio.sleep(0.1) + return {"id": user_id, "name": f"User{user_id}", "email": f"user{user_id}@example.com"} + +@broker.task +@pipeline_task(output="order_history") +async def fetch_orders(user_data: dict) -> list: + """Fetch order history for user.""" + await asyncio.sleep(0.2) + user_id = user_data["id"] + return [{"order_id": 100 + user_id, "total": 99.99}] + +@broker.task +@pipeline_task(output="recommendations") +async def generate_recommendations(user_data: dict, order_history: list): + """Generate recommendations.""" + await asyncio.sleep(0.15) + return ["product_A", "product_B", "product_C"] + +# Build pipeline +sample_pipeline = DataflowPipeline.from_tasks( + broker, + [fetch_user_data, fetch_orders, generate_recommendations], +) +sample_pipeline.pipeline_id = "sample_recommendation_pipeline" +``` + +**Pipeline DAG structure:** + +```mermaid +flowchart TD + A[fetch_user_data
output: user_data] --> B[fetch_orders
output: order_history] + A --> C[generate_recommendations
output: recommendations] + B --> C +``` + +### 2. Create FastAPI App with Visualization API + +```python +from taskiq_flow.api import create_visualization_api, PipelineVisualizationAPI + +def create_app() -> FastAPI: + app = FastAPI(title="TaskIQ Flow API", version="1.0.0") + + # Create visualization API (auto-mounts /pipelines endpoints) + viz_api = create_visualization_api(broker, app) + viz_api.add_pipeline("sample_recommendation_pipeline", sample_pipeline) + + # Custom endpoints below... + return app +``` + +The `create_visualization_api()` automatically adds these endpoints: +- `GET /health` +- `GET /pipelines` +- `POST /pipelines/{pipeline_id}` (register) +- `GET /pipelines/{pipeline_id}/status` +- `GET /pipelines/{pipeline_id}/dag` +- `GET /pipelines/{pipeline_id}/dag/dot` +- `GET /pipelines/{pipeline_id}/visualize` + +### 3. Add Custom Execute Endpoint + +```python +@app.post("/pipelines/{pipeline_id}/execute") +async def execute_pipeline( + pipeline_id: str, + parameters: dict[str, Any], +) -> dict[str, Any]: + """Execute a pipeline with given parameters.""" + if pipeline_id not in viz_api.pipelines: + raise HTTPException(status_code=404, detail=f"Pipeline {pipeline_id} not found") + + pipeline = viz_api.pipelines[pipeline_id] + try: + result = await pipeline.kiq_dataflow(**parameters) + return { + "status": "executed", + "pipeline_id": pipeline_id, + "task_id": result.task_id, + "message": "Pipeline execution started. Use /result/{task_id} to check status.", + } + except Exception as e: + raise HTTPException(status_code=500, detail=str(e)) from e +``` + +### 4. Add Result Retrieval Endpoint + +```python +@app.get("/pipelines/result/{task_id}") +async def get_result(task_id: str) -> dict[str, Any]: + """Get the result of a pipeline execution.""" + try: + result = await broker.result_backend.get_result(task_id) + if result is None: + raise HTTPException(status_code=404, detail=f"No result found for task_id {task_id}") + return {"task_id": task_id, "result": result.return_value} + except Exception as e: + raise HTTPException(status_code=500, detail=str(e)) from e +``` + +### 5. Run the Server + +```bash +uvicorn examples.api_example:create_app --reload --port 8000 +``` + +Or programmatically: + +```python +if __name__ == "__main__": + import uvicorn + uvicorn.run(create_app(), host="0.0.0.0", port=8000) +``` + +--- + +## API Endpoints Reference + +### Built-in (from `create_visualization_api`) + +| Method | Endpoint | Description | +|--------|----------|-------------| +| GET | `/health` | Health check | +| GET | `/pipelines` | List all registered pipelines | +| POST | `/pipelines/{pipeline_id}` | Register new pipeline | +| GET | `/pipelines/{pipeline_id}/status` | Get current execution status | +| GET | `/pipelines/{pipeline_id}/dag` | Get DAG as JSON | +| GET | `/pipelines/{pipeline_id}/dag/dot` | Get DAG as DOT string | +| GET | `/pipelines/{pipeline_id}/visualize` | Full visualization metadata | + +### Custom (defined in example) + +| Method | Endpoint | Description | +|--------|----------|-------------| +| POST | `/pipelines/{pipeline_id}/execute` | Execute pipeline with parameters | +| GET | `/pipelines/result/{task_id}` | Get result by task ID | + +--- + +## Testing the API + +### 1. Interactive Docs +Open http://localhost:8000/docs for Swagger UI. + +### 2. Execute Pipeline + +```bash +curl -X POST "http://localhost:8000/pipelines/sample_recommendation_pipeline/execute" \ + -H "Content-Type: application/json" \ + -d '{"user_id": 123}' +``` + +Response: +```json +{ + "status": "executed", + "pipeline_id": "sample_recommendation_pipeline", + "task_id": "abc123def456", + "message": "Pipeline execution started..." +} +``` + +### 3. Poll for Result + +```bash +curl "http://localhost:8000/pipelines/result/abc123def456" +``` + +Response: +```json +{ + "task_id": "abc123def456", + "result": { + "user_data": {"id": 123, "name": "User123", ...}, + "order_history": [...], + "recommendations": ["product_A", "product_B", "product_C"] + } +} +``` + +### 4. View DAG + +```bash +curl "http://localhost:8000/pipelines/sample_recommendation_pipeline/dag" +``` + +Returns JSON structure of the pipeline graph. + +--- + +## Programmatic API Usage + +You can also use the API classes directly without HTTP: + +```python +from taskiq_flow.api import PipelineVisualizationAPI + +app = FastAPI() +viz_api = PipelineVisualizationAPI(broker, app) + +# Register pipeline +viz_api.add_pipeline("my_pipe", my_pipeline) + +# List registered pipelines +for pid, p in viz_api.pipelines.items(): + print(f"Pipeline: {pid}, tasks: {len(p.visualize()['nodes'])}") + +# Get visualization +dag_json = my_pipeline.visualize() +dot = my_pipeline.visualize_dot() +``` + +This is useful for building custom dashboard backends or CLI tools. + +--- + +## Production Considerations + +### 1. Use Persistent Broker +```python +from taskiq import RedisStreamBroker +broker = RedisStreamBroker(redis_url="redis://localhost:6379") +``` + +### 2. Add Authentication +```python +from fastapi import Depends, Security +from fastapi.security import APIKeyHeader + +api_key_header = APIKeyHeader(name="X-API-Key") + +async def verify_api_key(api_key: str = Security(api_key_header)): + if api_key != os.getenv("API_SECRET"): + raise HTTPException(status_code=403, detail="Invalid API key") + return api_key + +@app.post("/pipelines/{pipeline_id}/execute") +async def execute(..., api_key: str = Security(verify_api_key)): + # ... +``` + +### 2b. Add JWT Authentication +```python +from jose import jwt +from fastapi import Depends + +async def get_current_user(token: str = Depends(oauth2_scheme)): + payload = jwt.decode(token, SECRET_KEY, algorithms=["HS256"]) + return payload["sub"] + +@app.post("/pipelines/{pipeline_id}/execute") +async def execute(..., user: str = Depends(get_current_user)): + logger.info(f"User {user} executed {pipeline_id}") + # ... +``` + +### 2c. Add Pipeline-Level Authorization +```python +from taskiq_flow.security.authorization import PipelineAuthorization + +authorization = PipelineAuthorization(rules={ + "admin": {"read": ["*"], "write": ["*"]}, + "viewer": {"read": ["audio_*"], "write": []}, +}) + +async def check_pipeline_access( + pipeline_id: str = Path(...), + user: dict = Depends(get_current_user), +): + if not authorization.can_read(pipeline_id, user): + raise HTTPException(status_code=403, detail="Access denied") + return user +``` + +### 3. Add Rate Limiting +```python +from slowapi import Limiter +limiter = Limiter(key_func=get_remote_address) +@app.post("/pipelines/{pipeline_id}/execute") +@limiter.limit("10/minute") +async def execute(...): + # ... +``` + +### 4. Enable CORS for Web Frontend +```python +from fastapi.middleware.cors import CORSMiddleware +app.add_middleware( + CORSMiddleware, + allow_origins=["https://your-dashboard.com"], + allow_methods=["*"], + allow_headers=["*"], +) +``` + +### 5. Deploy with Gunicorn +```bash +gunicorn -k uvicorn.workers.UvicornWorker -w 4 main:app --bind 0.0.0.0:8000 +``` + +--- + +## Learning Path + +After this example: + +1. **[API Guide]({{ '/en/guides/api/' | relative_url }})** — Full REST API documentation and best practices +2. **[WebSocket Guide]({{ '/en/guides/websocket/' | relative_url }})** — Add real-time updates to your API +3. **[Tracking Guide]({{ '/en/guides/tracking/' | relative_url }})** — Store execution history for analytics + +--- + +*This example provides a complete, production-ready API foundation. Extend it with authentication, rate limiting, and custom endpoints for your specific use case.* diff --git a/docs/_en/examples/dag_visualization_demo.md b/docs/_en/examples/dag_visualization_demo.md index 879b58b..4d45efa 100644 --- a/docs/_en/examples/dag_visualization_demo.md +++ b/docs/_en/examples/dag_visualization_demo.md @@ -1,6 +1,6 @@ --- permalink: /en/examples/dag-visualization-demo/ -title: Example: dag_visualization_demo.py +title: 'Example: dag_visualization_demo.py' nav_order: 47 color_scheme: dark --- diff --git a/docs/_en/examples/dataflow-audio-pipeline.md b/docs/_en/examples/dataflow-audio-pipeline.md index 0021667..44283bc 100644 --- a/docs/_en/examples/dataflow-audio-pipeline.md +++ b/docs/_en/examples/dataflow-audio-pipeline.md @@ -1,268 +1,278 @@ ---- -permalink: /en/examples/dataflow-audio-pipeline/ -title: Example: dataflow_audio_pipeline.py -nav_order: 42 -color_scheme: dark ---- -# Example: dataflow_audio_pipeline.py - -**Complete dataflow DAG with parallel execution, map-reduce, and visualization** - -> **Version**: {VERSION} | **File**: `examples/dataflow_audio_pipeline.py` - ---- - -## Overview - -This comprehensive example showcases the full power of DataflowPipeline with: - -- Automatic DAG construction from task dependencies -- Parallel execution of independent tasks -- Map-reduce pattern for batch processing -- Pipeline visualization (DOT, JSON, ASCII) -- Mixed sequential and parallel workflows - -This is the reference example for understanding dataflow architecture. - ---- - -## What This Example Shows - -- **`@pipeline_task`** decorator usage with single and multiple outputs -- **Automatic dependency resolution** — tasks declare outputs; downstream tasks consume by parameter name -- **Parallel execution** — tasks with same dependency run concurrently -- **Map-reduce pattern** — batch processing with `.map()` and `.reduce()` -- **DAG visualization** — print ASCII, export DOT, get JSON - ---- - -## Code Walkthrough - -### Task Definitions - -```python -from taskiq import InMemoryBroker -from taskiq_flow import DataflowPipeline, pipeline_task - -broker = InMemoryBroker(await_inplace=True) - -# Task 1: Extract audio features (no dependencies) -@broker.task -@pipeline_task(output="audio_features") -async def extract_audio_features(track_paths: list[str]) -> dict: - features = {...} - return features - -# Task 2: Compute MIR features (depends on audio_features) -@broker.task -@pipeline_task(output="mir_features") -async def compute_mir_features(audio_features: dict) -> dict: - # Gets audio_features automatically - return {...} - -# Task 3: Generate tags (depends on mir_features) -@broker.task -@pipeline_task(output="tags") -async def generate_tags(mir_features: dict) -> list[str]: - return ["electronic", "dance"] - -# Task 4: Create embedding (depends on BOTH mir_features AND tags) -@broker.task -@pipeline_task(output="vector") -async def create_embedding(mir_features: dict, tags: list[str]) -> list[float]: - # Receives both inputs automatically - return [0.1, 0.5, 0.8] -``` - -The pipeline automatically builds this DAG: - -```mermaid -flowchart TD - A[extract_audio_features] --> B[compute_mir_features] - A --> C[generate_tags] - B --> D[create_embedding] - C --> D -``` - -**Note**: `create_embedding` depends on both `mir_features` (output of `compute_mir_features`) and `tags` (output of `generate_tags`), so it executes after both parallel tasks complete. - ---- - -## Example 1: Sequential Pipeline with Automatic Dependencies - -```python -async def example_sequential_pipeline(): - pipeline = DataflowPipeline.from_tasks( - broker, - [ - extract_audio_features, - compute_mir_features, - generate_tags, - create_embedding, - ], - ) - - pipeline.print_dag() - # Output: - # DAG Execution Order: - # Level 0 (parallel): extract_audio_features - # Level 1 (parallel): compute_mir_features - # Level 2 (parallel): generate_tags, create_embedding - # Final outputs: audio_features, mir_features, tags, vector - - results = await pipeline.kiq_dataflow(track_paths=["track1.mp3"]) - # results = { - # "audio_features": {...}, - # "mir_features": {...}, - # "tags": [...], - # "vector": [...] - # } -``` - -**Dependency resolution**: -1. `extract_audio_features` has no dependencies → runs first -2. `compute_mir_features` needs `audio_features` → runs after step 1 -3. `generate_tags` needs `mir_features` → runs after step 2 -4. `create_embedding` needs `mir_features` and `tags` → runs after both steps 2 & 3 complete - ---- - -## Example 2: Parallel Execution - -With the addition of `extract_spectral_features` which also depends only on `audio_features`: - -```python -@broker.task -@pipeline_task(output="spectral_features") -async def extract_spectral_features(audio_features: dict) -> dict: - await asyncio.sleep(0.2) - return {"spectral_rolloff": 5000.0} - -@broker.task -@pipeline_task(output="combined_features") -async def combine_features( - mir_features: dict, - spectral_features: dict, - tags: list[str], -) -> dict: - return {**mir_features, **spectral_features, "tags": tags} - -pipeline = DataflowPipeline.from_tasks( - broker, - [ - extract_audio_features, - compute_mir_features, # Level 1 - extract_spectral_features, # Level 1 (runs in parallel with compute_mir_features) - generate_tags, # Level 2 (depends on mir_features) - combine_features, # Level 2 (depends on mir_features + spectral_features + tags) - ], -) -``` - -**Execution levels**: -- Level 0: `extract_audio_features` -- Level 1: `compute_mir_features`, `extract_spectral_features` (parallel) -- Level 2: `generate_tags`, `combine_features` (parallel after their dependencies met) - ---- - -## Example 3: Map-Reduce Pattern - -Process multiple tracks in parallel, then aggregate: - -```python -# Map: process each track independently -@broker.task -@pipeline_task(output="track_features") -async def process_single_track(track: str) -> dict: - return {"track": track, "duration": 180.0, "bpm": 120} - -# Reduce: aggregate all track features -@broker.task -@pipeline_task(output="playlist_stats") -async def aggregate_track_features(track_features: list[dict]) -> dict: - total_duration = sum(t["duration"] for t in track_features) - avg_bpm = sum(t["bpm"] for t in track_features) / len(track_features) - return {"total_tracks": len(track_features), "total_duration": total_duration, "avg_bpm": avg_bpm} - -# Build pipeline -pipeline = DataflowPipeline(broker) -pipeline.map( - process_single_track, - tracks, # ["track1.mp3", "track2.mp3", ...] - output="track_features", - max_parallel=4, -) -pipeline.reduce( - aggregate_track_features, - input_name="track_features", - output="playlist_stats", -) - -results = await pipeline.kiq_map_reduce() -# results = {"track_features": [...], "playlist_stats": {...}} -``` - ---- - -## Example 4: Visualization - -The pipeline provides multiple visualization formats: - -```python -# ASCII art (console) -pipeline.print_dag() - -# JSON (for web UIs) -viz_json = pipeline.visualize() -# Structure: -# { -# "nodes": [{"id": "task_name", "outputs": [...], "inputs": [...]}, ...], -# "edges": [{"from": "task_a", "to": "task_b"}], -# "levels": [["task1"], ["task2", "task3"], ...] -# } - -# DOT format (for Graphviz) -dot = pipeline.visualize_dot() -# Save and render: -# with open("pipeline.dot", "w") as f: -# f.write(dot) -# Run: dot -Tpng pipeline.dot -o pipeline.png -``` - ---- - -## Running the Example - -```bash -python examples/dataflow_audio_pipeline.py -``` - -Expected output includes: -- DAG ASCII prints showing execution order -- DAG DOT representation snippet -- DAG JSON structure snippet - ---- - -## Key Takeaways - -1. **Automatic dependency resolution** — No need to manually chain tasks; just declare outputs -2. **Parallel execution** — Independent tasks run concurrently automatically -3. **Dataflow programming** — Tasks are pure functions; output flows to inputs -4. **Visual debugging** — `print_dag()` shows exactly how tasks will execute -5. **Scalable patterns** — Map-reduce built in for batch workloads - ---- - -## Learning Path - -After this example: - -1. **[DataflowPipeline Guide]({{ '/en/guides/pipelines.md#2-dataflow-pipeline' | relative_url }})** — Deep dive into dataflow features -2. **[Execution Guide]({{ '/en/guides/execution/' | relative_url }})** — Parallelism, timeouts, error handling -3. **[Performance Guide]({{ '/en/guides/performance/' | relative_url }})** — Tuning `max_parallel`, resource profiles - ---- - -*This is the flagship example. Study it thoroughly to understand Taskiq-Flow's dataflow model.* +--- +permalink: /en/examples/dataflow-audio-pipeline/ +title: 'Example: dataflow_audio_pipeline.py' +nav_order: 42 +color_scheme: dark +--- +# Example: dataflow_audio_pipeline.py + +**Complete dataflow DAG with parallel execution, map-reduce, and visualization** + +> **Version**: {VERSION} | **File**: `examples/dataflow_audio_pipeline.py` + +--- + +## Overview + +This comprehensive example showcases the full power of DataflowPipeline with: + +- Automatic DAG construction from task dependencies +- Parallel execution of independent tasks +- Map-reduce pattern for batch processing +- Pipeline visualization (DOT, JSON, ASCII) +- Mixed sequential and parallel workflows + +This is the reference example for understanding dataflow architecture. + +--- + +## What This Example Shows + +- **`@pipeline_task`** decorator usage with single and multiple outputs +- **Automatic dependency resolution** — tasks declare outputs; downstream tasks consume by parameter name +- **Parallel execution** — tasks with same dependency run concurrently +- **Map-reduce pattern** — batch processing with `.map()` and `.reduce()` +- **DAG visualization** — print ASCII, export DOT, get JSON + +--- + +## Code Walkthrough + +### Task Definitions + +{% raw %} +```python +from taskiq import InMemoryBroker +from taskiq_flow import DataflowPipeline, pipeline_task + +broker = InMemoryBroker(await_inplace=True) + +# Task 1: Extract audio features (no dependencies) +@broker.task +@pipeline_task(output="audio_features") +async def extract_audio_features(track_paths: list[str]) -> dict: + features = {...} + return features + +# Task 2: Compute MIR features (depends on audio_features) +@broker.task +@pipeline_task(output="mir_features") +async def compute_mir_features(audio_features: dict) -> dict: + # Gets audio_features automatically + return {...} + +# Task 3: Generate tags (depends on mir_features) +@broker.task +@pipeline_task(output="tags") +async def generate_tags(mir_features: dict) -> list[str]: + return ["electronic", "dance"] + +# Task 4: Create embedding (depends on BOTH mir_features AND tags) +@broker.task +@pipeline_task(output="vector") +async def create_embedding(mir_features: dict, tags: list[str]) -> list[float]: + # Receives both inputs automatically + return [0.1, 0.5, 0.8] +``` +{% endraw %} +The pipeline automatically builds this DAG: + +{% raw %} +```mermaid +flowchart TD + A[extract_audio_features] --> B[compute_mir_features] + A --> C[generate_tags] + B --> D[create_embedding] + C --> D +``` +{% endraw %} +**Note**: `create_embedding` depends on both `mir_features` (output of `compute_mir_features`) and `tags` (output of `generate_tags`), so it executes after both parallel tasks complete. + +--- + +## Example 1: Sequential Pipeline with Automatic Dependencies + +{% raw %} +```python +async def example_sequential_pipeline(): + pipeline = DataflowPipeline.from_tasks( + broker, + [ + extract_audio_features, + compute_mir_features, + generate_tags, + create_embedding, + ], + ) + + pipeline.print_dag() + # Output: + # DAG Execution Order: + # Level 0 (parallel): extract_audio_features + # Level 1 (parallel): compute_mir_features + # Level 2 (parallel): generate_tags, create_embedding + # Final outputs: audio_features, mir_features, tags, vector + + results = await pipeline.kiq_dataflow(track_paths=["track1.mp3"]) + # results = { + # "audio_features": {...}, + # "mir_features": {...}, + # "tags": [...], + # "vector": [...] + # } +``` +{% endraw %} +**Dependency resolution**: + +1. `extract_audio_features` has no dependencies → runs first +2. `compute_mir_features` needs `audio_features` → runs after step 1 +3. `generate_tags` needs `mir_features` → runs after step 2 +4. `create_embedding` needs `mir_features` and `tags` → runs after both steps 2 & 3 complete + +--- + +## Example 2: Parallel Execution + +With the addition of `extract_spectral_features` which also depends only on `audio_features`: + +{% raw %} +```python +@broker.task +@pipeline_task(output="spectral_features") +async def extract_spectral_features(audio_features: dict) -> dict: + await asyncio.sleep(0.2) + return {"spectral_rolloff": 5000.0} + +@broker.task +@pipeline_task(output="combined_features") +async def combine_features( + mir_features: dict, + spectral_features: dict, + tags: list[str], +) -> dict: + return {**mir_features, **spectral_features, "tags": tags} + +pipeline = DataflowPipeline.from_tasks( + broker, + [ + extract_audio_features, + compute_mir_features, # Level 1 + extract_spectral_features, # Level 1 (runs in parallel with compute_mir_features) + generate_tags, # Level 2 (depends on mir_features) + combine_features, # Level 2 (depends on mir_features + spectral_features + tags) + ], +) +``` +{% endraw %} +**Execution levels**: + +- Level 0: `extract_audio_features` +- Level 1: `compute_mir_features`, `extract_spectral_features` (parallel) +- Level 2: `generate_tags`, `combine_features` (parallel after their dependencies met) + +--- + +## Example 3: Map-Reduce Pattern + +Process multiple tracks in parallel, then aggregate: + +{% raw %} +```python +# Map: process each track independently +@broker.task +@pipeline_task(output="track_features") +async def process_single_track(track: str) -> dict: + return {"track": track, "duration": 180.0, "bpm": 120} + +# Reduce: aggregate all track features +@broker.task +@pipeline_task(output="playlist_stats") +async def aggregate_track_features(track_features: list[dict]) -> dict: + total_duration = sum(t["duration"] for t in track_features) + avg_bpm = sum(t["bpm"] for t in track_features) / len(track_features) + return {"total_tracks": len(track_features), "total_duration": total_duration, "avg_bpm": avg_bpm} + +# Build pipeline +pipeline = DataflowPipeline(broker) +pipeline.map( + process_single_track, + tracks, # ["track1.mp3", "track2.mp3", ...] + output="track_features", + max_parallel=4, +) +pipeline.reduce( + aggregate_track_features, + input_name="track_features", + output="playlist_stats", +) + +results = await pipeline.kiq_map_reduce() +# results = {"track_features": [...], "playlist_stats": {...}} +``` +{% endraw %} +--- + +## Example 4: Visualization + +The pipeline provides multiple visualization formats: + +{% raw %} +```python +# ASCII art (console) +pipeline.print_dag() + +# JSON (for web UIs) +viz_json = pipeline.visualize() +# Structure: +# { +# "nodes": [{"id": "task_name", "outputs": [...], "inputs": [...]}, ...], +# "edges": [{"from": "task_a", "to": "task_b"}], +# "levels": [["task1"], ["task2", "task3"], ...] +# } + +# DOT format (for Graphviz) +dot = pipeline.visualize_dot() +# Save and render: +# with open("pipeline.dot", "w") as f: +# f.write(dot) +# Run: dot -Tpng pipeline.dot -o pipeline.png +``` +{% endraw %} +--- + +## Running the Example + +{% raw %} +```bash +python examples/dataflow_audio_pipeline.py +``` +{% endraw %} +Expected output includes: + +- DAG ASCII prints showing execution order +- DAG DOT representation snippet +- DAG JSON structure snippet + +--- + +## Key Takeaways + +1. **Automatic dependency resolution** — No need to manually chain tasks; just declare outputs +2. **Parallel execution** — Independent tasks run concurrently automatically +3. **Dataflow programming** — Tasks are pure functions; output flows to inputs +4. **Visual debugging** — `print_dag()` shows exactly how tasks will execute +5. **Scalable patterns** — Map-reduce built in for batch workloads + +--- + +## Learning Path + +After this example: + +1. **[DataflowPipeline Guide]({{ '/en/guides/pipelines.md#2-dataflow-pipeline' | relative_url }})** — Deep dive into dataflow features +2. **[Execution Guide]({{ '/en/guides/execution/' | relative_url }})** — Parallelism, timeouts, error handling +3. **[Performance Guide]({{ '/en/guides/performance/' | relative_url }})** — Tuning `max_parallel`, resource profiles + +--- + +*This is the flagship example. Study it thoroughly to understand Taskiq-Flow's dataflow model.* diff --git a/docs/_en/examples/index.md b/docs/_en/examples/index.md index 1cf5c79..4c1d344 100644 --- a/docs/_en/examples/index.md +++ b/docs/_en/examples/index.md @@ -1,90 +1,90 @@ ---- -title: Example Gallery -nav_order: 40 -permalink: /en/examples/ ---- -# Example Gallery - -**Working examples demonstrating key Taskiq-Flow features and patterns** - -> **Version**: {VERSION} | **Related**: [Quick Start Guide]({{ '/en/quickstart/' | relative_url }}) - ---- - -## Overview - -This gallery provides in-depth walkthroughs of the example scripts included in the `examples/` directory. Each example demonstrates a specific feature or integration pattern. - ---- - -## Example Index - -| Example | Description | Key Concepts | -|---------|-------------|--------------| -| [Basic Pipeline]({{ '/en/examples/quickstart/' | relative_url }}) | Simple sequential pipeline with map, filter, and group operations | SequentialPipeline, basic steps | -| [Tracking Demo]({{ '/en/examples/tracking-demo/' | relative_url }}) | Real-time pipeline monitoring with PipelineTrackingManager | Tracking, status storage, visualization | -| [Scheduled Pipeline]({{ '/en/examples/scheduled-pipeline/' | relative_url }}) | Cron-based recurring pipeline execution | PipelineScheduler, APScheduler, timezones | -| [Dataflow Audio Pipeline]({{ '/en/examples/dataflow-audio-pipeline/' | relative_url }}) | Full DAG with parallelism, map-reduce, and visualization | DataflowPipeline, automatic DAG, parallelism | -| [DAG Visualization Demo]({{ '/en/examples/dag-visualization-demo/' | relative_url }}) | NetworkX DAG analysis: critical path, parallel groups, multi-format export | DAGVisualizer, NetworkX, critical path, exports | -| [NiceGUI DAG Demo]({{ '/en/examples/nicegui-dag-demo/' | relative_url }}) | Interactive web DAG viewer via NiceGUI and MermaidGenerator | MermaidGenerator, NiceGUI, interactive web visualization | -| [Registry Discovery]({{ '/en/examples/registry-discovery/' | relative_url }}) | Manual DataflowRegistry construction, DAG inspection, and low-level execution | DataflowRegistry, ExecutionEngine, DAG introspection | -| [WebSocket Demo]({{ '/en/examples/websocket-demo/' | relative_url }}) | Real-time event streaming via WebSockets | HookManager, WebSocket transport, live tracking | -| [REST API]({{ '/en/examples/api-example/' | relative_url }}) | FastAPI integration for remote pipeline management | PipelineVisualizationAPI, custom endpoints | - ---- - -## Running the Examples - -Each example page includes: - -- **Overview** — What the example demonstrates -- **Prerequisites** — Required dependencies and setup -- **Code Walkthrough** — Line-by-line explanation -- **Key Concepts** — Core features highlighted -- **Running Instructions** — How to execute the script -- **Expected Output** — Sample output for verification -- **Common Issues** — Troubleshooting tips - -**To run an example**: - -```bash -# Navigate to the repository root -cd taskiq-flow - -# Install dependencies if needed -pip install -e . - -# Run an example script -python examples/quickstart.py -``` - -Some examples require additional services (Redis, etc.). See individual example pages for specifics. - ---- - -## Example Categories - -### Getting Started -- [Basic Pipeline]({{ '/en/examples/quickstart/' | relative_url }}) — Start here if you're new - -### Monitoring & Operations -- [Tracking Demo]({{ '/en/examples/tracking-demo/' | relative_url }}) -- [Scheduled Pipeline]({{ '/en/examples/scheduled-pipeline/' | relative_url }}) -- [WebSocket Demo]({{ '/en/examples/websocket-demo/' | relative_url }}) - -### Advanced Workflows -- [Dataflow Audio Pipeline]({{ '/en/examples/dataflow-audio-pipeline/' | relative_url }}) -- [DAG Visualization Demo]({{ '/en/examples/dag-visualization-demo/' | relative_url }}) -- [NiceGUI DAG Demo]({{ '/en/examples/nicegui-dag-demo/' | relative_url }}) -- [Registry Discovery]({{ '/en/examples/registry-discovery/' | relative_url }}) - -### Integration -- [REST API]({{ '/en/examples/api-example/' | relative_url }}) - ---- - -## Next Steps - -- **[Quick Start Guide]({{ '/en/quickstart/' | relative_url }})** — Run your first pipeline -- **[User Guides]({{ '/en/guides/' | relative_url }})** — Deep dives into each feature -- **[API Reference]({{ '/en/api/' | relative_url }})** — Complete module documentation +--- +title: Example Gallery +nav_order: 40 +permalink: /en/examples/ +--- +# Example Gallery + +**Working examples demonstrating key Taskiq-Flow features and patterns** + +> **Version**: {VERSION} | **Related**: [Quick Start Guide]({{ '/en/quickstart/' | relative_url }}) + +--- + +## Overview + +This gallery provides in-depth walkthroughs of the example scripts included in the `examples/` directory. Each example demonstrates a specific feature or integration pattern. + +--- + +## Example Index + +| Example | Description | Key Concepts | +|---------|-------------|--------------| +| [Basic Pipeline]({{ '/en/examples/quickstart/' | relative_url }}) | Simple sequential pipeline with map, filter, and group operations | SequentialPipeline, basic steps | +| [Tracking Demo]({{ '/en/examples/tracking-demo/' | relative_url }}) | Real-time pipeline monitoring with PipelineTrackingManager | Tracking, status storage, visualization | +| [Scheduled Pipeline]({{ '/en/examples/scheduled-pipeline/' | relative_url }}) | Cron-based recurring pipeline execution | PipelineScheduler, APScheduler, timezones | +| [Dataflow Audio Pipeline]({{ '/en/examples/dataflow-audio-pipeline/' | relative_url }}) | Full DAG with parallelism, map-reduce, and visualization | DataflowPipeline, automatic DAG, parallelism | +| [DAG Visualization Demo]({{ '/en/examples/dag-visualization-demo/' | relative_url }}) | NetworkX DAG analysis: critical path, parallel groups, multi-format export | DAGVisualizer, NetworkX, critical path, exports | +| [NiceGUI DAG Demo]({{ '/en/examples/nicegui-dag-demo/' | relative_url }}) | Interactive web DAG viewer via NiceGUI and MermaidGenerator | MermaidGenerator, NiceGUI, interactive web visualization | +| [Registry Discovery]({{ '/en/examples/registry-discovery/' | relative_url }}) | Manual DataflowRegistry construction, DAG inspection, and low-level execution | DataflowRegistry, ExecutionEngine, DAG introspection | +| [WebSocket Demo]({{ '/en/examples/websocket-demo/' | relative_url }}) | Real-time event streaming via WebSockets | HookManager, WebSocket transport, live tracking | +| [REST API]({{ '/en/examples/api-example/' | relative_url }}) | FastAPI integration for remote pipeline management | PipelineVisualizationAPI, custom endpoints | + +--- + +## Running the Examples + +Each example page includes: + +- **Overview** — What the example demonstrates +- **Prerequisites** — Required dependencies and setup +- **Code Walkthrough** — Line-by-line explanation +- **Key Concepts** — Core features highlighted +- **Running Instructions** — How to execute the script +- **Expected Output** — Sample output for verification +- **Common Issues** — Troubleshooting tips + +**To run an example**: + +```bash +# Navigate to the repository root +cd taskiq-flow + +# Install dependencies if needed +pip install -e . + +# Run an example script +python examples/quickstart.py +``` + +Some examples require additional services (Redis, etc.). See individual example pages for specifics. + +--- + +## Example Categories + +### Getting Started +- [Basic Pipeline]({{ '/en/examples/quickstart/' | relative_url }}) — Start here if you're new + +### Monitoring & Operations +- [Tracking Demo]({{ '/en/examples/tracking-demo/' | relative_url }}) +- [Scheduled Pipeline]({{ '/en/examples/scheduled-pipeline/' | relative_url }}) +- [WebSocket Demo]({{ '/en/examples/websocket-demo/' | relative_url }}) + +### Advanced Workflows +- [Dataflow Audio Pipeline]({{ '/en/examples/dataflow-audio-pipeline/' | relative_url }}) +- [DAG Visualization Demo]({{ '/en/examples/dag-visualization-demo/' | relative_url }}) +- [NiceGUI DAG Demo]({{ '/en/examples/nicegui-dag-demo/' | relative_url }}) +- [Registry Discovery]({{ '/en/examples/registry-discovery/' | relative_url }}) + +### Integration +- [REST API]({{ '/en/examples/api-example/' | relative_url }}) + +--- + +## Next Steps + +- **[Quick Start Guide]({{ '/en/quickstart/' | relative_url }})** — Run your first pipeline +- **[User Guides]({{ '/en/guides/' | relative_url }})** — Deep dives into each feature +- **[API Reference]({{ '/en/api/' | relative_url }})** — Complete module documentation diff --git a/docs/_en/examples/nicegui-dag-demo.md b/docs/_en/examples/nicegui-dag-demo.md index cc20614..4872b9a 100644 --- a/docs/_en/examples/nicegui-dag-demo.md +++ b/docs/_en/examples/nicegui-dag-demo.md @@ -1,5 +1,5 @@ --- -title: Example: nicegui_dag_demo.py +title: 'Example: nicegui_dag_demo.py' nav_order: 48 color_scheme: dark --- diff --git a/docs/_en/examples/quickstart.md b/docs/_en/examples/quickstart.md index 12ca849..6ab82fa 100644 --- a/docs/_en/examples/quickstart.md +++ b/docs/_en/examples/quickstart.md @@ -1,173 +1,173 @@ ---- -permalink: /en/examples/quickstart/ -title: Example: quickstart.py -nav_order: 41 -color_scheme: dark ---- -# Example: quickstart.py - -**Basic sequential pipeline with map, filter, and group operations** - +--- +permalink: /en/examples/quickstart/ +title: 'Example: quickstart.py' +nav_order: 41 +color_scheme: dark +--- +# Example: quickstart.py + +**Basic sequential pipeline with map, filter, and group operations** + > **Version**: {VERSION} | **File**: `examples/quickstart.py` - ---- - -## Overview - -This example demonstrates the fundamentals of Taskiq-Flow using a classic sequential pipeline. It covers: - -- Task definition with `@broker.task` -- Pipeline construction with `.call_next()`, `.map()`, `.filter()` -- Running the pipeline and retrieving results -- Understanding data flow through steps - ---- - -## Code Walkthrough - -```python -import asyncio -from taskiq import InMemoryBroker -from taskiq_flow import Pipeline, PipelineMiddleware - -# 1. Initialize broker and add middleware -broker = InMemoryBroker() -broker.add_middlewares(PipelineMiddleware()) - -# 2. Define tasks -@broker.task -def add_one(value: int) -> int: - return value + 1 - -@broker.task -def repeat(value: int, times: int) -> list[int]: - return [value] * times - -@broker.task -def is_positive(value: int) -> bool: - return value >= 0 - -# 3. Build pipeline -async def main(): - pipeline = ( - Pipeline(broker) - .call_next(add_one) # Step 1: 1 → 2 - .call_next(repeat, times=4) # Step 2: 2 → [2,2,2,2] - .map(add_one) # Step 3: [2,2,2,2] → [3,3,3,3] - .filter(is_positive) # Step 4: keep positives (all kept) - ) - - # 4. Execute - task = await pipeline.kiq(1) - result = await task.wait_result() - print("Result:", result.return_value) # [3, 3, 3, 3] - -asyncio.run(main()) -``` - ---- - -## Step-by-Step Explanation - -### Step 1: `call_next(add_one)` - -- **Input**: `1` -- **Operation**: `add_one(1) = 2` -- **Output**: `2` - -### Step 2: `call_next(repeat, times=4)` - -- **Input**: `2` -- **Operation**: `repeat(2, times=4) = [2, 2, 2, 2]` -- **Output**: `[2, 2, 2, 2]` - -### Step 3: `map(add_one)` - -- **Input**: `[2, 2, 2, 2]` (iterable) -- **Operation**: Apply `add_one` to each element **in parallel** - - `add_one(2) = 3` - - `add_one(2) = 3` - - `add_one(2) = 3` - - `add_one(2) = 3` -- **Output**: `[3, 3, 3, 3]` - -### Step 4: `filter(is_positive)` - -- **Input**: `[3, 3, 3, 3]` (iterable) -- **Operation**: Keep elements where `is_positive(element) == True` - - All 4 elements are positive → all kept -- **Output**: `[3, 3, 3, 3]` - ---- - -## Key Concepts Demonstrated - -1. **Task definition** — Every pipeline step must be a task (`@broker.task`) -2. **Middleware requirement** — `PipelineMiddleware` **must** be added to broker -3. **Data flow** — Each step receives previous output (except `call_after`) -4. **Parallel execution** — `.map()` runs elements concurrently -5. **Chaining** — Methods return pipeline for fluent interface - ---- - -## Running the Example - -```bash -python examples/quickstart.py -``` - -Expected output: -``` -Result: [3, 3, 3, 3] -``` - ---- - -## Variations to Try - -### Use `filter` to remove negatives - -```python -@broker.task -def subtract_three(value: int) -> int: - return value - 5 # results in [-2, -2, -2, -2] - -pipeline = ( - Pipeline(broker) - .call_next(add_one) - .call_next(repeat, times=4) - .map(subtract_three) # [2,2,2,2] → [-2,-2,-2,-2] - .filter(is_positive) # [] — all filtered out -) -``` - -### Use `group` for parallel independent tasks - -```python -@broker.task -def task_a(x: int) -> int: return x * 2 -@broker.task -def task_b(x: int) -> int: return x + 10 -@broker.task -def task_c(x: int) -> int: return x ** 2 - -pipeline = Pipeline(broker).call_next(add_one) # 1 → 2 -pipeline.group([task_a, task_b, task_c], param_names=["x"]) -# All three receive 2 and run in parallel -# Result: [4, 12, 4] -``` - ---- - -## Learning Path - -After this example: - -1. **[Dataflow Pipelines]({{ '/en/guides/pipelines.md#2-dataflow-pipeline' | relative_url }})** — Automatic DAG construction -2. **[Task Definition]({{ '/en/guides/tasks/' | relative_url }})** — Advanced task features -3. **[Tracking]({{ '/en/guides/tracking/' | relative_url }})** — Monitor pipeline execution -4. **[MapReduce]({{ '/en/guides/execution.md#3-map-reduce-pattern' | relative_url }})** — Batch processing pattern - ---- - -*This example is the "Hello World" of Taskiq-Flow. Master it before moving to more complex patterns.* + +--- + +## Overview + +This example demonstrates the fundamentals of Taskiq-Flow using a classic sequential pipeline. It covers: + +- Task definition with `@broker.task` +- Pipeline construction with `.call_next()`, `.map()`, `.filter()` +- Running the pipeline and retrieving results +- Understanding data flow through steps + +--- + +## Code Walkthrough + +```python +import asyncio +from taskiq import InMemoryBroker +from taskiq_flow import Pipeline, PipelineMiddleware + +# 1. Initialize broker and add middleware +broker = InMemoryBroker() +broker.add_middlewares(PipelineMiddleware()) + +# 2. Define tasks +@broker.task +def add_one(value: int) -> int: + return value + 1 + +@broker.task +def repeat(value: int, times: int) -> list[int]: + return [value] * times + +@broker.task +def is_positive(value: int) -> bool: + return value >= 0 + +# 3. Build pipeline +async def main(): + pipeline = ( + Pipeline(broker) + .call_next(add_one) # Step 1: 1 → 2 + .call_next(repeat, times=4) # Step 2: 2 → [2,2,2,2] + .map(add_one) # Step 3: [2,2,2,2] → [3,3,3,3] + .filter(is_positive) # Step 4: keep positives (all kept) + ) + + # 4. Execute + task = await pipeline.kiq(1) + result = await task.wait_result() + print("Result:", result.return_value) # [3, 3, 3, 3] + +asyncio.run(main()) +``` + +--- + +## Step-by-Step Explanation + +### Step 1: `call_next(add_one)` + +- **Input**: `1` +- **Operation**: `add_one(1) = 2` +- **Output**: `2` + +### Step 2: `call_next(repeat, times=4)` + +- **Input**: `2` +- **Operation**: `repeat(2, times=4) = [2, 2, 2, 2]` +- **Output**: `[2, 2, 2, 2]` + +### Step 3: `map(add_one)` + +- **Input**: `[2, 2, 2, 2]` (iterable) +- **Operation**: Apply `add_one` to each element **in parallel** + - `add_one(2) = 3` + - `add_one(2) = 3` + - `add_one(2) = 3` + - `add_one(2) = 3` +- **Output**: `[3, 3, 3, 3]` + +### Step 4: `filter(is_positive)` + +- **Input**: `[3, 3, 3, 3]` (iterable) +- **Operation**: Keep elements where `is_positive(element) == True` + - All 4 elements are positive → all kept +- **Output**: `[3, 3, 3, 3]` + +--- + +## Key Concepts Demonstrated + +1. **Task definition** — Every pipeline step must be a task (`@broker.task`) +2. **Middleware requirement** — `PipelineMiddleware` **must** be added to broker +3. **Data flow** — Each step receives previous output (except `call_after`) +4. **Parallel execution** — `.map()` runs elements concurrently +5. **Chaining** — Methods return pipeline for fluent interface + +--- + +## Running the Example + +```bash +python examples/quickstart.py +``` + +Expected output: +``` +Result: [3, 3, 3, 3] +``` + +--- + +## Variations to Try + +### Use `filter` to remove negatives + +```python +@broker.task +def subtract_three(value: int) -> int: + return value - 5 # results in [-2, -2, -2, -2] + +pipeline = ( + Pipeline(broker) + .call_next(add_one) + .call_next(repeat, times=4) + .map(subtract_three) # [2,2,2,2] → [-2,-2,-2,-2] + .filter(is_positive) # [] — all filtered out +) +``` + +### Use `group` for parallel independent tasks + +```python +@broker.task +def task_a(x: int) -> int: return x * 2 +@broker.task +def task_b(x: int) -> int: return x + 10 +@broker.task +def task_c(x: int) -> int: return x ** 2 + +pipeline = Pipeline(broker).call_next(add_one) # 1 → 2 +pipeline.group([task_a, task_b, task_c], param_names=["x"]) +# All three receive 2 and run in parallel +# Result: [4, 12, 4] +``` + +--- + +## Learning Path + +After this example: + +1. **[Dataflow Pipelines]({{ '/en/guides/pipelines.md#2-dataflow-pipeline' | relative_url }})** — Automatic DAG construction +2. **[Task Definition]({{ '/en/guides/tasks/' | relative_url }})** — Advanced task features +3. **[Tracking]({{ '/en/guides/tracking/' | relative_url }})** — Monitor pipeline execution +4. **[MapReduce]({{ '/en/guides/execution.md#3-map-reduce-pattern' | relative_url }})** — Batch processing pattern + +--- + +*This example is the "Hello World" of Taskiq-Flow. Master it before moving to more complex patterns.* diff --git a/docs/_en/examples/registry-discovery.md b/docs/_en/examples/registry-discovery.md index 5392234..f8e5a27 100644 --- a/docs/_en/examples/registry-discovery.md +++ b/docs/_en/examples/registry-discovery.md @@ -1,289 +1,289 @@ ---- -permalink: /en/examples/registry-discovery/ -title: Example: registry_discovery_example.py -nav_order: 43 -color_scheme: dark ---- -# Example: registry_discovery_example.py - -**Manual DataflowRegistry construction, DAG inspection, and low-level execution** - +--- +permalink: /en/examples/registry-discovery/ +title: 'Example: registry_discovery_example.py' +nav_order: 43 +color_scheme: dark +--- +# Example: registry_discovery_example.py + +**Manual DataflowRegistry construction, DAG inspection, and low-level execution** + > **Version**: {VERSION} | **File**: `examples/registry_discovery_example.py` - ---- - -## Overview - -This advanced example demonstrates the internals of Taskiq-Flow's automatic dependency resolution system using `DataflowRegistry`. It shows how to: - -- Manually register tasks with their I/O declarations -- Inspect the dataflow graph before execution -- Build and validate a DAG -- Execute pipelines using `ExecutionEngine` directly -- Understand data provenance and task dependencies - -**This is the core mechanism behind `DataflowPipeline.from_tasks()`.** - ---- - -## What This Example Shows - -- Complete `DataflowRegistry` API usage -- Manual DAG construction from task metadata -- Querying task dependencies (producers/consumers) -- Topological sorting and parallel level detection -- Direct `ExecutionEngine` execution -- The `DataCache` for manual step-by-step execution -- Error detection (missing dependencies, cycles) - ---- - -## Code Walkthrough - -### Tasks Definition (same as dataflow_audio style) - -```python -from taskiq_flow.dataflow.registry import DataflowRegistry -from taskiq_flow.execution_engine import ExecutionEngine -from taskiq_flow.dataflow.cache import DataCache -from taskiq_flow.visualization import DAGVisualizer - -@broker.task -@pipeline_task(output="raw_data") -async def load_data(source: str) -> dict: - return {"source": source, "records": [...]} - -@broker.task -@pipeline_task(output="cleaned_data") -async def clean_data(raw_data: dict) -> dict: - records = [r for r in raw_data["records"] if r["value"] > 0] - return {"source": raw_data["source"], "records": records} - -@broker.task -@pipeline_task(output="features") -async def extract_features(cleaned_data:dict) -> dict: - total = sum(r["value"] for r in cleaned_data["records"]) - return {"total": total, "count": len(cleaned_data["records"])} - -@broker.task -@pipeline_task(output="report") -async def generate_report(features: dict) -> dict: - return {"report_id": "RPT-001", "summary": features} -``` - ---- - -## Example 1: Manual Registry Construction & Inspection - -```python -async def example_manual_registry(): - registry = DataflowRegistry() - - # Register tasks manually - registry.register_task(load_data, output="raw_data", inputs=["source"]) - registry.register_task(clean_data, output="cleaned_data", inputs=["raw_data"]) - registry.register_task(extract_features, output="features", inputs=["cleaned_data"]) - registry.register_task(generate_report, output="report", inputs=["features"]) - - # Inspect the registry - print(f"Tasks: {[t.task_name for t in registry.get_tasks()]}") - # ['load_data', 'clean_data', 'extract_features', 'generate_report'] - - # Query dependencies - deps = registry.get_data_dependencies(generate_report) - print(f"generate_report depends on: {deps}") # ['features'] - - # Find who produces 'features' - producer = registry.get_producer("features") - print(f"'features' produced by: {producer.task_name}") # extract_features - - # Find who consumes 'raw_data' - consumers = registry.get_consumers("raw_data") - print(f"'raw_data' consumed by: {[c.task_name for c in consumers]}") # [clean_data] - - # External inputs (not produced by any task) - external = registry.get_external_inputs() - print(f"External inputs: {external}") # ['source'] - - # Outputs (final results) - outputs = registry.get_outputs() - print(f"Pipeline outputs: {outputs}") # ['raw_data', 'cleaned_data', 'features', 'report'] -``` - -**Key methods**: - -| Method | Returns | -|--------|---------| -| `get_tasks()` | All registered `TaskNode` objects | -| `get_outputs()` | All output keys | -| `get_external_inputs()` | Inputs not produced by any task | -| `get_producer(output_key)` | Task that produces that output | -| `get_consumers(input_key)` | Tasks needing that input | -| `get_data_dependencies(task)` | List of input keys for a task | - ---- - -## Example 2: Building and Visualizing the DAG - -```python - # Build DAG - dag = registry.build_dag() - - print(f"DAG: {len(dag.nodes)} nodes, {len(dag.edges)} edges") - - # Execution order (topological sort) - order = dag.topological_sort() - for i, node in enumerate(order): - print(f"{i+1}. {node.task_name}") - - # Parallel execution levels - for level_idx, level_nodes in enumerate(dag.levels): - tasks = [n.task_name for n in level_nodes] - print(f"Level {level_idx}: {tasks}") - - # ASCII visualization - dag.print() - - # DOT format - dot = DAGVisualizer.to_dot(dag) - with open("pipeline.dot", "w") as f: - f.write(dot) -``` - -**DAG properties**: -- `dag.nodes` — All nodes -- `dag.edges` — Dependency edges -- `dag.roots` — Nodes with no dependencies -- `dag.leaves` — Nodes with no dependents -- `dag.levels` — Groups of tasks that can run in parallel -- `dag.topological_sort()` — Linear execution order - ---- - -## Example 3: Validation & Error Detection - -```python -async def example_validation(): - registry = DataflowRegistry() - registry.register_task(load_data, output="raw_data", inputs=["source"]) - - # Broken: depends on nonexistent output - @broker.task - @pipeline_task(output="result") - async def broken_task(nonexistent_data: dict): - return {"result": "broken"} - - registry.register_task(broken_task, output="result", inputs=["nonexistent_data"]) - - try: - dag = registry.build_dag() # Raises ValueError - except ValueError as e: - print(f"Caught expected error: {e}") - # "Task 'broken_task' requires input 'nonexistent_data' but no task produces it" -``` - -**Validations performed**: -- All declared inputs must be produced by some task (or be external) -- No circular dependencies (cycles) -- No duplicate output names - ---- - -## Example 4: Execution with ExecutionEngine - -```python -async def example_execution_with_engine(): - registry = DataflowRegistry() - registry.register_task(load_data, output="raw_data", inputs=["source"]) - registry.register_task(clean_data, output="cleaned_data", inputs=["raw_data"]) - registry.register_task(extract_features, output="features", inputs=["cleaned_data"]) - registry.register_task(generate_report, output="report", inputs=["features"]) - - dag = registry.build_dag() - - engine = ExecutionEngine( - broker=broker, - dag=dag, - fail_fast=True, - max_parallel=4, - ) - - results = await engine.execute( - inputs={"source": "local://data/file.csv"}, - pipeline_id="manual_pipeline_example", - ) - - # results = {"raw_data": ..., "cleaned_data": ..., "features": ..., "report": ...} -``` - -The `ExecutionEngine` is the low-level executor that runs a DAG. - ---- - -## Example 5: Manual Step-by-Step Execution with DataCache - -Shows the internal execution loop: - -```python -async def example_manual_execution_with_cache(): - registry = DataflowRegistry() - # register tasks... - dag = registry.build_dag() - - cache = DataCache() - - # Initialize external inputs - cache.set("source", "local://data/file.csv") - - completed_nodes = set() - - while True: - ready = dag.get_ready_tasks(completed_nodes) - if not ready: - break - - for node in ready: - task = node.task - deps = registry.get_data_dependencies(task) - - # Inject dependencies from cache - args = cache.inject(deps) # {'raw_data': {...}, ...} - - # Execute task - result = await task.kiq(**args) - output_value = (await result.wait_result()).return_value - - # Store output in cache - output_name = registry.get_task_metadata(task)["output"] - cache.set(output_name, output_value) - - completed_nodes.add(node) - - # Final outputs in cache - final_report = cache.get("report") -``` - ---- - -## Why This Matters - -Understanding `DataflowRegistry` helps you: - -1. **Debug complex pipelines** — Inspect DAG before running -2. **Build dynamic pipelines** — Construct pipelines at runtime based on config -3. **Implement custom orchestration** — Use `ExecutionEngine` directly -4. **Understand data provenance** — Trace where each output came from - ---- - -## Learning Path - -After this example: - -1. **[Dataflow Guide]({{ '/en/guides/pipelines.md#2-dataflow-pipeline' | relative_url }})** — High-level usage -2. **[ExecutionEngine API]({{ '/en/api/execution/' | relative_url }})** — Low-level execution control -3. **[DAGBuilder]({{ '/en/api/execution.md#dagbuilder' | relative_url }})** — Programmatic DAG construction - ---- - -*Advanced topic. Most users will use `DataflowPipeline.from_tasks()` which wraps this registry internally. Explore this only if you need dynamic pipeline construction.* + +--- + +## Overview + +This advanced example demonstrates the internals of Taskiq-Flow's automatic dependency resolution system using `DataflowRegistry`. It shows how to: + +- Manually register tasks with their I/O declarations +- Inspect the dataflow graph before execution +- Build and validate a DAG +- Execute pipelines using `ExecutionEngine` directly +- Understand data provenance and task dependencies + +**This is the core mechanism behind `DataflowPipeline.from_tasks()`.** + +--- + +## What This Example Shows + +- Complete `DataflowRegistry` API usage +- Manual DAG construction from task metadata +- Querying task dependencies (producers/consumers) +- Topological sorting and parallel level detection +- Direct `ExecutionEngine` execution +- The `DataCache` for manual step-by-step execution +- Error detection (missing dependencies, cycles) + +--- + +## Code Walkthrough + +### Tasks Definition (same as dataflow_audio style) + +```python +from taskiq_flow.dataflow.registry import DataflowRegistry +from taskiq_flow.execution_engine import ExecutionEngine +from taskiq_flow.dataflow.cache import DataCache +from taskiq_flow.visualization import DAGVisualizer + +@broker.task +@pipeline_task(output="raw_data") +async def load_data(source: str) -> dict: + return {"source": source, "records": [...]} + +@broker.task +@pipeline_task(output="cleaned_data") +async def clean_data(raw_data: dict) -> dict: + records = [r for r in raw_data["records"] if r["value"] > 0] + return {"source": raw_data["source"], "records": records} + +@broker.task +@pipeline_task(output="features") +async def extract_features(cleaned_data:dict) -> dict: + total = sum(r["value"] for r in cleaned_data["records"]) + return {"total": total, "count": len(cleaned_data["records"])} + +@broker.task +@pipeline_task(output="report") +async def generate_report(features: dict) -> dict: + return {"report_id": "RPT-001", "summary": features} +``` + +--- + +## Example 1: Manual Registry Construction & Inspection + +```python +async def example_manual_registry(): + registry = DataflowRegistry() + + # Register tasks manually + registry.register_task(load_data, output="raw_data", inputs=["source"]) + registry.register_task(clean_data, output="cleaned_data", inputs=["raw_data"]) + registry.register_task(extract_features, output="features", inputs=["cleaned_data"]) + registry.register_task(generate_report, output="report", inputs=["features"]) + + # Inspect the registry + print(f"Tasks: {[t.task_name for t in registry.get_tasks()]}") + # ['load_data', 'clean_data', 'extract_features', 'generate_report'] + + # Query dependencies + deps = registry.get_data_dependencies(generate_report) + print(f"generate_report depends on: {deps}") # ['features'] + + # Find who produces 'features' + producer = registry.get_producer("features") + print(f"'features' produced by: {producer.task_name}") # extract_features + + # Find who consumes 'raw_data' + consumers = registry.get_consumers("raw_data") + print(f"'raw_data' consumed by: {[c.task_name for c in consumers]}") # [clean_data] + + # External inputs (not produced by any task) + external = registry.get_external_inputs() + print(f"External inputs: {external}") # ['source'] + + # Outputs (final results) + outputs = registry.get_outputs() + print(f"Pipeline outputs: {outputs}") # ['raw_data', 'cleaned_data', 'features', 'report'] +``` + +**Key methods**: + +| Method | Returns | +|--------|---------| +| `get_tasks()` | All registered `TaskNode` objects | +| `get_outputs()` | All output keys | +| `get_external_inputs()` | Inputs not produced by any task | +| `get_producer(output_key)` | Task that produces that output | +| `get_consumers(input_key)` | Tasks needing that input | +| `get_data_dependencies(task)` | List of input keys for a task | + +--- + +## Example 2: Building and Visualizing the DAG + +```python + # Build DAG + dag = registry.build_dag() + + print(f"DAG: {len(dag.nodes)} nodes, {len(dag.edges)} edges") + + # Execution order (topological sort) + order = dag.topological_sort() + for i, node in enumerate(order): + print(f"{i+1}. {node.task_name}") + + # Parallel execution levels + for level_idx, level_nodes in enumerate(dag.levels): + tasks = [n.task_name for n in level_nodes] + print(f"Level {level_idx}: {tasks}") + + # ASCII visualization + dag.print() + + # DOT format + dot = DAGVisualizer.to_dot(dag) + with open("pipeline.dot", "w") as f: + f.write(dot) +``` + +**DAG properties**: +- `dag.nodes` — All nodes +- `dag.edges` — Dependency edges +- `dag.roots` — Nodes with no dependencies +- `dag.leaves` — Nodes with no dependents +- `dag.levels` — Groups of tasks that can run in parallel +- `dag.topological_sort()` — Linear execution order + +--- + +## Example 3: Validation & Error Detection + +```python +async def example_validation(): + registry = DataflowRegistry() + registry.register_task(load_data, output="raw_data", inputs=["source"]) + + # Broken: depends on nonexistent output + @broker.task + @pipeline_task(output="result") + async def broken_task(nonexistent_data: dict): + return {"result": "broken"} + + registry.register_task(broken_task, output="result", inputs=["nonexistent_data"]) + + try: + dag = registry.build_dag() # Raises ValueError + except ValueError as e: + print(f"Caught expected error: {e}") + # "Task 'broken_task' requires input 'nonexistent_data' but no task produces it" +``` + +**Validations performed**: +- All declared inputs must be produced by some task (or be external) +- No circular dependencies (cycles) +- No duplicate output names + +--- + +## Example 4: Execution with ExecutionEngine + +```python +async def example_execution_with_engine(): + registry = DataflowRegistry() + registry.register_task(load_data, output="raw_data", inputs=["source"]) + registry.register_task(clean_data, output="cleaned_data", inputs=["raw_data"]) + registry.register_task(extract_features, output="features", inputs=["cleaned_data"]) + registry.register_task(generate_report, output="report", inputs=["features"]) + + dag = registry.build_dag() + + engine = ExecutionEngine( + broker=broker, + dag=dag, + fail_fast=True, + max_parallel=4, + ) + + results = await engine.execute( + inputs={"source": "local://data/file.csv"}, + pipeline_id="manual_pipeline_example", + ) + + # results = {"raw_data": ..., "cleaned_data": ..., "features": ..., "report": ...} +``` + +The `ExecutionEngine` is the low-level executor that runs a DAG. + +--- + +## Example 5: Manual Step-by-Step Execution with DataCache + +Shows the internal execution loop: + +```python +async def example_manual_execution_with_cache(): + registry = DataflowRegistry() + # register tasks... + dag = registry.build_dag() + + cache = DataCache() + + # Initialize external inputs + cache.set("source", "local://data/file.csv") + + completed_nodes = set() + + while True: + ready = dag.get_ready_tasks(completed_nodes) + if not ready: + break + + for node in ready: + task = node.task + deps = registry.get_data_dependencies(task) + + # Inject dependencies from cache + args = cache.inject(deps) # {'raw_data': {...}, ...} + + # Execute task + result = await task.kiq(**args) + output_value = (await result.wait_result()).return_value + + # Store output in cache + output_name = registry.get_task_metadata(task)["output"] + cache.set(output_name, output_value) + + completed_nodes.add(node) + + # Final outputs in cache + final_report = cache.get("report") +``` + +--- + +## Why This Matters + +Understanding `DataflowRegistry` helps you: + +1. **Debug complex pipelines** — Inspect DAG before running +2. **Build dynamic pipelines** — Construct pipelines at runtime based on config +3. **Implement custom orchestration** — Use `ExecutionEngine` directly +4. **Understand data provenance** — Trace where each output came from + +--- + +## Learning Path + +After this example: + +1. **[Dataflow Guide]({{ '/en/guides/pipelines.md#2-dataflow-pipeline' | relative_url }})** — High-level usage +2. **[ExecutionEngine API]({{ '/en/api/execution/' | relative_url }})** — Low-level execution control +3. **[DAGBuilder]({{ '/en/api/execution.md#dagbuilder' | relative_url }})** — Programmatic DAG construction + +--- + +*Advanced topic. Most users will use `DataflowPipeline.from_tasks()` which wraps this registry internally. Explore this only if you need dynamic pipeline construction.* diff --git a/docs/_en/examples/resource_aware_demo.md b/docs/_en/examples/resource_aware_demo.md index 0c25b30..f69e447 100644 --- a/docs/_en/examples/resource_aware_demo.md +++ b/docs/_en/examples/resource_aware_demo.md @@ -1,248 +1,254 @@ ---- -permalink: /en/examples/resource-aware-demo/ -title: Example: resource_aware_demo.py -nav_order: 49 -color_scheme: dark ---- -# Example: resource_aware_demo.py - -**Dynamic parallelism based on CPU/memory** - -> **Version**: {VERSION} | **File**: `examples/resource_aware_demo.py` - ---- - -## Overview - -This example demonstrates the `ResourceAwareExecutor` and `TaskResourceProfile` features introduced in v0.4.5. It shows how to: - -- Annotate tasks with resource requirements (CPU, memory, I/O vs CPU-bound) -- Compute optimal parallelism based on current system resources -- Adjust `max_parallel` dynamically to avoid overloading the host -- Apply different parallelism strategies for I/O-bound vs CPU-bound tasks - ---- - -## What This Example Shows - -- Defining `TaskResourceProfile` for tasks -- Creating a `ResourceAwareExecutor` with system limits -- Querying `get_optimal_parallelism()` for task types -- Using resource profiles in DataflowPipeline -- Manual parallelism tuning guidelines - ---- - -## Code Walkthrough - -### 1. Resource-Aware Executor Setup - -```python -from taskiq_flow.optimization import ResourceAwareExecutor, TaskResourceProfile - -executor = ResourceAwareExecutor( - max_cpu_percent=80.0, # Don't exceed 80% CPU usage - max_memory_percent=80.0, # Don't exceed 80% RAM - min_parallel=1, - max_parallel=20, -) - -# Query optimal parallelism for a given task resource estimate -optimal_light = executor.get_optimal_parallelism( - task_memory_estimate=50, # 50 MB per task - task_cpu_estimate=0.2, # 0.2 cores per task -) -print(f"Optimal for light tasks: {optimal_light}") - -optimal_heavy = executor.get_optimal_parallelism( - task_memory_estimate=200, # 200 MB per task - task_cpu_estimate=1.0, # 1.0 core per task -) -print(f"Optimal for heavy tasks: {optimal_heavy}") -``` - -The executor queries current system load (via `psutil`) and computes how many tasks of the given profile can run in parallel without exceeding the configured limits. - ---- - -### 2. Annotating Tasks with Resource Profiles - -```python -@broker.task -@pipeline_task( - output="light_result", - resources=TaskResourceProfile( - estimated_memory_mb=50, - estimated_cpu_cores=0.2, - io_bound=True, - ), -) -async def light_task(item: int) -> dict: - await asyncio.sleep(0.1) # Simulate I/O - return {"item": item, "result": item * 2} - -@broker.task -@pipeline_task( - output="heavy_result", - resources=TaskResourceProfile( - estimated_memory_mb=200, - estimated_cpu_cores=1.0, - io_bound=False, - ), -) -async def heavy_task(item: int) -> dict: - total = 0 - for _ in range(100000): - total += item * 2 - return {"item": item, "result": total} -``` - -**ResourceProfile fields:** - -- `estimated_memory_mb`: Expected memory usage per task instance -- `estimated_cpu_cores`: CPU cores required (0.5 = half a core) -- `io_bound`: True for I/O-heavy tasks (network, disk), False for CPU-heavy - ---- - -### 3. Using Resource Profiles in Pipelines - -The `DataflowPipeline`'s `max_parallel` parameter acts as an upper bound. The `ResourceAwareExecutor` can be used to compute a dynamic `max_parallel` before launching: - -```python -# Compute optimal parallelism for current system state -current_parallel = executor.get_optimal_parallelism( - task_memory_estimate=50, - task_cpu_estimate=0.2, -) - -pipeline = DataflowPipeline(broker, max_parallel=current_parallel) -pipeline.map(light_task, items=list(range(20)), output="light_results") -results = await pipeline.kiq_dataflow() -``` - -For mixed workloads, sum resource usage across parallel tasks. - ---- - -### 4. Manual Parallelism Tuning Guidelines - -```python -import psutil - -cpu_count = psutil.cpu_count() or 4 -memory_gb = psutil.virtual_memory().total / (1024 ** 3) - -# I/O-bound tasks: can oversubscribe CPU (they spend time waiting) -io_parallel = min(50, cpu_count * 5) - -# CPU-bound tasks: limit to available cores ± a small buffer -cpu_parallel = min(cpu_count + 2, 20) - -print(f"Recommended max_parallel for I/O-bound: {io_parallel}") -print(f"Recommended max_parallel for CPU-bound: {cpu_parallel}") -``` - -Start conservative, benchmark, and adjust. - ---- - -## Expected Output - -``` -=== Resource-Aware Parallelism Demo === - -Current system state: - CPU Usage: ? (will query at runtime) - Memory: ? (will query at runtime) - ---- Light tasks (I/O bound) --- - Optimal parallelism for light tasks: 25 - ---- Heavy tasks (CPU bound) --- - Optimal parallelism for heavy tasks: 4 - -Note: Actual values depend on current system load. - - -=== Pipeline with Resource-Aware Execution === - -Pipeline structure: - [items:20] --light_task--> [light_results] - [items:10] --heavy_task--> [heavy_results] - [light_results, heavy_results] --combine--> [final] - -Executing pipeline... - Pipeline completed: {'light_results': [...], 'heavy_results': [...], 'final': {...}} - -TaskResourceProfile allows you to annotate tasks with resource requirements. -ResourceAwareExecutor uses these profiles to compute optimal parallelism. - - -=== Manual Parallelism Tuning === - -System: 8 CPU cores, 15.6 GB RAM -Recommended max_parallel for I/O-bound tasks: 40 -Recommended max_parallel for CPU-bound tasks: 10 -Start with conservative values and benchmark: - pipeline.map(light_task, items, max_parallel=10) - pipeline.map(heavy_task, items, max_parallel=cpu_count) - - -=== Resource-Aware Demo Complete === - -Key takeaways: -1. Use TaskResourceProfile to annotate task resource needs -2. ResourceAwareExecutor computes optimal parallelism at runtime -3. Adjust max_parallel based on task type (I/O vs CPU) -4. Monitor system resources and tune accordingly -``` - ---- - -## Key Points - -### Why Resource-Aware Parallelism? - -Without resource awareness, setting `max_parallel` too high can: -- Exhaust memory → OOM kills -- Saturate CPU → tasks thrash, overall slowdown -- Starve other services on the same host - -`ResourceAwareExecutor` prevents this by querying current system usage and computing safe parallelism levels. - -### Best Practices - -1. **Profile your tasks**: Measure actual memory/CPU usage in production -2. **Set conservative defaults**: Start with `max_parallel=5` and increase -3. **Monitor**: Watch system metrics while pipelines run -4. **Tune per task type**: I/O-bound tasks can be more parallel than CPU-bound -5. **Use `min_parallel` and `max_parallel` bounds**: `ResourceAwareExecutor` respects these - -### Integration with Monitoring - -Combine with Prometheus metrics: - -```python -from taskiq_flow.metrics import MetricsMiddleware -broker.add_middlewares(MetricsMiddleware()) -``` - -Track: -- `taskiq_flow_worker_cpu_usage_percent` -- `taskiq_flow_worker_memory_usage_bytes` -- `taskiq_flow_pipeline_executions_total` - ---- - -## Learning Path - -After this example: - -1. **[Performance Guide]({{ '/en/guides/performance/' | relative_url }})** — Resource-aware parallelism deep dive -2. **[Optimization Module API]({{ '/en/api/optimization/' | relative_url }})** — Full `ResourceAwareExecutor` reference -3. **[Tracking Guide]({{ '/en/guides/tracking/' | relative_url }})** — Monitor resource usage over time - ---- - -*This example keeps your pipelines from overwhelming the host. Always test resource profiles with realistic data volumes.* +--- +permalink: /en/examples/resource-aware-demo/ +title: 'Example: resource_aware_demo.py' +nav_order: 49 +color_scheme: dark +--- +# Example: resource_aware_demo.py + +**Dynamic parallelism based on CPU/memory** + +> **Version**: {VERSION} | **File**: `examples/resource_aware_demo.py` + +--- + +## Overview + +This example demonstrates the `ResourceAwareExecutor` and `TaskResourceProfile` features introduced in v0.4.5. It shows how to: + +- Annotate tasks with resource requirements (CPU, memory, I/O vs CPU-bound) +- Compute optimal parallelism based on current system resources +- Adjust `max_parallel` dynamically to avoid overloading the host +- Apply different parallelism strategies for I/O-bound vs CPU-bound tasks + +--- + +## What This Example Shows + +- Defining `TaskResourceProfile` for tasks +- Creating a `ResourceAwareExecutor` with system limits +- Querying `get_optimal_parallelism()` for task types +- Using resource profiles in DataflowPipeline +- Manual parallelism tuning guidelines + +--- + +## Code Walkthrough + +### 1. Resource-Aware Executor Setup + +{% raw %} +```python +from taskiq_flow.optimization import ResourceAwareExecutor, TaskResourceProfile + +executor = ResourceAwareExecutor( + max_cpu_percent=80.0, # Don't exceed 80% CPU usage + max_memory_percent=80.0, # Don't exceed 80% RAM + min_parallel=1, + max_parallel=20, +) + +# Query optimal parallelism for a given task resource estimate +optimal_light = executor.get_optimal_parallelism( + task_memory_estimate=50, # 50 MB per task + task_cpu_estimate=0.2, # 0.2 cores per task +) +print(f"Optimal for light tasks: {optimal_light}") + +optimal_heavy = executor.get_optimal_parallelism( + task_memory_estimate=200, # 200 MB per task + task_cpu_estimate=1.0, # 1.0 core per task +) +print(f"Optimal for heavy tasks: {optimal_heavy}") +``` +{% endraw %} +The executor queries current system load (via `psutil`) and computes how many tasks of the given profile can run in parallel without exceeding the configured limits. + +--- + +### 2. Annotating Tasks with Resource Profiles + +{% raw %} +```python +@broker.task +@pipeline_task( + output="light_result", + resources=TaskResourceProfile( + estimated_memory_mb=50, + estimated_cpu_cores=0.2, + io_bound=True, + ), +) +async def light_task(item: int) -> dict: + await asyncio.sleep(0.1) # Simulate I/O + return {"item": item, "result": item * 2} + +@broker.task +@pipeline_task( + output="heavy_result", + resources=TaskResourceProfile( + estimated_memory_mb=200, + estimated_cpu_cores=1.0, + io_bound=False, + ), +) +async def heavy_task(item: int) -> dict: + total = 0 + for _ in range(100000): + total += item * 2 + return {"item": item, "result": total} +``` +{% endraw %} +**ResourceProfile fields:** + +- `estimated_memory_mb`: Expected memory usage per task instance +- `estimated_cpu_cores`: CPU cores required (0.5 = half a core) +- `io_bound`: True for I/O-heavy tasks (network, disk), False for CPU-heavy + +--- + +### 3. Using Resource Profiles in Pipelines + +The `DataflowPipeline`'s `max_parallel` parameter acts as an upper bound. The `ResourceAwareExecutor` can be used to compute a dynamic `max_parallel` before launching: + +{% raw %} +```python +# Compute optimal parallelism for current system state +current_parallel = executor.get_optimal_parallelism( + task_memory_estimate=50, + task_cpu_estimate=0.2, +) + +pipeline = DataflowPipeline(broker, max_parallel=current_parallel) +pipeline.map(light_task, items=list(range(20)), output="light_results") +results = await pipeline.kiq_dataflow() +``` +{% endraw %} +For mixed workloads, sum resource usage across parallel tasks. + +--- + +### 4. Manual Parallelism Tuning Guidelines + +{% raw %} +```python +import psutil + +cpu_count = psutil.cpu_count() or 4 +memory_gb = psutil.virtual_memory().total / (1024 ** 3) + +# I/O-bound tasks: can oversubscribe CPU (they spend time waiting) +io_parallel = min(50, cpu_count * 5) + +# CPU-bound tasks: limit to available cores ± a small buffer +cpu_parallel = min(cpu_count + 2, 20) + +print(f"Recommended max_parallel for I/O-bound: {io_parallel}") +print(f"Recommended max_parallel for CPU-bound: {cpu_parallel}") +``` +{% endraw %} +Start conservative, benchmark, and adjust. + +--- + +## Expected Output + +{% raw %} +``` +=== Resource-Aware Parallelism Demo === + +Current system state: + CPU Usage: ? (will query at runtime) + Memory: ? (will query at runtime) + +--- Light tasks (I/O bound) --- + Optimal parallelism for light tasks: 25 + +--- Heavy tasks (CPU bound) --- + Optimal parallelism for heavy tasks: 4 + +Note: Actual values depend on current system load. + + +=== Pipeline with Resource-Aware Execution === + +Pipeline structure: + [items:20] --light_task--> [light_results] + [items:10] --heavy_task--> [heavy_results] + [light_results, heavy_results] --combine--> [final] + +Executing pipeline... + Pipeline completed: {'light_results': [...], 'heavy_results': [...], 'final': {...}} + +TaskResourceProfile allows you to annotate tasks with resource requirements. +ResourceAwareExecutor uses these profiles to compute optimal parallelism. + + +=== Manual Parallelism Tuning === + +System: 8 CPU cores, 15.6 GB RAM +Recommended max_parallel for I/O-bound tasks: 40 +Recommended max_parallel for CPU-bound tasks: 10 +Start with conservative values and benchmark: + pipeline.map(light_task, items, max_parallel=10) + pipeline.map(heavy_task, items, max_parallel=cpu_count) + + +=== Resource-Aware Demo Complete === + +Key takeaways: +1. Use TaskResourceProfile to annotate task resource needs +2. ResourceAwareExecutor computes optimal parallelism at runtime +3. Adjust max_parallel based on task type (I/O vs CPU) +4. Monitor system resources and tune accordingly +``` +{% endraw %} +--- + +## Key Points + +### Why Resource-Aware Parallelism? + +Without resource awareness, setting `max_parallel` too high can: +- Exhaust memory → OOM kills +- Saturate CPU → tasks thrash, overall slowdown +- Starve other services on the same host + +`ResourceAwareExecutor` prevents this by querying current system usage and computing safe parallelism levels. + +### Best Practices + +1. **Profile your tasks**: Measure actual memory/CPU usage in production +2. **Set conservative defaults**: Start with `max_parallel=5` and increase +3. **Monitor**: Watch system metrics while pipelines run +4. **Tune per task type**: I/O-bound tasks can be more parallel than CPU-bound +5. **Use `min_parallel` and `max_parallel` bounds**: `ResourceAwareExecutor` respects these + +### Integration with Monitoring + +Combine with Prometheus metrics: + +{% raw %} +```python +from taskiq_flow.metrics import MetricsMiddleware +broker.add_middlewares(MetricsMiddleware()) +``` +{% endraw %} +Track: +- `taskiq_flow_worker_cpu_usage_percent` +- `taskiq_flow_worker_memory_usage_bytes` +- `taskiq_flow_pipeline_executions_total` + +--- + +## Learning Path + +After this example: + +1. **[Performance Guide]({{ '/en/guides/performance/' | relative_url }})** — Resource-aware parallelism deep dive +2. **[Optimization Module API]({{ '/en/api/optimization/' | relative_url }})** — Full `ResourceAwareExecutor` reference +3. **[Tracking Guide]({{ '/en/guides/tracking/' | relative_url }})** — Monitor resource usage over time + +--- + +*This example keeps your pipelines from overwhelming the host. Always test resource profiles with realistic data volumes.* diff --git a/docs/_en/examples/retry_demo.md b/docs/_en/examples/retry_demo.md index f1d1496..17c0e6d 100644 --- a/docs/_en/examples/retry_demo.md +++ b/docs/_en/examples/retry_demo.md @@ -1,252 +1,252 @@ ---- -permalink: /en/examples/retry-demo/ -title: Example: retry_demo.py -nav_order: 48 -color_scheme: dark ---- -# Example: retry_demo.py - -**Retry middleware and error handling modes** - -> **Version**: {VERSION} | **File**: `examples/retry_demo.py` - ---- - -## Overview - -This example demonstrates Taskiq-Flow v0.4.5's robust retry and error handling mechanisms. It covers: - -- `PipelineRetryMiddleware` with exponential backoff and jitter -- `ErrorHandlingMode` strategies (FAIL_FAST, CONTINUE_ON_ERROR, SKIP_FAILED, DEAD_LETTER) -- `PipelineErrorAggregator` for collecting and analyzing failures -- Configuring retry policies per pipeline - ---- - -## What This Example Shows - -- Adding retry middleware to a broker -- Automatic retry with backoff for transient failures -- Switching between error handling modes -- Aggregating errors for post-mortem analysis -- Distinguishing retryable vs non-retryable failures - ---- - -## Code Walkthrough - -### 1. Retry Middleware - -```python -from taskiq_flow.middlewares.retry import PipelineRetryMiddleware - -retry_mw = PipelineRetryMiddleware( - max_retries=3, - delay=0.5, - backoff=2.0, - jitter=True, -) -broker.add_middlewares(retry_mw) -``` - -**Parameters:** -- `max_retries`: Maximum retry attempts (3 → total of 4 tries) -- `delay`: Initial delay before first retry (0.5s) -- `backoff`: Multiplier for delay on each retry (2.0 → 0.5s, 1s, 2s) -- `jitter`: Add random variation to avoid thundering herd - ---- - -### 2. Flaky Task Demo - -```python -import random - -@broker.task -async def flaky_task(attempt: int = 0) -> str: - """Fails randomly, then eventually succeeds.""" - attempt += 1 - if random.random() < 0.7 and attempt < 3: - raise RuntimeError(f"Task failed on attempt {attempt}") - return f"Success on attempt {attempt}" -``` - -```python -async def demo_retry_middleware(): - pipeline = Pipeline(broker).call_next(flaky_task) - task = await pipeline.kiq(0) - result = await task.wait_result(timeout=10) - print(f"Pipeline succeeded! Result: {result.return_value}") - print(f"Retry count: {retry_mw.retry_counts}") -``` - -Output: - -``` -Pipeline succeeded! Result: Success on attempt 2 -Retry count: {'flaky_task': 1} -``` - -The middleware automatically retries the task once before success. - ---- - -### 3. Error Handling Modes - -```python -from taskiq_flow.errors import ErrorHandlingMode -from taskiq_flow.execution_engine import ExecutionEngine -from taskiq_flow.dataflow.registry import DataflowRegistry - -registry = DataflowRegistry() -registry.register_task(flaky_task, output="flaky_output", inputs=[]) -registry.register_task(process_result, output="final", inputs=["flaky_output"]) -dag = registry.build_dag() -``` - -#### FAIL_FAST (default) - -```python -engine = ExecutionEngine(broker, dag, error_mode=ErrorHandlingMode.FAIL_FAST) -# Stops immediately on first error; pipeline fails -``` - -#### CONTINUE_ON_ERROR - -```python -engine = ExecutionEngine(broker, dag, error_mode=ErrorHandlingMode.CONTINUE_ON_ERROR) -# Marks failed task as FAILED but continues with downstream tasks that don't depend on it -``` - -#### SKIP_FAILED - -```python -engine = ExecutionEngine(broker, dag, error_mode=ErrorHandlingMode.SKIP_FAILED) -# Failed tasks are skipped; downstream tasks receive default values (None) for failed inputs -``` - -#### DEAD_LETTER - -```python -engine = ExecutionEngine(broker, dag, error_mode=ErrorHandlingMode.DEAD_LETTER) -# Failed tasks are queued for later retry via a dead-letter queue -``` - ---- - -### 4. Error Aggregation - -```python -from taskiq_flow.errors import PipelineErrorAggregator - -aggregator = PipelineErrorAggregator() - -# During/after execution, errors are collected: -aggregator.add_error(task=failed_task, error=exc, context={...}) - -# Later, analyze: -print(f"Total errors: {len(aggregator.errors)}") -print(f"Failed tasks: {aggregator.failed_tasks}") -print(f"Skipped tasks: {aggregator.skipped_tasks}") - -for err in aggregator.errors: - print(f" {err.task_name}: {type(err.error).__name__}: {err.error}") -``` - -Useful for generating error reports and alerting. - ---- - -## Expected Output - -Running `python examples/retry_demo.py`: - -``` -=== Demo 1: Retry Middleware === - -Executing flaky task with retry middleware... -(Task may fail 1-2 times before succeeding) - - Pipeline succeeded! Result: Success on attempt 2 - -Retry count stored in middleware: {'flaky_task': 1} - - -=== Demo 2: Error Handling Modes === - ---- Mode: FAIL_FAST --- - Execution raised: RuntimeError: Task failed on attempt 3 - ---- Mode: CONTINUE_ON_ERROR --- - Execution completed. Results: ['flaky_output'] - ---- Mode: SKIP_FAILED --- - Execution completed. Results: ['flaky_output'] - -Note: ErrorHandlingMode.DEAD_LETTER would queue failures for later retry. - - -=== Demo 3: Error Aggregation === - -Total errors collected: 3 -Failed tasks: ['task_a', 'task_b', 'task_c'] - -Error details: - - task_a: RuntimeError: timeout - - task_b: ValueError: invalid data - - task_c: ConnectionError: network down - -You can use PipelineErrorAggregator to analyze failures and affected branches. - - -=== All Retry & Error Handling Demos Complete === -``` - ---- - -## Key Points - -### When to Use Which Error Mode - -| Mode | Best for | Behavior | -|------|----------|----------| -| `FAIL_FAST` | Critical pipelines where any failure invalidates the whole run | Immediate halt | -| `CONTINUE_ON_ERROR` | Best-effort analysis where partial results are valuable | Continue; mark failures | -| `SKIP_FAILED` | Data processing where missing inputs can be tolerated | Provide None defaults | -| `DEAD_LETTER` | Systems requiring manual intervention or re-play | Queue for later retry | - -### Retry Strategies - -- **Transient failures** (network timeouts, temporary resource exhaustion) → Use `PipelineRetryMiddleware` -- **Permanent failures** (invalid data, code bugs) → Use `FAIL_FAST` or `SKIP_FAILED` depending on tolerance -- **Mixed workloads** → Combine retry middleware (for transient) with error modes (for permanent) - -### Monitoring Retries - -Track retry counts in metrics or logs: - -```python -for task_name, count in retry_mw.retry_counts.items(): - logger.info(f"Task {task_name} retried {count} times") -``` - -Integrate with Prometheus: - -```python -from taskiq_flow.metrics import MetricsMiddleware -broker.add_middlewares(MetricsMiddleware()) -``` - ---- - -## Learning Path - -After this example: - -1. **[Retry Guide]({{ '/en/guides/retry/' | relative_url }})** — Complete retry & error handling documentation -2. **[Execution Guide]({{ '/en/guides/execution/' | relative_url }})** — Execution engine internals -3. **[Monitoring Guide]({{ '/en/guides/tracking/' | relative_url }})** — Track failed tasks and retries in production - ---- - -*This example shows all retry patterns. In production, tune retry parameters (max_retries, backoff) based on task characteristics and SLA requirements.* +--- +permalink: /en/examples/retry-demo/ +title: 'Example: retry_demo.py' +nav_order: 48 +color_scheme: dark +--- +# Example: retry_demo.py + +**Retry middleware and error handling modes** + +> **Version**: {VERSION} | **File**: `examples/retry_demo.py` + +--- + +## Overview + +This example demonstrates Taskiq-Flow v0.4.5's robust retry and error handling mechanisms. It covers: + +- `PipelineRetryMiddleware` with exponential backoff and jitter +- `ErrorHandlingMode` strategies (FAIL_FAST, CONTINUE_ON_ERROR, SKIP_FAILED, DEAD_LETTER) +- `PipelineErrorAggregator` for collecting and analyzing failures +- Configuring retry policies per pipeline + +--- + +## What This Example Shows + +- Adding retry middleware to a broker +- Automatic retry with backoff for transient failures +- Switching between error handling modes +- Aggregating errors for post-mortem analysis +- Distinguishing retryable vs non-retryable failures + +--- + +## Code Walkthrough + +### 1. Retry Middleware + +```python +from taskiq_flow.middlewares.retry import PipelineRetryMiddleware + +retry_mw = PipelineRetryMiddleware( + max_retries=3, + delay=0.5, + backoff=2.0, + jitter=True, +) +broker.add_middlewares(retry_mw) +``` + +**Parameters:** +- `max_retries`: Maximum retry attempts (3 → total of 4 tries) +- `delay`: Initial delay before first retry (0.5s) +- `backoff`: Multiplier for delay on each retry (2.0 → 0.5s, 1s, 2s) +- `jitter`: Add random variation to avoid thundering herd + +--- + +### 2. Flaky Task Demo + +```python +import random + +@broker.task +async def flaky_task(attempt: int = 0) -> str: + """Fails randomly, then eventually succeeds.""" + attempt += 1 + if random.random() < 0.7 and attempt < 3: + raise RuntimeError(f"Task failed on attempt {attempt}") + return f"Success on attempt {attempt}" +``` + +```python +async def demo_retry_middleware(): + pipeline = Pipeline(broker).call_next(flaky_task) + task = await pipeline.kiq(0) + result = await task.wait_result(timeout=10) + print(f"Pipeline succeeded! Result: {result.return_value}") + print(f"Retry count: {retry_mw.retry_counts}") +``` + +Output: + +``` +Pipeline succeeded! Result: Success on attempt 2 +Retry count: {'flaky_task': 1} +``` + +The middleware automatically retries the task once before success. + +--- + +### 3. Error Handling Modes + +```python +from taskiq_flow.errors import ErrorHandlingMode +from taskiq_flow.execution_engine import ExecutionEngine +from taskiq_flow.dataflow.registry import DataflowRegistry + +registry = DataflowRegistry() +registry.register_task(flaky_task, output="flaky_output", inputs=[]) +registry.register_task(process_result, output="final", inputs=["flaky_output"]) +dag = registry.build_dag() +``` + +#### FAIL_FAST (default) + +```python +engine = ExecutionEngine(broker, dag, error_mode=ErrorHandlingMode.FAIL_FAST) +# Stops immediately on first error; pipeline fails +``` + +#### CONTINUE_ON_ERROR + +```python +engine = ExecutionEngine(broker, dag, error_mode=ErrorHandlingMode.CONTINUE_ON_ERROR) +# Marks failed task as FAILED but continues with downstream tasks that don't depend on it +``` + +#### SKIP_FAILED + +```python +engine = ExecutionEngine(broker, dag, error_mode=ErrorHandlingMode.SKIP_FAILED) +# Failed tasks are skipped; downstream tasks receive default values (None) for failed inputs +``` + +#### DEAD_LETTER + +```python +engine = ExecutionEngine(broker, dag, error_mode=ErrorHandlingMode.DEAD_LETTER) +# Failed tasks are queued for later retry via a dead-letter queue +``` + +--- + +### 4. Error Aggregation + +```python +from taskiq_flow.errors import PipelineErrorAggregator + +aggregator = PipelineErrorAggregator() + +# During/after execution, errors are collected: +aggregator.add_error(task=failed_task, error=exc, context={...}) + +# Later, analyze: +print(f"Total errors: {len(aggregator.errors)}") +print(f"Failed tasks: {aggregator.failed_tasks}") +print(f"Skipped tasks: {aggregator.skipped_tasks}") + +for err in aggregator.errors: + print(f" {err.task_name}: {type(err.error).__name__}: {err.error}") +``` + +Useful for generating error reports and alerting. + +--- + +## Expected Output + +Running `python examples/retry_demo.py`: + +``` +=== Demo 1: Retry Middleware === + +Executing flaky task with retry middleware... +(Task may fail 1-2 times before succeeding) + + Pipeline succeeded! Result: Success on attempt 2 + +Retry count stored in middleware: {'flaky_task': 1} + + +=== Demo 2: Error Handling Modes === + +--- Mode: FAIL_FAST --- + Execution raised: RuntimeError: Task failed on attempt 3 + +--- Mode: CONTINUE_ON_ERROR --- + Execution completed. Results: ['flaky_output'] + +--- Mode: SKIP_FAILED --- + Execution completed. Results: ['flaky_output'] + +Note: ErrorHandlingMode.DEAD_LETTER would queue failures for later retry. + + +=== Demo 3: Error Aggregation === + +Total errors collected: 3 +Failed tasks: ['task_a', 'task_b', 'task_c'] + +Error details: + - task_a: RuntimeError: timeout + - task_b: ValueError: invalid data + - task_c: ConnectionError: network down + +You can use PipelineErrorAggregator to analyze failures and affected branches. + + +=== All Retry & Error Handling Demos Complete === +``` + +--- + +## Key Points + +### When to Use Which Error Mode + +| Mode | Best for | Behavior | +|------|----------|----------| +| `FAIL_FAST` | Critical pipelines where any failure invalidates the whole run | Immediate halt | +| `CONTINUE_ON_ERROR` | Best-effort analysis where partial results are valuable | Continue; mark failures | +| `SKIP_FAILED` | Data processing where missing inputs can be tolerated | Provide None defaults | +| `DEAD_LETTER` | Systems requiring manual intervention or re-play | Queue for later retry | + +### Retry Strategies + +- **Transient failures** (network timeouts, temporary resource exhaustion) → Use `PipelineRetryMiddleware` +- **Permanent failures** (invalid data, code bugs) → Use `FAIL_FAST` or `SKIP_FAILED` depending on tolerance +- **Mixed workloads** → Combine retry middleware (for transient) with error modes (for permanent) + +### Monitoring Retries + +Track retry counts in metrics or logs: + +```python +for task_name, count in retry_mw.retry_counts.items(): + logger.info(f"Task {task_name} retried {count} times") +``` + +Integrate with Prometheus: + +```python +from taskiq_flow.metrics import MetricsMiddleware +broker.add_middlewares(MetricsMiddleware()) +``` + +--- + +## Learning Path + +After this example: + +1. **[Retry Guide]({{ '/en/guides/retry/' | relative_url }})** — Complete retry & error handling documentation +2. **[Execution Guide]({{ '/en/guides/execution/' | relative_url }})** — Execution engine internals +3. **[Monitoring Guide]({{ '/en/guides/tracking/' | relative_url }})** — Track failed tasks and retries in production + +--- + +*This example shows all retry patterns. In production, tune retry parameters (max_retries, backoff) based on task characteristics and SLA requirements.* diff --git a/docs/_en/examples/scheduled-pipeline.md b/docs/_en/examples/scheduled-pipeline.md index fbc85f7..63a2ae9 100644 --- a/docs/_en/examples/scheduled-pipeline.md +++ b/docs/_en/examples/scheduled-pipeline.md @@ -1,199 +1,199 @@ ---- -permalink: /en/examples/scheduled-pipeline/ -title: Example: scheduled_pipeline.py -nav_order: 44 -color_scheme: dark ---- -# Example: scheduled_pipeline.py - -**Scheduling pipelines with cron and interval triggers** - -> **Version**: {VERSION} | **File**: `examples/scheduled_pipeline.py` - ---- - -## Overview - -This example demonstrates how to schedule pipelines to run periodically using `LabelBasedScheduler`. It covers: - -- Cron-based scheduling (with second precision) -- Interval-based scheduling -- Listing and inspecting scheduled jobs - -**Note**: This example uses `LabelBasedScheduler`, which is TaskIQ's label-based scheduling mechanism. For production cron scheduling, consider `PipelineScheduler` with APScheduler integration. - ---- - -## What This Example Shows - -- Creating a simple pipeline -- Using `LabelBasedScheduler` to schedule pipeline runs -- Cron expressions with second-level precision -- Interval-based scheduling -- Listing active schedules - ---- - -## Code Walkthrough - -```python -import asyncio -from taskiq import InMemoryBroker -from taskiq_flow import Pipeline, PipelineMiddleware -from taskiq_flow.scheduling import LabelBasedScheduler - -# Create broker -broker = InMemoryBroker(await_inplace=True).with_middlewares(PipelineMiddleware()) - -# Define a simple task -@broker.task -async def log_message(msg: str) -> str: - """Log a message.""" - return f"Processed: {msg}" - -async def main(): - # Create pipeline - pipeline = Pipeline(broker).call_next(log_message) - - # Create scheduler - scheduler = LabelBasedScheduler(broker) - - # Schedule with cron expression (every 5 seconds) - schedule_id = await scheduler.schedule_with_cron( - pipeline=pipeline, - label="every-5-seconds", - cron="*/5 * * * * *", # 6-field cron for second precision - args=("Hello from scheduled pipeline!",), - ) - print(f"Scheduled with cron: {schedule_id}") - - # Schedule with interval (every 3 seconds) - interval_id = await scheduler.schedule_with_interval( - pipeline=pipeline, - label="every-3-seconds", - interval_seconds=3, - args=("Interval scheduled run!",), - ) - print(f"Scheduled with interval: {interval_id}") - - # Wait for some executions to complete - print("Waiting for pipeline executions (12 seconds)...") - await asyncio.sleep(12) - - # List scheduled jobs - schedules = scheduler.list_schedules() - print(f"Active schedules: {len(schedules)}") - for sched in schedules: - print(f" - {sched['label']}: cron={sched.get('cron')}, enabled={sched['enabled']}") - -asyncio.run(main()) -``` - ---- - -## Scheduling Methods - -### Cron Scheduling - -```python -schedule_id = await scheduler.schedule_with_cron( - pipeline=pipeline, - label="my-schedule", - cron="*/5 * * * * *", # Every 5 seconds (6-field cron) - args=("message",), -) -``` - -**6-field cron format**: `second minute hour day month day-of-week` - -Examples: -- `*/5 * * * * *` — Every 5 seconds -- `0 * * * * *` — Every minute at second 0 -- `0 0 * * * *` — Every hour at minute 0, second 0 - -### Interval Scheduling - -```python -interval_id = await scheduler.schedule_with_interval( - pipeline=pipeline, - label="interval-3s", - interval_seconds=3, - args=("message",), -) -``` - -Runs every N seconds, regardless of system time. - ---- - -## Expected Output - -``` -Scheduled with cron: schedule_123456 -Scheduled with interval: interval_789012 -Waiting for pipeline executions (12 seconds)... -INFO:root:Processed: Hello from scheduled pipeline! -INFO:root:Processed: Interval scheduled run! -INFO:root:Processed: Hello from scheduled pipeline! -INFO:root:Processed: Interval scheduled run! -... -Active schedules: 2 - - every-5-seconds: cron=*/5 * * * * *, enabled=True - - every-3-seconds: cron=None, enabled=True -``` - -You should see the log message printed multiple times as schedules trigger. - ---- - -## Key Points - -### Label-Based Scheduling - -- Each schedule requires a unique `label` (used for identification) -- Labels can be used to enable/disable schedules dynamically -- The scheduler manages schedule persistence based on your broker - -### InMemoryBroker Limitation - -With `InMemoryBroker`, schedules only work while the process is running; they are lost on restart. For persistent scheduling, use Redis-based brokers with proper schedule stores. - -### Multiple Schedules - -You can schedule the same pipeline multiple times with different labels, cron expressions, or arguments. - ---- - -## Variations - -### Custom Scheduling with PipelineScheduler - -For more advanced scheduling (timezones, misfire handling), use `PipelineScheduler`: - -```python -from taskiq_flow import PipelineScheduler - -scheduler = PipelineScheduler(broker) -job_id = await scheduler.schedule( - pipeline, - cron="0 9 * * *", # Daily at 9 AM - args=("daily",) -) -await scheduler.start() -``` - -See the [Scheduling Guide]({{ '/en/guides/scheduling/' | relative_url }}) for full details on `PipelineScheduler`. - ---- - -## Learning Path - -After this example: - -1. **[Scheduling Guide]({{ '/en/guides/scheduling/' | relative_url }})** — Comprehensive cron and interval scheduling -2. **[PipelineScheduler]({{ '/en/api/core.md#pipelinescheduler' | relative_url }})** — API reference -3. **[Retry Guide]({{ '/en/guides/retry/' | relative_url }})** — Handling failures in scheduled pipelines - ---- - -*This example shows label-based scheduling basics. For production use, explore PipelineScheduler with external job stores (PostgreSQL/Redis).* +--- +permalink: /en/examples/scheduled-pipeline/ +title: 'Example: scheduled_pipeline.py' +nav_order: 44 +color_scheme: dark +--- +# Example: scheduled_pipeline.py + +**Scheduling pipelines with cron and interval triggers** + +> **Version**: {VERSION} | **File**: `examples/scheduled_pipeline.py` + +--- + +## Overview + +This example demonstrates how to schedule pipelines to run periodically using `LabelBasedScheduler`. It covers: + +- Cron-based scheduling (with second precision) +- Interval-based scheduling +- Listing and inspecting scheduled jobs + +**Note**: This example uses `LabelBasedScheduler`, which is TaskIQ's label-based scheduling mechanism. For production cron scheduling, consider `PipelineScheduler` with APScheduler integration. + +--- + +## What This Example Shows + +- Creating a simple pipeline +- Using `LabelBasedScheduler` to schedule pipeline runs +- Cron expressions with second-level precision +- Interval-based scheduling +- Listing active schedules + +--- + +## Code Walkthrough + +```python +import asyncio +from taskiq import InMemoryBroker +from taskiq_flow import Pipeline, PipelineMiddleware +from taskiq_flow.scheduling import LabelBasedScheduler + +# Create broker +broker = InMemoryBroker(await_inplace=True).with_middlewares(PipelineMiddleware()) + +# Define a simple task +@broker.task +async def log_message(msg: str) -> str: + """Log a message.""" + return f"Processed: {msg}" + +async def main(): + # Create pipeline + pipeline = Pipeline(broker).call_next(log_message) + + # Create scheduler + scheduler = LabelBasedScheduler(broker) + + # Schedule with cron expression (every 5 seconds) + schedule_id = await scheduler.schedule_with_cron( + pipeline=pipeline, + label="every-5-seconds", + cron="*/5 * * * * *", # 6-field cron for second precision + args=("Hello from scheduled pipeline!",), + ) + print(f"Scheduled with cron: {schedule_id}") + + # Schedule with interval (every 3 seconds) + interval_id = await scheduler.schedule_with_interval( + pipeline=pipeline, + label="every-3-seconds", + interval_seconds=3, + args=("Interval scheduled run!",), + ) + print(f"Scheduled with interval: {interval_id}") + + # Wait for some executions to complete + print("Waiting for pipeline executions (12 seconds)...") + await asyncio.sleep(12) + + # List scheduled jobs + schedules = scheduler.list_schedules() + print(f"Active schedules: {len(schedules)}") + for sched in schedules: + print(f" - {sched['label']}: cron={sched.get('cron')}, enabled={sched['enabled']}") + +asyncio.run(main()) +``` + +--- + +## Scheduling Methods + +### Cron Scheduling + +```python +schedule_id = await scheduler.schedule_with_cron( + pipeline=pipeline, + label="my-schedule", + cron="*/5 * * * * *", # Every 5 seconds (6-field cron) + args=("message",), +) +``` + +**6-field cron format**: `second minute hour day month day-of-week` + +Examples: +- `*/5 * * * * *` — Every 5 seconds +- `0 * * * * *` — Every minute at second 0 +- `0 0 * * * *` — Every hour at minute 0, second 0 + +### Interval Scheduling + +```python +interval_id = await scheduler.schedule_with_interval( + pipeline=pipeline, + label="interval-3s", + interval_seconds=3, + args=("message",), +) +``` + +Runs every N seconds, regardless of system time. + +--- + +## Expected Output + +``` +Scheduled with cron: schedule_123456 +Scheduled with interval: interval_789012 +Waiting for pipeline executions (12 seconds)... +INFO:root:Processed: Hello from scheduled pipeline! +INFO:root:Processed: Interval scheduled run! +INFO:root:Processed: Hello from scheduled pipeline! +INFO:root:Processed: Interval scheduled run! +... +Active schedules: 2 + - every-5-seconds: cron=*/5 * * * * *, enabled=True + - every-3-seconds: cron=None, enabled=True +``` + +You should see the log message printed multiple times as schedules trigger. + +--- + +## Key Points + +### Label-Based Scheduling + +- Each schedule requires a unique `label` (used for identification) +- Labels can be used to enable/disable schedules dynamically +- The scheduler manages schedule persistence based on your broker + +### InMemoryBroker Limitation + +With `InMemoryBroker`, schedules only work while the process is running; they are lost on restart. For persistent scheduling, use Redis-based brokers with proper schedule stores. + +### Multiple Schedules + +You can schedule the same pipeline multiple times with different labels, cron expressions, or arguments. + +--- + +## Variations + +### Custom Scheduling with PipelineScheduler + +For more advanced scheduling (timezones, misfire handling), use `PipelineScheduler`: + +```python +from taskiq_flow import PipelineScheduler + +scheduler = PipelineScheduler(broker) +job_id = await scheduler.schedule( + pipeline, + cron="0 9 * * *", # Daily at 9 AM + args=("daily",) +) +await scheduler.start() +``` + +See the [Scheduling Guide]({{ '/en/guides/scheduling/' | relative_url }}) for full details on `PipelineScheduler`. + +--- + +## Learning Path + +After this example: + +1. **[Scheduling Guide]({{ '/en/guides/scheduling/' | relative_url }})** — Comprehensive cron and interval scheduling +2. **[PipelineScheduler]({{ '/en/api/core.md#pipelinescheduler' | relative_url }})** — API reference +3. **[Retry Guide]({{ '/en/guides/retry/' | relative_url }})** — Handling failures in scheduled pipelines + +--- + +*This example shows label-based scheduling basics. For production use, explore PipelineScheduler with external job stores (PostgreSQL/Redis).* diff --git a/docs/_en/examples/secure_api_example.md b/docs/_en/examples/secure_api_example.md index 81985b8..16bc779 100644 --- a/docs/_en/examples/secure_api_example.md +++ b/docs/_en/examples/secure_api_example.md @@ -1,6 +1,6 @@ --- permalink: /en/examples/secure-api-example/ -title: Example: secure_api_example.py +title: 'Example: secure_api_example.py' nav_order: 46 color_scheme: dark --- diff --git a/docs/_en/examples/tracking-demo.md b/docs/_en/examples/tracking-demo.md index c6cf469..23850ac 100644 --- a/docs/_en/examples/tracking-demo.md +++ b/docs/_en/examples/tracking-demo.md @@ -1,183 +1,183 @@ ---- -permalink: /en/examples/tracking-demo/ -title: Example: tracking_demo.py -nav_order: 45 -color_scheme: dark ---- -# Example: tracking_demo.py - -**Pipeline execution tracking with PipelineTrackingManager** - -> **Version**: {VERSION} | **File**: `examples/tracking_demo.py` - ---- - -## Overview - -This example demonstrates how to monitor pipeline execution in real-time using the `PipelineTrackingManager`. It covers: - -- Setting up tracking with automatic storage selection -- Attaching tracking to a pipeline -- Running a pipeline and checking its status -- Accessing step-by-step execution history - ---- - -## What This Example Shows - -- Creating a `PipelineTrackingManager` with auto-storage -- Using `.with_tracking()` on a pipeline -- Waiting for pipeline completion -- Querying pipeline status from the tracking manager -- Logging step progress - ---- - -## Code Walkthrough - -```python -import asyncio -import logging - -from taskiq import InMemoryBroker -from taskiq_flow import Pipeline, PipelineMiddleware -from taskiq_flow.tracking import PipelineTrackingManager - -logging.basicConfig(level=logging.INFO) -logger = logging.getLogger(__name__) - -# Create broker -broker = InMemoryBroker(await_inplace=True) - -# Define a task with a delay to show tracking in action -@broker.task -async def slow_task(x: int) -> int: - """Slow task that doubles the input.""" - await asyncio.sleep(1) - print(f"Slow task called with {x}") - return x * 2 - -async def main(): - # 1. Setup tracking with auto-storage selection - tracking_manager = PipelineTrackingManager().with_auto_storage(broker) - - # 2. Create middleware with tracking manager - middleware = PipelineMiddleware(tracking_manager=tracking_manager) - broker_with_middleware = broker.with_middlewares(middleware) - - # 3. Create pipeline with tracking enabled - pipeline = ( - Pipeline(broker_with_middleware) - .with_tracking(manager=tracking_manager) - .call_next(slow_task) - .call_next(slow_task) - ) - - # 4. Execute the pipeline - result = await pipeline.kiq(10) - await result.wait_result() - - # 5. Query the tracking status - pipeline_id = pipeline.pipeline_id - if pipeline_id is None: - raise RuntimeError("Pipeline has no ID") - - status = await tracking_manager.get_status(pipeline_id) - if status is None: - raise RuntimeError("Failed to get pipeline status") - - logger.info(f"Pipeline status: {status.status}") - logger.info(f"Steps completed: {len(status.steps)}") - -asyncio.run(main()) -``` - ---- - -## Key Points - -### Tracking Setup - -```python -tracking_manager = PipelineTrackingManager().with_auto_storage(broker) -``` - -- `with_auto_storage()` automatically selects the appropriate storage backend based on the broker -- For `InMemoryBroker`, uses `InMemoryPipelineStorage` -- For Redis brokers, uses `RedisPipelineStorage` - -### Attaching Tracking to Pipeline - -```python -pipeline = Pipeline(broker).with_tracking(manager=tracking_manager) -``` - -The tracking manager must be attached **before** calling `pipeline.kiq()`. - -### Inspecting Status - -After execution, the `PipelineStatus` object contains: - -- `status` — Overall status (`COMPLETED`, `FAILED`, etc.) -- `steps` — List of `StepStatus` objects, one per step -- `started_at` / `completed_at` — Timestamps -- `duration_ms` — Total execution time -- `result` — Final return value (if completed) - -Each `StepStatus` includes: - -- `step_name` — Task name -- `status` — Step's status -- `duration_ms` — How long the step took -- `result` — Step's return value - ---- - -## Expected Output - -``` -INFO:__main__:Pipeline status: COMPLETED -INFO:__main__:Steps completed: 2 -``` - -You'll also see log messages from the slow_task calls with 1-second delays. - ---- - -## Variations - -### Access Step Details - -```python -for step in status.steps: - logger.info(f"Step '{step.step_name}' took {step.duration_ms:.0f}ms") - if step.result: - logger.info(f" Result: {step.result}") -``` - -### Track Multiple Pipelines - -```python -# Launch several pipelines concurrently -tasks = [pipeline.kiq(i) for i in range(5)] -await asyncio.gather(*[t.wait_result() for t in tasks]) - -# List all tracked pipelines -all_statuses = await tracking_manager.list_pipelines() -for s in all_statuses: - print(f"{s.pipeline_id}: {s.status}") -``` - ---- - -## Learning Path - -After this example: - -1. **[Tracking Guide]({{ '/en/guides/tracking/' | relative_url }})** — Full tracking features (storage backends, metrics) -2. **[WebSocket Guide]({{ '/en/guides/websocket/' | relative_url }})** — Real-time streaming of tracking events -3. **[API Guide]({{ '/en/guides/api/' | relative_url }})** — Expose tracking data via REST API - ---- - -*This example shows the basics. For production, use Redis storage and add listeners for alerts.* +--- +permalink: /en/examples/tracking-demo/ +title: 'Example: tracking_demo.py' +nav_order: 45 +color_scheme: dark +--- +# Example: tracking_demo.py + +**Pipeline execution tracking with PipelineTrackingManager** + +> **Version**: {VERSION} | **File**: `examples/tracking_demo.py` + +--- + +## Overview + +This example demonstrates how to monitor pipeline execution in real-time using the `PipelineTrackingManager`. It covers: + +- Setting up tracking with automatic storage selection +- Attaching tracking to a pipeline +- Running a pipeline and checking its status +- Accessing step-by-step execution history + +--- + +## What This Example Shows + +- Creating a `PipelineTrackingManager` with auto-storage +- Using `.with_tracking()` on a pipeline +- Waiting for pipeline completion +- Querying pipeline status from the tracking manager +- Logging step progress + +--- + +## Code Walkthrough + +```python +import asyncio +import logging + +from taskiq import InMemoryBroker +from taskiq_flow import Pipeline, PipelineMiddleware +from taskiq_flow.tracking import PipelineTrackingManager + +logging.basicConfig(level=logging.INFO) +logger = logging.getLogger(__name__) + +# Create broker +broker = InMemoryBroker(await_inplace=True) + +# Define a task with a delay to show tracking in action +@broker.task +async def slow_task(x: int) -> int: + """Slow task that doubles the input.""" + await asyncio.sleep(1) + print(f"Slow task called with {x}") + return x * 2 + +async def main(): + # 1. Setup tracking with auto-storage selection + tracking_manager = PipelineTrackingManager().with_auto_storage(broker) + + # 2. Create middleware with tracking manager + middleware = PipelineMiddleware(tracking_manager=tracking_manager) + broker_with_middleware = broker.with_middlewares(middleware) + + # 3. Create pipeline with tracking enabled + pipeline = ( + Pipeline(broker_with_middleware) + .with_tracking(manager=tracking_manager) + .call_next(slow_task) + .call_next(slow_task) + ) + + # 4. Execute the pipeline + result = await pipeline.kiq(10) + await result.wait_result() + + # 5. Query the tracking status + pipeline_id = pipeline.pipeline_id + if pipeline_id is None: + raise RuntimeError("Pipeline has no ID") + + status = await tracking_manager.get_status(pipeline_id) + if status is None: + raise RuntimeError("Failed to get pipeline status") + + logger.info(f"Pipeline status: {status.status}") + logger.info(f"Steps completed: {len(status.steps)}") + +asyncio.run(main()) +``` + +--- + +## Key Points + +### Tracking Setup + +```python +tracking_manager = PipelineTrackingManager().with_auto_storage(broker) +``` + +- `with_auto_storage()` automatically selects the appropriate storage backend based on the broker +- For `InMemoryBroker`, uses `InMemoryPipelineStorage` +- For Redis brokers, uses `RedisPipelineStorage` + +### Attaching Tracking to Pipeline + +```python +pipeline = Pipeline(broker).with_tracking(manager=tracking_manager) +``` + +The tracking manager must be attached **before** calling `pipeline.kiq()`. + +### Inspecting Status + +After execution, the `PipelineStatus` object contains: + +- `status` — Overall status (`COMPLETED`, `FAILED`, etc.) +- `steps` — List of `StepStatus` objects, one per step +- `started_at` / `completed_at` — Timestamps +- `duration_ms` — Total execution time +- `result` — Final return value (if completed) + +Each `StepStatus` includes: + +- `step_name` — Task name +- `status` — Step's status +- `duration_ms` — How long the step took +- `result` — Step's return value + +--- + +## Expected Output + +``` +INFO:__main__:Pipeline status: COMPLETED +INFO:__main__:Steps completed: 2 +``` + +You'll also see log messages from the slow_task calls with 1-second delays. + +--- + +## Variations + +### Access Step Details + +```python +for step in status.steps: + logger.info(f"Step '{step.step_name}' took {step.duration_ms:.0f}ms") + if step.result: + logger.info(f" Result: {step.result}") +``` + +### Track Multiple Pipelines + +```python +# Launch several pipelines concurrently +tasks = [pipeline.kiq(i) for i in range(5)] +await asyncio.gather(*[t.wait_result() for t in tasks]) + +# List all tracked pipelines +all_statuses = await tracking_manager.list_pipelines() +for s in all_statuses: + print(f"{s.pipeline_id}: {s.status}") +``` + +--- + +## Learning Path + +After this example: + +1. **[Tracking Guide]({{ '/en/guides/tracking/' | relative_url }})** — Full tracking features (storage backends, metrics) +2. **[WebSocket Guide]({{ '/en/guides/websocket/' | relative_url }})** — Real-time streaming of tracking events +3. **[API Guide]({{ '/en/guides/api/' | relative_url }})** — Expose tracking data via REST API + +--- + +*This example shows the basics. For production, use Redis storage and add listeners for alerts.* diff --git a/docs/_en/examples/websocket-demo.md b/docs/_en/examples/websocket-demo.md index 4595adc..7767606 100644 --- a/docs/_en/examples/websocket-demo.md +++ b/docs/_en/examples/websocket-demo.md @@ -1,6 +1,6 @@ --- permalink: /en/examples/websocket-demo/ -title: Example: websocket_demo.py +title: 'Example: websocket_demo.py' nav_order: 46 color_scheme: dark --- diff --git a/docs/_en/guides/api.md b/docs/_en/guides/api.md index 6a2c4e2..90edbab 100644 --- a/docs/_en/guides/api.md +++ b/docs/_en/guides/api.md @@ -1,753 +1,796 @@ ---- -title: REST API Guide -nav_order: 28 ---- -# REST API Guide - -**FastAPI-based pipeline management, visualization, and remote execution** - -> **Version**: {VERSION} | **Related**: [Tracking Guide]({{ '/en/guides/tracking/' | relative_url }}), [WebSocket Guide]({{ '/en/guides/websocket/' | relative_url }}), [Dataflow Guide]({{ '/en/guides/dataflow/' | relative_url }}) - ---- - -## Overview - -Taskiq-Flow includes a FastAPI-based REST API for managing pipelines remotely. Build dashboards, CI/CD integrations, or any system that needs to interact with pipelines via HTTP. - -This guide covers: - -- Setting up the API server -- Available endpoints -- Pipeline visualization endpoints -- Custom endpoint extensions -- Authentication considerations -- Production deployment patterns - ---- - -## 1. Quick Setup - -```python -from fastapi import FastAPI -from taskiq import InMemoryBroker -from taskiq_flow import DataflowPipeline, pipeline_task, create_visualization_api - -# 1. Create broker and tasks -broker = InMemoryBroker(await_inplace=True) - -@broker.task -@pipeline_task(output="result") -async def process(data: str) -> dict: - return {"processed": data.upper()} - -# 2. Build pipeline -pipeline = DataflowPipeline.from_tasks(broker, [process]) -pipeline.pipeline_id = "my_pipeline" - -# 3. Create FastAPI app with visualization API -app = FastAPI(title="Taskiq-Flow API", version="{VERSION}") -viz_api = create_visualization_api(broker, app) -viz_api.add_pipeline("my_pipeline", pipeline) - -# 4. Run with uvicorn -# uvicorn main:app --reload --port 8000 -``` - -All endpoints are automatically mounted under `/pipelines`. - ---- - -## 2. Available Endpoints - -The visualization API provides these routes: - -### 2.1. Health Check - -``` -GET /health -``` - -Returns simple health status: - -```json -{ - "status": "healthy", - "timestamp": "2026-05-05T12:00:00Z" -} -``` - -### 2.2. List All Pipelines - -``` -GET /pipelines -``` - -Lists all registered pipelines with metadata: - -```json -[ - { - "pipeline_id": "audio_analysis_v1", - "pipeline_type": "dataflow", - "tasks": ["extract", "tag", "embed"], - "created_at": "2026-05-05T10:00:00Z" - } -] -``` - -### 2.3. Register a New Pipeline - -``` -POST /pipelines/{pipeline_id} -``` - -Request body: - -```json -{ - "pipeline_type": "dataflow", - "tasks": ["task1", "task2"] -} -``` - -Or use the Python API directly (recommended): - -```python -viz_api.add_pipeline("new_pipeline", pipeline_object) -``` - -### 2.4. Get Pipeline Status - -``` -GET /pipelines/{pipeline_id}/status -``` - -Returns current execution status if a run is active: - -```json -{ - "pipeline_id": "my_pipeline_123", - "status": "RUNNING", - "steps_completed": 3, - "total_steps": 5, - "started_at": "2026-05-05T12:00:00Z" -} -``` - -### 2.5. Get DAG as JSON - -``` -GET /pipelines/{pipeline_id}/dag -``` - -Returns the directed acyclic graph structure: - -```json -{ - "nodes": [ - {"id": "extract", "outputs": ["features"]}, - {"id": "tag", "inputs": ["features"], "outputs": ["tags"]}, - {"id": "embed", "inputs": ["features"], "outputs": ["embedding"]} - ], - "edges": [ - {"from": "extract", "to": "tag"}, - {"from": "extract", "to": "embed"} - ] -} -``` - -### 2.6. Get DAG in DOT Format - -``` -GET /pipelines/{pipeline_id}/dag/dot -``` - -Returns Graphviz-compatible DOT string for visualization: - -``` -digraph "my_pipeline" { - node [shape=box]; - extract -> tag; - extract -> embed; -} -``` - -### 2.7. Full Pipeline Visualization - -``` -GET /pipelines/{pipeline_id}/visualize -``` - -Returns comprehensive pipeline metadata: - -```json -{ - "pipeline_id": "my_pipeline", - "type": "dataflow", - "tasks": [ - { - "name": "extract", - "outputs": ["features"], - "inputs": [], - "description": "Extract features from audio" - }, - { - "name": "tag", - "inputs": ["features"], - "outputs": ["tags"], - "description": "Generate tags" - } - ], - "execution_levels": [ - ["extract"], - ["tag", "embed"] - ] -} -``` - ---- - -## 3. Executing Pipelines via API - -The core API focuses on management and visualization. To execute pipelines remotely, add a custom endpoint: - -```python -from fastapi import FastAPI, HTTPException -from taskiq_flow.api import PipelineVisualizationAPI - -app = FastAPI() -viz_api = PipelineVisualizationAPI(broker, app) - -@app.post("/pipelines/{pipeline_id}/execute") -async def execute_pipeline( - pipeline_id: str, - parameters: dict, - wait: bool = False, - timeout: int = 30 -): - """ - Execute a pipeline with given parameters. - - - **pipeline_id**: Registered pipeline ID - - **parameters**: Dict of input parameters - - **wait**: If True, block until completion and return result - - **timeout**: Seconds to wait before timing out - """ - if pipeline_id not in viz_api.pipelines: - raise HTTPException(status_code=404, detail="Pipeline not found") - - pipeline = viz_api.pipelines[pipeline_id] - - try: - task = await pipeline.kiq_dataflow(**parameters) - - if wait: - result = await task.wait_result(timeout=timeout) - return { - "task_id": task.task_id, - "status": "COMPLETED", - "result": result.return_value - } - else: - return { - "task_id": task.task_id, - "status": "STARTED" - } - except asyncio.TimeoutError: - raise HTTPException(status_code=504, detail="Pipeline execution timed out") - except Exception as exc: - raise HTTPException(status_code=500, detail=str(exc)) - -@app.get("/pipelines/result/{task_id}") -async def get_result(task_id: str): - """Get the result of a pipeline execution.""" - result = await broker.get_result(task_id) - if result is None: - raise HTTPException(status_code=404, detail="Result not found or expired") - return {"task_id": task_id, "result": result.return_value} -``` - -### 3.1. Execute Async (Fire-and-Forget) - -```bash -curl -X POST "http://localhost:8000/pipelines/my_pipeline/execute" \ - -H "Content-Type: application/json" \ - -d '{"parameters": {"data": "input_value"}, "wait": false}' - -# Response: -{ - "task_id": "abc123def456", - "status": "STARTED" -} -``` - -### 3.2. Execute Synchronous (Wait for Result) - -```bash -curl -X POST "http://localhost:8000/pipelines/my_pipeline/execute" \ - -H "Content-Type: application/json" \ - -d '{"parameters": {"data": "input_value"}, "wait": true, "timeout": 60}' - -# Response (after pipeline completes): -{ - "task_id": "abc123def456", - "status": "COMPLETED", - "result": {"processed": "INPUT_VALUE"} -} -``` - ---- - -## 4. Integration with Frontend Dashboards - -### 4.1. React Dashboard Example - -```typescript -// React component displaying pipeline status -const PipelineStatus = ({ pipelineId }) => { - const [status, setStatus] = useState(null); - - useEffect(() => { - fetch(`/pipelines/${pipelineId}/status`) - .then(res => res.json()) - .then(data => setStatus(data)); - - // Poll every 5 seconds - const interval = setInterval(() => { - fetch(`/pipelines/${pipelineId}/status`) - .then(res => res.json()) - .then(setStatus); - }, 5000); - - return () => clearInterval(interval); - }, [pipelineId]); - - return ( -
-

Pipeline: {pipelineId}

-

Status: {status?.status}

-

Progress: {status?.steps_completed} / {status?.total_steps}

-
- ); -}; -``` - -### 4.2. DAG Visualization - -Use the DOT endpoint with Graphviz: - -```javascript -const renderDAG = async (pipelineId) => { - const response = await fetch(`/pipelines/${pipelineId}/dag/dot`); - const dot = await response.text(); - - // Use viz.js or d3-graphviz client-side - d3.select("#dag") - .graphviz() - .renderDot(dot); -}; -``` - ---- - -## 5. Authentication & Security - -### 5.1. API Key Authentication - -```python -from fastapi import Security, HTTPException -from fastapi.security import APIKeyHeader - -api_key_header = APIKeyHeader(name="X-API-Key") - -async def verify_api_key(api_key: str = Security(api_key_header)): - if api_key != os.getenv("API_SECRET"): - raise HTTPException(status_code=403, detail="Invalid API key") - return api_key - -@app.get("/pipelines") -async def list_pipelines(api_key: str = Security(verify_api_key)): - return viz_api.list_pipelines() -``` - -### 5.2. JWT Authentication - -```python -from jose import jwt -from fastapi import Depends - -async def get_current_user(token: str = Depends(oauth2_scheme)): - try: - payload = jwt.decode(token, SECRET_KEY, algorithms=["HS256"]) - return payload["sub"] - except jwt.JWTError: - raise HTTPException(status_code=401, detail="Invalid token") - -@app.post("/pipelines/{pipeline_id}/execute") -async def execute( - pipeline_id: str, - parameters: dict, - user: str = Depends(get_current_user) -): - # Log user's action for audit trail - logger.info(f"User {user} executed {pipeline_id}") - return await run_pipeline(pipeline_id, parameters) -``` - -### 5.3. Pipeline-Level Authorization - -Define per-pipeline ACLs via `pipeline_acls` in `TaskiqFlowConfig`, then use -`verify_pipeline_access` as a route dependency : - -```python -from fastapi import Depends -from taskiq_flow.config import TaskiqFlowConfig -from taskiq_flow.security.authorization import PipelineAuthorization -from taskiq_flow.security.dependencies import verify_pipeline_access -from taskiq_flow.api import create_visualization_api - -config = TaskiqFlowConfig( - pipeline_acls={ - "my_pipeline": { - "read": ["admin", "viewer"], - "execute": ["admin"], - }, - }, -) -viz_api = create_visualization_api(broker) # reads config automatically - -# verify_pipeline_access depends on get_current_user + authorization -# → use it directly on your protected endpoints -``` - -### 5.4. Combined Security Middleware + Route Dependencies - -For production, combine the global `SecurityMiddleware` (authentication) with per-route dependencies (authorization): - -```python -from taskiq_flow.security.middleware import SecurityMiddleware -from taskiq_flow.security.auth import APIKeyAuthProvider, JWTAuthProvider -from taskiq_flow.security.authorization import PipelineAuthorization -from taskiq_flow.config import TaskiqFlowConfig - -# 1. Configure a flat TaskiqFlowConfig (no nested SecurityConfig) -config = TaskiqFlowConfig( - security_enabled=True, - auth_provider="api_key", - api_keys={ - "admin-key": { - "role": "admin", - "pipelines": ["*"], - "permissions": ["read", "execute", "admin"], - }, - }, - jwt_secret="super-secret", # pragma: allowlist secret # noqa: S105 — documented placeholder, not a real secret - require_https=True, - pipeline_acls={ - "my_pipeline": {"read": ["admin"], "execute": ["admin"]}, - }, -) - -# 2. Build components from config -auth_provider = APIKeyAuthProvider(keys=config.api_keys) -if config.auth_provider == "jwt": - auth_provider = JWTAuthProvider(secret=config.jwt_secret) -authorization = PipelineAuthorization(pipeline_acls=config.pipeline_acls) - -# 3. Global middleware handles authentication + audit -app.add_middleware( - SecurityMiddleware, - auth_provider=auth_provider, - authorization=authorization, -) -``` - -Or, for full automatic wiring, use `create_visualization_api` which builds all -these components from `TaskiqFlowConfig` internally: - -```python -from taskiq_flow import create_visualization_api - -config = TaskiqFlowConfig( - security_enabled=True, - auth_provider="api_key", - api_keys={"admin-key": {"role": "admin", "pipelines": ["*"], "permissions": ["read", "execute", "admin"]}}, -) -app = create_visualization_api(broker) # security auto-configured from config -``` - -### Why this hybrid approach? -- `SecurityMiddleware` sets `request.state.user` for all routes after routing -- FastAPI path params (e.g. `pipeline_id`) are only available *after* routing -- Route dependencies (e.g. `Depends(verify_pipeline_access)`) run after routing → they can read `pipeline_id` and check ACLs -``` - ---- - -## 6. Rate Limiting - -Protect the API from abuse: - -```python -from slowapi import Limiter, _rate_limit_exceeded_handler -from slowapi.util import get_remote_address -from slowapi.errors import RateLimitExceeded - -limiter = Limiter(key_func=get_remote_address) -app.state.limiter = limiter -app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler) - -@app.post("/pipelines/{pipeline_id}/execute") -@limiter.limit("10/minute") # Max 10 executions per minute per IP -async def execute_pipeline(pipeline_id: str, parameters: dict): - # ... -``` - ---- - -## 7. CORS Configuration - -Enable cross-origin requests for web frontend: - -```python -from fastapi.middleware.cors import CORSMiddleware - -app.add_middleware( - CORSMiddleware, - allow_origins=["https://your-dashboard.com"], - allow_credentials=True, - allow_methods=["GET", "POST"], - allow_headers=["*"], -) -``` - ---- - -## 8. Production Deployment - -### 8.1. Gunicorn + Uvicorn Workers - -```bash -# Run with multiple workers for concurrency -gunicorn -k uvicorn.workers.UvicornWorker -w 4 main:app --bind 0.0.0.0:8000 - -# 4 worker processes handle concurrent requests -``` - -### 8.2. Docker - -```dockerfile -FROM python:3.12-slim - -WORKDIR /app -COPY requirements.txt . -RUN pip install --no-cache-dir -r requirements.txt - -COPY . . - -CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] -``` - -```yaml -# docker-compose.yml -services: - api: - build: . - ports: - - "8000:8000" - environment: - - REDIS_URL=redis://redis:6379 - depends_on: - - redis - redis: - image: redis:7-alpine -``` - -### 8.3. Behind Reverse Proxy (nginx) - -```nginx -server { - listen 80; - server_name api.taskiq-flow.example.com; - - location / { - proxy_pass http://localhost:8000; - proxy_set_header Host $host; - proxy_set_header X-Real-IP $remote_addr; - proxy_http_version 1.1; - proxy_set_header Connection ""; - } -} -``` - -### 8.4. HTTPS with Let's Encrypt - -```bash -# Using certbot with nginx -sudo certbot --nginx -d api.taskiq-flow.example.com -``` - -Configure HTTPS → redirect to HTTP upstream: - -```nginx -location / { - proxy_pass http://localhost:8000; - proxy_set_header X-Forwarded-Proto $scheme; -} -``` - ---- - -## 9. Monitoring API Health - -### 9.1. Health Check Endpoint - -```python -from datetime import datetime, timezone -from fastapi import FastAPI -import psutil - -app = FastAPI() - -@app.get("/health") -async def health(): - return { - "status": "healthy", - "timestamp": datetime.now(timezone.utc).isoformat(), - "broker_connected": broker.is_connected(), - "memory_mb": psutil.Process().memory_info().rss / 1024 / 1024 - } -``` - -### 9.2. Metrics with Prometheus - -```python -from prometheus_fastapi_instrumentator import Instrumentator - -Instrumentator().instrument(app).expose(app, endpoint="/metrics") -``` - -Exposes `/metrics` with standard Prometheus metrics (request count, latency, etc.). - -### 9.3. API Versioning - -```python -app = FastAPI( - title="Taskiq-Flow API", - version="1.0.0", - docs_url="/docs", - redoc_url="/redoc" -) - -# Prefix all routes with /api/v1 -from fastapi import APIRouter -api_router = APIRouter(prefix="/api/v1") -api_router.include_router(viz_api.router) -app.include_router(api_router) -``` - ---- - -## 10. Error Handling - -Centralized error handling: - -```python -from fastapi import Request -from fastapi.responses import JSONResponse - -@app.exception_handler(TaskiqError) -async def taskiq_exception_handler(request: Request, exc: TaskiqError): - return JSONResponse( - status_code=500, - content={ - "error": exc.__class__.__name__, - "message": str(exc), - "pipeline_id": getattr(exc, "pipeline_id", None) - } - ) -``` - -Standardized error responses: - -```json -{ - "error": "PipelineExecutionError", - "message": "Task 'process' failed after 3 retries", - "pipeline_id": "audio_analysis_123", - "step": "extract_audio", - "timestamp": "2026-05-05T12:00:00Z" -} -``` - ---- - -## 11. API Client Example - -Python client for interacting with the API: - -```python -import httpx - -class TaskiqFlowClient: - def __init__(self, base_url: str, api_key: str = None): - self.base_url = base_url.rstrip("/") - self.headers = {"X-API-Key": api_key} if api_key else {} - - async def list_pipelines(self): - async with httpx.AsyncClient() as client: - resp = await client.get(f"{self.base_url}/pipelines", headers=self.headers) - resp.raise_for_status() - return resp.json() - - async def execute(self, pipeline_id: str, parameters: dict, wait: bool = False): - async with httpx.AsyncClient() as client: - resp = await client.post( - f"{self.base_url}/pipelines/{pipeline_id}/execute", - json={"parameters": parameters, "wait": wait}, - headers=self.headers - ) - resp.raise_for_status() - return resp.json() - - async def get_result(self, task_id: str): - async with httpx.AsyncClient() as client: - resp = await client.get(f"{self.base_url}/pipelines/result/{task_id}", headers=self.headers) - resp.raise_for_status() - return resp.json() - -# Usage -client = TaskiqFlowClient("http://localhost:8000") -pipelines = await client.list_pipelines() -result = await client.execute("my_pipeline", {"data": "test"}, wait=True) -``` - ---- - -## 13. Summary - -| Feature | Endpoint | Method | -|---------|----------|--------| -| Health check | `/health` | GET | -| List pipelines | `/pipelines` | GET | -| Pipeline status | `/pipelines/{id}/status` | GET | -| Get DAG (JSON) | `/pipelines/{id}/dag` | GET | -| Get DAG (DOT) | `/pipelines/{id}/dag/dot` | GET | -| Full visualization | `/pipelines/{id}/visualize` | GET | -| Execute pipeline | `/pipelines/{id}/execute` | POST (custom) | -| Get result | `/pipelines/result/{task_id}` | GET (custom) | - -**Key takeaway**: The API gives you full control over pipeline lifecycle — register, inspect, execute, and retrieve results — perfect for custom dashboards and integrations. - ---- - -## 14. Next Steps - -- **[WebSocket Guide]({{ '/en/guides/websocket/' | relative_url }})** — Real-time event streaming for live updates -- **[Dataflow Guide]({{ '/en/guides/dataflow/' | relative_url }})** — DAG pipelines with automatic parallelism and DataflowPipeline -- **[Tracking Guide]({{ '/en/guides/tracking/' | relative_url }})** — Historical execution data for analytics -- **[Example: API Server]({{ '/en/examples/api-example/' | relative_url }})** — Complete working FastAPI app - ---- - -*Manage pipelines from anywhere. Build dashboards, automation, and integrations.* +--- +title: REST API Guide +nav_order: 28 +--- +# REST API Guide + +**FastAPI-based pipeline management, visualization, and remote execution** + +> **Version**: {VERSION} | **Related**: [Tracking Guide]({{ '/en/guides/tracking/' | relative_url }}), [WebSocket Guide]({{ '/en/guides/websocket/' | relative_url }}), [Dataflow Guide]({{ '/en/guides/dataflow/' | relative_url }}) + +--- + +## Overview + +Taskiq-Flow includes a FastAPI-based REST API for managing pipelines remotely. Build dashboards, CI/CD integrations, or any system that needs to interact with pipelines via HTTP. + +This guide covers: + +- Setting up the API server +- Available endpoints +- Pipeline visualization endpoints +- Custom endpoint extensions +- Authentication considerations +- Production deployment patterns + +--- + +## 1. Quick Setup + +{% raw %} +```python +from fastapi import FastAPI +from taskiq import InMemoryBroker +from taskiq_flow import DataflowPipeline, pipeline_task, create_visualization_api + +# 1. Create broker and tasks +broker = InMemoryBroker(await_inplace=True) + +@broker.task +@pipeline_task(output="result") +async def process(data: str) -> dict: + return {"processed": data.upper()} + +# 2. Build pipeline +pipeline = DataflowPipeline.from_tasks(broker, [process]) +pipeline.pipeline_id = "my_pipeline" + +# 3. Create FastAPI app with visualization API +app = FastAPI(title="Taskiq-Flow API", version="{VERSION}") +viz_api = create_visualization_api(broker, app) +viz_api.add_pipeline("my_pipeline", pipeline) + +# 4. Run with uvicorn +# uvicorn main:app --reload --port 8000 +``` +{% endraw %} +All endpoints are automatically mounted under `/pipelines`. + +--- + +## 2. Available Endpoints + +The visualization API provides these routes: + +### 2.1. Health Check + +{% raw %} +``` +GET /health +``` +{% endraw %} +Returns simple health status: + +{% raw %} +```json +{ + "status": "healthy", + "timestamp": "2026-05-05T12:00:00Z" +} +``` +{% endraw %} +### 2.2. List All Pipelines + +{% raw %} +``` +GET /pipelines +``` +{% endraw %} +Lists all registered pipelines with metadata: + +{% raw %} +```json +[ + { + "pipeline_id": "audio_analysis_v1", + "pipeline_type": "dataflow", + "tasks": ["extract", "tag", "embed"], + "created_at": "2026-05-05T10:00:00Z" + } +] +``` +{% endraw %} +### 2.3. Register a New Pipeline + +{% raw %} +``` +POST /pipelines/{pipeline_id} +``` +{% endraw %} +Request body: + +{% raw %} +```json +{ + "pipeline_type": "dataflow", + "tasks": ["task1", "task2"] +} +``` +{% endraw %} +Or use the Python API directly (recommended): + +{% raw %} +```python +viz_api.add_pipeline("new_pipeline", pipeline_object) +``` +{% endraw %} +### 2.4. Get Pipeline Status + +{% raw %} +``` +GET /pipelines/{pipeline_id}/status +``` +{% endraw %} +Returns current execution status if a run is active: + +{% raw %} +```json +{ + "pipeline_id": "my_pipeline_123", + "status": "RUNNING", + "steps_completed": 3, + "total_steps": 5, + "started_at": "2026-05-05T12:00:00Z" +} +``` +{% endraw %} +### 2.5. Get DAG as JSON + +{% raw %} +``` +GET /pipelines/{pipeline_id}/dag +``` +{% endraw %} +Returns the directed acyclic graph structure: + +{% raw %} +```json +{ + "nodes": [ + {"id": "extract", "outputs": ["features"]}, + {"id": "tag", "inputs": ["features"], "outputs": ["tags"]}, + {"id": "embed", "inputs": ["features"], "outputs": ["embedding"]} + ], + "edges": [ + {"from": "extract", "to": "tag"}, + {"from": "extract", "to": "embed"} + ] +} +``` +{% endraw %} +### 2.6. Get DAG in DOT Format + +{% raw %} +``` +GET /pipelines/{pipeline_id}/dag/dot +``` +{% endraw %} +Returns Graphviz-compatible DOT string for visualization: + +{% raw %} +``` +digraph "my_pipeline" { + node [shape=box]; + extract -> tag; + extract -> embed; +} +``` +{% endraw %} +### 2.7. Full Pipeline Visualization + +{% raw %} +``` +GET /pipelines/{pipeline_id}/visualize +``` +{% endraw %} +Returns comprehensive pipeline metadata: + +{% raw %} +```json +{ + "pipeline_id": "my_pipeline", + "type": "dataflow", + "tasks": [ + { + "name": "extract", + "outputs": ["features"], + "inputs": [], + "description": "Extract features from audio" + }, + { + "name": "tag", + "inputs": ["features"], + "outputs": ["tags"], + "description": "Generate tags" + } + ], + "execution_levels": [ + ["extract"], + ["tag", "embed"] + ] +} +``` +{% endraw %} +--- + +## 3. Executing Pipelines via API + +The core API focuses on management and visualization. To execute pipelines remotely, add a custom endpoint: + +{% raw %} +```python +from fastapi import FastAPI, HTTPException +from taskiq_flow.api import PipelineVisualizationAPI + +app = FastAPI() +viz_api = PipelineVisualizationAPI(broker, app) + +@app.post("/pipelines/{pipeline_id}/execute") +async def execute_pipeline( + pipeline_id: str, + parameters: dict, + wait: bool = False, + timeout: int = 30 +): + """ + Execute a pipeline with given parameters. + + - **pipeline_id**: Registered pipeline ID + - **parameters**: Dict of input parameters + - **wait**: If True, block until completion and return result + - **timeout**: Seconds to wait before timing out + """ + if pipeline_id not in viz_api.pipelines: + raise HTTPException(status_code=404, detail="Pipeline not found") + + pipeline = viz_api.pipelines[pipeline_id] + + try: + task = await pipeline.kiq_dataflow(**parameters) + + if wait: + result = await task.wait_result(timeout=timeout) + return { + "task_id": task.task_id, + "status": "COMPLETED", + "result": result.return_value + } + else: + return { + "task_id": task.task_id, + "status": "STARTED" + } + except asyncio.TimeoutError: + raise HTTPException(status_code=504, detail="Pipeline execution timed out") + except Exception as exc: + raise HTTPException(status_code=500, detail=str(exc)) + +@app.get("/pipelines/result/{task_id}") +async def get_result(task_id: str): + """Get the result of a pipeline execution.""" + result = await broker.get_result(task_id) + if result is None: + raise HTTPException(status_code=404, detail="Result not found or expired") + return {"task_id": task_id, "result": result.return_value} +``` +{% endraw %} +### 3.1. Execute Async (Fire-and-Forget) + +{% raw %} +```bash +curl -X POST "http://localhost:8000/pipelines/my_pipeline/execute" \ + -H "Content-Type: application/json" \ + -d '{"parameters": {"data": "input_value"}, "wait": false}' + +# Response: +{ + "task_id": "abc123def456", + "status": "STARTED" +} +``` +{% endraw %} +### 3.2. Execute Synchronous (Wait for Result) + +{% raw %} +```bash +curl -X POST "http://localhost:8000/pipelines/my_pipeline/execute" \ + -H "Content-Type: application/json" \ + -d '{"parameters": {"data": "input_value"}, "wait": true, "timeout": 60}' + +# Response (after pipeline completes): +{ + "task_id": "abc123def456", + "status": "COMPLETED", + "result": {"processed": "INPUT_VALUE"} +} +``` +{% endraw %} +--- + +## 4. Integration with Frontend Dashboards + +### 4.1. React Dashboard Example + +{% raw %} +```typescript +// React component displaying pipeline status +const PipelineStatus = ({ pipelineId }) => { + const [status, setStatus] = useState(null); + + useEffect(() => { + fetch(`/pipelines/${pipelineId}/status`) + .then(res => res.json()) + .then(data => setStatus(data)); + + // Poll every 5 seconds + const interval = setInterval(() => { + fetch(`/pipelines/${pipelineId}/status`) + .then(res => res.json()) + .then(setStatus); + }, 5000); + + return () => clearInterval(interval); + }, [pipelineId]); + + return ( +
+

Pipeline: {pipelineId}

+

Status: {status?.status}

+

Progress: {status?.steps_completed} / {status?.total_steps}

+
+ ); +}; +``` +{% endraw %} +### 4.2. DAG Visualization + +Use the DOT endpoint with Graphviz: + +{% raw %} +```javascript +const renderDAG = async (pipelineId) => { + const response = await fetch(`/pipelines/${pipelineId}/dag/dot`); + const dot = await response.text(); + + // Use viz.js or d3-graphviz client-side + d3.select("#dag") + .graphviz() + .renderDot(dot); +}; +``` +{% endraw %} +--- + +## 5. Authentication & Security + +### 5.1. API Key Authentication + +{% raw %} +```python +from fastapi import Security, HTTPException +from fastapi.security import APIKeyHeader + +api_key_header = APIKeyHeader(name="X-API-Key") + +async def verify_api_key(api_key: str = Security(api_key_header)): + if api_key != os.getenv("API_SECRET"): + raise HTTPException(status_code=403, detail="Invalid API key") + return api_key + +@app.get("/pipelines") +async def list_pipelines(api_key: str = Security(verify_api_key)): + return viz_api.list_pipelines() +``` +{% endraw %} +### 5.2. JWT Authentication + +{% raw %} +```python +from jose import jwt +from fastapi import Depends + +async def get_current_user(token: str = Depends(oauth2_scheme)): + try: + payload = jwt.decode(token, SECRET_KEY, algorithms=["HS256"]) + return payload["sub"] + except jwt.JWTError: + raise HTTPException(status_code=401, detail="Invalid token") + +@app.post("/pipelines/{pipeline_id}/execute") +async def execute( + pipeline_id: str, + parameters: dict, + user: str = Depends(get_current_user) +): + # Log user's action for audit trail + logger.info(f"User {user} executed {pipeline_id}") + return await run_pipeline(pipeline_id, parameters) +``` +{% endraw %} +### 5.3. Pipeline-Level Authorization + +Define per-pipeline ACLs via `pipeline_acls` in `TaskiqFlowConfig`, then use +`verify_pipeline_access` as a route dependency : + +{% raw %} +```python +from fastapi import Depends +from taskiq_flow.config import TaskiqFlowConfig +from taskiq_flow.security.authorization import PipelineAuthorization +from taskiq_flow.security.dependencies import verify_pipeline_access +from taskiq_flow.api import create_visualization_api + +config = TaskiqFlowConfig( + pipeline_acls={ + "my_pipeline": { + "read": ["admin", "viewer"], + "execute": ["admin"], + }, + }, +) +viz_api = create_visualization_api(broker) # reads config automatically + +# verify_pipeline_access depends on get_current_user + authorization +# → use it directly on your protected endpoints +``` +{% endraw %} +### 5.4. Combined Security Middleware + Route Dependencies + +For production, combine the global `SecurityMiddleware` (authentication) with per-route dependencies (authorization): + +{% raw %} +```python +from taskiq_flow.security.middleware import SecurityMiddleware +from taskiq_flow.security.auth import APIKeyAuthProvider, JWTAuthProvider +from taskiq_flow.security.authorization import PipelineAuthorization +from taskiq_flow.config import TaskiqFlowConfig + +# 1. Configure a flat TaskiqFlowConfig (no nested SecurityConfig) +config = TaskiqFlowConfig( + security_enabled=True, + auth_provider="api_key", + api_keys={ + "admin-key": { + "role": "admin", + "pipelines": ["*"], + "permissions": ["read", "execute", "admin"], + }, + }, + jwt_secret="super-secret", # pragma: allowlist secret # noqa: S105 — documented placeholder, not a real secret + require_https=True, + pipeline_acls={ + "my_pipeline": {"read": ["admin"], "execute": ["admin"]}, + }, +) + +# 2. Build components from config +auth_provider = APIKeyAuthProvider(keys=config.api_keys) +if config.auth_provider == "jwt": + auth_provider = JWTAuthProvider(secret=config.jwt_secret) +authorization = PipelineAuthorization(pipeline_acls=config.pipeline_acls) + +# 3. Global middleware handles authentication + audit +app.add_middleware( + SecurityMiddleware, + auth_provider=auth_provider, + authorization=authorization, +) +``` +{% endraw %} +Or, for full automatic wiring, use `create_visualization_api` which builds all +these components from `TaskiqFlowConfig` internally: + +{% raw %} +```python +from taskiq_flow import create_visualization_api + +config = TaskiqFlowConfig( + security_enabled=True, + auth_provider="api_key", + api_keys={"admin-key": {"role": "admin", "pipelines": ["*"], "permissions": ["read", "execute", "admin"]}}, +) +app = create_visualization_api(broker) # security auto-configured from config +``` +{% endraw %} +### Why this hybrid approach? +- `SecurityMiddleware` sets `request.state.user` for all routes after routing +- FastAPI path params (e.g. `pipeline_id`) are only available *after* routing +- Route dependencies (e.g. `Depends(verify_pipeline_access)`) run after routing → they can read `pipeline_id` and check ACLs +{% raw %} +``` + +--- + +## 6. Rate Limiting + +Protect the API from abuse: + +```python +{% endraw %} +from slowapi.util import get_remote_address +from slowapi.errors import RateLimitExceeded + +limiter = Limiter(key_func=get_remote_address) +app.state.limiter = limiter +app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler) + +@app.post("/pipelines/{pipeline_id}/execute") +@limiter.limit("10/minute") # Max 10 executions per minute per IP +async def execute_pipeline(pipeline_id: str, parameters: dict): + # ... +{% raw %} +``` + +--- + +## 7. CORS Configuration + +Enable cross-origin requests for web frontend: + +```python +{% endraw %} + +app.add_middleware( + CORSMiddleware, + allow_origins=["https://your-dashboard.com"], + allow_credentials=True, + allow_methods=["GET", "POST"], + allow_headers=["*"], +) +{% raw %} +``` + +--- + +## 8. Production Deployment + +### 8.1. Gunicorn + Uvicorn Workers + +```bash +{% endraw %} +gunicorn -k uvicorn.workers.UvicornWorker -w 4 main:app --bind 0.0.0.0:8000 + +# 4 worker processes handle concurrent requests +{% raw %} +``` + +### 8.2. Docker + +```dockerfile +{% endraw %} + +WORKDIR /app +COPY requirements.txt . +RUN pip install --no-cache-dir -r requirements.txt + +COPY . . + +CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] +{% raw %} +``` + +```yaml +{% endraw %} +services: + api: + build: . + ports: + - "8000:8000" + environment: + - REDIS_URL=redis://redis:6379 + depends_on: + - redis + redis: + image: redis:7-alpine +{% raw %} +``` + +### 8.3. Behind Reverse Proxy (nginx) + +```nginx +{% endraw %} + listen 80; + server_name api.taskiq-flow.example.com; + + location / { + proxy_pass http://localhost:8000; + proxy_set_header Host $host; + proxy_set_header X-Real-IP $remote_addr; + proxy_http_version 1.1; + proxy_set_header Connection ""; + } +} +{% raw %} +``` + +### 8.4. HTTPS with Let's Encrypt + +```bash +{% endraw %} +sudo certbot --nginx -d api.taskiq-flow.example.com +{% raw %} +``` + +Configure HTTPS → redirect to HTTP upstream: + +```nginx +{% endraw %} + proxy_pass http://localhost:8000; + proxy_set_header X-Forwarded-Proto $scheme; +} +{% raw %} +``` + +--- + +## 9. Monitoring API Health + +### 9.1. Health Check Endpoint + +```python +{% endraw %} +from fastapi import FastAPI +import psutil + +app = FastAPI() + +@app.get("/health") +async def health(): + return { + "status": "healthy", + "timestamp": datetime.now(timezone.utc).isoformat(), + "broker_connected": broker.is_connected(), + "memory_mb": psutil.Process().memory_info().rss / 1024 / 1024 + } +{% raw %} +``` + +### 9.2. Metrics with Prometheus + +```python +{% endraw %} + +Instrumentator().instrument(app).expose(app, endpoint="/metrics") +{% raw %} +``` + +Exposes `/metrics` with standard Prometheus metrics (request count, latency, etc.). + +### 9.3. API Versioning + +```python +{% endraw %} + title="Taskiq-Flow API", + version="1.0.0", + docs_url="/docs", + redoc_url="/redoc" +) + +# Prefix all routes with /api/v1 +from fastapi import APIRouter +api_router = APIRouter(prefix="/api/v1") +api_router.include_router(viz_api.router) +app.include_router(api_router) +{% raw %} +``` + +--- + +## 10. Error Handling + +Centralized error handling: + +```python +{% endraw %} +from fastapi.responses import JSONResponse + +@app.exception_handler(TaskiqError) +async def taskiq_exception_handler(request: Request, exc: TaskiqError): + return JSONResponse( + status_code=500, + content={ + "error": exc.__class__.__name__, + "message": str(exc), + "pipeline_id": getattr(exc, "pipeline_id", None) + } + ) +{% raw %} +``` + +Standardized error responses: + +```json +{% endraw %} + "error": "PipelineExecutionError", + "message": "Task 'process' failed after 3 retries", + "pipeline_id": "audio_analysis_123", + "step": "extract_audio", + "timestamp": "2026-05-05T12:00:00Z" +} +{% raw %} +``` + +--- + +## 11. API Client Example + +Python client for interacting with the API: + +```python +{% endraw %} + +class TaskiqFlowClient: + def __init__(self, base_url: str, api_key: str = None): + self.base_url = base_url.rstrip("/") + self.headers = {"X-API-Key": api_key} if api_key else {} + + async def list_pipelines(self): + async with httpx.AsyncClient() as client: + resp = await client.get(f"{self.base_url}/pipelines", headers=self.headers) + resp.raise_for_status() + return resp.json() + + async def execute(self, pipeline_id: str, parameters: dict, wait: bool = False): + async with httpx.AsyncClient() as client: + resp = await client.post( + f"{self.base_url}/pipelines/{pipeline_id}/execute", + json={"parameters": parameters, "wait": wait}, + headers=self.headers + ) + resp.raise_for_status() + return resp.json() + + async def get_result(self, task_id: str): + async with httpx.AsyncClient() as client: + resp = await client.get(f"{self.base_url}/pipelines/result/{task_id}", headers=self.headers) + resp.raise_for_status() + return resp.json() + +# Usage +client = TaskiqFlowClient("http://localhost:8000") +pipelines = await client.list_pipelines() +result = await client.execute("my_pipeline", {"data": "test"}, wait=True) +{% raw %} +``` + +--- + +## 13. Summary + +| Feature | Endpoint | Method | +|---------|----------|--------| +| Health check | `/health` | GET | +| List pipelines | `/pipelines` | GET | +| Pipeline status | `/pipelines/{id}/status` | GET | +| Get DAG (JSON) | `/pipelines/{id}/dag` | GET | +| Get DAG (DOT) | `/pipelines/{id}/dag/dot` | GET | +| Full visualization | `/pipelines/{id}/visualize` | GET | +| Execute pipeline | `/pipelines/{id}/execute` | POST (custom) | +| Get result | `/pipelines/result/{task_id}` | GET (custom) | + +**Key takeaway**: The API gives you full control over pipeline lifecycle — register, inspect, execute, and retrieve results — perfect for custom dashboards and integrations. + +--- + +## 14. Next Steps + +- **[WebSocket Guide]({{ '/en/guides/websocket/' | relative_url }})** — Real-time event streaming for live updates +- **[Dataflow Guide]({{ '/en/guides/dataflow/' | relative_url }})** — DAG pipelines with automatic parallelism and DataflowPipeline +- **[Tracking Guide]({{ '/en/guides/tracking/' | relative_url }})** — Historical execution data for analytics +- **[Example: API Server]({{ '/en/examples/api-example/' | relative_url }})** — Complete working FastAPI app + +--- + +*Manage pipelines from anywhere. Build dashboards, automation, and integrations.* + +{% endraw %} \ No newline at end of file diff --git a/docs/_en/guides/cache.md b/docs/_en/guides/cache.md index 2f7a348..8b34464 100644 --- a/docs/_en/guides/cache.md +++ b/docs/_en/guides/cache.md @@ -1,301 +1,301 @@ ---- -title: Storage & Cache Middleware Guide -nav_order: 23 ---- -# Storage & Cache Middleware Guide - -**Centralized persistence with StorageMiddleware and Dogpile caching with CacheMiddleware** - -> **Version**: {VERSION} | **New in v1.2.0** | **Related**: [Execution Guide]({{ '/en/guides/execution/' | relative_url }}), [API Reference — Storage]({{ '/en/api/storage/' | relative_url }}), [API Reference — Cache]({{ '/en/api/cache/' | relative_url }}) - ---- - -## Overview - -v1.2.0 introduces **two new middlewares** that modularize persistence and caching concerns: - -| Middleware | Responsibility | -|------------|----------------| -| `StorageMiddleware` | Centralized, pluggable persistence for task results, pipeline state, and execution history | -| `CacheMiddleware` | Dogpile-based worker caching to avoid redundant task executions | - -Both implement the `TaskiqMiddleware` lifecycle (`pre_execute`, `post_save`) and can be active **simultaneously** with `PipelineMiddleware`, `TransportMiddleware`, and `PipelineRetryMiddleware`. - ---- - -## StorageMiddleware — Centralized Persistence - -`StorageMiddleware` captures every task result and stores it via a configured `BaseStorageAdapter`. Unlike the previous approach where tracking and scheduling persisted independently, there is now **one unified store**. - -### Why StorageMiddleware? - -- **Single source of truth** — all task results, pipeline history, and scheduling metadata live in one place -- **Pluggable backend** — swap InMemory for Redis or SQLite without changing application code -- **Auto-detection** — `StorageAdapterFactory` picks the right backend from environment and config -- **Isolation** — storage, cache, and tracking concerns are each in their own layer - -### Installation - -No extra install required — included in `taskiq-flow`. - -```bash -# For Redis backend -pip install redis - -# SQLite backend included via aiosqlite (bundled) -``` - -### Basic Usage - -```python -from taskiq import InMemoryBroker -from taskiq_flow import PipelineMiddleware, DataflowPipeline, pipeline_task -from taskiq_flow.middlewares import StorageMiddleware -from taskiq_flow.storage import InMemoryStorageAdapter - -broker = InMemoryBroker(await_inplace=True) - -# central persistence layer — store all task results automatically -storage = InMemoryStorageAdapter() -broker.add_middlewares( - StorageMiddleware(storage=storage, enabled=True), - PipelineMiddleware(), -) -``` - -### Production with Redis - -```python -from taskiq_flow.middlewares import StorageMiddleware -from taskiq_flow.storage import RedisStorageAdapter - -broker.add_middlewares( - StorageMiddleware( - storage=RedisStorageAdapter( - redis_url="redis://localhost:6379", - ttl_seconds=86400, # 24-hour retention - ), - ), - PipelineMiddleware(), -) -``` - -### Task Result Keys - -`StorageMiddleware` stores results under keys derived from the `TaskiqMessage` labels: - -| Key Pattern | Example | -|-------------|---------| -| `pipeline:{pipeline_id}:task:{task_id}` | `pipeline:audio_v1:task:abc123` | -| `task:{task_id}` | `task:abc123` | - -Stored value shape: -```json -{ - "task_id": "abc123", - "pipeline_id": "audio_v1", - "is_err": false, - "return_value": "{...}", - "error": null, - "execution_time": 0.42 -} -``` - -### Manual Inspector Usage - -```python -storage = InMemoryStorageAdapter() -await storage.set("my_key", {"status": "running"}, ttl_seconds=600) - -# Later... -result = await storage.get("my_key") -exists = await storage.exists("my_key") - -# Pattern-based listing -keys = await storage.keys("pipeline:my_run:*") - -# Cleanup expired entries -deleted = await storage.cleanup(ttl_seconds=3600) -``` - -### TTL and Expiration - -All three adapters support per-key TTL. Entries that have expired are lazily cleaned on access and eagerly via `cleanup()`: - -```python -# Store with 24-hour TTL -await storage.set("status", {"running": True}, ttl_seconds=86_400) - -# Check remaining time before it expires -entry = StorageEntry(key="status", value={"running": True}, ...) -seconds_left = entry.remaining_ttl() -``` - ---- - -## CacheMiddleware — Dogpile Worker Caching - -`CacheMiddleware` prevents redundant task executions by caching task **outputs** at the worker level. The Dogpile pattern ensures that only one coroutine regenerates an expired entry while others wait. - -### Why CacheMiddleware? - -- **Reduce unnecessary work** — skip re-executing idempotent tasks whose inputs haven't changed -- **Lower latency** — cached results are returned instantly without scheduling -- **Stampede protection** — Dogpile lock prevents thundering-herd at TTL expiry -- **Pluggable backend** — InMemory for single-worker, Redis for distributed - -### Basic Usage - -```python -from taskiq_flow.middlewares import CacheMiddleware -from taskiq_flow.cache import InMemoryCacheAdapter - -# Task results are cached for 1 hour by default -broker.add_middlewares( - CacheMiddleware( - cache=InMemoryCacheAdapter(), - default_ttl=3600, - enabled=True, - ) -) -``` - -After this, every task's result is cached automatically. A second task execution with the same result is reduced to a cache lookup. - -### Producer/Consumer Middleware Ordering - -Middleware order matters. `CacheMiddleware` should be placed **before** `StorageMiddleware` in the chain so that cache hits short-circuit before any persistence write is attempted: - -```python -# Correct ordering — cache checked first, then storage -broker.add_middlewares( - CacheMiddleware(), # ← checked first (pre_execute runs first) - StorageMiddleware(), # ← persisted if not cached - PipelineMiddleware(), # ← orchestrates downstream -) -``` - -### Per-Task Overrides - -Set TTL and error-caching per task execution via `TaskiqMessage` labels: - -```python -# In a task, override cache TTL for this execution -result = await some_task.kiq( - input_data, - labels={"cache_ttl": "7200", "cache_errors": "true"}, -) -``` - -| Label | Values | Effect | -|-------|--------|--------| -| `cache_ttl` | `int` (seconds) | Override the `default_ttl` for this single execution | -| `cache_errors` | `"true"` / `"false"` | Cache error results when `"true"` (disabled by default) | - ---- - -## StorageAdapterFactory — Zero-Config Setup - -`StorageAdapterFactory` auto-creates the right adapters from `TaskiqFlowConfig` (read from env vars): - -```python -from taskiq_flow.storage.factory import StorageAdapterFactory -from taskiq_flow.config import TaskiqFlowConfig - -# Get both middlewares in one call — sensible defaults -config = TaskiqFlowConfig( - storage_type="redis", # "redis" | "sqlite" | "inmemory" | "auto" - storage_redis_url="redis://localhost:6379", - storage_ttl_seconds=86_400, # 24 h - cache_type="redis", - cache_redis_url="redis://localhost:6379", - cache_default_ttl=3600, -) -middlewares = StorageAdapterFactory.create_default_middlewares(config=config) - -broker.add_middlewares( - middlewares["cache"], # CacheMiddleware - middlewares["storage"], # StorageMiddleware - PipelineMiddleware(), -) -``` - -Environment variables (all optional): - -| Env Var | Description | -|---------|-------------| -| `TASKIQ_FLOW_STORAGE_TYPE` | `"redis"`, `"sqlite"`, `"inmemory"`, or `"auto"` | -| `TASKIQ_FLOW_STORAGE_REDIS_URL` | Redis URL for storage | -| `TASKIQ_FLOW_STORAGE_TTL_SECONDS` | Default storage TTL | -| `TASKIQ_FLOW_CACHE_TYPE` | `"redis"`, `"inmemory"`, or `"auto"` | -| `TASKIQ_FLOW_CACHE_REDIS_URL` | Redis URL for cache | - ---- - -## Comparison: Storage vs Cache - -| Aspect | `StorageMiddleware` | `CacheMiddleware` | -|--------|--------------------|--------------------| -| Purpose | Long-term persistence of task/pipeline state | Short-term deduplication of task results | -| TTL | Hours to days (`storage_ttl_seconds`) | Minutes to hours (`default_ttl`) | -| Scope | Pipeline IDs, task IDs, scheduling metadata | Individual task result IDs | -| Backend | InMemory / Redis / SQLite | InMemory / Redis | -| Dogpile stampede | N/A | Yes | -| Auto-dedup | N/A | Yes | - -Use **both** together for a complete production setup: - - -```python -from taskiq_flow.storage.factory import StorageAdapterFactory -from taskiq_flow.config import TaskiqFlowConfig - -config = TaskiqFlowConfig( - storage_type="redis", - storage_redis_url="redis://localhost:6379", - cache_type="redis", - cache_redis_url="redis://localhost:6379", -) -middlewares = StorageAdapterFactory.create_default_middlewares(config=config) -broker.add_middlewares( - middlewares["cache"], - middlewares["storage"], - PipelineMiddleware(), -) -``` - ---- - -## Monitoring - -### Cache Hit Rate - -```python -stats = cache.get_stats() -print(f"Hit rate: {stats['hit_rate']:.1%}") # 94.5% -print(f"Hits: {stats['hits']}, Misses: {stats['misses']}") -``` - -Aim for a hit rate above 80% for reproducible pipelines with stable inputs. - -### Storage Size - -```python -all_keys = await storage.keys("*") -print(f"Total stored entries: {len(all_keys)}") -``` - ---- - -## Troubleshooting - -| Symptom | Likely Cause | Fix | -|---------|-------------|-----| -| Every task is a cache miss | TTL too short or inputs too variable | Increase `default_ttl`; check task arguments | -| Cache stampede on expiry | Using `InMemoryCacheAdapter` without Dogpile | Switch to `RedisCacheAdapter` (proper distributed locking) | -| Storage grows without bounds | No TTL set on entries | Set `ttl_seconds` on `StorageMiddleware`; run `cleanup()` periodically | -| Workers share stale results | Redis TTL not respected | Verify Redis `EXPIRE` is applied; check Redis config | - ---- - -*New in v1.2.0. Both middlewares are additive — drop them into an existing broker without redesign.* +--- +title: Storage & Cache Middleware Guide +nav_order: 23 +--- +# Storage & Cache Middleware Guide + +**Centralized persistence with StorageMiddleware and Dogpile caching with CacheMiddleware** + +> **Version**: {VERSION} | **New in v1.2.0** | **Related**: [Execution Guide]({{ '/en/guides/execution/' | relative_url }}), [API Reference — Storage]({{ '/en/api/storage/' | relative_url }}), [API Reference — Cache]({{ '/en/api/cache/' | relative_url }}) + +--- + +## Overview + +v1.2.0 introduces **two new middlewares** that modularize persistence and caching concerns: + +| Middleware | Responsibility | +|------------|----------------| +| `StorageMiddleware` | Centralized, pluggable persistence for task results, pipeline state, and execution history | +| `CacheMiddleware` | Dogpile-based worker caching to avoid redundant task executions | + +Both implement the `TaskiqMiddleware` lifecycle (`pre_execute`, `post_save`) and can be active **simultaneously** with `PipelineMiddleware`, `TransportMiddleware`, and `PipelineRetryMiddleware`. + +--- + +## StorageMiddleware — Centralized Persistence + +`StorageMiddleware` captures every task result and stores it via a configured `BaseStorageAdapter`. Unlike the previous approach where tracking and scheduling persisted independently, there is now **one unified store**. + +### Why StorageMiddleware? + +- **Single source of truth** — all task results, pipeline history, and scheduling metadata live in one place +- **Pluggable backend** — swap InMemory for Redis or SQLite without changing application code +- **Auto-detection** — `StorageAdapterFactory` picks the right backend from environment and config +- **Isolation** — storage, cache, and tracking concerns are each in their own layer + +### Installation + +No extra install required — included in `taskiq-flow`. + +```bash +# For Redis backend +pip install redis + +# SQLite backend included via aiosqlite (bundled) +``` + +### Basic Usage + +```python +from taskiq import InMemoryBroker +from taskiq_flow import PipelineMiddleware, DataflowPipeline, pipeline_task +from taskiq_flow.middlewares import StorageMiddleware +from taskiq_flow.storage import InMemoryStorageAdapter + +broker = InMemoryBroker(await_inplace=True) + +# central persistence layer — store all task results automatically +storage = InMemoryStorageAdapter() +broker.add_middlewares( + StorageMiddleware(storage=storage, enabled=True), + PipelineMiddleware(), +) +``` + +### Production with Redis + +```python +from taskiq_flow.middlewares import StorageMiddleware +from taskiq_flow.storage import RedisStorageAdapter + +broker.add_middlewares( + StorageMiddleware( + storage=RedisStorageAdapter( + redis_url="redis://localhost:6379", + ttl_seconds=86400, # 24-hour retention + ), + ), + PipelineMiddleware(), +) +``` + +### Task Result Keys + +`StorageMiddleware` stores results under keys derived from the `TaskiqMessage` labels: + +| Key Pattern | Example | +|-------------|---------| +| `pipeline:{pipeline_id}:task:{task_id}` | `pipeline:audio_v1:task:abc123` | +| `task:{task_id}` | `task:abc123` | + +Stored value shape: +```json +{ + "task_id": "abc123", + "pipeline_id": "audio_v1", + "is_err": false, + "return_value": "{...}", + "error": null, + "execution_time": 0.42 +} +``` + +### Manual Inspector Usage + +```python +storage = InMemoryStorageAdapter() +await storage.set("my_key", {"status": "running"}, ttl_seconds=600) + +# Later... +result = await storage.get("my_key") +exists = await storage.exists("my_key") + +# Pattern-based listing +keys = await storage.keys("pipeline:my_run:*") + +# Cleanup expired entries +deleted = await storage.cleanup(ttl_seconds=3600) +``` + +### TTL and Expiration + +All three adapters support per-key TTL. Entries that have expired are lazily cleaned on access and eagerly via `cleanup()`: + +```python +# Store with 24-hour TTL +await storage.set("status", {"running": True}, ttl_seconds=86_400) + +# Check remaining time before it expires +entry = StorageEntry(key="status", value={"running": True}, ...) +seconds_left = entry.remaining_ttl() +``` + +--- + +## CacheMiddleware — Dogpile Worker Caching + +`CacheMiddleware` prevents redundant task executions by caching task **outputs** at the worker level. The Dogpile pattern ensures that only one coroutine regenerates an expired entry while others wait. + +### Why CacheMiddleware? + +- **Reduce unnecessary work** — skip re-executing idempotent tasks whose inputs haven't changed +- **Lower latency** — cached results are returned instantly without scheduling +- **Stampede protection** — Dogpile lock prevents thundering-herd at TTL expiry +- **Pluggable backend** — InMemory for single-worker, Redis for distributed + +### Basic Usage + +```python +from taskiq_flow.middlewares import CacheMiddleware +from taskiq_flow.cache import InMemoryCacheAdapter + +# Task results are cached for 1 hour by default +broker.add_middlewares( + CacheMiddleware( + cache=InMemoryCacheAdapter(), + default_ttl=3600, + enabled=True, + ) +) +``` + +After this, every task's result is cached automatically. A second task execution with the same result is reduced to a cache lookup. + +### Producer/Consumer Middleware Ordering + +Middleware order matters. `CacheMiddleware` should be placed **before** `StorageMiddleware` in the chain so that cache hits short-circuit before any persistence write is attempted: + +```python +# Correct ordering — cache checked first, then storage +broker.add_middlewares( + CacheMiddleware(), # ← checked first (pre_execute runs first) + StorageMiddleware(), # ← persisted if not cached + PipelineMiddleware(), # ← orchestrates downstream +) +``` + +### Per-Task Overrides + +Set TTL and error-caching per task execution via `TaskiqMessage` labels: + +```python +# In a task, override cache TTL for this execution +result = await some_task.kiq( + input_data, + labels={"cache_ttl": "7200", "cache_errors": "true"}, +) +``` + +| Label | Values | Effect | +|-------|--------|--------| +| `cache_ttl` | `int` (seconds) | Override the `default_ttl` for this single execution | +| `cache_errors` | `"true"` / `"false"` | Cache error results when `"true"` (disabled by default) | + +--- + +## StorageAdapterFactory — Zero-Config Setup + +`StorageAdapterFactory` auto-creates the right adapters from `TaskiqFlowConfig` (read from env vars): + +```python +from taskiq_flow.storage.factory import StorageAdapterFactory +from taskiq_flow.config import TaskiqFlowConfig + +# Get both middlewares in one call — sensible defaults +config = TaskiqFlowConfig( + storage_type="redis", # "redis" | "sqlite" | "inmemory" | "auto" + storage_redis_url="redis://localhost:6379", + storage_ttl_seconds=86_400, # 24 h + cache_type="redis", + cache_redis_url="redis://localhost:6379", + cache_default_ttl=3600, +) +middlewares = StorageAdapterFactory.create_default_middlewares(config=config) + +broker.add_middlewares( + middlewares["cache"], # CacheMiddleware + middlewares["storage"], # StorageMiddleware + PipelineMiddleware(), +) +``` + +Environment variables (all optional): + +| Env Var | Description | +|---------|-------------| +| `TASKIQ_FLOW_STORAGE_TYPE` | `"redis"`, `"sqlite"`, `"inmemory"`, or `"auto"` | +| `TASKIQ_FLOW_STORAGE_REDIS_URL` | Redis URL for storage | +| `TASKIQ_FLOW_STORAGE_TTL_SECONDS` | Default storage TTL | +| `TASKIQ_FLOW_CACHE_TYPE` | `"redis"`, `"inmemory"`, or `"auto"` | +| `TASKIQ_FLOW_CACHE_REDIS_URL` | Redis URL for cache | + +--- + +## Comparison: Storage vs Cache + +| Aspect | `StorageMiddleware` | `CacheMiddleware` | +|--------|--------------------|--------------------| +| Purpose | Long-term persistence of task/pipeline state | Short-term deduplication of task results | +| TTL | Hours to days (`storage_ttl_seconds`) | Minutes to hours (`default_ttl`) | +| Scope | Pipeline IDs, task IDs, scheduling metadata | Individual task result IDs | +| Backend | InMemory / Redis / SQLite | InMemory / Redis | +| Dogpile stampede | N/A | Yes | +| Auto-dedup | N/A | Yes | + +Use **both** together for a complete production setup: + + +```python +from taskiq_flow.storage.factory import StorageAdapterFactory +from taskiq_flow.config import TaskiqFlowConfig + +config = TaskiqFlowConfig( + storage_type="redis", + storage_redis_url="redis://localhost:6379", + cache_type="redis", + cache_redis_url="redis://localhost:6379", +) +middlewares = StorageAdapterFactory.create_default_middlewares(config=config) +broker.add_middlewares( + middlewares["cache"], + middlewares["storage"], + PipelineMiddleware(), +) +``` + +--- + +## Monitoring + +### Cache Hit Rate + +```python +stats = cache.get_stats() +print(f"Hit rate: {stats['hit_rate']:.1%}") # 94.5% +print(f"Hits: {stats['hits']}, Misses: {stats['misses']}") +``` + +Aim for a hit rate above 80% for reproducible pipelines with stable inputs. + +### Storage Size + +```python +all_keys = await storage.keys("*") +print(f"Total stored entries: {len(all_keys)}") +``` + +--- + +## Troubleshooting + +| Symptom | Likely Cause | Fix | +|---------|-------------|-----| +| Every task is a cache miss | TTL too short or inputs too variable | Increase `default_ttl`; check task arguments | +| Cache stampede on expiry | Using `InMemoryCacheAdapter` without Dogpile | Switch to `RedisCacheAdapter` (proper distributed locking) | +| Storage grows without bounds | No TTL set on entries | Set `ttl_seconds` on `StorageMiddleware`; run `cleanup()` periodically | +| Workers share stale results | Redis TTL not respected | Verify Redis `EXPIRE` is applied; check Redis config | + +--- + +*New in v1.2.0. Both middlewares are additive — drop them into an existing broker without redesign.* diff --git a/docs/_en/guides/execution.md b/docs/_en/guides/execution.md index 35356c7..c361549 100644 --- a/docs/_en/guides/execution.md +++ b/docs/_en/guides/execution.md @@ -1,514 +1,514 @@ ---- -title: Pipeline Execution Guide -nav_order: 22 ---- -# Pipeline Execution Guide - -**Understanding execution models, modes, and result handling** - -> **Version**: {VERSION} | **Applies to**: SequentialPipeline, DataflowPipeline, MapReduce | **See also**: [Dataflow Guide]({{ '/en/guides/dataflow/' | relative_url }}) - ---- - -## Overview - -This guide covers how Taskiq-Flow executes pipelines, manages concurrency, handles errors, and returns results. - ---- - -## 1. Execution Models - -### 1.1. Sequential Execution (Classic Pipeline) - -The classic `Pipeline` executes steps one after another in a linear chain: - -```python -pipeline = Pipeline(broker).call_next(task1).call_next(task2).call_next(task3) -# Execution order: task1 → task2 → task3 (synchronously) -``` - -**Characteristics**: -- Each step waits for the previous to complete -- Results pass directly from one step to the next -- Predictable, deterministic execution order -- Suitable for linear workflows - -### 1.2. Parallel Execution (Dataflow & Map) - -`DataflowPipeline` automatically parallelizes independent tasks: - -```python -@broker.task -@pipeline_task(output="features") -def extract(tracks): ... - -@broker_task -@pipeline_task(output="tags") -def tag(features): ... # Runs after extract - -@broker.task -@pipeline_task(output="embedding") -def embed(features): ... # Also runs after extract, in parallel with tag - -pipeline = DataflowPipeline.from_tasks(broker, [extract, tag, embed]) -# DAG: extract → (tag & embed in parallel) -``` - -**Characteristics**: -- Tasks with no unmet dependencies run concurrently -- DAG determines execution order -- Maximum throughput for independent operations -- Controlled by `max_parallel` parameter on `.map()` and `.reduce()` - -### 1.3. Map-Reduce Parallelism - -The `MapReduce` utility explicitly processes items in parallel: - -```python -from taskiq_flow import MapReduce - -# Process 100 items with max 10 concurrent workers -result = await MapReduce.map( - broker, - process_item, - items=items_list, - output="processed", - max_parallel=10 # controls concurrency level -) -``` - -**Parallelism control**: -- `max_parallel=None` → unlimited concurrency (use with caution) -- `max_parallel=1` → sequential execution -- Recommended: `max_parallel = number_of_cpu_cores * 2` for CPU-bound tasks - ---- - -## 2. Starting a Pipeline - -There are several ways to kick off pipeline execution: - -### 2.1. `pipeline.kiq(...)` — Fire and Forget - -Returns a `Task` immediately; you must manually wait for results: - -```python -task = await pipeline.kiq(initial_input) -# Do other things... -result = await task.wait_result() # blocks until complete -``` - -Use when: -- You need to track the task ID for later status checks -- You want to start multiple pipelines concurrently -- You're building a task queue system - -### 2.2. `pipeline.kiq_dataflow(...)` — Dataflow Convenience - -Same as `kiq()` but specifically for DataflowPipeline, with clearer semantics: - -```python -results = await pipeline.kiq_dataflow(track_paths=["a.mp3", "b.mp3"]) -# Returns: dict mapping output names → values -``` - -### 2.3. `pipeline.kiq_map_reduce(...)` — Map-Reduce Shortcut - -Combined map and reduce in one call: - -```python -final = await pipeline.kiq_map_reduce( - items=items, - map_output="processed", - reduce_output="final" -) -``` - ---- - -## 3. Waiting for Results - -### 3.1. Blocking Wait - -```python -task = await pipeline.kiq(data) -result = await task.wait_result() # blocks -print(result.return_value) -``` - -**Options**: -- `wait_result(timeout=30)` — timeout in seconds (raises `asyncio.TimeoutError`) -- `wait_result(raise_on_error=True)` — re-raise exceptions from tasks - -### 3.2. Polling for Status - -```python -task = await pipeline.kiq(data) - -# Check periodically without blocking -while not task.is_finished: - await asyncio.sleep(0.5) - status = await task.get_status() - print(f"Status: {status}") -``` - -Useful for progress bars or interactive applications. - -### 3.3. Fetch by Task ID (Distributed) - -If you have only the task ID (from another process): - -```python -from taskiq import Task -task = Task(task_id="abc123", broker=broker) -result = await task.wait_result() -``` - ---- - -## 4. Error Handling - -### 4.1. Task-Level Errors - -When a single task fails, the pipeline either: - -- **Stops immediately** (default) — remaining tasks are cancelled -- **Continues** if configured with error handling policies - -```python -pipeline = Pipeline(broker) - -# Configure to continue despite errors -pipeline.on_error("continue") # options: "stop", "continue", "retry" - -# Or use a retry policy (see Retry Guide) -pipeline.with_retry( - max_attempts=3, - delay=5, - backoff=2 -) -``` - -### 4.2. Pipeline-Level Errors - -The entire pipeline may fail if: - -- A critical task (no consumers) fails -- A task times out -- The broker becomes unavailable - -Handle pipeline errors with try/except: - -```python -try: - result = await pipeline.kiq(data) - output = await result.wait_result() -except TaskiqError as exc: - print(f"Pipeline failed: {exc}") - # Access partial results if any - if result.is_failed: - print(f"Failed at step: {result.failed_step}") -``` - -### 4.3. Partial Results on Failure - -Even if a pipeline fails, you may have partial results from completed steps: - -```python -result = await pipeline.kiq(data) -try: - output = await result.wait_result() -except PipelineError: - # Some steps succeeded before failure - partial = result.partial_results # dict of completed outputs - print(f"Partial: {partial}") -``` - ---- - -## 5. Timeouts - -Set timeouts at the pipeline level: - -```python -pipeline = Pipeline(broker) - -# Global timeout for entire pipeline (seconds) -pipeline.with_timeout(60) - -# Or per-task timeout via taskiq task decorator -@broker.task(timeout=30) -def slow_task(): ... -``` - -**Timeout behavior**: -- Exceeding the timeout cancels the running task -- `asyncio.TimeoutError` is raised -- Pipeline state is set to `ERROR` - ---- - -## 6. Execution Context - -Each task receives an optional `context` parameter containing metadata: - -```python -from taskiq_flow import PipelineContext - -@broker.task -async def my_task(data: str, context: PipelineContext): - print(f"Pipeline ID: {context.pipeline_id}") - print(f"Step number: {context.step_index}") - print(f"Task ID: {context.task_id}") - return data.upper() -``` - -**Context fields**: - -| Field | Type | Description | -|-------|------|-------------| -| `pipeline_id` | `str` | Unique identifier of the pipeline instance | -| `step_index` | `int` | Index of this step in the pipeline sequence | -| `task_id` | `str` | ID of the underlying taskiq task | -| `execution_mode` | `str` | `"sequential"`, `"parallel"`, or `"map_reduce"` | -| `started_at` | `datetime` | Timestamp when pipeline started | -| `broker` | `BaseBroker` | Reference to the task broker | - -Enable context passing when building the pipeline: - -```python -pipeline = Pipeline(broker).with_context(enable=True) -``` - ---- - -## 7. Custom Execution Engines (Advanced) - -For low-level control, use `ExecutionEngine` directly: - -```python -from taskiq_flow import ExecutionEngine, DAGBuilder -from taskiq_flow.dataflow import DataflowRegistry - -# Build registry manually -registry = DataflowRegistry() -registry.register_task(load, output="raw", inputs=[]) -registry.register_task(process, output="clean", inputs=["raw"]) -registry.register_task(save, output="saved", inputs=["clean"]) - -# Build DAG -dag = registry.build_dag() - -# Create execution engine -engine = ExecutionEngine(broker, dag) - -# Execute with custom inputs -results = await engine.execute(inputs={"source_file": "data.csv"}) -print(results) # {"raw": ..., "clean": ..., "saved": ...} -``` - -**When to use ExecutionEngine**: -- Building dynamic pipelines at runtime -- Custom scheduling/logic outside Pipeline abstraction -- Inspecting DAG structure before execution -- Integrating with external workflow managers - ---- - -## 8. Result Shapes - -Different pipeline types return different result structures: - -### 8.1. Sequential Pipeline Results - -```python -task = await pipeline.kiq(input) -result = await task.wait_result() - -# result.return_value is the final output after all steps -# Example: [3, 3, 3, 3] from our quickstart pipeline -``` - -### 8.2. Dataflow Pipeline Results - -```python -result = await pipeline.kiq_dataflow(input_data) - -# Returns a dict mapping each output name to its value -{ - "features": {...}, - "tags": [...], - "embedding": [...] -} -``` - -### 8.3. MapReduce Results - -```python -mapped = await MapReduce.map(...) -print(mapped.return_value) # List of mapped results - -reduced = await MapReduce.reduce(...) -print(reduced.return_value) # Final aggregated result -``` - ---- - -## 9. Inspecting Pipeline State - -Query pipeline status during or after execution: - -```python -from taskiq_flow import PipelineTrackingManager - -tracking = PipelineTrackingManager().with_auto_storage(broker) -pipeline = Pipeline(broker).with_tracking(tracking) - -task = await pipeline.kiq(data) - -# Get detailed status -status = await tracking.get_status(pipeline.pipeline_id) -print(f"Status: {status.status}") # PENDING, RUNNING, COMPLETED, FAILED -print(f"Steps: {len(status.steps)}") # Number of completed steps -print(f"Started: {status.started_at}") -print(f"Completed: {status.completed_at}") - -# Get step-by-step history -for step in status.steps: - print(f" {step.name}: {step.status} ({step.duration_ms}ms)") -``` - -**Status values**: - -| Status | Meaning | -|--------|---------| -| `PENDING` | Pipeline queued, not started | -| `RUNNING` | Currently executing | -| `COMPLETED` | Finished successfully | -| `FAILED` | Terminated with error | -| `CANCELLED` | Manually cancelled | - -See [Tracking Guide]({{ '/en/guides/tracking/' | relative_url }}) for advanced monitoring. - ---- - -## 10. Debugging Execution - -### 10.1. Enable Logging - -```python -import logging -logging.basicConfig(level=logging.DEBUG) - -# Or configure specific loggers -logger = logging.getLogger("taskiq_flow") -logger.setLevel(logging.DEBUG) -``` - -### 10.2. Print DAG Before Execution - -```python -pipeline.print_dag() -# Shows execution levels and dependencies -``` - -### 10.3. Inspect Task Arguments - -```python -@broker.task -async def debug_task(data, context: PipelineContext): - print(f"Received: {data}") - print(f"Context: pipeline={context.pipeline_id}, step={context.step_index}") - return data -``` - -### 10.4. Middleware for Tracing - -```python -from taskiq_flow.middleware import PipelineMiddleware - -class DebugMiddleware(PipelineMiddleware): - async def on_step_complete(self, ctx, result): - print(f"Step {ctx.task_id} completed with: {result}") - await super().on_step_complete(ctx, result) - -broker.add_middlewares(DebugMiddleware()) -``` - ---- - -## 11. Performance Considerations - -### 11.1. Concurrency Limits - -```python -# Limit total parallel tasks globally -from taskiq_flow.optimization.parallel import set_max_parallel_tasks -set_max_parallel_tasks(20) # never more than 20 tasks simultaneously -``` - -### 11.2. Selective Parallelism - -Not all tasks benefit from parallel execution: - -```python -# CPU-bound tasks: benefit from parallelism up to core count -# I/O-bound tasks: can handle higher parallelism -# Small/fast tasks: overhead may outweigh benefits - -# Tip: Profile with varying max_parallel values -pipeline.map(process_item, items, max_parallel=8) -``` - -### 11.3. Memory Footprint - -Parallel execution loads more data into memory: - -```python -# Process large datasets in chunks -chunks = split_into_chunks(large_list, chunk_size=100) -for chunk in chunks: - results = await pipeline.kiq_dataflow(chunk) - # process results before next chunk -``` - -See [Performance Guide]({{ '/en/guides/performance/' | relative_url }}) for detailed optimization strategies. - ---- - -## 12. Common Pitfalls - -| Issue | Cause | Solution | -|-------|-------|----------| -| Tasks run sequentially | `max_parallel=1` or sequential pipeline type | Use DataflowPipeline or increase parallelism | -| `wait_result()` hangs forever | Broker not shared, results lost | Use persistent broker (Redis) with result backend | -| Tasks receive wrong inputs | Incorrect parameter naming | Ensure `@pipeline_task(output=...)` matches downstream param names | -| Out-of-order results | Dataflow tasks finishing at different times | Results dict preserves output names, not execution order | -| Memory explosion | Unlimited parallelism | Set `max_parallel` or process in batches | -| Deadlock during execution | Circular dependency or missing external input | Check data flow graph for cycles; provide all external inputs | -| `kiq_dataflow()` raises "No DAG built" | No tasks added to pipeline | Use `DataflowPipeline.from_tasks()` or `add_dataflow_task()` | -| Partial results only | `continue_on_error=True` with failed tasks | Check `PipelineErrorAggregator` or execution report for details | - ---- - -## 13. Summary - -| Feature | Sequential Pipeline | DataflowPipeline | MapReduce | -|---------|--------------------|------------------|-----------| -| **Execution** | Linear chain | Automatic DAG | Parallel map + reduce | -| **Parallelism** | None (unless `.group()` used) | Automatic (independent tasks) | Explicit per-map call | -| **Control** | Manual chaining | Declarative dependencies | Batch-oriented | -| **Best for** | Simple linear workflows | Complex branched workflows | Bulk data transformation | - ---- - -## Next Steps - -- **[Pipelines Guide]({{ '/en/guides/pipelines/' | relative_url }})** — Choosing between pipeline types and patterns -- **[Dataflow Guide]({{ '/en/guides/dataflow/' | relative_url }})** — Complete guide to dataflow pipelines, DAGs, and decorators -- **[Tracking Guide]({{ '/en/guides/tracking/' | relative_url }})** — Monitoring pipeline status and history -- **[Performance Guide]({{ '/en/guides/performance/' | relative_url }})** — Tuning for speed and resource usage - ---- - -*Understanding execution is key to building reliable pipelines. Learn more about [Dataflow Pipelines]({{ '/en/guides/dataflow/' | relative_url }}) for complex workflows.* +--- +title: Pipeline Execution Guide +nav_order: 22 +--- +# Pipeline Execution Guide + +**Understanding execution models, modes, and result handling** + +> **Version**: {VERSION} | **Applies to**: SequentialPipeline, DataflowPipeline, MapReduce | **See also**: [Dataflow Guide]({{ '/en/guides/dataflow/' | relative_url }}) + +--- + +## Overview + +This guide covers how Taskiq-Flow executes pipelines, manages concurrency, handles errors, and returns results. + +--- + +## 1. Execution Models + +### 1.1. Sequential Execution (Classic Pipeline) + +The classic `Pipeline` executes steps one after another in a linear chain: + +```python +pipeline = Pipeline(broker).call_next(task1).call_next(task2).call_next(task3) +# Execution order: task1 → task2 → task3 (synchronously) +``` + +**Characteristics**: +- Each step waits for the previous to complete +- Results pass directly from one step to the next +- Predictable, deterministic execution order +- Suitable for linear workflows + +### 1.2. Parallel Execution (Dataflow & Map) + +`DataflowPipeline` automatically parallelizes independent tasks: + +```python +@broker.task +@pipeline_task(output="features") +def extract(tracks): ... + +@broker_task +@pipeline_task(output="tags") +def tag(features): ... # Runs after extract + +@broker.task +@pipeline_task(output="embedding") +def embed(features): ... # Also runs after extract, in parallel with tag + +pipeline = DataflowPipeline.from_tasks(broker, [extract, tag, embed]) +# DAG: extract → (tag & embed in parallel) +``` + +**Characteristics**: +- Tasks with no unmet dependencies run concurrently +- DAG determines execution order +- Maximum throughput for independent operations +- Controlled by `max_parallel` parameter on `.map()` and `.reduce()` + +### 1.3. Map-Reduce Parallelism + +The `MapReduce` utility explicitly processes items in parallel: + +```python +from taskiq_flow import MapReduce + +# Process 100 items with max 10 concurrent workers +result = await MapReduce.map( + broker, + process_item, + items=items_list, + output="processed", + max_parallel=10 # controls concurrency level +) +``` + +**Parallelism control**: +- `max_parallel=None` → unlimited concurrency (use with caution) +- `max_parallel=1` → sequential execution +- Recommended: `max_parallel = number_of_cpu_cores * 2` for CPU-bound tasks + +--- + +## 2. Starting a Pipeline + +There are several ways to kick off pipeline execution: + +### 2.1. `pipeline.kiq(...)` — Fire and Forget + +Returns a `Task` immediately; you must manually wait for results: + +```python +task = await pipeline.kiq(initial_input) +# Do other things... +result = await task.wait_result() # blocks until complete +``` + +Use when: +- You need to track the task ID for later status checks +- You want to start multiple pipelines concurrently +- You're building a task queue system + +### 2.2. `pipeline.kiq_dataflow(...)` — Dataflow Convenience + +Same as `kiq()` but specifically for DataflowPipeline, with clearer semantics: + +```python +results = await pipeline.kiq_dataflow(track_paths=["a.mp3", "b.mp3"]) +# Returns: dict mapping output names → values +``` + +### 2.3. `pipeline.kiq_map_reduce(...)` — Map-Reduce Shortcut + +Combined map and reduce in one call: + +```python +final = await pipeline.kiq_map_reduce( + items=items, + map_output="processed", + reduce_output="final" +) +``` + +--- + +## 3. Waiting for Results + +### 3.1. Blocking Wait + +```python +task = await pipeline.kiq(data) +result = await task.wait_result() # blocks +print(result.return_value) +``` + +**Options**: +- `wait_result(timeout=30)` — timeout in seconds (raises `asyncio.TimeoutError`) +- `wait_result(raise_on_error=True)` — re-raise exceptions from tasks + +### 3.2. Polling for Status + +```python +task = await pipeline.kiq(data) + +# Check periodically without blocking +while not task.is_finished: + await asyncio.sleep(0.5) + status = await task.get_status() + print(f"Status: {status}") +``` + +Useful for progress bars or interactive applications. + +### 3.3. Fetch by Task ID (Distributed) + +If you have only the task ID (from another process): + +```python +from taskiq import Task +task = Task(task_id="abc123", broker=broker) +result = await task.wait_result() +``` + +--- + +## 4. Error Handling + +### 4.1. Task-Level Errors + +When a single task fails, the pipeline either: + +- **Stops immediately** (default) — remaining tasks are cancelled +- **Continues** if configured with error handling policies + +```python +pipeline = Pipeline(broker) + +# Configure to continue despite errors +pipeline.on_error("continue") # options: "stop", "continue", "retry" + +# Or use a retry policy (see Retry Guide) +pipeline.with_retry( + max_attempts=3, + delay=5, + backoff=2 +) +``` + +### 4.2. Pipeline-Level Errors + +The entire pipeline may fail if: + +- A critical task (no consumers) fails +- A task times out +- The broker becomes unavailable + +Handle pipeline errors with try/except: + +```python +try: + result = await pipeline.kiq(data) + output = await result.wait_result() +except TaskiqError as exc: + print(f"Pipeline failed: {exc}") + # Access partial results if any + if result.is_failed: + print(f"Failed at step: {result.failed_step}") +``` + +### 4.3. Partial Results on Failure + +Even if a pipeline fails, you may have partial results from completed steps: + +```python +result = await pipeline.kiq(data) +try: + output = await result.wait_result() +except PipelineError: + # Some steps succeeded before failure + partial = result.partial_results # dict of completed outputs + print(f"Partial: {partial}") +``` + +--- + +## 5. Timeouts + +Set timeouts at the pipeline level: + +```python +pipeline = Pipeline(broker) + +# Global timeout for entire pipeline (seconds) +pipeline.with_timeout(60) + +# Or per-task timeout via taskiq task decorator +@broker.task(timeout=30) +def slow_task(): ... +``` + +**Timeout behavior**: +- Exceeding the timeout cancels the running task +- `asyncio.TimeoutError` is raised +- Pipeline state is set to `ERROR` + +--- + +## 6. Execution Context + +Each task receives an optional `context` parameter containing metadata: + +```python +from taskiq_flow import PipelineContext + +@broker.task +async def my_task(data: str, context: PipelineContext): + print(f"Pipeline ID: {context.pipeline_id}") + print(f"Step number: {context.step_index}") + print(f"Task ID: {context.task_id}") + return data.upper() +``` + +**Context fields**: + +| Field | Type | Description | +|-------|------|-------------| +| `pipeline_id` | `str` | Unique identifier of the pipeline instance | +| `step_index` | `int` | Index of this step in the pipeline sequence | +| `task_id` | `str` | ID of the underlying taskiq task | +| `execution_mode` | `str` | `"sequential"`, `"parallel"`, or `"map_reduce"` | +| `started_at` | `datetime` | Timestamp when pipeline started | +| `broker` | `BaseBroker` | Reference to the task broker | + +Enable context passing when building the pipeline: + +```python +pipeline = Pipeline(broker).with_context(enable=True) +``` + +--- + +## 7. Custom Execution Engines (Advanced) + +For low-level control, use `ExecutionEngine` directly: + +```python +from taskiq_flow import ExecutionEngine, DAGBuilder +from taskiq_flow.dataflow import DataflowRegistry + +# Build registry manually +registry = DataflowRegistry() +registry.register_task(load, output="raw", inputs=[]) +registry.register_task(process, output="clean", inputs=["raw"]) +registry.register_task(save, output="saved", inputs=["clean"]) + +# Build DAG +dag = registry.build_dag() + +# Create execution engine +engine = ExecutionEngine(broker, dag) + +# Execute with custom inputs +results = await engine.execute(inputs={"source_file": "data.csv"}) +print(results) # {"raw": ..., "clean": ..., "saved": ...} +``` + +**When to use ExecutionEngine**: +- Building dynamic pipelines at runtime +- Custom scheduling/logic outside Pipeline abstraction +- Inspecting DAG structure before execution +- Integrating with external workflow managers + +--- + +## 8. Result Shapes + +Different pipeline types return different result structures: + +### 8.1. Sequential Pipeline Results + +```python +task = await pipeline.kiq(input) +result = await task.wait_result() + +# result.return_value is the final output after all steps +# Example: [3, 3, 3, 3] from our quickstart pipeline +``` + +### 8.2. Dataflow Pipeline Results + +```python +result = await pipeline.kiq_dataflow(input_data) + +# Returns a dict mapping each output name to its value +{ + "features": {...}, + "tags": [...], + "embedding": [...] +} +``` + +### 8.3. MapReduce Results + +```python +mapped = await MapReduce.map(...) +print(mapped.return_value) # List of mapped results + +reduced = await MapReduce.reduce(...) +print(reduced.return_value) # Final aggregated result +``` + +--- + +## 9. Inspecting Pipeline State + +Query pipeline status during or after execution: + +```python +from taskiq_flow import PipelineTrackingManager + +tracking = PipelineTrackingManager().with_auto_storage(broker) +pipeline = Pipeline(broker).with_tracking(tracking) + +task = await pipeline.kiq(data) + +# Get detailed status +status = await tracking.get_status(pipeline.pipeline_id) +print(f"Status: {status.status}") # PENDING, RUNNING, COMPLETED, FAILED +print(f"Steps: {len(status.steps)}") # Number of completed steps +print(f"Started: {status.started_at}") +print(f"Completed: {status.completed_at}") + +# Get step-by-step history +for step in status.steps: + print(f" {step.name}: {step.status} ({step.duration_ms}ms)") +``` + +**Status values**: + +| Status | Meaning | +|--------|---------| +| `PENDING` | Pipeline queued, not started | +| `RUNNING` | Currently executing | +| `COMPLETED` | Finished successfully | +| `FAILED` | Terminated with error | +| `CANCELLED` | Manually cancelled | + +See [Tracking Guide]({{ '/en/guides/tracking/' | relative_url }}) for advanced monitoring. + +--- + +## 10. Debugging Execution + +### 10.1. Enable Logging + +```python +import logging +logging.basicConfig(level=logging.DEBUG) + +# Or configure specific loggers +logger = logging.getLogger("taskiq_flow") +logger.setLevel(logging.DEBUG) +``` + +### 10.2. Print DAG Before Execution + +```python +pipeline.print_dag() +# Shows execution levels and dependencies +``` + +### 10.3. Inspect Task Arguments + +```python +@broker.task +async def debug_task(data, context: PipelineContext): + print(f"Received: {data}") + print(f"Context: pipeline={context.pipeline_id}, step={context.step_index}") + return data +``` + +### 10.4. Middleware for Tracing + +```python +from taskiq_flow.middleware import PipelineMiddleware + +class DebugMiddleware(PipelineMiddleware): + async def on_step_complete(self, ctx, result): + print(f"Step {ctx.task_id} completed with: {result}") + await super().on_step_complete(ctx, result) + +broker.add_middlewares(DebugMiddleware()) +``` + +--- + +## 11. Performance Considerations + +### 11.1. Concurrency Limits + +```python +# Limit total parallel tasks globally +from taskiq_flow.optimization.parallel import set_max_parallel_tasks +set_max_parallel_tasks(20) # never more than 20 tasks simultaneously +``` + +### 11.2. Selective Parallelism + +Not all tasks benefit from parallel execution: + +```python +# CPU-bound tasks: benefit from parallelism up to core count +# I/O-bound tasks: can handle higher parallelism +# Small/fast tasks: overhead may outweigh benefits + +# Tip: Profile with varying max_parallel values +pipeline.map(process_item, items, max_parallel=8) +``` + +### 11.3. Memory Footprint + +Parallel execution loads more data into memory: + +```python +# Process large datasets in chunks +chunks = split_into_chunks(large_list, chunk_size=100) +for chunk in chunks: + results = await pipeline.kiq_dataflow(chunk) + # process results before next chunk +``` + +See [Performance Guide]({{ '/en/guides/performance/' | relative_url }}) for detailed optimization strategies. + +--- + +## 12. Common Pitfalls + +| Issue | Cause | Solution | +|-------|-------|----------| +| Tasks run sequentially | `max_parallel=1` or sequential pipeline type | Use DataflowPipeline or increase parallelism | +| `wait_result()` hangs forever | Broker not shared, results lost | Use persistent broker (Redis) with result backend | +| Tasks receive wrong inputs | Incorrect parameter naming | Ensure `@pipeline_task(output=...)` matches downstream param names | +| Out-of-order results | Dataflow tasks finishing at different times | Results dict preserves output names, not execution order | +| Memory explosion | Unlimited parallelism | Set `max_parallel` or process in batches | +| Deadlock during execution | Circular dependency or missing external input | Check data flow graph for cycles; provide all external inputs | +| `kiq_dataflow()` raises "No DAG built" | No tasks added to pipeline | Use `DataflowPipeline.from_tasks()` or `add_dataflow_task()` | +| Partial results only | `continue_on_error=True` with failed tasks | Check `PipelineErrorAggregator` or execution report for details | + +--- + +## 13. Summary + +| Feature | Sequential Pipeline | DataflowPipeline | MapReduce | +|---------|--------------------|------------------|-----------| +| **Execution** | Linear chain | Automatic DAG | Parallel map + reduce | +| **Parallelism** | None (unless `.group()` used) | Automatic (independent tasks) | Explicit per-map call | +| **Control** | Manual chaining | Declarative dependencies | Batch-oriented | +| **Best for** | Simple linear workflows | Complex branched workflows | Bulk data transformation | + +--- + +## Next Steps + +- **[Pipelines Guide]({{ '/en/guides/pipelines/' | relative_url }})** — Choosing between pipeline types and patterns +- **[Dataflow Guide]({{ '/en/guides/dataflow/' | relative_url }})** — Complete guide to dataflow pipelines, DAGs, and decorators +- **[Tracking Guide]({{ '/en/guides/tracking/' | relative_url }})** — Monitoring pipeline status and history +- **[Performance Guide]({{ '/en/guides/performance/' | relative_url }})** — Tuning for speed and resource usage + +--- + +*Understanding execution is key to building reliable pipelines. Learn more about [Dataflow Pipelines]({{ '/en/guides/dataflow/' | relative_url }}) for complex workflows.* diff --git a/docs/_en/guides/index.md b/docs/_en/guides/index.md index b2546f5..f4c454f 100644 --- a/docs/_en/guides/index.md +++ b/docs/_en/guides/index.md @@ -1,27 +1,27 @@ ---- -title: User Guides -nav_order: 15 -permalink: /en/guides/ ---- -# User Guides - -In-depth guides covering all Taskiq-Flow features. - -## Available Guides - -| Guide | Description | -|-------|-------------| -| **[Pipelines Guide]({{ '/en/guides/pipelines/' | relative_url }})** | Sequential & dataflow pipeline patterns | -| **[Tasks Guide]({{ '/en/guides/tasks/' | relative_url }})** | Task definitions & decorators | -| **[Execution Guide]({{ '/en/guides/execution/' | relative_url }})** | Execution modes & error handling | -| **[Storage & Cache Middleware]({{ '/en/guides/cache/' | relative_url }})** new in v1.2.0 | Centralized persistence & Dogpile worker caching | -| **[Tracking Guide]({{ '/en/guides/tracking/' | relative_url }})** | Real-time monitoring | -| **[WebSocket Guide]({{ '/en/guides/websocket/' | relative_url }})** | Live dashboards | -| **[Scheduling Guide]({{ '/en/guides/scheduling/' | relative_url }})** | Cron scheduling | -| **[Retry Guide]({{ '/en/guides/retry/' | relative_url }})** | Error recovery | -| **[Performance Guide]({{ '/en/guides/performance/' | relative_url }})** | Optimization | -| **[REST API Guide]({{ '/en/guides/api/' | relative_url }})** | FastAPI integration | - ---- - -*Start with the [Quick Start]({{ '/en/quickstart/' | relative_url }}) or browse all [Examples]({{ '/en/examples/' | relative_url }}).* +--- +title: User Guides +nav_order: 15 +permalink: /en/guides/ +--- +# User Guides + +In-depth guides covering all Taskiq-Flow features. + +## Available Guides + +| Guide | Description | +|-------|-------------| +| **[Pipelines Guide]({{ '/en/guides/pipelines/' | relative_url }})** | Sequential & dataflow pipeline patterns | +| **[Tasks Guide]({{ '/en/guides/tasks/' | relative_url }})** | Task definitions & decorators | +| **[Execution Guide]({{ '/en/guides/execution/' | relative_url }})** | Execution modes & error handling | +| **[Storage & Cache Middleware]({{ '/en/guides/cache/' | relative_url }})** new in v1.2.0 | Centralized persistence & Dogpile worker caching | +| **[Tracking Guide]({{ '/en/guides/tracking/' | relative_url }})** | Real-time monitoring | +| **[WebSocket Guide]({{ '/en/guides/websocket/' | relative_url }})** | Live dashboards | +| **[Scheduling Guide]({{ '/en/guides/scheduling/' | relative_url }})** | Cron scheduling | +| **[Retry Guide]({{ '/en/guides/retry/' | relative_url }})** | Error recovery | +| **[Performance Guide]({{ '/en/guides/performance/' | relative_url }})** | Optimization | +| **[REST API Guide]({{ '/en/guides/api/' | relative_url }})** | FastAPI integration | + +--- + +*Start with the [Quick Start]({{ '/en/quickstart/' | relative_url }}) or browse all [Examples]({{ '/en/examples/' | relative_url }}).* diff --git a/docs/_en/guides/migration.md b/docs/_en/guides/migration.md index f409e26..5672a64 100644 --- a/docs/_en/guides/migration.md +++ b/docs/_en/guides/migration.md @@ -1,452 +1,452 @@ -# Migration Guide - -> **Version**: {VERSION} | **From**: v0.4.5 | **To**: v1.0.2 -> **Related**: [Changelog]({{ '/CHANGELOG/' | relative_url }}), [API Reference]({{ '/api/core/' | relative_url }}) - ---- - -## Overview - -This guide helps you migrate from Taskiq-Flow v0.4.5 to v1.0.2. The migration involves changes to imports, API signatures, configuration, and new features. - -**Breaking changes** are marked with . Non-breaking additions are marked with . - ---- - -## 1. Package & Import Changes - -### Renamed Modules - -| Old Import (v0.4.5) | New Import (v1.0.2) | -|----------------------|----------------------| -| `taskiq_flow.pipeline` | `taskiq_flow.pipeliner` (original `Pipeline` class) | -| `taskiq_flow.pipeline` | `taskiq_flow.pipeline` (new: `DataflowPipeline`) | -| `taskiq_flow.decorators.pipeline_task` | `taskiq_flow.decorators.pipeline_task` (unchanged) | - -### New Top-Level Exports - -```python -# v1.0.2 - New imports available from taskiq_flow -from taskiq_flow import ( - DAG, - DAGBuilder, - DAGNode, - DAGVisualizer, - DataNode, - DataflowPipeline, - DataflowRegistry, - ExecutionEngine, - HookManager, - MapReduce, - Pipeline, # Re-exported for backward compatibility - PipelineMiddleware, - PipelineScheduler, - PipelineTrackingManager, - TrackingStorageFactory, - create_visualization_api, - visualize_pipeline, -) -``` - ---- - -## 2. Configuration Changes - -### Security Now Enabled by Default - -In v0.4.5, security was optional and required explicit setup. In v1.0.2, `TaskiqFlowConfig` enables security by default: - -```python -# v0.4.5 - No default security -config = {} # No config needed - -# v1.0.2 - Security enabled by default -from taskiq_flow.config import TaskiqFlowConfig - -config = TaskiqFlowConfig( - security_enabled=True, # Default: True - auth_provider="api_key", # or "jwt" - api_keys={ - "my-key": { - "role": "admin", - "pipelines": ["*"], - "permissions": ["read", "execute", "admin"], - } - }, -) - -# Optionally disable security for development -config = TaskiqFlowConfig(security_enabled=False) -``` - -### New Configuration Options - -```python -config = TaskiqFlowConfig( - # Security - security_enabled=True, - auth_provider="api_key", # "api_key" | "jwt" - jwt_secret="your-secret", # pragma: allowlist secret - Required for JWT auth - api_keys={...}, - require_https=True, - - # Authorization - pipeline_acls={ - "my_pipeline": { - "read": ["admin", "viewer"], - "execute": ["admin", "worker"], - "admin": ["admin"], - } - }, - - # Rate Limiting - rate_limit_enabled=True, - rate_limit_default="100/minute", - - # Metrics - metrics_enabled=True, - metrics_path="/metrics", - - # WebSocket - websocket_require_auth=True, - websocket_max_connections=1000, -) -``` - ---- - -## 3. Pipeline Creation - -### `Pipeline` Class Moved - -The original `Pipeline` class has been moved to `pipeliner`. The top-level `Pipeline` is now a re-export for backward compatibility. - -```python -# v0.4.5 -from taskiq_flow.pipeline import Pipeline - -# v1.0.2 (still works via re-export) -from taskiq_flow import Pipeline -# Or explicitly: -from taskiq_flow.pipeliner import Pipeline -``` - -### New `DataflowPipeline` - -The new `DataflowPipeline` is the recommended way to create pipelines with DAG support: - -```python -# v1.0.2 -from taskiq_flow import DataflowPipeline, Pipeline, pipeline_task - -# Define tasks -@broker.task -@pipeline_task(output="processed_data") -async def process_data(data): - return await transform(data) - -@broker.task -@pipeline_task(output="result") -async def aggregate(data): - return await compute(data) - -# Create as Pipeline (backward compatible) -pipe = Pipeline(broker, process_data).call_next(aggregate) - -# Or as DataflowPipeline (new) -from taskiq import InMemoryBroker -broker: InMemoryBroker = ... -pipeline = DataflowPipeline(broker) -pipeline.add_task(process_data) -pipeline.add_task(aggregate) -``` - ---- - -## 4. Execution Engine - -### `ExecutionEngine` Replaces Direct Pipeline Execution - -For DAG-based execution, use `ExecutionEngine`: - -```python -# v0.4.5 - Sequential pipeline execution -pipe = Pipeline(broker, task_a).call_next(task_b) -result = await pipe.kiq(input_data) - -# v1.0.2 - DAG-based execution with ExecutionEngine -from taskiq_flow import ExecutionEngine, DataflowPipeline - -pipeline = DataflowPipeline(broker) -# ... build DAG ... - -engine = ExecutionEngine( - broker=broker, - dag=pipeline._dag, - max_parallel=10, -) - -outputs = await engine.execute( - inputs={"data": input_data}, - pipeline_id="my_pipeline", -) -``` - -### New Execution Options - -```python -engine = ExecutionEngine( - broker=broker, - dag=dag, - fail_fast=True, # Stop on first error (default) - continue_on_error=False, # Continue despite errors - skip_failed=False, # Skip failed tasks - error_mode=None, # Override fail_fast/continue_on_error - max_parallel=10, # Max concurrent tasks - resource_aware=False, # Enable resource-aware scheduling - adaptive_parallelism=False, # Dynamic parallelism per level -) -``` - ---- - -## 5. Middleware Changes - -### `PipelineMiddleware` Now Supports Metrics - -The `PipelineMiddleware` constructor accepts an optional `metrics_collector`: - -```python -from taskiq_flow import PipelineMiddleware -from taskiq_flow.metrics.collector import MetricsCollector - -metrics = MetricsCollector() -middleware = PipelineMiddleware( - metrics_collector=metrics, -) - -broker = InMemoryBroker().with_middlewares(middleware) -``` - -### `TransportMiddleware` Construction Changed - -```python -# v0.4.5 - Implicit WebSocket -transport = TransportMiddleware() - -# v1.0.2 - Explicit transport type -from taskiq_flow.middleware import TransportMiddleware - -# WebSocket (default) -transport = TransportMiddleware(transport_type="websocket") - -# HTTP Streaming (SSE) - NEW -transport = TransportMiddleware(transport_type="http_stream") - -# Redis Pub/Sub -transport = TransportMiddleware( - transport_type="redis_pubsub", - redis_client=my_redis_client, -) -``` - ---- - -## 6. Metrics System - -### New Prometheus Metrics - -Metrics are now collected via `MetricsCollector`: - -```python -from taskiq_flow import MetricsCollector - -collector = MetricsCollector() # Singleton - -# Available metrics: -# - taskiq_flow_pipeline_executions_total -# - taskiq_flow_pipeline_duration_seconds -# - taskiq_flow_pipeline_steps_total -# - taskiq_flow_active_pipelines -# - taskiq_flow_task_executions_total -# - taskiq_flow_task_duration_seconds -# - taskiq_flow_task_retry_attempts_total -# - taskiq_flow_websocket_messages_total -# - taskiq_flow_sse_events_sent_total (new) -# - taskiq_flow_worker_cpu_usage_percent -# - taskiq_flow_worker_memory_usage_bytes -``` - -### Prometheus Endpoint - -```python -from taskiq_flow.metrics.exporters.prometheus import get_metrics_endpoint - -app.get("/metrics")(get_metrics_endpoint()) -``` - ---- - -## 7. WebSocket Changes - -### New HTTP Streaming (SSE) Transport - -```python -from taskiq_flow.transport.http_stream import HTTPStreamTransport, get_http_stream_transport - -transport = get_http_stream_transport() - -# Get FastAPI endpoint -sse_endpoint = transport.get_sse_endpoint(pipeline_id="my_pipeline") -app.get("/events")(sse_endpoint) - -# Or broadcast manually -await transport.broadcast(event) -``` - -### WebSocket Authentication Now Properly Enforced - -v0.4.5 had a bypass in the WebSocket handler. v1.0.2 properly validates tokens: - -```python -# Client must send auth message first -# WebSocket connection: -ws = new WebSocket("ws://localhost:8000/ws/my_pipeline") - -# First message MUST be auth -ws.send(JSON.stringify({ - "action": "auth", - "token": "your-jwt-or-api-key" -})) - -# Then subscribe -ws.send(JSON.stringify({ - "action": "subscribe", - "channel": "pipeline.my_pipeline" -})) -``` - -### Authorization Checks on Subscribe - -```python -# Server-side ACL enforcement -if not self.authorization.can_read(pipeline_id, self.user): - # Reject subscription -``` - ---- - -## 8. Tracking System - -### Tracking Manager Improvements - -```python -from taskiq_flow import PipelineTrackingManager, TrackingStorageFactory - -# With Redis backend -storage = TrackingStorageFactory.create_redis_storage(redis_url="redis://localhost:6379") -tracking = PipelineTrackingManager(storage=storage) - -# Or with auto-detection -tracking = PipelineTrackingManager().with_auto_storage(broker) - -# Query status -status = await tracking.get_status(pipeline_id) -print(status.state) # PipelineState.RUNNING, SUCCESS, FAILED, etc. -``` - ---- - -## 9. Security Setup - -### New Security Module - -```python -from taskiq_flow.security import SecurityMiddleware -from taskiq_flow.security.auth import create_auth_provider -from taskiq_flow.security.authorization import PipelineAuthorization -from taskiq_flow.security.rate_limiting import RateLimiter - -# Auth Provider -auth_provider = create_auth_provider(config) - -# Authorization -authorization = PipelineAuthorization(pipeline_acls=config.pipeline_acls) - -# Rate Limiting -rate_limiter = RateLimiter(default_limits={ - "list_pipelines": "60/minute", - "get_dag": "120/minute", - "execute_pipeline": "10/minute", -}) -``` - ---- - -## 10. Step-by-Step Migration Checklist - -1. **Update imports** - Replace old module paths with new ones -2. **Add configuration** - Create `TaskiqFlowConfig` instance -3. **Set up security** - Configure auth provider if needed -4. **Update middleware** - Add metrics collector to `PipelineMiddleware` -5. **Choose transport** - Decide between WebSocket, SSE, or Redis Pub/Sub -6. **Test locally** - Run with `security_enabled=False` first -7. **Enable security** - Set up API keys or JWT tokens -8. **Add monitoring** - Configure Prometheus metrics endpoint -9. **Run tests** - Verify all existing tests pass -10. **Deploy** - Set environment variables for production - ---- - -## 11. Common Issues - -### Error: "No API key provided" - -``` -# Fix: Add API key to headers or configure auth_provider -X-API-Key: your-api-key -``` - -### Error: "Authorization required" on WebSocket - -```python -# Fix: Send auth message before subscribing -ws.send(JSON.stringify({"action": "auth", "token": "your-token"})) -``` - -### Error: "Unsupported transport type" - -```python -# Fix: Use valid transport type -TransportMiddleware(transport_type="websocket") # "websocket", "http_stream", or "redis_pubsub" -``` - -### Metrics not showing up - -```python -# Fix: Add metrics_collector to PipelineMiddleware -metrics = MetricsCollector() -middleware = PipelineMiddleware(metrics_collector=metrics) -``` - ---- - -## 12. Version Compatibility - -| Feature | v0.4.5 | v1.0.0 | v1.0.2 | -|---------|--------|--------|--------| -| Basic pipeline execution | | | | -| DAG-based dataflow | | | | -| Security module | | | | -| Metrics system | | | | -| WebSocket transport | | | | -| HTTP SSE transport | | | | -| Authorization ACLs | | | | -| Rate limiting | | | | -| Resource-aware executor | | | | -| Execution engine | | | | -| Auto-configuration | | | | - ---- - -*Migration guide created for v1.0.2 release* +# Migration Guide + +> **Version**: {VERSION} | **From**: v0.4.5 | **To**: v1.0.2 +> **Related**: [Changelog]({{ '/CHANGELOG/' | relative_url }}), [API Reference]({{ '/api/core/' | relative_url }}) + +--- + +## Overview + +This guide helps you migrate from Taskiq-Flow v0.4.5 to v1.0.2. The migration involves changes to imports, API signatures, configuration, and new features. + +**Breaking changes** are marked with . Non-breaking additions are marked with . + +--- + +## 1. Package & Import Changes + +### Renamed Modules + +| Old Import (v0.4.5) | New Import (v1.0.2) | +|----------------------|----------------------| +| `taskiq_flow.pipeline` | `taskiq_flow.pipeliner` (original `Pipeline` class) | +| `taskiq_flow.pipeline` | `taskiq_flow.pipeline` (new: `DataflowPipeline`) | +| `taskiq_flow.decorators.pipeline_task` | `taskiq_flow.decorators.pipeline_task` (unchanged) | + +### New Top-Level Exports + +```python +# v1.0.2 - New imports available from taskiq_flow +from taskiq_flow import ( + DAG, + DAGBuilder, + DAGNode, + DAGVisualizer, + DataNode, + DataflowPipeline, + DataflowRegistry, + ExecutionEngine, + HookManager, + MapReduce, + Pipeline, # Re-exported for backward compatibility + PipelineMiddleware, + PipelineScheduler, + PipelineTrackingManager, + TrackingStorageFactory, + create_visualization_api, + visualize_pipeline, +) +``` + +--- + +## 2. Configuration Changes + +### Security Now Enabled by Default + +In v0.4.5, security was optional and required explicit setup. In v1.0.2, `TaskiqFlowConfig` enables security by default: + +```python +# v0.4.5 - No default security +config = {} # No config needed + +# v1.0.2 - Security enabled by default +from taskiq_flow.config import TaskiqFlowConfig + +config = TaskiqFlowConfig( + security_enabled=True, # Default: True + auth_provider="api_key", # or "jwt" + api_keys={ + "my-key": { + "role": "admin", + "pipelines": ["*"], + "permissions": ["read", "execute", "admin"], + } + }, +) + +# Optionally disable security for development +config = TaskiqFlowConfig(security_enabled=False) +``` + +### New Configuration Options + +```python +config = TaskiqFlowConfig( + # Security + security_enabled=True, + auth_provider="api_key", # "api_key" | "jwt" + jwt_secret="your-secret", # pragma: allowlist secret - Required for JWT auth + api_keys={...}, + require_https=True, + + # Authorization + pipeline_acls={ + "my_pipeline": { + "read": ["admin", "viewer"], + "execute": ["admin", "worker"], + "admin": ["admin"], + } + }, + + # Rate Limiting + rate_limit_enabled=True, + rate_limit_default="100/minute", + + # Metrics + metrics_enabled=True, + metrics_path="/metrics", + + # WebSocket + websocket_require_auth=True, + websocket_max_connections=1000, +) +``` + +--- + +## 3. Pipeline Creation + +### `Pipeline` Class Moved + +The original `Pipeline` class has been moved to `pipeliner`. The top-level `Pipeline` is now a re-export for backward compatibility. + +```python +# v0.4.5 +from taskiq_flow.pipeline import Pipeline + +# v1.0.2 (still works via re-export) +from taskiq_flow import Pipeline +# Or explicitly: +from taskiq_flow.pipeliner import Pipeline +``` + +### New `DataflowPipeline` + +The new `DataflowPipeline` is the recommended way to create pipelines with DAG support: + +```python +# v1.0.2 +from taskiq_flow import DataflowPipeline, Pipeline, pipeline_task + +# Define tasks +@broker.task +@pipeline_task(output="processed_data") +async def process_data(data): + return await transform(data) + +@broker.task +@pipeline_task(output="result") +async def aggregate(data): + return await compute(data) + +# Create as Pipeline (backward compatible) +pipe = Pipeline(broker, process_data).call_next(aggregate) + +# Or as DataflowPipeline (new) +from taskiq import InMemoryBroker +broker: InMemoryBroker = ... +pipeline = DataflowPipeline(broker) +pipeline.add_task(process_data) +pipeline.add_task(aggregate) +``` + +--- + +## 4. Execution Engine + +### `ExecutionEngine` Replaces Direct Pipeline Execution + +For DAG-based execution, use `ExecutionEngine`: + +```python +# v0.4.5 - Sequential pipeline execution +pipe = Pipeline(broker, task_a).call_next(task_b) +result = await pipe.kiq(input_data) + +# v1.0.2 - DAG-based execution with ExecutionEngine +from taskiq_flow import ExecutionEngine, DataflowPipeline + +pipeline = DataflowPipeline(broker) +# ... build DAG ... + +engine = ExecutionEngine( + broker=broker, + dag=pipeline._dag, + max_parallel=10, +) + +outputs = await engine.execute( + inputs={"data": input_data}, + pipeline_id="my_pipeline", +) +``` + +### New Execution Options + +```python +engine = ExecutionEngine( + broker=broker, + dag=dag, + fail_fast=True, # Stop on first error (default) + continue_on_error=False, # Continue despite errors + skip_failed=False, # Skip failed tasks + error_mode=None, # Override fail_fast/continue_on_error + max_parallel=10, # Max concurrent tasks + resource_aware=False, # Enable resource-aware scheduling + adaptive_parallelism=False, # Dynamic parallelism per level +) +``` + +--- + +## 5. Middleware Changes + +### `PipelineMiddleware` Now Supports Metrics + +The `PipelineMiddleware` constructor accepts an optional `metrics_collector`: + +```python +from taskiq_flow import PipelineMiddleware +from taskiq_flow.metrics.collector import MetricsCollector + +metrics = MetricsCollector() +middleware = PipelineMiddleware( + metrics_collector=metrics, +) + +broker = InMemoryBroker().with_middlewares(middleware) +``` + +### `TransportMiddleware` Construction Changed + +```python +# v0.4.5 - Implicit WebSocket +transport = TransportMiddleware() + +# v1.0.2 - Explicit transport type +from taskiq_flow.middleware import TransportMiddleware + +# WebSocket (default) +transport = TransportMiddleware(transport_type="websocket") + +# HTTP Streaming (SSE) - NEW +transport = TransportMiddleware(transport_type="http_stream") + +# Redis Pub/Sub +transport = TransportMiddleware( + transport_type="redis_pubsub", + redis_client=my_redis_client, +) +``` + +--- + +## 6. Metrics System + +### New Prometheus Metrics + +Metrics are now collected via `MetricsCollector`: + +```python +from taskiq_flow import MetricsCollector + +collector = MetricsCollector() # Singleton + +# Available metrics: +# - taskiq_flow_pipeline_executions_total +# - taskiq_flow_pipeline_duration_seconds +# - taskiq_flow_pipeline_steps_total +# - taskiq_flow_active_pipelines +# - taskiq_flow_task_executions_total +# - taskiq_flow_task_duration_seconds +# - taskiq_flow_task_retry_attempts_total +# - taskiq_flow_websocket_messages_total +# - taskiq_flow_sse_events_sent_total (new) +# - taskiq_flow_worker_cpu_usage_percent +# - taskiq_flow_worker_memory_usage_bytes +``` + +### Prometheus Endpoint + +```python +from taskiq_flow.metrics.exporters.prometheus import get_metrics_endpoint + +app.get("/metrics")(get_metrics_endpoint()) +``` + +--- + +## 7. WebSocket Changes + +### New HTTP Streaming (SSE) Transport + +```python +from taskiq_flow.transport.http_stream import HTTPStreamTransport, get_http_stream_transport + +transport = get_http_stream_transport() + +# Get FastAPI endpoint +sse_endpoint = transport.get_sse_endpoint(pipeline_id="my_pipeline") +app.get("/events")(sse_endpoint) + +# Or broadcast manually +await transport.broadcast(event) +``` + +### WebSocket Authentication Now Properly Enforced + +v0.4.5 had a bypass in the WebSocket handler. v1.0.2 properly validates tokens: + +```python +# Client must send auth message first +# WebSocket connection: +ws = new WebSocket("ws://localhost:8000/ws/my_pipeline") + +# First message MUST be auth +ws.send(JSON.stringify({ + "action": "auth", + "token": "your-jwt-or-api-key" +})) + +# Then subscribe +ws.send(JSON.stringify({ + "action": "subscribe", + "channel": "pipeline.my_pipeline" +})) +``` + +### Authorization Checks on Subscribe + +```python +# Server-side ACL enforcement +if not self.authorization.can_read(pipeline_id, self.user): + # Reject subscription +``` + +--- + +## 8. Tracking System + +### Tracking Manager Improvements + +```python +from taskiq_flow import PipelineTrackingManager, TrackingStorageFactory + +# With Redis backend +storage = TrackingStorageFactory.create_redis_storage(redis_url="redis://localhost:6379") +tracking = PipelineTrackingManager(storage=storage) + +# Or with auto-detection +tracking = PipelineTrackingManager().with_auto_storage(broker) + +# Query status +status = await tracking.get_status(pipeline_id) +print(status.state) # PipelineState.RUNNING, SUCCESS, FAILED, etc. +``` + +--- + +## 9. Security Setup + +### New Security Module + +```python +from taskiq_flow.security import SecurityMiddleware +from taskiq_flow.security.auth import create_auth_provider +from taskiq_flow.security.authorization import PipelineAuthorization +from taskiq_flow.security.rate_limiting import RateLimiter + +# Auth Provider +auth_provider = create_auth_provider(config) + +# Authorization +authorization = PipelineAuthorization(pipeline_acls=config.pipeline_acls) + +# Rate Limiting +rate_limiter = RateLimiter(default_limits={ + "list_pipelines": "60/minute", + "get_dag": "120/minute", + "execute_pipeline": "10/minute", +}) +``` + +--- + +## 10. Step-by-Step Migration Checklist + +1. **Update imports** - Replace old module paths with new ones +2. **Add configuration** - Create `TaskiqFlowConfig` instance +3. **Set up security** - Configure auth provider if needed +4. **Update middleware** - Add metrics collector to `PipelineMiddleware` +5. **Choose transport** - Decide between WebSocket, SSE, or Redis Pub/Sub +6. **Test locally** - Run with `security_enabled=False` first +7. **Enable security** - Set up API keys or JWT tokens +8. **Add monitoring** - Configure Prometheus metrics endpoint +9. **Run tests** - Verify all existing tests pass +10. **Deploy** - Set environment variables for production + +--- + +## 11. Common Issues + +### Error: "No API key provided" + +``` +# Fix: Add API key to headers or configure auth_provider +X-API-Key: your-api-key +``` + +### Error: "Authorization required" on WebSocket + +```python +# Fix: Send auth message before subscribing +ws.send(JSON.stringify({"action": "auth", "token": "your-token"})) +``` + +### Error: "Unsupported transport type" + +```python +# Fix: Use valid transport type +TransportMiddleware(transport_type="websocket") # "websocket", "http_stream", or "redis_pubsub" +``` + +### Metrics not showing up + +```python +# Fix: Add metrics_collector to PipelineMiddleware +metrics = MetricsCollector() +middleware = PipelineMiddleware(metrics_collector=metrics) +``` + +--- + +## 12. Version Compatibility + +| Feature | v0.4.5 | v1.0.0 | v1.0.2 | +|---------|--------|--------|--------| +| Basic pipeline execution | | | | +| DAG-based dataflow | | | | +| Security module | | | | +| Metrics system | | | | +| WebSocket transport | | | | +| HTTP SSE transport | | | | +| Authorization ACLs | | | | +| Rate limiting | | | | +| Resource-aware executor | | | | +| Execution engine | | | | +| Auto-configuration | | | | + +--- + +*Migration guide created for v1.0.2 release* diff --git a/docs/_en/guides/performance.md b/docs/_en/guides/performance.md index 12ce4dd..0d0e549 100644 --- a/docs/_en/guides/performance.md +++ b/docs/_en/guides/performance.md @@ -1,557 +1,583 @@ ---- -title: Performance Optimization Guide -nav_order: 27 -color_scheme: dark ---- -# Performance Optimization Guide - -**Resource-aware parallelism, memory optimization, and scaling strategies** - -> **Version**: {VERSION} | **Related**: [Execution Guide]({{ '/en/guides/execution/' | relative_url }}), [Tracking Guide]({{ '/en/guides/tracking/' | relative_url }}) - ---- - -## Overview - -Taskiq-Flow is designed for high-performance asynchronous execution. This guide covers optimization techniques to maximize throughput, minimize latency, and efficiently use system resources. - -Topics covered: - -- Parallelism tuning (`max_parallel`) -- CPU and RAM profiling -- Task resource profiles -- Memory management strategies -- Bottleneck identification -- Scaling from single worker to distributed - ---- - -## 1. Understanding the Performance Landscape - -Performance optimization involves tradeoffs between: - -| Dimension | What it affects | Typical trade-off | -|-----------|-----------------|-------------------| -| **Concurrency** | Throughput (# tasks/second) | Memory usage, context switching | -| **Parallelism** | CPU utilization | Overhead of coordination | -| **Latency** | Task completion time | Resource consumption | -| **Memory** | Dataset size capacity | GC pauses, cache efficiency | -| **I/O** | External service calls | Network bandwidth, connection limits | - -**Key insight**: Taskiq-Flow's parallelism is bounded by `max_parallel` settings across pipeline steps, and by available system resources (CPU cores, RAM). - ---- - -## 2. Parallelism Tuning - -### 2.1. The `max_parallel` Parameter - -Control concurrent task execution at the step level: - -```python -# Sequential Pipeline -pipeline.map(process_item, items, max_parallel=10) # Max 10 concurrent - -# Dataflow Pipeline: configure at pipeline level -pipeline = DataflowPipeline(broker, max_parallel=20) - -# MapReduce -mapped = await MapReduce.map( - broker, - process_item, - items, - max_parallel=15 -) -``` - -**Default behavior**: Without `max_parallel`, Taskiq-Flow attempts to run all independent tasks concurrently (essentially unlimited). This is fine for small numbers (<100) but dangerous for large datasets. - -### 2.2. Determining Optimal `max_parallel` - -#### For I/O-Bound Tasks (network calls, disk I/O) - -```python -# High I/O wait, low CPU: can handle many concurrent tasks -pipeline.map(fetch_url, url_list, max_parallel=50) -# Rule of thumb: 2–5× number of CPU cores -``` - -**Rationale**: While one task waits for network, another uses CPU. High concurrency saturates I/O pipelines. - -#### For CPU-Bound Tasks (computations, transcoding) - -```python -# CPU-intensive: limit to core count (or slightly higher) -import os -cpu_cores = os.cpu_count() or 4 -pipeline.map(transcode, files, max_parallel=cpu_cores + 2) -# Rule of thumb: CPU cores ± 2 -``` - -**Rationale**: Python's GIL limits true parallelism; `asyncio` still benefits from multiple cores when tasks release GIL (NumPy, C extensions). Over-subscription leads to context switching overhead. - -#### For Mixed Workloads - -Profile and adjust: - -```python -# Start conservative -for parallel in [5, 10, 20, 50]: - start = time.time() - await pipeline.kiq_dataflow(data) - duration = time.time() - start - print(f"Parallelism {parallel}: {duration:.2f}s") -``` - -Find the **knee of the curve** — point where increasing parallelism yields diminishing returns. - -### 2.3. Global Parallelism Limit - -Set a global cap across all pipelines: - -```python -from taskiq_flow.optimization.parallel import set_max_parallel_tasks - -set_max_parallel_tasks(100) # Never exceed 100 concurrent tasks globally -``` - -Useful in multi-tenant systems to prevent one pipeline from starving others. - ---- - -## 3. Resource-Aware Scheduling - -Taskiq-Flow can schedule tasks based on CPU/RAM requirements (requires resource-aware worker pool — advanced). - -### 3.1. Annotating Tasks with Resource Needs - -```python -from taskiq_flow import CPUProfile, RAMProfile - -@broker.task -@CPUProfile(cpu_units=2) # Needs 2 CPU cores -@RAMProfile(ram_mb=4096) # Needs 4GB RAM -def heavy_computation(data): - # Will only run on workers with sufficient resources - pass -``` - -### 3.2. Resource-Aware Worker Pool - -```python -from taskiq_flow import ResourceAwareWorkerPool - -pool = ResourceAwareWorkerPool( - workers=[ - {"cpu_cores": 8, "ram_gb": 32, "labels": {"gpu": True}}, - {"cpu_cores": 4, "ram_gb": 16, "labels": {"gpu": False}}, - ] -) - -# Tasks are automatically routed to compatible workers -``` - -**Note**: This feature requires custom worker implementation; standard brokers ignore resource profiles. - ---- - -## 4. Memory Optimization - -### 4.1. Avoid Large In-Memory Data Transfers - -Pass references instead of full data: - -```python -# Bad: copies entire dataset per task call -pipeline.map(process, large_dataset) # Each task gets full dataset copy - -# Better: pass identifiers, fetch inside task -@broker.task -def process(item_id: str): - item = database.get(item_id) # Fetch on-demand - return process_item(item) - -pipeline.map(process, item_ids) # Only IDs passed -``` - -### 4.2. Stream Large Datasets - -Use chunking: - -```python -def chunked(iterable, chunk_size=100): - for i in range(0, len(iterable), chunk_size): - yield iterable[i:i + chunk_size] - -for chunk in chunked(large_list, 100): - results = await pipeline.kiq_dataflow(chunk) - # Process results before next chunk to free memory -``` - -### 4.3. Clear Results After Use - -Pipeline results stay in tracking storage. Clean up after you're done: - -```python -# After processing, delete pipeline record -await tracking.delete_pipeline(pipeline.pipeline_id) -``` - -Or set TTL on storage: - -```python -RedisPipelineStorage(redis, ttl_seconds=86400) # Auto-delete after 1 day -``` - ---- - -## 5. Profiling & Bottleneck Detection - -### 5.1. Built-in Timing - -Each step records duration automatically (with tracking enabled): - -```python -status = await tracking.get_status(pipeline_id) -for step in status.steps: - print(f"{step.name}: {step.duration_ms}ms") -``` - -Identify slowest steps → optimization targets. - -### 5.2. Memory Profiling - -Use Python's `tracemalloc`: - -```python -import tracemalloc - -tracemalloc.start() - -# Run pipeline -await pipeline.kiq(data) - -# Check memory usage -current, peak = tracemalloc.get_traced_memory() -print(f"Current: {current/1024/1024:.1f} MB") -print(f"Peak: {peak/1024/1024:.1f} MB") -tracemalloc.stop() -``` - -### 5.3. CPU Profiling - -```python -import cProfile -import pstats - -profiler = cProfile.Profile() -profiler.enable() - -await pipeline.kiq(data) - -profiler.disable() -stats = pstats.Stats(profiler) -stats.sort_stats('cumulative') -stats.print_stats(20) # Top 20 functions -``` - -### 5.4. Async-Specific Profiling - -`uvloop` for faster event loop: - -```python -import uvloop -uvloop.install() # Replaces default asyncio event loop -``` - -Benchmark improvement: `uvloop` can provide 2×–3× speedup for I/O-bound workloads. - ---- - -## 6. Database/External Service Optimization - -### 6.1. Connection Pooling - -For databases (PostgreSQL, Redis), reuse connections: - -```python -from asyncpg import create_pool - -pool = await create_pool(database="...", min_size=5, max_size=20) - -@broker.task -async def db_task(query: str): - async with pool.acquire() as conn: - return await conn.fetch(query) -``` - -### 6.2. Batch Operations - -Instead of many small calls, batch: - -```python -# N separate calls -for item in items: - await db.insert(item) - -# Single batch insert -await db.bulk_insert(items) -``` - -### 6.3. Cache Results - -```python -from functools import lru_cache - -@broker.task -@lru_cache(maxsize=1000) -def expensive_computation(key: str): - return compute(key) -``` - -Or use Redis cache: - -```python -import redis -cache = redis.Redis(...) - -@broker.task -async def cached_task(key: str): - cached = await cache.get(key) - if cached: - return json.loads(cached) - result = await compute(key) - await cache.setex(key, 3600, json.dumps(result)) - return result -``` - ---- - -## 7. Distributed Scaling - -### 7.1. Multiple Workers - -Scale horizontally by running multiple worker processes: - -```bash -# Terminal 1 -taskiq worker --broker redis://localhost:6379 - -# Terminal 2 -taskiq worker --broker redis://localhost:6379 - -# Terminal 3 -taskiq worker --broker redis://localhost:6379 -``` - -All workers share the same broker (Redis) and process tasks concurrently. - -**Throughput ≈ (# workers) × (tasks/worker/second)**. - -### 7.2. Worker Pool Management - -Use a process manager (systemd, supervisord, Docker Compose): - -```yaml -# docker-compose.yml -services: - worker-1: - image: taskiq-flow-worker - command: taskiq worker --broker ${REDIS_URL} - worker-2: - image: taskiq-flow-worker - command: taskiq worker --broker ${REDIS_URL} - worker-3: - image: taskiq-flow-worker - command: taskiq worker --broker ${REDIS_URL} -``` - -### 7.3. Queue Prioritization - -Route critical pipelines to dedicated queues: - -```python -@broker.task(queue="high_priority") -def critical_task(): ... - -# Workers can be configured to process specific queues first -``` - -### 7.4. Geo-Distribution - -For low-latency global deployments, deploy workers in multiple regions with a global broker (Kafka) or regional Redis clusters with replication. - ---- - -## 8. Benchmarking - -Measure before and after optimization: - -```python -import time - -async def benchmark(pipeline, iterations=10): - durations = [] - for _ in range(iterations): - start = time.perf_counter() - result = await pipeline.kiq(data) - await result.wait_result() - duration = time.perf_counter() - start - durations.append(duration) - - avg = sum(durations) / len(durations) - p95 = sorted(durations)[int(0.95 * len(durations))] - print(f"Average: {avg:.3f}s, P95: {p95:.3f}s") - return durations -``` - -**Key metrics**: - -- **Throughput**: tasks/second -- **P50/P95/P99 latency**: median, 95th, 99th percentile -- **Memory peak**: maximum RSS/resident set size -- **CPU utilization**: % of cores used - ---- - -## 9. Production Checklist - -- [ ] Set `max_parallel` appropriately for task type (CPU vs I/O) -- [ ] Use connection pooling for external services -- [ ] Enable Redis storage for tracking (avoid memory leaks) -- [ ] Set TTL on tracking/result storage -- [ ] Configure timeouts on all tasks -- [ ] Add retry policies with backoff and jitter -- [ ] Monitor memory usage and set alerts -- [ ] Profile slow tasks with cProfile/tracemalloc -- [ ] Scale workers horizontally based on queue depth -- [ ] Use queue priorities for critical pipelines -- [ ] Implement DLQ and review failed tasks regularly -- [ ] Test failure scenarios (network partitions, service outages) - ---- - -## 10. Troubleshooting Performance - -### Pipeline Running Slowly - -**Diagnostic steps**: - -1. Check step durations in tracking: - ```python - status = await tracking.get_status(pipeline_id) - slowest = max(status.steps, key=lambda s: s.duration_ms) - print(f"Slowest step: {slowest.name} at {slowest.duration_ms}ms") - ``` - -2. Profile with cProfile to see where time is spent -3. Verify `max_parallel` not too low -4. Check for blocking I/O (use async libraries) - -### High Memory Usage - -**Causes & fixes**: - -| Cause | Fix | -|-------|-----| -| Large dataset in single step | Chunk data, process in batches | -| Results accumulating in tracking storage | Set TTL, delete after use | -| Memory leak in task code | Profile with `tracemalloc`, fix leaks | -| Too many parallel tasks | Reduce `max_parallel` | - -### Worker Starvation - -**Symptom**: Tasks queued but not executing. - -**Fixes**: -- Increase number of worker processes -- Ensure broker (Redis) has enough connections -- Check for long-running tasks blocking queue -- Consider task priorities or separate queues - ---- - -## 11. Advanced: Custom Executors - -For specialized workloads, implement custom executors: - -```python -from taskiq_flow import ExecutionEngine -from taskiq_flow.dataflow import DAG - -class GPUOptimizedEngine(ExecutionEngine): - async def schedule_task(self, task_node, inputs): - # Custom scheduling logic: route GPU tasks to GPU workers - if task_node.labels.get("requires_gpu"): - return await self.gpu_worker_pool.submit(task_node, inputs) - return await super().schedule_task(task_node, inputs) - -engine = GPUOptimizedEngine(broker, dag) -results = await engine.execute(inputs) -``` - -### 11.1. Resource-Aware Execution with `TaskResourceProfile` - -Taskiq-Flow provides a resource-aware execution pattern for pipelines that need -to allocate tasks to workers based on their CPU/RAM requirements: - -```python -from taskiq_flow import ResourceAwareExecutor, TaskResourceProfile -from taskiq_flow.dataflow import DataflowPipeline - -# Define a resource profile for heavy tasks -heavy_profile = TaskResourceProfile( - estimated_memory_mb=2048, - estimated_cpu_cores=4.0, -) - -# Annotate tasks with resource needs via labels when creating the pipeline -pipeline = DataflowPipeline( - broker=broker, - name="resource_aware_pipeline", - resource_aware=True, -) - -@pipeline.task(resource_profile=heavy_profile) -def heavy_computation(data: dict) -> dict: - """This task requires 4 CPU cores and 2 GB of RAM.""" - return process_heavy_data(data) - -# Configure the executor to respect resource profiles -executor = ResourceAwareExecutor( - broker=broker, - max_parallel=10, -) -executor.run_pipeline(pipeline, input_data) -``` - -`ResourceAwareExecutor` evaluates resource profiles of tasks and distributes them -to available workers based on their capacity. `TaskResourceProfile` lets you -annotate each task with its estimated resource needs, enabling the executor to -prevent over-subscription of workers. - ---- - -## 12. Summary - -Performance optimization is iterative: - -1. **Measure** — establish baseline with benchmarks -2. **Identify** — find bottlenecks with profiling -3. **Tune** — adjust `max_parallel`, resource profiles, batching -4. **Scale** — add workers, optimize external services -5. **Monitor** — track metrics in production -6. **Repeat** — optimization never ends - ---- - -## Next Steps - -- **[Tracking Guide]({{ '/en/guides/tracking/' | relative_url }})** — Monitor pipeline metrics -- **[Dataflow Guide]({{ '/en/guides/dataflow/' | relative_url }})** — Complete guide to DAG pipelines and dataflow architecture -- **[API Guide]({{ '/en/guides/api/' | relative_url }})** — Build custom dashboards for performance -- **[Example: Dataflow Audio Pipeline]({{ '/en/examples/dataflow-audio-pipeline/' | relative_url }})** — See optimization in action - ---- - -*Go fast, but measure first.* +--- +title: Performance Optimization Guide +nav_order: 27 +color_scheme: dark +--- +# Performance Optimization Guide + +**Resource-aware parallelism, memory optimization, and scaling strategies** + +> **Version**: {VERSION} | **Related**: [Execution Guide]({{ '/en/guides/execution/' | relative_url }}), [Tracking Guide]({{ '/en/guides/tracking/' | relative_url }}) + +--- + +## Overview + +Taskiq-Flow is designed for high-performance asynchronous execution. This guide covers optimization techniques to maximize throughput, minimize latency, and efficiently use system resources. + +Topics covered: + +- Parallelism tuning (`max_parallel`) +- CPU and RAM profiling +- Task resource profiles +- Memory management strategies +- Bottleneck identification +- Scaling from single worker to distributed + +--- + +## 1. Understanding the Performance Landscape + +Performance optimization involves tradeoffs between: + +| Dimension | What it affects | Typical trade-off | +|-----------|-----------------|-------------------| +| **Concurrency** | Throughput (# tasks/second) | Memory usage, context switching | +| **Parallelism** | CPU utilization | Overhead of coordination | +| **Latency** | Task completion time | Resource consumption | +| **Memory** | Dataset size capacity | GC pauses, cache efficiency | +| **I/O** | External service calls | Network bandwidth, connection limits | + +**Key insight**: Taskiq-Flow's parallelism is bounded by `max_parallel` settings across pipeline steps, and by available system resources (CPU cores, RAM). + +--- + +## 2. Parallelism Tuning + +### 2.1. The `max_parallel` Parameter + +Control concurrent task execution at the step level: + +{% raw %} +```python +# Sequential Pipeline +pipeline.map(process_item, items, max_parallel=10) # Max 10 concurrent + +# Dataflow Pipeline: configure at pipeline level +pipeline = DataflowPipeline(broker, max_parallel=20) + +# MapReduce +mapped = await MapReduce.map( + broker, + process_item, + items, + max_parallel=15 +) +``` +{% endraw %} +**Default behavior**: Without `max_parallel`, Taskiq-Flow attempts to run all independent tasks concurrently (essentially unlimited). This is fine for small numbers (<100) but dangerous for large datasets. + +### 2.2. Determining Optimal `max_parallel` + +#### For I/O-Bound Tasks (network calls, disk I/O) + +{% raw %} +```python +# High I/O wait, low CPU: can handle many concurrent tasks +pipeline.map(fetch_url, url_list, max_parallel=50) +# Rule of thumb: 2–5× number of CPU cores +``` +{% endraw %} +**Rationale**: While one task waits for network, another uses CPU. High concurrency saturates I/O pipelines. + +#### For CPU-Bound Tasks (computations, transcoding) + +{% raw %} +```python +# CPU-intensive: limit to core count (or slightly higher) +import os +cpu_cores = os.cpu_count() or 4 +pipeline.map(transcode, files, max_parallel=cpu_cores + 2) +# Rule of thumb: CPU cores ± 2 +``` +{% endraw %} +**Rationale**: Python's GIL limits true parallelism; `asyncio` still benefits from multiple cores when tasks release GIL (NumPy, C extensions). Over-subscription leads to context switching overhead. + +#### For Mixed Workloads + +Profile and adjust: + +{% raw %} +```python +# Start conservative +for parallel in [5, 10, 20, 50]: + start = time.time() + await pipeline.kiq_dataflow(data) + duration = time.time() - start + print(f"Parallelism {parallel}: {duration:.2f}s") +``` +{% endraw %} +Find the **knee of the curve** — point where increasing parallelism yields diminishing returns. + +### 2.3. Global Parallelism Limit + +Set a global cap across all pipelines: + +{% raw %} +```python +from taskiq_flow.optimization.parallel import set_max_parallel_tasks + +set_max_parallel_tasks(100) # Never exceed 100 concurrent tasks globally +``` +{% endraw %} +Useful in multi-tenant systems to prevent one pipeline from starving others. + +--- + +## 3. Resource-Aware Scheduling + +Taskiq-Flow can schedule tasks based on CPU/RAM requirements (requires resource-aware worker pool — advanced). + +### 3.1. Annotating Tasks with Resource Needs + +{% raw %} +```python +from taskiq_flow import CPUProfile, RAMProfile + +@broker.task +@CPUProfile(cpu_units=2) # Needs 2 CPU cores +@RAMProfile(ram_mb=4096) # Needs 4GB RAM +def heavy_computation(data): + # Will only run on workers with sufficient resources + pass +``` +{% endraw %} +### 3.2. Resource-Aware Worker Pool + +{% raw %} +```python +from taskiq_flow import ResourceAwareWorkerPool + +pool = ResourceAwareWorkerPool( + workers=[ + {"cpu_cores": 8, "ram_gb": 32, "labels": {"gpu": True}}, + {"cpu_cores": 4, "ram_gb": 16, "labels": {"gpu": False}}, + ] +) + +# Tasks are automatically routed to compatible workers +``` +{% endraw %} +**Note**: This feature requires custom worker implementation; standard brokers ignore resource profiles. + +--- + +## 4. Memory Optimization + +### 4.1. Avoid Large In-Memory Data Transfers + +Pass references instead of full data: + +{% raw %} +```python +# Bad: copies entire dataset per task call +pipeline.map(process, large_dataset) # Each task gets full dataset copy + +# Better: pass identifiers, fetch inside task +@broker.task +def process(item_id: str): + item = database.get(item_id) # Fetch on-demand + return process_item(item) + +pipeline.map(process, item_ids) # Only IDs passed +``` +{% endraw %} +### 4.2. Stream Large Datasets + +Use chunking: + +{% raw %} +```python +def chunked(iterable, chunk_size=100): + for i in range(0, len(iterable), chunk_size): + yield iterable[i:i + chunk_size] + +for chunk in chunked(large_list, 100): + results = await pipeline.kiq_dataflow(chunk) + # Process results before next chunk to free memory +``` +{% endraw %} +### 4.3. Clear Results After Use + +Pipeline results stay in tracking storage. Clean up after you're done: + +{% raw %} +```python +# After processing, delete pipeline record +await tracking.delete_pipeline(pipeline.pipeline_id) +``` +{% endraw %} +Or set TTL on storage: + +{% raw %} +```python +RedisPipelineStorage(redis, ttl_seconds=86400) # Auto-delete after 1 day +``` +{% endraw %} +--- + +## 5. Profiling & Bottleneck Detection + +### 5.1. Built-in Timing + +Each step records duration automatically (with tracking enabled): + +{% raw %} +```python +status = await tracking.get_status(pipeline_id) +for step in status.steps: + print(f"{step.name}: {step.duration_ms}ms") +``` +{% endraw %} +Identify slowest steps → optimization targets. + +### 5.2. Memory Profiling + +Use Python's `tracemalloc`: + +{% raw %} +```python +import tracemalloc + +tracemalloc.start() + +# Run pipeline +await pipeline.kiq(data) + +# Check memory usage +current, peak = tracemalloc.get_traced_memory() +print(f"Current: {current/1024/1024:.1f} MB") +print(f"Peak: {peak/1024/1024:.1f} MB") +tracemalloc.stop() +``` +{% endraw %} +### 5.3. CPU Profiling + +{% raw %} +```python +import cProfile +import pstats + +profiler = cProfile.Profile() +profiler.enable() + +await pipeline.kiq(data) + +profiler.disable() +stats = pstats.Stats(profiler) +stats.sort_stats('cumulative') +stats.print_stats(20) # Top 20 functions +``` +{% endraw %} +### 5.4. Async-Specific Profiling + +`uvloop` for faster event loop: + +{% raw %} +```python +import uvloop +uvloop.install() # Replaces default asyncio event loop +``` +{% endraw %} +Benchmark improvement: `uvloop` can provide 2×–3× speedup for I/O-bound workloads. + +--- + +## 6. Database/External Service Optimization + +### 6.1. Connection Pooling + +For databases (PostgreSQL, Redis), reuse connections: + +{% raw %} +```python +from asyncpg import create_pool + +pool = await create_pool(database="...", min_size=5, max_size=20) + +@broker.task +async def db_task(query: str): + async with pool.acquire() as conn: + return await conn.fetch(query) +``` +{% endraw %} +### 6.2. Batch Operations + +Instead of many small calls, batch: + +{% raw %} +```python +# N separate calls +for item in items: + await db.insert(item) + +# Single batch insert +await db.bulk_insert(items) +``` +{% endraw %} +### 6.3. Cache Results + +{% raw %} +```python +from functools import lru_cache + +@broker.task +@lru_cache(maxsize=1000) +def expensive_computation(key: str): + return compute(key) +``` +{% endraw %} +Or use Redis cache: + +{% raw %} +```python +import redis +cache = redis.Redis(...) + +@broker.task +async def cached_task(key: str): + cached = await cache.get(key) + if cached: + return json.loads(cached) + result = await compute(key) + await cache.setex(key, 3600, json.dumps(result)) + return result +``` +{% endraw %} +--- + +## 7. Distributed Scaling + +### 7.1. Multiple Workers + +Scale horizontally by running multiple worker processes: + +{% raw %} +```bash +# Terminal 1 +taskiq worker --broker redis://localhost:6379 + +# Terminal 2 +taskiq worker --broker redis://localhost:6379 + +# Terminal 3 +taskiq worker --broker redis://localhost:6379 +``` +{% endraw %} +All workers share the same broker (Redis) and process tasks concurrently. + +**Throughput ≈ (# workers) × (tasks/worker/second)**. + +### 7.2. Worker Pool Management + +Use a process manager (systemd, supervisord, Docker Compose): + +{% raw %} +```yaml +# docker-compose.yml +services: + worker-1: + image: taskiq-flow-worker + command: taskiq worker --broker ${REDIS_URL} + worker-2: + image: taskiq-flow-worker + command: taskiq worker --broker ${REDIS_URL} + worker-3: + image: taskiq-flow-worker + command: taskiq worker --broker ${REDIS_URL} +``` +{% endraw %} +### 7.3. Queue Prioritization + +Route critical pipelines to dedicated queues: + +{% raw %} +```python +@broker.task(queue="high_priority") +def critical_task(): ... + +# Workers can be configured to process specific queues first +``` +{% endraw %} +### 7.4. Geo-Distribution + +For low-latency global deployments, deploy workers in multiple regions with a global broker (Kafka) or regional Redis clusters with replication. + +--- + +## 8. Benchmarking + +Measure before and after optimization: + +{% raw %} +```python +import time + +async def benchmark(pipeline, iterations=10): + durations = [] + for _ in range(iterations): + start = time.perf_counter() + result = await pipeline.kiq(data) + await result.wait_result() + duration = time.perf_counter() - start + durations.append(duration) + + avg = sum(durations) / len(durations) + p95 = sorted(durations)[int(0.95 * len(durations))] + print(f"Average: {avg:.3f}s, P95: {p95:.3f}s") + return durations +``` +{% endraw %} +**Key metrics**: + +- **Throughput**: tasks/second +- **P50/P95/P99 latency**: median, 95th, 99th percentile +- **Memory peak**: maximum RSS/resident set size +- **CPU utilization**: % of cores used + +--- + +## 9. Production Checklist + +- [ ] Set `max_parallel` appropriately for task type (CPU vs I/O) +- [ ] Use connection pooling for external services +- [ ] Enable Redis storage for tracking (avoid memory leaks) +- [ ] Set TTL on tracking/result storage +- [ ] Configure timeouts on all tasks +- [ ] Add retry policies with backoff and jitter +- [ ] Monitor memory usage and set alerts +- [ ] Profile slow tasks with cProfile/tracemalloc +- [ ] Scale workers horizontally based on queue depth +- [ ] Use queue priorities for critical pipelines +- [ ] Implement DLQ and review failed tasks regularly +- [ ] Test failure scenarios (network partitions, service outages) + +--- + +## 10. Troubleshooting Performance + +### Pipeline Running Slowly + +**Diagnostic steps**: + +1. Check step durations in tracking: +{% raw %} + ```python + status = await tracking.get_status(pipeline_id) + slowest = max(status.steps, key=lambda s: s.duration_ms) + print(f"Slowest step: {slowest.name} at {slowest.duration_ms}ms") + ``` +{% endraw %} +2. Profile with cProfile to see where time is spent +3. Verify `max_parallel` not too low +4. Check for blocking I/O (use async libraries) + +### High Memory Usage + +**Causes & fixes**: + +| Cause | Fix | +|-------|-----| +| Large dataset in single step | Chunk data, process in batches | +| Results accumulating in tracking storage | Set TTL, delete after use | +| Memory leak in task code | Profile with `tracemalloc`, fix leaks | +| Too many parallel tasks | Reduce `max_parallel` | + +### Worker Starvation + +**Symptom**: Tasks queued but not executing. + +**Fixes**: +- Increase number of worker processes +- Ensure broker (Redis) has enough connections +- Check for long-running tasks blocking queue +- Consider task priorities or separate queues + +--- + +## 11. Advanced: Custom Executors + +For specialized workloads, implement custom executors: + +{% raw %} +```python +from taskiq_flow import ExecutionEngine +from taskiq_flow.dataflow import DAG + +class GPUOptimizedEngine(ExecutionEngine): + async def schedule_task(self, task_node, inputs): + # Custom scheduling logic: route GPU tasks to GPU workers + if task_node.labels.get("requires_gpu"): + return await self.gpu_worker_pool.submit(task_node, inputs) + return await super().schedule_task(task_node, inputs) + +engine = GPUOptimizedEngine(broker, dag) +results = await engine.execute(inputs) +``` +{% endraw %} +### 11.1. Resource-Aware Execution with `TaskResourceProfile` + +Taskiq-Flow provides a resource-aware execution pattern for pipelines that need +to allocate tasks to workers based on their CPU/RAM requirements: + +{% raw %} +```python +from taskiq_flow import ResourceAwareExecutor, TaskResourceProfile +from taskiq_flow.dataflow import DataflowPipeline + +# Define a resource profile for heavy tasks +heavy_profile = TaskResourceProfile( + estimated_memory_mb=2048, + estimated_cpu_cores=4.0, +) + +# Annotate tasks with resource needs via labels when creating the pipeline +pipeline = DataflowPipeline( + broker=broker, + name="resource_aware_pipeline", + resource_aware=True, +) + +@pipeline.task(resource_profile=heavy_profile) +def heavy_computation(data: dict) -> dict: + """This task requires 4 CPU cores and 2 GB of RAM.""" + return process_heavy_data(data) + +# Configure the executor to respect resource profiles +executor = ResourceAwareExecutor( + broker=broker, + max_parallel=10, +) +executor.run_pipeline(pipeline, input_data) +``` +{% endraw %} +`ResourceAwareExecutor` evaluates resource profiles of tasks and distributes them +to available workers based on their capacity. `TaskResourceProfile` lets you +annotate each task with its estimated resource needs, enabling the executor to +prevent over-subscription of workers. + +--- + +## 12. Summary + +Performance optimization is iterative: + +1. **Measure** — establish baseline with benchmarks +2. **Identify** — find bottlenecks with profiling +3. **Tune** — adjust `max_parallel`, resource profiles, batching +4. **Scale** — add workers, optimize external services +5. **Monitor** — track metrics in production +6. **Repeat** — optimization never ends + +--- + +## Next Steps + +- **[Tracking Guide]({{ '/en/guides/tracking/' | relative_url }})** — Monitor pipeline metrics +- **[Dataflow Guide]({{ '/en/guides/dataflow/' | relative_url }})** — Complete guide to DAG pipelines and dataflow architecture +- **[API Guide]({{ '/en/guides/api/' | relative_url }})** — Build custom dashboards for performance +- **[Example: Dataflow Audio Pipeline]({{ '/en/examples/dataflow-audio-pipeline/' | relative_url }})** — See optimization in action + +--- + +*Go fast, but measure first.* diff --git a/docs/_en/guides/pipelines.md b/docs/_en/guides/pipelines.md index 94237b6..d687bc8 100644 --- a/docs/_en/guides/pipelines.md +++ b/docs/_en/guides/pipelines.md @@ -1,544 +1,544 @@ ---- -title: Pipelines Guide -nav_order: 20 ---- -# Pipelines Guide - -**Sequential and Dataflow pipeline patterns, configurations, and best practices** - -> **Version**: {VERSION} | **Related**: [Execution Guide]({{ '/en/guides/execution/' | relative_url }}), [Tasks Guide]({{ '/en/guides/tasks/' | relative_url }}), [Dataflow Guide]({{ '/en/guides/dataflow/' | relative_url }}) - ---- - -## Overview - -Taskiq-Flow provides two main pipeline types for orchestrating task workflows: - -1. **SequentialPipeline** — Manual step chaining for linear workflows -2. **DataflowPipeline** — Automatic DAG construction from task dependencies - -For a comprehensive deep-dive into dataflow patterns, see the [Dataflow Guide]({{ '/en/guides/dataflow/' | relative_url }}). - -This guide explores both types, their use cases, and how to choose between them. - ---- - -## 1. Sequential Pipeline - -The classic pipeline model where you explicitly chain steps in order. - -### 1.1. Basic Structure - -```python -from taskiq_flow import Pipeline - -pipeline = ( - Pipeline(broker) - .call_next(task1) - .call_next(task2) - .call_next(task3) -) -``` - -**Execution**: `task1 → task2 → task3` (synchronously) - -### 1.2. Available Operations - -#### `.call_next(task, *args, **kwargs)` - -Execute a task, passing the previous result as the first argument: - -```python -pipeline.call_next(process_data).call_next(save_result) -# process_data receives output of previous step -# save_result receives output of process_data -``` - -**Parameter binding**: -- By position: result becomes first argument -- By name: `pipeline.call_next(task, param_name=previous_result)` - -Example: -```python -@broker.task -def multiply(value: int, factor: int) -> int: - return value * factor - -pipeline.call_next(add_one).call_next(multiply, factor=3) -# add_one output → multiply(value=...) , factor=3 -``` - -#### `.call_after(task, *args, **kwargs)` - -Execute a task **without** consuming the previous result (fire-and-forget within pipeline): - -```python -pipeline.call_next(process).call_after(log_completion) -# log_completion runs after process but doesn't receive process's output -``` - -Useful for side effects (logging, notifications) that shouldn't transform the data flow. - -#### `.map(task, max_parallel=None)` - -Apply a task to each element of an iterable result in parallel: - -```python -# Previous step returned: [1, 2, 3, 4] -pipeline.map(process_item) -# Runs process_item(1), process_item(2), ... concurrently -# Collects results: [processed1, processed2, ...] -``` - -**Options**: -- `max_parallel=10` — limit concurrent executions -- `output_name="results"` — custom output key (default: task output name) - -#### `.filter(task)` - -Keep elements where the task returns truthy: - -```python -# Previous step returned: [1, 2, 3, 4] -pipeline.filter(is_even) -# Keeps elements where is_even(element) returns True -# Result: [2, 4] -``` - -#### `.group(tasks, param_names=None)` - -Execute multiple independent tasks in parallel, starting from the same input: - -```python -pipeline.group( - [task_a, task_b, task_c], - param_names=["x", "y", "z"] # bind input to these parameters -) -# All three tasks receive the same previous result -# Returns: [result_a, result_b, result_c] -``` - ---- - -## 2. Dataflow Pipeline - -> For a comprehensive guide on dataflow patterns, see the [Dataflow Guide]({{ '/en/guides/dataflow/' | relative_url }}). - -Automatic DAG construction using `@pipeline_task(output=...)` annotations. - -### 2.1. Declaring Task Outputs - -```python -from taskiq_flow import pipeline_task, DataflowPipeline - -@broker.task -@pipeline_task(output="features") -def extract_features(data: list[str]) -> dict: - return {"count": len(data)} - -@broker.task -@pipeline_task(output="stats") -def compute_stats(features: dict) -> dict: - return {"entries": features["count"] * 2} - -@broker.task -@pipeline_task(output="report") -def generate_report(stats: dict) -> str: - return f"Stats: {stats}" -``` - -**Key**: The `output` parameter declares what this task produces. Downstream tasks declare matching parameter names to consume those outputs. - -### 2.2. Building the Pipeline - -```python -pipeline = DataflowPipeline.from_tasks( - broker, - [extract_features, compute_stats, generate_report] -) -``` - -**Automatic dependency resolution**: - -1. `extract_features` produces `features` — no dependencies -2. `compute_stats` needs `features` — depends on `extract_features` -3. `generate_report` needs `stats` — depends on `compute_stats` - -**Resulting DAG**: -``` -extract_features → compute_stats → generate_report -``` - -### 2.3. Multiple Consumers - -Multiple tasks can consume the same output; they'll all wait for the producer: - -```python -@broker.task -@pipeline_task(output="features") -def extract(data): ... - -@broker.task -@pipeline_task(output="tags") -def tag(features: dict): ... # consumer 1 of features - -@broker.task -@pipeline_task(output="embedding") -def embed(features: dict): ... # consumer 2 of features - -# Both tag and embed run in parallel after extract completes -``` - -### 2.4. Input Parameters - -Dataflow pipelines accept external inputs via `kiq_dataflow(**kwargs)`: - -```python -results = await pipeline.kiq_dataflow(data=["file1.mp3", "file2.mp3"]) -# The `data` parameter is matched to any task needing it -# Must match a parameter name of a task with no producer (external input) -``` - ---- - -## 3. Pipeline Configuration - -### 3.1. Adding Tracking - -```python -from taskiq_flow import PipelineTrackingManager - -tracking = PipelineTrackingManager().with_auto_storage(broker) -pipeline = Pipeline(broker).with_tracking(tracking) -``` - -See [Tracking Guide]({{ '/en/guides/tracking/' | relative_url }}) for details. - -### 3.2. Setting a Custom Pipeline ID - -```python -pipeline.pipeline_id = "my_custom_workflow_001" -# If not set, a UUID is generated automatically -``` - -Important for tracking and WebSocket subscriptions. - -### 3.3. Attaching Hooks (WebSocket) - -```python -from taskiq_flow.hooks import HookManager - -hooks = HookManager() -pipeline = Pipeline(broker).with_hooks(hooks) -``` - -See [WebSocket Guide]({{ '/en/guides/websocket/' | relative_url }}). - -### 3.4. Retry & Error Policies - -```python -pipeline.with_retry( - max_attempts=3, - delay=1.0, - backoff=2.0 -) -pipeline.on_error("continue") # or "stop" -``` - -See [Retry Guide]({{ '/en/guides/retry/' | relative_url }}). - -### 3.5. Timeouts - -```python -pipeline.with_timeout(seconds=60) -``` - ---- - -## 4. Pipeline Lifecycle - -### 4.1. Creation → Execution → Completion - -``` -1. pipeline = Pipeline(broker) # Create pipeline object -2. pipeline.call_next(...) # Chain steps -3. task = await pipeline.kiq(input) # Launch -4. result = await task.wait_result() # Wait & retrieve -``` - -### 4.2. Reuseability - -Pipeline objects are **single-use**. For repeated execution, create a new pipeline or use the `PipelineScheduler`: - -```python -# Correct: Create fresh pipeline each time -async def run_workflow(data): - pipeline = Pipeline(broker).call_next(step1).call_next(step2) - return await pipeline.kiq(data) - -# For recurring schedules, use PipelineScheduler -from taskiq_flow import PipelineScheduler -scheduler = PipelineScheduler(broker) -await scheduler.schedule(pipeline, cron="* * * * *") -``` - ---- - -## 5. Visualizing Pipelines - -### 5.1. ASCII DAG (Console) - -```python -pipeline.print_dag() -``` - -Example output: -``` -DAG Execution Order: - Level 0: task_a - Level 1: task_b, task_c - Level 2: task_d -``` - -### 5.2. JSON for Web UIs - -```python -viz = pipeline.visualize() # returns dict -print(viz) -``` - -Structure: -```json -{ - "nodes": [ - {"id": "task_a", "outputs": ["x", "y"]}, - {"id": "task_b", "inputs": ["x"]} - ], - "edges": [{"from": "task_a", "to": "task_b"}] -} -``` - -### 5.3. DOT Format (Graphviz) - -```python -dot = pipeline.visualize_dot() -with open("pipeline.dot", "w") as f: - f.write(dot) -# Render: dot -Tpng pipeline.dot -o pipeline.png -``` - -Resulting diagram shows nodes, edges, and execution order. - ---- - -## 6. Pipeline Inspection (DataflowRegistry) - -For advanced use cases, manually construct and inspect the dataflow graph: - -```python -from taskiq_flow import DataflowRegistry - -registry = DataflowRegistry() - -# Register tasks with explicit I/O -registry.register_task( - task=load_data, - output="raw", - inputs=["source"] # external input -) -registry.register_task( - task=clean, - output="clean", - inputs=["raw"] -) -registry.register_task( - task=save, - output="saved", - inputs=["clean"] -) - -# Inspect structure -print("Tasks:", [t.task_name for t in registry.get_tasks()]) -print("Outputs:", registry.get_outputs()) # ["raw", "clean", "saved"] -print("External inputs:", registry.get_external_inputs()) # ["source"] - -# Find dependencies -producer = registry.get_producer("clean") # returns TaskNode for 'clean' -consumers = registry.get_consumers("raw") # list of tasks needing 'raw' - -# Build DAG -dag = registry.build_dag() -dag.print() -order = dag.topological_sort() # list of tasks in execution order -levels = dag.levels # list of lists (parallel groups) -``` - -See `examples/registry_discovery_example.py` for complete usage. - ---- - -## 7. Choosing Between Pipeline Types - -| Criteria | SequentialPipeline | DataflowPipeline | -|----------|-------------------|------------------| -| **Workflow shape** | Linear, with occasional branching | Complex DAG with many branches | -| **Task dependencies** | Implicit (chaining order) | Explicit (`@pipeline_task`) | -| **Parallel needs** | Manual (`.group()`) | Automatic (independent tasks) | -| **Flexibility** | Full control over order | Declarative; library optimizes | -| **Dynamic workflows** | Hard (fixed at build time) | Easy (can add tasks flexibly) | -| **Best for** | ETL linear steps, simple batch | Audio/video processing, ML pipelines | - -**Rule of thumb**: -- **SequentialPipeline** for simple, fixed-order workflows -- **DataflowPipeline** for complex, branched, or reusable workflows - ---- - -## 8. Best Practices - -### 8.1. Task Naming & Outputs - -Use clear, unique output names: - -```python -@pipeline_task(output="user_features") # clear -@pipeline_task(output="features_2") # ambiguous (if multiple features exist) -``` - -### 8.2. Avoid Circular Dependencies - -DataflowPipeline detects cycles and raises `CycleError` during `build_dag()`. Design with forward data flow only. - -### 8.3. Minimize Shared State - -Each task should be pure (output depends only on inputs) for parallel safety. - -### 8.4. Version Pipeline IDs - -Include version in pipeline IDs for tracking: - -```python -pipeline.pipeline_id = f"audio_analysis_v1_{int(time.time())}" -``` - -### 8.5. Use `.call_after()` for Side Effects - -Don't corrupt the data flow with logging/metrics: - -```python -pipeline.call_next(process).call_after(log_result) # correct -pipeline.call_next(process_and_log) # anti-pattern -``` - -### 8.6. Limit Parallelism for Resource-Heavy Tasks - -```python -# CPU-intensive transcoding -pipeline.map(transcode, files, max_parallel=2) -``` - -### 8.7. Validate DAG Before Execution - -```python -pipeline.print_dag() # Always inspect complex pipelines -input("Press Enter to execute...") -``` - ---- - -## 9. Common Pitfalls - -| Symptom | Likely cause | Fix | -|---------|-------------|-----| -| Task runs twice | `.call_next()` and dependent task both declared | Remove redundant call; Dataflow manages dependencies | -| Missing output key | `@pipeline_task(output=...)` doesn't match downstream param | Align output name with parameter name | -| All tasks sequential | Using Pipeline instead of DataflowPipeline | Switch to DataflowPipeline for automatic parallelism | -| Results None | Forgetting `broker.add_middlewares(PipelineMiddleware())` | Add middleware before creating pipelines | -| Stale pipeline reused | Attempting to call `kiq()` twice on same pipeline object | Create fresh pipeline per execution | - ---- - -## 10. Advanced Patterns - -### 10.1. Hybrid Sequential + Dataflow - -Combine both types for maximum control: - -```python -# Sequential outer shell -sequential = Pipeline(broker) - -# Inside a step, spawn a dataflow sub-pipeline -@broker.task -async def process_batch(data: list) -> dict: - sub_pipeline = DataflowPipeline.from_tasks( - broker, - [subtask1, subtask2, subtask3] - ) - return await sub_pipeline.kiq_dataflow(data=data) - -sequential.call_next(process_batch).call_next(finalize) -``` - -### 10.2. Dynamic Pipeline Construction - -Build pipelines at runtime based on configuration: - -```python -def build_pipeline(config: dict) -> Pipeline: - steps = [] - if config.get("preprocess"): - steps.append(preprocess_task) - if config.get("analyze"): - steps.append(analyze_task) - # ... - pipeline = Pipeline(broker) - for step in steps: - pipeline.call_next(step) - return pipeline -``` - -### 10.3. Conditional Branching - -Use `.filter()` and condition steps: - -```python -high_value = pipeline.filter(is_high_value) -high_value.call_next(premium_processing) -low_value = pipeline.filter(is_low_value) -low_value.call_next(standard_processing) - -# Merge back -merged = high_value.group([premium_processing, standard_processing]) -``` - -See [steps/condition.py](https://github.com/dorel14/taskiq-flow/blob/main/taskiq_flow/steps/condition.py) for `IfStep`. - ---- - -## 11. Summary Checklist - -Before running a pipeline, verify: - -- [ ] Pipeline type chosen appropriately (Sequential vs Dataflow) -- [ ] All functions decorated with `@broker.task` -- [ ] Dataflow: all relevant tasks decorated with `@pipeline_task(output=…)` -- [ ] Output names match downstream parameter names exactly -- [ ] `PipelineMiddleware` added to broker -- [ ] `pipeline_id` set if tracking/WebSocket needed -- [ ] DAG inspected with `print_dag()` for complex workflows -- [ ] Parallelism limits (`max_parallel`) set appropriately -- [ ] Timeouts configured for long-running tasks -- [ ] Example run completed successfully before production use - ---- - -## Further Reading - -- **[Execution Guide]({{ '/en/guides/execution/' | relative_url }})** — How pipelines run, error handling, timeouts -- **[Tasks Guide]({{ '/en/guides/tasks/' | relative_url }})** — Writing task functions and decorators -- **[Examples]({{ '/en/examples/' | relative_url }})** — End-to-end pipeline demonstrations - ---- - -*Master pipelines to orchestrate any workflow. Next, learn about [Task Definition]({{ '/en/guides/tasks/' | relative_url }}).* +--- +title: Pipelines Guide +nav_order: 20 +--- +# Pipelines Guide + +**Sequential and Dataflow pipeline patterns, configurations, and best practices** + +> **Version**: {VERSION} | **Related**: [Execution Guide]({{ '/en/guides/execution/' | relative_url }}), [Tasks Guide]({{ '/en/guides/tasks/' | relative_url }}), [Dataflow Guide]({{ '/en/guides/dataflow/' | relative_url }}) + +--- + +## Overview + +Taskiq-Flow provides two main pipeline types for orchestrating task workflows: + +1. **SequentialPipeline** — Manual step chaining for linear workflows +2. **DataflowPipeline** — Automatic DAG construction from task dependencies + +For a comprehensive deep-dive into dataflow patterns, see the [Dataflow Guide]({{ '/en/guides/dataflow/' | relative_url }}). + +This guide explores both types, their use cases, and how to choose between them. + +--- + +## 1. Sequential Pipeline + +The classic pipeline model where you explicitly chain steps in order. + +### 1.1. Basic Structure + +```python +from taskiq_flow import Pipeline + +pipeline = ( + Pipeline(broker) + .call_next(task1) + .call_next(task2) + .call_next(task3) +) +``` + +**Execution**: `task1 → task2 → task3` (synchronously) + +### 1.2. Available Operations + +#### `.call_next(task, *args, **kwargs)` + +Execute a task, passing the previous result as the first argument: + +```python +pipeline.call_next(process_data).call_next(save_result) +# process_data receives output of previous step +# save_result receives output of process_data +``` + +**Parameter binding**: +- By position: result becomes first argument +- By name: `pipeline.call_next(task, param_name=previous_result)` + +Example: +```python +@broker.task +def multiply(value: int, factor: int) -> int: + return value * factor + +pipeline.call_next(add_one).call_next(multiply, factor=3) +# add_one output → multiply(value=...) , factor=3 +``` + +#### `.call_after(task, *args, **kwargs)` + +Execute a task **without** consuming the previous result (fire-and-forget within pipeline): + +```python +pipeline.call_next(process).call_after(log_completion) +# log_completion runs after process but doesn't receive process's output +``` + +Useful for side effects (logging, notifications) that shouldn't transform the data flow. + +#### `.map(task, max_parallel=None)` + +Apply a task to each element of an iterable result in parallel: + +```python +# Previous step returned: [1, 2, 3, 4] +pipeline.map(process_item) +# Runs process_item(1), process_item(2), ... concurrently +# Collects results: [processed1, processed2, ...] +``` + +**Options**: +- `max_parallel=10` — limit concurrent executions +- `output_name="results"` — custom output key (default: task output name) + +#### `.filter(task)` + +Keep elements where the task returns truthy: + +```python +# Previous step returned: [1, 2, 3, 4] +pipeline.filter(is_even) +# Keeps elements where is_even(element) returns True +# Result: [2, 4] +``` + +#### `.group(tasks, param_names=None)` + +Execute multiple independent tasks in parallel, starting from the same input: + +```python +pipeline.group( + [task_a, task_b, task_c], + param_names=["x", "y", "z"] # bind input to these parameters +) +# All three tasks receive the same previous result +# Returns: [result_a, result_b, result_c] +``` + +--- + +## 2. Dataflow Pipeline + +> For a comprehensive guide on dataflow patterns, see the [Dataflow Guide]({{ '/en/guides/dataflow/' | relative_url }}). + +Automatic DAG construction using `@pipeline_task(output=...)` annotations. + +### 2.1. Declaring Task Outputs + +```python +from taskiq_flow import pipeline_task, DataflowPipeline + +@broker.task +@pipeline_task(output="features") +def extract_features(data: list[str]) -> dict: + return {"count": len(data)} + +@broker.task +@pipeline_task(output="stats") +def compute_stats(features: dict) -> dict: + return {"entries": features["count"] * 2} + +@broker.task +@pipeline_task(output="report") +def generate_report(stats: dict) -> str: + return f"Stats: {stats}" +``` + +**Key**: The `output` parameter declares what this task produces. Downstream tasks declare matching parameter names to consume those outputs. + +### 2.2. Building the Pipeline + +```python +pipeline = DataflowPipeline.from_tasks( + broker, + [extract_features, compute_stats, generate_report] +) +``` + +**Automatic dependency resolution**: + +1. `extract_features` produces `features` — no dependencies +2. `compute_stats` needs `features` — depends on `extract_features` +3. `generate_report` needs `stats` — depends on `compute_stats` + +**Resulting DAG**: +``` +extract_features → compute_stats → generate_report +``` + +### 2.3. Multiple Consumers + +Multiple tasks can consume the same output; they'll all wait for the producer: + +```python +@broker.task +@pipeline_task(output="features") +def extract(data): ... + +@broker.task +@pipeline_task(output="tags") +def tag(features: dict): ... # consumer 1 of features + +@broker.task +@pipeline_task(output="embedding") +def embed(features: dict): ... # consumer 2 of features + +# Both tag and embed run in parallel after extract completes +``` + +### 2.4. Input Parameters + +Dataflow pipelines accept external inputs via `kiq_dataflow(**kwargs)`: + +```python +results = await pipeline.kiq_dataflow(data=["file1.mp3", "file2.mp3"]) +# The `data` parameter is matched to any task needing it +# Must match a parameter name of a task with no producer (external input) +``` + +--- + +## 3. Pipeline Configuration + +### 3.1. Adding Tracking + +```python +from taskiq_flow import PipelineTrackingManager + +tracking = PipelineTrackingManager().with_auto_storage(broker) +pipeline = Pipeline(broker).with_tracking(tracking) +``` + +See [Tracking Guide]({{ '/en/guides/tracking/' | relative_url }}) for details. + +### 3.2. Setting a Custom Pipeline ID + +```python +pipeline.pipeline_id = "my_custom_workflow_001" +# If not set, a UUID is generated automatically +``` + +Important for tracking and WebSocket subscriptions. + +### 3.3. Attaching Hooks (WebSocket) + +```python +from taskiq_flow.hooks import HookManager + +hooks = HookManager() +pipeline = Pipeline(broker).with_hooks(hooks) +``` + +See [WebSocket Guide]({{ '/en/guides/websocket/' | relative_url }}). + +### 3.4. Retry & Error Policies + +```python +pipeline.with_retry( + max_attempts=3, + delay=1.0, + backoff=2.0 +) +pipeline.on_error("continue") # or "stop" +``` + +See [Retry Guide]({{ '/en/guides/retry/' | relative_url }}). + +### 3.5. Timeouts + +```python +pipeline.with_timeout(seconds=60) +``` + +--- + +## 4. Pipeline Lifecycle + +### 4.1. Creation → Execution → Completion + +``` +1. pipeline = Pipeline(broker) # Create pipeline object +2. pipeline.call_next(...) # Chain steps +3. task = await pipeline.kiq(input) # Launch +4. result = await task.wait_result() # Wait & retrieve +``` + +### 4.2. Reuseability + +Pipeline objects are **single-use**. For repeated execution, create a new pipeline or use the `PipelineScheduler`: + +```python +# Correct: Create fresh pipeline each time +async def run_workflow(data): + pipeline = Pipeline(broker).call_next(step1).call_next(step2) + return await pipeline.kiq(data) + +# For recurring schedules, use PipelineScheduler +from taskiq_flow import PipelineScheduler +scheduler = PipelineScheduler(broker) +await scheduler.schedule(pipeline, cron="* * * * *") +``` + +--- + +## 5. Visualizing Pipelines + +### 5.1. ASCII DAG (Console) + +```python +pipeline.print_dag() +``` + +Example output: +``` +DAG Execution Order: + Level 0: task_a + Level 1: task_b, task_c + Level 2: task_d +``` + +### 5.2. JSON for Web UIs + +```python +viz = pipeline.visualize() # returns dict +print(viz) +``` + +Structure: +```json +{ + "nodes": [ + {"id": "task_a", "outputs": ["x", "y"]}, + {"id": "task_b", "inputs": ["x"]} + ], + "edges": [{"from": "task_a", "to": "task_b"}] +} +``` + +### 5.3. DOT Format (Graphviz) + +```python +dot = pipeline.visualize_dot() +with open("pipeline.dot", "w") as f: + f.write(dot) +# Render: dot -Tpng pipeline.dot -o pipeline.png +``` + +Resulting diagram shows nodes, edges, and execution order. + +--- + +## 6. Pipeline Inspection (DataflowRegistry) + +For advanced use cases, manually construct and inspect the dataflow graph: + +```python +from taskiq_flow import DataflowRegistry + +registry = DataflowRegistry() + +# Register tasks with explicit I/O +registry.register_task( + task=load_data, + output="raw", + inputs=["source"] # external input +) +registry.register_task( + task=clean, + output="clean", + inputs=["raw"] +) +registry.register_task( + task=save, + output="saved", + inputs=["clean"] +) + +# Inspect structure +print("Tasks:", [t.task_name for t in registry.get_tasks()]) +print("Outputs:", registry.get_outputs()) # ["raw", "clean", "saved"] +print("External inputs:", registry.get_external_inputs()) # ["source"] + +# Find dependencies +producer = registry.get_producer("clean") # returns TaskNode for 'clean' +consumers = registry.get_consumers("raw") # list of tasks needing 'raw' + +# Build DAG +dag = registry.build_dag() +dag.print() +order = dag.topological_sort() # list of tasks in execution order +levels = dag.levels # list of lists (parallel groups) +``` + +See `examples/registry_discovery_example.py` for complete usage. + +--- + +## 7. Choosing Between Pipeline Types + +| Criteria | SequentialPipeline | DataflowPipeline | +|----------|-------------------|------------------| +| **Workflow shape** | Linear, with occasional branching | Complex DAG with many branches | +| **Task dependencies** | Implicit (chaining order) | Explicit (`@pipeline_task`) | +| **Parallel needs** | Manual (`.group()`) | Automatic (independent tasks) | +| **Flexibility** | Full control over order | Declarative; library optimizes | +| **Dynamic workflows** | Hard (fixed at build time) | Easy (can add tasks flexibly) | +| **Best for** | ETL linear steps, simple batch | Audio/video processing, ML pipelines | + +**Rule of thumb**: +- **SequentialPipeline** for simple, fixed-order workflows +- **DataflowPipeline** for complex, branched, or reusable workflows + +--- + +## 8. Best Practices + +### 8.1. Task Naming & Outputs + +Use clear, unique output names: + +```python +@pipeline_task(output="user_features") # clear +@pipeline_task(output="features_2") # ambiguous (if multiple features exist) +``` + +### 8.2. Avoid Circular Dependencies + +DataflowPipeline detects cycles and raises `CycleError` during `build_dag()`. Design with forward data flow only. + +### 8.3. Minimize Shared State + +Each task should be pure (output depends only on inputs) for parallel safety. + +### 8.4. Version Pipeline IDs + +Include version in pipeline IDs for tracking: + +```python +pipeline.pipeline_id = f"audio_analysis_v1_{int(time.time())}" +``` + +### 8.5. Use `.call_after()` for Side Effects + +Don't corrupt the data flow with logging/metrics: + +```python +pipeline.call_next(process).call_after(log_result) # correct +pipeline.call_next(process_and_log) # anti-pattern +``` + +### 8.6. Limit Parallelism for Resource-Heavy Tasks + +```python +# CPU-intensive transcoding +pipeline.map(transcode, files, max_parallel=2) +``` + +### 8.7. Validate DAG Before Execution + +```python +pipeline.print_dag() # Always inspect complex pipelines +input("Press Enter to execute...") +``` + +--- + +## 9. Common Pitfalls + +| Symptom | Likely cause | Fix | +|---------|-------------|-----| +| Task runs twice | `.call_next()` and dependent task both declared | Remove redundant call; Dataflow manages dependencies | +| Missing output key | `@pipeline_task(output=...)` doesn't match downstream param | Align output name with parameter name | +| All tasks sequential | Using Pipeline instead of DataflowPipeline | Switch to DataflowPipeline for automatic parallelism | +| Results None | Forgetting `broker.add_middlewares(PipelineMiddleware())` | Add middleware before creating pipelines | +| Stale pipeline reused | Attempting to call `kiq()` twice on same pipeline object | Create fresh pipeline per execution | + +--- + +## 10. Advanced Patterns + +### 10.1. Hybrid Sequential + Dataflow + +Combine both types for maximum control: + +```python +# Sequential outer shell +sequential = Pipeline(broker) + +# Inside a step, spawn a dataflow sub-pipeline +@broker.task +async def process_batch(data: list) -> dict: + sub_pipeline = DataflowPipeline.from_tasks( + broker, + [subtask1, subtask2, subtask3] + ) + return await sub_pipeline.kiq_dataflow(data=data) + +sequential.call_next(process_batch).call_next(finalize) +``` + +### 10.2. Dynamic Pipeline Construction + +Build pipelines at runtime based on configuration: + +```python +def build_pipeline(config: dict) -> Pipeline: + steps = [] + if config.get("preprocess"): + steps.append(preprocess_task) + if config.get("analyze"): + steps.append(analyze_task) + # ... + pipeline = Pipeline(broker) + for step in steps: + pipeline.call_next(step) + return pipeline +``` + +### 10.3. Conditional Branching + +Use `.filter()` and condition steps: + +```python +high_value = pipeline.filter(is_high_value) +high_value.call_next(premium_processing) +low_value = pipeline.filter(is_low_value) +low_value.call_next(standard_processing) + +# Merge back +merged = high_value.group([premium_processing, standard_processing]) +``` + +See [steps/condition.py](https://github.com/dorel14/taskiq-flow/blob/main/taskiq_flow/steps/condition.py) for `IfStep`. + +--- + +## 11. Summary Checklist + +Before running a pipeline, verify: + +- [ ] Pipeline type chosen appropriately (Sequential vs Dataflow) +- [ ] All functions decorated with `@broker.task` +- [ ] Dataflow: all relevant tasks decorated with `@pipeline_task(output=…)` +- [ ] Output names match downstream parameter names exactly +- [ ] `PipelineMiddleware` added to broker +- [ ] `pipeline_id` set if tracking/WebSocket needed +- [ ] DAG inspected with `print_dag()` for complex workflows +- [ ] Parallelism limits (`max_parallel`) set appropriately +- [ ] Timeouts configured for long-running tasks +- [ ] Example run completed successfully before production use + +--- + +## Further Reading + +- **[Execution Guide]({{ '/en/guides/execution/' | relative_url }})** — How pipelines run, error handling, timeouts +- **[Tasks Guide]({{ '/en/guides/tasks/' | relative_url }})** — Writing task functions and decorators +- **[Examples]({{ '/en/examples/' | relative_url }})** — End-to-end pipeline demonstrations + +--- + +*Master pipelines to orchestrate any workflow. Next, learn about [Task Definition]({{ '/en/guides/tasks/' | relative_url }}).* diff --git a/docs/_en/guides/retry.md b/docs/_en/guides/retry.md index 63a8c68..8e0717f 100644 --- a/docs/_en/guides/retry.md +++ b/docs/_en/guides/retry.md @@ -1,485 +1,485 @@ ---- -title: Retry & Error Handling Guide -nav_order: 26 ---- -# Retry & Error Handling Guide - -**Resilient pipeline execution with retry policies, backoff, and dead-letter queues** - -> **Version**: {VERSION} | **Related**: [Execution Guide]({{ '/en/guides/execution/' | relative_url }}), [Scheduling Guide]({{ '/en/guides/scheduling/' | relative_url }}) - ---- - -## Overview - -Failures are inevitable in distributed systems. Taskiq-Flow provides comprehensive retry and error handling mechanisms to ensure pipeline robustness. - -This guide covers: - -- Retry policies at task and pipeline levels -- Exponential backoff strategies -- Dead-letter queues (DLQ) for unrecoverable failures -- Conditional retry logic -- Timeout configuration -- Monitoring retry metrics - ---- - -## 1. Understanding Retries - -A **retry** is automatically re-executing a failed task with the same inputs. Retry policies define **when** and **how** to retry. - -### When to Retry - - **Good candidates for retry**: - -- Network timeouts (external API unavailable) -- Database connection errors (transient) -- Rate limit hits (retry-after header) -- Temporary resource exhaustion - - **Do NOT retry**: - -- Validation errors (bad input won't fix itself) -- Programming errors (bug in code) -- Missing data (won't reappear) -- Permanent failures (404 Not Found, 401 Unauthorized) - ---- - -## 2. Retry at Task Level - -Configure retry directly on the task decorator: - -```python -@broker.task( - max_retries=3, # Maximum retry attempts (default: 0 = no retry) - retry_delay=5.0, # Seconds between retries - retry_backoff=2.0, # Multiply delay by this after each attempt - retry_timeout=60 # Overall timeout including retries -) -async def flaky_api_call(): - response = await call_external_api() - return response.json() -``` - -**Retry sequence**: - -| Attempt | Delay | Cumulative | -|---------|-------|------------| -| 1 (initial) | 0s | 0s | -| 2 (retry 1) | 5s | 5s | -| 3 (retry 2) | 10s (5 × 2) | 15s | -| 4 (retry 3) | 20s (10 × 2) | 35s | -| Final failure | — | 35s | - ---- - -## 3. Retry at Pipeline Level - -Apply consistent retry policy to all tasks in a pipeline: - -```python -pipeline = Pipeline(broker) -pipeline.with_retry( - max_attempts=3, - delay=2.0, # Initial delay - backoff=1.5, # Backoff multiplier - on_retry=None # Optional callback -) -``` - -All tasks in this pipeline inherit this policy unless they have their own. - -**Inheritance precedence**: Task-level overrides pipeline-level. - ---- - -## 4. Custom Retry Policies - -For fine control, implement `RetryPolicy`: - -```python -from taskiq_flow import RetryPolicy - -class MyRetryPolicy(RetryPolicy): - def should_retry(self, attempt: int, exception: Exception) -> bool: - # Retry only on network errors, max 5 attempts - if attempt >= 5: - return False - return isinstance(exception, NetworkError) - - def get_delay(self, attempt: int) -> float: - # Custom backoff: 2^attempt + random jitter - import random - base = 2 ** attempt - jitter = random.uniform(-0.1, 0.1) * base - return max(0.5, base + jitter) - -pipeline.with_retry(policy=MyRetryPolicy()) -``` - -### 4.1. Conditional Retry (Only on Specific Exceptions) - -```python -@broker.task -async def task_with_selective_retry(): - try: - result = await call_api() - return result - except NetworkTimeout: - # This exception type should be retried - raise RetryException("Timeout, will retry") - except InvalidResponse: - # This error is permanent; don't retry - raise # Will fail immediately -``` - -**Built-in exception-based retry**: - -```python -from taskiq.exceptions import RetryException - -@broker.task(retry_on=[NetworkError, TimeoutError]) -async def task(): - # Automatically retries on these exception types - pass -``` - ---- - -## 5. Exponential Backoff with Jitter - -Avoid thundering herd problem (all retries happen at same time): - -```python -import random - -def exponential_backoff_with_jitter( - attempt: int, - base_delay: float = 1.0, - max_delay: float = 60.0, - backoff_factor: float = 2.0, - jitter: bool = True -) -> float: - """Calculate retry delay.""" - delay = min(max_delay, base_delay * (backoff_factor ** attempt)) - if jitter: - # Add ±10% random jitter - delay *= random.uniform(0.9, 1.1) - return delay - -# Usage in policy -class JitteredRetryPolicy(RetryPolicy): - def get_delay(self, attempt: int) -> float: - return exponential_backoff_with_jitter(attempt, base_delay=2.0) -``` - -**Why jitter?** Prevents synchronized retry storms that overwhelm services. - ---- - -## 6. Dead Letter Queues (DLQ) - -When all retries are exhausted, failed tasks need somewhere to go. - -### 6.1. Configuring DLQ - -```python -from taskiq_flow.middlewares.retry import RetryMiddleware - -broker.add_middlewares( - RetryMiddleware( - max_retries=3, - dlq_queue="failed_tasks" # Tasks go here after exhausting retries - ) -) -``` - -**Behavior**: - -1. Task fails → retry 1 (after delay) -2. Fails again → retry 2 (after longer delay) -3. Fails again → retry 3 -4. Fails all retries → move to `failed_tasks` queue - -### 6.2. DLQ Inspection & Reprocessing - -```python -from taskiq_flow.middlewares.retry import DLQManager - -dlq = DLQManager(broker) - -# List failed tasks -failed_tasks = await dlq.list_failed() -for task_info in failed_tasks: - print(f"Task {task_info.task_id} failed: {task_info.error}") - -# Replay a failed task (re-queue for execution) -await dlq.retry_task(task_id) - -# Discard a failed task permanently -await dlq.delete_task(task_id) - -# Bulk delete older than N days -await dlq.cleanup_older_than(days=7) -``` - -### 6.3. DLQ Alerting - -Set up alerts when tasks land in DLQ: - -```python -class DLQAlertListener: - async def on_task_to_dlq(self, task_id: str, error: str): - send_slack_alert(f"Task {task_id} failed after retries: {error}") - create_incident_ticket(task_id, error) - -dlq_manager = DLQManager(broker).with_listener(DLQAlertListener()) -``` - ---- - -## 7. Timeouts - -Prevent tasks from running indefinitely. - -### 7.1. Task-Level Timeout - -```python -@broker.task(timeout=30) # seconds -async def potentially_slow_task(): - await long_running_operation() -``` - -If the task exceeds 30 seconds, `asyncio.TimeoutError` is raised and retry policy applies. - -### 7.2. Pipeline-Level Timeout - -```python -pipeline = Pipeline(broker) -pipeline.with_timeout(seconds=300) # 5 minutes for entire pipeline -``` - -Cancels all running steps when timeout expires. - -### 7.3. Step-Level Timeout (Advanced) - -```python -from taskiq_flow.steps import TimeoutStep - -pipeline = Pipeline(broker) -pipeline.call_next(TimeoutStep(my_task, timeout=10.0)) -``` - ---- - -## 8. Error Propagation - -### 8.1. Fail Fast (Default) - -Pipeline stops at first failure: - -```python -pipeline = Pipeline(broker) -# By default: on_error="stop" - -pipeline.call_next(task1) # Fails → pipeline stops, task2 never runs -pipeline.call_next(task2) -``` - -### 8.2. Continue on Error - -Continue executing remaining steps despite failures: - -```python -pipeline = Pipeline(broker) -pipeline.on_error("continue") - -pipeline.call_next(task1) # Fails, but task2 still runs -pipeline.call_next(task2) -``` - -**Result**: Task2 receives `None` or partial result; check `result.is_failed`. - -### 8.3. Compensation (Saga Pattern) - -Execute a cleanup task if a step fails: - -```python -pipeline = Pipeline(broker) - -pipeline.call_next(allocate_resource) - .on_failure(compensate_allocation) # Run compensation if previous step failed -pipeline.call_next(process) -``` - ---- - -## 9. Monitoring Retries - -Track retry metrics: - -```python -from taskiq_flow import PipelineTrackingManager - -tracking = PipelineTrackingManager().with_auto_storage(broker) - -# Retry metrics exposed in PipelineStatus: -status = await tracking.get_status(pipeline_id) -print(f"Steps: {len(status.steps)}") -for step in status.steps: - if step.retry_count > 0: - print(f" {step.name}: retried {step.retry_count} times") - print(f" Errors: {step.errors}") -``` - -**Metrics to monitor**: - -- **Retry rate** (%) of tasks needing retry -- **Average retry count** per task -- **Top failing tasks** (most retries) -- **DLQ size** (tasks giving up) -- **Time spent in retries** vs actual work - -### Integration with Prometheus - -```python -from prometheus_client import Counter, Summary - -RETRY_COUNT = Counter('task_retries_total', 'Total retry attempts', ['task_name']) -TASK_FAILURES = Counter('task_failures_total', 'Tasks that failed after retries', ['task_name']) -TASK_DURATION = Summary('task_duration_seconds', 'Task execution time', ['task_name']) - -class MetricsMiddleware(PipelineMiddleware): - async def on_step_complete(self, ctx, result): - step_name = ctx.task_name - RETRY_COUNT.labels(step_name).inc(ctx.retry_count) - TASK_DURATION.labels(step_name).observe(ctx.duration_ms / 1000) -``` - ---- - -## 10. Best Practices - -### 10.1. Set Reasonable Retry Limits - -```python -# Don't retry indefinitely -@broker.task(max_retries=3) # Good: bounded -@broker.task(max_retries=None) # Bad: infinite retries -``` - -### 10.2. Use Exponential Backoff - -Implemented via `retry_backoff`: - -```python -@broker.task(max_retries=5, retry_delay=2.0, retry_backoff=2.0) -# Delays: 2s, 4s, 8s, 16s, 32s -``` - -### 10.3. Add Jitter - -Randomize delays to avoid thundering herd: - -```python -retry_backoff=2.0, retry_jitter=True # Add ±10% jitter -``` - -### 10.4. Set Deadlines - -```python -# Overall timeout including retries -@broker.task(retry_timeout=300) # Give up after 5 minutes total -``` - -### 10.5. Log Every Retry - -```python -import logging -logger = logging.getLogger(__name__) - -@broker.task( - max_retries=3, - on_retry=lambda attempt, exc: logger.warning(f"Retry {attempt} for task: {exc}") -) -``` - -### 10.6. Separate Transient vs Permanent Errors - -```python -@broker.task -async def smart_task(): - try: - return await call_api() - except (Timeout, ConnectionError) as e: - raise RetryException("Transient error") from e # Will retry - except NotFoundError: - raise # No retry, fail permanently -``` - -### 10.7. DLQ for Investigation - -Never discard failed tasks without review: - -```python -dlq = DLQManager(broker) -# Periodically review DLQ -failed = await dlq.list_failed(limit=100) -for task in failed: - logger.error(f"DLQ task {task.task_id}: {task.error}") - # Consider manual replay or data correction -``` - ---- - -## 11. Common Pitfalls - -| Pitfall | Consequence | Solution | -|---------|-------------|----------| -| Infinite retries (`max_retries=None`) | System stuck in retry loop | Set explicit max | -| No backoff (delay=0) | Service overwhelmed | Use exponential backoff | -| Retrying validation errors | Wasted resources | Distinguish error types | -| No DLQ | Lost failed tasks | Configure DLQ | -| Timeout shorter than retry delay | Premature timeout | Ensure timeout > sum of retry delays | -| Multiple retries on non-idempotent tasks | Duplicate side-effects | Make tasks idempotent or limit retries | - ---- - -## 12. Summary - -| Feature | Task-level | Pipeline-level | -|---------|-----------|----------------| -| **Retry limit** | `@broker.task(max_retries=N)` | `pipeline.with_retry(max_attempts=N)` | -| **Delay** | `retry_delay` | `delay` | -| **Backoff** | `retry_backoff` | `backoff` | -| **Timeout** | `timeout` per task | `with_timeout(seconds)` overall | -| **DLQ** | Via `RetryMiddleware` | Inherited from tasks | - -**Complete resilient pipeline**: - -```python -tracking = PipelineTrackingManager().with_auto_storage(broker) - -pipeline = Pipeline(broker).with_tracking(tracking) -pipeline.with_retry(max_attempts=3, delay=2.0, backoff=2.0) -pipeline.with_timeout(seconds=300) -pipeline.on_error("continue") # Or use compensation steps - -# Add retry middleware with DLQ -from taskiq_flow.middlewares.retry import RetryMiddleware -broker.add_middlewares(RetryMiddleware(max_retries=3, dlq_queue="failed_tasks")) -``` - ---- - -## Next Steps - -- **[Performance Guide]({{ '/en/guides/performance/' | relative_url }})** — Optimize task execution and resource usage -- **[Scheduling Guide]({{ '/en/guides/scheduling/' | relative_url }})** — Automated pipeline retries at scheduled intervals -- **[Tracking Guide]({{ '/en/guides/tracking/' | relative_url }})** — Monitor retry metrics in production - ---- - -*Failures happen. Retry smart. Track everything.* +--- +title: Retry & Error Handling Guide +nav_order: 26 +--- +# Retry & Error Handling Guide + +**Resilient pipeline execution with retry policies, backoff, and dead-letter queues** + +> **Version**: {VERSION} | **Related**: [Execution Guide]({{ '/en/guides/execution/' | relative_url }}), [Scheduling Guide]({{ '/en/guides/scheduling/' | relative_url }}) + +--- + +## Overview + +Failures are inevitable in distributed systems. Taskiq-Flow provides comprehensive retry and error handling mechanisms to ensure pipeline robustness. + +This guide covers: + +- Retry policies at task and pipeline levels +- Exponential backoff strategies +- Dead-letter queues (DLQ) for unrecoverable failures +- Conditional retry logic +- Timeout configuration +- Monitoring retry metrics + +--- + +## 1. Understanding Retries + +A **retry** is automatically re-executing a failed task with the same inputs. Retry policies define **when** and **how** to retry. + +### When to Retry + + **Good candidates for retry**: + +- Network timeouts (external API unavailable) +- Database connection errors (transient) +- Rate limit hits (retry-after header) +- Temporary resource exhaustion + + **Do NOT retry**: + +- Validation errors (bad input won't fix itself) +- Programming errors (bug in code) +- Missing data (won't reappear) +- Permanent failures (404 Not Found, 401 Unauthorized) + +--- + +## 2. Retry at Task Level + +Configure retry directly on the task decorator: + +```python +@broker.task( + max_retries=3, # Maximum retry attempts (default: 0 = no retry) + retry_delay=5.0, # Seconds between retries + retry_backoff=2.0, # Multiply delay by this after each attempt + retry_timeout=60 # Overall timeout including retries +) +async def flaky_api_call(): + response = await call_external_api() + return response.json() +``` + +**Retry sequence**: + +| Attempt | Delay | Cumulative | +|---------|-------|------------| +| 1 (initial) | 0s | 0s | +| 2 (retry 1) | 5s | 5s | +| 3 (retry 2) | 10s (5 × 2) | 15s | +| 4 (retry 3) | 20s (10 × 2) | 35s | +| Final failure | — | 35s | + +--- + +## 3. Retry at Pipeline Level + +Apply consistent retry policy to all tasks in a pipeline: + +```python +pipeline = Pipeline(broker) +pipeline.with_retry( + max_attempts=3, + delay=2.0, # Initial delay + backoff=1.5, # Backoff multiplier + on_retry=None # Optional callback +) +``` + +All tasks in this pipeline inherit this policy unless they have their own. + +**Inheritance precedence**: Task-level overrides pipeline-level. + +--- + +## 4. Custom Retry Policies + +For fine control, implement `RetryPolicy`: + +```python +from taskiq_flow import RetryPolicy + +class MyRetryPolicy(RetryPolicy): + def should_retry(self, attempt: int, exception: Exception) -> bool: + # Retry only on network errors, max 5 attempts + if attempt >= 5: + return False + return isinstance(exception, NetworkError) + + def get_delay(self, attempt: int) -> float: + # Custom backoff: 2^attempt + random jitter + import random + base = 2 ** attempt + jitter = random.uniform(-0.1, 0.1) * base + return max(0.5, base + jitter) + +pipeline.with_retry(policy=MyRetryPolicy()) +``` + +### 4.1. Conditional Retry (Only on Specific Exceptions) + +```python +@broker.task +async def task_with_selective_retry(): + try: + result = await call_api() + return result + except NetworkTimeout: + # This exception type should be retried + raise RetryException("Timeout, will retry") + except InvalidResponse: + # This error is permanent; don't retry + raise # Will fail immediately +``` + +**Built-in exception-based retry**: + +```python +from taskiq.exceptions import RetryException + +@broker.task(retry_on=[NetworkError, TimeoutError]) +async def task(): + # Automatically retries on these exception types + pass +``` + +--- + +## 5. Exponential Backoff with Jitter + +Avoid thundering herd problem (all retries happen at same time): + +```python +import random + +def exponential_backoff_with_jitter( + attempt: int, + base_delay: float = 1.0, + max_delay: float = 60.0, + backoff_factor: float = 2.0, + jitter: bool = True +) -> float: + """Calculate retry delay.""" + delay = min(max_delay, base_delay * (backoff_factor ** attempt)) + if jitter: + # Add ±10% random jitter + delay *= random.uniform(0.9, 1.1) + return delay + +# Usage in policy +class JitteredRetryPolicy(RetryPolicy): + def get_delay(self, attempt: int) -> float: + return exponential_backoff_with_jitter(attempt, base_delay=2.0) +``` + +**Why jitter?** Prevents synchronized retry storms that overwhelm services. + +--- + +## 6. Dead Letter Queues (DLQ) + +When all retries are exhausted, failed tasks need somewhere to go. + +### 6.1. Configuring DLQ + +```python +from taskiq_flow.middlewares.retry import RetryMiddleware + +broker.add_middlewares( + RetryMiddleware( + max_retries=3, + dlq_queue="failed_tasks" # Tasks go here after exhausting retries + ) +) +``` + +**Behavior**: + +1. Task fails → retry 1 (after delay) +2. Fails again → retry 2 (after longer delay) +3. Fails again → retry 3 +4. Fails all retries → move to `failed_tasks` queue + +### 6.2. DLQ Inspection & Reprocessing + +```python +from taskiq_flow.middlewares.retry import DLQManager + +dlq = DLQManager(broker) + +# List failed tasks +failed_tasks = await dlq.list_failed() +for task_info in failed_tasks: + print(f"Task {task_info.task_id} failed: {task_info.error}") + +# Replay a failed task (re-queue for execution) +await dlq.retry_task(task_id) + +# Discard a failed task permanently +await dlq.delete_task(task_id) + +# Bulk delete older than N days +await dlq.cleanup_older_than(days=7) +``` + +### 6.3. DLQ Alerting + +Set up alerts when tasks land in DLQ: + +```python +class DLQAlertListener: + async def on_task_to_dlq(self, task_id: str, error: str): + send_slack_alert(f"Task {task_id} failed after retries: {error}") + create_incident_ticket(task_id, error) + +dlq_manager = DLQManager(broker).with_listener(DLQAlertListener()) +``` + +--- + +## 7. Timeouts + +Prevent tasks from running indefinitely. + +### 7.1. Task-Level Timeout + +```python +@broker.task(timeout=30) # seconds +async def potentially_slow_task(): + await long_running_operation() +``` + +If the task exceeds 30 seconds, `asyncio.TimeoutError` is raised and retry policy applies. + +### 7.2. Pipeline-Level Timeout + +```python +pipeline = Pipeline(broker) +pipeline.with_timeout(seconds=300) # 5 minutes for entire pipeline +``` + +Cancels all running steps when timeout expires. + +### 7.3. Step-Level Timeout (Advanced) + +```python +from taskiq_flow.steps import TimeoutStep + +pipeline = Pipeline(broker) +pipeline.call_next(TimeoutStep(my_task, timeout=10.0)) +``` + +--- + +## 8. Error Propagation + +### 8.1. Fail Fast (Default) + +Pipeline stops at first failure: + +```python +pipeline = Pipeline(broker) +# By default: on_error="stop" + +pipeline.call_next(task1) # Fails → pipeline stops, task2 never runs +pipeline.call_next(task2) +``` + +### 8.2. Continue on Error + +Continue executing remaining steps despite failures: + +```python +pipeline = Pipeline(broker) +pipeline.on_error("continue") + +pipeline.call_next(task1) # Fails, but task2 still runs +pipeline.call_next(task2) +``` + +**Result**: Task2 receives `None` or partial result; check `result.is_failed`. + +### 8.3. Compensation (Saga Pattern) + +Execute a cleanup task if a step fails: + +```python +pipeline = Pipeline(broker) + +pipeline.call_next(allocate_resource) + .on_failure(compensate_allocation) # Run compensation if previous step failed +pipeline.call_next(process) +``` + +--- + +## 9. Monitoring Retries + +Track retry metrics: + +```python +from taskiq_flow import PipelineTrackingManager + +tracking = PipelineTrackingManager().with_auto_storage(broker) + +# Retry metrics exposed in PipelineStatus: +status = await tracking.get_status(pipeline_id) +print(f"Steps: {len(status.steps)}") +for step in status.steps: + if step.retry_count > 0: + print(f" {step.name}: retried {step.retry_count} times") + print(f" Errors: {step.errors}") +``` + +**Metrics to monitor**: + +- **Retry rate** (%) of tasks needing retry +- **Average retry count** per task +- **Top failing tasks** (most retries) +- **DLQ size** (tasks giving up) +- **Time spent in retries** vs actual work + +### Integration with Prometheus + +```python +from prometheus_client import Counter, Summary + +RETRY_COUNT = Counter('task_retries_total', 'Total retry attempts', ['task_name']) +TASK_FAILURES = Counter('task_failures_total', 'Tasks that failed after retries', ['task_name']) +TASK_DURATION = Summary('task_duration_seconds', 'Task execution time', ['task_name']) + +class MetricsMiddleware(PipelineMiddleware): + async def on_step_complete(self, ctx, result): + step_name = ctx.task_name + RETRY_COUNT.labels(step_name).inc(ctx.retry_count) + TASK_DURATION.labels(step_name).observe(ctx.duration_ms / 1000) +``` + +--- + +## 10. Best Practices + +### 10.1. Set Reasonable Retry Limits + +```python +# Don't retry indefinitely +@broker.task(max_retries=3) # Good: bounded +@broker.task(max_retries=None) # Bad: infinite retries +``` + +### 10.2. Use Exponential Backoff + +Implemented via `retry_backoff`: + +```python +@broker.task(max_retries=5, retry_delay=2.0, retry_backoff=2.0) +# Delays: 2s, 4s, 8s, 16s, 32s +``` + +### 10.3. Add Jitter + +Randomize delays to avoid thundering herd: + +```python +retry_backoff=2.0, retry_jitter=True # Add ±10% jitter +``` + +### 10.4. Set Deadlines + +```python +# Overall timeout including retries +@broker.task(retry_timeout=300) # Give up after 5 minutes total +``` + +### 10.5. Log Every Retry + +```python +import logging +logger = logging.getLogger(__name__) + +@broker.task( + max_retries=3, + on_retry=lambda attempt, exc: logger.warning(f"Retry {attempt} for task: {exc}") +) +``` + +### 10.6. Separate Transient vs Permanent Errors + +```python +@broker.task +async def smart_task(): + try: + return await call_api() + except (Timeout, ConnectionError) as e: + raise RetryException("Transient error") from e # Will retry + except NotFoundError: + raise # No retry, fail permanently +``` + +### 10.7. DLQ for Investigation + +Never discard failed tasks without review: + +```python +dlq = DLQManager(broker) +# Periodically review DLQ +failed = await dlq.list_failed(limit=100) +for task in failed: + logger.error(f"DLQ task {task.task_id}: {task.error}") + # Consider manual replay or data correction +``` + +--- + +## 11. Common Pitfalls + +| Pitfall | Consequence | Solution | +|---------|-------------|----------| +| Infinite retries (`max_retries=None`) | System stuck in retry loop | Set explicit max | +| No backoff (delay=0) | Service overwhelmed | Use exponential backoff | +| Retrying validation errors | Wasted resources | Distinguish error types | +| No DLQ | Lost failed tasks | Configure DLQ | +| Timeout shorter than retry delay | Premature timeout | Ensure timeout > sum of retry delays | +| Multiple retries on non-idempotent tasks | Duplicate side-effects | Make tasks idempotent or limit retries | + +--- + +## 12. Summary + +| Feature | Task-level | Pipeline-level | +|---------|-----------|----------------| +| **Retry limit** | `@broker.task(max_retries=N)` | `pipeline.with_retry(max_attempts=N)` | +| **Delay** | `retry_delay` | `delay` | +| **Backoff** | `retry_backoff` | `backoff` | +| **Timeout** | `timeout` per task | `with_timeout(seconds)` overall | +| **DLQ** | Via `RetryMiddleware` | Inherited from tasks | + +**Complete resilient pipeline**: + +```python +tracking = PipelineTrackingManager().with_auto_storage(broker) + +pipeline = Pipeline(broker).with_tracking(tracking) +pipeline.with_retry(max_attempts=3, delay=2.0, backoff=2.0) +pipeline.with_timeout(seconds=300) +pipeline.on_error("continue") # Or use compensation steps + +# Add retry middleware with DLQ +from taskiq_flow.middlewares.retry import RetryMiddleware +broker.add_middlewares(RetryMiddleware(max_retries=3, dlq_queue="failed_tasks")) +``` + +--- + +## Next Steps + +- **[Performance Guide]({{ '/en/guides/performance/' | relative_url }})** — Optimize task execution and resource usage +- **[Scheduling Guide]({{ '/en/guides/scheduling/' | relative_url }})** — Automated pipeline retries at scheduled intervals +- **[Tracking Guide]({{ '/en/guides/tracking/' | relative_url }})** — Monitor retry metrics in production + +--- + +*Failures happen. Retry smart. Track everything.* diff --git a/docs/_en/guides/scheduling.md b/docs/_en/guides/scheduling.md index a3cf3c6..c8d9a92 100644 --- a/docs/_en/guides/scheduling.md +++ b/docs/_en/guides/scheduling.md @@ -1,875 +1,875 @@ ---- -title: Pipeline Scheduling Guide -nav_order: 25 ---- -# Pipeline Scheduling Guide - -**Cron-based, interval, and one-off pipeline scheduling with PipelineScheduler** - -> **Version**: {VERSION} | **Related**: [Execution Guide]({{ '/en/guides/execution/' | relative_url }}), [Tracking Guide]({{ '/en/guides/tracking/' | relative_url }}) - ---- - -## Overview - -Taskiq-Flow includes a powerful scheduling system for running pipelines at specific times or intervals, built on top of APScheduler. - -This guide covers: - -- `PipelineScheduler` — Main scheduling interface -- Cron expressions and patterns -- Interval-based scheduling -- One-off executions -- Timezone handling -- Job persistence and management -- Missed execution handling - ---- - -## 1. Quick Start - -> **Prerequisite**: Install the `scheduler` extra: -> `pip install "taskiq-flow[scheduler]"` -> Without it, `PipelineScheduler` raises `ImportError` at construction time. - -```python -from taskiq_flow import Pipeline, PipelineScheduler - -# Create your pipeline -pipeline = Pipeline(broker).call_next(my_task).call_next(another_task) - -# Create scheduler # requires taskiq-flow[scheduler] -scheduler = PipelineScheduler(broker) - -# Schedule to run every minute -job_id = await scheduler.schedule( - pipeline, - cron="* * * * *", # Every minute - args=("some", "data") # Arguments passed to pipeline.kiq() -) - -# Start the scheduler (runs in background) -await scheduler.start() - -# ... keep your application running ... -# scheduler runs in background tasks - -# Shutdown gracefully -await scheduler.shutdown() -``` - -That's the basics. Let's explore the features in detail. - ---- - -## 2. PipelineScheduler - -The main class for scheduling pipeline executions. - -### 2.1. Initialization - -```python -from taskiq_flow import PipelineScheduler - -scheduler = PipelineScheduler( - broker, - store="memory", # "memory" or "sqlite" - store_path="./scheduler_jobs.db" # for sqlite store -) -``` - -**Storage options**: - -| Store | Persistence | Multi-worker | Use case | -|-------|-------------|--------------|----------| -| `"memory"` | No | No | Development, single-process | -| `"sqlite"` | Yes | Limited* | Single-worker production, simple persistence | -| `"postgresql"` (via URL) | Yes | Yes | Production multi-worker, HA | -| `"mysql"` (via URL) | Yes | Yes | Production multi-worker, alternative | -| `"redis"` | | | **Not implemented** (raises `NotImplementedError`) | - -*SQLite store works with single scheduler instance; multiple workers need PostgreSQL/MySQL. - -**Recommendation**: -- Development/mocks → `store="memory"` -- Single-worker production → `store="sqlite"` with persistent path -- Production multi-worker → `store="postgresql://user:pass@host/dbname"` (recommended) # pragma: allowlist secret - -> **Note**: PostgreSQL and MySQL support is **already implemented** in `JobPersistenceManager` and works via SQLAlchemy async engine. See [Advanced Storage (PostgreSQL/MySQL)](#advanced-storage-postgresqlmysql) below. - -### 2.2. Starting & Stopping - -```python -# Start scheduler (begins monitoring schedules) -await scheduler.start() - -# Run in background while app is alive -# Typically integrated into FastAPI/Quart lifespan events - -# Graceful shutdown -await scheduler.shutdown() -# Waits for running jobs to finish, cancels pending -``` - -**Automatic start with context manager**: - -```python -async with PipelineScheduler(broker) as scheduler: - await scheduler.schedule(pipeline, cron="*/5 * * * *") - # Scheduler automatically starts on __aenter__ - # ... run your app ... -# Automatically shuts down on __aexit__ -``` - ---- - -## 3. Scheduling Methods - -### 3.1. Cron Scheduling - -```python -job_id = await scheduler.schedule( - pipeline, - cron="0 * * * *", # Every hour at minute 0 - args=("input_data",), - kwargs={"key": "value"}, - pipeline_id="hourly_job_001" -) -``` - -**Cron expression format**: `minute hour day month day-of-week` - -| Field | Allowed values | Special characters | -|-------|----------------|-------------------| -| Minute | 0-59 | `* , - /` | -| Hour | 0-23 | `* , - /` | -| Day | 1-31 | `* , - / ?` | -| Month | 1-12 | `* , - /` | -| Day of week | 0-6 (Sun-Sat) | `* , - / ?` | - -**Examples**: - -```python -"*/5 * * * *" # Every 5 minutes -"0 9 * * *" # Daily at 9:00 AM -"0 0 * * 0" # Weekly on Sunday at midnight -"0 0 1 * *" # Monthly on the 1st at midnight -"0 0 1 1 *" # Yearly on January 1st at midnight -``` - -### 3.2. Interval Scheduling - -```python -# Run every N seconds/minutes/hours/days/weeks -job_id = await scheduler.schedule_interval( - pipeline, - seconds=30, # Every 30 seconds - # minutes=5, # Every 5 minutes - # hours=1, # Every hour - args=(data,) -) -``` - -**Note**: Interval scheduling uses APScheduler's `IntervalTrigger`. Cron is generally preferred for production (more flexible, handles DST). - -### 3.3. One-Off Execution (Run At) - -Schedule a single future execution: - -```python -from datetime import datetime, timedelta - -job_id = await scheduler.schedule_at( - pipeline, - run_at=datetime.now() + timedelta(hours=2), # In 2 hours - args=(payload,) -) -``` - -Or schedule for a specific calendar time: - -```python -run_time = datetime(2026, 12, 31, 23, 59, 59) -await scheduler.schedule_at(pipeline, run_at=run_time) -``` - ---- - -## 4. Job Configuration - -### 4.1. Job ID - -Each scheduled job gets a unique identifier: - -```python -job_id = await scheduler.schedule(pipeline, cron="* * * * *") -print(job_id) # e.g., "job_20260505_abcdef123456" -``` - -Customize the ID: - -```python -job_id = await scheduler.schedule( - pipeline, - cron="0 9 * * *", - job_id="daily_etl_9am" # human-readable ID -) -``` - -Useful for later management (update, cancel, list). - -### 4.2. Arguments & Keyword Arguments - -Pass arguments to the pipeline's `kiq()` method: - -```python -await scheduler.schedule( - pipeline, - cron="* * * * *", - args=("positional_arg",), # tuple - kwargs={"option": True}, # dict - pipeline_id="my_pipeline" # explicit pipeline ID -) -``` - -The scheduler calls: `await pipeline.kiq(*args, **kwargs)` on each trigger. - -### 4.3. Pipeline ID - -Each scheduled execution can override the pipeline's default ID: - -```python -pipeline = Pipeline(broker) # generates random ID by default - -# Schedule with explicit ID (ensures uniqueness for tracking) -await scheduler.schedule( - pipeline, - cron="*/5 * * * *", - pipeline_id="my_pipeline_v1" -) -``` - -**Best practice**: Include timestamp or version in ID for tracking: - -```python -job_id = f"batch_process_v2_{int(time.time())}" -``` - ---- - -## 5. Job Management - -### 5.1. List Scheduled Jobs - -```python -jobs = await scheduler.list_jobs() -for job in jobs: - print(f"ID: {job.id}") - print(f" Trigger: {job.trigger}") - print(f" Next run: {job.next_run_time}") - print(f" Pipeline: {job.pipeline_id}") -``` - -### 5.2. Get Job Details - -```python -job = await scheduler.get_job(job_id) -if job: - print(f"Job {job.id} is scheduled for {job.next_run_time}") -``` - -### 5.3. Modify a Job - -```python -# Reschedule an existing job -await scheduler.reschedule_job( - job_id, - cron="0 */2 * * *" # Change to every 2 hours -) - -# Update job arguments -await scheduler.modify_job( - job_id, - args=("new_arg",), - kwargs={"updated": True} -) -``` - -### 5.4. Remove (Cancel) a Job - -```python -await scheduler.remove_job(job_id) -# Future executions are cancelled; running job continues -``` - -### 5.5. Pause & Resume - -```python -# Temporarily pause a job -await scheduler.pause_job(job_id) - -# Resume later -await scheduler.resume_job(job_id) -``` - ---- - -## 6. Tracking Scheduled Executions - -Each scheduled pipeline execution is automatically tracked if the pipeline has tracking enabled: - -```python -tracking = PipelineTrackingManager().with_auto_storage(broker) -pipeline = Pipeline(broker).with_tracking(tracking) - -scheduler = PipelineScheduler(broker) -await scheduler.schedule(pipeline, cron="*/5 * * * *") - -# Later, query execution history -history = await tracking.get_history() -for run in history: - print(f"Run {run.pipeline_id}: {run.status} at {run.started_at}") -``` - -**Distinguishing scheduled runs**: Use descriptive `pipeline_id` patterns: - -```python -await scheduler.schedule( - pipeline, - cron="0 2 * * *", # Daily 2AM - pipeline_id=f"daily_etl_{datetime.now().strftime('%Y%m%d')}" -) -# Each day gets a unique pipeline ID for tracking -``` - ---- - -## 7. Missed Execution Handling - -When a scheduled job's trigger time is missed (e.g., scheduler downtime, long-running job), APScheduler provides controls: - -### 7.1. Coalesce - -Combine multiple missed runs into a single execution: - -```python -from apscheduler.triggers.cron import CronTrigger - -trigger = CronTrigger( - hour=9, - minute=0, - coalesce=True # If scheduler was down at 9:00, run once at 9:05 instead of 5 times -) - -job = await scheduler.schedule(pipeline, trigger=trigger) -``` - -### 7.2. Max Instances - -Prevent overlapping runs of the same job: - -```python -# Job won't start a new execution if previous instance is still running -trigger = CronTrigger(minute="*/5", max_instances=1) - -job = await scheduler.schedule(pipeline, trigger=trigger) -# If a 9:00 run is still executing at 9:05, the 9:05 run is skipped -``` - -### 7.3. Misfire Grace Time - -Allow a window after scheduled time during which execution is still valid: - -```python -from apscheduler.triggers.cron import CronTrigger - -# If scheduler restarts within 10 minutes of scheduled time, still run -trigger = CronTrigger( - minute="*/5", - misfire_grace_time=600 # 10 minutes in seconds -) - -job = await scheduler.schedule(pipeline, trigger=trigger) -``` - ---- - -## 8. Timezone Handling - -By default, APScheduler uses the local system timezone. For production, set explicit timezone: - -```python -from apscheduler.triggers.cron import CronTrigger -import pytz - -# Schedule for 9:00 AM in New York timezone -trigger = CronTrigger( - hour=9, - minute=0, - timezone=pytz.timezone("America/New_York") -) - -job = await scheduler.schedule(pipeline, trigger=trigger) -``` - -Or set globally on scheduler: - -```python -scheduler = PipelineScheduler( - broker, - timezone="UTC" # or "America/Los_Angeles", "Europe/Paris", ... -) -``` - -**Daylight Saving Time (DST)**: Cron triggers with explicit timezone handle DST transitions automatically. Jobs scheduled at "9:00" will still run at 9:00 local time when clocks shift. - ---- - -## 9. Custom Triggers - -Beyond cron and intervals, use any APScheduler trigger: - -```python -from apscheduler.triggers.date import DateTrigger -from datetime import datetime, timedelta - -# Run once at specific datetime -trigger = DateTrigger(run_date=datetime(2026, 12, 31, 23, 59, 59)) -job = await scheduler.schedule(pipeline, trigger=trigger) - -# Run after a delay (from now) -trigger = DateTrigger(run_date=datetime.now() + timedelta(minutes=10)) -job = await scheduler.schedule(pipeline, trigger=trigger) -``` - -See APScheduler documentation for advanced triggers (calendar-based, etc.). - ---- - -## 10. Error Handling - -### 10.1. Catch Job Execution Errors - -Wrap pipeline execution with error handling: - -```python -@broker.task -async def my_pipeline_task(data): - try: - result = await process(data) - return result - except Exception as exc: - # Log error, but let scheduler continue - logger.error(f"Pipeline failed: {exc}") - raise # Scheduler records failure, continues with next schedule -``` - -### 10.2. Scheduler-Level Error Callbacks - -```python -scheduler = PipelineScheduler(broker) - -@scheduler.on_error -async def handle_scheduler_error(job_id, exception): - logger.error(f"Job {job_id} failed with: {exception}") - send_alert_email(job_id, exception) - -await scheduler.start() -``` - -### 10.3. Dead Letter Queue (DLQ) - -For jobs that repeatedly fail, route to DLQ: - -```python -from taskiq_flow.middlewares.retry import RetryMiddleware - -# Configure retry with backoff -broker.add_middlewares( - RetryMiddleware( - max_retries=3, - delay=10, - backoff=2 - ) -) - -# After max retries, task goes to DLQ (if broker supports it) -# RedisStreamBroker: dead_letter_stream # requires taskiq-flow[brokers] -# KafkaBroker: dead_letter_topic -``` - ---- - -## 11. Monitoring Scheduled Jobs - -### 11.1. Health Check - -```python -async def scheduler_health(): - stats = scheduler.get_stats() - return { - "scheduled_jobs": len(scheduler.get_jobs()), - "running_jobs": stats.active_jobs, - "next_run": min(job.next_run_time for job in scheduler.get_jobs()) - } -``` - -### 11.2. Logging - -Configure structured logging: - -```python -import logging -logger = logging.getLogger("taskiq_flow.scheduler") - -logging.basicConfig( - level=logging.INFO, - format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' -) - -# Scheduler logs: -# 2026-05-05 10:00:00 - taskiq_flow.scheduler - INFO - Running job daily_etl_9am -# 2026-05-05 10:00:05 - taskiq_flow.scheduler - INFO - Job daily_etl_9am completed successfully -``` - -### 11.3. Metrics - -Integrate with Prometheus: - -```python -from prometheus_client import Counter, Gauge - -SCHEDULED_JOBS = Gauge('scheduled_jobs_total', 'Total scheduled jobs') -JOB_RUNS = Counter('scheduler_job_runs_total', 'Job executions', ['job_id']) -JOB_FAILURES = Counter('scheduler_job_failures_total', 'Job failures', ['job_id']) - -class MetricsScheduler(PipelineScheduler): - async def _run_job(self, job_id, pipeline): - JOB_RUNS.labels(job_id=job_id).inc() - try: - await super()._run_job(job_id, pipeline) - except Exception: - JOB_FAILURES.labels(job_id=job_id).inc() - raise -``` - ---- - -## 12. Production Considerations - -### 12.1. High Availability - -For production HA deployments, run multiple scheduler instances with a shared job store: - -```python -# Scheduler 1 -scheduler1 = PipelineScheduler( - broker, - store="postgresql", - db_url="postgresql+asyncpg://user:pass@host/db" # pragma: allowlist secret -) - -# Scheduler 2 (identical config) — only one will acquire jobs -scheduler2 = PipelineScheduler( - broker, - store="postgresql", - # pragma: allownextline secret - db_url="postgresql+asyncpg://user:pass@host/db" # pragma: allowlist secret -) -# APScheduler's job stores use row-level locking; one scheduler per job -``` - -See [Advanced Storage (PostgreSQL/MySQL)](#advanced-storage-postgresqlmysql) for detailed configuration. - -### 12.2. Long-Running Jobs - -If a pipeline execution might exceed its schedule interval: - -```python -# Ensure no overlap -trigger = CronTrigger(minute="*/5", max_instances=1, coalesce=True) -job = await scheduler.schedule(pipeline, trigger=trigger) - -# Pipeline itself has timeout -pipeline.with_timeout(seconds=300) # 5 minutes max -``` - -### 12.3. Start-up Behavior - -On scheduler restart, missed jobs are handled according to `misfire_grace_time`: - -```python -# Scheduler restarts at 9:05 AM, job was scheduled for 9:00 -# With misfire_grace_time=600 (10 min): job runs at 9:05 -# With misfire_grace_time=0: job is skipped -trigger = CronTrigger(hour=9, misfire_grace_time=600) -``` - -### 12.4. Job Store Size - -The job store accumulates job history. Periodically clean up: - -```python -# Remove jobs older than 30 days -old_jobs = await scheduler.list_jobs() -for job in old_jobs: - if job.next_run_time < datetime.now() - timedelta(days=30): - await scheduler.remove_job(job.id) -``` - -### 12.5. Advanced Storage (PostgreSQL/MySQL) - -`JobPersistenceManager` natively supports PostgreSQL and MySQL via SQLAlchemy AsyncEngine. - -#### PostgreSQL Configuration (Recommended for Production) - -```python -from taskiq_flow.scheduling.storage import JobPersistenceManager - -# PostgreSQL with asyncpg -storage = JobPersistenceManager( - # pragma: allownextline secret - db_url="postgresql+asyncpg://user:pass@localhost:5432/taskiq_flow", # pragma: allowlist secret - async_mode=True, -) - -# Using the URL helper -storage = JobPersistenceManager( - db_url=JobPersistenceManager.get_connection_url( - "postgresql", - host="localhost", - port=5432, - user="taskiq", - # pragma: allownextline secret - password="secret", # pragma: allowlist secret - database="taskiq_flow", - ), - async_mode=True, -) -``` - -#### MySQL Configuration - -```python -storage = JobPersistenceManager( - # pragma: allownextline secret - db_url="mysql+aiomysql://user:pass@localhost:3306/taskiq_flow", # pragma: allowlist secret - async_mode=True, -) -``` - -#### SQLite Configuration (Development) - -```python -# Sync (development only) -storage = JobPersistenceManager( - db_url="sqlite:///jobs.db", - async_mode=False, -) - -# Async (recommended even for SQLite) -storage = JobPersistenceManager( - db_url="sqlite+aiosqlite:///jobs.db", - async_mode=True, -) -``` - -#### Integration with PipelineScheduler - -```python -from taskiq_flow.scheduling.scheduler import PipelineScheduler - -scheduler = PipelineScheduler( - broker, - job_store_url="postgresql+asyncpg://user:pass@localhost:5432/taskiq_flow", # pragma: allowlist secret -) -``` - -#### CRUD Operations with JobPersistenceManager - -```python -from datetime import datetime, timezone -from taskiq_flow.scheduling.storage import JobPersistenceManager, SchedulerJob, PipelineExecution - -storage = JobPersistenceManager(db_url="sqlite:///test.db") - -# Save a job -job = SchedulerJob( - id="job_001", - pipeline_id="etl_daily", - label="Daily ETL", - cron="0 2 * * *", - timezone="UTC", -) -await storage.save_job(job) - -# Load all jobs -jobs = await storage.load_jobs() -for j in jobs: - print(f"{j.id}: {j.cron} - {j.pipeline_id}") - -# Save execution history -execution = PipelineExecution( - job_id="job_001", - pipeline_id="etl_daily", - status="success", - started_at=datetime.now(timezone.utc), - completed_at=datetime.now(timezone.utc), - duration_seconds=45.2, -) -await storage.save_execution_history("job_001", execution) - -# Retrieve history -history = await storage.get_execution_history("job_001", limit=10) -for run in history: - print(f" {run.status} - {run.duration_seconds}s at {run.started_at}") -``` - -| Backend | Async | Multi-worker | Production | -|---------|-------|--------------|------------| -| SQLite | `sqlite+aiosqlite` | Single-writer | Dev / small projects | -| PostgreSQL | `postgresql+asyncpg` | Full | Recommended | -| MySQL | `mysql+aiomysql` | Full | Supported | - ---- - -## 13. Common Patterns - -### 13.1. Daily ETL Pipeline - -```python -@scheduler.schedule( - pipeline=etl_pipeline, - cron="0 2 * * *", # 2:00 AM daily - pipeline_id="daily_etl" -) -async def run_daily_etl(): - pass -``` - -### 13.2. Periodic Health Check - -```python -health_pipeline = Pipeline(broker).call_next(health_check_task) - -await scheduler.schedule_interval( - health_pipeline, - minutes=5, - pipeline_id="health_check_5m" -) -``` - -### 13.3. Dynamic Scheduling - -Create and cancel jobs at runtime: - -```python -# Schedule on-demand -job_id = await scheduler.schedule( - pipeline, - run_at=datetime.now() + timedelta(minutes=10) -) - -# Cancel if no longer needed -await scheduler.remove_job(job_id) -``` - -### 13.4. Chained Pipelines - -Pipeline A triggers Pipeline B via scheduling: - -```python -@broker.task -async def pipeline_a_finished(result): - # Schedule pipeline B to run after A completes - job_id = await scheduler.schedule_at( - pipeline_b, - run_at=datetime.now() + timedelta(minutes=5) - ) - return job_id -``` - ---- - -## 14. Troubleshooting - -### Jobs Not Running - -**Symptom**: Scheduled jobs never execute. - -**Fixes**: -- Ensure `await scheduler.start()` is called -- Check cron expression validity: `CronTrigger.from_crontab("* * * * *")` -- Verify timezone matches expected time (check server TZ) -- Confirm job was successfully scheduled (non-None job_id) -- Check scheduler logs for errors - -### Duplicate Job Execution - -**Symptom**: Same job runs multiple times concurrently. - -**Fixes**: -- Set `max_instances=1` in trigger -- Use `coalesce=True` to combine missed runs -- Ensure only one scheduler instance is running (HA needs shared store) - -### Job Store Persistence Not Working - -**Symptom**: Jobs disappear after restart despite sqlite store. - -**Fixes**: -- Use `store="sqlite"` and specify `store_path` -- Ensure file path is writable and persistent between restarts -- Don't mix memory and sqlite stores in same app - -### Timezone Issues - -**Symptom**: Job runs at wrong time (off by hours). - -**Fixes**: -- Set explicit timezone on scheduler: `PipelineScheduler(broker, timezone="UTC")` -- Or on trigger: `CronTrigger(hour=9, timezone=pytz.timezone("America/New_York"))` -- Verify server's system timezone matches expectations - ---- - -## 15. Summary - -PipelineScheduler provides robust, production-ready scheduling: - -| Feature | API | -|---------|-----| -| **Cron** | `scheduler.schedule(pipeline, cron="* * * * *")` | -| **Interval** | `scheduler.schedule_interval(pipeline, minutes=5)` | -| **One-off** | `scheduler.schedule_at(pipeline, run_at=datetime)` | -| **Management** | `list_jobs()`, `remove_job()`, `pause_job()` | -| **Persistence** | SQLite (single-worker), PostgreSQL/MySQL (multi-worker) | -| **Tracking** | Automatic with `PipelineTrackingManager` | -| **Concurrency** | `max_instances`, `coalesce` controls | - -**Typical production setup**: - -```python -tracking = PipelineTrackingManager().with_storage(RedisPipelineStorage(redis)) -pipeline = Pipeline(broker).with_tracking(tracking) - -scheduler = PipelineScheduler( - broker, - job_store_url="postgresql+asyncpg://user:pass@host/taskiq_flow", # pragma: allowlist secret -) -await scheduler.start() - -# Schedule your jobs... -``` - ---- - -## Next Steps - -- **[Retry Guide]({{ '/en/guides/retry/' | relative_url }})** — Error recovery and retry policies -- **[Performance Guide]({{ '/en/guides/performance/' | relative_url }})** — Optimize scheduled pipeline performance -- **[Tracking Guide]({{ '/en/guides/tracking/' | relative_url }})** — Monitor scheduled job history - ---- - -*Schedule pipelines like cron jobs. Track them like never before.* +--- +title: Pipeline Scheduling Guide +nav_order: 25 +--- +# Pipeline Scheduling Guide + +**Cron-based, interval, and one-off pipeline scheduling with PipelineScheduler** + +> **Version**: {VERSION} | **Related**: [Execution Guide]({{ '/en/guides/execution/' | relative_url }}), [Tracking Guide]({{ '/en/guides/tracking/' | relative_url }}) + +--- + +## Overview + +Taskiq-Flow includes a powerful scheduling system for running pipelines at specific times or intervals, built on top of APScheduler. + +This guide covers: + +- `PipelineScheduler` — Main scheduling interface +- Cron expressions and patterns +- Interval-based scheduling +- One-off executions +- Timezone handling +- Job persistence and management +- Missed execution handling + +--- + +## 1. Quick Start + +> **Prerequisite**: Install the `scheduler` extra: +> `pip install "taskiq-flow[scheduler]"` +> Without it, `PipelineScheduler` raises `ImportError` at construction time. + +```python +from taskiq_flow import Pipeline, PipelineScheduler + +# Create your pipeline +pipeline = Pipeline(broker).call_next(my_task).call_next(another_task) + +# Create scheduler # requires taskiq-flow[scheduler] +scheduler = PipelineScheduler(broker) + +# Schedule to run every minute +job_id = await scheduler.schedule( + pipeline, + cron="* * * * *", # Every minute + args=("some", "data") # Arguments passed to pipeline.kiq() +) + +# Start the scheduler (runs in background) +await scheduler.start() + +# ... keep your application running ... +# scheduler runs in background tasks + +# Shutdown gracefully +await scheduler.shutdown() +``` + +That's the basics. Let's explore the features in detail. + +--- + +## 2. PipelineScheduler + +The main class for scheduling pipeline executions. + +### 2.1. Initialization + +```python +from taskiq_flow import PipelineScheduler + +scheduler = PipelineScheduler( + broker, + store="memory", # "memory" or "sqlite" + store_path="./scheduler_jobs.db" # for sqlite store +) +``` + +**Storage options**: + +| Store | Persistence | Multi-worker | Use case | +|-------|-------------|--------------|----------| +| `"memory"` | No | No | Development, single-process | +| `"sqlite"` | Yes | Limited* | Single-worker production, simple persistence | +| `"postgresql"` (via URL) | Yes | Yes | Production multi-worker, HA | +| `"mysql"` (via URL) | Yes | Yes | Production multi-worker, alternative | +| `"redis"` | | | **Not implemented** (raises `NotImplementedError`) | + +*SQLite store works with single scheduler instance; multiple workers need PostgreSQL/MySQL. + +**Recommendation**: +- Development/mocks → `store="memory"` +- Single-worker production → `store="sqlite"` with persistent path +- Production multi-worker → `store="postgresql://user:pass@host/dbname"` (recommended) # pragma: allowlist secret + +> **Note**: PostgreSQL and MySQL support is **already implemented** in `JobPersistenceManager` and works via SQLAlchemy async engine. See [Advanced Storage (PostgreSQL/MySQL)](#advanced-storage-postgresqlmysql) below. + +### 2.2. Starting & Stopping + +```python +# Start scheduler (begins monitoring schedules) +await scheduler.start() + +# Run in background while app is alive +# Typically integrated into FastAPI/Quart lifespan events + +# Graceful shutdown +await scheduler.shutdown() +# Waits for running jobs to finish, cancels pending +``` + +**Automatic start with context manager**: + +```python +async with PipelineScheduler(broker) as scheduler: + await scheduler.schedule(pipeline, cron="*/5 * * * *") + # Scheduler automatically starts on __aenter__ + # ... run your app ... +# Automatically shuts down on __aexit__ +``` + +--- + +## 3. Scheduling Methods + +### 3.1. Cron Scheduling + +```python +job_id = await scheduler.schedule( + pipeline, + cron="0 * * * *", # Every hour at minute 0 + args=("input_data",), + kwargs={"key": "value"}, + pipeline_id="hourly_job_001" +) +``` + +**Cron expression format**: `minute hour day month day-of-week` + +| Field | Allowed values | Special characters | +|-------|----------------|-------------------| +| Minute | 0-59 | `* , - /` | +| Hour | 0-23 | `* , - /` | +| Day | 1-31 | `* , - / ?` | +| Month | 1-12 | `* , - /` | +| Day of week | 0-6 (Sun-Sat) | `* , - / ?` | + +**Examples**: + +```python +"*/5 * * * *" # Every 5 minutes +"0 9 * * *" # Daily at 9:00 AM +"0 0 * * 0" # Weekly on Sunday at midnight +"0 0 1 * *" # Monthly on the 1st at midnight +"0 0 1 1 *" # Yearly on January 1st at midnight +``` + +### 3.2. Interval Scheduling + +```python +# Run every N seconds/minutes/hours/days/weeks +job_id = await scheduler.schedule_interval( + pipeline, + seconds=30, # Every 30 seconds + # minutes=5, # Every 5 minutes + # hours=1, # Every hour + args=(data,) +) +``` + +**Note**: Interval scheduling uses APScheduler's `IntervalTrigger`. Cron is generally preferred for production (more flexible, handles DST). + +### 3.3. One-Off Execution (Run At) + +Schedule a single future execution: + +```python +from datetime import datetime, timedelta + +job_id = await scheduler.schedule_at( + pipeline, + run_at=datetime.now() + timedelta(hours=2), # In 2 hours + args=(payload,) +) +``` + +Or schedule for a specific calendar time: + +```python +run_time = datetime(2026, 12, 31, 23, 59, 59) +await scheduler.schedule_at(pipeline, run_at=run_time) +``` + +--- + +## 4. Job Configuration + +### 4.1. Job ID + +Each scheduled job gets a unique identifier: + +```python +job_id = await scheduler.schedule(pipeline, cron="* * * * *") +print(job_id) # e.g., "job_20260505_abcdef123456" +``` + +Customize the ID: + +```python +job_id = await scheduler.schedule( + pipeline, + cron="0 9 * * *", + job_id="daily_etl_9am" # human-readable ID +) +``` + +Useful for later management (update, cancel, list). + +### 4.2. Arguments & Keyword Arguments + +Pass arguments to the pipeline's `kiq()` method: + +```python +await scheduler.schedule( + pipeline, + cron="* * * * *", + args=("positional_arg",), # tuple + kwargs={"option": True}, # dict + pipeline_id="my_pipeline" # explicit pipeline ID +) +``` + +The scheduler calls: `await pipeline.kiq(*args, **kwargs)` on each trigger. + +### 4.3. Pipeline ID + +Each scheduled execution can override the pipeline's default ID: + +```python +pipeline = Pipeline(broker) # generates random ID by default + +# Schedule with explicit ID (ensures uniqueness for tracking) +await scheduler.schedule( + pipeline, + cron="*/5 * * * *", + pipeline_id="my_pipeline_v1" +) +``` + +**Best practice**: Include timestamp or version in ID for tracking: + +```python +job_id = f"batch_process_v2_{int(time.time())}" +``` + +--- + +## 5. Job Management + +### 5.1. List Scheduled Jobs + +```python +jobs = await scheduler.list_jobs() +for job in jobs: + print(f"ID: {job.id}") + print(f" Trigger: {job.trigger}") + print(f" Next run: {job.next_run_time}") + print(f" Pipeline: {job.pipeline_id}") +``` + +### 5.2. Get Job Details + +```python +job = await scheduler.get_job(job_id) +if job: + print(f"Job {job.id} is scheduled for {job.next_run_time}") +``` + +### 5.3. Modify a Job + +```python +# Reschedule an existing job +await scheduler.reschedule_job( + job_id, + cron="0 */2 * * *" # Change to every 2 hours +) + +# Update job arguments +await scheduler.modify_job( + job_id, + args=("new_arg",), + kwargs={"updated": True} +) +``` + +### 5.4. Remove (Cancel) a Job + +```python +await scheduler.remove_job(job_id) +# Future executions are cancelled; running job continues +``` + +### 5.5. Pause & Resume + +```python +# Temporarily pause a job +await scheduler.pause_job(job_id) + +# Resume later +await scheduler.resume_job(job_id) +``` + +--- + +## 6. Tracking Scheduled Executions + +Each scheduled pipeline execution is automatically tracked if the pipeline has tracking enabled: + +```python +tracking = PipelineTrackingManager().with_auto_storage(broker) +pipeline = Pipeline(broker).with_tracking(tracking) + +scheduler = PipelineScheduler(broker) +await scheduler.schedule(pipeline, cron="*/5 * * * *") + +# Later, query execution history +history = await tracking.get_history() +for run in history: + print(f"Run {run.pipeline_id}: {run.status} at {run.started_at}") +``` + +**Distinguishing scheduled runs**: Use descriptive `pipeline_id` patterns: + +```python +await scheduler.schedule( + pipeline, + cron="0 2 * * *", # Daily 2AM + pipeline_id=f"daily_etl_{datetime.now().strftime('%Y%m%d')}" +) +# Each day gets a unique pipeline ID for tracking +``` + +--- + +## 7. Missed Execution Handling + +When a scheduled job's trigger time is missed (e.g., scheduler downtime, long-running job), APScheduler provides controls: + +### 7.1. Coalesce + +Combine multiple missed runs into a single execution: + +```python +from apscheduler.triggers.cron import CronTrigger + +trigger = CronTrigger( + hour=9, + minute=0, + coalesce=True # If scheduler was down at 9:00, run once at 9:05 instead of 5 times +) + +job = await scheduler.schedule(pipeline, trigger=trigger) +``` + +### 7.2. Max Instances + +Prevent overlapping runs of the same job: + +```python +# Job won't start a new execution if previous instance is still running +trigger = CronTrigger(minute="*/5", max_instances=1) + +job = await scheduler.schedule(pipeline, trigger=trigger) +# If a 9:00 run is still executing at 9:05, the 9:05 run is skipped +``` + +### 7.3. Misfire Grace Time + +Allow a window after scheduled time during which execution is still valid: + +```python +from apscheduler.triggers.cron import CronTrigger + +# If scheduler restarts within 10 minutes of scheduled time, still run +trigger = CronTrigger( + minute="*/5", + misfire_grace_time=600 # 10 minutes in seconds +) + +job = await scheduler.schedule(pipeline, trigger=trigger) +``` + +--- + +## 8. Timezone Handling + +By default, APScheduler uses the local system timezone. For production, set explicit timezone: + +```python +from apscheduler.triggers.cron import CronTrigger +import pytz + +# Schedule for 9:00 AM in New York timezone +trigger = CronTrigger( + hour=9, + minute=0, + timezone=pytz.timezone("America/New_York") +) + +job = await scheduler.schedule(pipeline, trigger=trigger) +``` + +Or set globally on scheduler: + +```python +scheduler = PipelineScheduler( + broker, + timezone="UTC" # or "America/Los_Angeles", "Europe/Paris", ... +) +``` + +**Daylight Saving Time (DST)**: Cron triggers with explicit timezone handle DST transitions automatically. Jobs scheduled at "9:00" will still run at 9:00 local time when clocks shift. + +--- + +## 9. Custom Triggers + +Beyond cron and intervals, use any APScheduler trigger: + +```python +from apscheduler.triggers.date import DateTrigger +from datetime import datetime, timedelta + +# Run once at specific datetime +trigger = DateTrigger(run_date=datetime(2026, 12, 31, 23, 59, 59)) +job = await scheduler.schedule(pipeline, trigger=trigger) + +# Run after a delay (from now) +trigger = DateTrigger(run_date=datetime.now() + timedelta(minutes=10)) +job = await scheduler.schedule(pipeline, trigger=trigger) +``` + +See APScheduler documentation for advanced triggers (calendar-based, etc.). + +--- + +## 10. Error Handling + +### 10.1. Catch Job Execution Errors + +Wrap pipeline execution with error handling: + +```python +@broker.task +async def my_pipeline_task(data): + try: + result = await process(data) + return result + except Exception as exc: + # Log error, but let scheduler continue + logger.error(f"Pipeline failed: {exc}") + raise # Scheduler records failure, continues with next schedule +``` + +### 10.2. Scheduler-Level Error Callbacks + +```python +scheduler = PipelineScheduler(broker) + +@scheduler.on_error +async def handle_scheduler_error(job_id, exception): + logger.error(f"Job {job_id} failed with: {exception}") + send_alert_email(job_id, exception) + +await scheduler.start() +``` + +### 10.3. Dead Letter Queue (DLQ) + +For jobs that repeatedly fail, route to DLQ: + +```python +from taskiq_flow.middlewares.retry import RetryMiddleware + +# Configure retry with backoff +broker.add_middlewares( + RetryMiddleware( + max_retries=3, + delay=10, + backoff=2 + ) +) + +# After max retries, task goes to DLQ (if broker supports it) +# RedisStreamBroker: dead_letter_stream # requires taskiq-flow[brokers] +# KafkaBroker: dead_letter_topic +``` + +--- + +## 11. Monitoring Scheduled Jobs + +### 11.1. Health Check + +```python +async def scheduler_health(): + stats = scheduler.get_stats() + return { + "scheduled_jobs": len(scheduler.get_jobs()), + "running_jobs": stats.active_jobs, + "next_run": min(job.next_run_time for job in scheduler.get_jobs()) + } +``` + +### 11.2. Logging + +Configure structured logging: + +```python +import logging +logger = logging.getLogger("taskiq_flow.scheduler") + +logging.basicConfig( + level=logging.INFO, + format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' +) + +# Scheduler logs: +# 2026-05-05 10:00:00 - taskiq_flow.scheduler - INFO - Running job daily_etl_9am +# 2026-05-05 10:00:05 - taskiq_flow.scheduler - INFO - Job daily_etl_9am completed successfully +``` + +### 11.3. Metrics + +Integrate with Prometheus: + +```python +from prometheus_client import Counter, Gauge + +SCHEDULED_JOBS = Gauge('scheduled_jobs_total', 'Total scheduled jobs') +JOB_RUNS = Counter('scheduler_job_runs_total', 'Job executions', ['job_id']) +JOB_FAILURES = Counter('scheduler_job_failures_total', 'Job failures', ['job_id']) + +class MetricsScheduler(PipelineScheduler): + async def _run_job(self, job_id, pipeline): + JOB_RUNS.labels(job_id=job_id).inc() + try: + await super()._run_job(job_id, pipeline) + except Exception: + JOB_FAILURES.labels(job_id=job_id).inc() + raise +``` + +--- + +## 12. Production Considerations + +### 12.1. High Availability + +For production HA deployments, run multiple scheduler instances with a shared job store: + +```python +# Scheduler 1 +scheduler1 = PipelineScheduler( + broker, + store="postgresql", + db_url="postgresql+asyncpg://user:pass@host/db" # pragma: allowlist secret +) + +# Scheduler 2 (identical config) — only one will acquire jobs +scheduler2 = PipelineScheduler( + broker, + store="postgresql", + # pragma: allownextline secret + db_url="postgresql+asyncpg://user:pass@host/db" # pragma: allowlist secret +) +# APScheduler's job stores use row-level locking; one scheduler per job +``` + +See [Advanced Storage (PostgreSQL/MySQL)](#advanced-storage-postgresqlmysql) for detailed configuration. + +### 12.2. Long-Running Jobs + +If a pipeline execution might exceed its schedule interval: + +```python +# Ensure no overlap +trigger = CronTrigger(minute="*/5", max_instances=1, coalesce=True) +job = await scheduler.schedule(pipeline, trigger=trigger) + +# Pipeline itself has timeout +pipeline.with_timeout(seconds=300) # 5 minutes max +``` + +### 12.3. Start-up Behavior + +On scheduler restart, missed jobs are handled according to `misfire_grace_time`: + +```python +# Scheduler restarts at 9:05 AM, job was scheduled for 9:00 +# With misfire_grace_time=600 (10 min): job runs at 9:05 +# With misfire_grace_time=0: job is skipped +trigger = CronTrigger(hour=9, misfire_grace_time=600) +``` + +### 12.4. Job Store Size + +The job store accumulates job history. Periodically clean up: + +```python +# Remove jobs older than 30 days +old_jobs = await scheduler.list_jobs() +for job in old_jobs: + if job.next_run_time < datetime.now() - timedelta(days=30): + await scheduler.remove_job(job.id) +``` + +### 12.5. Advanced Storage (PostgreSQL/MySQL) + +`JobPersistenceManager` natively supports PostgreSQL and MySQL via SQLAlchemy AsyncEngine. + +#### PostgreSQL Configuration (Recommended for Production) + +```python +from taskiq_flow.scheduling.storage import JobPersistenceManager + +# PostgreSQL with asyncpg +storage = JobPersistenceManager( + # pragma: allownextline secret + db_url="postgresql+asyncpg://user:pass@localhost:5432/taskiq_flow", # pragma: allowlist secret + async_mode=True, +) + +# Using the URL helper +storage = JobPersistenceManager( + db_url=JobPersistenceManager.get_connection_url( + "postgresql", + host="localhost", + port=5432, + user="taskiq", + # pragma: allownextline secret + password="secret", # pragma: allowlist secret + database="taskiq_flow", + ), + async_mode=True, +) +``` + +#### MySQL Configuration + +```python +storage = JobPersistenceManager( + # pragma: allownextline secret + db_url="mysql+aiomysql://user:pass@localhost:3306/taskiq_flow", # pragma: allowlist secret + async_mode=True, +) +``` + +#### SQLite Configuration (Development) + +```python +# Sync (development only) +storage = JobPersistenceManager( + db_url="sqlite:///jobs.db", + async_mode=False, +) + +# Async (recommended even for SQLite) +storage = JobPersistenceManager( + db_url="sqlite+aiosqlite:///jobs.db", + async_mode=True, +) +``` + +#### Integration with PipelineScheduler + +```python +from taskiq_flow.scheduling.scheduler import PipelineScheduler + +scheduler = PipelineScheduler( + broker, + job_store_url="postgresql+asyncpg://user:pass@localhost:5432/taskiq_flow", # pragma: allowlist secret +) +``` + +#### CRUD Operations with JobPersistenceManager + +```python +from datetime import datetime, timezone +from taskiq_flow.scheduling.storage import JobPersistenceManager, SchedulerJob, PipelineExecution + +storage = JobPersistenceManager(db_url="sqlite:///test.db") + +# Save a job +job = SchedulerJob( + id="job_001", + pipeline_id="etl_daily", + label="Daily ETL", + cron="0 2 * * *", + timezone="UTC", +) +await storage.save_job(job) + +# Load all jobs +jobs = await storage.load_jobs() +for j in jobs: + print(f"{j.id}: {j.cron} - {j.pipeline_id}") + +# Save execution history +execution = PipelineExecution( + job_id="job_001", + pipeline_id="etl_daily", + status="success", + started_at=datetime.now(timezone.utc), + completed_at=datetime.now(timezone.utc), + duration_seconds=45.2, +) +await storage.save_execution_history("job_001", execution) + +# Retrieve history +history = await storage.get_execution_history("job_001", limit=10) +for run in history: + print(f" {run.status} - {run.duration_seconds}s at {run.started_at}") +``` + +| Backend | Async | Multi-worker | Production | +|---------|-------|--------------|------------| +| SQLite | `sqlite+aiosqlite` | Single-writer | Dev / small projects | +| PostgreSQL | `postgresql+asyncpg` | Full | Recommended | +| MySQL | `mysql+aiomysql` | Full | Supported | + +--- + +## 13. Common Patterns + +### 13.1. Daily ETL Pipeline + +```python +@scheduler.schedule( + pipeline=etl_pipeline, + cron="0 2 * * *", # 2:00 AM daily + pipeline_id="daily_etl" +) +async def run_daily_etl(): + pass +``` + +### 13.2. Periodic Health Check + +```python +health_pipeline = Pipeline(broker).call_next(health_check_task) + +await scheduler.schedule_interval( + health_pipeline, + minutes=5, + pipeline_id="health_check_5m" +) +``` + +### 13.3. Dynamic Scheduling + +Create and cancel jobs at runtime: + +```python +# Schedule on-demand +job_id = await scheduler.schedule( + pipeline, + run_at=datetime.now() + timedelta(minutes=10) +) + +# Cancel if no longer needed +await scheduler.remove_job(job_id) +``` + +### 13.4. Chained Pipelines + +Pipeline A triggers Pipeline B via scheduling: + +```python +@broker.task +async def pipeline_a_finished(result): + # Schedule pipeline B to run after A completes + job_id = await scheduler.schedule_at( + pipeline_b, + run_at=datetime.now() + timedelta(minutes=5) + ) + return job_id +``` + +--- + +## 14. Troubleshooting + +### Jobs Not Running + +**Symptom**: Scheduled jobs never execute. + +**Fixes**: +- Ensure `await scheduler.start()` is called +- Check cron expression validity: `CronTrigger.from_crontab("* * * * *")` +- Verify timezone matches expected time (check server TZ) +- Confirm job was successfully scheduled (non-None job_id) +- Check scheduler logs for errors + +### Duplicate Job Execution + +**Symptom**: Same job runs multiple times concurrently. + +**Fixes**: +- Set `max_instances=1` in trigger +- Use `coalesce=True` to combine missed runs +- Ensure only one scheduler instance is running (HA needs shared store) + +### Job Store Persistence Not Working + +**Symptom**: Jobs disappear after restart despite sqlite store. + +**Fixes**: +- Use `store="sqlite"` and specify `store_path` +- Ensure file path is writable and persistent between restarts +- Don't mix memory and sqlite stores in same app + +### Timezone Issues + +**Symptom**: Job runs at wrong time (off by hours). + +**Fixes**: +- Set explicit timezone on scheduler: `PipelineScheduler(broker, timezone="UTC")` +- Or on trigger: `CronTrigger(hour=9, timezone=pytz.timezone("America/New_York"))` +- Verify server's system timezone matches expectations + +--- + +## 15. Summary + +PipelineScheduler provides robust, production-ready scheduling: + +| Feature | API | +|---------|-----| +| **Cron** | `scheduler.schedule(pipeline, cron="* * * * *")` | +| **Interval** | `scheduler.schedule_interval(pipeline, minutes=5)` | +| **One-off** | `scheduler.schedule_at(pipeline, run_at=datetime)` | +| **Management** | `list_jobs()`, `remove_job()`, `pause_job()` | +| **Persistence** | SQLite (single-worker), PostgreSQL/MySQL (multi-worker) | +| **Tracking** | Automatic with `PipelineTrackingManager` | +| **Concurrency** | `max_instances`, `coalesce` controls | + +**Typical production setup**: + +```python +tracking = PipelineTrackingManager().with_storage(RedisPipelineStorage(redis)) +pipeline = Pipeline(broker).with_tracking(tracking) + +scheduler = PipelineScheduler( + broker, + job_store_url="postgresql+asyncpg://user:pass@host/taskiq_flow", # pragma: allowlist secret +) +await scheduler.start() + +# Schedule your jobs... +``` + +--- + +## Next Steps + +- **[Retry Guide]({{ '/en/guides/retry/' | relative_url }})** — Error recovery and retry policies +- **[Performance Guide]({{ '/en/guides/performance/' | relative_url }})** — Optimize scheduled pipeline performance +- **[Tracking Guide]({{ '/en/guides/tracking/' | relative_url }})** — Monitor scheduled job history + +--- + +*Schedule pipelines like cron jobs. Track them like never before.* diff --git a/docs/_en/guides/security.md b/docs/_en/guides/security.md index 31f95cc..8233be0 100644 --- a/docs/_en/guides/security.md +++ b/docs/_en/guides/security.md @@ -21,6 +21,7 @@ TaskIQ-Flow provides a flexible security system that can be enabled via configur Security features are configured in the `TaskiqFlowConfig` object or via environment variables. The main security settings are: +{% raw %} ```python from taskiq_flow import TaskiqFlowConfig @@ -59,7 +60,7 @@ config = TaskiqFlowConfig( websocket_max_connections=1000, ) ``` - +{% endraw %} Audit logs are handled by :class:`~taskiq_flow.security.audit.AuditLogger`, instantiated automatically by the API (no configuration field needed). @@ -87,19 +88,21 @@ Clients must include their API key in the `X-API-Key` header for HTTP requests o Example HTTP request: +{% raw %} ```http GET /api/pipelines X-API-Key: admin-key #pragma: allowlist secret ``` - +{% endraw %} ### JWT Authentication If a JWT secret is configured (via `jwt_secret`), clients can authenticate using a JSON Web Token (JWT) in the `Authorization` header: +{% raw %} ``` Authorization: Bearer ``` - +{% endraw %} The JWT must contain a `sub` (subject) field identifying the user and a `roles` list. ## Authorization @@ -131,6 +134,7 @@ The middleware respects the `X-Forwarded-Proto` header so deployments behind a T Run TaskIQ-Flow behind a WSGI/ASGI server such as Uvicorn with Docker: +{% raw %} ```dockerfile # Dockerfile FROM python:3.12-slim @@ -144,7 +148,8 @@ COPY . . EXPOSE 8000 CMD ["uvicorn", "my_app:app", "--host", "0.0.0.0", "--port", "8000"] ``` - +{% endraw %} +{% raw %} ```yaml # docker-compose.yml services: @@ -170,11 +175,12 @@ services: volumes: redis_data: ``` - +{% endraw %} ### Reverse Proxy (nginx) Place nginx in front of the application to terminate TLS, enforce HTTPS, and add security headers: +{% raw %} ```nginx # /etc/nginx/sites-available/taskiq-flow server { @@ -217,7 +223,7 @@ server { } } ``` - +{% endraw %} With this configuration: 1. All HTTP traffic is redirected to HTTPS (`require_https` also enforced inside the app) 2. Security headers are added to every response @@ -266,6 +272,7 @@ WebSocket connections follow the same security model as HTTP: Here is a complete example of securing a TaskIQ-Flow API with the **current** API (`create_visualization_api`, flat `TaskiqFlowConfig` fields): +{% raw %} ```python from taskiq import Taskiq, InMemoryBroker from taskiq_flow import TaskiqFlowConfig, create_visualization_api @@ -312,9 +319,10 @@ audit_logger = AuditLogger() # uvicorn app:app --host 0.0.0.0 --port 8000 # All endpoints now require authentication. ``` - +{% endraw %} ## Testing Security +{% raw %} ```bash # No credentials → 401 Unauthorized curl -i http://localhost:8000/pipelines @@ -325,7 +333,7 @@ curl -i -H "X-API-Key: invalid-key" http://localhost:8000/pipelines # Valid viewer key → 200 OK curl -i -H "X-API-Key: viewer-key" http://localhost:8000/pipelines ``` - +{% endraw %} For WebSocket testing, use a WebSocket client library and include the `X-API-Key` header during the upgrade request. ## Conclusion diff --git a/docs/_en/guides/tasks.md b/docs/_en/guides/tasks.md index a754a0d..45ba622 100644 --- a/docs/_en/guides/tasks.md +++ b/docs/_en/guides/tasks.md @@ -1,491 +1,491 @@ ---- -title: Tasks Guide -nav_order: 21 ---- -# Tasks Guide - -**Defining tasks, decorators, metadata, and resource management** - -> **Version**: {VERSION} | **Related**: [Pipelines Guide]({{ '/en/guides/pipelines/' | relative_url }}), [Execution Guide]({{ '/en/guides/execution/' | relative_url }}) - ---- - -## Overview - -Tasks are the fundamental building blocks of Taskiq-Flow pipelines. This guide covers: - -- Task definition with `@broker.task` -- The `@pipeline_task` decorator for dataflow pipelines -- Task metadata and annotations -- Resource profiles and constraints -- Retry configuration -- Input/output specification - ---- - -## 1. What Is a Task? - -A **Task** is an asynchronous function that can be executed by a Taskiq broker, optionally with retry logic, timeouts, and metadata for pipeline orchestration. - -### Minimal Task Definition - -```python -from taskiq import InMemoryBroker - -broker = InMemoryBroker() - -@broker.task -async def my_task(value: int) -> int: - return value * 2 -``` - -**Requirements**: -- Must be an `async def` function (or regular `def` for sync tasks) -- Must be decorated with `@broker.task` (or `@broker.task(...)` with options) -- Can accept any serializable parameters -- Must return a JSON-serializable value - ---- - -## 2. Task Decorators - -### 2.1. `@broker.task` — Basic Task - -```python -@broker.task -def add(a: int, b: int) -> int: - return a + b -``` - -**Options**: - -```python -@broker.task( - timeout=30, # Seconds before task times out - retry_policy=None, # Custom RetryPolicy (see Retry Guide) - max_retries=3, # Override global default - queue="default", # Route to specific queue - labels={"type": "cpu"} # Custom metadata labels -) -async def slow_task(): - await asyncio.sleep(10) - return "done" -``` - -### 2.2. `@pipeline_task` — Dataflow Annotation - -For `DataflowPipeline`, use `@pipeline_task(output=...)` to declare what the task produces: - -```python -from taskiq_flow import pipeline_task - -@broker.task -@pipeline_task(output="features") -def extract(data: list[str]) -> dict: - return {"features": compute_features(data)} - -# Downstream task automatically receives 'features' parameter: -@broker.task -@pipeline_task(output="tags") -def tag(features: dict) -> list[str]: - # 'features' is automatically passed from extract_task - return generate_tags(features) -``` - -**Parameters**: - -| Parameter | Type | Description | -|-----------|------|-------------| -| `output` | `str` | Output key name (must match downstream parameter names) | -| `outputs` | `list[str]` | Multiple outputs (for tuple-returning tasks) | -| `inputs` | `list[str]` | Explicit input dependencies (overrides automatic) | -| `description` | `str` | Human-readable task description | - -**Multiple outputs**: -```python -@broker.task -@pipeline_task(outputs=["features", "metadata"]) -def split_output(data: str) -> tuple[dict, dict]: - features = extract_features(data) - metadata = extract_metadata(data) - return features, metadata # tuple unpacked to both outputs -``` - -### 2.3. `@pipeline_task_multi_output` — Alternative - -Same as `@pipeline_task(outputs=[...])`; provided for clarity: - -```python -from taskiq_flow import pipeline_task_multi_output - -@broker.task -@pipeline_task_multi_output(outputs=["x", "y"]) -def split(value: int) -> tuple[int, int]: - return value // 2, value % 2 -``` - ---- - -## 3. Task Metadata - -Enhance tasks with metadata for documentation, monitoring, and auto-discovery. - -### 3.1. Standard Attributes - -```python -@broker.task( - name="process_audio_track", # Override auto-generated name - labels={ - "category": "audio_processing", - "priority": "high" - } -) -async def process_track(track_id: str) -> dict: - return {"track": track_id, "status": "processed"} -``` - -### 3.2. Custom Task Info - -```python -from taskiq_flow import TaskInfo - -task_info = TaskInfo( - name="extract_spectrogram", - description="Extract mel-spectrogram from audio waveform", - parameters={ - "sample_rate": {"type": "int", "default": 22050}, - "n_mels": {"type": "int", "default": 128} - }, - outputs=["spectrogram", "sample_rate"] -) - -@broker.task -@pipeline_task(output="spectrogram", description=task_info.description) -def extract_spectrogram(audio: np.ndarray, sample_rate: int = 22050, n_mels: int = 128): - # implementation... - return spectrogram -``` - ---- - -## 4. Resource Profiles - -Control CPU and memory allocation per task for resource-aware scheduling. - -### 4.1. CPU Profile - -```python -from taskiq_flow import CPUProfile - -@broker.task -@CPUProfile(cpu_units=2) # Requires 2 CPU cores -def heavy_computation(data): - # This task will be scheduled on workers with at least 2 cores - pass -``` - -**`cpu_units` values**: - -| Value | Meaning | -|-------|---------| -| `0.5` | Half a core (background task) | -| `1` | One full core (default) | -| `2` | Two cores (CPU-intensive) | - -### 4.2. RAM Profile - -```python -from taskiq_flow import RAMProfile - -@broker.task -@RAMProfile(ram_mb=2048) # Requires 2GB RAM -def memory_intensive(data): - # Will only run on workers with at least 2GB available RAM - pass -``` - -**Resource-aware scheduling** (requires compatible worker pool): - -```python -from taskiq_flow import ResourceAwareWorkerPool - -pool = ResourceAwareWorkerPool( - workers=[ - {"cpu_cores": 4, "ram_gb": 8}, - {"cpu_cores": 2, "ram_gb": 4}, - ] -) -# Tasks are routed to workers with sufficient resources -``` - -### 4.3. Combined Profiles - -```python -from taskiq_flow import CPUProfile, RAMProfile - -@broker.task -@CPUProfile(cpu_units=4) -@RAMProfile(ram_mb=4096) -def gpu_style_task(data): - # High-resource task - pass -``` - ---- - -## 5. Input/Output Specification - -### 5.1. Type Hints for Documentation - -```python -@broker.task -async def process( - text: str, # Required input - max_length: int = 100, # Optional with default - *, - strict: bool = False # Keyword-only argument -) -> dict: - return {"processed": text[:max_length]} -``` - -### 5.2. Pydantic Models (Recommended for Complex Data) - -```python -from pydantic import BaseModel - -class AudioFeatures(BaseModel): - duration: float - tempo: float - key: str - -@broker.task -async def extract_features(audio_path: str) -> AudioFeatures: - # Pydantic validates and serializes automatically - return AudioFeatures(duration=180.0, tempo=120.0, key="C") -``` - -### 5.3. Output Multiple Values - -Tasks can return any JSON-serializable type: - -```python -@broker.task -def split(data: str) -> tuple[str, str]: - return data[:10], data[10:] # Returns two values - -# With @pipeline_task(outputs=["first", "second"]) -@pipeline_task(outputs=["head", "tail"]) -def split(data): - return data[:10], data[10:] -# Produces two outputs: "head" and "tail" -``` - ---- - -## 6. Retry Configuration - -### 6.1. Decorator-Level Retry - -```python -@broker.task( - retry_policy={ - "max_retries": 3, - "delay": 5.0, - "backoff": 2.0 # Exponential backoff multiplier - } -) -async def flaky_task(): - # Will retry up to 3 times with delays: 5s, 10s, 20s - possibly_fails() -``` - -### 6.2. Pipeline-Level Retry - -Apply retry to all tasks in a pipeline: - -```python -pipeline = Pipeline(broker) -pipeline.with_retry( - max_attempts=3, - delay=1.0, - backoff=1.5 -) -``` - -### 6.3. Conditional Retry - -Only retry on specific exceptions: - -```python -from taskiq.exceptions import RetryException - -@broker.task -async def task_with_conditional_retry(): - try: - call_external_api() - except NetworkError: - raise RetryException("Network error, retry allowed") - except ValidationError: - raise # No retry, fail immediately -``` - -Detailed retry strategies covered in [Retry Guide]({{ '/en/guides/retry/' | relative_url }}). - ---- - -## 7. Task Discovery & Registry - -### 7.1. Automatic Discovery - -`DataflowPipeline.from_tasks()` automatically detects dependencies via type hints and `@pipeline_task` decorators. - -### 7.2. Manual Registration - -For dynamic pipelines, use `DataflowRegistry`: - -```python -from taskiq_flow import DataflowRegistry - -registry = DataflowRegistry() - -# Register with explicit I/O mapping -registry.register_task( - task=process_data, - output="processed", - inputs=["raw"] # depends on task that outputs "raw" -) - -# Discover from module -import my_tasks -for task in my_tasks.ALL_TASKS: - registry.register_task_from_object(task) -``` - -See `examples/registry_discovery_example.py`. - ---- - -## 8. Writing Testable Tasks - -Tasks should be pure functions for easy testing: - -```python -@broker.task -def process(data: dict) -> dict: - # Pure function: output depends only on input - return {"result": data["value"] * 2} - -# Unit test -def test_process(): - assert process({"value": 5}) == {"result": 10} -``` - -**Testing with broker**: - -```python -import pytest -from taskiq import InMemoryBroker - -@pytest.fixture -def test_broker(): - return InMemoryBroker(await_inplace=True) - -async def test_task_execution(test_broker): - @test_broker.task - async def my_task(x: int) -> int: - return x + 1 - - result = await my_task.kiq(5) - value = await result.wait_result() - assert value.return_value == 6 -``` - ---- - -## 9. Common Patterns - -### 9.1. Idempotency - -Design tasks to be safely re-runnable: - -```python -@broker.task -@pipeline_task(output="user_processed") -def process_user(user_id: str) -> dict: - # Check if already processed - if cache.get(f"processed:{user_id}"): - return {"status": "already_done"} - # Perform processing - result = heavy_compute(user_id) - cache.set(f"processed:{user_id}", result, ttl=3600) - return result -``` - -### 9.2. Composability - -Break complex logic into small, reusable tasks: - -```python -@broker.task -def validate(data): ... - -@broker.task -def transform(data): ... - -@broker.task -def enrich(data): ... - -# Compose in multiple pipelines -pipeline1 = Pipeline(broker).call_next(validate).call_next(transform) -pipeline2 = Pipeline(broker).call_next(validate).call_next(enrich) -``` - -### 9.3. Progress Reporting - -For long-running tasks, report progress via callbacks or logging: - -```python -@broker.task -async def long_task(items: list, progress_callback=None): - for i, item in enumerate(items): - result = process(item) - if progress_callback: - await progress_callback(i / len(items)) - return "done" -``` - ---- - -## 10. Anti-Patterns to Avoid - -| Anti-pattern | Why it's bad | Better approach | -|--------------|--------------|-----------------| -| Side effects in tasks | Makes testing/hard to reason about | Keep tasks pure; use `.call_after()` for side effects | -| Large return values | High memory, slow serialization | Store large results externally (DB, S3); return reference | -| Shared mutable state | Race conditions in parallel | Each task independent; pass data via return values | -| Blocking I/O without async | Blocks event loop | Use async libraries (aiohttp, asyncpg, etc.) | -| Tasks doing too much | Hard to reuse, test, debug | Break into smaller, focused tasks | - ---- - -## 11. Summary - -Taskiq-Flow tasks are: - -- **Flexible** — Regular Python functions with `@broker.task` -- **Observable** — Metadata, labels, and tracking -- **Resilient** — Retry policies, timeouts, error handling -- **Composable** — Small functions combine into complex workflows -- **Resource-aware** — CPU/RAM profiles for optimized scheduling - ---- - -## Next Steps - -- **[Pipeline Types]({{ '/en/guides/pipelines/' | relative_url }})** — Building workflows with tasks -- **[Execution Guide]({{ '/en/guides/execution/' | relative_url }})** — Running pipelines and handling results -- **[Retry Guide]({{ '/en/guides/retry/' | relative_url }})** — Robust error recovery strategies - ---- - -*Tasks are your workflow atoms. Learn to compose them in [Pipelines]({{ '/en/guides/pipelines/' | relative_url }}).* +--- +title: Tasks Guide +nav_order: 21 +--- +# Tasks Guide + +**Defining tasks, decorators, metadata, and resource management** + +> **Version**: {VERSION} | **Related**: [Pipelines Guide]({{ '/en/guides/pipelines/' | relative_url }}), [Execution Guide]({{ '/en/guides/execution/' | relative_url }}) + +--- + +## Overview + +Tasks are the fundamental building blocks of Taskiq-Flow pipelines. This guide covers: + +- Task definition with `@broker.task` +- The `@pipeline_task` decorator for dataflow pipelines +- Task metadata and annotations +- Resource profiles and constraints +- Retry configuration +- Input/output specification + +--- + +## 1. What Is a Task? + +A **Task** is an asynchronous function that can be executed by a Taskiq broker, optionally with retry logic, timeouts, and metadata for pipeline orchestration. + +### Minimal Task Definition + +```python +from taskiq import InMemoryBroker + +broker = InMemoryBroker() + +@broker.task +async def my_task(value: int) -> int: + return value * 2 +``` + +**Requirements**: +- Must be an `async def` function (or regular `def` for sync tasks) +- Must be decorated with `@broker.task` (or `@broker.task(...)` with options) +- Can accept any serializable parameters +- Must return a JSON-serializable value + +--- + +## 2. Task Decorators + +### 2.1. `@broker.task` — Basic Task + +```python +@broker.task +def add(a: int, b: int) -> int: + return a + b +``` + +**Options**: + +```python +@broker.task( + timeout=30, # Seconds before task times out + retry_policy=None, # Custom RetryPolicy (see Retry Guide) + max_retries=3, # Override global default + queue="default", # Route to specific queue + labels={"type": "cpu"} # Custom metadata labels +) +async def slow_task(): + await asyncio.sleep(10) + return "done" +``` + +### 2.2. `@pipeline_task` — Dataflow Annotation + +For `DataflowPipeline`, use `@pipeline_task(output=...)` to declare what the task produces: + +```python +from taskiq_flow import pipeline_task + +@broker.task +@pipeline_task(output="features") +def extract(data: list[str]) -> dict: + return {"features": compute_features(data)} + +# Downstream task automatically receives 'features' parameter: +@broker.task +@pipeline_task(output="tags") +def tag(features: dict) -> list[str]: + # 'features' is automatically passed from extract_task + return generate_tags(features) +``` + +**Parameters**: + +| Parameter | Type | Description | +|-----------|------|-------------| +| `output` | `str` | Output key name (must match downstream parameter names) | +| `outputs` | `list[str]` | Multiple outputs (for tuple-returning tasks) | +| `inputs` | `list[str]` | Explicit input dependencies (overrides automatic) | +| `description` | `str` | Human-readable task description | + +**Multiple outputs**: +```python +@broker.task +@pipeline_task(outputs=["features", "metadata"]) +def split_output(data: str) -> tuple[dict, dict]: + features = extract_features(data) + metadata = extract_metadata(data) + return features, metadata # tuple unpacked to both outputs +``` + +### 2.3. `@pipeline_task_multi_output` — Alternative + +Same as `@pipeline_task(outputs=[...])`; provided for clarity: + +```python +from taskiq_flow import pipeline_task_multi_output + +@broker.task +@pipeline_task_multi_output(outputs=["x", "y"]) +def split(value: int) -> tuple[int, int]: + return value // 2, value % 2 +``` + +--- + +## 3. Task Metadata + +Enhance tasks with metadata for documentation, monitoring, and auto-discovery. + +### 3.1. Standard Attributes + +```python +@broker.task( + name="process_audio_track", # Override auto-generated name + labels={ + "category": "audio_processing", + "priority": "high" + } +) +async def process_track(track_id: str) -> dict: + return {"track": track_id, "status": "processed"} +``` + +### 3.2. Custom Task Info + +```python +from taskiq_flow import TaskInfo + +task_info = TaskInfo( + name="extract_spectrogram", + description="Extract mel-spectrogram from audio waveform", + parameters={ + "sample_rate": {"type": "int", "default": 22050}, + "n_mels": {"type": "int", "default": 128} + }, + outputs=["spectrogram", "sample_rate"] +) + +@broker.task +@pipeline_task(output="spectrogram", description=task_info.description) +def extract_spectrogram(audio: np.ndarray, sample_rate: int = 22050, n_mels: int = 128): + # implementation... + return spectrogram +``` + +--- + +## 4. Resource Profiles + +Control CPU and memory allocation per task for resource-aware scheduling. + +### 4.1. CPU Profile + +```python +from taskiq_flow import CPUProfile + +@broker.task +@CPUProfile(cpu_units=2) # Requires 2 CPU cores +def heavy_computation(data): + # This task will be scheduled on workers with at least 2 cores + pass +``` + +**`cpu_units` values**: + +| Value | Meaning | +|-------|---------| +| `0.5` | Half a core (background task) | +| `1` | One full core (default) | +| `2` | Two cores (CPU-intensive) | + +### 4.2. RAM Profile + +```python +from taskiq_flow import RAMProfile + +@broker.task +@RAMProfile(ram_mb=2048) # Requires 2GB RAM +def memory_intensive(data): + # Will only run on workers with at least 2GB available RAM + pass +``` + +**Resource-aware scheduling** (requires compatible worker pool): + +```python +from taskiq_flow import ResourceAwareWorkerPool + +pool = ResourceAwareWorkerPool( + workers=[ + {"cpu_cores": 4, "ram_gb": 8}, + {"cpu_cores": 2, "ram_gb": 4}, + ] +) +# Tasks are routed to workers with sufficient resources +``` + +### 4.3. Combined Profiles + +```python +from taskiq_flow import CPUProfile, RAMProfile + +@broker.task +@CPUProfile(cpu_units=4) +@RAMProfile(ram_mb=4096) +def gpu_style_task(data): + # High-resource task + pass +``` + +--- + +## 5. Input/Output Specification + +### 5.1. Type Hints for Documentation + +```python +@broker.task +async def process( + text: str, # Required input + max_length: int = 100, # Optional with default + *, + strict: bool = False # Keyword-only argument +) -> dict: + return {"processed": text[:max_length]} +``` + +### 5.2. Pydantic Models (Recommended for Complex Data) + +```python +from pydantic import BaseModel + +class AudioFeatures(BaseModel): + duration: float + tempo: float + key: str + +@broker.task +async def extract_features(audio_path: str) -> AudioFeatures: + # Pydantic validates and serializes automatically + return AudioFeatures(duration=180.0, tempo=120.0, key="C") +``` + +### 5.3. Output Multiple Values + +Tasks can return any JSON-serializable type: + +```python +@broker.task +def split(data: str) -> tuple[str, str]: + return data[:10], data[10:] # Returns two values + +# With @pipeline_task(outputs=["first", "second"]) +@pipeline_task(outputs=["head", "tail"]) +def split(data): + return data[:10], data[10:] +# Produces two outputs: "head" and "tail" +``` + +--- + +## 6. Retry Configuration + +### 6.1. Decorator-Level Retry + +```python +@broker.task( + retry_policy={ + "max_retries": 3, + "delay": 5.0, + "backoff": 2.0 # Exponential backoff multiplier + } +) +async def flaky_task(): + # Will retry up to 3 times with delays: 5s, 10s, 20s + possibly_fails() +``` + +### 6.2. Pipeline-Level Retry + +Apply retry to all tasks in a pipeline: + +```python +pipeline = Pipeline(broker) +pipeline.with_retry( + max_attempts=3, + delay=1.0, + backoff=1.5 +) +``` + +### 6.3. Conditional Retry + +Only retry on specific exceptions: + +```python +from taskiq.exceptions import RetryException + +@broker.task +async def task_with_conditional_retry(): + try: + call_external_api() + except NetworkError: + raise RetryException("Network error, retry allowed") + except ValidationError: + raise # No retry, fail immediately +``` + +Detailed retry strategies covered in [Retry Guide]({{ '/en/guides/retry/' | relative_url }}). + +--- + +## 7. Task Discovery & Registry + +### 7.1. Automatic Discovery + +`DataflowPipeline.from_tasks()` automatically detects dependencies via type hints and `@pipeline_task` decorators. + +### 7.2. Manual Registration + +For dynamic pipelines, use `DataflowRegistry`: + +```python +from taskiq_flow import DataflowRegistry + +registry = DataflowRegistry() + +# Register with explicit I/O mapping +registry.register_task( + task=process_data, + output="processed", + inputs=["raw"] # depends on task that outputs "raw" +) + +# Discover from module +import my_tasks +for task in my_tasks.ALL_TASKS: + registry.register_task_from_object(task) +``` + +See `examples/registry_discovery_example.py`. + +--- + +## 8. Writing Testable Tasks + +Tasks should be pure functions for easy testing: + +```python +@broker.task +def process(data: dict) -> dict: + # Pure function: output depends only on input + return {"result": data["value"] * 2} + +# Unit test +def test_process(): + assert process({"value": 5}) == {"result": 10} +``` + +**Testing with broker**: + +```python +import pytest +from taskiq import InMemoryBroker + +@pytest.fixture +def test_broker(): + return InMemoryBroker(await_inplace=True) + +async def test_task_execution(test_broker): + @test_broker.task + async def my_task(x: int) -> int: + return x + 1 + + result = await my_task.kiq(5) + value = await result.wait_result() + assert value.return_value == 6 +``` + +--- + +## 9. Common Patterns + +### 9.1. Idempotency + +Design tasks to be safely re-runnable: + +```python +@broker.task +@pipeline_task(output="user_processed") +def process_user(user_id: str) -> dict: + # Check if already processed + if cache.get(f"processed:{user_id}"): + return {"status": "already_done"} + # Perform processing + result = heavy_compute(user_id) + cache.set(f"processed:{user_id}", result, ttl=3600) + return result +``` + +### 9.2. Composability + +Break complex logic into small, reusable tasks: + +```python +@broker.task +def validate(data): ... + +@broker.task +def transform(data): ... + +@broker.task +def enrich(data): ... + +# Compose in multiple pipelines +pipeline1 = Pipeline(broker).call_next(validate).call_next(transform) +pipeline2 = Pipeline(broker).call_next(validate).call_next(enrich) +``` + +### 9.3. Progress Reporting + +For long-running tasks, report progress via callbacks or logging: + +```python +@broker.task +async def long_task(items: list, progress_callback=None): + for i, item in enumerate(items): + result = process(item) + if progress_callback: + await progress_callback(i / len(items)) + return "done" +``` + +--- + +## 10. Anti-Patterns to Avoid + +| Anti-pattern | Why it's bad | Better approach | +|--------------|--------------|-----------------| +| Side effects in tasks | Makes testing/hard to reason about | Keep tasks pure; use `.call_after()` for side effects | +| Large return values | High memory, slow serialization | Store large results externally (DB, S3); return reference | +| Shared mutable state | Race conditions in parallel | Each task independent; pass data via return values | +| Blocking I/O without async | Blocks event loop | Use async libraries (aiohttp, asyncpg, etc.) | +| Tasks doing too much | Hard to reuse, test, debug | Break into smaller, focused tasks | + +--- + +## 11. Summary + +Taskiq-Flow tasks are: + +- **Flexible** — Regular Python functions with `@broker.task` +- **Observable** — Metadata, labels, and tracking +- **Resilient** — Retry policies, timeouts, error handling +- **Composable** — Small functions combine into complex workflows +- **Resource-aware** — CPU/RAM profiles for optimized scheduling + +--- + +## Next Steps + +- **[Pipeline Types]({{ '/en/guides/pipelines/' | relative_url }})** — Building workflows with tasks +- **[Execution Guide]({{ '/en/guides/execution/' | relative_url }})** — Running pipelines and handling results +- **[Retry Guide]({{ '/en/guides/retry/' | relative_url }})** — Robust error recovery strategies + +--- + +*Tasks are your workflow atoms. Learn to compose them in [Pipelines]({{ '/en/guides/pipelines/' | relative_url }}).* diff --git a/docs/_en/guides/tracking.md b/docs/_en/guides/tracking.md index fe6074d..4365d89 100644 --- a/docs/_en/guides/tracking.md +++ b/docs/_en/guides/tracking.md @@ -1,538 +1,538 @@ ---- -title: Pipeline Tracking & Monitoring Guide -nav_order: 23 ---- -# Pipeline Tracking & Monitoring Guide - -**Real-time and historical execution monitoring with PipelineTrackingManager** - -> **Version**: {VERSION} | **Related**: [Execution Guide]({{ '/en/guides/execution/' | relative_url }}), [WebSocket Guide]({{ '/en/guides/websocket/' | relative_url }}) - ---- - -## Overview - -Taskiq-Flow provides comprehensive tracking capabilities to monitor pipeline executions in real-time and historically. This guide covers: - -- `PipelineTrackingManager` — Central tracking coordinator -- Storage backends (Memory, Redis) -- Status queries and history -- Pipeline metrics collection -- Hooking into step-level events - ---- - -## 1. Quick Start - -```python -from taskiq_flow import Pipeline, PipelineTrackingManager - -# Initialize tracking with automatic storage selection -tracking = PipelineTrackingManager().with_auto_storage(broker) - -# Attach tracking to your pipeline -pipeline = Pipeline(broker).with_tracking(tracking) - -# Execute -task = await pipeline.kiq(data) -result = await task.wait_result() - -# Query status -status = await tracking.get_status(pipeline.pipeline_id) -print(f"Status: {status.status}") # COMPLETED -print(f"Steps: {len(status.steps)}") # Number of steps executed -print(f"Duration: {status.duration_ms}ms") -``` - -That's the basic pattern. Let's dive deeper. - ---- - -## 2. PipelineTrackingManager - -The central component for recording and retrieving pipeline execution data. - -### 2.1. Initialization - -```python -from taskiq_flow import PipelineTrackingManager, InMemoryPipelineStorage, RedisPipelineStorage - -# Option 1: Auto-select based on broker (recommended) -tracking = PipelineTrackingManager().with_auto_storage(broker) -# Uses Redis if broker supports it, otherwise falls back to Memory - -# Option 2: Explicit memory storage (development only) -tracking = PipelineTrackingManager().with_storage(InMemoryPipelineStorage()) - -# Option 3: Explicit Redis storage (production) -tracking = PipelineTrackingManager().with_storage( - RedisPipelineStorage(redis_client) -) - -# Option 4: Custom storage backend -tracking = PipelineTrackingManager().with_storage(MyCustomStorage()) -``` - -### 2.2. Storage Lifetime - -- **InMemoryPipelineStorage**: Lives only in current process; cleared on restart -- **RedisPipelineStorage**: Persistent across processes; survives restarts - -Choose based on deployment: -- Local development → Memory -- Single-worker production → Memory (if no restart needed) -- Multi-worker / distributed → Redis (or other shared storage) - ---- - -## 3. Pipeline Status Model - -Every tracked pipeline produces a `PipelineStatus` object: - -```python -from taskiq_flow.tracking.models import PipelineStatus - -status: PipelineStatus -``` - -**Fields**: - -| Field | Type | Description | -|-------|------|-------------| -| `pipeline_id` | `str` | Unique pipeline instance ID | -| `status` | `str` | `PENDING`, `RUNNING`, `COMPLETED`, `FAILED`, `CANCELLED` | -| `pipeline_type` | `str` | `"sequential"` or `"dataflow"` | -| `started_at` | `datetime` | Execution start timestamp | -| `completed_at` | `datetime` | Execution end timestamp (if finished) | -| `duration_ms` | `float` | Total execution time in milliseconds | -| `steps` | `list[StepStatus]` | Per-step breakdown | -| `result` | `Any` | Final return value (if completed) | -| `error` | `str` | Error message (if failed) | - -**StepStatus** fields: - -| Field | Type | Description | -|-------|------|-------------| -| `step_name` | `str` | Name of the task | -| `status` | `str` | `PENDING`, `RUNNING`, `COMPLETED`, `FAILED` | -| `started_at` | `datetime` | Step start time | -| `completed_at` | `datetime` | Step end time | -| `duration_ms` | `float` | Step execution time | -| `result` | `Any` | Step return value | -| `error` | `str` | Error message if failed | - ---- - -## 4. Querying Status - -### 4.1. Get Status of a Pipeline - -```python -status = await tracking.get_status(pipeline_id) - -if status.status == "COMPLETED": - print(f"Pipeline finished in {status.duration_ms}ms") - print(f"Result: {status.result}") -elif status.status == "FAILED": - print(f"Failed: {status.error}") -``` - -### 4.2. List All Pipelines - -```python -all_statuses = await tracking.list_pipelines() -for status in all_statuses: - print(f"{status.pipeline_id}: {status.status}") -``` - -### 4.3. Filter by Status - -```python -running = await tracking.list_pipelines(status_filter="RUNNING") -failed = await tracking.list_pipelines(status_filter="FAILED") -completed = await tracking.list_pipelines(status_filter="COMPLETED") -``` - -### 4.4. Get Pipeline History - -```python -# Get last 10 pipelines -history = await tracking.get_history(limit=10) - -# Filter by date range -from datetime import datetime, timedelta -week_ago = datetime.now() - timedelta(days=7) -recent = await tracking.get_history(since=week_ago) -``` - -### 4.5. Delete Old Records - -```python -# Delete records older than 30 days -deleted = await tracking.cleanup_older_than(days=30) -print(f"Deleted {deleted} old pipeline records") - -# Delete specific pipeline -await tracking.delete_pipeline(pipeline_id) -``` - ---- - -## 5. Storage Backends - -### 5.1. InMemoryPipelineStorage - -```python -from taskiq_flow.tracking import InMemoryPipelineStorage - -storage = InMemoryPipelineStorage() -tracking = PipelineTrackingManager().with_storage(storage) - -# Data lives only in current Python process -# On restart, all history is lost -# Suitable for: development, testing, single-run scripts -``` - -**Pros**: -- Zero configuration -- Fast (no network I/O) -- Simple - -**Cons**: -- Not shareable between workers -- Lost on restart -- Limited history size - -### 5.2. RedisPipelineStorage - -```python -from taskiq_flow.tracking import RedisPipelineStorage -import redis.asyncio as redis - -redis_client = redis.Redis(host="localhost", port=6379, decode_responses=True) -storage = RedisPipelineStorage(redis_client) -tracking = PipelineTrackingManager().with_storage(storage) -``` - -**Configuration**: - -```python -# With custom key prefix and TTL -storage = RedisPipelineStorage( - redis_client, - key_prefix="taskiq_flow:tracking:", - ttl_seconds=604800 # 7 days retention -) -``` - -**Pros**: -- Shared across multiple workers -- Persists across restarts -- Scalable -- Can be clustered for HA - -**Cons**: -- Requires Redis server -- Network latency -- TTL management needed (avoid unbounded growth) - -### 5.3. Custom Storage - -Implement `TrackingStorage` protocol: - -```python -from taskiq_flow.tracking.storage import TrackingStorage -from taskiq_flow.tracking.models import PipelineStatus - -class PostgresStorage(TrackingStorage): - async def save_status(self, status: PipelineStatus): - # Insert/update in PostgreSQL - pass - - async def get_status(self, pipeline_id: str) -> PipelineStatus | None: - # Fetch from DB - pass - - async def list_pipelines(self, status_filter: str | None = None): - # Query with optional filter - pass - - async def delete_pipeline(self, pipeline_id: str): - # Remove record - pass - -tracking = PipelineTrackingManager().with_storage(PostgresStorage()) -``` - ---- - -## 6. Real-Time Tracking with WebSocket - -For live dashboard updates, combine `PipelineTrackingManager` with `HookManager`: - -```python -from taskiq_flow.hooks import HookManager, TrackingEventBroadcaster - -hook_manager = HookManager() -broadcaster = TrackingEventBroadcaster(tracking, hook_manager) -tracking.add_listener(broadcaster.on_status_update) - -pipeline = Pipeline(broker).with_hooks(hook_manager).with_tracking(tracking) -``` - -Now pipeline events are broadcast via WebSocket as they happen. - -See [WebSocket Guide]({{ '/en/guides/websocket/' | relative_url }}) for complete setup. - ---- - -## 7. Metrics Collection - -Track pipeline performance over time: - -```python -# Collect statistics -stats = await tracking.get_metrics(days=7) - -print(f"Total executions: {stats.total_pipelines}") -print(f"Success rate: {stats.success_rate:.1%}") -print(f"Avg duration: {stats.avg_duration_ms:.0f}ms") -print(f"Failure reasons: {stats.failure_reasons}") -``` - -**Common metrics**: - -- Throughput (pipelines/minute) -- Success/failure ratio -- Average step duration -- Longest-running steps -- Busy hours - -Integrate with monitoring systems (Prometheus, Grafana): - -```python -from prometheus_client import Counter, Histogram - -PIPELINE_COUNT = Counter('pipelines_total', 'Total pipelines', ['status']) -PIPELINE_DURATION = Histogram('pipeline_duration_seconds', 'Pipeline runtime') - -class PrometheusExporter: - async def on_pipeline_complete(self, status: PipelineStatus): - PIPELINE_COUNT.labels(status=status.status).inc() - PIPELINE_DURATION.observe(status.duration_ms / 1000) -``` - ---- - -## 8. Event Listeners - -Attach callbacks to tracking events: - -```python -class MyListener: - async def on_pipeline_start(self, pipeline_id: str): - print(f"Pipeline {pipeline_id} started") - send_slack_notification(f"Pipeline {pipeline_id} started") - - async def on_step_complete(self, pipeline_id: str, step_name: str, result: Any): - log_step_metric(step_name, result) - - async def on_pipeline_complete(self, pipeline_id: str, status: PipelineStatus): - if status.status == "FAILED": - alert_on_failure(pipeline_id) - -listener = MyListener() -tracking.add_listener(listener) -``` - -**Listener methods** (all optional): - -- `on_pipeline_start(pipeline_id: str)` -- `on_step_start(pipeline_id: str, step_name: str)` -- `on_step_complete(pipeline_id: str, step_name: str, result: Any)` -- `on_pipeline_complete(pipeline_id: str, status: PipelineStatus)` -- `on_pipeline_error(pipeline_id: str, error: str)` - ---- - -## 9. Visualizing Tracking Data - -### 9.1. Console Output - -```python -status = await tracking.get_status(pipeline_id) -print(f"\n{'='*60}") -print(f"Pipeline: {status.pipeline_id}") -print(f"Status: {status.status}") -print(f"Duration: {status.duration_ms:.0f}ms") -print(f"Steps:") -for step in status.steps: - bar = "█" * int(step.duration_ms / 10) - print(f" {step.step_name:<30} {bar} {step.duration_ms:.0f}ms") -``` - -### 9.2. JSON Export - -```python -import json -status_dict = status.model_dump(mode="json", exclude={"result"}) # exclude large results -print(json.dumps(status_dict, indent=2, default=str)) -``` - -### 9.3. Integration with Dashboards - -Use the REST API endpoints (see [API Guide]({{ '/en/guides/api/' | relative_url }})) to build custom dashboards: - -```javascript -// Frontend fetch -fetch('/api/pipelines/{pipeline_id}/status') - .then(res => res.json()) - .then(status => { - // Render timeline chart of step durations - // Show success/failure badges - }); -``` - ---- - -## 10. Production Best Practices - -### 10.1. Use Redis for Production - -Always use `RedisPipelineStorage` in production: - -```python -# config.py -REDIS_URL = os.getenv("REDIS_URL", "redis://localhost:6379") - -# app.py -from redis.asyncio import Redis -redis_client = Redis.from_url(REDIS_URL) -tracking = PipelineTrackingManager().with_storage( - RedisPipelineStorage(redis_client, ttl_seconds=2592000) # 30 days -) -``` - -### 10.2. Set Up Retention Policies - -```python -# Periodic cleanup job (daily) -async def cleanup_old_trackings(): - deleted = await tracking.cleanup_older_than(days=7) - print(f"Cleaned up {deleted} old pipeline records") - -# Use APScheduler to run daily -from taskiq_flow import PipelineScheduler -scheduler = PipelineScheduler(broker) -scheduler.schedule_at(cleanup_old_trackings, run_at="0 3 * * *") # 3 AM daily -``` - -### 10.3. Monitor Tracker Health - -```python -# Health check for monitoring systems -async def tracking_health(): - try: - test_pipeline = Pipeline(broker).with_tracking(tracking) - await test_pipeline.kiq("health_check") - return {"status": "healthy"} - except Exception as e: - return {"status": "unhealthy", "error": str(e)} -``` - -### 10.4. Limit History Size - -```python -# Keep only last N pipelines per pipeline_id pattern -import fnmatch - -patterns = ["batch_job_*", "etl_*"] -for pattern in patterns: - old = await tracking.list_pipelines() - matching = [p for p in old if fnmatch.fnmatch(p.pipeline_id, pattern)] - if len(matching) > 100: - for old_pipeline in matching[-100:]: - await tracking.delete_pipeline(old_pipeline.pipeline_id) -``` - ---- - -## 11. Troubleshooting - -### "No storage configured" Error - -**Symptom**: `RuntimeError: No tracking storage configured` - -**Fix**: Add storage before using tracking: - -```python -tracking = PipelineTrackingManager().with_auto_storage(broker) -# or -tracking = PipelineTrackingManager().with_storage(InMemoryPipelineStorage()) -``` - -### Tracking Data Missing - -**Symptom**: `get_status()` returns `None` even though pipeline ran - -**Causes & fixes**: - -1. **Tracking not attached**: - ```python - pipeline = Pipeline(broker).with_tracking(tracking) # Must call with_tracking() - ``` - -2. **Using different brokers** — Ensure same `broker` instance across task and pipeline. - -3. **Storage lifetime** — InMemory storage lost on restart; switch to Redis. - -4. **Pipeline ID mismatch** — Confirm `pipeline.pipeline_id` matches what you query. - -### Performance Degradation with Redis - -**Symptom**: Tracking slows down pipeline execution - -**Fixes**: -- Use Redis connection pooling -- Batch status updates (bundle multiple steps) -- Async batch writes (default behavior) -- Increase Redis `maxmemory` and use appropriate eviction policy - ---- - -## 12. Summary - -| Feature | In-Memory | Redis | -|---------|-----------|-------| -| **Multi-process** | No | Yes | -| **Persistent** | No | Yes | -| **Shared state** | No | Yes | -| **Speed** | Fastest | Fast (network) | -| **Config required** | None | Redis server | - -**Basic recipe**: -```python -tracking = PipelineTrackingManager().with_auto_storage(broker) -pipeline = Pipeline(broker).with_tracking(tracking) -``` - -**Production recipe**: -```python -tracking = PipelineTrackingManager().with_storage( - RedisPipelineStorage(redis_client, ttl_seconds=604800) -) -pipeline = Pipeline(broker).with_tracking(tracking) -``` - ---- - -## Next Steps - -- **[WebSocket Streaming]({{ '/en/guides/websocket/' | relative_url }})** — Real-time event delivery for dashboards -- **[Dataflow Guide]({{ '/en/guides/dataflow/' | relative_url }})** — Full DAG pipelines with automatic parallelism -- **[Scheduling]({{ '/en/guides/scheduling/' | relative_url }})** — Automated recurring pipeline execution -- **[Performance Tuning]({{ '/en/guides/performance/' | relative_url }})** — Optimize tracking overhead - ---- - -*Track everything. Visualize with [WebSocket]({{ '/en/guides/websocket/' | relative_url }}).* +--- +title: Pipeline Tracking & Monitoring Guide +nav_order: 23 +--- +# Pipeline Tracking & Monitoring Guide + +**Real-time and historical execution monitoring with PipelineTrackingManager** + +> **Version**: {VERSION} | **Related**: [Execution Guide]({{ '/en/guides/execution/' | relative_url }}), [WebSocket Guide]({{ '/en/guides/websocket/' | relative_url }}) + +--- + +## Overview + +Taskiq-Flow provides comprehensive tracking capabilities to monitor pipeline executions in real-time and historically. This guide covers: + +- `PipelineTrackingManager` — Central tracking coordinator +- Storage backends (Memory, Redis) +- Status queries and history +- Pipeline metrics collection +- Hooking into step-level events + +--- + +## 1. Quick Start + +```python +from taskiq_flow import Pipeline, PipelineTrackingManager + +# Initialize tracking with automatic storage selection +tracking = PipelineTrackingManager().with_auto_storage(broker) + +# Attach tracking to your pipeline +pipeline = Pipeline(broker).with_tracking(tracking) + +# Execute +task = await pipeline.kiq(data) +result = await task.wait_result() + +# Query status +status = await tracking.get_status(pipeline.pipeline_id) +print(f"Status: {status.status}") # COMPLETED +print(f"Steps: {len(status.steps)}") # Number of steps executed +print(f"Duration: {status.duration_ms}ms") +``` + +That's the basic pattern. Let's dive deeper. + +--- + +## 2. PipelineTrackingManager + +The central component for recording and retrieving pipeline execution data. + +### 2.1. Initialization + +```python +from taskiq_flow import PipelineTrackingManager, InMemoryPipelineStorage, RedisPipelineStorage + +# Option 1: Auto-select based on broker (recommended) +tracking = PipelineTrackingManager().with_auto_storage(broker) +# Uses Redis if broker supports it, otherwise falls back to Memory + +# Option 2: Explicit memory storage (development only) +tracking = PipelineTrackingManager().with_storage(InMemoryPipelineStorage()) + +# Option 3: Explicit Redis storage (production) +tracking = PipelineTrackingManager().with_storage( + RedisPipelineStorage(redis_client) +) + +# Option 4: Custom storage backend +tracking = PipelineTrackingManager().with_storage(MyCustomStorage()) +``` + +### 2.2. Storage Lifetime + +- **InMemoryPipelineStorage**: Lives only in current process; cleared on restart +- **RedisPipelineStorage**: Persistent across processes; survives restarts + +Choose based on deployment: +- Local development → Memory +- Single-worker production → Memory (if no restart needed) +- Multi-worker / distributed → Redis (or other shared storage) + +--- + +## 3. Pipeline Status Model + +Every tracked pipeline produces a `PipelineStatus` object: + +```python +from taskiq_flow.tracking.models import PipelineStatus + +status: PipelineStatus +``` + +**Fields**: + +| Field | Type | Description | +|-------|------|-------------| +| `pipeline_id` | `str` | Unique pipeline instance ID | +| `status` | `str` | `PENDING`, `RUNNING`, `COMPLETED`, `FAILED`, `CANCELLED` | +| `pipeline_type` | `str` | `"sequential"` or `"dataflow"` | +| `started_at` | `datetime` | Execution start timestamp | +| `completed_at` | `datetime` | Execution end timestamp (if finished) | +| `duration_ms` | `float` | Total execution time in milliseconds | +| `steps` | `list[StepStatus]` | Per-step breakdown | +| `result` | `Any` | Final return value (if completed) | +| `error` | `str` | Error message (if failed) | + +**StepStatus** fields: + +| Field | Type | Description | +|-------|------|-------------| +| `step_name` | `str` | Name of the task | +| `status` | `str` | `PENDING`, `RUNNING`, `COMPLETED`, `FAILED` | +| `started_at` | `datetime` | Step start time | +| `completed_at` | `datetime` | Step end time | +| `duration_ms` | `float` | Step execution time | +| `result` | `Any` | Step return value | +| `error` | `str` | Error message if failed | + +--- + +## 4. Querying Status + +### 4.1. Get Status of a Pipeline + +```python +status = await tracking.get_status(pipeline_id) + +if status.status == "COMPLETED": + print(f"Pipeline finished in {status.duration_ms}ms") + print(f"Result: {status.result}") +elif status.status == "FAILED": + print(f"Failed: {status.error}") +``` + +### 4.2. List All Pipelines + +```python +all_statuses = await tracking.list_pipelines() +for status in all_statuses: + print(f"{status.pipeline_id}: {status.status}") +``` + +### 4.3. Filter by Status + +```python +running = await tracking.list_pipelines(status_filter="RUNNING") +failed = await tracking.list_pipelines(status_filter="FAILED") +completed = await tracking.list_pipelines(status_filter="COMPLETED") +``` + +### 4.4. Get Pipeline History + +```python +# Get last 10 pipelines +history = await tracking.get_history(limit=10) + +# Filter by date range +from datetime import datetime, timedelta +week_ago = datetime.now() - timedelta(days=7) +recent = await tracking.get_history(since=week_ago) +``` + +### 4.5. Delete Old Records + +```python +# Delete records older than 30 days +deleted = await tracking.cleanup_older_than(days=30) +print(f"Deleted {deleted} old pipeline records") + +# Delete specific pipeline +await tracking.delete_pipeline(pipeline_id) +``` + +--- + +## 5. Storage Backends + +### 5.1. InMemoryPipelineStorage + +```python +from taskiq_flow.tracking import InMemoryPipelineStorage + +storage = InMemoryPipelineStorage() +tracking = PipelineTrackingManager().with_storage(storage) + +# Data lives only in current Python process +# On restart, all history is lost +# Suitable for: development, testing, single-run scripts +``` + +**Pros**: +- Zero configuration +- Fast (no network I/O) +- Simple + +**Cons**: +- Not shareable between workers +- Lost on restart +- Limited history size + +### 5.2. RedisPipelineStorage + +```python +from taskiq_flow.tracking import RedisPipelineStorage +import redis.asyncio as redis + +redis_client = redis.Redis(host="localhost", port=6379, decode_responses=True) +storage = RedisPipelineStorage(redis_client) +tracking = PipelineTrackingManager().with_storage(storage) +``` + +**Configuration**: + +```python +# With custom key prefix and TTL +storage = RedisPipelineStorage( + redis_client, + key_prefix="taskiq_flow:tracking:", + ttl_seconds=604800 # 7 days retention +) +``` + +**Pros**: +- Shared across multiple workers +- Persists across restarts +- Scalable +- Can be clustered for HA + +**Cons**: +- Requires Redis server +- Network latency +- TTL management needed (avoid unbounded growth) + +### 5.3. Custom Storage + +Implement `TrackingStorage` protocol: + +```python +from taskiq_flow.tracking.storage import TrackingStorage +from taskiq_flow.tracking.models import PipelineStatus + +class PostgresStorage(TrackingStorage): + async def save_status(self, status: PipelineStatus): + # Insert/update in PostgreSQL + pass + + async def get_status(self, pipeline_id: str) -> PipelineStatus | None: + # Fetch from DB + pass + + async def list_pipelines(self, status_filter: str | None = None): + # Query with optional filter + pass + + async def delete_pipeline(self, pipeline_id: str): + # Remove record + pass + +tracking = PipelineTrackingManager().with_storage(PostgresStorage()) +``` + +--- + +## 6. Real-Time Tracking with WebSocket + +For live dashboard updates, combine `PipelineTrackingManager` with `HookManager`: + +```python +from taskiq_flow.hooks import HookManager, TrackingEventBroadcaster + +hook_manager = HookManager() +broadcaster = TrackingEventBroadcaster(tracking, hook_manager) +tracking.add_listener(broadcaster.on_status_update) + +pipeline = Pipeline(broker).with_hooks(hook_manager).with_tracking(tracking) +``` + +Now pipeline events are broadcast via WebSocket as they happen. + +See [WebSocket Guide]({{ '/en/guides/websocket/' | relative_url }}) for complete setup. + +--- + +## 7. Metrics Collection + +Track pipeline performance over time: + +```python +# Collect statistics +stats = await tracking.get_metrics(days=7) + +print(f"Total executions: {stats.total_pipelines}") +print(f"Success rate: {stats.success_rate:.1%}") +print(f"Avg duration: {stats.avg_duration_ms:.0f}ms") +print(f"Failure reasons: {stats.failure_reasons}") +``` + +**Common metrics**: + +- Throughput (pipelines/minute) +- Success/failure ratio +- Average step duration +- Longest-running steps +- Busy hours + +Integrate with monitoring systems (Prometheus, Grafana): + +```python +from prometheus_client import Counter, Histogram + +PIPELINE_COUNT = Counter('pipelines_total', 'Total pipelines', ['status']) +PIPELINE_DURATION = Histogram('pipeline_duration_seconds', 'Pipeline runtime') + +class PrometheusExporter: + async def on_pipeline_complete(self, status: PipelineStatus): + PIPELINE_COUNT.labels(status=status.status).inc() + PIPELINE_DURATION.observe(status.duration_ms / 1000) +``` + +--- + +## 8. Event Listeners + +Attach callbacks to tracking events: + +```python +class MyListener: + async def on_pipeline_start(self, pipeline_id: str): + print(f"Pipeline {pipeline_id} started") + send_slack_notification(f"Pipeline {pipeline_id} started") + + async def on_step_complete(self, pipeline_id: str, step_name: str, result: Any): + log_step_metric(step_name, result) + + async def on_pipeline_complete(self, pipeline_id: str, status: PipelineStatus): + if status.status == "FAILED": + alert_on_failure(pipeline_id) + +listener = MyListener() +tracking.add_listener(listener) +``` + +**Listener methods** (all optional): + +- `on_pipeline_start(pipeline_id: str)` +- `on_step_start(pipeline_id: str, step_name: str)` +- `on_step_complete(pipeline_id: str, step_name: str, result: Any)` +- `on_pipeline_complete(pipeline_id: str, status: PipelineStatus)` +- `on_pipeline_error(pipeline_id: str, error: str)` + +--- + +## 9. Visualizing Tracking Data + +### 9.1. Console Output + +```python +status = await tracking.get_status(pipeline_id) +print(f"\n{'='*60}") +print(f"Pipeline: {status.pipeline_id}") +print(f"Status: {status.status}") +print(f"Duration: {status.duration_ms:.0f}ms") +print(f"Steps:") +for step in status.steps: + bar = "█" * int(step.duration_ms / 10) + print(f" {step.step_name:<30} {bar} {step.duration_ms:.0f}ms") +``` + +### 9.2. JSON Export + +```python +import json +status_dict = status.model_dump(mode="json", exclude={"result"}) # exclude large results +print(json.dumps(status_dict, indent=2, default=str)) +``` + +### 9.3. Integration with Dashboards + +Use the REST API endpoints (see [API Guide]({{ '/en/guides/api/' | relative_url }})) to build custom dashboards: + +```javascript +// Frontend fetch +fetch('/api/pipelines/{pipeline_id}/status') + .then(res => res.json()) + .then(status => { + // Render timeline chart of step durations + // Show success/failure badges + }); +``` + +--- + +## 10. Production Best Practices + +### 10.1. Use Redis for Production + +Always use `RedisPipelineStorage` in production: + +```python +# config.py +REDIS_URL = os.getenv("REDIS_URL", "redis://localhost:6379") + +# app.py +from redis.asyncio import Redis +redis_client = Redis.from_url(REDIS_URL) +tracking = PipelineTrackingManager().with_storage( + RedisPipelineStorage(redis_client, ttl_seconds=2592000) # 30 days +) +``` + +### 10.2. Set Up Retention Policies + +```python +# Periodic cleanup job (daily) +async def cleanup_old_trackings(): + deleted = await tracking.cleanup_older_than(days=7) + print(f"Cleaned up {deleted} old pipeline records") + +# Use APScheduler to run daily +from taskiq_flow import PipelineScheduler +scheduler = PipelineScheduler(broker) +scheduler.schedule_at(cleanup_old_trackings, run_at="0 3 * * *") # 3 AM daily +``` + +### 10.3. Monitor Tracker Health + +```python +# Health check for monitoring systems +async def tracking_health(): + try: + test_pipeline = Pipeline(broker).with_tracking(tracking) + await test_pipeline.kiq("health_check") + return {"status": "healthy"} + except Exception as e: + return {"status": "unhealthy", "error": str(e)} +``` + +### 10.4. Limit History Size + +```python +# Keep only last N pipelines per pipeline_id pattern +import fnmatch + +patterns = ["batch_job_*", "etl_*"] +for pattern in patterns: + old = await tracking.list_pipelines() + matching = [p for p in old if fnmatch.fnmatch(p.pipeline_id, pattern)] + if len(matching) > 100: + for old_pipeline in matching[-100:]: + await tracking.delete_pipeline(old_pipeline.pipeline_id) +``` + +--- + +## 11. Troubleshooting + +### "No storage configured" Error + +**Symptom**: `RuntimeError: No tracking storage configured` + +**Fix**: Add storage before using tracking: + +```python +tracking = PipelineTrackingManager().with_auto_storage(broker) +# or +tracking = PipelineTrackingManager().with_storage(InMemoryPipelineStorage()) +``` + +### Tracking Data Missing + +**Symptom**: `get_status()` returns `None` even though pipeline ran + +**Causes & fixes**: + +1. **Tracking not attached**: + ```python + pipeline = Pipeline(broker).with_tracking(tracking) # Must call with_tracking() + ``` + +2. **Using different brokers** — Ensure same `broker` instance across task and pipeline. + +3. **Storage lifetime** — InMemory storage lost on restart; switch to Redis. + +4. **Pipeline ID mismatch** — Confirm `pipeline.pipeline_id` matches what you query. + +### Performance Degradation with Redis + +**Symptom**: Tracking slows down pipeline execution + +**Fixes**: +- Use Redis connection pooling +- Batch status updates (bundle multiple steps) +- Async batch writes (default behavior) +- Increase Redis `maxmemory` and use appropriate eviction policy + +--- + +## 12. Summary + +| Feature | In-Memory | Redis | +|---------|-----------|-------| +| **Multi-process** | No | Yes | +| **Persistent** | No | Yes | +| **Shared state** | No | Yes | +| **Speed** | Fastest | Fast (network) | +| **Config required** | None | Redis server | + +**Basic recipe**: +```python +tracking = PipelineTrackingManager().with_auto_storage(broker) +pipeline = Pipeline(broker).with_tracking(tracking) +``` + +**Production recipe**: +```python +tracking = PipelineTrackingManager().with_storage( + RedisPipelineStorage(redis_client, ttl_seconds=604800) +) +pipeline = Pipeline(broker).with_tracking(tracking) +``` + +--- + +## Next Steps + +- **[WebSocket Streaming]({{ '/en/guides/websocket/' | relative_url }})** — Real-time event delivery for dashboards +- **[Dataflow Guide]({{ '/en/guides/dataflow/' | relative_url }})** — Full DAG pipelines with automatic parallelism +- **[Scheduling]({{ '/en/guides/scheduling/' | relative_url }})** — Automated recurring pipeline execution +- **[Performance Tuning]({{ '/en/guides/performance/' | relative_url }})** — Optimize tracking overhead + +--- + +*Track everything. Visualize with [WebSocket]({{ '/en/guides/websocket/' | relative_url }}).* diff --git a/docs/_en/guides/websocket.md b/docs/_en/guides/websocket.md index dce53a4..72c2b5b 100644 --- a/docs/_en/guides/websocket.md +++ b/docs/_en/guides/websocket.md @@ -1,776 +1,776 @@ ---- -title: WebSocket Guide -nav_order: 24 ---- -# WebSocket Guide - -**Real-time event streaming for live dashboards and monitoring** - -> **Version**: {VERSION} | **Related**: [Tracking Guide]({{ '/en/guides/tracking/' | relative_url }}), [API Guide]({{ '/en/guides/api/' | relative_url }}) - ---- - -## Overview - -Taskiq-Flow's WebSocket integration provides live streaming of pipeline execution events — perfect for building real-time dashboards, progress displays, and monitoring tools. - -This guide covers: - -- Setting up a WebSocket server -- Subscribing clients to pipeline events -- Event types and payloads -- Transport layer configuration -- Production deployment considerations - ---- - -## 1. Architecture - -``` -[Pipeline] → [HookManager] → [WebSocketBridge] → [FastAPI WS Manager] → [Clients] -``` - -**Components**: - -1. **Pipeline** — Emits events via hooks at each lifecycle stage -2. **HookManager** — Collects events from pipelines -3. **WebSocketBridge** — Connects HookManager to WebSocket transport -4. **FastAPI WebSocket Manager** — Manages client connections and broadcasts via FastAPI WebSocket routes -5. **Client** — Web browser, monitoring app, dashboard - -> **Note**: Taskiq-Flow uses a **FastAPI-only** WebSocket integration. The legacy standalone picows server (`get_websocket_server`, `PipelineWebSocketServer`) was removed in v1.1. WebSocket events are now exposed through FastAPI `WebSocket` endpoints mounted in your FastAPI application. - ---- - -## 2. Quick Start - -### 2.1. Server-Side Setup - -```python -import asyncio -from taskiq import InMemoryBroker -from taskiq_flow import Pipeline -from taskiq_flow.hooks import HookManager, setup_websocket_bridge - -# 1. Create broker and hook manager -broker = InMemoryBroker() -hook_manager = HookManager() - -# 2. Set up the WebSocket bridge -setup_websocket_bridge(hook_manager) # connects HookManager → WebSocket transport - -# 3. Create a pipeline with hooks attached -pipeline = Pipeline(broker) -pipeline.pipeline_id = "demo_workflow" -pipeline.with_hooks(hook_manager) - -# Add tasks to pipeline... - -# 4. For FastAPI: mount the WebSocket route in your app -# (See Section 6 below for the full FastAPI integration example) - -# 5. Execute the pipeline -result = await pipeline.kiq(data) -``` - -### 2.2. Client Connection (JavaScript) - -```javascript -// Connect to WebSocket server -const ws = new WebSocket('ws://localhost:8765'); - -// Subscribe to a specific pipeline -ws.onopen = () => { - ws.send(JSON.stringify({ - type: 'subscribe', - pipeline_id: 'demo_workflow' - })); -}; - -// Receive events -ws.onmessage = (event) => { - const eventData = JSON.parse(event.data); - console.log('Pipeline event:', eventData); - - switch (eventData.type) { - case 'PipelineStartEvent': - showPipelineStarted(); - break; - case 'StepStartEvent': - showStepProgress(eventData.step_name); - break; - case 'StepCompleteEvent': - updateProgress(eventData.step_name, eventData.duration_ms); - break; - case 'PipelineCompleteEvent': - showResults(eventData.result); - break; - case 'PipelineErrorEvent': - showError(eventData.error); - break; - } -}; -``` - ---- - -## 3. Event Types - -All events are JSON-serializable with a `type` field indicating the event kind. - -### 3.1. PipelineStartEvent - -```json -{ - "type": "PipelineStartEvent", - "pipeline_id": "demo_workflow", - "pipeline_type": "sequential", - "timestamp": "2026-04-29T18:50:19+02:00", - "input": {...} -} -``` - -Emitted when a pipeline begins execution. - -### 3.2. StepStartEvent - -```json -{ - "type": "StepStartEvent", - "pipeline_id": "demo_workflow", - "step_name": "process_data", - "step_index": 2, - "task_id": "abc123", - "timestamp": "2026-04-29T18:50:19.5+02:00" -} -``` - -Emitted before each step starts. - -### 3.3. StepCompleteEvent - -```json -{ - "type": "StepCompleteEvent", - "pipeline_id": "demo_workflow", - "step_name": "process_data", - "step_index": 2, - "result": {"processed": 42}, - "duration_ms": 150.5, - "timestamp": "2026-04-29T18:50:19.7+02:00" -} -``` - -Emitted after each step completes successfully. - -### 3.4. PipelineCompleteEvent - -```json -{ - "type": "PipelineCompleteEvent", - "pipeline_id": "demo_workflow", - "pipeline_type": "sequential", - "status": "COMPLETED", - "duration_ms": 1250.3, - "result": {"final": "output"}, - "timestamp": "2026-04-29T18:50:20.5+02:00" -} -``` - -Emitted when the entire pipeline finishes successfully. - -### 3.5. StepErrorEvent - -```json -{ - "type": "StepErrorEvent", - "pipeline_id": "demo_workflow", - "step_name": "failing_task", - "error": "ValueError: invalid input", - "timestamp": "2026-04-29T18:50:19.9+02:00" -} -``` - -Emitted when a step fails. - -### 3.6. PipelineErrorEvent - -```json -{ - "type": "PipelineErrorEvent", - "pipeline_id": "demo_workflow", - "error": "Pipeline failed at step 'validate'", - "timestamp": "2026-04-29T18:50:20.2+02:00" -} -``` - -Emitted when the pipeline aborts due to an unrecoverable error. - ---- - -## 4. Client-Side Implementation - -### 4.1. Basic JavaScript Client - -```javascript -class PipelineMonitor { - constructor(url, pipelineId) { - this.url = url; - this.pipelineId = pipelineId; - this.ws = null; - this.events = []; - this.callbacks = {}; - } - - connect() { - this.ws = new WebSocket(this.url); - - this.ws.onopen = () => { - console.log('Connected to WebSocket server'); - this.subscribe(this.pipelineId); - }; - - this.ws.onmessage = (event) => { - const data = JSON.parse(event.data); - this.handleEvent(data); - }; - - this.ws.onerror = (err) => { - console.error('WebSocket error:', err); - }; - - this.ws.onclose = () => { - console.log('WebSocket connection closed'); - this.reconnect(); - }; - } - - subscribe(pipelineId) { - this.ws.send(JSON.stringify({ - type: 'subscribe', - pipeline_id: pipelineId - })); - } - - handleEvent(event) { - this.events.push(event); - const eventType = event.type; - - if (this.callbacks[eventType]) { - this.callbacks[eventType](event); - } - - // Generic event handler - if (this.callbacks['*']) { - this.callbacks['*'](event); - } - } - - on(eventType, callback) { - this.callbacks[eventType] = callback; - } - - reconnect() { - setTimeout(() => this.connect(), 3000); - } -} - -// Usage -monitor = new PipelineMonitor('ws://localhost:8765', 'pipeline_123'); -monitor.on('StepCompleteEvent', (event) => { - console.log(`Step ${event.step_name} completed in ${event.duration_ms}ms`); -}); -monitor.on('PipelineCompleteEvent', (event) => { - console.log('Pipeline finished with status:', event.status); -}); -monitor.connect(); -``` - -### 4.2. Python Client (for scripts) - -```python -import asyncio -import websockets -import json - -async def monitor_pipeline(uri, pipeline_id): - async with websockets.connect(uri) as websocket: - # Subscribe - await websocket.send(json.dumps({ - "type": "subscribe", - "pipeline_id": pipeline_id - })) - - # Receive events - async for message in websocket: - event = json.loads(message) - print(f"[{event['type']}] {event}") - - if event['type'] == 'PipelineCompleteEvent': - print(f"Pipeline finished: {event['status']}") - -asyncio.run(monitor_pipeline('ws://localhost:8765', 'pipeline_123')) -``` - ---- - -## 5. Subscription Management - -### 5.1. Subscribing to a Pipeline - -Clients send a subscription message: - -```json -{ - "type": "subscribe", - "pipeline_id": "my_pipeline_001" -} -``` - -After subscribing, all events for that pipeline are forwarded. - -### 5.2. Unsubscribing - -```json -{ - "type": "unsubscribe", - "pipeline_id": "my_pipeline_001" -} -``` - -### 5.3. Subscribing to All Pipelines (Wildcard) - -```json -{ - "type": "subscribe", - "pipeline_id": "*" -} -``` - -**Caution**: Broadcasting all events can generate significant traffic in high-throughput systems. - -### 5.4. Multiple Subscriptions - -A client can subscribe to multiple pipelines: - -```javascript -monitor.subscribe('pipeline_1'); -monitor.subscribe('pipeline_2'); -// Receive events for both, distinguished by pipeline_id field -``` - ---- - -## 6. Server Configuration - -### 6.1. WebSocket via FastAPI Route - -WebSocket events are exposed through a FastAPI `WebSocket` route at `/ws/{pipeline_id}`. The client connects directly to your FastAPI application: - -```python -from fastapi import FastAPI, WebSocket -from taskiq_flow.integration.websocket.fastapi_ws import ( - fastapi_websocket_endpoint, - get_fastapi_ws_manager, -) - -app = FastAPI() - -@app.websocket("/ws/{pipeline_id}") -async def ws_endpoint(websocket: WebSocket, pipeline_id: str): - await fastapi_websocket_endpoint(websocket, pipeline_id) -``` - -> **Prerequisite**: Install FastAPI: `pip install fastapi uvicorn`. - -The WebSocket URL is tied to your FastAPI server address: - -```javascript -// Client connects using your FastAPI app address -const ws = new WebSocket('ws://localhost:8000/ws/demo_workflow'); -``` - -### 6.2. CORS and Security Headers - -If behind a reverse proxy (nginx, Traefik), configure CORS headers: - -```nginx -# nginx.conf -location /ws { - proxy_pass http://localhost:8765; - proxy_http_version 1.1; - proxy_set_header Upgrade $http_upgrade; - proxy_set_header Connection "upgrade"; - add_header Access-Control-Allow-Origin "*"; - add_header Access-Control-Allow-Credentials true; -} -``` - -### 6.3. SSL/TLS Termination - -Terminate SSL at reverse proxy: - -```nginx -# HTTPS → WSS forwarding -location /ws { - proxy_pass http://localhost:8765; - # WSS (secure WebSocket) handled by nginx SSL config -} -``` - -Client connects with: - -```javascript -const ws = new WebSocket('wss://yourdomain.com/ws'); -``` - -### 6.4. Multi-Worker / Redis Transport - -For multi-worker deployments where multiple processes need to share event state, use the `RedisPubSubTransport`: - -```python -from taskiq_flow.transport import RedisPubSubTransport -import redis - -redis_client = redis.Redis(host="localhost", port=6379) -transport = RedisPubSubTransport(redis_client) - -# Pass the transport when initializing the bridge -from taskiq_flow.hooks.bridge import setup_websocket_bridge -setup_websocket_bridge(hook_manager, use_fastapi=True) - -# All workers connected to the same Redis channel share events -# WebSocket is always served via the FastAPI route -``` - -> **Prerequisite**: Install the `[brokers]` extra: `pip install "taskiq-flow[brokers]"` for Redis support. - -With this setup, workers share event state via Redis Pub/Sub while WebSocket connections are always managed through the FastAPI application. - ---- - -## 7. Filtering Events - -Reduce bandwidth by filtering on the server side: - -```python -from taskiq_flow.hooks import EventFilter - -# Only send events for specific pipelines -filter = EventFilter(pipeline_ids=['pipeline_1', 'pipeline_2']) -hook_manager.add_filter(filter) - -# Only send step events (not pipeline-level) -filter = EventFilter(event_types=['StepStartEvent', 'StepCompleteEvent']) -hook_manager.add_filter(filter) -``` - -Client-side filtering also possible: - -```javascript -monitor.on('StepCompleteEvent', (event) => { - if (event.step_name === 'important_step') { - highlightStep(event.step_name); - } -}); -``` - ---- - -## 8. Message Format Reference - -### Subscription Request - -| Field | Type | Description | -|-------|------|-------------| -| `type` | `"subscribe"` | Message type | -| `pipeline_id` | `str` or `"*"` | Pipeline to subscribe to | - -### Unsubscription Request - -| Field | Type | Description | -|-------|------|-------------| -| `type` | `"unsubscribe"` | Message type | -| `pipeline_id` | `str` | Pipeline to unsubscribe from | - -### Event Message (server → client) - -| Field | Type | Description | -|-------|------|-------------| -| `type` | `str` | Event type (see Section 3) | -| `pipeline_id` | `str` | Origin pipeline ID | -| `timestamp` | `ISO 8601 str` | Event time | - -Additional fields per event type (see above). - ---- - -## 9. Production Deployment - -### 9.1. Docker Deployment - -```dockerfile -# Dockerfile -FROM python:3.12-slim - -WORKDIR /app -COPY requirements.txt . -RUN pip install -r requirements.txt - -COPY . . - -CMD ["python", "-m", "my_websocket_app"] -``` - -```yaml -# docker-compose.yml -version: '3.8' -services: - redis: - image: redis:7-alpine - app: - build: . - ports: - - "8765:8765" - environment: - - REDIS_URL=redis://redis:6379 - depends_on: - - redis -``` - -### 9.2. Systemd Service - -```ini -# /etc/systemd/system/taskiq-flow-ws.service -[Unit] -Description=Taskiq-Flow WebSocket Server -After=network.target - -[Service] -Type=simple -User=appuser -WorkingDirectory=/opt/taskiq-flow -ExecStart=/usr/bin/python3 -m my_websocket_app -Restart=always -RestartSec=10 - -[Install] -WantedBy=multi-user.target -``` - -### 9.3. Monitoring - -Health check endpoint: - -```python -from aiohttp import web - -async def health(request): - return web.json_response({"status": "healthy"}) - -app = web.Application() -app.router.add_get('/health', health) -``` - -Or use the built-in API health endpoint (/health) from [API Guide]({{ '/en/guides/api/' | relative_url }}). - -### 9.4. Scalability - -For high-volume deployments: - -- **Horizontal scaling**: Deploy multiple WebSocket server instances with sticky sessions or Redis Pub/Sub transport -- **Load balancing**: Use nginx or HAProxy with WebSocket support -- **Connection limits**: Configure max connections per worker (OS limits) -- **Message compression**: Enable permessage-deflate for large payloads - ---- - -## 10. Security Considerations - -### 10.1. Authentication - -Require authentication tokens on connection: - -```python -# Server-side validation -async def authenticate(websocket, token): - if not validate_token(token): - await websocket.close(code=4001, reason="Unauthorized") - return False - return True - -# Client sends token upon connection -ws = new WebSocket(`ws://localhost:8765?token=${authToken}`); -``` - -### 10.2. Authorization - -Filter events by user permissions: - -```python -class AuthFilter(EventFilter): - def __init__(self, user_id, allowed_pipelines): - self.user_id = user_id - self.allowed = allowed_pipelines - - def should_emit(self, event): - return event.pipeline_id in self.allowed -``` - -### 10.3. Rate Limiting - -Prevent abuse: - -```python -from collections import defaultdict -import time - -class RateLimiter: - def __init__(self, max_events_per_second=100): - self.limits = defaultdict(list) - - def allow(self, client_id): - now = time.time() - self.limits[client_id] = [ - t for t in self.limits[client_id] if now - t < 1 - ] - if len(self.limits[client_id]) < 100: - self.limits[client_id].append(now) - return True - return False -``` - -### 10.4. SSL/TLS Encryption (WSS) - -Since the WebSocket integration is **FastAPI-only**, SSL termination is handled at the reverse proxy layer. Do **not** implement WSS directly at the application level — use a reverse proxy (nginx, Traefik, Caddy): - -```nginx -server { - listen 443 ssl; - server_name ws.taskiq-flow.example.com; - - ssl_certificate /etc/letsencrypt/live/ws.example.com/fullchain.pem; - ssl_certificate_key /etc/letsencrypt/live/ws.example.com/privkey.pem; - - location /ws { - proxy_pass http://localhost:8000; - proxy_http_version 1.1; - proxy_set_header Upgrade $http_upgrade; - proxy_set_header Connection "upgrade"; - proxy_set_header Host $host; - } -} -``` - -Client connects with: - -```javascript -const ws = new WebSocket("wss://ws.taskiq-flow.example.com/ws"); -``` - -> **Note**: The legacy `PipelineWebSocketServer` with inline SSL (`ssl_cert`/`ssl_key` parameters) was removed in v1.1. Use a reverse proxy for TLS termination. - -```nginx -server { - listen 443 ssl; - server_name ws.taskiq-flow.example.com; - - ssl_certificate /etc/letsencrypt/live/ws.example.com/fullchain.pem; - ssl_certificate_key /etc/letsencrypt/live/ws.example.com/privkey.pem; - - location /ws { - proxy_pass http://localhost:8765; - proxy_http_version 1.1; - proxy_set_header Upgrade $http_upgrade; - proxy_set_header Connection "upgrade"; - proxy_set_header Host $host; - } -} -``` - -```javascript -// Client connects via wss through the proxy -const ws = new WebSocket("wss://ws.taskiq-flow.example.com/ws"); -``` - ---- - -## 11. Troubleshooting - -### Connection Refused - -**Symptom**: Client can't connect, "Connection refused" error. - -**Fixes**: -- Verify your FastAPI application is running: `uvicorn app:app` -- Check firewall rules allow the FastAPI port (default: 8000) -- Ensure the WebSocket route is mounted (`/ws/{pipeline_id}`) -- Ensure `setup_websocket_bridge(hook_manager)` is called before the pipeline starts - -### No Events Received After Connection - -**Symptom**: Connection succeeds, but no events arrive. - -**Fixes**: -- Ensure pipeline has `pipeline_id` set -- Confirm `pipeline.with_hooks(hook_manager)` called -- Verify `setup_websocket_bridge(hook_manager)` called before pipeline starts -- Check subscription message format (see Section 5) - -### High Memory Usage - -**Symptom**: Server memory grows over time. - -**Fixes**: -- Limit number of tracked pipelines -- Implement automatic cleanup of disconnected clients -- Use Redis transport to offload state from process memory -- Set max connections limit - -### Events Out of Order - -**Symptom**: Client receives StepComplete before StepStart. - -**Fixes**: -- Use sequential delivery guarantees (default for WebSocket) -- Ensure all hooks are correctly attached -- Check for custom middleware that may emit events asynchronously - ---- - -## 12. Summary - -| Component | Responsibility | -|-----------|----------------| -| `Pipeline` | Generates execution events | -| `HookManager` | Collects events from pipelines | -| `WebSocketBridge` | Routes events to WebSocket transport | -| `FastAPIWebSocketManager` | Manages client connections, broadcasts | -| `Client` | Subscribes, receives, displays events | - -**Basic setup (3 lines for hooks + FastAPI route):** - -```python -# 1. Connect event pipeline -hooks = HookManager() -setup_websocket_bridge(hooks) -pipeline = Pipeline(broker).with_hooks(hooks) - -# 2. Mount WebSocket route in your FastAPI app -# @app.websocket("/ws/{pipeline_id}") -# async def ws_endpoint(websocket: WebSocket, pipeline_id: str): -# await fastapi_websocket_endpoint(websocket, pipeline_id) -``` - ---- - -## Next Steps - -- **[Tracking Guide]({{ '/en/guides/tracking/' | relative_url }})** — Backend storage and historical queries -- **[Dataflow Guide]({{ '/en/guides/dataflow/' | relative_url }})** — DAG pipelines with automatic parallelism and WebSocket-compatible events -- **[API Guide]({{ '/en/guides/api/' | relative_url }})** — REST endpoints for dashboard backends -- **[Examples: WebSocket Demo]({{ '/en/examples/websocket-demo/' | relative_url }})** — Complete working code - ---- - -*Stream live pipeline events. Combine with [Tracking Storage]({{ '/en/guides/tracking/' | relative_url }}) for persistent history.* +--- +title: WebSocket Guide +nav_order: 24 +--- +# WebSocket Guide + +**Real-time event streaming for live dashboards and monitoring** + +> **Version**: {VERSION} | **Related**: [Tracking Guide]({{ '/en/guides/tracking/' | relative_url }}), [API Guide]({{ '/en/guides/api/' | relative_url }}) + +--- + +## Overview + +Taskiq-Flow's WebSocket integration provides live streaming of pipeline execution events — perfect for building real-time dashboards, progress displays, and monitoring tools. + +This guide covers: + +- Setting up a WebSocket server +- Subscribing clients to pipeline events +- Event types and payloads +- Transport layer configuration +- Production deployment considerations + +--- + +## 1. Architecture + +``` +[Pipeline] → [HookManager] → [WebSocketBridge] → [FastAPI WS Manager] → [Clients] +``` + +**Components**: + +1. **Pipeline** — Emits events via hooks at each lifecycle stage +2. **HookManager** — Collects events from pipelines +3. **WebSocketBridge** — Connects HookManager to WebSocket transport +4. **FastAPI WebSocket Manager** — Manages client connections and broadcasts via FastAPI WebSocket routes +5. **Client** — Web browser, monitoring app, dashboard + +> **Note**: Taskiq-Flow uses a **FastAPI-only** WebSocket integration. The legacy standalone picows server (`get_websocket_server`, `PipelineWebSocketServer`) was removed in v1.1. WebSocket events are now exposed through FastAPI `WebSocket` endpoints mounted in your FastAPI application. + +--- + +## 2. Quick Start + +### 2.1. Server-Side Setup + +```python +import asyncio +from taskiq import InMemoryBroker +from taskiq_flow import Pipeline +from taskiq_flow.hooks import HookManager, setup_websocket_bridge + +# 1. Create broker and hook manager +broker = InMemoryBroker() +hook_manager = HookManager() + +# 2. Set up the WebSocket bridge +setup_websocket_bridge(hook_manager) # connects HookManager → WebSocket transport + +# 3. Create a pipeline with hooks attached +pipeline = Pipeline(broker) +pipeline.pipeline_id = "demo_workflow" +pipeline.with_hooks(hook_manager) + +# Add tasks to pipeline... + +# 4. For FastAPI: mount the WebSocket route in your app +# (See Section 6 below for the full FastAPI integration example) + +# 5. Execute the pipeline +result = await pipeline.kiq(data) +``` + +### 2.2. Client Connection (JavaScript) + +```javascript +// Connect to WebSocket server +const ws = new WebSocket('ws://localhost:8765'); + +// Subscribe to a specific pipeline +ws.onopen = () => { + ws.send(JSON.stringify({ + type: 'subscribe', + pipeline_id: 'demo_workflow' + })); +}; + +// Receive events +ws.onmessage = (event) => { + const eventData = JSON.parse(event.data); + console.log('Pipeline event:', eventData); + + switch (eventData.type) { + case 'PipelineStartEvent': + showPipelineStarted(); + break; + case 'StepStartEvent': + showStepProgress(eventData.step_name); + break; + case 'StepCompleteEvent': + updateProgress(eventData.step_name, eventData.duration_ms); + break; + case 'PipelineCompleteEvent': + showResults(eventData.result); + break; + case 'PipelineErrorEvent': + showError(eventData.error); + break; + } +}; +``` + +--- + +## 3. Event Types + +All events are JSON-serializable with a `type` field indicating the event kind. + +### 3.1. PipelineStartEvent + +```json +{ + "type": "PipelineStartEvent", + "pipeline_id": "demo_workflow", + "pipeline_type": "sequential", + "timestamp": "2026-04-29T18:50:19+02:00", + "input": {...} +} +``` + +Emitted when a pipeline begins execution. + +### 3.2. StepStartEvent + +```json +{ + "type": "StepStartEvent", + "pipeline_id": "demo_workflow", + "step_name": "process_data", + "step_index": 2, + "task_id": "abc123", + "timestamp": "2026-04-29T18:50:19.5+02:00" +} +``` + +Emitted before each step starts. + +### 3.3. StepCompleteEvent + +```json +{ + "type": "StepCompleteEvent", + "pipeline_id": "demo_workflow", + "step_name": "process_data", + "step_index": 2, + "result": {"processed": 42}, + "duration_ms": 150.5, + "timestamp": "2026-04-29T18:50:19.7+02:00" +} +``` + +Emitted after each step completes successfully. + +### 3.4. PipelineCompleteEvent + +```json +{ + "type": "PipelineCompleteEvent", + "pipeline_id": "demo_workflow", + "pipeline_type": "sequential", + "status": "COMPLETED", + "duration_ms": 1250.3, + "result": {"final": "output"}, + "timestamp": "2026-04-29T18:50:20.5+02:00" +} +``` + +Emitted when the entire pipeline finishes successfully. + +### 3.5. StepErrorEvent + +```json +{ + "type": "StepErrorEvent", + "pipeline_id": "demo_workflow", + "step_name": "failing_task", + "error": "ValueError: invalid input", + "timestamp": "2026-04-29T18:50:19.9+02:00" +} +``` + +Emitted when a step fails. + +### 3.6. PipelineErrorEvent + +```json +{ + "type": "PipelineErrorEvent", + "pipeline_id": "demo_workflow", + "error": "Pipeline failed at step 'validate'", + "timestamp": "2026-04-29T18:50:20.2+02:00" +} +``` + +Emitted when the pipeline aborts due to an unrecoverable error. + +--- + +## 4. Client-Side Implementation + +### 4.1. Basic JavaScript Client + +```javascript +class PipelineMonitor { + constructor(url, pipelineId) { + this.url = url; + this.pipelineId = pipelineId; + this.ws = null; + this.events = []; + this.callbacks = {}; + } + + connect() { + this.ws = new WebSocket(this.url); + + this.ws.onopen = () => { + console.log('Connected to WebSocket server'); + this.subscribe(this.pipelineId); + }; + + this.ws.onmessage = (event) => { + const data = JSON.parse(event.data); + this.handleEvent(data); + }; + + this.ws.onerror = (err) => { + console.error('WebSocket error:', err); + }; + + this.ws.onclose = () => { + console.log('WebSocket connection closed'); + this.reconnect(); + }; + } + + subscribe(pipelineId) { + this.ws.send(JSON.stringify({ + type: 'subscribe', + pipeline_id: pipelineId + })); + } + + handleEvent(event) { + this.events.push(event); + const eventType = event.type; + + if (this.callbacks[eventType]) { + this.callbacks[eventType](event); + } + + // Generic event handler + if (this.callbacks['*']) { + this.callbacks['*'](event); + } + } + + on(eventType, callback) { + this.callbacks[eventType] = callback; + } + + reconnect() { + setTimeout(() => this.connect(), 3000); + } +} + +// Usage +monitor = new PipelineMonitor('ws://localhost:8765', 'pipeline_123'); +monitor.on('StepCompleteEvent', (event) => { + console.log(`Step ${event.step_name} completed in ${event.duration_ms}ms`); +}); +monitor.on('PipelineCompleteEvent', (event) => { + console.log('Pipeline finished with status:', event.status); +}); +monitor.connect(); +``` + +### 4.2. Python Client (for scripts) + +```python +import asyncio +import websockets +import json + +async def monitor_pipeline(uri, pipeline_id): + async with websockets.connect(uri) as websocket: + # Subscribe + await websocket.send(json.dumps({ + "type": "subscribe", + "pipeline_id": pipeline_id + })) + + # Receive events + async for message in websocket: + event = json.loads(message) + print(f"[{event['type']}] {event}") + + if event['type'] == 'PipelineCompleteEvent': + print(f"Pipeline finished: {event['status']}") + +asyncio.run(monitor_pipeline('ws://localhost:8765', 'pipeline_123')) +``` + +--- + +## 5. Subscription Management + +### 5.1. Subscribing to a Pipeline + +Clients send a subscription message: + +```json +{ + "type": "subscribe", + "pipeline_id": "my_pipeline_001" +} +``` + +After subscribing, all events for that pipeline are forwarded. + +### 5.2. Unsubscribing + +```json +{ + "type": "unsubscribe", + "pipeline_id": "my_pipeline_001" +} +``` + +### 5.3. Subscribing to All Pipelines (Wildcard) + +```json +{ + "type": "subscribe", + "pipeline_id": "*" +} +``` + +**Caution**: Broadcasting all events can generate significant traffic in high-throughput systems. + +### 5.4. Multiple Subscriptions + +A client can subscribe to multiple pipelines: + +```javascript +monitor.subscribe('pipeline_1'); +monitor.subscribe('pipeline_2'); +// Receive events for both, distinguished by pipeline_id field +``` + +--- + +## 6. Server Configuration + +### 6.1. WebSocket via FastAPI Route + +WebSocket events are exposed through a FastAPI `WebSocket` route at `/ws/{pipeline_id}`. The client connects directly to your FastAPI application: + +```python +from fastapi import FastAPI, WebSocket +from taskiq_flow.integration.websocket.fastapi_ws import ( + fastapi_websocket_endpoint, + get_fastapi_ws_manager, +) + +app = FastAPI() + +@app.websocket("/ws/{pipeline_id}") +async def ws_endpoint(websocket: WebSocket, pipeline_id: str): + await fastapi_websocket_endpoint(websocket, pipeline_id) +``` + +> **Prerequisite**: Install FastAPI: `pip install fastapi uvicorn`. + +The WebSocket URL is tied to your FastAPI server address: + +```javascript +// Client connects using your FastAPI app address +const ws = new WebSocket('ws://localhost:8000/ws/demo_workflow'); +``` + +### 6.2. CORS and Security Headers + +If behind a reverse proxy (nginx, Traefik), configure CORS headers: + +```nginx +# nginx.conf +location /ws { + proxy_pass http://localhost:8765; + proxy_http_version 1.1; + proxy_set_header Upgrade $http_upgrade; + proxy_set_header Connection "upgrade"; + add_header Access-Control-Allow-Origin "*"; + add_header Access-Control-Allow-Credentials true; +} +``` + +### 6.3. SSL/TLS Termination + +Terminate SSL at reverse proxy: + +```nginx +# HTTPS → WSS forwarding +location /ws { + proxy_pass http://localhost:8765; + # WSS (secure WebSocket) handled by nginx SSL config +} +``` + +Client connects with: + +```javascript +const ws = new WebSocket('wss://yourdomain.com/ws'); +``` + +### 6.4. Multi-Worker / Redis Transport + +For multi-worker deployments where multiple processes need to share event state, use the `RedisPubSubTransport`: + +```python +from taskiq_flow.transport import RedisPubSubTransport +import redis + +redis_client = redis.Redis(host="localhost", port=6379) +transport = RedisPubSubTransport(redis_client) + +# Pass the transport when initializing the bridge +from taskiq_flow.hooks.bridge import setup_websocket_bridge +setup_websocket_bridge(hook_manager, use_fastapi=True) + +# All workers connected to the same Redis channel share events +# WebSocket is always served via the FastAPI route +``` + +> **Prerequisite**: Install the `[brokers]` extra: `pip install "taskiq-flow[brokers]"` for Redis support. + +With this setup, workers share event state via Redis Pub/Sub while WebSocket connections are always managed through the FastAPI application. + +--- + +## 7. Filtering Events + +Reduce bandwidth by filtering on the server side: + +```python +from taskiq_flow.hooks import EventFilter + +# Only send events for specific pipelines +filter = EventFilter(pipeline_ids=['pipeline_1', 'pipeline_2']) +hook_manager.add_filter(filter) + +# Only send step events (not pipeline-level) +filter = EventFilter(event_types=['StepStartEvent', 'StepCompleteEvent']) +hook_manager.add_filter(filter) +``` + +Client-side filtering also possible: + +```javascript +monitor.on('StepCompleteEvent', (event) => { + if (event.step_name === 'important_step') { + highlightStep(event.step_name); + } +}); +``` + +--- + +## 8. Message Format Reference + +### Subscription Request + +| Field | Type | Description | +|-------|------|-------------| +| `type` | `"subscribe"` | Message type | +| `pipeline_id` | `str` or `"*"` | Pipeline to subscribe to | + +### Unsubscription Request + +| Field | Type | Description | +|-------|------|-------------| +| `type` | `"unsubscribe"` | Message type | +| `pipeline_id` | `str` | Pipeline to unsubscribe from | + +### Event Message (server → client) + +| Field | Type | Description | +|-------|------|-------------| +| `type` | `str` | Event type (see Section 3) | +| `pipeline_id` | `str` | Origin pipeline ID | +| `timestamp` | `ISO 8601 str` | Event time | + +Additional fields per event type (see above). + +--- + +## 9. Production Deployment + +### 9.1. Docker Deployment + +```dockerfile +# Dockerfile +FROM python:3.12-slim + +WORKDIR /app +COPY requirements.txt . +RUN pip install -r requirements.txt + +COPY . . + +CMD ["python", "-m", "my_websocket_app"] +``` + +```yaml +# docker-compose.yml +version: '3.8' +services: + redis: + image: redis:7-alpine + app: + build: . + ports: + - "8765:8765" + environment: + - REDIS_URL=redis://redis:6379 + depends_on: + - redis +``` + +### 9.2. Systemd Service + +```ini +# /etc/systemd/system/taskiq-flow-ws.service +[Unit] +Description=Taskiq-Flow WebSocket Server +After=network.target + +[Service] +Type=simple +User=appuser +WorkingDirectory=/opt/taskiq-flow +ExecStart=/usr/bin/python3 -m my_websocket_app +Restart=always +RestartSec=10 + +[Install] +WantedBy=multi-user.target +``` + +### 9.3. Monitoring + +Health check endpoint: + +```python +from aiohttp import web + +async def health(request): + return web.json_response({"status": "healthy"}) + +app = web.Application() +app.router.add_get('/health', health) +``` + +Or use the built-in API health endpoint (/health) from [API Guide]({{ '/en/guides/api/' | relative_url }}). + +### 9.4. Scalability + +For high-volume deployments: + +- **Horizontal scaling**: Deploy multiple WebSocket server instances with sticky sessions or Redis Pub/Sub transport +- **Load balancing**: Use nginx or HAProxy with WebSocket support +- **Connection limits**: Configure max connections per worker (OS limits) +- **Message compression**: Enable permessage-deflate for large payloads + +--- + +## 10. Security Considerations + +### 10.1. Authentication + +Require authentication tokens on connection: + +```python +# Server-side validation +async def authenticate(websocket, token): + if not validate_token(token): + await websocket.close(code=4001, reason="Unauthorized") + return False + return True + +# Client sends token upon connection +ws = new WebSocket(`ws://localhost:8765?token=${authToken}`); +``` + +### 10.2. Authorization + +Filter events by user permissions: + +```python +class AuthFilter(EventFilter): + def __init__(self, user_id, allowed_pipelines): + self.user_id = user_id + self.allowed = allowed_pipelines + + def should_emit(self, event): + return event.pipeline_id in self.allowed +``` + +### 10.3. Rate Limiting + +Prevent abuse: + +```python +from collections import defaultdict +import time + +class RateLimiter: + def __init__(self, max_events_per_second=100): + self.limits = defaultdict(list) + + def allow(self, client_id): + now = time.time() + self.limits[client_id] = [ + t for t in self.limits[client_id] if now - t < 1 + ] + if len(self.limits[client_id]) < 100: + self.limits[client_id].append(now) + return True + return False +``` + +### 10.4. SSL/TLS Encryption (WSS) + +Since the WebSocket integration is **FastAPI-only**, SSL termination is handled at the reverse proxy layer. Do **not** implement WSS directly at the application level — use a reverse proxy (nginx, Traefik, Caddy): + +```nginx +server { + listen 443 ssl; + server_name ws.taskiq-flow.example.com; + + ssl_certificate /etc/letsencrypt/live/ws.example.com/fullchain.pem; + ssl_certificate_key /etc/letsencrypt/live/ws.example.com/privkey.pem; + + location /ws { + proxy_pass http://localhost:8000; + proxy_http_version 1.1; + proxy_set_header Upgrade $http_upgrade; + proxy_set_header Connection "upgrade"; + proxy_set_header Host $host; + } +} +``` + +Client connects with: + +```javascript +const ws = new WebSocket("wss://ws.taskiq-flow.example.com/ws"); +``` + +> **Note**: The legacy `PipelineWebSocketServer` with inline SSL (`ssl_cert`/`ssl_key` parameters) was removed in v1.1. Use a reverse proxy for TLS termination. + +```nginx +server { + listen 443 ssl; + server_name ws.taskiq-flow.example.com; + + ssl_certificate /etc/letsencrypt/live/ws.example.com/fullchain.pem; + ssl_certificate_key /etc/letsencrypt/live/ws.example.com/privkey.pem; + + location /ws { + proxy_pass http://localhost:8765; + proxy_http_version 1.1; + proxy_set_header Upgrade $http_upgrade; + proxy_set_header Connection "upgrade"; + proxy_set_header Host $host; + } +} +``` + +```javascript +// Client connects via wss through the proxy +const ws = new WebSocket("wss://ws.taskiq-flow.example.com/ws"); +``` + +--- + +## 11. Troubleshooting + +### Connection Refused + +**Symptom**: Client can't connect, "Connection refused" error. + +**Fixes**: +- Verify your FastAPI application is running: `uvicorn app:app` +- Check firewall rules allow the FastAPI port (default: 8000) +- Ensure the WebSocket route is mounted (`/ws/{pipeline_id}`) +- Ensure `setup_websocket_bridge(hook_manager)` is called before the pipeline starts + +### No Events Received After Connection + +**Symptom**: Connection succeeds, but no events arrive. + +**Fixes**: +- Ensure pipeline has `pipeline_id` set +- Confirm `pipeline.with_hooks(hook_manager)` called +- Verify `setup_websocket_bridge(hook_manager)` called before pipeline starts +- Check subscription message format (see Section 5) + +### High Memory Usage + +**Symptom**: Server memory grows over time. + +**Fixes**: +- Limit number of tracked pipelines +- Implement automatic cleanup of disconnected clients +- Use Redis transport to offload state from process memory +- Set max connections limit + +### Events Out of Order + +**Symptom**: Client receives StepComplete before StepStart. + +**Fixes**: +- Use sequential delivery guarantees (default for WebSocket) +- Ensure all hooks are correctly attached +- Check for custom middleware that may emit events asynchronously + +--- + +## 12. Summary + +| Component | Responsibility | +|-----------|----------------| +| `Pipeline` | Generates execution events | +| `HookManager` | Collects events from pipelines | +| `WebSocketBridge` | Routes events to WebSocket transport | +| `FastAPIWebSocketManager` | Manages client connections, broadcasts | +| `Client` | Subscribes, receives, displays events | + +**Basic setup (3 lines for hooks + FastAPI route):** + +```python +# 1. Connect event pipeline +hooks = HookManager() +setup_websocket_bridge(hooks) +pipeline = Pipeline(broker).with_hooks(hooks) + +# 2. Mount WebSocket route in your FastAPI app +# @app.websocket("/ws/{pipeline_id}") +# async def ws_endpoint(websocket: WebSocket, pipeline_id: str): +# await fastapi_websocket_endpoint(websocket, pipeline_id) +``` + +--- + +## Next Steps + +- **[Tracking Guide]({{ '/en/guides/tracking/' | relative_url }})** — Backend storage and historical queries +- **[Dataflow Guide]({{ '/en/guides/dataflow/' | relative_url }})** — DAG pipelines with automatic parallelism and WebSocket-compatible events +- **[API Guide]({{ '/en/guides/api/' | relative_url }})** — REST endpoints for dashboard backends +- **[Examples: WebSocket Demo]({{ '/en/examples/websocket-demo/' | relative_url }})** — Complete working code + +--- + +*Stream live pipeline events. Combine with [Tracking Storage]({{ '/en/guides/tracking/' | relative_url }}) for persistent history.* diff --git a/docs/_en/index.md b/docs/_en/index.md index 28a0e82..c3181b8 100644 --- a/docs/_en/index.md +++ b/docs/_en/index.md @@ -1,27 +1,27 @@ ---- -title: English Documentation -nav_order: 5 -permalink: /en/ ---- -# Taskiq-Flow Documentation (English) - -Welcome to the English documentation for **Taskiq-Flow**. - -## Start Here - -- **[Quick Start Guide]({{ '/en/quickstart/' | relative_url }})** — Get up and running in 5 minutes -- **[User Guides]({{ '/en/guides/' | relative_url }})** — In-depth guides on pipelines, tasks, execution, tracking, WebSocket, scheduling, retry, performance, and REST API -- **[API Reference]({{ '/en/api/' | relative_url }})** — Complete module and class documentation -- **[Examples]({{ '/en/examples/' | relative_url }})** — Working code examples for all features - -## Quick Links - -- [Project README](https://github.com/dorel14/taskiq-flow/blob/main/README.md) — Project overview and philosophy -- [PyPI Package](https://pypi.org/project/taskiq-flow/) — Install with `pip install taskiq-flow` -- [GitHub Repository](https://github.com/dorel14/taskiq-flow) — Source code and issue tracker -- [Contributing Guide](https://github.com/dorel14/taskiq-flow/blob/main/CONTRIBUTING.md) — How to contribute -- [License](https://github.com/dorel14/taskiq-flow/blob/main/LICENSE) — MIT License - ---- - -*Maintained by SoniqueBay Team* +--- +title: English Documentation +nav_order: 5 +permalink: /en/ +--- +# Taskiq-Flow Documentation (English) + +Welcome to the English documentation for **Taskiq-Flow**. + +## Start Here + +- **[Quick Start Guide]({{ '/en/quickstart/' | relative_url }})** — Get up and running in 5 minutes +- **[User Guides]({{ '/en/guides/' | relative_url }})** — In-depth guides on pipelines, tasks, execution, tracking, WebSocket, scheduling, retry, performance, and REST API +- **[API Reference]({{ '/en/api/' | relative_url }})** — Complete module and class documentation +- **[Examples]({{ '/en/examples/' | relative_url }})** — Working code examples for all features + +## Quick Links + +- [Project README](https://github.com/dorel14/taskiq-flow/blob/main/README.md) — Project overview and philosophy +- [PyPI Package](https://pypi.org/project/taskiq-flow/) — Install with `pip install taskiq-flow` +- [GitHub Repository](https://github.com/dorel14/taskiq-flow) — Source code and issue tracker +- [Contributing Guide](https://github.com/dorel14/taskiq-flow/blob/main/CONTRIBUTING.md) — How to contribute +- [License](https://github.com/dorel14/taskiq-flow/blob/main/LICENSE) — MIT License + +--- + +*Maintained by SoniqueBay Team* diff --git a/docs/_en/quickstart.md b/docs/_en/quickstart.md index 51a97aa..7ee9852 100644 --- a/docs/_en/quickstart.md +++ b/docs/_en/quickstart.md @@ -1,387 +1,387 @@ ---- -title: Quick Start Guide -nav_order: 10 -color_scheme: dark ---- -# Quick Start Guide - -**Getting up and running with Taskiq-Flow in 5 minutes** - -> **Version**: {VERSION} | **Prerequisites**: Python 3.9+, asyncio basics - ---- - -## Overview - -This guide will help you create your first pipelines with Taskiq-Flow. By the end, you'll understand: - -- How to set up a broker and add the PipelineMiddleware -- Defining tasks with `@broker.task` -- Building sequential pipelines with `.call_next()`, `.map()`, `.filter()` -- Running pipelines and retrieving results -- Basic dataflow pipelines with `@pipeline_task` - ---- - -## Prerequisites - -```bash -pip install taskiq taskiq-flow -``` - -For this guide, we'll use the in-memory broker which requires no external services. - ---- - -## 1. Basic Sequential Pipeline - -### 1.1. Setup - -Create a new Python file `quickstart_basic.py`: - -```python -import asyncio -from taskiq import InMemoryBroker -from taskiq_flow import Pipeline, PipelineMiddleware - -# Initialize broker and add required middleware -broker = InMemoryBroker() -broker.add_middlewares(PipelineMiddleware()) -``` - -### 1.2. Define Tasks - -All functions in a pipeline must be taskiq tasks (decorated with `@broker.task`): - -```python -@broker.task -def add_one(value: int) -> int: - """Add 1 to the input value.""" - return value + 1 - -@broker.task -def repeat(value: int, times: int) -> list[int]: - """Repeat a value multiple times.""" - return [value] * times - -@broker.task -def is_positive(value: int) -> bool: - """Check if value is non-negative.""" - return value >= 0 -``` - -### 1.3. Build & Run the Pipeline - -```python -async def main(): - # Build the pipeline by chaining operations - pipeline = ( - Pipeline(broker) - .call_next(add_one) # Step 1: 1 → 2 - .call_next(repeat, times=4) # Step 2: 2 → [2, 2, 2, 2] - .map(add_one) # Step 3: apply to each element → [3, 3, 3, 3] - .filter(is_positive) # Step 4: keep elements where result is True - ) - - # Kick off the pipeline with initial input - task = await pipeline.kiq(1) - - # Wait for completion and retrieve the result - result = await task.wait_result() - print("Result:", result.return_value) # Output: [3, 3, 3, 3] - -asyncio.run(main()) -``` - -**Expected output**: -``` -Result: [3, 3, 3, 3] -``` - -### 1.4. How It Works - -| Step | Operation | Input | Output | -|------|-----------|-------|--------| -| 1 | `.call_next(add_one)` | `1` | `2` | -| 2 | `.call_next(repeat, times=4)` | `2` | `[2, 2, 2, 2]` | -| 3 | `.map(add_one)` | `[2, 2, 2, 2]` | `[3, 3, 3, 3]` (parallel) | -| 4 | `.filter(is_positive)` | `[3, 3, 3, 3]` | `[3, 3, 3, 3]` (unchanged) | - -**Key points**: - -- The `PipelineMiddleware` handles task routing; it **must** be added to your broker. -- Each step receives the previous step's output as input. -- `.map()` and `.filter()` operate on iterable results and run elements in parallel. -- `pipeline.kiq(initial_input)` starts the pipeline and returns a `Task` object. -- `task.wait_result()` blocks until the pipeline finishes. - ---- - -## 2. Dataflow Pipeline (Automatic DAG) - -For more complex workflows, use `DataflowPipeline` which automatically builds a dependency graph. - -### 2.1. Define Tasks with `@pipeline_task` - -Mark task outputs using the `@pipeline_task` decorator: - -```python -from taskiq_flow import DataflowPipeline, pipeline_task - -@broker.task -@pipeline_task(output="features") -def extract_audio(track_paths: list[str]) -> dict: - """Extract audio features from tracks.""" - print(f"Extracting features from {len(track_paths)} tracks...") - return {"duration": 180.0, "tempo": 120.0, "energy": 0.8} - -@broker.task -@pipeline_task(output="tags") -def generate_tags(features: dict) -> list[str]: - """Generate tags based on audio features.""" - print(f"Generating tags from features: {features}") - return ["electronic", "dance", "upbeat"] - -@broker.task -@pipeline_task(output="embedding") -def compute_embedding(features: dict) -> list[float]: - """Compute vector embedding from features.""" - print(f"Computing embedding from {features}") - return [0.1, 0.2, 0.3, 0.4, 0.5] -``` - -**How dependency resolution works**: -- `extract_audio` declares `output="features"` -- `generate_tags` has parameter `features: dict` → automatically depends on `extract_audio` -- `compute_embedding` also depends on `extract_audio` (same `features` param) -- Taskiq-Flow constructs a DAG and runs independent tasks in parallel - -### 2.2. Build & Execute - -```python -async def main(): - # Auto-build the DAG from task list - pipeline = DataflowPipeline.from_tasks( - broker, - [extract_audio, generate_tags, compute_embedding] - ) - - # Optional: visualize the DAG - pipeline.print_dag() - - # Execute with input data (only external inputs needed) - results = await pipeline.kiq_dataflow(track_paths=["song1.mp3", "song2.mp3"]) - print("Results:", results) - # Output: { - # "features": {"duration": 180.0, ...}, - # "tags": ["electronic", "dance", "upbeat"], - # "embedding": [0.1, 0.2, 0.3, 0.4, 0.5] - # } - -asyncio.run(main()) -``` - -**Sample DAG output** (printed to console): -``` -DAG Execution Order: - Level 0 (parallel): extract_audio - Level 1 (parallel): generate_tags, compute_embedding - Final outputs: features, tags, embedding -``` - -### 2.3. Visualizing the Pipeline - -```python -# ASCII DAG in console -pipeline.print_dag() - -# JSON representation for web UIs -viz_json = pipeline.visualize() -print(viz_json) - -# DOT format for Graphviz -dot = pipeline.visualize_dot() -with open("pipeline.dot", "w") as f: - f.write(dot) -# Render: dot -Tpng pipeline.dot -o pipeline.png -``` - ---- - -## 3. Common Patterns - -### 3.1. Map-Reduce Pattern - -Process items in parallel, then aggregate: - -```python -from taskiq_flow import MapReduce - -# Map phase: process each track independently -mapped = await MapReduce.map( - broker, - process_track, # task function - track_list, # iterable of items - output="processed", # name of intermediate output - max_parallel=10 # limit concurrency -) - -# Reduce phase: aggregate all results -reduced = await MapReduce.reduce( - broker, - aggregate_results, # aggregation function - mapped, # MapReduceResult object - input_name="processed", # consume the mapped output - output="final_stats" -) - -print("Final:", reduced.return_value) -``` - -See `examples/dataflow_audio_pipeline.py` for a complete audio-processing pipeline. - -### 3.2. Group Parallel Execution - -Run multiple independent tasks simultaneously: - -```python -pipeline = Pipeline(broker) - -pipeline.group( - [task_a, task_b, task_c], - param_names=["input_a", "input_b", "input_c"] -) -# Returns: [result_a, result_b, result_c] -``` - -### 3.3. Pipeline with Tracking - -Monitor pipeline status in real-time: - -```python -from taskiq_flow import PipelineTrackingManager - -tracking = PipelineTrackingManager().with_auto_storage(broker) -pipeline = Pipeline(broker).with_tracking(tracking) - -task = await pipeline.kiq(data) - -# Check status later -status = await tracking.get_status(pipeline.pipeline_id) -print(f"Status: {status.status}, Steps completed: {len(status.steps)}") -``` - ---- - -## 4. Running Example Scripts - -The `examples/` directory contains complete runnable demonstrations: - -```bash -# Basic sequential pipeline -python examples/quickstart.py - -# Tracking and monitoring -python examples/tracking_demo.py - -# Scheduled pipelines (cron) -python examples/scheduled_pipeline.py - -# Full dataflow DAG with map-reduce -python examples/dataflow_audio_pipeline.py - -# Manual DAG construction with DataflowRegistry -python examples/registry_discovery_example.py - -# WebSocket event streaming -python examples/websocket_demo.py - -# REST API with FastAPI -python examples/api_example.py -``` - ---- - -## 5. Next Steps - -With the basics under your belt, explore the deeper guides: - -| Topic | Guide | -|-------|-------| -| Sequential & Dataflow Pipelines | [Pipelines Guide]({{ '/en/guides/pipelines/' | relative_url }}) | -| **Dataflow Deep Dive** | **[Dataflow Guide]({{ '/en/guides/dataflow/' | relative_url }})** | -| Task definitions & decorators | [Tasks Guide]({{ '/en/guides/tasks/' | relative_url }}) | -| Execution modes & error handling | [Execution Guide]({{ '/en/guides/execution/' | relative_url }}) | -| Real-time monitoring | [Tracking Guide]({{ '/en/guides/tracking/' | relative_url }}) | -| Live dashboards | [WebSocket Guide]({{ '/en/guides/websocket/' | relative_url }}) | -| Cron scheduling | [Scheduling Guide]({{ '/en/guides/scheduling/' | relative_url }}) | -| Error recovery | [Retry Guide]({{ '/en/guides/retry/' | relative_url }}) | -| Performance tuning | [Performance Guide]({{ '/en/guides/performance/' | relative_url }}) | -| REST API integration | [API Guide]({{ '/en/guides/api/' | relative_url }}) | -| Full API reference | [API Reference]({{ '/en/api/' | relative_url }}) | - ---- - -## Troubleshooting - -### "PipelineMiddleware not found" Error - -**Symptom**: Tasks fail with middleware errors. - -**Fix**: Ensure `PipelineMiddleware()` is added to your broker before creating pipelines: - -```python -broker.add_middlewares(PipelineMiddleware()) # Must be called -``` - -### "Task not found" or "Result is None" - -**Symptom**: `wait_result()` returns `None`. - -**Cause**: InMemoryBroker only works within the same process. For multi-worker distributed setups, use Redis or another persistent broker. - -**Fix**: Install the `[brokers]` extra and switch to `RedisStreamBroker`: - -```bash -pip install "taskiq-flow[brokers]" -``` - -```python -from taskiq_flow.broker import RedisStreamBroker -broker = RedisStreamBroker(redis_url="redis://localhost:6379") # requires taskiq-flow[brokers] -``` - -### WebSocket Connection Refused - -**Symptom**: Client cannot connect to WebSocket server. - -**Fix**: WebSocket is served through your FastAPI application. Ensure the FastAPI app is running and the WebSocket route is mounted: - -```python -from fastapi import FastAPI, WebSocket -from taskiq_flow.integration.websocket.fastapi_ws import fastapi_websocket_endpoint - -app = FastAPI() - -@app.websocket("/ws/{pipeline_id}") -async def ws_endpoint(websocket: WebSocket, pipeline_id: str): - await fastapi_websocket_endpoint(websocket, pipeline_id) - -# Run with: uvicorn app:app --host 0.0.0.0 --port 8000 -``` - -Then connect with `ws://localhost:8000/ws/{pipeline_id}`. - -> **Prerequisite**: Install the `[brokers]` extra: `pip install "taskiq-flow[brokers]"` for Redis-backed setups. - ---- - -## Further Reading - -- **[Full API Reference]({{ '/en/api/' | relative_url }})** — Complete class and method documentation -- **[Example Gallery]({{ '/en/examples/' | relative_url }})** — Detailed walkthroughs of each example script -- **[Project README](https://github.com/dorel14/taskiq-flow/blob/main/README.md)** — Project overview, installation, and philosophy - ---- - -*Ready to dive deeper? Continue with the [Pipelines Guide]({{ '/en/guides/pipelines/' | relative_url }}).* +--- +title: Quick Start Guide +nav_order: 10 +color_scheme: dark +--- +# Quick Start Guide + +**Getting up and running with Taskiq-Flow in 5 minutes** + +> **Version**: {VERSION} | **Prerequisites**: Python 3.9+, asyncio basics + +--- + +## Overview + +This guide will help you create your first pipelines with Taskiq-Flow. By the end, you'll understand: + +- How to set up a broker and add the PipelineMiddleware +- Defining tasks with `@broker.task` +- Building sequential pipelines with `.call_next()`, `.map()`, `.filter()` +- Running pipelines and retrieving results +- Basic dataflow pipelines with `@pipeline_task` + +--- + +## Prerequisites + +```bash +pip install taskiq taskiq-flow +``` + +For this guide, we'll use the in-memory broker which requires no external services. + +--- + +## 1. Basic Sequential Pipeline + +### 1.1. Setup + +Create a new Python file `quickstart_basic.py`: + +```python +import asyncio +from taskiq import InMemoryBroker +from taskiq_flow import Pipeline, PipelineMiddleware + +# Initialize broker and add required middleware +broker = InMemoryBroker() +broker.add_middlewares(PipelineMiddleware()) +``` + +### 1.2. Define Tasks + +All functions in a pipeline must be taskiq tasks (decorated with `@broker.task`): + +```python +@broker.task +def add_one(value: int) -> int: + """Add 1 to the input value.""" + return value + 1 + +@broker.task +def repeat(value: int, times: int) -> list[int]: + """Repeat a value multiple times.""" + return [value] * times + +@broker.task +def is_positive(value: int) -> bool: + """Check if value is non-negative.""" + return value >= 0 +``` + +### 1.3. Build & Run the Pipeline + +```python +async def main(): + # Build the pipeline by chaining operations + pipeline = ( + Pipeline(broker) + .call_next(add_one) # Step 1: 1 → 2 + .call_next(repeat, times=4) # Step 2: 2 → [2, 2, 2, 2] + .map(add_one) # Step 3: apply to each element → [3, 3, 3, 3] + .filter(is_positive) # Step 4: keep elements where result is True + ) + + # Kick off the pipeline with initial input + task = await pipeline.kiq(1) + + # Wait for completion and retrieve the result + result = await task.wait_result() + print("Result:", result.return_value) # Output: [3, 3, 3, 3] + +asyncio.run(main()) +``` + +**Expected output**: +``` +Result: [3, 3, 3, 3] +``` + +### 1.4. How It Works + +| Step | Operation | Input | Output | +|------|-----------|-------|--------| +| 1 | `.call_next(add_one)` | `1` | `2` | +| 2 | `.call_next(repeat, times=4)` | `2` | `[2, 2, 2, 2]` | +| 3 | `.map(add_one)` | `[2, 2, 2, 2]` | `[3, 3, 3, 3]` (parallel) | +| 4 | `.filter(is_positive)` | `[3, 3, 3, 3]` | `[3, 3, 3, 3]` (unchanged) | + +**Key points**: + +- The `PipelineMiddleware` handles task routing; it **must** be added to your broker. +- Each step receives the previous step's output as input. +- `.map()` and `.filter()` operate on iterable results and run elements in parallel. +- `pipeline.kiq(initial_input)` starts the pipeline and returns a `Task` object. +- `task.wait_result()` blocks until the pipeline finishes. + +--- + +## 2. Dataflow Pipeline (Automatic DAG) + +For more complex workflows, use `DataflowPipeline` which automatically builds a dependency graph. + +### 2.1. Define Tasks with `@pipeline_task` + +Mark task outputs using the `@pipeline_task` decorator: + +```python +from taskiq_flow import DataflowPipeline, pipeline_task + +@broker.task +@pipeline_task(output="features") +def extract_audio(track_paths: list[str]) -> dict: + """Extract audio features from tracks.""" + print(f"Extracting features from {len(track_paths)} tracks...") + return {"duration": 180.0, "tempo": 120.0, "energy": 0.8} + +@broker.task +@pipeline_task(output="tags") +def generate_tags(features: dict) -> list[str]: + """Generate tags based on audio features.""" + print(f"Generating tags from features: {features}") + return ["electronic", "dance", "upbeat"] + +@broker.task +@pipeline_task(output="embedding") +def compute_embedding(features: dict) -> list[float]: + """Compute vector embedding from features.""" + print(f"Computing embedding from {features}") + return [0.1, 0.2, 0.3, 0.4, 0.5] +``` + +**How dependency resolution works**: +- `extract_audio` declares `output="features"` +- `generate_tags` has parameter `features: dict` → automatically depends on `extract_audio` +- `compute_embedding` also depends on `extract_audio` (same `features` param) +- Taskiq-Flow constructs a DAG and runs independent tasks in parallel + +### 2.2. Build & Execute + +```python +async def main(): + # Auto-build the DAG from task list + pipeline = DataflowPipeline.from_tasks( + broker, + [extract_audio, generate_tags, compute_embedding] + ) + + # Optional: visualize the DAG + pipeline.print_dag() + + # Execute with input data (only external inputs needed) + results = await pipeline.kiq_dataflow(track_paths=["song1.mp3", "song2.mp3"]) + print("Results:", results) + # Output: { + # "features": {"duration": 180.0, ...}, + # "tags": ["electronic", "dance", "upbeat"], + # "embedding": [0.1, 0.2, 0.3, 0.4, 0.5] + # } + +asyncio.run(main()) +``` + +**Sample DAG output** (printed to console): +``` +DAG Execution Order: + Level 0 (parallel): extract_audio + Level 1 (parallel): generate_tags, compute_embedding + Final outputs: features, tags, embedding +``` + +### 2.3. Visualizing the Pipeline + +```python +# ASCII DAG in console +pipeline.print_dag() + +# JSON representation for web UIs +viz_json = pipeline.visualize() +print(viz_json) + +# DOT format for Graphviz +dot = pipeline.visualize_dot() +with open("pipeline.dot", "w") as f: + f.write(dot) +# Render: dot -Tpng pipeline.dot -o pipeline.png +``` + +--- + +## 3. Common Patterns + +### 3.1. Map-Reduce Pattern + +Process items in parallel, then aggregate: + +```python +from taskiq_flow import MapReduce + +# Map phase: process each track independently +mapped = await MapReduce.map( + broker, + process_track, # task function + track_list, # iterable of items + output="processed", # name of intermediate output + max_parallel=10 # limit concurrency +) + +# Reduce phase: aggregate all results +reduced = await MapReduce.reduce( + broker, + aggregate_results, # aggregation function + mapped, # MapReduceResult object + input_name="processed", # consume the mapped output + output="final_stats" +) + +print("Final:", reduced.return_value) +``` + +See `examples/dataflow_audio_pipeline.py` for a complete audio-processing pipeline. + +### 3.2. Group Parallel Execution + +Run multiple independent tasks simultaneously: + +```python +pipeline = Pipeline(broker) + +pipeline.group( + [task_a, task_b, task_c], + param_names=["input_a", "input_b", "input_c"] +) +# Returns: [result_a, result_b, result_c] +``` + +### 3.3. Pipeline with Tracking + +Monitor pipeline status in real-time: + +```python +from taskiq_flow import PipelineTrackingManager + +tracking = PipelineTrackingManager().with_auto_storage(broker) +pipeline = Pipeline(broker).with_tracking(tracking) + +task = await pipeline.kiq(data) + +# Check status later +status = await tracking.get_status(pipeline.pipeline_id) +print(f"Status: {status.status}, Steps completed: {len(status.steps)}") +``` + +--- + +## 4. Running Example Scripts + +The `examples/` directory contains complete runnable demonstrations: + +```bash +# Basic sequential pipeline +python examples/quickstart.py + +# Tracking and monitoring +python examples/tracking_demo.py + +# Scheduled pipelines (cron) +python examples/scheduled_pipeline.py + +# Full dataflow DAG with map-reduce +python examples/dataflow_audio_pipeline.py + +# Manual DAG construction with DataflowRegistry +python examples/registry_discovery_example.py + +# WebSocket event streaming +python examples/websocket_demo.py + +# REST API with FastAPI +python examples/api_example.py +``` + +--- + +## 5. Next Steps + +With the basics under your belt, explore the deeper guides: + +| Topic | Guide | +|-------|-------| +| Sequential & Dataflow Pipelines | [Pipelines Guide]({{ '/en/guides/pipelines/' | relative_url }}) | +| **Dataflow Deep Dive** | **[Dataflow Guide]({{ '/en/guides/dataflow/' | relative_url }})** | +| Task definitions & decorators | [Tasks Guide]({{ '/en/guides/tasks/' | relative_url }}) | +| Execution modes & error handling | [Execution Guide]({{ '/en/guides/execution/' | relative_url }}) | +| Real-time monitoring | [Tracking Guide]({{ '/en/guides/tracking/' | relative_url }}) | +| Live dashboards | [WebSocket Guide]({{ '/en/guides/websocket/' | relative_url }}) | +| Cron scheduling | [Scheduling Guide]({{ '/en/guides/scheduling/' | relative_url }}) | +| Error recovery | [Retry Guide]({{ '/en/guides/retry/' | relative_url }}) | +| Performance tuning | [Performance Guide]({{ '/en/guides/performance/' | relative_url }}) | +| REST API integration | [API Guide]({{ '/en/guides/api/' | relative_url }}) | +| Full API reference | [API Reference]({{ '/en/api/' | relative_url }}) | + +--- + +## Troubleshooting + +### "PipelineMiddleware not found" Error + +**Symptom**: Tasks fail with middleware errors. + +**Fix**: Ensure `PipelineMiddleware()` is added to your broker before creating pipelines: + +```python +broker.add_middlewares(PipelineMiddleware()) # Must be called +``` + +### "Task not found" or "Result is None" + +**Symptom**: `wait_result()` returns `None`. + +**Cause**: InMemoryBroker only works within the same process. For multi-worker distributed setups, use Redis or another persistent broker. + +**Fix**: Install the `[brokers]` extra and switch to `RedisStreamBroker`: + +```bash +pip install "taskiq-flow[brokers]" +``` + +```python +from taskiq_flow.broker import RedisStreamBroker +broker = RedisStreamBroker(redis_url="redis://localhost:6379") # requires taskiq-flow[brokers] +``` + +### WebSocket Connection Refused + +**Symptom**: Client cannot connect to WebSocket server. + +**Fix**: WebSocket is served through your FastAPI application. Ensure the FastAPI app is running and the WebSocket route is mounted: + +```python +from fastapi import FastAPI, WebSocket +from taskiq_flow.integration.websocket.fastapi_ws import fastapi_websocket_endpoint + +app = FastAPI() + +@app.websocket("/ws/{pipeline_id}") +async def ws_endpoint(websocket: WebSocket, pipeline_id: str): + await fastapi_websocket_endpoint(websocket, pipeline_id) + +# Run with: uvicorn app:app --host 0.0.0.0 --port 8000 +``` + +Then connect with `ws://localhost:8000/ws/{pipeline_id}`. + +> **Prerequisite**: Install the `[brokers]` extra: `pip install "taskiq-flow[brokers]"` for Redis-backed setups. + +--- + +## Further Reading + +- **[Full API Reference]({{ '/en/api/' | relative_url }})** — Complete class and method documentation +- **[Example Gallery]({{ '/en/examples/' | relative_url }})** — Detailed walkthroughs of each example script +- **[Project README](https://github.com/dorel14/taskiq-flow/blob/main/README.md)** — Project overview, installation, and philosophy + +--- + +*Ready to dive deeper? Continue with the [Pipelines Guide]({{ '/en/guides/pipelines/' | relative_url }}).* diff --git a/docs/_fr/api/cache.md b/docs/_fr/api/cache.md index 3f6a5eb..6c1d1a4 100644 --- a/docs/_fr/api/cache.md +++ b/docs/_fr/api/cache.md @@ -1,151 +1,151 @@ ---- -title: Référence API : Cache -nav_order: 32 -color_scheme: dark ---- -# Référence API : Cache - -**Mise en cache avec sémantiques Dogpile (anti-stampede)** - -> **Version** : {VERSION} | **Nouveau en v1.2.0** | **Module** : `taskiq_flow.cache`, `taskiq_flow.middlewares.cache` - ---- - -## Aperçu - -Taskiq-Flow v1.2.0 introduit une **couche de cache** pour les workers, construite autour du **pattern Dogpile**. Le principe clé : lorsqu'une entrée de cache expire, seul un thread/poste est autorisé à la régénérer. Tous les autres attendent et récupèrent la valeur fraîche — annulant complètement le stampede. - -``` -Requêtes concurrentes à expiration TTL : - -Sans Dogpile : [tâche exécutée × 10 en parallèle] → surcharge -Avec Dogpile : [1 tâche s'exécute, 9 attendent] → résultat unique partagé -``` - ---- - -## `BaseCacheAdapter` (ABC) - -```python -from taskiq_flow.storage.base import BaseCacheAdapter - -class MonAdaptateurCache(BaseCacheAdapter): - async def get_or_create(self, key, creator, ttl_seconds=3600) -> Any: ... - async def get(self, key) -> Any | None: ... - async def set(self, key, value, ttl_seconds=3600) -> None: ... - async def invalidate(self, key) -> bool: ... - async def clear(self) -> None: ... - def get_stats(self) -> dict: ... -``` - -Interface abstraite à implémenter pour tout nouveau backend de cache. - -| Méthode | Anti-Stampede ? | Description | -|---------|----------------|-------------| -| `get_or_create(cle, creator, ttl)` | **Oui** | Lecture atomique : exécute `creator()` seulement si absent/expiré, avec verrou | -| `get(cle)` | Côté lecture | Consultation cache ; `None` en cas de miss | -| `set(cle, valeur, ttl)` | Côté écriture | Stockage avec TTL optionnel en secondes | -| `invalidate(cle)` | — | Éviction immédiate d'une entrée | -| `clear()` | — | Vider le cache entièrement | -| `get_stats()` | — | `{"hits", "misses", "hit_rate", "size", "keys"}` | - ---- - -## `InMemoryCacheAdapter` - -```python -from taskiq_flow.cache import InMemoryCacheAdapter - -cache = InMemoryCacheAdapter() - -resultat = await cache.get_or_create( - "calcul_couteux", - lambda: calculer_couteux(), - ttl_seconds=300, -) -stats = cache.get_stats() -``` - -| Fonctionnalité | Détail | -|---------------|--------| -| Sécurité thread | Verrou par clé `threading.Lock` | -| TTL | Horloge monotone ; indépendant de l'heure système | -| Verrou Dogpile | Libéré seulement quand `creator` termine | -| `creator()` async | Si `creator()` retourne une coroutine, elle est `await`ée | -| Statistiques | `get_stats()` : hits, misses, hit_rate, size, keys | - ---- - -## `RedisCacheAdapter` - -```python -from taskiq_flow.cache import RedisCacheAdapter - -cache = RedisCacheAdapter( - redis_url="redis://localhost:6379", - default_ttl=3600, - lock_timeout=10, -) -resultat = await cache.get_or_create("calcul_partage", - lambda: calculer_couteux(), - ttl_seconds=300) -``` - -Cache distribué avec verrouillage Redis pour Dogpile anti-stampede. - -| Fonctionnalité | Détail | -|---------------|--------| -| Verrou distribué | `SETNX` : plusieurs workers partagent une seule entrée | -| TTL natif Redis | `EXPIRE` par clé | -| Sérialisation JSON | Automatique pour types non primitifs | -| Délai de verrou max | Configurable ; évite les deadlocks si worker crashe | - ---- - -## `CacheMiddleware` - -```python -from taskiq_flow.middlewares import CacheMiddleware -broker.add_middlewares( - PipelineMiddleware(), - CacheMiddleware(cache=InMemoryCacheAdapter(), default_ttl=3600), -) -``` - -`CacheMiddleware` est la manière production-ready d'activer le cache sur un broker. Il se branche sur `pre_execute` et `post_save` : - -- **`pre_execute`** — Retourne le résultat en cache si présent ; la tâche est court-circuitée. -- **`post_save`** — Stocke le résultat en cache pour la prochaine exécution. - -| Paramètre du Constructeur | Type | Défaut | Description | -|--------------------------|------|--------|-------------| -| `cache` | `BaseCacheAdapter \| None` | `None` | Backend ; `None` → `InMemoryCacheAdapter` | -| `enabled` | `bool` | `True` | Toggle global | -| `default_ttl` | `int` | `3600` | Durée de vie par défaut en secondes | - -**Surcharges par label de tâche :** - -| Label Message | Valeurs | Effet | -|--------------|---------|-------| -| `cache_ttl` | secondes (entier) | Remplacer le TTL défaut pour cette exécution | -| `cache_errors` | `"true"` | Mettre en cache les résultats d'erreur aussi | - ---- - -## Choix d'un Backend de Cache - -| Backend | Quand l'utiliser | -|---------|-----------------| -| `InMemoryCacheAdapter` | Développement, tests, worker unique | -| `RedisCacheAdapter` | Production, multi-worker, distribué | - ---- - -## Lectures Associées - -- **[Guide des Middlewares]({{ '/fr/guides/cache/' | relative_url }})** — Configuration complète -- **[Référence API : Stockage]({{ '/fr/api/storage/' | relative_url }})** — Adaptateurs de stockage - ---- - -*Nouveau en v1.2.0. Les adaptateurs de cache sont async et interchangeables à l'instanciation.* +--- +title: 'Référence API : Cache' +nav_order: 32 +color_scheme: dark +--- +# Référence API : Cache + +**Mise en cache avec sémantiques Dogpile (anti-stampede)** + +> **Version** : {VERSION} | **Nouveau en v1.2.0** | **Module** : `taskiq_flow.cache`, `taskiq_flow.middlewares.cache` + +--- + +## Aperçu + +Taskiq-Flow v1.2.0 introduit une **couche de cache** pour les workers, construite autour du **pattern Dogpile**. Le principe clé : lorsqu'une entrée de cache expire, seul un thread/poste est autorisé à la régénérer. Tous les autres attendent et récupèrent la valeur fraîche — annulant complètement le stampede. + +``` +Requêtes concurrentes à expiration TTL : + +Sans Dogpile : [tâche exécutée × 10 en parallèle] → surcharge +Avec Dogpile : [1 tâche s'exécute, 9 attendent] → résultat unique partagé +``` + +--- + +## `BaseCacheAdapter` (ABC) + +```python +from taskiq_flow.storage.base import BaseCacheAdapter + +class MonAdaptateurCache(BaseCacheAdapter): + async def get_or_create(self, key, creator, ttl_seconds=3600) -> Any: ... + async def get(self, key) -> Any | None: ... + async def set(self, key, value, ttl_seconds=3600) -> None: ... + async def invalidate(self, key) -> bool: ... + async def clear(self) -> None: ... + def get_stats(self) -> dict: ... +``` + +Interface abstraite à implémenter pour tout nouveau backend de cache. + +| Méthode | Anti-Stampede ? | Description | +|---------|----------------|-------------| +| `get_or_create(cle, creator, ttl)` | **Oui** | Lecture atomique : exécute `creator()` seulement si absent/expiré, avec verrou | +| `get(cle)` | Côté lecture | Consultation cache ; `None` en cas de miss | +| `set(cle, valeur, ttl)` | Côté écriture | Stockage avec TTL optionnel en secondes | +| `invalidate(cle)` | — | Éviction immédiate d'une entrée | +| `clear()` | — | Vider le cache entièrement | +| `get_stats()` | — | `{"hits", "misses", "hit_rate", "size", "keys"}` | + +--- + +## `InMemoryCacheAdapter` + +```python +from taskiq_flow.cache import InMemoryCacheAdapter + +cache = InMemoryCacheAdapter() + +resultat = await cache.get_or_create( + "calcul_couteux", + lambda: calculer_couteux(), + ttl_seconds=300, +) +stats = cache.get_stats() +``` + +| Fonctionnalité | Détail | +|---------------|--------| +| Sécurité thread | Verrou par clé `threading.Lock` | +| TTL | Horloge monotone ; indépendant de l'heure système | +| Verrou Dogpile | Libéré seulement quand `creator` termine | +| `creator()` async | Si `creator()` retourne une coroutine, elle est `await`ée | +| Statistiques | `get_stats()` : hits, misses, hit_rate, size, keys | + +--- + +## `RedisCacheAdapter` + +```python +from taskiq_flow.cache import RedisCacheAdapter + +cache = RedisCacheAdapter( + redis_url="redis://localhost:6379", + default_ttl=3600, + lock_timeout=10, +) +resultat = await cache.get_or_create("calcul_partage", + lambda: calculer_couteux(), + ttl_seconds=300) +``` + +Cache distribué avec verrouillage Redis pour Dogpile anti-stampede. + +| Fonctionnalité | Détail | +|---------------|--------| +| Verrou distribué | `SETNX` : plusieurs workers partagent une seule entrée | +| TTL natif Redis | `EXPIRE` par clé | +| Sérialisation JSON | Automatique pour types non primitifs | +| Délai de verrou max | Configurable ; évite les deadlocks si worker crashe | + +--- + +## `CacheMiddleware` + +```python +from taskiq_flow.middlewares import CacheMiddleware +broker.add_middlewares( + PipelineMiddleware(), + CacheMiddleware(cache=InMemoryCacheAdapter(), default_ttl=3600), +) +``` + +`CacheMiddleware` est la manière production-ready d'activer le cache sur un broker. Il se branche sur `pre_execute` et `post_save` : + +- **`pre_execute`** — Retourne le résultat en cache si présent ; la tâche est court-circuitée. +- **`post_save`** — Stocke le résultat en cache pour la prochaine exécution. + +| Paramètre du Constructeur | Type | Défaut | Description | +|--------------------------|------|--------|-------------| +| `cache` | `BaseCacheAdapter \| None` | `None` | Backend ; `None` → `InMemoryCacheAdapter` | +| `enabled` | `bool` | `True` | Toggle global | +| `default_ttl` | `int` | `3600` | Durée de vie par défaut en secondes | + +**Surcharges par label de tâche :** + +| Label Message | Valeurs | Effet | +|--------------|---------|-------| +| `cache_ttl` | secondes (entier) | Remplacer le TTL défaut pour cette exécution | +| `cache_errors` | `"true"` | Mettre en cache les résultats d'erreur aussi | + +--- + +## Choix d'un Backend de Cache + +| Backend | Quand l'utiliser | +|---------|-----------------| +| `InMemoryCacheAdapter` | Développement, tests, worker unique | +| `RedisCacheAdapter` | Production, multi-worker, distribué | + +--- + +## Lectures Associées + +- **[Guide des Middlewares]({{ '/fr/guides/cache/' | relative_url }})** — Configuration complète +- **[Référence API : Stockage]({{ '/fr/api/storage/' | relative_url }})** — Adaptateurs de stockage + +--- + +*Nouveau en v1.2.0. Les adaptateurs de cache sont async et interchangeables à l'instanciation.* diff --git a/docs/_fr/api/core.md b/docs/_fr/api/core.md index d7a0736..3708c57 100644 --- a/docs/_fr/api/core.md +++ b/docs/_fr/api/core.md @@ -1,271 +1,271 @@ ---- -permalink: /fr/api/core/ -title: Référence API: Composants Principaux -nav_order: 30 -color_scheme: dark ---- -# Référence API: Composants Principaux - -**Pipeline, DataflowPipeline, PipelineMiddleware, PipelineContext et exceptions principales** - -> **Version** : {VERSION} | **Module** : `taskiq_flow.core`, `taskiq_flow.pipeline`, `taskiq_flow.middleware` - ---- - -## Classes Principales - -### Pipeline (SequentialPipeline) - -Le pipeline séquentiel classique pour l'orchestration linéaire de tâches. - -```python -from taskiq_flow import Pipeline - -pipeline = Pipeline(broker) -``` - -**Constructeur**: -```python -Pipeline( - broker: BaseBroker, - max_parallel: int = None, # Limite globale de parallélisme - timeout: float = None, # Timeout global en secondes - pipeline_id: str = None # Auto-généré si non fourni -) -``` - -**Méthodes**: - -| Méthode | Signature | Description | -|---------|-----------|-------------| -| `call_next` | `call_next(task, *args, **kwargs) -> Pipeline` | Enchaîne une tâche; passe résultat précédent comme premier arg | -| `call_after` | `call_after(task, *args, **kwargs) -> Pipeline` | Exécute tâche sans consommer résultat précédent | -| `map` | `map(task, max_parallel=None, output_name=None) -> Pipeline` | Applique tâche à chaque élément d'un résultat itérable | -| `filter` | `filter(task) -> Pipeline` | Garde éléments où tâche retourne truthy | -| `group` | `group(tasks, param_names=None) -> Pipeline` | Exécute multiples tâches en parallèle depuis même entrée | -| `kiq` | `kiq(*args, **kwargs) -> Task` | Démarre exécution pipeline | -| `with_tracking` | `with_tracking(tracking_manager) -> Pipeline` | Attache gestionnaire de suivi | -| `with_hooks` | `with_hooks(hook_manager) -> Pipeline` | Attache gestionnaire hooks pour événements | -| `with_retry` | `with_retry(...) -> Pipeline` | Configure politique de retry | -| `with_timeout` | `with_timeout(seconds) -> Pipeline` | Définit timeout | -| `with_context` | `with_context(enable=True) -> Pipeline` | Active passage PipelineContext aux tâches | - -**Example**: -```python -pipeline = ( - Pipeline(broker) - .call_next(task1) - .call_next(task2, factor=2) - .map(task3, max_parallel=10) - .filter(validate) - .with_tracking(tracking) -) -result = await pipeline.kiq(initial_input) -``` - ---- - -### DataflowPipeline - -Construction automatique de DAG depuis dépendances entre tâches via décorateurs `@pipeline_task`. - -```python -from taskiq_flow import DataflowPipeline - -pipeline = DataflowPipeline.from_tasks( - broker, - [task_a, task_b, task_c] -) -``` - -**Constructeur**: -```python -DataflowPipeline( - broker: BaseBroker, - tasks: list[Callable] = None, - max_parallel: int = None, - timeout: float = None, - pipeline_id: str = None -) -``` - -**Méthodes de Classe**: - -| Méthode | Description | -|----------|-------------| -| `from_tasks(broker, tasks, **kwargs)` | Construit pipeline depuis liste de fonctions de tâche avec décorateurs `@pipeline_task` | - -**Méthodes d'Instance** (la plupart partagées avec `Pipeline`): - -| Méthode | Description | -|----------|-------------| -| `print_dag()` | Affiche DAG ASCII en console | -| `visualize()` | Retourne représentation JSON du DAG | -| `visualize_dot()` | Retourne chaîne DOT Graphviz | -| `kiq_dataflow(**kwargs)` | Exécute pipeline avec entrées nommées | - -**Exemple**: -```python -@broker.task -@pipeline_task(output="features") -def extract(données): ... - -@broker.task -@pipeline_task(output="tags") -def tag(features): ... - -pipeline = DataflowPipeline.from_tasks(broker, [extract, tag]) -pipeline.print_dag() -# Sortie: -# Niveau 0: extract -# Niveau 1: tag - -résultats = await pipeline.kiq_dataflow(data=données_entrée) -# résultats = {"features": ..., "tags": ...} -``` - ---- - -### PipelineMiddleware - -Le middleware qui orchestre l'exécution des étapes de pipeline. - -```python -from taskiq_flow import PipelineMiddleware - -broker.add_middlewares(PipelineMiddleware()) -``` - -**Responsabilités**: - -- Intercepte completion des tâches -- Détermine prochaine étape à exécuter -- Gère transitions d'état du pipeline -- Passe résultats entre étapes -- Émet événements hooks - -**Note** : Ce middleware **doit** être ajouté au broker pour que tout pipeline fonctionne. - ---- - -### PipelineContext - -Métadonnées passées aux tâches quand `with_context(enable=True)` est défini. - -```python -from taskiq_flow import PipelineContext - -@broker.task -async def my_task(data: str, context: PipelineContext): - print(f"Pipeline: {context.pipeline_id}") - print(f"Step: {context.step_index}") - print(f"Task ID: {context.task_id}") -``` - -**Champs**: - -| Champ | Type | Description | -|-------|------|-------------| -| `pipeline_id` | `str` | ID unique instance pipeline | -| `step_index` | `int` | Numéro étape courante (0-indexé) | -| `task_id` | `str` | ID tâche taskiq sous-jacente | -| `execution_mode` | `str` | `"sequential"`, `"parallel"`, `"map_reduce"` | -| `started_at` | `datetime` | Horodatage début pipeline | -| `broker` | `BaseBroker` | Référence instance broker | - ---- - -## Exceptions Principales - -Toutes exceptions héritent de classe de base `TaskiqFlowError`. - -```python -from taskiq_flow import TaskiqFlowError -``` - -| Exception | Signification | Cause Typique | -|-----------|---------------|---------------| -| `PipelineError` | Échec générique pipeline | Étape échouée | -| `CycleError` | Dépendance circulaire détectée | DAG a cycle | -| `TaskNotFoundError` | Tâche non dans registry | Tâche manquante dans DataflowPipeline | -| `InvalidOutputError` | Conflit clé de sortie | Deux tâches déclarent même sortie | -| `ConfigurationError` | Config pipeline invalide | Middleware manquant, paramètres incorrects | -| `TrackingError` | Échec opération suivi | Stockage indisponible | - -**Exemple gestion**: -```python -try: - résultat = await pipeline.kiq(données) -except CycleError as e: - print(f"Cycle DAG détecté: {e}") -except PipelineError as e: - print(f"Pipeline échoué: {e}") -``` - ---- - -## Utilitaires - -### DataflowRegistry - -Pour construction manuelle de DAG et inspection. - -```python -from taskiq_flow import DataflowRegistry - -registry = DataflowRegistry() -registry.register_task(tache, output="sortie", inputs=["entrée"]) -dag = registry.build_dag() -``` - -Voir documentation détaillée dans `docs/fr/api/dataflow.md`. - ---- - -### ExecutionEngine - -Exécuteur de DAG bas niveau pour cas avancés. - -```python -from taskiq_flow import ExecutionEngine - -engine = ExecutionEngine(broker, dag) -résultats = await engine.execute(inputs={"x": 1, "y": 2}) -``` - -Voir API docs execution. - ---- - -### PipelineScheduler - -Planification cron de pipelines. - -```python -from taskiq_flow import PipelineScheduler - -scheduler = PipelineScheduler(broker) -await scheduler.schedule(pipeline, cron="* * * * *") -await scheduler.start() -``` - -Voir guide planification. - ---- - -## Compatibilité Version - -Cette documentation couvre **Taskiq-Flow v0.3.0+**. - -Stabilité API: -- `Pipeline` et `DataflowPipeline`: Stable (v0.3+) -- Décorateur `pipeline_task`: Stable (v0.3+) -- `PipelineMiddleware`: Stable (v0.3+) -- `PipelineScheduler`: Stable (v0.3+) -- `PipelineTrackingManager`: Stable (v0.3+) - -Changements cassants notés dans [CHANGELOG.md](https://github.com/dorel14/taskiq-flow/blob/main/CHANGELOG.md). - ---- - -*Pour exemples détaillés, voir section [Exemples]({{ '/fr/examples/' | relative_url }}). Pour doc méthode par méthode, se référer aux docstrings Python inline (`help(Pipeline)`).* +--- +permalink: /fr/api/core/ +title: 'Référence API: Composants Principaux' +nav_order: 30 +color_scheme: dark +--- +# Référence API: Composants Principaux + +**Pipeline, DataflowPipeline, PipelineMiddleware, PipelineContext et exceptions principales** + +> **Version** : {VERSION} | **Module** : `taskiq_flow.core`, `taskiq_flow.pipeline`, `taskiq_flow.middleware` + +--- + +## Classes Principales + +### Pipeline (SequentialPipeline) + +Le pipeline séquentiel classique pour l'orchestration linéaire de tâches. + +```python +from taskiq_flow import Pipeline + +pipeline = Pipeline(broker) +``` + +**Constructeur**: +```python +Pipeline( + broker: BaseBroker, + max_parallel: int = None, # Limite globale de parallélisme + timeout: float = None, # Timeout global en secondes + pipeline_id: str = None # Auto-généré si non fourni +) +``` + +**Méthodes**: + +| Méthode | Signature | Description | +|---------|-----------|-------------| +| `call_next` | `call_next(task, *args, **kwargs) -> Pipeline` | Enchaîne une tâche; passe résultat précédent comme premier arg | +| `call_after` | `call_after(task, *args, **kwargs) -> Pipeline` | Exécute tâche sans consommer résultat précédent | +| `map` | `map(task, max_parallel=None, output_name=None) -> Pipeline` | Applique tâche à chaque élément d'un résultat itérable | +| `filter` | `filter(task) -> Pipeline` | Garde éléments où tâche retourne truthy | +| `group` | `group(tasks, param_names=None) -> Pipeline` | Exécute multiples tâches en parallèle depuis même entrée | +| `kiq` | `kiq(*args, **kwargs) -> Task` | Démarre exécution pipeline | +| `with_tracking` | `with_tracking(tracking_manager) -> Pipeline` | Attache gestionnaire de suivi | +| `with_hooks` | `with_hooks(hook_manager) -> Pipeline` | Attache gestionnaire hooks pour événements | +| `with_retry` | `with_retry(...) -> Pipeline` | Configure politique de retry | +| `with_timeout` | `with_timeout(seconds) -> Pipeline` | Définit timeout | +| `with_context` | `with_context(enable=True) -> Pipeline` | Active passage PipelineContext aux tâches | + +**Example**: +```python +pipeline = ( + Pipeline(broker) + .call_next(task1) + .call_next(task2, factor=2) + .map(task3, max_parallel=10) + .filter(validate) + .with_tracking(tracking) +) +result = await pipeline.kiq(initial_input) +``` + +--- + +### DataflowPipeline + +Construction automatique de DAG depuis dépendances entre tâches via décorateurs `@pipeline_task`. + +```python +from taskiq_flow import DataflowPipeline + +pipeline = DataflowPipeline.from_tasks( + broker, + [task_a, task_b, task_c] +) +``` + +**Constructeur**: +```python +DataflowPipeline( + broker: BaseBroker, + tasks: list[Callable] = None, + max_parallel: int = None, + timeout: float = None, + pipeline_id: str = None +) +``` + +**Méthodes de Classe**: + +| Méthode | Description | +|----------|-------------| +| `from_tasks(broker, tasks, **kwargs)` | Construit pipeline depuis liste de fonctions de tâche avec décorateurs `@pipeline_task` | + +**Méthodes d'Instance** (la plupart partagées avec `Pipeline`): + +| Méthode | Description | +|----------|-------------| +| `print_dag()` | Affiche DAG ASCII en console | +| `visualize()` | Retourne représentation JSON du DAG | +| `visualize_dot()` | Retourne chaîne DOT Graphviz | +| `kiq_dataflow(**kwargs)` | Exécute pipeline avec entrées nommées | + +**Exemple**: +```python +@broker.task +@pipeline_task(output="features") +def extract(données): ... + +@broker.task +@pipeline_task(output="tags") +def tag(features): ... + +pipeline = DataflowPipeline.from_tasks(broker, [extract, tag]) +pipeline.print_dag() +# Sortie: +# Niveau 0: extract +# Niveau 1: tag + +résultats = await pipeline.kiq_dataflow(data=données_entrée) +# résultats = {"features": ..., "tags": ...} +``` + +--- + +### PipelineMiddleware + +Le middleware qui orchestre l'exécution des étapes de pipeline. + +```python +from taskiq_flow import PipelineMiddleware + +broker.add_middlewares(PipelineMiddleware()) +``` + +**Responsabilités**: + +- Intercepte completion des tâches +- Détermine prochaine étape à exécuter +- Gère transitions d'état du pipeline +- Passe résultats entre étapes +- Émet événements hooks + +**Note** : Ce middleware **doit** être ajouté au broker pour que tout pipeline fonctionne. + +--- + +### PipelineContext + +Métadonnées passées aux tâches quand `with_context(enable=True)` est défini. + +```python +from taskiq_flow import PipelineContext + +@broker.task +async def my_task(data: str, context: PipelineContext): + print(f"Pipeline: {context.pipeline_id}") + print(f"Step: {context.step_index}") + print(f"Task ID: {context.task_id}") +``` + +**Champs**: + +| Champ | Type | Description | +|-------|------|-------------| +| `pipeline_id` | `str` | ID unique instance pipeline | +| `step_index` | `int` | Numéro étape courante (0-indexé) | +| `task_id` | `str` | ID tâche taskiq sous-jacente | +| `execution_mode` | `str` | `"sequential"`, `"parallel"`, `"map_reduce"` | +| `started_at` | `datetime` | Horodatage début pipeline | +| `broker` | `BaseBroker` | Référence instance broker | + +--- + +## Exceptions Principales + +Toutes exceptions héritent de classe de base `TaskiqFlowError`. + +```python +from taskiq_flow import TaskiqFlowError +``` + +| Exception | Signification | Cause Typique | +|-----------|---------------|---------------| +| `PipelineError` | Échec générique pipeline | Étape échouée | +| `CycleError` | Dépendance circulaire détectée | DAG a cycle | +| `TaskNotFoundError` | Tâche non dans registry | Tâche manquante dans DataflowPipeline | +| `InvalidOutputError` | Conflit clé de sortie | Deux tâches déclarent même sortie | +| `ConfigurationError` | Config pipeline invalide | Middleware manquant, paramètres incorrects | +| `TrackingError` | Échec opération suivi | Stockage indisponible | + +**Exemple gestion**: +```python +try: + résultat = await pipeline.kiq(données) +except CycleError as e: + print(f"Cycle DAG détecté: {e}") +except PipelineError as e: + print(f"Pipeline échoué: {e}") +``` + +--- + +## Utilitaires + +### DataflowRegistry + +Pour construction manuelle de DAG et inspection. + +```python +from taskiq_flow import DataflowRegistry + +registry = DataflowRegistry() +registry.register_task(tache, output="sortie", inputs=["entrée"]) +dag = registry.build_dag() +``` + +Voir documentation détaillée dans `docs/fr/api/dataflow.md`. + +--- + +### ExecutionEngine + +Exécuteur de DAG bas niveau pour cas avancés. + +```python +from taskiq_flow import ExecutionEngine + +engine = ExecutionEngine(broker, dag) +résultats = await engine.execute(inputs={"x": 1, "y": 2}) +``` + +Voir API docs execution. + +--- + +### PipelineScheduler + +Planification cron de pipelines. + +```python +from taskiq_flow import PipelineScheduler + +scheduler = PipelineScheduler(broker) +await scheduler.schedule(pipeline, cron="* * * * *") +await scheduler.start() +``` + +Voir guide planification. + +--- + +## Compatibilité Version + +Cette documentation couvre **Taskiq-Flow v0.3.0+**. + +Stabilité API: +- `Pipeline` et `DataflowPipeline`: Stable (v0.3+) +- Décorateur `pipeline_task`: Stable (v0.3+) +- `PipelineMiddleware`: Stable (v0.3+) +- `PipelineScheduler`: Stable (v0.3+) +- `PipelineTrackingManager`: Stable (v0.3+) + +Changements cassants notés dans [CHANGELOG.md](https://github.com/dorel14/taskiq-flow/blob/main/CHANGELOG.md). + +--- + +*Pour exemples détaillés, voir section [Exemples]({{ '/fr/examples/' | relative_url }}). Pour doc méthode par méthode, se référer aux docstrings Python inline (`help(Pipeline)`).* diff --git a/docs/_fr/api/decorators.md b/docs/_fr/api/decorators.md index 0bea978..4770e15 100644 --- a/docs/_fr/api/decorators.md +++ b/docs/_fr/api/decorators.md @@ -1,252 +1,268 @@ ---- -permalink: /fr/api/decorators/ -title: Référence API: Décorateurs -nav_order: 31 -color_scheme: dark ---- -# Référence API: Décorateurs - -**Décorateurs de tâches, @pipeline_task, et utilitaires** - +--- +permalink: /fr/api/decorators/ +title: 'Référence API: Décorateurs' +nav_order: 31 +color_scheme: dark +--- +# Référence API: Décorateurs + +**Décorateurs de tâches, @pipeline_task, et utilitaires** + > **Version** : {VERSION} | **Module** : `taskiq_flow.decorators` - ---- - -## Aperçu - -Le décorateur `@pipeline_task` annote les tâches taskiq avec des déclarations de sortie, permettant la résolution automatique de dépendances dans DataflowPipeline. - ---- - -## @pipeline_task - -Marque une tâche avec ce qu'elle produit pour les consommateurs en aval. - -```python -from taskiq_flow import pipeline_task - -@broker.task -@pipeline_task(output="features") -def extract(données: list[str]) -> dict: - return compute_features(données) -``` - -**Paramètres**: - -| Paramètre | Type | Description | -|-----------|------|-------------| -| `output` | `str` | Nom clé sortie unique | -| `outputs` | `list[str]` | Clés sortie multiples (pour retours tuple) | -| `inputs` | `list[str]` | Dépendances entrée explicites (optionnel, auto-détecté) | -| `description` | `str` | Description lisible humain (pour documentation) | - -### Sortie unique (plus courant) - -```python -@broker.task -@pipeline_task(output="données_traitées") -def process(données_brutes: str) -> dict: - return {"result": données_brutes.upper()} -``` - -### Sorties multiples - -```python -@broker.task -@pipeline_task(outputs=["features", "metadata"]) -def split_output(audio: np.ndarray) -> tuple[dict, dict]: - features = extract_features(audio) - metadata = extract_meta(audio) - return features, metadata # déballé vers les deux sorties -``` - -Les tâches en aval peuvent consommer soit sortie: - -```python -@broker.task -@pipeline_task(output="tags") -def tag(features: dict): ... # consomme sortie 'features' - -@broker.task -@pipeline_task(output="info") -def describe(metadata: dict): ... # consomme sortie 'metadata' -``` - ---- - -## @pipeline_task_multi_output - -Alias pour `@pipeline_task(outputs=[...])`. Apporte clarté pour tâches multi-sorties: - -```python -from taskiq_flow import pipeline_task_multi_output - -@broker.task -@pipeline_task_multi_output(outputs=["x", "y"]) -def split(valeur: int) -> tuple[int, int]: - return valeur // 2, valeur % 2 -``` - ---- - -## Fonctions Utilitaires - -### get_task_outputs(task: Callable) -> list[str] - -Obtenir clés sortie déclarées pour une tâche: - -```python -from taskiq_flow import get_task_outputs - -outputs = get_task_outputs(extract_task) -print(outputs) # ['features'] -``` - -### get_task_inputs(task: Callable) -> list[str] - -Get declared input dependencies: - -```python -from taskiq_flow import get_task_inputs - -inputs = get_task_inputs(tag_task) -print(inputs) # ['features'] -``` - -### is_pipeline_task(task: Callable) -> bool - -Check if function is decorated with `@pipeline_task`: - -```python -from taskiq_flow import is_pipeline_task - -if is_pipeline_task(my_function): - print("This is a pipeline task with output declarations") -``` - -### resolve_task_dependencies(tasks: list[Callable]) -> dict - -Build dependency map: - -```python -from taskiq_flow import resolve_task_dependencies - -deps = resolve_task_dependencies([task_a, task_b, task_c]) -# Returns: {task_a: [], task_b: ['features'], task_c: ['tags']} -``` - ---- - -## Decorator Order - -The order of decorators matters: `@broker.task` must be the outermost (applied last), `@pipeline_task` inner (applied first): - -```python -# CORRECT -@broker.task -@pipeline_task(output="result") -def my_task(): ... - -# INCORRECT (will fail) -@pipeline_task(output="result") -@broker.task -def my_task(): ... -``` - -Why: `@broker.task` wraps the function; `@pipeline_task` attaches metadata to the original function. Python applies decorators bottom-to-top. + +--- + +## Aperçu + +Le décorateur `@pipeline_task` annote les tâches taskiq avec des déclarations de sortie, permettant la résolution automatique de dépendances dans DataflowPipeline. + +--- + +## @pipeline_task + +Marque une tâche avec ce qu'elle produit pour les consommateurs en aval. + +{% raw %} +```python +from taskiq_flow import pipeline_task + +@broker.task +@pipeline_task(output="features") +def extract(données: list[str]) -> dict: + return compute_features(données) ``` - -Pourquoi: `@broker.task` enveloppe la fonction; `@pipeline_task` attache métadonnées à la fonction originale. Python applique décorateurs bas-vers-haut. - ---- - -## Type Hints & Analyse Statique - -Les type hints aident IDEs et linters à comprendre le dataflow: - -```python -from typing import TypedDict - -class AudioFeatures(TypedDict): - duration: float - tempo: float - -@broker.task -@pipeline_task(output="features") -def extract(chemin: str) -> AudioFeatures: - return {"duration": 180.0, "tempo": 120.0} - -@broker.task -@pipeline_task(output="tags") -def tag(features: AudioFeatures) -> list[str]: # type-safe - return ["rapide", "électronique"] -``` - -Utiliser `TypedDict` ou modèles Pydantic pour meilleure autocomplétion IDE et vérification mypy. - ---- - -## Versionnage & Métadonnées - -Attacher version et autres métadonnées: - -```python -@broker.task( - nom="extract_features_v2", - labels={"version": "2.0.0", "expérimental": False} -) -@pipeline_task( - output="features", - description="Extraire caractéristiques audio (v2 avec estimation tempo améliorée)" -) -def extract(chemin: str) -> dict: - ... -``` - ---- - -## Pièges Courants - -| Piège | Conséquence | Correction | -|-------|-------------|------------| -| `@broker.task` manquant | Tâche non enregistrée avec broker | Ajouter décorateur | -| `output` non défini | Aucun consommateur en aval ne peut en dépendre | Toujours déclarer `output` pour tâches dataflow | -| Mismatch nom sortie | Tâche en aval ne reçoit pas entrée | S'assurer nom paramètre en aval correspond `output` amont | -| Utiliser `@pipeline_task` sur tâches SequentialPipeline | Aucun effet mais inutile | Seulement nécessaire pour DataflowPipeline | - ---- - -## Exemple: Pipeline Dataflow Complet - -```python -from taskiq import InMemoryBroker -from taskiq_flow import DataflowPipeline, pipeline_task - -broker = InMemoryBroker() - -@broker.task -@pipeline_task(output="brut") -def charger(source: str) -> dict: - return {"data": lire_fichier(source)} - -@broker.task -@pipeline_task(output="propre") -def nettoyer(brut: dict) -> dict: - return {"data": prétraiter(brut["data"])} - -@broker.task -@pipeline_task(output="stats") -def analyser(propre: dict) -> dict: - return calculer_stats(propre["data"]) - -# Construire -pipeline = DataflowPipeline.from_tasks(broker, [charger, nettoyer, analyser]) - -# Exécuter -résultats = await pipeline.kiq_dataflow(source="data.csv") -# résultats = {"brut": {...}, "propre": {...}, "stats": {...}} -``` - ---- - -*Pour API tâches complète, voir [Guide des Tâches]({{ '/fr/guides/tasks/' | relative_url }}). Pour écrire décorateurs personnalisés, étendre `BaseTaskDecorator` depuis `taskiq_flow.decorators`.* +{% endraw %} +**Paramètres**: + +| Paramètre | Type | Description | +|-----------|------|-------------| +| `output` | `str` | Nom clé sortie unique | +| `outputs` | `list[str]` | Clés sortie multiples (pour retours tuple) | +| `inputs` | `list[str]` | Dépendances entrée explicites (optionnel, auto-détecté) | +| `description` | `str` | Description lisible humain (pour documentation) | + +### Sortie unique (plus courant) + +{% raw %} +```python +@broker.task +@pipeline_task(output="données_traitées") +def process(données_brutes: str) -> dict: + return {"result": données_brutes.upper()} +``` +{% endraw %} +### Sorties multiples + +{% raw %} +```python +@broker.task +@pipeline_task(outputs=["features", "metadata"]) +def split_output(audio: np.ndarray) -> tuple[dict, dict]: + features = extract_features(audio) + metadata = extract_meta(audio) + return features, metadata # déballé vers les deux sorties +``` +{% endraw %} +Les tâches en aval peuvent consommer soit sortie: + +{% raw %} +```python +@broker.task +@pipeline_task(output="tags") +def tag(features: dict): ... # consomme sortie 'features' + +@broker.task +@pipeline_task(output="info") +def describe(metadata: dict): ... # consomme sortie 'metadata' +``` +{% endraw %} +--- + +## @pipeline_task_multi_output + +Alias pour `@pipeline_task(outputs=[...])`. Apporte clarté pour tâches multi-sorties: + +{% raw %} +```python +from taskiq_flow import pipeline_task_multi_output + +@broker.task +@pipeline_task_multi_output(outputs=["x", "y"]) +def split(valeur: int) -> tuple[int, int]: + return valeur // 2, valeur % 2 +``` +{% endraw %} +--- + +## Fonctions Utilitaires + +### get_task_outputs(task: Callable) -> list[str] + +Obtenir clés sortie déclarées pour une tâche: + +{% raw %} +```python +from taskiq_flow import get_task_outputs + +outputs = get_task_outputs(extract_task) +print(outputs) # ['features'] +``` +{% endraw %} +### get_task_inputs(task: Callable) -> list[str] + +Get declared input dependencies: + +{% raw %} +```python +from taskiq_flow import get_task_inputs + +inputs = get_task_inputs(tag_task) +print(inputs) # ['features'] +``` +{% endraw %} +### is_pipeline_task(task: Callable) -> bool + +Check if function is decorated with `@pipeline_task`: + +{% raw %} +```python +from taskiq_flow import is_pipeline_task + +if is_pipeline_task(my_function): + print("This is a pipeline task with output declarations") +``` +{% endraw %} +### resolve_task_dependencies(tasks: list[Callable]) -> dict + +Build dependency map: + +{% raw %} +```python +from taskiq_flow import resolve_task_dependencies + +deps = resolve_task_dependencies([task_a, task_b, task_c]) +# Returns: {task_a: [], task_b: ['features'], task_c: ['tags']} +``` +{% endraw %} +--- + +## Decorator Order + +The order of decorators matters: `@broker.task` must be the outermost (applied last), `@pipeline_task` inner (applied first): + +{% raw %} +```python +# CORRECT +@broker.task +@pipeline_task(output="result") +def my_task(): ... + +# INCORRECT (will fail) +@pipeline_task(output="result") +@broker.task +def my_task(): ... +``` +{% endraw %} +Why: `@broker.task` wraps the function; `@pipeline_task` attaches metadata to the original function. Python applies decorators bottom-to-top. +{% raw %} +``` + +Pourquoi: `@broker.task` enveloppe la fonction; `@pipeline_task` attache métadonnées à la fonction originale. Python applique décorateurs bas-vers-haut. + +--- + +## Type Hints & Analyse Statique + +Les type hints aident IDEs et linters à comprendre le dataflow: + +```python +{% endraw %} + +class AudioFeatures(TypedDict): + duration: float + tempo: float + +@broker.task +@pipeline_task(output="features") +def extract(chemin: str) -> AudioFeatures: + return {"duration": 180.0, "tempo": 120.0} + +@broker.task +@pipeline_task(output="tags") +def tag(features: AudioFeatures) -> list[str]: # type-safe + return ["rapide", "électronique"] +{% raw %} +``` + +Utiliser `TypedDict` ou modèles Pydantic pour meilleure autocomplétion IDE et vérification mypy. + +--- + +## Versionnage & Métadonnées + +Attacher version et autres métadonnées: + +```python +{% endraw %} + nom="extract_features_v2", + labels={"version": "2.0.0", "expérimental": False} +) +@pipeline_task( + output="features", + description="Extraire caractéristiques audio (v2 avec estimation tempo améliorée)" +) +def extract(chemin: str) -> dict: + ... +{% raw %} +``` + +--- + +## Pièges Courants + +| Piège | Conséquence | Correction | +|-------|-------------|------------| +| `@broker.task` manquant | Tâche non enregistrée avec broker | Ajouter décorateur | +| `output` non défini | Aucun consommateur en aval ne peut en dépendre | Toujours déclarer `output` pour tâches dataflow | +| Mismatch nom sortie | Tâche en aval ne reçoit pas entrée | S'assurer nom paramètre en aval correspond `output` amont | +| Utiliser `@pipeline_task` sur tâches SequentialPipeline | Aucun effet mais inutile | Seulement nécessaire pour DataflowPipeline | + +--- + +## Exemple: Pipeline Dataflow Complet + +```python +{% endraw %} +from taskiq_flow import DataflowPipeline, pipeline_task + +broker = InMemoryBroker() + +@broker.task +@pipeline_task(output="brut") +def charger(source: str) -> dict: + return {"data": lire_fichier(source)} + +@broker.task +@pipeline_task(output="propre") +def nettoyer(brut: dict) -> dict: + return {"data": prétraiter(brut["data"])} + +@broker.task +@pipeline_task(output="stats") +def analyser(propre: dict) -> dict: + return calculer_stats(propre["data"]) + +# Construire +pipeline = DataflowPipeline.from_tasks(broker, [charger, nettoyer, analyser]) + +# Exécuter +résultats = await pipeline.kiq_dataflow(source="data.csv") +# résultats = {"brut": {...}, "propre": {...}, "stats": {...}} +{% raw %} +``` + +--- + +*Pour API tâches complète, voir [Guide des Tâches]({{ '/fr/guides/tasks/' | relative_url }}). Pour écrire décorateurs personnalisés, étendre `BaseTaskDecorator` depuis `taskiq_flow.decorators`.* + +{% endraw %} \ No newline at end of file diff --git a/docs/_fr/api/execution.md b/docs/_fr/api/execution.md index 9f2384b..8ecd2a1 100644 --- a/docs/_fr/api/execution.md +++ b/docs/_fr/api/execution.md @@ -1,274 +1,274 @@ ---- -permalink: /fr/api/execution/ -title: Référence API: Moteur d'Exécution -nav_order: 32 -color_scheme: dark ---- -# Référence API: Moteur d'Exécution - -**ExecutionEngine, DAG, utilitaires map-reduce, et gestion d'erreurs** - +--- +permalink: /fr/api/execution/ +title: 'Référence API: Moteur d''Exécution' +nav_order: 32 +color_scheme: dark +--- +# Référence API: Moteur d'Exécution + +**ExecutionEngine, DAG, utilitaires map-reduce, et gestion d'erreurs** + > **Version** : {VERSION} | **Module** : `taskiq_flow.execution_engine`, `taskiq_flow.dataflow.dag`, `taskiq_flow.map_reduce` - ---- - -## ExecutionEngine - -Moteur de bas niveau pour exécuter des DAGs directement, évitant l'abstraction Pipeline. - -```python -from taskiq_flow import ExecutionEngine, DataflowRegistry - -# Construire le registre manuellement -registry = DataflowRegistry() -registry.register_task(load, output="raw", inputs=[]) -registry.register_task(process, output="clean", inputs=["raw"]) -registry.register_task(save, output="saved", inputs=["clean"]) - -# Construire le DAG -dag = registry.build_dag() - -# Créer le moteur -engine = ExecutionEngine(broker, dag) - -# Exécuter -results = await engine.execute(inputs={"source": "data.csv"}) -# results = {"raw": ..., "clean": ..., "saved": ...} -``` - -**Constructeur** : -```python -ExecutionEngine( - broker: BaseBroker, - dag: DAG, - max_parallel: int = None, - on_step_complete: callable = None -) -``` - -**Méthodes** : - -| Méthode | Signature | Description | -|---------|-----------|-------------| -| `execute` | `execute(inputs: dict) -> dict` | Exécute le DAG avec les entrées données | -| `execute_async` | `execute_async(inputs: dict) -> AsyncIterator` | Stream les résultats au fur et à mesure | -| `cancel` | `cancel()` | Arrête l'exécution en cours | - -**Événements** : - -```python -async def on_step(task_name: str, result: Any): - print(f"Étape {task_name} terminée") - -engine = ExecutionEngine(broker, dag, on_step_complete=on_step) -``` - ---- - -## DAG (Directed Acyclic Graph) - -Représente le graphe d'exécution des tâches. - -```python -from taskiq_flow.dataflow import DAG, DAGNode - -dag = DAG() -node = DAGNode(task=my_task, output="result", inputs=["input_a"]) -dag.add_node(node) -``` - -**Méthodes DAG** : - -| Méthode | Description | -|---------|-------------| -| `add_node(node: DAGNode)` | Ajoute un nœud tâche | -| `add_edge(from_task, to_task)` | Ajoute une dépendance | -| `topological_sort() -> list[DAGNode]` | Retourne l'ordre d'exécution | -| `get_parallel_levels() -> list[list[DAGNode]]` | Groupe les nœuds par niveau d'exécution parallèle | -| `validate()` | Vérifie cycles, nœuds manquants | -| `print()` | Visualisation ASCII vers console | - -**Propriétés DAG** : - -| Propriété | Type | Description | -|-----------|------|-------------| -| `nodes` | `list[DAGNode]` | Tous les nœuds du graphe | -| `edges` | `set[tuple[DAGNode, DAGNode]]` | Arêtes de dépendance | -| `roots` | `list[DAGNode]` | Nœuds sans dépendances | -| `leaves` | `list[DAGNode]` | Nœuds sans dépendants | - ---- - -## DAGNode - -Représente une tâche unique dans le DAG avec sa spécification E/S. - -```python -from taskiq_flow.dataflow import DAGNode - -node = DAGNode( - task=my_task_function, - output="result_key", - inputs=["input_a", "input_b"], - metadata={"description": "Ma tâche"} -) -``` - -**Propriétés** : - -| Propriété | Type | Description | -|-----------|------|-------------| -| `task` | `Callable` | La fonction tâche | -| `task_name` | `str` | Nom auto-généré ou personnalisé | -| `output` | `str` | Clé de sortie (unique) | -| `outputs` | `list[str]` | Clés de sortie (multiples) | -| `inputs` | `list[str]` | Clés d'entrée requises | -| `metadata` | `dict` | Métadonnées arbitraires | - ---- - -## DAGBuilder - -Helper pour construire des DAGs par programmation (moins courant ; utilisez généralement DataflowRegistry). - -```python -from taskiq_flow import DAGBuilder - -builder = DAGBuilder() -builder.add_task(task1, output="a", inputs=[]) -builder.add_task(task2, output="b", inputs=["a"]) -builder.add_task(task3, output="c", inputs=["a", "b"]) - -dag = builder.build() -``` - -**Pattern Builder** : - -```python -dag = (DAGBuilder() - .node(load, output="raw", inputs=[]) - .node(process, output="clean", inputs=["raw"]) - .node(save, output="saved", inputs=["clean"]) - .build() -) -``` - ---- - -## MapReduce - -Utilitaire pour map parallèle suivi d'un reduce. - -### MapReduce.map - -```python -from taskiq_flow import MapReduce - -mapped = await MapReduce.map( - broker, - map_func, # Fonction tâche à appliquer - items: Iterable, # Éléments à traiter - output: str = "mapped", - max_parallel: int = None -) -# Retourne : MapReduceResult (comme une Task) -``` - -### MapReduce.reduce - -```python -reduced = await MapReduce.reduce( - broker, - reduce_func, # Fonction d'agrégation - mapped_result, # Résultat de MapReduce.map - input_name: str, # Nom de la sortie mappée à consommer - output: str = "reduced" -) -# Retourne : Task (avec résultat final) -``` - -### MapReduce.map_reduce (combiné) - -```python -final = await MapReduce.map_reduce( - broker, - map_func, - items, - reduce_func, - map_output="mapped", - reduce_output="final", - max_parallel=10 -) -``` - -Les trois retournent des objets Task ; appelez `.wait_result()` pour récupérer la valeur. - ---- - -## DataflowRegistry (Avancé) - -Enregistrement manuel des tâches pour construction dynamique de pipeline. - -```python -from taskiq_flow import DataflowRegistry - -registry = DataflowRegistry() - -# Enregistrer les tâches avec E/S explicites -registry.register_task( - task=extract, - output="features", - inputs=["audio_files"] # entrée externe -) -registry.register_task( - task=tag, - output="tags", - inputs=["features"] # dépend de la sortie de extract -) - -# Inspection -print("Tâches:", [t.task_name for t in registry.get_tasks()]) -print("Sorties:", registry.get_outputs()) -print("Entrées externes:", registry.get_external_inputs()) - -# Construction du DAG -dag = registry.build_dag() -dag.print() - -# Exécution via ExecutionEngine -engine = ExecutionEngine(broker, dag) -results = await engine.execute(inputs={"audio_files": files}) -``` - -**Requêtes Registry** : - -| Méthode | Description | -|---------|-------------| -| `get_tasks()` | Liste tous les objets TaskNode | -| `get_outputs()` | Liste toutes les clés de sortie | -| `get_external_inputs()` | Liste les entrées non produites par une tâche | -| `get_producer(output_key)` | Retourne la tâche produisant cette sortie | -| `get_consumers(input_key)` | Liste les tâches consommant cette entrée | -| `build_dag()` | Construit le DAG, valide, retourne prêt à exécuter | - ---- - -## Notes de Version - -- **ExecutionEngine** introduit en v0.3.0 -- `DAG` et `DAGNode` sont utilisés en interne par DataflowPipeline -- MapReduce disponible depuis v0.2.0 - ---- - -## Prochaines Étapes - -- **[API Suivi]({{ '/fr/api/tracking/' | relative_url }})** — Surveiller l'exécution avec PipelineTrackingManager -- **[API WebSocket]({{ '/fr/api/websocket/' | relative_url }})** — HookManager et système d'événements -- **[API Core]({{ '/fr/api/core/' | relative_url }})** — Référence Pipeline et middleware -- **[Exemple Pipeline Audio Dataflow]({{ '/fr/examples/dataflow-audio-pipeline/' | relative_url }})** — Voir ExecutionEngine utilisé dans un pipeline DAG réel -- **[Exemple Découverte Registry]({{ '/fr/examples/registry-discovery/' | relative_url }})** — Construction manuelle de DAG et utilisation d'ExecutionEngine - ---- - -*Pour cas avancés uniquement. 95% des utilisateurs devraient se contenter des abstractions Pipeline et DataflowPipeline.* + +--- + +## ExecutionEngine + +Moteur de bas niveau pour exécuter des DAGs directement, évitant l'abstraction Pipeline. + +```python +from taskiq_flow import ExecutionEngine, DataflowRegistry + +# Construire le registre manuellement +registry = DataflowRegistry() +registry.register_task(load, output="raw", inputs=[]) +registry.register_task(process, output="clean", inputs=["raw"]) +registry.register_task(save, output="saved", inputs=["clean"]) + +# Construire le DAG +dag = registry.build_dag() + +# Créer le moteur +engine = ExecutionEngine(broker, dag) + +# Exécuter +results = await engine.execute(inputs={"source": "data.csv"}) +# results = {"raw": ..., "clean": ..., "saved": ...} +``` + +**Constructeur** : +```python +ExecutionEngine( + broker: BaseBroker, + dag: DAG, + max_parallel: int = None, + on_step_complete: callable = None +) +``` + +**Méthodes** : + +| Méthode | Signature | Description | +|---------|-----------|-------------| +| `execute` | `execute(inputs: dict) -> dict` | Exécute le DAG avec les entrées données | +| `execute_async` | `execute_async(inputs: dict) -> AsyncIterator` | Stream les résultats au fur et à mesure | +| `cancel` | `cancel()` | Arrête l'exécution en cours | + +**Événements** : + +```python +async def on_step(task_name: str, result: Any): + print(f"Étape {task_name} terminée") + +engine = ExecutionEngine(broker, dag, on_step_complete=on_step) +``` + +--- + +## DAG (Directed Acyclic Graph) + +Représente le graphe d'exécution des tâches. + +```python +from taskiq_flow.dataflow import DAG, DAGNode + +dag = DAG() +node = DAGNode(task=my_task, output="result", inputs=["input_a"]) +dag.add_node(node) +``` + +**Méthodes DAG** : + +| Méthode | Description | +|---------|-------------| +| `add_node(node: DAGNode)` | Ajoute un nœud tâche | +| `add_edge(from_task, to_task)` | Ajoute une dépendance | +| `topological_sort() -> list[DAGNode]` | Retourne l'ordre d'exécution | +| `get_parallel_levels() -> list[list[DAGNode]]` | Groupe les nœuds par niveau d'exécution parallèle | +| `validate()` | Vérifie cycles, nœuds manquants | +| `print()` | Visualisation ASCII vers console | + +**Propriétés DAG** : + +| Propriété | Type | Description | +|-----------|------|-------------| +| `nodes` | `list[DAGNode]` | Tous les nœuds du graphe | +| `edges` | `set[tuple[DAGNode, DAGNode]]` | Arêtes de dépendance | +| `roots` | `list[DAGNode]` | Nœuds sans dépendances | +| `leaves` | `list[DAGNode]` | Nœuds sans dépendants | + +--- + +## DAGNode + +Représente une tâche unique dans le DAG avec sa spécification E/S. + +```python +from taskiq_flow.dataflow import DAGNode + +node = DAGNode( + task=my_task_function, + output="result_key", + inputs=["input_a", "input_b"], + metadata={"description": "Ma tâche"} +) +``` + +**Propriétés** : + +| Propriété | Type | Description | +|-----------|------|-------------| +| `task` | `Callable` | La fonction tâche | +| `task_name` | `str` | Nom auto-généré ou personnalisé | +| `output` | `str` | Clé de sortie (unique) | +| `outputs` | `list[str]` | Clés de sortie (multiples) | +| `inputs` | `list[str]` | Clés d'entrée requises | +| `metadata` | `dict` | Métadonnées arbitraires | + +--- + +## DAGBuilder + +Helper pour construire des DAGs par programmation (moins courant ; utilisez généralement DataflowRegistry). + +```python +from taskiq_flow import DAGBuilder + +builder = DAGBuilder() +builder.add_task(task1, output="a", inputs=[]) +builder.add_task(task2, output="b", inputs=["a"]) +builder.add_task(task3, output="c", inputs=["a", "b"]) + +dag = builder.build() +``` + +**Pattern Builder** : + +```python +dag = (DAGBuilder() + .node(load, output="raw", inputs=[]) + .node(process, output="clean", inputs=["raw"]) + .node(save, output="saved", inputs=["clean"]) + .build() +) +``` + +--- + +## MapReduce + +Utilitaire pour map parallèle suivi d'un reduce. + +### MapReduce.map + +```python +from taskiq_flow import MapReduce + +mapped = await MapReduce.map( + broker, + map_func, # Fonction tâche à appliquer + items: Iterable, # Éléments à traiter + output: str = "mapped", + max_parallel: int = None +) +# Retourne : MapReduceResult (comme une Task) +``` + +### MapReduce.reduce + +```python +reduced = await MapReduce.reduce( + broker, + reduce_func, # Fonction d'agrégation + mapped_result, # Résultat de MapReduce.map + input_name: str, # Nom de la sortie mappée à consommer + output: str = "reduced" +) +# Retourne : Task (avec résultat final) +``` + +### MapReduce.map_reduce (combiné) + +```python +final = await MapReduce.map_reduce( + broker, + map_func, + items, + reduce_func, + map_output="mapped", + reduce_output="final", + max_parallel=10 +) +``` + +Les trois retournent des objets Task ; appelez `.wait_result()` pour récupérer la valeur. + +--- + +## DataflowRegistry (Avancé) + +Enregistrement manuel des tâches pour construction dynamique de pipeline. + +```python +from taskiq_flow import DataflowRegistry + +registry = DataflowRegistry() + +# Enregistrer les tâches avec E/S explicites +registry.register_task( + task=extract, + output="features", + inputs=["audio_files"] # entrée externe +) +registry.register_task( + task=tag, + output="tags", + inputs=["features"] # dépend de la sortie de extract +) + +# Inspection +print("Tâches:", [t.task_name for t in registry.get_tasks()]) +print("Sorties:", registry.get_outputs()) +print("Entrées externes:", registry.get_external_inputs()) + +# Construction du DAG +dag = registry.build_dag() +dag.print() + +# Exécution via ExecutionEngine +engine = ExecutionEngine(broker, dag) +results = await engine.execute(inputs={"audio_files": files}) +``` + +**Requêtes Registry** : + +| Méthode | Description | +|---------|-------------| +| `get_tasks()` | Liste tous les objets TaskNode | +| `get_outputs()` | Liste toutes les clés de sortie | +| `get_external_inputs()` | Liste les entrées non produites par une tâche | +| `get_producer(output_key)` | Retourne la tâche produisant cette sortie | +| `get_consumers(input_key)` | Liste les tâches consommant cette entrée | +| `build_dag()` | Construit le DAG, valide, retourne prêt à exécuter | + +--- + +## Notes de Version + +- **ExecutionEngine** introduit en v0.3.0 +- `DAG` et `DAGNode` sont utilisés en interne par DataflowPipeline +- MapReduce disponible depuis v0.2.0 + +--- + +## Prochaines Étapes + +- **[API Suivi]({{ '/fr/api/tracking/' | relative_url }})** — Surveiller l'exécution avec PipelineTrackingManager +- **[API WebSocket]({{ '/fr/api/websocket/' | relative_url }})** — HookManager et système d'événements +- **[API Core]({{ '/fr/api/core/' | relative_url }})** — Référence Pipeline et middleware +- **[Exemple Pipeline Audio Dataflow]({{ '/fr/examples/dataflow-audio-pipeline/' | relative_url }})** — Voir ExecutionEngine utilisé dans un pipeline DAG réel +- **[Exemple Découverte Registry]({{ '/fr/examples/registry-discovery/' | relative_url }})** — Construction manuelle de DAG et utilisation d'ExecutionEngine + +--- + +*Pour cas avancés uniquement. 95% des utilisateurs devraient se contenter des abstractions Pipeline et DataflowPipeline.* diff --git a/docs/_fr/api/index.md b/docs/_fr/api/index.md index 2cd520b..5c29de1 100644 --- a/docs/_fr/api/index.md +++ b/docs/_fr/api/index.md @@ -1,26 +1,26 @@ ---- -title: Référence API -nav_order: 35 -permalink: /fr/api/ -color_scheme: dark ---- -# Référence API - -Documentation complète des modules et classes de Taskiq-Flow. - -## Disponible - -| Module | Description | -|--------|-------------| -| **[Composants Cœurs]({{ '/fr/api/core/' | relative_url }})** | Pipeline, DataflowPipeline, middleware, exceptions | -| **[Décorateurs]({{ '/fr/api/decorators/' | relative_url }})** | `@pipeline_task` et utilitaires | -| **[Exécution]({{ '/fr/api/execution/' | relative_url }})** | ExecutionEngine, DAG, DAGBuilder | -| **[Stockage]({{ '/fr/api/storage/' | relative_url }})** nouveau en v1.2.0 | Adaptateurs de stockage interchangeables (InMemory, Redis, SQLite), factory, StorageMiddleware | -| **[Cache]({{ '/fr/api/cache/' | relative_url }})** nouveau en v1.2.0 | Cache Dogpile (adaptateurs InMemory et Redis), CacheMiddleware | -| **[Suivi]({{ '/fr/api/tracking/' | relative_url }})** | TrackingManager et backends de stockage | -| **[Optimisation]({{ '/fr/api/optimization/' | relative_url }})** | ResourceAwareExecutor | -| **[WebSocket]({{ '/fr/api/websocket/' | relative_url }})** | HookManager et système d'événements | - ---- - -*Pas sûr où commencer ? Voir le [Guide de Démarrage Rapide]({{ '/fr/quickstart/' | relative_url }}) ou les [Guides]({{ '/fr/guides/' | relative_url }}).* +--- +title: Référence API +nav_order: 35 +permalink: /fr/api/ +color_scheme: dark +--- +# Référence API + +Documentation complète des modules et classes de Taskiq-Flow. + +## Disponible + +| Module | Description | +|--------|-------------| +| **[Composants Cœurs]({{ '/fr/api/core/' | relative_url }})** | Pipeline, DataflowPipeline, middleware, exceptions | +| **[Décorateurs]({{ '/fr/api/decorators/' | relative_url }})** | `@pipeline_task` et utilitaires | +| **[Exécution]({{ '/fr/api/execution/' | relative_url }})** | ExecutionEngine, DAG, DAGBuilder | +| **[Stockage]({{ '/fr/api/storage/' | relative_url }})** nouveau en v1.2.0 | Adaptateurs de stockage interchangeables (InMemory, Redis, SQLite), factory, StorageMiddleware | +| **[Cache]({{ '/fr/api/cache/' | relative_url }})** nouveau en v1.2.0 | Cache Dogpile (adaptateurs InMemory et Redis), CacheMiddleware | +| **[Suivi]({{ '/fr/api/tracking/' | relative_url }})** | TrackingManager et backends de stockage | +| **[Optimisation]({{ '/fr/api/optimization/' | relative_url }})** | ResourceAwareExecutor | +| **[WebSocket]({{ '/fr/api/websocket/' | relative_url }})** | HookManager et système d'événements | + +--- + +*Pas sûr où commencer ? Voir le [Guide de Démarrage Rapide]({{ '/fr/quickstart/' | relative_url }}) ou les [Guides]({{ '/fr/guides/' | relative_url }}).* diff --git a/docs/_fr/api/storage.md b/docs/_fr/api/storage.md index ab42ed8..9d68793 100644 --- a/docs/_fr/api/storage.md +++ b/docs/_fr/api/storage.md @@ -1,204 +1,204 @@ ---- -title: Référence API : Stockage -nav_order: 31 -color_scheme: dark ---- -# Référence API : Stockage - -**Couche de persistance centralisée — adaptateurs, factory et StorageMiddleware** - -> **Version** : {VERSION} | **Nouveau en v1.2.0** | **Module** : `taskiq_flow.storage`, `taskiq_flow.middlewares.storage` - ---- - -## Aperçu - -Taskiq-Flow v1.2.0 introduit une **couche de stockage centralisée** qui découple les préoccupations de persistance du broker sous-jacent. Le système de stockage offre : - -- **Une interface unifiée** — `BaseStorageAdapter` fonctionne avec tous les backends -- **Trois adaptateurs natifs** — InMemory, Redis, SQLite/SQLAlchemy -- **Factory d'auto-détection** — `StorageAdapterFactory` choisit le bon backend automatiquement -- **Intégration middleware** — `StorageMiddleware` se branche dans le cycle de vie TaskIQ - -Utilisez `StorageMiddleware` plutôt que du code ad-hoc : il intercepte les événements de tâche et persiste les résultats via un adaptateur interchangeable. - ---- - -## Module `taskiq_flow.storage` - -### `StorageEntry` - -```python -from taskiq_flow.storage import StorageEntry -from datetime import datetime, timezone - -entry = StorageEntry( - key="pipeline:run42:task:abc123", - value={"statut": "terminé", "resultat": 42}, - expires_at=datetime.now(timezone.utc) + timedelta(hours=1), - metadata={"pipeline_id": "run42"}, -) -``` - -Conteneur typé pour une valeur stockée avec TTL optionnel et métadonnées. - -| Attribut | Type | Description | -|----------|------|-------------| -| `key` | `str` | Clé unique de l'entrée | -| `value` | `Any` | Valeur stockée (recommandé : sérialisable JSON) | -| `created_at` | `datetime` | Horodatage de création (UTC) | -| `expires_at` | `datetime \| None` | Horodatage d'expiration ; `None` = pas d'expiration | -| `metadata` | `dict` | Métadonnées arbitraires | - -| Méthode | Signature | Description | -|---------|-----------|-------------| -| `is_expired()` | `() -> bool` | `True` si l'entrée a expiré | -| `remaining_ttl()` | `() -> float \| None` | Secondes restantes avant expiration ; `None` si jamais | - ---- - -### `BaseStorageAdapter` (ABC) - -```python -from taskiq_flow.storage import BaseStorageAdapter - -class MonAdaptateur(BaseStorageAdapter): - async def get(self, key: str) -> Any | None: ... - async def set(self, key: str, value: Any, ttl_seconds=None) -> None: ... - async def delete(self, key: str) -> bool: ... - async def exists(self, key: str) -> bool: ... - async def keys(self, pattern="*") -> list[str]: ... - async def cleanup(self, ttl_seconds=3600) -> int: ... -``` - -Interface abstraite que tous les backends de stockage doivent implémenter. Utilisez-la pour créer un backend personnalisé (PostgreSQL, DynamoDB, etc.). - -| Méthode | Description | -|---------|-------------| -| `get(cle)` | Récupérer une valeur par clé ; `None` si absente ou expirée | -| `set(cle, valeur, ttl_seconds)` | Stocker une valeur avec TTL optionnel en secondes | -| `delete(cle)` | Supprimer l'entrée ; retourne `True` si supprimée | -| `exists(cle)` | Vérifier l'existence d'une clé | -| `keys(motif)` | Lister les clés correspondant à un motif glob (ex. `"pipeline:*"`) | -| `cleanup(ttl_seconds)` | Purger les entrées expirées ; retourne le nombre supprimé | - ---- - -### `InMemoryStorageAdapter` - -```python -from taskiq_flow.storage import InMemoryStorageAdapter - -stockage = InMemoryStorageAdapter() -``` - -Adaptateur en mémoire basé sur un `dict` avec support de TTL par clé. Idéal pour le développement, les tests et les déploiements mono-processus. - ---- - -### `RedisStorageAdapter` - -```python -from taskiq_flow.storage import RedisStorageAdapter - -stockage = RedisStorageAdapter( - redis_url="redis://localhost:6379", - ttl_seconds=3600, -) -``` - -Adaptateur persistant basé sur Redis avec TTL natif et sérialisation JSON. - -| Fonctionnalité | Statut | -|---------------|--------| -| TTL natif | Par clé via `EXPIRE` Redis | -| Sérialisation JSON | Automatique | -| Partage distribué | Tous les workers partagent le même Redis | -| Persistance | Tant que Redis persiste | - ---- - -### `SQLiteStorageAdapter` - -```python -from taskiq_flow.storage import SQLiteStorageAdapter - -stockage = SQLiteStorageAdapter( - db_url="sqlite+aiosqlite:///taskiq-flow.db", - async_mode=True, -) -``` - -Adaptateur SQLite/SQLAlchemy pour une persistance locale sans service externe. - ---- - -## Module `taskiq_flow.storage.factory` - -### `StorageAdapterFactory` - -```python -from taskiq_flow.storage.factory import StorageAdapterFactory -config = TaskiqFlowConfig() -adaptateur = StorageAdapterFactory.create_storage_adapter(config=config) -``` - -Ordre de priorité pour `create_storage_adapter(type="auto")` : - -| Priorité | Backend | Condition | -|----------|---------|-----------| -| 1 | `RedisStorageAdapter` | `storage_type="redis"` ou broker est RedisBroker | -| 2 | `SQLiteStorageAdapter` | `storage_type="sqlite"` ou `"sqlalchemy"` | -| 3 | `InMemoryStorageAdapter` | Fallback | - -| Méthode de Factory | Description | -|-------------------|-------------| -| `create_storage_adapter(config, broker, …)` | Crée un `BaseStorageAdapter` | -| `create_cache_adapter(config, …)` | Crée un `BaseCacheAdapter` | -| `create_default_middlewares(config, broker)` | Crée `StorageMiddleware` et `CacheMiddleware` | - ---- - -## Module `taskiq_flow.middlewares.storage` - -### `StorageMiddleware` - -```python -from taskiq_flow.middlewares import StorageMiddleware -broker.add_middlewares( - StorageMiddleware(storage=InMemoryStorageAdapter(), enabled=True), - PipelineMiddleware(), -) -``` - -Intercepte le cycle de vie TaskIQ et persiste les résultats de tâche via l'adaptateur de stockage configuré. - -| Paramètre | Type | Défaut | Description | -|-----------|------|--------|-------------| -| `storage` | `BaseStorageAdapter \| None` | `None` | Backend de stockage | -| `enabled` | `bool` | `True` | Active/désactive la persistance | - -| Hook | Signature | Description | -|------|-----------|-------------| -| `post_save(message, result)` | Persiste `TaskiqResult` dans le stockage | Clé : `task:{task_id}` ou `pipeline:{pipeline_id}:task:{task_id}` | - ---- - -## Choix d'un Backend - -| Backend | Cas d'usage | Avantages | Inconvénients | -|---------|------------|-----------|---------------| -| `InMemoryStorageAdapter` | Dev, tests, mono-processus | Zéro dépendance, rapide | Volatile, non partagé | -| `RedisStorageAdapter` | Production, distribué | Rapide, partagé, persisté | Requiert Redis | -| `SQLiteStorageAdapter` | Persistance légère sans service externe | Pas de service externe | Contention mono-écriture | - ---- - -## Lectures Associées - -- **[Guide Stockage & Cache]({{ '/fr/guides/cache/' | relative_url }})** — Configuration complète des middlewares -- **[Référence API : Cache]({{ '/fr/api/cache/' | relative_url }})** — Adaptateurs de cache Dogpile - ---- - -*Nouveau en v1.2.0. Les adaptateurs de stockage sont entièrement interchangeables : changez l'adaptateur sans toucher la logique métier.* +--- +title: 'Référence API : Stockage' +nav_order: 31 +color_scheme: dark +--- +# Référence API : Stockage + +**Couche de persistance centralisée — adaptateurs, factory et StorageMiddleware** + +> **Version** : {VERSION} | **Nouveau en v1.2.0** | **Module** : `taskiq_flow.storage`, `taskiq_flow.middlewares.storage` + +--- + +## Aperçu + +Taskiq-Flow v1.2.0 introduit une **couche de stockage centralisée** qui découple les préoccupations de persistance du broker sous-jacent. Le système de stockage offre : + +- **Une interface unifiée** — `BaseStorageAdapter` fonctionne avec tous les backends +- **Trois adaptateurs natifs** — InMemory, Redis, SQLite/SQLAlchemy +- **Factory d'auto-détection** — `StorageAdapterFactory` choisit le bon backend automatiquement +- **Intégration middleware** — `StorageMiddleware` se branche dans le cycle de vie TaskIQ + +Utilisez `StorageMiddleware` plutôt que du code ad-hoc : il intercepte les événements de tâche et persiste les résultats via un adaptateur interchangeable. + +--- + +## Module `taskiq_flow.storage` + +### `StorageEntry` + +```python +from taskiq_flow.storage import StorageEntry +from datetime import datetime, timezone + +entry = StorageEntry( + key="pipeline:run42:task:abc123", + value={"statut": "terminé", "resultat": 42}, + expires_at=datetime.now(timezone.utc) + timedelta(hours=1), + metadata={"pipeline_id": "run42"}, +) +``` + +Conteneur typé pour une valeur stockée avec TTL optionnel et métadonnées. + +| Attribut | Type | Description | +|----------|------|-------------| +| `key` | `str` | Clé unique de l'entrée | +| `value` | `Any` | Valeur stockée (recommandé : sérialisable JSON) | +| `created_at` | `datetime` | Horodatage de création (UTC) | +| `expires_at` | `datetime \| None` | Horodatage d'expiration ; `None` = pas d'expiration | +| `metadata` | `dict` | Métadonnées arbitraires | + +| Méthode | Signature | Description | +|---------|-----------|-------------| +| `is_expired()` | `() -> bool` | `True` si l'entrée a expiré | +| `remaining_ttl()` | `() -> float \| None` | Secondes restantes avant expiration ; `None` si jamais | + +--- + +### `BaseStorageAdapter` (ABC) + +```python +from taskiq_flow.storage import BaseStorageAdapter + +class MonAdaptateur(BaseStorageAdapter): + async def get(self, key: str) -> Any | None: ... + async def set(self, key: str, value: Any, ttl_seconds=None) -> None: ... + async def delete(self, key: str) -> bool: ... + async def exists(self, key: str) -> bool: ... + async def keys(self, pattern="*") -> list[str]: ... + async def cleanup(self, ttl_seconds=3600) -> int: ... +``` + +Interface abstraite que tous les backends de stockage doivent implémenter. Utilisez-la pour créer un backend personnalisé (PostgreSQL, DynamoDB, etc.). + +| Méthode | Description | +|---------|-------------| +| `get(cle)` | Récupérer une valeur par clé ; `None` si absente ou expirée | +| `set(cle, valeur, ttl_seconds)` | Stocker une valeur avec TTL optionnel en secondes | +| `delete(cle)` | Supprimer l'entrée ; retourne `True` si supprimée | +| `exists(cle)` | Vérifier l'existence d'une clé | +| `keys(motif)` | Lister les clés correspondant à un motif glob (ex. `"pipeline:*"`) | +| `cleanup(ttl_seconds)` | Purger les entrées expirées ; retourne le nombre supprimé | + +--- + +### `InMemoryStorageAdapter` + +```python +from taskiq_flow.storage import InMemoryStorageAdapter + +stockage = InMemoryStorageAdapter() +``` + +Adaptateur en mémoire basé sur un `dict` avec support de TTL par clé. Idéal pour le développement, les tests et les déploiements mono-processus. + +--- + +### `RedisStorageAdapter` + +```python +from taskiq_flow.storage import RedisStorageAdapter + +stockage = RedisStorageAdapter( + redis_url="redis://localhost:6379", + ttl_seconds=3600, +) +``` + +Adaptateur persistant basé sur Redis avec TTL natif et sérialisation JSON. + +| Fonctionnalité | Statut | +|---------------|--------| +| TTL natif | Par clé via `EXPIRE` Redis | +| Sérialisation JSON | Automatique | +| Partage distribué | Tous les workers partagent le même Redis | +| Persistance | Tant que Redis persiste | + +--- + +### `SQLiteStorageAdapter` + +```python +from taskiq_flow.storage import SQLiteStorageAdapter + +stockage = SQLiteStorageAdapter( + db_url="sqlite+aiosqlite:///taskiq-flow.db", + async_mode=True, +) +``` + +Adaptateur SQLite/SQLAlchemy pour une persistance locale sans service externe. + +--- + +## Module `taskiq_flow.storage.factory` + +### `StorageAdapterFactory` + +```python +from taskiq_flow.storage.factory import StorageAdapterFactory +config = TaskiqFlowConfig() +adaptateur = StorageAdapterFactory.create_storage_adapter(config=config) +``` + +Ordre de priorité pour `create_storage_adapter(type="auto")` : + +| Priorité | Backend | Condition | +|----------|---------|-----------| +| 1 | `RedisStorageAdapter` | `storage_type="redis"` ou broker est RedisBroker | +| 2 | `SQLiteStorageAdapter` | `storage_type="sqlite"` ou `"sqlalchemy"` | +| 3 | `InMemoryStorageAdapter` | Fallback | + +| Méthode de Factory | Description | +|-------------------|-------------| +| `create_storage_adapter(config, broker, …)` | Crée un `BaseStorageAdapter` | +| `create_cache_adapter(config, …)` | Crée un `BaseCacheAdapter` | +| `create_default_middlewares(config, broker)` | Crée `StorageMiddleware` et `CacheMiddleware` | + +--- + +## Module `taskiq_flow.middlewares.storage` + +### `StorageMiddleware` + +```python +from taskiq_flow.middlewares import StorageMiddleware +broker.add_middlewares( + StorageMiddleware(storage=InMemoryStorageAdapter(), enabled=True), + PipelineMiddleware(), +) +``` + +Intercepte le cycle de vie TaskIQ et persiste les résultats de tâche via l'adaptateur de stockage configuré. + +| Paramètre | Type | Défaut | Description | +|-----------|------|--------|-------------| +| `storage` | `BaseStorageAdapter \| None` | `None` | Backend de stockage | +| `enabled` | `bool` | `True` | Active/désactive la persistance | + +| Hook | Signature | Description | +|------|-----------|-------------| +| `post_save(message, result)` | Persiste `TaskiqResult` dans le stockage | Clé : `task:{task_id}` ou `pipeline:{pipeline_id}:task:{task_id}` | + +--- + +## Choix d'un Backend + +| Backend | Cas d'usage | Avantages | Inconvénients | +|---------|------------|-----------|---------------| +| `InMemoryStorageAdapter` | Dev, tests, mono-processus | Zéro dépendance, rapide | Volatile, non partagé | +| `RedisStorageAdapter` | Production, distribué | Rapide, partagé, persisté | Requiert Redis | +| `SQLiteStorageAdapter` | Persistance légère sans service externe | Pas de service externe | Contention mono-écriture | + +--- + +## Lectures Associées + +- **[Guide Stockage & Cache]({{ '/fr/guides/cache/' | relative_url }})** — Configuration complète des middlewares +- **[Référence API : Cache]({{ '/fr/api/cache/' | relative_url }})** — Adaptateurs de cache Dogpile + +--- + +*Nouveau en v1.2.0. Les adaptateurs de stockage sont entièrement interchangeables : changez l'adaptateur sans toucher la logique métier.* diff --git a/docs/_fr/api/tracking.md b/docs/_fr/api/tracking.md index bdc6fe3..d661ffb 100644 --- a/docs/_fr/api/tracking.md +++ b/docs/_fr/api/tracking.md @@ -1,361 +1,361 @@ ---- -permalink: /fr/api/tracking/ -title: Référence API: Suivi & Monitoring -nav_order: 33 -color_scheme: dark ---- -# Référence API: Suivi & Monitoring - -**PipelineTrackingManager, backends de stockage, et modèles de statut** - +--- +permalink: /fr/api/tracking/ +title: 'Référence API: Suivi & Monitoring' +nav_order: 33 +color_scheme: dark +--- +# Référence API: Suivi & Monitoring + +**PipelineTrackingManager, backends de stockage, et modèles de statut** + > **Version** : {VERSION} | **Module** : `taskiq_flow.tracking`, `taskiq_flow.tracking.models` - ---- - -## PipelineTrackingManager - -Coordinateur central pour enregistrer et récupérer les données d'exécution des pipelines. - -```python -from taskiq_flow import PipelineTrackingManager - -tracking = PipelineTrackingManager() -tracking = tracking.with_auto_storage(broker) -# or -tracking = tracking.with_storage(InMemoryPipelineStorage()) -``` - -**Configuration**: - -```python -tracking = PipelineTrackingManager( - storage=None, # Optional pre-configured storage - max_history=1000, # Max pipeline records (memory store only) - auto_cleanup=True # Auto-purge old records -) -``` - -**Sélection stockage** (via `with_auto_storage`): - -| Broker | Stockage auto-sélectionné | -|--------|-------------------------| -| `InMemoryBroker` | `InMemoryPipelineStorage` | -| `RedisBroker` | `RedisPipelineStorage` | -| Autre | Fallback mémoire | - ---- - -## Méthodes - -### Attacher aux Pipelines - -```python -pipeline = Pipeline(broker).with_tracking(tracking) -# or -pipeline.with_tracking(tracking) # in-place modification -``` - -The tracking manager **must** be attached **before** calling `pipeline.kiq()`. - -### Interroger les Statuts - -```python -# Get status of specific pipeline execution -status = await tracking.get_status(pipeline_id: str) -> PipelineStatus | None - -# List all tracked pipelines -all_statuses = await tracking.list_pipelines( - filter_status: str | None = None, # Filter by status - limit: int = 100 -) -> list[PipelineStatus] - -# Get execution history -history = await tracking.get_history( - since: datetime | None = None, - until: datetime | None = None, - limit: int = 100 -) -> list[PipelineStatus] -``` - -### Maintenance - -```python -# Delete specific pipeline record -await tracking.delete_pipeline(pipeline_id: str) - -# Delete records older than N days -deleted = await tracking.cleanup_older_than(days: int = 30) -> int - -# Get aggregated metrics -metrics = await tracking.get_metrics( - days: int = 7 -) -> TrackingMetrics -``` - -### Event Listeners - -```python -class MyListener: - async def on_pipeline_start(self, pipeline_id: str): - print(f"Pipeline {pipeline_id} démarré") - - async def on_pipeline_complete(self, pipeline_id: str, status: PipelineStatus): - alert_if_failed(status) - -listener = MyListener() -tracking.add_listener(listener) -``` - -**Hooks écouteur** (tous optionnels): - -- `on_pipeline_start(pipeline_id)` -- `on_step_start(pipeline_id, step_name)` -- `on_step_complete(pipeline_id, step_name, result)` -- `on_pipeline_complete(pipeline_id, statut)` -- `on_pipeline_error(pipeline_id, error)` - ---- - + +--- + +## PipelineTrackingManager + +Coordinateur central pour enregistrer et récupérer les données d'exécution des pipelines. + +```python +from taskiq_flow import PipelineTrackingManager + +tracking = PipelineTrackingManager() +tracking = tracking.with_auto_storage(broker) +# or +tracking = tracking.with_storage(InMemoryPipelineStorage()) +``` + +**Configuration**: + +```python +tracking = PipelineTrackingManager( + storage=None, # Optional pre-configured storage + max_history=1000, # Max pipeline records (memory store only) + auto_cleanup=True # Auto-purge old records +) +``` + +**Sélection stockage** (via `with_auto_storage`): + +| Broker | Stockage auto-sélectionné | +|--------|-------------------------| +| `InMemoryBroker` | `InMemoryPipelineStorage` | +| `RedisBroker` | `RedisPipelineStorage` | +| Autre | Fallback mémoire | + +--- + +## Méthodes + +### Attacher aux Pipelines + +```python +pipeline = Pipeline(broker).with_tracking(tracking) +# or +pipeline.with_tracking(tracking) # in-place modification +``` + +The tracking manager **must** be attached **before** calling `pipeline.kiq()`. + +### Interroger les Statuts + +```python +# Get status of specific pipeline execution +status = await tracking.get_status(pipeline_id: str) -> PipelineStatus | None + +# List all tracked pipelines +all_statuses = await tracking.list_pipelines( + filter_status: str | None = None, # Filter by status + limit: int = 100 +) -> list[PipelineStatus] + +# Get execution history +history = await tracking.get_history( + since: datetime | None = None, + until: datetime | None = None, + limit: int = 100 +) -> list[PipelineStatus] +``` + +### Maintenance + +```python +# Delete specific pipeline record +await tracking.delete_pipeline(pipeline_id: str) + +# Delete records older than N days +deleted = await tracking.cleanup_older_than(days: int = 30) -> int + +# Get aggregated metrics +metrics = await tracking.get_metrics( + days: int = 7 +) -> TrackingMetrics +``` + +### Event Listeners + +```python +class MyListener: + async def on_pipeline_start(self, pipeline_id: str): + print(f"Pipeline {pipeline_id} démarré") + + async def on_pipeline_complete(self, pipeline_id: str, status: PipelineStatus): + alert_if_failed(status) + +listener = MyListener() +tracking.add_listener(listener) +``` + +**Hooks écouteur** (tous optionnels): + +- `on_pipeline_start(pipeline_id)` +- `on_step_start(pipeline_id, step_name)` +- `on_step_complete(pipeline_id, step_name, result)` +- `on_pipeline_complete(pipeline_id, statut)` +- `on_pipeline_error(pipeline_id, error)` + +--- + ## Backends de Stockage - -### InMemoryPipelineStorage - -```python -from taskiq_flow.tracking import InMemoryPipelineStorage - -storage = InMemoryPipelineStorage(max_records=1000) -tracking = PipelineTrackingManager().with_storage(storage) -``` - -**Characteristics**: -- Zero configuration -- Fast (no I/O) -- **Not shared between workers** -- Lost on process restart -- Good for: development, testing, single-process - -**Parameters**: - -| Parameter | Type | Default | Description | -|-----------|------|---------|-------------| -| `max_records` | `int` | 1000 | Maximum pipeline records to keep (LRU eviction) | - ---- - -### RedisPipelineStorage - -```python -from taskiq_flow.tracking import RedisPipelineStorage -import redis.asyncio as redis - -redis_client = redis.Redis(host="localhost", port=6379, decode_responses=True) -storage = RedisPipelineStorage( - redis_client, - key_prefix="taskiq_flow:tracking:", - ttl_seconds=604800 # 7 days -) -tracking = PipelineTrackingManager().with_storage(storage) -``` - -**Characteristics**: -- Shared between multiple workers -- Survives restarts -- Scalable (Redis cluster) -- TTL-based expiration -- Good for: production, distributed deployments - -**Parameters**: - -| Parameter | Type | Default | Description | -|-----------|------|---------|-------------| -| `redis_client` | `Redis` | **required** | Connected Redis client | -| `key_prefix` | `str` | `"taskiq_flow:tracking:"` | Prefix for all keys | -| `ttl_seconds` | `int` | 604800 (7d) | Automatic expiration after N seconds | -| `serializer` | `Callable` | `json.dumps` | Custom serialization function | + +### InMemoryPipelineStorage + +```python +from taskiq_flow.tracking import InMemoryPipelineStorage + +storage = InMemoryPipelineStorage(max_records=1000) +tracking = PipelineTrackingManager().with_storage(storage) +``` + +**Characteristics**: +- Zero configuration +- Fast (no I/O) +- **Not shared between workers** +- Lost on process restart +- Good for: development, testing, single-process + +**Parameters**: + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `max_records` | `int` | 1000 | Maximum pipeline records to keep (LRU eviction) | + +--- + +### RedisPipelineStorage + +```python +from taskiq_flow.tracking import RedisPipelineStorage +import redis.asyncio as redis + +redis_client = redis.Redis(host="localhost", port=6379, decode_responses=True) +storage = RedisPipelineStorage( + redis_client, + key_prefix="taskiq_flow:tracking:", + ttl_seconds=604800 # 7 days +) +tracking = PipelineTrackingManager().with_storage(storage) +``` + +**Characteristics**: +- Shared between multiple workers +- Survives restarts +- Scalable (Redis cluster) +- TTL-based expiration +- Good for: production, distributed deployments + +**Parameters**: + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `redis_client` | `Redis` | **required** | Connected Redis client | +| `key_prefix` | `str` | `"taskiq_flow:tracking:"` | Prefix for all keys | +| `ttl_seconds` | `int` | 604800 (7d) | Automatic expiration after N seconds | +| `serializer` | `Callable` | `json.dumps` | Custom serialization function | +``` + +**Caractéristiques**: +- Zéro configuration +- Rapide (pas d'I/O) +- **Non partagé entre workers** +- Perdu au redémarrage du processus +- Bon pour: développement, tests, mono-processus + +**Paramètres**: + +| Paramètre | Type | Défaut | Description | +|-----------|------|---------|-------------| +| `max_records` | `int` | 1000 | Max enregistrements pipelines à retenir (éviction LRU) | + +--- + +### RedisPipelineStorage + +```python +from taskiq_flow.tracking import RedisPipelineStorage +import redis.asyncio as redis + +client_redis = redis.Redis(host="localhost", port=6379, decode_responses=True) +stockage = RedisPipelineStorage( + client_redis, + key_prefix="taskiq_flow:tracking:", + ttl_seconds=604800 # 7 jours +) +tracking = PipelineTrackingManager().with_storage(storage) +``` + +**Caractéristiques**: +- Partagé entre multiples workers +- Persiste au redémarrage +- Évolutif (cluster Redis) +- Expiration basée TTL +- Bon pour: production, déploiements distribués + +**Paramètres**: + +| Paramètre | Type | Défaut | Description | +|-----------|------|---------|-------------| +| `client_redis` | `Redis` | **requis** | Client Redis connecté | +| `key_prefix` | `str` | `"taskiq_flow:tracking:"` | Préfixe pour toutes clés | +| `ttl_seconds` | `int` | 604800 (7j) | Expiration automatique après N secondes | +| `serializer` | `Callable` | `json.dumps` | Fonction de sérialisation personnalisée | + + +--- + +## Modèles de Données + +### PipelineStatus + +Statut complet d'une exécution de pipeline. + +```python +from taskiq_flow.tracking.models import PipelineStatus + +statut: PipelineStatus +``` + +**Attributs**: + +| Attribut | Type | Description | +|----------|------|-------------| +| `pipeline_id` | `str` | Identifiant unique | +| `status` | `str` | `PENDING`, `RUNNING`, `COMPLETED`, `FAILED`, `CANCELLED` | +| `pipeline_type` | `str` | `"sequential"` ou `"dataflow"` | +| `started_at` | `datetime` | Horodatage début exécution | +| `completed_at` | `datetime | None` | He fin si terminé | +| `duration_ms` | `float` | Durée totale en millisecondes | +| `steps` | `list[StepStatus]` | Objets statut par étape | +| `result` | `Any` | Valeur de retour finale (si terminé) | +| `error` | `str | None` | Message d'erreur si échec | + +**Méthodes**: +- `model_dump()` — Retourne dictionnaire (modèle Pydantic) +- `is_finished()` — True si état terminal (COMPLETED/FAILED/CANCELLED) + +--- + +### StepStatus + +Statut d'une seule étape de pipeline. + +```python +from taskiq_flow.tracking.models import StepStatus + +étape: StepStatus +``` + +**Attributs**: + +| Attribut | Type | Description | +|----------|------|-------------| +| `step_name` | `str` | Nom de la tâche | +| `status` | `str` | `PENDING`, `RUNNING`, `COMPLETED`, `FAILED` | +| `started_at` | `datetime` | Heure début étape | +| `completed_at` | `datetime | None` | Heure fin étape | +| `duration_ms` | `float` | Durée d'exécution | +| `result` | `Any` | Valeur de retour | +| `error` | `str | None` | Message d'erreur | +| `retry_count` | `int` | Nombre de tentatives de retry | + +--- + +### TrackingMetrics + +Statistiques agrégées (retournées par `get_metrics()`). + +```python +from taskiq_flow.tracking.models import TrackingMetrics + +métriques: TrackingMetrics +``` + +**Attributs**: + +| Attribut | Type | Description | +|----------|------|-------------| +| `total_pipelines` | `int` | Total exécutions suivies | +| `completed` | `int` | Complétions réussies | +| `failed` | `int` | Exécutions échouées | +| `success_rate` | `float` | Ratio complété / total | +| `avg_duration_ms` | `float` | Durée moyenne pipeline | +| `p95_duration_ms` | `float` | Durée percentile 95 | +| `failure_reasons` | `dict[str, int]` | Type erreur → compte | +| `most_frequent_step` | `str | None` | Étape échouant le plus souvent | + +--- + +## Implémentation Stockage Personnalisé + +Implémenter protocole `TrackingStorage` pour backend personnalisé: + +```python +from taskiq_flow.tracking.storage import TrackingStorage +from taskiq_flow.tracking.models import PipelineStatus + +class PostgresStorage(TrackingStorage): + async def save_status(self, status: PipelineStatus): + """Save status to PostgreSQL.""" + ... + + async def get_status(self, pipeline_id: str) -> PipelineStatus | None: + """Fetch from DB.""" + ... + + async def list_pipelines(self, filter_status: str | None = None): + """Query with optional filter.""" + ... + + async def delete_pipeline(self, pipeline_id: str): + """Remove record.""" + ... + +tracking = PipelineTrackingManager().with_storage(PostgresStorage()) ``` - -**Caractéristiques**: -- Zéro configuration -- Rapide (pas d'I/O) -- **Non partagé entre workers** -- Perdu au redémarrage du processus -- Bon pour: développement, tests, mono-processus - -**Paramètres**: - -| Paramètre | Type | Défaut | Description | -|-----------|------|---------|-------------| -| `max_records` | `int` | 1000 | Max enregistrements pipelines à retenir (éviction LRU) | - ---- - -### RedisPipelineStorage - -```python -from taskiq_flow.tracking import RedisPipelineStorage -import redis.asyncio as redis - -client_redis = redis.Redis(host="localhost", port=6379, decode_responses=True) -stockage = RedisPipelineStorage( - client_redis, - key_prefix="taskiq_flow:tracking:", - ttl_seconds=604800 # 7 jours -) -tracking = PipelineTrackingManager().with_storage(storage) -``` - -**Caractéristiques**: -- Partagé entre multiples workers -- Persiste au redémarrage -- Évolutif (cluster Redis) -- Expiration basée TTL -- Bon pour: production, déploiements distribués - -**Paramètres**: - -| Paramètre | Type | Défaut | Description | -|-----------|------|---------|-------------| -| `client_redis` | `Redis` | **requis** | Client Redis connecté | -| `key_prefix` | `str` | `"taskiq_flow:tracking:"` | Préfixe pour toutes clés | -| `ttl_seconds` | `int` | 604800 (7j) | Expiration automatique après N secondes | -| `serializer` | `Callable` | `json.dumps` | Fonction de sérialisation personnalisée | - - ---- - -## Modèles de Données - -### PipelineStatus - -Statut complet d'une exécution de pipeline. - -```python -from taskiq_flow.tracking.models import PipelineStatus - -statut: PipelineStatus -``` - -**Attributs**: - -| Attribut | Type | Description | -|----------|------|-------------| -| `pipeline_id` | `str` | Identifiant unique | -| `status` | `str` | `PENDING`, `RUNNING`, `COMPLETED`, `FAILED`, `CANCELLED` | -| `pipeline_type` | `str` | `"sequential"` ou `"dataflow"` | -| `started_at` | `datetime` | Horodatage début exécution | -| `completed_at` | `datetime | None` | He fin si terminé | -| `duration_ms` | `float` | Durée totale en millisecondes | -| `steps` | `list[StepStatus]` | Objets statut par étape | -| `result` | `Any` | Valeur de retour finale (si terminé) | -| `error` | `str | None` | Message d'erreur si échec | - -**Méthodes**: -- `model_dump()` — Retourne dictionnaire (modèle Pydantic) -- `is_finished()` — True si état terminal (COMPLETED/FAILED/CANCELLED) - ---- - -### StepStatus - -Statut d'une seule étape de pipeline. - -```python -from taskiq_flow.tracking.models import StepStatus - -étape: StepStatus -``` - -**Attributs**: - -| Attribut | Type | Description | -|----------|------|-------------| -| `step_name` | `str` | Nom de la tâche | -| `status` | `str` | `PENDING`, `RUNNING`, `COMPLETED`, `FAILED` | -| `started_at` | `datetime` | Heure début étape | -| `completed_at` | `datetime | None` | Heure fin étape | -| `duration_ms` | `float` | Durée d'exécution | -| `result` | `Any` | Valeur de retour | -| `error` | `str | None` | Message d'erreur | -| `retry_count` | `int` | Nombre de tentatives de retry | - ---- - -### TrackingMetrics - -Statistiques agrégées (retournées par `get_metrics()`). - -```python -from taskiq_flow.tracking.models import TrackingMetrics - -métriques: TrackingMetrics -``` - -**Attributs**: - -| Attribut | Type | Description | -|----------|------|-------------| -| `total_pipelines` | `int` | Total exécutions suivies | -| `completed` | `int` | Complétions réussies | -| `failed` | `int` | Exécutions échouées | -| `success_rate` | `float` | Ratio complété / total | -| `avg_duration_ms` | `float` | Durée moyenne pipeline | -| `p95_duration_ms` | `float` | Durée percentile 95 | -| `failure_reasons` | `dict[str, int]` | Type erreur → compte | -| `most_frequent_step` | `str | None` | Étape échouant le plus souvent | - ---- - -## Implémentation Stockage Personnalisé - -Implémenter protocole `TrackingStorage` pour backend personnalisé: - -```python -from taskiq_flow.tracking.storage import TrackingStorage -from taskiq_flow.tracking.models import PipelineStatus - -class PostgresStorage(TrackingStorage): - async def save_status(self, status: PipelineStatus): - """Save status to PostgreSQL.""" - ... - - async def get_status(self, pipeline_id: str) -> PipelineStatus | None: - """Fetch from DB.""" - ... - - async def list_pipelines(self, filter_status: str | None = None): - """Query with optional filter.""" - ... - - async def delete_pipeline(self, pipeline_id: str): - """Remove record.""" - ... - -tracking = PipelineTrackingManager().with_storage(PostgresStorage()) -``` - + All storage methods must be async. - ---- - -## Meilleures Pratiques - -1. **Production** : Toujours utiliser stockage Redis (partagé, persistant) -2. **TTL** : Définir TTL approprié (7–30 jours) pour limiter croissance stockage -3. **Écouteurs** : Ajouter écouteurs d'alerte pour échecs -4. **Nettoyage** : Planifier nettoyage périodique (cron quotidien) -5. **Indexation** : Pour stores DB personnalisés, indexer sur `pipeline_id`, `started_at` pour performance requêtes - ---- - -## Dépannage - -| Problème | Cause Probable | Solution | -|----------|----------------|----------| -| `get_status()` returns `None` | Tracking not attached, or wrong `pipeline_id` | Ensure `pipeline.with_tracking(tracking)` called before `kiq()` | -| Storage errors | Redis connection failed | Check Redis is running, connection string valid | -| Memory growth (memory store) | No old record cleanup | Set `max_records` or use Redis with TTL | -| Listeners not firing | Not added before pipeline start | Call `tracking.add_listener()` before `pipeline.kiq()` | - ---- - -*Combiner avec [WebSocket]({{ '/fr/api/websocket/' | relative_url }}) pour streaming temps réel. Voir [Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }}) pour patterns d'utilisation.* + +--- + +## Meilleures Pratiques + +1. **Production** : Toujours utiliser stockage Redis (partagé, persistant) +2. **TTL** : Définir TTL approprié (7–30 jours) pour limiter croissance stockage +3. **Écouteurs** : Ajouter écouteurs d'alerte pour échecs +4. **Nettoyage** : Planifier nettoyage périodique (cron quotidien) +5. **Indexation** : Pour stores DB personnalisés, indexer sur `pipeline_id`, `started_at` pour performance requêtes + +--- + +## Dépannage + +| Problème | Cause Probable | Solution | +|----------|----------------|----------| +| `get_status()` returns `None` | Tracking not attached, or wrong `pipeline_id` | Ensure `pipeline.with_tracking(tracking)` called before `kiq()` | +| Storage errors | Redis connection failed | Check Redis is running, connection string valid | +| Memory growth (memory store) | No old record cleanup | Set `max_records` or use Redis with TTL | +| Listeners not firing | Not added before pipeline start | Call `tracking.add_listener()` before `pipeline.kiq()` | + +--- + +*Combiner avec [WebSocket]({{ '/fr/api/websocket/' | relative_url }}) pour streaming temps réel. Voir [Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }}) pour patterns d'utilisation.* diff --git a/docs/_fr/api/websocket.md b/docs/_fr/api/websocket.md index 823441e..85de779 100644 --- a/docs/_fr/api/websocket.md +++ b/docs/_fr/api/websocket.md @@ -1,6 +1,6 @@ --- permalink: /fr/api/websocket/ -title: Référence API: Intégration WebSocket +title: 'Référence API: Intégration WebSocket' nav_order: 34 color_scheme: dark --- diff --git a/docs/_fr/examples/api-example.md b/docs/_fr/examples/api-example.md index 58235b0..861b6d4 100644 --- a/docs/_fr/examples/api-example.md +++ b/docs/_fr/examples/api-example.md @@ -1,369 +1,369 @@ ---- -permalink: /fr/examples/api-example/ -title: Exemple: api_example.py -nav_order: 47 -color_scheme: dark ---- -# Exemple: api_example.md - -**Intégration FastAPI pour gestion distante de pipelines** - +--- +permalink: /fr/examples/api-example/ +title: 'Exemple: api_example.py' +nav_order: 47 +color_scheme: dark +--- +# Exemple: api_example.md + +**Intégration FastAPI pour gestion distante de pipelines** + > **Version** : {VERSION} | **Fichier** : `examples/api_example.py` - ---- - -## Aperçu - -Cet exemple exhaustif démontre comment construire une API REST production-ready pour Taskiq-Flow en utilisant FastAPI. Il couvre: - -- Configuration FastAPI avec endpoints visualization pipeline -- Enregistrer pipelines programmatiquement -- Ajouter endpoints personnalisés pour exécution distante pipeline -- Récupérer résultats pipeline via API -- Documentation OpenAPI/Swagger complète - -**Prérequis**: Installer FastAPI et uvicorn: -```bash -pip install fastapi uvicorn[standard] -``` - ---- - -## Ce Que Cet Exemple Montre - -- Utilisation de `PipelineVisualizationAPI` pour endpoints built-in -- Enregistrement pipelines avec l'API -- Création endpoints personnalisés pour exécution distante pipeline -- Récupération résultats par task ID -- Structure API complète production - ---- - -## Explication du Code - -### 1. Définir Tâches et Pipeline - -Le pipeline suivant illustre un cas d'usage commun de recommandation: - -```python -from fastapi import FastAPI, HTTPException -from taskiq import InMemoryBroker -from taskiq_flow import DataflowPipeline, pipeline_task - -broker = InMemoryBroker(await_inplace=True) - -@broker.task -@pipeline_task(output="user_data") -async def fetch_user_data(user_id: int) -> dict: - """Récupérer données utilisateur depuis base.""" - await asyncio.sleep(0.1) - return {"id": user_id, "name": f"User{user_id}", "email": f"user{user_id}@example.com"} - -@broker.task -@pipeline_task(output="order_history") -async def fetch_orders(user_data: dict) -> list: - """Récupérer historique commandes utilisateur.""" - await asyncio.sleep(0.2) - user_id = user_data["id"] - return [{"order_id": 100 + user_id, "total": 99.99}] - -@broker.task -@pipeline_task(output="recommendations") -async def generate_recommendations(user_data: dict, order_history: list): - """Générer recommandations.""" - await asyncio.sleep(0.15) - return ["product_A", "product_B", "product_C"] - -# Construire pipeline -sample_pipeline = DataflowPipeline.from_tasks( - broker, - [fetch_user_data, fetch_orders, generate_recommendations], -) -sample_pipeline.pipeline_id = "sample_recommendation_pipeline" -``` - -**Structure du DAG:** - -```mermaid -flowchart TD - A[fetch_user_data
output: user_data] --> B[fetch_orders
output: order_history] - A --> C[generate_recommendations
output: recommendations] - B --> C -``` - -### 2. Créer App FastAPI avec Visualization API - -```python -from taskiq_flow.api import create_visualization_api, PipelineVisualizationAPI - -def create_app() -> FastAPI: - app = FastAPI(title="TaskIQ Flow API", version="1.0.0") - - # Créer visualization API (monte automatiquement endpoints /pipelines) - viz_api = create_visualization_api(broker, app) - viz_api.add_pipeline("sample_recommendation_pipeline", sample_pipeline) - - # Ajouter endpoints personnalisés ci-dessous... - return app -``` - -`create_visualization_api()` ajoute automatiquement endpoints: -- `GET /pipelines` — Lister tous les pipelines enregistrés -- `GET /pipelines/{pipeline_id}` — Obtenir pipeline par ID -- `GET /pipelines/{pipeline_id}/dag` — DAG JSON -- `GET /pipelines/{pipeline_id}/dag/dot` — DAG format DOT -- `GET /pipelines/{pipeline_id}/visualize` — Métadonnées complètes - -### 3. Ajouter Endpoint Exécution Personnalisé - -```python -@app.post("/pipelines/{pipeline_id}/execute") -async def execute_pipeline( - pipeline_id: str, - parameters: dict[str, Any], -) -> dict[str, Any]: - """Exécute un pipeline avec paramètres donnés.""" - if pipeline_id not in viz_api.pipelines: - raise HTTPException(status_code=404, detail=f"Pipeline {pipeline_id} non trouvé") - - pipeline = viz_api.pipelines[pipeline_id] - try: - result = await pipeline.kiq_dataflow(**parameters) - return { - "status": "executed", - "pipeline_id": pipeline_id, - "task_id": result.task_id, - "message": "Pipeline execution started. Use /result/{task_id} pour vérifier statut.", - } - except Exception as e: - raise HTTPException(status_code=500, detail=str(e)) from e -``` - -### 4. Ajouter Endpoint Récupération Résultat - -```python -@app.get("/pipelines/result/{task_id}") -async def get_result(task_id: str) -> dict[str, Any]: - """Récupère le résultat d'une exécution de pipeline.""" - try: - result = await broker.result_backend.get_result(task_id) - if result is None: - raise HTTPException(status_code=404, detail=f"Aucun résultat trouvé pour task_id {task_id}") - return {"task_id": task_id, "result": result.return_value} - except Exception as e: - raise HTTPException(status_code=500, detail=str(e)) from e -``` - -### 5. Lancer le Serveur - -```bash -uvicorn examples.api_example:create_app --reload --port 8000 -``` - -Ou programmatiquement: - -```python -if __name__ == "__main__": - import uvicorn - uvicorn.run(create_app(), host="0.0.0.0", port=8000) -``` - ---- - -## Référence des Endpoints API - -### Built-in (depuis `create_visualization_api`) - -| Méthode | Endpoint | Description | -|---------|----------|-------------| -| GET | `/health` | Health check | -| GET | `/pipelines` | Lister pipelines enregistrés | -| POST | `/pipelines/{pipeline_id}` | Enregistrer nouveau pipeline | -| GET | `/pipelines/{pipeline_id}/status` | Obtenir statut exécution courant | -| GET | `/pipelines/{pipeline_id}/dag` | Obtenir DAG en JSON | -| GET | `/pipelines/{pipeline_id}/dag/dot` | Obtenir DAG en format DOT | -| GET | `/pipelines/{pipeline_id}/visualize` | Métadonnées complètes visualization | - -### Personnalisés (définis dans exemple) - -| Méthode | Endpoint | Description | -|---------|----------|-------------| -| POST | `/pipelines/{pipeline_id}/execute` | Exécuter pipeline avec paramètres | -| GET | `/pipelines/result/{task_id}` | Obtenir résultat par task ID | - ---- - -## Tester l'API - -### 1. Docs Interactives -Ouvrir http://localhost:8000/docs pour Swagger UI. - -### 2. Exécuter Pipeline - -```bash -curl -X POST "http://localhost:8000/pipelines/sample_recommendation_pipeline/execute" \ - -H "Content-Type: application/json" \ - -d '{"user_id": 123}' -``` - -Réponse: -```json -{ - "status": "executed", - "pipeline_id": "sample_recommendation_pipeline", - "task_id": "abc123def456", - "message": "Pipeline execution started..." -} -``` - -### 3. Sonder pour Résultat - -```bash -curl "http://localhost:8000/pipelines/result/abc123def456" -``` - -Réponse: -```json -{ - "task_id": "abc123def456", - "result": { - "user_data": {"id": 123, "name": "User123", ...}, - "order_history": [...], - "recommendations": ["product_A", "product_B", "product_C"] - } -} -``` - -### 4. Voir DAG - -```bash -curl "http://localhost:8000/pipelines/sample_recommendation_pipeline/dag" -``` - -Retourne structure JSON du graphe pipeline. - ---- - -## Utilisation API Programmatique - -Vous pouvez aussi utiliser classes API directement sans HTTP: - -```python -from taskiq_flow.api import PipelineVisualizationAPI - -app = FastAPI() -viz_api = PipelineVisualizationAPI(broker, app) - -# Register pipeline -viz_api.add_pipeline("my_pipe", my_pipeline) - -# List registered pipelines -for pid, p in viz_api.pipelines.items(): - print(f"Pipeline: {pid}, tasks: {len(p.visualize()['nodes'])}") - -# Get visualization -dag_json = my_pipeline.visualize() -dot = my_pipeline.visualize_dot() -``` - -Utile pour construire backends dashboard personnalisés ou outils CLI. - ---- - -## Considérations Production - -### 1. Utiliser Broker Persistant -```python -from taskiq import RedisStreamBroker -broker = RedisStreamBroker(redis_url="redis://localhost:6379") -``` - -### 2. Ajouter Authentication -```python -from fastapi import Depends, Security -from fastapi.security import APIKeyHeader - -api_key_header = APIKeyHeader(name="X-API-Key") - -async def verify_api_key(api_key: str = Security(api_key_header)): - if api_key != os.getenv("API_SECRET"): - raise HTTPException(status_code=403, detail="Clé API invalide") - return api_key - -@app.post("/pipelines/{pipeline_id}/execute") -async def execute(..., api_key: str = Security(verify_api_key)): - # ... -``` - -### 2b. Ajouter Authentification JWT -```python -from jose import jwt -from fastapi import Depends - -async def get_current_user(token: str = Depends(oauth2_scheme)): - payload = jwt.decode(token, SECRET_KEY, algorithms=["HS256"]) - return payload["sub"] - -@app.post("/pipelines/{pipeline_id}/execute") -async def execute(..., user: str = Depends(get_current_user)): - logger.info(f"Utilisateur {user} a exécuté {pipeline_id}") - # ... -``` - -### 2c. Ajouter Autorisation au Niveau Pipeline -```python -from taskiq_flow.security.authorization import PipelineAuthorization - -authorization = PipelineAuthorization(rules={ - "admin": {"read": ["*"], "write": ["*"]}, - "viewer": {"read": ["audio_*"], "write": []}, -}) - -async def check_pipeline_access( - pipeline_id: str = Path(...), - user: dict = Depends(get_current_user), -): - if not authorization.can_read(pipeline_id, user): - raise HTTPException(status_code=403, detail="Accès refusé") - return user -``` - -### 3. Ajouter Rate Limiting -```python -from slowapi import Limiter -limiter = Limiter(key_func=get_remote_address) -@app.post("/pipelines/{pipeline_id}/execute") -@limiter.limit("10/minute") -async def execute(...): - # ... -``` - -### 4. Activer CORS pour Frontend Web -```python -from fastapi.middleware.cors import CORSMiddleware -app.add_middleware( - CORSMiddleware, - allow_origins=["https://votre-dashboard.com"], - allow_methods=["*"], - allow_headers=["*"], -) -``` - -### 5. Déployer avec Gunicorn -```bash -gunicorn -k uvicorn.workers.UvicornWorker -w 4 main:app --bind 0.0.0.0:8000 -``` - ---- - -## Chemin d'Apprentissage - -Après cet exemple: - -1. **[Guide API]({{ '/fr/guides/api/' | relative_url }})** — Documentation complète endpoints REST et meilleures pratiques -2. **[Guide WebSocket]({{ '/fr/guides/websocket/' | relative_url }})** — Ajouter mises à jour temps réel à votre API -3. **[Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }})** — Stocker historique exécution pour analytics - ---- - -*Cet exemple fournit fondation API complète production-ready. Étendez-le avec authentication, rate limiting, et endpoints personnalisés pour votre cas d'usage spécifique.* + +--- + +## Aperçu + +Cet exemple exhaustif démontre comment construire une API REST production-ready pour Taskiq-Flow en utilisant FastAPI. Il couvre: + +- Configuration FastAPI avec endpoints visualization pipeline +- Enregistrer pipelines programmatiquement +- Ajouter endpoints personnalisés pour exécution distante pipeline +- Récupérer résultats pipeline via API +- Documentation OpenAPI/Swagger complète + +**Prérequis**: Installer FastAPI et uvicorn: +```bash +pip install fastapi uvicorn[standard] +``` + +--- + +## Ce Que Cet Exemple Montre + +- Utilisation de `PipelineVisualizationAPI` pour endpoints built-in +- Enregistrement pipelines avec l'API +- Création endpoints personnalisés pour exécution distante pipeline +- Récupération résultats par task ID +- Structure API complète production + +--- + +## Explication du Code + +### 1. Définir Tâches et Pipeline + +Le pipeline suivant illustre un cas d'usage commun de recommandation: + +```python +from fastapi import FastAPI, HTTPException +from taskiq import InMemoryBroker +from taskiq_flow import DataflowPipeline, pipeline_task + +broker = InMemoryBroker(await_inplace=True) + +@broker.task +@pipeline_task(output="user_data") +async def fetch_user_data(user_id: int) -> dict: + """Récupérer données utilisateur depuis base.""" + await asyncio.sleep(0.1) + return {"id": user_id, "name": f"User{user_id}", "email": f"user{user_id}@example.com"} + +@broker.task +@pipeline_task(output="order_history") +async def fetch_orders(user_data: dict) -> list: + """Récupérer historique commandes utilisateur.""" + await asyncio.sleep(0.2) + user_id = user_data["id"] + return [{"order_id": 100 + user_id, "total": 99.99}] + +@broker.task +@pipeline_task(output="recommendations") +async def generate_recommendations(user_data: dict, order_history: list): + """Générer recommandations.""" + await asyncio.sleep(0.15) + return ["product_A", "product_B", "product_C"] + +# Construire pipeline +sample_pipeline = DataflowPipeline.from_tasks( + broker, + [fetch_user_data, fetch_orders, generate_recommendations], +) +sample_pipeline.pipeline_id = "sample_recommendation_pipeline" +``` + +**Structure du DAG:** + +```mermaid +flowchart TD + A[fetch_user_data
output: user_data] --> B[fetch_orders
output: order_history] + A --> C[generate_recommendations
output: recommendations] + B --> C +``` + +### 2. Créer App FastAPI avec Visualization API + +```python +from taskiq_flow.api import create_visualization_api, PipelineVisualizationAPI + +def create_app() -> FastAPI: + app = FastAPI(title="TaskIQ Flow API", version="1.0.0") + + # Créer visualization API (monte automatiquement endpoints /pipelines) + viz_api = create_visualization_api(broker, app) + viz_api.add_pipeline("sample_recommendation_pipeline", sample_pipeline) + + # Ajouter endpoints personnalisés ci-dessous... + return app +``` + +`create_visualization_api()` ajoute automatiquement endpoints: +- `GET /pipelines` — Lister tous les pipelines enregistrés +- `GET /pipelines/{pipeline_id}` — Obtenir pipeline par ID +- `GET /pipelines/{pipeline_id}/dag` — DAG JSON +- `GET /pipelines/{pipeline_id}/dag/dot` — DAG format DOT +- `GET /pipelines/{pipeline_id}/visualize` — Métadonnées complètes + +### 3. Ajouter Endpoint Exécution Personnalisé + +```python +@app.post("/pipelines/{pipeline_id}/execute") +async def execute_pipeline( + pipeline_id: str, + parameters: dict[str, Any], +) -> dict[str, Any]: + """Exécute un pipeline avec paramètres donnés.""" + if pipeline_id not in viz_api.pipelines: + raise HTTPException(status_code=404, detail=f"Pipeline {pipeline_id} non trouvé") + + pipeline = viz_api.pipelines[pipeline_id] + try: + result = await pipeline.kiq_dataflow(**parameters) + return { + "status": "executed", + "pipeline_id": pipeline_id, + "task_id": result.task_id, + "message": "Pipeline execution started. Use /result/{task_id} pour vérifier statut.", + } + except Exception as e: + raise HTTPException(status_code=500, detail=str(e)) from e +``` + +### 4. Ajouter Endpoint Récupération Résultat + +```python +@app.get("/pipelines/result/{task_id}") +async def get_result(task_id: str) -> dict[str, Any]: + """Récupère le résultat d'une exécution de pipeline.""" + try: + result = await broker.result_backend.get_result(task_id) + if result is None: + raise HTTPException(status_code=404, detail=f"Aucun résultat trouvé pour task_id {task_id}") + return {"task_id": task_id, "result": result.return_value} + except Exception as e: + raise HTTPException(status_code=500, detail=str(e)) from e +``` + +### 5. Lancer le Serveur + +```bash +uvicorn examples.api_example:create_app --reload --port 8000 +``` + +Ou programmatiquement: + +```python +if __name__ == "__main__": + import uvicorn + uvicorn.run(create_app(), host="0.0.0.0", port=8000) +``` + +--- + +## Référence des Endpoints API + +### Built-in (depuis `create_visualization_api`) + +| Méthode | Endpoint | Description | +|---------|----------|-------------| +| GET | `/health` | Health check | +| GET | `/pipelines` | Lister pipelines enregistrés | +| POST | `/pipelines/{pipeline_id}` | Enregistrer nouveau pipeline | +| GET | `/pipelines/{pipeline_id}/status` | Obtenir statut exécution courant | +| GET | `/pipelines/{pipeline_id}/dag` | Obtenir DAG en JSON | +| GET | `/pipelines/{pipeline_id}/dag/dot` | Obtenir DAG en format DOT | +| GET | `/pipelines/{pipeline_id}/visualize` | Métadonnées complètes visualization | + +### Personnalisés (définis dans exemple) + +| Méthode | Endpoint | Description | +|---------|----------|-------------| +| POST | `/pipelines/{pipeline_id}/execute` | Exécuter pipeline avec paramètres | +| GET | `/pipelines/result/{task_id}` | Obtenir résultat par task ID | + +--- + +## Tester l'API + +### 1. Docs Interactives +Ouvrir http://localhost:8000/docs pour Swagger UI. + +### 2. Exécuter Pipeline + +```bash +curl -X POST "http://localhost:8000/pipelines/sample_recommendation_pipeline/execute" \ + -H "Content-Type: application/json" \ + -d '{"user_id": 123}' +``` + +Réponse: +```json +{ + "status": "executed", + "pipeline_id": "sample_recommendation_pipeline", + "task_id": "abc123def456", + "message": "Pipeline execution started..." +} +``` + +### 3. Sonder pour Résultat + +```bash +curl "http://localhost:8000/pipelines/result/abc123def456" +``` + +Réponse: +```json +{ + "task_id": "abc123def456", + "result": { + "user_data": {"id": 123, "name": "User123", ...}, + "order_history": [...], + "recommendations": ["product_A", "product_B", "product_C"] + } +} +``` + +### 4. Voir DAG + +```bash +curl "http://localhost:8000/pipelines/sample_recommendation_pipeline/dag" +``` + +Retourne structure JSON du graphe pipeline. + +--- + +## Utilisation API Programmatique + +Vous pouvez aussi utiliser classes API directement sans HTTP: + +```python +from taskiq_flow.api import PipelineVisualizationAPI + +app = FastAPI() +viz_api = PipelineVisualizationAPI(broker, app) + +# Register pipeline +viz_api.add_pipeline("my_pipe", my_pipeline) + +# List registered pipelines +for pid, p in viz_api.pipelines.items(): + print(f"Pipeline: {pid}, tasks: {len(p.visualize()['nodes'])}") + +# Get visualization +dag_json = my_pipeline.visualize() +dot = my_pipeline.visualize_dot() +``` + +Utile pour construire backends dashboard personnalisés ou outils CLI. + +--- + +## Considérations Production + +### 1. Utiliser Broker Persistant +```python +from taskiq import RedisStreamBroker +broker = RedisStreamBroker(redis_url="redis://localhost:6379") +``` + +### 2. Ajouter Authentication +```python +from fastapi import Depends, Security +from fastapi.security import APIKeyHeader + +api_key_header = APIKeyHeader(name="X-API-Key") + +async def verify_api_key(api_key: str = Security(api_key_header)): + if api_key != os.getenv("API_SECRET"): + raise HTTPException(status_code=403, detail="Clé API invalide") + return api_key + +@app.post("/pipelines/{pipeline_id}/execute") +async def execute(..., api_key: str = Security(verify_api_key)): + # ... +``` + +### 2b. Ajouter Authentification JWT +```python +from jose import jwt +from fastapi import Depends + +async def get_current_user(token: str = Depends(oauth2_scheme)): + payload = jwt.decode(token, SECRET_KEY, algorithms=["HS256"]) + return payload["sub"] + +@app.post("/pipelines/{pipeline_id}/execute") +async def execute(..., user: str = Depends(get_current_user)): + logger.info(f"Utilisateur {user} a exécuté {pipeline_id}") + # ... +``` + +### 2c. Ajouter Autorisation au Niveau Pipeline +```python +from taskiq_flow.security.authorization import PipelineAuthorization + +authorization = PipelineAuthorization(rules={ + "admin": {"read": ["*"], "write": ["*"]}, + "viewer": {"read": ["audio_*"], "write": []}, +}) + +async def check_pipeline_access( + pipeline_id: str = Path(...), + user: dict = Depends(get_current_user), +): + if not authorization.can_read(pipeline_id, user): + raise HTTPException(status_code=403, detail="Accès refusé") + return user +``` + +### 3. Ajouter Rate Limiting +```python +from slowapi import Limiter +limiter = Limiter(key_func=get_remote_address) +@app.post("/pipelines/{pipeline_id}/execute") +@limiter.limit("10/minute") +async def execute(...): + # ... +``` + +### 4. Activer CORS pour Frontend Web +```python +from fastapi.middleware.cors import CORSMiddleware +app.add_middleware( + CORSMiddleware, + allow_origins=["https://votre-dashboard.com"], + allow_methods=["*"], + allow_headers=["*"], +) +``` + +### 5. Déployer avec Gunicorn +```bash +gunicorn -k uvicorn.workers.UvicornWorker -w 4 main:app --bind 0.0.0.0:8000 +``` + +--- + +## Chemin d'Apprentissage + +Après cet exemple: + +1. **[Guide API]({{ '/fr/guides/api/' | relative_url }})** — Documentation complète endpoints REST et meilleures pratiques +2. **[Guide WebSocket]({{ '/fr/guides/websocket/' | relative_url }})** — Ajouter mises à jour temps réel à votre API +3. **[Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }})** — Stocker historique exécution pour analytics + +--- + +*Cet exemple fournit fondation API complète production-ready. Étendez-le avec authentication, rate limiting, et endpoints personnalisés pour votre cas d'usage spécifique.* diff --git a/docs/_fr/examples/dag_visualization_demo.md b/docs/_fr/examples/dag_visualization_demo.md index 5307bc1..c260289 100644 --- a/docs/_fr/examples/dag_visualization_demo.md +++ b/docs/_fr/examples/dag_visualization_demo.md @@ -1,6 +1,6 @@ --- permalink: /fr/examples/dag-visualization-demo/ -title: Exemple: dag_visualization_demo.py +title: 'Exemple: dag_visualization_demo.py' nav_order: 47 color_scheme: dark --- diff --git a/docs/_fr/examples/dataflow-audio-pipeline.md b/docs/_fr/examples/dataflow-audio-pipeline.md index 4324d13..f81b2d3 100644 --- a/docs/_fr/examples/dataflow-audio-pipeline.md +++ b/docs/_fr/examples/dataflow-audio-pipeline.md @@ -1,269 +1,279 @@ ---- -permalink: /fr/examples/dataflow-audio-pipeline/ -title: Exemple: dataflow_audio_pipeline.py -nav_order: 42 -color_scheme: dark ---- -# Exemple: dataflow_audio_pipeline.py - -**DAG dataflow complet avec exécution parallèle, map-reduce et visualisation** - -> **Version** : {VERSION} | **Fichier** : `examples/dataflow_audio_pipeline.py` - ---- - -## Aperçu - -Cet exemple exhaustif démontre toute la puissance de DataflowPipeline avec: - -- Construction automatique de DAG depuis dépendances tâches -- Exécution parallèle de tâches indépendantes -- Motif map-reduce pour traitement par lots -- Visualisation de pipeline (DOT, JSON, ASCII) -- Workflows mixtes séquentiels et parallèles - -C'est l'exemple de référence pour comprendre l'architecture dataflow. - ---- - -## Ce Que Cet Exemple Montre - -- Utilisation du décorateur **`@pipeline_task`** avec sorties simples et multiples -- **Résolution automatique de dépendances** — les tâches déclarent leurs sorties; tâches en aval consomment par nom de paramètre -- **Exécution parallèle** — tâches avec même dépendance s'exécutent concurremment -- **Motif map-reduce** — traitement batch avec `.map()` et `.reduce()` -- **Visualisation DAG** — affichage ASCII, export DOT, JSON - ---- - -## Explication du Code - -### Définition des Tâches - -```python -from taskiq import InMemoryBroker -from taskiq_flow import DataflowPipeline, pipeline_task - -broker = InMemoryBroker(await_inplace=True) - -# Tâche 1: Extraire caractéristiques audio (aucune dépendance) -@broker.task -@pipeline_task(output="audio_features") -async def extract_audio_features(track_paths: list[str]) -> dict: - features = {...} - return features - -# Tâche 2: Calculer features MIR (dépend de audio_features) -@broker.task -@pipeline_task(output="mir_features") -async def compute_mir_features(audio_features: dict) -> dict: - # Reçoit audio_features automatiquement - return {...} - -# Tâche 3: Générer tags (dépend de mir_features) -@broker.task -@pipeline_task(output="tags") -async def generate_tags(mir_features: dict) -> list[str]: - return ["electronic", "dance"] - -# Tâche 4: Créer embedding (dépend de mir_features ET tags) -@broker.task -@pipeline_task(output="vector") -async def create_embedding(mir_features: dict, tags: list[str]) -> list[float]: - # Reçoit les deux entrées automatiquement - return [0.1, 0.5, 0.8] -``` - -Le pipeline construit automatiquement ce DAG: - -```mermaid -flowchart TD - A[extract_audio_features] --> B[compute_mir_features] - A --> C[generate_tags] - B --> D[create_embedding] - C --> D -``` - +--- +permalink: /fr/examples/dataflow-audio-pipeline/ +title: 'Exemple: dataflow_audio_pipeline.py' +nav_order: 42 +color_scheme: dark +--- +# Exemple: dataflow_audio_pipeline.py + +**DAG dataflow complet avec exécution parallèle, map-reduce et visualisation** + +> **Version** : {VERSION} | **Fichier** : `examples/dataflow_audio_pipeline.py` + +--- + +## Aperçu + +Cet exemple exhaustif démontre toute la puissance de DataflowPipeline avec: + +- Construction automatique de DAG depuis dépendances tâches +- Exécution parallèle de tâches indépendantes +- Motif map-reduce pour traitement par lots +- Visualisation de pipeline (DOT, JSON, ASCII) +- Workflows mixtes séquentiels et parallèles + +C'est l'exemple de référence pour comprendre l'architecture dataflow. + +--- + +## Ce Que Cet Exemple Montre + +- Utilisation du décorateur **`@pipeline_task`** avec sorties simples et multiples +- **Résolution automatique de dépendances** — les tâches déclarent leurs sorties; tâches en aval consomment par nom de paramètre +- **Exécution parallèle** — tâches avec même dépendance s'exécutent concurremment +- **Motif map-reduce** — traitement batch avec `.map()` et `.reduce()` +- **Visualisation DAG** — affichage ASCII, export DOT, JSON + +--- + +## Explication du Code + +### Définition des Tâches + +{% raw %} +```python +from taskiq import InMemoryBroker +from taskiq_flow import DataflowPipeline, pipeline_task + +broker = InMemoryBroker(await_inplace=True) + +# Tâche 1: Extraire caractéristiques audio (aucune dépendance) +@broker.task +@pipeline_task(output="audio_features") +async def extract_audio_features(track_paths: list[str]) -> dict: + features = {...} + return features + +# Tâche 2: Calculer features MIR (dépend de audio_features) +@broker.task +@pipeline_task(output="mir_features") +async def compute_mir_features(audio_features: dict) -> dict: + # Reçoit audio_features automatiquement + return {...} + +# Tâche 3: Générer tags (dépend de mir_features) +@broker.task +@pipeline_task(output="tags") +async def generate_tags(mir_features: dict) -> list[str]: + return ["electronic", "dance"] + +# Tâche 4: Créer embedding (dépend de mir_features ET tags) +@broker.task +@pipeline_task(output="vector") +async def create_embedding(mir_features: dict, tags: list[str]) -> list[float]: + # Reçoit les deux entrées automatiquement + return [0.1, 0.5, 0.8] +``` +{% endraw %} +Le pipeline construit automatiquement ce DAG: + +{% raw %} +```mermaid +flowchart TD + A[extract_audio_features] --> B[compute_mir_features] + A --> C[generate_tags] + B --> D[create_embedding] + C --> D +``` +{% endraw %} **Note**: `create_embedding` dépend à la fois de `mir_features` (sortie de `compute_mir_features`) et `tags` (sortie de `generate_tags`), donc il s'exécute après que les deux tâches parallèles sont terminées. -``` - ---- - -## Exemple 1: Pipeline Séquentiel avec Dépendances Automatiques - -```python -async def example_sequential_pipeline(): - pipeline = DataflowPipeline.from_tasks( - broker, - [ - extract_audio_features, - compute_mir_features, - generate_tags, - create_embedding, - ], - ) - - pipeline.print_dag() - # Sortie: - # Ordre Exécution DAG: - # Niveau 0 (parallèle): extract_audio_features - # Niveau 1 (parallèle): compute_mir_features - # Niveau 2 (parallèle): generate_tags, create_embedding - # Sorties finales: audio_features, mir_features, tags, vector - - résultats = await pipeline.kiq_dataflow(track_paths=["track1.mp3"]) - # résultats = { - # "audio_features": {...}, - # "mir_features": {...}, - # "tags": [...], - # "vector": [...] - # } -``` - -**Résolution dépendances**: -1. `extract_audio_features` aucune dépendance → exécute en premier -2. `compute_mir_features` besoin `audio_features` → exécute après étape 1 -3. `generate_tags` besoin `mir_features` → exécute après étape 2 -4. `create_embedding` besoin `mir_features` et `tags` → exécute après étapes 2 & 3 complétées - ---- - -## Exemple 2: Exécution Parallèle - -Avec ajout de `extract_spectral_features` qui dépend aussi seulement de `audio_features`: - -```python -@broker.task -@pipeline_task(output="spectral_features") -async def extract_spectral_features(audio_features: dict) -> dict: - await asyncio.sleep(0.2) - return {"spectral_rolloff": 5000.0} - -@broker.task -@pipeline_task(output="combined_features") -async def combine_features( - mir_features: dict, - spectral_features: dict, - tags: list[str], -) -> dict: - return {**mir_features, **spectral_features, "tags": tags} - -pipeline = DataflowPipeline.from_tasks( - broker, - [ - extract_audio_features, - compute_mir_features, # Niveau 1 - extract_spectral_features, # Niveau 1 (parallèle à compute_mir_features) - generate_tags, # Niveau 2 (dépend de mir_features) - combine_features, # Niveau 2 (dépend de mir_features + spectral_features + tags) - ], -) -``` - -**Niveaux d'exécution**: -- Niveau 0: `extract_audio_features` -- Niveau 1: `compute_mir_features`, `extract_spectral_features` (parallèle) -- Niveau 2: `generate_tags`, `combine_features` (parallèle après leurs dépendances satisfaites) - ---- - -## Exemple 3: Motif Map-Reduce - -Traiter multiples pistes en parallèle, puis agréger: - -```python -# Map: traiter chaque piste indépendamment -@broker.task -@pipeline_task(output="track_features") -async def process_single_track(track: str) -> dict: - return {"track": track, "duration": 180.0, "bpm": 120} - -# Reduce: agréger toutes features de pistes -@broker.task -@pipeline_task(output="playlist_stats") -async def aggregate_track_features(track_features: list[dict]) -> dict: - total_duration = sum(t["duration"] for t in track_features) - avg_bpm = sum(t["bpm"] for t in track_features) / len(track_features) - return {"total_tracks": len(track_features), "total_duration": total_duration, "avg_bpm": avg_bpm} - -# Construire pipeline -pipeline = DataflowPipeline(broker) -pipeline.map( - process_single_track, - tracks, # ["track1.mp3", "track2.mp3", ...] - output="track_features", - max_parallel=4, -) -pipeline.reduce( - aggregate_track_features, - input_name="track_features", - output="playlist_stats", -) - -résultats = await pipeline.kiq_map_reduce() -# résultats = {"track_features": [...], "playlist_stats": {...}} -``` - ---- - -## Exemple 4: Visualisation - -Le pipeline fournit multiples formats de visualisation: - -```python -# ASCII art (console) -pipeline.print_dag() - -# JSON (for web UIs) -viz_json = pipeline.visualize() -# Structure: -# { -# "nodes": [{"id": "task_name", "outputs": [...], "inputs": [...]}, ...], -# "edges": [{"from": "task_a", "to": "task_b"}], -# "levels": [["task1"], ["task2", "task3"], ...] -# } - -# DOT format (for Graphviz) -dot = pipeline.visualize_dot() -# Save and render: -# with open("pipeline.dot", "w") as f: -# f.write(dot) -# Run: dot -Tpng pipeline.dot -o pipeline.png -``` - ---- - -## Exécuter l'Exemple - -```bash -python examples/dataflow_audio_pipeline.py -``` - -Sortie attendue inclut: -- Affichages DAG ASCII montrant ordre exécution -- Représentation DOT DAG extrait -- Structure JSON DAG extrait - ---- - -## Points Clés à Retenir - -1. **Résolution automatique de dépendances** — Pas besoin d'enchaîner manuellement; juste déclarer sorties -2. **Exécution parallèle** — Tâches indépendantes s'exécutent concurremment automatiquement -3. **Programmation dataflow** — Tâches sont fonctions pures; sortie va vers entrées -4. **Débogage visuel** — `print_dag()` montre exactement comment tâches s'exécuteront -5. **Motifs évolutifs** — Map-reduce intégré pour charges batch - ---- - -## Chemin d'Apprentissage - -Après cet exemple: - -1. **[Guide DataflowPipeline]({{ '/fr/guides/pipelines.md#2-pipeline-dataflow' | relative_url }})** — Plongée profonde fonctionnalités dataflow -2. **[Guide d'Exécution]({{ '/fr/guides/execution/' | relative_url }})** — Parallélisme, timeouts, gestion erreurs -3. **[Guide de Performance]({{ '/fr/guides/performance/' | relative_url }})** — Réglage `max_parallel`, profils ressources - ---- - -*C'est l'exemple flagship. Étudiez-le thoroughly pour comprendre modèle dataflow Taskiq-Flow.* +{% raw %} +``` + +--- + +## Exemple 1: Pipeline Séquentiel avec Dépendances Automatiques + +```python +{% endraw %} + pipeline = DataflowPipeline.from_tasks( + broker, + [ + extract_audio_features, + compute_mir_features, + generate_tags, + create_embedding, + ], + ) + + pipeline.print_dag() + # Sortie: + # Ordre Exécution DAG: + # Niveau 0 (parallèle): extract_audio_features + # Niveau 1 (parallèle): compute_mir_features + # Niveau 2 (parallèle): generate_tags, create_embedding + # Sorties finales: audio_features, mir_features, tags, vector + + résultats = await pipeline.kiq_dataflow(track_paths=["track1.mp3"]) + # résultats = { + # "audio_features": {...}, + # "mir_features": {...}, + # "tags": [...], + # "vector": [...] + # } +{% raw %} +``` + +**Résolution dépendances**: +1. `extract_audio_features` aucune dépendance → exécute en premier +2. `compute_mir_features` besoin `audio_features` → exécute après étape 1 +3. `generate_tags` besoin `mir_features` → exécute après étape 2 +4. `create_embedding` besoin `mir_features` et `tags` → exécute après étapes 2 & 3 complétées + +--- + +## Exemple 2: Exécution Parallèle + +Avec ajout de `extract_spectral_features` qui dépend aussi seulement de `audio_features`: + +```python +{% endraw %} +@pipeline_task(output="spectral_features") +async def extract_spectral_features(audio_features: dict) -> dict: + await asyncio.sleep(0.2) + return {"spectral_rolloff": 5000.0} + +@broker.task +@pipeline_task(output="combined_features") +async def combine_features( + mir_features: dict, + spectral_features: dict, + tags: list[str], +) -> dict: + return {**mir_features, **spectral_features, "tags": tags} + +pipeline = DataflowPipeline.from_tasks( + broker, + [ + extract_audio_features, + compute_mir_features, # Niveau 1 + extract_spectral_features, # Niveau 1 (parallèle à compute_mir_features) + generate_tags, # Niveau 2 (dépend de mir_features) + combine_features, # Niveau 2 (dépend de mir_features + spectral_features + tags) + ], +) +{% raw %} +``` + +**Niveaux d'exécution**: +- Niveau 0: `extract_audio_features` +- Niveau 1: `compute_mir_features`, `extract_spectral_features` (parallèle) +- Niveau 2: `generate_tags`, `combine_features` (parallèle après leurs dépendances satisfaites) + +--- + +## Exemple 3: Motif Map-Reduce + +Traiter multiples pistes en parallèle, puis agréger: + +```python +{% endraw %} +@broker.task +@pipeline_task(output="track_features") +async def process_single_track(track: str) -> dict: + return {"track": track, "duration": 180.0, "bpm": 120} + +# Reduce: agréger toutes features de pistes +@broker.task +@pipeline_task(output="playlist_stats") +async def aggregate_track_features(track_features: list[dict]) -> dict: + total_duration = sum(t["duration"] for t in track_features) + avg_bpm = sum(t["bpm"] for t in track_features) / len(track_features) + return {"total_tracks": len(track_features), "total_duration": total_duration, "avg_bpm": avg_bpm} + +# Construire pipeline +pipeline = DataflowPipeline(broker) +pipeline.map( + process_single_track, + tracks, # ["track1.mp3", "track2.mp3", ...] + output="track_features", + max_parallel=4, +) +pipeline.reduce( + aggregate_track_features, + input_name="track_features", + output="playlist_stats", +) + +résultats = await pipeline.kiq_map_reduce() +# résultats = {"track_features": [...], "playlist_stats": {...}} +{% raw %} +``` + +--- + +## Exemple 4: Visualisation + +Le pipeline fournit multiples formats de visualisation: + +```python +{% endraw %} +pipeline.print_dag() + +# JSON (for web UIs) +viz_json = pipeline.visualize() +# Structure: +# { +# "nodes": [{"id": "task_name", "outputs": [...], "inputs": [...]}, ...], +# "edges": [{"from": "task_a", "to": "task_b"}], +# "levels": [["task1"], ["task2", "task3"], ...] +# } + +# DOT format (for Graphviz) +dot = pipeline.visualize_dot() +# Save and render: +# with open("pipeline.dot", "w") as f: +# f.write(dot) +# Run: dot -Tpng pipeline.dot -o pipeline.png +{% raw %} +``` + +--- + +## Exécuter l'Exemple + +```bash +{% endraw %} +{% raw %} +``` + +Sortie attendue inclut: +- Affichages DAG ASCII montrant ordre exécution +- Représentation DOT DAG extrait +- Structure JSON DAG extrait + +--- + +## Points Clés à Retenir + +1. **Résolution automatique de dépendances** — Pas besoin d'enchaîner manuellement; juste déclarer sorties +2. **Exécution parallèle** — Tâches indépendantes s'exécutent concurremment automatiquement +3. **Programmation dataflow** — Tâches sont fonctions pures; sortie va vers entrées +4. **Débogage visuel** — `print_dag()` montre exactement comment tâches s'exécuteront +5. **Motifs évolutifs** — Map-reduce intégré pour charges batch + +--- + +## Chemin d'Apprentissage + +Après cet exemple: + +1. **[Guide DataflowPipeline]({{ '/fr/guides/pipelines.md#2-pipeline-dataflow' | relative_url }})** — Plongée profonde fonctionnalités dataflow +2. **[Guide d'Exécution]({{ '/fr/guides/execution/' | relative_url }})** — Parallélisme, timeouts, gestion erreurs +3. **[Guide de Performance]({{ '/fr/guides/performance/' | relative_url }})** — Réglage `max_parallel`, profils ressources + +--- + +*C'est l'exemple flagship. Étudiez-le thoroughly pour comprendre modèle dataflow Taskiq-Flow.* + +{% endraw %} \ No newline at end of file diff --git a/docs/_fr/examples/index.md b/docs/_fr/examples/index.md index 0db61c5..dc1aa05 100644 --- a/docs/_fr/examples/index.md +++ b/docs/_fr/examples/index.md @@ -1,90 +1,90 @@ ---- -title: Galerie d'Exemples -nav_order: 40 -permalink: /fr/examples/ ---- -# Galerie d'Exemples - -**Exemples fonctionnels démontrant les fonctionnalités et motifs clés de Taskiq-Flow** - -> **Version** : {VERSION} | **Lié** : [Guide de Démarrage Rapide]({{ '/fr/quickstart/' | relative_url }}) - ---- - -## Aperçu - -Cette galerie propose des parcours détaillés des scripts d'exemple inclus dans le répertoire `examples/`. Chaque exemple démontre une fonctionnalité spécifique ou un motif d'intégration. - ---- - -## Index des Exemples - -| Exemple | Description | Concepts Clés | -|---------|-------------|---------------| -| [Pipeline Basique]({{ '/fr/examples/quickstart/' | relative_url }}) | Pipeline séquentiel simple avec opérations map, filter, group | SequentialPipeline, étapes de base | -| [Démonstration Suivi]({{ '/fr/examples/tracking-demo/' | relative_url }}) | Surveillance en temps réel avec PipelineTrackingManager | Suivi, stockage d'état, visualisation | -| [Pipeline Planifié]({{ '/fr/examples/scheduled-pipeline/' | relative_url }}) | Exécution périodique de pipeline via cron | PipelineScheduler, APScheduler, fuseaux horaires | -| [Pipeline Audio Dataflow]({{ '/fr/examples/dataflow-audio-pipeline/' | relative_url }}) | DAG complet avec parallélisme, map-reduce et visualisation | DataflowPipeline, DAG automatique, parallélisme | -| [Démonstration Visualisation DAG]({{ '/fr/examples/dag-visualization-demo/' | relative_url }}) | Analyse DAG NetworkX : chemin critique, groupes parallèles, export | DAGVisualizer, NetworkX, chemin critique, exports | -| [Démo NiceGUI DAG]({{ '/fr/examples/nicegui-dag-demo/' | relative_url }}) | Visualiseur web interactif de DAG via NiceGUI et MermaidGenerator | MermaidGenerator, NiceGUI, visualisation web interactive | -| [Découverte Registry]({{ '/fr/examples/registry-discovery/' | relative_url }}) | Construction manuelle de DataflowRegistry, introspection DAG, exécution bas niveau | DataflowRegistry, ExecutionEngine, introspection | -| [Démo WebSocket]({{ '/fr/examples/websocket-demo/' | relative_url }}) | Streaming d'événements en temps réel via WebSockets | HookManager, transport WebSocket, suivi live | -| [API REST]({{ '/fr/examples/api-example/' | relative_url }}) | Intégration FastAPI pour gestion distante de pipelines | PipelineVisualizationAPI, endpoints personnalisés | - ---- - -## Exécuter les Exemples - -Chaque page d'exemple inclut : - -- **Aperçu** — Ce que démontre l'exemple -- **Prérequis** — Dépendances et configuration requises -- **Parcours du code** — Explication ligne par ligne -- **Concepts clés** — Fonctionnalités illustrées -- **Instructions d'exécution** — Comment lancer le script -- **Sortie attendue** — Exemple de résultat pour vérification -- **Problèmes courants** — Conseils de dépannage - -**Pour exécuter un exemple** : - -```bash -# Se placer à la racine du dépôt -cd taskiq-flow - -# Installer les dépendances si nécessaire -pip install -e . - -# Lancer un script d'exemple -python examples/quickstart.py -``` - -Certains exemples nécessitent des services additionnels (Redis, etc.). Voir les pages individuelles pour les détails. - ---- - -## Catégories d'Exemples - -### Démarrage -- [Pipeline Basique]({{ '/fr/examples/quickstart/' | relative_url }}) — Commencez ici si vous êtes novice - -### Monitoring & Opérations -- [Démonstration Suivi]({{ '/fr/examples/tracking-demo/' | relative_url }}) -- [Pipeline Planifié]({{ '/fr/examples/scheduled-pipeline/' | relative_url }}) -- [Démo WebSocket]({{ '/fr/examples/websocket-demo/' | relative_url }}) - -### Workflows Avancés -- [Pipeline Audio Dataflow]({{ '/fr/examples/dataflow-audio-pipeline/' | relative_url }}) -- [Démonstration Visualisation DAG]({{ '/fr/examples/dag-visualization-demo/' | relative_url }}) -- [Démo NiceGUI DAG]({{ '/fr/examples/nicegui-dag-demo/' | relative_url }}) -- [Découverte Registry]({{ '/fr/examples/registry-discovery/' | relative_url }}) - -### Intégration -- [API REST]({{ '/fr/examples/api-example/' | relative_url }}) - ---- - -## Prochaines Étapes - -- **[Guide de Démarrage Rapide]({{ '/fr/quickstart/' | relative_url }})** — Lancez votre premier pipeline -- **[Guides Utilisateur]({{ '/fr/guides/' | relative_url }})** — Approfondissements par fonctionnalité -- **[Référence API]({{ '/fr/api/' | relative_url }})** — Documentation complète des modules +--- +title: Galerie d'Exemples +nav_order: 40 +permalink: /fr/examples/ +--- +# Galerie d'Exemples + +**Exemples fonctionnels démontrant les fonctionnalités et motifs clés de Taskiq-Flow** + +> **Version** : {VERSION} | **Lié** : [Guide de Démarrage Rapide]({{ '/fr/quickstart/' | relative_url }}) + +--- + +## Aperçu + +Cette galerie propose des parcours détaillés des scripts d'exemple inclus dans le répertoire `examples/`. Chaque exemple démontre une fonctionnalité spécifique ou un motif d'intégration. + +--- + +## Index des Exemples + +| Exemple | Description | Concepts Clés | +|---------|-------------|---------------| +| [Pipeline Basique]({{ '/fr/examples/quickstart/' | relative_url }}) | Pipeline séquentiel simple avec opérations map, filter, group | SequentialPipeline, étapes de base | +| [Démonstration Suivi]({{ '/fr/examples/tracking-demo/' | relative_url }}) | Surveillance en temps réel avec PipelineTrackingManager | Suivi, stockage d'état, visualisation | +| [Pipeline Planifié]({{ '/fr/examples/scheduled-pipeline/' | relative_url }}) | Exécution périodique de pipeline via cron | PipelineScheduler, APScheduler, fuseaux horaires | +| [Pipeline Audio Dataflow]({{ '/fr/examples/dataflow-audio-pipeline/' | relative_url }}) | DAG complet avec parallélisme, map-reduce et visualisation | DataflowPipeline, DAG automatique, parallélisme | +| [Démonstration Visualisation DAG]({{ '/fr/examples/dag-visualization-demo/' | relative_url }}) | Analyse DAG NetworkX : chemin critique, groupes parallèles, export | DAGVisualizer, NetworkX, chemin critique, exports | +| [Démo NiceGUI DAG]({{ '/fr/examples/nicegui-dag-demo/' | relative_url }}) | Visualiseur web interactif de DAG via NiceGUI et MermaidGenerator | MermaidGenerator, NiceGUI, visualisation web interactive | +| [Découverte Registry]({{ '/fr/examples/registry-discovery/' | relative_url }}) | Construction manuelle de DataflowRegistry, introspection DAG, exécution bas niveau | DataflowRegistry, ExecutionEngine, introspection | +| [Démo WebSocket]({{ '/fr/examples/websocket-demo/' | relative_url }}) | Streaming d'événements en temps réel via WebSockets | HookManager, transport WebSocket, suivi live | +| [API REST]({{ '/fr/examples/api-example/' | relative_url }}) | Intégration FastAPI pour gestion distante de pipelines | PipelineVisualizationAPI, endpoints personnalisés | + +--- + +## Exécuter les Exemples + +Chaque page d'exemple inclut : + +- **Aperçu** — Ce que démontre l'exemple +- **Prérequis** — Dépendances et configuration requises +- **Parcours du code** — Explication ligne par ligne +- **Concepts clés** — Fonctionnalités illustrées +- **Instructions d'exécution** — Comment lancer le script +- **Sortie attendue** — Exemple de résultat pour vérification +- **Problèmes courants** — Conseils de dépannage + +**Pour exécuter un exemple** : + +```bash +# Se placer à la racine du dépôt +cd taskiq-flow + +# Installer les dépendances si nécessaire +pip install -e . + +# Lancer un script d'exemple +python examples/quickstart.py +``` + +Certains exemples nécessitent des services additionnels (Redis, etc.). Voir les pages individuelles pour les détails. + +--- + +## Catégories d'Exemples + +### Démarrage +- [Pipeline Basique]({{ '/fr/examples/quickstart/' | relative_url }}) — Commencez ici si vous êtes novice + +### Monitoring & Opérations +- [Démonstration Suivi]({{ '/fr/examples/tracking-demo/' | relative_url }}) +- [Pipeline Planifié]({{ '/fr/examples/scheduled-pipeline/' | relative_url }}) +- [Démo WebSocket]({{ '/fr/examples/websocket-demo/' | relative_url }}) + +### Workflows Avancés +- [Pipeline Audio Dataflow]({{ '/fr/examples/dataflow-audio-pipeline/' | relative_url }}) +- [Démonstration Visualisation DAG]({{ '/fr/examples/dag-visualization-demo/' | relative_url }}) +- [Démo NiceGUI DAG]({{ '/fr/examples/nicegui-dag-demo/' | relative_url }}) +- [Découverte Registry]({{ '/fr/examples/registry-discovery/' | relative_url }}) + +### Intégration +- [API REST]({{ '/fr/examples/api-example/' | relative_url }}) + +--- + +## Prochaines Étapes + +- **[Guide de Démarrage Rapide]({{ '/fr/quickstart/' | relative_url }})** — Lancez votre premier pipeline +- **[Guides Utilisateur]({{ '/fr/guides/' | relative_url }})** — Approfondissements par fonctionnalité +- **[Référence API]({{ '/fr/api/' | relative_url }})** — Documentation complète des modules diff --git a/docs/_fr/examples/nicegui-dag-demo.md b/docs/_fr/examples/nicegui-dag-demo.md index d918930..25307c6 100644 --- a/docs/_fr/examples/nicegui-dag-demo.md +++ b/docs/_fr/examples/nicegui-dag-demo.md @@ -1,5 +1,5 @@ --- -title: Exemple: nicegui_dag_demo.py +title: 'Exemple: nicegui_dag_demo.py' nav_order: 48 color_scheme: dark --- diff --git a/docs/_fr/examples/quickstart.md b/docs/_fr/examples/quickstart.md index e5f1547..b2ae61e 100644 --- a/docs/_fr/examples/quickstart.md +++ b/docs/_fr/examples/quickstart.md @@ -1,173 +1,173 @@ ---- -permalink: /fr/examples/quickstart/ -title: Exemple: quickstart.py -nav_order: 41 -color_scheme: dark ---- -# Exemple: quickstart.py - -**Pipeline séquentiel simple avec opérations map, filter, group** - +--- +permalink: /fr/examples/quickstart/ +title: 'Exemple: quickstart.py' +nav_order: 41 +color_scheme: dark +--- +# Exemple: quickstart.py + +**Pipeline séquentiel simple avec opérations map, filter, group** + > **Version** : {VERSION} | **Fichier** : `examples/quickstart.py` - ---- - -## Aperçu - -Cet exemple démontre les fondamentaux de Taskiq-Flow en utilisant un pipeline séquentiel classique. Il couvre: - -- Définition de tâches avec `@broker.task` -- Construction de pipeline avec `.call_next()`, `.map()`, `.filter()` -- Exécution du pipeline et récupération des résultats -- Compréhension du flux de données à travers étapes - ---- - -## Explication Pas-à-Pas du Code - -```python -import asyncio -from taskiq import InMemoryBroker -from taskiq_flow import Pipeline, PipelineMiddleware - -# 1. Initialiser broker et ajouter middleware -broker = InMemoryBroker() -broker.add_middlewares(PipelineMiddleware()) - -# 2. Définir les tâches -@broker.task -def add_one(value: int) -> int: - return value + 1 - -@broker.task -def repeat(value: int, times: int) -> list[int]: - return [value] * times - -@broker.task -def is_positive(value: int) -> bool: - return value >= 0 - -# 3. Construire le pipeline -async def main(): - pipeline = ( - Pipeline(broker) - .call_next(add_one) # Étape 1: 1 → 2 - .call_next(repeat, times=4) # Étape 2: 2 → [2,2,2,2] - .map(add_one) # Étape 3: [2,2,2,2] → [3,3,3,3] - .filter(is_positive) # Étape 4: garder positifs (tous gardés) - ) - - # 4. Exécuter - task = await pipeline.kiq(1) - result = await task.wait_result() - print("Résultat:", result.return_value) # [3, 3, 3, 3] - -asyncio.run(main()) -``` - ---- - -## Explication Étape par Étape - -### Étape 1: `call_next(add_one)` - -- **Entrée**: `1` -- **Opération**: `add_one(1) = 2` -- **Sortie**: `2` - -### Étape 2: `call_next(repeat, times=4)` - -- **Entrée**: `2` -- **Opération**: `repeat(2, times=4) = [2, 2, 2, 2]` -- **Sortie**: `[2, 2, 2, 2]` - -### Étape 3: `map(add_one)` - -- **Entrée**: `[2, 2, 2, 2]` (itérable) -- **Opération**: Appliquer `add_one` à chaque élément **en parallèle** - - `add_one(2) = 3` - - `add_one(2) = 3` - - `add_one(2) = 3` - - `add_one(2) = 3` -- **Sortie**: `[3, 3, 3, 3]` - -### Étape 4: `filter(is_positive)` - -- **Entrée**: `[3, 3, 3, 3]` (itérable) -- **Opération**: Garder éléments où `is_positive(element) == True` - - Tous 4 éléments positifs → tous gardés -- **Sortie**: `[3, 3, 3, 3]` - ---- - -## Concepts Clés Démontrés - -1. **Définition de tâche** — Toute étape de pipeline doit être une tâche (`@broker.task`) -2. **Exigence middleware** — `PipelineMiddleware` **doit** être ajouté au broker -3. **Flux de données** — Chaque étape reçoit sortie précédente (sauf `call_after`) -4. **Exécution parallèle** — `.map()` exécute éléments concurremment -5. **Enchaînement** — Les méthodes retournent pipeline pour interface fluide - ---- - -## Exécuter l'Exemple - -```bash -python examples/quickstart.py -``` - -Sortie attendue: -``` -Résultat: [3, 3, 3, 3] -``` - ---- - -## Variations à Tester - -### Utiliser `filter` pour éliminer négatifs - -```python -@broker.task -def subtract_three(valeur: int) -> int: - return valeur - 5 # résultats en [-2, -2, -2, -2] - -pipeline = ( - Pipeline(broker) - .call_next(add_one) - .call_next(repeat, times=4) - .map(subtract_three) # [2,2,2,2] → [-2,-2,-2,-2] - .filter(is_positive) # [] — tous filtrés -) -``` - -### Utiliser `group` pour tâches indépendantes parallèles - -```python -@broker.task -def task_a(x: int) -> int: return x * 2 -@broker.task -def task_b(x: int) -> int: return x + 10 -@broker.task -def task_c(x: int) -> int: return x ** 2 - -pipeline = Pipeline(broker).call_next(add_one) # 1 → 2 -pipeline.group([task_a, task_b, task_c], param_names=["x"]) -# All three receive 2 and execute in parallel -# Result: [4, 12, 4] -``` - ---- - -## Chemin d'Apprentissage - -Après cet exemple: - -1. **[Pipelines Dataflow]({{ '/fr/guides/pipelines.md#2-pipeline-dataflow' | relative_url }})** — Construction automatique de DAG -2. **[Définition des Tâches]({{ '/fr/guides/tasks/' | relative_url }})** — Fonctionnalités avancées de tâches -3. **[Suivi]({{ '/fr/guides/tracking/' | relative_url }})** — Monitor exécutions de pipeline -4. **[MapReduce]({{ '/fr/guides/execution.md#3-motif-map-reduce' | relative_url }})** — Motif de traitement par lots - ---- - -*Cet exemple est le "Hello World" de Taskiq-Flow. Maîtriser-le avant de passer à motifs plus complexes.* + +--- + +## Aperçu + +Cet exemple démontre les fondamentaux de Taskiq-Flow en utilisant un pipeline séquentiel classique. Il couvre: + +- Définition de tâches avec `@broker.task` +- Construction de pipeline avec `.call_next()`, `.map()`, `.filter()` +- Exécution du pipeline et récupération des résultats +- Compréhension du flux de données à travers étapes + +--- + +## Explication Pas-à-Pas du Code + +```python +import asyncio +from taskiq import InMemoryBroker +from taskiq_flow import Pipeline, PipelineMiddleware + +# 1. Initialiser broker et ajouter middleware +broker = InMemoryBroker() +broker.add_middlewares(PipelineMiddleware()) + +# 2. Définir les tâches +@broker.task +def add_one(value: int) -> int: + return value + 1 + +@broker.task +def repeat(value: int, times: int) -> list[int]: + return [value] * times + +@broker.task +def is_positive(value: int) -> bool: + return value >= 0 + +# 3. Construire le pipeline +async def main(): + pipeline = ( + Pipeline(broker) + .call_next(add_one) # Étape 1: 1 → 2 + .call_next(repeat, times=4) # Étape 2: 2 → [2,2,2,2] + .map(add_one) # Étape 3: [2,2,2,2] → [3,3,3,3] + .filter(is_positive) # Étape 4: garder positifs (tous gardés) + ) + + # 4. Exécuter + task = await pipeline.kiq(1) + result = await task.wait_result() + print("Résultat:", result.return_value) # [3, 3, 3, 3] + +asyncio.run(main()) +``` + +--- + +## Explication Étape par Étape + +### Étape 1: `call_next(add_one)` + +- **Entrée**: `1` +- **Opération**: `add_one(1) = 2` +- **Sortie**: `2` + +### Étape 2: `call_next(repeat, times=4)` + +- **Entrée**: `2` +- **Opération**: `repeat(2, times=4) = [2, 2, 2, 2]` +- **Sortie**: `[2, 2, 2, 2]` + +### Étape 3: `map(add_one)` + +- **Entrée**: `[2, 2, 2, 2]` (itérable) +- **Opération**: Appliquer `add_one` à chaque élément **en parallèle** + - `add_one(2) = 3` + - `add_one(2) = 3` + - `add_one(2) = 3` + - `add_one(2) = 3` +- **Sortie**: `[3, 3, 3, 3]` + +### Étape 4: `filter(is_positive)` + +- **Entrée**: `[3, 3, 3, 3]` (itérable) +- **Opération**: Garder éléments où `is_positive(element) == True` + - Tous 4 éléments positifs → tous gardés +- **Sortie**: `[3, 3, 3, 3]` + +--- + +## Concepts Clés Démontrés + +1. **Définition de tâche** — Toute étape de pipeline doit être une tâche (`@broker.task`) +2. **Exigence middleware** — `PipelineMiddleware` **doit** être ajouté au broker +3. **Flux de données** — Chaque étape reçoit sortie précédente (sauf `call_after`) +4. **Exécution parallèle** — `.map()` exécute éléments concurremment +5. **Enchaînement** — Les méthodes retournent pipeline pour interface fluide + +--- + +## Exécuter l'Exemple + +```bash +python examples/quickstart.py +``` + +Sortie attendue: +``` +Résultat: [3, 3, 3, 3] +``` + +--- + +## Variations à Tester + +### Utiliser `filter` pour éliminer négatifs + +```python +@broker.task +def subtract_three(valeur: int) -> int: + return valeur - 5 # résultats en [-2, -2, -2, -2] + +pipeline = ( + Pipeline(broker) + .call_next(add_one) + .call_next(repeat, times=4) + .map(subtract_three) # [2,2,2,2] → [-2,-2,-2,-2] + .filter(is_positive) # [] — tous filtrés +) +``` + +### Utiliser `group` pour tâches indépendantes parallèles + +```python +@broker.task +def task_a(x: int) -> int: return x * 2 +@broker.task +def task_b(x: int) -> int: return x + 10 +@broker.task +def task_c(x: int) -> int: return x ** 2 + +pipeline = Pipeline(broker).call_next(add_one) # 1 → 2 +pipeline.group([task_a, task_b, task_c], param_names=["x"]) +# All three receive 2 and execute in parallel +# Result: [4, 12, 4] +``` + +--- + +## Chemin d'Apprentissage + +Après cet exemple: + +1. **[Pipelines Dataflow]({{ '/fr/guides/pipelines.md#2-pipeline-dataflow' | relative_url }})** — Construction automatique de DAG +2. **[Définition des Tâches]({{ '/fr/guides/tasks/' | relative_url }})** — Fonctionnalités avancées de tâches +3. **[Suivi]({{ '/fr/guides/tracking/' | relative_url }})** — Monitor exécutions de pipeline +4. **[MapReduce]({{ '/fr/guides/execution.md#3-motif-map-reduce' | relative_url }})** — Motif de traitement par lots + +--- + +*Cet exemple est le "Hello World" de Taskiq-Flow. Maîtriser-le avant de passer à motifs plus complexes.* diff --git a/docs/_fr/examples/registry-discovery.md b/docs/_fr/examples/registry-discovery.md index eb367f5..86d7087 100644 --- a/docs/_fr/examples/registry-discovery.md +++ b/docs/_fr/examples/registry-discovery.md @@ -1,289 +1,289 @@ ---- -permalink: /fr/examples/registry-discovery/ -title: Exemple: registry_discovery_example.py -nav_order: 43 -color_scheme: dark ---- -# Exemple: registry_discovery_example.py - -**Construction manuelle de DataflowRegistry, introspection DAG, exécution bas niveau** - +--- +permalink: /fr/examples/registry-discovery/ +title: 'Exemple: registry_discovery_example.py' +nav_order: 43 +color_scheme: dark +--- +# Exemple: registry_discovery_example.py + +**Construction manuelle de DataflowRegistry, introspection DAG, exécution bas niveau** + > **Version** : {VERSION} | **Fichier** : `examples/registry_discovery_example.py` - ---- - -## Aperçu - -Cet exemple avancé démontre les mécanismes internes du système de résolution automatique de dépendances de Taskiq-Flow utilisant `DataflowRegistry`. Il montre comment : - -- Enregistrer manuellement les tâches avec leurs déclarations E/S -- Inspecter le graphe de flux de données avant exécution -- Construire et valider un DAG -- Exécuter des pipelines en utilisant directement `ExecutionEngine` -- Comprendre la provenance des données et les dépendances des tâches - -**C'est le mécanisme central derrière `DataflowPipeline.from_tasks()`.** - ---- - -## Ce Que Cet Exemple Montre - -- API complète de `DataflowRegistry` -- Construction manuelle de DAG à partir des métadonnées de tâche -- Interrogation des dépendances (producteurs/consommateurs) -- Tri topologique et détection de niveaux parallèles -- Exécution directe via `ExecutionEngine` -- Le `DataCache` pour l'exécution manuelle étape par étape -- Détection d'erreurs (dépendances manquantes, cycles) - ---- - -## Parcours du Code - -### Définition des Tâches (identique au style dataflow_audio) - -```python -from taskiq_flow.dataflow.registry import DataflowRegistry -from taskiq_flow.execution_engine import ExecutionEngine -from taskiq_flow.dataflow.cache import DataCache -from taskiq_flow.visualization import DAGVisualizer - -@broker.task -@pipeline_task(output="raw_data") -async def load_data(source: str) -> dict: - return {"source": source, "records": [...]} - -@broker.task -@pipeline_task(output="cleaned_data") -async def clean_data(raw_data: dict) -> dict: - records = [r for r in raw_data["records"] if r["value"] > 0] - return {"source": raw_data["source"], "records": records} - -@broker.task -@pipeline_task(output="features") -async def extract_features(cleaned_data:dict) -> dict: - total = sum(r["value"] for r in cleaned_data["records"]) - return {"total": total, "count": len(cleaned_data["records"])} - -@broker.task -@pipeline_task(output="report") -async def generate_report(features: dict) -> dict: - return {"report_id": "RPT-001", "summary": features} -``` - ---- - -## Exemple 1 : Construction Manuelle du Registre & Inspection - -```python -async def example_manual_registry(): - registry = DataflowRegistry() - - # Enregistrer les tâches manuellement - registry.register_task(load_data, output="raw_data", inputs=["source"]) - registry.register_task(clean_data, output="cleaned_data", inputs=["raw_data"]) - registry.register_task(extract_features, output="features", inputs=["cleaned_data"]) - registry.register_task(generate_report, output="report", inputs=["features"]) - - # Inspecter le registre - print(f"Tâches: {[t.task_name for t in registry.get_tasks()]}") - # ['load_data', 'clean_data', 'extract_features', 'generate_report'] - - # Interroger les dépendances - deps = registry.get_data_dependencies(generate_report) - print(f"generate_report dépend de: {deps}") # ['features'] - - # Trouver qui produit 'features' - producer = registry.get_producer("features") - print(f"'features' produit par: {producer.task_name}") # extract_features - - # Trouver qui consomme 'raw_data' - consumers = registry.get_consumers("raw_data") - print(f"'raw_data' consommé par: {[c.task_name for c in consumers]}") # [clean_data] - - # Entrées externes (non produites par une tâche) - external = registry.get_external_inputs() - print(f"Entrées externes: {external}") # ['source'] - - # Sorties (résultats finaux) - outputs = registry.get_outputs() - print(f"Sorties du pipeline: {outputs}") # toutes les sorties -``` - -**Méthodes clés** : - -| Méthode | Retourne | -|---------|----------| -| `get_tasks()` | Tous les objets `TaskNode` enregistrés | -| `get_outputs()` | Toutes les clés de sortie | -| `get_external_inputs()` | Entrées non produites par une tâche | -| `get_producer(output_key)` | Tâche produisant cette sortie | -| `get_consumers(input_key)` | Tâches consommant cette entrée | -| `get_data_dependencies(task)` | Liste des clés d'entrée pour une tâche | - ---- - -## Exemple 2 : Construction et Visualisation du DAG - -```python - # Construction du DAG - dag = registry.build_dag() - - print(f"DAG: {len(dag.nodes)} nœuds, {len(dag.edges)} arêtes") - - # Ordre d'exécution (tri topologique) - order = dag.topological_sort() - for i, node in enumerate(order): - print(f"{i+1}. {node.task_name}") - - # Niveaux d'exécution parallèle - for level_idx, level_nodes in enumerate(dag.levels): - tasks = [n.task_name for n in level_nodes] - print(f"Niveau {level_idx}: {tasks}") - - # Visualisation ASCII - dag.print() - - # Format DOT - dot = DAGVisualizer.to_dot(dag) - with open("pipeline.dot", "w") as f: - f.write(dot) -``` - -**Propriétés DAG** : -- `dag.nodes` — Tous les nœuds -- `dag.edges` — Arêtes de dépendance -- `dag.roots` — Nœuds sans dépendances -- `dag.leaves` — Nœuds sans dépendants -- `dag.levels` — Groupes de tâches exécutables en parallèle -- `dag.topological_sort()` — Ordre d'exécution linéaire - ---- - -## Exemple 3 : Validation & Détection d'Erreurs - -```python -async def example_validation(): - registry = DataflowRegistry() - registry.register_task(load_data, output="raw_data", inputs=["source"]) - - # Cassé : dépend d'une sortie inexistante - @broker.task - @pipeline_task(output="result") - async def broken_task(nonexistent_data: dict): - return {"result": "broken"} - - registry.register_task(broken_task, output="result", inputs=["nonexistent_data"]) - - try: - dag = registry.build_dag() # Lève ValueError - except ValueError as e: - print(f"Erreur attendue capturée: {e}") - # "La tâche 'broken_task' requiert l'entrée 'nonexistent_data' mais aucune tâche ne la produit" -``` - -**Validations effectuées** : -- Toutes les entrées déclarées doivent être produites par une tâche (ou être externes) -- Pas de dépendances circulaires (cycles) -- Pas de noms de sortie en double - ---- - -## Exemple 4 : Exécution avec ExecutionEngine - -```python -async def example_execution_with_engine(): - registry = DataflowRegistry() - registry.register_task(load_data, output="raw_data", inputs=["source"]) - registry.register_task(clean_data, output="cleaned_data", inputs=["raw_data"]) - registry.register_task(extract_features, output="features", inputs=["cleaned_data"]) - registry.register_task(generate_report, output="report", inputs=["features"]) - - dag = registry.build_dag() - - engine = ExecutionEngine( - broker=broker, - dag=dag, - fail_fast=True, - max_parallel=4, - ) - - results = await engine.execute( - inputs={"source": "local://data/file.csv"}, - pipeline_id="manual_pipeline_example", - ) - - # results = {"raw_data": ..., "cleaned_data": ..., "features": ..., "report": ...} -``` - -`ExecutionEngine` est l'exécuteur de bas niveau qui exécute un DAG. - ---- - -## Exemple 5 : Exécution Pas-à-Pas Manuelle avec DataCache - -Montre la boucle d'exécution interne : - -```python -async def example_manual_execution_with_cache(): - registry = DataflowRegistry() - # enregistrer les tâches... - dag = registry.build_dag() - - cache = DataCache() - - # Initialiser les entrées externes - cache.set("source", "local://data/file.csv") - - completed_nodes = set() - - while True: - ready = dag.get_ready_tasks(completed_nodes) - if not ready: - break - - for node in ready: - task = node.task - deps = registry.get_data_dependencies(task) - - # Injection des dépendances depuis le cache - args = cache.inject(deps) # {'raw_data': {...}, ...} - - # Exécution de la tâche - result = await task.kiq(**args) - output_value = (await result.wait_result()).return_value - - # Stockage de la sortie dans le cache - output_name = registry.get_task_metadata(task)["output"] - cache.set(output_name, output_value) - - completed_nodes.add(node) - - # Sorties finales dans le cache - final_report = cache.get("report") -``` - ---- - -## Pourquoi C'Important - -Comprendre `DataflowRegistry` vous aide à : - -1. **Déboguer des pipelines complexes** — Inspecter le DAG avant exécution -2. **Construire des pipelines dynamiques** — Assembler des pipelines à la volée selon la configuration -3. **Implémenter une orchestration personnalisée** — Utiliser `ExecutionEngine` directement -4. **Comprendre la provenance des données** — Tracer l'origine de chaque sortie - ---- - -## Chemin d'Apprentissage - -Après cet exemple : - -1. **[Guide Dataflow]({{ '/fr/guides/pipelines.md#2-dataflow-pipeline' | relative_url }})** — Utilisation haut niveau -2. **[API ExecutionEngine]({{ '/fr/api/execution/' | relative_url }})** — Contrôle d'exécution bas niveau -3. **[DAGBuilder]({{ '/fr/api/execution.md#dagbuilder' | relative_url }})** — Construction programmatique de DAG - ---- - - *Sujet avancé. La plupart des utilisateurs utiliseront `DataflowPipeline.from_tasks()` qui encapsule ce registry en interne. Explorez ceci uniquement si vous avez besoin de construction dynamique de pipelines.* + +--- + +## Aperçu + +Cet exemple avancé démontre les mécanismes internes du système de résolution automatique de dépendances de Taskiq-Flow utilisant `DataflowRegistry`. Il montre comment : + +- Enregistrer manuellement les tâches avec leurs déclarations E/S +- Inspecter le graphe de flux de données avant exécution +- Construire et valider un DAG +- Exécuter des pipelines en utilisant directement `ExecutionEngine` +- Comprendre la provenance des données et les dépendances des tâches + +**C'est le mécanisme central derrière `DataflowPipeline.from_tasks()`.** + +--- + +## Ce Que Cet Exemple Montre + +- API complète de `DataflowRegistry` +- Construction manuelle de DAG à partir des métadonnées de tâche +- Interrogation des dépendances (producteurs/consommateurs) +- Tri topologique et détection de niveaux parallèles +- Exécution directe via `ExecutionEngine` +- Le `DataCache` pour l'exécution manuelle étape par étape +- Détection d'erreurs (dépendances manquantes, cycles) + +--- + +## Parcours du Code + +### Définition des Tâches (identique au style dataflow_audio) + +```python +from taskiq_flow.dataflow.registry import DataflowRegistry +from taskiq_flow.execution_engine import ExecutionEngine +from taskiq_flow.dataflow.cache import DataCache +from taskiq_flow.visualization import DAGVisualizer + +@broker.task +@pipeline_task(output="raw_data") +async def load_data(source: str) -> dict: + return {"source": source, "records": [...]} + +@broker.task +@pipeline_task(output="cleaned_data") +async def clean_data(raw_data: dict) -> dict: + records = [r for r in raw_data["records"] if r["value"] > 0] + return {"source": raw_data["source"], "records": records} + +@broker.task +@pipeline_task(output="features") +async def extract_features(cleaned_data:dict) -> dict: + total = sum(r["value"] for r in cleaned_data["records"]) + return {"total": total, "count": len(cleaned_data["records"])} + +@broker.task +@pipeline_task(output="report") +async def generate_report(features: dict) -> dict: + return {"report_id": "RPT-001", "summary": features} +``` + +--- + +## Exemple 1 : Construction Manuelle du Registre & Inspection + +```python +async def example_manual_registry(): + registry = DataflowRegistry() + + # Enregistrer les tâches manuellement + registry.register_task(load_data, output="raw_data", inputs=["source"]) + registry.register_task(clean_data, output="cleaned_data", inputs=["raw_data"]) + registry.register_task(extract_features, output="features", inputs=["cleaned_data"]) + registry.register_task(generate_report, output="report", inputs=["features"]) + + # Inspecter le registre + print(f"Tâches: {[t.task_name for t in registry.get_tasks()]}") + # ['load_data', 'clean_data', 'extract_features', 'generate_report'] + + # Interroger les dépendances + deps = registry.get_data_dependencies(generate_report) + print(f"generate_report dépend de: {deps}") # ['features'] + + # Trouver qui produit 'features' + producer = registry.get_producer("features") + print(f"'features' produit par: {producer.task_name}") # extract_features + + # Trouver qui consomme 'raw_data' + consumers = registry.get_consumers("raw_data") + print(f"'raw_data' consommé par: {[c.task_name for c in consumers]}") # [clean_data] + + # Entrées externes (non produites par une tâche) + external = registry.get_external_inputs() + print(f"Entrées externes: {external}") # ['source'] + + # Sorties (résultats finaux) + outputs = registry.get_outputs() + print(f"Sorties du pipeline: {outputs}") # toutes les sorties +``` + +**Méthodes clés** : + +| Méthode | Retourne | +|---------|----------| +| `get_tasks()` | Tous les objets `TaskNode` enregistrés | +| `get_outputs()` | Toutes les clés de sortie | +| `get_external_inputs()` | Entrées non produites par une tâche | +| `get_producer(output_key)` | Tâche produisant cette sortie | +| `get_consumers(input_key)` | Tâches consommant cette entrée | +| `get_data_dependencies(task)` | Liste des clés d'entrée pour une tâche | + +--- + +## Exemple 2 : Construction et Visualisation du DAG + +```python + # Construction du DAG + dag = registry.build_dag() + + print(f"DAG: {len(dag.nodes)} nœuds, {len(dag.edges)} arêtes") + + # Ordre d'exécution (tri topologique) + order = dag.topological_sort() + for i, node in enumerate(order): + print(f"{i+1}. {node.task_name}") + + # Niveaux d'exécution parallèle + for level_idx, level_nodes in enumerate(dag.levels): + tasks = [n.task_name for n in level_nodes] + print(f"Niveau {level_idx}: {tasks}") + + # Visualisation ASCII + dag.print() + + # Format DOT + dot = DAGVisualizer.to_dot(dag) + with open("pipeline.dot", "w") as f: + f.write(dot) +``` + +**Propriétés DAG** : +- `dag.nodes` — Tous les nœuds +- `dag.edges` — Arêtes de dépendance +- `dag.roots` — Nœuds sans dépendances +- `dag.leaves` — Nœuds sans dépendants +- `dag.levels` — Groupes de tâches exécutables en parallèle +- `dag.topological_sort()` — Ordre d'exécution linéaire + +--- + +## Exemple 3 : Validation & Détection d'Erreurs + +```python +async def example_validation(): + registry = DataflowRegistry() + registry.register_task(load_data, output="raw_data", inputs=["source"]) + + # Cassé : dépend d'une sortie inexistante + @broker.task + @pipeline_task(output="result") + async def broken_task(nonexistent_data: dict): + return {"result": "broken"} + + registry.register_task(broken_task, output="result", inputs=["nonexistent_data"]) + + try: + dag = registry.build_dag() # Lève ValueError + except ValueError as e: + print(f"Erreur attendue capturée: {e}") + # "La tâche 'broken_task' requiert l'entrée 'nonexistent_data' mais aucune tâche ne la produit" +``` + +**Validations effectuées** : +- Toutes les entrées déclarées doivent être produites par une tâche (ou être externes) +- Pas de dépendances circulaires (cycles) +- Pas de noms de sortie en double + +--- + +## Exemple 4 : Exécution avec ExecutionEngine + +```python +async def example_execution_with_engine(): + registry = DataflowRegistry() + registry.register_task(load_data, output="raw_data", inputs=["source"]) + registry.register_task(clean_data, output="cleaned_data", inputs=["raw_data"]) + registry.register_task(extract_features, output="features", inputs=["cleaned_data"]) + registry.register_task(generate_report, output="report", inputs=["features"]) + + dag = registry.build_dag() + + engine = ExecutionEngine( + broker=broker, + dag=dag, + fail_fast=True, + max_parallel=4, + ) + + results = await engine.execute( + inputs={"source": "local://data/file.csv"}, + pipeline_id="manual_pipeline_example", + ) + + # results = {"raw_data": ..., "cleaned_data": ..., "features": ..., "report": ...} +``` + +`ExecutionEngine` est l'exécuteur de bas niveau qui exécute un DAG. + +--- + +## Exemple 5 : Exécution Pas-à-Pas Manuelle avec DataCache + +Montre la boucle d'exécution interne : + +```python +async def example_manual_execution_with_cache(): + registry = DataflowRegistry() + # enregistrer les tâches... + dag = registry.build_dag() + + cache = DataCache() + + # Initialiser les entrées externes + cache.set("source", "local://data/file.csv") + + completed_nodes = set() + + while True: + ready = dag.get_ready_tasks(completed_nodes) + if not ready: + break + + for node in ready: + task = node.task + deps = registry.get_data_dependencies(task) + + # Injection des dépendances depuis le cache + args = cache.inject(deps) # {'raw_data': {...}, ...} + + # Exécution de la tâche + result = await task.kiq(**args) + output_value = (await result.wait_result()).return_value + + # Stockage de la sortie dans le cache + output_name = registry.get_task_metadata(task)["output"] + cache.set(output_name, output_value) + + completed_nodes.add(node) + + # Sorties finales dans le cache + final_report = cache.get("report") +``` + +--- + +## Pourquoi C'Important + +Comprendre `DataflowRegistry` vous aide à : + +1. **Déboguer des pipelines complexes** — Inspecter le DAG avant exécution +2. **Construire des pipelines dynamiques** — Assembler des pipelines à la volée selon la configuration +3. **Implémenter une orchestration personnalisée** — Utiliser `ExecutionEngine` directement +4. **Comprendre la provenance des données** — Tracer l'origine de chaque sortie + +--- + +## Chemin d'Apprentissage + +Après cet exemple : + +1. **[Guide Dataflow]({{ '/fr/guides/pipelines.md#2-dataflow-pipeline' | relative_url }})** — Utilisation haut niveau +2. **[API ExecutionEngine]({{ '/fr/api/execution/' | relative_url }})** — Contrôle d'exécution bas niveau +3. **[DAGBuilder]({{ '/fr/api/execution.md#dagbuilder' | relative_url }})** — Construction programmatique de DAG + +--- + + *Sujet avancé. La plupart des utilisateurs utiliseront `DataflowPipeline.from_tasks()` qui encapsule ce registry en interne. Explorez ceci uniquement si vous avez besoin de construction dynamique de pipelines.* diff --git a/docs/_fr/examples/resource_aware_demo.md b/docs/_fr/examples/resource_aware_demo.md index 0998c82..515dc56 100644 --- a/docs/_fr/examples/resource_aware_demo.md +++ b/docs/_fr/examples/resource_aware_demo.md @@ -1,248 +1,254 @@ ---- -permalink: /fr/examples/resource-aware-demo/ -title: Exemple: resource_aware_demo.py -nav_order: 49 -color_scheme: dark ---- -# Exemple: resource_aware_demo.py - -**Parallélisme dynamique basé sur CPU/mémoire** - -> **Version** : {VERSION} | **Fichier** : `examples/resource_aware_demo.py` - ---- - -## Aperçu - -Cet exemple démontre les fonctionnalités `ResourceAwareExecutor` et `TaskResourceProfile` introduites en v0.4.5. Il montre comment : - -- Annoter tâches avec besoins ressources (CPU, mémoire, I/O vs CPU) -- Calculer parallélisme optimal basé sur ressources système courantes -- Ajuster `max_parallel` dynamiquement pour ne pas surcharger l'hôte -- Appliquer différentes stratégies de parallélisme pour tâches I/O-bound vs CPU-bound - ---- - -## Ce Que Cet Exemple Montre - -- Définition de `TaskResourceProfile` pour tâches -- Création d'un `ResourceAwareExecutor` avec limites système -- Interrogation de `get_optimal_parallelism()` pour types de tâches -- Utilisation de profils ressource dans DataflowPipeline -- Directives de réglage manuel de parallélisme - ---- - -## Parcours Du Code - -### 1. Configuration Resource-Aware Executor - -```python -from taskiq_flow.optimization import ResourceAwareExecutor, TaskResourceProfile - -executor = ResourceAwareExecutor( - max_cpu_percent=80.0, # Ne pas dépasser 80% d'usage CPU - max_memory_percent=80.0, # Ne pas dépasser 80% de RAM - min_parallel=1, - max_parallel=20, -) - -# Interroger parallélisme optimal pour une estimation de tâche -optimal_light = executor.get_optimal_parallelism( - task_memory_estimate=50, # 50 MB par tâche - task_cpu_estimate=0.2, # 0.2 cores par tâche -) -print(f"Optimal pour tâches légères: {optimal_light}") - -optimal_heavy = executor.get_optimal_parallelism( - task_memory_estimate=200, # 200 MB par tâche - task_cpu_estimate=1.0, # 1.0 core par tâche -) -print(f"Optimal pour tâches lourdes: {optimal_heavy}") -``` - -L'exécuteur interroge l'usage système courant (via `psutil`) et calcule combien de tâches du profil donné peuvent s'exécuter en parallèle sans dépasser les limites configurées. - ---- - -### 2. Annoter Tâches avec Profils Ressources - -```python -@broker.task -@pipeline_task( - output="light_result", - resources=TaskResourceProfile( - estimated_memory_mb=50, - estimated_cpu_cores=0.2, - io_bound=True, - ), -) -async def light_task(item: int) -> dict: - await asyncio.sleep(0.1) # Simule I/O - return {"item": item, "result": item * 2} - -@broker.task -@pipeline_task( - output="heavy_result", - resources=TaskResourceProfile( - estimated_memory_mb=200, - estimated_cpu_cores=1.0, - io_bound=False, - ), -) -async def heavy_task(item: int) -> dict: - total = 0 - for _ in range(100000): - total += item * 2 - return {"item": item, "result": total} -``` - -**Champs ResourceProfile :** - -- `estimated_memory_mb`: Usage mémoire estimé par instance de tâche -- `estimated_cpu_cores`: Cores CPU requis (0.5 = demi-core) -- `io_bound`: True pour tâches I/O-heavy (réseau, disque), False pour CPU-heavy - ---- - -### 3. Utiliser Profils Ressources dans Pipelines - -Le paramètre `max_parallel` de `DataflowPipeline` agit comme borne supérieure. `ResourceAwareExecutor` peut calculer un `max_parallel` dynamique avant lancement : - -```python -# Calculer parallélisme optimal pour état système courant -current_parallel = executor.get_optimal_parallelism( - task_memory_estimate=50, - task_cpu_estimate=0.2, -) - -pipeline = DataflowPipeline(broker, max_parallel=current_parallel) -pipeline.map(light_task, items=list(range(20)), output="light_results") -results = await pipeline.kiq_dataflow() -``` - -Pour charges de travail mixtes, sommez l'usage ressource à travers tâches parallèles. - ---- - -### 4. Directives de Réglage Manuel Parallélisme - -```python -import psutil - -cpu_count = psutil.cpu_count() or 4 -memory_gb = psutil.virtual_memory().total / (1024 ** 3) - -# Tâches I/O-bound : pouvez oversubscribe CPU (passent du temps en attente) -io_parallel = min(50, cpu_count * 5) - -# Tâches CPU-bound : limitez aux cores disponibles ± petite marge -cpu_parallel = min(cpu_count + 2, 20) - -print(f"max_parallel recommandé pour tâches I/O-bound: {io_parallel}") -print(f"max_parallel recommandé pour tâches CPU-bound: {cpu_parallel}") -``` - -Commencez conservateur, benchmarkez, et ajustez. - ---- - -## Sortie Attendue - -``` -=== Resource-Aware Parallelism Demo === - -Current system state: - CPU Usage: ? (will query at runtime) - Memory: ? (will query at runtime) - ---- Light tasks (I/O bound) --- - Optimal parallelism for light tasks: 25 - ---- Heavy tasks (CPU bound) --- - Optimal parallelism for heavy tasks: 4 - -Note: Actual values depend on current system load. - - -=== Pipeline with Resource-Aware Execution === - -Pipeline structure: - [items:20] --light_task--> [light_results] - [items:10] --heavy_task--> [heavy_results] - [light_results, heavy_results] --combine--> [final] - -Executing pipeline... - Pipeline completed: {'light_results': [...], 'heavy_results': [...], 'final': {...}} - -TaskResourceProfile allows you to annotate tasks with resource requirements. -ResourceAwareExecutor uses these profiles to compute optimal parallelism. - - -=== Manual Parallelism Tuning === - -System: 8 CPU cores, 15.6 GB RAM -Recommended max_parallel for I/O-bound tasks: 40 -Recommended max_parallel for CPU-bound tasks: 10 -Start with conservative values and benchmark: - pipeline.map(light_task, items, max_parallel=10) - pipeline.map(heavy_task, items, max_parallel=cpu_count) - - -=== Resource-Aware Demo Complete === - -Key takeaways: -1. Use TaskResourceProfile to annotate task resource needs -2. ResourceAwareExecutor computes optimal parallelism at runtime -3. Adjust max_parallel based on task type (I/O vs CPU) -4. Monitor system resources and tune accordingly -``` - ---- - -## Points Clés - -### Pourquoi Parallélisme Aware-Ressource ? - -Sans conscience ressource, `max_parallel` trop haut peut : -- Épuiser mémoire → OOM kills -- Saturer CPU → tâches thrashing, ralentissement global -- Priver autres services sur même hôte - -`ResourceAwareExecutor` empêche ça en interrogeant usage système courant et calculant niveaux de parallélisme sûrs. - -### Meilleures Pratiques - -1. **Profilez vos tâches** : Mesurez usage mémoire/CPU réel en production -2. **Valeurs par défaut conservatrices** : Commencez avec `max_parallel=5` et augmentez -3. **Monitorer** : Surveillez métriques système pendant exécution pipelines -4. **Ajustez par type de tâche** : Tâches I/O-bound peuvent être plus parallèles que CPU-bound -5. **Utilisez bornes `min_parallel` et `max_parallel`** : `ResourceAwareExecutor` respecte ces bornes - -### Intégration avec Monitoring - -Combinez avec métriques Prometheus : - -```python -from taskiq_flow.metrics import MetricsMiddleware -broker.add_middlewares(MetricsMiddleware()) -``` - -Suivez : -- `taskiq_flow_worker_cpu_usage_percent` -- `taskiq_flow_worker_memory_usage_bytes` -- `taskiq_flow_pipeline_executions_total` - ---- - -## Chemin d'Apprentissage - -Après cet exemple : - -1. **[Guide Performance]({{ '/fr/guides/performance/' | relative_url }})** — Plongée profonde parallélisme aware-ressource -2. **[API Module Optimization]({{ '/fr/api/optimization/' | relative_url }})** — Référence complète `ResourceAwareExecutor` -3. **[Guide Suivi]({{ '/fr/guides/tracking/' | relative_url }})** — Monitorer usage ressource dans le temps - ---- - -*Cet exemple empêche vos pipelines de submerger l'hôte. Testez toujours les profils ressource avec volumes de données réalistes.* +--- +permalink: /fr/examples/resource-aware-demo/ +title: 'Exemple: resource_aware_demo.py' +nav_order: 49 +color_scheme: dark +--- +# Exemple: resource_aware_demo.py + +**Parallélisme dynamique basé sur CPU/mémoire** + +> **Version** : {VERSION} | **Fichier** : `examples/resource_aware_demo.py` + +--- + +## Aperçu + +Cet exemple démontre les fonctionnalités `ResourceAwareExecutor` et `TaskResourceProfile` introduites en v0.4.5. Il montre comment : + +- Annoter tâches avec besoins ressources (CPU, mémoire, I/O vs CPU) +- Calculer parallélisme optimal basé sur ressources système courantes +- Ajuster `max_parallel` dynamiquement pour ne pas surcharger l'hôte +- Appliquer différentes stratégies de parallélisme pour tâches I/O-bound vs CPU-bound + +--- + +## Ce Que Cet Exemple Montre + +- Définition de `TaskResourceProfile` pour tâches +- Création d'un `ResourceAwareExecutor` avec limites système +- Interrogation de `get_optimal_parallelism()` pour types de tâches +- Utilisation de profils ressource dans DataflowPipeline +- Directives de réglage manuel de parallélisme + +--- + +## Parcours Du Code + +### 1. Configuration Resource-Aware Executor + +{% raw %} +```python +from taskiq_flow.optimization import ResourceAwareExecutor, TaskResourceProfile + +executor = ResourceAwareExecutor( + max_cpu_percent=80.0, # Ne pas dépasser 80% d'usage CPU + max_memory_percent=80.0, # Ne pas dépasser 80% de RAM + min_parallel=1, + max_parallel=20, +) + +# Interroger parallélisme optimal pour une estimation de tâche +optimal_light = executor.get_optimal_parallelism( + task_memory_estimate=50, # 50 MB par tâche + task_cpu_estimate=0.2, # 0.2 cores par tâche +) +print(f"Optimal pour tâches légères: {optimal_light}") + +optimal_heavy = executor.get_optimal_parallelism( + task_memory_estimate=200, # 200 MB par tâche + task_cpu_estimate=1.0, # 1.0 core par tâche +) +print(f"Optimal pour tâches lourdes: {optimal_heavy}") +``` +{% endraw %} +L'exécuteur interroge l'usage système courant (via `psutil`) et calcule combien de tâches du profil donné peuvent s'exécuter en parallèle sans dépasser les limites configurées. + +--- + +### 2. Annoter Tâches avec Profils Ressources + +{% raw %} +```python +@broker.task +@pipeline_task( + output="light_result", + resources=TaskResourceProfile( + estimated_memory_mb=50, + estimated_cpu_cores=0.2, + io_bound=True, + ), +) +async def light_task(item: int) -> dict: + await asyncio.sleep(0.1) # Simule I/O + return {"item": item, "result": item * 2} + +@broker.task +@pipeline_task( + output="heavy_result", + resources=TaskResourceProfile( + estimated_memory_mb=200, + estimated_cpu_cores=1.0, + io_bound=False, + ), +) +async def heavy_task(item: int) -> dict: + total = 0 + for _ in range(100000): + total += item * 2 + return {"item": item, "result": total} +``` +{% endraw %} +**Champs ResourceProfile :** + +- `estimated_memory_mb`: Usage mémoire estimé par instance de tâche +- `estimated_cpu_cores`: Cores CPU requis (0.5 = demi-core) +- `io_bound`: True pour tâches I/O-heavy (réseau, disque), False pour CPU-heavy + +--- + +### 3. Utiliser Profils Ressources dans Pipelines + +Le paramètre `max_parallel` de `DataflowPipeline` agit comme borne supérieure. `ResourceAwareExecutor` peut calculer un `max_parallel` dynamique avant lancement : + +{% raw %} +```python +# Calculer parallélisme optimal pour état système courant +current_parallel = executor.get_optimal_parallelism( + task_memory_estimate=50, + task_cpu_estimate=0.2, +) + +pipeline = DataflowPipeline(broker, max_parallel=current_parallel) +pipeline.map(light_task, items=list(range(20)), output="light_results") +results = await pipeline.kiq_dataflow() +``` +{% endraw %} +Pour charges de travail mixtes, sommez l'usage ressource à travers tâches parallèles. + +--- + +### 4. Directives de Réglage Manuel Parallélisme + +{% raw %} +```python +import psutil + +cpu_count = psutil.cpu_count() or 4 +memory_gb = psutil.virtual_memory().total / (1024 ** 3) + +# Tâches I/O-bound : pouvez oversubscribe CPU (passent du temps en attente) +io_parallel = min(50, cpu_count * 5) + +# Tâches CPU-bound : limitez aux cores disponibles ± petite marge +cpu_parallel = min(cpu_count + 2, 20) + +print(f"max_parallel recommandé pour tâches I/O-bound: {io_parallel}") +print(f"max_parallel recommandé pour tâches CPU-bound: {cpu_parallel}") +``` +{% endraw %} +Commencez conservateur, benchmarkez, et ajustez. + +--- + +## Sortie Attendue + +{% raw %} +``` +=== Resource-Aware Parallelism Demo === + +Current system state: + CPU Usage: ? (will query at runtime) + Memory: ? (will query at runtime) + +--- Light tasks (I/O bound) --- + Optimal parallelism for light tasks: 25 + +--- Heavy tasks (CPU bound) --- + Optimal parallelism for heavy tasks: 4 + +Note: Actual values depend on current system load. + + +=== Pipeline with Resource-Aware Execution === + +Pipeline structure: + [items:20] --light_task--> [light_results] + [items:10] --heavy_task--> [heavy_results] + [light_results, heavy_results] --combine--> [final] + +Executing pipeline... + Pipeline completed: {'light_results': [...], 'heavy_results': [...], 'final': {...}} + +TaskResourceProfile allows you to annotate tasks with resource requirements. +ResourceAwareExecutor uses these profiles to compute optimal parallelism. + + +=== Manual Parallelism Tuning === + +System: 8 CPU cores, 15.6 GB RAM +Recommended max_parallel for I/O-bound tasks: 40 +Recommended max_parallel for CPU-bound tasks: 10 +Start with conservative values and benchmark: + pipeline.map(light_task, items, max_parallel=10) + pipeline.map(heavy_task, items, max_parallel=cpu_count) + + +=== Resource-Aware Demo Complete === + +Key takeaways: +1. Use TaskResourceProfile to annotate task resource needs +2. ResourceAwareExecutor computes optimal parallelism at runtime +3. Adjust max_parallel based on task type (I/O vs CPU) +4. Monitor system resources and tune accordingly +``` +{% endraw %} +--- + +## Points Clés + +### Pourquoi Parallélisme Aware-Ressource ? + +Sans conscience ressource, `max_parallel` trop haut peut : +- Épuiser mémoire → OOM kills +- Saturer CPU → tâches thrashing, ralentissement global +- Priver autres services sur même hôte + +`ResourceAwareExecutor` empêche ça en interrogeant usage système courant et calculant niveaux de parallélisme sûrs. + +### Meilleures Pratiques + +1. **Profilez vos tâches** : Mesurez usage mémoire/CPU réel en production +2. **Valeurs par défaut conservatrices** : Commencez avec `max_parallel=5` et augmentez +3. **Monitorer** : Surveillez métriques système pendant exécution pipelines +4. **Ajustez par type de tâche** : Tâches I/O-bound peuvent être plus parallèles que CPU-bound +5. **Utilisez bornes `min_parallel` et `max_parallel`** : `ResourceAwareExecutor` respecte ces bornes + +### Intégration avec Monitoring + +Combinez avec métriques Prometheus : + +{% raw %} +```python +from taskiq_flow.metrics import MetricsMiddleware +broker.add_middlewares(MetricsMiddleware()) +``` +{% endraw %} +Suivez : +- `taskiq_flow_worker_cpu_usage_percent` +- `taskiq_flow_worker_memory_usage_bytes` +- `taskiq_flow_pipeline_executions_total` + +--- + +## Chemin d'Apprentissage + +Après cet exemple : + +1. **[Guide Performance]({{ '/fr/guides/performance/' | relative_url }})** — Plongée profonde parallélisme aware-ressource +2. **[API Module Optimization]({{ '/fr/api/optimization/' | relative_url }})** — Référence complète `ResourceAwareExecutor` +3. **[Guide Suivi]({{ '/fr/guides/tracking/' | relative_url }})** — Monitorer usage ressource dans le temps + +--- + +*Cet exemple empêche vos pipelines de submerger l'hôte. Testez toujours les profils ressource avec volumes de données réalistes.* diff --git a/docs/_fr/examples/retry_demo.md b/docs/_fr/examples/retry_demo.md index a520e6a..858a66d 100644 --- a/docs/_fr/examples/retry_demo.md +++ b/docs/_fr/examples/retry_demo.md @@ -1,244 +1,244 @@ ---- -permalink: /fr/examples/retry-demo/ -title: Exemple: retry_demo.py -nav_order: 48 -color_scheme: dark ---- -# Exemple: retry_demo.py - -**Middleware retry et modes de gestion d'erreurs** - -> **Version** : {VERSION} | **Fichier** : `examples/retry_demo.py` - ---- - -## Aperçu - -Cet exemple démontre les mécanismes robustes de retry et gestion d'erreurs de Taskiq-Flow v0.4.5. Il couvre : - -- `PipelineRetryMiddleware` avec backoff exponentiel et jitter -- Stratégies `ErrorHandlingMode` (FAIL_FAST, CONTINUE_ON_ERROR, SKIP_FAILED, DEAD_LETTER) -- `PipelineErrorAggregator` pour collecter et analyser les échecs -- Configuration des politiques de retry par pipeline - ---- - -## Ce Que Cet Exemple Montre - -- Ajout du middleware retry à un broker -- Retry automatique avec backoff pour échecs transitoires -- Changement entre modes de gestion d'erreurs -- Agrégation des erreurs pour analyse post-mortem -- Distinction échecs retryables vs non-retryables - ---- - -## Parcours Du Code - -### 1. Middleware Retry - -```python -from taskiq_flow.middlewares.retry import PipelineRetryMiddleware - -retry_mw = PipelineRetryMiddleware( - max_retries=3, - delay=0.5, - backoff=2.0, - jitter=True, -) -broker.add_middlewares(retry_mw) -``` - -**Paramètres :** -- `max_retries`: Nombre max de tentatives (3 → 4 essais totaux) -- `delay`: Délai initial avant première retry (0.5s) -- `backoff`: Multiplicateur de délai à chaque retry (2.0 → 0.5s, 1s, 2s) -- `jitter`: Ajoute variation aléatoire pour éviter "thundering herd" - ---- - -### 2. Démo Tâche Flaky (capricieuse) - -```python -import random - -@broker.task -async def flaky_task(attempt: int = 0) -> str: - """Échoue aléatoirement, puis réussit parfois.""" - attempt += 1 - if random.random() < 0.7 and attempt < 3: - raise RuntimeError(f"Task failed on attempt {attempt}") - return f"Success on attempt {attempt}" -``` - -```python -async def demo_retry_middleware(): - pipeline = Pipeline(broker).call_next(flaky_task) - task = await pipeline.kiq(0) - result = await task.wait_result(timeout=10) - print(f"Pipeline succeeded! Result: {result.return_value}") - print(f"Retry count: {retry_mw.retry_counts}") -``` - -Sortie : - -``` -Pipeline succeeded! Result: Success on attempt 2 -Retry count: {'flaky_task': 1} -``` - -Le middleware retente automatiquement la tâche une fois avant succès. - ---- - -### 3. Modes de Gestion d'Erreurs - -```python -from taskiq_flow.errors import ErrorHandlingMode -from taskiq_flow.execution_engine import ExecutionEngine -from taskiq_flow.dataflow.registry import DataflowRegistry - -registry = DataflowRegistry() -registry.register_task(flaky_task, output="flaky_output", inputs=[]) -registry.register_task(process_result, output="final", inputs=["flaky_output"]) -dag = registry.build_dag() -``` - -#### FAIL_FAST (défaut) - -```python -engine = ExecutionEngine(broker, dag, error_mode=ErrorHandlingMode.FAIL_FAST) -# Arrêt immédiat à première erreur ; pipeline échoue -``` - -#### CONTINUE_ON_ERROR - -```python -engine = ExecutionEngine(broker, dag, error_mode=ErrorHandlingMode.CONTINUE_ON_ERROR) -# Marque tâche échouée comme FAILED mais continue avec tâches en aval qui ne dépendent pas d'elle -``` - -#### SKIP_FAILED - -```python -engine = ExecutionEngine(broker, dag, error_mode=ErrorHandlingMode.SKIP_FAILED) -# Tâches échouées sont skipées ; tâches en aval reçoivent valeurs par défaut (None) pour intrants échoués -``` - -#### DEAD_LETTER - -```python -engine = ExecutionEngine(broker, dag, error_mode=ErrorHandlingMode.DEAD_LETTER) -# Tâches échouées sont mises en file "dead-letter" pour retry ultérieur -``` - ---- - -### 4. Agrégation d'Erreurs - -```python -from taskiq_flow.errors import PipelineErrorAggregator - -aggregator = PipelineErrorAggregator() - -# During/after execution, errors are collected: -aggregator.add_error(task=failed_task, error=exc, context={...}) -``` - -Utile pour générer rapports d'erreur et alertes. - ---- - -## Sortie Attendue - -Lancer `python examples/retry_demo.py` : - -``` -=== Demo 1: Retry Middleware === - -Executing flaky task with retry middleware... -(Task may fail 1-2 times before succeeding) - - Pipeline succeeded! Result: Success on attempt 2 - -Retry count stored in middleware: {'flaky_task': 1} - - -=== Demo 2: Error Handling Modes === - ---- Mode: FAIL_FAST --- - Execution raised: RuntimeError: Task failed on attempt 3 - ---- Mode: CONTINUE_ON_ERROR --- - Execution completed. Results: ['flaky_output'] - ---- Mode: SKIP_FAILED --- - Execution completed. Results: ['flaky_output'] - -Note: ErrorHandlingMode.DEAD_LETTER would queue failures for later retry. - - -=== Demo 3: Error Aggregation === - -Total errors collected: 3 -Failed tasks: ['task_a', 'task_b', 'task_c'] - -Error details: - - task_a: RuntimeError: timeout - - task_b: ValueError: invalid data - - task_c: ConnectionError: network down - -You can use PipelineErrorAggregator to analyze failures and affected branches. - - -=== All Retry & Error Handling Demos Complete === -``` - ---- - -## Points Clés - -### Quel Mode Choisir ? - -| Mode | Idéal pour | Comportement | -|------|------------|--------------| -| `FAIL_FAST` | Pipelines critiques où tout échec invalide l'ensemble | Arrêt immédiat | -| `CONTINUE_ON_ERROR` | Analyses best-effort où résultats partiels ont de la valeur | Continue ; marque échecs | -| `SKIP_FAILED` | Traitement données où intrants manquants tolérés | Fournit None par défaut | -| `DEAD_LETTER` | Systèmes nécessitant intervention manuelle ou re-jeu | File d'attente pour retry ultérieur | - -### Stratégies de Retry - -- **Échecs transitoires** (timeouts réseau, épuisement temporaire ressources) → Utiliser `PipelineRetryMiddleware` -- **Échecs permanents** (données invalides, bugs code) → Utiliser `FAIL_FAST` ou `SKIP_FAILED` selon tolérance -- **Chargements mixtes** → Combiner retry middleware (pour transitoires) avec modes erreur (pour permanents) - -### Monitoring des Retries - -Suivez compteurs retry dans métriques ou logs : - -```python -for task_name, count in retry_mw.retry_counts.items(): - logger.info(f"Task {task_name} retried {count} times") -``` - -Intégrez avec Prometheus : - -```python -from taskiq_flow.metrics import MetricsMiddleware -broker.add_middlewares(MetricsMiddleware()) -``` - ---- - -## Chemin d'Apprentissage - -Après cet exemple : - -1. **[Guide Retry]({{ '/fr/guides/retry/' | relative_url }})** — Documentation complète retry & gestion d'erreurs -2. **[Guide Exécution]({{ '/fr/guides/execution/' | relative_url }})** — Moteur d'exécution interne -3. **[Guide Monitoring]({{ '/fr/guides/tracking/' | relative_url }})** — Suivre tâches échouées et retries en production - ---- - -*Cet exemple montre tous les patterns de retry. En production, ajustez paramètres retry (max_retries, backoff) selon caractéristiques tâches et exigences SLA.* +--- +permalink: /fr/examples/retry-demo/ +title: 'Exemple: retry_demo.py' +nav_order: 48 +color_scheme: dark +--- +# Exemple: retry_demo.py + +**Middleware retry et modes de gestion d'erreurs** + +> **Version** : {VERSION} | **Fichier** : `examples/retry_demo.py` + +--- + +## Aperçu + +Cet exemple démontre les mécanismes robustes de retry et gestion d'erreurs de Taskiq-Flow v0.4.5. Il couvre : + +- `PipelineRetryMiddleware` avec backoff exponentiel et jitter +- Stratégies `ErrorHandlingMode` (FAIL_FAST, CONTINUE_ON_ERROR, SKIP_FAILED, DEAD_LETTER) +- `PipelineErrorAggregator` pour collecter et analyser les échecs +- Configuration des politiques de retry par pipeline + +--- + +## Ce Que Cet Exemple Montre + +- Ajout du middleware retry à un broker +- Retry automatique avec backoff pour échecs transitoires +- Changement entre modes de gestion d'erreurs +- Agrégation des erreurs pour analyse post-mortem +- Distinction échecs retryables vs non-retryables + +--- + +## Parcours Du Code + +### 1. Middleware Retry + +```python +from taskiq_flow.middlewares.retry import PipelineRetryMiddleware + +retry_mw = PipelineRetryMiddleware( + max_retries=3, + delay=0.5, + backoff=2.0, + jitter=True, +) +broker.add_middlewares(retry_mw) +``` + +**Paramètres :** +- `max_retries`: Nombre max de tentatives (3 → 4 essais totaux) +- `delay`: Délai initial avant première retry (0.5s) +- `backoff`: Multiplicateur de délai à chaque retry (2.0 → 0.5s, 1s, 2s) +- `jitter`: Ajoute variation aléatoire pour éviter "thundering herd" + +--- + +### 2. Démo Tâche Flaky (capricieuse) + +```python +import random + +@broker.task +async def flaky_task(attempt: int = 0) -> str: + """Échoue aléatoirement, puis réussit parfois.""" + attempt += 1 + if random.random() < 0.7 and attempt < 3: + raise RuntimeError(f"Task failed on attempt {attempt}") + return f"Success on attempt {attempt}" +``` + +```python +async def demo_retry_middleware(): + pipeline = Pipeline(broker).call_next(flaky_task) + task = await pipeline.kiq(0) + result = await task.wait_result(timeout=10) + print(f"Pipeline succeeded! Result: {result.return_value}") + print(f"Retry count: {retry_mw.retry_counts}") +``` + +Sortie : + +``` +Pipeline succeeded! Result: Success on attempt 2 +Retry count: {'flaky_task': 1} +``` + +Le middleware retente automatiquement la tâche une fois avant succès. + +--- + +### 3. Modes de Gestion d'Erreurs + +```python +from taskiq_flow.errors import ErrorHandlingMode +from taskiq_flow.execution_engine import ExecutionEngine +from taskiq_flow.dataflow.registry import DataflowRegistry + +registry = DataflowRegistry() +registry.register_task(flaky_task, output="flaky_output", inputs=[]) +registry.register_task(process_result, output="final", inputs=["flaky_output"]) +dag = registry.build_dag() +``` + +#### FAIL_FAST (défaut) + +```python +engine = ExecutionEngine(broker, dag, error_mode=ErrorHandlingMode.FAIL_FAST) +# Arrêt immédiat à première erreur ; pipeline échoue +``` + +#### CONTINUE_ON_ERROR + +```python +engine = ExecutionEngine(broker, dag, error_mode=ErrorHandlingMode.CONTINUE_ON_ERROR) +# Marque tâche échouée comme FAILED mais continue avec tâches en aval qui ne dépendent pas d'elle +``` + +#### SKIP_FAILED + +```python +engine = ExecutionEngine(broker, dag, error_mode=ErrorHandlingMode.SKIP_FAILED) +# Tâches échouées sont skipées ; tâches en aval reçoivent valeurs par défaut (None) pour intrants échoués +``` + +#### DEAD_LETTER + +```python +engine = ExecutionEngine(broker, dag, error_mode=ErrorHandlingMode.DEAD_LETTER) +# Tâches échouées sont mises en file "dead-letter" pour retry ultérieur +``` + +--- + +### 4. Agrégation d'Erreurs + +```python +from taskiq_flow.errors import PipelineErrorAggregator + +aggregator = PipelineErrorAggregator() + +# During/after execution, errors are collected: +aggregator.add_error(task=failed_task, error=exc, context={...}) +``` + +Utile pour générer rapports d'erreur et alertes. + +--- + +## Sortie Attendue + +Lancer `python examples/retry_demo.py` : + +``` +=== Demo 1: Retry Middleware === + +Executing flaky task with retry middleware... +(Task may fail 1-2 times before succeeding) + + Pipeline succeeded! Result: Success on attempt 2 + +Retry count stored in middleware: {'flaky_task': 1} + + +=== Demo 2: Error Handling Modes === + +--- Mode: FAIL_FAST --- + Execution raised: RuntimeError: Task failed on attempt 3 + +--- Mode: CONTINUE_ON_ERROR --- + Execution completed. Results: ['flaky_output'] + +--- Mode: SKIP_FAILED --- + Execution completed. Results: ['flaky_output'] + +Note: ErrorHandlingMode.DEAD_LETTER would queue failures for later retry. + + +=== Demo 3: Error Aggregation === + +Total errors collected: 3 +Failed tasks: ['task_a', 'task_b', 'task_c'] + +Error details: + - task_a: RuntimeError: timeout + - task_b: ValueError: invalid data + - task_c: ConnectionError: network down + +You can use PipelineErrorAggregator to analyze failures and affected branches. + + +=== All Retry & Error Handling Demos Complete === +``` + +--- + +## Points Clés + +### Quel Mode Choisir ? + +| Mode | Idéal pour | Comportement | +|------|------------|--------------| +| `FAIL_FAST` | Pipelines critiques où tout échec invalide l'ensemble | Arrêt immédiat | +| `CONTINUE_ON_ERROR` | Analyses best-effort où résultats partiels ont de la valeur | Continue ; marque échecs | +| `SKIP_FAILED` | Traitement données où intrants manquants tolérés | Fournit None par défaut | +| `DEAD_LETTER` | Systèmes nécessitant intervention manuelle ou re-jeu | File d'attente pour retry ultérieur | + +### Stratégies de Retry + +- **Échecs transitoires** (timeouts réseau, épuisement temporaire ressources) → Utiliser `PipelineRetryMiddleware` +- **Échecs permanents** (données invalides, bugs code) → Utiliser `FAIL_FAST` ou `SKIP_FAILED` selon tolérance +- **Chargements mixtes** → Combiner retry middleware (pour transitoires) avec modes erreur (pour permanents) + +### Monitoring des Retries + +Suivez compteurs retry dans métriques ou logs : + +```python +for task_name, count in retry_mw.retry_counts.items(): + logger.info(f"Task {task_name} retried {count} times") +``` + +Intégrez avec Prometheus : + +```python +from taskiq_flow.metrics import MetricsMiddleware +broker.add_middlewares(MetricsMiddleware()) +``` + +--- + +## Chemin d'Apprentissage + +Après cet exemple : + +1. **[Guide Retry]({{ '/fr/guides/retry/' | relative_url }})** — Documentation complète retry & gestion d'erreurs +2. **[Guide Exécution]({{ '/fr/guides/execution/' | relative_url }})** — Moteur d'exécution interne +3. **[Guide Monitoring]({{ '/fr/guides/tracking/' | relative_url }})** — Suivre tâches échouées et retries en production + +--- + +*Cet exemple montre tous les patterns de retry. En production, ajustez paramètres retry (max_retries, backoff) selon caractéristiques tâches et exigences SLA.* diff --git a/docs/_fr/examples/scheduled-pipeline.md b/docs/_fr/examples/scheduled-pipeline.md index 4b1b870..841781f 100644 --- a/docs/_fr/examples/scheduled-pipeline.md +++ b/docs/_fr/examples/scheduled-pipeline.md @@ -1,199 +1,199 @@ ---- -permalink: /fr/examples/scheduled-pipeline/ -title: Exemple: scheduled_pipeline.py -nav_order: 44 -color_scheme: dark ---- -# Exemple: scheduled_pipeline.py - -**Planification de pipelines avec déclencheurs cron et intervalles** - +--- +permalink: /fr/examples/scheduled-pipeline/ +title: 'Exemple: scheduled_pipeline.py' +nav_order: 44 +color_scheme: dark +--- +# Exemple: scheduled_pipeline.py + +**Planification de pipelines avec déclencheurs cron et intervalles** + > **Version** : {VERSION} | **Fichier** : `examples/scheduled_pipeline.py` - ---- - -## Aperçu - -Cet exemple démontre comment planifier des pipelines pour exécution périodique en utilisant `LabelBasedScheduler`. Il couvre: - -- Planification cron (avec précision seconde) -- Planification par intervalle -- Lister et inspecter jobs planifiés - -**Note** : Cet exemple utilise `LabelBasedScheduler`, mécanisme de planification par labels de TaskIQ. Pour planification cron production, considérer `PipelineScheduler` avec intégration APScheduler. - ---- - -## Ce Que Cet Exemple Montre - -- Créer un pipeline simple -- Utiliser `LabelBasedScheduler` pour planifier exécutions pipeline -- Expressions cron avec précision seconde -- Planification par intervalle -- Lister schedules actifs - ---- - -## Explication du Code - -```python -import asyncio -from taskiq import InMemoryBroker -from taskiq_flow import Pipeline, PipelineMiddleware -from taskiq_flow.scheduling import LabelBasedScheduler - -# Créer broker -broker = InMemoryBroker(await_inplace=True).with_middlewares(PipelineMiddleware()) - -# Définir tâche simple -@broker.task -async def log_message(msg: str) -> str: - """Logger un message.""" - return f"Traited: {msg}" - -async def main(): - # Créer pipeline - pipeline = Pipeline(broker).call_next(log_message) - - # Créer scheduler - scheduler = LabelBasedScheduler(broker) - - # Planifier avec expression cron (toutes les 5 secondes) - schedule_id = await scheduler.schedule_with_cron( - pipeline=pipeline, - label="every-5-seconds", - cron="*/5 * * * * *", # cron 6 champs pour précision seconde - args=("Hello from scheduled pipeline!",), - ) - print(f"Scheduled with cron: {schedule_id}") - - # Planifier avec intervalle (toutes les 3 secondes) - interval_id = await scheduler.schedule_with_interval( - pipeline=pipeline, - label="every-3-seconds", - interval_seconds=3, - args=("Interval scheduled run!",), - ) - print(f"Scheduled with interval: {interval_id}") - - # Attendre quelques exécutions - print("Waiting for pipeline executions (12 seconds)...") - await asyncio.sleep(12) - - # Lister jobs planifiés - schedules = scheduler.list_schedules() - print(f"Active schedules: {len(schedules)}") - for sched in schedules: - print(f" - {sched['label']}: cron={sched.get('cron')}, enabled={sched['enabled']}") - -asyncio.run(main()) -``` - ---- - -## Méthodes de Planification - -### Planification Cron - -```python -schedule_id = await scheduler.schedule_with_cron( - pipeline=pipeline, - label="mon-schedule", - cron="*/5 * * * * *", # Toutes 5 secondes (cron 6 champs) - args=("message",), -) -``` - -**Format cron 6 champs**:`seconde minute Heure jour mois jour-semaine` - -Exemples: -- `*/5 * * * * *` — Toutes les 5 secondes -- `0 * * * * *` — Toutes les minutes à seconde 0 -- `0 0 * * * *` — Toutes les heures à minute 0, seconde 0 - -### Planification Intervalle - -```python -interval_id = await scheduler.schedule_with_interval( - pipeline=pipeline, - label="interval-3s", - interval_seconds=3, - args=("message",), -) -``` - -Exécute toutes les N secondes, indépendamment de l'heure système. - ---- - -## Sortie Attendue - -``` -Scheduled with cron: schedule_123456 -Scheduled with interval: interval_789012 -Waiting for pipeline executions (12 seconds)... -INFO:root:Processed: Hello from scheduled pipeline! -INFO:root:Processed: Interval scheduled run! -INFO:root:Processed: Hello from scheduled pipeline! -INFO:root:Processed: Interval scheduled run! -... -Active schedules: 2 - - every-5-seconds: cron=*/5 * * * * *, enabled=True - - every-3-seconds: cron=None, enabled=True -``` - -Vous devriez voir le message logué plusieurs fois comme schedules se déclenchent. - ---- - -## Points Clés - -### Planification par Label - -- Chaque schedule requiert un `label` unique (utilisé pour identification) -- Les labels peuvent activer/désactiver schedules dynamiquement -- Le scheduler gère persistance schedules selon votre broker - -### Limite InMemoryBroker - -Avec `InMemoryBroker`, schedules fonctionnent seulement pendant processus en cours; perdus au redémarrage. Pour planification persistante, utiliser brokers Redis avec stores de schedule appropriés. - -### Multiples Schedules - -Vous pouvez planifier même pipeline plusieurs fois avec labels, expressions cron, ou arguments différents. - ---- - -## Variations - -### Planification Avancée avec PipelineScheduler - -Pour planification plus avancée (timezones, gestion misfire), utiliser `PipelineScheduler`: - -```python -from taskiq_flow import PipelineScheduler - -scheduler = PipelineScheduler(broker) -job_id = await scheduler.schedule( - pipeline, - cron="0 9 * * *", # Quotidien à 9h - args=("daily",) -) -await scheduler.start() -``` - -Voir [Guide de Planification]({{ '/fr/guides/scheduling/' | relative_url }}) pour détails complets sur `PipelineScheduler`. - ---- - -## Chemin d'Apprentissage - -Après cet exemple: - -1. **[Guide de Planification]({{ '/fr/guides/scheduling/' | relative_url }})** — Planification cron et intervalle complète -2. **[PipelineScheduler API]({{ '/fr/api/core.md#pipelinescheduler' | relative_url }})** — Référence API -3. **[Guide de Retry]({{ '/fr/guides/retry/' | relative_url }})** — Gestion échecs pipelines planifiés - ---- - -*Cet exemple montre bases planification par label. Pour production, explorer PipelineScheduler avec stores de jobs externes (PostgreSQL/Redis).* + +--- + +## Aperçu + +Cet exemple démontre comment planifier des pipelines pour exécution périodique en utilisant `LabelBasedScheduler`. Il couvre: + +- Planification cron (avec précision seconde) +- Planification par intervalle +- Lister et inspecter jobs planifiés + +**Note** : Cet exemple utilise `LabelBasedScheduler`, mécanisme de planification par labels de TaskIQ. Pour planification cron production, considérer `PipelineScheduler` avec intégration APScheduler. + +--- + +## Ce Que Cet Exemple Montre + +- Créer un pipeline simple +- Utiliser `LabelBasedScheduler` pour planifier exécutions pipeline +- Expressions cron avec précision seconde +- Planification par intervalle +- Lister schedules actifs + +--- + +## Explication du Code + +```python +import asyncio +from taskiq import InMemoryBroker +from taskiq_flow import Pipeline, PipelineMiddleware +from taskiq_flow.scheduling import LabelBasedScheduler + +# Créer broker +broker = InMemoryBroker(await_inplace=True).with_middlewares(PipelineMiddleware()) + +# Définir tâche simple +@broker.task +async def log_message(msg: str) -> str: + """Logger un message.""" + return f"Traited: {msg}" + +async def main(): + # Créer pipeline + pipeline = Pipeline(broker).call_next(log_message) + + # Créer scheduler + scheduler = LabelBasedScheduler(broker) + + # Planifier avec expression cron (toutes les 5 secondes) + schedule_id = await scheduler.schedule_with_cron( + pipeline=pipeline, + label="every-5-seconds", + cron="*/5 * * * * *", # cron 6 champs pour précision seconde + args=("Hello from scheduled pipeline!",), + ) + print(f"Scheduled with cron: {schedule_id}") + + # Planifier avec intervalle (toutes les 3 secondes) + interval_id = await scheduler.schedule_with_interval( + pipeline=pipeline, + label="every-3-seconds", + interval_seconds=3, + args=("Interval scheduled run!",), + ) + print(f"Scheduled with interval: {interval_id}") + + # Attendre quelques exécutions + print("Waiting for pipeline executions (12 seconds)...") + await asyncio.sleep(12) + + # Lister jobs planifiés + schedules = scheduler.list_schedules() + print(f"Active schedules: {len(schedules)}") + for sched in schedules: + print(f" - {sched['label']}: cron={sched.get('cron')}, enabled={sched['enabled']}") + +asyncio.run(main()) +``` + +--- + +## Méthodes de Planification + +### Planification Cron + +```python +schedule_id = await scheduler.schedule_with_cron( + pipeline=pipeline, + label="mon-schedule", + cron="*/5 * * * * *", # Toutes 5 secondes (cron 6 champs) + args=("message",), +) +``` + +**Format cron 6 champs**:`seconde minute Heure jour mois jour-semaine` + +Exemples: +- `*/5 * * * * *` — Toutes les 5 secondes +- `0 * * * * *` — Toutes les minutes à seconde 0 +- `0 0 * * * *` — Toutes les heures à minute 0, seconde 0 + +### Planification Intervalle + +```python +interval_id = await scheduler.schedule_with_interval( + pipeline=pipeline, + label="interval-3s", + interval_seconds=3, + args=("message",), +) +``` + +Exécute toutes les N secondes, indépendamment de l'heure système. + +--- + +## Sortie Attendue + +``` +Scheduled with cron: schedule_123456 +Scheduled with interval: interval_789012 +Waiting for pipeline executions (12 seconds)... +INFO:root:Processed: Hello from scheduled pipeline! +INFO:root:Processed: Interval scheduled run! +INFO:root:Processed: Hello from scheduled pipeline! +INFO:root:Processed: Interval scheduled run! +... +Active schedules: 2 + - every-5-seconds: cron=*/5 * * * * *, enabled=True + - every-3-seconds: cron=None, enabled=True +``` + +Vous devriez voir le message logué plusieurs fois comme schedules se déclenchent. + +--- + +## Points Clés + +### Planification par Label + +- Chaque schedule requiert un `label` unique (utilisé pour identification) +- Les labels peuvent activer/désactiver schedules dynamiquement +- Le scheduler gère persistance schedules selon votre broker + +### Limite InMemoryBroker + +Avec `InMemoryBroker`, schedules fonctionnent seulement pendant processus en cours; perdus au redémarrage. Pour planification persistante, utiliser brokers Redis avec stores de schedule appropriés. + +### Multiples Schedules + +Vous pouvez planifier même pipeline plusieurs fois avec labels, expressions cron, ou arguments différents. + +--- + +## Variations + +### Planification Avancée avec PipelineScheduler + +Pour planification plus avancée (timezones, gestion misfire), utiliser `PipelineScheduler`: + +```python +from taskiq_flow import PipelineScheduler + +scheduler = PipelineScheduler(broker) +job_id = await scheduler.schedule( + pipeline, + cron="0 9 * * *", # Quotidien à 9h + args=("daily",) +) +await scheduler.start() +``` + +Voir [Guide de Planification]({{ '/fr/guides/scheduling/' | relative_url }}) pour détails complets sur `PipelineScheduler`. + +--- + +## Chemin d'Apprentissage + +Après cet exemple: + +1. **[Guide de Planification]({{ '/fr/guides/scheduling/' | relative_url }})** — Planification cron et intervalle complète +2. **[PipelineScheduler API]({{ '/fr/api/core.md#pipelinescheduler' | relative_url }})** — Référence API +3. **[Guide de Retry]({{ '/fr/guides/retry/' | relative_url }})** — Gestion échecs pipelines planifiés + +--- + +*Cet exemple montre bases planification par label. Pour production, explorer PipelineScheduler avec stores de jobs externes (PostgreSQL/Redis).* diff --git a/docs/_fr/examples/secure_api_example.md b/docs/_fr/examples/secure_api_example.md index a0fa673..110d707 100644 --- a/docs/_fr/examples/secure_api_example.md +++ b/docs/_fr/examples/secure_api_example.md @@ -1,6 +1,6 @@ --- permalink: /fr/examples/secure-api-example/ -title: Exemple: secure_api_example.py +title: 'Exemple: secure_api_example.py' nav_order: 46 color_scheme: dark --- diff --git a/docs/_fr/examples/tracking-demo.md b/docs/_fr/examples/tracking-demo.md index 1e13864..055121d 100644 --- a/docs/_fr/examples/tracking-demo.md +++ b/docs/_fr/examples/tracking-demo.md @@ -1,183 +1,183 @@ ---- -permalink: /fr/examples/tracking-demo/ -title: Exemple: tracking_demo.py -nav_order: 45 -color_scheme: dark ---- -# Exemple: tracking_demo.py - -**Surveillance en temps réel avec PipelineTrackingManager** - +--- +permalink: /fr/examples/tracking-demo/ +title: 'Exemple: tracking_demo.py' +nav_order: 45 +color_scheme: dark +--- +# Exemple: tracking_demo.py + +**Surveillance en temps réel avec PipelineTrackingManager** + > **Version** : {VERSION} | **Fichier** : `examples/tracking_demo.py` - ---- - -## Aperçu - -Cet exemple démontre comment monitorer l'exécution de pipeline en temps réel en utilisant `PipelineTrackingManager`. Il couvre: - -- Configuration du suivi avec sélection automatique de stockage -- Attacher le suivi à un pipeline -- Exécuter un pipeline et vérifier son statut -- Accéder à l'historique d'exécution étape par étape - ---- - -## Ce Que Cet Example Montre - -- Créer un `PipelineTrackingManager` avec stockage auto -- Utiliser `.with_tracking()` sur un pipeline -- Attendre complétion du pipeline -- Interroger le statut du pipeline depuis le tracking manager -- Logger la progression des étapes - ---- - -## Explication du Code - -```python -import asyncio -import logging - -from taskiq import InMemoryBroker -from taskiq_flow import Pipeline, PipelineMiddleware -from taskiq_flow.tracking import PipelineTrackingManager - -logging.basicConfig(level=logging.INFO) -logger = logging.getLogger(__name__) - -# Créer broker -broker = InMemoryBroker(await_inplace=True) - -# Définir une tâche avec délai pour montrer le suivi en action -@broker.task -async def slow_task(x: int) -> int: - """Tâche lente qui double l'entrée.""" - await asyncio.sleep(1) - print(f"Slow task appelée avec {x}") - return x * 2 - -async def main(): - # 1. Configuration suivi avec sélection auto stockage - tracking_manager = PipelineTrackingManager().with_auto_storage(broker) - - # 2. Créer middleware avec tracking manager - middleware = PipelineMiddleware(tracking_manager=tracking_manager) - broker_with_middleware = broker.with_middlewares(middleware) - - # 3. Créer pipeline avec suivi activé - pipeline = ( - Pipeline(broker_with_middleware) - .with_tracking(manager=tracking_manager) - .call_next(slow_task) - .call_next(slow_task) - ) - - # 4. Exécuter le pipeline - result = await pipeline.kiq(10) - await result.wait_result() - - # 5. Interroger le statut de tracking - pipeline_id = pipeline.pipeline_id - if pipeline_id is None: - raise RuntimeError("Pipeline has no ID") - - status = await tracking_manager.get_status(pipeline_id) - if status is None: - raise RuntimeError("Failed to get pipeline status") - - logger.info(f"Pipeline status: {status.status}") - logger.info(f"Steps completed: {len(status.steps)}") - -asyncio.run(main()) -``` - ---- - -## Points Clés - -### Configuration Tracking - -```python -tracking_manager = PipelineTrackingManager().with_auto_storage(broker) -``` - -- `with_auto_storage()` sélectionne automatiquement backend stockage selon broker -- Pour `InMemoryBroker`, utilise `InMemoryPipelineStorage` -- Pour brokers Redis, utilise `RedisPipelineStorage` - -### Attacher Suivi au Pipeline - -```python -pipeline = Pipeline(broker).with_tracking(manager=tracking_manager) -``` - -Le tracking manager **doit** être attaché **avant** l'appel à `pipeline.kiq()`. - -### Inspection Statut - -Après exécution, l'objet `PipelineStatus` contient: - -- `status` — Statut global (`COMPLETED`, `FAILED`, etc.) -- `steps` — Liste d'objets `StepStatus`, un par étape -- `started_at` / `completed_at` — Horodatages -- `duration_ms` — Temps exécution total -- `result` — Valeur retour finale (si terminé) - -Chaque `StepStatus` inclut: - -- `step_name` — Nom de la tâche -- `status` — Statut de l'étape -- `duration_ms` — Durée d'exécution -- `result` — Valeur de retour - ---- - -## Sortie Attendue - -``` -INFO:__main__:Pipeline status: COMPLETED -INFO:__main__:Steps completed: 2 -``` - -Vous verrez aussi logs des appels slow_task avec délais 1 seconde. - ---- - -## Variations - -### Accéder aux Détails d'Étape - -```python -for step in status.steps: - logger.info(f"Étape '{step.step_name}' a pris {step.duration_ms:.0f}ms") - if step.result: - logger.info(f" Résultat: {step.result}") -``` - -### Suivre Multiples Pipelines - -```python -# Lancer plusieurs pipelines concurremment -tasks = [pipeline.kiq(i) for i in range(5)] -await asyncio.gather(*[t.wait_result() for t in tasks]) - -# Lister tous pipelines suivis -all_statuses = await tracking_manager.list_pipelines() -for s in all_statuses: - print(f"{s.pipeline_id}: {s.status}") -``` - ---- - -## Chemin d'Apprentissage - -Après cet exemple: - -1. **[Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }})** — Fonctionnalités complètes tracking (backends stockage, métriques) -2. **[Guide WebSocket]({{ '/fr/guides/websocket/' | relative_url }})** — Streaming temps réel événements de tracking -3. **[Guide API]({{ '/fr/guides/api/' | relative_url }})** — Exposer données tracking via REST API - ---- - -*Cet exemple montre les bases. Pour production, utiliser stockage Redis et ajouter écouteurs pour alertes.* + +--- + +## Aperçu + +Cet exemple démontre comment monitorer l'exécution de pipeline en temps réel en utilisant `PipelineTrackingManager`. Il couvre: + +- Configuration du suivi avec sélection automatique de stockage +- Attacher le suivi à un pipeline +- Exécuter un pipeline et vérifier son statut +- Accéder à l'historique d'exécution étape par étape + +--- + +## Ce Que Cet Example Montre + +- Créer un `PipelineTrackingManager` avec stockage auto +- Utiliser `.with_tracking()` sur un pipeline +- Attendre complétion du pipeline +- Interroger le statut du pipeline depuis le tracking manager +- Logger la progression des étapes + +--- + +## Explication du Code + +```python +import asyncio +import logging + +from taskiq import InMemoryBroker +from taskiq_flow import Pipeline, PipelineMiddleware +from taskiq_flow.tracking import PipelineTrackingManager + +logging.basicConfig(level=logging.INFO) +logger = logging.getLogger(__name__) + +# Créer broker +broker = InMemoryBroker(await_inplace=True) + +# Définir une tâche avec délai pour montrer le suivi en action +@broker.task +async def slow_task(x: int) -> int: + """Tâche lente qui double l'entrée.""" + await asyncio.sleep(1) + print(f"Slow task appelée avec {x}") + return x * 2 + +async def main(): + # 1. Configuration suivi avec sélection auto stockage + tracking_manager = PipelineTrackingManager().with_auto_storage(broker) + + # 2. Créer middleware avec tracking manager + middleware = PipelineMiddleware(tracking_manager=tracking_manager) + broker_with_middleware = broker.with_middlewares(middleware) + + # 3. Créer pipeline avec suivi activé + pipeline = ( + Pipeline(broker_with_middleware) + .with_tracking(manager=tracking_manager) + .call_next(slow_task) + .call_next(slow_task) + ) + + # 4. Exécuter le pipeline + result = await pipeline.kiq(10) + await result.wait_result() + + # 5. Interroger le statut de tracking + pipeline_id = pipeline.pipeline_id + if pipeline_id is None: + raise RuntimeError("Pipeline has no ID") + + status = await tracking_manager.get_status(pipeline_id) + if status is None: + raise RuntimeError("Failed to get pipeline status") + + logger.info(f"Pipeline status: {status.status}") + logger.info(f"Steps completed: {len(status.steps)}") + +asyncio.run(main()) +``` + +--- + +## Points Clés + +### Configuration Tracking + +```python +tracking_manager = PipelineTrackingManager().with_auto_storage(broker) +``` + +- `with_auto_storage()` sélectionne automatiquement backend stockage selon broker +- Pour `InMemoryBroker`, utilise `InMemoryPipelineStorage` +- Pour brokers Redis, utilise `RedisPipelineStorage` + +### Attacher Suivi au Pipeline + +```python +pipeline = Pipeline(broker).with_tracking(manager=tracking_manager) +``` + +Le tracking manager **doit** être attaché **avant** l'appel à `pipeline.kiq()`. + +### Inspection Statut + +Après exécution, l'objet `PipelineStatus` contient: + +- `status` — Statut global (`COMPLETED`, `FAILED`, etc.) +- `steps` — Liste d'objets `StepStatus`, un par étape +- `started_at` / `completed_at` — Horodatages +- `duration_ms` — Temps exécution total +- `result` — Valeur retour finale (si terminé) + +Chaque `StepStatus` inclut: + +- `step_name` — Nom de la tâche +- `status` — Statut de l'étape +- `duration_ms` — Durée d'exécution +- `result` — Valeur de retour + +--- + +## Sortie Attendue + +``` +INFO:__main__:Pipeline status: COMPLETED +INFO:__main__:Steps completed: 2 +``` + +Vous verrez aussi logs des appels slow_task avec délais 1 seconde. + +--- + +## Variations + +### Accéder aux Détails d'Étape + +```python +for step in status.steps: + logger.info(f"Étape '{step.step_name}' a pris {step.duration_ms:.0f}ms") + if step.result: + logger.info(f" Résultat: {step.result}") +``` + +### Suivre Multiples Pipelines + +```python +# Lancer plusieurs pipelines concurremment +tasks = [pipeline.kiq(i) for i in range(5)] +await asyncio.gather(*[t.wait_result() for t in tasks]) + +# Lister tous pipelines suivis +all_statuses = await tracking_manager.list_pipelines() +for s in all_statuses: + print(f"{s.pipeline_id}: {s.status}") +``` + +--- + +## Chemin d'Apprentissage + +Après cet exemple: + +1. **[Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }})** — Fonctionnalités complètes tracking (backends stockage, métriques) +2. **[Guide WebSocket]({{ '/fr/guides/websocket/' | relative_url }})** — Streaming temps réel événements de tracking +3. **[Guide API]({{ '/fr/guides/api/' | relative_url }})** — Exposer données tracking via REST API + +--- + +*Cet exemple montre les bases. Pour production, utiliser stockage Redis et ajouter écouteurs pour alertes.* diff --git a/docs/_fr/examples/websocket-demo.md b/docs/_fr/examples/websocket-demo.md index b32c11b..ff92687 100644 --- a/docs/_fr/examples/websocket-demo.md +++ b/docs/_fr/examples/websocket-demo.md @@ -1,6 +1,6 @@ --- permalink: /fr/examples/websocket-demo/ -title: Exemple: websocket_demo.py +title: 'Exemple: websocket_demo.py' nav_order: 46 color_scheme: dark --- diff --git a/docs/_fr/guides/api.md b/docs/_fr/guides/api.md index 33cfee3..43885da 100644 --- a/docs/_fr/guides/api.md +++ b/docs/_fr/guides/api.md @@ -1,826 +1,872 @@ ---- -title: Guide API REST -nav_order: 28 ---- -# Guide API REST - -**Gestion de pipelines via FastAPI : exécution distante, visualisation, tableaux de bord** - -> **Version** : {VERSION} | **Lié** : [Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }}), [Guide WebSocket]({{ '/fr/guides/websocket/' | relative_url }}), [Guide Dataflow]({{ '/fr/guides/dataflow/' | relative_url }}) - ---- - -## Aperçu - -Taskiq-Flow inclut une API REST basée sur FastAPI pour gérer les pipelines à distance. Construisez des tableaux de bord, intégrations CI/CD, ou tout système interagissant avec les pipelines via HTTP. - -Ce guide couvre : - -- Configuration du serveur API -- Endpoints disponibles -- Visualisation des DAG -- Exécution distante de pipelines -- Extensions d'endpoints personnalisés -- Authentification et sécurité -- Déploiement en production - ---- - -## 1. Configuration Rapide - -```python -from fastapi import FastAPI -from taskiq import InMemoryBroker -from taskiq_flow import DataflowPipeline, pipeline_task, create_visualization_api - -# 1. Create broker and tasks -broker = InMemoryBroker(await_inplace=True) - -@broker.task -@pipeline_task(output="result") -async def process(data: str) -> dict: - return {"processed": data.upper()} - -# 2. Build pipeline -pipeline = DataflowPipeline.from_tasks(broker, [process]) -pipeline.pipeline_id = "my_pipeline" - -# 3. Create FastAPI app with visualization API -app = FastAPI(title="Taskiq-Flow API", version="{VERSION}") -viz_api = create_visualization_api(broker, app) -viz_api.add_pipeline("my_pipeline", pipeline) - -# 4. Run with uvicorn -# uvicorn main:app --reload --port 8000 -``` - -Tous les endpoints sont automatiquement montés sous `/pipelines`. - ---- - -## 2. Endpoints Disponibles - -L'API de visualisation fournit ces routes : - -### 2.1. Health Check - -``` -GET /health -``` - -Retourne statut simple: - -```json -{ - "statut": "healthy", - "timestamp": "2026-05-05T12:00:00Z" -} -``` - -### 2.2. Lister Tous les Pipelines - -``` -GET /pipelines -``` - -Liste tous les pipelines enregistrés avec métadonnées: - -```json -[ - { - "pipeline_id": "analyse_audio_v1", - "pipeline_type": "dataflow", - "tasks": ["extract", "tag", "embed"], - "created_at": "2026-05-05T10:00:00Z" - } -] -``` - -### 2.3. Enregistrer un Nouveau Pipeline - -``` -POST /pipelines/{pipeline_id} -``` - -Corps de requête: - -```json -{ - "pipeline_type": "dataflow", - "tasks": ["task1", "task2"] -} -``` - -Ou utiliser l'API Python directement (recommandé): - -```python -viz_api.add_pipeline("nouveau_pipeline", objet_pipeline) -``` - -### 2.4. Obtenir le Statut d'un Pipeline - -``` -GET /pipelines/{pipeline_id}/status -``` - -Retourne statut d'exécution courant si un run est actif: - -```json -{ - "pipeline_id": "my_pipeline_123", - "status": "RUNNING", - "steps_completed": 3, - "total_steps": 5, - "started_at": "2026-05-05T12:00:00Z" -} -``` - -### 2.5. Obtenir le DAG en JSON - -``` -GET /pipelines/{pipeline_id}/dag -``` - -Retourne la structure de graphe orienté acyclique: - -```json -{ - "nodes": [ - {"id": "extract", "outputs": ["features"]}, - {"id": "tag", "inputs": ["features"], "outputs": ["tags"]}, - {"id": "embed", "inputs": ["features"], "outputs": ["embedding"]} - ], - "edges": [ - {"from": "extract", "to": "tag"}, - {"from": "extract", "to": "embed"} - ] -} -``` - -### 2.6. Obtenir le DAG au Format DOT - -``` -GET /pipelines/{pipeline_id}/dag/dot -``` - -Retourne chaîne DOT compatible Graphviz: - -``` -digraph "my_pipeline" { - node [shape=box]; - extract -> tag; - extract -> embed; -} -``` - -### 2.7. Visualisation Complète de Pipeline - -``` -GET /pipelines/{pipeline_id}/visualize -``` - -Retourne métadonnées complètes du pipeline: - -```json -{ - "pipeline_id": "my_pipeline", - "type": "dataflow", - "tasks": [ - { - "name": "extract", - "outputs": ["features"], - "inputs": [], - "description": "Extract audio features" - }, - { - "name": "tag", - "inputs": ["features"], - "outputs": ["tags"], - "description": "Generate tags" - } - ], - "execution_levels": [ - ["extract"], - ["tag", "embed"] - ] -} -``` - ---- - -## 3. Exécution de Pipelines via API - -L'API de base se concentre sur gestion et visualisation. Pour exécuter des pipelines à distance, ajouter un endpoint personnalisé: - -```python -from fastapi import FastAPI, HTTPException -from taskiq_flow.api import PipelineVisualizationAPI - -app = FastAPI() -viz_api = PipelineVisualizationAPI(broker, app) - -@app.post("/pipelines/{pipeline_id}/execute") -async def execute_pipeline( - pipeline_id: str, - parameters: dict, - wait: bool = False, - timeout: int = 30 -): - """ - Exécute un pipeline avec paramètres donnés. - - - **pipeline_id**: ID pipeline enregistré - - **parameters**: Dict paramètres d'entrée - - **wait**: Si True, bloque jusqu'à complétion et retourne résultat - - **timeout**: Secondes avant timeout - """ - if pipeline_id not in viz_api.pipelines: - raise HTTPException(status_code=404, detail="Pipeline non trouvé") - - pipeline = viz_api.pipelines[pipeline_id] - - try: - task = await pipeline.kiq_dataflow(**parameters) - - if wait: - result = await task.wait_result(timeout=timeout) - return { - "task_id": task.task_id, - "statut": "COMPLETED", - "resultat": result.return_value - } - else: - return { - "task_id": task.task_id, - "statut": "STARTED" - } - except asyncio.TimeoutError: - raise HTTPException(status_code=504, detail="Exécution pipeline timed out") - except Exception as exc: - raise HTTPException(status_code=500, detail=str(exc)) - -@app.get("/pipelines/result/{task_id}") -async def get_result(task_id: str): - """Récupère le résultat d'une exécution de pipeline.""" - result = await broker.get_result(task_id) - if result is None: - raise HTTPException(status_code=404, detail="Résultat non trouvé ou expiré") - return {"task_id": task_id, "resultat": result.return_value} -``` - -### 3.1. Exécution Async (Fire-and-Forget) - -```bash -curl -X POST "http://localhost:8000/pipelines/my_pipeline/execute" \ - -H "Content-Type: application/json" \ - -d '{"parameters": {"data": "value"}, "wait": false}' - -# Response: -{ - "task_id": "abc123def456", - "status": "STARTED" -} -``` - -### 3.2. Synchronous Execution (Wait for Result) - -```bash -curl -X POST "http://localhost:8000/pipelines/my_pipeline/execute" \ - -H "Content-Type: application/json" \ - -d '{"parameters": {"data": "value"}, "wait": true, "timeout": 60}' - -# Response (after pipeline completion): -{ - "task_id": "abc123def456", - "status": "COMPLETED", - "result": {"processed": "VALUE"} -} -``` - ---- - -## 4. Intégration avec Tableaux de Bord Frontend - -### 4.1. Exemple Dashboard React - -```typescript -const PipelineStatus = ({ pipelineId }) => { - const [status, setStatus] = useState(null); - - useEffect(() => { - fetch(`/pipelines/${pipelineId}/status`) - .then(res => res.json()) - .then(data => setStatus(data)); - - // Poll toutes les 5 secondes - const interval = setInterval(() => { - fetch(`/pipelines/${pipelineId}/status`) - .then(res => res.json()) - .then(setStatus); - }, 5000); - - return () => clearInterval(interval); - }, [pipelineId]); - - return ( -
-

Pipeline: {pipelineId}

-

Statut: {status?.statut}

-

Progression: {status?.étapes_complétées} / {status?.total_étapes}

-
- ); -}; -``` - -### 4.2. Visualisation DAG - -Utiliser endpoint DOT avec Graphviz: - -```javascript -const renderDAG = async (pipelineId) => { - const response = await fetch(`/pipelines/${pipelineId}/dag/dot`); - const dot = await response.text(); - - // Utiliser viz.js ou d3-graphviz côté client - d3.select("#dag") - .graphviz() - .renderDot(dot); -}; -``` - ---- - -## 5. Authentification & Sécurité - -### 5.1. Authentification par Clé API - -```python -from fastapi import Security, HTTPException -from fastapi.security import APIKeyHeader - -api_key_header = APIKeyHeader(name="X-API-Key") - -async def verify_api_key(api_key: str = Security(api_key_header)): - if api_key != os.getenv("API_SECRET"): - raise HTTPException(status_code=403, detail="Invalid API key") - return api_key - -@app.get("/pipelines") -async def list_pipelines(api_key: str = Security(verify_api_key)): - return viz_api.list_pipelines() -``` - -### 5.2. Authentification JWT - -```python -from jose import jwt -from fastapi import Depends - -async def get_current_user(token: str = Depends(oauth2_scheme)): - try: - payload = jwt.decode(token, SECRET_KEY, algorithms=["HS256"]) - return payload["sub"] - except jwt.JWTError: - raise HTTPException(status_code=401, detail="Invalid token") - -@app.post("/pipelines/{pipeline_id}/execute") -async def execute( - pipeline_id: str, - parameters: dict, - user: str = Depends(get_current_user) -): - # Logger action user pour audit - logger.info(f"User {user} executed {pipeline_id}") - return await run_pipeline(pipeline_id, parameters) -``` - ---- - -## 6. Limitation de Débit (Rate Limiting) - -Protéger l'API contre abus: - -```python -from slowapi import Limiter, _rate_limit_exceeded_handler -from slowapi.util import get_remote_address -from slowapi.errors import RateLimitExceeded - -limiter = Limiter(key_func=get_remote_address) -app.state.limiter = limiter -app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler) - -@app.post("/pipelines/{pipeline_id}/execute") -@limiter.limit("10/minute") # Max 10 exécutions par minute par IP -async def execute_pipeline(pipeline_id: str, parameters: dict): - # ... -``` - ---- - -## 7. Configuration CORS - -Permettre requêtes cross-origin pour frontend web: - -```python -from fastapi.middleware.cors import CORSMiddleware - -app.add_middleware( - CORSMiddleware, - allow_origins=["https://votre-dashboard.com"], - allow_credentials=True, - allow_methods=["GET", "POST"], - allow_headers=["*"], -) -``` - ---- - -## 8. Déploiement en Production - -### 8.1. Gunicorn + Workers Uvicorn - -```bash -# Lancer avec multiples workers pour concurrence -gunicorn -k uvicorn.workers.UvicornWorker -w 4 main:app --bind 0.0.0.0:8000 - -# 4 processus workers gèrent requêtes concurrentes -``` - -### 8.2. Docker - -```dockerfile -FROM python:3.12-slim - -WORKDIR /app -COPY requirements.txt . -RUN pip install --no-cache-dir -r requirements.txt - -COPY . . - -CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] -``` - -```yaml -# docker-compose.yml -services: - api: - build: . - ports: - - "8000:8000" - environment: - - REDIS_URL=redis://redis:6379 - depends_on: - - redis - redis: - image: redis:7-alpine -``` - -### 8.3. Derrière Reverse Proxy (nginx) - -```nginx -server { - listen 80; - server_name api.taskiq-flow.example.com; - - location / { - proxy_pass http://localhost:8000; - proxy_set_header Host $host; - proxy_set_header X-Real-IP $remote_addr; - proxy_http_version 1.1; - proxy_set_header Connection ""; - } -} -``` - -### 8.4. HTTPS avec Let's Encrypt - -```bash -# Utiliser certbot avec nginx -sudo certbot --nginx -d api.taskiq-flow.example.com -``` - -Configurer HTTPS → redirect vers HTTP upstream: - -```nginx -location / { - proxy_pass http://localhost:8000; - proxy_set_header X-Forwarded-Proto $scheme; -} -``` - ---- - -## 9. Sécurité de l'API - -### 9.1. Authentification par Clé API - -```python -from fastapi import Security, HTTPException -from fastapi.security import APIKeyHeader - -api_key_header = APIKeyHeader(name="X-API-Key") - -async def verify_api_key(api_key: str = Security(api_key_header)): - if api_key != os.getenv("API_SECRET"): - raise HTTPException(status_code=403, detail="Clé API invalide") - return api_key - -@app.get("/pipelines") -async def list_pipelines(api_key: str = Security(verify_api_key)): - return viz_api.list_pipelines() -``` - -### 9.2. Authentification JWT - -```python -from jose import jwt -from fastapi import Depends - -async def get_current_user(token: str = Depends(oauth2_scheme)): - try: - payload = jwt.decode(token, SECRET_KEY, algorithms=["HS256"]) - return payload["sub"] - except jwt.JWTError: - raise HTTPException(status_code=401, detail="Token invalide") - -@app.post("/pipelines/{pipeline_id}/execute") -async def execute( - pipeline_id: str, - parameters: dict, - user: str = Depends(get_current_user) -): - logger.info(f"Utilisateur {user} a exécuté {pipeline_id}") - return await run_pipeline(pipeline_id, parameters) -``` - -### 9.3. Autorisation au Niveau Pipeline - -Définissez les ACLs par pipeline via `pipeline_acls` dans `TaskiqFlowConfig`, -puis utilisez `verify_pipeline_access` comme dépendance de route : - -```python -from fastapi import Depends -from taskiq_flow.config import TaskiqFlowConfig -from taskiq_flow.security.authorization import PipelineAuthorization -from taskiq_flow.security.dependencies import verify_pipeline_access -from taskiq_flow.api import create_visualization_api - -config = TaskiqFlowConfig( - pipeline_acls={ - "my_pipeline": { - "read": ["admin", "viewer"], - "execute": ["admin"], - }, - }, -) -viz_api = create_visualization_api(broker) # lit config automatiquement - -# verify_pipeline_access dépend de get_current_user + authorization -# → utilisez-la directement sur vos endpoints protégés -``` - -### 9.4. Combinaison Middleware + Dépendances de Route - -Pour la production, combinez le middleware global (authentification) avec les dépendances de route (autorisation) : - -```python -from taskiq_flow.security.middleware import SecurityMiddleware -from taskiq_flow.security.auth import APIKeyAuthProvider, JWTAuthProvider -from taskiq_flow.security.authorization import PipelineAuthorization -from taskiq_flow.config import TaskiqFlowConfig - -# 1. Configuration via TaskiqFlowConfig (champs plats, pas de SecurityConfig imbriqué) -config = TaskiqFlowConfig( - security_enabled=True, - auth_provider="api_key", - api_keys={ - "admin-key": { - "role": "admin", - "pipelines": ["*"], - "permissions": ["read", "execute", "admin"], - }, - }, - jwt_secret="super-secret", # pragma: allowlist secret # noqa: S105 — valeur de documentation, pas un secret réel - require_https=True, - pipeline_acls={ - "my_pipeline": {"read": ["admin"], "execute": ["admin"]}, - }, -) - -# 2. Construire les composants depuis la config -auth_provider = APIKeyAuthProvider(keys=config.api_keys) -if config.auth_provider == "jwt": - auth_provider = JWTAuthProvider(secret=config.jwt_secret) -authorization = PipelineAuthorization(pipeline_acls=config.pipeline_acls) - -# 3. Middleware global gère authentification + audit -app.add_middleware( - SecurityMiddleware, - auth_provider=auth_provider, - authorization=authorization, -) -``` - -Ou, pour un câblage automatique complet, utilisez `create_visualization_api` qui -construit tous les composants depuis `TaskiqFlowConfig` : - -```python -from taskiq_flow import create_visualization_api - -config = TaskiqFlowConfig( - security_enabled=True, - auth_provider="api_key", - api_keys={"admin-key": {"role": "admin", "pipelines": ["*"], "permissions": ["read", "execute", "admin"]}}, -) -app = create_visualization_api(broker) # sécurité auto-configurée depuis config -``` - -### Pourquoi cette approche hybride ? -- `SecurityMiddleware` place `request.state.user` pour toutes les routes après le routage -- Les paramètres de chemin FastAPI (ex. `pipeline_id`) ne sont disponibles qu'**après** le routage -- Les dépendances de route (ex. `Depends(verify_pipeline_access)`) s'exécutent après le routage → elles peuvent lire `pipeline_id` et vérifier les ACLs -``` - -**Pourquoi cette approche hybride ?** -- Le middleware s'exécute **avant** le routage → `path_params` est vide -- Les dépendances de route s'exécutent **après** le routage → `pipeline_id` est disponible -- Le middleware définit `request.state.user` pour toutes les routes -- Les dépendances lisent `request.state.user` et vérifient l'ACL par pipeline - ---- - -## 10. Limitation de Débit - -Protégez l'API contre les abus: - -```python -from slowapi import Limiter, _rate_limit_exceeded_handler -from slowapi.util import get_remote_address -from slowapi.errors import RateLimitExceeded - -limiter = Limiter(key_func=get_remote_address) -app.state.limiter = limiter -app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler) - -@app.post("/pipelines/{pipeline_id}/execute") -@limiter.limit("10/minute") # Max 10 exécutions par minute par IP -async def execute_pipeline(pipeline_id: str, parameters: dict): - # ... -``` - ---- - -## 11. Monitoring de Santé API - -### 11.1. Endpoint Health Check - -```python -from datetime import datetime, timezone -from fastapi import FastAPI -import psutil - -app = FastAPI() - -@app.get("/health") -async def health(): - return { - "statut": "sain", - "timestamp": datetime.now(timezone.utc).isoformat(), - "broker_connecté": broker.is_connected(), - "memoire_mb": psutil.Process().memory_info().rss / 1024 / 1024 - } -``` - -### 11.2. Métriques avec Prometheus - -```python -from prometheus_fastapi_instrumentator import Instrumentator - -Instrumentator().instrument(app).expose(app, endpoint="/metrics") -``` - -Expose `/metrics` avec métriques Prometheus standard (compte requêtes, latence, etc.). - -### 11.3. Versionnement API - -```python -app = FastAPI( - title="API Taskiq-Flow", - version="1.0.0", - docs_url="/docs", - redoc_url="/redoc" -) - -# Préfixer toutes routes avec /api/v1 -from fastapi import APIRouter -api_router = APIRouter(prefix="/api/v1") -api_router.include_router(viz_api.router) -app.include_router(api_router) -``` - ---- - -## 12. Gestion des Erreurs - -Gestion centralisée erreurs: - -```python -from fastapi import Request -from fastapi.responses import JSONResponse -from taskiq.exceptions import TaskiqError - -@app.exception_handler(TaskiqError) -async def taskiq_exception_handler(request: Request, exc: TaskiqError): - return JSONResponse( - status_code=500, - content={ - "error": exc.__class__.__name__, - "message": str(exc), - "pipeline_id": getattr(exc, "pipeline_id", None) - } - ) -``` - -Réponses d'erreur standardisées: - -```json -{ - "error": "PipelineExecutionError", - "message": "Task 'process' échoué après 3 retries", - "pipeline_id": "analyse_audio_123", - "step": "extract_audio", - "timestamp": "2026-05-05T12:00:00Z" -} -``` - ---- - -## 13. Exemple Client API - -Client Python pour interagir avec l'API: - -```python -import httpx - -class ClientTaskiqFlow: - def __init__(self, base_url: str, api_key: str = None): - self.base_url = base_url.rstrip("/") - self.headers = {"X-API-Key": api_key} if api_key else {} - - async def list_pipelines(self): - async with httpx.AsyncClient() as client: - resp = await client.get(f"{self.base_url}/pipelines", headers=self.headers) - resp.raise_for_status() - return resp.json() - - async def execute(self, pipeline_id: str, parameters: dict, wait: bool = False): - async with httpx.AsyncClient() as client: - resp = await client.post( - f"{self.base_url}/pipelines/{pipeline_id}/execute", - json={"parameters": parameters, "wait": wait}, - headers=self.headers - ) - resp.raise_for_status() - return resp.json() - - async def get_result(self, task_id: str): - async with httpx.AsyncClient() as client: - resp = await client.get(f"{self.base_url}/pipelines/result/{task_id}", headers=self.headers) - resp.raise_for_status() - return resp.json() - -# Usage -client = ClientTaskiqFlow("http://localhost:8000") -pipelines = await client.list_pipelines() -result = await client.execute("my_pipeline", {"data": "test"}, wait=True) -``` - ---- - -## 14. Résumé - -| Fonctionnalité | Endpoint | Méthode | -|----------------|----------|---------| -| Health check | `/health` | GET | -| Lister pipelines | `/pipelines` | GET | -| Statut pipeline | `/pipelines/{id}/status` | GET | -| DAG (JSON) | `/pipelines/{id}/dag` | GET | -| DAG (DOT) | `/pipelines/{id}/dag/dot` | GET | -| Visualisation complète | `/pipelines/{id}/visualize` | GET | -| Exécuter pipeline | `/pipelines/{id}/execute` | POST (custom) | -| Obtenir résultat | `/pipelines/result/{task_id}` | GET (custom) | - -**Point clé**: L'API donne contrôle complet sur cycle de vie pipeline — enregistrer, inspecter, exécuter, récupérer résultats — parfait pour tableaux de bord et intégrations personnalisés. - ---- - -## Prochaines Étapes - -- **[Guide WebSocket]({{ '/fr/guides/websocket/' | relative_url }})** — Streaming d'événements en temps réel pour mises à jour live -- **[Guide Dataflow]({{ '/fr/guides/dataflow/' | relative_url }})** — Pipeline DAG complet avec DataflowPipeline -- **[Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }})** — Données historiques d'exécution pour analytics -- **[Exemple: Serveur API]({{ '/fr/examples/api-example/' | relative_url }})** — App FastAPI complète fonctionnelle - ---- - -*Gérez des pipelines de n'importe où. Construisez tableaux de bord, automatisation, intégrations.* +--- +title: Guide API REST +nav_order: 28 +--- +# Guide API REST + +**Gestion de pipelines via FastAPI : exécution distante, visualisation, tableaux de bord** + +> **Version** : {VERSION} | **Lié** : [Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }}), [Guide WebSocket]({{ '/fr/guides/websocket/' | relative_url }}), [Guide Dataflow]({{ '/fr/guides/dataflow/' | relative_url }}) + +--- + +## Aperçu + +Taskiq-Flow inclut une API REST basée sur FastAPI pour gérer les pipelines à distance. Construisez des tableaux de bord, intégrations CI/CD, ou tout système interagissant avec les pipelines via HTTP. + +Ce guide couvre : + +- Configuration du serveur API +- Endpoints disponibles +- Visualisation des DAG +- Exécution distante de pipelines +- Extensions d'endpoints personnalisés +- Authentification et sécurité +- Déploiement en production + +--- + +## 1. Configuration Rapide + +{% raw %} +```python +from fastapi import FastAPI +from taskiq import InMemoryBroker +from taskiq_flow import DataflowPipeline, pipeline_task, create_visualization_api + +# 1. Create broker and tasks +broker = InMemoryBroker(await_inplace=True) + +@broker.task +@pipeline_task(output="result") +async def process(data: str) -> dict: + return {"processed": data.upper()} + +# 2. Build pipeline +pipeline = DataflowPipeline.from_tasks(broker, [process]) +pipeline.pipeline_id = "my_pipeline" + +# 3. Create FastAPI app with visualization API +app = FastAPI(title="Taskiq-Flow API", version="{VERSION}") +viz_api = create_visualization_api(broker, app) +viz_api.add_pipeline("my_pipeline", pipeline) + +# 4. Run with uvicorn +# uvicorn main:app --reload --port 8000 +``` +{% endraw %} +Tous les endpoints sont automatiquement montés sous `/pipelines`. + +--- + +## 2. Endpoints Disponibles + +L'API de visualisation fournit ces routes : + +### 2.1. Health Check + +{% raw %} +``` +GET /health +``` +{% endraw %} +Retourne statut simple: + +{% raw %} +```json +{ + "statut": "healthy", + "timestamp": "2026-05-05T12:00:00Z" +} +``` +{% endraw %} +### 2.2. Lister Tous les Pipelines + +{% raw %} +``` +GET /pipelines +``` +{% endraw %} +Liste tous les pipelines enregistrés avec métadonnées: + +{% raw %} +```json +[ + { + "pipeline_id": "analyse_audio_v1", + "pipeline_type": "dataflow", + "tasks": ["extract", "tag", "embed"], + "created_at": "2026-05-05T10:00:00Z" + } +] +``` +{% endraw %} +### 2.3. Enregistrer un Nouveau Pipeline + +{% raw %} +``` +POST /pipelines/{pipeline_id} +``` +{% endraw %} +Corps de requête: + +{% raw %} +```json +{ + "pipeline_type": "dataflow", + "tasks": ["task1", "task2"] +} +``` +{% endraw %} +Ou utiliser l'API Python directement (recommandé): + +{% raw %} +```python +viz_api.add_pipeline("nouveau_pipeline", objet_pipeline) +``` +{% endraw %} +### 2.4. Obtenir le Statut d'un Pipeline + +{% raw %} +``` +GET /pipelines/{pipeline_id}/status +``` +{% endraw %} +Retourne statut d'exécution courant si un run est actif: + +{% raw %} +```json +{ + "pipeline_id": "my_pipeline_123", + "status": "RUNNING", + "steps_completed": 3, + "total_steps": 5, + "started_at": "2026-05-05T12:00:00Z" +} +``` +{% endraw %} +### 2.5. Obtenir le DAG en JSON + +{% raw %} +``` +GET /pipelines/{pipeline_id}/dag +``` +{% endraw %} +Retourne la structure de graphe orienté acyclique: + +{% raw %} +```json +{ + "nodes": [ + {"id": "extract", "outputs": ["features"]}, + {"id": "tag", "inputs": ["features"], "outputs": ["tags"]}, + {"id": "embed", "inputs": ["features"], "outputs": ["embedding"]} + ], + "edges": [ + {"from": "extract", "to": "tag"}, + {"from": "extract", "to": "embed"} + ] +} +``` +{% endraw %} +### 2.6. Obtenir le DAG au Format DOT + +{% raw %} +``` +GET /pipelines/{pipeline_id}/dag/dot +``` +{% endraw %} +Retourne chaîne DOT compatible Graphviz: + +{% raw %} +``` +digraph "my_pipeline" { + node [shape=box]; + extract -> tag; + extract -> embed; +} +``` +{% endraw %} +### 2.7. Visualisation Complète de Pipeline + +{% raw %} +``` +GET /pipelines/{pipeline_id}/visualize +``` +{% endraw %} +Retourne métadonnées complètes du pipeline: + +{% raw %} +```json +{ + "pipeline_id": "my_pipeline", + "type": "dataflow", + "tasks": [ + { + "name": "extract", + "outputs": ["features"], + "inputs": [], + "description": "Extract audio features" + }, + { + "name": "tag", + "inputs": ["features"], + "outputs": ["tags"], + "description": "Generate tags" + } + ], + "execution_levels": [ + ["extract"], + ["tag", "embed"] + ] +} +``` +{% endraw %} +--- + +## 3. Exécution de Pipelines via API + +L'API de base se concentre sur gestion et visualisation. Pour exécuter des pipelines à distance, ajouter un endpoint personnalisé: + +{% raw %} +```python +from fastapi import FastAPI, HTTPException +from taskiq_flow.api import PipelineVisualizationAPI + +app = FastAPI() +viz_api = PipelineVisualizationAPI(broker, app) + +@app.post("/pipelines/{pipeline_id}/execute") +async def execute_pipeline( + pipeline_id: str, + parameters: dict, + wait: bool = False, + timeout: int = 30 +): + """ + Exécute un pipeline avec paramètres donnés. + + - **pipeline_id**: ID pipeline enregistré + - **parameters**: Dict paramètres d'entrée + - **wait**: Si True, bloque jusqu'à complétion et retourne résultat + - **timeout**: Secondes avant timeout + """ + if pipeline_id not in viz_api.pipelines: + raise HTTPException(status_code=404, detail="Pipeline non trouvé") + + pipeline = viz_api.pipelines[pipeline_id] + + try: + task = await pipeline.kiq_dataflow(**parameters) + + if wait: + result = await task.wait_result(timeout=timeout) + return { + "task_id": task.task_id, + "statut": "COMPLETED", + "resultat": result.return_value + } + else: + return { + "task_id": task.task_id, + "statut": "STARTED" + } + except asyncio.TimeoutError: + raise HTTPException(status_code=504, detail="Exécution pipeline timed out") + except Exception as exc: + raise HTTPException(status_code=500, detail=str(exc)) + +@app.get("/pipelines/result/{task_id}") +async def get_result(task_id: str): + """Récupère le résultat d'une exécution de pipeline.""" + result = await broker.get_result(task_id) + if result is None: + raise HTTPException(status_code=404, detail="Résultat non trouvé ou expiré") + return {"task_id": task_id, "resultat": result.return_value} +``` +{% endraw %} +### 3.1. Exécution Async (Fire-and-Forget) + +{% raw %} +```bash +curl -X POST "http://localhost:8000/pipelines/my_pipeline/execute" \ + -H "Content-Type: application/json" \ + -d '{"parameters": {"data": "value"}, "wait": false}' + +# Response: +{ + "task_id": "abc123def456", + "status": "STARTED" +} +``` +{% endraw %} +### 3.2. Synchronous Execution (Wait for Result) + +{% raw %} +```bash +curl -X POST "http://localhost:8000/pipelines/my_pipeline/execute" \ + -H "Content-Type: application/json" \ + -d '{"parameters": {"data": "value"}, "wait": true, "timeout": 60}' + +# Response (after pipeline completion): +{ + "task_id": "abc123def456", + "status": "COMPLETED", + "result": {"processed": "VALUE"} +} +``` +{% endraw %} +--- + +## 4. Intégration avec Tableaux de Bord Frontend + +### 4.1. Exemple Dashboard React + +{% raw %} +```typescript +const PipelineStatus = ({ pipelineId }) => { + const [status, setStatus] = useState(null); + + useEffect(() => { + fetch(`/pipelines/${pipelineId}/status`) + .then(res => res.json()) + .then(data => setStatus(data)); + + // Poll toutes les 5 secondes + const interval = setInterval(() => { + fetch(`/pipelines/${pipelineId}/status`) + .then(res => res.json()) + .then(setStatus); + }, 5000); + + return () => clearInterval(interval); + }, [pipelineId]); + + return ( +
+

Pipeline: {pipelineId}

+

Statut: {status?.statut}

+

Progression: {status?.étapes_complétées} / {status?.total_étapes}

+
+ ); +}; +``` +{% endraw %} +### 4.2. Visualisation DAG + +Utiliser endpoint DOT avec Graphviz: + +{% raw %} +```javascript +const renderDAG = async (pipelineId) => { + const response = await fetch(`/pipelines/${pipelineId}/dag/dot`); + const dot = await response.text(); + + // Utiliser viz.js ou d3-graphviz côté client + d3.select("#dag") + .graphviz() + .renderDot(dot); +}; +``` +{% endraw %} +--- + +## 5. Authentification & Sécurité + +### 5.1. Authentification par Clé API + +{% raw %} +```python +from fastapi import Security, HTTPException +from fastapi.security import APIKeyHeader + +api_key_header = APIKeyHeader(name="X-API-Key") + +async def verify_api_key(api_key: str = Security(api_key_header)): + if api_key != os.getenv("API_SECRET"): + raise HTTPException(status_code=403, detail="Invalid API key") + return api_key + +@app.get("/pipelines") +async def list_pipelines(api_key: str = Security(verify_api_key)): + return viz_api.list_pipelines() +``` +{% endraw %} +### 5.2. Authentification JWT + +{% raw %} +```python +from jose import jwt +from fastapi import Depends + +async def get_current_user(token: str = Depends(oauth2_scheme)): + try: + payload = jwt.decode(token, SECRET_KEY, algorithms=["HS256"]) + return payload["sub"] + except jwt.JWTError: + raise HTTPException(status_code=401, detail="Invalid token") + +@app.post("/pipelines/{pipeline_id}/execute") +async def execute( + pipeline_id: str, + parameters: dict, + user: str = Depends(get_current_user) +): + # Logger action user pour audit + logger.info(f"User {user} executed {pipeline_id}") + return await run_pipeline(pipeline_id, parameters) +``` +{% endraw %} +--- + +## 6. Limitation de Débit (Rate Limiting) + +Protéger l'API contre abus: + +{% raw %} +```python +from slowapi import Limiter, _rate_limit_exceeded_handler +from slowapi.util import get_remote_address +from slowapi.errors import RateLimitExceeded + +limiter = Limiter(key_func=get_remote_address) +app.state.limiter = limiter +app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler) + +@app.post("/pipelines/{pipeline_id}/execute") +@limiter.limit("10/minute") # Max 10 exécutions par minute par IP +async def execute_pipeline(pipeline_id: str, parameters: dict): + # ... +``` +{% endraw %} +--- + +## 7. Configuration CORS + +Permettre requêtes cross-origin pour frontend web: + +{% raw %} +```python +from fastapi.middleware.cors import CORSMiddleware + +app.add_middleware( + CORSMiddleware, + allow_origins=["https://votre-dashboard.com"], + allow_credentials=True, + allow_methods=["GET", "POST"], + allow_headers=["*"], +) +``` +{% endraw %} +--- + +## 8. Déploiement en Production + +### 8.1. Gunicorn + Workers Uvicorn + +{% raw %} +```bash +# Lancer avec multiples workers pour concurrence +gunicorn -k uvicorn.workers.UvicornWorker -w 4 main:app --bind 0.0.0.0:8000 + +# 4 processus workers gèrent requêtes concurrentes +``` +{% endraw %} +### 8.2. Docker + +{% raw %} +```dockerfile +FROM python:3.12-slim + +WORKDIR /app +COPY requirements.txt . +RUN pip install --no-cache-dir -r requirements.txt + +COPY . . + +CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] +``` +{% endraw %} +{% raw %} +```yaml +# docker-compose.yml +services: + api: + build: . + ports: + - "8000:8000" + environment: + - REDIS_URL=redis://redis:6379 + depends_on: + - redis + redis: + image: redis:7-alpine +``` +{% endraw %} +### 8.3. Derrière Reverse Proxy (nginx) + +{% raw %} +```nginx +server { + listen 80; + server_name api.taskiq-flow.example.com; + + location / { + proxy_pass http://localhost:8000; + proxy_set_header Host $host; + proxy_set_header X-Real-IP $remote_addr; + proxy_http_version 1.1; + proxy_set_header Connection ""; + } +} +``` +{% endraw %} +### 8.4. HTTPS avec Let's Encrypt + +{% raw %} +```bash +# Utiliser certbot avec nginx +sudo certbot --nginx -d api.taskiq-flow.example.com +``` +{% endraw %} +Configurer HTTPS → redirect vers HTTP upstream: + +{% raw %} +```nginx +location / { + proxy_pass http://localhost:8000; + proxy_set_header X-Forwarded-Proto $scheme; +} +``` +{% endraw %} +--- + +## 9. Sécurité de l'API + +### 9.1. Authentification par Clé API + +{% raw %} +```python +from fastapi import Security, HTTPException +from fastapi.security import APIKeyHeader + +api_key_header = APIKeyHeader(name="X-API-Key") + +async def verify_api_key(api_key: str = Security(api_key_header)): + if api_key != os.getenv("API_SECRET"): + raise HTTPException(status_code=403, detail="Clé API invalide") + return api_key + +@app.get("/pipelines") +async def list_pipelines(api_key: str = Security(verify_api_key)): + return viz_api.list_pipelines() +``` +{% endraw %} +### 9.2. Authentification JWT + +{% raw %} +```python +from jose import jwt +from fastapi import Depends + +async def get_current_user(token: str = Depends(oauth2_scheme)): + try: + payload = jwt.decode(token, SECRET_KEY, algorithms=["HS256"]) + return payload["sub"] + except jwt.JWTError: + raise HTTPException(status_code=401, detail="Token invalide") + +@app.post("/pipelines/{pipeline_id}/execute") +async def execute( + pipeline_id: str, + parameters: dict, + user: str = Depends(get_current_user) +): + logger.info(f"Utilisateur {user} a exécuté {pipeline_id}") + return await run_pipeline(pipeline_id, parameters) +``` +{% endraw %} +### 9.3. Autorisation au Niveau Pipeline + +Définissez les ACLs par pipeline via `pipeline_acls` dans `TaskiqFlowConfig`, +puis utilisez `verify_pipeline_access` comme dépendance de route : + +{% raw %} +```python +from fastapi import Depends +from taskiq_flow.config import TaskiqFlowConfig +from taskiq_flow.security.authorization import PipelineAuthorization +from taskiq_flow.security.dependencies import verify_pipeline_access +from taskiq_flow.api import create_visualization_api + +config = TaskiqFlowConfig( + pipeline_acls={ + "my_pipeline": { + "read": ["admin", "viewer"], + "execute": ["admin"], + }, + }, +) +viz_api = create_visualization_api(broker) # lit config automatiquement + +# verify_pipeline_access dépend de get_current_user + authorization +# → utilisez-la directement sur vos endpoints protégés +``` +{% endraw %} +### 9.4. Combinaison Middleware + Dépendances de Route + +Pour la production, combinez le middleware global (authentification) avec les dépendances de route (autorisation) : + +{% raw %} +```python +from taskiq_flow.security.middleware import SecurityMiddleware +from taskiq_flow.security.auth import APIKeyAuthProvider, JWTAuthProvider +from taskiq_flow.security.authorization import PipelineAuthorization +from taskiq_flow.config import TaskiqFlowConfig + +# 1. Configuration via TaskiqFlowConfig (champs plats, pas de SecurityConfig imbriqué) +config = TaskiqFlowConfig( + security_enabled=True, + auth_provider="api_key", + api_keys={ + "admin-key": { + "role": "admin", + "pipelines": ["*"], + "permissions": ["read", "execute", "admin"], + }, + }, + jwt_secret="super-secret", # pragma: allowlist secret # noqa: S105 — valeur de documentation, pas un secret réel + require_https=True, + pipeline_acls={ + "my_pipeline": {"read": ["admin"], "execute": ["admin"]}, + }, +) + +# 2. Construire les composants depuis la config +auth_provider = APIKeyAuthProvider(keys=config.api_keys) +if config.auth_provider == "jwt": + auth_provider = JWTAuthProvider(secret=config.jwt_secret) +authorization = PipelineAuthorization(pipeline_acls=config.pipeline_acls) + +# 3. Middleware global gère authentification + audit +app.add_middleware( + SecurityMiddleware, + auth_provider=auth_provider, + authorization=authorization, +) +``` +{% endraw %} +Ou, pour un câblage automatique complet, utilisez `create_visualization_api` qui +construit tous les composants depuis `TaskiqFlowConfig` : + +{% raw %} +```python +from taskiq_flow import create_visualization_api + +config = TaskiqFlowConfig( + security_enabled=True, + auth_provider="api_key", + api_keys={"admin-key": {"role": "admin", "pipelines": ["*"], "permissions": ["read", "execute", "admin"]}}, +) +app = create_visualization_api(broker) # sécurité auto-configurée depuis config +``` +{% endraw %} +### Pourquoi cette approche hybride ? +- `SecurityMiddleware` place `request.state.user` pour toutes les routes après le routage +- Les paramètres de chemin FastAPI (ex. `pipeline_id`) ne sont disponibles qu'**après** le routage +- Les dépendances de route (ex. `Depends(verify_pipeline_access)`) s'exécutent après le routage → elles peuvent lire `pipeline_id` et vérifier les ACLs +{% raw %} +``` + +**Pourquoi cette approche hybride ?** +- Le middleware s'exécute **avant** le routage → `path_params` est vide +- Les dépendances de route s'exécutent **après** le routage → `pipeline_id` est disponible +- Le middleware définit `request.state.user` pour toutes les routes +- Les dépendances lisent `request.state.user` et vérifient l'ACL par pipeline + +--- + +## 10. Limitation de Débit + +Protégez l'API contre les abus: + +```python +{% endraw %} +from slowapi.util import get_remote_address +from slowapi.errors import RateLimitExceeded + +limiter = Limiter(key_func=get_remote_address) +app.state.limiter = limiter +app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler) + +@app.post("/pipelines/{pipeline_id}/execute") +@limiter.limit("10/minute") # Max 10 exécutions par minute par IP +async def execute_pipeline(pipeline_id: str, parameters: dict): + # ... +{% raw %} +``` + +--- + +## 11. Monitoring de Santé API + +### 11.1. Endpoint Health Check + +```python +{% endraw %} +from fastapi import FastAPI +import psutil + +app = FastAPI() + +@app.get("/health") +async def health(): + return { + "statut": "sain", + "timestamp": datetime.now(timezone.utc).isoformat(), + "broker_connecté": broker.is_connected(), + "memoire_mb": psutil.Process().memory_info().rss / 1024 / 1024 + } +{% raw %} +``` + +### 11.2. Métriques avec Prometheus + +```python +{% endraw %} + +Instrumentator().instrument(app).expose(app, endpoint="/metrics") +{% raw %} +``` + +Expose `/metrics` avec métriques Prometheus standard (compte requêtes, latence, etc.). + +### 11.3. Versionnement API + +```python +{% endraw %} + title="API Taskiq-Flow", + version="1.0.0", + docs_url="/docs", + redoc_url="/redoc" +) + +# Préfixer toutes routes avec /api/v1 +from fastapi import APIRouter +api_router = APIRouter(prefix="/api/v1") +api_router.include_router(viz_api.router) +app.include_router(api_router) +{% raw %} +``` + +--- + +## 12. Gestion des Erreurs + +Gestion centralisée erreurs: + +```python +{% endraw %} +from fastapi.responses import JSONResponse +from taskiq.exceptions import TaskiqError + +@app.exception_handler(TaskiqError) +async def taskiq_exception_handler(request: Request, exc: TaskiqError): + return JSONResponse( + status_code=500, + content={ + "error": exc.__class__.__name__, + "message": str(exc), + "pipeline_id": getattr(exc, "pipeline_id", None) + } + ) +{% raw %} +``` + +Réponses d'erreur standardisées: + +```json +{% endraw %} + "error": "PipelineExecutionError", + "message": "Task 'process' échoué après 3 retries", + "pipeline_id": "analyse_audio_123", + "step": "extract_audio", + "timestamp": "2026-05-05T12:00:00Z" +} +{% raw %} +``` + +--- + +## 13. Exemple Client API + +Client Python pour interagir avec l'API: + +```python +{% endraw %} + +class ClientTaskiqFlow: + def __init__(self, base_url: str, api_key: str = None): + self.base_url = base_url.rstrip("/") + self.headers = {"X-API-Key": api_key} if api_key else {} + + async def list_pipelines(self): + async with httpx.AsyncClient() as client: + resp = await client.get(f"{self.base_url}/pipelines", headers=self.headers) + resp.raise_for_status() + return resp.json() + + async def execute(self, pipeline_id: str, parameters: dict, wait: bool = False): + async with httpx.AsyncClient() as client: + resp = await client.post( + f"{self.base_url}/pipelines/{pipeline_id}/execute", + json={"parameters": parameters, "wait": wait}, + headers=self.headers + ) + resp.raise_for_status() + return resp.json() + + async def get_result(self, task_id: str): + async with httpx.AsyncClient() as client: + resp = await client.get(f"{self.base_url}/pipelines/result/{task_id}", headers=self.headers) + resp.raise_for_status() + return resp.json() + +# Usage +client = ClientTaskiqFlow("http://localhost:8000") +pipelines = await client.list_pipelines() +result = await client.execute("my_pipeline", {"data": "test"}, wait=True) +{% raw %} +``` + +--- + +## 14. Résumé + +| Fonctionnalité | Endpoint | Méthode | +|----------------|----------|---------| +| Health check | `/health` | GET | +| Lister pipelines | `/pipelines` | GET | +| Statut pipeline | `/pipelines/{id}/status` | GET | +| DAG (JSON) | `/pipelines/{id}/dag` | GET | +| DAG (DOT) | `/pipelines/{id}/dag/dot` | GET | +| Visualisation complète | `/pipelines/{id}/visualize` | GET | +| Exécuter pipeline | `/pipelines/{id}/execute` | POST (custom) | +| Obtenir résultat | `/pipelines/result/{task_id}` | GET (custom) | + +**Point clé**: L'API donne contrôle complet sur cycle de vie pipeline — enregistrer, inspecter, exécuter, récupérer résultats — parfait pour tableaux de bord et intégrations personnalisés. + +--- + +## Prochaines Étapes + +- **[Guide WebSocket]({{ '/fr/guides/websocket/' | relative_url }})** — Streaming d'événements en temps réel pour mises à jour live +- **[Guide Dataflow]({{ '/fr/guides/dataflow/' | relative_url }})** — Pipeline DAG complet avec DataflowPipeline +- **[Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }})** — Données historiques d'exécution pour analytics +- **[Exemple: Serveur API]({{ '/fr/examples/api-example/' | relative_url }})** — App FastAPI complète fonctionnelle + +--- + +*Gérez des pipelines de n'importe où. Construisez tableaux de bord, automatisation, intégrations.* + +{% endraw %} \ No newline at end of file diff --git a/docs/_fr/guides/cache.md b/docs/_fr/guides/cache.md index c671751..6c0df00 100644 --- a/docs/_fr/guides/cache.md +++ b/docs/_fr/guides/cache.md @@ -1,246 +1,246 @@ ---- -title: Guide des Middlewares Stockage & Cache -nav_order: 23 ---- -# Guide des Middlewares Stockage & Cache - -**Persistance centralisée avec StorageMiddleware et cache Dogpile avec CacheMiddleware** - -> **Version** : {VERSION} | **Nouveau en v1.2.0** | **Lié** : [Guide d'Exécution]({{ '/fr/guides/execution/' | relative_url }}), [Référence API — Stockage]({{ '/fr/api/storage/' | relative_url }}), [Référence API — Cache]({{ '/fr/api/cache/' | relative_url }}) - ---- - -## Aperçu - -v1.2.0 introduit **deux nouveaux middlewares** qui modularisent la persistance et le cache : - -| Middleware | Responsabilité | -|------------|----------------| -| `StorageMiddleware` | Persistance centralisée et pluggable des résultats, de l'état pipeline et de l'historique d'exécution | -| `CacheMiddleware` | Cache de workers Dogpile pour éviter les exécutions redondantes | - -Implémentent tous deux le cycle de vie `TaskiqMiddleware` (`pre_execute`, `post_save`) et peuvent être **actifs simultanément** avec `PipelineMiddleware`, `TransportMiddleware` et `PipelineRetryMiddleware`. - ---- - -## StorageMiddleware — Persistance Centralisée - -`StorageMiddleware` capture chaque résultat de tâche et le stocke via un `BaseStorageAdapter` configuré. Contrairement à l'ancienne approche où le suivi et la planification persistaient indépendamment, il y a maintenant **un seul magasin unifié**. - -### Pourquoi StorageMiddleware ? - -- **Source de vérité unique** — tous les résultats, l'état des pipelines et la planification dans un seul endroit -- **Backend interchangeable** — passez de InMemory à Redis ou SQLite sans modifier le code métier -- **Auto-détection** — `StorageAdapterFactory` choisit automatiquement le bon backend -- **Isolation** — stockage, cache et suivi sont chacun dans leur propre couche - -### Utilisation Basique - -```python -from taskiq import InMemoryBroker -from taskiq_flow import PipelineMiddleware, DataflowPipeline, pipeline_task -from taskiq_flow.middlewares import StorageMiddleware -from taskiq_flow.storage import InMemoryStorageAdapter - -broker = InMemoryBroker(await_inplace=True) -broker.add_middlewares( - StorageMiddleware(storage=InMemoryStorageAdapter(), enabled=True), - PipelineMiddleware(), -) -``` - -### Production avec Redis - -```python -from taskiq_flow.middlewares import StorageMiddleware -from taskiq_flow.storage import RedisStorageAdapter - -broker.add_middlewares( - StorageMiddleware( - storage=RedisStorageAdapter( - redis_url="redis://localhost:6379", - ttl_seconds=86400, - ), - ), - PipelineMiddleware(), -) -``` - -### Clés de Stockage - -`StorageMiddleware` stocke les résultats sous des clés dérivées des labels `TaskiqMessage` : - -| Motif de Clé | Exemple | -|-------------|---------| -| `pipeline:{pipeline_id}:task:{task_id}` | `pipeline:audio_v1:task:abc123` | -| `task:{task_id}` | `task:abc123` | - -Forme de la valeur stockée : -```json -{ - "task_id": "abc123", - "pipeline_id": "audio_v1", - "is_err": false, - "return_value": "{...}", - "error": null, - "execution_time": 0.42 -} -``` - -### TTL et Expiration - -Tous les adaptateurs supportent le TTL par clé. Les entrées expirées sont nettoyées paresseusement à l'accès et activement via `cleanup()` : - ---- - -## CacheMiddleware — Cache Dogpile Workers - -`CacheMiddleware` évite les exécutions redondantes de tâches en mettant en cache les **résultats** des tâches au niveau worker. Le pattern Dogpile garantit qu'un seul coroutine régénère une entrée expirée. - -### Pourquoi CacheMiddleware ? - -- **Réduire le travail inutile** — sauter les tâches idempotentes dont les entrées n'ont pas changé -- **Latence plus faible** — les résultats en cache sont retournés instantanément -- **Protection anti-stampede** — verrou Dogpile empêche la foule à l'expiration TTL -- **Backend interchangeable** — InMemory pour mono-worker, Redis pour distribué - -### Ordre des Middlewares - -L'ordre des middlewares importe. `CacheMiddleware` doit être placé **avant** `StorageMiddleware` : - -```python -# Ordre correct — cache vérifié d'abord, stockage ensuite -broker.add_middlewares( - CacheMiddleware(), # ← vérifié en premier - StorageMiddleware(), # ← écrit en base seulement si pas en cache - PipelineMiddleware(), # ← orchestre les tâches en aval -) -``` - -### Surcharges par Tâche via Labels - -```python -# Sur une tâche spécifique, augmenter TTL à 2 heures et cacher les erreurs -result = await tache_couteuse.kiq( - donnees_entree, - labels={"cache_ttl": "7200", "cache_errors": "true"}, -) -``` - ---- - -## StorageAdapterFactory — Configuration Zéro - -`StorageAdapterFactory` crée automatiquement les bons adaptateurs depuis `TaskiqFlowConfig` : - -```python -from taskiq_flow.storage.factory import StorageAdapterFactory -from taskiq_flow.config import TaskiqFlowConfig - -config = TaskiqFlowConfig( - storage_type="redis", - storage_redis_url="redis://localhost:6379", - storage_ttl_seconds=86400, - cache_type="redis", - cache_redis_url="redis://localhost:6379", -) -middlewares = StorageAdapterFactory.create_default_middlewares(config=config) - -broker.add_middlewares( - middlewares["cache"], # CacheMiddleware - middlewares["storage"], # StorageMiddleware - PipelineMiddleware(), -) -``` - -Variables d'environnement (toutes optionnelles) : - -| Var d'env | Description | -|-----------|-------------| -| `TASKIQ_FLOW_STORAGE_TYPE` | `"redis"`, `"sqlite"`, `"inmemory"`, `"auto"` | -| `TASKIQ_FLOW_CACHE_TYPE` | `"redis"`, `"inmemory"`, `"auto"` | - ---- - -## Comparatif : Stockage vs Cache - -| Aspect | `StorageMiddleware` | `CacheMiddleware` | -|--------|--------------------|-------------------| -| Objectif | Persistance long terme (état, résultats, planification) | Déduplication court terme des résultats de tâches | -| TTL typique | Heures à jours | Minutes à heures | -| Portée | IDs de pipelines et de tâches | IDs de résultats de tâches individuelles | -| Backend | InMemory / Redis / SQLite | InMemory / Redis | -| Anti-stampede | N/A | Oui | -| Auto-dédup | N/A | Oui | - ---- - -## Monitoring - -### Taux de Hits de Cache - -```python -stats = cache.get_stats() -print(f"Taux de hit : {stats['hit_rate']:.1%}") -print(f"Hits: {stats['hits']}, Misses: {stats['misses']}") -``` - -Ciblez un taux de hit > 80 % pour des pipelines reproduductibles avec entrées stables. - -### Surveillance du Stockage - -```python -from datetime import datetime, timezone - -# Lister tous les pipelines suivis -toutes_cles = await storage.keys("pipeline:*") -print(f"Entrées totales : {len(toutes_cles)}") - -# Nettoyage périodique des entrées expirées -supprimes = await storage.cleanup(ttl_seconds=3600) -print(f"Entrées expirées supprimées : {supprimes}") -``` - ---- - -## Dépannage - -| Symptôme | Cause Probable | Correctif | -|---------|----------------|-----------| -| Tous les caches sont misses | TTL trop court ou entrées trop variables | Augmenter `default_ttl` ; vérifier les arguments des tâches | -| Stampede sur expirations | `InMemoryCacheAdapter` sans Dogpile distribué | Passer à `RedisCacheAdapter` pour verrou distribué | -| Croissance stockage illimitée | Aucun TTL défini | Définir `ttl_seconds` ; exécuter `cleanup()` régulièrement | -| Workers partagent résultats périmés | Redis TTL non appliqué | Vérifier `EXPIRE` Redis ; contrôler la configuration Redis | - ---- - -## Installation Complète Production - -```bash -pip install "taskiq-flow[all]" # Toutes les fonctionnalités -docker run -p 6379:6379 redis:7 # Redis pour stockage et cache distribué -``` - -```python -from taskiq_flow.storage.factory import StorageAdapterFactory -from taskiq_flow.config import TaskiqFlowConfig - -config = TaskiqFlowConfig( - storage_type="redis", - storage_redis_url="redis://localhost:6379", - storage_ttl_seconds=86_400, - cache_type="redis", - cache_redis_url="redis://localhost:6379", -) -middlewares = StorageAdapterFactory.create_default_middlewares(config=config) - -broker.add_middlewares( - middlewares["cache"], - middlewares["storage"], - PipelineMiddleware(), -) -``` - ---- - -*Nouveau en v1.2.0. Les deux middlewares sont additifs — ajoutez-les à un broker existant sans refactoring.* +--- +title: Guide des Middlewares Stockage & Cache +nav_order: 23 +--- +# Guide des Middlewares Stockage & Cache + +**Persistance centralisée avec StorageMiddleware et cache Dogpile avec CacheMiddleware** + +> **Version** : {VERSION} | **Nouveau en v1.2.0** | **Lié** : [Guide d'Exécution]({{ '/fr/guides/execution/' | relative_url }}), [Référence API — Stockage]({{ '/fr/api/storage/' | relative_url }}), [Référence API — Cache]({{ '/fr/api/cache/' | relative_url }}) + +--- + +## Aperçu + +v1.2.0 introduit **deux nouveaux middlewares** qui modularisent la persistance et le cache : + +| Middleware | Responsabilité | +|------------|----------------| +| `StorageMiddleware` | Persistance centralisée et pluggable des résultats, de l'état pipeline et de l'historique d'exécution | +| `CacheMiddleware` | Cache de workers Dogpile pour éviter les exécutions redondantes | + +Implémentent tous deux le cycle de vie `TaskiqMiddleware` (`pre_execute`, `post_save`) et peuvent être **actifs simultanément** avec `PipelineMiddleware`, `TransportMiddleware` et `PipelineRetryMiddleware`. + +--- + +## StorageMiddleware — Persistance Centralisée + +`StorageMiddleware` capture chaque résultat de tâche et le stocke via un `BaseStorageAdapter` configuré. Contrairement à l'ancienne approche où le suivi et la planification persistaient indépendamment, il y a maintenant **un seul magasin unifié**. + +### Pourquoi StorageMiddleware ? + +- **Source de vérité unique** — tous les résultats, l'état des pipelines et la planification dans un seul endroit +- **Backend interchangeable** — passez de InMemory à Redis ou SQLite sans modifier le code métier +- **Auto-détection** — `StorageAdapterFactory` choisit automatiquement le bon backend +- **Isolation** — stockage, cache et suivi sont chacun dans leur propre couche + +### Utilisation Basique + +```python +from taskiq import InMemoryBroker +from taskiq_flow import PipelineMiddleware, DataflowPipeline, pipeline_task +from taskiq_flow.middlewares import StorageMiddleware +from taskiq_flow.storage import InMemoryStorageAdapter + +broker = InMemoryBroker(await_inplace=True) +broker.add_middlewares( + StorageMiddleware(storage=InMemoryStorageAdapter(), enabled=True), + PipelineMiddleware(), +) +``` + +### Production avec Redis + +```python +from taskiq_flow.middlewares import StorageMiddleware +from taskiq_flow.storage import RedisStorageAdapter + +broker.add_middlewares( + StorageMiddleware( + storage=RedisStorageAdapter( + redis_url="redis://localhost:6379", + ttl_seconds=86400, + ), + ), + PipelineMiddleware(), +) +``` + +### Clés de Stockage + +`StorageMiddleware` stocke les résultats sous des clés dérivées des labels `TaskiqMessage` : + +| Motif de Clé | Exemple | +|-------------|---------| +| `pipeline:{pipeline_id}:task:{task_id}` | `pipeline:audio_v1:task:abc123` | +| `task:{task_id}` | `task:abc123` | + +Forme de la valeur stockée : +```json +{ + "task_id": "abc123", + "pipeline_id": "audio_v1", + "is_err": false, + "return_value": "{...}", + "error": null, + "execution_time": 0.42 +} +``` + +### TTL et Expiration + +Tous les adaptateurs supportent le TTL par clé. Les entrées expirées sont nettoyées paresseusement à l'accès et activement via `cleanup()` : + +--- + +## CacheMiddleware — Cache Dogpile Workers + +`CacheMiddleware` évite les exécutions redondantes de tâches en mettant en cache les **résultats** des tâches au niveau worker. Le pattern Dogpile garantit qu'un seul coroutine régénère une entrée expirée. + +### Pourquoi CacheMiddleware ? + +- **Réduire le travail inutile** — sauter les tâches idempotentes dont les entrées n'ont pas changé +- **Latence plus faible** — les résultats en cache sont retournés instantanément +- **Protection anti-stampede** — verrou Dogpile empêche la foule à l'expiration TTL +- **Backend interchangeable** — InMemory pour mono-worker, Redis pour distribué + +### Ordre des Middlewares + +L'ordre des middlewares importe. `CacheMiddleware` doit être placé **avant** `StorageMiddleware` : + +```python +# Ordre correct — cache vérifié d'abord, stockage ensuite +broker.add_middlewares( + CacheMiddleware(), # ← vérifié en premier + StorageMiddleware(), # ← écrit en base seulement si pas en cache + PipelineMiddleware(), # ← orchestre les tâches en aval +) +``` + +### Surcharges par Tâche via Labels + +```python +# Sur une tâche spécifique, augmenter TTL à 2 heures et cacher les erreurs +result = await tache_couteuse.kiq( + donnees_entree, + labels={"cache_ttl": "7200", "cache_errors": "true"}, +) +``` + +--- + +## StorageAdapterFactory — Configuration Zéro + +`StorageAdapterFactory` crée automatiquement les bons adaptateurs depuis `TaskiqFlowConfig` : + +```python +from taskiq_flow.storage.factory import StorageAdapterFactory +from taskiq_flow.config import TaskiqFlowConfig + +config = TaskiqFlowConfig( + storage_type="redis", + storage_redis_url="redis://localhost:6379", + storage_ttl_seconds=86400, + cache_type="redis", + cache_redis_url="redis://localhost:6379", +) +middlewares = StorageAdapterFactory.create_default_middlewares(config=config) + +broker.add_middlewares( + middlewares["cache"], # CacheMiddleware + middlewares["storage"], # StorageMiddleware + PipelineMiddleware(), +) +``` + +Variables d'environnement (toutes optionnelles) : + +| Var d'env | Description | +|-----------|-------------| +| `TASKIQ_FLOW_STORAGE_TYPE` | `"redis"`, `"sqlite"`, `"inmemory"`, `"auto"` | +| `TASKIQ_FLOW_CACHE_TYPE` | `"redis"`, `"inmemory"`, `"auto"` | + +--- + +## Comparatif : Stockage vs Cache + +| Aspect | `StorageMiddleware` | `CacheMiddleware` | +|--------|--------------------|-------------------| +| Objectif | Persistance long terme (état, résultats, planification) | Déduplication court terme des résultats de tâches | +| TTL typique | Heures à jours | Minutes à heures | +| Portée | IDs de pipelines et de tâches | IDs de résultats de tâches individuelles | +| Backend | InMemory / Redis / SQLite | InMemory / Redis | +| Anti-stampede | N/A | Oui | +| Auto-dédup | N/A | Oui | + +--- + +## Monitoring + +### Taux de Hits de Cache + +```python +stats = cache.get_stats() +print(f"Taux de hit : {stats['hit_rate']:.1%}") +print(f"Hits: {stats['hits']}, Misses: {stats['misses']}") +``` + +Ciblez un taux de hit > 80 % pour des pipelines reproduductibles avec entrées stables. + +### Surveillance du Stockage + +```python +from datetime import datetime, timezone + +# Lister tous les pipelines suivis +toutes_cles = await storage.keys("pipeline:*") +print(f"Entrées totales : {len(toutes_cles)}") + +# Nettoyage périodique des entrées expirées +supprimes = await storage.cleanup(ttl_seconds=3600) +print(f"Entrées expirées supprimées : {supprimes}") +``` + +--- + +## Dépannage + +| Symptôme | Cause Probable | Correctif | +|---------|----------------|-----------| +| Tous les caches sont misses | TTL trop court ou entrées trop variables | Augmenter `default_ttl` ; vérifier les arguments des tâches | +| Stampede sur expirations | `InMemoryCacheAdapter` sans Dogpile distribué | Passer à `RedisCacheAdapter` pour verrou distribué | +| Croissance stockage illimitée | Aucun TTL défini | Définir `ttl_seconds` ; exécuter `cleanup()` régulièrement | +| Workers partagent résultats périmés | Redis TTL non appliqué | Vérifier `EXPIRE` Redis ; contrôler la configuration Redis | + +--- + +## Installation Complète Production + +```bash +pip install "taskiq-flow[all]" # Toutes les fonctionnalités +docker run -p 6379:6379 redis:7 # Redis pour stockage et cache distribué +``` + +```python +from taskiq_flow.storage.factory import StorageAdapterFactory +from taskiq_flow.config import TaskiqFlowConfig + +config = TaskiqFlowConfig( + storage_type="redis", + storage_redis_url="redis://localhost:6379", + storage_ttl_seconds=86_400, + cache_type="redis", + cache_redis_url="redis://localhost:6379", +) +middlewares = StorageAdapterFactory.create_default_middlewares(config=config) + +broker.add_middlewares( + middlewares["cache"], + middlewares["storage"], + PipelineMiddleware(), +) +``` + +--- + +*Nouveau en v1.2.0. Les deux middlewares sont additifs — ajoutez-les à un broker existant sans refactoring.* diff --git a/docs/_fr/guides/execution.md b/docs/_fr/guides/execution.md index eb7b74f..3f8e320 100644 --- a/docs/_fr/guides/execution.md +++ b/docs/_fr/guides/execution.md @@ -1,514 +1,514 @@ ---- -title: Guide d'Exécution des Pipelines -nav_order: 22 ---- -# Guide d'Exécution des Pipelines - -**Comprendre les modèles d'exécution, les modes et la gestion des résultats** - -> **Version** : {VERSION} | **S'applique à** : SequentialPipeline, DataflowPipeline, MapReduce | **Voir aussi** : [Guide Dataflow]({{ '/fr/guides/dataflow/' | relative_url }}) - ---- - -## Aperçu - -Ce guide couvre l'exécution des pipelines par Taskiq-Flow, la gestion de la concurrence, la gestion des erreurs et le renvoi des résultats. - ---- - -## 1. Modèles d'Exécution - -### 1.1. Exécution Séquentielle (Pipeline Classique) - -Le `Pipeline` classique exécute les étapes une après l'autre en chaîne linéaire : - -```python -pipeline = Pipeline(broker).call_next(task1).call_next(task2).call_next(task3) -# Ordre d'exécution : task1 → task2 → task3 (synchroniquement) -``` - -**Caractéristiques**: -- Chaque étape attend que la précédente se termine -- Les résultats passent directement d'une étape à la suivante -- Ordre d'exécution prévisible et déterministe -- Adapté aux workflows linéaires - -### 1.2. Exécution Parallèle (Dataflow & Map) - -`DataflowPipeline` parallélise automatiquement les tâches indépendantes: - -```python -@broker.task -@pipeline_task(output="features") -def extract(tracks): ... - -@broker.task -@pipeline_task(output="tags") -def tag(features): ... # Exécuté après extract - -@broker.task -@pipeline_task(output="embedding") -def embed(features): ... # Aussi après extract, parallèle à tag - -pipeline = DataflowPipeline.from_tasks(broker, [extract, tag, embed]) -# DAG: extract → (tag et embed en parallèle) -``` - -**Caractéristiques**: -- Les tâches sans dépendances non satisfaites s'exécutent concurremment -- Le DAG détermine l'ordre d'exécution -- Débit maximal pour les opérations indépendantes -- Contrôlé par le paramètre `max_parallel` sur `.map()` et `.reduce()` - -### 1.3. Parallélisme Map-Reduce - -L'utilitaire `MapReduce` traite explicitement les éléments en parallèle: - -```python -from taskiq_flow import MapReduce - -# Traiter 100 éléments avec max 10 workers concurrents -result = await MapReduce.map( - broker, - process_item, - items=items_list, - output="processed", - max_parallel=10 # contrôle le niveau de concurrence -) -``` - -**Contrôle de parallélisme**: -- `max_parallel=None` → concurrence illimitée (à utiliser avec précaution) -- `max_parallel=1` → exécution séquentielle -- Recommandé : `max_parallel = nombre_de_coeurs_CPU * 2` pour les tâches liées au CPU - ---- - -## 2. Démarrer un Pipeline - -Plusieurs façons de lancer l'exécution d'un pipeline: - -### 2.1. `pipeline.kiq(...)` — Fire and Forget - -Retourne une `Task` immédiatement ; vous devez attendre les résultats manuellement: - -```python -task = await pipeline.kiq(entrée_initiale) -# Faire d'autres choses... -result = await task.wait_result() # bloque jusqu'à la fin -``` - -Utiliser quand: -- Vous avez besoin de l'ID de tâche pour des vérifications ultérieures -- Vous voulez démarrer plusieurs pipelines concurremment -- Vous construisez un système de file d'attente de tâches - -### 2.2. `pipeline.kiq_dataflow(...)` — Convenance Dataflow - -Identique à `kiq()` mais spécifique aux DataflowPipeline, avec une sémantique plus claire: - -```python -results = await pipeline.kiq_dataflow(track_paths=["a.mp3", "b.mp3"]) -# Retourne : dict mappant les noms de sortie vers les valeurs -``` - -### 2.3. `pipeline.kiq_map_reduce(...)` — Raccourci Map-Reduce - -Combine map et reduce en un seul appel: - -```python -final = await pipeline.kiq_map_reduce( - items=items, - map_output="processed", - reduce_output="final" -) -``` - ---- - -## 3. Attente des Résultats - -### 3.1. Attente Bloquante - -```python -task = await pipeline.kiq(données) -result = await task.wait_result() # bloque -print(result.return_value) -``` - -**Options**: -- `wait_result(timeout=30)` — timeout en secondes (lève `asyncio.TimeoutError`) -- `wait_result(raise_on_error=True)` — re-lance les exceptions des tâches - -### 3.2. Interrogation du Statut (Polling) - -```python -task = await pipeline.kiq(données) - -# Vérifier périodiquement sans bloquer -while not task.is_finished: - await asyncio.sleep(0.5) - statut = await task.get_status() - print(f"Statut: {statut}") -``` - -Utile pour les barres de progression ou applications interactives. - -### 3.3. Récupération par ID de Tâche (Distribué) - -Si vous n'avez que l'ID de tâche (depuis un autre processus): - -```python -from taskiq import Task -task = Task(task_id="abc123", broker=broker) -result = await task.wait_result() -``` - ---- - -## 4. Gestion des Erreurs - -### 4.1. Erreurs au Niveau Tâche - -Quand une tâche échoue, le pipeline : - -- **S'arrête immédiatement** (par défaut) — les tâches restantes sont annulées -- **Continue** si configuré avec des politiques de gestion d'erreurs - -```python -pipeline = Pipeline(broker) - -# Configurer pour continuer malgré les erreurs -pipeline.on_error("continue") # options : "stop", "continue", "retry" - -# Ou utiliser une politique de retry (voir Guide Retry) -pipeline.with_retry( - max_attempts=3, - delay=5, - backoff=2 -) -``` - -### 4.2. Erreurs au Niveau Pipeline - -Le pipeline entier peut échouer si: - -- Une tâche critique (sans consommateurs) échoue -- Une tâche dépasse son timeout -- Le broker devient indisponible - -Gérer les erreurs de pipeline avec try/except: - -```python -try: - result = await pipeline.kiq(données) - sortie = await result.wait_result() -except TaskiqError as exc: - print(f"Pipeline échoué: {exc}") - # Accéder aux résultats partiels s'il y en a - if result.is_failed: - print(f"Échec à l'étape: {result.failed_step}") -``` - -### 4.3. Résultats Partiels en Cas d'Échec - -Même si un pipeline échoue, vous pouvez avoir des résultats partiels des étapes complétées: - -```python -result = await pipeline.kiq(données) -try: - sortie = await result.wait_result() -except PipelineError: - # Certaines étapes ont réussi avant l'échec - partiel = result.partial_results # dict des sorties complétées - print(f"Partiel: {partiel}") -``` - ---- - -## 5. Timeouts - -Définir des timeouts au niveau pipeline: - -```python -pipeline = Pipeline(broker) - -# Timeout global pour tout le pipeline (secondes) -pipeline.with_timeout(60) - -# Or per-task timeout via the taskiq decorator -@broker.task(timeout=30) -def slow_task(): ... -``` - -**Comportement des timeouts**: -- Dépasser le timeout annule la tâche en cours -- `asyncio.TimeoutError` est levée -- Le statut du pipeline est défini à `ERROR` - ---- - -## 6. Contexte d'Exécution - -Chaque tâche reçoit un paramètre optionnel `context` contenant des métadonnées: - -```python -from taskiq_flow import PipelineContext - -@broker.task -async def my_task(data: str, context: PipelineContext): - print(f"Pipeline ID: {context.pipeline_id}") - print(f"Step index: {context.step_index}") - print(f"Task ID: {context.task_id}") - return data.upper() -``` - -**Champs du contexte**: - -| Champ | Type | Description | -|-------|------|-------------| -| `pipeline_id` | `str` | Identifiant unique de l'instance de pipeline | -| `step_index` | `int` | Index de cette étape dans la séquence | -| `task_id` | `str` | ID de la tâche taskiq sous-jacente | -| `execution_mode` | `str` | `"sequential"`, `"parallel"`, ou `"map_reduce"` | -| `started_at` | `datetime` | Horodatage de démarrage du pipeline | -| `broker` | `BaseBroker` | Référence au broker de tâches | - -Activer le passage de contexte lors de la construction du pipeline: - -```python -pipeline = Pipeline(broker).with_context(enable=True) -``` - ---- - -## 7. Moteurs d'Exécution Personnalisés (Avancé) - -Pour un contrôle de bas niveau, utilisez `ExecutionEngine` directement: - -```python -from taskiq_flow import ExecutionEngine, DAGBuilder -from taskiq_flow.dataflow import DataflowRegistry - -# Construire le registry manuellement -registry = DataflowRegistry() -registry.register_task(load, output="raw", inputs=[]) -registry.register_task(process, output="clean", inputs=["raw"]) -registry.register_task(save, output="saved", inputs=["clean"]) - -# Construire le DAG -dag = registry.build_dag() - -# Créer le moteur d'exécution -engine = ExecutionEngine(broker, dag) - -# Exécuter avec des entrées personnalisées -résultats = await engine.execute(inputs={"source_file": "data.csv"}) -print(résultats) # {"raw": ..., "clean": ..., "saved": ...} -``` - -**Quand utiliser ExecutionEngine**: -- Construire des pipelines dynamiques à l'exécution -- Ordonnancement/logique personnalisée hors de l'abstraction Pipeline -- Inspecter la structure du DAG avant exécution -- Intégration avec des gestionnaires de workflow externes - ---- - -## 8. Formes des Résultats - -Les différents types de pipelines renvoient des structures de résultats différentes: - -### 8.1. Résultats de Pipeline Séquentiel - -```python -task = await pipeline.kiq(entrée) -result = await task.wait_result() - -# result.return_value est la sortie finale après toutes les étapes -# Exemple: [3, 3, 3, 3] depuis notre pipeline quickstart -``` - -### 8.2. Résultats de Pipeline Dataflow - -```python -result = await pipeline.kiq_dataflow(données) - -# Retourne un dict mappant chaque nom de sortie vers sa valeur -{ - "features": {...}, - "tags": [...], - "embedding": [...] -} -``` - -### 8.3. Résultats MapReduce - -```python -mappé = await MapReduce.map(...) -print(mappé.return_value) # Liste des résultats mappés - -réduit = await MapReduce.reduce(...) -print(réduit.return_value) # Résultat final agrégé -``` - ---- - -## 9. Inspection de l'État du Pipeline - -Interroger le statut du pipeline pendant ou après exécution: - -```python -from taskiq_flow import PipelineTrackingManager - -tracking = PipelineTrackingManager().with_auto_storage(broker) -pipeline = Pipeline(broker).with_tracking(tracking) - -task = await pipeline.kiq(données) - -# Obtenir le statut détaillé -statut = await tracking.get_status(pipeline.pipeline_id) -print(f"Statut: {statut.statut}") # EN_ATTENTE, EN_COURSE, TERMINÉ, ÉCHOUÉ -print(f"Étapes: {len(statut.étapes)}") # Nombre d'étapes complétées -print(f"Démarré: {statut.démarré_à}") -print(f"Terminé: {statut.terminé_à}") - -# Obtenir l'historique étape par étape -for étape in statut.étapes: - print(f" {étape.nom}: {étape.statut} ({étape.durée_ms}ms)") -``` - -**Valeurs de statut**: - -| Statut | Signification | -|--------|---------------| -| `EN_ATTENTE` | Pipeline en file, pas encore démarré | -| `EN_COURSE` | En cours d'exécution | -| `TERMINÉ` | Terminé avec succès | -| `ÉCHOUÉ` | Terminé avec erreur | -| `ANNULÉ` | Annulé manuellement | - -Voir [Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }}) pour le monitoring avancé. - ---- - -## 10. Débogage de l'Exécution - -### 10.1. Activer les Logs - -```python -import logging -logging.basicConfig(level=logging.DEBUG) - -# Ou configurer des loggers spécifiques -logger = logging.getLogger("taskiq_flow") -logger.setLevel(logging.DEBUG) -``` - -### 10.2. Afficher le DAG Avant Exécution - -```python -pipeline.print_dag() -# Montre les niveaux d'exécution et les dépendances -``` - -### 10.3. Inspecter les Arguments des Tâches - -```python -@broker.task -async def debug_task(data, context: PipelineContext): - print(f"Received: {data}") - print(f"Context: pipeline={context.pipeline_id}, step={context.step_index}") - return data -``` - -### 10.4. Middleware de Traçage - -```python -from taskiq_flow.middleware import PipelineMiddleware - -class DebugMiddleware(PipelineMiddleware): - async def on_step_complete(self, ctx, résultat): - print(f"Étape {ctx.task_id} complétée avec: {résultat}") - await super().on_step_complete(ctx, résultat) - -broker.add_middlewares(DebugMiddleware()) -``` - ---- - -## 11. Considérations de Performance - -### 11.1. Limites de Concurrence - -```python -# Limiter le total des tâches parallèles globalement -from taskiq_flow.optimization.parallel import set_max_parallel_tasks -set_max_parallel_tasks(20) # jamais plus de 20 tâches simultanées -``` - -### 11.2. Parallélisme Sélectif - -Toutes les tâches ne bénéficient pas du parallélisme: - -```python -# Tâches liées au CPU: bénéficient du parallélisme jusqu'au nombre de cœurs -# Tâches liées aux E/S: peuvent gérer un parallélisme plus élevé -# Tâches petites/rapides: le surcoût peut l'emporter sur les bénéfices - -# Astuce: Profiler avec différentes valeurs max_parallel -pipeline.map(process_item, items, max_parallel=8) -``` - -### 11.3. Empreinte Mémoire - -L'exécution parallèle charge plus de données en mémoire: - -```python -# Traiter les grands jeux de données par morceaux -morceaux = diviser_en_morceaux(grande_liste, taille_morceau=100) -for morceau in morceaux: - résultats = await pipeline.kiq_dataflow(morceau) - # traiter les résultats avant le morceau suivant -``` - -Voir [Guide de Performance]({{ '/fr/guides/performance/' | relative_url }}) pour des stratégies d'optimisation détaillées. - ---- - -## 12. Pièges Courants - -| Problème | Cause | Solution | -|---------|--------|----------| -| Tâches exécutées séquentiellement | `max_parallel=1` ou type de pipeline séquentiel | Utiliser DataflowPipeline ou augmenter le parallélisme | -| `wait_result()` reste bloqué indéfiniment | Broker non partagé, résultats perdus | Utiliser un broker persistant (Redis) avec backend de résultats | -| Tâches reçoivent de mauvaises entrées | Nommage incorrect des paramètres | S'assurer que `@pipeline_task(output=...)` correspond aux noms de paramètres en aval | -| Résultats dans le désordre | Tâches dataflow finissant à des moments différents | Le dict des résultats préserve les noms de sortie, pas l'ordre d'exécution | -| Explosion mémoire | Parallélisme illimité | Définir `max_parallel` ou traiter par lots | -| Deadlock détecté | Dépendance circulaire ou entrée externe manquante | Vérifier le graphe de flux de données pour les cycles ; fournir toutes les entrées externes | -| `kiq_dataflow()` lève "No DAG built" | Aucune tâche ajoutée au pipeline | Utiliser `DataflowPipeline.from_tasks()` ou `add_dataflow_task()` | -| Résultats partiels uniquement | `continue_on_error=True` avec des tâches échouées | Vérifier `PipelineErrorAggregator` ou le rapport d'exécution | - ---- - -## 13. Résumé - -| Fonctionnalité | Pipeline Séquentiel | DataflowPipeline | MapReduce | -|----------------|--------------------|------------------|-----------| -| **Exécution** | Chaîne linéaire | DAG automatique | Map parallèle + reduce | -| **Parallélisme** | Aucun (sauf `.group()`) | Automatique (tâches indépendantes) | Explicitement par appel map | -| **Contrôle** | Enchaînement manuel | Dépendances déclaratives | Orienté traitement par lots | -| **Idéal pour** | Workflows linéaires simples | Workflows complexes ramifiés | Transformation de données en masse | - ---- - -## Prochaines Étapes - -- **[Guide des Pipelines]({{ '/fr/guides/pipelines/' | relative_url }})** — Choisir entre types de pipelines et motifs -- **[Guide Dataflow]({{ '/fr/guides/dataflow/' | relative_url }})** — Guide complet sur les pipelines dataflow, DAGs et décorateurs -- **[Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }})** — Surveillance du statut et historique des pipelines -- **[Guide de Performance]({{ '/fr/guides/performance/' | relative_url }})** — Réglage pour vitesse et ressources - ---- - -*Comprendre l'exécution est essentiel pour construire des pipelines fiables. Découvrez les [Pipelines Dataflow]({{ '/fr/guides/dataflow/' | relative_url }}) pour des workflows complexes.* +--- +title: Guide d'Exécution des Pipelines +nav_order: 22 +--- +# Guide d'Exécution des Pipelines + +**Comprendre les modèles d'exécution, les modes et la gestion des résultats** + +> **Version** : {VERSION} | **S'applique à** : SequentialPipeline, DataflowPipeline, MapReduce | **Voir aussi** : [Guide Dataflow]({{ '/fr/guides/dataflow/' | relative_url }}) + +--- + +## Aperçu + +Ce guide couvre l'exécution des pipelines par Taskiq-Flow, la gestion de la concurrence, la gestion des erreurs et le renvoi des résultats. + +--- + +## 1. Modèles d'Exécution + +### 1.1. Exécution Séquentielle (Pipeline Classique) + +Le `Pipeline` classique exécute les étapes une après l'autre en chaîne linéaire : + +```python +pipeline = Pipeline(broker).call_next(task1).call_next(task2).call_next(task3) +# Ordre d'exécution : task1 → task2 → task3 (synchroniquement) +``` + +**Caractéristiques**: +- Chaque étape attend que la précédente se termine +- Les résultats passent directement d'une étape à la suivante +- Ordre d'exécution prévisible et déterministe +- Adapté aux workflows linéaires + +### 1.2. Exécution Parallèle (Dataflow & Map) + +`DataflowPipeline` parallélise automatiquement les tâches indépendantes: + +```python +@broker.task +@pipeline_task(output="features") +def extract(tracks): ... + +@broker.task +@pipeline_task(output="tags") +def tag(features): ... # Exécuté après extract + +@broker.task +@pipeline_task(output="embedding") +def embed(features): ... # Aussi après extract, parallèle à tag + +pipeline = DataflowPipeline.from_tasks(broker, [extract, tag, embed]) +# DAG: extract → (tag et embed en parallèle) +``` + +**Caractéristiques**: +- Les tâches sans dépendances non satisfaites s'exécutent concurremment +- Le DAG détermine l'ordre d'exécution +- Débit maximal pour les opérations indépendantes +- Contrôlé par le paramètre `max_parallel` sur `.map()` et `.reduce()` + +### 1.3. Parallélisme Map-Reduce + +L'utilitaire `MapReduce` traite explicitement les éléments en parallèle: + +```python +from taskiq_flow import MapReduce + +# Traiter 100 éléments avec max 10 workers concurrents +result = await MapReduce.map( + broker, + process_item, + items=items_list, + output="processed", + max_parallel=10 # contrôle le niveau de concurrence +) +``` + +**Contrôle de parallélisme**: +- `max_parallel=None` → concurrence illimitée (à utiliser avec précaution) +- `max_parallel=1` → exécution séquentielle +- Recommandé : `max_parallel = nombre_de_coeurs_CPU * 2` pour les tâches liées au CPU + +--- + +## 2. Démarrer un Pipeline + +Plusieurs façons de lancer l'exécution d'un pipeline: + +### 2.1. `pipeline.kiq(...)` — Fire and Forget + +Retourne une `Task` immédiatement ; vous devez attendre les résultats manuellement: + +```python +task = await pipeline.kiq(entrée_initiale) +# Faire d'autres choses... +result = await task.wait_result() # bloque jusqu'à la fin +``` + +Utiliser quand: +- Vous avez besoin de l'ID de tâche pour des vérifications ultérieures +- Vous voulez démarrer plusieurs pipelines concurremment +- Vous construisez un système de file d'attente de tâches + +### 2.2. `pipeline.kiq_dataflow(...)` — Convenance Dataflow + +Identique à `kiq()` mais spécifique aux DataflowPipeline, avec une sémantique plus claire: + +```python +results = await pipeline.kiq_dataflow(track_paths=["a.mp3", "b.mp3"]) +# Retourne : dict mappant les noms de sortie vers les valeurs +``` + +### 2.3. `pipeline.kiq_map_reduce(...)` — Raccourci Map-Reduce + +Combine map et reduce en un seul appel: + +```python +final = await pipeline.kiq_map_reduce( + items=items, + map_output="processed", + reduce_output="final" +) +``` + +--- + +## 3. Attente des Résultats + +### 3.1. Attente Bloquante + +```python +task = await pipeline.kiq(données) +result = await task.wait_result() # bloque +print(result.return_value) +``` + +**Options**: +- `wait_result(timeout=30)` — timeout en secondes (lève `asyncio.TimeoutError`) +- `wait_result(raise_on_error=True)` — re-lance les exceptions des tâches + +### 3.2. Interrogation du Statut (Polling) + +```python +task = await pipeline.kiq(données) + +# Vérifier périodiquement sans bloquer +while not task.is_finished: + await asyncio.sleep(0.5) + statut = await task.get_status() + print(f"Statut: {statut}") +``` + +Utile pour les barres de progression ou applications interactives. + +### 3.3. Récupération par ID de Tâche (Distribué) + +Si vous n'avez que l'ID de tâche (depuis un autre processus): + +```python +from taskiq import Task +task = Task(task_id="abc123", broker=broker) +result = await task.wait_result() +``` + +--- + +## 4. Gestion des Erreurs + +### 4.1. Erreurs au Niveau Tâche + +Quand une tâche échoue, le pipeline : + +- **S'arrête immédiatement** (par défaut) — les tâches restantes sont annulées +- **Continue** si configuré avec des politiques de gestion d'erreurs + +```python +pipeline = Pipeline(broker) + +# Configurer pour continuer malgré les erreurs +pipeline.on_error("continue") # options : "stop", "continue", "retry" + +# Ou utiliser une politique de retry (voir Guide Retry) +pipeline.with_retry( + max_attempts=3, + delay=5, + backoff=2 +) +``` + +### 4.2. Erreurs au Niveau Pipeline + +Le pipeline entier peut échouer si: + +- Une tâche critique (sans consommateurs) échoue +- Une tâche dépasse son timeout +- Le broker devient indisponible + +Gérer les erreurs de pipeline avec try/except: + +```python +try: + result = await pipeline.kiq(données) + sortie = await result.wait_result() +except TaskiqError as exc: + print(f"Pipeline échoué: {exc}") + # Accéder aux résultats partiels s'il y en a + if result.is_failed: + print(f"Échec à l'étape: {result.failed_step}") +``` + +### 4.3. Résultats Partiels en Cas d'Échec + +Même si un pipeline échoue, vous pouvez avoir des résultats partiels des étapes complétées: + +```python +result = await pipeline.kiq(données) +try: + sortie = await result.wait_result() +except PipelineError: + # Certaines étapes ont réussi avant l'échec + partiel = result.partial_results # dict des sorties complétées + print(f"Partiel: {partiel}") +``` + +--- + +## 5. Timeouts + +Définir des timeouts au niveau pipeline: + +```python +pipeline = Pipeline(broker) + +# Timeout global pour tout le pipeline (secondes) +pipeline.with_timeout(60) + +# Or per-task timeout via the taskiq decorator +@broker.task(timeout=30) +def slow_task(): ... +``` + +**Comportement des timeouts**: +- Dépasser le timeout annule la tâche en cours +- `asyncio.TimeoutError` est levée +- Le statut du pipeline est défini à `ERROR` + +--- + +## 6. Contexte d'Exécution + +Chaque tâche reçoit un paramètre optionnel `context` contenant des métadonnées: + +```python +from taskiq_flow import PipelineContext + +@broker.task +async def my_task(data: str, context: PipelineContext): + print(f"Pipeline ID: {context.pipeline_id}") + print(f"Step index: {context.step_index}") + print(f"Task ID: {context.task_id}") + return data.upper() +``` + +**Champs du contexte**: + +| Champ | Type | Description | +|-------|------|-------------| +| `pipeline_id` | `str` | Identifiant unique de l'instance de pipeline | +| `step_index` | `int` | Index de cette étape dans la séquence | +| `task_id` | `str` | ID de la tâche taskiq sous-jacente | +| `execution_mode` | `str` | `"sequential"`, `"parallel"`, ou `"map_reduce"` | +| `started_at` | `datetime` | Horodatage de démarrage du pipeline | +| `broker` | `BaseBroker` | Référence au broker de tâches | + +Activer le passage de contexte lors de la construction du pipeline: + +```python +pipeline = Pipeline(broker).with_context(enable=True) +``` + +--- + +## 7. Moteurs d'Exécution Personnalisés (Avancé) + +Pour un contrôle de bas niveau, utilisez `ExecutionEngine` directement: + +```python +from taskiq_flow import ExecutionEngine, DAGBuilder +from taskiq_flow.dataflow import DataflowRegistry + +# Construire le registry manuellement +registry = DataflowRegistry() +registry.register_task(load, output="raw", inputs=[]) +registry.register_task(process, output="clean", inputs=["raw"]) +registry.register_task(save, output="saved", inputs=["clean"]) + +# Construire le DAG +dag = registry.build_dag() + +# Créer le moteur d'exécution +engine = ExecutionEngine(broker, dag) + +# Exécuter avec des entrées personnalisées +résultats = await engine.execute(inputs={"source_file": "data.csv"}) +print(résultats) # {"raw": ..., "clean": ..., "saved": ...} +``` + +**Quand utiliser ExecutionEngine**: +- Construire des pipelines dynamiques à l'exécution +- Ordonnancement/logique personnalisée hors de l'abstraction Pipeline +- Inspecter la structure du DAG avant exécution +- Intégration avec des gestionnaires de workflow externes + +--- + +## 8. Formes des Résultats + +Les différents types de pipelines renvoient des structures de résultats différentes: + +### 8.1. Résultats de Pipeline Séquentiel + +```python +task = await pipeline.kiq(entrée) +result = await task.wait_result() + +# result.return_value est la sortie finale après toutes les étapes +# Exemple: [3, 3, 3, 3] depuis notre pipeline quickstart +``` + +### 8.2. Résultats de Pipeline Dataflow + +```python +result = await pipeline.kiq_dataflow(données) + +# Retourne un dict mappant chaque nom de sortie vers sa valeur +{ + "features": {...}, + "tags": [...], + "embedding": [...] +} +``` + +### 8.3. Résultats MapReduce + +```python +mappé = await MapReduce.map(...) +print(mappé.return_value) # Liste des résultats mappés + +réduit = await MapReduce.reduce(...) +print(réduit.return_value) # Résultat final agrégé +``` + +--- + +## 9. Inspection de l'État du Pipeline + +Interroger le statut du pipeline pendant ou après exécution: + +```python +from taskiq_flow import PipelineTrackingManager + +tracking = PipelineTrackingManager().with_auto_storage(broker) +pipeline = Pipeline(broker).with_tracking(tracking) + +task = await pipeline.kiq(données) + +# Obtenir le statut détaillé +statut = await tracking.get_status(pipeline.pipeline_id) +print(f"Statut: {statut.statut}") # EN_ATTENTE, EN_COURSE, TERMINÉ, ÉCHOUÉ +print(f"Étapes: {len(statut.étapes)}") # Nombre d'étapes complétées +print(f"Démarré: {statut.démarré_à}") +print(f"Terminé: {statut.terminé_à}") + +# Obtenir l'historique étape par étape +for étape in statut.étapes: + print(f" {étape.nom}: {étape.statut} ({étape.durée_ms}ms)") +``` + +**Valeurs de statut**: + +| Statut | Signification | +|--------|---------------| +| `EN_ATTENTE` | Pipeline en file, pas encore démarré | +| `EN_COURSE` | En cours d'exécution | +| `TERMINÉ` | Terminé avec succès | +| `ÉCHOUÉ` | Terminé avec erreur | +| `ANNULÉ` | Annulé manuellement | + +Voir [Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }}) pour le monitoring avancé. + +--- + +## 10. Débogage de l'Exécution + +### 10.1. Activer les Logs + +```python +import logging +logging.basicConfig(level=logging.DEBUG) + +# Ou configurer des loggers spécifiques +logger = logging.getLogger("taskiq_flow") +logger.setLevel(logging.DEBUG) +``` + +### 10.2. Afficher le DAG Avant Exécution + +```python +pipeline.print_dag() +# Montre les niveaux d'exécution et les dépendances +``` + +### 10.3. Inspecter les Arguments des Tâches + +```python +@broker.task +async def debug_task(data, context: PipelineContext): + print(f"Received: {data}") + print(f"Context: pipeline={context.pipeline_id}, step={context.step_index}") + return data +``` + +### 10.4. Middleware de Traçage + +```python +from taskiq_flow.middleware import PipelineMiddleware + +class DebugMiddleware(PipelineMiddleware): + async def on_step_complete(self, ctx, résultat): + print(f"Étape {ctx.task_id} complétée avec: {résultat}") + await super().on_step_complete(ctx, résultat) + +broker.add_middlewares(DebugMiddleware()) +``` + +--- + +## 11. Considérations de Performance + +### 11.1. Limites de Concurrence + +```python +# Limiter le total des tâches parallèles globalement +from taskiq_flow.optimization.parallel import set_max_parallel_tasks +set_max_parallel_tasks(20) # jamais plus de 20 tâches simultanées +``` + +### 11.2. Parallélisme Sélectif + +Toutes les tâches ne bénéficient pas du parallélisme: + +```python +# Tâches liées au CPU: bénéficient du parallélisme jusqu'au nombre de cœurs +# Tâches liées aux E/S: peuvent gérer un parallélisme plus élevé +# Tâches petites/rapides: le surcoût peut l'emporter sur les bénéfices + +# Astuce: Profiler avec différentes valeurs max_parallel +pipeline.map(process_item, items, max_parallel=8) +``` + +### 11.3. Empreinte Mémoire + +L'exécution parallèle charge plus de données en mémoire: + +```python +# Traiter les grands jeux de données par morceaux +morceaux = diviser_en_morceaux(grande_liste, taille_morceau=100) +for morceau in morceaux: + résultats = await pipeline.kiq_dataflow(morceau) + # traiter les résultats avant le morceau suivant +``` + +Voir [Guide de Performance]({{ '/fr/guides/performance/' | relative_url }}) pour des stratégies d'optimisation détaillées. + +--- + +## 12. Pièges Courants + +| Problème | Cause | Solution | +|---------|--------|----------| +| Tâches exécutées séquentiellement | `max_parallel=1` ou type de pipeline séquentiel | Utiliser DataflowPipeline ou augmenter le parallélisme | +| `wait_result()` reste bloqué indéfiniment | Broker non partagé, résultats perdus | Utiliser un broker persistant (Redis) avec backend de résultats | +| Tâches reçoivent de mauvaises entrées | Nommage incorrect des paramètres | S'assurer que `@pipeline_task(output=...)` correspond aux noms de paramètres en aval | +| Résultats dans le désordre | Tâches dataflow finissant à des moments différents | Le dict des résultats préserve les noms de sortie, pas l'ordre d'exécution | +| Explosion mémoire | Parallélisme illimité | Définir `max_parallel` ou traiter par lots | +| Deadlock détecté | Dépendance circulaire ou entrée externe manquante | Vérifier le graphe de flux de données pour les cycles ; fournir toutes les entrées externes | +| `kiq_dataflow()` lève "No DAG built" | Aucune tâche ajoutée au pipeline | Utiliser `DataflowPipeline.from_tasks()` ou `add_dataflow_task()` | +| Résultats partiels uniquement | `continue_on_error=True` avec des tâches échouées | Vérifier `PipelineErrorAggregator` ou le rapport d'exécution | + +--- + +## 13. Résumé + +| Fonctionnalité | Pipeline Séquentiel | DataflowPipeline | MapReduce | +|----------------|--------------------|------------------|-----------| +| **Exécution** | Chaîne linéaire | DAG automatique | Map parallèle + reduce | +| **Parallélisme** | Aucun (sauf `.group()`) | Automatique (tâches indépendantes) | Explicitement par appel map | +| **Contrôle** | Enchaînement manuel | Dépendances déclaratives | Orienté traitement par lots | +| **Idéal pour** | Workflows linéaires simples | Workflows complexes ramifiés | Transformation de données en masse | + +--- + +## Prochaines Étapes + +- **[Guide des Pipelines]({{ '/fr/guides/pipelines/' | relative_url }})** — Choisir entre types de pipelines et motifs +- **[Guide Dataflow]({{ '/fr/guides/dataflow/' | relative_url }})** — Guide complet sur les pipelines dataflow, DAGs et décorateurs +- **[Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }})** — Surveillance du statut et historique des pipelines +- **[Guide de Performance]({{ '/fr/guides/performance/' | relative_url }})** — Réglage pour vitesse et ressources + +--- + +*Comprendre l'exécution est essentiel pour construire des pipelines fiables. Découvrez les [Pipelines Dataflow]({{ '/fr/guides/dataflow/' | relative_url }}) pour des workflows complexes.* diff --git a/docs/_fr/guides/index.md b/docs/_fr/guides/index.md index b7546fa..32fc731 100644 --- a/docs/_fr/guides/index.md +++ b/docs/_fr/guides/index.md @@ -1,27 +1,27 @@ ---- -title: Guides Utilisateur -nav_order: 15 -permalink: /fr/guides/ ---- -# Guides Utilisateur - -Guides détaillés couvrant toutes les fonctionnalités de Taskiq-Flow. - -## Guides Disponibles - -| Guide | Description | -|-------|-------------| -| **[Guide des Pipelines]({{ '/fr/guides/pipelines/' | relative_url }})** | Patterns de pipelines séquentiels et dataflow | -| **[Guide des Tâches]({{ '/fr/guides/tasks/' | relative_url }})** | Définitions de tâches et décorateurs | -| **[Guide d'Exécution]({{ '/fr/guides/execution/' | relative_url }})** | Modes d'exécution et gestion d'erreurs | -| **[Guide Stockage & Cache]({{ '/fr/guides/cache/' | relative_url }})** nouveau en v1.2.0 | Persistance centralisée & cache Dogpile workers | -| **[Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }})** | Monitoring en temps réel | -| **[Guide WebSocket]({{ '/fr/guides/websocket/' | relative_url }})** | Tableaux de bord en direct | -| **[Guide de Planification]({{ '/fr/guides/scheduling/' | relative_url }})** | Planification cron | -| **[Guide de Retry]({{ '/fr/guides/retry/' | relative_url }})** | Récupération d'erreurs | -| **[Guide de Performance]({{ '/fr/guides/performance/' | relative_url }})** | Optimisation | -| **[Guide API REST]({{ '/fr/guides/api/' | relative_url }})** | Intégration FastAPI | - ---- - -*Commencez par le [Guide de Démarrage Rapide]({{ '/fr/quickstart/' | relative_url }}) ou explorez les [Exemples]({{ '/fr/examples/' | relative_url }}).* +--- +title: Guides Utilisateur +nav_order: 15 +permalink: /fr/guides/ +--- +# Guides Utilisateur + +Guides détaillés couvrant toutes les fonctionnalités de Taskiq-Flow. + +## Guides Disponibles + +| Guide | Description | +|-------|-------------| +| **[Guide des Pipelines]({{ '/fr/guides/pipelines/' | relative_url }})** | Patterns de pipelines séquentiels et dataflow | +| **[Guide des Tâches]({{ '/fr/guides/tasks/' | relative_url }})** | Définitions de tâches et décorateurs | +| **[Guide d'Exécution]({{ '/fr/guides/execution/' | relative_url }})** | Modes d'exécution et gestion d'erreurs | +| **[Guide Stockage & Cache]({{ '/fr/guides/cache/' | relative_url }})** nouveau en v1.2.0 | Persistance centralisée & cache Dogpile workers | +| **[Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }})** | Monitoring en temps réel | +| **[Guide WebSocket]({{ '/fr/guides/websocket/' | relative_url }})** | Tableaux de bord en direct | +| **[Guide de Planification]({{ '/fr/guides/scheduling/' | relative_url }})** | Planification cron | +| **[Guide de Retry]({{ '/fr/guides/retry/' | relative_url }})** | Récupération d'erreurs | +| **[Guide de Performance]({{ '/fr/guides/performance/' | relative_url }})** | Optimisation | +| **[Guide API REST]({{ '/fr/guides/api/' | relative_url }})** | Intégration FastAPI | + +--- + +*Commencez par le [Guide de Démarrage Rapide]({{ '/fr/quickstart/' | relative_url }}) ou explorez les [Exemples]({{ '/fr/examples/' | relative_url }}).* diff --git a/docs/_fr/guides/performance.md b/docs/_fr/guides/performance.md index 6a0282a..a9fc4f3 100644 --- a/docs/_fr/guides/performance.md +++ b/docs/_fr/guides/performance.md @@ -1,547 +1,573 @@ ---- -title: Guide d'Optimisation des Performances -nav_order: 27 -color_scheme: dark ---- -# Guide d'Optimisation des Performances - -**Parallélisme conscient des ressources, optimisation mémoire et stratégies de mise à l'échelle** - -> **Version** : {VERSION} | **Lié** : [Guide d'Exécution]({{ '/fr/guides/execution/' | relative_url }}), [Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }}) - ---- - -## Aperçu - -Taskiq-Flow est conçu pour une exécution asynchrone hautes performances. Ce guide couvre les techniques d'optimisation pour maximiser le débit, minimiser la latence et utiliser efficacement les ressources système. - -Sujets abordés : - -- Réglage du parallélisme (`max_parallel`) -- Profilage CPU et RAM -- Profils de ressources des tâches -- Stratégies de gestion mémoire -- Identification des goulots d'étranglement -- Passage d'un worker unique à distribué - ---- - -## 1. Comprendre le Paysage des Performances - -L'optimisation des performances implique des compromis : - -| Dimension | Ce qui est affecté | Compromis typique | -|-----------|--------------------|-------------------| -| **Concurrence** | Débit (tâches/seconde) | Utilisation mémoire, changement de contexte | -| **Parallélisme** | Utilisation CPU | Surcharge de coordination | -| **Latence** | Temps de complétion des tâches | Consommation de ressources | -| **Mémoire** | Capacité du jeu de données | Pauses GC, efficacité du cache | -| **I/O** | Appels services externes | Bande passante réseau, limites de connexions | - -**Aperçu clé** : Le parallélisme de Taskiq-Flow est limité par les paramètres `max_parallel` à travers les étapes du pipeline, et par les ressources système disponibles (cœurs CPU, RAM). - ---- - -## 2. Réglage du Parallélisme - -### 2.1. Le Paramètre `max_parallel` - -Contrôle l'exécution concurrente des tâches au niveau de l'étape : - -```python -# Pipeline Séquentiel -pipeline.map(process_item, items, max_parallel=10) # Max 10 concurrentes - -# Pipeline Dataflow : configuration au niveau pipeline -pipeline = DataflowPipeline(broker, max_parallel=20) - -# MapReduce -mapped = await MapReduce.map( - broker, - process_item, - items, - max_parallel=15 -) -``` - -**Comportement par défaut** : Sans `max_parallel`, Taskiq-Flow tente d'exécuter toutes les tâches indépendantes concurremment (essentiellement illimité). C'est acceptable pour les petits nombres (<100) mais dangereux pour les grands jeux de données. - -### 2.2. Déterminer le `max_parallel` Optimal - -#### Pour les Tâches Liées aux I/O (appels réseau, I/O disque) - -```python -# Attente I/O élevée, CPU faible : peut gérer beaucoup de tâches concurrentes -pipeline.map(fetch_url, url_list, max_parallel=50) -# Règle empirique : 2–5 × nombre de cœurs CPU -``` - -**Justification** : Pendant qu'une tâche attend le réseau, une autre utilise le CPU. Une haute concurrence sature les pipelines d'I/O. - -#### Pour les Tâches Intensives en CPU (calculs, transcodage) - -```python -# Intensif en CPU : limiter au nombre de cœurs (ou légèrement plus) -import os -cpu_cores = os.cpu_count() or 4 -pipeline.map(transcode, files, max_parallel=cpu_cores + 2) -# Règle empirique : cœurs CPU ± 2 -``` - -**Justification** : Le GIL de Python limite le vrai parallélisme ; `asyncio` bénéficie toujours de plusieurs cœurs quand les tâches libèrent le GIL (NumPy, extensions C). Une sur-inscription entraîne des surcoûts de changement de contexte. - -#### Pour les Charges de Travail Mixtes - -Profilez et ajustez : - -```python -# Commencez prudent -for parallel in [5, 10, 20, 50]: - start = time.time() - await pipeline.kiq_dataflow(data) - duration = time.time() - start - print(f"Parallélisme {parallel} : {duration:.2f}s") -``` - -Trouvez le **coude de la courbe** — point où augmenter le parallélisme donne des rendements décroissants. - -### 2.3. Limite Globale de Parallélisme - -Définissez une limite globale sur tous les pipelines : - -```python -from taskiq_flow.optimization.parallel import set_max_parallel_tasks - -set_max_parallel_tasks(100) # Ne jamais dépasser 100 tâches concurrentes globalement -``` - -Utile dans les systèmes multi-tenants pour empêcher un pipeline d'en asphyxier d'autres. - ---- - -## 3. Ordonnancement Conscient des Ressources - -Taskiq-Flow peut ordonnancer les tâches selon les besoins CPU/RAM (nécessite un pool de workers conscient des ressources — avancé). - -### 3.1. Annoter les Tâches avec Besoins en Ressources - -```python -from taskiq_flow import CPUProfile, RAMProfile - -@broker.task -@CPUProfile(cpu_units=2) # Nécessite 2 cœurs CPU -@RAMProfile(ram_mb=4096) # Nécessite 4 Go RAM -def heavy_computation(data): - # Ne s'exécutera que sur des workers avec ressources suffisantes - pass -``` - -### 3.2. Pool de Workers Conscient des Ressources - -```python -from taskiq_flow import ResourceAwareWorkerPool - -pool = ResourceAwareWorkerPool( - workers=[ - {"cpu_cores": 8, "ram_gb": 32, "labels": {"gpu": True}}, - {"cpu_cores": 4, "ram_gb": 16, "labels": {"gpu": False}}, - ] -) - -# Les tâches sont automatiquement routées vers les workers compatibles -``` - -**Note** : Cette fonctionnalité nécessite une implémentation worker personnalisée ; les brokers standards ignorent les profils de ressources. - ---- - -## 4. Optimisation Mémoire - -### 4.1. Éviter les Transferts de Données Volumineuses en Mémoire - -Passez des références au lieu des données complètes : - -```python -# Mauvais : copie le jeu de données complet pour chaque appel de tâche -pipeline.map(process, large_dataset) # Chaque tâche reçoit une copie complète - -# Mieux : passez des identifiants, récupérez dans la tâche -@broker.task -def process(item_id: str): - item = database.get(item_id) # Récupération à la demande - return process_item(item) - -pipeline.map(process, item_ids) # Seuls les IDs sont passés -``` - -### 4.2. Streamer les Gros Jeux de Données - -Utilisez le découpage en chunks : - -```python -def chunked(iterable, chunk_size=100): - for i in range(0, len(iterable), chunk_size): - yield iterable[i:i + chunk_size] - -for chunk in chunked(large_list, 100): - results = await pipeline.kiq_dataflow(chunk) - # Traitez les résultats avant le prochain chunk pour libérer la mémoire -``` - -### 4.3. Nettoyer les Résultats Après Utilisation - -Les résultats de pipeline restent dans le stockage de suivi. Nettoyez après usage : - -```python -# Après traitement, supprimez l'enregistrement du pipeline -await tracking.delete_pipeline(pipeline.pipeline_id) -``` - -Ou définissez un TTL sur le stockage : - -```python -RedisPipelineStorage(redis, ttl_seconds=86400) # Suppression auto après 1 jour -``` - ---- - -## 5. Profilage & Détection des Goulots d'Étranglement - -### 5.1. Chronométrage Intégré - -Chaque étape enregistre la durée automatiquement (avec le suivi activé) : - -```python -status = await tracking.get_status(pipeline_id) -for step in status.steps: - print(f"{step.name}: {step.duration_ms}ms") -``` - -Identifiez les étapes les plus lentes → cibles d'optimisation. - -### 5.2. Profilage Mémoire - -Utilisez `tracemalloc` de Python : - -```python -import tracemalloc - -tracemalloc.start() - -# Exécutez le pipeline -await pipeline.kiq(data) - -# Vérifiez l'usage mémoire -current, peak = tracemalloc.get_traced_memory() -print(f"Actuel : {current/1024/1024:.1f} Mo") -print(f"Pic : {peak/1024/1024:.1f} Mo") -tracemalloc.stop() -``` - -### 5.3. Profilage CPU - -```python -import cProfile -import pstats - -profiler = cProfile.Profile() -profiler.enable() - -await pipeline.kiq(data) - -profiler.disable() -stats = pstats.Stats(profiler) -stats.sort_stats('cumulative') -stats.print_stats(20) # Top 20 fonctions -``` - -### 5.4. Profilage Spécifique Asyncio - -`uvloop` pour une boucle d'événements plus rapide : - -```python -import uvloop -uvloop.install() # Remplace la boucle asyncio par défaut -``` - -Amélioration benchmark : `uvloop` peut fournir un gain 2×–3× pour les charges liées aux I/O. - ---- - -## 6. Optimisation Base de Données / Services Externes - -### 6.1. Pool de Connexions - -Pour les bases de données (PostgreSQL, Redis), réutilisez les connexions : - -```python -from asyncpg import create_pool - -pool = await create_pool(database="...", min_size=5, max_size=20) - -@broker.task -async def db_task(query: str): - async with pool.acquire() as conn: - return await conn.fetch(query) -``` - -### 6.2. Opérations par Lots - -Au lieu de nombreux petits appels, faites des lots : - -```python -# N appels séparés -for item in items: - await db.insert(item) - -# Insertion par lot unique -await db.bulk_insert(items) -``` - -### 6.3. Mise en Cache des Résultats - -```python -from functools import lru_cache - -@broker.task -@lru_cache(maxsize=1000) -def expensive_computation(key: str): - return compute(key) -``` - -Ou utilisez un cache Redis : - -```python -import redis -cache = redis.Redis(...) - -@broker.task -async def cached_task(key: str): - cached = await cache.get(key) - if cached: - return json.loads(cached) - result = await compute(key) - await cache.setex(key, 3600, json.dumps(result)) - return result -``` - ---- - -## 7. Mise à l'Échelle Distribuée - -### 7.1. Workers Multiples - -Mise à l'échelle horizontale en lançant plusieurs processus worker : - -```bash -# Terminal 1 -taskiq worker --broker redis://localhost:6379 - -# Terminal 2 -taskiq worker --broker redis://localhost:6379 - -# Terminal 3 -taskiq worker --broker redis://localhost:6379 -``` - -Tous les workers partagent le même broker (Redis) et traitent les tâches concurremment. - -**Débit ≈ (# workers) × (tâches/worker/seconde)**. - -### 7.2. Gestion du Pool de Workers - -Utilisez un gestionnaire de processus (systemd, supervisord, Docker Compose) : - -```yaml -# docker-compose.yml -services: - worker-1: - image: taskiq-flow-worker - command: taskiq worker --broker ${REDIS_URL} - worker-2: - image: taskiq-flow-worker - command: taskiq worker --broker ${REDIS_URL} - worker-3: - image: taskiq-flow-worker - command: taskiq worker --broker ${REDIS_URL} -``` - -### 7.3. Priorisation des Files - -Routez les pipelines critiques vers des files dédiées : - -```python -@broker.task(queue="high_priority") -def critical_task(): ... - -# Les workers peuvent être configurés pour traiter certaines files en priorité -``` - -### 7.4. Géodistribution - -Pour des déploiements mondiaux à faible latence, déployez des workers dans plusieurs régions avec un broker global (Kafka) ou des clusters Redis régionaux avec réplication. - ---- - -## 8. Benchmarking - -Mesurez avant et après optimisation : - -```python -import time - -async def benchmark(pipeline, iterations=10): - durations = [] - for _ in range(iterations): - start = time.perf_counter() - result = await pipeline.kiq(data) - await result.wait_result() - duration = time.perf_counter() - start - durations.append(duration) - - avg = sum(durations) / len(durations) - p95 = sorted(durations)[int(0.95 * len(durations))] - print(f"Moyenne: {avg:.3f}s, P95: {p95:.3f}s") - return durations -``` - -**Métriques clés** : - -- **Débit** : tâches/seconde -- **Latence P50/P95/P99** : médiane, 95ème, 99ème percentile -- **Pic mémoire** : mémoire résidente maximale -- **Utilisation CPU** : % de cœurs utilisés - ---- - -## 9. Checklist Production - -- [ ] Définir `max_parallel` adapté au type de tâche (CPU vs I/O) -- [ ] Utiliser le pool de connexions pour services externes -- [ ] Activer le stockage Redis pour le suivi (éviter les fuites mémoire) -- [ ] Définir un TTL sur le stockage de suivi/résultats -- [ ] Configurer les timeouts sur toutes les tâches -- [ ] Ajouter des politiques de retry avec backoff et jitter -- [ ] Surveiller l'usage mémoire et définir des alertes -- [ ] Profiler les tâches lentes avec cProfile/tracemalloc -- [ ] Mettre à l'échelle les workers horizontalement selon la profondeur de file -- [ ] Utiliser les priorités de file pour les pipelines critiques -- [ ] Implémenter la DLQ et réviser régulièrement les tâches échouées -- [ ] Tester les scénarios de panne (partitions réseau, pannes service) - ---- - -## 10. Dépannage des Performances - -### Pipeline Lent - -**Étapes de diagnostic** : - -1. Vérifiez les durées d'étapes dans le suivi : - ```python - status = await tracking.get_status(pipeline_id) - slowest = max(status.steps, key=lambda s: s.duration_ms) - print(f"Étape la plus lente : {slowest.name} à {slowest.duration_ms}ms") - ``` - -2. Profilez avec cProfile pour voir où le temps est passé -3. Vérifiez que `max_parallel` n'est pas trop bas -4. Cherchez des I/O bloquants (utilisez des librairies async) - -### Utilisation Mémoire Élevée - -**Causes & corrections** : - -| Cause | Correction | -|-------|------------| -| Gros jeu de données dans une seule étape | Découper les données, traiter par lots | -| Résultats s'accumulant dans le stockage de suivi | Définir TTL, supprimer après usage | -| Fuite mémoire dans le code de tâche | Profiler avec `tracemalloc`, corriger les fuites | -| Trop de tâches parallèles | Réduire `max_parallel` | - -### Worker en Manque (Starvation) - -**Symptôme** : Tâches en file mais non exécutées. - -**Corrections** : -- Augmenter le nombre de processus workers -- Vérifier que le broker (Redis) a assez de connexions -- Chercher des tâches longues bloquant la file -- Envisager les priorités de tâches ou files séparées - ---- - -## 11. Avancé : Exécuteurs Personnalisés - -Pour des charges spécialisées, implémentez des exécuteurs personnalisés : - -```python -from taskiq_flow import ExecutionEngine -from taskiq_flow.dataflow import DAG - -class GPUOptimizedEngine(ExecutionEngine): - async def schedule_task(self, task_node, inputs): - # Logique d'ordonnancement personnalisée : router les tâches GPU vers workers GPU - if task_node.labels.get("requires_gpu"): - return await self.gpu_worker_pool.submit(task_node, inputs) - return await super().schedule_task(task_node, inputs) - -engine = GPUOptimizedEngine(broker, dag) -results = await engine.execute(inputs) -``` - -### 11.1. ResourceAwareExecutor et TaskResourceProfile - -TaskIQ-Flow fournit un exécuteur conscient des ressources qui peut être utilisé -pour allouer des tâches aux workers en fonction de leurs besoins en ressources : - -```python -from taskiq_flow import ResourceAwareExecutor, TaskResourceProfile - -# Définir un profil de ressources pour les tâches lourdes -heavy_profile = TaskResourceProfile( - estimated_memory_mb=2048, - estimated_cpu_cores=4.0, -) - -@broker.task -@heavy_profile -def heavy_computation(data): - # Cette tâche nécessite 4 cœurs CPU et 2 Go de RAM - return process_heavy_data(data) - -# Utiliser ResourceAwareExecutor pour l'exécution -executor = ResourceAwareExecutor( - broker=broker, - max_parallel=10, -) -``` - -`ResourceAwareExecutor` évalue les profils de ressources des tâches et les -distribue aux workers disponibles en fonction de leur capacité. -`TaskResourceProfile` permet d'annoter chaque tâche avec ses besoins estimés -en mémoire et CPU. - -## 13. Résumé - -L'optimisation des performances est itérative : - -1. **Mesurer** — établir une baseline avec des benchmarks -2. **Identifier** — trouver les goulots avec le profilage -3. **Régler** — ajuster `max_parallel`, profils de ressources, batch -4. **Mettre à l'échelle** — ajouter des workers, optimiser services externes -5. **Surveiller** — suivre les métriques en production -6. **Répéter** — l'optimisation ne s'arrête jamais - ---- - -## Prochaines Étapes - -- **[Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }})** — Surveiller les métriques des pipelines -- **[Guide Dataflow]({{ '/fr/guides/dataflow/' | relative_url }})** — Guide complet sur les pipelines DAG et l'architecture dataflow -- **[Guide API]({{ '/fr/guides/api/' | relative_url }})** — Construire des tableaux de bord pour la performance -- **[Exemple : Pipeline Audio Dataflow]({{ '/fr/examples/dataflow-audio-pipeline/' | relative_url }})** — Voir l'optimisation en action - ---- - - *Allez vite, mais mesurez d'abord.* +--- +title: Guide d'Optimisation des Performances +nav_order: 27 +color_scheme: dark +--- +# Guide d'Optimisation des Performances + +**Parallélisme conscient des ressources, optimisation mémoire et stratégies de mise à l'échelle** + +> **Version** : {VERSION} | **Lié** : [Guide d'Exécution]({{ '/fr/guides/execution/' | relative_url }}), [Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }}) + +--- + +## Aperçu + +Taskiq-Flow est conçu pour une exécution asynchrone hautes performances. Ce guide couvre les techniques d'optimisation pour maximiser le débit, minimiser la latence et utiliser efficacement les ressources système. + +Sujets abordés : + +- Réglage du parallélisme (`max_parallel`) +- Profilage CPU et RAM +- Profils de ressources des tâches +- Stratégies de gestion mémoire +- Identification des goulots d'étranglement +- Passage d'un worker unique à distribué + +--- + +## 1. Comprendre le Paysage des Performances + +L'optimisation des performances implique des compromis : + +| Dimension | Ce qui est affecté | Compromis typique | +|-----------|--------------------|-------------------| +| **Concurrence** | Débit (tâches/seconde) | Utilisation mémoire, changement de contexte | +| **Parallélisme** | Utilisation CPU | Surcharge de coordination | +| **Latence** | Temps de complétion des tâches | Consommation de ressources | +| **Mémoire** | Capacité du jeu de données | Pauses GC, efficacité du cache | +| **I/O** | Appels services externes | Bande passante réseau, limites de connexions | + +**Aperçu clé** : Le parallélisme de Taskiq-Flow est limité par les paramètres `max_parallel` à travers les étapes du pipeline, et par les ressources système disponibles (cœurs CPU, RAM). + +--- + +## 2. Réglage du Parallélisme + +### 2.1. Le Paramètre `max_parallel` + +Contrôle l'exécution concurrente des tâches au niveau de l'étape : + +{% raw %} +```python +# Pipeline Séquentiel +pipeline.map(process_item, items, max_parallel=10) # Max 10 concurrentes + +# Pipeline Dataflow : configuration au niveau pipeline +pipeline = DataflowPipeline(broker, max_parallel=20) + +# MapReduce +mapped = await MapReduce.map( + broker, + process_item, + items, + max_parallel=15 +) +``` +{% endraw %} +**Comportement par défaut** : Sans `max_parallel`, Taskiq-Flow tente d'exécuter toutes les tâches indépendantes concurremment (essentiellement illimité). C'est acceptable pour les petits nombres (<100) mais dangereux pour les grands jeux de données. + +### 2.2. Déterminer le `max_parallel` Optimal + +#### Pour les Tâches Liées aux I/O (appels réseau, I/O disque) + +{% raw %} +```python +# Attente I/O élevée, CPU faible : peut gérer beaucoup de tâches concurrentes +pipeline.map(fetch_url, url_list, max_parallel=50) +# Règle empirique : 2–5 × nombre de cœurs CPU +``` +{% endraw %} +**Justification** : Pendant qu'une tâche attend le réseau, une autre utilise le CPU. Une haute concurrence sature les pipelines d'I/O. + +#### Pour les Tâches Intensives en CPU (calculs, transcodage) + +{% raw %} +```python +# Intensif en CPU : limiter au nombre de cœurs (ou légèrement plus) +import os +cpu_cores = os.cpu_count() or 4 +pipeline.map(transcode, files, max_parallel=cpu_cores + 2) +# Règle empirique : cœurs CPU ± 2 +``` +{% endraw %} +**Justification** : Le GIL de Python limite le vrai parallélisme ; `asyncio` bénéficie toujours de plusieurs cœurs quand les tâches libèrent le GIL (NumPy, extensions C). Une sur-inscription entraîne des surcoûts de changement de contexte. + +#### Pour les Charges de Travail Mixtes + +Profilez et ajustez : + +{% raw %} +```python +# Commencez prudent +for parallel in [5, 10, 20, 50]: + start = time.time() + await pipeline.kiq_dataflow(data) + duration = time.time() - start + print(f"Parallélisme {parallel} : {duration:.2f}s") +``` +{% endraw %} +Trouvez le **coude de la courbe** — point où augmenter le parallélisme donne des rendements décroissants. + +### 2.3. Limite Globale de Parallélisme + +Définissez une limite globale sur tous les pipelines : + +{% raw %} +```python +from taskiq_flow.optimization.parallel import set_max_parallel_tasks + +set_max_parallel_tasks(100) # Ne jamais dépasser 100 tâches concurrentes globalement +``` +{% endraw %} +Utile dans les systèmes multi-tenants pour empêcher un pipeline d'en asphyxier d'autres. + +--- + +## 3. Ordonnancement Conscient des Ressources + +Taskiq-Flow peut ordonnancer les tâches selon les besoins CPU/RAM (nécessite un pool de workers conscient des ressources — avancé). + +### 3.1. Annoter les Tâches avec Besoins en Ressources + +{% raw %} +```python +from taskiq_flow import CPUProfile, RAMProfile + +@broker.task +@CPUProfile(cpu_units=2) # Nécessite 2 cœurs CPU +@RAMProfile(ram_mb=4096) # Nécessite 4 Go RAM +def heavy_computation(data): + # Ne s'exécutera que sur des workers avec ressources suffisantes + pass +``` +{% endraw %} +### 3.2. Pool de Workers Conscient des Ressources + +{% raw %} +```python +from taskiq_flow import ResourceAwareWorkerPool + +pool = ResourceAwareWorkerPool( + workers=[ + {"cpu_cores": 8, "ram_gb": 32, "labels": {"gpu": True}}, + {"cpu_cores": 4, "ram_gb": 16, "labels": {"gpu": False}}, + ] +) + +# Les tâches sont automatiquement routées vers les workers compatibles +``` +{% endraw %} +**Note** : Cette fonctionnalité nécessite une implémentation worker personnalisée ; les brokers standards ignorent les profils de ressources. + +--- + +## 4. Optimisation Mémoire + +### 4.1. Éviter les Transferts de Données Volumineuses en Mémoire + +Passez des références au lieu des données complètes : + +{% raw %} +```python +# Mauvais : copie le jeu de données complet pour chaque appel de tâche +pipeline.map(process, large_dataset) # Chaque tâche reçoit une copie complète + +# Mieux : passez des identifiants, récupérez dans la tâche +@broker.task +def process(item_id: str): + item = database.get(item_id) # Récupération à la demande + return process_item(item) + +pipeline.map(process, item_ids) # Seuls les IDs sont passés +``` +{% endraw %} +### 4.2. Streamer les Gros Jeux de Données + +Utilisez le découpage en chunks : + +{% raw %} +```python +def chunked(iterable, chunk_size=100): + for i in range(0, len(iterable), chunk_size): + yield iterable[i:i + chunk_size] + +for chunk in chunked(large_list, 100): + results = await pipeline.kiq_dataflow(chunk) + # Traitez les résultats avant le prochain chunk pour libérer la mémoire +``` +{% endraw %} +### 4.3. Nettoyer les Résultats Après Utilisation + +Les résultats de pipeline restent dans le stockage de suivi. Nettoyez après usage : + +{% raw %} +```python +# Après traitement, supprimez l'enregistrement du pipeline +await tracking.delete_pipeline(pipeline.pipeline_id) +``` +{% endraw %} +Ou définissez un TTL sur le stockage : + +{% raw %} +```python +RedisPipelineStorage(redis, ttl_seconds=86400) # Suppression auto après 1 jour +``` +{% endraw %} +--- + +## 5. Profilage & Détection des Goulots d'Étranglement + +### 5.1. Chronométrage Intégré + +Chaque étape enregistre la durée automatiquement (avec le suivi activé) : + +{% raw %} +```python +status = await tracking.get_status(pipeline_id) +for step in status.steps: + print(f"{step.name}: {step.duration_ms}ms") +``` +{% endraw %} +Identifiez les étapes les plus lentes → cibles d'optimisation. + +### 5.2. Profilage Mémoire + +Utilisez `tracemalloc` de Python : + +{% raw %} +```python +import tracemalloc + +tracemalloc.start() + +# Exécutez le pipeline +await pipeline.kiq(data) + +# Vérifiez l'usage mémoire +current, peak = tracemalloc.get_traced_memory() +print(f"Actuel : {current/1024/1024:.1f} Mo") +print(f"Pic : {peak/1024/1024:.1f} Mo") +tracemalloc.stop() +``` +{% endraw %} +### 5.3. Profilage CPU + +{% raw %} +```python +import cProfile +import pstats + +profiler = cProfile.Profile() +profiler.enable() + +await pipeline.kiq(data) + +profiler.disable() +stats = pstats.Stats(profiler) +stats.sort_stats('cumulative') +stats.print_stats(20) # Top 20 fonctions +``` +{% endraw %} +### 5.4. Profilage Spécifique Asyncio + +`uvloop` pour une boucle d'événements plus rapide : + +{% raw %} +```python +import uvloop +uvloop.install() # Remplace la boucle asyncio par défaut +``` +{% endraw %} +Amélioration benchmark : `uvloop` peut fournir un gain 2×–3× pour les charges liées aux I/O. + +--- + +## 6. Optimisation Base de Données / Services Externes + +### 6.1. Pool de Connexions + +Pour les bases de données (PostgreSQL, Redis), réutilisez les connexions : + +{% raw %} +```python +from asyncpg import create_pool + +pool = await create_pool(database="...", min_size=5, max_size=20) + +@broker.task +async def db_task(query: str): + async with pool.acquire() as conn: + return await conn.fetch(query) +``` +{% endraw %} +### 6.2. Opérations par Lots + +Au lieu de nombreux petits appels, faites des lots : + +{% raw %} +```python +# N appels séparés +for item in items: + await db.insert(item) + +# Insertion par lot unique +await db.bulk_insert(items) +``` +{% endraw %} +### 6.3. Mise en Cache des Résultats + +{% raw %} +```python +from functools import lru_cache + +@broker.task +@lru_cache(maxsize=1000) +def expensive_computation(key: str): + return compute(key) +``` +{% endraw %} +Ou utilisez un cache Redis : + +{% raw %} +```python +import redis +cache = redis.Redis(...) + +@broker.task +async def cached_task(key: str): + cached = await cache.get(key) + if cached: + return json.loads(cached) + result = await compute(key) + await cache.setex(key, 3600, json.dumps(result)) + return result +``` +{% endraw %} +--- + +## 7. Mise à l'Échelle Distribuée + +### 7.1. Workers Multiples + +Mise à l'échelle horizontale en lançant plusieurs processus worker : + +{% raw %} +```bash +# Terminal 1 +taskiq worker --broker redis://localhost:6379 + +# Terminal 2 +taskiq worker --broker redis://localhost:6379 + +# Terminal 3 +taskiq worker --broker redis://localhost:6379 +``` +{% endraw %} +Tous les workers partagent le même broker (Redis) et traitent les tâches concurremment. + +**Débit ≈ (# workers) × (tâches/worker/seconde)**. + +### 7.2. Gestion du Pool de Workers + +Utilisez un gestionnaire de processus (systemd, supervisord, Docker Compose) : + +{% raw %} +```yaml +# docker-compose.yml +services: + worker-1: + image: taskiq-flow-worker + command: taskiq worker --broker ${REDIS_URL} + worker-2: + image: taskiq-flow-worker + command: taskiq worker --broker ${REDIS_URL} + worker-3: + image: taskiq-flow-worker + command: taskiq worker --broker ${REDIS_URL} +``` +{% endraw %} +### 7.3. Priorisation des Files + +Routez les pipelines critiques vers des files dédiées : + +{% raw %} +```python +@broker.task(queue="high_priority") +def critical_task(): ... + +# Les workers peuvent être configurés pour traiter certaines files en priorité +``` +{% endraw %} +### 7.4. Géodistribution + +Pour des déploiements mondiaux à faible latence, déployez des workers dans plusieurs régions avec un broker global (Kafka) ou des clusters Redis régionaux avec réplication. + +--- + +## 8. Benchmarking + +Mesurez avant et après optimisation : + +{% raw %} +```python +import time + +async def benchmark(pipeline, iterations=10): + durations = [] + for _ in range(iterations): + start = time.perf_counter() + result = await pipeline.kiq(data) + await result.wait_result() + duration = time.perf_counter() - start + durations.append(duration) + + avg = sum(durations) / len(durations) + p95 = sorted(durations)[int(0.95 * len(durations))] + print(f"Moyenne: {avg:.3f}s, P95: {p95:.3f}s") + return durations +``` +{% endraw %} +**Métriques clés** : + +- **Débit** : tâches/seconde +- **Latence P50/P95/P99** : médiane, 95ème, 99ème percentile +- **Pic mémoire** : mémoire résidente maximale +- **Utilisation CPU** : % de cœurs utilisés + +--- + +## 9. Checklist Production + +- [ ] Définir `max_parallel` adapté au type de tâche (CPU vs I/O) +- [ ] Utiliser le pool de connexions pour services externes +- [ ] Activer le stockage Redis pour le suivi (éviter les fuites mémoire) +- [ ] Définir un TTL sur le stockage de suivi/résultats +- [ ] Configurer les timeouts sur toutes les tâches +- [ ] Ajouter des politiques de retry avec backoff et jitter +- [ ] Surveiller l'usage mémoire et définir des alertes +- [ ] Profiler les tâches lentes avec cProfile/tracemalloc +- [ ] Mettre à l'échelle les workers horizontalement selon la profondeur de file +- [ ] Utiliser les priorités de file pour les pipelines critiques +- [ ] Implémenter la DLQ et réviser régulièrement les tâches échouées +- [ ] Tester les scénarios de panne (partitions réseau, pannes service) + +--- + +## 10. Dépannage des Performances + +### Pipeline Lent + +**Étapes de diagnostic** : + +1. Vérifiez les durées d'étapes dans le suivi : +{% raw %} + ```python + status = await tracking.get_status(pipeline_id) + slowest = max(status.steps, key=lambda s: s.duration_ms) + print(f"Étape la plus lente : {slowest.name} à {slowest.duration_ms}ms") + ``` +{% endraw %} +2. Profilez avec cProfile pour voir où le temps est passé +3. Vérifiez que `max_parallel` n'est pas trop bas +4. Cherchez des I/O bloquants (utilisez des librairies async) + +### Utilisation Mémoire Élevée + +**Causes & corrections** : + +| Cause | Correction | +|-------|------------| +| Gros jeu de données dans une seule étape | Découper les données, traiter par lots | +| Résultats s'accumulant dans le stockage de suivi | Définir TTL, supprimer après usage | +| Fuite mémoire dans le code de tâche | Profiler avec `tracemalloc`, corriger les fuites | +| Trop de tâches parallèles | Réduire `max_parallel` | + +### Worker en Manque (Starvation) + +**Symptôme** : Tâches en file mais non exécutées. + +**Corrections** : +- Augmenter le nombre de processus workers +- Vérifier que le broker (Redis) a assez de connexions +- Chercher des tâches longues bloquant la file +- Envisager les priorités de tâches ou files séparées + +--- + +## 11. Avancé : Exécuteurs Personnalisés + +Pour des charges spécialisées, implémentez des exécuteurs personnalisés : + +{% raw %} +```python +from taskiq_flow import ExecutionEngine +from taskiq_flow.dataflow import DAG + +class GPUOptimizedEngine(ExecutionEngine): + async def schedule_task(self, task_node, inputs): + # Logique d'ordonnancement personnalisée : router les tâches GPU vers workers GPU + if task_node.labels.get("requires_gpu"): + return await self.gpu_worker_pool.submit(task_node, inputs) + return await super().schedule_task(task_node, inputs) + +engine = GPUOptimizedEngine(broker, dag) +results = await engine.execute(inputs) +``` +{% endraw %} +### 11.1. ResourceAwareExecutor et TaskResourceProfile + +TaskIQ-Flow fournit un exécuteur conscient des ressources qui peut être utilisé +pour allouer des tâches aux workers en fonction de leurs besoins en ressources : + +{% raw %} +```python +from taskiq_flow import ResourceAwareExecutor, TaskResourceProfile + +# Définir un profil de ressources pour les tâches lourdes +heavy_profile = TaskResourceProfile( + estimated_memory_mb=2048, + estimated_cpu_cores=4.0, +) + +@broker.task +@heavy_profile +def heavy_computation(data): + # Cette tâche nécessite 4 cœurs CPU et 2 Go de RAM + return process_heavy_data(data) + +# Utiliser ResourceAwareExecutor pour l'exécution +executor = ResourceAwareExecutor( + broker=broker, + max_parallel=10, +) +``` +{% endraw %} +`ResourceAwareExecutor` évalue les profils de ressources des tâches et les +distribue aux workers disponibles en fonction de leur capacité. +`TaskResourceProfile` permet d'annoter chaque tâche avec ses besoins estimés +en mémoire et CPU. + +## 13. Résumé + +L'optimisation des performances est itérative : + +1. **Mesurer** — établir une baseline avec des benchmarks +2. **Identifier** — trouver les goulots avec le profilage +3. **Régler** — ajuster `max_parallel`, profils de ressources, batch +4. **Mettre à l'échelle** — ajouter des workers, optimiser services externes +5. **Surveiller** — suivre les métriques en production +6. **Répéter** — l'optimisation ne s'arrête jamais + +--- + +## Prochaines Étapes + +- **[Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }})** — Surveiller les métriques des pipelines +- **[Guide Dataflow]({{ '/fr/guides/dataflow/' | relative_url }})** — Guide complet sur les pipelines DAG et l'architecture dataflow +- **[Guide API]({{ '/fr/guides/api/' | relative_url }})** — Construire des tableaux de bord pour la performance +- **[Exemple : Pipeline Audio Dataflow]({{ '/fr/examples/dataflow-audio-pipeline/' | relative_url }})** — Voir l'optimisation en action + +--- + + *Allez vite, mais mesurez d'abord.* diff --git a/docs/_fr/guides/pipelines.md b/docs/_fr/guides/pipelines.md index 02271c8..9773e1b 100644 --- a/docs/_fr/guides/pipelines.md +++ b/docs/_fr/guides/pipelines.md @@ -1,587 +1,587 @@ ---- -title: Guide des Pipelines -nav_order: 20 ---- -# Guide des Pipelines - -**Motifs de pipelines séquentiels et dataflow, configurations et bonnes pratiques** - -> **Version** : {VERSION} | **Lié** : [Guide d'Exécution]({{ '/fr/guides/execution/' | relative_url }}), [Guide des Tâches]({{ '/fr/guides/tasks/' | relative_url }}), [Guide Dataflow]({{ '/fr/guides/dataflow/' | relative_url }}) - ---- - -## Aperçu - -Taskiq-Flow propose deux types principaux de pipelines pour orchestrer des workflows de tâches: - -1. **SequentialPipeline** — Enchaînement manuel des étapes pour des workflows linéaires -2. **DataflowPipeline** — Construction automatique de DAG depuis les dépendances entre tâches - -Pour une exploration approfondie des patterns dataflow, voir le [Guide Dataflow]({{ '/fr/guides/dataflow/' | relative_url }}). - -Ce guide explore les deux types, leurs cas d'usage, et comment choisir entre eux. - ---- - -## 1. Pipeline Séquentiel - -Le modèle classique où vous enchaînez explicitement les étapes dans l'ordre. - -### 1.1. Structure de Base - -```python -from taskiq_flow import Pipeline - -pipeline = ( - Pipeline(broker) - .call_next(task1) - .call_next(task2) - .call_next(task3) -) -``` - -**Exécution** : `task1 → task2 → task3` (synchroniquement) - -### 1.2. Opérations Disponibles - -#### `.call_next(task, *args, **kwargs)` - -Exécute une tâche, passant le résultat précédent comme premier argument: - -```python -pipeline.call_next(process_data).call_next(save_result) -# process_data receives output of previous step -# save_result receives output of process_data -``` - -**Parameter binding**: -- By position: result becomes first argument -- By name: `pipeline.call_next(task, param_name=previous_result)` - -Example: -```python -@broker.task -def multiply(value: int, factor: int) -> int: - return value * factor - -pipeline.call_next(add_one).call_next(multiply, factor=3) -# add_one output → multiply(value=...), factor=3 -``` - -#### `.call_after(task, *args, **kwargs)` - -Execute a task **without** consuming the previous result (fire-and-forget within pipeline): - -```python -pipeline.call_next(process).call_after(log_completion) -# log_completion runs after process but doesn't receive process's output -``` - -Useful for side effects (logging, notifications) that shouldn't transform the data flow. - -#### `.map(task, max_parallel=None)` - -Apply a task to each element of an iterable result in parallel: - -```python -# Previous step returned: [1, 2, 3, 4] -pipeline.map(process_item) -# Runs process_item(1), process_item(2), ... concurrently -# Collects results: [processed1, processed2, ...] -``` - -**Options**: -- `max_parallel=10` — limit concurrent executions -- `output_name="results"` — custom output key (default: task output name) - -#### `.filter(task)` - -Keep elements where the task returns truthy: - -```python -# Previous step returned: [1, 2, 3, 4] -pipeline.filter(is_even) -# Keeps elements where is_even(element) returns True -# Result: [2, 4] -``` - -#### `.group(tasks, param_names=None)` - -Execute multiple independent tasks in parallel, starting from the same input: - -```python -pipeline.group( - [task_a, task_b, task_c], - param_names=["x", "y", "z"] # bind input to these parameters -) -# All three tasks receive the same previous result -# Returns: [result_a, result_b, result_c] -``` - ---- - -## 2. Pipeline Dataflow - -> Pour un guide complet sur les patterns dataflow, voir le [Guide Dataflow]({{ '/fr/guides/dataflow/' | relative_url }}). - -Construction automatique de DAG via annotations `@pipeline_task(output=...)`. - -### 2.1. Déclaration des Sorties de Tâche - -```python -from taskiq_flow import pipeline_task, DataflowPipeline - -@broker.task -@pipeline_task(output="features") -def extract_features(data: list[str]) -> dict: - return {"count": len(data)} - -@broker.task -@pipeline_task(output="stats") -def compute_stats(features: dict) -> dict: - return {"entries": features["count"] * 2} - -@broker.task -@pipeline_task(output="report") -def generate_report(stats: dict) -> str: - return f"Stats: {stats}" -``` - -**Key**: The `output` parameter declares what this task produces. Downstream tasks declare matching parameter names to consume those outputs. - -### 2.2. Building the Pipeline - -```python -pipeline = DataflowPipeline.from_tasks( - broker, - [extract_features, compute_stats, generate_report] -) -``` - -**Automatic dependency resolution**: - -1. `extract_features` produces `features` — no dependencies -2. `compute_stats` needs `features` — depends on `extract_features` -3. `generate_report` needs `stats` — depends on `compute_stats` - -**Resulting DAG**: -``` -extract_features → compute_stats → generate_report -``` - -### 2.3. Multiple Consumers - -Multiple tasks can consume the same output; they'll all wait for the producer: - -```python -@broker.task -@pipeline_task(output="features") -def extract(data): ... - -@broker.task -@pipeline_task(output="tags") -def tag(features: dict): ... # consumer 1 of features - -@broker.task -@pipeline_task(output="embedding") -def embed(features: dict): ... # consumer 2 of features -``` - -**Clé** : Le paramètre `output` déclare ce que cette tâche produit. Les tâches en aval déclarent des noms de paramètres correspondants pour consommer ces sorties. - -### 2.2. Construction du Pipeline - -```python -pipeline = DataflowPipeline.from_tasks( - broker, - [extract_features, compute_stats, generate_report] -) - -**Automatic dependency resolution**: - -1. `extract_features` produces `features` — no dependencies -2. `compute_stats` needs `features` — depends on `extract_features` -3. `generate_report` needs `stats` — depends on `compute_stats` - -**Resulting DAG**: -``` -extract_features → compute_stats → generate_report -``` - -**Résolution automatique des dépendances**: - -1. `extraire_features` produit `features` — aucune dépendance -2. `calculer_stats` a besoin de `features` — dépend de `extraire_features` -3. `générer_rapport` a besoin de `stats` — dépend de `calculer_stats` - -**DAG résultant**: -``` -extraire_features → calculer_stats → générer_rapport -``` - -### 2.3. Multiple Consommateurs - -Multiple tasks can consume the same output; they'll all wait for the producer: - -```python -@broker.task -@pipeline_task(output="features") -def extract(data): ... - -@broker.task -@pipeline_task(output="tags") -def tag(features: dict): ... # consumer 1 of features - -@broker.task -@pipeline_task(output="embedding") -def embed(features: dict): ... # consumer 2 of features -``` - -### 2.4. Paramètres d'Entrée - -Les pipelines dataflow acceptent des entrées externes via `kiq_dataflow(**kwargs)`: - -```python -résultats = await pipeline.kiq_dataflow(data=["fichier1.mp3", "fichier2.mp3"]) -# Le paramètre `data` est apparié à toute tâche en ayant besoin -# Doit correspondre à un nom de paramètre d'une tâche sans producteur (entrée externe) -``` - ---- - -## 3. Configuration du Pipeline - -### 3.1. Ajout du Suivi - -```python -from taskiq_flow import PipelineTrackingManager - -suivi = PipelineTrackingManager().with_auto_storage(broker) -pipeline = Pipeline(broker).with_tracking(suivi) -``` - -Voir [Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }}) pour plus de détails. - -### 3.2. Définition d'un ID de Pipeline Personnalisé - -```python -pipeline.pipeline_id = "my_workflow_001" -# If not set, a UUID is automatically generated -``` - -Important pour le suivi et les abonnements WebSocket. - -### 3.3. Attachement des Hooks (WebSocket) - -```python -from taskiq_flow.hooks import HookManager - -hooks = HookManager() -pipeline = Pipeline(broker).with_hooks(hooks) -``` - -Voir [Guide WebSocket]({{ '/fr/guides/websocket/' | relative_url }}). - -### 3.4. Retry & Politiques d'Erreur - -```python -pipeline.with_retry( - max_attempts=3, - delay=1.0, - backoff=2.0 -) -pipeline.on_error("continue") # ou "stop" -``` - -Voir [Guide de Retry]({{ '/fr/guides/retry/' | relative_url }}). - -### 3.5. Timeouts - -```python -pipeline.with_timeout(seconds=60) -``` - ---- - -## 4. Cycle de Vie du Pipeline - -### 4.1. Création → Exécution → Completion - -``` -1. pipeline = Pipeline(broker) # Créer l'objet pipeline -2. pipeline.call_next(...) # Enchaîner les étapes -3. task = await pipeline.kiq(entrée) # Lancer -4. résultat = await task.wait_result() # Attendre & récupérer -``` - -### 4.2. Réutilisabilité - -Les objets Pipeline sont **à usage unique**. Pour des exécutions répétées, créez un nouveau pipeline ou utilisez `PipelineScheduler`: - -```python -# Correct: create a fresh pipeline each time -async def execute_workflow(data): - pipeline = Pipeline(broker).call_next(step1).call_next(step2) - return await pipeline.kiq(data) -``` - ---- - -## 5. Visualisation des Pipelines - -### 5.1. DAG ASCII (Console) - -```python -pipeline.print_dag() -``` - -Example output: -``` -Execution Order DAG: - Level 0: task_a - Level 1: task_b, task_c - Level 2: task_d -``` - -### 5.2. JSON for Web UIs - -```python -viz = pipeline.visualize() # returns a dict -print(viz) -``` - -Structure: -```json -{ - "nodes": [ - {"id": "task_a", "outputs": ["x", "y"]}, - {"id": "task_b", "inputs": ["x"]} - ], - "edges": [{"from": "task_a", "to": "task_b"}] -} -``` - -### 5.3. Format DOT (Graphviz) - -```python -dot = pipeline.visualize_dot() -with open("pipeline.dot", "w") as f: - f.write(dot) -# Rendre: dot -Tpng pipeline.dot -o pipeline.png -``` - -Le diagramme résultant montre les nœuds, liens et ordre d'exécution. - ---- - -## 6. Inspection du Pipeline (DataflowRegistry) - -For advanced use cases, manually construct and inspect the dataflow graph: - -```python -from taskiq_flow import DataflowRegistry - -registry = DataflowRegistry() - -# Register tasks with explicit I/O -registry.register_task( - task=load_data, - output="raw", - inputs=["source"] # external input -) -registry.register_task( - task=clean, - output="clean", - inputs=["raw"] -) -registry.register_task( - task=save, - output="saved", - inputs=["clean"] -) - -# Inspect structure -print("Tasks:", [t.task_name for t in registry.get_tasks()]) -print("Outputs:", registry.get_outputs()) # ["raw", "clean", "saved"] -print("External inputs:", registry.get_external_inputs()) # ["source"] - -# Find dependencies -producer = registry.get_producer("clean") # returns TaskNode for 'clean' -consumers = registry.get_consumers("raw") # list of tasks needing 'raw' - -# Build DAG -dag = registry.build_dag() -dag.print() -order = dag.topological_sort() # list of tasks in execution order -levels = dag.levels # list of lists (parallel groups) -``` - -Voir `examples/registry_discovery_example.py` pour une utilisation complète. - ---- - -## 7. Choix entre Types de Pipeline - -| Critère | SequentialPipeline | DataflowPipeline | -|---------|-------------------|------------------| -| **Forme du workflow** | Linéaire, avec embranchements occasionnels | DAG complexe avec nombreuses branches | -| **Dépendances des tâches** | Implicites (ordre d'enchaînement) | Explicites (`@pipeline_task`) | -| **Parallélisme** | Manuel (`.group()`) | Automatique (tâches indépendantes) | -| **Flexibilité** | Contrôle total de l'ordre | Déclaratif ; la bibliothèque optimise | -| **Workflows dynamiques** | Difficile (fixé au moment de la construction) | Facile (peut ajouter des tâches flexiblement) | -| **Idéal pour** | ETL étapes linéaires, batch simple | Traitement audio/vidéo, pipelines ML | - -**Règle empirique**: -- **SequentialPipeline** pour des workflows simples à ordre fixe -- **DataflowPipeline** pour des workflows complexes, ramifiés ou réutilisables - ---- - -## 8. Bonnes Pratiques - -### 8.1. Nommage des Tâches et Sorties - -Utiliser des noms de sortie clairs et uniques: - -```python -@pipeline_task(output="user_features") # clair -@pipeline_task(output="features_2") # ambigu (si plusieurs features existent) -``` - -### 8.2. Éviter les Dépendances Circulaires - -DataflowPipeline détecte les cycles et lève `CycleError` pendant `build_dag()`. Concevoir avec un flux de données avant uniquement. - -### 8.3. Minimiser l'État Partagé - -Chaque tâche doit être pure (la sortie dépend uniquement des entrées) pour la sécurité en parallèle. - -### 8.4. Versionner les IDs de Pipeline - -Inclure la version dans les IDs de pipeline pour le suivi: - -```python -pipeline.pipeline_id = f"analyse_audio_v1_{int(time.time())}" -``` - -### 8.5. Utiliser `.call_after()` pour les Effets Secondaires - -Ne pas corrompre le flux de données avec logs/métriques: - -```python -pipeline.call_next(processus).call_after(journaliser_résultat) # correct -pipeline.call_next(processus_et_journaliser) # anti-pattern -``` - -### 8.6. Limiter le Parallélisme pour les Tâches Ressource-Intensives - -```python -# Transcodage intensif en CPU -pipeline.map(transcoder, fichiers, max_parallel=2) -``` - -### 8.7. Valider le DAG Avant Exécution - -```python -pipeline.print_dag() # Toujours inspecter les pipelines complexes -input("Appuyer sur Entrée pour exécuter...") -``` - ---- - -## 9. Pièges Courants - -| Symptôme | Cause probable | Correction | -|----------|----------------|------------| -| Tâche exécutée deux fois | `.call_next()` et tâche dépendante tous deux déclarés | Supprimer l'appel redondant; Dataflow gère les dépendances | -| Sortie manquante | `@pipeline_task(output=...)` ne correspond pas au paramètre en aval | Aligner le nom de sortie avec le nom du paramètre | -| Toutes les tâches séquentielles | Utilisation de Pipeline au lieu de DataflowPipeline | Passer à DataflowPipeline pour le parallélisme automatique | -| Résultats None | Oubli de `broker.add_middlewares(PipelineMiddleware())` | Ajouter le middleware avant de créer des pipelines | -| Pipeline stale réutilisé | Tentative d'appeler `kiq()` deux fois sur le même objet pipeline | Créer un pipeline frais par exécution | - ---- - -## 10. Motifs Avancés - -### 10.1. Hybride Séquentiel + Dataflow - -Combiner les deux types pour un contrôle maximal: - -```python -# Coquille séquentielle -séquentiel = Pipeline(broker) - -# À l'intérieur d'une étape, lancer un sous-pipeline dataflow -@broker.task -async def traiter_lot(données: list) -> dict: - sous_pipeline = DataflowPipeline.from_tasks( - broker, - [sous_tache1, sous_tache2, sous_tache3] - ) - return await sous_pipeline.kiq_dataflow(data=données) - -séquentiel.call_next(traiter_lot).call_next(finaliser) -``` - -### 10.2. Construction de Pipeline Dynamique - -Construire des pipelines à l'exécution selon la configuration: - -```python -def build_pipeline(config: dict) -> Pipeline: - steps = [] - if config.get("preprocess"): - steps.append(preprocess_task) - if config.get("analyze"): - steps.append(analyze_task) - # ... - pipeline = Pipeline(broker) - for step in steps: - pipeline.call_next(step) - return pipeline -``` - -### 10.3. Branchement Conditionnel - -Utiliser `.filter()` et les étapes de condition: - -```python -high_value = pipeline.filter(is_high_value) -high_value.call_next(premium_processing) -low_value = pipeline.filter(is_low_value) -low_value.call_next(standard_processing) - -# Merge -merged = high_value.group([premium_processing, standard_processing]) -``` - -Voir [steps/condition.py](https://github.com/dorel14/taskiq-flow/blob/main/taskiq_flow/steps/condition.py) pour `IfStep`. - ---- - -## 11. Checklist de Vérification - -Avant d'exécuter un pipeline, vérifier : - -- [ ] Type de pipeline choisi correctement (Séquentiel vs Dataflow) -- [ ] Toutes les fonctions décorées avec `@broker.task` -- [ ] Dataflow: toutes les tâches concernées décorées avec `@pipeline_task(output=…)` -- [ ] Les noms de sortie correspondent exactement aux noms de paramètres en aval -- [ ] `PipelineMiddleware` ajouté au broker -- [ ] `pipeline_id` défini si suivi/WebSocket nécessaire -- [ ] DAG inspecté avec `print_dag()` pour les pipelines complexes -- [ ] Limites de parallélisme (`max_parallel`) définies appropriément -- [ ] Timeouts configurés pour les tâches longues -- [ ] Exécution d'exemple réussie avant utilisation en production - ---- - -## Lectures Complémentaires - -- **[Guide d'Exécution]({{ '/fr/guides/execution/' | relative_url }})** — Comment les pipelines s'exécutent, gestion d'erreurs, timeouts -- **[Guide des Tâches]({{ '/fr/guides/tasks/' | relative_url }})** — Écriture des fonctions de tâche et décorateurs -- **[Exemples]({{ '/fr/examples/' | relative_url }})** — Démonstrations complètes de pipelines - ---- - -*Maîtriser les pipelines pour orchestrer n'importe quel workflow. Ensuite, apprendre sur la [Définition des Tâches]({{ '/fr/guides/tasks/' | relative_url }}).* +--- +title: Guide des Pipelines +nav_order: 20 +--- +# Guide des Pipelines + +**Motifs de pipelines séquentiels et dataflow, configurations et bonnes pratiques** + +> **Version** : {VERSION} | **Lié** : [Guide d'Exécution]({{ '/fr/guides/execution/' | relative_url }}), [Guide des Tâches]({{ '/fr/guides/tasks/' | relative_url }}), [Guide Dataflow]({{ '/fr/guides/dataflow/' | relative_url }}) + +--- + +## Aperçu + +Taskiq-Flow propose deux types principaux de pipelines pour orchestrer des workflows de tâches: + +1. **SequentialPipeline** — Enchaînement manuel des étapes pour des workflows linéaires +2. **DataflowPipeline** — Construction automatique de DAG depuis les dépendances entre tâches + +Pour une exploration approfondie des patterns dataflow, voir le [Guide Dataflow]({{ '/fr/guides/dataflow/' | relative_url }}). + +Ce guide explore les deux types, leurs cas d'usage, et comment choisir entre eux. + +--- + +## 1. Pipeline Séquentiel + +Le modèle classique où vous enchaînez explicitement les étapes dans l'ordre. + +### 1.1. Structure de Base + +```python +from taskiq_flow import Pipeline + +pipeline = ( + Pipeline(broker) + .call_next(task1) + .call_next(task2) + .call_next(task3) +) +``` + +**Exécution** : `task1 → task2 → task3` (synchroniquement) + +### 1.2. Opérations Disponibles + +#### `.call_next(task, *args, **kwargs)` + +Exécute une tâche, passant le résultat précédent comme premier argument: + +```python +pipeline.call_next(process_data).call_next(save_result) +# process_data receives output of previous step +# save_result receives output of process_data +``` + +**Parameter binding**: +- By position: result becomes first argument +- By name: `pipeline.call_next(task, param_name=previous_result)` + +Example: +```python +@broker.task +def multiply(value: int, factor: int) -> int: + return value * factor + +pipeline.call_next(add_one).call_next(multiply, factor=3) +# add_one output → multiply(value=...), factor=3 +``` + +#### `.call_after(task, *args, **kwargs)` + +Execute a task **without** consuming the previous result (fire-and-forget within pipeline): + +```python +pipeline.call_next(process).call_after(log_completion) +# log_completion runs after process but doesn't receive process's output +``` + +Useful for side effects (logging, notifications) that shouldn't transform the data flow. + +#### `.map(task, max_parallel=None)` + +Apply a task to each element of an iterable result in parallel: + +```python +# Previous step returned: [1, 2, 3, 4] +pipeline.map(process_item) +# Runs process_item(1), process_item(2), ... concurrently +# Collects results: [processed1, processed2, ...] +``` + +**Options**: +- `max_parallel=10` — limit concurrent executions +- `output_name="results"` — custom output key (default: task output name) + +#### `.filter(task)` + +Keep elements where the task returns truthy: + +```python +# Previous step returned: [1, 2, 3, 4] +pipeline.filter(is_even) +# Keeps elements where is_even(element) returns True +# Result: [2, 4] +``` + +#### `.group(tasks, param_names=None)` + +Execute multiple independent tasks in parallel, starting from the same input: + +```python +pipeline.group( + [task_a, task_b, task_c], + param_names=["x", "y", "z"] # bind input to these parameters +) +# All three tasks receive the same previous result +# Returns: [result_a, result_b, result_c] +``` + +--- + +## 2. Pipeline Dataflow + +> Pour un guide complet sur les patterns dataflow, voir le [Guide Dataflow]({{ '/fr/guides/dataflow/' | relative_url }}). + +Construction automatique de DAG via annotations `@pipeline_task(output=...)`. + +### 2.1. Déclaration des Sorties de Tâche + +```python +from taskiq_flow import pipeline_task, DataflowPipeline + +@broker.task +@pipeline_task(output="features") +def extract_features(data: list[str]) -> dict: + return {"count": len(data)} + +@broker.task +@pipeline_task(output="stats") +def compute_stats(features: dict) -> dict: + return {"entries": features["count"] * 2} + +@broker.task +@pipeline_task(output="report") +def generate_report(stats: dict) -> str: + return f"Stats: {stats}" +``` + +**Key**: The `output` parameter declares what this task produces. Downstream tasks declare matching parameter names to consume those outputs. + +### 2.2. Building the Pipeline + +```python +pipeline = DataflowPipeline.from_tasks( + broker, + [extract_features, compute_stats, generate_report] +) +``` + +**Automatic dependency resolution**: + +1. `extract_features` produces `features` — no dependencies +2. `compute_stats` needs `features` — depends on `extract_features` +3. `generate_report` needs `stats` — depends on `compute_stats` + +**Resulting DAG**: +``` +extract_features → compute_stats → generate_report +``` + +### 2.3. Multiple Consumers + +Multiple tasks can consume the same output; they'll all wait for the producer: + +```python +@broker.task +@pipeline_task(output="features") +def extract(data): ... + +@broker.task +@pipeline_task(output="tags") +def tag(features: dict): ... # consumer 1 of features + +@broker.task +@pipeline_task(output="embedding") +def embed(features: dict): ... # consumer 2 of features +``` + +**Clé** : Le paramètre `output` déclare ce que cette tâche produit. Les tâches en aval déclarent des noms de paramètres correspondants pour consommer ces sorties. + +### 2.2. Construction du Pipeline + +```python +pipeline = DataflowPipeline.from_tasks( + broker, + [extract_features, compute_stats, generate_report] +) + +**Automatic dependency resolution**: + +1. `extract_features` produces `features` — no dependencies +2. `compute_stats` needs `features` — depends on `extract_features` +3. `generate_report` needs `stats` — depends on `compute_stats` + +**Resulting DAG**: +``` +extract_features → compute_stats → generate_report +``` + +**Résolution automatique des dépendances**: + +1. `extraire_features` produit `features` — aucune dépendance +2. `calculer_stats` a besoin de `features` — dépend de `extraire_features` +3. `générer_rapport` a besoin de `stats` — dépend de `calculer_stats` + +**DAG résultant**: +``` +extraire_features → calculer_stats → générer_rapport +``` + +### 2.3. Multiple Consommateurs + +Multiple tasks can consume the same output; they'll all wait for the producer: + +```python +@broker.task +@pipeline_task(output="features") +def extract(data): ... + +@broker.task +@pipeline_task(output="tags") +def tag(features: dict): ... # consumer 1 of features + +@broker.task +@pipeline_task(output="embedding") +def embed(features: dict): ... # consumer 2 of features +``` + +### 2.4. Paramètres d'Entrée + +Les pipelines dataflow acceptent des entrées externes via `kiq_dataflow(**kwargs)`: + +```python +résultats = await pipeline.kiq_dataflow(data=["fichier1.mp3", "fichier2.mp3"]) +# Le paramètre `data` est apparié à toute tâche en ayant besoin +# Doit correspondre à un nom de paramètre d'une tâche sans producteur (entrée externe) +``` + +--- + +## 3. Configuration du Pipeline + +### 3.1. Ajout du Suivi + +```python +from taskiq_flow import PipelineTrackingManager + +suivi = PipelineTrackingManager().with_auto_storage(broker) +pipeline = Pipeline(broker).with_tracking(suivi) +``` + +Voir [Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }}) pour plus de détails. + +### 3.2. Définition d'un ID de Pipeline Personnalisé + +```python +pipeline.pipeline_id = "my_workflow_001" +# If not set, a UUID is automatically generated +``` + +Important pour le suivi et les abonnements WebSocket. + +### 3.3. Attachement des Hooks (WebSocket) + +```python +from taskiq_flow.hooks import HookManager + +hooks = HookManager() +pipeline = Pipeline(broker).with_hooks(hooks) +``` + +Voir [Guide WebSocket]({{ '/fr/guides/websocket/' | relative_url }}). + +### 3.4. Retry & Politiques d'Erreur + +```python +pipeline.with_retry( + max_attempts=3, + delay=1.0, + backoff=2.0 +) +pipeline.on_error("continue") # ou "stop" +``` + +Voir [Guide de Retry]({{ '/fr/guides/retry/' | relative_url }}). + +### 3.5. Timeouts + +```python +pipeline.with_timeout(seconds=60) +``` + +--- + +## 4. Cycle de Vie du Pipeline + +### 4.1. Création → Exécution → Completion + +``` +1. pipeline = Pipeline(broker) # Créer l'objet pipeline +2. pipeline.call_next(...) # Enchaîner les étapes +3. task = await pipeline.kiq(entrée) # Lancer +4. résultat = await task.wait_result() # Attendre & récupérer +``` + +### 4.2. Réutilisabilité + +Les objets Pipeline sont **à usage unique**. Pour des exécutions répétées, créez un nouveau pipeline ou utilisez `PipelineScheduler`: + +```python +# Correct: create a fresh pipeline each time +async def execute_workflow(data): + pipeline = Pipeline(broker).call_next(step1).call_next(step2) + return await pipeline.kiq(data) +``` + +--- + +## 5. Visualisation des Pipelines + +### 5.1. DAG ASCII (Console) + +```python +pipeline.print_dag() +``` + +Example output: +``` +Execution Order DAG: + Level 0: task_a + Level 1: task_b, task_c + Level 2: task_d +``` + +### 5.2. JSON for Web UIs + +```python +viz = pipeline.visualize() # returns a dict +print(viz) +``` + +Structure: +```json +{ + "nodes": [ + {"id": "task_a", "outputs": ["x", "y"]}, + {"id": "task_b", "inputs": ["x"]} + ], + "edges": [{"from": "task_a", "to": "task_b"}] +} +``` + +### 5.3. Format DOT (Graphviz) + +```python +dot = pipeline.visualize_dot() +with open("pipeline.dot", "w") as f: + f.write(dot) +# Rendre: dot -Tpng pipeline.dot -o pipeline.png +``` + +Le diagramme résultant montre les nœuds, liens et ordre d'exécution. + +--- + +## 6. Inspection du Pipeline (DataflowRegistry) + +For advanced use cases, manually construct and inspect the dataflow graph: + +```python +from taskiq_flow import DataflowRegistry + +registry = DataflowRegistry() + +# Register tasks with explicit I/O +registry.register_task( + task=load_data, + output="raw", + inputs=["source"] # external input +) +registry.register_task( + task=clean, + output="clean", + inputs=["raw"] +) +registry.register_task( + task=save, + output="saved", + inputs=["clean"] +) + +# Inspect structure +print("Tasks:", [t.task_name for t in registry.get_tasks()]) +print("Outputs:", registry.get_outputs()) # ["raw", "clean", "saved"] +print("External inputs:", registry.get_external_inputs()) # ["source"] + +# Find dependencies +producer = registry.get_producer("clean") # returns TaskNode for 'clean' +consumers = registry.get_consumers("raw") # list of tasks needing 'raw' + +# Build DAG +dag = registry.build_dag() +dag.print() +order = dag.topological_sort() # list of tasks in execution order +levels = dag.levels # list of lists (parallel groups) +``` + +Voir `examples/registry_discovery_example.py` pour une utilisation complète. + +--- + +## 7. Choix entre Types de Pipeline + +| Critère | SequentialPipeline | DataflowPipeline | +|---------|-------------------|------------------| +| **Forme du workflow** | Linéaire, avec embranchements occasionnels | DAG complexe avec nombreuses branches | +| **Dépendances des tâches** | Implicites (ordre d'enchaînement) | Explicites (`@pipeline_task`) | +| **Parallélisme** | Manuel (`.group()`) | Automatique (tâches indépendantes) | +| **Flexibilité** | Contrôle total de l'ordre | Déclaratif ; la bibliothèque optimise | +| **Workflows dynamiques** | Difficile (fixé au moment de la construction) | Facile (peut ajouter des tâches flexiblement) | +| **Idéal pour** | ETL étapes linéaires, batch simple | Traitement audio/vidéo, pipelines ML | + +**Règle empirique**: +- **SequentialPipeline** pour des workflows simples à ordre fixe +- **DataflowPipeline** pour des workflows complexes, ramifiés ou réutilisables + +--- + +## 8. Bonnes Pratiques + +### 8.1. Nommage des Tâches et Sorties + +Utiliser des noms de sortie clairs et uniques: + +```python +@pipeline_task(output="user_features") # clair +@pipeline_task(output="features_2") # ambigu (si plusieurs features existent) +``` + +### 8.2. Éviter les Dépendances Circulaires + +DataflowPipeline détecte les cycles et lève `CycleError` pendant `build_dag()`. Concevoir avec un flux de données avant uniquement. + +### 8.3. Minimiser l'État Partagé + +Chaque tâche doit être pure (la sortie dépend uniquement des entrées) pour la sécurité en parallèle. + +### 8.4. Versionner les IDs de Pipeline + +Inclure la version dans les IDs de pipeline pour le suivi: + +```python +pipeline.pipeline_id = f"analyse_audio_v1_{int(time.time())}" +``` + +### 8.5. Utiliser `.call_after()` pour les Effets Secondaires + +Ne pas corrompre le flux de données avec logs/métriques: + +```python +pipeline.call_next(processus).call_after(journaliser_résultat) # correct +pipeline.call_next(processus_et_journaliser) # anti-pattern +``` + +### 8.6. Limiter le Parallélisme pour les Tâches Ressource-Intensives + +```python +# Transcodage intensif en CPU +pipeline.map(transcoder, fichiers, max_parallel=2) +``` + +### 8.7. Valider le DAG Avant Exécution + +```python +pipeline.print_dag() # Toujours inspecter les pipelines complexes +input("Appuyer sur Entrée pour exécuter...") +``` + +--- + +## 9. Pièges Courants + +| Symptôme | Cause probable | Correction | +|----------|----------------|------------| +| Tâche exécutée deux fois | `.call_next()` et tâche dépendante tous deux déclarés | Supprimer l'appel redondant; Dataflow gère les dépendances | +| Sortie manquante | `@pipeline_task(output=...)` ne correspond pas au paramètre en aval | Aligner le nom de sortie avec le nom du paramètre | +| Toutes les tâches séquentielles | Utilisation de Pipeline au lieu de DataflowPipeline | Passer à DataflowPipeline pour le parallélisme automatique | +| Résultats None | Oubli de `broker.add_middlewares(PipelineMiddleware())` | Ajouter le middleware avant de créer des pipelines | +| Pipeline stale réutilisé | Tentative d'appeler `kiq()` deux fois sur le même objet pipeline | Créer un pipeline frais par exécution | + +--- + +## 10. Motifs Avancés + +### 10.1. Hybride Séquentiel + Dataflow + +Combiner les deux types pour un contrôle maximal: + +```python +# Coquille séquentielle +séquentiel = Pipeline(broker) + +# À l'intérieur d'une étape, lancer un sous-pipeline dataflow +@broker.task +async def traiter_lot(données: list) -> dict: + sous_pipeline = DataflowPipeline.from_tasks( + broker, + [sous_tache1, sous_tache2, sous_tache3] + ) + return await sous_pipeline.kiq_dataflow(data=données) + +séquentiel.call_next(traiter_lot).call_next(finaliser) +``` + +### 10.2. Construction de Pipeline Dynamique + +Construire des pipelines à l'exécution selon la configuration: + +```python +def build_pipeline(config: dict) -> Pipeline: + steps = [] + if config.get("preprocess"): + steps.append(preprocess_task) + if config.get("analyze"): + steps.append(analyze_task) + # ... + pipeline = Pipeline(broker) + for step in steps: + pipeline.call_next(step) + return pipeline +``` + +### 10.3. Branchement Conditionnel + +Utiliser `.filter()` et les étapes de condition: + +```python +high_value = pipeline.filter(is_high_value) +high_value.call_next(premium_processing) +low_value = pipeline.filter(is_low_value) +low_value.call_next(standard_processing) + +# Merge +merged = high_value.group([premium_processing, standard_processing]) +``` + +Voir [steps/condition.py](https://github.com/dorel14/taskiq-flow/blob/main/taskiq_flow/steps/condition.py) pour `IfStep`. + +--- + +## 11. Checklist de Vérification + +Avant d'exécuter un pipeline, vérifier : + +- [ ] Type de pipeline choisi correctement (Séquentiel vs Dataflow) +- [ ] Toutes les fonctions décorées avec `@broker.task` +- [ ] Dataflow: toutes les tâches concernées décorées avec `@pipeline_task(output=…)` +- [ ] Les noms de sortie correspondent exactement aux noms de paramètres en aval +- [ ] `PipelineMiddleware` ajouté au broker +- [ ] `pipeline_id` défini si suivi/WebSocket nécessaire +- [ ] DAG inspecté avec `print_dag()` pour les pipelines complexes +- [ ] Limites de parallélisme (`max_parallel`) définies appropriément +- [ ] Timeouts configurés pour les tâches longues +- [ ] Exécution d'exemple réussie avant utilisation en production + +--- + +## Lectures Complémentaires + +- **[Guide d'Exécution]({{ '/fr/guides/execution/' | relative_url }})** — Comment les pipelines s'exécutent, gestion d'erreurs, timeouts +- **[Guide des Tâches]({{ '/fr/guides/tasks/' | relative_url }})** — Écriture des fonctions de tâche et décorateurs +- **[Exemples]({{ '/fr/examples/' | relative_url }})** — Démonstrations complètes de pipelines + +--- + +*Maîtriser les pipelines pour orchestrer n'importe quel workflow. Ensuite, apprendre sur la [Définition des Tâches]({{ '/fr/guides/tasks/' | relative_url }}).* diff --git a/docs/_fr/guides/retry.md b/docs/_fr/guides/retry.md index a60d84a..019603f 100644 --- a/docs/_fr/guides/retry.md +++ b/docs/_fr/guides/retry.md @@ -1,485 +1,485 @@ ---- -title: Guide des Retentatives et de la Gestion d'Erreurs -nav_order: 26 ---- -# Guide des Retentatives et de la Gestion d'Erreurs - -**Exécution de pipeline résiliente avec politiques de retry, backoff et files de lettres mortes** - -> **Version** : {VERSION} | **Lié** : [Guide d'Exécution]({{ '/fr/guides/execution/' | relative_url }}), [Guide d'Ordonnancement]({{ '/fr/guides/scheduling/' | relative_url }}) - ---- - -## Aperçu - -Les pannes sont inévitables dans les systèmes distribués. Taskiq-Flow fournit des mécanismes complets de retry et de gestion d'erreurs pour garantir la robustesse des pipelines. - -Ce guide couvre : - -- Politiques de retry au niveau tâche et pipeline -- Stratégies d'exponential backoff -- Dead Letter Queues (DLQ) pour les échecs irrécupérables -- Logique de retry conditionnel -- Configuration des timeouts -- Surveillance des métriques de retry - ---- - -## 1. Comprendre les Retentatives - -Une **retry** (réessai) est la ré-exécution automatique d'une tâche échouée avec les mêmes entrées. Les politiques de retry définissent **quand** et **comment** réessayer. - -### Quand Retenter - - **Bons candidats pour le retry** : - -- Timeouts réseau (API externe indisponible) -- Erreurs de connexion base de données (transitoires) -- Limitation de débit (header retry-after) -- Épuisement temporaire des ressources - - **Ne PAS retenter** : - -- Erreurs de validation (mauvaise entrée ne se corrigera pas) -- Erreurs de programmation (bug dans le code) -- Données manquantes (ne réapparaîtront pas) -- Échecs permanents (404 Not Found, 401 Unauthorized) - ---- - -## 2. Retry au Niveau Tâche - -Configurez le retry directement sur le décorateur de tâche : - -```python -@broker.task( - max_retries=3, # Nombre maximum de tentatives (défaut : 0 = pas de retry) - retry_delay=5.0, # Secondes entre les retentatives - retry_backoff=2.0, # Multiplice le délai après chaque tentative - retry_timeout=60 # Timeout global incluant les retentatives -) -async def flaky_api_call(): - response = await call_external_api() - return response.json() -``` - -**Séquence de retry** : - -| Tentative | Délai | Cumulé | -|-----------|-------|--------| -| 1 (initiale) | 0s | 0s | -| 2 (retry 1) | 5s | 5s | -| 3 (retry 2) | 10s (5 × 2) | 15s | -| 4 (retry 3) | 20s (10 × 2) | 35s | -| Échec final | — | 35s | - ---- - -## 3. Retry au Niveau Pipeline - -Appliquez une politique de retry cohérente à toutes les tâches d'un pipeline : - -```python -pipeline = Pipeline(broker) -pipeline.with_retry( - max_attempts=3, - delay=2.0, # Délai initial - backoff=1.5, # Multiplicateur de backoff - on_retry=None # Callback optionnel -) -``` - -Toutes les tâches de ce pipeline héritent de cette politique à moins qu'elles n'en aient une propre. - -**Priorité** : Le niveau tâche écrase le niveau pipeline. - ---- - -## 4. Politiques de Retry Personnalisées - -Pour un contrôle fin, implémentez `RetryPolicy` : - -```python -from taskiq_flow import RetryPolicy - -class MyRetryPolicy(RetryPolicy): - def should_retry(self, attempt: int, exception: Exception) -> bool: - # Retente uniquement sur erreurs réseau, max 5 tentatives - if attempt >= 5: - return False - return isinstance(exception, NetworkError) - - def get_delay(self, attempt: int) -> float: - # Backoff personnalisé : 2^attempt + jitter aléatoire - import random - base = 2 ** attempt - jitter = random.uniform(-0.1, 0.1) * base - return max(0.5, base + jitter) - -pipeline.with_retry(policy=MyRetryPolicy()) -``` - -### 4.1. Retry Conditionnel (Sur Exceptions Spécifiques) - -```python -@broker.task -async def task_with_selective_retry(): - try: - result = await call_api() - return result - except NetworkTimeout: - # Cette exception doit être retentée - raise RetryException("Timeout, réessai autorisé") - except InvalidResponse: - # Erreur permanente ; pas de retry - raise # Échec immédiat -``` - -**Retry basé sur les exceptions** : - -```python -from taskiq.exceptions import RetryException - -@broker.task(retry_on=[NetworkError, TimeoutError]) -async def task(): - # Retente automatiquement sur ces types d'exceptions - pass -``` - ---- - -## 5. Exponential Backoff avec Jitter - -Évitez le problème du "thundering herd" (tous les retentements en même temps) : - -```python -import random - -def exponential_backoff_with_jitter( - attempt: int, - base_delay: float = 1.0, - max_delay: float = 60.0, - backoff_factor: float = 2.0, - jitter: bool = True -) -> float: - """Calcule le délai de retry.""" - delay = min(max_delay, base_delay * (backoff_factor ** attempt)) - if jitter: - # Ajoute ±10% de jitter aléatoire - delay *= random.uniform(0.9, 1.1) - return delay - -# Utilisation dans une policy -class JitteredRetryPolicy(RetryPolicy): - def get_delay(self, attempt: int) -> float: - return exponential_backoff_with_jitter(attempt, base_delay=2.0) -``` - -**Pourquoi le jitter ?** Empêche les vagues synchronisées de retentements qui submergent les services. - ---- - -## 6. Dead Letter Queues (DLQ) - -Lorsque tous les retentatifs sont épuisés, les tâches échouées doivent être stockées quelque part. - -### 6.1. Configuration DLQ - -```python -from taskiq_flow.middlewares.retry import RetryMiddleware - -broker.add_middlewares( - RetryMiddleware( - max_retries=3, - dlq_queue="failed_tasks" # Les tâches vont ici après épuisement des retentatives - ) -) -``` - -**Comportement** : - -1. Tâche échoue → retry 1 (après délai) -2. Échoue à nouveau → retry 2 (délai plus long) -3. Échoue à nouveau → retry 3 -4. Échoue tous les retentatives → déplacement vers la file `failed_tasks` - -### 6.2. Inspection & Rejeu DLQ - -```python -from taskiq_flow.middlewares.retry import DLQManager - -dlq = DLQManager(broker) - -# Lister les tâches échouées -failed_tasks = await dlq.list_failed() -for task_info in failed_tasks: - print(f"Tâche {task_info.task_id} échouée : {task_info.error}") - -# Rejouer une tâche échouée (remettre en file d'attente) -await dlq.retry_task(task_id) - -# Supprimer définitivement une tâche échouée -await dlq.delete_task(task_id) - -# Suppression en masse plus ancienne que N jours -await dlq.cleanup_older_than(days=7) -``` - -### 6.3. Alerting DLQ - -Mettez en place des alertes lorsque des tâches vont en DLQ : - -```python -class DLQAlertListener: - async def on_task_to_dlq(self, task_id: str, error: str): - send_slack_alert(f"Tâche {task_id} échouée après retentatives : {error}") - create_incident_ticket(task_id, error) - -dlq_manager = DLQManager(broker).with_listener(DLQAlertListener()) -``` - ---- - -## 7. Timeouts - -Évitez que les tâches ne s'exécutent indéfiniment. - -### 7.1. Timeout au Niveau Tâche - -```python -@broker.task(timeout=30) # secondes -async def potentially_slow_task(): - await long_running_operation() -``` - -Si la tâche dépasse 30 secondes, une `asyncio.TimeoutError` est levée et la politique de retry s'applique. - -### 7.2. Timeout au Niveau Pipeline - -```python -pipeline = Pipeline(broker) -pipeline.with_timeout(seconds=300) # 5 minutes pour l'ensemble du pipeline -``` - -Annule toutes les étapes en cours lorsque le timeout expire. - -### 7.3. Timeout au Niveau Étape (Avancé) - -```python -from taskiq_flow.steps import TimeoutStep - -pipeline = Pipeline(broker) -pipeline.call_next(TimeoutStep(my_task, timeout=10.0)) -``` - ---- - -## 8. Propagation des Erreurs - -### 8.1. Échec Rapide (Par Défaut) - -Le pipeline s'arrête à la première erreur : - -```python -pipeline = Pipeline(broker) -# Par défaut : on_error="stop" - -pipeline.call_next(task1) # Échoue → le pipeline s'arrête, task2 ne s'exécute jamais -pipeline.call_next(task2) -``` - -### 8.2. Continuer en Cas d'Erreur - -Continue d'exécuter les étapes restantes malgré les échecs : - -```python -pipeline = Pipeline(broker) -pipeline.on_error("continue") - -pipeline.call_next(task1) # Échoue, mais task2 s'exécute quand même -pipeline.call_next(task2) -``` - -**Résultat** : Task2 reçoit `None` ou un résultat partiel ; vérifiez `result.is_failed`. - -### 8.3. Compensation (Pattern Saga) - -Exécute une tâche de nettoyage si une étape échoue : - -```python -pipeline = Pipeline(broker) - -pipeline.call_next(allocate_resource) - .on_failure(compensate_allocation) # Exécute la compensation si l'étape précédente a échoué -pipeline.call_next(process) -``` - ---- - -## 9. Surveillance des Retentatives - -Suivez les métriques de retry : - -```python -from taskiq_flow import PipelineTrackingManager - -tracking = PipelineTrackingManager().with_auto_storage(broker) - -# Métriques de retry exposées dans PipelineStatus: -status = await tracking.get_status(pipeline_id) -print(f"Étapes : {len(status.steps)}") -for step in status.steps: - if step.retry_count > 0: - print(f" {step.name} : retenté {step.retry_count} fois") - print(f" Erreurs : {step.errors}") -``` - -**Métriques à surveiller** : - -- **Taux de retry** (%) de tâches nécessitant un retry -- **Nombre moyen de retentatives** par tâche -- **Top des tâches échouantes** (plus de retentatives) -- **Taille de la DLQ** (tâches abandonnées) -- **Temps passé en retry** vs travail réel - -### Intégration avec Prometheus - -```python -from prometheus_client import Counter, Summary - -RETRY_COUNT = Counter('task_retries_total', 'Total des tentatives de retry', ['task_name']) -TASK_FAILURES = Counter('task_failures_total', 'Tâches ayant échoué après retentatives', ['task_name']) -TASK_DURATION = Summary('task_duration_seconds', 'Temps d\'exécution des tâches', ['task_name']) - -class MetricsMiddleware(PipelineMiddleware): - async def on_step_complete(self, ctx, result): - step_name = ctx.task_name - RETRY_COUNT.labels(step_name).inc(ctx.retry_count) - TASK_DURATION.labels(step_name).observe(ctx.duration_ms / 1000) -``` - ---- - -## 10. Bonnes Pratiques - -### 10.1. Définir des Limites de Retry Raisonnables - -```python -# Ne pas retenter indéfiniment -@broker.task(max_retries=3) # Bon : borné -@broker.task(max_retries=None) # Mauvais : retentatives infinies -``` - -### 10.2. Utiliser l'Exponential Backoff - -Implémenté via `retry_backoff` : - -```python -@broker.task(max_retries=5, retry_delay=2.0, retry_backoff=2.0) -# Délais : 2s, 4s, 8s, 16s, 32s -``` - -### 10.3. Ajouter du Jitter - -Randomisez les délais pour éviter le "thundering herd" : - -```python -retry_backoff=2.0, retry_jitter=True # Ajoute ±10% de jitter -``` - -### 10.4. Fixer des Délais Max - -```python -# Timeout global incluant les retentatives -@broker.task(retry_timeout=300) # Abandon après 5 minutes totales -``` - -### 10.5. Logger Chaque Retry - -```python -import logging -logger = logging.getLogger(__name__) - -@broker.task( - max_retries=3, - on_retry=lambda attempt, exc: logger.warning(f"Retry {attempt} pour la tâche : {exc}") -) -``` - -### 10.6. Séparer Erreurs Transitoires vs Permanentes - -```python -@broker.task -async def smart_task(): - try: - return await call_api() - except (Timeout, ConnectionError) as e: - raise RetryException("Erreur transitoire") from e # Sera retentée - except NotFoundError: - raise # Pas de retry, échec permanent -``` - -### 10.7. DLQ pour Investigation - -Ne jetez jamais les tâches échouées sans revue : - -```python -dlq = DLQManager(broker) -# Examiner périodiquement la DLQ -failed = await dlq.list_failed(limit=100) -for task in failed: - logger.error(f"Tâche DLQ {task.task_id} : {task.error}") - # Penser à rejouer manuellement ou corriger les données -``` - ---- - -## 11. Pièges Courants - -| Piège | Conséquence | Solution | -|-------|-------------|----------| -| Retentatives infinies (`max_retries=None`) | Système bloqué en boucle de retry | Fixer une limite explicite | -| Pas de backoff (delay=0) | Service submergé | Utiliser exponential backoff | -| Retenter sur erreurs de validation | Ressources gaspillées | Distinguer les types d'erreur | -| Pas de DLQ | Tâches échouées perdues | Configurer la DLQ | -| Timeout plus court que délai de retry | Timeout prématuré | S'assurer que timeout > somme des délais de retry | -| Multiples retentatives sur tâches non-idempotentes | Effets de bord en double | Rendre les tâches idempotentes ou limiter retry | - ---- - -## 12. Résumé - -| Fonctionnalité | Niveau Tâche | Niveau Pipeline | -|----------------|--------------|-----------------| -| **Limite de retry** | `@broker.task(max_retries=N)` | `pipeline.with_retry(max_attempts=N)` | -| **Délai** | `retry_delay` | `delay` | -| **Backoff** | `retry_backoff` | `backoff` | -| **Timeout** | `timeout` par tâche | `with_timeout(seconds)` global | -| **DLQ** | Via `RetryMiddleware` | Hérité des tâches | - -**Pipeline résilient complet** : - -```python -tracking = PipelineTrackingManager().with_auto_storage(broker) - -pipeline = Pipeline(broker).with_tracking(tracking) -pipeline.with_retry(max_attempts=3, delay=2.0, backoff=2.0) -pipeline.with_timeout(seconds=300) -pipeline.on_error("continue") # Ou utiliser des étapes de compensation - -# Ajouter middleware de retry avec DLQ -from taskiq_flow.middlewares.retry import RetryMiddleware -broker.add_middlewares(RetryMiddleware(max_retries=3, dlq_queue="failed_tasks")) -``` - ---- - -## Prochaines Étapes - -- **[Guide de Performance]({{ '/fr/guides/performance/' | relative_url }})** — Optimiser l'exécution et l'usage des ressources -- **[Guide d'Ordonnancement]({{ '/fr/guides/scheduling/' | relative_url }})** — Ordonnancement automatique des pipelines -- **[Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }})** — Surveiller les métriques de retry en production - ---- - -*Les pannes arrivent. Réessayez intelligemment. Tout suivez.* +--- +title: Guide des Retentatives et de la Gestion d'Erreurs +nav_order: 26 +--- +# Guide des Retentatives et de la Gestion d'Erreurs + +**Exécution de pipeline résiliente avec politiques de retry, backoff et files de lettres mortes** + +> **Version** : {VERSION} | **Lié** : [Guide d'Exécution]({{ '/fr/guides/execution/' | relative_url }}), [Guide d'Ordonnancement]({{ '/fr/guides/scheduling/' | relative_url }}) + +--- + +## Aperçu + +Les pannes sont inévitables dans les systèmes distribués. Taskiq-Flow fournit des mécanismes complets de retry et de gestion d'erreurs pour garantir la robustesse des pipelines. + +Ce guide couvre : + +- Politiques de retry au niveau tâche et pipeline +- Stratégies d'exponential backoff +- Dead Letter Queues (DLQ) pour les échecs irrécupérables +- Logique de retry conditionnel +- Configuration des timeouts +- Surveillance des métriques de retry + +--- + +## 1. Comprendre les Retentatives + +Une **retry** (réessai) est la ré-exécution automatique d'une tâche échouée avec les mêmes entrées. Les politiques de retry définissent **quand** et **comment** réessayer. + +### Quand Retenter + + **Bons candidats pour le retry** : + +- Timeouts réseau (API externe indisponible) +- Erreurs de connexion base de données (transitoires) +- Limitation de débit (header retry-after) +- Épuisement temporaire des ressources + + **Ne PAS retenter** : + +- Erreurs de validation (mauvaise entrée ne se corrigera pas) +- Erreurs de programmation (bug dans le code) +- Données manquantes (ne réapparaîtront pas) +- Échecs permanents (404 Not Found, 401 Unauthorized) + +--- + +## 2. Retry au Niveau Tâche + +Configurez le retry directement sur le décorateur de tâche : + +```python +@broker.task( + max_retries=3, # Nombre maximum de tentatives (défaut : 0 = pas de retry) + retry_delay=5.0, # Secondes entre les retentatives + retry_backoff=2.0, # Multiplice le délai après chaque tentative + retry_timeout=60 # Timeout global incluant les retentatives +) +async def flaky_api_call(): + response = await call_external_api() + return response.json() +``` + +**Séquence de retry** : + +| Tentative | Délai | Cumulé | +|-----------|-------|--------| +| 1 (initiale) | 0s | 0s | +| 2 (retry 1) | 5s | 5s | +| 3 (retry 2) | 10s (5 × 2) | 15s | +| 4 (retry 3) | 20s (10 × 2) | 35s | +| Échec final | — | 35s | + +--- + +## 3. Retry au Niveau Pipeline + +Appliquez une politique de retry cohérente à toutes les tâches d'un pipeline : + +```python +pipeline = Pipeline(broker) +pipeline.with_retry( + max_attempts=3, + delay=2.0, # Délai initial + backoff=1.5, # Multiplicateur de backoff + on_retry=None # Callback optionnel +) +``` + +Toutes les tâches de ce pipeline héritent de cette politique à moins qu'elles n'en aient une propre. + +**Priorité** : Le niveau tâche écrase le niveau pipeline. + +--- + +## 4. Politiques de Retry Personnalisées + +Pour un contrôle fin, implémentez `RetryPolicy` : + +```python +from taskiq_flow import RetryPolicy + +class MyRetryPolicy(RetryPolicy): + def should_retry(self, attempt: int, exception: Exception) -> bool: + # Retente uniquement sur erreurs réseau, max 5 tentatives + if attempt >= 5: + return False + return isinstance(exception, NetworkError) + + def get_delay(self, attempt: int) -> float: + # Backoff personnalisé : 2^attempt + jitter aléatoire + import random + base = 2 ** attempt + jitter = random.uniform(-0.1, 0.1) * base + return max(0.5, base + jitter) + +pipeline.with_retry(policy=MyRetryPolicy()) +``` + +### 4.1. Retry Conditionnel (Sur Exceptions Spécifiques) + +```python +@broker.task +async def task_with_selective_retry(): + try: + result = await call_api() + return result + except NetworkTimeout: + # Cette exception doit être retentée + raise RetryException("Timeout, réessai autorisé") + except InvalidResponse: + # Erreur permanente ; pas de retry + raise # Échec immédiat +``` + +**Retry basé sur les exceptions** : + +```python +from taskiq.exceptions import RetryException + +@broker.task(retry_on=[NetworkError, TimeoutError]) +async def task(): + # Retente automatiquement sur ces types d'exceptions + pass +``` + +--- + +## 5. Exponential Backoff avec Jitter + +Évitez le problème du "thundering herd" (tous les retentements en même temps) : + +```python +import random + +def exponential_backoff_with_jitter( + attempt: int, + base_delay: float = 1.0, + max_delay: float = 60.0, + backoff_factor: float = 2.0, + jitter: bool = True +) -> float: + """Calcule le délai de retry.""" + delay = min(max_delay, base_delay * (backoff_factor ** attempt)) + if jitter: + # Ajoute ±10% de jitter aléatoire + delay *= random.uniform(0.9, 1.1) + return delay + +# Utilisation dans une policy +class JitteredRetryPolicy(RetryPolicy): + def get_delay(self, attempt: int) -> float: + return exponential_backoff_with_jitter(attempt, base_delay=2.0) +``` + +**Pourquoi le jitter ?** Empêche les vagues synchronisées de retentements qui submergent les services. + +--- + +## 6. Dead Letter Queues (DLQ) + +Lorsque tous les retentatifs sont épuisés, les tâches échouées doivent être stockées quelque part. + +### 6.1. Configuration DLQ + +```python +from taskiq_flow.middlewares.retry import RetryMiddleware + +broker.add_middlewares( + RetryMiddleware( + max_retries=3, + dlq_queue="failed_tasks" # Les tâches vont ici après épuisement des retentatives + ) +) +``` + +**Comportement** : + +1. Tâche échoue → retry 1 (après délai) +2. Échoue à nouveau → retry 2 (délai plus long) +3. Échoue à nouveau → retry 3 +4. Échoue tous les retentatives → déplacement vers la file `failed_tasks` + +### 6.2. Inspection & Rejeu DLQ + +```python +from taskiq_flow.middlewares.retry import DLQManager + +dlq = DLQManager(broker) + +# Lister les tâches échouées +failed_tasks = await dlq.list_failed() +for task_info in failed_tasks: + print(f"Tâche {task_info.task_id} échouée : {task_info.error}") + +# Rejouer une tâche échouée (remettre en file d'attente) +await dlq.retry_task(task_id) + +# Supprimer définitivement une tâche échouée +await dlq.delete_task(task_id) + +# Suppression en masse plus ancienne que N jours +await dlq.cleanup_older_than(days=7) +``` + +### 6.3. Alerting DLQ + +Mettez en place des alertes lorsque des tâches vont en DLQ : + +```python +class DLQAlertListener: + async def on_task_to_dlq(self, task_id: str, error: str): + send_slack_alert(f"Tâche {task_id} échouée après retentatives : {error}") + create_incident_ticket(task_id, error) + +dlq_manager = DLQManager(broker).with_listener(DLQAlertListener()) +``` + +--- + +## 7. Timeouts + +Évitez que les tâches ne s'exécutent indéfiniment. + +### 7.1. Timeout au Niveau Tâche + +```python +@broker.task(timeout=30) # secondes +async def potentially_slow_task(): + await long_running_operation() +``` + +Si la tâche dépasse 30 secondes, une `asyncio.TimeoutError` est levée et la politique de retry s'applique. + +### 7.2. Timeout au Niveau Pipeline + +```python +pipeline = Pipeline(broker) +pipeline.with_timeout(seconds=300) # 5 minutes pour l'ensemble du pipeline +``` + +Annule toutes les étapes en cours lorsque le timeout expire. + +### 7.3. Timeout au Niveau Étape (Avancé) + +```python +from taskiq_flow.steps import TimeoutStep + +pipeline = Pipeline(broker) +pipeline.call_next(TimeoutStep(my_task, timeout=10.0)) +``` + +--- + +## 8. Propagation des Erreurs + +### 8.1. Échec Rapide (Par Défaut) + +Le pipeline s'arrête à la première erreur : + +```python +pipeline = Pipeline(broker) +# Par défaut : on_error="stop" + +pipeline.call_next(task1) # Échoue → le pipeline s'arrête, task2 ne s'exécute jamais +pipeline.call_next(task2) +``` + +### 8.2. Continuer en Cas d'Erreur + +Continue d'exécuter les étapes restantes malgré les échecs : + +```python +pipeline = Pipeline(broker) +pipeline.on_error("continue") + +pipeline.call_next(task1) # Échoue, mais task2 s'exécute quand même +pipeline.call_next(task2) +``` + +**Résultat** : Task2 reçoit `None` ou un résultat partiel ; vérifiez `result.is_failed`. + +### 8.3. Compensation (Pattern Saga) + +Exécute une tâche de nettoyage si une étape échoue : + +```python +pipeline = Pipeline(broker) + +pipeline.call_next(allocate_resource) + .on_failure(compensate_allocation) # Exécute la compensation si l'étape précédente a échoué +pipeline.call_next(process) +``` + +--- + +## 9. Surveillance des Retentatives + +Suivez les métriques de retry : + +```python +from taskiq_flow import PipelineTrackingManager + +tracking = PipelineTrackingManager().with_auto_storage(broker) + +# Métriques de retry exposées dans PipelineStatus: +status = await tracking.get_status(pipeline_id) +print(f"Étapes : {len(status.steps)}") +for step in status.steps: + if step.retry_count > 0: + print(f" {step.name} : retenté {step.retry_count} fois") + print(f" Erreurs : {step.errors}") +``` + +**Métriques à surveiller** : + +- **Taux de retry** (%) de tâches nécessitant un retry +- **Nombre moyen de retentatives** par tâche +- **Top des tâches échouantes** (plus de retentatives) +- **Taille de la DLQ** (tâches abandonnées) +- **Temps passé en retry** vs travail réel + +### Intégration avec Prometheus + +```python +from prometheus_client import Counter, Summary + +RETRY_COUNT = Counter('task_retries_total', 'Total des tentatives de retry', ['task_name']) +TASK_FAILURES = Counter('task_failures_total', 'Tâches ayant échoué après retentatives', ['task_name']) +TASK_DURATION = Summary('task_duration_seconds', 'Temps d\'exécution des tâches', ['task_name']) + +class MetricsMiddleware(PipelineMiddleware): + async def on_step_complete(self, ctx, result): + step_name = ctx.task_name + RETRY_COUNT.labels(step_name).inc(ctx.retry_count) + TASK_DURATION.labels(step_name).observe(ctx.duration_ms / 1000) +``` + +--- + +## 10. Bonnes Pratiques + +### 10.1. Définir des Limites de Retry Raisonnables + +```python +# Ne pas retenter indéfiniment +@broker.task(max_retries=3) # Bon : borné +@broker.task(max_retries=None) # Mauvais : retentatives infinies +``` + +### 10.2. Utiliser l'Exponential Backoff + +Implémenté via `retry_backoff` : + +```python +@broker.task(max_retries=5, retry_delay=2.0, retry_backoff=2.0) +# Délais : 2s, 4s, 8s, 16s, 32s +``` + +### 10.3. Ajouter du Jitter + +Randomisez les délais pour éviter le "thundering herd" : + +```python +retry_backoff=2.0, retry_jitter=True # Ajoute ±10% de jitter +``` + +### 10.4. Fixer des Délais Max + +```python +# Timeout global incluant les retentatives +@broker.task(retry_timeout=300) # Abandon après 5 minutes totales +``` + +### 10.5. Logger Chaque Retry + +```python +import logging +logger = logging.getLogger(__name__) + +@broker.task( + max_retries=3, + on_retry=lambda attempt, exc: logger.warning(f"Retry {attempt} pour la tâche : {exc}") +) +``` + +### 10.6. Séparer Erreurs Transitoires vs Permanentes + +```python +@broker.task +async def smart_task(): + try: + return await call_api() + except (Timeout, ConnectionError) as e: + raise RetryException("Erreur transitoire") from e # Sera retentée + except NotFoundError: + raise # Pas de retry, échec permanent +``` + +### 10.7. DLQ pour Investigation + +Ne jetez jamais les tâches échouées sans revue : + +```python +dlq = DLQManager(broker) +# Examiner périodiquement la DLQ +failed = await dlq.list_failed(limit=100) +for task in failed: + logger.error(f"Tâche DLQ {task.task_id} : {task.error}") + # Penser à rejouer manuellement ou corriger les données +``` + +--- + +## 11. Pièges Courants + +| Piège | Conséquence | Solution | +|-------|-------------|----------| +| Retentatives infinies (`max_retries=None`) | Système bloqué en boucle de retry | Fixer une limite explicite | +| Pas de backoff (delay=0) | Service submergé | Utiliser exponential backoff | +| Retenter sur erreurs de validation | Ressources gaspillées | Distinguer les types d'erreur | +| Pas de DLQ | Tâches échouées perdues | Configurer la DLQ | +| Timeout plus court que délai de retry | Timeout prématuré | S'assurer que timeout > somme des délais de retry | +| Multiples retentatives sur tâches non-idempotentes | Effets de bord en double | Rendre les tâches idempotentes ou limiter retry | + +--- + +## 12. Résumé + +| Fonctionnalité | Niveau Tâche | Niveau Pipeline | +|----------------|--------------|-----------------| +| **Limite de retry** | `@broker.task(max_retries=N)` | `pipeline.with_retry(max_attempts=N)` | +| **Délai** | `retry_delay` | `delay` | +| **Backoff** | `retry_backoff` | `backoff` | +| **Timeout** | `timeout` par tâche | `with_timeout(seconds)` global | +| **DLQ** | Via `RetryMiddleware` | Hérité des tâches | + +**Pipeline résilient complet** : + +```python +tracking = PipelineTrackingManager().with_auto_storage(broker) + +pipeline = Pipeline(broker).with_tracking(tracking) +pipeline.with_retry(max_attempts=3, delay=2.0, backoff=2.0) +pipeline.with_timeout(seconds=300) +pipeline.on_error("continue") # Ou utiliser des étapes de compensation + +# Ajouter middleware de retry avec DLQ +from taskiq_flow.middlewares.retry import RetryMiddleware +broker.add_middlewares(RetryMiddleware(max_retries=3, dlq_queue="failed_tasks")) +``` + +--- + +## Prochaines Étapes + +- **[Guide de Performance]({{ '/fr/guides/performance/' | relative_url }})** — Optimiser l'exécution et l'usage des ressources +- **[Guide d'Ordonnancement]({{ '/fr/guides/scheduling/' | relative_url }})** — Ordonnancement automatique des pipelines +- **[Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }})** — Surveiller les métriques de retry en production + +--- + +*Les pannes arrivent. Réessayez intelligemment. Tout suivez.* diff --git a/docs/_fr/guides/scheduling.md b/docs/_fr/guides/scheduling.md index b6cfb1d..135b369 100644 --- a/docs/_fr/guides/scheduling.md +++ b/docs/_fr/guides/scheduling.md @@ -1,865 +1,865 @@ ---- -title: Guide de Planification des Pipelines -nav_order: 25 ---- -# Guide de Planification des Pipelines - -**Planification cron, intervalles et exécutions uniques avec PipelineScheduler** - -> **Version** : {VERSION} | **Lié** : [Guide d'Exécution]({{ '/fr/guides/execution/' | relative_url }}), [Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }}) - ---- - -## Aperçu - -Taskiq-Flow inclut un système de planification puissant pour exécuter des pipelines à heures fixes ou intervalles réguliers, construit sur APScheduler. - -Ce guide couvre : - -- `PipelineScheduler` — Interface principale de planification -- Expressions cron et motifs -- Planification par intervalle -- Exécutions uniques (one-off) -- Gestion des fuseaux horaires -- Persistance et gestion des jobs -- Gestion des exécutions manquées - ---- - -## 1. Démarrage Rapide - -> **Prérequis** : Installer l'extra `scheduler` : -> `pip install "taskiq-flow[scheduler]"` -> Sans lui, `PipelineScheduler` lève `ImportError` à la construction. - -```python -from taskiq_flow import Pipeline, PipelineScheduler - -# Create your pipeline -pipeline = Pipeline(broker).call_next(my_task).call_next(another_task) - -# Créer le planificateur # nécessite taskiq-flow[scheduler] -scheduler = PipelineScheduler(broker) - -# Planifier pour exécution chaque minute -job_id = await scheduler.schedule( - pipeline, - cron="* * * * *", # Toutes les minutes - args=("données",) # Arguments passés à pipeline.kiq() -) - -# Démarrer le planificateur (tourne en arrière-plan) -await scheduler.start() - -# ... garder votre application en vie ... -# le scheduler tourne en tâches de fond - -# Arrêt gracieux -await scheduler.shutdown() -``` - -C'est la base. Explorons les fonctionnalités en détail. - ---- - -## 2. PipelineScheduler - -La classe principale pour planifier les exécutions de pipeline. - -### 2.1. Initialisation - -```python -from taskiq_flow import PipelineScheduler - -scheduler = PipelineScheduler( - broker, - store="memory", # "memory" ou "sqlite" - store_path="./scheduler_jobs.db" # pour store sqlite -) -``` - -**Options de stockage**: - -| Store | Persistance | Multi-worker | Cas d'usage | -|-------|-------------|--------------|-------------| -| `"memory"` | Non | Non | Développement, mono-processus | -| `"sqlite"` | Oui | Limité* | Production mono-worker, persistance simple | -| `"postgresql"` (via URL) | Oui | Oui | Production multi-worker, haute disponibilité | -| `"mysql"` (via URL) | Oui | Oui | Production multi-worker, alternative PostgreSQL | -| `"redis"` | | | **Non implémenté** (placeholder lève `NotImplementedError`) | - -*Le store sqlite fonctionne avec une seule instance de scheduler ; multiples workers nécessitent DB externe (PostgreSQL/MySQL). - -**Recommandation** : -- Dev/mocks → `store="memory"` -- Production mono-worker → `store="sqlite"` avec chemin persistant -- Production distribué → `store="postgresql://user:pass@host/dbname"` (recommandé) #pragma: allowlist secret - -> **Note** : Le support PostgreSQL et MySQL est **déjà implémenté** dans `taskiq_flow.scheduling.storage.JobPersistenceManager` et fonctionne via SQLAlchemy avec `sqlalchemy.asyncio.AsyncSession`. Voir la section [Stockage Avancé (PostgreSQL/MySQL)](#stockage-avancé-postgresqlmysql) ci-dessous. - -### 2.2. Démarrage & Arrêt - -```python -# Démarrer le scheduler (commence à surveiller les schedules) -await scheduler.start() - -# Tourner en arrière-plan pendant que l'app tourne -# Typiquement intégré aux événements de lifespan FastAPI/Quart - -# Arrêt gracieux -await scheduler.shutdown() -# Attend que les jobs en cours finissent, annule les pending -``` - -**Démarrage automatique avec context manager**: - -```python -async with PipelineScheduler(broker) as scheduler: - await scheduler.schedule(pipeline, cron="*/5 * * * *") - # Le scheduler démarre automatiquement sur __aenter__ - # ... exécuter votre app ... -# Arrêt automatique sur __aexit__ -``` - ---- - -## 3. Méthodes de Planification - -### 3.1. Planification Cron - -```python -job_id = await scheduler.schedule( - pipeline, - cron="0 * * * *", # Toutes les heures à minute 0 - args=("input_data",), - kwargs={"key": "value"}, - pipeline_id="job_horaire_001" -) -``` - -**Format expression cron**: `minute heure jour mois jour-semaine` - -| Champ | Valeurs autorisées | Caractères spéciaux | -|-------|-------------------|---------------------| -| Minute | 0-59 | `* , - /` | -| Heure | 0-23 | `* , - /` | -| Jour | 1-31 | `* , - / ?` | -| Mois | 1-12 | `* , - /` | -| Jour semaine | 0-6 (Dim-Sam) | `* , - / ?` | - -**Exemples**: - -```python -"*/5 * * * *" # Toutes les 5 minutes -"0 9 * * *" # Quotidien à 9h00 -"0 0 * * 0" # Hebdomadaire dimanche à minuit -"0 0 1 * *" # Mensuel le 1er à minuit -"0 0 1 1 *" # Annuel 1er janvier à minuit -``` - -### 3.2. Planification par Intervalle - -```python -# Exécuter toutes les N secondes/minutes/heures/jours/semaines -job_id = await scheduler.schedule_interval( - pipeline, - seconds=30, # Toutes les 30 secondes - # minutes=5, # Toutes les 5 minutes - # hours=1, # Toutes les heures - args=(data,) -) -``` - -**Note** : La planification par intervalle utilise `IntervalTrigger` d'APScheduler. Le cron est généralement préféré en production (plus flexible, gère DST). - -### 3.3. Exécution Unique (Run At) - -Planifier une seule exécution future: - -```python -from datetime import datetime, timedelta - -job_id = await scheduler.schedule_at( - pipeline, - run_at=datetime.now() + timedelta(hours=2), # Dans 2 heures - args=(payload,) -) -``` - -Ou planifier pour un horaire calendaire spécifique: - -```python -run_time = datetime(2026, 12, 31, 23, 59, 59) -await scheduler.schedule_at(pipeline, run_at=run_time) -``` - ---- - -## 4. Configuration du Job - -### 4.1. ID de Job - -Chaque job planifié reçoit un identifiant unique: - -```python -job_id = await scheduler.schedule(pipeline, cron="* * * * *") -print(job_id) # ex: "job_20260505_abcdef123456" -``` - -Personnaliser l'ID: - -```python -job_id = await scheduler.schedule( - pipeline, - cron="0 9 * * *", - job_id="etl_quotidien_9h" # ID lisible par humain -) -``` - -Utile pour gestion ultérieure (update, cancel, list). - -### 4.2. Arguments & Kwargs - -Passer des arguments à la méthode `kiq()` du pipeline: - -```python -await scheduler.schedule( - pipeline, - cron="* * * * *", - args=("positional_arg",), # tuple - kwargs={"option": True}, # dict - pipeline_id="my_pipeline" # explicit pipeline ID -) -``` - -Le scheduler appelle : `await pipeline.kiq(*args, **kwargs)` à chaque déclenchement. - -### 4.3. ID de Pipeline - -Chaque exécution planifiée peut surcharger l'ID par défaut du pipeline: - -```python -pipeline = Pipeline(broker) # génère ID aléatoire par défaut - -# Planifier avec ID explicite (assure unicité pour suivi) -await scheduler.schedule( - pipeline, - cron="*/5 * * * *", - pipeline_id="my_pipeline_v1" -) -``` - -**Bonne pratique** : Inclure timestamp ou version dans l'ID pour suivi: - -```python -job_id = f"batch_process_v2_{int(time.time())}" -``` - ---- - -## 5. Gestion des Jobs - -### 5.1. Lister les Jobs Planifiés - -```python -jobs = await scheduler.list_jobs() -for job in jobs: - print(f"ID: {job.id}") - print(f" Trigger: {job.trigger}") - print(f" Next run: {job.next_run_time}") - print(f" Pipeline: {job.pipeline_id}") -``` - -### 5.2. Obtenir les Détails d'un Job - -```python -job = await scheduler.get_job(job_id) -if job: - print(f"Job {job.id} prévu pour {job.next_run_time}") -``` - -### 5.3. Modifier un Job - -```python -# Replanifier un job existant -await scheduler.reschedule_job( - job_id, - cron="0 */2 * * *" # Changer pour toutes les 2 heures -) - -# Mettre à jour les arguments du job -await scheduler.modify_job( - job_id, - args=("nouvel_arg",), - kwargs={"mis_à_jour": True} -) -``` - -### 5.4. Supprimer (Annuler) un Job - -```python -await scheduler.remove_job(job_id) -# Les exécutions futures sont annulées ; le job en cours continue -``` - -### 5.5. Pause & Reprise - -```python -# Mettre en pause temporairement un job -await scheduler.pause_job(job_id) - -# Reprendre plus tard -await scheduler.resume_job(job_id) -``` - ---- - -## 6. Suivi des Exécutions Planifiées - -Chaque exécution de pipeline planifiée est automatiquement suivie si le pipeline a le suivi activé: - -```python -tracking = PipelineTrackingManager().with_auto_storage(broker) -pipeline = Pipeline(broker).with_tracking(tracking) - -scheduler = PipelineScheduler(broker) -await scheduler.schedule(pipeline, cron="*/5 * * * *") - -# Later, query execution history -history = await tracking.get_history() -for run in history: - print(f"Run {run.pipeline_id}: {run.status} at {run.started_at}") -``` - -**Distinguer les runs planifiés** : Utiliser des `pipeline_id` descriptifs: - -```python -await scheduler.schedule( - pipeline, - cron="0 2 * * *", # Quotidien 2h - pipeline_id=f"etl_quotidien_{datetime.now().strftime('%Y%m%d')}" -) -# Chaque jour reçoit un ID unique pour suivi -``` - ---- - -## 7. Gestion des Exécutions Manquées - -Quand l'heure de déclenchement d'un job planifié est manquée (ex: downtime du scheduler, job long), APScheduler fournit des contrôles: - -### 7.1. Coalescing (Regroupement) - -Combiner multiples runs manqués en une seule exécution: - -```python -from apscheduler.triggers.cron import CronTrigger - -trigger = CronTrigger( - hour=9, - minute=0, - coalesce=True # Si scheduler down à 9h00, lance une fois à 9h05 au lieu de 5 fois -) - -job = await scheduler.schedule(pipeline, trigger=trigger) -``` - -### 7.2. Max Instances (Instances Max) - -Empêcher exécutions qui se chevauchent du même job: - -```python -# Un nouveau run ne démarre pas si l'instance précédente tourne encore -trigger = CronTrigger(minute="*/5", max_instances=1, coalesce=True) -job = await scheduler.schedule(pipeline, trigger=trigger) -# Si un run 9h00 est encore en cours à 9h05, le run 9h05 est sauté -``` - -### 7.3. Misfire Grace Time (Délai de grâce après manqué) - -Permettre une fenêtre après l'heure planifiée pendant laquelle l'exécution est toujours valide: - -```python -from apscheduler.triggers.cron import CronTrigger - -# Si le scheduler redémarre dans les 10 minutes après l'heure planifiée, lance quand même -trigger = CronTrigger( - minute="*/5", - misfire_grace_time=600 # 10 minutes en secondes -) - -job = await scheduler.schedule(pipeline, trigger=trigger) -``` - ---- - -## 8. Fuseaux Horaires - -Par défaut, APScheduler utilise le fuseau horaire système. Pour production, définir explicitement: - -```python -from apscheduler.triggers.cron import CronTrigger -import pytz - -# Planifier pour 9h00 dans le fuseau New York -trigger = CronTrigger( - hour=9, - minute=0, - timezone=pytz.timezone("America/New_York") -) - -job = await scheduler.schedule(pipeline, trigger=trigger) -``` - -Ou définir globalement sur le scheduler: - -```python -scheduler = PipelineScheduler( - broker, - timezone="UTC" # ou "America/Los_Angeles", "Europe/Paris", ... -) -``` - -**Gestion de l'heure d'été (DST)** : Les triggers cron avec fuseau explicite gèrent automatiquement les transitions DST. Les jobs planifiés à "9h00" s'exécutent toujours à 9h00 locale quand l'horloge change. - ---- - -## 9. Triggers Personnalisés - -Au-delà du cron et intervalles, utiliser n'importe quel trigger APScheduler: - -```python -from apscheduler.triggers.date import DateTrigger -from datetime import datetime, timedelta - -# Exécution unique à datetime spécifique -trigger = DateTrigger(run_date=datetime(2026, 12, 31, 23, 59, 59)) -job = await scheduler.schedule(pipeline, trigger=trigger) - -# Exécution après délai (from now) -trigger = DateTrigger(run_date=datetime.now() + timedelta(minutes=10)) -job = await scheduler.schedule(pipeline, trigger=trigger) -``` - -Voir documentation APScheduler pour triggers avancés (calendaires, etc.). - ---- - -## 10. Gestion des Erreurs - -### 10.1. Capturer les Erreurs d'Exécution de Job - -Encapsuler l'exécution du pipeline avec gestion d'erreur: - -```python -@broker.task -async def my_pipeline_task(data): - try: - result = await process(data) - return result - except Exception as exc: - # Log error, but let scheduler continue - logger.error(f"Pipeline failed: {exc}") - raise # Scheduler records failure, continues with next schedule -``` - -### 10.2. Callbacks d'Erreur au Niveau Scheduler - -```python -scheduler = PipelineScheduler(broker) - -@scheduler.on_error -async def handle_scheduler_error(job_id, exception): - logger.error(f"Job {job_id} échoué avec: {exception}") - envoyer_alerte_email(job_id, exception) - -await scheduler.start() -``` - -### 10.3. Dead Letter Queue (DLQ) - -Pour les jobs qui échouent répétitivement, router vers DLQ: - -```python -from taskiq_flow.middlewares.retry import RetryMiddleware - -# Configurer retry avec backoff -broker.add_middlewares( - RetryMiddleware( - max_retries=3, - delay=10, - backoff=2 - ) -) - -# Après max retries, la tâche va dans DLQ (si broker supporte) -# RedisStreamBroker: dead_letter_stream # nécessite taskiq-flow[brokers] -# KafkaBroker: dead_letter_topic -``` - ---- - -## 11. Monitoring des Jobs Planifiés - -### 11.1. Health Check - -```python -async def scheduler_health(): - stats = scheduler.get_stats() - return { - "scheduled_jobs": len(scheduler.get_jobs()), - "running_jobs": stats.active_jobs, - "next_run": min(job.next_run_time for job in scheduler.get_jobs()) - } -``` - -### 11.2. Logging - -Configurer logging structuré: - -```python -import logging -logger = logging.getLogger("taskiq_flow.scheduler") - -logging.basicConfig( - level=logging.INFO, - format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' -) - -# Logs du scheduler: -# 2026-05-05 10:00:00 - taskiq_flow.scheduler - INFO - Running job daily_etl_9am -# 2026-05-05 10:00:05 - taskiq_flow.scheduler - INFO - Job daily_etl_9am completed successfully -``` - -### 11.3. Métriques - -Intégrer avec Prometheus: - -```python -from prometheus_client import Counter, Gauge - -SCHEDULED_JOBS = Gauge('scheduled_jobs_total', 'Total jobs planifiés') -JOB_RUNS = Counter('scheduler_job_runs_total', 'Exécutions job', ['job_id']) -JOB_FAILURES = Counter('scheduler_job_failures_total', 'Échecs job', ['job_id']) - -class MetricsScheduler(PipelineScheduler): - async def _run_job(self, job_id, pipeline): - JOB_RUNS.labels(job_id=job_id).inc() - try: - await super()._run_job(job_id, pipeline) - except Exception: - JOB_FAILURES.labels(job_id=job_id).inc() - raise -``` - ---- - -## 12. Considérations de Production - -### 12.1. Haute Disponibilité - -Pour déploiements production HA, lancer multiples instances de scheduler avec un job store partagé: - -```python -# Scheduler 1 -scheduler1 = PipelineScheduler( - broker, - store="postgresql", - # pragma: allownextline secret - db_url="postgresql+asyncpg://user:pass@host/db" # pragma: allowlist secret - -# Scheduler 2 (config identique) — seul un acquittera les jobs -scheduler2 = PipelineScheduler( - broker, - store="postgresql", - # pragma: allownextline secret - db_url="postgresql+asyncpg://user:pass@host/db" # pragma: allowlist secret -) -# Les job stores d'APScheduler utilisent un verrouillage ligne ; un scheduler par job -``` - -Voir [Stockage Avancé (PostgreSQL/MySQL)](#stockage-avancé-postgresqlmysql) pour la configuration détaillée. - -### 12.2. Jobs de Longue Durée - -Si une exécution de pipeline peut dépasser son intervalle de schedule: - -```python -# S'assurer pas de chevauchement -trigger = CronTrigger(minute="*/5", max_instances=1, coalesce=True) -job = await scheduler.schedule(pipeline, trigger=trigger) - -# Le pipeline lui-même a timeout -pipeline.with_timeout(seconds=300) # 5 minutes max -``` - -### 12.3. Comportement au Démarrage - -Au redémarrage du scheduler, les jobs manqués sont gérés selon `misfire_grace_time`: - -```python -# Scheduler redémarre à 9h05, job planifié pour 9h00 -# Avec misfire_grace_time=600 (10 min) : job lance à 9h05 -# Avec misfire_grace_time=0 : job sauté -trigger = CronTrigger(hour=9, misfire_grace_time=600) -``` - -### 12.5. Stockage Avancé (PostgreSQL/MySQL) - -`JobPersistenceManager` supporte nativement PostgreSQL et MySQL via SQLAlchemy AsyncEngine. - -#### Configuration PostgreSQL (recommandé pour production) - -```python -from taskiq_flow.scheduling.storage import JobPersistenceManager - -# PostgreSQL avec asyncpg -storage = JobPersistenceManager( - db_url="postgresql+asyncpg://user:pass@localhost:5432/taskiq_flow", # pragma: allowlist secret - async_mode=True, -) - -# Avec le helper pour générer l'URL -storage = JobPersistenceManager( - db_url=JobPersistenceManager.get_connection_url( - "postgresql", - host="localhost", - port=5432, - user="taskiq", - password="secret", # pragma: allowlist secret - database="taskiq_flow", - ), - async_mode=True, -) -``` - -#### Configuration MySQL - -```python -storage = JobPersistenceManager( - db_url="mysql+aiomysql://user:pass@localhost:3306/taskiq_flow", # pragma: allowlist secret - async_mode=True, -) -``` - -#### Configuration SQLite (développement) - -```python -# Sync (développement) -storage = JobPersistenceManager( - db_url="sqlite:///jobs.db", - async_mode=False, -) - -# Async (recommandé même pour SQLite en production) -storage = JobPersistenceManager( - db_url="sqlite+aiosqlite:///jobs.db", - async_mode=True, -) -``` - -#### Intégration avec la Persistance APScheduler - -```python -from taskiq_flow.scheduling.scheduler import PipelineScheduler -from taskiq_flow.scheduling.storage import JobPersistenceManager - -storage = JobPersistenceManager( - db_url="postgresql+asyncpg://user:pass@localhost:5432/taskiq_flow", # pragma: allowlist secret -) - -# Le store URL est passé au PipelineScheduler -scheduler = PipelineScheduler( - broker, - job_store_url="postgresql+asyncpg://user:pass@localhost:5432/taskiq_flow", # pragma: allowlist secret -) -``` - -#### Opérations CRUD du JobPersistenceManager - -```python -from datetime import datetime, timezone -from taskiq_flow.scheduling.storage import JobPersistenceManager, SchedulerJob, PipelineExecution - -storage = JobPersistenceManager(db_url="sqlite:///test.db") - -# Sauvegarder un job -job = SchedulerJob( - id="job_001", - pipeline_id="etl_daily", - label="ETL Quotidien", - cron="0 2 * * *", - timezone="UTC", -) -await storage.save_job(job) - -# Charger tous les jobs -jobs = await storage.load_jobs() -for j in jobs: - print(f"{j.id}: {j.cron} - {j.pipeline_id}") - -# Sauvegarder l'historique d'exécution -execution = PipelineExecution( - job_id="job_001", - pipeline_id="etl_daily", - status="success", - started_at=datetime.now(timezone.utc), - completed_at=datetime.now(timezone.utc), - duration_seconds=45.2, -) -await storage.save_execution_history("job_001", execution) - -# Récupérer l'historique -history = await storage.get_execution_history("job_001", limit=10) -for run in history: - print(f" {run.status} - {run.duration_seconds}s at {run.started_at}") -``` - -| Backend | Async | Multi-worker | Production | -|---------|-------|--------------|------------| -| SQLite | `sqlite+aiosqlite` | Single-writer | Dev / petits projets | -| PostgreSQL | `postgresql+asyncpg` | Full | Recommandé | -| MySQL | `mysql+aiomysql` | Full | Supporté | - ---- - -## 13. Motifs Courants - -### 13.1. Pipeline ETL Quotidien - -```python -@scheduler.schedule( - pipeline=etl_pipeline, - cron="0 2 * * *", # 2h00 quotidien - pipeline_id="etl_quotidien" -) -async def run_daily_etl(): - pass -``` - -### 13.2. Health Check Périodique - -```python -health_pipeline = Pipeline(broker).call_next(health_check_task) - -await scheduler.schedule_interval( - health_pipeline, - minutes=5, - pipeline_id="health_check_5m" -) -``` - -### 13.3. Planification Dynamique - -Créer et annuler des jobs à la volée: - -```python -# Planifier on-demand -job_id = await scheduler.schedule( - pipeline, - run_at=datetime.now() + timedelta(minutes=10) -) - -# Annuler si plus nécessaire -await scheduler.remove_job(job_id) -``` - -### 13.4. Pipelines en Chaîne - -Pipeline A déclenche Pipeline B via scheduling: - -```python -@broker.task -async def pipeline_a_finished(result): - # Schedule pipeline B after completion of A - job_id = await scheduler.schedule_at( - pipeline_b, - run_at=datetime.now() + timedelta(minutes=5) - ) - return job_id -``` - ---- - -## 14. Dépannage - -### Jobs Ne Lancés Pas - -**Symptôme** : Les jobs planifiés ne s'exécutent jamais. - -**Corrections** : -- Vérifier `await scheduler.start()` est appelé -- Vérifier validité expression cron: `CronTrigger.from_crontab("* * * * *")` -- Vérifier timezone correspond à l'heure attendue (vérifier TZ serveur) -- Confirmer job bien planifié (job_id non None) -- Vérifier logs scheduler pour erreurs - -### Exécution Dupliquée - -**Symptôme** : Même job s'exécute fois multiples concurremment. - -**Corrections** : -- Définir `max_instances=1` dans trigger -- Utiliser `coalesce=True` pour combiner runs manqués -- S'assurer qu'une seule instance de scheduler tourne (HA a besoin de store partagé) - -### Persistance Job Store Ne Fonctionne Pas - -**Symptôme** : Jobs disparaissent après restart malgré store sqlite. - -**Corrections** : -- Utiliser `store="sqlite"` et spécifier `store_path` -- S'assurer que le chemin de fichier est accessible et persiste entre redémarrages -- Ne pas mélanger stores memory et sqlite dans même app - -### Problèmes Timezone - -**Symptôme** : Job s'exécute à mauvaise heure (décalage de plusieurs heures). - -**Corrections** : -- Définir timezone explicite sur scheduler: `PipelineScheduler(broker, timezone="UTC")` -- Ou sur trigger: `CronTrigger(hour=9, timezone=pytz.timezone("America/New_York"))` -- Vérifier timezone système du serveur correspond aux attentes - ---- - -## 15. Résumé - -PipelineScheduler fournit planification robuste, production-ready : - -| Fonctionnalité | API | -|----------------|-----| -| **Cron** | `scheduler.schedule(pipeline, cron="* * * * *")` | -| **Intervalle** | `scheduler.schedule_interval(pipeline, minutes=5)` | -| **One-off** | `scheduler.schedule_at(pipeline, run_at=datetime)` | -| **Gestion** | `list_jobs()`, `remove_job()`, `pause_job()` | -| **Persistance** | SQLite (mono-worker), PostgreSQL/MySQL (multi-worker) | -| **Tracking** | Automatique avec PipelineTrackingManager | -| **Concurrence** | `max_instances`, `coalesce` contrôles | - -**Setup production typique**: - -```python -tracking = PipelineTrackingManager().with_storage(RedisPipelineStorage(redis)) -pipeline = Pipeline(broker).with_tracking(tracking) - -scheduler = PipelineScheduler( - broker, - job_store_url="postgresql+asyncpg://user:pass@host/taskiq_flow", # pragma: allowlist secret -) -await scheduler.start() - -# Schedule your jobs... -``` - ---- - -## Prochaines Étapes - -- **[Guide de Retry]({{ '/fr/guides/retry/' | relative_url }})** — Récupération d'erreur et politiques de retry -- **[Guide de Performance]({{ '/fr/guides/performance/' | relative_url }})** — Optimiser performance pipelines planifiés -- **[Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }})** — Monitorer l'historique des jobs planifiés - ---- - -*Planifiez des pipelines comme des cron jobs. Suivez-les comme jamais.* +--- +title: Guide de Planification des Pipelines +nav_order: 25 +--- +# Guide de Planification des Pipelines + +**Planification cron, intervalles et exécutions uniques avec PipelineScheduler** + +> **Version** : {VERSION} | **Lié** : [Guide d'Exécution]({{ '/fr/guides/execution/' | relative_url }}), [Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }}) + +--- + +## Aperçu + +Taskiq-Flow inclut un système de planification puissant pour exécuter des pipelines à heures fixes ou intervalles réguliers, construit sur APScheduler. + +Ce guide couvre : + +- `PipelineScheduler` — Interface principale de planification +- Expressions cron et motifs +- Planification par intervalle +- Exécutions uniques (one-off) +- Gestion des fuseaux horaires +- Persistance et gestion des jobs +- Gestion des exécutions manquées + +--- + +## 1. Démarrage Rapide + +> **Prérequis** : Installer l'extra `scheduler` : +> `pip install "taskiq-flow[scheduler]"` +> Sans lui, `PipelineScheduler` lève `ImportError` à la construction. + +```python +from taskiq_flow import Pipeline, PipelineScheduler + +# Create your pipeline +pipeline = Pipeline(broker).call_next(my_task).call_next(another_task) + +# Créer le planificateur # nécessite taskiq-flow[scheduler] +scheduler = PipelineScheduler(broker) + +# Planifier pour exécution chaque minute +job_id = await scheduler.schedule( + pipeline, + cron="* * * * *", # Toutes les minutes + args=("données",) # Arguments passés à pipeline.kiq() +) + +# Démarrer le planificateur (tourne en arrière-plan) +await scheduler.start() + +# ... garder votre application en vie ... +# le scheduler tourne en tâches de fond + +# Arrêt gracieux +await scheduler.shutdown() +``` + +C'est la base. Explorons les fonctionnalités en détail. + +--- + +## 2. PipelineScheduler + +La classe principale pour planifier les exécutions de pipeline. + +### 2.1. Initialisation + +```python +from taskiq_flow import PipelineScheduler + +scheduler = PipelineScheduler( + broker, + store="memory", # "memory" ou "sqlite" + store_path="./scheduler_jobs.db" # pour store sqlite +) +``` + +**Options de stockage**: + +| Store | Persistance | Multi-worker | Cas d'usage | +|-------|-------------|--------------|-------------| +| `"memory"` | Non | Non | Développement, mono-processus | +| `"sqlite"` | Oui | Limité* | Production mono-worker, persistance simple | +| `"postgresql"` (via URL) | Oui | Oui | Production multi-worker, haute disponibilité | +| `"mysql"` (via URL) | Oui | Oui | Production multi-worker, alternative PostgreSQL | +| `"redis"` | | | **Non implémenté** (placeholder lève `NotImplementedError`) | + +*Le store sqlite fonctionne avec une seule instance de scheduler ; multiples workers nécessitent DB externe (PostgreSQL/MySQL). + +**Recommandation** : +- Dev/mocks → `store="memory"` +- Production mono-worker → `store="sqlite"` avec chemin persistant +- Production distribué → `store="postgresql://user:pass@host/dbname"` (recommandé) #pragma: allowlist secret + +> **Note** : Le support PostgreSQL et MySQL est **déjà implémenté** dans `taskiq_flow.scheduling.storage.JobPersistenceManager` et fonctionne via SQLAlchemy avec `sqlalchemy.asyncio.AsyncSession`. Voir la section [Stockage Avancé (PostgreSQL/MySQL)](#stockage-avancé-postgresqlmysql) ci-dessous. + +### 2.2. Démarrage & Arrêt + +```python +# Démarrer le scheduler (commence à surveiller les schedules) +await scheduler.start() + +# Tourner en arrière-plan pendant que l'app tourne +# Typiquement intégré aux événements de lifespan FastAPI/Quart + +# Arrêt gracieux +await scheduler.shutdown() +# Attend que les jobs en cours finissent, annule les pending +``` + +**Démarrage automatique avec context manager**: + +```python +async with PipelineScheduler(broker) as scheduler: + await scheduler.schedule(pipeline, cron="*/5 * * * *") + # Le scheduler démarre automatiquement sur __aenter__ + # ... exécuter votre app ... +# Arrêt automatique sur __aexit__ +``` + +--- + +## 3. Méthodes de Planification + +### 3.1. Planification Cron + +```python +job_id = await scheduler.schedule( + pipeline, + cron="0 * * * *", # Toutes les heures à minute 0 + args=("input_data",), + kwargs={"key": "value"}, + pipeline_id="job_horaire_001" +) +``` + +**Format expression cron**: `minute heure jour mois jour-semaine` + +| Champ | Valeurs autorisées | Caractères spéciaux | +|-------|-------------------|---------------------| +| Minute | 0-59 | `* , - /` | +| Heure | 0-23 | `* , - /` | +| Jour | 1-31 | `* , - / ?` | +| Mois | 1-12 | `* , - /` | +| Jour semaine | 0-6 (Dim-Sam) | `* , - / ?` | + +**Exemples**: + +```python +"*/5 * * * *" # Toutes les 5 minutes +"0 9 * * *" # Quotidien à 9h00 +"0 0 * * 0" # Hebdomadaire dimanche à minuit +"0 0 1 * *" # Mensuel le 1er à minuit +"0 0 1 1 *" # Annuel 1er janvier à minuit +``` + +### 3.2. Planification par Intervalle + +```python +# Exécuter toutes les N secondes/minutes/heures/jours/semaines +job_id = await scheduler.schedule_interval( + pipeline, + seconds=30, # Toutes les 30 secondes + # minutes=5, # Toutes les 5 minutes + # hours=1, # Toutes les heures + args=(data,) +) +``` + +**Note** : La planification par intervalle utilise `IntervalTrigger` d'APScheduler. Le cron est généralement préféré en production (plus flexible, gère DST). + +### 3.3. Exécution Unique (Run At) + +Planifier une seule exécution future: + +```python +from datetime import datetime, timedelta + +job_id = await scheduler.schedule_at( + pipeline, + run_at=datetime.now() + timedelta(hours=2), # Dans 2 heures + args=(payload,) +) +``` + +Ou planifier pour un horaire calendaire spécifique: + +```python +run_time = datetime(2026, 12, 31, 23, 59, 59) +await scheduler.schedule_at(pipeline, run_at=run_time) +``` + +--- + +## 4. Configuration du Job + +### 4.1. ID de Job + +Chaque job planifié reçoit un identifiant unique: + +```python +job_id = await scheduler.schedule(pipeline, cron="* * * * *") +print(job_id) # ex: "job_20260505_abcdef123456" +``` + +Personnaliser l'ID: + +```python +job_id = await scheduler.schedule( + pipeline, + cron="0 9 * * *", + job_id="etl_quotidien_9h" # ID lisible par humain +) +``` + +Utile pour gestion ultérieure (update, cancel, list). + +### 4.2. Arguments & Kwargs + +Passer des arguments à la méthode `kiq()` du pipeline: + +```python +await scheduler.schedule( + pipeline, + cron="* * * * *", + args=("positional_arg",), # tuple + kwargs={"option": True}, # dict + pipeline_id="my_pipeline" # explicit pipeline ID +) +``` + +Le scheduler appelle : `await pipeline.kiq(*args, **kwargs)` à chaque déclenchement. + +### 4.3. ID de Pipeline + +Chaque exécution planifiée peut surcharger l'ID par défaut du pipeline: + +```python +pipeline = Pipeline(broker) # génère ID aléatoire par défaut + +# Planifier avec ID explicite (assure unicité pour suivi) +await scheduler.schedule( + pipeline, + cron="*/5 * * * *", + pipeline_id="my_pipeline_v1" +) +``` + +**Bonne pratique** : Inclure timestamp ou version dans l'ID pour suivi: + +```python +job_id = f"batch_process_v2_{int(time.time())}" +``` + +--- + +## 5. Gestion des Jobs + +### 5.1. Lister les Jobs Planifiés + +```python +jobs = await scheduler.list_jobs() +for job in jobs: + print(f"ID: {job.id}") + print(f" Trigger: {job.trigger}") + print(f" Next run: {job.next_run_time}") + print(f" Pipeline: {job.pipeline_id}") +``` + +### 5.2. Obtenir les Détails d'un Job + +```python +job = await scheduler.get_job(job_id) +if job: + print(f"Job {job.id} prévu pour {job.next_run_time}") +``` + +### 5.3. Modifier un Job + +```python +# Replanifier un job existant +await scheduler.reschedule_job( + job_id, + cron="0 */2 * * *" # Changer pour toutes les 2 heures +) + +# Mettre à jour les arguments du job +await scheduler.modify_job( + job_id, + args=("nouvel_arg",), + kwargs={"mis_à_jour": True} +) +``` + +### 5.4. Supprimer (Annuler) un Job + +```python +await scheduler.remove_job(job_id) +# Les exécutions futures sont annulées ; le job en cours continue +``` + +### 5.5. Pause & Reprise + +```python +# Mettre en pause temporairement un job +await scheduler.pause_job(job_id) + +# Reprendre plus tard +await scheduler.resume_job(job_id) +``` + +--- + +## 6. Suivi des Exécutions Planifiées + +Chaque exécution de pipeline planifiée est automatiquement suivie si le pipeline a le suivi activé: + +```python +tracking = PipelineTrackingManager().with_auto_storage(broker) +pipeline = Pipeline(broker).with_tracking(tracking) + +scheduler = PipelineScheduler(broker) +await scheduler.schedule(pipeline, cron="*/5 * * * *") + +# Later, query execution history +history = await tracking.get_history() +for run in history: + print(f"Run {run.pipeline_id}: {run.status} at {run.started_at}") +``` + +**Distinguer les runs planifiés** : Utiliser des `pipeline_id` descriptifs: + +```python +await scheduler.schedule( + pipeline, + cron="0 2 * * *", # Quotidien 2h + pipeline_id=f"etl_quotidien_{datetime.now().strftime('%Y%m%d')}" +) +# Chaque jour reçoit un ID unique pour suivi +``` + +--- + +## 7. Gestion des Exécutions Manquées + +Quand l'heure de déclenchement d'un job planifié est manquée (ex: downtime du scheduler, job long), APScheduler fournit des contrôles: + +### 7.1. Coalescing (Regroupement) + +Combiner multiples runs manqués en une seule exécution: + +```python +from apscheduler.triggers.cron import CronTrigger + +trigger = CronTrigger( + hour=9, + minute=0, + coalesce=True # Si scheduler down à 9h00, lance une fois à 9h05 au lieu de 5 fois +) + +job = await scheduler.schedule(pipeline, trigger=trigger) +``` + +### 7.2. Max Instances (Instances Max) + +Empêcher exécutions qui se chevauchent du même job: + +```python +# Un nouveau run ne démarre pas si l'instance précédente tourne encore +trigger = CronTrigger(minute="*/5", max_instances=1, coalesce=True) +job = await scheduler.schedule(pipeline, trigger=trigger) +# Si un run 9h00 est encore en cours à 9h05, le run 9h05 est sauté +``` + +### 7.3. Misfire Grace Time (Délai de grâce après manqué) + +Permettre une fenêtre après l'heure planifiée pendant laquelle l'exécution est toujours valide: + +```python +from apscheduler.triggers.cron import CronTrigger + +# Si le scheduler redémarre dans les 10 minutes après l'heure planifiée, lance quand même +trigger = CronTrigger( + minute="*/5", + misfire_grace_time=600 # 10 minutes en secondes +) + +job = await scheduler.schedule(pipeline, trigger=trigger) +``` + +--- + +## 8. Fuseaux Horaires + +Par défaut, APScheduler utilise le fuseau horaire système. Pour production, définir explicitement: + +```python +from apscheduler.triggers.cron import CronTrigger +import pytz + +# Planifier pour 9h00 dans le fuseau New York +trigger = CronTrigger( + hour=9, + minute=0, + timezone=pytz.timezone("America/New_York") +) + +job = await scheduler.schedule(pipeline, trigger=trigger) +``` + +Ou définir globalement sur le scheduler: + +```python +scheduler = PipelineScheduler( + broker, + timezone="UTC" # ou "America/Los_Angeles", "Europe/Paris", ... +) +``` + +**Gestion de l'heure d'été (DST)** : Les triggers cron avec fuseau explicite gèrent automatiquement les transitions DST. Les jobs planifiés à "9h00" s'exécutent toujours à 9h00 locale quand l'horloge change. + +--- + +## 9. Triggers Personnalisés + +Au-delà du cron et intervalles, utiliser n'importe quel trigger APScheduler: + +```python +from apscheduler.triggers.date import DateTrigger +from datetime import datetime, timedelta + +# Exécution unique à datetime spécifique +trigger = DateTrigger(run_date=datetime(2026, 12, 31, 23, 59, 59)) +job = await scheduler.schedule(pipeline, trigger=trigger) + +# Exécution après délai (from now) +trigger = DateTrigger(run_date=datetime.now() + timedelta(minutes=10)) +job = await scheduler.schedule(pipeline, trigger=trigger) +``` + +Voir documentation APScheduler pour triggers avancés (calendaires, etc.). + +--- + +## 10. Gestion des Erreurs + +### 10.1. Capturer les Erreurs d'Exécution de Job + +Encapsuler l'exécution du pipeline avec gestion d'erreur: + +```python +@broker.task +async def my_pipeline_task(data): + try: + result = await process(data) + return result + except Exception as exc: + # Log error, but let scheduler continue + logger.error(f"Pipeline failed: {exc}") + raise # Scheduler records failure, continues with next schedule +``` + +### 10.2. Callbacks d'Erreur au Niveau Scheduler + +```python +scheduler = PipelineScheduler(broker) + +@scheduler.on_error +async def handle_scheduler_error(job_id, exception): + logger.error(f"Job {job_id} échoué avec: {exception}") + envoyer_alerte_email(job_id, exception) + +await scheduler.start() +``` + +### 10.3. Dead Letter Queue (DLQ) + +Pour les jobs qui échouent répétitivement, router vers DLQ: + +```python +from taskiq_flow.middlewares.retry import RetryMiddleware + +# Configurer retry avec backoff +broker.add_middlewares( + RetryMiddleware( + max_retries=3, + delay=10, + backoff=2 + ) +) + +# Après max retries, la tâche va dans DLQ (si broker supporte) +# RedisStreamBroker: dead_letter_stream # nécessite taskiq-flow[brokers] +# KafkaBroker: dead_letter_topic +``` + +--- + +## 11. Monitoring des Jobs Planifiés + +### 11.1. Health Check + +```python +async def scheduler_health(): + stats = scheduler.get_stats() + return { + "scheduled_jobs": len(scheduler.get_jobs()), + "running_jobs": stats.active_jobs, + "next_run": min(job.next_run_time for job in scheduler.get_jobs()) + } +``` + +### 11.2. Logging + +Configurer logging structuré: + +```python +import logging +logger = logging.getLogger("taskiq_flow.scheduler") + +logging.basicConfig( + level=logging.INFO, + format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' +) + +# Logs du scheduler: +# 2026-05-05 10:00:00 - taskiq_flow.scheduler - INFO - Running job daily_etl_9am +# 2026-05-05 10:00:05 - taskiq_flow.scheduler - INFO - Job daily_etl_9am completed successfully +``` + +### 11.3. Métriques + +Intégrer avec Prometheus: + +```python +from prometheus_client import Counter, Gauge + +SCHEDULED_JOBS = Gauge('scheduled_jobs_total', 'Total jobs planifiés') +JOB_RUNS = Counter('scheduler_job_runs_total', 'Exécutions job', ['job_id']) +JOB_FAILURES = Counter('scheduler_job_failures_total', 'Échecs job', ['job_id']) + +class MetricsScheduler(PipelineScheduler): + async def _run_job(self, job_id, pipeline): + JOB_RUNS.labels(job_id=job_id).inc() + try: + await super()._run_job(job_id, pipeline) + except Exception: + JOB_FAILURES.labels(job_id=job_id).inc() + raise +``` + +--- + +## 12. Considérations de Production + +### 12.1. Haute Disponibilité + +Pour déploiements production HA, lancer multiples instances de scheduler avec un job store partagé: + +```python +# Scheduler 1 +scheduler1 = PipelineScheduler( + broker, + store="postgresql", + # pragma: allownextline secret + db_url="postgresql+asyncpg://user:pass@host/db" # pragma: allowlist secret + +# Scheduler 2 (config identique) — seul un acquittera les jobs +scheduler2 = PipelineScheduler( + broker, + store="postgresql", + # pragma: allownextline secret + db_url="postgresql+asyncpg://user:pass@host/db" # pragma: allowlist secret +) +# Les job stores d'APScheduler utilisent un verrouillage ligne ; un scheduler par job +``` + +Voir [Stockage Avancé (PostgreSQL/MySQL)](#stockage-avancé-postgresqlmysql) pour la configuration détaillée. + +### 12.2. Jobs de Longue Durée + +Si une exécution de pipeline peut dépasser son intervalle de schedule: + +```python +# S'assurer pas de chevauchement +trigger = CronTrigger(minute="*/5", max_instances=1, coalesce=True) +job = await scheduler.schedule(pipeline, trigger=trigger) + +# Le pipeline lui-même a timeout +pipeline.with_timeout(seconds=300) # 5 minutes max +``` + +### 12.3. Comportement au Démarrage + +Au redémarrage du scheduler, les jobs manqués sont gérés selon `misfire_grace_time`: + +```python +# Scheduler redémarre à 9h05, job planifié pour 9h00 +# Avec misfire_grace_time=600 (10 min) : job lance à 9h05 +# Avec misfire_grace_time=0 : job sauté +trigger = CronTrigger(hour=9, misfire_grace_time=600) +``` + +### 12.5. Stockage Avancé (PostgreSQL/MySQL) + +`JobPersistenceManager` supporte nativement PostgreSQL et MySQL via SQLAlchemy AsyncEngine. + +#### Configuration PostgreSQL (recommandé pour production) + +```python +from taskiq_flow.scheduling.storage import JobPersistenceManager + +# PostgreSQL avec asyncpg +storage = JobPersistenceManager( + db_url="postgresql+asyncpg://user:pass@localhost:5432/taskiq_flow", # pragma: allowlist secret + async_mode=True, +) + +# Avec le helper pour générer l'URL +storage = JobPersistenceManager( + db_url=JobPersistenceManager.get_connection_url( + "postgresql", + host="localhost", + port=5432, + user="taskiq", + password="secret", # pragma: allowlist secret + database="taskiq_flow", + ), + async_mode=True, +) +``` + +#### Configuration MySQL + +```python +storage = JobPersistenceManager( + db_url="mysql+aiomysql://user:pass@localhost:3306/taskiq_flow", # pragma: allowlist secret + async_mode=True, +) +``` + +#### Configuration SQLite (développement) + +```python +# Sync (développement) +storage = JobPersistenceManager( + db_url="sqlite:///jobs.db", + async_mode=False, +) + +# Async (recommandé même pour SQLite en production) +storage = JobPersistenceManager( + db_url="sqlite+aiosqlite:///jobs.db", + async_mode=True, +) +``` + +#### Intégration avec la Persistance APScheduler + +```python +from taskiq_flow.scheduling.scheduler import PipelineScheduler +from taskiq_flow.scheduling.storage import JobPersistenceManager + +storage = JobPersistenceManager( + db_url="postgresql+asyncpg://user:pass@localhost:5432/taskiq_flow", # pragma: allowlist secret +) + +# Le store URL est passé au PipelineScheduler +scheduler = PipelineScheduler( + broker, + job_store_url="postgresql+asyncpg://user:pass@localhost:5432/taskiq_flow", # pragma: allowlist secret +) +``` + +#### Opérations CRUD du JobPersistenceManager + +```python +from datetime import datetime, timezone +from taskiq_flow.scheduling.storage import JobPersistenceManager, SchedulerJob, PipelineExecution + +storage = JobPersistenceManager(db_url="sqlite:///test.db") + +# Sauvegarder un job +job = SchedulerJob( + id="job_001", + pipeline_id="etl_daily", + label="ETL Quotidien", + cron="0 2 * * *", + timezone="UTC", +) +await storage.save_job(job) + +# Charger tous les jobs +jobs = await storage.load_jobs() +for j in jobs: + print(f"{j.id}: {j.cron} - {j.pipeline_id}") + +# Sauvegarder l'historique d'exécution +execution = PipelineExecution( + job_id="job_001", + pipeline_id="etl_daily", + status="success", + started_at=datetime.now(timezone.utc), + completed_at=datetime.now(timezone.utc), + duration_seconds=45.2, +) +await storage.save_execution_history("job_001", execution) + +# Récupérer l'historique +history = await storage.get_execution_history("job_001", limit=10) +for run in history: + print(f" {run.status} - {run.duration_seconds}s at {run.started_at}") +``` + +| Backend | Async | Multi-worker | Production | +|---------|-------|--------------|------------| +| SQLite | `sqlite+aiosqlite` | Single-writer | Dev / petits projets | +| PostgreSQL | `postgresql+asyncpg` | Full | Recommandé | +| MySQL | `mysql+aiomysql` | Full | Supporté | + +--- + +## 13. Motifs Courants + +### 13.1. Pipeline ETL Quotidien + +```python +@scheduler.schedule( + pipeline=etl_pipeline, + cron="0 2 * * *", # 2h00 quotidien + pipeline_id="etl_quotidien" +) +async def run_daily_etl(): + pass +``` + +### 13.2. Health Check Périodique + +```python +health_pipeline = Pipeline(broker).call_next(health_check_task) + +await scheduler.schedule_interval( + health_pipeline, + minutes=5, + pipeline_id="health_check_5m" +) +``` + +### 13.3. Planification Dynamique + +Créer et annuler des jobs à la volée: + +```python +# Planifier on-demand +job_id = await scheduler.schedule( + pipeline, + run_at=datetime.now() + timedelta(minutes=10) +) + +# Annuler si plus nécessaire +await scheduler.remove_job(job_id) +``` + +### 13.4. Pipelines en Chaîne + +Pipeline A déclenche Pipeline B via scheduling: + +```python +@broker.task +async def pipeline_a_finished(result): + # Schedule pipeline B after completion of A + job_id = await scheduler.schedule_at( + pipeline_b, + run_at=datetime.now() + timedelta(minutes=5) + ) + return job_id +``` + +--- + +## 14. Dépannage + +### Jobs Ne Lancés Pas + +**Symptôme** : Les jobs planifiés ne s'exécutent jamais. + +**Corrections** : +- Vérifier `await scheduler.start()` est appelé +- Vérifier validité expression cron: `CronTrigger.from_crontab("* * * * *")` +- Vérifier timezone correspond à l'heure attendue (vérifier TZ serveur) +- Confirmer job bien planifié (job_id non None) +- Vérifier logs scheduler pour erreurs + +### Exécution Dupliquée + +**Symptôme** : Même job s'exécute fois multiples concurremment. + +**Corrections** : +- Définir `max_instances=1` dans trigger +- Utiliser `coalesce=True` pour combiner runs manqués +- S'assurer qu'une seule instance de scheduler tourne (HA a besoin de store partagé) + +### Persistance Job Store Ne Fonctionne Pas + +**Symptôme** : Jobs disparaissent après restart malgré store sqlite. + +**Corrections** : +- Utiliser `store="sqlite"` et spécifier `store_path` +- S'assurer que le chemin de fichier est accessible et persiste entre redémarrages +- Ne pas mélanger stores memory et sqlite dans même app + +### Problèmes Timezone + +**Symptôme** : Job s'exécute à mauvaise heure (décalage de plusieurs heures). + +**Corrections** : +- Définir timezone explicite sur scheduler: `PipelineScheduler(broker, timezone="UTC")` +- Ou sur trigger: `CronTrigger(hour=9, timezone=pytz.timezone("America/New_York"))` +- Vérifier timezone système du serveur correspond aux attentes + +--- + +## 15. Résumé + +PipelineScheduler fournit planification robuste, production-ready : + +| Fonctionnalité | API | +|----------------|-----| +| **Cron** | `scheduler.schedule(pipeline, cron="* * * * *")` | +| **Intervalle** | `scheduler.schedule_interval(pipeline, minutes=5)` | +| **One-off** | `scheduler.schedule_at(pipeline, run_at=datetime)` | +| **Gestion** | `list_jobs()`, `remove_job()`, `pause_job()` | +| **Persistance** | SQLite (mono-worker), PostgreSQL/MySQL (multi-worker) | +| **Tracking** | Automatique avec PipelineTrackingManager | +| **Concurrence** | `max_instances`, `coalesce` contrôles | + +**Setup production typique**: + +```python +tracking = PipelineTrackingManager().with_storage(RedisPipelineStorage(redis)) +pipeline = Pipeline(broker).with_tracking(tracking) + +scheduler = PipelineScheduler( + broker, + job_store_url="postgresql+asyncpg://user:pass@host/taskiq_flow", # pragma: allowlist secret +) +await scheduler.start() + +# Schedule your jobs... +``` + +--- + +## Prochaines Étapes + +- **[Guide de Retry]({{ '/fr/guides/retry/' | relative_url }})** — Récupération d'erreur et politiques de retry +- **[Guide de Performance]({{ '/fr/guides/performance/' | relative_url }})** — Optimiser performance pipelines planifiés +- **[Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }})** — Monitorer l'historique des jobs planifiés + +--- + +*Planifiez des pipelines comme des cron jobs. Suivez-les comme jamais.* diff --git a/docs/_fr/guides/security.md b/docs/_fr/guides/security.md index 4811c0a..c705bf7 100644 --- a/docs/_fr/guides/security.md +++ b/docs/_fr/guides/security.md @@ -26,6 +26,7 @@ Les fonctionnalités de sécurité sont configurées dans l'objet :class:`~taskiq_flow.config.TaskiqFlowConfig` ou via des variables d'environnement. Les principaux paramètres sont : +{% raw %} ```python from taskiq_flow import TaskiqFlowConfig @@ -64,7 +65,7 @@ config = TaskiqFlowConfig( websocket_max_connections=1000, ) ``` - +{% endraw %} La journalisation d'audit est gérée par :class:`~taskiq_flow.security.audit.AuditLogger`, instanciée automatiquement par l'API (aucun champ de configuration requis). @@ -92,19 +93,21 @@ TaskIQ-Flow prend en charge deux méthodes d'authentification : Les clients doivent inclure leur clé API dans l'en-tête ``X-API-Key`` pour les requêtes HTTP ou dans le champ ``auth`` des messages de connexion WebSocket. Exemple de requête HTTP : +{% raw %} ```http GET /api/pipelines X-API-Key: admin-key #pragma: allowlist secret ``` - +{% endraw %} ### Authentification JWT Si une clé secrète JWT est configurée (``jwt_secret``), les clients peuvent s'authentifier à l'aide d'un jeton Web Token (JWT) dans l'en-tête ``Authorization`` : +{% raw %} ``` Authorization: Bearer ``` - +{% endraw %} Le JWT doit contenir un champ ``sub`` (sujet) identifiant l'utilisateur et une liste ``roles``. ## Autorisation @@ -139,6 +142,7 @@ correctement derrière un reverse proxy ou un load-balancer qui termine TLS. Exécutez TaskIQ-Flow derrière un serveur ASGI tel qu'Uvicorn avec Docker : +{% raw %} ```dockerfile # Dockerfile FROM python:3.12-slim @@ -152,7 +156,8 @@ COPY . . EXPOSE 8000 CMD ["uvicorn", "mon_app:app", "--host", "0.0.0.0", "--port", "8000"] ``` - +{% endraw %} +{% raw %} ```yaml # docker-compose.yml services: @@ -178,11 +183,12 @@ services: volumes: redis_data: ``` - +{% endraw %} ### Reverse Proxy (nginx) Placez nginx devant l'application pour terminer TLS, renforcer HTTPS et ajouter des en-têtes de sécurité : +{% raw %} ```nginx # /etc/nginx/sites-available/taskiq-flow server { @@ -225,7 +231,7 @@ server { } } ``` - +{% endraw %} Avec cette configuration : 1. Tout le trafic HTTP est redirigé vers HTTPS (`require_https` est également appliqué côté application) 2. Des en-têtes de sécurité sont ajoutés à chaque réponse @@ -283,6 +289,7 @@ Les connexions WebSocket suivent le même modèle de sécurité que HTTP : Voici un exemple complet utilisant l'**API actuelle** (``create_visualization_api``, champs plats sur ``TaskiqFlowConfig``) : +{% raw %} ```python from taskiq import Taskiq, InMemoryBroker from taskiq_flow import TaskiqFlowConfig, create_visualization_api @@ -325,11 +332,12 @@ app = create_visualization_api(broker) # ── 3. Journaliseur d'audit personnalisé (optionnel) ────────────── audit_logger = AuditLogger() ``` - +{% endraw %} Lancez l'application avec ``uvicorn app:app --host 0.0.0.0 --port 8000``. Tous les endpoints nécessiteront désormais une authentification. ## Tests de sécurité +{% raw %} ```bash # Sans identifiants → 401 Unauthorized curl -i http://localhost:8000/pipelines @@ -340,7 +348,7 @@ curl -i -H "X-API-Key: invalid-key" http://localhost:8000/pipelines # Clé viewer valide → 200 OK curl -i -H "X-API-Key: viewer-key" http://localhost:8000/pipelines ``` - +{% endraw %} Pour tester WebSocket, utilisez une bibliothèque client WebSocket et incluez l'en-tête ``X-API-Key`` lors de la requête de mise à niveau. ## Conclusion diff --git a/docs/_fr/guides/tasks.md b/docs/_fr/guides/tasks.md index 2de298f..9cca06d 100644 --- a/docs/_fr/guides/tasks.md +++ b/docs/_fr/guides/tasks.md @@ -1,498 +1,498 @@ ---- -title: Guide des Tâches -nav_order: 21 ---- -# Guide des Tâches - -**Définition des tâches, décorateurs, métadonnées et gestion des ressources** - -> **Version** : {VERSION} | **Lié** : [Guide des Pipelines]({{ '/fr/guides/pipelines/' | relative_url }}), [Guide d'Exécution]({{ '/fr/guides/execution/' | relative_url }}) - ---- - -## Aperçu - -Les tâches sont les blocs de construction fondamentaux des pipelines Taskiq-Flow. Ce guide couvre : - -- Définition des tâches avec `@broker.task` -- Le décorateur `@pipeline_task` pour les pipelines dataflow -- Métadonnées et annotations des tâches -- Profils de ressources et contraintes -- Configuration des retentatives -- Spécification des entrées/sorties - ---- - -## 1. Qu'est-ce qu'une Tâche ? - -Une **Tâche** est une fonction asynchrone qui peut être exécutée par un broker Taskiq, éventuellement avec une logique de retry, des timeouts et des métadonnées pour l'orchestration de pipeline. - -### Définition Minimale d'une Tâche - -```python -from taskiq import InMemoryBroker - -broker = InMemoryBroker() - -@broker.task -async def my_task(value: int) -> int: - return value * 2 -``` - -**Exigences** : - -- Doit être une fonction `async def` (ou `def` normale pour les tâches synchrones) -- Doit être décorée avec `@broker.task` (ou `@broker.task(...)` avec options) -- Peut accepter n'importe quels paramètres sérialisables -- Doit retourner une valeur sérialisable en JSON - ---- - -## 2. Décorateurs de Tâche - -### 2.1. `@broker.task` — Tâche de Base - -```python -@broker.task -def add(a: int, b: int) -> int: - return a + b -``` - -**Options** : - -```python -@broker.task( - timeout=30, # Secondes avant timeout de la tâche - retry_policy=None, # RetryPolicy personnalisée (voir Guide des Retentatives) - max_retries=3, # Remplacer la valeur globale par défaut - queue="default", # Router vers une file spécifique - labels={"type": "cpu"} # Métadonnées labels personnalisées -) -async def slow_task(): - await asyncio.sleep(10) - return "done" -``` - -### 2.2. `@pipeline_task` — Annotation Dataflow - -Pour `DataflowPipeline`, utilisez `@pipeline_task(output=...)` pour déclarer ce que la tâche produit : - -```python -from taskiq_flow import pipeline_task - -@broker.task -@pipeline_task(output="features") -def extract(data: list[str]) -> dict: - return {"features": compute_features(data)} - -# La tâche en aval reçoit automatiquement le paramètre 'features' : -@broker.task -@pipeline_task(output="tags") -def tag(features: dict) -> list[str]: - # 'features' est automatiquement passé depuis extract_task - return generate_tags(features) -``` - -**Paramètres** : - -| Paramètre | Type | Description | -|-----------|------|-------------| -| `output` | `str` | Nom de la clé de sortie (doit correspondre aux noms de paramètres en aval) | -| `outputs` | `list[str]` | Sorties multiples (pour les tâches renvoyant un tuple) | -| `inputs` | `list[str]` | Dépendances d'entrée explicites (remplace la détection automatique) | -| `description` | `str` | Description lisible de la tâche | - -**Sorties multiples** : - -```python -@broker.task -@pipeline_task(outputs=["features", "metadata"]) -def split_output(data: str) -> tuple[dict, dict]: - features = extract_features(data) - metadata = extract_metadata(data) - return features, metadata # tuple déballé vers les deux sorties -``` - -### 2.3. `@pipeline_task_multi_output` — Alternative - -Identique à `@pipeline_task(outputs=[...])` ; fourni pour plus de clarté : - -```python -from taskiq_flow import pipeline_task_multi_output - -@broker.task -@pipeline_task_multi_output(outputs=["x", "y"]) -def split(value: int) -> tuple[int, int]: - return value // 2, value % 2 -``` - ---- - -## 3. Métadonnées des Tâches - -Enrichissez les tâches avec des métadonnées pour la documentation, la surveillance et la découverte automatique. - -### 3.1. Attributs Standard - -```python -@broker.task( - name="process_audio_track", # Remplacer le nom auto-généré - labels={ - "category": "audio_processing", - "priority": "high" - } -) -async def process_track(track_id: str) -> dict: - return {"track": track_id, "status": "processed"} -``` - -### 3.2. Informations Personnalisées de Tâche - -```python -from taskiq_flow import TaskInfo - -task_info = TaskInfo( - name="extract_spectrogram", - description="Extraire le mel-spectrogramme d'un signal audio", - parameters={ - "sample_rate": {"type": "int", "default": 22050}, - "n_mels": {"type": "int", "default": 128} - }, - outputs=["spectrogram", "sample_rate"] -) - -@broker.task -@pipeline_task(output="spectrogram", description=task_info.description) -def extract_spectrogram(audio: np.ndarray, sample_rate: int = 22050, n_mels: int = 128): - # implémentation... - return spectrogram -``` - ---- - -## 4. Profils de Ressources - -Contrôlez l'allocation CPU et mémoire par tâche pour un ordonnancement conscient des ressources. - -### 4.1. Profil CPU - -```python -from taskiq_flow import CPUProfile - -@broker.task -@CPUProfile(cpu_units=2) # Requiert 2 cœurs CPU -def heavy_computation(data): - # Cette tâche sera exécutée sur des workers avec au moins 2 cœurs - pass -``` - -**Valeurs de `cpu_units`** : - -| Valeur | Signification | -|--------|---------------| -| `0.5` | Half a core (tâche d'arrière-plan) | -| `1` | Un cœur complet (par défaut) | -| `2` | Deux cœurs (intensif en CPU) | - -### 4.2. Profil RAM - -```python -from taskiq_flow import RAMProfile - -@broker.task -@RAMProfile(ram_mb=2048) # Requiert 2 Go de RAM -def memory_intensive(data): - # S'exécute uniquement sur les workers avec au moins 2 Go de RAM disponible - pass -``` - -**Ordonnancement conscient des ressources** (nécessite un pool de workers compatible) : - -```python -from taskiq_flow import ResourceAwareWorkerPool - -pool = ResourceAwareWorkerPool( - workers=[ - {"cpu_cores": 4, "ram_gb": 8}, - {"cpu_cores": 2, "ram_gb": 4}, - ] -) -# Les tâches sont routées vers les workers avec ressources suffisantes -``` - -### 4.3. Profils Combinés - -```python -from taskiq_flow import CPUProfile, RAMProfile - -@broker.task -@CPUProfile(cpu_units=4) -@RAMProfile(ram_mb=4096) -def gpu_style_task(data): - # Tâche à hautes ressources - pass -``` - ---- - -## 5. Spécification des Entrées/Sorties - -### 5.1.Annotations de Type pour la Documentation - -```python -@broker.task -async def process( - text: str, # Entrée requise - max_length: int = 100, # Optionnel avec valeur par défaut - *, - strict: bool = False # Argument mot-clé uniquement -) -> dict: - return {"processed": text[:max_length]} -``` - -### 5.2. Modèles Pydantic (Recommandé pour Données Complexes) - -```python -from pydantic import BaseModel - -class AudioFeatures(BaseModel): - duration: float - tempo: float - key: str - -@broker.task -async def extract_features(audio_path: str) -> AudioFeatures: - # Pydantic valide et sérialise automatiquement - return AudioFeatures(duration=180.0, tempo=120.0, key="C") -``` - -### 5.3. Retourner Plusieurs Valeurs - -Les tâches peuvent retourner n'importe quel type sérialisable en JSON : - -```python -@broker.task -def split(data: str) -> tuple[str, str]: - return data[:10], data[10:] # Retourne deux valeurs - -# Avec @pipeline_task(outputs=["first", "second"]) -@pipeline_task(outputs=["head", "tail"]) -def split(data): - return data[:10], data[10:] -# Produit deux sorties : "head" et "tail" -``` - ---- - -## 6. Configuration des Retentatives - -### 6.1. Retry au Niveau Tâche - -```python -@broker.task( - retry_policy={ - "max_retries": 3, - "delay": 5.0, - "backoff": 2.0 # Multiplicateur d'exponential backoff - } -) -async def flaky_task(): - # Réessayera jusqu'à 3 fois avec des délais : 5s, 10s, 20s - possibly_fails() -``` - -### 6.2. Retry au Niveau Pipeline - -Appliquez une politique de retry à toutes les tâches d'un pipeline : - -```python -pipeline = Pipeline(broker) -pipeline.with_retry( - max_attempts=3, - delay=2.0, # Délai initial - backoff=1.5, # Multiplicateur de backoff - on_retry=None # Callback optionnel -) -``` - -Toutes les tâches de ce pipeline héritent de cette politique à moins qu'elles n'en aient une propre. - -**Prévalence** : Le niveau tâche écrase le niveau pipeline. - -### 6.3. Retry Conditionnel - -Ne réessayez que pour des exceptions spécifiques : - -```python -from taskiq.exceptions import RetryException - -@broker.task -async def task_with_conditional_retry(): - try: - call_external_api() - except NetworkError: - raise RetryException("Erreur réseau, réessai autorisé") - except ValidationError: - raise # Échec immédiat, pas de réessai -``` - -Les stratégies de retry détaillées sont couvertes dans le [Guide des Retentatives]({{ '/fr/guides/retry/' | relative_url }}). - ---- - -## 7. Découverte & Registre des Tâches - -### 7.1. Découverte Automatique - -`DataflowPipeline.from_tasks()` détecte automatiquement les dépendances via les annotations de type et les décorateurs `@pipeline_task`. - -### 7.2. Enregistrement Manuel - -Pour des pipelines dynamiques, utilisez `DataflowRegistry` : - -```python -from taskiq_flow import DataflowRegistry - -registry = DataflowRegistry() - -# Enregistrer avec un mapping E/S explicite -registry.register_task( - task=process_data, - output="processed", - inputs=["raw"] # dépend de la tâche qui produit "raw" -) - -# Découverte depuis un module -import my_tasks -for task in my_tasks.ALL_TASKS: - registry.register_task_from_object(task) -``` - -Voir `examples/registry_discovery_example.py`. - ---- - -## 8. Écriture de Tâches Testables - -Les tâches doivent être des fonctions pures pour faciliter les tests : - -```python -@broker.task -def process(data: dict) -> dict: - # Fonction pure : la sortie dépend uniquement de l'entrée - return {"result": data["value"] * 2} - -# Test unitaire -def test_process(): - assert process({"value": 5}) == {"result": 10} -``` - -**Test avec broker** : - -```python -import pytest -from taskiq import InMemoryBroker - -@pytest.fixture -def test_broker(): - return InMemoryBroker(await_inplace=True) - -async def test_task_execution(test_broker): - @test_broker.task - async def my_task(x: int) -> int: - return x + 1 - - result = await my_task.kiq(5) - value = await result.wait_result() - assert value.return_value == 6 -``` - ---- - -## 9. Motifs Courants - -### 9.1. Idempotence - -Concevez les tâches pour être ré-exécutables en toute sécurité : - -```python -@broker.task -@pipeline_task(output="user_processed") -def process_user(user_id: str) -> dict: - # Vérifie si déjà traité - if cache.get(f"processed:{user_id}"): - return {"status": "already_done"} - # Exécute le traitement - result = heavy_compute(user_id) - cache.set(f"processed:{user_id}", result, ttl=3600) - return result -``` - -### 9.2. Composabilité - -Décomposez la logique complexe en petites tâches réutilisables : - -```python -@broker.task -def validate(data): ... - -@broker.task -def transform(data): ... - -@broker.task -def enrich(data): ... - -# Composition dans plusieurs pipelines -pipeline1 = Pipeline(broker).call_next(validate).call_next(transform) -pipeline2 = Pipeline(broker).call_next(validate).call_next(enrich) -``` - -### 9.3. Rapports de Progression - -Pour les tâches longues, signalez la progression via des callbacks ou logs : - -```python -@broker.task -async def long_task(items: list, progress_callback=None): - for i, item in enumerate(items): - result = process(item) - if progress_callback: - await progress_callback(i / len(items)) - return "done" -``` - ---- - -## 10. Antipatterns à Éviter - -| Anti-pattern | Pourquoi c'est mauvais | Meilleure approche | -|--------------|----------------------|-------------------| -| Effets de bord dans les tâches | Rend les tests difficiles, logique obscure | Gardez les tâches pures ; utilisez `.call_after()` pour les effets de bord | -| Retours de valeurs volumineux | Mémoire élevée, sérialisation lente | Stockez les résultats volumineux en externe (DB, S3) ; retournez une référence | -| État mutable partagé | Conditions de course en parallèle | Chaque tâche indépendante ; passez les données via les retours | -| I/O bloquant sans async | Bloque la boucle d'événements | Utilisez des librairies async (aiohttp, asyncpg, etc.) | -| Tâches trop grosses | Difficile à réutiliser, tester, déboguer | Découpez en tâches plus petites et ciblées | - ---- - -## 11. Résumé - -Les tâches Taskiq-Flow sont : - -- **Flexibles** — Fonctions Python classiques avec `@broker.task` -- **Observables** — Métadonnées, labels et suivi -- **Résilientes** — Politiques de retry, timeouts, gestion d'erreurs -- **Composables** — Petites fonctions combinées en workflows complexes -- **Conscientes des ressources** — Profils CPU/RAM pour un ordonnancement optimisé - ---- - -## Prochaines Étapes - -- **[Types de Pipelines]({{ '/fr/guides/pipelines/' | relative_url }})** — Construire des workflows avec des tâches -- **[Guide d'Exécution]({{ '/fr/guides/execution/' | relative_url }})** — Exécuter les pipelines et gérer les résultats -- **[Guide des Retentatives]({{ '/fr/guides/retry/' | relative_url }})** — Stratégies robustes de récupération d'erreurs - ---- - -*Les tâches sont vos atomes de workflow. Apprenez à les composer dans [Pipelines]({{ '/fr/guides/pipelines/' | relative_url }}).* +--- +title: Guide des Tâches +nav_order: 21 +--- +# Guide des Tâches + +**Définition des tâches, décorateurs, métadonnées et gestion des ressources** + +> **Version** : {VERSION} | **Lié** : [Guide des Pipelines]({{ '/fr/guides/pipelines/' | relative_url }}), [Guide d'Exécution]({{ '/fr/guides/execution/' | relative_url }}) + +--- + +## Aperçu + +Les tâches sont les blocs de construction fondamentaux des pipelines Taskiq-Flow. Ce guide couvre : + +- Définition des tâches avec `@broker.task` +- Le décorateur `@pipeline_task` pour les pipelines dataflow +- Métadonnées et annotations des tâches +- Profils de ressources et contraintes +- Configuration des retentatives +- Spécification des entrées/sorties + +--- + +## 1. Qu'est-ce qu'une Tâche ? + +Une **Tâche** est une fonction asynchrone qui peut être exécutée par un broker Taskiq, éventuellement avec une logique de retry, des timeouts et des métadonnées pour l'orchestration de pipeline. + +### Définition Minimale d'une Tâche + +```python +from taskiq import InMemoryBroker + +broker = InMemoryBroker() + +@broker.task +async def my_task(value: int) -> int: + return value * 2 +``` + +**Exigences** : + +- Doit être une fonction `async def` (ou `def` normale pour les tâches synchrones) +- Doit être décorée avec `@broker.task` (ou `@broker.task(...)` avec options) +- Peut accepter n'importe quels paramètres sérialisables +- Doit retourner une valeur sérialisable en JSON + +--- + +## 2. Décorateurs de Tâche + +### 2.1. `@broker.task` — Tâche de Base + +```python +@broker.task +def add(a: int, b: int) -> int: + return a + b +``` + +**Options** : + +```python +@broker.task( + timeout=30, # Secondes avant timeout de la tâche + retry_policy=None, # RetryPolicy personnalisée (voir Guide des Retentatives) + max_retries=3, # Remplacer la valeur globale par défaut + queue="default", # Router vers une file spécifique + labels={"type": "cpu"} # Métadonnées labels personnalisées +) +async def slow_task(): + await asyncio.sleep(10) + return "done" +``` + +### 2.2. `@pipeline_task` — Annotation Dataflow + +Pour `DataflowPipeline`, utilisez `@pipeline_task(output=...)` pour déclarer ce que la tâche produit : + +```python +from taskiq_flow import pipeline_task + +@broker.task +@pipeline_task(output="features") +def extract(data: list[str]) -> dict: + return {"features": compute_features(data)} + +# La tâche en aval reçoit automatiquement le paramètre 'features' : +@broker.task +@pipeline_task(output="tags") +def tag(features: dict) -> list[str]: + # 'features' est automatiquement passé depuis extract_task + return generate_tags(features) +``` + +**Paramètres** : + +| Paramètre | Type | Description | +|-----------|------|-------------| +| `output` | `str` | Nom de la clé de sortie (doit correspondre aux noms de paramètres en aval) | +| `outputs` | `list[str]` | Sorties multiples (pour les tâches renvoyant un tuple) | +| `inputs` | `list[str]` | Dépendances d'entrée explicites (remplace la détection automatique) | +| `description` | `str` | Description lisible de la tâche | + +**Sorties multiples** : + +```python +@broker.task +@pipeline_task(outputs=["features", "metadata"]) +def split_output(data: str) -> tuple[dict, dict]: + features = extract_features(data) + metadata = extract_metadata(data) + return features, metadata # tuple déballé vers les deux sorties +``` + +### 2.3. `@pipeline_task_multi_output` — Alternative + +Identique à `@pipeline_task(outputs=[...])` ; fourni pour plus de clarté : + +```python +from taskiq_flow import pipeline_task_multi_output + +@broker.task +@pipeline_task_multi_output(outputs=["x", "y"]) +def split(value: int) -> tuple[int, int]: + return value // 2, value % 2 +``` + +--- + +## 3. Métadonnées des Tâches + +Enrichissez les tâches avec des métadonnées pour la documentation, la surveillance et la découverte automatique. + +### 3.1. Attributs Standard + +```python +@broker.task( + name="process_audio_track", # Remplacer le nom auto-généré + labels={ + "category": "audio_processing", + "priority": "high" + } +) +async def process_track(track_id: str) -> dict: + return {"track": track_id, "status": "processed"} +``` + +### 3.2. Informations Personnalisées de Tâche + +```python +from taskiq_flow import TaskInfo + +task_info = TaskInfo( + name="extract_spectrogram", + description="Extraire le mel-spectrogramme d'un signal audio", + parameters={ + "sample_rate": {"type": "int", "default": 22050}, + "n_mels": {"type": "int", "default": 128} + }, + outputs=["spectrogram", "sample_rate"] +) + +@broker.task +@pipeline_task(output="spectrogram", description=task_info.description) +def extract_spectrogram(audio: np.ndarray, sample_rate: int = 22050, n_mels: int = 128): + # implémentation... + return spectrogram +``` + +--- + +## 4. Profils de Ressources + +Contrôlez l'allocation CPU et mémoire par tâche pour un ordonnancement conscient des ressources. + +### 4.1. Profil CPU + +```python +from taskiq_flow import CPUProfile + +@broker.task +@CPUProfile(cpu_units=2) # Requiert 2 cœurs CPU +def heavy_computation(data): + # Cette tâche sera exécutée sur des workers avec au moins 2 cœurs + pass +``` + +**Valeurs de `cpu_units`** : + +| Valeur | Signification | +|--------|---------------| +| `0.5` | Half a core (tâche d'arrière-plan) | +| `1` | Un cœur complet (par défaut) | +| `2` | Deux cœurs (intensif en CPU) | + +### 4.2. Profil RAM + +```python +from taskiq_flow import RAMProfile + +@broker.task +@RAMProfile(ram_mb=2048) # Requiert 2 Go de RAM +def memory_intensive(data): + # S'exécute uniquement sur les workers avec au moins 2 Go de RAM disponible + pass +``` + +**Ordonnancement conscient des ressources** (nécessite un pool de workers compatible) : + +```python +from taskiq_flow import ResourceAwareWorkerPool + +pool = ResourceAwareWorkerPool( + workers=[ + {"cpu_cores": 4, "ram_gb": 8}, + {"cpu_cores": 2, "ram_gb": 4}, + ] +) +# Les tâches sont routées vers les workers avec ressources suffisantes +``` + +### 4.3. Profils Combinés + +```python +from taskiq_flow import CPUProfile, RAMProfile + +@broker.task +@CPUProfile(cpu_units=4) +@RAMProfile(ram_mb=4096) +def gpu_style_task(data): + # Tâche à hautes ressources + pass +``` + +--- + +## 5. Spécification des Entrées/Sorties + +### 5.1.Annotations de Type pour la Documentation + +```python +@broker.task +async def process( + text: str, # Entrée requise + max_length: int = 100, # Optionnel avec valeur par défaut + *, + strict: bool = False # Argument mot-clé uniquement +) -> dict: + return {"processed": text[:max_length]} +``` + +### 5.2. Modèles Pydantic (Recommandé pour Données Complexes) + +```python +from pydantic import BaseModel + +class AudioFeatures(BaseModel): + duration: float + tempo: float + key: str + +@broker.task +async def extract_features(audio_path: str) -> AudioFeatures: + # Pydantic valide et sérialise automatiquement + return AudioFeatures(duration=180.0, tempo=120.0, key="C") +``` + +### 5.3. Retourner Plusieurs Valeurs + +Les tâches peuvent retourner n'importe quel type sérialisable en JSON : + +```python +@broker.task +def split(data: str) -> tuple[str, str]: + return data[:10], data[10:] # Retourne deux valeurs + +# Avec @pipeline_task(outputs=["first", "second"]) +@pipeline_task(outputs=["head", "tail"]) +def split(data): + return data[:10], data[10:] +# Produit deux sorties : "head" et "tail" +``` + +--- + +## 6. Configuration des Retentatives + +### 6.1. Retry au Niveau Tâche + +```python +@broker.task( + retry_policy={ + "max_retries": 3, + "delay": 5.0, + "backoff": 2.0 # Multiplicateur d'exponential backoff + } +) +async def flaky_task(): + # Réessayera jusqu'à 3 fois avec des délais : 5s, 10s, 20s + possibly_fails() +``` + +### 6.2. Retry au Niveau Pipeline + +Appliquez une politique de retry à toutes les tâches d'un pipeline : + +```python +pipeline = Pipeline(broker) +pipeline.with_retry( + max_attempts=3, + delay=2.0, # Délai initial + backoff=1.5, # Multiplicateur de backoff + on_retry=None # Callback optionnel +) +``` + +Toutes les tâches de ce pipeline héritent de cette politique à moins qu'elles n'en aient une propre. + +**Prévalence** : Le niveau tâche écrase le niveau pipeline. + +### 6.3. Retry Conditionnel + +Ne réessayez que pour des exceptions spécifiques : + +```python +from taskiq.exceptions import RetryException + +@broker.task +async def task_with_conditional_retry(): + try: + call_external_api() + except NetworkError: + raise RetryException("Erreur réseau, réessai autorisé") + except ValidationError: + raise # Échec immédiat, pas de réessai +``` + +Les stratégies de retry détaillées sont couvertes dans le [Guide des Retentatives]({{ '/fr/guides/retry/' | relative_url }}). + +--- + +## 7. Découverte & Registre des Tâches + +### 7.1. Découverte Automatique + +`DataflowPipeline.from_tasks()` détecte automatiquement les dépendances via les annotations de type et les décorateurs `@pipeline_task`. + +### 7.2. Enregistrement Manuel + +Pour des pipelines dynamiques, utilisez `DataflowRegistry` : + +```python +from taskiq_flow import DataflowRegistry + +registry = DataflowRegistry() + +# Enregistrer avec un mapping E/S explicite +registry.register_task( + task=process_data, + output="processed", + inputs=["raw"] # dépend de la tâche qui produit "raw" +) + +# Découverte depuis un module +import my_tasks +for task in my_tasks.ALL_TASKS: + registry.register_task_from_object(task) +``` + +Voir `examples/registry_discovery_example.py`. + +--- + +## 8. Écriture de Tâches Testables + +Les tâches doivent être des fonctions pures pour faciliter les tests : + +```python +@broker.task +def process(data: dict) -> dict: + # Fonction pure : la sortie dépend uniquement de l'entrée + return {"result": data["value"] * 2} + +# Test unitaire +def test_process(): + assert process({"value": 5}) == {"result": 10} +``` + +**Test avec broker** : + +```python +import pytest +from taskiq import InMemoryBroker + +@pytest.fixture +def test_broker(): + return InMemoryBroker(await_inplace=True) + +async def test_task_execution(test_broker): + @test_broker.task + async def my_task(x: int) -> int: + return x + 1 + + result = await my_task.kiq(5) + value = await result.wait_result() + assert value.return_value == 6 +``` + +--- + +## 9. Motifs Courants + +### 9.1. Idempotence + +Concevez les tâches pour être ré-exécutables en toute sécurité : + +```python +@broker.task +@pipeline_task(output="user_processed") +def process_user(user_id: str) -> dict: + # Vérifie si déjà traité + if cache.get(f"processed:{user_id}"): + return {"status": "already_done"} + # Exécute le traitement + result = heavy_compute(user_id) + cache.set(f"processed:{user_id}", result, ttl=3600) + return result +``` + +### 9.2. Composabilité + +Décomposez la logique complexe en petites tâches réutilisables : + +```python +@broker.task +def validate(data): ... + +@broker.task +def transform(data): ... + +@broker.task +def enrich(data): ... + +# Composition dans plusieurs pipelines +pipeline1 = Pipeline(broker).call_next(validate).call_next(transform) +pipeline2 = Pipeline(broker).call_next(validate).call_next(enrich) +``` + +### 9.3. Rapports de Progression + +Pour les tâches longues, signalez la progression via des callbacks ou logs : + +```python +@broker.task +async def long_task(items: list, progress_callback=None): + for i, item in enumerate(items): + result = process(item) + if progress_callback: + await progress_callback(i / len(items)) + return "done" +``` + +--- + +## 10. Antipatterns à Éviter + +| Anti-pattern | Pourquoi c'est mauvais | Meilleure approche | +|--------------|----------------------|-------------------| +| Effets de bord dans les tâches | Rend les tests difficiles, logique obscure | Gardez les tâches pures ; utilisez `.call_after()` pour les effets de bord | +| Retours de valeurs volumineux | Mémoire élevée, sérialisation lente | Stockez les résultats volumineux en externe (DB, S3) ; retournez une référence | +| État mutable partagé | Conditions de course en parallèle | Chaque tâche indépendante ; passez les données via les retours | +| I/O bloquant sans async | Bloque la boucle d'événements | Utilisez des librairies async (aiohttp, asyncpg, etc.) | +| Tâches trop grosses | Difficile à réutiliser, tester, déboguer | Découpez en tâches plus petites et ciblées | + +--- + +## 11. Résumé + +Les tâches Taskiq-Flow sont : + +- **Flexibles** — Fonctions Python classiques avec `@broker.task` +- **Observables** — Métadonnées, labels et suivi +- **Résilientes** — Politiques de retry, timeouts, gestion d'erreurs +- **Composables** — Petites fonctions combinées en workflows complexes +- **Conscientes des ressources** — Profils CPU/RAM pour un ordonnancement optimisé + +--- + +## Prochaines Étapes + +- **[Types de Pipelines]({{ '/fr/guides/pipelines/' | relative_url }})** — Construire des workflows avec des tâches +- **[Guide d'Exécution]({{ '/fr/guides/execution/' | relative_url }})** — Exécuter les pipelines et gérer les résultats +- **[Guide des Retentatives]({{ '/fr/guides/retry/' | relative_url }})** — Stratégies robustes de récupération d'erreurs + +--- + +*Les tâches sont vos atomes de workflow. Apprenez à les composer dans [Pipelines]({{ '/fr/guides/pipelines/' | relative_url }}).* diff --git a/docs/_fr/guides/tracking.md b/docs/_fr/guides/tracking.md index 8b37256..c6d4634 100644 --- a/docs/_fr/guides/tracking.md +++ b/docs/_fr/guides/tracking.md @@ -1,538 +1,538 @@ ---- -title: Guide de Suivi et Monitoring des Pipelines -nav_order: 23 ---- -# Guide de Suivi et Monitoring des Pipelines - -**Suivi en temps réel et historique des exécutions avec PipelineTrackingManager** - -> **Version**: 1.0.0 | **Lié** : [Guide d'Exécution]({{ '/fr/guides/execution/' | relative_url }}), [Guide WebSocket]({{ '/fr/guides/websocket/' | relative_url }}) - ---- - -## Aperçu - -Taskiq-Flow offre des capacités complètes de suivi pour monitorer les exécutions de pipeline en temps réel et historiquement. Ce guide couvre : - -- `PipelineTrackingManager` — Coordonnateur central de suivi -- Backends de stockage (Mémoire, Redis) -- Requêtes de statut et historique -- Collecte de métriques -- Écoute d'événements au niveau étape - ---- - -## 1. Démarrage Rapide - -```python -from taskiq_flow import Pipeline, PipelineTrackingManager - -# Initialize tracking with automatic storage selection -tracking = PipelineTrackingManager().with_auto_storage(broker) - -# Attach tracking to pipeline -pipeline = Pipeline(broker).with_tracking(tracking) - -# Execute -task = await pipeline.kiq(data) -result = await task.wait_result() - -# Query status -status = await tracking.get_status(pipeline.pipeline_id) -print(f"Status: {status.status}") # COMPLETED -print(f"Steps: {len(status.steps)}") # Number of steps executed -print(f"Duration: {status.duration_ms}ms") -``` - -C'est le pattern de base. Approfondissons. - ---- - -## 2. PipelineTrackingManager - -Le composant central pour enregistrer et récupérer les données d'exécution des pipelines. - -### 2.1. Initialisation - -```python -from taskiq_flow import PipelineTrackingManager, InMemoryPipelineStorage, RedisPipelineStorage - -# Option 1: Auto-select based on broker (recommended) -tracking = PipelineTrackingManager().with_auto_storage(broker) -# Uses Redis if broker supports it, else falls back to Memory - -# Option 2: Explicit memory storage (development only) -tracking = PipelineTrackingManager().with_storage(InMemoryPipelineStorage()) - -# Option 3: Explicit Redis storage (production) -tracking = PipelineTrackingManager().with_storage( - RedisPipelineStorage(redis_client) -) - -# Option 4: Custom storage backend -tracking = PipelineTrackingManager().with_storage(CustomStorage()) -``` - -### 2.2. Durée de Vie du Stockage - -- **InMemoryPipelineStorage** : Vit dans le processus Python seulement ; perdu au redémarrage -- **RedisPipelineStorage** : Persistant entre processus ; survit aux redémarrages - -Choisir selon le déploiement: -- Développement local → Mémoire -- Production mono-worker → Mémoire (si pas de redémarrage) -- Multi-workers / distribué → Redis (ou autre stockage partagé) - ---- - -## 3. Modèle de Statut de Pipeline - -Chaque pipeline suivi produit un objet `PipelineStatus`: - -```python -from taskiq_flow.tracking.models import PipelineStatus - -statut: PipelineStatus -``` - -**Champs**: - -| Champ | Type | Description | -|-------|------|-------------| -| `pipeline_id` | `str` | Identifiant unique de l'instance de pipeline | -| `statut` | `str` | `EN_ATTENTE`, `EN_COURSE`, `TERMINÉ`, `ÉCHOUÉ`, `ANNULÉ` | -| `pipeline_type` | `str` | `"sequential"` ou `"dataflow"` | -| `démarré_à` | `datetime` | Horodatage de début d'exécution | -| `terminé_à` | `datetime` | Horodatage de fin (si terminé) | -| `durée_ms` | `float` | Temps d'exécution total en millisecondes | -| `étapes` | `list[StepStatus]` | Détail par étape | -| `résultat` | `Any` | Valeur de retour finale (si terminé) | -| `erreur` | `str` | Message d'erreur (si échoué) | - -**Champs StepStatus**: - -| Champ | Type | Description | -|-------|------|-------------| -| `step_name` | `str` | Nom de la tâche | -| `statut` | `str` | `EN_ATTENTE`, `EN_COURSE`, `TERMINÉ`, `ÉCHOUÉ` | -| `démarré_à` | `datetime` | Heure de début d'étape | -| `terminé_à` | `datetime` | Heure de fin d'étape | -| `durée_ms` | `float` | Temps d'exécution de l'étape | -| `résultat` | `Any` | Valeur de retour de l'étape | -| `erreur` | `str` | Message d'erreur si échec | - ---- - -## 4. Interrogation des Statuts - -### 4.1. Obtenir le Statut d'un Pipeline - -```python -status = await tracking.get_status(pipeline_id) - -if status.status == "COMPLETED": - print(f"Pipeline completed in {status.duration_ms}ms") - print(f"Result: {status.result}") -elif status.status == "FAILED": - print(f"Failed: {status.error}") -``` - -### 4.2. Lister Tous les Pipelines - -```python -all_statuses = await tracking.list_pipelines() -for status in all_statuses: - print(f"{status.pipeline_id}: {status.status}") -``` - -### 4.3. Filtrer par Statut - -```python -running = await tracking.list_pipelines(filter_status="RUNNING") -failed = await tracking.list_pipelines(filter_status="FAILED") -completed = await tracking.list_pipelines(filter_status="COMPLETED") -``` - -### 4.4. Obtenir l'Historique - -```python -# Get last 10 pipelines -history = await tracking.get_history(limit=10) - -# Filter by date range -from datetime import datetime, timedelta -week_ago = datetime.now() - timedelta(days=7) -recent = await tracking.get_history(since=week_ago) -``` - -### 4.5. Supprimer les Anciens Enregistrements - -```python -# Delete records older than 30 days -deleted = await tracking.cleanup_old(days=30) -print(f"Deleted {deleted} old pipeline records") - -# Delete specific pipeline -await tracking.delete_pipeline(pipeline_id) -``` - ---- - -## 5. Backends de Stockage - -### 5.1. InMemoryPipelineStorage - -```python -from taskiq_flow.tracking import InMemoryPipelineStorage - -storage = InMemoryPipelineStorage() -tracking = PipelineTrackingManager().with_storage(storage) - -# Data lives only in Python process -# Lost on restart -# Suitable for: development, testing, one-shot scripts -``` - -**Avantages**: -- Zéro configuration -- Rapide (pas d'I/O réseau) -- Simple - -**Inconvénients**: -- Non partageable entre workers -- Perdu au redémarrage -- Taille d'historique limitée - -### 5.2. RedisPipelineStorage - -```python -from taskiq_flow.tracking import RedisPipelineStorage -import redis.asyncio as redis - -client_redis = redis.Redis(host="localhost", port=6379, decode_responses=True) -stockage = RedisPipelineStorage(client_redis) -tracking = PipelineTrackingManager().with_storage(stockage) -``` - -**Configuration**: - -```python -# Avec préfixe de clé et TTL personnalisés -stockage = RedisPipelineStorage( - client_redis, - key_prefix="taskiq_flow:suivi:", - ttl_secondes=604800 # rétention 7 jours -) -``` - -**Avantages**: -- Partagé entre multiples workers -- Persiste au redémarrage -- Évolutif -- Peut être en cluster pour haute disponibilité - -**Inconvénients**: -- Requiert un serveur Redis -- Latence réseau -- Gestion TTL nécessaire (éviter croissance illimitée) - -### 5.3. Stockage Personnalisé - -Implémenter le protocole `TrackingStorage`: - -```python -from taskiq_flow.tracking.storage import TrackingStorage -from taskiq_flow.tracking.models import PipelineStatus - -class PostgresStorage(TrackingStorage): - async def save_status(self, status: PipelineStatus): - # Insert/update in PostgreSQL - pass - - async def get_status(self, pipeline_id: str) -> PipelineStatus | None: - # Fetch from DB - pass - - async def list_pipelines(self, filter_status: str | None = None): - # Query with optional filter - pass - - async def delete_pipeline(self, pipeline_id: str): - # Remove record - pass - -tracking = PipelineTrackingManager().with_storage(PostgresStorage()) -``` - ---- - -## 6. Suivi en Temps Réel avec WebSocket - -Pour des mises à jour de tableau de bord en direct, combiner `PipelineTrackingManager` avec `HookManager`: - -```python -from taskiq_flow.hooks import HookManager, TrackingEventBroadcaster - -hook_manager = HookManager() -broadcaster = TrackingEventBroadcaster(tracking, hook_manager) -tracking.add_listener(broadcaster.on_status_update) - -pipeline = Pipeline(broker).with_hooks(hook_manager).with_tracking(tracking) -``` - -Les événements de pipeline sont maintenant diffusés via WebSocket en temps réel. - -Voir [Guide WebSocket]({{ '/fr/guides/websocket/' | relative_url }}) pour la configuration complète。 - ---- - -## 7. Collecte de Métriques - -Collecter des statistiques de performance au fil du temps: - -```python -# Collect statistics -stats = await tracking.get_metrics(days=7) - -print(f"Total executions: {stats.total_pipelines}") -print(f"Success rate: {stats.success_rate:.1%}") -print(f"Avg duration: {stats.avg_duration_ms:.0f}ms") -print(f"Failure reasons: {stats.failure_reasons}") -``` - -**Métriques courantes**: - -- Débit (pipelines/minute) -- Ratio succès/échec -- Durée moyenne des étapes -- Étapes les plus longues -- Heures de pointe - -Intégrer avec des systèmes de monitoring (Prometheus, Grafana): - -```python -from prometheus_client import Counter, Histogram - -PIPELINES_TOTAL = Counter('pipelines_total', 'Total pipelines', ['status']) -PIPELINE_DURATION = Histogram('pipeline_duration_seconds', 'Pipeline execution duration') - -class PrometheusExporter: - async def on_pipeline_complete(self, status: PipelineStatus): - PIPELINES_TOTAL.labels(status=status.status).inc() - PIPELINE_DURATION.observe(status.duration_ms / 1000) -``` - ---- - -## 8. Écouteurs d'Événements - -Attacher des callbacks aux événements de suivi: - -```python -class MyListener: - async def on_pipeline_start(self, pipeline_id: str): - print(f"Pipeline {pipeline_id} started") - send_slack_notification(f"Pipeline {pipeline_id} started") - - async def on_step_complete(self, pipeline_id: str, step_name: str, result: Any): - log_step_metric(step_name, result) - - async def on_pipeline_complete(self, pipeline_id: str, status: PipelineStatus): - if status.status == "FAILED": - alert_failure(pipeline_id) - -listener = MyListener() -tracking.add_listener(listener) -``` - -**Méthodes d'écouteur** (toutes optionnelles): - -- `on_pipeline_start(pipeline_id: str)` -- `on_step_start(pipeline_id: str, step_name: str)` -- `on_step_complete(pipeline_id: str, step_name: str, résultat: Any)` -- `on_pipeline_complete(pipeline_id: str, statut: PipelineStatus)` -- `on_pipeline_error(pipeline_id: str, erreur: str)` - ---- - -## 9. Visualisation des Données de Suivi - -### 9.1. Sortie Console - -```python -status = await tracking.get_status(pipeline_id) -print(f"\n{'='*60}") -print(f"Pipeline: {status.pipeline_id}") -print(f"Status: {status.status}") -print(f"Duration: {status.duration_ms:.0f}ms") -print(f"Steps:") -for step in status.steps: - bar = "█" * int(step.duration_ms / 10) - print(f" {step.step_name:<30} {bar} {step.duration_ms:.0f}ms") -``` - -### 9.2. JSON Export - -```python -import json -status_dict = status.model_dump(mode="json", exclude={"result"}) # exclude large results -print(json.dumps(status_dict, indent=2, default=str)) -``` - -### 9.3. Intégration avec Tableaux de Bord - -Utiliser les endpoints API REST (voir [Guide API]({{ '/fr/guides/api/' | relative_url }})) pour construire des tableaux de bord personnalisés: - -```javascript -// Frontend fetch -fetch('/api/pipelines/{pipeline_id}/status') - .then(res => res.json()) - .then(statut => { - // Rendre graphique temporel des durées d'étapes - // Afficher badges succès/échec - }); -``` - ---- - -## 10. Meilleures Pratiques de Production - -### 10.1. Utiliser Redis en Production - -Toujours utiliser `RedisPipelineStorage` en production: - -```python -# config.py -URL_REDIS = os.getenv("URL_REDIS", "redis://localhost:6379") - -# app.py -from redis.asyncio import Redis -client_redis = Redis.from_url(REDIS_URL) -tracking = PipelineTrackingManager().with_storage( - RedisPipelineStorage(client_redis, ttl_seconds=2592000) # 30 days -) -``` - -### 10.2. Configurer des Politiques de Rétention - -```python -# Periodic cleanup job (daily) -async def cleanup_old_tracking(): - deleted = await tracking.cleanup_old(days=7) - print(f"Cleaned up {deleted} old pipeline records") - -# Use APScheduler to run daily -from taskiq_flow import PipelineScheduler -scheduler = PipelineScheduler(broker) -scheduler.schedule_at(cleanup_old_tracking, run_at="0 3 * * *") # 3am daily -``` - -### 10.3. Monitor Tracking Health - -```python -# Health check for monitoring systems -async def health_check(): - try: - test_pipeline = Pipeline(broker).with_tracking(tracking) - await test_pipeline.kiq("health_check") - return {"status": "healthy"} - except Exception as e: - return {"status": "unhealthy", "error": str(e)} -``` - -### 10.4. Limiter la Taille de l'Historique - -```python -# Keep only last N pipelines per pipeline_id pattern -import fnmatch - -patterns = ["batch_job_*", "etl_*"] -for pattern in patterns: - old = await tracking.list_pipelines() - matches = [p for p in old if fnmatch.fnmatch(p.pipeline_id, pattern)] - if len(matches) > 100: - for old_pipeline in matches[-100:]: - await tracking.delete_pipeline(old_pipeline.pipeline_id) -``` - ---- - -## 11. Dépannage - -### Erreur "Aucun stockage configuré" - -**Symptôme** : `RuntimeError: No tracking storage configured` - -**Solution** : Add storage before using tracking: - -```python -tracking = PipelineTrackingManager().with_auto_storage(broker) -# or -tracking = PipelineTrackingManager().with_storage(InMemoryPipelineStorage()) -``` - -### Missing Tracking Data - -**Symptom**: `get_status()` returns `None` even though pipeline ran - -**Causes & fixes**: - -1. **Tracking not attached**: - ```python - pipeline = Pipeline(broker).with_tracking(tracking) # Must call with_tracking() - ``` - -2. **Different brokers** — Ensure same `broker` instance between task and pipeline. - -3. **Storage lifetime** — In-memory storage lost on restart; switch to Redis. - -4. **Pipeline ID mismatch** — Confirm `pipeline.pipeline_id` matches the query. - -### Dégradation des Performance avec Redis - -**Symptôme** : Le suivi ralentit l'exécution du pipeline - -**Correctifs**: -- Utiliser le pooling de connexions Redis -- Mettre à jour les statuts en batch (regrouper plusieurs étapes) -- Écritures batch asynchrones (comportement par défaut) -- Augmenter `maxmemory` Redis et utiliser politique d'éviction appropriée - ---- - -## 12. Résumé - -| Fonctionnalité | Mémoire | Redis | -|----------------|---------|-------| -| **Multi-processus** | Non | Oui | -| **Persistant** | Non | Oui | -| **État partagé** | Non | Oui | -| **Vitesse** | Plus rapide | Rapide (réseau) | -| **Configuration requise** | Aucune | Serveur Redis | - -**Recette basique**: -```python -suivi = PipelineTrackingManager().with_auto_storage(broker) -pipeline = Pipeline(broker).with_tracking(suivi) -``` - -**Recette production**: -```python -suivi = PipelineTrackingManager().with_storage( - RedisPipelineStorage(client_redis, ttl_secondes=604800) -) -pipeline = Pipeline(broker).with_tracking(suivi) -``` - ---- - -## Prochaines Étapes - -- **[Streaming WebSocket]({{ '/fr/guides/websocket/' | relative_url }})** — Livraison d'événements en direct pour tableaux de bord -- **[Guide Dataflow]({{ '/fr/guides/dataflow/' | relative_url }})** — Pipeline DAG complet avec parallélisme automatique -- **[Planification]({{ '/fr/guides/scheduling/' | relative_url }})** — Exécution périodique automatique de pipelines -- **[Performance]({{ '/fr/guides/performance/' | relative_url }})** — Optimiser la surcharge de suivi - ---- - -*Tout suivre. Visualiser avec [WebSocket]({{ '/fr/guides/websocket/' | relative_url }}).* +--- +title: Guide de Suivi et Monitoring des Pipelines +nav_order: 23 +--- +# Guide de Suivi et Monitoring des Pipelines + +**Suivi en temps réel et historique des exécutions avec PipelineTrackingManager** + +> **Version**: 1.0.0 | **Lié** : [Guide d'Exécution]({{ '/fr/guides/execution/' | relative_url }}), [Guide WebSocket]({{ '/fr/guides/websocket/' | relative_url }}) + +--- + +## Aperçu + +Taskiq-Flow offre des capacités complètes de suivi pour monitorer les exécutions de pipeline en temps réel et historiquement. Ce guide couvre : + +- `PipelineTrackingManager` — Coordonnateur central de suivi +- Backends de stockage (Mémoire, Redis) +- Requêtes de statut et historique +- Collecte de métriques +- Écoute d'événements au niveau étape + +--- + +## 1. Démarrage Rapide + +```python +from taskiq_flow import Pipeline, PipelineTrackingManager + +# Initialize tracking with automatic storage selection +tracking = PipelineTrackingManager().with_auto_storage(broker) + +# Attach tracking to pipeline +pipeline = Pipeline(broker).with_tracking(tracking) + +# Execute +task = await pipeline.kiq(data) +result = await task.wait_result() + +# Query status +status = await tracking.get_status(pipeline.pipeline_id) +print(f"Status: {status.status}") # COMPLETED +print(f"Steps: {len(status.steps)}") # Number of steps executed +print(f"Duration: {status.duration_ms}ms") +``` + +C'est le pattern de base. Approfondissons. + +--- + +## 2. PipelineTrackingManager + +Le composant central pour enregistrer et récupérer les données d'exécution des pipelines. + +### 2.1. Initialisation + +```python +from taskiq_flow import PipelineTrackingManager, InMemoryPipelineStorage, RedisPipelineStorage + +# Option 1: Auto-select based on broker (recommended) +tracking = PipelineTrackingManager().with_auto_storage(broker) +# Uses Redis if broker supports it, else falls back to Memory + +# Option 2: Explicit memory storage (development only) +tracking = PipelineTrackingManager().with_storage(InMemoryPipelineStorage()) + +# Option 3: Explicit Redis storage (production) +tracking = PipelineTrackingManager().with_storage( + RedisPipelineStorage(redis_client) +) + +# Option 4: Custom storage backend +tracking = PipelineTrackingManager().with_storage(CustomStorage()) +``` + +### 2.2. Durée de Vie du Stockage + +- **InMemoryPipelineStorage** : Vit dans le processus Python seulement ; perdu au redémarrage +- **RedisPipelineStorage** : Persistant entre processus ; survit aux redémarrages + +Choisir selon le déploiement: +- Développement local → Mémoire +- Production mono-worker → Mémoire (si pas de redémarrage) +- Multi-workers / distribué → Redis (ou autre stockage partagé) + +--- + +## 3. Modèle de Statut de Pipeline + +Chaque pipeline suivi produit un objet `PipelineStatus`: + +```python +from taskiq_flow.tracking.models import PipelineStatus + +statut: PipelineStatus +``` + +**Champs**: + +| Champ | Type | Description | +|-------|------|-------------| +| `pipeline_id` | `str` | Identifiant unique de l'instance de pipeline | +| `statut` | `str` | `EN_ATTENTE`, `EN_COURSE`, `TERMINÉ`, `ÉCHOUÉ`, `ANNULÉ` | +| `pipeline_type` | `str` | `"sequential"` ou `"dataflow"` | +| `démarré_à` | `datetime` | Horodatage de début d'exécution | +| `terminé_à` | `datetime` | Horodatage de fin (si terminé) | +| `durée_ms` | `float` | Temps d'exécution total en millisecondes | +| `étapes` | `list[StepStatus]` | Détail par étape | +| `résultat` | `Any` | Valeur de retour finale (si terminé) | +| `erreur` | `str` | Message d'erreur (si échoué) | + +**Champs StepStatus**: + +| Champ | Type | Description | +|-------|------|-------------| +| `step_name` | `str` | Nom de la tâche | +| `statut` | `str` | `EN_ATTENTE`, `EN_COURSE`, `TERMINÉ`, `ÉCHOUÉ` | +| `démarré_à` | `datetime` | Heure de début d'étape | +| `terminé_à` | `datetime` | Heure de fin d'étape | +| `durée_ms` | `float` | Temps d'exécution de l'étape | +| `résultat` | `Any` | Valeur de retour de l'étape | +| `erreur` | `str` | Message d'erreur si échec | + +--- + +## 4. Interrogation des Statuts + +### 4.1. Obtenir le Statut d'un Pipeline + +```python +status = await tracking.get_status(pipeline_id) + +if status.status == "COMPLETED": + print(f"Pipeline completed in {status.duration_ms}ms") + print(f"Result: {status.result}") +elif status.status == "FAILED": + print(f"Failed: {status.error}") +``` + +### 4.2. Lister Tous les Pipelines + +```python +all_statuses = await tracking.list_pipelines() +for status in all_statuses: + print(f"{status.pipeline_id}: {status.status}") +``` + +### 4.3. Filtrer par Statut + +```python +running = await tracking.list_pipelines(filter_status="RUNNING") +failed = await tracking.list_pipelines(filter_status="FAILED") +completed = await tracking.list_pipelines(filter_status="COMPLETED") +``` + +### 4.4. Obtenir l'Historique + +```python +# Get last 10 pipelines +history = await tracking.get_history(limit=10) + +# Filter by date range +from datetime import datetime, timedelta +week_ago = datetime.now() - timedelta(days=7) +recent = await tracking.get_history(since=week_ago) +``` + +### 4.5. Supprimer les Anciens Enregistrements + +```python +# Delete records older than 30 days +deleted = await tracking.cleanup_old(days=30) +print(f"Deleted {deleted} old pipeline records") + +# Delete specific pipeline +await tracking.delete_pipeline(pipeline_id) +``` + +--- + +## 5. Backends de Stockage + +### 5.1. InMemoryPipelineStorage + +```python +from taskiq_flow.tracking import InMemoryPipelineStorage + +storage = InMemoryPipelineStorage() +tracking = PipelineTrackingManager().with_storage(storage) + +# Data lives only in Python process +# Lost on restart +# Suitable for: development, testing, one-shot scripts +``` + +**Avantages**: +- Zéro configuration +- Rapide (pas d'I/O réseau) +- Simple + +**Inconvénients**: +- Non partageable entre workers +- Perdu au redémarrage +- Taille d'historique limitée + +### 5.2. RedisPipelineStorage + +```python +from taskiq_flow.tracking import RedisPipelineStorage +import redis.asyncio as redis + +client_redis = redis.Redis(host="localhost", port=6379, decode_responses=True) +stockage = RedisPipelineStorage(client_redis) +tracking = PipelineTrackingManager().with_storage(stockage) +``` + +**Configuration**: + +```python +# Avec préfixe de clé et TTL personnalisés +stockage = RedisPipelineStorage( + client_redis, + key_prefix="taskiq_flow:suivi:", + ttl_secondes=604800 # rétention 7 jours +) +``` + +**Avantages**: +- Partagé entre multiples workers +- Persiste au redémarrage +- Évolutif +- Peut être en cluster pour haute disponibilité + +**Inconvénients**: +- Requiert un serveur Redis +- Latence réseau +- Gestion TTL nécessaire (éviter croissance illimitée) + +### 5.3. Stockage Personnalisé + +Implémenter le protocole `TrackingStorage`: + +```python +from taskiq_flow.tracking.storage import TrackingStorage +from taskiq_flow.tracking.models import PipelineStatus + +class PostgresStorage(TrackingStorage): + async def save_status(self, status: PipelineStatus): + # Insert/update in PostgreSQL + pass + + async def get_status(self, pipeline_id: str) -> PipelineStatus | None: + # Fetch from DB + pass + + async def list_pipelines(self, filter_status: str | None = None): + # Query with optional filter + pass + + async def delete_pipeline(self, pipeline_id: str): + # Remove record + pass + +tracking = PipelineTrackingManager().with_storage(PostgresStorage()) +``` + +--- + +## 6. Suivi en Temps Réel avec WebSocket + +Pour des mises à jour de tableau de bord en direct, combiner `PipelineTrackingManager` avec `HookManager`: + +```python +from taskiq_flow.hooks import HookManager, TrackingEventBroadcaster + +hook_manager = HookManager() +broadcaster = TrackingEventBroadcaster(tracking, hook_manager) +tracking.add_listener(broadcaster.on_status_update) + +pipeline = Pipeline(broker).with_hooks(hook_manager).with_tracking(tracking) +``` + +Les événements de pipeline sont maintenant diffusés via WebSocket en temps réel. + +Voir [Guide WebSocket]({{ '/fr/guides/websocket/' | relative_url }}) pour la configuration complète。 + +--- + +## 7. Collecte de Métriques + +Collecter des statistiques de performance au fil du temps: + +```python +# Collect statistics +stats = await tracking.get_metrics(days=7) + +print(f"Total executions: {stats.total_pipelines}") +print(f"Success rate: {stats.success_rate:.1%}") +print(f"Avg duration: {stats.avg_duration_ms:.0f}ms") +print(f"Failure reasons: {stats.failure_reasons}") +``` + +**Métriques courantes**: + +- Débit (pipelines/minute) +- Ratio succès/échec +- Durée moyenne des étapes +- Étapes les plus longues +- Heures de pointe + +Intégrer avec des systèmes de monitoring (Prometheus, Grafana): + +```python +from prometheus_client import Counter, Histogram + +PIPELINES_TOTAL = Counter('pipelines_total', 'Total pipelines', ['status']) +PIPELINE_DURATION = Histogram('pipeline_duration_seconds', 'Pipeline execution duration') + +class PrometheusExporter: + async def on_pipeline_complete(self, status: PipelineStatus): + PIPELINES_TOTAL.labels(status=status.status).inc() + PIPELINE_DURATION.observe(status.duration_ms / 1000) +``` + +--- + +## 8. Écouteurs d'Événements + +Attacher des callbacks aux événements de suivi: + +```python +class MyListener: + async def on_pipeline_start(self, pipeline_id: str): + print(f"Pipeline {pipeline_id} started") + send_slack_notification(f"Pipeline {pipeline_id} started") + + async def on_step_complete(self, pipeline_id: str, step_name: str, result: Any): + log_step_metric(step_name, result) + + async def on_pipeline_complete(self, pipeline_id: str, status: PipelineStatus): + if status.status == "FAILED": + alert_failure(pipeline_id) + +listener = MyListener() +tracking.add_listener(listener) +``` + +**Méthodes d'écouteur** (toutes optionnelles): + +- `on_pipeline_start(pipeline_id: str)` +- `on_step_start(pipeline_id: str, step_name: str)` +- `on_step_complete(pipeline_id: str, step_name: str, résultat: Any)` +- `on_pipeline_complete(pipeline_id: str, statut: PipelineStatus)` +- `on_pipeline_error(pipeline_id: str, erreur: str)` + +--- + +## 9. Visualisation des Données de Suivi + +### 9.1. Sortie Console + +```python +status = await tracking.get_status(pipeline_id) +print(f"\n{'='*60}") +print(f"Pipeline: {status.pipeline_id}") +print(f"Status: {status.status}") +print(f"Duration: {status.duration_ms:.0f}ms") +print(f"Steps:") +for step in status.steps: + bar = "█" * int(step.duration_ms / 10) + print(f" {step.step_name:<30} {bar} {step.duration_ms:.0f}ms") +``` + +### 9.2. JSON Export + +```python +import json +status_dict = status.model_dump(mode="json", exclude={"result"}) # exclude large results +print(json.dumps(status_dict, indent=2, default=str)) +``` + +### 9.3. Intégration avec Tableaux de Bord + +Utiliser les endpoints API REST (voir [Guide API]({{ '/fr/guides/api/' | relative_url }})) pour construire des tableaux de bord personnalisés: + +```javascript +// Frontend fetch +fetch('/api/pipelines/{pipeline_id}/status') + .then(res => res.json()) + .then(statut => { + // Rendre graphique temporel des durées d'étapes + // Afficher badges succès/échec + }); +``` + +--- + +## 10. Meilleures Pratiques de Production + +### 10.1. Utiliser Redis en Production + +Toujours utiliser `RedisPipelineStorage` en production: + +```python +# config.py +URL_REDIS = os.getenv("URL_REDIS", "redis://localhost:6379") + +# app.py +from redis.asyncio import Redis +client_redis = Redis.from_url(REDIS_URL) +tracking = PipelineTrackingManager().with_storage( + RedisPipelineStorage(client_redis, ttl_seconds=2592000) # 30 days +) +``` + +### 10.2. Configurer des Politiques de Rétention + +```python +# Periodic cleanup job (daily) +async def cleanup_old_tracking(): + deleted = await tracking.cleanup_old(days=7) + print(f"Cleaned up {deleted} old pipeline records") + +# Use APScheduler to run daily +from taskiq_flow import PipelineScheduler +scheduler = PipelineScheduler(broker) +scheduler.schedule_at(cleanup_old_tracking, run_at="0 3 * * *") # 3am daily +``` + +### 10.3. Monitor Tracking Health + +```python +# Health check for monitoring systems +async def health_check(): + try: + test_pipeline = Pipeline(broker).with_tracking(tracking) + await test_pipeline.kiq("health_check") + return {"status": "healthy"} + except Exception as e: + return {"status": "unhealthy", "error": str(e)} +``` + +### 10.4. Limiter la Taille de l'Historique + +```python +# Keep only last N pipelines per pipeline_id pattern +import fnmatch + +patterns = ["batch_job_*", "etl_*"] +for pattern in patterns: + old = await tracking.list_pipelines() + matches = [p for p in old if fnmatch.fnmatch(p.pipeline_id, pattern)] + if len(matches) > 100: + for old_pipeline in matches[-100:]: + await tracking.delete_pipeline(old_pipeline.pipeline_id) +``` + +--- + +## 11. Dépannage + +### Erreur "Aucun stockage configuré" + +**Symptôme** : `RuntimeError: No tracking storage configured` + +**Solution** : Add storage before using tracking: + +```python +tracking = PipelineTrackingManager().with_auto_storage(broker) +# or +tracking = PipelineTrackingManager().with_storage(InMemoryPipelineStorage()) +``` + +### Missing Tracking Data + +**Symptom**: `get_status()` returns `None` even though pipeline ran + +**Causes & fixes**: + +1. **Tracking not attached**: + ```python + pipeline = Pipeline(broker).with_tracking(tracking) # Must call with_tracking() + ``` + +2. **Different brokers** — Ensure same `broker` instance between task and pipeline. + +3. **Storage lifetime** — In-memory storage lost on restart; switch to Redis. + +4. **Pipeline ID mismatch** — Confirm `pipeline.pipeline_id` matches the query. + +### Dégradation des Performance avec Redis + +**Symptôme** : Le suivi ralentit l'exécution du pipeline + +**Correctifs**: +- Utiliser le pooling de connexions Redis +- Mettre à jour les statuts en batch (regrouper plusieurs étapes) +- Écritures batch asynchrones (comportement par défaut) +- Augmenter `maxmemory` Redis et utiliser politique d'éviction appropriée + +--- + +## 12. Résumé + +| Fonctionnalité | Mémoire | Redis | +|----------------|---------|-------| +| **Multi-processus** | Non | Oui | +| **Persistant** | Non | Oui | +| **État partagé** | Non | Oui | +| **Vitesse** | Plus rapide | Rapide (réseau) | +| **Configuration requise** | Aucune | Serveur Redis | + +**Recette basique**: +```python +suivi = PipelineTrackingManager().with_auto_storage(broker) +pipeline = Pipeline(broker).with_tracking(suivi) +``` + +**Recette production**: +```python +suivi = PipelineTrackingManager().with_storage( + RedisPipelineStorage(client_redis, ttl_secondes=604800) +) +pipeline = Pipeline(broker).with_tracking(suivi) +``` + +--- + +## Prochaines Étapes + +- **[Streaming WebSocket]({{ '/fr/guides/websocket/' | relative_url }})** — Livraison d'événements en direct pour tableaux de bord +- **[Guide Dataflow]({{ '/fr/guides/dataflow/' | relative_url }})** — Pipeline DAG complet avec parallélisme automatique +- **[Planification]({{ '/fr/guides/scheduling/' | relative_url }})** — Exécution périodique automatique de pipelines +- **[Performance]({{ '/fr/guides/performance/' | relative_url }})** — Optimiser la surcharge de suivi + +--- + +*Tout suivre. Visualiser avec [WebSocket]({{ '/fr/guides/websocket/' | relative_url }}).* diff --git a/docs/_fr/guides/websocket.md b/docs/_fr/guides/websocket.md index 1cd1a39..8571e49 100644 --- a/docs/_fr/guides/websocket.md +++ b/docs/_fr/guides/websocket.md @@ -550,7 +550,7 @@ WantedBy=multi-user.target ### 9.3. Monitoring -Utilisez l'endpoint `/health` intégré de l'API FastAPI (voir [Guide API]({{ '/fr/guides/api/' | relative_url })). +Utilisez l'endpoint `/health` intégré de l'API FastAPI (voir [Guide API]({{ '/fr/guides/api/' | relative_url }})). ### 9.4. Scalabilité diff --git a/docs/_fr/index.md b/docs/_fr/index.md index 6a0e949..5959815 100644 --- a/docs/_fr/index.md +++ b/docs/_fr/index.md @@ -1,27 +1,27 @@ ---- -title: Documentation Française -nav_order: 5 -permalink: /fr/ ---- -# Taskiq-Flow Documentation (Français) - -Bienvenue dans la documentation française de **Taskiq-Flow**. - -## Commencer - -- **[Guide de Démarrage Rapide]({{ '/fr/quickstart/' | relative_url }})** — Prenez en main en 5 minutes -- **[Guides Utilisateur]({{ '/fr/guides/' | relative_url }})** — Guides détaillés sur les pipelines, tâches, exécution, suivi, WebSocket, planification, retry, performance et API REST -- **[Référence API]({{ '/fr/api/' | relative_url }})** — Documentation complète des modules et classes -- **[Exemples]({{ '/fr/examples/' | relative_url }})** — Scripts d'exemple fonctionnels pour toutes les fonctionnalités - -## Liens Rapides - -- [README du Projet](https://github.com/dorel14/taskiq-flow/blob/main/README.fr.md) — Vue d'ensemble et philosophie du projet -- [Package PyPI](https://pypi.org/project/taskiq-flow/) — Installez avec `pip install taskiq-flow` -- [Dépôt GitHub](https://github.com/dorel14/taskiq-flow) — Code source et suivi d'issues -- [Guide de Contribution](https://github.com/dorel14/taskiq-flow/blob/main/CONTRIBUTING.md) — Comment contribuer -- [Licence](https://github.com/dorel14/taskiq-flow/blob/main/LICENSE) — Licence MIT - ---- - -*Maintenu par l'équipe SoniqueBay* +--- +title: Documentation Française +nav_order: 5 +permalink: /fr/ +--- +# Taskiq-Flow Documentation (Français) + +Bienvenue dans la documentation française de **Taskiq-Flow**. + +## Commencer + +- **[Guide de Démarrage Rapide]({{ '/fr/quickstart/' | relative_url }})** — Prenez en main en 5 minutes +- **[Guides Utilisateur]({{ '/fr/guides/' | relative_url }})** — Guides détaillés sur les pipelines, tâches, exécution, suivi, WebSocket, planification, retry, performance et API REST +- **[Référence API]({{ '/fr/api/' | relative_url }})** — Documentation complète des modules et classes +- **[Exemples]({{ '/fr/examples/' | relative_url }})** — Scripts d'exemple fonctionnels pour toutes les fonctionnalités + +## Liens Rapides + +- [README du Projet](https://github.com/dorel14/taskiq-flow/blob/main/README.fr.md) — Vue d'ensemble et philosophie du projet +- [Package PyPI](https://pypi.org/project/taskiq-flow/) — Installez avec `pip install taskiq-flow` +- [Dépôt GitHub](https://github.com/dorel14/taskiq-flow) — Code source et suivi d'issues +- [Guide de Contribution](https://github.com/dorel14/taskiq-flow/blob/main/CONTRIBUTING.md) — Comment contribuer +- [Licence](https://github.com/dorel14/taskiq-flow/blob/main/LICENSE) — Licence MIT + +--- + +*Maintenu par l'équipe SoniqueBay* diff --git a/docs/_fr/quickstart.md b/docs/_fr/quickstart.md index 5ba4a6e..732ee86 100644 --- a/docs/_fr/quickstart.md +++ b/docs/_fr/quickstart.md @@ -1,383 +1,383 @@ ---- -title: Guide de Démarrage Rapide -nav_order: 10 -color_scheme: dark ---- -# Guide de Démarrage Rapide - -**Se familiariser avec Taskiq-Flow en 5 minutes** - -> **Version** : {VERSION} | **Prérequis** : Python 3.9+, bases d'asyncio - ---- - -## Aperçu - -Ce guide vous aidera à créer vos premiers pipelines avec Taskiq-Flow. À la fin, vous comprendrez : - -- Comment configurer un broker et ajouter le PipelineMiddleware -- Définir des tâches avec `@broker.task` -- Construire des pipelines séquentiels avec `.call_next()`, `.map()`, `.filter()` -- Exécuter des pipelines et récupérer les résultats -- Les bases des pipelines dataflow avec `@pipeline_task` - ---- - -## Prérequis - -```bash -pip install taskiq taskiq-flow -``` - -Pour ce guide, nous utilisons le broker en mémoire qui ne nécessite aucun service externe. - ---- - -## 1. Pipeline Séquentiel Basique - -### 1.1. Configuration - -Créez un fichier Python `quickstart_basic.py` : - -```python -import asyncio -from taskiq import InMemoryBroker -from taskiq_flow import Pipeline, PipelineMiddleware - -# Initialiser le broker et ajouter le middleware requis -broker = InMemoryBroker() -broker.add_middlewares(PipelineMiddleware()) -``` - -### 1.2. Définir les Tâches - -Toutes les fonctions dans un pipeline doivent être des tâches taskiq (décorées avec `@broker.task`) : - -```python -@broker.task -def add_one(value: int) -> int: - """Ajouter 1 à la valeur d'entrée.""" - return value + 1 - -@broker.task -def repeat(value: int, times: int) -> list[int]: - """ Répéter une valeur plusieurs fois.""" - return [value] * times - -@broker.task -def is_positive(value: int) -> bool: - """Vérifier si la valeur est positive ou nulle.""" - return value >= 0 -``` - -### 1.3. Construire et Exécuter le Pipeline - -```python -async def main(): - # Construire le pipeline en enchaînant les opérations - pipeline = ( - Pipeline(broker) - .call_next(add_one) # Étape 1: 1 → 2 - .call_next(repeat, times=4) # Étape 2: 2 → [2, 2, 2, 2] - .map(add_one) # Étape 3: appliquer à chaque élément → [3, 3, 3, 3] - .filter(is_positive) # Étape 4: garder les éléments où le résultat est True - ) - - # Lancer le pipeline avec une entrée initiale - task = await pipeline.kiq(1) - - # Attendre la fin et récupérer le résultat - result = await task.wait_result() - print("Résultat :", result.return_value) # Sortie: [3, 3, 3, 3] - -asyncio.run(main()) -``` - -**Sortie attendue** : -``` -Résultat : [3, 3, 3, 3] -``` - -### 1.4. Comment Ça Marche - -| Étape | Opération | Entrée | Sortie | -|-------|-----------|--------|--------| -| 1 | `.call_next(add_one)` | `1` | `2` | -| 2 | `.call_next(repeat, times=4)` | `2` | `[2, 2, 2, 2]` | -| 3 | `.map(add_one)` | `[2, 2, 2, 2]` | `[3, 3, 3, 3]` (parallèle) | -| 4 | `.filter(is_positive)` | `[3, 3, 3, 3]` | `[3, 3, 3, 3]` (inchangé) | - -**Points clés** : - -- Le `PipelineMiddleware` gère le routage des tâches ; il **doit** être ajouté au broker. -- Chaque étape reçoit la sortie de l'étape précédente comme entrée. -- `.map()` et `.filter()` opèrent sur des résultats itérables et exécutent les éléments en parallèle. -- `pipeline.kiq(entrée_initiale)` démarre le pipeline et renvoie un objet `Task`. -- `task.wait_result()` bloque jusqu'à la fin du pipeline. - ---- - -## 2. Pipeline Dataflow (DAG Automatique) - -Pour des workflows plus complexes, utilisez `DataflowPipeline` qui construit automatiquement un graphe de dépendances. - -### 2.1. Définir des Tâches avec `@pipeline_task` - -Marquez les sorties de tâche avec le décorateur `@pipeline_task` : - -```python -from taskiq_flow import DataflowPipeline, pipeline_task - -@broker.task -@pipeline_task(output="features") -def extract_audio(track_paths: list[str]) -> dict: - """Extraire les caractéristiques audio des pistes.""" - print(f"Extraction des caractéristiques de {len(track_paths)} pistes...") - return {"duration": 180.0, "tempo": 120.0, "energy": 0.8} - -@broker.task -@pipeline_task(output="tags") -def generate_tags(features: dict) -> list[str]: - """Générer des tags basés sur les caractéristiques audio.""" - print(f"Génération de tags depuis les caractéristiques : {features}") - return ["electronic", "dance", "upbeat"] - -@broker.task -@pipeline_task(output="embedding") -def compute_embedding(features: dict) -> list[float]: - """Calculer l'incorporation vectorielle depuis les caractéristiques.""" - print(f"Calcul de l'incorporation depuis {features}") - return [0.1, 0.2, 0.3, 0.4, 0.5] -``` - -**Fonctionnement de la résolution de dépendances** : -- `extract_audio` déclare `output="features"` -- `generate_tags` a le paramètre `features: dict` → dépend automatiquement de `extract_audio` -- `compute_embedding` dépend aussi de `extract_audio` (même paramètre `features`) -- Taskiq-Flow construit un DAG et exécute les tâches indépendantes en parallèle - -### 2.2. Construire et Exécuter - -```python -async def main(): - # Construire automatiquement le DAG depuis la liste de tâches - pipeline = DataflowPipeline.from_tasks( - broker, - [extract_audio, generate_tags, compute_embedding] - ) - - # Optionnel: visualiser le DAG - pipeline.print_dag() - - # Exécuter avec les données d'entrée (seulement les entrées externes nécessaires) - results = await pipeline.kiq_dataflow(track_paths=["chanson1.mp3", "chanson2.mp3"]) - print("Résultats :", results) - # Sortie: { - # "features": {"duration": 180.0, ...}, - # "tags": ["electronic", "dance", "upbeat"], - # "embedding": [0.1, 0.2, 0.3, 0.4, 0.5] - # } - -asyncio.run(main()) -``` - -**Exemple de sortie DAG** (affiché dans la console): -``` -Ordre d'Exécution DAG: - Niveau 0 (parallèle): extract_audio - Niveau 1 (parallèle): generate_tags, compute_embedding - Sorties finales: features, tags, embedding -``` - -### 2.3. Visualiser le Pipeline - -```python -# DAG ASCII dans la console -pipeline.print_dag() - -# Représentation JSON pour interfaces web -viz_json = pipeline.visualize() -print(viz_json) - -# Format DOT pour Graphviz -dot = pipeline.visualize_dot() -with open("pipeline.dot", "w") as f: - f.write(dot) -# Rendre: dot -Tpng pipeline.dot -o pipeline.png -``` - ---- - -## 3. Motifs Courants - -### 3.1. Motif Map-Reduce - -Traiter des éléments en parallèle, puis agréger : - -```python -from taskiq_flow import MapReduce - -# Phase Map: traiter chaque piste indépendamment -mapped = await MapReduce.map( - broker, - process_track, # fonction de tâche - track_list, # itérable d'éléments - output="processed", # nom de la sortie intermédiaire - max_parallel=10 # limiter la concurrence -) - -# Phase Reduce: agréger tous les résultats -reduced = await MapReduce.reduce( - broker, - aggregate_results, # fonction d'agrégation - mapped, # objet MapReduceResult - input_name="processed", # consommer la sortie mappée - output="final_stats" -) - -print("Final :", reduced.return_value) -``` - -Voir `examples/dataflow_audio_pipeline.py` pour un pipeline audio complet. - -### 3.2. Exécution Parallèle Groupée - -Exécuter plusieurs tâches indépendantes simultanément : - -```python -pipeline = Pipeline(broker) - -pipeline.group( - [task_a, task_b, task_c], - param_names=["input_a", "input_b", "input_c"] -) -# Retourne : [resultat_a, resultat_b, resultat_c] -``` - -### 3.3. Pipeline avec Suivi - -Surveiller le statut du pipeline en temps réel : - -```python -from taskiq_flow import PipelineTrackingManager - -tracking = PipelineTrackingManager().with_auto_storage(broker) -pipeline = Pipeline(broker).with_tracking(tracking) - -task = await pipeline.kiq(données) - -# Vérifier le statut ultérieurement -statut = await tracking.get_status(pipeline.pipeline_id) -print(f"Statut : {statut.status}, Étapes complétées : {len(statut.steps)}") -``` - ---- - -## 4. Exécuter les Exemples - -Le répertoire `examples/` contient des démonstrations complètes exécutables : - -```bash -# Pipeline séquentiel basique -python examples/quickstart.py - -# Suivi et monitoring -python examples/tracking_demo.py - -# Pipelines planifiés (cron) -python examples/scheduled_pipeline.py - -# DAG dataflow complet avec map-reduce -python examples/dataflow_audio_pipeline.py - -# Construction manuelle de DAG avec DataflowRegistry -python examples/registry_discovery_example.py - -# Streaming d'événements WebSocket -python examples/websocket_demo.py - -# API REST avec FastAPI -python examples/api_example.py -``` - ---- - -## 5. Prochaines Étapes - -Avec les bases acquises, explorez les guides approfondis : - -| Sujet | Guide | -|-------|-------| -| Pipelines séquentiels et dataflow | [Guide des Pipelines]({{ '/fr/guides/pipelines/' | relative_url }}) | -| **Approfondissement Dataflow** | **[Guide Dataflow]({{ '/fr/guides/dataflow/' | relative_url }})** | -| Définition des tâches et décorateurs | [Guide des Tâches]({{ '/fr/guides/tasks/' | relative_url }}) | -| Modes d'exécution et gestion d'erreurs | [Guide d'Exécution]({{ '/fr/guides/execution/' | relative_url }}) | -| Monitoring en temps réel | [Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }}) | -| Tableaux de bord en direct | [Guide WebSocket]({{ '/fr/guides/websocket/' | relative_url }}) | -| Planification cron | [Guide de Planification]({{ '/fr/guides/scheduling/' | relative_url }}) | -| Récupération d'erreurs | [Guide de Retry]({{ '/fr/guides/retry/' | relative_url }}) | -| Optimisation des performances | [Guide de Performance]({{ '/fr/guides/performance/' | relative_url }}) | -| Intégration API REST | [Guide API]({{ '/fr/guides/api/' | relative_url }}) | -| Référence API complète | [Référence API]({{ '/fr/api/' | relative_url }}) | - ---- - -## Dépannage - -### Erreur "PipelineMiddleware non trouvé" - -**Symptôme** : Les tâches échouent avec des erreurs de middleware. - -**Solution** : Assurez-vous que `PipelineMiddleware()` est ajouté au broker avant de créer des pipelines : - -```python -broker.add_middlewares(PipelineMiddleware()) # Obligatoire -``` - -### Erreur "Task not found" ou "Result is None" - -**Symptôme** : `wait_result()` retourne `None`. - -**Cause** : InMemoryBroker fonctionne uniquement dans le même processus. Pour des setups multi-Workers distribués, utilisez Redis ou un broker persistant. - -**Solution** : Passez à `RedisStreamBroker` avec un backend de résultats partagé : - -```python -from taskiq_flow.broker import RedisStreamBroker -broker = RedisStreamBroker(redis_url="redis://localhost:6379") -``` - -### Connexion WebSocket Refusée - -**Symptôme** : Le client ne peut pas se connecter au serveur WebSocket. - -**Solution** : Assurez-vous que l'application FastAPI est en cours d'exécution et que la route WebSocket est montée : - -```python -from fastapi import FastAPI, WebSocket -from taskiq_flow.integration.websocket.fastapi_ws import fastapi_websocket_endpoint - -app = FastAPI() - -@app.websocket("/ws/{pipeline_id}") -async def ws_endpoint(websocket: WebSocket, pipeline_id: str): - await fastapi_websocket_endpoint(websocket, pipeline_id) - -# Lancer avec : uvicorn app:app --host 0.0.0.0 --port 8000 -``` - -Puis se connecter avec `ws://localhost:8000/ws/{pipeline_id}`. - -> **Prérequis** : Installer l'extra `[brokers]` : `pip install "taskiq-flow[brokers]"` pour les setups avec Redis. - ---- - -## Lectures Complémentaires - -- **[Référence API Complète]({{ '/fr/api/' | relative_url }})** — Documentation complète des classes et méthodes -- **[Galerie d'Exemples]({{ '/fr/examples/' | relative_url }})** — Explications détaillées de chaque script d'exemple -- **[README du Projet](https://github.com/dorel14/taskiq-flow/blob/main/README.fr.md)** — Vue d'ensemble, installation et philosophie - ---- - -*Prêt à approfondir ? Continuez avec le [Guide des Pipelines]({{ '/fr/guides/pipelines/' | relative_url }}).* +--- +title: Guide de Démarrage Rapide +nav_order: 10 +color_scheme: dark +--- +# Guide de Démarrage Rapide + +**Se familiariser avec Taskiq-Flow en 5 minutes** + +> **Version** : {VERSION} | **Prérequis** : Python 3.9+, bases d'asyncio + +--- + +## Aperçu + +Ce guide vous aidera à créer vos premiers pipelines avec Taskiq-Flow. À la fin, vous comprendrez : + +- Comment configurer un broker et ajouter le PipelineMiddleware +- Définir des tâches avec `@broker.task` +- Construire des pipelines séquentiels avec `.call_next()`, `.map()`, `.filter()` +- Exécuter des pipelines et récupérer les résultats +- Les bases des pipelines dataflow avec `@pipeline_task` + +--- + +## Prérequis + +```bash +pip install taskiq taskiq-flow +``` + +Pour ce guide, nous utilisons le broker en mémoire qui ne nécessite aucun service externe. + +--- + +## 1. Pipeline Séquentiel Basique + +### 1.1. Configuration + +Créez un fichier Python `quickstart_basic.py` : + +```python +import asyncio +from taskiq import InMemoryBroker +from taskiq_flow import Pipeline, PipelineMiddleware + +# Initialiser le broker et ajouter le middleware requis +broker = InMemoryBroker() +broker.add_middlewares(PipelineMiddleware()) +``` + +### 1.2. Définir les Tâches + +Toutes les fonctions dans un pipeline doivent être des tâches taskiq (décorées avec `@broker.task`) : + +```python +@broker.task +def add_one(value: int) -> int: + """Ajouter 1 à la valeur d'entrée.""" + return value + 1 + +@broker.task +def repeat(value: int, times: int) -> list[int]: + """ Répéter une valeur plusieurs fois.""" + return [value] * times + +@broker.task +def is_positive(value: int) -> bool: + """Vérifier si la valeur est positive ou nulle.""" + return value >= 0 +``` + +### 1.3. Construire et Exécuter le Pipeline + +```python +async def main(): + # Construire le pipeline en enchaînant les opérations + pipeline = ( + Pipeline(broker) + .call_next(add_one) # Étape 1: 1 → 2 + .call_next(repeat, times=4) # Étape 2: 2 → [2, 2, 2, 2] + .map(add_one) # Étape 3: appliquer à chaque élément → [3, 3, 3, 3] + .filter(is_positive) # Étape 4: garder les éléments où le résultat est True + ) + + # Lancer le pipeline avec une entrée initiale + task = await pipeline.kiq(1) + + # Attendre la fin et récupérer le résultat + result = await task.wait_result() + print("Résultat :", result.return_value) # Sortie: [3, 3, 3, 3] + +asyncio.run(main()) +``` + +**Sortie attendue** : +``` +Résultat : [3, 3, 3, 3] +``` + +### 1.4. Comment Ça Marche + +| Étape | Opération | Entrée | Sortie | +|-------|-----------|--------|--------| +| 1 | `.call_next(add_one)` | `1` | `2` | +| 2 | `.call_next(repeat, times=4)` | `2` | `[2, 2, 2, 2]` | +| 3 | `.map(add_one)` | `[2, 2, 2, 2]` | `[3, 3, 3, 3]` (parallèle) | +| 4 | `.filter(is_positive)` | `[3, 3, 3, 3]` | `[3, 3, 3, 3]` (inchangé) | + +**Points clés** : + +- Le `PipelineMiddleware` gère le routage des tâches ; il **doit** être ajouté au broker. +- Chaque étape reçoit la sortie de l'étape précédente comme entrée. +- `.map()` et `.filter()` opèrent sur des résultats itérables et exécutent les éléments en parallèle. +- `pipeline.kiq(entrée_initiale)` démarre le pipeline et renvoie un objet `Task`. +- `task.wait_result()` bloque jusqu'à la fin du pipeline. + +--- + +## 2. Pipeline Dataflow (DAG Automatique) + +Pour des workflows plus complexes, utilisez `DataflowPipeline` qui construit automatiquement un graphe de dépendances. + +### 2.1. Définir des Tâches avec `@pipeline_task` + +Marquez les sorties de tâche avec le décorateur `@pipeline_task` : + +```python +from taskiq_flow import DataflowPipeline, pipeline_task + +@broker.task +@pipeline_task(output="features") +def extract_audio(track_paths: list[str]) -> dict: + """Extraire les caractéristiques audio des pistes.""" + print(f"Extraction des caractéristiques de {len(track_paths)} pistes...") + return {"duration": 180.0, "tempo": 120.0, "energy": 0.8} + +@broker.task +@pipeline_task(output="tags") +def generate_tags(features: dict) -> list[str]: + """Générer des tags basés sur les caractéristiques audio.""" + print(f"Génération de tags depuis les caractéristiques : {features}") + return ["electronic", "dance", "upbeat"] + +@broker.task +@pipeline_task(output="embedding") +def compute_embedding(features: dict) -> list[float]: + """Calculer l'incorporation vectorielle depuis les caractéristiques.""" + print(f"Calcul de l'incorporation depuis {features}") + return [0.1, 0.2, 0.3, 0.4, 0.5] +``` + +**Fonctionnement de la résolution de dépendances** : +- `extract_audio` déclare `output="features"` +- `generate_tags` a le paramètre `features: dict` → dépend automatiquement de `extract_audio` +- `compute_embedding` dépend aussi de `extract_audio` (même paramètre `features`) +- Taskiq-Flow construit un DAG et exécute les tâches indépendantes en parallèle + +### 2.2. Construire et Exécuter + +```python +async def main(): + # Construire automatiquement le DAG depuis la liste de tâches + pipeline = DataflowPipeline.from_tasks( + broker, + [extract_audio, generate_tags, compute_embedding] + ) + + # Optionnel: visualiser le DAG + pipeline.print_dag() + + # Exécuter avec les données d'entrée (seulement les entrées externes nécessaires) + results = await pipeline.kiq_dataflow(track_paths=["chanson1.mp3", "chanson2.mp3"]) + print("Résultats :", results) + # Sortie: { + # "features": {"duration": 180.0, ...}, + # "tags": ["electronic", "dance", "upbeat"], + # "embedding": [0.1, 0.2, 0.3, 0.4, 0.5] + # } + +asyncio.run(main()) +``` + +**Exemple de sortie DAG** (affiché dans la console): +``` +Ordre d'Exécution DAG: + Niveau 0 (parallèle): extract_audio + Niveau 1 (parallèle): generate_tags, compute_embedding + Sorties finales: features, tags, embedding +``` + +### 2.3. Visualiser le Pipeline + +```python +# DAG ASCII dans la console +pipeline.print_dag() + +# Représentation JSON pour interfaces web +viz_json = pipeline.visualize() +print(viz_json) + +# Format DOT pour Graphviz +dot = pipeline.visualize_dot() +with open("pipeline.dot", "w") as f: + f.write(dot) +# Rendre: dot -Tpng pipeline.dot -o pipeline.png +``` + +--- + +## 3. Motifs Courants + +### 3.1. Motif Map-Reduce + +Traiter des éléments en parallèle, puis agréger : + +```python +from taskiq_flow import MapReduce + +# Phase Map: traiter chaque piste indépendamment +mapped = await MapReduce.map( + broker, + process_track, # fonction de tâche + track_list, # itérable d'éléments + output="processed", # nom de la sortie intermédiaire + max_parallel=10 # limiter la concurrence +) + +# Phase Reduce: agréger tous les résultats +reduced = await MapReduce.reduce( + broker, + aggregate_results, # fonction d'agrégation + mapped, # objet MapReduceResult + input_name="processed", # consommer la sortie mappée + output="final_stats" +) + +print("Final :", reduced.return_value) +``` + +Voir `examples/dataflow_audio_pipeline.py` pour un pipeline audio complet. + +### 3.2. Exécution Parallèle Groupée + +Exécuter plusieurs tâches indépendantes simultanément : + +```python +pipeline = Pipeline(broker) + +pipeline.group( + [task_a, task_b, task_c], + param_names=["input_a", "input_b", "input_c"] +) +# Retourne : [resultat_a, resultat_b, resultat_c] +``` + +### 3.3. Pipeline avec Suivi + +Surveiller le statut du pipeline en temps réel : + +```python +from taskiq_flow import PipelineTrackingManager + +tracking = PipelineTrackingManager().with_auto_storage(broker) +pipeline = Pipeline(broker).with_tracking(tracking) + +task = await pipeline.kiq(données) + +# Vérifier le statut ultérieurement +statut = await tracking.get_status(pipeline.pipeline_id) +print(f"Statut : {statut.status}, Étapes complétées : {len(statut.steps)}") +``` + +--- + +## 4. Exécuter les Exemples + +Le répertoire `examples/` contient des démonstrations complètes exécutables : + +```bash +# Pipeline séquentiel basique +python examples/quickstart.py + +# Suivi et monitoring +python examples/tracking_demo.py + +# Pipelines planifiés (cron) +python examples/scheduled_pipeline.py + +# DAG dataflow complet avec map-reduce +python examples/dataflow_audio_pipeline.py + +# Construction manuelle de DAG avec DataflowRegistry +python examples/registry_discovery_example.py + +# Streaming d'événements WebSocket +python examples/websocket_demo.py + +# API REST avec FastAPI +python examples/api_example.py +``` + +--- + +## 5. Prochaines Étapes + +Avec les bases acquises, explorez les guides approfondis : + +| Sujet | Guide | +|-------|-------| +| Pipelines séquentiels et dataflow | [Guide des Pipelines]({{ '/fr/guides/pipelines/' | relative_url }}) | +| **Approfondissement Dataflow** | **[Guide Dataflow]({{ '/fr/guides/dataflow/' | relative_url }})** | +| Définition des tâches et décorateurs | [Guide des Tâches]({{ '/fr/guides/tasks/' | relative_url }}) | +| Modes d'exécution et gestion d'erreurs | [Guide d'Exécution]({{ '/fr/guides/execution/' | relative_url }}) | +| Monitoring en temps réel | [Guide de Suivi]({{ '/fr/guides/tracking/' | relative_url }}) | +| Tableaux de bord en direct | [Guide WebSocket]({{ '/fr/guides/websocket/' | relative_url }}) | +| Planification cron | [Guide de Planification]({{ '/fr/guides/scheduling/' | relative_url }}) | +| Récupération d'erreurs | [Guide de Retry]({{ '/fr/guides/retry/' | relative_url }}) | +| Optimisation des performances | [Guide de Performance]({{ '/fr/guides/performance/' | relative_url }}) | +| Intégration API REST | [Guide API]({{ '/fr/guides/api/' | relative_url }}) | +| Référence API complète | [Référence API]({{ '/fr/api/' | relative_url }}) | + +--- + +## Dépannage + +### Erreur "PipelineMiddleware non trouvé" + +**Symptôme** : Les tâches échouent avec des erreurs de middleware. + +**Solution** : Assurez-vous que `PipelineMiddleware()` est ajouté au broker avant de créer des pipelines : + +```python +broker.add_middlewares(PipelineMiddleware()) # Obligatoire +``` + +### Erreur "Task not found" ou "Result is None" + +**Symptôme** : `wait_result()` retourne `None`. + +**Cause** : InMemoryBroker fonctionne uniquement dans le même processus. Pour des setups multi-Workers distribués, utilisez Redis ou un broker persistant. + +**Solution** : Passez à `RedisStreamBroker` avec un backend de résultats partagé : + +```python +from taskiq_flow.broker import RedisStreamBroker +broker = RedisStreamBroker(redis_url="redis://localhost:6379") +``` + +### Connexion WebSocket Refusée + +**Symptôme** : Le client ne peut pas se connecter au serveur WebSocket. + +**Solution** : Assurez-vous que l'application FastAPI est en cours d'exécution et que la route WebSocket est montée : + +```python +from fastapi import FastAPI, WebSocket +from taskiq_flow.integration.websocket.fastapi_ws import fastapi_websocket_endpoint + +app = FastAPI() + +@app.websocket("/ws/{pipeline_id}") +async def ws_endpoint(websocket: WebSocket, pipeline_id: str): + await fastapi_websocket_endpoint(websocket, pipeline_id) + +# Lancer avec : uvicorn app:app --host 0.0.0.0 --port 8000 +``` + +Puis se connecter avec `ws://localhost:8000/ws/{pipeline_id}`. + +> **Prérequis** : Installer l'extra `[brokers]` : `pip install "taskiq-flow[brokers]"` pour les setups avec Redis. + +--- + +## Lectures Complémentaires + +- **[Référence API Complète]({{ '/fr/api/' | relative_url }})** — Documentation complète des classes et méthodes +- **[Galerie d'Exemples]({{ '/fr/examples/' | relative_url }})** — Explications détaillées de chaque script d'exemple +- **[README du Projet](https://github.com/dorel14/taskiq-flow/blob/main/README.fr.md)** — Vue d'ensemble, installation et philosophie + +--- + +*Prêt à approfondir ? Continuez avec le [Guide des Pipelines]({{ '/fr/guides/pipelines/' | relative_url }}).* From f6c48ef7d2b605efac5bd85a798910bfb8fb0d53 Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" Date: Sun, 24 May 2026 15:56:51 +0000 Subject: [PATCH 02/12] docs: auto-update llms context files --- llms-full.txt | 276 ++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 246 insertions(+), 30 deletions(-) diff --git a/llms-full.txt b/llms-full.txt index 826005c..9895af2 100644 --- a/llms-full.txt +++ b/llms-full.txt @@ -271,6 +271,7 @@ pipeline = Pipeline(broker) ``` **Constructor**: + ```python Pipeline( broker: BaseBroker, @@ -297,6 +298,7 @@ Pipeline( | `with_context` | `with_context(enable=True) -> Pipeline` | Enable passing PipelineContext to tasks | **Example**: + ```python pipeline = ( Pipeline(broker) @@ -325,6 +327,7 @@ pipeline = DataflowPipeline.from_tasks( ``` **Constructor**: + ```python DataflowPipeline( broker: BaseBroker, @@ -351,6 +354,7 @@ DataflowPipeline( | `kiq_dataflow(**kwargs)` | Execute pipeline with named inputs | **Example**: + ```python @broker.task @pipeline_task(output="features") @@ -439,6 +443,7 @@ from taskiq_flow import TaskiqFlowError | `TrackingError` | Tracking operation failed | Storage unavailable | **Example handling**: + ```python try: result = await pipeline.kiq(data) @@ -504,6 +509,7 @@ See scheduling guide. This documentation covers **Taskiq-Flow v0.3.0+**. API stability: + - `Pipeline` and `DataflowPipeline`: Stable (v0.3+) - `pipeline_task` decorator: Stable (v0.3+) - `PipelineMiddleware`: Stable (v0.3+) @@ -537,6 +543,7 @@ The `@pipeline_task` decorator annotates taskiq tasks with output declarations, Marks a task with what it produces for downstream consumers. + ```python from taskiq_flow import pipeline_task @@ -559,6 +566,7 @@ def extract(data: list[str]) -> dict: ### Single output (most common) + ```python @broker.task @pipeline_task(output="processed_data") @@ -568,6 +576,7 @@ def process(raw_data: str) -> dict: ### Multiple outputs + ```python @broker.task @pipeline_task(outputs=["features", "metadata"]) @@ -579,6 +588,7 @@ def split_output(audio: np.ndarray) -> tuple[dict, dict]: Downstream tasks can consume either output: + ```python @broker.task @pipeline_task(output="tags") @@ -595,6 +605,7 @@ def describe(metadata: dict): ... # consumes 'metadata' output Alias for `@pipeline_task(outputs=[...])`. Provides clarity for multi-output tasks: + ```python from taskiq_flow import pipeline_task_multi_output @@ -612,6 +623,7 @@ def split(value: int) -> tuple[int, int]: Get declared output keys for a task: + ```python from taskiq_flow import get_task_outputs @@ -623,6 +635,7 @@ print(outputs) # ['features'] Get declared input dependencies: + ```python from taskiq_flow import get_task_inputs @@ -634,6 +647,7 @@ print(inputs) # ['features'] Check if a function has been decorated with `@pipeline_task`: + ```python from taskiq_flow import is_pipeline_task @@ -645,6 +659,7 @@ if is_pipeline_task(my_func): Build a dependency map: + ```python from taskiq_flow import resolve_task_dependencies @@ -658,6 +673,7 @@ deps = resolve_task_dependencies([task_a, task_b, task_c]) The decorator order matters: `@broker.task` must be outermost (applied last), `@pipeline_task` inner (applied first): + ```python # CORRECT @broker.task @@ -678,6 +694,7 @@ Why: `@broker.task` wraps the function; `@pipeline_task` attaches metadata to th Type hints help IDEs and static checkers understand dataflow: + ```python from typing import TypedDict @@ -704,6 +721,7 @@ Using `TypedDict` or Pydantic models provides better IDE autocomplete and mypy c Attach version and other metadata: + ```python @broker.task( name="extract_features_v2", @@ -732,6 +750,7 @@ def extract(path: str) -> dict: ## Example: Complete Dataflow Pipeline + ```python from taskiq import InMemoryBroker from taskiq_flow import DataflowPipeline, pipeline_task @@ -3051,6 +3070,7 @@ This is the reference example for understanding dataflow architecture. ### Task Definitions + ```python from taskiq import InMemoryBroker from taskiq_flow import DataflowPipeline, pipeline_task @@ -3087,6 +3107,7 @@ async def create_embedding(mir_features: dict, tags: list[str]) -> list[float]: The pipeline automatically builds this DAG: + ```mermaid flowchart TD A[extract_audio_features] --> B[compute_mir_features] @@ -3101,6 +3122,7 @@ flowchart TD ## Example 1: Sequential Pipeline with Automatic Dependencies + ```python async def example_sequential_pipeline(): pipeline = DataflowPipeline.from_tasks( @@ -3131,6 +3153,7 @@ async def example_sequential_pipeline(): ``` **Dependency resolution**: + 1. `extract_audio_features` has no dependencies → runs first 2. `compute_mir_features` needs `audio_features` → runs after step 1 3. `generate_tags` needs `mir_features` → runs after step 2 @@ -3142,6 +3165,7 @@ async def example_sequential_pipeline(): With the addition of `extract_spectral_features` which also depends only on `audio_features`: + ```python @broker.task @pipeline_task(output="spectral_features") @@ -3171,6 +3195,7 @@ pipeline = DataflowPipeline.from_tasks( ``` **Execution levels**: + - Level 0: `extract_audio_features` - Level 1: `compute_mir_features`, `extract_spectral_features` (parallel) - Level 2: `generate_tags`, `combine_features` (parallel after their dependencies met) @@ -3181,6 +3206,7 @@ pipeline = DataflowPipeline.from_tasks( Process multiple tracks in parallel, then aggregate: + ```python # Map: process each track independently @broker.task @@ -3220,6 +3246,7 @@ results = await pipeline.kiq_map_reduce() The pipeline provides multiple visualization formats: + ```python # ASCII art (console) pipeline.print_dag() @@ -3245,11 +3272,13 @@ dot = pipeline.visualize_dot() ## Running the Example + ```bash python examples/dataflow_audio_pipeline.py ``` Expected output includes: + - DAG ASCII prints showing execution order - DAG DOT representation snippet - DAG JSON structure snippet @@ -4127,6 +4156,7 @@ This example demonstrates the `ResourceAwareExecutor` and `TaskResourceProfile` ### 1. Resource-Aware Executor Setup + ```python from taskiq_flow.optimization import ResourceAwareExecutor, TaskResourceProfile @@ -4157,6 +4187,7 @@ The executor queries current system load (via `psutil`) and computes how many ta ### 2. Annotating Tasks with Resource Profiles + ```python @broker.task @pipeline_task( @@ -4199,6 +4230,7 @@ async def heavy_task(item: int) -> dict: The `DataflowPipeline`'s `max_parallel` parameter acts as an upper bound. The `ResourceAwareExecutor` can be used to compute a dynamic `max_parallel` before launching: + ```python # Compute optimal parallelism for current system state current_parallel = executor.get_optimal_parallelism( @@ -4217,6 +4249,7 @@ For mixed workloads, sum resource usage across parallel tasks. ### 4. Manual Parallelism Tuning Guidelines + ```python import psutil @@ -4239,6 +4272,7 @@ Start conservative, benchmark, and adjust. ## Expected Output + ``` === Resource-Aware Parallelism Demo === @@ -4313,6 +4347,7 @@ Without resource awareness, setting `max_parallel` too high can: Combine with Prometheus metrics: + ```python from taskiq_flow.metrics import MetricsMiddleware broker.add_middlewares(MetricsMiddleware()) @@ -5562,6 +5597,7 @@ This guide covers: ## 1. Quick Setup + ```python from fastapi import FastAPI from taskiq import InMemoryBroker @@ -5598,12 +5634,14 @@ The visualization API provides these routes: ### 2.1. Health Check + ``` GET /health ``` Returns simple health status: + ```json { "status": "healthy", @@ -5613,12 +5651,14 @@ Returns simple health status: ### 2.2. List All Pipelines + ``` GET /pipelines ``` Lists all registered pipelines with metadata: + ```json [ { @@ -5632,12 +5672,14 @@ Lists all registered pipelines with metadata: ### 2.3. Register a New Pipeline + ``` POST /pipelines/{pipeline_id} ``` Request body: + ```json { "pipeline_type": "dataflow", @@ -5647,18 +5689,21 @@ Request body: Or use the Python API directly (recommended): + ```python viz_api.add_pipeline("new_pipeline", pipeline_object) ``` ### 2.4. Get Pipeline Status + ``` GET /pipelines/{pipeline_id}/status ``` Returns current execution status if a run is active: + ```json { "pipeline_id": "my_pipeline_123", @@ -5671,12 +5716,14 @@ Returns current execution status if a run is active: ### 2.5. Get DAG as JSON + ``` GET /pipelines/{pipeline_id}/dag ``` Returns the directed acyclic graph structure: + ```json { "nodes": [ @@ -5693,12 +5740,14 @@ Returns the directed acyclic graph structure: ### 2.6. Get DAG in DOT Format + ``` GET /pipelines/{pipeline_id}/dag/dot ``` Returns Graphviz-compatible DOT string for visualization: + ``` digraph "my_pipeline" { node [shape=box]; @@ -5709,12 +5758,14 @@ digraph "my_pipeline" { ### 2.7. Full Pipeline Visualization + ``` GET /pipelines/{pipeline_id}/visualize ``` Returns comprehensive pipeline metadata: + ```json { "pipeline_id": "my_pipeline", @@ -5746,6 +5797,7 @@ Returns comprehensive pipeline metadata: The core API focuses on management and visualization. To execute pipelines remotely, add a custom endpoint: + ```python from fastapi import FastAPI, HTTPException from taskiq_flow.api import PipelineVisualizationAPI @@ -5804,6 +5856,7 @@ async def get_result(task_id: str): ### 3.1. Execute Async (Fire-and-Forget) + ```bash curl -X POST "http://localhost:8000/pipelines/my_pipeline/execute" \ -H "Content-Type: application/json" \ @@ -5818,6 +5871,7 @@ curl -X POST "http://localhost:8000/pipelines/my_pipeline/execute" \ ### 3.2. Execute Synchronous (Wait for Result) + ```bash curl -X POST "http://localhost:8000/pipelines/my_pipeline/execute" \ -H "Content-Type: application/json" \ @@ -5837,6 +5891,7 @@ curl -X POST "http://localhost:8000/pipelines/my_pipeline/execute" \ ### 4.1. React Dashboard Example + ```typescript // React component displaying pipeline status const PipelineStatus = ({ pipelineId }) => { @@ -5871,6 +5926,7 @@ const PipelineStatus = ({ pipelineId }) => { Use the DOT endpoint with Graphviz: + ```javascript const renderDAG = async (pipelineId) => { const response = await fetch(`/pipelines/${pipelineId}/dag/dot`); @@ -5889,6 +5945,7 @@ const renderDAG = async (pipelineId) => { ### 5.1. API Key Authentication + ```python from fastapi import Security, HTTPException from fastapi.security import APIKeyHeader @@ -5907,6 +5964,7 @@ async def list_pipelines(api_key: str = Security(verify_api_key)): ### 5.2. JWT Authentication + ```python from jose import jwt from fastapi import Depends @@ -5934,6 +5992,7 @@ async def execute( Define per-pipeline ACLs via `pipeline_acls` in `TaskiqFlowConfig`, then use `verify_pipeline_access` as a route dependency : + ```python from fastapi import Depends from taskiq_flow.config import TaskiqFlowConfig @@ -5959,6 +6018,7 @@ viz_api = create_visualization_api(broker) # reads config automatically For production, combine the global `SecurityMiddleware` (authentication) with per-route dependencies (authorization): + ```python from taskiq_flow.security.middleware import SecurityMiddleware from taskiq_flow.security.auth import APIKeyAuthProvider, JWTAuthProvider @@ -6000,6 +6060,7 @@ app.add_middleware( Or, for full automatic wiring, use `create_visualization_api` which builds all these components from `TaskiqFlowConfig` internally: + ```python from taskiq_flow import create_visualization_api @@ -6015,6 +6076,7 @@ app = create_visualization_api(broker) # security auto-configured from config - `SecurityMiddleware` sets `request.state.user` for all routes after routing - FastAPI path params (e.g. `pipeline_id`) are only available *after* routing - Route dependencies (e.g. `Depends(verify_pipeline_access)`) run after routing → they can read `pipeline_id` and check ACLs + ``` --- @@ -6024,7 +6086,7 @@ app = create_visualization_api(broker) # security auto-configured from config Protect the API from abuse: ```python -from slowapi import Limiter, _rate_limit_exceeded_handler + from slowapi.util import get_remote_address from slowapi.errors import RateLimitExceeded @@ -6036,6 +6098,7 @@ app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler) @limiter.limit("10/minute") # Max 10 executions per minute per IP async def execute_pipeline(pipeline_id: str, parameters: dict): # ... + ``` --- @@ -6045,7 +6108,7 @@ async def execute_pipeline(pipeline_id: str, parameters: dict): Enable cross-origin requests for web frontend: ```python -from fastapi.middleware.cors import CORSMiddleware + app.add_middleware( CORSMiddleware, @@ -6054,6 +6117,7 @@ app.add_middleware( allow_methods=["GET", "POST"], allow_headers=["*"], ) + ``` --- @@ -6063,16 +6127,17 @@ app.add_middleware( ### 8.1. Gunicorn + Uvicorn Workers ```bash -# Run with multiple workers for concurrency + gunicorn -k uvicorn.workers.UvicornWorker -w 4 main:app --bind 0.0.0.0:8000 # 4 worker processes handle concurrent requests + ``` ### 8.2. Docker ```dockerfile -FROM python:3.12-slim + WORKDIR /app COPY requirements.txt . @@ -6081,10 +6146,11 @@ RUN pip install --no-cache-dir -r requirements.txt COPY . . CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] + ``` ```yaml -# docker-compose.yml + services: api: build: . @@ -6096,12 +6162,13 @@ services: - redis redis: image: redis:7-alpine + ``` ### 8.3. Behind Reverse Proxy (nginx) ```nginx -server { + listen 80; server_name api.taskiq-flow.example.com; @@ -6113,22 +6180,25 @@ server { proxy_set_header Connection ""; } } + ``` ### 8.4. HTTPS with Let's Encrypt ```bash -# Using certbot with nginx + sudo certbot --nginx -d api.taskiq-flow.example.com + ``` Configure HTTPS → redirect to HTTP upstream: ```nginx -location / { + proxy_pass http://localhost:8000; proxy_set_header X-Forwarded-Proto $scheme; } + ``` --- @@ -6138,7 +6208,7 @@ location / { ### 9.1. Health Check Endpoint ```python -from datetime import datetime, timezone + from fastapi import FastAPI import psutil @@ -6152,14 +6222,16 @@ async def health(): "broker_connected": broker.is_connected(), "memory_mb": psutil.Process().memory_info().rss / 1024 / 1024 } + ``` ### 9.2. Metrics with Prometheus ```python -from prometheus_fastapi_instrumentator import Instrumentator + Instrumentator().instrument(app).expose(app, endpoint="/metrics") + ``` Exposes `/metrics` with standard Prometheus metrics (request count, latency, etc.). @@ -6167,7 +6239,7 @@ Exposes `/metrics` with standard Prometheus metrics (request count, latency, etc ### 9.3. API Versioning ```python -app = FastAPI( + title="Taskiq-Flow API", version="1.0.0", docs_url="/docs", @@ -6179,6 +6251,7 @@ from fastapi import APIRouter api_router = APIRouter(prefix="/api/v1") api_router.include_router(viz_api.router) app.include_router(api_router) + ``` --- @@ -6188,7 +6261,7 @@ app.include_router(api_router) Centralized error handling: ```python -from fastapi import Request + from fastapi.responses import JSONResponse @app.exception_handler(TaskiqError) @@ -6201,18 +6274,20 @@ async def taskiq_exception_handler(request: Request, exc: TaskiqError): "pipeline_id": getattr(exc, "pipeline_id", None) } ) + ``` Standardized error responses: ```json -{ + "error": "PipelineExecutionError", "message": "Task 'process' failed after 3 retries", "pipeline_id": "audio_analysis_123", "step": "extract_audio", "timestamp": "2026-05-05T12:00:00Z" } + ``` --- @@ -6222,7 +6297,7 @@ Standardized error responses: Python client for interacting with the API: ```python -import httpx + class TaskiqFlowClient: def __init__(self, base_url: str, api_key: str = None): @@ -6255,6 +6330,7 @@ class TaskiqFlowClient: client = TaskiqFlowClient("http://localhost:8000") pipelines = await client.list_pipelines() result = await client.execute("my_pipeline", {"data": "test"}, wait=True) + ``` --- @@ -8626,6 +8702,7 @@ Performance optimization involves tradeoffs between: Control concurrent task execution at the step level: + ```python # Sequential Pipeline pipeline.map(process_item, items, max_parallel=10) # Max 10 concurrent @@ -8648,6 +8725,7 @@ mapped = await MapReduce.map( #### For I/O-Bound Tasks (network calls, disk I/O) + ```python # High I/O wait, low CPU: can handle many concurrent tasks pipeline.map(fetch_url, url_list, max_parallel=50) @@ -8658,6 +8736,7 @@ pipeline.map(fetch_url, url_list, max_parallel=50) #### For CPU-Bound Tasks (computations, transcoding) + ```python # CPU-intensive: limit to core count (or slightly higher) import os @@ -8672,6 +8751,7 @@ pipeline.map(transcode, files, max_parallel=cpu_cores + 2) Profile and adjust: + ```python # Start conservative for parallel in [5, 10, 20, 50]: @@ -8687,6 +8767,7 @@ Find the **knee of the curve** — point where increasing parallelism yields dim Set a global cap across all pipelines: + ```python from taskiq_flow.optimization.parallel import set_max_parallel_tasks @@ -8703,6 +8784,7 @@ Taskiq-Flow can schedule tasks based on CPU/RAM requirements (requires resource- ### 3.1. Annotating Tasks with Resource Needs + ```python from taskiq_flow import CPUProfile, RAMProfile @@ -8716,6 +8798,7 @@ def heavy_computation(data): ### 3.2. Resource-Aware Worker Pool + ```python from taskiq_flow import ResourceAwareWorkerPool @@ -8739,6 +8822,7 @@ pool = ResourceAwareWorkerPool( Pass references instead of full data: + ```python # Bad: copies entire dataset per task call pipeline.map(process, large_dataset) # Each task gets full dataset copy @@ -8756,6 +8840,7 @@ pipeline.map(process, item_ids) # Only IDs passed Use chunking: + ```python def chunked(iterable, chunk_size=100): for i in range(0, len(iterable), chunk_size): @@ -8770,6 +8855,7 @@ for chunk in chunked(large_list, 100): Pipeline results stay in tracking storage. Clean up after you're done: + ```python # After processing, delete pipeline record await tracking.delete_pipeline(pipeline.pipeline_id) @@ -8777,6 +8863,7 @@ await tracking.delete_pipeline(pipeline.pipeline_id) Or set TTL on storage: + ```python RedisPipelineStorage(redis, ttl_seconds=86400) # Auto-delete after 1 day ``` @@ -8789,6 +8876,7 @@ RedisPipelineStorage(redis, ttl_seconds=86400) # Auto-delete after 1 day Each step records duration automatically (with tracking enabled): + ```python status = await tracking.get_status(pipeline_id) for step in status.steps: @@ -8801,6 +8889,7 @@ Identify slowest steps → optimization targets. Use Python's `tracemalloc`: + ```python import tracemalloc @@ -8818,6 +8907,7 @@ tracemalloc.stop() ### 5.3. CPU Profiling + ```python import cProfile import pstats @@ -8837,6 +8927,7 @@ stats.print_stats(20) # Top 20 functions `uvloop` for faster event loop: + ```python import uvloop uvloop.install() # Replaces default asyncio event loop @@ -8852,6 +8943,7 @@ Benchmark improvement: `uvloop` can provide 2×–3× speedup for I/O-bound work For databases (PostgreSQL, Redis), reuse connections: + ```python from asyncpg import create_pool @@ -8867,6 +8959,7 @@ async def db_task(query: str): Instead of many small calls, batch: + ```python # N separate calls for item in items: @@ -8878,6 +8971,7 @@ await db.bulk_insert(items) ### 6.3. Cache Results + ```python from functools import lru_cache @@ -8889,6 +8983,7 @@ def expensive_computation(key: str): Or use Redis cache: + ```python import redis cache = redis.Redis(...) @@ -8911,6 +9006,7 @@ async def cached_task(key: str): Scale horizontally by running multiple worker processes: + ```bash # Terminal 1 taskiq worker --broker redis://localhost:6379 @@ -8930,6 +9026,7 @@ All workers share the same broker (Redis) and process tasks concurrently. Use a process manager (systemd, supervisord, Docker Compose): + ```yaml # docker-compose.yml services: @@ -8948,6 +9045,7 @@ services: Route critical pipelines to dedicated queues: + ```python @broker.task(queue="high_priority") def critical_task(): ... @@ -8965,6 +9063,7 @@ For low-latency global deployments, deploy workers in multiple regions with a gl Measure before and after optimization: + ```python import time @@ -9016,6 +9115,7 @@ async def benchmark(pipeline, iterations=10): **Diagnostic steps**: 1. Check step durations in tracking: + ```python status = await tracking.get_status(pipeline_id) slowest = max(status.steps, key=lambda s: s.duration_ms) @@ -9053,6 +9153,7 @@ async def benchmark(pipeline, iterations=10): For specialized workloads, implement custom executors: + ```python from taskiq_flow import ExecutionEngine from taskiq_flow.dataflow import DAG @@ -9073,6 +9174,7 @@ results = await engine.execute(inputs) Taskiq-Flow provides a resource-aware execution pattern for pipelines that need to allocate tasks to workers based on their CPU/RAM requirements: + ```python from taskiq_flow import ResourceAwareExecutor, TaskResourceProfile from taskiq_flow.dataflow import DataflowPipeline @@ -11249,6 +11351,7 @@ TaskIQ-Flow provides a flexible security system that can be enabled via configur Security features are configured in the `TaskiqFlowConfig` object or via environment variables. The main security settings are: + ```python from taskiq_flow import TaskiqFlowConfig @@ -11315,6 +11418,7 @@ Clients must include their API key in the `X-API-Key` header for HTTP requests o Example HTTP request: + ```http GET /api/pipelines X-API-Key: admin-key #pragma: allowlist secret @@ -11324,6 +11428,7 @@ X-API-Key: admin-key #pragma: allowlist secret If a JWT secret is configured (via `jwt_secret`), clients can authenticate using a JSON Web Token (JWT) in the `Authorization` header: + ``` Authorization: Bearer ``` @@ -11359,6 +11464,7 @@ The middleware respects the `X-Forwarded-Proto` header so deployments behind a T Run TaskIQ-Flow behind a WSGI/ASGI server such as Uvicorn with Docker: + ```dockerfile # Dockerfile FROM python:3.12-slim @@ -11373,6 +11479,7 @@ EXPOSE 8000 CMD ["uvicorn", "my_app:app", "--host", "0.0.0.0", "--port", "8000"] ``` + ```yaml # docker-compose.yml services: @@ -11403,6 +11510,7 @@ volumes: Place nginx in front of the application to terminate TLS, enforce HTTPS, and add security headers: + ```nginx # /etc/nginx/sites-available/taskiq-flow server { @@ -11494,6 +11602,7 @@ WebSocket connections follow the same security model as HTTP: Here is a complete example of securing a TaskIQ-Flow API with the **current** API (`create_visualization_api`, flat `TaskiqFlowConfig` fields): + ```python from taskiq import Taskiq, InMemoryBroker from taskiq_flow import TaskiqFlowConfig, create_visualization_api @@ -11543,6 +11652,7 @@ audit_logger = AuditLogger() ## Testing Security + ```bash # No credentials → 401 Unauthorized curl -i http://localhost:8000/pipelines @@ -14221,6 +14331,7 @@ Le décorateur `@pipeline_task` annote les tâches taskiq avec des déclarations Marque une tâche avec ce qu'elle produit pour les consommateurs en aval. + ```python from taskiq_flow import pipeline_task @@ -14241,6 +14352,7 @@ def extract(données: list[str]) -> dict: ### Sortie unique (plus courant) + ```python @broker.task @pipeline_task(output="données_traitées") @@ -14250,6 +14362,7 @@ def process(données_brutes: str) -> dict: ### Sorties multiples + ```python @broker.task @pipeline_task(outputs=["features", "metadata"]) @@ -14261,6 +14374,7 @@ def split_output(audio: np.ndarray) -> tuple[dict, dict]: Les tâches en aval peuvent consommer soit sortie: + ```python @broker.task @pipeline_task(output="tags") @@ -14277,6 +14391,7 @@ def describe(metadata: dict): ... # consomme sortie 'metadata' Alias pour `@pipeline_task(outputs=[...])`. Apporte clarté pour tâches multi-sorties: + ```python from taskiq_flow import pipeline_task_multi_output @@ -14294,6 +14409,7 @@ def split(valeur: int) -> tuple[int, int]: Obtenir clés sortie déclarées pour une tâche: + ```python from taskiq_flow import get_task_outputs @@ -14305,6 +14421,7 @@ print(outputs) # ['features'] Get declared input dependencies: + ```python from taskiq_flow import get_task_inputs @@ -14316,6 +14433,7 @@ print(inputs) # ['features'] Check if function is decorated with `@pipeline_task`: + ```python from taskiq_flow import is_pipeline_task @@ -14327,6 +14445,7 @@ if is_pipeline_task(my_function): Build dependency map: + ```python from taskiq_flow import resolve_task_dependencies @@ -14340,6 +14459,7 @@ deps = resolve_task_dependencies([task_a, task_b, task_c]) The order of decorators matters: `@broker.task` must be the outermost (applied last), `@pipeline_task` inner (applied first): + ```python # CORRECT @broker.task @@ -14353,6 +14473,7 @@ def my_task(): ... ``` Why: `@broker.task` wraps the function; `@pipeline_task` attaches metadata to the original function. Python applies decorators bottom-to-top. + ``` Pourquoi: `@broker.task` enveloppe la fonction; `@pipeline_task` attache métadonnées à la fonction originale. Python applique décorateurs bas-vers-haut. @@ -14364,7 +14485,7 @@ Pourquoi: `@broker.task` enveloppe la fonction; `@pipeline_task` attache métado Les type hints aident IDEs et linters à comprendre le dataflow: ```python -from typing import TypedDict + class AudioFeatures(TypedDict): duration: float @@ -14379,6 +14500,7 @@ def extract(chemin: str) -> AudioFeatures: @pipeline_task(output="tags") def tag(features: AudioFeatures) -> list[str]: # type-safe return ["rapide", "électronique"] + ``` Utiliser `TypedDict` ou modèles Pydantic pour meilleure autocomplétion IDE et vérification mypy. @@ -14390,7 +14512,7 @@ Utiliser `TypedDict` ou modèles Pydantic pour meilleure autocomplétion IDE et Attacher version et autres métadonnées: ```python -@broker.task( + nom="extract_features_v2", labels={"version": "2.0.0", "expérimental": False} ) @@ -14400,6 +14522,7 @@ Attacher version et autres métadonnées: ) def extract(chemin: str) -> dict: ... + ``` --- @@ -14418,7 +14541,7 @@ def extract(chemin: str) -> dict: ## Exemple: Pipeline Dataflow Complet ```python -from taskiq import InMemoryBroker + from taskiq_flow import DataflowPipeline, pipeline_task broker = InMemoryBroker() @@ -14444,6 +14567,7 @@ pipeline = DataflowPipeline.from_tasks(broker, [charger, nettoyer, analyser]) # Exécuter résultats = await pipeline.kiq_dataflow(source="data.csv") # résultats = {"brut": {...}, "propre": {...}, "stats": {...}} + ``` --- @@ -16706,6 +16830,7 @@ C'est l'exemple de référence pour comprendre l'architecture dataflow. ### Définition des Tâches + ```python from taskiq import InMemoryBroker from taskiq_flow import DataflowPipeline, pipeline_task @@ -16742,6 +16867,7 @@ async def create_embedding(mir_features: dict, tags: list[str]) -> list[float]: Le pipeline construit automatiquement ce DAG: + ```mermaid flowchart TD A[extract_audio_features] --> B[compute_mir_features] @@ -16751,6 +16877,7 @@ flowchart TD ``` **Note**: `create_embedding` dépend à la fois de `mir_features` (sortie de `compute_mir_features`) et `tags` (sortie de `generate_tags`), donc il s'exécute après que les deux tâches parallèles sont terminées. + ``` --- @@ -16758,7 +16885,7 @@ flowchart TD ## Exemple 1: Pipeline Séquentiel avec Dépendances Automatiques ```python -async def example_sequential_pipeline(): + pipeline = DataflowPipeline.from_tasks( broker, [ @@ -16784,6 +16911,7 @@ async def example_sequential_pipeline(): # "tags": [...], # "vector": [...] # } + ``` **Résolution dépendances**: @@ -16799,7 +16927,7 @@ async def example_sequential_pipeline(): Avec ajout de `extract_spectral_features` qui dépend aussi seulement de `audio_features`: ```python -@broker.task + @pipeline_task(output="spectral_features") async def extract_spectral_features(audio_features: dict) -> dict: await asyncio.sleep(0.2) @@ -16824,6 +16952,7 @@ pipeline = DataflowPipeline.from_tasks( combine_features, # Niveau 2 (dépend de mir_features + spectral_features + tags) ], ) + ``` **Niveaux d'exécution**: @@ -16838,7 +16967,7 @@ pipeline = DataflowPipeline.from_tasks( Traiter multiples pistes en parallèle, puis agréger: ```python -# Map: traiter chaque piste indépendamment + @broker.task @pipeline_task(output="track_features") async def process_single_track(track: str) -> dict: @@ -16868,6 +16997,7 @@ pipeline.reduce( résultats = await pipeline.kiq_map_reduce() # résultats = {"track_features": [...], "playlist_stats": {...}} + ``` --- @@ -16877,7 +17007,7 @@ résultats = await pipeline.kiq_map_reduce() Le pipeline fournit multiples formats de visualisation: ```python -# ASCII art (console) + pipeline.print_dag() # JSON (for web UIs) @@ -16895,6 +17025,7 @@ dot = pipeline.visualize_dot() # with open("pipeline.dot", "w") as f: # f.write(dot) # Run: dot -Tpng pipeline.dot -o pipeline.png + ``` --- @@ -16902,7 +17033,8 @@ dot = pipeline.visualize_dot() ## Exécuter l'Exemple ```bash -python examples/dataflow_audio_pipeline.py + + ``` Sortie attendue inclut: @@ -17786,6 +17918,7 @@ Cet exemple démontre les fonctionnalités `ResourceAwareExecutor` et `TaskResou ### 1. Configuration Resource-Aware Executor + ```python from taskiq_flow.optimization import ResourceAwareExecutor, TaskResourceProfile @@ -17816,6 +17949,7 @@ L'exécuteur interroge l'usage système courant (via `psutil`) et calcule combie ### 2. Annoter Tâches avec Profils Ressources + ```python @broker.task @pipeline_task( @@ -17858,6 +17992,7 @@ async def heavy_task(item: int) -> dict: Le paramètre `max_parallel` de `DataflowPipeline` agit comme borne supérieure. `ResourceAwareExecutor` peut calculer un `max_parallel` dynamique avant lancement : + ```python # Calculer parallélisme optimal pour état système courant current_parallel = executor.get_optimal_parallelism( @@ -17876,6 +18011,7 @@ Pour charges de travail mixtes, sommez l'usage ressource à travers tâches para ### 4. Directives de Réglage Manuel Parallélisme + ```python import psutil @@ -17898,6 +18034,7 @@ Commencez conservateur, benchmarkez, et ajustez. ## Sortie Attendue + ``` === Resource-Aware Parallelism Demo === @@ -17972,6 +18109,7 @@ Sans conscience ressource, `max_parallel` trop haut peut : Combinez avec métriques Prometheus : + ```python from taskiq_flow.metrics import MetricsMiddleware broker.add_middlewares(MetricsMiddleware()) @@ -19214,6 +19352,7 @@ Ce guide couvre : ## 1. Configuration Rapide + ```python from fastapi import FastAPI from taskiq import InMemoryBroker @@ -19250,12 +19389,14 @@ L'API de visualisation fournit ces routes : ### 2.1. Health Check + ``` GET /health ``` Retourne statut simple: + ```json { "statut": "healthy", @@ -19265,12 +19406,14 @@ Retourne statut simple: ### 2.2. Lister Tous les Pipelines + ``` GET /pipelines ``` Liste tous les pipelines enregistrés avec métadonnées: + ```json [ { @@ -19284,12 +19427,14 @@ Liste tous les pipelines enregistrés avec métadonnées: ### 2.3. Enregistrer un Nouveau Pipeline + ``` POST /pipelines/{pipeline_id} ``` Corps de requête: + ```json { "pipeline_type": "dataflow", @@ -19299,18 +19444,21 @@ Corps de requête: Ou utiliser l'API Python directement (recommandé): + ```python viz_api.add_pipeline("nouveau_pipeline", objet_pipeline) ``` ### 2.4. Obtenir le Statut d'un Pipeline + ``` GET /pipelines/{pipeline_id}/status ``` Retourne statut d'exécution courant si un run est actif: + ```json { "pipeline_id": "my_pipeline_123", @@ -19323,12 +19471,14 @@ Retourne statut d'exécution courant si un run est actif: ### 2.5. Obtenir le DAG en JSON + ``` GET /pipelines/{pipeline_id}/dag ``` Retourne la structure de graphe orienté acyclique: + ```json { "nodes": [ @@ -19345,12 +19495,14 @@ Retourne la structure de graphe orienté acyclique: ### 2.6. Obtenir le DAG au Format DOT + ``` GET /pipelines/{pipeline_id}/dag/dot ``` Retourne chaîne DOT compatible Graphviz: + ``` digraph "my_pipeline" { node [shape=box]; @@ -19361,12 +19513,14 @@ digraph "my_pipeline" { ### 2.7. Visualisation Complète de Pipeline + ``` GET /pipelines/{pipeline_id}/visualize ``` Retourne métadonnées complètes du pipeline: + ```json { "pipeline_id": "my_pipeline", @@ -19398,6 +19552,7 @@ Retourne métadonnées complètes du pipeline: L'API de base se concentre sur gestion et visualisation. Pour exécuter des pipelines à distance, ajouter un endpoint personnalisé: + ```python from fastapi import FastAPI, HTTPException from taskiq_flow.api import PipelineVisualizationAPI @@ -19456,6 +19611,7 @@ async def get_result(task_id: str): ### 3.1. Exécution Async (Fire-and-Forget) + ```bash curl -X POST "http://localhost:8000/pipelines/my_pipeline/execute" \ -H "Content-Type: application/json" \ @@ -19470,6 +19626,7 @@ curl -X POST "http://localhost:8000/pipelines/my_pipeline/execute" \ ### 3.2. Synchronous Execution (Wait for Result) + ```bash curl -X POST "http://localhost:8000/pipelines/my_pipeline/execute" \ -H "Content-Type: application/json" \ @@ -19489,6 +19646,7 @@ curl -X POST "http://localhost:8000/pipelines/my_pipeline/execute" \ ### 4.1. Exemple Dashboard React + ```typescript const PipelineStatus = ({ pipelineId }) => { const [status, setStatus] = useState(null); @@ -19522,6 +19680,7 @@ const PipelineStatus = ({ pipelineId }) => { Utiliser endpoint DOT avec Graphviz: + ```javascript const renderDAG = async (pipelineId) => { const response = await fetch(`/pipelines/${pipelineId}/dag/dot`); @@ -19540,6 +19699,7 @@ const renderDAG = async (pipelineId) => { ### 5.1. Authentification par Clé API + ```python from fastapi import Security, HTTPException from fastapi.security import APIKeyHeader @@ -19558,6 +19718,7 @@ async def list_pipelines(api_key: str = Security(verify_api_key)): ### 5.2. Authentification JWT + ```python from jose import jwt from fastapi import Depends @@ -19586,6 +19747,7 @@ async def execute( Protéger l'API contre abus: + ```python from slowapi import Limiter, _rate_limit_exceeded_handler from slowapi.util import get_remote_address @@ -19607,6 +19769,7 @@ async def execute_pipeline(pipeline_id: str, parameters: dict): Permettre requêtes cross-origin pour frontend web: + ```python from fastapi.middleware.cors import CORSMiddleware @@ -19625,6 +19788,7 @@ app.add_middleware( ### 8.1. Gunicorn + Workers Uvicorn + ```bash # Lancer avec multiples workers pour concurrence gunicorn -k uvicorn.workers.UvicornWorker -w 4 main:app --bind 0.0.0.0:8000 @@ -19634,6 +19798,7 @@ gunicorn -k uvicorn.workers.UvicornWorker -w 4 main:app --bind 0.0.0.0:8000 ### 8.2. Docker + ```dockerfile FROM python:3.12-slim @@ -19646,6 +19811,7 @@ COPY . . CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] ``` + ```yaml # docker-compose.yml services: @@ -19663,6 +19829,7 @@ services: ### 8.3. Derrière Reverse Proxy (nginx) + ```nginx server { listen 80; @@ -19680,6 +19847,7 @@ server { ### 8.4. HTTPS avec Let's Encrypt + ```bash # Utiliser certbot avec nginx sudo certbot --nginx -d api.taskiq-flow.example.com @@ -19687,6 +19855,7 @@ sudo certbot --nginx -d api.taskiq-flow.example.com Configurer HTTPS → redirect vers HTTP upstream: + ```nginx location / { proxy_pass http://localhost:8000; @@ -19700,6 +19869,7 @@ location / { ### 9.1. Authentification par Clé API + ```python from fastapi import Security, HTTPException from fastapi.security import APIKeyHeader @@ -19718,6 +19888,7 @@ async def list_pipelines(api_key: str = Security(verify_api_key)): ### 9.2. Authentification JWT + ```python from jose import jwt from fastapi import Depends @@ -19744,6 +19915,7 @@ async def execute( Définissez les ACLs par pipeline via `pipeline_acls` dans `TaskiqFlowConfig`, puis utilisez `verify_pipeline_access` comme dépendance de route : + ```python from fastapi import Depends from taskiq_flow.config import TaskiqFlowConfig @@ -19769,6 +19941,7 @@ viz_api = create_visualization_api(broker) # lit config automatiquement Pour la production, combinez le middleware global (authentification) avec les dépendances de route (autorisation) : + ```python from taskiq_flow.security.middleware import SecurityMiddleware from taskiq_flow.security.auth import APIKeyAuthProvider, JWTAuthProvider @@ -19810,6 +19983,7 @@ app.add_middleware( Ou, pour un câblage automatique complet, utilisez `create_visualization_api` qui construit tous les composants depuis `TaskiqFlowConfig` : + ```python from taskiq_flow import create_visualization_api @@ -19825,6 +19999,7 @@ app = create_visualization_api(broker) # sécurité auto-configurée depuis con - `SecurityMiddleware` place `request.state.user` pour toutes les routes après le routage - Les paramètres de chemin FastAPI (ex. `pipeline_id`) ne sont disponibles qu'**après** le routage - Les dépendances de route (ex. `Depends(verify_pipeline_access)`) s'exécutent après le routage → elles peuvent lire `pipeline_id` et vérifier les ACLs + ``` **Pourquoi cette approche hybride ?** @@ -19840,7 +20015,7 @@ app = create_visualization_api(broker) # sécurité auto-configurée depuis con Protégez l'API contre les abus: ```python -from slowapi import Limiter, _rate_limit_exceeded_handler + from slowapi.util import get_remote_address from slowapi.errors import RateLimitExceeded @@ -19852,6 +20027,7 @@ app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler) @limiter.limit("10/minute") # Max 10 exécutions par minute par IP async def execute_pipeline(pipeline_id: str, parameters: dict): # ... + ``` --- @@ -19861,7 +20037,7 @@ async def execute_pipeline(pipeline_id: str, parameters: dict): ### 11.1. Endpoint Health Check ```python -from datetime import datetime, timezone + from fastapi import FastAPI import psutil @@ -19875,14 +20051,16 @@ async def health(): "broker_connecté": broker.is_connected(), "memoire_mb": psutil.Process().memory_info().rss / 1024 / 1024 } + ``` ### 11.2. Métriques avec Prometheus ```python -from prometheus_fastapi_instrumentator import Instrumentator + Instrumentator().instrument(app).expose(app, endpoint="/metrics") + ``` Expose `/metrics` avec métriques Prometheus standard (compte requêtes, latence, etc.). @@ -19890,7 +20068,7 @@ Expose `/metrics` avec métriques Prometheus standard (compte requêtes, latence ### 11.3. Versionnement API ```python -app = FastAPI( + title="API Taskiq-Flow", version="1.0.0", docs_url="/docs", @@ -19902,6 +20080,7 @@ from fastapi import APIRouter api_router = APIRouter(prefix="/api/v1") api_router.include_router(viz_api.router) app.include_router(api_router) + ``` --- @@ -19911,7 +20090,7 @@ app.include_router(api_router) Gestion centralisée erreurs: ```python -from fastapi import Request + from fastapi.responses import JSONResponse from taskiq.exceptions import TaskiqError @@ -19925,18 +20104,20 @@ async def taskiq_exception_handler(request: Request, exc: TaskiqError): "pipeline_id": getattr(exc, "pipeline_id", None) } ) + ``` Réponses d'erreur standardisées: ```json -{ + "error": "PipelineExecutionError", "message": "Task 'process' échoué après 3 retries", "pipeline_id": "analyse_audio_123", "step": "extract_audio", "timestamp": "2026-05-05T12:00:00Z" } + ``` --- @@ -19946,7 +20127,7 @@ Réponses d'erreur standardisées: Client Python pour interagir avec l'API: ```python -import httpx + class ClientTaskiqFlow: def __init__(self, base_url: str, api_key: str = None): @@ -19979,6 +20160,7 @@ class ClientTaskiqFlow: client = ClientTaskiqFlow("http://localhost:8000") pipelines = await client.list_pipelines() result = await client.execute("my_pipeline", {"data": "test"}, wait=True) + ``` --- @@ -21840,6 +22022,7 @@ L'optimisation des performances implique des compromis : Contrôle l'exécution concurrente des tâches au niveau de l'étape : + ```python # Pipeline Séquentiel pipeline.map(process_item, items, max_parallel=10) # Max 10 concurrentes @@ -21862,6 +22045,7 @@ mapped = await MapReduce.map( #### Pour les Tâches Liées aux I/O (appels réseau, I/O disque) + ```python # Attente I/O élevée, CPU faible : peut gérer beaucoup de tâches concurrentes pipeline.map(fetch_url, url_list, max_parallel=50) @@ -21872,6 +22056,7 @@ pipeline.map(fetch_url, url_list, max_parallel=50) #### Pour les Tâches Intensives en CPU (calculs, transcodage) + ```python # Intensif en CPU : limiter au nombre de cœurs (ou légèrement plus) import os @@ -21886,6 +22071,7 @@ pipeline.map(transcode, files, max_parallel=cpu_cores + 2) Profilez et ajustez : + ```python # Commencez prudent for parallel in [5, 10, 20, 50]: @@ -21901,6 +22087,7 @@ Trouvez le **coude de la courbe** — point où augmenter le parallélisme donne Définissez une limite globale sur tous les pipelines : + ```python from taskiq_flow.optimization.parallel import set_max_parallel_tasks @@ -21917,6 +22104,7 @@ Taskiq-Flow peut ordonnancer les tâches selon les besoins CPU/RAM (nécessite u ### 3.1. Annoter les Tâches avec Besoins en Ressources + ```python from taskiq_flow import CPUProfile, RAMProfile @@ -21930,6 +22118,7 @@ def heavy_computation(data): ### 3.2. Pool de Workers Conscient des Ressources + ```python from taskiq_flow import ResourceAwareWorkerPool @@ -21953,6 +22142,7 @@ pool = ResourceAwareWorkerPool( Passez des références au lieu des données complètes : + ```python # Mauvais : copie le jeu de données complet pour chaque appel de tâche pipeline.map(process, large_dataset) # Chaque tâche reçoit une copie complète @@ -21970,6 +22160,7 @@ pipeline.map(process, item_ids) # Seuls les IDs sont passés Utilisez le découpage en chunks : + ```python def chunked(iterable, chunk_size=100): for i in range(0, len(iterable), chunk_size): @@ -21984,6 +22175,7 @@ for chunk in chunked(large_list, 100): Les résultats de pipeline restent dans le stockage de suivi. Nettoyez après usage : + ```python # Après traitement, supprimez l'enregistrement du pipeline await tracking.delete_pipeline(pipeline.pipeline_id) @@ -21991,6 +22183,7 @@ await tracking.delete_pipeline(pipeline.pipeline_id) Ou définissez un TTL sur le stockage : + ```python RedisPipelineStorage(redis, ttl_seconds=86400) # Suppression auto après 1 jour ``` @@ -22003,6 +22196,7 @@ RedisPipelineStorage(redis, ttl_seconds=86400) # Suppression auto après 1 jour Chaque étape enregistre la durée automatiquement (avec le suivi activé) : + ```python status = await tracking.get_status(pipeline_id) for step in status.steps: @@ -22015,6 +22209,7 @@ Identifiez les étapes les plus lentes → cibles d'optimisation. Utilisez `tracemalloc` de Python : + ```python import tracemalloc @@ -22032,6 +22227,7 @@ tracemalloc.stop() ### 5.3. Profilage CPU + ```python import cProfile import pstats @@ -22051,6 +22247,7 @@ stats.print_stats(20) # Top 20 fonctions `uvloop` pour une boucle d'événements plus rapide : + ```python import uvloop uvloop.install() # Remplace la boucle asyncio par défaut @@ -22066,6 +22263,7 @@ Amélioration benchmark : `uvloop` peut fournir un gain 2×–3× pour les charg Pour les bases de données (PostgreSQL, Redis), réutilisez les connexions : + ```python from asyncpg import create_pool @@ -22081,6 +22279,7 @@ async def db_task(query: str): Au lieu de nombreux petits appels, faites des lots : + ```python # N appels séparés for item in items: @@ -22092,6 +22291,7 @@ await db.bulk_insert(items) ### 6.3. Mise en Cache des Résultats + ```python from functools import lru_cache @@ -22103,6 +22303,7 @@ def expensive_computation(key: str): Ou utilisez un cache Redis : + ```python import redis cache = redis.Redis(...) @@ -22125,6 +22326,7 @@ async def cached_task(key: str): Mise à l'échelle horizontale en lançant plusieurs processus worker : + ```bash # Terminal 1 taskiq worker --broker redis://localhost:6379 @@ -22144,6 +22346,7 @@ Tous les workers partagent le même broker (Redis) et traitent les tâches concu Utilisez un gestionnaire de processus (systemd, supervisord, Docker Compose) : + ```yaml # docker-compose.yml services: @@ -22162,6 +22365,7 @@ services: Routez les pipelines critiques vers des files dédiées : + ```python @broker.task(queue="high_priority") def critical_task(): ... @@ -22179,6 +22383,7 @@ Pour des déploiements mondiaux à faible latence, déployez des workers dans pl Mesurez avant et après optimisation : + ```python import time @@ -22230,6 +22435,7 @@ async def benchmark(pipeline, iterations=10): **Étapes de diagnostic** : 1. Vérifiez les durées d'étapes dans le suivi : + ```python status = await tracking.get_status(pipeline_id) slowest = max(status.steps, key=lambda s: s.duration_ms) @@ -22267,6 +22473,7 @@ async def benchmark(pipeline, iterations=10): Pour des charges spécialisées, implémentez des exécuteurs personnalisés : + ```python from taskiq_flow import ExecutionEngine from taskiq_flow.dataflow import DAG @@ -22287,6 +22494,7 @@ results = await engine.execute(inputs) TaskIQ-Flow fournit un exécuteur conscient des ressources qui peut être utilisé pour allouer des tâches aux workers en fonction de leurs besoins en ressources : + ```python from taskiq_flow import ResourceAwareExecutor, TaskResourceProfile @@ -24491,6 +24699,7 @@ Les fonctionnalités de sécurité sont configurées dans l'objet :class:`~taskiq_flow.config.TaskiqFlowConfig` ou via des variables d'environnement. Les principaux paramètres sont : + ```python from taskiq_flow import TaskiqFlowConfig @@ -24557,6 +24766,7 @@ TaskIQ-Flow prend en charge deux méthodes d'authentification : Les clients doivent inclure leur clé API dans l'en-tête ``X-API-Key`` pour les requêtes HTTP ou dans le champ ``auth`` des messages de connexion WebSocket. Exemple de requête HTTP : + ```http GET /api/pipelines X-API-Key: admin-key #pragma: allowlist secret @@ -24566,6 +24776,7 @@ X-API-Key: admin-key #pragma: allowlist secret Si une clé secrète JWT est configurée (``jwt_secret``), les clients peuvent s'authentifier à l'aide d'un jeton Web Token (JWT) dans l'en-tête ``Authorization`` : + ``` Authorization: Bearer ``` @@ -24604,6 +24815,7 @@ correctement derrière un reverse proxy ou un load-balancer qui termine TLS. Exécutez TaskIQ-Flow derrière un serveur ASGI tel qu'Uvicorn avec Docker : + ```dockerfile # Dockerfile FROM python:3.12-slim @@ -24618,6 +24830,7 @@ EXPOSE 8000 CMD ["uvicorn", "mon_app:app", "--host", "0.0.0.0", "--port", "8000"] ``` + ```yaml # docker-compose.yml services: @@ -24648,6 +24861,7 @@ volumes: Placez nginx devant l'application pour terminer TLS, renforcer HTTPS et ajouter des en-têtes de sécurité : + ```nginx # /etc/nginx/sites-available/taskiq-flow server { @@ -24748,6 +24962,7 @@ Les connexions WebSocket suivent le même modèle de sécurité que HTTP : Voici un exemple complet utilisant l'**API actuelle** (``create_visualization_api``, champs plats sur ``TaskiqFlowConfig``) : + ```python from taskiq import Taskiq, InMemoryBroker from taskiq_flow import TaskiqFlowConfig, create_visualization_api @@ -24795,6 +25010,7 @@ Lancez l'application avec ``uvicorn app:app --host 0.0.0.0 --port 8000``. Tous l ## Tests de sécurité + ```bash # Sans identifiants → 401 Unauthorized curl -i http://localhost:8000/pipelines @@ -26404,7 +26620,7 @@ WantedBy=multi-user.target ### 9.3. Monitoring -Utilisez l'endpoint `/health` intégré de l'API FastAPI (voir [Guide API]({{ '/fr/guides/api/' | relative_url })). +Utilisez l'endpoint `/health` intégré de l'API FastAPI (voir [Guide API]({{ '/fr/guides/api/' | relative_url }})). ### 9.4. Scalabilité From e69b9c9ce719a3d1aadd03d948b80e6f9ddff6ec Mon Sep 17 00:00:00 2001 From: David Orel Date: Sun, 24 May 2026 18:00:33 +0200 Subject: [PATCH 03/12] Update docs/_en/api/cache.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- docs/_en/api/cache.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/_en/api/cache.md b/docs/_en/api/cache.md index 9db9d55..ae99d8f 100644 --- a/docs/_en/api/cache.md +++ b/docs/_en/api/cache.md @@ -5,7 +5,7 @@ color_scheme: dark --- # API Reference: Cache -**Dogpile-based caching with cache stampede sémantics** +**Dogpile-based caching with cache stampede semantics** > **Version**: {VERSION} | **New in v1.2.0** | **Module**: `taskiq_flow.cache`, `taskiq_flow.middlewares.cache` From 6b9b71fe61143602c47cecc0171679d5d64d4202 Mon Sep 17 00:00:00 2001 From: David Orel Date: Sun, 24 May 2026 18:00:51 +0200 Subject: [PATCH 04/12] Update docs/_en/api/decorators.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- docs/_en/api/decorators.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/_en/api/decorators.md b/docs/_en/api/decorators.md index 46235ec..99c104f 100644 --- a/docs/_en/api/decorators.md +++ b/docs/_en/api/decorators.md @@ -209,7 +209,7 @@ Attach version and other metadata: @pipeline_task( output="features", description="Extract audio features (v2 with improvedtempo estimation)" -) + description="Extract audio features (v2 with improved tempo estimation)" def extract(path: str) -> dict: ... ``` From ebc6029800f3dfab2989fbe0539151c36169844c Mon Sep 17 00:00:00 2001 From: David Orel Date: Sun, 24 May 2026 18:01:12 +0200 Subject: [PATCH 05/12] Update docs/_en/guides/api.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- docs/_en/guides/api.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/_en/guides/api.md b/docs/_en/guides/api.md index 90edbab..590fa62 100644 --- a/docs/_en/guides/api.md +++ b/docs/_en/guides/api.md @@ -531,7 +531,7 @@ async def execute_pipeline(pipeline_id: str, parameters: dict): {% raw %} ``` ---- +viz_api = create_visualization_api(broker, app) # reads config automatically ## 7. CORS Configuration From fff9b1c04cf151d1d1c88d4f5ef5c81a04338bf0 Mon Sep 17 00:00:00 2001 From: David Orel Date: Sun, 24 May 2026 18:01:25 +0200 Subject: [PATCH 06/12] Update docs/_en/guides/api.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- docs/_en/guides/api.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/_en/guides/api.md b/docs/_en/guides/api.md index 590fa62..23f6e9e 100644 --- a/docs/_en/guides/api.md +++ b/docs/_en/guides/api.md @@ -542,7 +542,7 @@ Enable cross-origin requests for web frontend: app.add_middleware( CORSMiddleware, - allow_origins=["https://your-dashboard.com"], +`verify_pipeline_access` as a route dependency: allow_credentials=True, allow_methods=["GET", "POST"], allow_headers=["*"], From 020c91f184bc51d30bb556211b7f8dc83383769b Mon Sep 17 00:00:00 2001 From: David Orel Date: Sun, 24 May 2026 18:01:47 +0200 Subject: [PATCH 07/12] Update docs/_en/guides/api.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- docs/_en/guides/api.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/_en/guides/api.md b/docs/_en/guides/api.md index 23f6e9e..b343d2c 100644 --- a/docs/_en/guides/api.md +++ b/docs/_en/guides/api.md @@ -595,7 +595,7 @@ services: {% raw %} ``` -### 8.3. Behind Reverse Proxy (nginx) +create_visualization_api(broker, app) # security auto-configured from config ```nginx {% endraw %} From 0642630337f9a19be0ead99e6ae4eb27e87de34c Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" Date: Sun, 24 May 2026 16:06:15 +0000 Subject: [PATCH 08/12] docs: auto-update llms context files --- llms-full.txt | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/llms-full.txt b/llms-full.txt index 9895af2..c8f3aa6 100644 --- a/llms-full.txt +++ b/llms-full.txt @@ -99,7 +99,7 @@ This project is licensed under the MIT License — see the [LICENSE](https://git # API Reference: Cache -**Dogpile-based caching with cache stampede sémantics** +**Dogpile-based caching with cache stampede semantics** > **Version**: {VERSION} | **New in v1.2.0** | **Module**: `taskiq_flow.cache`, `taskiq_flow.middlewares.cache` @@ -730,7 +730,7 @@ Attach version and other metadata: @pipeline_task( output="features", description="Extract audio features (v2 with improvedtempo estimation)" -) + description="Extract audio features (v2 with improved tempo estimation)" def extract(path: str) -> dict: ... ``` @@ -6101,7 +6101,7 @@ async def execute_pipeline(pipeline_id: str, parameters: dict): ``` ---- +viz_api = create_visualization_api(broker, app) # reads config automatically ## 7. CORS Configuration @@ -6112,7 +6112,7 @@ Enable cross-origin requests for web frontend: app.add_middleware( CORSMiddleware, - allow_origins=["https://your-dashboard.com"], +`verify_pipeline_access` as a route dependency: allow_credentials=True, allow_methods=["GET", "POST"], allow_headers=["*"], @@ -6165,7 +6165,7 @@ services: ``` -### 8.3. Behind Reverse Proxy (nginx) +create_visualization_api(broker, app) # security auto-configured from config ```nginx From 1456d316719e7b1e06c91ac930656e43b995f411 Mon Sep 17 00:00:00 2001 From: David Orel Date: Sun, 24 May 2026 18:10:02 +0200 Subject: [PATCH 09/12] fix: fix doc problems --- docs/_en/guides/api.md | 1 + docs/_en/guides/performance.md | 33 ++++++++++----- docs/archive/docs-review-thread.md | 64 ++++++++++++++++++++++++++++++ 3 files changed, 89 insertions(+), 9 deletions(-) create mode 100644 docs/archive/docs-review-thread.md diff --git a/docs/_en/guides/api.md b/docs/_en/guides/api.md index 90edbab..dc221b5 100644 --- a/docs/_en/guides/api.md +++ b/docs/_en/guides/api.md @@ -693,6 +693,7 @@ Centralized error handling: ```python {% endraw %} from fastapi.responses import JSONResponse +from taskiq.exceptions import TaskiqError @app.exception_handler(TaskiqError) async def taskiq_exception_handler(request: Request, exc: TaskiqError): diff --git a/docs/_en/guides/performance.md b/docs/_en/guides/performance.md index 0d0e549..29b4e28 100644 --- a/docs/_en/guides/performance.md +++ b/docs/_en/guides/performance.md @@ -538,23 +538,38 @@ pipeline = DataflowPipeline( resource_aware=True, ) -@pipeline.task(resource_profile=heavy_profile) +@broker.task +@pipeline_task(output="heavy_result", resources=heavy_profile.model_dump()) def heavy_computation(data: dict) -> dict: """This task requires 4 CPU cores and 2 GB of RAM.""" return process_heavy_data(data) -# Configure the executor to respect resource profiles +# Configure the executor to compute optimal parallelism executor = ResourceAwareExecutor( - broker=broker, - max_parallel=10, + max_cpu_percent=80.0, + max_memory_percent=80.0, + min_parallel=1, + max_parallel=20, +) + +# Get optimal parallelism for the task +optimal_parallel = executor.get_optimal_parallelism( + task_memory_estimate=2048, + task_cpu_estimate=4.0, +) + +# Apply optimal parallelism to the pipeline before execution +pipeline = DataflowPipeline.from_tasks( + broker, + [heavy_computation], + max_parallel=optimal_parallel, ) -executor.run_pipeline(pipeline, input_data) +results = await pipeline.kiq_dataflow(data=input_data) ``` {% endraw %} -`ResourceAwareExecutor` evaluates resource profiles of tasks and distributes them -to available workers based on their capacity. `TaskResourceProfile` lets you -annotate each task with its estimated resource needs, enabling the executor to -prevent over-subscription of workers. +`ResourceAwareExecutor` evaluates resource profiles of tasks and computes optimal +parallelism via `get_optimal_parallelism()`. Use this value to configure the pipeline's +`max_parallel` setting before calling `kiq_dataflow()` for execution. --- diff --git a/docs/archive/docs-review-thread.md b/docs/archive/docs-review-thread.md new file mode 100644 index 0000000..9b7ca96 --- /dev/null +++ b/docs/archive/docs-review-thread.md @@ -0,0 +1,64 @@ +--- +title: Documentation Review Thread +date: 2026-05-24 +author: gemini-code-assist +status: archived +--- + +# Documentation Review Comments + +## Thread Summary + +Review comments on documentation files that were addressed and archived. + +--- + +## Comment 1: Missing TaskiqError Import + +**File**: `docs/_en/guides/api.md` + +### Issue +L'exception TaskiqError est utilisée dans la signature de la fonction mais n'est pas importée dans cet extrait de code. Il faudrait ajouter from taskiq import TaskiqError. + +### Resolution +Added the missing import: +```python +from taskiq.exceptions import TaskiqError +``` + +--- + +## Comment 2: Inconsistent @pipeline.task Decorator + +**File**: `docs/_en/guides/api.md` + +### Issue +L'utilisation du décorateur @pipeline.task est incohérente avec le reste de la documentation qui préconise l'utilisation combinée de @broker.task et @pipeline_task. + +### Resolution +The documentation already correctly uses `@broker.task` + `@pipeline_task` pattern. No changes needed. + +--- + +## Comment 3: Non-existent run_pipeline Method + +**File**: `docs/_en/guides/performance.md` + +### Issue +La méthode run_pipeline n'existe pas sur la classe ResourceAwareExecutor. Cet exécuteur doit être utilisé pour obtenir le parallélisme optimal via get_optimal_parallelism, qui est ensuite appliqué au pipeline avant son exécution via kiq_dataflow. + +### Resolution +Updated the code example to show the correct pattern: +1. Use `ResourceAwareExecutor` to compute optimal parallelism via `get_optimal_parallelism()` +2. Apply the computed value to the pipeline configuration +3. Execute via `pipeline.kiq_dataflow()` + +--- + +## Changes Made + +1. Added `TaskiqError` import to `docs/_en/guides/api.md` +2. Fixed resource-aware execution example in `docs/_en/guides/performance.md`: + - Changed `@pipeline.task` to correct `@broker.task` + `@pipeline_task` pattern + - Replaced `executor.run_pipeline()` with correct pattern using `get_optimal_parallelism()` + - Updated execution to use `pipeline.kiq_dataflow()` \ No newline at end of file From 4a733ec3618a425b0c98df6b2fa67174ccb53e40 Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" Date: Sun, 24 May 2026 16:10:20 +0000 Subject: [PATCH 10/12] docs: auto-update llms context files --- llms-full.txt | 95 ++++++++++++++++++++++++++++++++++++++++++++++----- llms.txt | 1 + 2 files changed, 87 insertions(+), 9 deletions(-) diff --git a/llms-full.txt b/llms-full.txt index c8f3aa6..ddf5478 100644 --- a/llms-full.txt +++ b/llms-full.txt @@ -6263,6 +6263,7 @@ Centralized error handling: ```python from fastapi.responses import JSONResponse +from taskiq.exceptions import TaskiqError @app.exception_handler(TaskiqError) async def taskiq_exception_handler(request: Request, exc: TaskiqError): @@ -9192,23 +9193,38 @@ pipeline = DataflowPipeline( resource_aware=True, ) -@pipeline.task(resource_profile=heavy_profile) +@broker.task +@pipeline_task(output="heavy_result", resources=heavy_profile.model_dump()) def heavy_computation(data: dict) -> dict: """This task requires 4 CPU cores and 2 GB of RAM.""" return process_heavy_data(data) -# Configure the executor to respect resource profiles +# Configure the executor to compute optimal parallelism executor = ResourceAwareExecutor( - broker=broker, - max_parallel=10, + max_cpu_percent=80.0, + max_memory_percent=80.0, + min_parallel=1, + max_parallel=20, +) + +# Get optimal parallelism for the task +optimal_parallel = executor.get_optimal_parallelism( + task_memory_estimate=2048, + task_cpu_estimate=4.0, ) -executor.run_pipeline(pipeline, input_data) + +# Apply optimal parallelism to the pipeline before execution +pipeline = DataflowPipeline.from_tasks( + broker, + [heavy_computation], + max_parallel=optimal_parallel, +) +results = await pipeline.kiq_dataflow(data=input_data) ``` -`ResourceAwareExecutor` evaluates resource profiles of tasks and distributes them -to available workers based on their capacity. `TaskResourceProfile` lets you -annotate each task with its estimated resource needs, enabling the executor to -prevent over-subscription of workers. +`ResourceAwareExecutor` evaluates resource profiles of tasks and computes optimal +parallelism via `get_optimal_parallelism()`. Use this value to configure the pipeline's +`max_parallel` setting before calling `kiq_dataflow()` for execution. --- @@ -27212,6 +27228,67 @@ Puis se connecter avec `ws://localhost:8000/ws/{pipeline_id}`. *Prêt à approfondir ? Continuez avec le [Guide des Pipelines]({{ '/fr/guides/pipelines/' | relative_url }}).* +## DOCUMENT: Docs Review Thread + +# Documentation Review Comments + +## Thread Summary + +Review comments on documentation files that were addressed and archived. + +--- + +## Comment 1: Missing TaskiqError Import + +**File**: `docs/_en/guides/api.md` + +### Issue +L'exception TaskiqError est utilisée dans la signature de la fonction mais n'est pas importée dans cet extrait de code. Il faudrait ajouter from taskiq import TaskiqError. + +### Resolution +Added the missing import: +```python +from taskiq.exceptions import TaskiqError +``` + +--- + +## Comment 2: Inconsistent @pipeline.task Decorator + +**File**: `docs/_en/guides/api.md` + +### Issue +L'utilisation du décorateur @pipeline.task est incohérente avec le reste de la documentation qui préconise l'utilisation combinée de @broker.task et @pipeline_task. + +### Resolution +The documentation already correctly uses `@broker.task` + `@pipeline_task` pattern. No changes needed. + +--- + +## Comment 3: Non-existent run_pipeline Method + +**File**: `docs/_en/guides/performance.md` + +### Issue +La méthode run_pipeline n'existe pas sur la classe ResourceAwareExecutor. Cet exécuteur doit être utilisé pour obtenir le parallélisme optimal via get_optimal_parallelism, qui est ensuite appliqué au pipeline avant son exécution via kiq_dataflow. + +### Resolution +Updated the code example to show the correct pattern: +1. Use `ResourceAwareExecutor` to compute optimal parallelism via `get_optimal_parallelism()` +2. Apply the computed value to the pipeline configuration +3. Execute via `pipeline.kiq_dataflow()` + +--- + +## Changes Made + +1. Added `TaskiqError` import to `docs/_en/guides/api.md` +2. Fixed resource-aware execution example in `docs/_en/guides/performance.md`: + - Changed `@pipeline.task` to correct `@broker.task` + `@pipeline_task` pattern + - Replaced `executor.run_pipeline()` with correct pattern using `get_optimal_parallelism()` + - Updated execution to use `pipeline.kiq_dataflow()` + + # Code Examples diff --git a/llms.txt b/llms.txt index 9d9c289..cdedd18 100644 --- a/llms.txt +++ b/llms.txt @@ -86,6 +86,7 @@ - [Websocket](https://dorel14.github.io/taskiq-flow/_fr/guides/websocket) - [Index](https://dorel14.github.io/taskiq-flow/_fr/index) - [Quickstart](https://dorel14.github.io/taskiq-flow/_fr/quickstart) +- [Docs Review Thread](https://dorel14.github.io/taskiq-flow/archive/docs-review-thread) ## Code Examples & Recipes From 05ac19a0ed507ad430e123151ba91a6b238fb683 Mon Sep 17 00:00:00 2001 From: David Orel Date: Sun, 24 May 2026 18:18:21 +0200 Subject: [PATCH 11/12] fix : doc fixes --- docs/_en/guides/performance.md | 15 ++++------- docs/_fr/guides/performance.md | 41 +++++++++++++++++++++--------- docs/archive/docs-review-thread.md | 3 ++- 3 files changed, 36 insertions(+), 23 deletions(-) diff --git a/docs/_en/guides/performance.md b/docs/_en/guides/performance.md index 29b4e28..db082a8 100644 --- a/docs/_en/guides/performance.md +++ b/docs/_en/guides/performance.md @@ -515,6 +515,7 @@ engine = GPUOptimizedEngine(broker, dag) results = await engine.execute(inputs) ``` {% endraw %} + ### 11.1. Resource-Aware Execution with `TaskResourceProfile` Taskiq-Flow provides a resource-aware execution pattern for pipelines that need @@ -522,8 +523,9 @@ to allocate tasks to workers based on their CPU/RAM requirements: {% raw %} ```python -from taskiq_flow import ResourceAwareExecutor, TaskResourceProfile -from taskiq_flow.dataflow import DataflowPipeline +from taskiq_flow import pipeline_task +from taskiq_flow.optimization import ResourceAwareExecutor, TaskResourceProfile +from taskiq_flow.pipeline import DataflowPipeline # Define a resource profile for heavy tasks heavy_profile = TaskResourceProfile( @@ -531,13 +533,6 @@ heavy_profile = TaskResourceProfile( estimated_cpu_cores=4.0, ) -# Annotate tasks with resource needs via labels when creating the pipeline -pipeline = DataflowPipeline( - broker=broker, - name="resource_aware_pipeline", - resource_aware=True, -) - @broker.task @pipeline_task(output="heavy_result", resources=heavy_profile.model_dump()) def heavy_computation(data: dict) -> dict: @@ -558,7 +553,7 @@ optimal_parallel = executor.get_optimal_parallelism( task_cpu_estimate=4.0, ) -# Apply optimal parallelism to the pipeline before execution +# Build pipeline with optimal parallelism pipeline = DataflowPipeline.from_tasks( broker, [heavy_computation], diff --git a/docs/_fr/guides/performance.md b/docs/_fr/guides/performance.md index a9fc4f3..d4f5943 100644 --- a/docs/_fr/guides/performance.md +++ b/docs/_fr/guides/performance.md @@ -518,11 +518,13 @@ results = await engine.execute(inputs) ### 11.1. ResourceAwareExecutor et TaskResourceProfile TaskIQ-Flow fournit un exécuteur conscient des ressources qui peut être utilisé -pour allouer des tâches aux workers en fonction de leurs besoins en ressources : +pour calculer le parallélisme optimal selon les ressources CPU et mémoire : {% raw %} ```python -from taskiq_flow import ResourceAwareExecutor, TaskResourceProfile +from taskiq_flow import pipeline_task +from taskiq_flow.optimization import ResourceAwareExecutor, TaskResourceProfile +from taskiq_flow.pipeline import DataflowPipeline # Définir un profil de ressources pour les tâches lourdes heavy_profile = TaskResourceProfile( @@ -531,22 +533,37 @@ heavy_profile = TaskResourceProfile( ) @broker.task -@heavy_profile -def heavy_computation(data): - # Cette tâche nécessite 4 cœurs CPU et 2 Go de RAM +@pipeline_task(output="heavy_result", resources=heavy_profile.model_dump()) +def heavy_computation(data: dict) -> dict: + """Cette tâche nécessite 4 cœurs CPU et 2 Go de RAM.""" return process_heavy_data(data) -# Utiliser ResourceAwareExecutor pour l'exécution +# Configurer l'exécuteur pour calculer le parallélisme optimal executor = ResourceAwareExecutor( - broker=broker, - max_parallel=10, + max_cpu_percent=80.0, + max_memory_percent=80.0, + min_parallel=1, + max_parallel=20, +) + +# Obtenir le parallélisme optimal pour la tâche +optimal_parallel = executor.get_optimal_parallelism( + task_memory_estimate=2048, + task_cpu_estimate=4.0, +) + +# Construire le pipeline avec le parallélisme optimal +pipeline = DataflowPipeline.from_tasks( + broker, + [heavy_computation], + max_parallel=optimal_parallel, ) +results = await pipeline.kiq_dataflow(data=input_data) ``` {% endraw %} -`ResourceAwareExecutor` évalue les profils de ressources des tâches et les -distribue aux workers disponibles en fonction de leur capacité. -`TaskResourceProfile` permet d'annoter chaque tâche avec ses besoins estimés -en mémoire et CPU. +`ResourceAwareExecutor` évalue les profils de ressources des tâches et calcule +le parallélisme optimal via `get_optimal_parallelism()`. Utilisez cette valeur +pour configurer `max_parallel` du pipeline avant d'appeler `kiq_dataflow()`. ## 13. Résumé diff --git a/docs/archive/docs-review-thread.md b/docs/archive/docs-review-thread.md index 9b7ca96..b4351f8 100644 --- a/docs/archive/docs-review-thread.md +++ b/docs/archive/docs-review-thread.md @@ -61,4 +61,5 @@ Updated the code example to show the correct pattern: 2. Fixed resource-aware execution example in `docs/_en/guides/performance.md`: - Changed `@pipeline.task` to correct `@broker.task` + `@pipeline_task` pattern - Replaced `executor.run_pipeline()` with correct pattern using `get_optimal_parallelism()` - - Updated execution to use `pipeline.kiq_dataflow()` \ No newline at end of file + - Updated execution to use `pipeline.kiq_dataflow()` +3. Fixed the same issues in `docs/_fr/guides/performance.md` \ No newline at end of file From 852a03ec21860e145d9be64b65814adf8872f3d1 Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" Date: Sun, 24 May 2026 16:19:02 +0000 Subject: [PATCH 12/12] docs: auto-update llms context files --- llms-full.txt | 57 +++++++++++++++++++++++++++++++-------------------- 1 file changed, 35 insertions(+), 22 deletions(-) diff --git a/llms-full.txt b/llms-full.txt index ddf5478..2c13e7f 100644 --- a/llms-full.txt +++ b/llms-full.txt @@ -9170,6 +9170,7 @@ engine = GPUOptimizedEngine(broker, dag) results = await engine.execute(inputs) ``` + ### 11.1. Resource-Aware Execution with `TaskResourceProfile` Taskiq-Flow provides a resource-aware execution pattern for pipelines that need @@ -9177,8 +9178,9 @@ to allocate tasks to workers based on their CPU/RAM requirements: ```python -from taskiq_flow import ResourceAwareExecutor, TaskResourceProfile -from taskiq_flow.dataflow import DataflowPipeline +from taskiq_flow import pipeline_task +from taskiq_flow.optimization import ResourceAwareExecutor, TaskResourceProfile +from taskiq_flow.pipeline import DataflowPipeline # Define a resource profile for heavy tasks heavy_profile = TaskResourceProfile( @@ -9186,13 +9188,6 @@ heavy_profile = TaskResourceProfile( estimated_cpu_cores=4.0, ) -# Annotate tasks with resource needs via labels when creating the pipeline -pipeline = DataflowPipeline( - broker=broker, - name="resource_aware_pipeline", - resource_aware=True, -) - @broker.task @pipeline_task(output="heavy_result", resources=heavy_profile.model_dump()) def heavy_computation(data: dict) -> dict: @@ -9213,7 +9208,7 @@ optimal_parallel = executor.get_optimal_parallelism( task_cpu_estimate=4.0, ) -# Apply optimal parallelism to the pipeline before execution +# Build pipeline with optimal parallelism pipeline = DataflowPipeline.from_tasks( broker, [heavy_computation], @@ -22508,11 +22503,13 @@ results = await engine.execute(inputs) ### 11.1. ResourceAwareExecutor et TaskResourceProfile TaskIQ-Flow fournit un exécuteur conscient des ressources qui peut être utilisé -pour allouer des tâches aux workers en fonction de leurs besoins en ressources : +pour calculer le parallélisme optimal selon les ressources CPU et mémoire : ```python -from taskiq_flow import ResourceAwareExecutor, TaskResourceProfile +from taskiq_flow import pipeline_task +from taskiq_flow.optimization import ResourceAwareExecutor, TaskResourceProfile +from taskiq_flow.pipeline import DataflowPipeline # Définir un profil de ressources pour les tâches lourdes heavy_profile = TaskResourceProfile( @@ -22521,22 +22518,37 @@ heavy_profile = TaskResourceProfile( ) @broker.task -@heavy_profile -def heavy_computation(data): - # Cette tâche nécessite 4 cœurs CPU et 2 Go de RAM +@pipeline_task(output="heavy_result", resources=heavy_profile.model_dump()) +def heavy_computation(data: dict) -> dict: + """Cette tâche nécessite 4 cœurs CPU et 2 Go de RAM.""" return process_heavy_data(data) -# Utiliser ResourceAwareExecutor pour l'exécution +# Configurer l'exécuteur pour calculer le parallélisme optimal executor = ResourceAwareExecutor( - broker=broker, - max_parallel=10, + max_cpu_percent=80.0, + max_memory_percent=80.0, + min_parallel=1, + max_parallel=20, +) + +# Obtenir le parallélisme optimal pour la tâche +optimal_parallel = executor.get_optimal_parallelism( + task_memory_estimate=2048, + task_cpu_estimate=4.0, ) + +# Construire le pipeline avec le parallélisme optimal +pipeline = DataflowPipeline.from_tasks( + broker, + [heavy_computation], + max_parallel=optimal_parallel, +) +results = await pipeline.kiq_dataflow(data=input_data) ``` -`ResourceAwareExecutor` évalue les profils de ressources des tâches et les -distribue aux workers disponibles en fonction de leur capacité. -`TaskResourceProfile` permet d'annoter chaque tâche avec ses besoins estimés -en mémoire et CPU. +`ResourceAwareExecutor` évalue les profils de ressources des tâches et calcule +le parallélisme optimal via `get_optimal_parallelism()`. Utilisez cette valeur +pour configurer `max_parallel` du pipeline avant d'appeler `kiq_dataflow()`. ## 13. Résumé @@ -27287,6 +27299,7 @@ Updated the code example to show the correct pattern: - Changed `@pipeline.task` to correct `@broker.task` + `@pipeline_task` pattern - Replaced `executor.run_pipeline()` with correct pattern using `get_optimal_parallelism()` - Updated execution to use `pipeline.kiq_dataflow()` +3. Fixed the same issues in `docs/_fr/guides/performance.md` # Code Examples