From 975f9f93c2f464028ab5537d34f35c6a369fe6c9 Mon Sep 17 00:00:00 2001 From: Max Bohomolov Date: Thu, 22 Jan 2026 17:13:49 +0000 Subject: [PATCH 1/5] add deploy on AWS --- .../aws_lambda_beautifulsoup_crawler.mdx | 93 +++++++++++++++ .../aws_lambda_playwright_crawler.mdx | 107 ++++++++++++++++++ .../aws/beautifulsoup_crawler_lambda.py | 61 ++++++++++ .../aws/playwright_crawler_lambda.py | 73 ++++++++++++ .../code_examples/aws/playwright_dockerfile | 36 ++++++ website/sidebars.js | 16 +-- 6 files changed, 378 insertions(+), 8 deletions(-) create mode 100644 docs/deployment/aws_lambda_beautifulsoup_crawler.mdx create mode 100644 docs/deployment/aws_lambda_playwright_crawler.mdx create mode 100644 docs/deployment/code_examples/aws/beautifulsoup_crawler_lambda.py create mode 100644 docs/deployment/code_examples/aws/playwright_crawler_lambda.py create mode 100644 docs/deployment/code_examples/aws/playwright_dockerfile diff --git a/docs/deployment/aws_lambda_beautifulsoup_crawler.mdx b/docs/deployment/aws_lambda_beautifulsoup_crawler.mdx new file mode 100644 index 0000000000..65796eb619 --- /dev/null +++ b/docs/deployment/aws_lambda_beautifulsoup_crawler.mdx @@ -0,0 +1,93 @@ +--- +id: aws-lambda-beautifulsoup +title: BeautifulSoup crawler on AWS Lambda +description: Prepare your BeautifulSoupCrawler to run in Lambda functions on Amazon Web Services. +--- + +import ApiLink from '@site/src/components/ApiLink'; + +import CodeBlock from '@theme/CodeBlock'; + +import BeautifulSoupCrawlerLambda from '!!raw-loader!./code_examples/aws/beautifulsoup_crawler_lambda.py'; + +[AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) is a serverless compute service that lets you run code without provisioning or managing servers. It is well suited for deploying simple crawlers that don't require browser rendering. For simple projects, you can deploy using a ZIP archive. + +## Updating the code + +For the project foundation, use `BeautifulSoupCrawler` as described in this [example](../examples/beautifulsoup-crawler). + +When instantiating a crawler, use `MemoryStorageClient`. By default, Crawlee uses file-based storage, but the Lambda filesystem is read-only (except for `/tmp`). Using `MemoryStorageClient` tells Crawlee to use in-memory storage instead. + +Wrap the crawler logic in a `lambda_handler` function. This is the entry point that AWS will execute. + +:::important + +Make sure to always instantiate a new crawler for every Lambda invocation. AWS keeps the environment running for some time after the first execution (to reduce cold-start times), so subsequent calls may access an already-used crawler instance. + +**TLDR: Keep your Lambda stateless.** + +::: + +Finally, return the scraped data from the Lambda when the crawler run ends. + + + {BeautifulSoupCrawlerLambda} + + +## Deploying the project + +### Preparing the environment + +Lambda requires all dependencies to be included in the deployment package. Create a virtual environment and install dependencies: + +```bash +python3.14 -m venv .venv +source .venv/bin/activate +pip install 'crawlee[beautifulsoup]' 'boto3' 'aws-lambda-powertools' +``` + +### Creating the ZIP archive + +Create a ZIP archive from your project, including dependencies from the virtual environment: + +```bash +cd .venv/lib/python3.14/site-packages +zip -r ../../../../package.zip . +cd ../../../../ +zip package.zip lambda_function.py +``` + +:::note Large dependencies? + +AWS has a limit of 50MB for direct upload and 250MB for unzipped deployment package size. + +A better way to manage dependencies is by using Lambda Layers. With Layers, you can share files between multiple Lambda functions and keep the actual code as slim as possible. + +To create a Lambda Layer: + +1. Create a `python/` folder and copy dependencies from `site-packages` into it +2. Create a zip archive: `zip -r layer.zip python/` +3. Create a new Lambda Layer from the archive (you may need to upload it to S3 first) +4. Attach the Layer to your Lambda function + +::: + +### Uploading and configuring + +Upload `package.zip` as the code source in the AWS Lambda Console using the "Upload from" button. + +In Lambda Runtime Settings, set the handler. Since the file is named `lambda_function.py` and the function is `lambda_handler`, you can use the default value `lambda_function.lambda_handler`. + +:::tip Configuration + +In the Configuration tab, you can adjust: + +- **Memory**: Memory size can greatly affect execution speed. A minimum of 256-512 MB is recommended. +- **Timeout**: Set according to the size of the website you are scraping (1 minute for code in example). +- **Ephemeral storage**: Size of the `/tmp` directory. + +See the [official documentation](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) to learn how performance and cost scale with memory. + +::: + +After the Lambda deploys, you can test it by clicking the "Test" button. The event contents don't matter for a basic test, but you can parameterize your crawler by analyzing the event object that AWS passes as the first argument to the handler. diff --git a/docs/deployment/aws_lambda_playwright_crawler.mdx b/docs/deployment/aws_lambda_playwright_crawler.mdx new file mode 100644 index 0000000000..20950a7359 --- /dev/null +++ b/docs/deployment/aws_lambda_playwright_crawler.mdx @@ -0,0 +1,107 @@ +--- +id: aws-lambda-playwright +title: PlaywrightCrawler on AWS Lambda +description: Prepare your PlaywrightCrawler to run in Lambda functions on Amazon Web Services. +--- + +import ApiLink from '@site/src/components/ApiLink'; + +import CodeBlock from '@theme/CodeBlock'; + +import PlaywrightCrawlerLambda from '!!raw-loader!./code_examples/aws/playwright_crawler_lambda.py'; +import PlaywrightCrawlerDockerfile from '!!raw-loader!./code_examples/aws/playwright_dockerfile'; + +[AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) is a serverless compute service that runs code without provisioning or managing servers. For crawlers that require browser rendering, you need to deploy using Docker container images because Playwright and browser binaries exceed Lambda's ZIP deployment size limits. + +## Updating the code + +For the project foundation, use `BeautifulSoupCrawler` as described in this [example](../examples/beautifulsoup-crawler). We will update it to work with `PlaywrightCrawler`. + +When instantiating a crawler, use `MemoryStorageClient`. By default, Crawlee uses file-based storage, but the Lambda filesystem is read-only (except for `/tmp`). Using `MemoryStorageClient` tells Crawlee to use in-memory storage instead. Replace `BeautifulSoupCrawler` with `PlaywrightCrawler` and configure `browser_launch_options` with flags optimized for serverless environments. These flags disable sandboxing and GPU features that aren't available in Lambda's containerized runtime. + +Wrap the crawler logic in a `lambda_handler` function. This is the entry point that AWS will execute. + +:::important + +Make sure to always instantiate a new crawler for every Lambda invocation. AWS keeps the environment running for some time after the first execution (to reduce cold-start times), so subsequent calls may access an already-used crawler instance. + +**TLDR: Keep your Lambda stateless.** + +::: + +Finally, return the scraped data from the Lambda when the crawler run ends. + + + {PlaywrightCrawlerLambda} + + +## Deploying the project + +### Installing and configuring AWS CLI + +Install AWS CLI following the [official documentation](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) according to your operating system. + +Authenticate by running: + +```bash +aws login +``` + +### Preparing the project + +Initialize the project by running `uvx 'crawlee[cli]' create`. + +Or use a single command if you don't need interactive mode: + +```bash +uvx 'crawlee[cli]' create aws_playwright --crawler-type playwright --http-client impit --package-manager uv --no-apify --start-url 'https://crawlee.dev' --install +``` + +Add additional dependencies: + +```bash +uv add awslambdaric aws-lambda-powertools boto3 +``` + +The project is created with a Dockerfile that needs to be modified for AWS Lambda by adding `ENTRYPOINT` and updating `CMD`: + + + {PlaywrightCrawlerDockerfile} + + +### Building and pushing the Docker image + +Create a repository `lambda/aws-playwright` in [Amazon Elastic Container Registry](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html) in the same region where your Lambda functions will run. To learn more, refer to the [official documentation](https://docs.aws.amazon.com/AmazonECR/latest/userguide/getting-started-cli.html). + +Navigate to the created repository and click the "View push commands" button. This will open a window with console commands for uploading the Docker image to your repository. Execute them. + +Example: +```bash +aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin {user-specific-data} +docker build --platform linux/amd64 --provenance=false -t lambda/aws-playwright . +docker tag lambda/aws-playwright:latest {user-specific-data}/lambda/aws-playwright:latest +docker push {user-specific-data}/lambda/aws-playwright:latest +``` + +### Creating the Lambda function + +1. In the AWS Lambda Console, click "Create function" +2. Select "Container image" +3. Browse and select your ECR image +4. Configure the function settings + +:::tip Configuration + +Playwright crawlers require more resources than HTTP-based crawlers: + +- **Memory**: Minimum 1024 MB recommended. Browser operations are memory-intensive, so 2048 MB or more may be needed for complex pages. +- **Timeout**: Set according to crawl size. Browser startup adds overhead, so allow at least 5 minutes even for simple crawls. +- **Ephemeral storage**: Default 512 MB is usually sufficient unless downloading large files. + +See the [official documentation](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) to learn how performance and cost scale with memory. + +::: + +## Testing the function + +After the Lambda deploys, click the "Test" button to invoke it. The event contents don't matter for a basic test, but you can parameterize your crawler by parsing the event object that AWS passes as the first argument to the handler. diff --git a/docs/deployment/code_examples/aws/beautifulsoup_crawler_lambda.py b/docs/deployment/code_examples/aws/beautifulsoup_crawler_lambda.py new file mode 100644 index 0000000000..3fb8bfe3b1 --- /dev/null +++ b/docs/deployment/code_examples/aws/beautifulsoup_crawler_lambda.py @@ -0,0 +1,61 @@ +import asyncio +import json +from datetime import timedelta +from typing import Any + +from aws_lambda_powertools.utilities.typing import LambdaContext + +from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext +from crawlee.storage_clients import MemoryStorageClient +from crawlee.storages import Dataset, RequestQueue + + +async def main() -> str: + # highlight-start + # Disable writing storage data to the file system + storage_client = MemoryStorageClient() + # highlight-end + + # Initialize storages + dataset = await Dataset.open(storage_client=storage_client) + request_queue = await RequestQueue.open(storage_client=storage_client) + + crawler = BeautifulSoupCrawler( + storage_client=storage_client, + max_request_retries=1, + request_handler_timeout=timedelta(seconds=30), + max_requests_per_crawl=10, + ) + + @crawler.router.default_handler + async def request_handler(context: BeautifulSoupCrawlingContext) -> None: + context.log.info(f'Processing {context.request.url} ...') + + data = { + 'url': context.request.url, + 'title': context.soup.title.string if context.soup.title else None, + 'h1s': [h1.text for h1 in context.soup.find_all('h1')], + 'h2s': [h2.text for h2 in context.soup.find_all('h2')], + 'h3s': [h3.text for h3 in context.soup.find_all('h3')], + } + + await context.push_data(data) + await context.enqueue_links() + + await crawler.run(['https://crawlee.dev']) + + # Extract data saved in `Dataset` + data = await crawler.get_data() + + # Clean up storages after the crawl + await dataset.drop() + await request_queue.drop() + + # Serialize the list of scraped items to JSON string + return json.dumps(data.items) + + +def lambda_handler(_event: dict[str, Any], _context: LambdaContext) -> dict[str, Any]: + result = asyncio.run(main()) + # Return the response with results + return {'statusCode': 200, 'body': result} diff --git a/docs/deployment/code_examples/aws/playwright_crawler_lambda.py b/docs/deployment/code_examples/aws/playwright_crawler_lambda.py new file mode 100644 index 0000000000..d1c831ef51 --- /dev/null +++ b/docs/deployment/code_examples/aws/playwright_crawler_lambda.py @@ -0,0 +1,73 @@ +import asyncio +import json +from datetime import timedelta +from typing import Any + +from aws_lambda_powertools.utilities.typing import LambdaContext + +from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext +from crawlee.storage_clients import MemoryStorageClient +from crawlee.storages import Dataset, RequestQueue + + +async def main() -> str: + # highlight-start + # Disable writing storage data to the file system + storage_client = MemoryStorageClient() + # highlight-end + + # Initialize storages + dataset = await Dataset.open(storage_client=storage_client) + request_queue = await RequestQueue.open(storage_client=storage_client) + + crawler = PlaywrightCrawler( + storage_client=storage_client, + max_request_retries=1, + request_handler_timeout=timedelta(seconds=30), + max_requests_per_crawl=10, + # highlight-start + # Configure Playwright to run in AWS Lambda environment + browser_launch_options={ + 'args': [ + '--no-sandbox', + '--disable-setuid-sandbox', + '--disable-dev-shm-usage', + '--disable-gpu', + '--single-process', + ] + }, + # highlight-end + ) + + @crawler.router.default_handler + async def request_handler(context: PlaywrightCrawlingContext) -> None: + context.log.info(f'Processing {context.request.url} ...') + + data = { + 'url': context.request.url, + 'title': await context.page.title(), + 'h1s': await context.page.locator('h1').all_text_contents(), + 'h2s': await context.page.locator('h2').all_text_contents(), + 'h3s': await context.page.locator('h3').all_text_contents(), + } + + await context.push_data(data) + await context.enqueue_links() + + await crawler.run(['https://crawlee.dev']) + + # Extract data saved in `Dataset` + data = await crawler.get_data() + + # Clean up storages after the crawl + await dataset.drop() + await request_queue.drop() + + # Serialize the list of scraped items to JSON string + return json.dumps(data.items) + + +def lambda_handler(_event: dict[str, Any], _context: LambdaContext) -> dict[str, Any]: + result = asyncio.run(main()) + # Return the response with results + return {'statusCode': 200, 'body': result} diff --git a/docs/deployment/code_examples/aws/playwright_dockerfile b/docs/deployment/code_examples/aws/playwright_dockerfile new file mode 100644 index 0000000000..618587e55f --- /dev/null +++ b/docs/deployment/code_examples/aws/playwright_dockerfile @@ -0,0 +1,36 @@ +FROM apify/actor-python-playwright:3.14 + +RUN apt update && apt install -yq git && rm -rf /var/lib/apt/lists/* + +RUN pip install -U pip setuptools \ + && pip install 'uv<1' + +ENV UV_PROJECT_ENVIRONMENT="/usr/local" + +COPY pyproject.toml uv.lock ./ + +RUN echo "Python version:" \ + && python --version \ + && echo "Installing dependencies:" \ + && PLAYWRIGHT_INSTALLED=$(pip freeze | grep -q playwright && echo "true" || echo "false") \ + && if [ "$PLAYWRIGHT_INSTALLED" = "true" ]; then \ + echo "Playwright already installed, excluding from uv sync" \ + && uv sync --frozen --no-install-project --no-editable -q --no-dev --inexact --no-install-package playwright; \ + else \ + echo "Playwright not found, installing all dependencies" \ + && uv sync --frozen --no-install-project --no-editable -q --no-dev --inexact; \ + fi \ + && echo "All installed Python packages:" \ + && pip freeze + +COPY . ./ + +RUN python -m compileall -q . + +# highlight-start +# AWS Lambda entrypoint +ENTRYPOINT [ "/usr/local/bin/python3", "-m", "awslambdaric" ] + +# Lambda handler function +CMD [ "aws_playwright.main.lambda_handler" ] +# highlight-end diff --git a/website/sidebars.js b/website/sidebars.js index e843ac1336..9aa98d18f0 100644 --- a/website/sidebars.js +++ b/website/sidebars.js @@ -54,14 +54,14 @@ module.exports = { id: 'deployment/apify-platform', label: 'Deploy on Apify', }, - // { - // type: 'category', - // label: 'Deploy on AWS', - // items: [ - // 'deployment/aws-cheerio', - // 'deployment/aws-browsers', - // ], - // }, + { + type: 'category', + label: 'Deploy on AWS', + items: [ + 'deployment/aws-lambda-beautifulsoup', + 'deployment/aws-lambda-playwright', + ], + }, { type: 'category', label: 'Deploy to Google Cloud', From a4d81fb1770f8b6064cb23f311dea767aa6f8468 Mon Sep 17 00:00:00 2001 From: Max Bohomolov <34358312+Mantisus@users.noreply.github.com> Date: Mon, 26 Jan 2026 16:20:57 +0200 Subject: [PATCH 2/5] Update docs/deployment/aws_lambda_beautifulsoup_crawler.mdx Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- docs/deployment/aws_lambda_beautifulsoup_crawler.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/deployment/aws_lambda_beautifulsoup_crawler.mdx b/docs/deployment/aws_lambda_beautifulsoup_crawler.mdx index 65796eb619..a670a8e5d4 100644 --- a/docs/deployment/aws_lambda_beautifulsoup_crawler.mdx +++ b/docs/deployment/aws_lambda_beautifulsoup_crawler.mdx @@ -90,4 +90,4 @@ See the [official documentation](https://docs.aws.amazon.com/lambda/latest/dg/ge ::: -After the Lambda deploys, you can test it by clicking the "Test" button. The event contents don't matter for a basic test, but you can parameterize your crawler by analyzing the event object that AWS passes as the first argument to the handler. +After the Lambda deploys, you can test it by clicking the "Test" button. The event contents don't matter for a basic test, but you can parameterize your crawler by parsing the event object that AWS passes as the first argument to the handler. From 36af0c2711d9fc85fc27ef6c7743fd80f6b737d0 Mon Sep 17 00:00:00 2001 From: Max Bohomolov Date: Mon, 26 Jan 2026 16:40:57 +0000 Subject: [PATCH 3/5] update docs --- .../aws_lambda_beautifulsoup_crawler.mdx | 15 +++++++++++++-- .../deployment/aws_lambda_playwright_crawler.mdx | 16 ++++++++++------ 2 files changed, 23 insertions(+), 8 deletions(-) diff --git a/docs/deployment/aws_lambda_beautifulsoup_crawler.mdx b/docs/deployment/aws_lambda_beautifulsoup_crawler.mdx index a670a8e5d4..4b460aef0e 100644 --- a/docs/deployment/aws_lambda_beautifulsoup_crawler.mdx +++ b/docs/deployment/aws_lambda_beautifulsoup_crawler.mdx @@ -46,6 +46,8 @@ source .venv/bin/activate pip install 'crawlee[beautifulsoup]' 'boto3' 'aws-lambda-powertools' ``` +[`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) is the AWS SDK for Python. Including it in your dependencies is recommended to avoid version misalignment issues with the Lambda runtime. + ### Creating the ZIP archive Create a ZIP archive from your project, including dependencies from the virtual environment: @@ -72,9 +74,18 @@ To create a Lambda Layer: ::: -### Uploading and configuring +### Create Lambda function + +Create the Lambda function in the AWS Lambda Console: + +1. Navigate to `Lambda` in [AWS Management Console](https://aws.amazon.com/console/). +2. Click **Create function**. +3. Select **Author from scratch**. +4. Enter a **Function name**, for example `BeautifulSoupTest`. +5. Choose a **Python runtime** that matches the version used in your virtual environment (for example, Python 3.14). +6. Click **Create function** to finish. -Upload `package.zip` as the code source in the AWS Lambda Console using the "Upload from" button. +Once created, upload `package.zip` as the code source in the AWS Lambda Console using the "Upload from" button. In Lambda Runtime Settings, set the handler. Since the file is named `lambda_function.py` and the function is `lambda_handler`, you can use the default value `lambda_function.lambda_handler`. diff --git a/docs/deployment/aws_lambda_playwright_crawler.mdx b/docs/deployment/aws_lambda_playwright_crawler.mdx index 20950a7359..279d3d8de1 100644 --- a/docs/deployment/aws_lambda_playwright_crawler.mdx +++ b/docs/deployment/aws_lambda_playwright_crawler.mdx @@ -1,6 +1,6 @@ --- id: aws-lambda-playwright -title: PlaywrightCrawler on AWS Lambda +title: Playwright crawler on AWS Lambda description: Prepare your PlaywrightCrawler to run in Lambda functions on Amazon Web Services. --- @@ -63,6 +63,9 @@ Add additional dependencies: uv add awslambdaric aws-lambda-powertools boto3 ``` +[`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) is the AWS SDK for Python. Use it if your function integrates with any other AWS services. + + The project is created with a Dockerfile that needs to be modified for AWS Lambda by adding `ENTRYPOINT` and updating `CMD`: @@ -85,14 +88,15 @@ docker push {user-specific-data}/lambda/aws-playwright:latest ### Creating the Lambda function -1. In the AWS Lambda Console, click "Create function" -2. Select "Container image" -3. Browse and select your ECR image -4. Configure the function settings +1. Navigate to `Lambda` in [AWS Management Console](https://aws.amazon.com/console/). +2. Click **Create function**. +3. Select **Container image**. +4. Browse and select your ECR image. +5. Click **Create function** to finish. :::tip Configuration -Playwright crawlers require more resources than HTTP-based crawlers: +In the Configuration tab, you can adjust resources. Playwright crawlers require more resources than HTTP-based crawlers: - **Memory**: Minimum 1024 MB recommended. Browser operations are memory-intensive, so 2048 MB or more may be needed for complex pages. - **Timeout**: Set according to crawl size. Browser startup adds overhead, so allow at least 5 minutes even for simple crawls. From a6e62907f353a6083bdf11cb0c09ceb2a48f3459 Mon Sep 17 00:00:00 2001 From: Max Bohomolov Date: Wed, 28 Jan 2026 00:40:53 +0000 Subject: [PATCH 4/5] merge to one guide --- docs/deployment/aws_lambda.mdx | 190 ++++++++++++++++++ .../aws_lambda_beautifulsoup_crawler.mdx | 104 ---------- .../aws_lambda_playwright_crawler.mdx | 111 ---------- website/sidebars.js | 9 +- 4 files changed, 193 insertions(+), 221 deletions(-) create mode 100644 docs/deployment/aws_lambda.mdx delete mode 100644 docs/deployment/aws_lambda_beautifulsoup_crawler.mdx delete mode 100644 docs/deployment/aws_lambda_playwright_crawler.mdx diff --git a/docs/deployment/aws_lambda.mdx b/docs/deployment/aws_lambda.mdx new file mode 100644 index 0000000000..664b82d91b --- /dev/null +++ b/docs/deployment/aws_lambda.mdx @@ -0,0 +1,190 @@ +--- +id: aws-lambda +title: Deploy on AWS Lambda +description: Prepare your crawler to run on AWS Lambda. +--- + +import ApiLink from '@site/src/components/ApiLink'; + +import CodeBlock from '@theme/CodeBlock'; + +import BeautifulSoupCrawlerLambda from '!!raw-loader!./code_examples/aws/beautifulsoup_crawler_lambda.py'; +import PlaywrightCrawlerLambda from '!!raw-loader!./code_examples/aws/playwright_crawler_lambda.py'; +import PlaywrightCrawlerDockerfile from '!!raw-loader!./code_examples/aws/playwright_dockerfile'; + +[AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) is a serverless compute service that lets you run code without provisioning or managing servers. This guide covers deploying `BeautifulSoupCrawler` and `PlaywrightCrawler`. + +The code examples are based on the [BeautifulSoupCrawler example](../examples/beautifulsoup-crawler). + +## BeautifulSoupCrawler on AWS Lambda + +For simple crawlers that don't require browser rendering, you can deploy using a ZIP archive. + +### Updating the code + +When instantiating a crawler, use `MemoryStorageClient`. By default, Crawlee uses file-based storage, but the Lambda filesystem is read-only (except for `/tmp`). Using `MemoryStorageClient` tells Crawlee to use in-memory storage instead. + +Wrap the crawler logic in a `lambda_handler` function. This is the entry point that AWS will execute. + +:::important + +Make sure to always instantiate a new crawler for every Lambda invocation. AWS keeps the environment running for some time after the first execution (to reduce cold-start times), so subsequent calls may access an already-used crawler instance. + +**TLDR: Keep your Lambda stateless.** + +::: + +Finally, return the scraped data from the Lambda when the crawler run ends. + + + {BeautifulSoupCrawlerLambda} + + +### Preparing the environment + +Lambda requires all dependencies to be included in the deployment package. Create a virtual environment and install dependencies: + +```bash +python3.14 -m venv .venv +source .venv/bin/activate +pip install 'crawlee[beautifulsoup]' 'boto3' 'aws-lambda-powertools' +``` + +[`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) is the AWS SDK for Python. Including it in your dependencies is recommended to avoid version misalignment issues with the Lambda runtime. + +### Creating the ZIP archive + +Create a ZIP archive from your project, including dependencies from the virtual environment: + +```bash +cd .venv/lib/python3.14/site-packages +zip -r ../../../../package.zip . +cd ../../../../ +zip package.zip lambda_function.py +``` + +:::note Large dependencies? + +AWS has a limit of 50 MB for direct upload and 250 MB for unzipped deployment package size. + +A better way to manage dependencies is by using Lambda Layers. With Layers, you can share files between multiple Lambda functions and keep the actual code as slim as possible. + +To create a Lambda Layer: + +1. Create a `python/` folder and copy dependencies from `site-packages` into it +2. Create a zip archive: `zip -r layer.zip python/` +3. Create a new Lambda Layer from the archive (you may need to upload it to S3 first) +4. Attach the Layer to your Lambda function + +::: + +### Creating the Lambda function + +Create the Lambda function in the AWS Lambda Console: + +1. Navigate to `Lambda` in [AWS Management Console](https://aws.amazon.com/console/). +2. Click **Create function**. +3. Select **Author from scratch**. +4. Enter a **Function name**, for example `BeautifulSoupTest`. +5. Choose a **Python runtime** that matches the version used in your virtual environment (for example, Python 3.14). +6. Click **Create function** to finish. + +Once created, upload `package.zip` as the code source in the AWS Lambda Console using the "Upload from" button. + +In Lambda Runtime Settings, set the handler. Since the file is named `lambda_function.py` and the function is `lambda_handler`, you can use the default value `lambda_function.lambda_handler`. + +:::tip Configuration + +In the Configuration tab, you can adjust: + +- **Memory**: Memory size can greatly affect execution speed. A minimum of 256-512 MB is recommended. +- **Timeout**: Set according to the size of the website you are scraping (1 minute for the example code). +- **Ephemeral storage**: Size of the `/tmp` directory. + +See the [official documentation](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) to learn how performance and cost scale with memory. + +::: + +After the Lambda deploys, you can test it by clicking the "Test" button. The event contents don't matter for a basic test, but you can parameterize your crawler by parsing the event object that AWS passes as the first argument to the handler. + +## PlaywrightCrawler on AWS Lambda + +For crawlers that require browser rendering, you need to deploy using Docker container images because Playwright and browser binaries exceed Lambda's ZIP deployment size limits. + +### Updating the code + +As with `BeautifulSoupCrawler`, use `MemoryStorageClient` and wrap the logic in a `lambda_handler` function. Additionally, configure `browser_launch_options` with flags optimized for serverless environments. These flags disable sandboxing and GPU features that aren't available in Lambda's containerized runtime. + + + {PlaywrightCrawlerLambda} + + +### Installing and configuring AWS CLI + +Install AWS CLI following the [official documentation](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) according to your operating system. + +Authenticate by running: + +```bash +aws login +``` + +### Preparing the project + +Initialize the project by running `uvx 'crawlee[cli]' create`. + +Or use a single command if you don't need interactive mode: + +```bash +uvx 'crawlee[cli]' create aws_playwright --crawler-type playwright --http-client impit --package-manager uv --no-apify --start-url 'https://crawlee.dev' --install +``` + +Add the following dependencies: + +```bash +uv add awslambdaric aws-lambda-powertools boto3 +``` + +[`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) is the AWS SDK for Python. Use it if your function integrates with any other AWS services. + +The project is created with a Dockerfile that needs to be modified for AWS Lambda by adding `ENTRYPOINT` and updating `CMD`: + + + {PlaywrightCrawlerDockerfile} + + +### Building and pushing the Docker image + +Create a repository `lambda/aws-playwright` in [Amazon Elastic Container Registry](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html) in the same region where your Lambda functions will run. To learn more, refer to the [official documentation](https://docs.aws.amazon.com/AmazonECR/latest/userguide/getting-started-cli.html). + +Navigate to the created repository and click the "View push commands" button. This will open a window with console commands for uploading the Docker image to your repository. Execute them. + +Example: +```bash +aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin {user-specific-data} +docker build --platform linux/amd64 --provenance=false -t lambda/aws-playwright . +docker tag lambda/aws-playwright:latest {user-specific-data}/lambda/aws-playwright:latest +docker push {user-specific-data}/lambda/aws-playwright:latest +``` + +### Creating the Lambda function + +1. Navigate to `Lambda` in [AWS Management Console](https://aws.amazon.com/console/). +2. Click **Create function**. +3. Select **Container image**. +4. Browse and select your ECR image. +5. Click **Create function** to finish. + +:::tip Configuration + +In the Configuration tab, you can adjust resources. Playwright crawlers require more resources than BeautifulSoup crawlers: + +- **Memory**: Minimum 1024 MB recommended. Browser operations are memory-intensive, so 2048 MB or more may be needed for complex pages. +- **Timeout**: Set according to crawl size. Browser startup adds overhead, so allow at least 5 minutes even for simple crawls. +- **Ephemeral storage**: Default 512 MB is usually sufficient unless downloading large files. + +See the [official documentation](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) to learn how performance and cost scale with memory. + +::: + +After the Lambda deploys, click the "Test" button to invoke it. The event contents don't matter for a basic test, but you can parameterize your crawler by parsing the event object that AWS passes as the first argument to the handler. diff --git a/docs/deployment/aws_lambda_beautifulsoup_crawler.mdx b/docs/deployment/aws_lambda_beautifulsoup_crawler.mdx deleted file mode 100644 index 4b460aef0e..0000000000 --- a/docs/deployment/aws_lambda_beautifulsoup_crawler.mdx +++ /dev/null @@ -1,104 +0,0 @@ ---- -id: aws-lambda-beautifulsoup -title: BeautifulSoup crawler on AWS Lambda -description: Prepare your BeautifulSoupCrawler to run in Lambda functions on Amazon Web Services. ---- - -import ApiLink from '@site/src/components/ApiLink'; - -import CodeBlock from '@theme/CodeBlock'; - -import BeautifulSoupCrawlerLambda from '!!raw-loader!./code_examples/aws/beautifulsoup_crawler_lambda.py'; - -[AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) is a serverless compute service that lets you run code without provisioning or managing servers. It is well suited for deploying simple crawlers that don't require browser rendering. For simple projects, you can deploy using a ZIP archive. - -## Updating the code - -For the project foundation, use `BeautifulSoupCrawler` as described in this [example](../examples/beautifulsoup-crawler). - -When instantiating a crawler, use `MemoryStorageClient`. By default, Crawlee uses file-based storage, but the Lambda filesystem is read-only (except for `/tmp`). Using `MemoryStorageClient` tells Crawlee to use in-memory storage instead. - -Wrap the crawler logic in a `lambda_handler` function. This is the entry point that AWS will execute. - -:::important - -Make sure to always instantiate a new crawler for every Lambda invocation. AWS keeps the environment running for some time after the first execution (to reduce cold-start times), so subsequent calls may access an already-used crawler instance. - -**TLDR: Keep your Lambda stateless.** - -::: - -Finally, return the scraped data from the Lambda when the crawler run ends. - - - {BeautifulSoupCrawlerLambda} - - -## Deploying the project - -### Preparing the environment - -Lambda requires all dependencies to be included in the deployment package. Create a virtual environment and install dependencies: - -```bash -python3.14 -m venv .venv -source .venv/bin/activate -pip install 'crawlee[beautifulsoup]' 'boto3' 'aws-lambda-powertools' -``` - -[`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) is the AWS SDK for Python. Including it in your dependencies is recommended to avoid version misalignment issues with the Lambda runtime. - -### Creating the ZIP archive - -Create a ZIP archive from your project, including dependencies from the virtual environment: - -```bash -cd .venv/lib/python3.14/site-packages -zip -r ../../../../package.zip . -cd ../../../../ -zip package.zip lambda_function.py -``` - -:::note Large dependencies? - -AWS has a limit of 50MB for direct upload and 250MB for unzipped deployment package size. - -A better way to manage dependencies is by using Lambda Layers. With Layers, you can share files between multiple Lambda functions and keep the actual code as slim as possible. - -To create a Lambda Layer: - -1. Create a `python/` folder and copy dependencies from `site-packages` into it -2. Create a zip archive: `zip -r layer.zip python/` -3. Create a new Lambda Layer from the archive (you may need to upload it to S3 first) -4. Attach the Layer to your Lambda function - -::: - -### Create Lambda function - -Create the Lambda function in the AWS Lambda Console: - -1. Navigate to `Lambda` in [AWS Management Console](https://aws.amazon.com/console/). -2. Click **Create function**. -3. Select **Author from scratch**. -4. Enter a **Function name**, for example `BeautifulSoupTest`. -5. Choose a **Python runtime** that matches the version used in your virtual environment (for example, Python 3.14). -6. Click **Create function** to finish. - -Once created, upload `package.zip` as the code source in the AWS Lambda Console using the "Upload from" button. - -In Lambda Runtime Settings, set the handler. Since the file is named `lambda_function.py` and the function is `lambda_handler`, you can use the default value `lambda_function.lambda_handler`. - -:::tip Configuration - -In the Configuration tab, you can adjust: - -- **Memory**: Memory size can greatly affect execution speed. A minimum of 256-512 MB is recommended. -- **Timeout**: Set according to the size of the website you are scraping (1 minute for code in example). -- **Ephemeral storage**: Size of the `/tmp` directory. - -See the [official documentation](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) to learn how performance and cost scale with memory. - -::: - -After the Lambda deploys, you can test it by clicking the "Test" button. The event contents don't matter for a basic test, but you can parameterize your crawler by parsing the event object that AWS passes as the first argument to the handler. diff --git a/docs/deployment/aws_lambda_playwright_crawler.mdx b/docs/deployment/aws_lambda_playwright_crawler.mdx deleted file mode 100644 index 279d3d8de1..0000000000 --- a/docs/deployment/aws_lambda_playwright_crawler.mdx +++ /dev/null @@ -1,111 +0,0 @@ ---- -id: aws-lambda-playwright -title: Playwright crawler on AWS Lambda -description: Prepare your PlaywrightCrawler to run in Lambda functions on Amazon Web Services. ---- - -import ApiLink from '@site/src/components/ApiLink'; - -import CodeBlock from '@theme/CodeBlock'; - -import PlaywrightCrawlerLambda from '!!raw-loader!./code_examples/aws/playwright_crawler_lambda.py'; -import PlaywrightCrawlerDockerfile from '!!raw-loader!./code_examples/aws/playwright_dockerfile'; - -[AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) is a serverless compute service that runs code without provisioning or managing servers. For crawlers that require browser rendering, you need to deploy using Docker container images because Playwright and browser binaries exceed Lambda's ZIP deployment size limits. - -## Updating the code - -For the project foundation, use `BeautifulSoupCrawler` as described in this [example](../examples/beautifulsoup-crawler). We will update it to work with `PlaywrightCrawler`. - -When instantiating a crawler, use `MemoryStorageClient`. By default, Crawlee uses file-based storage, but the Lambda filesystem is read-only (except for `/tmp`). Using `MemoryStorageClient` tells Crawlee to use in-memory storage instead. Replace `BeautifulSoupCrawler` with `PlaywrightCrawler` and configure `browser_launch_options` with flags optimized for serverless environments. These flags disable sandboxing and GPU features that aren't available in Lambda's containerized runtime. - -Wrap the crawler logic in a `lambda_handler` function. This is the entry point that AWS will execute. - -:::important - -Make sure to always instantiate a new crawler for every Lambda invocation. AWS keeps the environment running for some time after the first execution (to reduce cold-start times), so subsequent calls may access an already-used crawler instance. - -**TLDR: Keep your Lambda stateless.** - -::: - -Finally, return the scraped data from the Lambda when the crawler run ends. - - - {PlaywrightCrawlerLambda} - - -## Deploying the project - -### Installing and configuring AWS CLI - -Install AWS CLI following the [official documentation](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) according to your operating system. - -Authenticate by running: - -```bash -aws login -``` - -### Preparing the project - -Initialize the project by running `uvx 'crawlee[cli]' create`. - -Or use a single command if you don't need interactive mode: - -```bash -uvx 'crawlee[cli]' create aws_playwright --crawler-type playwright --http-client impit --package-manager uv --no-apify --start-url 'https://crawlee.dev' --install -``` - -Add additional dependencies: - -```bash -uv add awslambdaric aws-lambda-powertools boto3 -``` - -[`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) is the AWS SDK for Python. Use it if your function integrates with any other AWS services. - - -The project is created with a Dockerfile that needs to be modified for AWS Lambda by adding `ENTRYPOINT` and updating `CMD`: - - - {PlaywrightCrawlerDockerfile} - - -### Building and pushing the Docker image - -Create a repository `lambda/aws-playwright` in [Amazon Elastic Container Registry](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html) in the same region where your Lambda functions will run. To learn more, refer to the [official documentation](https://docs.aws.amazon.com/AmazonECR/latest/userguide/getting-started-cli.html). - -Navigate to the created repository and click the "View push commands" button. This will open a window with console commands for uploading the Docker image to your repository. Execute them. - -Example: -```bash -aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin {user-specific-data} -docker build --platform linux/amd64 --provenance=false -t lambda/aws-playwright . -docker tag lambda/aws-playwright:latest {user-specific-data}/lambda/aws-playwright:latest -docker push {user-specific-data}/lambda/aws-playwright:latest -``` - -### Creating the Lambda function - -1. Navigate to `Lambda` in [AWS Management Console](https://aws.amazon.com/console/). -2. Click **Create function**. -3. Select **Container image**. -4. Browse and select your ECR image. -5. Click **Create function** to finish. - -:::tip Configuration - -In the Configuration tab, you can adjust resources. Playwright crawlers require more resources than HTTP-based crawlers: - -- **Memory**: Minimum 1024 MB recommended. Browser operations are memory-intensive, so 2048 MB or more may be needed for complex pages. -- **Timeout**: Set according to crawl size. Browser startup adds overhead, so allow at least 5 minutes even for simple crawls. -- **Ephemeral storage**: Default 512 MB is usually sufficient unless downloading large files. - -See the [official documentation](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) to learn how performance and cost scale with memory. - -::: - -## Testing the function - -After the Lambda deploys, click the "Test" button to invoke it. The event contents don't matter for a basic test, but you can parameterize your crawler by parsing the event object that AWS passes as the first argument to the handler. diff --git a/website/sidebars.js b/website/sidebars.js index 9aa98d18f0..a2115c4ff4 100644 --- a/website/sidebars.js +++ b/website/sidebars.js @@ -55,12 +55,9 @@ module.exports = { label: 'Deploy on Apify', }, { - type: 'category', - label: 'Deploy on AWS', - items: [ - 'deployment/aws-lambda-beautifulsoup', - 'deployment/aws-lambda-playwright', - ], + type: 'doc', + id: 'deployment/aws-lambda', + label: 'Deploy on AWS Lambda' }, { type: 'category', From f81346b6e8722b50aaea4739f8ee05db1e95e2c2 Mon Sep 17 00:00:00 2001 From: Max Bohomolov <34358312+Mantisus@users.noreply.github.com> Date: Wed, 28 Jan 2026 02:45:45 +0200 Subject: [PATCH 5/5] Update docs/deployment/aws_lambda.mdx Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- docs/deployment/aws_lambda.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/deployment/aws_lambda.mdx b/docs/deployment/aws_lambda.mdx index 664b82d91b..cbe49c9ff0 100644 --- a/docs/deployment/aws_lambda.mdx +++ b/docs/deployment/aws_lambda.mdx @@ -30,7 +30,7 @@ Wrap the crawler logic in a `lambda_handler` function. This is the entry point t Make sure to always instantiate a new crawler for every Lambda invocation. AWS keeps the environment running for some time after the first execution (to reduce cold-start times), so subsequent calls may access an already-used crawler instance. -**TLDR: Keep your Lambda stateless.** +**TL;DR: Keep your Lambda stateless.** :::