Skip to content

JamesCarter526/Solve-AWS-WAF-CAPTCHA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Solving AWS WAF CAPTCHA for Web Scraping

Table of Contents


1. Introduction

As developers, we often encounter AWS Web Application Firewall (WAF) CAPTCHA challenges during web scraping tasks. This guide explores effective methods to bypass AWS WAF CAPTCHA, focusing on API-based solutions like to streamline your scraping processes.

2. Why We Encounter AWS WAF CAPTCHA

AWS WAF CAPTCHAs are part of Amazon’s layered defense system, built to protect web applications against bots and abuse.
A CAPTCHA is triggered when AWS WAF detects patterns such as:

  • High request frequency from a single IP address
  • Identical request headers or user-agent strings
  • Missing browser behaviors like JavaScript execution or scrolling

For developers running scrapers or automation pipelines, these signals often lead AWS to issue a CAPTCHA challenge page requiring human verification before proceeding.

3. Bypass AWS WAF CAPTCHA Using CapSolver

One of the most direct and reliable approaches to solving AWS WAF CAPTCHA is using specialized CAPTCHA-solving APIs.
CapSolver provides a dedicated service capable of parsing and solving AWS WAF challenges automatically. Its API is designed to:

  • Extract CAPTCHA parameters (iv, key, context, challengeJS) from the target page.
  • Send them to CapSolver’s endpoint.
  • Receive a valid aws-waf-token cookie that allows your scraper to continue requests.

CapSolver handles AWS CAPTCHA variants dynamically and keeps its solver updated to adapt to new formats. This makes it a practical option for developers managing large-scale automation without frequent human input.

Code Example (Python)

import requests
import re
import time

# Your CapSolver API Key
CAPSOLVER_API_KEY = "YOUR_CAPSOLVER_API_KEY"
CAPSOLVER_CREATE_TASK_ENDPOINT = "https://api.capsolver.com/createTask"
CAPSOLVER_GET_TASK_RESULT_ENDPOINT = "https://api.capsolver.com/getTaskResult"

# The URL of the website protected by AWS WAF
WEBSITE_URL = "https://efw47fpad9.execute-api.us-east-1.amazonaws.com/latest" # Example URL

def solve_aws_waf_captcha(website_url, capsolver_api_key):
    client = requests.Session()
    response = client.get(website_url)
    script_content = response.text

    key_match = re.search(r'"key":"([^"]+)"', script_content)
    iv_match = re.search(r'"iv":"([^"]+)"', script_content)
    context_match = re.search(r'"context":"([^"]+)"', script_content)
    jschallenge_match = re.search(r'<script.*?src="(.*?)".*?></script>', script_content)

    key = key_match.group(1) if key_match else None
    iv = iv_match.group(1) if iv_match else None
    context = context_match.group(1) if context_match else None
    jschallenge = jschallenge_match.group(1) if jschallenge_match else None

    if not all([key, iv, context, jschallenge]):
        print("Error: AWS WAF parameters not found in the page content.")
        return None

    task_payload = {
        "clientKey": capsolver_api_key,
        "task": {
            "type": "AntiAwsWafTaskProxyLess",
            "websiteURL": website_url,
            "awsKey": key,
            "awsIv": iv,
            "awsContext": context,
            "awsChallengeJS": jschallenge
        }
    }

    create_task_response = client.post(CAPSOLVER_CREATE_TASK_ENDPOINT, json=task_payload).json()
    task_id = create_task_response.get('taskId')

    if not task_id:
        print(f"Error creating CapSolver task: {create_task_response.get('errorId')}, {create_task_response.get('errorCode')}")
        return None

    print(f"CapSolver task created with ID: {task_id}")

    # Poll for task result
    for _ in range(10): # Try up to 10 times with 5-second intervals
        time.sleep(5)
        get_result_payload = {"clientKey": capsolver_api_key, "taskId": task_id}
        get_result_response = client.post(CAPSOLVER_GET_TASK_RESULT_ENDPOINT, json=get_result_payload).json()

        if get_result_response.get('status') == 'ready':
            aws_waf_token_cookie = get_result_response['solution']['cookie']
            print("CapSolver successfully solved the CAPTCHA.")
            return aws_waf_token_cookie
        elif get_result_response.get('status') == 'failed':
            print(f"CapSolver task failed: {get_result_response.get('errorId')}, {get_result_response.get('errorCode')}")
            return None

    print("CapSolver task timed out.")
    return None

# Example usage:
# aws_waf_token = solve_aws_waf_captcha(WEBSITE_URL, CAPSOLVER_API_KEY)
# if aws_waf_token:
#     print(f"Received AWS WAF Token: {aws_waf_token}")
#     # Use the token in your subsequent requests
#     final_response = requests.get(WEBSITE_URL, cookies={"aws-waf-token": aws_waf_token})
#     print(final_response.text)

Once you obtain the token, attach it to subsequent requests as a session cookie to maintain uninterrupted scraping.


4. Use Cases

Integrating an automated AWS CAPTCHA solver like CapSolver ensures uninterrupted and reliable data collection across a variety of development and analytics tasks.

Reliable Data Feeds for Machine Learning Maintain consistent training datasets by automatically bypassing CAPTCHA challenges. Ensure temporal continuity and improve model accuracy without manual intervention.

Continuous Market Intelligence Monitor competitor pricing, product availability, and promotions in real time. Prevent interruptions caused by AWS protections and maintain complete market visibility.

Consistent Business Intelligence Reporting Keep ETL pipelines and dashboards updated with accurate data. Avoid gaps and broken metrics caused by CAPTCHA blocks.

Scalable SEO and Marketing Analytics Collect keyword rankings, ad placements, and content metrics efficiently. Scale scraping operations without losing coverage due to AWS WAF protections.

Public Data and Research Collection Preserve reproducible datasets for academic or policy research. Eliminate manual CAPTCHA resolution and maintain regular updates across large-scale data sources.

5. Complementary Techniques to Handle AWS WAF

Proxy Rotation and User-Agent Management

AWS WAF flags repetitive patterns from a single IP or user-agent. Implementing proxy rotation and rotating browser identifiers help disguise automated traffic as organic user behavior.

Simulating Human Behavior

Use headless browsers (e.g., Selenium, Playwright) configured with:

  • Random mouse movements
  • Delays between clicks
  • Variable scrolling patterns

These small changes mimic human activity, reducing the likelihood of detection.

Cookie and Session Management

After passing a CAPTCHA, save and reuse cookies for persistent sessions. This prevents repeated CAPTCHA triggers on every new request.

Request Throttling

Throttle requests and introduce random delays. AWS WAF monitors activity rates, and consistent request intervals are a common red flag for bots.

HTTP Header Optimization

Match real browser headers (Accept-Language, Referer, Connection). Inconsistent or incomplete headers are often the easiest signal for AWS to block automated agents.

JavaScript Rendering and Fingerprinting Evasion

AWS WAF CAPTCHA relies on client-side JavaScript. Using headless browsers capable of executing JS—and modifying fingerprint identifiers like WebGL or screen resolution—can bypass this layer of defense.

6. Conclusion

Handling AWS WAF CAPTCHA effectively requires techniques like proxy rotation, user-agent rotation, session management, and human-like interaction. Automated CAPTCHA solvers, such as CapSolver, provide reliable token generation and integrate directly into scraping workflows. Using these methods helps maintain stable, uninterrupted data collection with minimal manual intervention.

About

A compenhensive guide to solving AWS challenges in web scraping projects. Python example code inside.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors