Skip to content

feat: add Playwright connector for Amazon profile and order history#18

Open
letonchanh wants to merge 4 commits intomainfrom
feat/amazon-connector
Open

feat: add Playwright connector for Amazon profile and order history#18
letonchanh wants to merge 4 commits intomainfrom
feat/amazon-connector

Conversation

@letonchanh
Copy link
Member

@letonchanh letonchanh commented Mar 3, 2026

Summary

  • Adds a new Amazon connector that exports profile info (name, email, Prime status) and full order history (items, prices, dates, delivery status) using Playwright browser automation and DOM scraping
  • Two-phase architecture: visible browser for manual login (handles CAPTCHA/2FA), then headless scraping of account settings page + paginated order history year-by-year
  • Adds data-connect as a playwright-runner search path in test-connector.cjs

Design Decisions

Session Verification

Amazon aggressively caches cookies — the nav bar shows "Hello, Name" even with expired sessions. The connector uses a two-step verification: quick nav bar check → deep check by navigating to /your-orders/orders and detecting sign-in redirects.

Per-Item Price Fetching

Amazon's order list page (/your-orders/orders) only shows the order total, not individual item prices. To get per-item prices, the connector fetches each order's detail page (/gp/your-account/order-details?orderID=xxx) after collecting the order list. This adds ~1.5s per order but provides complete price data.

Delivery Status Extraction

Uses innerText instead of textContent to match delivery status patterns, because textContent includes embedded <script> tag content (e.g., JS return; keyword) that false-matches the "Return" status pattern.

DOM Scraping Approach

Amazon uses server-rendered HTML with no clean JSON APIs and aggressive A/B testing of DOM structure. The connector uses text-based regex matching on card content (order ID pattern ###-#######-#######, date pattern, price pattern) rather than brittle CSS selectors for metadata extraction.

Files Changed

File Description
amazon/amazon-playwright.js Main connector script (~470 lines)
amazon/amazon-playwright.json Connector metadata (scopes, connect URL, selectors)
schemas/amazon.profile.json Schema for profile scope
schemas/amazon.orders.json Schema for orders scope
registry.json Added Amazon entry with SHA-256 checksums
test-connector.cjs Added data-connect runner path

Test plan

  • Run node test-connector.cjs ./amazon/amazon-playwright.js --headed and log in manually
  • Verify profile extraction: name, and Prime status populated
  • Verify order list: orders across multiple years with correct orderId, orderDate, orderTotal, deliveryStatus
  • Verify per-item prices: detail page fetching populates individual item prices
  • Verify delivery status does not contain JS artifacts like "return;"
  • Verify pagination: connector fetches all pages for years with >10 orders
  • Verify result matches amazon.profile and amazon.orders schemas
  • Test with expired session: confirm browser is shown for re-login

🤖 Generated with Claude Code

letonchanh and others added 3 commits March 3, 2026 12:15
Add Amazon connector that exports profile info (name, email, Prime status)
and full order history with per-item prices via DOM scraping.

Two-phase architecture:
- Phase 1 (visible browser): Manual login with CAPTCHA/2FA support
- Phase 2 (headless): Scrape account settings + paginated order history

Key design decisions:
- Uses two-step session verification (nav bar check + orders page redirect)
  to handle stale cookies that make the nav bar show "Hello, Name" even
  with expired sessions
- Uses innerText instead of textContent for delivery status to avoid
  matching JS keywords from embedded <script> tags
- Fetches each order's detail page to get per-item prices, since the
  order list page only shows order totals
- Year-by-year extraction via the time filter dropdown with pagination

Also adds data-connect as a playwright-runner search path in test-connector.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Email requires re-authentication on Amazon's account pages which
cannot be done in headless mode. Remove email field from profile
scope entirely rather than returning empty values.

Fix isPrime detection: old selectors matched promotional "Try Prime"
elements present for non-members. Now checks for actual membership
indicators like "Your Prime" or "Prime Benefits" text.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After goHeadless() the browser starts on a blank page with no DOM,
so the nav bar greeting selector found nothing. Add page.goto to
load amazon.com first so the nav bar is available.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@letonchanh letonchanh requested a review from volod-vana March 3, 2026 18:57
@letonchanh
Copy link
Member Author

@copilot Review the PR

Copy link

Copilot AI commented Mar 4, 2026

@letonchanh I've opened a new pull request, #19, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants