A production-ready TypeScript scraping service and web UI for extracting Google Play Store developer metadata at scale.
Note
This repository contains a full-stack scraper application (API + UI). It is not a standalone logging package, but it includes structured request logging and retry diagnostics in the server runtime.
- Title and Description
- Features
- Tech Stack & Architecture
- Getting Started
- Testing
- Deployment
- Usage
- Configuration
- License
- Support the Project
- Batch URL processing from a multiline input box (one Google Play URL per line).
- Server-side scraping endpoint at
POST /api/scrapewith schema validation viazod. - Automatic retry with exponential backoff to improve reliability on transient network/parsing failures.
- Per-request timeout enforcement using
AbortControllerto avoid hung scraping jobs. - Extraction pipeline for:
- App title
- Studio/developer name
- Support email
- Developer email (from contact section)
- External website URL
- Download count (
1K+,1M+,500M+, etc.)
- Defensive fallback parsing for website/email/download count to maximize useful yield.
- Front-end progress tracking for long URL batches.
- Result table UX with copy-to-clipboard JSON and JSON file export.
- Shared API contract definitions in
shared/to keep client/server payloads consistent. - Vite development server integration and production static serving via Express.
- Build pipeline that bundles the server with
esbuildand the client with Vite.
Tip
For large batches, keep requests sequential (current default) to reduce Play Store throttling risk.
- Language: TypeScript (
stricttyping patterns across server/client). - Backend: Express + native
fetch+cheerio. - Frontend: React + Wouter + TanStack Query + Tailwind CSS.
- Validation & Contracts:
zodshared between client/server. - Build Tooling: Vite (client),
esbuild(server),tsx(runtime/dev scripts).
.
├── client/
│ ├── src/
│ │ ├── components/
│ │ ├── pages/
│ │ │ ├── home.tsx
│ │ │ └── not-found.tsx
│ │ ├── lib/
│ │ ├── hooks/
│ │ ├── App.tsx
│ │ └── main.tsx
│ ├── index.html
│ └── requirements.md
├── server/
│ ├── index.ts
│ ├── routes.ts
│ ├── static.ts
│ └── vite.ts
├── shared/
│ ├── routes.ts
│ └── schema.ts
├── script/
│ └── build.ts
├── package.json
├── render.yaml
└── LICENSE
- Shared Contracts First: Request/response schemas are colocated in
shared/and consumed by both app tiers. - Retry + Timeout Composition: Scrape operations are wrapped in retries while each attempt has a hard timeout.
- Resilient Parsing Strategy: Primary selectors are paired with fallback heuristics when Google Play markup changes.
- Operational Logging: Server request middleware logs API execution duration and JSON payload snapshots for diagnostics.
flowchart TD
A[User inputs Play Store URLs] --> B[React UI batches URLs sequentially]
B --> C[POST /api/scrape]
C --> D[Zod input validation]
D --> E[retryWithBackoff wrapper]
E --> F[fetch Play Store HTML with timeout]
F --> G[Cheerio parsing + regex fallbacks]
G --> H[Zod-shaped JSON response]
H --> I[UI table update + progress update]
I --> J[Copy JSON or export file]
Important
Google Play DOM structures change periodically; selector drift is expected and should be handled by updating parsing selectors and fallback rules.
- Node.js
>= 20.0.0 - npm
>= 10(recommended) - Linux/macOS/WSL environment for the provided scripts
git clone <your-repo-url>
cd Google-Play-Store-Scraper
npm installStart development mode:
npm run devThis starts the Express server and Vite-powered front-end on the configured PORT (default 5000).
Run static checks and validate type safety:
npm run checkRecommended local quality commands (manual convention for this repo):
# Unit/integration tests (if you add a test framework such as vitest/jest)
npm run test
# Linting (if ESLint is configured)
npm run lintWarning
test and lint scripts are not currently defined in package.json; add them before wiring CI gates.
npm run build
npm run startBuild flow details:
- Client bundle is produced by Vite.
- Server is bundled to
dist/index.cjsusingesbuild. - Runtime serves static assets in production mode via
serveStatic.
A minimal CI pipeline should:
- Install dependencies with lockfile (
npm ci). - Run type checks (
npm run check). - Run build (
npm run build). - Optionally boot app smoke test (
npm run start).
If deploying with Docker:
- Use Node 20+ base image.
- Run
npm ci --omit=devin runtime stage if prebuilt artifacts are copied. - Expose
PORTand setNODE_ENV=production.
Caution
Scraping workloads can trigger remote anti-bot defenses. Add rate controls, retries, and observability before high-volume production usage.
- Open the app home page.
- Paste one Play Store URL per line.
- Click Start Scraping.
- Review extracted rows.
- Use Copy JSON or Save Array.
curl -X POST http://localhost:5000/api/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://play.google.com/store/apps/details?id=com.supercell.clashofclans"
}'Example response:
{
"originalUrl": "https://play.google.com/store/apps/details?id=com.supercell.clashofclans",
"studioName": "Supercell",
"gameTitle": "Clash of Clans",
"supportEmail": "support@supercell.com",
"developerEmail": "support@supercell.com",
"websiteUrl": "https://supercell.com",
"downloadCount": "500M+"
}// Minimal node client for the scraper API
const response = await fetch("http://localhost:5000/api/scrape", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
url: "https://play.google.com/store/apps/details?id=com.king.candycrushsaga",
}),
});
if (!response.ok) {
throw new Error(`Scrape failed: ${response.status}`);
}
const payload = await response.json();
console.log(payload); // Extracted app metadataNODE_ENV:developmentorproduction.PORT: HTTP port binding (defaults to5000).
- Scrape timeout per attempt:
60000 ms. - Retry attempts:
2retries (3total attempts). - Initial retry delay:
1000 mswith exponential backoff. - URL input validation: strict URL validation using
zod.
server/routes.tsretryWithBackoff(...)parameters.- DOM selectors for title/studio/contact extraction.
- fallback regex patterns for download counts/emails.
server/index.ts- request logging format and inclusion criteria (
/apiroutes only).
- request logging format and inclusion criteria (
script/build.tsallowlistand bundling external dependency strategy.
This project is licensed under the GNU GPL v3.0. See LICENSE for full legal terms.
If you find this tool useful, consider leaving a star on GitHub or supporting the author directly.