Web UI Guide
Step 1 — Enter URLs
In the Target URLs textarea, enter one URL per line. The scraper will visit each URL in order.
https://inside.fifa.com/fifa-world-ranking/men https://example.com https://news.ycombinator.com
Step 2 — Build a Command Pipeline
Click buttons in the Commands panel to add them to your pipeline. Each command appears as a block with editable parameters. Drag blocks to reorder them.
The scraper executes commands top-to-bottom on every URL.
Step 3 — Run
Click Run Scraper. The status bar shows progress. When finished, result cards appear below — one per URL. The first card is expanded, the rest are collapsed. Click any card header to expand or collapse it.
Step 4 — Export
- Click Copy on a result card to copy its HTML to clipboard.
- Click Download to save a single
.htmlfile. - Click Download All (.zip) to get every result as a zip archive.
- Click Download CSV to export extracted data.
Pipeline Save / Load
Use the Export button to save your command pipeline as a JSON file. Use Import to load a previously saved pipeline. The JSON contains an array of command objects with type and params.
[
{ "type": "click", "params": { "text": "Show full rankings" } },
{ "type": "wait_timeout", "params": { "ms": "3000" } },
{ "type": "scroll", "params": { "times": "1" } }
]
Pipeline Example
Scrape the full FIFA rankings by clicking "Show full rankings" and waiting for the table to load:
1. Click → text: "Show full rankings" 2. Wait (ms) → 3000 3. Scroll → times: 1
Commands Reference
Scroll
Scrolls to the bottom of the page. Useful for pages that load content lazily (infinite scroll).
| Parameter | Type | Default | Description |
|---|---|---|---|
times | int | 1 | How many times to scroll down |
delay_ms | int | 1500 | Wait between scrolls (ms). Lower = faster |
Example: Scroll 5 times with 500ms delay to load a long feed.
Click
Clicks an element by CSS selector or by visible text. Case-insensitive when matching by text. Scrolls the element into view first.
| Parameter | Type | Default | Description |
|---|---|---|---|
selector | string | — | CSS selector (alternative to text) |
text | string | — | Visible text of the button/link (partial match OK) |
wait_after_ms | int | 2000 | Wait after clicking (ms) for content to load |
Example: Text = Show full rankings clicks the FIFA rankings expand button.
Extract
Extracts data from elements matching a CSS selector. Returns values in the response under extracted.
| Parameter | Type | Default | Description |
|---|---|---|---|
selector | string | required | CSS selector for elements to extract |
attr | string | text | What to extract: text, html, or an attribute name like href |
Example: Selector = .ranking-table td.name, attr = text extracts team names.
Wait for Selector
Pauses until a CSS selector becomes visible on the page. Useful after a click that loads new content.
| Parameter | Type | Default | Description |
|---|---|---|---|
selector | string | required | CSS selector (e.g. .results, #modal) |
timeout | int | 10000 | Max wait time (ms) |
Example: Selector = .ranking-table tbody tr waits for table rows to appear.
Wait (ms)
Pauses for a fixed duration. Use as a last resort when other wait strategies don't work.
| Parameter | Type | Default | Description |
|---|---|---|---|
ms | int | required | Duration in milliseconds |
Example: 3000 waits 3 seconds.
API Reference
All API endpoints and the web UI /scrape endpoint are rate-limited to 30 requests per minute per IP.
Health Check
GET /api/health
Response 200:
{ "status": "ok" }
List Commands
GET /api/commands
Response 200:
{
"commands": [
{
"type": "scroll",
"description": "Scroll to the bottom of the page",
"params": {
"times": { "type": "int", "default": 1 },
"delay_ms": { "type": "int", "default": 1500 }
}
},
...
]
}
Scrape
POST /api/scrape
Content-Type: application/json
{
"url": "https://example.com",
"commands": [
{ "type": "click", "params": { "text": "Show more" } },
{ "type": "scroll", "params": { "times": 3 } }
],
"html": true
}
Fields:
| Field | Type | Required | Description |
|---|---|---|---|
url | string | one of url/urls | Single URL to scrape |
urls | string[] | one of url/urls | Multiple URLs to scrape |
commands | object[] | no | Command pipeline (see Commands) |
html | bool | no | Include HTML in response (default true) |
page | int | no | Page number (default 1) |
per_page | int | no | URLs per page, max 100 (default 10) |
Response:
{
"results": [
{
"url": "https://example.com",
"status": "ok",
"html_length": 528,
"html": "<!DOCTYPE html>..."
}
],
"pagination": {
"page": 1,
"per_page": 10,
"total": 1,
"total_pages": 1
}
}
Multi-URL example:
POST /api/scrape
{
"urls": ["https://example.com", "https://example.org"],
"commands": [{ "type": "scroll", "params": { "times": 2 } }],
"html": false
}
Set html to false when you only need lengths or success/failure status.
Scrape to ZIP
Same input as /api/scrape, but returns a .zip file with one HTML file per URL.
POST /api/scrape/zip
Content-Type: application/json
{
"urls": ["https://example.com", "https://example.org"]
}
Response 200:
Content-Type: application/zip
Content-Disposition: attachment; filename=scraped.zip
Error Handling
Validation errors return 400:
POST /api/scrape { "url": "...", "commands": [{"type": "bad"}] }
Response 400:
{ "error": "commands[0].type 'bad' is invalid. Must be one of: click, extract, scroll, wait_selector, wait_timeout" }
Scrape failures (DNS, timeout, etc.) are per-URL — other URLs still succeed:
{
"results": [
{ "url": "https://example.com", "status": "ok", "html_length": 528 },
{ "url": "https://bad.invalid", "status": "error", "error": "net::ERR_NAME_NOT_RESOLVED" }
],
"pagination": { "page": 1, "per_page": 10, "total": 2, "total_pages": 1 }
}
Rate limit exceeded returns 429:
Response 429:
{ "error": "Rate limit exceeded. Try again later." }
cURL Examples
# Single URL
curl -X POST http://localhost:5000/api/scrape \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'
# Multiple URLs with pagination (page 2, 5 per page)
curl -X POST http://localhost:5000/api/scrape \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://a.com", "https://b.com", "https://c.com", "https://d.com", "https://e.com", "https://f.com"],
"page": 2,
"per_page": 5
}'
# Multiple URLs with commands
curl -X POST http://localhost:5000/api/scrape \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com", "https://example.org"],
"commands": [{"type": "scroll", "params": {"times": 2}}]
}'
# Download as zip
curl -X POST http://localhost:5000/api/scrape/zip \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com"]}' \
-o scraped.zip