Web UI Guide

Step 1 — Enter URLs

In the Target URLs textarea, enter one URL per line. The scraper will visit each URL in order.

https://inside.fifa.com/fifa-world-ranking/men
https://example.com
https://news.ycombinator.com

Step 2 — Build a Command Pipeline

Click buttons in the Commands panel to add them to your pipeline. Each command appears as a block with editable parameters. Drag blocks to reorder them.

The scraper executes commands top-to-bottom on every URL.

Step 3 — Run

Click Run Scraper. The status bar shows progress. When finished, result cards appear below — one per URL. The first card is expanded, the rest are collapsed. Click any card header to expand or collapse it.

Step 4 — Export

Pipeline Save / Load

Use the Export button to save your command pipeline as a JSON file. Use Import to load a previously saved pipeline. The JSON contains an array of command objects with type and params.

[
  { "type": "click", "params": { "text": "Show full rankings" } },
  { "type": "wait_timeout", "params": { "ms": "3000" } },
  { "type": "scroll", "params": { "times": "1" } }
]

Pipeline Example

Scrape the full FIFA rankings by clicking "Show full rankings" and waiting for the table to load:

1. Click       → text: "Show full rankings"
2. Wait (ms)   → 3000
3. Scroll      → times: 1

Commands Reference

Scroll

Scrolls to the bottom of the page. Useful for pages that load content lazily (infinite scroll).

ParameterTypeDefaultDescription
timesint1How many times to scroll down
delay_msint1500Wait between scrolls (ms). Lower = faster

Example: Scroll 5 times with 500ms delay to load a long feed.

Click

Clicks an element by CSS selector or by visible text. Case-insensitive when matching by text. Scrolls the element into view first.

ParameterTypeDefaultDescription
selectorstringCSS selector (alternative to text)
textstringVisible text of the button/link (partial match OK)
wait_after_msint2000Wait after clicking (ms) for content to load

Example: Text = Show full rankings clicks the FIFA rankings expand button.

Extract

Extracts data from elements matching a CSS selector. Returns values in the response under extracted.

ParameterTypeDefaultDescription
selectorstringrequiredCSS selector for elements to extract
attrstringtextWhat to extract: text, html, or an attribute name like href

Example: Selector = .ranking-table td.name, attr = text extracts team names.

Wait for Selector

Pauses until a CSS selector becomes visible on the page. Useful after a click that loads new content.

ParameterTypeDefaultDescription
selectorstringrequiredCSS selector (e.g. .results, #modal)
timeoutint10000Max wait time (ms)

Example: Selector = .ranking-table tbody tr waits for table rows to appear.

Wait (ms)

Pauses for a fixed duration. Use as a last resort when other wait strategies don't work.

ParameterTypeDefaultDescription
msintrequiredDuration in milliseconds

Example: 3000 waits 3 seconds.

API Reference

All API endpoints and the web UI /scrape endpoint are rate-limited to 30 requests per minute per IP.

Health Check

GET /api/health

Response 200:
{ "status": "ok" }

List Commands

GET /api/commands

Response 200:
{
  "commands": [
    {
      "type": "scroll",
      "description": "Scroll to the bottom of the page",
      "params": {
        "times":  { "type": "int", "default": 1 },
        "delay_ms": { "type": "int", "default": 1500 }
      }
    },
    ...
  ]
}

Scrape

POST /api/scrape
Content-Type: application/json

{
  "url": "https://example.com",
  "commands": [
    { "type": "click", "params": { "text": "Show more" } },
    { "type": "scroll", "params": { "times": 3 } }
  ],
  "html": true
}

Fields:

FieldTypeRequiredDescription
urlstringone of url/urlsSingle URL to scrape
urlsstring[]one of url/urlsMultiple URLs to scrape
commandsobject[]noCommand pipeline (see Commands)
htmlboolnoInclude HTML in response (default true)
pageintnoPage number (default 1)
per_pageintnoURLs per page, max 100 (default 10)

Response:

{
  "results": [
    {
      "url": "https://example.com",
      "status": "ok",
      "html_length": 528,
      "html": "<!DOCTYPE html>..."
    }
  ],
  "pagination": {
    "page": 1,
    "per_page": 10,
    "total": 1,
    "total_pages": 1
  }
}

Multi-URL example:

POST /api/scrape
{
  "urls": ["https://example.com", "https://example.org"],
  "commands": [{ "type": "scroll", "params": { "times": 2 } }],
  "html": false
}

Set html to false when you only need lengths or success/failure status.

Scrape to ZIP

Same input as /api/scrape, but returns a .zip file with one HTML file per URL.

POST /api/scrape/zip
Content-Type: application/json

{
  "urls": ["https://example.com", "https://example.org"]
}

Response 200:
Content-Type: application/zip
Content-Disposition: attachment; filename=scraped.zip

Error Handling

Validation errors return 400:

POST /api/scrape { "url": "...", "commands": [{"type": "bad"}] }

Response 400:
{ "error": "commands[0].type 'bad' is invalid. Must be one of: click, extract, scroll, wait_selector, wait_timeout" }

Scrape failures (DNS, timeout, etc.) are per-URL — other URLs still succeed:

{
  "results": [
    { "url": "https://example.com", "status": "ok", "html_length": 528 },
    { "url": "https://bad.invalid", "status": "error", "error": "net::ERR_NAME_NOT_RESOLVED" }
  ],
  "pagination": { "page": 1, "per_page": 10, "total": 2, "total_pages": 1 }
}

Rate limit exceeded returns 429:

Response 429:
{ "error": "Rate limit exceeded. Try again later." }

cURL Examples

# Single URL
curl -X POST http://localhost:5000/api/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

# Multiple URLs with pagination (page 2, 5 per page)
curl -X POST http://localhost:5000/api/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://a.com", "https://b.com", "https://c.com", "https://d.com", "https://e.com", "https://f.com"],
    "page": 2,
    "per_page": 5
  }'

# Multiple URLs with commands
curl -X POST http://localhost:5000/api/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com", "https://example.org"],
    "commands": [{"type": "scroll", "params": {"times": 2}}]
  }'

# Download as zip
curl -X POST http://localhost:5000/api/scrape/zip \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com"]}' \
  -o scraped.zip