Web Parser — User Guide

What the Web Parser module does

The module automatically crawls a website by following internal links, extracts text from each page, and (optionally) sends it to your chosen AI provider with your prompt — turning HTML into structured data. The output is JSON or CSV, ready for further processing. Useful for one-off scrapes and for large extraction jobs with complex selectors.

What you get:

Crawler with auto-discovery — set a start URL and a page limit, and the module follows internal links itself.
AI page processing — every page is sent to the LLM with your prompt and data template; the result is saved separately.
JSON/CSV export — via the Download button in the report.
Spider-trap detection — built-in detection of cyclic URLs and a 15-segment path-depth limit.
Proxy and auth — each template can use a chosen proxy and an auth profile (headers, cookies).
Pause and resume — stop a job and continue later without losing progress.
Rate control — requests-per-second limit on the template.
Local storage — every page and result lives in the YoBench database.
Infinite scroll on the pages list — large crawls (hundreds of pages) load on demand instead of paging.

Template parameters

A template is a reusable bundle of parameters. You specify:

Start URL — the entry point.
Max pages — max_pages. Default 100. Caps the total job size.
Rate (req/sec) — rate_limit. Default 2. Throttles request rate.
AI enabled — ai_enabled flag. When off, the module saves only page text without AI processing.
AI provider — which provider from the AI Chat registry to use.
LLM prompt — instructions on how to extract data from the page.
Data template — a JSON skeleton the LLM should fill (data_template).
Output format — none / json / csv.
Proxy (optional) — route traffic through a selected proxy profile.
Auth profile (optional) — headers, cookies, query parameters for authentication.

Technical caveats (not exposed in the UI):

Requests use the fetch API (no headless browser, no JavaScript rendering). JS-only sites parse poorly — use Site Audit with Chromium for those.
User-Agent is fixed: Mozilla/5.0 ... Chrome/120 (looks like a regular desktop).
Request timeout is 30 seconds per page.
robots.txt is not respected — be courteous and use the rate limit.

Global settings

The module has no dedicated entries in the central Settings section — everything is configured at the template level.

Actions

On a job:

Start — kicks a job off from a template: a page queue is created, the parser starts.
Pause — persists the current queue and LLM processing state.
Resume — resumes from the pause point.
Stop — cancels the job and clears the queue.
Download JSON / CSV — exports structured data for every processed page.

On a page:

View original text, LLM response and the resulting JSON.

States

Job: queued → running → paused | completed | stopped | error.

Page: crawled → processing → processed | error.

Workflow

1. Create a template

Open the Web Parser module from the left sidebar.
On Parse templates tab click Create template.
Fill in the start URL, page cap, rate.
Toggle AI on, pick a provider, write the prompt and the data template.
Pick the output format (JSON / CSV) or none if you only need raw text.
Optionally pick a proxy and auth profile.
Save.

2. Start a job

Next to the template, click Start. A new job appears with running status.
The Reports tab shows progress: pages crawled / processed / failed, current status.
If needed, click Pause — the job persists to the database and survives an app restart.
To continue, click Resume.

3. Download results

For a completed job, click Download JSON or Download CSV. The file contains structured_data for every processed page.

4. Inspect

Open a specific page in the report — you see the original text, the LLM response and the resulting JSON.
Feed the JSON into other modules (e.g. import into the Context Manager or an external script).

Next steps

Connect AI providers — without them AI processing does nothing (the module can still save text only).
For sites that need JS rendering, use Site Audit on Chromium.
For ongoing availability checks, run Health Check alongside.

Help and feedback

Want a headless browser, robots.txt support or a scheduler? Contact us via the feedback form.