Web Parser — User Guide | YoBench
How to use the Web Parser module in YoBench: website crawler, AI page processing, JSON/CSV export, pause/resume, proxy and rate limit.
What the Web Parser module does
The module automatically crawls a website by following internal links, extracts text from each page, and (optionally) sends it to your chosen AI provider with your prompt — turning HTML into structured data. The output is JSON or CSV, ready for further processing. Useful for one-off scrapes and for large extraction jobs with complex selectors.
What you get:
- Crawler with auto-discovery — set a start URL and a page limit, and the module follows internal links itself.
- AI page processing — every page is sent to the LLM with your prompt and data template; the result is saved separately.
- JSON/CSV export — via the Download button in the report.
- Spider-trap detection — built-in detection of cyclic URLs and a 15-segment path-depth limit.
- Proxy and auth — each template can use a chosen proxy and an auth profile (headers, cookies).
- Pause and resume — stop a job and continue later without losing progress.
- Rate control — requests-per-second limit on the template.
- Local storage — every page and result lives in the YoBench database.
- Infinite scroll on the pages list — large crawls (hundreds of pages) load on demand instead of paging.
Template parameters
A template is a reusable bundle of parameters. You specify:
- Start URL — the entry point.
- Max pages —
max_pages. Default 100. Caps the total job size. - Rate (req/sec) —
rate_limit. Default 2. Throttles request rate. - AI enabled —
ai_enabledflag. When off, the module saves only page text without AI processing. - AI provider — which provider from the AI Chat registry to use.
- LLM prompt — instructions on how to extract data from the page.
- Data template — a JSON skeleton the LLM should fill (
data_template). - Output format —
none/json/csv. - Proxy (optional) — route traffic through a selected proxy profile.
- Auth profile (optional) — headers, cookies, query parameters for authentication.
Technical caveats (not exposed in the UI):
- Requests use the fetch API (no headless browser, no JavaScript rendering). JS-only sites parse poorly — use Site Audit with Chromium for those.
- User-Agent is fixed:
Mozilla/5.0 ... Chrome/120(looks like a regular desktop). - Request timeout is 30 seconds per page.
- robots.txt is not respected — be courteous and use the rate limit.
Global settings
The module has no dedicated entries in the central Settings section — everything is configured at the template level.
Actions
On a job:
- Start — kicks a job off from a template: a page queue is created, the parser starts.
- Pause — persists the current queue and LLM processing state.
- Resume — resumes from the pause point.
- Stop — cancels the job and clears the queue.
- Download JSON / CSV — exports structured data for every processed page.
On a page:
- View original text, LLM response and the resulting JSON.
States
Job: queued → running → paused | completed | stopped | error.
Page: crawled → processing → processed | error.
Workflow
1. Create a template
- Open the Web Parser module from the left sidebar.
- On Parse templates tab click Create template.
- Fill in the start URL, page cap, rate.
- Toggle AI on, pick a provider, write the prompt and the data template.
- Pick the output format (JSON / CSV) or
noneif you only need raw text. - Optionally pick a proxy and auth profile.
- Save.
2. Start a job
- Next to the template, click Start. A new job appears with
runningstatus. - The Reports tab shows progress: pages crawled / processed / failed, current status.
- If needed, click Pause — the job persists to the database and survives an app restart.
- To continue, click Resume.
3. Download results
For a completed job, click Download JSON or Download CSV. The file contains structured_data for every processed page.
4. Inspect
- Open a specific page in the report — you see the original text, the LLM response and the resulting JSON.
- Feed the JSON into other modules (e.g. import into the Context Manager or an external script).
Next steps
- Connect AI providers — without them AI processing does nothing (the module can still save text only).
- For sites that need JS rendering, use Site Audit on Chromium.
- For ongoing availability checks, run Health Check alongside.
Help and feedback
Want a headless browser, robots.txt support or a scheduler? Contact us via the feedback form.