Docs/SDK/Structured Extraction

Structured Extraction

Turn read() output into validated JSON records (schema-first) so agents can extract data reliably instead of scraping brittle HTML.

This page covers:

A schema-first extraction pattern (validated outputs)
How to use extraction on top of read() markdown
How to handle failures deterministically

Extraction should produce validated data, not “maybe JSON”.

Use extraction when you need structured records (items, prices, metadata) that downstream code can trust.

Table of Contents

Concept
Python: typed extraction
Failure modes

Concept

The stable path to structured data is:

Read the page into markdown/text (read())
Extract a typed object from that content (extract(...))
Validate the object with a schema so callers don’t need prompt heuristics

Python: typed extraction

from pydantic import BaseModel
from predicate import read, extract

class Item(BaseModel):
    name: str
    price: str

md = read(browser, format="markdown")["content"]

result = extract(browser, llm, "Extract item name and price", schema=Item)
if result.ok:
    print(result.data.name, result.data.price)
else:
    print("extract failed:", result.error)

Failure modes

Extraction can fail for deterministic reasons:

invalid JSON
schema mismatch (missing/extra fields)
not enough information in read() output

When extraction fails, treat it as a normal verification failure:

retry with a clarified prompt
narrow the read scope
fall back to a snapshot-based workflow if the page is too dynamic

Content Reading — produce markdown/text inputs
Tool Registry — typed tool contracts (related concept)