Docs/SDK/Structured Extraction
Turn read() output into validated JSON records (schema-first) so agents can extract data reliably instead of scraping brittle HTML.
This page covers:
- A schema-first extraction pattern (validated outputs)
- How to use extraction on top of
read() markdown
- How to handle failures deterministically
Extraction should produce validated data, not “maybe JSON”.
Use extraction when you need structured records (items, prices, metadata) that downstream code can trust.
Table of Contents
- Concept
- Python: typed extraction
- Failure modes
Concept
The stable path to structured data is:
- Read the page into markdown/text (
read())
- Extract a typed object from that content (
extract(...))
- Validate the object with a schema so callers don’t need prompt heuristics
from pydantic import BaseModel
from predicate import read, extract
class Item(BaseModel):
name: str
price: str
md = read(browser, format="markdown")["content"]
result = extract(browser, llm, "Extract item name and price", schema=Item)
if result.ok:
print(result.data.name, result.data.price)
else:
print("extract failed:", result.error)
Failure modes
Extraction can fail for deterministic reasons:
- invalid JSON
- schema mismatch (missing/extra fields)
- not enough information in
read() output
When extraction fails, treat it as a normal verification failure:
- retry with a clarified prompt
- narrow the read scope
- fall back to a snapshot-based workflow if the page is too dynamic
- Content Reading — produce markdown/text inputs
- Tool Registry — typed tool contracts (related concept)