Introduction
SignalExtract AI turns messy, real-world documents — PDFs, emails, and reports — into structured, evidence-linked signals. Unlike a simple parser or a single LLM call, it layers deterministic rules, language-model understanding, evidence grounding, and human review to stay reliable on inconsistent, ambiguous text.
Evidence-grounded
Every signal links to a verbatim source span.
Hybrid engine
Rules + LLM, merged — never dependent on one path.
Human-in-the-loop
Approve, reject, or edit with calibrated confidence.
Structured output
Typed signals with provenance, JSON or CSV.
Quick start
The full flow is: upload → extract text → extract signals → review → export. Every request includes your API key when key protection is enabled.
API=http://localhost:8000/api/v1
KEY="sk_cleanextract_internal_123456"
# 1. Upload a document
DOC=$(curl -s -X POST "$API/documents/upload" -H "x-api-key: $KEY" \
-F "file=@report.pdf" | jq -r .id)
# 2. Extract text, then signals
curl -s -X POST "$API/documents/$DOC/extract-text" -H "x-api-key: $KEY"
curl -s -X POST "$API/documents/$DOC/extract-signals" -H "x-api-key: $KEY"
# 3. List the extracted signals
curl -s "$API/documents/$DOC/signals" -H "x-api-key: $KEY"Or just use the workspace UI — upload from the Documents page, run extraction, and review signals visually.
Core concepts
Signal
A single structured data point extracted from a document — a typed value with the evidence that supports it. Example: a recommendation with the verbatim sentence it came from.
Evidence grounding
Every signal carries a evidence span lifted verbatim from the source. If a model proposes a signal whose evidence can't be located in the text, it's discarded — this is the primary guard against hallucinations.
Confidence
A 0.0–1.0 score per signal. Signals at or above the REVIEW_THRESHOLDare treated as high-confidence; the rest are surfaced for closer review.
Extraction modes
Set globally via EXTRACTION_MODE, or rely on the default. The mode controls how deterministic rules and the LLM combine.
| Mode | Behavior | LLM dependency |
|---|---|---|
| rule_based | Deterministic patterns only — fully offline. | None |
| hybrid (default) | Union of rules + LLM, de-duplicated. | Optional — falls back to rules |
| llm | LLM-first, falls back to rules if it returns nothing. | Preferred |
In hybrid and llm, if the LLM is unreachable or over quota the rule-based pass still returns results — extraction never hard-fails on the model.
Signal types
Two families of signals are supported: precise entities and higher-order claim-level statements.
Entities
Claim-level
How extraction works
The engine is built around one principle: recall first, precision second. Generate candidate signals broadly, then ground and verify strictly.
- 1Ingest. Parse PDF / email / docx to text, preserving layout and character offsets.
- 2Extract. Rule anchors and the LLM each propose candidate signals (high recall).
- 3Ground. Each candidate's evidence must be found in the source, or it's dropped.
- 4Merge. Duplicates across strategies are collapsed into one signal.
- 5Review. Confidence routes uncertain signals to a human approve / reject queue.
- 6Export. Approved signals leave as JSON or CSV with full provenance.
Review workflow
Each signal has a review_status that moves through human-in-the-loop states:
| Status | Meaning |
|---|---|
| pending | Awaiting review (default after extraction). |
| approved | Confirmed correct — included in approved-only exports. |
| rejected | Marked incorrect — excluded from approved exports. |
| edited | Value corrected by a reviewer, then accepted. |
curl -X PATCH "$API/signals/$SIGNAL_ID/review" \
-H "x-api-key: $KEY" -H "content-type: application/json" \
-d '{"review_status":"approved","reviewed_by":"analyst@org"}'API reference
All routes are under /api/v1. When REQUIRE_API_KEY is enabled, send your key in the x-api-key header.
| Method | Endpoint | Purpose |
|---|---|---|
| GET | /health · /ready | Liveness and DB readiness |
| GET | /stats | Aggregate metrics |
| POST | /documents/upload | Upload a file (multipart) |
| GET | /documents | List documents |
| GET | /documents/{id} | Get one document |
| DELETE | /documents/{id} | Delete a document |
| POST | /documents/{id}/extract-text | Extract document text |
| POST | /documents/{id}/extract-signals | Run signal extraction |
| GET | /documents/{id}/signals | List signals (filterable) |
| GET | /documents/{id}/export.json | Export signals as JSON |
| GET | /documents/{id}/export.csv | Export signals as CSV |
| PATCH | /signals/{id}/review | Approve / reject / edit a signal |
Configuration
Configured through environment variables (a .env file on the backend).
# Extraction
EXTRACTION_MODE=hybrid # rule_based | hybrid | llm
REVIEW_THRESHOLD=0.6
# LLM provider (Ollama or Anthropic)
LLM_PROVIDER=ollama # ollama | anthropic
LLM_ENDPOINT=http://localhost:11434
LLM_MODEL=llama3.1
LLM_API_KEY= # required for cloud / Anthropic
# Storage & limits
DATABASE_URL=sqlite:///./signalextract.db
MAX_UPLOAD_MB=20
ALLOWED_EXTENSIONS=.txt,.pdf,.docx,.eml
# Security
REQUIRE_API_KEY=true
API_KEY=replace-meFAQ
Does it depend on the LLM?
No. In hybrid mode the rule-based extractor always runs, so a down or rate-limited model degrades gracefully instead of failing.
Can I use a local model?
Yes — point LLM_PROVIDER=ollama at a local Ollama server, or at Ollama Cloud / Anthropic by setting the endpoint and key.
How are hallucinations prevented?
Evidence grounding: any signal whose evidence can't be matched back to the source text is discarded before it reaches review.