Introduction

SignalExtract AI turns messy, real-world documents — PDFs, emails, and reports — into structured, evidence-linked signals. Unlike a simple parser or a single LLM call, it layers deterministic rules, language-model understanding, evidence grounding, and human review to stay reliable on inconsistent, ambiguous text.

Evidence-grounded

Every signal links to a verbatim source span.

Hybrid engine

Rules + LLM, merged — never dependent on one path.

Human-in-the-loop

Approve, reject, or edit with calibrated confidence.

Structured output

Typed signals with provenance, JSON or CSV.

Quick start

The full flow is: upload → extract text → extract signals → review → export. Every request includes your API key when key protection is enabled.

bash
API=http://localhost:8000/api/v1
KEY="sk_cleanextract_internal_123456"

# 1. Upload a document
DOC=$(curl -s -X POST "$API/documents/upload" -H "x-api-key: $KEY" \
  -F "file=@report.pdf" | jq -r .id)

# 2. Extract text, then signals
curl -s -X POST "$API/documents/$DOC/extract-text"    -H "x-api-key: $KEY"
curl -s -X POST "$API/documents/$DOC/extract-signals" -H "x-api-key: $KEY"

# 3. List the extracted signals
curl -s "$API/documents/$DOC/signals" -H "x-api-key: $KEY"

Or just use the workspace UI — upload from the Documents page, run extraction, and review signals visually.

Core concepts

Signal

A single structured data point extracted from a document — a typed value with the evidence that supports it. Example: a recommendation with the verbatim sentence it came from.

Evidence grounding

Every signal carries a evidence span lifted verbatim from the source. If a model proposes a signal whose evidence can't be located in the text, it's discarded — this is the primary guard against hallucinations.

Confidence

A 0.0–1.0 score per signal. Signals at or above the REVIEW_THRESHOLDare treated as high-confidence; the rest are surfaced for closer review.

Extraction modes

Set globally via EXTRACTION_MODE, or rely on the default. The mode controls how deterministic rules and the LLM combine.

ModeBehaviorLLM dependency
rule_basedDeterministic patterns only — fully offline.None
hybrid (default)Union of rules + LLM, de-duplicated.Optional — falls back to rules
llmLLM-first, falls back to rules if it returns nothing.Preferred

In hybrid and llm, if the LLM is unreachable or over quota the rule-based pass still returns results — extraction never hard-fails on the model.

Signal types

Two families of signals are supported: precise entities and higher-order claim-level statements.

Entities

dateamountpercentageemailphoneurlidentifiermeasurementperson_nameorganizationlocationmedical_code

Claim-level

findingrecommendationactionkey_statement

How extraction works

The engine is built around one principle: recall first, precision second. Generate candidate signals broadly, then ground and verify strictly.

  1. 1Ingest. Parse PDF / email / docx to text, preserving layout and character offsets.
  2. 2Extract. Rule anchors and the LLM each propose candidate signals (high recall).
  3. 3Ground. Each candidate's evidence must be found in the source, or it's dropped.
  4. 4Merge. Duplicates across strategies are collapsed into one signal.
  5. 5Review. Confidence routes uncertain signals to a human approve / reject queue.
  6. 6Export. Approved signals leave as JSON or CSV with full provenance.

Review workflow

Each signal has a review_status that moves through human-in-the-loop states:

StatusMeaning
pendingAwaiting review (default after extraction).
approvedConfirmed correct — included in approved-only exports.
rejectedMarked incorrect — excluded from approved exports.
editedValue corrected by a reviewer, then accepted.
bash
curl -X PATCH "$API/signals/$SIGNAL_ID/review" \
  -H "x-api-key: $KEY" -H "content-type: application/json" \
  -d '{"review_status":"approved","reviewed_by":"analyst@org"}'

API reference

All routes are under /api/v1. When REQUIRE_API_KEY is enabled, send your key in the x-api-key header.

MethodEndpointPurpose
GET/health · /readyLiveness and DB readiness
GET/statsAggregate metrics
POST/documents/uploadUpload a file (multipart)
GET/documentsList documents
GET/documents/{id}Get one document
DELETE/documents/{id}Delete a document
POST/documents/{id}/extract-textExtract document text
POST/documents/{id}/extract-signalsRun signal extraction
GET/documents/{id}/signalsList signals (filterable)
GET/documents/{id}/export.jsonExport signals as JSON
GET/documents/{id}/export.csvExport signals as CSV
PATCH/signals/{id}/reviewApprove / reject / edit a signal

Configuration

Configured through environment variables (a .env file on the backend).

.env
# Extraction
EXTRACTION_MODE=hybrid          # rule_based | hybrid | llm
REVIEW_THRESHOLD=0.6

# LLM provider (Ollama or Anthropic)
LLM_PROVIDER=ollama             # ollama | anthropic
LLM_ENDPOINT=http://localhost:11434
LLM_MODEL=llama3.1
LLM_API_KEY=                    # required for cloud / Anthropic

# Storage & limits
DATABASE_URL=sqlite:///./signalextract.db
MAX_UPLOAD_MB=20
ALLOWED_EXTENSIONS=.txt,.pdf,.docx,.eml

# Security
REQUIRE_API_KEY=true
API_KEY=replace-me

FAQ

Does it depend on the LLM?

No. In hybrid mode the rule-based extractor always runs, so a down or rate-limited model degrades gracefully instead of failing.

Can I use a local model?

Yes — point LLM_PROVIDER=ollama at a local Ollama server, or at Ollama Cloud / Anthropic by setting the endpoint and key.

How are hallucinations prevented?

Evidence grounding: any signal whose evidence can't be matched back to the source text is discarded before it reaches review.