SignalExtract AI

Introduction

SignalExtract AI turns messy, real-world documents — PDFs, emails, and reports — into structured, evidence-linked signals. Unlike a simple parser or a single LLM call, it layers deterministic rules, language-model understanding, evidence grounding, and human review to stay reliable on inconsistent, ambiguous text.

Evidence-grounded

Every signal links to a verbatim source span.

Hybrid engine

Rules + LLM, merged — never dependent on one path.

Human-in-the-loop

Approve, reject, or edit with calibrated confidence.

Structured output

Typed signals with provenance, JSON or CSV.

Quick start

The full flow is: upload → extract text → extract signals → review → export. Every request includes your API key when key protection is enabled.

bash

API=http://localhost:8000/api/v1
KEY="sk_cleanextract_internal_123456"

# 1. Upload a document
DOC=$(curl -s -X POST "$API/documents/upload" -H "x-api-key: $KEY" \
  -F "file=@report.pdf" | jq -r .id)

# 2. Extract text, then signals
curl -s -X POST "$API/documents/$DOC/extract-text"    -H "x-api-key: $KEY"
curl -s -X POST "$API/documents/$DOC/extract-signals" -H "x-api-key: $KEY"

# 3. List the extracted signals
curl -s "$API/documents/$DOC/signals" -H "x-api-key: $KEY"

Or just use the workspace UI — upload from the Documents page, run extraction, and review signals visually.

Core concepts

Signal

A single structured data point extracted from a document — a typed value with the evidence that supports it. Example: a recommendation with the verbatim sentence it came from.

Evidence grounding

Every signal carries a evidence span lifted verbatim from the source. If a model proposes a signal whose evidence can't be located in the text, it's discarded — this is the primary guard against hallucinations.

Confidence

A 0.0–1.0 score per signal. Signals at or above the REVIEW_THRESHOLDare treated as high-confidence; the rest are surfaced for closer review.

Extraction modes

Set globally via EXTRACTION_MODE, or rely on the default. The mode controls how deterministic rules and the LLM combine.

Mode	Behavior	LLM dependency
rule_based	Deterministic patterns only — fully offline.	None
hybrid (default)	Union of rules + LLM, de-duplicated.	Optional — falls back to rules
llm	LLM-first, falls back to rules if it returns nothing.	Preferred

In hybrid and llm, if the LLM is unreachable or over quota the rule-based pass still returns results — extraction never hard-fails on the model.

Signal types

Two families of signals are supported: precise entities and higher-order claim-level statements.

Entities

dateamountpercentageemailphoneurlidentifiermeasurementperson_nameorganizationlocationmedical_code

Claim-level

findingrecommendationactionkey_statement

How extraction works

The engine is built around one principle: recall first, precision second. Generate candidate signals broadly, then ground and verify strictly.

1Ingest. Parse PDF / email / docx to text, preserving layout and character offsets.
2Extract. Rule anchors and the LLM each propose candidate signals (high recall).
3Ground. Each candidate's evidence must be found in the source, or it's dropped.
4Merge. Duplicates across strategies are collapsed into one signal.
5Review. Confidence routes uncertain signals to a human approve / reject queue.
6Export. Approved signals leave as JSON or CSV with full provenance.

Review workflow

Each signal has a review_status that moves through human-in-the-loop states:

Status	Meaning
pending	Awaiting review (default after extraction).
approved	Confirmed correct — included in approved-only exports.
rejected	Marked incorrect — excluded from approved exports.
edited	Value corrected by a reviewer, then accepted.

bash

curl -X PATCH "$API/signals/$SIGNAL_ID/review" \
  -H "x-api-key: $KEY" -H "content-type: application/json" \
  -d '{"review_status":"approved","reviewed_by":"analyst@org"}'

API reference

All routes are under /api/v1. When REQUIRE_API_KEY is enabled, send your key in the x-api-key header.

Method	Endpoint	Purpose
GET	/health · /ready	Liveness and DB readiness
GET	/stats	Aggregate metrics
POST	/documents/upload	Upload a file (multipart)
GET	/documents	List documents
GET	/documents/{id}	Get one document
DELETE	/documents/{id}	Delete a document
POST	/documents/{id}/extract-text	Extract document text
POST	/documents/{id}/extract-signals	Run signal extraction
GET	/documents/{id}/signals	List signals (filterable)
GET	/documents/{id}/export.json	Export signals as JSON
GET	/documents/{id}/export.csv	Export signals as CSV
PATCH	/signals/{id}/review	Approve / reject / edit a signal

Configuration

Configured through environment variables (a .env file on the backend).

.env

# Extraction
EXTRACTION_MODE=hybrid          # rule_based | hybrid | llm
REVIEW_THRESHOLD=0.6

# LLM provider (Ollama or Anthropic)
LLM_PROVIDER=ollama             # ollama | anthropic
LLM_ENDPOINT=http://localhost:11434
LLM_MODEL=llama3.1
LLM_API_KEY=                    # required for cloud / Anthropic

# Storage & limits
DATABASE_URL=sqlite:///./signalextract.db
MAX_UPLOAD_MB=20
ALLOWED_EXTENSIONS=.txt,.pdf,.docx,.eml

# Security
REQUIRE_API_KEY=true
API_KEY=replace-me

FAQ

Does it depend on the LLM?

No. In hybrid mode the rule-based extractor always runs, so a down or rate-limited model degrades gracefully instead of failing.

Can I use a local model?

Yes — point LLM_PROVIDER=ollama at a local Ollama server, or at Ollama Cloud / Anthropic by setting the endpoint and key.

How are hallucinations prevented?

Evidence grounding: any signal whose evidence can't be matched back to the source text is discarded before it reaches review.