Skip to main content
State Government Agency

Government Document Intelligence

Automated extraction of 167 fields per document with >95% accuracy in under 60 seconds

167 fields per document

>95% accuracy

663 automated tests<60s processing
Government document intelligence case study — large-scale OCR and classification of public-records documents for a state government, with full audit lineage

Context

A state government agency relied on manual data entry for scanned 4-page forms - 15-20 minutes per document with 2-5% error rates across 167 fields.

Constraint

Documents contained PII requiring strict protection. Scan quality varied (skewed, poor print, handwritten). Regulatory compliance was mandatory.

Intervention

Built a multi-engine OCR system in Rust with consensus merging, 65-zone coordinate-based PII redaction, and a production Axum API - with 663 tests.

Outcome

167 fields extracted at >95% accuracy in under 60 seconds (16x faster), full PII compliance, 663 automated tests - delivered in 15 weeks.

Architecture

From scanned form to structured, validated data

Document Preprocessing

Scanned TIFF/image files are normalized, deskewed, and prepared for extraction. Multi-page documents are split and classified to identify form sections and field locations.

Multi-Engine Extraction

Multiple OCR and extraction engines process each document independently. A consensus-merging layer cross-validates results per field, selecting the highest-confidence value based on field type and validation rules.

PII Protection

65 calibrated coordinate-based redaction zones mask sensitive fields. Pattern detection catches PII outside expected locations. Every redaction is verified against ground truth in the test suite.

Production API

Axum-based REST API handles document submission, processing status, and result retrieval. Rate limiting, structured logging, and health checks ensure production reliability.

Tech Stack

Core Language

Rust

Cloud

AWS (ECS Fargate, Textract, S3, Secrets Manager)

API Framework

Axum

Document Processing

TIFF/image processing, multi-engine OCR

Deployment

Docker, ECS Fargate

Testing

663 automated tests

Results

167

Fields extracted per document

>95%

Extraction accuracy

<60s

Processing time (was 15-20 min)

663

Automated tests

Zero-PII

Compliance-grade pipeline

16×

Speed improvement

Advisory Mandate

Planning a Similar Mandate?

A direct working session about the problem, the constraints, and the fastest credible path forward.

We respond within 4 hours during business hours

Subscribe

AI engineering insights. No spam.