DocCustomKIE

An end-to-end pipeline for custom Key Information Extraction from documents

A Complete Pipeline for Document Information Extraction

DocCustomKIE is an end-to-end solution for extracting key information from any type of document. The project provides a complete workflow from data annotation to model deployment, requiring minimal data and no GPU for training or inference.

From left to right: intuitive annotation interface, automated training pipeline, and inference with extracted entities visualization.

Key Features

The pipeline offers several innovative features that make document processing accessible and efficient:

  • 📄 Minimal Data Requirements: Only about 10 documents needed for good performance
  • 🎨 Intuitive Labeling Interface: Create custom labels for any information you need to extract
  • 🔍 Advanced OCR: Intelligent region detection using Sobel gradients for optimal text recognition
  • 🔄 Automatic Data Augmentation: Generate labeled variants to strengthen model robustness
  • 🤖 Lightweight Transformer Fine-tuning: Adapt Microsoft’s LayoutLMv3-base without GPU requirements
  • 🖥️ Modular Inference Interface: Visualize extracted information or customize the pipeline for your needs

Technical Approach

The system employs a sophisticated multi-stage process:

DocCustomKIE Pipeline Architecture 1. Document Processing Document Upload (PDF/PNG/JPG) Sobel Gradient Block Detection Block Processing (Upscale 3x + Sharpen) Tesseract OCR (Per Block) Bbox Reconstruction (Original Scale) 2. Annotation & Training Manual Annotation (Custom Labels) BIO Tagging (B-/I-/O Format) Data Augmentation (Noise, Rotation, etc.) JSONL Export (Normalized Bboxes) LayoutLMv3 Fine-tuning (Token Classification) 3. Inference New Document (Same OCR Pipeline) Model Prediction (Label per Token) BIO Merging (Entity Reconstruction) Post-processing (Clean Duplicates) Structured Output (JSON + Visualization) Future: Feedback Loop (Human Corrections → Retraining)
BIO Tagging Example Original Document Text Patient: John Smith Born: 15/03/1985 in Paris, France Address: 123 Main Street, Lyon Doctor: Dr. Marie Dupont After OCR Tokenization Patient: John Smith Born: 15/03/1985 in Paris, France BIO Tagged Tokens Patient: O John B-NAME Smith I-NAME Born: O 15/03/1985 B-DATE in O Paris, B-CITY France I-CITY Doctor: O Dr. B-DOC Marie I-DOC Dupont I-DOC Legend: B- = Beginning of entity I- = Inside entity O = Outside (not an entity)
System architecture showing the complete pipeline, and example of BIO tagging for multi-token entities.

1. Intelligent OCR with Block Detection

The system uses Sobel gradient analysis to intelligently detect text regions in documents. Each region is then upscaled (3x) and sharpened before applying Tesseract OCR, resulting in significantly improved text recognition accuracy compared to standard whole-image OCR.

2. BIO Tagging for Entity Recognition

The annotation interface automatically handles multi-token entities using the BIO (Beginning-Inside-Outside) tagging scheme. When annotating “John Smith Ltd.”, the system generates:

  • B-COMPANY for “John”
  • I-COMPANY for “Smith”
  • I-COMPANY for “Ltd.”

This allows the model to learn complex entity boundaries and handle fragmented text naturally.

3. Data Augmentation Pipeline

After annotation, the system automatically generates augmented versions of your documents with various transformations:

  • Contrast and brightness adjustments
  • Gaussian noise addition
  • Slight rotations and perspective changes
  • All while preserving and transferring the original annotations using Levenshtein distance matching

Workflow

annotate.py → layoutlmv3_ft.py → inference.py

  1. Annotation Phase: Upload documents, run OCR, and annotate key information with custom labels
  2. Training Phase: Fine-tune LayoutLMv3 on your annotated data (typically 2-4 hours on CPU)
  3. Inference Phase: Process new documents and extract structured information

Future Development: Feedback Loop

The next major enhancement will introduce a feedback loop in the inference pipeline, allowing continuous improvement of the model through human corrections:

Planned feedback loop architecture: predictions → human corrections → model retraining with incremental learning.

Planned Features:

  1. Interactive Correction Interface: Review and correct model predictions directly in the inference UI
  2. Incremental Learning: Retrain the model on corrected data without forgetting previous knowledge
  3. Active Learning: Prioritize uncertain predictions for human review
  4. Version Control: Track model improvements over time with performance metrics

This feedback mechanism will enable the system to continuously improve its accuracy on your specific document types, creating a truly adaptive solution.

Use Cases

DocCustomKIE has been successfully applied to various document types:

  • Medical records and prescriptions
  • Administrative forms and certificates
  • Financial documents and invoices
  • Legal contracts and agreements
  • Any structured or semi-structured documents requiring information extraction

Technical Stack

  • OCR: Tesseract with custom pre-processing pipeline
  • Model: Microsoft LayoutLMv3-base (multimodal transformer)
  • Framework: PyTorch with Hugging Face Transformers
  • Interface: Flask web application
  • Data Format: JSONL for annotations, supporting standard KIE datasets

The system is designed to be lightweight and accessible, running efficiently on standard hardware without specialized GPU requirements.