DocCustomKIE | Camille Barboule

A Complete Pipeline for Document Information Extraction

DocCustomKIE is an end-to-end solution for extracting key information from any type of document. The project provides a complete workflow from data annotation to model deployment, requiring minimal data and no GPU for training or inference.

From left to right: intuitive annotation interface, automated training pipeline, and inference with extracted entities visualization.

Key Features

The pipeline offers several innovative features that make document processing accessible and efficient:

📄 Minimal Data Requirements: Only about 10 documents needed for good performance
🎨 Intuitive Labeling Interface: Create custom labels for any information you need to extract
🔍 Advanced OCR: Intelligent region detection using Sobel gradients for optimal text recognition

🔄 Automatic Data Augmentation: Generate labeled variants to strengthen model robustness
🤖 Lightweight Transformer Fine-tuning: Adapt Microsoft’s LayoutLMv3-base without GPU requirements
🖥️ Modular Inference Interface: Visualize extracted information or customize the pipeline for your needs

Technical Approach

The system employs a sophisticated multi-stage process:

System architecture showing the complete pipeline, and example of BIO tagging for multi-token entities.

1. Intelligent OCR with Block Detection

The system uses Sobel gradient analysis to intelligently detect text regions in documents. Each region is then upscaled (3x) and sharpened before applying Tesseract OCR, resulting in significantly improved text recognition accuracy compared to standard whole-image OCR.

2. BIO Tagging for Entity Recognition

The annotation interface automatically handles multi-token entities using the BIO (Beginning-Inside-Outside) tagging scheme. When annotating “John Smith Ltd.”, the system generates:

B-COMPANY for “John”
I-COMPANY for “Smith”
I-COMPANY for “Ltd.”

This allows the model to learn complex entity boundaries and handle fragmented text naturally.

3. Data Augmentation Pipeline

After annotation, the system automatically generates augmented versions of your documents with various transformations:

Contrast and brightness adjustments
Gaussian noise addition
Slight rotations and perspective changes
All while preserving and transferring the original annotations using Levenshtein distance matching

Workflow

annotate.py → layoutlmv3_ft.py → inference.py

Annotation Phase: Upload documents, run OCR, and annotate key information with custom labels
Training Phase: Fine-tune LayoutLMv3 on your annotated data (typically 2-4 hours on CPU)
Inference Phase: Process new documents and extract structured information

Future Development: Feedback Loop

The next major enhancement will introduce a feedback loop in the inference pipeline, allowing continuous improvement of the model through human corrections:

Planned feedback loop architecture: predictions → human corrections → model retraining with incremental learning.

Planned Features:

Interactive Correction Interface: Review and correct model predictions directly in the inference UI
Incremental Learning: Retrain the model on corrected data without forgetting previous knowledge
Active Learning: Prioritize uncertain predictions for human review
Version Control: Track model improvements over time with performance metrics

This feedback mechanism will enable the system to continuously improve its accuracy on your specific document types, creating a truly adaptive solution.

Use Cases

DocCustomKIE has been successfully applied to various document types:

Medical records and prescriptions
Administrative forms and certificates
Financial documents and invoices
Legal contracts and agreements
Any structured or semi-structured documents requiring information extraction

Technical Stack

OCR: Tesseract with custom pre-processing pipeline
Model: Microsoft LayoutLMv3-base (multimodal transformer)
Framework: PyTorch with Hugging Face Transformers
Interface: Flask web application
Data Format: JSONL for annotations, supporting standard KIE datasets

The system is designed to be lightweight and accessible, running efficiently on standard hardware without specialized GPU requirements.