DocCustomKIE
An end-to-end pipeline for custom Key Information Extraction from documents
A Complete Pipeline for Document Information Extraction
DocCustomKIE is an end-to-end solution for extracting key information from any type of document. The project provides a complete workflow from data annotation to model deployment, requiring minimal data and no GPU for training or inference.



Key Features
The pipeline offers several innovative features that make document processing accessible and efficient:
- 📄 Minimal Data Requirements: Only about 10 documents needed for good performance
- 🎨 Intuitive Labeling Interface: Create custom labels for any information you need to extract
- 🔍 Advanced OCR: Intelligent region detection using Sobel gradients for optimal text recognition

- 🔄 Automatic Data Augmentation: Generate labeled variants to strengthen model robustness
- 🤖 Lightweight Transformer Fine-tuning: Adapt Microsoft’s LayoutLMv3-base without GPU requirements
- 🖥️ Modular Inference Interface: Visualize extracted information or customize the pipeline for your needs
Technical Approach
The system employs a sophisticated multi-stage process:
1. Intelligent OCR with Block Detection
The system uses Sobel gradient analysis to intelligently detect text regions in documents. Each region is then upscaled (3x) and sharpened before applying Tesseract OCR, resulting in significantly improved text recognition accuracy compared to standard whole-image OCR.
2. BIO Tagging for Entity Recognition
The annotation interface automatically handles multi-token entities using the BIO (Beginning-Inside-Outside) tagging scheme. When annotating “John Smith Ltd.”, the system generates:
-
B-COMPANY
for “John” -
I-COMPANY
for “Smith” -
I-COMPANY
for “Ltd.”
This allows the model to learn complex entity boundaries and handle fragmented text naturally.
3. Data Augmentation Pipeline
After annotation, the system automatically generates augmented versions of your documents with various transformations:
- Contrast and brightness adjustments
- Gaussian noise addition
- Slight rotations and perspective changes
- All while preserving and transferring the original annotations using Levenshtein distance matching
Workflow
annotate.py → layoutlmv3_ft.py → inference.py
- Annotation Phase: Upload documents, run OCR, and annotate key information with custom labels
- Training Phase: Fine-tune LayoutLMv3 on your annotated data (typically 2-4 hours on CPU)
- Inference Phase: Process new documents and extract structured information
Future Development: Feedback Loop
The next major enhancement will introduce a feedback loop in the inference pipeline, allowing continuous improvement of the model through human corrections:

Planned Features:
- Interactive Correction Interface: Review and correct model predictions directly in the inference UI
- Incremental Learning: Retrain the model on corrected data without forgetting previous knowledge
- Active Learning: Prioritize uncertain predictions for human review
- Version Control: Track model improvements over time with performance metrics
This feedback mechanism will enable the system to continuously improve its accuracy on your specific document types, creating a truly adaptive solution.
Use Cases
DocCustomKIE has been successfully applied to various document types:
- Medical records and prescriptions
- Administrative forms and certificates
- Financial documents and invoices
- Legal contracts and agreements
- Any structured or semi-structured documents requiring information extraction
Technical Stack
- OCR: Tesseract with custom pre-processing pipeline
- Model: Microsoft LayoutLMv3-base (multimodal transformer)
- Framework: PyTorch with Hugging Face Transformers
- Interface: Flask web application
- Data Format: JSONL for annotations, supporting standard KIE datasets
The system is designed to be lightweight and accessible, running efficiently on standard hardware without specialized GPU requirements.