cv
You can find here my CV.
Basics
Name | Camille Barboule |
Label | Data Scientist |
camille.barboule@gmail.com | |
Summary | Data Scientist specialized in Natural Language Processing and Information Retrieval, interested in Explainable AI |
Work
-
2025.03 - Present Machine Learning Researcher
Orange Innovation
Exploration of example-based explainability approaches for LLMs to analyze when the model uses its parametric knowledge vs its contextual knowledge for generating text: What appends when context given to the LLM (within the prompt, cf RAG) contradicts the model's memorized knowledge ?
- Explored training data attribution techniques (influence functions) applied to LLMs
- Explored context attribution techniques (saliency, attention-based)
- Explored mechanistic interpretability approaches
-
2025.03 - Present -
2024.05 - 2025.04 Machine Learning Researcher
Orange Innovation
Collaboration with ISIR (former LIP6) on cutting-edge ML research project about visually-rich document understanding
- Published SOTA research on Visually-Rich Document Understanding methods giving a comprehensive overview of state-of-the-art approaches, emphasizing their strengths and limitations and proposing promising research directions : https://arxiv.org/pdf/2501.02235
- Explored apporaches to handle documents with LLMs: adding the 2D position of tokens (bounding boxes from PDFs, PPTs, or OCRized documents) in the attention scores of transformers with RoPE, to better understand tables and charts within documents
- Explored position encoding: techniques to mitigate positional bias (lost-in-the-middle effect, attention sink effect) in transformers, and to isolate position information within transformers to make this position explicit and manipulable
-
2022.09 - 2024.05 Data Scientist
Orange Innovation
Research Project about LLM adaptation to the telecommunication industry which aims at getting a "small" (7b-parameters) model performing as well as big LLMs on telecom use-cases
- Explored best practices about fine-tuning for domain adaptation of LLMs: on the data, preprocessing, continual pretraining (self-supervised fine-tuning on raw domain texts), instruction-tuning (on telco instructions (QA, MCQs, Summarization)), fine-tuning process itself (packing vs non-packing, loss computed on all tokens (auto-regressive way) vs on output tokens only, ...), parameter-efficient fine-tuning methods (lora vs qlora vs full fine-tuning)
- Implemented a pipeline for domain-adaptation of LLM (from data collection, preprocessing, to fine-tuning, to results evaluation)
- Trained models on a SLURM internal cluster using several parallelization methods
- Wrote research paper about this work: https://arxiv.org/abs/2412.15891
-
2021.04 - 2022.08 Machine Learning Engineer
Delpha.io
Developed large-scale string comparison algorithms and deployed ML solutions on AWS for client name deduplication and relationship detection in Salesforce environments.
- Optimized sparse matrix multiplication for large-scale string comparison
- AWS deployment using EMR, Lambda, S3, and CloudFormation
- Product owner role with Agile/SCRUM methodology
-
2020.01 - 2020.06 Leveraged Finance Intern
ING Bank
Financial analysis and LBO modeling during Covid-19 crisis, including participation in Credit Modification process for government-guaranteed loans.
- Built LBO models challenging management business plans
- Financial analysis including P&L, cashflow, and deleveraging profiles
-
2019.07 - 2019.12 Leveraged Finance Origination Intern
BNP Paribas
Pitched funds on potential LBOs and conducted comprehensive financial analysis of LBO targets.
- Financial analysis of LBO targets including industry research and risk assessment
- Built LBO models for various transaction types
-
2018.12 - 2019.03 Asset Management Intern
Aqua Asset Management
Conducted financial analysis using FactSet for asset management operations.
- Financial analysis using FactSet platform
Education
-
2020.08 - 2022.09 Brest, France
Engineering Diploma
IMT Atlantique
Engineering
- Machine Learning
- Deep Learning
- Advanced Mathematics
- Optimized Deep Learning
- Design of Communicating Objects
-
2019.04 - 2019.07 Bamberg, Germany
-
2018.09 - 2018.12 Cambridge, UK
-
2017.09 - 2021.07 Grenoble, France
-
2015.01 - 2017.09 Toulouse, France
Awards
- 2025.02.01
Lauréat of the Hackathon GenAI for Public Good
DINUM
Organised by the DINUM, this hackathon aimed at using GenAI for Document Understanding on a public use-cases (criminal record understanding to help magistrate in handling their cases). Documents were scanned and contained images. The first task was about QA on documents, so we used Colpali as a retriever to get the relevant page(s) regarding the prompt. The second task was about summarizing the document, so we finetuned GOT to convert the document into structured text then passed it to the LLM Qwen with 1M context window.
Publications
-
2025.01.01 Visually-Rich Document Understanding Methods: A Comprehensive Survey
arXiv
Comprehensive overview of state-of-the-art approaches in visually-rich document understanding, emphasizing strengths and limitations while proposing promising research directions.
-
2024.12.01 LLM Domain Adaptation for Telecommunications
arXiv
Research paper on best practices for fine-tuning LLMs for domain adaptation in the telecommunications industry.
Skills
Programming | |
Python | |
gestionnaires de dépendances (uv, poetry) | |
code quality checker (mypy, ruff, black) | |
typage statique en python: mypy | |
virtualenv | |
code documentation | |
Java |
Machine Learning & AI | |
PyTorch | |
Transformers | |
Scikit-Learn | |
TensorFlow | |
langchain | |
MCP | |
vllm | |
llama.cpp | |
quantification to integer precision & to binary | |
pruning (structured & unstructured) | |
distillation (response & feature-based) |
Infrastructure GPU & Cloud | |
SLURM | |
AWS EC2, EMR, Lambda, sagemaker | |
GCP Vertex AI, Cloud Functions, Cloud Run | |
Systèmes de stockage cloud: AWS S3 & GCS | |
CLI & SDK: AWS CLI & via python boto3, AWS cloudformation, Google gcloud, terraform |
Finance | |
LBO Modeling | |
Financial Analysis | |
FactSet |
Languages
French | |
Native speaker |
English | |
C2 |
German | |
B1 |
Interests
Machine Learning Research | |
Document Understanding | |
LLM Adaptation | |
Explainable AI | |
Position Encoding | |
Information Retrieval | |
Natural Language Processing |
Dancing | |
Rock | |
Salsa |
Projects
- 2023.09 - 2024.04
LLM Adaptation for Telecommunications
Research project developing a pipeline for domain-specific adaptation of large language models to telecommunications industry, achieving performance comparable to larger models on telecom use-cases.
- Explored continual pretraining and instruction-tuning techniques
- Implemented parameter-efficient fine-tuning methods
- Trained models using SLURM cluster with parallelization
- 2024.09 - 2025.03
Visually-Rich Document Understanding
Comprehensive research on state-of-the-art methods for understanding documents with complex layouts, including tables, charts, and mixed text-image content.
- Published comprehensive survey paper
- Developed novel position encoding techniques for transformers
- Research on handling 2D document layouts with LLMs
- 2021.04 - 2022.08
Large-Scale String Comparison Algorithm
Developed optimized sparse matrix multiplication algorithm for large-scale client name comparison and deduplication in Salesforce environments.
- Vectorization and cosine similarity optimization
- AWS deployment with full MLOps pipeline
- Detection of duplicate and related client names