cv | Camille Barboule

Basics

Name	Camille Barboule
Label	Data Scientist
Email	camille.barboule@gmail.com
Summary	Data Scientist specialized in Natural Language Processing and Information Retrieval, interested in Explainable AI

Work

2025.03 - Present
Machine Learning Researcher

Orange Innovation

Exploration of example-based explainability approaches for LLMs to analyze when the model uses its parametric knowledge vs its contextual knowledge for generating text: What appends when context given to the LLM (within the prompt, cf RAG) contradicts the model's memorized knowledge ?
- Explored training data attribution techniques (influence functions) applied to LLMs
- Explored context attribution techniques (saliency, attention-based)
- Explored mechanistic interpretability approaches
2025.03 - Present
Machine Learning Engineer

Orange Innovation

Implement visually-rich document RAG MCP server
2024.05 - 2025.04
Machine Learning Researcher

Orange Innovation

Collaboration with ISIR (former LIP6) on cutting-edge ML research project about visually-rich document understanding
- Published SOTA research on Visually-Rich Document Understanding methods giving a comprehensive overview of state-of-the-art approaches, emphasizing their strengths and limitations and proposing promising research directions : https://arxiv.org/pdf/2501.02235
- Explored apporaches to handle documents with LLMs: adding the 2D position of tokens (bounding boxes from PDFs, PPTs, or OCRized documents) in the attention scores of transformers with RoPE, to better understand tables and charts within documents
- Explored position encoding: techniques to mitigate positional bias (lost-in-the-middle effect, attention sink effect) in transformers, and to isolate position information within transformers to make this position explicit and manipulable
2022.09 - 2024.05
Data Scientist

Orange Innovation

Research Project about LLM adaptation to the telecommunication industry which aims at getting a "small" (7b-parameters) model performing as well as big LLMs on telecom use-cases
- Explored best practices about fine-tuning for domain adaptation of LLMs: on the data, preprocessing, continual pretraining (self-supervised fine-tuning on raw domain texts), instruction-tuning (on telco instructions (QA, MCQs, Summarization)), fine-tuning process itself (packing vs non-packing, loss computed on all tokens (auto-regressive way) vs on output tokens only, ...), parameter-efficient fine-tuning methods (lora vs qlora vs full fine-tuning)
- Implemented a pipeline for domain-adaptation of LLM (from data collection, preprocessing, to fine-tuning, to results evaluation)
- Trained models on a SLURM internal cluster using several parallelization methods
- Wrote research paper about this work: https://arxiv.org/abs/2412.15891
2021.04 - 2022.08
Machine Learning Engineer

Delpha.io

Developed large-scale string comparison algorithms and deployed ML solutions on AWS for client name deduplication and relationship detection in Salesforce environments.
- Optimized sparse matrix multiplication for large-scale string comparison
- AWS deployment using EMR, Lambda, S3, and CloudFormation
- Product owner role with Agile/SCRUM methodology
2020.01 - 2020.06
Leveraged Finance Intern

ING Bank

Financial analysis and LBO modeling during Covid-19 crisis, including participation in Credit Modification process for government-guaranteed loans.
- Built LBO models challenging management business plans
- Financial analysis including P&L, cashflow, and deleveraging profiles
2019.07 - 2019.12
Leveraged Finance Origination Intern

BNP Paribas

Pitched funds on potential LBOs and conducted comprehensive financial analysis of LBO targets.
- Financial analysis of LBO targets including industry research and risk assessment
- Built LBO models for various transaction types
2018.12 - 2019.03
Asset Management Intern

Aqua Asset Management

Conducted financial analysis using FactSet for asset management operations.
- Financial analysis using FactSet platform

Education

2020.08 - 2022.09

Brest, France
Engineering Diploma

IMT Atlantique

Engineering
- Machine Learning
- Deep Learning
- Advanced Mathematics
- Optimized Deep Learning
- Design of Communicating Objects
2019.04 - 2019.07

Bamberg, Germany
Erasmus Program

Otto-Friedrich-Universität Bamberg

Finance and Economics
- Finance
- Economics
2018.09 - 2018.12

Cambridge, UK
Exchange Program

University of Cambridge

Finance and Economics
- Finance
- Economics
2017.09 - 2021.07

Grenoble, France
Grande École Diploma

Grenoble École de Management

Management
- Finance
- Economics
2015.01 - 2017.09

Toulouse, France
Classe Préparatoire ECS

Pierre de Fermat

Economics and Commerce

Awards

2025.02.01

Lauréat of the Hackathon GenAI for Public Good

DINUM

Organised by the DINUM, this hackathon aimed at using GenAI for Document Understanding on a public use-cases (criminal record understanding to help magistrate in handling their cases). Documents were scanned and contained images. The first task was about QA on documents, so we used Colpali as a retriever to get the relevant page(s) regarding the prompt. The second task was about summarizing the document, so we finetuned GOT to convert the document into structured text then passed it to the LLM Qwen with 1M context window.

Publications

2025.01.01

Visually-Rich Document Understanding Methods: A Comprehensive Survey

arXiv

Comprehensive overview of state-of-the-art approaches in visually-rich document understanding, emphasizing strengths and limitations while proposing promising research directions.
2024.12.01

LLM Domain Adaptation for Telecommunications

arXiv

Research paper on best practices for fine-tuning LLMs for domain adaptation in the telecommunications industry.

Skills

	Programming
	Python
	gestionnaires de dépendances (uv, poetry)
	code quality checker (mypy, ruff, black)
	typage statique en python: mypy
	virtualenv
	code documentation
	Java

	Machine Learning & AI
	PyTorch
	Transformers
	Scikit-Learn
	TensorFlow
	langchain
	MCP
	vllm
	llama.cpp
	quantification to integer precision & to binary
	pruning (structured & unstructured)
	distillation (response & feature-based)

	Infrastructure GPU & Cloud
	SLURM
	AWS EC2, EMR, Lambda, sagemaker
	GCP Vertex AI, Cloud Functions, Cloud Run
	Systèmes de stockage cloud: AWS S3 & GCS
	CLI & SDK: AWS CLI & via python boto3, AWS cloudformation, Google gcloud, terraform

	Finance
	LBO Modeling
	Financial Analysis
	FactSet

Languages

	French
	Native speaker

	English
	C2

	German
	B1

Interests

	Machine Learning Research
	Document Understanding
	LLM Adaptation
	Explainable AI
	Position Encoding
	Information Retrieval
	Natural Language Processing

	Dancing
	Rock
	Salsa

Projects

2023.09 - 2024.04
LLM Adaptation for Telecommunications

Research project developing a pipeline for domain-specific adaptation of large language models to telecommunications industry, achieving performance comparable to larger models on telecom use-cases.
- Explored continual pretraining and instruction-tuning techniques
- Implemented parameter-efficient fine-tuning methods
- Trained models using SLURM cluster with parallelization
2024.09 - 2025.03
Visually-Rich Document Understanding

Comprehensive research on state-of-the-art methods for understanding documents with complex layouts, including tables, charts, and mixed text-image content.
- Published comprehensive survey paper
- Developed novel position encoding techniques for transformers
- Research on handling 2D document layouts with LLMs
2021.04 - 2022.08
Large-Scale String Comparison Algorithm

Developed optimized sparse matrix multiplication algorithm for large-scale client name comparison and deduplication in Salesforce environments.
- Vectorization and cosine similarity optimization
- AWS deployment with full MLOps pipeline
- Detection of duplicate and related client names

Basics

Work

Orange Innovation

Exploration of example-based explainability approaches for LLMs to analyze when the model uses its parametric knowledge vs its contextual knowledge for generating text: What appends when context given to the LLM (within the prompt, cf RAG) contradicts the model's memorized knowledge ?

Orange Innovation

Implement visually-rich document RAG MCP server

Orange Innovation

Collaboration with ISIR (former LIP6) on cutting-edge ML research project about visually-rich document understanding

Orange Innovation

Research Project about LLM adaptation to the telecommunication industry which aims at getting a "small" (7b-parameters) model performing as well as big LLMs on telecom use-cases

Delpha.io

Developed large-scale string comparison algorithms and deployed ML solutions on AWS for client name deduplication and relationship detection in Salesforce environments.

ING Bank

Financial analysis and LBO modeling during Covid-19 crisis, including participation in Credit Modification process for government-guaranteed loans.

BNP Paribas

Pitched funds on potential LBOs and conducted comprehensive financial analysis of LBO targets.

Aqua Asset Management

Conducted financial analysis using FactSet for asset management operations.

Education

IMT Atlantique

Engineering

Otto-Friedrich-Universität Bamberg

Finance and Economics

University of Cambridge

Finance and Economics

Grenoble École de Management

Management

Pierre de Fermat

Economics and Commerce

Awards

DINUM

Publications

arXiv

Comprehensive overview of state-of-the-art approaches in visually-rich document understanding, emphasizing strengths and limitations while proposing promising research directions.

arXiv

Research paper on best practices for fine-tuning LLMs for domain adaptation in the telecommunications industry.

Skills

Languages

Interests

Projects

Research project developing a pipeline for domain-specific adaptation of large language models to telecommunications industry, achieving performance comparable to larger models on telecom use-cases.

Comprehensive research on state-of-the-art methods for understanding documents with complex layouts, including tables, charts, and mixed text-image content.

Developed optimized sparse matrix multiplication algorithm for large-scale client name comparison and deduplication in Salesforce environments.