cv

You can find here my CV.

Basics

Name Camille Barboule
Label Data Scientist
Email camille.barboule@gmail.com
Summary Data Scientist specialized in Natural Language Processing and Information Retrieval, interested in Explainable AI

Work

  • 2025.03 - Present
    Machine Learning Researcher
    Orange Innovation
    Exploration of example-based explainability approaches for LLMs to analyze when the model uses its parametric knowledge vs its contextual knowledge for generating text: What appends when context given to the LLM (within the prompt, cf RAG) contradicts the model's memorized knowledge ?
    • Explored training data attribution techniques (influence functions) applied to LLMs
    • Explored context attribution techniques (saliency, attention-based)
    • Explored mechanistic interpretability approaches
  • 2025.03 - Present
    Machine Learning Engineer
    Orange Innovation
    Implement visually-rich document RAG MCP server
  • 2024.05 - 2025.04
    Machine Learning Researcher
    Orange Innovation
    Collaboration with ISIR (former LIP6) on cutting-edge ML research project about visually-rich document understanding
    • Published SOTA research on Visually-Rich Document Understanding methods giving a comprehensive overview of state-of-the-art approaches, emphasizing their strengths and limitations and proposing promising research directions : https://arxiv.org/pdf/2501.02235
    • Explored apporaches to handle documents with LLMs: adding the 2D position of tokens (bounding boxes from PDFs, PPTs, or OCRized documents) in the attention scores of transformers with RoPE, to better understand tables and charts within documents
    • Explored position encoding: techniques to mitigate positional bias (lost-in-the-middle effect, attention sink effect) in transformers, and to isolate position information within transformers to make this position explicit and manipulable
  • 2022.09 - 2024.05
    Data Scientist
    Orange Innovation
    Research Project about LLM adaptation to the telecommunication industry which aims at getting a "small" (7b-parameters) model performing as well as big LLMs on telecom use-cases
    • Explored best practices about fine-tuning for domain adaptation of LLMs: on the data, preprocessing, continual pretraining (self-supervised fine-tuning on raw domain texts), instruction-tuning (on telco instructions (QA, MCQs, Summarization)), fine-tuning process itself (packing vs non-packing, loss computed on all tokens (auto-regressive way) vs on output tokens only, ...), parameter-efficient fine-tuning methods (lora vs qlora vs full fine-tuning)
    • Implemented a pipeline for domain-adaptation of LLM (from data collection, preprocessing, to fine-tuning, to results evaluation)
    • Trained models on a SLURM internal cluster using several parallelization methods
    • Wrote research paper about this work: https://arxiv.org/abs/2412.15891
  • 2021.04 - 2022.08
    Machine Learning Engineer
    Delpha.io
    Developed large-scale string comparison algorithms and deployed ML solutions on AWS for client name deduplication and relationship detection in Salesforce environments.
    • Optimized sparse matrix multiplication for large-scale string comparison
    • AWS deployment using EMR, Lambda, S3, and CloudFormation
    • Product owner role with Agile/SCRUM methodology
  • 2020.01 - 2020.06
    Leveraged Finance Intern
    ING Bank
    Financial analysis and LBO modeling during Covid-19 crisis, including participation in Credit Modification process for government-guaranteed loans.
    • Built LBO models challenging management business plans
    • Financial analysis including P&L, cashflow, and deleveraging profiles
  • 2019.07 - 2019.12
    Leveraged Finance Origination Intern
    BNP Paribas
    Pitched funds on potential LBOs and conducted comprehensive financial analysis of LBO targets.
    • Financial analysis of LBO targets including industry research and risk assessment
    • Built LBO models for various transaction types
  • 2018.12 - 2019.03
    Asset Management Intern
    Aqua Asset Management
    Conducted financial analysis using FactSet for asset management operations.
    • Financial analysis using FactSet platform

Education

  • 2020.08 - 2022.09

    Brest, France

    Engineering Diploma
    IMT Atlantique
    Engineering
    • Machine Learning
    • Deep Learning
    • Advanced Mathematics
    • Optimized Deep Learning
    • Design of Communicating Objects
  • 2019.04 - 2019.07

    Bamberg, Germany

    Erasmus Program
    Otto-Friedrich-Universität Bamberg
    Finance and Economics
    • Finance
    • Economics
  • 2018.09 - 2018.12

    Cambridge, UK

    Exchange Program
    University of Cambridge
    Finance and Economics
    • Finance
    • Economics
  • 2017.09 - 2021.07

    Grenoble, France

    Grande École Diploma
    Grenoble École de Management
    Management
    • Finance
    • Economics
  • 2015.01 - 2017.09

    Toulouse, France

    Classe Préparatoire ECS
    Pierre de Fermat
    Economics and Commerce

Awards

  • 2025.02.01
    Lauréat of the Hackathon GenAI for Public Good
    DINUM
    Organised by the DINUM, this hackathon aimed at using GenAI for Document Understanding on a public use-cases (criminal record understanding to help magistrate in handling their cases). Documents were scanned and contained images. The first task was about QA on documents, so we used Colpali as a retriever to get the relevant page(s) regarding the prompt. The second task was about summarizing the document, so we finetuned GOT to convert the document into structured text then passed it to the LLM Qwen with 1M context window.

Publications

Skills

Programming
Python
gestionnaires de dépendances (uv, poetry)
code quality checker (mypy, ruff, black)
typage statique en python: mypy
virtualenv
code documentation
Java
Machine Learning & AI
PyTorch
Transformers
Scikit-Learn
TensorFlow
langchain
MCP
vllm
llama.cpp
quantification to integer precision & to binary
pruning (structured & unstructured)
distillation (response & feature-based)
Infrastructure GPU & Cloud
SLURM
AWS EC2, EMR, Lambda, sagemaker
GCP Vertex AI, Cloud Functions, Cloud Run
Systèmes de stockage cloud: AWS S3 & GCS
CLI & SDK: AWS CLI & via python boto3, AWS cloudformation, Google gcloud, terraform
Finance
LBO Modeling
Financial Analysis
FactSet

Languages

French
Native speaker
English
C2
German
B1

Interests

Machine Learning Research
Document Understanding
LLM Adaptation
Explainable AI
Position Encoding
Information Retrieval
Natural Language Processing
Dancing
Rock
Salsa

Projects

  • 2023.09 - 2024.04
    LLM Adaptation for Telecommunications
    Research project developing a pipeline for domain-specific adaptation of large language models to telecommunications industry, achieving performance comparable to larger models on telecom use-cases.
    • Explored continual pretraining and instruction-tuning techniques
    • Implemented parameter-efficient fine-tuning methods
    • Trained models using SLURM cluster with parallelization
  • 2024.09 - 2025.03
    Visually-Rich Document Understanding
    Comprehensive research on state-of-the-art methods for understanding documents with complex layouts, including tables, charts, and mixed text-image content.
    • Published comprehensive survey paper
    • Developed novel position encoding techniques for transformers
    • Research on handling 2D document layouts with LLMs
  • 2021.04 - 2022.08
    Large-Scale String Comparison Algorithm
    Developed optimized sparse matrix multiplication algorithm for large-scale client name comparison and deduplication in Salesforce environments.
    • Vectorization and cosine similarity optimization
    • AWS deployment with full MLOps pipeline
    • Detection of duplicate and related client names