← Accueil
  • RAGGAE: A multipurpose local RAG system for Adservio

    Retrieval-Augmented Generation Generalized Architecture for Enterprise

    Olivier Vitrac, PhD., HDR | olivier.vitrac@adservio.fr – 2025-11-05

    Summary

    This early note discusses the design of a generic RAG/embeddings library can serve CVs, reports, and tenders, which relies on different document adapters using a shared semantic core (retrieval + re-rank + annotation + scoring). A hybrid (dense+sparse) + cross-encoder is proposed. The POC adds domain-tuning and Natural Lalnguage interface (NLI) checks, and is designed from day one for traceability (provenance spans, scores, reasons). The whole system is designed to run on minimal infrastructure: fully local MVP – GPU with 8 GB VRAM and possibly running on CPU.

    The project RAGGAE is now mature and is available as an Adservio GitHub project. All details available in README.md. The POC can be launched as: uvicorn RAGGAE.cli.demo_app:app --host 0.0.0.0 --port 8000 --reload

    Access to all files, read this file in PDF.


    1 | Technical Review

    1.1 | Embedding options (and when to use which)

    A. Dense text embeddings (bi-encoders) — default for RAG

    B. Cross-encoders (re-rankers) — for precision at the top

    C. Hybrid retrieval (dense + sparse) — when vocabulary matters

    D. Domain-tuned embeddings — when your domain dominates

    E. Multilingual & French

    F. Long-document strategies (tenders/CVs/reports)


    1.2 | “Semantic analysis” we’ll want beyond embeddings

    We think of this as signals layered on top of retrieval:

    These features feed your scoring/ranking (fit, risk, attractiveness) and later your form pre-fill.


    1.3 | Can one library handle CVs, reports, tenders? (Yes—if you design it right)

    Design a document-agnostic semantic layer with adapters:

    Result: same embedding/retrieval engine, different adapters and scoring logic.


    1.4 | Minimal technical blueprint


    1.5 | Choosing an embedding setup (quick decision guide)


    Absolutely feasible locally: E5-small + BM25 + optional cross-encoder, FAISS index, Ollama (7–8B Q4) for NLI/extraction.

    One generic library with adapters lets you handle tenders, CVs, and reports with the same semantic core.1.6 | Ranking & classification for tenders


    1.7 | Toward pre-filling response forms (step 2)


    1.8 | Evaluation of prototype from day 1


    1.9 | Practical shortlist (safe bets to prototype)


    Bottom line


     

    2 | Local MVP stack (FR/EN tenders, CVs, reports)

    2.1 | Retrieval (dense)

    2.2 | Re-ranking (cross-encoder)

    2.3 | Sparse retrieval (for jargon & exact clauses)

    2.4 | Vector store

    2.5 | Parsers & chunking

    2.6 | Local NLI/extraction (for “does this clause match?” and pre-fill)**


    3 | Minimal pipeline (drop-in code)

    Swap in a cross-encoder re-ranker later (e.g., jinaai/jina-reranker-v1-base-multilingual) on the hits[:100] to boost precision@5.


    4 | Using Ollama locally (NLI/extraction)

    Example of response after NLI (natural language extraction):

     


    5 | What fits in 8 GB VRAM (comfortably)


    6 | Can the same lib read CVs, reports, tenders? Yes — via adapters

    Keep a shared semantic core and add thin adapters:

    All three reuse the same: parser → chunker → embeddings → FAISS/BM25 → (optional) reranker → scorers.


    7 | Folder scaffold (ready to uv/pip)}


    8 | Early-phase eval (so you can show value next week)


    TL;DR


    9 | Python environment

    9.1 | Check CUDA version (within conda env torch_env)

    The output of LX-Olivier2023:

     

    9.2 | Environment env-adservio-raggae

    Use:

    9.3 | Smoke test | part 1

    The smoke test check the setup. If this runs fine, your core loop (parse → embed → index → hybrid search → provenance) is ready for plugging into adapters (tenders/CVs/reports).

    You should read:

     

    9.4 | Smoke test | part 2

    If it works well, you should read:

     

    9.5 | Troubleshooting Pytorch without CUDA

    If your Spyder is still using a CPU-only PyTorch wheel (note torch.version.cuda: None). Let’s fix it cleanly by installing the CUDA build from the pytorch + nvidia channels and avoiding any pip/conda-forge Torch that might override it.

     

    Solution: reinstall pytorch

     

    Retest

     

    Just in case, a prior pin drags a CPU build back in. Make a fresh env that only uses the correct channels:

     

    9.6 | Troubleshooting numpy `broadcast_to import error.`

    This error means your NumPy install in adservio-raggae is inconsistent (CPU-only Torch is fine now). Let’s fix NumPy cleanly and avoid mixed pip/conda wheels.

    Quick check (run in the same Spyder console)

    Fix (close Spyder, then in a terminal)

    If you prefer staying on NumPy 1.x: use numpy=1.26.* (it supports Python 3.12). broadcast_to exists in both.

    Relaunch & re-test

    1. Start Spyder from the env:

    1. In Spyder, run:

    If it still fails, show me the output of:

    and we’ll zero in—but in 99% cases the clean conda-forge NumPy reinstall resolves the broadcast_to import error.