Retrieval-Augmented Generation Generalized Architecture for Enterprise
A multipurpose local RAG system for processing and analyzing documents (tenders, CVs, reports) with semantic search, hybrid retrieval, and NLI-based compliance scoring.
RAGGAE is a production-ready, modular Retrieval-Augmented Generation (RAG) system designed to run entirely on local infrastructure. It combines:
Dense embeddings (bi-encoders like E5, GTE, BGE)
Sparse retrieval (BM25 for exact term matching)
Hybrid fusion (linear combination of dense and sparse scores)
Cross-encoder re-ranking (optional, for precision at the top)
Natural Language Inference (NLI) for compliance checking via local LLMs (Ollama)
Traceability with provenance tracking (document, page, block, bounding box)
The system is designed with a document-agnostic semantic core and pluggable adapters for different document types (PDFs, DOCX, ODT, TXT, MD), making it suitable for:
Tender analysis (requirements extraction, compliance scoring)
CV/Resume processing (skills matching, experience extraction)
Technical reports (semantic search, section extraction)
Multi-document batch processing
✨ Fully Local: No external APIs required—runs on CPU or GPU (8GB VRAM sufficient)
🔍 Hybrid Retrieval: Dense (FAISS) + Sparse (BM25) with configurable fusion
📄 Multi-Format Support: PDF, DOCX, ODT, TXT, MD with layout-aware parsing
🎯 NLI Compliance: Automatic requirement satisfaction checking via Ollama (Mistral, Llama3)
📊 Fit Scoring: Weighted requirement verdicts with exportable audit trails (JSON, CSV)
🌐 Web UI: Modern, responsive interface for upload, index, search, and scoring
🔌 RESTful API: FastAPI backend for integration with existing workflows
🧪 Fully Tested: Comprehensive test suite with mocked NLI for CI/CD
🌍 Multilingual: FR/EN support with E5 embeddings; extensible to other languages
📦 Extensible: Pluggable document adapters, embedding providers, and scoring strategies
RAGGAE/├── core/ # Semantic core modules│ ├── embeddings.py # Embedding providers (E5, GTE, etc.)│ ├── index_faiss.py # FAISS vector index + metadata│ ├── retriever.py # Hybrid retrieval (dense + sparse)│ ├── scoring.py # Fit scoring from NLI verdicts│ └── nli_ollama.py # Local NLI via Ollama├── io/ # Document parsers│ ├── pdf.py # PDF parsing (PyMuPDF)│ ├── tables.py # Table extraction (future)│ └── textloaders.py # DOCX, ODT, TXT, MD loaders├── adapters/ # Domain-specific adapters (future)│ ├── tenders.py # Tender-specific logic│ ├── cv.py # CV/resume parsing│ └── reports.py # Technical report adapters├── cli/ # Command-line tools│ ├── index_doc.py # Index PDFs into FAISS│ ├── search.py # Semantic search CLI│ ├── quickscore.py # NLI-based scoring CLI│ └── demo_app.py # FastAPI web application├── web/ # Frontend UI│ ├── index.html # Single-page app│ ├── script.js # Vanilla JS (no framework)│ └── styles.css # Modern dark/light theme├── tests/ # Test suite│ ├── conftest.py # Pytest fixtures│ ├── test_core.py # Core module tests│ ├── test_core_embeddings.py # Embedding tests│ ├── test_core_index_retriever.py│ ├── test_scoring.py│ └── test_nli_mock.py # Mocked NLI tests├── data/ # Data files│ └── labels/ # Few-shot seeds (future)├── uploads/ # Upload storage (auto-created)├── examples/ # Example documents (optional)├── index.md # Original design document├── README.md # This file├── LICENSE # MIT License└── requirements.txt # Python dependencies (if using pip)
Python 3.12+ (tested on 3.12)
8GB RAM minimum (16GB recommended)
GPU with 8GB VRAM (optional, but recommended for faster embeddings)
Ollama (for NLI/compliance checks): ollama.com
x# Create environmentmamba env create -f env-adservio-raggae.ymlmamba activate adservio-raggae
# Or create manuallymamba create -n adservio-raggae -c conda-forge -c pytorch -c nvidia \ python=3.12 \ pytorch pytorch-cuda=12.1 \ faiss-cpu sentence-transformers \ pymupdf pypdf python-docx odfpy \ fastapi uvicorn pydantic \ numpy scipy scikit-learn tqdm rich \ pytest
# Install BM25 and Ollama client via pippip install rank-bm25 ollamaEnvironment file (env-adservio-raggae.yml):
xxxxxxxxxxnameadservio-raggaechannelspytorchnvidiaconda-forgedependenciespython=3.12 # Core ML stackpytorch>=2.4pytorch-cuda=12.1torchvisiontorchaudio # RAG / retrievalfaiss-cpusentence-transformersnumpyscipyscikit-learntqdm # PDF / text parsingpymupdfpypdfpython-docxodfpy # Web APIfastapiuvicornpydantic # Testingpytest # Utilsrichpippiprank-bm25ollamaxxxxxxxxxxpython3.12 -m venv venvsource venv/bin/activate # On Windows: venv\Scripts\activate
pip install --upgrade pippip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121pip install faiss-cpu sentence-transformerspip install pymupdf pypdf python-docx odfpypip install fastapi uvicorn pydanticpip install numpy scipy scikit-learn tqdm richpip install rank-bm25 ollamapip install pytestIf you have a CUDA-capable GPU:
xxxxxxxxxx# Check CUDA availabilitypython -c "import torch; print('CUDA:', torch.cuda.is_available(), 'Device:', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None')"
# If CUDA is False, reinstall PyTorch with CUDA supportmamba install -c pytorch -c nvidia pytorch=2.5.* pytorch-cuda=12.1 torchvision torchaudio
# For FAISS GPU acceleration (optional, requires faiss-gpu)mamba install -c pytorch faiss-gpuCore:
sentence-transformers — Embedding models (E5, GTE, BGE)
faiss-cpu / faiss-gpu — Vector similarity search
rank-bm25 — Sparse retrieval (BM25)
ollama — Local LLM client (Mistral, Llama3)
Parsing:
pymupdf (fitz) — PDF parsing with layout
pypdf — Fallback PDF reader
python-docx — DOCX parsing
odfpy — ODT parsing
Web:
fastapi — API framework
uvicorn — ASGI server
pydantic — Data validation
Testing:
pytest — Test framework
xxxxxxxxxxpython -m cli.index_doc \ --pdf /path/to/tender.pdf \ --out ./tender.idx \ --model intfloat/multilingual-e5-small \ --e5Output:
xxxxxxxxxxIndexed 342 chunks → ./tender.idx.faiss + ./tender.idx.jsonlintfloat/multilingual-e5-small [cuda] dim=384 (e5)
Supported flags:
--pdf — Path to PDF document
--out — Output index prefix (creates .faiss and .jsonl files)
--model — HuggingFace model ID (default: intfloat/multilingual-e5-small)
--e5 — Use E5-style prefixes (passage: / query:)
xxxxxxxxxxpython -m cli.search \ --index ./tender.idx \ --model intfloat/multilingual-e5-small \ --e5 \ --query "Plateforme MLOps avec MLflow sur Kubernetes" \ --k 10Output:
xxxxxxxxxxTop-10 for: 'Plateforme MLOps avec MLflow sur Kubernetes'• 0.8423 (p.3, b12) La plateforme MLOps repose sur MLflow déployé sur un cluster Kubernetes…• 0.7891 (p.5, b23) L'orchestration des workflows ML utilise Argo Workflows sur K8s…• 0.7654 (p.8, b45) Monitoring des modèles via Prometheus et Grafana sur Kubernetes…...
xxxxxxxxxxpython -m cli.quickscore \ --index ./tender.idx \ --model intfloat/multilingual-e5-small \ --e5 \ --req "Provider must be ISO 27001 certified" \ --req "Platform uses MLflow for MLOps" \ --req "Deployments on Kubernetes with GitOps" \ --topk 5Output:
xxxxxxxxxxFit score: 83.3/100- Provider must be ISO 27001 certified: Yes- Platform uses MLflow for MLOps: Yes- Deployments on Kubernetes with GitOps: Partial
Prerequisites: Ollama must be running with a model (e.g., mistral)
xxxxxxxxxx# Start Ollama daemon (if not running)ollama serve
# Pull modelollama pull mistral:latest
# Or use Llama3ollama pull llama3:8bxxxxxxxxxxuvicorn cli.demo_app:app --host 0.0.0.0 --port 8000 --reloadOpen http://localhost:8000 in your browser.
Features:
Index Tab: Upload documents (PDF, DOCX, TXT, ODT, MD, or ZIP), configure indexing parameters
Search Tab: Semantic search with provenance (file, page, block, score)
Quickscore Tab: NLI-based compliance checking with audit trail export (JSON/CSV)
Keyboard shortcuts:
Cmd/Ctrl + K — Focus search input
Esc — Clear current form
Base URL: http://localhost:8000
xxxxxxxxxxcurl http://localhost:8000/healthResponse:
xxxxxxxxxx{ "ok": true, "service": "raggae", "version": "0.1.2"}Single file or ZIP:
xxxxxxxxxxcurl -F "file=@/path/to/tender.pdf" http://localhost:8000/uploadResponse:
xxxxxxxxxx{ "ok": true, "type": "pdf", "key": "20251031-143022/tender.pdf", "size": 2458123}Multiple files:
xxxxxxxxxxcurl -F "files=@tender1.pdf" -F "files=@tender2.docx" http://localhost:8000/upload-multiResponse:
xxxxxxxxxx{ "ok": true, "key": "20251031-143022", "files": ["20251031-143022/tender1.pdf", "20251031-143022/tender2.docx"]}xxxxxxxxxxcurl -X POST http://localhost:8000/index \ -H "Content-Type: application/json" \ -d '{ "key": "20251031-143022", "index_path": "./tender.idx", "model": "intfloat/multilingual-e5-small", "e5": true, "min_chars": 40, "extensions": ["pdf", "docx", "txt"] }'Response:
xxxxxxxxxx{ "indexed": 342, "files": ["tender1.pdf", "tender2.docx"], "index_path": "./tender.idx", "encoder": "intfloat/multilingual-e5-small [cuda] dim=384 (e5)"}xxxxxxxxxxcurl -X POST http://localhost:8000/search \ -H "Content-Type: application/json" \ -d '{ "index_path": "./tender.idx", "model": "intfloat/multilingual-e5-small", "e5": true, "query": "MLflow sur Kubernetes ISO 27001", "k": 5 }' | jqResponse:
xxxxxxxxxx{ "query": "MLflow sur Kubernetes ISO 27001", "k": 5, "hits": [ { "score": 0.8423, "page": 3, "block": 12, "file": "tender1.pdf", "ext": "pdf", "snippet": "La plateforme MLOps repose sur MLflow déployé sur un cluster Kubernetes avec conformité ISO 27001…" }, ]}xxxxxxxxxxcurl -X POST http://localhost:8000/quickscore \ -H "Content-Type: application/json" \ -d '{ "index_path": "./tender.idx", "model": "intfloat/multilingual-e5-small", "e5": true, "requirements": [ "Provider must be ISO 27001 certified", "Platform uses MLflow for MLOps", "Deployments on Kubernetes with GitOps" ], "topk": 5, "ollama_model": "mistral", "nli_lang": "auto" }' | jqResponse:
xxxxxxxxxx{ "fit_score": 83.3, "verdicts": [ { "requirement": "Provider must be ISO 27001 certified", "verdict": "Yes", "rationale": "The document explicitly states ISO/IEC 27001:2022 certification.", "evidence": { "file": "tender1.pdf", "ext": "pdf", "page": 5, "block": 23, "snippet": "Le prestataire détient la certification ISO/IEC 27001:2022 pour…", "score": 0.7654 }, "evaluated": [] }, ], "summary": [ {"requirement": "Provider must be ISO 27001 certified", "label": "Yes"}, {"requirement": "Platform uses MLflow for MLOps", "label": "Yes"}, {"requirement": "Deployments on Kubernetes with GitOps", "label": "Partial"} ]}xxxxxxxxxx# JSON exportcurl -X POST http://localhost:8000/quickscore/export \ -H "Content-Type: application/json" \ -d '{ "index_path": "./tender.idx", "requirements": ["ISO 27001 certified"], "format": "json" }' > quickscore.json
# CSV exportcurl -X POST http://localhost:8000/quickscore/export \ -H "Content-Type: application/json" \ -d '{ "index_path": "./tender.idx", "requirements": ["ISO 27001 certified", "MLflow on K8s"], "format": "csv" }' > quickscore.csvRAGGAE combines dense (semantic) and sparse (lexical) retrieval:
Dense: Sentence-Transformers bi-encoder (e.g., E5-small) → 384-dim vectors → FAISS inner-product search
Sparse: BM25 on tokenized text (exact term matching)
Fusion: score = α·dense + (1-α)·sparse (default α=0.6)
Why hybrid?
Dense: captures semantic similarity ("MLOps platform" ≈ "machine learning operations")
Sparse: preserves exact matches (acronyms, IDs, legal clauses)
xxxxxxxxxxfrom cli.core.embeddings import STBiEncoderfrom cli.core.retriever import HybridRetriever
# Build indexencoder = STBiEncoder("intfloat/multilingual-e5-small", prefix_mode="e5")texts = ["MLOps with MLflow on K8s", "ISO 27001 certification required"]retriever = HybridRetriever.build(encoder, texts)
# Searchhits = retriever.search("MLflow on Kubernetes", k=10, alpha=0.6)for h in hits: print(h.score, h.text)Natural Language Inference (NLI) determines if a clause satisfies a requirement:
Input: (clause, requirement) pair
Output: {"label": "Yes|No|Partial", "rationale": "..."}
Model: Local LLM via Ollama (Mistral, Llama3, etc.)
Example:
xxxxxxxxxxfrom cli.core.nli_ollama import NLIClient, NLIConfig
nli = NLIClient(NLIConfig(model="mistral", lang="auto"))result = nli.check( clause="Le prestataire est certifié ISO/IEC 27001:2022.", requirement="Provider must be ISO 27001 certified")# result.label = "Yes"# result.rationale = "The clause explicitly states ISO/IEC 27001:2022 certification."Robustness:
Language auto-detection: Retries with fallback language if rationale is invalid
JSON parsing: Handles malformed LLM outputs gracefully
Label sanitization: Ensures label ∈ {"Yes", "No", "Partial"}
Aggregate compliance across multiple requirements:
xxxxxxxxxxfrom cli.core.scoring import FitScorer, RequirementVerdict
verdicts = [ RequirementVerdict("ISO 27001", "Yes", weight=1.5), RequirementVerdict("MLflow on K8s", "Partial", weight=1.0), RequirementVerdict("Data in EU", "No", weight=1.0),]
scorer = FitScorer()score = scorer.fit_score(verdicts) # 0.56percentage = scorer.to_percent(score) # 56.0Weights:
Reflect requirement importance (e.g., mandatory vs. optional)
Default: 1.0 for all requirements
Adapters translate document-specific formats into a unified Block abstraction:
xxxxxxxxxx# PDFfrom cli.io.pdf import extract_blocksblocks = extract_blocks("tender.pdf", min_chars=40)# → List[PDFBlock(text, page, block, bbox)]
# DOCX / ODT / TXT / MDfrom cli.io.textloaders import load_blocks_anyblocks = load_blocks_any("report.docx", min_chars=20)# → List[TextBlock(text, page=1, block, bbox=(0,0,0,0))]Future adapters (in adapters/):
TenderAdapter: Extract lots, requirements (MUST/SHALL), deadlines
CVAdapter: Parse roles, skills, certifications, experience periods
ReportAdapter: Section hierarchy, methods, results, annexes
xxxxxxxxxxfrom cli.core.embeddings import EmbeddingProvider, EmbeddingInfoimport numpy as np
class MyCustomEncoder(EmbeddingProvider): def info(self) -> EmbeddingInfo: return EmbeddingInfo(model_name="my-model", device="cpu", dimension=512)
def embed_texts(self, texts) -> np.ndarray: # Your embedding logic return np.random.rand(len(texts), 512).astype("float32")
def embed_query(self, text: str) -> np.ndarray: return self.embed_texts([text])[0]xxxxxxxxxxfrom cli.core.scoring import FitScorer, RequirementVerdict
class CustomScorer(FitScorer): def fit_score(self, verdicts, extra_signals=None): # Custom weighting logic base = super().fit_score(verdicts, extra_signals) penalty = 0.1 if any(v.label == "No" for v in verdicts if v.weight > 1.0) else 0 return max(0, base - penalty)xxxxxxxxxxfrom dataclasses import dataclassfrom typing import List, Dict
class TenderBlock: text: str page: int block: int section: str # e.g., "Lot 1", "Annex A" requirement_type: str # "MUST" | "SHALL" | "SHOULD"
def as_metadata(self) -> Dict: return { "page": self.page, "block": self.block, "section": self.section, "req_type": self.requirement_type }
def parse_tender(path: str) -> List[TenderBlock]: # Your custom tender parsing logic passxxxxxxxxxx# Stage 1: Hybrid retrieval (top-100)hits = retriever.search(query, k_dense=100, k=100)
# Stage 2: Cross-encoder re-ranking (top-20)from sentence_transformers import CrossEncoderreranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")pairs = [(query, h.text) for h in hits]scores = reranker.predict(pairs)reranked = sorted(zip(hits, scores), key=lambda x: x[1], reverse=True)[:20]xxxxxxxxxx# Current: FAISS (embedded)from cli.core.index_faiss import FaissIndex
# Future: Qdrant (server-based, with filters)import qdrant_client
class QdrantIndex: def __init__(self, client, collection_name): self.client = client self.collection = collection_name
def add(self, vectors, texts, metadatas): # Insert into Qdrant pass
def search(self, query_vec, k): # Search with filters passxxxxxxxxxx# Install pytestmamba install -c conda-forge pytest
# Run all testspytest -q
# Run with coveragepytest --cov=cli --cov-report=html
# Run specific test filepytest tests/test_core_embeddings.py -v
# Run tests in parallel (requires pytest-xdist)mamba install -c conda-forge pytest-xdistpytest -n autoTest structure:
xxxxxxxxxxtests/├── conftest.py # Fixtures (sample data, mocked NLI)├── test_core.py # Core abstractions├── test_core_embeddings.py # Embedding providers├── test_core_index_retriever.py # FAISS + hybrid retrieval├── test_scoring.py # Fit scoring└── test_nli_mock.py # Mocked NLI (CI-friendly)
Mocking Ollama for CI:
xxxxxxxxxx# tests/conftest.py.fixturedef mock_nli(monkeypatch): def fake_check(clause, req): if "ISO" in clause and "ISO" in req: return NLIResult(label="Yes", rationale="ISO mentioned") return NLIResult(label="No", rationale="No match")
monkeypatch.setattr("cli.core.nli_ollama.NLIClient.check", fake_check)PEP 8 compliance (use black for formatting)
Type hints for all public APIs
Docstrings (Google style)
xxxxxxxxxx# Format codepip install blackblack cli/ tests/
# Type checkingpip install mypymypy cli/
# Lintingpip install flake8flake8 cli/ --max-line-length=120All modules, classes, and public functions include docstrings:
xxxxxxxxxx"""Brief one-line summary.
Extended description with usage notes.
Parameters----------param1 : type Description.
Returns-------type Description.
Examples-------->>> from cli.core.embeddings import STBiEncoder>>> enc = STBiEncoder("intfloat/multilingual-e5-small")>>> enc.embed_query("test")array([0.1, 0.2, ...], dtype=float32)"""Semantic versioning: MAJOR.MINOR.PATCH
MAJOR: Breaking API changes
MINOR: New features (backward-compatible)
PATCH: Bug fixes
| Model | Dim | CPU (docs/sec) | GPU (docs/sec) | VRAM (8GB) |
|---|---|---|---|---|
multilingual-e5-small | 384 | ~30 | ~200 | ✅ |
multilingual-e5-base | 768 | ~15 | ~120 | ✅ |
gte-base-en-v1.5 | 768 | ~18 | ~150 | ✅ |
Optimization:
Use batch_size=64 for bulk encoding
Cache embeddings on disk if re-indexing frequently
Consider faiss-gpu for multi-million document collections
| Type | Search Speed | Memory | Accuracy |
|---|---|---|---|
IndexFlatIP | Fast (exact) | High | 100% |
IndexIVFFlat | Very fast | Medium | ~99% |
IndexHNSWFlat | Fastest | Highest | ~98% |
When to upgrade:
>100K documents: Use IndexIVFFlat with nlist=sqrt(N)
>1M documents: Use IndexHNSWFlat or quantized index
| Model | Quantization | Latency (per check) | VRAM |
|---|---|---|---|
mistral:7b | Q4_K_M | ~2-3s | 4-5GB |
llama3:8b | Q4_K_M | ~3-4s | 5-6GB |
phi-3:mini | Q4_K_M | ~1-2s | 2-3GB |
Optimization:
Batch NLI checks in parallel (Ollama supports concurrent requests)
Use smaller models (Phi-3 mini) for faster scoring
Cache NLI results for repeated requirements
Symptom: torch.cuda.is_available() == False
Solution:
xxxxxxxxxxmamba activate adservio-raggaemamba remove -y pytorch torchvision torchaudio cpuonlypython -m pip uninstall -y torch torchvision torchaudiomamba install -y -c pytorch -c nvidia pytorch=2.5.* pytorch-cuda=12.1 torchvision torchaudioVerify:
xxxxxxxxxxpython -c "import torch; print('CUDA:', torch.cuda.is_available())"Symptom: requests.exceptions.ConnectionError: Ollama not running
Solution:
xxxxxxxxxx# Start Ollama daemonollama serve
# In another terminal, pull a modelollama pull mistral:latest
# Testollama run mistral "Hello"broadcast_to Import ErrorSymptom: AttributeError: module 'numpy' has no attribute 'broadcast_to'
Solution:
xxxxxxxxxxmamba activate adservio-raggaepython -m pip uninstall -y numpymamba install -y -c conda-forge "numpy>=1.26"Symptom: AssertionError: d == index.d
Cause: Embedding model changed between indexing and search.
Solution:
Re-index with the correct model
Or ensure --model matches the original indexing model
Symptom: 404 Not Found or blank page
Solution:
xxxxxxxxxx# Ensure FastAPI is serving static files# Check that web/ directory exists:ls -la web/
# Restart server with --reloaduvicorn cli.demo_app:app --host 0.0.0.0 --port 8000 --reload
# Access via http://localhost:8000 (not /app)Contributions are welcome! Please follow these guidelines:
Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Add tests for new functionality
Ensure tests pass: pytest
Format code: black cli/ tests/
Commit: git commit -m "Add amazing feature"
Push: git push origin feature/amazing-feature
Open a Pull Request
Code review checklist:
Tests pass (pytest)
Code formatted (black)
Type hints added (mypy)
Docstrings updated
README updated (if API changed)
This project is licensed under the MIT License - see the LICENSE file for details.
Dr. Olivier Vitrac, PhD, HDR
Email: olivier.vitrac@adservio.com
Organization: Adservio
Date: October 31, 2025
Sentence-Transformers (Nils Reimers, UKP Lab) — Embedding models
FAISS (Facebook AI Research) — Vector similarity search
Ollama — Local LLM inference
FastAPI (Sebastián Ramírez) — Modern Python web framework
PyMuPDF — Robust PDF parsing
Hugging Face — Model hosting and ecosystem
Inspirations:
LangChain, LlamaIndex (RAG frameworks)
ColBERT, SPLADE (advanced retrieval)
MS MARCO, BEIR (retrieval benchmarks)
If you use RAGGAE in your research or production systems, please cite:
xxxxxxxxxx@software{raggae2025, author = {Vitrac, Olivier}, title = {RAGGAE: Retrieval-Augmented Generation Generalized Architecture for Enterprise}, year = {2025}, publisher = {GitHub}, url = {https://github.com/adservio/raggae}}End of README
For questions, issues, or feature requests, please open an issue on GitHub or contact olivier.vitrac@adservio.com.