Retrieval-Augmented Generation Generalized Architecture for Enterprise
Olivier Vitrac, PhD., HDR | olivier.vitrac@adservio.fr – 2025-11-05
Summary
This early note discusses the design of a generic RAG/embeddings library can serve CVs, reports, and tenders, which relies on different document adapters using a shared semantic core (retrieval + re-rank + annotation + scoring). A hybrid (dense+sparse) + cross-encoder is proposed. The POC adds domain-tuning and Natural Lalnguage interface (NLI) checks, and is designed from day one for traceability (provenance spans, scores, reasons). The whole system is designed to run on minimal infrastructure: fully local MVP – GPU with 8 GB VRAM and possibly running on CPU.
The project RAGGAE is now mature and is available as an Adservio GitHub project. All details available in README.md. The POC can be launched as:
uvicorn RAGGAE.cli.demo_app:app --host 0.0.0.0 --port 8000 --reload
General English/Multilingual: E5-family, GTE-family, bge-family, Jina, Sentence-Transformers (MiniLM, MPNet), Cohere, OpenAI, etc.
Pros: fast, scalable, cheap to store/query; perfect for “retrieve top-k chunks.”
Cons: retrieval scores are approximate; for high-precision ranking add a re-ranker.
BERT/DeBERTa/Modern LLM cross-encoders (e.g., ms-marco-tuned) that score (query, passage) jointly.
Use: take the top 50–200 dense hits, re-rank to get very accurate top-10.
Trade-off: slower and costlier per query, but best quality for tenders.
C. Hybrid retrieval (dense + sparse) — when vocabulary matters
Combine BM25 / SPLADE (sparse, exact terms) with dense vectors (semantics).
Use: tenders have jargon, acronyms, legal clauses—hybrid boosts recall on rare terms.
Light fine-tuning (or adapters) using your historic tenders, SoWs, CVs, past responses.
Use: improves intent matching on “DevOps/MLOps” specifics, vendor boilerplate, compliance phrasing.
Choose a multilingual model (FR/EN at minimum). If not, keep separate indices per language and route queries.
Consider language-aware chunking and query translation as a fallback.
F. Long-document strategies (tenders/CVs/reports)
Hierarchical embeddings: section → paragraph → sentence; route queries to the right level.
Layout-aware chunking: keep tables, bullets, headers/footers; preserve section numbers and annex links.
We think of this as signals layered on top of retrieval:
Document structure parsing: title, sections, annexes, tables, numbered requirements (MUST/SHALL/DOIT).
Keyphrase & requirement mining: extract capabilities (e.g., “K8s, ArgoCD, MLflow, ISO 27001, on-prem”), constraints (SLA, RPO/RTO, sovereignty).
NER & taxonomy mapping: map entities/skills/standards to an Adservio capability ontology (DevOps, MLOps, Security, Cloud, Data).
Entailment/NLI checks: “Does our offer satisfy clause 4.2.3?” (Yes/No/Partial + rationale).
De-duplication & canonicalization: normalize synonyms (“GPU farm” ≈ “on-prem compute with NVIDIA A-series”).
Risk & eligibility flags: deadlines, mandatory certifications, exclusion criteria, IP/sovereignty clauses.
These features feed your scoring/ranking (fit, risk, attractiveness) and later your form pre-fill.
Design a document-agnostic semantic layer with adapters:
Core abstractions:
Document (metadata + pages + spans + tables)
Chunk (text, layout anchors, section path)
EmbeddingProvider (pluggable: dense, sparse, hybrid)
Indexer/Retriever (vector DB + BM25)
Reranker (cross-encoder)
Annotator (NER, keyphrases, taxonomy linker)
Scorer (tender-fit, confidence, risk)
Extractor (field mappers for pre-fill)
Adapters per doc type: TenderAdapter, CVAdapter, ReportAdapter implement:
Parsing rules (e.g., numbered requirements vs. experiences vs. results)
Chunking rules (keep bullets, tables, job periods)
Field mappers (e.g., “Lot 2 scope” → scope.devops, “Years exp in K8s” → cv.skills.k8s.years)
Result: same embedding/retrieval engine, different adapters and scoring logic.
xclass EmbeddingProvider: def embed_texts(self, texts: list[str]) -> list[list[float]]NumPy OK: 1.26.4ST OK. dim: 384: ... def embed_query(self, text: str) -> list[float]: ...
class DenseBiEncoder(EmbeddingProvider): ...class SparseBM25: ...class HybridRetriever: def __init__(self, dense: EmbeddingProvider, sparse: SparseBM25, alpha=0.6): ... def search(self, query: str, k=100) -> list["Hit"]: ...
class CrossEncoderReranker: def rerank(self, query: str, hits: list["Hit"], top_k=20) -> list["Hit"]: ...
class DocumentAdapter: def parse(self, raw_bytes) -> "Document": ... def chunk(self, doc: "Document") -> list["Chunk"]: ... def annotate(self, chunks) -> list["Chunk"]: ... def score(self, query, chunks) -> list["ScoredChunk"]: ...
# Pipelineadapter = TenderAdapter(lang="fr")doc = adapter.parse(pdf_bytes)chunks = adapter.chunk(doc)vectors = dense.embed_texts([c.text for c in chunks])index.upsert(chunks, vectors, metadata=adapter.annotations)
hits = hybrid.search(query, k=150)hits = reranker.rerank(query, hits, top_k=25)Early phase / fast demo: Multilingual dense bi-encoder + BM25 hybrid; add a small cross-encoder re-ranker.
Production quality for tenders: Same as above plus (a) domain-tuning on historical tenders & responses, (b) taxonomy-aware scoring, (c) NLI compliance checks.
High privacy / on-prem: Prefer open models (no external API), self-host vector DB (FAISS, Qdrant, Milvus).
Strict FR/EN mix: Multilingual embeddings or per-language indices with automatic routing.
Lots of tables/forms: Ensure layout-aware parsing (tables become key-value triples; keep cell coordinates).
Absolutely feasible locally: E5-small + BM25 + optional cross-encoder, FAISS index, Ollama (7–8B Q4) for NLI/extraction.
One generic library with adapters lets you handle tenders, CVs, and reports with the same semantic core.1.6 | Ranking & classification for tenders
Relevance ranking: Hybrid retrieve → cross-encode re-rank.
Fit scoring: weighted signals (must-haves met, certifications present, tech match, budget window, delivery window, jurisdiction).
Classification buckets: DevOps/MLOps/Lot-based labels via:
Zero-shot (NLI prompt + label descriptions) for cold start.
Few-shot supervised (logistic regression or small classifier on embeddings) once you have labeled data.
Topic modeling (BERTopic/Top2Vec on embeddings) for discovery of recurring themes.
Field schema registry: define each target field with a canonical name, regex/ontology, and examples.
Extractor chain: retrieval → NER/regex → NLI validation → LLM with constrained generation to map spans to fields.
Traceability: keep source spans + page numbers (for audit and human review).
Safety gates: mandatory fields coverage, confidence thresholds, red-flag clauses (IP/sovereignty/insurance).
Retrieval: Recall@k, nDCG on a seed set of queries (FR/EN).
Re-ranking: MRR@10, precision@5.
Classification: F1 per class, macro-F1; calibration curve.
Extraction (pre-fill): exact-match / relaxed-match and provenance coverage (% fields with verified source span).
Human-in-the-loop: review time saved per tender.
Dense bi-encoder: a strong multilingual Sentence-Transformers-style model (or equivalent GTE/bge multilingual).
Sparse: BM25; consider SPLADE later if needed.
Re-ranker: MS-MARCO-style cross-encoder or a modern cross-encoder fine-tuned on your domain pairs.
Vector DB: FAISS (embedded) → Qdrant/Milvus (server) when scaling.
NumPy OK: 1.26.4
ST OK. dim: 384
Parsers: pdfminer/pymupdf + table extraction (camelot/tabula) + a layout-retaining schema.
Yes, one generic RAG/embeddings library can serve CVs, reports, and tenders if you separate document adapters from a shared semantic core (retrieval + re-rank + annotation + scoring).
Start hybrid (dense+sparse) + cross-encoder, add domain-tuning and NLI checks, and design from day one for traceability (provenance spans, scores, reasons).
This sets you up cleanly for step-2 form pre-fill with auditable mappings..
Multilingual small (fits easily):
intfloat/multilingual-e5-small (~33M) or sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 (~118M).
English-optimized (if you need max quality in EN):
thenlper/gte-small or Alibaba-NLP/gte-base-en-v1.5 (base is fine on CPU/GPU).
Tip: start with multilingual-e5-small for FR/EN, upgrade to multilingual-e5-base when you want a tiny quality boost.
Light & accurate: cross-encoder/ms-marco-MiniLM-L-6-v2 (EN).
Multilingual option: jinaai/jina-reranker-v1-base-multilingual (base size, still comfy on 8 GB).
Use it only on top-100 dense hits → top-20 final.
BM25: rank_bm25 (pure Python) to start.
Later: Elastic (OpenSearch) or SPLADE if recall needs help.
FAISS for embedded mode (simple and fast).
Optional server mode later: Qdrant (Docker) when you need multi-user + filters.
PyMuPDF (fitz) + metadata/page anchors.
Camelot/tabula for tables → convert to key-value triples with cell coordinates.
Chunk by sections/bullets; keep (doc_id, section_path, page, bbox) in metadata for traceability.
With Ollama: mistral:7b-instruct or llama3:8b-instruct in Q4_K_M quant runs on 8 GB.
Use for: entitlement checks, short rationales, and field extraction with constrained prompts.
xxxxxxxxxx# pip install sentence-transformers faiss-cpu rank-bm25 pypdf pymupdf tqdmfrom sentence_transformers import SentenceTransformerimport faiss, numpy as npfrom rank_bm25 import BM25Okapiimport fitz # PyMuPDF
# 1) Parse & chunkdef parse_pdf(path): doc = fitz.open(path) chunks = [] for pno in range(len(doc)): page = doc[pno] text = page.get_text("blocks") # retains block order for i, (_, _, _, _, t, _, _) in enumerate(text): t = (t or "").strip() if len(t) > 40: chunks.append({"text": t, "page": pno+1, "block": i}) return chunks
chunks = parse_pdf("tender.pdf")texts = [c["text"] for c in chunks]
# 2) Dense embeddingsmodel = SentenceTransformer("intfloat/multilingual-e5-small")# e5 expects "query: ..." vs "passage: ..." prefixes for best resultspassages = [f"passage: {t}" for t in texts]E = np.vstack(model.encode(passages, normalize_embeddings=True))
# 3) FAISS indexindex = faiss.IndexFlatIP(E.shape[1])index.add(E.astype("float32"))
# 4) BM25bm25 = BM25Okapi([t.split() for t in texts])
# 5) Hybrid searchdef hybrid_search(q, k_dense=100, k=20, alpha=0.6): q_dense = model.encode([f"query: {q}"], normalize_embeddings=True) D, I = index.search(q_dense.astype("float32"), k_dense) dense_scores = {i: float(s) for i, s in zip(I[0], D[0])} # BM25 scores bm = bm25.get_scores(q.split()) # Normalize BM25 bm = (bm - bm.min()) / (bm.ptp() + 1e-9) # Fuse fused = [] for i, ds in dense_scores.items(): fs = alpha*ds + (1-alpha)*float(bm[i]) fused.append((i, fs)) fused.sort(key=lambda x: x[1], reverse=True) return [chunks[i] | {"score": s} for i, s in fused[:k]]
hits = hybrid_search("ISO 27001, MLOps platform avec MLflow et K8s")for h in hits[:5]: print(h["score"], h["page"], h["text"][:120], "…")Swap in a cross-encoder re-ranker later (e.g.,
jinaai/jina-reranker-v1-base-multilingual) on thehits[:100]to boost precision@5.
xxxxxxxxxx# examples: mistral & llama3 in 4-bit quantollama pull mistral:latestollama pull llama3:8b
# Python: pip install ollamaimport ollama
PROMPT = """You are a compliance checker.Clause: "{clause}"Requirement: "Provider must be ISO 27001 certified"Answer with JSON: {{"label":"Yes/No/Partial","rationale": "..."}}"""
def nli_check(clause): r = ollama.chat(model="mistral", messages=[{"role":"user","content":PROMPT.format(clause=clause)}]) return r["message"]["content"]Example of response after NLI (natural language extraction):
xxxxxxxxxxNLI result: { "label": "No", "rationale": "The clause does not mention the provider's ISO 27001 certification. Therefore, it cannot be confirmed that the provider is certified." }
Embeddings: “small/base” sentence-transformers (CPU or GPU).
Re-rankers: MiniLM-class and multilingual base rerankers (GPU helps; CPU is fine).
LLM for reasoning/extraction: 7B–8B quantized via Ollama (Q4_) — good for short answers and NLI.
You don’t need bigger models for step-1 retrieval/ranking.
Keep a shared semantic core and add thin adapters:
TenderAdapter: numbered requirements (MUST/SHALL), lots/eligibility, deadlines.
CVAdapter: roles, durations, skills, certs; normalize to a capability ontology (e.g., devops.k8s, mlops.mlflow, security.iso27001).
ReportAdapter: sections, methods, results, conclusions, annexes/tables.
All three reuse the same: parser → chunker → embeddings → FAISS/BM25 → (optional) reranker → scorers.
uv/pip)}xxxxxxxxxxRAGGAE/core/embeddings.py # providers (E5, GTE, bge…)retriever.py # hybrid retrievereranker.py # optional cross-encoderindex_faiss.py # vector indexscoring.py # signals + weighted fit scorenli_ollama.py # local NLI/extractorio/pdf.py # PyMuPDF parsingtables.py # camelot/tabula wrappersadapters/tenders.py # parse/chunk/fieldscv.py # parse/chunk/fieldsreports.pycli/index_doc.py # index PDFssearch.py # query + show provenancequickscore.py # tender fit scoredata/ontology.yaml # skills, certs, standardslabels/ # few-shot seeds for classifiers
Retrieval: Recall@50 on 10–20 real tender questions (FR/EN).
Top-k quality: nDCG@10 with cross-encoder on/off (demo the delta).
Classification: Zero-shot labels (DevOps/MLOps/Lot) → quick F1 from a tiny hand-labeled set.
Traceability: Every hit printed with (doc, page, block, score) — reviewers love this.
Absolutely feasible locally: E5-small + BM25 + optional cross-encoder, FAISS index, Ollama (7–8B Q4) for NLI/extraction.
One generic library with adapters lets you handle tenders, CVs, and reports with the same semantic core.
Start with the code above; one can add the cross-encoder and a simple fit score next (must-haves met, tech match, risk flags).
conda env torch_env)xxxxxxxxxxpython - <<'PY'import torchprint("PyTorch version:", torch.__version__)print("CUDA available:", torch.cuda.is_available())if torch.cuda.is_available(): print("CUDA version (runtime):", torch.version.cuda) print("GPU:", torch.cuda.get_device_name(0)) print("Capability:", torch.cuda.get_device_capability(0))PYnvidia-smiThe output of LX-Olivier2023:
xxxxxxxxxxPyTorch version: 2.5.1CUDA available: TrueCUDA version (runtime): 12.1GPU: NVIDIA RTX A2000 8GB Laptop GPUCapability: (8, 6)+-----------------------------------------------------------------------------------------+| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |+-----------------------------------------+------------------------+----------------------+| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. || | | MIG M. ||=========================================+========================+======================|| 0 NVIDIA RTX A2000 8GB Lap... Off | 00000000:01:00.0 Off | N/A || N/A 49C P8 4W / 35W | 114MiB / 8192MiB | 0% Default || | | N/A |+-----------------------------------------+------------------------+----------------------++-----------------------------------------------------------------------------------------+| Processes: || GPU GI CI PID Type Process name GPU Memory || ID ID Usage ||=========================================================================================|| 0 N/A N/A 6701 G /usr/lib/xorg/Xorg 4MiB || 0 N/A N/A 7200 C+G ...c/gnome-remote-desktop-daemon 83MiB |+-----------------------------------------------------------------------------------------+
env-adservio-raggaexxxxxxxxxxnameadservio-raggaechannelspytorchnvidiaconda-forgedependenciespython=3.12spyder # Core ML stackpytorch>=2.4pytorch-cuda=12.1 # uses LX-Olivier2023 NVIDIA GPU (CUDA 12.x)torchvisiontorchaudio # RAG / retrievalfaiss-cpu # simple & stable; upgrade to faiss-gpu later if neededsentence-transformersnumpyscipyscikit-learntqdm # PDF / parsingpymupdf # (import as `fitz`)pypdf # utilsuvicornrichpippiprank-bm25ollama # python client for your local Ollama
Use:
xxxxxxxxxxmamba env create -f env-adservio-raggae.ymlconda activate adservio-raggaeThe smoke test check the setup. If this runs fine, your core loop (parse → embed → index → hybrid search → provenance) is ready for plugging into adapters (tenders/CVs/reports).
GPU + CUDA info printed.
Embedding shape (N, 384) and timing.
FAISS indexed count.
A ranked list of top matches with scores and (optional) PDF page/block.
part 2 will show a short JSON-like verdict from the Ollama block if you enable it.
xxxxxxxxxx# -*- coding: utf-8 -*-"""Smoke test for RAGGAE (use )
Adservio | 2025-10-27"""
#%% 0) Environment check (GPU, versions)import torch, sys, platform, timeprint("Python:", platform.python_version(), "| Torch:", torch.__version__)print("CUDA available:", torch.cuda.is_available(), "| Torch CUDA runtime:", torch.version.cuda)if torch.cuda.is_available(): print("GPU:", torch.cuda.get_device_name(0), "| Compute:", torch.cuda.get_device_capability(0))
#%% 1) Importsfrom sentence_transformers import SentenceTransformerimport numpy as np, faissfrom rank_bm25 import BM25Okapi
# Optional PDF parsingPDF_PATH = "" # set to a local tender PDF path, e.g., "/home/olivi/Documents/tender.pdf"try: import fitz # PyMuPDFexcept Exception as e: fitz = None print("PyMuPDF not available:", e)
#%% 2) Tiny corpus + optional PDF chunksdef parse_pdf_blocks(path, min_chars=40, max_blocks=300): """Return list[{'text','page','block'}] from a PDF, keeping simple text blocks.""" out = [] doc = fitz.open(path) for pno in range(len(doc)): page = doc[pno] for bi, blk in enumerate(page.get_text("blocks")): # blk: (x0, y0, x1, y1, text, block_no, block_type) txt = (blk[4] or "").strip() if len(txt) >= min_chars: out.append({"text": txt, "page": pno+1, "block": bi}) if len(out) >= max_blocks: return out return out
seed_chunks = [ {"text": "Adservio propose une offre MLOps fondée sur MLflow et Kubernetes (K8s).", "page": 0, "block": 0}, {"text": "Exigence ISO 27001 et hébergement des données en Union Européenne.", "page": 0, "block": 1}, {"text": "DevOps CI/CD avec GitLab, ArgoCD, Helm et GitOps pour déploiement cloud.", "page": 0, "block": 2}, {"text": "SLA attendu 99.9%, RPO 15 minutes, RTO 1 heure. Support 24/7 requis.", "page": 0, "block": 3},]
if PDF_PATH and fitz: try: pdf_chunks = parse_pdf_blocks(PDF_PATH) print(f"Parsed {len(pdf_chunks)} blocks from PDF") chunks = pdf_chunks or seed_chunks except Exception as e: print("PDF parse failed, using seed chunks:", e) chunks = seed_chunkselse: if PDF_PATH and not fitz: print("PyMuPDF missing; set PDF_PATH='' or install it in the env.") chunks = seed_chunks
texts = [c["text"] for c in chunks]print(f"Corpus size: {len(texts)}")
#%% 3) Load embedding model (GPU if available)MODEL = "intfloat/multilingual-e5-small" # FR/EN good starterdevice = "cuda" if torch.cuda.is_available() else "cpu"model = SentenceTransformer(MODEL, device=device)print("Embedding model loaded on:", device)
#%% 4) Build embeddings (timed)t0 = time.time()with torch.inference_mode(): passages = [f"passage: {t}" for t in texts] # E5-style prefix E = model.encode(passages, normalize_embeddings=True, convert_to_numpy=True, batch_size=64, show_progress_bar=False)print("Emb shape:", E.shape, "| secs:", round(time.time()-t0, 2))
#%% 5) FAISS index (inner product / cosine with normalized vecs)index = faiss.IndexFlatIP(E.shape[1])index.add(E.astype("float32"))print("FAISS indexed:", index.ntotal, "vectors")
#%% 6) BM25 on same corpusbm25 = BM25Okapi([t.split() for t in texts])
def _minmax(x): x = np.asarray(x, dtype=np.float32) return (x - x.min()) / (x.ptp() + 1e-9)
#%% 7) Hybrid searchdef hybrid_search(query, k_dense=100, k=10, alpha=0.6): # dense qv = model.encode([f"query: {query}"], normalize_embeddings=True, convert_to_numpy=True).astype("float32") D, I = index.search(qv, min(k_dense, len(texts))) dense_scores = {int(i): float(s) for i, s in zip(I[0], D[0])} # bm25 bm = bm25.get_scores(query.split()) bm = _minmax(bm) # normalize to [0,1] # fuse fused = [] for i, ds in dense_scores.items(): fs = alpha*ds + (1-alpha)*float(bm[i]) fused.append((i, fs)) fused.sort(key=lambda x: x[1], reverse=True) out = [] for i, s in fused[:k]: out.append(chunks[i] | {"score": round(s, 4)}) return out
#%% 8) Run a queryquery = "Plateforme MLOps avec MLflow sur Kubernetes, exigences ISO 27001 et GitOps"hits = hybrid_search(query, k=5)print("\nQuery:", query)for h in hits: loc = f"(p.{h['page']}, block {h['block']})" if h.get("page") else "" print(f"- score={h['score']:.4f} {loc} :: {h['text'][:110]}…")
#%% 9) Optional: quick provenance pretty-printdef show_hit(h, max_len=400): print(f"\n[score={h['score']}] page={h.get('page','?')} block={h.get('block','?')}") print(h["text"][:max_len] + ("…" if len(h["text"])>max_len else ""))
if hits: show_hit(hits[0])
You should read:
xxxxxxxxxxProjects/raggae/smoke_test.pymodules.json: 100%|██████████| 387/387 [00:00<00:00, 1.24MB/s]README.md: 498kB [00:00, 63.3MB/s]sentence_bert_config.json: 100%|██████████| 57.0/57.0 [00:00<00:00, 90.9kB/s]config.json: 100%|██████████| 655/655 [00:00<00:00, 2.17MB/s]model.safetensors: 100%|██████████| 471M/471M [00:13<00:00, 34.8MB/s] tokenizer_config.json: 100%|██████████| 443/443 [00:00<00:00, 1.54MB/s]sentencepiece.bpe.model: 100%|██████████| 5.07M/5.07M [00:00<00:00, 10.3MB/s]tokenizer.json: 100%|██████████| 17.1M/17.1M [00:00<00:00, 38.9MB/s]special_tokens_map.json: 100%|██████████| 167/167 [00:00<00:00, 483kB/s]config.json: 100%|██████████| 200/200 [00:00<00:00, 591kB/s]Embedding model loaded on: cpu
xxxxxxxxxx#%% 10a) (Optional) NLI/compliance check with Ollama (requires daemon running)# Uncomment to test. Example: does the clause satisfy an ISO 27001 requirement?import ollama, jsonclause = hits[0]["text"] if hits else "Le prestataire dispose d’une certification ISO 27001."prompt = f'''You are a compliance checker.Clause: "{clause}"Requirement: "Provider must be ISO 27001 certified."Answer JSON with keys: label in ["Yes","No","Partial"], rationale (short).'''res = ollama.chat(model="mistral", messages=[{"role":"user","content":prompt}])print("\nNLI result:", res["message"]["content"])
# %% 10b) More sophisticated RAG: Hardened NLI helper (deterministic + JSON-safe + FR/EN)import ollama, json, re
NLI_SYS = ( "You are a strict compliance checker. " "Return ONLY compact JSON with keys: label, rationale. " "label ∈ ['Yes','No','Partial'].")
def parse_json_loose(s: str): # strip code fences and grab the first {...} s = s.strip() s = re.sub(r"^```(?:json)?\s*|\s*```$", "", s, flags=re.I|re.M) m = re.search(r"\{.*\}", s, flags=re.S) if not m: return None try: return json.loads(m.group(0)) except Exception: return None
def nli_check(clause: str, requirement: str, lang="auto"): prompt = ( f"Language: {lang}. " f'Clause: "{clause}"\n' f'Requirement: "{requirement}"\n' 'Respond as JSON: {"label":"Yes|No|Partial","rationale":"..."}' ) r = ollama.chat( model="mistral", options={"temperature": 0, "num_ctx": 4096}, messages=[{"role":"system","content":NLI_SYS}, {"role":"user","content":prompt}] ) out = parse_json_loose(r["message"]["content"]) or {"label":"No","rationale":"Invalid or non-JSON output"} # normalize label lbl = out.get("label","").strip().title() if lbl not in {"Yes","No","Partial"}: lbl = "No" out["label"] = lbl return out
# quick test (use your best clause from hits)clause = hits[0]["text"] if hits else "Le prestataire dispose d’une certification ISO 27001."print(nli_check(clause, "Provider must be ISO 27001 certified"))
# %% 10c) Batch matrix (requirements x top-k clauses)import pandas as pd
REQUIREMENTS = [ "Provider must be ISO 27001 certified", "Platform uses MLflow for MLOps", "Deployments on Kubernetes with GitOps", "Data hosted in the European Union"]
def requirement_matrix(hits, requirements=REQUIREMENTS, topk=5): rows = [] for req in requirements: for i, h in enumerate(hits[:topk]): res = nli_check(h["text"], req) rows.append({ "requirement": req, "hit_rank": i+1, "label": res["label"], "rationale": res["rationale"], "page": h.get("page"), "block": h.get("block"), "snippet": h["text"][:160].replace("\n"," ") }) df = pd.DataFrame(rows) # simple per-requirement verdict: first Yes > Partial > No order = {"Yes":2, "Partial":1, "No":0} verdict = (df.assign(score=df["label"].map(order)) .groupby("requirement")["score"].max() .map({2:"Yes",1:"Partial",0:"No"})) return df, verdict
df_checks, verdict = requirement_matrix(hits, REQUIREMENTS, topk=5)print("\nOverall verdict per requirement:\n", verdict)print("\nSample rows:\n", df_checks.head(6))
# %% 10d) Fit score from NLI labels (0..100)label_w = {"Yes": 1.0, "Partial": 0.5, "No": 0.0}fit_score = round(100 * verdict.map({"Yes":1.0,"Partial":0.5,"No":0.0}).mean(), 1)print(f"\nTender fit score (NLI): {fit_score}/100")If it works well, you should read:
xxxxxxxxxxNLI result: { "label": "No", "rationale": "The clause does not mention the provider's ISO 27001 certification. Therefore, it cannot be confirmed that the provider is certified."}
{'label': 'Partial', 'rationale': 'The text does not explicitly state that Adservio is ISO 27001 certified. However, mentioning Kubernetes (K8s) implies a certain level of compliance as it is often used in enterprise environments where such certifications are required.'}
Overall verdict per requirement: requirementData hosted in the European Union YesDeployments on Kubernetes with GitOps YesPlatform uses MLflow for MLOps YesProvider must be ISO 27001 certified YesName: score, dtype: object
Sample rows: requirement ... snippet0 Provider must be ISO 27001 certified ... Adservio propose une offre MLOps fondée sur ML...1 Provider must be ISO 27001 certified ... Exigence ISO 27001 et hébergement des données ...2 Provider must be ISO 27001 certified ... DevOps CI/CD avec GitLab, ArgoCD, Helm et GitO...3 Provider must be ISO 27001 certified ... SLA attendu 99.9%, RPO 15 minutes, RTO 1 heure...4 Platform uses MLflow for MLOps ... Adservio propose une offre MLOps fondée sur ML...5 Platform uses MLflow for MLOps ... Exigence ISO 27001 et hébergement des données ...
[6 rows x 7 columns]
Tender fit score (NLI): 100.0/100
If your Spyder is still using a CPU-only PyTorch wheel (note torch.version.cuda: None). Let’s fix it cleanly by installing the CUDA build from the pytorch + nvidia channels and avoiding any pip/conda-forge Torch that might override it.
xxxxxxxxxx# checking torch with CUDAimport torchprint("CUDA available:", torch.cuda.is_available())print("torch.version.cuda:", torch.version.cuda)if torch.cuda.is_available(): print("GPU:", torch.cuda.get_device_name(0)) # Example of misconfiguration# ---------------------------# Conda env : adservio-raggae# Torch ver : 2.7.1# torch.version.cuda : None# CUDA available : False# Built with CUDA? : False# cuDNN version : None# CUDA_VISIBLE_DEVICES: None # additional checkimport subprocessprint(subprocess.check_output(["nvidia-smi"]).decode()[:300])# You should get something simular to:# == nvidia-smi == nvidia-smi not callable here: Command '['nvidia-smi', '--query-gpu=name,driver_version,cuda_version', '--format=csv,noheader']' returned non-zero exit status 2.
Solution: reinstall pytorch
xxxxxxxxxx# 0) Close Spyder# 1) Purge any CPU Torch left-overs in this envconda activate adservio-raggae# remove conda packagesmamba remove -y pytorch torchvision torchaudio cpuonly# remove any pip wheels that might shadow conda packagespython -m pip uninstall -y torch torchvision torchaudio# 2) Enforce channel priority (important)conda config --env --set channel_priority strict# 3) Install the CUDA build (match your runtime: 12.1)# Do **not** add `-c conda-forge` to this command; it can pull a CPU build.mamba install -y -c pytorch -c nvidia \ pytorch=2.5.* pytorch-cuda=12.1 torchvision torchaudio# 4) (Optional) ensure Spyder can attach to this envmamba install -y spyder-kernels# 5) Launch Spyder from this env
Retest
xxxxxxxxxx# 6) Re-check in the same Spyder consoleimport torch, sysprint("Python exe:", sys.executable)print("Torch:", torch.__version__)print("torch.version.cuda:", torch.version.cuda)print("CUDA available:", torch.cuda.is_available())if torch.cuda.is_available(): print("GPU:", torch.cuda.get_device_name(0)) # You should now see:# Python exe: /home/olivi/anaconda3/envs/adservio-raggae/bin/python3.12# Torch: 2.5.1# torch.version.cuda: 12.1# CUDA available: True# GPU: NVIDIA RTX A2000 8GB Laptop GPUVIDIA RTX A2000 8GB Laptop GPU ```
Then point Spyder to this interpreter (same path pattern as above).
---
## Note on `nvidia-smi`
Inside Spyder, just check the plain command if you’re curious:
```pythonimport subprocess, shutilprint("nvidia-smi path:", shutil.which("nvidia-smi"))print(subprocess.check_output(["nvidia-smi"]).decode().splitlines()[0])```
But **PyTorch CUDA working** is the real goal; `nvidia-smi` availability inside the IDE is optional.
This sequence fixes ~99% of “CUDA False in Spyder” cases (wrong channel, CPU wheel, or kernel mismatch).
Just in case, a prior pin drags a CPU build back in. Make a fresh env that only uses the correct channels:
xxxxxxxxxxmamba create -n adservio-raggae12 -y -c pytorch -c nvidia -c conda-forge \ python=3.12 pytorch=2.5.* pytorch-cuda=12.1 torchvision torchaudio \ faiss-cpu sentence-transformers numpy scipy scikit-learn tqdm \ pymupdf pypdf uvicorn rich spyder-kernels
mamba activate adservio-raggae12spyder &
`broadcast_to import error.`This error means your NumPy install in adservio-raggae is inconsistent (CPU-only Torch is fine now). Let’s fix NumPy cleanly and avoid mixed pip/conda wheels.
Quick check (run in the same Spyder console)
xxxxxxxxxximport numpy, sysprint("NumPy:", numpy.__version__, "| path:", numpy.__file__)Fix (close Spyder, then in a terminal)
xxxxxxxxxxmamba activate adservio-raggae
# 1) Remove any pip wheel that might be shadowing conda’s NumPypython -m pip uninstall -y numpy
# 2) Install a clean conda-forge build compatible with Python 3.12# (both 1.26.x and 2.1.x work; 2.1.x is current and stable)mamba install -y -c conda-forge "numpy=2.1.*" "scipy>=1.11"
# Optional: harmonize key libs to conda-forge to avoid ABI mismatchesmamba install -y -c conda-forge scikit-learn tqdm pymupdf pypdf faiss-cpu sentence-transformersIf you prefer staying on NumPy 1.x: use
numpy=1.26.*(it supports Python 3.12).broadcast_toexists in both.
Relaunch & re-test
Start Spyder from the env:
xxxxxxxxxxspyder &In Spyder, run:
xxxxxxxxxximport numpy as npfrom sentence_transformers import SentenceTransformerprint("NumPy OK:", np.__version__)m = SentenceTransformer("intfloat/multilingual-e5-small")print("ST OK. dim:", m.get_sentence_embedding_dimension())
# you should see# NumPy OK: 1.26.4# ST OK. dim: 384If it still fails, show me the output of:
xxxxxxxxxxmamba list | egrep 'numpy|scipy|torch|sentence|faiss'and we’ll zero in—but in 99% cases the clean conda-forge NumPy reinstall resolves the broadcast_to import error.