VeritasRAG
Production-grade RAG document Q&A platform — upload PDFs, DOCX, and TXT files, then query them with grounded, cited answers powered by hybrid vector search and Gemini
Role
Full Stack Developer
Team
Solo
- Proxying SSE stream from FastAPI → Django → Next.js Route Handler → browser without buffering or timeout
- Hybrid ANN + BM25 reciprocal rank fusion where RRF scores (~0.016) must not be compared against cosine similarity thresholds
- Cross-encoder reranker (~80MB) loaded at FastAPI startup within Render free-tier 512MB RAM constraint
- httpOnly JWT cookies set exclusively via Next.js Route Handlers — no token ever reaches browser JavaScript
- Atomic document deletion cascading Celery ingestion status, pgvector chunks, embeddings, and Cloudinary asset in one transaction
- Intent-aware retrieval routing — combined intent classification + HyDE in a single LLM call with standalone query rewriting for follow-up messages
- pgvector HNSW cosine ANN combined with PostgreSQL BM25 full-text search via reciprocal rank fusion for higher recall on technical documents
- Cross-encoder reranker (ms-marco-MiniLM-L-6-v2) scoring merged top-20 candidates down to top-5 before generation
- FastAPI internal-only architecture — Django is the sole public-facing API; FastAPI enforces X-Internal-Key on every endpoint
- Celery async ingestion pipeline decoupling upload from extract → chunk → embed → pgvector write
- Redis cache-aside at three layers: full RAG response (1h), retrieved chunks (30min), dashboard stats (5min)
- Grounding score threshold gating — explicit low-confidence warning when top similarity scores fall below 0.75
Overview
VeritasRAG is a full-stack, production-grade Retrieval-Augmented Generation platform. Users upload PDF, DOCX, and TXT documents, which are ingested asynchronously through a multi-stage pipeline — text extraction, sliding-window chunking, embedding via Google text-embedding-004, and storage in pgvector. Once indexed, users query documents through a persistent chat interface and receive answers grounded strictly in retrieved source content — never hallucinated — with exact citations showing chunk text, source document, page number, and similarity score.
The system is a three-service architecture: a Next.js 16 (App Router) frontend, a Django 5 backend owning auth, document metadata, chat persistence, and Redis caching, and a FastAPI AI service owning the entire RAG pipeline. All services run in Docker Compose locally and deploy independently to Render and Vercel.
Key Features
Document Management
- Upload PDF, DOCX, TXT (max 50MB) — browser uploads directly to Cloudinary via Django-signed URL; credentials never reach the client.
- Async ingestion via Celery:
uploaded → processing → ready → failedstatus tracked in PostgreSQL, polled by the frontend every 3 seconds. - Delete document cascades chunk deletion, embedding deletion, Cloudinary asset removal, and related chat messages atomically.
- Per-document chunk count and processed-at timestamp visible in the document list.
RAG Pipeline
- Extraction:
pdfplumber(PDF),python-docx(DOCX), plain read (TXT). - Chunking: sliding window, 512 tokens, 50-token overlap, preserving page number metadata.
- Embedding: Google
text-embedding-004→ 768-dimension vectors stored in pgvector with HNSW index. - Retrieval: hybrid pgvector cosine ANN (top-20) + PostgreSQL BM25 full-text search (top-20), merged via Reciprocal Rank Fusion.
- Reranking:
cross-encoder/ms-marco-MiniLM-L-6-v2reranks the merged top-20 to top-5. - Generation:
gemini-2.0-flashwith a strict grounding prompt — answers only from retrieved excerpts. - Streaming: token-by-token SSE stream from FastAPI → Django → Next.js Route Handler → browser.
Intent-Aware Retrieval
- Single LLM call combines intent classification + HyDE (hypothetical document embedding) to avoid a redundant round-trip on factual queries.
- Intent categories: factual, boolean, definition, comparison, summary, procedural, analytical, troubleshooting, recommendation, out-of-scope.
- Standalone query rewriting for follow-up messages using conversation history.
top_ktuned per intent: 20 for comparison/factual, 15 for boolean/definition.
Citation Grounding
- Every answer accompanied by citation cards: chunk text, source document, page number, similarity score.
grounding_score(0–1) per answer; warning banner renders if score < 0.6.- If all top-5 similarity scores < 0.75, response explicitly states the documents do not contain sufficient information.
- Low-confidence queries logged to
QUERY_LOGSfor admin review.
Chat Interface
- Create named chat sessions scoped to one or more documents.
- Streaming message rendering — tokens appear progressively as the SSE stream arrives.
- Collapsible citation panel per answer.
- Full chat history: previous sessions and all messages accessible from sidebar.
- Message metadata: latency_ms, retrieval score, cache hit/miss badge.
Caching (Redis, cache-aside)
- Full RAG response cached by
hash(question + doc_ids)— 1 hour TTL. - Retrieved chunks cached by same key — 30 minute TTL.
- Dashboard stats cached per user — 5 minute TTL.
- JWT session data — 24 hour TTL.
- Cache hit badge visible in UI; cache hit rate shown on dashboard.
Auth & Security
- Email/password signup with JWT access (15 min) + refresh (7 day) tokens via SimpleJWT.
- Tokens set as
httpOnly; Secure; SameSite=Strictcookies exclusively by Next.js Route Handlers — never in localStorage, never in JavaScript-accessible cookies. - Every Django view enforces
IsAuthenticated+.filter(user=request.user)row-level ownership. - FastAPI endpoints are internal-only —
X-Internal-Keyheader enforced on every request; only Django calls FastAPI.
Dashboard & Observability
- Stats panel: document count, query count, average grounding score, cache hit rate — Redis-cached, 5 minute TTL.
- Every query logged: question text, latency_ms, chunk count, top similarity score, grounding score, cache hit, model used.
- Keep-alive ping on app mount warms Render cold-start before first user interaction.
- Health check endpoints on both Django (
/api/health/) and FastAPI (/health) for Render readiness probes.
Architecture
Three independently deployable services behind a Next.js BFF layer:
Ingestion Flow
Query Flow
Data Model
Eight PostgreSQL tables plus pgvector:
users— UUID PK, email, hashed password, full name.documents— user-owned; filename, storage_key, file_type, status, chunk_count.chunks— document-scoped; chunk_index, content, token_count, page_number.embeddings— one-to-one with chunks;vector(768), model_name. Decoupled so re-embedding a new model doesn't touch chunk data.chat_sessions— user-owned; title, document_ids (jsonb array).messages— session-scoped; role, content, retrieval_score, grounding_score, latency_ms, cache_hit.citations— message-scoped; chunk_id FK, similarity_score, citation_order.query_logs— full observability row per query: question, latency, grounding score, cache hit, model used, low_confidence flag.
Outcome
VeritasRAG demonstrates a production-grade RAG system — combining hybrid vector + keyword retrieval, cross-encoder reranking, intent-aware query routing, streaming SSE generation, Redis multi-layer caching, and strict citation grounding across a three-service Docker-native architecture. Every answer is traceable to an exact source chunk with page-level provenance, and every architectural boundary is enforced at both the API and database layers.
