VeritasRAG

Overview

VeritasRAG is a full-stack, production-grade Retrieval-Augmented Generation platform. Users upload PDF, DOCX, and TXT documents, which are ingested asynchronously through a multi-stage pipeline — text extraction, sliding-window chunking, embedding via Google text-embedding-004, and storage in pgvector. Once indexed, users query documents through a persistent chat interface and receive answers grounded strictly in retrieved source content — never hallucinated — with exact citations showing chunk text, source document, page number, and similarity score.

The system is a three-service architecture: a Next.js 16 (App Router) frontend, a Django 5 backend owning auth, document metadata, chat persistence, and Redis caching, and a FastAPI AI service owning the entire RAG pipeline. All services run in Docker Compose locally and deploy independently to Render and Vercel.

Key Features

Document Management

Upload PDF, DOCX, TXT (max 50MB) — browser uploads directly to Cloudinary via Django-signed URL; credentials never reach the client.
Async ingestion via Celery: uploaded → processing → ready → failed status tracked in PostgreSQL, polled by the frontend every 3 seconds.
Delete document cascades chunk deletion, embedding deletion, Cloudinary asset removal, and related chat messages atomically.
Per-document chunk count and processed-at timestamp visible in the document list.

RAG Pipeline

Extraction: pdfplumber (PDF), python-docx (DOCX), plain read (TXT).
Chunking: sliding window, 512 tokens, 50-token overlap, preserving page number metadata.
Embedding: Google text-embedding-004 → 768-dimension vectors stored in pgvector with HNSW index.
Retrieval: hybrid pgvector cosine ANN (top-20) + PostgreSQL BM25 full-text search (top-20), merged via Reciprocal Rank Fusion.
Reranking: cross-encoder/ms-marco-MiniLM-L-6-v2 reranks the merged top-20 to top-5.
Generation: gemini-2.0-flash with a strict grounding prompt — answers only from retrieved excerpts.
Streaming: token-by-token SSE stream from FastAPI → Django → Next.js Route Handler → browser.

Intent-Aware Retrieval

Single LLM call combines intent classification + HyDE (hypothetical document embedding) to avoid a redundant round-trip on factual queries.
Intent categories: factual, boolean, definition, comparison, summary, procedural, analytical, troubleshooting, recommendation, out-of-scope.
Standalone query rewriting for follow-up messages using conversation history.
top_k tuned per intent: 20 for comparison/factual, 15 for boolean/definition.

Citation Grounding

Every answer accompanied by citation cards: chunk text, source document, page number, similarity score.
grounding_score (0–1) per answer; warning banner renders if score < 0.6.
If all top-5 similarity scores < 0.75, response explicitly states the documents do not contain sufficient information.
Low-confidence queries logged to QUERY_LOGS for admin review.

Chat Interface

Create named chat sessions scoped to one or more documents.
Streaming message rendering — tokens appear progressively as the SSE stream arrives.
Collapsible citation panel per answer.
Full chat history: previous sessions and all messages accessible from sidebar.
Message metadata: latency_ms, retrieval score, cache hit/miss badge.

Caching (Redis, cache-aside)

Full RAG response cached by hash(question + doc_ids) — 1 hour TTL.
Retrieved chunks cached by same key — 30 minute TTL.
Dashboard stats cached per user — 5 minute TTL.
JWT session data — 24 hour TTL.
Cache hit badge visible in UI; cache hit rate shown on dashboard.

Auth & Security

Email/password signup with JWT access (15 min) + refresh (7 day) tokens via SimpleJWT.
Tokens set as httpOnly; Secure; SameSite=Strict cookies exclusively by Next.js Route Handlers — never in localStorage, never in JavaScript-accessible cookies.
Every Django view enforces IsAuthenticated + .filter(user=request.user) row-level ownership.
FastAPI endpoints are internal-only — X-Internal-Key header enforced on every request; only Django calls FastAPI.

Dashboard & Observability

Stats panel: document count, query count, average grounding score, cache hit rate — Redis-cached, 5 minute TTL.
Every query logged: question text, latency_ms, chunk count, top similarity score, grounding score, cache hit, model used.
Keep-alive ping on app mount warms Render cold-start before first user interaction.
Health check endpoints on both Django (/api/health/) and FastAPI (/health) for Render readiness probes.

Architecture

Three independently deployable services behind a Next.js BFF layer:

Browser
  └── Next.js 16 (Vercel)
        ├── Route Handlers (BFF) — cookie management, auth proxy, SSE proxy, upload signature
        └── App Router pages — dashboard, documents, chat, citations

Next.js Route Handlers
  └── Django 5 (Render Web Service)
        ├── Auth: SimpleJWT signup/login/logout/refresh
        ├── Documents: CRUD, Cloudinary signature, status polling
        ├── Chat: session + message persistence, Redis cache-aside, SSE proxy
        ├── Stats: dashboard aggregates (Redis cached)
        └── Celery dispatcher → Redis broker

Celery Worker (Render Background Worker, same Django image)
  └── POST /ingest → FastAPI

FastAPI (Render Web Service, internal-only)
  ├── POST /ingest — extract → chunk → embed → pgvector write
  ├── POST /query — embed → hybrid search → rerank → generate → SSE stream
  └── GET /health

PostgreSQL + pgvector (Neon)  |  Redis (Upstash)  |  Files (Cloudinary)

Ingestion Flow

POST /api/documents/ (Django)
  → DOCUMENT record (status = uploaded)
  → Celery task dispatched

Celery: process_document(document_id)
  → FastAPI POST /ingest {document_id, storage_key, file_type}
  → fetch file from Cloudinary
  → extract text → chunk (512t, 50 overlap) → embed (text-embedding-004)
  → write CHUNKS + EMBEDDINGS to pgvector
  → return {chunk_count}
  → DOCUMENT.status = ready

Query Flow

user sends message
  → Django: Redis cache check hash(question + doc_ids)
  → cache hit: return in <5ms
  → cache miss: FastAPI POST /query (httpx streaming)
      → embed question (text-embedding-004)
      → pgvector HNSW ANN top-20 + BM25 FTS top-20
      → Reciprocal Rank Fusion merge
      → cross-encoder rerank top-20 → top-5
      → grounding check: all scores < 0.75 → grounding_score = 0.0
      → grounding prompt + gemini-2.0-flash SSE stream
  → Django proxies SSE → Next.js Route Handler → browser
  → on complete: persist MESSAGE + CITATIONS + QUERY_LOG; cache response

Data Model

Eight PostgreSQL tables plus pgvector:

users — UUID PK, email, hashed password, full name.
documents — user-owned; filename, storage_key, file_type, status, chunk_count.
chunks — document-scoped; chunk_index, content, token_count, page_number.
embeddings — one-to-one with chunks; vector(768), model_name. Decoupled so re-embedding a new model doesn't touch chunk data.
chat_sessions — user-owned; title, document_ids (jsonb array).
messages — session-scoped; role, content, retrieval_score, grounding_score, latency_ms, cache_hit.
citations — message-scoped; chunk_id FK, similarity_score, citation_order.
query_logs — full observability row per query: question, latency, grounding score, cache hit, model used, low_confidence flag.

Outcome

VeritasRAG demonstrates a production-grade RAG system — combining hybrid vector + keyword retrieval, cross-encoder reranking, intent-aware query routing, streaming SSE generation, Redis multi-layer caching, and strict citation grounding across a three-service Docker-native architecture. Every answer is traceable to an exact source chunk with page-level provenance, and every architectural boundary is enforced at both the API and database layers.