Skip to main content
All Projects
COMPLETED

VeritasRAG

Production-grade RAG document Q&A platform — upload PDFs, DOCX, and TXT files, then query them with grounded, cited answers powered by hybrid vector search and Gemini

VeritasRAG screenshot 1

Role

Full Stack Developer

Team

Solo

Stack
TypeScriptNext.jsReactDjangoFastAPIPythonPostgreSQLpgvectorGoogle GeminiCloudinaryCeleryRedisTailwind CSSshadcn/uiTanStack QueryDocker
Challenges
  • Proxying SSE stream from FastAPI → Django → Next.js Route Handler → browser without buffering or timeout
  • Hybrid ANN + BM25 reciprocal rank fusion where RRF scores (~0.016) must not be compared against cosine similarity thresholds
  • Cross-encoder reranker (~80MB) loaded at FastAPI startup within Render free-tier 512MB RAM constraint
  • httpOnly JWT cookies set exclusively via Next.js Route Handlers — no token ever reaches browser JavaScript
  • Atomic document deletion cascading Celery ingestion status, pgvector chunks, embeddings, and Cloudinary asset in one transaction
  • Intent-aware retrieval routing — combined intent classification + HyDE in a single LLM call with standalone query rewriting for follow-up messages
Insights
  • pgvector HNSW cosine ANN combined with PostgreSQL BM25 full-text search via reciprocal rank fusion for higher recall on technical documents
  • Cross-encoder reranker (ms-marco-MiniLM-L-6-v2) scoring merged top-20 candidates down to top-5 before generation
  • FastAPI internal-only architecture — Django is the sole public-facing API; FastAPI enforces X-Internal-Key on every endpoint
  • Celery async ingestion pipeline decoupling upload from extract → chunk → embed → pgvector write
  • Redis cache-aside at three layers: full RAG response (1h), retrieved chunks (30min), dashboard stats (5min)
  • Grounding score threshold gating — explicit low-confidence warning when top similarity scores fall below 0.75

Overview

VeritasRAG is a full-stack, production-grade Retrieval-Augmented Generation platform. Users upload PDF, DOCX, and TXT documents, which are ingested asynchronously through a multi-stage pipeline — text extraction, sliding-window chunking, embedding via Google text-embedding-004, and storage in pgvector. Once indexed, users query documents through a persistent chat interface and receive answers grounded strictly in retrieved source content — never hallucinated — with exact citations showing chunk text, source document, page number, and similarity score.

The system is a three-service architecture: a Next.js 16 (App Router) frontend, a Django 5 backend owning auth, document metadata, chat persistence, and Redis caching, and a FastAPI AI service owning the entire RAG pipeline. All services run in Docker Compose locally and deploy independently to Render and Vercel.


Key Features

Document Management

  • Upload PDF, DOCX, TXT (max 50MB) — browser uploads directly to Cloudinary via Django-signed URL; credentials never reach the client.
  • Async ingestion via Celery: uploaded → processing → ready → failed status tracked in PostgreSQL, polled by the frontend every 3 seconds.
  • Delete document cascades chunk deletion, embedding deletion, Cloudinary asset removal, and related chat messages atomically.
  • Per-document chunk count and processed-at timestamp visible in the document list.

RAG Pipeline

  • Extraction: pdfplumber (PDF), python-docx (DOCX), plain read (TXT).
  • Chunking: sliding window, 512 tokens, 50-token overlap, preserving page number metadata.
  • Embedding: Google text-embedding-004 → 768-dimension vectors stored in pgvector with HNSW index.
  • Retrieval: hybrid pgvector cosine ANN (top-20) + PostgreSQL BM25 full-text search (top-20), merged via Reciprocal Rank Fusion.
  • Reranking: cross-encoder/ms-marco-MiniLM-L-6-v2 reranks the merged top-20 to top-5.
  • Generation: gemini-2.0-flash with a strict grounding prompt — answers only from retrieved excerpts.
  • Streaming: token-by-token SSE stream from FastAPI → Django → Next.js Route Handler → browser.

Intent-Aware Retrieval

  • Single LLM call combines intent classification + HyDE (hypothetical document embedding) to avoid a redundant round-trip on factual queries.
  • Intent categories: factual, boolean, definition, comparison, summary, procedural, analytical, troubleshooting, recommendation, out-of-scope.
  • Standalone query rewriting for follow-up messages using conversation history.
  • top_k tuned per intent: 20 for comparison/factual, 15 for boolean/definition.

Citation Grounding

  • Every answer accompanied by citation cards: chunk text, source document, page number, similarity score.
  • grounding_score (0–1) per answer; warning banner renders if score < 0.6.
  • If all top-5 similarity scores < 0.75, response explicitly states the documents do not contain sufficient information.
  • Low-confidence queries logged to QUERY_LOGS for admin review.

Chat Interface

  • Create named chat sessions scoped to one or more documents.
  • Streaming message rendering — tokens appear progressively as the SSE stream arrives.
  • Collapsible citation panel per answer.
  • Full chat history: previous sessions and all messages accessible from sidebar.
  • Message metadata: latency_ms, retrieval score, cache hit/miss badge.

Caching (Redis, cache-aside)

  • Full RAG response cached by hash(question + doc_ids) — 1 hour TTL.
  • Retrieved chunks cached by same key — 30 minute TTL.
  • Dashboard stats cached per user — 5 minute TTL.
  • JWT session data — 24 hour TTL.
  • Cache hit badge visible in UI; cache hit rate shown on dashboard.

Auth & Security

  • Email/password signup with JWT access (15 min) + refresh (7 day) tokens via SimpleJWT.
  • Tokens set as httpOnly; Secure; SameSite=Strict cookies exclusively by Next.js Route Handlers — never in localStorage, never in JavaScript-accessible cookies.
  • Every Django view enforces IsAuthenticated + .filter(user=request.user) row-level ownership.
  • FastAPI endpoints are internal-only — X-Internal-Key header enforced on every request; only Django calls FastAPI.

Dashboard & Observability

  • Stats panel: document count, query count, average grounding score, cache hit rate — Redis-cached, 5 minute TTL.
  • Every query logged: question text, latency_ms, chunk count, top similarity score, grounding score, cache hit, model used.
  • Keep-alive ping on app mount warms Render cold-start before first user interaction.
  • Health check endpoints on both Django (/api/health/) and FastAPI (/health) for Render readiness probes.

Architecture

Three independently deployable services behind a Next.js BFF layer:

Browser
  └── Next.js 16 (Vercel)
        ├── Route Handlers (BFF) — cookie management, auth proxy, SSE proxy, upload signature
        └── App Router pages — dashboard, documents, chat, citations

Next.js Route Handlers
  └── Django 5 (Render Web Service)
        ├── Auth: SimpleJWT signup/login/logout/refresh
        ├── Documents: CRUD, Cloudinary signature, status polling
        ├── Chat: session + message persistence, Redis cache-aside, SSE proxy
        ├── Stats: dashboard aggregates (Redis cached)
        └── Celery dispatcher → Redis broker

Celery Worker (Render Background Worker, same Django image)
  └── POST /ingest → FastAPI

FastAPI (Render Web Service, internal-only)
  ├── POST /ingest — extract → chunk → embed → pgvector write
  ├── POST /query — embed → hybrid search → rerank → generate → SSE stream
  └── GET /health

PostgreSQL + pgvector (Neon)  |  Redis (Upstash)  |  Files (Cloudinary)

Ingestion Flow

POST /api/documents/ (Django)
  → DOCUMENT record (status = uploaded)
  → Celery task dispatched

Celery: process_document(document_id)
  → FastAPI POST /ingest {document_id, storage_key, file_type}
  → fetch file from Cloudinary
  → extract text → chunk (512t, 50 overlap) → embed (text-embedding-004)
  → write CHUNKS + EMBEDDINGS to pgvector
  → return {chunk_count}
  → DOCUMENT.status = ready

Query Flow

user sends message
  → Django: Redis cache check hash(question + doc_ids)
  → cache hit: return in <5ms
  → cache miss: FastAPI POST /query (httpx streaming)
      → embed question (text-embedding-004)
      → pgvector HNSW ANN top-20 + BM25 FTS top-20
      → Reciprocal Rank Fusion merge
      → cross-encoder rerank top-20 → top-5
      → grounding check: all scores < 0.75 → grounding_score = 0.0
      → grounding prompt + gemini-2.0-flash SSE stream
  → Django proxies SSE → Next.js Route Handler → browser
  → on complete: persist MESSAGE + CITATIONS + QUERY_LOG; cache response

Data Model

Eight PostgreSQL tables plus pgvector:

  • users — UUID PK, email, hashed password, full name.
  • documents — user-owned; filename, storage_key, file_type, status, chunk_count.
  • chunks — document-scoped; chunk_index, content, token_count, page_number.
  • embeddings — one-to-one with chunks; vector(768), model_name. Decoupled so re-embedding a new model doesn't touch chunk data.
  • chat_sessions — user-owned; title, document_ids (jsonb array).
  • messages — session-scoped; role, content, retrieval_score, grounding_score, latency_ms, cache_hit.
  • citations — message-scoped; chunk_id FK, similarity_score, citation_order.
  • query_logs — full observability row per query: question, latency, grounding score, cache hit, model used, low_confidence flag.

Outcome

VeritasRAG demonstrates a production-grade RAG system — combining hybrid vector + keyword retrieval, cross-encoder reranking, intent-aware query routing, streaming SSE generation, Redis multi-layer caching, and strict citation grounding across a three-service Docker-native architecture. Every answer is traceable to an exact source chunk with page-level provenance, and every architectural boundary is enforced at both the API and database layers.