Case study · 2026
DeepScholar
AI-powered RAG copilot that converts academic PDFs into a grounded, citable knowledge base.
Overview
DeepScholar is a research assistant that ingests academic PDFs and exposes them as a semantic knowledge base. Users ask questions and get answers tied to real passages, with inline citations they can expand. Ingestion and retrieval are separate services, so each stage is easy to test and scale on its own.
Problem
- Manual literature review is slow and hard to scale across dozens of papers
- Keyword search misses semantic relationships between concepts
- General-purpose LLMs hallucinate citations and fabricate references
- Outputs lack verifiable sourcing, making them unreliable for academic work
Architecture
FastAPI runs a multi-stage ingest: PyMuPDF pulls text, a sentence-aware chunker splits it with configurable overlap, and OpenAI embeddings land in PostgreSQL with pgvector on Supabase. At query time, cosine search over those vectors builds a prompt that only allows answers from retrieved chunks, with structured JSON citations on every claim.
- Structured extraction pipeline: after parse, an LLM step fills a validated JSON schema per paper (title, authors, abstract, methodology, datasets, metrics, limitations) before chunking and embedding. Bad payloads fail at the service boundary and retry with a smaller context window. Extraction runs as its own service, separate from chunking and embedding.
- Global state layer: React Context backed by localStorage rehydration. FileResult[] persists across route changes and hard refreshes. addDocuments deduplicates by filename; removeDocument and clearDocuments provide full store control.
Technical Highlights
- Magic-byte validation rejects non-PDF payloads before any processing begins
- Sentence-aware chunking with configurable overlap preserves cross-boundary context
- RAG prompt contract: model must cite retrieved passages or explicitly decline. No fabrication.
- Citations emitted as structured JSON: source filename, chunk_id, and verbatim passage
- pgvector cosine similarity search with IVFFlat indexing for sub-100 ms P99 retrieval
- Supabase schema designed for multi-document workspaces and per-user isolation
- Async ingestion pipeline keeps the API responsive under concurrent uploads
- Deterministic document fingerprinting prevents duplicate vector entries on re-upload
- LLM-assisted structured extraction enforces a typed JSON schema per document; field-level partial failure does not invalidate full document extraction
Frontend
- Drag-and-drop PDF upload with live ingestion progress feedback
- Per-file upload progress via XHR onprogress, keyed by filename in a reducer map (not one blended percentage)
- Indexed Library table reads from global context, not the last upload response, so it survives remounts and route changes
- Status badges follow the FileResult enum: pending, extracting, chunking, indexed, failed. Failures show error_message inline
- chunks_stored and pages_extracted rendered as secondary metadata per document card
- Chat interface streams answers with inline expandable citation cards
- Each citation surfaces the source passage and document origin
- Research Insights Panel: expandable cards per extraction field (methodology, datasets, metrics, limitations), mounted inline in upload results and on a dedicated document detail page. Collapsed by default; expand state tracked per card per document. Missing fields render a degraded state cell, not an empty render
- Paper comparison at /compare: pick papers from the global store, columns per doc, rows for methodology, datasets, metrics, and limitations. Renders from stored extraction JSON only; no LLM on page load and no new API routes
- Single-workspace UI optimized for focused, distraction-free research sessions
Roadmap
Next phase work is split into six shippable slices. None of them need changes to the ingest schema:
- Hybrid search: BM25 keyword index running in parallel with dense vector retrieval, merged via reciprocal rank fusion
- Reranking model re-scores the merged candidate set before context window construction
- Citation grounding: answer spans without a traceable retrieved chunk are rejected at the output layer
- Research Memory Layer: structured facts stored as a knowledge graph alongside chunk vectors, enabling cross-paper reasoning without re-retrieval
- Query Intelligence Layer: rewrite step decomposes abstract queries into concrete entity terms before hitting the index
- Paper Graph: methods, datasets, and tasks stored as typed nodes with edges defined by co-occurrence, rendered as an interactive force-directed layout
- Evaluation system: recall@k, citation correctness, and LLM-judge faithfulness scoring surfaced on an internal eval dashboard per pipeline version
- Research Assistant Behaviors: conflict detection across retrieved chunks, confidence scoring from retrieval score distribution, output schema extended with a reasoning trace