Building a Production-Grade RAG System from Scratch
LLMRAGFastAPI

Building a Production-Grade RAG System from Scratch

February 28, 202510 min read

Why RAG?

Large language models have a fixed knowledge cutoff and a finite context window. Fine-tuning is expensive, brittle to data drift, and doesn't scale to knowledge that changes frequently. Retrieval-Augmented Generation solves this by giving the model a dynamic, queryable external memory your documents, your database, your organization's knowledge base at inference time.

The idea is simple: before answering a question, retrieve the most relevant passages from your corpus and inject them into the prompt. The model's job shifts from knowing the answer to synthesizing it from retrieved evidence.

Simple in theory. Surprisingly complex in production.

System Architecture Overview

A production RAG pipeline has five distinct stages:

Query
  │
  ▼
[1. Query Processing]     rewrite, expand, classify intent
  │
  ▼
[2. Retrieval]            dense + sparse hybrid search
  │
  ▼
[3. Re-ranking]           cross-encoder scoring
  │
  ▼
[4. Context Assembly]     pack retrieved chunks into prompt
  │
  ▼
[5. Generation]           LLM produces grounded response

Each stage has decisions that materially affect output quality. Let's go through them.

Stage 1: Document Ingestion and Chunking

Before retrieval, you need to embed your documents. This starts with chunking splitting documents into pieces small enough to embed meaningfully but large enough to contain coherent information.

Chunking strategies

Fixed-size chunking splits every N tokens with an overlap of K. Simple, predictable, but semantically unaware a sentence about one topic can span two chunks.

def chunk_fixed(text: str, size: int = 512, overlap: int = 64) -> list[str]:
    tokens = tokenizer.encode(text)
    chunks = []
    for i in range(0, len(tokens), size - overlap):
        chunks.append(tokenizer.decode(tokens[i : i + size]))
    return chunks

Recursive character text splitting (LangChain's default) tries to split on paragraph boundaries first, then sentences, then words. Better semantic coherence in practice.

Semantic chunking embeds every sentence, then splits where cosine similarity between adjacent sentences drops below a threshold. Produces semantically coherent chunks but is slower and harder to tune.

For most document types, recursive splitting with a 512-token size and 10% overlap is a strong default. Adjust based on your document structure shorter chunks for dense technical text, longer for narrative content.

Embedding model selection

The embedding model is the most important decision in your RAG system. A weak embedder produces poor retrievals that no amount of re-ranking can recover.

ModelDimMTEB ScoreNotes
text-embedding-3-small153662.3OpenAI, fast, cheap
all-mpnet-base-v276857.8Open-source, self-hostable
bge-large-en-v1.5102464.2Best open-source for English
e5-mistral-7b409666.6Expensive, very high quality

For the portfolio chatbot I built, all-mpnet-base-v2 via Sentence-Transformers hit a good balance of quality and latency for self-hosted inference.

Stage 2: Vector Storage and Retrieval

ChromaDB setup

ChromaDB is a solid choice for projects under ~1M documents. It runs in-process (no server overhead), persists to disk, and has a clean Python API.

import chromadb
from chromadb.utils import embedding_functions

ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-mpnet-base-v2"
)

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(
    name="portfolio_knowledge",
    embedding_function=ef,
    metadata={"hnsw:space": "cosine"},
)

Critical parameter: hnsw:space. Default is L2 distance. For text embeddings, cosine similarity is almost always better. Set this at collection creation it cannot be changed after.

Hybrid retrieval

Pure dense retrieval (vector similarity) misses exact keyword matches. Pure sparse retrieval (BM25) misses semantic similarity. Hybrid retrieval combines both:

def hybrid_search(query: str, k: int = 5) -> list[Document]:
    dense_results = collection.query(
        query_texts=[query],
        n_results=k * 2,  # over-fetch for re-ranking
    )
    sparse_results = bm25_index.search(query, k=k * 2)

    # Reciprocal Rank Fusion
    return rrf_merge(dense_results, sparse_results, k=k)

Reciprocal Rank Fusion (RRF) is a parameter-free merging strategy that tends to outperform weighted linear combinations in practice.

Stage 3: Re-ranking

Initial retrieval casts a wide net. Re-ranking scores each candidate more precisely using a cross-encoder a model that jointly processes the query and each document as a pair (unlike bi-encoders which embed them separately).

Cross-encoders are slower (can't pre-compute document embeddings) but significantly more accurate. Apply them only to the top-K from retrieval, not the full corpus.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, candidates: list[str], k: int = 3) -> list[str]:
    pairs = [(query, doc) for doc in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(scores, candidates), reverse=True)
    return [doc for _, doc in ranked[:k]]

Stage 4: Context Assembly

How you pack retrieved chunks into the prompt matters more than most people think.

Lost in the middle: LLMs attend better to content at the beginning and end of the context window than the middle. Put the most relevant chunk first, not in the middle.

Context window budgeting: Reserve tokens for the system prompt, the user query, and the response. With a 16K context LLM, a reasonable allocation:

  • System prompt: ~500 tokens
  • Retrieved context: ~8000 tokens
  • User query: ~200 tokens
  • Response buffer: ~2000 tokens

Source attribution: Include document metadata (file name, section, page) in each chunk's prefix. The model will naturally include citations if you ask it to.

def assemble_context(chunks: list[Chunk]) -> str:
    parts = []
    for i, chunk in enumerate(chunks, 1):
        parts.append(
            f"[Source {i}: {chunk.metadata['source']}, p.{chunk.metadata.get('page', '?')}]\n"
            f"{chunk.text}"
        )
    return "\n\n---\n\n".join(parts)

Stage 5: FastAPI Orchestration

The full pipeline wraps into a clean FastAPI endpoint:

@app.post("/chat")
async def chat(request: ChatRequest) -> ChatResponse:
    # 1. Retrieve
    candidates = hybrid_search(request.query, k=10)
    
    # 2. Re-rank
    top_chunks = rerank(request.query, candidates, k=3)
    
    # 3. Assemble
    context = assemble_context(top_chunks)
    
    # 4. Generate
    response = await llm_client.complete(
        system=SYSTEM_PROMPT,
        context=context,
        query=request.query,
        history=request.history,
    )
    
    return ChatResponse(answer=response.text, sources=top_chunks)

What Actually Goes Wrong

Retrieval failures are the most common issue. If the right document isn't in the top-K, no prompt engineering will fix the answer. Invest in retrieval quality first add query expansion, tune chunk size, check your embedding model.

Hallucination in the presence of partial context. The model has a retrieved chunk about topic A and is asked about topic B. It may blend the two. Explicit prompting helps: "If the answer is not in the retrieved context, say so."

Chunk boundary artifacts. A sentence starts in chunk N and ends in chunk N+1. Retrieval returns N, the model sees a truncated sentence. Overlap mitigates this; consider sentence-aware splitting.

Latency. Cross-encoder re-ranking adds 200–400ms. Cache embeddings aggressively, run re-ranking in parallel if your framework supports it, and consider async generation with streaming for perceived latency.


The RAG chatbot powering this portfolio's AI widget is built on this exact architecture. Feel free to test it ask it anything about my background, projects, or research.