Welcome back to this series on Retrieval Augmented Generation (RAG). In our first post, “Grounding LLMs: An Introduction to Retrieval Augmented Generation (RAG)”, we laid the groundwork, explaining what RAG is, why it’s a helpful extension for Large Language Models (LLMs), and the fundamental benefits it brings – like reducing hallucinations, enabling access to up-to-date information, and providing domain-specific knowledge. We saw how RAG connects LLMs to external knowledge sources, allowing them to generate more accurate, relevant, and trustworthy responses.


This post will peel back the layers of a typical RAG system, dissecting its core components. We’ll explore the intricacies of each stage, from the initial ingestion of raw data to the final synthesis of an answer. Think of it as opening the hood of a high-performance engine to understand how each part contributes to the overall power and efficiency.


Why bother with such a deep dive? Because the effectiveness of a RAG system isn’t determined by a single monolithic block; it’s the sum of its parts. This post aims to give you a comprehensive overview of these individual parts, the set of choices that they offer and the trade-offs to be made.


We’ll embark on a journey through the typical RAG pipeline, examining each critical stage:

  1. Data Ingestion
  2. Representation
  3. Retrieval & Refinement
  4. Answer Generation
  5. Advanced Considerations


Let’s get started!

Section 1: Data Ingestion – Laying the Foundation with Quality Content

The old adage “garbage in, garbage out” also holds true for RAG systems. The quality of your data ingestion process directly impacts everything downstream.

File Parsing: The First Hurdle – Extracting Clean Text and Data

RAG systems rarely deal with just plain .txt files. Your knowledge base might consist of PDFs, Word documents, PowerPoint presentations, HTML pages, structured data from databases, and more. Each format presents unique parsing challenges.

Common Parsing Challenges and Solutions:

  • Scanned Documents: Extracting text from scanned PDFs or images requires Optical Character Recognition (OCR). Two main approaches exist: dedicated OCR Services/Software and Language-Vision Models (LVMs) / Multimodal Modals. LVMs like GPT-4V or Gemini can perform OCR, often with a better understanding of context and complex layouts. Documents often contain crucial layout information (e.g., bold headers, sender/receiver addresses on an envelope, multi-column text). LVMs excel here because they can interpret the meaning and context of the extracted text alongside its visual presentation. For instance, if an LVM sees “Date: “ at the top of a letter, it understands that a date format (like “2025-05-04”) is expected, improving accuracy for ambiguous characters that a traditional OCR might misread.
    Performance and Costs: Dedicated OCR can be fast and cost-effective for clean, simple documents. LVMs might be slower and more expensive per document if used solely for OCR but can offer superior accuracy on complex or degraded documents due to their contextual understanding.
    Processing for LLMs: The extracted information, including layout cues, needs to be encoded for downstream LLM consumption. For example, a bold header might be translated into Markdown (“## Header Text”), table structures preserved, and reading order from columns correctly linearized. Capturing this structural and semantic information accurately during parsing is vital for effective RAG.
  • Multi-Column Layouts: Academic papers, newsletters, and some reports often use multi-column layouts. A naive text extraction might interleave text from different columns, destroying the logical flow. Parsers need to be sophisticated enough to recognize column boundaries and reconstruct the correct reading order, often leveraging the layout information from advanced OCR.
  • Presentation Files (e.g., PowerPoint): Extracting content from presentations isn’t just about grabbing text from slides. You also need to consider how to handle embedded objects or diagrams (which we’ll touch on in Multimodal RAG). The order of text boxes on a slide can also be ambiguous without layout analysis.
  • Complex Structures (e.g., Headers, Footers): Documents often contain headers, footers, page numbers, tables of contents, and other boilerplate text that might be irrelevant or even detrimental if included in chunks. Effective parsers should identify and optionally exclude such elements, again potentially using layout cues.
  • Tables: Tables are information-rich but require careful thought on representation for an LLM. Should they be converted to Markdown tables, which preserves some structure and is generally LLM-friendly? Or should complex tables use HTML tags (e.g., <table>, <tr>, <td>) for better structural fidelity? Another approach is to flatten tables into descriptive sentences (e.g., “For item X, the value in column Y is Z”), making the information accessible but potentially verbose and losing relational context. The chosen representation method significantly impacts how well an LLM can utilize the tabular data.


Metadata is Gold: Extracting and Utilizing Source Information


During parsing, extract as much metadata as possible: filename, document title, author, creation/modification date, section headings, page numbers, etc. This metadata is invaluable for:

  • Filtering during retrieval (e.g., “find information from documents created in the last year”).
  • Providing context to the LLM.
  • Citing sources in the final answer.
  • Debugging and understanding retrieval results.

Chunking Strategies: Dividing and Conquering Your Documents

Once you have clean text, the next step is to divide it into smaller, manageable “chunks.” The RAG system will embed and retrieve these individual chunks.


The Rationale: Why Effective Chunking is Crucial for Retrieval Relevance.

  • Context Limits: Embedding models and LLMs have maximum context window sizes. Chunks must fit these limits.
  • Specificity: Smaller, focused chunks can lead to more precise retrieval. If a chunk is too broad, it might be retrieved for queries that are only tangentially related to its core content.
  • Noise Reduction: Including irrelevant information within a chunk can confuse the embedding model and the downstream LLM.


Popular Chunking Methods and Their Trade-offs:

  • Fixed-Size Chunking: Simply splitting text into chunks of N characters or tokens.
    • Pros: Easy to implement.
    • Cons: Can arbitrarily cut sentences or even words, breaking semantic meaning. This often leads to poor quality chunks.
  • Sentence/Paragraph Chunking: Splitting based on sentence terminators (., !, ?) or paragraph breaks.
    • Pros: More semantically coherent as sentences and paragraphs usually represent complete thoughts.
    • Cons: Sentence/paragraph lengths can vary wildly. A single sentence could be very long, or a paragraph very short.
  • Content-Aware Chunking: Using document structure identified during parsing (e.g., splitting by sections, subsections, list items).
    • Pros: Often yields highly relevant and semantically meaningful chunks.
    • Cons: Relies on well-structured documents and a good parser.
  • Recursive Chunking: Iteratively splitting text using a predefined list of separators (e.g., “\n\n” for paragraphs, then “\n” for lines, then “ “ for words). The process continues until chunks are below a certain size.
    • Pros: Tries to keep semantically related text together by prioritizing larger separators first.
    • Cons: The choice of separators and order matters.
  • Semantic Chunking: A more advanced technique where an embedding model is used to identify semantic breaks in the text. Chunks are formed where the meaning shifts significantly.
    • Pros: Can create highly coherent chunks based on meaning.
    • Cons: More computationally intensive; effectiveness depends on the quality of the model used for semantic segmentation.

The Overlap Dilemma: How Much Context Should Chunks Share?

To avoid losing context at the boundaries of chunks, it’s common to introduce an overlap. For example, if chunking by 200 tokens with a 20-token overlap, the last 20 tokens of chunk N would be the first 20 tokens of chunk N+1.

  • Benefit: Ensures that information isn’t missed if a relevant piece of text spans across a chunk boundary.
  • Consideration: Too much overlap increases redundancy and storage/computation. Too little might miss context. Typical overlaps range from 10-25% of the chunk size.

Considerations: The ideal chunking strategy depends on your document types (prose vs. code vs. structured data), the context window of your chosen embedding model, and the nature of expected user queries (broad vs. specific).

Section 2: Representation – Turning Content into Searchable Knowledge

Once data is chunked, we need to convert these textual chunks into a format that allows for efficient semantic search – typically dense vector embeddings.

Embeddings: The Language of Semantic Similarity

A Quick Refresher: What Are Embeddings?


Embeddings are numerical vector representations of text (or other data types) in a high-dimensional space. The key idea is that semantically similar pieces of text will have embeddings that are “close” to each other in this vector space (e.g., by cosine similarity or Euclidean distance).


How Embedding Models Learn: A Look Under the Hood


Embedding models aren’t magic; they are trained on vast amounts of text data to learn these representations.

  • Byproduct of LLMs: Many powerful embedding models are derived from pre-trained LLMs (like BERT, RoBERTa, or even larger models). The embeddings can be taken from the hidden states of one of the last layers, or from the representation of a special token (like the [CLS] token in BERT).
  • Task-Specific Training for Semantic Search: While general-purpose embeddings are useful, models fine-tuned for specific semantic search tasks can perform better. Training methodologies include:
    • Identifying Similar Text (Symmetric Search): Training on datasets where pairs of texts are known to be similar or dissimilar. For example, the Quora Duplicate Questions dataset, where the task is to determine if two questions have the same meaning. Techniques like Siamese networks or SetFit are used, where the model learns to pull embeddings of similar items together and push dissimilar ones apart.
    • Matching Questions to Answers (Asymmetric Search): Training on datasets of (question, relevant_passage) pairs. Examples include MS MARCO (Microsoft Machine Reading Comprehension), SQuAD (Stanford Question Answering Dataset), or Natural Questions. The model learns to map a question to a space where relevant answer passages are nearby.
    • Triplet Loss: A common technique used in this type of training. For an “anchor” (e.g., a query or a passage), the model is given a “positive” example (a relevant passage/query) and a “negative” example (an irrelevant passage/query). The loss function encourages the distance between the anchor and positive to be smaller than the distance between the anchor and negative, often by a certain margin.
    • The Role of “Hard Negatives”: Simply picking random negative examples isn’t always effective. “Hard negatives” are examples that are superficially similar to the anchor (e.g., share keywords) but are actually irrelevant or answer a different question. Training with hard negatives helps the model learn finer distinctions and become more robust. As discussed, finding or generating good hard negatives is a challenge but crucial for high performance.


Choosing Your Embedding Model: Key Factors


The market for embedding models is exploding. Consider these factors:

  • Context Window Size: The maximum number of tokens the model can process. Your chunks must be smaller than or equal to this limit.
  • Embedding Dimension: The size of the output vector (e.g., 384, 768, 1024, or even more).
    • Higher dimensions can capture more nuance but also mean more storage, slower similarity searches, and HUGElly more risk of overfitting if not enough training data for that dimensionality.
    • Lower dimensions are more efficient but might lose some representational power.
  • Matryoshka Embeddings (e.g., Matryoshka Representation Learning - MRL): This innovative approach trains embeddings such that prefixes of the full embedding vector are also effective embeddings of a lower dimension. For example, an MRL model might output a 768-dimension embedding, but its first 512 dimensions also form a good 512-d embedding, and its first 256 dimensions form a good 256-d embedding. This provides flexibility: you can choose a dimension that fits your performance/cost trade-off without retraining, as if the smaller embeddings are “nested” within the larger ones, like Matryoshka dolls. The key idea is often to structure the training or the model architecture so that more important information is concentrated in the earlier dimensions, though this isn’t always a strict sorting by importance.
  • Domain Adaptation: General-purpose embedding models (trained on diverse web text) are a good starting point. However, if your documents are highly specialized (e.g., legal texts, medical research, internal company jargon), these models might not capture the domain-specific nuances well. Fine-tuning an embedding model on your own domain-specific corpus can improve retrieval relevance.
  • Performance on Benchmarks: Leaderboards like MTEB (Massive Text Embedding Benchmark) can provide insights into model performance across various tasks. However, always evaluate models on your own data and use case, as benchmark performance doesn’t always translate directly.

Vector Stores: The Memory of Your RAG System

Once you have embeddings for your chunks, you need a place to store them and efficiently search through them. This is the role of a vector store (also known as a vector database).


Why Not a Traditional Database?


Traditional relational or NoSQL databases are typically optimized for exact matches or range queries on scalar values. Vector stores are designed for similarity search on dense vectors using Approximate Nearest Neighbor (ANN) algorithms. ANN is crucial because finding the exact nearest neighbors in a high-dimensional space with millions or billions of vectors is computationally prohibitive for real-time applications.


Core Functionalities:

  • Storing Embeddings and Metadata: Efficiently stores the vector embeddings alongside their corresponding text chunks (or pointers to them) and any associated metadata.
  • ANN Search: Implements algorithms (e.g., HNSW, FAISS, ScaNN) to quickly find vectors in the database that are closest to a given query vector.
  • Metadata Filtering: Allows filtering of search results based on metadata before or after the vector search (e.g., “find relevant chunks from documents published after 2022 and tagged ‘finance’”).


Keeping the Vector Store Current: The Synchronization Challenge


A common and critical challenge is ensuring the vector store reflects the current state of your source documents. Files get added, updated, or deleted.

  • Problem: Stale data in the vector store leads to outdated or incorrect information in RAG responses.
  • Strategies for Updates:
    • Periodic Full Re-indexing: The simplest approach. Periodically, discard the entire vector store and re-process and re-index all source documents.
      • Pros: Ensures consistency.
      • Cons: Can be very time-consuming and resource-intensive for large datasets; system might be offline or serving stale data during re-indexing.
    • Incremental Updates: More sophisticated systems track changes in source documents.
      • Adding New Documents: Parse, chunk, embed, and add new documents to the store.
      • Updating Changed Documents: Identify changed documents. This can be tricky. Do you re-process the whole document? Or try to identify only changed sections? A common approach is to delete the old chunks associated with the document and add the new ones. This requires mapping documents to their chunks.
      • Deleting Old Documents: Remove the chunks and embeddings associated with deleted documents.

Section 3: Retrieval & Refinement – Finding the Needles in the Haystack

With our data ingested and represented, the next step is to retrieve relevant information when a user asks a query.

Initial Retrieval: Casting a Wide Net

Vector Search as the Foundation:

The user’s query is converted into an embedding using the same model that was used for the document chunks. The vector store then performs an ANN search to find the top-K chunks whose embeddings have the highest cosine similarity (or lowest distance) to the query embedding.


Hybrid Search: The Best of Both Worlds

While semantic search with embeddings is powerful for understanding intent, it can sometimes miss exact keyword matches that are important. Hybrid search combines dense (embedding-based) retrieval with sparse (keyword-based) retrieval.

  • Sparse Retrievers (e.g., BM25/TF-IDF): These traditional information retrieval algorithms excel at matching keywords and term frequencies. The core idea behind BM25 is to score how relevant a document is to a query based on the query terms appearing in the document. It gives more weight to terms that are frequent in a specific document but rare across all documents (e.g., a specific technical term like “zygomorphic” is more important than a common word like “function”). BM25 also considers document length, so a query term found in a shorter document typically contributes more to the relevance score than the same term in a much longer document.
  • Dense Retrievers (Embeddings): Capture semantic meaning and relationships, even if the exact words don’t match.
  • Combining Scores: The results from sparse and dense retrievers are then combined. A common technique is Reciprocal Rank Fusion (RRF), which scores documents based on their rank in each retrieval list, rather than their raw scores, making it robust to score incomparability.

Metadata Filtering:


As mentioned, metadata can be used to narrow the search space before vector search (if the vector store supports pre-filtering efficiently) or to filter the retrieved candidates after the search. This can significantly improve relevance and efficiency.


Contextual Expansion:


Sometimes, the single retrieved chunk might not be enough. A strategy might be: “If I match paragraph X, automatically include paragraph X-1 and X+1 as well.” This provides more surrounding context to the LLM but also increases the amount of text it has to process. The decision depends on chunk size and the nature of your documents.

Re-rankers: Adding a Layer of Precision

The initial retrieval step (top-K) is optimized for speed and recall – finding a broad set of potentially relevant documents. A re-ranking step can then improve precision by applying a more powerful, but slower, model to this smaller candidate set.


Why a Second Pass?


ANN search is an approximation. The top-K results might not be in the perfect order of relevance, or might contain some less relevant items. Re-rankers aim to re-order these candidates to push the most relevant ones to the very top.


Cross-Encoders Explained:

  • Architecture: Unlike bi-encoders (typical for initial retrieval) which create separate embeddings for the query and document and then compare them, cross-encoders process the query and a candidate document together in a single input sequence (e.g., [CLS] query [SEP] document_chunk [SEP]). The model then outputs a single score representing the relevance of that document chunk to the query.
  • Benefit: Because the cross-encoder sees both the query and document simultaneously, it can perform deeper interaction and attention between them, leading to a more accurate relevance judgment. They often outperform bi-encoders in terms of quality.
  • Cost: Cross-encoders are significantly more computationally expensive than bi-encoders. You cannot pre-compute document representations. For each query, you must run the cross-encoder for every candidate document from the initial retrieval list (e.g., top 50-100 candidates). This is why they are used as a second-stage re-ranker.


Training Re-rankers:


Cross-encoders are typically fine-tuned on (query, relevant_passage) pairs, often with a classification objective (relevant/not relevant) or a regression objective (relevance score). Datasets like MS MARCO are commonly used. They can also be trained with (query, relevant_passage, irrelevant_passage) triplets.


Other Re-ranking Approaches:


  • Smaller LLMs as Re-rankers: A smaller, faster LLM can be prompted to assess the relevance of each retrieved chunk to the query.
  • Diversity-Based Re-ranking: If the top retrieved chunks are very similar to each other (e.g., slight variations of the same point), a diversity-based re-ranker might try to promote chunks that cover different aspects of the query, even if their raw relevance score is slightly lower. This helps provide a more comprehensive set of information. One common method is Maximal Marginal Relevance (MMR), which iteratively selects chunks that are relevant to the query but also dissimilar to already selected chunks, thus balancing relevance with novelty.

Section 4: Answer Generation – Synthesizing Information into Coherent Responses

Finally, the LLM takes the user’s query and the refined, retrieved context to generate a human-like answer.

Prompt Engineering for RAG:

The way you present the retrieved context and the query to the LLM (the “prompt”) is critical. A well-designed prompt guides the LLM to:

  • Base its answer primarily on the provided context.
  • Indicate if the context doesn’t contain the answer.
  • Synthesize information from multiple chunks if necessary.
  • Maintain a specific tone or persona.
  • Cite sources (by referencing metadata from the chunks).

Example prompt structure:

You are a helpful AI assistant. Answer the user’s question based only on the provided context. If the context does not contain the answer, say “I do not have enough information in the provided documents to answer this question.”

Context:  
\[Chunk 1 text\] (Source: \[Metadata for Chunk 1\])  
\[Chunk 2 text\] (Source: \[Metadata for Chunk 2\])  
...

User Question: \[User's original query\]

Answer:

The Challenge of Conversational Context vs. Standalone Questions

Users rarely ask perfect, standalone questions. Queries are often part of an ongoing conversation and may rely on anaphora (e.g., “What about it?”) or omit details mentioned earlier.

  • Problem: If you only retrieve based on the latest, ambiguous query, you’ll get poor results.
  • Solutions:
    • Query Rewriting: Before hitting the retrieval system, use an LLM to “rewrite” the conversational query into a standalone, self-contained query. For example, if the history is:
      User: “Tell me about polar bears.”
      AI: [Information about polar bears]
      User: “What do they eat?”
      The query rewriter would transform “What do they eat?” into “What do polar bears eat?”
    • Including Chat History:
      • In Retrieval: Embed the recent chat history along with the current query to retrieve context relevant to the ongoing conversation.
      • In Generation: Provide the chat history (in addition to retrieved documents) to the generator LLM so it understands the full conversational flow.

Grounding and Attribution:

Encourage or instruct the LLM to clearly indicate which parts of its answer come from which retrieved chunks. This improves transparency and allows users to verify the information.

The RAG landscape is constantly evolving. Here are a couple of areas seeing exciting developments:

Multimodal RAG: Incorporating Images, Diagrams, and More

Documents are often more than just text; they contain images, charts, tables, and diagrams that convey crucial information.

The Challenge: How do you search and reason over this non-textual content?

Approaches:

  • Embedding Visuals Directly: Multimodal embedding models like CLIP (Contrastive Language-Image Pre-training) or similar language-vision models can create embeddings for images that live in the same vector space as text embeddings. This allows searching for images using text queries, or vice-versa, and retrieving images alongside text chunks.
  • Generating Textual Descriptions: Use a vision-LLM (e.g., GPT-4V, LLaVA) to generate detailed textual descriptions (captions) of images or diagrams. Index these descriptions as text. The LLM generating the final answer might still benefit from seeing the original image as well if possible.
  • Supporting Image-Based Queries: Allow users to input an image as part of their query to find relevant textual or visual information.

Agentic RAG & Iterative Refinement (e.g., RAG-Fusion, Self-Reflective RAG)

Instead of a fixed pipeline, some newer approaches involve LLMs acting more like “agents” in the retrieval process:

  • Query Transformation: Generating multiple query variations from an initial user query to retrieve a broader, more diverse set of documents, then fusing the results (e.g., RAG-Fusion).
  • Iterative Retrieval: The LLM might decide if the initially retrieved information is sufficient, or if it needs to issue new or refined queries to gather more context.

Section 6: Conclusion: Assembling Your Optimal RAG System

We’ve journeyed through the core components of a Retrieval Augmented Generation system, from parsing raw files and strategically chunking them, to selecting and training powerful embedding models, to implementing robust retrieval and re-ranking strategies, and finally, to prompting an LLM for answer synthesis. At each step, we’ve seen a multitude of options and trade-offs.


The optimal RAG configuration depends on your specific data, your users’ query patterns, and your application’s requirements for accuracy, latency, and cost. Building a high-performing RAG system is often an iterative process of experimentation, evaluation, and refinement.


Understanding the components is one thing; knowing how well they are working together is another. In the next post, we’ll explore how to measure the effectiveness of your retrieval, the quality of your generated answers, and how to build robust evaluation pipelines.


Stay tuned, and happy building!


Disclaimer: This post is the result of me chatting with an AI to structure my thoughts on this topic. The AI then kindly drafted this summary based on our talk and my outline, serving as my personal ‘don’t forget!’ note for the future – because apparently, my brain isn’t a perfect recording device. I’ve made minor edits for clarity.