Advanced RAG Patterns: Beyond Basic Retrieval

Six months ago, I thought RAG was simple: retrieve chunks, send to LLM, done. Then I built a system that needed to answer questions about 50,000 technical documents. Basic retrieval failed spectacularly. That’s when I discovered advanced RAG patterns—techniques that transform RAG from a prototype into a production system.

Diagram” alt=”Advanced RAG Patterns” style=”max-width: 100%; height: auto; border-radius: 8px; box-shadow: 0 4px 20px rgba(0,0,0,0.1);”/>
Figure 1: Advanced RAG Pattern Architecture

The Problem with Basic RAG

Basic RAG works great for demos. You chunk documents, embed them, store in a vector database, and retrieve the top K chunks. Simple, right?

Then you try it in production:

  • Irrelevant chunks: Top K doesn’t mean “most relevant”—it means “most similar.” Semantic similarity ≠ relevance.
  • Context fragmentation: Important information gets split across chunks, and you retrieve only part of it.
  • No reasoning: The LLM can’t reason about multiple documents or conflicting information.
  • Hallucination: When retrieval fails, the LLM makes things up instead of saying “I don’t know.”

I learned this the hard way when our RAG system confidently answered questions with information from completely unrelated documents.

Pattern 1: Hybrid Search

Semantic search finds “similar” content, but keyword search finds “exact” matches. Hybrid search combines both.

from pinecone import Pinecone
import openai

def hybrid_search(query, top_k=5):
    # Semantic search
    query_embedding = openai.embeddings.create(
        model="text-embedding-3-small",
        input=query
    ).data[0].embedding
    
    semantic_results = vector_db.query(
        vector=query_embedding,
        top_k=top_k * 2,
        include_metadata=True
    )
    
    # Keyword search (BM25 or similar)
    keyword_results = keyword_search(query, top_k=top_k * 2)
    
    # Combine and rerank
    combined = merge_results(semantic_results, keyword_results)
    reranked = rerank_with_cross_encoder(combined, query)
    
    return reranked[:top_k]

This pattern improved our retrieval accuracy by 35%.

Diagram” alt=”Hybrid Search Architecture” style=”max-width: 100%; height: auto; border-radius: 8px; box-shadow: 0 4px 20px rgba(0,0,0,0.1);”/>
Figure 2: Hybrid Search Combining Semantic and Keyword Retrieval

Pattern 2: Query Rewriting and Expansion

User queries are often ambiguous or incomplete. Query rewriting fixes this:

  1. Query expansion: Add synonyms and related terms
  2. Query decomposition: Break complex queries into sub-queries
  3. Query rewriting: Rephrase for better retrieval
def expand_query(original_query):
    # Use LLM to expand query
    expansion_prompt = "Expand this query with synonyms and related terms:\n"
    expansion_prompt += f"Query: {original_query}\n\n"
    expansion_prompt += "Expanded query:"
    
    expanded = llm.complete(expansion_prompt)
    
    # Also generate sub-queries for complex questions
    if " and " in original_query or " or " in original_query:
        sub_queries = decompose_query(original_query)
        return [expanded] + sub_queries
    
    return [expanded]

def decompose_query(query):
    # Break complex query into simpler sub-queries
    decomposition_prompt = "Break this query into simpler sub-queries:\n"
    decomposition_prompt += f"Query: {query}\n\n"
    decomposition_prompt += "Sub-queries:"
    
    return llm.complete(decomposition_prompt).split("\n")

Pattern 3: Multi-Step Retrieval

Sometimes you need multiple retrieval steps:

  1. Initial retrieval: Get broad context
  2. Refinement: Use initial results to refine the query
  3. Final retrieval: Get precise chunks based on refined query

This is especially useful for questions that require reasoning across multiple documents.

Diagram” alt=”Multi-Step Retrieval Flow” style=”max-width: 100%; height: auto; border-radius: 8px; box-shadow: 0 4px 20px rgba(0,0,0,0.1);”/>
Figure 3: Multi-Step Retrieval Process

Pattern 4: Reranking

Vector similarity is a good first pass, but reranking with a cross-encoder dramatically improves results:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_results(query, candidates, top_k=5):
    # Create query-candidate pairs
    pairs = [[query, candidate['text']] for candidate in candidates]
    
    # Get relevance scores
    scores = reranker.predict(pairs)
    
    # Sort by score
    ranked = sorted(
        zip(candidates, scores),
        key=lambda x: x[1],
        reverse=True
    )
    
    return [item[0] for item in ranked[:top_k]]

Reranking improved our precision from 68% to 89%.

Pattern 5: Contextual Compression

Retrieved chunks often contain irrelevant information. Contextual compression extracts only what’s relevant:

def compress_context(query, chunks):
    compressed = []
    
    for chunk in chunks:
        # Use LLM to extract relevant parts
        compression_prompt = "Extract only information relevant to this query:\n"
        compression_prompt += f"Query: {query}\n\n"
        compression_prompt += f"Chunk: {chunk['text']}\n\n"
        compression_prompt += "Relevant information:"
        
        relevant = llm.complete(compression_prompt)
        if relevant.strip():
            compressed.append({
                'text': relevant,
                'metadata': chunk['metadata']
            })
    
    return compressed

Real-World Implementation

Here’s how we combined these patterns in production:

def advanced_rag(query, documents):
    # Step 1: Query expansion
    expanded_queries = expand_query(query)
    
    # Step 2: Hybrid search for each expanded query
    all_results = []
    for eq in expanded_queries:
        results = hybrid_search(eq, top_k=10)
        all_results.extend(results)
    
    # Step 3: Deduplicate
    unique_results = deduplicate(all_results)
    
    # Step 4: Rerank
    reranked = rerank_results(query, unique_results, top_k=10)
    
    # Step 5: Contextual compression
    compressed = compress_context(query, reranked)
    
    # Step 6: Generate answer
    context = "\n\n".join([c['text'] for c in compressed])
    answer = generate_answer(query, context)
    
    return answer

Performance Impact

Implementing these patterns had dramatic results:

Metric Basic RAG Advanced RAG Improvement
Retrieval Precision 68% 89% +31%
Answer Accuracy 72% 91% +26%
Hallucination Rate 18% 4% -78%
Latency 450ms 680ms +51%

The latency increase is worth it for the accuracy gains. We optimized further with caching and parallel processing.

When to Use Each Pattern

Not every pattern is needed for every use case:

  • Hybrid Search: Use when you have technical terms, names, or specific keywords
  • Query Expansion: Use for general knowledge or when users ask questions in different ways
  • Multi-Step Retrieval: Use for complex questions requiring reasoning
  • Reranking: Always use—it’s the biggest accuracy win for minimal cost
  • Contextual Compression: Use when chunks are long or contain lots of irrelevant info

Common Mistakes

Here’s what I learned the hard way:

Mistake 1: Over-Engineering

Don’t use all patterns at once. Start with reranking and hybrid search, then add others as needed.

Mistake 2: Ignoring Latency

Advanced patterns add latency. Cache aggressively and use parallel processing where possible.

Mistake 3: Not Measuring

You can’t improve what you don’t measure. Track precision, recall, and answer quality.

🎯 Key Takeaway

Advanced RAG patterns aren’t optional—they’re essential for production. Start with reranking and hybrid search, measure everything, and add complexity only when needed.

Bottom Line

Basic RAG gets you 70% of the way there. Advanced patterns get you to 95%. The difference is production-ready vs. prototype. Invest in these patterns early—you’ll thank yourself later.


Discover more from C4: Container, Code, Cloud & Context

Subscribe to get the latest posts sent to your email.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.