April 2024 – C4: Container, Code, Cloud & Context

Semantic Caching for LLMs: Embedding-Based Similarity and Cache Strategies

Posted on April 28, 2024 by Nithin Mohan TK 13 min read

Introduction: LLM API calls are expensive and slow—semantic caching reduces both by reusing responses for similar queries. Unlike exact-match caching, semantic caching uses embeddings to find queries that are semantically similar, even if worded differently. This enables cache hits for paraphrased questions, reducing latency from seconds to milliseconds and cutting API costs significantly. This guide […]

Read more →

What Is Retrieval-Augmented Generation (RAG)?

Posted on April 27, 2024 by Nithin Mohan TK 16 min read

Introduction Welcome to a fascinating journey into the world of AI innovation! Today, we delve into the realm of Retrieval-Augmented Generation (RAG) – a cutting-edge technique revolutionizing the way AI systems interact with external knowledge. Imagine a world where artificial intelligence not only generates text but also taps into vast repositories of information to deliver […]

Read more →

Async LLM Patterns: Maximizing Throughput with Concurrent Processing

Posted on April 25, 2024 by Nithin Mohan TK 9 min read

Introduction: LLM API calls are slow—often 1-10 seconds per request. Sequential processing kills throughput. Async patterns let you process multiple requests concurrently, dramatically improving performance for batch operations, parallel tool calls, and high-traffic applications. This guide covers async LLM patterns in Python: using asyncio with OpenAI and Anthropic clients, managing concurrency with semaphores, implementing retry […]

Read more →

Function Calling Patterns: Tool Schemas, Execution Pipelines, and Agent Loops

Posted on April 18, 2024 by Nithin Mohan TK 13 min read

Introduction: Function calling transforms LLMs from text generators into capable agents that can interact with external systems. By defining tools with clear schemas, models can decide when to call functions, extract parameters from natural language, and incorporate results into responses. This guide covers practical function calling patterns: defining tool schemas, handling multiple tool calls, implementing […]

Read more →

Fine-Tuning LLMs: From Data Preparation to Production Deployment

Posted on April 15, 2024 by Nithin Mohan TK 6 min read

Introduction: Fine-tuning transforms a general-purpose LLM into a specialized model tailored to your domain, style, or task. While prompt engineering can get you far, fine-tuning offers consistent behavior, reduced token usage, and capabilities that prompting alone cannot achieve. This guide covers the complete fine-tuning workflow—from data preparation to deployment—using both cloud APIs (OpenAI, Together AI) […]

Read more →

Advanced RAG Patterns: Beyond Basic Retrieval

Posted on April 10, 2024 by Nithin Mohan TK 5 min read

Six months ago, I thought RAG was simple: retrieve chunks, send to LLM, done. Then I built a system that needed to answer questions about 50,000 technical documents. Basic retrieval failed spectacularly. That’s when I discovered advanced RAG patterns—techniques that transform RAG from a prototype into a production system. ” alt=”Advanced RAG Patterns” style=”max-width: 100%; […]

Read more →

Searching in

Month: April 2024

Semantic Caching for LLMs: Embedding-Based Similarity and Cache Strategies

What Is Retrieval-Augmented Generation (RAG)?

Async LLM Patterns: Maximizing Throughput with Concurrent Processing

Function Calling Patterns: Tool Schemas, Execution Pipelines, and Agent Loops

Fine-Tuning LLMs: From Data Preparation to Production Deployment