Exercise 1: Simple RAG for 10-K filings
The objective of this exercise series is to develop a prototype of a Retrieval-Augmented Generation (RAG) system capable of answering questions based on 10-K filings submitted to the U.S. Securities and Exchange Commission (SEC). The full series includes six Colab notebooks, each exploring progressively advanced concepts in RAG systems and their applications: Exercise 1: Simple RAG for 10-K filings Exercise 2: RAG with Reranker for 10-K filings Exercise 3: RAG with Query Decomposition & Tracing with Langsmith/Langfuse Exercise 4: RAG with Agentic Pattern: ReAct (Reasoning and Action) Exercise 5: RAG with Agentic Pattern: ReAct + Reflection Exercise 6: RAG with tabular data handling These exercises incrementally build on basic RAG with focus on “why” before “what” and “how". This first tutorial focuses on developing a basic end-to-end RAG pipeline. It is divided into three parts to provide a comprehensive understanding of building a simple RAG system for 10-K filings: RAG Fundamentals High-Level Overview of Underlying Models Code with Explanation is posted here: Colab Notebook Link We strongly encourage readers to go through RAG fundamentals before diving into the code. RAG Fundamentals 10-K SEC filings are comprehensive annual reports that provide an in-depth overview of a publicly traded company's operations, financial performance, and risks. These documents are essential for investors, analysts, and regulators, offering insights into business strategies, legal issues, financial health, and future outlook. However, their length and complexity often make extracting specific information time-consuming and challenging, especially when dealing with multiple filings across different companies. RAG systems address these challenges by combining traditional retrieval methods with the generative capabilities of large language models (LLMs). By structuring and embedding text from 10-K filings into a searchable database, RAG systems can quickly retrieve and synthesize relevant information, enabling users to answer complex queries efficiently. In this exercise, we will work with 10-K filings from companies like Tesla and GM, utilizing their SEC-hosted webpages as data sources. Pre-processing Data for RAG (Retrieval-Augmented Generation) Pre-processing text for RAG systems (e.g., company policy documents, emails, website content, and reports) involves key steps to prepare and organize data for efficient querying and retrieval. The primary steps include chunking, embedding generation, and vector database integration. Here’s a breakdown: Chunking Chunking is the process of breaking down large texts into smaller, manageable pieces that are easier to process and retrieve. In knowledge bases with lengthy documents, breaking them into smaller chunks enables RAG models to query and retrieve only the most relevant sections for user queries. This targeted retrieval promotes contextually coherent responses while reducing off-topic content and conserving computational resources, making the process more efficient and scalable. A key consideration in chunking is determining the appropriate chunk size to balance context preservation and semantic specificity. Semantic specificity refers to how distinctly and unambiguously a text conveys an idea. Larger chunks excel at maintaining discussion context and keeping related ideas together, which helps models understand references and pronouns. This is particularly valuable for tasks like document summarization or question answering that require comprehensive topic understanding. However, larger chunks can encompass multiple themes, potentially diluting the semantic focus of their embeddings and leading to less precise retrievals when queries target specific aspects. Conversely, smaller chunks typically focus on single ideas, generating highly focused and semantically rich embeddings that can be matched precisely with specific queries. The drawback is potential loss of broader context, where important background information or pronoun references might fall outside the chunk's scope. This can result in retrieved chunks that, while semantically relevant, may miss crucial context for coherent responses. The optimal chunk size depends on the specific application requirements and often involves experimentation. To address the risk of splitting important information across chunks, an overlapping sentences approach is often used. This involves adding a portion of the end of one chunk to the beginning of the next, helping preserve context and semantic integrity of ideas that span chunk boundaries. This ensures the model maintains a better understanding of the text as a whole, enhancing information continuity before moving into the vectorization phase of the RAG model's data pre-processing pipeline. Generating Chunk Embeddings using Embedding Model Think of embeddings as a way to translate text into a sequence of numbers that computers can understand and compare. When you convert text into embeddings (also referred as vectors), you're essentially creating a numerical "fingerprint" that captures the meaning of that text. In a RAG (Retrieval Augmented Generation) system, embeddings serve three key functions: They convert chunks of your company's documents (manuals, reports, policies) into these numerical fingerprints. They similarly convert user questions into numerical fingerprints. They allow rapid searching by comparing these fingerprints to find relevant matching chunks. Let's say an attorney has a new case about a contract dispute where a software company failed to deliver custom AI features they promised to build for a client. The attorney has this case summary: “Contract dispute: Client paid $2M for custom AI software development. Contract specified 6-month delivery. Vendor delivered incomplete features after 8 months, failing to meet specifications. Client seeking damages.” When this query is converted to an embedding, it captures key legal concepts like breach of contract, delayed delivery, and incomplete work. The system compares this numerical pattern against thousands of past cases' embeddings to find similar precedents. Precisely speaking, the system compares it against embeddings of chunks from past legal cases and finds chunks with similar numerical patterns about breach of software development contracts, delayed project deliveries, and incomplete or non-conforming deliverables. By comparing embeddings of chunks rather than entire cases, attorneys can quickly pinpoint not only the precedent cases but also the most relevant sections in these cases. This helps attorneys rapidly identify relevant precedents without reading through thousands of unrelated cases. Storing Chunk Embeddings in Vector Database After generating embeddings for text chunks, storing them effectively becomes crucial for a RAG system's performance. While traditional relational databases are excellent for structured data, they face significant challenges when handling embeddings due to their high-dimensional nature. For context, embeddings generated from BERT-Base models produce vectors containing 768 numbers, while BERT-Large models create even larger vectors with 1024 elements. Traditional databases simply weren't designed to efficiently manage and query data with such high dimensionality. This is where vector databases come into play, offering a specialized solution designed specifically for handling these high-dimensional vectors. These databases implement sophisticated indexing techniques that allow for rapid similarity searches, making them particularly well-suited for RAG applications. When a user submits a query, the system needs to quickly identify and retrieve the most semantically similar chunks from potentially millions of stored embeddings. Vector databases excel at this task, providing the necessary infrastructure for swift and accurate information retrieval that would be impractical or impossible with traditional database systems. Popular vector database solutions include FAISS and Pinecone, which are specifically optimized for storing and querying these high-dimensional embeddings. These databases implement efficient similarity search mechanisms, typically using cosine similarity measures, enabling them to rapidly identify and retrieve the most relevant chunks of information in response to user queries. This capability is essential for maintaining the responsiveness and effectiveness of RAG systems, particularly when dealing with large-scale knowledge bases. Handling User's Query After preprocessing data and setting up the vector database infrastructure, the RAG system needs to handle real-time user queries effectively. This process happens in four key stages: query vectorization, vector database retrieval, prompt creation, and response generation. Generating Query Embeddings using Embedding model First, query vectorization converts incoming user questions or requests into the same type of numerical representations (embeddings) used for the stored knowledge base chunks. This step is crucial and must use the exact same embedding model that was employed during the preprocessing phase. For instance, if BERT-Base was used to generate the 768-dimensional vectors for your stored chunks, the same model must be used for converting user queries into embeddings. This consistency ensures that both the stored chunks and user queries exist in the same semantic space, making similarity comparisons meaningful and accurate. Using different embedding models for queries versus stored chunks would be like trying to compare distances between points on two different maps with different scales – the results would be unreliable. Retrieving Relevant Chunks using Vector Database Once the query has been converted into an embedding, the vector database performs a similarity search to find the most relevant chunks from the knowledge base. This search typically employs cosine similarity or other distance metrics to identify stored vectors that are closest to the query vector in the high-dimensional space. Modern vector databases can execute these similarity searches extremely efficiently, even across millions of chunks. The system then retrieves the original text chunks corresponding to the most similar vectors, providing the contextually relevant information needed for the RAG model to generate its response. Creating Effective Prompts with Retrieved Context Creating Effective Prompts with Retrieved Context After retrieving the most relevant chunks, the next crucial step is constructing an effective prompt that helps the language model generate accurate and contextually appropriate responses. This process requires careful consideration of how to structure and combine the retrieved information with the user's query. The basic structure of a RAG prompt typically consists of three main components: instructions for the model, the retrieved context, and the user's query. Think of this like preparing a subject matter expert for a consultation – you first explain how they should approach the task (instructions), provide them with relevant reference materials (retrieved context), and then present the specific question they need to address (user's query). Consider this approach: Give an answer for the `question` using only the given `context`. Use only the provided `context` to answer the `question`. If the information needed isn't in the `context`, acknowledge this limitation rather than making assumptions. Provide a detailed answer with thorough explanations, avoiding summaries. question: {question} context: {context} Answer: The instructions at the top sets the foundation for how the model should process and utilize the retrieved information. This helps ensure the model stays grounded in the retrieved information rather than hallucinating or drawing from its pre-trained knowledge. The context section would typically join the ranked chunks with newline characters (\n\n) before inserting them into the prompt template. This preserves the ranking while creating a readable and processable format for the language model. Response Generation After generating the prompt with its carefully structured components, the RAG system passes this combined input to a Large Language Model (LLM) for response generation. The LLM processes the instructions, context (retrieved chunks), and user query together to produce a coherent, contextually appropriate response that addresses the user's needs. The LLM leverages the context to ground its responses rather than relying solely on its pre-trained knowledge. This approach significantly reduces hallucination risks since the model is explicitly instructed to base its response on the provided context. If the retrieved context lacks sufficient information to fully address the query, the model acknowledges these limitations instead of making unsupported claims. The effectiveness of response generation heavily depends on the quality of the prompt engineering discussed earlier. Depending on the requirements, the response from the LLM can be further customized or refined based on additional criteria, such as tone, style, or specific user preferences. Note: The implementation of robust guardrails is crucial when deploying LLMs in RAG systems to ensure responsible and reliable output. A comprehensive validation system should verify that the model's responses strictly align with the provided context, preventing both subtle and obvious forms of hallucination. Additional checks should evaluate responses for potential biases and ethical concerns, including screening for harmful content, discriminatory language, or inappropriate recommendations. These guardrails should also ensure compliance with company policies, regulatory requirements, and societal norms while maintaining appropriate tone and professionalism. The system should be designed to either automatically modify responses that don't meet these criteria or flag them for human review, ensuring a balance between accurate information delivery and responsible AI behavior. Technical Details Embedding model Embedding models are specialized versions of encoder architectures (like BERT) that are fine-tuned specifically to create meaningful vectors (sequence of numbers) of entire sentences or passages, rather than individual words or tokens. Base encoder models create contextual representations at the token level - meaning each word is represented by a vector that depends on surrounding words. However, they aren't trained to directly optimize for sentence-level similarity. In contrast, embedding models are explicitly trained on sentence-pair tasks using contrastive learning. During training, they learn to generate sentence vectors that: Place similar sentences close together in vector space Push dissimilar sentences far apart Capture high-level semantic relationships rather than just word-level patterns This targeted training makes them much better at tasks requiring sentence-level semantic understanding, like finding similar legal cases or matching questions to relevant documents. Note: The terms vectors, embeddings, and representations are often used interchangeably, and all refer to sequences of numbers that represent data in a machine-readable format. In large language models (LLMs): Tokenization: Input text is first broken down into smaller units called tokens. The process maps the text to elements of a predefined vocabulary or dictionary. Since the vocabulary may not contain every possible word, tokenization handles out-of-vocabulary (OOV) words by breaking them into subwords, characters, or other smaller components, depending on the tokenization strategy used. Token Embeddings: Each token is then converted into a numerical vector (embedding). At this stage, these embeddings are static, meaning they do not depend on the context provided by surrounding tokens. Contextualized Embeddings: These are embeddings generated after processing token embeddings through the layers of the transformer model. Unlike static embeddings, contextualized embeddings reflect the meaning of each token based on its surrounding tokens in the input sequence. For example, in the phrases "sits by a river bank" and "went to a bank to deposit a check," the word "bank" has different meanings. Contextualized embeddings capture these differences by producing distinct representations for the word "bank" in each context. The choice of embedding model can significantly impact the quality of your vectors and retrieval effectiveness. Since new embedding models come out on a regular basis, you can select an appropriate model from MTEB leaderboard . Response Generation Model The Large Language Models (LLMs) used for response generation in RAG systems are primarily based on decoder architectures, exemplified by models like ChatGPT, Claude, Llama, and Qwen. These decoder models operate fundamentally differently from the encoder-based models used in embedding generation and reranking stages. Their core objective is next-token prediction, where the model can only see and process tokens that come before the current position, unlike encoder models which have full visibility of the entire input sequence. This architectural constraint creates a more challenging training task, as the model must learn to generate coherent and contextually appropriate text while working with limited future context. This limitation actually drives these models to develop stronger reasoning capabilities and deeper understanding of language patterns, as they must make predictions based solely on previous context. A crucial development stage for decoder models is instruction tuning, which enables them to understand and follow specific directives in prompts. Without this specialized training, these models would simply continue the pattern of text generation rather than providing appropriate responses to instructions. For example, when presented with a prompt like "How are you?", a base model might simply complete the phrase with "doing today", while an instruction-tuned model would recognize the question format and respond appropriately with something like "I'm fine, thank you. How about yourself?" This capability is essential for RAG systems where the model needs to interpret prompts that combine retrieved context with specific instructions about how to use that information. The complexity of the text generation task necessitates significantly larger model architectures compared to embedding and reranking models. These decoder models typically employ many more parameters and layers to support their advanced reasoning capabilities. The scale difference is substantial - while embedding models might operate with hundreds of millions of parameters, modern decoder models often contain hundreds of billions of parameters. This massive scale translates directly to computational costs, with training expenses often reaching hundreds of millions of dollars. As a result, most organizations opt to access these capabilities through APIs provided by services like ChatGPT and Claude, or leverage open-weight models such as the 405-billion parameter Llama hosted on platforms like Together.ai, rather than training their own models from scratch. The combination of this complex architecture, instruction tuning, and massive scale enables decoder models to perform the sophisticated task of synthesizing information from retrieved context into coherent, relevant responses. In a RAG system, this manifests as the ability to not just understand the retrieved chunks and user query, but to reason about their relationships and generate new text that effectively addresses the user's needs while remaining grounded in the provided context.
17 min read