Introduction to RAG
Large Language Models (LLMs), or any kind of foundational models, are typically trained on static datasets. Retraining these models every time new data becomes available on the internet is not feasible. While domain-specific fine-tuning can be beneficial for particular use cases, it is not a scalable or timely solution for keeping up with continuously evolving information.
For example, if you ask ChatGPT to list papers on "Multi-Agent Systems for Autonomous Driving," it might return publications available up to 2023 or 2024. While useful, what if you're looking for research published just a few minutes ago? In rapidly evolving fields like Machine Learning and Deep Learning, hundreds—if not thousands—of papers are published daily. Continuously retraining LLMs to keep up with this influx is simply impractical.
This is where techniques like Retrieval-Augmented Generation (RAG), Prompt Engineering, and Fine-Tuning become invaluable. These approaches help dynamically ground the responses of foundational models based on the user’s query and the latest available information. Each method has its strengths and limitations, and the appropriate choice depends on the specific application and context.
Alternative Approaches to RAG
Prompt Engineering
Prompt engineering refers to crafting inputs (prompts) in a way that effectively elicits the desired response from large language models (LLMs) such as those by OpenAI or Meta. These models demonstrate zero-shot and few-shot reasoning capabilities when prompted correctly.
- Advantages: Simple to use. No additional training or fine-tuning is required. Effective with sufficiently large models.
- Disadvantages: Responses are limited by the model's pre-trained capabilities. Less reliable for domain-specific or nuanced queries.
Fine-tuning
Fine-tuning involves taking a pre-trained LLM and further training it on a smaller, domain-specific dataset. This helps the model specialize in tasks or behavior relevant to a particular application.
- Advantages: Tailors the model to specific domains or applications. Improves performance on narrow tasks.
- Disadvantages: Requires additional training and computational resources. May reduce the model's generalization ability.
Relying on only one method when working with LLMs often does not yield the best results. Fine-tuned models may stop performing well beyond a certain point, in which case using RAG to ground the model in context can be helpful. RAG can also be combined with chain-of-thought prompting to create agentic RAG systems.
What is RAG?
RAG can be thought of as an intelligent pre-processing layer for LLMs. Instead of feeding a model an enormous corpus of documents from diverse sources (e.g., web pages, PDFs), RAG selectively retrieves and supplies only the most relevant chunks of documents as context, along with the user’s query. The model then generates a grounded response based on this curated input.
At a high level, a typical RAG system consists of three core components:
- Indexing
- Retrieval
- Generation
1. Indexing
Indexing involves breaking down source documents into smaller chunks that fit within the context window of the embedding model. Each chunk is then passed through an embedding model that converts it into a vector representation, capturing the semantic meaning of the text. These vectors are stored in a vector database, where semantically similar chunks are placed close together in vector space.
2. Retrieval
When a user submits a query, it is also embedded into a vector using the same embedding model. The system then searches for the most semantically relevant document chunks in the vector database.
A basic approach is using k-Nearest Neighbors (k-NN), which measures cosine similarity between the query vector and vectors in the database. However, this approach has a time complexity of O(N), which becomes inefficient as the number of stored vectors grows.
To overcome this, more scalable methods like Approximate Nearest Neighbor (ANN) algorithms are used. A popular technique is the Hierarchical Navigable Small World (HNSW) graph. HNSW is built on graph and skip-list principles and significantly reduces search time to O(log(N)), making real-time retrieval more efficient.
3. Generation
The final step is generating a meaningful response using the retrieved context. This stage consists of two key components:
- Choice of LLM: The selected model depends on factors such as context window size, speed, cost, and domain specialization.
The table below compares several popular large language models (LLMs) used in RAG pipelines. Each model is evaluated based on its type (open-source or commercial), strengths, and limitations. Open-source models such as LLaMA 2 and Mistral are often preferred for local deployment or cost-effective experimentation, while commercial models like GPT-4 and Claude typically offer stronger reasoning capabilities and more reliable output at a higher cost.
Choosing the right LLM depends on your application needs—whether you prioritize reasoning quality, response time, token limits, or infrastructure control.
Comparison of LLMs
| LLM | Type | Strengths | Weaknesses |
|---|---|---|---|
| LLaMA 2 | Open-source | Locally deployable, no API costs, highly customizable | Requires setup, GPU-intensive |
| Mistral | Open-source | Lightweight, fast, suitable for edge devices | Limited support for long-context inputs |
| Falcon | Open-source | Highly scalable, permissive licensing | Inconsistent output formatting |
| Claude | Closed-source | Excellent reasoning capabilities, safe output generation | Slower performance, less control over formatting |
| GPT-4 | Closed-source | High-quality responses, strong reasoning | Limited API availability, higher cost |
| Gemini Flash series | Closed-source | Fast, consistent formatting, large context support | Requires API key and service integration |
The context window size determines how much input text (in tokens) the model can process at once. Larger context windows allow the LLM to reason over longer documents or conversations without losing track of earlier information.
For example, GPT-4o and Gemini models support up to 128K tokens, making them well-suited for use cases like multi-document summarization, long conversations, or complex RAG pipelines. In contrast, smaller models like GPT-3 and Mistral may require aggressive chunking of inputs.
Context Window Sizes
| Model | Context Window Size |
|---|---|
| GPT-3 | 2,049 tokens |
| GPT-4o | 128K tokens |
| LLaMA 3 | 32K tokens |
| Gemini | 128K tokens |
Takeaway: Use open-source models when you need more control, the ability to fine-tune, and cost-efficiency. Commercial models with accessible APIs may be more convenient for quick deployment.
All-in-One RAG Pipeline (LangChain + HuggingFace)
THis is the most simplest form of RAG implementation using LangChain and HuggingFace.
# 1. Load documents
from langchain_community.document_loaders import TextLoader, PyPDFLoader
# Load from .txt
txt_loader = TextLoader("ADD PATH FOR YOUR FILE")
txt_docs = txt_loader.load()
# Load from PDF
pdf_loader = PyPDFLoader("ADD PATH FOR YOUR FILE")
pdf_docs = pdf_loader.load()
# Combine all documents
docs = txt_docs + pdf_docs
# 2. Chunk the text
from langchain.text_splitter import RecursiveCharacterTextSplitter
raw_texts = [doc.page_content for doc in docs]
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=100
)
texts = text_splitter.create_documents(raw_texts)
# 3. Generate embeddings
from langchain.embeddings import HuggingFaceEmbeddings
embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# 4. Create vector store
from langchain_community.vectorstores import FAISS
db = FAISS.from_documents(texts, embedding)
retriever = db.as_retriever()
# 5. Retrieve relevant documents and prepare prompt
query = "What are the related works that have been done on this?"
retrieved_docs = retriever.get_relevant_documents(query)
context = "\n\n".join([doc.page_content for doc in retrieved_docs])
prompt = f"""
You are reading a research paper. The following is extracted content from the abstract and related work sections:
{context}
Please summarize the related work section of the paper, highlighting prior approaches, key methodologies, and how this paper builds upon or differs from them.
"""
# 6. Generate a response using your preferred model
# Replace this with actual LLM call (OpenAI, HuggingFace, etc.)
print(prompt)
# print(response) # Uncomment after adding model integration
References: