🔍 [AI] An Introduction to RAG Systems

January 13, 2025 · 27 min read

微信公众号@卤代烃实验室

In 2025, discussing RAG (Retrieval-Augmented Generation) technology might not seem particularly fashionable in the rapidly evolving AI landscape. However, from a practical implementation perspective, RAG offers low technical barriers and significant performance improvements. With clear expected benefits, many institutions, companies, and departments are implementing RAG solutions, while researchers continue proposing various optimization approaches to enhance accuracy.

Previously, to support search functionality for internal network services, I built a RAG server. During this process, I researched numerous RAG approaches and implemented several solutions. Today, I'd like to share my insights on RAG through this document.

Due to time and space constraints, this article primarily focuses on organizing the relationships between various modules in RAG systems. The content leans towards educational concepts rather than delving into code and algorithmic details. For further technical specifics, please refer to the external links cited throughout the article.

Conceptual Foundation

Introducing RAG concepts in a textbook manner might be somewhat boring, so I'll use a practical example to illustrate this approach.

In early 2025, one of the most popular events in China was the movie "Nezha: The Devil Boy" (《哪吒之魔童闹海》), which became the highest-grossing animated film globally. Existing large language models weren't pre-trained on this new movie's data, making it perfect for testing purposes.

For demonstration purposes, I chose DeepSeek as the LLM. I selected it because their chatbot maintains a more basic interface - it doesn't search the internet unless the "web search" option is explicitly checked. Other chatbot products like Doubao automatically trigger searches based on user intent, making variable control difficult.

Stage 1: Let's ask a question: "In the movie 'Nezha: The Devil Boy', what is the name of Shen Gongbao's brother?" As you might expect, when dealing with completely untrained data, the LLM will generate fabricated information - what's commonly known as hallucination:

Stage 2: Of course, there's a low-cost yet highly effective technique when using LLMs: directly providing content that the model doesn't know in the prompt, then asking it to answer questions based on the given context. When we provide the plot summary of "Nezha 2," the answer becomes accurate:

Stage 3: Copy-pasting is still too cumbersome and provides poor user experience. Additionally, the need to summarize based on latest data is quite common, so various LLM chatbots quickly introduced "web search" functionality. The principle is straightforward: first search the query using search engines, then insert the top N webpage contents into the prompt, and have the LLM answer the original question based on this contextual information:

In fact, Stage 3 represents the application of RAG (Retrieval-Augmented Generation):

Retrieval-Augmented: Search content from the internet, then place the retrieved content into the prompt
Generation: Have the LLM generate desired information based on retrieved content combined with the original question

Through this combination approach, we can significantly improve the accuracy of LLM responses on untrained data.

The greatest strength of RAG technology lies in its ability to simultaneously leverage external knowledge bases and the emergent intelligence of LLMs to efficiently handle previously time-consuming tasks of retrieval, collection, and summarization. It's important to note that LLMs themselves cannot access private data during training, such as confidential information and knowledge bases from various organizations. By internally deploying RAG servers alongside LLM servers, companies can significantly improve efficiency in handling relevant internal affairs.

Fundamental Technical Concepts

This section analyzes the technical components required to build a basic RAG service.

As we've seen from the previous sections, RAG isn't strongly related to traditional AI infrastructure. When discussing LLMs, we typically think of GPU/CUDA/pre-training/post-training, but RAG is more like a search-centric complex collaborative system that integrates various data, components, and models of different scales.

Below is a typical RAG service workflow. I'll explain each step in detail:

Document Parsing

When implementing RAG services within enterprises, the biggest difference from internet search engines lies in the information carriers.

Web search information carriers are primarily HTML, and after years of development, there are very mature web crawling and HTML parsing solutions. Internet companies with good digital transformation (like ByteDance's Lark) mainly carry text information in HTML/Markdown formats, which also have mature solutions.

However, for most traditional companies, information carriers are primarily in formats like Word/Excel/PDF/Image/Chart, along with various non-standard files in vertical domains. Therefore, parsing these file formats and storing them in databases becomes the first step of RAG:

Common formats like Word/Excel have many parsing tools available for direct use
PDF/Image formats use OCR solutions, which are mature in traditional algorithms but may require customization based on specific data when necessary
Professional data formats require one-to-one specialized solutions

In summary, data processing and cleaning are crucial as they directly affect final output quality - otherwise, it's garbage in, garbage out. This is typically the grunt work that requires continuous debugging to improve recognition accuracy.

Since ByteDance has excellent internal information digitalization, I haven't focused much on these details. If interested, you can explore more about Document Intelligence in this field.

Chunking

Overview

Chunking refers to the process of dividing a long document into smaller segments or chunks.

Using the "Nezha 2" example from earlier, when my question is "What is the name of Shen Gongbao's brother?", the plot context I provided contains 90% irrelevant content to the question. This information is essentially redundant.

Ideally, if we chunk this content and RAG search only retrieves the sentence "Shen Gongbao's brother Shen Xiaobao came to seek refuge," we could theoretically save 90% of token costs! Moreover, shorter prompt content improves LLM recognition accuracy.

From another perspective, chunking is also related to subsequent "vector storage." If we store entire documents directly, we'll encounter performance and accuracy issues during future storage and search operations. Therefore, the best approach is to divide and conquer.

Chunking Strategies

For demonstration purposes, let's assume one Chinese character equals one token, and we'll chunk the following test text: 天劫之后，哪吒、敖丙的灵魂虽保住了，但肉身很快会魂飞魄散。太乙真人打算用七色宝莲给二人重塑肉身。但是在重塑肉身的过程中却遇到重重困难，哪吒、敖丙的命运将走向何方？ (After the tribulation, although the souls of Nezha and Ao Bing were preserved, their physical bodies would soon disintegrate. Taiyi Zhenren planned to use the Seven-Colored Treasure Lotus to rebuild their bodies. However, they encountered numerous difficulties during the body reconstruction process. Where would the fate of Nezha and Ao Bing lead?)

There are several common chunking strategies:

1. Fixed-Length Chunking

This is the simplest approach: fix a chunk token count (typically the maximum token count accepted by the database) and directly split.

For the example above, assuming a chunk length of 20, the final splitting results are as follows. You can see the chunks are neatly divided but somewhat rigid, with some context lost from individual chunk perspectives:

[
  '天劫之后，哪吒、敖丙的灵魂虽保住了，但肉',
  '身很快会魂飞魄散。太乙真人打算用七色宝莲',
  '给二人重塑肉身。但是在重塑肉身的过程中却',
  '遇到重重困难，哪吒、敖丙的命运将走向何方',
]

A common approach is to allow content overlap between chunks, strengthening contextual connections through redundancy. Here, while keeping chunk_size at 20, setting chunk_overlap to 5 shows improvement, though the cuts remain somewhat harsh:

[
  '天劫之后，哪吒、敖丙的灵魂虽保住了，但肉',
  '住了，但肉身很快会魂飞魄散。太乙真人打算',
  '乙真人打算用七色宝莲给二人重塑肉身。但是',
  '肉身。但是在重塑肉身的过程中却遇到重重困',
  '遇到重重困难，哪吒、敖丙的命运将走向何方',
]

tip

For demonstration purposes, very small chunk_size was used here. In actual engineering practice, chunk sizes of 512/1024/4096 are relatively better.

2. Sentence/Paragraph-based Chunking

The previous example shows that fixed-length chunking is quite rigid. We can adopt a different approach utilizing meta information from documents - punctuation marks. When humans write text, periods and line breaks serve as natural structural divisions that we can leverage.

Here, splitting by 。 (period) shows significantly better results without sentence fragmentation issues:

[
  '天劫之后，哪吒、敖丙的灵魂虽保住了，但肉身很快会魂飞魄散',
  '太乙真人打算用七色宝莲给二人重塑肉身。',
  '但是在重塑肉身的过程中却遇到重重困难，哪吒、敖丙的命运将走向何方？',
]

We can also adopt overlap strategies, adding recent sentences when chunking by sentences/paragraphs to enrich chunk context:

[
  '天劫之后，哪吒、敖丙的灵魂虽保住了，但肉身很快会魂飞魄散',
  '天劫之后，哪吒、敖丙的灵魂虽保住了，但肉身很快会魂飞魄散。太乙真人打算用七色宝莲给二人重塑肉身。',
  '太乙真人打算用七色宝莲给二人重塑肉身。但是在重塑肉身的过程中却遇到重重困难，哪吒、敖丙的命运将走向何方？',
]

Of course, we can also combine both approaches: first split into sentences/paragraphs, then concatenate with rolling windows to approach fixed chunk sizes, balancing both advantages.

3. Document Structure-based Chunking

Many documents contain heading structures that are reflected in their carriers. For example, HTML has <h1> and <h2> tags, while markdown has # and ## structural identifiers. These are natural meta information confirmed and edited by humans. For documents with clear structures, we can utilize these elements for chunking to better balance chunk_size and chunk context.

Metadata Optimization

When constructing chunks, one often overlooked aspect is adding metadata.

For example, when processing documents, the documents themselves contain pre-processed/classified metadata such as: directory structure, filenames, links, authors, categories, etc.

When building chunks, we can include this metadata so that during subsequent search and retrieval, we can first filter based on metadata. This classification optimization is actually more accurate than relying on LLM guesses.

Data Storage

Whether data remains unprocessed or is finely minced, it must be stored. Let's revisit RAG's core requirement: efficiently search and retrieve content relevant to questions. Therefore, data storage must align with "search" requirements.

With internet development, storage content is no longer limited to text. Text, audio, video, images, and interactive behaviors all have storage and retrieval needs. Traditional relational databases and text-based search algorithms increasingly struggle to meet the growing demand for retrieval.

Of course, technology continues to evolve, leading to the development of vector embeddings - a technical concept that can embed various types of data into the same space. In this embedding space, semantically similar content has "closer" distances. This fundamental capability is crucial for systems with search requirements (such as search engines, recommendation systems, and RAG).

embeddings	clustering

From the above, we can see that embeddings have two focus points:

How to convert data into appropriate vectors
How to compare the "distance" between two vectors

Let's analyze the development history for these two focus points. However, due to space and focus considerations, we won't delve into algorithmic details. For content simplicity, the following will introduce related algorithms using "text" as an example.

Evolution of Embeddings

This section discusses how to convert text into vectors.

➊ TF-IDF

TF-IDF stands for Term Frequency-Inverse Document Frequency.

TF (Term Frequency): The more frequently a word appears in a single document, the more important it might be.
IDF (Inverse Document Frequency): If a word appears more commonly across all documents (like "the", "is"), it's less important.

TF-IDF = TF × IDF - the higher the value, the more unique and critical the word is to the current document.

Through this frequency calculation, we can obtain vectors for each word. From the formula, we can see that rare/key words in documents have higher weights, but this is still frequency-based and cannot capture semantic relationships behind the vectors.

Another issue with this method is that it produces very sparse vectors. The vector length equals the size of all words in the corpus. English has approximately 470k unique words, while a sentence contains only dozens of words. This results in extremely sparse sentence vectors (99.99% of values are 0), with very little information density, making it unfavorable for storage and computation.

➋ Word2Vec

A milestone in word vectorization was the introduction of Word2Vec, proposed in Google's 2013 paper "Efficient Estimation of Word Representations in Vector Space."

The algorithms in this paper can generate dense vectors and capture semantic relationships between words.

The paper proposed two algorithms for converting words to vectors: CBOW and Skip-gram:

CBOW: Uses context words (e.g., "I love to eat ___") to predict the missing central word (e.g., "apple")
Skip-gram: The reverse - uses the central word (e.g., "apple") to predict surrounding words (e.g., "eat, sweet, red")

Both algorithms consider word context during training to compute vectors, so the final generated content preserves semantics:

Another advantage is that the generated vectors are fixed-length, with much higher information density than TF-IDF.

However, Word2Vec still has limitations:

First, it only handles "words", but our RAG system processes sentences and paragraphs, which doesn't meet this requirement
Second, it encodes individual words with fixed vectors, unable to capture different meanings of the same word in different sentences

Therefore, even though Word2Vec was impressive, it still doesn't meet RAG's needs, so we must find more suitable solutions.

➌ Transformers

The Transformers architecture was proposed in the 2017 paper "Attention Is All You Need" and is so well-known that I won't elaborate further. The first Transformer model was Google's BERT, followed by Sentence-BERT, which allows computing sentence vectors with better semantic understanding.

Transformers offer several advantages in embedding scenarios:

Contextual Dynamic Encoding: The same word can have different vectors in different sentences (e.g., "apple" in "eating an apple" vs. "Apple Inc.")
Capturing Long-distance Dependencies: Can understand relationships between words far apart in sentences
Support for Sentence/Paragraph Encoding

Here's a comparison table of these approaches:

	TF-IDF	Word2Vec	Transformers
Core Capability	Word Frequency Statistics	Local Semantics	Global Semantics + Contextual Dynamics
Computational Resources	Extremely Low	Medium	Extremely High
Data Requirements	Flexible	Needs Corpus	Massive Pre-training Data
Semantic Understanding	None	Word-level Semantics	Complex Semantic Reasoning

➍ bge-m3

To date, many professional embedding models have been trained based on Transformers, with the BGE (BAAI General Embedding) series from Beijing Academy of Artificial Intelligence being among the standouts. bge-m3 performs particularly well in multilingual understanding (especially Chinese), long-text input, and retrieval directions, and is currently the recommended model for Chinese use cases.

Vector Distance

The significance of calculating vector distance lies in that smaller vector distances indicate higher semantic similarity.

There are three common calculation methods:

Manhattan Distance (L1)	Euclidean Distance (L2)	Cosine Distance

In the NLP field, cosine distance is most commonly used for two main reasons:

In terms of final values, cosine distance has a calculation range of (-1, 1), making it easy to normalize, while L1 and L2 have ranges of (0, +∞)
In engineering terms, cosine distance only requires addition and multiplication operations, making it computationally efficient

With the theoretical foundation complete, the RAG process becomes quite straightforward:

Use an Embeddings Model to vectorize chunk data and store it in a vector database
For user queries, use the same Embeddings Model to convert them into vectors
Use cosine distance algorithms to compare against data in the vector database, returning the top N scoring chunks

Search

Cosine Distance

Actually, the previous step (calculating cosine similarity and taking topN) already completes the basic RAG search.

If we don't do additional work and simply throw the directly retrieved content into the prompt for LLM generation, the effect is already much better than bare-bones LLM calls. The paper "RAG for LLM: A Survey" defines this most basic approach as Naive RAG:

However, engineers and researchers quickly discovered that additional engineering work could significantly improve retrieval accuracy. Let's discuss some of these approaches.

Pre-Retrieval

The Pre-Retrieval stage primarily involves expanding and modifying the Original Query.

When users pose questions, the "question" itself might not be optimal. Could we leverage LLMs at this stage to analyze and understand user intent, then expand and modify the original question before proceeding to the search and retrieval stage?

Here are some relevant examples:

Decomplexifying Questions
- Before: "Why did XXX stock price fall in 2024 but rise in 2025?"
- After: "Factors affecting XXX stock price decline in 2024" + "Reasons driving XXX stock price increase in 2025"
- Advantage: Breaks down multi-task questions into independent sub-questions for targeted content retrieval
Expanding Ambiguous Descriptions
- Before: "What is Apple's latest product?"
- After: "What are the latest products released by Apple Inc. in 2025?"
- Advantage: Analyzes semantics and adds appropriate time ranges to avoid retrieving outdated information
Synonym Replacement and Specialization
- Before: "Heart disease treatment methods"
- After: "Clinical treatment guidelines for cardiovascular diseases and latest drug research progress"
- Advantage: Uses professional terminology to replace colloquial expressions, facilitating better retrieval of professional content rather than fraudulent advertisements

There are many such semantic examples. I believe that with stronger LLM capabilities, Original Query optimization will become more accurate.

Hybrid Search

Previously, we discussed much about vectors, semantics, and cosine distance, all aimed at solving fuzzy search problems, but their precision for exact searches isn't as high. Engineers and scholars early discovered that in some scenarios, traditional keyword search algorithms (like BM25) are more effective for precise scenarios.

Actually, both vector search and keyword search have their respective advantages and disadvantages. In engineering, there's no need to compete - they can work together to collectively improve retrieval effectiveness. This approach is also called Hybrid Search.

You are both my wings.PNG	Hybrid Search

Hybrid Search has multiple organizational methods. For example, running Vector Search first, then running Keyword Search on its results; or running Keyword Search first, then Vector Search. However, the current common practice is to run both types of searches simultaneously, then assign different weights to results through some weighted algorithms, and finally mix the two search result streams.

After this stage, data is successfully retrieved from the database. Generally, taking topN at this point is sufficient for direct use. If further refinement is desired, another Post-Retrieval stage can be added.

Post-Retrieval (Reranking)

Post-Retrieval involves processing the data retrieved from the database once more. This falls into two main categories: content compression and reranking.

Content compression is relatively straightforward to understand. Directly retrieved data from the database may still contain some redundant information, which might not work well when placed directly into the final prompt. One approach is to pass the original retrieved content through an LLM for summarization and compression before handing it to subsequent steps.

Let's focus on reranking.

Reranking involves taking the retrieved data (understood as coarse-grained ranking data) and the Query together, submitting them to a reranker. The system will combine the Query to reorder the series of chunk data (fine-grained ranking data). This has been extensively validated to significantly improve search quality:

Currently, one of the best-performing reranker models on the market is also from BAAI - bge-reranker:

In specific practice, we can increase the topN of retrieved data - for example, changing from the original top10 to top30, then perform reranking and take the reordered top10, comprehensively improving accuracy.

When all these engineering solutions are implemented, the approach can also be called Advanced RAG.

Overall, not all these solutions necessarily improve results when combined. For example, Query Rewriting and Summary approaches both involve LLM calls. These processes are serial in the overall workflow, inevitably increasing total processing time. LLM participation also introduces uncontrollable hallucination issues, and long processes with multiple modules bring engineering complexity management challenges. Therefore, final implementation should still select appropriate technical solutions based on business requirements.

Cutting-edge Developments

A major characteristic of 2024's frontier developments is that large models are deeply participating in key RAG components.

Document Parsing

As mentioned earlier, when RAG is implemented in non-internet enterprises, it sometimes encounters "unstructured documents" that cannot be parsed and converted using mature libraries. In such cases, OCR and other solutions are typically used to recognize document layouts and extract document information.

With the development of multimodal models in 2024, this content can actually be handed over to large models for information extraction, which has significantly improved effectiveness.

Chunking

In the "Fundamental Technical Concepts-Chunking" section, you may have noticed that regardless of the chunking method used, we encounter a contextual semantic loss problem.

For example, for the sentence: "After the tribulation, although the souls of Nezha and Ao Bing were preserved, their physical bodies would soon disintegrate," Nezha and Ao Bing appear in multiple literary works. How can we ensure this chunk refers to "Nezha 2" rather than "The Legend of Nezha"?

The industry has some solutions for this problem. An Anthropic Claude blog post "Introducing Contextual Retrieval" provides a solution: when generating each chunk, use LLM to introduce specific contextual explanations for each chunk:

<document>
{{original document}}
</document>
<chunk>
{{pre-segmented chunk}}
</chunk>

To improve search retrieval effectiveness, please provide a brief background information for the chunk by referencing the document

This solution is indeed effective, but it lengthens the chunking process and consumes considerable tokens, making it quite expensive.

GraphRAG

We all understand the structure of graphs - they have points, edges, and weights, and the concept of knowledge graphs has existed for some time. For example, Lark has a citation graph structure:

Since the various optimizations above already consume considerable tokens, why not go all out? In mid-2024, Microsoft open-sourced a RAG solution - GraphRAG. Its characteristic is using LLMs to automatically extract named entities from documents, then automatically build knowledge graphs using these entities.

Specifically in the workflow, during database construction, it leverages large model capabilities to extract keywords from chunks, then establish point/edge/weight graph relationships between these keywords and associate them with key chunks. During querying, it decomposes queries into multiple keywords, then queries in the graph database based on these keywords, finally retrieves the queried chunk information and hands it to the large model to answer questions in fluent language.

-Objective-
Given text documents potentially related to this activity and a list of entity types, identify all entities of these types from the text and all relationships between the identified entities.

-Steps-
1. Identify all entities. For each identified entity, extract the following information:
-entity_name: entity name, capitalized
-entity_type: one of the following types: 〔｛entity_type｝〕
-entity_description: comprehensive description of entity attributes and activities
Format each entity as ("Entity"{tuple_delimiter}<entity_name>{tuple-delimiter}<entity_type>{tuple _delimiter}<entity_description>

2. From the entities identified in step 1, identify all pairs (source_entity, target_entity) that are *obviously related* to each other.
For each pair of related entities, extract the following information:
-source_entity: name of the source entity, as identified in step 1
-target_entity: name of the target entity, as identified in step 1
-relationship_description: explain why you think the source and target entities are related to each other
-relationship_strength: a numeric score representing the strength of the relationship between source and target entities
-relationship_keywords: one or more high-level keywords summarizing the overall nature of the relationship, focusing on concepts or themes rather than specific details
Format each relationship as ("Relationship"{tuple_delimiter}<source_entity>{tuple-delimiter}<target_entity>{tuple.delimiter}>relationship_description>{tule_delimiter}<relationship_keywords>{tuple _delimiter}>relationship_strength>)

3. Identify high-level keywords that summarize the main concepts, themes, or topics of the entire article. These should capture the overall ideas present in the file.
Format content-level keywords as ("content_keywords"{tuple_delimiter}<high_level_keywords>)

4. Return English output as a single list of all entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list separator.

5. When complete, output {completion_definer}

Entity_types: [Person, Technology, Mission, Organization, Location]
Text:
Alex gritted his teeth as the frustration hummed to a dull roar against the authoritarian certainty of Taylor's backdrop. It was this competitive undercurrent that kept him on edge, his and Jordan's shared commitment to discovery a silent rebellion against Cruz's narrow vision of control and order.

Then Taylor did something unexpected. They stopped beside Jordan, and for a moment, observed the device with an almost reverent air. "If this technology could be understood..." Taylor said, their voice quieter, "it could change the game. For all of us."

The potential earlier dismissal seemed to waver, replaced by a reluctant respect for the seriousness of what they held in their hands. Jordan glanced up, their heartbeat a fleeting moment, gaze locked on Taylor, a silent conflict of wills softening into an uneasy truce.

It was a small shift, almost imperceptible, but Alex nodded inwardly. They had all been brought here through different paths
################
Output:
("Entity"{tuple_delimiter}"Alex"{tule_delimiter}"Person"{tuple _delimiter}Alex is a character who experiences frustration and is highly attuned to the dynamics between other characters.){record_delimitr}
("Entity"{tuple_delimiter}"Taylor"{tule_delimiter}"Person"{tuple _delimiter}Taylor is portrayed as having authoritarian certainty and shows moments of reverence toward the device, indicating a shift in perspective.){record_delimitr}
("Entity"{tuple_delimiter}"Jordan"{tule_delimiter}"Person"{tuple _delimiter}Jordan shares a discovery commitment with Taylor and has significant interactions regarding the device.){record_delimitr}
("Entity"{tuple_delimiter}"Cruz"{tuple_delimiter}"Person"{tuple _delimiter}Cruz is associated with a vision of control and order that influences the dynamics of other characters.){record_delimiter]
("Entity"{tuple_delimiter}"device"{tule_delimiter}"Technology"{tuple _delimiter}"The device is central to the story, with potential game-changing significance and is treated with reverence by Taylor.){record_delimiter}
("Relationship"{tuple_delimiter}"Alex"{tule_delimiter}"Taylor"{tuple _delimiter}Alex is influenced by Taylor's authoritarian certainty and observes Taylor's attitude change toward the device{tuple_delimiter}7){record_delimiter}
("Relationship"{tuple_delimiter}"Alex"{tule_delimiter}"Jordan"{tuple_delimiter}"Alex and Jordan share a commitment to discovery, contrasting with Cruz's vision{tuple_delimiter}6){record_delimiter}
("Relationship"{tuple_delimiter}"Taylor"{tule_delimiter}"Jordan"{tuple _delimiter}"Taylor and Jordan directly interact regarding the device, leading to mutual respect and an uneasy truce{tuple_delimiter}8){record_delimiter}
("Relationship"{tuple_delimiter}"Jordan"{tule_delimiter}"Cruz"{tuple _delimiter}Jordan's commitment to discovery is a rebellion against Cruz's control and order vision{tuple_delimiter}5){record_delimiter}
("Relationship"{tuple_delimiter}"Taylor"{tule_delimiter}"device"{tuple _delimiter}Taylor shows reverence toward the device, indicating its importance and potential impact{tuple_delimiter}9){record_delimiter}
("content_keywords"{tuple_delimiter}"Power dynamics, ideological conflict, discovery, rebellion"){completion_delimiter}

With graph model support, GraphRAG can capture more complex and subtle connections and dependencies between documents, better handling semantic issues.

The concept of Multi-Modal RAG was also influenced by the rapid development of Vision Language Models (VLMs) in 2024. The release of models like GPT-4o/Claude/Qwen-VL has enabled computer vision understanding to go beyond previous object recognition/classification, allowing for deeper understanding of some multimodal documents.

Current multimodal document processing approaches convert various information to text, then perform storage, indexing, and ranking on the text; now the idea is to directly embed multimodal documents, bypassing OCR and other processes.

Currently, this approach still has certain implementation costs, but with the development of models and infrastructure in 2025, there's still considerable room for imagination.

Agentic

The Agent concept has been very popular in 2024 and should continue to be an important exploration direction in 2025.

Actually, Agent and RAG have always been closely related. RAG itself is an important operator for Agents, unlocking Agents' data access capabilities. Agents can also be directly used in RAG systems to improve overall RAG capabilities. For example, in the various components introduced above, there are some LLM-involved tasks (like using LLM for Query rewriting), and these processes can be Agentified.

Theoretically, after all RAG components are Agentified, they can autonomously plan/adjust/iterate rather than executing mechanical instructions, ultimately finding the optimal solution.

Of course, introducing Agents also brings some existing Agent problems, such as high token consumption/dead loops/hallucinations, etc. Currently, there haven't been large-scale application cases seen in the industry, but there might be some breakthroughs and implementations in 2025.

Common Questions

The longer the context window, the less necessary RAG becomes?

This is a very typical question that I also had early on.

The origin of this view and debate is mainly various vendors releasing/open-sourcing large models with long context windows. For some shorter documents, it's indeed possible to directly throw them in and ask questions.

But marketing is marketing, and actual experience tells a different story. Currently, these problems are encountered:

Cost Issues: Although LLM token costs continue to decrease, they're still much more expensive than a single data query (some institutions estimate the cost difference is 10 billion times). Everyone is running a business, so cost is still very important
Speed Issues: The longer the context, the slower the subsequent speed, but a single RAG query is millisecond-level, with significant experience differences
Accuracy Issues: Although marketing praises long context, in actual use, when prompt length exceeds a certain threshold, generated content quality declines. For some popular models last year, for example, using Claude 3.5 to generate code, when single file code lines exceed 1000, error rates significantly increase

This can actually be combined with computer development history:

Andy-Bill Law: The benefits brought by hardware performance improvements are quickly consumed by software development
Increasing memory doesn't mean people no longer need hard drives

Should we introduce frameworks like LangChain?

This depends on the situation. If you're just testing the waters, using it for RAG is relatively out-of-the-box.

LangChain has a problem of being too thickly encapsulated. If you have some special customization needs, you'll find that some of its modules are difficult to modify. Many people, by the end of modifications, find that LangChain's proportion becomes smaller and smaller. For flexibility, it's better not to use it from the start. Therefore, in 2024, there was a trend in the community toward moving away from LangChain.

Conceptual Foundation​

Fundamental Technical Concepts​

Document Parsing​

Chunking​

Overview​

Chunking Strategies​

Metadata Optimization​

Data Storage​

Evolution of Embeddings​

Vector Distance​

Search​

Cosine Distance​

Pre-Retrieval​

Hybrid Search​

Post-Retrieval (Reranking)​

Cutting-edge Developments​

Document Parsing​

Chunking​

GraphRAG​

Multi-Modal​

Agentic​

Common Questions​

The longer the context window, the less necessary RAG becomes?​

Should we introduce frameworks like LangChain?​

References​

Conceptual Foundation

Fundamental Technical Concepts

Document Parsing

Chunking

Overview

Chunking Strategies

Metadata Optimization

Data Storage

Evolution of Embeddings

Vector Distance

Search

Cosine Distance

Pre-Retrieval

Hybrid Search

Post-Retrieval (Reranking)

Cutting-edge Developments

Document Parsing

Chunking

GraphRAG

Multi-Modal

Agentic

Common Questions

The longer the context window, the less necessary RAG becomes?

Should we introduce frameworks like LangChain?

References