🔍 [AI] An Introduction to RAG Systems

In 2025, discussing RAG (Retrieval-Augmented Generation) technology might not seem particularly fashionable in the rapidly evolving AI landscape. However, from a practical implementation perspective, RAG offers low technical barriers and significant performance improvements. With clear expected benefits, many institutions, companies, and departments are implementing RAG solutions, while researchers continue proposing various optimization approaches to enhance accuracy.
Previously, to support search functionality for internal network services, I built a RAG server. During this process, I researched numerous RAG approaches and implemented several solutions. Today, I'd like to share my insights on RAG through this document.
Due to time and space constraints, this article primarily focuses on organizing the relationships between various modules in RAG systems. The content leans towards educational concepts rather than delving into code and algorithmic details. For further technical specifics, please refer to the external links cited throughout the article.
Conceptual Foundation
Introducing RAG concepts in a textbook manner might be somewhat boring, so I'll use a practical example to illustrate this approach.
In early 2025, one of the most popular events in China was the movie "Nezha: The Devil Boy" (《哪吒之魔童闹海》), which became the highest-grossing animated film globally. Existing large language models weren't pre-trained on this new movie's data, making it perfect for testing purposes.

For demonstration purposes, I chose DeepSeek as the LLM. I selected it because their chatbot maintains a more basic interface - it doesn't search the internet unless the "web search" option is explicitly checked. Other chatbot products like Doubao automatically trigger searches based on user intent, making variable control difficult.

Stage 1: Let's ask a question: "In the movie 'Nezha: The Devil Boy', what is the name of Shen Gongbao's brother?" As you might expect, when dealing with completely untrained data, the LLM will generate fabricated information - what's commonly known as hallucination:

Stage 2: Of course, there's a low-cost yet highly effective technique when using LLMs: directly providing content that the model doesn't know in the prompt, then asking it to answer questions based on the given context. When we provide the plot summary of "Nezha 2," the answer becomes accurate:

Stage 3: Copy-pasting is still too cumbersome and provides poor user experience. Additionally, the need to summarize based on latest data is quite common, so various LLM chatbots quickly introduced "web search" functionality. The principle is straightforward: first search the query using search engines, then insert the top N webpage contents into the prompt, and have the LLM answer the original question based on this contextual information:

In fact, Stage 3 represents the application of RAG (Retrieval-Augmented Generation):
- Retrieval-Augmented: Search content from the internet, then place the retrieved content into the prompt
- Generation: Have the LLM generate desired information based on retrieved content combined with the original question
Through this combination approach, we can significantly improve the accuracy of LLM responses on untrained data.
The greatest strength of RAG technology lies in its ability to simultaneously leverage external knowledge bases and the emergent intelligence of LLMs to efficiently handle previously time-consuming tasks of retrieval, collection, and summarization. It's important to note that LLMs themselves cannot access private data during training, such as confidential information and knowledge bases from various organizations. By internally deploying RAG servers alongside LLM servers, companies can significantly improve efficiency in handling relevant internal affairs.
Fundamental Technical Concepts
This section analyzes the technical components required to build a basic RAG service.
As we've seen from the previous sections, RAG isn't strongly related to traditional AI infrastructure. When discussing LLMs, we typically think of GPU/CUDA/pre-training/post-training, but RAG is more like a search-centric complex collaborative system that integrates various data, components, and models of different scales.
Below is a typical RAG service workflow. I'll explain each step in detail:

Document Parsing
When implementing RAG services within enterprises, the biggest difference from internet search engines lies in the information carriers.
Web search information carriers are primarily HTML, and after years of development, there are very mature web crawling and HTML parsing solutions. Internet companies with good digital transformation (like ByteDance's Lark) mainly carry text information in HTML/Markdown formats, which also have mature solutions.
However, for most traditional companies, information carriers are primarily in formats like Word/Excel/PDF/Image/Chart, along with various non-standard files in vertical domains. Therefore, parsing these file formats and storing them in databases becomes the first step of RAG:
- Common formats like Word/Excel have many parsing tools available for direct use
- PDF/Image formats use OCR solutions, which are mature in traditional algorithms but may require customization based on specific data when necessary
- Professional data formats require one-to-one specialized solutions
In summary, data processing and cleaning are crucial as they directly affect final output quality - otherwise, it's garbage in, garbage out. This is typically the grunt work that requires continuous debugging to improve recognition accuracy.
Since ByteDance has excellent internal information digitalization, I haven't focused much on these details. If interested, you can explore more about Document Intelligence in this field.
Chunking
Overview
Chunking refers to the process of dividing a long document into smaller segments or chunks.
Using the "Nezha 2" example from earlier, when my question is "What is the name of Shen Gongbao's brother?", the plot context I provided contains 90% irrelevant content to the question. This information is essentially redundant.

Ideally, if we chunk this content and RAG search only retrieves the sentence "Shen Gongbao's brother Shen Xiaobao came to seek refuge," we could theoretically save 90% of token costs! Moreover, shorter prompt content improves LLM recognition accuracy.

From another perspective, chunking is also related to subsequent "vector storage." If we store entire documents directly, we'll encounter performance and accuracy issues during future storage and search operations. Therefore, the best approach is to divide and conquer.
Chunking Strategies
For demonstration purposes, let's assume one Chinese character equals one token, and we'll chunk the following test text: 天劫之后,哪吒、敖丙的灵魂虽保住了,但肉身很快会魂飞魄散。太乙真人打算用七色宝莲给二人重塑肉身。但是在重塑肉身的过程中却遇到重重困难,哪吒、敖丙的命运将走向何方? (After the tribulation, although the souls of Nezha and Ao Bing were preserved, their physical bodies would soon disintegrate. Taiyi Zhenren planned to use the Seven-Colored Treasure Lotus to rebuild their bodies. However, they encountered numerous difficulties during the body reconstruction process. Where would the fate of Nezha and Ao Bing lead?)
There are several common chunking strategies:
1. Fixed-Length Chunking
This is the simplest approach: fix a chunk token count (typically the maximum token count accepted by the database) and directly split.
For the example above, assuming a chunk length of 20, the final splitting results are as follows. You can see the chunks are neatly divided but somewhat rigid, with some context lost from individual chunk perspectives:
[
'天劫之后,哪吒、敖丙的灵魂虽保住了,但肉',
'身很快会魂飞魄散。太乙真人打算用七色宝莲',
'给二人重塑肉身。但是在重塑肉身的过程中却',
'遇到重重困难,哪吒、敖丙的命运将走向何方',
]
A common approach is to allow content overlap between chunks, strengthening contextual connections through redundancy. Here, while keeping chunk_size at 20, setting chunk_overlap to 5 shows improvement, though the cuts remain somewhat harsh:
[
'天劫之后,哪吒、敖丙的灵魂虽保住了,但肉',
'住了,但肉身很快会魂飞魄散。太乙真人打算',
'乙真人打算用七色宝莲给二人重塑肉身。但是',
'肉身。但是在重塑肉身的过程中却遇到重重困',
'遇到重重困难,哪吒、敖丙的命运将走向何方',
]
For demonstration purposes, very small chunk_size was used here. In actual engineering practice, chunk sizes of 512/1024/4096 are relatively better.
2. Sentence/Paragraph-based Chunking
The previous example shows that fixed-length chunking is quite rigid. We can adopt a different approach utilizing meta information from documents - punctuation marks. When humans write text, periods and line breaks serve as natural structural divisions that we can leverage.
Here, splitting by 。 (period) shows significantly better results without sentence fragmentation issues:
[
'天劫之后,哪吒、敖丙的灵魂虽保住了,但肉身很快会魂飞魄散',
'太乙真人打算用七色宝莲给二人重塑肉身。',
'但是在重塑肉身的过程中却遇到重重困难,哪吒、敖丙的命运将走向何方?',
]
We can also adopt overlap strategies, adding recent sentences when chunking by sentences/paragraphs to enrich chunk context:
[
'天劫之后,哪吒、敖丙的灵魂虽保住了,但肉身很快会魂飞魄散',
'天劫之后,哪吒、敖丙的灵魂虽保住了,但肉身很快会魂飞魄散。太乙真人打算用七色宝莲给二人重塑肉身。',
'太乙真人打算用七色宝莲给二人重塑肉身。但是在重塑肉身的过程中却遇到重重困难,哪吒、敖丙的命运将走向何方?',
]
Of course, we can also combine both approaches: first split into sentences/paragraphs, then concatenate with rolling windows to approach fixed chunk sizes, balancing both advantages.
3. Document Structure-based Chunking
Many documents contain heading structures that are reflected in their carriers. For example, HTML has <h1> and <h2> tags, while markdown has # and ## structural identifiers. These are natural meta information confirmed and edited by humans. For documents with clear structures, we can utilize these elements for chunking to better balance chunk_size and chunk context.
Metadata Optimization
When constructing chunks, one often overlooked aspect is adding metadata.
For example, when processing documents, the documents themselves contain pre-processed/classified metadata such as: directory structure, filenames, links, authors, categories, etc.
When building chunks, we can include this metadata so that during subsequent search and retrieval, we can first filter based on metadata. This classification optimization is actually more accurate than relying on LLM guesses.
Data Storage
Whether data remains unprocessed or is finely minced, it must be stored. Let's revisit RAG's core requirement: efficiently search and retrieve content relevant to questions. Therefore, data storage must align with "search" requirements.
With internet development, storage content is no longer limited to text. Text, audio, video, images, and interactive behaviors all have storage and retrieval needs. Traditional relational databases and text-based search algorithms increasingly struggle to meet the growing demand for retrieval.
Of course, technology continues to evolve, leading to the development of vector embeddings - a technical concept that can embed various types of data into the same space. In this embedding space, semantically similar content has "closer" distances. This fundamental capability is crucial for systems with search requirements (such as search engines, recommendation systems, and RAG).
| embeddings | clustering |
|---|---|
![]() | ![]() |
From the above, we can see that embeddings have two focus points:
- How to convert data into appropriate vectors
- How to compare the "distance" between two vectors
Let's analyze the development history for these two focus points. However, due to space and focus considerations, we won't delve into algorithmic details. For content simplicity, the following will introduce related algorithms using "text" as an example.
Evolution of Embeddings
This section discusses how to convert text into vectors.
➊ TF-IDF
TF-IDF stands for Term Frequency-Inverse Document Frequency.
- TF (Term Frequency): The more frequently a word appears in a single document, the more important it might be.
- IDF (Inverse Document Frequency): If a word appears more commonly across all documents (like "the", "is"), it's less important.
TF-IDF = TF × IDF - the higher the value, the more unique and critical the word is to the current document.
Through this frequency calculation, we can obtain vectors for each word. From the formula, we can see that rare/key words in documents have higher weights, but this is still frequency-based and cannot capture semantic relationships behind the vectors.
Another issue with this method is that it produces very sparse vectors. The vector length equals the size of all words in the corpus. English has approximately 470k unique words, while a sentence contains only dozens of words. This results in extremely sparse sentence vectors (99.99% of values are 0), with very little information density, making it unfavorable for storage and computation.


