Retrieval Augmented Generation

Exploring RAG architectures

Retrieval Augmented Generation

Starting from scratch

What is it?

Generation is basically (in terms of LLMs) producing new text as response to a query (prompt). But there are some common problems when it comes to LLMs and this text generation due to them answering based on their training. These fall into:

No source of data for the answer
Answers out of date

What happens when we add Retrieval (a retriever)?

We are connecting a content store, this can be the internet or a set of documents that will fed indexed and unstructured, info to the LLM. Thus, the LLM can retrieve information from the content store before answering the query. This data is not real time as it needs to be previously indexed and stored in vector DBs (see GPTs encoders)

So now, before generating the answer to the prompt, the LLM will ask the retriever to find relevant information and give it to it as context for THEN generating the asnwer. So now, information can be supported by sources and does not need to be out of date.

Also, with RAG, instead of the model generating some gibberish when it does not know the answer, it can just say "I don't know" if it does not find relevant information within the data source.

However, here lies another problem, which is that the retriever should be powerfl enough to fed the model with the best possible data because otherwise, it could not answer the prompt or give an innacurate response.

RAG vs Function calling

It is important to differentiate that whereas RAG and function calling might seem similar, they have certain points taht differentitate them. RAG is supposed to fetch unstructured data from external documents or knowledge bases previously indexed (vector stores) whereas function calling calls an API or service through predefined functions, returning fresh, structured, real-time data. In this context, RAG is more focused on complex and Knowledge-Intensive Queries while function calling will be mostly used for dynamic and real-time data retrieval

RAG vs Fine-tunning

RAG is useful for genreating new knowledge while fine tunning can be used to improve model performance and efficiency through internal knowledge improvement. These approaches are not exclusive and compliment each other for LLMs that need complex knowledge and intensive and scalable application.

Understanding RAG

It is a combination of two key components:

Retrieval mechanism: Fetches relevant documents or pieces of information from large corpus based on input
Generation Mechanism: Generates response based both on input query and retrieved documents

Retrieval mechanisms

Semantic Search: Retrieve documents based on meaning rather than exact word match. Typically uses vector embeddings to represent semantic meaning of texts and retreive documents basedon similarity to those. More ofter than not, RAG use GPT or BERT models to create this vector embeddings.
Traditional Keyword-Based Search: Match words between query and documents in the corpus
Hybrid: Combine both

What is a GPT?

GPT stands for Generative Pre-trained Transformers, neural network models that use the transformer architecture for powering generating AI applications. The value of these kind of models lies in their speed and scale at what they can operate. Ex: creating a document on nuclear plants. This will take a human hours, while a GPT can do it in seconds.

How does it work?

These GPT models are neural network-based language prediction models built on the Transformer architecture. They analyze natural language (prompts) and predict the best possible response based on their understanding of such language.

This understanding is based on the knowledge gained over training with hundreds of billions of parameters. They can take input context into account and attend different parts of the input. Ex: We could ask a GPT for a Shakespeare-inspired content and it woulld do so remembering and reconstructing new sentences with a similar style.

Neural Networks

There are different types of neural networks, like recurrent and convolutional. Or, in the case of GPT, transformer neural networks. This neural networks use self-attention mechanism to focus on different parts of the input during each processing step. Such fact allows them to capture more context and improve performance on NLP tasks. Transformer models have two parts:

Encoder: Pre-process input as embeddings (fixed-lenght vector representations). Mathematical repsentations of a word represented on a vector space, works taht are close together are expected to have similar meaning. This encoded process embeddings are processed taking into account contextual information of the input. Once the input is reaceive, the encoder splits the words into embeddings and assigns them a weight, indicating the relevance of the word. Position encoders also help GPT models to prevent amiguous meanings when a works is used in another part of a sentence.
Decoder: Uses vector representation to predict the requested output. It has self-attention mechanisms to focus on different parts of the input and guess the matching output. Compared to previous neural nets, transformer are more paralelizable because they don't process works sequentially but instead, they process the whole input at once during the learning cycle. Due to fine tunning and training, GPT models are able to give fluent answers back

Due to this encoding-decoding techniques, we can train the GPT further by feeding it with more inputs that match our expected output

RAG framewoork

Retrieval

Deals with retrieving highly relevant context from a source. This is done through a retriever. The later can be enhanced by:

Enhancing semantic representation: Improiving semantinc representation by chunking (choosing the right chunking strategy) and fine-tunning embedding models (fine tunning the vector emebedding of the chunks based on the specialized domain.)
Aligning queries and documents: align queries with doucments in the semantic space. Mayb be needed when a user's query lack semantic information or imprecise phrasing. Query rewriting or embedding transformation (optimize query representation aligning them to a latent space more closeliy aligned with the task) can be used for improvement.
Aligning retriever and LLM: Align the retriever outputs with the preferences of the LLMs. For this we can use adapters or have a fine-tunned retriever that is able to generate the right output for the LLM.

Generation

Deals with generating a coherent response based on the retrieved context. This is done through a generation model. This step involves diverse data which sometimes may required efforst to refiine the languiage model to the input data. For that we could use:

Post-retrieval optimization with Frozen LLMs: Leaves the LLM untouched and focus on enhancing the quality of the retrieved data. This compression and rerank of the information can help reduce noise and enhance generation effects because the most relevant elementsin the data can be prioritized.
Fine tunning LLM for RAG: Generator can be fine-tuned to ensure generated text is natural.

Augmentation

Involves the process of effectively integrating context from retrieved passages with the current generation task. Augmanetation can be done at different states such as pre-training, fine-tunning or inference.

We can also do augmentation at the source, categorizing data into unstructured, structured and LLM-generated.

As well, augmentaiton can be done in the RAG process by augmentin the retrieval stage. Itearative retrieval (multiple retrieval cycles enahcing the depth and relevance of information), recurse retrieval (recursively iterates on the output of one retrieval step as the input to another retrieval step; this enables delving deeper into relevant information for complex and multi-step queries) or adaptative retreival (tailor retreival process to specific demands, determining optimal moments and context for retreival)

Links

IBM RAG AWS AWS RAG