Introduction to Retrieval Augmented Generators (RAG): Enhancing Virtual Assistants with Domain-Specific Knowledge
Retrieval Augmented Generators (RAG) represent an innovative approach to supplementing Virtual Assistants (Large Language Models - LLM) with proprietary, private, and domain-specific information and insights. Building upon the foundation laid in our preceding discussion, "Integrating Enterprise Knowledge with LLMs", we remain steadfast in our belief in the efficacy, precision, and vast potential of Retrieval Augmented Generators (RAG). This belief underscores RAG's pivotal role in the architectural framework of our PondhouseAI, aiming to substantially simplify our interaction with and access to existing knowledge.
As is customary with the advent of new technologies, initial skepticism is inevitable. Nonetheless, the pathway to widespread acceptance and trust hinges on demonstrable proof of utility and value. It is with this understanding that we embrace our role in demystifying these technologies. Today's focus brings us to the core of our PondhouseAI—the "Retrieval Augmented Generators." This article endeavors to acquaint you with the fundamental principles, architectural nuances, and the inherent potential of RAG technology.
Required System Capabilities
In the traditional process of training or fine-tuning a model to assimilate specific knowledge, all relevant data is fed into and retained by the internal mechanics of a solitary model. While this methodology may offer certain benefits in terms of performance, it is not without its share of operational and financial challenges, including the complexities of data preparation, prolonged training periods, escalating training expenses, and issues of transparency. Furthermore, this approach necessitates extensive training and testing by professionals to ensure that the model not only enhances its performance but also operates correctly and as anticipated.
Contrary to this monolithic model paradigm, Retrieval Augmented Generators (RAG) advocate for a departure from consolidating all knowledge within a single, expansive model due to the significant drawbacks this poses for a majority of companies and applications. The preferred strategy involves decomposing this process and the requisite capabilities of the entire system into a comprehensive process and distinct architectural components.
Consider the scenario from the perspective of a user: Imagine being a technical support engineer at a company that specializes in manufacturing machinery. Our objective is for our virtual assistant to deliver swift and precise resolutions to the issues our customers report. To streamline this interaction, the virtual assistant is designed to converse with us using natural language. For instance, when we inquire, "When does alarm 5334 occur for machine type XYZ?", it is imperative for the assistant to comprehend the query, recall relevant information previously provided, and formulate a suitable response.
Required system capabilities
This process demands a suite of competencies from a bird's-eye view:
- Learning: The system must possess the capability to assimilate and process new information, integrating it with the knowledge already in place.
- Memorizing: Newly acquired information must be readily accessible, ensuring it can be "remembered" as necessary. (Indeed, one might argue that memorization is an aspect of learning, yet it is critical to underscore this function as it plays a pivotal role in the system.)
- Associating: The system needs to discern which segments of available knowledge are pertinent to the current inquiry and task.
- Communicating: To simplify interactions with human operators, the system must efficiently process the posed question or task and present the necessary information in a manner that adequately addresses the query.
Base architecture
Base architecture of RAG
Let's delve into a more technical perspective on equipping our system with these capabilities. The process we've referred to as "Learning" involves our data ingestion and preparation pathway, labeled A-D in our diagram. This involves processing all files that contain the knowledge we wish the system to absorb. Such files could range across various formats, including PDFs, Word documents, CSVs, Markdown files, exports from ticketing systems, and database connections, among others (A). These files, in their raw form, undergo an "information extraction" phase (B), where multiple techniques are employed to distill the necessary information. This vast amount of data is then segmented into much smaller, more manageable units referred to as "chunks" (C). Each chunk holds a fragment of information that the system can later retrieve when relevant to a query.
However, evaluating the relevance of each chunk's raw textual content to a given task can be cumbersome, making it challenging to ascertain whether a piece of information will contribute to answering a query. To address this, the chunks are converted into a semantic format that encapsulates the meaning of the information. In practice, this conversion involves transforming the text into a series of numerical vectors known as "embeddings", where texts of similar meanings have similar embeddings, and those with divergent meanings differ significantly. Specialized "embedding models" facilitate this transformation from raw text to a numerical semantic representation.
These embeddings, along with the original raw texts, are then stored in a vector database (D). But what exactly is a vector database? It is a database designed to manage and perform operations on these large vectors, effectively acting as the system's "memory".
When a user submits a request (1), it is also transformed into an embedding using the same model. This embedding serves as a reference for querying the vector database, searching for information that closely matches or, more precisely, has a similar embedding to the user's query (2). Taking our earlier example: "When does alarm 5334 occur for machine type XYZ?" the system seeks out information related to "machine type XYZ" and "alarm 5334". One might wonder why we employ this complex process involving embedding models and a specialized vector database instead of simply searching for the exact terms 'machine type XYZ' and 'alarm 5334'. Searching solely based on these terms could yield inadequate results, necessitating an exact match between the query's phrasing and the stored information. However, by utilizing semantic representations within our embeddings, the exact terminology becomes less critical. This semantic search approach, therefore, is exceptionally resilient in accommodating variations in phraseology, making it highly advantageous for this task. Yet, semantic search has limitations in handling highly specific terminology and character sequences, such as alarm codes, since it focuses on similarity rather than exact matches. To counter this, a 'hybrid search' that combines semantic search with keyword matching is sometimes employed.
What then becomes of the information deemed relevant? We amalgamate the original textual content from these relevant chunks with the user's query, forwarding this compilation to our LLM. This enables the LLM to use the provided data as a reference resource for responding to the query, embodying the concept of "retrieval-augmentation" in its name.
In the final stage, the LLM leverages its full capabilities, understanding the context of the question, selecting the necessary information from the "look-up resources", and crafting an apt response.
Benefits of RAG
At first glance, this system may appear quite intricate, yet it is structured so that each component of the process is assigned a specific and limited role, offering numerous advantages. This segmentation allows for the independent enhancement of each component. Since the "knowledge" resides in a database rather than within the LLM itself, updating the system's knowledge base becomes straightforward and enables immediate updates. This accessibility to new information circumvents the need for costly and time-consuming training and testing cycles associated with LLMs, facilitating the use and integration of pre-trained models without altering their proven and validated performance. Moreover, this structure affords complete transparency at every stage, an attribute not attainable with a monolithic LLM.
Further Reading
Interested in how to train your very own Large Language Model?
We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:
- Cost control
- Data privacy
- Excellent performance - adjusted specifically for your intended use