Beyond the Hype: Making RAG Working for Your Business

Dr M Maruf Hossain, PhD, GAICD
Mar 2
4 min read

In the wake of the advent of Large Language Models (LLMs), a multitude of initiatives have been undertaken to enhance user experiences via Natural Language Processing (NLP). A vast array of articles has been penned to encourage organisations to embrace generative artificial intelligence (AI). These pieces, however, have predominantly focused on the impressive ability of LLMs to understand user inquiries and generate responses, and often criticise the fact that they overlook factual accuracy when generating responses.

Originally published at LinkedIn Pulse on 6 February 2024.

To ensure that LLMs continue to provide accurate information and adapt to rapidly evolving content, a technique known as Retrieval Augmented Generation (RAG) has emerged. However, the discourse surrounding RAG has largely been technical, with scant attention paid to the solution’s accuracy as data scales beyond the scope of a Proof of Concept (POC).

Organisations are now facing various challenges in implementing such advanced AI technologies:

Underestimating the retrieval aspect. There’s a common misconception that AI will intuitively understand our requirements. However, traditional challenges in information retrieval remain, and relying exclusively on vector search may not always yield optimal results due to the limitations of similarity-based ranking. The distinction between “relevance” and “similarity” is crucial for effective information retrieval with RAG. While similarity concerns matching words, relevance emphasises the linkage of ideas. The importance of context cannot be overstated. The model needs to understand the context in which it is functioning. This understanding includes the user’s intent, the history of the conversation, and the broader context in which the dialogue is taking place.
Single prompt for all cases. Adapting a single prompt to address a specific issue might unintentionally compromise its performance in other scenarios. An illustrative example is determining an effective prompt when the response requires a sequential procedure to accomplish a task. However, this prompt may be significantly constrained when the response requires a binary (yes/no) answer with justifications.
Chunking strategy. In the field of NLP, “chunking” refers to the process of dividing text into small, meaningful units. RAG systems are particularly adept at extracting context from smaller text chunks rather than from larger documents, thereby enhancing both speed and accuracy. This has led to the success of numerous POCs. However, these systems often fall short in large-scale production scenarios. Consequently, the critical question arises: What constitutes an ideal chunk size? Determining the optimal size for text chunks is challenging. If the chunks are excessively large, the model may struggle to pinpoint the relevant context. Conversely, if they are too small, they may lack the necessary information.
Speed and size of LLM. The time it takes for the model to respond can pose a significant challenge. The model’s complexity is directly proportional to the time required, with larger models taking longer. Furthermore, the associated costs of cloud computing and infrastructure escalate correspondingly. In the context of RAG, the model’s primary function is to comprehend user queries and enhance the final response, rather than leveraging the information stored within the hidden layers of the neural network that constitutes the model.

To overcome these challenges, organisations can start with the following strategies:

Investigate with different information retrieval techniques. Vector search is just one approach to extracting information from embeddings. When the text chunk is significantly smaller, a vector search may simply become another form of keyword search, rather than an alternative for semantic search. Knowledge graphs offer a viable alternative to vector search. Leveraging more sophisticated tools is essential for accurately identifying and retrieving relevant content. The primary objective remains to locate the correct source of the desired information. In doing so, the embedding-based approach can be entirely avoided. For instance, Microsoft employs LLMs solely to enhance responses across multiple web pages in its Bing AI Search. They continue to utilise their conventional search engine to identify relevant pages and subsequently use these pages as context when generating responses.
Prompt engineering. Carefully crafting and refining prompts to establish base instructions and handle exceptions can significantly enhance performance. Rather than depending on a single prompt, a variety of prompts can be employed for diverse question types. A simple NLP-based classification model can be used to determine the type of user query.
Optimal chunking strategy. Striking a balance in determining the appropriate chunk size is crucial for capturing essential information without sacrificing speed. The implementation of overlapping chunks can help maintain this equilibrium. Notably, the Sliding Window technique can be beneficial, albeit it may incur substantial costs. Other alternatives, such as adaptive chunking, dynamic context window, hierarchical retrieval, or post-retrieval merging, can also prove useful.
Automated testing. Implementing automated testing, similar to unit tests, can safeguard against alterations to the prompt or chunk size disrupting previously successful cases. In addition, automated testing aids data scientists conducting experiments with varying prompts, chunk sizes, or both, enabling them to swiftly monitor their experiment outcomes.
User training and communication. A solution is only as effective as its acceptance by the end-users who will engage with it. It is crucial to provide clear explanations of the system’s objectives, advantages, and alignment with the project’s goals to ensure successful implementation.
Document versioning. Introducing document versioning can be a valuable strategy to mitigate some of the challenges associated with RAG implementation, or even better, to connect it with a document and record management system. This strategy will allow tracking and managing changes to documents over time. It can be useful in RAG implementation because it helps resolve issues such as inconsistency, reproducibility, and accountability. Document versioning can ensure that the RAG system retrieves the most updated and relevant version of a document, avoiding outdated or conflicting information. It can enable the RAG system to reproduce the same results for the same query, regardless of any changes made to the document,s if and when necessary. It can provide a record of who made which changes to the documents and when, enabling better quality control and auditing.

These are just a few of the challenges and strategies to try to address them. The specifics will depend on the organisation’s unique needs and context. Implementing RAG is a complex process that requires careful planning and execution. However, with the right approach and resources, it can significantly enhance the quality and relevance of the generated content.