New Delhi, June 26 -- The majority of RAG prototypes are good. They have acceptable context information, the language model gives an acceptable response and the prototype works. At this point, we can think that we have an architecture-a front end, a back end, and some pipes. The prototype has been successful, and it is just engineering to production.

It is not. The problem is that once the system is required to support enterprises with different data sensitivities, loads, freshness requirements, and support (service level agreements, SLAs), the prototype assumptions are no longer valid. One knowledge base becomes many. A central index is a curse. You could take 2 seconds to respond to a query if you're presenting a system to a user, but you can't if you have to create a system for financial services that needs to answer hundreds of questions per hour. The problems to be solved when you have a single user and a single corpus are different from the problems to be solved for multiple users and multiple corpora.

The challenges to be solved in making RAG enterprise-ready are not the same as the challenges to be solved in the proof of concept. The following sections discuss the design choices that are needed after the proof of concept: how to safely multi-tenant the vector store, how to preserve the quality of retrieval as corpora change and become more diverse, how to plan for the cost and latency of invoking the LLM for many users, and how to build the observability platform to proactively detect quality changes before your customers notice them.

Tenant Isolation in the Vector Store

The vector database is where multi-tenancy comes in, and the first choice of how to separate tenants is important. The first decision is to isolate tenants by namespace: the index is shared, documents are tagged with tenant IDs, and filtering is used to search only the tenant's documents. This is easy to manage, but there's a problem here that becomes unacceptable in multi-tenancy. It's possible that a filter or query will match documents belonging to other tenants, and for data under regulation, this is usually unacceptable, although highly unlikely.

The urge is to begin to trend backwards: one collection or index per tenant, with no sharing. This approach guarantees minimum tenant interference and easy onboarding and offboarding of tenants. But at a cost. The cost of providing the service grows with the number of tenants, and index use is very inefficient with tenants of varying sizes. The enterprise tenant that shares the same space with several smaller tenants pays for the indexes of the pool and has accounting and pricing issues.

The most common approach to enterprise deployments is to use a mix of both worlds. Tenants with segments that need to be secure or performant have their own isolated indexes. The average tenant gets a partitioned index, but with a twist: the filtering of metadata, which is typically done by the application, is done at the time of retrieval of the entry. Doing that at the store, rather than relying on the call stack to pass the data, means that we aren't doing much damage with the next bug-and there will be a next bug.

Retrieval Quality at Scale

There's a downside to retrieval in a multi-tenant system. The best embedding model for the overall benchmark might not be the best embedding model for the legal documents of one user or the technical runbooks of another, and it's not always easy to see that-it can be justified with slightly wrong results that are blamed on the model. It's often not feasible to fine-tune an embedding model for each tenant, so the retrieval system has to adapt in other ways.

The best of these is hybrid retrieval, which uses dense vector embeddings and sparse vector embeddings, such as BM25. The rationale is simple. In enterprise knowledge bases, the precise words are more important than in search, and dense vector embeddings tend to de- emphasise rare but relevant words-words like those in contracts, incident reports and technical articles. Combining the two search results and then re-ranking them with a cross-encoder model before the language model gets the context adds a layer of quality without modifying the index.

Another factor that might not be tuned, but should be, is chunk size. In practice, chunk size varies according to the document. For small technical documents, small overlapping chunks are required, whereas for policy documents, longer chunks are required to capture the arguments that lend context to individual sentences. Chunking that can be set per-tenant document corpus, rather than across the entire system at design time, means the system works better as it matures and document corpora diverge.

Latency, Cost, and the LLM Call Budget

The process of obtaining the output from the LLM is the most time- and resource-intensive aspect of any RAG workflow, making it the most expensive part as well. Operating under a multi-tenant structure, where each user can send an infinite number of queries to the LLM, may incur unpredictable costs and result in long wait times for responses from the LLM. There are two methodologies that can be employed to limit costs and improve response times.

One method is semantic caching, which allows the storing of past query-response pairs for future use in order to serve users who submit similar queries (i.e., by embedding the semantic meaning of the user's query rather than using its precise string). If there are multiple users within an enterprise who submit a variety of different versions of the same business question, then the potential success of utilising semantic caching when users submit these queries could be in excess of 90%. It is imperative, however, to tune the threshold so that cached responses do not become stale (i.e., are no longer current) or irrelevant (i.e., do not relate to the user's query).

The second method is to impose limits on the number of requests that each user or tenant can submit to the LLM, such that all requests from all tenants are managed and allocated according to operational need (e.g., each person using the shared LLM will share equally in the capacity of the LLM; therefore, if one tenant uses a high percentage or number of requests, that would reduce the capacity (i.e., speed) at which the other tenants receive their responses). All request limits should be enforced prior to submitting a request to the LLM, so that during the data retrieval process, if the request limit for a tenant has been exceeded, there is a response (e.g., a portion of the entire response or a cached response) rather than simply generating an error.

Observability and Ground Truth

You cannot methodically upgrade a RAG system without having retrieval observability. Teams working on RAG must be able to view per-query information about the retrieved chunks, retrieval scores, as well as whether their answers have been grounded in retrieved context or have relied on the parametric knowledge of the model. Claims made in a response are grounded when they can be traced back to a specific tenant document that was recovered at the time of the query. Drift occurs when the model uses pre-trained knowledge to supply fluency but not accuracy in the answer and does not use the same grounding as the retrieved context, instead relying on parametric knowledge.

In a multi-tenant architecture, whether a response is grounded or has drifted makes a big difference in terms of quality. Grounded responses comply with tenant data boundaries by delivering consistency within the response; drifted responses are based on pre-training and as such, can be generic and may contain erroneous or outdated information, or include concepts or terminology that do not align with the tenant's policies. This is especially pertinent for regulated tenants that may have conflicting or different base sources of knowledge.

Logging this information on a per-tenant basis-for example, retrieval scores, selected chunks, and whether the response is grounded-assists in recognising tenants or corpora that have higher rates of drift. As retrieval failures are generally not uniform, having this information is critical for diagnosing failures that could result from embeddings, chunking or index types.

Having an evaluation harness that makes use of curated QA pairs by tenant helps ensure that changes made to the pipeline do not produce unintended regressions and allows for ongoing monitoring of RAG quality rather than a one-time validation.

From Prototype to Production

In reality, the gap between a fully functioning RAG prototype and a true enterprise-grade RAG system is much larger than what you can see from a demo-room view. To close this gap, you must make intentional decisions at every level of the system, such as how tenant data will be separated in the vector store; how to maintain retrieval quality across multiple corpora; how to control LLM calling costs while still providing users with an adequate experience; and how to ensure no regressions occur before reaching the user. All of these issues are solvable, but none will resolve themselves. The best teams that build enterprise RAG systems with reliability and safety in mind are those that have considered production readiness as a core aspect of engineering from the beginning (as opposed to waiting until after they are done with their "clean-up" phase).

NOTE: No VCCircle Journalist was involved in the creation of this content.

Published by HT Digital Content Services with permission from TechCircle.