RAG Architecture Best Practices for UK Enterprise Deployments
Retrieval-augmented generation has moved from a research concept to a core enterprise AI pattern in a remarkably short time. The fundamental idea – grounding a large language model’s responses in an organisation’s own data by retrieving relevant documents at query time – solves the most persistent objection to deploying generative AI in business: hallucination and accuracy.
We previously published an introduction to RAG on this site, explaining what retrieval-augmented generation is and why it matters. This article goes deeper. It is a practical implementation guide for UK enterprises that are moving from proof-of-concept to production RAG deployments, covering the architectural decisions, engineering trade-offs, and regulatory considerations that determine whether a RAG system delivers genuine business value or becomes an expensive experiment.
At McKenna Consultants, we provide RAG implementation consultancy for UK organisations building custom AI chatbots and enterprise knowledge base systems. The guidance in this article reflects the patterns and practices we apply in real deployments.
RAG Architecture Overview
A production RAG system comprises several components working in concert:
- Document ingestion pipeline – Extracts content from source documents (PDFs, Word files, web pages, databases, SharePoint libraries), processes them into chunks, and generates vector embeddings.
- Vector store – Stores the document chunks and their embeddings, enabling similarity search at query time.
- Retrieval engine – Takes a user query, converts it to an embedding, searches the vector store for relevant chunks, and optionally applies re-ranking and filtering.
- Generation engine – Sends the retrieved context along with the user’s query to a large language model, which generates a grounded response.
- Orchestration layer – Manages the end-to-end flow, including query preprocessing, retrieval strategy selection, prompt construction, and response post-processing.
Each component involves architectural decisions that materially affect the system’s accuracy, performance, and cost. We will examine the most consequential decisions in turn.
Vector Database Selection
The vector store is the foundation of a RAG system’s retrieval capability. The choice of vector database affects search performance, scalability, operational complexity, and cost.
Dedicated Vector Databases
Purpose-built vector databases such as Pinecone, Weaviate, Qdrant, and Milvus are designed specifically for high-performance similarity search over large embedding collections. They offer:
- Optimised indexing algorithms (HNSW, IVF) for fast approximate nearest-neighbour search
- Metadata filtering alongside vector search
- Managed hosting options that reduce operational burden
- Purpose-built APIs for embedding operations
For enterprise deployments handling millions of document chunks with strict latency requirements, a dedicated vector database is typically the right choice.
Vector Extensions for Existing Databases
PostgreSQL with pgvector, Azure Cosmos DB with vector search, and Azure AI Search all provide vector search capability within platforms that organisations may already operate. The advantage is reduced operational complexity – one fewer database to manage, monitor, and secure.
The trade-off is that these extensions may not match the search performance or feature set of dedicated vector databases at scale. For RAG deployments with moderate document volumes (tens of thousands to low hundreds of thousands of chunks), these integrated options are often entirely sufficient and reduce infrastructure sprawl.
Selection Criteria for UK Enterprises
For UK organisations, the vector database selection should account for:
- Data residency. Where is the data stored? Can the provider guarantee UK or EU data residency? For systems processing personal data, this is a GDPR requirement.
- Existing cloud platform. If the organisation is an Azure customer, Azure AI Search or Cosmos DB may be preferable to introducing a new cloud provider relationship.
- Operational capability. Does the team have the skills to operate a new database platform, or is a managed service essential?
- Cost model. Vector databases have varied pricing models – some charge per vector stored, others per query, others per compute hour. Model the cost at your expected scale before committing.
Chunking Strategies
How documents are split into chunks has a profound effect on retrieval quality. Chunks that are too large dilute the relevant information with irrelevant context. Chunks that are too small lose the surrounding context that makes the information meaningful.
Fixed-Size Chunking
The simplest approach: split documents into chunks of a fixed token count (e.g., 512 tokens) with a defined overlap between consecutive chunks (e.g., 50 tokens). This is easy to implement and provides predictable chunk sizes.
The limitation is that fixed-size chunking ignores document structure. A chunk boundary may fall in the middle of a paragraph, separating a claim from its supporting evidence, or split a table across two chunks.
Semantic Chunking
Semantic chunking uses the document’s structure to define chunk boundaries. Section headings, paragraph breaks, list boundaries, and table structures are used as natural split points. This preserves the logical coherence of each chunk.
Implementation requires parsing document structure, which varies by source format. Well-structured HTML and Markdown are straightforward. PDFs are notoriously challenging – especially scanned documents or documents with complex layouts.
Hierarchical Chunking
Hierarchical chunking creates multiple representations of the same content at different granularities. A section-level chunk provides broad context; paragraph-level chunks within that section provide detail. At retrieval time, the system can match on the detailed chunk and then include the parent section chunk in the context window to provide surrounding context.
This approach is particularly effective for long-form technical documents where both the specific detail and the broader context are needed to generate an accurate response.
Practical Recommendations
In our RAG implementation consultancy work, we have found that the following approach works well for most enterprise document collections:
- Start with semantic chunking that respects document structure.
- Target a chunk size of 300-500 tokens, which balances specificity with context.
- Include 50-100 tokens of overlap to maintain continuity across chunk boundaries.
- For documents with hierarchical structure (technical manuals, policy documents), implement parent-child chunk relationships.
- Always include document-level metadata (title, source, date, author, classification) with each chunk for filtering and attribution.
Embedding Model Comparison
The embedding model converts text into vector representations that enable similarity search. The choice of embedding model affects retrieval accuracy, latency, and cost.
Key Dimensions
- Dimensionality. Models produce embeddings of different sizes – 384, 768, 1024, 1536, or 3072 dimensions are common. Higher dimensionality captures more nuance but increases storage and search costs.
- Context window. Some models handle only 512 tokens; others support 8192 or more. For longer chunks, a model with a larger context window is essential.
- Domain adaptation. General-purpose embedding models perform well across topics, but domain-specific models (trained on legal, medical, or technical text) can improve retrieval accuracy for specialised corpora.
- Multilingual capability. For UK organisations with international operations, multilingual embedding models enable cross-language retrieval.
Models to Consider in 2025
- OpenAI text-embedding-3-large (3072 dimensions) – Strong general-purpose performance, available via API. Data processing considerations under GDPR as queries are sent to OpenAI’s API.
- Cohere embed-v3 (1024 dimensions) – Competitive retrieval accuracy with support for multiple languages and a compression feature that reduces storage costs.
- Open-source models (e.g., BGE, GTE, E5) – Can be self-hosted, keeping all data within the organisation’s infrastructure. Essential for deployments where data must not leave the corporate network.
For UK enterprise deployments handling sensitive data, self-hosted open-source embedding models eliminate the data protection concerns associated with sending content to external APIs. The trade-off is increased infrastructure and operational complexity.
Hybrid Search Approaches
Pure vector search relies on semantic similarity, which is powerful but has limitations. It can miss documents that use different terminology to express the same concept, or it can match documents that are semantically similar but factually irrelevant.
Hybrid search combines vector similarity search with traditional keyword search (BM25 or similar) to overcome these limitations. The two search methods are complementary:
- Vector search excels at finding conceptually relevant documents even when the exact terminology differs.
- Keyword search excels at finding documents that contain specific terms, codes, reference numbers, or proper nouns that are critical for accuracy.
Reciprocal Rank Fusion
The most common approach to combining vector and keyword search results is reciprocal rank fusion (RRF). Each search method produces a ranked list of results. RRF assigns a score to each document based on its rank in each list and combines the scores to produce a final ranking.
The formula is straightforward:
RRF_score(d) = sum(1 / (k + rank_i(d)))
Where k is a constant (typically 60) and rank_i(d) is the rank of document d in the ith result list.
Re-Ranking
After initial retrieval (whether vector, keyword, or hybrid), a re-ranking model can re-order the results based on their relevance to the specific query. Cross-encoder models such as Cohere Rerank or open-source models like BGE Reranker assess the relevance of each query-document pair more accurately than the initial retrieval, at the cost of additional latency.
For enterprise deployments where accuracy is paramount – legal research, regulatory compliance, medical information – re-ranking is a valuable addition. For lower-stakes applications such as internal FAQ systems, the initial retrieval may be sufficient.
Measuring Accuracy
A RAG system that cannot demonstrate its accuracy is not production-ready. Establishing robust evaluation metrics is essential for building stakeholder confidence and for guiding iterative improvement.
Retrieval Metrics
- Recall@k – Of the relevant documents in the corpus, what proportion appears in the top-k retrieved results? A recall@10 of 0.8 means that 80% of relevant documents are retrieved in the top 10 results.
- Precision@k – Of the top-k retrieved results, what proportion is actually relevant? High precision means less irrelevant noise in the context window.
- Mean Reciprocal Rank (MRR) – How highly is the first relevant document ranked? An MRR close to 1.0 means relevant documents consistently appear at the top of results.
Generation Metrics
- Faithfulness – Does the generated answer accurately reflect the content of the retrieved documents? This measures hallucination risk.
- Answer relevance – Does the generated answer actually address the user’s question?
- Groundedness – Can every claim in the generated answer be attributed to a specific retrieved document?
Building an Evaluation Dataset
Create a dataset of representative questions paired with known correct answers and the documents that contain those answers. This dataset becomes the benchmark against which you measure retrieval and generation quality.
For enterprise deployments, this evaluation dataset should be built in collaboration with subject matter experts who can validate both the questions and the expected answers. It should cover the range of question types the system will encounter: factual lookups, procedural guidance, comparative questions, and edge cases.
Automated evaluation frameworks such as RAGAS and DeepEval can compute these metrics at scale, enabling continuous monitoring of system quality as the document corpus evolves.
Data Governance and UK Regulatory Considerations
For UK enterprises, RAG deployments must navigate a regulatory landscape that includes GDPR, the UK Data Protection Act 2018, and the evolving framework around AI governance.
GDPR Implications for RAG Systems
A RAG system that indexes documents containing personal data is processing that data within the meaning of GDPR. This has several implications:
- Lawful basis. The organisation must have a lawful basis for processing personal data through the RAG system. For internal employee-facing systems, legitimate interest may apply. For customer-facing systems, the analysis depends on the data being processed and the purpose.
- Data minimisation. Only index documents that are necessary for the system’s purpose. A RAG system that indiscriminately indexes all corporate documents is likely to process personal data beyond what is necessary.
- Right of erasure. If an individual exercises their right to have their personal data deleted, the organisation must be able to remove that data from the RAG system’s vector store and document index – not just from the source documents.
- Data protection impact assessment. For RAG systems processing personal data at scale or for novel purposes, a DPIA is likely required under Article 35 of the UK GDPR.
The ICO’s Position on AI and Personal Data
The Information Commissioner’s Office has published guidance on AI and data protection that is directly relevant to RAG deployments. The ICO emphasises that organisations must:
- Be transparent about how AI systems use personal data
- Ensure that automated decisions affecting individuals include appropriate human oversight
- Maintain records of processing activities that include AI system details
- Conduct regular reviews of AI system accuracy and fairness
For RAG systems that provide information to employees or customers, particularly in regulated sectors such as financial services or healthcare, these requirements are not optional enhancements – they are legal obligations.
The EU AI Act and UK Implications
Although the UK is not directly subject to the EU AI Act, many UK enterprises operate in EU markets and must comply with its requirements. The Act classifies AI systems by risk level and imposes obligations accordingly. A RAG system used for HR decisions, credit assessments, or medical information could fall into a higher risk category.
Even for purely UK-domestic deployments, the UK government’s approach to AI regulation – currently principles-based rather than prescriptive – is expected to evolve. Building RAG systems with strong governance foundations now positions organisations to adapt to future regulatory requirements.
The intersection of enterprise AI governance, the EU AI Act, and UK data protection law is an area where technical architecture decisions have direct legal consequences. Access controls, audit logging, data lineage tracking, and the ability to explain how a particular response was generated are not merely best practices – they are increasingly legal requirements.
AI Agents and RAG: Extending Beyond Q&A
While the most common RAG use case is a custom AI chatbot querying an enterprise knowledge base, the pattern extends naturally to AI agents that take actions based on retrieved information.
An AI agent for business process automation might retrieve relevant policy documents, extract the applicable rules, and then execute a workflow step based on those rules – approving a request, routing a ticket, or generating a report. The RAG component ensures that the agent’s actions are grounded in current organisational policy rather than potentially outdated training data.
This agent pattern is gaining traction in areas such as:
- IT service management – Agents that retrieve resolution procedures from a knowledge base and guide technicians through troubleshooting steps.
- Procurement – Agents that retrieve supplier contracts, check terms and pricing, and pre-populate purchase orders.
- Compliance monitoring – Agents that retrieve regulatory requirements and assess whether proposed activities or documents comply.
For UK enterprises exploring AI agents for business process automation, RAG provides the grounding mechanism that makes agent actions trustworthy and auditable.
Implementation Roadmap
Based on our experience delivering RAG implementation consultancy for UK organisations, we recommend the following phased approach:
Phase 1: Foundation (Weeks 1-4)
- Define the use case and success criteria
- Identify and catalogue source documents
- Select vector database and embedding model
- Build the document ingestion pipeline
- Create an initial evaluation dataset with subject matter experts
Phase 2: Core RAG (Weeks 5-8)
- Implement chunking strategy and index the document corpus
- Build the retrieval engine with hybrid search
- Integrate the LLM for response generation
- Implement basic prompt engineering for grounded responses
- Measure baseline accuracy against the evaluation dataset
Phase 3: Optimisation (Weeks 9-12)
- Implement re-ranking to improve retrieval precision
- Tune chunking parameters based on evaluation results
- Add metadata filtering for access control and source scoping
- Implement citation and source attribution in responses
- Conduct user acceptance testing with target user group
Phase 4: Production Hardening (Weeks 13-16)
- Implement monitoring and observability (query logging, latency tracking, accuracy monitoring)
- Build operational dashboards
- Implement access controls aligned with existing identity infrastructure
- Complete data protection impact assessment
- Document data lineage and processing records for GDPR compliance
- Deploy to production with phased rollout
Conclusion
Building a production-grade RAG system for UK enterprise deployment requires careful attention to architectural decisions that compound in their effects: chunking strategy, embedding model selection, hybrid search configuration, and accuracy measurement. Each decision affects retrieval quality, which in turn affects the accuracy and trustworthiness of generated responses.
Equally important for UK organisations are the regulatory and governance dimensions. GDPR compliance, ICO guidance on AI, and the evolving EU AI Act framework all impose requirements on how RAG systems process, store, and provide access to data. These requirements must be designed into the architecture from the outset, not bolted on as an afterthought.
The organisations achieving the strongest results from RAG are those that treat it as a proper engineering discipline – with rigorous evaluation, continuous monitoring, and iterative improvement – rather than a one-off project. With the right architecture, governance, and measurement framework, a RAG system becomes a strategic asset: a reliable, grounded interface between your organisation’s accumulated knowledge and the people who need to access it.
McKenna Consultants provides RAG implementation consultancy for UK enterprises, from initial architecture design through to production deployment and ongoing optimisation. If you are planning a RAG deployment and want to ensure it is built on solid technical and regulatory foundations, we would welcome the opportunity to discuss your requirements.