RAG Architecture for Business Knowledge: Build AI That Knows Your Company

RAG Architecture for Business Knowledge: Build AI That Knows Your Company

Victor Valentine Romo ·

RAG Architecture for Business Knowledge: Build AI That Knows Your Company

Quick Summary

  • What this covers: Practical guidance for building and scaling your online presence.
  • Who it's for: Business operators, consultants, and professionals using AI + search.
  • Key takeaway: Read the first section for the core framework, then apply what fits your situation.

Retrieval-Augmented Generation (RAG) connects AI models to your internal knowledge base—documents, SOPs, emails, CRM data—so they answer questions using your proprietary information, not generic training data. Standard AI tools like ChatGPT or Claude can't access your internal systems. RAG fixes this by retrieving relevant context from your knowledge base before generating answers. This transforms AI from generic assistant to company-specific intelligence layer.

Why Generic AI Models Can't Answer Company-Specific Questions

Public AI models train on internet data—Wikipedia, articles, books, code repositories. They don't know your company's processes, client history, pricing strategies, or internal decisions. Ask ChatGPT "What's our client onboarding SOP?" and it hallucinates an answer or says "I don't know."

RAG solves this by implementing retrieval step before generation. When you ask "What's our client onboarding SOP?" the system:

  1. Retrieves relevant documents from your knowledge base (client onboarding SOP, related templates, past examples).
  2. Augments the AI prompt with retrieved context.
  3. Generates an answer grounded in your actual documentation.

This architecture unlocks AI for internal use cases: instant SOP lookup, onboarding Q&A bots, client history summaries, pricing guidance, troubleshooting documentation, and research assistance. Instead of searching through Notion, Google Drive, or Confluence manually, you ask natural language questions and get instant, sourced answers.

RAG also prevents hallucinations. Without context, AI models fill gaps with plausible-sounding fiction. With RAG, answers cite sources: "According to the client onboarding SOP (last updated 2026-01-15), step 3 is..."

Finally, RAG enables continuous learning. As your knowledge base grows—new SOPs, updated processes, recent projects—RAG systems automatically incorporate them. No retraining needed. Update a doc, and future queries reflect that update.

RAG System Components: Vector Databases, Embeddings, and LLMs

RAG systems have three core components: document storage (vector database), semantic search (embeddings), and answer generation (LLM).

Component 1: Vector Database Stores documents as embeddings (numerical representations of text). When you ask a question, the system searches embeddings for semantically similar documents, not just keyword matches.

Popular vector databases:

  • Pinecone: Managed service, easy setup, scales automatically. $70-$140/month for SMB use cases.
  • Weaviate: Open-source, self-hosted or cloud. Free self-hosted, $25+/month cloud.
  • Chroma: Lightweight, embeds into Python apps. Free, open-source.
  • Qdrant: Fast, supports filtering metadata. Free self-hosted, $25+/month cloud.

Start with Chroma or Pinecone. Chroma for prototypes or small teams (<5 users). Pinecone for production or teams >10 users.

Component 2: Embedding Model Converts text into vectors (arrays of numbers) that capture semantic meaning. "Client onboarding process" and "How to onboard new clients" have similar embeddings even though they use different words.

Popular embedding models:

  • OpenAI text-embedding-3-small: $0.02 per 1M tokens. Fast, high quality.
  • Cohere Embed: $0.10 per 1M tokens. Supports multi-language better.
  • Open-source models (sentence-transformers): Free but require self-hosting (compute costs).

Start with OpenAI text-embedding-3-small. Cheap, reliable, easy integration.

Component 3: LLM for Generation Generates answers using retrieved context. Supports GPT-4, Claude Opus, or Gemini.

  • GPT-4: $10-$30 per 1M tokens (varies by model). Strong general capabilities.
  • Claude Opus 4.6: $15-$75 per 1M tokens. Better for long-context and reasoning.
  • Claude Sonnet: $3-$15 per 1M tokens. Cheaper, faster, good for most queries.

Start with Claude Sonnet for cost efficiency. Upgrade to Opus if queries need deep reasoning or long-context handling.

System flow:

  1. User asks: "What's our pricing for SEO retainers?"
  2. Embedding model converts question to vector.
  3. Vector database searches for documents with similar embeddings (pricing docs, proposal templates).
  4. Top 3-5 most relevant documents retrieved.
  5. LLM receives prompt: "Answer this question using only the following context: [retrieved docs]. Question: What's our pricing for SEO retainers?"
  6. LLM generates answer citing sources.

Building a Minimal Viable RAG System in 4 Hours

You don't need ML expertise to build basic RAG. Use LangChain or LlamaIndex (Python libraries) to assemble components quickly.

Tech stack:

  • Python 3.10+
  • LangChain or LlamaIndex (orchestration frameworks)
  • Chroma or Pinecone (vector database)
  • OpenAI API (embeddings + LLM)

Step 1: Collect documents Gather internal docs you want RAG to access: SOPs, templates, meeting notes, client files, product docs. Export as markdown, PDF, or text files. Store in one folder: /knowledge_base/.

Step 2: Install dependencies

pip install langchain chromadb openai python-dotenv

Step 3: Set up API keys Create .env file:

OPENAI_API_KEY=your_openai_key_here

Step 4: Ingest documents Load documents, split into chunks, generate embeddings, store in Chroma:

from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
import os
from dotenv import load_dotenv

load_dotenv()

# Load documents
loader = DirectoryLoader('./knowledge_base/', glob="**/*.md")
documents = loader.load()

# Split into chunks (1000 chars, 200 overlap)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

# Generate embeddings and store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
print(f"Ingested {len(chunks)} chunks from {len(documents)} documents")

Run this script once to populate vector database. As you add/update docs, re-run to refresh.

Step 5: Query the system Retrieve relevant chunks and generate answer:

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# Load vectorstore
vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=OpenAIEmbeddings())

# Create retrieval chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(model="gpt-3.5-turbo"),
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3})
)

# Ask question
question = "What's our client onboarding process?"
answer = qa_chain.run(question)
print(answer)

Step 6: Add source citations Modify to return sources:

qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(model="gpt-3.5-turbo"),
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True
)

result = qa_chain({"query": question})
print("Answer:", result['result'])
print("\nSources:")
for doc in result['source_documents']:
    print(f"- {doc.metadata['source']}")

This builds a functional RAG system in 2-4 hours depending on familiarity with Python.

Optimizing Retrieval Quality with Chunking and Metadata

Poor retrieval ruins RAG systems. If irrelevant documents get retrieved, LLM generates bad answers. Optimize retrieval through chunking strategy and metadata filtering.

Chunking strategy: Documents split into chunks before embedding. Chunk size affects retrieval quality.

  • Too small (100-200 chars): Chunks lack context. "Our pricing is $5K" without explaining what $5K refers to.
  • Too large (5,000+ chars): Chunks contain too much unrelated information. Dilutes semantic meaning.
  • Sweet spot: 800-1,200 chars with 200-char overlap. Overlap ensures concepts spanning chunk boundaries aren't lost.

Test chunk sizes. Run queries against different configurations, evaluate answer quality.

Metadata enrichment: Add metadata to chunks (document type, date, author, category). This enables filtering during retrieval.

Example metadata:

documents = [
    Document(page_content="Client onboarding steps...", metadata={"type": "SOP", "category": "sales", "date": "2026-01-15"}),
    Document(page_content="Pricing for retainers...", metadata={"type": "pricing", "category": "sales", "date": "2025-12-10"})
]

Filter retrieval by metadata:

retriever = vectorstore.as_retriever(
    search_kwargs={"k": 3, "filter": {"category": "sales"}}
)

This retrieves only sales-related docs, improving relevance for sales queries.

Hybrid search (semantic + keyword): Pure semantic search sometimes misses exact terms. Combine semantic search (embeddings) with keyword search (BM25):

from langchain.retrievers import BM25Retriever, EnsembleRetriever

# Semantic retriever
semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Keyword retriever
keyword_retriever = BM25Retriever.from_documents(documents)
keyword_retriever.k = 5

# Ensemble (combines both)
ensemble_retriever = EnsembleRetriever(retrievers=[semantic_retriever, keyword_retriever], weights=[0.5, 0.5])

This catches both semantically similar and keyword-exact matches.

Re-ranking retrieved documents: After initial retrieval, re-rank results using a more sophisticated model (Cohere Rerank, cross-encoders). Re-ranking improves top-3 relevance by 20-30%.

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank

# Base retriever
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

# Rerank top 10 down to top 3
compressor = CohereRerank(model="rerank-english-v2.0", top_n=3)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever
)

This retrieves 10 candidates, re-ranks, returns top 3. Cost: ~$1-$2 per 1K queries.

Adding Chat Memory for Multi-Turn Conversations

Basic RAG answers one question at a time. Chat memory enables follow-up questions: "What's our pricing?" → "How does that compare to competitors?" without repeating context.

LangChain conversation memory:

from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

qa_chain = ConversationalRetrievalChain.from_llm(
    llm=OpenAI(model="gpt-3.5-turbo"),
    retriever=vectorstore.as_retriever(),
    memory=memory
)

# First question
response = qa_chain({"question": "What's our client onboarding process?"})
print(response['answer'])

# Follow-up (memory retains context)
response = qa_chain({"question": "How long does that typically take?"})
print(response['answer'])

Memory persists conversation history. LLM uses prior context to interpret follow-ups.

For production, persist memory per user session (store in Redis, PostgreSQL, or session cookies). This enables multi-session continuity—users can return days later and resume conversations.

Deploying RAG Systems: Chatbots, Slack Bots, and APIs

RAG systems need interfaces. Deploy as web chatbots, Slack bots, or APIs.

Web chatbot (Streamlit): Build UI in Python using Streamlit:

import streamlit as st
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI

st.title("Company Knowledge Base")

vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=OpenAIEmbeddings())
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(model="gpt-3.5-turbo"),
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)

question = st.text_input("Ask a question:")
if st.button("Submit"):
    result = qa_chain({"query": question})
    st.write("**Answer:**", result['result'])
    st.write("**Sources:**")
    for doc in result['source_documents']:
        st.write(f"- {doc.metadata['source']}")

Run: streamlit run app.py. Access at localhost:8501.

Deploy to cloud (Streamlit Cloud free, Render $7/month, Fly.io $5-$10/month).

Slack bot: Integrate RAG into Slack using Slack Bolt (Python SDK):

from slack_bolt import App
from langchain.chains import RetrievalQA
# ... (same setup as above)

app = App(token=os.environ["SLACK_BOT_TOKEN"])

@app.message(".*")
def handle_message(message, say):
    question = message['text']
    result = qa_chain({"query": question})
    say(result['result'])

if __name__ == "__main__":
    app.start(port=3000)

Deploy to Heroku, Render, or Fly.io. Users ask questions in Slack, bot responds with RAG answers.

REST API (FastAPI): Expose RAG as API for custom integrations:

from fastapi import FastAPI
from pydantic import BaseModel
# ... (same setup)

app = FastAPI()

class Query(BaseModel):
    question: str

@app.post("/ask")
def ask(query: Query):
    result = qa_chain({"query": query.question})
    return {
        "answer": result['result'],
        "sources": [doc.metadata['source'] for doc in result['source_documents']]
    }

Run: uvicorn api:app --reload. Access at localhost:8000/ask.

Security, Access Control, and Compliance

RAG systems access internal data. Secure them properly.

Access control:

  • Implement user authentication (OAuth, SSO).
  • Role-based access: Sales sees sales docs, Ops sees ops docs. Filter retrieval by user role metadata.
  • Log queries: Track who asked what, when. Audit for misuse.

Data privacy:

  • Use zero data retention AI models (Claude for Enterprise, OpenAI Enterprise) to prevent training on your data.
  • For sensitive docs, self-host LLMs (use Llama 3, Mistral via Ollama) to keep data on-premises.
  • Encrypt vector databases and data at rest.

Compliance (GDPR, HIPAA, SOC 2):

  • Redact PII before ingesting docs into RAG.
  • Implement data deletion workflows: when docs are deleted from source, remove from vector DB.
  • For regulated industries, consult legal before deploying RAG on sensitive data.

RAG Use Cases for B2B Operations

Use case 1: Internal knowledge Q&A Employees ask questions about company policies, SOPs, benefits, IT troubleshooting. RAG replaces searching through Confluence, Notion, or asking coworkers.

Use case 2: Client history summaries Sales or support staff ask "Summarize interactions with Client X in the past 6 months." RAG retrieves CRM notes, emails, support tickets, generates summary.

Use case 3: Onboarding assistant New hires ask questions about onboarding: "Where do I submit expenses?" "Who's my manager?" RAG answers using onboarding docs and org charts.

Use case 4: Proposal and contract generation Sales team asks "Draft a proposal for [Client Y] including [services]." RAG retrieves similar proposals, templates, pricing docs, generates customized draft.

Use case 5: Research and competitive intel Analysts ask "What's our competitive positioning vs. Competitor Z?" RAG retrieves internal strategy docs, market research, past analyses.

Use case 6: Troubleshooting and support Technical support asks "How do we fix [error X]?" RAG retrieves troubleshooting guides, past ticket resolutions, engineering docs.

Use case 7: Compliance and legal queries Legal or compliance teams ask "What's our data retention policy?" RAG retrieves policies, regulatory docs, past decisions.

FAQ: RAG Architecture for Business Knowledge

How much does it cost to run a RAG system?

Setup: $0-$500 (developer time, no external costs if using open-source tools). Monthly operating: $50-$300 depending on usage. Pinecone ($70-$140/month), OpenAI API ($10-$100/month for embeddings + LLM), hosting ($10-$50/month). Total: $90-$290/month for small-medium teams (<50 users).

How do I keep the knowledge base updated?

Re-run ingestion script whenever docs change. Automate with cron jobs (nightly) or webhooks (trigger ingestion when docs are updated in Notion, Google Drive, etc.). Use Zapier or n8n to connect doc platforms to ingestion pipeline.

Can RAG work with non-text data (PDFs, images, videos)?

Yes, with additional preprocessing. PDFs: Extract text using PyPDF2 or pdfplumber before ingesting. Images: Use OCR (Tesseract) or vision models (GPT-4 Vision) to extract text. Videos: Transcribe audio using Whisper (OpenAI), ingest transcripts.

What if the AI hallucinates despite having retrieval?

Hallucinations occur when: (1) retrieval fails (no relevant docs found), (2) LLM ignores context, (3) context is ambiguous. Fixes: Improve retrieval quality (better chunking, metadata), constrain LLM ("Only answer using provided context. If context doesn't answer the question, say 'I don't know.'"), use citations to verify answers.

Can I build RAG without coding?

Limited options. Glean, Dashworks, and Hebbia offer no-code RAG for enterprise knowledge bases. Cost: $5K-$50K/year. For small teams, basic coding (Python + LangChain) is more cost-effective and flexible.

Related: personal-knowledge-management-operators.html, prompt-libraries-b2b-teams.html, operator-tech-stack-tools.html


When This Doesn't Apply

Skip this if your situation is fundamentally different from what's described above. Not every framework fits every business. Use the diagnostic in the first section to determine whether this approach matches your current stage and goals.

← All articles

This is one piece of the system.

I build AI memory systems for people who run businesses. Claude Code + Obsidian vault architecture with persistent memory across conversations. The open-source repo is the architecture. The service is making it yours.