Knowledge Systems & RAG

What RAG actually does

Retrieval-augmented generation is a pattern, not a product. The pattern: when a user asks a question, your system searches your documents for relevant passages, then asks a language model to answer using only those passages. It’s how you get a chatbot that answers from your runbooks instead of making things up about your runbooks.

For most SMBs and municipal IT teams, the value isn’t “chat with your documents.” It’s “stop losing 20 minutes a day to people asking questions that are already answered in the SharePoint folder no one can find.” That’s a real, boring, valuable problem.

When RAG is worth the build

We’ve shipped RAG systems we’re proud of, and we’ve talked clients out of building RAG systems they didn’t need. The honest answer on when it’s worth it:

Document count: 100-50,000. Below 100 documents, prompt stuffing (just paste the docs into the system prompt) often works fine and costs less to maintain. Above 50,000, you’re in territory where the indexing and re-ranking strategy matters more than the model choice.
Documents change at a stable rate. RAG systems work best when the document corpus updates predictably (weekly SOP changes, monthly policy updates). If your docs change hourly, you’ll spend more on index maintenance than on inference.
The questions cluster. If the same 30 questions get asked every week, RAG pays back fast. If every question is bespoke, it pays back slower or not at all.

What we build with

Anthropic Claude for the answer-generation step when accuracy on long-context retrieval matters more than latency. Claude’s long context window means we can sometimes skip the vector store entirely for small corpora.
OpenAI embeddings as the default embedding model for retrieval, paired with a Postgres + pgvector or Pinecone index depending on scale.
Microsoft 365 Copilot when the documents are already in SharePoint and the org has the M365 Copilot license. There’s no point reinventing retrieval over SharePoint when Microsoft already shipped it.
Open-source embeddings (BGE, E5) when document content can’t leave the org’s infrastructure. We run these self-hosted on the org’s own GPUs or on a private VPC.

The right stack depends on where your documents live, what compliance constraints you have, and whether you have a technical owner who’ll maintain it. We pick on the discovery call.

What we ship

A typical RAG engagement is 4-8 weeks:

Week 1: Document audit. We read a sample of your corpus and find out how clean it is. (Spoiler: usually it isn’t. That’s fine; we factor cleanup into the build.)
Weeks 2-3: Index design and chunking strategy. The chunking strategy is where most public RAG implementations are quietly broken; we tune for your document shape.
Weeks 4-5: Retrieval pipeline, including re-ranking and citation. Every answer in our RAG systems cites the source passage. No source, no answer.
Weeks 6-7: Eval harness. We build a test set of 50-100 real questions from your team and measure precision and recall against it. We share the eval, not just the score.
Week 8: Deploy, train, hand off.

Pricing for a RAG pilot is quoted in the proposal once we’ve sized the corpus and integration depth on the discovery call. No “let’s discuss investment” theater; you’ll see the number in writing within 48 hours.

Where we draw the line

We don’t ship RAG systems for organizations that won’t fund the corpus cleanup. RAG retrieves whatever you point it at; pointing it at a stale SharePoint with five conflicting versions of the same SOP gets you a confidently wrong chatbot. We require at least one round of document cleanup as part of the engagement.

Where to start

Read our AVA case study: AVA uses retrieval over historical Autotask tickets, which is a RAG pattern over a structured data source. Take the AI Readiness Assessment if you’re trying to figure out whether your data is ready. Or book a 30-minute discovery call, bring a sample document, and we’ll tell you on the call whether RAG is the right shape for your problem.

              recall@1   recall@5   p95 latency
─────────────  ─────────  ─────────  ───────────
naive bm25      0.41       0.68        180 ms
+ rerank        0.58       0.81        310 ms
+ hyde          0.62       0.86        420 ms
+ tuned chunks  0.71       0.91        430 ms
─────────────────────────────────────────────────
shipping: tuned-chunks + rerank (no hyde)

When is RAG worth it?

When your team asks the same kinds of questions every day and the answers exist somewhere in your documents. If the questions cluster, RAG earns its keep. If every question is a one-off, you're better off with a smaller search system or chat-with-docs.

What's the document floor?

Below ~500 documents, the engineering overhead of a real RAG system usually loses to simpler alternatives. Above ~50,000 documents, you start needing a search backend with proper relevance tuning. The sweet spot is in between.

How do you evaluate retrieval quality?

We build a test set of 50–100 real questions from your team and measure recall@k (did we retrieve the right document?) and precision (did the answer cite the right passage?). Every prompt change is gated by this eval.