Benchmarking 4 Embedding Models on Real Documents for RAG

I tested 4 encoders on 141 SEC filings. Voyage led retrieval at half the cost, BGE nearly matched it for free, and this is just the baseline.

The embedding model you pick determines what the LLM ever gets to read. If retrieval fails, no prompt engineering fixes it. No chain-of-thought trick, no system prompt rewrite, no temperature tweak. None of it matters if the right chunks never made it into the context window.

I assumed OpenAI's text-embedding-3-large would be the safe default. It's the model most RAG tutorials reach for. I ran the benchmarks anyway. The data told a different story.

This is the first experiment in a series I'm running to build a production-quality RAG pipeline. Before I can optimize chunking, re-ranking, or prompt strategies, I need to answer a foundational question: which encoder gives me the best starting point? This article covers that experiment: four encoders, 141 SEC filings, 459 query-document pairs, and what the retrieval metrics revealed.

Talk is cheap. Here's the notebook so you can run it yourself.

Why Encoder Choice Matters

Most RAG discussions focus on the generation side: prompt templates, model selection, chain architecture. The encoder gets treated as a configuration detail. Pick OpenAI, set the dimension, move on.

That's a mistake. The encoder is the gatekeeper. It decides which chunks score high enough to reach the LLM. A mediocre encoder that buries the relevant passage at rank 4 instead of rank 1 degrades every downstream metric: faithfulness, answer relevance, context relevance. The LLM can only work with what it receives.

I wanted to know exactly how much encoder choice matters on a real corpus. Not synthetic benchmarks. Not MTEB leaderboard scores. Actual retrieval performance on domain-specific documents with measurable quality gaps.

*End-to-end pipeline: SEC filings are chunked, embedded by four encoders, stored in ChromaDB, and evaluated with both retrieval metrics and LLM-as-judge scoring.*

The Setup

The corpus: 141 SEC filings from 10 companies across 5 sectors (tech, finance, healthcare, energy, consumer). Filing types include 10-K annual reports, 8-K current reports, and DEF 14A proxy statements. SEC filings are hard in the right ways. Dense prose, domain-specific terminology, shared vocabulary across companies that makes semantic disambiguation genuinely difficult.

The encoders:

The four encoders compared: OpenAI text-embedding-3-large, Voyage, BGE, and MiniLM.

Evaluation: I built a test set: 459 query-document pairs with ground-truth relevance judgments. I evaluated 50 queries (sampled for stable confidence intervals while keeping GPT-4o-mini judge costs under $2). Retrieval metrics: MRR, NDCG@5, Recall@5. Generation quality: Faithfulness, Answer Relevancy, and Context Relevancy scored by GPT-4o-mini as judge.

Chunking: 500 characters with 25% overlap using LangChain's RecursiveCharacterTextSplitter. LangChain is used only for text splitting. Everything else (embedding, retrieval, evaluation, the RAG pipeline) is implemented from scratch to keep full control over the pipeline. Weights & Biases tracks every run.

Every encoder implements a clean abstraction so swapping models is a config change, not a refactor:

Encoder abstraction code: a unified interface so swapping models is a config change, not a refactor.

This pattern matters. When you're comparing encoders, you want the only variable to be the model itself. Same chunking, same retrieval logic, same evaluation. The abstraction enforces that.

Retrieval is a single function call. Embed the query, hit ChromaDB, return ranked results:

Retrieval function: embed the query, hit ChromaDB, return ranked results.

The Results

*Full results across all four encoders. Voyage leads on retrieval metrics; OpenAI leads on generation quality; BGE is surprisingly competitive for a free model.*

Voyage is the strongest retrieval model in this setup. MRR of 0.69 vs OpenAI's 0.61. That's the difference between the relevant document landing at rank 1 vs rank 2 on average. NDCG@5 tells the same story: 0.72 vs 0.66. And Voyage does this at $0.07 total embedding cost, half of OpenAI's$ 0.14.

OpenAI produces the highest quality generation outputs. Highest Faithfulness (0.85), best Answer Relevancy (0.72), and near-best Context Relevancy (0.85). If your priority is minimizing hallucinations in the generated answer, OpenAI's encoder feeds the LLM better context.

BGE is the surprise. It leads on Recall@5 (0.84), trails Voyage by 0.09 on MRR, and costs nothing. For teams that can't use external APIs or simply don't want to, BGE is a serious option.

MiniLM is 17x faster (19 seconds vs 328 seconds for OpenAI) but the quality gap is real. MRR drops to 0.59, and generation metrics fall across the board.

The Insight: High Recall Doesn't Mean High MRR

The most important finding is this. High recall does not mean good ranking. Look at the Recall@5 column: all four encoders land between 0.80 and 0.84. The relevant document is somewhere in the top 5 for nearly every query, regardless of which encoder you use.

But MRR ranges from 0.58 to 0.69. That's a meaningful spread.

*95% confidence intervals for MRR across encoders. The spread between Recall@5 (tight) and MRR (wide) reveals that ranking quality, not just retrieval, separates these models.*

This directly impacts RAG performance. Because most pipelines pass the top-k chunks to the LLM in rank order. The chunk at position 1 gets the most attention. A document buried at position 4 might technically count as "retrieved," but the LLM gives it less weight, especially with longer context windows where attention degrades.

For RAG, MRR matters more than Recall. Getting the document somewhere in the top 5 isn't enough. You need it at rank 1. This is where Voyage's 0.69 MRR vs MiniLM's 0.59 translates into measurably better generated answers.

The MRR computation captures this directly. It's the reciprocal rank of the first relevant document:

MRR computation: the reciprocal rank of the first relevant document.

A relevant document at rank 1 scores 1.0. At rank 2, it scores 0.5. At rank 5, just 0.2. The metric punishes exactly the behavior that hurts RAG: burying the right answer below less relevant chunks.

Cost vs Quality

The cost picture is straightforward:

*Cost-quality Pareto analysis. Voyage sits on the efficient frontier: best retrieval quality at the lowest API cost. BGE achieves near-Voyage quality at zero cost.*

Voyage embeds the entire corpus for $0.07. OpenAI costs$ 0.14, double, for lower retrieval scores. At scale, this gap compounds. A corpus 100x larger means $7 vs$ 14. A corpus 1000x larger means $70 vs$ 140. The per-query cost difference is small, but the per-corpus cost difference adds up fast.

BGE sits at the extreme: zero cost, and it trails Voyage by only 0.03 MRR. If you're self-hosting and willing to manage GPU infrastructure, BGE delivers remarkable value.

The real question is whether OpenAI's generation-quality advantage (Faithfulness 0.85 vs Voyage's 0.83, Answer Relevancy 0.72 vs 0.67) justifies the 2x cost. For applications where hallucination risk is critical (legal, medical, compliance) it might. For most other cases, Voyage offers the better tradeoff.

Where Encoders Disagree

The error analysis revealed a clear pattern: query specificity is the biggest differentiator.

All four encoders handle broad topical queries well. "What are Apple's risk factors?" retrieves relevant chunks regardless of encoder. The failures cluster around two query types:

Entity-specific queries: "What responsibilities does Lead Independent Director Mr. Horton hold?" requires the encoder to distinguish one named individual from dozens of executives mentioned across the filing. Voyage and OpenAI handle this; BGE and MiniLM struggle.

Date-anchored queries: "What is the fiscal year end date in Apple's 10-K dated October 31, 2025?" demands precise temporal matching. All encoders occasionally fail here, but MiniLM fails most often. Its 384-dimensional space simply can't encode the specificity needed.

The pattern is consistent: as query specificity increases, the gap between encoders widens. Generic queries are easy. Precise queries reveal the quality difference you're paying for.

The t-SNE projections below compress each encoder's high-dimensional embedding space into 2D. Each point is a single SEC filing chunk; color represents K-Means clusters (k=8) learned on the raw embeddings.

*t-SNE projection of document embeddings colored by sector. Cluster separation varies by encoder, another signal that encoder choice shapes what the retrieval system "sees."*

Tighter, more distinct clusters indicate stronger semantic organization, the encoder is placing similar content closer together, which directly helps retrieval. Notice how Voyage and OpenAI produce more separated clusters than MiniLM, which shows a more diffuse cloud. This mirrors the MRR gap: encoders that organize the space well rank the right document higher.

Key Insights

Benchmark on YOUR data. What wins on SEC filings may not win on your corpus. Domain vocabulary, document structure, and query patterns all shift the results. MTEB leaderboard rankings are a starting point, not an answer.
MRR > Recall for RAG. All four encoders achieve 0.80–0.84 Recall@5, but MRR ranges from 0.59 to 0.69. Getting the right document somewhere in the top 5 isn't enough. Rank 1 is what matters for generation quality.
Free models are viable. BGE trails Voyage by 0.09 MRR and costs nothing. For teams with API restrictions or tight budgets, it's a legitimate production choice.
Speed vs quality is a real tradeoff. MiniLM embeds 17x faster than OpenAI (19s vs 328s). If you're indexing millions of documents in real-time, that matters. But the quality drop (MRR 0.59) is measurable and shows up in downstream generation metrics.

Recommendations (Baseline RAG — Before Optimization)

Start with Voyage if you're building a new RAG system and want the best retrieval-to-cost ratio.
Use BGE if you can't use external APIs or need to self-host.
Consider OpenAI only if minimizing hallucinations matters more than retrieval precision (legal, medical, compliance).
Skip MiniLM unless you're indexing at massive scale and can tolerate the quality drop.

What This Experiment Tells Me (and What It Doesn't)

This was Experiment 1. The goal was narrow: pick the encoder that gives me the best foundation to build on. On that question, Voyage AI is the clear frontrunner. Best MRR (0.69), best NDCG (0.72), half the cost of OpenAI. BGE is the fallback if I need zero API dependency.

But I want to be honest about where these numbers stand. An MRR of 0.69 and NDCG of 0.72 are not production-ready. For a RAG system I'd actually ship, I need those metrics significantly higher. The encoder was the first variable to isolate. Now that I have a viable baseline, the real optimization work begins.

With the baseline established, the next step is improving ranking, retrieval coverage, and context quality. These are the main levers in RAG systems.

Chunking R&D. Different chunking strategies (size, overlap, semantic boundaries) directly affect what each embedding represents. I kept chunking fixed in this experiment to isolate the encoder variable. Next, I'll vary it.
Encoder R&D. Now that I have a test set and evaluation harness, I can revisit encoder selection with a larger query sample and newer models as they release.
Prompt improvement. The current prompts are basic. Adding general context, the current date, and conversation history to the LLM call should improve answer quality without touching retrieval at all.
Document pre-processing. Using an LLM to clean, restructure, or enrich chunks before encoding. SEC filings have boilerplate, tables, and legalese that could be transformed into more embeddable text.
Query rewriting. Using an LLM to convert a user's natural question into a better RAG query before it hits the encoder.
Query expansion. Turning one question into multiple retrieval queries to improve coverage across different phrasings and angles.
Re-ranking. Using an LLM to sub-select from the initial retrieval results. This is where I expect the biggest MRR gains: retrieve broadly, then rank precisely.
Hierarchical retrieval. Summarizing documents at multiple levels so the system can match at different granularities.
Hybrid search. Combining dense vector retrieval with sparse keyword search (BM25). Neither alone is sufficient. Vector search misses exact terms; BM25 misses semantics. Every production RAG system I've seen uses both.
Contextual retrieval. Prepending LLM-generated context to each chunk before embedding, explaining what document and section the chunk belongs to. Anthropic's research shows this reduces failed retrievals by 49%, and by 67% when combined with re-ranking.

Each of these is a separate experiment. I'll be writing up the results as I go.

Limitations I want to be transparent about: 50 of 459 queries were evaluated (sampled for cost efficiency), and relevance judgments are synthetic. Rankings could shift with the full query set, especially where encoder gaps are small. This is an experiment, not a production audit.

This is a fast-moving space. Models improve, APIs change, and best practices evolve. I'm constantly learning and implementing alongside this series. If you spot any inconsistency or something that's since been outdated, I'd genuinely appreciate the feedback, drop a comment or reach out.

The full notebook with reproducible code, data, and W&B experiment tracking is on GitHub: Notebook Link

Next up: chunking strategies, re-ranking, and hybrid search, the levers I expect to move these metrics the most.

I'm also expanding the encoder lineup. If there's a model you think should be in the comparison, newer releases, domain-specific encoders, or something that's worked well in your stack, please let me know.