AI Chatbot Terms > 4 min read

Cross-Encoder: How It Improves Retrieval in AI Chatbots

A cross-encoder is a transformer model that scores a query and document together for relevance, powering the reranking step in modern RAG pipelines. Learn how it works.

More about Cross-Encoder

A cross-encoder is a type of transformer model that takes a query and a document as a single combined input and produces a relevance score between them. Unlike a bi-encoder, which embeds the query and document separately and then compares them, a cross-encoder attends to both pieces of text jointly, which lets it capture subtle relationships that independent embeddings miss.

The trade-off is cost. A cross-encoder has to run an attention pass over every query-document pair, so you cannot use it to search millions of documents directly. Instead, it sits at the second stage of a retrieval pipeline: a fast bi-encoder retrieves a shortlist of candidates, and the cross-encoder reranks them to surface the very best.

How a Cross-Encoder Works

The architecture is a standard transformer encoder like BERT, but used differently:

  • The query and document are concatenated into a single input sequence, usually separated by a special token.
  • The model runs full self-attention across the entire sequence, so every token of the query attends to every token of the document.
  • A classification head on top produces a single relevance score.

Because the two texts are processed together, the model can reason about whether a specific phrase in the query actually matches a specific phrase in the document, not just whether the overall topics are similar.

Cross-Encoder vs. Bi-Encoder

Both are transformer-based, but they work at different stages:

  • Bi-encoder: embeds query and documents into separate vectors. Fast, scalable, used for the initial retrieval over a vector database. The vector comparison is cheap, but the model never sees query and document together, so it misses nuance.
  • Cross-encoder: embeds query and document jointly. Slow, expensive, used to rerank a shortlist. Much higher precision.

The rule of thumb in modern retrieval: bi-encoders for the first pass, cross-encoders for the final ranking.

Why Cross-Encoders Matter for Chatbots

In a real chatbot, semantic search with a bi-encoder returns a reasonable set of candidates, but the top result is not always the most relevant. Users notice. They ask "how do I change my billing email" and the bot surfaces the article about changing notification preferences, because both are about "email" in embedding space.

A cross-encoder reranker fixes this. Running the top 20 bi-encoder results through a cross-encoder and taking the top 3 typically lifts precision at 3 by 10 to 20 percentage points. That is the difference between a chatbot that answers correctly most of the time and one that answers correctly almost every time.

SiteSpeak uses Cohere's reranker in exactly this position: Pinecone handles fast initial retrieval, and the cross-encoder reranks the shortlist before the passages are handed to the large language model for answer generation. The extra step adds modest latency and significantly improves answer quality.

Where to Use a Cross-Encoder

Cross-encoders are worth adding when:

  • Retrieval quality matters more than latency. Typical for customer support and knowledge base chatbots.
  • The corpus is large enough that a single-stage bi-encoder returns noisy top results.
  • You want to combine signals: a cross-encoder trained on domain data can weigh relevance, freshness, and authority together.
  • You are already doing retrieval augmented generation and want to reduce AI hallucination risk by surfacing better grounding passages.

They are less necessary when:

  • The corpus is small and bi-encoder results are already precise.
  • Latency budgets are very tight (sub-100ms end to end).
  • You can afford to return 10 to 20 passages and let the LLM pick.

Common Cross-Encoder Models

Popular options include:

  • Cohere Rerank: hosted API, strong out-of-the-box performance, multilingual.
  • Sentence-Transformers cross-encoders (ms-marco-MiniLM, ms-marco-MPNet): open-source, easy to self-host.
  • BGE-Reranker: open-source, competitive with commercial options.
  • Jina Reranker: hosted, fast.

Most teams start with a hosted reranker for speed of integration and only move to self-hosted when they need specific model behaviour or strict data residency.

Limitations

Cross-encoders are powerful but not free:

  • Inference cost: each rerank call costs more than a vector search. Batching helps but does not eliminate the gap.
  • Shortlist dependency: the cross-encoder can only pick from what the bi-encoder hands it. If the right document is not in the top 50, reranking will not save you.
  • Training data drift: a reranker trained on generic web data may underperform on niche domain text. Fine-tuning on your own query-document pairs is sometimes worth it.

Treat the reranker as one lever in your retrieval stack, not a magic fix for poor indexing or retrieval.

Frequently Asked Questions

A bi-encoder encodes query and document separately into vectors, then compares them with cosine similarity. Fast, scales to millions of documents. A cross-encoder takes query and document together and produces a joint relevance score. Slower, much more accurate for a given pair, but too expensive to run across an entire corpus. Modern pipelines use a bi-encoder first, then a cross-encoder to rerank.

When the top 1 to 3 results from your semantic search are noticeably noisy, or when you want to reduce AI hallucination by giving the model better grounding passages. Most production chatbots with more than a few hundred documents benefit from reranking. The latency cost is usually 50 to 200ms for a top-20 rerank, which is a reasonable trade-off for the quality gain.

Hosted rerankers like Cohere Rerank are the fastest path to a working setup, with no infrastructure to maintain and multilingual support out of the box. Self-host an open-source model like a Sentence-Transformers cross-encoder or BGE-Reranker when you need strict data residency, very high request volumes where per-call pricing becomes expensive, or when you want to fine-tune the reranker on your own data.

Share this article:
Copied!

Ready to automate your customer service with AI?

Join over 1000+ businesses, websites and startups automating their customer service and other tasks with a custom trained AI agent.

Create Your AI Agent No credit card required