Search is where bilingual publications quietly lose readers.
The problem is rarely raw speed. Static search tools like Pagefind are already fast enough for serious editorial workloads. The real failure happens one layer up: users do not always search with the same vocabulary we used in the headline. That gets worse when the site operates in English and Arabic, where a reader may remember the concept but not the exact wording.
Why keyword-only search breaks down on bilingual sites
Keyword search is excellent when the query and the article title share the same nouns. It starts to wobble when the reader remembers intent instead of literal phrasing:
- “MCP permissions” versus “tool boundaries”
- “Arabic search” versus “multilingual retrieval”
- “local translation workflow” versus “draft localization pipeline”
On a bilingual site, you also get cross-language intent. An Arabic-speaking reader may search in English for a concept they saw on social media, while an English-speaking reader may search by product name and still expect Arabic coverage to surface when it is the strongest match.
Layer 1: Keep Pagefind as the operational baseline
Pagefind should still be the first answer for most user input. It is static, portable, and extraordinarily efficient for publication workflows. It also keeps the failure mode civilized: even if every advanced feature is disabled, the site still ships with dependable search.
The right mindset is simple:
- Pagefind handles exact titles, strong keyword matches, and instant results.
- A semantic layer helps when phrasing drifts or when English and Arabic express the same idea differently.
- Editorial curation still wins for homepage surfacing and section pages.
That is how you keep the system explainable.
Layer 2: Use embeddings to improve recall, not to invent relevance
The second layer is where multilingual embeddings earn their keep. Instead of requiring a literal token overlap, you convert both the query and each document into vectors, then rank by similarity. The point is not to sound magical. The point is to catch strong conceptual matches that keyword systems miss.
In practice, the best use cases are:
- returning Arabic and English articles for the same topic cluster
- rescuing searches where the headline vocabulary is too narrow
- powering “Related coverage” below the fold
- auditing content gaps between locales
The wrong use case is replacing all ranking logic with a black box. Semantic retrieval should be additive. It works best when your taxonomy, featured priorities, and section design are already disciplined.
A production-friendly architecture
For editorial teams, the cleanest pattern is build-time indexing plus optional semantic enrichment:
Code snippet
ts
type SearchDocument = {
id: string;
locale: "en" | "ar";
url: string;
title: string;
description: string;
category: "devhub" | "security" | "reviews";
tags: string[];
body: string;
};
export async function buildSearchArtifacts(docs: SearchDocument[]) {
const pagefindIndex = await buildPagefindIndex(docs);
const semanticIndex = await buildEmbeddingIndex(docs, {
model: "ibm-granite/granite-embedding-278m-multilingual",
});
return {
pagefindIndex,
semanticIndex,
};
}
This keeps the fast path obvious. Pagefind handles immediate search. Semantic search becomes a sidecar, not a mandatory dependency for every request.
Arabic quality control matters more than model choice
Teams often obsess over model benchmarks and then ignore the parts that actually hurt Arabic experience:
- inconsistent slug strategy
- weak tag hygiene
- English-only excerpts for Arabic pages
- typography that makes Arabic look visually lighter than English
If those fundamentals are broken, a stronger embedding model will not save the experience. Multilingual retrieval is not just a model problem. It is an editorial systems problem.
Recommended rollout for 2026
If I were building a bilingual publication from scratch, I would ship search in four steps:
- Pagefind only, with clean titles, excerpts, and tags.
- Taxonomy-aware related content for every article.
- Multilingual embeddings for recall and cross-language discovery.
- Editorial analytics that show what readers searched for but never found.
That order matters. It protects performance, keeps the system explainable, and prevents the team from skipping the editorial discipline that semantic tooling depends on.
The best bilingual search stack in 2026 is not the most complicated one. It is the one that stays fast under pressure, stays legible to editors, and respects how readers actually move between English and Arabic when they are looking for a technical answer.