MediaWiki Code2Code Search (code2codesearch) is a semantic code search engine designed for the MediaWiki ecosystem.
Unlike traditional string-based search, it utilises neural retrieval to find code snippets based on semantics, capturing the intent of functions and types across more than 2,500 code repositories.
It is specifically optimised for Toolforge, while remaining within strict memory constraints and providing second-order responses on a commodity laptop.
Try it out: toolforge:code2codesearch
Implementation

MediaWiki Code2Code Search utilises a modern neural retrieval pipeline optimised for asymmetric hardware:
- Neural Model: Uses the Qwen3-Embedding (0.6B) model to generate semantic vectors.
- Vector Engine: Uses FAISS with
IndexIVFPQquantisation for memory-efficient similarity search. - Storage: Metadata are managed via an indexed SQLite database instead of RAM-heavy JSON structures to maintain a low memory footprint (optimised for the 6 GiB Toolforge limit).
- Backend: Powered by FastAPI and uvicorn.
- Supported languages: PHP, Python, JavaScript, TypeScript, Go, Java, C, C++, Lua, Rust, Ruby, and Perl (12 languages, parsed via Tree-sitter).
- Index size: The compressed FAISS IVF-PQ index occupies ~169 MB (96.6% smaller than a flat float32 baseline of ~4.9 GB), staying within Toolforge's 6 GiB RAM limit.
- Frontend: A clean and accessible interface built with Wikimedia's Codex Design System and vanilla JavaScript. It supports 17 languages and features advanced multi-select filtering for repository groups, programming languages, and entry types.

Included repositories
The engine indexes the entire MediaWiki development ecosystem, covering more than 1,289,000 code snippets originating from over 2,500 unique repositories:
- MediaWiki core.
- All MediaWiki extensions and skins hosted on Gerrit.
- Software services by WMF SRE, Performance, and Analytics.
- Shared libraries and WMF-specific services.
Heavy code vectorisation is performed offline on a GPU-equipped machine, while the resulting index and metadata are served online in the constrained CPU-only environment on Toolforge.
Graphics
Frontend
The frontend is fully internationalised and renders consistently across scripts and languages.
-
English
-
French
-
Italian
-
Telugu
Presentations
-
Wikimedia Hackathon 2026
Administration
The tool is hosted on Wikimedia Toolforge. Detailed development guidelines and architectural constraints are available in the repository documentation.
Source
The Code2Code Search is hosted on GitHub and available under the terms of the Apache Software License 2.0.
As for bugs and feature requests, please file them in the GitHub Issue Tracker.
See also
Assets
Code-related services
- MediaWiki Codesearch, the string-based code search tool based on classical pattern matching
- Software Heritage, the universal source-code archive used to reference and cite persistent code snippets via SWHIDs
- Wikimedia Global Search, used to find on-wiki uses of JavaScript and CSS code
- Huma, a type-aware static-analysis index for PHP code
Vector-search engines for Wikimedia
- Wikidata: Wikidata Embedding Project, a semantic graph search tool over Wikidata by Wikimedia Deutschland (WMDE) in collaboration with Jina.AI and IBM DataStax
- Wikimedia Commons: WISE Search Engine, multimodal AI search tool for Wikimedia Commons created by the Visual Geometry Group (VGG) of the University of Oxford
- Wikipedia: Semantic search over Wikipedia pages by WMF Product and Technology group