CS Knowledge Base

MediaWiki Code2Code Search (code2codesearch) is a semantic code search engine designed for the MediaWiki ecosystem.

Unlike traditional string-based search, it utilises neural retrieval to find code snippets based on semantics, capturing the intent of functions and types across more than 2,500 code repositories.

It is specifically optimised for Toolforge, while remaining within strict memory constraints and providing second-order responses on a commodity laptop.

Demo: searching for code similar to a Python implementation of the greatest common divisor (v1.0.0).

Try it out: toolforge:code2codesearch

Implementation

MediaWiki Code2Code Search utilises a modern neural retrieval pipeline optimised for asymmetric hardware:

Neural Model: Uses the Qwen3-Embedding (0.6B) model to generate semantic vectors.
Vector Engine: Uses FAISS with IndexIVFPQ quantisation for memory-efficient similarity search.
Storage: Metadata are managed via an indexed SQLite database instead of RAM-heavy JSON structures to maintain a low memory footprint (optimised for the 6 GiB Toolforge limit).
Backend: Powered by FastAPI and uvicorn.
Supported languages: PHP, Python, JavaScript, TypeScript, Go, Java, C, C++, Lua, Rust, Ruby, and Perl (12 languages, parsed via Tree-sitter).
Index size: The compressed FAISS IVF-PQ index occupies ~169 MB (96.6% smaller than a flat float32 baseline of ~4.9 GB), staying within Toolforge's 6 GiB RAM limit.
Frontend: A clean and accessible interface built with Wikimedia's Codex Design System and vanilla JavaScript. It supports 17 languages and features advanced multi-select filtering for repository groups, programming languages, and entry types.

Included repositories

The engine indexes the entire MediaWiki development ecosystem, covering more than 1,289,000 code snippets originating from over 2,500 unique repositories:

MediaWiki core.
All MediaWiki extensions and skins hosted on Gerrit.
Software services by WMF SRE, Performance, and Analytics.
Shared libraries and WMF-specific services.

Heavy code vectorisation is performed offline on a GPU-equipped machine, while the resulting index and metadata are served online in the constrained CPU-only environment on Toolforge.

Graphics

Frontend

The frontend is fully internationalised and renders consistently across scripts and languages.

English
French
Italian
Telugu

Presentations

Wikimedia Hackathon 2026

Administration

The tool is hosted on Wikimedia Toolforge. Detailed development guidelines and architectural constraints are available in the repository documentation.

Source

The Code2Code Search is hosted on GitHub and available under the terms of the Apache Software License 2.0.

As for bugs and feature requests, please file them in the GitHub Issue Tracker.

CS Knowledge Base

Implementation

Included repositories

Graphics

Frontend

Presentations

Administration

Source

See also

Assets

Code-related services

Vector-search engines for Wikimedia