MediaWiki Code2Code Search (code2codesearch) is a semantic code search engine designed for the MediaWiki ecosystem.

Unlike traditional string-based search, it utilises neural retrieval to find code snippets based on semantics, capturing the intent of functions and types across more than 2,500 code repositories.

It is specifically optimised for Toolforge, while remaining within strict memory constraints and providing second-order responses on a commodity laptop.

Demo: searching for code similar to a Python implementation of the greatest common divisor (v1.0.0).

Try it out: toolforge:code2codesearch

Implementation

A flow diagram illustrating a code-to-code search scheme for MediaWiki.
The Transformer-based processing scheme: offline indexing (1'-4') meets online querying (1-5).

MediaWiki Code2Code Search utilises a modern neural retrieval pipeline optimised for asymmetric hardware:

  • Neural Model: Uses the Qwen3-Embedding (0.6B) model to generate semantic vectors.
  • Vector Engine: Uses FAISS with IndexIVFPQ quantisation for memory-efficient similarity search.
  • Storage: Metadata are managed via an indexed SQLite database instead of RAM-heavy JSON structures to maintain a low memory footprint (optimised for the 6 GiB Toolforge limit).
  • Backend: Powered by FastAPI and uvicorn.
  • Supported languages: PHP, Python, JavaScript, TypeScript, Go, Java, C, C++, Lua, Rust, Ruby, and Perl (12 languages, parsed via Tree-sitter).
  • Index size: The compressed FAISS IVF-PQ index occupies ~169 MB (96.6% smaller than a flat float32 baseline of ~4.9 GB), staying within Toolforge's 6 GiB RAM limit.
  • Frontend: A clean and accessible interface built with Wikimedia's Codex Design System and vanilla JavaScript. It supports 17 languages and features advanced multi-select filtering for repository groups, programming languages, and entry types.
A diagram showing code snippets embedded into a vector space.
How code snippets are mapped into a shared semantic vector space, so that functionally similar code lies close together regardless of programming language (v1.0.0).

Included repositories

The engine indexes the entire MediaWiki development ecosystem, covering more than 1,289,000 code snippets originating from over 2,500 unique repositories:

  • MediaWiki core.
  • All MediaWiki extensions and skins hosted on Gerrit.
  • Software services by WMF SRE, Performance, and Analytics.
  • Shared libraries and WMF-specific services.

Heavy code vectorisation is performed offline on a GPU-equipped machine, while the resulting index and metadata are served online in the constrained CPU-only environment on Toolforge.

Graphics

Frontend

The frontend is fully internationalised and renders consistently across scripts and languages.

Presentations

Administration

The tool is hosted on Wikimedia Toolforge. Detailed development guidelines and architectural constraints are available in the repository documentation.

Source

The Code2Code Search is hosted on GitHub and available under the terms of the Apache Software License 2.0.

As for bugs and feature requests, please file them in the GitHub Issue Tracker.

See also

Assets

  • MediaWiki Codesearch, the string-based code search tool based on classical pattern matching
  • Software Heritage, the universal source-code archive used to reference and cite persistent code snippets via SWHIDs
  • Wikimedia Global Search, used to find on-wiki uses of JavaScript and CSS code
  • Huma, a type-aware static-analysis index for PHP code

Vector-search engines for Wikimedia

  • Wikidata: Wikidata Embedding Project, a semantic graph search tool over Wikidata by Wikimedia Deutschland (WMDE) in collaboration with Jina.AI and IBM DataStax
  • Wikimedia Commons: WISE Search Engine, multimodal AI search tool for Wikimedia Commons created by the Visual Geometry Group (VGG) of the University of Oxford
  • Wikipedia: Semantic search over Wikipedia pages by WMF Product and Technology group