mechanistic-interpretability
Here are 501 public repositories matching this topic...
Stanford NLP Python library for understanding and improving PyTorch models via interventions
-
Updated
Mar 6, 2026 - Python
This repository collects all relevant resources about interpretability in LLMs
-
Updated
Nov 1, 2024
Performant framework for training, analyzing and visualizing Sparse Autoencoders (SAEs) and their frontier variants.
-
Updated
Jun 16, 2026 - Python
From teacher to tiles — a from-scratch LLM distillation & serving engine: custom Triton/CUDA kernels, FSDP distillation, paged-KV continuous batching, speculative decoding, a Rust gateway, a JAX oracle, and interpretability tooling.
-
Updated
Jun 5, 2026 - Python
A curated collection of resources focused on the Mechanistic Interpretability (MI) of Large Multimodal Models (LMMs). This repository aggregates surveys, blog posts, and research papers that explore how LMMs represent, transform, and align multimodal information internally.
-
Updated
Mar 4, 2026
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
-
Updated
Mar 12, 2026 - Python
Steering vectors for transformer language models in Pytorch / Huggingface
-
Updated
Feb 21, 2025 - Python
Decomposing and Editing Predictions by Modeling Model Computation
-
Updated
Jun 12, 2024 - Jupyter Notebook
Agent orchestration & security template featuring MCP tool building, agent2agent workflows, mechanistic interpretability on sleeper agents, and agent integration via CLI wrappers
-
Updated
Jun 5, 2026 - Rust
Unified access to Large Language Model modules using NNsight
-
Updated
May 6, 2026 - Python
A carefully curated collection of high-quality libraries, projects, tutorials, research papers, and other essential resources focused on Mechanistic Interpretability, a growing subfield in machine learning interpretability research that aims to reverse-engineer neural networks into understandable computational components.
-
Updated
Jun 18, 2026 - JavaScript
Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform human-designed algorithms
-
Updated
Feb 20, 2024 - Python
[ICLR 2025] Code and Data Repo for Paper "Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation"
-
Updated
Dec 19, 2024 - Python
A Mechanistic Interpretability Toolkit for Cross-Layer Transcoder Training and Attribution-Graph Visualization
-
Updated
Apr 16, 2026 - Python
Interpreting how transformers simulate agents performing RL tasks
-
Updated
Oct 23, 2023 - Jupyter Notebook
Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".
-
Updated
Mar 11, 2024 - Jupyter Notebook
đź§ Starter templates for doing interpretability research
-
Updated
Jul 16, 2023
Automatically extract executable programs from pruned mechanistic circuits, extending OpenAI's Sparse Circuits
-
Updated
Nov 23, 2025 - Python
Sparse probing paper full code.
-
Updated
Dec 17, 2023 - Jupyter Notebook
Generating and validating natural-language explanations for the brain.
-
Updated
Jun 1, 2026 - Jupyter Notebook
Improve this page
Add a description, image, and links to the mechanistic-interpretability topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the mechanistic-interpretability topic, visit your repo's landing page and select "manage topics."