CS Knowledge Base

#

mechanistic-interpretability

Here are 501 public repositories matching this topic...

pyvene

stanfordnlp / pyvene

Stanford NLP Python library for understanding and improving PyTorch models via interventions

intervention interpretability mechanistic-interpretability activation-intervention activation-patching

Updated Mar 6, 2026
Python

ruizheliUOA / Awesome-Interpretability-in-Large-Language-Models

This repository collects all relevant resources about interpretability in LLMs

dictionary-learning sparse-autoencoder interpretability-and-explainability mechanistic-interpretability

Updated Nov 1, 2024

OpenMOSS / Llamascopium

Performant framework for training, analyzing and visualizing Sparse Autoencoders (SAEs) and their frontier variants.

sparse-autoencoders interpretability sparse-dictionary mechanistic-interpretability

Updated Jun 16, 2026
Python

zengxiao-he / tessera

From teacher to tiles — a from-scratch LLM distillation & serving engine: custom Triton/CUDA kernels, FSDP distillation, paged-KV continuous batching, speculative decoding, a Rust gateway, a JAX oracle, and interpretability tooling.

rust cuda pytorch triton quantization knowledge-distillation inference-engine jax kv-cache ml-systems llm mechanistic-interpretability fsdp flash-attention speculative-decoding paged-attention

Updated Jun 5, 2026
Python

itsqyh / Awesome-LMMs-Mechanistic-Interpretability

A curated collection of resources focused on the Mechanistic Interpretability (MI) of Large Multimodal Models (LMMs). This repository aggregates surveys, blog posts, and research papers that explore how LMMs represent, transform, and align multimodal information internally.

generative-model generative paperlist vision-models large-language-models mechanistic-interpretability large-vision-language-models large-multimodal-models vision-foundation-model

Updated Mar 4, 2026

stanfordnlp / axbench

Stanford NLP Python library for benchmarking the utility of LLM interpretability methods

intervention interpretability large-language-models mechanistic-interpretability llm-steering

Updated Mar 12, 2026
Python

steering-vectors / steering-vectors

Steering vectors for transformer language models in Pytorch / Huggingface

nlp ai pytorch gpt huggingface mechanistic-interpretability representation-engineering

Updated Feb 21, 2025
Python

MadryLab / modelcomponents

Decomposing and Editing Predictions by Modeling Model Computation

attribution pytorch interpretability model-editing mechanistic-interpretability

Updated Jun 12, 2024
Jupyter Notebook

AndrewAltimit / template-repo

Agent orchestration & security template featuring MCP tool building, agent2agent workflows, mechanistic interpretability on sleeper agents, and agent integration via CLI wrappers

docker mcp ai-safety text2speech ai-agents text2image github-actions ai-policy agent-framework text2video ai-governance mechanistic-interpretability gemini-cli model-context-protocol agent-orchestration agent-security claude-code codex-cli sleeper-agents

Updated Jun 5, 2026
Rust

ndif-team / nnterp

Unified access to Large Language Model modules using NNsight

mechanistic-interpretability nnsight patchscopes

Updated May 6, 2026
Python

AI-in-Transportation-Lab / awesome-mechanistic-interpretability

A carefully curated collection of high-quality libraries, projects, tutorials, research papers, and other essential resources focused on Mechanistic Interpretability, a growing subfield in machine learning interpretability research that aims to reverse-engineer neural networks into understandable computational components.

llms mechanistic-interpretability

Updated Jun 18, 2026
JavaScript

pauljblazek / deepdistilling

Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform human-designed algorithms

program-synthesis knowledge-distillation inductive-logic-programming domain-adaptation explainable-ai interpretable distilling neurosymbolic model-distillation out-of-distribution-generalization mechanistic-interpretability

Updated Feb 20, 2024
Python

Alsace08 / Chain-of-Embedding

[ICLR 2025] Code and Data Repo for Paper "Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation"

interpretability trustworthy-ai large-language-models mechanistic-interpretability self-evaluation hallucination-detection iclr-2025

Updated Dec 19, 2024
Python

LLM-Interp / CLT-Forge

A Mechanistic Interpretability Toolkit for Cross-Layer Transcoder Training and Attribution-Graph Visualization

transcoder visual-interface mechanistic-interpretability ai-interpretability attribution-graphs auto-interpretability cross-layer-transcoder transformer-circuits

Updated Apr 16, 2026
Python

jbloomAus / DecisionTransformerInterpretability

Interpreting how transformers simulate agents performing RL tasks

reinforcement-learning mechanistic-interpretability

Updated Oct 23, 2023
Jupyter Notebook

epfl-dlab / llm-latent-language

Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".

multilingual-nlp llm mechanistic-interpretability llama2

Updated Mar 11, 2024
Jupyter Notebook

apartresearch / interpretability-starter

🧠 Starter templates for doing interpretability research

interpretability interpretability-jam alignment-jam mechanistic-interpretability

Updated Jul 16, 2023

neelsomani / symbolic-circuit-distillation

Automatically extract executable programs from pruned mechanistic circuits, extending OpenAI's Sparse Circuits

machine-learning formal-verification mechanistic-interpretability

Updated Nov 23, 2025
Python

wesg52 / sparse-probing-paper

Sparse probing paper full code.

ai-safety interpretability ai-alignment mechanistic-interpretability

Updated Dec 17, 2023
Jupyter Notebook

automated-brain-explanations

microsoft / automated-brain-explanations

Generating and validating natural-language explanations for the brain.

data-science machine-learning natural-language-processing neuroscience artificial-intelligence fmri gpt explanation language-model interpretability xai fmri-data-analysis huggingface gpt4 large-language-models ai-for-science mechanistic-interpretability automated-interpretability interpretable-embeddings

Updated Jun 1, 2026
Jupyter Notebook

Improve this page

Add a description, image, and links to the mechanistic-interpretability topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the mechanistic-interpretability topic, visit your repo's landing page and select "manage topics."