A curated list of explainability-related papers, articles, and resources focused on Large Language Models (LLMs). This repository aims to provide researchers, practitioners, and enthusiasts with insights into the explainability implications, challenges, and advancements surrounding these powerful models.
We've curated a collection of the latest 📈, most comprehensive 📚, and most valuable 💡 resources on large language model explainability (LLM Explainability)). But we don't stop there; included are also relevant talks, tutorials, conferences, news, and articles. Our repository is constantly updated to ensure you have the most current information at your fingertips.
- LLM Explainability or Controllability Improvements with Tensor Networks, ChemicalQDevice, March 28.
- AI Explained: Inference, Guardrails, and Observability for LLMs
- LLM Explainability, Mitigating Hallucinations & Ensuring Ethical Practices, April 2nd, 5:30 - 9pm CEST, Berlin.
Date | Institute | Publication | Paper Title | GitHub |
---|---|---|---|---|
2024 | New Jersey Institute of Technology | ACM TIST | Explainability for Large Language Models: A Survey | GitHub |
2024 | Imperial College | Arxiv | From Understanding to Utilization: A Survey on Explainability for Large Language Models | N/A |
2024 | Hong Kong University of Science and Technology | Arxiv | Explainable Artificial Intelligence for Scientific Discovery | N/A |
2024 | UMaT | Arxiv | Explainable Artificial Intelligence (XAI): from Inherent Explainability to Large Language Models | N/A |
2024 | Nanyang Technological University | Arxiv | XAI meets LLMs: A Survey of the Relation between Explainable AI and Large Language Models | N/A |
2024 | University of Maryland, | Arxiv | Large Language Models and Causal Inference in Collaboration: A Comprehensive Survey | N/A |
Date | Institute | Publication | Paper Title | Code |
---|---|---|---|---|
2023 | Tsinghua University | Arxiv | Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation | GitHub |
2023 | UC Brekley | NIPS23 | Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena | GitHub |
Date | Institute | Publication | Paper Title |
---|---|---|---|
2023 | MIT/Harvard | Arxiv | Finding Neurons in a Haystack: Case Studies with Sparse Probing |
2023 | UoTexas/DeepMind | Arxiv | Copy Suppression: Comprehensively Understanding an Attention Head |
2023 | UCL | Arxiv | Towards Automated Circuit Discovery for Mechanistic Interpretability |
2023 | OpenAI | OpenAI Publication | Language models can explain neurons in language models |
2023 | MIT | NIPS23 | Toward a Mechanistic Understanding of Stepwise Inference in Transformers: A Synthetic Graph Navigation Model |
2023 | Cambridge | Arxiv | Successor Heads: Recurring, Interpretable Attention Heads In The Wild |
2023 | Meta | Arxiv | Neurons in Large Language Models: Dead, N-gram, Positional |
2023 | Redwood/UC Berkeley | Arxiv | Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small |
2023 | Microsoft | Arxiv | Explaining black box text modules in natural language with language models |
2023 | ApartR/Oxford | ICLR23 | N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models |
2023 | --- | Blog | Interpreting GPT: the Logit Lens |
Date | Institute | Publication | Paper Title |
---|---|---|---|
YYYY-MM-DD | Institute | Journal | Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias |
YYYY-MM-DD | Institute | Journal | Discovering Latent Knowledge in Language Models Without Supervision |
YYYY-MM-DD | Institute | Journal | Towards Monosemanticity: Decomposing Language Models With Dictionary Learning |
YYYY-MM-DD | Institute | Journal | Spine: Sparse interpretable neural embeddings |
YYYY-MM-DD | Institute | Journal | Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors |
YYYY-MM-DD | Institute | Journal | Sparse Autoencoders Find Highly Interpretable Features in Language Models |
YYYY-MM-DD | Institute | Journal | Attribution Patching: Activation Patching At Industrial Scale |
YYYY-MM-DD | Institute | Journal | Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] |
Date | Institute | Publication | Paper Title |
---|---|---|---|
2023 | EleutherAI Institute | Arxiv | Linear Representations of Sentiment in Large Language Models |
2023 | Michigan/Harvard | Arxiv | Emergent Linear Representations in World Models of Self-Supervised Sequence Models |
2023 | MIT/Standford/Oxford | Arxiv | Measuring Feature Sparsity in Language Models |
2023 | Flatiron | Arxiv | Polysemanticity and capacity in neural networks |
2019 | Google/Cambridge | NeurIPS | Visualizing and measuring the geometry of BERT |
2024 | NEU/MIT | Arxiv | The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets |
2021 | Conference | Attention is not all you need: pure attention loses rank doubly exponentially with depth | |
2019 | NCKU | arXiv | Probing neural network comprehension of natural language arguments |
2024 | Tsinghua University | arXiv | How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States |
2024 | IIITDM | arXiv | HULLMI: Human vs LLM identification with explainability |
2024 | HUST | arXiv | Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM |
2024 | UMass Amherst | arXiv | Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates |
2024 | Tsinghua University | arXiv | CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation |
2024 | UGA | arXiv | Explainable AI Reloaded: Challenging the XAI Status Quo in the Era of Large Language Models |
2024 | Stanford\California | arXiv | Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference |
Date | Institute | Publication | Paper Title |
---|---|---|---|
YYYY-MM-DD | Institute | Journal | Large Language Models Are Not Robust Multiple Choice Selectors |
YYYY-MM-DD | Institute | Journal | The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Language Models |
YYYY-MM-DD | Institute | Journal | ChainPoll: A High Efficacy Method for LLM Hallucination Detection |
2023 | PrincetonU | Online Presentation | Evaluating LLMs is a minefield |
Date | Institute | Publication | Paper Title |
---|---|---|---|
YYYY-MM-DD | Institute | Journal | Let's Verify Step by Step |
YYYY-MM-DD | Institute | Journal | Interpretability Illusions in the Generalization of Simplified Models |
YYYY-MM-DD | Institute | Journal | Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling |
2024 | Polytechnique Montreal | Arxiv | Can Large Language Models Explain Themselves? |
YYYY-MM-DD | Institute | Journal | A Mechanistic Interpretability Analysis of Grokking |
YYYY-MM-DD | Institute | Journal | 200 Concrete Open Problems in Mechanistic Interpretability |
YYYY-MM-DD | Institute | Journal | Interpretability at Scale: Identifying Causal Mechanisms in Alpaca |
YYYY-MM-DD | Institute | Journal | Representation Engineering: A Top-Down Approach to AI Transparency |
2023 | UC Berkeley | Nature Communication | Augmenting Interpretable Models with LLMs during Training |
Date | Institute | Publication | Paper Title |
---|---|---|---|
YYYY-MM-DD | Institute | Journal | Emergent world representations: Exploring a sequence model trained on a synthetic task |
YYYY-MM-DD | Institute | Journal | How does GPT-2 compute greater than?: Interpreting mathematical abilities in a pre-trained language model |
YYYY-MM-DD | Institute | Journal | Interpreting the Inner Mechanisms of Large Language Models in Mathematical Addition |
YYYY-MM-DD | Institute | Journal | An Overview of Early Vision in InceptionV1 |
Date | Institute | Publication | Paper Title |
---|---|---|---|
YYYY-MM-DD | Institute | Journal | A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations |
YYYY-MM-DD | Institute | Journal | The Quantization Model of Neural Scaling |
YYYY-MM-DD | Institute | Journal | Toy Models of Superposition |
YYYY-MM-DD | Institute | Journal | Engineering monosemanticity in toy models |
YYYY-MM-DD | Institute | Journal | A New Approach to Computation Reimagines Artificial Intelligence |
- Georgia Deaconu, (December 2023) Towards LLM Explainability: Why Did My Model Produce This Output?
- Anthropic -- Adly Templeton, et al. (May 2024), Mapping the Mind of a Large Language Model
- OpenAI -- (May 2023), Language Models Can Explain Neurons in Language Models, Github, Neuron Viewer.
Please feel free to check out CONTRIBUTING and CODE-OF-CONDUCT to collaborate with us.
- One future direction is Fairness-Explainability Evaluation for LLMs.