Skip to content

A curated list of explainability-related papers, articles, and resources focused on Large Language Models (LLMs). This repository aims to provide researchers, practitioners, and enthusiasts with insights into the explainability implications, challenges, and advancements surrounding these powerful models.

License

Notifications You must be signed in to change notification settings

koo-ec/Awesome-LLM-Explainability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Awesome

GitHub stars GitHub forks GitHub issues GitHub Last commit License: MIT Standard - \Python Style Guide

Awesome-LLM-Explainability

A curated list of explainability-related papers, articles, and resources focused on Large Language Models (LLMs). This repository aims to provide researchers, practitioners, and enthusiasts with insights into the explainability implications, challenges, and advancements surrounding these powerful models.

🚧 This repository is under construction (with daily updates) 🚧

Table of Contents

Introduction

We've curated a collection of the latest 📈, most comprehensive 📚, and most valuable 💡 resources on large language model explainability (LLM Explainability)). But we don't stop there; included are also relevant talks, tutorials, conferences, news, and articles. Our repository is constantly updated to ensure you have the most current information at your fingertips.

Webinars

Recorded Videos

Events

Papers

Survey Papers

Date Institute Publication Paper Title GitHub
2024 New Jersey Institute of Technology ACM TIST Explainability for Large Language Models: A Survey GitHub
2024 Imperial College Arxiv From Understanding to Utilization: A Survey on Explainability for Large Language Models N/A
2024 Hong Kong University of Science and Technology Arxiv Explainable Artificial Intelligence for Scientific Discovery N/A
2024 UMaT Arxiv Explainable Artificial Intelligence (XAI): from Inherent Explainability to Large Language Models N/A
2024 Nanyang Technological University Arxiv XAI meets LLMs: A Survey of the Relation between Explainable AI and Large Language Models N/A
2024 University of Maryland, Arxiv Large Language Models and Causal Inference in Collaboration: A Comprehensive Survey N/A

LLM Explainability Evaluation

Date Institute Publication Paper Title Code
2023 Tsinghua University Arxiv Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation GitHub
2023 UC Brekley NIPS23 Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena GitHub

Neural Network Analysis

Date Institute Publication Paper Title
2023 MIT/Harvard Arxiv Finding Neurons in a Haystack: Case Studies with Sparse Probing
2023 UoTexas/DeepMind Arxiv Copy Suppression: Comprehensively Understanding an Attention Head
2023 UCL Arxiv Towards Automated Circuit Discovery for Mechanistic Interpretability
2023 OpenAI OpenAI Publication Language models can explain neurons in language models
2023 MIT NIPS23 Toward a Mechanistic Understanding of Stepwise Inference in Transformers: A Synthetic Graph Navigation Model
2023 Cambridge Arxiv Successor Heads: Recurring, Interpretable Attention Heads In The Wild
2023 Meta Arxiv Neurons in Large Language Models: Dead, N-gram, Positional
2023 Redwood/UC Berkeley Arxiv Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
2023 Microsoft Arxiv Explaining black box text modules in natural language with language models
2023 ApartR/Oxford ICLR23 N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models
2023 --- Blog Interpreting GPT: the Logit Lens

Algorithmic Approaches

Date Institute Publication Paper Title
YYYY-MM-DD Institute Journal Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias
YYYY-MM-DD Institute Journal Discovering Latent Knowledge in Language Models Without Supervision
YYYY-MM-DD Institute Journal Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
YYYY-MM-DD Institute Journal Spine: Sparse interpretable neural embeddings
YYYY-MM-DD Institute Journal Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors
YYYY-MM-DD Institute Journal Sparse Autoencoders Find Highly Interpretable Features in Language Models
YYYY-MM-DD Institute Journal Attribution Patching: Activation Patching At Industrial Scale
YYYY-MM-DD Institute Journal Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

Representation Analysis

Date Institute Publication Paper Title
2023 EleutherAI Institute Arxiv Linear Representations of Sentiment in Large Language Models
2023 Michigan/Harvard Arxiv Emergent Linear Representations in World Models of Self-Supervised Sequence Models
2023 MIT/Standford/Oxford Arxiv Measuring Feature Sparsity in Language Models
2023 Flatiron Arxiv Polysemanticity and capacity in neural networks
2019 Google/Cambridge NeurIPS Visualizing and measuring the geometry of BERT
2024 NEU/MIT Arxiv The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
2021 Google Conference Attention is not all you need: pure attention loses rank doubly exponentially with depth
2019 NCKU arXiv Probing neural network comprehension of natural language arguments
2024 Tsinghua University arXiv How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States
2024 IIITDM arXiv HULLMI: Human vs LLM identification with explainability
2024 HUST arXiv Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM
2024 UMass Amherst arXiv Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
2024 Tsinghua University arXiv CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation
2024 UGA arXiv Explainable AI Reloaded: Challenging the XAI Status Quo in the Era of Large Language Models
2024 Stanford\California arXiv Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Bias and Robustness Studies

Date Institute Publication Paper Title
YYYY-MM-DD Institute Journal Large Language Models Are Not Robust Multiple Choice Selectors
YYYY-MM-DD Institute Journal The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Language Models
YYYY-MM-DD Institute Journal ChainPoll: A High Efficacy Method for LLM Hallucination Detection
2023 PrincetonU Online Presentation Evaluating LLMs is a minefield

Interpretability Frameworks

Date Institute Publication Paper Title
YYYY-MM-DD Institute Journal Let's Verify Step by Step
YYYY-MM-DD Institute Journal Interpretability Illusions in the Generalization of Simplified Models
YYYY-MM-DD Institute Journal Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
2024 Polytechnique Montreal Arxiv Can Large Language Models Explain Themselves?
YYYY-MM-DD Institute Journal A Mechanistic Interpretability Analysis of Grokking
YYYY-MM-DD Institute Journal 200 Concrete Open Problems in Mechanistic Interpretability
YYYY-MM-DD Institute Journal Interpretability at Scale: Identifying Causal Mechanisms in Alpaca
YYYY-MM-DD Institute Journal Representation Engineering: A Top-Down Approach to AI Transparency
2023 UC Berkeley Nature Communication Augmenting Interpretable Models with LLMs during Training

Application-Specific Studies

Date Institute Publication Paper Title
YYYY-MM-DD Institute Journal Emergent world representations: Exploring a sequence model trained on a synthetic task
YYYY-MM-DD Institute Journal How does GPT-2 compute greater than?: Interpreting mathematical abilities in a pre-trained language model
YYYY-MM-DD Institute Journal Interpreting the Inner Mechanisms of Large Language Models in Mathematical Addition
YYYY-MM-DD Institute Journal An Overview of Early Vision in InceptionV1

Theoretical Approaches

Date Institute Publication Paper Title
YYYY-MM-DD Institute Journal A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
YYYY-MM-DD Institute Journal The Quantization Model of Neural Scaling
YYYY-MM-DD Institute Journal Toy Models of Superposition
YYYY-MM-DD Institute Journal Engineering monosemanticity in toy models
YYYY-MM-DD Institute Journal A New Approach to Computation Reimagines Artificial Intelligence

Related GitHub Repositories:

Blogs

Medium Blogs

Big Player's Blogs

Tools

Related Communities

Contribution and Collaboration:

Please feel free to check out CONTRIBUTING and CODE-OF-CONDUCT to collaborate with us.

Future Research Directions

  • One future direction is Fairness-Explainability Evaluation for LLMs.

About

A curated list of explainability-related papers, articles, and resources focused on Large Language Models (LLMs). This repository aims to provide researchers, practitioners, and enthusiasts with insights into the explainability implications, challenges, and advancements surrounding these powerful models.

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •