Papers - Observability and Interpretability
updated
JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and
Attention
Paper
• 2310.00535
• Published
• 2
Interpretability in the Wild: a Circuit for Indirect Object
Identification in GPT-2 small
Paper
• 2211.00593
• Published
• 2
Rethinking Interpretability in the Era of Large Language Models
Paper
• 2402.01761
• Published
• 23
Does Circuit Analysis Interpretability Scale? Evidence from Multiple
Choice Capabilities in Chinchilla
Paper
• 2307.09458
• Published
• 12
Sparse Autoencoders Find Highly Interpretable Features in Language
Models
Paper
• 2309.08600
• Published
• 15
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca
Paper
• 2305.08809
• Published
• 2
Natural Language Decomposition and Interpretation of Complex Utterances
Paper
• 2305.08677
• Published
• 2
Information Flow Routes: Automatically Interpreting Language Models at
Scale
Paper
• 2403.00824
• Published
• 3
Structural Similarities Between Language Models and Neural Response
Measurements
Paper
• 2306.01930
• Published
• 2
The Impact of Depth and Width on Transformer Language Model
Generalization
Paper
• 2310.19956
• Published
• 10
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing
& Attribution in AI
Paper
• 2310.16787
• Published
• 5
A Pretrainer's Guide to Training Data: Measuring the Effects of Data
Age, Domain Coverage, Quality, & Toxicity
Paper
• 2305.13169
• Published
• 3
A Watermark for Large Language Models
Paper
• 2301.10226
• Published
• 9
Universal and Transferable Adversarial Attacks on Aligned Language
Models
Paper
• 2307.15043
• Published
• 2
Vision Transformers Need Registers
Paper
• 2309.16588
• Published
• 86
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending
Against Extraction Attacks
Paper
• 2309.17410
• Published
• 4
On the Origin of LLMs: An Evolutionary Tree and Graph for 15,821 Large
Language Models
Paper
• 2307.09793
• Published
• 48
Tools for Verifying Neural Models' Training Data
Paper
• 2307.00682
• Published
• 2
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
Paper
• 2310.02949
• Published
• 3
Chain-of-Thought Reasoning Without Prompting
Paper
• 2402.10200
• Published
• 109
Building and Interpreting Deep Similarity Models
Paper
• 2003.05431
• Published
• 2
Long-form factuality in large language models
Paper
• 2403.18802
• Published
• 26
Unsolvable Problem Detection: Evaluating Trustworthiness of Vision
Language Models
Paper
• 2403.20331
• Published
• 16
Locating and Editing Factual Associations in Mamba
Paper
• 2404.03646
• Published
• 3
BERT Rediscovers the Classical NLP Pipeline
Paper
• 1905.05950
• Published
• 3
Prompt-to-Prompt Image Editing with Cross Attention Control
Paper
• 2208.01626
• Published
• 3
LVLM-Intrepret: An Interpretability Tool for Large Vision-Language
Models
Paper
• 2404.03118
• Published
• 25