From:language-Models

Browse posts by tag

April 24, 2026

A Formal Theory of Inductive Inference

Notes

Foundational paper on algorithmic probability and universal induction. Basis for AIXI.

April 24, 2026

Attention Is All You Need

Notes

Introduced the Transformer architecture. The paper that started everything.

April 24, 2026

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Notes

Bidirectional pre-training via masked language modeling. Defined the pre-train/fine-tune paradigm.

April 24, 2026

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Notes

Step-by-step reasoning via prompting. Unlocked a new capability class.

April 24, 2026

Constitutional AI: Harmlessness from AI Feedback

Notes

Self-critique and revision using principles instead of human labels.

April 24, 2026

Deep Reinforcement Learning from Human Preferences

Notes

Foundational RLHF paper. Learning reward models from human comparisons.

April 24, 2026

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Notes

Bypasses reward modeling entirely. Simpler alignment, same results.

April 24, 2026

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Notes

IO-aware attention that is both faster and uses less memory. Essential infrastructure.

April 24, 2026

Generating Long Sequences with Sparse Transformers

Notes

Sparse attention patterns for long-range dependencies. O(n√n) attention.

April 24, 2026

Language Models are Few-Shot Learners (GPT-3)

Notes

175B parameters. In-context learning emerges at scale. Changed the field.

April 24, 2026

Language Models are Unsupervised Multitask Learners (GPT-2)

Notes

Showed large LMs can perform tasks zero-shot. Introduced the scaling intuition.

April 24, 2026

LLaMA: Open and Efficient Foundation Language Models

Notes

Open-weight models competitive with GPT-3. Catalyzed the open-source LLM ecosystem.

April 24, 2026

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Notes

Mixture of Experts with learned gating. Conditional computation at scale.

April 24, 2026

Probabilistic Graphical Models: Principles and Techniques

Notes

The comprehensive reference on graphical models, inference, and learning.

April 24, 2026

ReAct: Synergizing Reasoning and Acting in Language Models

Notes

Interleaving reasoning traces and actions. The prompting pattern behind most LLM agents.

April 24, 2026

Scaling Laws for Neural Language Models

Notes

Power-law relationships between compute, data, parameters, and loss. Empirical scaling science.

April 24, 2026

The Unreasonable Effectiveness of Recurrent Neural Networks

Notes

Seminal blog post demonstrating char-level RNN power. Shakespeare, LaTeX, kernel code generation.

April 24, 2026

Toolformer: Language Models Can Teach Themselves to Use Tools

Notes

LMs learning when and how to call external tools. Key step toward agentic LMs.

April 24, 2026

Training Compute-Optimal Large Language Models (Chinchilla)

Notes

Showed most LLMs were undertrained. Optimal ratio of data to parameters.

April 24, 2026

Training language models to follow instructions with human feedback (InstructGPT)

Notes

RLHF applied to GPT-3. The bridge from raw LM to useful assistant.