Back to Media
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafailov, Sharma, Mitchell, Ermon, Manning, Finn
Notes
Bypasses reward modeling entirely. Simpler alignment, same results.