prompting.validators.reward.dpo
#
Module Contents#
Classes#
- class prompting.validators.reward.dpo.DirectPreferenceRewardModel(device)#
Bases:
prompting.validators.reward.reward.BaseRewardModel
- Parameters:
device (str) –
- reward_single(prompt, completion, name, with_penalty=True)#
Calculates a direct preference optimization (DPO) style reward for a completion, which is a reference model’s average log-probability for completion tokens given a prompt. Uses guidance from eric-mitchell/direct-preference-optimization.
- Parameters:
- Return type:
- get_rewards(prompt, completions, name)#
- Parameters:
- Return type: