Skip to content
This repository was archived by the owner on Nov 19, 2025. It is now read-only.

Conversation

@Davood-M
Copy link

@Davood-M Davood-M commented Sep 24, 2024

What does this PR do ?

Adding RPO on multiple responses for alignment. RPO is able to take a dataset with a variable number of responses per prompt.

Changelog

  • Please update the CHANGELOG.md under next version with high level changes in this PR.

Usage

  • The dataset should be formatted like this:
{
"prompt": ...,
"responses": [ list of responses ],
"rewards": [ list of rewards ]
}

Before your PR is "Ready for review"

Pre checks:

Checklist when contributing a new algorithm

  • Does the trainer resume and restore model state all states?
  • Does the trainer support all parallelism techniques(PP, TP, DP)?
  • Does the trainer support max_steps=-1 and validation?
  • Does the trainer only call APIs defined in alignable_interface.py?
  • Does the trainer have proper logging?

Additional Information

  • Related to # (issue)

shengyangs and others added 19 commits July 3, 2024 15:16
Signed-off-by: Shengyang Sun <shengyangs@nvidia.com>
Signed-off-by: Shengyang Sun <shengyangs@nvidia.com>
Signed-off-by: Shengyang Sun <shengyangs@nvidia.com>
Signed-off-by: Shengyang Sun <shengyangs@nvidia.com>
Signed-off-by: Shengyang Sun <shengyangs@nvidia.com>
…-ref-policy

Signed-off-by: Shengyang Sun <shengyangs@nvidia.com>
Signed-off-by: David Mosallanezhad <dmosallanezh@nvidia.com>
Signed-off-by: David Mosallanezhad <dmosallanezh@nvidia.com>
Signed-off-by: David Mosallanezhad <dmosallanezh@cw-dfw-cs-001-dc-01.cm.cluster>
Signed-off-by: David Mosallanezhad <dmosallanezh@cw-dfw-cs-001-dc-01.cm.cluster>
Signed-off-by: David Mosallanezhad <dmosallanezh@cw-dfw-cs-001-dc-01.cm.cluster>
Signed-off-by: David Mosallanezhad <dmosallanezh@cw-dfw-cs-001-dc-01.cm.cluster>
Signed-off-by: David Mosallanezhad <dmosallanezh@cw-dfw-cs-001-dc-01.cm.cluster>
Signed-off-by: David Mosallanezhad <dmosallanezh@cw-dfw-cs-001-dc-01.cm.cluster>
Signed-off-by: David Mosallanezhad <dmosallanezh@cw-dfw-cs-001-dc-01.cm.cluster>
@Davood-M Davood-M changed the title Davidm/rpo multi resp RPO on multiple responses Sep 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants