PPO #12

Jemoka · 2025-08-13T20:41:06Z

Pull Request

Description

Implements the Proximal Policy Optimization solver. As a corollary this enables the implementation of Perez, et al., 2022.

Type of Change

🐛 Bug fix (non-breaking change which fixes an issue)
✨ New feature (non-breaking change which adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📚 Documentation update
🔧 Refactoring (no functional changes)
⚡ Performance improvement
🧪 Test additions or improvements
🏗️ Infrastructure/build changes

Changes Made

implements PPO algorithm
to support PPO, implements a ValueFunctionProblem which extends Problem, allowing the user to train a value function for Actor-Critic type algorithms
refactors the entire codebase to return sequence-wide logprobs (i.e. so we have sequence-wide supervision); where multiplied logprobs are needed, they are multiplied right before use
implements examples/ast_ppo.py, an example of how to run PPO with our package.

Testing

Manually, loss goes down on example after a few gradient steps.

Documentation

Updated docstrings for new/modified functions
Updated README.md if applicable
Updated documentation in docs/ if applicable
Added examples for new features

Pre-submission Checklist

Code follows the project's style guidelines
Pre-commit hooks pass (pre-commit run --all-files)
Self-review of the code completed
Comments added for hard-to-understand areas
No new warnings introduced
Changes are backwards compatible (or breaking changes are documented)

(for tokenwise PPO)

Jemoka added 7 commits August 12, 2025 17:17

[wip] refactoring such that logprobs are exposed in full

efdf49d

(for tokenwise PPO)

PPO algorithm implementation

68866dc

patch method resolution order to not double inherit from generic

7bc35e5

pipe problem through

6e8ae0d

Export PPO to be used

f9ab96f

patch various small PPO implementation bugs

483bfa4

updating documentation and example for PPO

bcfdeac

Jemoka requested a review from duncaneddy August 13, 2025 20:41

Jemoka added 2 commits August 19, 2025 14:36

Merge remote-tracking branch 'origin/main' into feat/ppo

21fa2dc

add value calculation into hf extension, and fix typo

1458ca8

Jemoka merged commit 351cd04 into main Aug 19, 2025
2 checks passed

Jemoka deleted the feat/ppo branch August 19, 2025 23:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PPO #12

PPO #12

Uh oh!

Jemoka commented Aug 13, 2025

Uh oh!

Uh oh!

Uh oh!

PPO #12

PPO #12

Uh oh!

Conversation

Jemoka commented Aug 13, 2025

Pull Request

Description

Type of Change

Changes Made

Testing

Documentation

Pre-submission Checklist

Uh oh!

Uh oh!

Uh oh!