PCPO | Jeongjae Lee

Proportionate Credit Policy Optimization (PCPO) is a novel framework that addresses the instability and high variance in policy gradient methods for text-to-image (T2I) model alignment.

Our analysis reveals that disproportionate credit assignment—arising from the mathematical structure of generative samplers—causes volatile feedback across timesteps. PCPO resolves this by enforcing proportional credit assignment through:

A stable objective reformulation.
A principled reweighting of timesteps.

This approach stabilizes training, accelerates convergence, and significantly improves image quality by mitigating model collapse. PCPO outperforms existing baselines, including state-of-the-art methods like DanceGRPO.

Qualitative comparison showing PCPO's ability to maintain diversity and fidelity while baselines suffer from model collapse.

References