3

Generalized on-policy distillation with reward extrapolation