Generalized on-policy distillation with reward extrapolation / hacker news