We ran 600 agent evals – steering hooks hit 100% accuracy, prompts hit 82% / hacker news