Show HN: Endgame – Production-aware ML under the sklearn API

Most ML frameworks optimize for leaderboard accuracy. But in finance and healthcare, accuracy is often the least interesting part of the system. If you can’t explain a prediction, you can’t deploy it. If your probabilities aren’t calibrated, you can’t trust them. If your pipeline doesn’t enforce constraints, you can’t ship it. I built Endgame after repeatedly running into that gap in production.

Anti-money laundering (banks) Early in my career, I was hired to improve an anti-money laundering system. The incumbent model was 28 hard-coded rules. If enough thresholds fired (e.g., $3,000 ATM withdrawals over 30 days), the account was flagged. No one knew where the thresholds came from. There was no modeling of the underlying behavior. Just rule accumulation. I convinced the bank to provide the raw financial features behind those rule firings. We trained an interpretable ML model directly on the underlying activity patterns. The result: ~200% more true positives (accounts actually involved in fraud or laundering). But what leadership cared about most wasn’t the metric. It was this: “Why is this account suspicious?” That theme repeated across industries.

Insurance claim adjudication I later built a claim adjudication model for a major health insurer. The legacy system was massive, brittle, and effectively a black box. It would frequently deny claims incorrectly, and no one fully understood how it worked. We built a new ML system that brought claim-level adjudication accuracy to ~95%. Again, the metric wasn’t the headline internally. The headline was: “Why did this claim get denied?” In regulated environments, interpretability isn’t optional.

Stock forecasting and calibration I also learned this lesson personally. I built stock-forecasting models that performed well in historical backtests. Some predictions showed 80% probability of a price increase. Then the market regime shifted. The probabilities were overconfident. Some trades went the opposite direction. I lost money. Accuracy ≠ trustworthy probabilities. Calibration and drift awareness matter far more in deployment than most tutorials suggest. That experience fundamentally changed how I think about ML systems.

The core idea Endgame is my attempt to encode those lessons into a framework. It’s not trying to replace scikit-learn. Every estimator implements fit / predict / transform. But it extends the ecosystem with: Glass-box models (EBM, GAM, CORELS, SLIM, GOSDT, etc.) SOTA deep tabular models (FT-Transformer, TabPFN, SAINT, etc.) Conformal prediction and Venn-ABERS calibration Deployment guardrails (leakage detection, latency constraints, drift checks) 42 self-contained HTML visualizations Super Learner, BMA, cascade ensembles A full AutoML pipeline that respects deployment constraints All under a unified sklearn-compatible API.

Agent-native ML (MCP) We’re in the agentic AI era. You can ask an LLM to build a pipeline for you, but it often requires multiple prompts and manual corrections. Endgame ships with a native MCP server. This lets agents: load data train models compare results generate reports export reproducible scripts Through structured tool calls, not fragile prompt chains. My belief is that ML pipelines will increasingly become conversational infrastructure.

A small contrarian view The ML community is underestimating the problems left to solve in tabular data and overestimating the demand for accuracy-optimized models. Most real-world data in business, healthcare, and finance is tabular (often multimodal). And most real-world systems need to be interpretable, calibrated, and deployable — not just accurate.

Endgame v1.0.0 is open source (Apache 2.0). Python 3.10+. If you work on production ML systems, especially in regulated domains, I’d genuinely value feedback. GitHub: https://github.com/allianceai/endgame Install: pip install endgame-ml Happy to answer technical questions.