Show HN: ElasticMM – 4.2× Faster Multimodal LLM Serving (NeurIPS 2025 Oral)

ElasticMM is a newly released open-source serving system designed for modern multimodal large language models (MLLMs). The work was selected as an Oral presentation at NeurIPS 2025.

Unlike existing serving stacks such as vLLM—which are primarily optimized for text-only workloads—ElasticMM introduces Elastic Multimodal Parallelism (EMP), a new execution paradigm that adapts parallelism across different inference stages and modalities.

Key findings from the paper:

Up to 4.2× reduction in TTFT

3.2×–4.5× higher throughput under mixed multimodal workloads

Modality-aware scheduling, elastic stage partitioning, unified prefix caching, and non-blocking encoding

Paper (OpenReview PDF): https://openreview.net/pdf?id=Zd6VyjmN1S

GitHub repo: https://github.com/hpdps-group/ElasticMM

Curious to hear what the HN community thinks, especially from those building LLM/MLLM inference stacks or dealing with multimodal serving in production.

nice post, thank for sharing！