I don't think HTML is the right approach. HTML is better than PDF, but it is still a format for displaying/rendering.
the actual paper content format should be separated from its rendering.
i.e. it should contain abstract, sections, equations, figures, citations etc. but it shouldn't have font sizes, layout etc.
the viewer platforms then should be able to style the content differently.
Perfect is the enemy of good. HTML is good enough. Let’s get this done.
And as another commenter has pointed out, HTML does exactly what you ask for. If it’s done correctly, it doesn’t contain font sizes or layout. Users can style HTML differently with custom CSS.
Wouldn’t that be CSS?
no
<div class="abstract-container">
<div class="abstract">
<pre><code> abstract text ... </code></pre>
</div>
<div class="author-list">
<ol>
<li>author one</li>
<li>author two</li>
<ol>
</div>
should be just:
[abstract]
abstract text
[authors]
author one | email | affiliation
author two | email | affiliation
Sounds like XML and XSL would be a great fit here. Shame it’s being deprecated…
Is this new or somehow updated? HTML versions of papers have been available for several years now.
Did you know that 90% of submissions to arXiv are in TeX format, mostly LaTeX? That poses a unique accessibility challenge: to accurately convert from TeX—a very extensible language used in myriad unique ways by authors—to HTML, a language that is much more accessible to screen readers and text-to-speech software, screen magnifiers, and mobile devices. In addition to the technical challenges, the conversion must be both rapid and automated in order to maintain arXiv’s core service of free and fast dissemination.
No I mean _arXiv_ has had experimental support for generating HTML versions of papers for years now. If you visit arXiv, you'll see a lot of papers have generated HTML alongside the usual PDF, so I'm trying to understand whether the article discussed any new developments. It seems like it's not new at all
>Did you know that 90% of submissions to arXiv are in TeX format, mostly LaTeX? That poses a unique accessibility challenge: to accurately convert from TeX—a very extensible language used in myriad unique ways by authors—to HTML, a language that is much more accessible to screen readers and text-to-speech software, screen magnifiers, and mobile devices.
Challenging. Good work!
Accessibility barriers in research are not new, but they are urgent. The message we have heard from our community is that arXiv can have the most impact in the shortest time by offering HTML papers alongside the existing PDF.
Hello, I was going through html versions of my preprints on Arxiv, thank you for all that you guys do
Please do let me know if the community could contribute through any means for the same
Unfortunately I didn't see the recommendation there on what can be done for old papers. I checked, and only my papers after 2022 have an HTML version. I wish they'd make some kind of 'try html' button for those.
> View any arXiv article URL [in HTML] by changing the X to a 5
The line
> Sources upto the end of November 2025.
sounds to me like this is indeed intended for older articles.
[Sept 2023] as per the wayback machine.
Pandoc can convert to svg. It can then be inlined in html. Looks just like latex, though copy/paste isn't very useful
Seeing the Gemini 3 capabilities, I can imagine a near future where file formats are effectively irrelevant.
Files.
Truth in general, if we aren't careful.
[dead]
I was reading through this article too, glad to have found it on here
Maybe unpopular, but papers should be in n markdown flavor to be determined. Just to have them more machine readable.
Compared to HTML, Markdown is very bad at being mahcine-readable.
[dead]
Can't help but wonder if this was motivated in part by people feeding papers into LLMs for summary, search, or review. PDF is awful for LLMs. You're effectively pigeonholed into using (PAYING for) Adobe's proprietary app and models which barely hold a candle to Gemini or Claude. There are PDF-to-text converters, but they often munge up the formatting.
Not sure when you last tried, but Gemini, Claude, and ChatGPT have all supported pretty effective PDF input for quite a while.
I don't think HTML is the right approach. HTML is better than PDF, but it is still a format for displaying/rendering.
the actual paper content format should be separated from its rendering.
i.e. it should contain abstract, sections, equations, figures, citations etc. but it shouldn't have font sizes, layout etc.
the viewer platforms then should be able to style the content differently.
Perfect is the enemy of good. HTML is good enough. Let’s get this done.
And as another commenter has pointed out, HTML does exactly what you ask for. If it’s done correctly, it doesn’t contain font sizes or layout. Users can style HTML differently with custom CSS.
Wouldn’t that be CSS?
no
<div class="abstract-container">
<div class="abstract">
<pre><code> abstract text ... </code></pre>
</div>
<div class="author-list">
<ol>
<li>author one</li>
<li>author two</li>
<ol>
</div>
should be just:
[abstract]
abstract text
[authors]
author one | email | affiliation
author two | email | affiliation
Sounds like XML and XSL would be a great fit here. Shame it’s being deprecated…
Is this new or somehow updated? HTML versions of papers have been available for several years now.
EDIT: indeed, it was introduced in 2023: https://blog.arxiv.org/2023/12/21/accessibility-update-arxiv...
From the paper...
Why "experimental" HTML?
Did you know that 90% of submissions to arXiv are in TeX format, mostly LaTeX? That poses a unique accessibility challenge: to accurately convert from TeX—a very extensible language used in myriad unique ways by authors—to HTML, a language that is much more accessible to screen readers and text-to-speech software, screen magnifiers, and mobile devices. In addition to the technical challenges, the conversion must be both rapid and automated in order to maintain arXiv’s core service of free and fast dissemination.
No I mean _arXiv_ has had experimental support for generating HTML versions of papers for years now. If you visit arXiv, you'll see a lot of papers have generated HTML alongside the usual PDF, so I'm trying to understand whether the article discussed any new developments. It seems like it's not new at all
>Did you know that 90% of submissions to arXiv are in TeX format, mostly LaTeX? That poses a unique accessibility challenge: to accurately convert from TeX—a very extensible language used in myriad unique ways by authors—to HTML, a language that is much more accessible to screen readers and text-to-speech software, screen magnifiers, and mobile devices.
Challenging. Good work!
Accessibility barriers in research are not new, but they are urgent. The message we have heard from our community is that arXiv can have the most impact in the shortest time by offering HTML papers alongside the existing PDF.
Hello, I was going through html versions of my preprints on Arxiv, thank you for all that you guys do Please do let me know if the community could contribute through any means for the same
Unfortunately I didn't see the recommendation there on what can be done for old papers. I checked, and only my papers after 2022 have an HTML version. I wish they'd make some kind of 'try html' button for those.
Do the older papers work via [Ar5iv](https://ar5iv.labs.arxiv.org/) ?
> View any arXiv article URL [in HTML] by changing the X to a 5
The line
> Sources upto the end of November 2025.
sounds to me like this is indeed intended for older articles.
[Sept 2023] as per the wayback machine.
Pandoc can convert to svg. It can then be inlined in html. Looks just like latex, though copy/paste isn't very useful
Seeing the Gemini 3 capabilities, I can imagine a near future where file formats are effectively irrelevant.
Files.
Truth in general, if we aren't careful.
[dead]
I was reading through this article too, glad to have found it on here
Maybe unpopular, but papers should be in n markdown flavor to be determined. Just to have them more machine readable.
Compared to HTML, Markdown is very bad at being mahcine-readable.
[dead]
Can't help but wonder if this was motivated in part by people feeding papers into LLMs for summary, search, or review. PDF is awful for LLMs. You're effectively pigeonholed into using (PAYING for) Adobe's proprietary app and models which barely hold a candle to Gemini or Claude. There are PDF-to-text converters, but they often munge up the formatting.
Not sure when you last tried, but Gemini, Claude, and ChatGPT have all supported pretty effective PDF input for quite a while.