How HTML changes in ePub

> Then there was the problem of fragility: any syntax problems with your XHTML and your users would get a blank screen

I don't call that fragile, I call that well-founded. It has always perturbed me that, when encountering an error, HTML parsers will guess what they think you meant instead of throwing it back. I don't want my parser to do guesswork with potentially undefined behavior. I don't want my mistakes to be obscured so they can later come back to bite me - I want to be called out on issues loud and clear before my users see them.

Perhaps it works under the context of manually-authored markup with minimal effort, so I can see why the choice was made. These days it's yet another reason why the web is a precarious pile of sticks. HTML freely lets you put a broken oddly-shaped stick right in the middle and topple the whole stack.

The people turning the web from a handcrafted document sharing system into the world's premiere application platform should have made XHTML win.

It’s Postel’s law at the end of the day: “be conservative in what you do, be liberal in what you accept from others”. As a site owner I want my site to fail loudly and quickly before a user sees a problem; as a user I never want to see a problem.

ePub is in a nice place: the number of documents to check for errors is reasonable, and the resulting artefect is designed to be shipped and never (or rarely) amended. That means that we can shift the balance towards strict parsing. But for a web site of thousands (or millions) of documents that are being amended regularly, the balance shifts back to loose parsing as the best way of meeting user needs.

Isn't the developer always the first user? With strict parsing, testing a site before launch would show you the problem right there and allow you to fix it to launch a bug free site.

What about a 5 year old client hitting a new server or the reverse? Is the only solution just don't do that?

Postel's Law sounds nice but it can result in major problems. It results in a de facto spec that differs from the written spec, and disagreements about what a piece of data actually means can lead to bugs and even security vulnerabilities.

Having strictly parsed HTML from the start would be fine. You'd check it before you ship it and you'd make sure it's valid.

Requiring it now would be a disaster, of course. There's so much malformed HTML out there. But making HTML parsers accept garbage at the beginning was the wrong choice.

The widespread acceptance of Postel’s Law also encourages poor authorship, because if you know clients have to be liberal in what they accept, there is no incentive to be conservative in what you send.

The problem is that you are collapsing two users with very different needs into a single one.

1. If you are authoring an XHTML file, yes, you want the renderer to be as picky as possible and yell loud and clear if you make a mistake. This helps you produce a valid, well-formed document.

2. If you are an end user reading an XHTML file, it's not your file, it's not your fault if there are bugs, and there's jack shit you can do to fix it. You just want the browser to do its best to show you something reasonable so you can read the page and get on with your life.

XHTML optimizes for 1 at the expense of 2. HTML5 optimizes for 2 at the expense of 1.

For some reason generating a valid wire format seems to be no problem for people when it comes to json. Forgot to escape a quote? Woops, that’s on me, should have used a serializer.

But add a few angled braces in there and lord have-a mercy, ain’t nobody can understand this ampersand mumbo jumbo, I wanna hand write my documents and generate wutever, yous better jus deal with it gosh dangit.

I prefer the current situation too but I still think it’s funny somehow we just never bought into serializers for html. Maybe the idea was before its time? I’m sure you’d have no such parsing problems in the wild if you introduced JTML now. Clearly people know how to serialize.

> 2. If you are an end user reading an XHTML file, it's not your file, it's not your fault if there are bugs, and there's jack shit you can do to fix it. You just want the browser to do its best to show you something reasonable so you can read the page and get on with your life.

Here's the thing though, if all XHTML clients are strict about it then that means the content is broken for EVERYONE which presumably means it gets noticed pretty quickly as long as the site is being maintained by anyone.

Compare that to HTML where if a page is doing something wrong but in a way that happens to work in Webkit/Blink while it barfs all over the place in Gecko it could go ignored for ages. Those of us who are old enough remember an era where a huge number of web sites only targeted Trident and didn't care in the slightest whether it worked in any other engine.

There has to be an opposite to Postel's Law that acknowledges it's better in some cases to ensure that breakage for anybody becomes breakage for everybody because that means the breakage can't be ignored.

If authored files had to be valid in order to work, how would the author have sold you an invalid file in the first place? They would have seen that it didn’t work when they were making it, and fixed it. If they’d sold you a book that didn’t open, you’d be entitled to a refund.

1. ensures that 2. doesn't happen. Not doing 1. pretty much gurantees that 2. will happen.

[deleted]

My favorite is how this interacts with the oh so fun mistake many people make of adding a `<div/>` thinking they are doing it right.

Especially since it is correct in JSX, adding to the confusion.

It doesn't help that self closing/void tags are actually a thing in HTML. input, image, meta, br-- these can be misleading to a naive developer. You can try to say something about how a div requires text content, but empty divs get used structurally by design teams all the time in practice...

It's hilarious that browsers will use loosy goosy parsing on the HTML of a web page but strictly interpret a JPEG to which its img tag refers.

Why? Why do car rentals typically not require cash up front, but hotel rentals universally do? The economics are similar. Sometimes it's simple path dependency. Something is a certain way because it's always been that way and it'd be too expensive to change it now.

At least browsers all use the same loosy goosy HTML parsing now. It was hell on earth when each browser had its own error recovery strategy. In a sense, there is no longer any such thing as an invalid HTML document: every code point sequence has a canonical interpretation now.

What about evolving standards in a system that must handle clients or servers which implement anything from tomorrows feature today to 10 years prior. Shouldn't failures be as graceful as possible?

Author here, happy to answer any questions / clarify anything.

Thanks for all the explanations. I always thought it was regular HTML, but now I know to watch out for the differences.

Can you say a few more words about the library https://github.com/standardebooks/tools ? Can it generate ePub3 from markdown files or do I have to feed it HTML already. Any repo with usage examples of the `--white-label` option would be nice.

The tooling does two main things: create a valid epub3 skeleton for your content, and build your book into “compatible”, Kobo and Kindle versions. You need to supply the valid XHTML.

There’s more info on the build process at https://news.ycombinator.com/item?id=46469341

Are there any sites that provide e-reader engine support charts for ePub, similar to what MDN provides for HTML?

Thank you, this was very interesting. I knew about XHTML in EPUB, but not that it was extendible - I had assumed it was mostly frozen.

Others have requested a list of features supported in different E-readers. i second this request.

Sometimes I wish some kind of weird disaster would strike that somehow only erases the protocols and styling/markup languages invented in the last 60 years -- without losing any data -- to force us to start over, but with the benefit of hindsight.

Oh, and JavaScript.

> A few decades ago XML emerged from the pit. XML [...] could be used for documents, data transfer, and a bunch of other things, and people genuinely liked it [...] They liked it so much that a concerted effort was started to take HTML and rebuild it on top of XML.

XML didn't "emerge" and was repurposed for HTML; it was designed for new vocabularies on the web. The first sentence of the XML spec reads:

> The Extensible Markup Language (XML) is a subset of SGML that is completely described in this document. Its goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML.

[deleted]