91

Data evolution with set-theoretic types

Author here. This is probably the article that took me the longest to write, roughly 15 months, and I may still not have explained all concepts with the clarity they deserve and I intended to. If there is any feedback or questions, I'd be glad to answer them!

4 days agojosevalim

This is great stuff! As a Haskell fan who uses Elixir (and Erlang) in my day job, I'm really looking forward to seeing how we can leverage more of the new set-theoretic type features showing up in Elixir. Revisions seem like something I've never really seen before in any of the type systems I've used, so I'm excited to see how it can be leveraged in a real production system. Thanks for the post.

3 days agoddellacosta

Stuff like this is why I don't like type systems. What you want to do is easy, but it becomes difficult to explain in a sane way (15 months difficult), because you need to work around the limitations of type systems. When you say "set-theoretic types", I hear, "get rid of types, just give me logic".

3 days agopractal

The work to develop the base theory, which this article presents, takes 15 months, but it doesn't take 15 months to read it (and hopefully it won't take as long to use it either). Whenever you use a programming language, you may work with data structures that took months to formalize and several more years to optimize, yet no one is saying "throw our data structures away". Even things like pretty printing and formatting a float have collectively several years of research behind them, yet the API is often a single function call.

Of course, you can still not like types, and making it harder to evolve libraries over time is a good reason. But using the time it takes to formalize its underlying concepts is not a strong argument against them. The goal is that someone will spend this time, precisely so you don't have to. :)

3 days agojosevalim

I have seen multiple users of one certain popular programming language claim that data structures besides a dynamic length array and a hash table have no useful application.

3 days agoConscat

They wouldn't also happen to throw out memory safety guarantees one level up by using a ton of array indexing, would they?

3 days agocaspper69

Clojure? Sounds like something Hickey would say.

3 days agotadfisher

Oh, I like formalising things, don't get me wrong, and I don't mind spending time on it at all. I just don't like doing it via types, and looking at how much time you spent on what, I rest my case.

3 days agopractal

> what you want to do is easy Easy to implement, hard to get correct. It inverts where you do work in a system. It can be hard to implement robust types but once that’s done it’s easy to know what you are writing is correct.

3 days agoryanschaefer

> because you need to work around the limitations of type systems.

What limitations of type systems are you talking about?

> When you say "set-theoretic types", I hear, "get rid of types, just give me logic".

A type theory/type system is a logic, so I don't really understand this point. https://en.wikipedia.org/wiki/Type_theory#Logic

3 days agoddellacosta

Yes, a type theory is a logic, but not a particularly good one. It is limiting to have to under-approximate the mathematical universe via statically typed chunks before being able to talk about the objects in the universe, and the article just describes another instance of why that is problematic in practice.

The alternative I propose is simpler than you would think, and definitely simpler than type theory, but it is also very new: http://abstractionlogic.com . And of course, it doesn't come with a fixed algorithm for type checking, but going forward and in the age of AI, I think that will turn out to be an advantage.

3 days agopractal

> Yes, a type theory is a logic, but not a particularly good one.

Given the context of the original post, type theory at least has the benefit of actually existing in industrial languages in the form of a bunch of implemented and in-use type systems.

> It is limiting to have to under-approximate the mathematical universe via statically typed chunks before being able to talk about the objects in the universe, and the article just describes another instance of why that is problematic in practice.

>

> The alternative I propose is simpler than you would think, and definitely simpler than type theory, but it is also very new: http://abstractionlogic.com . And of course, it doesn't come with a fixed algorithm for type checking, but going forward and in the age of AI, I think that will turn out to be an advantage.

That's helpful in understanding where you're coming from, but now I'm even more perplexed--what you've linked to seems extremely theoretical. The closest to something my industrial programmer brain can grasp and read that I'd care about (in the context of the original post we're discussing) is https://obua.com/publications/automating-abstraction-logic/1..., which seems focused on theorem proving.

None of this is meant to suggest that I don't see the value in what you've linked to--it will take me some time to absorb, but I will take a closer look--but if you want to come into a discussion about a post like this one and criticize it, and more generally type systems, I think you'd be better off showing more directly how your approach can solve the same problem (or how it doesn't need to solve it, which it sounds like you're implying) in a way that doesn't take reading through a bunch of fairly abstract posts and papers or watching a five-part video series first. And if there's no actual implementation (that I can see at least) then don't expect me, an industrial programmer--like most folks on HN I would guess vs. e.g. lambda-the-ultimate or whatever--to consider it worthwhile past idle curiosity.

I see value in Valim's post because it identifies a problem with type systems that I understand and have encountered, and provides an incremental and pragmatic solution to that. It doesn't seem like you're offering a coherent, practical alternative, regardless of the validity of your arguments ("It is limiting to have to under-approximate the mathematical universe via statically typed chunks before being able to talk about the objects in the universe" makes intuitive sense to me in this context, at least).

2 days agoddellacosta

How much work you want to put into understanding abstraction logic is very much up to you.

There are no tools for abstraction logic right now. I hope there will be in 15 months ;-) Maybe check back then.

2 days agopractal

While it is a bit unfortunate that the time spent to write the article continues to be used as a criticism, and perhaps as a dig, I’ll only add that I have no formal background on type theory or logic. So much of the 15 months was also spent on learning the underlying concepts, mostly on my free time, and I am quite happy to even produce _something_ on a topic I have only recently got into. I am sure people smarter or more familiar with the theory than me would produce more in less time.

2 days agojosevalim

No, I was just joking in my last comment, as I know how time-consuming true progress can be. I'm also familiar with how type theory can overcomplicate simple things. When I see this happening, I can't help but point it out, although my criticism might seem out of context. I would be very happy if I could deliver tools for abstraction logic in just 15 months.

2 days agopractal

These are great examples of difficulties people will encounter in all popular statically typed languages sooner or later.

I find the solution presented interesting but it is limited to the 3 operations mentioned.

The alternative to this runtime schema checks and treating your data as data - the data you are working with is coming from an external system and needs to be runtime-validated anyways.

Throw in some nil-punting and you can have a very flexible system that is guarded by runtime checks.

3 days agobeders

Agreed that it is naive to assume that static types will keep some kind of consistency on the bytes that come in on the wire, but it misses the point.

With static types, you do one transformation from bytes to a proper data type at the boundary.

The static type information propogates bidirectionally deep into the guts of the system.

If you're 10 layers in, and want to invoke getUserId() on an object, you just add it to your type. Now your endpoint can return 400s without you needing to add extra nil-checks somewhere (or multiple places) from layers 1 to 10 (which realistically will turn into 500s).

3 days agomrkeen

This is some interesting shit, I love it! At least that parts I think I understand :P

> The goal of data versioning is to provide more mechanisms for library authors to evolve their schemas without imposing breaking changes often. Application developers will have limited use of this feature, as they would rather update their existing codebases and their types right away. Although this may find use cases around durable data and distributed systems.

"Hard cutover" is definitely a lot less laborious, but often times incremental updates are necessary - large teams, existing tables and whatnot. To that end, I would foresee a lot of appeal in app dev when applied to database migrations.

3 days agoMoosieus

Some initial concerns when thinking about this:

- Understanding how it expands the surface area of both the API and discussion of a struct (and thus also libraries and dependencies of the librar{y,ies}). Now you don't need to ask someone if they're using v1.0 of some package, but also which revision of a struct they're using, which they may not even know. This compounds when a library releases a new breaking major version, presumably - `%SomeStruct{}` for v2 may be different from `%SomeStruct{}` v1r2.

- Documentation seems like it would be more complex, as well. If `%SomeStruct{}` in v1.0 has some shape, and you publish `v1.1`, then realize you want to update the revision of `%SomeStruct{}` with a modified (or added) field, would there be docs for `v1.1` and `v1.1r2`? or would `v1.1` docs be retroactively modified to include `%SomeStruct{}v1.1r2`?

- The first example is a common situation, where an author realizes after it's too late that the representation of some data isn't ideal, or maybe even isn't sufficient. Typically, this is a solved with a single change, or rarely in a couple or few changes. I'm not sure if the complexity is worth it. I understand the desire to not fragment users due to breaking changes, but I'm not sure if this is the appropriate solution.

- How does this interact with (de-)serialization?

I happen to be working with the SQLite file format currently, and I generally really enjoy data formats. It's not exactly the same as this, since runtime data structures are ephemeral technically, but in reality they are not. The typical strategy for any blob format is to hoard a reasonable amount of extra space for when you realize you made a mistake. This is usually fine, since previous readers and writers ought to ignore any extra data in reserved space.

One of the things one quickly realizes when working with various blob formats is the header usually has (as a guess), 25% of it allocated to reserved space. However, looking at many data formats over the last several decades, it's extraordinarily rare to ever see an update to them that uses it. Maybe once or twice, but it's not common. One solution is to have more variable-length data, but this has its own problems.

So, in general, I'm very interested in this problem (which is a real problem!), and I'm also skeptical of this solution. I am very willing to explore these ideas though, because they're an interesting approach that don't have prior art to look at, as far as I know.

EDIT: Also, thanks for all the work to everyone on the type system! I'm a huge fan of it!

3 days agoharrisi

> This compounds when a library releases a new breaking major version, presumably - `%SomeStruct{}` for v2 may be different from `%SomeStruct{}` v1r2.

You should never reuse the revisions. If you launch a major version, then it means you only support r5 onwards. Do not reset it back to r1.

I am also not sure if the user needs to know the version. Remember that if I am in r5, because of subtyping, the code naturally supports r1, r2, r3, and r4, and those already have to be pretty printed. All of the work here is to "generate automatic proofs" via type signatures, no new capability or requirement is being introduced on the type representation of things.

> Documentation seems like it would be more complex, as well. If `%SomeStruct{}` in v1.0 has some shape, and you publish `v1.1`, then realize you want to update the revision of `%SomeStruct{}` with a modified (or added) field

I don't think documentation is more complex because that's something already factored in today. Functions add new options in new releases. Structs add new fields. And today we already annotate the version they were added. Perhaps this approach can automate some of it instead.

---

Other than that, we should be skeptical indeed. :)

3 days agojosevalim

As a preamble, the discussion of versioning is difficult since it's abstract; this is difficult to discuss in general.

> You should never reuse the revisions. If you launch a major version, then it means you only support r5 onwards. Do not reset it back to r1.

Hm, so hypothetically a v1.0.0 of a library could begin its life with structs at a greater-than-1 revision number. This seems odd, since a major version bump is able to be breaking, but revisions don't have the same semantic meaning - it just means "it's different from before".

This kind of breaks my mental model of major version bumps. I consider v1 of something to be distinct from v2, where it just happens to be that v2 typically is mostly compatible with v1. However, with revisions, it maintains that connection explicitly. I suppose the answer to that would be to rename the structs and have them start at revision 1, but now we're back at square one, where I could've just done that without revisions to maintain backwards compatibility by introducing a new struct.

I'm not sure if I'm conveying what's odd about that sufficiently. Let me know if I should think on that more and try explaining it again.

> I don't think documentation is more complex because that's something already factored in today. Functions add new options in new releases. Structs add new fields. And today we already annotate the version they were added. Perhaps this approach can automate some of it instead.

The expectation here would be that a struct revision would be a minor version bump, correct? That seems like it kind of breaks semantic versioning, since a user of v1.0.0 wouldn't necessarily have any behavioral changes in v1.1.0, since the structs and users of the structs would be identical, right? It kind of pushes the minor version bump onto the consumer code at their discretion, but not in a typical way. Now if I bump a dependency's minor, I don't necessarily want to bump my own minor - but maybe I do? Especially if I'm a middleman library, where I pass behavior from client code to one of my dependencies.

This is getting a bit confusing to consider the hypothetical situations. The main gut feeling I have about this is the same about versioning in general - it's fine until it's not, which is usually some unknown point in the future when things get complex. Kind of the gist of my hesitation is just that versioning is already a complicated problem, and introducing more variation in that seems like it really needs to be a long experiment with the expectation that it may not work. Unfortunately, the situations in which versioning rears its ugly head is typically only in long-used, complex projects. So I don't really know how a test bed for this could exist to give realistic results.

As a bit of a disclaimer to everything I just said, semver is not the holy grail, in my opinion, and I think it's perfectly reasonable to experiment with it and try alternatives. Maybe part of my issue is just that I haven't entertained other versioning schemes that can sufficiently handle the difference between behavioral and representational changes.

Thanks for the discussion!

3 days agoharrisi

> This kind of breaks my mental model of major version bumps.

If you want, you can reset the revisions in a major version, as you could rename all modules and change the behavior of all functions, but that's hardly a good idea. Revisions are not any different. If you give a revision a new meaning in a major version, it is going to be as confusing as giving a new implementation to a function: users upgrading will have to figure out why something has subtly changed, instead of a clear error that something has gone away.

> The expectation here would be that a struct revision would be a minor version bump, correct?

We could make either minor and major versions (as described by semver) work. For example, if we want to allow new revisions to be minor versions, we could make it so every time you install a dependency, we automatically lock all structs it provides. This way, if a new version is out, you have to explicitly upgrade any new revision too.

Of course, you could also release a new major version of the library once you introduce a new revision and _that's fine_. The goal is not to avoid major versions of the library that defines a struct, the goal is to avoid forcing its dependents to release major versions, which has a cascading effect in the ecosystem. Putting in Elixir terms, I want to be able to release Phoenix v2 without forcing LiveView to immediately become v2. LiveView should be able to stay on v1.x and say it supports both Phoenix v1 and v2 at once. But in most typed languages, if you change a field definition, it is game over.

The guarantees you need to provide are not that many: as long as two revisions overlap at one point in time, you can offer a better upgrading experience to everyone, instead of assuming *all code must be updated at once*.

Thanks for the convo!

2 days agojosevalim

I forgot to mention, the typical situation for these things for added elements is to add `_ex` or the like, which is not a great solution. You can see this in various places in Erlang, especially around erl_interface and related aspects.

On the flip side, what if you realize a data structure needs to remove elements? Say revision 2 adds a field, but revision 3 removes it?

3 days agoharrisi

Yea, this is an issue rather near and dear to my heart (due to pain). I very much appreciate strong and safe types, but it tends to mean enormous pain when making small obvious fixes to past mistakes, and you really don't want to block those. It just makes everything harder in the long term.

As an up-front caveat for below: I don't know Elixir in detail, so I'm approaching this as a general type/lang-design issue. And this is a bit stream-of-consciousness-y and I'm not really seeking any goal beyond maybe discussion.

---

Structural sub-typing with inference (or similar, e.g. perhaps something fancier with dependent types) seems like kinda the only real option, as you need to be able to adapt to whatever bowl of Hyrum Slaw[1] has been created when you weren't looking. Allowing code that is still provably correct to continue to work without modification seems like a natural fit, and for changes I've made it would fairly often mean >90% of users would do absolutely nothing and simply get better safety and better behavior for free. It might even be an ideal end state.

I kinda like the ergonomics of `revision 2` here, it's clear what it's doing and can provide tooling hints in a rather important and complicated situation... but tbh I'm just not sure how much this offers vs actual structural typing, e.g. just having an implicit revision per field. With explicit revisions you can bundle interrelated changes (which is quite good, and doing this with types alone ~always requires some annoying ceremony), but it seems like you'll also be forcing code to accept all of v2..N-1 to get the change in vN because they're not independent.

The "you must accept all intermediate changes" part is in some ways natural, but you'll also be (potentially) forcing it on your users, and/or writing a lot of transitioning code to avoid constraining them.

I'm guessing this is mostly due to Elixir's type system, and explicit versions are a pragmatic tradeoff? A linear rather than combinatoric growth of generated types?

>It is unlikely - or shall we say, not advisable to - for a given application to depend on several revisions over a long period of time. They are meant to be transitory.

An application, yes - applications should migrate when they are able, and because they are usage-leaf-nodes they can do that without backwards compatibility concerns. But any library that uses other libraries generally benefits from supporting as many versions as possible, to constrain the parent-library's users as little as possible.

It's exactly the same situation as you see in normal SAT-like dependency management: applications should pin versions for stability, libraries should try to allow as broad of a range as possible to avoid conflicts.

>Would downcasting actually be useful in practice? That is yet to be seen.

I would pretty-strongly assume both "yes" and "it's complicated". For end-users directly touching those fields: yes absolutely, pick your level of risk and live with it! This kind of thing is great for isolated workarounds and "trust me, it's fine" scenarios, code has varying risk/goals and that's good. Those happen all the time, even if nobody really likes them afterward.

But if those choices apply to all libraries using it in a project... well then it gets complicated. Unless you know how all of them use it, and they all agree, you can't safely make that decision. Ruby has refinements which can at least somewhat deal with this, by restricting when those decisions apply, and Lisps with continuations have another kind of tool, but most popular languages do not... and I have no idea how possible either would be in Elixir.

---

All that probably summarizes as: if we could boil the ocean, would this be meaningfully different than structural typing with type inference, and no versioning? It sounds like this might be a reasonable middle-ground for Elixir, but what about in general, when trying to apply this strategy to other languages? And viewed through that lens, are there other structural typing tools worth looking at?

[1] https://en.wiktionary.org/wiki/Hyrum%27s_law

3 days agoGroxx

Thank you for the comments. Your questions at the end resonate a lot with what I have been asking myself!

> Structural sub-typing with inference

Can we have structural sub-typing with inference that is relatively fast and will generate reasonable error reports? We have been bulking up the amount of inference for dynamic code in our system and sometimes the inferred types get quite large, which can make trouble shooting daunting. In any case, better inference is a win even without taking data evolution into account.

> but tbh I'm just not sure how much this offers vs actual structural typing

The bulk of the work is definitely achieved by structural typing. The revisions help generate automated type signatures that guarantee you have not widened the output for old versions. If all you have is inference, you could accidentally introduce breaking changes?

I guess there may be some automated way where we could check old inferred types against new ones but I am not sure how it could be done without annotating that _something_ has changed?

> The "you must accept all intermediate changes" part is in some ways natural, but you'll also be (potentially) forcing it on your users, and/or writing a lot of transitioning code to avoid constraining them.

Theoretically, you do not need to support all revisions, only support more than one revision at once. A library that provides r1-r2-r3 can be supported downstream through the pairs `r1-r2` and then `r2-r3` and that should hopefully provide a smoother upgrade experience to everyone compared to `r1`, `r2`, and `r3` being part of distinct major versions.