Vapour: A typed superset of the R programming language

I have some questions that are not answered by the homepage.

1) How does this work with function parameters that are intended to be captured unevaluated with substitute()? Do you type the input as "any" and document separately that the parameter is kept "unevaluated" as a symbol/name or call?

2) How does this work with existing untyped R code? Does it at least include types for the standard library (or some subset thereof?)

3) Is there any type inference, or does it require explicit type annotation everywhere?

4) How do you propose to handle NA (which can appear "within" any typed vector)? Does the compiler support refinement types? If not, how does checking for and preventing nullability work, when checking for NA values requires a runtime check?

5) How do data frames work? Are they typed like structs?

6) Which object systems does it support, if any? S3, S4, Reference Classes, or the 3rd-party R6?

As much as I like static types, I feel like R is maybe the language where I need or want them the _least_. How often do you really run into a situation where you pass a character vector to a function that requires a numeric vector and it crashes your program?

99% of the time what you really want is known-valid data frames for data processing, and statically-sized arrays for math stuff.

> As much as I like static types, I feel like R is maybe the language where I need or want them the _least_.

I really disagree with this.

I think one of the whole reason there is a whole Tidyverse ecosystem that the behavior of (some) R code is unintuitive in a way that adding typing would absolutely improve.

It seems like you're deeply familiar with the R ecosystem, but as a user what I want is a safe subset of R that I can use.

> How often do you really run into a situation where you pass a character vector to a function that requires a numeric vector and it crashes your program?

In R the more likely situation is that you pass in the wrong typed thing and it silently continues with very unexpected values being passed, causing trouble or errors much later in the program. Which is very much a problem that typing helps with.

> In R the more likely situation is that you pass in the wrong typed thing and it silently continues with very unexpected values being passed, causing trouble or errors much later in the program. Which is very much a problem that typing helps with.

Can you name one practical example of this happening as a result of passing in a vector of the wrong class()/mode()? Not a data frame with the wrong column types, but an actual standalone vector. Can you name an example in the Tidyverse ecosystem that specifically improves on the type-safety ("class/mode-safety") of the standard library? I can't, but maybe that's just because it's been too long since I did anything serious with the language.

I can definitely think of complicated interfaces where you can silently get strange results by passing in the wrong thing. sweep() and apply() are obvious examples, where you can accidentally swap the argument order and silently get a nonsensical result. But that's a matter of array shape, not of type. Try passing an argument of the wrong type (again, where "type" in this case means the class or mode of the vector) to sweep() or apply(), and watch what happens: you get an error message informing you that you passed a value of the wrong type. At worst, you get an obtuse error message informing you that you passed a value of the wrong type, but bubbled up from some internal code. But you get an error all the same.

R is actually very strongly-typed, and abstracts over some details that would otherwise cut into that type-strength. For example, R doesn't have the Numpy problem of exposing different physical storage sizes for integers and floats! It just has abstract numeric arrays, backed by whatever the hell storage type the R language implementers decided to back them with, with no opportunity for the user to accidentally mix things up and lose precision, overflow, or crash on contact with some pre-compiled Numba function.

I maintain that a much, much more pertinent problem is that array shape is not part of the type system, and moreover that a lot of R code is (by design) highly polymorphic with respect to array shape, precisely because there is no such thing as a "scalar" number or string but we still want to let people use numbers and strings in scalar-like fashion.

NULL I think falls into this category as well. NULL in R is a bit like nil in Lua or undefined in Javascript, in that it has a kind of dual function as a "value that is not any other value" and a "non-value that cannot be inserted into a collection, instead deleting whatever was previously there". But when is the last time someone got a NULL and a numeric vector mixed up, and wasn't able to figure out what happened? Is all the complexity of a static compiler really necessary to catch that relatively rare mistake?

Maybe the one exception here is the factor class. But there's no mention of factors here, and (as with array shape), validating factor levels is probably more important as validating that the thing is a factor in the first place, as opposed to character.

The NA checking proposed is another story. Now that would be useful, but so would checking things like min/max ranges, the presence of certain columns in a data frame, etc. For example Python has its data frame input validation framework Pandera that offers at least some of these guarantees at the type level.

As for classes, I noticed that they implement what looks like a nice concise syntax for creating S3 class objects with structure(). That's great, but you could have just written a helper library to do that.

Anyway, here's a project where someone designed a whole language and wrote a compiler for it, and I'm just one cantankerous former R user doubting whether that project is ever going to be useful. If this is just a hobby project to scratch someone's itch: ignore me. But if this is intended to be a serious thing for serious use in production, then I'd encourage the creators to reconsider how they portray their value proposition, and to maybe reconsider whether the goal of their project aligns with the needs and desires of actual R users in industry, of whom there are still many, but definitely not as many as there used to be.

You might sometimes end up with a vector of factors with numerical labels where I think you can get a surprise or two. E.g., that the factor is 2 but the factor level is 1

Static types seem like a bad idea for most R use cases. Contracts, on the other hand would be absolutely stellar. A-la-SQLite style.

Is there not a decent contracts framework for R yet?

There is. But that's still a far cry from a fully integrated language feature.

As an R programmer the examples given on the landing page seem very foreign to me -- you are almost always writing vectorized code in R, so I would think that would be front and center.

    let x: int = 1

Is this a list of ints or a pure singleton? R doesn't have scalar types, so it would seem the former, but the example makes it unclear. Later in the docs it makes it clearer:

    let x: int = (1, 2, 3)

And this, as an R developer, I can definitely get behind -- the c(...) syntax is always awkward and having a native syntax for static arrays is a welcome change.

Yeah, it's not an idiomatic example. I like the idea, but this makes me worry that the project does not have the right priorities. I.e., supporting my use cases :D

How does a R developer differ from a R programmer? (Asking you as an R programmer)

How do I find jobs that use the R language? It's impossible to search the letter "R" on linkedIn or Indeed without getting a bunch of unrelated job postings

"R" is the only programming language I know and I can't find a job that uses a R because job search engines don't allow you to sort by skill

"R language" is the closest substitute on linkedin but the results are still a jumbled mess of jobs, some looking moreso for other skills (SQL/Python)

I know R-heavy jobs exist but finding them on LinkedIn is virtually impossible

How does "R language" compare to searching for one of the popular R packages? Searching for "tidyverse", "dplyr", or "ggplot" seems to get a good chunk of hits. That being said, yeah, there does seem to be a trio of skills that often go together (R, python, SQL)

If you search specific packages on LinkedIn the number of jobs is usually very small

E.g. tidyverse or dplyr is like 20-40 jobs. ggplot is 88. There's definitely way more than 100+ companies looking for R-heavy users.

I tried using "r" (with quotes) on indeed, and got some hits where R was listed as one of the necessary skills.

Perhaps #rlang would work? Or #tidyverse if you are feeling tibblish :)

Why would you do that? R is a just a tool for doing statistics or research. You need to search for jobs in your subject area like "ecologist", "econometrician", "green energy reseacher", etc.

There are hedge funds that like hiring people who know how to manipulate data in R using dplyr and data.table

Looking for a similar job where my desire/interest to spend all day in Rstudio is a value add to a business

With apologies if this breaks guidelines: https://hymans.current-vacancies.com/Jobs/Advert/3525353?cid...

Because if you work on a team you need to use a language that the whole team can work with. If I'm the one R guy at a Python shop, it's not going to work out well. It depends a lot on org structure of course. But I think it's telling that the jobs you highlight are mostly academic jobs where the practitioner would be expected to be a highly competent individual working largely alone, or in a very small group, carrying out research on behalf of some stakeholders, and not likely to have to put anything "into production" any time soon.

For example, I used R (data.table) when I was a solo data scientist working on a consulting project where I needed to work with a dataset on the order of a few billion rows. I had nobody around to constrain my choice of tools, so I went with whatever felt convenient, familiar, and ergonomic for getting the job done.

Today, I am on a team of 5 other people, none of which know a lick of R, and my code needs to run in production pipelines that need to at least in theory be debuggable, auditable, fixable, etc. by people other than me. Therefore I use Python, because we are a Python team and that's the language that we use, end of story. (Python also happens to be a good choice on our team for other reasons, but that's not the point here).

Maybe the best industry where you are likely to find people doing "production" work in R is some form of insurance. But even back in 2017-2020, things were shifting towards Python at the one P&C company I worked for.

Lots of quantitative research uses R. It’s still very popular in the industry.

Insurance is also still using a lot of R. Actuaries I know still use it, and they talk of Python, but I don’t see anyone actually moving to using Python.

The main reason we shy away from R for production apps is all the silent errors where things seem to succeed while being horribly wrong if you take a look. Typing would certainly help mitigate that.

This isn't specifically about Vapour, just about what's become the common way to specify types.

I know this is totally bike shedding, semantics, vi vs Emacs, BigEndian vs LittleEndian and it's too late now to affect anything, but to me using a colon after the variable is just wrong!

let x : int = 1

func add(x: int, y: int): int { return x + y }

I see that and it looks like int = 1 and the function's return type is totally lost.

This seems completely backwards to me. Maybe I'm just used to the way C did it, but the variable modifiers should come first.

let int x = 1

func int add(int x, int y) { return x + y }

Why we reversed it and added in the colon just doesn't make much sense to me.

": type" has a long heritage, going back to (at least) Pascal in 1970.

I didn't know that! Still doesn't make much sense to me.

Will this fix the problems it claims to? The power of R is the rich package ecosystem. It caters to people who don’t want to think about engineering concerns but want a fast way to access the powers of computation rather than building a scalable system, two very different things. It excels at the former. A new language will not fix this, because this type of thinking has infected the entire package ecosystem. Frankly with code translation you probably don’t need a new language. Prototype in R and code translate to Python or whatever you want to use in prod. Or frankly just do code gen directly in Python so you can skip having to confirm if the results match.

To be clear, I love R, it excels in prototyping but I have seen too many real world struggles of folks trying to move to prod that I would say save it for EDA projects and one time analyses.

I often find I want a specific statistical package that's only in R, but want a more general purpose language for all the other stuff that's involved (parsing, filesystem stuff, error handling etc). I don't want to risk re-writing the statistical methods and all their dependencies in the sensible language, so I end up calling R only for the statistical methods, but I can see this as an alternative.

> A new language will not fix this, because this type of thinking has infected the entire package ecosystem.

Do you think the culture of the package ecosystem could possibly change in the future?

Even if it does the problem will still be there

I think this is a great idea for the project. I don't dislike the syntax, but the syntax seems more ML than R to me. I think keeping the syntax more R-like could be worthwhile.

I think this is super interesting. I’m not convinced by the examples that this is the right next step, but I hope that either this is alpha enough to experiment more and change direction, settle into a niche within R (like maybe libraries are developed in vapor and compiled and distributed in R?) or that this inspires more ways of supersetting or subsetting R.

Statisticians and researchers, is this helpful?

I would say that vast majority of type problems in data science/stats workflows come from data tables "trojan-horsing" type or missing data issues, rather than type problems strictly at the code level. Type annotations won't help you when your upstreams decide they want to change the format of their year-quarter strings without telling you.

> Type annotations won't help you when your upstreams decide they want to change the format of their year-quarter strings without telling you.

IME with both Python and JS/TS, it helps a lot (which is different than completely solving the problem), for reasons which should generalize to other typing add-ons/supersets for untyped languages. Typing your code forces validations at the boundaries, which obviously doesn't stop upstream sources from messing with formats but it does mean that you are much more likely to catch it at the boundary rather than having weird breakages deep in your code that you have to trace back to bad upstream data.

Is the idea that if my year_quarter parser is properly typed then it should detect the format change and throw an error? (kind of a silly example, just trying to be illustrative)

Yes. Your type can encode what the proper format for a string should be and if a string is passed that does not meet that format it will throw an error allowing you to make any necessary adjustments to handle the new date year_quarter format.

eg. `type DateString = ${number}/${number}/${number}`

A super naïve check for using "/" instead of "-" as the separator character for a date formatted as a string. If a date is provided with some other separator character it will throw an error. If my function takes a DateString the string must be formatted correctly to pass the type check. Obviously this isn't enough (YYYY/MM/DD is different than DD/MM/YYYY) but the intention was to show a way to enforce something via types rather than validating a string to check that your have a DateString you can simply enforce that you have one.

"Typing your code forces validations at the boundaries" was too strong because of course you can type your code without actually doing the validations, but you can structure your code such that that won't happen accidentally: https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

The idea is that checking should be the only way of making a value of the type. That prevents you from forgetting to check when you turn some broader type (say, string) into the more narrow one (date, in this case).

> "Typing your code forces validations at the boundaries" was too strong because of course you can type your code without actually doing the validations

Yeah, of course you can cheat the typechecking in the code at the boundary in several ways, or convert from wire format to internal types in a way which plugs in type-valid defaults for bad data rather than erroring, or just use too-broad internal types to start with (you can have "stringly-typed code"), and fail to help the problems. But if you use the types that make sense internally for what the code is doing, than conversion including validation at the boundary becomes the path of least resistance in most cases. "Forces" is not strictly true, but my experience is that adding types does create a strong push for boundary validation.

No, it will get in the way more than anything else. As has been said elsewhere in thread, what we need to ensure in R is mostly runtime constraints (array shape, number in specific interval, etc.) this would require a super heavy and complex type system, with at least refinement and probably fully-dependent types. It would be too complex to use for most people and use cases. A contract system would be far more practical and useful. See CHECK and other constraints in SQLite. That's exactly what we need.

It is probably helpful in some cases and unhelpful in others. R uses multiple dispatch, so calling `foo` on different types can produce different output. It isn't clear to me how Vapour handles this. In general though, folks are passing around data.frame or similar objects.

Not really, because honestly a lot of us who came into programming via research never learned typed languages or unit tests or any of those best practices - we were just hacking around in MATLAB, R, or Python from the start. What I really need is a seamless and easy way to run statistical models that can only be fit in R, but from Python or Node. There are several categories of statistical modeling where R completely blows python out of the water, and it's incredibly wasteful (and error-prone) to try to re-implement these yourself in Python.

rpy2 can be used to call R from Python: https://rviews.rstudio.com/2022/05/25/calling-r-from-python-...

reticulate works for going in the other direction: https://rstudio.github.io/reticulate/

With the good interoperability these days, let's stop rewriting functionality in other languages. If the interoperability is no good, work on fixing that, please.

Looks interesting! What types of programs do you think people would write in this language? I don't see an obvious need for traditional R programs which are usually just scripts for working with data, but maybe people could write R packages in this language?

Never thought someone would do this for R. Really nice work.

Vapour is an interesting choice. Hope it’s in name only :)

This looks nice. I find R to be an unreadable mess. The comprison shows a great improvement.

The default IDE workflow is like a python "notebook" where code can and is run in whatever order the creator wants. Every R code I've read treats it as such and it results in an absolute mess to read and manage.

Is it fair to compare this to what Typescript provides for JavaScript?

Cool idea! Looking forward to exploring it this weekend

I took a couple stabs at this long ago (even before there was a Typescript for inspiration). The first attempt was to add types to the syntax of R, but that would have required a lot more time than I had. Properly catching errors is a massive undertaking requiring a lot of background I don't have. The second attempt was to add syntax for types to R and then compile the code to another language. That's easy to do, but really boring, so I wasn't able to stick with it. It comes with the advantages of static typing and R code that runs very fast. I gave up and went with embedding R inside a statically typed language. Very happy with my choice.

Good luck to the authors of this. I believe it solves an important problem for R package authors and others wanting to write bigger programs. It's hard to argue with the benefits of static typing for this type of work.

Sounds like vapourware. ;)

I mean, there is an alpha you can download. If it was just a landing page and an email waitlist, then that would be vaporware.

I was commenting on the naming choice.

Yeah I caught that, but I thought you were doing a double-entendre since it’s in early alpha.

[deleted]

[flagged]

First, how is that "giving myself an excuse"? Second, it's a total non sequitur, and even then, it's a day old has it broken?

the syntax might change, things will break, expect bugs.

Bugs are normal software development.

Changing syntax and breaking things make work for everyone else for the convenience of developers. Reliability is what makes a tool a tool.

> Changing syntax and breaking things make work

How else might one explore a new language (vapour) in the open among interested like-minded developers seeking to iterate on a tool found lacking (R)?

Changing and iterating things makes.

they aren’t wrong. backwards compatibility is a suppose to one of the first promises any mature programming languages. unless you make it explicit via noting breaking changes in major version updates (1.X.X —> 2.X.X) or the language is purely for R&D and makes no guarantee of anything

The website says, "EARLY ALPHA Vapour is extremely young, the syntax might change, things will break, expect bugs."

What part of this is giving any sense of stability? It's clearly an experimental language, so I find it hard to understand why you are discussing stability and compatibility at all.

I have no skin in this game, but why are you expecting maturity from a new language? They're not claiming to be production-ready – quite the opposite. It's on version 0.0.5, for goodness' sake.

[deleted]