arrow, rrrow, rcher, spurrrow

4 min read

arrow R c++ python

The naming conundrum

Here I am again at the conundrum of choosing a name for a thing. This is hard, I like when it’s over and I have the perfect name, and I feel finally free to try to match the personnality of code to the name.

On tuesdays, I’ll be working with Wes on Apache Arrow, specifically on adding R as a front end to the C++ library.

The python front end is called pyarrow, but I guess names are less of an issue with python as the first thing I’ve seen on many scripts using pyarrow is import pyarrow as pa.

There are some proof of concepts of R and Arrow - rarrow from Jim - Rarrow from Clark

So it looks like [rR]arrow is the natural pattern to use to name the bindings. I have mixed feelings about this, so I coined rrrow instead, making the regex [rR]a?rrow.

I like rrrow because it’s weird, it has 3 consecutive r like some other R package you might have heard of, and JD did find the perfect imagery for it.

Here is an extract from my conversation with Wes about it:

arrow makes a lot of sense actually, we’re already in R so we don’t need a prefix reminder.

I kind of like rcher too, without the capital R though in the interest of saving ⌨️ time, and I ❤️ the idea of pretty much outsource the marketing to Mara who has the super power to tweet archer gifs faster than … well I don’t remember the typical expression for something fast, but pretty fast …

More things to learn

I don’t just sit around and think about naming things all day, I also sometimes procrastinate, but not today, I’ll procrastinate tomorrow.

Arrow is already a mature and somewhat complex project with many moving parts, so being tasked to “do the r thing” is kind of intimidating at first, I’ll try to not let myself go to anxiety too soon.

I spent my first #arrowtuesday reading documentation, installing things, and generally get a feel of the project, mostly through the python front end.

I need to learn about python, here’s my current amazon cart. I’ve been meaning to read Wes’s book for some time and I’m not the onmly member of my 👪 who wants to learn about 🐍

Plan

In essence the task is to make the arrow data structures accessible to R, and be inline with the principles of Arrow of limiting the copies to a minimum.

The tools we have at our disposal in R for this are external pointers, they let us get hold of an instance of a C++ class with enough hooks to destruct the object once the wrapping R object around it (the external pointer, aka EXTPTRSXP) goes out of scope.

Rcpp has modules around external pointers, but I’m not really satisfied with it because they take forever to compile and still at the moment require a lot of boiler plate work when used with a C++ library that gies beyond hello world.

Using external pointers is the path Jim has followed with rarrow, it’s the right tool.

But we need to go further, because external pointers only give you ways to get hold of an object and maintain its life cycle, as soon as you want to do anything in R with the data, you have to convert it to R data types. However, there’s ALTREP on the horizon.

ALTREP is a big deal, it makes it possible to decouple the metadata of R objects (all the stuff that goes in the SEXPREC bits) from the actual data, so whereas now the actual data directly follows the header, ALTREP adds abstractions that we can use to add indirections.

This is still somewhat obscure to me, but in short if the data can be elsewhere, it can definitely come from some Arrow structure. Exciting times ahead, I’m leaving this here, the thread has some references about 📦 using ALTREP.

See you next tuesday for more R and Arrow stuff.