ALTREP and C++
Traditionally, R vectors (numeric, …) are formed of one header ditrectly followed by the actual contiguous data. I’ll spare the details about the headers because even though I have a fair understanding, I don’t necessarily want to propagate my misconceptions 😬, and also because it is mostly irrelevant to this post.
Since R 3.5.0, the implementation of vectors benefits from ALTREP
(alternative representation), which
challenges this and decouples the header with the actual data. This has two main use cases:
when the memory comes from somewhere else (a memory mapped file, an
arrow
array or whatever), i.e. some contiguous memory that comes from somewhere.ALTREP
lets you create an R vector that points to that memory.when you only need part of the vector, e.g. a single value. Here
ALTREP
lets you define what is thei
th value of the vector, without having to materialize it. The canonical example isfor(i in 1:n){}
. You never need the entire vector1:n
, you only need one value at a time.
As the projects I’m currently involved with (dplyr
and arrow
) might ultimately
benefit from ALTREP
, I’ve 🤹 with it in the
altrepisode 📦.
In this post, I’m covering the first use case with numeric vectors, in particular
creating an R altrep vector that borrows data from a C++ std::vector<double>
.
The Altrep.h header
To manipulate ALTREP
objects, create your own classes, you need to include
the R_ext/Altrep.h
header file. Unfortunately in R 3.5.0, the
Altrep.h
is not C++ friendly, so you need to be careful when you include it. The situation has
been fixed
recently so if you rely on R-devel things are easier.
In the meantime, here is my workaround:
// to manipulate R objects, aka SEXP
#include <R.h>
#include <Rinternals.h>
#include <Rversion.h>
// because we need to initialize the altrep class
#include <R_ext/Rdynload.h>
#if R_VERSION < R_Version(3, 6, 0)
// workaround because R's <R_ext/Altrep.h> not so conveniently uses `class`
// as a variable name, and C++ is not happy about that
//
// SEXP R_new_altrep(R_altrep_class_t class, SEXP data1, SEXP data2);
//
#define class klass
// Because functions declared in <R_ext/Altrep.h> have C linkage
extern "C" {
#include <R_ext/Altrep.h>
}
// undo the workaround
#undef class
#else
#include <R_ext/Altrep.h>
#endif
The <200 lines of (fairly documented) C++ code stdvec_doubles.cpp
contains the implementation that is discussed here. This uses Rcpp
but only for code generation using
the attributes feature. This is not using any Rcpp classes (e.g. NumericVector
), which at this point
would eliminate the benefits of ALTREP
.
Motivation
Let’s start by the end of the file to get some motivation. The doubles
function below creates an
R altrep object backed by a std::vector<double>
.
//' an altrep object that wraps a std::vector<double>
//'
//' @export
// [[Rcpp::export]]
SEXP doubles() {
// create a new std::vector<double>
//
// this uses `new` because we want the vector to survive
// it is deleted when the altrep object is garbage collected
auto v = new std::vector<double> {-2.0, -1.0, 0.0, 1.0, 2.0};
// The altrep object owns the std::vector<double>
return stdvec_double::Make(v, true);
}
The details will follow, but for now, let’s look at what the object look like when back on the R side:
library(altrepisode)
x <- doubles()
x
## [1] -2 -1 0 1 2
mode(x)
## [1] "numeric"
str(x)
## num [1:5] -2 -1 0 1 2
It looks and feels like any other R numeric vector, that’s the point. As far as the R
code is concerned, this is not different than an object that would have been created
by c
:
y <- c(-2.0, -1.0, 0.0, 1.0, 2.0)
identical(x, y)
## [1] TRUE
To see a difference, you have to look at the object with .Internal(inspect())
, as ALTREP
gives you a way to control how your ALTREP objects are inspected.
.Internal(inspect(x))
## @7fca66a1f990 14 REALSXP g0c0 [NAM(3)] std::vector<double> (len=5, ptr=0x7fca5f46da60)
.Internal(inspect(y))
## @7fca654f7dc8 14 REALSXP g0c4 [NAM(3)] (len=5, tl=0) -2,-1,0,1,2
Register the ALTREP class
The stdvec_double::Make
function from the above code chunk creates an R object
of ALTREP class (the ALTREP class is completely orthogonal to
the R class, as again as far as R code is concerned, nothing has changed).
For this we need to register an R_altrep_class_t
object with (in the case
of an altrep numeric
vector) the R_make_altreal_class
function.
This is done at 📦 initialisation time, thanks
to the new Rcpp::init
attribute.
// static initialization of stdvec_double::class_t
R_altrep_class_t stdvec_double::class_t;
// Called the package is loaded (needs Rcpp 0.12.18.3)
// [[Rcpp::init]]
void init_stdvec_double(DllInfo* dll){
stdvec_double::Init(dll);
}
ALTREP is a C api, relying on C functions, but because this is C++,
I’ve squashed the functions together as static functions of the stdvec_double
C++ class, hence
the stdvec_double::Init
call here. Init
looks like this:
static void Init(DllInfo* dll){
class_t = R_make_altreal_class("stdvec_double", "altrepisode", dll);
// altrep
R_set_altrep_Length_method(class_t, Length);
R_set_altrep_Inspect_method(class_t, Inspect);
// altvec
R_set_altvec_Dataptr_method(class_t, Dataptr);
R_set_altvec_Dataptr_or_null_method(class_t, Dataptr_or_null);
// altreal
R_set_altreal_Elt_method(class_t, real_Elt);
R_set_altreal_Get_region_method(class_t, Get_region);
}
First, we register the class with R_make_altreal_class
function, then
we replace default methods with custom functions that know how to deal with
out std::vector<double>
.
Length
: what is the length of the vectorInspect
: what happens when we.Internal(inspect())
the objectDataptr
: where is the data (more on that later) ?Dataptr_or_null
: where is the data (but don’t look too hard)real_Elt
: what is thei
th element ?Get_region
: A contiguous region of the data
This is I believe 🤷 the bare minimum.
In addition to that, the stdvec_double
class hosts:
Make
: to construct one such objet from a pointer to astd::vector
.Finalize
: todelete
the object as the proper time if we own itPtr
: to get the pointerGet
: to get a reference to thestd::vector<double>
Construction
The stdvec_double::Make
function creates the altrep R object backed by the
`std::vector
// Make an altrep object of class `stdvec_double::class_t`
static SEXP Make(std::vector<double>* data, bool owner){
// The std::vector<double> pointer is wrapped into an R external pointer
//
// `xp` needs protection because R_new_altrep allocates
SEXP xp = PROTECT(R_MakeExternalPtr(data, R_NilValue, R_NilValue));
// If we own the std::vector<double>*, we need to delete it
// when the R object is being garbage collected
if (owner) {
R_RegisterCFinalizerEx(xp, stdvec_double::Finalize, TRUE);
}
// make a new altrep object of class `stdvec_double::class_t`
SEXP res = R_new_altrep(class_t, xp, R_NilValue);
// xp no longer needs protection, as it has been adopted by `res`
UNPROTECT(1);
return res;
}
Eventually the R object (aka SEXP) is created with the R_new_altrep
function,
which takes the altrep class as first argument, and two other
arbitrary R objects. These two R objects can be later accessed with R_altrep_data1
and R_altrep_data2
and can be just about anything you like.
Here we use an external pointer, created with R_MakeExternalPtr
, as data1
and
we don’t need anything for data2
so we use NULL
. If we own the C++ vector,
as indicated by the owner
argument, we register a finalizer so that when the
external pointer (the R object) is garbage collected, the destructor of the
C++ object is invoked.
The Finalize
, Ptr
and Get
functions are conveniences that allow us to go from
the altrep R object to the C++ vector:
// finalizer for the external pointer
static void Finalize(SEXP xp){
delete static_cast<std::vector<double>*>(R_ExternalPtrAddr(xp));
}
// get the std::vector<double>* from the altrep object `x`
static std::vector<double>* Ptr(SEXP x) {
return static_cast<std::vector<double>*>(R_ExternalPtrAddr(R_altrep_data1(x)));
}
// same, but as a reference, for convenience
static std::vector<double>& Get(SEXP vec) {
return *Ptr(vec) ;
}
Given the altrep object x
we need to first get to its data1
with R_altrep_data1
and then
cast that to the underlying C++ vector with R_ExternalPtrAddr
and a static_cast<>
. Once we have
this, the rest follows naturally.
ALTREP methods
ALTREP is divided in several layers depending on the type of object we altrep (maybe it’s to soon
for this to be a verb). The first layer is generic and apply to all ALTREP objects. There might
be other methods, but for this I’ve implemented the Length
and Inspect
:
// The length of the object
static R_xlen_t Length(SEXP vec){
return Get(vec).size();
}
// What gets printed when .Internal(inspect()) is used
static Rboolean Inspect(SEXP x, int pre, int deep, int pvec, void (*inspect_subtree)(SEXP, int, int, int)){
Rprintf("std::vector<double> (len=%d, ptr=%p)\n", Length(x), Ptr(x));
return TRUE;
}
In the Length
method, we are given the altreped (still assuming this is a verb) object. From this object,
we extract the std::vector<double>&
with Get
and then simply call size()
on it.
The Inspect
is a bit more involved, let’s just skip it 🙈.
ALTVEC methods
Then, we have methods that are only relevant to vector type R objects. In this
implementation, I have defined the Dataptr
and Dataptr_or_null
methods:
// The start of the data, i.e. the underlying double* array from the std::vector<double>
//
// This is guaranteed to never allocate (in the R sense)
static const void* Dataptr_or_null(SEXP vec){
return Get(vec).data();
}
// same in this case, writeable is ignored
static void* Dataptr(SEXP vec, Rboolean writeable){
return Get(vec).data();
}
In this example, they are identical, but it’s not necessarily the case for all altrep class implementations.
The difference is that the Dataptr_or_null
method is guaranteed to not allocate
additional R memory. If you already have access to a contiguous chunk of memory,
then return that, otherwise return a null pointer, but this should never allocate.
The Dataptr
method is the big 🔨. Whatever your class has to do,
it must return a pointer to contiguous chunk of memory where the data is. We’ll illustrate
this better in another post, but in short if you already have that contiguous memory,
then return it, if not do whatever it takes, allocate if you have to, but eventually
get me that memory.
Dataptr
is what most ALTREP unaware (e.g. the constructor of Rcpp::NumericVector
or the
mean
function from base R) code will use.
ALTREAL methods
Eventually, methods specific to numeric vectors.
// the element at the index `i`
//
// this does not do bounds checking because that's expensive, so
// the caller must take care of that
static double real_Elt(SEXP vec, R_xlen_t i){
return Get(vec)[i];
}
// Get a pointer to the region of the data starting at index `i`
// of at most `size` elements.
//
// The return values is the number of elements the region truly is so the caller
// must not go beyond
static R_xlen_t Get_region(SEXP vec, R_xlen_t start, R_xlen_t size, double* out){
out = Get(vec).data() + start;
R_xlen_t len = Get(vec).size() - start;
return len > size ? len : size;
}
real_Elt
gives you the element at offset i
of the underlying object, here this
calls the std::vector<double>::operator[]
but you can imagine situations
where this is implemented differently, e.g. for 1:n
you can imagine this would just
return i+1
without having to rely on anything else.
Get_region
again is a bit more involved, so I’ll skip it too. You can do what I do and guess
what it is supposed to do based on its name. There can be situations where you can have
access to contiguous memory for part of the vector.
What else to read
I’ll write some more about this in another post. In the meantime, here are a few pointers:
- Luke Tierney’s presentation about ALTREP.
- The ALTREP Examples organization on github hosts packages that showcase ALTREP. Most of what I did on altrepisode is inspired from the simplemmap 📦. The mutable 📦 is newer and easier to grasp.
- The altvecR 📦 from Simon Urbanek, which is a toy package that re-routes ALTREP/ALTVEC methods to R functions for experimentation.
- The actual ALTREP code, mainly lives in the altrep.c and Altrep.h files.
- My altrepisode. This contains the
stdvec_double
class described here and another class that I’ll discuss in a follow up post.