Strings know their own length
Strings are arrays of bytes
Strings are hard, especially when you start to add concepts like encoding and locales, etc … but that’s not what we are talking about here. Let’s ⏪.
Strings are simple, they are just a consecutive sequences of characters, let’s keep hiding a lot under the carpet 🙊, strings are just a (null terminated) array of bytes. The length of the string is the number of non null bytes.
A simple way to get that information in C would be a loop (it’s ok to write loops in C).
int string_length( const char* s){
int n = 0 ;
for( char* p = s; *p != 0; p++) n++ ;
return n ;
}
string_length( "head, shoulders, knees and toes") ;
This just counts the number of bytes (aka char
) in the string, obviously in real life
what a character is is more complicated than a byte
some characters, like say emojis are
made of several bytes, but we said this was another day’s problem.
We don’t have to write the loop manually because we know about the
C function strlen
;
strlen( "head, shoulders, knees and toes" ) ;
What about R strings
You don’t typically see individual strings in R because R has no
notion of scalars, you see character vectors of length one, perhaps through
the Rcpp::CharacterVector
class.
You don’t see them, but they exist internally as one of the SEXP types, namely CHARSXP
.
When you have a character vector (a STRSXP
) and you want a single string (a CHARSXP
)
you may use the STRING_ELT
macro.
Then if you want the actual string, you can use the CHAR
macro.
Rcpp::cppFunction( '
int rstring_length_strlen( SEXP sv ){
SEXP s = STRING_ELT(sv, 0) ;
return strlen( CHAR(s) ) ;
}
')
s <- "head, shoulders, knees and toes"
rstring_length_strlen( s )
## [1] 31
nchar(s)
## [1] 31
But CHAR
is not the only thing you can do to a CHARSXP
, you can also
use the LENGTH
macro, which retrieves information that is stored in the SEXP
alonside the actual string, so R strings already know their length.
Rcpp::cppFunction( '
int rstring_length_LENGTH( SEXP sv ){
SEXP s = STRING_ELT(sv, 0) ;
return LENGTH( s ) ;
}
')
rstring_length_LENGTH( s )
## [1] 31
Back story
This might be an obscure post, it’s been on this blog issues for a while now. I guess I’ve not materialised it yet is because of my conception that this is either known by people for which this relevant, or it is not relevant to other people. Either you already know, or you don’t have to care.
I’ll try to force myself to write about these types of things when I can identify that at some point in the past, I would have benefited from knowing and I did not.
And that’s quite the case, otherwise the size
method of Rcpp::String
would use
length
when in fact there isn’t even a size
method 😱.