About 20% packages from CRAN and BIOC repositories include some native code and more than a half of those include some code in C++. This number is rather high given that the R API and runtime have been designed for C (or Fortran) and cannot be used reliably from C++, without extensive effort and restrictions. To avoid nasty bugs in such code, one needs to know R internals well, and when following the restrictions, one cannot use much from C++ anyway. This text describes some of these technical issues and gives some recommendations.
A summary of the recommendation would be: don’t use C++ to interface with R.
If you need to implement some computation in native code, use C (or perhaps
Fortran), not C++, or completely avoid interacting with the R runtime (e.g.
.C
or .Fortran
interfaces are fine, indeed, many external libraries are
written in C++).
I got to writing this text mostly based on my experience with helping
package authors who get rchk
(PROTECT bug finding tool) reports for their
C++ code, but believe they are false alarms. When I read the referenced
lines of their code, I often concluded they were really false alarms (unlike
C where it is by now quite rare), but also I would see some problem of using
C++ with R API on those lines or very close. Unfortunately these problems
are very common and can lead to crashes and other hard to find bugs.
RAII
RAII (resource acquisition is initialization) is a feature/idiom sometimes considered as the core innovation of C++ over C. It allows to easily allocate memory on the C stack and safely release it as the stack is unwound, either along normal returns or C++ exceptions. When used wisely, it allows for elegant and fast scoped-memory management. Indeed, there is more, but the other things can be gotten also in C, even though perhaps in a less elegant way.
Unfortunately, RAII does not work with setjmp/longjmp
functions provided
by the C runtime for exception handling. In case of a long jump,
destructors for statically allocated local variables are not executed. This
is a property of the C/C++ runtimes and consequence of incompatibilities
between performance goals and implementations of C++ exceptions and
setjmp/longjmp
. Typically, C++ exceptions are designed to have minimal
overhead when not taken, because they are used to implement error paths.
However, long jumps have to be very fast when they happen, because they are
used in language interpreters for control flow of the interpreted language;
it makes sense to pay some performance overhead even when these jumps are
not taken. Still, indeed, it is frustrating that long jumps cannot run the
destructors, even if it caused some performance overhead.
R internally uses setjmp/longjmp
to implement control flow of interpreted
loops and return statements (sometimes, but not always, the byte-code
compiler allows to elide the long jumps), but also for error handling. An R
error, e.g. a result of a call to error()
or an allocation failure when
allocating from the R heap, cause a long jump. If called from C++, the long
jump will not run the destructors.
Consequently, this means one cannot rely on that destructors will be run in
a package implemented in C++. The memory on the stack will still be freed
(the long jump will do that), but memory allocated using new
operator say
within a constructor of a statically allocated object, and de-allocated in a
destructor of that object using delete
, will not be freed, causing a
memory leak. This is a common error.
R restores the protection stack depth before taking a long jump, so if a C++
destructor includes say UNPROTECT(1)
call to restore the protection stack
depth, it does not matter it is not executed, because R will do that
automatically. This is unfortunately the only thing one can safely do
inside a destructor, but a common error is that destructors are written to
do much more.
Wrapping R API calls
One cannot easily guess which R API functions may long jump, also this may
change between R versions without notice. When programming in C, this is
not a problem and the long jump will lead to standard R error handling.
When programming in C++, if one wants to use destructors (and, well, C++
without destructors is probably quite useless), the only option is to wrap
all R API calls using code that will convert the long jumps to C++
exceptions, or that will otherwise run some cleanup code. This conversion
is possible e.g. using R_UnwindProtect
, but is far from trivial; see
Writing R Extensions 6.12,
but requires some verbose coding/boiler-plate. Rcpp currently uses this API.
If R long jumps are converted to C++ exceptions, these exceptions also need to be converted back to long jumps when the code returns from C++ to C (R runtime).
PROTECT errors on function return
Even if we convert the long jumps to C++ exceptions and back, there is
unfortunately another issue with destructors. Functions that return SEXP
by convention return it unprotected, and the caller protects it. However, if
any destructor that is run when such function is exitting allocates, R GC
may run and it may destroy the value before being returned. Unfortunately,
in such destructors we do not have access to the variable that holds such
object, so we cannot protect it. One should therefore avoid allocation from
the R heap in destructors, but that is hard given that almost any R API
function can allocate: one should just not call any R API function from a
destructor.
We found an error like this in the NAM
package (detected by a CRAN check
using ASAN, but it required some time to analyze): an Rcpp function used
Rcpp RNGScope
object which restores the state of the random number
generator in its destructor. Unfortunately, this means it has to call into
an R API (PutRNGstate
), which allocates, and hence may run GC, which in
turn has destroyed the value to be returned from that function. Indeed,
debugging these things is far from trivial, in this case we were lucky that
ASAN
caught it.
Similar errors could easily happen in various operators and copy constructors, when the return value from one function is being passed to another function. If some of these calls happen implicitly, it would be easy to forget to protect it by the caller.
Memory leaks and asynchronous de-initialization
Memory leaks of dynamically allocated memory are possible also in packages
written in plain C, but I’ve seen them often in C++ code interfacing with R:
memory allocated using new
, freed using delete
, with calls to the R API
in between, often with even explicit calls to error
, and without any
attempt to recover from a long jump (if long jumps were converted to C++
exceptions, one would have to handle those, instead). In case of an error
this memory is permanently leaked. With C, one can use say R_alloc
that
is deallocated automatically and also on a long jump (see Writing R
Extensions
6.1.1).
This can be solved using a statically allocated object with a destructor (in
case we have the converted long jumps to exceptions), or using an R object
with a finalizer. One can create such a dummy R object on the R heap,
PROTECT it, give it a finalizer to release the memory using delete
, and
UNPROTECT it at the end of the function, if this is where the freeing should
happen (or could first happen).
This way, one can get something like a destructor, which will be run eventually (except e.g. R shutdown), but not synchronously with the end of the scope, so not RAII. This idiom could be used instead of C++ destructors, e.g. when conversion of long jumps is not in place, but it also adds a bit of boiler-plate code. One has to be careful when calling back into R from the finalizer as R is not really reentrant (see Writing R Extensions 5.13), but not as careful as in a destructor, where as I mentioned one should not call any function that might allocate.
Automated unprotection
If R was implemented in C++ with an interface in C++, it would probably have some form of automated unprotection: objects will be unprotected automatically when they get out of scope (using RAII), and that would avoid some kind of protection imbalance errors. There is no way to get this in standard C.
Packages implemented in C++ sometimes employ some form of automated
unprotection, but I would not switch a package from C to C++ just to get
automated unprotection, I think there is a benefit in using the standard API
for better maintenance and tool support. The protection imbalance errors
are very easy to find using the rchk
tool, now run regularly for checking
CRAN packages and available in a container, and they are not nearly as
common as other protection errors (typically one forgets to protect). Such
harder protection errors can also be often found by the tool, but rarely
when the non-standard API is used (the automated unprotection will confuse
the tool).
In addition, the previous restrictions apply. Automated unprotection cannot
simply use R_PreserveObject/R_ReleaseObject
, because of long jumps
bypassing the destructor, and hence not releasing the object (unless the
long jumps were prevented/converted). Automated unprotection should not use
UNPROTECT_PTR
for the reasons I described earlier (Unprotecting by
Value).
In principle, automated unprotection can do something like UNPROTECT(n)
,
but indeed care needs to be taken that the C++ object is not allocated
dynamically or that n
is the same for all objects allocated, otherwise
destructors could be run in the wrong order and cause PROTECT errors or
memory leaks. The solution with R_PreserveObject/R_ReleaseObject
, if long
jumps are converted to exceptions and back, seems safest, but it also
requires a lot of work for the conversion.
Summary
When I started working on rchk
, the PROTECT bugs finding tool, I first
wanted to use plain C to interface with LLVM. Even though the C interface
existed, I’ve soon run into problems as it was poorly documented, rather
clumsy, and not much used. LLVM is written in C++ and the intended and
supported way to use it is indeed through its C++ interface. Luckily, I
switched to C++ already at the beginning and wrote the tool completely in
C++.
To interface with R from native code, the right interface is C. Apart from that it avoids the problems I’ve described here, it is the language of the interface documented, supported and maintained by R Core, described together with the various restrictions and low-level rules that have to be followed, at one place. Using the C interface makes the code easier to review and easier to debug than any external wrapper interface. Using sophisticated C++ code on top of the C interface requires tracking things back to the original C interface and thinking about the restrictions (such as what destructors do, but also how and when it is ok to modify objects, etc, things that are much harder to find out than in the original interface).
The best option for those who need to use C++, e.g. to interface with
external libraries where the only meaningful interface is in C++, is to
avoid interfacing with R in any way from the C++ code (e.g. extend R via
.C
interface, if via .Call
then with thorough isolation using a C
layer). Such C++ code would operate on objects on the C heap (not R heap,
except perhaps pointers to existing objects that are allowed to be modified
by the R semantics), and would never call into R in any way.
Packages that are already using C++ would best be carefully reviewed and
fixed by their authors. When the use of C++ is very limited and easy to
avoid, perhaps it is the best option to do that, otherwise one could use
some of the tricks I’ve described here. Note that using Rcpp does not
release package authors from thinking about these problems: indeed with Rcpp
one can still call R API directly, but even if that is avoided, one can
introduce PROTECT errors by incorrectly using existing objects (like the
RNGScope
example), by introducing complicated destructors of their own
objects (allocating R API call from a destructor) or cause a memory leak by
allocating memory dynamically without thinking about exceptions.