--- title: "Unprotecting by Value" author: "Tomas Kalibera" date: 2018-12-10 categories: ["Internals", "User-visible Behavior"] tags: ["parsing", "PROTECT bugs"] ---
In short, UNPROTECT_PTR
is dangerous and should not be used. This text
describes why and describes how to replace it, including mset-based
functions that have been introduced as a substitute for situations when
unprotection by value is really needed. This could be of interest to anyone
who writes native code to interface with the R heap, and definitely to all
who use UNPROTECT_PTR
in their code.
R provides several functions to protect pointers to R objects held by local
C variables (typed SEXP
) from the garbage collector. As documented in
Writing R Extensions,
there are two structures to hold protected pointers: the pointer
protection stack and the precious list.
The pointer protection stack is accessed using PROTECT
/UNPROTECT
.
Pointers are unprotected by being removed from the top of the stack. One
can also use PROTECT_WITH_INDEX
and then REPROTECT
to replace a pointer
defined by its position in the stack, which allows to simplify and speed-up
code that repeatedly updates local variables holding pointers (in such
scenarios, one could in principle still use a sequence of
PROTECT
/UNPROTECT
operations, instead). The pointer protection stack
needs to be managed in-line with the C call stack: after returning from a
function, the stack depth should be the same as when the function was called
(pointer protection balance). These and other rules are described in
Writing R
Extensions
and The danger of PROTECT
errors,
they are relatively easy follow and check, both visually and by tools (also,
pointer-protection balance is checked to some level at runtime). The
stack-based protection and unprotection are fast, do not require additional
allocation and are automatically handled during R errors (long jumps): a
long jump recovers the previous stack depth, unprotecting the values that
have been left on the stack by the code executed after the jump was set but
before the jump was executed.
Although such situations are very rare, sometimes achieving pointer
protection balance is difficult, sometimes say package code wishes to keep
some allocated space without returning a pointer to it (hence without making
the caller protect it, when we have global variables pointing to R heap and
for some reason cannot turn them into locals). This is addressed by the
precious list, which is accessed using
R_PreserveObject
/R_ReleaseObject
. It is implement as a linked list (and
yes, R_PreserveObject
allocates!) and objects are unprotected by value.
There is no automated unprotection on error, the user is always responsible
for unprotecting objects stored on the precious list. To achieve that in
case of R errors (long jumps) or in callbacks (e.g. unloading of a
package), it may be necessary to allocate a dummy object, set up its
finalizer, and let the finalizer release needed objects from the precious
list. R_PreserveObject
and R_ReleaseObject
are also much slower than
PROTECT
/UNPROTECT
.
The API was still not sufficient for very special applications, applications
which used generated code that allocated memory from the R heap, such as the
R parser generated by bison
. The parser code uses a stack of semantic
values, which are pointers to objects on the R heap. Values are pushed on
the stack by the tokenizer during shift operations, are both pushed and
removed during actions of reduce operations, and are removed on some parse
errors. R errors (long jumps) can also occur during parsing. The stack is
local to a parsing function. The key problem is that the code of the parser
is generated and bison
cannot be customized enough to ensure insertion of
PROTECT
/UNPROTECT
operations. It would be natural to allocate the
semantic values stack on the R heap, protect it, and protect semantic
values when held in local variables but not yet on the semantic values
stack, all using PROTECT
/UNPROTECT
. But, this is not possible. In
principle, R_PreserveObject
/R_ReleaseObject
could be used, but one would
have to handle the errors and, most importantly, the performance overhead
would not be acceptable.
To work around this problem, UNPROTECT_PTR
has been introduced. It allows
relatively fast unprotect-by-value operation for semantic values protected
in the pointer protection stack. When new semantic values are created, they
are immediatelly put on the protection stack using PROTECT
by the
tokenizer and reduce rules. The values are unprotected by UNPROTECT_PTR
inside the reduce rules, and the pointer protection stack depth is restored
after certain parse errors that did not cause a long jump (one can also
define a destructor
in bison
for some tokens and make it call
UNPROTECT_PTR
, as done in the Rd
parser in package tools
).
UNPROTECT_PTR
removes the first occurrence of the pointer (starting at
stack top) and squeezes the stack, reducing the stack depth. Using
UNPROTECT_PTR
this way causes pointer protection imbalance by design (the
tokenizer and reduce rules are implemented in different functions), which
increases cognitive complexity of the code. It is, however, faster than the
precious list and uses less memory, when used carefully it works with R long
jumps (automated unprotection), and it may well be that there is not a
better way to do protection in the parser than unprotect-by-value (if we
don’t modify the generated parser code). It has been used for many years in
the parser and, unfortunately, started to be used also outside the parser
where not necessary.
It has been known and documented
that combining UNPROTECT_PTR
with PROTECT_WITH_INDEX
is dangerous,
because by removing a certain object from the stack by UNPROTECT_PTR
and
squeezing the stack, the protect index may become invalid/unexpected
(objects locations on the stack change). REPROTECT
would then replace the
wrong pointer, resulting in a memory leak (the object intended for
unprotection stays protected) and, worse, premature unprotection
(REPROTECT
would replace an object that still was to be protected). Code
which uses UNPROTECT_PTR
is also rather hard to read.
While working on some improvements of the parser I realized that
UNPROTECT_PTR
is unsafe also in combination with PROTECT/UNPROTECT
. The
problem occurs when the same pointer is stored multiple times on the
protection stack. One can accidentally use UNPROTECT_PTR
to unprotect the
unintended instance of the object, an instance that was intended to be
unprotected by UNPROTECT
, instead. At the point of UNPROTECT_PTR
,
nothing bad yet happens, but, when one later gets to the UNPROTECT
, the
wrong object gets unprotected, resulting in a premature unprotection
(protect bug). Unfortunately, it is quite common particularly in the parser
for the same pointer to be protected multiple times (R_NilValue
, symbols).
To illustrate this, imagine this sequence of pointers on the stack (3 is protected last, A and A’ are the same pointer, A is intended for unprotection by value):
1A2A'3
after UNPROTECT_PTR(A), we get
1A23
instead of what the enclosing code expected:
12A'3
The depth is ok, say the code later does UNPROTECT(1)
intending to
unprotect 3 and actually doing so, so still ok. But, then it calls
UNPROTECT(1)
intending to unprotect A'
, but instead unprotecting 2. As
a result, A'
(A
) will still be kept alive (memory leak, possibly
temporary, so not that bad), but 2 will be prematurely unprotected, causing
a protect bug (and one that may be very hard to debug).
In principle, R_NilValue
and symbols do not need to be protected at all,
but they are and sometimes it makes the code more readable when the
distinction is not made. Moreover, any function returning a pointer may
sometimes return a fresh pointer and sometimes a pointer that already exists
(including in the parser, where some list manipulating functions work(ed)
that way). So, this seems to be a real danger. Also, using UNPROTECT_PTR
the way as in the parser makes verification of other, purely stack-based
PROTECT/UNPROTECT
operations, harder, both manually and by tools, because
it is not made explicit which pointers were intended to be unprotected by
UNPROTECT_PTR
and which by R_ReleaseObject
.
I have thus removed the use of UNPROTECT_PTR
from all R base code. It was
relatively easy in the few cases when used outside the parser, I have just
rewritten the code using stack-based protection functions. I think in all
cases this actually simplified the code.
For use in the parsers (the R parser and the two parsers from package
tools
), I’ve introduced API for value-based unprotection outside the
pointer protection stack. These functions use a precious multi-set
to
protect these objects; the multi-set is allocated on the R heap and needs to
be protected by the caller (e.g. using PROTECT
). Consequently, it is
automatically unprotected on the long jump, and hence all pointers protected
in the mset get indirectly unprotected as well. The current implementation
uses a (vector-) list instead of a pair-list, so is also faster than
R_PreserveObject
/R_ReleaseObject
, but this is just an implementation
detail that can change and certainly the unprotection could be made faster
if it turns out to be a bottleneck in practice. The main benefit is that
these functions use a separate structure for unprotection by value, not
polluting the pointer protection stack.
SEXP R_NewPreciousMSet(int initialSize);
void R_PreserveInMSet(SEXP x, SEXP mset);
void R_ReleaseFromMSet(SEXP x, SEXP mset);
void R_ReleaseMSet(SEXP mset, int keepSize);
To use this API, one needs first to create a new mset using
R_NewPreciousMSet
and PROTECT
it. The mset is expanded automatically as
needed (R_PreserveInMSet
may allocate). Objects are released by value via
R_ReleaseFromMSet
using the same (naive) algorithm as was used in
UNPROTECT_PTR
, so there should be no performance hit (in principle, the
operations could be faster as they do not have to deal with objects intended
for stack-based protection). One does not have to release objects
explicitly, they will all be released when the mset is garbage collected
(e.g. on a long jump that would unprotect the mset). For performance
reasons, one may however use R_ReleaseMSet
to clear the mset but keep it
allocated, if the allocated size is not bigger than given number of
elements (this can be used e.g. on errors that are not implemented as long
jumps). As anything in R-devel code base, the API is still subject to
change.
Switching from UNPROTECT_PTR
to the new API is harder than it may first
seem as one has to identify the PROTECT
operations that are intended for
unprotection by value (and rewrite the code when some code paths unprotect
the same “variable” in one way and other code paths in another).
I think for memory protection one should always use PROTECT
/UNPROTECT
,
possibly with PROTECT_WITH_INDEX
/REPROTECT
in performance critical code.
R_PreserveObject
/R_ReleaseObject
help if we have global variables
holding on to R memory, but global variables should be avoided anyway also
for other reasons, so this should be very rare. Also, arranging for
unprotection on error is a bit tedious.
R_PreserveInMSet
/R_ReleaseFromMSet
should be used only in bison
/yacc
parsers and UNPROTECT_PTR
should be phased out from all code.