---
title: "Unprotecting by Value"
author: "Tomas Kalibera"
date: 2018-12-10
categories: ["Internals", "User-visible Behavior"]
tags: ["parsing", "PROTECT bugs"]
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE)
```

In short, `UNPROTECT_PTR` is dangerous and should not be used.  This text
describes why and describes how to replace it, including mset-based
functions that have been introduced as a substitute for situations when
unprotection by value is really needed.  This could be of interest to anyone
who writes native code to interface with the R heap, and definitely to all
who use `UNPROTECT_PTR` in their code.

# Background

R provides several functions to protect pointers to R objects held by local
C variables (typed `SEXP`) from the garbage collector.  As documented in
[Writing R Extensions](https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Garbage-Collection),
there are two structures to hold protected pointers: the pointer
protection stack and the precious list.

The *pointer protection stack* is accessed using `PROTECT`/`UNPROTECT`. 
Pointers are unprotected by being removed from the top of the stack.  One
can also use `PROTECT_WITH_INDEX` and then `REPROTECT` to replace a pointer
defined by its position in the stack, which allows to simplify and speed-up
code that repeatedly updates local variables holding pointers (in such
scenarios, one could in principle still use a sequence of
`PROTECT`/`UNPROTECT` operations, instead).  The pointer protection stack
needs to be managed in-line with the C call stack: after returning from a
function, the stack depth should be the same as when the function was called
(pointer protection balance).  These and other rules are described in
[Writing R
Extensions](https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Garbage-Collection)
and [The danger of PROTECT
errors](https://github.com/kalibera/cran-checks/blob/master/rchk/PROTECT.md),
they are relatively easy follow and check, both visually and by tools (also,
pointer-protection balance is checked to some level at runtime).  The
stack-based protection and unprotection are fast, do not require additional
allocation and are automatically handled during R errors (long jumps): a
long jump recovers the previous stack depth, unprotecting the values that
have been left on the stack by the code executed after the jump was set but
before the jump was executed.

Although such situations are very rare, sometimes achieving pointer
protection balance is difficult, sometimes say package code wishes to keep
some allocated space without returning a pointer to it (hence without making
the caller protect it, when we have global variables pointing to R heap and
for some reason cannot turn them into locals).  This is addressed by the
*precious list*, which is accessed using
`R_PreserveObject`/`R_ReleaseObject`.  It is implement as a linked list (and
yes, `R_PreserveObject` allocates!) and objects are unprotected by value. 
There is no automated unprotection on error, the user is always responsible
for unprotecting objects stored on the precious list.  To achieve that in
case of R errors (long jumps) or in callbacks (e.g.  unloading of a
package), it may be necessary to allocate a dummy object, set up its
finalizer, and let the finalizer release needed objects from the precious
list. `R_PreserveObject` and `R_ReleaseObject` are also much slower than
`PROTECT`/`UNPROTECT`.

The API was still not sufficient for very special applications, applications
which used generated code that allocated memory from the R heap, such as the
R parser generated by `bison`.  The parser code uses a stack of semantic
values, which are pointers to objects on the R heap.  Values are pushed on
the stack by the tokenizer during shift operations, are both pushed and
removed during actions of reduce operations, and are removed on some parse
errors.  R errors (long jumps) can also occur during parsing.  The stack is
local to a parsing function.  The key problem is that the code of the parser
is generated and `bison` cannot be customized enough to ensure insertion of
`PROTECT`/`UNPROTECT` operations.  It would be natural to allocate the
semantic values stack on the R heap, protect it, and  protect semantic
values when held in local variables but not yet on the semantic values
stack, all using `PROTECT`/`UNPROTECT`. But, this is not possible. In
principle, `R_PreserveObject`/`R_ReleaseObject` could be used, but one would
have to handle the errors and, most importantly, the performance overhead
would not be acceptable.

To work around this problem, `UNPROTECT_PTR` has been introduced.  It allows
relatively fast unprotect-by-value operation for semantic values protected
in the pointer protection stack.  When new semantic values are created, they
are immediatelly put on the protection stack using `PROTECT` by the
tokenizer and reduce rules.  The values are unprotected by `UNPROTECT_PTR`
inside the reduce rules, and the pointer protection stack depth is restored
after certain parse errors that did not cause a long jump (one can also
define a `destructor` in `bison` for some tokens and make it call
`UNPROTECT_PTR`, as done in the `Rd` parser in package `tools`). 
`UNPROTECT_PTR` removes the first occurrence of the pointer (starting at
stack top) and squeezes the stack, reducing the stack depth.  Using
`UNPROTECT_PTR` this way causes pointer protection imbalance by design (the
tokenizer and reduce rules are implemented in different functions), which
increases cognitive complexity of the code.  It is, however, faster than the
precious list and uses less memory, when used carefully it works with R long
jumps (automated unprotection), and it may well be that there is not a
better way to do protection in the parser than unprotect-by-value (if we
don't modify the generated parser code).  It has been used for many years in
the parser and, unfortunately, started to be used also outside the parser
where not necessary.

It has been known and [documented](https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Garbage-Collection)
 that combining `UNPROTECT_PTR` with `PROTECT_WITH_INDEX` is dangerous,
because by removing a certain object from the stack by `UNPROTECT_PTR` and
squeezing the stack, the *protect index* may become invalid/unexpected
(objects locations on the stack change).  `REPROTECT` would then replace the
wrong pointer, resulting in a memory leak (the object intended for
unprotection stays protected) and, worse, premature unprotection
(`REPROTECT` would replace an object that still was to be protected).  Code
which uses `UNPROTECT_PTR` is also rather hard to read.

# UNPROTECT_PTR is dangerous

While working on some improvements of the parser I realized that
`UNPROTECT_PTR` is unsafe also in combination with `PROTECT/UNPROTECT`.  The
problem occurs when the same pointer is stored multiple times on the
protection stack.  One can accidentally use `UNPROTECT_PTR` to unprotect the
unintended instance of the object, an instance that was intended to be
unprotected by `UNPROTECT`, instead.  At the point of `UNPROTECT_PTR`,
nothing bad yet happens, but, when one later gets to the `UNPROTECT`, the
wrong object gets unprotected, resulting in a premature unprotection
(protect bug).  Unfortunately, it is quite common particularly in the parser
for the same pointer to be protected multiple times (`R_NilValue`, symbols).

To illustrate this, imagine this sequence of pointers on the stack (3 is
protected last, A and A' are the same pointer, A is intended for
unprotection by value):

``` 1A2A'3 ```

after UNPROTECT_PTR(A), we get

``` 1A23 ```

instead of what the enclosing code expected:

``` 12A'3 ```

The depth is ok, say the code later does `UNPROTECT(1)` intending to
unprotect 3 and actually doing so, so still ok.  But, then it calls
`UNPROTECT(1)` intending to unprotect `A'`, but instead unprotecting 2.  As
a result, `A'` (`A`) will still be kept alive (memory leak, possibly
temporary, so not that bad), but 2 will be prematurely unprotected, causing
a protect bug (and one that may be very hard to debug).

In principle, `R_NilValue` and symbols do not need to be protected at all,
but they are and sometimes it makes the code more readable when the
distinction is not made.  Moreover, any function returning a pointer may
sometimes return a fresh pointer and sometimes a pointer that already exists
(including in the parser, where some list manipulating functions work(ed)
that way). So, this seems to be a real danger. Also, using `UNPROTECT_PTR`
the way as in the parser makes verification of other, purely stack-based
`PROTECT/UNPROTECT` operations, harder, both manually and by tools, because
it is not made explicit which pointers were intended to be unprotected by
`UNPROTECT_PTR` and which by `R_ReleaseObject`.

# Phasing out UNPROTECT_PTR

I have thus removed the use of `UNPROTECT_PTR` from all R base code.  It was
relatively easy in the few cases when used outside the parser, I have just
rewritten the code using stack-based protection functions.  I think in all
cases this actually simplified the code.

For use in the parsers (the R parser and the two parsers from package
`tools`), I've introduced API for value-based unprotection outside the
pointer protection stack.  These functions use a `precious multi-set` to
protect these objects; the multi-set is allocated on the R heap and needs to
be protected by the caller (e.g.  using `PROTECT`).  Consequently, it is
automatically unprotected on the long jump, and hence all pointers protected
in the mset get indirectly unprotected as well.  The current implementation
uses a (vector-) list instead of a pair-list, so is also faster than
`R_PreserveObject`/`R_ReleaseObject`, but this is just an implementation
detail that can change and certainly the unprotection could be made faster
if it turns out to be a bottleneck in practice.  The main benefit is that
these functions use a separate structure for unprotection by value, not
polluting the pointer protection stack.

```
SEXP R_NewPreciousMSet(int initialSize);
void R_PreserveInMSet(SEXP x, SEXP mset);
void R_ReleaseFromMSet(SEXP x, SEXP mset);
void R_ReleaseMSet(SEXP mset, int keepSize);
```

To use this API, one needs first to create a new mset using
`R_NewPreciousMSet` and `PROTECT` it.  The mset is expanded automatically as
needed (`R_PreserveInMSet` may allocate).  Objects are released by value via
`R_ReleaseFromMSet` using the same (naive) algorithm as was used in
`UNPROTECT_PTR`, so there should be no performance hit (in principle, the
operations could be faster as they do not have to deal with objects intended
for stack-based protection).  One does not have to release objects
explicitly, they will all be released when the mset is garbage collected
(e.g.  on a long jump that would unprotect the mset). For performance
reasons, one may however use `R_ReleaseMSet` to clear the mset but keep it
allocated, if the allocated size is not bigger than given number of
elements (this can be used e.g. on errors that are not implemented as long
jumps). As anything in R-devel code base, the API is still subject to
change.

Switching from `UNPROTECT_PTR` to the new API is harder than it may first
seem as one has to identify the `PROTECT` operations that are intended for
unprotection by value (and rewrite the code when some code paths unprotect
the same "variable" in one way and other code paths in another). 

# Choosing the right API

I think for memory protection one should always use `PROTECT`/`UNPROTECT`,
possibly with `PROTECT_WITH_INDEX`/`REPROTECT` in performance critical code. 
`R_PreserveObject`/`R_ReleaseObject` help if we have global variables
holding on to R memory, but global variables should be avoided anyway also
for other reasons, so this should be very rare.  Also, arranging for
unprotection on error is a bit tedious. 
`R_PreserveInMSet`/`R_ReleaseFromMSet` should be used only in `bison`/`yacc`
parsers and `UNPROTECT_PTR` should be phased out from all code.