--- title: "Staged Install" author: "Tomas Kalibera" date: 2019-02-14 categories: ["Package installation"] tags: ["R CMD INSTALL", "parallel install", "shared objects"] --- ```{r setup, include=FALSE} knitr::opts_chunk$set(collapse = TRUE) ``` This text is about a new feature in R, staged installation of packages. It may be of interest to package authors and maintainers, and particularly to those who maintain packages that are affected. # The problem I often have to run checks for all CRAN and BIOC packages to test the impact of my changes to R. This is to find about my own bugs, but often I also wake up existing bugs in packages or R or find out that some packages rely on undocumented API or behavior. I run all CRAN/BIOC package tests for the baseline R-devel version, then for my modified version, and then I compare the outcomes looking for packages newly failing or newly with warnings. In each run, I install (the same version of) packages afresh, and indeed to get that in a reasonable time, the installation is run in parallel. During the last months this process has been increasingly complicated by randomly appearing warnings during installation, like ```Warning: S3 methods '[.fun_list', '[.grouped_df', 'all.equal.tbl_df' ... [... truncated]```. These warnings appeared for many packages, but not repeatably, so they complicated the analysis of check results. Some of the processing is automated, re-checking packages in base and modified version to reduce the number of differences due to temporary unavailability of remote systems. Initially the install warnings were also accompanied by check warnings like: ```Warning in grep(pattern, x, invert = TRUE, value = TRUE, ...) : input string 1 is invalid in this locale``` These check warnings turned out to be emitted because of the truncation that sometimes accidentally split multi-byte UTF-8 characters. I fixed the truncation and then found out the original installation warning was actually saying "S3 methods were declared in NAMESPACE but not found". Incidentally, there were just two distinct (very long) lists of methods in the warnings across all installed packages in my run, but repeated for many packages. It turned out that they were lists of exported methods from `dplyr` and `rlang` packages. These two packages take very long to install due to C++ code compilation. They also have a lot of reverse dependencies and so while they are being installed, it is very likely that another package being installed would use them in a partially-installed state, and this is why these warnings were emitted. I learned that the CRAN team indeed had been affected by this problem as well for long and that they have seen it unsurprisingly caused by also other packages that took long time to install, not just `dplyr` and `rlang`. In principle, this problem does not only happen during parallel installation and does not affect only repository maintainers and R core developers who regularly check all CRAN and/or BIOC packages. The problem is present any time the same R library is used from different R sessions (and in some installations there could be sessions run by different users). The package installation process has become complicated and can run arbitrary code, even from packages themselves, so the consequences of accessing other packages in inconsistent/partially-installed state are unpredictable and potentially dangerous. The probability of this race condition happening seems to have increased in the last years with wider use of C++ (in patterns that take long to compile), as the problem has not been observed before. # Existing lock directories do not solve the problem The current implementation of package installation by default backs up the old installation of the package by moving it into a per-library `00LOCK` directory (or per-package `00LOCK-pkgname`). The installation is performed directly into the final directory `pkgname` in the library. If it fails, it is by default cleaned up and the old version is moved back; otherwise, if it succeeds, the old version is deleted. If the lock directory already exists when the installation is requested, the installation fails with an error and one typically would delete the directory manually. During parallel install, the per package locking is used (`00LOCK-pkgname`). This locking mechanism works for backing-up and recovering previous versions of packages in case of error, but it does not prevent access to partially installed packages. I've been trying initially to extend it to do so, after all, it would seem natural to make R respect the lock directories and ignore packages that were "locked", getting a cheap partial solution to the problem. "Partial" because of the obvious race condition - what happens between checking the existence of a lock directory and accessing the package. It turned out to be neither cheap nor easy to implement, and in the end we decided for *staged install*, instead. The first observation was that one cannot simply hide/ignore the packages for which there is a lock directory -- this is not possible because during installation, one needs to be able to see the (partially installed) package. For example, this is while the lazy loading database is being built (so one has to be able to load the namespace), but also when running a custom installation script from the package (`install.libs.R`). One would have to customize all package access/discovery functions so that they would make the locked package visible just to the R session(s) that were installing the package. Passing function arguments all the way down to the package discovery functions would not be realistic, but in principle this would be possible via environment variables, some of which are already in use. For a start, I've looked at how packages check if another package is installed. This is a surprisingly common task and I found many popular ways (`installed.packages()`, `requireNamespace()`, `require()`, `.packages()`, `system.file()`, `find.package()`, `packageVersion`). I may have easily overlooked some cases as I've just grepped the source code of all the packages and there will be most likely many more types of access to packages than just checking if they were installed. If we missed to handle any of the cases, the resulting race conditions would be extremely hard to debug (not repeatable runs, only showing on some systems, etc). Also, it is not impossible that some tools or packages are looking directly into the library directory to discover packages. Finally, there will be a non-trivial performance overhead in package access functions. # Staged installation Staged installation is hence the implemented solution to the problem. It only works together with the lock directories, which are used by default. A package is first installed into a temporary directory under the lock directory (under `00LOCK` or `00LOCK-pkgname`). When the package is being installed, this temporary directory is the R library for that R session, so the R session sees the partially installed package using the standard means. Other packages, however, do not see it. After the package is installed (byte-compiled, lazy loading database created, native code compiled and built, test-loaded, etc), it is moved to the final location (`pkgname`) and becomes visible to other packages. Directory move is very fast operation within the same filesystem and in POSIX/Unix it is atomic (on Windows it is also fast, but not easily done to be guaranteed atomic). Staged installation thus provides isolation of partially installed packages on the file-system level and all package access APIs or even file-based API usage can stay as they are now. It was clear from the beginning that the problems would, instead, arise from the fact that packages are moved to a different directory after they are installed and the original directory no longer exists. Packages fail with staged install when they hard-code the temporary installation directory name (save it to some configuration file, keep it in an R object, or save it via linker to a shared object as absolute path or linker `rpath`). Luckily, this is the case with only a small number of packages from CRAN and BIOC and it is relatively easy to find out without spending days of debugging (compared to debugging that would be needed if package access code had to be updated to respect lock directories). # Paths hard-coded in shared objects This problem exists only in several packages from CRAN and BIOC, when a package dynamically links one of its shared objects against another of _its_ shared objects and uses linker `rpath` (`runpath`) or an absolute shared object path when doing so. This problem does not exist on Windows where paths cannot be hard-coded this way, but exists on Linux, Solaris, macOS and other Unix systems. The affected packages would ideally be updated to avoid such linking. Note that linking against shared objects from _other_ packages is not a problem for staged install. On Windows, packages cannot do this, and so they would use static linking within the same package. I think it would just be simplest to do the same on all systems; the disk space overhead due to the code size is hardly relevant these days and, if that is possible on Windows, why not on other systems, too. An example is `Rhtslib` from BIOC, which now uses static linking on Windows and macOS, but dynamic linking with `rpath` on other systems including Linux. If static linking was not possible for some reason, one could still use symbolic dynamic linker variables. On Linux and Solaris, `$ORIGIN` is a linker variable that points to where the current shared object was found, so one can set `rpath` e.g. to `\$ORIGIN/../usrlibs` (the `..` gets out of `libs`, the common directory for shared objects in packages). On macOS, one can use `@loader_path` the same way. These symbolic variables get interpreted by the dynamic linker, so the dependencies are found even after the package is moved to the final location. During staged installation on non-Windows systems, R will check for hard-coded paths in shared objects. This requires OS-specific external tools which are normally available on systems that build packages from source. On Linux, it uses `readelf`, which is part of `binutils`. On macOS, it uses `otool`, which is part of CLT (Command Line Tools) and hence should be available on all systems that build packages from source. On Solaris, `elfedit` is used. Finally, R fixes the hard-coded paths in shared objects automatically when installing packages and the needed OS-specific external tools are available. On Linux, `patchelf` is used when available to fix both `rpath` and absolute linking paths, it is usually available in a separate package named `patchelf` and unfortunately not usually installed by default. On macOS, `install_name_tool` is used and it is part of CLT like `otool`, so should be available. On Solaris, `elfedit` is used and should be available in the OS. On Linux and Solaris, `chrpath` can also be used but only to fix the `rpath`, not absolute paths to other shared libraries, but they should be rare on non-macOS systems. The detection of the hard-coded paths and fixing is done automatically during staged install, with informative messages. When paths cannot be fixed (tools are not available or they did not succeed fixing), installation will fail. Also, the package is test-loaded also from its final location, which can detect problems with some hard-coded paths on its own, even when tools to analyze the shared objects were not available. Packages during their installation typically get their installation directory name from `R_PACKAGE_DIR` environment variable, e.g. for use with in build scripts or make files. With staged install, this variable holds the _temporary_ installation directory. Note that the package, after the native code is built, is test-loaded from its _temporary_ installation directory first. Packages should not attempt to refer to the final installation directory name in any way. # Paths hard-coded in R code Packages often need to access files from their own installation directory, which can always be obtained by `system.file(package=)` call. Some packages save the directory names obtained by `system.file()`, but that practice is dangerous with staged install and should be avoided. With staged install, it may happen that the saving of the directory is executed when the package still runs in the temporary installation directory, typically while the package is being prepared for lazy loading. The preparation for lazy loading involves sourcing all R files of the package, hence also executing all the assignments to global variables. Therefore, assignments like this (from `pd.ecoli`) at the top level in an R source file in a package save the temporary installation directory: ``` globals$DB_PATH <- system.file("extdata", "pd.ecoli.sqlite", package="pd.ecoli") ``` Sometimes the calls to `system.file(package=)` are hidden deeper in assignments that are executed when the namespace is loaded for preparation of lazy loading database, including in assignments setting up S4 classes. I think the best way to fix these patterns is to just always call `system.file()`, so in this case have a function like below, _and_ never save the result in anything that is not an obviously local variable in a function. ``` getDbPath <- function() system.file("extdata", "pd.ecoli.sqlite", package="pd.ecoli") ``` However, even though not ideal, it is also possible to fix such hard-coded paths in `.onLoad` package hook (`pd.ecoli` does already fix them, even before staged install, but only in `.onAttach`, so one can still access the wrong path): ``` .onAttach <- function(libname, pkgname) { globals$DB_PATH <- system.file("extdata", "pd.ecoli.sqlite", package="pd.ecoli", lib.loc=libname) ... ``` The problem with fixing in `.onLoad` is that the binary image of the package still includes the hard-coded temporary installation directory name, and thus checking tools that look at the files without loading the namespace would report errors (the tool described later in this text, however, loads the namespace so it would see the state after hooks have been executed). During staged installation, R checks for hard-coded paths that include the temporary installation directory, and if it finds any, the installation fails with an informative message. This is a conservative approach, because in some cases the hard-coded installation directory would never really be used to access files, but it is a prevention against hard-to-find bugs. The problem of hard-coded paths in R code is a bit more common that of the paths in shared objects, but it still directly affects only a small number of packages from CRAN and BIOC. # Testing packages for staged install Package authors can test their packages for staged installation by attempting the install using `R CMD INSTALL --staged-install` with a recent version of R-devel. The checks during the installation should be defensive enough to catch most problems: if staged installation succeeds and the package worked with non-staged installation (to be applied also to package dependencies), it should also work with staged installation. Currently, the only known exception is when a package saves its temporary installation path into an external file, which is not checked automatically. I would be happy for reports about any other issues that are undetected by the checks. My tests on Linux suggest that currently 21 CRAN and 4 BIOC packages fail to install because they have hard-coded temporary installation paths in their R code. 2 CRAN and 2 BIOC packages fail to install because they have hard-coded temporary installation paths in their shared objects. Some packages fail to install because they depend on these: in total, out of CRAN/BIOC, 48 packages failed to install with staged installation, but could be installed with non-staged installation. The CRAN team has been running many more tests with on multiple platforms and with multiple C compilers. The problem of hard-coded paths in shared objects is trivial to diagnose from the installation log/output, which contains the name of the shared object in the error message and typically also the compilation/linking commands used for building the native code of the package (so most of the times one can just search the output for "rpath"). Also, package authors did have to specify linking using `rpath` or absolute path explicitly, so there needs to be a record of it in build scripts or make files of the package. The problem of hard-coded paths in R code is a bit harder to diagnose, the installation only performs a trivial check to find out that there is a hard-coded path, but checking out where is a bit more time consuming. I've written a simple program (`sicheck`) that finds out what are the hard-coded paths (already knowing the path sometimes helps, when one can search the suffix in R package sources). It also tries to find out R expressions (object paths) how to get to these hard-coded paths from the environment of the package namespace. The program and results for recent versions of CRAN and BIOC 3.9 packages can be found [here](https://github.com/kalibera/rstagedinst). For example, package `franc` has these reports: ``` Package contains these hard-coded paths (sercheck): CONTAINS: franc/speakers.json CONTAINS: franc/data.json Package contains these objects with hard-coded paths (walkcheck): OBJPATH: as.list(getNamespace("franc"), all.names=TRUE)[["speakers_file"]] franc/speakers.json SPATH: franc$speakers_file franc/speakers.json OBJPATH: as.list(getNamespace("franc"), all.names=TRUE)[["datafile"]] franc/data.json SPATH: franc$datafile franc/data.json ``` In the above, `CONTAINS: franc/speakers.json` means that `sicheck` tool found hard-coded path to `franc/speakers.json` (the output copied to this text excludes the prefix of the full path including the `00LOCK-franc` directory). The name is hard-coded in variable `datafile` of the package namespace (`OBJPATH:` and `SPATH:` sections). It is easy to see that this happens because source file `speakers.R` of the package has this assignment at the top-level: ``` speakers_file <- system.file("speakers.json", package = packageName()) ``` A slightly less trivial example is package `zonator`. Its report includes: ``` CONTAINS: zonator/extdata/test_project/zsetup/01/01_out OBJPATH: as.list(as.list(getNamespace("zonator"), all.names=TRUE)[[".options"]],all.names=TRUE)[["results.dir"]] zonator/extdata/test_project/zsetup/01/01_out SPATH: zonator$.options$results.dir zonator/extdata/test_project/zsetup/01/01_out ``` The hard-coded path is `extdata/test_project/zsetup/01/01_out`. It is being hard-coded in source file `options.R` of the package, in (top-level command): ``` assign("results.dir", file.path(.options$setup.dir, "01/01_out"), envir = .options) ``` I found this line of code first using `grep` on the sources, looking for `01_out`. It is probably always easiest to try this first before trying to interpret more complicated object paths, but it does not help when the hard-coded path does not have a unique suffix, e.g. when it is just path to the root of the package installation. Then, one needs to analyze the object path. In this example, the object path is is still easy to understand. The executable one (`OBJPATH`) can be executed to get the value (excluding hard-coded path prefix) in R: ``` > as.list(as.list(getNamespace("zonator"), all.names=TRUE)[[".options"]],all.names=TRUE)[["results.dir"]] Registered S3 methods overwritten by 'ggplot2': method from [.quosures rlang c.quosures rlang print.quosures rlang [1] "zonator/extdata/test_project/zsetup/01/01_out" ``` `SPATH` (`zonator$.options$results.dir`) tries to be more concise, but is not executable. The special elements of these paths are: $name | named vector element [i] | unnamed vector element -A | attributes -E | environment @ | S4 data part Note that currently the tool does not attempt to find the shortest path to the object. # Opting out Staged installation is not currently turned on by default but the plan is to do so soon. Packages that for some reason could not be fixed for staged installation (or could not be fixed in time) can be still installed after the switch using the current, non-staged, procedure. Packages can opt-out via `StagedInstall` field in their `DESCRIPTION` file. There is no need for packages to opt-in as this is going to be the default. There are also new options for `R CMD INSTALL`: `--staged-install` and `--no-staged-install`. # Summary Staged installation is a new feature of `R CMD INSTALL` in R-devel, which is intended to be soon turned on by default. It isolates packages during installation time so that they are not accidentally accessed by other R sessions, which is key to correct function of parallel installation, but is relevant to any installation that may use multiple R sessions. Some packages need to be fixed to work with staged installation and package authors are kindly asked to cooperate with repository maintainers and update their packages promptly. It may not be immediately obvious that the role of the repository maintainers is very important also in the process of enhancing R. Adding a feature to R often puts a significant amount of work on them as they test packages on different platforms, analyze the outputs, and sometimes debug the packages to figure out whom to report the bugs to or to help package maintainers who do not have enough technical skill to do so on their own. In addition to that "usual" load for repository maintainers, this feature has been implemented in close collaboration with the CRAN team and particularly Brian Ripley has provided valuable advice, comments, reviews and found a number of issues by testing.