--- title: "Faster downloads" author: "Tomas Kalibera" date: 2024-12-02 categories: ["User-visible Behavior", "Internals"] tags: ["downloading files", "package installation"] --- ```{r setup, include=FALSE} knitr::opts_chunk$set(collapse = TRUE) ``` Most R users would sometimes install or update R packages and hence are impacted by how long this takes. The parts of package installation that take potentially longest have already been addressed by support for binary packages and parallel installation. A remaining overhead that may be rather surprising, but is easy to reduce, is package download. The overhead may be noticeable when installing many mostly small packages in parallel, because so far the package download has been sequential even with parallel installation. This post reports on recent work in R-devel, the development version of R (to become R 4.5.0), on improving the support for simultaneous download. As any work in the development version of R, this should be regarded as experimental and may be updated or changed before release. The text includes technical details. The short story for R users not necessarily interested in those is that R 4.5 will probably download packages for installation several times faster than R 4.4. In certain specific situations, hopefully rare, users might have to increase the internet timeout (see `?download.file`, look for `timeout`), which applies to downloading individual files. For best performance of their users, maintainers of package repository mirrors might consider enabling `HTTP 2`. With `HTTP 1.1`, they might experience more concurrent connections than with previous versions of R. In either case, they would experience more concurrent transfers to the same client. # Rsync Some, very small, group of R users would often download and install all or almost all packages from a repository, typically CRAN or Bioconductor, for R or R package testing. The work presented here doesn't impact such use. For such use, the best practice is to have a local mirror of each package and install packages as files from the mirror (one would use `rsync` to create and update the mirror). This will save network traffic and resources on the repository servers. The improved download in R presented here wasn't meant to compete with `rsync`. # Progress bar In already released versions of R (R 4.4 and older), a simple way to speed up downloads of very many small files is to disable the progress bar by passing `quiet=TRUE` to `download.packages()` or `install.packages()`. The progress bar is only displayed when downloading a single file, but even when installing several packages at once, they are downloaded sequentially. Disabling the progress bar helps particularly on Windows. # Simultaneous download R has support for simultaneous download since version 3.2.0 (so for over 9 years, see `?download.file`) using download method `libcurl`: with the download method explicitly specified to `download.file`, one can provide a vector of URLs to download from and a vector of destination file names. At the low level, the implementation creates a single curl "multi-handle" for the whole operation and a curl "easy-handle" for each URL. Transfers would be initiated and performed concurrently in a single thread (R main thread). This offers a substantial speedup. The key is that R doesn't have to wait for the transfer of each file individually to be set up, but while setting it up, it can keep setting up other transfers and downloading. # Package download and installation The simultaneous download, however, hasn't been used by `download.packages()` and `install.packages()`, so couldn't be easily used to speed up package downloading. These days, R requires the curl library and the `libcurl` download method is the default on all platforms, so it makes sense to enable simultaneous download for package installation. It has been done in R-devel for `download.packages()` and for `install.packages()` when the default download method is used (or when `libcurl` is requested explicitly). `install.packages()` needs to know download of which files has succeeded, so this has been added to the `download.file()` API. With the existing version of simultaneous download from R 3.2.0, I could get about 2-3x speedup on downloading 100 CRAN packages via `download.packages()`, but see below for the disclaimer on reproducibility of such measurements in setups one does not control. The simultaneous download doesn't use progress bars, so I have disabled them with sequential download as well to have more relevant comparisons. The experiments also exposed a limitation that prevented downloading of a large number of files (on my system over 500, but, it could be much less on other systems). The existing implementation creates and opens all files upfront, but, this can run into the limit on the number of open files imposed by the C library: the current implementation uses C library standard I/O streams with the default curl callback. # Reducing the concurrency and resource usage The code has hence been refactored to create and open the output files on the fly. At most a pre-defined number of files (and hence curl easy-handles) is used at a time, currently 15, but it is adaptively less when opening files fails. As a byproduct, this also reduces the number of concurrent connections to remote servers held at a time (even though curl can do this part for the user). In practice, when installing packages, a number of connection could be to the same server, and then it makes even more sense to limit the number of connections in order not to overload it. As now not all transfers happen concurrently, but only up to a defined limit, the connection re-use done by curl becomes more visible and saves some resources. Established connections to the same server are re-used for additional transfers. With HTTP 2 connections, multiple transfers from the same server can be multi-plexed over the same connection, further reducing the number of concurrent connections to the same server to one. An original version of HTTP 1.1 protocol asked for at most 2 concurrent connections to the same server. This restriction/suggestion has been removed in a later version (RFC7230 vs RFC2616). Most browsers seem to use 6, some a bit more. In the current implementation in R, at most 6 concurrent connections are used to the same server (via a curl option). This means that at most 6 files from the same server are being downloaded at a time, compared to at most 15 using HTTP 2. # Timeout In R, downloading files is subject to a timeout (by default 60s), documented in `?download.file`, `?options`, `?connections`. The timeout serves as a limit for blocking, low-level network operations and ensures that R would not get blocked due to network problems. But, higher-level R operations typically do more than a single low-level network operation, so may take longer than the value of the timeout. In practice, most users probably experience internet timeout only when the remote server is down, and at implementation level this can result in the connection timing out. Users downloading large files may have to increase the timeout, because the transfer itself is protected by it as well. In some sense, absolute time timeout also gives some level of guarantee that operations would finish and do so in reasonable time. Some imprecision in the definition of R timeout is probably necessary to be able to provide implementations with different download methods and libraries, which may have their views on how reasonable timeout should work. With libcurl, the R timeout has been mapped to curl connection timeout and overall transfer timeout of each file being downloaded (which also includes the connection time). With simultaneous download, this becomes more complicated, and yet more with the changes described above. The curl overall timeout includes also the time a transfer has been paused by curl e.g. in case there are already too many concurrent connections to a given server or another connection is being established which might (or might not) allow multiplexing. The first of these pauses could in theory be very long, and caused by transfers of different files, but would cause a timeout. Therefore, a custom implementation of the timeout has been provided in R-devel, instead, on top of curl callbacks. As the callbacks only allow to cover the transfer part of the timeout, the curl connection timeout is still used in addition, and this gives a bit more time for each of the transfer. In principle, if the local bandwidth was the bottleneck rather than network problems, it is possible that the R timeout with simultaneous download would not be sufficient for a download when in sequential download it was. Similar situations could be caused by other applications on the same machine or network, or temporary issues on the way to the server, which were situations users already had to cope with (e.g. by increasing the timeout). It has to be seen whether such issue would appear in practice and whether the current implementation should be adjusted. An alternative to absolute-time timeouts for file transfer as in R is aborting the transfer when it seems it has stopped (no data transferred within an interval, available in wget) or when it became too slow (average speed below a specific limit during an interval, available in libcurl). An obvious advantage is that these are independent on the file size. On the other hand, these limits might keep slow downloads too long for other reasons (e.g. by accident downloading a too large file), and anecdotically some proxies batch transfers so much to cause an undesirable abort. At the implementation level, currently in R-devel, R falls back to the speed limit with older versions of curl to cover a phase between a request is ready to be sent to the server and the first data is transferred: that phase doesn't seem to be otherwise easily covered by a custom timeout implementation. # Speedup I've measured the speedup in several scenarios. First, on fast academic networks (client is a server machine on a well-connected academic network), downloading 500 CRAN packages from the CRAN cloud mirror (CDN). Second, on laptops connected via a decent home internet connection on different operating systems (on the client), downloading 100 CRAN packages from the CRAN cloud mirror. Then, to test the impact of connection speed, I tried also downloading from fixed CRAN mirrors at selected locations (New Zealand, UK, Europe) to a macOS laptop in Europe. I'm not reporting the numbers, because they are not repeatable: the download times are too dynamic and the involved systems out of my control. Measuring the same thing the next day can give a very different speedup. The usual speedups were 2-5x (that is, twice to five times shorter download time), but sometimes it was much more (even around 30x speedup) and sometimes it was about the same speed. One could measure on a lab setup (own CRAN mirror, emulated network of different characteristics), but that is probably not worth the effort, and it would be hard to know what are realistic setups. It still seems from the experiments that big speedups can be achieved even on slower connections (I expect because of longer latency). Also, the speedup can be bigger when HTTP 2 is enabled on the server. On Windows, R doesn't yet support HTTP 2 because Rtools 4.4 use a version of curl without HTTP 2 support, but the next version of Rtools and R is expected to support it. I've also tested that the speedup is bigger compared to just a sequential download with re-using connections. The actual speedup seen by users would be different, as they wouldn't be just downloading the packages, but also installing them. I find the speedup noticeable when installing a large number of packages in parallel on many cores, even source packages. It is also visible when installing binary packages on Windows, not necessarily in parallel, where the installation is very fast. It certainly wouldn't be visible when installing just several packages from source which use C++, where the compile time would dominate. # Possible further improvements In principle, it would be possible to push things further by using more concurrent transfers, but that in turn would put bigger stress on the servers - the actual performance is a balance between the download time on the client and peak load on the server. Should the number of parallel transfers be still increased, say with HTTP 2, one could solve the problem of limited number of I/O streams (or file handles) by partially buffering the downloads in memory and closing/re-opening files during the download as needed.