Windows/UTF-8 Build of R and CRAN Packages

R-devel-win.exe is an experimental installer of R, set up to download experimental binary builds of CRAN packages. It sets UTF-8 as the current default encoding on Windows (Windows 10 November 2019 release or newer). 92% of CRAN packages are supported. Intended as a demonstration of this option to support Unicode characters in R on Windows, not for production use.

To play with this version of R, start cmd.exe, run chcp 65001 (to set UTF-8 code page), go to Properties/Font/Font and select NSimFun (a font with glyphs needed for this example), go to C:\Program Files\R\R-devel\bin (default installation directory of the demo), run R.

Check that both the C library encoding (codepage) and system encoding (system.codepage) are 65001, which is UTF-8.

> l10n_info()
$MBCS
[1] TRUE

$`UTF-8`
[1] TRUE

$`Latin-1`
[1] FALSE

$codepage
[1] 65001

$system.codepage
[1] 65001

Try plotting a histogram with captions in Japanese:

hist(mtcars[,"mpg"], xlab="マイル/ガロン", ylab="車の数")

On the standard version of R, this would work only in Windows running in a locale that supports Japanese. On other Windows systems, one would not be able to even paste the Japanese characters (“Miles/Gallon” and “Number of Cars”) into the RTerm window. With this demo build of R, it should work on any Windows 10 (November 2019 release or newer, last tested with April 2020 release).

We can do the same with external packages. Running

install.packages("ggplot2")

will install ggplot2 from the pre-set demo binary repository at https://www.r-project.org/nosvn/winutf8/demo. With it, we can run

library(ggplot2)
ggplot(mtcars,aes(x=mpg)) + geom_histogram(binwidth=5) + xlab('マイル/ガロン') + ylab('車の数')

More usage examples can be found in my previous post from May. At that time, only base and recommended packages were supported.

Background

Windows 10 (November 2019 release and newer) allows applications to use UTF-8 as their native encoding when interfacing both with the C library (needs to be UCRT) and with the operating system. This new Windows feature, present in Unix systems for many years, finally allows R on Windows to work reliably with all Unicode characters.

Applications that already worked reliably with all Unicode characters on Windows before used proprietary Windows API and wide-character strings, which required implementing and maintaining a lot of Windows-specific code. R did not go that route completely, except for RGui and particularly Windows-specific code interfacing with the file system / operating system (in some cases on Windows this is also needed for other reasons than character encoding). Today, Windows can’t even encode all Unicode characters using one wide character (wide characters are 16-bit, UTF16-LE is used, and hence two wide characters are needed to represent some Unicode characters), so the old Windows way to support Unicode in addition does not seem to have any technical advantage. The new way, via UTF-8, will instead allow to eventually phase out some Windows-specific code from R.

To use UTF-8 as native encoding on Windows, only trivial changes to R are needed, but we need to rebuild R and packages to use UCRT as the C runtime. We need a new toolchain for Windows for that and we need to rebuild all code with the new toolchain. It is no longer possible to re-use binary code built using previous toolchains in form of object files or static libraries. Unfortunately, a common practice when building R packages on Windows is to download pre-built static libraries from external sources, which no longer works when those are built with MSVCRT. In principle, this is bad practice on Windows anyway, such code should be compiled with the same compiler toolchain. Compatibility between different C runtimes can only be assumed between DLLs, and even that has its limits, such as the interpretation of what is the current native encoding.

The Demo

A new experimental compiler toolchain and a number of libraries for R and packages have been built using GCC 9 and MXE (cross-compiled on Linux). It can be downloaded as gcc9_ucrt2.txz, 733M.

This part of the work required updating some MXE build configurations to use newer software, to build with UCRT, to build with newer GCC, etc. Also this required patching some external software to build at all or build with MXE (which is a cross-compiling environment). Several more external libraries would be needed to support all CRAN packages. It took ~5 days to finish this to the level that R with base and recommended packages could be built, about ~5 more days to add more libraries so that those 92% of CRAN packages are supported (note: many CRAN packages don’t have any native code).

R has been patched to allow UTF-8 as native encoding on Windows, which has been trivial (supported on other OSes) and is already part of R-devel. The demo build is then patched to set UTF-8 via fusion manifest (has to be done for an executable at build time), to build only the 64-bit binaries/installer, plus some minor changes. Available as r_gcc9_ucrt2_2.diff.

Over 50 CRAN packages have been patched to build with the new toolchain. Almost all the patches only removed downloading of external binary code at installation time and replaced it by linking against static libraries built together with the experimental toolchain. Note that many packages are downloading code that is otherwise available even in Rtools4. Also, some packages are downloading source code of external software and building it, even though it is included even in Rtools4.

Patching the packages to build and building binary versions of all packages took about 5 days. The remaining packages need some more libraries in the toolchain. The patches are available here (and clearly would have to be revisited and cleaned up if considered for a production toolchain).

In addition, the packages were tested via R CMD check. The outputs are available for CRAN packages and for several BIOC packages required by those CRAN packages. Out of CRAN and required BIOC packages (15793 in the snapshot used), about 92% were built and passed their tests with OK/NOTE (14656 CRAN and 59 BIOC packages). All created binary packages are available for use, even those that failed their tests.

Finally, some time was spent on writing various texts about this, see references below, so about ~20 days on this demo spent in total.

Testing

In case anyone finds an encoding-related problem in this build, something that should be working with UTF-8 as native encoding but does not, a report would be highly welcome. It is not impossible that some parts of R have buried assumptions that the native encoding on Windows is never multi-byte (that it is either single- or double-byte), and those cases would have to be fixed. It is unlikely that such issues would have been found by running R CMD check, as the tests/examples don’t include unusual characters.

Conclusions

Based also on this experience, I believe that switching to UCRT is already possible and I expect that building a complete toolchain should take a small number of months. It is I think the only realistic way to support Unicode characters (not representable in native encoding) reliably in R on Windows.

Investing more effort into adding and fixing various “shortcuts” (avoiding conversions to native encoding in some cases) only complicates the code, introduces more bugs, and cannot solve the problem completely.

Rewriting all of R and R packages to use the old proprietary Windows way to support Unicode (wide-characters, proprietary Windows API instead of standard C library) is I believe out of question, that would require orders of magnitude more effort and hardly anyone would want to do it. It would require experts on R internals (e.g. rewriting the parser or the connections code) and either would duplicate a lot of code, or cause inefficiencies on Unix systems, neither of which seems acceptable. Some external libraries used would likely have to be replaced or extended.

Building a new UCRT toolchain and libraries does not require any special knowledge of R nor its internals. Patches to external software to build with UCRT (e.g. when contributed back to MXE or similar systems, and currently it seems good patches to this effect would be highly welcome) could be re-used by other projects, completely unrelated to R. Finally, a switch from MSVCRT to UCRT will likely be necessary at some point, anyway.

References

https://developer.r-project.org/WindowsBuilds/winutf8/winutf8.html. Detailed text about building R on Windows and about how this demo was created.
https://www.r-project.org/nosvn/winutf8/gcc9_ucrt2.txz, 733M. Experimental toolchain based on GCC 9/UCRT with libraries for most CRAN packages, only 64-bit. Built using a modified version of MXE available from https://www.r-project.org/nosvn/winutf8/mxe_gcc9_ucrt2.tgz.
https://www.r-project.org/nosvn/winutf8/demo. Repository of binary CRAN packages and required BIOC packages, includes patches and outputs from their checks. Built using the experimental toolchain. Only 64-bit.
https://www.r-project.org/nosvn/winutf8/R-devel-win.exe. R installer, built using the experimental toolchain from R-devel 78739, patched with https://www.r-project.org/nosvn/winutf8/r_gcc9_ucrt2_2.diff. Only 64-bit.