--- title: "Issues While Switching R to UTF-8 and UCRT on Windows" author: "Tomas Kalibera" date: 2022-11-07 categories: ["Internals", "Windows"] tags: ["UTF-8", "UCRT", "encodings"] ---
From version 4.2.0 released in April 2022, R on Windows uses UTF-8 as the native encoding via UCRT as the new C Windows runtime. The transition for R and its packages has been a non-trivial effort which took several years. This post gives a summary some technical obstacles found on the way, focusing on aspects that may be interesting to other projects.
R is implemented in C and Fortran (and R). It requires a Fortran 90 compiler. R code tries to be as much platform independent as possible, using the standard C library functions instead of OS-specific API. A lot of the code has been primarily developed for POSIX systems.
The same applies to extension packages. Currently, there are nearly 19,000 packages on CRAN, out of which nearly 4,500 include C, C++ or Fortran code.
R on Windows uses static linking for external libraries. They are linked statically to R itself and to R packages, specifically to the dynamic library of R and the dynamic libraries of individual packages.
R packages are primarily distributed in source form. For Windows, CRAN provides binary builds of R packages and a distribution of the compiler toolchain and pre-compiled static libraries to build R and R packages. Before the transition to UCRT/UTF-8, R used a GCC/MinGW-w64 toolchain targeting MSVCRT as the C runtime.
CRAN checks the published packages and requires package maintainers to fix problems and adapt packages to changes in R. Development versions of R are tested using CRAN package checks to foresee any problems. Hence, R with UTF-8 as the native encoding and UCRT as the C runtime was only released once CRAN (and Biococonductor) packages were ready. Helping package authors with necessary changes to the packages has been a significant part of the work.
Only object files compiled for the same C runtime can be linked together on Windows. This means that a transition from MSVCRT to UCRT requires that all static libraries are re-compiled from scratch using a UCRT toolchain.
Building the new toolchain, static libraries and re-building R packages required most effort in the transition, but that might be different for other projects and be best described in a separate post.
The key thing is to what level a project allows to re-build the required the complete software stack automatically from sources from scratch, using a new compiler toolchain, without re-using/downloading pre-compiled code from various sources. This wasn’t the case of R.
The decision on the toolchain and software distribution was made 2 years ago, and it was to stay with GCC/MinGW-W64, using GCC 10 at the time. At R 4.2.0 release time, it was GCC 10.3 and MinGW 9. LLVM/Clang wasn’t an option because of the need of a Fortran 90 compiler.
MXE cross-compilation environment was chosen as it was easy to ensure that the toolchain and all libraries were rebuilt for UCRT from source, while it supported building static libraries. A number of different options would be available today, particularly for projects not requiring Fortran 90 or static libraries.
UCRT is different from MSVCRT and this required some modifications of source code to be re-compiled. Two common problems mentioned below were definitely linked to transition to UCRT. Compilation problems likely related to an update of MinGW, GCC version or other involved software are excluded.
A surprising obstacle was that one could not print a 64-bit integer using
e.g. printf
in C without getting a warning from GCC: both %lld
(C99,
supported by UCRT) and %I64d
(Microsoft) formats resulted in a warning.
This caused trouble for building external libraries, because sometimes
warnings were automatically turned to errors and tweaks of compiler options
were necessary (-Wno-format
or not turn to errors).
CRAN requires format warnings to be addressed in packages, so disabling them wasn’t an option there at all.
This is a GCC bug, which has been reported and I’ve been offering a patch, which is used in the new toolchain for R and also in Msys2. It hasn’t been adopted by GCC to this day, but the main part of the problem has been solved in GCC 11 differently.
The remaining part addressed by the patch is that providing a format specifier wrong in both the C99 and the Microsoft formats will emit two warnings instead of one. See GCC PR#95130 for more details.
Some software explicitly sets __MSVCRT_VERSION__
C preprocessor macro and
the values used accidentally imply the use of MSVCRT, which breaks the
build, usually linking. Removing the setting typically resolved the
problem. This macro probably should not be set manually at all outside of
the C runtime.
There were only few encoding-unrelated issues detected by the transition to UCRT at runtime.
UCRT is stricter in checking arguments to the runtime functions. Problems
newly appeared with setting locale categories not available on Windows,
double close of a file descriptor and invalid descriptor passed to dup2
.
By MinGW default, the invalid parameter handlers did nothing, but e.g. when linked to applications built by MSVC, this would cause program termination. With MSVCRT builds, these problems were hidden/benign.
To detect these problems, one can set a custom handler via
_set_invalid_parameter_handler
and run tests. Debugging these things is
usually easy once the handler is set, as long as the test coverage allows.
Broken directory symbolic links (junctions) now appear as non-existent via
_wstati64
, but before they were reported as existing. This new behavior
seems consistent with the documentation and matches what happens on Unix.
We haven’t ran into this, but switching an application to UCRT, while some
DLLs linked to it would remain built for MSVCRT, could expose
interoperability problems. It could be for example accidental dynamic
allocation by malloc
in one and release by free
in other runtime.
However, mixing runtimes across DLLs is not good for the encoding support, anyway (more below).
The rest of the text covers encoding-related issues found during the transition to UCRT/UTF-8.
MSVCRT does not allow UTF-8 to be the encoding of the C runtime (as reported
by setlocale()
function and used by standard C functions). Applications
linked to MSVCRT, in order to support Unicode, hence have either to use
Windows-specific UTF-16LE API for anything that involves strings, or some
third-party library, such as ICU.
UCRT supports UTF-8 as the encoding of the C runtime, so that one can use the standard C library functions, which is much better for writing portable code, and it seem to be the way Microsoft now seems to recommend as well.
UCRT is the new Microsoft C runtime and it is expected that applications will eventually have to switch to it, anyway.
While preferring the standard C API, R itself uses also Windows-specific
functions, both the *A
and *W
forms where necessary. The *A
calls
use the encoding defined by the active code page (sometimes referred to as
system encoding), which may be different from the C library encoding but
typically is the same. Normally the active code page is specified
system-wide and changing it requires a reboot.
The code of R and packages is not designed to always carefully differentiate between the two encodings and it would become substantially more complex if this were to be done just in base R, not mentioning R packages and external libraries. Also, the goal is to have Unicode strings supported always, so we would want the active code page to be also UTF-8.
The active code page can newly be set to UTF-8 for the whole process via the fusion manifest, so it is decided at build time, without requiring system-wide changes or reboot.
R hence specifies that in the manifest and then sets the C encoding to whatever is the active code page, so the encodings are always the same. The active code page can be UTF-8 via the manifest only on recent Windows (on the desktop, Windows 10 November 2019 or newer). On older systems, this part of the manifest is ignored, the active code page becomes whatever is used system-wide, and then also the C encoding.
Another consequence is for “embedding”. When R is used as a dynamic library linked to a different application, it uses the active code page (and then C encoding) of the application. If such application is designed in a way that does not allow setting UTF-8 as the active code page, it needs to be split: one may create a new small embedding application using UTF-8 and that one could communicate with the original embedding application.
While in theory an application can link to dynamic libraries using different C runtimes, the MSVCRT cannot use UTF-8 as the native encoding. So, string operations would not work with mixed runtimes. Given that R uses UTF-8 as the active code page, a MSVCRT-based DLL would not work properly even when performing string operations in isolation.
While even before different application on Windows on the same system could use different encodings (of the C runtime), typically they did not and it was often silently assumed all data was in the default system encoding.
We have ran into this problem with the aspell
tool, which luckily allows
to specify UTF-8, and with a small test application shipped with and used in
an R package.
Clearly with the advent of applications on Windows using different “ANSI” encodings (at least UTF-8 or the default one from the system locale), it is now necessary to be encoding-aware even in “ANSI” code, including say processing command line arguments.
While R by default sets the C library encoding to the active code page via
setlocale(LC_CTYPE,"")
, this can be overridden by user or R may run on old
Windows not allowing UTF-8 as the active code page or be embedded in an
application with a different active code page. It is therefore necessary to
be able to detect the C library encoding.
R does it by parsing the result of the call setlocale(LC_CTYPE, NULL)
.
The encoding is usually given in a suffix .<codepage_num>
, e.g.
Czech_Czechia.1250
stands for CP1250
(similar to Latin 2).
For UTF-8, the code page in Windows is 65001, but the suffix is given as
.utf8
, so has to be treated specially. According to the Microsoft
documentation, all of .UTF8
, .UTF-8
, .utf8
, utf-8
are allowed on
input, so R now detects any of these. Sadly, the output of
setlocale(LC_CTYPE, NULL)
is not explicitly specified.
The locale names do not always include the code page, for example when they
are in the form of cs-CZ
or cs_CZ
. In that case, according to the
documentation one can find it as the default locale ANSI code page
(LOCALE_IDEFAULTANSICODEPAGE
of GetLocaleInfoEx
), which is now supported
by R. This was added to R recently and didn’t work before the transition to
UTF-8, but I didn’t find an easy way now to find the MSVCRT documentation to
check whether these were supported by the runtime.
In either case it is worth checking the documenation for “UCRT Locale names, Languages, and Country/Region strings” when switching to UCRT and compare it to the assumptions made by the application.
The text in Windows clipboard can either be in UTF-16LE, in which case no special handling should be needed, or in a “text” encoding. The latter causes trouble as described below.
In R, this was fixed to always use “Unicode text” in UTF-16LE, as it seemed simplest solution. It is an irony that switching to UTF-16LE interface of a component is needed for a transition to UTF-8.
Even though one may specify the locale for the “text” content in the clipboard, and that locale defines an encoding that the “text” is in, there are two problems. First, no locale as far as I could find has UTF-8 as the encoding, so one cannot really use an arbitrary Unicode text, which we would want to allow to support UTF-8 (also, using such locale would normally have also effect on other applications). So, while Windows should allow to use UTF-8 wherever an “ANSI” encoding has been used before, it doesn’t really do that for the clipboard.
Further, some applications do not fill in the locale for the “text”, and then Windows automatically uses what is the current input language, i.e. what is the “keyboard” selected when the user pastes the data to the clipboard. This was also the case of R.
With programmatic access to the clipboard, this default behavior doesn’t make sense, because the string used would normally have been encoded at different time from when the write operation has been invoked. The problem that the locale used is set implicitly when the write operation took place, however, existed even before the switch to UTF-8: the user may switch the input language between creating and sending the string.
Some *A
functions of the Windows API do not get the encoding to use from
the active code page, but from the font charset in the device context. This
includes function TextOutA
used to write a text to a dialog box. When a
font is being created by CreateFontIndirect
, one can specify a charset,
where DEFAULT_CHARSET
is a value set based on the current system locale,
so e.g. for English it is ANSI_CHARSET
, a non-UTF-8 encoding.
It turns out it is possible to get UTF-8 charset, but one has to do it
explicitly via TranslateCharsetInfo
, passing 65001
as the source code
page. This is another instance of the problem where encoding is specified
via locale, but the locale doesn’t have the information we are using UTF-8.
Rgui, a graphical front-end for R, offers a script editor. It is an editor
window where one can edit some R code, save it to a file, read it from a
file, and execute in the R interpreter. The editor is implemented using
RichEdit
2.0
control.
There is no encoding information saved in the R source code files. Before, the files were assumed in the default system encoding, which was different on different systems. It made sense to switch to UTF-8, to support all Unicode characters and have always the same encoding, for the price of that older script files will have to be converted by users.
The hard part was to make RichEdit work with UTF-8. I wasn’t able to find documentation for this behavior, nor any other sources, so what is written here is based on experimentation, guesses, trial and error.
R uses the EM_LINEFROMCHAR
message to get the index of the current line
and then EM_GETLINE
message to get the text from the line of the script to
execute it. R used the RichEdit20A
control (so the “ANSI” version), but,
when UTF-8 is the active code page, the returned text is
still in the default system (so current locale) encoding, not it UTF-8.
R is not compiled with the _UNICODE
flag and cannot be, and it wouldn’t
now be desirable anyway as we want to use UTF-8 via the *A
calls instead
of UTF-16LE.
Still, it turns out that with the RichEdit20W
control (so the “Unicode”
version), the returned text is actually UTF-8 (not UTF-16LE) when the active
code page is UTF-8, so it is what we want. R hence explicitly uses
RichEdit20W
as the class name.
Still, the RichEdit20W
control appears to not accept UTF-8 in
EM_FINDTEXTEX
message (for the “Search operation”), so the “ANSI” strings
in the documentation do not really cover UTF-8 in this case. Switching to
UTF-16LE and EM_FINDTEXTEXW
worked.
Messages EM_EXSETSEL
and EM_EXGETSEL
seem to be working correctly with
character indexes, probably the control internally uses Unicode in either
case, so messages passing character indexes work.
However, the EM_GETSELTEXT
message produces an UTF-16LE string, not UTF-8
(while EM_GETLINE
produces UTF-8, not UTF-16LE). I didn’t find an
explanation for that.
These messages are used in R’s implementation of Search/Replace in the Rgui script editor. A hint for that a problem may be due to that UTF-8 is expected but UTF-16LE received is that things work only for a single (ASCII) character, where part of the UTF-16LE representation looks as a string terminator in UTF-8.
It may be that switching applications which already used newer versions of the control would have been easier, but I don’t have experience with that to comment. Investing into updating to the newer control in Rgui may not be worth the effort.
An important feature of the switch to UTF-8 should be that also that the user can print and enter any Unicode characters in the console, not only that those could be processed internally.
The Windows console, at least some implementations, need to be told to
switch to UTF-8. One may do this by running chcp 65001
e.g. in
cmd.exe
, but it is possible to do it programmatically from the application
via SetConsoleOutputCP
call. Rterm, the console front-end for R on
Windows, now uses SetConsoleOutputCP
and SetConsoleCP
to set the output
and input code pages to UTF-8 (65001) whenever using UTF-8.
The fonts in the console need to have glyphs for the characters to be used,
and this is something that remains the responsibility of the user if the
defaults are not sufficient. One may have to switch NSimFun
font in
cmd.exe
to display some Asian characters.
Rterm uses the Windows console API and specifically the ReadConsoleInputW
function to read input from the console. Each event received includes
information on key code, scan code, whether the key is pressed or released,
and a Unicode character.
How specific strings entered into the console are received depends on the
console application/terminal: cmd.exe
, PowerShell, mintty/bash, Windows
Terminal app. It is not unusual when particularly Windows Terminal,
mintty/bash and cmd.exe
differ. I am not aware of a
documentation/specification of this behavior.
One source of differences, so far not related to UTF-8 support but good to
illustrate the challenge, is whether Alt+xxx
sequences are interpreted
already by the console, or whether the application (Rterm) receives the raw
key presses. For example, Alt+65
produces the A
key. Mintty interprets
the sequence and sends the character, only. Windows Terminal sends the Alt,
the interpreted character, and the release of Alt. cmd.exe
sends all the
key events but also interprets them and sends the character as well. When
the numlock is off, Windows Terminal instead sends the uninterpreted keys
but not the resulting character. One needs to extrapolate from this to
produce an algorithm which reads A
once in all cases, so which knows how
to interpret the sequence, but also doesn’t accidentally get the character
twice. The frustrating part is when users run into a corner case difference
not spotted while testing.
It may seem that the use of Alt+
sequences is rather niche, but it is used
even when pasting characters not present on the keyboard with the current
input method, e.g. tilde on the Italian keyboard. It is sent as Alt+126
(and tilde is used in the R language).
Now an example of a problem specific to UTF-8 support. Supplementary
Unicode characters, so those which are represented using surrogate pairs in
UTF-16LE, are received differently. For example, the “Bear Face” character
(U+1F43B
).
When one presses a key and then releases it, the application typically
receives two events, one for the pressing (with zero as the character) and
one for the releasing (with non-zero character code). This is also what
happens with the “Bear Face” emoji in cmd.exe
and mintty, but not with
Windows Terminal for this supplementary character. There, the character
code is received both with the key pressing and key releasing event.
It also turns out that Unicode sequences (such as <U+63><U+30>
for “c”
with caron) work with terminals in surprising ways. It hasn’t been resolved
in R, yet, and it is not clear to me whether the console support in Windows
is ready for that.
The switch to UTF-8 uncovered problems which existed before in
Rterm/getline with support for multi-width and multi-byte characters, and
also with support for input using Alt+
sequences. R 4.1 already received a
rewrite of this code, which was already aiming at UTF-8 support. More
details are in Improved Multi-byte Support in
RTerm.
What seems to have been useful in the transition to UTF-8: fixing support
for various Alt+
input sequences (with and without numlock, on numpad and
the main keyboard), diagnostic mode which prints the keyboard events
received (Alt+I
in Rterm), testing with different terminals (cmd.exe
,
PowerShell, Windows Terminal, mintty, Linux terminal and ssh). More work
will be needed to make surrogate pairs work reliably and then possibly the
Unicode sequences.
It may be that switching applications to UTF-8 which already used conPTY would have been easier, but I don’t have an experience with that to comment. It may be that updating Rterm to use conPTY and ANSI escape sequence API on input will be useful in the future.
It turns out that UCRT functions for case changing, towlower
and
towupper
, do not work with some non-English characters, such as German
U+F6
, U+D6
, which are multi-byte in UTF-8. This worked with MSVCRT.
R has its own replacement functions for case-changing, which had to be selected also on Windows. Otherwise, one might probably use ICU.
Several additional issues were found in the GraphApp library. A customized version is part of R and is used for graphical user interface on Windows. It heavily uses Windows API and UTF-16LE interface, so this was a bit surprising to be impacted.
But, there is special mode of operation used when running in a multi-byte locale, which has been missing some features and hasn’t apparently been much used in the past. This changed with switching to UTF-8, when users previously running in a single-byte locale ended up using the other code path. As very R specific, these issues may be best covered in another post in more detail.
It is known that std::regex, a C++ interface to regular expressions, is not reliable with multi-byte encodings, and this has been the case on other platforms. With the switch to UTF-8, some R packages using C++ have ran into this problem also on Windows.
The experience with R seems to suggest that transitioning a large project to UCRT/UTF-8 on Windows is possible. The changes that had to be made to the code were not large. Some time has been needed to debug the issues, and hopefully this list will help others to save some of theirs.
It was surprisingly harder to make Windows-specific code work rather than plain C code using the standard C library (but code aware of that the current encoding may be multi-byte).
It is good to know there are “two current encodings”, the C runtime but also the active code page, and one needs to decide how to deal about those. R requires that both are the same (and UTF-8), for the price old Windows systems won’t be supported.
Some Windows functionality works with encoding specified indirectly via the current locale, which cannot be UTF-8. This requires special handling and work-arounds. We have run into such issues with fonts, clipboard and RichEdit.
The console support for Unicode via UTF-8 may require some effort, code using legacy Windows API may have to be rewritten.
The obvious part: this may wake up issues not seen before. Characters previously single byte on systems running Latin languages will sometime be multi-byte.
And, all code should be recompiled for UCRT.