--- title: "Upcoming Changes in R 4.2.1 on Windows" author: "Tomas Kalibera" date: 2022-06-16 categories: ["User-visible Behavior", "Windows"] tags: ["Rgui", "UTF-8", "UCRT", "encodings"] ---
R 4.2.1 is scheduled to be released next week with a number of Windows-specific fixes. All Windows R users currently using R 4.2.0 should upgrade to R 4.2.1. This text has more details on some of the fixes.
R 4.2.0 on Windows came with a significant improvement. It uses UTF-8 as the native encoding and for that it switched to the Universal C Runtime (UCRT). This in turn required creating a new R toolchain for Windows and re-building R, R packages and all (statically linked) dependencies with it (Rtools42, more details on the transition).
Using UTF-8 as the native encoding significantly reduces the number of encoding conversion issues when working with characters not representable in the encoding used normally by Windows, so e.g. problems with Asian characters on systems running in Europe, Americas or anywhere else where latin scripts are used.
R 4.2.0 has been regularly tested with CRAN and Bioconductor packages before the release, but several issues not covered by automated R/package testing and missed by the limited manual testing have been found by users after the release. Thanks to users who reported issues via R bugzilla, R-devel mailing list, R-help mailing list as well as private messages, soon after the R 4.2.0 release, these issues were fixed for R 4.2.1. Moreover, the good news is that no major issues with the rather significant transition to UTF-8/UCRT have been found to this date.
It would be nice to get more help from the R community volunteers with testing R before releases, as detailed in a blog post from April 2021. As far I can tell from when we are receiving bug reports, this is still not happening much. Such testing doesn’t have to be only “manual”, a lot of interactive testing in principle can be automated as well, but in either case that requires effort and time that would have to be contributed.
Clipboard connection support in R on Windows (see ?connection
and search
for “clipboard”) was rewritten in R 4.2.0 to use Unicode (UTF16-LE) Windows
API interface to fix encoding issues
(PR#18267).
Unfortunately, there was an error in computing offsets in the connection stream
which resulted in an bug observed during consecutive writes
(PR#18332), fixed in
R 4.2.1. This only impacted programmatic access to the clipboard via the R
connections API.
It was a rather embarrassing omission of a pair of parentheses and apparently I was only testing the original bug fix using a single write operation, not multiple. While fixing the bug with consecutive writes, I also found and fixed a spurious warning about an ignored encoding argument, which is a by-product of internal conversions to/from UTF16-LE inside the connections code.
Clipboard connection testing is for good reasons not allowed in automated CRAN package checks (as clipboard is a user/system-wide device, regarded the same as user’s home files pace, see CRAN Repository Policy), so the issue hence could not have been found that way.
Another issue found after the release was with the R Sys.getlocale
function attempting to query an unsupported locale category on Windows. The
function is documented to accept also LC_MESSAGES
, LC_PAPER
and
LC_MEASUREMENT
categories on Windows, even though they are not supported
there; Sys.getlocale
returns an empty string.
The implementation used to call the C runtime function setlocale
to obtain
the locale information even for LC_MESSAGES
, and that worked in the past.
But, it does no longer with UCRT when invalid parameter handlers are enabled
(see Parameter
Validation
in MSDN).
By default, MinGW-W64 and hence applications built using Rtools42 disable the invalid parameter handlers, so we have never ran into that during automated CRAN and Bioconductor package checking, nor during manual testing using the “normal” builds. But, if R is embedded in an application built using Microsoft compilers, the invalid parameter handlers may be enabled by default and may terminate/crash R.
This has only been found after R 4.2.0 release inside RStudio which had the
handlers enabled. It was reported that rJava
crashed during
initialization, because it was using Sys.getlocale
to query the
LC_MESSAGES
locale category.
The getlocale
implementation has been fixed in R 4.2.1 not to query the
unsupported locale categories. In addition, R-devel has been extended to
optionally enable these handlers for checking (via
_R_WIN_CHECK_INVALID_PARAMETERS_
), and CRAN package checks were ran using
this setting. Luckily, only few packages have been affected. One package
trigered invalid parameter handler by accidentally closing a handle twice,
so attempting to close an invalid handle.
As usual, checking all CRAN packages is not only a service to the package maintainers, but also serves as a check for R itself.
Perhaps surprisingly, a number of users have found issues in Rgui after the R 4.2.0 release. This shows that Rgui is still actively used, and not only directly, but also as an interactive R console window connected to and controlled by other applications (Dasher, Tinn-R).
Problems with transition to UTF-8 were somewhat surprising to me as Rgui has been designed as a Unicode application and, using the GraphApp library, written to support Unicode characters not representable in the native/ANSI Windows encoding. Rgui has limitations in supporting non-BMP characters, but that was not the issue here. GraphApp, at least the version included and customized in the R distribution, has two very distinct modes of operation: “Unicode” and “non-Unicode” windows. Both modes support working with characters not representable in the native/ANSI Windows encoding.
However, by default, “non-Unicode” windows are used in a single-byte locale (the native/ANSI) and in some contexts are also used by accident even on multi-byte locale (due to initialization/bootstrapping issues). Hence, Windows systems of R users of languages using single-byte encoding have always been using “non-Unicode” GraphApp windows, and it wasn’t discovered/reported that the “Unicode” windows lacked some features and had some bugs. As R 4.2.0 switched to UTF-8, a multi-byte locale, Rgui started using “Unicode” GraphApp windows and these issues popped up. The reports were from users from Europe and South America.
One of the consequences was that the accent keys (dead keys) almost didn’t
work. Some were not supported at all and some couldn’t be typed without
combining them with the next character. The reported cases (and a number of
additional I found while debugging) have been fixed. However, handling of
these characters, at least in the form done in GraphApp “Unicode” windows,
is very language-specific and depends on keyboard layouts. It is hence
definitely not impossible that some accents via dead keys still will not
work: in that case, the best course of action is to use copy-paste (or some
other input method common for the specific language) as a work around, and
report a bug. As a last resort, non-ASCII characters in string literals can
be represented using \u
and \U
escapes.
GraphApp “Unicode” windows are internally designed differently and respond
to different Windows API messages. Hence, injection of text via
SendInput
, as used in Dasher, didn’t work. Luckily this has still been
fixable and is fixed in R 4.2.1. Tinn-R used WM_CHAR
messages, instead,
and they stopped working as well. This seems unfixable without bigger
changes to GraphApp, because the “Unicode” windows are simply designed to
handle related messages differently, but Tinn-R luckily can switch to
SendInput
, which is also a better way to do text injection, despite it has
also limitations (more details
here). If
there are other similar applications that used WM_CHAR
or
WM_KEYDOWN
/WM_KEYUP
messages, the best/simplest course of action is to
switch to SendInput
. Switching to embedding may be more flexible and
reliable in the long-term, but require a much higher investment.
Rgui has a “Script Editor”, which is implemented using a RichEdit control
(part of Windows). GraphApp has been using the ANSI (*A
) interface to the
control, so one would expect that it should work with UTF-8 as it worked
before with whatever was the ANSI encoding (even double-byte). However, it
turned out that the RichEdit20A
version of the control does not, it was
not possible to copy and execute a line of R code which contained non-ASCII
characters (characters were received in the ANSI encoding, not respecting
that Rgui opted for UTF-8 in its manifest). However, the RichEdit20W
version of the control accepts UTF-8 properly, even using the ANSI (*A
).
If any expert on these things is reading these lines, I would be happy for a
review of the current code or for an explanation, as this doesn’t seem to be
documented.
Rgui has also experienced a significant performance regression of
txtProgressBar
. The progress bar is based on carriage return characters
and repeated rewriting of the previous state. Rgui has a not very efficient
way of implementing these: it remembers the full history of the line,
interpreting the carriage returns only on redraws. While redrawing a line,
Rgui computes width of each character. So, every update of the progress bar
adds to the work to be done on the next redraw, and even previous lines
shown in the window have to be redrawn, so, if one runs the progress bar
several times, the performance overheads are increasing.
This has only been detected in R 4.2.0 running in UTF-8, because UTF-8 is a multi-byte locale and a different code path to compute the character widths has been used. It turns out that this code contributed long time ago to R had a bug in caching a locale identifier, so it was re-computed on every character, plus an optimization for ASCII characters (relevant for the progress bar) accidentally only took place after the broken caching. Fixing this old performance bug in R fixed this performance regression in Rgui and potentially will improve performance also on other systems where R is built to use the internal width calculation.
Rtools42 have been updated and the official build of R 4.2.1 (at the time of this writing R-4.2.1 release candidate) will be built using version 5253.
Compared to version 5168 used to build R 4.2.0, there is now also the tidy
tool for checking HTML in packages and a number of libraries have been
updated, from which R itself and then all CRAN packages using those would
benefit: 15 out of those are used by R and recommended packages, see a
complete list for
details. All CRAN packages have been tested (and where needed updated) for the
new versions. Note that CRAN packages are required to use libraries from
Rtools when those are available CRAN Repository Policy
has more details).
For a summary of additional updates in R 4.2.1, see the NEWS file of the R-patched branch and look for “Changes in R 4.2.0 patched” (when still before the release) or to “Changes in R 4.2.1” (when after the release).