--- title: "Why to avoid \\x in regular expressions" author: "Tomas Kalibera" date: 2022-06-27 categories: ["User-visible Behavior", "Package Development"] tags: ["encodings", "regular expressions"] ---
Using \x
in string literals is almost always a bad idea, but using it in
regular expressions is particularly dangerous.
Consider this “don’t do” example in R 4.2.1 or earlier:
text <- "Hello\u00a0R"
gsub("\xa0", "", text)
a0
is the code point of the Unicode “NO-BREAK SPACE” and the example runs
in UTF-8 locale. The intention is to remove the space; a slightly more
complicated variant has been discussed on the R-devel mailing list about
half a year ago.
The result is "Hello R"
, the space is not removed. While slightly
contrived, this example gives a clue why:
text <- "Only ASCII <,>,a and digits: <a0><a1><a2>"
gsub("\xa0", "", text)
The result is ""Only ASCII <,>,a and digits: <a1><a2>"
, so the “<a0>
”
portion of the string is removed. The problem is that R converts the \xa0
in the regular expression to ASCII string "<a0>"
before passing it to
the regular expression engine.
R does it because the string is invalid. First, the parser expands \xa0
into the byte a0
and when gsub
needs to convert the string to UTF16-LE
for use with TRE, it can’t, because a0
is invalid UTF-8 (and we are
running in UTF-8 locale). The code point a0
is instead encoded as c2
a0
in UTF-8. Thus, R escapes the invalid byte to "<a0>"
and produces a
valid UTF16-LE string, but that’s not the one intended. Additional checks
are now in place in R-devel so that R would actually report an error (more
later below) instead of escaping invalid bytes.
The origin of the problem is, however, a user error. The pattern is itself
an invalid string. Likely this used to work when R was used in Latin-1
locale, where the a0
byte represents the no-break space, and maybe it was
only tested there but not in other locales. Latin-1 in recent R would be
used very rarely and hence this issue would now bite users much more.
To somewhat mitigate this issue, one could pass \x
to the regular
expression engine, so double the backslash in the regular expression. \\x
is an ASCII string, hence will always be valid. But, see below.
By default (perl=FALSE
, fixed=FALSE
), POSIX extended regular expressions
as described in ?regex
are used, and those are not documented to support
the \x
escapes. While the currently used implementation, TRE, supports
them, one should hence not use the feature (e.g. as a prevention against
the case the implementation in R switches to a different engine). So, for
this one should use the Perl regular expressions (perl=TRUE
), which has
other advantages, so this is not limiting modulo that one has to remember.
Notable advantages of Perl regular expressions include that one usually
saves the encoding conversions to UTF-16LE (and back) and that one has
access to Unicode properties. So, when re-visiting existing code to fix
issues like this one, it may be beneficial to switch to Perl regular
expressions, anyway (but this requires some care as the expressions are not
exactly the same, see ?regex
).
But, worse yet for using \\x
escapes, it has the risk that the
interpretation depends on the mode of the regular expression engine and can
hence be still locale-specific. This example works in ISO-8859-2 (result is
"cesky"
) but not in UTF-8 locale:
text <- "\u010desky"
text <- iconv(text, from="UTF-8", to="")
gsub("\\xe8", "c", text, perl=TRUE)
This works in UTF-8 locale, but not in ISO-8859-2:
text <- "\u010desky"
text <- iconv(text, from="UTF-8", to="")
gsub("\\x{010d}", "c", text, perl=TRUE)
The reason is that the first time the Perl regular expression is run in a
locale mode (e8
is the code of “LATIN SMALL LETTER C WITH CARON”), the
second time they run in UTF mode (where 010d
is the code, the code point
number).
Which mode is used depends on the locale and on the input strings. One can force the UTF mode by ensuring one of the input is in UTF-8 (excluding ASCII). But, still, if the text argument is a vector with multiple elements, we can’t pick just any element to convert to UTF-8, we have to pick one that is not ASCII. Or, all of them may be ASCII, then we would have to convert the pattern or the replacement. So perhaps convert all inputs explicitly to UTF-8? Technically, this would work, but is it worth the hassle?
There is a much easier way to ensure that the result is locale-independent (and that one of the inputs is UTF-8, specifically, the pattern):
text <- "Hello\u00a0R"
gsub("\u00a0", "", text)
and
text <- "\u010desky"
text <- iconv(text, from="UTF-8", to="")
gsub("\u010d", "c", text)
This works with the default (POSIX extended) regular expressions, with Perl
regular expressions and with “fixed” expressions, because the Unicode
(UTF-8) character is created by the parser. (\\x
does not work with fixed
expressions and is only documented to work with Perl regular expressions).
In principle, the first example of removing the no-break space could probably be often generalized to refer also to other kinds of spaces, e.g. via Unicode properties, which are supported by the Perl regular expressions.
With a recent version of R-devel, invalid strings passed to regular expressions are now being detected also in cases they were not detected before.
> text <- "Hello\u00a0R"
gsub("\xa0", "", text)
Error in gsub("\xa0", "", text) : 'pattern' is invalid
In addition: Warning message:
In gsub("\xa0", "", text) : unable to translate '<a0>' to a wide string
18 CRAN and 5 Bioconductor package checks fail now visibly because of the
new checks, allowing package authors to fix the issues. But, all package
authors using \x
(or \\x
) in their regular expressions should fix those.
It is not recommended to disable the newly added checks via useBytes
,
because that could also lead to creation of invalid strings, essentially
only hiding the problem, apart from potentially breaking the code by
changing the mode of operation of the regular expression functions. And,
even if that is not detected by R now, it might be detected, soon.