--- title: "Improvements in handling bytes encoding" author: "Tomas Kalibera" date: 2022-10-10 categories: ["User-visible Behavior"] tags: ["encodings", "bytes"] ---
In R, a string can be declared to be in bytes encoding. According to
?Encoding
, it must be a non-ASCII string which should be manipulated as
bytes and never converted to a character encoding (e.g. Latin 1, UTF-8).
This text summarizes recent improvements in how R handles bytes encoded
strings and provides some of thoughts about what they should and shouldn’t
be used for today.
Particularly for readers not familiar with R, it may be useful to highlight
how strings are supported in the language. A character vector is a vector
of strings. As any vector, it may be of length zero or more, and it may
be NA
. The string type is not visible at the R level, but a single string
is represented using a character vector of length one, with that string as
the element.
A string literal, such as "hello"
, is a character vector of length one:
> x <- "hello"
> length(x)
[1] 1
> x[1]
[1] "hello"
Similarly, there is no type in R to hold a single character. One may
extract a single character using e.g. a substring
function, but such
character would be represented as a string (so a character vector of length
one with that single-character string as element):
> substring(x, 1, 1)
[1] "h"
Strings are immutable, they cannot be constructed incrementally, e.g. by filling in individual bytes or characters as in C. Creating a string is potentially an expensive operation, strings are cached/interned and some of their properties are examined and recorded. Currently it is checked and recorded whether the string is ASCII.
Encoding information is attached to the string, so one character vector may contain strings in different encodings. Supported encodings are currently “UTF-8”, “latin1”, “bytes” and “native/unknown” (more about which comes later).
Functions accepting character vectors handle strings according to their
encoding. E.g. substring
counts in bytes for bytes encoded strings, but
in characters for character strings (“UTF-8”, “latin1” and “native/unknown”). Not
all functions support bytes encoded strings, e.g. nchar(,type="chars")
is
a runtime error, because a byte encoded string has no characters.
Functions have to deal with the situation when different strings are in different encodings. Individual functions differ in how they do it, but often character strings are converted to a single character encoding, usually UTF-8, and when that happens, any newly created result strings are also in UTF-8. The user doesn’t have to worry as long as the strings are valid, because they can always then be represented in UTF-8.
This is more complicated with bytes encoded strings, which cannot be
converted to a character encoding. Some functions, such as gsub
or
match
, switch to a different mode if any of the input strings is bytes
encoded. In this mode, they ignore encodings of strings and treat all of
them as byte sequences. As discussed later, this only makes sense in
certain situations.
From the previous, it is clear that bytes encoded strings are not like byte
arrays in Java or Python or char arrays in C, because one cannot refer to
the individual bytes in them. Also, one cannot modify individual bytes
using []
operator.
There are additional differences. The zero byte is reserved and cannot be included in any string. Also, every byte encoded string must contain at least one byte of value greater than 127, because the string must be non-ASCII. ASCII strings are always encoded as “native/unknown” (and while encoding flags can sometimes be manipulated, this rule cannot be violated). It would become clearer, later, that this is due to identity/comparison of ASCII strings.
So, bytes encoded strings are not usable to represent binary data. Instead,
there are raw
vectors in R for that. Elements of a raw vector are
arbitrary bytes (including zero) and can be indexed and mutated at R level
using []
. They don’t work like strings, aren’t printed as strings and
aren’t supported by string functions.
Particularly in the past when there were only single-byte encodings, it made sense to think of encoding-agnostic string operations. Not only because sometimes the input encoding wasn’t reliably known, but also because possibly old code not encoding aware or not aware of new encodings could be re-used. Also, there were many different encodings in use.
When all strings are in the same (stateless) single-byte encoding, one can concatenate them without knowing the encoding, one can do search/replace. If they are all a super-set of ASCII (encodings supported Bby R all are), one can even do parsing of a language that is all-ASCII, including trivial parts such as splitting lines and columns.
People sometimes had input files in really unknown encoding (the provider of the file didn’t tell). And as long as most of the bytes/characters were ASCII, many things could be done at byte level ignoring encodings.
A concrete example that still exists in today’s R is the package DESCRIPTION
file. The file may be in different encodings, but the encoding is defined
in a field named Encoding:
inside that file. The file can even have
records in different encodings, each with its own Encoding:
field.
Parsing such file in R requires some encoding-agnostic operations: one
doesn’t know in advance of reading the file what the encoding is.
With multi-byte encodings, things are much more complicated and encoding-agnostic operations no longer really make sense. Still, UTF-8 allows some of them, to the point that it is supported in DESCRIPTION files. UTF-8 is ASCII safe: a multi-byte character is encoded using only non-ASCII bytes, so that all ASCII bytes represent ASCII characters. Also, in UTF-8, searching can be based on bytes: the byte representation of a multi-byte character doesn’t include a byte representation of another character. Still, currently, Debian Control Files (DCF) which DESCRIPTION files are based on, do not allow to define the encoding inside them, today they are required to be in UTF-8. It would make sense to eventually move to UTF-8 in DESCRIPTION files as well.
Even with UTF-8, indeed, some basic encoding-agnostic operations are not possible, as characters may be represented by multiple bytes. Other multi-byte and particularly stateful encodings make encoding-agnostic operations on the byte stream impossible.
The current trend seems to be that files must be in a defined known encoding (known without parsing text of the file), and often this encoding is known implicitly as it is required to be UTF-8.
Still, to support old-style files, such as current DESCRIPTION (or e.g. old LaTeX), byte-based encoding-agnostic operations are needed, and the bytes encoded strings are the right tool for that in R.
R has an encoding referred to as “unknown” (see e.g. ?Encoding
).
In most parts of R today, strings in this encoding are expected to be valid strings in the native encoding of the R session, and this is why I used “unknown/native” elsewhere in this text. Any encoding conversion (typically to UTF-8) relies on this. If it doesn’t hold, there is in an error, a warning, substituting invalid bytes, etc, depending on the operation.
Such string conversions may happen at almost any time internally without a direct control of the user, so using “unknown/native” strings to perform encoding-agnostic operations is brittle and error prone. It is still sometimes possible as string validity is not currently checked at creation, but it is not impossible it would be turned into an error in the future, as often invalid strings are simply created by user error.
Bytes encoded strings, instead, are safe against accidental conversion as by design/definition they cannot be converted to a character string.
For completeness, it should be said that some parts of R implement certain uncertainty for strings with “unknown/native” encoding. They are meant to be valid strings in the native encoding, but the idea is that we are not so sure we believe that unless it is confirmed by an explicit declaration of the user (some functions allow marking such strings UTF-8 or Latin 1) or by successful conversion to a different encoding. Still, whenever the string encoding is actually needed, it is expected to be the native encoding, and if it is not, there is an error, warning, substitution, transliteration, etc.
In the past and recently on Windows, the native encoding was often single-byte (and Latin 1), so conversions did not detect invalid bytes as often as now and the results were often acceptable for the reasons described above. Now, when the native encoding is mostly UTF-8, where many byte values cannot be the lead byte, conversions more often detect invalid bytes in old single-byte encoded files.
Particularly in the past, another bit of uncertainty was what actually was the native encoding and, even today, finding out is platform specific. So, strings were assumed to be in the native encoding, but it was sometimes unknown what that encoding actually was.
Finally, while it is discouraged, the R session encoding can be changed at runtime. This makes the existing strings in “native/unknown” encoding invalid, or in other words, it is then not known in what encoding are which strings.
I think that all these sources of uncertainty are becoming of less concern today and that the “unknown” encoding should be understood as “native” and all strings marked with that encoding should be valid in it. The R session encoding should never be changed at runtime. On recent Windows, it should never be changed at all (it should be UTF-8, because, that is the build-time choice for the system encoding in Windows for R). Definitely, “unknown” encoding should not be used for encoding-agnostic operations: we have the bytes encoding for that.
It recently turned out that the existing support for “bytes” encoding had several limitations, which were recently fixed.
First, it wasn’t possible to read lines from a text file (such as
DESCRIPTION) as strings in “bytes” encoding. One would normally read using
readLines
without specifying an encoding, and then mark as “bytes” using
Encoding(x) <- "bytes"
, but that approach uses invalid strings, because
for a short while the strings are marked as “native/unknown”. This has been
improved and now one can use readLines(,encoding="bytes")
to read lines
from the file as “bytes”. Indeed, this assumes that line separators have
that meaning (which must be the case for encoding-agnostic operations).
Then, there was a problem with regexp operations gsub
, sub
and
strsplit
. These operations create sometimes new strings, by substitution
or splitting, and the question is what encoding these strings should have.
When any of the inputs is encoded as bytes, these operations “use bytes”
(work on byte level). But, for historical reasons, they used to return
these new strings as “unknown/native”.
Hence, by processing an input line from say a DESCRIPTION file, represented
as bytes encoded string, one could get an invalid “native/unknown” string,
which could then be corrupted by accidental conversion to some other
encoding. One would have to always change the encoding of the result of
every single regexp operation to “bytes”, but that is inconvenient and
sometimes cannot be easily done by user, e.g. when calling a function that
isn’t doing that (e.g. trimws
, which may apply two regexp operations in
sequence).
These functions were changed to mark the newly created strings as bytes when at least one of the input is marked as bytes. It should be said that while the regexp functions allow mixed-encoding use, only a small subset of that makes any sense. Either, all inputs are in a character encoding (so convertible to UTF-8), and then the results will also be in a character encoding. Or, all inputs are bytes encoded or ASCII, and then the results will also be bytes encoded or ASCII. Mixing bytes encoded and other non-ASCII strings doesn’t make sense.
Now, a natural question is whether we shouldn’t also do this whenever
useBytes=TRUE
, whether the newly created strings or possibly all strings
returned should not be marked as bytes.
This has been tried in R-devel but reverted for further analysis as it broke too much existing code. I’ve first wanted to mark only the newly created strings as bytes (because we haven’t changed the old ones, so why forgetting about their encoding). This would conceptually make sense, but broke this pattern in user code:
xx <- gsub(<something_strange>, "", x, useBytes = TRUE)
stopifnot(identical(xx, yy))
The pattern removes “something strange” from an input text in a character
encoding. When replacement happens, the result element is bytes encoded
after the change (but “unknown/native” before the change). When replacement
doesn’t happen, it is encoded in the original character encoding of x
.
However, a bytes encoded string is never treated identical to a string in a
character encoding. So, the change has introduced type instability
(character vs bytes encoding) where it wasn’t before and tests started
failing. I tried to fix this by making all strings returned by the function
bytes encoded, but while “stable”, it broke even more code, because it ended
up passing bytes encoded strings to string functions that did not (and some
could not) support them.
In the previous, I wrote that using a mixture of bytes and character encoded
non-ASCII strings on input doesn’t make sense. useBytes = TRUE
with
inputs in multiple different character encodings doesn’t make sense, either,
for the same reasons (simply the bytes in different inputs mean different
things). But, useBytes = TRUE
is historically being used, as in this
pattern, to achieve some level of robustness against invalid input UTF-8
strings. This works with a subset of regular expressions on UTF-8 inputs
with some invalid bytes.
Being able to process UTF-8 with invalid bytes is a useful feature e.g.
when processing textual logs from multiple parallel processes without proper
synchronization: multi-byte characters may not be written atomically. While
PCRE2 today has better support for invalid bytes in UTF-8 strings, R doesn’t
yet provide access to it. Indeed, for some applications, one could simply
substitute invalid bytes using iconv
and get a valid UTF-8 string to
process.
It should be noted here that the “bytes” encoding (and also character
encodings) already do have another type instability wrt to ASCII. If an
operation on bytes encoded string say extracts some parts of the string or
otherwise processes them, the result may be bytes encoded (when it has at
least one non-ASCII byte) or “native/unknown” (when it is ASCII).
substring
is a trivial example. Hence, results of strings operations
should already be treated with some type instability in mind.
It would seem the pattern above could be handled by (WARNING: this doesn’t work, see below):
xx <- gsub(<something_strange>, "", x, useBytes = TRUE)
xx <- iconv(xx)
stopifnot(identical(xx, yy))
which would re-flag bytes encoded elements of xx
as “unknown/native” and
convert elements in a character encoding to “unknown/native” as well. But,
this has two problems. The first is that some of the input characters may
not be representable in the “unknown/native” encoding (on old systems where
UTF-8 is not the native encoding). That could be solved by using xx <- iconv(xx, to="UTF-8")
.
But, there is another problem: iconv(,from="")
historically ignores the
encoding flag of the input string, but always converts from the
“unknown/native” encoding, so it misinterprets strings in other encodings.
This behavior of iconv
has been changed. Now, the encoding flag of the
input string takes precedence if it is UTF-8 or Latin 1. This is a change
to the documented behavior, but in principle it could only break code that
used to depend on using invalid strings. Checking all of CRAN an
Bioconductor packages revealed that only one package started failing after
the change, and it was actually a good thing because the package had an
error; it worked by accident with the old behavior.
I believe that when considering using useBytes = TRUE
, primarily it should
be decided whether invalid inputs need to be supported at all, in many
applications probably not, but in some they do. Then, one should I think
first consider whether substitution using iconv(,sub=)
to valid UTF-8
input would be acceptable. If so, that is the simplest, most defensive and
future-compatible option for accepting invalid strings.
Only if that is not acceptable and useBytes = TRUE
with regular
expressions is to be used, the code should handle type instability wrt to
getting results in bytes or “native/unknown” encoding, as discussed above.
The documentation of the regexp operations has been updated to make it
explicit that in some cases, it is unspecified whether the results would be
“bytes” or “unknown/native” encoded (before it was unspecified indirectly).
Code should be made robust against possible changes within this range (which
may not only be a result of cleanups, but also performance optimizations or
refactorings to support new features). Once R gets safer regexp support for
handling invalid UTF-8 inputs, such code may have to be updated, anyway.
I would not consider using useBytes = TRUE
in regexp operations for any
other reason, because of not only the type instability, but also limitations
of the regular expressions that may be used. In the past, this has been
done for performance, but performance of regexp operations was improved
recently for this reason (see this blog
post).
It also used to be done when the support for handling UTF-8 strings in R was
limited, but that again should no longer be a valid reason anymore.
Bytes encoding is R is a bit unusual feature, which is suitable for encoding-agnostic operations at byte level.
It allows to perform such operations safely. Unsafe alternatives used in the past included using invalid strings in the “unknown/native” encoding, sometimes together with changing the R session locale, but these lead to wrong results (due to accidental transliteration and substitution) or warnings and errors. Unsafe alternatives are also only possible because R tolerates creation of invalid strings, which in turn is hiding errors in user code and packages, which could otherwise be detected by checking string validity at string creation time.
Recent improvements in R made it easier to use bytes encoding for encoding-agnostic operations at byte level, when they are needed. This text, however, also argues that encoding-agnostic operations should not be much needed in the future when encodings are properly supported, known (and ideally/mostly UTF-8).
Providing safe alternatives for unsafe operations with “native/unknown” encoding done now, in the form of bytes encoding improvements, better support for regular expressions on invalid UTF-8 inputs or regular expressions speedups, should allow to better detect encoding bugs which are now causing incorrect results or errors, but also to simplify the encoding support in R in the future. Since now UTF-8 is the native encoding also on recent Windows, it should be possible to once have only UTF-8 as the character encoding supported in R.