Encodings and R

The use of encodings is raised sporadically on the R mailing lists, with discussion of ideas to `do better'. R has been developed by authors speaking English or a Western European language, and its current mindset is the ISO Latin 1 (aka ISO 8859-1) character set. Even these authors find some problems, for example the lack of some currency symbols (notably the Euro, € if it displays for you). Users of R in Central Europe need more characters and are sometimes puzzled that Latin 2 (aka ISO 8859-2) works only partially. Other languages present much greater challenges, and there is a project to `Japanize' R which (for obvious reasons) is little known outside Japan.

One of the challenges is that in European usage, nchar(x) is the number of characters in the string but is used for adjusting layouts. In other encodings there can be different values for

The number of characters in a string
The number of bytes used to store a string and
The number of columns used to display the string -- some chars may be double width even in a monospaced font.

Fortunately nchar is little used (and often to see if a string is empty), but the C-level equivalents are widely used in all three meanings, and it is used at R level in all three meanings.

Update: This document was first written in December 2003: see the below for changes made for R 2.1.0.

Encoding in R 1.8.x

The default behavour is to treat characters as a stream of 8-bit bytes, and not to interpret them other than to assume that each byte represents one character. The only exceptions are

The connections mechanism allows the remapping of input files to the `native' encoding. Since the encoding defaults to getOption("encoding") it is possible to use this within read.table fairly easily. Note that this is a byte-level remapping, and that not all of R's input goes through the connections mechanism.
Those graphical devices which name glyphs, notably postscript and pdf, do have to deal with encoding, and they allow the user to specify the byte-level mapping of code to glyphs. This has been one of the problem areas as the standard Adobe font metrics included in R only cover ISO Latin 1 and not for example the Euro (although the URW font metrics supplied do have it). Similarly, the Adobe fonts do not cover all of ISO Latin 2.

With these exceptions, character encoding is the responsibility of the environment provided by the OS, so

What glyph is displayed on a graphics device depends on the encoding of the font selected.
How output is displayed in the terminal depends on the font and locale selected.
What numeric code is generated by keystrokes depends on the keyboard mapping or locale in use.

Towards Unicode?

It seems generally agreed that Unicode is the way to cope with all known character sets. There is a comprehensive FAQ. Unicode defines a numbering of characters up to 31 bits although it seems agreed than only 21 bits will ever be used. However, to use it as an encoding would be rather wasteful, and most people seem to use UTF-8 (see this FAQ, rather Unix-oriented), in which each character is represented as 1,2,...,6 bytes (and how many can be deduced from the first byte). As 7-bit ASCII characters are represented as a single byte (with the high bit zero) there is no storage overhead unless non-American characters are used. An alternative encoding is UTF-16, which is a two-byte encoding of most characters and a pair of two-bytes for others (`surrogate pairs'). UTF-16 without surrogates is sometimes known as UCS-2, and was the Unicode standard prior to version 3.0. (Note that the ISO C99 wide characters need not be encoded as UCS-2.) UTF-16 is big-endian unless otherwise specified (as UTF-16LE). There is the concept of a BOM, a non-printing first character that can be used to determine the endian-ness (and which Microsoft code expects to see in UTF-16 files).

Not only can a single character be stored in a variable number of bytes but it can be displayed in 1, 2 or even 0 columns.

Linux and other systems based on glibc are moving towards UTF-8 support: if the locale is set to en_GB.utf8 then the run-time assumes UTF-8 encoding is required. Here is a somewhat outdated Linux HOWTO: its advice is to use wide characters internally and ISO C99 facilities to convert to and from extenal representations.

Major Unix distributions (e.g. Solaris 2.8) are also incorporating UTF-8 support. It appears that the Mac part of MacOS X uses UTF-16.

Windows has long supported `wide characters', that is 2-byte representations of characters, and provides fonts covering a very wide range of glyphs (at least under NT-based versions of Windows). This appears to be little-endian UCS-2, and it is said that internally Windows NT uses wide characters, converting the usual byte-based characters to and fro as needed. Some Asian versions of Windows use a double-byte character set (DBCS) which appears to represent characters in one or two bytes: this is the meaning of char in Japanese-language versions of Windows. Long filenames are stored in `Unicode', and are sometimes automatically translated to the `OEM' character set (that is ASCII plus an 8-bit extension set by the code page). Windows 2000 and later have limited support for the surrogate pairs of UTF-16. Translations from `Unicode' to UTF-8 and vice versa by functions WideCharToMultiByte and MultiByteToWideChar are supported in NT-based Windows versions, and in earlier ones with the `Microsoft Layer for Unicode'.

Implementation issues

If R were to use UTF-8 internally we would need to handle at least the following issues

Conversion to UTF-8 on input. This would be easy for connections-based input (although a more general way to describe the source encoding would be required), but all the console/keyboard-based input routines would need to be modified. There would need to be a more comprehensive way to specify encodings. Possibilities are to use libiconv (if installed, or to install it ourselves) or a DIY approach like Tcl/Tk.
Conversion of text output. This would be easy for connections-based output, but dialog-box based output would need to be handled, for example. It is not clear what to do with characters which cannot be mapped -- the graphical devices currently map to space.
Handling of file names. It is quite common to read, manipulate and process file names. If these are allowed to be UTF-8 this would be straightforward, but are they? Probably usually not. Note that Unix kernels expect single bytes for NUL and / at least, so cannot work with UTF-16 file names.
On MacOS X and Windows the encoding of file names depends on the file system: the modern file systems use UCS-2 file names.
Graphical text output. This boils down to either selecting suitable fonts or converting to the encoding of the fonts. I suspect that under Windows a 2-byte encoding would be used, and X servers can make use of ISO10646-1 fonts but the present device would need its font-handling internationalized.
Text manipulation, for example match and grep and tolower. For some of these UTF-8 versions are readily available and others we would have to rewrite. And a lucky few like match would work directly on the encoded strings. For PCRE, only UTF-8 is available as there is no wide-character version, whereas the GNU regex has a wide-character version but not a UTF-8 one (and according to Markus Kuhn is 100x slower as a result).
String collation is also an issue in a few places, but strcoll should be UTF-8-aware on suitable OSes.
Most widespread is the use of snprintf, strncmp and simple loops to e.g, map \ to / (and the latter is fine as no ASCII character can occur as part of a multibyte sequence). The use of classification types such as isalpha would need replacement (and probably coercion to wide characters would be easiest). substr(ing) and strsplit will need to be aware of character boundaries. Note that Unicode has three cases, not two (the extra one being `title').
We would need to support the \uxxxx format for arbitrary Unicode characters.
The format for the distribution of R sources. Fortunately only a few files are not in ASCII: some .Rd files were, as well as the THANKS file.
Help files. Most modern Web browsers can display UTF-8, and Perl 5.8 is apparently aware of UTF-8 (and uses it internally) so it may be fairly easy to make use of our existing Rdconv. I have just added a charset=iso-8859-1 to the header of the converted HTML help files, and this would need to be changed. Since LaTeX cannot handle Unicode we would have to convert the encoding of latex help files or use Lambda (and tell it they were in UTF-8).
Environment variables could have both names and values in UTF-8.

The API for extending R would be problematical. There are a few hundred R extensions written in C and FORTRAN, and a few of them manipulate character vectors. They would not be expecting UTF-8 encoding (and probably have not thought about encodings at all). Possible ways forward are

To map to a single-byte encoding (Latin1?) and back again when .C does the copying.
Just to pass through the stream of bytes.

This does raise the issue of whether the CHAR internal type should be used for UTF-8 or a new type created. It would probably be better to create a new type for raw bytes.

Eiji Nakama's paper on `Japanizing R' seems to take the earlier multi-byte character approach rather than UTF-8 or UCS-2, except for Windows fonts. Functions such as isalpha do not work correctly in MBCSs (including UTF-8).

The Debian guide to internationalization is a useful background resource. Note that internationalization is often abbreviated as 'i18n', and localization (support for a particular locale) as 'L10n'. The main other internationalization/localization issue is to allow for the translation of messages (and to translate them).

Encodings in R 2.1.0

Work has started in December 2004 on implementing UTF-8 support for R 2.1.0, expected to be released in April 2005. Currently implemented are:

The parser has been made aware of multi-byte characters in UTF-8 and so works in character (rather than byte) units.
An internationalized version of the regexp code. For the basic and extended regexps we use the code from glibc-2.3.3which internally uses widechars and so supports all multi-byte character sets, e.g. UTF-8. For the Perl versions we use PCRE, which has UTF-8 (but not general MBCS) support available.
Replacement versions of chartr, toupper and tolower work via conversion to widechar and so handle any MBCS that the OS supports as the current locale.
substr(), make.names() work with characters not bytes.
nchar() has an additional argument to return the number of bytes, the number of characters or the display width. It was often used in conjunction with substr() to truncate character strings: that should be done in terms of display width for which there is a new function strtrim().
A new function iconv() allows character vectors to be mapped between encodings (where it is available: GNU libiconv has been grafted on for the Windows build).
The 'encoding' argument of connections has been changed from a numeric vector to a character string naming an encoding that iconv knows about, and re-encoding on the fly can now be done on both input and output. Note that this does not apply to the 'terminal' connections nor text connections, but does to all file-like connections. If input is redirected from a file (or pipe), the input encoding can be specified by the command-line flag --encoding.
The postscript() and pdf() devices handle UTF-8 strings by remapping to Latin1 (this is currently hardcoded).
A start has been made on converting the X11() device and the X11-based data editor using Nakama's Japanization patches, but adding X input methods to the data editor so it does now work in a (Western) UTF-8 locale.
scan() needs single-byte chars for its decimal, comment and separator characters -- this is now enforced. It still uses isspace and isdigit, so only ASCII space and digit chars are recognized (but this seems little problem).
abbreviate() is a problem: its algorithm is hardcoded for English (e.g. which bytes are vowels) and it now warns if given non-ASCII text.
print()ing looks for valid characters and only escapes non-printable characters (rather than bytes). It does so by converting to widechars and using the wctype functions in the current locale.
UTF-8 strings are passed to and from the tcltk package (this applies in any MBCS).
There is some support for pch=n > 127 and pch="c" in UTF-8 locales, where a number is taken to be the Unicode character number, and the first MBCS character is taken.
The replacement for strptime has been rewritten to work a character at a time, using widechars internally.
The Hershey fonts are encoded in Latin-1, so the vfont support has been rewritten to re-encode to Latin-1.
A new function localeToCharset attempts to deduce plausible character sets from the locale name (on Unix and on Windows). This is used by source to test out plausible encodings if the (new) argument encoding = "unknown" is specified.
.Rd has a new directive \encoding{} to set the encoding to be assumed for the file and hence its HTML translation (and also this is given as a comment in the example file). Note that one has to be careful here, as some implementations of iconv do not allow any 8-bit chars in the C locale, and the lack of standards for charset names is also a problem.
The Windows console and data editor have been modified to work with MBCS character sets, as well as having support for double-width characters.
readChar and writeChar work in characters not bytes.
.C supports a new argument ENCODING= to specify the encoding expected for character strings.
delimMatch (tools) returns the position and match length in characters not bytes, and allows multi-byte delimiters.

For many of these features R needs to be configured with --enable-utf8.

Implementation details

The C code often handles character strings as a whole. We have identified the following places where character-level access is used:

In the parser to identify tokens. (gram.y)
do_nchar, do_substr, do_substrgets, do_strsplit, do_abbrev, do_makenames, do_grep, do_gsub, do_regexpr, do_tolower, do_chartr, do_agrep, do_strtrim (character.c), do_pgrep, do_pgsub, do_pregexpr (pcre.c)
GEText, GEMetricInfo.(engine.c)
RenderStr. (plotmath.c)
RStrlen, EncodeString (printutil.c)
The dataentry editor (various dataentry.c)
Graphics devices in handling encoded text, and in metric info. (Currently devX11.c, rotated.c and devPS.c have been changed, and devPicTeX.c is tied to TeX which is a byte-based program.)
The ASCII versions of load and save. As these are a reversible representation of objects in ASCII, it does not matter if they are handled as byte streams.
New wrapper functions Rf_strchr, Rf_strrchr and R_fixslash cover comparisons with single ASCII characters.

backquotify (deparse.c), do_dircreate (platform.c), do_getwd, do_basename, do_dirname and isBlankString (util.c) are now MBCS-aware.

There are many other places which do a comparison with a single ASCII character (such as . or / or \ or LF) and so cause no problem in UTF-8 but might in other MBCSs. These include filbuf (platform.c, which looks for CR and LF and these seem safe), fillBuffer (scan.c) and there are others.

Encodings which are likely to cause problems include

Vietnamese (VISCII). This uses 186 characters including the control characters 0x02, 0x05, 0x06, 0x14, 0x19, 0x1e: the Windows GUI makes use of these as control characters.
Big5, GBK, Shift-JIS. These are all 1- or 2-byte encodings including ASCII as 1-byte chars (except Shift-JIS replaces backspace by ¥) but whose second byte overlaps the ASCII range.

fillBuffer (scan.c) has now been rewritten to be aware of double-byte character sets and to only test the lead byte.

Windows

Windows does things somewhat differently. `Standard' versions of Windows have only single-byte locales, with the interpretation of those bytes being determined by code pages. However, `East Asian' versions (an optional install at least on Windows XP) use double-byte locales in which characters can be represented by one or two bytes (and can be one or two columns wide). Windows also has `Unicode' (UCS-2) applications in which all information is transferred as 16-bit wide characters, and the locale does not affect the interpretation. Windows 2000 and later have optional support for surrogate pairs (UTF-16) but this is not normally enabled. (See here for how to enable it.) Currently R-devel has three levels of MBCS support under Windows.

By default, all character strings are interpreted as single bytes.
If SUPPORT_MBCS is defined in MkRules and in config.h, R.dll will recognize multi-byte characters if run in a MBCS locale and generally (but not always, notably in scan) treat them as whole units.
If in addition SUPPORT_GUI_MBCS is defined in MkRules, RGui is compiled to be aware of multi-byte characters if run in a MBCS locale, and cursor movements will work in whole characters, with the cursor width adapting to the current character's width.
If SUPPORT_UTF8 is defined in addition to SUPPORT_MBCS, most of R.dll will assume it is running in a UTF-8 locale. As there are no such locales under Windows, this is only useful with a custom front-end that communicates in UTF-8 (and even then there are issues with file names and content, and environment variables).

Localization of messages

As from 2005-01-25, R uses GNU gettext where available. So far only the start-up message is marked for translation, as a proof-of-concept: there are several thousand C-level messages that could potentially be translated. The same mechanism could be applied to R packages, provided they call dgettext with a PACKAGE specific to the package, and install their own PACKAGE.mo files, say via an inst/po directory. The splines package was been converted to show how this might be done: it only has one error message.

Brian Ripley
2004-01-11, 2005-01-25