Encodings and R
The use of encodings is raised sporadically on the R mailing lists,
with discussion of ideas to `do better'. R has been developed by
authors speaking English or a Western European language, and its
current mindset is the ISO Latin 1 (aka ISO 8859-1) character
set. Even these authors find some problems, for example the lack
of some currency symbols (notably the Euro, € if it displays for
you). Users of R in Central Europe need more characters and are
sometimes puzzled that Latin 2 (aka ISO 8859-2) works only
partially.
Other languages present much greater challenges, and there is a project
to `Japanize' R which (for obvious reasons) is little known outside
Japan.
One of the challenges is that in European usage, nchar(x)
is the number of characters in the string but is used for
adjusting layouts. In other encodings there can be different
values for
- The number of characters in a string
- The number of bytes used to store a string and
- The number of columns used to display the string -- some chars
may be double width even in a monospaced font.
Fortunately nchar is little used (and often to see if a
string is empty), but the C-level equivalents are widely used in all
three meanings, and it is used at R level in all three meanings.
Update: This document was first
written in December 2003: see the below
for changes made for R 2.1.0.
Encoding in R 1.8.x
The default behavour is to treat characters as a stream of
8-bit
bytes, and not to interpret them other than to assume that each byte
represents one character. The only exceptions are
- The connections mechanism allows the remapping of input files to
the `native' encoding. Since the encoding defaults to
getOption("encoding") it is possible to use this within
read.table fairly easily. Note that this is a
byte-level remapping, and that not all of R's input goes through the
connections mechanism.
- Those graphical devices which name glyphs, notably
postscript and pdf, do have to deal with
encoding, and they allow the user to specify the byte-level mapping of
code to glyphs. This has been one of the problem areas as the
standard Adobe font metrics included in R only cover ISO Latin 1 and
not for example the Euro (although the URW font metrics supplied do
have it). Similarly, the Adobe fonts do not cover all of ISO
Latin 2.
With these exceptions, character encoding is the responsibility of the
environment provided by the OS, so
- What glyph is displayed on a graphics device depends on the
encoding of the font selected.
- How output is displayed in the terminal depends on the font and
locale selected.
- What numeric code is generated by keystrokes depends on the
keyboard mapping or locale in use.
Towards Unicode?
It seems generally agreed that Unicode is the way to cope with all
known character sets. There is a comprehensive FAQ. Unicode defines a
numbering of characters up to 31 bits although it seems agreed than
only 21 bits will ever be used. However, to use it as an
encoding would be rather wasteful, and most people seem to use UTF-8
(see this FAQ, rather
Unix-oriented), in which each character is represented as 1,2,...,6
bytes (and how many can be deduced from the first byte). As
7-bit ASCII characters are represented as a single byte (with the high
bit zero) there is no storage overhead unless non-American characters
are used.
An alternative encoding is UTF-16, which is a two-byte encoding of
most characters and a pair of two-bytes for others (`surrogate
pairs'). UTF-16 without surrogates is sometimes known as UCS-2,
and was the Unicode standard prior to version 3.0. (Note that
the ISO C99 wide characters need not be encoded as UCS-2.)
UTF-16 is big-endian unless otherwise specified (as UTF-16LE).
There is the concept of a BOM, a non-printing
first character that can be used to determine the endian-ness (and
which Microsoft code expects to see in UTF-16 files).
Not only can a single
character be stored in a variable number of bytes but it can be
displayed in 1, 2 or even 0 columns.
Linux and other systems based on glibc are moving towards
UTF-8 support: if the locale is set to en_GB.utf8 then
the run-time assumes UTF-8 encoding is required. Here
is a somewhat outdated Linux HOWTO: its advice is to use wide
characters internally and ISO C99 facilities to convert to and from
extenal representations.
Major Unix distributions
(e.g. Solaris
2.8) are also incorporating UTF-8 support. It appears that the Mac part
of MacOS X uses UTF-16.
Windows has long supported `wide characters', that is 2-byte
representations of characters, and provides fonts covering a very wide
range of glyphs (at least under NT-based versions of Windows).
This appears to be little-endian UCS-2, and it is said that internally
Windows NT uses wide characters, converting the usual byte-based
characters to and fro as needed. Some Asian versions of Windows
use a double-byte character set (DBCS) which appears to represent
characters in one or two bytes: this is the meaning of
char in Japanese-language versions of Windows.
Long filenames are stored in `Unicode', and are sometimes
automatically translated to the `OEM' character set (that is ASCII
plus an 8-bit extension set by the code page). Windows 2000 and later
have limited support for the surrogate pairs of UTF-16. Translations
from `Unicode' to UTF-8 and vice versa by functions
WideCharToMultiByte and MultiByteToWideChar
are supported in NT-based Windows versions, and in earlier ones with
the `Microsoft Layer for Unicode'.
Implementation issues
If R were to use UTF-8 internally we would need to handle at least the
following issues
- Conversion to UTF-8 on input. This would be easy for
connections-based input (although a more general way to describe the
source encoding would be required), but all the console/keyboard-based
input routines would need to be modified. There would need to be
a more comprehensive way to specify encodings. Possibilities are to
use libiconv
(if installed, or to install it ourselves) or a DIY approach like Tcl/Tk.
- Conversion of text output. This would be easy for
connections-based output, but dialog-box based output would need to be
handled, for example. It is not clear what to do with characters
which cannot be mapped -- the graphical devices currently map to
space.
- Handling of file names. It is quite common to read,
manipulate and process file names. If these are allowed to be
UTF-8 this would be straightforward, but are they? Probably
usually not. Note that Unix kernels expect single bytes for NUL
and / at least, so cannot work with UTF-16 file names.
On
MacOS
X and Windows the encoding of file names depends on the file
system: the modern file systems use UCS-2 file names.
- Graphical text output. This boils down to either selecting
suitable fonts or converting to the encoding of the fonts. I
suspect that under Windows a 2-byte encoding would be used, and X
servers can make use of ISO10646-1 fonts but the present device
would need its font-handling internationalized.
- Text manipulation, for example match and
grep and tolower. For some of these
UTF-8 versions are readily available and others we would have to
rewrite. And a lucky few like match would work
directly on the encoded strings. For PCRE, only UTF-8 is available as
there is no wide-character version, whereas the GNU regex
has a wide-character version but not a UTF-8 one (and according to Markus
Kuhn is 100x slower as a result).
String collation is also an
issue in a few places, but strcoll should be UTF-8-aware
on suitable OSes.
Most widespread is the use of
snprintf, strncmp and simple loops to e.g,
map \ to / (and the latter is fine as no
ASCII character can occur as part of a multibyte sequence). The use of
classification types such as isalpha would need
replacement (and probably coercion to wide characters would be
easiest).
substr(ing) and strsplit will need to be
aware of character boundaries. Note that Unicode has three
cases, not two (the extra one being `title').
- We would need to support the \uxxxx format for
arbitrary Unicode characters.
- The format for the distribution of R sources. Fortunately
only a few files are not in ASCII: some .Rd files were, as
well as the THANKS file.
- Help files. Most modern Web browsers can display UTF-8, and Perl
5.8 is apparently aware of UTF-8 (and uses it internally) so it
may be fairly easy to make use of our existing
Rdconv. I have just added a
charset=iso-8859-1 to the header of the converted HTML
help files, and this would need to be changed. Since LaTeX cannot
handle Unicode we would have to convert the encoding of latex help
files or use Lambda (and tell it they were in UTF-8).
- Environment variables could have both names and values in UTF-8.
The API for extending R would be problematical. There are a few
hundred R extensions written in C and FORTRAN, and a few of them
manipulate character vectors. They would not be expecting UTF-8
encoding (and probably have not thought about encodings at all).
Possible ways forward are
- To map to a single-byte encoding (Latin1?) and back again when .C
does
the copying.
- Just to pass through the stream of bytes.
This does raise the issue of whether the CHAR internal
type should be used for UTF-8 or a new type created. It would
probably be better to create a new type for raw bytes.
Eiji Nakama's paper
on `Japanizing R' seems to take the earlier multi-byte character
approach rather than UTF-8 or UCS-2, except for Windows fonts.
Functions such as isalpha do not work correctly in MBCSs
(including UTF-8).
The Debian guide
to internationalization is a useful background resource. Note
that internationalization is often abbreviated as 'i18n', and
localization (support for a particular locale) as 'L10n'. The
main other internationalization/localization issue is to allow for the
translation of messages (and to translate them).
Encodings in R 2.1.0
Work has started in December 2004 on implementing UTF-8 support for R
2.1.0, expected to be released in April 2005. Currently
implemented are:
- The parser has been made aware of multi-byte characters in UTF-8
and so works in character (rather than byte) units.
- An internationalized version of the regexp code. For the
basic and extended regexps we use the code from glibc-2.3.3
which internally uses widechars and so supports all
multi-byte character sets, e.g. UTF-8. For the Perl versions
we use PCRE, which has UTF-8 (but not general MBCS) support
available.
- Replacement versions of chartr,
toupper and tolower work via
conversion to widechar and so handle any MBCS that the OS supports
as the current locale.
- substr(), make.names() work with
characters not bytes.
- nchar() has an additional argument to return the
number of bytes, the number of characters or the display
width. It was often used in conjunction with
substr() to truncate character strings: that
should be done in terms of display width for which there is a
new function strtrim().
- A new function iconv() allows character vectors to
be mapped between encodings (where it is available: GNU libiconv has
been grafted on for the Windows build).
- The 'encoding' argument of connections has been
changed from a numeric vector to a character string naming an
encoding that iconv knows about, and re-encoding on the
fly can now be done on both input and output. Note that this
does not apply to the 'terminal' connections nor text connections,
but does to all file-like connections. If input is redirected from
a file (or pipe), the input encoding can be specified by the
command-line flag --encoding.
- The postscript() and pdf() devices
handle UTF-8 strings by remapping to Latin1 (this is currently
hardcoded).
- A start has been made on converting the X11()
device and the X11-based data editor using Nakama's Japanization
patches, but adding X input methods to the data editor so it does
now work in a (Western) UTF-8 locale.
- scan() needs single-byte chars for its decimal,
comment and separator characters -- this is now enforced. It still
uses isspace and isdigit, so only ASCII
space and digit chars are recognized (but this seems little
problem).
- abbreviate() is a problem: its algorithm is
hardcoded for English (e.g. which bytes are vowels) and it now warns
if given non-ASCII text.
- print()ing looks for valid characters and only
escapes non-printable characters (rather than bytes). It does so
by converting to widechars and using the wctype functions
in the current locale.
- UTF-8 strings are passed to and from the tcltk
package (this applies in any MBCS).
-
There is some support for pch=n > 127 and
pch="c" in UTF-8 locales, where a number is taken to be
the Unicode character number, and the first MBCS character is
taken.
-
The replacement for strptime has been rewritten to work
a character at a time, using widechars internally.
-
The Hershey fonts are encoded in Latin-1, so the
vfont support has been rewritten to re-encode to
Latin-1.
-
A new function localeToCharset attempts to deduce
plausible character sets from the locale name (on Unix and on
Windows). This is used by source to test out plausible
encodings if the (new) argument encoding = "unknown" is
specified.
-
.Rd has a new directive \encoding{} to set
the encoding to be assumed for the file and hence its HTML
translation (and also this is given as a comment in the example
file). Note that one has to be careful here, as some
implementations of iconv do not allow any 8-bit chars
in the C locale, and the lack of standards for charset
names is also a problem.
-
The Windows console and data editor have been modified to work with
MBCS character sets, as well as having support for double-width
characters.
-
readChar and writeChar work in characters
not bytes.
-
.C supports a new argument ENCODING= to
specify the encoding expected for character strings.
-
delimMatch (tools) returns the position and
match length in characters not bytes, and allows multi-byte delimiters.
For many of these features R needs to be configured with
--enable-utf8.
Implementation details
The C code often handles character strings as a whole. We have
identified the following places where character-level access is used:
- In the parser to identify tokens. (gram.y)
- do_nchar, do_substr, do_substrgets, do_strsplit,
do_abbrev, do_makenames, do_grep, do_gsub, do_regexpr, do_tolower,
do_chartr, do_agrep, do_strtrim (character.c),
do_pgrep, do_pgsub, do_pregexpr
(pcre.c)
- GEText, GEMetricInfo.(engine.c)
- RenderStr. (plotmath.c)
- RStrlen, EncodeString
(printutil.c)
- The dataentry editor (various dataentry.c)
- Graphics devices in handling encoded text, and in metric
info. (Currently devX11.c, rotated.c and
devPS.c have been changed, and devPicTeX.c
is tied to TeX which is a byte-based program.)
-
The ASCII versions of load and save. As
these are a reversible representation of objects in ASCII, it does
not matter if they are handled as byte streams.
-
New wrapper functions Rf_strchr,
Rf_strrchr and R_fixslash cover
comparisons with single ASCII characters.
backquotify (deparse.c),
do_dircreate (platform.c), do_getwd,
do_basename, do_dirname and
isBlankString (util.c)
are now MBCS-aware.
There are many other places which do a comparison with a single ASCII
character (such as . or / or \ or LF) and so cause no problem in UTF-8
but might in other MBCSs. These include filbuf
(platform.c, which looks for CR and LF and these seem
safe), fillBuffer (scan.c) and there are
others.
Encodings which are likely to cause problems include
- Vietnamese (VISCII). This uses 186 characters including the
control characters 0x02, 0x05, 0x06, 0x14, 0x19, 0x1e:
the Windows GUI makes use of these as control characters.
- Big5, GBK, Shift-JIS. These are all 1- or 2-byte encodings
including ASCII as 1-byte chars (except Shift-JIS replaces backspace
by ¥) but whose second byte overlaps the ASCII range.
fillBuffer (scan.c) has now been
rewritten to be aware of double-byte character sets and to only test
the lead byte.
Windows
Windows does things somewhat differently. `Standard' versions of
Windows have only single-byte locales, with the interpretation of
those bytes being determined by code pages. However, `East
Asian' versions (an optional install at least on Windows XP) use
double-byte locales in which characters can be represented by one or
two bytes (and can be one or two columns wide).
Windows also has `Unicode' (UCS-2) applications in which all
information is transferred as 16-bit wide characters, and the locale
does not affect the interpretation. Windows 2000 and later have
optional support for surrogate pairs (UTF-16) but this is not normally
enabled. (See here for how to
enable it.)
Currently R-devel has three levels of MBCS support under Windows.
- By default, all character strings are interpreted as single bytes.
- If SUPPORT_MBCS is defined in MkRules
and in config.h, R.dll will recognize
multi-byte characters if run in a MBCS locale and generally (but not
always, notably in scan) treat them as whole units.
- If in addition SUPPORT_GUI_MBCS is defined in
MkRules, RGui is compiled to be aware of
multi-byte characters if run in a MBCS locale, and cursor movements
will work in whole characters, with the cursor width adapting to the
current character's width.
- If SUPPORT_UTF8 is defined in addition to
SUPPORT_MBCS, most of R.dll will assume it
is running in a UTF-8 locale. As there are no such locales under
Windows, this is only useful with a custom front-end that
communicates in UTF-8 (and even then there are issues with file
names and content, and environment variables).
Localization of messages
As from 2005-01-25, R uses GNU gettext where available.
So far only the start-up message is marked for translation, as a
proof-of-concept: there are several thousand C-level messages that
could potentially be translated.
The same mechanism could be applied to R packages, provided they call
dgettext with a PACKAGE specific to the
package, and install their own PACKAGE.mo files, say via
an inst/po directory. The splines package
was been converted to show how this might be done: it only has one
error message.
Brian Ripley
2004-01-11, 2005-01-25