--- title: "Testing R on Emulated Platforms" author: "Tomas Kalibera" date: 2020-05-29 categories: ["Testing"] tags: ["QEMU", "Aarch64", "Power"] ---

Sometimes it is useful to test R on unusual platforms, even when the expected number of users is not large at the moment. It is better to be ready when a new platform arrives or becomes more widely used, it may be easier to find some bugs on one platform than other, and such testing may reveal code unintentionally too platform specific.

Recently I wanted to test R on 64-bit ARM (Aarch64) and on Power. It turned out easy in QEMU, an instruction-level emulator, running on a decent multi-processor 64-bit Intel host machine. This text gives the exact steps to get these emulated systems and it describes two concrete problems I was debugging on Power: one with long double types preventing R to start and another with foreign package not reading STATA files correctly. The latter comes with some interesting technical details and a rather surprising conclusion.

Aarch64

Testing on this platform was inspired by some predictions that 64-bit ARM CPUs could be soon used in Apple laptops. One might order say a recent version of Raspberry Pi for the testing, but installing QEMU is still faster and the experience could be applicable also to testing of other platforms. I’ve based my setup on an Ubuntu wiki page, running on Ubuntu 19.10 as host and Ubuntu 18.04 as guest.

Install QEMU:

apt-get install qemu-system-arm qemu-efi

Prepare flash memory for boot:

dd if=/dev/zero of=flash0.img bs=1M count=64
dd if=/usr/share/qemu-efi/QEMU_EFI.fd of=flash0.img conv=notrunc
dd if=/dev/zero of=flash1.img bs=1M count=64

Get Ubuntu 18.04 cloud image:

wget http://cloud-images.ubuntu.com/releases/bionic/release/ubuntu-18.04-server-cloudimg-arm64.img

Set root password in that image:

virt-customize -a ubuntu-18.04-server-cloudimg-arm64.img --root-password password:my_root_password

Resize that image so that it can hold R and dependencies (add 10G, perhaps a lot less would be ok):

qemu-img resize ubuntu-18.04-server-cloudimg-arm64.img +10G

Run the guest system:

qemu-system-aarch64 -smp 8 -m 8192 -cpu cortex-a57 -M virt -nographic -pflash flash0.img \
  -pflash flash1.img -drive if=none,file=ubuntu-18.04-server-cloudimg-arm64.img,id=hd0 \
  -device virtio-blk-device,drive=hd0 \
  -device e1000,netdev=net0 -netdev user,id=net0,hostfwd=tcp::5555-:22

This runs a system with 8 cores, 8G of memory, where port 5555 of the host forwards to port 22 of the guest, so one can log in via

ssh -p 5555 username@localhost

once the system is set up for that. One can also right away log in as root to the console with the password set previously.

Files can be transferred using ssh/scp, but if needed, the image can also be mounted to the host system:

modprobe nbd max_part=8
qemu-nbd --connect=/dev/nbd0 ubuntu-18.04-server-cloudimg-arm64.img
mount /dev/nbd0p1 mnt/

and unmounted

umount mnt
nbd-client -d /dev/nbd0

After logging into the guest system, upgrade it:

apt-get update
apt-get upgrade

Install a text editor, enable “src” repositories in /etc/apt/sources.list (uncomment lines starting with deb-src). Then install R build dependencies and some tools:

apt-get update
apt-get build-dep r-base
apt-get install subversion rsync screen libpcre2-dev

Create a user account (adduser username) and enable ssh password authentication in /etc/ssh/sshd_config. Log in to that account by ssh from the host system via port 5555 of host, that way the terminal works better than via the emulated console.

Now check out R source code from SVN, get the recommended packages via rsync, build from source (via configure, make -j, preferably out-of-tree) and run checks (check-all) as usual and documented in R Installation and Administration.

The build is not fast on the emulated system, but running in parallel helps a lot. The tests are best run overnight as they may take several hours, particularly “check-all” for all recommended packages.

One caveat is that some tests still have hard-coded time limits, which are often not met on an emulated system. In case of related failures, one has to disable these checks manually and re-start. These checks should be removed soon and are discouraged in Writing R Extensions, because they cannot really reveal performance regressions. For that one would have to run the tests repeatedly, perhaps increasing their runtimes, on an isolated system and in both versions where regressions are to be tested.

One can use the emulated system as usual, narrowing down failing tests to reproducible examples, trying them out interactively, debugging via gdb, etc. The only limitation is that the system is slower, but on a decent host server it was fast enough for interactive work.

Apart from the timing constraints, an incorrect assumption about /proc filesystem content was found in one package. Several tests had too tight numerical checks. However, I did not find any significant problem on the platform (this was done before the 4.0.0 release).

Power 9

Soon after the 4.0.0 release, we got a report that R did not even build on Power, so I’ve repeated this experiment for Power as well. These steps worked for me:

apt-get install qemu-system-ppc64
wget https://cloud-images.ubuntu.com/releases/bionic/release/ubuntu-18.04-server-cloudimg-ppc64el.img
sudo virt-customize -a ubuntu-18.04-server-cloudimg-ppc64el.img --root-password password:my_root_password
qemu-img resize ubuntu-18.04-server-cloudimg-ppc64el.img +10G

To boot the system:

qemu-system-ppc64le -smp 8 -m 8192 -cpu power9 -nographic \
  -hda ubuntu-18.04-server-cloudimg-ppc64el.img -L pc-bios -boot c \
  -device e1000,netdev=net0 -netdev user,id=net0,hostfwd=tcp::5555-:22

The remaining steps are the same as for Aarch64: upgrade the system, enable source repositories, install R build dependencies, install some tools, create a user account, checkout R, configure, build, run tests.

It turned out that R falls into an infinite loop on startup. As R binary is already used itself when building R, the build process did not finish, either. Using gdb running in the guest system, one could see it was looping inside machar_LD, a long-double version of the detection of machine-specific floating point characteristics. These routines are from Netlib BLAS code (machar.c), based on M. Malcolm: Algorithms to reveal properties of floating-point arithmetic and W. J. Cody:ALGORITHM 665 MACHAR: A Subroutine to Dynamically Determine Machine Parameters.

These routines don’t work on the unusual double-double implementation of long double type on Power, which has known issues. Muller et al: Handbook of Floating Point Arithmetics: “there are still open specification and/or implementation issues concerning the largest exponent to regard this format as a valid long double type” and “some properties requiring strict floating-point arithmetic (such as Sterbenz’s lemma) will not always be true for the long double type, and corresponding floating-point algorithms may no longer work.” Sterbenz’s lemma requires that “if x and y are floating-point number such that x/2 <= y <= 2x, then x-y will be computed exactly”.

A quick fix for R is to avoid using the long double type on Power, which can be done via configure when building R. It is still work in progress to provide some support for long double on Power, with primary contributions from Bryan Lewis and Martin Maechler. The problem was bisected to a specific R SVN release by Dirk Eddelbuettel.

QEMU has been useful in identifying and debugging this issue. However, once I had it running, I decided to run check-all on the platform also with a quick fix for the machine detection: what if there were other issues caused by e.g. the unusual long double type implementation?

Apart from expected problems with hard-coded timing checks, too tight numerical checks and one more case of parsing /proc filesystem, I found that foreign package was failing its tests, unable to read some STATA files properly.

On Power, I got

> foreign::read.dta("./tests/foreign.Rcheck/tests/compressed.dta")[1:10,"alkphos"]
 [1] 1.718000e+03 7.394800e+03 7.394800e+03 1.836710e-40 6.121800e+03
 [6] 6.121800e+03 6.710000e+02 6.710000e+02 1.175494e-38 1.175494e-38

while on x86_64

> foreign::read.dta("./tests/foreign.Rcheck/tests/compressed.dta")[1:10,"alkphos"]
 [1] 1718.0 7394.8 7394.8  516.0 6121.8 6121.8  671.0  671.0 944.0  944.0

So, some of the floating point numbers were read properly, but not all, e.g.  516 became 1.836710e-40. These were single precision floating point numbers (32-bit), which are documented to be proper IEEE 754 even on Power, so an unrelated problem to the long double type.

These numbers from the example are stored as binary, IEEE 754, big-endian in the file. Power is a bi-endian architecture and the emulator invocation shown above runs it as little endian. The routine to read the values is simple and seems correct:

static double InFloatBinary(FILE * fp, int naok, int swapends)
{
    float i;
    if (fread(&i, sizeof(float), 1, fp) != 1)
        error(_("a binary read error occurred"));
    if (swapends)
        reverse_float(i);
    return (((i == STATA_FLOAT_NA) & !naok) ? NA_REAL :  (double) i);
}

Small modifications of this routine, even just adding print statements, sometimes masked the bug: all values were then read properly. Printing the whole value did not mask the bug, but printing individual bytes masked it. Marking i as volatile masked the bug as well. When not masked, the bug affected only some numbers (e.g. 516) but not other (e.g. 1718).

Reducing the code to this still preserved the bug:

static double InFloatBinaryBugPresent(FILE * fp, int naok, int swapends)
{
    float i;

    if (fread(&i, sizeof(float), 1, fp) != 1)
        return (double)0;

    if (swapends)
        reverse_float(i);

    return (double)i;
}

but reducing more masked it:

static double InFloatBinaryBugMasked(FILE * fp, int naok, int swapends)
{
    float i;

    fread(&i, sizeof(float), 1, fp);        

    if (swapends)
        reverse_float(i);

    return (double)i;
}

When not masked, the bug already affects the result of InFloatBinary. Small modifications mask it, but only for some numbers, so it does not seem like a memory corruption nor like stack alignment issue. Also, stack alignment would unlikely be a problem with 32-bit values. It does not seem like a problem in non-reversing the byte order nor in reversing it incorrectly, as some values with all four bytes different were read correctly.

At the same time, the above showed that even the presence of dead code (the branch returning 0 that is never taken) impacted presence of the bug. Removing for instance the branch if (swapends) masks the bug. Perhaps the compiler could possibly be miscompiling the code, or more likely the code could be depending on unspecified behavior. The next step hence is to look at the assembly.

An annotated version of InFloatBinaryBugMasked that does not have the bug:

<+0>:     addis   r2,r12,2
<+4>:     addi    r2,r2,5504
<+8>:     mflr    r0  <== save link register to r0
<+12>:    std     r31,-8(r1) <== save r31 to stack
<+16>:    mr      r6,r3  <====== move r3 to r6  [FILE *fp]
<+20>:    mr      r31,r4 <====== move r4 to r31 [swapends; naok was optimized out]
<+24>:    li      r5,1 <======== move 1 to r5
<+28>:    li      r4,4 <======== move 4 to r4 (size of float)
<+32>:    std     r0,16(r1) <=== save r0 to stack [link register save area]
<+36>:    stdu    r1,-64(r1)
<+40>:    addi    r3,r1,36
<+44>:    ld      r9,-28688(r13) <=== [stack checking]
<+48>:    std     r9,40(r1) <=== save r9 to stack [TOC save area]
<+52>:    li      r9,0 <======== move 0 to r9
<+56>:    bl      0x7ffff4bc4720 <0000011a.plt_call.fread@@GLIBC_2.17>
<+60>:    ld      r2,24(r1) <=== [compiler area]
<+64>:    cmpdi   cr7,r31,0 <=== check swapends
<+68>:    beq     cr7,0x7ffff4bd6910 <InFloatBinary+144> <==== jump if not swapping
<+72>:    addi    r9,r1,36
<+76>:    lwbrx   r9,0,r9  <==== load from address r9, reverse bytes, store to r9
<+80>:    rldicr  r9,r9,32,31 <= extract upper 32 bits of r9 to r9 (shift right)
<+84>:    mtvsrd  vs1,r9 <====== move r9 to vector register
<+88>:    xscvspdpn vs1,vs1 <=== convert vs1 to double precision
<+92>:    ld      r9,40(r1)  <== [stack checking] restore r9
<+96>:    ld      r10,-28688(r13)
<+100>:   xor.    r9,r9,r10
<+104>:   li      r10,0 <======= move 0 to r10
<+108>:   bne     0x7ffff4bd6918 <InFloatBinary+152>
<+112>:   addi    r1,r1,64   <== r1 += 64
<+116>:   ld      r0,16(r1)  <== restore r0
<+120>:   ld      r31,-8(r1) <== restore r31 
<+124>:   mtlr    r0 <========== restore link register from r0
<+128>:   blr     <============= return
<+132>:   nop
<+136>:   nop
<+140>:   ori     r2,r2,0
<+144>:   lfs     f1,36(r1)
<+148>:   b       0x7ffff4bd68dc <InFloatBinary+92>
<+152>:   bl      0x7ffff4bc4260 <0000011a.plt_call.__stack_chk_fail@@GLIBC_2.17>
<+156>:   ld      r2,24(r1)
<+160>:   .long 0x0
<+164>:   .long 0x1000000
<+168>:   .long 0x180

The code calls fread, reverses the byte order in the result, promotes the result to double precision and returns it. In addition, there is some stack corruption checking and saving/restoring registers in the function prologue/epilogue.

Key architecture details for this example: register r1 points to top of stack, r3 and r4 contain the first two fixed-point parameters to the function, the link register holds the address to return to and blr is a jump to that address (so it returns from the function), cr7 is a condition register used here explicitly to hold the result of a comparison, cr0 is a condition register used by compare instructions of fixed-point operands ending with a dot (e.g. xor.), floating point registers f0/f1 correspond to vector registers vs0/vs1.

The disassembly of InFloatBinaryBugPresent has a similar structure, but the byte swapping code is different:

<+128>:   cmpdi   cr7,r31,0  <===== check swapends
<+132>:   lfs     f1,36(r1)  <===== load float extended to double into f1 (vs1)
<+136>:   beq     cr7,0x7ffff4bd68cc <InFloatBinary+76> <== jump if not swapping
<+140>:   xscvdpspn vs0,vs1 <====== convert double precision vs1 to single precision vs0
<+144>:   mfvsrd  r9,vs0 <========= move vs0 to r9
<+148>:   rldicl  r9,r9,32,32 <==== shift right by 32 bits (extract upper 32 bits)
<+152>:   rotlwi  r10,r9,24 <====== rlwinm r10,r9,24,0,31; 32-bit rotate right 1-byte: 4321 => 1432 
                                    store "1432" to r10
<+156>:   rlwimi  r10,r9,8,8,15 <== rotate left 1 byte (4321 => 3214), store 2nd byte to r10
                                    store "1232" to r10
<+160>:   rlwimi  r10,r9,8,24,31 <= rotate left 1 byte (4321 => 3214), store 4th byte to r10
                                    store "1234"" to r10
<+164>:   rldicr  r9,r10,32,31 <=== extract r10 as high word of r9 (shift left 4 bytes)
<+168>:   mtvsrd  vs1,r9 <========= move r9 to vector register
<+172>:   xscvspdpn vs1,vs1 <====== convert vs1 to double precision

In both cases, the generated code seems ok on the first look. To find the cause, the next step is instruction-level debugging.

Useful gdb commands were set disassemble-next-line on to see the disassembly when stepping, stepi to step into a single instruction, nexti to execute to the next instruction, i r f1 vs1 to print register values of the floating point register f1 and the corresponding vector unit floating point register vs1. To jump to the invocations of InFloatBinaryBugPresent, ignore 1 23 was useful (ignore next 23 crossings of breakpoint 1 to debug reading of value 516 in the example). To dump memory at specific address, e.g. x/16bx 0x7fffffffb2e0.

The key observation is then that the value of byte-reversed 516 is read correctly by fread and is correctly stored on the stack at 36(r1). However, instruction lfs at <+132> destroys the value when converting it from float to double.

When byte-reversed, the value of 516 is a denormalized floating point number. Still, even such numbers should be converted to double without losing precision/information. To demonstrate this was the problem, it was then easy to extract an example that only converts floats to doubles: 516 (a denormalized number) was converted incorrectly. This operation systematically gives surprising results.

The masking of the bug depending seemingly on inconsequential changes of the code was caused by the compiler sometimes including unnecessary promotion of the float value to double and back yet before reversing the bytes.

How come that Power cannot convert a denormalized float number to its well defined double version? This made me ask for an account at Minicloud and test there. On their Power machine, the conversion was working correctly, yet it was Power8 (not Power9) and it turned out also emulated by QEMU, not bare hardware.

Then I tried with the latest version of QEMU, and the conversion worked as well. After looking into the changelog of QEMU, I found

commit c0e6616b6685ffdb4c5e091bc152e46e14703dd1
Author: Paul A. Clarke <pc@us.ibm.com>
Date:   Mon Aug 19 16:42:16 2019 -0500

    ppc: Fix emulated single to double denormalized conversions
    
    helper_todouble() was not properly converting any denormalized 32 bit
    float to 64 bit double.

QEMU was easy to build on my Ubuntu machine and I tested this commit and the previous one - no need to re-install the VM, just rebuild QEMU and restart. And yes, it was caused by a bug in QEMU that was fixed by this commit, which demonstrates a possible downside of testing on emulated platforms.

The original C code in foreign package works fine on Power in the latest QEMU, but still, it is storing an arbitrary bit pattern (byte-reversed floating point value) in a variable typed as float. By the C standard, converting from float to double should preserve the value, as well as after converting back to float.

Yet, it is not completely clear that exactly the same bit pattern of the original float value will always be preserved. Denormalized numbers could be disabled by compiler of CPU options (“flush to zero”, “denormals are zero”). It is probably safer to avoid this unnecessary risk and rewrite the code to only store valid floating point values (after reversing the bytes if needed) in a float variable. Still, finding out about this problem that is probably unlikely to trigger has been rather expensive. A good practice in similar cases might be to check with different versions of QEMU earlier.