--- title: "Testing R on Emulated Platforms" author: "Tomas Kalibera" date: 2020-05-29 categories: ["Testing"] tags: ["QEMU", "Aarch64", "Power"] --- ```{r setup, include=FALSE} knitr::opts_chunk$set(collapse = TRUE) ``` Sometimes it is useful to test R on unusual platforms, even when the expected number of users is not large at the moment. It is better to be ready when a new platform arrives or becomes more widely used, it may be easier to find some bugs on one platform than other, and such testing may reveal code unintentionally too platform specific. Recently I wanted to test R on 64-bit ARM (Aarch64) and on Power. It turned out easy in QEMU, an instruction-level emulator, running on a decent multi-processor 64-bit Intel host machine. This text gives the exact steps to get these emulated systems and it describes two concrete problems I was debugging on Power: one with long double types preventing R to start and another with `foreign` package not reading STATA files correctly. The latter comes with some interesting technical details and a rather surprising conclusion. ## Aarch64 Testing on this platform was inspired by some predictions that 64-bit ARM CPUs could be soon used in Apple laptops. One might order say a recent version of Raspberry Pi for the testing, but installing QEMU is still faster and the experience could be applicable also to testing of other platforms. I've based my setup on an Ubuntu wiki [page](https://wiki.ubuntu.com/ARM64/QEMU), running on Ubuntu 19.10 as host and Ubuntu 18.04 as guest. Install QEMU: ``` apt-get install qemu-system-arm qemu-efi ``` Prepare flash memory for boot: ``` dd if=/dev/zero of=flash0.img bs=1M count=64 dd if=/usr/share/qemu-efi/QEMU_EFI.fd of=flash0.img conv=notrunc dd if=/dev/zero of=flash1.img bs=1M count=64 ``` Get Ubuntu 18.04 cloud image: ``` wget http://cloud-images.ubuntu.com/releases/bionic/release/ubuntu-18.04-server-cloudimg-arm64.img ``` Set root password in that image: ``` virt-customize -a ubuntu-18.04-server-cloudimg-arm64.img --root-password password:my_root_password ``` Resize that image so that it can hold R and dependencies (add 10G, perhaps a lot less would be ok): ``` qemu-img resize ubuntu-18.04-server-cloudimg-arm64.img +10G ``` Run the guest system: ``` qemu-system-aarch64 -smp 8 -m 8192 -cpu cortex-a57 -M virt -nographic -pflash flash0.img \ -pflash flash1.img -drive if=none,file=ubuntu-18.04-server-cloudimg-arm64.img,id=hd0 \ -device virtio-blk-device,drive=hd0 \ -device e1000,netdev=net0 -netdev user,id=net0,hostfwd=tcp::5555-:22 ``` This runs a system with 8 cores, 8G of memory, where port 5555 of the host forwards to port 22 of the guest, so one can log in via ``` ssh -p 5555 username@localhost ``` once the system is set up for that. One can also right away log in as root to the console with the password set previously. Files can be transferred using ssh/scp, but if needed, the image can also be mounted to the host system: ``` modprobe nbd max_part=8 qemu-nbd --connect=/dev/nbd0 ubuntu-18.04-server-cloudimg-arm64.img mount /dev/nbd0p1 mnt/ ``` and unmounted ``` umount mnt nbd-client -d /dev/nbd0 ``` After logging into the guest system, upgrade it: ``` apt-get update apt-get upgrade ``` Install a text editor, enable "src" repositories in `/etc/apt/sources.list` (uncomment lines starting with deb-src). Then install R build dependencies and some tools: ``` apt-get update apt-get build-dep r-base apt-get install subversion rsync screen libpcre2-dev ``` Create a user account (`adduser username`) and enable ssh password authentication in `/etc/ssh/sshd_config`. Log in to that account by ssh from the host system via port 5555 of host, that way the terminal works better than via the emulated console. Now check out R source code from SVN, get the recommended packages via rsync, build from source (via configure, `make -j`, preferably out-of-tree) and run checks (check-all) as usual and documented in [R Installation and Administration](https://cran.r-project.org/doc/manuals/r-devel/R-admin.html). The build is not fast on the emulated system, but running in parallel helps a lot. The tests are best run overnight as they may take several hours, particularly "check-all" for all recommended packages. One caveat is that some tests still have hard-coded time limits, which are often not met on an emulated system. In case of related failures, one has to disable these checks manually and re-start. These checks should be removed soon and are discouraged in [Writing R Extensions](https://cran.r-project.org/doc/manuals/r-devel/R-exts.html), because they cannot really reveal performance regressions. For that one would have to run the tests repeatedly, perhaps increasing their runtimes, on an isolated system and in both versions where regressions are to be tested. One can use the emulated system as usual, narrowing down failing tests to reproducible examples, trying them out interactively, debugging via `gdb`, etc. The only limitation is that the system is slower, but on a decent host server it was fast enough for interactive work. Apart from the timing constraints, an incorrect assumption about `/proc` filesystem content was found in one package. Several tests had too tight numerical checks. However, I did not find any significant problem on the platform (this was done before the 4.0.0 release). ## Power 9 Soon after the 4.0.0 release, we got a report that R did not even build on Power, so I've repeated this experiment for Power as well. These steps worked for me: ``` apt-get install qemu-system-ppc64 wget https://cloud-images.ubuntu.com/releases/bionic/release/ubuntu-18.04-server-cloudimg-ppc64el.img sudo virt-customize -a ubuntu-18.04-server-cloudimg-ppc64el.img --root-password password:my_root_password qemu-img resize ubuntu-18.04-server-cloudimg-ppc64el.img +10G ``` To boot the system: ``` qemu-system-ppc64le -smp 8 -m 8192 -cpu power9 -nographic \ -hda ubuntu-18.04-server-cloudimg-ppc64el.img -L pc-bios -boot c \ -device e1000,netdev=net0 -netdev user,id=net0,hostfwd=tcp::5555-:22 ``` The remaining steps are the same as for Aarch64: upgrade the system, enable source repositories, install R build dependencies, install some tools, create a user account, checkout R, configure, build, run tests. It turned out that R falls into an infinite loop on startup. As R binary is already used itself when building R, the build process did not finish, either. Using gdb running in the guest system, one could see it was looping inside `machar_LD`, a long-double version of the detection of machine-specific floating point characteristics. These routines are from Netlib BLAS code ([machar.c](http://netlib.org/blas/machar.c)), based on [M. Malcolm: Algorithms to reveal properties of floating-point arithmetic](https://dl.acm.org/doi/10.1145/355606.361870) and [W. J. Cody:ALGORITHM 665 MACHAR: A Subroutine to Dynamically Determine Machine Parameters](https://dl.acm.org/doi/10.1145/50063.51907). These routines don't work on the unusual double-double implementation of long double type on Power, which has known issues. Muller et al: Handbook of Floating Point Arithmetics: "there are still open specification and/or implementation issues concerning the largest exponent to regard this format as a valid long double type" and "some properties requiring strict floating-point arithmetic (such as Sterbenz's lemma) will not always be true for the long double type, and corresponding floating-point algorithms may no longer work." Sterbenz's lemma requires that "if x and y are floating-point number such that x/2 <= y <= 2x, then x-y will be computed exactly". A quick fix for R is to avoid using the long double type on Power, which can be done via `configure` when building R. It is still work in progress to provide some support for long double on Power, with primary contributions from Bryan Lewis and Martin Maechler. The problem was bisected to a specific R SVN release by Dirk Eddelbuettel. QEMU has been useful in identifying and debugging this issue. However, once I had it running, I decided to run check-all on the platform also with a quick fix for the machine detection: what if there were other issues caused by e.g. the unusual long double type implementation? Apart from expected problems with hard-coded timing checks, too tight numerical checks and one more case of parsing `/proc` filesystem, I found that `foreign` package was failing its tests, unable to read some STATA files properly. On Power, I got ``` > foreign::read.dta("./tests/foreign.Rcheck/tests/compressed.dta")[1:10,"alkphos"] [1] 1.718000e+03 7.394800e+03 7.394800e+03 1.836710e-40 6.121800e+03 [6] 6.121800e+03 6.710000e+02 6.710000e+02 1.175494e-38 1.175494e-38 ``` while on x86_64 ``` > foreign::read.dta("./tests/foreign.Rcheck/tests/compressed.dta")[1:10,"alkphos"] [1] 1718.0 7394.8 7394.8 516.0 6121.8 6121.8 671.0 671.0 944.0 944.0 ``` So, some of the floating point numbers were read properly, but not all, e.g. `516` became `1.836710e-40`. These were single precision floating point numbers (32-bit), which are documented to be proper IEEE 754 even on Power, so an unrelated problem to the long double type. These numbers from the example are stored as binary, IEEE 754, big-endian in the file. Power is a bi-endian architecture and the emulator invocation shown above runs it as little endian. The routine to read the values is simple and seems correct: ``` static double InFloatBinary(FILE * fp, int naok, int swapends) { float i; if (fread(&i, sizeof(float), 1, fp) != 1) error(_("a binary read error occurred")); if (swapends) reverse_float(i); return (((i == STATA_FLOAT_NA) & !naok) ? NA_REAL : (double) i); } ``` Small modifications of this routine, even just adding print statements, sometimes masked the bug: all values were then read properly. Printing the whole value did not mask the bug, but printing individual bytes masked it. Marking `i` as volatile masked the bug as well. When not masked, the bug affected only some numbers (e.g. 516) but not other (e.g. 1718). Reducing the code to this still preserved the bug: ``` static double InFloatBinaryBugPresent(FILE * fp, int naok, int swapends) { float i; if (fread(&i, sizeof(float), 1, fp) != 1) return (double)0; if (swapends) reverse_float(i); return (double)i; } ``` but reducing more masked it: ``` static double InFloatBinaryBugMasked(FILE * fp, int naok, int swapends) { float i; fread(&i, sizeof(float), 1, fp); if (swapends) reverse_float(i); return (double)i; } ``` When not masked, the bug already affects the result of InFloatBinary. Small modifications mask it, but only for some numbers, so it does not seem like a memory corruption nor like stack alignment issue. Also, stack alignment would unlikely be a problem with 32-bit values. It does not seem like a problem in non-reversing the byte order nor in reversing it incorrectly, as some values with all four bytes different were read correctly. At the same time, the above showed that even the presence of dead code (the branch returning 0 that is never taken) impacted presence of the bug. Removing for instance the branch `if (swapends)` masks the bug. Perhaps the compiler could possibly be miscompiling the code, or more likely the code could be depending on unspecified behavior. The next step hence is to look at the assembly. An annotated version of `InFloatBinaryBugMasked` that does not have the bug: ``` <+0>: addis r2,r12,2 <+4>: addi r2,r2,5504 <+8>: mflr r0 <== save link register to r0 <+12>: std r31,-8(r1) <== save r31 to stack <+16>: mr r6,r3 <====== move r3 to r6 [FILE *fp] <+20>: mr r31,r4 <====== move r4 to r31 [swapends; naok was optimized out] <+24>: li r5,1 <======== move 1 to r5 <+28>: li r4,4 <======== move 4 to r4 (size of float) <+32>: std r0,16(r1) <=== save r0 to stack [link register save area] <+36>: stdu r1,-64(r1) <+40>: addi r3,r1,36 <+44>: ld r9,-28688(r13) <=== [stack checking] <+48>: std r9,40(r1) <=== save r9 to stack [TOC save area] <+52>: li r9,0 <======== move 0 to r9 <+56>: bl 0x7ffff4bc4720 <0000011a.plt_call.fread@@GLIBC_2.17> <+60>: ld r2,24(r1) <=== [compiler area] <+64>: cmpdi cr7,r31,0 <=== check swapends <+68>: beq cr7,0x7ffff4bd6910 <==== jump if not swapping <+72>: addi r9,r1,36 <+76>: lwbrx r9,0,r9 <==== load from address r9, reverse bytes, store to r9 <+80>: rldicr r9,r9,32,31 <= extract upper 32 bits of r9 to r9 (shift right) <+84>: mtvsrd vs1,r9 <====== move r9 to vector register <+88>: xscvspdpn vs1,vs1 <=== convert vs1 to double precision <+92>: ld r9,40(r1) <== [stack checking] restore r9 <+96>: ld r10,-28688(r13) <+100>: xor. r9,r9,r10 <+104>: li r10,0 <======= move 0 to r10 <+108>: bne 0x7ffff4bd6918 <+112>: addi r1,r1,64 <== r1 += 64 <+116>: ld r0,16(r1) <== restore r0 <+120>: ld r31,-8(r1) <== restore r31 <+124>: mtlr r0 <========== restore link register from r0 <+128>: blr <============= return <+132>: nop <+136>: nop <+140>: ori r2,r2,0 <+144>: lfs f1,36(r1) <+148>: b 0x7ffff4bd68dc <+152>: bl 0x7ffff4bc4260 <0000011a.plt_call.__stack_chk_fail@@GLIBC_2.17> <+156>: ld r2,24(r1) <+160>: .long 0x0 <+164>: .long 0x1000000 <+168>: .long 0x180 ``` The code calls `fread`, reverses the byte order in the result, promotes the result to double precision and returns it. In addition, there is some stack corruption checking and saving/restoring registers in the function prologue/epilogue. Key architecture details for this example: register `r1` points to top of stack, `r3` and `r4` contain the first two fixed-point parameters to the function, the link register holds the address to return to and `blr` is a jump to that address (so it returns from the function), `cr7` is a condition register used here explicitly to hold the result of a comparison, `cr0` is a condition register used by compare instructions of fixed-point operands ending with a dot (e.g. `xor.`), floating point registers f0/f1 correspond to vector registers vs0/vs1. The disassembly of `InFloatBinaryBugPresent` has a similar structure, but the byte swapping code is different: ``` <+128>: cmpdi cr7,r31,0 <===== check swapends <+132>: lfs f1,36(r1) <===== load float extended to double into f1 (vs1) <+136>: beq cr7,0x7ffff4bd68cc <== jump if not swapping <+140>: xscvdpspn vs0,vs1 <====== convert double precision vs1 to single precision vs0 <+144>: mfvsrd r9,vs0 <========= move vs0 to r9 <+148>: rldicl r9,r9,32,32 <==== shift right by 32 bits (extract upper 32 bits) <+152>: rotlwi r10,r9,24 <====== rlwinm r10,r9,24,0,31; 32-bit rotate right 1-byte: 4321 => 1432 store "1432" to r10 <+156>: rlwimi r10,r9,8,8,15 <== rotate left 1 byte (4321 => 3214), store 2nd byte to r10 store "1232" to r10 <+160>: rlwimi r10,r9,8,24,31 <= rotate left 1 byte (4321 => 3214), store 4th byte to r10 store "1234"" to r10 <+164>: rldicr r9,r10,32,31 <=== extract r10 as high word of r9 (shift left 4 bytes) <+168>: mtvsrd vs1,r9 <========= move r9 to vector register <+172>: xscvspdpn vs1,vs1 <====== convert vs1 to double precision ``` In both cases, the generated code seems ok on the first look. To find the cause, the next step is instruction-level debugging. Useful gdb commands were `set disassemble-next-line on` to see the disassembly when stepping, `stepi` to step into a single instruction, `nexti` to execute to the next instruction, `i r f1 vs1` to print register values of the floating point register `f1` and the corresponding vector unit floating point register `vs1`. To jump to the invocations of `InFloatBinaryBugPresent`, `ignore 1 23` was useful (ignore next 23 crossings of breakpoint 1 to debug reading of value 516 in the example). To dump memory at specific address, e.g. `x/16bx 0x7fffffffb2e0`. The key observation is then that the value of byte-reversed 516 is read correctly by `fread` and is correctly stored on the stack at `36(r1)`. However, instruction `lfs` at `<+132>` destroys the value when converting it from float to double. When byte-reversed, the value of 516 is a denormalized floating point number. Still, even such numbers should be converted to double without losing precision/information. To demonstrate this was the problem, it was then easy to extract an example that only converts floats to doubles: 516 (a denormalized number) was converted incorrectly. This operation systematically gives surprising results. The masking of the bug depending seemingly on inconsequential changes of the code was caused by the compiler sometimes including unnecessary promotion of the float value to double and back yet before reversing the bytes. How come that Power cannot convert a denormalized float number to its well defined double version? This made me ask for an account at [Minicloud](https://github.com/Unicamp-OpenPower/minicloud/wiki) and test there. On their Power machine, the conversion was working correctly, yet it was Power8 (not Power9) and it turned out also emulated by QEMU, not bare hardware. Then I tried with the latest version of QEMU, and the conversion worked as well. After looking into the changelog of QEMU, I found ``` commit c0e6616b6685ffdb4c5e091bc152e46e14703dd1 Author: Paul A. Clarke Date: Mon Aug 19 16:42:16 2019 -0500 ppc: Fix emulated single to double denormalized conversions helper_todouble() was not properly converting any denormalized 32 bit float to 64 bit double. ``` QEMU was easy to build on my Ubuntu machine and I tested this commit and the previous one - no need to re-install the VM, just rebuild QEMU and restart. And yes, it was caused by a bug in QEMU that was fixed by this commit, which demonstrates a possible downside of testing on emulated platforms. The original C code in `foreign` package works fine on Power in the latest QEMU, but still, it is storing an arbitrary bit pattern (byte-reversed floating point value) in a variable typed as float. By the C standard, converting from float to double should preserve the value, as well as after converting back to float. Yet, it is not completely clear that exactly the same bit pattern of the original float value will always be preserved. Denormalized numbers could be disabled by compiler of CPU options ("flush to zero", "denormals are zero"). It is probably safer to avoid this unnecessary risk and rewrite the code to only store valid floating point values (after reversing the bytes if needed) in a float variable. Still, finding out about this problem that is probably unlikely to trigger has been rather expensive. A good practice in similar cases might be to check with different versions of QEMU earlier.