This post is a story of a bug in Rterm, the console R front-end on Windows, which has been bugging me for several years, but only two weeks ago it showed up unwarily so that I could trace it down and fix.
The terminal sometimes crashed during completion, so after I pressed the tab key, but it was very rare, there seemed to be no way to reproduce the problem, and it was only happening with the mintty terminal (from Msys2, running bash), never with cmd.exe.
I thought it was related to the rather complicated conversion between stream-based and function-call based console control, which such terminals do, and I assumed it was not a bug in R. But, I still felt rather uneasy whenever it showed up again. One cannot really know where the bug isn’t unless it is traced down.
Two weeks ago Rterm crashed for me during completion even in cmd.exe, the default Windows terminal, and it happened several times in a row. I would start Rterm, press the tab key, and R would crash with an error that “C stack usage is too close to the limit”, giving a very large stack usage. So, clearly not a problem of mintty. I could get the same crash also in Power Shell.
This error message usually means infinite recursion. In this example,
f <- function(x) { print(x); f(x+1) }
f(1)
function f
will always print the nesting level and invoke itself
recursively. Eventually, the execution will result in an error
[1] 733
[1] 734
[1] 735
Error: C stack usage 7970564 is too close to the limit
>
The actual numbers above are system specific, but the key is that infinite recursion becomes a regular R error and R doesn’t crash.
How could completion cause infinite recursion in R code, why would it crash R, and how come that it crashes only very rarely?
The completion options are computed in R code, function
.win32consoleCompletion
in utils
package, which is part of base R. So,
the computation does run in R, but why would it only run into infinite
recursion sometimes, in a clean fresh start of R?
I was lucky that the problem happened to be repeatable also in a build of R
without C compiler optimizations (-O0
) and with debug symbols included. I
thought it would hence be easy to run R in gdb and see the problem there.
But, for some reason, the problem wouldn’t happen when running in gdb. Neither when running R in gdb from the beginning, nor when an already running version of R was attached to gdb.
The only thing I could get was a stack trace from windbg, a windows debugger invoked “post-mortem”. Post-mortem means that the debugger is invoked automatically by Windows when a program, Rterm in this case, crashes:
# Child-SP RetAddr Call Site
00 000000e3`8bff82d0 00007ffb`04e6af0a ntdll!RtlUnwindEx+0x4e2
01 000000e3`8bff89f0 00007ffa`b87061ee ucrtbase!_longjmp_internal+0xea
02 000000e3`8bff8f30 00007ffa`b8742543 R!R_jumpctxt+0xe9
03 000000e3`8bff8f70 00007ffa`b8741d6b R!Rf_error+0x575
04 000000e3`8bff91e0 00007ffa`b8741ed5 R!R_curErrorBuf+0xb79
05 000000e3`8bffd4d0 00007ffa`b8746691 R!Rf_errorcall+0x131
06 000000e3`8bfff510 00007ffa`b873f673 R!R_signalErrorConditionEx+0xe7
07 000000e3`8bfff550 00007ffa`b8748667 R!R_SignalCStackOverflow+0xed
08 000000e3`8bfff590 00007ffa`b869868e R!Rf_eval+0x1e4
09 000000e3`8bfff850 00007ffa`b88eb654 R!get_R_HOME+0x74e
0a 000000e3`8bfffa50 00007ffa`b86a29e0 R!Rf_rsignrank+0x205b
0b 000000e3`8bfffaa0 00007ffa`b86a2888 R!R_WriteConsoleEx+0x308
0c 000000e3`8bfffae0 00007ffb`04e4514b R!R_WriteConsoleEx+0x1b0
0d 000000e3`8bfffb10 00007ffb`05c77034 ucrtbase!thread_start<void (__cdecl*)(void *),0>+0x7b
0e 000000e3`8bfffb40 00007ffb`075c26a1 KERNEL32!BaseThreadInitThunk+0x14
0f 000000e3`8bfffb70 00000000`00000000 ntdll!RtlUserThreadStart+0x21
And the application had some other threads, one of which was:
# Child-SP RetAddr Call Site
00 000000e3`fbdfe408 00007ffa`eddf11b0 win32u!NtUserWaitMessage+0x14
01 000000e3`fbdfe410 00007ffa`b86a25b1 Rgraphapp!GA_waitevent+0x1a
02 000000e3`fbdfe440 00007ffa`b86a2953 R!R_WaitEvent+0x16
03 000000e3`fbdfe470 00007ffa`b86a267d R!R_WriteConsoleEx+0x27b
04 000000e3`fbdfe4b0 00007ffa`b8789ea2 R!R_ReadConsole+0x3e
05 000000e3`fbdfe4e0 00007ffa`b878a26a R!Rf_ReplIteration+0x7c
06 000000e3`fbdfe550 00007ffa`b878b8cd R!Rf_ReplIteration+0x444
07 000000e3`fbdff5b0 00007ffa`b878b8eb R!run_Rmainloop+0x71
08 000000e3`fbdff5e0 00007ff7`10181b9b R!Rf_mainloop+0x12
09 000000e3`fbdff610 00007ff7`10181592 Rterm+0x1b9b
0a 000000e3`fbdff790 00007ff7`101813c1 Rterm+0x1592
0b 000000e3`fbdff7d0 00007ff7`101814f6 Rterm+0x13c1
0c 000000e3`fbdff8a0 00007ffb`05c77034 Rterm+0x14f6
0d 000000e3`fbdff8d0 00007ffb`075c26a1 KERNEL32!BaseThreadInitThunk+0x14
0e 000000e3`fbdff900 00000000`00000000 ntdll!RtlUserThreadStart+0x21
The second thread above seems to be the main R/Rterm thread, which is
waiting for some Windows event. The first thread above runs into R
interpreter and immediately gets a stack overflow error. But, how come?
Normally, with infinite recursion, one should see a deep stack with nested
calls, there is only one call to R_eval
in the first trace.
This rang a bell for me, as I worked before on the stack detection in R. The stack overflow checking is based on checking how far the current frame is from the known stack start, given the known stack growth direction. The stack start is detected at R startup and is stored in a global variable.
The first thread above uses the stack start detected in the second thread above (the R main thread). But, indeed, the stack of the first thread is somewhere else. Based on where the operating system allocates the new thread stack, R happens to detect a stack overflow or not, but it always uses wrong stack start for the detection. This explains why the error was so hard to repeat and why the source of non-determinism is external to R. Also it makes it less surprising that it doesn’t happen inside gdb.
R is not thread safe. If R or a package creates a new thread, that thread must never call into R. It became clear that the console implementation on Windows will have to be fixed to ensure that. There is a pattern already in base R: the embedded HTTP server used for serving the HTML help. The worker threads also need to run R code, and they ask the main R thread to do it for them. The same had to be done for the console.
Now, while I had a very plausible explanation for the crash, I really wanted a fully reproducible way to trigger it, also in gdb, trigger it always, so that I can test before and after a fix and be confident I got all the problems causing the (very sporadic) crashes. “Very sporadic” turned rather euphemistic, because after a reboot of the Windows machine, I couldn’t get the crash again at all, not even in mintty where it used to happen more often before.
I’ve tried to provoke it by running this code always in
.win32consoleCompletion
:
f <- function(x) { print(x); f(x+1) }
f(1)
That unsurprisingly always ran into stack overflow error detected, but, R still would not crash in gdb. Not only that gdb would have given me stack traces with debug symbols (unlike windbg above), it showed that the crash wouldn’t be reliably repeatable, and so lack of it after a fix would give limited confidence.
To fully isolate the problem of detecting stack overflow and handling an error in completion computation, I’ve instead added
stop("xxx")
to .win32consoleCompletion
. It crashed R. An R error in computation of
completion should not cause a crash. The computation needs to be guarded
against long jumps, which are used for implementing errors in R. The
embedded HTTP server does this using R_ToplevelExec
.
So, there are actually two bugs involved in this problem: R code evaluated on a new thread (causing invalid stack overflow error) and R code evaluation not guarded against errors.
The second error is trivial to fix using R_ToplevelExec
and a reliably
reproducible example is a modified version of R, which always runs stop()
inside .win32consoleCompletion
, as above.
Back to the first error, I wanted a reliably reproducible example. There is
code in main.c
in base R to access all bytes on what R believes to be its
C stack.
The code is to debug the detection of the stack start and stack growth
direction. I’ve arranged for do_inspect
to call this code, so to always
access all (remaining) bytes of the stack, and then made
.win32consoleCompletion
always call .Internal(inspect(1))
. This
modified version of R would crash when the C stack is detected incorrectly
while R computes the completion. This “worked” for me also in gdb, finally
a reproducible example.
To fix the first error, I had to arrange for the computation of completion to run on the main R thread. The main R thread runs an event loop like this to get a single line of input from the console (comments added):
SetEvent(EhiWakeUp);
while (1) {
R_WaitEvent(); // WaitMessage()
if (lineavailable) break;
doevent(); // PeekMessage()
// TranslateMessage()
// DispatchMessage()
if(R_Tcl_do) R_Tcl_do();
}
lineavailable = 0;
The console reader thread runs a loop like this to get the line of code from getline and inform the main thread about it:
while(1) {
WaitForSingleObject(EhiWakeUp,INFINITE);
tlen = InThreadReadConsole(tprompt,tbuf,tlen,thist);
lineavailable = 1;
PostThreadMessage(mainThreadId, 0, 0, 0);
}
The extra console reader thread is needed, because the call to
InThreadReadConsole
is blocking (in the getline library), but we need to
process Windows events and Tcl events while the line is being entered. So,
in the above, the reader thread sets lineavailable
and sends a thread
message to the main thread to wake it up from WaitMessage()
.
When completion is needed, getline calls a tab hook function in the reader
thread. I changed that hook to instead ask the main thread to do the work.
Following the pattern in this code, one could add a flag completionneeded
,
pass the inputs via globals, and add a Windows event (a variant of a
semaphore) by which the main thread would inform the reader thread to pick
up results in other globals. This natural extension, however, has a race
condition on accessing the globals (found by Luke Tierney) and another risk
of race condition on the thread message. For the easy of presentation, I
describe those on the original code.
In the code above, variable lineavailable
is set in the reader thread and
then a thread message is posted. Sending that thread message should include
a memory barrier, so, after receiving the message, it should be safe for the
main thread to access lineavailable
even though it is a global variable
not locked. The main thread would zero the variable, but the reader thread
would not touch it before the EhiWakeUp
event, so the consistency should
be assured. But, what if the R_WaitEvent
above returns with a different
message to the main thread? Then, the main thread may accidentally see an
inconsistent version of lineavailable
.
This is probably benign (atomic writes to 32-bit integers are atomic) in the
base version, but the extended version would in addition to
completionneeded
fill in other globals with arguments for the completion,
and those may be seen by the main thread in an unlucky order, picking up
incomplete arguments.
In addition, there is a potential problem with the thread message sent by
PostThreadMessage
to the main thread. If, by accident, any code ran on the
main thread (via doevent
or R_Tcl_do
) included calls to PeekMessage()
,
TranslateMessage()
and DispatchMessage()
, which would accidentally get
the thread message from the reader thread, that message will be lost,
because it doesn’t belong to any window. The main thread would then be able
to respond to a line available or completion request only when another
windows message arrives.
I’ve hence changed the synchronization in the console to use a dummy window
for receiving messages. The main thread loop looks exactly as before
(above), but lineavailable
is only ever accessed in the main thread. The
main thread creates the dummy window with a windows procedure which sets
lineavailable
upon receiving a message from the reader thread. Similarly,
the windows procedure computes completion upon receiving a message from the
reader thread, and this doesn’t require any communication with the loop in
the main thread.
This makes the synchronization more robust by not using thread messages, and makes it a bit more similar to how the embedded HTTP server does it, somewhat reducing cognitive complexity of the code.
In summary, these crashes have revealed two bugs and two potential race conditions in Rterm. For regular R users on Windows, the details presented here don’t matter, it is just enough if R doesn’t crash loosing all their work when they press the tab key. Hopefully it got fixed soon enough, after all the bug hasn’t been reported, yet. A “work-around” is indeed not to use completion in Rterm. The fix is now in R-devel and will be patched to R 4.2.3, it was too late for R 4.2.2.
The work of maintaining R involves debugging and fixing issues like this on regular basis. As the operating systems, external libraries and hardware evolve, code that worked in the past may stop working, be it due to bugs that were benign before or even correct code requiring updates due to external changes.
It is not possible to normally report bugs in this level of detail, but I thought this one might be an interesting example. A list of the most important/visible bug fixes can be found in the R NEWS file, and all changes are visible in the versioning system.