---
title: "Sporadic Rterm Crashes with Completion"
author: "Tomas Kalibera"
date: 2022-11-01
categories: ["Windows", "Internals"]
tags: ["debugging", "completion"]

---

```{r setup, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE)
```

This post is a story of a bug in Rterm, the console R front-end on Windows,
which has been bugging me for several years, but only two weeks ago it
showed up unwarily so that I could trace it down and fix.

The terminal sometimes crashed during completion, so after I pressed the tab
key, but it was very rare, there seemed to be no way to reproduce the
problem, and it was only happening with the mintty terminal (from Msys2,
running bash), never with cmd.exe.

I thought it was related to the rather complicated conversion between
stream-based and function-call based console control, which such terminals
do, and I assumed it was not a bug in R.  But, I still felt rather uneasy
whenever it showed up again.  One cannot really know where the bug isn't
unless it is traced down.

Two weeks ago Rterm crashed for me during completion even in cmd.exe, the
default Windows terminal, and it happened several times in a row. I would
start Rterm, press the tab key, and R would crash with an error that "C
stack usage is too close to the limit", giving a very large stack usage. So,
clearly not a problem of mintty. I could get the same crash also in
Power Shell.

This error message usually means infinite recursion. In this example,

```
f <- function(x) { print(x); f(x+1) }
f(1)
```

function `f` will always print the nesting level and invoke itself
recursively. Eventually, the execution will result in an error

```
[1] 733
[1] 734
[1] 735
Error: C stack usage  7970564 is too close to the limit
> 
```
The actual numbers above are system specific, but the key is that infinite
recursion becomes a regular R error and R doesn't crash.

How could completion cause infinite recursion in R code, why would it crash
R, and how come that it crashes only very rarely?

The completion options are computed in R code, function
`.win32consoleCompletion` in `utils` package, which is part of base R.  So,
the computation does run in R, but why would it only run into infinite
recursion sometimes, in a clean fresh start of R?

I was lucky that the problem happened to be repeatable also in a build of R
without C compiler optimizations (`-O0`) and with debug symbols included. I
thought it would hence be easy to run R in gdb and see the problem there.

But, for some reason, the problem wouldn't happen when running in gdb. 
Neither when running R in gdb from the beginning, nor when an already
running version of R was attached to gdb.

The only thing I could get was a stack trace from windbg, a windows debugger
invoked "post-mortem". Post-mortem means that the debugger is invoked
automatically by Windows when a program, Rterm in this case, crashes:

```
 # Child-SP          RetAddr               Call Site
00 000000e3`8bff82d0 00007ffb`04e6af0a     ntdll!RtlUnwindEx+0x4e2
01 000000e3`8bff89f0 00007ffa`b87061ee     ucrtbase!_longjmp_internal+0xea
02 000000e3`8bff8f30 00007ffa`b8742543     R!R_jumpctxt+0xe9
03 000000e3`8bff8f70 00007ffa`b8741d6b     R!Rf_error+0x575
04 000000e3`8bff91e0 00007ffa`b8741ed5     R!R_curErrorBuf+0xb79
05 000000e3`8bffd4d0 00007ffa`b8746691     R!Rf_errorcall+0x131
06 000000e3`8bfff510 00007ffa`b873f673     R!R_signalErrorConditionEx+0xe7
07 000000e3`8bfff550 00007ffa`b8748667     R!R_SignalCStackOverflow+0xed
08 000000e3`8bfff590 00007ffa`b869868e     R!Rf_eval+0x1e4
09 000000e3`8bfff850 00007ffa`b88eb654     R!get_R_HOME+0x74e
0a 000000e3`8bfffa50 00007ffa`b86a29e0     R!Rf_rsignrank+0x205b
0b 000000e3`8bfffaa0 00007ffa`b86a2888     R!R_WriteConsoleEx+0x308
0c 000000e3`8bfffae0 00007ffb`04e4514b     R!R_WriteConsoleEx+0x1b0
0d 000000e3`8bfffb10 00007ffb`05c77034     ucrtbase!thread_start<void (__cdecl*)(void *),0>+0x7b
0e 000000e3`8bfffb40 00007ffb`075c26a1     KERNEL32!BaseThreadInitThunk+0x14
0f 000000e3`8bfffb70 00000000`00000000     ntdll!RtlUserThreadStart+0x21
```

And the application had some other threads, one of which was:

```
 # Child-SP          RetAddr               Call Site
00 000000e3`fbdfe408 00007ffa`eddf11b0     win32u!NtUserWaitMessage+0x14
01 000000e3`fbdfe410 00007ffa`b86a25b1     Rgraphapp!GA_waitevent+0x1a
02 000000e3`fbdfe440 00007ffa`b86a2953     R!R_WaitEvent+0x16
03 000000e3`fbdfe470 00007ffa`b86a267d     R!R_WriteConsoleEx+0x27b
04 000000e3`fbdfe4b0 00007ffa`b8789ea2     R!R_ReadConsole+0x3e
05 000000e3`fbdfe4e0 00007ffa`b878a26a     R!Rf_ReplIteration+0x7c
06 000000e3`fbdfe550 00007ffa`b878b8cd     R!Rf_ReplIteration+0x444
07 000000e3`fbdff5b0 00007ffa`b878b8eb     R!run_Rmainloop+0x71
08 000000e3`fbdff5e0 00007ff7`10181b9b     R!Rf_mainloop+0x12
09 000000e3`fbdff610 00007ff7`10181592     Rterm+0x1b9b
0a 000000e3`fbdff790 00007ff7`101813c1     Rterm+0x1592
0b 000000e3`fbdff7d0 00007ff7`101814f6     Rterm+0x13c1
0c 000000e3`fbdff8a0 00007ffb`05c77034     Rterm+0x14f6
0d 000000e3`fbdff8d0 00007ffb`075c26a1     KERNEL32!BaseThreadInitThunk+0x14
0e 000000e3`fbdff900 00000000`00000000     ntdll!RtlUserThreadStart+0x21
```

The second thread above seems to be the main R/Rterm thread, which is
waiting for some Windows event.  The first thread above runs into R
interpreter and immediately gets a stack overflow error.  But, how come? 
Normally, with infinite recursion, one should see a deep stack with nested
calls, there is only one call to `R_eval` in the first trace.

This rang a bell for me, as I worked before on the stack detection in R. 
The stack overflow checking is based on checking how far the current frame
is from the known stack start, given the known stack growth direction.  The
stack start is detected at R startup and is stored in a global variable. 

The first thread above uses the stack start detected in the second thread
above (the R main thread).  But, indeed, the stack of the first thread is
somewhere else.  Based on where the operating system allocates the new
thread stack, R happens to detect a stack overflow or not, but it always
uses wrong stack start for the detection.  This explains why the error was
so hard to repeat and why the source of non-determinism is external to R.
Also it makes it less surprising that it doesn't happen inside gdb.

R is not thread safe.  If R or a package creates a new thread, that thread
must never call into R.  It became clear that the console implementation on
Windows will have to be fixed to ensure that.  There is a pattern already in
base R: the embedded HTTP server used for serving the HTML help.  The worker
threads also need to run R code, and they ask the main R thread to do it for
them. The same had to be done for the console.

Now, while I had a very plausible explanation for the crash, I really wanted
a fully reproducible way to trigger it, also in gdb, trigger it always, so
that I can test before and after a fix and be confident I got all the
problems causing the (very sporadic) crashes.  "Very sporadic" turned rather
euphemistic, because after a reboot of the Windows machine, I couldn't get
the crash again at all, not even in mintty where it used to happen more
often before.

I've tried to provoke it by running this code always in
`.win32consoleCompletion`:

```
f <- function(x) { print(x); f(x+1) }
f(1)
```

That unsurprisingly always ran into stack overflow error detected, but, R
still would not crash in gdb.  Not only that gdb would have given me stack
traces with debug symbols (unlike windbg above), it showed that the crash
wouldn't be reliably repeatable, and so lack of it after a fix would give
limited confidence.

To fully isolate the problem of detecting stack overflow and handling an
error in completion computation, I've instead added

```
stop("xxx")
```

to `.win32consoleCompletion`.  It crashed R.  An R error in computation of
completion should not cause a crash.  The computation needs to be guarded
against long jumps, which are used for implementing errors in R.  The
embedded HTTP server does this using `R_ToplevelExec`.

So, there are actually two bugs involved in this problem: R code evaluated
on a new thread (causing invalid stack overflow error) and R code evaluation
not guarded against errors.

The second error is trivial to fix using `R_ToplevelExec` and a reliably
reproducible example is a modified version of R, which always runs `stop()`
inside `.win32consoleCompletion`, as above.

Back to the first error, I wanted a reliably reproducible example.  There is
code in `main.c` in base R to access all bytes on what R believes to be its
C stack.

The code is to debug the detection of the stack start and stack growth
direction.  I've arranged for `do_inspect` to call this code, so to always
access all (remaining) bytes of the stack, and then made
`.win32consoleCompletion` always call `.Internal(inspect(1))`.  This
modified version of R would crash when the C stack is detected incorrectly
while R computes the completion.  This "worked" for me also in gdb, finally
a reproducible example.

To fix the first error, I had to arrange for the computation of completion
to run on the main R thread.  The main R thread runs an event loop like this
to get a single line of input from the console (comments added):

```
SetEvent(EhiWakeUp);
while (1) {
    R_WaitEvent(); // WaitMessage()
    if (lineavailable) break;
    doevent(); // PeekMessage()
               // TranslateMessage()
               // DispatchMessage()
    if(R_Tcl_do) R_Tcl_do();
}
lineavailable = 0;
```

The console reader thread runs a loop like this to get the line of code from
getline and inform the main thread about it:

```
while(1) {
    WaitForSingleObject(EhiWakeUp,INFINITE);
    tlen = InThreadReadConsole(tprompt,tbuf,tlen,thist);
    lineavailable = 1;
    PostThreadMessage(mainThreadId, 0, 0, 0);
}
```

The extra console reader thread is needed, because the call to
`InThreadReadConsole` is blocking (in the getline library), but we need to
process Windows events and Tcl events while the line is being entered.  So,
in the above, the reader thread sets `lineavailable` and sends a thread
message to the main thread to wake it up from `WaitMessage()`.

When completion is needed, getline calls a tab hook function in the reader
thread.  I changed that hook to instead ask the main thread to do the work. 
Following the pattern in this code, one could add a flag `completionneeded`,
pass the inputs via globals, and add a Windows event (a variant of a
semaphore) by which the main thread would inform the reader thread to pick
up results in other globals.  This natural extension, however, has a race
condition on accessing the globals (found by Luke Tierney) and another risk
of race condition on the thread message.  For the easy of presentation, I
describe those on the original code.

In the code above, variable `lineavailable` is set in the reader thread and
then a thread message is posted. Sending that thread message should include
a memory barrier, so, after receiving the message, it should be safe for the
main thread to access `lineavailable` even though it is a global variable
not locked. The main thread would zero the variable, but the reader thread
would not touch it before the `EhiWakeUp` event, so the consistency should
be assured. But, what if the `R_WaitEvent` above returns with a different
message to the main thread? Then, the main thread may accidentally see an
inconsistent version of `lineavailable`. 

This is probably benign (atomic writes to 32-bit integers are atomic) in the
base version, but the extended version would in addition to
`completionneeded` fill in other globals with arguments for the completion,
and those may be seen by the main thread in an unlucky order, picking up
incomplete arguments.

In addition, there is a potential problem with the thread message sent by
`PostThreadMessage` to the main thread. If, by accident, any code ran on the
main thread (via `doevent` or `R_Tcl_do`) included calls to `PeekMessage()`,
`TranslateMessage()` and `DispatchMessage()`, which would accidentally get
the thread message from the reader thread, that message will be lost,
because it doesn't belong to any window. The main thread would then be able
to respond to a line available or completion request only when another
windows message arrives.

I've hence changed the synchronization in the console to use a dummy window
for receiving messages.  The main thread loop looks exactly as before
(above), but `lineavailable` is only ever accessed in the main thread.  The
main thread creates the dummy window with a windows procedure which sets
`lineavailable` upon receiving a message from the reader thread.  Similarly,
the windows procedure computes completion upon receiving a message from the
reader thread, and this doesn't require any communication with the loop in
the main thread.

This makes the synchronization more robust by not using thread messages, and
makes it a bit more similar to how the embedded HTTP server does it,
somewhat reducing cognitive complexity of the code.

In summary, these crashes have revealed two bugs and two potential race
conditions in Rterm.  For regular R users on Windows, the details presented
here don't matter, it is just enough if R doesn't crash loosing all their
work when they press the tab key.  Hopefully it got fixed soon enough, after
all the bug hasn't been reported, yet.  A "work-around" is indeed not to use
completion in Rterm.  The fix is now in R-devel and will be patched to R
4.2.3, it was too late for R 4.2.2.

The work of maintaining R involves debugging and fixing issues like this on
regular basis.  As the operating systems, external libraries and hardware
evolve, code that worked in the past may stop working, be it due to bugs
that were benign before or even correct code requiring updates due to
external changes.

It is not possible to normally report bugs in this level of detail, but I
thought this one might be an interesting example.  A list of the most
important/visible bug fixes can be found in the [R
NEWS](https://stat.ethz.ch/R-manual/R-devel/doc/html/NEWS.html) file, and
all changes are visible in the versioning system.