---
title: "Long input lines"
author: "Tomas Kalibera"
date: 2024-08-30
categories: ["Internals", "User-visible Behavior", "Windows","Unix"]
tags: ["front-ends", "line editing", "parsing"]

---

```{r setup, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE)
```

When using R interactively via a console, one edits a line of input,
confirms it by pressing ENTER, then R parses the line, evaluates it, prints
the output and lets the user enter another line.  This is also known as REPL
(Read-Eval-Print-Loop).

The maximum length of the input line is sometimes limited.  It is
essentially impossible that one would run into a limit when typing the
commands, but there was a report from a user who pasted generated content to
the console and have ran into a limit of 4096 bytes.  The input has been
truncated. Other bug reports have indicated that users paste generated code
into R console.

This text explains how R processes long input lines and reports on recent
improvements in R-devel, the development version of R.  In short, it is now
possible to paste lines of unlimited length into R consoles that are part of
the R distribution (so running R from the Linux/macOS terminal, running
Rterm on Windows, with some limitations also running Rgui on Windows).

Despite these improvements, a recommendation for defensive users would be to
instead save generated code to a file and then let software (R or any other)
read it from there.  Consoles, terminals or even clipboards have been
designed for interactive work and may have not been tested with such long
inputs or may have input length limitations.

# How input gets to R

This describes (my interpretation) of the current design of R components
interacting with the input, and how they deal with lines of unlimited
length.

## Parser

The parser is not to blame.  The parser must be and is able to to work with
input of unlimited length, spanning multiple lines.  Think of a large piece
of R code placed inside a branch of the `if` statement, the parser needs to
see the whole `if` statement.  The parser is used to process not only input
provided by the user interactively, but also non-interactive input including
that read from files, which can be very large.

The parser is always given a piece of input of certain length by the REPL
and tries to parse some of it.  When it cannot make progress parsing any
part of it without seeing more, it reports this back to the REPL and the
REPL instead gives it a bigger chunk of input.  So, the parser can work with
input of unlimited length.  There are some limits in the parser/lexer, but
not in principle for the total length of input (but e.g.  to the number of
digits in a number, which indeed is also limited semantically).  The parser
can accept for example arbitrarily long string literals, or arbitrary number
of commands on a line, so things that would be most likely seen in generated
code.

This mechanism is not only to support very long lines of input, but also to
support commands spanning multiple lines. When one enters say

```
if (TRUE) {
```

the parser needs to ask for additional input to make any progress.

## REPL

The REPL gets the input from the console via R console API, function
`R_ReadConsole`.  The interface is modelled after the C function `fgets()`. 
The caller provides a fixed-size buffer and the size of the buffer in bytes. 
The console fills that buffer with a line of input or a segment of it.  If
the whole line fits, including the newline (`\n`) and NUL (`\0`), it is
placed into the buffer.  If it doesn't fit, only as much as fits is added
and terminated by the NUL byte.  The caller (REPL) hence knows whether it
received the full line or not, and this interface allows to receive lines of
unlimited length via repeated calls.

The size of the console buffer used by the REPL for this is a constant
defined in R source code, which has been 4096 since year 2008, before it was
1024.  The REPL copies the input into a growable buffer, which is then
presented to the parser.  When the parser asks for more input to be able to
make any progress, the REPL gets more data from the console, adds that to
the growable buffer, and runs the parser again.  So, the REPL can process
lines of unlimited length.

## Console

When the REPL asks the console for input, it also provides it with a prompt
for the user to edit the line.  Normally it is "`>`", but it can be
something else, e.g.  when debugging (in a browser).  When the REPL asks for
a continuation of a line, the prompt is normally "`+`".

This can be easily seen when one enters 

```
if (TRUE) {
```

to the console.  The next console prompt asks for a continuation using
"`+`", because the parser could not have made any progress parsing just the
previous command; it didn't see the end of the `if` statement.

This design also allows R to get lines of unlimited length from a console
that itself provides lines of _limited_ length.  Normally the user would
enter a line that fits into the limit within the normal prompt ("`>`").  But
if the user tries to add one more character than could fit, the console
returns the segment of the line to the REPL.  The REPL then asks the console
for the continuation.  The console allows the user then to edit another
segment of the line using the continuation prompt ("`+`"), and there may be
multiple continuation prompts of a line.  The user can only edit the current
segment of the line, it is not possible to edit the previously entered
segments.

The standard "console" used non-interactively in R works this way.  It reads
the input using C `fgets()`.  Passing an R script with very long lines to `R
--vanilla` provides an output with the continuation prompts ("`+`").

This is similar to when these are shown to indicate that more lines are needed to
parser the input, where running `R --vanilla` on

```
if (TRUE) {
  cat("Hello.\n")
}

```

yields


```
> if (TRUE) {
+   cat("Hello.\n")
+ }
Hello.
>
```

# Improvements

Unfortunately, there is a limit on the input line length when R is used
interactively, with some console implementations in R.  The situation
differs for each console.

## Rgui

On Windows, in Rgui, the console implementation is part of base R and has
support for the continuation mechanism described above. So, the user is
provided with a continuation prompt to enter an additional segment of the
line.

While testing this in more detail, however, several issues were discovered
(R was compiled with a small console buffer size so that the problems were
more visible).  I've observed that a newline `\n` character gets introduced
into the text at the continuation segments, so e.g.  long string literals
are corrupted.  Also, the allowed length of the line with multi-byte
characters varied and pasting characters sometimes introduced garbage into
the continuation segments (some other characters than the overflow character
appeared on the continuation line).

A conceptual problem with the continuation mechanism based on `fgets()` is
that the limit is given in bytes, not characters.  It may be that several
bytes of a character would fit, but the additional ones not.  Also, an
editor based on lines of _limited_ length would probably want to limit the
number of characters (or perhaps width with a fixed-width font): the user
shouldn't be exposed to that how many bytes are needed to represent
different characters.

This has been fixed by an intermediate interface, which is prepared to that
Rgui provides segments of lines of arbitrary length (note that `fgets()`
would always fill up the buffer when not returning the last segment of the
line).  With this interface, Rgui can limit the line by the number of
characters (as I think was the original design), while entering lines of
arbitrary length to that intermediate interface (e.g.  less than the limit
when some bytes of the overflow character fits but others don't, or much
more than the limit when many multi-byte characters have been entered).

The problem of incorrect character or garbage on the continuation line was
related to incorrect implementation of pushback (internal to Rgui).  This
pushback is only used when a line overflows, so this understandably went
unnoticed and has now been fixed.

## Rterm

On Windows, in Rterm, the console is implemented using a significantly
rewritten version of the getline library, which is part of R source code. 
The original as well as the rewritten version has a limit on the number of
bytes per line.  It is not possible to enter more than that - so essentially
the input is truncated, with a beep signalling this has happened.  The limit
is the same as the size of the buffer provided by the REPL, so R's console
buffer size.  Getline, unlike Rgui, unfortunately doesn't support the
continuation mechanism.

As part of this work, getline in R has been extended to support lines of
arbitrary length via re-allocation.  To work via R's console interface, an
intermediate interface was implemented which presents segments of the
already edited line to the REPL.  So, the user first edits the whole line,
which is then provided in segments to the REPL.  The continuation prompt is
never displayed to ask for continuation of a long line, because such request
is always fulfilled by the intermediate interface, which remembers the rest
of the line.

## Unix/readline

R on the Linux/macOS (Unix) command line in interactive mode uses the
readline library.  The library itself provides a line editor supporting
lines of unlimited length, dynamically allocated, but R truncates those
lines.

This was fixed in R-devel, this console implementation has been extended so
that it returns the additional segments of the line to the REPL.  The user
experience would be the same as newly with Rterm: the continuation prompt
would never be displayed to ask for the rest of a line (it could be
displayed to ask for the rest of an R command spanning multiple lines) and
the user would be able to fully edit the whole line.

## REPL, parser

Some corner cases issues related to continuation of long lines were fixed
also in the REPL and the parser (lexer).  Some of these involved multi-byte
characters: the lexer didn't correctly handle when the input for the parser
ended within such a character.  Some were general programming errors.

# Limitations

One may run into limitations imposed by other software.  For example, on
Linux, the line editor of the terminal (so called cannonical input mode) has
a limit of 4096 bytes.  More characters can be entered, but they are not
stored, so it appears as if the input is truncated.  This limit of the
terminal does not affect the normal use of R with readline.

On macOS, there is a similar limit of around 1K, and at least observed on
one system, one cannot even enter more than that, the input is stopped. 
Again, this does not affect the normal use of R with readline.

The continuation support visible to the user, as presented in Rgui, is less
flexible than support for fully editing the line, as now present in Rterm
and on Unix console with readline.  With Rgui, one cannot go to previous
segments of the line and overflow is only handled at the end of the line. 
Therefore, e.g.  pasting some very long generated text _inside_ an existing
line wouldn't work.  For any new console implementation/front-ends, it would
probably make sense to implement full editing of arbitrarily long lines.