--- title: "Parser Speedups" author: "Tomas Kalibera" date: 2019-01-07 categories: ["Internals", "Performance"] tags: ["parsing", "source references", "parse data"] ---
It wasn’t my primary goal to improve parser performance nor to measure it.
I’ve been working on optimizations to reduce the runtime overhead of
including source reference into packages (this is not done by default due to
space and execution time overheads). I’ve added an option to exclude parse
data from source references and enabled it by default for packages, as parse
data account for most of the runtime overhead of source references while
they are rarely needed. Trivially, not collecting parse data also speeds up
parsing with source references. While debugging a bug in parse data I found
that parental information in the parse data was updated even when source
references were not collected; fixing this also improved performance. I’ve
then made some fixes to initialization, memory allocation and memory
protection of the parser structures, following on bug reports and addressing
bugs I’ve discovered while reading the code, including the potential problem
of mixing UNPROTECT_PTR
with UNPROTECT
calls. I’ve been profiling my
changes to verify that I was not unnecessarily adding performance
regressions, and this profiling has uncovered pathological performance
overhead of parse data post-processing on large inputs. For the profiling,
I artificially created a large file by concatenating all files from R base
packages and I got over 90% of time spent in parse data post-processing
(and 10% in all the rest including parsing), but this percentage depends on
the size of the input, for smaller files it is much smaller. I fixed its
primary cause (the pass of finding parent nodes for comments). While
profiling on real (non-concatenated) files from CRAN and BIOC packages, I
found and fixed another pathological case that made parse data
post-processing very slow with large generated expressions (updating
parental information with nodes excluded from the parse data). While
profiling I’ve also found that there was some performance improvement
recently in the parser (between 3.4 and 3.5) for which I could not find any
plausible explanation in commit logs for the parser code. I became curious
which they were, so I decided to measure parser performance in a bit more
detail.
The performance numbers below need to be seen in context of that they depend
on the files chosen for measurements. It would not be hard at all to create
input files to report an arbitrarily large relative performance improvement
with some of the optimizations - at least for the finding of parents for
comments, larger the input file with comments is, larger is the relative
speedup of the optimization. I’ve chosen to measure parse time of
concatenated sources of the base
package, compiler
, tools
, utils
,
and stats
from R 3.5.2 to represent different coding styles and file
sizes. I’ve also added all
as concatenation of all sources from the
packages in the base distribution, including the previous ones measured
individually, to represent large, hand-written code with comments (this is
also the same set I was profiling while working on the optimizations). Then
I added the largest files from CRAN/BIOC packages which also represent some
of the pathological cases: rgtk2
is gtkFuncs.R
from package RGtk2
version 2.20.35 (a large file with no comments), phylosim
is
PhyloSimSource.R
from package phylosim
version 3.0.2 (a large file with
a lot of inline comments used for inline technical documentation), and
rcoletum
is test-GetAnswersComplexForm.R
from package RColetum
version
0.2.0 (a large generated file with a call to structure()
with a very large
number of arguments). These are also the three files with the largest
number of lines from CRAN+BIOC packages (from the subset that I regularly
check, which excludes several packages with dependencies not easily
available).
Lines | Chars | |
---|---|---|
base | 20K | 730K |
stats | 29K | 1057K |
compiler | 3K | 102K |
tools | 42K | 1634K |
utils | 20K | 729K |
all | 156K | 5793K |
phylosim | 52K | 1074K |
rcoletum | 82K | 5418K |
rgtk2 | 41K | 780K |
Each iteration runs a fresh instance of RScript
with
options(keep.parse.data = KEEP_PARSE_DATA)
system.time(
dummy <- parse(FILENAME, keep.source= KEEP_SRCREFS)
)
I’ve run each of the three settings (no source references, source references without parse data, source references with parse data) 20 times, with every R version and source file, reporting average elapsed time. I’ve calculated also 95% bootstrap confidence intervals, which are used to decide to how many digits to round the results to, but are not reported directly. Measured on 64-bit Ubuntu laptop. The numbers are in seconds, so lower is better.
I’ve used the latest R tarballs for versions 3.3, 3.4, and 3.5 and then selected R-devel versions (I had to measure a bit more than shown here to validate where the interesting changes happened; note that 73746 and 73747 are between 3.4 and 3.5, so in the “old” R-devel, but 75177 and later are already the current R-devel).
KEEP_PARSE_DATA=TRUE, KEEP_SRCREFS=TRUE
(source references with parse
data) corresponds to calling parse()
explicitly to parse a file with
default arguments.
3.3.3 | 3.4.4 | 3.5.2 | 73746 | 73747 | 75177 | 75178 | 75754 | 75873 | 75883 | |
---|---|---|---|---|---|---|---|---|---|---|
all.r | 20.700 | 20.600 | 25.8000 | 26.0000 | 25.7000 | 25.800 | 25.8000 | 25.0000 | 1.560 | 1.600 |
base.r | 0.372 | 0.370 | 0.4400 | 0.4480 | 0.4480 | 0.440 | 0.4390 | 0.4600 | 0.103 | 0.102 |
compiler.r | 0.017 | 0.023 | 0.0218 | 0.0189 | 0.0189 | 0.023 | 0.0222 | 0.0229 | 0.019 | 0.019 |
phylosim.r | 2.000 | 1.500 | 2.4000 | 2.3900 | 2.4000 | 2.400 | 2.4200 | 2.6000 | 0.110 | 0.110 |
rcoletum.r | 18.700 | 18.600 | 19.0000 | 18.6000 | 18.0000 | 19.000 | 19.0000 | 18.3000 | 18.300 | 0.770 |
rgtk2.r | 0.293 | 0.308 | 0.1990 | 0.2070 | 0.2060 | 0.199 | 0.2000 | 0.1200 | 0.120 | 0.130 |
stats.r | 0.720 | 0.720 | 0.7800 | 0.7990 | 0.7900 | 0.790 | 0.8100 | 0.7900 | 0.178 | 0.178 |
tools.r | 0.720 | 0.702 | 0.6890 | 0.6890 | 0.6230 | 0.700 | 0.7000 | 0.7000 | 0.300 | 0.298 |
utils.r | 0.270 | 0.270 | 0.2720 | 0.2680 | 0.2690 | 0.274 | 0.2710 | 0.2880 | 0.100 | 0.101 |
The numbers show a speedup in R-devel 75873 for all sample files except
rcoletum
and rgtk2
. This is the speedup in finding parent nodes for
comments, rcoletum
is not affected because it has only a trivial number of
comments, rgtk2
because it has no comments at all. The original algorithm
for finding parent nodes is quadratic in the worst case as it has to go
potentially through the whole remaining parse data for each comment. The
new algorithm is almost linear in the size of the parse data, it takes
advantage of the parental information already available in the parse data
for non-comment nodes, and of certain ordering properties of entries in the
parse data met by the parser (more details available in the source code
comments). All large files are likely to benefit, and there is even some
improvement visible for the smallest sample file (compiler
, about 33%).
The numbers also show a speedup in R-devel 75883 for rcoletum
. This is a
simple optimization in updating parental information with nodes excluded
from the parse data. For each node, one has to traverse through a sequence
of excluded nodes until one finds a non-excluded node. For very large
expressions (list of arguments in this case, but could also be a large
arithmetic expression), the sequences of excluded nodes are long and the
excluded nodes are repeated, hence the searches are repeated as well. The
optimization keeps record of the parents already found and re-uses it during
followup searches. The speedup depends on the shape of the expression tree
and likely only generated files would benefit. The cost for other files
should be negligible (and this is in line with the numbers).
There also seem to be some slowdowns between 3.4.4 and 3.5.2 for all
,
phylosim
and base
. These slowdowns are erased in 75873 and 75883. They
may be (but I did not verify) caused by virtualization of access to integer
vectors used in parse data post-processing. In principle, one could get
further speedups over 75883 by rewriting two loops for parse data
post-processing, but it is not clear whether is worth the somewhat increased
cognitive complexity of the code. 73747 helped in some cases, but this is
again not interesting anymore due to big changes in 75873 and 75883.
KEEP_PARSE_DATA=FALSE, KEEP_SRCREFS=TRUE
(source references without parse
data) corresponds to what is now done by default when installing packages
using R_KEEP_PKG_SOURCE=yes
.
75177 | 75178 | 75754 | 75873 | 75883 | |
---|---|---|---|---|---|
all.r | 1.2700 | 1.0000 | 0.8100 | 0.813 | 0.820 |
base.r | 0.0720 | 0.0640 | 0.0610 | 0.062 | 0.063 |
compiler.r | 0.0138 | 0.0122 | 0.0122 | 0.012 | 0.013 |
phylosim.r | 0.0730 | 0.0600 | 0.0600 | 0.070 | 0.070 |
rcoletum.r | 0.3900 | 0.3280 | 0.3470 | 0.347 | 0.400 |
rgtk2.r | 0.1670 | 0.1560 | 0.0900 | 0.088 | 0.087 |
stats.r | 0.1200 | 0.1040 | 0.1100 | 0.107 | 0.107 |
tools.r | 0.2100 | 0.2000 | 0.1980 | 0.200 | 0.200 |
utils.r | 0.0680 | 0.0590 | 0.0610 | 0.062 | 0.062 |
The numbers are only for the current R-devel versions, because the option to
exclude parse data has been added in R-devel 74761. In prior versions, one
always got parse data with source references (which had nearly 25x overhead
with all
).
There is a visible speedup for all
and rgtk2
in R-devel 75754. This
optimization uses stretchy lists for intermediate structures in the parser
that hold source references before they are compacted into fixed-size
newlist blocks. Stretchy lists are linked lists that allow constant-time
append operation and have been already in use in the parser, but the
temporary source reference lists used the traditional pairlists with linear
time append operation. This optimization is expected to help only when such
lists are long, such as in large files with a lot of (short) functions
(probably it is why it helps so much in rgtk2
) or with functions with very
long bodies.
There is an improvement in version 75178, the optimization removes unnecessary updating parental information for the parse data. Such updating is not needed when parse data are not collected.
KEEP_PARSE_DATA=FALSE, KEEP_SRCREFS=FALSE
(no source references, no parse
data) corresponds to what is now (in 3.5 and current R-devel) done by
default when installing packages.
3.3.3 | 3.4.4 | 3.5.2 | 73746 | 73747 | 75177 | 75178 | 75754 | 75873 | 75883 | |
---|---|---|---|---|---|---|---|---|---|---|
all.r | 0.7310 | 0.7200 | 0.4490 | 0.7040 | 0.5500 | 0.5220 | 0.460 | 0.470 | 0.480 | 0.4810 |
base.r | 0.0630 | 0.0632 | 0.0477 | 0.0500 | 0.0511 | 0.0570 | 0.049 | 0.051 | 0.052 | 0.0520 |
compiler.r | 0.0090 | 0.0092 | 0.0090 | 0.0080 | 0.0082 | 0.0110 | 0.010 | 0.010 | 0.010 | 0.0102 |
phylosim.r | 0.0618 | 0.0620 | 0.0482 | 0.0500 | 0.0507 | 0.0570 | 0.049 | 0.051 | 0.052 | 0.0520 |
rcoletum.r | 0.5600 | 0.5720 | 0.3000 | 0.5310 | 0.3640 | 0.3600 | 0.300 | 0.330 | 0.330 | 0.3300 |
rgtk2.r | 0.0849 | 0.0840 | 0.0659 | 0.0690 | 0.0700 | 0.0761 | 0.067 | 0.069 | 0.071 | 0.0710 |
stats.r | 0.0980 | 0.0972 | 0.0760 | 0.0900 | 0.0900 | 0.0880 | 0.077 | 0.081 | 0.083 | 0.0830 |
tools.r | 0.1570 | 0.1560 | 0.0960 | 0.1150 | 0.1200 | 0.1140 | 0.097 | 0.102 | 0.104 | 0.1100 |
utils.r | 0.0640 | 0.0629 | 0.0470 | 0.0498 | 0.0501 | 0.0562 | 0.048 | 0.051 | 0.051 | 0.0510 |
In absolute numbers, parse times are small in this configuration even for large source files, but they still contribute to the installation time of packages.
Like when collecting source references without parse data, there is an improvement in version 75178 (not updating parents for parse data as they are not collected).
In addition, there has been a speedup between 3.4.4 and 3.5.2 for all sample
files (perhaps except compiler
). Having a closer look it is clear that
while there are almost no changes in the parser code; one exception is a
port of 75178, which avoided updating parental information when source
references were not collected, ported because it was a bugfix. Most of the
speedups between 3.4.4 and 3.5.2 happen indirectly through other changes in
the R code, and seem to be a result of many smaller performance changes,
both slowdowns and speedups.
One of the bigger speedups was in 73747 (when R-devel was to become R 3.5). In this commit, Luke Tierney changed the GC heuristics to grow the heap for nodes (cells) more aggressively in the first few heap adjustments after R starts. This followed profiling we have been doing that identified several situations when byte-compiled code in the byte-code interpreter ran slower than AST interpreter (the overhead was not in compilation - the code has already been compiled). This turned up to be caused by GC pressure (unnecessarily too many GC runs) with compiled code when the code took a non-trivial amount of memory for the initial heap sizes. The parser and particularly the tokenizer take quite a bit of memory, so it is not surprising GC heuristics had an impact.
Another speedup was in 73554 (not shown in these numbers, I just measured
for gtk2
), again due to GC, I’ve updated the GC initial vector memory
size, causing the vector heap to grow faster. The change made sense given
the current usual memory sizes and it followed again profiling some cases
when loading of the compiled code has been taking too long, resulting in
some cases in slowdown with already compiled byte-code.
R parser performance has been improved since version 3.4 already in version 3.5 and the current R-devel has additional improvements when source references with parse data are collected. In the current R-devel, one can also collect source references without parse data, which is significantly faster in addition to reduced installation size.