--- title: "Concordances" author: "Duncan Murdoch" date: "2022-10-20" output: html_document ---
One of the strengths of R is its ability to help in
producing documents. Sweave
and knitr
can work with .Rnw
files, evaluating
and automatically inserting the results of R code
to produce a LaTeX document in a .tex
file. We call
this “preprocessing”, since the later steps were originally
designed with the assumption that the .tex
file was
directly edited by the user and then processed to produce
PDF or other output formats.
R Markdown (using knitr
) does the same for documents
written in the Markdown language.
A difficulty with preprocessors is that errors arising
in the later steps will produce error messages that refer to
the intermediate files: for example,
LaTeX errors will refer to the .tex
file rather than the .Rnw
file
that is the true source.
Errors in the HTML
code generated from help files are reported by the HTML Tidy utility
according to their line in the .html
file, not the .Rd
or .R
file
which the user originally wrote.
Concordances address this issue. A concordance is a mapping between lines
in the intermediate file and lines in the input file. If an error
is reported at "file:line"
by LaTeX or HTML Tidy, the concordance
allows that location to be translated into the corresponding location
in the .Rnw
or .Rd
file. I added concordances to Sweave
many
years ago, and wrote the patchDVI
package to use them with
previewers and to translate LaTeX error messages.
(See the details in the history below.) With the upcoming
release 4.3.0 of R, concordances have been extended to help files.
Messages from HTML Tidy will be reported with both the .html
file
location and the .Rd
file location.
For example, the file hello.Rd
could contain this code:
\name{hello}
\alias{hello}
\title{Hello, World!}
\usage{
hello()
}
\description{
Prints 'Hello, world!'.
\out{<foobar>}
}
The second last line inserts the literal text <foobar>
into the output.
This is not a legal HTML token, and HTML Tidy will complain.
With the new changes, the complaint will be shown as
* checking HTML version of manual ... NOTE
Found the following HTML validation problems:
hello.html:25:1 (hello.Rd:10): Error: <foobar> is not recognized!
hello.html:25:1 (hello.Rd:10): Warning: discarding unexpected <foobar>
This indicates that the bad token was spotted by HTML Tidy
in column 1 of line 25 of the
hello.html
file, and that line originated from line 10 of
hello.Rd
. There may also be an error reported in producing
the PDF version of the manual; at present those are not
automatically translated
by R, but as shown below, the location can be found manually.
The concordance code is mainly intended for internal use, but
it is being made available to package writers.
One package that might be able to use it is roxygen2
; among
other things, it creates
help files from .R
source files. The new code would allow it
to embed its own concordance in the .Rd
file so that HTML Tidy would
report a reference to the true source
in the .R
file.
(There are some difficult issues in producing that
concordance due to Pandoc limitations, so this might not happen
soon.)
There’s a new class named "Rconcordance"
, and three
related functions exported by the tools
package. The
"Rconcordance"
objects are simple lists
with three fields:
offset
: If only part of the output file is related to the
input file, the initial offset
lines can be skipped.srcLine
: This is a vector of line numbers from the original
source file corresponding to a range of lines of the output file
starting at line offset + 1
.srcFile
: In simple cases, this is a single filename for the source
file; in more complicated cases, it can be a vector of filenames
of the same length as srcLine
, possibly giving a different source
file for each of those lines. There is a print
method for the class:library(tools)
concordance <- structure(list(offset = 5,
srcLine = 20:30,
srcFile = "myHelpfile.Rd"),
class = "Rconcordance")
concordance
## srcFile srcLine
## 6 myHelpfile.Rd 20
## 7 myHelpfile.Rd 21
## 8 myHelpfile.Rd 22
## 9 myHelpfile.Rd 23
## 10 myHelpfile.Rd 24
## 11 myHelpfile.Rd 25
## 12 myHelpfile.Rd 26
## 13 myHelpfile.Rd 27
## 14 myHelpfile.Rd 28
## 15 myHelpfile.Rd 29
## 16 myHelpfile.Rd 30
The row labels are the output line numbers, the columns give the source filename and line corresponding to each.
The as.character
method for "Rconcordance"
objects converts them
into one or more fairly compact strings, suitable for inclusion
into a final document. For example,
conc_as_char <- as.character(concordance)
conc_as_char
## [1] "concordance::myHelpfile.Rd:ofs 5:20 10 1"
The as.Rconcordance
function is a generic function, with a default
method defined. That method looks for strings like the one above
in its input, and combines all of them into a single concordance object. For example:
newconcordance <- as.Rconcordance(conc_as_char)
newconcordance
## srcFile srcLine
## 6 myHelpfile.Rd 20
## 7 myHelpfile.Rd 21
## 8 myHelpfile.Rd 22
## 9 myHelpfile.Rd 23
## 10 myHelpfile.Rd 24
## 11 myHelpfile.Rd 25
## 12 myHelpfile.Rd 26
## 13 myHelpfile.Rd 27
## 14 myHelpfile.Rd 28
## 15 myHelpfile.Rd 29
## 16 myHelpfile.Rd 30
Finally, the tools::matchConcordance
function does the translation
of locations in intermediate files to locations in the source file.
For example, when proofreading the HTML help files, you may have
noticed “Hello, world!” on lines 1, 19 and 23 of the hello.html
file
and decided to change it, but
because your actual help file was so large, this isn’t the trivial
problem it would be with my example. So what you could do is the
following:
tools::Rd2HTML("hello.Rd", concordance = TRUE)
. This will
print the HTML source for the help page, ending with<!-- concordance::hello.Rd:3 19 0 1 4 1 0 3 1 2 0 1 -6 1 0 1 1 3 0 1 7 1 0 1 1 5 0 -->
concordance <- tools::as.Rconcordance("<!-- concordance::hello.Rd:3 19 0 1 4 1 0 3 1 2 0 1 -6 1 0 1 1 3 0 1 7 1 0 1 1 5 0 -->")
tools::matchConcordance(c(1, 19, 23), concordance)
## srcFile srcLine
## [1,] "hello.Rd" "3"
## [2,] "hello.Rd" "3"
## [3,] "hello.Rd" "8"
The first two arose from the \title{}
specification, and the
third one came from a line of text in the \description
section.
Many years ago I used Sweave
for writing papers, presentations,
exams, etc. It took .Rnw
files as input, and produced
.tex
files as output. I would run those files through
latex
to get .dvi
files which I could preview, print, or convert to
PDF for distribution.
Previewers existed in those days that let you click on a particular
word in the preview, and they’d tell your text editor to jump to the
corresponding location in the .tex
file. That was kind of nice,
but also kind of irritating: I then had to figure out the right
location in the .Rnw
file to make my edits, or make the edits in the
.tex
file and be frustrated when they
got wiped out by Sweave
on the next run!
My first solution to this problem was to get Sweave
in R
2.5.0 to
keep a record of the
correspondence between the lines of the .Rnw
file and the .tex
file
it produced, which I called the “concordance”. Given a line in the
.tex
file, it was then possible to find the corresponding line in the
.Rnw
file. By embedding this record in the latex
output, this
could be made automatic. I wrote the patchDVI
package to
modify the links in the .dvi
file so that the previewer would
automatically jump to the right place in the right file. Happiness!
Over the years there were lots of developments. I started using
pdflatex
which skipped the .dvi
stage, but supported synctex
,
so I added support for that into Sweave
and patchDVI
. knitr
arrived to improve on Sweave
, and included concordance support.
I switched text editors and previewers
several times, writing new scripts each time to connect things.
Unfortunately, R Markdown is processed by Pandoc, and as far as I
know, Pandoc doesn’t support any way to relate input lines to output
lines. I’d love to be corrected if I’m wrong about that! So
concordances don’t work with R Markdown or other processors
like Quarto
that rely on Pandoc. I believe roxygen2
uses
Pandoc for processing some help files, so it will also be difficult.