Concordances

One of the strengths of R is its ability to help in producing documents. Sweave and knitr can work with .Rnw files, evaluating and automatically inserting the results of R code to produce a LaTeX document in a .tex file. We call this “preprocessing”, since the later steps were originally designed with the assumption that the .tex file was directly edited by the user and then processed to produce PDF or other output formats. R Markdown (using knitr) does the same for documents written in the Markdown language.

A difficulty with preprocessors is that errors arising in the later steps will produce error messages that refer to the intermediate files: for example, LaTeX errors will refer to the .tex file rather than the .Rnw file that is the true source. Errors in the HTML code generated from help files are reported by the HTML Tidy utility according to their line in the .html file, not the .Rd or .R file which the user originally wrote.

Concordances address this issue. A concordance is a mapping between lines in the intermediate file and lines in the input file. If an error is reported at "file:line" by LaTeX or HTML Tidy, the concordance allows that location to be translated into the corresponding location in the .Rnw or .Rd file. I added concordances to Sweave many years ago, and wrote the patchDVI package to use them with previewers and to translate LaTeX error messages. (See the details in the history below.) With the upcoming release 4.3.0 of R, concordances have been extended to help files. Messages from HTML Tidy will be reported with both the .html file location and the .Rd file location.

For example, the file hello.Rd could contain this code:

\name{hello}
\alias{hello}
\title{Hello, World!}
\usage{
hello()
}
\description{
Prints 'Hello, world!'.

\out{<foobar>}
}

The second last line inserts the literal text <foobar> into the output. This is not a legal HTML token, and HTML Tidy will complain. With the new changes, the complaint will be shown as

* checking HTML version of manual ... NOTE
Found the following HTML validation problems:
hello.html:25:1 (hello.Rd:10): Error: <foobar> is not recognized!
hello.html:25:1 (hello.Rd:10): Warning: discarding unexpected <foobar>

This indicates that the bad token was spotted by HTML Tidy in column 1 of line 25 of the hello.html file, and that line originated from line 10 of hello.Rd. There may also be an error reported in producing the PDF version of the manual; at present those are not automatically translated by R, but as shown below, the location can be found manually.

Concordance code

The concordance code is mainly intended for internal use, but it is being made available to package writers. One package that might be able to use it is roxygen2; among other things, it creates help files from .R source files. The new code would allow it to embed its own concordance in the .Rd file so that HTML Tidy would report a reference to the true source in the .R file. (There are some difficult issues in producing that concordance due to Pandoc limitations, so this might not happen soon.)

Some details about the new code

There’s a new class named "Rconcordance", and three related functions exported by the tools package. The "Rconcordance" objects are simple lists with three fields:

  • offset: If only part of the output file is related to the input file, the initial offset lines can be skipped.
  • srcLine: This is a vector of line numbers from the original source file corresponding to a range of lines of the output file starting at line offset + 1.
  • srcFile: In simple cases, this is a single filename for the source file; in more complicated cases, it can be a vector of filenames of the same length as srcLine, possibly giving a different source file for each of those lines. There is a print method for the class:
library(tools)
concordance <- structure(list(offset = 5,
                  srcLine = 20:30,
                  srcFile = "myHelpfile.Rd"),
             class = "Rconcordance")
concordance
##          srcFile srcLine
## 6  myHelpfile.Rd      20
## 7  myHelpfile.Rd      21
## 8  myHelpfile.Rd      22
## 9  myHelpfile.Rd      23
## 10 myHelpfile.Rd      24
## 11 myHelpfile.Rd      25
## 12 myHelpfile.Rd      26
## 13 myHelpfile.Rd      27
## 14 myHelpfile.Rd      28
## 15 myHelpfile.Rd      29
## 16 myHelpfile.Rd      30

The row labels are the output line numbers, the columns give the source filename and line corresponding to each.

The as.character method for "Rconcordance" objects converts them into one or more fairly compact strings, suitable for inclusion into a final document. For example,

conc_as_char <- as.character(concordance)
conc_as_char
## [1] "concordance::myHelpfile.Rd:ofs 5:20 10 1"

The as.Rconcordance function is a generic function, with a default method defined. That method looks for strings like the one above in its input, and combines all of them into a single concordance object. For example:

newconcordance <- as.Rconcordance(conc_as_char)
newconcordance
##          srcFile srcLine
## 6  myHelpfile.Rd      20
## 7  myHelpfile.Rd      21
## 8  myHelpfile.Rd      22
## 9  myHelpfile.Rd      23
## 10 myHelpfile.Rd      24
## 11 myHelpfile.Rd      25
## 12 myHelpfile.Rd      26
## 13 myHelpfile.Rd      27
## 14 myHelpfile.Rd      28
## 15 myHelpfile.Rd      29
## 16 myHelpfile.Rd      30

Finally, the tools::matchConcordance function does the translation of locations in intermediate files to locations in the source file. For example, when proofreading the HTML help files, you may have noticed “Hello, world!” on lines 1, 19 and 23 of the hello.html file and decided to change it, but because your actual help file was so large, this isn’t the trivial problem it would be with my example. So what you could do is the following:

  1. Run tools::Rd2HTML("hello.Rd", concordance = TRUE). This will print the HTML source for the help page, ending with
<!-- concordance::hello.Rd:3 19 0 1 4 1 0 3 1 2 0 1 -6 1 0 1 1 3 0 1 7 1 0 1 1 5 0 -->
  1. Convert that string to a concordance object using
concordance <- tools::as.Rconcordance("<!-- concordance::hello.Rd:3 19 0 1 4 1 0 3 1 2 0 1 -6 1 0 1 1 3 0 1 7 1 0 1 1 5 0 -->")
  1. Find the source corresponding to lines 1, 19 and 23 using
tools::matchConcordance(c(1, 19, 23), concordance)
##      srcFile    srcLine
## [1,] "hello.Rd" "3"    
## [2,] "hello.Rd" "3"    
## [3,] "hello.Rd" "8"

The first two arose from the \title{} specification, and the third one came from a line of text in the \description section.

Ancient History

Many years ago I used Sweave for writing papers, presentations, exams, etc. It took .Rnw files as input, and produced .tex files as output. I would run those files through latex to get .dvi files which I could preview, print, or convert to PDF for distribution.

Previewers existed in those days that let you click on a particular word in the preview, and they’d tell your text editor to jump to the corresponding location in the .tex file. That was kind of nice, but also kind of irritating: I then had to figure out the right location in the .Rnw file to make my edits, or make the edits in the .tex file and be frustrated when they got wiped out by Sweave on the next run!

My first solution to this problem was to get Sweave in R 2.5.0 to keep a record of the correspondence between the lines of the .Rnw file and the .tex file it produced, which I called the “concordance”. Given a line in the .tex file, it was then possible to find the corresponding line in the .Rnw file. By embedding this record in the latex output, this could be made automatic. I wrote the patchDVI package to modify the links in the .dvi file so that the previewer would automatically jump to the right place in the right file. Happiness!

Over the years there were lots of developments. I started using pdflatex which skipped the .dvi stage, but supported synctex, so I added support for that into Sweave and patchDVI. knitr arrived to improve on Sweave, and included concordance support. I switched text editors and previewers several times, writing new scripts each time to connect things.

Unfortunately, R Markdown is processed by Pandoc, and as far as I know, Pandoc doesn’t support any way to relate input lines to output lines. I’d love to be corrected if I’m wrong about that! So concordances don’t work with R Markdown or other processors like Quarto that rely on Pandoc. I believe roxygen2 uses Pandoc for processing some help files, so it will also be difficult.