Introduction ************ This document contains answers to some of the most frequently asked questions about R. Legalese ======== This document is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2, or (at your option) any later version. This document is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. A copy of the GNU General Public License is available via WWW at `http://www.gnu.org/copyleft/gpl.html'. You can also obtain it by writing to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. Obtaining this Document ======================= The latest version of this document is always available from `http://www.ci.tuwien.ac.at/~hornik/R/' From there, you can obtain versions converted to plain ASCII text (http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.txt), DVI (http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.dvi.gz), GNU info (http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.info.gz), HTML (http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html), PDF (http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.pdf), PostScript (http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.ps.gz) as well as the Texinfo source (http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.texi) used for creating all these formats using the GNU Texinfo system. You can also obtain the R FAQ from the `doc/FAQ' subdirectory of a CRAN site (*Note What Is CRAN?::). Notation ======== Everything should be pretty standard. `R>' is used for the R prompt, and a `$' for the shell prompt (where applicable). Feedback ======== Feedback is of course most welcome. In particular, note that I do not have access to Windows or Mac systems. If you have information on these systems that you think should be added to this document, please let me know. R Basics ******** What Is R? ========== R is a system for statistical computation and graphics. It consists of a language plus a run-time environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in script files. The design of R has been heavily influenced by two existing languages: Becker, Chambers & Wilks' S (*note What Is S?::.) and Sussman's Scheme (http://www.cs.indiana.edu/scheme-repository/home.html). Whereas the resulting language is very similar in appearance to S, the underlying implementation and semantics are derived from Scheme. *Note What Are the Differences between R and S?:: for a discussion of the differences between R and S. R was initially written by Ross Ihaka and Robert Gentleman , who are Senior Lecturers at the Department of Statistics of the University of Auckland in Auckland, New Zealand. In addition, a large group of individuals has contributed to R by sending code and bug reports. Since mid-1997 there has been a core group (the "R Core Team") who can modify the R source code CVS archive. The group currently consists of Doug Bates, Peter Dalgaard, Robert Gentleman, Kurt Hornik, Ross Ihaka, Friedrich Leisch, Thomas Lumley, Martin Maechler, Guido Masarotto, Paul Murrell, Brian Ripley, Heiner Schwarte, and Luke Tierney. R has a home page at `http://stat.auckland.ac.nz/r/r.html'. It is free software distributed under a GNU-style copyleft, and an official part of the GNU project ("GNU S"). What Machines Does R Run on? ============================ R is being developed for the Unix, Windows and Mac families of operating systems. The current version of R will configure and build under a number of common Unix platforms including i386-freebsd, i386-linux, ppc-linux, mips-sgi-irix, alpha-linux, alpha-dec-osf4, sparc-linux, and sparc-sun-solaris, see the file `PLATFORMS' in the R distribution for more information. If you know about other platforms, please drop us a note. What Is the Current Version of R? ================================= The current stable Unix version is 0.63.3, the unstable one is 0.64.0. Typically, new features are introduced in the development versions; updates of stable versions are for bug fixes mostly. The Windows version tracks the stable Unix version quite closely. The version for the Mac is pre-alpha. How Can R Be Obtained? ====================== Sources, binaries and documentation for R can be obtained via CRAN, the "Comprehensive R Archive Network" (see *Note What Is CRAN?::). How Can R Be Installed? ======================= How Can R Be Installed (Unix) ----------------------------- If binaries are available for your platform (see *Note Are there Unix Binaries for R?::), you can use these, following the instructions that come with them. Otherwise, you can compile and install R yourself, which can be done very easily under a number of common Unix platforms (see *Note What Machines Does R Run on?::). The file `INSTALL' that comes with the R distribution contains instructions. Note that as of version 0.62, you need a FORTRAN compiler or `f2c' in addition to a C compiler to build R. Also (as of 0.60), you need Perl version 5 to build the documentation. If this is not available on your system, you can obtain precompiled documentation files via CRAN. In the simplest case, untar the R source code, cd to the directory thus created, and issue the following commands (at the shell prompt): $ ./configure $ make If these commands execute successfully, the R binary and a shell script font-end called `R' are created and copied to the `bin' directory. You can copy the script to a place where users can invoke it, for example to `/usr/local/bin'. In addition, plain text help pages as well as HTML and LaTeX versions of the documentation are built. Use `make dvi' to obtain a dvi version of the R manual. This creates the file `Man.dvi' in the `doc/manual' subdirectory which can be previewed and printed using standard programs such as `xdvi' and `dvips'. (Note that you have to build this file in the source tree.) You can also perform a "system-wide" installation using $ make install This will install to the following directories: `${prefix}/bin' (some) executables `${prefix}/man/man1' man pages `${prefix}/lib/R' all the rest (libraries, on-line help system, ...) where `prefix' is determined during configuration (typically `/usr/local') and can be set by running `configure' with the option $ ./configure --prefix=/where/you/want/R/to/go (E.g., the R executable will then be installed into `/where/you/want/R/to/go/bin'.) How Can R Be Installed (Windows) -------------------------------- The `bin/ms-windows' directory of a CRAN site currently contains two binary distributions of R for MS Windows: one by Robert Gentleman in the `windows' subdirectory, and one by Guido Masarotto in `windows-9x'. The latter only works on 32 bit versions of Windows (i.e., 95, 98 or NT), the former also on 3.11. See the respective directories for more information. Binary distributions for a large number of add-on packages (basically all those on CRAN except purely data collections) for use with Guido's version are available in `windows-9x/contrib' subdirectory. Note that when uncompressing the zip files, the pkunzip program needs to be invoked with the `-D' flag to create subdirectories. Also, be aware that some decompression programs do not preserve long file names properly. How Can R Be Installed (Macintosh) ---------------------------------- The CRAN `bin/macintosh' directory contains `R.sea.hqx', a binhexed self-extracting archive, and installation instructions in `README.MACINTOSH'. Note that the version in it is nowhere near the quality of the current Unix version. The Power Macintosh port is temporarily on hold. Are there Unix Binaries for R? ============================== The `bin/linux' directory contains Debian 2.0 packages for the i386 platform (now part of the Debian distribution and maintained by Doug Bates) as well as Red Hat 5.1 packages for the i386, alpha and sparc platforms (maintained by Martyn Plummer, Nassib Nassar, and Vin Everett, respectively), S.u.S.E. 5.3 i386 packages by Albrecht Gebhardt, and RPMs for the ppc platform by Alex Buerkle. (Note that conversion between Debian and Red Hat using Debian's `alien'(1) tool unfortunately only sort of works, as the systems use different versions or numbering for libreadline and libncurses.) There are also `tar' distributions for NEXTSTEP on the i386 and m68k platforms in `bin/nextstep/i386' and `bin/nextstep/m68k'. No other binary distributions have thus far been made publically available. Which Documentation Exists for R? ================================= Online documentation for most of the functions and variables in R exists, and can be printed on-screen by typing `help(NAME)' (or `?NAME') at the R prompt, where NAME is the name of the topic help is sought for. (In the case of unary and binary operators and control-flow special forms, the name may need to be be quoted.) This documentation can also be made available as HTML, and as hardcopy via LaTeX, see *Note How Can R Be Installed?::. An up-to-date HTML version is always available for web browsing at `http://stat.ethz.ch/R/manual/'. An R manual ("Notes on R: A Programming Environment for Data Analysis and Graphics") is currently being written, based on the "Notes on S-PLUS" by Bill Venables and David Smith . The current version can be obtained as `Rnotes.tgz' (LaTeX source) in a CRAN `doc' directory. Note that the "conversion" from S(-PLUS) to R is not complete yet. Further documentation on R and the R API are currently being written. In the absence of an R manual, documentation for S/S-PLUS (see *Note R and S::) can be used in combination with this FAQ (*note What Are the Differences between R and S?::.). We recommend W. N. Venables and B. D. Ripley (1997), "Modern Applied Statistics with S-PLUS. Second Edition". Springer, ISBN 0-387-98214-0. which has a home page at `http://www.stats.ox.ac.uk/pub/MASS2/' providing additional material, in particular `R' Complements which describe how to use the book with R. These complements provide both descriptions of some of the differences between R and S, and the modifications needed to run the examples in the book. More introductory books are P. Spector (1994), "An introduction to S and S-PLUS", Duxbury Press. A. Krause and M. Olsen (1997), "The Basics of S and S-PLUS", Springer. Last, but not least, Ross' and Robert's experience in designing and implementing R is described in: @article{, author = {Ross Ihaka and Robert Gentleman}, title = {R: A Language for Data Analysis and Graphics}, journal = {Journal of Computational and Graphical Statistics}, year = 1996, volume = 5, number = 3, pages = {299--314} } This is also the reference for R to use in publications. Which Mailing Lists Exist for R? ================================ Thanks to Martin Maechler , there are three mailing lists devoted to R. `r-announce' This list is for announcements about the development of R and the availability of new code. `r-devel' This list is for discussions about the future of R and pre-testing of new versions. It is meant for those who maintain an active position in the development of R. `r-help' The `main' R mailing list, for announcements about the development of R and the availability of new code, questions and answers about problems and solutions using R, enhancements and patches to the source code and documentation of R, comparison and compatibility with S and S-PLUS, and for the posting of nice examples and benchmarks. Note that the r-announce list is gatewayed into r-help, so you don't need to subscribe to both of them. Send email to to reach everyone on the r-help mailing list. To subscribe (or unsubscribe) to this list send `subscribe' (or `unsubscribe') in the BODY of the message (not in the subject!) to . Information about the list can be obtained by sending an email with `info' as its contents to . Subscription and posting to the other lists is done analogously, with `r-help' replaced by `r-announce' and `r-devel', respectively. It is recommended that you send mail to r-help rather than only to the R developers (who are also subscribed to the list, of course). This may save them precious time they can use for constantly improving R, and will typically also result in much quicker feedback for yourself. Of course, in the case of bug reports it would be very helpful to have code which reliably reproduces the problem. Also, make sure that you include information on the system and version of R being used. See *Note R Bugs:: for more details. Archives of the above three mailing lists are made available on the net in a monthly schedule via the `doc/mail/mail.html' file in CRAN. An HTML archive of the lists are available via `http://www.ens.gu.edu.au/robertk/R/'. The R Core Team can be reached at for comments and reports. What Is CRAN? ============= The "Comprehensive R Archive Network" (CRAN) is a collection of sites which carry identical material, consisting of the R distribution(s), the contributed extensions, documentation for R, and binaries. The CRAN master site can be found at the URL `http://www.ci.tuwien.ac.at/R/' (Austria) and is currently being mirrored daily at `http://SunSITE.auc.dk/R/' (Denmark) `http://www.stat.unipg.it/pub/stat/statlib/R/CRAN/'(Italy) `ftp://ftp.u-aizu.ac.jp/pub/lang/R/CRAN/' (Japan) `ftp://dola.snu.ac.kr/pub/R/CRAN/' (Korea) `http://stat.ethz.ch/CRAN/' (Switzerland) `http://www.stats.bris.ac.uk/R/' (United Kingdom) `http://lib.stat.cmu.edu/R/CRAN/' (USA/Pennsylvania) `ftp://ftp.biostat.washington.edu/mirrors/R/CRAN/'(USA/Washington) `http://cran.stat.wisc.edu/' (USA/Wisconsin) Please use the CRAN site closest to you to reduce network load. From CRAN, you can obtain the latest official release of R, daily snapshots of R for Unix systems (copy of the current CVS tree), as gzipped and bzipped tar files or as two gzipped tar files (ready for 1.4M floppies), a wealth of additional contributed code, as well as prebuilt binaries for various operating systems (Linux, Nextstep, MacOS, MSWin) and pre-formatted help pages. CRAN also provides access to documentation on R, existing mailing lists and the R Bug Tracking system. To "submit" to CRAN, simply upload to `ftp://ftp.ci.tuwien.ac.at/incoming' and send an email to . *Note:* It is very important that you indicate the copyright (license) information (GPL, BSD, Artistic, ...) in your submission. R and S ******* What Is S? ========== S is a very high level language and an environment for data analysis and graphics. S was written by Richard A. Becker, John M. Chambers, and Allan R. Wilks of AT&T Bell Laboratories Statistics Research Department. The primary references for S are two books by the creators of S. * Richard A. Becker, John M. Chambers and Allan R. Wilks (1988), "The New S Language," Chapman & Hall, London. This book is often called the "*Blue Book*". * John M. Chambers and Trevor J. Hastie (1992), "Statistical Models in S," Chapman & Hall, London. This is also called the "*White Book*". There is a huge amount of user-contributed code for S, available at the S Repository (http://lib.stat.cmu.edu) at CMU. See the "Frequently Asked Questions about S" (http://lib.stat.cmu.edu/S/faq) for further information about S. What Is S-PLUS? =============== S-PLUS is a value-added version of S sold by Statistical Sciences, Inc. (now a division of Mathsoft, Inc.). S is a subset of S-PLUS, and hence anything which may be done in S may be done in S-PLUS. In addition S-PLUS has extended functionality in a wide variety areas, including robust regression, modern non-parametric regression, time series, survival analysis, multivariate analysis, classical statistical tests, quality control, and graphics drivers. Add-on modules add additional capabilities for wavelet analysis, spatial statistics, and design of experiments. See the MathSoft S-PLUS page (http://www.mathsoft.com/splus.html) for further information. What Are the Differences between R and S? ========================================= Lexical Scoping --------------- Whereas the developers of R have tried to stick to the S language as defined in "The New S Language" (Blue Book, see *Note What Is S?::), they have adopted the evaluation model of Scheme. This difference becomes manifest when *free* variables occur in a function. Free variables are those which are neither formal parameters (occurring in the argument list of the function) nor local variables (created by assigning to them in the body of the function). Whereas S (like C) by default uses *static* scoping, R (like Scheme) has adopted *lexical* scoping. This means the values of free variables are determined by a set of global variables in S, but in R by the bindings that were in effect at the time the function was created. Consider the following function: cube <- function(n) { sq <- function() n * n n * sq() } Under S, `sq()' does not "know" about the variable `n' unless it is defined globally: S> cube(2) Error in sq(): Object "n" not found Dumped S> n <- 3 S> cube(2) [1] 18 In R, the "environment" created when `cube()' was invoked is also looked in: R> cube(2) [1] 8 As a more "interesting" real-world problem, suppose you want to write a function which returns the density function of the r-th order statistic from a sample of size n from a (continuous) distribution. For simplicity, we shall use both the cdf and pdf of the distribution as explicit arguments. (Example compiled from various postings by Luke Tierney.) The S-PLUS documentation for `call' basically suggests the following: dorder <- function(n, r, pfun, dfun) { f <- function(x) NULL con <- round(exp(lgamma(n + 1) - lgamma(r) - lgamma(n - r + 1))) PF <- call(substitute(pfun), as.name("x")) DF <- call(substitute(dfun), as.name("x")) f[[length(f)]] <- call("*", con, call("*", call("^", PF, r - 1), call("*", call("^", call("-", 1, PF), n - r), DF))) f } Rather tricky, isn't it? The code uses the fact that in S, functions are just lists of special mode with the function body as the last argument, and hence does not work in R (one could make the idea work, though). A version which makes heavy use of `substitute()' and seems to work under both S and R is dorder <- function(n, r, pfun, dfun) { con <- round(exp(lgamma(n + 1) - lgamma(r) - lgamma(n - r + 1))) eval(substitute(function(x) K * PF(x)^a * (1 - PF(x))^b * DF(x), list(PF = substitute(pfun), DF = substitute(dfun), a = r - 1, b = n - r, K = con))) } (the `eval' is not needed in S). However, in R there is a much easier solution: dorder <- function(n, r, pfun, dfun) { con <- round(exp(lgamma(n + 1) - lgamma(r) - lgamma(n - r + 1))) function(x) { con * pfun(x)^(r - 1) * (1 - pfun(x))^(n - r) * dfun(x) } } This seems to be the "natural" implementation, and it works because the free variables in the returned function can be looked up in the defining environment (this is lexical scope). Note that what you really need is the function *closure*, i.e., the body along with all variable bindings needed for evaluating it. Since in the above version, the free variables in the value function are not modified, you can actually use it in S as well if you abstract out the closure operation into a function `MC()' (for "make closure"): dorder <- function(n, r, pfun, dfun) { con <- round(exp(lgamma(n + 1) - lgamma(r) - lgamma(n - r + 1))) MC(function(x) { con * pfun(x)^(r - 1) * (1 - pfun(x))^(n - r) * dfun(x) }, list(con = con, pfun = pfun, dfun = dfun, r = r, n = n)) } Given the appropriate definitions of the closure operator, this works in both R and S, and is much "cleaner" than a substitute/eval solution (or one which overrules the default scoping rules by using explicit access to evaluation frames, as is of course possible in both R and S). For R, `MC()' simply is MC <- function(f, env) f (lexical scope!), a version for S is MC <- function(f, env = NULL) { env <- as.list(env) if (mode(f) != "function") stop(paste("not a function:", f)) if (length(env) > 0 && any(names(env) == "")) stop(paste("not all arguments are named:", env)) fargs <- if(length(f) > 1) f[1:(length(f) - 1)] else NULL fargs <- c(fargs, env) if (any(duplicated(names(fargs)))) stop(paste("duplicated arguments:", paste(names(fargs)), collapse = ", ")) fbody <- f[length(f)] cf <- c(fargs, fbody) mode(cf) <- "function" return(cf) } Similarly, most optimization (or zero-finding) routines need some arguments to be optimized over and have other parameters that depend on the data but are fixed with respect to optimization. With R scoping rules, this is a trivial problem; simply make up the function with the required definitions in the same environment and scoping takes care of it. With S, one solution is to add an extra parameter to the function and to the optimizer to pass in these extras, which however can only work if the optimizer supports this (and typically, the builtin ones do not). Lexical scoping allows using function closures and maintaining local state. A simple example (taken from Abelson and Sussman) is obtained by typing `demo(scoping)' at the R prompt. Further information is provided in the standard R reference "R: A Language for Data Analysis and Graphics" (*note Which Documentation Exists for R?::.) and a paper on "Lexical Scope and Statistical Computing" by Robert Gentleman and Ross Ihaka which can be obtained from the `doc/misc' directory of a CRAN site. Lexical scoping also implies a further major difference. Whereas S stores all objects as separate files in a directory somewhere (usually `.Data' under the current directory), R does not. All objects in R are stored internally. When R is started up it grabs a very large piece of memory and uses it to store the objects. R performs its own memory management of this piece of memory. Having everything in memory is necessary because it is not really possible to externally maintain all relevant "environments" of symbol/value pairs. This difference also seems to make R *much faster* than S. The down side is that if R crashes you will lose all the work for the current session. Saving and restoring the memory "images" (the functions and data stored in R's internal memory at any time) can be a bit slow, especially if they are big. In S this does not happen, because everything is saved in disk files and if you crash nothing is likely to happen to them. (In fact, one might conjecture that the S developers felt that the price of changing their approach to persistent storage just to accommodate lexical scope was far too expensive.) R is still in a beta stage, and may crash from time to time. Hence, for important work you should consider saving often (see *Note How Can I Save My Workspace?::). Other possibilities are logging your sessions, or have your R commands stored in text files which can be read in using `source()'. *Note:* If you run R from within Emacs (see *Note R and Emacs::), you can save the contents of the interaction buffer to a file and conveniently manipulate it using `ess-transcript-mode', as well as save source copies of all functions and data used. Models ------ There are some differences in the modeling code, such as * Whereas in S, you would use `lm(y ~ x^3)' to regress `y' on `x^3', in R, you have to insulate powers of numeric vectors (using `I()'), i.e., you have to use `lm(y ~ I(x^3))'. * The glm family objects are implemented differently in R and S. The same functionality is available but the components have different names. * Terms objects are stored differently. In S a terms object is an expression with attributes, in R it is a formula with attributes. The attributes have the same names but are mostly stored differently. The major difference in functionality is that a terms object is subscriptable in S but not in R. If you can't imagine why this would matter then you don't need to know. * Finally, in R `y~x+0' is an alternative to `y~x-1' for specifying a model with no intercept. Models with no parameters at all can be specified by `y~0'. Others ------ Apart from lexical scoping and its implications, R follows the S language definition in the Blue Book as much as possible, and hence really is an "implementation" of S. There are some intentional differences where the behavior of S is considered "not clean". In general, the rationale is that R should help you detect programming errors, while at the same time being as compatible as possible with S. Some known differences are the following. * In R, if `x' is a list, then `x[i] <- NULL' and `x[[i]] <- NULL' remove the specified elements from `x'. The first of these is incompatible with S, where it is a no-op. (Note that you can set elements to `NULL' using `x[i] <- list(NULL)'.) * In S, the functions named `.First' and `.Last' in the `.Data' directory can be used for customizing, as they are executed at the very beginning and end of a session, respectively. In R, the startup mechanism is as follows. R first sources the system startup file ``$RHOME'/library/base/R/Rprofile'. Then, it searches for a site-wide startup profile unless the command line option `--no-site-file' was given. The name of this file is taken from the value of the `RPROFILE' environment variable. If that variable is unset, the default is ``$RHOME'/etc/Rprofile'. Then, unless `--no-init-file' was given, R searches for a file called `.Rprofile' in the current directory or in the user's home directory (in that order) and sources it. It also loads a saved image from `.RData' in case there is one (unless `--no-restore' was specified). If needed, the functions `.First()' and `.Last()' should be defined in the appropriate startup profiles. * In R, `T' and `F' are just variables being set to `TRUE' and `FALSE', respectively, but are not reserved words as in S and hence can be overwritten by the user. (This helps e.g. when you have factors with levels "T" or "F".) Hence, when writing code you should always use `TRUE' and `FALSE'. * In R, `dyn.load()' can only load *shared libraries*, as created for example by `R SHLIB'. * Whereas in S, `abs(z)' is the same as `Mod(z)' for complex `z', in R you *must* use `Mod(z)', since `abs()' is a function of real numbers only. * In R, `attach()' currently only works for lists and data frames (not for directories). Also, you cannot attach at position 1. * Categories do not exist in R, and never will as they are deprecated now in S. Use factors instead. * In R, `For()' loops are not necessary and hence not supported. * In R, `assign()' uses the argument `envir=' rather than `where=' as in S. * The random number generators are different, and the seeds have different length. * R uses only double precision and so can only pass numeric arguments to C/FORTRAN subroutines as double * or DOUBLE PRECISION, respectively. * By default, `ls()' returns the names of the objects in the current (under R) and global (under S) environment, respectively. For example, given x <- 1; fun <- function() {y <- 1; ls()} then `fun()' returns `"y"' in R and `"x"' (together with the rest of the global environment) in S. * R allows for zero-extent matrices (and arrays, i.e., some elements of the `dim' attribute vector can be 0). This has been determined a useful feature as it helps reducing the need for special-case tests for empty subsets. For example, if `x' is a matrix, `x[, FALSE]' is not `NULL' but a "matrix" with 0 columns. Hence, such objects need to be tested for by checking whether their `length()' is zero (which works in both R and S), and not using `is.null()'. * Named vectors are vectors in R but not in S (e.g., `is.vector(c(a=1:3))' returns `FALSE' in S and `TRUE' in R). * Data frames are not considered as matrices in R (i.e., if `DF' is a data frame, then `is.matrix(DF)' returns `FALSE' in R and `TRUE' in S). * R by default uses treatment contrasts in the unordered case, whereas S uses the Helmert ones. This is a deliberate difference reflecting the opinion that treatment contrasts are more natural. * In R, the last argument (which corresponds to the right hand side) of an assignment function must be named `value'. E.g., `fun(a) <- b' is evaluated as `(fun<-)(a, value = b)'. * In S, `substitute' searches for names for substitution in the given expression in three places: the actual and the default arguments of the matching call, and the local frame (in that order). R looks in the local frame only, with the special rule to use a "promise" if a variable is not evaluated. Since the local frame is initialized with the actual arguments or the default expressions, this is usually equivalent to S, until assignment takes place. * In R, `eval(EXPR, sys.parent())' does not work. Instead, one should use `eval(EXPR, sys.frame(sys.parent())),' which also works in S. * In S, the index variable in a `for()' loop is local to the inside of the loop. In R it is local to the environment where the `for()' statement is executed. There are also differences which are not intentional, and result from missing or incorrect code in R. The developers would appreciate hearing about any deficiencies you may find (in a written report fully documenting the difference as you see it). Of course, it would be useful if you were to implement the change yourself and make sure it works. Is There Anything R Can Do that S-PLUS Cannot? ============================================== Since almost anything you can do in R has source code that you could port to S-PLUS with little effort there will never be much you can do in R that you couldn't do in S-PLUS (or vanilla S) if you wanted to. (Note that using lexical scoping may simplify matters considerably, though.) R offers several graphics features that S-PLUS does not, such as finer handling of line types, more convenient color handling (via palettes), gamma correction for color, fixing aspect ratios, and, most importantly, it allows mathematics in plot texts. Unfortunately, this feature still is mostly undocumented, but in a nutshell, "R has TeX inside". R Web Interfaces **************** *Rcgi* is a CGI WWW interface to R by Mark J Ray . Recent version have the ability to use "embedded code": you can mix user input and code, allowing the HTML author to do anything from load in data sets to enter most of the commands for users without writing CGI scripts. Graphical output is possible in PostScript or GIF formats and the executed code is presented to the user for revision. Demo and download are available from `http://www.mth.uea.ac.uk/~h089/Rcgi/'. *Rweb* is developed and maintained by Jeff Banfield . The Rweb Home Page (http://www.math.montana.edu/Rweb), provides access to all three versions of Rweb--a simple text entry form that returns output and graphs, a more sophisticated Javascript version that provides a multiple window environment, and a set of point and click modules that are useful for introductory statistics courses and require no knowledge of the R language. All of the Rweb versions can analyze Web accessible datasets if a URL is provided. A paper on Rweb, providing a detailed explanation of the different versions of Rweb and an overview of how Rweb works, will soon apper in the Journal of Statistical Software (`http://www.stat.ucla.edu/journals/jss/'). R Add-On Packages ***************** Which Add-on Packages Exist for R? ================================== The R distribution comes with the following extra packages: *eda* Exploratory Data Analysis. Currently only contains functions for robust line fitting, and median polish and smoothing. *lqs* Resistant regression and covariance estimation. *modreg* MODern REGression: smoothing and local methods. *mva* MultiVariate Analysis. Currently contains code for principal components, canonical correlations, hierarchichal clustering, and metric multidimensional scaling. *stepfun* Code for dealing with step functions, including empirical cumulative distribution functions (`ecdf'). The following packages are available from the CRAN `src/contrib' area. *KernSmooth* Functions for kernel smoothing (and density estimation) corresponding to the book "Kernel Smoothing" by M. P. Wand and M. C. Jones, 1995. *RmSQL* An interface between R and the mSQL database system. *MASS* The main package from Venables and Ripley, "Modern Applied Statistics with S-Plus" (2nd edition). Contains all data sets. Some code does not work (yet) under R. Contained in `VR'. *acepack* ace (Alternating Conditional Expectations) and avas (Additivity and VAriance Stabilization for regression) for selecting regression transformations. *akima* An R implementation of the S-PLUS function `interp()'. *ash* Programs for 1D, 2D and 3D density estimation. *bindata* Generation of correlated artificial binary data. *boot* Functions and datasets for bootstrapping from the book "Bootstrap Methods and Their Applications" by A. C. Davison and D. V. Hinkley, 1997, Cambridge University Press. *bootstrap* Software (bootstrap, cross-validation, jackknife), data and errata for the book "An Introduction to the Bootstrap" by B. Efron and R. Tibshirani, 1993, Chapman and Hall. *cclust* Convex clustering methods, including k-means algorithm, on-line update algorithm (Hard Competitive Learning) and Neural Gas algorithm (Soft Competitive Learning) and calculation of several indexes for finding the number of clusters in a data set. *chron* A package for working with chronological objects (times and dates). *class* Functions for classification (k-nearest neighbor and LVQ). Contained in `VR'. *cluster* Functions for cluster analysis. *coda* Output analysis and diagnostics for Markov Chain Monte Carlo (MCMC) simulations. *ctest* A collection of classical tests, including the Bartlett, Fisher, Kruskal-Wallis, Kolmogorov-Smirnov, and Wilcoxon tests. *date* Functions for dealing with dates. The most useful of them accepts a vector of input dates in any of the forms `8/30/53', `30Aug53', `30 August 1953', ..., `August 30 53', or any mixture of these. *e1071* Miscellaneous functions used at the Department of Statistics at TU Wien (E1071). *event* Procedures for event history analysis. *fracdiff* Maximum likelihood estimation of the parameters of a fractionally differenced ARIMA(p,d,q) model (Haslett and Raftery, Applied Statistics, 1989). *funfits* An integrated set of functions for fitting curves and surfaces including thin plate splines, kriging and neural networks. *gee* An implementation of the Liang/Zeger generalized estimating equation approach to GLMs for dependent data. *gnlm* Generalized nonlinear regression models. *growth* Normal theory repeated measurements models. *integrate* Code for adaptive quadrature. *leaps* A package which performs an exhaustive search for the best subsets of a given set of potential regressors, using a branch-and-bound algorithm, and also performs searches using a number of less time-consuming techniques. *lme* Fit and compare Gaussian linear mixed-effects models. *lmtest* A collection of tests on the assumptions of linear regression models from the book "The linear regression model under test" by W. Kraemer and H. Sonnberger (1986). *locfit* Local Regression, likelihood and density estimation. *logspline* Logspline density estimation. *mclust* Model-based cluster analysis. *mlbench* A collection of artificial and real-world machine learning benchmark problems, including the Boston housing data. *multiv* Functions for hierarchical clustering, partitioning, bond energy algorithm, Sammon mapping, PCA and correspondence analysis. *nnet* Software for single hidden layer perceptrons ("feed-forward neural networks") and for multinomial log-linear models. Contained in `VR'. *oz* Functions for plotting Australia's coastline and state boundaries. *pls* Univariate Partial Least Squares Regression. *polymars* Polychotomous regression based on Multivariate Adaptive Regression Splines. *polynom* A collection of functions to implement a class for univariate polynomial manipulations. *principal.curve* Code for fitting a principal curve to a matrix of points in arbitrary dimension. *pspline* Smoothing splines with penalties on order m derivatives. *quadprog* For solving quadratic programming problems. *quantreg* Compute regression quantiles and some related rank statistics. *ratetables* US national and state mortality data (requires *survival4* and *date*). *repeated* Models for non-normal repeated measurements. *rmutil* Tools for repeated measurements. *rpart* Recursive Partitioning. *sfb* Functions and datasets related to the SFB `Adaptive Modeling'. *sgeostat* An object-oriented framework for geostatistical modeling. *sm* Software linked to the book "Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-PLUS Illustrations" by A. W. Bowman and A. Azzalini. *spatial* Interface to FORTRAN functions for universal kriging; K-fn and pseudo-likelihood analyses of spatial point patterns. Contains several spatial datasets. Contained in `VR'. *splines* Functions and classes for defining B-spline representations or polynomial spline representations of regression splines or interpolation splines. *stable* Density, distribution, quantile and hazard functions of a stable variate; generalized linear models for the parameters of a stable distribution. *survival4* Functions for survival analysis (requires *splines*). *tree* Classification and regression trees. *tripack* A constrained two-dimensional Delaunay triangulation package. *xgobi* Interface to the XGobi program for graphical data analysis. See CRAN `src/contrib/INDEX' for more information. There is also a CRAN `src/contrib/Devel' directory which contains packages still "under development" or depending on features only present in the current development versions of R. Volunteers are invited to give these a try, of course. This area of CRAN currently contains *dopt* Finding D-optimal experimental designs. *dse* A multivariate time series package (Dynamic Systems Estimation, DSE) by Paul Gilbert which implements an object oriented approach to time series models (using classes and methods in R/S). The package provides state-space models and the Kalman filter, VARMA and cointegration models, and numerical differentiation. It also contains Troll models as another class (the Troll interface is not yet functional in R). For further information see `http://www.bank-banque-canada.ca/pgilbert'. *mda* Code for mixture discriminant analysis (MDA), flexible discriminant analysis (FDA), penalized discriminant analysis (PDA), multivariate additive regression splines (MARS), adaptive back-fitting splines (BRUTO), and penalized regression. *nls* Nonlinear regression routines for R. *timeslab* Time series routines. Harald Fekjaer has written *addreg*, a package for additive hazards regression, which can be obtained from `http://www.med.uio.no/imb/stat/addreg/'. More code has been posted to the r-help mailing list, and can be obtained from the mailing list archive. How Can Add-on Packages Be Installed? ===================================== (Unix only.) The add-on packages on CRAN come as gzipped tar files (which may contain more than one package). First "unpack" the files of interest. If you have GNU tar, you can use `tar zxf NAME', otherwise you can use `gunzip -c NAME | tar xf -'. Let PKGDIR_1, ..., PKGDIR_N be the (relative or absolute) path names of the packages to be installed. (In the simplest case, the unpacking creates a single package directory, and its name is used.) To install to the default R directory tree (the `library' subdirectory of `RHOME'), type $ R INSTALL PKGDIR_1 ... PKGDIR_N at the shell prompt. To install to another tree (e.g., your private one), use $ R INSTALL -l LIB PKGDIR_1 ... PKGDIR_N where LIB gives the path to the library tree to install to. You can use several library trees of add-on packages. The easiest way to tell R to use these is via the environment variable `RLIBS' which should be a colon-separated list of directories at which R library trees are rooted. You do not have to specify the default tree in `RLIBS'. E.g., to use a private tree in `$HOME/lib/R' and a public site-wide tree in `/usr/local/lib/R/site', put RLIBS="$HOME/lib/R:/usr/local/lib/R/site"; export RLIBS into your (Bourne) shell profile. How Can Add-on Packages Be Used? ================================ To find out which additional packages are available on your system, type library() at the R prompt. This produces something like Packages in `/home/me/lib/R': mystuff My own R functions, nicely packaged and not documented Packages in `/usr/local/lib/R/library': MASS Main package of Venables and Ripley's MASS acepack ace() and avas() for selecting regression transformations base The R base package class Functions for classification cluster Functions for clustering ctest Classical Tests date Functions for handling dates eda Exploratory Data Analysis gee Generalized Estimating Equation models lme Linear mixed effects library locfit Local Regression, Likelihood and Density Estimtion. lqs Resistant Regression and Covariance Estimation modreg Modern regression: smoothing and local methods mva Classical Multivariate Analysis nnet Software for feed-forward neural networks with a single hidden layer and for multinomial log-linear models. splines Regression Spline Functions and Classes stepfun Step Functions, including Empirical Distributions survival4 Survival analysis (needs `splines') You can "load" the installed package PKG by library(PKG) You can then find out which functions it provides by typing one of help(package = PKG) library(help = PKG) You can unload the loaded package PKG by detach("package:PKG") How Can Add-on Packages Be Removed? =================================== To remove the packages PKG_1, ..., PKG_N from the default library or the library LIB, do $ R REMOVE PKG_1 ... PKG_N or $ R REMOVE -l LIB PKG_1 ... PKG_N respectively. How Can I Create an R Package? ============================== A package consists of a subdirectory containing the files `DESCRIPTION', `INDEX', and `TITLE', and the subdirectories `R', `data', `exec', `man' and `src' (some of which can be missing). The `DESCRIPTION' file contains basic information about the package in the following format: Package: e1071 Version: 0.7-3 Author: Compiled by Fritz Leisch . Description: Miscellaneous functions used at the Department of Statistics at TU Wien (E1071). Depends: License: GPL version 2 or later The license field should contain an explicit statement or a well-known abbreviation (such as `GPL', `LGPL', `BSD' and `Artistic'), maybe followed by a reference to the actual license file. It is very important that you include this information! Otherwise, it may not even be legally correct for others to distribute copies of the package. The `TITLE' file contains a line giving the name of the package and a brief description. `INDEX' contains a line for each sufficiently interesting object in the package, giving its name and a description (functions such as print methods not usually called explicitly might not be included). Note that you can automatically create this file using something like `R CMD Rdindex man/*.Rd > INDEX' provided that Perl is available on your system. The `R' subdirectory contains code files. The code files to be installed must start with a (lower- or uppercase) letter and have one of the extensions `.R', `.S', `.q', `.r', or `.s'. We recommend using `.R', as this extension seems to be not used by any other software. It should be possible to read in the files using `source()', so R objects must be created by assignments. Note that there has to be no connection between the name of the file and the R objects created by it. If necessary, one of these files (historically `zzz.R') should use `library.dynam()' *inside* `.First.lib()' to load compiled code. The `man' subdirectory should contain documentation files for the objects in the package. The documentation files to be installed must also start with a (lower- or uppercase) letter and have the extension `.Rd' (the default) or `.rd'. C or FORTRAN source and optionally a `Makefile' for the compiled code is in `src'. A sample `Makefile' can be found in the standard *eda* package. The `data' subdirectory is for additional data files the package makes available for loading using `data()'. Currently, data files can have one of three types as indicated by their extension: plain R code (`.R' or `.r'), tables (`.tab', `.txt', or `.csv'), or `save()' images (`.RData' or `.rda'). Finally, `exec' could contain additional executables the package needs, typically Shell or Perl scripts. This mechanism is currently not used by any package, and still experimental. See the documentation for `library()' for more information. The web page `http://www.biostat.washington.edu/~thomas/Rlib.html' maintained by Thomas Lumley provides information on porting S packages to R. *Note What Is CRAN?:: for information on uploading a package to CRAN. How Can I Contribute to R? ========================== R is currently still in alpha (or pre-alpha) state, so simply using it and communicating problems is certainly of great value. One place where functionality is still missing is the modeling software as described in "Statistical Models in S" (see *Note What Is S?::). `gam' and and the nonlinear modeling code are not there yet. See also the `PROJECTS' file in the top level R source directory. Many (more) of the packages available at the Statlib S Repository might be worth porting to R. If you are interested in working on any of these projects, please notify Kurt Hornik . R and Emacs *********** Is there Emacs Support for R? ============================= There is an Emacs-Lisp interface for interactive statistical programming and data analysis called ESS ("Emacs Speaks Statistics"). Languages supported include: S dialects (S 3/4, S-PLUS 3.x, and R), LispStat dialects (XLispStat, ViSta), and SAS. Stata and SPSS dialect (SPSS, PSPP) support is being examined for possible future implementation (a preliminary Stata mode is distributed). ESS grew out of the desire for bug fixes and extensions to S-mode 4.8 (which was a GNU Emacs interface to S/S-PLUS version 3 only). In particular, XEmacs support as well as extensions to incorporate R were desired. In addition, with new modes being developed for R, Stata, and SAS, it was felt that providing for a unifying framework would eliminate differences in the user interface, as well as to provide for faster development of production tools and statistical analysis. Version 5.0 has, for its guts, the basic framework from S-mode. However, it has been cleaned, streamlined, brought closer to conformance as a standard GNU Emacs package, and redesigned for modularity and reuse. R support contains code for editing R source code (syntactic indentation and highlighting of source code, partial evaluations of code, loading and error-checking of code, and source code revision maintenance) and documentation (including sending examples to a running R process and previewing), interacting with an inferior R process from within Emacs (command-line editing, searchable command history, command-line completion of R object and file names, quick access to object and search lists, transcript recording, and an interface to the help system), and transcript manipulation (in particular for re-evaluating commands from transcript files). The latest versions of ESS are available from `http://ess.stat.wisc.edu/pub/ESS/' or `ftp://ess.stat.wisc.edu/pub/ESS/', or via CRAN. The HTML version of the documentation can be found at `http://stat.ethz.ch/ESS/'. ESS comes with detailed installation instructions. Should I Run R from Within Emacs? ================================= Yes, *definitely*. Inferior R mode provides a readline/history mechanism, object name completion, and syntax-based highlighting of the interaction buffer using Font Lock mode, as well as a very convenient interface to the R help system. Of course, it also integrates nicely with the mechanisms for editing R source using Emacs. One can write code in one Emacs buffer and send whole or parts of it for execution to R; this is helpful for both data analysis and programming. One can also seamlessly integrate with a revision control system, in order to maintain a log of changes in your programs and data, as well as to allow for the retrieval of past versions of the code. In addition, it allows you to keep a record of your session, which can also be used for error recovery through the use of the transcript mode. To specify command line arguments for the inferior R process, use `C-u M-x R' for starting R. This prompts you for the arguments; in particular, you can increase the memory size this way (*note Why Does R Run out of Memory?::.). R Miscellania ************* Why Does R Run out of Memory? ============================= R (currently) uses a *static* memory model. This means that when it starts up, it asks the operating system to reserve a fixed amount of memory for it. The size of this chunk cannot be changed subsequently. Hence, it can happen that not enough memory was allocated, e.g., when trying to read large data sets into R. In these cases, you should restart R with more memory available, using the command line options `--nsize' and `--vsize'. To understand these options, one needs to know that R maintains separate areas for fixed and variable sized objects. The first of these is allocated as an array of "cons cells" (Lisp programmers will know what they are, others may think of them as the building blocks of the language itself, parse trees, etc.), and the second are thrown on a "heap". The `--nsize' option can be used to specify the number of cons cells (each occupying 16 bytes) which R is to use (the default is 200000), and the `--vsize' option to specify the size of the vector heap in bytes (the default is 2 MB). Boths options must either be integers or integers ending with `M', `K', or `k' meaning `Mega' (2^20), (computer) `Kilo' (2^10), or regular `kilo' (1000). E.g., to read in a table of 5000 observations on 40 numeric variables, `R --vsize 6M' should do. Note that the information on where to find vectors and strings on the heap is stored using cons cells. Thus, it may also be necessary to allocate more space for cons cells in order to perform computations with very "large" variable-size objects. You can find out the current memory consumption (the proportion of heap and cons cells used) by typing `gc()' at the R prompt. This may help you in finding out whether to increase `--vsize' or `--nsize'. Note that following `gcinfo(TRUE)', automatic garbage collection always prints memory use statistics. As of version 0.62.3, R will tell you whether you ran out of cons or heap memory. When using `read.table()', the memory requirements are in fact higher than anticipated, because the file is first read in as one long string which is then split again. Use `scan()' if possible in case you run out of memory when reading in a large table. Why Does Sourcing a Correct File Fail? ====================================== R sometimes has problems parsing a file which does not end in a newline. This can happen for example when Emacs is used for editing the file and `next-line-add-newlines' is set to `nil'. To avoid the problem, either set `require-final-newline' to a non-`nil' value in one of your Emacs startup files, or make sure R-mode (*note Is there Emacs Support for R?::.) is used for editing R source files (which locally ensures this setting). Earlier R versions had a similar problem when reading in data files, but this should have been taken care of now. How Can I Set Components of a List to NULL? =========================================== You can use x[i] <- list(NULL) to set component `i' of the list `x' to `NULL', similarly for named components. Do not set `x[i]' or `x[[i]]' to `NULL', because this will remove the corresponding component from the list. For dropping the row names of a matrix `x', it may be easier to use `rownames(x) <- NULL', similarly for column names. How Can I Save My Workspace? ============================ `save.image()' saves the objects in the user's `.GlobalEnv' to the file `.RData' in the R startup directory. (This is also what happens after `q("yes")'.) Using `save.image(FILE)' one can save the image under a different name. How Can I Clean Up My Workspace? ================================ To remove all objects in the currently active environment (typically `.GlobalEnv'), you can do rm(list = ls()) How Can I Get eval() and D() to Work? ===================================== Strange things will happen if you use `eval(print(x), envir = e)' or `D(x^2, "x")'. The first one will either tell you that "`x'" is not found, or print the value of the wrong `x'. The other one will likely return zero if `x' exists, and an error otherwise. This is because in both cases, the first argument is evaluated in the calling environment first. The result (which should be an object of mode `expression' or `call') is then evaluated or differentiated. What you (most likely) really want is obtained by "quoting" the first argument upon surrounding it with `expression()'. For example, R> D(expression(x^2), "x") 2 * x Although this behavior may initially seem to be rather strange, is perfectly logical. The "intuitive" behavior could easily be implemented, but problems would arise whenever the expression is contained in a variable, passed as a parameter, or is the result of a function call. Consider for instance the semantics in cases like D2 <- function(e, n) D(D(e, n), n) or g <- function(y) eval(substitute(y), sys.frame(sys.parent(n = 2))) g(a * b) See the help pages for more examples. Why Do My Matrices Lose Dimensions? =================================== When a matrix with a single row or column is created by a subscripting operation, e.g., `row <- mat[2, ]', it is by default turned into a vector. In a similar way if an array with dimension, say, 2 x 3 x 1 x 4 is created by subscripting it will be coerced into a 2 x 3 x 4 array, losing the unnecessary dimension. After much discussion this has been determined to be a *feature*. To prevent this happening, add the option `drop = FALSE' to the subscripting. For example, rowmatrix <- mat[2, , drop = FALSE] # creates a row matrix colmatrix <- mat[, 2, drop = FALSE] # creates a column matrix a <- b[1, 1, 1, drop = FALSE] # creates a 1 x 1 x 1 array The `drop = FALSE' option should be used defensively when programming. For example, the statement somerows <- mat[index, ] will return a vector rather than a matrix if `index' happens to have length 1, causing errors later in the code. It should probably be rewritten as somerows <- mat[index, , drop = FALSE] How Does Autoloading Work? ========================== R has a special environment called `.AutoloadEnv'. Using `autoload(NAME, PKG)', where NAME and PKG are strings giving the names of an object and the package containing it, stores some information in this environment. When R tries to evaluate NAME, it loads the corresponding package PKG and reevaluates NAME in the new package's environment. Using this mechanism makes R behave as if the package was loaded, but does not occupy memory (yet). See the help page for `autoload()' for a very nice example. How Should I Set Options? ========================= The function `options()' allows setting and examining a variety of global "options" which affect the way in which R computes and displays its results. The variable `.Options' holds the current values of these options, but should never directly be assigned to unless you want to drive yourself crazy--simply pretend that it is a "read-only" variable. For example, given test1 <- function(x = pi, dig = 3) { oo <- options(digits = dig); on.exit(options(oo)); cat(.Options$digits, x, "\n") } test2 <- function(x = pi, dig = 3) { .Options$digits <- dig cat(.Options$digits, x, "\n") } we obtain: R> test1() 3 3.14 R> test2() 3 3.141593 What is really used is the *global* value of `.Options', and using `options(OPT = VAL)' correctly updates it. Local copies of `.Options', either in `.GlobalEnv' or in a function environment (frame), are just silently disregarded. How Do File Names Work in Windows? ================================== As R uses C-style string handling, `\' is treated as an escape character, so that for example one can enter a newline as `\n'. When you really need a `\', you have to escape it with another `\'. Thus, in filenames use something like `"c:\\data\\money.dat"'. You can also rotate `\' by 90 degrees (`"c:/data/money.dat"'). Why Does Plotting Give a Color Allocation Error? ================================================ Sometimes plotting, e.g., when running `demo(image)', results in "Error: color allocation error". This is an X problem, and only indirectly related to R. It occurs when applications started prior to R have used all the available colors. (How many colors are available depends on the X configuration; sometimes only 256 colors can be used.) One application which is notorious for "eating" colors is Netscape. If the problem occurs when Netscape is running, try (re)starting it with either the `-no-install' (to use the default colormap) or the `-install' (to install a private colormap) option. R Programming ************* How Should I Write Summary Methods? =================================== Suppose you want to provide a summary method for class `foo'. Then `summary.foo()' should not print anything, but return an object of class `summary.foo', *and* you should write a method `print.summary.foo()' which nicely prints the summary information and invisibly returns its object. This approach is preferred over having `summary.foo()' print summary information and return something useful, as sometimes you need to grab something computed by `summary()' inside a function or similar. In such cases you don't want anything printed. How Can I Debug Dynamically Loaded Code? ======================================== According to Doug Bates, the secret of symbolic debugging of dynamically loaded code is to * Call the debugger on the R executable. In the shell, you can e.g. use `R -d gdb'; see below for debugging from within Emacs. * Start the R program. * At the R prompt, use `dyn.load()' to load your library. * Send an interrupt signal. Inside of GUD mode in Emacs, you send `C-c C-c'. This will put you back to the debugger prompt. * Set the breakpoints in your code. * Continue execution of R by typing `signal 0'. When using GUD mode for debugging from within Emacs, you may find it most convenient to use the directory with your code in it as the current working directory and then make a symbolic link from that directory to the R executable (`R.binary'). That way `.gdbinit' can stay in the directory with the code and be used to set up the environment and the search paths for the source, e.g. as follows: set env RHOME /usr/lib/R set env R_PAPERSIZE letter set env R_PRINTCMD lpr dir /usr/lib/R/src/appl dir /usr/lib/R/src/main dir /usr/lib/R/src/nmath dir /usr/lib/R/src/unix How Can I Inspect R Objects When Debugging? =========================================== In the C implementation underlying R, all objects are so-called SEXPs (from Lisp's "S-expressions" which comes from "symbolic expression"), which are pointers to data structures called SEXPRECs. (See the file `src/include/Defn.h' in the R sources for the definition of the SEXPREC type.) For example, let R> DF <- data.frame(a = 1:3, b = 4:6) By setting a breakpoint at `do_get' and typing `get("DF")' at the R prompt, one can find out the address in memory of `DF', e.g. Value returned is $1 = (SEXPREC *) 0x40583e1c (gdb) p *$1 $2 = { sxpinfo = {type = 19, obj = 1, named = 1, gp = 0, mark = 0, debug = 0, trace = 0, = 0}, attrib = 0x40583e80, u = { vecsxp = { length = 2, type = {c = 0x40634700 "0>X@D>X@0>X@", i = 0x40634700, f = 0x40634700, z = 0x40634700, s = 0x40634700}, truelength = 1075851272, }, primsxp = {offset = 2}, symsxp = {pname = 0x2, value = 0x40634700, internal = 0x40203008}, listsxp = {carval = 0x2, cdrval = 0x40634700, tagval = 0x40203008}, envsxp = {frame = 0x2, enclos = 0x40634700}, closxp = {formals = 0x2, body = 0x40634700, env = 0x40203008}, promsxp = {value = 0x2, expr = 0x40634700, env = 0x40203008} } } (Debugger output reformatted for better legibility). Using `PrintValue' one can "inspect" the values of the various elements of the SEXP, e.g., (gdb) p PrintValue($1->attrib) $names [1] "a" "b" $row.names [1] "1" "2" "3" $class [1] "data.frame" $3 = void (Make sure, however, to use `PrintValue' on SEXPs with the `obj' bit turned on only.) To find out where exactly the corresponding information is stored, one needs to go "deeper": (gdb) set $a = $1->attrib (gdb) p $a->u.listsxp.tagval->u.symsxp.pname->u.vecsxp.type.c $4 = 0x405d40e8 "names" (gdb) p $a->u.listsxp.carval->u.vecsxp.type.s[1]->u.vecsxp.type.c $5 = 0x40634378 "b" (gdb) p $1->u.vecsxp.type.s[0]->u.vecsxp.type.i[0] $6 = 1 (gdb) p $1->u.vecsxp.type.s[1]->u.vecsxp.type.i[1] $7 = 5 R Bugs ****** What Is a Bug? ============== If R executes an illegal instruction, or dies with an operating system error message that indicates a problem in the program (as opposed to something like "disk full"), then it is certainly a bug. If you call `.Internal()', `.C()' or `.Fortran()' yourself (or in a function you wrote), you can always crash R by using wrong argument types (modes). This is not a bug. Taking forever to complete a command can be a bug, but you must make certain that it was really R's fault. Some commands simply take a long time. If the input was such that you *know* it should have been processed quickly, report a bug. If you don't know whether the command should take a long time, find out by looking in the manual or by asking for assistance. If a command you are familiar with causes an R error message in a case where its usual definition ought to be reasonable, it is probably a bug. If a command does the wrong thing, that is a bug. But be sure you know for certain what it ought to have done. If you aren't familiar with the command, or don't know for certain how the command is supposed to work, then it might actually be working right. Rather than jumping to conclusions, show the problem to someone who knows for certain. Finally, a command's intended definition may not be best for statistical analysis. This is a very important sort of problem, but it is also a matter of judgment. Also, it is easy to come to such a conclusion out of ignorance of some of the existing features. It is probably best not to complain about such a problem until you have checked the documentation in the usual ways, feel confident that you understand it, and know for certain that what you want is not available. If you are not sure what the command is supposed to do after a careful reading of the manual this indicates a bug in the manual. The manual's job is to make everything clear. It is just as important to report documentation bugs as program bugs. However, we know that the introductory documentation is seriously inadequate, so you don't need to report this. If the online argument list of a function disagrees with the manual, one of them must be wrong, so report the bug. How to Report a Bug =================== When you decide that there is a bug, it is important to report it and to report it in a way which is useful. What is most useful is an exact description of what commands you type, starting with the shell command to run R, until the problem happens. Always include the version of R, machine, and operating system that you are using; type `version' in R to print this. The most important principle in reporting a bug is to report *facts*, not hypotheses or categorizations. It is always easier to report the facts, but people seem to prefer to strain to posit explanations and report them instead. If the explanations are based on guesses about how R is implemented, they will be useless; we will have to try to figure out what the facts must have been to lead to such speculations. Sometimes this is impossible. But in any case, it is unnecessary work for us. For example, suppose that on a data set which you know to be quite large the command R> data.frame(x, y, z, monday, tuesday) never returns. Do not report that `data.frame()' fails for large data sets. Perhaps it fails when a variable name is a day of the week. If this is so then when we got your report we would try out the `data.frame()' command on a large data set, probably with no day of the week variable name, and not see any problem. There is no way in the world that we could guess that we should try a day of the week variable name. Or perhaps the command fails because the last command you used was a method for `"["()' that had a bug causing R's internal data structures to be corrupted and making the `data.frame()' command fail from then on. This is why we need to know what other commands you have typed (or read from your startup file). It is very useful to try and find simple examples that produce apparently the same bug, and somewhat useful to find simple examples that might be expected to produce the bug but actually do not. If you want to debug the problem and find exactly what caused it, that is wonderful. You should still report the facts as well as any explanations or solutions. Invoking R with the `--vanilla' option may help in isolating a bug. This ensures that the site profile and saved data files are not read. On Unix systems a bug report can be generated using the function `bug.report()'. This automatically includes the version information and sends the bug to the correct address. Alternatively the bug report can be emailed to or submitted to the Web page at `http://r-bugs.biostat.ku.dk'. Acknowledgments *************** Of course, many many thanks to Robert and Ross for the R system, and to the package writers and porters for adding to it. Special thanks go to Doug Bates, Peter Dalgaard, Paul Gilbert, Fritz Leisch, Jim Lindsey, Thomas Lumley, Martin Maechler, Brian D. Ripley, Anthony Rossini, and Andreas Weingessel for their comments which helped me improve this FAQ. More to some soon ...