% -*- mode: ess-noweb; ess-noweb-code-mode: R-mode -*- \documentclass[11pt]{article} \usepackage{hyperref} \usepackage[headings]{fullpage} \usepackage{verbatim} \usepackage{noweb} % This is a minor modification to the new verbatim environment that uses % the same size as noweb's code size and indents the same amount at % noweb's code chunks. I redefine verbatim instead of defining my own % environment since the html converter only seems to understand % verbatim, not new definitions. \makeatletter \addto@hook{\every@verbatim}{\nowebsize\setlength{\leftmargin}{50mm}} \def\verbatim@processline{\hspace{\codemargin}\the\verbatim@line\par} \makeatother % The following try to prevent wasteful page breaks \def\nwendcode{\endtrivlist \endgroup} \let\nwdocspar=\par \pagestyle{noweb} \bibliographystyle{plain} \noweboptions{noidentxref,longchunks,smallcode} \title{A Byte Code Compiler for R} \author{Luke Tierney\\ Department of Statistics and Actuarial Science\\ University of Iowa} \begin{document} \maketitle This document presents the current implementation of the byte code compiler for R. The compiler produces code for a virtual machine that is then executed by a virtual machine runtime system. The virtual machine is a stack based machine. Thus instructions for the virtual machine take arguments off a stack and may leave one or more results on the stack. Byte code objects consists of an integer vector representing instruction opcodes and operands, and a generic vector representing a constant pool. The compiler is implemented almost entirely in R, with just a few support routines in C to manage compiled code objects. The virtual machine instruction set is designed to allow much of the interpreter internals to be re-used. In particular, for now the mechanism for calling functions of all types from compiled code remains the same as the function calling mechanism for interpreted code. There are opportunities for efficiency improvements through using a different mechanism for calls from compiled functions to compiled functions, or changing the mechanism for both interpreted and compiled code; this will be explored in future work. The style used by the compiler for building up code objects is imperative: A code buffer object is created that contains buffers for the instruction stream and the constant pool. Instructions and constants are written to the buffer as the compiler processes an expression tree, and at the end a code object is constructed. A more functional design in which each compiler step returns a modified code object might be more elegant in principle, but it would be more difficult to make efficient. A multi-pass compiler in which a first pass produces an intermediate representation, subsequent passes optimize the intermediate representation, and a final pass produces actual code would also be useful and might be able to produce better code. A future version of the compiler may use this approach. But for now to keep things simple a single pass is used. %% **** Some peephole optimization is probably possible, and at least %% **** some constant folding could be done on the bytecode, but more %% **** sophisticated optimizations like inlining or R code would require %% **** a more suitable intermediate representation. %% **** I _think_ conversion from stack-based byte code to register-based %% **** code is reasonably straight forward but I haven't thought it %% **** through thoroughly yet. \section{The compiler interface} The compiler can be used either explicitly by calling certain functions to carry out compilations, or implicitly by enabling compilation to occur automatically at certain points. \subsection{Explicit compilation} The primary functions for explicit compilation are [[compile]], [[cmpfun]], and [[cmpfile]]. The [[compile]] function compiles an expression and returns a byte code object, which can then be passed to [[eval]]. A simple example is \begin{verbatim} > library(compiler) > compile(quote(1+3)) > eval(compile(quote(1+3))) [1] 4 \end{verbatim} A closure can be compiled using [[cmpfun]]. If the function [[f]] is defined as \begin{verbatim} f <- function(x) { s <- 0.0 for (y in x) s <- s + y s } \end{verbatim} then a compiled version is produced by \begin{verbatim} fc <- cmpfun(f) \end{verbatim} We can then compare the performance of the interpreted and compiled versions: \begin{verbatim} > x <- as.double(1 : 10000000) > system.time(f(x)) user system elapsed 6.470 0.010 6.483 > system.time(fc(x)) user system elapsed 1.870 0.000 1.865 \end{verbatim} A source file can be compiled with [[cmpfile]]. For now, the resulting file has to then be loaded with [[loadcmp]]. In the future it may make sense to allow [[source]] to either load a pre-compiled file or to optionally compile while sourcing. \subsection{Implicit compilation} Implicit compilation can be used to compile packages as they are installed or for just-in-time (JIT) compilation of functions or expressions. The mechanism for enabling these is experimental and likely to change. For now, compilation of packages requires the use of lazy loading and can be enabled either by calling [[compilePKGS]] with argument [[TRUE]] or by starting R with the environment variable [[_R_COMPILE_PKGS_]] set to a positive integer value. These settings are used internally during R build to compile the base package (and tools, utils, methods, etc) and by [[R CMD INSTALL]]. Functions are compiled as they are written to the lazy loading database. Compilation of packages should only be enabled for that time, because it adds noticeable time and space overhead to any serialization. In a UNIX-like environment, for example, installing a package with \begin{verbatim} env R_COMPILE_PKGS=1 R CMD INSTALL foo.tar.gz \end{verbatim} will internally enable package compilation using [[compilePKGS]]. If R is installed from source then the base and required packages can be compiled on installation using \begin{verbatim} make bytecode \end{verbatim} This does not require setting the [[_R_COMPILE_PKGS_]] environment variable. JIT compilation can be enabled from within R by calling [[enableJIT]] with a non-negative integer argument or by starting R with the environment variable [[R_ENABLE_JIT]] set to a non-negative integer. The possible values of the argument to [[enableJIT]] and their meanings are \begin{itemize} \item[0] turn off JIT \item[1] compile closures before they are called the first time \item[2] same as 1, plus compile closures before duplicating (useful for packages that store closures in lists, like lattice) \item[3] same as 2, plus compile all [[for()]], [[while()]], and [[repeat()]] loops before executing. \end{itemize} R may initially be somewhat sluggish if JIT is enabled and base and recommended packages have not been pre-compiled as almost everything will initially need some compilation. \section{The basic compiler} This section presents the basic compiler for compiling R expressions to byte code objects. \subsection{The compiler top level} R expressions consist of function calls, variable references, and literal constants. To create a byte code object representing an R expression the compiler has to walk the expression tree and emit code for the different node types in encounters. The code emitted may depend on the environment in which the expression will be evaluated as well as various compiler option settings. The simplest function in the top level compiler interface is the function [[compile]]. This function requires an expression argument and takes three optional arguments: an environment, a list of options and source code reference. The default environment is the global environment. By default, the source reference argument is [[NULL]] and the source reference is taken from the [[srcref]] attribute of the expression argument. <<[[compile]] function>>= compile <- function(e, env = .GlobalEnv, options = NULL, srcref = NULL) { cenv <- makeCenv(env) cntxt <- make.toplevelContext(cenv, options) cntxt$env <- addCenvVars(cenv, findLocals(e, cntxt)) if (mayCallBrowser(e, cntxt)) ## NOTE: compilation will be attempted repeatedly e else if (is.null(srcref)) genCode(e, cntxt) else genCode(e, cntxt, loc = list(expr = e, srcref = srcref)) } @ %def compile The supplied environment is converted into a compilation environment data structure. This compilation environment and any options provided are then used to construct a compiler context. The function [[genCode]] is then used to generate a byte code object for the expression and the constructed compilation context. Compilation environments are described in Section \ref{sec:environments} and compiler contexts in Section \ref{sec:contexts}. The [[genCode]] function is defined as <<[[genCode]] function>>= genCode <- function(e, cntxt, gen = NULL, loc = NULL) { cb <- make.codeBuf(e, loc) if (is.null(gen)) cmp(e, cb, cntxt, setloc = FALSE) else gen(cb, cntxt) codeBufCode(cb, cntxt) } @ %def genCode [[genCode]] creates a code buffer, fills the code buffer, and then calls [[codeBufCode]] to extract and return the byte code object. In the most common case [[genCode]] uses the low level recursive compilation function [[cmp]], described in Section \ref{subsec:cmp}, to generate the code. For added flexibility it can be given a generator function that emits code into the code buffer based on the provided context. This is used in Section \ref{sec:loops} for compilation of loop bodies in loops that require an explicit loop context (and a long jump in the byte-code interpreter). \subsection{Basic code buffer interface} Code buffers are used to accumulate the compiled code and related constant values. A code buffer [[cb]] is a list containing a number of closures used to manipulate the content of the code buffer. In this section two closures are used, [[putconst]] and [[putcode]]. The closure [[cb$putconst]] is used to enter constants into the constant pool. It takes a single argument, an arbitrary R object to be entered into the constant pool, and returns an integer index into the pool. The [[cb$putcode]] closure takes an instruction opcode and any operands the opcode requires and emits them into the code buffer. The operands are typically constant pool indices or labels, to be introduced in Section \ref{sec:codebuf}. As an example, the [[GETVAR]] instruction takes one operand, the index in the constant pool of a symbol. The opcode for this instruction is [[GETVAR.OP]]. The instruction retrieves the symbol from the constant pool, looks up its value in the current environment, and pushes the value on the stack. If [[sym]] is a variable with value of a symbol, then code to enter the symbol in the constant pool and emit an instruction to get its value would be <>= ci <- cb$putconst(sym) cb$putcode(GETVAR.OP, ci) @ %def The complete code buffer implementation is given in Section \ref{sec:codebuf}. \subsection{The recursive code generator} \label{subsec:cmp} The function [[cmp]] is the basic code generation function. It recursively traverses the expression tree and emits code as it visits each node in the tree. Before generating code for an expression the function [[cmp]] attempts to determine the value of the expression by constant folding using the function [[constantFold]]. If constant folding is successful then [[contantFold]] returns a named list containing a [[value]] element. Otherwise it returns [[NULL]]. If constant folding is successful, then the result is compiled as a constant. Otherwise, the standard code generation process is used. %% **** comment on alternative of doing constant folding as an %% **** optimization on the bytecode or an intermediate representation? In the interpreter there are four types of objects that are not treated as constants, i.e. as evaluating to themselves: function calls of type [["language"]], variable references of type [["symbol"]], promises, and byte code objects. Neither promises nor byte code objects should appear as literals in code so an error is signaled for those. The language, symbol, and constant cases are each handled by their own code generators. %% **** promises do appear in the expressions generated by the %% **** interpreter for evaluating complex assignment expressions <>= if (typeof(e) == "language") cmpCall(e, cb, cntxt) else if (typeof(e) == "symbol") cmpSym(e, cb, cntxt, missingOK) else if (typeof(e) == "bytecode") cntxt$stop(gettext("cannot compile byte code literals in code"), cntxt) else if (typeof(e) == "promise") cntxt$stop(gettext("cannot compile promise literals in code"), cntxt) else cmpConst(e, cb, cntxt) @ The function [[cmp]] is then defined as <<[[cmp]] function>>= cmp <- function(e, cb, cntxt, missingOK = FALSE, setloc = TRUE) { if (setloc) { sloc <- cb$savecurloc() cb$setcurexpr(e) } ce <- constantFold(e, cntxt) if (is.null(ce)) { <> } else cmpConst(ce$value, cb, cntxt) if (setloc) cb$restorecurloc(sloc) } @ %def cmp The call code generator [[cmpCall]] will recursively call [[cmp]]. %% **** should promises/byte code produce compiler errors or runtime errors?? \subsection{Compiling constant expressions} The constant code generator [[cmpConst]] is the simplest of the three generators. A simplified generator can be defined as <>= cmpConst <- function(val, cb, cntxt) { ci <- cb$putconst(val) cb$putcode(LDCONST.OP, ci) if (cntxt$tailcall) cb$putcode(RETURN.OP) } @ %def cmpConst This function enters the constant in the constant pool using the closure [[cb$putconst]]. The value returned by this closure is an index for the constant in the constant pool. Then the code generator emits an instruction to load the constant at the specified constant pool index and push it onto the stack. If the expression appears in tail position then a [[RETURN]] instruction is emitted as well. %% **** explain tail position here?? Certain constant values, such as [[TRUE]], [[FALSE]], and [[NULL]] appear very often in code. It may be useful to provide and use special instructions for loading these. The resulting code will have slightly smaller constant pools and may be a little faster, though the difference is likely to be small. A revised definition of [[cmpConst]] that makes use of instructions for loading these particular values is given by <<[[cmpConst]] function>>= cmpConst <- function(val, cb, cntxt) { if (identical(val, NULL)) cb$putcode(LDNULL.OP) else if (identical(val, TRUE)) cb$putcode(LDTRUE.OP) else if (identical(val, FALSE)) cb$putcode(LDFALSE.OP) else { ci <- cb$putconst(val) cb$putcode(LDCONST.OP, ci) } if (cntxt$tailcall) cb$putcode(RETURN.OP) } @ %def cmpConst It might be useful to handle other constants in a similar way, such as [[NA]] or small integer values; this may be done in the future. %% **** check out if small integers is worth doing. %% **** mention peephole optimization as alternative The implementation marks values in the constant pool as read-only after they are loaded. In the past, all values were duplicated as they were retrieved from the constant pool as a precaution against bad package code: several packages in the wild assumed that an expression [[TRUE]], for example, appearing in code would result in a freshly allocated value that could be freely modified in [[.C]] calls. \subsection{Compiling variable references} The function [[cmpSym]] handles compilation of variable references. For standard variables this involves entering the symbol in the constant pool, emitting code to look up the value of the variable at the specified constant pool location in the current environment, and, if necessary, emitting a [[RETURN]] instruction. In addition to standard variables there is the ellipsis variable [[...]] and the accessors [[..1]], [[..2]], and so on that need to be considered. The ellipsis variable can only appear as an argument in function calls, so [[cmp]], like the interpreter [[eval]] itself, should not encounter it. The interpreter signals an error if it does encounter a [[...]] variable, and the compiler emits code that does the same at runtime. The compiler also emits a warning at compile time. Variables representing formal parameters may not have values provided in their calls, i.e. may have missing values. In some cases this should signal an error; in others the missing value can be passed on (for example in expressions of the form [[x[]]]). To support this, [[cmpSym]] takes an optional argument for allowing missing argument values. <<[[cmpSym]] function>>= cmpSym <- function(sym, cb, cntxt, missingOK = FALSE) { if (sym == "...") { notifyWrongDotsUse("...", cntxt) cb$putcode(DOTSERR.OP) } else if (is.ddsym(sym)) { <> } else { <> } } @ %def cmpSym References to [[..n]] variables are also only appropriate when a [[...]] variable is available, so a warning is given if that is not the case. The virtual machine provides instructions [[DDVAL]] and [[DDVAL_MISSOK]] for the case where missing arguments are not allowed and for the case where they are, and the appropriate instruction is used based on the [[missingOK]] argument to [[cmpSym]]. <>= if (! findLocVar("...", cntxt)) notifyWrongDotsUse(sym, cntxt) ci <- cb$putconst(sym) if (missingOK) cb$putcode(DDVAL_MISSOK.OP, ci) else cb$putcode(DDVAL.OP, ci) if (cntxt$tailcall) cb$putcode(RETURN.OP) @ %def There are also two instructions available for obtaining the value of a general variable from the current environment, one that allows missing values and one that does not. <>= if (! findVar(sym, cntxt)) notifyUndefVar(sym, cntxt) ci <- cb$putconst(sym) if (missingOK) cb$putcode(GETVAR_MISSOK.OP, ci) else cb$putcode(GETVAR.OP, ci) if (cntxt$tailcall) cb$putcode(RETURN.OP) @ %def For now, these instructions only take an index in the constant pool for the symbol as operands, not any information about where the variable can be found within the environment. This approach to obtaining the value of variables requires a search of the current environment for every variable reference. In a less dynamic language it would be possible to compute locations of variable bindings within an environment at compile time and to choose environment representations that allow constant time access to any variable's value. Since bindings in R can be added or removed at runtime this would require a semantic change that would need some form of declaration to make legitimate. Another approach that may be worth exploring is some sort of caching mechanism in which the location of each variable is stored when it is first found by a full search, and that cached location is used until an event occurs that forces flushing of the cache. If such events are rare, as they typically are, then this may be effective. %% **** need to look into caching strategies %% **** looks like a simple cache of the local frame speeds up sum and %% **** Neal's em by about 10% (just lookup, not assignment -- with %% **** assignment should be a bit better) %% **** Is it really necessary for bcEval to save/restore stack tops? %% **** Shouldn't that happen automatically? %% **** Is it possible to have closure calling stay in the same bc? %% **** maybe at least for promises? \subsection{Compiling function calls} Conceptually, the R function calling mechanism uses lazy evaluation of arguments. Thus calling a function involves three steps: \begin{itemize} \item finding the function to call \item packaging up the argument expressions into deferred evaluation objects, or promises \item executing the call \end{itemize} Code for this process is generated by the function [[cmpCall]]. A simplified version is defined as <>= cmpCall <- function(call, cb, cntxt) { cntxt <- make.callContext(cntxt, call) fun <- call[[1]] args <- call[-1] if (typeof(fun) == "symbol") cmpCallSymFun(fun, args, call, cb, cntxt) else cmpCallExprFun(fun, args, call, cb, cntxt) } @ %def cmpCall Call expressions in which the function is represented by a symbol are compiled by [[cmpCallSymFun]]. This function emits a [[GETFUN]] instruction and then compiles the arguments. <<[[cmpCallSymFun]] function>>= cmpCallSymFun <- function(fun, args, call, cb, cntxt) { ci <- cb$putconst(fun) cb$putcode(GETFUN.OP, ci) <> } @ %def cmpCallSymFun The [[GETFUN]] instruction takes a constant pool index of the symbol as an operand, looks for a function binding to the symbol in the current environment, places it on the stack, and prepares the stack for handling function call arguments. %% **** need a fun cache and a var cache??? Argument compilation is carried out by the function [[cmpCallArgs]], presented in Section \ref{subsec:callargs}, and is followed by emitting code to execute the call and, if necessary, return a result. <>= cmpCallArgs(args, cb, cntxt) ci <- cb$putconst(call) cb$putcode(CALL.OP, ci) if (cntxt$tailcall) cb$putcode(RETURN.OP) @ %def The call expression itself is stored in the constant pool and is available to the [[CALL]] instruction. Calls in which the function is represented by an expression other than a symbol are handled by [[cmpCallExprFun]]. This emits code to evaluate the expression, leaving the value in the stack, and then emits a [[CHECKFUN]] instruction. This instruction checks that the value on top of the stack is a function and prepares the stack for receiving call arguments. Generation of argument code and the [[CALL]] instruction are handled as for symbol function calls. <<[[cmpCallExprFun]] function>>= cmpCallExprFun <- function(fun, args, call, cb, cntxt) { ncntxt <- make.nonTailCallContext(cntxt) cmp(fun, cb, ncntxt) cb$putcode(CHECKFUN.OP) <> } @ %def cmpCallExprFun The actual definition of [[cmpCall]] is a bit more complex than the simplified one given above: <<[[cmpCall]] function>>= cmpCall <- function(call, cb, cntxt, inlineOK = TRUE) { sloc <- cb$savecurloc() cb$setcurexpr(call) cntxt <- make.callContext(cntxt, call) fun <- call[[1]] args <- call[-1] if (typeof(fun) == "symbol") { if (! (inlineOK && tryInline(call, cb, cntxt))) { <> cmpCallSymFun(fun, args, call, cb, cntxt) } } else { <> cmpCallExprFun(fun, args, call, cb, cntxt) } cb$restorecurloc(sloc) } @ %def cmpCall The main addition is the use of a [[tryInline]] function which tries to generate more efficient code for particular functions. The [[inlineOK]] argument can be used to disable inlining. This function returns [[TRUE]] if it has handled code generation and [[FALSE]] if it has not. Code will be generated by the inline mechanism if inline handlers for the particular function are available and the optimization level permits their use. Details of the inlining mechanism are given in Section \ref{sec:inlining}. In addition to the inlining mechanism, some checking of the call is carried out for symbol calls. The checking code is <>= if (findLocVar(fun, cntxt)) notifyLocalFun(fun, cntxt) else { def <- findFunDef(fun, cntxt) if (is.null(def)) notifyUndefFun(fun, cntxt) else checkCall(def, call, function(w) notifyBadCall(w, cntxt)) } @ and [[checkCall]] is defined as <<[[checkCall]] function>>= ## **** figure out how to handler multi-line deparses ## **** e.g. checkCall(`{`, quote({})) ## **** better design would capture error object, wrap it up, and pass it on ## **** use approach from codetools to capture partial argument match ## **** warnings if enabled? checkCall <- function(def, call, signal = warning) { if (typeof(def) %in% c("builtin", "special")) def <- args(def) if (typeof(def) != "closure" || any.dots(call)) NA else { msg <- tryCatch({match.call(def, call); NULL}, error = function(e) conditionMessage(e)) if (! is.null(msg)) { emsg <- gettextf("possible error in '%s': %s", deparse(call, 20)[1], msg) if (! is.null(signal)) signal(emsg) FALSE } else TRUE } } @ %def checkCall Finally, for calls where the function is an expression a hack is currently needed for dealing with the way the parser currently parses expressions of the form [[break()]] and [[next()]]. To be able to compile as many [[break]] and [[next]] calls as possible as simple [[GOTO]] instructions these need to be handled specially to avoid placing things on the stack. A better solution would probably be to modify the parser to make expressions of the form [[break()]] be syntax errors. <>= ## **** this hack is needed for now because of the way the ## **** parser handles break() and next() calls if (typeof(fun) == "language" && typeof(fun[[1]]) == "symbol" && as.character(fun[[1]]) %in% c("break", "next")) return(cmp(fun, cb, cntxt)) @ \subsection{Compiling call arguments} \label{subsec:callargs} Function calls can contain four kinds of arguments: \begin{itemize} \item missing arguments \item [[...]] arguments \item general expressions \end{itemize} In the first and third cases the arguments can also be named. The argument compilation function [[cmpCallArgs]] loops over the argument lists and handles each of the three cases, in addition to signaling errors for arguments that are literal bytecode or promise objects: <<[[cmpCallArgs]] function>>= cmpCallArgs <- function(args, cb, cntxt) { names <- names(args) pcntxt <- make.promiseContext(cntxt) for (i in seq_along(args)) { a <- args[[i]] n <- names[[i]] <> <> <> <> } } @ %def cmpCallArgs The missing argument case is handled by <>= if (missing(a)) { ## better test for missing?? cb$putcode(DOMISSING.OP) cmpTag(n, cb) } @ %def Computations on the language related to missing arguments are tricky. The use of [[missing]] is a little odd, but for now at least it does work. An ellipsis argument [[...]] is handled by the [[DODOTS]] instruction: <>= else if (is.symbol(a) && a == "...") { if (! findLocVar("...", cntxt)) notifyWrongDotsUse("...", cntxt) cb$putcode(DODOTS.OP) } @ %def A warning is issued if no [[...]] argument is visible. As in [[cmp]], errors are signaled for literal bytecode or promise values as arguments. <>= else if (typeof(a) == "bytecode") cntxt$stop(gettext("cannot compile byte code literals in code"), cntxt) else if (typeof(a) == "promise") cntxt$stop(gettext("cannot compile promise literals in code"), cntxt) @ %def A general non-constant argument expression is compiled to a separate byte code object which is stored in the constant pool. The compiler then emits a [[MAKEPROM]] instruction that uses the stored code object. Promises are not needed for literal constant arguments as these are self-evaluating. Within the current implementation both the evaluation process and use of [[substitute]] will work properly if constants are placed directly in the argument list rather than being wrapped in promises. This could also be done in the interpreter, though the benefit is less clear as a runtime determination of whether an argument is a constant would be needed. This may still be cheap enough compared to the cost of allocating a promise to be worth doing. Constant folding in [[cmp]] may also produce more constants, but promises are needed in this case in order for [[substitute]] to work properly. These promises could be created as evaluated promises, though it is not clean how much this would gain. <>= else { if (is.symbol(a) || typeof(a) == "language") { ci <- cb$putconst(genCode(a, pcntxt, loc = cb$savecurloc())) cb$putcode(MAKEPROM.OP, ci) } else cmpConstArg(a, cb, cntxt) cmpTag(n, cb) } @ %def %% **** look into using evaluated promises for constant folded arguments %% **** then we would use a variant of this: % else { % ca <- constantFold(a, cntxt) % if (is.null(ca)) { % if (is.symbol(a) || typeof(a) == "language") { % ci <- cb$putconst(genCode(a, pcntxt)) % cb$putcode(MAKEPROM.OP, ci) % } % else % cmpConstArg(a, cb, cntxt) % } % else % cmpConstArg(ca$value, cb, cntxt) % cmpTag(n, cb) % } For calls to closures the [[MAKEPROM]] instruction retrieves the code object, creates a promise from the code object and the current environment, and pushes the promise on the argument stack. For calls to functions of type [[BULTIN]] the [[MAKEPROM]] instruction actually executes the code object in the current environment and pushes the resulting value on the stack. For calls to functions of type [[SPECIAL]] the [[MAKEPROM]] instruction does nothing as these calls use only the call expression. Constant arguments are compiled by [[cmpConstArg]]. Again there are special instructions for the common special constants [[NULL]], [[TRUE]], and [[FALSE]]. <<[[cmpConstArg]]>>= cmpConstArg <- function(a, cb, cntxt) { if (identical(a, NULL)) cb$putcode(PUSHNULLARG.OP) else if (identical(a, TRUE)) cb$putcode(PUSHTRUEARG.OP) else if (identical(a, FALSE)) cb$putcode(PUSHFALSEARG.OP) else { ci <- cb$putconst(a) cb$putcode(PUSHCONSTARG.OP, ci) } } @ %def cmpConstArg Code to install names for named arguments is generated by [[cmpTag]]: <<[[cmpTag]] function>>= cmpTag <- function(n, cb) { if (! is.null(n) && n != "") { ci <- cb$putconst(as.name(n)) cb$putcode(SETTAG.OP, ci) } } @ %def cmpTag The current implementation allocates a linked list of call arguments, stores tags in the list cells, and allocates promises. Alternative implementations that avoid some or all allocation are worth exploring. Also worth exploring is having an instruction specifically for calls that do not require matching of named arguments to formal arguments, since cases that use only order of arguments, not names, are quite common and are known at compile time. In the case of calls to functions with definitions known at compile time matching of named arguments to formal ones could also be done at compile time. \subsection{Discussion} The framework presented in this section, together with some support functions, is actually able to compile any legal R code. But this is somewhat deceptive. The R implementation, and the [[CALL]] opcode, support three kinds of functions: closures (i.e. R-level functions), primitive functions of type [[BUILTIN]], and primitive functions of type [[SPECIAL]]. Primitives of type [[BUILTIN]] always evaluate their arguments in order, so creating promises is not necessary and in fact the [[MAKEPROM]] instruction does not do so --- if the function to be called is a [[BUILTIN]] then [[MAKEPROM]] runs the code for computing the argument in the current environment and pushes the value on the stack. On the other hand, primitive functions of type [[SPECIAL]] use the call expression and evaluate bits of it as needed. As a result, they will be running interpreted code. Since core functions like the sequencing function [[{]] and the conditional evaluation function [[if]] are of type [[SPECIAL]] this means most non-trivial code will be run by the standard interpreter. This will be addressed by defining inlining rules that allow functions like [[{]] and [[if]] to be compiled properly. \section{The code buffer} \label{sec:codebuf} The code buffer is a collection of closures that accumulate code and constants in variables in their defining environment. For a code buffer [[cb]] the closures [[cb$putcode]] and [[cb$putconst]] write an instruction sequence and a constant, respectively, into the code buffer. The closures [[cb$code]] and [[cb$consts]] extract the code vector and the constant pool. The function [[make.codeBuf]] creates a set of closures for managing the instruction stream buffer and the constant pool buffer and returns a list of these closures for use by the compilation functions. In addition, the expression to be compiled into the code buffer is stored as the first constant in the constant pool; this can be used to retrieve the source code for a compiled expression. <<[[make.codeBuf]] function>>= make.codeBuf <- function(expr, loc = NULL) { <> <> <> <