JavaScript Object Notation (JSON) and <r/> <subtitle>The <omg:pkg>RJSONIO</omg:pkg> package</subtitle> DuncanTemple Lang University of California at Davis duncan@r-project.org http://www.omegahat.org The JavaScript Object Notation (JSON) format is becoming widely used as a means of exchanging data. It is used in Web Services, client server applications and to some extent as a means of serializing data for use between applications. We describe an R package that facilitates both importing data from JSON into R and also exporting R objects as JSON content. The package is quite simple but the architecture allows R programmers to customize several aspects of the computations. Input/Output application independent data format JavaScript Web Services
Things To Fix
  1. Character Encoding
  2. Base64 Encoding
JavaScript Object Notation Purpose of JSON and the need to be able to deal with it in R. Common uses. REST, inserting objects in to JavaScript code e.g. HTML documents. ECMAScript and ActionScript for Flash. Comparison with XML. Not a competition, just different purposes and strengths. Basics of the package: fromJSON and toJSON. The RJSONIO package offers two primary operations - transforming JSON content into R objects and serializing R objects to JSON format. This allows us to import and export JSON from R. The two functions that do this are fromJSON and toJSON. fromJSON can read JSON content from a file or a general connection, or from a string in memory. The latter is convenient when we obtain the JSON content from some other computation such as a Web request and the content is already in memory. Character encoding Before developing the RJSONIO package, we used rjson when serializing R objects into JavaScript code within HTML pages. This worked very well for small objects but was too slow for large objects. RJSONIO solves some of the speed issues, primarily by vectorizing the code that generates the content. We developed the RJSONIO package to be a direct substitute for rjson so code that used the latter would not need to be changed to use RJSONIO. So we can think of RJSONIO as a second-generation of rjson with a focus on efficiency which was not warranted when rjson was being developed as means to effect entirely new facilities for . RJSONIO also changes the approach used to parse JSON content into R. Firstly, it uses a C library - libjson . This should yield two benefits. Firstly, it should be faster than pure interepreted R code. Secondly, it relies on code that is used developed by others and used in other applications. The benefit of this is that we do not have to maintain it and we benefit from any updates as they are made in the libjson project. Relying on libjson means that we also suffer from its deficiencies and bugs and do not have the flexibility to design things as we want. The hope is that libjson is used in other projects so that bugs will be identified by a larger audience than if we had developed the code ourselves for use only in R. Unfortunately, this may not be the case. Since we use libjson, it would appear we have an additional dependency which users must satisfy. However, to simplify installation for users, we have included a copy of the libjson code. We use that version only if we cannot find a version of libjson on the local machine to which we are installing the source package. This means that R users can elect to use newer versions of libjson but do not have to. While RJSONIO acts as a direct replacement for rjson, it also offers additional features and controls. The remainder of this paper is organized as follows. We start with a brief description of the simple but general JSON format. In , we illustrate how to read JSON content into R in three different contexts: read local JSON files from Kiva.org, parsing results from a request to the Twitter API, and interacting with CouchDB, a simple client-server database. We then discuss how to serialize R objects in JSON format. In section , we discuss how one can customize the parsing of JSON content. We end with some notes about how the package could be extended and made more general.
The JSON Format The JSON format is quite simple and reflects the basic data structures in JavaScript and other programming languages. We'll start with scalar values. Logical values are represented by true and false. There is no distinction in JSON between an integer and a real valued number. Scientific notation, e.g., 123e10 and 123E-5 is supported. Strings are enclosed within pairs of ", i.e. the double quote character. Arrays are ordered collections of values. Each element is separated by a comma (,). Associative arrays have names for each of the elements. These are equivalent to named lists in R although the order is not guaranteed. Regular arrays are enclosed by [ ] pairs, again with elements being separated by a comma. Associative arrays are enclosed within { }. Each element is given in the form name: value. The name term should be enclosed within quotes. Not all JSON parsers insist on this (including libjson and hence RJSONIO), but it is good practice to ensure these names are quoted. Each element in an associative or regular array can be an arbitrary JSON object. This allows us to nest values so that we can have an array whose elements are arrays, associative arrays and scalar values. The final element of the format is the literal value . It represent the null object in and is a special constant object there, useful for comparing the value of a variable to this special state. In some senses, it corresponds to in . However, it might also map to an empty vector. The format is very simple Note that it does not have support for mathematical terms such as infinity, pi, e. Nor does it have the notion of a , the missing value in . Valid JSON requires that the top-level content be either an array or an object. This means that simple literal values such as "2" or 'abc' are not valid, but "[2]" and "{xyz: 'abc'}" are valid. How do we map to a value in ? How do we map empty vectors in to JSON? JSON is written as plain text. It would appear that we cannot include binary content such as an image. There is however a way around this. We can take arbitrary binary content and convert it to text using base-64 encoding commonly used to include binary content in email messages. There are several implementation of functions that convert to and from base64 encoding in various R packages include caTools, RCurl and readMzXmlData. While we can easily include binary content in JSON using base64 encoding, it is imperative that the consumer of that JSON content be aware that the content is base64 and so can decode it appropriately. Unfortunately, JSON doesn't provide a standard or convenient mechanism for identifying meta-data about elements of the content. Valid JavaScript and can be evaluated. Security concerns.
Examples
Reading Non-Rectangular Data While many data sets come to us as rectangular tables made up of rows and columns corresponding to observations each with the same number of variables. This works reasonably well, but is not rich enough for many more complex data structures. We may have repeated measurements for different observational units and so not the same number of variables in each "row". For each observation, we might have hierarchical structures such as their address or location. We could collapse this into separate variables at the top-level, but this might be a different format for different types of observational units. So in short, we need a richer format to represent raw data before we project it into a rectangular format or data frame in An example of a moderately complex data set is the dump of the Kiva database from Kiva.org. Kiva is an non-profit organization that connects lenders and borrowers on-line to provide micro-loans for people in developing countries. They make several details of loans, borrowers and lenders available both via a Web Service API and also via serializing their database. The provide this serialization in both XML and JSON formats. The data can be downloaded from , specifically . We download and extract the files and this produces two directories, one for lenders and another for loans. Each of these contains a collection of files with the .json extension each numbered from 1 to the number of files in that directory. We'll look at the loan files. We can read one of these files with loans1 = fromJSON("loans/1.json") The result in loans1 is a list with two elements. The first is named "header" and provides information about the contents of the file, e.g. the number of loans, the date it was serialized. The second element ("loans") contains the data for each loan. Strangely, there are 795 repeated elements which we can identify by examining the "id" element. This has nothing to do with JSON but the way the data were dumped from the database. The same occurs in the XML version. So we remove the duplicates with: w = duplicated(sapply(loans1$loans, `[[`, "id")) loans1$loans = loans1$loans[ ! w ] Now we can look at each loan. We can look at the types of each element: table(unlist(lapply(loans1$loans, function(x) sapply(x, class)))) So each element has 5 lists, e.g. description, terms, location, borrowers. The location is made up of several fields identifying the town and country and also latitude and longitude in a separate list named "geo": loans1$loans[[1]]$location How we chose to represent and work with this data in depends on what we want to do with it. lenders1 = fromJSON("lenders/1.json") The result is a list with two elements: names(lenders1) The header element gives us overview information about the contents of the file. The lenders element contains an element for each of the 1000 lenders described in the file. Each lender object is a list. There are many fields in common, but not all lender objects have all of the fields. We can see what fields are in each and which are not with sort(table(unlist(lapply(lenders1$lenders, names)))) Each lender object is actually quite simple. All of the variables except image is a simple scalar value. image is a character vector of length 2. This example could quite easily be represented as a rectangular table with some empty or missing cells. The lenders data can be read in the same manner. It has a simpler structure with all but one variable for each lender being a simple scalar. Not all lenders have all variables so we have a ragged array again. However, we could easily put this data into rectangular form by having values. Comparison with XML and speed for overall processing or XPath to get sub-elements. XQuery also.
Web Services JSON is commonly used in Web Services, specifically as the result format in REST (Representational State Transfer) services. The idea is that we make an HTTP request to query information we want. We specify a URL and possibly additional arguments to parameterize our request. Let's use the Twitter API as an example. Twitter allows us to query the the 20 most recent public "statues" or activities on Twitter. We send a request to the URL . We can control the format of the result by appending one of the strings "xml", "json", "rss" or "atom", separated by a period. url = "http://api.twitter.com/1/statuses/public_timeline" txt = getURLContent(sprintf("%s.json", url)) This returns a string containing the JSON content. This object also has attributes that identify the content type ("application/json") and the character encoding. These are extracted from the header of the HTTP response. In older versions of RCurl, this was returned as a binary object. Now, RCurl recognizes the content type "application/json" as text. Now that we have the JSON content as a string, we can convert it to R values via fromJSON. We do this with tweets = fromJSON(tt, asText = TRUE) We use the asText argument to ensure the function does not confuse the value as the name of a file. The function will typically guess correctly, but since we know we have the JSON content as a string, it is good practice to indicate this to fromJSON. The result is an list with twenty elements. Each element is also a list with 19 named elements: names(tweets[[1]]) The "user" element is also a list: names(tweets[[1]]$user) Compelling example: NYTimes? What others? Gloss over the RCurl requests.
CouchDB Mention that others have built an R-CouchDB interface.
Creating JSON Content from R To this point, we have seen how we can consume or import JSON content in . We now turn our attention to how we create JSON content from and so export it to other applications. Basically we want to generate text that we store as a string or write to a connection and which consists of JSON content. Any R programmer can create arbitrary JSON content using R commands such as paste, sprintf and cat and character vectors or connections (including textConnection). We focus here however on serializing arbitrary R objects in JSON format so that the information can be restored within another JSON-enabled application. The basic function that takes an R object and serializes it to a JSON string is toJSON. This function takes an R object and serializes its elements. Basically, this maps R vectors (logical, integer, numeric, character) to either a dictionary (or object in JSON terms) or a regular array. If the R vector has a names, we preserve these and use a dictionary. x = c(a = 1, b = 10, c = 20) toJSON(x) There are occasions when we have names on an R object, but we want the resulting JSON value to be a simple array. We can use the .withNames parameter to control this. Passing a value of causes the names to be ignored and a regular array to be created, e.g. x = c(a = 1, b = 10, c = 20) toJSON(x, .withNames = FALSE) There are methods for serializing R objects to JSON for various classes of R objects. These allow us to customize how some R objects are translated to JSON. For example, a matrix in R is merely a one-dimensional vector with a dim attribute that allows R to treat as a two or more dimensional object. As a result, by default, it would be serialized as a single long vector in column-wise order. However, a matrix might be represented in as an array of row arrays, i.e. a top-level container in which element is itself a one-dimensional array for a given row. So we define a method to handle matrix objects in R. It is defined as 1 || length(names(x)) > 0, collapse = "\n", ..., .level = 1L, .withNames = length(x) > 0 && length(names(x)) > 0) { tmp = paste(apply(x, 1, toJSON), collapse = sprintf(",%s", collapse)) if(!container) return(tmp) if(.withNames) paste("{", paste(dQuote(names(x)), tmp, sep = ": "), "}") else paste("[", tmp, "]") }) ]]> With this defined, the code toJSON(matrix(1:10, 5, 2)) yields toJSON and its methods could be extended to write to a connection. The default connection could be a textConnection and if this was not specified (i.e. missing in the initial call), the string rather than the connection would be returned. This would allow us to avoid collecting the JSON text in memory for an entire object and to emit/flush content to a connection as it was generated. This would save memory and could be important for large objects.
Customizing the Parser We can provide our own handlers to process each element as it is encountered by the JSON parser. This is similar to the SAX style of parsing for XML.
Future Directions At present, we omit/drop attributes on R objects when serializing to JSON. We use the length, dim and names for vectors, but ignore them for other types of R objects and ignore any other attributes entirely. To serialize to R Attributes on R objects. We could use either an empty name or .Data for the data part of an object and then "attributes" to identify the list of attributes. We cannot serialize R functions easily to JavaScript as they do not make a great deal of sense in that language. Instead, we can serialize the source code for a function as a string. This loses information, and this is very important if the function has a non-standard environment. We have also experimented with approaches to translating the R syntax to JavaScript code and possibly R code to JavaScript in an effort to simplify authoring JavaScript code for use in, e.g., Web pages. The libjson parser expects the entire JSON content to be in memory when it starts to read, i.e. passed as a single string. We would like to be able to have the parser read from a file or connection and access additional bytes of the input stream as it requires them. This would reduce the memory required as we wouldn't have to load the file into memory ahead of time. Instead, we would retain a smaller buffer of content that is being processed. We have developed a bi-directional interface between the JavaScript interpreter SpiderMonkey used in the Mozilla/Firefox browser. This allows us to pass R objects to JavaScript and vice verse using C-level references. However, we can also transfer objects by value between the two languages using RJSONIO with very little infrastructure.