Using Cookies for Connected Requests with <omg:pkg>RCurl</omg:pkg> DuncanTemple Lang University of California at Davis Department of Statistics
The Problem This is an example of using RCurl with cookies. This comes from a question on the R-help mailing list on Sep 18th, 2012. In a Web browser, we visit the page . Before allowing us access to the data, the Web site presents us with a disclaimer page. We have to click on the I Agree button and then we are forwarded to a page with the actual data. We want to read that using, for example, readHTMLTable in the XML package. What happens when we click on the I Agree button? That sets a cookie. After that, we include that cookie in each request to that server and this confirms that we have agreed to the disclaimer. The Web server will process each request containing the cookie knowing we have agreed and so give us the data. So we need to first make a request in that emulates clicking the I Agree button. We have to arrange for that request to recognize the cookie in the response and then use that cookie in all subsequent requests to that server. We could do this manually, but there is no need to. We simply use the same curl object in all of the requests. In the first request, libcurl will process the response and retrieve the cookie. By using the same curl handle in subsequent requests, libcurl will automatically send the cookie in those requests. We create the curl handle object with library(RCurl) curl = getCurlHandle(cookiefile = "", verbose = TRUE) This enables cookies in the handle, but does not arrange to write them to a file. We could store the cookie in a file when the curl handle is deleted. We could then use this in subsequent sessions or other curl handles. However, there is no need to do this. We can just agree to the disclaimer each time. However, if we do want to store the cookie in a file (when the curl handle is deleted), we can do this by specifying a file name as the value for the cookiefile argument. The disclaimer page is a form. We send the request to with the parameter named disclaimer_action and the value "I Agree". We can get this information by reading the page and looking for the form element. Alternatively, we could use the RHTMLForms package. We can make the request with postForm("http://www.wateroffice.ec.gc.ca/include/disclaimer.php", disclaimer_action = "I Agree", curl = curl) We can ignore the result as we just want the side-effect of getting the cookie in the curl handle. We can now access the actual data at the original URL. We cannot use readHTMLTable directly as that does not use a curl handle, and does not know about the cookie. Instead, we use getURLContent to get the content of the page. We can then pass this text to readHTMLTable. So we make the request with Personally, I prefer to use txt = getForm("http://www.wateroffice.ec.gc.ca/graph/graph_e.html", mode = "text", stn = "05ND012", prm1 = 3, syr = "2012", smo = "09", sday = "15", eyr = "2012", emo = "09", eday = "18", curl = curl) This makes it easier to change individual inputs. The result should contain the actual data. library(XML) tbl = readHTMLTable(txt, asText = TRUE) We can find the number of rows and columns in each table with sapply(tbl, dim) dataTable hydroTable [1,] 852 1 [2,] 2 4 We want the first one. The columns are, by default, strings or factors. The numbers have a * on them. We can post-process this to get the values. tbl = readHTMLTable(txt, asText = TRUE, which = 1, stringsAsFactors = FALSE) tbl[[2]] = as.numeric(gsub("\\*", "", tbl[[2]])) tbl[[1]] = strptime(tbl[[1]], "%Y-%m-%d %H:%M:%S")
Using <omg:pkg>RHTMLForms</omg:pkg> to Find the Disclaimer Form The RHTMLForms package can both read an page and get a description of all of its forms, and also generate an function corresponding to each form so that we can invoke the form as if it were a local function in . We get the descriptions with library(RHTMLForms) forms = getHTMLFormDescription(u, FALSE) We need to keep the buttons in the forms and hence the as the second argument to getHTMLFormDescription. We create the function for this form with fun = createFunction(forms[[1]]) We can invoke this function using the curl handle we created to capture the cookies: fun(.curl = curl) This will agree to the disclaimer on our behalf.