Using Cookies for Connected Requests with RCurl
DuncanTemple Lang
University of California at Davis
Department of Statistics
The Problem
This is an example of using RCurl with cookies.
This comes from a question on the R-help mailing list on Sep 18th,
2012.
In a Web browser, we visit the page
.
Before allowing us access to the data, the Web site presents us with a disclaimer page.
We have to click on the I Agree button and then we are forwarded to a page with the actual
data. We want to read that using, for example, readHTMLTable in the
XML package.
What happens when we click on the I Agree button?
That sets a cookie. After that, we include that cookie in each request to that server
and this confirms that we have agreed to the disclaimer. The Web server will process
each request containing the cookie knowing we have agreed and so give us the data.
So we need to first make a request in that emulates clicking the I Agree button.
We have to arrange for that request to recognize the cookie in the response and
then use that cookie in all subsequent requests to that server.
We could do this manually, but there is no need to.
We simply use the same curl object in all of the requests.
In the first request, libcurl will process the response and retrieve the cookie.
By using the same curl handle in subsequent requests, libcurl will automatically
send the cookie in those requests.
We create the curl handle object with
library(RCurl)
curl = getCurlHandle(cookiefile = "", verbose = TRUE)
This enables cookies in the handle, but does not arrange to write them to a file.
We could store the cookie in a file when the curl handle is deleted.
We could then use this in subsequent sessions or other curl handles.
However, there is no need to do this. We can just agree to the disclaimer each time.
However, if we do want to store the cookie in a file (when the curl handle is deleted),
we can do this by specifying a file name as the value for the cookiefile argument.
The disclaimer page is a form.
We send the request to
with the parameter named disclaimer_action and the value "I Agree".
We can get this information by reading the page
and looking for the form element.
Alternatively, we could use the RHTMLForms package.
We can make the request with
postForm("http://www.wateroffice.ec.gc.ca/include/disclaimer.php",
disclaimer_action = "I Agree", curl = curl)
We can ignore the result as we just want the side-effect of getting the cookie in the
curl handle.
We can now access the actual data at the original URL.
We cannot use readHTMLTable directly as
that does not use a curl handle, and does not know about the cookie.
Instead, we use getURLContent to get the content of the
page. We can then pass this text to readHTMLTable.
So we make the request with
Personally, I prefer to use
txt = getForm("http://www.wateroffice.ec.gc.ca/graph/graph_e.html",
mode = "text", stn = "05ND012", prm1 = 3,
syr = "2012", smo = "09", sday = "15", eyr = "2012", emo = "09",
eday = "18", curl = curl)
This makes it easier to change individual inputs.
The result should contain the actual data.
library(XML)
tbl = readHTMLTable(txt, asText = TRUE)
We can find the number of rows and columns in each table with
sapply(tbl, dim)
dataTable hydroTable
[1,] 852 1
[2,] 2 4
We want the first one.
The columns are, by default, strings or factors.
The numbers have a * on them. We can post-process this
to get the values.
tbl = readHTMLTable(txt, asText = TRUE, which = 1,
stringsAsFactors = FALSE)
tbl[[2]] = as.numeric(gsub("\\*", "", tbl[[2]]))
tbl[[1]] = strptime(tbl[[1]], "%Y-%m-%d %H:%M:%S")
Using RHTMLForms to Find the Disclaimer Form
The RHTMLForms package can both read an page
and get a description of all of its forms,
and also generate an function corresponding to each form
so that we can invoke the form as if it were a local function in .
We get the descriptions with
library(RHTMLForms)
forms = getHTMLFormDescription(u, FALSE)
We need to keep the buttons in the forms and hence the as the second
argument to getHTMLFormDescription.
We create the function for this form with
fun = createFunction(forms[[1]])
We can invoke this function using the curl handle we created to capture the cookies:
fun(.curl = curl)
This will agree to the disclaimer on our behalf.