\name{cv.glm}
\alias{cv.glm}
\title{
Cross-validation for Generalized Linear Models
}
\description{
This function calculates the estimated K-fold cross-validation prediction 
error for generalized linear models.
}
\usage{
cv.glm(data, glmfit, cost, K)
}
\arguments{
\item{data}{
A matrix or data frame containing the data.  The rows should be cases and
the columns correspond to variables, one of which is the response.
}
\item{glmfit}{
An object of class \code{"glm"} containing the results of a generalized linear
model fitted to \code{data}.
}
\item{cost}{
A function of two vector arguments specifying the cost function for the 
cross-validation.  The first argument to \code{cost} should correspond to the
observed responses and the second argument should correspond to the predicted
or fitted responses from the generalized linear model.  \code{cost} must return a
non-negative scalar value.  The default is the average squared error function.
}
\item{K}{
The number of groups into which the data should be split to estimate the
cross-validation prediction error.  The value of \code{K} must be such that all
groups are of approximately equal size.  If the supplied value of \code{K} does
not satisfy this criterion then it will be set to the closest integer which
does and a warning is generated specifying the value of \code{K} used.  The default
is to set \code{K} equal to the number of observations in \code{data} which gives the
usual leave-one-out cross-validation.
}}
\value{
The returned value is a list with the following components.

\item{call}{
The original call to \code{cv.glm}.
}
\item{K}{
The value of \code{K} used for the K-fold cross validation.
}
\item{delta}{
A vector of length two.  The first component is the raw cross-validation
estimate of prediction error.  The second component is the adjusted
cross-validation estimate.  The adjustment is designed to compensate for the
bias introduced by not using leave-one-out cross-validation.
}
\item{seed}{
The value of \code{.Random.seed} when \code{cv.glm} was called. 
}}
\section{Side Effects}{
The value of \code{.Random.seed} is updated.
}
\details{
The data is divided randomly into \code{K} groups.  For each group the generalized
linear model is fit to \code{data} omitting that group, then the function \code{cost}
is applied to the observed responses in the group that was omitted from the fit
and the prediction made by the fitted models for those observations.


When \code{K} is the number of observations leave-one-out cross-validation is used
and all the possible splits of the data are used.  When \code{K} is less than
the number of observations the \code{K} splits to be used are found by randomly
partitioning the data into \code{K} groups of approximately equal size.  In this
latter case a certain amount of bias is introduced.  This can be reduced by
using a simple adjustment (see equation 6.48 in Davison and Hinkley, 1997).
The second value returned in \code{delta} is the estimate adjusted by this method.


}
\references{
Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984)
\emph{Classification and Regression Trees}. Wadsworth.


Burman, P. (1989) A comparative study of ordinary cross-validation, 
\emph{v}-fold cross-validation and repeated learning-testing methods.
\emph{Biometrika}, \bold{76}, 503--514


Davison, A.C. and Hinkley, D.V. (1997) 
\emph{Bootstrap Methods and Their Application}. Cambridge University Press.


Efron, B. (1986) How biased is the apparent error rate of a prediction rule?
\emph{Journal of the American Statistical Association}, \bold{81}, 461--470.


Stone, M.  (1974) Cross-validation choice and assessment of statistical
predictions (with Discussion). 
\emph{Journal of the Royal Statistical Society, B}, \bold{36}, 111--147.
}
\seealso{
\code{\link{glm}}, \code{\link{glm.diag}}, \code{\link{predict}}
}
\examples{
# leave-one-out and 6-fold cross-validation prediction error for 
# the mammals data set.
data(mammals, package="MASS")
mammals.glm <- glm(log(brain) ~ log(body), data = mammals)
(cv.err <- cv.glm(mammals, mammals.glm)$delta)
(cv.err.6 <- cv.glm(mammals, mammals.glm, K = 6)$delta)

# As this is a linear model we could calculate the leave-one-out 
# cross-validation estimate without any extra model-fitting.
muhat <- fitted(mammals.glm)
mammals.diag <- glm.diag(mammals.glm)
(cv.err <- mean((mammals.glm$y - muhat)^2/(1 - mammals.diag$h)^2))


# leave-one-out and 11-fold cross-validation prediction error for 
# the nodal data set.  Since the response is a binary variable an
# appropriate cost function is
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)

nodal.glm <- glm(r ~ stage+xray+acid, binomial, data = nodal)
(cv.err <- cv.glm(nodal, nodal.glm, cost, K = nrow(nodal))$delta)
(cv.11.err <- cv.glm(nodal, nodal.glm, cost, K = 11)$delta)
}
\keyword{regression}