speedglm {speedglm} | R Documentation |
speedglm
and speedglm.wfit
fit GLMs to medium-large
data sets, that is those storable into the R memory. The highest performances, in terms of computation time,
are obtained when R is linked against an optimized BLAS, such as ATLAS. The function shglm
is for a data set stored into a file of size greater than the available memory, and takes as
argument a function to manipulate connections.
## S3 method for class 'data.frame': speedglm(formula, data, family = gaussian(), weights = NULL, start = NULL, etastart = NULL, mustart = NULL, offset = NULL, maxit = 25, k = 2, sparse = NULL, set.default = list(), ...) ## S3 method for class 'matrix': speedglm.wfit(y, X, intercept = TRUE, weights = NULL, row.chunk = NULL, family = gaussian(), start = NULL, etastart = NULL, mustart = NULL, offset = NULL, acc = 1e-08, maxit = 25, k = 2, sparselim = .9, camp = .01, eigendec = TRUE, tol.values = 1e-7, tol.vectors = 1e-7, tol.solve = .Machine$double.eps, method = "eigen", sparse = NULL,...) ## S3 method for class 'function': shglm(formula, data, family = gaussian(), weights.fo = NULL, start = NULL, etastart = NULL, mustart = NULL, offset = NULL, maxit = 25, k = 2, chunksize = 5000, sparse = NULL, all.levels = FALSE, set.default = list(),...)
Most of arguments are the same of glm or bigglm but with some difference.
formula |
the same of glm . |
data |
as in bigglm , can be either a data frame or a function which uses connections.
The latter case is only for shglm . |
family |
the same of glm , but it must be specified with brackets. |
start |
the same of glm . |
weights |
the same of glm , but it must be specified as data$weights . |
weights.fo |
weights for the response. It must be specified as a formula (see the example below). |
etastart |
the same of glm . |
mustart |
the same of glm . |
offset |
the same of glm . |
intercept |
the same of glm . |
X |
the same of x in glm.fit . |
y |
the same of glm . |
maxit |
the same of glm . |
k |
numeric, the penalty per parameter to be used; the default k = 2 is the classical AIC. |
sparse |
logical. Is the model matrix sparse? By default is NULL, so a quickly sample survey will be made. |
chunksize |
an integer indicates the number of rows of the data file to read at time. |
all.levels |
logical, are all factor's levels present in each data chunk? |
set.default |
a list in which to specify the below parameters. |
sparselim |
a real in the interval [0, 1]. It indicates the minimal proportion of zeroes in the data matrix X in order to consider X as sparse. |
camp |
see the function is.sparse. |
eigendec |
logical. Do you want to check the rank of X? You may set it to false if you are sure that X is full rank. |
row.chunk |
an integer, see the function cp for details. |
acc |
tolerance to be used for the estimation. |
tol.solve |
see the function solve. |
tol.values |
see the function control. |
tol.vectors |
see the function control. |
method |
see the function control. |
... |
further optional arguments. |
The function shglm
works like
biglm
, but it checks for singularity and does not impose restrictions on
factors. Since during the IWLS estimation shglm
uses repeated accesses to data file stored, for example,
into the hard disk, the estimation time could be very long.
Unlike from glm
or biglm
, the functions of class 'speedglm' do not use
the QR decomposition, but directly solve the equations in the form of Iterative(-ly) (Re-)Weighted Least Squares (IWLS).
The memory size of an object of class 'speedglm' is O(p^2), where p is the number of covariates. If an optimized BLAS
is not installed, an attempt to speed up calculations might be done by setting row.chunk
to some value, usually less than 1000, in set.default
. See the function cp for details.
If the model matrix is (very) sparse, the package Matrix
could be used.
coefficients |
the estimated coefficients. |
logLik |
the log likelihood of the fitted model. |
iter |
the number of iterations of IWLS used. |
tol |
the maximal value of tolerance reached. |
convergence |
a logical value which indicates if convergence was reached. |
family |
the family object used. |
link |
the link function used. |
df |
the degrees of freedom of the model. |
XTX |
the product X'X (weighted, if the case). |
dispersion |
the estimated dispersion parameter of the model. |
ok |
the set of column indeces of the model matrix where the model has been fitted. |
rank |
the rank of the model matrix. |
RSS |
the estimated residual sum of squares of the fitted model. |
aic |
the estimated Akaike Information Criterion. |
sparse |
a logical value which indicates if the model matrix is sparse. |
deviance |
the estimated deviance of the fitted model. |
nulldf |
the degrees of freedom of the null model. |
nulldev |
the estimated deviance of the null model. |
ngoodobs |
the number of non-zero weighted observations. |
n |
the number of observations. |
intercept |
a logical value which indicates if an intercept has been used. |
terms |
the terms object used. |
call |
the matched call. |
All the above functions make an object of class 'speedglm'
Marco ENEA
Enea, M. (2009) Fitting Linear Models and Generalized Linear Models with large data sets in R.
In book of short papers, conference on ``Statistical Methods for the analysis of large data-sets'',
Italian Statistical Society, Chieti-Pescara, 23-25 September 2009, 411-414.
Bates, D. (2009) Comparing Least Square Calculations. Technical report. Available at
http://cran.rakanu.com/web/packages/Matrix/vignettes/Comparisons.pdf
Lumley, T. (2009) biglm: bounded memory linear and generalized linear models. R package version 0.7. http://CRAN.R-project.org/package=biglm.
# Not Run: the example may take a long time. # The following comparison among glm(), bigglm() and speedglm() cannot be considered rigorous # and exhaustive, but it is only to give an idea of the computation time. require(biglm) n<-50000 k<-80 y <- rgamma(n,1.5,1) x <-round( matrix(rnorm(n*k),n,k),digits=3) colnames(x) <-paste("s",1:k,sep = "") da<- data.frame(y,x) fo <- as.formula(paste("y~",paste(paste("s",1:k,sep=""),collapse="+"))) system.time(m1 <- glm(fo,data=da,family=Gamma(log))) system.time(m2 <- bigglm(fo,data=da,family=Gamma(log))) system.time(m3 <- speedglm(fo,data=da,family=Gamma(log))) # You may also try speedglm when R is linked against an optimized BLAS, otherwise try to run the following # function. In some computers, it is faster for large data sets. system.time(m4 <- speedglm(fo,data=da,family=Gamma(log),set.default=list(row.chunk=1000))) ################## ## The function from the bigglm's help page make.data<-function(urlname, chunksize,...){ conn<-NULL function(reset=FALSE){ if(reset){ if(!is.null(conn)) close(conn) conn<<-url(urlname,open="r") } else{ rval<-read.table(conn, nrows=chunksize,...) if (nrow(rval)==0) { close(conn) conn<<-NULL rval<-NULL } return(rval) } } } require(biglm) da<-make.data("http://dssm.unipa.it/enea/data1.txt", chunksize=50,col.names=c("y","fat1","x1","x2")) # fat1 is a factor with four levels b1<-shglm(y~factor(fat1)+x1,weights=~I(x2^2),data=da,family=Gamma(log)) b2<-bigglm(y~factor(fat1)+x1,weights=~I(x2^2),data=da,family=Gamma(log)) summary(b1) summary(b2)