speedlm {speedglm}R Documentation

Fitting Linear Models to Large Data Sets

Description

The functions of class 'speedlm' may speed up the fitting of LMs to large data sets. High performances can be obtained especially if R is linked against an optimized BLAS, such as ATLAS.

Usage

# S3 method of class 'data.frame'
speedlm(formula,data,weights=NULL,offset=NULL,sparse=NULL,set.default=list(),...)

# S3 method of class 'matrix'
speedlm.fit(y,X,intercept=FALSE,offset=NULL,row.chunk=NULL,sparselim=.9,camp=.01,
                      eigendec=TRUE,tol.solve=.Machine$double.eps,sparse=NULL,
                      tol.values=1e-7,tol.vectors=1e-7, method = "eigen",...)

speedlm.wfit(y,X,w,intercept=FALSE,offset=NULL,row.chunk=NULL,sparselim=.9,camp=.01,
                      eigendec=TRUE,tol.solve=.Machine$double.eps,sparse=NULL,
                      tol.values=1e-7,tol.vectors=1e-7, method = "eigen",...)                                            
                      
# S3 method of class 'speedlm' (object) and 'data.frame' (data)                    
update.speedlm(object,data,weights=NULL,offset=NULL,sparse=NULL,all.levels=FALSE,
               set.default=list(),...)     

Arguments

Most of arguments are the same of functions lm but with some difference.

formula the same of function lm.
data the same of function lm, but it must always specified.
weights the same of function lm, but it must be specified as data$weights.
w the same of weights.
intercept a logical value which indicates if an intercept is used.
offset the same of function lm.
X the same of x in function lm.
y the same of function lm.
sparse logical. Is the model matrix sparse? By default is NULL, so a quickly sample survey will be made.
set.default a list in which to specify the parameters to pass to the functions cp, control and is.sparse.
sparselim a value in the interval [0, 1]. It indicates the minimal proportion of zeroes, in the model matrix X, in order to consider X as sparse.
camp see function is.sparse.
eigendec logical. Do you want to investigate on rank of X? You may set it to false if you are sure that X is full rank.
row.chunk an integer, see the function cp for details.
tol.solve see function solve.
tol.values see function control.
tol.vectors see function control.
method see function control.
object an object of class 'speedlm'.
all.levels are all levels of eventual factors present in each data chunk? If so, set all.levels to true to speed up the fitting.
... further optional arguments.

Details

Unlikely from the functions lm or biglm, the functions of class 'speedlm' do not use the QR decomposition but directly solve the normal equations. In some extreme case, this might have some problem of numerical stability but may take advantage from the use of an optimized BLAS. The memory size of an object of class 'speedlm' is O(p^2), where p is the number of covariates. If an optimized BLAS library is not installed, an attempt to speed up calculations may be done by setting row.chunk to some value, usually less than 1000, in set.default. See the function cp for details. Factors are permitted without limitations.

Value

coefficients the estimated coefficients.
df.residual the residual degrees of freedom.
XTX the product X'X (weighted, if the case).
A the product X'X (weighted, if the case) not checked for singularity.
Xy the product X'y (weighted, if the case).
ok the set of column indeces of the model matrix where the model has been fitted.
rank the numeric rank of the fitted linear model.
pivot see the function control.
RSS the estimated residual sums of squares of the fitted model.
sparse a logical value indicating if the model matrix is sparse.
deviance the estimated deviance of the fitted model.
weigths the weights used in the last updating.
zero.w the number of non-zero weighted observations.
n.obs the number of observations.
nvar the number of independent variables.
terms the terms object used.
intercept a logical value which indicates if an intercept has been used.
call the matched call.
... others values necessary to update the estimation.

Note

All the above functions make an object of class 'speedlm'.

Author(s)

Marco ENEA

References

Enea, M. (2009) Fitting Linear Models and Generalized Linear Models With Large Data Sets in R. In book of short papers, conference on ``Statistical Methods for the analysis of large data-sets'', Italian Statistical Society, Chieti-Pescara, 23-25 September 2009, 411-414.

Klotz, J.H. (1995) Updating Simple Linear Regression. Statistica Sinica, 5, 399-403.

Bates, D. (2009) Comparing Least Square Calculations. Technical report. Available at http://cran.rakanu.com/web/packages/Matrix/vignettes/Comparisons.pdf

Lumley, T. (2009) biglm: bounded memory linear and generalized linear models. R package version 0.7 http://CRAN.R-project.org/package=biglm.

See Also

summary.speedlm,speedglm, lm, and biglm

Examples

n <- 1000
k <- 3
y <- rnorm(n)
x <- round(matrix(rnorm(n * k), n, k), digits = 3)
colnames(x) <- c("s1", "s2", "s3") 
da <- data.frame(y, x)
do1 <- da[1:300,]
do2 <- da[301:700,]
do3 <- da[701:1000,]

m1 <- speedlm(y ~ s1 + s2 + s3, data = do1)
m1 <- update(m1, data = do2)
m1 <- update(m1, data = do3)

m2 <- lm(y ~ s1 + s2 + s3, data = da)
summary(m1)
summary(m2)

[Package speedglm version 0.1 Index]