attrEval {CORElearn}R Documentation

Attribute evaluation

Description

The method evaluates the quality of the features/attributes/dependent variables specified by the formula with the selected heuristic method. Feature evaluation algorithms available for classification problems are various variants of Relief and ReliefF algorithms (ReliefF, cost-sensitive ReliefF, ...), gain ratio, gini-index, MDL, DKM, information gain, ... For regression problems there are RREliefF, MSE, MAE, ...

Usage

  attrEval(formula, data, costMatrix = NULL, estimator, ...)

Arguments

formula Formula specifying the predictors to be evaluated and the target variable.
data Data frame with evaluation data.
costMatrix Optional cost matrix.
estimator The name of the evaluation method.
... Additional options used by specific evaluation methods.

Details

Parameter formula is used as a mechanism to select features (attributes) and prediction variable (class). Only simple terms can be used and interaction expressed in formula syntax are not supported. The simplest way is to specify just response variable: class ~ .. In this case all other attributes in the data set are evaluated. See also example below.

The optional parameter costMatrix can provide nonuniform cost matrix to classification cost-sensitive measures (ReliefFexpC, ReliefFavgC, ReliefFpe, ReliefFpa, ReliefFsmp,GainRatioCost, DKMcost, ReliefKukar, and MDLsmp). For other measures this parameter is ignored. The format of the matrix is costMatrix(true class, predicted class). By default a uniform costs are assumed, i.e., costMatrix(i, i) = 0, and costMatrix(i, j) = 1, for i not equal to j.

The estimator parameter selects the evaluation heuristics. For classification problem it must be one of the names returned by infoCore(what="attrEval") and for regression problem it must be one of the names returned by infoCore(what="attrEvalReg") Majority of these feature evaluation measures are described in the references given below, here only a short description is given. For classification problem they are

"ReliefFequalK"
ReliefF algorithm where k nearest instances have equal weight.
"ReliefFexpRank"
ReliefF algorithm where k nearest instances have weight exponentially decreasing with increasing rank. Rank of nearest instance is determined by the increasing (Manhattan) distance from the selected instance. This is a default choice for methods taking conditional dependencies among the attributes into account.
"ReliefFbestK"
ReliefF algorithm where all possible k (representing k nearest instances) are tested and for each feature the highest score is returned. Nearest instances have equal weights.
"Relief
Original algorithm of Kira and Rendel (1991) working on two class problems.
"InfGain"
Information gain.
"GainRatio"
Gain ratio, which is normalized information gain to prevent bias to multi-valued attributes.
"MDL"
Acronym for Minimum Description Length, presents method introduced in (Kononenko, 1995) with favorable bias for multi-valued and multi-class problems. Might be the best method among those not taking conditional dependencies into account.
"Gini"
Gini-index.
"MyopicReliefF"
Myopic version of ReliefF resulting from assumption of no local dependencies and attribute dependencies upon class.
"Accuracy"
Accuracy of resulting split.
"BinAccuracy"
Accuracy of resulting binary split.
"ReliefFmerit"
ReliefF algorithm where for each random instance the merit of each attribute is normalized by the sum of differences in all attributes.
"ReliefFdistance"
ReliefF algorithm where k nearest instances are weighed directly with its inverse distance from the selected instance. Usually using ranks instead of distance as in ReliefFexpRank is more effective.
"ReliefFsqrDistance"
ReliefF algorithm where k nearest instances are weighed with its inverse square distance from the selected instance.
"DKM"
Measure named after Dietterich, Kearns, and Mansour who proposed it in 1996.
"ReliefFexpC"
Cost-sensitive ReliefF algorithm with expected costs.
"ReliefFavgC"
Cost-sensitive ReliefF algorithm with average costs.
"ReliefFpe"
Cost-sensitive ReliefF algorithm with expected probability.
"ReliefFpa"
Cost-sensitive ReliefF algorithm with average probability.
"ReliefFsmp"
Cost-sensitive ReliefF algorithm with cost sensitive sampling.
"GainRatioCost"
Cost-sensitive variant of GainRatio.
"DKMcost"
Cost-sensitive variant of DKM.
"ReliefKukar"
Cost-sensitive Relief algorithm introduced by Kukar in 1999.
"MDLsmp"
Cost-sensitive variant of MDL where costs are introduced through sampling.
"ImpurityEuclid"
Euclidean distance as impurity function on within node class distributions.
"ImpurityHellinger"
Hellinger distance as impurity function on within node class distributions.
"UniformDKM"
Dietterich-Kearns-Mansour (DKM) with uniform priors."
"UniformGini"
Gini index with uniform priors.
"UniformInf"
Information score with uniform priors.
"UniformAccuracy"
Accuracy with uniform priors.
"EqualDKM"
Dietterich-Kearns-Mansour (DKM) with equal weights for splits.
"EqualGini"
Gini index with equal weights for splits.
"EqualInf"
Information score with equal weights for splits.
"EqualHellinger"
Two equally weighted splits based Hellinger distance.
"DistHellinger"
Hellinger distance between class distributions in branches.
"DistAUC"
AUC distance between splits.
"DistAngle"
Cosine of angular distance between splits.
"DistEuclid"
Euclidean distance between splits.
For regression problem the implemented measures are:
"RReliefFequalK"
RReliefF algorithm where k nearest instances have equal weight.
"ReliefFexpRank"
RReliefF algorithm where k nearest instances have weight exponentially decreasing with increasing rank. Rank of nearest instance is determined by the increasing (Manhattan) distance from the selected instance. This is a default choice for methods taking conditional dependencies among the attributes into account.
"RReliefFbestK"
RReliefF algorithm where all possible k (representing k nearest instances) are tested and for each feature the highest score is returned. Nearest instances have equal weights.
"RReliefFwithMSE"
A combination of RReliefF and MSE algorithms.
"MSEofMean"
Mean Squared Error as heuristic used to measure error by mean predicted value after split on the feature.
"MSEofModel"
Mean Squared Error of an arbitrary model used on splits resulting from the feature. The model is chosen with parameter modelTypeReg.
"MAEofModel"
Mean Absolute Error of an arbitrary model used on splits resulting from the feature. The model is chosen with parameter modelTypeReg. If we use median as the model, we get robust equivalent to MSEofMean.
"RReliefFdistance"
RReliefF algorithm where k nearest instances are weighed directly with its inverse distance from the selected instance. Usually using ranks instead of distance as in RReliefFexpRank is more effective.
"RReliefFsqrDistance"
RReliefF algorithm where k nearest instances are weighed with its inverse square distance from the selected instance.

There are some additional parameters ... available which are used by specific evaluation heuristics. Their list and short description is available by calling optionCore. See Section on attribute evaluation.

Evaluation and visualization of ordered attributes is covered in function ordEval.

Value

Vector of evaluations for the features in the order specified by the formula.

Author(s)

Marko Robnik-Sikonja, Petr Savicky

References

Marko Robnik-Sikonja, Igor Kononenko: Theoretical and Empirical Analysis of ReliefF and RReliefF. Machine Learning Journal, 53:23-69, 2003

Marko Robnik-Sikonja: Experiments with Cost-sensitive Feature Evaluation. In Lavrac et al.(eds): Machine Learning, Proceedings of ECML 2003, Springer, Berlin, 2003, pp. 325-336

Igor Kononenko: On Biases in Estimating Multi-Valued Attributes. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI'95), pp. 1034-1040, 1995

Some of these references are available also from http://lkm.fri.uni-lj.si/rmarko/papers/

See Also

CORElearn, CoreModel, ordEval, optionCore, infoCore.

Examples

# use iris data

# run method ReliefF with exponential rank distance  
estReliefF <- attrEval(Species ~ ., iris, 
                            estimator="ReliefFexpRank", ReliefIterations=30)
print(estReliefF)

# print all available estimators
infoCore(what="attrEval")

[Package CORElearn version 0.9.22 Index]