Title: | Tunes AdaBoost, Elastic Net, Support Vector Machines, and Gradient Boosting Machines |
---|---|
Description: | Contains two functions that are intended to make tuning supervised learning methods easy. The eztune function uses a genetic algorithm or Hooke-Jeeves optimizer to find the best set of tuning parameters. The user can choose the optimizer, the learning method, and if optimization will be based on accuracy obtained through validation error, cross validation, or resubstitution. The function eztune_cv will compute a cross validated error rate. The purpose of eztune_cv is to provide a cross validated accuracy or MSE when resubstitution or validation data are used for optimization because error measures from both approaches can be misleading. |
Authors: | Jill Lundell [aut, cre] |
Maintainer: | Jill Lundell <[email protected]> |
License: | GPL-3 |
Version: | 3.1.2 |
Built: | 2024-11-05 04:30:43 UTC |
Source: | https://github.com/jillbo1000/eztune |
eztune
is a function that automatically tunes adaboost, support
vector machines, gradient boosting machines, and elastic net. An
optimization algorithm is used to find a good set of tuning parameters
for the selected model. The function optimizes on a validation dataset,
cross validated accuracy, or resubstitution accuracy.
eztune( x, y, method = "svm", optimizer = "hjn", fast = TRUE, cross = NULL, loss = "default" )
eztune( x, y, method = "svm", optimizer = "hjn", fast = TRUE, cross = NULL, loss = "default" )
x |
Matrix or data frame containing the dependent variables. |
y |
Vector of responses. Can either be a factor or a numeric vector. |
method |
Model to be fit. Choices are " |
optimizer |
Optimization method. Options are " |
fast |
Indicates if the function should use a subset of the
observations when optimizing to speed up calculation time. A value
of |
cross |
If an integer k \> 1 is specified, k-fold cross-validation
is used to fit the model. This method is very slow for large datasets.
This parameter is ignored unless |
loss |
The type of loss function used for optimization. Options
for models with a binary response are " |
Function returns an object of class "eztune
" which contains
a summary of the tuning parameters for the best model, the best loss
measure achieved (classification accuracy, AUC, MSE, or MAE), and the best
model.
loss |
Best loss measure obtained by the optimizer. This is the measure specified by the user that the optimizer uses to choose a "best" model (classification accuracy, AUC, MSE, or MAE). Note that if the default option is used it is the classification accuracy for a binary response and the MSE for a continuous response. |
model |
Best model found by the optimizer. Adaboost model
comes from package |
n |
Number of observations used in model training when fast option is used |
nfold |
Number of folds used if cross validation is used for optimization. |
iter |
Tuning parameter for adaboost. |
nu |
Tuning parameter for adaboost. |
shrinkage |
Tuning parameter for adaboost and gbm. |
lambda |
Tuning parameter for elastic net |
alpha |
Tuning parameter for elastic net |
n.trees |
Tuning parameter for gbm. |
interaction.depth |
Tuning parameter for gbm. |
n.minobsinnode |
Tuning parameter for gbm. |
cost |
Tuning parameter for svm. |
gamma |
Tuning parameter for svm. |
epsilon |
Tuning parameter for svm regression. |
levels |
If the model has a binary response, the levels of y are listed. |
library(mlbench) data(Sonar) sonar <- Sonar[sample(1:nrow(Sonar), 100), ] y <- sonar[, 61] x <- sonar[, 1:10] # Optimize an SVM using the default fast setting and Hooke-Jeeves eztune(x, y) # Optimize an SVM with 3-fold cross validation and Hooke-Jeeves eztune(x, y, fast = FALSE, cross = 3) # Optimize GBM using training set of 50 observations and Hooke-Jeeves eztune(x, y, method = "gbm", fast = 50, loss = "auc") # Optimize SVM with 25% of the observations as a training dataset # using a genetic algorithm eztune(x, y, method = "svm", optimizer = "ga", fast = 0.25)
library(mlbench) data(Sonar) sonar <- Sonar[sample(1:nrow(Sonar), 100), ] y <- sonar[, 61] x <- sonar[, 1:10] # Optimize an SVM using the default fast setting and Hooke-Jeeves eztune(x, y) # Optimize an SVM with 3-fold cross validation and Hooke-Jeeves eztune(x, y, fast = FALSE, cross = 3) # Optimize GBM using training set of 50 observations and Hooke-Jeeves eztune(x, y, method = "gbm", fast = 50, loss = "auc") # Optimize SVM with 25% of the observations as a training dataset # using a genetic algorithm eztune(x, y, method = "svm", optimizer = "ga", fast = 0.25)
eztune_cv
returns the cross-validated
loss measures for a model returned by eztune
.
The function eztune
can tune a model using validation data,
cross validation, data splitting, or resubstitution. If resubstitution
or a data splitting method (via the fast
option) is used to
tune the model, the accuracy obtained from the function
may not be accurate. The function eztune_cv
will return
cross-validated accuracy measures for any model returned by eztune
.
eztune_cv(x, y, model, cross = 10)
eztune_cv(x, y, model, cross = 10)
x |
Matrix or data frame containing the dependent variables used to create the model. |
y |
Vector of the response used to create the model. Can be either numeric or a factor. |
model |
An Object of class |
cross |
Number of folds to use for n-fold cross-validation. |
Function returns a numeric value that represents the cross-validated accuracy of the model. Both classification accuracy and the AUC are returned for models with a binary response. MSE and mean absolute error (MAE) are returned for models with a continuous response.
accuracy |
Cross-validated classification accuracy. |
auc |
Cross-validated AUC. |
mse |
Cross-validated MSE. |
mae |
Cross-validated MAE. |
library(mlbench) data(Sonar) sonar <- Sonar[sample(1:nrow(Sonar), 100), ] y <- sonar[, 61] x <- sonar[, 1:10] sonar_default <- eztune(x, y) eztune_cv(x, y, sonar_default) sonar_svm <- eztune(x, y, fast = FALSE, cross = 3) eztune_cv(x, y, sonar_svm) sonar_gbm <- eztune(x, y, method = "gbm", fast = 50) eztune_cv(x, y, sonar_gbm)
library(mlbench) data(Sonar) sonar <- Sonar[sample(1:nrow(Sonar), 100), ] y <- sonar[, 61] x <- sonar[, 1:10] sonar_default <- eztune(x, y) eztune_cv(x, y, sonar_default) sonar_svm <- eztune(x, y, fast = FALSE, cross = 3) eztune_cv(x, y, sonar_svm) sonar_gbm <- eztune(x, y, method = "gbm", fast = 50) eztune_cv(x, y, sonar_gbm)
Data were collected between 1993 and 1999 as part of the Lichen Air Quality surveys on public lands in Oregon and southern Washington. Observations were obtained from 1-acre (0.4 ha) plots at Current Vegetation Survey (CVS) sites. Indicator variables denote the presences and absences of 7 lichen species. Data for each sampled plot include the topographic variables elevation, aspect, and slope; bioclimatic predictors including maximum, minimum, daily, and average temperatures, relative humidity precipitation, evapotranspiration, and vapor pressure; and vegetation variables including the average age of the dominant conifer and percent conifer cover. The data in lichenTest were collected from half-acre plots at CVS sites in the same geographical region and contains many of the same variables, including presences and absences for the 7 lichen species. As such, it is a good test dataset for predictive methods applied to the Lichen Air Quality data.
lichen
lichen
A data frame with 840 observations and 40 variables. One variable is a location identifier, 7 (coded as 0 and 1) identify the presence or absence of a type of lichen species, and 32 are characteristics of the survey site where the data were collected.
There were 12 monthly values in the original data for each of the bioclimatic predictors. Principal components analyses suggested that for each of these predictors 2 principal components explained the vast majority (95.0%-99.5%) of the total variability. Based on these analyses, indices were created for each set of bioclimatic predictors. The variables with the suffix Ave in the variable name are the average of 12 monthly variables. The variables with the suffix Diff are contrasts between the sum of the April-September monthly values and the sum of the October-December and January-March monthly values, divided by 12. Roughly speaking, these are summer-to-winter contrasts.
The variables are summarized as follows:
Identifier of the section of forest from which the data were collected.
Lobaria oregana (Absent = 0, Present = 1)
Lobaria pulmonaria (Absent = 0, Present = 1)
Nephroma bellum (Absent = 0, Present = 1)
Nephroma helveticum (Absent = 0, Present = 1)
Pseudocyphellaria anomala (Absent = 0, Present = 1)
Pseudocyphellaria anthraspis (Absent = 0, Present = 1)
Pseudocyphellaria crocata (Absent = 0, Present = 1)
Average monthly potential evapotranspiration in mm
Summer-to-winter difference in monthly potential evapotranspiration in mm
Average monthly moisture index in cm
Summer-to-winter difference in monthly monthly moisture index in cm
Average monthly precipitation in cm
Summer-to-winter difference in monthly precipitation in cm
Average monthly relative humidity in percent
Summer-to-winter difference in monthly relative humidity in percent
Average monthly potential global radiation in kJ
Summer-to-winter difference in monthly potential global radiation in kJ
Average monthly average temperature in degrees Celsius
Summer-to-winter difference in monthly average temperature in degrees Celsius
Average monthly maximum temperature in degrees Celsius
Summer-to-winter difference in monthly maximum temperature in degrees Celsius
Average monthly minimum temperature in degrees Celsius
Summer-to-winter difference in monthly minimum temperature in degrees Celsius
Mean average daytime temperature in degrees Celsius
Summer-to-winter difference in average daytime temperature in degrees Celsius
Average monthly average ambient vapor pressure in Pa
Summer-to-winter difference in monthly average ambient vapor pressure in Pa
Average monthly average saturated vapor pressure in Pa
Summer-to-winter difference in monthly average saturated vapor pressure in Pa
Aspect in degrees
Transformed Aspect: TransAspect=(1-cos(Aspect))/2
Elevation in meters
Percent slope
Reserve Status (Reserve, Matrix)
Stand Age Class (< 80 years, 80+ years)
Average age of the dominant conifer in years
Percent vegetation cover
Percent conifer cover
Percent broadleaf cover
Live tree (> 1inch DBH) biomass, above ground, dry weight.
Cutler, D. Richard., Thomas C. Edwards Jr., Karen H. Beard, Adele Cutler, Kyle T. Hess, Jacob Gibson, and Joshua J. Lawler. 2007. Random Forests for Classification in Ecology. Ecology 88(11): 2783-2792.
Data were collected as part of the Northwest Forest Conservation Plan. Data were collected from 300 half-acre (0.2 ha) sites on the Current Vegetation Survey grid in Gifford-Pinchot National Forest, the Umpqua Basin, and the Oregon Coast. Samples were collected between 2002 and 2003. Indicator variables denoted the presence or absence of 7 lichen species. This dataset may be used as a test dataset for the lichen dataset included in this package.
lichenTest
lichenTest
A data frame with 300 observations and 40 variables. One variable is a location identifier, 7 identify the presence or absence of the lichen species, and 32 are characteristics of the survey site where the data were collected.
As with the Lichen Air Quality data, the variables with the suffix Ave in the variable name are the average of 12 monthly variables. The variables with the suffix Diff are contrasts between the sum of the April-September monthly values and the sum of the October-December and January-March monthly values, divided by 12. Roughly speaking, these are summer-to-winter contrasts.
The variables are summarized as follows:
Identifier of the section of forest from which the data were collected.
Lobaria oregana (Absent = 0, Present = 1)
Lobaria pulmonaria (Absent = 0, Present = 1)
Nephroma bellum (Absent = 0, Present = 1)
Nephroma helveticum (Absent = 0, Present = 1)
Pseudocyphellaria anomala (Absent = 0, Present = 1)
Pseudocyphellaria anthraspis (Absent = 0, Present = 1)
Pseudocyphellaria crocata (Absent = 0, Present = 1)
Average monthly potential evapotranspiration in mm
Summer-to-winter difference in monthly potential evapotranspiration in mm
Average monthly moisture index in cm
Summer-to-winter difference in monthly monthly moisture index in cm
Average monthly precipitation in cm
Summer-to-winter difference in monthly precipitation in cm
Average monthly relative humidity in percent
Summer-to-winter difference in monthly relative humidity in percent
Average monthly potential global radiation in kJ
Summer-to-winter difference in monthly potential global radiation in kJ
Average monthly average temperature in degrees Celsius
Summer-to-winter difference in monthly average temperature in degrees Celsius
Average monthly maximum temperature in degrees Celsius
Summer-to-winter difference in monthly maximum temperature in degrees Celsius
Average monthly minimum temperature in degrees Celsius
Summer-to-winter difference in monthly minimum temperature in degrees Celsius
Mean average daytime temperature in degrees Celsius
Summer-to-winter difference in average daytime temperature in degrees Celsius
Average monthly average ambient vapor pressure in Pa
Summer-to-winter difference in monthly average ambient vapor pressure in Pa
Average monthly average saturated vapor pressure in Pa
Summer-to-winter difference in monthly average saturated vapor pressure in Pa
Aspect in degrees
Transformed Aspect: TransAspect=(1-cos(Aspect))/2
Elevation in meters
Percent slope
Reserve Status (Reserve, Matrix)
Stand Age Class (< 80 years, 80+ years)
Average age of the dominant conifer in years
Percent vegetation cover
Percent conifer cover
Percent broadleaf cover
Live tree (> 1inch DBH) biomass, above ground, dry weight.
Cutler, D. Richard., Thomas C. Edwards Jr., Karen H. Beard, Adele Cutler, Kyle T. Hess, Jacob Gibson, and Joshua J. Lawler. 2007. Random Forests for Classification in Ecology. Ecology 88(11): 2783-2792.
This dataset contains information about the presence and absence of common mullein (Verbascum thapsus) at Lava Beds National Monument. The park was digitally divided into 30m by 30m pixels. Park personnel provided data on 6,047 sites at which mullein was detected and treated between 2000 and 2005, and these data were augmented by 6,047 randomly selected pseudo-absences. For each 30m by 30m site there are data on elevation, aspect, slope, proximity to roads and trails, and interpolated bioclimatic variables such as minimum, maximum, and average temperature, precipitation, relative humidity, and evapotranspiration. The dataset called mulleinTest is a test dataset collected in Lava Beds National Monument in 2006 that can be used to verify evaluate predictive statistical procedures applied to the mullein dataset.
mullein
mullein
A data frame with 12,094 observations and 32 variables. One variable identifies the presence or absence of mullein in a 30m by 30m site and 31 variables are characteristics of the site where the data were collected.
In the original data there were 12 monthly values for each of the bioclimatic predictors. Principal components analyses suggested that for each of these predictors 2 principal components explained the vast majority (95.0% - 99.5%) of the total variability. Based on these analyses, indices were created for each set of bioclimatic predictors. The variables with the suffix Ave in the variable name are the average of 12 monthly variables. The variables with the suffix Diff are contrasts between the sum of the April-September monthly values and the sum of the October-December and January-March monthly values, divided by 12. Roughly speaking, these are summer-to-winter contrasts. The variables are summarized as follows:
Presence or absence of Verbascum thapsus, common mullein, (Absent = 0, Present = 1)
Degree days in degrees Celsius
Average monthly potential evapotranspiration in mm
Summer-to-winter difference in monthly potential evapotranspiration in mm
Average monthly moisture index in cm
Summer-to-winter difference in monthly moisture index in cm
Average monthly precipitation in cm
Summer-to-winter difference in monthly precipitation in cm
Average monthly relative humidity in percent
Summer-to-winter difference in monthly relative humidity in percent
Average monthly potential global radiation in kJ
Summer-to-winter difference in monthly potential global radiation in kJ
Average monthly average temperature in degrees Celsius
Summer-to-winter difference in monthly average temperature in degrees Celsius
Average monthly minimum temperature in degrees Celsius
Summer-to-winter difference in monthly minimum temperature in degrees Celsius
Average monthly maximum temperature in degrees Celsius
Summer-to-winter difference in monthly maximum temperature in degrees Celsius
Mean average daytime temperature in degrees Celsius
Summer-to-winter difference in average daytime temperature in degrees Celsius
Average monthly average ambient vapor pressure in Pa
Summer-to-winter difference in monthly average ambient vapor pressure in Pa
Average monthly average saturated vapor pressure in Pa
Summer-to-winter difference in monthly average saturated vapor pressure in Pa
Average monthly average vapor pressure deficit in Pa
Summer-to-winter difference in monthly average vapor pressure deficit in Pa
Elevation in meters
Percent slope
Transformed Aspect: TransAspect=(1-cos(Aspect))/2
Distance to the nearest road in meters
Distance to the nearest trail in meters
Distance to the nearest road or trail in meters
Cutler, D. Richard., Thomas C. Edwards Jr., Karen H. Beard, Adele Cutler, Kyle T. Hess, Jacob Gibson, and Joshua J. Lawler. 2007. Random Forests for Classification in Ecology. Ecology 88(11): 2783-2792.
This dataset contains information about the presence and absence of common mullein (Verbascum thapsus) at 1,512 randomly selected sites in Lava Beds National Monument. The data were collected in summer 2006. This dataset may be used to evaluate predictive statistical procedures that have been fit on the mullein dataset.
mulleinTest
mulleinTest
A data frame with 1512 observations and 32 variables. One variable identifies the presence or absence of mullein in a 30m by 30m site and 31 variables are characteristics of the site where the data were collected.
In the original data there were 12 monthly values for each of the bioclimatic predictors. Principal components analyses suggested that for each of these predictors 2 principal components explained the vast majority (95.0%-99.5%) of the total variability. Based on these analyses, indices were created for each set of bioclimatic predictors. The variables with the suffix Ave in the variable name are the average of 12 monthly variables. The variables with the suffix Diff are contrasts between the sum of the April-September monthly values and the sum of the October-December and January-March monthly values, divided by 12. Roughly speaking, these are summer-to-winter contrasts.
The variables are summarized as follows:
Presence or absence of Verbascum thapsus, common mullein, (Absent = 0, Present = 1)
Degree days in degrees Celsius
Average monthly potential evapotranspiration in mm
Summer-to-winter difference in monthly potential evapotranspiration in mm
Average monthly moisture index in cm
Summer-to-winter difference in monthly moisture index in cm
Average monthly precipitation in cm
Summer-to-winter difference in monthly precipitation in cm
Average monthly relative humidity in percent
Summer-to-winter difference in monthly relative humidity in percent
Average monthly potential global radiation in kJ
Summer-to-winter difference in monthly potential global radiation in kJ
Average monthly average temperature in degrees Celsius
Summer-to-winter difference in monthly average temperature in degrees Celsius
Average monthly minimum temperature in degrees Celsius
Summer-to-winter difference in monthly minimum temperature in degrees Celsius
Average monthly maximum temperature in degrees Celsius
Summer-to-winter difference in monthly maximum temperature in degrees Celsius
Mean average daytime temperature in degrees Celsius
Summer-to-winter difference in average daytime temperature in degrees Celsius
Average monthly average ambient vapor pressure in Pa
Summer-to-winter difference in monthly average ambient vapor pressure in Pa
Average monthly average saturated vapor pressure in Pa
Summer-to-winter difference in monthly average saturated vapor pressure in Pa
Average monthly average vapor pressure deficit in Pa
Summer-to-winter difference in monthly average vapor pressure deficit in Pa
Elevation in meters
Percent slope
Transformed Aspect: TransAspect=(1-cos(Aspect))/2
Distance to the nearest road in meters
Distance to the nearest trail in meters
Distance to the nearest road or trail in meters
Cutler, D. Richard., Thomas C. Edwards Jr., Karen H. Beard, Adele Cutler, Kyle T. Hess, Jacob Gibson, and Joshua J. Lawler. 2007. Random Forests for Classification in Ecology. Ecology 88(11): 2783-2792.
predict.eztune
Computes predictions for a validation dataset.
## S3 method for class 'eztune' predict(object, newdata, ...)
## S3 method for class 'eztune' predict(object, newdata, ...)
object |
An object of class " |
newdata |
Matrix or data frame containing the test or validation dataset. |
... |
Additional parameters to pass to predict. |
Function returns a vector of predictions if the response is
continuous. If the response is binary, a data.frame
with the predicted
response and the probabilities of each response type is returned.
library(EZtune) data(lichen) data(lichenTest) y <- lichen[, 2] x <- lichen[, 9:41] # Optimize an SVM classification model using the default settings mod1 <- eztune(x, y) # Obtain predictions using the lichenTest dataset and compute classification # error pred <- predict(mod1, lichenTest) mean(pred$predictions == as.factor(lichenTest$LobaOreg)) # Optimize an SVM regression model using the default settings library(mlbench) library(dplyr) library(yardstick) data(BostonHousing2) bh <- mutate(BostonHousing2, lcrim = log(crim)) %>% select(-town, -medv, -crim) x <- bh[, c(1:3, 5:17)] y <- bh[, 4] mod2 <- eztune(x, y) # Obtain predictions from the original data and compute the rmse pred <- predict(mod2, x) rmse_vec(pred, y)
library(EZtune) data(lichen) data(lichenTest) y <- lichen[, 2] x <- lichen[, 9:41] # Optimize an SVM classification model using the default settings mod1 <- eztune(x, y) # Obtain predictions using the lichenTest dataset and compute classification # error pred <- predict(mod1, lichenTest) mean(pred$predictions == as.factor(lichenTest$LobaOreg)) # Optimize an SVM regression model using the default settings library(mlbench) library(dplyr) library(yardstick) data(BostonHousing2) bh <- mutate(BostonHousing2, lcrim = log(crim)) %>% select(-town, -medv, -crim) x <- bh[, c(1:3, 5:17)] y <- bh[, 4] mod2 <- eztune(x, y) # Obtain predictions from the original data and compute the rmse pred <- predict(mod2, x) rmse_vec(pred, y)