Without going into extreme depth here, let’s unpack that by looking at an example. Machine learning is the study of data-driven, computational methods for making inferences and predictions. We’ll build a very simple machine learning model as a way to learn some of caret’s basic syntax and functionality.īut before diving into caret, let’s quickly discuss what machine learning is and why we use it. To help you begin learning about machine learning in R, I’m going to introduce you to an R package: the caret package. This is normal and a low percentUnique value for a categorical feature variable is not by itself sufficient reason to remove the feature variable.If you’ve been using R for a while, and you’ve been working with basic data visualization and data exploration techniques, the next logical step is to start learning some machine learning. In other words, categorical variables, e.g. dummy variables, often have low percentUnique values. The species, sex.male and sex.female variables have low percentUnique values, but this is to be expected for these types of variables (if they were continuous numeric variables, then this could be cause for concern). Therefore, higher values are considered better, but it is worth noting that as our data set increases in size, this percentage will naturally decrease.īased on these results, we can see that none of the variables show concerning characteristics.Īll the variables have freqRatio values close to 1. If we only have a few unique values (i.e. the feature variable has near-zero variance) then the percentUnique value will be small. This is good news, and means that we don’t have an unbalanced data set where one value is being recorded significantly more frequently than other values.įinally, if we check the percentUnique column, we see the number of unique values recorded for each variable, divided by the total number of samples, and expressed as a percentage. If we check this column, we see that all feature variables have a freqRatio value close to 1. The freqRatio column computes the frequency of the most prevalent value recorded for that variable, divided by the frequency of the second most prevalent value. Here, we can see that as identified previously, none of the variables have zero or near zero variance (as shown in columns 3 and 4 of the output). NearZeroVar(ml_penguins_updated, saveMetrics = T) # freqRatio percentUnique zeroVar nzv Notice that in the first row, we have a value of 0 for sex.female and a value of 1 for sex.male - in other words, the data in the first row is for a male penguin. Now, instead of sex taking the values of female or male, this variable has been replaced by the dummy variables sex.female and sex.male. This is mainly because we would like to include the species variable with the labels Adelie, Chinstrap and Gentoo, rather than the numbers 1,2 and 3. Note: We use the as_tibble function from the tibble package to restructure our data following the introduction of the dummyVars dummy variables. , data = ml_penguins) ml_penguins_updated <- as_tibble( predict(dummy_penguins, newdata = ml_penguins)) # remember to include the outcome variable too ml_penguins_updated <- cbind( species = ml_penguins $species, ml_penguins_updated) head(ml_penguins_updated) # species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex.female Library(tibble) dummy_penguins <- dummyVars(species ~.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |