ML Model Comparison for Fish Threat in R

Author

Elke Windschitl

Published

October 11, 2023

Description: In this qmd, I evaluate different supervised machine learning algorithms for predicting IUCN Red List status of fish based on ecological and morphological characteristics. These characteristics were retrieved from FishBase and joined with the IUCN data in a separate script.

Introduction

Global human activity threatens many species with extinction. According to the International Union and Conservation of Nature (IUCN), “More than 41,000 species are threatened with extinction. That is still 28% of all assessed species.”1. Increased extinction and loss of biodiversity can have severe ecological, economic, and cultural impacts. Cardinale et al.’s deep dive into biodiversity and ecosystem services research conclude that biodiversity loss reduces ecological communities’ efficiency, stability, and productivity. Decreased productivity from ecosystem services can have a negative impact on ecosystem economics2. Additionally, cultures worldwide have strong ties to local flora and fauna, much of which now face extinction risk. Improving understanding of extinction risk is ecologically, economically, and culturally important.

The IUCN Red List classifies species into various categories based on how vulnerable they are to extinction. The Red List also has many species that are listed as “Data Deficient” or “Not Evaluated”. Filling in these data gaps is extremely important when it comes to conservation. In marine species, evaluating these populations can prove challenging. It can be helpful to build off of existing knowledge to inform where evaluation resources should be spent. Here, I propose to build various machine learning models that predict binary Red List status of saltwater fish based on their ecological and morphological traits according to FishBase. I apply the most successful model to Red List Data Deficient and Not Evaluated species.

This work builds off of my previous work Identifying Key Traits in Hawaiian Fish that Predict Risk of Extinction. However, here I am looking at all fish listed on the IUCN Red List – not just those in Hawaii – and I am using a Tidymodels machine learning approach.

The Data

For my analysis I use the IUCN Red List data accessed via the IUCN Red List API1 and package rredlist3. Consistent with Munstermann et al., living species listed as ‘Vulnerable’, ‘Endangered’, or ‘Critically Endangered’ were categorized as ‘Threatened’. Living species listed as ‘Least Concern’ and ‘Near Threatened’ were categorized as ‘Nonthreatened’4. I also chose to add ‘Extinct in the Wild,’ to the ‘Threatened’ category. Fully extinct species were not included. The IUCN Red List data are limited in that many marine species have not been listed yet or have been identified as too data deficient to be evaluated. The lack of data on elusive fish may introduce bias into the models.

Fish ecological data were accessed from FishBase5 via package rfishbase6. Different species in the FishBase data were originally described by different people, possibly leading to errors or biases. Measurement errors in length may be present, as there are various common ways to measure the length of a fish. The species recorded in FishBase may be biased towards fish with commercial value. Data were wrangled in R and formatted in a tidy data table with the following variables.

SpecCode GenusSpecies BodyShapeI DemersPelag AirBreathing DepthRangeShallow DepthRangeDeep LongevityWild Length Weight Importance PriceCateg MainCatchingMethod UsedforAquaculture GameFish Dangerous Electrogenic Neritic SupraLittoralZone Saltmarshes LittoralZone TidePools Intertidal SubLittoral Caves Oceanic Epipelagic Mesopelagic Bathypelagic Abyssopelagic Hadopelagic Estuaries Mangroves MarshesSwamps CaveAnchialine Stream Lakes Cave Cave2 FeedingType Parasitism Solitary Symbiosis Symphorism Commensalism Mutualism Epiphytic Schooling Shoaling Benthic Sessile Mobile Demersal Endofauna Pelagic Megabenthos Macrobenthos Meiobenthos SoftBottom Sand Coarse Fine Level Sloping Silt Mud Ooze Detritus Organic HardBottom Rocky Rubble Gravel SexualAttributes SexMorphology SexColors StrikingFeatures Forehead OperculumPresent TypeofEyes TypeofMouth PosofMouth MandibleTeeth MaxillaTeeth VomerineTeeth Palatine PharyngealTeeth TeethonTongue TypeofScales Scutes Keels HorStripesTTI HorStripesTTII VerStripesTTI VerStripesTTII VerStripesTTIII DiaStripesTTI DiaStripesTTII DiaStripesTTIII CurStripesTTI CurStripesTTII CurStripesTTIII SpotsTTI SpotsTTII SpotsTTIII DorsalFinI DorsalFinII CaudalFinI CaudalFinII AnalFinI AnalFinII LateralLinesNo LLinterrupted ScalesLateralmin ScalesLateralmax PoredScalesMin PoredScalesMax LatSeriesMin LatSeriesMax ScaleRowsAboveMin ScaleRowsAboveMax ScaleRowsBelowMin ScaleRowsBelowMax ScalesPeduncMin ScalesPeduncMax BarbelsNo BarbelsType GillCleftsNo Spiracle GillRakersLowMin GillRakersLowMax GillRakersUpMin GillRakersUpMax GillRakersTotalMin GillRakersTotalMax Vertebrae VertebraePreanMin VertebraePreanMax VertebraeTotalMin VertebraeTotalMax DorsalAttributes Dfinno DorsalSpinesMin DorsalSpinesMax Notched DorsalSoftRaysMin DorsalSoftRaysMax Adifin DFinletsmin DFinletsmax VFinletsmin VFinletsmax CShape Attributes Afinno AnalFinSpinesMin AnalFinSpinesMax Araymin Araymax PectoralAttributes Pspines2 Praymin Praymax PelvicsAttributes VPosition VPosition2 Vspines Vraymin Vraymax StandardLengthCm Forklength Totallength HeadLength PreDorsalLength PrePelvicsLength PreAnalLength PostHeadDepth PostTrunkDepth MaximumDepth PeduncleDepth PeduncleLength CaudalHeight PreorbitalLength EyeLength GasBladder ReproMode Fertilization MatingSystem MonogamyType MatingQuality SpawnAgg Spawning BatchSpawner ParentalCare MainCommonName IsOfConcern
4 Engraulis ringens elongated pelagic-neritic WaterAssumed 3 80 3 20.0 NA highly commercial low seines never/rarely 0 harmless no special ability -1 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 filtering plankton 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 no special organs males alike females males alike females none more or less straight -1 more or less normal more or less normal sub-terminal/inferior NA NA NA NA NA NA NA absent NA absent NA absent NA NA absent NA NA absent NA NA no spots NA NA no spots or stripes NA no spots or stripes NA no spots or stripes NA 1 0 NA NA NA NA NA NA NA NA NA NA NA NA 0 NA 0 absent 38 49 NA NA NA NA NA NA NA NA NA no striking attributes 1 0 0 0 NA NA absent 0 0 0 0 forked more or less normal 1 0 0 NA NA more or less normal 0 NA NA more or less normal abdominal before origin of D1 0 7 7 NA NA NA NA NA NA NA NA NA 20.0 NA NA NA NA NA NA dioecism external NA NA NA -1 Two seasonal peaks per year -1 none Peruvian Anchoveta 0
5 Orthopristis chrysoptera short and / or deep demersal WaterAssumed 10 NA 4 46.0 900 minor commercial low seines never/rarely 0 harmless no special ability -1 0 0 0 0 -1 0 0 -1 0 0 0 0 0 -1 -1 0 0 0 0 0 0 hunting macrofauna (predator) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 -1 -1 0 0 no special organs males alike females males alike females none more or less straight -1 more or less normal more or less normal terminal NA NA NA NA NA NA ctenoid scales absent 0 present mainly ventral absent NA NA present mainly dorsal ending before ventral contour absent NA NA no spots NA NA more than one spot or stripe NA no spots or stripes NA no spots or stripes NA 1 0 55 58 NA NA NA NA 10 10 19 19 NA NA 0 NA 0 absent 12 13 8 8 NA NA NA NA NA NA NA no striking attributes 1 12 13 0 15 16 absent 0 0 0 0 forked more or less normal 1 3 3 12 13 more or less normal NA 17 19 more or less normal thoracic behind origin of D1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA dioecism external monogamy NA 1 0 one clear seasonal peak per year -1 none Pigfish 0
6 Coryphaena hippurus elongated pelagic-neritic WaterAssumed 0 85 4 210.0 40000 highly commercial high various gears commercial 1 reports of ciguatera poisoning no special ability -1 0 0 0 0 0 0 0 -1 -1 0 0 0 0 0 0 0 0 0 0 0 0 hunting macrofauna (predator) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 no special organs always different morphology between mature adults males alike females striking shape of body clearly convex -1 more or less normal more or less normal terminal present present NA NA NA present cycloid scales absent 0 absent NA absent NA NA absent NA NA absent NA NA more than one spot dorsal and ventral on trunk and tail no spots or stripes NA no spots or stripes NA no spots or stripes NA 1 0 200 NA NA NA NA NA NA NA NA NA NA NA 0 NA 0 absent NA NA NA NA NA NA NA 13 14 31 31 extending over most of the back length 1 0 0 0 58 66 absent 0 0 0 0 forked more or less normal 1 0 0 25 31 more or less normal 0 NA NA more or less normal thoracic behind origin of D1 0 NA NA NA NA 123 22.3 7.8 23.3 51.5 22.3 18.4 23.3 4.9 4.9 24.3 30.4 13 NA dioecism external NA NA NA -1 Variable throughout range -1 none Common Dolphinfish 0
7 Coryphaena equiselis fusiform / normal pelagic-oceanic WaterAssumed 0 400 4 145.7 NA minor commercial unknown various gears never/rarely 1 harmless no special ability 0 0 0 0 0 0 0 0 -1 -1 0 0 0 0 0 0 0 0 0 0 0 0 hunting macrofauna (predator) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NA NA NA none clearly convex -1 more or less normal more or less normal terminal present present present present NA present cycloid scales absent 0 absent NA absent NA NA absent NA NA absent NA NA more than one spot dorsal and ventral on trunk and tail no spots or stripes no colored margin no spots or stripes no colored margin no spots or stripes no colored margin 1 0 160 200 NA NA NA NA NA NA NA NA NA NA 0 NA NA NA 8 9 1 1 9 10 NA 13 14 33 33 extending over most of the back length 1 0 0 0 52 59 absent 0 0 0 0 forked more or less normal 1 0 0 24 28 more or less normal 0 18 20 more or less normal thoracic behind origin of D1 1 5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA dioecism external NA NA NA 0 NA -1 none Pompano Dolphinfish 0
9 Abalistes stellatus short and / or deep demersal WaterAssumed 7 350 NA 60.0 NA commercial medium trawls never/rarely 0 harmless no special ability 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 hunting macrofauna (predator) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 -1 0 0 0 0 -1 -1 0 0 0 0 0 0 0 no special organs males alike females males alike females striking fins clearly convex -1 more or less normal more or less normal terminal present present NA NA NA NA NA absent 0 absent NA absent NA NA absent NA NA absent NA NA more than one spot mainly dorsal mainly on trunk more than one spot or stripe no colored margin no spots or stripes no colored margin more than one spot or stripe no colored margin NA 0 NA NA NA NA 33 41 NA NA NA NA NA NA 0 NA 0 absent NA NA NA NA NA NA NA NA NA NA NA first rays forming locking device 2 3 3 0 25 27 absent 0 0 0 0 more or less truncate more or less normal 1 0 0 24 26 more or less normal 0 13 16 joint to one spine only abdominal behind origin of D1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA dioecism external monogamy NA 1 0 NA 0 NA Starry Triggerfish 0
10 Alectis indica short and / or deep reef-associated WaterAssumed 20 100 NA 165.0 25000 commercial medium hooks and lines never/rarely 1 harmless no special ability -1 0 0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0 0 hunting macrofauna (predator) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 no special organs males alike females males alike females none more or less straight -1 more or less normal more or less normal terminal present present NA NA NA NA scales embedded or partly/completely absent along lateral line 0 absent NA absent NA NA absent NA NA absent NA NA no spots NA NA no spots or stripes colored margin no spots or stripes colored margin no spots or stripes colored margin 1 0 NA NA NA NA NA NA NA NA NA NA NA NA 0 NA 0 absent 21 26 8 11 29 37 NA 10 10 24 24 other 2 7 7 0 18 20 absent 0 0 0 0 forked more or less normal 2 3 3 15 20 more or less normal NA 18 18 more or less normal abdominal before origin of D1 1 5 5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA dioecism external NA NA NA 0 NA 0 none Indian Threadfish 0
12 Cephalopholis cruentata fusiform / normal reef-associated WaterAssumed 0 170 13 42.6 1130 minor commercial very high seines never/rarely 0 reports of ciguatera poisoning no special ability -1 0 0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0 0 hunting macrofauna (predator) 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 no special organs males alike females males alike females none more or less straight -1 more or less normal more or less normal superior present present present present NA NA ctenoid scales absent 0 absent NA absent NA NA absent NA NA absent NA NA more than one spot lateral on trunk and tail more than one spot or stripe no colored margin more than one spot or stripe no colored margin more than one spot or stripe no colored margin 1 0 47 51 NA NA 69 81 NA NA NA NA NA NA 0 NA 0 absent NA NA NA NA 18 25 NA NA NA NA NA no striking attributes 1 9 9 0 13 15 absent 0 0 0 0 more or less truncate more or less normal 1 3 3 8 8 more or less normal 0 16 NA more or less normal abdominal behind origin of D1 1 5 5 NA NA NA NA NA NA NA NA NA 37.0 NA NA NA NA NA NA protogyny external NA NA NA -1 one clear seasonal peak per year 0 none Graysby 0
14 Epinephelus adscensionis fusiform / normal demersal WaterAssumed 1 120 25 65.0 4500 highly commercial very high NA never/rarely 1 reports of ciguatera poisoning no special ability -1 0 0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0 0 hunting macrofauna (predator) 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 no special organs males alike females males alike females none more or less straight -1 more or less normal more or less normal terminal NA NA NA NA NA NA ctenoid scales absent 0 absent NA absent NA NA absent NA NA absent NA NA more than one spot lateral on trunk and tail more than one spot or stripe colored margin more than one spot or stripe colored margin more than one spot or stripe colored margin 1 0 NA NA NA NA NA NA NA NA NA NA NA NA 0 NA 0 absent NA NA NA NA NA NA NA NA NA NA NA no striking attributes 1 11 11 0 16 17 absent 0 0 0 0 more or less truncate NA 1 3 3 8 8 more or less normal NA 18 19 more or less normal abdominal behind origin of D1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA protogyny external NA NA NA -1 NA 0 none Rock Hind 0
15 Epinephelus guttatus fusiform / normal reef-associated WaterAssumed 100 NA 22 76.0 25000 highly commercial high hooks and lines never/rarely 1 reports of ciguatera poisoning no special ability -1 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 hunting macrofauna (predator) 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 NA NA NA none clearly convex -1 more or less normal more or less normal superior present present NA present NA NA ctenoid scales absent 0 absent NA absent NA NA absent NA NA absent NA NA more than one spot dorsal and ventral NA more than one spot or stripe NA more than one spot or stripe NA more than one spot or stripe NA 1 0 65 74 NA NA 92 104 NA NA NA NA NA NA 0 NA NA NA 16 18 8 9 24 26 NA NA NA NA NA no striking attributes 1 11 11 -1 15 16 absent 0 0 0 0 more or less truncate more or less normal 1 3 3 8 8 more or less normal 0 16 18 more or less normal thoracic beneath origin of D1 1 5 5 NA NA NA NA NA NA NA NA NA 34.5 NA NA NA NA NA NA protogyny external NA NA NA -1 one clear seasonal peak per year 0 none Red Hind 0
16 Epinephelus itajara fusiform / normal reef-associated WaterAssumed 0 100 37 250.0 455000 minor commercial very high hooks and lines never/rarely 1 traumatogenic no special ability -1 0 0 0 0 0 0 0 -1 0 0 0 0 0 -1 -1 0 0 0 0 0 0 hunting macrofauna (predator) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 -1 0 0 0 0 -1 0 0 no special organs males alike females different colors in juveniles and adults none more or less straight -1 more or less normal more or less normal superior present present NA NA NA NA ctenoid scales absent 0 absent NA present dorsal and ventral reaching ventral contour absent NA NA absent NA NA more than one spot lateral on trunk and tail more than one spot or stripe no colored margin more than one spot or stripe no colored margin more than one spot or stripe no colored margin 1 0 61 64 NA NA 89 112 NA NA NA NA NA NA 0 NA NA NA 13 15 8 9 21 24 NA NA NA NA NA no striking attributes 1 11 11 -1 15 16 absent 0 0 0 0 more or less truncate NA 1 3 3 8 8 more or less normal 1 8 19 more or less normal abdominal behind origin of D1 NA NA NA NA NA NA NA NA NA NA NA NA 32.8 NA NA NA NA NA NA protogyny external NA NA NA -1 NA 0 none Atlantic Goliath Grouper 1

Methods

To get started, there are several packages I will be using. Tidyverse packages help with further cleaning and preparing data. Tidymodels packages have almost all of what I need for the machine learning steps. kknn helps me build my knn model. knitr is used to create kable tables. baguette is used in my bagging model. doParallel allows for parallel computing on my laptop. vip helps to identify variable importance.

# Load libraries
library(tidyverse)
library(tidymodels)
library(kknn)
library(knitr)
library(baguette)
library(doParallel)
library(vip)

First I read in the data. These data were cleaned and joined in a separate script, but they will still need a bit of preprocessing. The outcome variable in this dataset is labeled IsOfConcern and indicates if the species is at risk or extinction (1) or not (0). I start out by exploring the data dimensions.

# Read in data
fish_dat_full <- read_csv("/Users/elkewindschitl/Documents/data-sci/fish_data.csv") 
fish_dat <- fish_dat_full %>%
  filter(!is.na(IsOfConcern)) # remove columns that don't have outcome variable

# Explore some characteristics of the dataset
cols <- ncol(fish_dat)
rows <- nrow(fish_dat)
df_chars <- data.frame(
  Metric = c("Number of Columns", "Number of Rows"),
  Count = c(ncol(fish_dat), nrow(fish_dat)))
kable(df_chars,
      col.names = c("", "Count"))
Count
Number of Columns 196
Number of Rows 9875
fish_dat %>% 
  group_by(IsOfConcern) %>%
  count() %>% 
  kable(col.names = c("Species threat is of concern", "Count"))
Species threat is of concern Count
0 9156
1 719

Data Prep

There are a lot on NA values in this dataset. I have a lot of columns already, so I can reduce that by removing columns that have a high proportion of NA values. Here I only keep columns where less than 20% of rows have NA values.

# Calculate the proportion of NA values in each column
na_proportion <- colMeans(is.na(fish_dat), na.rm = TRUE)

#I want to remove rows with extreme NA counts (more than 20%)
# Define the threshold (20% or 0.20)
threshold <- 0.20

# Find columns with more than the threshold proportion of NA values
columns_meeting_threshold <- names(na_proportion[na_proportion <= threshold])

# Print the column names that meet the threshold
columns_meeting_threshold %>% kable(col.names = "Columns that are below NA threshold")
Columns that are below NA threshold
SpecCode
GenusSpecies
BodyShapeI
DemersPelag
AirBreathing
DepthRangeDeep
Length
PriceCateg
UsedforAquaculture
GameFish
Dangerous
Electrogenic
MainCommonName
IsOfConcern
# Select for just those columns that meet my criteria
fish_short <- fish_dat %>% 
  select(all_of(columns_meeting_threshold))

There is still more to be done. I want to make sure numeric columns are numeric, and that character columns are treated as factors. I need to make sure my outcome variable is a factor as well. I have a small enough data frame that I am able to do this by looking at which columns are character and mutating them to be factors. I want unordered factors, and I need to remove columns that are identifiers rather than features. Because some of the algorithms I am working with here do not handle missing data well, I chose to remove all of the rows that had NA values. This did unfortunately cut down on the amount of data that I have to train and test the algorithms on.

# Find character columns that need to be converted to factor
sapply(fish_short, class) %>% kable(col.names = c("Column", "Class"))
Column Class
SpecCode numeric
GenusSpecies character
BodyShapeI character
DemersPelag character
AirBreathing character
DepthRangeDeep numeric
Length numeric
PriceCateg character
UsedforAquaculture character
GameFish numeric
Dangerous character
Electrogenic character
MainCommonName character
IsOfConcern numeric
# List of character columns to convert to factors
character_columns_to_convert <- c("GenusSpecies", "BodyShapeI", "DemersPelag", "AirBreathing", "PriceCateg", "UsedforAquaculture", "Dangerous", "Electrogenic", "MainCommonName")

# Convert the specified character columns to factors
fish <- fish_short %>%
  mutate(across(all_of(character_columns_to_convert), as.factor))

# If feature is a factor DON'T order, remove identifying columns
fish <- fish %>% mutate_if(is.ordered, .funs = factor, ordered = F)  %>% 
  select(-GenusSpecies) %>% 
  select(-SpecCode) %>% 
  select(-MainCommonName)

# Make outcome factor
fish$IsOfConcern <- as.factor(fish$IsOfConcern)

# Remove rows with any remaining missing values
fish <- na.omit(fish)

# Check the new df
sapply(fish, class) %>% kable(col.names = c("Column", "Class"))
Column Class
BodyShapeI factor
DemersPelag factor
AirBreathing factor
DepthRangeDeep numeric
Length numeric
PriceCateg factor
UsedforAquaculture factor
GameFish numeric
Dangerous factor
Electrogenic factor
IsOfConcern factor

After my data are prepped, I need to split the data into training and testing data sets. I use a 70/30 split. I have unbalanced data, so I stratify by my outcome variable, IsOfConcern

set.seed(123)
# Initial split of data, default 70/30
fish_split <- initial_split(fish, prop = 0.7, strata = IsOfConcern)
fish_train <- training(fish_split)  # Training data
fish_test <- testing(fish_split)    # Test data

Preprocessing

I create a recipe for the preprocessing steps used. I use dummy columns to make all the factor (categorical) variables have their own column. I remove columns where there is no variation in the data. Then I normalize the numeric columns because the lasso and knn algorithms require normalization to avoid certain features dominating the model. I use the same preprocessing steps for all algorithms for adequate comparison.

set.seed(123)
# Preprocess the data within the recipe
fish_recipe <- recipe(IsOfConcern ~ ., data = fish_train) %>% 
  step_dummy(all_factor(), -all_outcomes(), one_hot = TRUE) %>% 
  step_zv(all_predictors()) %>%  
  step_normalize(all_numeric(), -all_outcomes())

# Check test and train dfs look as expected
prepped <- fish_recipe %>% 
  prep()
fish_baked_train <- bake(prepped, fish_train)
fish_baked_test <- bake(prepped, fish_test)

# Use below to check for NA values in the entire dataframes
#any(is.na(fish_baked_train))
#any(is.na(fish_baked_test))

Dummy Classifier

Because my data are unbalanced with many more non-threatened species, if a model always chose non-threatened it would have a high accuracy. Of course, that is not very helpful when trying to predict which species are or might be threatened. Here I derive a dummy accuracy by calculating the accuracy of a model that always predicts non-threatened. This will serve as a baseline for if a model is performing well (better than the dummy) or not. However, because this dataset is so imbalanced, I will be using area under the curve (AUC) for model selection.

# Calculate dummy classifier for baseline comparison
# Calculate the number of rows where IsOfConcern is 0
num_is_0 <- sum(fish_test$IsOfConcern == 0)

# Calculate the number of rows where IsOfConcern is not 0
num_is_not_0 <- nrow(fish_test) - num_is_0

# Calculate the accuracy of the dummy classifier (always predicting the majority class)
dummy <- num_is_0 / nrow(fish_test)

The dummy classifier accuracy is 0.94. This will serve as the baseline for other algorithms. Now I will proceed with building various models and training with the training data. I will be building Lasso, K-Nearest Neighbors, Decision Tree, Bagged Decision Tree, Random Forest, and Gradient Boosted Decision Tree models.

Lasso for Classification

set.seed(123)

# Set up k-fold cross validation with 10 folds. This can be used for all the algorithms
fish_cv = fish_train %>% 
  vfold_cv(v = 10,
           strata = IsOfConcern)

# Set specifications
tune_l_spec <- logistic_reg(penalty = tune(), mixture = 1) %>%
  set_engine("glmnet")

# Define a workflow
wf_l <- workflow() %>%
  add_model(tune_l_spec) %>% 
  add_recipe(fish_recipe)
# set grid
lambda_grid <- grid_regular(penalty(), levels = 50)

doParallel::registerDoParallel()
set.seed(123)

# Tune lasso model
lasso_grid <-   wf_l %>% 
  tune_grid(
    add_model(tune_l_spec),
    resamples = fish_cv,
    grid = lambda_grid
)

# Plot the mean accuracy and AUC at each penalty
lasso_grid %>%
  collect_metrics() %>%
  ggplot(aes(penalty, mean, color = .metric)) +
  geom_errorbar(aes(ymin = mean - std_err,
                    ymax = mean + std_err),
                alpha = 0.5) +
  geom_line(size = 1.5) +
  facet_wrap(~.metric, 
             scales = "free", 
             strip.position = "left",
             nrow = 2, labeller = as_labeller(c(`accuracy` = "Accuracy", 
                                                `roc_auc` = "Area under ROC curve"))) +
  scale_x_log10(name = "Penalty") +
  scale_y_continuous(name = "") +
  scale_color_manual(values = c("#4a6c75", "#57ba72")) +
  theme_minimal() +
  theme(
    strip.placement = "outside",
    legend.position = "none",
    panel.background = element_blank(),
    plot.background = element_blank()
  ) +
  labs(title = "Results of penalty tuning")

# View table
lasso_grid %>%
  tune::show_best(metric = "roc_auc") %>%
  slice_head(n = 5) %>% 
  kable(caption = "Performance of the best models and the associated estimates for the penalty parameter values.")
Performance of the best models and the associated estimates for the penalty parameter values.
penalty .metric .estimator mean n std_err .config
0.0022230 roc_auc binary 0.8411065 10 0.0163830 Preprocessor1_Model37
0.0013895 roc_auc binary 0.8406888 10 0.0174560 Preprocessor1_Model36
0.0008685 roc_auc binary 0.8404846 10 0.0176116 Preprocessor1_Model35
0.0035565 roc_auc binary 0.8397288 10 0.0156837 Preprocessor1_Model38
0.0005429 roc_auc binary 0.8392379 10 0.0172147 Preprocessor1_Model34
# Select the model with the highest auc
best_lasso <- lasso_grid %>%
  select_best("roc_auc")

final_l_wf <- wf_l %>% 
  finalize_workflow(best_lasso)

# Perform a last fit to see how the model performs on the test data
final_lasso_fit <- last_fit(final_l_wf, fish_split)

# Collect metrics on the test data
tibble_lasso <- final_lasso_fit %>% collect_metrics()
tibble_lasso %>% 
  kable(caption = "Accuracy and area under ther receiver operator curve of the final fit.")
Accuracy and area under ther receiver operator curve of the final fit.
.metric .estimator .estimate .config
accuracy binary 0.9433472 Preprocessor1_Model1
roc_auc binary 0.8017930 Preprocessor1_Model1
# Grab the model accuracy on the testing data
final_lasso_accuracy <- tibble_lasso %>%
  filter(.metric == "accuracy") %>%
  pull(.estimate)
final_lasso_auc <- tibble_lasso %>%
  filter(.metric == "roc_auc") %>%
  pull(.estimate)

# Bind predictions and original data
lasso_test_rs <- cbind(fish_test, final_lasso_fit$.predictions)[, -16]# Remove duplicate column

# Compute a confusion matrix
cm_lasso <- lasso_test_rs %>% yardstick::conf_mat(truth = IsOfConcern, estimate = .pred_class) 

# Create a custom color palette
custom_palette <- scale_fill_gradient(
  high = "#4a6c75",   
  low = "#d3e6eb"  
)

# Create the confusion matrix heatmap plot
autoplot(cm_lasso, type = "heatmap") +
  custom_palette +  # Apply the custom color palette
  theme(
    axis.text.x = element_text(size = 12),
    axis.text.y = element_text(size = 12),
    axis.title = element_text(size = 14),
    panel.background = element_blank(),
    plot.background = element_blank()
  ) +
  labs(title = "Confusion matrix of lasso predictions on test data")

# Calculate rates of tru pos, false neg. etc. from the confusion matrix
TP_las <- cm_lasso$table[2, 2]
FP_las <- cm_lasso$table[2, 1]
TN_las <- cm_lasso$table[1, 1]
FN_las <- cm_lasso$table[1, 2]
TPR_las <- TP_las / (TP_las + FN_las)  # True Positive Rate
FPR_las <- FP_las / (FP_las + TN_las)  # False Positive Rate
TNR_las <- TN_las / (TN_las + FP_las)  # True Negative Rate
FNR_las <- FN_las / (TP_las + FN_las)  # False Negative Rate

# Create cm df to hold all false pos, etc. metrics
lasso_cm_vec <- c(TPR_las, FPR_las, TNR_las, FNR_las)
row_names <- c("True positive rate", "False positive rate", "True negative rate", "False negative rate")
cm_df <- bind_cols(Metric = row_names, Lasso = lasso_cm_vec)

The accuracy for the lasso model was 0.943 which is slightly better than our dummy classifier that had an accuracy of 0.94. This model had an AUC of 0.802.

K-Nearest Neighbors

set.seed(123)

# Define the KNN model with tuning
knn_spec_tune <- nearest_neighbor(neighbors = tune()) %>% # tune k
  set_mode("classification") %>% 
  set_engine("kknn")

# Define a new workflow
wf_knn_tune <- workflow() %>% 
  add_model(knn_spec_tune) %>% 
  add_recipe(fish_recipe)
    
# Fit the workflow on the predefined folds and hyperparameters
fit_knn_cv <- wf_knn_tune %>% 
  tune_grid( 
    fish_cv, 
    grid = data.frame(neighbors = c(1,5,10,15,seq(20,200,10))))

# Use autoplot() to examine how different parameter configurations relate to accuracy
autoplot(fit_knn_cv) +
  theme_light() +
  labs(
    x = "Number of neighbors (K)",
    title = "Results of neighbor tuning"
  ) +
  theme(
    legend.position = "none",
    panel.background = element_blank(),
    plot.background = element_blank()
  ) +
  facet_wrap(
    ~.metric,
    nrow = 2,
    labeller = labeller(.metric = c("accuracy" = "Accuracy", "roc_auc" = "Area under ROC curve"))
  )

# View table
fit_knn_cv %>%
  tune::show_best(metric = "roc_auc") %>%
  slice_head(n = 5) %>% 
  kable(caption = "Performance of the best models and the associated estimates for the number of neighbors parameter values.")
Performance of the best models and the associated estimates for the number of neighbors parameter values.
neighbors .metric .estimator mean n std_err .config
150 roc_auc binary 0.8630058 10 0.0132910 Preprocessor1_Model18
130 roc_auc binary 0.8628657 10 0.0131475 Preprocessor1_Model16
140 roc_auc binary 0.8628149 10 0.0132096 Preprocessor1_Model17
160 roc_auc binary 0.8627477 10 0.0133205 Preprocessor1_Model19
170 roc_auc binary 0.8625283 10 0.0133526 Preprocessor1_Model20
# Select the model with the highest auc
best_knn <- fit_knn_cv %>%
  select_best("roc_auc")

# The final workflow for our KNN model
final_knn_wf <-
  wf_knn_tune %>% 
  finalize_workflow(best_knn)

# Use last_fit() approach to apply model to test data
final_knn_fit <- last_fit(final_knn_wf, fish_split)

# Collect metrics on the test data
tibble_knn <- final_knn_fit %>% collect_metrics()
tibble_knn %>% 
  kable(caption = "Accuracy and area under the receiver operator curve of the final fit.")
Accuracy and area under the receiver operator curve of the final fit.
.metric .estimator .estimate .config
accuracy binary 0.9407484 Preprocessor1_Model1
roc_auc binary 0.8511741 Preprocessor1_Model1
# Store accuracy and AUC
final_knn_accuracy <- tibble_knn %>%
  filter(.metric == "accuracy") %>%
  pull(.estimate)
final_knn_auc <- tibble_knn %>%
  filter(.metric == "roc_auc") %>%
  pull(.estimate)

# Bind predictions and original data
knn_test_rs <- cbind(fish_test, final_knn_fit$.predictions)[, -16]

# Compute a confusion matrix
cm_knn <- knn_test_rs %>% yardstick::conf_mat(truth = IsOfConcern, estimate = .pred_class) 

# Create the confusion matrix heatmap plot
autoplot(cm_knn, type = "heatmap") +
  custom_palette +
  theme(
    axis.text.x = element_text(size = 12),
    axis.text.y = element_text(size = 12),
    axis.title = element_text(size = 14),
    panel.background = element_blank(),
    plot.background = element_blank()
  ) +
  labs(title = "Confusion matrix of knn predictions on test data")

# Calculate rates from the confusion matrix
TP_knn <- cm_knn$table[2, 2]
FP_knn <- cm_knn$table[2, 1]
TN_knn <- cm_knn$table[1, 1]
FN_knn <- cm_knn$table[1, 2]
TPR_knn <- TP_knn / (TP_knn + FN_knn)  # True Positive Rate
FPR_knn <- FP_knn / (FP_knn + TN_knn)  # False Positive Rate
TNR_knn <- TN_knn / (TN_knn + FP_knn)  # True Negative Rate
FNR_knn <- FN_knn / (TP_knn + FN_knn)  # False Negative Rate

# Add rates to cm df
knn_cm_vec <- c(TPR_knn, FPR_knn, TNR_knn, FNR_knn)
cm_df$KNN <- knn_cm_vec

The k-nearest neighbors model had nearly the same accuracy at predicting threat status than the dummy classifier. The accuracy of the model was 0.941. This model had an AUC of 0.851 which is better than the lasso model.

Decision Tree

# Tell the model that we are tuning hyperparams
tree_spec_tune <- decision_tree(
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()) %>% 
  set_engine("rpart") %>% 
  set_mode("classification")

# Set up grid
tree_grid <- grid_regular(cost_complexity(), tree_depth(), min_n(), levels = 5)

# Check grid
#tree_grid 

# Define a workflow with the recipe and specification
wf_tree_tune <- workflow() %>% 
  add_recipe(fish_recipe) %>% 
  add_model(tree_spec_tune)

doParallel::registerDoParallel(cores = 3) #build trees in parallel

# Tune
tree_rs <- tune_grid(
  wf_tree_tune,
  IsOfConcern~.,
  resamples = fish_cv,
  grid = tree_grid,
  metrics = metric_set(roc_auc)
)

# Use autoplot() to examine how different parameter configurations relate to auc
autoplot(tree_rs) + 
  theme_light() +
  scale_color_manual(values = c("#4a6c75", "#57ba72", "#d596e0", "#e06d53", "#d6cf81")) + 
  labs(x = "Cost-complexity parameter",
       y = "Area under the ROC curve",
       title = "Results of tree tuning") +
  theme(
    plot.title = element_text(size = 16, hjust = 0.5),
    panel.background = element_blank(),
    plot.background = element_blank()
  )

# View table
tree_rs %>%
  tune::show_best(metric = "roc_auc") %>%
  slice_head(n = 5) %>% 
  kable(caption = "Performance of the best models and the associated estimates for the tuned tree parameter values.")
Performance of the best models and the associated estimates for the tuned tree parameter values.
cost_complexity tree_depth min_n .metric .estimator mean n std_err .config
0.0000000 11 11 roc_auc binary 0.8010909 10 0.0156204 Preprocessor1_Model041
0.0000000 11 11 roc_auc binary 0.8010909 10 0.0156204 Preprocessor1_Model042
0.0000032 11 11 roc_auc binary 0.8010909 10 0.0156204 Preprocessor1_Model043
0.0005623 11 11 roc_auc binary 0.8010909 10 0.0156204 Preprocessor1_Model044
0.0005623 15 11 roc_auc binary 0.7960204 10 0.0160636 Preprocessor1_Model049
# Finalize the model specs with the best hyperparameter result
final_tree <- finalize_model(tree_spec_tune, select_best(tree_rs))

# Final fit to test data 
final_tree_fit <- last_fit(final_tree, IsOfConcern~., fish_split) # does training fit then final prediction as well

# Collect metrics from fit
tibble_tree <- final_tree_fit %>% collect_metrics()
tibble_tree %>% kable(caption = "Accuracy and area under ther receiver operator curve of the final fit.")
Accuracy and area under ther receiver operator curve of the final fit.
.metric .estimator .estimate .config
accuracy binary 0.9438669 Preprocessor1_Model1
roc_auc binary 0.7751052 Preprocessor1_Model1
# Store accuracy and auc metrics
final_tree_accuracy <- tibble_tree %>%
  filter(.metric == "accuracy") %>%
  pull(.estimate)

final_tree_auc <- tibble_tree %>%
  filter(.metric == "roc_auc") %>%
  pull(.estimate)

# Bind predictions and original data
tree_test_rs <- cbind(fish_test, final_tree_fit$.predictions)[, -16]

# Compute a confusion matrix
cm_tree <- tree_test_rs %>% yardstick::conf_mat(truth = IsOfConcern, estimate = .pred_class) 

# Create the confusion matrix heatmap plot
autoplot(cm_tree, type = "heatmap") +
  custom_palette +
  theme(
    axis.text.x = element_text(size = 12),
    axis.text.y = element_text(size = 12),
    axis.title = element_text(size = 14),
    panel.background = element_blank(),
    plot.background = element_blank()
  ) +
  labs(title = "Confusion matrix of decision tree predictions on test data")

# Calculate rates from the confusion matrix
TP_tree <- cm_tree$table[2, 2]
FP_tree <- cm_tree$table[2, 1]
TN_tree <- cm_tree$table[1, 1]
FN_tree <- cm_tree$table[1, 2]
TPR_tree <- TP_tree / (TP_tree + FN_tree)  # True Positive Rate
FPR_tree <- FP_tree / (FP_tree + TN_tree)  # False Positive Rate
TNR_tree <- TN_tree / (TN_tree + FP_tree)  # True Negative Rate
FNR_tree <- FN_tree / (TP_tree + FN_tree)  # False Negative Rate

# Add rates to cm df
tree_cm_vec <- c(TPR_tree, FPR_tree, TNR_tree, FNR_tree)
cm_df$DecisionTree <- tree_cm_vec

The decision tree model had a higher accuracy at predicting threat status than the dummy classifier. The accuracy of the decision tree was 0.944. The AUC is 0.775 which is lower than the lasso model and the knn model.

Bagging

set.seed(123)
# Set bagging tuning specifications
bag_spec <- 
  bag_tree(cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()) %>% 
  set_engine("rpart", times = 75) %>% # 25 ensemble members 
  set_mode("classification")

# Set up tuning grid
bag_grid <- grid_regular(cost_complexity(), tree_depth(), min_n(), levels = 5)

# Check the grid space
#bag_grid

# Set up bagging workflow
wf_bag <- workflow() %>% 
  add_recipe(fish_recipe) %>% 
  add_model(bag_spec)

doParallel::registerDoParallel() #build trees in parallel

# Run tuning
bag_rs <- tune_grid(
  wf_bag,
  IsOfConcern~.,
  resamples = fish_cv,
  grid = bag_grid,
  metrics = metric_set(roc_auc)
)

# Use autoplot() to examine how different parameter configurations relate to auc
autoplot(bag_rs) + 
  theme_light() +
  scale_color_manual(values = c("#4a6c75", "#57ba72", "#d596e0", "#e06d53", "#d6cf81")) + 
  labs(x = "Cost-complexity parameter",
       y = "Area under the ROC curve",
       title = "Results of bagging tuning") +
  theme(
    plot.title = element_text(size = 16, hjust = 0.5),
    panel.background = element_blank(),
    plot.background = element_blank()
  )

# View table
bag_rs %>%
  tune::show_best() %>%
  slice_head(n = 5) %>% 
  kable(caption = "Performance of the best models and the associated estimates for the tuned tree parameter values.")
Performance of the best models and the associated estimates for the tuned tree parameter values.
cost_complexity tree_depth min_n .metric .estimator mean n std_err .config
0.0005623 15 11 roc_auc binary 0.8820775 10 0.0098248 Preprocessor1_Model049
0.0005623 15 21 roc_auc binary 0.8812587 10 0.0123964 Preprocessor1_Model074
0.0000032 15 11 roc_auc binary 0.8791856 10 0.0137413 Preprocessor1_Model048
0.0000032 11 21 roc_auc binary 0.8787266 10 0.0122258 Preprocessor1_Model068
0.0000000 15 11 roc_auc binary 0.8779793 10 0.0117088 Preprocessor1_Model047
# Finalize the model specs witht he best performing model
final_bag <- finalize_model(bag_spec, select_best(bag_rs))

# Perform a last fit of the model on the testing data
final_bag_fit <- last_fit(final_bag, IsOfConcern~., fish_split)

# Collect metrics from fit
tibble_bag <- final_bag_fit %>% collect_metrics()
tibble_bag %>% 
  kable(caption = "Accuracy and area under ther receiver operator curve of the final fit.")
Accuracy and area under ther receiver operator curve of the final fit.
.metric .estimator .estimate .config
accuracy binary 0.9438669 Preprocessor1_Model1
roc_auc binary 0.8568390 Preprocessor1_Model1
# Store model accuracy and auc on testing data
final_bag_accuracy <- tibble_bag %>%
  filter(.metric == "accuracy") %>%
  pull(.estimate)
final_bag_auc <- tibble_bag %>%
  filter(.metric == "roc_auc") %>%
  pull(.estimate)

# Bind predictions and original data
bag_test_rs <- cbind(fish_test, final_bag_fit$.predictions)[, -16]

# Compute a confusion matrix
cm_bag <- bag_test_rs %>% yardstick::conf_mat(truth = IsOfConcern, estimate = .pred_class) 

# Create the confusion matrix heatmap plot
autoplot(cm_bag, type = "heatmap") +
  custom_palette +  # Apply the custom color palette
  theme(
    axis.text.x = element_text(size = 12),
    axis.text.y = element_text(size = 12),
    axis.title = element_text(size = 14),
    panel.background = element_blank(),
    plot.background = element_blank()
  ) +
  labs(title = "Confusion matrix of bagging predictions on test data")

# Calculate rates from the confusion matrix
TP_bag <- cm_bag$table[2, 2]
FP_bag <- cm_bag$table[2, 1]
TN_bag <- cm_bag$table[1, 1]
FN_bag <- cm_bag$table[1, 2]
TPR_bag <- TP_bag / (TP_bag + FN_bag)  # True Positive Rate
FPR_bag <- FP_bag / (FP_bag + TN_bag)  # False Positive Rate
TNR_bag <- TN_bag / (TN_bag + FP_bag)  # True Negative Rate
FNR_bag <- FN_bag / (TP_bag + FN_bag)  # False Negative Rate

# Add rates to cm df
bag_cm_vec <- c(TPR_bag, FPR_bag, TNR_bag, FNR_bag)
cm_df$Bagging <- bag_cm_vec

The bagging model had a similar accuracy at predicting threat status as the decision tree and lasso model. The accuracy of the bagging was 0.944. The AUC is 0.857 which is similar to the knn model.

Random Forest

set.seed(123)

# Set forest specifications
forest_spec <- 
  rand_forest(min_n = tune(),
              mtry = tune(),
              trees = tune()) %>%
  set_engine("ranger") %>%
  set_mode("classification")

# Set grid for tuning
forest_grid <- grid_regular(min_n(), mtry(c(1,44)), trees(), levels = 5)

# Check the grid space
#forest_grid

# Set the workflow with the recipe and forest specs
wf_forest <- workflow() %>% 
  add_recipe(fish_recipe) %>% 
  add_model(forest_spec)

doParallel::registerDoParallel()

# Perform tuning
forest_rs <- tune_grid(
  wf_forest,
  IsOfConcern~.,
  resamples = fish_cv,
  grid = forest_grid,
  metrics = metric_set(roc_auc)
)

# Use autoplot() to examine how different parameter configurations relate to auc
autoplot(forest_rs) + 
  theme_light() +
  scale_color_manual(values = c("#4a6c75", "#57ba72", "#d596e0", "#e06d53", "#d6cf81")) + 
  labs(x = "# Randomly selected predictors",
       y = "Area under the ROC curve",
       title = "Results of random forest tuning") +
  theme(
    plot.title = element_text(size = 16, hjust = 0.5),
    panel.background = element_blank(),
    plot.background = element_blank()
  )

# View table
forest_rs %>%
  tune::show_best() %>%
  slice_head(n = 5) %>% 
  kable(caption = "Performance of the best models and the associated estimates for the tree and forest parameter values.")
Performance of the best models and the associated estimates for the tree and forest parameter values.
mtry trees min_n .metric .estimator mean n std_err .config
11 2000 11 roc_auc binary 0.8855631 10 0.0134102 Preprocessor1_Model107
11 500 2 roc_auc binary 0.8850774 10 0.0133817 Preprocessor1_Model031
11 1500 11 roc_auc binary 0.8850390 10 0.0138726 Preprocessor1_Model082
11 1500 2 roc_auc binary 0.8847734 10 0.0135810 Preprocessor1_Model081
11 1000 11 roc_auc binary 0.8846533 10 0.0141003 Preprocessor1_Model057
# Finalize model with specs and best hyperparameters
final_forest <- finalize_model(forest_spec, select_best(forest_rs))

# Perform last fit
final_forest_fit <- last_fit(final_forest, IsOfConcern~., fish_split)

# Collect performance metrics
tibble_forest <- final_forest_fit %>% collect_metrics()
tibble_forest %>% 
  kable(caption = "Accuracy and area under ther receiver operator curve of the final fit.")
Accuracy and area under ther receiver operator curve of the final fit.
.metric .estimator .estimate .config
accuracy binary 0.9433472 Preprocessor1_Model1
roc_auc binary 0.8240705 Preprocessor1_Model1
# Store accuracy and auc metrics
final_forest_accuracy <- tibble_forest %>%
  filter(.metric == "accuracy") %>%
  pull(.estimate)
final_forest_auc <- tibble_forest %>%
  filter(.metric == "roc_auc") %>%
  pull(.estimate)

# Bind predictions and original data
forest_test_rs <- cbind(fish_test, final_forest_fit$.predictions)[, -16]

# Compute a confusion matrix
cm_forest <- forest_test_rs %>% yardstick::conf_mat(truth = IsOfConcern, estimate = .pred_class) 

# Create the confusion matrix heatmap plot
autoplot(cm_forest, type = "heatmap") +
  custom_palette +  # Apply the custom color palette
  theme(
    axis.text.x = element_text(size = 12),
    axis.text.y = element_text(size = 12),
    axis.title = element_text(size = 14),
    panel.background = element_blank(),
    plot.background = element_blank()
  ) +
  labs(title = "Confusion matrix of random forest predictions on test data")

# Calculate rates from the confusion matrix
TP_forest <- cm_forest$table[2, 2]
FP_forest <- cm_forest$table[2, 1]
TN_forest <- cm_forest$table[1, 1]
FN_forest <- cm_forest$table[1, 2]
TPR_forest <- TP_forest / (TP_forest + FN_forest)  # True Positive Rate
FPR_forest <- FP_forest / (FP_forest + TN_forest)  # False Positive Rate
TNR_forest <- TN_forest / (TN_forest + FP_forest)  # True Negative Rate
FNR_forest <- FN_forest / (TP_forest + FN_forest)  # False Negative Rate

# Add rates to cm df
forest_cm_vec <- c(TPR_forest, FPR_forest, TNR_forest, FNR_forest)
cm_df$RandomForest <- forest_cm_vec

The accuracy of the forest was 0.943. This is again similar to other models. This model had an auc of 0.824.

Boosting

# Tune learning rate first
# Set up specs for learning rate tuning
lr_spec <- parsnip::boost_tree(mode = "classification",
                                engine = "xgboost",
                                trees = 3000,
                                learn_rate = tune())

# Set up tuning grid
lr_grid <- expand.grid(learn_rate = seq(0.0001, 0.5, length.out = 50))

# Set up workflow
wf_lr_tune <- workflow() %>% 
  add_recipe(fish_recipe) %>% 
  add_model(lr_spec)

doParallel::registerDoParallel()
set.seed(123)

# Tune
lr_rs <- tune_grid(
  wf_lr_tune,
  IsOfConcern~.,
  resamples = fish_cv,
  grid = lr_grid
)

# Use autoplot() to examine how different parameter configurations relate to accuracy 
autoplot(lr_rs) +   
  theme_light() +
  labs(
    x = "Learning rate",
    title = "Results from learning rate tuning"
  ) +
  theme(
    legend.position = "none",
    panel.background = element_blank(),
    plot.background = element_blank()
  ) +
  facet_wrap(
    ~.metric,
    nrow = 2,
    labeller = labeller(.metric = c("accuracy" = "Accuracy", "roc_auc" = "Area under ROC curve"))
  )

# Identify best values from the tuning process
lr_rs %>%
  tune::show_best(metric = "roc_auc") %>%
  slice_head(n = 5) %>% 
  kable(caption = "Performance of the best models and the associated estimates for the learning rate parameter values.")
Performance of the best models and the associated estimates for the learning rate parameter values.
learn_rate .metric .estimator mean n std_err .config
0.0103020 roc_auc binary 0.8697860 10 0.0125356 Preprocessor1_Model02
0.0307061 roc_auc binary 0.8587251 10 0.0141796 Preprocessor1_Model04
0.0205041 roc_auc binary 0.8584129 10 0.0126799 Preprocessor1_Model03
0.0511102 roc_auc binary 0.8574748 10 0.0135668 Preprocessor1_Model06
0.0409082 roc_auc binary 0.8572797 10 0.0133747 Preprocessor1_Model05
# Select best lr hyperparametes
best_learn <- lr_rs %>%
  tune::select_best("roc_auc")

# Tune tree parameters next
# Create a new specification where setting the learning rate and tune the tree parameters
boost_tree_spec <- parsnip::boost_tree(mode = "classification",
                                engine = "xgboost",
                                trees = 3000,
                                learn_rate = best_learn$learn_rate,
                                min_n = tune(),
                                tree_depth = tune(),
                                loss_reduction = tune()
                                )

# Define parameters to be tuned
boost_params <- dials::parameters(
  min_n(),
  tree_depth(),
  loss_reduction()
)

# Set up a tuning grid using grid_max_entropy() to get a representative sampling of the parameter space.
boost_tree_grid <- dials::grid_max_entropy(boost_params, size = 50)

# Set up workflow
wf_boost_tree_tune <- workflow() %>% 
  add_recipe(fish_recipe) %>% 
  add_model(boost_tree_spec)

set.seed(123)
doParallel::registerDoParallel()

# Tune
boost_tree_rs <- tune_grid(
  wf_boost_tree_tune,
  IsOfConcern~.,
  resamples = fish_cv,
  grid = boost_tree_grid
)

# Identify best values from the tuning process
boost_tree_rs %>%
  tune::show_best(metric = "roc_auc") %>% 
  slice_head(n = 5) %>% 
  kable(caption = "Performance of the best models and the associated estimates for the tree parameter values.")
Performance of the best models and the associated estimates for the tree parameter values.
min_n tree_depth loss_reduction .metric .estimator mean n std_err .config
5 13 1.4059617 roc_auc binary 0.8826142 10 0.0139348 Preprocessor1_Model11
5 6 0.5640640 roc_auc binary 0.8751618 10 0.0139514 Preprocessor1_Model25
10 3 0.0000086 roc_auc binary 0.8744951 10 0.0127641 Preprocessor1_Model34
11 3 0.0000000 roc_auc binary 0.8741408 10 0.0133400 Preprocessor1_Model21
15 8 0.0000000 roc_auc binary 0.8734961 10 0.0128297 Preprocessor1_Model50
# Select best tree hyperparameters
boost_best_trees <- boost_tree_rs %>%
  tune::select_best("roc_auc")

# Tune Stochastic Parameters
# Create another new specification where setting the learning rate and tree parameters and tune the stochastic parameters.
boost_stoc_spec <- parsnip::boost_tree(mode = "classification",
                                engine = "xgboost",
                                trees = 3000,
                                learn_rate = best_learn$learn_rate,
                                min_n = boost_best_trees$min_n,
                                tree_depth = boost_best_trees$tree_depth,
                                mtry = tune(),                   
                                loss_reduction = boost_best_trees$loss_reduction,
                                sample_size = tune(),
                                stop_iter = tune()
                                )

# Set up a tuning grid using grid_max_entropy() again.
# Define parameters to be tuned
boost_stoc_params <- dials::parameters(
  finalize(mtry(),
           select(fish_baked_train,-IsOfConcern)),
  sample_size = sample_prop(c(.4, .9)),
  stop_iter())

# Set up grid
boost_stoc_grid <- dials::grid_max_entropy(boost_stoc_params, size = 50)

# Set up workflow
wf_boost_stoc <- workflow() %>% 
  add_recipe(fish_recipe) %>% 
  add_model(boost_stoc_spec)

set.seed(123)
doParallel::registerDoParallel()

# Tune
boost_stoc_rs <- tune_grid(
  wf_boost_stoc,
  IsOfConcern~.,
  resamples = fish_cv,
  grid = boost_stoc_grid
)

# Identify best values from the tuning process
boost_stoc_rs %>%
  tune::show_best(metric = "roc_auc") %>%
  slice_head(n = 5) %>% 
  kable(caption = "Performance of the best models and the associated estimates for the stochastic parameter values.")
Performance of the best models and the associated estimates for the stochastic parameter values.
mtry sample_size stop_iter .metric .estimator mean n std_err .config
8 0.7172139 6 roc_auc binary 0.8839072 10 0.0144816 Preprocessor1_Model36
6 0.7955117 11 roc_auc binary 0.8838613 10 0.0143617 Preprocessor1_Model14
4 0.8387867 18 roc_auc binary 0.8826144 10 0.0145272 Preprocessor1_Model11
12 0.8546558 8 roc_auc binary 0.8821623 10 0.0136444 Preprocessor1_Model22
7 0.6991360 10 roc_auc binary 0.8820257 10 0.0143307 Preprocessor1_Model46
# Select best hyperparameters from tuning
boost_best_stoch <- boost_stoc_rs %>%
  tune::select_best("roc_auc")

# Finalize workflow
# Assemble final workflow with all of the optimized parameters and do a final fit.
boost_final_spec <- parsnip::boost_tree(mode = "classification",
                                engine = "xgboost",
                                trees = 1000,
                                learn_rate = best_learn$learn_rate,
                                min_n = boost_best_trees$min_n,
                                tree_depth = boost_best_trees$tree_depth,
                                mtry = boost_best_stoch$mtry,                   
                                loss_reduction = boost_best_trees$loss_reduction,
                                stop_iter = boost_best_stoch$stop_iter,
                                sample_size = boost_best_stoch$sample_size
                                )

# Set up workflow
wf_boost_final <- workflow() %>% 
  add_recipe(fish_recipe) %>% 
  add_model(boost_final_spec)

# Fit to just training data (need for later)
final_simple_fit <- wf_boost_final %>%
  fit(data = fish_train)

# Final fit
final_boost_fit <- last_fit(boost_final_spec, IsOfConcern~., fish_split)

# Store accuracy and auc metrics
tibble_boost <- final_boost_fit %>% collect_metrics()
tibble_boost %>% 
  kable(caption = "Accuracy and area under ther receiver operator curve of the final fit.")
Accuracy and area under ther receiver operator curve of the final fit.
.metric .estimator .estimate .config
accuracy binary 0.9438669 Preprocessor1_Model1
roc_auc binary 0.8593097 Preprocessor1_Model1
final_boost_accuracy <- tibble_boost %>%
  filter(.metric == "accuracy") %>%
  pull(.estimate)

final_boost_auc <- tibble_boost %>%
  filter(.metric == "roc_auc") %>%
  pull(.estimate)

# Bind predictions and original data
boost_test_rs <- cbind(fish_test, final_boost_fit$.predictions)[, -16]

# Compute a confusion matrix
cm_boost <- boost_test_rs %>% yardstick::conf_mat(truth = IsOfConcern, estimate = .pred_class) 

# Create the confusion matrix heatmap plot
autoplot(cm_boost, type = "heatmap") +
  custom_palette +  # Apply the custom color palette
  theme(
    axis.text.x = element_text(size = 12),
    axis.text.y = element_text(size = 12),
    axis.title = element_text(size = 14),
    panel.background = element_blank(),
    plot.background = element_blank()
  ) +
  labs(title = "Confusion matrix of random boosted predictions on test data")

# Calculate rates from the confusion matrix
TP_boost <- cm_boost$table[2, 2]
FP_boost <- cm_boost$table[2, 1]
TN_boost <- cm_boost$table[1, 1]
FN_boost <- cm_boost$table[1, 2]
TPR_boost <- TP_boost / (TP_boost + FN_boost)  # True Positive Rate
FPR_boost <- FP_boost / (FP_boost + TN_boost)  # False Positive Rate
TNR_boost <- TN_boost / (TN_boost + FP_boost)  # True Negative Rate
FNR_boost <- FN_boost / (TP_boost + FN_boost)  # False Negative Rate

# Add rates to cm df
boost_cm_vec <- c(TPR_boost, FPR_boost, TNR_boost, FNR_boost)
cm_df$Boosting <- boost_cm_vec

The accuracy of the boosting was 0.944. This is also similar to other models. This model had an auc of 0.859, which is the best of all models by a small margin.

Model Results

Model selection

I want to compare accuracy and area under the curve of all models created.

# Name models in vec
models <- c("Dummy", "Lasso", "KNN", "Decision Tree", "Bagging", "Random Forest", "Boosting")

# Create accuracy vec
accuracy <- c(dummy, final_lasso_accuracy, final_knn_accuracy, final_tree_accuracy, final_bag_accuracy, final_forest_accuracy, final_boost_accuracy)

# Make df
accuracy_df <- data.frame(models, accuracy)

# Create a factor with the desired order for models
accuracy_df$models <- factor(accuracy_df$models, levels = c("Dummy", "Lasso", "KNN", "Decision Tree", "Bagging", "Random Forest", "Boosting"))

# Create the plot
ggplot(accuracy_df, aes(x = models, y = accuracy)) +
  geom_col(fill = "#4a6c75") +
  theme_minimal() +
  labs(title = "Accuracy was similar across all models",
    x = "Model",        
    y = "Accuracy") +
  geom_text(aes(label = round(accuracy, 3)), vjust = -0.5) +
  theme(plot.background = element_blank(),
        panel.background = element_blank())

# Create auc vec
auc <- c(final_lasso_auc, final_knn_auc, final_tree_auc, final_bag_auc, final_forest_auc, final_boost_auc)

# Make df
auc_df <- data.frame(models[-1], auc)

# Create a factor with the desired order for models
auc_df$models <- factor(auc_df$models, levels = c("Lasso", "KNN", "Decision Tree", "Bagging", "Random Forest", "Boosting"))

ggplot(auc_df, aes(x = models, y = auc)) +
  geom_col(fill = "#4a6c75") +
  theme_minimal() +
  labs(title = "Boosting was the highest performing model",
    x = NULL,        
    y = "Area under ROC curve") +
    geom_text(aes(label = round(auc, 3)), vjust = -0.5) +
  theme(plot.background = element_blank(),
        panel.background = element_blank())

Because the dataset is very imbalanced with the proportion of threatened species being small, the accuracies between models don’t vary much. It is difficult to get much more accurate than always choosing not threatened (the dummy classifier) because that already has an accuracy of 0.94. All models performed better than the dummy classifier though, so that is encouraging. Looking at AUC we see that boosting decision trees was the most successful of the models when applied to the testing data.

I also want to compare the confusion matrix values of each model.

# Print table
cm_df %>% 
  kable()
Metric Lasso KNN DecisionTree Bagging RandomForest Boosting
True positive rate 0.1826087 0.0086957 0.3217391 0.2521739 0.2434783 0.2521739
False positive rate 0.0082919 0.0000000 0.0165837 0.0121614 0.0121614 0.0121614
True negative rate 0.9917081 1.0000000 0.9834163 0.9878386 0.9878386 0.9878386
False negative rate 0.8173913 0.9913043 0.6782609 0.7478261 0.7565217 0.7478261
# Create a named vector with colors for each Metric
color_palette <- c("True positive rate" = "#8ae68d",
                   "False positive rate" = "#F3A6DE",
                   "True negative rate" = "#1b4a1c",
                   "False negative rate" = "#B3297A")

# Plot the true pos etc rates
cm_df %>% 
  pivot_longer(cols = Lasso:Boosting) %>% 
  ggplot(aes(x = name, y = value, fill = Metric)) +
  geom_col() +
  theme_minimal() +
  labs(x = NULL,
       y = "Value",
       title = "Rates of true positive, true negative, etc. for each model")  +
  theme(plot.background = element_blank(),
        panel.background = element_blank()) +
  scale_fill_manual(values = color_palette)

The decision tree had the highest true positive rate of all the models and the lowest false negative rate. However, the decision tree also had the largest false positive rate (although all false positive rates were generally small). False positives in the context of investigating fish populations could cause a misuse of resources. The true negative rate is comparable across all models. Based on these metrics and the AUC, either boosting or a decision tree appear to be good options. Because I want to minimize false positive outcomes, I will proceed with the boosted model for predictions.

Varibale Importance

I want to look at the variable importance of the most successful model, the boosting.

# Variable importance
var_imp_boost <- wf_boost_final %>% 
  fit(fish_train) %>% 
  pull_workflow_fit() %>%
  vi() %>%
  mutate(
    Importance = abs(Importance),
    Variable = fct_reorder(Variable, Importance))
var_imp_boost %>% 
  ggplot(aes(x = Importance, y = Variable)) +
  geom_col(fill = "#57ba72") +
  theme_minimal() +
  scale_x_continuous(expand = c(0, 0)) +
  labs(y = "Feature",
       x = "Importance",
       title = "Feature importance of boosting model") +
  theme(plot.background = element_blank(),
        panel.background = element_blank())

It looks like length, other body shape, depth range (deep), and having no special electrogenic ability, are driving the model the most.

Prediction Results

I want to use the boosted model to predict Threatened/Nontreatened of species that are listed as ‘Data Deficient’ or NA on the IUCN Red List. I read in the species that are not categorized and apply the same preparation steps on them. I then apply the model to the new data.

set.seed(123)
# Read in data for predictions
prediction_dat <- read_csv("/Users/elkewindschitl/Documents/data-sci/predict_data.csv")

# Remove columns not used above
predict_short <- prediction_dat %>% 
  select(all_of(columns_meeting_threshold[1:13]))

# Convert the specified character columns to factors
predict_fac <- predict_short %>%
  mutate(across(all_of(character_columns_to_convert), as.factor))

predict_fac$Index <- 1:nrow(predict_fac)

# If feature is a factor DON'T order, remove identifying columns
predict_fac <- predict_fac %>% 
  mutate_if(is.ordered, .funs = factor, ordered = F)
predict_features <- predict_fac %>% 
  select(-GenusSpecies) %>% 
  select(-SpecCode) %>% 
  select(-MainCommonName)

# Remove rows with any remaining missing values
predict_features_no_na <- na.omit(predict_features)

# Remove index column
predict_no_index <- predict_features_no_na[-11]

# Apply the model
predictions <- final_simple_fit %>% predict(new_data = predict_no_index)

# Pull name columns
species_names_p <- predict_fac %>%
  select(GenusSpecies, MainCommonName, Index)

# Bind to data and join w names
predicted_df <- cbind(predict_features_no_na, predictions) %>% 
  left_join(y = species_names_p, 
            by = join_by(Index))

# Isolate threatened species
predicted_threatened <- predicted_df %>% 
  filter(.pred_class == 1) %>% 
  select(GenusSpecies, MainCommonName)

# Display threatened species
predicted_threatened %>% 
  kable(col.names = c("Genus Species", "Main Common Name"),
        caption = "Data deficient species predicted to be threated.")
Data deficient species predicted to be threated.
Genus Species Main Common Name
Makaira mazara NA
Cephaloscyllium isabella NA
Carcharhinus perezii NA
Raja texana NA
Hippocampus guttulatus Long-snouted Seahorse
Anarhichas lupus NA
Pristis microdon NA
Zapteryx exasperata Banded Guitarfish
Myliobatis californica NA
Mobula japanica NA
Anarhichas orientalis NA
Epinephelus lanceolatus Giant Grouper
Dasyatis brevis NA
Raja cervigoni NA
Rhina ancylostomus NA
Raja herwigi Cape Verde Skate
Raja rouxi NA
Dasyatis tortonesei Tortonese’s Stingray
Pseudobatos glaucostigma NA
Echinophryne crassispina NA
Hemitrygon bennettii NA
Rhinobatos formosensis NA
Acroteriobatus ocellatus Speckled Guitarfish
Hippocampus coronatus High-crowned Seahorse
Lophius litulon Yellow Goosefish
Echinophryne reynoldsi NA
Mobula eregoodootenkee NA
Dipturus trachyderma NA
Tetronarce semipelagica NA
Hippocampus colemani Coleman’s Pygmy Seahorse
Squatina punctata NA

Of the 3560 data deficient species with complete feature information, my model returned 31 species that are likely threatened.

Conclusions

Machine learning algorithms can be a helpful tool for classifying data that may otherwise be missing. Here, I predicted which data deficient IUCN Red List saltwater fish species might be threatened based on a set of their ecological and morphological characteristics. However, it is important to note there are limitations to my approach here. Because some of the algorithms I used are sensitive to missing data values in the features, I only worked with species that had complete data. Elusive fish may not be represented in FishBase and therefore might not be represented in my model. I also cannot apply the model to fish that are lacking data on both the Red List and Fishbase. Also, these data were very imbalanced. A continuation of this analysis might play with techniques that help address imbalanced data.

References

1 “IUCN,” IUCN Red List of Threatened Species. Version 2022-1, 2022. https://www.iucnredlist.org/ (accessed Dec. 02, 2022).

2 B. J. Cardinale et al., “Biodiversity loss and its impact on humanity,” Nature, vol. 486, no. 7401, Art. no. 7401, Jun. 2012, doi: 10.1038/nature11148.

3 “IUCN,” IUCN Red List of Threatened Species. Version 2022-1, 2015. www.iucnredlist.org

4 M. J. Munstermann et al., “A global ecological signal of extinction risk in terrestrial vertebrates,” Conserv. Biol., vol. 36, no. 3, p. e13852, 2022, doi: 10.1111/cobi.13852.

5 R. Froese and D. Pauly, “FishBase,” 2022. www.fishbase.org

6 C. Boettiger, D. Temple Lang, and P. Wainwright, “rfishbase: exploring, manipulating and visualizing FishBase data from R.,” J. Fish Biol., 2012, doi: https://doi.org/10.1111/j.1095-8649.2012.03464.x.