Cleaning and Joining Data from FishBase and IUCN RedList in R

Author

Elke Windschitl

Published

December 9, 2022

Description: In this qmd, I showcase how to access, clean, and join data from FishBase and the IUCN Red List.

Introduction

As a data scientist, it is frequently necessary to join data from two or more sources. Sometimes this can be trickier than others depending on the state of the original data sources. Here, I will access data from FishBase and the The International Union for Conservation of Nature (IUCN) Red List. Luckily, these data sources already have fairly thoughtful layouts and clean data. Both data sources have quite a lot of data on fish and other animal species, so I will need to select and filter for what I am interested in. Here, I am interested in fish in Hawaii. This data accessing, cleaning, and join, was done in preparation for a logistic regression analysis to identify key traits in Hawaiian fish that predict risk of extinction (Not shown here).

The Data

I use the IUCN Red List data accessed via the IUCN Red List API 1 and package rredlist. The Red List categorizes species by Not Evaluated, Data Deficient, Least Concern, Near Threatened, Vulnerable, Endangered, Critically Endangered, Extinct in the Wild and Extinct. Extinct species were not evaluated in this analysis. The IUCN Red List data are limited in that many marine species have not been listed yet or have been identified as too data deficient to be evaluated.

Fish ecological data were accessed from FishBase 2 via package rfishbase. Different species in the FishBase data were originally described by different people, possibly leading to errors or biases.

Methods

To get started, there are several packages I will be using. I will use rfishbase to access the FishBase data, rredlist to access the IUCN Red List data, tidyverse for data cleaning and joining, janitor for data cleaning, and knitr for displaying data tables.

# Load libraries
library(rfishbase)
library(tidyverse)
library(rredlist)
library(janitor)
library(knitr)

I access the species and country data from FishBase. I filter for fish in Hawaii then join the species information to the Hawaiian fish with a left join.

# Load all species in FishBase
species <- fb_tbl("species") %>% 
  janitor::clean_names()
  species %>% 
    select(-"comments") %>% 
    head() %>% kable()
spec_code genus species species_ref_no author f_bname pic_preferred_name pic_preferred_name_m pic_preferred_name_f pic_preferred_name_j fam_code subfamily gen_code sub_gen_code body_shape_i source author_ref remark tax_issue fresh brack saltwater demers_pelag air_breathing air_breathing_ref ana_cat migrat_ref depth_range_shallow depth_range_deep depth_range_ref depth_range_com_shallow depth_range_com_deep depth_com_ref longevity_wild longevity_wild_ref longevity_captive longevity_cap_ref vulnerability vulnerability_climate length l_type_max_m length_female l_type_max_f max_length_ref common_length l_type_com_m common_length_f l_type_com_f common_length_ref weight weight_female max_weight_ref pic picture_female larva_pic egg_pic importance_ref importance price_categ price_reliability remarks7 landing_statistics landings main_catching_method ii m_seines m_gillnets m_castnets m_traps m_spears m_trawls m_dredges m_liftnets m_hooks_lines m_other usedfor_aquaculture life_cycle aquaculture_ref usedas_bait bait_ref aquarium aquarium_fish_ii aquarium_ref game_fish game_ref dangerous dangerous_ref electrogenic electro_ref complete google_image profile pd50 emblematic entered date_entered modified date_modified expert date_checked ts
24523 Aborichthys kempi 39226 Chaudhuri, 1913 NA Abkem_u1.jpg NA NA NA 692 NA 784 NA elongated O NA NA 0 1 0 0 demersal WaterAssumed NA NA NA NA NA NA NA NA NA NA NA NA NA 10.00 NA 8.10 SL NA NA 39226 NA NA NA NA NA NA NA NA NA NA NA NA 4832 of no interest unknown NA NA NA NA NA NA 0 0 0 0 0 0 0 0 0 0 never/rarely NA NA never/rarely NA never/rarely NA NA 0 NA harmless NA no special ability NA NA 1 NA 0.5039 0 10 1996-07-18 65 2013-07-18 10 2001-09-08 NA
65802 Aborichthys rosammai 95217 Sen, 2009 NA NA NA NA NA 692 NA 784 NA Elongated O NA NA 0 1 0 0 demersal WaterAssumed NA NA NA NA NA NA NA NA NA NA NA NA NA 10.00 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0 0 0 0 0 0 0 0 0 0 never/rarely NA NA never/rarely NA never/rarely NA NA 0 NA harmless NA no special ability NA NA 1 NA 0.5039 0 10 2010-10-05 65 2021-10-25 NA NA NA
24524 Aborichthys tikaderi 4832 Barman, 1985 NA NA NA NA NA 692 NA 784 NA Elongated O NA NA 0 1 0 0 demersal WaterAssumed NA NA NA NA NA NA NA NA NA NA NA NA NA 10.00 NA 10.90 SL NA NA 95217 NA NA NA NA NA NA NA NA NA NA NA NA 4832 of no interest unknown NA NA NA NA NA NA 0 0 0 0 0 0 0 0 0 0 never/rarely NA NA never/rarely NA never/rarely NA NA 0 NA harmless NA no special ability NA NA 1 NA 0.5039 0 10 1996-07-18 65 2021-10-25 10 2003-01-09 NA
67300 Aborichthys verticauda 95164 Arunachalam, Raja, Malaiammal & Mayden, 2014 NA NA NA NA NA 692 NA 784 NA elongated O NA NA NA 1 0 0 demersal WaterAssumed NA NA NA NA NA NA NA NA NA NA NA NA NA 10.00 NA 7.75 SL NA NA 95164 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0 0 0 0 0 0 0 0 0 0 never/rarely NA NA never/rarely NA never/rarely NA NA 0 NA harmless NA no special ability NA NA 1 NA 0.5039 0 10 2014-04-05 10 2014-09-04 NA NA NA
67609 Aborichthys waikhomi 95217 Kosygin, 2012 NA NA NA NA NA 692 NA 784 NA elongated O NA NA NA 1 0 0 demersal WaterAssumed NA NA NA NA NA NA NA NA NA NA NA NA NA 10.00 NA 6.80 SL NA NA 95217 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0 0 0 0 0 0 0 0 0 0 never/rarely NA NA never/rarely NA never/rarely NA NA 0 NA harmless NA no special ability NA NA 1 NA 0.5039 0 10 2015-02-02 NA 2015-02-03 NA NA NA
268 Abramis brama 59043 (Linnaeus, 1758) Freshwater bream Abbra_ue.jpg NA NA NA 756 Leuciscinae 50 NA fusiform / normal O NA NA 0 1 1 0 benthopelagic WaterAssumed NA potamodromous 51243 1 NA 9696 NA NA NA 23 796 17 72462 61.85 NA 82.00 TL NA NA 6114 25 TL NA NA 3561 6010 NA 4699 ABBRA_U0 NA NA NA 4931 highly commercial unknown NA NA from 50,000 to 100,000 Russia (Ref. 9988) seines 0 1 0 1 0 1 0 1 1 0 commercial 12108 usually NA never/rarely 274 1 4699 harmless NA no special ability NA NA 1 NA 1.0000 0 2 1990-10-17 2 2013-08-28 97 2003-01-03 NA
# Load all species by country
country <- fb_tbl("country") %>% 
  janitor::clean_names() 
country %>% 
  select(-"comments") %>% 
  head() %>% kable()
autoctr stockcode c_code spec_code country_ref_no also_ref status current_presence freshwater brackish saltwater land abundance ref_abundance importance ref_importance ex_vessel_price aquaculture ref_aquaculture live_export_org live_export ref_live_export game bait regulations ref_regulations threatened country_sub_comp entered date_entered modified date_modified expert date_checked ts
435 1 638 2 13246 1739 introduced present 1 1 0 0 NA NA NA NA NA NA NA 0 NA NA 0 0 NA NA 0 0 19 1992-08-27 1 2000-11-21 NA NA NA
436 1 646 2 2 35818 native present 1 0 0 0 NA NA NA NA NA commercial 7306 0 NA NA 0 0 NA NA 0 0 19 1992-08-27 10 2013-02-05 NA NA NA
437 1 662 2 12228 NA introduced present 1 0 0 0 NA NA NA NA NA commercial NA 0 NA NA 0 0 NA NA 0 0 97 1997-03-12 97 1998-02-02 NA NA NA
438 1 670 2 26042 NA introduced present 1 0 0 0 NA NA NA NA NA NA NA 0 NA NA 0 0 NA NA 0 0 97 1998-02-02 1 2000-11-21 409 2000-09-28 NA
439 1 678 2 28179 NA introduced present 1 0 0 0 NA NA NA NA NA experimental 28179 0 NA NA 0 0 NA NA 0 0 97 2002-10-25 NA 2002-10-25 NA NA NA
440 1 682 2 104 12228 introduced present 1 1 0 0 NA NA minor commercial 8984 NA commercial 7306 0 NA NA 0 0 NA NA 0 0 18 1993-08-30 1 1999-03-22 NA NA NA
# Find all species that might live in Hawaii
hi_fish_c <- country %>% 
  filter(c_code == "840B") # Found in documentation
hi_fish_c %>% 
  select(-"comments") %>% 
  head() %>% kable()
autoctr stockcode c_code spec_code country_ref_no also_ref status current_presence freshwater brackish saltwater land abundance ref_abundance importance ref_importance ex_vessel_price aquaculture ref_aquaculture live_export_org live_export ref_live_export game bait regulations ref_regulations threatened country_sub_comp entered date_entered modified date_modified expert date_checked ts
75 3 840B 3 5360 6362 introduced present 1 1 0 0 NA NA NA NA NA NA NA 0 NA NA 0 0 NA NA 0 0 19 1992-08-27 1472 2008-01-07 56 1993-04-08 NA
582 9 840B 7 58302 NA native present 0 0 1 0 NA NA NA NA NA NA NA 0 NA NA 0 0 NA NA 0 0 2 1990-12-11 1472 2008-01-17 NA NA NA
154 16 840B 10 583 NA native present 0 1 1 0 NA NA NA NA NA NA NA 0 NA NA 0 0 NA NA 0 0 10 1993-02-24 NA 1996-11-15 NA NA NA
1323 50 840B 40 58302 NA native present 0 0 1 0 NA NA NA NA NA NA NA 0 NA NA 0 0 NA NA 0 0 10 1993-02-16 1472 2008-01-09 NA NA NA
1524 56 840B 46 58302 NA native present 0 0 1 0 NA NA NA NA NA NA NA 0 NA NA 0 0 NA NA 0 0 1472 2007-06-13 NA 2007-06-13 NA NA NA
2574 87 840B 77 43 NA native present 0 0 1 0 NA NA NA NA NA NA NA 0 NA NA 0 0 NA NA 0 0 1 1990-11-10 1472 2008-01-24 NA NA NA
# Join hi_fish with species info
hi_fish <- left_join(hi_fish_c, species, by = "spec_code") 

I do some cleaning, isolate the species list, and further explore what is offered in FishBase. FishBase has a LOT of information available. I want to add some ecological and common name information to my data frame of Hawaiian fish. Then, I select for the columns and rows I need for my later analysis that contain currently present fish in Hawaii and their ecological traits of interest.

# Concatenate Genus and Species
hi_fish$genus_species <- paste(hi_fish$genus, hi_fish$species)
hi_fish <- hi_fish %>% relocate(genus_species, .after = spec_code) %>% 
  relocate(length, weight, .after = genus_species)

# Isolate list of species
hi_species_list <- as.character(hi_fish$genus_species)

# Check out what is available in the various tables
tables <- docs()
ecology <- fb_tbl("ecology")
com_names <- fb_tbl("comnames")

# Grab the ecological traits I'm looking for with my Hawaiian species list
hi_ecol <- ecology(species_list = hi_species_list, 
                   fields = c("SpecCode", "CoralReefs", "FeedingType", "Schooling")) %>% 
  clean_names()# Theres duplicates here but I remove them later
hi_ecol %>% head() %>% kable()
spec_code coral_reefs feeding_type schooling
6652 -1 variable -1
5689 -1 variable 0
6630 -1 grazing on aquatic plants 0
5402 -1 hunting macrofauna (predator) 0
5403 -1 hunting macrofauna (predator) 0
972 0 hunting macrofauna (predator) -1
# Grab the common names I'm looking for with my Hawaiian species list
hi_com_names <- common_names(species_list = hi_species_list,
                             fields = c("SpecCode", "ComName")) %>% 
  clean_names() 
hi_com_names %>% head() %>% kable()
spec_code species com_name language
3 Oreochromis mossambicus African mouthbrooder English
3 Oreochromis mossambicus African perch English
3 Oreochromis mossambicus Blue bream English
3 Oreochromis mossambicus Blue tilapia English
3 Oreochromis mossambicus Common tilapia English
3 Oreochromis mossambicus Hawaiian perch English
# Check to see if any hi_ecol species are duplicated, then remove the duplicate
sum(duplicated(hi_ecol$spec_code))
[1] 1
duplicated_species <- hi_ecol$spec_code[duplicated(hi_ecol$spec_code)]
hi_ecol[hi_ecol$spec_code %in% duplicated_species, ]
# A tibble: 2 × 4
  spec_code coral_reefs feeding_type              schooling
      <int>       <int> <chr>                         <int>
1      1412           0 grazing on aquatic plants         0
2      1412           0 <NA>                              0
hi_ecol <- hi_ecol[!duplicated(hi_ecol$spec_code), ]
sum(duplicated(hi_ecol$spec_code))
[1] 0
# Combine data sets then select and filter
hi_fish_chars <- left_join(hi_fish, hi_ecol, by = "spec_code") %>% 
  select(c("spec_code", "genus_species", "length", "l_type_max_m", "weight", 
           "status", "current_presence", "genus", "species", "importance.y", 
           "price_categ", "coral_reefs", "feeding_type", "schooling")) %>% 
  filter(current_presence == "present") %>% # Data frame full of fish characteristics
  rename(length_cm = length) 
hi_fish_chars %>% head() %>% kable()
spec_code genus_species length_cm l_type_max_m weight status current_presence genus species importance.y price_categ coral_reefs feeding_type schooling
3 Oreochromis mossambicus 39.0 SL 1130 introduced present Oreochromis mossambicus highly commercial unknown 0 browsing on substrate 0
7 Coryphaena equiselis 145.7 TL NA native present Coryphaena equiselis minor commercial unknown 0 hunting macrofauna (predator) 0
10 Alectis indica 165.0 TL 25000 native present Alectis indica commercial medium -1 hunting macrofauna (predator) 0
40 Argyropelecus affinis 8.4 SL NA native present Argyropelecus affinis of no interest unknown 0 selective plankton feeding 0
46 Melanolagus bericoides 20.0 SL NA native present Melanolagus bericoides of no interest unknown 0 selective plankton feeding 0
77 Istiophorus platypterus 348.0 FL 100240 native present Istiophorus platypterus commercial very high -1 hunting macrofauna (predator) 0

Next, I access the IUCN Red List data through the IUCN Red List API. I load the entire list of Red List species in 16 requests – the number of available pages at the time of this analysis.

# Identify token for accessing IUCN API
iucn_token <- Sys.getenv("IUCN_KEY")

# Create an empty list to store the data frames
species_list <- list()

# Import all species on IUCN Redlist
for (i in 0:15) {
  # Get data from API and assign to a variable with a name
  species_list[[i + 1]] <- rl_sp(page = i, key = iucn_token)$result
}

# Combine all data frames into one and select only columns I need
all_iucn_species <- do.call(rbind, species_list) %>% 
  select(c("scientific_name",
           "category",
           "main_common_name")) %>% 
  rename(genus_species = scientific_name) 
all_iucn_species %>% head() %>% kable()
genus_species category main_common_name
Aaadonta angaurana CR NA
Aaadonta constricta EN NA
Aaadonta fuscozonata EN NA
Aaadonta irregularis CR NA
Aaadonta kinlochi CR NA
Aaadonta pelewana CR NA

Now I have a large data frame with all species on the Red List, their status, and their common name. I don’t need ALL species, though, so I need to combine these data with my Hawaii fish data from FishBase. I do this by left joining the Red List data to the Hawaii fish characteristics. I remove species that show up multiple times in the Red List data and might not be reliable.

# Join data
hi_fish_status <- left_join(hi_fish_chars, all_iucn_species,
                            by = "genus_species") # Duplicate rows introduced

# Identify which rows are here multiple times
status_unique <- as.data.frame(table(hi_fish_status$spec_code)) %>% 
  setNames(c("spec_code", "freq")) %>% 
  filter(!freq != 1) # remove rows w freq > 1
#view(status_unique)

# Recombine with status df
hi_fish_status$spec_code <- as.factor(hi_fish_status$spec_code)
hi_fish_status <- left_join(status_unique, hi_fish_status, 
                            by = "spec_code")
hi_fish_status %>% head() %>% kable()
spec_code freq genus_species length_cm l_type_max_m weight status current_presence genus species importance.y price_categ coral_reefs feeding_type schooling category main_common_name
3 1 Oreochromis mossambicus 39.0 SL 1130 introduced present Oreochromis mossambicus highly commercial unknown 0 browsing on substrate 0 VU Mozambique Tilapia
6 1 Coryphaena hippurus 210.0 TL 40000 native present Coryphaena hippurus highly commercial high -1 hunting macrofauna (predator) 0 LC Common Dolphinfish
7 1 Coryphaena equiselis 145.7 TL NA native present Coryphaena equiselis minor commercial unknown 0 hunting macrofauna (predator) 0 LC Pompano Dolphinfish
10 1 Alectis indica 165.0 TL 25000 native present Alectis indica commercial medium -1 hunting macrofauna (predator) 0 LC Indian Threadfish
40 1 Argyropelecus affinis 8.4 SL NA native present Argyropelecus affinis of no interest unknown 0 selective plankton feeding 0 LC Pacific Hatchet Fish
46 1 Melanolagus bericoides 20.0 SL NA native present Melanolagus bericoides of no interest unknown 0 selective plankton feeding 0 LC Bigscale Deepsea Smelt

I remove columns where fish were data deficient or not assessed. Then, consistent with Munstermann et al., living species listed as ‘Vulnerable’, ‘Endangered’, or ‘Critically Endangered’ were categorized as ‘Threatened’. Living species listed as ‘Least Concern’ and ‘Near Threatened’ were categorized as ‘Nonthreatened’ 3.

# Drop all rows with na values of interest
hi_status_drop_na <- hi_fish_status %>% 
  filter(!category == "NA") %>% 
  filter(!category == "DD")

# Make a binary column with 1 as some level of concern and 0 as least concern
tidy_fish_data <- hi_status_drop_na %>% 
  mutate(is_of_concern = case_when(category == "CR" | 
                                     category == "EN" |
                                     category == "VU" ~ 1,
                                   category == "LR/nt" |
                                     category == "NT" |
                                     category == "LC" ~ 0)) %>% 
  mutate(coral_reefs = coral_reefs * - 1) %>% 
  mutate(reef_associated = case_when(coral_reefs == 1 ~ "yes",
                                     coral_reefs == 0 ~ "no")) %>% 
  mutate(is_endemic = case_when(status == "endemic" ~ "yes",
                                status == "native" |
                                  status == "introduced" ~ "no")) %>% 
  select(-c("freq"))

Results

hi_status_drop_na %>% 
  arrange(genus_species) %>% 
  slice_head(n = 20) %>% kable()
spec_code freq genus_species length_cm l_type_max_m weight status current_presence genus species importance.y price_categ coral_reefs feeding_type schooling category main_common_name
11144 1 Abantennarius analis 7.8 SL NA native present Abantennarius analis of no interest unknown NA NA NA LC Tailjet Anglerfish
5402 1 Abantennarius coccineus 13.0 TL NA native present Abantennarius coccineus of no interest unknown -1 hunting macrofauna (predator) 0 LC Freckled Anglerfish
69666 1 Abantennarius drombus 12.0 SL NA endemic present Abantennarius drombus NA NA NA NA NA LC Hawaiian Freckled Frogfish
27298 1 Abantennarius duescus 3.0 SL NA native present Abantennarius duescus NA unknown NA NA NA LC Side-Jet Frogfish
972 1 Ablennes hians 140.0 TL 4800 native present Ablennes hians minor commercial high 0 hunting macrofauna (predator) -1 LC Flat Needlefish
6652 1 Abudefduf abdominalis 30.0 TL NA native present Abudefduf abdominalis NA unknown -1 variable -1 LC Green Damselfish
5689 1 Abudefduf sordidus 24.0 TL NA native present Abudefduf sordidus minor commercial unknown -1 variable 0 LC Yellow-banded Sergeant-major
6630 1 Abudefduf vaigiensis 20.0 TL NA native present Abudefduf vaigiensis subsistence fisheries unknown -1 grazing on aquatic plants 0 LC Five-banded Sergeant-major
89 1 Acanthocybium solandri 250.0 TL 83000 native present Acanthocybium solandri commercial very high 0 hunting macrofauna (predator) 0 LC Wahoo
4306 1 Acanthurus achilles 24.0 TL NA native present Acanthurus achilles commercial medium -1 grazing on aquatic plants 0 LC Achilles Tang
4750 1 Acanthurus blochii 45.0 SL NA native present Acanthurus blochii commercial medium -1 grazing on aquatic plants 0 LC Ringtail Surgeonfish
1256 1 Acanthurus dussumieri 54.0 TL NA native present Acanthurus dussumieri minor commercial medium -1 browsing on substrate 0 LC Eyestripe Surgeonfish
4736 1 Acanthurus guttatus 26.0 TL NA native present Acanthurus guttatus commercial medium -1 browsing on substrate 0 LC Whitespotted Surgeonfish
4737 1 Acanthurus leucopareius 25.0 TL NA native present Acanthurus leucopareius minor commercial medium -1 browsing on substrate 0 LC Whitebar Surgeonfish
1258 1 Acanthurus lineatus 38.0 TL NA native present Acanthurus lineatus commercial medium -1 browsing on substrate 0 LC Lined Surgeonfish
6011 1 Acanthurus nigricans 36.0 FL 620 native present Acanthurus nigricans subsistence fisheries medium -1 browsing on substrate 0 LC Golden Rim Surgeonfish
4739 1 Acanthurus nigrofuscus 21.0 TL NA native present Acanthurus nigrofuscus commercial medium -1 browsing on substrate 0 LC Brown Surgeonfish
4738 1 Acanthurus nigroris 25.0 TL NA native present Acanthurus nigroris NA medium -1 browsing on substrate 0 LC Bluelined Surgeonfish
4744 1 Acanthurus olivaceus 35.0 TL NA native present Acanthurus olivaceus commercial medium -1 grazing on aquatic plants 0 LC Orange band Surgeonfish
4734 1 Acanthurus thompsoni 27.0 TL NA native present Acanthurus thompsoni NA medium -1 selective plankton feeding 0 LC Thompson’s Surgeonfish

Conclusion

Now I have a data frame of extant fish species in Hawaii, whether or not they are threatened, and some of their associated characteristics. These data are ready for logistic regression or classification analyses. The data are ready to be exported and are much smaller than all of the existing data in FishBase and the Red List.

Sources

1 “IUCN,” IUCN Red List of Threatened Species. Version 2022-1, 2022. https://www.iucnredlist.org/ (accessed Dec. 02, 2022).

2 R. Froese and D. Pauly, “FishBase,” 2022. www.fishbase.org

3 B. J. Cardinale et al., “Biodiversity loss and its impact on humanity,” Nature, vol. 486, no. 7401, Art. no. 7401, Jun. 2012, doi: 10.1038/nature11148. [3] M. J. Munstermann et al., “A global ecological signal of extinction risk in terrestrial vertebrates,” Conserv. Biol., vol. 36, no. 3, p. e13852, 2022, doi: 10.1111/cobi.13852.