Cleaning Mammalian and Avian Species Names
Data collected on species can be erroneous and not standardize. Standardization of species names in the analysis is essential for various methods especially during table joins and merges.
To standardize species names in the data, we can match species names in our data with standard dataset such as the one provided by IUCN.
http://www.iucnredlist.org/technical-documents/spatial-data
Here the vignette shows an example code where species list is matched with IUCN species names using fuzzy matches, which are incomplete or inexact matches.
The Python package fuzzywuzzy has a few functions that can help in this matching.
To install the package use
#pip install fuzzywuzzy
Importing packages
pandas for dataframe management
fuzzywuzzy for fuzz matching
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
"""Setting data paths"""
data_path = 'C:\Users\Falco\Desktop\directory\Link_Prediction\data'
reading data into pandas dataframe
species_list: dataframe of consisting species names. This is the data which we expect to standardize.
IUCN: reference database with standardized names of species
species_list = pd.read_pickle(data_path+'\Species_list.pkl')
IUCN = pd.read_csv(data_path+ '\IUCN Mammals, Birds, Reptiles, and Amphibians.csv',)
IUCN["ScientificName"] = IUCN["Genus"].map(str) +' '+IUCN["Species"]
species_list.head()
ScientificName | Source | |
---|---|---|
0 | 8 | |
1 | Accipiter cooperii | 3 |
2 | Accipiter gentilis | 7 |
3 | Accipiter nisus | 1 |
4 | Accipiter striatus | 1 |
List of Correct Names
The fuzzywuzzy package finds the best matching string from the list and return the string matching ratio along with it
list_of_correct_names = IUCN['ScientificName'].tolist()
Following are wrapping functions around the fuzzywuzzy package to extract best matching name and matching ratio for a pandas dataframe
It returns matching name only of the matching percent is greater than 90%
def fillVName(c):
name_to_check = c.ScientificName
a = process.extract(name_to_check, list_of_correct_names, limit=1)
if a[0][1]>= 90:
return a[0][0]
else:
return np.nan
def fillVmatch(c):
name_to_check = c.ScientificName
a = process.extract(name_to_check,list_of_correct_names, limit=1)
if a[0][1]>= 90:
return a[0][1]
else:
return np.nan
%%time
species_list['Matched_ScientificName'] = species_list.apply(fillVName, axis=1)
WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '']
%%time
species_list['ratio'] = species_list.apply(fillVmatch, axis=1)
WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '']
Wall time: 12min 25s
Matching names with matching ratio greater than 95% are generally spelling mistakes.
for example, row number 6, where Spelling in the database was ‘Accipter nisus’ but the correct name is Accipiter nisus (97%) matching.
All the unidentified species with sp in their binomial name gets a NaN value assigned as matching string has ratio less than 90%.
*Some Species have completely changed names either due to taxonomical udates in genus name. For example row 21
*Agouti paca is now called as Cuniculus paca. **
I would manually go and check all the matching entries with less than 95% matching and correct them manually.
species_list.drop('Source', inplace=True, axis=1)
species_list.head(50)
ScientificName | Matched_ScientificName | ratio | |
---|---|---|---|
0 | NaN | NaN | |
1 | Accipiter cooperii | Accipiter cooperii | 100.0 |
2 | Accipiter gentilis | Accipiter gentilis | 100.0 |
3 | Accipiter nisus | Accipiter nisus | 100.0 |
4 | Accipiter striatus | Accipiter striatus | 100.0 |
5 | Accipitridae sp | NaN | NaN |
6 | Accipter nisus | Accipiter nisus | 97.0 |
7 | Acerodon jubatus | Acerodon jubatus | 100.0 |
8 | Acomys cahirinus | Acomys cahirinus | 100.0 |
9 | Acrocephalus palustris | Acrocephalus palustris | 100.0 |
10 | Acrocephalus schoenobaenus | Acrocephalus schoenobaenus | 100.0 |
11 | Acrocephalus scirpaceus | Acrocephalus scirpaceus | 100.0 |
12 | Actitis macularius | Actitis macularius | 100.0 |
13 | Aegolius funereus | Aegolius funereus | 100.0 |
14 | Aegypius monachus | Aegypius monachus | 100.0 |
15 | Aepyceros melampus | Aepyceros melampus | 100.0 |
16 | Aethomys kaiseri | Aethomys kaiseri | 100.0 |
17 | Aethomys namaquensis | Aethomys namaquensis | 100.0 |
18 | Agelaioides badius | Agelaioides badius | 100.0 |
19 | Agelaius phoeniceus | Agelaius phoeniceus | 100.0 |
20 | Agelaius tricolor | Agelaius tricolor | 100.0 |
21 | Agouti paca | NaN | NaN |
22 | Ailurus fulgens | Ailurus fulgens | 100.0 |
23 | Akodon mimus | Akodon mimus | 100.0 |
24 | Akodon montensis | Akodon montensis | 100.0 |
25 | Akodon simulator | Akodon simulator | 100.0 |
26 | Alcedo atthis | Alcedo atthis | 100.0 |
27 | Alcelaphus buselaphus | Alcelaphus buselaphus | 100.0 |
28 | Alces alces | Alces alces | 100.0 |
29 | Alectoris rufa | Alectoris rufa | 100.0 |
30 | Allactaga williamsi | Allactaga williamsi | 100.0 |
31 | Allenopithecus nigroviridis | Allenopithecus nigroviridis | 100.0 |
32 | Alouatta beizebul | Alouatta belzebul | 94.0 |
33 | Alouatta belzebul | Alouatta belzebul | 100.0 |
34 | Alouatta caraya | Alouatta caraya | 100.0 |
35 | Alouatta guariba | Alouatta guariba | 100.0 |
36 | Alouatta palliata | Alouatta palliata | 100.0 |
37 | Alouatta pigra | Alouatta pigra | 100.0 |
38 | Alouatta sara | Alouatta sara | 100.0 |
39 | Alouatta seniculus | NaN | NaN |
40 | Alouatta sp | NaN | NaN |
41 | Alticola argentatus | Alticola argentatus | 100.0 |
42 | Ammospermophilus nelsoni | Ammospermophilus nelsoni | 100.0 |
43 | Anas diazi | NaN | NaN |
44 | Anas platyrhynchos | Anas platyrhynchos | 100.0 |
45 | Anatidae sp | NaN | NaN |
46 | Anaxyrus fowleri | NaN | NaN |
47 | Andropadus virens | Andropadus virens | 100.0 |
48 | Anisognathus flavinucha | NaN | NaN |
49 | Anomalurus derbianus | Anomalurus derbianus | 100.0 |
Entries with less than 100% of matching ratio
species_list.ratio.fillna(0, inplace= True)
manual_edit = species_list[species_list.ratio <100]
manual_edit.head()
ScientificName | Matched_ScientificName | ratio | |
---|---|---|---|
0 | NaN | 0.0 | |
5 | Accipitridae sp | NaN | 0.0 |
6 | Accipter nisus | Accipiter nisus | 97.0 |
21 | Agouti paca | NaN | 0.0 |
32 | Alouatta beizebul | Alouatta belzebul | 94.0 |
'there are '+ str(manual_edit.shape[0])+ ' entries which do not have perfect match'
'there are 219 entries which do not have perfect match'
Removing entries with unidentified species names
a = manual_edit[~manual_edit.ScientificName.str.contains(".sp")]
a.head()
ScientificName | Matched_ScientificName | ratio | |
---|---|---|---|
0 | NaN | 0.0 | |
6 | Accipter nisus | Accipiter nisus | 97.0 |
21 | Agouti paca | NaN | 0.0 |
32 | Alouatta beizebul | Alouatta belzebul | 94.0 |
39 | Alouatta seniculus | NaN | 0.0 |
'there are '+ str(a.shape[0])+ ' entries which do not have perfect match and are completely identified'
'there are 151 entries which do not have perfect match and are completely identified'
Most likely spelling mistakes
entries with matching ratio higher than 95%
b =a[a.ratio >=95]
b.head(50)
ScientificName | Matched_ScientificName | ratio | |
---|---|---|---|
6 | Accipter nisus | Accipiter nisus | 97.0 |
63 | Apodemus aregenteus | Apodemus argenteus | 97.0 |
131 | Calidris ruficollis minata | Calidris ruficollis | 95.0 |
173 | Cercopithecus diana | Cercopithecus diana | 97.0 |
193 | Cercoptheus nictitans | Cercopithecus nictitans | 95.0 |
258 | Crocidura olivieria | Crocidura olivieri | 97.0 |
266 | Ctenodactylus gundi | Ctenodactylus gundi | 97.0 |
287 | Dendrocygna biocolor | Dendrocygna bicolor | 97.0 |
360 | Gallus domesticus | Gallus gallus | 95.0 |
374 | Gerbillus unknown | Gerbillus gerbillus | 95.0 |
376 | Gis Glis | Glis glis | 95.0 |
391 | Hieaaetus pennatus | Hieraaetus pennatus | 97.0 |
393 | Hipposideros pomona | Hipposideros pomona | 97.0 |
434 | Larus novae-hollandia | Larus novaehollandiae | 95.0 |
451 | Lissonycteris angolensis | Lissonycteris angolensis | 98.0 |
455 | Lophocebus aterrimus | Lophocebus aterrimus | 98.0 |
473 | Macaca nemestrina leonina | Macaca nemestrina | 95.0 |
548 | Murina aurata | Murina aurata | 96.0 |
568 | Myotis blythi omari | Myotis myotis | 95.0 |
576 | Myotis ricketti | Myotis myotis | 95.0 |
597 | Nyctalus n. noctula | Nyctalus noctula | 95.0 |
636 | Pan troglodyte | Pan troglodytes | 97.0 |
639 | Panthera leo | Panthera leo | 96.0 |
640 | Panthera tigris | Panthera tigris | 97.0 |
648 | Papio doguera | Papio papio | 95.0 |
650 | Papio unknown | Papio papio | 95.0 |
658 | Pecari tajacu | Pecari tajacu | 96.0 |
701 | Pipistrellus pipistrellus | Pipistrellus pipistrellus | 98.0 |
705 | Pipstrellus kuhlii | Pipistrellus kuhlii | 97.0 |
784 | Rhinolophous affinis | Rhinolophus affinis | 97.0 |
836 | Sigmondon toltecus | Sigmodon toltecus | 97.0 |
872 | Sylvia attricapilla | Sylvia atricapilla | 97.0 |
881 | Tadardida brasiliensis | Tadarida brasiliensis | 98.0 |
911 | Tragelaphus anagasii | Tragelaphus angasii | 97.0 |
932 | Uncia uncia | Panthera uncia | 95.0 |
944 | Vicugna pacos | Vicugna vicugna | 95.0 |
'there are '+ str(b.shape[0])+ ' entries which do not most likely spelling mistakes'
'there are 36 entries which do not most likely spelling mistakes'
Needs detailed investigation
entries with matching ratio lesser than 95%
c =a[a.ratio <95]
c.head()
ScientificName | Matched_ScientificName | ratio | |
---|---|---|---|
0 | NaN | 0.0 | |
21 | Agouti paca | NaN | 0.0 |
32 | Alouatta beizebul | Alouatta belzebul | 94.0 |
39 | Alouatta seniculus | NaN | 0.0 |
43 | Anas diazi | NaN | 0.0 |
'there are '+ str(c.shape[0])+ ' entries which need detail investigation'
'there are 115 entries which need detail investigation'
species_list.to_csv(data_path+'Corrected_species_names.csv')