Cleaning Mammalian and Avian Species Names

Data collected on species can be erroneous and not standardize. Standardization of species names in the analysis is essential for various methods especially during table joins and merges.

To standardize species names in the data, we can match species names in our data with standard dataset such as the one provided by IUCN.
http://www.iucnredlist.org/technical-documents/spatial-data

Here the vignette shows an example code where species list is matched with IUCN species names using fuzzy matches, which are incomplete or inexact matches.
The Python package fuzzywuzzy has a few functions that can help in this matching. To install the package use
#pip install fuzzywuzzy

Importing packages

pandas for dataframe management
fuzzywuzzy for fuzz matching

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
"""Setting data paths"""
data_path = 'C:\Users\Falco\Desktop\directory\Link_Prediction\data'

reading data into pandas dataframe

species_list: dataframe of consisting species names. This is the data which we expect to standardize.
IUCN: reference database with standardized names of species

species_list = pd.read_pickle(data_path+'\Species_list.pkl')
IUCN = pd.read_csv(data_path+ '\IUCN Mammals, Birds, Reptiles, and Amphibians.csv',)
IUCN["ScientificName"] = IUCN["Genus"].map(str) +' '+IUCN["Species"]
species_list.head()
ScientificName Source
0 8
1 Accipiter cooperii 3
2 Accipiter gentilis 7
3 Accipiter nisus 1
4 Accipiter striatus 1

List of Correct Names

The fuzzywuzzy package finds the best matching string from the list and return the string matching ratio along with it

list_of_correct_names = IUCN['ScientificName'].tolist()

Following are wrapping functions around the fuzzywuzzy package to extract best matching name and matching ratio for a pandas dataframe

It returns matching name only of the matching percent is greater than 90%

def fillVName(c):
    name_to_check = c.ScientificName
    a = process.extract(name_to_check, list_of_correct_names, limit=1) 
    if a[0][1]>= 90:
        return a[0][0]
    else:
        return np.nan

def fillVmatch(c):
    name_to_check = c.ScientificName
    a = process.extract(name_to_check,list_of_correct_names, limit=1) 
    if a[0][1]>= 90:
        return a[0][1]
    else:
        return np.nan
%%time
species_list['Matched_ScientificName'] = species_list.apply(fillVName, axis=1)
WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '']
%%time
species_list['ratio'] = species_list.apply(fillVmatch, axis=1)
WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '']


Wall time: 12min 25s

Matching names with matching ratio greater than 95% are generally spelling mistakes.
for example, row number 6, where Spelling in the database was ‘Accipter nisus’ but the correct name is Accipiter nisus (97%) matching.

All the unidentified species with sp in their binomial name gets a NaN value assigned as matching string has ratio less than 90%.

*Some Species have completely changed names either due to taxonomical udates in genus name. For example row 21
*Agouti paca
is now called as Cuniculus paca. **

I would manually go and check all the matching entries with less than 95% matching and correct them manually.

species_list.drop('Source', inplace=True, axis=1)
species_list.head(50)
ScientificName Matched_ScientificName ratio
0 NaN NaN
1 Accipiter cooperii Accipiter cooperii 100.0
2 Accipiter gentilis Accipiter gentilis 100.0
3 Accipiter nisus Accipiter nisus 100.0
4 Accipiter striatus Accipiter striatus 100.0
5 Accipitridae sp NaN NaN
6 Accipter nisus Accipiter nisus 97.0
7 Acerodon jubatus Acerodon jubatus 100.0
8 Acomys cahirinus Acomys cahirinus 100.0
9 Acrocephalus palustris Acrocephalus palustris 100.0
10 Acrocephalus schoenobaenus Acrocephalus schoenobaenus 100.0
11 Acrocephalus scirpaceus Acrocephalus scirpaceus 100.0
12 Actitis macularius Actitis macularius 100.0
13 Aegolius funereus Aegolius funereus 100.0
14 Aegypius monachus Aegypius monachus 100.0
15 Aepyceros melampus Aepyceros melampus 100.0
16 Aethomys kaiseri Aethomys kaiseri 100.0
17 Aethomys namaquensis Aethomys namaquensis 100.0
18 Agelaioides badius Agelaioides badius 100.0
19 Agelaius phoeniceus Agelaius phoeniceus 100.0
20 Agelaius tricolor Agelaius tricolor 100.0
21 Agouti paca NaN NaN
22 Ailurus fulgens Ailurus fulgens 100.0
23 Akodon mimus Akodon mimus 100.0
24 Akodon montensis Akodon montensis 100.0
25 Akodon simulator Akodon simulator 100.0
26 Alcedo atthis Alcedo atthis 100.0
27 Alcelaphus buselaphus Alcelaphus buselaphus 100.0
28 Alces alces Alces alces 100.0
29 Alectoris rufa Alectoris rufa 100.0
30 Allactaga williamsi Allactaga williamsi 100.0
31 Allenopithecus nigroviridis Allenopithecus nigroviridis 100.0
32 Alouatta beizebul Alouatta belzebul 94.0
33 Alouatta belzebul Alouatta belzebul 100.0
34 Alouatta caraya Alouatta caraya 100.0
35 Alouatta guariba Alouatta guariba 100.0
36 Alouatta palliata Alouatta palliata 100.0
37 Alouatta pigra Alouatta pigra 100.0
38 Alouatta sara Alouatta sara 100.0
39 Alouatta seniculus NaN NaN
40 Alouatta sp NaN NaN
41 Alticola argentatus Alticola argentatus 100.0
42 Ammospermophilus nelsoni Ammospermophilus nelsoni 100.0
43 Anas diazi NaN NaN
44 Anas platyrhynchos Anas platyrhynchos 100.0
45 Anatidae sp NaN NaN
46 Anaxyrus fowleri NaN NaN
47 Andropadus virens Andropadus virens 100.0
48 Anisognathus flavinucha NaN NaN
49 Anomalurus derbianus Anomalurus derbianus 100.0

Entries with less than 100% of matching ratio

species_list.ratio.fillna(0, inplace= True)
manual_edit = species_list[species_list.ratio <100]
manual_edit.head()
ScientificName Matched_ScientificName ratio
0 NaN 0.0
5 Accipitridae sp NaN 0.0
6 Accipter nisus Accipiter nisus 97.0
21 Agouti paca NaN 0.0
32 Alouatta beizebul Alouatta belzebul 94.0
'there are '+ str(manual_edit.shape[0])+ ' entries which do not have perfect match'
'there are 219 entries which do not have perfect match'

Removing entries with unidentified species names

a = manual_edit[~manual_edit.ScientificName.str.contains(".sp")]
a.head()
ScientificName Matched_ScientificName ratio
0 NaN 0.0
6 Accipter nisus Accipiter nisus 97.0
21 Agouti paca NaN 0.0
32 Alouatta beizebul Alouatta belzebul 94.0
39 Alouatta seniculus NaN 0.0
'there are '+ str(a.shape[0])+ ' entries which do not have perfect match and are completely identified'
'there are 151 entries which do not have perfect match and are completely identified'

Most likely spelling mistakes
entries with matching ratio higher than 95%

b =a[a.ratio >=95]
b.head(50)
ScientificName Matched_ScientificName ratio
6 Accipter nisus Accipiter nisus 97.0
63 Apodemus aregenteus Apodemus argenteus 97.0
131 Calidris ruficollis minata Calidris ruficollis 95.0
173 Cercopithecus diana Cercopithecus diana 97.0
193 Cercoptheus nictitans Cercopithecus nictitans 95.0
258 Crocidura olivieria Crocidura olivieri 97.0
266 Ctenodactylus gundi Ctenodactylus gundi 97.0
287 Dendrocygna biocolor Dendrocygna bicolor 97.0
360 Gallus domesticus Gallus gallus 95.0
374 Gerbillus unknown Gerbillus gerbillus 95.0
376 Gis Glis Glis glis 95.0
391 Hieaaetus pennatus Hieraaetus pennatus 97.0
393 Hipposideros pomona Hipposideros pomona 97.0
434 Larus novae-hollandia Larus novaehollandiae 95.0
451 Lissonycteris angolensis Lissonycteris angolensis 98.0
455 Lophocebus aterrimus Lophocebus aterrimus 98.0
473 Macaca nemestrina leonina Macaca nemestrina 95.0
548 Murina aurata Murina aurata 96.0
568 Myotis blythi omari Myotis myotis 95.0
576 Myotis ricketti Myotis myotis 95.0
597 Nyctalus n. noctula Nyctalus noctula 95.0
636 Pan troglodyte Pan troglodytes 97.0
639 Panthera leo Panthera leo 96.0
640 Panthera tigris Panthera tigris 97.0
648 Papio doguera Papio papio 95.0
650 Papio unknown Papio papio 95.0
658 Pecari tajacu Pecari tajacu 96.0
701 Pipistrellus pipistrellus Pipistrellus pipistrellus 98.0
705 Pipstrellus kuhlii Pipistrellus kuhlii 97.0
784 Rhinolophous affinis Rhinolophus affinis 97.0
836 Sigmondon toltecus Sigmodon toltecus 97.0
872 Sylvia attricapilla Sylvia atricapilla 97.0
881 Tadardida brasiliensis Tadarida brasiliensis 98.0
911 Tragelaphus anagasii Tragelaphus angasii 97.0
932 Uncia uncia Panthera uncia 95.0
944 Vicugna pacos Vicugna vicugna 95.0
'there are '+ str(b.shape[0])+ ' entries which do not most likely spelling mistakes'
'there are 36 entries which do not most likely spelling mistakes'

Needs detailed investigation
entries with matching ratio lesser than 95%

c =a[a.ratio <95]
c.head()
ScientificName Matched_ScientificName ratio
0 NaN 0.0
21 Agouti paca NaN 0.0
32 Alouatta beizebul Alouatta belzebul 94.0
39 Alouatta seniculus NaN 0.0
43 Anas diazi NaN 0.0
'there are '+ str(c.shape[0])+ ' entries which need detail investigation'
'there are 115 entries which need detail investigation'
species_list.to_csv(data_path+'Corrected_species_names.csv')