Cleaning Mammalian and Avian Species Names

Data collected on species can be erroneous and not standardize. Standardization of species names in the analysis is essential for various methods especially during table joins and merges.

To standardize species names in the data, we can match species names in our data with standard dataset such as the one provided by IUCN.
http://www.iucnredlist.org/technical-documents/spatial-data

Here the vignette shows an example code where species list is matched with IUCN species names using fuzzy matches, which are incomplete or inexact matches.
The Python package fuzzywuzzy has a few functions that can help in this matching. To install the package use
#pip install fuzzywuzzy

Importing packages

pandas for dataframe management
fuzzywuzzy for fuzz matching

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

"""Setting data paths"""
data_path = 'C:\Users\Falco\Desktop\directory\Link_Prediction\data'

reading data into pandas dataframe

species_list: dataframe of consisting species names. This is the data which we expect to standardize.
IUCN: reference database with standardized names of species

species_list = pd.read_pickle(data_path+'\Species_list.pkl')
IUCN = pd.read_csv(data_path+ '\IUCN Mammals, Birds, Reptiles, and Amphibians.csv',)
IUCN["ScientificName"] = IUCN["Genus"].map(str) +' '+IUCN["Species"]

species_list.head()

	ScientificName	Source
0		8
1	Accipiter cooperii	3
2	Accipiter gentilis	7
3	Accipiter nisus	1
4	Accipiter striatus	1

List of Correct Names

The fuzzywuzzy package finds the best matching string from the list and return the string matching ratio along with it

list_of_correct_names = IUCN['ScientificName'].tolist()

Following are wrapping functions around the fuzzywuzzy package to extract best matching name and matching ratio for a pandas dataframe

It returns matching name only of the matching percent is greater than 90%

def fillVName(c):
    name_to_check = c.ScientificName
    a = process.extract(name_to_check, list_of_correct_names, limit=1) 
    if a[0][1]>= 90:
        return a[0][0]
    else:
        return np.nan

def fillVmatch(c):
    name_to_check = c.ScientificName
    a = process.extract(name_to_check,list_of_correct_names, limit=1) 
    if a[0][1]>= 90:
        return a[0][1]
    else:
        return np.nan

%%time
species_list['Matched_ScientificName'] = species_list.apply(fillVName, axis=1)

WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '']

%%time
species_list['ratio'] = species_list.apply(fillVmatch, axis=1)

WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '']

Wall time: 12min 25s

Matching names with matching ratio greater than 95% are generally spelling mistakes.
for example, row number 6, where Spelling in the database was ‘Accipter nisus’ but the correct name is Accipiter nisus (97%) matching.

All the unidentified species with sp in their binomial name gets a NaN value assigned as matching string has ratio less than 90%.

*Some Species have completely changed names either due to taxonomical udates in genus name. For example row 21
*Agouti paca is now called as Cuniculus paca. **

I would manually go and check all the matching entries with less than 95% matching and correct them manually.

species_list.drop('Source', inplace=True, axis=1)
species_list.head(50)

	ScientificName	Matched_ScientificName	ratio
0		NaN	NaN
1	Accipiter cooperii	Accipiter cooperii	100.0
2	Accipiter gentilis	Accipiter gentilis	100.0
3	Accipiter nisus	Accipiter nisus	100.0
4	Accipiter striatus	Accipiter striatus	100.0
5	Accipitridae sp	NaN	NaN
6	Accipter nisus	Accipiter nisus	97.0
7	Acerodon jubatus	Acerodon jubatus	100.0
8	Acomys cahirinus	Acomys cahirinus	100.0
9	Acrocephalus palustris	Acrocephalus palustris	100.0
10	Acrocephalus schoenobaenus	Acrocephalus schoenobaenus	100.0
11	Acrocephalus scirpaceus	Acrocephalus scirpaceus	100.0
12	Actitis macularius	Actitis macularius	100.0
13	Aegolius funereus	Aegolius funereus	100.0
14	Aegypius monachus	Aegypius monachus	100.0
15	Aepyceros melampus	Aepyceros melampus	100.0
16	Aethomys kaiseri	Aethomys kaiseri	100.0
17	Aethomys namaquensis	Aethomys namaquensis	100.0
18	Agelaioides badius	Agelaioides badius	100.0
19	Agelaius phoeniceus	Agelaius phoeniceus	100.0
20	Agelaius tricolor	Agelaius tricolor	100.0
21	Agouti paca	NaN	NaN
22	Ailurus fulgens	Ailurus fulgens	100.0
23	Akodon mimus	Akodon mimus	100.0
24	Akodon montensis	Akodon montensis	100.0
25	Akodon simulator	Akodon simulator	100.0
26	Alcedo atthis	Alcedo atthis	100.0
27	Alcelaphus buselaphus	Alcelaphus buselaphus	100.0
28	Alces alces	Alces alces	100.0
29	Alectoris rufa	Alectoris rufa	100.0
30	Allactaga williamsi	Allactaga williamsi	100.0
31	Allenopithecus nigroviridis	Allenopithecus nigroviridis	100.0
32	Alouatta beizebul	Alouatta belzebul	94.0
33	Alouatta belzebul	Alouatta belzebul	100.0
34	Alouatta caraya	Alouatta caraya	100.0
35	Alouatta guariba	Alouatta guariba	100.0
36	Alouatta palliata	Alouatta palliata	100.0
37	Alouatta pigra	Alouatta pigra	100.0
38	Alouatta sara	Alouatta sara	100.0
39	Alouatta seniculus	NaN	NaN
40	Alouatta sp	NaN	NaN
41	Alticola argentatus	Alticola argentatus	100.0
42	Ammospermophilus nelsoni	Ammospermophilus nelsoni	100.0
43	Anas diazi	NaN	NaN
44	Anas platyrhynchos	Anas platyrhynchos	100.0
45	Anatidae sp	NaN	NaN
46	Anaxyrus fowleri	NaN	NaN
47	Andropadus virens	Andropadus virens	100.0
48	Anisognathus flavinucha	NaN	NaN
49	Anomalurus derbianus	Anomalurus derbianus	100.0

Entries with less than 100% of matching ratio

species_list.ratio.fillna(0, inplace= True)
manual_edit = species_list[species_list.ratio <100]

manual_edit.head()

	ScientificName	Matched_ScientificName	ratio
0		NaN	0.0
5	Accipitridae sp	NaN	0.0
6	Accipter nisus	Accipiter nisus	97.0
21	Agouti paca	NaN	0.0
32	Alouatta beizebul	Alouatta belzebul	94.0

'there are '+ str(manual_edit.shape[0])+ ' entries which do not have perfect match'

'there are 219 entries which do not have perfect match'

Removing entries with unidentified species names

a = manual_edit[~manual_edit.ScientificName.str.contains(".sp")]
a.head()

	ScientificName	Matched_ScientificName	ratio
0		NaN	0.0
6	Accipter nisus	Accipiter nisus	97.0
21	Agouti paca	NaN	0.0
32	Alouatta beizebul	Alouatta belzebul	94.0
39	Alouatta seniculus	NaN	0.0

'there are '+ str(a.shape[0])+ ' entries which do not have perfect match and are completely identified'

'there are 151 entries which do not have perfect match and are completely identified'

Most likely spelling mistakes
entries with matching ratio higher than 95%

b =a[a.ratio >=95]
b.head(50)

	ScientificName	Matched_ScientificName	ratio
6	Accipter nisus	Accipiter nisus	97.0
63	Apodemus aregenteus	Apodemus argenteus	97.0
131	Calidris ruficollis minata	Calidris ruficollis	95.0
173	Cercopithecus diana	Cercopithecus diana	97.0
193	Cercoptheus nictitans	Cercopithecus nictitans	95.0
258	Crocidura olivieria	Crocidura olivieri	97.0
266	Ctenodactylus gundi	Ctenodactylus gundi	97.0
287	Dendrocygna biocolor	Dendrocygna bicolor	97.0
360	Gallus domesticus	Gallus gallus	95.0
374	Gerbillus unknown	Gerbillus gerbillus	95.0
376	Gis Glis	Glis glis	95.0
391	Hieaaetus pennatus	Hieraaetus pennatus	97.0
393	Hipposideros pomona	Hipposideros pomona	97.0
434	Larus novae-hollandia	Larus novaehollandiae	95.0
451	Lissonycteris angolensis	Lissonycteris angolensis	98.0
455	Lophocebus aterrimus	Lophocebus aterrimus	98.0
473	Macaca nemestrina leonina	Macaca nemestrina	95.0
548	Murina aurata	Murina aurata	96.0
568	Myotis blythi omari	Myotis myotis	95.0
576	Myotis ricketti	Myotis myotis	95.0
597	Nyctalus n. noctula	Nyctalus noctula	95.0
636	Pan troglodyte	Pan troglodytes	97.0
639	Panthera leo	Panthera leo	96.0
640	Panthera tigris	Panthera tigris	97.0
648	Papio doguera	Papio papio	95.0
650	Papio unknown	Papio papio	95.0
658	Pecari tajacu	Pecari tajacu	96.0
701	Pipistrellus pipistrellus	Pipistrellus pipistrellus	98.0
705	Pipstrellus kuhlii	Pipistrellus kuhlii	97.0
784	Rhinolophous affinis	Rhinolophus affinis	97.0
836	Sigmondon toltecus	Sigmodon toltecus	97.0
872	Sylvia attricapilla	Sylvia atricapilla	97.0
881	Tadardida brasiliensis	Tadarida brasiliensis	98.0
911	Tragelaphus anagasii	Tragelaphus angasii	97.0
932	Uncia uncia	Panthera uncia	95.0
944	Vicugna pacos	Vicugna vicugna	95.0

'there are '+ str(b.shape[0])+ ' entries which do not most likely spelling mistakes'

'there are 36 entries which do not most likely spelling mistakes'

Needs detailed investigation
entries with matching ratio lesser than 95%

c =a[a.ratio <95]
c.head()

	ScientificName	Matched_ScientificName	ratio
0		NaN	0.0
21	Agouti paca	NaN	0.0
32	Alouatta beizebul	Alouatta belzebul	94.0
39	Alouatta seniculus	NaN	0.0
43	Anas diazi	NaN	0.0

'there are '+ str(c.shape[0])+ ' entries which need detail investigation'

'there are 115 entries which need detail investigation'

species_list.to_csv(data_path+'Corrected_species_names.csv')

Matching_Species_names

Wrapper function around fuzzywuzzy python package to fuzzy match species names in database with standard names.

Cleaning Mammalian and Avian Species Names

Importing packages

reading data into pandas dataframe

List of Correct Names

Following are wrapping functions around the fuzzywuzzy package to extract best matching name and matching ratio for a pandas dataframe