Species Mismatch Analysis¶

This notebook identifies species from the provided list that are not valid for Chicago, Illinois. It also generates statistics and visualizations for the mismatched species.

In [1]:
# Import Required Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from pytaxize import scicomm

Next, we load in the SpeciesNet taxonomy release dataset - this is the master list that speciesnet uses for classification and includes the hierarchy used to get our simple class inference.

In [2]:
# Read the file without headers
columns = ['uuid', 'class', 'order', 'family', 'genus', 'species', 'common_name']
df_taxa = pd.read_csv("taxonomy_release.txt", 
                 delimiter=";", 
                 names=columns,
                 header=None)

# Display the first few rows to verify
display(df_taxa.head())
uuid class order family genus species common_name
0 00049ff0-2ffa-4d82-8cf3-c861fbbfa9d5 mammalia rodentia muridae rattus NaN rattus species
1 000e4049-11cd-4630-afd6-ea16a908d5ff mammalia cetartiodactyla bovidae gazella gazella mountain gazelle
2 000f61aa-c02a-46f4-b7a7-81fe76a9212f mammalia carnivora canidae lycaon NaN lycaon species
3 001795ae-1963-47f2-91cc-9dd627643a06 mammalia cetartiodactyla bovidae nesotragus NaN nesotragus species
4 00339477-70ab-42aa-9a4f-ed2cca9a028f aves musophagiformes musophagidae tauraco schuettii black-billed turaco

We can see the taxon level by sorting the dataset for each column. In the example below you can see how.. rana species < true frogs < frogs < amphibian

where ambibian is the class, anura is the order, ranidae is the family, rana is the genus

and cane toads are: cane toad < marina < rhinella < bufonidae < anura < amphibia

In [3]:
# Sort taxa by level 
df_sorted = df_taxa.sort_values(by=['class', 'order', 'family', 'genus', 'species'], ascending=True)
display(df_sorted.head())
uuid class order family genus species common_name
59 04eda76f-c0e7-4e9e-85c3-5b1542db2915 amphibia anura bufonidae rhinella marina cane toad
337 17b8145a-a164-4059-a68b-47b8b7438428 amphibia anura ranidae rana NaN rana species
1168 5351aa59-81ba-4c9f-b453-f7a2830ae892 amphibia anura ranidae NaN NaN true frogs
2063 96632fbc-d0d0-4880-9df7-f747f6f5ec11 amphibia anura NaN NaN NaN frogs
499 23a6f03b-b3d0-471b-a67d-88f10cb64e59 amphibia NaN NaN NaN NaN amphibian

Next we query the GBIF data on BigQuery to find a unique list of species within the manually makred geofence:

WITH unique_species AS (
    SELECT DISTINCT
    class,
    `order`,
    family,
    genus,
    species,
    taxonkey

    FROM
    `bigquery-public-data.gbif.occurrences` 

    WHERE 
    ST_WITHIN(
        ST_GEOGPOINT(decimallongitude, decimallatitude),
        ST_GEOGFROMTEXT('POLYGON((
            -87.69081115722656 42.005312912238956, 
            -87.66952514648438 41.955818412264705, 
            -87.61596679687501 41.905774595463853, 
            -87.60910034179689 41.85779952612765, 
            -87.62626647949219 41.815801430687642, 
            -87.7196502685547 41.808127409160392, 
            -87.71690368652345 41.842908943268263, 
            -87.67982482910158 41.88533726561532, 
            -87.72377014160158 41.946119107705776, 
            -87.78625488281251 41.99051961904691, 
            -87.69081115722656 42.005312912238956
        ))')
    )
    AND LOWER(phylum) = "chordata" # This is the phylum that includes mammals and birds
    
    LIMIT 1000 
)

SELECT
  t1.*,
  ARRAY_AGG(DISTINCT countrycode IGNORE NULLS) AS country_codes

FROM unique_species t1
LEFT JOIN `bigquery-public-data.gbif.occurrences` 
USING(taxonkey)

GROUP BY 
  class,
  `order`,
  family,
  genus,
  species,
  taxonkey

Just a little cleaning to clear up the classification of species and scientific name

In [4]:
# Try the gbq download
df = pd.read_csv("bq-withcountry.csv")

df_filtered = df.map(lambda x: x.lower() if isinstance(x, str) else x)

# extract actual speciesname
df_filtered.rename(columns={'species': 'scientific_name'}, inplace=True)

# create an actual species column
df_filtered['species'] = df_filtered['scientific_name'].str.split().str[-1]

df_gbq = df_filtered.copy()

display(df_filtered.head())
class order family genus scientific_name taxonkey country_codes species
0 testudines NaN emydidae trachemys trachemys scripta 7062200 [pl,hu,ar,es,ch,do,cz,at,tw,ae,dk,bm,sk,hn,ro,... scripta
1 aves passeriformes vireonidae vireo vireo gilvus 7191625 [ca,gt,us,sv,ni,mx,ec] gilvus
2 aves passeriformes parulidae protonotaria protonotaria citrea 2489871 [bs,tc,ec,ni,gf,bl,na,ai,jm,bz,zz,cr,sv,kn,hn,... citrea
3 aves pelecaniformes ardeidae bubulcus bubulcus ibis 4408439 [am,gy,bm,ma,aq,lr,hu,ky,au,ly,td,kn,uz,pa,sa,... ibis
4 aves strigiformes strigidae aegolius aegolius funereus 5739298 [by,hr,im,ir,md,gg,cn,it,al,mn,cz,us,me,tj,no,... funereus

Now we're going to join the GBQ data (df_gbq) and the SpeciesNet Taxonomy Hierarchy data (df_taxa) based on the family tree. This is the most accurate way of matching between dataset because there is variation in common names.

In [5]:
# Merge dataframes on taxa hierarchy
merged_df = pd.merge(
    df_gbq,
    df_taxa[['class', 'order', 'family', 'genus', 'species', 'common_name']],
    on=['class', 'order', 'family', 'genus', 'species'],
    how='left'
)
merged_df.tail()
Out[5]:
class order family genus scientific_name taxonkey country_codes species common_name
920 aves piciformes picidae colaptes colaptes auratus 6177448 [zz,us,ca] auratus northern flicker
921 mammalia rodentia cricetidae microtus microtus pennsylvanicus 7194068 [us,ca] pennsylvanicus NaN
922 mammalia lagomorpha leporidae sylvilagus sylvilagus transitionalis 2436905 [us] transitionalis NaN
923 squamata NaN colubridae regina regina grahamii 5222697 [us] grahamii NaN
924 aves passeriformes passerellidae passerella passerella iliaca 5788850 [us,ca] iliaca red fox-sparrow

In this merged table, if there is no common_name it's because there isn't a classification in the speciesnet taxonomy for it.
Meaning that specific animal will not be classified by speciesnet.

Since the table was merged with a left join on the gbq query (valid species in chicago) we can use this NaN field to understand what fraction of the dataset is potentially classified in our results.

In [6]:
# Display matches and non-matches
print("Rows with matching common names:")
display(merged_df[merged_df['common_name'].notna()].head())

print("\nRows without matching common names:")
display(merged_df[merged_df['common_name'].isna()].head())

# Count matches
total_matches = merged_df['common_name'].notna().sum()
print(f"\nTotal matches found: {total_matches}")
print(f"Total rows without matches: {len(merged_df) - total_matches}")
Rows with matching common names:
class order family genus scientific_name taxonkey country_codes species common_name
3 aves pelecaniformes ardeidae bubulcus bubulcus ibis 4408439 [am,gy,bm,ma,aq,lr,hu,ky,au,ly,td,kn,uz,pa,sa,... ibis cattle egret
5 aves galliformes phasianidae rollulus rollulus rouloul 2474113 [bn,zz,us,dk,de,br,th,ca,mm,la,my,id,in,ch,sg,... rouloul crested partridge
6 aves podicipediformes podicipedidae podilymbus podilymbus podiceps 7191591 [sr,zz,cu,gt,ni,mx,vi,hn,br,nl,ca,sv,bs,it,bo,... podiceps pied-billed grebe
8 aves columbiformes columbidae columba columba livia 7191490 [dk,gr,ma,jo,gb,es,mk,gt,jp,pt,zw,cl,eg,iq,bg,... livia rock dove
9 NaN NaN NaN NaN NaN 44 [nf,mu,pg,om,mv,ke,nl,sv,fi,gg,vn,je,tf,ie,mz,... NaN animal
Rows without matching common names:
class order family genus scientific_name taxonkey country_codes species common_name
0 testudines NaN emydidae trachemys trachemys scripta 7062200 [pl,hu,ar,es,ch,do,cz,at,tw,ae,dk,bm,sk,hn,ro,... scripta NaN
1 aves passeriformes vireonidae vireo vireo gilvus 7191625 [ca,gt,us,sv,ni,mx,ec] gilvus NaN
2 aves passeriformes parulidae protonotaria protonotaria citrea 2489871 [bs,tc,ec,ni,gf,bl,na,ai,jm,bz,zz,cr,sv,kn,hn,... citrea NaN
4 aves strigiformes strigidae aegolius aegolius funereus 5739298 [by,hr,im,ir,md,gg,cn,it,al,mn,cz,us,me,tj,no,... funereus NaN
7 aves columbiformes columbidae streptopelia streptopelia roseogrisea 2495688 [er,tc,gm,sa,nl,zz,qa,au,dz,hn,gd,et,sn,my,na,... roseogrisea NaN
Total matches found: 391
Total rows without matches: 534

The merged dataset here is the set between SpeciesNet's taxonomic release (all of the species speciesnet can classify) and the set of species that were found withing the gbq polygon.

In [7]:
# Save the matching dataset
df_matched = merged_df[merged_df['common_name'].notna()]

df_matched.to_csv('gbq_to_taxa_common_matched.csv')

# Store the matching common_names for comparison to the speciesnet output
chicago_species = df_matched['common_name'].tolist()

print(chicago_species)
['cattle egret', 'crested partridge', 'pied-billed grebe', 'rock dove', 'animal', 'vehicle', 'blank', 'yellow-crowned night-heron', 'great blue heron', 'busard saint-martin', 'lesser black-backed gull', 'northern mockingbird', 'belted kingfisher', 'golden eagle', 'blue-winged teal', 'american wigeon', 'blue grosbeak', 'eastern cottontail', 'cinnamon teal', "wilson's snipe", 'green heron', 'great blue heron', 'peromyscus species', 'long-billed dowitcher', 'brown thrasher', 'myrtle warbler', 'blue jay', 'sharp-tailed grouse', 'marsh wren', 'double-crested cormorant', 'mexican flying squirrel', 'american robin', 'coyote', 'red-headed woodpecker', 'black-capped chickadee', 'brown-headed cowbird', 'hairy woodpecker', "cassin's sparrow", 'spotted sandpiper', 'héron à dos vert', 'empidonax species', 'common grackle', 'moorhen', 'belted kingfisher', 'hairy woodpecker', 'red-winged blackbird', 'hairy woodpecker', 'american badger', 'chukar', 'larus species', 'short-eared owl', 'héron à dos vert', 'domestic horse', 'zosterops species', 'faucon pèlerin', 'house mouse', 'mute swan', 'spotted towhee', 'american beaver', 'gadwall', 'chipping sparrow', 'eurasian collared-dove', 'common starling', 'russet-backed thrush', 'california gull', 'wren', 'ring-necked pheasant', 'superb starling', 'european rabbit', 'american black duck', 'common yellowthroat', 'north american river otter', 'vesper sparrow', 'house wren', 'antrostomus species', 'swamp sparrow', 'eastern screech-owl', 'western meadowlark', 'northern bobwhite', 'pine warbler', 'song sparrow', 'russet-backed thrush', 'american crow', 'horned lark', 'sphyrapicus species', 'blue jay', 'muscovy duck', 'botaurus species', 'virginia rail', 'domestic cat', 'sciurus species', 'solitary sandpiper', 'red fox-sparrow', 'ovenbird', 'house rat', 'canada goose', 'goosander', 'microtus species', 'moorhen', 'great crested flycatcher', 'double-crested cormorant', 'spotted sandpiper', 'accipitridae family', 'short-eared owl', 'american white pelican', 'veery', 'yellow-headed blackbird', 'marsh wren', 'common ground-dove', 'eastern kingbird', 'palm warbler', 'woodchuck', 'northern goshawk', 'red-shouldered hawk', 'double-crested cormorant', 'colaptes species', 'eastern chipmunk', 'barred owl', 'dark-eyed junco', 'great horned owl', 'gray-cheeked thrush', 'brown thrasher', 'long-eared owl', 'gray-cheeked thrush', 'house sparrow', 'house finch', 'greater white-fronted goose', 'northern bobwhite', 'grey catbird', 'barn swallow', 'house sparrow', 'mourning dove', 'golden eagle', 'ring-necked duck', 'common yellowthroat', 'mallard', 'faucon pèlerin', 'mourning dove', 'song sparrow', 'american avocet', 'hirundinidae family', 'gray-cheeked thrush', 'black-and-white warbler', 'western tanager', 'american coot', 'virginia opossum', 'great white egret', 'red-shouldered hawk', 'western grebe', 'song sparrow', 'blue jay', 'pine warbler', 'mallard', 'eastern screech-owl', 'barred owl', 'eastern bluebird', 'common nighthawk', 'eastern fox squirrel', "steller's jay", 'killdeer', 'rough-legged buzzard', 'little blue heron', 'anatidae family', 'white-winged dove', 'moorhen', 'american kestrel', 'weasel species', 'red-bellied woodpecker', 'green heron', 'chipping sparrow', 'western kingbird', 'greater yellowlegs', 'snowy egret', 'wood duck', 'canada goose', 'muskrat', 'downy woodpecker', 'palm warbler', "wilson's warbler", 'song sparrow', 'ovenbird', 'cedar waxwing', 'loggerhead shrike', 'swamp sparrow', 'north american deermouse', 'lark sparrow', 'grasshopper sparrow', "swainson's hawk", 'red-headed woodpecker', 'white-breasted nuthatch', "cooper's hawk", 'eastern phoebe', "williamson's sapsucker", 'lark bunting', 'eastern gray squirrel', 'brown rat', 'black-bellied whistling-duck', 'upland sandpiper', 'domestic turkey', 'russet-backed thrush', 'moorhen', 'common grackle', 'grey wolf', 'pine warbler', 'dark-eyed junco', 'belted kingfisher', 'northern raccoon', 'american redstart', 'busard saint-martin', 'american bittern', 'bald eagle', 'wood thrush', 'rusty blackbird', "bewick's wren", 'northern cardinal', "wilson's warbler", 'song sparrow', 'northern raccoon', 'bald eagle', 'clay-colored sparrow', "brewer's blackbird", 'house finch', 'long-tailed weasel', 'rusty blackbird', 'red-breasted nuthatch', 'palm warbler', 'moorhen', 'aythya species', 'least bittern', 'grey wolf', 'northern flicker', 'domestic guineafowl', 'common barn owl', 'mallard', 'horned lark', 'willet', 'american wigeon', 'northern pintail', 'goosander', 'goéland argenté', 'red-winged blackbird', 'american crow', 'common starling', 'rallidae family', 'broad-winged hawk', 'green-winged teal', 'dark-eyed junco', 'black-headed grosbeak', 'broad-winged hawk', 'house finch', 'red fox', 'long-eared owl', 'palm warbler', 'northern flicker', "gambel's quail", 'white-breasted nuthatch', 'palm warbler', 'eastern gray squirrel', 'buff-bellied pipit', 'hairy woodpecker', 'grasshopper sparrow', 'eastern wood-pewee', "bewick's wren", "cassin's sparrow", 'domestic turkey', 'northern raccoon', 'marsh wren', 'bat', 'blue grosbeak', 'mallard', 'least bittern', 'red fox', 'black-crowned night-heron', 'ring-necked pheasant', 'cattle egret', 'killdeer', 'red-tailed hawk', 'rock dove', 'sylvilagus species', 'black-crowned night-heron', 'virginia rail', 'california quail', 'blue-winged teal', 'white-throated sparrow', 'myrtle warbler', 'geothlypis species', 'golden-crowned kinglet', 'busard saint-martin', 'setophaga species', 'hooded merganser', 'downy woodpecker', 'canada goose', 'hermit thrush', 'pine warbler', 'veery', 'american robin', 'hermit thrush', 'great horned owl', 'red fox-sparrow', 'veery', 'varied thrush', 'northern goshawk', 'american herring gull', 'vesper sparrow', 'loggerhead shrike', 'lark sparrow', 'song sparrow', 'red-winged blackbird', 'marsh wren', 'striped skunk', 'black-billed magpie', 'white-crowned sparrow', "townsend's solitaire", 'budgerigar', 'rock dove', 'red junglefowl', 'red-tailed hawk', 'bird', 'rock dove', 'northern goshawk', 'rufous-sided towhee', 'black-capped chickadee', 'buff-bellied pipit', 'gadwall', 'buteo species', 'house mouse', 'dryobates species', 'great black-backed gull', 'bufflehead', 'junco species', 'striped skunk', 'ring-billed gull', 'house wren', 'green peafowl', 'brown-headed cowbird', 'american woodcock', 'white ibis', 'white-crowned sparrow', 'spotted towhee', 'ruby-crowned kinglet', 'ruby-crowned kinglet', 'red-winged blackbird', 'swamp sparrow', 'eastern fox squirrel', 'black-billed magpie', "wilson's warbler", 'house finch', 'eastern screech-owl', 'dark-eyed junco', 'eastern screech-owl', 'north american deermouse', 'ovenbird', 'american herring gull', 'rattus species', 'jabiru', 'common nighthawk', "wilson's warbler", 'barn swallow', 'american coot', 'belted kingfisher', 'great white egret', 'tufted titmouse', 'northern cardinal', 'black-crowned night-heron', 'marbled godwit', 'sora', 'goldfinch', 'solitary sandpiper', 'golden-crowned kinglet', 'passer species', 'ruby-crowned kinglet', 'myrtle warbler', 'greater white-fronted goose', 'myrtle warbler', 'pied-billed grebe', 'american kestrel', 'sage thrasher', 'virginia opossum', 'eastern bluebird', 'mallard', 'grey crowned-crane', 'coyote', "wilson's snipe", 'white-footed mouse', 'downy woodpecker', 'black-capped chickadee', 'rufous-sided towhee', 'ruffed grouse', 'ferruginous hawk', 'northern flicker', 'red fox-sparrow']

Now we can load the species data from our first runs using SpeciesNet.

In [8]:
# Load Species Data
species_data = pd.read_csv('../runs/full-no-geo/simple_class_counts.csv')
species_data.reset_index(drop=True, inplace=True)
species_data.head()
Out[8]:
Unnamed: 0 class count %
0 1 blank 89424 84.70
1 2 bird 4948 4.70
2 3 human 4196 4.00
3 4 western pond turtle 1621 1.50
4 5 mallard 579 0.55
In [9]:
# Compare Species Lists
species_list = species_data['class'].tolist()
mismatched_species = [species for species in species_list if species not in chicago_species]
print(f"Total mismatched species: {len(mismatched_species)}")
print(f"Total matched species: {len(species_list)-len(mismatched_species)}")
Total mismatched species: 38
Total matched species: 25

Matched Species = for species both in the speciesnet taxonomy release and in the geofenced chicago area - these were found in the speciesnet predictions

Mismatched Species = for species in the speciesNet taxonomy release but not in the geofenced chicago area - these were found in the speciesNet predictions - they are false positives

In [10]:
# Generate Statistics on Mismatched Species
total_species = len(species_list)
mismatched_percentage = (len(mismatched_species) / total_species) * 100
matched_percentage = ( (total_species - len(mismatched_species) ) / total_species) * 100
print(f"Percentage of mismatched species: {mismatched_percentage:.2f}%")
print(f"Percentage of matched species: {matched_percentage:.2f}%")
Percentage of mismatched species: 60.32%
Percentage of matched species: 39.68%

This means that about 40% of the speciesnet classifications on our images are actual valid species and about 60% were classified by speciesnet but the class doesn't belong in chicago.

In [11]:
# List the matching vs mismatching species
matching_species = [species for species in species_list if species in chicago_species]
mismatched_species = [species for species in species_list if species not in chicago_species]

print(f"Matching species ({len(matching_species)}): {matching_species}")
print(f"Mismatched species ({len(mismatched_species)}): {mismatched_species}")
Matching species (25): ['blank', 'bird', 'mallard', 'american coot', 'northern raccoon', 'great blue heron', 'vehicle', 'eastern cottontail', 'brown rat', 'domestic cat', 'muskrat', 'wood duck', 'coyote', 'canada goose', 'american beaver', 'eastern gray squirrel', 'domestic horse', 'american robin', 'white-crowned sparrow', 'sylvilagus species', 'song sparrow', 'snowy egret', 'california quail', 'north american river otter', 'eastern chipmunk']
Mismatched species (38): ['human', 'western pond turtle', 'anseriformes order', 'domestic dog', 'reptile', 'domestic cattle', 'wild turkey', 'central american agouti', 'white-tailed deer', 'nutria', 'crocodile', 'wild boar', 'mammal', 'common tapeti', "tome's spiny rat", 'ocellated turkey', 'rodent', 'collared peccary', 'spotted paca', 'madagascar crested ibis', 'canis species', 'red squirrel', 'bushy-tailed woodrat', 'pronghorn', 'eastern red forest rat', 'domestic chicken', 'blood pheasant', 'white-lipped peccary', 'red acouchi', 'desert cottontail', 'plains zebra', 'rufescent tiger-heron', 'owl', 'common wombat', 'bearded pig', 'fossa', 'nine-banded armadillo', "guenther's dik-dik"]

Some of the mismatched species make sense - the gbif dataset won't have a hierarchy for humans or domestic dogs - or vehicles and reptiles.

We built the join to get the simple class names based on the taxa hierarchy (genus/family/order etc) and there just wouldn't be one sometimes where the speciesnet team added it to allow for useful model predictions.

In [12]:
from itertools import zip_longest

review_df = pd.DataFrame(
    list(zip_longest(matching_species, mismatched_species, fillvalue=pd.NA)),
    columns=['matching', 'non_matching (speciesnet!gbif)']
)

print(review_df)
                      matching non_matching (speciesnet!gbif)
0                        blank                          human
1                         bird            western pond turtle
2                      mallard             anseriformes order
3                american coot                   domestic dog
4             northern raccoon                        reptile
5             great blue heron                domestic cattle
6                      vehicle                    wild turkey
7           eastern cottontail        central american agouti
8                    brown rat              white-tailed deer
9                 domestic cat                         nutria
10                     muskrat                      crocodile
11                   wood duck                      wild boar
12                      coyote                         mammal
13                canada goose                  common tapeti
14             american beaver               tome's spiny rat
15       eastern gray squirrel               ocellated turkey
16              domestic horse                         rodent
17              american robin               collared peccary
18       white-crowned sparrow                   spotted paca
19          sylvilagus species        madagascar crested ibis
20                song sparrow                  canis species
21                 snowy egret                   red squirrel
22            california quail           bushy-tailed woodrat
23  north american river otter                      pronghorn
24            eastern chipmunk         eastern red forest rat
25                        <NA>               domestic chicken
26                        <NA>                 blood pheasant
27                        <NA>           white-lipped peccary
28                        <NA>                    red acouchi
29                        <NA>              desert cottontail
30                        <NA>                   plains zebra
31                        <NA>          rufescent tiger-heron
32                        <NA>                            owl
33                        <NA>                  common wombat
34                        <NA>                    bearded pig
35                        <NA>                          fossa
36                        <NA>          nine-banded armadillo
37                        <NA>             guenther's dik-dik
In [13]:
# Some of the mismatched are given and we need to remove them from being counted
to_remove = ['human', 'domestic dog', 'mammal', 'rodent', 'reptile', 'anseriformes order', 'canis species']  # example items to remove, edit as needed
mismatched_species = [s for s in mismatched_species if s not in to_remove]


# Filter species_data for mismatched species
mismatched_stats = species_data[species_data['class'].isin(mismatched_species)][['class', 'count', '%']]

print(mismatched_stats['class'].to_list())
display(mismatched_stats)
['western pond turtle', 'domestic cattle', 'wild turkey', 'central american agouti', 'white-tailed deer', 'nutria', 'crocodile', 'wild boar', 'common tapeti', "tome's spiny rat", 'ocellated turkey', 'collared peccary', 'spotted paca', 'madagascar crested ibis', 'red squirrel', 'bushy-tailed woodrat', 'pronghorn', 'eastern red forest rat', 'domestic chicken', 'blood pheasant', 'white-lipped peccary', 'red acouchi', 'desert cottontail', 'plains zebra', 'rufescent tiger-heron', 'owl', 'common wombat', 'bearded pig', 'fossa', 'nine-banded armadillo', "guenther's dik-dik"]
class count %
3 western pond turtle 1621 1.500
14 domestic cattle 41 0.039
15 wild turkey 41 0.039
16 central american agouti 29 0.027
20 white-tailed deer 20 0.019
22 nutria 17 0.016
23 crocodile 16 0.015
26 wild boar 11 0.010
28 common tapeti 7 0.007
30 tome's spiny rat 6 0.006
31 ocellated turkey 5 0.005
34 collared peccary 3 0.003
36 spotted paca 3 0.003
38 madagascar crested ibis 2 0.002
42 red squirrel 1 0.001
43 bushy-tailed woodrat 1 0.001
44 pronghorn 1 0.001
45 eastern red forest rat 1 0.001
47 domestic chicken 1 0.001
49 blood pheasant 1 0.001
50 white-lipped peccary 1 0.001
51 red acouchi 1 0.001
52 desert cottontail 1 0.001
53 plains zebra 1 0.001
54 rufescent tiger-heron 1 0.001
55 owl 1 0.001
56 common wombat 1 0.001
58 bearded pig 1 0.001
60 fossa 1 0.001
61 nine-banded armadillo 1 0.001
62 guenther's dik-dik 1 0.001
In [34]:
# for each species on this list that shouldn't have been found in chicago, determine the list of 2 letter countries it belongs to.

df_missmatched = df_taxa[df_taxa['common_name'].isin(mismatched_species)].sort_values(by=['class', 'order', 'family', 'genus', 'species'])

df_missmatched["scientific_name"] = df_missmatched['genus'] + " " + df_missmatched['species']

print(f'total missmatched and in taxa list: {len(df_missmatched)}')
display(df_missmatched.head())
total missmatched and in taxa list: 30
uuid class order family genus species common_name scientific_name
2806 cae9534f-f302-4229-9e11-b91138333d92 aves galliformes phasianidae gallus gallus domesticus domestic chicken gallus gallus domesticus
1574 710eb57f-77d9-44ca-a1e1-4077f868c0fa aves galliformes phasianidae ithaginis cruentus blood pheasant ithaginis cruentus
992 466b25f0-a916-432c-823e-394a69391328 aves galliformes phasianidae meleagris ocellata ocellated turkey meleagris ocellata
250 1110460b-7f99-405b-a9b0-65a09ecccca1 aves pelecaniformes ardeidae tigrisoma lineatum rufescent tiger-heron tigrisoma lineatum
2544 b84b365a-b420-4164-85a1-124afc96f1d7 aves pelecaniformes threskiornithidae lophotibis cristata madagascar crested ibis lophotibis cristata
In [35]:
from pygbif import species as species_gbif


def get_gbif_key(name):
    """Get GBIF usageKey (taxonKey) for a single scientific name."""
    if not isinstance(name, str) or not name.strip():
        return None
    try:
        result = species_gbif.name_backbone(name=name)
        return int(result.get("usageKey"))  # safely get key or None
    except Exception as e:
        print(f"Error with name '{name}': {e}")
        return None
    
# Apply function
df_missmatched["taxanomic"] = df_missmatched.apply(lambda row: get_gbif_key(row["scientific_name"]), axis=1)
In [36]:
# Display the result
df_missmatched["taxanomic"] = pd.to_numeric(df_missmatched["taxanomic"], errors='coerce').astype('Int64')


display(df_missmatched.head())
uuid class order family genus species common_name scientific_name taxanomic
2806 cae9534f-f302-4229-9e11-b91138333d92 aves galliformes phasianidae gallus gallus domesticus domestic chicken gallus gallus domesticus 9457155
1574 710eb57f-77d9-44ca-a1e1-4077f868c0fa aves galliformes phasianidae ithaginis cruentus blood pheasant ithaginis cruentus 2474060
992 466b25f0-a916-432c-823e-394a69391328 aves galliformes phasianidae meleagris ocellata ocellated turkey meleagris ocellata 5227755
250 1110460b-7f99-405b-a9b0-65a09ecccca1 aves pelecaniformes ardeidae tigrisoma lineatum rufescent tiger-heron tigrisoma lineatum 2480869
2544 b84b365a-b420-4164-85a1-124afc96f1d7 aves pelecaniformes threskiornithidae lophotibis cristata madagascar crested ibis lophotibis cristata 2480814
In [38]:
# get country codes
from pygbif import occurrences as occ

def get_countries_and_us_states(taxa_key):
    countries = []
    us_states = []
    try:
        # convert to int if possible, else raise error or skip
        taxa_key_int = int(taxa_key)
        cc = occ.search(taxonKey=taxa_key_int, facet='country', facetLimit=200)
        countries = [item['name'] for item in cc.get('facets', [])[0].get('counts', [])]

        if 'US' in countries:
            sc = occ.search(taxonKey=taxa_key_int, country='US', facet='stateProvince', facetLimit=200)
            us_states = [item['name'] for item in sc.get('facets', [])[0].get('counts', [])]

    except Exception as e:
        print(f"Error with id '{taxa_key}': {e}")
    return {"countries": countries, "us_states": us_states}

# Apply function
df_missmatched[['countries', 'us_states']] = df_missmatched['taxanomic'].apply(get_countries_and_us_states).apply(pd.Series)
Error with id 'nan': cannot convert float NaN to integer
In [55]:
# Finally see the completed df
display(df_missmatched.head())

df_missmatched.to_csv('missmatched_with_locations.csv')
uuid class order family genus species common_name scientific_name taxanomic countries us_states
2806 cae9534f-f302-4229-9e11-b91138333d92 aves galliformes phasianidae gallus gallus domesticus domestic chicken gallus gallus domesticus 9457155 [CO, US, NL, BE, SG, PR, NZ, FR, TH, AU, DE, K... [hawaii, florida, california, texas, north car...
1574 710eb57f-77d9-44ca-a1e1-4077f868c0fa aves galliformes phasianidae ithaginis cruentus blood pheasant ithaginis cruentus 2474060 [CN, BT, IN, NP, MM, ZZ, DE, AT, TH, US] []
992 466b25f0-a916-432c-823e-394a69391328 aves galliformes phasianidae meleagris ocellata ocellated turkey meleagris ocellata 5227755 [BZ, GT, MX, US, ZZ, DE, HN, CH, CO, DK, NL] [california, florida, hawaii, oklahoma]
250 1110460b-7f99-405b-a9b0-65a09ecccca1 aves pelecaniformes ardeidae tigrisoma lineatum rufescent tiger-heron tigrisoma lineatum 2480869 [BR, AR, CO, PA, EC, PE, VE, BO, PY, UY, CR, G... [hawaii]
2544 b84b365a-b420-4164-85a1-124afc96f1d7 aves pelecaniformes threskiornithidae lophotibis cristata madagascar crested ibis lophotibis cristata 2480814 [MG, BE, CH, DE, NG, ZZ] []
In [47]:
import plotly.express as px

df_exploded = df_missmatched.explode('countries').rename(columns={'countries': 'country_code'})

df_grouped = df_exploded.groupby('country_code')['common_name'].apply(lambda x: '<br>'.join(x.unique())).reset_index()
# Add a presence column for coloring (all 1s)
df_grouped['presence'] = 1

display(df_grouped.head())
country_code common_name presence
0 AD domestic cattle<br>wild boar 1
1 AE domestic chicken<br>domestic cattle 1
2 AF domestic cattle<br>wild boar 1
3 AG domestic chicken 1
4 AI domestic chicken 1
In [52]:
import pycountry

def alpha2_to_alpha3(alpha2):
    try:
        return pycountry.countries.get(alpha_2=alpha2).alpha_3
    except:
        return None

df_grouped['country_code_3'] = df_grouped['country_code'].apply(alpha2_to_alpha3)

display(df_grouped.head())

# Drop rows where conversion failed (None)
df_grouped = df_grouped.dropna(subset=['country_code_3'])
fig = px.choropleth(
    df_grouped,
    locations='country_code_3',
    color='presence',             # just presence
    hover_name='country_code',
    hover_data={'common_name': True, 'presence': False},
    color_continuous_scale='Viridis',
    scope='world',
    labels={'common_name': 'Species Found'},
    title='Species Occurrence by Country'
)

fig.update_traces(
    hovertemplate='<b>%{location}</b><br>Species:<br>%{customdata[0]}<extra></extra>'
)

fig.show()
country_code common_name presence country_code_3
0 AD domestic cattle<br>wild boar 1 AND
1 AE domestic chicken<br>domestic cattle 1 ARE
2 AF domestic cattle<br>wild boar 1 AFG
3 AG domestic chicken 1 ATG
4 AI domestic chicken 1 AIA
In [53]:
import re

# Mapping full state names to their 2-letter abbreviations
us_states = {
    'Alabama': 'AL', 'Alaska': 'AK', 'Arizona': 'AZ', 'Arkansas': 'AR',
    'California': 'CA', 'Colorado': 'CO', 'Connecticut': 'CT', 'Delaware': 'DE',
    'Florida': 'FL', 'Georgia': 'GA', 'Hawaii': 'HI', 'Idaho': 'ID',
    'Illinois': 'IL', 'Indiana': 'IN', 'Iowa': 'IA', 'Kansas': 'KS',
    'Kentucky': 'KY', 'Louisiana': 'LA', 'Maine': 'ME', 'Maryland': 'MD',
    'Massachusetts': 'MA', 'Michigan': 'MI', 'Minnesota': 'MN', 'Mississippi': 'MS',
    'Missouri': 'MO', 'Montana': 'MT', 'Nebraska': 'NE', 'Nevada': 'NV',
    'New Hampshire': 'NH', 'New Jersey': 'NJ', 'New Mexico': 'NM', 'New York': 'NY',
    'North Carolina': 'NC', 'North Dakota': 'ND', 'Ohio': 'OH', 'Oklahoma': 'OK',
    'Oregon': 'OR', 'Pennsylvania': 'PA', 'Rhode Island': 'RI', 'South Carolina': 'SC',
    'South Dakota': 'SD', 'Tennessee': 'TN', 'Texas': 'TX', 'Utah': 'UT',
    'Vermont': 'VT', 'Virginia': 'VA', 'Washington': 'WA', 'West Virginia': 'WV',
    'Wisconsin': 'WI', 'Wyoming': 'WY', 'District of Columbia': 'DC'
}

# Also allow reverse lookup from abbreviation to itself for convenience
abbrev_set = set(us_states.values())

def normalize_state(state_str):
    """Normalize various US state formats to 2-letter code."""
    if not isinstance(state_str, str):
        return None

    state_str = state_str.strip()

    # Case 1: Matches format like "Texas (TX)"
    m = re.search(r'\((\w{2})\)', state_str)
    if m:
        code = m.group(1).upper()
        if code in abbrev_set:
            return code

    # Case 2: Is already a 2-letter abbreviation?
    if len(state_str) == 2 and state_str.upper() in abbrev_set:
        return state_str.upper()

    # Case 3: Full state name (case insensitive)
    # Capitalize each word for matching keys in us_states dict
    normalized_name = ' '.join(word.capitalize() for word in state_str.split())
    if normalized_name in us_states:
        return us_states[normalized_name]

    # If no match found, return None or original string
    return None
In [57]:
# Assuming your states column is a list of strings per row, e.g. ['Texas (TX)', 'CA', 'Florida']

# Explode list into rows
df_exploded = df_missmatched.explode('us_states')

# Normalize states
df_exploded['state_code'] = df_exploded['us_states'].apply(normalize_state)

# Drop invalid or None
df_exploded = df_exploded.dropna(subset=['state_code'])

# Now you can group and plot with these cleaned 2-letter codes
df_exploded.head()
Out[57]:
uuid class order family genus species common_name scientific_name taxanomic countries us_states state_code
2806 cae9534f-f302-4229-9e11-b91138333d92 aves galliformes phasianidae gallus gallus domesticus domestic chicken gallus gallus domesticus 9457155 [CO, US, NL, BE, SG, PR, NZ, FR, TH, AU, DE, K... hawaii HI
2806 cae9534f-f302-4229-9e11-b91138333d92 aves galliformes phasianidae gallus gallus domesticus domestic chicken gallus gallus domesticus 9457155 [CO, US, NL, BE, SG, PR, NZ, FR, TH, AU, DE, K... florida FL
2806 cae9534f-f302-4229-9e11-b91138333d92 aves galliformes phasianidae gallus gallus domesticus domestic chicken gallus gallus domesticus 9457155 [CO, US, NL, BE, SG, PR, NZ, FR, TH, AU, DE, K... california CA
2806 cae9534f-f302-4229-9e11-b91138333d92 aves galliformes phasianidae gallus gallus domesticus domestic chicken gallus gallus domesticus 9457155 [CO, US, NL, BE, SG, PR, NZ, FR, TH, AU, DE, K... texas TX
2806 cae9534f-f302-4229-9e11-b91138333d92 aves galliformes phasianidae gallus gallus domesticus domestic chicken gallus gallus domesticus 9457155 [CO, US, NL, BE, SG, PR, NZ, FR, TH, AU, DE, K... north carolina NC
In [58]:
# Group by state to aggregate species names
df_grouped = df_exploded.groupby('state_code')['common_name'].apply(lambda x: '<br>'.join(x.unique())).reset_index()

# Add presence column (all 1s)
df_grouped['presence'] = 1

# Plot US choropleth
fig = px.choropleth(
    df_grouped,
    locations='state_code',
    locationmode='USA-states',
    color='presence',
    hover_name='state_code',
    hover_data={'common_name': True, 'presence': False},
    scope='usa',
    color_continuous_scale='Viridis',
    labels={'common_name': 'Species Found'},
    title='Species Occurrence by US State'
)

fig.update_traces(
    hovertemplate='<b>%{location}</b><br>Species:<br>%{customdata[0]}<extra></extra>'
)

fig.show()

Interesting that even after all of that - keeping the states as the location doesn't exclude everything - some of our missmatched list are in fact in Illinois Still.

BUT - we can subset our list now to species that definitely aren't in illinois (IL)

In [81]:
# Group by common_name and get list of states
df_definite = df_exploded.groupby('common_name')['state_code'].agg(list).reset_index()

# Filter rows where 'IL' is NOT in the state_code list
df_definite = df_definite[~df_definite['state_code'].apply(lambda states: "IL" in states)]

# Save and display
df_definite.to_csv('subset_definitely_not_in_illinois.csv', index=False)
display(df_definite)
common_name state_code
0 bushy-tailed woodrat [CO, CA, WY, OR, UT, WA, NM, MT, SD, NV, ID, N...
1 central american agouti [TX]
3 common wombat [NM, WA]
4 crocodile [FL]
5 desert cottontail [CA, AZ, CO, NM, TX, NV, UT, WY, MT, NE, SD, K...
8 fossa [NE, TX]
9 guenther's dik-dik [TX]
12 ocellated turkey [CA, FL, HI, OK]
13 plains zebra [CA, NM, TX, WA, AK, CO, MA, OH, OR, UT]
14 pronghorn [WY, CO, NM, UT, AZ, SD, MT, OR, ID, TX, NV, C...
15 red acouchi [NE, NY, WA]
17 rufescent tiger-heron [HI]
18 spotted paca [CO, CA, RI, UT, WA]
19 western pond turtle [CA, OR, NV]
20 white-lipped peccary [TX, CO, TX]

This demonstrates the framework we would need for geofencing and checking geofence in the future.

We could also use this type of api search to generate a valid geofence by state list for SpeciesNet if desired.