Skip to content
This repository has been archived by the owner on Sep 4, 2024. It is now read-only.

Commit

Permalink
Data cleaning for occurrences
Browse files Browse the repository at this point in the history
  • Loading branch information
cjgrady committed May 24, 2022
1 parent 3a92489 commit 79bff3d
Show file tree
Hide file tree
Showing 4 changed files with 218,694 additions and 196 deletions.
269 changes: 93 additions & 176 deletions _sphinx_config/tutorial/data_cleaning.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,11 @@
Data Cleaning
=============

Cleaning Occurrences
====================

Introduction
============
------------
One of the first steps in creating
:term:`species distribution models<Species Distribution Model>`, let alone
multi-species analyses, is acquiring and preparing specimen occurrence records. There
Expand All @@ -13,178 +16,92 @@ dataset, which involves converting records to a common format, grouping, and cle
The lmpy library provides tools for performing these aggregation and cleaning steps to
greatly simplify the process for the user.


Reading a CSV File
==================
Read a CSV file that has fields "decimalLongitude" and "decimalLatitude" for x and y
and "speciesName" for the species binomial.

See: `PointCsvReader <../autoapi/lmpy/point/index.html#lmpy.point.PointCsvReader>`_

>>> reader = PointCsvReader(csv_filename, 'speciesName', 'decimalLongitude', 'decimalLatitude')


Reading a Darwin Core Archive File
==================================
Read a :term:`Darwin Core Archive<DWCA>` file. The file is assumed to be valid and
metadata will be pulled from the 'meta.xml' file contained within the zip file.

See: `PointDwcaReader <../autoapi/lmpy/point/index.html#lmpy.point.PointDwcaReader>`_

>>> reader = PointDwcaReader(dwca_filename)


Filtering Records
=================

Built-in Filters
----------------
Filter a list of :term:`Point` objects so that those with less than four (4) decimal places
of precision are removed.

>>> points = [Point('Species A', 10.3, 23.1),
... Point('Species B', 11.34123, 12.2314),
... Point('Species C', 13.23131, 18.3123)]
>>> precision_filter = get_decimal_precision_filter(4)
>>> flt_points = precision_filter(points)
>>> print(flt_points)
[Point('Species B', 11.34123, 12.2314), Point('Species C', 13.23131, 18.3123)]


Custom Filters
--------------
Filter a list of :term:`points<Point>` so that those without a species epithet are
removed.

>>> def genus_filter_func(point):
... return len(point.split(' ')) > 1
>>> genus_filter = get_occurrence_filter(genus_filter_func)
>>> points = [Point('Species A', 1, 2),
... Point('Genus', 3, 4), Point('Genus', 9, 3), Point('Species B', 2, 1)]
>>> flt_points = genus_filter(points)
>>> print(flt_points)
[Point('Species A', 1, 2), Point('Species B', 2, 1)]


Modifying Records
=================

Built-in Modifiers
------------------
Use the accepted name modifier with a file, ACCEPTED_TAXA_FILENAME, containing accepted
name mappings.

>>> accepted_name_modifier = get_accepted_name_modifier(ACCEPTED_TAXA_FILENAME)
>>> points = [Point('Accepted species', 1, 2),
... Point('Synonym species', 5, 3), Point('Another synonym', 4, 4)]
>>> mod_points = accepted_name_modifier(points)
>>> print(mod_points)
[Point('Accepted species', 1, 2), Point('Accepted species', 5, 3),
Point('Accepted species', 4, 4)]


Putting It All Together
=======================

Aggregate and Clean Multiple Data Files
---------------------------------------
For this example, we will process occurrence data from three sources, a Darwin Core
Archive, a JSON file, and a CSV file. The Darwin Core Archive file is at
DWCA_FILENAME, the JSON data is at JSON_FILENAME and the records are under the 'items'
key with 'scientificName' as the species key and 'lon' and 'lat' as the x and y keys
under the 'geopoint' key. For the CSV file, CSV_FILENAME, the fields are 'taxonName',
'decimalLongitude', and 'decimalLatitude' for the species, x, and y fields
respectively.

First, define a modifier function that will ensure points from each source are in the
same format and only include species name, x, and y. We can do this by only keeping
the attributes 'species_name', 'x', and 'y'. We will also send the points through an
accepted name modifier, with a mapping file at ACCEPTED_NAMES_FILENAME, to ensure that
the species name for each point is an accepted name. We will define a chained modifier
function that we will utilize to apply both the accepted name modifier and the common
format modifier. It is defined in the 'get_chained_modifier' function.

>>> accepted_name_modifier = get_accepted_name_modifier(ACCEPTED_NAMES_FILENAME)
>>> def common_format_modifier(points):
... return [Point(pt.species_name, pt.x, pt.y) for pt in points]
>>> def get_chained_modifier(*modifiers):
... def chained_modifier(points):
... for modifier in list(modifiers):
... points = modifier(points)
... return points
... return chained_modifier
>>> chained_modifier = get_chained_modifier(
... accepted_name_modifier,
... common_format_modifier
... )
>>> all_points = []
>>> # Process the Darwin Core Archive
>>> with PointDwcaReader(DWCA_FILENAME) as dwca_reader:
... for points in dwca_reader:
... all_points.extend(chained_modifier(points))
>>> # Process the JSON file
>>> with open(JSON_FILENAME) as in_file:
... json_point_data = json.load(in_file)
>>> raw_json_points = []
>>> for item in json_point_data['items']:
... raw_json_points.append(
... Point(
... item['scientificName'],
... item['geopoint']['lon'],
... item['geopoint']['lat']
... )
... )
>>> # For consistency, common format json points
>>> all_points.extend(chained_modifier(raw_json_points))
>>> # Process the CSV file
>>> with PointCsvReader(
... CSV_FILENAME,
... 'taxonName',
... 'decimalLongitude',
... 'decimalLatitude'
... ) as csv_reader:
... for points in csv_reader:
... all_points.extend(chained_modifier(points))

In this example, we assume that there are a reasonable number of points that can be
sorted at once. For large datasets, it may be necessary to split the data first
before attempting to sort. We will sort the points and write to a temporary file
because, when we read them from it, each group will contain all of the points for a
single species.

>>> # Sort points and write to a temporary file
>>> temp_filename = tempfile.NamedTemporaryFile(suffix='.csv', delete=True).name
>>> with PointCsvWriter(temp_filename, 'species_name', 'x', 'y') as csv_writer:
... for points in sorted(all_points):
... csv_writer.write_points(points)

Now we have an aggregated CSV file containing all of the specimen records from each of
the three sources that is grouped and sorted by species name. Next, we will filter
the specimen records so that we only keep those with at least four decimal places of
precision, only unique localities, and only keep species with at least 12 points.
Write the cleaned data points to OUTPUT_POINTS_FILENAME.

>>> # Set up filters (except for duplicate localities)
>>> chain_filters = [
... get_decimal_precision_filter(4),
... get_minimum_points_filter(12),
... ]
>>> with PointCsvWriter(
... OUTPUT_POINTS_FILENAME,
... ['species_name, 'x', 'y']
... ) as csv_writer:
... with PointCsvReader(temp_filename, 'species_name', 'x', 'y') as csv_reader:
... for points in csv_reader:
... dup_filter = get_unique_localities_filter()
... points = dup_filter(points)
... for flt in chain_filters:
... if points: # Stop trying to filter if there are no points
... points = flt(points)
... dup_filter = None # Reset to preserve memory
... if points: # If any points remain, write them
... csv_writer.write_points(points)

That's it! We have processed data from three sources, ensured that all records
have an accepted taxon name, filtered out records that have low coordinate decimal
precision, identified taxa with a minimum number of unique localities, to be able to use all of the resulting
data for computing species distribution models.
Occurrence Data Wrangler Configuration
--------------------------------------
You can either use the Data
`Wrangler Factory <../autoapi/lmpy/data_wrangling/factory/index.html#lmpy.data_wrangling.factory.WranglerFactory>`_
or instantiate occurrence data wrangler classes directly. We will use the factory for
this example with the configuration below.

.. code-block:: json
[
# Decimal precision
dict(
wrangler_type='DecimalPrecisionFilter',
decimal_places=4
),
# Bounding box
dict(
wrangler_type='BoundingBoxFilter',
min_x=0.0,
min_y=-90.0,
max_x=180.0,
max_y=0.0
),
# Unique localities
dict(wrangler_type='UniqueLocalitiesFilter')
]
Example - Console Script
------------------------
For this example, we will use the raw occurrence data found in the sample data
directory in lmpy at `occurrence/Crocodylus porosus.csv` and the example wrangler
configuration should be written to `./occurrence_wrangler_config.json`. The cleaned
data will be written to `./clean_data.csv`.

.. code-block:: bash
$ wrangle_occurrences "./lmpy/sample_data/occurrence/Crocodylus porosus.csv" \
./clean_data.csv \
./occurrence_wrangler_config.json
Example - Python
------------------------
For this example, we will use the raw occurrence data found in the sample data
directory in lmpy at `occurrence/Crocodylus porosus.csv` and the example wrangler
configuration. The cleaned data will be written to `./clean_data.csv`.

.. code-block:: python
from lmpy.data_wrangling.factory import WranglerFactory
from lmpy.point import PointCsvReader, PointCsvWriter
raw_occurrences_filename = './lmpy/sample_data/occurrence/Crocodylus porosus.csv'
clean_occurrences_filename = './clean_data.csv'
wrangler_config = [
# Decimal precision
dict(
wrangler_type='DecimalPrecisionFilter',
decimal_places=4
),
# Bounding box
dict(
wrangler_type='BoundingBoxFilter',
min_x=0.0,
min_y=-90.0,
max_x=180.0,
max_y=0.0
),
# Unique localities
dict(wrangler_type='UniqueLocalitiesFilter')
]
factory = WranglerFactory()
wranglers = factory.get_wranglers(wrangler_config)
with PointCsvReader(
raw_occurrences_filename,
'species_name',
'x',
'y'
) as reader:
with PointCsvWriter(
clean_occurrences_filename, ['species_name', 'x', 'y']
) as writer:
for points in reader:
for wrangler in wranglers:
points = wrangler.wrangle_points(points)
if len(points) > 0:
writer.write_points(points)
Loading

0 comments on commit 79bff3d

Please sign in to comment.