Dataloader module

This module provides functions to load input data before enrichment. The package supports two types of input: occurrences or areas. Occurrences can be loaded straight from GBIF, from a local DarwinCore archive, or from a custom csv file. Areas have to be loaded from a csv file. See geoenrich.dataloader.load_areas_file().

geoenrich.dataloader.download_requested(request_key)

Download GBIF data for the given request key. Download previously requested data if available, otherwise print request status.

Parameters:

request_key (int) – Request key as returned by the geoenrich.dataloader.request_from_gbif() function.

Returns:

None

geoenrich.dataloader.get_taxon_key(query)

Look for a taxonomic category in GBIF database, print the best result and return its unique ID.

Parameters:

query (str) – Scientific name of the genus or species to search for.

Returns:

GBIF taxon ID

Return type:

int

geoenrich.dataloader.import_occurrences_csv(path, id_col, date_col, lat_col, lon_col, date_format=None, crs='EPSG:4326', *args, **kwargs)

Load data from a custom csv file. Additional arguments are passed down to pandas.read_csv. Remove rows with a missing event date or missing coordinates. Return a geodataframe with all occurrences if fewer than max_number. Otherwise, return a random sample of max_number occurrences.

Parameters:
  • path (str) – Path to the csv file to open.

  • id_col (int or str) – Name or index of the column containing individual occurrence ids.

  • date_col (int or str) – Name or index of the column containing occurrence dates.

  • lat_col (int or str) – Name or index of the column containing occurrence latitudes (decimal degrees).

  • lon_col (int or str) – Name or index of the column containing occurrence longitudes (decimal degrees).

  • date_format (str) – To avoid date parsing mistakes, specify your date format (according to strftime syntax).

  • crs (str) – Crs of the provided coordinates.

Returns:

occurrences data (only relevant columns are included)

Return type:

geopandas.GeoDataFrame

geoenrich.dataloader.load_areas_file(path, date_format=None, crs='EPSG:4326', *args, **kwargs)

Load data to download a variable for specific areas. Bound columns must be named {dim}_min and {dim}_max, with {dim} in latitude, longitude, date. Additional arguments are passed down to pandas.read_csv.

Parameters:
  • path (str) – Path to the csv file to open.

  • date_format (str) – To avoid date parsing mistakes, specify your date format (according to strftime syntax).

  • crs (str) – Crs of the provided coordinates.

Returns:

areas bounds (only relevant columns are included)

Return type:

geopandas.GeoDataFrame

geoenrich.dataloader.open_dwca(path=None, taxonKey=None, max_number=10000)

Load data from DarwinCoreArchive located at given path. If no path is given, try to open a previously downloaded gbif archive for the given taxonomic key. Remove rows with a missing event date. Return a geodataframe with all occurrences if fewer than max_number. Otherwise, return a random sample of max_number occurrences.

Parameters:
  • path (str) – Path to the DarwinCoreArchive (.zip) to open.

  • taxonKey (int) – Taxonomic key of a previously downloaded archive from GBIF.

  • max_number (int) – Maximum number of rows to import. A random sample is selected.

Returns:

occurrences data (only relevant columns are included)

Return type:

GeoDataFrame

geoenrich.dataloader.request_from_gbif(taxon_key, override=False)

Request all georeferenced occurrences for the given taxonKey. Return the request key. If the same request was already done for this gbif account, return the key of the first request. In this case a new request can be made with override = True.

Parameters:
  • taxonKey (int) – GBIF ID of the taxonomic category to request.

  • override (bool) – Force new request to be made if one already exists.

Returns:

Request key

Return type:

int