Dataloader module

This module provides functions to load input data before enrichment. The package supports two types of input: occurrences or areas. Occurrences can be loaded straight from GBIF, from a local DarwinCore archive, or from a custom csv file. Areas have to be loaded from a csv file. See geoenrich.dataloader.load_areas_file().

geoenrich.dataloader.import_occurrences_csv(path, id_col, date_col, lat_col, lon_col, depth_col=None, date_format=None, crs='EPSG:4326', *args, **kwargs)

Load data from a custom csv file. Additional arguments are passed down to pandas.read_csv. Remove rows with a missing event date or missing coordinates. Return a geodataframe with all occurrences if fewer than max_number. Otherwise, return a random sample of max_number occurrences.

Parameters:
  • path (str) – Path to the csv file to open.

  • id_col (int or str) – Name or index of the column containing unique occurrence ids (must be numeric).

  • date_col (int or str) – Name or index of the column containing occurrence dates.

  • lat_col (int or str) – Name or index of the column containing occurrence latitudes (decimal degrees).

  • lon_col (int or str) – Name or index of the column containing occurrence longitudes (decimal degrees).

  • depth_col (int or str) – Name or index of the column containing occurrence longitudes (meters from the surface).

  • date_format (str) – To avoid date parsing mistakes, specify your date format (according to strftime syntax).

  • crs (str) – Crs of the provided coordinates.

Returns:

occurrences data (only relevant columns are included)

Return type:

geopandas.GeoDataFrame

geoenrich.dataloader.load_areas_file(path, date_format=None, crs='EPSG:4326', *args, **kwargs)

Load data to download a variable for specific areas. An “id” column must be present and contain a unique numeric identifier. Bound columns must be named {dim}_min and {dim}_max, with {dim} in latitude, longitude, date. Additional arguments are passed down to pandas.read_csv.

Parameters:
  • path (str) – Path to the csv file to open.

  • date_format (str) – To avoid date parsing mistakes, specify your date format (according to strftime syntax).

  • crs (str) – Crs of the provided coordinates.

Returns:

areas bounds (only relevant columns are included)

Return type:

geopandas.GeoDataFrame

geoenrich.dataloader.load_paths()

Loads paths for caching biodiversity and satellite data. If the config.yml file does not exist, it creates it and sets the cache path to a geoenrich_cache folder in the user’s home directory.

Parameters:

None

Returns:

(biodiv_path, sat_path)

Return type:

tuple

geoenrich.dataloader.open_dwca(path=None, taxonKey=None, max_number=10000)

Load data from DarwinCoreArchive located at given path. If no path is given, try to open a previously downloaded gbif archive for the given taxonomic key. Remove rows with a missing event date. Return a geodataframe with all occurrences if fewer than max_number. Otherwise, return a random sample of max_number occurrences.

Parameters:
  • path (str) – Path to the DarwinCoreArchive (.zip) to open.

  • taxonKey (int) – Taxonomic key of a previously downloaded archive from GBIF.

  • max_number (int) – Maximum number of rows to import. A random sample is selected.

Returns:

occurrences data (only relevant columns are included)

Return type:

GeoDataFrame