Dataloader module
This module provides functions to load input data before enrichment.
The package supports two types of input: occurrences or areas.
Occurrences can be loaded straight from GBIF, from a local DarwinCore archive, or from a custom csv file.
Areas have to be loaded from a csv file. See geoenrich.dataloader.load_areas_file().
- geoenrich.dataloader.import_occurrences_csv(path, id_col, date_col, lat_col, lon_col, depth_col=None, date_format=None, crs='EPSG:4326', *args, **kwargs)
Load data from a custom csv file. Additional arguments are passed down to pandas.read_csv. Remove rows with a missing event date or missing coordinates. Return a geodataframe with all occurrences if fewer than max_number. Otherwise, return a random sample of max_number occurrences.
- Parameters:
path (str) – Path to the csv file to open.
id_col (int or str) – Name or index of the column containing unique occurrence ids (must be numeric).
date_col (int or str) – Name or index of the column containing occurrence dates.
lat_col (int or str) – Name or index of the column containing occurrence latitudes (decimal degrees).
lon_col (int or str) – Name or index of the column containing occurrence longitudes (decimal degrees).
depth_col (int or str) – Name or index of the column containing occurrence longitudes (meters from the surface).
date_format (str) – To avoid date parsing mistakes, specify your date format (according to strftime syntax).
crs (str) – Crs of the provided coordinates.
- Returns:
occurrences data (only relevant columns are included)
- Return type:
geopandas.GeoDataFrame
- geoenrich.dataloader.load_areas_file(path, date_format=None, crs='EPSG:4326', *args, **kwargs)
Load data to download a variable for specific areas. An “id” column must be present and contain a unique numeric identifier. Bound columns must be named {dim}_min and {dim}_max, with {dim} in latitude, longitude, date. Additional arguments are passed down to pandas.read_csv.
- Parameters:
path (str) – Path to the csv file to open.
date_format (str) – To avoid date parsing mistakes, specify your date format (according to strftime syntax).
crs (str) – Crs of the provided coordinates.
- Returns:
areas bounds (only relevant columns are included)
- Return type:
geopandas.GeoDataFrame
- geoenrich.dataloader.load_paths()
Loads paths for caching biodiversity and satellite data. If the config.yml file does not exist, it creates it and sets the cache path to a geoenrich_cache folder in the user’s home directory.
- Parameters:
None
- Returns:
(biodiv_path, sat_path)
- Return type:
tuple
- geoenrich.dataloader.open_dwca(path=None, taxonKey=None, max_number=10000)
Load data from DarwinCoreArchive located at given path. If no path is given, try to open a previously downloaded gbif archive for the given taxonomic key. Remove rows with a missing event date. Return a geodataframe with all occurrences if fewer than max_number. Otherwise, return a random sample of max_number occurrences.
- Parameters:
path (str) – Path to the DarwinCoreArchive (.zip) to open.
taxonKey (int) – Taxonomic key of a previously downloaded archive from GBIF.
max_number (int) – Maximum number of rows to import. A random sample is selected.
- Returns:
occurrences data (only relevant columns are included)
- Return type:
GeoDataFrame