Enrichment module

This is the main module of the package. It handles the local enrichment files, as well as the download of enrichment data from remote servers.

Main functions

geoenrich.enrichment.create_enrichment_file(gdf, dataset_ref)

Create database file that will be used to save enrichment metadata. Dataframe index will be used as unique occurrences ids.

Parameters:

gdf (geopandas.GeoDataFrame or pandas.DataFrame) – Data to enrich (output of geoenrich.dataloader.open_dwca() or geoenrich.dataloader.import_csv() or geoenrich.dataloader.load_areas_file()).
dataset_ref (str) – The enrichment file name (e.g. gbif_taxonKey). Must be unique.

Returns:

None

geoenrich.enrichment.enrich(dataset_ref, var_id, geo_buff=None, time_buff=None, depth_request='surface', downsample={}, slice=None, maxpoints=None, force_download=False, progress_callback=None)

Enrich the given dataset with data of the requested variable. All Data within the given buffers are downloaded (if needed) and stored locally in netCDF files. The enrichment file is updated with the coordinates of the relevant netCDF subsets. If the enrichment file is large, use slice argument to only enrich some rows.

Parameters:

dataset_ref (str) – The enrichment file name (e.g. gbif_taxonKey). Must be unique.
var_id (str) – ID of the variable to download.
geo_buff (int) – Geographic buffer for which to download data around occurrence point (kilometers).
time_buff (float list) – Time bounds for which to download data around occurrence day (days). For instance, time_buff = [-7, 0] will download data from 7 days before the occurrence to the occurrence date.
depth_request (str) – For 4D data: ‘all’ -> data for all depths. ‘nearest’ -> closest available depth. ‘nearest_lower’ -> closest lower available depth. Anything else downloads surface data.
downsample (dict) – BROKEN, DO NOT USE. Number of points to skip between each downloaded point, for each dimension, using its standard name as a key.
slice (int tuple) – Slice of the enrichment file to use for enrichment.
maxpoints (int) – Maximum number of points to download.
force_download (bool) – If True, download data regardless of cache status.
progress_callback (class) – If provided, this class is used to create a custom tqdm progress bar, for instance in a web application. It should be a subclass of tqdm with the same signature.

Returns:

None

geoenrich.enrichment.enrichment_status(dataset_ref)

Return the number of occurrences of the given dataset that are already enriched, for each variable.

Parameters:: datset_ref (str) – The enrichment file name (e.g. gbif_taxonKey).
Returns:: A table of variables and statuses of enrichment.
Return type:: pandas.DataFrame

geoenrich.enrichment.read_ids(dataset_ref)

Return a list of all ids of the given enrichment file.

Parameters:: dataset_ref (str) – The enrichment file name (e.g. gbif_taxonKey).
Returns:: List of all present ids.
Return type:: list

geoenrich.enrichment.reset_enrichment_file(dataset_ref, var_ids_to_remove)

Remove all enrichment data from the enrichment file. Does not remove downloaded data from netCDF files

Parameters:

dataset_ref (str) – The enrichment file name (e.g. gbif_taxonKey).
var_ids_to_remove (str list) – List of variables to delete from the enrichment file. _all_ removes everything.

Returns:

None

Other functions (for internal use)

geoenrich.enrichment.add_bounds(geodf1, geo_buff, time_buff)

Calculate geo buffer and time buffer. Add columns for cube limits: ‘minx’, ‘maxx’, ‘miny’, ‘maxy’, ‘mint’, ‘maxt’.

Parameters:

geodf1 (geopandas.GeoDataFrame) – Data to calculate buffers for.
geo_buff (int) – Geographic buffer for which to download data around occurrence point (kilometers).
time_buff (float list) – Time bounds for which to download data around occurrence day (days). For instance, time_buff = [-7, 0] will download data from 7 days before the occurrence to the occurrence date.

Returns:

Updated GeoDataFrame with geographical and time boundaries.

Return type:

geopandas.GeoDataFrame

geoenrich.enrichment.calculate_indices(row, dimdict, var, depth_request, downsample)

Calculate indices of interest for the given bounds, according to variable dimensions.

Parameters:

row (pandas.Series) – GeoDataFrame row to enrich.
dimdict (dict) – Dictionary of dimensions as returned by geoenrich.satellite.get_metadata.
var (dict) – Variable dictionary as returned by geoenrich.satellite.get_metadata.
depth_request (str) – For 4D data: ‘all’ -> data for all depths. ‘nearest’ -> closest available depth. ‘nearest_lower’ -> closest lower available depth. Anything else downloads surface data.
downsample (dict) – Number of points to skip between each downloaded point, for each dimension, using its standard name as a key.

Returns:

Dictionary of indices for each dimension (keys are standard dimension names).

Return type:

dict

geoenrich.enrichment.checksize(ind)

Calculate the number of points to be downloaded.

Parameters:: ind (pd.Series) – Series of data indices as output by geoenrich.enrichment.calculate_indices().
Returns:: number of points to be downloaded.
Return type:: int

geoenrich.enrichment.compute_variable(var_id, base_data)

Calculate a composite variable.

Parameters:

var_id (str) – ID of the variable to compute.
base_data (numpy.ma.MaskedArray dict) – Required data for the computation.

Returns:

Output data.

Return type:

numpy.ma.MaskedArray

geoenrich.enrichment.download_data(remote_ds, local_ds, bool_ds, var, dimdict, ind, force_download)

Download missing data from the remote dataset to the local dataset.

Parameters:

remote_ds (netCDF4.Dataset) – Remote dataset.
local_ds (netCDF4.Dataset) – Local dataset.
bool_ds (netCDF4.Dataset) – Local dataset recording whether data has already been downloaded.
var (dict) – Variable dictionary as returned by geoenrich.satellite.get_metadata.
dimdict (dict) – Dictionary of dimensions as returned by geoenrich.satellite.get_metadata.
ind (dict) – Dictionary with ordered slicing indices for all dimensions.
force_download (bool) – If True, download data regardless of cache status.

Returns:

None

geoenrich.enrichment.enrich_compute(geodf, var_id, geo_buff, time_buff, downsample, progress_callback)

Compute a calculated variable for the provided bounds and save into local netcdf file. Calculate and return indices of the data of interest in the ncdf file.

Parameters:

geodf (geopandas.GeoDataFrame) – Data to be enriched.
var_id (str) – ID of the variable to download.
geo_buff (int) – Geographic buffer for which to download data around occurrence point (kilometers).
time_buff (float list) – Time bounds for which to download data around occurrence day (days). For instance, time_buff = [-7, 0] will download data from 7 days before the occurrence to the occurrence date.
downsample (dict) – Number of points to skip between each downloaded point, for each dimension, using its standard name as a key.
progress_callback (class) – If provided, this class is used to create a custom tqdm progress bar, for instance in a web application. It should be a subclass of tqdm with the same signature.

Returns:

DataFrame with indices of relevant data in the netCDF file.

Return type:

pandas.DataFrame

geoenrich.enrichment.enrich_copernicus(geodf, varname, var_id, dataset_id, geo_buff, time_buff, depth_request, downsample, maxpoints, force_download, progress_callback)

Download Copernicus data for the requested occurrences and buffer into local netcdf file. Calculate and return indices of the data of interest in the ncdf file.

Parameters:

geodf (geopandas.GeoDataFrame) – Data to be enriched.
varname (str) – Variable name in the dataset.
var_id (str) – ID of the variable to download.
dataset_id (str) – Copernicus dataset ID.
geo_buff (int) – Geographic buffer for which to download data around occurrence point (kilometers).
time_buff (float list) – Time bounds for which to download data around occurrence day (days). For instance, time_buff = [-7, 0] will download data from 7 days before the occurrence to the occurrence date.
depth_request (str) – For 4D data: ‘all’ -> data for all depths. ‘nearest’ -> closest available depth. ‘nearest_lower’ -> closest lower available depth. Anything else downloads surface data.
downsample (dict) – Number of points to skip between each downloaded point, for each dimension, using its standard name as a key.
maxpoints (int) – Maximum number of points to download.
force_download (bool) – If True, download data regardless of cache status.
progress_callback (class) – If provided, this class is used to create a custom tqdm progress bar, for instance in a web application. It should be a subclass of tqdm with the same signature.

Returns:

DataFrame with indices of relevant data in the netCDF file.

Return type:

pandas.DataFrame

geoenrich.enrichment.enrich_download(geodf, varname, var_id, url, geo_buff, time_buff, depth_request, downsample, maxpoints, force_download, progress_callback)

Download data for the requested occurrences and buffer into local netcdf file. Calculate and return indices of the data of interest in the ncdf file.

Parameters:

geodf (geopandas.GeoDataFrame) – Data to be enriched.
varname (str) – Variable name in the dataset.
var_id (str) – ID of the variable to download.
url (str) – Dataset url.
geo_buff (int) – Geographic buffer for which to download data around occurrence point (kilometers).
time_buff (float list) – Time bounds for which to download data around occurrence day (days). For instance, time_buff = [-7, 0] will download data from 7 days before the occurrence to the occurrence date.
depth_request (str) – For 4D data: ‘all’ -> data for all depths. ‘nearest’ -> closest available depth. ‘nearest_lower’ -> closest lower available depth. Anything else downloads surface data.
downsample (dict) – Number of points to skip between each downloaded point, for each dimension, using its standard name as a key.
maxpoints (int) – Maximum number of points to download.
force_download (bool) – If True, download data regardless of cache status.
progress_callback (class) – If provided, this class is used to create a custom tqdm progress bar, for instance in a web application. It should be a subclass of tqdm with the same signature.

Returns:

DataFrame with indices of relevant data in the netCDF file.

Return type:

pandas.DataFrame

geoenrich.enrichment.get_enrichment_id(enrichments, var_id, geo_buff, time_buff, depth_request, downsample)

Return ID of the requested enrichment if it exists, -1 otherwise.

Parameters:

enrichments (dict) – Enrichments metadata as stored in the json config file.
var_id (str) – ID of the variable to download.
geo_buff (int) – Geographic buffer for which to download data around occurrence point (kilometers).
time_buff (float list) – Time bounds for which to download data around occurrence day (days). For instance, time_buff = [-7, 0] will download data from 7 days before the occurrence to the occurrence date.
depth_request (str) – For 4D data: ‘all’ -> data for all depths. ‘nearest’ -> closest available depth. ‘nearest_lower’ -> closest lower available depth. Anything else downloads surface data.
downsample (dict) – Number of points to skip between each downloaded point, for each dimension, using its standard name as a key.

Returns:

Enrichment ID.

Return type:

int

geoenrich.enrichment.load_enrichment_file(dataset_ref, mute=False)

Load enrichment file.

Parameters:

dataset_ref (str) – The enrichment file name (e.g. gbif_taxonKey).
mute (bool) – Not printing load message if mute is True.

Returns:

Data to enrich (including previously added columns). dict: Enrichment metadata

Return type:

geopandas.GeoDataFrame or pandas.DataFrame

geoenrich.enrichment.parse_columns(df)

Return column indices sorted by variable and dimension.

Parameters:: df (pandas.DataFrame) – Enrichment file as a DataFrame, as returned by geoenrich.enrichment.load_enrichment_file.
Returns:: Dictionary of column indices, with enrichment ID as a primary key, dimension as a secondary key, and min/max as tertiary key.
Return type:: dict

geoenrich.enrichment.row_compute(row, local_ds, bool_ds, base_datasets, dimdict, var, downsample)

Calculate variable for the given row. Save netCDF data to disk and return their coordinates.

Parameters:

row (pandas.Series) – GeoDataFrame row to enrich.
local_ds (netCDF4.Dataset) – Local dataset.
bool_ds (netCDF4.Dataset) – Local dataset recording whether data has already been downloaded.
base_datasets (netCDF4.Dataset dict) – Required datasets for the computation.
dimdict (dict) – Dictionary of dimensions as returned by geoenrich.satellite.get_metadata.
var (dict) – Variable dictionary as returned by geoenrich.satellite.get_metadata.
downsample (dict) – Number of points to skip between each downloaded point, for each dimension, using its standard name as a key.

Returns:

Coordinates of the data of interest in the netCDF file.

Return type:

pandas.Series

geoenrich.enrichment.row_enrich(row, remote_ds, local_ds, bool_ds, dimdict, var, depth_request, downsample, force_download)

Query geospatial data for the given GeoDataFrame row. Save netCDF data to disk and return their coordinates.

Parameters:

row (pandas.Series) – GeoDataFrame row to enrich.
remote_ds (netCDF4.Dataset) – Remote dataset.
local_ds (netCDF4.Dataset) – Local dataset.
bool_ds (netCDF4.Dataset) – Local dataset recording whether data has already been downloaded.
dimdict (dict) – Dictionary of dimensions as returned by geoenrich.satellite.get_metadata().
var (dict) – Variable dictionary as returned by geoenrich.satellite.get_metadata().
depth_request (str) – For 4D data: ‘all’ -> data for all depths. ‘nearest’ -> closest available depth. ‘nearest_lower’ -> closest lower available depth. Anything else downloads surface data.
downsample (dict) – Number of points to skip between each downloaded point, for each dimension, using its standard name as a key.
force_download (bool) – If True, download data regardless of cache status.

Returns:

Coordinates of the data of interest in the netCDF file.

Return type:

pandas.Series

geoenrich.enrichment.save_enrichment_config(dataset_ref, enrichment_id, var_id, geo_buff, time_buff, depth_request, downsample, status='None')

Save enrichment metadata in the json config file.

Parameters:

dataset_ref (str) – The enrichment file name (e.g. gbif_taxonKey). Must be unique.
enrichment_id (int) – Enrichment ID.
var_id (str) – ID of the variable to download.
geo_buff (int) – Geographic buffer for which to download data around occurrence point (kilometers).
time_buff (float list) – Time bounds for which to download data around occurrence day (days). For instance, time_buff = [-7, 0] will download data from 7 days before the occurrence to the occurrence date.
depth_request (str) – For 4D data: ‘all’ -> data for all depths. ‘nearest’ -> closest available depth. ‘nearest_lower’ -> closest lower available depth. Anything else downloads surface data.
downsample (dict) – Number of points to skip between each downloaded point, for each dimension, using its standard name as a key.

Returns:

None

geoenrich.enrichment.update_enrichment_status(ds_ref, enrichment_id, status)

Update enrichment status in the config file.

Parameters:

ds_ref (str) – The enrichment file name (e.g. gbif_taxonKey).
enrichment_id (int) – Enrichment ID.
status (str) – Enrichment status. Can be ‘None’, ‘Enriched’, ‘Partially Enriched’, etc.

Returns:

None