API Docs¶
Documentation for the package’s Python API for usage as a library.
Individual files¶
The main function is:
geoextent.from_file(input, bbox, time)
- Parameters:
input: a string value of input file or pathbbox: a boolean value to extract spatial extent (bounding box)time: a boolean value to extract temporal extent ( at “day” precision ‘%Y-%m-%d’)
The output of this function is the bbox and/or the tbox for individual files (see: Supported file formats). The resulting coordinate reference system CRS of the bounding box is the one resulting from the extraction (i.e no transformation to other coordinate reference system).
Examples¶
Extract bounding box from a single file¶
Code:
geoextent.from_file('muenster_ring_zeit.geojson', True, False)
Output:
Processing muenster_ring_zeit.geojson: 0%| | 0/1 [00:00<?, ?task/s]
Processing muenster_ring_zeit.geojson: 0%| | 0/1 [00:00<?, ?task/s, ../tests/testdata/geojson/muenster_ring_zeit.geojson]
Processing muenster_ring_zeit.geojson: 0%| | 0/1 [00:00<?, ?it/s]
Processing muenster_ring_zeit.geojson: 0%| | 0/1 [00:00<?, ?it/s, Spatial extent extracted]
{'format': 'geojson',
'geoextent_handler': 'handle_vector',
'bbox': [51.94881477206191,
7.6016807556152335,
51.974624029877454,
7.647256851196289],
'crs': '4326',
'file_size_bytes': 1695}
Extracting time interval from a single file¶
Code:
geoextent.from_file('muenster_ring_zeit.geojson', False, True)
Output:
Processing muenster_ring_zeit.geojson: 0%| | 0/1 [00:00<?, ?task/s]
Processing muenster_ring_zeit.geojson: 0%| | 0/1 [00:00<?, ?task/s, ../tests/testdata/geojson/muenster_ring_zeit.geojson]
Processing muenster_ring_zeit.geojson: 0%| | 0/1 [00:00<?, ?it/s]
Processing muenster_ring_zeit.geojson: 0%| | 0/1 [00:00<?, ?it/s, Temporal extent extracted]
{'format': 'geojson',
'geoextent_handler': 'handle_vector',
'tbox': ['2018-11-14', '2018-11-14'],
'file_size_bytes': 1695}
Extracting both bounding box and time interval from a single file¶
Code:
geoextent.from_file('muenster_ring_zeit.geojson', True, True)
Output:
Processing muenster_ring_zeit.geojson: 0%| | 0/2 [00:00<?, ?task/s]
Processing muenster_ring_zeit.geojson: 0%| | 0/2 [00:00<?, ?task/s, ../tests/testdata/geojson/muenster_ring_zeit.geojson]
Processing muenster_ring_zeit.geojson: 0%| | 0/2 [00:00<?, ?it/s]
Processing muenster_ring_zeit.geojson: 0%| | 0/2 [00:00<?, ?it/s, Spatial extent extracted]
Processing muenster_ring_zeit.geojson: 0%| | 0/2 [00:00<?, ?it/s]
Processing muenster_ring_zeit.geojson: 0%| | 0/2 [00:00<?, ?it/s, Temporal extent extracted]
{'format': 'geojson',
'geoextent_handler': 'handle_vector',
'bbox': [51.94881477206191,
7.6016807556152335,
51.974624029877454,
7.647256851196289],
'crs': '4326',
'tbox': ['2018-11-14', '2018-11-14'],
'file_size_bytes': 1695}
Folders or ZIP file(s)¶
Geoextent also supports queries for multiple files inside folders or ZIP files.
geoextent.from_directory(input, bbox, time, details)
- Parameters:
input: a string value of directory of zipfile pathbbox: a boolean value to extract spatial extent (bounding box)time: a boolean value to extract temporal extent ( at “day” precision ‘%Y-%m-%d’)details: a boolean value to return details (geoextent) of individual files (default False)workers: number of parallel workers for file extraction (default 1 = sequential, 0 = auto-detect CPU count). Parallel extraction uses threads and helps most with directories containing many files (tens or more), where per-file I/O latency adds up.
The output of this function is the combined bbox or tbox resulting from merging all results of individual files (see: Supported file formats) inside the folder or zipfile. The resulting coordinate reference system CRS of the combined bbox is always in the EPSG: 4326 system.
Extracting both bounding box and time interval from a folder (with details)¶
Code:
geoextent.from_directory('folder_one_file', True, True, True)
Output:
Processing directory: folder_one_file: 0%| | 0/1 [00:00<?, ?item/s]
Processing directory: folder_one_file: 0%| | 0/1 [00:00<?, ?item/s, Processing muenster_ring_zeit.geojson]
Merging results: 0it [00:00, ?it/s]
Merging results: 0it [00:00, ?it/s, folder_one_file]
{'format': 'folder',
'crs': '4326',
'bbox': {'type': 'Polygon',
'coordinates': [[[51.94881477206191, 7.608118057250977],
[51.953258408047034, 7.602796554565429],
[51.96537036973145, 7.6016807556152335],
[51.97361943924433, 7.606401443481445],
[51.974624029877454, 7.62125015258789],
[51.97240332571046, 7.636871337890624],
[51.96817310852836, 7.645368576049805],
[51.96780294552556, 7.645540237426757],
[51.96330786509095, 7.6471710205078125],
[51.95807185013927, 7.647256851196289],
[51.953258408047034, 7.643308639526367],
[51.94881477206191, 7.608118057250977]]]},
'convex_hull': True,
'tbox': ['2018-11-14', '2018-11-14']}
Remote repositories¶
Geoextent supports queries for multiple research data repositories including Zenodo, Figshare, Dryad, PANGAEA, OSF, Dataverse, GFZ Data Services, Pensoft, GBIF, SEANOE, DEIMS-SDR, HALO DB, GDI-DE, Arctic Data Center, and TU Dresden Opara.
Geoextent downloads files from the repository and extracts the temporal or geographical extent. The function supports both single identifiers (string) and multiple identifiers (list).
geoextent.from_remote(remote_identifier, bbox, time, details)
- Parameters:
remote_identifier: a string value with a repository URL or DOI (e.g., https://zenodo.org/record/3528062, https://doi.org/10.5281/zenodo.3528062, 10.5281/zenodo.3528062), or a list of identifiers for multiple resource extractionbbox: a boolean value to extract spatial extent (bounding box)time: a boolean value to extract temporal extent (at “day” precision ‘%Y-%m-%d’)details: a boolean value to return details (geoextent) of individual files (default False)
The output of this function is the combined bbox or tbox resulting from merging all results of individual files (see: Supported file formats) inside the repository. The resulting coordinate reference system CRS of the combined bbox is always in the EPSG: 4326 system.
Single repository extraction¶
Code:
geoextent.from_remote('https://zenodo.org/record/820562', True, True, False)
Output:
Downloading files: 0%| | 0.00/14.1M [00:00<?, ?B/s]
Downloading files: 32%|███▏ | 4.52M/14.1M [00:09<00:19, 497kB/s]
Downloading files: 32%|███▏ | 4.52M/14.1M [00:09<00:19, 497kB/s, files=1/6]
Downloading files: 32%|███▏ | 4.52M/14.1M [00:11<00:19, 497kB/s, files=2/6]
Downloading files: 64%|██████▍ | 9.04M/14.1M [00:12<00:06, 790kB/s, files=2/6]
Downloading files: 64%|██████▍ | 9.04M/14.1M [00:12<00:06, 790kB/s, files=3/6]
Downloading files: 81%|████████▏ | 11.4M/14.1M [00:14<00:02, 899kB/s, files=3/6]
Downloading files: 81%|████████▏ | 11.4M/14.1M [00:14<00:02, 899kB/s, files=4/6]
Downloading files: 81%|████████▏ | 11.4M/14.1M [00:19<00:02, 899kB/s, files=5/6]
Downloading files: 100%|██████████| 14.1M/14.1M [00:21<00:00, 610kB/s, files=5/6]
Downloading files: 100%|██████████| 14.1M/14.1M [00:21<00:00, 610kB/s, files=6/6]
Downloading files: 100%|██████████| 14.1M/14.1M [00:21<00:00, 655kB/s, files=6/6]
Processing directory: tmpv_b4mwak: 0%| | 0/6 [00:00<?, ?item/s]
Processing directory: tmpv_b4mwak: 0%| | 0/6 [00:00<?, ?item/s, Processing 20160100_Hpakan_20151123_PRE.tif]
Processing directory: tmpv_b4mwak: 17%|█▋ | 1/6 [00:00<00:00, 97.14item/s, Processing 20160100_Hpakan_20160322_POST.tif]
Processing directory: tmpv_b4mwak: 33%|███▎ | 2/6 [00:00<00:00, 154.61item/s, Processing 20160100_Hpakan_20151123_PRE.png]
Error for /tmp/tmpv_b4mwak/20160100_Hpakan_20151123_PRE.pngw extracting bbox: The csv file from /tmp/tmpv_b4mwak/20160100_Hpakan_20151123_PRE.pngw has no BoundingBox
Error extracting tbox, time format not found The csv file from /tmp/tmpv_b4mwak/20160100_Hpakan_20151123_PRE.pngw has no TemporalExtent:
Processing directory: tmpv_b4mwak: 50%|█████ | 3/6 [00:00<00:00, 157.19item/s, Processing 20160100_Hpakan_20151123_PRE.pngw]
Error for /tmp/tmpv_b4mwak/20160100_Hpakan_20160322_POST.pngw extracting bbox: The csv file from /tmp/tmpv_b4mwak/20160100_Hpakan_20160322_POST.pngw has no BoundingBox
Error extracting tbox, time format not found The csv file from /tmp/tmpv_b4mwak/20160100_Hpakan_20160322_POST.pngw has no TemporalExtent:
Processing directory: tmpv_b4mwak: 67%|██████▋ | 4/6 [00:00<00:00, 174.86item/s, Processing 20160100_Hpakan_20160322_POST.pngw]
Processing directory: tmpv_b4mwak: 83%|████████▎ | 5/6 [00:00<00:00, 193.15item/s, Processing 20160100_Hpakan_20160322_POST.png]
Merging results: 0it [00:00, ?it/s]
Merging results: 0it [00:00, ?it/s, tmpv_b4mwak]
{'format': 'remote',
'crs': '4326',
'bbox': [25.558346194400002,
96.21146318274846,
25.632931128800003,
96.35495081696702]}
Multiple repositories¶
Extract from multiple repositories in a single call:
identifiers = [
'10.5281/zenodo.4593540',
'10.25532/OPARA-581',
'https://osf.io/abc123/'
]
geoextent.from_remote(identifiers, True, True, True)
The function returns a merged bounding box covering all resources (similar to directory extraction), plus extraction metadata with success/failure tracking. Individual resource details are available in the details field for diagnostics.
See Advanced Features for detailed documentation on multiple resource extraction features and return structure.
Download size limits¶
Use the max_download_size parameter to limit how much data geoextent downloads from a remote repository. The value is a human-friendly size string parsed by filesizelib (e.g. '100MB', '2GB', '500KB', '10MiB', '0.5GiB'):
# Limit download to 20 MB
geoextent.from_remote('10.23728/b2share.26jnj-a4x24', bbox=True, tbox=True,
max_download_size='20MB')
# Limit GBIF DwC-A download to 500 MB
geoextent.from_remote('10.15468/6bleia', bbox=True, tbox=True,
max_download_size='500MB')
When the combined file sizes exceed the limit, the default behavior (API) is to silently select a subset using the max_download_method strategy ('ordered' by default, or 'random' with a reproducible seed via max_download_method_seed).
Download size soft limit¶
Set download_size_soft_limit=True to raise a DownloadSizeExceeded exception instead of silently truncating the file list. This is what the CLI uses to prompt the user for confirmation, and is available for all providers whose APIs report file sizes:
from geoextent.lib.exceptions import DownloadSizeExceeded
try:
result = geoextent.from_remote('10.5281/zenodo.820562', bbox=True,
max_download_size='1MB',
download_size_soft_limit=True)
except DownloadSizeExceeded as exc:
print(f"Download is {exc.estimated_size:,} bytes "
f"(limit: {exc.max_size:,} bytes, provider: {exc.provider})")
# Retry with a larger limit
result = geoextent.from_remote('10.5281/zenodo.820562', bbox=True,
max_download_size=f'{exc.estimated_size + 1}B',
download_size_soft_limit=True)
The exception carries three attributes:
exc.estimated_size— total available download size in bytesexc.max_size— the size limit that was exceeded, in bytesexc.provider— name of the provider (e.g."Zenodo","GBIF")
GBIF DwC-A soft limit. GBIF datasets with Darwin Core Archive downloads have an additional built-in 1 GB soft limit that is always active (regardless of download_size_soft_limit).
Note
The soft limit relies on providers reporting file sizes in their API metadata before download. Metadata-only providers (DEIMS-SDR, HALO DB, Wikidata, Pensoft) do not download data files, so the size limit does not apply. A warning is logged when max_download_size is configured but the provider cannot enforce it.
To avoid the size check entirely, use download_data=False for metadata-only extraction:
# Fast, no download — uses provider API metadata
result = geoextent.from_remote('10.15468/6bleia', bbox=True, tbox=True,
download_data=False)
Progress callbacks¶
All three public API functions (from_file, from_directory, from_remote)
accept a progress_callback parameter for structured progress reporting.
This is useful for web applications, Jupyter notebooks, and other programmatic
consumers that need to display progress without depending on tqdm.
The callback receives ProgressEvent instances
– frozen dataclasses describing what geoextent is doing at each step.
Quick start¶
from geoextent.lib.progress import CollectingProgressCallback
from geoextent.lib import extent
cb = CollectingProgressCallback()
result = extent.from_file(
'data.tif',
bbox=True,
tbox=True,
progress_callback=cb,
)
for event in cb.events:
print(f'{event.phase.value}: {event.message} [{event.current}/{event.total}]')
Output:
process_file: Processing data.tif [0/2]
spatial: Processing data.tif [1/2]
temporal: Processing data.tif [2/2]
ProgressEvent¶
Each event is a frozen (immutable) dataclass with these fields:
Field |
Type |
Description |
|---|---|---|
|
|
Which processing phase emitted this event (see table below). |
|
|
Human-readable description (e.g. |
|
|
Current step number (0 when phase starts). |
|
|
Total number of steps (0 if unknown). |
|
|
Optional extra context (filename, provider name, etc.). |
|
|
Bytes processed so far (download phase only). |
|
|
Total bytes to download (download phase only). |
Two computed properties are available:
event.fraction– progress as a float in[0.0, 1.0], or-1.0if indeterminate (total <= 0).event.is_indeterminate–Truewhentotalis unknown.
ProgressPhase¶
Events are tagged with a phase indicating which part of the pipeline emitted them:
Phase |
Emitted by |
Description |
|---|---|---|
|
|
Starting to process a single file. |
|
|
Spatial extent extraction completed for a file. |
|
|
Temporal extent extraction completed for a file. |
|
|
Processing the n-th item in a directory. |
|
|
Merging individual file results into a combined extent. |
|
|
A content provider has been identified for the remote identifier. |
|
|
Downloading files from a remote repository. |
|
– |
Extracting an archive. |
|
– |
Reverse-geocoding coordinates to a placename. |
Built-in callbacks¶
Three callback implementations are provided in geoextent.lib.progress:
CollectingProgressCallback – appends every event to a list. Useful for testing and post-hoc analysis.
from geoextent.lib.progress import CollectingProgressCallback
cb = CollectingProgressCallback()
result = extent.from_directory('mydata/', bbox=True, progress_callback=cb)
print(f'{len(cb.events)} events captured')
LoggingProgressCallback – logs each event to the geoextent logger. The
log level is configurable (default INFO).
from geoextent.lib.progress import LoggingProgressCallback
cb = LoggingProgressCallback() # or LoggingProgressCallback(level=logging.DEBUG)
result = extent.from_file('data.shp', bbox=True, progress_callback=cb)
TqdmProgressCallback – renders tqdm progress bars, one per phase. This is
what geoextent uses internally when show_progress=True and no callback is
provided.
from geoextent.lib.progress import TqdmProgressCallback
cb = TqdmProgressCallback(leave=True) # leave=True keeps bars on screen
result = extent.from_directory('mydata/', bbox=True, progress_callback=cb)
cb.close() # close any open bars
Writing a custom callback¶
A callback is any callable that accepts a single ProgressEvent argument.
Here is an example that pushes progress to a web API:
import requests
from geoextent.lib.progress import ProgressEvent
def webhook_callback(event: ProgressEvent) -> None:
requests.post('https://example.com/progress', json={
'phase': event.phase.value,
'message': event.message,
'fraction': event.fraction,
'detail': event.detail,
}, timeout=5)
result = extent.from_remote(
'10.5281/zenodo.820562',
bbox=True,
tbox=True,
progress_callback=webhook_callback,
)
Here is an example that updates a Jupyter notebook widget:
import ipywidgets as widgets
from IPython.display import display
from geoextent.lib.progress import ProgressEvent
progress_bar = widgets.FloatProgress(min=0, max=1, description='Extracting...')
status_label = widgets.Label()
display(widgets.HBox([progress_bar, status_label]))
def jupyter_callback(event: ProgressEvent) -> None:
if not event.is_indeterminate:
progress_bar.value = event.fraction
status_label.value = event.message
result = extent.from_directory(
'mydata/',
bbox=True,
progress_callback=jupyter_callback,
)
progress_bar.value = 1.0
status_label.value = 'Done'
Interaction with show_progress¶
When
progress_callbackis provided, geoextent automatically suppresses internal tqdm bars (equivalent toshow_progress=False) to avoid duplicate output.When
progress_callbackisNoneandshow_progress=True(the default), geoextent auto-creates aTqdmProgressCallbackinternally for backward compatibility. The CLI uses this path.To disable all progress output, pass both
show_progress=Falseand omitprogress_callback.