API Docs

Documentation for the package’s Python API for usage as a library.

Individual files

The main function is:

geoextent.from_file(input, bbox, time)
Parameters:
  • input: a string value of input file or path

  • bbox: a boolean value to extract spatial extent (bounding box)

  • time: a boolean value to extract temporal extent ( at “day” precision ‘%Y-%m-%d’)

The output of this function is the bbox and/or the tbox for individual files (see: Supported file formats). The resulting coordinate reference system CRS of the bounding box is the one resulting from the extraction (i.e no transformation to other coordinate reference system).

Examples

Extract bounding box from a single file

Code:

geoextent.from_file('muenster_ring_zeit.geojson', True, False)

Output:

Processing muenster_ring_zeit.geojson:   0%|          | 0/1 [00:00<?, ?task/s]
Processing muenster_ring_zeit.geojson:   0%|          | 0/1 [00:00<?, ?task/s, ../tests/testdata/geojson/muenster_ring_zeit.geojson]
                                                                                                                                    

Processing muenster_ring_zeit.geojson:   0%|          | 0/1 [00:00<?, ?it/s]
Processing muenster_ring_zeit.geojson:   0%|          | 0/1 [00:00<?, ?it/s, Spatial extent extracted]
                                                                                                      

{'format': 'geojson',
 'geoextent_handler': 'handle_vector',
 'bbox': [51.94881477206191,
  7.6016807556152335,
  51.974624029877454,
  7.647256851196289],
 'crs': '4326',
 'file_size_bytes': 1695}

(source of file muenster_ring_zeit.geojson)

Extracting time interval from a single file

Code:

geoextent.from_file('muenster_ring_zeit.geojson', False, True)

Output:

Processing muenster_ring_zeit.geojson:   0%|          | 0/1 [00:00<?, ?task/s]
Processing muenster_ring_zeit.geojson:   0%|          | 0/1 [00:00<?, ?task/s, ../tests/testdata/geojson/muenster_ring_zeit.geojson]
                                                                                                                                    

Processing muenster_ring_zeit.geojson:   0%|          | 0/1 [00:00<?, ?it/s]
Processing muenster_ring_zeit.geojson:   0%|          | 0/1 [00:00<?, ?it/s, Temporal extent extracted]
                                                                                                       

{'format': 'geojson',
 'geoextent_handler': 'handle_vector',
 'tbox': ['2018-11-14', '2018-11-14'],
 'file_size_bytes': 1695}

(source of file muenster_ring_zeit.geojson)

Extracting both bounding box and time interval from a single file

Code:

geoextent.from_file('muenster_ring_zeit.geojson', True, True)

Output:

Processing muenster_ring_zeit.geojson:   0%|          | 0/2 [00:00<?, ?task/s]
Processing muenster_ring_zeit.geojson:   0%|          | 0/2 [00:00<?, ?task/s, ../tests/testdata/geojson/muenster_ring_zeit.geojson]
                                                                                                                                    

Processing muenster_ring_zeit.geojson:   0%|          | 0/2 [00:00<?, ?it/s]
Processing muenster_ring_zeit.geojson:   0%|          | 0/2 [00:00<?, ?it/s, Spatial extent extracted]
                                                                                                      

Processing muenster_ring_zeit.geojson:   0%|          | 0/2 [00:00<?, ?it/s]
Processing muenster_ring_zeit.geojson:   0%|          | 0/2 [00:00<?, ?it/s, Temporal extent extracted]
                                                                                                       

{'format': 'geojson',
 'geoextent_handler': 'handle_vector',
 'bbox': [51.94881477206191,
  7.6016807556152335,
  51.974624029877454,
  7.647256851196289],
 'crs': '4326',
 'tbox': ['2018-11-14', '2018-11-14'],
 'file_size_bytes': 1695}

(source of file muenster_ring_zeit.geojson)

Folders or ZIP file(s)

Geoextent also supports queries for multiple files inside folders or ZIP files.

geoextent.from_directory(input, bbox, time, details)
Parameters:
  • input: a string value of directory of zipfile path

  • bbox: a boolean value to extract spatial extent (bounding box)

  • time: a boolean value to extract temporal extent ( at “day” precision ‘%Y-%m-%d’)

  • details: a boolean value to return details (geoextent) of individual files (default False)

  • workers: number of parallel workers for file extraction (default 1 = sequential, 0 = auto-detect CPU count). Parallel extraction uses threads and helps most with directories containing many files (tens or more), where per-file I/O latency adds up.

The output of this function is the combined bbox or tbox resulting from merging all results of individual files (see: Supported file formats) inside the folder or zipfile. The resulting coordinate reference system CRS of the combined bbox is always in the EPSG: 4326 system.

Extracting both bounding box and time interval from a folder (with details)

Code:

geoextent.from_directory('folder_one_file', True, True, True)

Output:

Processing directory: folder_one_file:   0%|          | 0/1 [00:00<?, ?item/s]
Processing directory: folder_one_file:   0%|          | 0/1 [00:00<?, ?item/s, Processing muenster_ring_zeit.geojson]
                                                                                                                     

Merging results: 0it [00:00, ?it/s]
Merging results: 0it [00:00, ?it/s, folder_one_file]
                                                    

{'format': 'folder',
 'crs': '4326',
 'bbox': {'type': 'Polygon',
  'coordinates': [[[51.94881477206191, 7.608118057250977],
    [51.953258408047034, 7.602796554565429],
    [51.96537036973145, 7.6016807556152335],
    [51.97361943924433, 7.606401443481445],
    [51.974624029877454, 7.62125015258789],
    [51.97240332571046, 7.636871337890624],
    [51.96817310852836, 7.645368576049805],
    [51.96780294552556, 7.645540237426757],
    [51.96330786509095, 7.6471710205078125],
    [51.95807185013927, 7.647256851196289],
    [51.953258408047034, 7.643308639526367],
    [51.94881477206191, 7.608118057250977]]]},
 'convex_hull': True,
 'tbox': ['2018-11-14', '2018-11-14']}

folder_two_files

Remote repositories

Geoextent supports queries for multiple research data repositories including Zenodo, Figshare, Dryad, PANGAEA, OSF, Dataverse, GFZ Data Services, Pensoft, GBIF, SEANOE, DEIMS-SDR, HALO DB, GDI-DE, Arctic Data Center, and TU Dresden Opara.

Geoextent downloads files from the repository and extracts the temporal or geographical extent. The function supports both single identifiers (string) and multiple identifiers (list).

geoextent.from_remote(remote_identifier, bbox, time, details)
Parameters:
  • remote_identifier: a string value with a repository URL or DOI (e.g., https://zenodo.org/record/3528062, https://doi.org/10.5281/zenodo.3528062, 10.5281/zenodo.3528062), or a list of identifiers for multiple resource extraction

  • bbox: a boolean value to extract spatial extent (bounding box)

  • time: a boolean value to extract temporal extent (at “day” precision ‘%Y-%m-%d’)

  • details: a boolean value to return details (geoextent) of individual files (default False)

The output of this function is the combined bbox or tbox resulting from merging all results of individual files (see: Supported file formats) inside the repository. The resulting coordinate reference system CRS of the combined bbox is always in the EPSG: 4326 system.

Single repository extraction

Code:

geoextent.from_remote('https://zenodo.org/record/820562', True, True, False)

Output:

Downloading files:   0%|          | 0.00/14.1M [00:00<?, ?B/s]
Downloading files:  32%|███▏      | 4.52M/14.1M [00:09<00:19, 497kB/s]
Downloading files:  32%|███▏      | 4.52M/14.1M [00:09<00:19, 497kB/s, files=1/6]
Downloading files:  32%|███▏      | 4.52M/14.1M [00:11<00:19, 497kB/s, files=2/6]
Downloading files:  64%|██████▍   | 9.04M/14.1M [00:12<00:06, 790kB/s, files=2/6]
Downloading files:  64%|██████▍   | 9.04M/14.1M [00:12<00:06, 790kB/s, files=3/6]
Downloading files:  81%|████████▏ | 11.4M/14.1M [00:14<00:02, 899kB/s, files=3/6]
Downloading files:  81%|████████▏ | 11.4M/14.1M [00:14<00:02, 899kB/s, files=4/6]
Downloading files:  81%|████████▏ | 11.4M/14.1M [00:19<00:02, 899kB/s, files=5/6]
Downloading files: 100%|██████████| 14.1M/14.1M [00:21<00:00, 610kB/s, files=5/6]
Downloading files: 100%|██████████| 14.1M/14.1M [00:21<00:00, 610kB/s, files=6/6]
Downloading files: 100%|██████████| 14.1M/14.1M [00:21<00:00, 655kB/s, files=6/6]

Processing directory: tmpv_b4mwak:   0%|          | 0/6 [00:00<?, ?item/s]
Processing directory: tmpv_b4mwak:   0%|          | 0/6 [00:00<?, ?item/s, Processing 20160100_Hpakan_20151123_PRE.tif]
Processing directory: tmpv_b4mwak:  17%|█▋        | 1/6 [00:00<00:00, 97.14item/s, Processing 20160100_Hpakan_20160322_POST.tif]
Processing directory: tmpv_b4mwak:  33%|███▎      | 2/6 [00:00<00:00, 154.61item/s, Processing 20160100_Hpakan_20151123_PRE.png]
Error for /tmp/tmpv_b4mwak/20160100_Hpakan_20151123_PRE.pngw extracting bbox:
The csv file from /tmp/tmpv_b4mwak/20160100_Hpakan_20151123_PRE.pngw has no BoundingBox
Error extracting tbox, time format not found 
 The csv file from /tmp/tmpv_b4mwak/20160100_Hpakan_20151123_PRE.pngw has no TemporalExtent:
Processing directory: tmpv_b4mwak:  50%|█████     | 3/6 [00:00<00:00, 157.19item/s, Processing 20160100_Hpakan_20151123_PRE.pngw]
Error for /tmp/tmpv_b4mwak/20160100_Hpakan_20160322_POST.pngw extracting bbox:
The csv file from /tmp/tmpv_b4mwak/20160100_Hpakan_20160322_POST.pngw has no BoundingBox
Error extracting tbox, time format not found 
 The csv file from /tmp/tmpv_b4mwak/20160100_Hpakan_20160322_POST.pngw has no TemporalExtent:
Processing directory: tmpv_b4mwak:  67%|██████▋   | 4/6 [00:00<00:00, 174.86item/s, Processing 20160100_Hpakan_20160322_POST.pngw]
Processing directory: tmpv_b4mwak:  83%|████████▎ | 5/6 [00:00<00:00, 193.15item/s, Processing 20160100_Hpakan_20160322_POST.png] 
                                                                                                                                 

Merging results: 0it [00:00, ?it/s]
Merging results: 0it [00:00, ?it/s, tmpv_b4mwak]
                                                

{'format': 'remote',
 'crs': '4326',
 'bbox': [25.558346194400002,
  96.21146318274846,
  25.632931128800003,
  96.35495081696702]}

Multiple repositories

Extract from multiple repositories in a single call:

identifiers = [
    '10.5281/zenodo.4593540',
    '10.25532/OPARA-581',
    'https://osf.io/abc123/'
]

geoextent.from_remote(identifiers, True, True, True)

The function returns a merged bounding box covering all resources (similar to directory extraction), plus extraction metadata with success/failure tracking. Individual resource details are available in the details field for diagnostics.

See Advanced Features for detailed documentation on multiple resource extraction features and return structure.

Download size limits

Use the max_download_size parameter to limit how much data geoextent downloads from a remote repository. The value is a human-friendly size string parsed by filesizelib (e.g. '100MB', '2GB', '500KB', '10MiB', '0.5GiB'):

# Limit download to 20 MB
geoextent.from_remote('10.23728/b2share.26jnj-a4x24', bbox=True, tbox=True,
                      max_download_size='20MB')

# Limit GBIF DwC-A download to 500 MB
geoextent.from_remote('10.15468/6bleia', bbox=True, tbox=True,
                      max_download_size='500MB')

When the combined file sizes exceed the limit, the default behavior (API) is to silently select a subset using the max_download_method strategy ('ordered' by default, or 'random' with a reproducible seed via max_download_method_seed).

Download size soft limit

Set download_size_soft_limit=True to raise a DownloadSizeExceeded exception instead of silently truncating the file list. This is what the CLI uses to prompt the user for confirmation, and is available for all providers whose APIs report file sizes:

from geoextent.lib.exceptions import DownloadSizeExceeded

try:
    result = geoextent.from_remote('10.5281/zenodo.820562', bbox=True,
                                   max_download_size='1MB',
                                   download_size_soft_limit=True)
except DownloadSizeExceeded as exc:
    print(f"Download is {exc.estimated_size:,} bytes "
          f"(limit: {exc.max_size:,} bytes, provider: {exc.provider})")
    # Retry with a larger limit
    result = geoextent.from_remote('10.5281/zenodo.820562', bbox=True,
                                   max_download_size=f'{exc.estimated_size + 1}B',
                                   download_size_soft_limit=True)

The exception carries three attributes:

  • exc.estimated_size — total available download size in bytes

  • exc.max_size — the size limit that was exceeded, in bytes

  • exc.provider — name of the provider (e.g. "Zenodo", "GBIF")

GBIF DwC-A soft limit. GBIF datasets with Darwin Core Archive downloads have an additional built-in 1 GB soft limit that is always active (regardless of download_size_soft_limit).

Note

The soft limit relies on providers reporting file sizes in their API metadata before download. Metadata-only providers (DEIMS-SDR, HALO DB, Wikidata, Pensoft) do not download data files, so the size limit does not apply. A warning is logged when max_download_size is configured but the provider cannot enforce it.

To avoid the size check entirely, use download_data=False for metadata-only extraction:

# Fast, no download — uses provider API metadata
result = geoextent.from_remote('10.15468/6bleia', bbox=True, tbox=True,
                               download_data=False)

Progress callbacks

All three public API functions (from_file, from_directory, from_remote) accept a progress_callback parameter for structured progress reporting. This is useful for web applications, Jupyter notebooks, and other programmatic consumers that need to display progress without depending on tqdm.

The callback receives ProgressEvent instances – frozen dataclasses describing what geoextent is doing at each step.

Quick start

from geoextent.lib.progress import CollectingProgressCallback
from geoextent.lib import extent

cb = CollectingProgressCallback()
result = extent.from_file(
    'data.tif',
    bbox=True,
    tbox=True,
    progress_callback=cb,
)

for event in cb.events:
    print(f'{event.phase.value}: {event.message} [{event.current}/{event.total}]')

Output:

process_file: Processing data.tif [0/2]
spatial: Processing data.tif [1/2]
temporal: Processing data.tif [2/2]

ProgressEvent

Each event is a frozen (immutable) dataclass with these fields:

Field

Type

Description

phase

ProgressPhase

Which processing phase emitted this event (see table below).

message

str

Human-readable description (e.g. "Processing directory: mydata").

current

int

Current step number (0 when phase starts).

total

int

Total number of steps (0 if unknown).

detail

str | None

Optional extra context (filename, provider name, etc.).

bytes_current

int

Bytes processed so far (download phase only).

bytes_total

int

Total bytes to download (download phase only).

Two computed properties are available:

  • event.fraction – progress as a float in [0.0, 1.0], or -1.0 if indeterminate (total <= 0).

  • event.is_indeterminateTrue when total is unknown.

ProgressPhase

Events are tagged with a phase indicating which part of the pipeline emitted them:

Phase

Emitted by

Description

PROCESS_FILE

from_file

Starting to process a single file.

SPATIAL

from_file

Spatial extent extraction completed for a file.

TEMPORAL

from_file

Temporal extent extraction completed for a file.

PROCESS_DIR

from_directory

Processing the n-th item in a directory. current/total track progress.

MERGE

from_directory

Merging individual file results into a combined extent.

RESOLVE

from_remote

A content provider has been identified for the remote identifier.

DOWNLOAD

from_remote

Downloading files from a remote repository. bytes_current/bytes_total track byte-level progress.

EXTRACT

Extracting an archive.

PLACENAME

Reverse-geocoding coordinates to a placename.

Built-in callbacks

Three callback implementations are provided in geoextent.lib.progress:

CollectingProgressCallback – appends every event to a list. Useful for testing and post-hoc analysis.

from geoextent.lib.progress import CollectingProgressCallback

cb = CollectingProgressCallback()
result = extent.from_directory('mydata/', bbox=True, progress_callback=cb)
print(f'{len(cb.events)} events captured')

LoggingProgressCallback – logs each event to the geoextent logger. The log level is configurable (default INFO).

from geoextent.lib.progress import LoggingProgressCallback

cb = LoggingProgressCallback()  # or LoggingProgressCallback(level=logging.DEBUG)
result = extent.from_file('data.shp', bbox=True, progress_callback=cb)

TqdmProgressCallback – renders tqdm progress bars, one per phase. This is what geoextent uses internally when show_progress=True and no callback is provided.

from geoextent.lib.progress import TqdmProgressCallback

cb = TqdmProgressCallback(leave=True)  # leave=True keeps bars on screen
result = extent.from_directory('mydata/', bbox=True, progress_callback=cb)
cb.close()  # close any open bars

Writing a custom callback

A callback is any callable that accepts a single ProgressEvent argument. Here is an example that pushes progress to a web API:

import requests
from geoextent.lib.progress import ProgressEvent

def webhook_callback(event: ProgressEvent) -> None:
    requests.post('https://example.com/progress', json={
        'phase': event.phase.value,
        'message': event.message,
        'fraction': event.fraction,
        'detail': event.detail,
    }, timeout=5)

result = extent.from_remote(
    '10.5281/zenodo.820562',
    bbox=True,
    tbox=True,
    progress_callback=webhook_callback,
)

Here is an example that updates a Jupyter notebook widget:

import ipywidgets as widgets
from IPython.display import display
from geoextent.lib.progress import ProgressEvent

progress_bar = widgets.FloatProgress(min=0, max=1, description='Extracting...')
status_label = widgets.Label()
display(widgets.HBox([progress_bar, status_label]))

def jupyter_callback(event: ProgressEvent) -> None:
    if not event.is_indeterminate:
        progress_bar.value = event.fraction
    status_label.value = event.message

result = extent.from_directory(
    'mydata/',
    bbox=True,
    progress_callback=jupyter_callback,
)
progress_bar.value = 1.0
status_label.value = 'Done'

Interaction with show_progress

  • When progress_callback is provided, geoextent automatically suppresses internal tqdm bars (equivalent to show_progress=False) to avoid duplicate output.

  • When progress_callback is None and show_progress=True (the default), geoextent auto-creates a TqdmProgressCallback internally for backward compatibility. The CLI uses this path.

  • To disable all progress output, pass both show_progress=False and omit progress_callback.