Content Providers¶
Geoextent supports extracting geospatial data from 33 research data repositories (including 10 Dataverse instances), Wikidata, any STAC catalog, any CKAN instance, GitHub and GitLab repositories (including self-hosted GitLab instances), and the Software Heritage archive. All providers support URL-based extraction, and return merged geometries when processing multiple resources.
Overview¶
All content providers support:
DOI-based extraction - Use DOIs directly or via resolver URLs
URL-based extraction - Use direct repository URLs
Merged geometry output - Multiple resources combined into single extent
Download size limiting - Control bandwidth with
--max-download-sizeFile filtering - Skip non-geospatial files with
--download-skip-nogeoParallel downloads - Speed up multi-file downloads with
--max-download-workersMetadata-first strategy - Try metadata extraction first, fall back to data download with
--metadata-first
Metadata-First Extraction¶
Some providers (Arctic Data Center, DataONE, Figshare, 4TU.ResearchData, Senckenberg, PANGAEA, BGR, SEANOE, GeoScienceWorld, UKCEH, GBIF, DEIMS-SDR, NFDI4Earth, HALO DB, GDI-DE, STAC, CKAN, Wikidata) can extract geospatial extents directly from repository metadata without downloading data files. The --metadata-first flag leverages this for a smart two-phase strategy:
Phase 1 (metadata): If the provider supports metadata extraction, try metadata-only extraction first (fast, no file downloads).
Phase 2 (fallback): If metadata didn’t yield the requested extents, or if the provider doesn’t support metadata, fall back to downloading and processing data files.
This is especially useful when processing multiple providers in batch:
# Senckenberg has metadata → uses metadata (fast); Zenodo has no metadata → downloads data
python -m geoextent -b --metadata-first 10.12761/sgn.2018.10225 10.5281/zenodo.4593540
import geoextent.lib.extent as geoextent
result = geoextent.from_remote(
'10.12761/sgn.2018.10225',
bbox=True, metadata_first=True
)
print(result['extraction_method']) # 'metadata' or 'download'
The result includes an extraction_method field indicating which strategy was used: "metadata" (fast, from repository metadata) or "download" (full data download and extraction).
Note: --metadata-first and --no-download-data are mutually exclusive. Use --no-download-data if you want metadata-only extraction without any fallback.
Automatic Metadata Fallback¶
When downloading data files from a provider, some repositories may have files disabled or unavailable (e.g., GEO Knowledge Hub packages with "files": {"enabled": false}). In these cases, the download succeeds but yields an empty folder, and no spatial extent can be extracted.
By default, geoextent automatically detects this situation and falls back to metadata-only extraction if the provider supports it. This happens transparently without any user action required.
# GKHub package with files disabled -- automatically uses metadata fallback
python -m geoextent -b https://gkhub.earthobservations.org/packages/msaw9-hzd25
import geoextent.lib.extent as geoextent
result = geoextent.from_remote(
'https://gkhub.earthobservations.org/packages/msaw9-hzd25',
bbox=True
)
print(result['extraction_method']) # 'metadata_fallback'
The result includes extraction_method: "metadata_fallback" to indicate that the automatic fallback was used.
To disable this behavior, use --no-metadata-fallback on the CLI or metadata_fallback=False in the Python API:
python -m geoextent -b --no-metadata-fallback https://gkhub.earthobservations.org/packages/msaw9-hzd25
Quick Reference¶
Provider Details¶
Description: Free and open digital archive built by CERN and OpenAIRE for sharing research output in any format. Supports all research disciplines with unlimited storage and preservation guarantees.
Website: https://zenodo.org/
DOI Prefix: 10.5281/zenodo
Supported Identifier Formats:
DOI:
10.5281/zenodo.4593540DOI URL:
https://doi.org/10.5281/zenodo.4593540Zenodo URL:
https://zenodo.org/record/4593540
Example:
python -m geoextent -b -t 10.5281/zenodo.4593540
Special Notes:
Supports download size limiting and file filtering
Parallel downloads supported
Handles both individual files and complete record archives
Description: Online open access repository for preserving and sharing research outputs with DOI assignment and altmetrics. Provides 20GB free private space and unlimited public sharing. Figshare also powers many institutional research data portals.
Website: https://figshare.com/
DOI Prefix: 10.6084/m9.figshare
Supported Identifier Formats:
DOI:
10.6084/m9.figshare.12345678DOI URL:
https://doi.org/10.6084/m9.figshare.12345678Figshare URL:
https://figshare.com/articles/dataset/title/12345678Institutional portal URL:
https://springernature.figshare.com/articles/dataset/title/12345678Institutional portal URL:
https://ices-library.figshare.com/articles/dataset/title/12345678API URL:
https://api.figshare.com/v2/articles/12345678
Example (Data Download):
# Download data files and extract spatial extent from their contents
python -m geoextent -b -t https://figshare.com/articles/dataset/London_boroughs/11373984
# Institutional portal (ICES Library - shapefiles archive)
python -m geoextent -b https://ices-library.figshare.com/articles/dataset/HELCOM_request_2022_for_spatial_data_layers_on_effort_fishing_intensity_and_fishing_footprint_for_the_years_2016-2021/20310255
Example (Metadata Only):
# Extract temporal extent from repository metadata without downloading data files
python -m geoextent -b -t --no-download-data https://figshare.com/articles/dataset/Country_centroids/5902369
# USDA Ag Data Commons - has geospatial metadata (GeoJSON in custom fields)
python -m geoextent -b --no-download-data https://api.figshare.com/v2/articles/30753383
Python API Examples:
import geoextent.lib.extent as geoextent
# Data download mode: downloads files and extracts extent from file contents
result = geoextent.from_remote(
'https://figshare.com/articles/dataset/London_boroughs/11373984',
bbox=True, tbox=True, download_data=True
)
# Metadata-only mode: uses published_date for temporal extent
result = geoextent.from_remote(
'https://figshare.com/articles/dataset/Country_centroids/5902369',
bbox=True, tbox=True, download_data=False
)
# Metadata-first strategy: tries metadata first, falls back to data download
result = geoextent.from_remote(
'https://figshare.com/articles/dataset/Country_centroids/5902369',
bbox=True, tbox=True, metadata_first=True
)
Special Notes:
Full support for size limiting and file filtering
API-based file metadata retrieval
Supports private and public datasets (public only accessible)
Supports
--no-download-datafor metadata-only extraction (temporal extent frompublished_date; spatial extent available when portals provide geolocation metadata)Supports
--metadata-firststrategy for smart metadata-then-download extractionRecognizes institutional portal URLs (
*.figshare.com), e.g.springernature.figshare.com,ices-library.figshare.comSome institutional portals (e.g. USDA Ag Data Commons) provide rich geospatial metadata including GeoJSON coverage polygons in
custom_fields
Description: Research data repository of the four Dutch Universities of Technology (TU Delft, TU Eindhoven, University of Twente, Wageningen University & Research). Based on the open-source Djehuty platform with a Figshare-compatible API. Supports both metadata-only and full data download extraction.
Website: https://data.4tu.nl/
DOI Prefix: 10.4121
Supported Identifier Formats:
DOI (legacy):
10.4121/uuid:8ce9d22a-9aa4-41ea-9299-f44efa9c8b75DOI (new-style):
10.4121/19361018.v2DOI URL:
https://doi.org/10.4121/uuid:8ce9d22a-9aa4-41ea-9299-f44efa9c8b75Dataset URL (new):
https://data.4tu.nl/datasets/61e28011-f96d-4b01-900e-15145b77ee59/2Article URL (legacy):
https://data.4tu.nl/articles/_/12707150/1
Example (Data Download):
# Download data files and extract spatial extent from their contents
python -m geoextent -b -t https://data.4tu.nl/articles/_/12707150/1
python -m geoextent -b https://data.4tu.nl/datasets/3035126d-ee51-4dbd-a187-5f6b0be85e9f/1
Example (Metadata Only):
# Extract extent from repository metadata without downloading data files
python -m geoextent -b --no-download-data https://data.4tu.nl/articles/_/12707150/1
python -m geoextent -b --no-download-data https://data.4tu.nl/datasets/3035126d-ee51-4dbd-a187-5f6b0be85e9f/1
Python API Examples:
import geoextent.lib.extent as geoextent
# Data download mode: downloads files and extracts extent from file contents
result = geoextent.from_remote(
'https://data.4tu.nl/articles/_/12707150/1',
bbox=True, tbox=False, download_data=True
)
# Metadata-only mode: uses repository metadata (no file download)
result = geoextent.from_remote(
'https://data.4tu.nl/articles/_/12707150/1',
bbox=True, tbox=True, download_data=False
)
Special Notes:
Uses a Figshare-compatible API (Djehuty platform) but with its own domain and DOI prefix
Handles both new-style UUID identifiers and legacy numeric article IDs
Supports
--no-download-datafor metadata-only extraction (limited spatial information from repository metadata)Full support for download size limiting (
--max-download-size), geospatial file filtering (--download-skip-nogeo), and parallel downloads (--max-download-workers)
Description: Nonprofit curated repository specializing in data underlying scientific publications with CC0 licensing. Focuses on data reusability and long-term preservation with Merritt Repository.
Website: https://datadryad.org/
DOI Prefix: 10.5061/dryad
Supported Identifier Formats:
DOI:
10.5061/dryad.0k6djhb7xDOI URL:
https://doi.org/10.5061/dryad.0k6djhb7xDryad URL:
https://datadryad.org/stash/dataset/doi:10.5061/dryad.0k6djhb7x
Example:
python -m geoextent -b -t 10.5061/dryad.0k6djhb7x
Special Notes:
Intelligent file vs. ZIP archive download selection
Full filtering and size limiting support
Handles nested ZIP files efficiently
Description: Digital data library and publisher for earth system science with over 375,000 georeferenced datasets. Specialized in geosciences, environmental, and climate research with extensive metadata.
Website: https://www.pangaea.de/
DOI Prefix: 10.1594/PANGAEA
Supported Identifier Formats:
DOI:
10.1594/PANGAEA.734969DOI URL:
https://doi.org/10.1594/PANGAEA.734969PANGAEA URL:
https://pangaea.de/doi:10.1594/PANGAEA.734969
Example:
python -m geoextent -b -t 10.1594/PANGAEA.734969
Special Notes:
Often includes rich geospatial metadata in repository records
Supports
--no-download-datafor metadata-only extractionSpecialized in Earth science datasets
Description: Free open-source project management tool by Center for Open Science for collaborative research workflows. Supports data storage, version control, and research lifecycle management.
Website: https://osf.io/
DOI Prefix: 10.17605/OSF.IO
Supported Identifier Formats:
DOI:
10.17605/OSF.IO/ABC123DOI URL:
https://doi.org/10.17605/OSF.IO/ABC123OSF URL:
https://osf.io/abc123/Short ID:
abc123
Example:
python -m geoextent -b https://osf.io/4xe6z/
Special Notes:
Full filtering and size limiting capabilities
Handles project storage and individual components
Supports file versioning
Description: Open-source web application from Harvard University for sharing and preserving research data across disciplines. Supports institutional repositories with customizable metadata schemas.
Website: https://dataverse.org/
DOI Prefix: Varies by Dataverse instance
Supported Dataverse Instances:
Instance |
Host |
DOI Prefix |
|---|---|---|
Harvard Dataverse |
dataverse.harvard.edu |
10.7910/DVN |
DataverseNL |
dataverse.nl |
10.34894 |
DataverseNO |
dataverse.no |
10.18710 |
UNC Dataverse |
dataverse.unc.edu |
10.5064 |
UVA Library Dataverse |
data.library.virginia.edu |
(varies) |
Recherche Data Gouv |
recherche.data.gouv.fr |
(varies) |
ioerDATA |
data.fdz.ioer.de |
10.71830 |
heiDATA |
heidata.uni-heidelberg.de |
10.11588/DATA |
Edmond |
edmond.mpg.de |
10.17617 |
Demo DataverseNL |
demo.dataverse.nl |
(varies) |
Supported Identifier Formats:
DOI:
10.7910/DVN/ABCDEFDOI URL:
https://doi.org/10.7910/DVN/ABCDEFDataverse URL:
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ABCDEF
Example:
python -m geoextent -b -t 10.7910/DVN/ABCDEF
Special Notes:
Supports 10 Dataverse instances (see table above)
Automatically skips restricted files that require authentication
Handles complex dataset structures
API-based metadata and file retrieval
Description: Research data repository of the Leibniz Institute of Ecological Urban and Regional Development (IOER), hosted on Dataverse. Specializes in urban and regional development, land use monitoring, and spatial analysis data for Germany and Europe.
Website: https://data.fdz.ioer.de/
DOI Prefix: 10.71830
Supported Identifier Formats:
DOI:
10.71830/VDMUWWDOI URL:
https://doi.org/10.71830/VDMUWWioerDATA URL:
https://data.fdz.ioer.de/dataset.xhtml?persistentId=doi:10.71830/VDMUWW
Example:
python -m geoextent -b 10.71830/VDMUWW
Special Notes:
Standard Dataverse instance (uses Dataverse provider internally)
Some datasets have restricted files requiring authentication; these are automatically skipped
Specializes in German urban/regional development and land use data
Uses the same Dataverse API as all other Dataverse instances
Description: Research data repository of Heidelberg University, hosted on Dataverse. Part of the NFDI4Earth initiative. Provides access to research data across multiple disciplines, with a focus on geosciences, environmental science, and digital humanities.
Website: https://heidata.uni-heidelberg.de/
DOI Prefix: 10.11588/DATA
Supported Identifier Formats:
DOI:
10.11588/DATA/TJNQZGDOI URL:
https://doi.org/10.11588/DATA/TJNQZGheiDATA URL:
https://heidata.uni-heidelberg.de/dataset.xhtml?persistentId=doi:10.11588/DATA/TJNQZG
Example:
python -m geoextent -b 10.11588/DATA/TJNQZG
Special Notes:
Standard Dataverse instance (uses Dataverse provider internally)
Has the NFDI4Earth Label for geoscience data
Supports both open access and restricted datasets
Uses the same Dataverse API as all other Dataverse instances
Description: Research data repository of the Max Planck Society, hosted on Dataverse. Provides open access to research data from Max Planck Institutes across all scientific disciplines, including earth sciences, chemistry, and biogeochemistry.
Website: https://edmond.mpg.de/
DOI Prefix: 10.17617
Supported Identifier Formats:
DOI:
10.17617/3.QZGTDUDOI URL:
https://doi.org/10.17617/3.QZGTDUEdmond URL:
https://edmond.mpg.de/dataset.xhtml?persistentId=doi:10.17617/3.QZGTDU
Example:
python -m geoextent -b 10.17617/3.QZGTDU
Special Notes:
Standard Dataverse instance (uses Dataverse provider internally)
Hosts data from Max Planck Institutes worldwide
Uses the same Dataverse API as all other Dataverse instances
Description: Curated repository for geosciences domain hosted at GFZ German Research Centre in Potsdam. Specialized in Earth observation, geophysics, and geoscience research data.
Website: https://dataservices.gfz-potsdam.de/
DOI Prefix: 10.5880/GFZ
Supported Identifier Formats:
DOI:
10.5880/GFZ.2.1.2020.001DOI URL:
https://doi.org/10.5880/GFZ.2.1.2020.001GFZ URL:
https://dataservices.gfz-potsdam.de/panmetaworks/showshort.php?id=...
Example:
python -m geoextent -b -t 10.5880/GFZ.2.1.2020.001
Special Notes:
Specialized in geoscience datasets
Comprehensive metadata for spatial datasets
Long-term preservation guarantees
Description: Scholarly publisher from Bulgaria specializing in biodiversity with 60+ open access journals. Integrates data publishing with manuscript publication for transparent research.
Website: https://pensoft.net/
DOI Prefix: 10.3897
Supported Identifier Formats:
DOI:
10.3897/BDJ.13.e159973DOI URL:
https://doi.org/10.3897/BDJ.13.e159973
Example:
python -m geoextent -b -t 10.3897/BDJ.13.e159973
Special Notes:
Specialized in biodiversity and ecological data
Links data directly to publications
Handles occurrence data and species distributions
Description: Open Access Repository and Archive for research data of Saxon universities with 10-year archiving guarantee. Supports DSpace 7.x with comprehensive metadata management.
Website: https://opara.zih.tu-dresden.de/
DOI Prefix: 10.25532/OPARA
Supported Identifier Formats:
DOI:
10.25532/OPARA-581DOI URL:
https://doi.org/10.25532/OPARA-581Handle URL:
https://opara.zih.tu-dresden.de/xmlui/handle/123456789/123Item URL:
https://opara.zih.tu-dresden.de/xmlui/handle/123456789/123UUID:
a1b2c3d4-e5f6-7890-abcd-ef1234567890
Example:
python -m geoextent -b -t 10.25532/OPARA-581
Special Notes:
Full DSpace 7.x REST API integration
Handles complex ZIP archives with nested directories
Supports multiple shapefiles in single archive
Size filtering and geospatial file filtering fully supported
Description: CKAN-based data portal for Senckenberg Biodiversity and Climate Research Centre providing access to biodiversity, climate, and geoscience research data. Primarily a metadata repository with rich geospatial and temporal metadata but limited/restricted data files.
Website: https://dataportal.senckenberg.de/
DOI Prefix: 10.12761/sgn
Supported Identifier Formats:
DOI:
10.12761/sgn.2018.10268DOI URL:
https://doi.org/10.12761/sgn.2018.10268Dataset URL:
https://dataportal.senckenberg.de/dataset/as-sahabi-1Dataset ID (name slug):
as-sahabi-1Dataset ID (UUID):
00dda005-68c0-4e92-96e5-ceb68034f3baJSON-LD URL:
https://dataportal.senckenberg.de/dataset/as-sahabi-1.jsonld
Example (Recommended - Metadata Only):
# Extract spatial and temporal extent from metadata
python -m geoextent -b -t --no-download-data 10.12761/sgn.2018.10268
Output: Bounding box for Ecuador region and temporal extent from 2014-05-01 to 2015-12-30
Special Notes:
Best Practice: Always use
--no-download-datafor metadata-only extractionBuilt on CKAN (Comprehensive Knowledge Archive Network) platform
Extracts both spatial extent (bounding box) and temporal extent (date ranges) from metadata
Supports both open access and metadata-only restricted datasets
Rich taxonomic, spatial, and temporal coverage metadata
Metadata extraction is fast and does not require downloading data files
Full filtering and size limiting capabilities available when data files exist
Description: Elsevier-hosted generalist research data repository and part of the NIH Generalist Repository Ecosystem Initiative (GREI). Supports sharing, discovering, and citing research data across all disciplines with DOI assignment.
Website: https://data.mendeley.com/
DOI Prefix: 10.17632
Supported Identifier Formats:
DOI:
10.17632/ybx6zp2rfp.1DOI URL:
https://doi.org/10.17632/ybx6zp2rfp.1Mendeley Data URL:
https://data.mendeley.com/datasets/ybx6zp2rfp/1
Example:
python -m geoextent -b 10.17632/ybx6zp2rfp.1
Special Notes:
Uses unauthenticated public API (no OAuth tokens required)
No geospatial metadata available; requires downloading data files for extent extraction
Full support for download size limiting and geospatial file filtering
Parallel downloads supported
Description: Free, collaborative, multilingual knowledge base operated by the Wikimedia Foundation. Contains structured geographic data for millions of entities including countries, cities, parks, rivers, and other geographic features. Geoextent extracts bounding boxes from Wikidata’s coordinate properties via the SPARQL endpoint.
Website: https://www.wikidata.org/
Identifier Format: Q-numbers (e.g., Q64) or Wikidata URLs
Supported Identifier Formats:
Q-number:
Q64Wiki URL:
https://www.wikidata.org/wiki/Q64Entity URI:
http://www.wikidata.org/entity/Q64
Coordinate Extraction:
Extreme coordinates (P1332-P1335): northernmost, southernmost, easternmost, westernmost points — used to construct a bounding box
Coordinate location (P625): single or multiple point locations — used as fallback when extreme coordinates are not available
Example:
# Extract bbox for Berlin
python -m geoextent -b Q64
# Using Wikidata URL
python -m geoextent -b https://www.wikidata.org/wiki/Q64
# Multiple Wikidata items (merged bbox)
python -m geoextent -b Q64 Q35 Q60786916
Special Notes:
Metadata-only provider: Extracts coordinates from Wikidata SPARQL endpoint, no data files are downloaded
The
--no-download-dataflag is accepted but has no effect (there are no data files)Supports multiple Wikidata items in a single call, returning a merged bounding box
When only P625 point coordinates are available, the bounding box is computed from all available points
For entities with a single P625 point, a zero-extent bounding box (point) is returned
Description: Cross-disciplinary research data repository operated by FIZ Karlsruhe for archiving and publishing German research data. Assigns DOIs via DataCite and delivers all datasets as .tar archives.
Website: https://www.radar-service.eu/
DOI Prefix: 10.35097
Supported Identifier Formats:
DOI:
10.35097/tvn5vujqfvf99f32DOI URL:
https://doi.org/10.35097/tvn5vujqfvf99f32RADAR URL:
https://www.radar-service.eu/radar/en/dataset/tvn5vujqfvf99f32KIT URL:
https://radar.kit.edu/radar/en/dataset/tvn5vujqfvf99f32
Example:
python -m geoextent -b -t 10.35097/tvn5vujqfvf99f32
Special Notes:
All datasets are delivered as a single
.tararchive (no individual file downloads)Backend API provides file listing before download for size estimation and geospatial file detection
Supports download size limiting and geospatial file filtering
Multiple hosting domains:
www.radar-service.euandradar.kit.edu
Description: The primary data and software repository for NSF-funded Arctic research, operated by the National Center for Ecological Analysis and Synthesis (NCEAS). Built on DataONE/Metacat infrastructure with rich structured geospatial and temporal metadata in its Solr index.
Website: https://arcticdata.io/
DOI Prefix: 10.18739
Supported Identifier Formats:
DOI:
10.18739/A2Z892H2JDOI URL:
https://doi.org/10.18739/A2Z892H2JCatalog URL:
https://arcticdata.io/catalog/view/doi%3A10.18739%2FA2Z892H2JURN UUID:
urn:uuid:054b4c9a-8be1-4d28-8724-5e2beb0ce4e6
Example:
python -m geoextent -b -t 10.18739/A2Z892H2J
Special Notes:
Supports metadata-only extraction (every dataset has bounding coordinates and temporal coverage in its Solr index)
Supports both DOI and URN UUID identifiers
Individual file downloads via DataONE object endpoint
Parallel downloads supported
Description: DataONE (Data Observation Network for Earth) is a federated cyberinfrastructure for Earth observation data, aggregating metadata from ~38 member nodes (including KNB, PISCO, and others) into a unified Coordinating Node (CN) Solr index with ~1.2 million records. Geoextent queries the CN Solr API to extract pre-computed bounding boxes and temporal ranges from structured EML metadata.
Website: https://www.dataone.org/
DOI Prefixes: 10.5063/ (KNB), 10.6085/ (PISCO)
Supported Identifier Formats:
DOI:
10.5063/F1Z60M87DOI URL:
https://doi.org/10.5063/F1Z60M87Search URL:
https://search.dataone.org/view/doi%3A10.5063%2FF1Z60M87Hash URL:
https://search.dataone.org/#view/doi:10.5063/F1Z60M87Datasets URL:
https://dataone.org/datasets/doi%3A10.5063%2FF1Z60M87CN object URL:
https://cn.dataone.org/cn/v2/object/doi%3A10.5063%2FF1Z60M87CN resolve URL:
https://cn.dataone.org/cn/v2/resolve/doi%3A10.5063%2FF1Z60M87
Example (Metadata Only):
# KNB Alaska elevation — bbox and temporal extent from DataONE CN metadata
python -m geoextent -b -t --no-download-data 10.5063/F1Z60M87
# PISCO Kelp Forest Community Surveys — US West Coast
python -m geoextent -b -t --no-download-data 10.6085/AA/PISCO_kelpforest.1.11
# Using search.dataone.org URL
python -m geoextent -b --no-download-data https://search.dataone.org/view/doi%3A10.5063%2FF1Z60M87
# Open in geojson.io
python -m geoextent -b -t --geojsonio --no-download-data 10.5063/F1Z60M87
Python API Examples:
import geoextent.lib.extent as geoextent
# Metadata-only: uses DataONE CN Solr API for bbox and temporal extent
result = geoextent.from_remote(
'10.5063/F1Z60M87',
bbox=True, tbox=True, download_data=False
)
print(result['bbox']) # Alaska region: [54.3, -166.4, 71.3, -130.1]
print(result['tbox']) # ['2017-01-01', '2017-01-01']
# PISCO dataset: US West Coast kelp forest surveys
result = geoextent.from_remote(
'10.6085/AA/PISCO_kelpforest.1.11',
bbox=True, tbox=True, download_data=False
)
print(result['bbox']) # West Coast: [33.0, -125.0, 45.0, -118.0]
print(result['tbox']) # ['1999-09-07', '2024-12-07']
Special Notes:
Metadata-only provider: Extracts pre-computed bounding boxes and temporal ranges from the DataONE CN Solr index — no data files are downloaded
The
--no-download-dataflag is accepted but has no effect (there are no data files)Supports both DOI-based and URL-based identifiers (7 URL patterns)
DOI prefixes
10.5063/(KNB) and10.6085/(PISCO) are recognized automaticallyDatasets from member nodes with dedicated providers (Arctic Data Center, PANGAEA, Dryad) are skipped to avoid duplicate handling
Temporal metadata is extracted from
beginDate/endDatefields in the Solr index
Description: SEANOE (SEA scieNtific Open data Edition) is a marine science data repository operated by Ifremer/SISMER (France). It publishes open-access oceanographic, marine biology, and geoscience datasets with DOI prefix 10.17882.
Website: https://www.seanoe.org/
DOI Prefix: 10.17882
Supported Identifier Formats:
DOI:
10.17882/105467DOI URL:
https://doi.org/10.17882/105467SEANOE URL:
https://www.seanoe.org/data/00943/105467/
Example (Metadata Only):
# French Mediterranean CTD data — bbox and temporal extent from SEANOE metadata
python -m geoextent -b -t --no-download-data 10.17882/105467
# Bowhead whale biologging — open in geojson.io
python -m geoextent -b -t --geojsonio --no-download-data 10.17882/112127
Example (Data Download):
# Ireland coastline REI — download data files and extract extent
python -m geoextent -b 10.17882/109463
Python API Examples:
import geoextent.lib.extent as geoextent
# Metadata-only: uses SEANOE REST API for bbox and temporal extent
result = geoextent.from_remote(
'10.17882/105467',
bbox=True, tbox=True, download_data=False
)
# Data download mode: downloads open-access files and extracts extent
result = geoextent.from_remote(
'10.17882/109463',
bbox=True, download_data=True
)
Special Notes:
Rich structured metadata via
https://www.seanoe.org/api/find-by-id/{id}REST APISupports
--no-download-datafor metadata-only extraction (geographic bounding boxes and temporal ranges from API)Data files can be downloaded and processed for more precise extent extraction
Only open-access files are downloaded; restricted files are automatically skipped
Full support for download size limiting, geospatial file filtering, and parallel downloads
Description: GeoScienceWorld is a publishing platform hosting geoscience journals from multiple publishers (SEG, GSL, Mineralogical Society, etc.). Articles include GeoRef metadata with geographic coordinates embedded as WKT (POLYGON/POINT) in the article HTML.
Website: https://pubs.geoscienceworld.org/
DOI Prefix: Various publisher prefixes (10.1190, 10.1144, 10.1180, …)
Supported Identifier Formats:
Article URL:
https://pubs.geoscienceworld.org/{pub}/{journal}/article-abstract/{vol}/{issue}/{page}/{id}/{slug}Article URL:
https://pubs.geoscienceworld.org/{journal}/article/{vol}/{issue}/{page}/{id}/{slug}GeoRef record URL:
https://pubs.geoscienceworld.org/georef/record/{type}/{id}/{slug}DOI:
10.1190/tle44120952.1(resolves to GSW)DOI:
10.1144/petgeo2024-095(resolves to GSW)
Example (Metadata Only):
# Mozambique Channel seismic article — bbox and date from GeoRef metadata
python -m geoextent -b -t --no-download-data \
"https://pubs.geoscienceworld.org/seg/tle/article-abstract/44/12/952/721805/Diagenesis-and-pore-pressure-induced-dim-spots-on"
# Via DOI
python -m geoextent -b -t --no-download-data 10.1190/tle44120952.1
Python API Examples:
import geoextent.lib.extent as geoextent
# Metadata-only: extracts coordinates from GeoRef metadata in article HTML
result = geoextent.from_remote(
'https://pubs.geoscienceworld.org/seg/tle/article-abstract/44/12/952/721805/'
'Diagenesis-and-pore-pressure-induced-dim-spots-on',
bbox=True, tbox=True, download_data=False
)
# Convex hull from multiple articles across different journals
result = geoextent.from_remote(
['https://pubs.geoscienceworld.org/seg/tle/article-abstract/44/12/952/721805/'
'Diagenesis-and-pore-pressure-induced-dim-spots-on',
'https://pubs.geoscienceworld.org/gsl/pg/article/32/1/petgeo2024-095/722925/'
'Combined-geophysical-and-tectonostratigraphic'],
bbox=True, tbox=True, download_data=False, convex_hull=True
)
GeoRef Coordinate Structure:
GeoRef metadata embeds coordinates in <coordinates points='...'> HTML elements as a JSON
object containing WKT geometries. Two types of geographic metadata appear:
Bounding box articles — Regional studies have an axis-aligned rectangular POLYGON (the study area bounding box) plus a POINT at the exact centroid:
<coordinates points='{"Polygon":"POLYGON((43 -25.6667,50.5 -25.6667,
50.5 -11.8667,43 -11.8667,43 -25.6667))",
"Point":"POINT(46.75 -18.7667)"}'>
Point-only articles — Single-site studies (mineral localities, craters, mines) have only a POINT with no bounding polygon.
<coordinates points='{"Point":"POINT(-118.3547 34.0631)"}'>
Illustrative Examples by Scale:
GeoRef bounding boxes span vastly different spatial scales depending on the study type. These real articles illustrate the range of polygon metadata:
Article / Study Area |
Journal |
Width |
Height |
Area |
|---|---|---|---|---|
Continental-scale seismic survey (Mozambique Channel) |
SEG The Leading Edge |
791 km |
1,536 km |
~1.2M km² |
Tectonic extension zone (Western California) |
GSA Geology |
908 km |
1,058 km |
~970K km² |
Cratonic mantle study (Eastern Tibet) |
GSA Geology |
741 km |
297 km |
~221K km² |
Porphyry copper district (N Greece, Chalkidiki) |
Economic Geology |
135 km |
130 km |
~18K km² |
Volcanic complex (Erongo, Namibia) |
GSSA S. Afr. J. Geol. |
52 km |
46 km |
~2,400 km² |
Single volcano (Torfajökull, Iceland) |
GSA Geology |
12 km |
19 km |
~228 km² |
Point-only: mineral locality (Monte Somma, Italy) |
MinSoc Min. Mag. |
— |
— |
point |
Point-only: impact crater (Lonar, India) |
J. Geol. Soc. India |
— |
— |
point |
Point-only: mine site (Sangdong, Korea) |
Economic Geology |
— |
— |
point |
The POINT coordinate in bounding-box articles is always the arithmetic centroid of the
POLYGON: POINT((W+E)/2, (S+N)/2). It carries no independent spatial information
beyond what the POLYGON already provides.
Special Notes:
Metadata-only provider — coordinates are extracted from GeoRef metadata in article HTML; no data files are downloaded
The
download_dataparameter is accepted for API compatibility but has no effectUses
curl_cffiwith Chrome TLS fingerprint impersonation to bypass Cloudflare protection onpubs.geoscienceworld.org; works for most articles but some older content may still be blocked — see gsw-cloudflare-note belowNo single DOI prefix — GSW hosts journals from many publishers (SEG: 10.1190, GSL: 10.1144, etc.)
DOIs are supported via resolution: the DOI is resolved and the redirect URL is checked for
pubs.geoscienceworld.orgCoordinates use WKT
(lon lat)order, which is standard; no coordinate swap is needed internallyTemporal extent is the article publication date from
<meta name="citation_publication_date">
Note
Cloudflare protection status
GeoScienceWorld uses Cloudflare’s “managed challenge” (Turnstile) protection.
geoextent uses curl_cffi with Chrome TLS fingerprint impersonation to bypass
this without requiring a real browser. This works for the majority of articles,
but some older content served from different backends may still return empty results.
See issue #109 for updates.
Description: UKCEH (UK Centre for Ecology & Hydrology) operates the Environmental Information Data Centre (EIDC), publishing environmental science datasets including water chemistry, land cover, biomass, and atmospheric data. The catalogue provides structured metadata via a JSON API with bounding boxes and temporal extents.
Website: https://catalogue.ceh.ac.uk/
DOI Prefix: 10.5285
Supported Identifier Formats:
DOI:
10.5285/dd35316a-cecc-4f6d-9a21-74a0f6599e9eDOI URL:
https://doi.org/10.5285/dd35316a-cecc-4f6d-9a21-74a0f6599e9eCatalogue URL:
https://catalogue.ceh.ac.uk/documents/dd35316a-cecc-4f6d-9a21-74a0f6599e9e
Example (Metadata Only):
# Blelham Tarn water chemistry — bbox and temporal extent from catalogue metadata
python -m geoextent -b -t --no-download-data 10.5285/dd35316a-cecc-4f6d-9a21-74a0f6599e9e
Example (Data Download):
# Blelham Tarn water chemistry — download CSV data and extract extent
python -m geoextent -b -t 10.5285/dd35316a-cecc-4f6d-9a21-74a0f6599e9e
Python API Examples:
import geoextent.lib.extent as geoextent
# Metadata-only: uses catalogue JSON API for bbox and temporal extent
result = geoextent.from_remote(
'10.5285/dd35316a-cecc-4f6d-9a21-74a0f6599e9e',
bbox=True, tbox=True, download_data=False
)
# Data download mode: downloads files and extracts extent
result = geoextent.from_remote(
'10.5285/dd35316a-cecc-4f6d-9a21-74a0f6599e9e',
bbox=True, tbox=True, download_data=True
)
Special Notes:
Dual download pattern: Apache datastore directory listing (selective file download) or data-package ZIP (all-or-nothing)
Datastore listing tried first to enable selective file download and size filtering; falls back to data-package ZIP
Supports
--no-download-datafor metadata-only extraction (bounding boxes and temporal ranges from catalogue API)Full support for download size limiting, geospatial file filtering, and parallel downloads
Dataset identifiers are UUIDs (e.g.,
dd35316a-cecc-4f6d-9a21-74a0f6599e9e)
Description: GDI-DE (Geodateninfrastruktur Deutschland / Spatial Data Infrastructure Germany) is the national spatial data infrastructure catalogue with 771,000+ records, aggregating metadata from German federal, state, and municipal agencies (BKG, DWD, DLR, etc.).
Website: https://www.geoportal.de/
Identifier Format: UUIDs or geoportal.de URLs (no DOIs)
Supported Identifier Formats:
Landing page URL:
https://www.geoportal.de/Metadata/{uuid}CSW URL:
https://gdk.gdi-de.org/gdi-de/srv/eng/csw?...Id={uuid}Bare UUID:
75987CE0-AA66-4445-AC44-068B98390E89
Example (Metadata Only):
# Heavy rain hazard map — bbox from GDI-DE catalogue metadata
python -m geoextent -b --no-download-data https://www.geoportal.de/Metadata/75987CE0-AA66-4445-AC44-068B98390E89
# Forest canopy cover loss — bbox and temporal extent from bare UUID
python -m geoextent -b -t --no-download-data cdb2c209-7e08-4f4c-b500-69de926e3023
Python API Examples:
import geoextent.lib.extent as geoextent
# Metadata-only: uses GDI-DE CSW 2.0.2 API for bbox and temporal extent
result = geoextent.from_remote(
'https://www.geoportal.de/Metadata/75987CE0-AA66-4445-AC44-068B98390E89',
bbox=True, tbox=True, download_data=False
)
Special Notes:
Metadata-only provider: GDI-DE is a catalogue pointing to external WMS/WFS/Atom services; no data files are downloaded
Uses OGC CSW 2.0.2 endpoint with ISO 19115/19139 metadata (same standard as BGR, BAW, MDI-DE)
The
--no-download-dataflag is accepted but has no effect (there are no data files)Supports bare UUIDs verified against the GDI-DE CSW catalog
Description: NFDI4Earth (National Research Data Infrastructure for Earth System Sciences) operates the Knowledge Hub — a Cordra-based digital object repository with 1.3M+ datasets, 168 repositories, and 415K data services. The OneStop4All portal provides a unified search/discovery frontend. Geospatial metadata is extracted from the SPARQL endpoint (with Cordra REST API fallback). Only dcat:Dataset type objects are processed.
Website: https://onestop4all.nfdi4earth.de/
Identifier Format: OneStop4All or Cordra URLs (no DOIs)
Supported Identifier Formats:
OneStop4All URL:
https://onestop4all.nfdi4earth.de/result/{id}Cordra URL:
https://cordra.knowledgehub.nfdi4earth.de/objects/n4e/{id}
Example (Metadata Only):
# Schiffsdichte 2013 — bbox from WKT geometry via SPARQL
python -m geoextent -b https://onestop4all.nfdi4earth.de/result/dthb-82b6552d-2b8e-4800-b955-ea495efc28af/
# ESA Antarctic Ice Sheet — bbox and temporal extent (1994–2021)
python -m geoextent -b -t https://onestop4all.nfdi4earth.de/result/dthb-7b3bddd5af4945c2ac508a6d25537f0a/
# FNP Berlin — Berlin area polygon
python -m geoextent -b https://onestop4all.nfdi4earth.de/result/dthb-92a8e490-3d32-46cc-853a-50c0d43a187f/
Example (Disable Follow):
# Use NFDI4Earth metadata only, do not follow the landingPage to another provider
python -m geoextent -b -t --no-follow https://onestop4all.nfdi4earth.de/result/dthb-82b6552d-2b8e-4800-b955-ea495efc28af/
Python API Examples:
import geoextent.lib.extent as geoextent
# Extract bbox and temporal extent from NFDI4Earth Knowledge Hub
result = geoextent.from_remote(
'https://onestop4all.nfdi4earth.de/result/dthb-7b3bddd5af4945c2ac508a6d25537f0a/',
bbox=True, tbox=True
)
print(result['bbox']) # Antarctic region bounding box
print(result['tbox']) # ['1994-01-28', '2021-01-19']
# Disable follow — use NFDI4Earth SPARQL metadata only
result = geoextent.from_remote(
'https://onestop4all.nfdi4earth.de/result/dthb-82b6552d-2b8e-4800-b955-ea495efc28af/',
bbox=True, follow=False
)
# Direct Cordra URL also works
result = geoextent.from_remote(
'https://cordra.knowledgehub.nfdi4earth.de/objects/n4e/dthb-82b6552d-2b8e-4800-b955-ea495efc28af',
bbox=True
)
Special Notes:
Metadata-only provider: Extracts WKT geometry and temporal ranges from the NFDI4Earth SPARQL endpoint — no data files are downloaded
Provider-jump (follow): When a dataset has a
landingPageURL that matches another supported provider (e.g., GDI-DE), geoextent automatically follows it for data extent extraction. Disable with--no-followorfollow=False.Uses SPARQL as the primary data access method with Cordra REST API as fallback when the SPARQL endpoint is unavailable
The
--no-download-dataflag is accepted but has no effect (there are no data files)Both OneStop4All landing pages and direct Cordra object URLs are supported
Description: STAC (SpatioTemporal Asset Catalog) is an OGC Community Standard for describing geospatial information. STAC Collections contain pre-computed aggregate bounding boxes and temporal intervals, making them ideal for fast metadata-only extraction. Geoextent supports any STAC-compliant API.
Website: https://stacspec.org/
Identifier Format: STAC Collection URLs (no DOIs)
Supported Identifier Formats:
Collection URL:
https://{host}/stac/v1/collections/{id}Collection URL:
https://{host}/collections/{id}Known STAC API hosts are matched instantly (Element84, DLR, Terradue, WorldPop, Lantmateriet, etc.)
Unknown hosts with
/stac/in the URL path are also matchedFallback: any URL returning JSON with a
stac_versionfield
Example (Metadata Only):
# US National Agriculture Imagery (Element84 Earth Search)
python -m geoextent -b -t https://earth-search.aws.element84.com/v1/collections/naip
# German forest structure (DLR EOC STAC API)
python -m geoextent -b -t https://geoservice.dlr.de/eoc/ogc/stac/v1/collections/FOREST_STRUCTURE_DE_COVER_P1Y
# Switzerland population data (WorldPop)
python -m geoextent -b -t https://api.stac.worldpop.org/collections/CHE
# Swedish orthophoto (Lantmateriet)
python -m geoextent -b -t https://api.lantmateriet.se/stac-bild/v1/collections/orto-f2-2014
# San Andreas Fault SAR data (Terradue)
python -m geoextent -b -t https://gep-supersites-stac.terradue.com/collections/csk-san-andrea-supersite
Python API Examples:
import geoextent.lib.extent as geoextent
# Extract bbox and temporal extent from STAC Collection
result = geoextent.from_remote(
'https://earth-search.aws.element84.com/v1/collections/naip',
bbox=True, tbox=True
)
print(result['bbox']) # [17.0, -160.0, 50.0, -67.0] (NAIP US coverage)
print(result['tbox']) # ['2010-01-01', '2022-12-31']
# Open-ended temporal range (end date is null)
result = geoextent.from_remote(
'https://geoservice.dlr.de/eoc/ogc/stac/v1/collections/FOREST_STRUCTURE_DE_COVER_P1Y',
bbox=True, tbox=True
)
print(result['tbox']) # ['2017-01-01', None]
Special Notes:
Metadata-only provider: Extracts pre-computed
extent.spatial.bboxandextent.temporal.intervaldirectly from STAC Collection JSON — no data files are downloadedThe
--no-download-dataflag is accepted but has no effect (there are no data files)Supports content negotiation: if a URL returns HTML (e.g. OGC API with content negotiation), retries with
?f=application/jsonHandles open-ended temporal ranges where the end date is
null(ongoing data collection)Supports STAC API v1.0 and v1.1
Description: Generic provider for any CKAN (Comprehensive Knowledge Archive Network) instance. CKAN is the world’s most widely-used open-source data management system, powering government open data portals and research data repositories worldwide. The generic CKAN provider supports metadata-only extraction (spatial extent from GeoJSON geometries, temporal extent from various field naming conventions) and data file downloads.
Website: https://ckan.org/
Identifier Format: Dataset URLs (no DOIs)
Known CKAN Instances:
Instance |
Host |
|---|---|
GeoKur (TU Dresden) |
geokur-dmp.geo.tu-dresden.de |
UK data.gov.uk |
ckan.publishing.service.gov.uk |
GovData.de |
ckan.govdata.de |
Canada Open Data |
open.canada.ca |
Australia Open Data |
data.gov.au |
US data.gov |
catalog.data.gov |
Ireland Open Data |
data.gov.ie |
Singapore Open Data |
data.gov.sg |
Unknown CKAN hosts are automatically detected by probing the /api/3/action/status_show endpoint.
Supported Identifier Formats:
Dataset URL:
https://{ckan-host}/dataset/{dataset_id_or_name}Subpath URL:
https://{host}/data/en/dataset/{id}(e.g. Canada)
Example (Metadata Only):
# GeoKur cropland extent — bbox and temporal from CKAN metadata (GeoJSON geometry + temporal_start/end)
python -m geoextent -b -t --no-download-data https://geokur-dmp.geo.tu-dresden.de/dataset/cropland-extent
# UK data.gov.uk — bbox from bbox-* extras pattern
python -m geoextent -b --no-download-data https://ckan.publishing.service.gov.uk/dataset/bishkek-spatial-data
# German GovData — spatial GeoJSON and temporal extent
python -m geoextent -b -t --no-download-data https://ckan.govdata.de/dataset/a-spatially-distributed-sampling-of-rhine-surface-water-for-non-target-screening
Example (Data Download):
# Ireland libraries — download Shapefile and extract bbox from file contents
python -m geoextent -b https://data.gov.ie/dataset/libraries-dlr
# Australia Gisborne — download GeoJSON and extract bbox from file contents
python -m geoextent -b https://data.gov.au/dataset/gisborne-neighbourhood-character-precincts
Python API Examples:
import geoextent.lib.extent as geoextent
# Metadata-only: uses CKAN API for bbox and temporal extent
result = geoextent.from_remote(
'https://geokur-dmp.geo.tu-dresden.de/dataset/cropland-extent',
bbox=True, tbox=True, download_data=False
)
# Data download: downloads files and extracts extent
result = geoextent.from_remote(
'https://data.gov.ie/dataset/libraries-dlr',
bbox=True, tbox=True, download_data=True
)
# Metadata-first strategy: tries metadata first, falls back to data download
result = geoextent.from_remote(
'https://ckan.govdata.de/dataset/a-spatially-distributed-sampling-of-rhine-surface-water-for-non-target-screening',
bbox=True, tbox=True, metadata_first=True
)
Special Notes:
Recommended: Use
--metadata-firstfor CKAN datasets — many have rich catalogue metadata but data files may not contain geospatial contentSpatial metadata supports: GeoJSON geometries (Polygon, MultiPolygon, Point),
bbox-*extras (UK pattern), andwest/south/east/northdict fieldsTemporal metadata supports 5 naming conventions across instances:
temporal_start/end,temporal-extent-begin/end,temporal_coverage-from/to,temporal_coverage_from/to,time_period_coverage_start/endComplex GeoJSON geometries are preserved for convex hull calculations (not simplified to bounding box rectangles)
Automatic metadata fallback: if downloaded data files have no geospatial content, automatically falls back to catalogue metadata
Senckenberg (
dataportal.senckenberg.de) has a dedicated provider and is excluded from generic CKAN matching
Description: GitHub is the most widely used platform for hosting research code and data, including research compendia that bundle geospatial data alongside analysis scripts. This provider downloads geospatial files from public GitHub repositories and extracts their spatial and temporal extent. It uses the Git Trees API (2 API calls per repo) and raw file downloads, preserving directory structure for co-located files (e.g. shapefile components).
Website: https://github.com/
Identifier Format: Repository URLs (no DOIs)
Supported Identifier Formats:
Repository:
https://github.com/{owner}/{repo}Branch/tag:
https://github.com/{owner}/{repo}/tree/{ref}Subdirectory:
https://github.com/{owner}/{repo}/tree/{ref}/{path}
Example (CLI):
# Extract bbox from entire repository (GeoJSON tectonic plates — global extent)
python -m geoextent -b https://github.com/fraxen/tectonicplates
# Extract from a specific subdirectory
python -m geoextent -b https://github.com/Nowosad/spDataLarge/tree/master/inst/raster
# Skip non-geospatial files
python -m geoextent -b --download-skip-nogeo https://github.com/fraxen/tectonicplates
Python API Examples:
import geoextent.lib.extent as geoextent
# Extract bbox from GitHub repository
result = geoextent.from_remote(
'https://github.com/fraxen/tectonicplates',
bbox=True, tbox=False, download_skip_nogeo=True
)
# Extract from a specific subdirectory
result = geoextent.from_remote(
'https://github.com/Nowosad/spDataLarge/tree/master/inst/raster',
bbox=True, tbox=True
)
Special Notes:
Data-download provider: Downloads actual files from the repository — no metadata-only extraction (git repositories don’t have structured spatial metadata)
Rate limits: Unauthenticated: 60 API requests/hour. Set the
GITHUB_TOKENenvironment variable for 5000 requests/hour.Directory structure preservation: Files are downloaded preserving their path structure, which is essential for shapefile components (
.shp+.shx+.dbf+.prj) and world filesRecommended: Use
--download-skip-nogeofor repositories with many non-geospatial files
Description: GitLab is a platform for hosting and collaborating on code and data. This provider downloads geospatial files from public GitLab repositories on gitlab.com and self-hosted instances, and extracts their spatial and temporal extent. It uses the paginated Repository Tree API and raw file API, preserving directory structure for co-located files (e.g. shapefile components).
Website: https://gitlab.com/
Identifier Format: Repository URLs (no DOIs)
Supported Identifier Formats:
Repository:
https://gitlab.com/{namespace}/{project}Branch/tag:
https://gitlab.com/{namespace}/{project}/-/tree/{ref}Subdirectory:
https://gitlab.com/{namespace}/{project}/-/tree/{ref}/{path}Nested namespace:
https://gitlab.com/{group}/{subgroup}/{project}Self-hosted:
https://{gitlab-host}/{namespace}/{project}Git suffix:
https://gitlab.com/{namespace}/{project}.git
Known Self-Hosted Instances:
Instance |
Organization |
|---|---|
git.rwth-aachen.de |
RWTH Aachen University |
zivgitlab.uni-muenster.de |
University of Münster |
git.gfz-potsdam.de |
GFZ Helmholtz Potsdam |
codebase.helmholtz.cloud |
Helmholtz Association |
gitlab.opencode.de |
German Government |
gitlab.ethz.ch |
ETH Zurich |
git.wur.nl |
Wageningen University & Research |
gitlab.eumetsat.int |
EUMETSAT |
forge.inrae.fr |
INRAE France |
framagit.org |
Framasoft |
Unknown self-hosted instances are detected automatically if the hostname contains “gitlab” or via API probe fallback.
Example (CLI):
# European avalanche warning regions (GeoJSON files)
python -m geoextent -b https://gitlab.com/eaws/eaws-regions/-/tree/master/public/outline
# Upper Silesia seismicity data — CSV with coordinates and dates
python -m geoextent -b -t https://gitlab.com/bazylizon/seismicity
# DWD radar network — GeoPackage in EPSG:3035 (reprojected to WGS84)
python -m geoextent -b https://gitlab.com/Weatherman_/radolan2map/-/tree/master/example/shapes/RadarNetwork
# Self-hosted GitLab (RWTH Aachen) — NFDI4Earth datasets
python -m geoextent -b https://git.rwth-aachen.de/nfdi4earth/crosstopics/knowledgehub-maps/-/tree/main/maps/200_datasets/data
# Skip non-geospatial files
python -m geoextent -b --download-skip-nogeo https://gitlab.com/bazylizon/seismicity
Python API Examples:
import geoextent.lib.extent as geoextent
# Extract bbox from GitLab repository
result = geoextent.from_remote(
'https://gitlab.com/bazylizon/seismicity',
bbox=True, tbox=True, download_skip_nogeo=True
)
# Extract from a specific subdirectory
result = geoextent.from_remote(
'https://gitlab.com/eaws/eaws-regions/-/tree/master/public/outline',
bbox=True, tbox=False
)
# Self-hosted GitLab instance
result = geoextent.from_remote(
'https://git.rwth-aachen.de/nfdi4earth/crosstopics/knowledgehub-maps/-/tree/main/maps/200_datasets/data',
bbox=True, tbox=False, download_skip_nogeo=True
)
Special Notes:
Data-download provider: Downloads actual files from the repository — no metadata-only extraction (git repositories don’t have structured spatial metadata)
Rate limits: Unauthenticated on gitlab.com: ~400 API requests/10 min. Set the
GITLAB_TOKENenvironment variable for higher limits.Self-hosted instances: Supports any GitLab instance — known hosts are matched instantly, unknown hosts with “gitlab” in the hostname are detected heuristically, and all other hosts are verified via API probe
Nested namespaces: Supports GitLab’s group/subgroup/project hierarchy (e.g.
nfdi4earth/crosstopics/knowledgehub-maps)Directory structure preservation: Files are downloaded preserving their path structure, which is essential for shapefile components (
.shp+.shx+.dbf+.prj) and world filesRecommended: Use
--download-skip-nogeofor repositories with many non-geospatial files
Description: Software Heritage is a non-profit archive (Inria + UNESCO) of all publicly available source code, assigning persistent identifiers (SWHIDs) to every software artifact. This provider downloads geospatial files from archived repositories and extracts their spatial and temporal extent. It resolves SWHIDs through the SWH API chain (origin/snapshot/revision/directory) and downloads files by content hash.
Website: https://www.softwareheritage.org/
Identifier Format: SWHIDs and browse URLs (no DOIs)
Supported Identifier Formats:
Bare SWHID:
swh:1:dir:<40-hex>Origin SWHID:
swh:1:ori:<40-hex>SWHID with qualifiers:
swh:1:dir:<hash>;origin=<url>;path=/subdirBrowse origin URL:
https://archive.softwareheritage.org/browse/origin/directory/?origin_url=<url>Browse origin URL with path:
https://archive.softwareheritage.org/browse/origin/directory/?origin_url=<url>&path=<path>Browse directory URL:
https://archive.softwareheritage.org/browse/directory/<sha>/Browse revision URL:
https://archive.softwareheritage.org/browse/revision/<sha>/
Example (CLI):
# Extract bbox from an archived repository subdirectory
python -m geoextent -b --download-skip-nogeo \
"https://archive.softwareheritage.org/browse/origin/directory/?origin_url=https://github.com/AWMC/geodata&path=Cultural-Data/political_shading/hasmonean"
# Extract from a directory SWHID
python -m geoextent -b --download-skip-nogeo swh:1:dir:92890dbe77bbe36ccba724673bc62c2764df4f5a
Python API Examples:
import geoextent.lib.extent as geoextent
# Extract bbox from Software Heritage archive
result = geoextent.from_remote(
'swh:1:dir:92890dbe77bbe36ccba724673bc62c2764df4f5a',
bbox=True, tbox=False, download_skip_nogeo=True
)
Special Notes:
Data-download provider: Downloads actual files from the archive – no metadata-only extraction
Rate limits: Anonymous: 120 API requests/hour. Set the
SWH_TOKENenvironment variable for 1200 requests/hour.Sequential downloads: Downloads are sequential due to strict API rate limits
Subpath optimization: When a path is specified, only the targeted subdirectory is traversed
Recommended: Use
--download-skip-nogeoto skip non-geospatial files and&path=to target specific subdirectories
Description: Direct HTTP(S) URLs to GeoTIFF/COG files. Reads raster headers via GDAL /vsicurl/ without downloading the full file. Works best with Cloud Optimized GeoTIFFs (COG) but supports any HTTP-accessible GeoTIFF.
Website: https://www.cogeo.org/
Identifier Format: Direct HTTP(S) URLs ending in .tif or .tiff
Examples:
https://raw.githubusercontent.com/GeoTIFF/test-data/main/files/gfw-azores.tifhttps://zenodo.org/records/14711942/files/FSM_1-km_MED-epsg.4326_v01.tif
CLI:
geoextent -b https://raw.githubusercontent.com/GeoTIFF/test-data/main/files/gfw-azores.tif
Python:
import geoextent.lib.extent as geoextent
result = geoextent.from_remote(
'https://raw.githubusercontent.com/GeoTIFF/test-data/main/files/gfw-azores.tif',
bbox=True, tbox=True,
)
Special Notes:
Metadata-only provider: Only reads the raster header via HTTP range requests; no full file download
Catch-all: Positioned last in the provider list — URLs that match another provider (e.g. Zenodo) are handled by that provider instead
Performance: COGs are most efficient (~16 KB transferred); regular GeoTIFFs also work but may require more HTTP requests
Usage Examples¶
Extract from a Zenodo dataset:
python -m geoextent -b -t 10.5281/zenodo.4593540
Mix resources from different providers:
python -m geoextent -b -t \
10.5281/zenodo.4593540 \
10.25532/OPARA-581 \
https://osf.io/4xe6z/
Returns a merged bounding box covering all resources.
Limit download size and skip non-geospatial files:
python -m geoextent -b \
--max-download-size 100MB \
--download-skip-nogeo \
--max-download-workers 8 \
10.5281/zenodo.7080016
Provider Selection¶
Geoextent automatically detects the appropriate provider based on:
DOI prefix matching - Most reliable method
URL pattern matching - For direct repository URLs
Known host detection - For repository-specific domains
The first matching provider is used. If no provider matches, an error is returned.
See Also¶
Quick Start Guide - Get started with repository extraction
Examples - Detailed repository extraction examples
advanced-features - Download control and filtering options
API Docs - Python API for repository extraction