Changelog¶
Unreleased¶
New Content Providers
Generalise Zenodo provider into InvenioRDM base provider supporting CaltechDATA, TU Wien, Frei-Data, GEO Knowledge Hub, TU Graz, Materials Cloud Archive, FDAT, DataPLANT ARChive, KTH, Prism, and NYU Ultraviolet (#81)
Add Mendeley Data as content provider (#58)
Add ioerDATA (Leibniz IOER) Dataverse instance support (#85)
Add heiDATA (Heidelberg University) Dataverse instance support (#94)
Add Edmond (Max Planck Society) Dataverse instance support (#93)
Add Wikidata as content provider for extracting geographic extents from Wikidata items via SPARQL (#83)
Add 4TU.ResearchData as content provider with support for both data download and metadata-only extraction (#92)
Add RADAR (FIZ Karlsruhe) as content provider for cross-disciplinary German research data (#87)
Add NSF Arctic Data Center as content provider with metadata-only and data download support (#90)
Add DEIMS-SDR (Dynamic Ecological Information Management System) as metadata-only content provider for long-term ecological research sites and datasets, with support for dataset and site URLs, WKT geometry parsing (POINT, POLYGON, MULTIPOLYGON), and temporal ranges (#101)
Add BAW (Bundesanstalt für Wasserbau) Datenrepository as content provider with CSW 2.0.2 metadata extraction via OWSLib, supporting DOIs (
10.48437/*), landing page URLs, and bare UUIDs (#89)Add B2SHARE (EUDAT) as InvenioRDM instance, supporting DOIs (
10.23728/b2share.*), record URLs, and old-style hex DOIs (#16)Add MDI-DE (Marine Daten-Infrastruktur Deutschland) as content provider with CSW 2.0.2 metadata extraction via OWSLib, WFS-based data download, and direct file download; supports NOKIS landing page URLs and bare UUIDs (#86)
Add HALO DB (DLR) as metadata-only content provider for the HALO research aircraft database, extracting flight track geometry from the GeoJSON search API and temporal extent from dataset HTML pages (#88)
Add GBIF (Global Biodiversity Information Facility) as content provider with metadata-only extraction from the Registry API and optional Darwin Core Archive (DwC-A) data download from institutional IPT servers; supports DOI prefixes 10.15468, 10.15470, 10.15472, 10.25607, 10.71819, 10.82144 and gbif.org dataset URLs (#62)
Add GeoScienceWorld as metadata-only content provider for geoscience journal articles, extracting geographic coordinates from GeoRef metadata (WKT POLYGON/POINT) embedded in article landing pages; supports article URLs, GeoRef record URLs, and DOIs from multiple publisher prefixes (10.1190, 10.1144, etc.); uses
curl_cffiwith TLS fingerprint impersonation for Cloudflare-protected pages (#109)Add SEANOE (SEA scieNtific Open data Edition) as content provider for marine science data from Ifremer/SISMER, with metadata-only extraction from the REST API (geographic bounding boxes, temporal extents) and data download of open-access files; supports DOI prefix 10.17882 and seanoe.org landing page URLs (#104)
Add UKCEH (UK Centre for Ecology & Hydrology) EIDC as content provider with metadata-only extraction from the catalogue JSON API (bounding boxes, temporal extents) and dual data download pattern (Apache datastore listing or data-package ZIP); supports DOI prefix 10.5285 and catalogue.ceh.ac.uk URLs (#103)
Add GDI-DE (Geodateninfrastruktur Deutschland / geoportal.de) as metadata-only content provider for the national German spatial data infrastructure catalogue (771,000+ records), with CSW 2.0.2 metadata extraction via OWSLib; supports geoportal.de landing page URLs and bare UUIDs (#84)
Add generic CKAN content provider supporting any CKAN instance via dataset URL matching, with known-host fast matching and dynamic API discovery; includes GeoJSON spatial metadata parsing with geometry preservation for convex hull, multi-format temporal field support, and UK
bbox-*extras pattern; known hosts include GeoKur TU Dresden, data.gov.uk, GovData.de, data.gov.au, and catalog.data.gov (#98)Add STAC (SpatioTemporal Asset Catalog) as metadata-only content provider, extracting spatial bounding boxes and temporal intervals directly from STAC Collection JSON; supports known STAC API hosts (Element84, DLR, Terradue, WorldPop, Lantmateriet, etc.),
/stac/URL path patterns, JSON content-inspection fallback, and content negotiation for HTML/JSON servers (#25)Add NFDI4Earth Knowledge Hub as metadata-only content provider for the Cordra-based digital object repository (1.3M+ datasets), with SPARQL endpoint extraction (spatial WKT, temporal ranges) and Cordra REST API fallback; supports OneStop4All landing page URLs and direct Cordra object URLs; follows
landingPageto other supported providers (#100)Add GitHub as content provider for extracting geospatial extent from public GitHub repositories, with
GitHostProviderabstract base class for git hosting platforms; uses Git Trees API (2 API calls per repo) and raw file downloads; supports repository root, branch, and subdirectory URLs; preserves directory structure for co-located files (e.g. shapefile components); optionalGITHUB_TOKENfor higher rate limits (#96)Add GitLab as content provider for extracting geospatial extent from public GitLab repositories on gitlab.com and self-hosted instances; uses paginated Repository Tree API and raw file API; supports repository root, branch, and subdirectory URLs with nested namespace paths (
group/subgroup/project); three-stage instance detection (known hosts, hostname heuristic, API probe); optionalGITLAB_TOKENfor higher rate limits (#97)Add Forgejo/Gitea as content provider for extracting geospatial extent from public Forgejo and Gitea repositories (including Codeberg.org and Helmholtz DataHub); uses Gitea REST API v1 for paginated tree listing and raw file downloads; three-stage instance detection (known hosts, hostname heuristic containing “forgejo” or “gitea”,
/api/v1/versionprobe); optionalFORGEJO_TOKENfor higher rate limitsAdd Software Heritage as content provider for extracting geospatial extent from the universal software archive (Inria + UNESCO); supports persistent SWHIDs (
swh:1:dir:...,swh:1:ori:...,swh:1:rev:...,swh:1:cnt:...), browse URLs (archive.softwareheritage.org), and SWHID qualifiers (origin=,path=); subpath optimization for targeted subdirectory extraction; optionalSWH_TOKENfor higher rate limits (1200 req/hr vs 120 anonymous)Add DataONE (Data Observation Network for Earth) as metadata-only content provider, extracting pre-computed spatial bounding boxes and temporal ranges from the Coordinating Node Solr API; covers ~20 member node repositories (KNB, PISCO, EDI/LTER, NEON, BCO-DMO, ESS-DIVE, etc.) through a single API; supports DOI prefixes
10.5063/(KNB),10.6085/(PISCO), andsearch.dataone.orgURLs; defers Arctic Data Center, PANGAEA, and Dryad datasets to their dedicated providers (#15)
New Features
Add Cloud Optimized GeoTIFF (COG) support: direct HTTP(S) URLs to GeoTIFF files are opened via GDAL
/vsicurl/for efficient header-only metadata extraction without downloading the full file (#11)Add point cloud support for LAS/LAZ files via laspy, extracting bounding boxes from file headers and temporal extent from creation dates (#10)
Add
--joinCLI flag andjoin_files()Python API to merge multiple export files (from--output) into a single file, dropping summary rows and keeping only individual-file features; supports cross-format joins (GPKG, GeoJSON, CSV)Enhanced
--outputexport: support single-file input, auto-detect GeoJSON/CSV format from extension,export_to_file()Python API, proper date fields (tbox_start/tbox_end), convex hull geometry export,--formatinteraction for CSV (#21)Add metadata-only extraction for InvenioRDM instances (
metadata.locationsGeoJSON,metadata.dates)Implement metadata-only extraction for Figshare provider (
download_data=False/--no-data-download), supporting--metadata-firststrategy (#68)Expand Figshare provider to support institutional portal URLs (
*.figshare.com, e.g.springernature.figshare.com,monash.figshare.com)Add
--time-formatCLI option andtime_formatAPI parameter for configurable temporal extent output format: date-only (default), ISO 8601, or custom strftime strings (#39)Add
--metadata-firstCLI flag andmetadata_firstAPI parameter for smart metadata-then-download extraction strategy: tries metadata-only extraction first if the provider supports it, falls back to data download if metadata yields no resultsAdd automatic metadata fallback: when data download yields no files and the provider supports metadata extraction, automatically fall back to metadata-only extraction (enabled by default, disable with
--no-metadata-fallbackormetadata_fallback=False)DEIMS-SDR provider now follows external DOIs/URLs to other supported providers (e.g., Zenodo, PANGAEA) for actual data extent extraction. Disable with
--no-followorfollow=False.Extract temporal extent from raster files: NetCDF CF time dimensions, GeoTIFF
TIFFTAG_DATETIME, ACDDtime_coverage_start/endglobal attributes, and band-levelACQUISITIONDATETIME(IMAGERY domain) (#22)Add
--assume-wgs84CLI flag andassume_wgs84API parameter to explicitly enable WGS84 fallback for ungeoreferenced rasters (disabled by default)Add support for Esri File Geodatabase (
.gdb) format via GDAL’s OpenFileGDB driverAdd support for Zarr format (
.zarr) via GDAL’s Zarr driver (V2 and V3), including directory-based dataset handling in CLI and directory extraction (#9)All content providers now support interactive download size confirmation via
--max-download-size. When the total download exceeds the limit, the CLI prompts for confirmation instead of silently truncating. API:download_size_soft_limit=Trueinfrom_remote().Add
-p/--parallelCLI flag andworkersAPI parameter for parallel file extraction within directories using thread-based parallelism.-pauto-detects CPU count,-p Nuses N workers. API:from_directory(..., workers=N)andfrom_remote(..., workers=N). (#34)Add
progress_callbackparameter tofrom_file(),from_directory(), andfrom_remote()for structured progress reporting. Callbacks receiveProgressEventdataclass instances with phase, message, current/total counters, and byte-level download progress. Three built-in callbacks:TqdmProgressCallback(tqdm bars),LoggingProgressCallback(logger),CollectingProgressCallback(list collection for testing). See API Docs for usage. (#80)Add
--map,--preview, and--map-dimCLI flags for static map preview of extracted spatial extents on OpenStreetMap tiles (requirespip install geoextent[preview]).--mapsaves a PNG to a temporary file,--map FILEsaves to a specific path,--previewdisplays in the terminal usingterm-image(auto-detects Kitty/iTerm2/Sixel, falls back to Unicode blocks) or external tools,--map-dim WxHsets image dimensions. The saved path is printed to stderr unless--quietis used. (#35)Add AppImage packaging for portable Linux distribution: single-file executable bundling Python, GDAL, PROJ, and all dependencies via conda-forge + appimagetool with zstd compression; CI workflow builds on tag push and attaches to GitHub Releases (#40)
Breaking Changes
Drop support for bare numeric Zenodo record IDs (e.g.,
820562); use DOI (10.5281/zenodo.820562) or URL (https://zenodo.org/records/820562) instead
Bug Fixes
Skip raster files with pixel-based coordinates (outside WGS84 bounds) instead of merging them into geographic extents
Validate bounding boxes against WGS84 coordinate ranges before including in results
Reject vector files with projected coordinates falsely reported as WGS84 (e.g., GeoJSON files without CRS declaration)
API Changes
Rename all public API functions from camelCase to snake_case per PEP 8:
from_file(),from_directory(),from_remote()(#31)Remove old camelCase aliases (
fromFile,fromDirectory,fromRemote)Rename internal handler modules:
handleCSV→handle_csv,handleRaster→handle_raster,handleVector→handle_vectorRename all internal handler functions to snake_case (e.g.,
checkFileSupported→check_file_supported,getBoundingBox→get_bounding_box)
Improvements
Document installation with uv, conda/mamba, Poetry, and pipx (#4, #5, #41)
Fix and enable skipped multi-input CLI tests, add convex hull geometry tests for 2–5 file inputs, and add documentation for multiple input processing
Use GDAL CSV driver open options for coordinate column detection, supporting GDAL column naming conventions (
X/Y,Easting/Northing) and CSVT sidecar files (#53)Add GeoCSV format support: recognise
CoordX/CoordYcolumn names (giswiki.ch GeoCSV spec),.prjsidecar files for CRS information, WKT polygon geometry columns, and EarthScope GeoCSV#-prefixed metadata header lines (#52)Move content provider metadata into provider classes (
provider_info()classmethod), eliminating duplication infeatures.pyVerify bare UUIDs against BGR CSW catalog and Opara DSpace API before accepting, preventing misrouting between providers
Correct 4TU.ResearchData platform description: uses Djehuty, not Figshare
Automatically skip restricted files in Dataverse downloads
0.12.0¶
Breaking Changes
Default coordinate order for plain bounding boxes is now EPSG:4326 native axis order: (latitude, longitude)
Bounding boxes are returned as
[minlat, minlon, maxlat, maxlon]instead of[minlon, minlat, maxlon, maxlat]GeoJSON output always uses
[longitude, latitude]coordinate order per RFC 7946Add
--legacyCLI flag andlegacy=TruePython API parameter to restore the previous[lon, lat]order for plain bounding boxes (does not affect GeoJSON output)
New Content Providers
New Features
Bug Fixes
Skip layers with degenerate extent
[0,0,0,0]and emit a user-visible warning instead of silently including invalid coordinatesFix resource leak for GeoPackage/SQLite-backed files (unclosed database connections)
Improvements
Enable parallel test execution by default using
pytest-xdist(#38)Refactor CRS extraction into shared utility function
Harden CSV handler: force CSV GDAL driver to prevent misidentification, add extension-based pre-filtering, improve geometry column detection
Optimize gazetteer queries to avoid duplicate API calls for closed polygon points
Suppress noisy pandas date-parsing warnings during temporal extent detection
0.11.0¶
Add
--ext-metadataoption to retrieve bibliographic metadata for DOIs from CrossRef and DataCite APIsAdd
--ext-metadata-methodoption to control metadata source (auto,all,crossref,datacite)Add display names to file format handlers
Fix unwanted coordinate flipping for GML bounding boxes with GDAL >= 3.2
0.10.0¶
fromRemote()now accepts both single identifiers (string) and multiple identifiers (list)Add
--list-featuresCLI option andget_supported_features()Python API for discovering supported file formats and content providersAdd
validate_remote_identifier()andvalidate_file_format()validation functions
0.9.0¶
Content Providers
Add TU Dresden Opara content provider supporting DSpace 7.x repositories (#77)
Add GFZ Data Services as content provider (#17)
Add Dataverse repository support (#57)
Add support for OSF (Open Science Framework) (#19)
Add support for Dryad and Figshare repositories
Add support for Pensoft journals (#64)
Enhance PANGAEA provider with non-tabular data support and ZIP archive handling
Add download size limiting for repositories (#70)
Add
--no-data-downloadoption for metadata-only extractionAdd
--download-skip-nogeooption to skip non-geospatial filesAdd
--max-download-size,--max-download-workers, and--max-download-methodoptionsRestructure regex patterns for better repository candidate detection
Output and Visualization
Add extraction metadata to GeoJSON output (
geoextent_extractionfield)Add
--no-metadataoption to exclude extraction metadata from GeoJSON outputAdd geojson.io URL generation (
--geojsonio)Add
--browseoption to open visualizations in default web browserAdd WKT and WKB output format support (#46)
Add convex hull extraction (
--convex-hull) (#37)
CLI and Processing
Add
--no-subdirsoption to control recursive directory processing (#55)Add support for processing multiple files with automatic extent merging
Add progress bars with
--no-progressoption to disable (#32)Add
--quietoption to suppress all console messagesAdd
--placenameoption for geographic placename lookup via GeoNames, Nominatim, and Photon (#74)Add file filtering and parallel downloads (#75)
Add FlatGeobuf format support (#43)
Infrastructure
Refactor CI workflows to use custom GDAL installation script instead of pygdal
Run code formatter (#54)
Skip GDAL auxiliary files during directory processing
Fix Figshare URL validation and download handling
0.8.0¶
Move configuration from
setup.pytopyproject.toml
0.7.1¶
0.6.0¶
Add details option
--detailsfor folders and ZIP files (#116)
0.5.0¶
0.4.0¶
Add support for ZIP files and folders (#79)
0.3.0¶
0.2.0¶
0.1.0¶
Initial release with core functionality