Extracting extents from free text

geoextent can pull spatial and temporal extents out of unstructured English prose using spaCy named entity recognition together with a place-name gazetteer and a bundled time-period gazetteer (ICS GTS2020). This page is a tour of the feature: setup, place names, calendar dates, named time periods, the signed ISO 8601 output format, match highlighting, ambiguity policy, and how to turn the source off.

For the offset contract that underpins highlighting and tool integration, see Highlighting matches from the text/NER source. Issue #112 tracks the feature; follow-ups #113 (Wikidata period gazetteer) and #114 (HTML and Web Annotation export) are open.

One-time setup

Install the optional [nlp] extra and a spaCy English model:

pip install 'geoextent[nlp]'
python -m spacy download en_core_web_sm

The default forward-gazetteer is Nominatim (no API key required); the default model is en_core_web_sm (auto-downloaded on first use unless --no-auto-download is passed). Without the [nlp] extra the text handler silently declines, so a directory containing README.md is not suddenly NER-ed just because you upgraded geoextent.

Place names from text

The defaults make place-name extraction work out of the box:

geoextent -b --text "Field campaigns in Berlin and Paris"

That prints a GeoJSON FeatureCollection with each resolved place under place_names. Note that by default geoextent drops ambiguous mentions — “Paris” has many homonyms (Paris/France, Paris/Texas, Paris/Ontario, …), so Nominatim returns several candidates and the safe default refuses to guess. To keep the top-ranked match:

geoextent -b --ner-ambiguity top \
    --text "Field campaigns in Berlin and Paris"

Other input modes:

# Stdin
echo "Workshops in Tokyo and London" | geoextent -b -

# A single text file
geoextent -b tests/testdata/text/cities.txt

# A whole directory of text files
geoextent -b tests/testdata/text/

# Mixed text + geospatial in one call
geoextent -b \
    tests/testdata/text/mixed_dir/cities.txt \
    tests/testdata/text/mixed_dir/point.geojson

# Different gazetteer (Nominatim is already the default; GeoNames needs
# the GEONAMES_USERNAME env var or a .env file)
geoextent -b --ner-gazetteer photon \
    --text "Saxony, Bavaria, and Brandenburg"

Boundary geometries

Some gazetteers can return more than a centroid for areal features. By default, geoextent uses an administrative boundary or other areal polygon when one is available, and falls back to the centroid point otherwise. The classic case is a state name like Saxony:

geoextent -b --ner-ambiguity top --text "Field campaign in Saxony"

With Nominatim (the default gazetteer), Saxony resolves to OSM relation 62467 and the gazetteer returns the state’s MultiPolygon. The emitted bbox is the polygon’s envelope (roughly [11.87, 50.17, 15.04, 51.69]) and the place provenance carries the geometry under boundary:

"place_names": [{
  "name": "Saxony",
  "gazetteer_id": "osm:relation:62467",
  "gazetteer_url": "https://www.openstreetmap.org/relation/62467",
  "lat": 50.93, "lon": 13.46,         // centroid, still emitted
  "boundary": {"type": "MultiPolygon", "coordinates": [...]}
}]

Force a centroid (e.g. for sensors that expect single-point geometry) with --place-geometry point:

geoextent -b --place-geometry point \
    --ner-ambiguity top --text "Field campaign in Saxony"
# → bbox is the centroid (degenerate point envelope)

--place-geometry auto (default) uses the boundary when present and silently falls back to the point when absent. boundary is the same as auto today; a future release may make it stricter (warn / error on fallback). GeoNames and Photon return only centroid points for the geopy interface geoextent uses, so this knob has no effect with those backends — the spatial extent will always be the point envelope.

Convex hull on mixed geometries

With --convex-hull the spatial extent is the convex hull of all matched gazetteer hits — polygon hits contribute their boundary vertices, point hits contribute their centroid, and the union is hulled together. This makes --convex-hull useful for three distinct shapes of input:

Single polygon hit — polygon simplification. If only one place is matched and it has a boundary, the result is the convex hull of that polygon’s vertices. For an already-convex shape the result equals the boundary; for an irregular shape it acts as a simplification:

geoextent -b --convex-hull --ner-ambiguity top --text "Field campaign in Saxony"
# → "bbox" is a closed polygon ring covering the Saxony outline

Polygon + outside point — extended hull. A polygon plus a far-away point extends the hull to enclose both. The hull of “Saxony” (covering ~11.9–15.0°E, 50.2–51.7°N) and “Berlin” (~13.4°E, 52.5°N) reaches north beyond Saxony to include Berlin:

geoextent -b --convex-hull --ner-ambiguity top \
    --text "Field campaigns in Saxony and Berlin"
# → "bbox" is a closed polygon ring whose northern boundary touches Berlin

Multiple point hits. Two or three point hits (cities) produce a line segment or polygon hull — the same behaviour as before:

geoextent -b --convex-hull --ner-ambiguity top \
    --text "Field campaigns in Berlin, Paris, and Tokyo"

--place-geometry point forces the hull to be computed from centroids even when polygons are available, which can be useful if you want a city-to-city skeleton and not a country-wide hull.

Viewing the extent on geojson.io and the 150 KB payload limit

--geojsonio produces a clickable URL that opens the extracted extent on https://geojson.io. The URL embeds the GeoJSON directly in its #data=… fragment, so the only practical limit is how much GeoJSON fits in a URL:

geoextent -b --convex-hull --geojsonio \
    --ner-ambiguity top --text "Field campaigns in Berlin and Reykjavik"
# → http://geojson.io/#data=data:application/json,%7B%22type%22%…

The 150 KB threshold. geojson.io itself does not document a maximum payload size — see the upstream URL API reference. The limit comes from the geojsonio Python wrapper that geoextent uses to build the URL: it defines MAX_URL_LEN = 150e3 (150 000 bytes of GeoJSON content) and, for anything larger, falls back to uploading the GeoJSON as an anonymous GitHub Gist and embedding the gist ID in the URL instead.

Why the fallback fails today. GitHub no longer permits anonymous gist creation (the API returns 401 Requires authentication), so the fallback always fails for oversize payloads. geoextent surfaces this as:

geojson.io URL could not be generated — geojson.io service call
failed: geojsonio.make_url → GitHub Gist API (anonymous gist
fallback for GeoJSON > ~150 KB): 401 Requires authentication
(payload size 331222 bytes) — try --convex-hull to reduce geometry
complexity, or drop optional fields that bloat properties

What pushes a text-NER extent over 150 KB. The geometry itself is usually small (a convex hull or a 4-corner envelope is a few hundred bytes). The bloat comes from the place_names[*].boundary polygons that Nominatim returns for administrative areas — a single Berlin or Saxony boundary is 50–200 KB of coordinates. --convex-hull already strips boundaries from the provenance once it has consumed them for the hull, so the common fix is:

# Was 324 KB → 401 from the gist fallback
geoextent -b --geojsonio --placename --text "Workshops in Berlin"

# 2.5 KB → URL-fragment path, no gist, works
geoextent -b --convex-hull --geojsonio --placename \
    --text "Workshops in Berlin"

Other ways to shrink the payload below 150 KB:

  • --place-geometry point also drops boundaries from provenance.

  • Cap the number of mentions (--ner-ambiguity drop skips ambiguous ones; you can also write tighter input text).

  • Use --ner-gazetteer photon or --ner-gazetteer geonames — neither returns admin polygons, so the provenance stays small.

  • Save the GeoJSON to a file (geoextent -b > extent.geojson) and upload via the geojson.io GUI’s “Open → File” instead.

If you need to render a > 150 KB extent and don’t want to depend on external services, --map FILE writes a local PNG preview without involving geojson.io at all.

Combining a text input with a local geospatial file

The CLI accepts --text together with positional file or directory inputs in a single call. --text (and - stdin) is treated as one more source, peer to the positional inputs, and all sources are merged into a single envelope / convex hull by default — the same behaviour you get from multiple positional files. Use --details to inspect each source separately; the merged top-level extent is kept either way:

# Mix a free-text mention with a GeoJSON file. The bbox spans Berlin
# (from --text) and Tokyo (from the file).
geoextent -b --ner-ambiguity top \
    --text "Field campaigns in Berlin" \
    tests/testdata/text/mixed_dir/point.geojson

# Convex hull across text + file, with temporal extent merged too.
# Denmark + Belgium (from --text), Berlin + Reykjavik (from cities.txt);
# tbox spans 2021–2023.
geoextent -b -t --convex-hull \
    --text "Travelling from Denmark to Belgium in 2021 and 2023" \
    tests/testdata/text/mixed_dir/cities.txt

# Same as above, with --geojsonio appended to also print a clickable
# geojson.io URL covering the four-country hull. Stays well under the
# 150 KB URL-fragment limit because --convex-hull strips per-place
# boundary polygons from the provenance.
geoextent -b -t --convex-hull \
    --text "Travelling from Denmark to Belgium in 2021 and 2023" \
    --geojsonio \
    tests/testdata/text/mixed_dir/cities.txt

# Add --details to inspect the per-source extents under
# geoextent_extraction.details.
geoextent -b --details --ner-ambiguity top \
    --text "Field campaigns in Berlin" \
    tests/testdata/text/mixed_dir/point.geojson

The same shape works for any positional input: a Shapefile, a GeoTIFF, a directory of geospatial files, a DOI / repository URL, or stdin (-). Mixed runs through the Python API use the multi-input call:

from geoextent.lib import extent

# 1) Inline text alone
text_result = extent.from_text(
    "Field campaigns in Berlin",
    bbox=True, ner_ambiguity="top",
)

# 2) A local file alone
file_result = extent.from_file("path/to/point.geojson", bbox=True)

# 3) Merge in the application layer if you need a combined envelope.
from geoextent.lib import helpfunctions as hf
merged = hf.bbox_merge(
    {"text": text_result, "file": file_result},
    "multi-input",
)
print(merged["bbox"])

Tuning what spaCy picks up:

# Only keep GPE (geo-political entities) — countries, cities, regions
geoextent -b --ner-labels GPE \
    --text "Hiking in the Alps near Munich and along the Rhine"

# Use a larger model (must be installed separately):
#   python -m spacy download en_core_web_md
geoextent -b --ner-model en_core_web_md \
    --text "Berlin and Paris"

Calendar dates from text

The temporal pipeline understands four shapes of date expressions:

geoextent -t --text "Field measurements in May 2024"
# → "tbox": ["2024-05-01", "2024-05-31"]   (month envelope)

geoextent -t --text "Records from the 1990s"
# → "tbox": ["1990-01-01", "1999-12-31"]   (decade envelope)

geoextent -t --text "Records from the 19th century"
# → "tbox": ["1801-01-01", "1900-12-31"]   (century envelope)

geoextent -t --text "Monitoring ran between 2010 and 2015"
# → "tbox": ["2010-01-01", "2015-12-31"]   (range splitter)

Range detection handles between X and Y, from X to Y, en-dashes (X–Y), em-dashes (X—Y), plain ASCII hyphens (X-Y), and to/until/through/and connectors.

Two phrasings, two provenance paths, same envelope

A useful comparison — the same time window expressed two ways yields identical tbox envelopes but very different mention provenance:

geoextent -t \
  --text "Field campaigns in Berlin and Paris ending in March 2022 and beginning in June 2021"

Result (extract):

"tbox": ["2021-06-01", "2022-03-31"],
"date_entities": [
  {"text": "March 2022", "kind": "date",
   "start": "2022-03-01", "end": "2022-03-31"},
  {"text": "June 2021",  "kind": "date",
   "start": "2021-06-01", "end": "2021-06-30"}
]

spaCy emits two independent DATE entities; each is parsed independently (each into a month envelope), and tbox is the envelope of the envelopes.

Now the same window in a single phrase:

geoextent -t \
  --text "Field campaigns in Berlin and Paris from June 2021 to March 2022"

Result (extract):

"tbox": ["2021-06-01", "2022-03-31"],
"date_entities": [
  {"text": "June 2021 to March 2022", "kind": "date",
   "start": "2021-06-01", "end": "2022-03-31"}
]

spaCy emits one DATE span spanning both endpoints; geoextent’s range splitter recognises the to connector, parses each side, and returns the merged envelope as a single mention.

Both phrasings produce the same tbox because the envelope-of-envelopes and the explicit-range computations converge. The difference shows up in date_entities: the first phrasing carries two mentions, the second carries one. For downstream tools that highlight matches in the source, this distinction matters — the second phrasing yields a single span to underline ("June 2021 to March 2022"); the first yields two non-contiguous spans.

Named time periods

Beyond calendar dates, geoextent recognises geological time periods using the bundled International Chronostratigraphic Chart (ICS GTS2020, CC0, ~178 eons / eras / periods / epochs / ages). Period detection runs as a spaCy PhraseMatcher over the gazetteer’s label index — which means it catches mentions that en_core_web_sm mislabels (Holocene as ORG, Mesozoic Era as ORG, Bronze Age as PERSON) or misses entirely (Pleistocene, Late Cretaceous):

geoextent -t --text "Sediment cores from the Holocene"
# → "tbox": ["-9750-01-01", "1950-01-01"]

geoextent -t --text "Late Cretaceous fossils dominate the section"
# → "tbox": ["-100498050-01-01", "-65998050-01-01"]

geoextent -t --text "Pleistocene cores below the modern surface"
# → "tbox": ["-2578050-01-01", "-9750-01-01"]

Resolved periods carry the same provenance shape as places — a gazetteer_id (ics:Holocene) and gazetteer_url pointing to the canonical resource on resource.geosciml.org. Disable period matching with --no-period-resolution if you only want calendar-date parsing.

Combining periods and dates

geoextent -t \
  --text "Pleistocene cores near Berlin re-surveyed on 2024-05-12"
# → "tbox": ["-2578050-01-01", "2024-05-12"]

The deep-time start and the CE-date end coexist in the same envelope; the tbox merge falls back to numeric signed-year comparison when any mention is pre-CE.

Signed ISO 8601 dates for pre-CE / deep time

Python’s stdlib datetime cannot represent year 0 or negative years, so geological periods are emitted as signed ISO 8601 extended year strings:

  • Holocene start: -9750-01-01

  • Pleistocene start: -2578050-01-01

  • Mesozoic Era start: -251900050-01-01

The sign and the at-least-four-digit year width are fixed; larger years extend the width as needed. --time-format is not applied to deep-time mentions (those rely on the signed-ISO contract); CE-only output continues to honour the format flag exactly as before, byte-for-byte.

Highlighting matches

The CLI can render the source string with matched spans wrapped for display:

geoextent -b -t --annotate brackets \
  --text "Sediment cores in Berlin span the Holocene; resurvey on 2024-05-12"

Output (after the JSON):

---annotated source (brackets)---
Sediment cores in [[Berlin|place]] span the [[Holocene|period]]; resurvey on [[2024-05-12|date]]

Modes:

  • --annotate auto (default) — ANSI colour when stdout is a TTY, brackets otherwise, mirroring grep --color=auto.

  • --annotate ansi — force ANSI SGR colours (terminal preview).

  • --annotate brackets — force [[surface|kind]] markers (pipelines, log capture, chat clients).

  • --annotate off — suppress.

  • --annotate-classes "place=cyan,date=yellow,period=magenta" — override colours per kind.

Each mention also carries char_start / char_end offsets into the echoed source_text so consumers can render their own highlights. See Highlighting matches from the text/NER source for the contract details and a JavaScript / Java re-encoding recipe.

Ambiguity policy

Both gazetteers (place-name and time-period) have an --ner-ambiguity / --period-ambiguity knob with the same two values:

  • drop (default) — refuse to choose when more than one candidate is returned. Defensive: a “Paris” mention without disambiguating context is dropped rather than silently bound to the wrong city. The first time a name is dropped, geoextent logs a WARNING to stderr naming the place, the gazetteer candidates that triggered the drop, and the exact flag (--ner-ambiguity top) to flip the policy.

  • top — keep the highest-ranked candidate.

Repeat drops of the same name in one run only warn once, to keep the log quiet when a long directory mentions the same ambiguous town many times.

geoextent -b --ner-ambiguity top --text "We met in Paris and Berlin"
geoextent -t --period-ambiguity top --text "Iron Age burials"

The drop policy preserves provenance: dropped mentions still appear in place_names / date_entities with matched: false and the full candidate_count.

Turning text extraction off

If you process a directory of structured data and want to be sure no README.md (or similar) is fed to spaCy:

geoextent -b -t --text-method none path/to/data_dir

--text-method none disables the text handler entirely; .txt and .md files then fall back to other handlers (e.g. tab-delimited Darwin Core occurrence files via the CSV handler) or are skipped.

Python API

The same surface is available as geoextent.lib.extent.from_text() for in-memory strings and as the standard handler for from_file() / from_directory():

from geoextent.lib import extent

result = extent.from_text(
    "Sediment cores in Berlin span the Holocene; resurvey on 2024-05-12.",
    bbox=True, tbox=True,
    ner_ambiguity="top",        # keep top hit (Berlin is unambiguous, but
                                # "Paris" or "Springfield" would otherwise drop)
    period_ambiguity="top",     # same idea for the ICS gazetteer
)

print(result["bbox"])   # → [13.41, 52.52, 13.41, 52.52]
print(result["tbox"])   # → ["-9750-01-01", "2024-05-12"]

for rec in result["place_names"]:
    print("place ", rec["name"], "→", rec.get("gazetteer_url"))
for rec in result["date_entities"]:
    print(rec["kind"], rec["text"], "→", rec.get("start"), rec.get("end"))

Listing the bundled period gazetteer

Downstream tools (UIs, autocomplete widgets, reference docs) often need the full list of periods that geoextent recognises, together with the licence and provenance of the underlying data. Two paths are provided:

CLI--list-periods prints the bundled gazetteer to stdout:

# Full JSON output with metadata block + 178 period records
geoextent --list-periods

# Plain-text table for terminal scanning
geoextent --list-periods --list-periods-format text

# Filter by substring (case-insensitive, matches name and aliases)
geoextent --list-periods --list-periods-filter Holo
geoextent --list-periods --list-periods-format text --list-periods-filter Mesozoic

The header of the output carries the provenance: source repository, the exact upstream commit SHA, the build timestamp, the licence URL, and an attribution string suitable for embedding in a UI footer.

Pythongeoextent.lib.period_gazetteer.list_periods() returns the same data as a dict:

from geoextent.lib.period_gazetteer import list_periods

data = list_periods()
print(data["source"], data["source_revision"], data["built_at"])
for rec in data["periods"][:3]:
    print(rec["name"], rec["start"], "..", rec["end"], rec["url"])

# Narrow the list (substring on name or any alias)
holos = list_periods(name_filter="holocene")
assert holos["period_count"] == 1

# Drop the metadata block — useful when the consumer already knows the
# provenance and only needs the records themselves.
bare = list_periods(include_metadata=False)
assert set(bare.keys()) == {"periods", "period_count"}

The dict shape matches the on-disk geoextent/lib/data/periods.json; schema_version lets consumers detect a future layout shift. The file’s metadata block is reproduced here in full:

{
  "name": "geoextent bundled period gazetteer",
  "schema_version": "1.0",
  "source": "ICS International Chronostratigraphic Chart (GTS2020)",
  "source_url": "https://github.com/CGI-IUGS/timescale-data",
  "source_file": ".../rdf/isc2020.ttl",
  "source_revision": "<upstream commit SHA>",
  "source_revision_date": "<commit date, ISO 8601 UTC>",
  "license": "CC0-1.0",
  "license_url": "https://creativecommons.org/publicdomain/zero/1.0/",
  "attribution": "International Chronostratigraphic Chart ... CC0-1.0 ...",
  "built_at": "<build timestamp, ISO 8601 UTC>",
  "built_by": "geoextent tools/build_periods_data.py",
  "ma_bp_origin_year": 1950,
  "period_count": 178,
  "periods": [ ... ]
}

To refresh the bundled data from upstream, run python tools/build_periods_data.py and commit the regenerated periods.json.

Performance notes

  • spaCy + en_core_web_sm is loaded once per process and reused.

  • The forward gazetteer keeps an in-memory (service, query) cache for the run, so duplicate mentions within a directory only hit the network once.

  • Public Nominatim has a 1 req/s rate limit; large batches may benefit from Photon or a self-hosted Nominatim. Set NOMINATIM_USER_AGENT (env var) to identify your application.

Limitations and roadmap

  • English only out of the box (en_core_web_sm). Multi-language models exist on spaCy’s hub but are not exercised by geoextent’s tests.

  • Historical / archaeological periods (Bronze Age, Iron Age, Medieval, Roman, …) are not in the bundled ICS chart. Online Wikidata-backed resolution is tracked in #113.

  • HTML rendering for notebooks and Web Annotation Data Model JSON-LD export are tracked in #114.