========================================== Extracting extents from free text ========================================== geoextent can pull spatial and temporal extents out of unstructured English prose using `spaCy `_ named entity recognition together with a place-name gazetteer and a bundled time-period gazetteer (ICS GTS2020). This page is a tour of the feature: setup, place names, calendar dates, named time periods, the signed ISO 8601 output format, match highlighting, ambiguity policy, and how to turn the source off. For the offset contract that underpins highlighting and tool integration, see :doc:`highlighting`. Issue :issue:`112` tracks the feature; follow-ups :issue:`113` (Wikidata period gazetteer) and :issue:`114` (HTML and Web Annotation export) are open. One-time setup ============== Install the optional ``[nlp]`` extra and a spaCy English model: .. code-block:: bash pip install 'geoextent[nlp]' python -m spacy download en_core_web_sm The default forward-gazetteer is **Nominatim** (no API key required); the default model is ``en_core_web_sm`` (auto-downloaded on first use unless ``--no-auto-download`` is passed). Without the ``[nlp]`` extra the text handler silently declines, so a directory containing ``README.md`` is not suddenly NER-ed just because you upgraded geoextent. Place names from text ===================== The defaults make place-name extraction work out of the box: .. code-block:: bash geoextent -b --text "Field campaigns in Berlin and Paris" That prints a GeoJSON ``FeatureCollection`` with each resolved place under ``place_names``. Note that by default geoextent **drops** ambiguous mentions — "Paris" has many homonyms (Paris/France, Paris/Texas, Paris/Ontario, …), so Nominatim returns several candidates and the safe default refuses to guess. To keep the top-ranked match: .. code-block:: bash geoextent -b --ner-ambiguity top \ --text "Field campaigns in Berlin and Paris" Other input modes: .. code-block:: bash # Stdin echo "Workshops in Tokyo and London" | geoextent -b - # A single text file geoextent -b tests/testdata/text/cities.txt # A whole directory of text files geoextent -b tests/testdata/text/ # Mixed text + geospatial in one call geoextent -b \ tests/testdata/text/mixed_dir/cities.txt \ tests/testdata/text/mixed_dir/point.geojson # Different gazetteer (Nominatim is already the default; GeoNames needs # the GEONAMES_USERNAME env var or a .env file) geoextent -b --ner-gazetteer photon \ --text "Saxony, Bavaria, and Brandenburg" Boundary geometries ------------------- Some gazetteers can return more than a centroid for areal features. By default, geoextent uses an administrative boundary or other areal polygon when one is available, and falls back to the centroid point otherwise. The classic case is a state name like *Saxony*: .. code-block:: bash geoextent -b --ner-ambiguity top --text "Field campaign in Saxony" With Nominatim (the default gazetteer), Saxony resolves to OSM relation ``62467`` and the gazetteer returns the state's ``MultiPolygon``. The emitted ``bbox`` is the polygon's envelope (roughly ``[11.87, 50.17, 15.04, 51.69]``) and the place provenance carries the geometry under ``boundary``: .. code-block:: jsonc "place_names": [{ "name": "Saxony", "gazetteer_id": "osm:relation:62467", "gazetteer_url": "https://www.openstreetmap.org/relation/62467", "lat": 50.93, "lon": 13.46, // centroid, still emitted "boundary": {"type": "MultiPolygon", "coordinates": [...]} }] Force a centroid (e.g. for sensors that expect single-point geometry) with ``--place-geometry point``: .. code-block:: bash geoextent -b --place-geometry point \ --ner-ambiguity top --text "Field campaign in Saxony" # → bbox is the centroid (degenerate point envelope) ``--place-geometry auto`` (default) uses the boundary when present and silently falls back to the point when absent. ``boundary`` is the same as ``auto`` today; a future release may make it stricter (warn / error on fallback). **GeoNames and Photon return only centroid points** for the geopy interface geoextent uses, so this knob has no effect with those backends — the spatial extent will always be the point envelope. Convex hull on mixed geometries ------------------------------- With ``--convex-hull`` the spatial extent is the convex hull of all matched gazetteer hits — polygon hits contribute their boundary vertices, point hits contribute their centroid, and the union is hulled together. This makes ``--convex-hull`` useful for three distinct shapes of input: **Single polygon hit — polygon simplification.** If only one place is matched and it has a boundary, the result is the convex hull of that polygon's vertices. For an already-convex shape the result equals the boundary; for an irregular shape it acts as a simplification: .. code-block:: bash geoextent -b --convex-hull --ner-ambiguity top --text "Field campaign in Saxony" # → "bbox" is a closed polygon ring covering the Saxony outline **Polygon + outside point — extended hull.** A polygon plus a far-away point extends the hull to enclose both. The hull of "Saxony" (covering ~11.9–15.0°E, 50.2–51.7°N) and "Berlin" (~13.4°E, 52.5°N) reaches north beyond Saxony to include Berlin: .. code-block:: bash geoextent -b --convex-hull --ner-ambiguity top \ --text "Field campaigns in Saxony and Berlin" # → "bbox" is a closed polygon ring whose northern boundary touches Berlin **Multiple point hits.** Two or three point hits (cities) produce a line segment or polygon hull — the same behaviour as before: .. code-block:: bash geoextent -b --convex-hull --ner-ambiguity top \ --text "Field campaigns in Berlin, Paris, and Tokyo" ``--place-geometry point`` forces the hull to be computed from centroids even when polygons are available, which can be useful if you want a city-to-city skeleton and not a country-wide hull. Viewing the extent on geojson.io and the 150 KB payload limit ------------------------------------------------------------- ``--geojsonio`` produces a clickable URL that opens the extracted extent on https://geojson.io. The URL embeds the GeoJSON directly in its ``#data=…`` fragment, so the only practical limit is how much GeoJSON fits in a URL: .. code-block:: bash geoextent -b --convex-hull --geojsonio \ --ner-ambiguity top --text "Field campaigns in Berlin and Reykjavik" # → http://geojson.io/#data=data:application/json,%7B%22type%22%… **The 150 KB threshold.** geojson.io itself does not document a maximum payload size — see the upstream `URL API reference `__. The limit comes from the `geojsonio Python wrapper `__ that geoextent uses to build the URL: it defines ``MAX_URL_LEN = 150e3`` (150 000 bytes of GeoJSON content) and, for anything larger, falls back to *uploading the GeoJSON as an anonymous GitHub Gist* and embedding the gist ID in the URL instead. **Why the fallback fails today.** GitHub no longer permits anonymous gist creation (the API returns ``401 Requires authentication``), so the fallback always fails for oversize payloads. geoextent surfaces this as: .. code-block:: text geojson.io URL could not be generated — geojson.io service call failed: geojsonio.make_url → GitHub Gist API (anonymous gist fallback for GeoJSON > ~150 KB): 401 Requires authentication (payload size 331222 bytes) — try --convex-hull to reduce geometry complexity, or drop optional fields that bloat properties **What pushes a text-NER extent over 150 KB.** The geometry itself is usually small (a convex hull or a 4-corner envelope is a few hundred bytes). The bloat comes from the ``place_names[*].boundary`` polygons that Nominatim returns for administrative areas — a single ``Berlin`` or ``Saxony`` boundary is 50–200 KB of coordinates. ``--convex-hull`` already strips boundaries from the provenance once it has consumed them for the hull, so the common fix is: .. code-block:: bash # Was 324 KB → 401 from the gist fallback geoextent -b --geojsonio --placename --text "Workshops in Berlin" # 2.5 KB → URL-fragment path, no gist, works geoextent -b --convex-hull --geojsonio --placename \ --text "Workshops in Berlin" Other ways to shrink the payload below 150 KB: * ``--place-geometry point`` also drops boundaries from provenance. * Cap the number of mentions (``--ner-ambiguity drop`` skips ambiguous ones; you can also write tighter input text). * Use ``--ner-gazetteer photon`` or ``--ner-gazetteer geonames`` — neither returns admin polygons, so the provenance stays small. * Save the GeoJSON to a file (``geoextent -b > extent.geojson``) and upload via the geojson.io GUI's "Open → File" instead. If you need to render a > 150 KB extent and don't want to depend on external services, ``--map FILE`` writes a local PNG preview without involving geojson.io at all. Combining a text input with a local geospatial file --------------------------------------------------- The CLI accepts ``--text`` together with positional file or directory inputs in a single call. ``--text`` (and ``-`` stdin) is treated as one more source, peer to the positional inputs, and **all sources are merged into a single envelope / convex hull by default** — the same behaviour you get from multiple positional files. Use ``--details`` to inspect each source separately; the merged top-level extent is kept either way: .. code-block:: bash # Mix a free-text mention with a GeoJSON file. The bbox spans Berlin # (from --text) and Tokyo (from the file). geoextent -b --ner-ambiguity top \ --text "Field campaigns in Berlin" \ tests/testdata/text/mixed_dir/point.geojson # Convex hull across text + file, with temporal extent merged too. # Denmark + Belgium (from --text), Berlin + Reykjavik (from cities.txt); # tbox spans 2021–2023. geoextent -b -t --convex-hull \ --text "Travelling from Denmark to Belgium in 2021 and 2023" \ tests/testdata/text/mixed_dir/cities.txt # Same as above, with --geojsonio appended to also print a clickable # geojson.io URL covering the four-country hull. Stays well under the # 150 KB URL-fragment limit because --convex-hull strips per-place # boundary polygons from the provenance. geoextent -b -t --convex-hull \ --text "Travelling from Denmark to Belgium in 2021 and 2023" \ --geojsonio \ tests/testdata/text/mixed_dir/cities.txt # Add --details to inspect the per-source extents under # geoextent_extraction.details. geoextent -b --details --ner-ambiguity top \ --text "Field campaigns in Berlin" \ tests/testdata/text/mixed_dir/point.geojson The same shape works for any positional input: a Shapefile, a GeoTIFF, a directory of geospatial files, a DOI / repository URL, or stdin (``-``). Mixed runs through the Python API use the multi-input call: .. code-block:: python from geoextent.lib import extent # 1) Inline text alone text_result = extent.from_text( "Field campaigns in Berlin", bbox=True, ner_ambiguity="top", ) # 2) A local file alone file_result = extent.from_file("path/to/point.geojson", bbox=True) # 3) Merge in the application layer if you need a combined envelope. from geoextent.lib import helpfunctions as hf merged = hf.bbox_merge( {"text": text_result, "file": file_result}, "multi-input", ) print(merged["bbox"]) Tuning what spaCy picks up: .. code-block:: bash # Only keep GPE (geo-political entities) — countries, cities, regions geoextent -b --ner-labels GPE \ --text "Hiking in the Alps near Munich and along the Rhine" # Use a larger model (must be installed separately): # python -m spacy download en_core_web_md geoextent -b --ner-model en_core_web_md \ --text "Berlin and Paris" Calendar dates from text ======================== The temporal pipeline understands four shapes of date expressions: .. code-block:: bash geoextent -t --text "Field measurements in May 2024" # → "tbox": ["2024-05-01", "2024-05-31"] (month envelope) geoextent -t --text "Records from the 1990s" # → "tbox": ["1990-01-01", "1999-12-31"] (decade envelope) geoextent -t --text "Records from the 19th century" # → "tbox": ["1801-01-01", "1900-12-31"] (century envelope) geoextent -t --text "Monitoring ran between 2010 and 2015" # → "tbox": ["2010-01-01", "2015-12-31"] (range splitter) Range detection handles ``between X and Y``, ``from X to Y``, en-dashes (``X–Y``), em-dashes (``X—Y``), plain ASCII hyphens (``X-Y``), and ``to``/``until``/``through``/``and`` connectors. Two phrasings, two provenance paths, same envelope -------------------------------------------------- A useful comparison — the same time window expressed two ways yields identical ``tbox`` envelopes but very different mention provenance: .. code-block:: bash geoextent -t \ --text "Field campaigns in Berlin and Paris ending in March 2022 and beginning in June 2021" Result (extract): .. code-block:: jsonc "tbox": ["2021-06-01", "2022-03-31"], "date_entities": [ {"text": "March 2022", "kind": "date", "start": "2022-03-01", "end": "2022-03-31"}, {"text": "June 2021", "kind": "date", "start": "2021-06-01", "end": "2021-06-30"} ] spaCy emits **two** independent ``DATE`` entities; each is parsed independently (each into a month envelope), and ``tbox`` is the envelope of the envelopes. Now the same window in a single phrase: .. code-block:: bash geoextent -t \ --text "Field campaigns in Berlin and Paris from June 2021 to March 2022" Result (extract): .. code-block:: jsonc "tbox": ["2021-06-01", "2022-03-31"], "date_entities": [ {"text": "June 2021 to March 2022", "kind": "date", "start": "2021-06-01", "end": "2022-03-31"} ] spaCy emits **one** ``DATE`` span spanning both endpoints; geoextent's range splitter recognises the ``to`` connector, parses each side, and returns the merged envelope as a single mention. Both phrasings produce the same ``tbox`` because the envelope-of-envelopes and the explicit-range computations converge. The difference shows up in ``date_entities``: the first phrasing carries two mentions, the second carries one. For downstream tools that highlight matches in the source, this distinction matters — the second phrasing yields a single span to underline (``"June 2021 to March 2022"``); the first yields two non-contiguous spans. Named time periods ================== Beyond calendar dates, geoextent recognises geological time periods using the bundled International Chronostratigraphic Chart (ICS GTS2020, CC0, ~178 eons / eras / periods / epochs / ages). Period detection runs as a spaCy ``PhraseMatcher`` over the gazetteer's label index — which means it catches mentions that ``en_core_web_sm`` mislabels (Holocene as ``ORG``, Mesozoic Era as ``ORG``, Bronze Age as ``PERSON``) or misses entirely (Pleistocene, Late Cretaceous): .. code-block:: bash geoextent -t --text "Sediment cores from the Holocene" # → "tbox": ["-9750-01-01", "1950-01-01"] geoextent -t --text "Late Cretaceous fossils dominate the section" # → "tbox": ["-100498050-01-01", "-65998050-01-01"] geoextent -t --text "Pleistocene cores below the modern surface" # → "tbox": ["-2578050-01-01", "-9750-01-01"] Resolved periods carry the same provenance shape as places — a ``gazetteer_id`` (``ics:Holocene``) and ``gazetteer_url`` pointing to the canonical resource on ``resource.geosciml.org``. Disable period matching with ``--no-period-resolution`` if you only want calendar-date parsing. Combining periods and dates --------------------------- .. code-block:: bash geoextent -t \ --text "Pleistocene cores near Berlin re-surveyed on 2024-05-12" # → "tbox": ["-2578050-01-01", "2024-05-12"] The deep-time start and the CE-date end coexist in the same envelope; the ``tbox`` merge falls back to numeric signed-year comparison when any mention is pre-CE. Signed ISO 8601 dates for pre-CE / deep time ============================================ Python's stdlib ``datetime`` cannot represent year 0 or negative years, so geological periods are emitted as **signed ISO 8601 extended year** strings: * Holocene start: ``-9750-01-01`` * Pleistocene start: ``-2578050-01-01`` * Mesozoic Era start: ``-251900050-01-01`` The sign and the at-least-four-digit year width are fixed; larger years extend the width as needed. ``--time-format`` is **not** applied to deep-time mentions (those rely on the signed-ISO contract); CE-only output continues to honour the format flag exactly as before, byte-for-byte. Highlighting matches ==================== The CLI can render the source string with matched spans wrapped for display: .. code-block:: bash geoextent -b -t --annotate brackets \ --text "Sediment cores in Berlin span the Holocene; resurvey on 2024-05-12" Output (after the JSON): .. code-block:: ---annotated source (brackets)--- Sediment cores in [[Berlin|place]] span the [[Holocene|period]]; resurvey on [[2024-05-12|date]] Modes: * ``--annotate auto`` (default) — ANSI colour when stdout is a TTY, brackets otherwise, mirroring ``grep --color=auto``. * ``--annotate ansi`` — force ANSI SGR colours (terminal preview). * ``--annotate brackets`` — force ``[[surface|kind]]`` markers (pipelines, log capture, chat clients). * ``--annotate off`` — suppress. * ``--annotate-classes "place=cyan,date=yellow,period=magenta"`` — override colours per kind. Each mention also carries ``char_start`` / ``char_end`` offsets into the echoed ``source_text`` so consumers can render their own highlights. See :doc:`highlighting` for the contract details and a JavaScript / Java re-encoding recipe. Ambiguity policy ================ Both gazetteers (place-name and time-period) have an ``--ner-ambiguity`` / ``--period-ambiguity`` knob with the same two values: * ``drop`` (default) — refuse to choose when more than one candidate is returned. Defensive: a "Paris" mention without disambiguating context is dropped rather than silently bound to the wrong city. The first time a name is dropped, geoextent logs a WARNING to ``stderr`` naming the place, the gazetteer candidates that triggered the drop, and the exact flag (``--ner-ambiguity top``) to flip the policy. * ``top`` — keep the highest-ranked candidate. Repeat drops of the same name in one run only warn once, to keep the log quiet when a long directory mentions the same ambiguous town many times. .. code-block:: bash geoextent -b --ner-ambiguity top --text "We met in Paris and Berlin" geoextent -t --period-ambiguity top --text "Iron Age burials" The drop policy preserves provenance: dropped mentions still appear in ``place_names`` / ``date_entities`` with ``matched: false`` and the full ``candidate_count``. Turning text extraction off =========================== If you process a directory of structured data and want to be sure no ``README.md`` (or similar) is fed to spaCy: .. code-block:: bash geoextent -b -t --text-method none path/to/data_dir ``--text-method none`` disables the text handler entirely; ``.txt`` and ``.md`` files then fall back to other handlers (e.g. tab-delimited Darwin Core occurrence files via the CSV handler) or are skipped. Python API ========== The same surface is available as :func:`geoextent.lib.extent.from_text` for in-memory strings and as the standard handler for :func:`~geoextent.lib.extent.from_file` / :func:`~geoextent.lib.extent.from_directory`: .. code-block:: python from geoextent.lib import extent result = extent.from_text( "Sediment cores in Berlin span the Holocene; resurvey on 2024-05-12.", bbox=True, tbox=True, ner_ambiguity="top", # keep top hit (Berlin is unambiguous, but # "Paris" or "Springfield" would otherwise drop) period_ambiguity="top", # same idea for the ICS gazetteer ) print(result["bbox"]) # → [13.41, 52.52, 13.41, 52.52] print(result["tbox"]) # → ["-9750-01-01", "2024-05-12"] for rec in result["place_names"]: print("place ", rec["name"], "→", rec.get("gazetteer_url")) for rec in result["date_entities"]: print(rec["kind"], rec["text"], "→", rec.get("start"), rec.get("end")) Listing the bundled period gazetteer ==================================== Downstream tools (UIs, autocomplete widgets, reference docs) often need the full list of periods that geoextent recognises, together with the licence and provenance of the underlying data. Two paths are provided: **CLI** — ``--list-periods`` prints the bundled gazetteer to stdout: .. code-block:: bash # Full JSON output with metadata block + 178 period records geoextent --list-periods # Plain-text table for terminal scanning geoextent --list-periods --list-periods-format text # Filter by substring (case-insensitive, matches name and aliases) geoextent --list-periods --list-periods-filter Holo geoextent --list-periods --list-periods-format text --list-periods-filter Mesozoic The header of the output carries the provenance: source repository, the exact upstream commit SHA, the build timestamp, the licence URL, and an attribution string suitable for embedding in a UI footer. **Python** — :func:`geoextent.lib.period_gazetteer.list_periods` returns the same data as a dict: .. code-block:: python from geoextent.lib.period_gazetteer import list_periods data = list_periods() print(data["source"], data["source_revision"], data["built_at"]) for rec in data["periods"][:3]: print(rec["name"], rec["start"], "..", rec["end"], rec["url"]) # Narrow the list (substring on name or any alias) holos = list_periods(name_filter="holocene") assert holos["period_count"] == 1 # Drop the metadata block — useful when the consumer already knows the # provenance and only needs the records themselves. bare = list_periods(include_metadata=False) assert set(bare.keys()) == {"periods", "period_count"} The dict shape matches the on-disk ``geoextent/lib/data/periods.json``; ``schema_version`` lets consumers detect a future layout shift. The file's metadata block is reproduced here in full: .. code-block:: jsonc { "name": "geoextent bundled period gazetteer", "schema_version": "1.0", "source": "ICS International Chronostratigraphic Chart (GTS2020)", "source_url": "https://github.com/CGI-IUGS/timescale-data", "source_file": ".../rdf/isc2020.ttl", "source_revision": "", "source_revision_date": "", "license": "CC0-1.0", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "attribution": "International Chronostratigraphic Chart ... CC0-1.0 ...", "built_at": "", "built_by": "geoextent tools/build_periods_data.py", "ma_bp_origin_year": 1950, "period_count": 178, "periods": [ ... ] } To refresh the bundled data from upstream, run ``python tools/build_periods_data.py`` and commit the regenerated ``periods.json``. Performance notes ================= * spaCy + ``en_core_web_sm`` is loaded **once** per process and reused. * The forward gazetteer keeps an in-memory ``(service, query)`` cache for the run, so duplicate mentions within a directory only hit the network once. * Public Nominatim has a 1 req/s rate limit; large batches may benefit from Photon or a self-hosted Nominatim. Set ``NOMINATIM_USER_AGENT`` (env var) to identify your application. Limitations and roadmap ======================= * English only out of the box (``en_core_web_sm``). Multi-language models exist on spaCy's hub but are not exercised by geoextent's tests. * Historical / archaeological periods (Bronze Age, Iron Age, Medieval, Roman, …) are not in the bundled ICS chart. Online Wikidata-backed resolution is tracked in :issue:`113`. * HTML rendering for notebooks and Web Annotation Data Model JSON-LD export are tracked in :issue:`114`.