Highlighting matches from the text/NER source¶
When geoextent extracts an extent from free text (--text-method ner /
geoextent.lib.extent.from_text(), see #112), it returns enough
information for another tool — or a human reading the JSON — to point at
the exact words that produced each result. This page is the reference for
that contract.
Three surfaces are available today:
Standoff offsets (always on): every place, date, and named-period mention carries
char_startandchar_endindices into the source string the extractor used. This is the machine-to-machine surface, and it matches the convention used by spaCy, Hugging Face NER pipelines, and the W3C Web Annotation Data Model.Source-text echo (on by default; disable with
--no-source-text): the result includes the NFC-normalised source undersource_text, plus thesource_offset_unitandsource_normalisationfields describing the offset contract.``–annotate`` (CLI, opt-in human display): prints the source with matches highlighted in ANSI colour, or wrapped in
[[Berlin|place]]-style brackets for non-terminal contexts.
Two further surfaces are tracked as follow-ups: HTML/Markdown rendering for notebooks and web UIs, and Web Annotation Data Model export (see the follow-up issue at #114).
The standoff contract¶
For every entry in place_names and date_entities you get:
{
"name": "Berlin", // place_names; date_entities use "text"
"char_start": 18,
"char_end": 24,
"matched": true,
"gazetteer_id": "geonames:2950159",
"gazetteer_url": "https://www.geonames.org/2950159"
}
The slice source_text[char_start:char_end] is guaranteed to equal
name / text.
Offset unit¶
source_offset_unit is always "python_codepoint": indices count
Unicode code points (Python len(str) semantics, post-PEP 393). This
matters when consuming the result from JavaScript or Java, which count
UTF-16 code units. A safe round-trip from Python offsets to UTF-16 offsets:
def to_utf16_offsets(text, start, end):
prefix = text[:start].encode("utf-16-le")
slice_ = text[start:end].encode("utf-16-le")
return len(prefix) // 2, (len(prefix) + len(slice_)) // 2
Or, in JavaScript, use the source text returned by geoextent directly —
[...text] iterates code points and lets you reproduce the slice.
Normalisation¶
source_normalisation is always "nfc". The extractor normalises
the input to NFC before tokenising; source_text reflects that. If
the caller passed an NFD string (e.g. "München"), the echoed
source_text is the NFC form ("München") and the offsets index
into that form, not the original.
This eliminates the family of bugs where é (1 code point) and
é (e + ́, 2 code points) produce different offsets for
the same visual character.
Byte-order marks¶
A leading (BOM) is stripped before offsets are computed —
both for file inputs (already handled by the text reader) and for
--text/stdin inputs.
Consuming the offsets in Python¶
The simplest possible consumer that prints matched spans with their gazetteer URL:
from geoextent.lib import extent
result = extent.from_text(
"Sediment cores in Berlin span the Holocene; resurvey on 2024-05-12.",
bbox=True, tbox=True, text_method="ner",
ner_gazetteer="nominatim",
ner_ambiguity="top",
)
src = result["source_text"]
for rec in result["place_names"]:
surface = src[rec["char_start"]:rec["char_end"]]
print(f"place {surface!r:20s} → {rec.get('gazetteer_url')}")
for rec in result["date_entities"]:
surface = src[rec["char_start"]:rec["char_end"]]
kind = rec["kind"]
resolved = (rec.get("start"), rec.get("end"))
print(f"{kind:6s} {surface!r:20s} → {resolved}")
Sample output:
place 'Berlin' → https://www.openstreetmap.org/relation/62422
period 'Holocene' → ('-9750-01-01', '1950-01-01')
date '2024-05-12' → ('2024-05-12', '2024-05-12')
Opting out of the source-text echo¶
The echoed source_text can be sizable (for long inputs) or sensitive
(for inputs containing private text). Suppress it with the
--no-source-text CLI flag or include_source_text=False API
parameter:
geoextent -b -t --quiet --text-method ner --no-source-text \
--text "Berlin in 1990"
The offsets are still emitted; they just point into the source the caller
keeps locally. --annotate cannot render when source_text is
absent, so combine --no-source-text with --annotate off.
The --annotate flag¶
--annotate {auto,ansi,brackets,off} adds a human-readable rendering
of the source after the JSON result. Default: auto (ansi when
stdout is a TTY, brackets otherwise).
ANSI (terminal)¶
geoextent -b -t --quiet --text-method ner --ner-ambiguity top \
--annotate ansi \
--text "Sediment cores in Berlin span the Holocene; resurvey on 2024-05-12."
The annotated line follows the JSON and a header:
{... JSON ...}
---annotated source (ansi)---
Sediment cores in [cyan]Berlin[/] span the [magenta]Holocene[/]; resurvey on [yellow]2024-05-12[/].
Default colour assignment: places cyan, dates yellow, periods magenta.
Override with --annotate-classes:
geoextent ... --annotate ansi \
--annotate-classes "place=bright_red,date=green,period=blue"
Recognised colour names: black, red, green, yellow,
blue, magenta, cyan, white, each available as a
bright_* variant.
Brackets (pipelines, log capture, non-TTY contexts)¶
geoextent -b -t --quiet --text-method ner --ner-ambiguity top \
--annotate brackets \
--text "Sediment cores in Berlin span the Holocene; resurvey on 2024-05-12." \
| tee report.txt
---annotated source (brackets)---
Sediment cores in [[Berlin|place]] span the [[Holocene|period]]; resurvey on [[2024-05-12|date]].
The markers are designed to survive piping, log aggregation, and copy/paste into chat clients. They never collide with HTML or Markdown formatting because they are not interpreted as either.
Library API¶
The renderer is available as geoextent.lib.annotate.render_annotated_text()
for use in notebooks, services, or custom tooling:
from geoextent.lib import extent
from geoextent.lib.annotate import render_annotated_text, parse_classes
result = extent.from_text("Berlin in 1990.", bbox=True, tbox=True,
text_method="ner", ner_ambiguity="top")
print(render_annotated_text(result, mode="brackets"))
# → Berlin in [[1990|date]]. (well, Berlin too if the gazetteer resolves)
# Custom classes
classes = parse_classes("place=red,date=green,period=blue")
print(render_annotated_text(result, mode="ansi", classes=classes))
Overlap handling¶
Most inputs do not produce overlapping spans because the period
PhraseMatcher already wins over conflicting place spans before
provenance is emitted (see geoextent.lib.text_extraction.ner).
When overlaps do appear in custom result dicts, the renderer falls back
to greedy longest-wins: the longer match is kept, shorter overlapping
spans are dropped (and logged at debug level). This is the same rule
used by geoextent.lib.text_extraction.periods.extract_periods.
Multi-input runs¶
When more than one positional input is given (or a directory contains several text files), the CLI prints one annotated block per source, each prefixed by the input label:
---annotated source (brackets) — <text>---
...
---annotated source (brackets) — tests/testdata/text/cities.txt---
...
Roadmap¶
Coming in a follow-up (#114):
--annotate htmland a library helper that wraps matches in<mark class="geoextent-place" data-id="…">…</mark>elements, plusgeoextent.display(result)for one-line Jupyter integration.A
--format webannotationsexport emitting JSON-LD compatible with the W3C Web Annotation Data Model, BRAT, INCEPTION, and Hypothes.is.