Docs API reference — text

Text API reference

POST /v1/ingest/text/{tenant}/{record} is the canonical endpoint. Algorithm is selected via query string or the JSON body.

Algorithm matrix

`algorithm`	Output kind	Use for
`minhash`	hash set (h slots, 64-bit each)	near-duplicate dedup at scale (Jaccard similarity)
`simhash-tf`	64-bit signature	near-duplicate dedup, simpler than minhash
`simhash-idf`	64-bit signature	same, but weights tokens by an IDF table
`lsh`	banded minhash	fast bucket lookup over millions of records
`tlsh`	35-byte locality digest	malware-style fuzzy matching, robust to small edits
`semantic-local`	dense embedding (FP32 vector)	semantic similarity, on-device model
`semantic-openai`	dense embedding	OpenAI `text-embedding-3-…` models
`semantic-voyage`	dense embedding	Voyage AI models
`semantic-cohere`	dense embedding	Cohere `embed-…` models

Common parameters

These apply to every text algorithm. Send as JSON body when the call needs structured params; for simple text you can POST raw text and rely on defaults.

{
  "text": "The quick brown fox.",
  "params": {
    "algorithm": "minhash",
    "k": 5,
    "h": 128,
    "tokenizer": "Word",
    "canonicalizer": {
      "normalization": "Nfkc",
      "case_fold": true,
      "strip_bidi": true,
      "strip_format": true,
      "apply_confusable": false
    },
    "preprocess": null,
    "security_mode": null
  }
}

`k`

Shingle width. Default 5 for word tokenizer, 9 for grapheme. Smaller k is more sensitive to local edits; larger k is stricter.

`h` (MinHash slot count)

64, 128, or 256. Higher h is slower and more bytes on the wire but yields a tighter Jaccard estimate. The dispatched route is fingerprint_minhash_with::<H> — passing any other value rejects with 422.

`tokenizer`: `TokenizerKind`

Value	Behaviour
`Word`	Unicode word-segment, default.
`Grapheme`	Grapheme-cluster shingles. Use for code, IDs, languages without word breaks.
`CjkJp`	MeCab-backed Japanese segmentation. Requires `text-cjk-japanese` feature.
`CjkKo`	Korean morpheme segmentation. Requires `text-cjk-korean` feature.

`canonicalizer`: `CanonicalizerDto`

Field	Default	Effect
`normalization`	`"Nfkc"`	Unicode normalization form. `"Nfc"`, `"Nfd"`, `"Nfkc"`, `"Nfkd"`, `"None"`.
`case_fold`	`true`	Apply Unicode case folding.
`strip_bidi`	`true`	Remove bidirectional control marks (U+202A–U+202E, U+2066–U+2069).
`strip_format`	`true`	Remove invisible format characters (zero-width joiner, soft hyphen, …).
`apply_confusable`	`false`	Replace confusable glyphs with their Latin equivalents (Cyrillic `а` → Latin `a`). Slower, useful for anti-spam.

`preprocess`: `PreprocessKind | null`

Value	Effect	Feature
`null`	Treat input as plain text.	always
`"Html"`	Strip tags, extract visible text.	`text-markup`
`"Markdown"`	Render to text, drop formatting.	`text-markup`
`"Pdf"`	Extract text from PDF bytes.	`text-pdf`

For "Html"/"Markdown"/"Pdf", the dedicated subroute POST /v1/ingest/text/{tid}/{rid}/preprocess/{html|markdown|pdf} is more efficient — the body is sent raw and parsed once, rather than wrapped in JSON.

`security_mode`: `UtsMode | null`

Hardens canonicalization against Unicode-based attacks (homoglyph spoofing, RTLO injection). Values: "Strict", "Lenient", null (off). Requires text-security feature.

Per-algorithm parameters

`minhash`

{ "algorithm": "minhash", "h": 128, "k": 5 }

h ∈ {64, 128, 256}. Output is h × 8 bytes plus a tiny header.

`simhash-tf`

{ "algorithm": "simhash-tf", "k": 5 }

64-bit signature. Term frequency–weighted by default.

`simhash-idf`

{
  "algorithm": "simhash-idf",
  "k": 5,
  "weighting": { "kind": "idf", "idf_table_ref": "english-wikipedia-2024" }
}

idf_table_ref resolves a server-side preloaded IDF table. Omitting it falls back to TF.

`lsh`

{ "algorithm": "lsh", "h": 128, "k": 5 }

Returns a banded MinHash suitable for bucket lookup. Internally bands × rows = h; default bands=16, rows=8 for h=128.

`tlsh`

{ "algorithm": "tlsh" }

Returns the 35-byte TLSH digest. Requires input ≥ 50 bytes; smaller inputs reject with 422. Feature text-tlsh.

`semantic-local`

{ "algorithm": "semantic-local", "model_id": "all-MiniLM-L6-v2" }

Runs a quantized sentence-transformer on the server. model_id must be one of the preloaded models (see /healthz for the list). Feature text-semantic-local.

`semantic-openai`

{ "algorithm": "semantic-openai", "model_id": "text-embedding-3-small" }

The server holds the OpenAI key. Latency dominated by the upstream call. Returns the FP32 vector inline. Feature text-semantic-openai.

`semantic-voyage`

{ "algorithm": "semantic-voyage", "model_id": "voyage-3" }

Feature text-semantic-voyage.

`semantic-cohere`

{ "algorithm": "semantic-cohere", "model_id": "embed-english-v3.0" }

Feature text-semantic-cohere.

Streaming

POST /v1/ingest/text/{tid}/{rid}/stream accepts NDJSON: one JSON object per line, each { "text": "…" }. The server returns one fingerprint per input. Useful for ingesting a corpus line-at-a-time without round-trips.

Feature text-streaming.

Response shape

Every text route returns the same envelope:

{
  "tenant_id": 17,
  "record_id": "01HZX…",
  "modality": "text",
  "algorithm": "txtfp-minhash-h128-v1",
  "format_version": 1,
  "config_hash": "0x9c1ab40f5fe2c7d3",
  "fingerprint_bytes": 1024,
  "has_embedding": false,
  "embedding_dim": null,
  "model_id": null,
  "metadata_bytes": 0
}

config_hash is the txtfp config_hash(canon, tokenizer_tag, algorithm_tag) — two records with the same config_hash are directly comparable; different config_hash means you must re-fingerprint to compare.

Text API reference

Algorithm matrix

Common parameters

k

h (MinHash slot count)

tokenizer: TokenizerKind

canonicalizer: CanonicalizerDto

preprocess: PreprocessKind | null

security_mode: UtsMode | null

Per-algorithm parameters

minhash

simhash-tf

simhash-idf

lsh

tlsh

semantic-local

semantic-openai

semantic-voyage

semantic-cohere