Methodology — How Indic Engine Measures Compression

Every number published on this site is derived from reproducible benchmark runs. This page documents the exact formulas, test coverage, scoring rules, and technical architecture. Run the benchmark yourself — it requires no API key.

Section 1 — How We Measure Compression

Token counting formula

All token estimates use character-length math, calibrated against production tokenizer output across all supported scripts.

Raw tokens = ceil(input.length / 3)
Compressed tokens = ceil(output.length / 4)
Savings % = round((1 − compressed / raw) × 100)
// Division by 4 for output: compressed JSON is denser than prose input

The asymmetry (÷3 input, ÷4 output) reflects a production observation: the model's compressed JSON output averages ~4 characters per token because it uses short keys and removes whitespace, while mixed-script input averages ~3 characters per token.

Stress test coverage

Languages tested

Verticals tested

240

Combinations (24×10)

238/240

Passing · June 2026

The 24 languages tested: English, Hindi, Marathi, Bangla, Gujarati, Hinglish, Nepali, Tamil, Telugu, Kannada, Malayalam, Urdu, Arabic, Persian, Russian, French, German, Spanish, Portuguese, Bahasa Indonesia, Malay, Thai, Vietnamese, Sinhala.

The 10 verticals tested: realestate, ecommerce, leadgen, bfsi, healthcare, education, travel, hr, logistics, default.

Note on 168/168: Earlier published results (168/168 · 100%) used 7 original verticals across 24 languages. The current 240-combination test adds travel, hr, and logistics. Both figures are correct for their respective scope.

Compression rates by language

Language	Script	Compression rate
Hindi / Marathi	Devanagari	72%
Tamil / Telugu	Dravidian	72%
Hinglish	Mixed Latin+Devanagari	68%
Arabic / Urdu / Persian	Arabic script	70%
Bangla / Malayalam / Kannada / Gujarati	Indic	65%
Russian	Cyrillic	65%
French / German / Spanish	Latin	60%
Portuguese / Bahasa / Malay / Vietnamese	Latin	58%
English	Latin	55%

Scoring criteria

PASS — valid JSON returned AND output contains i (intent) field
FAIL — invalid JSON OR missing intent field i
FAIL — "Delhi bug": l field returns "delhi" when no location was stated in the input message
All other field values are checked for schema conformance via enforceSchema()

Reproduce it yourself

The benchmark endpoint requires no API key and runs against the live production stack.

curl -X POST "https://indic-engine.com/benchmark?vertical=realestate" \
  -H "Content-Type: application/json" \
  -d '{"input":"Bhaiya 3BHK chahiye Hinjewadi mein, budget 1.5 crore"}'

Every number on this page can be verified by running the benchmark endpoint with your own messages in any supported language and vertical combination.

Honest failures

June 2026 — 240-combination test: 2 tests initially failed — Gujarati/travel and Nepali/HR. Both failures were caused by LLM schema drift on low-resource language + newly-added vertical combinations where the model occasionally added unsupported fields or used non-English location names despite the system prompt. Both were fixed by adding explicit WRONG/RIGHT examples to the system prompt for those combinations. This is how we iterate — failures are diagnosed, fixed, and re-tested. They are never hidden.

Section 2 — How It Works Technically

Script detection

Language routing is determined by Unicode range matching — no external API, no model call. Runs in under 1ms.

Indic    → U+0900–U+0D7F (Devanagari, Bengali, Tamil, Telugu, Kannada, Malayalam, Gujarati, Sinhala)
Arabic   → U+0600–U+06FF (Arabic, Urdu, Persian, extended Arabic blocks)
Cyrillic → U+0400–U+04FF (Russian, Ukrainian, Bulgarian)
Latin    → everything else (English, French, German, Spanish, Hinglish, etc.)

Thai (U+0E00–U+0E7F) is detected as a sub-case of the Indic routing path. Sinhala (U+0D80–U+0DFF) is included in the Indic range.

Provider routing

Each script family has a dedicated fallback chain. If the primary provider fails or returns an error string, the next provider is tried automatically. The CF Llama fallback is the final safety net.

Indic / Latin→Groq Llama 3.1 8B → Groq Qwen3-32B → CF Llama 3.1 8B

Arabic / Cyrillic→Groq Qwen3-32B → Groq Llama 3.1 8B → CF Llama 3.1 8B

Arabic and Cyrillic scripts route to Qwen3-32B first because it has stronger multilingual instruction-following on right-to-left and complex scripts. Latin and Indic route to Llama 3.1 8B first because it's faster and cheaper for structured extraction tasks.

Provider timeout: 800ms. If a provider does not respond within 800ms, the next provider in the chain is tried. This is enforced via a Promise.race() pattern at the edge.

Semantic cache

Repeated or semantically similar messages hit the cache instead of calling the LLM. Cache parameters:

Index       : Cloudflare Vectorize (384 dimensions, cosine similarity)
Threshold   : 0.85 cosine similarity
Cache key   : input + promptVersion (SHA-256 first 8 chars) + vertical
Hit rate    : 90.5% on warm cache
Latency     : ~15ms on cache hit (vs 150–400ms live)

The promptVersion component ensures cache is automatically invalidated when the system prompt changes — no manual version bumping required. Changing the system prompt changes its SHA-256 hash, which changes every cache key, causing a cold cache on the new version.

Schema enforcement

After LLM output is received, enforceSchema() strips any key not defined in the vertical's format spec. This is a hard guarantee — regardless of what the model outputs, the response will contain exactly the keys specified for the vertical, no more, no less. Fields with no extracted value are omitted from the response entirely — Indic Engine never pads output with placeholder values like "unknown" or 0. If a field is absent, the customer simply didn't mention it.

Input : raw LLM JSON (may contain extra fields)
Output : only keys from INPUT_FORMATS[vertical] that have real values
Missing: omitted entirely — never padded with "unknown" or 0
Extra : silently dropped

Failsafe

If JSON validation fails after all providers are exhausted, the raw input is returned unchanged with a warning field explaining the failure. The bot never receives an error that breaks its flow — it either gets compressed JSON or the original message. This guarantees 100% uptime from the middleware's perspective.

How We Measure Compression

Section 1 — How We Measure Compression

Token counting formula

Stress test coverage

Compression rates by language

Scoring criteria

Reproduce it yourself

Honest failures

Section 2 — How It Works Technically

Script detection

Provider routing

Semantic cache

Schema enforcement

Failsafe

Test it yourself — no signup required