Low-Latency Inference
Pipeline inferensi dirancang untuk waktu respons sub-detik dengan orkestrasi context window adaptif.
Documentation / Neural Core V2
Fondasi runtime ConAI untuk inference, orchestration, dan decisioning real-time di lingkungan operasional bisnis.
Pipeline inferensi dirancang untuk waktu respons sub-detik dengan orkestrasi context window adaptif.
Request diarahkan ke engine yang paling sesuai berdasarkan intent, policy, dan QoS tenant.
Retry policy, fallback berlapis, dan observability berbasis event menjaga stabilitas sistem.
Semua request ke Neural Core harus menyertakan JWT token yang diterbitkan oleh ConAI Auth Service. Token berisi tenant ID, role, dan scope permission.
# Request Header
Authorization: Bearer <jwt_token>
X-Tenant-ID: mitra_abc123
X-Request-ID: uuid-v4
# JWT Payload Structure
{
"sub": "user_id",
"tid": "tenant_id",
"role": "mitra|superadmin|investor",
"scopes": ["inference:read", "inference:write"],
"plan": "growth|pro|enterprise",
"iat": 1716000000,
"exp": 1716086400
}
| Scope | Deskripsi | Plan Minimum |
|---|---|---|
| inference:read | Query inference history & status | Starter |
| inference:write | Submit inference request | Starter |
| inference:stream | Real-time streaming response | Growth |
| policy:manage | CRUD guardrail policies | Pro |
| telemetry:read | Access observability metrics | Pro |
/api/neural/inferSubmit inference request ke Neural Core. Mendukung mode synchronous dan streaming.
{
"messages": [
{ "role": "system", "content": "Kamu adalah asisten CS..." },
{ "role": "user", "content": "Kapan tagihan saya jatuh tempo?" }
],
"mode": "sync" | "stream",
"context": {
"customer_id": "cust_xyz",
"channel": "whatsapp" | "telegram" | "web",
"session_id": "sess_abc"
},
"options": {
"max_tokens": 1024,
"temperature": 0.7,
"tools": ["billing_lookup", "ticket_create"],
"guardrail_policy": "default" | "strict" | "custom_id"
}
}
{
"id": "inf_abc123",
"status": "completed",
"output": {
"content": "Tagihan Anda jatuh tempo tanggal 25 Mei...",
"tool_calls": [
{ "tool": "billing_lookup", "result": { "due_date": "2026-05-25" } }
]
},
"usage": {
"prompt_tokens": 142,
"completion_tokens": 67,
"total_tokens": 209
},
"telemetry": {
"latency_ms": 340,
"model_used": "hermes-3-70b",
"node_id": "node-sg-01",
"routing_decision_ms": 12,
"guardrail_pass": true
}
}
# SSE Stream Format
data: {"type":"token","content":"Tagihan"}
data: {"type":"token","content":" Anda"}
data: {"type":"token","content":" jatuh"}
data: {"type":"tool_call","tool":"billing_lookup","status":"executing"}
data: {"type":"tool_result","tool":"billing_lookup","result":{"due_date":"2026-05-25"}}
data: {"type":"done","usage":{"total_tokens":209},"telemetry":{"latency_ms":340}}
/api/neural/status/:inference_idCek status inference request yang sedang berjalan atau sudah selesai.
{
"id": "inf_abc123",
"status": "completed" | "processing" | "queued" | "failed",
"created_at": "2026-05-18T10:00:00Z",
"completed_at": "2026-05-18T10:00:00.340Z",
"error": null | { "code": "NC-001", "message": "..." }
}
Rate limit diterapkan per tenant berdasarkan plan. Header response menyertakan informasi kuota real-time.
| Plan | Requests/menit | Tokens/hari | Concurrent |
|---|---|---|---|
| Starter | 20 | 50,000 | 2 |
| Growth | 60 | 500,000 | 5 |
| Pro | 200 | 2,000,000 | 15 |
| Enterprise | Custom | Unlimited | Custom |
# Rate Limit Response Headers
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 42
X-RateLimit-Reset: 1716012060
X-TokenLimit-Daily-Remaining: 487231
| Code | Meaning | Resolution |
|---|---|---|
| NC-001 | Context window overflow — input melebihi batas token model aktif | Kurangi panjang input atau aktifkan auto-truncation policy |
| NC-002 | Model routing timeout — tidak ada engine yang merespons dalam SLA | Cek health node atau naikkan timeout threshold |
| NC-003 | Guardrail rejection — output melanggar policy tenant | Review policy rules di tenant settings |
| NC-004 | Rate limit exceeded — tenant melampaui kuota inference/menit | Upgrade plan atau tunggu cooldown window |
| NC-005 | Fallback exhausted — semua engine dalam chain gagal | Eskalasi ke NOC, cek status semua node |
Semua metric diekspor via Prometheus-compatible endpoint dan tersedia di dashboard monitoring tenant.
| Metric | Unit | Deskripsi |
|---|---|---|
| nc.inference.latency_p50 | ms | Median latency inference end-to-end |
| nc.inference.latency_p99 | ms | 99th percentile latency |
| nc.routing.decision_time | ms | Waktu routing engine selection |
| nc.guardrail.rejection_rate | % | Persentase output yang ditolak guardrail |
| nc.fallback.trigger_count | count/min | Jumlah fallback triggered per menit |
| nc.token.usage_total | tokens | Total token consumed per tenant |
| nc.queue.depth | count | Jumlah request dalam antrian inference |
| nc.model.availability | % | Uptime model engine per node |
┌─────────────────────────────────────────────────────────────────┐
│ CLIENT REQUEST │
│ (WhatsApp / Telegram / Web / API) │
└──────────────────────────────┬──────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ GATEWAY LAYER │
│ ┌──────────┐ ┌──────────────┐ ┌────────────────┐ │
│ │ Auth/JWT │ │ Rate Limiter │ │ Request Router │ │
│ └──────────┘ └──────────────┘ └────────────────┘ │
└──────────────────────────────┬───────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ NEURAL CORE V2 │
│ ┌────────────────┐ ┌───────────────┐ ┌──────────────────┐ │
│ │ Intent Parser │ │ Model Router │ │ Context Manager │ │
│ └────────────────┘ └───────────────┘ └──────────────────┘ │
│ ┌────────────────┐ ┌───────────────┐ ┌──────────────────┐ │
│ │ Tool Executor │ │ Guardrails │ │ Telemetry Emit │ │
│ └────────────────┘ └───────────────┘ └──────────────────┘ │
└──────────────────────────────┬───────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ ENGINE POOL │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Hermes-3 │ │ GPT-4o │ │ Claude │ │ Fallback │ │
│ │ 70B │ │ mini │ │ Sonnet │ │ Local │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└──────────────────────────────────────────────────────────────────┘