Background
Ontul

OntulDistributed Unified Data Engine

Ontul Key Features

Discover the core features of the distributed data engine that unifies batch processing, stream processing, and interactive SQL in a single engine.

Unified Data Engine

Run batch processing, stream processing, and interactive SQL queries in a single cluster. Consolidate all data workloads without separate systems.

Arrow-Native Execution Engine

Process all data in Apache Arrow columnar format. Iceberg Parquet data files are decoded column-at-a-time directly into Arrow vectors, with Iceberg file pruning and Parquet row-group skipping to minimize bytes read; columnar aggregation and bounded-heap Top-N with zero-copy execution deliver best-in-class analytical performance.

Interactive SQL

JDBC connections (DBeaver, DataGrip) via Arrow Flight SQL with multi-catalog federation queries. Full standard SQL support including JOINs, window functions, and CTEs — plus a compiled-plan cache and a snapshot-keyed result cache that skips execution entirely for repeated reads over unchanged data, lifting interactive QPS for BI and AI-agent workloads. The MCP server services JSON-RPC batch requests concurrently, so an agent's multiple tool calls take one round-trip.

Flink-style Streaming

Continuous processing — events are processed as they arrive, not in Spark-style micro-batches. Supports TUMBLING, SLIDING, and SESSION windows with multi-worker hash shuffle.

Exchange Manager

Unified fault-tolerance infrastructure for Query, Batch, and Streaming. Handles data spill on memory pressure and streaming checkpoint state — all through a single system with KMS envelope encryption.

Exactly-Once Semantics

Master-coordinated barrier checkpoint guarantees exactly-once delivery for transactional sinks (Iceberg, JDBC, NeorunBase, Kafka Transactions). Sink commit before offset commit ensures data consistency.

Connector Architecture

Access diverse data sources through plugin-based connectors. Dynamically register and unregister Iceberg, NeorunBase, JDBC, Kafka, Elasticsearch, and more at runtime.

Federation Queries

Execute cross-catalog joins across multiple data sources in a single SQL query. Combine Iceberg, NeorunBase, and JDBC tables seamlessly.

Semantic Layer

Define metrics, dimensions, multilingual synonyms, governance, conformed-dimension joins, derived metrics, and multi-tenant mandatory filters once. Ontul rewrites SELECT revenue FROM sales into the full aggregation, JOIN, GROUP BY, and row filter server-side — clients never duplicate the formula.

Agentic AI Ready

Built-in MCP server gives LLM agents metric discovery, natural-language search (Korean 매출 ↔ revenue), and certification metadata. The semantic layer handles aggregation, JOINs, and RBAC server-side, so agents only need column names — multi-tenant policies follow the authenticated user automatically.

Native Apache Iceberg v2 & v3

Native support for both Iceberg v2 and v3 — distributed INSERT/CTAS plus merge-on-read DELETE/UPDATE/MERGE, hidden partitioning, schema evolution, time travel, branches and tags. On v3, deletes are written and read as deletion vectors (Puffin) instead of position-delete files. All operational capabilities in a single engine. Write-Audit-Publish (WAP) is supported too: stage INSERT/UPDATE/DELETE/MERGE on a non-main branch via SET, audit in isolation, then publish to main with ALTER TABLE EXECUTE fast_forward / cherrypick. It also ships Spark-style table-maintenance procedures (optimize, expire_snapshots, rewrite_manifests, remove_orphan_files, rollback) via ALTER TABLE EXECUTE, with fine-grained parameters such as retain_last, min_input_files, dry_run, and window_hours (window_hours does incremental compaction of just the last N hours of small files — ideal for streaming churn). From the Admin UI you can run per-table auto-maintenance with per-operation toggles, those parameters, and a CRON schedule.

Security (IAM & KMS)

AES-256-GCM envelope encryption, built-in KMS, Exchange Manager data encryption, catalog/table/column/row-level IAM policies, and STS temporary credentials.

BI Integration (Tableau · Power BI · Looker)

Tableau, Power BI, Looker, and DBeaver connect live via Arrow Flight SQL JDBC. Semantic views expose measures and dimensions with the right classification, and /api/v1/bi/connection-info returns driver coordinates plus per-tool setup hints in one call.

Agentic AI · Semantic Layer

Semantic Layer — Single Source of Truth for Agentic AI

One definition of truth per metric — enforced server-side.

Ontul's semantic layer gives agents two things. ① Numbers (metrics) — define an analytics measure like revenue or margin once, and LLM agents, Tableau, and analysts all see the same number. ② Relevant context (retrievers) — multi-modal search that finds related documents by text or image. If a metric answers "how much revenue?", a retriever answers "find the related documents." Agents get both through one interface.

Core Capabilities

Server-Side Query Rewriting

SELECT revenue, customer.region FROM sales becomes SUM(amount * (1-discount)) with LEFT JOIN customer ON ... and GROUP BY customer.region — automatically. Clients never have to memorize the formula.

MCP-Native Metric Discovery

LLM agents use ontul_search_metrics and ontul_describe_semantic_view to find metrics across multilingual synonyms (매출 · revenue · net_revenue · sales_amount) and read their definitions. One definition is shared by every agent.

Derived Metrics

profit = revenue - cost, profit_margin = (revenue - cost) / revenue — define metrics in terms of other metrics. Ontul resolves them recursively at plan time, with cycle detection.

Conformed-Dimension Joins

Declare a JOIN once; Ontul injects it only when its columns are referenced. SELECT customer.region, revenue auto-adds LEFT JOIN customer ON ..., while unused joins stay out of the plan — declared joins cost nothing until used.

Multi-Tenant Mandatory Filters

Declare row-scoping predicates like tenant_id = ${user.attr.tenant_id} at the view or per-metric level. Substituted from the authenticated user context, so the same RLS policy applies whether the caller is a BI dashboard or an LLM agent.

Governance & RBAC

Per-metric allowedRoles for access control, DRAFT → CERTIFIED → DEPRECATED lifecycle, certifier audit, free-form tags. Enforced at rewrite time — unauthorized users never see the formula in error messages.

Retrievers — Multi-Modal Search in One Call

The search object an agent uses to find "what's related" by text or image. It runs vector (meaning), keyword (BM25), and graph (relationships) together on NeorunBase, protected by the same IAM and permissions as metrics. The agent writes no SQL — it just fills in the values, and that's RAG. (HYBRID_SEARCH / GRAPH_NEIGHBORS defined as governed retriever objects, pushed down through Ontul.)

One line is enough

What the user writes
SELECT customer.region, profit_margin
FROM saas.core.sales
WHERE ship_date >= DATE '2024-01-01';
What Ontul actually runs
SELECT customer.region,
       (SUM(amount) - SUM(unit_cost * quantity)) / SUM(amount)
         AS profit_margin
FROM saas.core.sales
LEFT JOIN saas.core.customer customer
  ON sales.customer_id = customer.id
WHERE ship_date >= DATE '2024-01-01'
  AND tenant_id = 'acme-co'           -- auto RLS
  AND status = 'COMPLETED'             -- per-metric filter
GROUP BY customer.region;

What this means for Agentic AI

No Hallucinated Metrics

Formulas live once, server-side. Even if an LLM guesses AVG instead of SUM — as long as the metric name is right, the correct aggregation runs every time.

IAM Auto-Propagation

The metrics and rows an agent can see are exactly what the user's IAM policy allows. No prompt-level permission logic, no bypass.

Multilingual by Default

"매출 어떻게 돼?" finds the revenue metric via synonym matching. Business terminology varies by team — the semantic layer bridges that gap.

BI · AI Consistency

The revenue Tableau shows and the revenue an LLM agent answers are computed by the same SQL. The two channels never disagree on the number.

Use Cases

Unified Data Processing

Handle all data workloads — batch, streaming, and SQL — with a single Ontul cluster instead of separate systems.

AI Agent Analytics

LLM agents discover metrics through MCP tools and translate natural-language questions into Ontul SQL. The semantic layer handles aggregation, joins, and IAM — so agents answer with certified business definitions, not hallucinated formulas.

Real-Time Data Pipelines

Ingest data from Kafka, process in Ontul, and load into Iceberg tables for real-time ETL pipelines.

Data Lake Analytics

Run federation queries across Iceberg, JDBC, and other sources for unified analytics.

Analytics + RAG in One Backend

Run metrics (analytics) and retrievers (multi-modal search) on one engine under one governance — no separate semantic-analytics tool and vector/graph search stack to operate. An agent pulls "the numbers" and "the supporting context" together in a single MCP session.

Considering Ontul for your data platform?

Unified. Arrow-Native. Agentic AI-Ready.

A distributed data engine that unifies batch, streaming, SQL, and a production semantic layer — so BI dashboards and AI agents answer with the same truth.