Chango is True Unified Data Lakehouse Platform to build your iceberg centric data lakehouses.

True Unified Data Lakehouse Platform

Chango provides popular open source engines like Spark, Trino, Kafka, and Iceberg as lakehouse table format, and Chango specific components.

Chango is true unified data lakehouse platform with the support of most features which are necessary to build your data lakehouse.

Chango Architecture From the Point of Use Cases

Iceberg is default lakehouse table format in Chango and supported perfectly by Chango with strong storage security using Chango Authorizer and Chango REST Catalog .

SQL is essential in modern data lakehouses, even for ETL jobs. Chango provides powerful SQL engines such as Trino through Chango Trino Gateway and Spark through Chango Spark Thrift Server with strong storage security to execute interactive and ETL SQL queries.

In addition, streaming events are ingested to iceberg just using REST easily.

Data Exploration

Users can run trino and spark sql queries like ETL queries and interactive queries through Superset which connects to Chango Trino Gateway and Chango Spark Thrift Server.

ETL Query Jobs with Workflow Engine

All the ETL query jobs will be integrated and scheduled with Azkaban. Trino ETL queries and spark SQL ETL query jobs will be processed periodically by Azkaban. ETL queries will be sent to Chango Query Exec through REST, and ETL queries will be executed through Chango Trino Gateway by Trino and Chango Spark Thrift Server by Spark.

Realtime Analytics

  • CDC data, for example, PostgreSQL CDC data will be captured by Chango CDC which will send it to Chango Streaming Ingestion(Chango Data API + Kafka + Chango Spark Streaming) through REST. Incoming streaming events will be inserted into iceberg table.
  • Log files will be read by Chango Log which will send it to Chango Streaming Ingestion through REST.
  • Streaming events generated by Applications will be sent to Chango Streaming Ingestion through REST.

Perfect Iceberg Support

Iceberg is most popular lakehouse table format. Iceberg is changing the paradyme of data lake and data lakehouses. Iceberg is default lakehouse table format in Chango and supported  perfectly by Chango. Chango provides iceberg supported engines like Trino and Spark with connecting to Chango REST Catalog which is iceberg REST catalog and maintains iceberg tables automatically for you.  So you can build your iceberg centric data lakehouses with Chango easily.

Storage Security

Storage Security is a first-class mandatory in modern data lakehouses. Chango provides fine-grained data access control using RBAC to Chango storage. All data accesses are controlled in the fine-grained manner like catalog, schema and table level.

Data Catalog

Chango REST Catalog is iceberg REST Catalog used as data catalog in Chango.

Security-first Data Catalog

Storage Security is a first-class mandatory in modern data lakehouses. Chango REST Catalog works with Chango Authorizer tightly which controls all the data access with strong storage security of catalog, schema and table level in Chango. That is, multiple iceberg supported engines like sparktrino can work with Chango REST Catalog seamlessly with the support of strong storage security to iceberg in Chango.

Automatic Iceberg Table Maintenance

Everytime data committed to iceberg tables, many files will be created like data files, snapshots, metadata files which should be maintained manually later. Chango REST Catalog maintains iceberg tables automatically for you.  Chango REST Catalog does the followings for you automatically.

  • Compacts small files.
  • Expires snapshots.
  • Remove old metadata files.
  • Remove orphan files.
  • Rewrite manifest files.
  • Rewrite position delete files

Load Files to Iceberg Tables using Chango SQL Procedure

Chango SQL Procedure is an easy way to load external files like CSVJSONParquet and ORC located in s3 compatible object storage to iceberg tables in Chango without the need of additional development of spark jobs.

PROC iceberg.system.import (    
    source => 's3a://any-bucket/any-path',    
    s3_access_key => 'any access key',    
    s3_secret_key => 'any secret key',    
    s3_endpoint => 'any endpoint',    
    s3_region => 'any region',    
    file_format => 'json',    
    id_columns => 'id_1, id_2',    
    action => 'MERGE',    
    target_table => 'iceberg.test_db.test'
)

Streaming Ingestion

If you want to insert streaming events like user behavior events, logs, IoT events to iceberg tables, you need to build event streaming platform like kafka and write streaming jobs like spark streaming jobs in most cases. But in Chango, you don’t have to do so. Streaming application can ingest streaming events to iceberg tables through REST API in chango directly without the need of additional streaming platform and streaming jobs.

  • Just send streaming events through REST simply, and the rest of the work to ingest streaming events to iceberg tables will be done automatically.
  • No need to build event streaming platform and develop streaming jobs.
  • Small data files created everytime streaming events are ingested to iceberg tables will be compacted automatically.
  • Iceberg table maintenance like snapshot expiration and old metadata files removal will be done automatically.

Aggregate Logs

Chango Log is a log agent to read local log files and send logs to iceberg tables in Chango to analyze logs. Using Chango Log, you can analyze logs from all your distributed logs joining different databases in richer manner realtimely in Chango.

Change Data Capture

Chango CDC is Change Data Capture application to catch CDC data of database and send CDC data to iceberg tables in Chango. You don’t need such as Kafka and Kafka Connect cluster to accomplish CDC.

Trino Gateway

Chango Trino Gateway is an implementation of trino gateway concept. Chango has the concept of trino gateway which routes trino queries to upstream backend trino clusters dynamically. If one of the backend trino clusters has exhausted, then trino gateway will route queries to the trino cluster which is executing less requested queries. Trino does not support HA because trino coordinator has single point failure. In order to support HA of trino, we need to use trino gateway. Chango Trino Gateway also supports Resource Groups to control resources of the backend trino clusters in which queries will be run.

Data Transformation

Chango Query Exec is a REST application to execute Trino and Spark SQL ETL queries to transform data in Chango. You just send such as ETL queries to Chango Query Exec through REST using such as curl. Chango Query Exec has several advantages.

  • Send ETL queries via REST simply without additional tool and library installation.
  • Just use the same Trino and Spark SQL ETL queries which you already used to explore data in your BI tools.
  • Easy way to integrate with workflow engine.

Query Audit

All the queries executed by query engines like Chango Trino GatewayChango Spark Thrift Server and Chango Spark SQL Runner in Chango will be logged to explore history of all the executed queries later. You will see the query counts by query engines and roles, and explore the details of executed queries.

© 2024 Cloud Chef Labs, Inc. All rights reserved.