Cloud Chef Labs

DataRoaster

Tech Blog

Contact



Cloud Chef Labs aims to provide a simple and efficient Data Platform which can solve most of the problems that you encounter in Data Lake.

DataRoaster

DataRoaster is open source tool to provide data platforms running on Kubernetes to build a data lake and AI-based analytics platform with ease. Users can use DataRoaster as a cost-effective alternative to serverless services provided by other cloud providers.

To use DataRoaster, visit github repo: https://github.com/cloudcheflabs/dataroaster


DataRoaster Architecture
DataRoaster consists of the following components.
  • CLI: command line interface to API Server.
  • API Server: handles requests from clients like CLI.
  • Authorizer: runs as OAuth2 Server.
  • Secret Manager: manages secrets like kubeconfig using Vault.
  • Resource Controller: manages remote kubernetes resources with kubectl, helm and kubernetes client like fabric8 k8s client.

The following demo shows how to create a data platform which consists of hive metastore, spark thrift server, trino, redash and jupyterhub, etc running on Kubernetes using DataRoaster with ease.



Services provided by DataRoaster are as follows.

Data Catalog

Query Engine

  • Spark Thrift Server: used as Hive Server, namely Hive on Spark. Interface to query data in Data Lake
  • Trino: Fast Interactive Query Engine to query data in Data Lake

Streaming

  • Kafka: Popular Streaming Platform

Analytics

  • JupyterHub: Controller to serve Jupyter Notebook which is most popular web based interactive analytics tool for multiple users
  • Redash: Visual Data Analytics SQL Engine which provides a lot of data sources connectors

Workflow

  • Argo Workflow: Workflow engine running on Kubernetes, with which containerized long running batch jobs, ETL Jobs, ML Jobs, etc can be scheduled to run on Kubernetes

CI / CD

  • Jenkins: Popular continuous Integration Server
  • Argo CD: Continuous Delivery tool for Kubernetes

Metrics Monitoring

Pod Log Monitoring

  • ELK: Elasticsearch, Logstash and Kibana
  • Filebeat: used to fetch log files

Distributed Tracing

  • Jaeger: Popular microservices distributed tracing platform

Backup

  • Velero: used to backup Kubernetes Resources and Persistent Volumes

Private Registry

  • Harbor: used as private registry to manage docker images and helm charts

Ingress Controller




© 2021 Cloud Chef Labs Inc. All rights reserved.