Skip to content

agile-lab-dev/witboost-cdp-hive-tech-adapter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


witboost


Designed by Agile Lab, Witboost is a versatile platform that addresses a wide range of sophisticated data engineering challenges. It enables businesses to discover, enhance, and productize their data, fostering the creation of automated data platforms that adhere to the highest standards of data governance. Want to know more about Witboost? Check it out here or contact us!

This repository is part of our Starter Kit meant to showcase Witboost's integration capabilities and provide a "batteries-included" product.

CDP Hive Tech Adapter

Overview

This project implements a Tech Adapter deploying Output Ports and Storage Areas (as External Tables or Views) on Apache Hive hosted on a Cloudera Data Platform environment. It currently supports only CDP Private Cloud using Hive and HDFS. After deploying this microservice and configuring witboost to use it, the platform can create Output Ports and Storage Areas on existing data files leveraging an existing Hive instance.

What's a Tech Adapter?

A Tech Adapter (formerly a Specific Provisioner) is a microservice which is in charge of deploying components that use a specific technology. When the deployment of a Data Product is triggered, the platform generates it descriptor and orchestrates the deployment of every component contained in the Data Product. For every such component the platform knows which Tech Adapter is responsible for its deployment, and can thus send a provisioning request with the descriptor to it so that the Tech Adapter can perform whatever operation is required to fulfill this request and report back the outcome to the platform.

You can learn more about how the Tech Adapters fit in the broader picture here.

Software stack

This microservice is written in Scala 2.13, using HTTP4s and Guardrail for the HTTP layer. Project is built with SBT and supports packaging as JAR, fat-JAR and Docker image, ideal for Kubernetes deployments (which is the preferred option).

This is a multi module sbt project:

  • api: Contains the API layer of the service. The latter can be invoked synchronously in 3 different ways:
    1. POST /provision: provision the hive output port/storage area specified in the payload request. It will synchronously call the service logic to perform the provisioning logic.
    2. POST /validate: validate the payload request and return a validation result. It should be invoked before provisioning a resource in order to understand if the request is correct.
    3. POST /updateacl: Updates the access to users to the provisioned resources, only for output ports.
  • core: Contains model case classes and shared logic among the projects
  • service: Contains the Provisioner Service logic. Is called from the API layer after some check on the request and return the deployed resource. This is the module on which we provision the output port/storage area

In this project we are using the following sbt plugins:

  1. scalaformat: To keep the scala style aligned with all collaborators
  2. wartRemover: To keep the code as functional as possible
  3. scoverage: To create a test coverage report
  4. k8tyGitlabPlugin: To publish the packages to Gitlab Package Registry

Artifacts

We produce two different artifacts on the CI/CD for this repository

  1. The scoverage report that you could download from the CI/CD and check the test coverage
  2. A docker image published in the Gitlab Container Registry
  3. A set of jars, one for each module published in the Maven Gitlab Package Registry

Building

Requirements:

  • Java >=11
  • sbt

This project also depends on Witboost library scala-mesh-commons, published Open-Source on Maven Central.

Generating sources: this project uses OpenAPI as standard API specification and the sbt-guardrail plugin to generate server code from the specification.

The code generation is done automatically in the compile phase:

sbt compile

Test

Tests: are handled by the standard task as well:

sbt test

CI/CD

Once you commit and push the CI/CD will be triggered, test and build phase are executed at each push. The CI/CD will use the job token to push the dependency libraries Dev Deploy are executed only for master branch Prod Deploy are executed only for release branch You could double-check the artifacts that will be deployed downloading from the CI/CD artifacts.zip that was cached during the test/build stages

How to collaborate

We recommend using IntelliJ IDEA Community Edition for developing this project. You are free to use your favorite IDE. Please remember to add on the .gitignore the IDE specific files.

If you fork this repository, please modify the project settings with the appropriate gitlab project id to avoid trying pushing artifacts to the wrong repository.

Scala style

Leverage the scalaformat library to reformat the code while editing. This will apply the scala format specification written on the .scalafmt.conf and avoids fake changes on merge request.

We added additional compilation rules using the wartRemover library, so if any exceptions are raised during compile time please fix them.

Running

This provisioner uses two sets of credentials to perform operations on Apache Ranger and Apache Hive respectively. Based on how authentication is configured for Apache Hive, please follow the appropriate strategy:

Basic authentication

The default configuration sets the two sets of credentials both equal to the environment variables CDP_DEPLOY_ROLE_USER and CDP_DEPLOY_ROLE_PASSWORD, so that only one user is initially necessary, but the Ranger credentials can be overridden via configuration if they need to be different (see Configuring).

The CDP users configured must be a Machine User and existing in Ranger as well.

The deploy user needs to have admin privileges on Ranger, as well as have the following permissions (e.g. through Ranger policies):

  • read, write, execute permissions on HDFS directory to be used
  • all permissions on Hive databases and tables to be used

Kerberos

If Hive is authenticated using Kerberos as it is in most cases, the set of needed user credentials will be used only to access Ranger, whereas for Hive a valid keytab will be necessary, accompanied by the necessary kerberos configuration files to set it up (see Configuring).

After this, execute:

sbt compile run

By default, the server binds to port 8093 on localhost. After it's up and running you can make provisioning requests to this address.

Configuring

Most application configurations are handled with the Typesafe Config library. You can find the default settings in the reference.conf of each module. Customize them and use the config.file system property or the other options provided by Typesafe Config according to your needs. The provided docker image expects the config file mounted at path /config/application.conf.

A set of required configuration fields must be modified, like Ranger and HDFS base URLs. For more information on the configuration and to understand how to set up the provisioner, see Configuring the Hive Tech Adapter.

Helm chart configuration

The second configuration kerberos.enabled would set the necessary system properties needed for the provisioner to authenticate on a Kerberos system to services like Hive. For this, the provisioner expects a jaas.conf file and krb5.conf. For more information about these files see Configuring the Hive Tech Adapter. You can provide override values for these files using the kerberos.krb5Override and kerberos.jaasOverride fields.

Custom Root CA

The chart provides the option customCA.enabled to add a custom Root Certification Authority to the JVM truststore. If this option is enabled, the chart will load the custom CA from a secret with key cdp-private-hive-custom-ca. The CA is expected to be in a format compatible with keytool utility (PEM works fine).

Deploying

This microservice is meant to be deployed to a Kubernetes cluster.

How it works

  1. Parse the request body
  2. Retrieve Hive Server 2 host and ranger host from the provisioner configuration
  3. Create the Hive resource (table or view)
  4. Upsert the ranger security zone for the specific data product version
  5. Upsert ranger roles for owners of the component; and for Output Ports a role for users as well.
  6. Upsert access policies for said roles, granting read/write access to the owner role, and read-only to the user role
  7. Return the deployed resource

Descriptor Input

The Hive Tech Adapter receives a yaml-descriptor containing a data contract schema and a specific field with the information of the table or view to be deployed. It allows defining

  • Data contract schema. OpenMetadata Column schema defining the schema of the table or view to be created
  • Database name: Database to be created to handle the component tables
  • Table name: Table name to be created, or when provisioning a view, the name of the table exposed by the view
  • View name: Sent when provisioning a view to define its name
  • Format: Format of the data files an external table exposes. Only required for table creation
  • Location: Location HDFS where the data files are located
  • Partitions: List of columns used to partition the data
  • Table parameters: Extra table parameters to define TBLPROPERTIES, text file delimiter and header, etc.

For the specification of schema of this object, check out Descriptor Input

License

This project is available under the Apache License, Version 2.0; see LICENSE for full details.

About Witboost

Witboost is a cutting-edge Data Experience platform, that streamlines complex data projects across various platforms, enabling seamless data production and consumption. This unified approach empowers you to fully utilize your data without platform-specific hurdles, fostering smoother collaboration across teams.

It seamlessly blends business-relevant information, data governance processes, and IT delivery, ensuring technically sound data projects aligned with strategic objectives. Witboost facilitates data-driven decision-making while maintaining data security, ethics, and regulatory compliance.

Moreover, Witboost maximizes data potential through automation, freeing resources for strategic initiatives. Apply your data for growth, innovation and competitive advantage.

Contact us or follow us on:

About

The CDP Hive Tech Adapter. Part of the Witboost Starter Kit: https://github.com/agile-lab-dev/witboost-starter-kit

Resources

License

Stars

Watchers

Forks

Contributors

Languages