Usage
Requirements
Apache Hive Metastores need a relational database to store their state. We currently support PostgreSQL and Apache Derby (embedded database, not recommended for production). Other databases might work if JDBC drivers are available. Please open an issue if you require support for another database.
S3 Support
Hive supports creating tables in S3 compatible object stores.
To use this feature you need to provide connection details for the object store using the S3Connection in the top level clusterConfig.
An example usage can look like this:
clusterConfig:
  s3:
    inline:
      host: minio
      port: 9000
      accessStyle: Path
      credentials:
        secretClass: simple-hive-s3-secret-classApache HDFS Support
As well as S3, Hive also supports creating tables in HDFS.
You can add the HDFS connection in the top level clusterConfig as follows:
clusterConfig:
  hdfs:
    configMap: my-hdfs-cluster # Name of the HdfsClusterMonitoring
The managed Hive instances are automatically configured to export Prometheus metrics. See Monitoring for more details.
Log aggregation
The logs can be forwarded to a Vector log aggregator by providing a discovery ConfigMap for the aggregator and by enabling the log agent:
spec:
  clusterConfig:
    vectorAggregatorConfigMapName: vector-aggregator-discovery
  metastore:
    config:
      logging:
        enableVectorAgent: trueFurther information on how to configure logging, can be found in Logging.
Configuration & Environment Overrides
The cluster definition also supports overriding configuration properties and environment variables, either per role or per role group, where the more specific override (role group) has precedence over the less specific one (role).
| Overriding certain properties, which are set by the operator (such as the HTTP port) can interfere with the operator and can lead to problems. | 
Configuration Properties
For a role or role group, at the same level of config, you can specify: configOverrides for the hive-site.xml. For example, if you want to set the datanucleus.connectionPool.maxPoolSize for the metastore to 20 adapt the metastore section of the cluster resource like so:
metastore:
  roleGroups:
    default:
      config: [...]
      configOverrides:
        hive-site.xml:
          datanucleus.connectionPool.maxPoolSize: "20"
      replicas: 1Just as for the config, it is possible to specify this at role level as well:
metastore:
  configOverrides:
    hive-site.xml:
      datanucleus.connectionPool.maxPoolSize: "20"
  roleGroups:
    default:
      config: [...]
      replicas: 1All override property values must be strings. The properties will be formatted and escaped correctly into the XML file.
For a full list of configuration options we refer to the Hive Configuration Reference.
Environment Variables
In a similar fashion, environment variables can be (over)written. For example per role group:
metastore:
  roleGroups:
    default:
      config: {}
      envOverrides:
        MY_ENV_VAR: "MY_VALUE"
      replicas: 1or per role:
metastore:
  envOverrides:
    MY_ENV_VAR: "MY_VALUE"
  roleGroups:
    default:
      config: {}
      replicas: 1Storage for data volumes
You can mount a volume where the Hive Metastore data is stored by specifying PersistentVolumeClaims for each individual role or role group:
metastore:
  config:
    resources:
      storage:
        data:
          capacity: 2Gi
  roleGroups:
    default:
      config:
        resources:
          storage:
            data:
              capacity: 3GiIn the above example, all Hive Metastores in the default group will store data on a 3Gi volume. Additional role groups not specifying any resources will inherit the config provided on the role level (2Gi volume). This works the same for memory or CPU requests.
By default, in case nothing is configured in the custom resource for a certain role group, each Pod will have a 2Gi large local volume mount for the data location containing mainly logs.
Resource Requests
Stackable operators handle resource requests in a sligtly different manner than Kubernetes. Resource requests are defined on role or group level. See Roles and role groups for details on these concepts. On a role level this means that e.g. all workers will use the same resource requests and limits. This can be further specified on role group level (which takes priority to the role level) to apply different resources.
This is an example on how to specify CPU and memory resources using the Stackable Custom Resources:
---
apiVersion: example.stackable.tech/v1alpha1
kind: ExampleCluster
metadata:
  name: example
spec:
  workers: # role-level
    config:
      resources:
        cpu:
          min: 300m
          max: 600m
        memory:
          limit: 3Gi
    roleGroups: # role-group-level
      resources-from-role: # role-group 1
        replicas: 1
      resources-from-role-group: # role-group 2
        replicas: 1
        config:
          resources:
            cpu:
              min: 400m
              max: 800m
            memory:
              limit: 4GiIn this case, the role group resources-from-role will inherit the resources specified on the role level. Resulting in a maximum of 3Gi memory and 600m CPU resources.
The role group resources-from-role-group has maximum of 4Gi memory and 800m CPU resources (which overrides the role CPU resources).
| For Java products the actual used Heap memory is lower than the specified memory limit due to other processes in the Container requiring memory to run as well. Currently, 80% of the specified memory limits is passed to the JVM. | 
For memory only a limit can be specified, which will be set as memory request and limit in the Container. This is to always guarantee a Container the full amount memory during Kubernetes scheduling.
If no resource requests are configured explicitly, the Hive operator uses the following defaults:
metastore:
  roleGroups:
    default:
      config:
        resources:
          requests:
            cpu: 200m
            memory: 2Gi
          limits:
            cpu: "4"
            memory: 2Gi
          storage:
            data:
              capacity: 2Gi| The default values are most likely not sufficient to run a proper cluster in production. Please adapt according to your requirements. | 
For more details regarding Kubernetes CPU limits see: Assign CPU Resources to Containers and Pods.
Examples
Please note that the version you need to specify is not only the version of Apache Hive which you want to roll out, but has to be amended with a Stackable version as shown. This Stackable version is the version of the underlying container image which is used to execute the processes. For a list of available versions please check our image registry. It should generally be safe to simply use the latest image version that is available.
---
apiVersion: hive.stackable.tech/v1alpha1
kind: HiveCluster
metadata:
  name: simple-hive-derby
spec:
  image:
    productVersion: 3.1.3
    stackableVersion: 0.prerelease
  clusterConfig:
    database:
      connString: jdbc:derby:;databaseName=/tmp/metastore_db;create=true
      user: APP
      password: mine
      dbType: derby
  metastore:
    roleGroups:
      default:
        replicas: 1| You should not use the Derbydatabase with more than one replica or in production. Derby stores data locally and therefore the data is not shared between different metastore Pods and lost after Pod restarts. | 
To create a single node Apache Hive Metastore (v2.3.9) cluster with derby and S3 access, deploy a minio (or use any available S3 bucket):
helm install minio \
    minio \
    --repo https://charts.bitnami.com/bitnami \
    --set auth.rootUser=minio-access-key \
    --set auth.rootPassword=minio-secret-keyIn order to upload data to minio we need a port-forward to access the web ui.
kubectl port-forward service/minio 9001Then, connect to localhost:9001 and login with the user minio-access-key and password minio-secret-key. Create a bucket and upload data.
Deploy the hive cluster:
---
apiVersion: hive.stackable.tech/v1alpha1
kind: HiveCluster
metadata:
  name: simple-hive-derby
spec:
  image:
    productVersion: 3.1.3
    stackableVersion: 0.prerelease
  clusterConfig:
    database:
      connString: jdbc:derby:;databaseName=/stackable/metastore_db;create=true
      user: APP
      password: mine
      dbType: derby
    s3:
      inline:
        host: minio
        port: 9000
        accessStyle: Path
        credentials:
          secretClass: simple-hive-s3-secret-class
  metastore:
    roleGroups:
      default:
        replicas: 1
---
apiVersion: secrets.stackable.tech/v1alpha1
kind: SecretClass
metadata:
  name: simple-hive-s3-secret-class
spec:
  backend:
    k8sSearch:
      searchNamespace:
        pod: {}
---
apiVersion: v1
kind: Secret
metadata:
  name: simple-hive-s3-secret
  labels:
    secrets.stackable.tech/class: simple-hive-s3-secret-class
stringData:
  accessKey: minio-access-key
  secretKey: minio-secret-keyTo create a single node Apache Hive Metastore using PostgreSQL, deploy a PostgreSQL instance via helm.
This installs PostgreSQL in version 10 to work around the issue mentioned above:
helm install hive bitnami/postgresql --version=12.1.5 \
--set postgresqlUsername=hive \
--set postgresqlPassword=hive \
--set postgresqlDatabase=hiveapiVersion: hive.stackable.tech/v1alpha1
kind: HiveCluster
metadata:
  name: simple-hive-postgres
spec:
  image:
    productVersion: 3.1.3
    stackableVersion: 0.prerelease
  clusterConfig:
    database:
      connString: jdbc:postgresql://hive-postgresql.default.svc.cluster.local:5432/hive
      user: hive
      password: hive
      dbType: postgres
  metastore:
    roleGroups:
      default:
        replicas: 1Pod Placement
You can configure Pod placement for Hive metastores as described in Pod Placement.
By default, the operator configures the following Pod placement constraints:
affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - podAffinityTerm:
        labelSelector:
          matchLabels:
            app.kubernetes.io/name: hive
            app.kubernetes.io/instance: cluster-name
            app.kubernetes.io/component: metastore
        topologyKey: kubernetes.io/hostname
      weight: 70In the example above cluster-name is the name of the HiveCluster custom resource that owns this Pod.