Bug 1459345

Summary: Openshift cannot create Hawkular Metrics' pods because missing key_spaces
Product: OpenShift Container Platform Reporter: Guilherme Baufaker Rêgo <gbaufake>
Component: HawkularAssignee: Matt Wringe <mwringe>
Status: CLOSED DEFERRED QA Contact: Liming Zhou <lizhou>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.5.1CC: aos-bugs, gbaufake, jsanda, snegrea
Target Milestone: ---   
Target Release: 3.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-29 19:15:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Hawkular-Metrics-Logs
none
Cassandra Pod 1
none
Cassandra Pod 2 none

Description Guilherme Baufaker Rêgo 2017-06-06 21:29:10 UTC
Created attachment 1285618 [details]
Hawkular-Metrics-Logs

Description of problem:

When I try to recreate Hawkular Metrics pod on Openshift-infra, cassandra does not find alerts_keyspace or metrics_keyspace 

Version-Release number of selected component (if applicable):


How reproducible:

Steps to Reproduce:
1. Install Openshift OCP 3.5
2. Install Metrics
3. Delete Hawkular Metrics Pod on Openshift-Infra without deleting Cassandra pods
4. Recreate Hawkular Metrics Pods

Actual results:

- Openshift can't create Hawkular Metrics' pods.

Expected results:

- Recreate Hawkukar Metrics' pods without errors

Additional info:

Comment 1 Matt Wringe 2017-06-06 21:33:44 UTC
can you please provide the logs for Cassandra as well as the outputs of:

- 'oc get pods -n openshift-infra'
- 'oc get pods -o yaml -n openshift-infra'
- 'oc describe pods -n openshift-infra'

Is this a fresh install on OCP 3.5? or is this an update from an older 3.4 release?

Comment 2 Guilherme Baufaker Rêgo 2017-06-07 15:03:59 UTC
It is a fresh install of OCP 3.5

- 'oc get pods -n openshift-infra'

hawkular-cassandra-1-x94p0   1/1       Running   0          47d
hawkular-cassandra-1-z18wk   1/1       Running   0          47d
hawkular-metrics-zj777       1/1       Running   0          21h
heapster-tpc9q               1/1       Running   2          60d


- 'oc get pods -o yaml -n openshift-infra'

apiVersion: v1
items:
- apiVersion: v1
  kind: Pod
  metadata:
    annotations:
      kubernetes.io/created-by: |
        {"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"openshift-infra","name":"hawkular-cassandra-1","uid":"654c53d3-1713-11e7-b64b-001a4a10173a","apiVersion":"v1","resourceVersion":"569955"}}
      openshift.io/scc: restricted
    creationTimestamp: 2017-04-20T22:53:17Z
    generateName: hawkular-cassandra-1-
    labels:
      metrics-infra: hawkular-cassandra
      name: hawkular-cassandra-1
      type: hawkular-cassandra
    name: hawkular-cassandra-1-x94p0
    namespace: openshift-infra
    resourceVersion: "570065"
    selfLink: /api/v1/namespaces/openshift-infra/pods/hawkular-cassandra-1-x94p0
    uid: 249ef5f4-261c-11e7-9dc7-001a4a10173a
  spec:
    containers:
    - command:
      - /opt/apache-cassandra/bin/cassandra-docker.sh
      - --cluster_name=hawkular-metrics
      - --data_volume=/cassandra_data
      - --internode_encryption=all
      - --require_node_auth=true
      - --enable_client_encryption=true
      - --require_client_auth=true
      - --keystore_file=/secret/cassandra.keystore
      - --keystore_password_file=/secret/cassandra.keystore.password
      - --truststore_file=/secret/cassandra.truststore
      - --truststore_password_file=/secret/cassandra.truststore.password
      - --cassandra_pem_file=/secret/cassandra.pem
      env:
      - name: CASSANDRA_MASTER
        value: "true"
      - name: CASSANDRA_DATA_VOLUME
        value: /cassandra_data
      - name: JVM_OPTS
        value: -Dcassandra.commitlog.ignorereplayerrors=true
      - name: POD_NAMESPACE
        valueFrom:
          fieldRef:
            apiVersion: v1
            fieldPath: metadata.namespace
      - name: MEMORY_LIMIT
        valueFrom:
          resourceFieldRef:
            divisor: "0"
            resource: limits.memory
      - name: CPU_LIMIT
        valueFrom:
          resourceFieldRef:
            divisor: 1m
            resource: limits.cpu
      image: openshift3/metrics-cassandra:3.5.0
      imagePullPolicy: IfNotPresent
      lifecycle:
        postStart:
          exec:
            command:
            - /opt/apache-cassandra/bin/cassandra-poststart.sh
        preStop:
          exec:
            command:
            - /opt/apache-cassandra/bin/cassandra-prestop.sh
      name: hawkular-cassandra-1
      ports:
      - containerPort: 9042
        name: cql-port
        protocol: TCP
      - containerPort: 9160
        name: thift-port
        protocol: TCP
      - containerPort: 7000
        name: tcp-port
        protocol: TCP
      - containerPort: 7001
        name: ssl-port
        protocol: TCP
      readinessProbe:
        exec:
          command:
          - /opt/apache-cassandra/bin/cassandra-docker-ready.sh
        failureThreshold: 3
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 1
      resources:
        limits:
          memory: 2G
        requests:
          memory: 1G
      securityContext:
        capabilities:
          drop:
          - KILL
          - MKNOD
          - SETGID
          - SETUID
          - SYS_CHROOT
        privileged: false
        runAsUser: 1000000000
        seLinuxOptions:
          level: s0:c1,c0
      terminationMessagePath: /dev/termination-log
      volumeMounts:
      - mountPath: /cassandra_data
        name: cassandra-data
      - mountPath: /secret
        name: hawkular-cassandra-secrets
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: cassandra-token-5l9kw
        readOnly: true
    dnsPolicy: ClusterFirst
    imagePullSecrets:
    - name: cassandra-dockercfg-nnhxg
    nodeName: ose1.bc.jonqe.lab.eng.bos.redhat.com
    restartPolicy: Always
    securityContext:
      fsGroup: 1000000000
      seLinuxOptions:
        level: s0:c1,c0
      supplementalGroups:
      - 65534
    serviceAccount: cassandra
    serviceAccountName: cassandra
    terminationGracePeriodSeconds: 30
    volumes:
    - emptyDir: {}
      name: cassandra-data
    - name: hawkular-cassandra-secrets
      secret:
        defaultMode: 420
        secretName: hawkular-cassandra-secrets
    - name: cassandra-token-5l9kw
      secret:
        defaultMode: 420
        secretName: cassandra-token-5l9kw
  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: 2017-04-20T22:53:17Z
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: 2017-04-20T22:57:10Z
      status: "True"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: 2017-04-20T22:53:17Z
      status: "True"
      type: PodScheduled
    containerStatuses:
    - containerID: docker://aa0ee98fd46cd1727bbd423053f4725877175f8558c71e656768a1b8c2e0b82e
      image: openshift3/metrics-cassandra:3.5.0
      imageID: docker-pullable://brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-cassandra@sha256:f195339f5bbcaf5de4a844fa1738f83ddb36c372c6cb03859199bb53bcf5e093
      lastState: {}
      name: hawkular-cassandra-1
      ready: true
      restartCount: 0
      state:
        running:
          startedAt: 2017-04-20T22:53:56Z
    hostIP: 10.16.23.148
    phase: Running
    podIP: 10.128.0.73
    startTime: 2017-04-20T22:53:17Z
- apiVersion: v1
  kind: Pod
  metadata:
    annotations:
      kubernetes.io/created-by: |
        {"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"openshift-infra","name":"hawkular-cassandra-1","uid":"654c53d3-1713-11e7-b64b-001a4a10173a","apiVersion":"v1","resourceVersion":"569963"}}
      openshift.io/scc: restricted
    creationTimestamp: 2017-04-20T22:53:18Z
    generateName: hawkular-cassandra-1-
    labels:
      metrics-infra: hawkular-cassandra
      name: hawkular-cassandra-1
      type: hawkular-cassandra
    name: hawkular-cassandra-1-z18wk
    namespace: openshift-infra
    resourceVersion: "570069"
    selfLink: /api/v1/namespaces/openshift-infra/pods/hawkular-cassandra-1-z18wk
    uid: 2508a753-261c-11e7-9dc7-001a4a10173a
  spec:
    containers:
    - command:
      - /opt/apache-cassandra/bin/cassandra-docker.sh
      - --cluster_name=hawkular-metrics
      - --data_volume=/cassandra_data
      - --internode_encryption=all
      - --require_node_auth=true
      - --enable_client_encryption=true
      - --require_client_auth=true
      - --keystore_file=/secret/cassandra.keystore
      - --keystore_password_file=/secret/cassandra.keystore.password
      - --truststore_file=/secret/cassandra.truststore
      - --truststore_password_file=/secret/cassandra.truststore.password
      - --cassandra_pem_file=/secret/cassandra.pem
      env:
      - name: CASSANDRA_MASTER
        value: "true"
      - name: CASSANDRA_DATA_VOLUME
        value: /cassandra_data
      - name: JVM_OPTS
        value: -Dcassandra.commitlog.ignorereplayerrors=true
      - name: POD_NAMESPACE
        valueFrom:
          fieldRef:
            apiVersion: v1
            fieldPath: metadata.namespace
      - name: MEMORY_LIMIT
        valueFrom:
          resourceFieldRef:
            divisor: "0"
            resource: limits.memory
      - name: CPU_LIMIT
        valueFrom:
          resourceFieldRef:
            divisor: 1m
            resource: limits.cpu
      image: openshift3/metrics-cassandra:3.5.0
      imagePullPolicy: IfNotPresent
      lifecycle:
        postStart:
          exec:
            command:
            - /opt/apache-cassandra/bin/cassandra-poststart.sh
        preStop:
          exec:
            command:
            - /opt/apache-cassandra/bin/cassandra-prestop.sh
      name: hawkular-cassandra-1
      ports:
      - containerPort: 9042
        name: cql-port
        protocol: TCP
      - containerPort: 9160
        name: thift-port
        protocol: TCP
      - containerPort: 7000
        name: tcp-port
        protocol: TCP
      - containerPort: 7001
        name: ssl-port
        protocol: TCP
      readinessProbe:
        exec:
          command:
          - /opt/apache-cassandra/bin/cassandra-docker-ready.sh
        failureThreshold: 3
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 1
      resources:
        limits:
          memory: 2G
        requests:
          memory: 1G
      securityContext:
        capabilities:
          drop:
          - KILL
          - MKNOD
          - SETGID
          - SETUID
          - SYS_CHROOT
        privileged: false
        runAsUser: 1000000000
        seLinuxOptions:
          level: s0:c1,c0
      terminationMessagePath: /dev/termination-log
      volumeMounts:
      - mountPath: /cassandra_data
        name: cassandra-data
      - mountPath: /secret
        name: hawkular-cassandra-secrets
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: cassandra-token-5l9kw
        readOnly: true
    dnsPolicy: ClusterFirst
    imagePullSecrets:
    - name: cassandra-dockercfg-nnhxg
    nodeName: ose1.bc.jonqe.lab.eng.bos.redhat.com
    restartPolicy: Always
    securityContext:
      fsGroup: 1000000000
      seLinuxOptions:
        level: s0:c1,c0
      supplementalGroups:
      - 65534
    serviceAccount: cassandra
    serviceAccountName: cassandra
    terminationGracePeriodSeconds: 30
    volumes:
    - emptyDir: {}
      name: cassandra-data
    - name: hawkular-cassandra-secrets
      secret:
        defaultMode: 420
        secretName: hawkular-cassandra-secrets
    - name: cassandra-token-5l9kw
      secret:
        defaultMode: 420
        secretName: cassandra-token-5l9kw
  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: 2017-04-20T22:53:18Z
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: 2017-04-20T22:57:12Z
      status: "True"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: 2017-04-20T22:53:18Z
      status: "True"
      type: PodScheduled
    containerStatuses:
    - containerID: docker://5bdbbc04a1744d72de5debdb63e5502f42f2e7b53167ed3fdc5dde1fd764704c
      image: openshift3/metrics-cassandra:3.5.0
      imageID: docker-pullable://brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-cassandra@sha256:f195339f5bbcaf5de4a844fa1738f83ddb36c372c6cb03859199bb53bcf5e093
      lastState: {}
      name: hawkular-cassandra-1
      ready: true
      restartCount: 0
      state:
        running:
          startedAt: 2017-04-20T22:53:56Z
    hostIP: 10.16.23.148
    phase: Running
    podIP: 10.128.0.74
    startTime: 2017-04-20T22:53:18Z
- apiVersion: v1
  kind: Pod
  metadata:
    annotations:
      kubernetes.io/created-by: |
        {"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"openshift-infra","name":"hawkular-metrics","uid":"599fcafa-1713-11e7-b64b-001a4a10173a","apiVersion":"v1","resourceVersion":"288799"}}
      openshift.io/scc: restricted
    creationTimestamp: 2017-06-06T17:16:27Z
    generateName: hawkular-metrics-
    labels:
      metrics-infra: hawkular-metrics
      name: hawkular-metrics
    name: hawkular-metrics-zj777
    namespace: openshift-infra
    resourceVersion: "1776988"
    selfLink: /api/v1/namespaces/openshift-infra/pods/hawkular-metrics-zj777
    uid: dfebe954-4adb-11e7-982a-001a4a10173a
  spec:
    containers:
    - command:
      - /opt/hawkular/scripts/hawkular-metrics-wrapper.sh
      - -b
      - 0.0.0.0
      - -Dhawkular.metrics.cassandra.nodes=hawkular-cassandra
      - -Dhawkular.metrics.cassandra.use-ssl
      - -Dhawkular.metrics.openshift.auth-methods=openshift-oauth,htpasswd
      - -Dhawkular.metrics.openshift.htpasswd-file=/secrets/hawkular-metrics.htpasswd.file
      - -Dhawkular.metrics.allowed-cors-access-control-allow-headers=authorization
      - -Dhawkular.metrics.default-ttl=7
      - -Dhawkular.metrics.admin-tenant=_hawkular_admin
      - -Dhawkular-alerts.cassandra-nodes=hawkular-cassandra
      - -Dhawkular-alerts.cassandra-use-ssl
      - -Dhawkular.alerts.openshift.auth-methods=openshift-oauth,htpasswd
      - -Dhawkular.alerts.openshift.htpasswd-file=/secrets/hawkular-metrics.htpasswd.file
      - -Dhawkular.alerts.allowed-cors-access-control-allow-headers=authorization
      - -Dorg.apache.tomcat.util.buf.UDecoder.ALLOW_ENCODED_SLASH=true
      - -Dorg.apache.catalina.connector.CoyoteAdapter.ALLOW_BACKSLASH=true
      - -Dcom.datastax.driver.FORCE_NIO=true
      - -DKUBERNETES_MASTER_URL=https://kubernetes.default.svc.cluster.local
      - -DUSER_WRITE_ACCESS=False
      - --hmw.keystore=/secrets/hawkular-metrics.keystore
      - --hmw.truststore=/secrets/hawkular-metrics.truststore
      - --hmw.keystore_password_file=/secrets/hawkular-metrics.keystore.password
      - --hmw.truststore_password_file=/secrets/hawkular-metrics.truststore.password
      - --hmw.jgroups_keystore=/secrets/hawkular-metrics.jgroups.keystore
      - --hmw.jgroups_keystore_password_file=/secrets/hawkular-metrics.jgroups.keystore.password
      - --hmw.jgroups_alias_file=/secrets/hawkular-metrics.jgroups.alias
      env:
      - name: POD_NAMESPACE
        valueFrom:
          fieldRef:
            apiVersion: v1
            fieldPath: metadata.namespace
      - name: MASTER_URL
        value: https://kubernetes.default.svc.cluster.local
      - name: OPENSHIFT_KUBE_PING_NAMESPACE
        valueFrom:
          fieldRef:
            apiVersion: v1
            fieldPath: metadata.namespace
      - name: OPENSHIFT_KUBE_PING_LABELS
        value: metrics-infra=hawkular-metrics,name=hawkular-metrics
      - name: STARTUP_TIMEOUT
        value: "500"
      image: openshift3/metrics-hawkular-metrics:3.5.0
      imagePullPolicy: IfNotPresent
      livenessProbe:
        exec:
          command:
          - /opt/hawkular/scripts/hawkular-metrics-liveness.py
        failureThreshold: 3
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 1
      name: hawkular-metrics
      ports:
      - containerPort: 8080
        name: http-endpoint
        protocol: TCP
      - containerPort: 8443
        name: https-endpoint
        protocol: TCP
      - containerPort: 8888
        name: ping
        protocol: TCP
      readinessProbe:
        exec:
          command:
          - /opt/hawkular/scripts/hawkular-metrics-readiness.py
        failureThreshold: 3
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 1
      resources:
        limits:
          memory: 2500M
        requests:
          memory: 1500M
      securityContext:
        capabilities:
          drop:
          - KILL
          - MKNOD
          - SETGID
          - SETUID
          - SYS_CHROOT
        privileged: false
        runAsUser: 1000000000
        seLinuxOptions:
          level: s0:c1,c0
      terminationMessagePath: /dev/termination-log
      volumeMounts:
      - mountPath: /secrets
        name: hawkular-metrics-secrets
      - mountPath: /client-secrets
        name: hawkular-metrics-client-secrets
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: hawkular-token-xnc2c
        readOnly: true
    dnsPolicy: ClusterFirst
    imagePullSecrets:
    - name: hawkular-dockercfg-5cqrk
    nodeName: ose1.bc.jonqe.lab.eng.bos.redhat.com
    restartPolicy: Always
    securityContext:
      fsGroup: 1000000000
      seLinuxOptions:
        level: s0:c1,c0
    serviceAccount: hawkular
    serviceAccountName: hawkular
    terminationGracePeriodSeconds: 30
    volumes:
    - name: hawkular-metrics-secrets
      secret:
        defaultMode: 420
        secretName: hawkular-metrics-secrets
    - name: hawkular-metrics-client-secrets
      secret:
        defaultMode: 420
        secretName: hawkular-metrics-account
    - name: hawkular-token-xnc2c
      secret:
        defaultMode: 420
        secretName: hawkular-token-xnc2c
  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: 2017-06-06T17:16:27Z
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: 2017-06-06T17:19:57Z
      status: "True"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: 2017-06-06T17:16:27Z
      status: "True"
      type: PodScheduled
    containerStatuses:
    - containerID: docker://6f8ffde05b78be196a885479dafdca04b33a4254fb53ef5075eacf4d88ce8d7c
      image: openshift3/metrics-hawkular-metrics:3.5.0
      imageID: docker-pullable://brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-hawkular-metrics@sha256:f84a9f9abd9d4407b1a0b8542392ce2e57674821946c2edae410e9c7f4e3e764
      lastState: {}
      name: hawkular-metrics
      ready: true
      restartCount: 0
      state:
        running:
          startedAt: 2017-06-06T17:16:38Z
    hostIP: 10.16.23.148
    phase: Running
    podIP: 10.128.0.91
    startTime: 2017-06-06T17:16:27Z
- apiVersion: v1
  kind: Pod
  metadata:
    annotations:
      kubernetes.io/created-by: |
        {"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"openshift-infra","name":"heapster","uid":"5d811534-1713-11e7-b64b-001a4a10173a","apiVersion":"v1","resourceVersion":"188998"}}
      openshift.io/scc: restricted
    creationTimestamp: 2017-04-07T20:06:30Z
    generateName: heapster-
    labels:
      metrics-infra: heapster
      name: heapster
    name: heapster-tpc9q
    namespace: openshift-infra
    resourceVersion: "1074853"
    selfLink: /api/v1/namespaces/openshift-infra/pods/heapster-tpc9q
    uid: b0d237b4-1bcd-11e7-9dc7-001a4a10173a
  spec:
    containers:
    - command:
      - heapster-wrapper.sh
      - --wrapper.allowed_users_file=/secrets/heapster.allowed-users
      - --source=kubernetes.summary_api:${MASTER_URL}?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250
      - --tls_cert=/secrets/heapster.cert
      - --tls_key=/secrets/heapster.key
      - --tls_client_ca=/secrets/heapster.client-ca
      - --allowed_users=%allowed_users%
      - --metric_resolution=30s
      - --wrapper.username_file=/hawkular-account/hawkular-metrics.username
      - --wrapper.password_file=/hawkular-account/hawkular-metrics.password
      - --wrapper.endpoint_check=https://hawkular-metrics:443/hawkular/metrics/status
      - --sink=hawkular:https://hawkular-metrics:443?tenant=_system&labelToTenant=pod_namespace&labelNodeId=nodename&caCert=/hawkular-cert/hawkular-metrics-ca.certificate&user=%username%&pass=%password%&filter=label(container_name:^system.slice.*|^user.slice)
      env:
      - name: STARTUP_TIMEOUT
        value: "500"
      image: openshift3/metrics-heapster:3.5.0
      imagePullPolicy: IfNotPresent
      name: heapster
      ports:
      - containerPort: 8082
        name: http-endpoint
        protocol: TCP
      readinessProbe:
        exec:
          command:
          - /opt/heapster-readiness.sh
        failureThreshold: 3
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 1
      resources:
        limits:
          memory: 3750M
        requests:
          memory: 937500k
      securityContext:
        capabilities:
          drop:
          - KILL
          - MKNOD
          - SETGID
          - SETUID
          - SYS_CHROOT
        privileged: false
        runAsUser: 1000000000
        seLinuxOptions:
          level: s0:c1,c0
      terminationMessagePath: /dev/termination-log
      volumeMounts:
      - mountPath: /secrets
        name: heapster-secrets
      - mountPath: /hawkular-cert
        name: hawkular-metrics-certificate
      - mountPath: /hawkular-account
        name: hawkular-metrics-account
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: heapster-token-1ksp6
        readOnly: true
    dnsPolicy: ClusterFirst
    imagePullSecrets:
    - name: heapster-dockercfg-k22zx
    nodeName: ose2.bc.jonqe.lab.eng.bos.redhat.com
    restartPolicy: Always
    securityContext:
      fsGroup: 1000000000
      seLinuxOptions:
        level: s0:c1,c0
    serviceAccount: heapster
    serviceAccountName: heapster
    terminationGracePeriodSeconds: 30
    volumes:
    - name: heapster-secrets
      secret:
        defaultMode: 420
        secretName: heapster-secrets
    - name: hawkular-metrics-certificate
      secret:
        defaultMode: 420
        secretName: hawkular-metrics-certificate
    - name: hawkular-metrics-account
      secret:
        defaultMode: 420
        secretName: hawkular-metrics-account
    - name: heapster-token-1ksp6
      secret:
        defaultMode: 420
        secretName: heapster-token-1ksp6
  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: 2017-04-07T20:06:30Z
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: 2017-05-03T03:55:47Z
      status: "True"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: 2017-04-07T20:06:30Z
      status: "True"
      type: PodScheduled
    containerStatuses:
    - containerID: docker://46f893d2183bd541e7648791c3f5cc6da31c70b5b19f7c7b2033494c097e7aad
      image: openshift3/metrics-heapster:3.5.0
      imageID: docker-pullable://brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-heapster@sha256:52ec40ac8235095951cf321d77c10bdc59194e620fc24d5eb49383d671eb6937
      lastState: {}
      name: heapster
      ready: true
      restartCount: 2
      state:
        running:
          startedAt: 2017-04-20T22:45:22Z
    hostIP: 10.16.23.195
    phase: Running
    podIP: 10.129.0.113
    startTime: 2017-04-07T20:06:30Z
kind: List
metadata: {}
resourceVersion: ""
selfLink: ""


- 'oc describe pods -n openshift-infra'

oc describe pods -n openshift-infra
Name:			hawkular-cassandra-1-x94p0
Namespace:		openshift-infra
Security Policy:	restricted
Node:			ose1.bc.jonqe.lab.eng.bos.redhat.com/10.16.23.148
Start Time:		Thu, 20 Apr 2017 19:53:17 -0300
Labels:			metrics-infra=hawkular-cassandra
			name=hawkular-cassandra-1
			type=hawkular-cassandra
Status:			Running
IP:			10.128.0.73
Controllers:		ReplicationController/hawkular-cassandra-1
Containers:
  hawkular-cassandra-1:
    Container ID:	docker://aa0ee98fd46cd1727bbd423053f4725877175f8558c71e656768a1b8c2e0b82e
    Image:		openshift3/metrics-cassandra:3.5.0
    Image ID:		docker-pullable://brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-cassandra@sha256:f195339f5bbcaf5de4a844fa1738f83ddb36c372c6cb03859199bb53bcf5e093
    Ports:		9042/TCP, 9160/TCP, 7000/TCP, 7001/TCP
    Command:
      /opt/apache-cassandra/bin/cassandra-docker.sh
      --cluster_name=hawkular-metrics
      --data_volume=/cassandra_data
      --internode_encryption=all
      --require_node_auth=true
      --enable_client_encryption=true
      --require_client_auth=true
      --keystore_file=/secret/cassandra.keystore
      --keystore_password_file=/secret/cassandra.keystore.password
      --truststore_file=/secret/cassandra.truststore
      --truststore_password_file=/secret/cassandra.truststore.password
      --cassandra_pem_file=/secret/cassandra.pem
    Limits:
      memory:	2G
    Requests:
      memory:		1G
    State:		Running
      Started:		Thu, 20 Apr 2017 19:53:56 -0300
    Ready:		True
    Restart Count:	0
    Readiness:		exec [/opt/apache-cassandra/bin/cassandra-docker-ready.sh] delay=0s timeout=1s period=10s #success=1 #failure=3
    Volume Mounts:
      /cassandra_data from cassandra-data (rw)
      /secret from hawkular-cassandra-secrets (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from cassandra-token-5l9kw (ro)
    Environment Variables:
      CASSANDRA_MASTER:		true
      CASSANDRA_DATA_VOLUME:	/cassandra_data
      JVM_OPTS:			-Dcassandra.commitlog.ignorereplayerrors=true
      POD_NAMESPACE:		openshift-infra (v1:metadata.namespace)
      MEMORY_LIMIT:		2000000000 (limits.memory)
      CPU_LIMIT:		node allocatable (limits.cpu)
Conditions:
  Type		Status
  Initialized 	True
  Ready 	True
  PodScheduled 	True
Volumes:
  cassandra-data:
    Type:	EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  hawkular-cassandra-secrets:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	hawkular-cassandra-secrets
  cassandra-token-5l9kw:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	cassandra-token-5l9kw
QoS Class:	Burstable
Tolerations:	<none>
No events.


Name:			hawkular-cassandra-1-z18wk
Namespace:		openshift-infra
Security Policy:	restricted
Node:			ose1.bc.jonqe.lab.eng.bos.redhat.com/10.16.23.148
Start Time:		Thu, 20 Apr 2017 19:53:18 -0300
Labels:			metrics-infra=hawkular-cassandra
			name=hawkular-cassandra-1
			type=hawkular-cassandra
Status:			Running
IP:			10.128.0.74
Controllers:		ReplicationController/hawkular-cassandra-1
Containers:
  hawkular-cassandra-1:
    Container ID:	docker://5bdbbc04a1744d72de5debdb63e5502f42f2e7b53167ed3fdc5dde1fd764704c
    Image:		openshift3/metrics-cassandra:3.5.0
    Image ID:		docker-pullable://brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-cassandra@sha256:f195339f5bbcaf5de4a844fa1738f83ddb36c372c6cb03859199bb53bcf5e093
    Ports:		9042/TCP, 9160/TCP, 7000/TCP, 7001/TCP
    Command:
      /opt/apache-cassandra/bin/cassandra-docker.sh
      --cluster_name=hawkular-metrics
      --data_volume=/cassandra_data
      --internode_encryption=all
      --require_node_auth=true
      --enable_client_encryption=true
      --require_client_auth=true
      --keystore_file=/secret/cassandra.keystore
      --keystore_password_file=/secret/cassandra.keystore.password
      --truststore_file=/secret/cassandra.truststore
      --truststore_password_file=/secret/cassandra.truststore.password
      --cassandra_pem_file=/secret/cassandra.pem
    Limits:
      memory:	2G
    Requests:
      memory:		1G
    State:		Running
      Started:		Thu, 20 Apr 2017 19:53:56 -0300
    Ready:		True
    Restart Count:	0
    Readiness:		exec [/opt/apache-cassandra/bin/cassandra-docker-ready.sh] delay=0s timeout=1s period=10s #success=1 #failure=3
    Volume Mounts:
      /cassandra_data from cassandra-data (rw)
      /secret from hawkular-cassandra-secrets (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from cassandra-token-5l9kw (ro)
    Environment Variables:
      CASSANDRA_MASTER:		true
      CASSANDRA_DATA_VOLUME:	/cassandra_data
      JVM_OPTS:			-Dcassandra.commitlog.ignorereplayerrors=true
      POD_NAMESPACE:		openshift-infra (v1:metadata.namespace)
      MEMORY_LIMIT:		2000000000 (limits.memory)
      CPU_LIMIT:		node allocatable (limits.cpu)
Conditions:
  Type		Status
  Initialized 	True
  Ready 	True
  PodScheduled 	True
Volumes:
  cassandra-data:
    Type:	EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  hawkular-cassandra-secrets:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	hawkular-cassandra-secrets
  cassandra-token-5l9kw:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	cassandra-token-5l9kw
QoS Class:	Burstable
Tolerations:	<none>
No events.


Name:			hawkular-metrics-zj777
Namespace:		openshift-infra
Security Policy:	restricted
Node:			ose1.bc.jonqe.lab.eng.bos.redhat.com/10.16.23.148
Start Time:		Tue, 06 Jun 2017 14:16:27 -0300
Labels:			metrics-infra=hawkular-metrics
			name=hawkular-metrics
Status:			Running
IP:			10.128.0.91
Controllers:		ReplicationController/hawkular-metrics
Containers:
  hawkular-metrics:
    Container ID:	docker://6f8ffde05b78be196a885479dafdca04b33a4254fb53ef5075eacf4d88ce8d7c
    Image:		openshift3/metrics-hawkular-metrics:3.5.0
    Image ID:		docker-pullable://brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-hawkular-metrics@sha256:f84a9f9abd9d4407b1a0b8542392ce2e57674821946c2edae410e9c7f4e3e764
    Ports:		8080/TCP, 8443/TCP, 8888/TCP
    Command:
      /opt/hawkular/scripts/hawkular-metrics-wrapper.sh
      -b
      0.0.0.0
      -Dhawkular.metrics.cassandra.nodes=hawkular-cassandra
      -Dhawkular.metrics.cassandra.use-ssl
      -Dhawkular.metrics.openshift.auth-methods=openshift-oauth,htpasswd
      -Dhawkular.metrics.openshift.htpasswd-file=/secrets/hawkular-metrics.htpasswd.file
      -Dhawkular.metrics.allowed-cors-access-control-allow-headers=authorization
      -Dhawkular.metrics.default-ttl=7
      -Dhawkular.metrics.admin-tenant=_hawkular_admin
      -Dhawkular-alerts.cassandra-nodes=hawkular-cassandra
      -Dhawkular-alerts.cassandra-use-ssl
      -Dhawkular.alerts.openshift.auth-methods=openshift-oauth,htpasswd
      -Dhawkular.alerts.openshift.htpasswd-file=/secrets/hawkular-metrics.htpasswd.file
      -Dhawkular.alerts.allowed-cors-access-control-allow-headers=authorization
      -Dorg.apache.tomcat.util.buf.UDecoder.ALLOW_ENCODED_SLASH=true
      -Dorg.apache.catalina.connector.CoyoteAdapter.ALLOW_BACKSLASH=true
      -Dcom.datastax.driver.FORCE_NIO=true
      -DKUBERNETES_MASTER_URL=https://kubernetes.default.svc.cluster.local
      -DUSER_WRITE_ACCESS=False
      --hmw.keystore=/secrets/hawkular-metrics.keystore
      --hmw.truststore=/secrets/hawkular-metrics.truststore
      --hmw.keystore_password_file=/secrets/hawkular-metrics.keystore.password
      --hmw.truststore_password_file=/secrets/hawkular-metrics.truststore.password
      --hmw.jgroups_keystore=/secrets/hawkular-metrics.jgroups.keystore
      --hmw.jgroups_keystore_password_file=/secrets/hawkular-metrics.jgroups.keystore.password
      --hmw.jgroups_alias_file=/secrets/hawkular-metrics.jgroups.alias
    Limits:
      memory:	2500M
    Requests:
      memory:		1500M
    State:		Running
      Started:		Tue, 06 Jun 2017 14:16:38 -0300
    Ready:		True
    Restart Count:	0
    Liveness:		exec [/opt/hawkular/scripts/hawkular-metrics-liveness.py] delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:		exec [/opt/hawkular/scripts/hawkular-metrics-readiness.py] delay=0s timeout=1s period=10s #success=1 #failure=3
    Volume Mounts:
      /client-secrets from hawkular-metrics-client-secrets (rw)
      /secrets from hawkular-metrics-secrets (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from hawkular-token-xnc2c (ro)
    Environment Variables:
      POD_NAMESPACE:			openshift-infra (v1:metadata.namespace)
      MASTER_URL:			https://kubernetes.default.svc.cluster.local
      OPENSHIFT_KUBE_PING_NAMESPACE:	openshift-infra (v1:metadata.namespace)
      OPENSHIFT_KUBE_PING_LABELS:	metrics-infra=hawkular-metrics,name=hawkular-metrics
      STARTUP_TIMEOUT:			500
Conditions:
  Type		Status
  Initialized 	True
  Ready 	True
  PodScheduled 	True
Volumes:
  hawkular-metrics-secrets:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	hawkular-metrics-secrets
  hawkular-metrics-client-secrets:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	hawkular-metrics-account
  hawkular-token-xnc2c:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	hawkular-token-xnc2c
QoS Class:	Burstable
Tolerations:	<none>
No events.


Name:			heapster-tpc9q
Namespace:		openshift-infra
Security Policy:	restricted
Node:			ose2.bc.jonqe.lab.eng.bos.redhat.com/10.16.23.195
Start Time:		Fri, 07 Apr 2017 17:06:30 -0300
Labels:			metrics-infra=heapster
			name=heapster
Status:			Running
IP:			10.129.0.113
Controllers:		ReplicationController/heapster
Containers:
  heapster:
    Container ID:	docker://46f893d2183bd541e7648791c3f5cc6da31c70b5b19f7c7b2033494c097e7aad
    Image:		openshift3/metrics-heapster:3.5.0
    Image ID:		docker-pullable://brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-heapster@sha256:52ec40ac8235095951cf321d77c10bdc59194e620fc24d5eb49383d671eb6937
    Port:		8082/TCP
    Command:
      heapster-wrapper.sh
      --wrapper.allowed_users_file=/secrets/heapster.allowed-users
      --source=kubernetes.summary_api:${MASTER_URL}?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250
      --tls_cert=/secrets/heapster.cert
      --tls_key=/secrets/heapster.key
      --tls_client_ca=/secrets/heapster.client-ca
      --allowed_users=%allowed_users%
      --metric_resolution=30s
      --wrapper.username_file=/hawkular-account/hawkular-metrics.username
      --wrapper.password_file=/hawkular-account/hawkular-metrics.password
      --wrapper.endpoint_check=https://hawkular-metrics:443/hawkular/metrics/status
      --sink=hawkular:https://hawkular-metrics:443?tenant=_system&labelToTenant=pod_namespace&labelNodeId=nodename&caCert=/hawkular-cert/hawkular-metrics-ca.certificate&user=%username%&pass=%password%&filter=label(container_name:^system.slice.*|^user.slice)
    Limits:
      memory:	3750M
    Requests:
      memory:		937500k
    State:		Running
      Started:		Thu, 20 Apr 2017 19:45:22 -0300
    Ready:		True
    Restart Count:	2
    Readiness:		exec [/opt/heapster-readiness.sh] delay=0s timeout=1s period=10s #success=1 #failure=3
    Volume Mounts:
      /hawkular-account from hawkular-metrics-account (rw)
      /hawkular-cert from hawkular-metrics-certificate (rw)
      /secrets from heapster-secrets (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from heapster-token-1ksp6 (ro)
    Environment Variables:
      STARTUP_TIMEOUT:	500
Conditions:
  Type		Status
  Initialized 	True
  Ready 	True
  PodScheduled 	True
Volumes:
  heapster-secrets:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	heapster-secrets
  hawkular-metrics-certificate:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	hawkular-metrics-certificate
  hawkular-metrics-account:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	hawkular-metrics-account
  heapster-token-1ksp6:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	heapster-token-1ksp6
QoS Class:	Burstable
Tolerations:	<none>
No events.

Comment 3 Guilherme Baufaker Rêgo 2017-06-07 15:24:58 UTC
Created attachment 1285830 [details]
Cassandra Pod 1

Comment 4 Guilherme Baufaker Rêgo 2017-06-07 15:25:35 UTC
Created attachment 1285831 [details]
Cassandra Pod 2

Comment 5 Matt Wringe 2017-06-07 20:04:43 UTC
First, please attach output in files and not directly pasting them into the bugzilla.

The main issue here is that your installation is invalid. It appears that you have increase the Cassandra RC above 1, which is not how you scale the Cassandra components.

You will need to specify the number of Cassandra instances to deploy in your ansible inventory file. Otherwise you will have multiple Cassandra pods using the same filesystem which will cause problems.

In this case, it doesn't appear that you are using persistent volumes and are using the emptydir volumes instead. So this is not causing your issue.

@gbaufake: it appears you are using the brew image, those are not supported. Why are you not using the supported images?

@jsanda: I don't see anything weird in the Cassandra logs, any idea about what is happening here?

Comment 6 John Sanda 2017-06-07 21:43:59 UTC
(In reply to Matt Wringe from comment #5)
> First, please attach output in files and not directly pasting them into the
> bugzilla.
> 
> The main issue here is that your installation is invalid. It appears that
> you have increase the Cassandra RC above 1, which is not how you scale the
> Cassandra components.
> 
> You will need to specify the number of Cassandra instances to deploy in your
> ansible inventory file. Otherwise you will have multiple Cassandra pods
> using the same filesystem which will cause problems.
> 
> In this case, it doesn't appear that you are using persistent volumes and
> are using the emptydir volumes instead. So this is not causing your issue.
> 
> @gbaufake: it appears you are using the brew image, those are not supported.
> Why are you not using the supported images?
> 
> @jsanda: I don't see anything weird in the Cassandra logs, any idea about
> what is happening here?

I can see from the logs that both the hawkular_alerts and hawkular_metrics keyspace have been created. Cassandra must be under fairly heavy load because both logs are filled with tons of GC.

Comment 7 Matt Wringe 2017-06-07 21:53:12 UTC
@jsanda: The Hawkular Metric pod is not running in this case, so there shouldn't be anything connecting to Cassandra to cause it to be under a heavy load.


@gbaufake: do you know how large a cluster size you are trying to manage here? Were metrics running at some point and are now failing? Or is this a fresh installation?

Comment 8 John Sanda 2017-06-07 21:58:31 UTC
(In reply to Matt Wringe from comment #7)
> @jsanda: The Hawkular Metric pod is not running in this case, so there
> shouldn't be anything connecting to Cassandra to cause it to be under a
> heavy load.

Maybe the logs in comment 3 and in comment 4 are not current because they show both keyspaces exist.

> 
> 
> @gbaufake: do you know how large a cluster size you are trying to manage
> here? Were metrics running at some point and are now failing? Or is this a
> fresh installation?

Comment 9 Matt Wringe 2017-06-07 22:01:30 UTC
I asked gbaufake if restarting the pods made any difference, and he claimed he got the same results.

Comment 10 Guilherme Baufaker Rêgo 2017-06-07 22:08:21 UTC
@jsanda the logs are current from cassandra and hawkular-metrics. 
Weird, sometimes key_space is created sometimes it does not.


@mwringe: my cluster has two nodes

- 2 Vms of 8GB of Ram each
- 40 GB of disk 
- RHEL 7.3

I restarted the pods and got the same results

Comment 11 Matt Wringe 2017-06-07 22:21:39 UTC
are you still using the brew images?

Can you please use the supported access.redhat.com ones and then attach fresh logs when those startup?

Comment 12 John Sanda 2017-06-07 22:33:18 UTC
Can you log into both of the Cassandra pods and run `nodetool status` and share the output?

Comment 13 Guilherme Baufaker Rêgo 2017-06-09 20:59:24 UTC
Sorry.
We have a problem on the blade center and some machines were affected.

I'll let you know when I found something.

Comment 14 Guilherme Baufaker Rêgo 2017-06-20 13:25:41 UTC
I think I could reproduce the situation on a Docker Container:
https://issues.jboss.org/projects/HWKMETRICS/issues/HWKMETRICS-685.

Comment 15 Matt Wringe 2017-06-20 19:01:24 UTC
The issue described in https://issues.jboss.org/projects/HWKMETRICS/issues/HWKMETRICS-685 is invalid.

It should be expected that going into Cassandra while Hawkular Metrics is running and deleting keyspaces will cause problems.

I believe there is a possibility for this to occur in OpenShift if you are running Cassandra without persistent storage and you delete the Cassandra pod. When the Cassandra pod comes back up, Hawkular Metrics might be able to connect to it but the keyspace will not be available (the keyspace is created at start time).

We have a few options here:

1) detect this situation and have Hawkular Metrics restart itself (eg via a liveness probe). When Hawkular Metrics restarts, it will notice the keyspace does not exist and will create it.

2) have Hawkular Metrics be able to detect this situation better and automatically recreate the keyspaces if it detects they do not exist.

Comment 16 Matt Wringe 2017-06-20 19:04:15 UTC
@jsanda: any opinions here?

If we put this at the Hawkular level, it will work in all environments. Not just OpenShift

If we add a check and restart our Hawkular pod in this case, we will need to provide a mechanism to detect this scenario. But if we detect this scenario in the Hawkular code, it might be just as easy to just recreate the keyspace directly.

Thoughts?

Comment 17 John Sanda 2017-06-21 14:05:19 UTC
(In reply to Matt Wringe from comment #16)
> @jsanda: any opinions here?
> 
> If we put this at the Hawkular level, it will work in all environments. Not
> just OpenShift
> 
> If we add a check and restart our Hawkular pod in this case, we will need to
> provide a mechanism to detect this scenario. But if we detect this scenario
> in the Hawkular code, it might be just as easy to just recreate the keyspace
> directly.
> 
> Thoughts?

Detecting it might be tricky. We should get notifications from the driver on things like schema changes or schema being be added/dropped. Things get tricky though in the single C* node scenario. If we can rely on the driver notification, then this will be easy. If we cannot rely on the driver notifications, then we need to explore some other options.

When we do detect that the schema has been dropped, I think the best way of handling it is a restart.

Comment 18 Matt Wringe 2017-07-10 18:35:53 UTC
(In reply to John Sanda from comment #17)
> When we do detect that the schema has been dropped, I think the best way of
> handling it is a restart.

I am assuming it would be Hawkular Metrics which would detect that the schema has been dropped.

This means we can either:

1) have Hawkular Metrics terminate. This will cause the container within the pod to restart itself.

2) expose this error somewhere (perhaps the status endpoint). We can then use the liveness probe to check for this condition and restart the container based on it.

@jsanda: any thoughts on which is more perferred?

Comment 19 John Sanda 2017-07-11 12:02:47 UTC
(In reply to Matt Wringe from comment #18)
> (In reply to John Sanda from comment #17)
> > When we do detect that the schema has been dropped, I think the best way of
> > handling it is a restart.
> 
> I am assuming it would be Hawkular Metrics which would detect that the
> schema has been dropped.
> 
> This means we can either:
> 
> 1) have Hawkular Metrics terminate. This will cause the container within the
> pod to restart itself.
> 
> 2) expose this error somewhere (perhaps the status endpoint). We can then
> use the liveness probe to check for this condition and restart the container
> based on it.
> 
> @jsanda: any thoughts on which is more perferred?

My preference would be through the status endpoint and using the liveness probe.