Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1721729

Summary: hawkular-cassandra readiness probe fails (0/1) when image tag version is incomplete
Product: OpenShift Container Platform Reporter: Caden Marchese <cmarches>
Component: HawkularAssignee: Ruben Vargas Palma <rvargasp>
Status: CLOSED NOTABUG QA Contact: Junqi Zhao <juzhao>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.11.0CC: akaiser, aos-bugs, jforrest, jmartisk, jolee, jpriddy, msweiker, nstielau, openshift-bugs-escalate, rvargasp, scuppett
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-06 01:27:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
oc logs -p from a failing pod none

Description Caden Marchese 2019-06-18 23:32:09 UTC
Description of problem:

Hawkular Cassandra pods are unready regardless of status across multiple different OpenShift clusters:

[root@server1 ~]# oc get pod
NAME                            READY     STATUS             RESTARTS   AGE
hawkular-cassandra-1-cwg6p      0/1       CrashLoopBackOff   33         2h
hawkular-metrics-k78t7          0/1       Running            865        5d
hawkular-metrics-schema-tw6kh   0/1       Completed          0          11d
heapster-tbkt9                  0/1       Running            867        5d

[root@server2 ~]# oc get pod
NAME                            READY     STATUS             RESTARTS   AGE
hawkular-cassandra-1-82gpn      0/1       CrashLoopBackOff   928        12d
hawkular-metrics-cqgr5          0/1       Running            526        12d
hawkular-metrics-schema-7kw4z   0/1       Completed          0          12d
heapster-lpsfm                  0/1       Running            523        12d

[root@server3 ~]# oc get pod
NAME                            READY     STATUS      RESTARTS   AGE
hawkular-cassandra-1-gtjzr      0/1       Error       927        14d
hawkular-metrics-schema-5zjs6   0/1       Completed   0          14d
hawkular-metrics-sprtv          0/1       Running     532        14d
heapster-vhh6g                  0/1       Running     523        14d

Similar upstream issue is located at https://github.com/openshift/origin/issues/15920. One of customer's clusters was fixed by entering v3.11.98 for the image tag in the replication controller rather than v3.11, and the other two were not. The fix that did work is documented at https://access.redhat.com/solutions/4207931.

[root@saomap0004 ~]# oc get pod hawkular-cassandra-1-82gpn -o yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: 2019-05-23T23:46:15Z
  generateName: hawkular-cassandra-1-
  labels:
    metrics-infra: hawkular-cassandra
    name: hawkular-cassandra-1
    type: hawkular-cassandra
  name: hawkular-cassandra-1-82gpn
  namespace: openshift-infra
  ownerReferences:
  - apiVersion: v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicationController
    name: hawkular-cassandra-1
    uid: c6ab28fd-7db4-11e9-b0dd-005056875e3f
  resourceVersion: "7783503"
  selfLink: /api/v1/namespaces/openshift-infra/pods/hawkular-cassandra-1-82gpn
  uid: f4256f17-7db4-11e9-9762-00505687ded7
spec:
  containers:
  - command:
    - /opt/apache-cassandra/bin/cassandra-docker.sh
    - --cluster_name=hawkular-metrics
    - --data_volume=/cassandra_data
    - --internode_encryption=all
    - --require_node_auth=true
    - --enable_client_encryption=true
    - --require_client_auth=true
    env:
    - name: CASSANDRA_MASTER
      value: "true"
    - name: CASSANDRA_DATA_VOLUME
      value: /cassandra_data
    - name: JVM_OPTS
      value: -Dcassandra.commitlog.ignorereplayerrors=true
    - name: ENABLE_PROMETHEUS_ENDPOINT
      value: "True"
    - name: TRUSTSTORE_NODES_AUTHORITIES
      value: /hawkular-cassandra-certs/tls.peer.truststore.crt
    - name: TRUSTSTORE_CLIENT_AUTHORITIES
      value: /hawkular-cassandra-certs/tls.client.truststore.crt
    - name: TAKE_SNAPSHOT
      value: "False"
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: MEMORY_LIMIT
      valueFrom:
        resourceFieldRef:
          divisor: "0"
          resource: limits.memory
    - name: CPU_LIMIT
      valueFrom:
        resourceFieldRef:
          divisor: 1m
          resource: limits.cpu
    image: registry.access.redhat.com/openshift3/metrics-cassandra:v3.11
    imagePullPolicy: IfNotPresent
    lifecycle:
      postStart:
        exec:
          command:
          - /opt/apache-cassandra/bin/cassandra-poststart.sh
      preStop:
        exec:
          command:
          - /opt/apache-cassandra/bin/cassandra-prestop.sh
    name: hawkular-cassandra-1
    ports:
    - containerPort: 9042
      name: cql-port
      protocol: TCP
    - containerPort: 9160
      name: thrift-port
      protocol: TCP
    - containerPort: 7000
      name: tcp-port
      protocol: TCP
    - containerPort: 7001
      name: ssl-port
      protocol: TCP
    readinessProbe:
      exec:
        command:
        - /opt/apache-cassandra/bin/cassandra-docker-ready.sh
      failureThreshold: 3
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 10
    resources:
      limits:
        memory: 2G
      requests:
        memory: 1G
    securityContext:
      runAsUser: 1000040000
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /cassandra_data
      name: cassandra-data
    - mountPath: /hawkular-cassandra-certs
      name: hawkular-cassandra-certs
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: cassandra-token-j8wwx
      readOnly: true
  dnsPolicy: ClusterFirst
  imagePullSecrets:
  - name: cassandra-dockercfg-9xvnt
  nodeName: saoinp0006.dtcc.com
  nodeSelector:
    node-role.kubernetes.io/infra: "true"
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 1000040000
    seLinuxOptions:
      level: s0:c6,c5
    supplementalGroups:
    - 65534
  serviceAccount: cassandra
  serviceAccountName: cassandra
  terminationGracePeriodSeconds: 1800
  tolerations:
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  volumes:
  - name: cassandra-data
    persistentVolumeClaim:
      claimName: metrics-cassandra-1
  - name: hawkular-cassandra-certs
    secret:
      defaultMode: 420
      secretName: hawkular-cassandra-certs
  - name: cassandra-token-j8wwx
    secret:
      defaultMode: 420
      secretName: cassandra-token-j8wwx
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2019-05-23T23:46:15Z
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: 2019-06-02T08:20:43Z
    message: 'containers with unready status: [hawkular-cassandra-1]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: null
    message: 'containers with unready status: [hawkular-cassandra-1]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: 2019-05-23T23:46:15Z
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://d1cacc1c046a752c49a7626bbbde962efb25fefd80a0007102ba43976cabbd7f
    image: registry.access.redhat.com/openshift3/metrics-cassandra:v3.11
    imageID: docker-pullable://registry.access.redhat.com/openshift3/metrics-cassandra@sha256:67e49e23399a1a0df731c67a957759c703eda4318a164b666ac054d472f14281
    lastState:
      terminated:
        containerID: docker://d1cacc1c046a752c49a7626bbbde962efb25fefd80a0007102ba43976cabbd7f
        exitCode: 3
        finishedAt: 2019-06-18T22:06:08Z
        reason: Error
        startedAt: 2019-06-18T22:05:59Z
    name: hawkular-cassandra-1
    ready: false
    restartCount: 4539
    state:
      waiting:
        message: Back-off 5m0s restarting failed container=hawkular-cassandra-1 pod=hawkular-cassandra-1-82gpn_openshift-infra(f4256f17-7db4-11e9-9762-00505687ded7)
        reason: CrashLoopBackOff
  hostIP: 10.130.48.71
  phase: Running
  podIP: 10.16.4.22
  qosClass: Burstable
  startTime: 2019-05-23T23:46:15Z

Version-Release number of selected component (if applicable):

[root@server ~]# oc version
oc v3.11.69
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

How reproducible:


Steps to Reproduce:
1. Use image tag "image: registry.access.redhat.com/openshift3/metrics-cassandra:v3.11" rather than "image: registry.access.redhat.com/openshift3/metrics-cassandra:v3.11.98"
2. Wait for pods to fail readiness probe.
3. Set image tag to "image: registry.access.redhat.com/openshift3/metrics-cassandra:v3.11.98"

Actual results:

Pod will fail readiness probe sometimes.

Expected results:

Pod should not fail readiness probe.

Additional info:

Events for unready probe:

  Type     Reason     Age                   From                          Message
  ----     ------     ----                  ----                          -------
  Warning  Unhealthy  37m (x2517 over 12d)  kubelet, saoinp0006.dtcc.com  Readiness probe failed: Could not get the Cassandra status. This may mean that the Cassandra instance is not up yet. Will try again
nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused (Connection refused)'.
  Normal   Pulled   32m (x3357 over 12d)  kubelet, saoinp0006.dtcc.com  Container image "registry.access.redhat.com/openshift3/metrics-cassandra:v3.11" already present on machine
  Warning  BackOff  2m (x77225 over 12d)  kubelet, saoinp0006.dtcc.com  Back-off restarting failed container
[root@saomap0004 ~]#

Comment 1 Caden Marchese 2019-06-18 23:35:51 UTC
The logs from unready pod:

sed: cannot rename /opt/apache-cassandra/conf/sed6b3JJH: Operation not permitted
sed: cannot rename /opt/apache-cassandra/conf/sedIBVSPF: Operation not permitted
sed: cannot rename /opt/apache-cassandra/conf/sedQyRWxG: Operation not permitted
sed: cannot rename /opt/apache-cassandra/conf/sedtayTwJ: Operation not permitted
sed: cannot rename /opt/apache-cassandra/conf/sedWAzVmK: Operation not permitted
sed: cannot rename /opt/apache-cassandra/conf/sedun6iVJ: Operation not permitted
sed: cannot rename /opt/apache-cassandra/conf/sed0lEnoI: Operation not permitted
sed: cannot rename /opt/apache-cassandra/conf/sed4LogeJ: Operation not permitted
sed: cannot rename /opt/apache-cassandra/conf/sedfEYpVI: Operation not permitted

We have ruled out actual permissions errors and SELinux denials.

Comment 2 Jan Martiska 2019-06-19 06:23:02 UTC
So you suppose that basically the root cause is that the v3.11 tag was pointing at some broken hawkular-cassandra image?
The problematic image 67e49e2339 from the output seems to be version v3.11.98-6 (2 months old). Currently the v3.11 tag is pointing at v3.11.104-14 (image id d3e151cdb005). Can they try again (specifying just v3.11) if this works correctly now?  I don't have much insight into what exactly changes between these versions but the old image could have been broken for some reason.

Comment 3 Caden Marchese 2019-06-19 17:40:12 UTC
No luck changing the image tag back to v3.11 - same issue. We will try it with image tag v3.11.104.

Comment 6 Caden Marchese 2019-07-01 16:40:36 UTC
Customer was able to temporarily fix this by deleting and re-adding the schema job yaml with the fully suffixed image tag:

oc create -f the-below-text-block.yaml

apiVersion: batch/v1
kind: Job
metadata:
  annotations:
  labels:
    metrics-infra: hawkular-metrics
    name: hawkular-metrics-schema
  name: hawkular-metrics-schema
  selfLink: /apis/batch/v1/namespaces/openshift-infra/jobs/hawkular-metrics-schema
spec:
  backoffLimit: 6
  completions: 1
  parallelism: 1
  template:
    metadata:
      labels:
        job-name: hawkular-metrics-schema
    spec:
      containers:
      - env:
        - name: TRUSTSTORE_AUTHORITIES
          value: /hawkular-metrics-certs/tls.truststore.crt
        image: registry.access.redhat.com/openshift3/metrics-schema-installer:v3.11.104-14
        imagePullPolicy: IfNotPresent
        name: hawkular-metrics-schema
        volumeMounts:
        - mountPath: /hawkular-metrics-certs
          name: hawkular-metrics-certs
        - mountPath: /hawkular-account
          name: hawkular-metrics-account
      dnsPolicy: ClusterFirst
      restartPolicy: OnFailure
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 30
      volumes:
      - name: hawkular-metrics-certs
        secret:
          defaultMode: 420
          secretName: hawkular-metrics-certs
      - name: hawkular-metrics-account
        secret:
          defaultMode: 420
          secretName: hawkular-metrics-account

This configuration has since broken - has our metrics image version/suffix been updated on our side recently?

Comment 8 Caden Marchese 2019-07-02 16:57:10 UTC
Customer's configuration was broken when the image on our side was updated from 3.11.104-14 to 3.11.117-04. The the workaround appears to remove the suffix entirely, which defaults to the latest version. Adding the full and exact suffix works, but only until our registry changes the image version.

Comment 9 Caden Marchese 2019-07-11 17:49:27 UTC
Customer is continuing to experience this every few weeks. Each fix has yielded the same error eventually:

sed: cannot rename /opt/apache-cassandra/conf/sedU1J2iu: Operation not permitted
sed: cannot rename /opt/apache-cassandra/conf/sedBqYVXt: Operation not permitted
sed: cannot rename /opt/apache-cassandra/conf/sedEUDDFu: Operation not permitted
sed: cannot rename /opt/apache-cassandra/conf/sednQEs9s: Operation not permitted

Interestingly enough there was never an issue getting into this volume and writing to this storage. This issue only comes up when the pods are unready.

Here is what we have tried so far:

- Updated the replication controller to reflect most recent metrics image version suffix (this worked for a few days)
- Oc rsh'd into the pod, and attempted to write to the directory showing permissions errors (this worked)
- Update the schema job yaml with up to date image version suffix, and redeploy it (this worked for a few weeks, see the case for more details)
- Remove the version suffix entirely from the pod deployment config (this worked for a week or so)
- Remove the version suffix from the schema job entirely (did not work)

Comment 12 Caden Marchese 2019-07-15 16:17:10 UTC
Created attachment 1590809 [details]
oc logs -p from a failing pod

Comment 13 jolee 2019-07-16 14:56:53 UTC
Created attachment 1591089 [details]
sosreport-swoapd0008-10.138.240.74-2019-07-15-snmvfkt

Comment 23 Caden Marchese 2019-07-30 16:58:35 UTC
On our call, customer confirmed that writing/moving around the directory /opt/apache-cassandra/conf/ works fine. I don't believe that the cassandra.yaml file has been specifically checked, but permissions in that filesystem seem fine. It seems to me that the issue may be just with the readiness probe, which is causing otherwise fine metrics pods to constantly restart.

   readinessProbe:
      exec:
        command:
        - /opt/apache-cassandra/bin/cassandra-docker-ready.sh
      failureThreshold: 3
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 10

I will wait for the customer to report that the full remove/reinstall worked or did not work, and then I will suggest taking a look at cassandra-docker-ready.sh and the below variables. It is possible that their backend storage configuration does not enjoy the readiness probe's requirements.