Bug 1721729
| Summary: | hawkular-cassandra readiness probe fails (0/1) when image tag version is incomplete | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Caden Marchese <cmarches> | ||||
| Component: | Hawkular | Assignee: | Ruben Vargas Palma <rvargasp> | ||||
| Status: | CLOSED NOTABUG | QA Contact: | Junqi Zhao <juzhao> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 3.11.0 | CC: | akaiser, aos-bugs, jforrest, jmartisk, jolee, jpriddy, msweiker, nstielau, openshift-bugs-escalate, rvargasp, scuppett | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 3.11.z | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-08-06 01:27:45 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Caden Marchese
2019-06-18 23:32:09 UTC
The logs from unready pod: sed: cannot rename /opt/apache-cassandra/conf/sed6b3JJH: Operation not permitted sed: cannot rename /opt/apache-cassandra/conf/sedIBVSPF: Operation not permitted sed: cannot rename /opt/apache-cassandra/conf/sedQyRWxG: Operation not permitted sed: cannot rename /opt/apache-cassandra/conf/sedtayTwJ: Operation not permitted sed: cannot rename /opt/apache-cassandra/conf/sedWAzVmK: Operation not permitted sed: cannot rename /opt/apache-cassandra/conf/sedun6iVJ: Operation not permitted sed: cannot rename /opt/apache-cassandra/conf/sed0lEnoI: Operation not permitted sed: cannot rename /opt/apache-cassandra/conf/sed4LogeJ: Operation not permitted sed: cannot rename /opt/apache-cassandra/conf/sedfEYpVI: Operation not permitted We have ruled out actual permissions errors and SELinux denials. So you suppose that basically the root cause is that the v3.11 tag was pointing at some broken hawkular-cassandra image? The problematic image 67e49e2339 from the output seems to be version v3.11.98-6 (2 months old). Currently the v3.11 tag is pointing at v3.11.104-14 (image id d3e151cdb005). Can they try again (specifying just v3.11) if this works correctly now? I don't have much insight into what exactly changes between these versions but the old image could have been broken for some reason. No luck changing the image tag back to v3.11 - same issue. We will try it with image tag v3.11.104. Customer was able to temporarily fix this by deleting and re-adding the schema job yaml with the fully suffixed image tag:
oc create -f the-below-text-block.yaml
apiVersion: batch/v1
kind: Job
metadata:
annotations:
labels:
metrics-infra: hawkular-metrics
name: hawkular-metrics-schema
name: hawkular-metrics-schema
selfLink: /apis/batch/v1/namespaces/openshift-infra/jobs/hawkular-metrics-schema
spec:
backoffLimit: 6
completions: 1
parallelism: 1
template:
metadata:
labels:
job-name: hawkular-metrics-schema
spec:
containers:
- env:
- name: TRUSTSTORE_AUTHORITIES
value: /hawkular-metrics-certs/tls.truststore.crt
image: registry.access.redhat.com/openshift3/metrics-schema-installer:v3.11.104-14
imagePullPolicy: IfNotPresent
name: hawkular-metrics-schema
volumeMounts:
- mountPath: /hawkular-metrics-certs
name: hawkular-metrics-certs
- mountPath: /hawkular-account
name: hawkular-metrics-account
dnsPolicy: ClusterFirst
restartPolicy: OnFailure
schedulerName: default-scheduler
terminationGracePeriodSeconds: 30
volumes:
- name: hawkular-metrics-certs
secret:
defaultMode: 420
secretName: hawkular-metrics-certs
- name: hawkular-metrics-account
secret:
defaultMode: 420
secretName: hawkular-metrics-account
This configuration has since broken - has our metrics image version/suffix been updated on our side recently?
Customer's configuration was broken when the image on our side was updated from 3.11.104-14 to 3.11.117-04. The the workaround appears to remove the suffix entirely, which defaults to the latest version. Adding the full and exact suffix works, but only until our registry changes the image version. Customer is continuing to experience this every few weeks. Each fix has yielded the same error eventually: sed: cannot rename /opt/apache-cassandra/conf/sedU1J2iu: Operation not permitted sed: cannot rename /opt/apache-cassandra/conf/sedBqYVXt: Operation not permitted sed: cannot rename /opt/apache-cassandra/conf/sedEUDDFu: Operation not permitted sed: cannot rename /opt/apache-cassandra/conf/sednQEs9s: Operation not permitted Interestingly enough there was never an issue getting into this volume and writing to this storage. This issue only comes up when the pods are unready. Here is what we have tried so far: - Updated the replication controller to reflect most recent metrics image version suffix (this worked for a few days) - Oc rsh'd into the pod, and attempted to write to the directory showing permissions errors (this worked) - Update the schema job yaml with up to date image version suffix, and redeploy it (this worked for a few weeks, see the case for more details) - Remove the version suffix entirely from the pod deployment config (this worked for a week or so) - Remove the version suffix from the schema job entirely (did not work) Created attachment 1590809 [details]
oc logs -p from a failing pod
Created attachment 1591089 [details]
sosreport-swoapd0008-10.138.240.74-2019-07-15-snmvfkt
On our call, customer confirmed that writing/moving around the directory /opt/apache-cassandra/conf/ works fine. I don't believe that the cassandra.yaml file has been specifically checked, but permissions in that filesystem seem fine. It seems to me that the issue may be just with the readiness probe, which is causing otherwise fine metrics pods to constantly restart.
readinessProbe:
exec:
command:
- /opt/apache-cassandra/bin/cassandra-docker-ready.sh
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
I will wait for the customer to report that the full remove/reinstall worked or did not work, and then I will suggest taking a look at cassandra-docker-ready.sh and the below variables. It is possible that their backend storage configuration does not enjoy the readiness probe's requirements.
|