Bug 2027685
| Summary: | openshift-cluster-csi-drivers pods crashing on PSI | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | rlobillo |
| Component: | Storage | Assignee: | Emilien Macchi <emacchi> |
| Storage sub component: | OpenStack CSI Drivers | QA Contact: | Itay Matza <imatza> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | aos-bugs, emacchi, jsafrane, m.andre, mbooth, ppitonak, pprinett, tsze |
| Version: | 4.9 | Keywords: | TestBlocker, Triaged |
| Target Milestone: | --- | ||
| Target Release: | 4.10.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: The csi-driver pod's livenessProbe was too strict
Consequence: the probe would fail on slower clouds causing the cluster to be degraded.
Fix: Relax the livenessProbe to more realistic values to accommodate slower environments
Result: the cluster is no longer degraded on clouds with slow cinder.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-03-10 16:31:00 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 2037080 | ||
|
Description
rlobillo
2021-11-30 12:56:33 UTC
This is kind of blocking our testing as it makes our cluster fail - before we can run our tests. For us, it only happens on top of rhos-d OpenStack PSI. I checked the cluster, manual bump of liveness probe log level shows:
I1202 10:44:05.042761 1 connection.go:183] GRPC call: /csi.v1.Identity/Probe
I1202 10:44:05.042765 1 connection.go:184] GRPC request: {}
I1202 10:44:08.042348 1 connection.go:186] GRPC response: {}
I1202 10:44:08.042398 1 connection.go:187] GRPC error: rpc error: code = Canceled desc = context canceled
I.e. CSI driver Probe() call timed out after 3 seconds. This timeout leads to the driver container to be declared unhealthy and restarted.
Probe() should be ideally cheap (it's called often to check driver health). There is a cmdline option -probe-timeout to increase the probe timeout if it's expected that it can take longer time on a random OpenStack.
https://github.com/openshift/openstack-cinder-csi-driver-operator/blob/286f1ba0f478a3443fa223721787f09f51531dc2/assets/controller.yaml#L265
Then I would probably suggest to increase the liveness probe timeout to the same number and increase period to much larger value than every 10 seconds to give OpenStack some rest:
https://github.com/openshift/openstack-cinder-csi-driver-operator/blob/286f1ba0f478a3443fa223721787f09f51531dc2/assets/controller.yaml#L72-L73
(note that the driver DaemonSet needs to be changed in the same fashion).
Removing the Triaged keyword because: * the QE automation assessment (flag qe_test_coverage) is missing Verified with OCP 4.10.0-0.nightly-2022-01-10-144202 on top of OSP RHOS-16.1-RHEL-8-20211126.n.1: - Confirmed that the values of the parameters were updated successfully. - Confirmed on several different CI jobs - the current timeout values are working in the PSI cloud. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |