2278815 – [VolumeGroupSnapshot] csi-cephfsplugin-provisioner and csi-rbdplugin-provisioner pods stuck in CrashLoopBackOff state after enabling featuregate in OCP

Bug 2278815 - [VolumeGroupSnapshot] csi-cephfsplugin-provisioner and csi-rbdplugin-provisioner pods stuck in CrashLoopBackOff state after enabling featuregate in OCP

Summary: [VolumeGroupSnapshot] csi-cephfsplugin-provisioner and csi-rbdplugin-provisio...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	build
Sub Component:
Version:	4.16
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.16.0
Assignee:	Boris Ranto
QA Contact:	Sidhant Agrawal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-05-03 07:28 UTC by Sidhant Agrawal
Modified:	2024-07-17 13:22 UTC (History)
CC List:	5 users (show)
Fixed In Version:	4.16.0-96
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-07-17 13:22:05 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2024:4591	0	None	None	None	2024-07-17 13:22:10 UTC

Description Sidhant Agrawal 2024-05-03 07:28:19 UTC

Description of problem (please be detailed as possible and provide log
snippests):
In OCP 4.16 and ODF 4.16 cluster, csi-cephfsplugin-provisioner and csi-rbdplugin-provisioner pods are stuck in CrashLoopBackOff state after enabling featuregate in OCP

$ oc get pod -n openshift-storage | grep provisioner
csi-cephfsplugin-provisioner-54cd9c86f6-42bpx 5/6 CrashLoopBackOff 15 (3m32s ago) 55m
csi-cephfsplugin-provisioner-54cd9c86f6-jtqtl 5/6 CrashLoopBackOff 15 (3m21s ago) 55m
csi-rbdplugin-provisioner-d8b4c77cf-g7tj2 5/6 CrashLoopBackOff 15 (3m23s ago) 55m
csi-rbdplugin-provisioner-d8b4c77cf-n4mlv 5/6 CrashLoopBackOff 15 (3m39s ago) 55m

Logs from one of the provisioner pod (csi-snapshotter container):
---
flag provided but not defined: -enable-volume-group-snapshots
Usage of /usr/bin/csi-snapshotter:
-add_dir_header
If true, adds the file directory to the header of the log messages
-alsologtostderr
log to standard error as well as files (no effect when -logtostderr=true)
-csi-address string
Address of the CSI driver socket. (default "/run/csi/socket")
-extra-create-metadata
If set, add snapshot metadata to plugin snapshot requests as parameters.
-groupsnapshot-name-prefix string
Prefix to apply to the name of a created group snapshot (default "groupsnapshot")
-groupsnapshot-name-uuid-length int
Length in characters for the generated uuid of a created group snapshot. Defaults behavior is to NOT truncate. (default -1)
-http-endpoint :8080
The TCP network address where the HTTP server for diagnostics, including metrics and leader election health check, will listen (example: :8080). The default is empty string, which means the server is disabled. Only one of `--metrics-address` and `--http-endpoint` can be set.
-kube-api-burst int
Burst to use while communicating with the kubernetes apiserver. Defaults to 10. (default 10)
-kube-api-qps float
QPS to use while communicating with the kubernetes apiserver. Defaults to 5.0. (default 5)
-kubeconfig string
Absolute path to the kubeconfig file. Required only when running out of cluster.
-leader-election
Enables leader election.
-leader-election-lease-duration duration
Duration, in seconds, that non-leader candidates will wait to force acquire leadership. Defaults to 15 seconds. (default 15s)
-leader-election-namespace string
The namespace where the leader election resource exists. Defaults to the pod namespace if not set.
-leader-election-renew-deadline duration
Duration, in seconds, that the acting leader will retry refreshing leadership before giving up. Defaults to 10 seconds. (default 10s)
-leader-election-retry-period duration
Duration, in seconds, the LeaderElector clients should wait between tries of actions. Defaults to 5 seconds. (default 5s)
-log_backtrace_at value
when logging hits line file:N, emit a stack trace
-log_dir string
If non-empty, write log files in this directory (no effect when -logtostderr=true)
-log_file string
If non-empty, use this log file (no effect when -logtostderr=true)
-log_file_max_size uint
Defines the maximum size a log file can grow to (no effect when -logtostderr=true). Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
-logtostderr
log to standard error instead of files (default true)
-metrics-address :8080
(deprecated) The TCP network address where the prometheus metrics endpoint will listen (example: :8080). The default is empty string, which means metrics endpoint is disabled. Only one of `--metrics-address` and `--http-endpoint` can be set.
-metrics-path /metrics
The HTTP path where prometheus metrics will be exposed. Default is /metrics. (default "/metrics")
-node-deployment
Enables deploying the sidecar controller together with a CSI driver on nodes to manage snapshots for node-local volumes.
-one_output
If true, only write logs to their native severity level (vs also writing to each lower severity level; no effect when -logtostderr=true)
-resync-period duration
Resync interval of the controller. Default is 15 minutes (default 15m0s)
-retry-interval-max duration
Maximum retry interval of failed volume snapshot creation or deletion. Default is 5 minutes. (default 5m0s)
-retry-interval-start duration
Initial retry interval of failed volume snapshot creation or deletion. It doubles with each failure, up to retry-interval-max. Default is 1 second. (default 1s)
-skip_headers
If true, avoid header prefixes in the log messages
-skip_log_headers
If true, avoid headers when opening log files (no effect when -logtostderr=true)
-snapshot-name-prefix string
Prefix to apply to the name of a created snapshot (default "snapshot")
-snapshot-name-uuid-length int
Length in characters for the generated uuid of a created snapshot. Defaults behavior is to NOT truncate. (default -1)
-stderrthreshold value
logs at or above this threshold go to stderr when writing to files and stderr (no effect when -logtostderr=true or -alsologtostderr=false) (default 2)
-timeout duration
The timeout for any RPCs to the CSI driver. Default is 1 minute. (default 1m0s)
-v value
number for the log level verbosity
-version
Show version.
-vmodule value
comma-separated list of pattern=N settings for file-filtered logging
-worker-threads int
Number of worker threads. (default 10)
---

Version of all relevant components (if applicable):
OCP: 4.16.0-0.nightly-2024-05-01-111315
ODF: 4.16.0-90.stable
Images from csi pods:
registry.redhat.io/odf4/cephcsi-rhel9@sha256:d851bc4896e3666ba4d965eac89010ed5eea6c59d55027a5f5a01f9b079aeafe
registry.redhat.io/odf4/odf-csi-addons-sidecar-rhel9@sha256:d0ca282694892d6caf025a35a593a3633785d2a40f4f8984e7f94a6906bb4236
registry.redhat.io/openshift4/ose-csi-external-attacher@sha256:bce20ed64dbee694666b75a96fd505223e8eed193d5cd40a607d871d0cc8b9c0
registry.redhat.io/openshift4/ose-csi-external-provisioner@sha256:2da32b524163a1e046bdde7750fe71a2f1175e509357db3cd1300ef849f4f0b6
registry.redhat.io/openshift4/ose-csi-external-resizer@sha256:927629fd0731988d52d5bb1094b650bc5def609bacb406dac5e60905e4c9ca26
registry.redhat.io/openshift4/ose-csi-external-snapshotter@sha256:965111171af569965e07b724eb93ea77077c6272023c02d0f1aa80ebcdef48fa
registry.redhat.io/openshift4/ose-csi-node-driver-registrar@sha256:b7eacc160fcce0881a00be2eb8d050a66b6cf68bcac2ef9da72d7c0297f77c0f

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, pods are in CrashLoopBackOff state

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?

Can this issue reproduce from the UI?

If this is a regression, please provide more details to justify this:

Steps to Reproduce:

1. Deploy OCP 4.16 and ODF 4.16
2. Enable featuregate in OCP
3. Observe the pod status

Actual results:
csi-cephfsplugin-provisioner and csi-rbdplugin-provisioner pods stuck in CrashLoopBackOff state

Expected results:
Pods should be in Running state

Additional info:

$ oc get csv -n openshift-storage
NAME DISPLAY VERSION REPLACES PHASE
mcg-operator.v4.16.0-90.stable NooBaa Operator 4.16.0-90.stable Succeeded
ocs-client-operator.v4.16.0-90.stable OpenShift Data Foundation Client 4.16.0-90.stable Succeeded
ocs-operator.v4.16.0-90.stable OpenShift Container Storage 4.16.0-90.stable Succeeded
odf-csi-addons-operator.v4.16.0-90.stable CSI Addons 4.16.0-90.stable Succeeded
odf-operator.v4.16.0-90.stable OpenShift Data Foundation 4.16.0-90.stable Succeeded
odf-prometheus-operator.v4.16.0-90.stable Prometheus Operator 4.16.0-90.stable Succeeded
rook-ceph-operator.v4.16.0-90.stable Rook-Ceph 4.16.0-90.stable Succeeded

Comment 12 errata-xmlrpc 2024-07-17 13:22:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591

Note You need to log in before you can comment on or make changes to this bug.