Bug 2278815 - [VolumeGroupSnapshot] csi-cephfsplugin-provisioner and csi-rbdplugin-provisioner pods stuck in CrashLoopBackOff state after enabling featuregate in OCP
Summary: [VolumeGroupSnapshot] csi-cephfsplugin-provisioner and csi-rbdplugin-provisio...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: build
Version: 4.16
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ODF 4.16.0
Assignee: Boris Ranto
QA Contact: Sidhant Agrawal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-05-03 07:28 UTC by Sidhant Agrawal
Modified: 2024-07-17 13:22 UTC (History)
5 users (show)

Fixed In Version: 4.16.0-96
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-07-17 13:22:05 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2024:4591 0 None None None 2024-07-17 13:22:10 UTC

Description Sidhant Agrawal 2024-05-03 07:28:19 UTC
Description of problem (please be detailed as possible and provide log
snippests):
In OCP 4.16 and ODF 4.16 cluster, csi-cephfsplugin-provisioner and csi-rbdplugin-provisioner pods are stuck in CrashLoopBackOff state after enabling featuregate in OCP

$ oc get pod -n openshift-storage | grep provisioner
csi-cephfsplugin-provisioner-54cd9c86f6-42bpx                     5/6     CrashLoopBackOff   15 (3m32s ago)   55m
csi-cephfsplugin-provisioner-54cd9c86f6-jtqtl                     5/6     CrashLoopBackOff   15 (3m21s ago)   55m
csi-rbdplugin-provisioner-d8b4c77cf-g7tj2                         5/6     CrashLoopBackOff   15 (3m23s ago)   55m
csi-rbdplugin-provisioner-d8b4c77cf-n4mlv                         5/6     CrashLoopBackOff   15 (3m39s ago)   55m

Logs from one of the provisioner pod (csi-snapshotter container):
---
flag provided but not defined: -enable-volume-group-snapshots
Usage of /usr/bin/csi-snapshotter:
  -add_dir_header
    	If true, adds the file directory to the header of the log messages
  -alsologtostderr
    	log to standard error as well as files (no effect when -logtostderr=true)
  -csi-address string
    	Address of the CSI driver socket. (default "/run/csi/socket")
  -extra-create-metadata
    	If set, add snapshot metadata to plugin snapshot requests as parameters.
  -groupsnapshot-name-prefix string
    	Prefix to apply to the name of a created group snapshot (default "groupsnapshot")
  -groupsnapshot-name-uuid-length int
    	Length in characters for the generated uuid of a created group snapshot. Defaults behavior is to NOT truncate. (default -1)
  -http-endpoint :8080
    	The TCP network address where the HTTP server for diagnostics, including metrics and leader election health check, will listen (example: :8080). The default is empty string, which means the server is disabled. Only one of `--metrics-address` and `--http-endpoint` can be set.
  -kube-api-burst int
    	Burst to use while communicating with the kubernetes apiserver. Defaults to 10. (default 10)
  -kube-api-qps float
    	QPS to use while communicating with the kubernetes apiserver. Defaults to 5.0. (default 5)
  -kubeconfig string
    	Absolute path to the kubeconfig file. Required only when running out of cluster.
  -leader-election
    	Enables leader election.
  -leader-election-lease-duration duration
    	Duration, in seconds, that non-leader candidates will wait to force acquire leadership. Defaults to 15 seconds. (default 15s)
  -leader-election-namespace string
    	The namespace where the leader election resource exists. Defaults to the pod namespace if not set.
  -leader-election-renew-deadline duration
    	Duration, in seconds, that the acting leader will retry refreshing leadership before giving up. Defaults to 10 seconds. (default 10s)
  -leader-election-retry-period duration
    	Duration, in seconds, the LeaderElector clients should wait between tries of actions. Defaults to 5 seconds. (default 5s)
  -log_backtrace_at value
    	when logging hits line file:N, emit a stack trace
  -log_dir string
    	If non-empty, write log files in this directory (no effect when -logtostderr=true)
  -log_file string
    	If non-empty, use this log file (no effect when -logtostderr=true)
  -log_file_max_size uint
    	Defines the maximum size a log file can grow to (no effect when -logtostderr=true). Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
  -logtostderr
    	log to standard error instead of files (default true)
  -metrics-address :8080
    	(deprecated) The TCP network address where the prometheus metrics endpoint will listen (example: :8080). The default is empty string, which means metrics endpoint is disabled. Only one of `--metrics-address` and `--http-endpoint` can be set.
  -metrics-path /metrics
    	The HTTP path where prometheus metrics will be exposed. Default is /metrics. (default "/metrics")
  -node-deployment
    	Enables deploying the sidecar controller together with a CSI driver on nodes to manage snapshots for node-local volumes.
  -one_output
    	If true, only write logs to their native severity level (vs also writing to each lower severity level; no effect when -logtostderr=true)
  -resync-period duration
    	Resync interval of the controller. Default is 15 minutes (default 15m0s)
  -retry-interval-max duration
    	Maximum retry interval of failed volume snapshot creation or deletion. Default is 5 minutes. (default 5m0s)
  -retry-interval-start duration
    	Initial retry interval of failed volume snapshot creation or deletion. It doubles with each failure, up to retry-interval-max. Default is 1 second. (default 1s)
  -skip_headers
    	If true, avoid header prefixes in the log messages
  -skip_log_headers
    	If true, avoid headers when opening log files (no effect when -logtostderr=true)
  -snapshot-name-prefix string
    	Prefix to apply to the name of a created snapshot (default "snapshot")
  -snapshot-name-uuid-length int
    	Length in characters for the generated uuid of a created snapshot. Defaults behavior is to NOT truncate. (default -1)
  -stderrthreshold value
    	logs at or above this threshold go to stderr when writing to files and stderr (no effect when -logtostderr=true or -alsologtostderr=false) (default 2)
  -timeout duration
    	The timeout for any RPCs to the CSI driver. Default is 1 minute. (default 1m0s)
  -v value
    	number for the log level verbosity
  -version
    	Show version.
  -vmodule value
    	comma-separated list of pattern=N settings for file-filtered logging
  -worker-threads int
    	Number of worker threads. (default 10)
---

Version of all relevant components (if applicable):
OCP: 4.16.0-0.nightly-2024-05-01-111315
ODF: 4.16.0-90.stable
Images from csi pods:
registry.redhat.io/odf4/cephcsi-rhel9@sha256:d851bc4896e3666ba4d965eac89010ed5eea6c59d55027a5f5a01f9b079aeafe
registry.redhat.io/odf4/odf-csi-addons-sidecar-rhel9@sha256:d0ca282694892d6caf025a35a593a3633785d2a40f4f8984e7f94a6906bb4236
registry.redhat.io/openshift4/ose-csi-external-attacher@sha256:bce20ed64dbee694666b75a96fd505223e8eed193d5cd40a607d871d0cc8b9c0
registry.redhat.io/openshift4/ose-csi-external-provisioner@sha256:2da32b524163a1e046bdde7750fe71a2f1175e509357db3cd1300ef849f4f0b6
registry.redhat.io/openshift4/ose-csi-external-resizer@sha256:927629fd0731988d52d5bb1094b650bc5def609bacb406dac5e60905e4c9ca26
registry.redhat.io/openshift4/ose-csi-external-snapshotter@sha256:965111171af569965e07b724eb93ea77077c6272023c02d0f1aa80ebcdef48fa
registry.redhat.io/openshift4/ose-csi-node-driver-registrar@sha256:b7eacc160fcce0881a00be2eb8d050a66b6cf68bcac2ef9da72d7c0297f77c0f

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, pods are in CrashLoopBackOff state

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:

Steps to Reproduce:

1. Deploy OCP 4.16 and ODF 4.16
2. Enable featuregate in OCP
3. Observe the pod status


Actual results:
csi-cephfsplugin-provisioner and csi-rbdplugin-provisioner pods stuck in CrashLoopBackOff state

Expected results:
Pods should be in Running state

Additional info:

$ oc get csv -n openshift-storage
NAME                                        DISPLAY                            VERSION            REPLACES   PHASE
mcg-operator.v4.16.0-90.stable              NooBaa Operator                    4.16.0-90.stable              Succeeded
ocs-client-operator.v4.16.0-90.stable       OpenShift Data Foundation Client   4.16.0-90.stable              Succeeded
ocs-operator.v4.16.0-90.stable              OpenShift Container Storage        4.16.0-90.stable              Succeeded
odf-csi-addons-operator.v4.16.0-90.stable   CSI Addons                         4.16.0-90.stable              Succeeded
odf-operator.v4.16.0-90.stable              OpenShift Data Foundation          4.16.0-90.stable              Succeeded
odf-prometheus-operator.v4.16.0-90.stable   Prometheus Operator                4.16.0-90.stable              Succeeded
rook-ceph-operator.v4.16.0-90.stable        Rook-Ceph                          4.16.0-90.stable              Succeeded

Comment 12 errata-xmlrpc 2024-07-17 13:22:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591


Note You need to log in before you can comment on or make changes to this bug.