Bug 1923819 - csi-cephfsplugin pods CrashLoopBackoff in fresh 4.6 cluster due to conflict with kube-rbac-proxy
Summary: csi-cephfsplugin pods CrashLoopBackoff in fresh 4.6 cluster due to conflict w...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: rook
Version: 4.6
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: OCS 4.8.0
Assignee: Madhu Rajanna
QA Contact: akarsha
URL:
Whiteboard:
Depends On:
Blocks: 1937245 1937266
TreeView+ depends on / blocked
 
Reported: 2021-02-02 01:33 UTC by Dave Cain
Modified: 2023-09-15 01:00 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
.gRPC metrics are disabled by default Earlier, the `cephcsi` pods exposed the remote procedure call (gRPC) metrics for debugging purposes. The `cephcsi` node plugin pods used the host ports 9091 for CephFS and 9090 for RBD on the node where the `cephcsi` node plugin pods were running. This meant the `cephcsi` pods failed to come up. With this update, gRPC metrics are disabled by default and `cephcsi` pods do not use ports 9091 and 9090 on the node where the node plugin pods are running.
Clone Of:
: 1937245 (view as bug list)
Environment:
Last Closed: 2021-08-03 18:15:14 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2021:3003 0 None None None 2021-08-03 18:15:58 UTC

Description Dave Cain 2021-02-02 01:33:23 UTC
Description of problem (please be detailed as possible and provide log
snippests):

The 'csi-cephfsplugin' pods are stuck in CrashLoopBackOff in a freshly deployed OCS 4.6 cluster with OVNKubernetes:

-------

# oc get pods -n openshift-storage
NAME                                            READY   STATUS             RESTARTS   AGE
csi-cephfsplugin-b7pzg                          2/3     CrashLoopBackOff   6          9m4s
csi-cephfsplugin-jwpk4                          2/3     CrashLoopBackOff   6          9m4s
csi-cephfsplugin-kpqg9                          2/3     CrashLoopBackOff   6          9m4s
csi-cephfsplugin-provisioner-77c87bd6b6-hcr8t   6/6     Running            0          9m4s
csi-cephfsplugin-provisioner-77c87bd6b6-mcnr5   6/6     Running            0          9m4s
csi-cephfsplugin-s9fmq                          2/3     CrashLoopBackOff   6          9m4s
csi-rbdplugin-7qxc4                             3/3     Running            0          9m5s
csi-rbdplugin-d9chb                             3/3     Running            0          9m5s
csi-rbdplugin-g7rm6                             3/3     Running            0          9m5s
csi-rbdplugin-provisioner-79cffcd6df-5hcfb      6/6     Running            0          9m4s
csi-rbdplugin-provisioner-79cffcd6df-887bw      6/6     Running            0          9m5s
csi-rbdplugin-v4mc9                             3/3     Running            0          9m5s
noobaa-operator-6d6b46745d-hpxxs                1/1     Running            0          110m
ocs-metrics-exporter-5557759dd8-xqkn6           1/1     Running            0          110m
ocs-operator-d67758886-xlkrs                    0/1     Running            0          115m
rook-ceph-detect-version-frdmk                  0/1     ImagePullBackOff   0          9m8s
rook-ceph-operator-7b8d579586-9lwm9             1/1     Running            0          110m

# oc logs csi-cephfsplugin-ktp8q
error: a container name must be specified for pod csi-cephfsplugin-ktp8q, choose one of: [driver-registrar csi-cephfsplugin liveness-prometheus]
[root@nuc-pod2 ocp4-offline-operator-mirror]# oc logs csi-cephfsplugin-ktp8q csi-cephfsplugin
I0202 01:14:03.722918       1 cephcsi.go:124] Driver version: release-4.6 and Git version: 49cf5efdd5663986c1c23c66163576bf77fdccb4
I0202 01:14:03.723103       1 cephcsi.go:142] Initial PID limit is set to 1024
I0202 01:14:03.723159       1 cephcsi.go:151] Reconfigured PID limit to -1 (max)
I0202 01:14:03.723170       1 cephcsi.go:170] Starting driver type: cephfs with name: openshift-storage.cephfs.csi.ceph.com
I0202 01:14:03.736617       1 volumemounter.go:87] loaded mounter: kernel
E0202 01:14:03.736640       1 volumemounter.go:96] failed to run ceph-fuse exec: "ceph-fuse": executable file not found in $PATH
W0202 01:14:03.736856       1 driver.go:157] EnableGRPCMetrics is deprecated
F0202 01:14:03.737092       1 httpserver.go:24] failed to listen on address 172.16.20.22:9091: listen tcp 172.16.20.22:9091: bind: address already in use
goroutine 9 [running]:

Master0 System in my lab, with port 9091 already used:

# ps -ef | grep 165271         
nfsnobo+  165271  165258  0 00:33 ?        00:00:00 /usr/bin/kube-rbac-proxy --logtostderr --secure-listen-address=:8443 --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 --upstream=http://127.0.0.1:9091/ --tls-private-key-file=/etc/metrics/tls.key --tls-cert-file=/etc/metrics/tls.crt


Version of all relevant components (if applicable):
OCS 4.6.2
Baremetal


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes


Is there any workaround available to the best of your knowledge?
ConfigMap can be modified changing the port of the GRPC metrics port to something other than 9091, which is already used by kube-rbac-proxy.  There is nothing in the documentation or KB articles addressing this, to my knowledge.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2


Can this issue reproducible?
Yes


Can this issue reproduce from the UI?
Unsure


Steps to Reproduce:
1. Install OpenShift 4.6.13 cluster with OVNKubernetes as the network SDN
2. Install OCS 4.6.2 from the CLI
3. Observe failure messages


Actual results:


Expected results:
No CrashLoopBackOff error messages, all OCS pods deploy successfully.


Additional info:

Comment 2 Mudit Agarwal 2021-02-02 12:43:46 UTC
We can not be sure which ports are already being in use, so changing default ports in rook would end up in similar situation at a later deployment.
Moving it to documentation so that we can either create a KB or document the list of reserved/default ports.

Adding need info on Madhu for providing the port list and on Bipin for KB.

Comment 3 Dave Cain 2021-02-02 12:51:35 UTC
How do we plan on addressing this when OVNKubernetes becomes the default SDN installed in OpenShift clusters?

Comment 4 Mudit Agarwal 2021-02-02 14:00:55 UTC
(In reply to Dave Cain from comment #3)
> How do we plan on addressing this when OVNKubernetes becomes the default SDN
> installed in OpenShift clusters?

What is the time line for that? We can plan a fix based upon that.

Comment 7 Mudit Agarwal 2021-02-03 06:34:47 UTC
Sure, moving it back. Lets discuss a solution within 4.8 time frame.

Comment 16 Travis Nielsen 2021-05-11 15:08:02 UTC
@Madhu This is already done, right?

Comment 18 Travis Nielsen 2021-05-17 15:53:06 UTC
This is already in the 4.8 resync to release-4.8.

Comment 24 Mudit Agarwal 2021-07-12 06:04:15 UTC
LGTM, thanks

Comment 26 errata-xmlrpc 2021-08-03 18:15:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3003

Comment 27 Red Hat Bugzilla 2023-09-15 01:00:13 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.