Description of problem (please be detailed as possible and provide log snippests): The 'csi-cephfsplugin' pods are stuck in CrashLoopBackOff in a freshly deployed OCS 4.6 cluster with OVNKubernetes: ------- # oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-cephfsplugin-b7pzg 2/3 CrashLoopBackOff 6 9m4s csi-cephfsplugin-jwpk4 2/3 CrashLoopBackOff 6 9m4s csi-cephfsplugin-kpqg9 2/3 CrashLoopBackOff 6 9m4s csi-cephfsplugin-provisioner-77c87bd6b6-hcr8t 6/6 Running 0 9m4s csi-cephfsplugin-provisioner-77c87bd6b6-mcnr5 6/6 Running 0 9m4s csi-cephfsplugin-s9fmq 2/3 CrashLoopBackOff 6 9m4s csi-rbdplugin-7qxc4 3/3 Running 0 9m5s csi-rbdplugin-d9chb 3/3 Running 0 9m5s csi-rbdplugin-g7rm6 3/3 Running 0 9m5s csi-rbdplugin-provisioner-79cffcd6df-5hcfb 6/6 Running 0 9m4s csi-rbdplugin-provisioner-79cffcd6df-887bw 6/6 Running 0 9m5s csi-rbdplugin-v4mc9 3/3 Running 0 9m5s noobaa-operator-6d6b46745d-hpxxs 1/1 Running 0 110m ocs-metrics-exporter-5557759dd8-xqkn6 1/1 Running 0 110m ocs-operator-d67758886-xlkrs 0/1 Running 0 115m rook-ceph-detect-version-frdmk 0/1 ImagePullBackOff 0 9m8s rook-ceph-operator-7b8d579586-9lwm9 1/1 Running 0 110m # oc logs csi-cephfsplugin-ktp8q error: a container name must be specified for pod csi-cephfsplugin-ktp8q, choose one of: [driver-registrar csi-cephfsplugin liveness-prometheus] [root@nuc-pod2 ocp4-offline-operator-mirror]# oc logs csi-cephfsplugin-ktp8q csi-cephfsplugin I0202 01:14:03.722918 1 cephcsi.go:124] Driver version: release-4.6 and Git version: 49cf5efdd5663986c1c23c66163576bf77fdccb4 I0202 01:14:03.723103 1 cephcsi.go:142] Initial PID limit is set to 1024 I0202 01:14:03.723159 1 cephcsi.go:151] Reconfigured PID limit to -1 (max) I0202 01:14:03.723170 1 cephcsi.go:170] Starting driver type: cephfs with name: openshift-storage.cephfs.csi.ceph.com I0202 01:14:03.736617 1 volumemounter.go:87] loaded mounter: kernel E0202 01:14:03.736640 1 volumemounter.go:96] failed to run ceph-fuse exec: "ceph-fuse": executable file not found in $PATH W0202 01:14:03.736856 1 driver.go:157] EnableGRPCMetrics is deprecated F0202 01:14:03.737092 1 httpserver.go:24] failed to listen on address 172.16.20.22:9091: listen tcp 172.16.20.22:9091: bind: address already in use goroutine 9 [running]: Master0 System in my lab, with port 9091 already used: # ps -ef | grep 165271 nfsnobo+ 165271 165258 0 00:33 ? 00:00:00 /usr/bin/kube-rbac-proxy --logtostderr --secure-listen-address=:8443 --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 --upstream=http://127.0.0.1:9091/ --tls-private-key-file=/etc/metrics/tls.key --tls-cert-file=/etc/metrics/tls.crt Version of all relevant components (if applicable): OCS 4.6.2 Baremetal Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes Is there any workaround available to the best of your knowledge? ConfigMap can be modified changing the port of the GRPC metrics port to something other than 9091, which is already used by kube-rbac-proxy. There is nothing in the documentation or KB articles addressing this, to my knowledge. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Yes Can this issue reproduce from the UI? Unsure Steps to Reproduce: 1. Install OpenShift 4.6.13 cluster with OVNKubernetes as the network SDN 2. Install OCS 4.6.2 from the CLI 3. Observe failure messages Actual results: Expected results: No CrashLoopBackOff error messages, all OCS pods deploy successfully. Additional info:
We can not be sure which ports are already being in use, so changing default ports in rook would end up in similar situation at a later deployment. Moving it to documentation so that we can either create a KB or document the list of reserved/default ports. Adding need info on Madhu for providing the port list and on Bipin for KB.
How do we plan on addressing this when OVNKubernetes becomes the default SDN installed in OpenShift clusters?
(In reply to Dave Cain from comment #3) > How do we plan on addressing this when OVNKubernetes becomes the default SDN > installed in OpenShift clusters? What is the time line for that? We can plan a fix based upon that.
Sure, moving it back. Lets discuss a solution within 4.8 time frame.
@Madhu This is already done, right?
This is already in the 4.8 resync to release-4.8.
LGTM, thanks
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3003
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days