Description of problem (please be detailed as possible and provide log snippests): Version of all relevant components (if applicable): 4.11.6 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? yes Is there any workaround available to the best of your knowledge? no Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? yes Can this issue reproduce from the UI? yes If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Install odf-operator via UI 2. Create storagesystem from Installed Operators -> ODF 3. Check the output of oc get pods -n openshift-storage Actual results: pod/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6f44b94kqfx4 1/2 CrashLoopBackOff 49 (59s ago) 4h19m $ oc describe pod/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6f44b94kqfx4 ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 32m (x45 over 4h22m) kubelet Container image "registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:7892e9da0a70b2d7e3efd98d2cb980e485f07eddff6a0dac6d6bd6c516914f3c" already present on machine Warning Unhealthy 7m16s (x878 over 4h22m) kubelet Startup probe failed: dial tcp 10.128.2.34:8080: connect: connection refused Warning BackOff 2m19s (x528 over 3h58m) kubelet Back-off restarting failed container Expected results: rgw pod should be in Running state. Additional info:
I found similar bug on vSphere: https://bugzilla.redhat.com/show_bug.cgi?id=2000133 I tried to apply what suggested there: oc edit cm rook-config-override -n openshift-storage ... [global] rbd_mirror_die_after_seconds = 3600 bdev_flock_retry = 20 mon_osd_full_ratio = .85 mon_osd_backfillfull_ratio = .8 mon_osd_nearfull_ratio = .75 mon_max_pg_per_osd = 600 mon_pg_warn_max_object_skew = 0 mon_data_avail_warn = 15 [osd] osd_memory_target_cgroup_limit_ratio = 0.8 ... Tried adding under the [osd]stanza: [client.rgw.ocs.storagecluster.cephobjectstore.a] debug rgw = 20/20 and then delete pod. But I see that the cm has been reverted to its original value.... What should I do to apply the debug settings?
created storage system from web console I pre-created a NAD to test Multus $ cat odf_nad.yaml apiVersion: "k8s.cni.cncf.io/v1" kind: NetworkAttachmentDefinition metadata: name: ocs-public-cluster namespace: openshift-storage spec: config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "enp5s0f0", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "172.26.0.0/24" } }' $ oc create -f odf_nad.yaml networkattachmentdefinition.k8s.cni.cncf.io/ocs-public-cluster created $ In the wizard created local volume set odfvolumeset; chosen only 3 nodes of the existing 4 nodes executed filter from 3000 to 4000 Gb for disk size and only disk, not partitions, so that for all 3 nodes there are 2 nvme disks of 3.7Tb of size: /dev/nvme0n1 e /dev/nvme2n1 On all 3 nodes same name of devices no encryption In network step selected Multus and the NAD created above for public network interface, leaving empty the cluster network interface so that it uses the same for both
$ oc get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE local-pv-136455f6 3726Gi RWO Delete Available odfvolumeset 14m local-pv-215abc3f 3726Gi RWO Delete Available odfvolumeset 14m local-pv-98e76659 3726Gi RWO Delete Available odfvolumeset 14m local-pv-cdbc060c 3726Gi RWO Delete Available odfvolumeset 14m local-pv-e51a1253 3726Gi RWO Delete Available odfvolumeset 14m local-pv-edd2590a 3726Gi RWO Delete Available odfvolumeset 14m $ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-odfvolumeset-0-data-0ztfpl Bound local-pv-e51a1253 3726Gi RWO odfvolumeset 93s ocs-deviceset-odfvolumeset-0-data-1lqlsl Bound local-pv-136455f6 3726Gi RWO odfvolumeset 93s ocs-deviceset-odfvolumeset-0-data-2hl4r4 Bound local-pv-98e76659 3726Gi RWO odfvolumeset 93s ocs-deviceset-odfvolumeset-0-data-37v6x7 Bound local-pv-215abc3f 3726Gi RWO odfvolumeset 93s ocs-deviceset-odfvolumeset-0-data-4hxjrq Bound local-pv-cdbc060c 3726Gi RWO odfvolumeset 93s ocs-deviceset-odfvolumeset-0-data-5h9tg9 Bound local-pv-edd2590a 3726Gi RWO odfvolumeset 93s $ $ oc get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE ocs-storagecluster-ceph-rgw openshift-storage.ceph.rook.io/bucket Delete Immediate false 4m15s ocs-storagecluster-cephfs openshift-storage.cephfs.csi.ceph.com Delete Immediate true 66s odfvolumeset kubernetes.io/no-provisioner Delete WaitForFirstConsumer false 22m $
After about 4 hours waiting: $ oc get all -n openshift-storage NAME READY STATUS RESTARTS AGE pod/csi-addons-controller-manager-787797589d-mwqwg 2/2 Running 0 4h57m pod/csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-6dvln 1/1 Running 0 4h22m pod/csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-9gvfh 1/1 Running 0 4h22m pod/csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-b9gcm 1/1 Running 0 4h22m pod/csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-rhchw 1/1 Running 0 4h22m pod/csi-cephfsplugin-lzgcf 3/3 Running 0 4h22m pod/csi-cephfsplugin-provisioner-5ff844654c-p6cpn 6/6 Running 0 4h22m pod/csi-cephfsplugin-provisioner-5ff844654c-tzpdk 6/6 Running 0 4h22m pod/csi-cephfsplugin-txpwf 3/3 Running 0 4h22m pod/csi-cephfsplugin-wkn7h 3/3 Running 0 4h22m pod/csi-cephfsplugin-xqlqr 3/3 Running 0 4h22m pod/csi-rbdplugin-5x5tc 4/4 Running 0 4h22m pod/csi-rbdplugin-7t9zk 4/4 Running 0 4h22m pod/csi-rbdplugin-9ncc7 4/4 Running 0 4h22m pod/csi-rbdplugin-holder-ocs-storagecluster-cephcluster-59str 1/1 Running 0 4h22m pod/csi-rbdplugin-holder-ocs-storagecluster-cephcluster-btxtr 1/1 Running 0 4h22m pod/csi-rbdplugin-holder-ocs-storagecluster-cephcluster-c2cnc 1/1 Running 0 4h22m pod/csi-rbdplugin-holder-ocs-storagecluster-cephcluster-fctnq 1/1 Running 0 4h22m pod/csi-rbdplugin-provisioner-6bb79b864-vlphv 7/7 Running 0 4h22m pod/csi-rbdplugin-provisioner-6bb79b864-xdn5n 7/7 Running 0 4h22m pod/csi-rbdplugin-wjsqq 4/4 Running 0 4h22m pod/noobaa-operator-7555d4c459-t78q8 1/1 Running 0 4h57m pod/ocs-metrics-exporter-5564bc6f89-rz7db 1/1 Running 0 4h56m pod/ocs-operator-79d665749b-8ghgv 1/1 Running 0 4h57m pod/odf-console-7c8f9bd66c-86fq4 1/1 Running 0 4h57m pod/odf-operator-controller-manager-97c969b-w569r 2/2 Running 0 4h57m pod/rook-ceph-crashcollector-ocp-worker01.ocp.seeweb.local-7774lwsk 1/1 Running 0 4h20m pod/rook-ceph-crashcollector-ocp-worker02.ocp.seeweb.local-86495vlw 1/1 Running 0 4h20m pod/rook-ceph-crashcollector-ocp-worker03.ocp.seeweb.local-74ctt757 1/1 Running 0 4h21m pod/rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6b549f57vl8l7 2/2 Running 0 4h20m pod/rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-67c7c88cccckj 2/2 Running 0 4h20m pod/rook-ceph-mgr-a-67889bc8c6-gtxgc 3/3 Running 0 4h21m pod/rook-ceph-mon-a-8498f7978f-ztrwj 2/2 Running 0 4h22m pod/rook-ceph-mon-b-757b65949-rn79v 2/2 Running 0 4h21m pod/rook-ceph-mon-c-8556ccd9b6-rn84s 2/2 Running 0 4h21m pod/rook-ceph-operator-84cb4b77b4-8p6t5 1/1 Running 0 4h57m pod/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6f44b94kqfx4 1/2 CrashLoopBackOff 49 (59s ago) 4h19m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/csi-addons-controller-manager-metrics-service ClusterIP 172.30.74.249 <none> 8443/TCP 4h57m service/csi-cephfsplugin-metrics ClusterIP 172.30.107.185 <none> 8080/TCP,8081/TCP 4h22m service/csi-rbdplugin-metrics ClusterIP 172.30.108.77 <none> 8080/TCP,8081/TCP 4h22m service/noobaa-operator-service ClusterIP 172.30.234.19 <none> 443/TCP 4h57m service/odf-console-service ClusterIP 172.30.48.109 <none> 9001/TCP 4h57m service/odf-operator-controller-manager-metrics-service ClusterIP 172.30.140.58 <none> 8443/TCP 4h57m service/rook-ceph-mgr ClusterIP 172.30.48.13 <none> 9283/TCP 4h20m service/rook-ceph-mon-a ClusterIP 172.30.31.203 <none> 6789/TCP,3300/TCP 4h22m service/rook-ceph-mon-b ClusterIP 172.30.109.248 <none> 6789/TCP,3300/TCP 4h21m service/rook-ceph-mon-c ClusterIP 172.30.17.88 <none> 6789/TCP,3300/TCP 4h21m service/rook-ceph-rgw-ocs-storagecluster-cephobjectstore ClusterIP 172.30.125.199 <none> 80/TCP,443/TCP 4h20m NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/csi-cephfsplugin 4 4 4 4 4 <none> 4h22m daemonset.apps/csi-cephfsplugin-holder-ocs-storagecluster-cephcluster 4 4 4 4 4 <none> 4h22m daemonset.apps/csi-rbdplugin 4 4 4 4 4 <none> 4h22m daemonset.apps/csi-rbdplugin-holder-ocs-storagecluster-cephcluster 4 4 4 4 4 <none> 4h22m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/csi-addons-controller-manager 1/1 1 1 4h57m deployment.apps/csi-cephfsplugin-provisioner 2/2 2 2 4h22m deployment.apps/csi-rbdplugin-provisioner 2/2 2 2 4h22m deployment.apps/noobaa-operator 1/1 1 1 4h57m deployment.apps/ocs-metrics-exporter 1/1 1 1 4h57m deployment.apps/ocs-operator 1/1 1 1 4h57m deployment.apps/odf-console 1/1 1 1 4h57m deployment.apps/odf-operator-controller-manager 1/1 1 1 4h57m deployment.apps/rook-ceph-crashcollector-ocp-worker01.ocp.seeweb.local 1/1 1 1 4h20m deployment.apps/rook-ceph-crashcollector-ocp-worker02.ocp.seeweb.local 1/1 1 1 4h20m deployment.apps/rook-ceph-crashcollector-ocp-worker03.ocp.seeweb.local 1/1 1 1 4h21m deployment.apps/rook-ceph-mds-ocs-storagecluster-cephfilesystem-a 1/1 1 1 4h20m deployment.apps/rook-ceph-mds-ocs-storagecluster-cephfilesystem-b 1/1 1 1 4h20m deployment.apps/rook-ceph-mgr-a 1/1 1 1 4h21m deployment.apps/rook-ceph-mon-a 1/1 1 1 4h22m deployment.apps/rook-ceph-mon-b 1/1 1 1 4h21m deployment.apps/rook-ceph-mon-c 1/1 1 1 4h21m deployment.apps/rook-ceph-operator 1/1 1 1 4h57m deployment.apps/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a 0/1 1 0 4h19m NAME DESIRED CURRENT READY AGE replicaset.apps/csi-addons-controller-manager-54fc64cd74 0 0 0 4h57m replicaset.apps/csi-addons-controller-manager-787797589d 1 1 1 4h57m replicaset.apps/csi-cephfsplugin-provisioner-5ff844654c 2 2 2 4h22m replicaset.apps/csi-rbdplugin-provisioner-6bb79b864 2 2 2 4h22m replicaset.apps/noobaa-operator-7555d4c459 1 1 1 4h57m replicaset.apps/ocs-metrics-exporter-5564bc6f89 1 1 1 4h57m replicaset.apps/ocs-operator-79d665749b 1 1 1 4h57m replicaset.apps/odf-console-7c8f9bd66c 1 1 1 4h57m replicaset.apps/odf-operator-controller-manager-97c969b 1 1 1 4h57m replicaset.apps/rook-ceph-crashcollector-ocp-worker01.ocp.seeweb.local-7684d47fcf 0 0 0 4h20m replicaset.apps/rook-ceph-crashcollector-ocp-worker01.ocp.seeweb.local-777c9f7665 1 1 1 4h20m replicaset.apps/rook-ceph-crashcollector-ocp-worker02.ocp.seeweb.local-8647595664 1 1 1 4h20m replicaset.apps/rook-ceph-crashcollector-ocp-worker03.ocp.seeweb.local-74c777d6c9 1 1 1 4h21m replicaset.apps/rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6b549f5798 1 1 1 4h20m replicaset.apps/rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-67c7c88cd9 1 1 1 4h20m replicaset.apps/rook-ceph-mgr-a-67889bc8c6 1 1 1 4h21m replicaset.apps/rook-ceph-mon-a-8498f7978f 1 1 1 4h22m replicaset.apps/rook-ceph-mon-b-757b65949 1 1 1 4h21m replicaset.apps/rook-ceph-mon-c-8556ccd9b6 1 1 1 4h21m replicaset.apps/rook-ceph-operator-84cb4b77b4 1 1 1 4h57m replicaset.apps/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6f44b94c7f 1 1 0 4h19m NAME COMPLETIONS DURATION AGE job.batch/rook-ceph-osd-prepare-3a2b9850812d66522429328d24904be9 0/1 4h14m 4h14m job.batch/rook-ceph-osd-prepare-4e3613b54a20eb44a5c01256fbd4c6fe 0/1 4h14m 4h14m job.batch/rook-ceph-osd-prepare-5acab5b2369ef4cedd3603dc14a17a47 0/1 4h14m 4h14m job.batch/rook-ceph-osd-prepare-696df5af89adbd60b61c1aca4fee4b06 0/1 4h13m 4h13m job.batch/rook-ceph-osd-prepare-c795143f209f83cf7fb71fe1a40df448 0/1 4h14m 4h14m job.batch/rook-ceph-osd-prepare-e9f6ccb372fa498d35f6ec3f0f223bc1 0/1 4h13m 4h13m NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD route.route.openshift.io/ocs-storagecluster-cephobjectstore ocs-storagecluster-cephobjectstore-openshift-storage.apps.ocp.seeweb.local rook-ceph-rgw-ocs-storagecluster-cephobjectstore <all> None $
Cluster created in 4.11.12 and then updated to 4.11.16 before installing Local Storage Operator and ODF Operator
For Installation I used anyplatform approach using PXE. 3 master nodes (not schedulable) are vSphere VMs; 4 worker nodes are baremetal ones. Network type is OVNKubernetes $ oc get network.config/cluster -o jsonpath='{.status.networkType}{"\n"}' OVNKubernetes
Inside logs of the ever restarting pod I only see this $ oc logs -f rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6f44b942f9sd Defaulted container "rgw" out of: rgw, log-collector, chown-container-data-dir (init) debug 2022-12-06T16:17:27.263+0000 7fba8b4f95c0 0 deferred set uid:gid to 167:167 (ceph:ceph) debug 2022-12-06T16:17:27.263+0000 7fba8b4f95c0 0 ceph version 16.2.8-84.el8cp (c2980f2fd700e979d41b4bad2939bb90f0fe435c) pacific (stable), process radosgw, pid 613 debug 2022-12-06T16:17:27.263+0000 7fba8b4f95c0 0 framework: beast debug 2022-12-06T16:17:27.263+0000 7fba8b4f95c0 0 framework conf key: port, val: 8080 debug 2022-12-06T16:17:27.263+0000 7fba8b4f95c0 0 framework conf key: ssl_port, val: 443 debug 2022-12-06T16:17:27.263+0000 7fba8b4f95c0 0 framework conf key: ssl_certificate, val: /etc/ceph/private/rgw-cert.pem debug 2022-12-06T16:17:27.263+0000 7fba8b4f95c0 0 framework conf key: ssl_private_key, val: /etc/ceph/private/rgw-key.pem debug 2022-12-06T16:17:27.263+0000 7fba8b4f95c0 1 radosgw_Main not setting numa affinity
Gianluca, could you please collect and attach a must-gather to help assist in debugging?
collection done. $ oc adm must-gather ... Reprinting Cluster State: When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information: ClusterID: 01ca8e48-a73b-4f0b-8c19-a6a3be35b808 ClusterVersion: Stable at "4.11.16" ClusterOperators: All healthy and stable I'm going to attach the tar gz archive
Archive too big to be added as an attachment (about 80Mb). You can download it here: https://drive.google.com/file/d/19MxD5OuURAdcSHKjr_nvAIcoYg2qPMro/view?usp=sharing
Any comments on the must-gather output? Questions: . is the whereabouts CNI plugin, used in the NAD definition, pre-configured in a standard OCP install or am I supposed to pre-configure anything for it? . are there any multus pre-configurations I need to do before creating the ODF storagesystem or is what needed supposed to be already in place in a standard OCP install?
An update: I recreated the same environment with the same hardware components. Differences: install 4.11.12 update 4.11.18 update 4.11.20 Then before installing ODF I have installed nmstate operator Then I created Network Attached Definition in quite the same way as before (only giving the "name" param in config section and using the other 10Gbit nic on the systems): apiVersion: "k8s.cni.cncf.io/v1" kind: NetworkAttachmentDefinition metadata: name: odf-enp5s0f21-27subnet-whereabouts namespace: openshift-storage spec: config: '{ "cniVersion": "0.3.1", "name": "macvlan-27-net", "type": "macvlan", "master": "enp5s0f1", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "172.27.0.0/24" } }' and the installation from the web console, done with the same steps as before, worked ok, with the ceph cluster components all in correct state. It would be interesting to know if the OpenShift version made the difference, 4.11.20 vs 4.11.16 before; or the installation of the nmstate operator before ODF. Or anything I missed or made wrong in the first attempt.... Any insight from the logs I sent?
I'm not seeing anything in the logs gathered that suggests why the startup probe for the RGW was failing. Since we can't repro this now, I'm not sure what else to look into. I'll close this for now, but please reopen if the issue repros.