Bug 2151295 - rgw pod stuck in CrashLoopBackOff while installing odf-operator via UI on baremetal cluster
Summary: rgw pod stuck in CrashLoopBackOff while installing odf-operator via UI on bar...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.11
Hardware: x86_64
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Blaine Gardner
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-12-06 15:55 UTC by Gianluca Cecchi
Modified: 2023-08-09 17:03 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-01-17 15:31:20 UTC
Embargoed:


Attachments (Terms of Use)

Description Gianluca Cecchi 2022-12-06 15:55:59 UTC
Description of problem (please be detailed as possible and provide log
snippests):


Version of all relevant components (if applicable):
4.11.6

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes

Is there any workaround available to the best of your knowledge?
no

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install odf-operator via UI
2. Create storagesystem from Installed Operators -> ODF
3. Check the output of oc get pods -n openshift-storage


Actual results:
pod/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6f44b94kqfx4   1/2     CrashLoopBackOff   49 (59s ago)   4h19m

$ oc describe pod/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6f44b94kqfx4
...
Events:
  Type     Reason     Age                      From     Message
  ----     ------     ----                     ----     -------
  Normal   Pulled     32m (x45 over 4h22m)     kubelet  Container image "registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:7892e9da0a70b2d7e3efd98d2cb980e485f07eddff6a0dac6d6bd6c516914f3c" already present on machine
  Warning  Unhealthy  7m16s (x878 over 4h22m)  kubelet  Startup probe failed: dial tcp 10.128.2.34:8080: connect: connection refused
  Warning  BackOff    2m19s (x528 over 3h58m)  kubelet  Back-off restarting failed container

Expected results:
rgw pod should be in Running state.

Additional info:

Comment 2 Gianluca Cecchi 2022-12-06 15:59:39 UTC
I found similar bug on vSphere:
https://bugzilla.redhat.com/show_bug.cgi?id=2000133

I tried to apply what suggested there:

oc edit cm rook-config-override -n openshift-storage

...

    [global]
    rbd_mirror_die_after_seconds = 3600
    bdev_flock_retry = 20
    mon_osd_full_ratio = .85
    mon_osd_backfillfull_ratio = .8
    mon_osd_nearfull_ratio = .75
    mon_max_pg_per_osd = 600
    mon_pg_warn_max_object_skew = 0
    mon_data_avail_warn = 15
    [osd]
    osd_memory_target_cgroup_limit_ratio = 0.8
...

Tried adding under the [osd]stanza:

[client.rgw.ocs.storagecluster.cephobjectstore.a]
debug rgw = 20/20


and then delete pod.

But I see that the cm has been reverted to its original value....
What should I do to apply the debug settings?

Comment 3 Gianluca Cecchi 2022-12-06 16:04:26 UTC
created storage system from web console

I pre-created a NAD to test Multus

$ cat odf_nad.yaml
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: ocs-public-cluster
  namespace: openshift-storage
spec:
  config: '{
        "cniVersion": "0.3.1",
        "type": "macvlan",
        "master": "enp5s0f0",
        "mode": "bridge",
        "ipam": {
            "type": "whereabouts",
            "range": "172.26.0.0/24"
        }
  }'
$ oc create -f odf_nad.yaml
networkattachmentdefinition.k8s.cni.cncf.io/ocs-public-cluster created
$

In the wizard
created local volume set odfvolumeset; chosen only 3 nodes of the existing 4 nodes
executed filter from 3000 to 4000 Gb for disk size and only disk, not partitions, so that for all 3 nodes there are 2 nvme disks of 3.7Tb of size: /dev/nvme0n1 e /dev/nvme2n1
On all 3 nodes same name of devices
no encryption
In network step selected Multus and the NAD created above for public network interface, leaving empty the cluster network interface so that it uses the same for both

Comment 4 Gianluca Cecchi 2022-12-06 16:05:13 UTC
$ oc get pv
NAME                CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   REASON   AGE
local-pv-136455f6   3726Gi     RWO            Delete           Available           odfvolumeset            14m
local-pv-215abc3f   3726Gi     RWO            Delete           Available           odfvolumeset            14m
local-pv-98e76659   3726Gi     RWO            Delete           Available           odfvolumeset            14m
local-pv-cdbc060c   3726Gi     RWO            Delete           Available           odfvolumeset            14m
local-pv-e51a1253   3726Gi     RWO            Delete           Available           odfvolumeset            14m
local-pv-edd2590a   3726Gi     RWO            Delete           Available           odfvolumeset            14m

$ oc get pvc
NAME                                       STATUS   VOLUME              CAPACITY   ACCESS MODES   STORAGECLASS   AGE
ocs-deviceset-odfvolumeset-0-data-0ztfpl   Bound    local-pv-e51a1253   3726Gi     RWO            odfvolumeset   93s
ocs-deviceset-odfvolumeset-0-data-1lqlsl   Bound    local-pv-136455f6   3726Gi     RWO            odfvolumeset   93s
ocs-deviceset-odfvolumeset-0-data-2hl4r4   Bound    local-pv-98e76659   3726Gi     RWO            odfvolumeset   93s
ocs-deviceset-odfvolumeset-0-data-37v6x7   Bound    local-pv-215abc3f   3726Gi     RWO            odfvolumeset   93s
ocs-deviceset-odfvolumeset-0-data-4hxjrq   Bound    local-pv-cdbc060c   3726Gi     RWO            odfvolumeset   93s
ocs-deviceset-odfvolumeset-0-data-5h9tg9   Bound    local-pv-edd2590a   3726Gi     RWO            odfvolumeset   93s
$


$ oc get sc
NAME                          PROVISIONER                             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
ocs-storagecluster-ceph-rgw   openshift-storage.ceph.rook.io/bucket   Delete          Immediate              false                  4m15s
ocs-storagecluster-cephfs     openshift-storage.cephfs.csi.ceph.com   Delete          Immediate              true                   66s
odfvolumeset                  kubernetes.io/no-provisioner            Delete          WaitForFirstConsumer   false                  22m
$

Comment 5 Gianluca Cecchi 2022-12-06 16:06:34 UTC
After about 4 hours waiting:

$ oc get all -n openshift-storage
NAME                                                                  READY   STATUS             RESTARTS       AGE
pod/csi-addons-controller-manager-787797589d-mwqwg                    2/2     Running            0              4h57m
pod/csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-6dvln      1/1     Running            0              4h22m
pod/csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-9gvfh      1/1     Running            0              4h22m
pod/csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-b9gcm      1/1     Running            0              4h22m
pod/csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-rhchw      1/1     Running            0              4h22m
pod/csi-cephfsplugin-lzgcf                                            3/3     Running            0              4h22m
pod/csi-cephfsplugin-provisioner-5ff844654c-p6cpn                     6/6     Running            0              4h22m
pod/csi-cephfsplugin-provisioner-5ff844654c-tzpdk                     6/6     Running            0              4h22m
pod/csi-cephfsplugin-txpwf                                            3/3     Running            0              4h22m
pod/csi-cephfsplugin-wkn7h                                            3/3     Running            0              4h22m
pod/csi-cephfsplugin-xqlqr                                            3/3     Running            0              4h22m
pod/csi-rbdplugin-5x5tc                                               4/4     Running            0              4h22m
pod/csi-rbdplugin-7t9zk                                               4/4     Running            0              4h22m
pod/csi-rbdplugin-9ncc7                                               4/4     Running            0              4h22m
pod/csi-rbdplugin-holder-ocs-storagecluster-cephcluster-59str         1/1     Running            0              4h22m
pod/csi-rbdplugin-holder-ocs-storagecluster-cephcluster-btxtr         1/1     Running            0              4h22m
pod/csi-rbdplugin-holder-ocs-storagecluster-cephcluster-c2cnc         1/1     Running            0              4h22m
pod/csi-rbdplugin-holder-ocs-storagecluster-cephcluster-fctnq         1/1     Running            0              4h22m
pod/csi-rbdplugin-provisioner-6bb79b864-vlphv                         7/7     Running            0              4h22m
pod/csi-rbdplugin-provisioner-6bb79b864-xdn5n                         7/7     Running            0              4h22m
pod/csi-rbdplugin-wjsqq                                               4/4     Running            0              4h22m
pod/noobaa-operator-7555d4c459-t78q8                                  1/1     Running            0              4h57m
pod/ocs-metrics-exporter-5564bc6f89-rz7db                             1/1     Running            0              4h56m
pod/ocs-operator-79d665749b-8ghgv                                     1/1     Running            0              4h57m
pod/odf-console-7c8f9bd66c-86fq4                                      1/1     Running            0              4h57m
pod/odf-operator-controller-manager-97c969b-w569r                     2/2     Running            0              4h57m
pod/rook-ceph-crashcollector-ocp-worker01.ocp.seeweb.local-7774lwsk   1/1     Running            0              4h20m
pod/rook-ceph-crashcollector-ocp-worker02.ocp.seeweb.local-86495vlw   1/1     Running            0              4h20m
pod/rook-ceph-crashcollector-ocp-worker03.ocp.seeweb.local-74ctt757   1/1     Running            0              4h21m
pod/rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6b549f57vl8l7   2/2     Running            0              4h20m
pod/rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-67c7c88cccckj   2/2     Running            0              4h20m
pod/rook-ceph-mgr-a-67889bc8c6-gtxgc                                  3/3     Running            0              4h21m
pod/rook-ceph-mon-a-8498f7978f-ztrwj                                  2/2     Running            0              4h22m
pod/rook-ceph-mon-b-757b65949-rn79v                                   2/2     Running            0              4h21m
pod/rook-ceph-mon-c-8556ccd9b6-rn84s                                  2/2     Running            0              4h21m
pod/rook-ceph-operator-84cb4b77b4-8p6t5                               1/1     Running            0              4h57m
pod/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6f44b94kqfx4   1/2     CrashLoopBackOff   49 (59s ago)   4h19m

NAME                                                       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
service/csi-addons-controller-manager-metrics-service      ClusterIP   172.30.74.249    <none>        8443/TCP            4h57m
service/csi-cephfsplugin-metrics                           ClusterIP   172.30.107.185   <none>        8080/TCP,8081/TCP   4h22m
service/csi-rbdplugin-metrics                              ClusterIP   172.30.108.77    <none>        8080/TCP,8081/TCP   4h22m
service/noobaa-operator-service                            ClusterIP   172.30.234.19    <none>        443/TCP             4h57m
service/odf-console-service                                ClusterIP   172.30.48.109    <none>        9001/TCP            4h57m
service/odf-operator-controller-manager-metrics-service    ClusterIP   172.30.140.58    <none>        8443/TCP            4h57m
service/rook-ceph-mgr                                      ClusterIP   172.30.48.13     <none>        9283/TCP            4h20m
service/rook-ceph-mon-a                                    ClusterIP   172.30.31.203    <none>        6789/TCP,3300/TCP   4h22m
service/rook-ceph-mon-b                                    ClusterIP   172.30.109.248   <none>        6789/TCP,3300/TCP   4h21m
service/rook-ceph-mon-c                                    ClusterIP   172.30.17.88     <none>        6789/TCP,3300/TCP   4h21m
service/rook-ceph-rgw-ocs-storagecluster-cephobjectstore   ClusterIP   172.30.125.199   <none>        80/TCP,443/TCP      4h20m

NAME                                                                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/csi-cephfsplugin                                         4         4         4       4            4           <none>          4h22m
daemonset.apps/csi-cephfsplugin-holder-ocs-storagecluster-cephcluster   4         4         4       4            4           <none>          4h22m
daemonset.apps/csi-rbdplugin                                            4         4         4       4            4           <none>          4h22m
daemonset.apps/csi-rbdplugin-holder-ocs-storagecluster-cephcluster      4         4         4       4            4           <none>          4h22m

NAME                                                                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/csi-addons-controller-manager                            1/1     1            1           4h57m
deployment.apps/csi-cephfsplugin-provisioner                             2/2     2            2           4h22m
deployment.apps/csi-rbdplugin-provisioner                                2/2     2            2           4h22m
deployment.apps/noobaa-operator                                          1/1     1            1           4h57m
deployment.apps/ocs-metrics-exporter                                     1/1     1            1           4h57m
deployment.apps/ocs-operator                                             1/1     1            1           4h57m
deployment.apps/odf-console                                              1/1     1            1           4h57m
deployment.apps/odf-operator-controller-manager                          1/1     1            1           4h57m
deployment.apps/rook-ceph-crashcollector-ocp-worker01.ocp.seeweb.local   1/1     1            1           4h20m
deployment.apps/rook-ceph-crashcollector-ocp-worker02.ocp.seeweb.local   1/1     1            1           4h20m
deployment.apps/rook-ceph-crashcollector-ocp-worker03.ocp.seeweb.local   1/1     1            1           4h21m
deployment.apps/rook-ceph-mds-ocs-storagecluster-cephfilesystem-a        1/1     1            1           4h20m
deployment.apps/rook-ceph-mds-ocs-storagecluster-cephfilesystem-b        1/1     1            1           4h20m
deployment.apps/rook-ceph-mgr-a                                          1/1     1            1           4h21m
deployment.apps/rook-ceph-mon-a                                          1/1     1            1           4h22m
deployment.apps/rook-ceph-mon-b                                          1/1     1            1           4h21m
deployment.apps/rook-ceph-mon-c                                          1/1     1            1           4h21m
deployment.apps/rook-ceph-operator                                       1/1     1            1           4h57m
deployment.apps/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a       0/1     1            0           4h19m

NAME                                                                                DESIRED   CURRENT   READY   AGE
replicaset.apps/csi-addons-controller-manager-54fc64cd74                            0         0         0       4h57m
replicaset.apps/csi-addons-controller-manager-787797589d                            1         1         1       4h57m
replicaset.apps/csi-cephfsplugin-provisioner-5ff844654c                             2         2         2       4h22m
replicaset.apps/csi-rbdplugin-provisioner-6bb79b864                                 2         2         2       4h22m
replicaset.apps/noobaa-operator-7555d4c459                                          1         1         1       4h57m
replicaset.apps/ocs-metrics-exporter-5564bc6f89                                     1         1         1       4h57m
replicaset.apps/ocs-operator-79d665749b                                             1         1         1       4h57m
replicaset.apps/odf-console-7c8f9bd66c                                              1         1         1       4h57m
replicaset.apps/odf-operator-controller-manager-97c969b                             1         1         1       4h57m
replicaset.apps/rook-ceph-crashcollector-ocp-worker01.ocp.seeweb.local-7684d47fcf   0         0         0       4h20m
replicaset.apps/rook-ceph-crashcollector-ocp-worker01.ocp.seeweb.local-777c9f7665   1         1         1       4h20m
replicaset.apps/rook-ceph-crashcollector-ocp-worker02.ocp.seeweb.local-8647595664   1         1         1       4h20m
replicaset.apps/rook-ceph-crashcollector-ocp-worker03.ocp.seeweb.local-74c777d6c9   1         1         1       4h21m
replicaset.apps/rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6b549f5798        1         1         1       4h20m
replicaset.apps/rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-67c7c88cd9        1         1         1       4h20m
replicaset.apps/rook-ceph-mgr-a-67889bc8c6                                          1         1         1       4h21m
replicaset.apps/rook-ceph-mon-a-8498f7978f                                          1         1         1       4h22m
replicaset.apps/rook-ceph-mon-b-757b65949                                           1         1         1       4h21m
replicaset.apps/rook-ceph-mon-c-8556ccd9b6                                          1         1         1       4h21m
replicaset.apps/rook-ceph-operator-84cb4b77b4                                       1         1         1       4h57m
replicaset.apps/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6f44b94c7f       1         1         0       4h19m

NAME                                                               COMPLETIONS   DURATION   AGE
job.batch/rook-ceph-osd-prepare-3a2b9850812d66522429328d24904be9   0/1           4h14m      4h14m
job.batch/rook-ceph-osd-prepare-4e3613b54a20eb44a5c01256fbd4c6fe   0/1           4h14m      4h14m
job.batch/rook-ceph-osd-prepare-5acab5b2369ef4cedd3603dc14a17a47   0/1           4h14m      4h14m
job.batch/rook-ceph-osd-prepare-696df5af89adbd60b61c1aca4fee4b06   0/1           4h13m      4h13m
job.batch/rook-ceph-osd-prepare-c795143f209f83cf7fb71fe1a40df448   0/1           4h14m      4h14m
job.batch/rook-ceph-osd-prepare-e9f6ccb372fa498d35f6ec3f0f223bc1   0/1           4h13m      4h13m

NAME                                                          HOST/PORT                                                                    PATH   SERVICES                                           PORT    TERMINATION   WILDCARD
route.route.openshift.io/ocs-storagecluster-cephobjectstore   ocs-storagecluster-cephobjectstore-openshift-storage.apps.ocp.seeweb.local          rook-ceph-rgw-ocs-storagecluster-cephobjectstore   <all>                 None
$

Comment 6 Gianluca Cecchi 2022-12-06 16:07:45 UTC
Cluster created in 4.11.12 and then updated to 4.11.16 before installing Local Storage Operator and ODF Operator

Comment 7 Gianluca Cecchi 2022-12-06 16:13:07 UTC
For Installation I used anyplatform approach using PXE.
3 master nodes (not schedulable) are vSphere VMs; 4 worker nodes are baremetal ones.
Network type is OVNKubernetes
$ oc get network.config/cluster -o jsonpath='{.status.networkType}{"\n"}'
OVNKubernetes

Comment 8 Gianluca Cecchi 2022-12-06 16:19:17 UTC
Inside logs of the ever restarting pod I only see this

$ oc logs -f rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6f44b942f9sd
Defaulted container "rgw" out of: rgw, log-collector, chown-container-data-dir (init)
debug 2022-12-06T16:17:27.263+0000 7fba8b4f95c0  0 deferred set uid:gid to 167:167 (ceph:ceph)
debug 2022-12-06T16:17:27.263+0000 7fba8b4f95c0  0 ceph version 16.2.8-84.el8cp (c2980f2fd700e979d41b4bad2939bb90f0fe435c) pacific (stable), process radosgw, pid 613
debug 2022-12-06T16:17:27.263+0000 7fba8b4f95c0  0 framework: beast
debug 2022-12-06T16:17:27.263+0000 7fba8b4f95c0  0 framework conf key: port, val: 8080
debug 2022-12-06T16:17:27.263+0000 7fba8b4f95c0  0 framework conf key: ssl_port, val: 443
debug 2022-12-06T16:17:27.263+0000 7fba8b4f95c0  0 framework conf key: ssl_certificate, val: /etc/ceph/private/rgw-cert.pem
debug 2022-12-06T16:17:27.263+0000 7fba8b4f95c0  0 framework conf key: ssl_private_key, val: /etc/ceph/private/rgw-key.pem
debug 2022-12-06T16:17:27.263+0000 7fba8b4f95c0  1 radosgw_Main not setting numa affinity

Comment 9 Blaine Gardner 2022-12-07 15:26:58 UTC
Gianluca, could you please collect and attach a must-gather to help assist in debugging?

Comment 10 Gianluca Cecchi 2022-12-07 22:54:53 UTC
collection done.
$ oc adm must-gather 
...
Reprinting Cluster State:
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 01ca8e48-a73b-4f0b-8c19-a6a3be35b808
ClusterVersion: Stable at "4.11.16"
ClusterOperators:
	All healthy and stable
I'm going to attach the tar gz archive

Comment 11 Gianluca Cecchi 2022-12-07 22:59:56 UTC
Archive too big to be added as an attachment (about 80Mb). You can download it here:
https://drive.google.com/file/d/19MxD5OuURAdcSHKjr_nvAIcoYg2qPMro/view?usp=sharing

Comment 12 Gianluca Cecchi 2022-12-09 10:08:26 UTC
Any comments on the must-gather output?
Questions: 
. is the whereabouts CNI plugin, used in the NAD definition, pre-configured in a standard OCP install or am I supposed to pre-configure anything for it?
. are there any multus pre-configurations I need to do before creating the ODF storagesystem or is what needed supposed to be already in place in a standard OCP install?

Comment 20 Gianluca Cecchi 2023-01-05 09:06:48 UTC
An update: I recreated the same environment with the same hardware components.
Differences:
install 4.11.12
update 4.11.18
update 4.11.20

Then before installing ODF I have installed nmstate operator

Then I created Network Attached Definition in quite the same way as before (only giving the "name" param in config section and using the other 10Gbit nic on the systems):

apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: odf-enp5s0f21-27subnet-whereabouts
  namespace: openshift-storage
spec:
  config: '{
  	"cniVersion": "0.3.1",
  	"name": "macvlan-27-net",
  	"type": "macvlan",
  	"master": "enp5s0f1",
  	"mode": "bridge",
  	"ipam": {
    	    "type": "whereabouts",
    	    "range": "172.27.0.0/24"
  	}
  }'

and the installation from the web console, done with the same steps as before, worked ok, with the ceph cluster components all in correct state.
It would be interesting to know if the OpenShift version made the difference, 4.11.20 vs 4.11.16 before; or the installation of the nmstate operator before ODF. Or anything I missed or made wrong in the first attempt....
Any insight from the logs I sent?

Comment 21 Blaine Gardner 2023-01-17 15:31:20 UTC
I'm not seeing anything in the logs gathered that suggests why the startup probe for the RGW was failing. Since we can't repro this now, I'm not sure what else to look into. I'll close this for now, but please reopen if the issue repros.


Note You need to log in before you can comment on or make changes to this bug.