Description of problem (please be detailed as possible and provide log snippests): Version of all relevant components (if applicable): ODF 4.9.0-120.ci and OCP 4.9.0-0.nightly-2021-08-25-111423 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Install odf-operator via UI 2. Create storagesystem from Installed Operators -> ODF 3. Check the output of oc get pods -n openshift-storage Actual results: rgw pod stuck in CrashLoopBackOff while installing odf-operator via UI on VMware cluster Output of: oc describe pod rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-86768dfggcxx- Normal Pulled 31m (x2 over 31m) kubelet Container image "quay.io/rhceph-dev/rhceph@sha256:7ecf53369849d0141abe029d142751c755766078f15caac6ced4621cba1b7dcf" already present on machine Normal Killing 31m kubelet Container rgw failed liveness probe, will be restarted Warning Unhealthy 30m (x6 over 31m) kubelet Liveness probe failed: Get "http://10.131.0.20:8080/swift/healthcheck": dial tcp 10.131.0.20:8080: connect: connection refused Warning ProbeError 30m (x6 over 31m) kubelet Liveness probe error: Get "http://10.131.0.20:8080/swift/healthcheck": dial tcp 10.131.0.20:8080: connect: connection refused body: Warning BackOff 93s (x125 over 29m) kubelet Back-off restarting failed container Expected results: rgw pod should be in Running state. Additional info:
The rgw log shows an error with the cert or zonegroup configuration. @Jiffin Can you take a look? 2021-09-01T13:13:39.863734009Z debug 2021-09-01T13:13:39.861+0000 7f309e2ef480 0 deferred set uid:gid to 167:167 (ceph:ceph) 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000 7f309e2ef480 0 ceph version 16.2.0-81.el8cp (8908ce967004ed706acb5055c01030e6ecd06036) pacific (stable), process radosgw, pid 478 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000 7f309e2ef480 0 framework: beast 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000 7f309e2ef480 0 framework conf key: port, val: 8080 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000 7f309e2ef480 0 framework conf key: ssl_port, val: 443 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000 7f309e2ef480 0 framework conf key: ssl_certificate, val: /etc/ceph/private/rgw-cert.pem 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000 7f309e2ef480 0 framework conf key: ssl_private_key, val: /etc/ceph/private/rgw-key.pem 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.863860573Z 2021-09-01T13:13:39.861+0000 7f309e2ef480 1 radosgw_Main not setting numa affinity 2021-09-01T13:13:39.881728097Z debug 2021-09-01T13:13:39.879+0000 7f309e2ef480 0 failed reading zonegroup info: ret -2 (2) No such file or directory 2021-09-01T13:13:39.881728097Z debug 2021-09-01T13:13:39.879+0000 7f309e2ef480 0 ERROR: failed to start notify service ((2) No such file or directory 2021-09-01T13:13:39.881728097Z debug 2021-09-01T13:13:39.879+0000 7f309e2ef480 0 ERROR: failed to init services (ret=(2) No such file or directory) 2021-09-01T13:13:39.883725963Z debug 2021-09-01T13:13:39.881+0000 7f309e2ef480 -1 Couldn't init storage provider (RADOS)
(In reply to Travis Nielsen from comment #3) > The rgw log shows an error with the cert or zonegroup configuration. > @Jiffin Can you take a look? > > 2021-09-01T13:13:39.863734009Z debug 2021-09-01T13:13:39.861+0000 > 7f309e2ef480 0 deferred set uid:gid to 167:167 (ceph:ceph) > 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000 > 7f309e2ef480 0 ceph version 16.2.0-81.el8cp > (8908ce967004ed706acb5055c01030e6ecd06036) pacific (stable), process > radosgw, pid 478 > 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000 > 7f309e2ef480 0 framework: beast > 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000 > 7f309e2ef480 0 framework conf key: port, val: 8080 > 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000 > 7f309e2ef480 0 framework conf key: ssl_port, val: 443 > 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000 > 7f309e2ef480 0 framework conf key: ssl_certificate, val: > /etc/ceph/private/rgw-cert.pem > 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000 > 7f309e2ef480 0 framework conf key: ssl_private_key, val: > /etc/ceph/private/rgw-key.pem > 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.863860573Z > 2021-09-01T13:13:39.861+0000 7f309e2ef480 1 radosgw_Main not setting numa > affinity > 2021-09-01T13:13:39.881728097Z debug 2021-09-01T13:13:39.879+0000 > 7f309e2ef480 0 failed reading zonegroup info: ret -2 (2) No such file or > directory > 2021-09-01T13:13:39.881728097Z debug 2021-09-01T13:13:39.879+0000 > 7f309e2ef480 0 ERROR: failed to start notify service ((2) No such file or > directory > 2021-09-01T13:13:39.881728097Z debug 2021-09-01T13:13:39.879+0000 > 7f309e2ef480 0 ERROR: failed to init services (ret=(2) No such file or > directory) > 2021-09-01T13:13:39.883725963Z debug 2021-09-01T13:13:39.881+0000 > 7f309e2ef480 -1 Couldn't init storage provider (RADOS) I was not able to find suspicious part last few error messages Can you please recollect the logs with debug level 20, add the following to rook-config-override and restart the rgw-pod [client.rgw.ocs.storagecluster.cephobjectstore.a] debug rgw = 20/20 ?
As of now, OCS must-gather command doesn't collect each and every log related to new changes in re-branding. Please follow bug 2000190 for more details. I am not sure how else could I be able to help you here. Hi Neha/Mudit/Jose/Nitin, Could you please help us here?
Hi Jiffin, Travis, How should we change RGW log level? Is it done using a configMap?
(In reply to Elad from comment #10) > Hi Jiffin, Travis, > > How should we change RGW log level? Is it done using a configMap? In Rook, we can do by adding the following to the "rook-config-override" cm and restart the rgw-pod(if the pod is already started) [client.rgw.ocs.storagecluster.cephobjectstore.a] debug rgw = 20/20
Removing needsinfo since Jiffin answered the logging question.
Aman, can you please repro this with the logging instructions provided by Jiffin in #Comment11
Setting need info back on Aman, we still need help with the reproduction of this issue (with apt debug log level)
Thanks Aman. I guess we can keep it open for some time, if there is no instance of this in future then we might close it. I don't see it as a Test blocker if this is not even reproducible, removing the blocker flag. Please re-flag if required.
Please reopen if this is reproducible.
Please open a new bug if this is able to repro with the increased logging.