Bug 2000133

Summary: rgw pod stuck in CrashLoopBackOff while installing odf-operator via UI on VMware cluster
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Aman Agrawal <amagrawa>
Component: rookAssignee: Jiffin <jthottan>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.8CC: hnallurv, jrivera, jthottan, madam, muagarwa, nberry, nigoyal, ocs-bugs, odf-bz-bot, sbaldwin, srai, tnielsen
Target Milestone: ---Keywords: Reopened
Target Release: ---Flags: muagarwa: needinfo? (sbaldwin)
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-01-24 16:21:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2000190    
Bug Blocks:    

Description Aman Agrawal 2021-09-01 13:13:08 UTC
Description of problem (please be detailed as possible and provide log
snippests):


Version of all relevant components (if applicable): ODF 4.9.0-120.ci and OCP 4.9.0-0.nightly-2021-08-25-111423


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)? Yes


Is there any workaround available to the best of your knowledge? No


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)? 1


Can this issue reproducible? Yes


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install odf-operator via UI
2. Create storagesystem from Installed Operators -> ODF
3. Check the output of oc get pods -n openshift-storage


Actual results: 
rgw pod stuck in CrashLoopBackOff while installing odf-operator via UI on VMware cluster

Output of: oc describe pod rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-86768dfggcxx-

Normal   Pulled          31m (x2 over 31m)  kubelet            Container image "quay.io/rhceph-dev/rhceph@sha256:7ecf53369849d0141abe029d142751c755766078f15caac6ced4621cba1b7dcf" already present on machine
  Normal   Killing         31m                kubelet            Container rgw failed liveness probe, will be restarted
  Warning  Unhealthy       30m (x6 over 31m)  kubelet            Liveness probe failed: Get "http://10.131.0.20:8080/swift/healthcheck": dial tcp 10.131.0.20:8080: connect: connection refused
  Warning  ProbeError      30m (x6 over 31m)  kubelet            Liveness probe error: Get "http://10.131.0.20:8080/swift/healthcheck": dial tcp 10.131.0.20:8080: connect: connection refused
body:
  Warning  BackOff  93s (x125 over 29m)  kubelet  Back-off restarting failed container

Expected results: rgw pod should be in Running state.


Additional info:

Comment 3 Travis Nielsen 2021-09-02 20:54:23 UTC
The rgw log shows an error with the cert or zonegroup configuration.
@Jiffin Can you take a look?

2021-09-01T13:13:39.863734009Z debug 2021-09-01T13:13:39.861+0000 7f309e2ef480  0 deferred set uid:gid to 167:167 (ceph:ceph)
2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000 7f309e2ef480  0 ceph version 16.2.0-81.el8cp (8908ce967004ed706acb5055c01030e6ecd06036) pacific (stable), process radosgw, pid 478
2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000 7f309e2ef480  0 framework: beast
2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000 7f309e2ef480  0 framework conf key: port, val: 8080
2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000 7f309e2ef480  0 framework conf key: ssl_port, val: 443
2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000 7f309e2ef480  0 framework conf key: ssl_certificate, val: /etc/ceph/private/rgw-cert.pem
2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000 7f309e2ef480  0 framework conf key: ssl_private_key, val: /etc/ceph/private/rgw-key.pem
2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.863860573Z 2021-09-01T13:13:39.861+0000 7f309e2ef480  1 radosgw_Main not setting numa affinity
2021-09-01T13:13:39.881728097Z debug 2021-09-01T13:13:39.879+0000 7f309e2ef480  0 failed reading zonegroup info: ret -2 (2) No such file or directory
2021-09-01T13:13:39.881728097Z debug 2021-09-01T13:13:39.879+0000 7f309e2ef480  0 ERROR: failed to start notify service ((2) No such file or directory
2021-09-01T13:13:39.881728097Z debug 2021-09-01T13:13:39.879+0000 7f309e2ef480  0 ERROR: failed to init services (ret=(2) No such file or directory)
2021-09-01T13:13:39.883725963Z debug 2021-09-01T13:13:39.881+0000 7f309e2ef480 -1 Couldn't init storage provider (RADOS)

Comment 5 Jiffin 2021-09-08 05:51:57 UTC
(In reply to Travis Nielsen from comment #3)
> The rgw log shows an error with the cert or zonegroup configuration.
> @Jiffin Can you take a look?
> 
> 2021-09-01T13:13:39.863734009Z debug 2021-09-01T13:13:39.861+0000
> 7f309e2ef480  0 deferred set uid:gid to 167:167 (ceph:ceph)
> 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000
> 7f309e2ef480  0 ceph version 16.2.0-81.el8cp
> (8908ce967004ed706acb5055c01030e6ecd06036) pacific (stable), process
> radosgw, pid 478
> 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000
> 7f309e2ef480  0 framework: beast
> 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000
> 7f309e2ef480  0 framework conf key: port, val: 8080
> 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000
> 7f309e2ef480  0 framework conf key: ssl_port, val: 443
> 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000
> 7f309e2ef480  0 framework conf key: ssl_certificate, val:
> /etc/ceph/private/rgw-cert.pem
> 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000
> 7f309e2ef480  0 framework conf key: ssl_private_key, val:
> /etc/ceph/private/rgw-key.pem
> 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.863860573Z
> 2021-09-01T13:13:39.861+0000 7f309e2ef480  1 radosgw_Main not setting numa
> affinity
> 2021-09-01T13:13:39.881728097Z debug 2021-09-01T13:13:39.879+0000
> 7f309e2ef480  0 failed reading zonegroup info: ret -2 (2) No such file or
> directory
> 2021-09-01T13:13:39.881728097Z debug 2021-09-01T13:13:39.879+0000
> 7f309e2ef480  0 ERROR: failed to start notify service ((2) No such file or
> directory
> 2021-09-01T13:13:39.881728097Z debug 2021-09-01T13:13:39.879+0000
> 7f309e2ef480  0 ERROR: failed to init services (ret=(2) No such file or
> directory)
> 2021-09-01T13:13:39.883725963Z debug 2021-09-01T13:13:39.881+0000
> 7f309e2ef480 -1 Couldn't init storage provider (RADOS)

I was not able to find suspicious part last few error messages
Can you please recollect the logs with debug level 20, add the following to rook-config-override and restart the rgw-pod

[client.rgw.ocs.storagecluster.cephobjectstore.a]
debug rgw = 20/20

?

Comment 9 Aman Agrawal 2021-09-14 07:37:45 UTC
As of now, OCS must-gather command doesn't collect each and every log related to new changes in re-branding. Please follow bug 2000190 for more details.

I am not sure how else could I be able to help you here.

Hi Neha/Mudit/Jose/Nitin, 

Could you please help us here?

Comment 10 Elad 2021-09-14 10:26:14 UTC
Hi Jiffin, Travis,

How should we change RGW log level? Is it done using a configMap?

Comment 11 Jiffin 2021-09-14 10:41:49 UTC
(In reply to Elad from comment #10)
> Hi Jiffin, Travis,
> 
> How should we change RGW log level? Is it done using a configMap?

In Rook, we can do by adding the following to the "rook-config-override" cm and restart the rgw-pod(if the pod is already started)

[client.rgw.ocs.storagecluster.cephobjectstore.a]
debug rgw = 20/20

Comment 13 Travis Nielsen 2021-09-14 18:38:31 UTC
Removing needsinfo since Jiffin answered the logging question.

Comment 14 Mudit Agarwal 2021-09-16 06:25:45 UTC
Aman, can you please repro this with the logging instructions provided by Jiffin in  #Comment11

Comment 17 Mudit Agarwal 2021-09-20 10:18:00 UTC
Setting need info back on Aman, we still need help with the reproduction of this issue (with apt debug log level)

Comment 19 Mudit Agarwal 2021-09-20 11:51:47 UTC
Thanks Aman. I guess we can keep it open for some time, if there is no instance of this in future then we might close it.

I don't see it as a Test blocker if this is not even reproducible, removing the blocker flag. 
Please re-flag if required.

Comment 21 Mudit Agarwal 2021-10-11 15:16:24 UTC
Please reopen if this is reproducible.

Comment 26 Travis Nielsen 2022-01-24 16:21:26 UTC
Please open a new bug if this is able to repro with the increased logging.