Bug 2000133 - rgw pod stuck in CrashLoopBackOff while installing odf-operator via UI on VMware cluster [NEEDINFO]
Summary: rgw pod stuck in CrashLoopBackOff while installing odf-operator via UI on VMw...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Jiffin
QA Contact: Elad
URL:
Whiteboard:
Depends On: 2000190
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-01 13:13 UTC by Aman Agrawal
Modified: 2023-08-09 17:03 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-01-24 16:21:26 UTC
Embargoed:
muagarwa: needinfo? (sbaldwin)


Attachments (Terms of Use)

Description Aman Agrawal 2021-09-01 13:13:08 UTC
Description of problem (please be detailed as possible and provide log
snippests):


Version of all relevant components (if applicable): ODF 4.9.0-120.ci and OCP 4.9.0-0.nightly-2021-08-25-111423


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)? Yes


Is there any workaround available to the best of your knowledge? No


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)? 1


Can this issue reproducible? Yes


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install odf-operator via UI
2. Create storagesystem from Installed Operators -> ODF
3. Check the output of oc get pods -n openshift-storage


Actual results: 
rgw pod stuck in CrashLoopBackOff while installing odf-operator via UI on VMware cluster

Output of: oc describe pod rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-86768dfggcxx-

Normal   Pulled          31m (x2 over 31m)  kubelet            Container image "quay.io/rhceph-dev/rhceph@sha256:7ecf53369849d0141abe029d142751c755766078f15caac6ced4621cba1b7dcf" already present on machine
  Normal   Killing         31m                kubelet            Container rgw failed liveness probe, will be restarted
  Warning  Unhealthy       30m (x6 over 31m)  kubelet            Liveness probe failed: Get "http://10.131.0.20:8080/swift/healthcheck": dial tcp 10.131.0.20:8080: connect: connection refused
  Warning  ProbeError      30m (x6 over 31m)  kubelet            Liveness probe error: Get "http://10.131.0.20:8080/swift/healthcheck": dial tcp 10.131.0.20:8080: connect: connection refused
body:
  Warning  BackOff  93s (x125 over 29m)  kubelet  Back-off restarting failed container

Expected results: rgw pod should be in Running state.


Additional info:

Comment 3 Travis Nielsen 2021-09-02 20:54:23 UTC
The rgw log shows an error with the cert or zonegroup configuration.
@Jiffin Can you take a look?

2021-09-01T13:13:39.863734009Z debug 2021-09-01T13:13:39.861+0000 7f309e2ef480  0 deferred set uid:gid to 167:167 (ceph:ceph)
2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000 7f309e2ef480  0 ceph version 16.2.0-81.el8cp (8908ce967004ed706acb5055c01030e6ecd06036) pacific (stable), process radosgw, pid 478
2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000 7f309e2ef480  0 framework: beast
2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000 7f309e2ef480  0 framework conf key: port, val: 8080
2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000 7f309e2ef480  0 framework conf key: ssl_port, val: 443
2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000 7f309e2ef480  0 framework conf key: ssl_certificate, val: /etc/ceph/private/rgw-cert.pem
2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000 7f309e2ef480  0 framework conf key: ssl_private_key, val: /etc/ceph/private/rgw-key.pem
2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.863860573Z 2021-09-01T13:13:39.861+0000 7f309e2ef480  1 radosgw_Main not setting numa affinity
2021-09-01T13:13:39.881728097Z debug 2021-09-01T13:13:39.879+0000 7f309e2ef480  0 failed reading zonegroup info: ret -2 (2) No such file or directory
2021-09-01T13:13:39.881728097Z debug 2021-09-01T13:13:39.879+0000 7f309e2ef480  0 ERROR: failed to start notify service ((2) No such file or directory
2021-09-01T13:13:39.881728097Z debug 2021-09-01T13:13:39.879+0000 7f309e2ef480  0 ERROR: failed to init services (ret=(2) No such file or directory)
2021-09-01T13:13:39.883725963Z debug 2021-09-01T13:13:39.881+0000 7f309e2ef480 -1 Couldn't init storage provider (RADOS)

Comment 5 Jiffin 2021-09-08 05:51:57 UTC
(In reply to Travis Nielsen from comment #3)
> The rgw log shows an error with the cert or zonegroup configuration.
> @Jiffin Can you take a look?
> 
> 2021-09-01T13:13:39.863734009Z debug 2021-09-01T13:13:39.861+0000
> 7f309e2ef480  0 deferred set uid:gid to 167:167 (ceph:ceph)
> 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000
> 7f309e2ef480  0 ceph version 16.2.0-81.el8cp
> (8908ce967004ed706acb5055c01030e6ecd06036) pacific (stable), process
> radosgw, pid 478
> 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000
> 7f309e2ef480  0 framework: beast
> 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000
> 7f309e2ef480  0 framework conf key: port, val: 8080
> 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000
> 7f309e2ef480  0 framework conf key: ssl_port, val: 443
> 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000
> 7f309e2ef480  0 framework conf key: ssl_certificate, val:
> /etc/ceph/private/rgw-cert.pem
> 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.861+0000
> 7f309e2ef480  0 framework conf key: ssl_private_key, val:
> /etc/ceph/private/rgw-key.pem
> 2021-09-01T13:13:39.863847681Z debug 2021-09-01T13:13:39.863860573Z
> 2021-09-01T13:13:39.861+0000 7f309e2ef480  1 radosgw_Main not setting numa
> affinity
> 2021-09-01T13:13:39.881728097Z debug 2021-09-01T13:13:39.879+0000
> 7f309e2ef480  0 failed reading zonegroup info: ret -2 (2) No such file or
> directory
> 2021-09-01T13:13:39.881728097Z debug 2021-09-01T13:13:39.879+0000
> 7f309e2ef480  0 ERROR: failed to start notify service ((2) No such file or
> directory
> 2021-09-01T13:13:39.881728097Z debug 2021-09-01T13:13:39.879+0000
> 7f309e2ef480  0 ERROR: failed to init services (ret=(2) No such file or
> directory)
> 2021-09-01T13:13:39.883725963Z debug 2021-09-01T13:13:39.881+0000
> 7f309e2ef480 -1 Couldn't init storage provider (RADOS)

I was not able to find suspicious part last few error messages
Can you please recollect the logs with debug level 20, add the following to rook-config-override and restart the rgw-pod

[client.rgw.ocs.storagecluster.cephobjectstore.a]
debug rgw = 20/20

?

Comment 9 Aman Agrawal 2021-09-14 07:37:45 UTC
As of now, OCS must-gather command doesn't collect each and every log related to new changes in re-branding. Please follow bug 2000190 for more details.

I am not sure how else could I be able to help you here.

Hi Neha/Mudit/Jose/Nitin, 

Could you please help us here?

Comment 10 Elad 2021-09-14 10:26:14 UTC
Hi Jiffin, Travis,

How should we change RGW log level? Is it done using a configMap?

Comment 11 Jiffin 2021-09-14 10:41:49 UTC
(In reply to Elad from comment #10)
> Hi Jiffin, Travis,
> 
> How should we change RGW log level? Is it done using a configMap?

In Rook, we can do by adding the following to the "rook-config-override" cm and restart the rgw-pod(if the pod is already started)

[client.rgw.ocs.storagecluster.cephobjectstore.a]
debug rgw = 20/20

Comment 13 Travis Nielsen 2021-09-14 18:38:31 UTC
Removing needsinfo since Jiffin answered the logging question.

Comment 14 Mudit Agarwal 2021-09-16 06:25:45 UTC
Aman, can you please repro this with the logging instructions provided by Jiffin in  #Comment11

Comment 17 Mudit Agarwal 2021-09-20 10:18:00 UTC
Setting need info back on Aman, we still need help with the reproduction of this issue (with apt debug log level)

Comment 19 Mudit Agarwal 2021-09-20 11:51:47 UTC
Thanks Aman. I guess we can keep it open for some time, if there is no instance of this in future then we might close it.

I don't see it as a Test blocker if this is not even reproducible, removing the blocker flag. 
Please re-flag if required.

Comment 21 Mudit Agarwal 2021-10-11 15:16:24 UTC
Please reopen if this is reproducible.

Comment 26 Travis Nielsen 2022-01-24 16:21:26 UTC
Please open a new bug if this is able to repro with the increased logging.


Note You need to log in before you can comment on or make changes to this bug.