Bug 2254475 - [4.14 clone] rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a does not come up after node reboot
Summary: [4.14 clone] rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a does not come...
Keywords:
Status: CLOSED DUPLICATE of bug 2254547
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.14
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.14.5
Assignee: Parth Arora
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On: 2245004
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-12-14 06:33 UTC by Mudit Agarwal
Modified: 2024-01-29 07:09 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2245004
Environment:
Last Closed: 2024-01-29 07:03:48 UTC
Embargoed:


Attachments (Terms of Use)

Description Mudit Agarwal 2023-12-14 06:33:38 UTC
+++ This bug was initially created as a clone of Bug #2245004 +++

Description of problem (please be detailed as possible and provide log
snippests):

After one of the storage nodes rebooted, the rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a is stuck in CrashLoopBackOff.

ceph health detail reports HEALTH_WARN 1 filesystem is degraded; 1 MDSs report slow metadata IOs; Reduced data availability: 106 pgs inactive, 46 pgs peering; 258 slow ops, oldest one blocked for 141331 sec, daemons [osd.2,mon.a] have slow ops.

Version of all relevant components (if applicable):

ODF 4.14.0-139.stable. OCP 4.14.0-rc.4

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

This is a lab doing preGA testing

Is there any workaround available to the best of your knowledge?

The problem sounds very similar to this:https://access.redhat.com/solutions/6972994. The workaround posted is to delete the OSD pod, however we want to identify the root cause so that no manual intervention is required.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?

unsure


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

I will add must-gather and ceph logs in the comments

--- Additional comment from RHEL Program Management on 2023-10-19 08:48:35 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.14.0' to '?', and so is being proposed to be fixed at the ODF 4.14.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from Logan McNaughton on 2023-10-19 08:50:07 UTC ---

This was reported to us by Nokia during pre-ga testing.

Here is the must-gather: https://drive.google.com/file/d/1c_qQ9war377OXV56xrBTACW09S4hQqvl/view?usp=share_link


Here is the ceph health output: https://docs.google.com/document/d/1csR02-GfMTrffCEcSghWvCvcQDYAEuCS2LuAKzZ3QmM/edit?usp=sharing

--- Additional comment from Jiffin on 2023-10-20 10:45:32 UTC ---

The must-gather does not have any ceph-related commands or crds. I am guessing it is not gathered properly. From the pod.yaml Container startup itself failed. Can you please gather requested info

--- Additional comment from Logan McNaughton on 2023-10-20 13:20:07 UTC ---

Nokia attempted to gather an ODF must-gather (using this command: oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13), but it got stuck here:

[must-gather-lb5pv] POD 2023-10-20T11:44:06.743797211Z Defaulted container "noobaa-operator" out of: noobaa-operator, objectstorage-provisioner-sidecar
[must-gather-lb5pv] POD 2023-10-20T11:44:06.750554674Z Collecting MCG database information...

They waited about 30 minutes but it didn't progress, so they just did the regular must-gather, which is what is attached. The provided must-gather should have all the logs from the pods in the openshift-storage namespace, just not the CRDs I suppose.

--- Additional comment from Subham Rai on 2023-10-25 06:55:20 UTC ---

I was also trying to look at the logs and I don't see logs that collect oc command output in openshift-storage namespace.

--- Additional comment from Parth Arora on 2023-10-25 09:44:44 UTC ---

Adding the findings:

mds is reporting waiting for osdmap 87 (which blocklists prior instance)

--- Additional comment from Subham Rai on 2023-10-25 13:10:41 UTC ---

in operator logs 
```
2023-10-14T21:16:55.950608286Z 2023-10-14 21:16:55.950567 I | exec: exec timeout waiting for process radosgw-admin to return. Sending interrupt signal to the process
2023-10-14T21:16:55.959268610Z 2023-10-14 21:16:55.959238 E | ceph-object-store-user-controller: failed to reconcile CephObjectStoreUser "openshift-storage/noobaa-ceph-objectstore-user". failed to initialized rgw admin ops client api: failed to create or retrieve rgw admin ops user: failed to create object user "rgw-admin-ops-user". error code 1 for object store "ocs-storagecluster-cephobjectstore": failed to create s3 user. . : signal: interrupt
```

can you share the ceph status output?

--- Additional comment from Parth Arora on 2023-10-25 13:34:04 UTC ---

Subham feels like this is the same issue we are fixing with https://github.com/rook/rook/pull/12818 , https://github.com/rook/rook/pull/12817 

Tracked here  https://bugzilla.redhat.com/show_bug.cgi?id=2235611

Jiffin one more instance surely we need to prioritize it.

--- Additional comment from Prasad Desala on 2023-10-25 13:50:10 UTC ---

Observing similar issue on 4.14.0-156.

After Storagecluster installation via UI, rook-ceph-rgw pod stuck at CrashLoopBackOff state.

rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6cfb9d6twtvx   1/2     CrashLoopBackOff   5 (45s ago)   7m6s

oc describe of the pod:
=======================
Warning  Unhealthy       12m (x16 over 14m)    kubelet            Startup probe failed: RGW health check failed with error code: 7. the RGW likely cannot be reached by clients
  Warning  BackOff         4m55s (x36 over 11m)  kubelet            Back-off restarting failed container rgw in pod rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6cfb9d6twtvx_openshift-storage(39280b40-f172-463b-b394-6d75fd946347)


rook-operator logs:
===================
2023-10-25 07:36:30.263476 I | ceph-spec: parsing mon endpoints: a=172.30.83.26:3300,b=172.30.118.189:3300,c=172.30.215.220:3300
2023-10-25 07:36:30.263536 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found
2023-10-25 07:36:30.263626 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found
2023-10-25 07:36:30.383117 E | ceph-object-store-user-controller: failed to reconcile CephObjectStoreUser "openshift-storage/noobaa-ceph-objectstore-user". failed to initialized rgw admin ops client api: failed to create or retrieve rgw admin ops user: failed to create object user "rgw-admin-ops-user". error code 1 for object store "ocs-storagecluster-cephobjectstore": failed to create s3 user. 2023-10-25T07:36:30.373+0000 7fac0740a780  0 ERROR: could not find zonegroup (ocs-storagecluster-cephobjectstore)
2023-10-25T07:36:30.373+0000 7fac0740a780  0 ERROR: failed to start notify service ((2) No such file or directory
2023-10-25T07:36:30.373+0000 7fac0740a780  0 ERROR: failed to init services (ret=(2) No such file or directory)
couldn't init storage provider. : exit status 5

ocs-mustgather@ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/Prasad/ocs4/2245004/


Restarting the rook-operator pod made the rook-ceph-rgw pod stable and reach running state.

--- Additional comment from Subham Rai on 2023-10-25 14:48:42 UTC ---

(In reply to Parth Arora from comment #8)
> Subham feels like this is the same issue we are fixing with
> https://github.com/rook/rook/pull/12818 ,
> https://github.com/rook/rook/pull/12817 
> 
> Tracked here  https://bugzilla.redhat.com/show_bug.cgi?id=2235611
> 
> Jiffin one more instance surely we need to prioritize it.

Parth seems like an issue with the RGW readiness probe to me

--- Additional comment from Jiffin on 2023-10-26 07:17:28 UTC ---

(In reply to Subham Rai from comment #10)
> (In reply to Parth Arora from comment #8)
> > Subham feels like this is the same issue we are fixing with
> > https://github.com/rook/rook/pull/12818 ,
> > https://github.com/rook/rook/pull/12817 
> > 
> > Tracked here  https://bugzilla.redhat.com/show_bug.cgi?id=2235611
> > 
> > Jiffin one more instance surely we need to prioritize it.
> 
> Parth seems like an issue with the RGW readiness probe to me

From the logs the radosgw-admin got time out it looks issue what Parth and Prasad referring. @Subham Readiness probe will fail if rgw pod is not up and running. So if it crashing the readiness probe intends to fail.

If it radosgw-admin timeout issue, IMO the Rook Operator ignores the timeout and sets CephObjectStore Status is to successfully even though the rgw pod is not running. Can you do `oc describe cephobjectstore` and check s the status

--- Additional comment from Sunil Kumar Acharya on 2023-10-26 08:18:40 UTC ---

Moving the non-blocker BZs out of ODF-4.14.0. If you think this is a blocker issue, feel free to propose it as a blocker for ODF04.14.0 with justification note.

--- Additional comment from Jiffin on 2023-10-30 10:00:53 UTC ---

Parth is on PTO this week, I will provide update from Engineering side

--- Additional comment from Jiffin on 2023-12-13 05:14:43 UTC ---

IMO https://github.com/rook/rook/pull/12817 should resolve this issue and it got merged Rook Upstream, @paarora Do we need https://github.com/rook/rook/pull/12818 as well? Can we change the status of the bug or mark it as MODIFIED??

--- Additional comment from Parth Arora on 2023-12-13 07:53:03 UTC ---

>IMO https://github.com/rook/rook/pull/12817 should resolve this issue and it got merged Rook Upstream,

this fix mainly helps us in fixing the error logging and tracking this issue better.

>Do we need https://github.com/rook/rook/pull/12818 as well
Yes,
I found that the function executeCommandwithtimeout https://github.com/rook/rook/blob/master/pkg/util/exec/exec.go#L99
Is not working as expected, the signal process is killing the actual operator behavior. 
So to fix the overall operator we need to work on this function, https://github.com/rook/rook/pull/12818

The root cause might be because of latency or timing issues with new features added, and this function triggered signals which is wrongly implemented IMO 

I see this function is called by rbd pools creation and the impact is visible in the rook logs, even the Blackpool is not able to get created.

The workaround for this case would be to restart the rook operator

--- Additional comment from Elad on 2023-12-13 12:48:50 UTC ---

Since this happens also post fresh deployment, according to comment #9, setting the bug severity to high and proposing as a blocker for 4.15.0

--- Additional comment from Santosh Pillai on 2023-12-14 03:53:51 UTC ---

(In reply to Elad from comment #16)
> Since this happens also post fresh deployment, according to comment #9,
> setting the bug severity to high and proposing as a blocker for 4.15.0

Elad, this happens intermittently and workaround is to just restart the rook operator. The fix is merged upstream - https://github.com/rook/rook/pull/12817

Comment 10 krishnaram Karthick 2024-01-02 07:45:57 UTC
Fix not ready for 4.14.4, moving the bug to 4.14.5.

Comment 13 Mudit Agarwal 2024-01-29 07:03:48 UTC

*** This bug has been marked as a duplicate of bug 2254547 ***


Note You need to log in before you can comment on or make changes to this bug.