Bug 2032404 - After a node restart, the RGW pod is stuck in a CrashLoopBackOff state
Summary: After a node restart, the RGW pod is stuck in a CrashLoopBackOff state
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ODF 4.10.0
Assignee: Blaine Gardner
QA Contact: aberner
URL:
Whiteboard:
Depends On:
Blocks: 2034359 2034361 2034976 2036949
TreeView+ depends on / blocked
 
Reported: 2021-12-14 12:47 UTC by Ben Eli
Modified: 2023-08-09 16:37 UTC (History)
17 users (show)

Fixed In Version: 4.10.0-113
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2034359 2034361 2034976 2036949 (view as bug list)
Environment:
Last Closed: 2022-04-13 18:50:40 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage rook pull 326 0 None Merged Sync from upstream release-1.8 to downstream release-4.10 2022-01-11 18:43:15 UTC
Red Hat Product Errata RHSA-2022:1372 0 None None None 2022-04-13 18:51:13 UTC

Comment 4 Blaine Gardner 2021-12-14 19:22:03 UTC
One thing I am seeing that is not the cause of the underlying issue but is a concern for observability is that the CephObjectStore resource still reports a `status.bucketStatus.health: Connected` when I see from the Rook operator logs that it has failed to check the health for quite some time.

For @

Comment 5 Blaine Gardner 2021-12-14 19:34:38 UTC
I'm sorry. I don't know how my previous draft comment got submitted. Allow me to continue here.

It's hard for me to determine why the RGW is failing to start. I suspect it will be hard to debug this issue without extra debug output from the RGW as well.

I think the first step of this is to determine whether this is reproducible. @

Comment 6 Blaine Gardner 2021-12-14 19:47:23 UTC
My previous comment was accidentally submitted again, and seems to happen when I try to `@` someone for needsinfo. :\ Apologies again.

---

Anyway, continuing... @belimele can you run the test again to see if it reproduces? I would suggest doing this with increased RGW log level so that we can get better help from the RHCS/RGW team if the failure occurs again. Additionally, each time a node is going to be restarted, it would be good to list the pods that are running on the node before the node restart; in case it only reproduces when some combination of apps go down simultaneously (unlikely but could be important).

I'd also like to understand the conditions under which the nodes are restarted. Are they merely reset with a hard reset? Is it a system `reboot` command? Is Ceph set with `noout` or any other kind of maintenance mode? Is Kubernetes set with any maintenance commands like `cordon` or `drain`? Are there any other details about the commands run to prepare for or initiathe the restarts that might be important?

If you do get an environment that reproduces the issue, please leave the environment active for us to connect to it. It will make debugging easier, and hopefully we can get this resolved much quicker.

---

Even if we can't get a repro, can someone from the RGW team to look at the RGW logs from the must-gather to see if it is a known issue? @sostapov was tagged on GChat related to this issue and may be able to help facilitate this part. Since this is an `urgent` bug, I want to make sure we are able to move this forward in parallel as much as possible.

Comment 7 Elad 2021-12-14 20:02:36 UTC
Hi Blaine, I believe this bug will be easily reproduced with a standalone RHCS. Can we first test with standalone RHCS?

Comment 8 Elad 2021-12-14 20:03:52 UTC
What I mean is if Ceph team can test this, on a standalone RHCS cluster

Comment 9 Jiffin 2021-12-15 07:21:37 UTC
Adding to Blaine's request, while reproducing please set the following as well from toolbox pod "ceph config set client.admin debug_rgw 20" to get the results of the o/p of radosgw-admin commands executed from operator pod. Currently, all those commands got timeout error, by above setting we may get the trace of the command

Comment 10 Elad 2021-12-15 09:23:34 UTC
Hi Jiffin, as I mentioned, this BZ is seen only with the OCS build that consumes RHCS4.2z4 and there were no other changes in the OCS z-stream build's components that could have caused this. Therefore, I am quite certain that the bug will be seen with an RHCS cluster, where configuring RGW in debug mode would be much more convenient and won't require hacking through ceph toolbox or changing configmaps.

Comment 11 Matt Benjamin (redhat) 2021-12-15 12:16:26 UTC
(In reply to Elad from comment #10)
> Hi Jiffin, as I mentioned, this BZ is seen only with the OCS build that
> consumes RHCS4.2z4 and there were no other changes in the OCS z-stream
> build's components that could have caused this. Therefore, I am quite
> certain that the bug will be seen with an RHCS cluster, where configuring
> RGW in debug mode would be much more convenient and won't require hacking
> through ceph toolbox or changing configmaps.

Hi Elad,

While the issue may be "in" RHCS, rhcs-4.2z4 passed all quality testing and has been running in many production configurations successfully since release.  So we are asking for debug output from the RGW while it is reproducing in the OCS environment--because we haven't seen the issue elsewhere.

thanks,

Matt

Comment 23 Travis Nielsen 2021-12-16 19:13:37 UTC
In addition to adjusting the liveness probe and picking up the readiness probes, we need to look at the startup probes, which are not yet exposed in rook. This will allow setting a longer startup timeout, while allowing keeping a more robust small timeout for liveness and readiness.

Comment 24 Mudit Agarwal 2021-12-17 06:08:26 UTC
Hi Blaine,

Thanks for the detailed explanation, after reading your update I am more inclined to think it as an upper layer bug (rook/kubernetes), only thing which still bothers me is why it is only seen with 4.2z4 and not earlier ceph versions given that the rook code is same. We need to apply the fix in ODF4.8 for sure and Elad/QE can confirm if they are seeing the same issues in 4.7/4.6 or not.

Comment 25 Blaine Gardner 2021-12-17 18:38:37 UTC
Travis and I discussed today how to approach the fix for this. Our current plan is the following:

Create a targeted fix specifically for OCS 4.8 and 4.9 that adds a hardcoded startup to the RGW pod. Because the RGW took between 60 and 90 seconds to become ready, the probe will allow 180 seconds (3 mins) for the RGW to start. Currently we don't plan for these to be present in upstream Rook v1.6 or v1.7 (the bases for OCS 4.8 and 4.9) since we need to deal with supported k8s version considerations and more complex backports.

In upstream Rook v1.8, we will add user-configurable startup probes to all resources following the current pattern we have for liveness and readiness probes. This will be part of ODF 4.10. The default startup probe will be the same as what we hard code for 4.8 and 4.9, but it will be able to be overridden or disabled if desired.

Comment 26 Blaine Gardner 2021-12-17 18:43:24 UTC
Addendum: 
All the above being said and the work going into Rook/OCS to tolerate longer startup times, I think it is worth investigating in the RGW why startup time increased 2-3x after being rescheduled between 4.2z3 and 4.2z4. This could mean increased failover times, and I don't know how this might affect our customer agreements for object storage failover. If this is expected and within the desires we have for startup time, I think getting clarity about that is still good to have.

Comment 27 Blaine Gardner 2021-12-17 18:45:17 UTC
@muagarwa given that this fix will be for 4.8, 4.9, and 4.10, should we create cloned BZs for those releases?

Comment 28 Mudit Agarwal 2021-12-20 13:53:06 UTC
Yes, for 4.8 and 4.9

Comment 29 Blaine Gardner 2022-01-11 18:43:15 UTC
Merged into 4.10 codebase here: https://github.com/red-hat-storage/rook/pull/326

Comment 32 aberner 2022-01-26 09:47:12 UTC
Verified over:
ocp 4.10.0-0.nightly-2022-01-25-023600
odf 4.10.0-118

Comment 37 errata-xmlrpc 2022-04-13 18:50:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.10.0 enhancement, security & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1372


Note You need to log in before you can comment on or make changes to this bug.