One thing I am seeing that is not the cause of the underlying issue but is a concern for observability is that the CephObjectStore resource still reports a `status.bucketStatus.health: Connected` when I see from the Rook operator logs that it has failed to check the health for quite some time. For @
I'm sorry. I don't know how my previous draft comment got submitted. Allow me to continue here. It's hard for me to determine why the RGW is failing to start. I suspect it will be hard to debug this issue without extra debug output from the RGW as well. I think the first step of this is to determine whether this is reproducible. @
My previous comment was accidentally submitted again, and seems to happen when I try to `@` someone for needsinfo. :\ Apologies again. --- Anyway, continuing... @belimele can you run the test again to see if it reproduces? I would suggest doing this with increased RGW log level so that we can get better help from the RHCS/RGW team if the failure occurs again. Additionally, each time a node is going to be restarted, it would be good to list the pods that are running on the node before the node restart; in case it only reproduces when some combination of apps go down simultaneously (unlikely but could be important). I'd also like to understand the conditions under which the nodes are restarted. Are they merely reset with a hard reset? Is it a system `reboot` command? Is Ceph set with `noout` or any other kind of maintenance mode? Is Kubernetes set with any maintenance commands like `cordon` or `drain`? Are there any other details about the commands run to prepare for or initiathe the restarts that might be important? If you do get an environment that reproduces the issue, please leave the environment active for us to connect to it. It will make debugging easier, and hopefully we can get this resolved much quicker. --- Even if we can't get a repro, can someone from the RGW team to look at the RGW logs from the must-gather to see if it is a known issue? @sostapov was tagged on GChat related to this issue and may be able to help facilitate this part. Since this is an `urgent` bug, I want to make sure we are able to move this forward in parallel as much as possible.
Hi Blaine, I believe this bug will be easily reproduced with a standalone RHCS. Can we first test with standalone RHCS?
What I mean is if Ceph team can test this, on a standalone RHCS cluster
Adding to Blaine's request, while reproducing please set the following as well from toolbox pod "ceph config set client.admin debug_rgw 20" to get the results of the o/p of radosgw-admin commands executed from operator pod. Currently, all those commands got timeout error, by above setting we may get the trace of the command
Hi Jiffin, as I mentioned, this BZ is seen only with the OCS build that consumes RHCS4.2z4 and there were no other changes in the OCS z-stream build's components that could have caused this. Therefore, I am quite certain that the bug will be seen with an RHCS cluster, where configuring RGW in debug mode would be much more convenient and won't require hacking through ceph toolbox or changing configmaps.
(In reply to Elad from comment #10) > Hi Jiffin, as I mentioned, this BZ is seen only with the OCS build that > consumes RHCS4.2z4 and there were no other changes in the OCS z-stream > build's components that could have caused this. Therefore, I am quite > certain that the bug will be seen with an RHCS cluster, where configuring > RGW in debug mode would be much more convenient and won't require hacking > through ceph toolbox or changing configmaps. Hi Elad, While the issue may be "in" RHCS, rhcs-4.2z4 passed all quality testing and has been running in many production configurations successfully since release. So we are asking for debug output from the RGW while it is reproducing in the OCS environment--because we haven't seen the issue elsewhere. thanks, Matt
In addition to adjusting the liveness probe and picking up the readiness probes, we need to look at the startup probes, which are not yet exposed in rook. This will allow setting a longer startup timeout, while allowing keeping a more robust small timeout for liveness and readiness.
Hi Blaine, Thanks for the detailed explanation, after reading your update I am more inclined to think it as an upper layer bug (rook/kubernetes), only thing which still bothers me is why it is only seen with 4.2z4 and not earlier ceph versions given that the rook code is same. We need to apply the fix in ODF4.8 for sure and Elad/QE can confirm if they are seeing the same issues in 4.7/4.6 or not.
Travis and I discussed today how to approach the fix for this. Our current plan is the following: Create a targeted fix specifically for OCS 4.8 and 4.9 that adds a hardcoded startup to the RGW pod. Because the RGW took between 60 and 90 seconds to become ready, the probe will allow 180 seconds (3 mins) for the RGW to start. Currently we don't plan for these to be present in upstream Rook v1.6 or v1.7 (the bases for OCS 4.8 and 4.9) since we need to deal with supported k8s version considerations and more complex backports. In upstream Rook v1.8, we will add user-configurable startup probes to all resources following the current pattern we have for liveness and readiness probes. This will be part of ODF 4.10. The default startup probe will be the same as what we hard code for 4.8 and 4.9, but it will be able to be overridden or disabled if desired.
Addendum: All the above being said and the work going into Rook/OCS to tolerate longer startup times, I think it is worth investigating in the RGW why startup time increased 2-3x after being rescheduled between 4.2z3 and 4.2z4. This could mean increased failover times, and I don't know how this might affect our customer agreements for object storage failover. If this is expected and within the desires we have for startup time, I think getting clarity about that is still good to have.
@muagarwa given that this fix will be for 4.8, 4.9, and 4.10, should we create cloned BZs for those releases?
Yes, for 4.8 and 4.9
Merged into 4.10 codebase here: https://github.com/red-hat-storage/rook/pull/326
Verified over: ocp 4.10.0-0.nightly-2022-01-25-023600 odf 4.10.0-118
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.10.0 enhancement, security & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:1372