Bug 2254547 - [4.14 backport] [GSS] During installation of ODF 4.14 'rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a pod' stuck in CrashLoopBackOff state
Summary: [4.14 backport] [GSS] During installation of ODF 4.14 'rook-ceph-rgw-ocs-stor...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.14
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
: ODF 4.14.5
Assignee: Parth Arora
QA Contact: Uday kurundwade
URL:
Whiteboard:
: 2254475 (view as bug list)
Depends On: 2253185
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-12-14 13:11 UTC by Bipin Kunal
Modified: 2024-02-29 09:13 UTC (History)
16 users (show)

Fixed In Version: 4.14.5-4
Doc Type: Rebase: Bug Fixes Only
Doc Text:
Previously, the `rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a` pod was stuck in `CrashLoopBackOff` state as the RADOS Gateway (RGW) multisite zonegroup was not getting created and fetched, and the error handling was reporting wrong text. With this release, the error handling bug in multisite configuration is fixed and fetching the zonegroup is improved by fetching it for a particular rgw-realm that was created earlier. As a result, the multisite configuration and `rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a` pod gets created successfully.
Clone Of: 2253185
Environment:
Last Closed: 2024-02-29 09:13:12 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage rook pull 559 0 None open Bug 2254547: object: improve the error handling for multisite objs 2024-01-30 09:59:23 UTC
Github rook rook pull 12817 0 None Merged object: improve the error handling for multisite objs 2024-01-30 09:09:12 UTC
Red Hat Product Errata RHBA-2024:1043 0 None None None 2024-02-29 09:13:16 UTC

Description Bipin Kunal 2023-12-14 13:11:19 UTC
+++ This bug was initially created as a clone of Bug #2253185 +++

* Description of problem (please be detailed as possible and provide log
snippets):

ODF 4.14 installation does not complete because the 'rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a pod' stuck in  CrashLoopBackOff state


* Version of all relevant components (if applicable):
ODF 4.14

--- Additional comment from Karun Josy on 2023-12-06 18:25:43 IST ---

RGW pod log:
--------------
debug 2023-11-27T11:32:11.918+0000 3ff9385f760  0 rgw main: ERROR: could not find zonegroup (ocs-storagecluster-cephobjectstore)
debug 2023-11-27T11:32:11.918+0000 3ff9385f760  0 rgw main: ERROR: failed to start notify service ((2) No such file or directory
debug 2023-11-27T11:32:11.918+0000 3ff9385f760  0 rgw main: ERROR: failed to init services (ret=(2) No such file or directory)
debug 2023-11-27T11:32:11.928+0000 3ff9385f760 -1 Couldn't init storage provider (RADOS)
--------------

rook-ceph-operator log:
---------------
2023-11-23T17:32:49.121973373Z 2023-11-23 17:32:49.121915 E | ceph-object-store-user-controller: failed to reconcile CephObjectStoreUser "openshift-storage/ocs-storagecluster-cephobjectstoreuser". initialized rgw admin ops client api: failed to create or retrieve rgw admin ops user: failed to create object user "rgw-admin-ops-user". error code 1 for object store "ocs-storagecluster-cephobjectstore": failed to create s3 user. 2023-11-23T17:32:49.109+0000 3ffaa95f760  0 ERROR: could not find zonegroup (ocs-storagecluster-cephobjectstore)
2023-11-23T17:32:49.121973373Z 2023-11-23T17:32:49.109+0000 3ffaa95f760  0 ERROR: failed to start notify service ((2) No such file or directory
2023-11-23T17:32:49.121973373Z 2023-11-23T17:32:49.109+0000 3ffaa95f760  0 ERROR: failed to init services (ret=(2) No such file or directory)
2023-11-23T17:32:49.121973373Z couldn't init storage provider. : exit status 5         
---------------                       

---------------
sh-5.1$ radosgw-admin zonegroup list
failed to list zonegroups: (2) No such file or directory
sh-5.1$
---------------


It looks like a race condition where the zonegroup creation was initiated before the OSDs were completely deployed.

Restarting the rook-ceph-operator helps to correct this.

the architecture of this system is IBM s390x 


ODF must-gather available in supportshell
/cases/03674711/0020-03674711-odf-must-gather.tgz/

--- Additional comment from RHEL Program Management on 2023-12-06 18:25:57 IST ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.15.0' to '?', and so is being proposed to be fixed at the ODF 4.15.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from Jiffin on 2023-12-07 12:13:32 IST ---

The Operator will reconcile the CephObjectStore CRD until it succeeds. AFAIR it will check the pool creation completion then only goes with creating the zonegroup/zone/realm etc. I was wondering how we can end up the scenario. Do OSDs were offline for some time after deployment or something? I can see the following logs consistently throughout rook-operator
ceph-object-controller: reconciling object store deployments
2023-11-23T16:58:30.252849875Z 2023-11-23 16:58:30.252833 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node drain is detected
2023-11-23T16:58:30.252967721Z 2023-11-23 16:58:30.252958 I | clusterdisruption-controller: osd "rook-ceph-osd-2" is down but no node drain is detected
2023-11-23T16:58:30.253073880Z 2023-11-23 16:58:30.253065 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node drain is detected
2023-11-23T16:58:30.277085154Z 2023-11-23 16:58:30.277061 I | ceph-object-controller: ceph object store gateway service running at 172.70.152.71
2023-11-23T16:58:30.277120427Z 2023-11-23 16:58:30.277112 I | ceph-object-controller: reconciling object store pools
2023-11-23T16:58:30.693110546Z 2023-11-23 16:58:30.693051 I | op-config: successfully set "mds.ocs-storagecluster-cephfilesystem-a"="mds_cache_memory_limit"="4294967296" option to the mon configuration database
2023-11-23T16:58:30.693185494Z 2023-11-23 16:58:30.693178 I | op-config: setting "mds.ocs-storagecluster-cephfilesystem-a"="mds_join_fs"="ocs-storagecluster-cephfilesystem" option to the mon configuration database
2023-11-23T16:58:31.014758110Z 2023-11-23 16:58:31.014697 I | clusterdisruption-controller: osd is down in failure domain "c02ls001-nodes-c2lwedi02-ocp-caas-banksvcs-net" and pgs are not active+clean. pg health: "cluster is not fully clean. PGs: [{StateName:unknown Count:1}]"
2023-11-23T16:58:31.014850600Z 2023-11-23 16:58:31.014843 I | op-k8sutil: format and nodeName longer than 63 chars, nodeName c02ls002-nodes-c2lwedi02-ocp-caas-banksvcs-net will be 896faaa75c0e5b10a19667a1e2d1a47b
2023-11-23T16:58:31.015943318Z 2023-11-23 16:58:31.015934 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-896faaa75c0e5b10a19667a1e2d1a47b" with maxUnavailable=0 for "host" failure domain "c02ls002-nodes-c2lwedi02-ocp-caas-banksvcs-net"
2023-11-23T16:58:31.053138539Z 2023-11-23 16:58:31.053108 I | op-k8sutil: format and nodeName longer than 63 chars, nodeName c02ls003-nodes-c2lwedi02-ocp-caas-banksvcs-net will be 47cdf625c9285d395e63df72dab22434
2023-11-23T16:58:31.054391472Z 2023-11-23 16:58:31.054378 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-47cdf625c9285d395e63df72dab22434" with maxUnavailable=0 for "host" failure domain "c02ls003-nodes-c2lwedi02-ocp-caas-banksvcs-net"
2023-11-23T16:58:31.075187305Z 2023-11-23 16:58:31.075170 I | clusterdisruption-controller: deleting the default pdb "rook-ceph-osd" with maxUnavailable=1 for all osd

Is this impacting the RADOS??

All the pool creations and multisite configurations also succeed between these logs. I can see the `radosgw-admin` getting hung in the logs.

--- Additional comment from Karun Josy on 2023-12-07 18:16:39 IST ---

Hi Jiffin,

Thanks for looking into this and good catch! 
I went through the rook-ceph-operator logs and OSD logs as pointed out by you and I have some observations to share:
TLDR, the OSDs are underperforming with huge latency which I believe is resulting in this race condition.


rook-ceph-operator logs
=======================

+ OSD creation process starts here:
------------
    2023-11-23T16:58:04.929320993Z 2023-11-23 16:58:04.929300 I | op-osd: start running osds in namespace "openshift-storage"
    2023-11-23T16:58:04.948214822Z 2023-11-23 16:58:04.948194 I | op-osd: start provisioning the OSDs on PVCs, if needed
    2023-11-23T16:58:04.958696078Z 2023-11-23 16:58:04.958428 I | op-osd: creating 3 new PVCs for device set "ocs-deviceset-odf-cluster-storage-0"
    2023-11-23T16:58:04.958696078Z 2023-11-23 16:58:04.958479 I | op-osd: image "registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:3ee8169c13d824d96c0494d5e58d6376f3fa8b947d81cf3e98f722e5d33028e5" contains invalid character, skipping adding label
------------

+ It gets completed
------------
2023-11-23T16:58:14.621264832Z 2023-11-23 16:58:14.621161 I | op-osd: OSD orchestration status for PVC ocs-deviceset-odf-cluster-storage-0-data-22xnh8 is "completed"
2023-11-23T16:58:14.621264832Z 2023-11-23 16:58:14.621201 I | op-osd: creating OSD 1 on PVC "ocs-deviceset-odf-cluster-storage-0-data-22xnh8"

2023-11-23T16:58:14.727423369Z 2023-11-23 16:58:14.727408 I | op-osd: OSD orchestration status for PVC ocs-deviceset-odf-cluster-storage-0-data-0z4czw is "completed"
2023-11-23T16:58:14.727473348Z 2023-11-23 16:58:14.727468 I | op-osd: creating OSD 0 on PVC "ocs-deviceset-odf-cluster-storage-0-data-0z4czw"

2023-11-23T16:58:14.928479666Z 2023-11-23 16:58:14.928085 I | op-osd: OSD orchestration status for PVC ocs-deviceset-odf-cluster-storage-0-data-1s8l77 is "completed"
2023-11-23T16:58:14.928479666Z 2023-11-23 16:58:14.928104 I | op-osd: creating OSD 2 on PVC "ocs-deviceset-odf-cluster-storage-0-data-1s8l77"
------------


+ But almost immediately or even a few microseconds before the completion we can see the OSDs are getting marked down
------------
2023-11-23T16:58:14.730880633Z 2023-11-23 16:58:14.729541 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node drain is detected
2023-11-23T16:58:15.111736332Z 2023-11-23 16:58:15.111720 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node drain is detected
2023-11-23T16:58:15.111822382Z 2023-11-23 16:58:15.111812 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node drain is detected
2023-11-23T16:58:15.532577836Z 2023-11-23 16:58:15.532554 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node drain is detected
2023-11-23T16:58:15.532656461Z 2023-11-23 16:58:15.532647 I | clusterdisruption-controller: osd "rook-ceph-osd-2" is down but no node drain is detected
2023-11-23T16:58:15.532746314Z 2023-11-23 16:58:15.532737 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node drain is detected
------------

+  ######  ---> This message is confusing and contradicting
------------
2023-11-23T16:58:20.335901990Z 2023-11-23 16:58:20.335846 I | clusterdisruption-controller: osd is down in the failure domain "c02ls001-nodes-c2lwedi02-ocp-caas-banksvcs-net", but pgs are active+clean. Requeuing in case pg status is not updated yet...
------------


+ Since the PG status are active+clean we move ahead with the RGW reconciliation as observed here but it fails as the OSDs are unresponsive
------------
2023-11-23T16:58:26.016042456Z 2023-11-23 16:58:26.016035 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found
2023-11-23T16:58:26.024111982Z 2023-11-23 16:58:26.024094 E | ceph-object-store-user-controller: failed to reconcile CephObjectStoreUser "openshift-storage/ocs-storagecluster-cephobjectstoreuser". failed to detect if object store "ocs-storagecluster-cephobjectstore" is initialized: no rgw pod found
2023-11-23T16:58:26.028484333Z 2023-11-23 16:58:26.028466 I | ceph-spec: parsing mon endpoints: b=172.70.69.160:3300,c=172.70.33.209:3300,a=172.70.4.250:3300
2023-11-23T16:58:26.028543270Z 2023-11-23 16:58:26.028532 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found
2023-11-23T16:58:26.033825751Z 2023-11-23 16:58:26.033811 E | ceph-object-store-user-controller: failed to reconcile CephObjectStoreUser "openshift-storage/prometheus-user". failed to detect if object store "ocs-storagecluster-cephobjectstore" is initialized: no rgw pod found

2023-11-23T16:59:21.090499967Z 2023-11-23 16:59:21.090419 I | exec: exec timeout waiting for process radosgw-admin to return. Sending interrupt signal to the process
2023-11-23T16:59:21.102295110Z 2023-11-23 16:59:21.102279 E | ceph-object-store-user-controller: failed to reconcile CephObjectStoreUser "openshift-storage/ocs-storagecluster-cephobjectstoreuser". failed to initialized rgw admin ops client api: failed to create or retrieve rgw admin ops user: failed to create object user "rgw-admin-ops-user". error code 1 for object store "ocs-storagecluster-cephobjectstore": failed to create s3 user. . : signal: interrupt
------------



OSD.0 pod log
==============

Now if we check the osd pod log during the creation;

+ OSD booted at :
------------
2023-11-23T16:58:26.163129866Z debug 2023-11-23T16:58:26.156+0000 3ffb7ce8800  1 osd.0 0 start_boot
------------


+ After 30 seconds, the logs are getting spammed with latency errors
------------
2023-11-23T16:58:56.509439836Z debug 2023-11-23T16:58:56.496+0000 3ffa9dff800  0 bluestore(/var/lib/ceph/osd/ceph-0) _txc_state_proc slow aio_wait, txc = 0x2aa1b605800, latency = 30.338800430s
2023-11-23T16:58:56.510203100Z debug 2023-11-23T16:58:56.496+0000 3ffb7ce8800  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency slow operation observed for submit_transact, latency = 27.602117538s
2023-11-23T16:58:56.510203100Z debug 2023-11-23T16:58:56.496+0000 3ffb7ce8800  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency slow operation observed for throttle_transact, latency = 27.598726273s
2023-11-23T16:58:56.582973187Z debug 2023-11-23T16:58:56.576+0000 3ffa9dff800  0 bluestore(/var/lib/ceph/osd/ceph-0) _txc_state_proc slow aio_wait, txc = 0x2aa1de72900, latency = 30.410196304s
2023-11-23T16:58:56.739062901Z debug 2023-11-23T16:58:56.726+0000 3ffa9dff800  0 bluestore(/var/lib/ceph/osd/ceph-0) _txc_state_proc slow aio_wait, txc = 0x2aa1de72c00, latency = 30.563947678s
2023-11-23T16:58:56.874223835Z debug 2023-11-23T16:58:56.866+0000 3ffa9dff800  0 bluestore(/var/lib/ceph/osd/ceph-0) _txc_state_proc slow aio_wait, txc = 0x2aa1de72f00, latency = 30.697076797s
------------

+ This gets cleared after the osd reboots again:
------------
2023-11-23T17:01:37.572786130Z debug 2023-11-23T17:01:37.557+0000 3ffa1c76800  1 osd.0 43 state: booting -> active
------------

+ only to return  in couple of minutes:
------------
2023-11-23T17:14:36.016357615Z debug 2023-11-23T17:14:36.007+0000 3ffa9dff800  0 bluestore(/var/lib/ceph/osd/ceph-0) _txc_state_proc slow aio_wait, txc = 0x2aa2ef20000, latency = 14.868250847s
2023-11-23T17:14:36.017967711Z debug 2023-11-23T17:14:36.007+0000 3ffa2477800  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency slow operation observed for kv_commit, latency = 14.869242668s
2023-11-23T17:14:36.017967711Z debug 2023-11-23T17:14:36.007+0000 3ffa2477800  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency slow operation observed for kv_sync, latency = 14.869242668s
2023-11-23T17:14:36.017967711Z debug 2023-11-23T17:14:36.007+0000 3ffa2c78800  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _txc_committed_kv, latency = 14.870983124s, txc = 0x2aa2f3b6000
2023-11-23T17:14:36.018576531Z debug 2023-11-23T17:14:36.007+0000 3ffa2c78800  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _txc_committed_kv, latency = 14.870734215s, txc = 0x2aa325af200
2023-11-23T17:14:36.018598093Z debug 2023-11-23T17:14:36.007+0000 3ffa2c78800  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _txc_committed_kv, latency = 14.885713577s, txc = 0x2aa2efa6900
2023-11-23T17:14:36.018644710Z debug 2023-11-23T17:14:36.007+0000 3ffa2c78800  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _txc_committed_kv, latency = 14.548559189s, txc = 0x2aa1e32e300
2023-11-23T17:14:36.018715614Z debug 2023-11-23T17:14:36.007+0000 3ffa2c78800  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _txc_committed_kv, latency = 14.170990944s, txc = 0x2aa2eee4c00
2023-11-23T17:14:36.018715614Z debug 2023-11-23T17:14:36.007+0000 3ffa2c78800  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _txc_committed_kv, latency = 14.874240875s, txc = 0x2aa2ef20000
------------



So what I believe is happening is once the OSD creation succeeds rook-operator moves ahead to reconcile the CephObjectStore CRD. 
However, the OSDs are unresponsive almost immediately, also generating  false positive pg status 
-----
osd is down in the failure domain "c02ls001-nodes-c2lwedi02-ocp-caas-banksvcs-net", but pgs are active+clean. Requeuing in case pg status is not updated yet...
-----

The RGW pool creation succeeds but the rest of the objectstore functions are only partially completed.


I will ask the customer regarding the hardware that they are using and see if it is reproducible in a more stable environment.

Regards,
Karun

--- Additional comment from Karun Josy on 2023-12-12 19:57:07 IST ---

Hello,

Customer reports that the hardware and disk setup is exactly the same as they had before when they were using ODF 4.12 and didn't experience any latency issues.
The hardware is IBM Z series with the disks being located on IBM SAN disk arrays.

Is there any other way to troubleshoot this further?

*Note*
I just picked up another case from Nokia, which has the same symptoms. I have requested for further logs to verify it
03686923

--- Additional comment from Karun Josy on 2023-12-12 22:37:28 IST ---

Looks like this bug was already reported here : https://bugzilla.redhat.com/show_bug.cgi?id=2245004#c9

--- Additional comment from  on 2023-12-13 04:29:52 IST ---

Hello,

Can we get an update on this? Is anything needed from the support side? This BZ/Case hit Anats inbox and, the customer and account team raised escalations on it.

--- Additional comment from Santosh Pillai on 2023-12-13 09:37:18 IST ---

@karun A few questions:

- Does restarting the rook operator fixes the issue completely and cluster is usable after that? (Trying to understand if there a race condition like you mentioned above)
- In the logs that you have shared, the ceph status seems to be `Healh_ok` and no OSDs are down. Is must gather taken after the Operator was restarted and StorageCluster was successful?

--- Additional comment from Santosh Pillai on 2023-12-13 09:56:13 IST ---

(In reply to Karun Josy from comment #4)
> Hi Jiffin,
> 
> Thanks for looking into this and good catch! 
> I went through the rook-ceph-operator logs and OSD logs as pointed out by
> you and I have some observations to share:
> TLDR, the OSDs are underperforming with huge latency which I believe is
> resulting in this race condition.
> 
> 
> rook-ceph-operator logs
> =======================
> 
> + OSD creation process starts here:
> ------------
>     2023-11-23T16:58:04.929320993Z 2023-11-23 16:58:04.929300 I | op-osd:
> start running osds in namespace "openshift-storage"
>     2023-11-23T16:58:04.948214822Z 2023-11-23 16:58:04.948194 I | op-osd:
> start provisioning the OSDs on PVCs, if needed
>     2023-11-23T16:58:04.958696078Z 2023-11-23 16:58:04.958428 I | op-osd:
> creating 3 new PVCs for device set "ocs-deviceset-odf-cluster-storage-0"
>     2023-11-23T16:58:04.958696078Z 2023-11-23 16:58:04.958479 I | op-osd:
> image
> "registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:
> 3ee8169c13d824d96c0494d5e58d6376f3fa8b947d81cf3e98f722e5d33028e5" contains
> invalid character, skipping adding label
> ------------
> 
> + It gets completed
> ------------
> 2023-11-23T16:58:14.621264832Z 2023-11-23 16:58:14.621161 I | op-osd: OSD
> orchestration status for PVC ocs-deviceset-odf-cluster-storage-0-data-22xnh8
> is "completed"
> 2023-11-23T16:58:14.621264832Z 2023-11-23 16:58:14.621201 I | op-osd:
> creating OSD 1 on PVC "ocs-deviceset-odf-cluster-storage-0-data-22xnh8"
> 
> 2023-11-23T16:58:14.727423369Z 2023-11-23 16:58:14.727408 I | op-osd: OSD
> orchestration status for PVC ocs-deviceset-odf-cluster-storage-0-data-0z4czw
> is "completed"
> 2023-11-23T16:58:14.727473348Z 2023-11-23 16:58:14.727468 I | op-osd:
> creating OSD 0 on PVC "ocs-deviceset-odf-cluster-storage-0-data-0z4czw"
> 
> 2023-11-23T16:58:14.928479666Z 2023-11-23 16:58:14.928085 I | op-osd: OSD
> orchestration status for PVC ocs-deviceset-odf-cluster-storage-0-data-1s8l77
> is "completed"
> 2023-11-23T16:58:14.928479666Z 2023-11-23 16:58:14.928104 I | op-osd:
> creating OSD 2 on PVC "ocs-deviceset-odf-cluster-storage-0-data-1s8l77"
> ------------
> 
> 
> + But almost immediately or even a few microseconds before the completion we
> can see the OSDs are getting marked down
> ------------
> 2023-11-23T16:58:14.730880633Z 2023-11-23 16:58:14.729541 I |
> clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node
> drain is detected
> 2023-11-23T16:58:15.111736332Z 2023-11-23 16:58:15.111720 I |
> clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node
> drain is detected
> 2023-11-23T16:58:15.111822382Z 2023-11-23 16:58:15.111812 I |
> clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node
> drain is detected
> 2023-11-23T16:58:15.532577836Z 2023-11-23 16:58:15.532554 I |
> clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node
> drain is detected
> 2023-11-23T16:58:15.532656461Z 2023-11-23 16:58:15.532647 I |
> clusterdisruption-controller: osd "rook-ceph-osd-2" is down but no node
> drain is detected
> 2023-11-23T16:58:15.532746314Z 2023-11-23 16:58:15.532737 I |
> clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node
> drain is detected
> ------------
> 
> +  ######  ---> This message is confusing and contradicting
> ------------
> 2023-11-23T16:58:20.335901990Z 2023-11-23 16:58:20.335846 I |
> clusterdisruption-controller: osd is down in the failure domain
> "c02ls001-nodes-c2lwedi02-ocp-caas-banksvcs-net", but pgs are active+clean.
> Requeuing in case pg status is not updated yet...
> ------------

When OSDs are down, rook checks for the PGs status and prints this log. It is expected that there can be slight delay in ceph from reporting unhealthy PG status, so rook re queues the controller again to check for the pg health status again. So this message is harmless unless you keep hitting the same message again and again. 

> 
> 
> + Since the PG status are active+clean we move ahead with the RGW
> reconciliation as observed here but it fails as the OSDs are unresponsive
> ------------
> 2023-11-23T16:58:26.016042456Z 2023-11-23 16:58:26.016035 I |
> ceph-object-store-user-controller: CephObjectStore
> "ocs-storagecluster-cephobjectstore" found
> 2023-11-23T16:58:26.024111982Z 2023-11-23 16:58:26.024094 E |
> ceph-object-store-user-controller: failed to reconcile CephObjectStoreUser
> "openshift-storage/ocs-storagecluster-cephobjectstoreuser". failed to detect
> if object store "ocs-storagecluster-cephobjectstore" is initialized: no rgw
> pod found

Rgw pod is still not up yet. 


> 2023-11-23T16:58:26.028484333Z 2023-11-23 16:58:26.028466 I | ceph-spec:
> parsing mon endpoints:
> b=172.70.69.160:3300,c=172.70.33.209:3300,a=172.70.4.250:3300
> 2023-11-23T16:58:26.028543270Z 2023-11-23 16:58:26.028532 I |
> ceph-object-store-user-controller: CephObjectStore
> "ocs-storagecluster-cephobjectstore" found
> 2023-11-23T16:58:26.033825751Z 2023-11-23 16:58:26.033811 E |
> ceph-object-store-user-controller: failed to reconcile CephObjectStoreUser
> "openshift-storage/prometheus-user". failed to detect if object store
> "ocs-storagecluster-cephobjectstore" is initialized: no rgw pod found
> 
> 2023-11-23T16:59:21.090499967Z 2023-11-23 16:59:21.090419 I | exec: exec
> timeout waiting for process radosgw-admin to return. Sending interrupt
> signal to the process
> 2023-11-23T16:59:21.102295110Z 2023-11-23 16:59:21.102279 E |
> ceph-object-store-user-controller: failed to reconcile CephObjectStoreUser
> "openshift-storage/ocs-storagecluster-cephobjectstoreuser". failed to
> initialized rgw admin ops client api: failed to create or retrieve rgw admin
> ops user: failed to create object user "rgw-admin-ops-user". error code 1
> for object store "ocs-storagecluster-cephobjectstore": failed to create s3
> user. . : signal: interrupt
> ------------

This would usually happen is the OSDs are not up or pgs are not satisfied.

> 
> 
> OSD.0 pod log
> ==============
> 
> Now if we check the osd pod log during the creation;
> 
> + OSD booted at :
> ------------
> 2023-11-23T16:58:26.163129866Z debug 2023-11-23T16:58:26.156+0000
> 3ffb7ce8800  1 osd.0 0 start_boot
> ------------
> 
> 
> + After 30 seconds, the logs are getting spammed with latency errors
> ------------
> 2023-11-23T16:58:56.509439836Z debug 2023-11-23T16:58:56.496+0000
> 3ffa9dff800  0 bluestore(/var/lib/ceph/osd/ceph-0) _txc_state_proc slow
> aio_wait, txc = 0x2aa1b605800, latency = 30.338800430s
> 2023-11-23T16:58:56.510203100Z debug 2023-11-23T16:58:56.496+0000
> 3ffb7ce8800  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency slow
> operation observed for submit_transact, latency = 27.602117538s
> 2023-11-23T16:58:56.510203100Z debug 2023-11-23T16:58:56.496+0000
> 3ffb7ce8800  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency slow
> operation observed for throttle_transact, latency = 27.598726273s
> 2023-11-23T16:58:56.582973187Z debug 2023-11-23T16:58:56.576+0000
> 3ffa9dff800  0 bluestore(/var/lib/ceph/osd/ceph-0) _txc_state_proc slow
> aio_wait, txc = 0x2aa1de72900, latency = 30.410196304s
> 2023-11-23T16:58:56.739062901Z debug 2023-11-23T16:58:56.726+0000
> 3ffa9dff800  0 bluestore(/var/lib/ceph/osd/ceph-0) _txc_state_proc slow
> aio_wait, txc = 0x2aa1de72c00, latency = 30.563947678s
> 2023-11-23T16:58:56.874223835Z debug 2023-11-23T16:58:56.866+0000
> 3ffa9dff800  0 bluestore(/var/lib/ceph/osd/ceph-0) _txc_state_proc slow
> aio_wait, txc = 0x2aa1de72f00, latency = 30.697076797s
> ------------
> 
> + This gets cleared after the osd reboots again:
> ------------
> 2023-11-23T17:01:37.572786130Z debug 2023-11-23T17:01:37.557+0000
> 3ffa1c76800  1 osd.0 43 state: booting -> active
> ------------
> 
> + only to return  in couple of minutes:
> ------------
> 2023-11-23T17:14:36.016357615Z debug 2023-11-23T17:14:36.007+0000
> 3ffa9dff800  0 bluestore(/var/lib/ceph/osd/ceph-0) _txc_state_proc slow
> aio_wait, txc = 0x2aa2ef20000, latency = 14.868250847s
> 2023-11-23T17:14:36.017967711Z debug 2023-11-23T17:14:36.007+0000
> 3ffa2477800  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency slow
> operation observed for kv_commit, latency = 14.869242668s
> 2023-11-23T17:14:36.017967711Z debug 2023-11-23T17:14:36.007+0000
> 3ffa2477800  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency slow
> operation observed for kv_sync, latency = 14.869242668s
> 2023-11-23T17:14:36.017967711Z debug 2023-11-23T17:14:36.007+0000
> 3ffa2c78800  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow
> operation observed for _txc_committed_kv, latency = 14.870983124s, txc =
> 0x2aa2f3b6000
> 2023-11-23T17:14:36.018576531Z debug 2023-11-23T17:14:36.007+0000
> 3ffa2c78800  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow
> operation observed for _txc_committed_kv, latency = 14.870734215s, txc =
> 0x2aa325af200
> 2023-11-23T17:14:36.018598093Z debug 2023-11-23T17:14:36.007+0000
> 3ffa2c78800  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow
> operation observed for _txc_committed_kv, latency = 14.885713577s, txc =
> 0x2aa2efa6900
> 2023-11-23T17:14:36.018644710Z debug 2023-11-23T17:14:36.007+0000
> 3ffa2c78800  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow
> operation observed for _txc_committed_kv, latency = 14.548559189s, txc =
> 0x2aa1e32e300
> 2023-11-23T17:14:36.018715614Z debug 2023-11-23T17:14:36.007+0000
> 3ffa2c78800  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow
> operation observed for _txc_committed_kv, latency = 14.170990944s, txc =
> 0x2aa2eee4c00
> 2023-11-23T17:14:36.018715614Z debug 2023-11-23T17:14:36.007+0000
> 3ffa2c78800  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow
> operation observed for _txc_committed_kv, latency = 14.874240875s, txc =
> 0x2aa2ef20000
> ------------
> 
> 
> 
> So what I believe is happening is once the OSD creation succeeds
> rook-operator moves ahead to reconcile the CephObjectStore CRD. 
> However, the OSDs are unresponsive almost immediately, also generating 
> false positive pg status 
> -----
> osd is down in the failure domain
> "c02ls001-nodes-c2lwedi02-ocp-caas-banksvcs-net", but pgs are active+clean.
> Requeuing in case pg status is not updated yet...
> -----
> 
> The RGW pool creation succeeds but the rest of the objectstore functions are
> only partially completed.
> 
> 
> I will ask the customer regarding the hardware that they are using and see
> if it is reproducible in a more stable environment.
> 
> Regards,
> Karun

--- Additional comment from Santosh Pillai on 2023-12-13 10:01:32 IST ---

(In reply to Santosh Pillai from comment #8)
> @karun A few questions:
> 
> - Does restarting the rook operator fixes the issue completely and cluster
> is usable after that? (Trying to understand if there a race condition like
> you mentioned above)
> - In the logs that you have shared, the ceph status seems to be `Healh_ok`
> and no OSDs are down. Is must gather taken after the Operator was restarted
> and StorageCluster was successful?

- is it possible to get the output of `radosgw-admin zonegroup list` on a cluster with this error vs on a cluster that is working fine?

--- Additional comment from Jiffin on 2023-12-13 10:49:55 IST ---

If the issue is not related to OSD latency or OSD being down, then it may be related to https://bugzilla.redhat.com/show_bug.cgi?id=2245004. Basically, Rook Operator ignores the when radosgw-admin timeouts and is fixed by https://github.com/rook/rook/pull/12817. I can see the status of CephobjectStore CRD is Ready so this is a possibility

--- Additional comment from Karun Josy on 2023-12-13 14:34:59 IST ---

Hello Santosh,

Please see the answers inline:

- Does restarting the rook operator fixes the issue completely and cluster is usable after that? (Trying to understand if there a race condition like you mentioned above)

Yes, restarting the rook-ceph-operator completely fixes the issue as observed in multiple clusters.


- In the logs that you have shared, the ceph status seems to be `Healh_ok` and no OSDs are down. Is must gather taken after the Operator was restarted and StorageCluster was successful?

No, the must-gather was captured before the rook-ceph-operator was restarted.


- is it possible to get the output of `radosgw-admin zonegroup list` on a cluster with this error vs on a cluster that is working fine?

This output was captured from the cluster where the issue is observed:
---------------
sh-5.1$ radosgw-admin zonegroup list
failed to list zonegroups: (2) No such file or directory
sh-5.1$
---------------

Regards,
Karun

--- Additional comment from Santosh Pillai on 2023-12-14 09:18:43 IST ---

Thanks Karun for the details. This should be fixed by Parth's PR https://github.com/rook/rook/pull/12817 which correctly handles the timeout errors and has increased the retry count.

Comment 3 Mudit Agarwal 2024-01-29 07:03:48 UTC
*** Bug 2254475 has been marked as a duplicate of this bug. ***

Comment 15 errata-xmlrpc 2024-02-29 09:13:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.14.5 Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:1043


Note You need to log in before you can comment on or make changes to this bug.