Bug 2292435

Summary: ODF ceph quorum lost during removal of extra mon
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Ales Nosek <anosek>
Component: rookAssignee: Travis Nielsen <tnielsen>
Status: CLOSED ERRATA QA Contact: Nagendra Reddy <nagreddy>
Severity: unspecified Docs Contact:
Priority: low    
Version: 4.14CC: brgardne, ebenahar, edonnell, hnallurv, odf-bz-bot, tnielsen
Target Milestone: ---   
Target Release: ODF 4.17.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.17.0-98 Doc Type: Bug Fix
Doc Text:
.Rook.io Operator no longer gets stuck when removing a mon from quorum Previously, mon quorum could be lost when removing a mon from quorum due to a race condition. This was because there may not have been enough quorum to complete the removal of the mon from quorum. This issue has been fixed, and the Rook.io Operator operator no longer gets stuck when removing a mon from quorum.
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-10-30 14:28:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2281703    
Attachments:
Description Flags
Rook operator pod logs none

Description Ales Nosek 2024-06-14 19:22:47 UTC
Created attachment 2037356 [details]
Rook operator pod logs

We lost Ceph mon quorum while upgrading an OCP cluster from version 4.14.11 to 4.14.22. On this cluster, the persistent storage is provided by ODF version 4.14.5. The OCP cluster consists of 18 nodes: 3 master nodes and 15 worker nodes. Worker nodes serve as ODF storage nodes at the same time. There are 3 Ceph monitors deployed across 4 Ceph failure domains on the cluster. The cluster is deployed on bare metal. Its main purpose is to run virtual machines on top of OpenShift Virtualization.

No ODF upgrade was performed, we were upgrading the OCP cluster only. During this upgrade, we lost the Ceph mon quorum in the following way:

During the OpenShift upgrade, the cluster nodes are drained one at a time (maxUnavailable = 1). As the virtual machines need to be live-migrated away from the node, draining takes a long time. The drain and reboot of a node takes more than 2x 10 minutes (the mon failover timeout period).

The attached rook-ceph-operator-d5f8bccb8-tc84k.redacted.txt file includes the full rook-ceph-operator logs. In the logs, I replaced the customer-specific domain names with example.com. In the following, I use the log snippets from this file.

During the OpenShift upgrade, while a node hosting one of the Ceph mons was rebooting, the mon failed to another node. Shortly after that, the node came back up and so we ended up with 4 mons in the cluster:

...
2024-05-15 09:22:48.961421 I | op-mon: Monitors in quorum: [a e g h]
... 

The rook-ceph-operator noticed the 4 monitors, picked one monitor for removal, and deleted its deployment object from the cluster:

...
2024-05-15 09:22:50.302886 I | op-mon: removing an extra mon. currently 4 are in quorum and only 3 are desired
2024-05-15 09:22:50.302942 I | op-mon: removing arbitrary extra mon "a"
2024-05-15 09:22:50.302949 I | op-mon: ensuring removal of unhealthy monitor a
2024-05-15 09:22:55.014762 I | ceph-nodedaemon-controller: ceph-exporter labels not specified
2024-05-15 09:22:55.048975 I | ceph-spec: object "rook-ceph-mon-a" matched on delete, reconciling
... 

At the same time, another cluster node hosting a Ceph mon was draining. A second mon was deleted during this draining process moments after the rook-ceph-operator initiated the deletion of the first mon. As only 2 mons out of 4 remained in the cluster, we lost the Ceph quorum. Since this moment the rook-ceph-operator could no longer check on the Ceph cluster health:

...
2024-05-15 09:24:04.343867 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . timed out: exit status 1
... 

Unfortunately, the last cluster node draining had a networking issue triggered after the reboot. Due to this issue, the node remained disconnected from the cluster. As the node didn't become healthy, the OpenShift upgrade got stuck along with the Ceph quorum lost.

Sometime after losing the Ceph quorum, several virtual machines were restarted. These machines could not come back up as their persistent volumes could not be attached due to the loss of Ceph quorum. This resulted in an outage of about 15 virtual machines.

Eventually, we corrected the network configuration and rebooted the cluster node. After the reboot, the mon on the node joined the Ceph cluster which restored the quorum. From here on, everything started working perfectly again. The ceph-rook-operator was able to complete the adjustment of mon nodes to three nodes as well:

 ...
2024-05-15 17:21:29.766080 I | cephclient: getting or creating ceph auth key "client.csi-rbd-node"
2024-05-15 17:21:29.796125 I | op-mon: removed monitor a
2024-05-15 17:21:29.816347 I | op-mon: mon pvc did not exist "rook-ceph-mon-a"
2024-05-15 17:21:29.835554 I | op-mon: monitor endpoints changed, updating the bootstrap peer token
2024-05-15 17:21:29.835598 I | op-mon: monitor endpoints changed, updating the bootstrap peer token
2024-05-15 17:21:29.835700 I | op-mon: saved mon endpoints to config map map[csi-cluster-config-json:[{"clusterID":"openshift-storage","monitors":["10.88.216.21:3300","10.88.226.129:3300","10.88.58.241:3300"],"namespace":""}] data:g=10.88.216.21:3300,e=10.88.226.129:3300,h=10.88.58.241:3300 mapping:{"node":{"e":{"Name":"sat1cdpcn701.example.com","Hostname":"sat1cdpcn701.example.com","Address":"10.47.70.252"},"g":{"Name":"sat1cdpcn005.example.com","Hostname":"sat1cdpcn005.example.com","Address":"10.47.70.30"},"h":{"Name":"sat1cdpcn002.example.com","Hostname":"sat1cdpcn002.example.com","Address":"10.47.70.19"}}} maxMonId:7 outOfQuorum:]
...

Based on what I could observe, the issue is caused by a race condition created by the ceph-rook-operator. The operator initiates a deletion of a mon to reduce the number of mons down to 3. It takes some time to delete the mon. Only after a mon is deleted, the operator decreases the expected number of mons down to 3. If another mon becomes unavailable before the expected number of mons is decreased, the Ceph quorum is lost.

This bug was originally filed in Red Hat's Jira:
https://issues.redhat.com/browse/RHSTOR-5928

Comment 3 Santosh Pillai 2024-06-17 06:21:34 UTC
Issue is most likely a race condition caused due to a corner case scenario. So moving this out of 4.16 as its not a blocker for 4.16.

Comment 4 Travis Nielsen 2024-08-30 18:22:38 UTC
Ales, thanks for the great writeup of how the loss of mon quorum occurred, related to the mon failover event. It greatly helped track down this race condition!

Comment 16 errata-xmlrpc 2024-10-30 14:28:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.17.0 Security, Enhancement, & Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:8676