1952571 – [GSS][ceph-ansible][RFE] Additional pre-check for mon quorum failures while running rolling_update.yml playbook

Bug 1952571 - [GSS][ceph-ansible][RFE] Additional pre-check for mon quorum failures while running rolling_update.yml playbook

Summary: [GSS][ceph-ansible][RFE] Additional pre-check for mon quorum failures while r...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Ansible
Sub Component:
Version:	4.2
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.3
Assignee:	Guillaume Abrioux
QA Contact:	Ameena Suhani S H
Docs Contact:	Ranjini M N
URL:
Whiteboard:
Depends On:
Blocks:	2031070
TreeView+	depends on / blocked

Reported:	2021-04-22 14:55 UTC by Geo Jose
Modified:	2025-05-28 09:44 UTC (History)
CC List:	12 users (show)
Fixed In Version:	ceph-ansible-4.0.63-1.el8cp, ceph-ansible-4.0.63-1.el7cp
Doc Type:	Enhancement
Doc Text:	.`ceph-ansible` checks for the Ceph Monitor quorum before starting the upgrade Previously, when the storage cluster was in a HEALTH ERR or HEALTH WARN state due to one of the Ceph monitors being down, the `rolling_upgrade.yml` playbook would run. However, the upgrade would fail and the quorum was lost resulting in I/O down or a cluster failure. With this release, an additional condition occurs where `ceph-ansible` checks the Ceph Monitor quorum before starting the upgrade.
Clone Of:
Environment:
Last Closed:	2022-05-05 07:53:20 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ceph ceph-ansible pull 6704	None	Merged	[skip ci] rolling_update: check quorum state before upgrade	2021-11-16 09:27:37 UTC
Red Hat Issue Tracker	RHCEPH-1335	None	None	None	2021-08-30 06:01:21 UTC
Red Hat Product Errata	RHSA-2022:1716	None	None	None	2022-05-05 07:53:39 UTC

Description Geo Jose 2021-04-22 14:55:41 UTC

Description of problem:
While running rolling_update.yml, the playbook will fail if the cluster isn't in an acceptable state(HEALTH_ERR). The playbook will run even if in HEALTH_WARN(let's assume 1/3 mons down). But while running this playbook, if the upgrade fails for one of the mon, we loose the quorum resulting in IO down/Cluster failure. So to avoid this situation, it would be good if we can add the below conditions/anything similar conditions:
 - Add another condition to check the running mons before starting the mon upgrade.
 - if we add the above condition, we should give an option to overide the situation where the system admin is okay to proceed with upgrading 2 mons(with minimum number of quorum)

Version-Release number of selected component (if applicable):
 * RHCS 4.2


Additional info:

 o Due to the below condition, it is not checking whether all the monitors are up and running:
---
    - name: set mon_host_count
      set_fact:
        mon_host_count: "{{ groups[mon_group_name] | length }}"

    - name: fail when less than three monitors
      fail:
        msg: "Upgrade of cluster with less than three monitors is not supported."
      when: mon_host_count | int < 3
---

 o The below condition will skip since the cluster not in 'HEALTH_ERR'(1/3 mons down)
---
          - name: fail if cluster isn't in an acceptable state
            fail:
              msg: "cluster is not in an acceptable state!"
        when: (check_cluster_health.stdout | from_json).status == 'HEALTH_ERR'
    when: inventory_hostname == groups[mon_group_name] | first
---

Comment 9 errata-xmlrpc 2022-05-05 07:53:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 4.3 Security and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1716

Comment 10 Ketrina Foster 2023-02-15 09:08:40 UTC Comment hidden (spam)

branch 3.2, rolling_upgrade.yml sets flags nout and norebalance, which is fine, however, after upgrading one osd it checks for clean pgs and fails. PGs are not going to be clean with the flags still set. Especially if any I/O occurred to the PG when the OSD was upgraded and restarted   https://www.runyourpool.net/

Comment 11 damon eddleman 2025-05-28 09:44:19 UTC

I’ll walk you through why your cat’s meowing isn’t just random noise—it’s their way of talking to you. Cats use meows as their primary mode of feline communication, shaped by specific needs or situations. Understanding these reasons strengthens your bond and helps you respond better.

https://www.whycatmeows.com

Note You need to log in before you can comment on or make changes to this bug.