Bug 1952571

Summary:	[GSS][ceph-ansible][RFE] Additional pre-check for mon quorum failures while running rolling_update.yml playbook
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Geo Jose <gjose>
Component:	Ceph-Ansible	Assignee:	Guillaume Abrioux <gabrioux>
Status:	CLOSED ERRATA	QA Contact:	Ameena Suhani S H <amsyedha>
Severity:	medium	Docs Contact:	Ranjini M N <rmandyam>
Priority:	medium
Version:	4.2	CC:	aschoen, ceph-eng-bugs, damon64eddleman, gabrioux, gmeno, kimiasalamo9881, mmuench, nthomas, rmandyam, tserlin, vereddy, ykaul
Target Milestone:	---	Keywords:	FutureFeature
Target Release:	4.3
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	ceph-ansible-4.0.63-1.el8cp, ceph-ansible-4.0.63-1.el7cp	Doc Type:	Enhancement
Doc Text:	.`ceph-ansible` checks for the Ceph Monitor quorum before starting the upgrade Previously, when the storage cluster was in a HEALTH ERR or HEALTH WARN state due to one of the Ceph monitors being down, the `rolling_upgrade.yml` playbook would run. However, the upgrade would fail and the quorum was lost resulting in I/O down or a cluster failure. With this release, an additional condition occurs where `ceph-ansible` checks the Ceph Monitor quorum before starting the upgrade.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-05-05 07:53:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2031070

Description Geo Jose 2021-04-22 14:55:41 UTC

Description of problem:
While running rolling_update.yml, the playbook will fail if the cluster isn't in an acceptable state(HEALTH_ERR). The playbook will run even if in HEALTH_WARN(let's assume 1/3 mons down). But while running this playbook, if the upgrade fails for one of the mon, we loose the quorum resulting in IO down/Cluster failure. So to avoid this situation, it would be good if we can add the below conditions/anything similar conditions:
 - Add another condition to check the running mons before starting the mon upgrade.
 - if we add the above condition, we should give an option to overide the situation where the system admin is okay to proceed with upgrading 2 mons(with minimum number of quorum)

Version-Release number of selected component (if applicable):
 * RHCS 4.2


Additional info:

 o Due to the below condition, it is not checking whether all the monitors are up and running:
---
    - name: set mon_host_count
      set_fact:
        mon_host_count: "{{ groups[mon_group_name] | length }}"

    - name: fail when less than three monitors
      fail:
        msg: "Upgrade of cluster with less than three monitors is not supported."
      when: mon_host_count | int < 3
---

 o The below condition will skip since the cluster not in 'HEALTH_ERR'(1/3 mons down)
---
          - name: fail if cluster isn't in an acceptable state
            fail:
              msg: "cluster is not in an acceptable state!"
        when: (check_cluster_health.stdout | from_json).status == 'HEALTH_ERR'
    when: inventory_hostname == groups[mon_group_name] | first
---

Comment 9 errata-xmlrpc 2022-05-05 07:53:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 4.3 Security and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1716

Comment 10 Ketrina Foster 2023-02-15 09:08:40 UTC Comment hidden (spam)

branch 3.2, rolling_upgrade.yml sets flags nout and norebalance, which is fine, however, after upgrading one osd it checks for clean pgs and fails. PGs are not going to be clean with the flags still set. Especially if any I/O occurred to the PG when the OSD was upgraded and restarted   https://www.runyourpool.net/

Comment 11 damon eddleman 2025-05-28 09:44:19 UTC

I’ll walk you through why your cat’s meowing isn’t just random noise—it’s their way of talking to you. Cats use meows as their primary mode of feline communication, shaped by specific needs or situations. Understanding these reasons strengthens your bond and helps you respond better.

https://www.whycatmeows.com