.`ceph-ansible` checks for the Ceph Monitor quorum before starting the upgrade
Previously, when the storage cluster was in a HEALTH ERR or HEALTH WARN state due to one of the Ceph monitors being down, the `rolling_upgrade.yml` playbook would run. However, the upgrade would fail and the quorum was lost resulting in I/O down or a cluster failure.
With this release, an additional condition occurs where `ceph-ansible` checks the Ceph Monitor quorum before starting the upgrade.
Description of problem:
While running rolling_update.yml, the playbook will fail if the cluster isn't in an acceptable state(HEALTH_ERR). The playbook will run even if in HEALTH_WARN(let's assume 1/3 mons down). But while running this playbook, if the upgrade fails for one of the mon, we loose the quorum resulting in IO down/Cluster failure. So to avoid this situation, it would be good if we can add the below conditions/anything similar conditions:
- Add another condition to check the running mons before starting the mon upgrade.
- if we add the above condition, we should give an option to overide the situation where the system admin is okay to proceed with upgrading 2 mons(with minimum number of quorum)
Version-Release number of selected component (if applicable):
* RHCS 4.2
Additional info:
o Due to the below condition, it is not checking whether all the monitors are up and running:
---
- name: set mon_host_count
set_fact:
mon_host_count: "{{ groups[mon_group_name] | length }}"
- name: fail when less than three monitors
fail:
msg: "Upgrade of cluster with less than three monitors is not supported."
when: mon_host_count | int < 3
---
o The below condition will skip since the cluster not in 'HEALTH_ERR'(1/3 mons down)
---
- name: fail if cluster isn't in an acceptable state
fail:
msg: "cluster is not in an acceptable state!"
when: (check_cluster_health.stdout | from_json).status == 'HEALTH_ERR'
when: inventory_hostname == groups[mon_group_name] | first
---
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: Red Hat Ceph Storage 4.3 Security and Bug Fix update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2022:1716
Comment 10Ketrina Foster
2023-02-15 09:08:40 UTC
Comment hidden (spam)
branch 3.2, rolling_upgrade.yml sets flags nout and norebalance, which is fine, however, after upgrading one osd it checks for clean pgs and fails. PGs are not going to be clean with the flags still set. Especially if any I/O occurred to the PG when the OSD was upgraded and restarted https://www.runyourpool.net/
Comment 11damon eddleman
2025-05-28 09:44:19 UTC
Comment hidden (spam)
I’ll walk you through why your cat’s meowing isn’t just random noise—it’s their way of talking to you. Cats use meows as their primary mode of feline communication, shaped by specific needs or situations. Understanding these reasons strengthens your bond and helps you respond better.
https://www.whycatmeows.com
Description of problem: While running rolling_update.yml, the playbook will fail if the cluster isn't in an acceptable state(HEALTH_ERR). The playbook will run even if in HEALTH_WARN(let's assume 1/3 mons down). But while running this playbook, if the upgrade fails for one of the mon, we loose the quorum resulting in IO down/Cluster failure. So to avoid this situation, it would be good if we can add the below conditions/anything similar conditions: - Add another condition to check the running mons before starting the mon upgrade. - if we add the above condition, we should give an option to overide the situation where the system admin is okay to proceed with upgrading 2 mons(with minimum number of quorum) Version-Release number of selected component (if applicable): * RHCS 4.2 Additional info: o Due to the below condition, it is not checking whether all the monitors are up and running: --- - name: set mon_host_count set_fact: mon_host_count: "{{ groups[mon_group_name] | length }}" - name: fail when less than three monitors fail: msg: "Upgrade of cluster with less than three monitors is not supported." when: mon_host_count | int < 3 --- o The below condition will skip since the cluster not in 'HEALTH_ERR'(1/3 mons down) --- - name: fail if cluster isn't in an acceptable state fail: msg: "cluster is not in an acceptable state!" when: (check_cluster_health.stdout | from_json).status == 'HEALTH_ERR' when: inventory_hostname == groups[mon_group_name] | first ---