Bug 1952571 - [GSS][ceph-ansible][RFE] Additional pre-check for mon quorum failures while running rolling_update.yml playbook
Summary: [GSS][ceph-ansible][RFE] Additional pre-check for mon quorum failures while r...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Ansible
Version: 4.2
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 4.3
Assignee: Guillaume Abrioux
QA Contact: Ameena Suhani S H
Ranjini M N
URL:
Whiteboard:
Depends On:
Blocks: 2031070
TreeView+ depends on / blocked
 
Reported: 2021-04-22 14:55 UTC by Geo Jose
Modified: 2025-09-18 13:23 UTC (History)
10 users (show)

Fixed In Version: ceph-ansible-4.0.63-1.el8cp, ceph-ansible-4.0.63-1.el7cp
Doc Type: Enhancement
Doc Text:
.`ceph-ansible` checks for the Ceph Monitor quorum before starting the upgrade Previously, when the storage cluster was in a HEALTH ERR or HEALTH WARN state due to one of the Ceph monitors being down, the `rolling_upgrade.yml` playbook would run. However, the upgrade would fail and the quorum was lost resulting in I/O down or a cluster failure. With this release, an additional condition occurs where `ceph-ansible` checks the Ceph Monitor quorum before starting the upgrade.
Clone Of:
Environment:
Last Closed: 2022-05-05 07:53:20 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph-ansible pull 6704 0 None Merged [skip ci] rolling_update: check quorum state before upgrade 2021-11-16 09:27:37 UTC
Red Hat Issue Tracker RHCEPH-1335 0 None None None 2021-08-30 06:01:21 UTC
Red Hat Product Errata RHSA-2022:1716 0 None None None 2022-05-05 07:53:39 UTC

Description Geo Jose 2021-04-22 14:55:41 UTC
Description of problem:
While running rolling_update.yml, the playbook will fail if the cluster isn't in an acceptable state(HEALTH_ERR). The playbook will run even if in HEALTH_WARN(let's assume 1/3 mons down). But while running this playbook, if the upgrade fails for one of the mon, we loose the quorum resulting in IO down/Cluster failure. So to avoid this situation, it would be good if we can add the below conditions/anything similar conditions:
 - Add another condition to check the running mons before starting the mon upgrade.
 - if we add the above condition, we should give an option to overide the situation where the system admin is okay to proceed with upgrading 2 mons(with minimum number of quorum)

Version-Release number of selected component (if applicable):
 * RHCS 4.2


Additional info:

 o Due to the below condition, it is not checking whether all the monitors are up and running:
---
    - name: set mon_host_count
      set_fact:
        mon_host_count: "{{ groups[mon_group_name] | length }}"

    - name: fail when less than three monitors
      fail:
        msg: "Upgrade of cluster with less than three monitors is not supported."
      when: mon_host_count | int < 3
---

 o The below condition will skip since the cluster not in 'HEALTH_ERR'(1/3 mons down)
---
          - name: fail if cluster isn't in an acceptable state
            fail:
              msg: "cluster is not in an acceptable state!"
        when: (check_cluster_health.stdout | from_json).status == 'HEALTH_ERR'
    when: inventory_hostname == groups[mon_group_name] | first
---

Comment 9 errata-xmlrpc 2022-05-05 07:53:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 4.3 Security and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1716

Comment 10 Ketrina Foster 2023-02-15 09:08:40 UTC Comment hidden (spam)
Comment 11 damon eddleman 2025-05-28 09:44:19 UTC Comment hidden (spam)

Note You need to log in before you can comment on or make changes to this bug.