Bug 1952571

Summary: [GSS][ceph-ansible][RFE] Additional pre-check for mon quorum failures while running rolling_update.yml playbook
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Geo Jose <gjose>
Component: Ceph-AnsibleAssignee: Guillaume Abrioux <gabrioux>
Status: CLOSED ERRATA QA Contact: Ameena Suhani S H <amsyedha>
Severity: medium Docs Contact: Ranjini M N <rmandyam>
Priority: medium    
Version: 4.2CC: aschoen, ceph-eng-bugs, gabrioux, gmeno, kimiasalamo9881, mmuench, nthomas, rmandyam, tserlin, vereddy, ykaul
Target Milestone: ---Keywords: FutureFeature
Target Release: 4.3   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ceph-ansible-4.0.63-1.el8cp, ceph-ansible-4.0.63-1.el7cp Doc Type: Enhancement
Doc Text:
.`ceph-ansible` checks for the Ceph Monitor quorum before starting the upgrade Previously, when the storage cluster was in a HEALTH ERR or HEALTH WARN state due to one of the Ceph monitors being down, the `rolling_upgrade.yml` playbook would run. However, the upgrade would fail and the quorum was lost resulting in I/O down or a cluster failure. With this release, an additional condition occurs where `ceph-ansible` checks the Ceph Monitor quorum before starting the upgrade.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-05 07:53:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2031070    

Description Geo Jose 2021-04-22 14:55:41 UTC
Description of problem:
While running rolling_update.yml, the playbook will fail if the cluster isn't in an acceptable state(HEALTH_ERR). The playbook will run even if in HEALTH_WARN(let's assume 1/3 mons down). But while running this playbook, if the upgrade fails for one of the mon, we loose the quorum resulting in IO down/Cluster failure. So to avoid this situation, it would be good if we can add the below conditions/anything similar conditions:
 - Add another condition to check the running mons before starting the mon upgrade.
 - if we add the above condition, we should give an option to overide the situation where the system admin is okay to proceed with upgrading 2 mons(with minimum number of quorum)

Version-Release number of selected component (if applicable):
 * RHCS 4.2


Additional info:

 o Due to the below condition, it is not checking whether all the monitors are up and running:
---
    - name: set mon_host_count
      set_fact:
        mon_host_count: "{{ groups[mon_group_name] | length }}"

    - name: fail when less than three monitors
      fail:
        msg: "Upgrade of cluster with less than three monitors is not supported."
      when: mon_host_count | int < 3
---

 o The below condition will skip since the cluster not in 'HEALTH_ERR'(1/3 mons down)
---
          - name: fail if cluster isn't in an acceptable state
            fail:
              msg: "cluster is not in an acceptable state!"
        when: (check_cluster_health.stdout | from_json).status == 'HEALTH_ERR'
    when: inventory_hostname == groups[mon_group_name] | first
---

Comment 9 errata-xmlrpc 2022-05-05 07:53:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 4.3 Security and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1716

Comment 10 Ketrina Foster 2023-02-15 09:08:40 UTC Comment hidden (spam)