Description of problem: ======================= When we use rolling_update.yml to update/upgrade cluster it sets 2 flags "noout" and "norebalance". IMHO, during rolling_update we should set "nodeep-scrub" flag rather than "no rebalance" (more on flags - https://docs.ceph.com/docs/mimic/rados/operations/health-checks/#osdmap-flags) Issue with "norebalance" After OSD upgrade, it waits for "active+clean" state (no of retry defined by user). when data and no of retry count is large, it will be stuck there for a longer period. e.g. In one of our cluster no of retry was= 10000 and upgrade was stuck for 2 days due to status of cluster. FAILED - RETRYING: waiting for clean pgs... (93858 retries left) FAILED - RETRYING: waiting for clean pgs... (93857 retries left) FAILED - RETRYING: waiting for clean pgs... (93856 retries left) and so on pg status for 2 days pgs: 40437/247052355 objects misplaced (0.016%) 4284 active+clean 2 active+undersized+remapped+backfilling 1 active+remapped+backfilling 1 active+remapped+backfill_wait As "norebalance" was set backfilling was suspended. Version-Release number of selected component (if applicable): ============================================================= ceph-ansible-3.2.15-1.el7cp.noarch How reproducible: ================= always
Working fine with ceph-ansible-3.2.40-1 Moving to VERIFIED state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1320