Bug 1740463

Summary: rolling update should set "nodeep-scrub" flag instead of "norebalance"
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Rachana Patel <racpatel>
Component: Ceph-AnsibleAssignee: Guillaume Abrioux <gabrioux>
Status: CLOSED ERRATA QA Contact: Vasishta <vashastr>
Severity: high Docs Contact: Bara Ancincova <bancinco>
Priority: high    
Version: 3.3CC: aschoen, assingh, ceph-eng-bugs, dsavinea, gabrioux, gmeno, kdreyer, nthomas, tchandra, tserlin, vumrao
Target Milestone: z4   
Target Release: 3.3   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: RHEL: ceph-ansible-3.2.39-1.el7cp Ubuntu: ceph-ansible_3.2.39-2redhat1 Doc Type: Bug Fix
Doc Text:
.Upgrading OSDs is no longer unresponsive for a long period of time Previously, when using the `rolling_update.yml` playbook to upgrade an OSD, the playbook waited for the `active+clean` state. When data and `no of retry` count was large, the upgrading process became unresponsive for a long period of time because the playbook set the `noout` and `norebalance` flags instead of the `nodeep-scrub` flag. With this update, the playbook sets the correct flag, and the upgrading process is no longer unresponsive for a long period of time.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-06 08:27:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1813905    
Bug Blocks: 1726135, 1727980    

Description Rachana Patel 2019-08-13 04:44:46 UTC
Description of problem:
=======================
When we use rolling_update.yml to update/upgrade cluster it sets 2 flags "noout" and "norebalance".
IMHO, during rolling_update we should set "nodeep-scrub" flag rather than "no rebalance"
(more on flags - https://docs.ceph.com/docs/mimic/rados/operations/health-checks/#osdmap-flags)

Issue with "norebalance"
After OSD upgrade, it waits for "active+clean" state (no of retry defined by user).
when data and no of retry count is large, it will be stuck there for a longer period.
e.g.
In one of our cluster no of retry was= 10000 and upgrade was stuck for 2 days due to status of cluster.

FAILED - RETRYING: waiting for clean pgs... (93858 retries left)
FAILED - RETRYING: waiting for clean pgs... (93857 retries left)
FAILED - RETRYING: waiting for clean pgs... (93856 retries left)
and so on

pg status for 2 days

pgs: 40437/247052355 objects misplaced (0.016%)
4284 active+clean
2 active+undersized+remapped+backfilling
1 active+remapped+backfilling
1 active+remapped+backfill_wait

As "norebalance" was set backfilling was suspended.

Version-Release number of selected component (if applicable):
=============================================================
ceph-ansible-3.2.15-1.el7cp.noarch

How reproducible:
=================
always

Comment 12 Vasishta 2020-03-17 15:32:58 UTC
Working fine with ceph-ansible-3.2.40-1
Moving to VERIFIED state.

Comment 15 errata-xmlrpc 2020-04-06 08:27:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1320