Bug 1740463

Summary:	rolling update should set "nodeep-scrub" flag instead of "norebalance"
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Rachana Patel <racpatel>
Component:	Ceph-Ansible	Assignee:	Guillaume Abrioux <gabrioux>
Status:	CLOSED ERRATA	QA Contact:	Vasishta <vashastr>
Severity:	high	Docs Contact:	Bara Ancincova <bancinco>
Priority:	high
Version:	3.3	CC:	aschoen, assingh, ceph-eng-bugs, dsavinea, gabrioux, gmeno, kdreyer, nthomas, tchandra, tserlin, vumrao
Target Milestone:	z4
Target Release:	3.3
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	RHEL: ceph-ansible-3.2.39-1.el7cp Ubuntu: ceph-ansible_3.2.39-2redhat1	Doc Type:	Bug Fix
Doc Text:	.Upgrading OSDs is no longer unresponsive for a long period of time Previously, when using the `rolling_update.yml` playbook to upgrade an OSD, the playbook waited for the `active+clean` state. When data and `no of retry` count was large, the upgrading process became unresponsive for a long period of time because the playbook set the `noout` and `norebalance` flags instead of the `nodeep-scrub` flag. With this update, the playbook sets the correct flag, and the upgrading process is no longer unresponsive for a long period of time.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-04-06 08:27:04 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1813905
Bug Blocks:	1726135, 1727980

Description Rachana Patel 2019-08-13 04:44:46 UTC

Description of problem:
=======================
When we use rolling_update.yml to update/upgrade cluster it sets 2 flags "noout" and "norebalance".
IMHO, during rolling_update we should set "nodeep-scrub" flag rather than "no rebalance"
(more on flags - https://docs.ceph.com/docs/mimic/rados/operations/health-checks/#osdmap-flags)

Issue with "norebalance"
After OSD upgrade, it waits for "active+clean" state (no of retry defined by user).
when data and no of retry count is large, it will be stuck there for a longer period.
e.g.
In one of our cluster no of retry was= 10000 and upgrade was stuck for 2 days due to status of cluster.

FAILED - RETRYING: waiting for clean pgs... (93858 retries left)
FAILED - RETRYING: waiting for clean pgs... (93857 retries left)
FAILED - RETRYING: waiting for clean pgs... (93856 retries left)
and so on

pg status for 2 days

pgs: 40437/247052355 objects misplaced (0.016%)
4284 active+clean
2 active+undersized+remapped+backfilling
1 active+remapped+backfilling
1 active+remapped+backfill_wait

As "norebalance" was set backfilling was suspended.

Version-Release number of selected component (if applicable):
=============================================================
ceph-ansible-3.2.15-1.el7cp.noarch

How reproducible:
=================
always

Comment 12 Vasishta 2020-03-17 15:32:58 UTC

Working fine with ceph-ansible-3.2.40-1
Moving to VERIFIED state.

Comment 15 errata-xmlrpc 2020-04-06 08:27:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1320