1740463 – rolling update should set "nodeep-scrub" flag instead of "norebalance"

Bug 1740463 - rolling update should set "nodeep-scrub" flag instead of "norebalance"

Summary: rolling update should set "nodeep-scrub" flag instead of "norebalance"

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Ansible
Sub Component:
Version:	3.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	z4
Target Release:	3.3
Assignee:	Guillaume Abrioux
QA Contact:	Vasishta
Docs Contact:	Bara Ancincova
URL:
Whiteboard:
Depends On:	1813905
Blocks:	1726135 1727980
TreeView+	depends on / blocked

Reported:	2019-08-13 04:44 UTC by Rachana Patel
Modified:	2023-09-07 20:26 UTC (History)
CC List:	11 users (show)
Fixed In Version:	RHEL: ceph-ansible-3.2.39-1.el7cp Ubuntu: ceph-ansible_3.2.39-2redhat1
Doc Type:	Bug Fix
Doc Text:	.Upgrading OSDs is no longer unresponsive for a long period of time Previously, when using the `rolling_update.yml` playbook to upgrade an OSD, the playbook waited for the `active+clean` state. When data and `no of retry` count was large, the upgrading process became unresponsive for a long period of time because the playbook set the `noout` and `norebalance` flags instead of the `nodeep-scrub` flag. With this update, the playbook sets the correct flag, and the upgrading process is no longer unresponsive for a long period of time.
Clone Of:
Environment:
Last Closed:	2020-04-06 08:27:04 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ceph ceph-ansible pull 4757	'None'	closed	upgrade: use flags noout and nodeep-scrub only (bp #4750)	2021-01-20 09:42:07 UTC
Red Hat Issue Tracker	RHCEPH-7355	None	None	None	2023-09-07 20:26:03 UTC
Red Hat Product Errata	RHBA-2020:1320	None	None	None	2020-04-06 08:27:45 UTC

Description Rachana Patel 2019-08-13 04:44:46 UTC

Description of problem:
=======================
When we use rolling_update.yml to update/upgrade cluster it sets 2 flags "noout" and "norebalance".
IMHO, during rolling_update we should set "nodeep-scrub" flag rather than "no rebalance"
(more on flags - https://docs.ceph.com/docs/mimic/rados/operations/health-checks/#osdmap-flags)

Issue with "norebalance"
After OSD upgrade, it waits for "active+clean" state (no of retry defined by user).
when data and no of retry count is large, it will be stuck there for a longer period.
e.g.
In one of our cluster no of retry was= 10000 and upgrade was stuck for 2 days due to status of cluster.

FAILED - RETRYING: waiting for clean pgs... (93858 retries left)
FAILED - RETRYING: waiting for clean pgs... (93857 retries left)
FAILED - RETRYING: waiting for clean pgs... (93856 retries left)
and so on

pg status for 2 days

pgs: 40437/247052355 objects misplaced (0.016%)
4284 active+clean
2 active+undersized+remapped+backfilling
1 active+remapped+backfilling
1 active+remapped+backfill_wait

As "norebalance" was set backfilling was suspended.

Version-Release number of selected component (if applicable):
=============================================================
ceph-ansible-3.2.15-1.el7cp.noarch

How reproducible:
=================
always

Comment 12 Vasishta 2020-03-17 15:32:58 UTC

Working fine with ceph-ansible-3.2.40-1
Moving to VERIFIED state.

Comment 15 errata-xmlrpc 2020-04-06 08:27:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1320

Note You need to log in before you can comment on or make changes to this bug.