Bug 1740463 - rolling update should set "nodeep-scrub" flag instead of "norebalance"
Summary: rolling update should set "nodeep-scrub" flag instead of "norebalance"
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Ansible
Version: 3.3
Hardware: x86_64
OS: Linux
high
high
Target Milestone: z4
: 3.3
Assignee: Guillaume Abrioux
QA Contact: Vasishta
Bara Ancincova
URL:
Whiteboard:
Depends On: 1813905
Blocks: 1726135 1727980
TreeView+ depends on / blocked
 
Reported: 2019-08-13 04:44 UTC by Rachana Patel
Modified: 2023-09-07 20:26 UTC (History)
11 users (show)

Fixed In Version: RHEL: ceph-ansible-3.2.39-1.el7cp Ubuntu: ceph-ansible_3.2.39-2redhat1
Doc Type: Bug Fix
Doc Text:
.Upgrading OSDs is no longer unresponsive for a long period of time Previously, when using the `rolling_update.yml` playbook to upgrade an OSD, the playbook waited for the `active+clean` state. When data and `no of retry` count was large, the upgrading process became unresponsive for a long period of time because the playbook set the `noout` and `norebalance` flags instead of the `nodeep-scrub` flag. With this update, the playbook sets the correct flag, and the upgrading process is no longer unresponsive for a long period of time.
Clone Of:
Environment:
Last Closed: 2020-04-06 08:27:04 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph-ansible pull 4757 0 'None' closed upgrade: use flags noout and nodeep-scrub only (bp #4750) 2021-01-20 09:42:07 UTC
Red Hat Issue Tracker RHCEPH-7355 0 None None None 2023-09-07 20:26:03 UTC
Red Hat Product Errata RHBA-2020:1320 0 None None None 2020-04-06 08:27:45 UTC

Description Rachana Patel 2019-08-13 04:44:46 UTC
Description of problem:
=======================
When we use rolling_update.yml to update/upgrade cluster it sets 2 flags "noout" and "norebalance".
IMHO, during rolling_update we should set "nodeep-scrub" flag rather than "no rebalance"
(more on flags - https://docs.ceph.com/docs/mimic/rados/operations/health-checks/#osdmap-flags)

Issue with "norebalance"
After OSD upgrade, it waits for "active+clean" state (no of retry defined by user).
when data and no of retry count is large, it will be stuck there for a longer period.
e.g.
In one of our cluster no of retry was= 10000 and upgrade was stuck for 2 days due to status of cluster.

FAILED - RETRYING: waiting for clean pgs... (93858 retries left)
FAILED - RETRYING: waiting for clean pgs... (93857 retries left)
FAILED - RETRYING: waiting for clean pgs... (93856 retries left)
and so on

pg status for 2 days

pgs: 40437/247052355 objects misplaced (0.016%)
4284 active+clean
2 active+undersized+remapped+backfilling
1 active+remapped+backfilling
1 active+remapped+backfill_wait

As "norebalance" was set backfilling was suspended.

Version-Release number of selected component (if applicable):
=============================================================
ceph-ansible-3.2.15-1.el7cp.noarch

How reproducible:
=================
always

Comment 12 Vasishta 2020-03-17 15:32:58 UTC
Working fine with ceph-ansible-3.2.40-1
Moving to VERIFIED state.

Comment 15 errata-xmlrpc 2020-04-06 08:27:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1320


Note You need to log in before you can comment on or make changes to this bug.