Bug 1919471

Summary:	[cee/sd][ceph-mgr][balancer] After upgrade from 3.3 to 4.2 balancer fails with unhandled exceptions
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Steve Baldwin <sbaldwin>
Component:	Ceph-Mgr Plugins	Assignee:	Neha Ojha <nojha>
Status:	CLOSED ERRATA	QA Contact:	Pawan <pdhiran>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.2	CC:	ceph-eng-bugs, ceph-qe-bugs, jdurgin, lithomas, mmuench, nojha, pdhiran, tpetr, tserlin
Target Milestone:	---
Target Release:	4.2z2
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	ceph-14.2.11-157.el8cp, ceph-14.2.11-157.el7cp	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-06-15 17:13:37 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Steve Baldwin 2021-01-22 23:47:33 UTC

Description of problem:
After upgrading from 3.3z5 to 4.2 the balancer is failing with the following errors
2021-01-22 00:11:23.362 7f89aad3f700  0 mgr[py] `config set mgrmgr/rbd_support/cephmon1/mirror_snapshot_schedule --` failed: (22) Invalid argument <-- config set - missing space (mgrmgr)
2021-01-22 00:11:23.362 7f89aad3f700  0 mgr[py] mon returned -22: unrecognized config target ''
2021-01-22 00:11:23.412 7f89b6655700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'balancer' while running on mgr.cephmon1: (44,)
2021-01-22 00:11:23.412 7f89b6655700 -1 balancer.serve:
2021-01-22 00:11:23.412 7f89b6655700 -1 Traceback (most recent call last):
  File "/usr/share/ceph/mgr/balancer/module.py", line 657, in serve
    r, detail = self.optimize(plan)
  File "/usr/share/ceph/mgr/balancer/module.py", line 928, in optimize
    return self.do_crush_compat(plan)
  File "/usr/share/ceph/mgr/balancer/module.py", line 1089, in do_crush_compat
    weight = best_ws[osd]
KeyError: 44


This results in ceph status reporting cluster health error
      health: HEALTH_ERR
            Module 'balancer' has failed: (44,)


Version-Release number of selected component (if applicable):
14.2.11-95 (4.2)

How reproducible:
Unknown encountered once after the upgrade. 

Steps to Reproduce:
1. Upgrade from 3.3z5 to 4.2


Actual results:
Balancer fails post-upgrade with errors in Problem description.

Expected results:
Balancer should not fail after upgrade to 4.2 

Additional info:
This was a manual upgrade from 3.3z5 -> 4.2

Comment 2 Josh Durgin 2021-01-23 00:12:12 UTC

This is fixed in 4.2: https://tracker.ceph.com/issues/42721

Going forward the upmap mode of the balancer is the preferred method. Switching to it requires removing the weightsets, and will cause some data movement.

Comment 3 Josh Durgin 2021-01-23 00:13:36 UTC

Nevermind, I see they're running 4.2 so it may be the same symptom with a different root cause. Could we get their osdmap?

Comment 4 Steve Baldwin 2021-01-23 01:56:13 UTC

Thanks Josh, I have a copy of the osdmap now and will upload the file to the BZ shortly.

Comment 6 Yaniv Kaul 2021-01-24 09:54:26 UTC

Isn't there a missing space in 'config set mgrmgr/rbd_support' ? Shouldn't it be 'config set mgr mgr/rbd_support' ?

Comment 8 Josh Durgin 2021-01-28 01:08:07 UTC

David, can you take a look? I don't see what's so different about luminous vs nautilus crush-compat balancers here.

Comment 38 errata-xmlrpc 2021-06-15 17:13:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 4.2 Security and Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2445