1919471 – [cee/sd][ceph-mgr][balancer] After upgrade from 3.3 to 4.2 balancer fails with unhandled exceptions

Bug 1919471 - [cee/sd][ceph-mgr][balancer] After upgrade from 3.3 to 4.2 balancer fails with unhandled exceptions

Summary: [cee/sd][ceph-mgr][balancer] After upgrade from 3.3 to 4.2 balancer fails wit...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Mgr Plugins
Sub Component:
Version:	4.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.2z2
Assignee:	Neha Ojha
QA Contact:	Pawan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-22 23:47 UTC by Steve Baldwin
Modified:	2021-06-15 17:14 UTC (History)
CC List:	9 users (show)
Fixed In Version:	ceph-14.2.11-157.el8cp, ceph-14.2.11-157.el7cp
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-15 17:13:37 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	49576	None	None	None	2021-03-02 22:34:51 UTC
Github	ceph ceph pull 40128	None	closed	nautilus: pybind/mgr/balancer/module.py: assign weight-sets to all buckets before balancing	2021-05-06 23:02:57 UTC
Red Hat Product Errata	RHSA-2021:2445	None	None	None	2021-06-15 17:14:01 UTC

Description Steve Baldwin 2021-01-22 23:47:33 UTC

Description of problem:
After upgrading from 3.3z5 to 4.2 the balancer is failing with the following errors
2021-01-22 00:11:23.362 7f89aad3f700  0 mgr[py] `config set mgrmgr/rbd_support/cephmon1/mirror_snapshot_schedule --` failed: (22) Invalid argument <-- config set - missing space (mgrmgr)
2021-01-22 00:11:23.362 7f89aad3f700  0 mgr[py] mon returned -22: unrecognized config target ''
2021-01-22 00:11:23.412 7f89b6655700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'balancer' while running on mgr.cephmon1: (44,)
2021-01-22 00:11:23.412 7f89b6655700 -1 balancer.serve:
2021-01-22 00:11:23.412 7f89b6655700 -1 Traceback (most recent call last):
  File "/usr/share/ceph/mgr/balancer/module.py", line 657, in serve
    r, detail = self.optimize(plan)
  File "/usr/share/ceph/mgr/balancer/module.py", line 928, in optimize
    return self.do_crush_compat(plan)
  File "/usr/share/ceph/mgr/balancer/module.py", line 1089, in do_crush_compat
    weight = best_ws[osd]
KeyError: 44


This results in ceph status reporting cluster health error
      health: HEALTH_ERR
            Module 'balancer' has failed: (44,)


Version-Release number of selected component (if applicable):
14.2.11-95 (4.2)

How reproducible:
Unknown encountered once after the upgrade. 

Steps to Reproduce:
1. Upgrade from 3.3z5 to 4.2


Actual results:
Balancer fails post-upgrade with errors in Problem description.

Expected results:
Balancer should not fail after upgrade to 4.2 

Additional info:
This was a manual upgrade from 3.3z5 -> 4.2

Comment 2 Josh Durgin 2021-01-23 00:12:12 UTC

This is fixed in 4.2: https://tracker.ceph.com/issues/42721

Going forward the upmap mode of the balancer is the preferred method. Switching to it requires removing the weightsets, and will cause some data movement.

Comment 3 Josh Durgin 2021-01-23 00:13:36 UTC

Nevermind, I see they're running 4.2 so it may be the same symptom with a different root cause. Could we get their osdmap?

Comment 4 Steve Baldwin 2021-01-23 01:56:13 UTC

Thanks Josh, I have a copy of the osdmap now and will upload the file to the BZ shortly.

Comment 6 Yaniv Kaul 2021-01-24 09:54:26 UTC

Isn't there a missing space in 'config set mgrmgr/rbd_support' ? Shouldn't it be 'config set mgr mgr/rbd_support' ?

Comment 8 Josh Durgin 2021-01-28 01:08:07 UTC

David, can you take a look? I don't see what's so different about luminous vs nautilus crush-compat balancers here.

Comment 38 errata-xmlrpc 2021-06-15 17:13:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 4.2 Security and Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2445

Note You need to log in before you can comment on or make changes to this bug.