Description of problem: After upgrading from 3.3z5 to 4.2 the balancer is failing with the following errors 2021-01-22 00:11:23.362 7f89aad3f700 0 mgr[py] `config set mgrmgr/rbd_support/cephmon1/mirror_snapshot_schedule --` failed: (22) Invalid argument <-- config set - missing space (mgrmgr) 2021-01-22 00:11:23.362 7f89aad3f700 0 mgr[py] mon returned -22: unrecognized config target '' 2021-01-22 00:11:23.412 7f89b6655700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'balancer' while running on mgr.cephmon1: (44,) 2021-01-22 00:11:23.412 7f89b6655700 -1 balancer.serve: 2021-01-22 00:11:23.412 7f89b6655700 -1 Traceback (most recent call last): File "/usr/share/ceph/mgr/balancer/module.py", line 657, in serve r, detail = self.optimize(plan) File "/usr/share/ceph/mgr/balancer/module.py", line 928, in optimize return self.do_crush_compat(plan) File "/usr/share/ceph/mgr/balancer/module.py", line 1089, in do_crush_compat weight = best_ws[osd] KeyError: 44 This results in ceph status reporting cluster health error health: HEALTH_ERR Module 'balancer' has failed: (44,) Version-Release number of selected component (if applicable): 14.2.11-95 (4.2) How reproducible: Unknown encountered once after the upgrade. Steps to Reproduce: 1. Upgrade from 3.3z5 to 4.2 Actual results: Balancer fails post-upgrade with errors in Problem description. Expected results: Balancer should not fail after upgrade to 4.2 Additional info: This was a manual upgrade from 3.3z5 -> 4.2
This is fixed in 4.2: https://tracker.ceph.com/issues/42721 Going forward the upmap mode of the balancer is the preferred method. Switching to it requires removing the weightsets, and will cause some data movement.
Nevermind, I see they're running 4.2 so it may be the same symptom with a different root cause. Could we get their osdmap?
Thanks Josh, I have a copy of the osdmap now and will upload the file to the BZ shortly.
Isn't there a missing space in 'config set mgrmgr/rbd_support' ? Shouldn't it be 'config set mgr mgr/rbd_support' ?
David, can you take a look? I don't see what's so different about luminous vs nautilus crush-compat balancers here.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat Ceph Storage 4.2 Security and Bug Fix Update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2445