Bug 1919471

Summary: [cee/sd][ceph-mgr][balancer] After upgrade from 3.3 to 4.2 balancer fails with unhandled exceptions
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Steve Baldwin <sbaldwin>
Component: Ceph-Mgr PluginsAssignee: Neha Ojha <nojha>
Status: CLOSED ERRATA QA Contact: Pawan <pdhiran>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.2CC: ceph-eng-bugs, ceph-qe-bugs, jdurgin, lithomas, mmuench, nojha, pdhiran, tpetr, tserlin
Target Milestone: ---   
Target Release: 4.2z2   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ceph-14.2.11-157.el8cp, ceph-14.2.11-157.el7cp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-15 17:13:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Steve Baldwin 2021-01-22 23:47:33 UTC
Description of problem:
After upgrading from 3.3z5 to 4.2 the balancer is failing with the following errors
2021-01-22 00:11:23.362 7f89aad3f700  0 mgr[py] `config set mgrmgr/rbd_support/cephmon1/mirror_snapshot_schedule --` failed: (22) Invalid argument <-- config set - missing space (mgrmgr)
2021-01-22 00:11:23.362 7f89aad3f700  0 mgr[py] mon returned -22: unrecognized config target ''
2021-01-22 00:11:23.412 7f89b6655700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'balancer' while running on mgr.cephmon1: (44,)
2021-01-22 00:11:23.412 7f89b6655700 -1 balancer.serve:
2021-01-22 00:11:23.412 7f89b6655700 -1 Traceback (most recent call last):
  File "/usr/share/ceph/mgr/balancer/module.py", line 657, in serve
    r, detail = self.optimize(plan)
  File "/usr/share/ceph/mgr/balancer/module.py", line 928, in optimize
    return self.do_crush_compat(plan)
  File "/usr/share/ceph/mgr/balancer/module.py", line 1089, in do_crush_compat
    weight = best_ws[osd]
KeyError: 44


This results in ceph status reporting cluster health error
      health: HEALTH_ERR
            Module 'balancer' has failed: (44,)


Version-Release number of selected component (if applicable):
14.2.11-95 (4.2)

How reproducible:
Unknown encountered once after the upgrade. 

Steps to Reproduce:
1. Upgrade from 3.3z5 to 4.2


Actual results:
Balancer fails post-upgrade with errors in Problem description.

Expected results:
Balancer should not fail after upgrade to 4.2 

Additional info:
This was a manual upgrade from 3.3z5 -> 4.2

Comment 2 Josh Durgin 2021-01-23 00:12:12 UTC
This is fixed in 4.2: https://tracker.ceph.com/issues/42721

Going forward the upmap mode of the balancer is the preferred method. Switching to it requires removing the weightsets, and will cause some data movement.

Comment 3 Josh Durgin 2021-01-23 00:13:36 UTC
Nevermind, I see they're running 4.2 so it may be the same symptom with a different root cause. Could we get their osdmap?

Comment 4 Steve Baldwin 2021-01-23 01:56:13 UTC
Thanks Josh, I have a copy of the osdmap now and will upload the file to the BZ shortly.

Comment 6 Yaniv Kaul 2021-01-24 09:54:26 UTC
Isn't there a missing space in 'config set mgrmgr/rbd_support' ? Shouldn't it be 'config set mgr mgr/rbd_support' ?

Comment 8 Josh Durgin 2021-01-28 01:08:07 UTC
David, can you take a look? I don't see what's so different about luminous vs nautilus crush-compat balancers here.

Comment 38 errata-xmlrpc 2021-06-15 17:13:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 4.2 Security and Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2445