Bug 1919471 - [cee/sd][ceph-mgr][balancer] After upgrade from 3.3 to 4.2 balancer fails with unhandled exceptions
Summary: [cee/sd][ceph-mgr][balancer] After upgrade from 3.3 to 4.2 balancer fails wit...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Mgr Plugins
Version: 4.2
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 4.2z2
Assignee: Neha Ojha
QA Contact: Pawan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-22 23:47 UTC by Steve Baldwin
Modified: 2021-06-15 17:14 UTC (History)
9 users (show)

Fixed In Version: ceph-14.2.11-157.el8cp, ceph-14.2.11-157.el7cp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-06-15 17:13:37 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 49576 0 None None None 2021-03-02 22:34:51 UTC
Github ceph ceph pull 40128 0 None closed nautilus: pybind/mgr/balancer/module.py: assign weight-sets to all buckets before balancing 2021-05-06 23:02:57 UTC
Red Hat Product Errata RHSA-2021:2445 0 None None None 2021-06-15 17:14:01 UTC

Description Steve Baldwin 2021-01-22 23:47:33 UTC
Description of problem:
After upgrading from 3.3z5 to 4.2 the balancer is failing with the following errors
2021-01-22 00:11:23.362 7f89aad3f700  0 mgr[py] `config set mgrmgr/rbd_support/cephmon1/mirror_snapshot_schedule --` failed: (22) Invalid argument <-- config set - missing space (mgrmgr)
2021-01-22 00:11:23.362 7f89aad3f700  0 mgr[py] mon returned -22: unrecognized config target ''
2021-01-22 00:11:23.412 7f89b6655700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'balancer' while running on mgr.cephmon1: (44,)
2021-01-22 00:11:23.412 7f89b6655700 -1 balancer.serve:
2021-01-22 00:11:23.412 7f89b6655700 -1 Traceback (most recent call last):
  File "/usr/share/ceph/mgr/balancer/module.py", line 657, in serve
    r, detail = self.optimize(plan)
  File "/usr/share/ceph/mgr/balancer/module.py", line 928, in optimize
    return self.do_crush_compat(plan)
  File "/usr/share/ceph/mgr/balancer/module.py", line 1089, in do_crush_compat
    weight = best_ws[osd]
KeyError: 44


This results in ceph status reporting cluster health error
      health: HEALTH_ERR
            Module 'balancer' has failed: (44,)


Version-Release number of selected component (if applicable):
14.2.11-95 (4.2)

How reproducible:
Unknown encountered once after the upgrade. 

Steps to Reproduce:
1. Upgrade from 3.3z5 to 4.2


Actual results:
Balancer fails post-upgrade with errors in Problem description.

Expected results:
Balancer should not fail after upgrade to 4.2 

Additional info:
This was a manual upgrade from 3.3z5 -> 4.2

Comment 2 Josh Durgin 2021-01-23 00:12:12 UTC
This is fixed in 4.2: https://tracker.ceph.com/issues/42721

Going forward the upmap mode of the balancer is the preferred method. Switching to it requires removing the weightsets, and will cause some data movement.

Comment 3 Josh Durgin 2021-01-23 00:13:36 UTC
Nevermind, I see they're running 4.2 so it may be the same symptom with a different root cause. Could we get their osdmap?

Comment 4 Steve Baldwin 2021-01-23 01:56:13 UTC
Thanks Josh, I have a copy of the osdmap now and will upload the file to the BZ shortly.

Comment 6 Yaniv Kaul 2021-01-24 09:54:26 UTC
Isn't there a missing space in 'config set mgrmgr/rbd_support' ? Shouldn't it be 'config set mgr mgr/rbd_support' ?

Comment 8 Josh Durgin 2021-01-28 01:08:07 UTC
David, can you take a look? I don't see what's so different about luminous vs nautilus crush-compat balancers here.

Comment 38 errata-xmlrpc 2021-06-15 17:13:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 4.2 Security and Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2445


Note You need to log in before you can comment on or make changes to this bug.