Description of problem: * Ceph MON's crashing in assert(pg_upmap_items.empty()) * Cluster had balancer ran in upmap mode, but kRBD clients unable to map RBDs after * Backing out of upmap mode, changing to crush-compat mode then removing the pg_upmap_items with: # ceph osd rm-pg-upmap-items $pgid * Appears this ran successfully until the last upmap-item was removed, then MONs started crashing * Cluster had 3 of 5 MONs crashed * Removed 2 of the 3 crashing MONs using monmaptool and restarted 2 surviving mons - 2 surviving MONs were up for a short time, then crashed with the same assert. Version-Release number of selected component (if applicable): RHCS, Ceph version 12.2.5-42.el7cp RHEL, 7.5 Kernel 3.10.0-862.14.4.el7.x86_64 How reproducible: Every attempt to start MON results in crash Steps to Reproduce: 1. Enable ceph balancer in pg upmap mode 2. Switch balancer to crush-compat mode 3. Remove pg upmap items 4. Once all upmap items removed, MONs crash Actual results: - MON crash Expected results: - MONs not crash Additional info: 2018-10-14 10:28:07.584476 7fb186ef3700 -1 /builddir/build/BUILD/ceph-12.2.5/src/osd/OSDMap.cc: In function 'void OSDMap::encode(ceph::bufferlist&, uint64_t) const' thread 7fb186ef3700 time 2018-10-14 10:28:07.581772 /builddir/build/BUILD/ceph-12.2.5/src/osd/OSDMap.cc: 2551: FAILED assert(pg_upmap_items.empty()) ceph version 12.2.5-42.el7cp (82d52d7efa6edec70f6a0fc306f40b89265535fb) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55eebe2bc6b0] 2: (OSDMap::encode(ceph::buffer::list&, unsigned long) const+0xcb1) [0x55eebe3a58a1] 3: (MOSDMap::encode_payload(unsigned long)+0x390) [0x55eebe22b950] 4: (Message::encode(unsigned long, int)+0x26) [0x55eebe2f7486] 5: (AsyncConnection::prepare_send_message(unsigned long, Message*, ceph::buffer::list&)+0x1da) [0x55eebe58ea2a] 6: (AsyncConnection::handle_write()+0x580) [0x55eebe596970] 7: (EventCenter::process_events(int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x8bc) [0x55eebe3744ac] 8: (()+0x6ebb0e) [0x55eebe376b0e] 9: (()+0xb5070) [0x7fb191107070] 10: (()+0x7dd5) [0x7fb193460dd5] 11: (clone()+0x6d) [0x7fb19086bb3d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
This crash is being triggered because a client which doesn't support upmap is connecting while the upmap state is not completely gone. Presumably there's a bug in the upmap cleanup, but it can be temporarily worked around by disabling those clients.
Thanks Greg, will pass this along and report back if there are other issues.