Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1639415

Summary: MON crash on assert(pg_upmap_items.empty())
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Michael J. Kidd <linuxkidd>
Component: RADOSAssignee: Josh Durgin <jdurgin>
Status: CLOSED WONTFIX QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.0CC: ceph-eng-bugs, dzafman, gfarnum, kchai, linuxkidd, nojha
Target Milestone: rc   
Target Release: 3.*   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-23 22:23:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Michael J. Kidd 2018-10-15 16:47:24 UTC
Description of problem:
* Ceph MON's crashing in assert(pg_upmap_items.empty())
* Cluster had balancer ran in upmap mode, but kRBD clients unable to map RBDs after
* Backing out of upmap mode, changing to crush-compat mode then removing the pg_upmap_items with:
  # ceph osd rm-pg-upmap-items $pgid
* Appears this ran successfully until the last upmap-item was removed, then MONs started crashing
* Cluster had 3 of 5 MONs crashed
* Removed 2 of the 3 crashing MONs using monmaptool and restarted 2 surviving mons
 - 2 surviving MONs were up for a short time, then crashed with the same assert.

Version-Release number of selected component (if applicable):
  RHCS, Ceph version 12.2.5-42.el7cp
  RHEL, 7.5
  Kernel 3.10.0-862.14.4.el7.x86_64

How reproducible:
  Every attempt to start MON results in crash

Steps to Reproduce:
1. Enable ceph balancer in pg upmap mode
2. Switch balancer to crush-compat mode
3. Remove pg upmap items
4. Once all upmap items removed, MONs crash

Actual results:
- MON crash

Expected results:
- MONs not crash

Additional info:

2018-10-14 10:28:07.584476 7fb186ef3700 -1 /builddir/build/BUILD/ceph-12.2.5/src/osd/OSDMap.cc: In function 'void OSDMap::encode(ceph::bufferlist&, uint64_t) const' thread 7fb186ef3700 time 2018-10-14 10:28:07.581772
/builddir/build/BUILD/ceph-12.2.5/src/osd/OSDMap.cc: 2551: FAILED assert(pg_upmap_items.empty())

 ceph version 12.2.5-42.el7cp (82d52d7efa6edec70f6a0fc306f40b89265535fb) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55eebe2bc6b0]
 2: (OSDMap::encode(ceph::buffer::list&, unsigned long) const+0xcb1) [0x55eebe3a58a1]
 3: (MOSDMap::encode_payload(unsigned long)+0x390) [0x55eebe22b950]
 4: (Message::encode(unsigned long, int)+0x26) [0x55eebe2f7486]
 5: (AsyncConnection::prepare_send_message(unsigned long, Message*, ceph::buffer::list&)+0x1da) [0x55eebe58ea2a]
 6: (AsyncConnection::handle_write()+0x580) [0x55eebe596970]
 7: (EventCenter::process_events(int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x8bc) [0x55eebe3744ac]
 8: (()+0x6ebb0e) [0x55eebe376b0e]
 9: (()+0xb5070) [0x7fb191107070]
 10: (()+0x7dd5) [0x7fb193460dd5]
 11: (clone()+0x6d) [0x7fb19086bb3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Comment 5 Greg Farnum 2018-10-15 17:40:47 UTC
This crash is being triggered because a client which doesn't support upmap is connecting while the upmap state is not completely gone. Presumably there's a bug in the upmap cleanup, but it can be temporarily worked around by disabling those clients.

Comment 6 Michael J. Kidd 2018-10-15 17:41:38 UTC
Thanks Greg, will pass this along and report back if there are other issues.