Bug 1639415 - MON crash on assert(pg_upmap_items.empty())
Summary: MON crash on assert(pg_upmap_items.empty())
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 3.0
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: rc
: 3.*
Assignee: Josh Durgin
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-10-15 16:47 UTC by Michael J. Kidd
Modified: 2021-12-10 17:57 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-01-23 22:23:40 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-2726 0 None None None 2021-12-10 17:57:51 UTC

Description Michael J. Kidd 2018-10-15 16:47:24 UTC
Description of problem:
* Ceph MON's crashing in assert(pg_upmap_items.empty())
* Cluster had balancer ran in upmap mode, but kRBD clients unable to map RBDs after
* Backing out of upmap mode, changing to crush-compat mode then removing the pg_upmap_items with:
  # ceph osd rm-pg-upmap-items $pgid
* Appears this ran successfully until the last upmap-item was removed, then MONs started crashing
* Cluster had 3 of 5 MONs crashed
* Removed 2 of the 3 crashing MONs using monmaptool and restarted 2 surviving mons
 - 2 surviving MONs were up for a short time, then crashed with the same assert.

Version-Release number of selected component (if applicable):
  RHCS, Ceph version 12.2.5-42.el7cp
  RHEL, 7.5
  Kernel 3.10.0-862.14.4.el7.x86_64

How reproducible:
  Every attempt to start MON results in crash

Steps to Reproduce:
1. Enable ceph balancer in pg upmap mode
2. Switch balancer to crush-compat mode
3. Remove pg upmap items
4. Once all upmap items removed, MONs crash

Actual results:
- MON crash

Expected results:
- MONs not crash

Additional info:

2018-10-14 10:28:07.584476 7fb186ef3700 -1 /builddir/build/BUILD/ceph-12.2.5/src/osd/OSDMap.cc: In function 'void OSDMap::encode(ceph::bufferlist&, uint64_t) const' thread 7fb186ef3700 time 2018-10-14 10:28:07.581772
/builddir/build/BUILD/ceph-12.2.5/src/osd/OSDMap.cc: 2551: FAILED assert(pg_upmap_items.empty())

 ceph version 12.2.5-42.el7cp (82d52d7efa6edec70f6a0fc306f40b89265535fb) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55eebe2bc6b0]
 2: (OSDMap::encode(ceph::buffer::list&, unsigned long) const+0xcb1) [0x55eebe3a58a1]
 3: (MOSDMap::encode_payload(unsigned long)+0x390) [0x55eebe22b950]
 4: (Message::encode(unsigned long, int)+0x26) [0x55eebe2f7486]
 5: (AsyncConnection::prepare_send_message(unsigned long, Message*, ceph::buffer::list&)+0x1da) [0x55eebe58ea2a]
 6: (AsyncConnection::handle_write()+0x580) [0x55eebe596970]
 7: (EventCenter::process_events(int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x8bc) [0x55eebe3744ac]
 8: (()+0x6ebb0e) [0x55eebe376b0e]
 9: (()+0xb5070) [0x7fb191107070]
 10: (()+0x7dd5) [0x7fb193460dd5]
 11: (clone()+0x6d) [0x7fb19086bb3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Comment 5 Greg Farnum 2018-10-15 17:40:47 UTC
This crash is being triggered because a client which doesn't support upmap is connecting while the upmap state is not completely gone. Presumably there's a bug in the upmap cleanup, but it can be temporarily worked around by disabling those clients.

Comment 6 Michael J. Kidd 2018-10-15 17:41:38 UTC
Thanks Greg, will pass this along and report back if there are other issues.


Note You need to log in before you can comment on or make changes to this bug.