Bug 2302201 - Mon Pods are in CrashLoopBackOff with msg /builddir/build/BUILD/ceph-19.1.0-0-g9025b9024ba/src/osd/OSDMap.cc: 3286: FAILED ceph_assert(pg_upmap_primaries.empty())
Summary: Mon Pods are in CrashLoopBackOff with msg /builddir/build/BUILD/ceph-19.1.0-0...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.17
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ODF 4.17.0
Assignee: Radoslaw Zarzynski
QA Contact: Pratik Surve
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-08-01 05:22 UTC by Pratik Surve
Modified: 2025-02-28 04:25 UTC (History)
9 users (show)

Fixed In Version: 4.17.0-80
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-10-30 14:29:40 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OCSBZM-8787 0 None None None 2024-08-01 05:24:08 UTC
Red Hat Product Errata RHSA-2024:8676 0 None None None 2024-10-30 14:29:41 UTC

Description Pratik Surve 2024-08-01 05:22:41 UTC
Description of problem (please be detailed as possible and provide log
snippests):
on a  fresh deploying cluster Mon Pods are in CrashLoopBackOff with msg /builddir/build/BUILD/ceph-19.1.0-0-g9025b9024ba/src/osd/OSDMap.cc: 3286: FAILED ceph_assert(pg_upmap_primaries.empty())


Version of all relevant components (if applicable):

OCP version:- 4.17.0-0.nightly-2024-07-31-035751
ODF version:- 4.17.0-57
CEPH version:- ceph version 19.1.0-0-g9025b9024ba (9025b9024baf597d63005552b5ee004013630404) squid (rc)
ACM version:- 2.12.0-25
SUBMARINER version:- v0.18.0

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Deploy cluster over vmware
2.Install ODF 4.17
3.


Actual results:

mgrc handle_mgr_map Active mgr is now [v2:10.129.2.31:6800/76971175,v1:10.129.2.31:6801/76971175]
debug     -2> 2024-08-01T05:16:39.170+0000 7f41654cc640  5 mon.a@0(leader).paxos(paxos active c 1..435) is_readable = 1 - now=2024-08-01T05:16:39.172408+0000 lease_expire=2024-08-01T05:16:44.074475+0000 has v0 lc 435
debug     -1> 2024-08-01T05:16:39.173+0000 7f41654cc640 -1 /builddir/build/BUILD/ceph-19.1.0-0-g9025b9024ba/src/osd/OSDMap.cc: In function 'void OSDMap::encode(ceph::buffer::v15_2_0::list&, uint64_t) const' thread 7f41654cc640 time 2024-08-01T05:16:39.173718+0000
/builddir/build/BUILD/ceph-19.1.0-0-g9025b9024ba/src/osd/OSDMap.cc: 3286: FAILED ceph_assert(pg_upmap_primaries.empty())

 ceph version 19.1.0-0-g9025b9024ba (9025b9024baf597d63005552b5ee004013630404) squid (rc)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12e) [0x7f416d9e9f62]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x182120) [0x7f416d9ea120]
 3: /usr/lib64/ceph/libceph-common.so.2(+0x1c244b) [0x7f416da2a44b]
 4: (OSDMonitor::reencode_full_map(ceph::buffer::v15_2_0::list&, unsigned long)+0xdf) [0x5589340e22df]
 5: (OSDMonitor::get_version_full(unsigned long, unsigned long, ceph::buffer::v15_2_0::list&)+0x1e6) [0x5589340e3e26]
 6: (OSDMonitor::build_latest_full(unsigned long)+0x133) [0x5589340e3fb3]
 7: (OSDMonitor::check_osdmap_sub(Subscription*)+0x73) [0x5589340e6a73]
 8: (Monitor::handle_subscribe(boost::intrusive_ptr<MonOpRequest>)+0x1190) [0x558933f9f260]
 9: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x796) [0x558933f98df6]
 10: (Monitor::_ms_dispatch(Message*)+0x42f) [0x558933f99f8f]
 11: ceph-mon(+0x260a6e) [0x558933f53a6e]
 12: (DispatchQueue::entry()+0x542) [0x7f416dbe4602]
 13: /usr/lib64/ceph/libceph-common.so.2(+0x410421) [0x7f416dc78421]
 14: /lib64/libc.so.6(+0x89c02) [0x7f416d171c02]
 15: /lib64/libc.so.6(+0x10ec40) [0x7f416d1f6c40]

debug      0> 2024-08-01T05:16:39.174+0000 7f41654cc640 -1 *** Caught signal (Aborted) **
 in thread 7f41654cc640 thread_name:ms_dispatch

 ceph version 19.1.0-0-g9025b9024ba (9025b9024baf597d63005552b5ee004013630404) squid (rc)
 1: /lib64/libc.so.6(+0x3e6f0) [0x7f416d1266f0]
 2: /lib64/libc.so.6(+0x8b94c) [0x7f416d17394c]
 3: raise()
 4: abort()
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7f416d9e9fbc]
 6: /usr/lib64/ceph/libceph-common.so.2(+0x182120) [0x7f416d9ea120]
 7: /usr/lib64/ceph/libceph-common.so.2(+0x1c244b) [0x7f416da2a44b]
 8: (OSDMonitor::reencode_full_map(ceph::buffer::v15_2_0::list&, unsigned long)+0xdf) [0x5589340e22df]
 9: (OSDMonitor::get_version_full(unsigned long, unsigned long, ceph::buffer::v15_2_0::list&)+0x1e6) [0x5589340e3e26]
 10: (OSDMonitor::build_latest_full(unsigned long)+0x133) [0x5589340e3fb3]
 11: (OSDMonitor::check_osdmap_sub(Subscription*)+0x73) [0x5589340e6a73]
 12: (Monitor::handle_subscribe(boost::intrusive_ptr<MonOpRequest>)+0x1190) [0x558933f9f260]
 13: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x796) [0x558933f98df6]
 14: (Monitor::_ms_dispatch(Message*)+0x42f) [0x558933f99f8f]
 15: ceph-mon(+0x260a6e) [0x558933f53a6e]
 16: (DispatchQueue::entry()+0x542) [0x7f416dbe4602]
 17: /usr/lib64/ceph/libceph-common.so.2(+0x410421) [0x7f416dc78421]
 18: /lib64/libc.so.6(+0x89c02) [0x7f416d171c02]
 19: /lib64/libc.so.6(+0x10ec40) [0x7f416d1f6c40]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   3/ 5 mds_quiesce
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 rbd_pwl
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 immutable_obj_cache
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/ 5 rgw_datacache
   1/ 5 rgw_access
   1/ 5 rgw_dbstore
   1/ 5 rgw_flight
   1/ 5 rgw_lifecycle
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   1/ 5 fuse
   2/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
   0/ 5 test
   0/ 5 cephfs_mirror
   0/ 5 cephsqlite
   0/ 5 crimson_interrupt
   0/ 5 seastore
   0/ 5 seastore_onode
   0/ 5 seastore_odata
   0/ 5 seastore_omap
   0/ 5 seastore_tm
   0/ 5 seastore_t
   0/ 5 seastore_cleaner
   0/ 5 seastore_epm
   0/ 5 seastore_lba
   0/ 5 seastore_fixedkv_tree
   0/ 5 seastore_cache
   0/ 5 seastore_journal
   0/ 5 seastore_device
   0/ 5 seastore_backref
   0/ 5 alienstore
   1/ 5 mclock
   0/ 5 cyanstore
   1/ 5 ceph_exporter
   1/ 5 memstore
   1/ 5 trace
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
  7f4161cc5640 / ms_dispatch
  7f41624c6640 / ceph-mon
  7f4162cc7640 / fn_monstore
  7f41634c8640 / msgr-worker-0
  7f4163cc9640 / msgr-worker-1
  7f41654cc640 / ms_dispatch
  7f4167cd1640 / safe_timer
  7f41694d4640 / msgr-worker-2
  7f416b695640 / admin_socket
  7f416c6fbb00 / ceph-mon
  max_recent     10000
  max_new         1000
  log_file /var/lib/ceph/crash/2024-08-01T05:16:39.176055Z_488cc8b3-a9e8-4422-bccc-17c758fc1b3a/log
--- end dump of recent events ---

Expected results:
Pod should be in running state

Additional info:

Comment 11 Sunil Kumar Acharya 2024-08-26 11:12:44 UTC
Please update the RDT flag/text appropriately.

Comment 12 Sunil Kumar Acharya 2024-09-03 05:33:26 UTC
Please update the RDT flag/text appropriately.

Comment 15 errata-xmlrpc 2024-10-30 14:29:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.17.0 Security, Enhancement, & Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:8676

Comment 16 Red Hat Bugzilla 2025-02-28 04:25:23 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.