Bug 2302201

Summary: Mon Pods are in CrashLoopBackOff with msg /builddir/build/BUILD/ceph-19.1.0-0-g9025b9024ba/src/osd/OSDMap.cc: 3286: FAILED ceph_assert(pg_upmap_primaries.empty())
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Pratik Surve <prsurve>
Component: cephAssignee: Radoslaw Zarzynski <rzarzyns>
ceph sub component: RADOS QA Contact: Pratik Surve <prsurve>
Status: CLOSED ERRATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: bniver, dahorak, muagarwa, nberry, nojha, rzarzyns, sheggodu, sostapov, vavuthu
Version: 4.17Keywords: Automation, Regression
Target Milestone: ---   
Target Release: ODF 4.17.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.17.0-80 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-10-30 14:29:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pratik Surve 2024-08-01 05:22:41 UTC
Description of problem (please be detailed as possible and provide log
snippests):
on a  fresh deploying cluster Mon Pods are in CrashLoopBackOff with msg /builddir/build/BUILD/ceph-19.1.0-0-g9025b9024ba/src/osd/OSDMap.cc: 3286: FAILED ceph_assert(pg_upmap_primaries.empty())


Version of all relevant components (if applicable):

OCP version:- 4.17.0-0.nightly-2024-07-31-035751
ODF version:- 4.17.0-57
CEPH version:- ceph version 19.1.0-0-g9025b9024ba (9025b9024baf597d63005552b5ee004013630404) squid (rc)
ACM version:- 2.12.0-25
SUBMARINER version:- v0.18.0

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Deploy cluster over vmware
2.Install ODF 4.17
3.


Actual results:

mgrc handle_mgr_map Active mgr is now [v2:10.129.2.31:6800/76971175,v1:10.129.2.31:6801/76971175]
debug     -2> 2024-08-01T05:16:39.170+0000 7f41654cc640  5 mon.a@0(leader).paxos(paxos active c 1..435) is_readable = 1 - now=2024-08-01T05:16:39.172408+0000 lease_expire=2024-08-01T05:16:44.074475+0000 has v0 lc 435
debug     -1> 2024-08-01T05:16:39.173+0000 7f41654cc640 -1 /builddir/build/BUILD/ceph-19.1.0-0-g9025b9024ba/src/osd/OSDMap.cc: In function 'void OSDMap::encode(ceph::buffer::v15_2_0::list&, uint64_t) const' thread 7f41654cc640 time 2024-08-01T05:16:39.173718+0000
/builddir/build/BUILD/ceph-19.1.0-0-g9025b9024ba/src/osd/OSDMap.cc: 3286: FAILED ceph_assert(pg_upmap_primaries.empty())

 ceph version 19.1.0-0-g9025b9024ba (9025b9024baf597d63005552b5ee004013630404) squid (rc)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12e) [0x7f416d9e9f62]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x182120) [0x7f416d9ea120]
 3: /usr/lib64/ceph/libceph-common.so.2(+0x1c244b) [0x7f416da2a44b]
 4: (OSDMonitor::reencode_full_map(ceph::buffer::v15_2_0::list&, unsigned long)+0xdf) [0x5589340e22df]
 5: (OSDMonitor::get_version_full(unsigned long, unsigned long, ceph::buffer::v15_2_0::list&)+0x1e6) [0x5589340e3e26]
 6: (OSDMonitor::build_latest_full(unsigned long)+0x133) [0x5589340e3fb3]
 7: (OSDMonitor::check_osdmap_sub(Subscription*)+0x73) [0x5589340e6a73]
 8: (Monitor::handle_subscribe(boost::intrusive_ptr<MonOpRequest>)+0x1190) [0x558933f9f260]
 9: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x796) [0x558933f98df6]
 10: (Monitor::_ms_dispatch(Message*)+0x42f) [0x558933f99f8f]
 11: ceph-mon(+0x260a6e) [0x558933f53a6e]
 12: (DispatchQueue::entry()+0x542) [0x7f416dbe4602]
 13: /usr/lib64/ceph/libceph-common.so.2(+0x410421) [0x7f416dc78421]
 14: /lib64/libc.so.6(+0x89c02) [0x7f416d171c02]
 15: /lib64/libc.so.6(+0x10ec40) [0x7f416d1f6c40]

debug      0> 2024-08-01T05:16:39.174+0000 7f41654cc640 -1 *** Caught signal (Aborted) **
 in thread 7f41654cc640 thread_name:ms_dispatch

 ceph version 19.1.0-0-g9025b9024ba (9025b9024baf597d63005552b5ee004013630404) squid (rc)
 1: /lib64/libc.so.6(+0x3e6f0) [0x7f416d1266f0]
 2: /lib64/libc.so.6(+0x8b94c) [0x7f416d17394c]
 3: raise()
 4: abort()
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7f416d9e9fbc]
 6: /usr/lib64/ceph/libceph-common.so.2(+0x182120) [0x7f416d9ea120]
 7: /usr/lib64/ceph/libceph-common.so.2(+0x1c244b) [0x7f416da2a44b]
 8: (OSDMonitor::reencode_full_map(ceph::buffer::v15_2_0::list&, unsigned long)+0xdf) [0x5589340e22df]
 9: (OSDMonitor::get_version_full(unsigned long, unsigned long, ceph::buffer::v15_2_0::list&)+0x1e6) [0x5589340e3e26]
 10: (OSDMonitor::build_latest_full(unsigned long)+0x133) [0x5589340e3fb3]
 11: (OSDMonitor::check_osdmap_sub(Subscription*)+0x73) [0x5589340e6a73]
 12: (Monitor::handle_subscribe(boost::intrusive_ptr<MonOpRequest>)+0x1190) [0x558933f9f260]
 13: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x796) [0x558933f98df6]
 14: (Monitor::_ms_dispatch(Message*)+0x42f) [0x558933f99f8f]
 15: ceph-mon(+0x260a6e) [0x558933f53a6e]
 16: (DispatchQueue::entry()+0x542) [0x7f416dbe4602]
 17: /usr/lib64/ceph/libceph-common.so.2(+0x410421) [0x7f416dc78421]
 18: /lib64/libc.so.6(+0x89c02) [0x7f416d171c02]
 19: /lib64/libc.so.6(+0x10ec40) [0x7f416d1f6c40]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   3/ 5 mds_quiesce
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 rbd_pwl
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 immutable_obj_cache
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/ 5 rgw_datacache
   1/ 5 rgw_access
   1/ 5 rgw_dbstore
   1/ 5 rgw_flight
   1/ 5 rgw_lifecycle
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   1/ 5 fuse
   2/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
   0/ 5 test
   0/ 5 cephfs_mirror
   0/ 5 cephsqlite
   0/ 5 crimson_interrupt
   0/ 5 seastore
   0/ 5 seastore_onode
   0/ 5 seastore_odata
   0/ 5 seastore_omap
   0/ 5 seastore_tm
   0/ 5 seastore_t
   0/ 5 seastore_cleaner
   0/ 5 seastore_epm
   0/ 5 seastore_lba
   0/ 5 seastore_fixedkv_tree
   0/ 5 seastore_cache
   0/ 5 seastore_journal
   0/ 5 seastore_device
   0/ 5 seastore_backref
   0/ 5 alienstore
   1/ 5 mclock
   0/ 5 cyanstore
   1/ 5 ceph_exporter
   1/ 5 memstore
   1/ 5 trace
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
  7f4161cc5640 / ms_dispatch
  7f41624c6640 / ceph-mon
  7f4162cc7640 / fn_monstore
  7f41634c8640 / msgr-worker-0
  7f4163cc9640 / msgr-worker-1
  7f41654cc640 / ms_dispatch
  7f4167cd1640 / safe_timer
  7f41694d4640 / msgr-worker-2
  7f416b695640 / admin_socket
  7f416c6fbb00 / ceph-mon
  max_recent     10000
  max_new         1000
  log_file /var/lib/ceph/crash/2024-08-01T05:16:39.176055Z_488cc8b3-a9e8-4422-bccc-17c758fc1b3a/log
--- end dump of recent events ---

Expected results:
Pod should be in running state

Additional info:

Comment 11 Sunil Kumar Acharya 2024-08-26 11:12:44 UTC
Please update the RDT flag/text appropriately.

Comment 12 Sunil Kumar Acharya 2024-09-03 05:33:26 UTC
Please update the RDT flag/text appropriately.

Comment 15 errata-xmlrpc 2024-10-30 14:29:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.17.0 Security, Enhancement, & Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:8676

Comment 16 Red Hat Bugzilla 2025-02-28 04:25:23 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days