Bug 2302201

Summary:	Mon Pods are in CrashLoopBackOff with msg /builddir/build/BUILD/ceph-19.1.0-0-g9025b9024ba/src/osd/OSDMap.cc: 3286: FAILED ceph_assert(pg_upmap_primaries.empty())
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Pratik Surve <prsurve>
Component:	ceph	Assignee:	Radoslaw Zarzynski <rzarzyns>
ceph sub component:	RADOS	QA Contact:	Pratik Surve <prsurve>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	unspecified
Priority:	unspecified	CC:	bniver, dahorak, muagarwa, nberry, nojha, rzarzyns, sheggodu, sostapov, vavuthu
Version:	4.17	Keywords:	Automation, Regression
Target Milestone:	---
Target Release:	ODF 4.17.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.17.0-80	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-10-30 14:29:40 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pratik Surve 2024-08-01 05:22:41 UTC

Description of problem (please be detailed as possible and provide log
snippests):
on a  fresh deploying cluster Mon Pods are in CrashLoopBackOff with msg /builddir/build/BUILD/ceph-19.1.0-0-g9025b9024ba/src/osd/OSDMap.cc: 3286: FAILED ceph_assert(pg_upmap_primaries.empty())


Version of all relevant components (if applicable):

OCP version:- 4.17.0-0.nightly-2024-07-31-035751
ODF version:- 4.17.0-57
CEPH version:- ceph version 19.1.0-0-g9025b9024ba (9025b9024baf597d63005552b5ee004013630404) squid (rc)
ACM version:- 2.12.0-25
SUBMARINER version:- v0.18.0

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Deploy cluster over vmware
2.Install ODF 4.17
3.


Actual results:

mgrc handle_mgr_map Active mgr is now [v2:10.129.2.31:6800/76971175,v1:10.129.2.31:6801/76971175]
debug     -2> 2024-08-01T05:16:39.170+0000 7f41654cc640  5 mon.a@0(leader).paxos(paxos active c 1..435) is_readable = 1 - now=2024-08-01T05:16:39.172408+0000 lease_expire=2024-08-01T05:16:44.074475+0000 has v0 lc 435
debug     -1> 2024-08-01T05:16:39.173+0000 7f41654cc640 -1 /builddir/build/BUILD/ceph-19.1.0-0-g9025b9024ba/src/osd/OSDMap.cc: In function 'void OSDMap::encode(ceph::buffer::v15_2_0::list&, uint64_t) const' thread 7f41654cc640 time 2024-08-01T05:16:39.173718+0000
/builddir/build/BUILD/ceph-19.1.0-0-g9025b9024ba/src/osd/OSDMap.cc: 3286: FAILED ceph_assert(pg_upmap_primaries.empty())

 ceph version 19.1.0-0-g9025b9024ba (9025b9024baf597d63005552b5ee004013630404) squid (rc)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12e) [0x7f416d9e9f62]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x182120) [0x7f416d9ea120]
 3: /usr/lib64/ceph/libceph-common.so.2(+0x1c244b) [0x7f416da2a44b]
 4: (OSDMonitor::reencode_full_map(ceph::buffer::v15_2_0::list&, unsigned long)+0xdf) [0x5589340e22df]
 5: (OSDMonitor::get_version_full(unsigned long, unsigned long, ceph::buffer::v15_2_0::list&)+0x1e6) [0x5589340e3e26]
 6: (OSDMonitor::build_latest_full(unsigned long)+0x133) [0x5589340e3fb3]
 7: (OSDMonitor::check_osdmap_sub(Subscription*)+0x73) [0x5589340e6a73]
 8: (Monitor::handle_subscribe(boost::intrusive_ptr<MonOpRequest>)+0x1190) [0x558933f9f260]
 9: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x796) [0x558933f98df6]
 10: (Monitor::_ms_dispatch(Message*)+0x42f) [0x558933f99f8f]
 11: ceph-mon(+0x260a6e) [0x558933f53a6e]
 12: (DispatchQueue::entry()+0x542) [0x7f416dbe4602]
 13: /usr/lib64/ceph/libceph-common.so.2(+0x410421) [0x7f416dc78421]
 14: /lib64/libc.so.6(+0x89c02) [0x7f416d171c02]
 15: /lib64/libc.so.6(+0x10ec40) [0x7f416d1f6c40]

debug      0> 2024-08-01T05:16:39.174+0000 7f41654cc640 -1 *** Caught signal (Aborted) **
 in thread 7f41654cc640 thread_name:ms_dispatch

 ceph version 19.1.0-0-g9025b9024ba (9025b9024baf597d63005552b5ee004013630404) squid (rc)
 1: /lib64/libc.so.6(+0x3e6f0) [0x7f416d1266f0]
 2: /lib64/libc.so.6(+0x8b94c) [0x7f416d17394c]
 3: raise()
 4: abort()
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7f416d9e9fbc]
 6: /usr/lib64/ceph/libceph-common.so.2(+0x182120) [0x7f416d9ea120]
 7: /usr/lib64/ceph/libceph-common.so.2(+0x1c244b) [0x7f416da2a44b]
 8: (OSDMonitor::reencode_full_map(ceph::buffer::v15_2_0::list&, unsigned long)+0xdf) [0x5589340e22df]
 9: (OSDMonitor::get_version_full(unsigned long, unsigned long, ceph::buffer::v15_2_0::list&)+0x1e6) [0x5589340e3e26]
 10: (OSDMonitor::build_latest_full(unsigned long)+0x133) [0x5589340e3fb3]
 11: (OSDMonitor::check_osdmap_sub(Subscription*)+0x73) [0x5589340e6a73]
 12: (Monitor::handle_subscribe(boost::intrusive_ptr<MonOpRequest>)+0x1190) [0x558933f9f260]
 13: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x796) [0x558933f98df6]
 14: (Monitor::_ms_dispatch(Message*)+0x42f) [0x558933f99f8f]
 15: ceph-mon(+0x260a6e) [0x558933f53a6e]
 16: (DispatchQueue::entry()+0x542) [0x7f416dbe4602]
 17: /usr/lib64/ceph/libceph-common.so.2(+0x410421) [0x7f416dc78421]
 18: /lib64/libc.so.6(+0x89c02) [0x7f416d171c02]
 19: /lib64/libc.so.6(+0x10ec40) [0x7f416d1f6c40]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   3/ 5 mds_quiesce
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 rbd_pwl
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 immutable_obj_cache
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/ 5 rgw_datacache
   1/ 5 rgw_access
   1/ 5 rgw_dbstore
   1/ 5 rgw_flight
   1/ 5 rgw_lifecycle
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   1/ 5 fuse
   2/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
   0/ 5 test
   0/ 5 cephfs_mirror
   0/ 5 cephsqlite
   0/ 5 crimson_interrupt
   0/ 5 seastore
   0/ 5 seastore_onode
   0/ 5 seastore_odata
   0/ 5 seastore_omap
   0/ 5 seastore_tm
   0/ 5 seastore_t
   0/ 5 seastore_cleaner
   0/ 5 seastore_epm
   0/ 5 seastore_lba
   0/ 5 seastore_fixedkv_tree
   0/ 5 seastore_cache
   0/ 5 seastore_journal
   0/ 5 seastore_device
   0/ 5 seastore_backref
   0/ 5 alienstore
   1/ 5 mclock
   0/ 5 cyanstore
   1/ 5 ceph_exporter
   1/ 5 memstore
   1/ 5 trace
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
  7f4161cc5640 / ms_dispatch
  7f41624c6640 / ceph-mon
  7f4162cc7640 / fn_monstore
  7f41634c8640 / msgr-worker-0
  7f4163cc9640 / msgr-worker-1
  7f41654cc640 / ms_dispatch
  7f4167cd1640 / safe_timer
  7f41694d4640 / msgr-worker-2
  7f416b695640 / admin_socket
  7f416c6fbb00 / ceph-mon
  max_recent     10000
  max_new         1000
  log_file /var/lib/ceph/crash/2024-08-01T05:16:39.176055Z_488cc8b3-a9e8-4422-bccc-17c758fc1b3a/log
--- end dump of recent events ---

Expected results:
Pod should be in running state

Additional info:

Comment 11 Sunil Kumar Acharya 2024-08-26 11:12:44 UTC

Please update the RDT flag/text appropriately.

Comment 12 Sunil Kumar Acharya 2024-09-03 05:33:26 UTC

Please update the RDT flag/text appropriately.

Comment 15 errata-xmlrpc 2024-10-30 14:29:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.17.0 Security, Enhancement, & Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:8676

Comment 16 Red Hat Bugzilla 2025-02-28 04:25:23 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days