Description of problem (please be detailed as possible and provide log snippests): When OCS was deployed using arbiter mode mon's are going into CLBO state Version of all relevant components (if applicable): OCP version:- 4.7.0-0.nightly-2021-01-08-032110 OCS version:- ocs-operator.v4.7.0-229.ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? yes Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? yes Can this issue reproduce from the UI? yes If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy OCP 4.7 over VMware using LSO 2. Label worker node's with zone 3. Create storagecluster by selecting node's and arbiter zone Actual results: rook-ceph-mon-a-7d5c7797f5-znv9b 0/1 CrashLoopBackOff 14 80m rook-ceph-mon-b-5fd68b7c95-f847j 0/1 CrashLoopBackOff 15 79m rook-ceph-mon-c-6b5c8fcdb9-4bd8h 1/1 Running 15 79m rook-ceph-mon-d-8bdf76849-jbdgg 1/1 Running 16 78m rook-ceph-mon-e-ff5f79b88-f4jzm 1/1 Running 16 78m Expected results: All pods should be up and should be in Running state Additional info: Snipper from mon logs ding OSD to matched zone debug -3> 2021-01-08 08:36:43.494 7f39cbdaf700 5 mon.b@1(peon).paxos(paxos active c 2009..2713) is_readable = 1 - now=2021-01-08 08:36:43.495166 lease_expire=2021-01-08 08:36:48.260524 has v0 lc 2713 debug -2> 2021-01-08 08:36:43.494 7f39cbdaf700 2 mon.b@1(peon) e12 send_reply 0x559af1f35a40 0x559af1942900 auth_reply(proto 2 0 (0) Success) v1 debug -1> 2021-01-08 08:36:43.497 7f39cbdaf700 -1 /builddir/build/BUILD/ceph-14.2.11/src/osd/OSDMap.cc: In function 'void OSDMap::encode(ceph::bufferlist&, uint64_t) const' thread 7f39cbdaf700 time 2021-01-08 08:36:43.496068 /builddir/build/BUILD/ceph-14.2.11/src/osd/OSDMap.cc: 2959: FAILED ceph_assert(target_v >= 9) ceph version 14.2.11-95.el8cp (1d6087ae858e7c8e72fe7390c3522c7e0d951240) nautilus (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x156) [0x7f39db88f358] 2: (()+0x276572) [0x7f39db88f572] 3: (OSDMap::encode(ceph::buffer::v14_2_0::list&, unsigned long) const+0x483) [0x7f39dbd09783] 4: (OSDMonitor::reencode_full_map(ceph::buffer::v14_2_0::list&, unsigned long)+0x18e) [0x559aedcec27e] 5: (OSDMonitor::get_version_full(unsigned long, unsigned long, ceph::buffer::v14_2_0::list&)+0x3fe) [0x559aedcedb9e] 6: (OSDMonitor::build_latest_full(unsigned long)+0x22c) [0x559aedcede3c] 7: (OSDMonitor::check_osdmap_sub(Subscription*)+0x1f0) [0x559aedcf69f0] 8: (Monitor::handle_subscribe(boost::intrusive_ptr<MonOpRequest>)+0x1325) [0x559aedba0ed5] 9: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x445) [0x559aedbc1b95] 10: (Monitor::_ms_dispatch(Message*)+0x953) [0x559aedbc3603] 11: (Monitor::ms_dispatch(Message*)+0x2a) [0x559aedbf618a] 12: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x2a) [0x559aedbf255a] 13: (DispatchQueue::entry()+0x134a) [0x7f39dbae506a] 14: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f39dbb9a3f1] 15: (()+0x814a) [0x7f39d859214a] 16: (clone()+0x43) [0x7f39d72c9f23] debug 0> 2021-01-08 08:36:43.499 7f39cbdaf700 -1 *** Caught signal (Aborted) ** in thread 7f39cbdaf700 thread_name:ms_dispatch ceph version 14.2.11-95.el8cp (1d6087ae858e7c8e72fe7390c3522c7e0d951240) nautilus (stable) 1: (()+0x12b20) [0x7f39d859cb20] 2: (gsignal()+0x10f) [0x7f39d72047ff] 3: (abort()+0x127) [0x7f39d71eec35] 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a7) [0x7f39db88f3a9] 5: (()+0x276572) [0x7f39db88f572] 6: (OSDMap::encode(ceph::buffer::v14_2_0::list&, unsigned long) const+0x483) [0x7f39dbd09783] 7: (OSDMonitor::reencode_full_map(ceph::buffer::v14_2_0::list&, unsigned long)+0x18e) [0x559aedcec27e] 8: (OSDMonitor::get_version_full(unsigned long, unsigned long, ceph::buffer::v14_2_0::list&)+0x3fe) [0x559aedcedb9e] 9: (OSDMonitor::build_latest_full(unsigned long)+0x22c) [0x559aedcede3c] 10: (OSDMonitor::check_osdmap_sub(Subscription*)+0x1f0) [0x559aedcf69f0] 11: (Monitor::handle_subscribe(boost::intrusive_ptr<MonOpRequest>)+0x1325) [0x559aedba0ed5] 12: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x445) [0x559aedbc1b95] 13: (Monitor::_ms_dispatch(Message*)+0x953) [0x559aedbc3603] 14: (Monitor::ms_dispatch(Message*)+0x2a) [0x559aedbf618a] 15: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x2a) [0x559aedbf255a] 16: (DispatchQueue::entry()+0x134a) [0x7f39dbae506a] 17: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f39dbb9a3f1] 18: (()+0x814a) [0x7f39d859214a] 19: (clone()+0x43) [0x7f39d72c9f23] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 0 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 rgw_sync 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 kinetic 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace 1/ 5 prioritycache 0/ 5 test -2/-2 (syslog threshold) 99/99 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/lib/ceph/crash/2021-01-08_08:36:43.499621Z_eb1d72cb-f983-4609-a82e-dd289dbeafe8/log --- end dump of recent events --- reraise_fatal: default handler for signal 6 didn't terminate the process?
Looks like the stretch cluster was fully setup, but then the mons started crashing. Greg, can you take a look at the stack?
Hmm I'd have thought the mustgather output would contain actual ceph daemon logs from the ceph-mon crashes but all I'm seeing is backtraces? Anyway, based on those I think this due to an inadvertent compatibility mismatch between the monitors and clients. The fix is posted upstream but got delayed getting run through QA while I tried to add on another feature bit; I'm pushing it through now and will have the fix in downstream later today. https://github.com/ceph/ceph/pull/38531
Ah, this is the same client error? I overlooked that the base image for Rook wasn't updated. @Boris Is the base image for Rook not always the same RHCS image we use for OCS? If not, can you update the Rook base image as well with RHCS 4.2?
No it’s broken in all the codebases. Give it a few more hours for tests to run in the lab and then I’ll push patches around.
Pushed a patch to ceph-4.2-rhel-patches that should resolve this
(In reply to Travis Nielsen from comment #9) > Ah, this is the same client error? I overlooked that the base image for Rook > wasn't updated. > > @Boris Is the base image for Rook not always the same RHCS image we use for > OCS? If not, can you update the Rook base image as well with RHCS 4.2? We use the same base image for rook as we use in OCS, the build pipeline updates it whenever we switch the RHCS image in OCS.
Boris, are we picking the latest RHCS4.2 build with OCS 4.7?
Yes, we use the latest RHCS 4.2 (GA) image. The image didn't change in a while though. @Greg Farnum: Did the patch make it into RHCS 4.2 or is it planned for 4.2z1?
I pushed a commit to ceph-4.2-rhel patches on Sunday (https://gitlab.cee.redhat.com/ceph/ceph/-/commit/6b378160f6949011d7232a2102e15e668adf4d6b). I thought I saw an automated email suggesting it had been built, but I can't find that now and the steps after I push to the patches repo are pretty much invisible to me. Not sure what happens after that in the pipeline with all the build changes that have been going on, but that's usually all we have to do for things to appear.
fixing direction of BZ dependency
The latest container build (4.2 GA) is from Dec 18 so it doesn't contain that fix. We will have to include 4.2z1 in OCS 4.7 to fix this.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041