1914159 – When OCS was deployed using arbiter mode mon's are going into CLBO state, ceph version = 14.2.11-95

Bug 1914159 - When OCS was deployed using arbiter mode mon's are going into CLBO state, ceph version = 14.2.11-95

Summary: When OCS was deployed using arbiter mode mon's are going into CLBO state, cep...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	OCS 4.7.0
Assignee:	Greg Farnum
QA Contact:	Pratik Surve
Docs Contact:
URL:
Whiteboard:
Depends On:	1917374
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-08 08:52 UTC by Pratik Surve
Modified:	2021-06-01 08:43 UTC (History)
CC List:	11 users (show)
Fixed In Version:	ocs-registry:4.7.0-247.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1917374 (view as bug list)
Environment:
Last Closed:	2021-05-19 09:17:47 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	48489	0	None	None	None	2021-01-08 18:33:12 UTC
Red Hat Product Errata	RHSA-2021:2041	0	None	None	None	2021-05-19 09:18:12 UTC

Description Pratik Surve 2021-01-08 08:52:21 UTC

Description of problem (please be detailed as possible and provide log
snippests):

When OCS was deployed using arbiter mode mon's are going into CLBO state

Version of all relevant components (if applicable):
OCP version:- 4.7.0-0.nightly-2021-01-08-032110
OCS version:- ocs-operator.v4.7.0-229.ci

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy OCP 4.7 over VMware using LSO
2. Label worker node's with zone
3. Create storagecluster by selecting node's and arbiter zone 


Actual results:
rook-ceph-mon-a-7d5c7797f5-znv9b                                  0/1     CrashLoopBackOff   14         80m
rook-ceph-mon-b-5fd68b7c95-f847j                                  0/1     CrashLoopBackOff   15         79m
rook-ceph-mon-c-6b5c8fcdb9-4bd8h                                  1/1     Running            15         79m
rook-ceph-mon-d-8bdf76849-jbdgg                                   1/1     Running            16         78m
rook-ceph-mon-e-ff5f79b88-f4jzm                                   1/1     Running            16         78m



Expected results:
All pods should be up and should be in Running state

Additional info:
Snipper from mon logs 

ding OSD to matched zone
debug     -3> 2021-01-08 08:36:43.494 7f39cbdaf700  5 mon.b@1(peon).paxos(paxos active c 2009..2713) is_readable = 1 - now=2021-01-08 08:36:43.495166 lease_expire=2021-01-08 08:36:48.260524 has v0 lc 2713
debug     -2> 2021-01-08 08:36:43.494 7f39cbdaf700  2 mon.b@1(peon) e12 send_reply 0x559af1f35a40 0x559af1942900 auth_reply(proto 2 0 (0) Success) v1
debug     -1> 2021-01-08 08:36:43.497 7f39cbdaf700 -1 /builddir/build/BUILD/ceph-14.2.11/src/osd/OSDMap.cc: In function 'void OSDMap::encode(ceph::bufferlist&, uint64_t) const' thread 7f39cbdaf700 time 2021-01-08 08:36:43.496068
/builddir/build/BUILD/ceph-14.2.11/src/osd/OSDMap.cc: 2959: FAILED ceph_assert(target_v >= 9)

 ceph version 14.2.11-95.el8cp (1d6087ae858e7c8e72fe7390c3522c7e0d951240) nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x156) [0x7f39db88f358]
 2: (()+0x276572) [0x7f39db88f572]
 3: (OSDMap::encode(ceph::buffer::v14_2_0::list&, unsigned long) const+0x483) [0x7f39dbd09783]
 4: (OSDMonitor::reencode_full_map(ceph::buffer::v14_2_0::list&, unsigned long)+0x18e) [0x559aedcec27e]
 5: (OSDMonitor::get_version_full(unsigned long, unsigned long, ceph::buffer::v14_2_0::list&)+0x3fe) [0x559aedcedb9e]
 6: (OSDMonitor::build_latest_full(unsigned long)+0x22c) [0x559aedcede3c]
 7: (OSDMonitor::check_osdmap_sub(Subscription*)+0x1f0) [0x559aedcf69f0]
 8: (Monitor::handle_subscribe(boost::intrusive_ptr<MonOpRequest>)+0x1325) [0x559aedba0ed5]
 9: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x445) [0x559aedbc1b95]
 10: (Monitor::_ms_dispatch(Message*)+0x953) [0x559aedbc3603]
 11: (Monitor::ms_dispatch(Message*)+0x2a) [0x559aedbf618a]
 12: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x2a) [0x559aedbf255a]
 13: (DispatchQueue::entry()+0x134a) [0x7f39dbae506a]
 14: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f39dbb9a3f1]
 15: (()+0x814a) [0x7f39d859214a]
 16: (clone()+0x43) [0x7f39d72c9f23]

debug      0> 2021-01-08 08:36:43.499 7f39cbdaf700 -1 *** Caught signal (Aborted) **
 in thread 7f39cbdaf700 thread_name:ms_dispatch

 ceph version 14.2.11-95.el8cp (1d6087ae858e7c8e72fe7390c3522c7e0d951240) nautilus (stable)
 1: (()+0x12b20) [0x7f39d859cb20]
 2: (gsignal()+0x10f) [0x7f39d72047ff]
 3: (abort()+0x127) [0x7f39d71eec35]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a7) [0x7f39db88f3a9]
 5: (()+0x276572) [0x7f39db88f572]
 6: (OSDMap::encode(ceph::buffer::v14_2_0::list&, unsigned long) const+0x483) [0x7f39dbd09783]
 7: (OSDMonitor::reencode_full_map(ceph::buffer::v14_2_0::list&, unsigned long)+0x18e) [0x559aedcec27e]
 8: (OSDMonitor::get_version_full(unsigned long, unsigned long, ceph::buffer::v14_2_0::list&)+0x3fe) [0x559aedcedb9e]
 9: (OSDMonitor::build_latest_full(unsigned long)+0x22c) [0x559aedcede3c]
 10: (OSDMonitor::check_osdmap_sub(Subscription*)+0x1f0) [0x559aedcf69f0]
 11: (Monitor::handle_subscribe(boost::intrusive_ptr<MonOpRequest>)+0x1325) [0x559aedba0ed5]
 12: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x445) [0x559aedbc1b95]
 13: (Monitor::_ms_dispatch(Message*)+0x953) [0x559aedbc3603]
 14: (Monitor::ms_dispatch(Message*)+0x2a) [0x559aedbf618a]
 15: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x2a) [0x559aedbf255a]
 16: (DispatchQueue::entry()+0x134a) [0x7f39dbae506a]
 17: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f39dbb9a3f1]
 18: (()+0x814a) [0x7f39d859214a]
 19: (clone()+0x43) [0x7f39d72c9f23]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
   0/ 5 test
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/lib/ceph/crash/2021-01-08_08:36:43.499621Z_eb1d72cb-f983-4609-a82e-dd289dbeafe8/log
--- end dump of recent events ---
reraise_fatal: default handler for signal 6 didn't terminate the process?

Comment 7 Travis Nielsen 2021-01-08 16:16:50 UTC

Looks like the stretch cluster was fully setup, but then the mons started crashing. 
Greg, can you take a look at the stack?

Comment 8 Greg Farnum 2021-01-08 18:33:15 UTC

Hmm I'd have thought the mustgather output would contain actual ceph daemon logs from the ceph-mon crashes but all I'm seeing is backtraces?

Anyway, based on those I think this due to an inadvertent compatibility mismatch between the monitors and clients. The fix is posted upstream but got delayed getting run through QA while I tried to add on another feature bit; I'm pushing it through now and will have the fix in downstream later today.

https://github.com/ceph/ceph/pull/38531

Comment 9 Travis Nielsen 2021-01-08 18:52:03 UTC

Ah, this is the same client error? I overlooked that the base image for Rook wasn't updated.

@Boris Is the base image for Rook not always the same RHCS image we use for OCS? If not, can you update the Rook base image as well with RHCS 4.2?

Comment 10 Greg Farnum 2021-01-08 21:05:19 UTC

No it’s broken in all the codebases. Give it a few more hours for tests to run in the lab and then I’ll push patches around.

Comment 11 Greg Farnum 2021-01-11 06:50:36 UTC

Pushed a patch to ceph-4.2-rhel-patches that should resolve this

Comment 14 Boris Ranto 2021-01-11 11:41:37 UTC

(In reply to Travis Nielsen from comment #9)
> Ah, this is the same client error? I overlooked that the base image for Rook
> wasn't updated.
> 
> @Boris Is the base image for Rook not always the same RHCS image we use for
> OCS? If not, can you update the Rook base image as well with RHCS 4.2?

We use the same base image for rook as we use in OCS, the build pipeline updates it whenever we switch the RHCS image in OCS.

Comment 16 Mudit Agarwal 2021-01-15 11:23:27 UTC

Boris, are we picking the latest RHCS4.2 build with OCS 4.7?

Comment 17 Boris Ranto 2021-01-15 14:02:39 UTC

Yes, we use the latest RHCS 4.2 (GA) image. The image didn't change in a while though.

@Greg Farnum: Did the patch make it into RHCS 4.2 or is it planned for 4.2z1?

Comment 18 Greg Farnum 2021-01-15 17:06:17 UTC

I pushed a commit to ceph-4.2-rhel patches on Sunday (https://gitlab.cee.redhat.com/ceph/ceph/-/commit/6b378160f6949011d7232a2102e15e668adf4d6b). I thought I saw an automated email suggesting it had been built, but I can't find that now and the steps after I push to the patches repo are pretty much invisible to me. Not sure what happens after that in the pipeline with all the build changes that have been going on, but that's usually all we have to do for things to appear.

Comment 19 Michael Adam 2021-01-18 14:33:25 UTC

fixing direction of BZ dependency

Comment 20 Boris Ranto 2021-01-18 17:06:30 UTC

The latest container build (4.2 GA) is from Dec 18 so it doesn't contain that fix. We will have to include 4.2z1 in OCS 4.7 to fix this.

Comment 29 errata-xmlrpc 2021-05-19 09:17:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Note You need to log in before you can comment on or make changes to this bug.