Bug 1939007

Summary:	[Arbiter] [Tracker for BZ #1939766] When performed drain and undrained 4 Mon and 1 Osd are going into CLBO and ceph is not accessible
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Pratik Surve <prsurve>
Component:	ceph	Assignee:	Travis Nielsen <tnielsen>
Status:	CLOSED ERRATA	QA Contact:	Pratik Surve <prsurve>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.7	CC:	bniver, ebenahar, gfarnum, madam, mbukatov, muagarwa, nberry, ocs-bugs, owasserm, shan, tnielsen
Target Milestone:	---	Keywords:	AutomationBackLog, TestBlocker
Target Release:	OCS 4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	1939617 1939766 (view as bug list)		Environment:
Last Closed:	2021-08-03 18:15:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1939766
Bug Blocks:	1939617, 1941918

Description Pratik Surve 2021-03-15 12:54:41 UTC

Description of problem (please be detailed as possible and provide log
snippests):
[Arbiter] When performed drain and undrained 4 Mon and 1 Osd are going into CLBO and ceph is not accessible

Version of all relevant components (if applicable):
OCP version:- 4.7.0-0.nightly-2021-03-14-223051
OCS version:- ocs-operator.v4.7.0-294.ci

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy OCP 4.7 
2. Deploy OCS with the arbiter
3. Write some data
4. perform drain on ocs node
5. Wait for 10-15 min
6. Uncordon the node
7. check for pod status

Actual results:
# oc get pods -o wide 
NAME                                                              READY   STATUS             RESTARTS   AGE     IP             NODE                      NOMINATED NODE   READINESS GATES
csi-cephfsplugin-db7bl                                            3/3     Running            0          3h35m   10.8.128.207   argo007.ceph.redhat.com   <none>           <none>
csi-cephfsplugin-k7nt5                                            3/3     Running            0          3h35m   10.8.128.208   argo008.ceph.redhat.com   <none>           <none>
csi-cephfsplugin-kn4mw                                            3/3     Running            0          3h35m   10.8.128.209   argo009.ceph.redhat.com   <none>           <none>
csi-cephfsplugin-provisioner-b85dcd49d-5h6r8                      6/6     Running            0          3h35m   10.131.2.30    argo005.ceph.redhat.com   <none>           <none>
csi-cephfsplugin-provisioner-b85dcd49d-mjjgh                      6/6     Running            0          95m     10.128.2.25    argo008.ceph.redhat.com   <none>           <none>
csi-cephfsplugin-t2g7w                                            3/3     Running            0          3h35m   10.8.128.205   argo005.ceph.redhat.com   <none>           <none>
csi-cephfsplugin-z6blm                                            3/3     Running            0          3h35m   10.8.128.206   argo006.ceph.redhat.com   <none>           <none>
csi-rbdplugin-6qrhp                                               3/3     Running            0          3h35m   10.8.128.208   argo008.ceph.redhat.com   <none>           <none>
csi-rbdplugin-8rfc7                                               3/3     Running            0          3h35m   10.8.128.209   argo009.ceph.redhat.com   <none>           <none>
csi-rbdplugin-l7gpd                                               3/3     Running            0          3h35m   10.8.128.206   argo006.ceph.redhat.com   <none>           <none>
csi-rbdplugin-p57pp                                               3/3     Running            0          3h35m   10.8.128.205   argo005.ceph.redhat.com   <none>           <none>
csi-rbdplugin-provisioner-69747448bf-hp26w                        6/6     Running            0          3h35m   10.131.0.21    argo006.ceph.redhat.com   <none>           <none>
csi-rbdplugin-provisioner-69747448bf-nbk2m                        6/6     Running            0          95m     10.128.2.26    argo008.ceph.redhat.com   <none>           <none>
csi-rbdplugin-xqrm7                                               3/3     Running            0          3h35m   10.8.128.207   argo007.ceph.redhat.com   <none>           <none>
noobaa-core-0                                                     1/1     Running            0          3h22m   10.131.2.46    argo005.ceph.redhat.com   <none>           <none>
noobaa-db-pg-0                                                    1/1     Running            0          95m     10.131.2.60    argo005.ceph.redhat.com   <none>           <none>
noobaa-endpoint-68959ddbbd-vqdcw                                  1/1     Running            0          95m     10.131.2.56    argo005.ceph.redhat.com   <none>           <none>
noobaa-operator-5f4698b84f-k79dq                                  1/1     Running            0          3h42m   10.131.2.24    argo005.ceph.redhat.com   <none>           <none>
ocs-metrics-exporter-544dcc6c9-9nzpw                              1/1     Running            0          95m     10.131.2.55    argo005.ceph.redhat.com   <none>           <none>
ocs-operator-685546d4ff-dg79p                                     1/1     Running            0          3h42m   10.131.0.17    argo006.ceph.redhat.com   <none>           <none>
rook-ceph-crashcollector-argo004.ceph.redhat.com-97bbb8df9hjlrl   1/1     Running            0          3h31m   10.130.0.19    argo004.ceph.redhat.com   <none>           <none>
rook-ceph-crashcollector-argo005.ceph.redhat.com-7c9bb99c7gdmkw   1/1     Running            0          3h24m   10.131.2.47    argo005.ceph.redhat.com   <none>           <none>
rook-ceph-crashcollector-argo006.ceph.redhat.com-78fbc7c68hxtfp   1/1     Running            0          3h34m   10.131.0.33    argo006.ceph.redhat.com   <none>           <none>
rook-ceph-crashcollector-argo007.ceph.redhat.com-d98fc7b69dk4dx   1/1     Running            0          46m     10.130.2.25    argo007.ceph.redhat.com   <none>           <none>
rook-ceph-crashcollector-argo008.ceph.redhat.com-77dcc8b85r2sgc   1/1     Running            0          3h35m   10.128.2.18    argo008.ceph.redhat.com   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5857bbff9x22d   2/2     Running            0          3h22m   10.128.2.19    argo008.ceph.redhat.com   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6cbcbbb82rz9d   2/2     Running            0          3h22m   10.131.0.34    argo006.ceph.redhat.com   <none>           <none>
rook-ceph-mgr-a-6c8b7c9d64-kfc2n                                  2/2     Running            0          3h23m   10.131.0.30    argo006.ceph.redhat.com   <none>           <none>
rook-ceph-mon-b-6474c8fb9-zkb88                                   2/2     Running            11         3h35m   10.128.2.15    argo008.ceph.redhat.com   <none>           <none>
rook-ceph-mon-c-f848688c8-7c84r                                   2/2     Running            11         3h34m   10.131.0.23    argo006.ceph.redhat.com   <none>           <none>
rook-ceph-mon-d-68874b57-ws2kr                                    2/2     Running            12         3h34m   10.131.2.42    argo005.ceph.redhat.com   <none>           <none>
rook-ceph-mon-e-7d6f4475dd-z9dbw                                  2/2     Running            0          3h31m   10.130.0.18    argo004.ceph.redhat.com   <none>           <none>
rook-ceph-mon-f-85bf67f477-tx49q                                  1/2     CrashLoopBackOff   13         45m     10.130.2.27    argo007.ceph.redhat.com   <none>           <none>
rook-ceph-operator-7b99bb79b4-8pbll                               1/1     Running            0          3h42m   10.131.0.18    argo006.ceph.redhat.com   <none>           <none>
rook-ceph-osd-0-5bd446d9cb-npxlf                                  2/2     Running            0          3h23m   10.131.2.45    argo005.ceph.redhat.com   <none>           <none>
rook-ceph-osd-1-67fc5f978-xp5wk                                   2/2     Running            0          3h23m   10.128.2.17    argo008.ceph.redhat.com   <none>           <none>
rook-ceph-osd-2-67986c8b7c-bjjdd                                  2/2     Running            0          3h23m   10.131.0.32    argo006.ceph.redhat.com   <none>           <none>
rook-ceph-osd-3-55f4747bd8-5qkck                                  2/2     Running            16         95m     10.130.2.26    argo007.ceph.redhat.com   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-0mbwstd9p   0/1     Completed          0          3h23m   10.131.2.44    argo005.ceph.redhat.com   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-localblock-1-data-0lg569w7v   0/1     Completed          0          3h23m   10.128.2.16    argo008.ceph.redhat.com   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-localblock-3-data-0p6bhcsjc   0/1     Completed          0          3h23m   10.131.0.31    argo006.ceph.redhat.com   <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-654854fshhbh   2/2     Running            0          3h22m   10.131.2.48    argo005.ceph.redhat.com   <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-56fdf77rvmg4   2/2     Running            0          95m     10.128.2.27    argo008.ceph.redhat.com   <none>           <none>
rook-ceph-tools-6666ddb5bf-nrjcz                                  1/1     Running            0          3h20m   10.8.128.205   argo005.ceph.redhat.com   <none>           <none>


Expected results:
All pods should be in Running state pod should not be in CLBO.

Additional info:

Comment 4 Travis Nielsen 2021-03-15 17:41:07 UTC

@Greg Could this be related to the other fix just merged, or could you take a look?

From the mon-f log [1], I see a ceph crash:

2021-03-15T12:14:44.082223764Z debug     -8> 2021-03-15 12:14:43.429 7f462f257700  4 mgrc reconnect Starting new session with [v2:10.131.0.30:6800/49,v1:10.131.0.30:6801/49]
2021-03-15T12:14:44.082223764Z debug     -7> 2021-03-15 12:14:43.432 7f462ea56700  4 mgrc handle_mgr_configure stats_period=52021-03-15T12:14:44.082229702Z 
2021-03-15T12:14:44.082229702Z debug     -6> 2021-03-15 12:14:43.432 7f462ea56700  4 mgrc ms_handle_reset ms_handle_reset con 0x560f78a8d600
2021-03-15T12:14:44.082235606Z debug     -5> 2021-03-15 12:14:43.432 7f462ea56700  4 mgrc reconnect Terminating session with v2:10.131.0.30:6800/49
2021-03-15T12:14:44.082235606Z debug     -4> 2021-03-15 12:14:43.432 7f462ea56700  4 mgrc reconnect waiting to retry connect until 2021-03-15 12:14:44.4298052021-03-15T12:14:44.082241531Z 
2021-03-15T12:14:44.082241531Z debug     -3> 2021-03-15 12:14:43.463 7f463225d700  5 mon.f@5(electing).paxos(paxos recovering c 10041..10581) is_readable = 0 - now=2021-03-15 12:14:43.463905 lease_expire=0.000000 has v0 lc 10581
2021-03-15T12:14:44.082247642Z debug     -2> 2021-03-15 12:14:43.463 7f463225d700  5 mon.f@5(electing).paxos(paxos recovering c 10041..10581) is_readable = 0 - now=2021-03-15 12:14:43.463918 lease_expire=0.000000 has v0 lc 10581
2021-03-15T12:14:44.082247642Z debug 2021-03-15T12:14:44.082261505Z     -1> 2021-03-15 12:14:44.076 7f463225d700 -1 /builddir/build/BUILD/ceph-14.2.11/src/mon/Monitor.cc: In function 'bool Monitor::session_stretch_allowed(MonSession*, MonOpRequestRef&)' thread 7f463225d700 time 2021-03-15 12:14:44.075007
2021-03-15T12:14:44.082261505Z /builddir/build/BUILD/ceph-14.2.11/src/mon/Monitor.cc: 6771: FAILED ceph_assert(ci != mi->second.crush_loc.end())
2021-03-15T12:14:44.082261505Z 
2021-03-15T12:14:44.082261505Z  ceph version 14.2.11-133.el8cp (b35842cdf727a690afe60d0a32cdbca7da7171c8) nautilus (stable)
2021-03-15T12:14:44.082261505Z  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x156) [0x7f4641d62ce8]
2021-03-15T12:14:44.082261505Z  2: (()+0x276f02) [0x7f4641d62f02]
2021-03-15T12:14:44.082261505Z  3: (Monitor::session_stretch_allowed(MonSession*, boost::intrusive_ptr<MonOpRequest>&)+0x5e8) [0x560f7694c7f8]
2021-03-15T12:14:44.082261505Z  4: (Monitor::_ms_dispatch(Message*)+0x916) [0x560f7695b606]
2021-03-15T12:14:44.082261505Z  5: (Monitor::ms_dispatch(Message*)+0x2a) [0x560f7698e7fa]
2021-03-15T12:14:44.082261505Z  6: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x2a) [0x560f7698abca]
2021-03-15T12:14:44.082261505Z  7: (DispatchQueue::entry()+0x134a) [0x7f4641fb9fba]
2021-03-15T12:14:44.082261505Z  8: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f46420702e1]
2021-03-15T12:14:44.082261505Z  9: (()+0x815a) [0x7f463ea6215a]
2021-03-15T12:14:44.082261505Z  10: (clone()+0x43) [0x7f463d799f73]
2021-03-15T12:14:44.082261505Z 
2021-03-15T12:14:44.082261505Z debug      0> 2021-03-15 12:14:44.077 7f463225d700 -1 *** Caught signal (Aborted) **
2021-03-15T12:14:44.082261505Z  in thread 7f463225d700 thread_name:ms_dispatch
2021-03-15T12:14:44.082261505Z 
2021-03-15T12:14:44.082261505Z  ceph version 14.2.11-133.el8cp (b35842cdf727a690afe60d0a32cdbca7da7171c8) nautilus (stable)
2021-03-15T12:14:44.082261505Z  1: (()+0x12b30) [0x7f463ea6cb30]
2021-03-15T12:14:44.082261505Z  2: (gsignal()+0x10f) [0x7f463d6d484f]
2021-03-15T12:14:44.082261505Z  3: (abort()+0x127) [0x7f463d6bec45]
2021-03-15T12:14:44.082261505Z  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a7) [0x7f4641d62d39]
2021-03-15T12:14:44.082261505Z  5: (()+0x276f02) [0x7f4641d62f02]
2021-03-15T12:14:44.082261505Z  6: (Monitor::session_stretch_allowed(MonSession*, boost::intrusive_ptr<MonOpRequest>&)+0x5e8) [0x560f7694c7f8]
2021-03-15T12:14:44.082261505Z  7: (Monitor::_ms_dispatch(Message*)+0x916) [0x560f7695b606]
2021-03-15T12:14:44.082261505Z  8: (Monitor::ms_dispatch(Message*)+0x2a) [0x560f7698e7fa]
2021-03-15T12:14:44.082261505Z  9: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x2a) [0x560f7698abca]
2021-03-15T12:14:44.082261505Z  10: (DispatchQueue::entry()+0x134a) [0x7f4641fb9fba]
2021-03-15T12:14:44.082261505Z  11: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f46420702e1]
2021-03-15T12:14:44.082261505Z  12: (()+0x815a) [0x7f463ea6215a]
2021-03-15T12:14:44.082261505Z  13: (clone()+0x43) [0x7f463d799f73]
2021-03-15T12:14:44.082261505Z  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2021-03-15T12:14:44.082261505Z 
2021-03-15T12:14:44.082350808Z --- logging levels ---
2021-03-15T12:14:44.082350808Z    0/ 5 none
2021-03-15T12:14:44.082350808Z    0/ 1 lockdep
2021-03-15T12:14:44.082350808Z    0/ 1 context
2021-03-15T12:14:44.082350808Z    1/ 1 crush
2021-03-15T12:14:44.082350808Z    1/ 5 mds
2021-03-15T12:14:44.082350808Z    1/ 5 mds_balancer
2021-03-15T12:14:44.082364673Z    1/ 5 mds_locker
2021-03-15T12:14:44.082364673Z    1/ 5 mds_log
2021-03-15T12:14:44.082364673Z    1/ 5 mds_log_expire
2021-03-15T12:14:44.082364673Z    1/ 5 mds_migrator
2021-03-15T12:14:44.082364673Z    0/ 1 buffer
2021-03-15T12:14:44.082373025Z    0/ 1 timer
2021-03-15T12:14:44.082373025Z    0/ 1 filer
2021-03-15T12:14:44.082373025Z    0/ 1 striper2021-03-15T12:14:44.082380653Z 
2021-03-15T12:14:44.082380653Z    0/ 1 objecter
2021-03-15T12:14:44.082380653Z    0/ 5 rados
2021-03-15T12:14:44.082388213Z    0/ 5 rbd
2021-03-15T12:14:44.082388213Z    0/ 5 rbd_mirror
2021-03-15T12:14:44.082388213Z    0/ 5 rbd_replay2021-03-15T12:14:44.082398773Z 
2021-03-15T12:14:44.082398773Z    0/ 5 journaler
2021-03-15T12:14:44.082398773Z    0/ 5 objectcacher
2021-03-15T12:14:44.082398773Z    0/ 5 client
2021-03-15T12:14:44.082406609Z    1/ 5 osd
2021-03-15T12:14:44.082406609Z    0/ 5 optracker
2021-03-15T12:14:44.082406609Z    0/ 5 objclass
2021-03-15T12:14:44.082413928Z    1/ 3 filestore
2021-03-15T12:14:44.082413928Z    1/ 3 journal
2021-03-15T12:14:44.082420880Z    0/ 0 ms
2021-03-15T12:14:44.082420880Z    1/ 5 mon
2021-03-15T12:14:44.082420880Z    0/10 monc2021-03-15T12:14:44.082428302Z 
2021-03-15T12:14:44.082428302Z    1/ 5 paxos
2021-03-15T12:14:44.082428302Z    0/ 5 tp
2021-03-15T12:14:44.082435671Z    1/ 5 auth
2021-03-15T12:14:44.082435671Z    1/ 5 crypto
2021-03-15T12:14:44.082435671Z    1/ 1 finisher2021-03-15T12:14:44.082443168Z 
2021-03-15T12:14:44.082443168Z    1/ 1 reserver
2021-03-15T12:14:44.082443168Z    1/ 5 heartbeatmap
2021-03-15T12:14:44.082450506Z    1/ 5 perfcounter
2021-03-15T12:14:44.082450506Z    1/ 5 rgw
2021-03-15T12:14:44.082450506Z    1/ 5 rgw_sync2021-03-15T12:14:44.082457964Z 
2021-03-15T12:14:44.082457964Z    1/10 civetweb
2021-03-15T12:14:44.082457964Z    1/ 5 javaclient
2021-03-15T12:14:44.082465284Z    1/ 5 asok
2021-03-15T12:14:44.082465284Z    1/ 1 throttle
2021-03-15T12:14:44.082465284Z    0/ 0 refs2021-03-15T12:14:44.082472625Z 
2021-03-15T12:14:44.082472625Z    1/ 5 xio
2021-03-15T12:14:44.082472625Z    1/ 5 compressor
2021-03-15T12:14:44.082480172Z    1/ 5 bluestore
2021-03-15T12:14:44.082480172Z    1/ 5 bluefs
2021-03-15T12:14:44.082487381Z    1/ 3 bdev
2021-03-15T12:14:44.082487381Z    1/ 5 kstore
2021-03-15T12:14:44.082487381Z    4/ 5 rocksdb2021-03-15T12:14:44.082494705Z 
2021-03-15T12:14:44.082494705Z    4/ 5 leveldb
2021-03-15T12:14:44.082494705Z    4/ 5 memdb
2021-03-15T12:14:44.082502075Z    1/ 5 kinetic
2021-03-15T12:14:44.082502075Z    1/ 5 fuse
2021-03-15T12:14:44.082502075Z    1/ 5 mgr2021-03-15T12:14:44.082509398Z 
2021-03-15T12:14:44.082509398Z    1/ 5 mgrc
2021-03-15T12:14:44.082516362Z    1/ 5 dpdk
2021-03-15T12:14:44.082516362Z    1/ 5 eventtrace
2021-03-15T12:14:44.082523409Z    1/ 5 prioritycache
2021-03-15T12:14:44.082523409Z    0/ 5 test
2021-03-15T12:14:44.082530572Z   -2/-2 (syslog threshold)
2021-03-15T12:14:44.082530572Z   99/99 (stderr threshold)
2021-03-15T12:14:44.082530572Z   max_recent     100002021-03-15T12:14:44.082537941Z 
2021-03-15T12:14:44.082537941Z   max_new         1000
2021-03-15T12:14:44.082537941Z   log_file /var/lib/ceph/crash/2021-03-15_12:14:44.077688Z_69a6f89d-bbcf-450f-88aa-716d65406319/log
2021-03-15T12:14:44.082545318Z --- end dump of recent events ---


[1] http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/pratik/bz/bz_1939007/mar_15/ocs_logs/quay-io-rhceph-dev-ocs-must-gather-sha256-1ee2f52263ef141c79b3f63b7b1699f04eec7ba370446ec5a033d69d28b98c91/namespaces/openshift-storage/pods/rook-ceph-mon-f-85bf67f477-tx49q/mon/mon/logs/current.log

Comment 5 Greg Farnum 2021-03-15 17:49:58 UTC

This is a different issue. The assert

> 2021-03-15T12:14:44.082261505Z /builddir/build/BUILD/ceph-14.2.11/src/mon/Monitor.cc: 6771: FAILED ceph_assert(ci != mi->second.crush_loc.end())

is because the monitor is trying to compare its own crush location and the incoming OSD's, but the monitor doesn't have a location set (or at least, if there is a location, it doesn't match the stretch bucket divider).

I'm a bit confused because I thought we'd guarded all the ways to add a monitor without a location, but I'm also not clear on exactly what's going on with "drain" here or how the cluster state has been manipulate. Clearly things *were* working for a long time...

Comment 6 Travis Nielsen 2021-03-15 18:03:16 UTC

I see now, the operator saw that a mon was down for too long (10 min) and decided to failover mon.a [1]:

2021-03-15T11:13:11.447389179Z 2021-03-15 11:13:11.447296 W | op-mon: mon "a" NOT found in quorum and timeout exceeded, mon will be failed over
2021-03-15T11:13:11.447389179Z 2021-03-15 11:13:11.447337 I | op-mon: Failing over monitor "a"
2021-03-15T11:13:11.453319397Z 2021-03-15 11:13:11.453246 I | op-mon: scaling the mon "a" deployment to replica 0
2021-03-15T11:13:11.467112963Z 2021-03-15 11:13:11.467041 I | op-mon: starting new mon: &{ResourceName:rook-ceph-mon-f DaemonName:f PublicIP: Port:6789 Zone:b DataPathMap:0xc0020eef60}

The failed over mon isn't getting a location set initially. Let's sync up on that command, I know we discussed a while back. 

[1] http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/pratik/bz/bz_1939007/mar_15/ocs_logs/quay-io-rhceph-dev-ocs-must-gather-sha256-1ee2f52263ef141c79b3f63b7b1699f04eec7ba370446ec5a033d69d28b98c91/namespaces/openshift-storage/pods/rook-ceph-operator-7b99bb79b4-8pbll/rook-ceph-operator/rook-ceph-operator/logs/current.log

Comment 7 Greg Farnum 2021-03-15 18:20:37 UTC

I'll see if I can come up with a command for this today and we discussed a workaround option.

Comment 8 Martin Bukatovic 2021-03-19 11:44:07 UTC

When I tried to deploy machine config for worker machine config pool, which updates nodes one by one, draining and rebooting them, I hit a similar issue. Multiple pods are in CrashLoopBackOff state:

```
$ oc get pods -n openshift-storage | grep CrashLoopBackOff
rook-ceph-mon-d-6b844b9ccf-mtdsm                                  1/2     CrashLoopBackOff   75         4d
rook-ceph-mon-f-86558ff789-rhvw5                                  1/2     CrashLoopBackOff   159        13h
rook-ceph-osd-1-7655fdd46f-6wzz9                                  1/2     CrashLoopBackOff   196        13h
rook-ceph-osd-3-865bfc7cfb-ptjx5                                  1/2     CrashLoopBackOff   227        13h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-5447657z97bp   1/2     CrashLoopBackOff   271        13h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-67745b452tz6   1/2     CrashLoopBackOff   239        12h
```

And I see a similar traceback in ceph crash logs:

```
/builddir/build/BUILD/ceph-14.2.11/src/mon/Monitor.cc: 6771: FAILED ceph_assert(ci != mi->second.crush_loc.end())

 ceph version 14.2.11-133.el8cp (b35842cdf727a690afe60d0a32cdbca7da7171c8) nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x156) [0x7f26329e8ce8]
 2: (()+0x276f02) [0x7f26329e8f02]
 3: (Monitor::session_stretch_allowed(MonSession*, boost::intrusive_ptr<MonOpRequest>&)+0x5e8) [0x5652767437f8]
 4: (Monitor::_ms_dispatch(Message*)+0x916) [0x565276752606]
 5: (Monitor::ms_dispatch(Message*)+0x2a) [0x5652767857fa]
 6: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x2a) [0x565276781bca]
 7: (DispatchQueue::entry()+0x134a) [0x7f2632c3ffba]
 8: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f2632cf62e1]
 9: (()+0x815a) [0x7f262f6e815a]
 10: (clone()+0x43) [0x7f262e41ff73]

     0> 2021-03-18 21:55:10.884 7f2622ee3700 -1 *** Caught signal (Aborted) **
 in thread 7f2622ee3700 thread_name:ms_dispatch
```

This is observed on cluster:

- OCP 4.7.0-0.nightly-2021-03-12-142313
- LSO 4.8.0-202103130210.p0
- OCS 4.7.0-294.ci

I run into various issues with this in the past, but when I was trying to isolate them, all were caused by some other problem in ceph cluster (eg. invalid allocation of OSDs within cluster), and eg. with OCS 4.7.0-280.ci on 2021-03-04 I was able to deploy machineconfig without disrupting the whole cluster. So if my machineconfig problem and usecase of this bug is caused by the same issue, it likely that it's caused by a recent change.

I was also able to create disruption just by isolating network between zones, but because of other issues (like this one, or issues in ocs qe infra), I wasn't able to reproduce it again to confirm and report it.

Comment 9 Mudit Agarwal 2021-03-22 16:33:23 UTC

As the decision was to take care of this in Ceph after RHCS4.2z1, we will go ahead and fix https://bugzilla.redhat.com/show_bug.cgi?id=1939617 in rook from OCS side.

Moving this out of 4.7, will also create a clone for documentation team to consume.

Comment 10 Raz Tamir 2021-03-25 12:21:10 UTC

*** Bug 1942680 has been marked as a duplicate of this bug. ***

Comment 11 Mudit Agarwal 2021-03-25 17:04:20 UTC

Moving it out based on the offline discussion.

Summary:
The failover is, when rook detects that one monitor is out of the quorum, we wait for 10min, if after 10min the mon is still out then we delete it and schedule a new one. 
The reason why we don't do this for the stretched scenario is because mons break (due to the missing RHCS patch/fix) so if a mon gets out of quorum rook does nothing, 
but it's "fine" since we have 5 mons in total, which means it's safe
So, we have disabled it in rook, once we have RHCS patch, we will enable it back.

Please read here for details: https://chat.google.com/room/AAAAREGEba8/gX1sagCfNn4

Comment 12 Mudit Agarwal 2021-04-01 05:42:37 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1939766 is in MODIFIED state now.

Comment 15 Greg Farnum 2021-04-01 15:39:05 UTC

As this is fixed in ceph-14.2.11-143.el8cp, I'm assigning the bug to Travis since I think he wants to swap things around for 4.7.

Comment 16 Travis Nielsen 2021-04-06 19:06:07 UTC

Re-enabling mon failover in stretch mode is being tracked with https://bugzilla.redhat.com/show_bug.cgi?id=1939617.

Given that this RHCS issue is fixed in the latest 4.2z1, moving to ON_QA since it's in the 4.7rc3 build. I believe Pratik already validated this is fixed.

Comment 22 errata-xmlrpc 2021-08-03 18:15:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3003