Bug 1939007
Summary: | [Arbiter] [Tracker for BZ #1939766] When performed drain and undrained 4 Mon and 1 Osd are going into CLBO and ceph is not accessible | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Pratik Surve <prsurve> | |
Component: | ceph | Assignee: | Travis Nielsen <tnielsen> | |
Status: | CLOSED ERRATA | QA Contact: | Pratik Surve <prsurve> | |
Severity: | high | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 4.7 | CC: | bniver, ebenahar, gfarnum, madam, mbukatov, muagarwa, nberry, ocs-bugs, owasserm, shan, tnielsen | |
Target Milestone: | --- | Keywords: | AutomationBackLog, TestBlocker | |
Target Release: | OCS 4.8.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1939617 1939766 (view as bug list) | Environment: | ||
Last Closed: | 2021-08-03 18:15:14 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1939766 | |||
Bug Blocks: | 1939617, 1941918 |
Description
Pratik Surve
2021-03-15 12:54:41 UTC
@Greg Could this be related to the other fix just merged, or could you take a look? From the mon-f log [1], I see a ceph crash: 2021-03-15T12:14:44.082223764Z debug -8> 2021-03-15 12:14:43.429 7f462f257700 4 mgrc reconnect Starting new session with [v2:10.131.0.30:6800/49,v1:10.131.0.30:6801/49] 2021-03-15T12:14:44.082223764Z debug -7> 2021-03-15 12:14:43.432 7f462ea56700 4 mgrc handle_mgr_configure stats_period=52021-03-15T12:14:44.082229702Z 2021-03-15T12:14:44.082229702Z debug -6> 2021-03-15 12:14:43.432 7f462ea56700 4 mgrc ms_handle_reset ms_handle_reset con 0x560f78a8d600 2021-03-15T12:14:44.082235606Z debug -5> 2021-03-15 12:14:43.432 7f462ea56700 4 mgrc reconnect Terminating session with v2:10.131.0.30:6800/49 2021-03-15T12:14:44.082235606Z debug -4> 2021-03-15 12:14:43.432 7f462ea56700 4 mgrc reconnect waiting to retry connect until 2021-03-15 12:14:44.4298052021-03-15T12:14:44.082241531Z 2021-03-15T12:14:44.082241531Z debug -3> 2021-03-15 12:14:43.463 7f463225d700 5 mon.f@5(electing).paxos(paxos recovering c 10041..10581) is_readable = 0 - now=2021-03-15 12:14:43.463905 lease_expire=0.000000 has v0 lc 10581 2021-03-15T12:14:44.082247642Z debug -2> 2021-03-15 12:14:43.463 7f463225d700 5 mon.f@5(electing).paxos(paxos recovering c 10041..10581) is_readable = 0 - now=2021-03-15 12:14:43.463918 lease_expire=0.000000 has v0 lc 10581 2021-03-15T12:14:44.082247642Z debug 2021-03-15T12:14:44.082261505Z -1> 2021-03-15 12:14:44.076 7f463225d700 -1 /builddir/build/BUILD/ceph-14.2.11/src/mon/Monitor.cc: In function 'bool Monitor::session_stretch_allowed(MonSession*, MonOpRequestRef&)' thread 7f463225d700 time 2021-03-15 12:14:44.075007 2021-03-15T12:14:44.082261505Z /builddir/build/BUILD/ceph-14.2.11/src/mon/Monitor.cc: 6771: FAILED ceph_assert(ci != mi->second.crush_loc.end()) 2021-03-15T12:14:44.082261505Z 2021-03-15T12:14:44.082261505Z ceph version 14.2.11-133.el8cp (b35842cdf727a690afe60d0a32cdbca7da7171c8) nautilus (stable) 2021-03-15T12:14:44.082261505Z 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x156) [0x7f4641d62ce8] 2021-03-15T12:14:44.082261505Z 2: (()+0x276f02) [0x7f4641d62f02] 2021-03-15T12:14:44.082261505Z 3: (Monitor::session_stretch_allowed(MonSession*, boost::intrusive_ptr<MonOpRequest>&)+0x5e8) [0x560f7694c7f8] 2021-03-15T12:14:44.082261505Z 4: (Monitor::_ms_dispatch(Message*)+0x916) [0x560f7695b606] 2021-03-15T12:14:44.082261505Z 5: (Monitor::ms_dispatch(Message*)+0x2a) [0x560f7698e7fa] 2021-03-15T12:14:44.082261505Z 6: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x2a) [0x560f7698abca] 2021-03-15T12:14:44.082261505Z 7: (DispatchQueue::entry()+0x134a) [0x7f4641fb9fba] 2021-03-15T12:14:44.082261505Z 8: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f46420702e1] 2021-03-15T12:14:44.082261505Z 9: (()+0x815a) [0x7f463ea6215a] 2021-03-15T12:14:44.082261505Z 10: (clone()+0x43) [0x7f463d799f73] 2021-03-15T12:14:44.082261505Z 2021-03-15T12:14:44.082261505Z debug 0> 2021-03-15 12:14:44.077 7f463225d700 -1 *** Caught signal (Aborted) ** 2021-03-15T12:14:44.082261505Z in thread 7f463225d700 thread_name:ms_dispatch 2021-03-15T12:14:44.082261505Z 2021-03-15T12:14:44.082261505Z ceph version 14.2.11-133.el8cp (b35842cdf727a690afe60d0a32cdbca7da7171c8) nautilus (stable) 2021-03-15T12:14:44.082261505Z 1: (()+0x12b30) [0x7f463ea6cb30] 2021-03-15T12:14:44.082261505Z 2: (gsignal()+0x10f) [0x7f463d6d484f] 2021-03-15T12:14:44.082261505Z 3: (abort()+0x127) [0x7f463d6bec45] 2021-03-15T12:14:44.082261505Z 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a7) [0x7f4641d62d39] 2021-03-15T12:14:44.082261505Z 5: (()+0x276f02) [0x7f4641d62f02] 2021-03-15T12:14:44.082261505Z 6: (Monitor::session_stretch_allowed(MonSession*, boost::intrusive_ptr<MonOpRequest>&)+0x5e8) [0x560f7694c7f8] 2021-03-15T12:14:44.082261505Z 7: (Monitor::_ms_dispatch(Message*)+0x916) [0x560f7695b606] 2021-03-15T12:14:44.082261505Z 8: (Monitor::ms_dispatch(Message*)+0x2a) [0x560f7698e7fa] 2021-03-15T12:14:44.082261505Z 9: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x2a) [0x560f7698abca] 2021-03-15T12:14:44.082261505Z 10: (DispatchQueue::entry()+0x134a) [0x7f4641fb9fba] 2021-03-15T12:14:44.082261505Z 11: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f46420702e1] 2021-03-15T12:14:44.082261505Z 12: (()+0x815a) [0x7f463ea6215a] 2021-03-15T12:14:44.082261505Z 13: (clone()+0x43) [0x7f463d799f73] 2021-03-15T12:14:44.082261505Z NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2021-03-15T12:14:44.082261505Z 2021-03-15T12:14:44.082350808Z --- logging levels --- 2021-03-15T12:14:44.082350808Z 0/ 5 none 2021-03-15T12:14:44.082350808Z 0/ 1 lockdep 2021-03-15T12:14:44.082350808Z 0/ 1 context 2021-03-15T12:14:44.082350808Z 1/ 1 crush 2021-03-15T12:14:44.082350808Z 1/ 5 mds 2021-03-15T12:14:44.082350808Z 1/ 5 mds_balancer 2021-03-15T12:14:44.082364673Z 1/ 5 mds_locker 2021-03-15T12:14:44.082364673Z 1/ 5 mds_log 2021-03-15T12:14:44.082364673Z 1/ 5 mds_log_expire 2021-03-15T12:14:44.082364673Z 1/ 5 mds_migrator 2021-03-15T12:14:44.082364673Z 0/ 1 buffer 2021-03-15T12:14:44.082373025Z 0/ 1 timer 2021-03-15T12:14:44.082373025Z 0/ 1 filer 2021-03-15T12:14:44.082373025Z 0/ 1 striper2021-03-15T12:14:44.082380653Z 2021-03-15T12:14:44.082380653Z 0/ 1 objecter 2021-03-15T12:14:44.082380653Z 0/ 5 rados 2021-03-15T12:14:44.082388213Z 0/ 5 rbd 2021-03-15T12:14:44.082388213Z 0/ 5 rbd_mirror 2021-03-15T12:14:44.082388213Z 0/ 5 rbd_replay2021-03-15T12:14:44.082398773Z 2021-03-15T12:14:44.082398773Z 0/ 5 journaler 2021-03-15T12:14:44.082398773Z 0/ 5 objectcacher 2021-03-15T12:14:44.082398773Z 0/ 5 client 2021-03-15T12:14:44.082406609Z 1/ 5 osd 2021-03-15T12:14:44.082406609Z 0/ 5 optracker 2021-03-15T12:14:44.082406609Z 0/ 5 objclass 2021-03-15T12:14:44.082413928Z 1/ 3 filestore 2021-03-15T12:14:44.082413928Z 1/ 3 journal 2021-03-15T12:14:44.082420880Z 0/ 0 ms 2021-03-15T12:14:44.082420880Z 1/ 5 mon 2021-03-15T12:14:44.082420880Z 0/10 monc2021-03-15T12:14:44.082428302Z 2021-03-15T12:14:44.082428302Z 1/ 5 paxos 2021-03-15T12:14:44.082428302Z 0/ 5 tp 2021-03-15T12:14:44.082435671Z 1/ 5 auth 2021-03-15T12:14:44.082435671Z 1/ 5 crypto 2021-03-15T12:14:44.082435671Z 1/ 1 finisher2021-03-15T12:14:44.082443168Z 2021-03-15T12:14:44.082443168Z 1/ 1 reserver 2021-03-15T12:14:44.082443168Z 1/ 5 heartbeatmap 2021-03-15T12:14:44.082450506Z 1/ 5 perfcounter 2021-03-15T12:14:44.082450506Z 1/ 5 rgw 2021-03-15T12:14:44.082450506Z 1/ 5 rgw_sync2021-03-15T12:14:44.082457964Z 2021-03-15T12:14:44.082457964Z 1/10 civetweb 2021-03-15T12:14:44.082457964Z 1/ 5 javaclient 2021-03-15T12:14:44.082465284Z 1/ 5 asok 2021-03-15T12:14:44.082465284Z 1/ 1 throttle 2021-03-15T12:14:44.082465284Z 0/ 0 refs2021-03-15T12:14:44.082472625Z 2021-03-15T12:14:44.082472625Z 1/ 5 xio 2021-03-15T12:14:44.082472625Z 1/ 5 compressor 2021-03-15T12:14:44.082480172Z 1/ 5 bluestore 2021-03-15T12:14:44.082480172Z 1/ 5 bluefs 2021-03-15T12:14:44.082487381Z 1/ 3 bdev 2021-03-15T12:14:44.082487381Z 1/ 5 kstore 2021-03-15T12:14:44.082487381Z 4/ 5 rocksdb2021-03-15T12:14:44.082494705Z 2021-03-15T12:14:44.082494705Z 4/ 5 leveldb 2021-03-15T12:14:44.082494705Z 4/ 5 memdb 2021-03-15T12:14:44.082502075Z 1/ 5 kinetic 2021-03-15T12:14:44.082502075Z 1/ 5 fuse 2021-03-15T12:14:44.082502075Z 1/ 5 mgr2021-03-15T12:14:44.082509398Z 2021-03-15T12:14:44.082509398Z 1/ 5 mgrc 2021-03-15T12:14:44.082516362Z 1/ 5 dpdk 2021-03-15T12:14:44.082516362Z 1/ 5 eventtrace 2021-03-15T12:14:44.082523409Z 1/ 5 prioritycache 2021-03-15T12:14:44.082523409Z 0/ 5 test 2021-03-15T12:14:44.082530572Z -2/-2 (syslog threshold) 2021-03-15T12:14:44.082530572Z 99/99 (stderr threshold) 2021-03-15T12:14:44.082530572Z max_recent 100002021-03-15T12:14:44.082537941Z 2021-03-15T12:14:44.082537941Z max_new 1000 2021-03-15T12:14:44.082537941Z log_file /var/lib/ceph/crash/2021-03-15_12:14:44.077688Z_69a6f89d-bbcf-450f-88aa-716d65406319/log 2021-03-15T12:14:44.082545318Z --- end dump of recent events --- [1] http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/pratik/bz/bz_1939007/mar_15/ocs_logs/quay-io-rhceph-dev-ocs-must-gather-sha256-1ee2f52263ef141c79b3f63b7b1699f04eec7ba370446ec5a033d69d28b98c91/namespaces/openshift-storage/pods/rook-ceph-mon-f-85bf67f477-tx49q/mon/mon/logs/current.log This is a different issue. The assert
> 2021-03-15T12:14:44.082261505Z /builddir/build/BUILD/ceph-14.2.11/src/mon/Monitor.cc: 6771: FAILED ceph_assert(ci != mi->second.crush_loc.end())
is because the monitor is trying to compare its own crush location and the incoming OSD's, but the monitor doesn't have a location set (or at least, if there is a location, it doesn't match the stretch bucket divider).
I'm a bit confused because I thought we'd guarded all the ways to add a monitor without a location, but I'm also not clear on exactly what's going on with "drain" here or how the cluster state has been manipulate. Clearly things *were* working for a long time...
I see now, the operator saw that a mon was down for too long (10 min) and decided to failover mon.a [1]: 2021-03-15T11:13:11.447389179Z 2021-03-15 11:13:11.447296 W | op-mon: mon "a" NOT found in quorum and timeout exceeded, mon will be failed over 2021-03-15T11:13:11.447389179Z 2021-03-15 11:13:11.447337 I | op-mon: Failing over monitor "a" 2021-03-15T11:13:11.453319397Z 2021-03-15 11:13:11.453246 I | op-mon: scaling the mon "a" deployment to replica 0 2021-03-15T11:13:11.467112963Z 2021-03-15 11:13:11.467041 I | op-mon: starting new mon: &{ResourceName:rook-ceph-mon-f DaemonName:f PublicIP: Port:6789 Zone:b DataPathMap:0xc0020eef60} The failed over mon isn't getting a location set initially. Let's sync up on that command, I know we discussed a while back. [1] http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/pratik/bz/bz_1939007/mar_15/ocs_logs/quay-io-rhceph-dev-ocs-must-gather-sha256-1ee2f52263ef141c79b3f63b7b1699f04eec7ba370446ec5a033d69d28b98c91/namespaces/openshift-storage/pods/rook-ceph-operator-7b99bb79b4-8pbll/rook-ceph-operator/rook-ceph-operator/logs/current.log I'll see if I can come up with a command for this today and we discussed a workaround option. When I tried to deploy machine config for worker machine config pool, which updates nodes one by one, draining and rebooting them, I hit a similar issue. Multiple pods are in CrashLoopBackOff state: ``` $ oc get pods -n openshift-storage | grep CrashLoopBackOff rook-ceph-mon-d-6b844b9ccf-mtdsm 1/2 CrashLoopBackOff 75 4d rook-ceph-mon-f-86558ff789-rhvw5 1/2 CrashLoopBackOff 159 13h rook-ceph-osd-1-7655fdd46f-6wzz9 1/2 CrashLoopBackOff 196 13h rook-ceph-osd-3-865bfc7cfb-ptjx5 1/2 CrashLoopBackOff 227 13h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-5447657z97bp 1/2 CrashLoopBackOff 271 13h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-67745b452tz6 1/2 CrashLoopBackOff 239 12h ``` And I see a similar traceback in ceph crash logs: ``` /builddir/build/BUILD/ceph-14.2.11/src/mon/Monitor.cc: 6771: FAILED ceph_assert(ci != mi->second.crush_loc.end()) ceph version 14.2.11-133.el8cp (b35842cdf727a690afe60d0a32cdbca7da7171c8) nautilus (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x156) [0x7f26329e8ce8] 2: (()+0x276f02) [0x7f26329e8f02] 3: (Monitor::session_stretch_allowed(MonSession*, boost::intrusive_ptr<MonOpRequest>&)+0x5e8) [0x5652767437f8] 4: (Monitor::_ms_dispatch(Message*)+0x916) [0x565276752606] 5: (Monitor::ms_dispatch(Message*)+0x2a) [0x5652767857fa] 6: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x2a) [0x565276781bca] 7: (DispatchQueue::entry()+0x134a) [0x7f2632c3ffba] 8: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f2632cf62e1] 9: (()+0x815a) [0x7f262f6e815a] 10: (clone()+0x43) [0x7f262e41ff73] 0> 2021-03-18 21:55:10.884 7f2622ee3700 -1 *** Caught signal (Aborted) ** in thread 7f2622ee3700 thread_name:ms_dispatch ``` This is observed on cluster: - OCP 4.7.0-0.nightly-2021-03-12-142313 - LSO 4.8.0-202103130210.p0 - OCS 4.7.0-294.ci I run into various issues with this in the past, but when I was trying to isolate them, all were caused by some other problem in ceph cluster (eg. invalid allocation of OSDs within cluster), and eg. with OCS 4.7.0-280.ci on 2021-03-04 I was able to deploy machineconfig without disrupting the whole cluster. So if my machineconfig problem and usecase of this bug is caused by the same issue, it likely that it's caused by a recent change. I was also able to create disruption just by isolating network between zones, but because of other issues (like this one, or issues in ocs qe infra), I wasn't able to reproduce it again to confirm and report it. As the decision was to take care of this in Ceph after RHCS4.2z1, we will go ahead and fix https://bugzilla.redhat.com/show_bug.cgi?id=1939617 in rook from OCS side. Moving this out of 4.7, will also create a clone for documentation team to consume. *** Bug 1942680 has been marked as a duplicate of this bug. *** Moving it out based on the offline discussion. Summary: The failover is, when rook detects that one monitor is out of the quorum, we wait for 10min, if after 10min the mon is still out then we delete it and schedule a new one. The reason why we don't do this for the stretched scenario is because mons break (due to the missing RHCS patch/fix) so if a mon gets out of quorum rook does nothing, but it's "fine" since we have 5 mons in total, which means it's safe So, we have disabled it in rook, once we have RHCS patch, we will enable it back. Please read here for details: https://chat.google.com/room/AAAAREGEba8/gX1sagCfNn4 https://bugzilla.redhat.com/show_bug.cgi?id=1939766 is in MODIFIED state now. As this is fixed in ceph-14.2.11-143.el8cp, I'm assigning the bug to Travis since I think he wants to swap things around for 4.7. Re-enabling mon failover in stretch mode is being tracked with https://bugzilla.redhat.com/show_bug.cgi?id=1939617. Given that this RHCS issue is fixed in the latest 4.2z1, moving to ON_QA since it's in the 4.7rc3 build. I believe Pratik already validated this is fixed. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3003 |