| Summary: | OSD's crashed with "common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout") " while creating VM's on QEMU/KVM on an iSCSI LUN | ||
|---|---|---|---|
| Product: | Red Hat Ceph Storage | Reporter: | Hemanth Kumar <hyelloji> |
| Component: | RADOS | Assignee: | Samuel Just <sjust> |
| Status: | CLOSED NOTABUG | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 2.1 | CC: | ceph-eng-bugs, dzafman, hnallurv, hyelloji, kchai |
| Target Milestone: | rc | ||
| Target Release: | 2.1 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-10-24 15:01:53 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Hemanth Kumar
2016-10-24 07:17:55 UTC
54 0 0) 0x7f3b4950b000 con 0x7f3b474d0000
-3> 2016-10-21 16:50:48.043737 7f3b2173f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3b1972f700' had timed out after 15
-2> 2016-10-21 16:50:48.043739 7f3b2173f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3b1972f700' had suicide timed out after 150
-1> 2016-10-21 16:50:48.043736 7f3b22f42700 1 -- 10.70.39.4:6826/1189524 <== osd.0 10.70.39.2:0/31949 51 ==== osd_ping(ping e187 stamp 2016-10-21 16:50:23.312632) v2 ==== 47+0+0 (77267054 0 0) 0x7f3b47bf7a00 con 0x7f3b474d0180
0> 2016-10-21 16:50:48.045943 7f3b2173f700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f3b2173f700 time 2016-10-21 16:50:48.043746
common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
ceph version 10.2.3-8.el7cp (e1d1eac13d734c63e896531428b565fc47d51874)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f3b3bade225]
2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2e1) [0x7f3b3ba1dc11]
3: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7f3b3ba1e46e]
4: (OSD::handle_osd_ping(MOSDPing*)+0x93f) [0x7f3b3b453a8f]
5: (OSD::heartbeat_dispatch(Message*)+0x3cb) [0x7f3b3b454cbb]
6: (DispatchQueue::entry()+0x78a) [0x7f3b3bba020a]
7: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f3b3bab9a1d]
8: (()+0x7dc5) [0x7f3b39a0adc5]
9: (clone()+0x6d) [0x7f3b3809673d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 newstore
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
1/ 5 kinetic
1/ 5 fuse
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.10.log
--- end dump of recent events ---
2016-10-21 16:50:48.067726 7f3b2173f700 -1 *** Caught signal (Aborted) **
in thread 7f3b2173f700 thread_name:ms_dispatch
ceph version 10.2.3-8.el7cp (e1d1eac13d734c63e896531428b565fc47d51874)
1: (()+0x91944a) [0x7f3b3b9e144a]
2: (()+0xf370) [0x7f3b39a12370]
3: (gsignal()+0x37) [0x7f3b37fd41d7]
4: (abort()+0x148) [0x7f3b37fd58c8]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x7f3b3bade407]
6: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2e1) [0x7f3b3ba1dc11]
7: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7f3b3ba1e46e]
8: (OSD::handle_osd_ping(MOSDPing*)+0x93f) [0x7f3b3b453a8f]
9: (OSD::heartbeat_dispatch(Message*)+0x3cb) [0x7f3b3b454cbb]
10: (DispatchQueue::entry()+0x78a) [0x7f3b3bba020a]
11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f3b3bab9a1d]
12: (()+0x7dc5) [0x7f3b39a0adc5]
13: (clone()+0x6d) [0x7f3b3809673d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- begin dump of recent events ---
0> 2016-10-21 16:50:48.067726 7f3b2173f700 -1 *** Caught signal (Aborted) **
in thread 7f3b2173f700 thread_name:ms_dispatch
ceph version 10.2.3-8.el7cp (e1d1eac13d734c63e896531428b565fc47d51874)
1: (()+0x91944a) [0x7f3b3b9e144a]
2: (()+0xf370) [0x7f3b39a12370]
3: (gsignal()+0x37) [0x7f3b37fd41d7]
4: (abort()+0x148) [0x7f3b37fd58c8]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x7f3b3bade407]
6: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2e1) [0x7f3b3ba1dc11]
7: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7f3b3ba1e46e]
8: (OSD::handle_osd_ping(MOSDPing*)+0x93f) [0x7f3b3b453a8f]
9: (OSD::heartbeat_dispatch(Message*)+0x3cb) [0x7f3b3b454cbb]
10: (DispatchQueue::entry()+0x78a) [0x7f3b3bba020a]
11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f3b3bab9a1d]
12: (()+0x7dc5) [0x7f3b39a0adc5]
13: (clone()+0x6d) [0x7f3b3809673d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 newstore
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
1/ 5 kinetic
1/ 5 fuse
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.10.log
--- end dump of recent events ---
2016-10-21 16:54:48.216110 7fe7c6683800 0 set uid:gid to 167:167 (ceph:ceph)
2016-10-21 16:54:48.216123 7fe7c6683800 0 ceph version 10.2.3-8.el7cp (e1d1eac13d734c63e896531428b565fc47d51874), process ceph-osd, pid 191367
2016-10-21 16:54:48.216239 7fe7c6683800 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-10: (5) Input/output error^[[0m
2016-10-21 16:54:48.713614 7f23a6acf800 0 set uid:gid to 167:167 (ceph:ceph)
2016-10-21 16:54:48.713626 7f23a6acf800 0 ceph version 10.2.3-8.el7cp (e1d1eac13d734c63e896531428b565fc47d51874), process ceph-osd, pid 191414
2016-10-21 16:54:48.713751 7f23a6acf800 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-10: (5) Input/output error^[[0m
Re-assigning to RADOS since this is an OSD IO issue. From a quick glance, perhaps you are having storage issues with your backing OSD disks: superblock on /var/lib/ceph/osd/ceph-10: (5) Input/output error Look at dmesg on that node, it looks like the disk failed. [259674.726328] XFS (sdc1): xfs_log_force: error -5 returned. sdc1 is mapped to one of the osds /dev/sdc1 1.1T 467M 1.1T 1% /var/lib/ceph/osd/ceph-10 I'm not really sure why the other 3 osds hung, but I do see xfs backtraces in the dmesg log suggesting that xfs hung (probably because of the EIO on sdc1). I'm closing this as not a bug since sdc1 seems to be busted and it does not seem to be a ceph/osd issue. If you want more information, please reproduce with debug osd = 20 debug filestore = 20 debug ms = 1 |