Bug 1387993

Summary: OSD's crashed with "common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout") " while creating VM's on QEMU/KVM on an iSCSI LUN
Product: Red Hat Ceph Storage Reporter: Hemanth Kumar <hyelloji>
Component: RADOSAssignee: Samuel Just <sjust>
Status: CLOSED NOTABUG QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 2.1CC: ceph-eng-bugs, dzafman, hnallurv, hyelloji, kchai
Target Milestone: rc   
Target Release: 2.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-24 15:01:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Hemanth Kumar 2016-10-24 07:17:55 UTC
Description of problem:
-----------------------
4 OSD's from Same Host crashed while creating 2 VM's on same iSCSI LUN mounted on QEMU/KVM Host.


Version-Release number of selected component (if applicable):
V10.2.3.8el7


Steps to Reproduce:
1. Created an rbd image of 100G and configured LUN using ceph-iscsi ansible
2. Discovered the LUN on Linux host and used it as Storage Volume in QEMU/KVM host. 
Note : created a storage pool directly in KVM on top of a physical iscsi lun
3. created 2 qcow2 volumes on the storage pool created using iscsi lun
4. Started OS Installation on both the volumes and after a while the OS installation stopped and observed that 4 OSD from one of the OSD host was down. 

Actual results:
There shouldn't be any IO error 

Expected results:
IO stopped and OSD went down

Comment 3 Hemanth Kumar 2016-10-24 07:23:21 UTC
54 0 0) 0x7f3b4950b000 con 0x7f3b474d0000
    -3> 2016-10-21 16:50:48.043737 7f3b2173f700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3b1972f700' had timed out after 15
    -2> 2016-10-21 16:50:48.043739 7f3b2173f700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3b1972f700' had suicide timed out after 150
    -1> 2016-10-21 16:50:48.043736 7f3b22f42700  1 -- 10.70.39.4:6826/1189524 <== osd.0 10.70.39.2:0/31949 51 ==== osd_ping(ping e187 stamp 2016-10-21 16:50:23.312632) v2 ==== 47+0+0 (77267054 0 0) 0x7f3b47bf7a00 con 0x7f3b474d0180
     0> 2016-10-21 16:50:48.045943 7f3b2173f700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f3b2173f700 time 2016-10-21 16:50:48.043746
common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")

 ceph version 10.2.3-8.el7cp (e1d1eac13d734c63e896531428b565fc47d51874)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f3b3bade225]
 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2e1) [0x7f3b3ba1dc11]
 3: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7f3b3ba1e46e]
 4: (OSD::handle_osd_ping(MOSDPing*)+0x93f) [0x7f3b3b453a8f]
 5: (OSD::heartbeat_dispatch(Message*)+0x3cb) [0x7f3b3b454cbb]
 6: (DispatchQueue::entry()+0x78a) [0x7f3b3bba020a]
 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f3b3bab9a1d]
 8: (()+0x7dc5) [0x7f3b39a0adc5]
 9: (clone()+0x6d) [0x7f3b3809673d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 kinetic
   1/ 5 fuse
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.10.log
--- end dump of recent events ---
2016-10-21 16:50:48.067726 7f3b2173f700 -1 *** Caught signal (Aborted) **
 in thread 7f3b2173f700 thread_name:ms_dispatch

 ceph version 10.2.3-8.el7cp (e1d1eac13d734c63e896531428b565fc47d51874)
 1: (()+0x91944a) [0x7f3b3b9e144a]
 2: (()+0xf370) [0x7f3b39a12370]
 3: (gsignal()+0x37) [0x7f3b37fd41d7]
 4: (abort()+0x148) [0x7f3b37fd58c8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x7f3b3bade407]
 6: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2e1) [0x7f3b3ba1dc11]
 7: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7f3b3ba1e46e]
 8: (OSD::handle_osd_ping(MOSDPing*)+0x93f) [0x7f3b3b453a8f]
 9: (OSD::heartbeat_dispatch(Message*)+0x3cb) [0x7f3b3b454cbb]
 10: (DispatchQueue::entry()+0x78a) [0x7f3b3bba020a]
 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f3b3bab9a1d]
 12: (()+0x7dc5) [0x7f3b39a0adc5]
 13: (clone()+0x6d) [0x7f3b3809673d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
     0> 2016-10-21 16:50:48.067726 7f3b2173f700 -1 *** Caught signal (Aborted) **
 in thread 7f3b2173f700 thread_name:ms_dispatch

 ceph version 10.2.3-8.el7cp (e1d1eac13d734c63e896531428b565fc47d51874)
 1: (()+0x91944a) [0x7f3b3b9e144a]
 2: (()+0xf370) [0x7f3b39a12370]
 3: (gsignal()+0x37) [0x7f3b37fd41d7]
 4: (abort()+0x148) [0x7f3b37fd58c8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x7f3b3bade407]
 6: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2e1) [0x7f3b3ba1dc11]
 7: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7f3b3ba1e46e]
 8: (OSD::handle_osd_ping(MOSDPing*)+0x93f) [0x7f3b3b453a8f]
 9: (OSD::heartbeat_dispatch(Message*)+0x3cb) [0x7f3b3b454cbb]
 10: (DispatchQueue::entry()+0x78a) [0x7f3b3bba020a]
 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f3b3bab9a1d]
 12: (()+0x7dc5) [0x7f3b39a0adc5]
 13: (clone()+0x6d) [0x7f3b3809673d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 kinetic
   1/ 5 fuse
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.10.log
--- end dump of recent events ---
2016-10-21 16:54:48.216110 7fe7c6683800  0 set uid:gid to 167:167 (ceph:ceph)
2016-10-21 16:54:48.216123 7fe7c6683800  0 ceph version 10.2.3-8.el7cp (e1d1eac13d734c63e896531428b565fc47d51874), process ceph-osd, pid 191367
2016-10-21 16:54:48.216239 7fe7c6683800 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-10: (5) Input/output error^[[0m
2016-10-21 16:54:48.713614 7f23a6acf800  0 set uid:gid to 167:167 (ceph:ceph)
2016-10-21 16:54:48.713626 7f23a6acf800  0 ceph version 10.2.3-8.el7cp (e1d1eac13d734c63e896531428b565fc47d51874), process ceph-osd, pid 191414
2016-10-21 16:54:48.713751 7f23a6acf800 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-10: (5) Input/output error^[[0m

Comment 4 Jason Dillaman 2016-10-24 13:52:43 UTC
Re-assigning to RADOS since this is an OSD IO issue.  From a quick glance, perhaps you are having storage issues with your backing OSD disks:

superblock on /var/lib/ceph/osd/ceph-10: (5) Input/output error

Comment 5 Samuel Just 2016-10-24 15:01:53 UTC
Look at dmesg on that node, it looks like the disk failed.

[259674.726328] XFS (sdc1): xfs_log_force: error -5 returned.

sdc1 is mapped to one of the osds

/dev/sdc1              1.1T  467M  1.1T   1% /var/lib/ceph/osd/ceph-10

I'm not really sure why the other 3 osds hung, but I do see xfs backtraces in the dmesg log suggesting that xfs hung (probably because of the EIO on sdc1).

I'm closing this as not a bug since sdc1 seems to be busted and it does not seem to be a ceph/osd issue.  If you want more information, please reproduce with

debug osd = 20
debug filestore = 20
debug ms = 1