Hide Forgot
Description of problem: ----------------------- 4 OSD's from Same Host crashed while creating 2 VM's on same iSCSI LUN mounted on QEMU/KVM Host. Version-Release number of selected component (if applicable): V10.2.3.8el7 Steps to Reproduce: 1. Created an rbd image of 100G and configured LUN using ceph-iscsi ansible 2. Discovered the LUN on Linux host and used it as Storage Volume in QEMU/KVM host. Note : created a storage pool directly in KVM on top of a physical iscsi lun 3. created 2 qcow2 volumes on the storage pool created using iscsi lun 4. Started OS Installation on both the volumes and after a while the OS installation stopped and observed that 4 OSD from one of the OSD host was down. Actual results: There shouldn't be any IO error Expected results: IO stopped and OSD went down
54 0 0) 0x7f3b4950b000 con 0x7f3b474d0000 -3> 2016-10-21 16:50:48.043737 7f3b2173f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3b1972f700' had timed out after 15 -2> 2016-10-21 16:50:48.043739 7f3b2173f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3b1972f700' had suicide timed out after 150 -1> 2016-10-21 16:50:48.043736 7f3b22f42700 1 -- 10.70.39.4:6826/1189524 <== osd.0 10.70.39.2:0/31949 51 ==== osd_ping(ping e187 stamp 2016-10-21 16:50:23.312632) v2 ==== 47+0+0 (77267054 0 0) 0x7f3b47bf7a00 con 0x7f3b474d0180 0> 2016-10-21 16:50:48.045943 7f3b2173f700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f3b2173f700 time 2016-10-21 16:50:48.043746 common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout") ceph version 10.2.3-8.el7cp (e1d1eac13d734c63e896531428b565fc47d51874) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f3b3bade225] 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2e1) [0x7f3b3ba1dc11] 3: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7f3b3ba1e46e] 4: (OSD::handle_osd_ping(MOSDPing*)+0x93f) [0x7f3b3b453a8f] 5: (OSD::heartbeat_dispatch(Message*)+0x3cb) [0x7f3b3b454cbb] 6: (DispatchQueue::entry()+0x78a) [0x7f3b3bba020a] 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f3b3bab9a1d] 8: (()+0x7dc5) [0x7f3b39a0adc5] 9: (clone()+0x6d) [0x7f3b3809673d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 newstore 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 1/ 5 kinetic 1/ 5 fuse -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.10.log --- end dump of recent events --- 2016-10-21 16:50:48.067726 7f3b2173f700 -1 *** Caught signal (Aborted) ** in thread 7f3b2173f700 thread_name:ms_dispatch ceph version 10.2.3-8.el7cp (e1d1eac13d734c63e896531428b565fc47d51874) 1: (()+0x91944a) [0x7f3b3b9e144a] 2: (()+0xf370) [0x7f3b39a12370] 3: (gsignal()+0x37) [0x7f3b37fd41d7] 4: (abort()+0x148) [0x7f3b37fd58c8] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x7f3b3bade407] 6: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2e1) [0x7f3b3ba1dc11] 7: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7f3b3ba1e46e] 8: (OSD::handle_osd_ping(MOSDPing*)+0x93f) [0x7f3b3b453a8f] 9: (OSD::heartbeat_dispatch(Message*)+0x3cb) [0x7f3b3b454cbb] 10: (DispatchQueue::entry()+0x78a) [0x7f3b3bba020a] 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f3b3bab9a1d] 12: (()+0x7dc5) [0x7f3b39a0adc5] 13: (clone()+0x6d) [0x7f3b3809673d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- 0> 2016-10-21 16:50:48.067726 7f3b2173f700 -1 *** Caught signal (Aborted) ** in thread 7f3b2173f700 thread_name:ms_dispatch ceph version 10.2.3-8.el7cp (e1d1eac13d734c63e896531428b565fc47d51874) 1: (()+0x91944a) [0x7f3b3b9e144a] 2: (()+0xf370) [0x7f3b39a12370] 3: (gsignal()+0x37) [0x7f3b37fd41d7] 4: (abort()+0x148) [0x7f3b37fd58c8] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x7f3b3bade407] 6: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2e1) [0x7f3b3ba1dc11] 7: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7f3b3ba1e46e] 8: (OSD::handle_osd_ping(MOSDPing*)+0x93f) [0x7f3b3b453a8f] 9: (OSD::heartbeat_dispatch(Message*)+0x3cb) [0x7f3b3b454cbb] 10: (DispatchQueue::entry()+0x78a) [0x7f3b3bba020a] 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f3b3bab9a1d] 12: (()+0x7dc5) [0x7f3b39a0adc5] 13: (clone()+0x6d) [0x7f3b3809673d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 newstore 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 1/ 5 kinetic 1/ 5 fuse -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.10.log --- end dump of recent events --- 2016-10-21 16:54:48.216110 7fe7c6683800 0 set uid:gid to 167:167 (ceph:ceph) 2016-10-21 16:54:48.216123 7fe7c6683800 0 ceph version 10.2.3-8.el7cp (e1d1eac13d734c63e896531428b565fc47d51874), process ceph-osd, pid 191367 2016-10-21 16:54:48.216239 7fe7c6683800 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-10: (5) Input/output error^[[0m 2016-10-21 16:54:48.713614 7f23a6acf800 0 set uid:gid to 167:167 (ceph:ceph) 2016-10-21 16:54:48.713626 7f23a6acf800 0 ceph version 10.2.3-8.el7cp (e1d1eac13d734c63e896531428b565fc47d51874), process ceph-osd, pid 191414 2016-10-21 16:54:48.713751 7f23a6acf800 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-10: (5) Input/output error^[[0m
Re-assigning to RADOS since this is an OSD IO issue. From a quick glance, perhaps you are having storage issues with your backing OSD disks: superblock on /var/lib/ceph/osd/ceph-10: (5) Input/output error
Look at dmesg on that node, it looks like the disk failed. [259674.726328] XFS (sdc1): xfs_log_force: error -5 returned. sdc1 is mapped to one of the osds /dev/sdc1 1.1T 467M 1.1T 1% /var/lib/ceph/osd/ceph-10 I'm not really sure why the other 3 osds hung, but I do see xfs backtraces in the dmesg log suggesting that xfs hung (probably because of the EIO on sdc1). I'm closing this as not a bug since sdc1 seems to be busted and it does not seem to be a ceph/osd issue. If you want more information, please reproduce with debug osd = 20 debug filestore = 20 debug ms = 1