Description of problem: There are many Warning messages in vmkernel.log then I use iscsigw tool in Ceph3.0. At same time. from the vsphere montior tap, I found the volume has Lost connection information. At the same time, from the vmkernel.log,there are many "iSCSI connection is being marked "OFFLINE" (Event:4)" messages. For example: 2018-03-30T11:15:52.455Z cpu9:65646)WARNING: HBX: 727: 'CEPH-LUN09': HB at offset 0 - Volume 5abc0760-13c36702-c307-a4c64f1dcfdb may be damaged on disk. Corrupt heartbeat detected: 2018-03-30T11:15:52.455Z cpu9:65646)WARNING: [HB state 0 offset 0 gen 0 stampUS 0 uuid 00000000-00000000-0000-000000000000 jrnl <FB 0> drv 0.0] 2018-03-30T11:15:55.067Z cpu22:66551)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4309061150c0 network resource pool netsched.pools.persist.iscsi associated 2018-03-30T11:15:55.067Z cpu22:66551)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4309061150c0 network tracker id 88966450 tracker.iSCSI.10.2.132.2 associated 2018-03-30T11:15:55.068Z cpu2:66521)WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic: vmhba64:CH:2 T:0 CN:0: Failed to receive data: Connection closed by peer 2018-03-30T11:15:55.068Z cpu2:66521)WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic: Sess [ISID: 00023d000003 TARGET: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw TPGT: 3 TSIH: 0] 2018-03-30T11:15:55.068Z cpu2:66521)WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic: Conn [CID: 0 L: 10.2.128.61:49656 R: 10.2.132.2:3260] 2018-03-30T11:15:55.068Z cpu2:66521)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: vmhba64:CH:2 T:0 CN:0: Connection rx notifying failure: Failed to Receive. State=Bound 2018-03-30T11:15:55.068Z cpu2:66521)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: Sess [ISID: 00023d000003 TARGET: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw TPGT: 3 TSIH: 0] 2018-03-30T11:15:55.068Z cpu2:66521)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: Conn [CID: 0 L: 10.2.128.61:49656 R: 10.2.132.2:3260] 2018-03-30T11:15:55.321Z cpu22:66551)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba64:CH:2 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4) 2018-03-30T11:15:55.321Z cpu22:66551)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000003 TARGET: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw TPGT: 3 TSIH: 0] 2018-03-30T11:15:55.321Z cpu22:66551)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 10.2.128.61:49656 R: 10.2.132.2:3260] Version-Release number of selected component (if applicable): cat installed-rpms | grep ceph ceph-base-12.2.1-40.el7cp.x86_64 Mon Dec 25 01:45:02 2017 ceph-common-12.2.1-40.el7cp.x86_64 Mon Dec 25 01:44:19 2017 ceph-iscsi-cli-2.5-9.el7cp.noarch Mon Dec 25 22:31:05 2017 ceph-iscsi-config-2.3-12.el7cp.noarch Mon Dec 25 08:49:28 2017 ceph-iscsi-tools-2.1-3.el7cp.noarch Mon Dec 25 22:54:29 2017 ceph-osd-12.2.1-40.el7cp.x86_64 Mon Dec 25 01:45:23 2017 ceph-selinux-12.2.1-40.el7cp.x86_64 Mon Dec 25 01:45:02 2017 libcephfs2-12.2.1-40.el7cp.x86_64 Mon Dec 25 01:44:16 2017 python-cephfs-12.2.1-40.el7cp.x86_64 Mon Dec 25 01:44:16 2017 cat installed-rpms | grep tcmu libtcmu-1.3.0-0.4.el7cp.x86_64 Mon Dec 25 08:49:31 2017 tcmu-runner-1.3.0-0.4.el7cp.x86_64 Mon Dec 25 08:49:31 2017 Kernel Version: cat uname Linux osd1.zyc4.cmicnmg.com 3.10.0-858.el7.x86_64 #1 SMP Tue Feb 27 08:59:23 EST 2018 x86_64 x86_64 x86_64 GNU/Linux [wliu@collab-shell osd1.zyc4.cmicnmg.com]$ cat etc/redhat-release Red Hat Enterprise Linux Server release 7.5 Beta (Maipo) Vmware esx Version: 6.5 How reproducible: 100%reproduced Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
When you report this info could you also give the output of esxcli iscsi session list esxcli iscsi session connection list esxcli storage nmp device list esxcli storage core device list -d iscsi_diskABC and esxcli storage nmp path list -d iscsi_diskABC Could you also save us the hunt and describe what operations if any you were doing at this time like adding disks, osds, exporting new luns, etc.
Dear Mike: Thanks for your help this ticket. For your questions "Could you also save us the hunt and describe what operations if any you were doing at this time like adding disks, osds, exporting new luns, etc." In actually, without any IO action on rados side, ( adding disks, osds, exporting new luns, etc). We are still found "Lost Connection" issue on vsphere monitor tap. Later , we will update those command output. Thanks again! Liu Wei
For future use for the user and to simplify new bzs, could you disable the linux initiatior if you are not using it anymore? It is filling up the logs in all these bzs with a garbage relogin errors: Mar 25 03:49:18 osd1 kernel: iSCSI Initiator Node: iqn.1994-05.com.redhat:f14d7a65d5ca is not authorized to access iSCSI target portal group: 1. It looks like you might have set it up as a test and then only partially stopped it. On the linux box, just do: iscsiadm -m node -u and then just delete the record iscsiadm -m node -o delete
The esxcli session and connection output indicate you have 8 sessions. Were you doing iscsi interface/iface/port binding on the initiator/ESX/vmware side or did you manually try to login more than 1 session per iscsi target tpg per initiator?
The vcenter client reported "connection Lost ERRORs" on Mar 30, but I have not found the logs of that day at esxi server. it's very strange that the esxi keep so limited logs. we saw lots of WARNING level messages in esxi before, but didn't see the ERROR level log as vcenter reported. I am not sure if the "warning" in esxi server will be regarded as ERROR in vcenter client. Can any vmware expert be involved in this issue.
(In reply to Mike Christie from comment #12) > For future use for the user and to simplify new bzs, could you disable the > linux initiatior if you are not using it anymore? It is filling up the logs > in all these bzs with a garbage relogin errors: > > Mar 25 03:49:18 osd1 kernel: iSCSI Initiator Node: > iqn.1994-05.com.redhat:f14d7a65d5ca is not authorized to access iSCSI target > portal group: 1. > > It looks like you might have set it up as a test and then only partially > stopped it. > > On the linux box, just do: > > iscsiadm -m node -u > > and then just delete the record > > iscsiadm -m node -o delete ok ,I just disable and stop the iscsi service across all the ceph cluster nodes.
(In reply to William Lau from comment #15) > The vcenter client reported "connection Lost ERRORs" on Mar 30, but I have > not found the logs of that day at esxi server. it's very strange that the > esxi keep so limited logs. > Check out 127_vmkernel.log in all_vmkernel_log.tar.gz in https://access.redhat.com/support/cases/02056772.
Not sure if related to the packet drops or the cause but there is a bug in the iscsi layer that we might be hitting and causing some of the problems. We see this repeated for a long time: Mar 28 11:03:53 osd10 kernel: iSCSI Login negotiation failed. Mar 28 11:04:08 osd10 kernel: iSCSI Login timeout on Network Portal 10.2.132.10:3260 I think we just might be hitting this bug: commit 1c130ae00b769a2e2df41bad3d6051ee8234b636 Author: Florian Westphal <fw> Date: Fri Jan 19 14:36:29 2018 +0100 iscsi-target: make sure to wake up sleeping login worker This could cause the connection offline errors. I really have no idea if this bug can cause packets to just look like they are dropped. On the other hand, if the nic is going bad and dropping packets then that would explain why we see the iscsi login timeout errors and slow commands again (In the logs I saw ABORTs being sent a couple times on Mar 30 on osd 10 osd 1 and osd20 so commands were taking longer than normal)/ The other possibility is that a command was lost in the target layer and that is causing the target to drop relogin attempts. We would see a similar login timeout error. However, from the esxlci output it looks like all the targets but the osd2 are logged back in. Note I still need the logs for osd2 because that seems to be hitting some hang in the session reinstatement code which may be a lost command type of bug causing relogins to fail. If you guys have an earlier vmkernel logs it would be helpful. The earliest I have starts at 2018-03-30T11:07:07.
(In reply to Mike Christie from comment #24) > Not sure if related to the packet drops or the cause but there is a bug in > the iscsi layer that we might be hitting and causing some of the problems. > > We see this repeated for a long time: > > Mar 28 11:03:53 osd10 kernel: iSCSI Login negotiation failed. > Mar 28 11:04:08 osd10 kernel: iSCSI Login timeout on Network Portal > 10.2.132.10:3260 > > I think we just might be hitting this bug: > > commit 1c130ae00b769a2e2df41bad3d6051ee8234b636 > Author: Florian Westphal <fw> > Date: Fri Jan 19 14:36:29 2018 +0100 > > iscsi-target: make sure to wake up sleeping login worker > > This could cause the connection offline errors. > > I really have no idea if this bug can cause packets to just look like they > are dropped. > > On the other hand, if the nic is going bad and dropping packets then that > would explain why we see the iscsi login timeout errors and slow commands > again (In the logs I saw ABORTs being sent a couple times on Mar 30 on osd > 10 osd 1 and osd20 so commands were taking longer than normal)/ > > > The other possibility is that a command was lost in the target layer and > that is causing the target to drop relogin attempts. We would see a similar > login timeout error. However, from the esxlci output it looks like all the > targets but the osd2 are logged back in. > > Note I still need the logs for osd2 because that seems to be hitting some > hang in the session reinstatement code which may be a lost command type of > bug causing relogins to fail. > > If you guys have an earlier vmkernel logs it would be helpful. The earliest > I have starts at 2018-03-30T11:07:07. I checked the atop log on 30 Mar from osd10 and osd20, these two nodes dropped much more packets than osd1 and osd2, found the reason dropping packets was the Nic received massive packets,I think it ran out of the Nic's Ring buffer. and seemingly,the packets dropping has nothing to do with the "Lost connecton error" of vcenter, since vcenter didn't alarm when the packet was dropping . additionally, from the 127_vmkernel_log. I see many connection/session related warning, such as : 2018-03-30T11:08:02.192Z cpu6:66508)WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic: vmhba64:CH:5 T:0 CN:0: Failed to receive data: Connection closed by peer 2018-03-30T11:08:02.192Z cpu6:66508)WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic: Sess [ISID: 00023d000006 TARGET: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw TPGT: 3 TSIH: 0] 2018-03-30T11:08:02.192Z cpu6:66508)WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic: Conn [CID: 0 L: 10.2.128.54:54493 R: 10.2.132.2:3260] 2018-03-30T11:08:02.192Z cpu6:66508)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: vmhba64:CH:5 T:0 CN:0: Connection rx notifying failure: Failed to Receive. State=Bound 2018-03-30T11:08:02.192Z cpu6:66508)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: Sess [ISID: 00023d000006 TARGET: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw TPGT: 3 TSIH: 0] 2018-03-30T11:08:02.193Z cpu6:66508)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: Conn [CID: 0 L: 10.2.128.54:54493 R: 10.2.132.2:3260] I am not sure if the above message reflected the "connection Lost error" from vcenter side. A similar case from DELL: https://www.dell.com/community/EqualLogic/iSCSI-connection-issues-or-not/m-p/4123283 one of the comments worth to be noted: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 06-03-2013 03:30 PM Re: iSCSI connection issues or not? Whenever a connection is moved, either by the array for balancing IO across available array interfaces, page movement when you have more than three members in a pool, or with MEM installed it will periodically move connections, you WILL see redundancy lost messages. As you noted, by the time you check the paths the connection has already logged back in as it should. For that briefest of periods, less than a second typically, that session is gone, and a new one is being created at the same time. The volume is still accessible via the other sessions. So it's an info message, not an indication of a problem. When the array balances slices to a new member, in pools more than three, expect to see numerous, simulaneous messages from all the hosts connecting to that datastore. This flurry is completely normal. Social Media and Community Professional #IWork4Dell Get Support on Twitter - @dellcarespro +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
(In reply to Mike Christie from comment #24) > We see this repeated for a long time: > > Mar 28 11:03:53 osd10 kernel: iSCSI Login negotiation failed. > Mar 28 11:04:08 osd10 kernel: iSCSI Login timeout on Network Portal > 10.2.132.10:3260 > > I think we just might be hitting this bug: > > commit 1c130ae00b769a2e2df41bad3d6051ee8234b636 > Author: Florian Westphal <fw> > Date: Fri Jan 19 14:36:29 2018 +0100 > > iscsi-target: make sure to wake up sleeping login worker > > This could cause the connection offline errors. > I made this bz: https://bugzilla.redhat.com/show_bug.cgi?id=1562541 for the login time out issue.
@Mike Christie We are discussing the following patch: commit f0cfe030bd7ffff980b8194c2a73a5e1fec14ead Author: Zhang Zhuoyu <zhangzhuoyu.com> Date: Fri Jan 19 16:44:08 2018 +0800 handler: do not perform implicit transition if command not supported If unsupported command was handled by tcmur_cmd_handler, alua implicit transition have a chance to cause image exclusive-lock threshing between runners, so do not perform implicit transition if command not supported Signed-off-by: Xiubo Li <lixiubo.com> Signed-off-by: Zhang Zhuoyu <zhangzhuoyu.com> Do you remember this one ? This is because we were testing the ceph iscsi gateway products and when we are pressure it by fio or dd, the /dev/sdX or /dev/mapper/mpathX will hit read-only issues. The readonly issue maybe could lead the current bug. Do you have any more idea or suggestion about this ?
On OSD10 ,there are many worker[*] threads under D state now: [root@osd10 log]# [root@osd10 log]# uptime 12:19:08 up 3 days, 15:33, 4 users, load average: 21.00, 21.10, 21.20 [root@osd10 log]# [root@osd10 log]# [root@osd10 log]# ps -elf | grep D F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 4 S root 1387 1 0 80 0 - 55291 hrtime Mar28 ? 00:00:16 /usr/bin/abrt-watch-log -F BUG: WARNING: at WARNING: CPU: INFO: possible recursive locking detected ernel BUG at list_del corruption list_add corruption do_IRQ: stack overflow: ear stack overflow (cur: eneral protection fault nable to handle kernel ouble fault: RTNL: assertion failed eek! page_mapcount(page) went negative! adness at NETDEV WATCHDOG ysctl table check failed : nobody cared IRQ handler type mismatch Kernel panic - not syncing: Machine Check Exception: Machine check events logged divide error: bounds: coprocessor segment overrun: invalid TSS: segment not present: invalid opcode: alignment check: stack segment: fpu exception: simd exception: iret exception: /var/log/messages -- /usr/bin/abrt-dump-oops -xtD 4 S root 2336 1 0 80 0 - 28198 poll_s Mar28 ? 00:00:00 /usr/sbin/sshd -D 1 D root 8056 2 0 80 0 - 0 transp Mar28 ? 00:23:47 [iscsi_ttx] 1 D root 794529 2 0 80 0 - 0 transp Mar30 ? 00:00:00 [kworker/u65:0] 1 D root 823768 2 0 80 0 - 0 transp Mar31 ? 00:00:00 [kworker/u66:0] 1 D root 843864 2 0 80 0 - 0 transp Mar31 ? 00:00:00 [kworker/u65:2] 1 D root 863065 2 0 80 0 - 0 transp Mar31 ? 00:00:00 [kworker/u66:1] 1 D root 868161 2 0 80 0 - 0 transp Mar31 ? 00:00:00 [kworker/u65:1] 1 D root 868236 2 0 80 0 - 0 transp Mar31 ? 00:00:00 [kworker/u66:2] 1 D root 868238 2 0 80 0 - 0 transp Mar31 ? 00:00:00 [kworker/u65:3] 1 D root 868239 2 0 80 0 - 0 transp Mar31 ? 00:00:00 [kworker/u65:4] 1 D root 868248 2 0 80 0 - 0 transp Mar31 ? 00:00:00 [kworker/u66:3] 1 D root 868253 2 0 80 0 - 0 transp Mar31 ? 00:00:00 [kworker/u66:4] 1 D root 868255 2 0 80 0 - 0 transp Mar31 ? 00:00:00 [kworker/u66:5] 1 D root 868257 2 0 80 0 - 0 transp Mar31 ? 00:00:00 [kworker/u66:6] 1 D root 868260 2 0 80 0 - 0 transp Mar31 ? 00:00:00 [kworker/u66:7] 1 D root 868262 2 0 80 0 - 0 transp Mar31 ? 00:00:00 [kworker/u65:5] 1 D root 868263 2 0 80 0 - 0 transp Mar31 ? 00:00:00 [kworker/u66:8] 1 D root 868265 2 0 80 0 - 0 transp Mar31 ? 00:00:00 [kworker/u65:6] 1 D root 868268 2 0 80 0 - 0 transp Mar31 ? 00:00:00 [kworker/u65:7] 1 D root 868269 2 0 80 0 - 0 transp Mar31 ? 00:00:00 [kworker/u66:9] 1 D root 868275 2 0 80 0 - 0 transp Mar31 ? 00:00:00 [kworker/u65:8] 0 S root 927509 925511 0 80 0 - 28177 pipe_w 12:19 pts/1 00:00:00 grep --color=auto D [root@osd10 log]# [root@osd10 log]# [root@osd10 log]# [root@osd10 log]# ps -elf | grep kworker | wc-l -bash: wc-l: command not found [root@osd10 log]# ps -elf | grep kworker | wc -l 161 [root@osd10 log]# [root@osd10 log]# cat /proc/794529/stack [<ffffffffc04f2bda>] __transport_wait_for_tasks+0xba/0x1b0 [target_core_mod] [<ffffffffc04f2db3>] transport_wait_for_tasks+0x53/0x90 [target_core_mod] [<ffffffffc04eff30>] core_tmr_abort_task+0x100/0x180 [target_core_mod] [<ffffffffc04f2f4a>] target_tmr_work+0x15a/0x170 [target_core_mod] [<ffffffff9e4b2eff>] process_one_work+0x17f/0x440 [<ffffffff9e4b3bc6>] worker_thread+0x126/0x3c0 [<ffffffff9e4baf31>] kthread+0xd1/0xe0 [<ffffffff9eb1f637>] ret_from_fork_nospec_end+0x0/0x39 [<ffffffffffffffff>] 0xffffffffffffffff [root@osd10 log]# According to the atop tool, they all became D state from 18:34 on Mar 31. some of the esxi server reported Lost access error,such as 10.2.226.129: [root@C05-NMGJD-PA-ZYC-SV029-RH2288H-XNH:/vmfs/volumes/5aa2a258-d4765fd3-a8fd-883fd3d0366e/log] grep Lost hostd.2 2018-03-30T18:27:58.237Z info hostd[C181B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 5148 : Lost access to volume 5abc06ed-6ff4cbab-5dd7-a4c64f1dcfdb (CEPH-LUN08) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. 2018-03-30T18:27:58.238Z info hostd[AC53B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 5149 : Lost access to volume 5abbff72-9da349d5-9d51-a4c64f1dcfdb (CEPH-LUN00) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. ... Note that, the time from esxi side is about 5 minutes ahead of ceph cluster nodes. Besides, at 2018/03/30 20:19 ,iscsi_ttx, and several kworker threads went to D state, they kept D state for about 2 minutes, but two iscsi_ttx thread stuck in D forever: Note: the "tslpu" means the number of process/thread in D state. ATOP - osd10 2018/03/30 20:19:47 -------------- 15s elapsed PRC | sys 24.54s | user 33.42s | | | #proc 518 | #trun 1 | #tslpi 1340 | | #tslpu 0 | #zombie 0 | clones 12 | | | #exit 1 | CPU | sys 144% | user 212% | | irq 29% | | idle 2814% | wait 2% | | | steal 0% | guest 0% | curf 2.10GHz | | curscal 99% | CPL | avg1 3.59 | | avg5 6.26 | avg15 6.81 | | | | csw 1769808 | | intr 1961559 | | | numcpu 32 | | MEM | tot 251.4G | free 1.0G | cache 226.9G | dirty 4.3M | buff 76.1M | slab 14.9G | slrec 9.2G | shmem 4.0G | shrss 0.0M | shswp 0.0M | vmbal 0.0M | | hptot 0.0M | hpuse 0.0M | SWP | tot 32.0G | free 32.0G | | | | | | | | | | vmcom 15.8G | | vmlim 157.7G | LVM | rhel-var | busy 0% | | read 0 | write 65 | | KiB/r 0 | KiB/w 242 | MBr/s 0.0 | | MBw/s 1.0 | avq 7.40 | | avio 0.08 ms | DSK | sdj | busy 4% | | read 0 | write 10214 | | KiB/r 0 | KiB/w 44 | MBr/s 0.0 | | MBw/s 29.4 | avq 1.63 | | avio 0.06 ms | DSK | sdk | busy 4% | | read 0 | write 8250 | | KiB/r 0 | KiB/w 45 | MBr/s 0.0 | | MBw/s 24.3 | avq 1.70 | | avio 0.07 ms | DSK | sde | busy 1% | | read 0 | write 5501 | | KiB/r 0 | KiB/w 38 | MBr/s 0.0 | | MBw/s 13.9 | avq 44.60 | | avio 0.03 ms | NET | transport | tcpi 957160 | tcpo 2352978 | udpi 0 | | udpo 0 | tcpao 0 | tcppo 0 | tcprs 80 | tcpie 0 | tcpor 0 | | udpnp 0 | udpie 0 | NET | network | ipi 957159 | | ipo 886333 | ipfrw 0 | | deliv 957159 | | | | | icmpi 0 | | icmpo 0 | NET | enp130s 8% | pcki 857529 | pcko 1306151 | | sp 10 Gbps | si 581 Mbps | so 819 Mbps | | coll 0 | mlti 14 | erri 0 | erro 0 | drpi 0 | drpo 0 | NET | bond0 5% | pcki 1674473 | pcko 1806426 | | sp 20 Gbps | si 1097 Mbps | so 1011 Mbps | | coll 0 | mlti 30 | erri 0 | erro 0 | drpi 0 | drpo 0 | ATOP - osd10 2018/03/30 20:20:02 -------------- 15s elapsed PRC | sys 0.90s | user 0.57s | | | #proc 527 | #trun 2 | #tslpi 1337 | | #tslpu 6 | #zombie 0 | clones 22 | | | #exit 6 | CPU | sys 5% | user 4% | | irq 0% | | idle 3192% | wait 0% | | | steal 0% | guest 0% | curf 2.10GHz | | curscal 99% | CPL | avg1 3.78 | | avg5 6.16 | avg15 6.77 | | | | csw 70548 | | intr 96360 | | | numcpu 32 | | MEM | tot 251.4G | free 1.0G | cache 226.9G | dirty 2.5M | buff 76.1M | slab 14.9G | slrec 9.2G | shmem 4.0G | shrss 0.0M | shswp 0.0M | vmbal 0.0M | | hptot 0.0M | hpuse 0.0M | SWP | tot 32.0G | free 32.0G | | | | | | | | | | vmcom 15.8G | | vmlim 157.7G | LVM | rhel-var | busy 0% | | read 0 | write 2 | | KiB/r 0 | KiB/w 6 | MBr/s 0.0 | | MBw/s 0.0 | avq 0.00 | | avio 0.00 ms | DSK | sde | busy 2% | | read 30 | write 40 | | KiB/r 4 | KiB/w 7 | MBr/s 0.0 | | MBw/s 0.0 | avq 1.02 | | avio 4.39 ms | DSK | sdd | busy 0% | | read 0 | write 533 | | KiB/r 0 | KiB/w 84 | MBr/s 0.0 | | MBw/s 2.9 | avq 128.20 | | avio 0.03 ms | DSK | sdf | busy 0% | | read 0 | write 460 | | KiB/r 0 | KiB/w 55 | MBr/s 0.0 | | MBw/s 1.7 | avq 77.50 | | avio 0.03 ms | NET | transport | tcpi 18305 | tcpo 33885 | udpi 0 | | udpo 0 | tcpao 0 | tcppo 0 | tcprs 101 | tcpie 0 | tcpor 0 | | udpnp 0 | udpie 0 | NET | network | ipi 18306 | | ipo 18476 | ipfrw 0 | | deliv 18306 | | | | | icmpi 0 | | icmpo 0 | NET | enp130s 0% | pcki 6803 | pcko 16582 | | sp 10 Gbps | si 3238 Kbps | so 8782 Kbps | | coll 0 | mlti 14 | erri 0 | erro 0 | drpi 0 | drpo 0 | NET | bond0 0% | pcki 17157 | pcko 22881 | | sp 20 Gbps | si 6902 Kbps | so 11 Mbps | | coll 0 | mlti 28 | erri 0 | erro 0 | drpi 0 | drpo 0 | ATOP - osd10 2018/03/30 20:20:17 -------------- 15s elapsed PRC | sys 0.60s | user 0.52s | | | #proc 522 | #trun 2 | #tslpi 1318 | | #tslpu 9 | #zombie 0 | clones 16 | | | #exit 19 | CPU | sys 4% | user 3% | | irq 0% | | idle 3193% | wait 0% | | | steal 0% | guest 0% | curf 2.10GHz | | curscal 99% | CPL | avg1 4.58 | | avg5 6.22 | avg15 6.78 | | | | csw 51769 | | intr 71668 | | | numcpu 32 | | MEM | tot 251.4G | free 1.0G | cache 226.9G | dirty 1.0M | buff 76.1M | slab 14.9G | slrec 9.2G | shmem 4.0G | shrss 0.0M | shswp 0.0M | vmbal 0.0M | | hptot 0.0M | hpuse 0.0M | SWP | tot 32.0G | free 32.0G | | | | | | | | | | vmcom 15.8G | | vmlim 157.7G | LVM | rhel-var | busy 0% | | read 0 | write 10 | | KiB/r 0 | KiB/w 245 | MBr/s 0.0 | | MBw/s 0.2 | avq 0.00 | | avio 0.00 ms | DSK | sde | busy 3% | | read 4 | write 357 | | KiB/r 64 | KiB/w 77 | MBr/s 0.0 | | MBw/s 1.8 | avq 1.87 | | avio 1.37 ms | DSK | sdh | busy 1% | | read 8 | write 284 | | KiB/r 64 | KiB/w 82 | MBr/s 0.0 | | MBw/s 1.5 | avq 7.68 | | avio 0.41 ms | DSK | sdb | busy 1% | | read 5 | write 18 | | KiB/r 64 | KiB/w 25 | MBr/s 0.0 | | MBw/s 0.0 | avq 1.00 | | avio 3.87 ms | NET | transport | tcpi 11654 | tcpo 16054 | udpi 0 | | udpo 0 | tcpao 9 | tcppo 13 | tcprs 18 | tcpie 0 | tcpor 0 | | udpnp 0 | udpie 0 | NET | network | ipi 11654 | | ipo 11668 | ipfrw 0 | | deliv 11654 | | | | | icmpi 0 | | icmpo 0 | NET | enp129s 0% | pcki 4598 | pcko 4256 | | sp 10 Gbps | si 1944 Kbps | so 1691 Kbps | | coll 0 | mlti 16 | erri 0 | erro 0 | drpi 0 | drpo 0 | NET | bond0 0% | pcki 8238 | pcko 8411 | | sp 20 Gbps | si 3331 Kbps | so 3257 Kbps | | coll 0 | mlti 32 | erri 0 | erro 0 | drpi 0 | drpo 0 | ATOP - osd10 2018/03/30 20:21:02 -------------- 15s elapsed PRC | sys 1.33s | user 0.96s | | | #proc 505 | #trun 1 | #tslpi 1329 | | #tslpu 2 | #zombie 0 | clones 15 | | | #exit 0 | CPU | sys 9% | user 5% | | irq 0% | | idle 3187% | wait 0% | | | steal 0% | guest 0% | curf 2.10GHz | | curscal 99% | CPL | avg1 3.42 | | avg5 5.70 | avg15 6.58 | | | | csw 180520 | | intr 236565 | | | numcpu 32 | | MEM | tot 251.4G | free 1.0G | cache 226.9G | dirty 0.4M | buff 76.1M | slab 14.9G | slrec 9.2G | shmem 4.0G | shrss 0.0M | shswp 0.0M | vmbal 0.0M | | hptot 0.0M | hpuse 0.0M | SWP | tot 32.0G | free 32.0G | | | | | | | | | | vmcom 15.8G | | vmlim 157.7G | LVM | rhel-root | busy 0% | | read 0 | write 1 | | KiB/r 0 | KiB/w 4 | MBr/s 0.0 | | MBw/s 0.0 | avq 0.00 | | avio 0.00 ms | LVM | rhel-var | busy 0% | | read 0 | write 4 | | KiB/r 0 | KiB/w 65 | MBr/s 0.0 | | MBw/s 0.0 | avq 0.00 | | avio 0.00 ms | DSK | sdd | busy 0% | | read 0 | write 358 | | KiB/r 0 | KiB/w 18 | MBr/s 0.0 | | MBw/s 0.4 | avq 80.80 | | avio 0.01 ms | DSK | sdb | busy 0% | | read 0 | write 251 | | KiB/r 0 | KiB/w 22 | MBr/s 0.0 | | MBw/s 0.4 | avq 74.33 | | avio 0.01 ms | DSK | sdf | busy 0% | | read 0 | write 249 | | KiB/r 0 | KiB/w 24 | MBr/s 0.0 | | MBw/s 0.4 | avq 70.67 | | avio 0.01 ms | NET | transport | tcpi 26696 | tcpo 31156 | udpi 0 | | udpo 0 | tcpao 0 | tcppo 45 | tcprs 1 | tcpie 0 | tcpor 0 | | udpnp 0 | udpie 0 | NET | network | ipi 26696 | | ipo 26650 | ipfrw 0 | | deliv 26696 | | | | | icmpi 0 | | icmpo 0 | NET | enp130s 0% | pcki 8407 | pcko 18843 | | sp 10 Gbps | si 1759 Kbps | so 2527 Kbps | | coll 0 | mlti 14 | erri 0 | erro 0 | drpi 0 | drpo 0 | NET | enp129s 0% | pcki 14457 | pcko 4399 | | sp 10 Gbps | si 2266 Kbps | so 1702 Kbps | | coll 0 | mlti 15 | erri 0 | erro 0 | drpi 0 | drpo 0 | during which, there are many esxi server reported connection issue, such as: 141: [root@D04-NMGJD-PA-ZYC-SV041-RH2288H-XNH:/vmfs/volumes/5aa25cfe-f7d265b4-56aa-2c55d3428853/log] grep -i Lost hostd.1 2018-03-30T20:23:39.755Z info hostd[B185B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 1560 : Lost access to volume 5abc008e-3b508996-3e93-a4c64f1dcfdb (CEPH-LUN02) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. --> eventTypeId = "esx.problem.storage.redundancy.lost", 2018-03-30T20:24:03.220Z info hostd[76AF890] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 1594 : Lost path redundancy to storage device naa.60014056b8554b392dd4b9c83c1343c5. Path vmhba64:C8:T0:L10 is down. Affected datastores: CEPH-LUN10. --> eventTypeId = "esx.problem.storage.redundancy.lost", 2018-03-30T20:24:03.223Z info hostd[A981B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 1595 : Lost path redundancy to storage device naa.60014056b8554b392dd4b9c83c1343c5. Path vmhba64:C7:T0:L10 is down. Affected datastores: CEPH-LUN10. --> eventTypeId = "esx.problem.storage.redundancy.lost", 2018-03-30T20:24:03.227Z info hostd[B185B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 1596 : Lost path redundancy to storage device naa.60014056b8554b392dd4b9c83c1343c5. Path vmhba64:C5:T0:L10 is down. Affected datastores: CEPH-LUN10. 137: [root@D04-NMGJD-PA-ZYC-SV037-RH2288H-XNH:/vmfs/volumes/5aa25cb0-5a64d290-3279-2c55d3428865/log] grep -i Lost hostd.1 2018-03-30T20:02:28.305Z info hostd[F4C2B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 9879 : Lost access to volume 5abc06ed-6ff4cbab-5dd7-a4c64f1dcfdb (CEPH-LUN08) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. 2018-03-30T20:19:28.305Z info hostd[FBC2B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 9882 : Lost access to volume 5abc06ed-6ff4cbab-5dd7-a4c64f1dcfdb (CEPH-LUN08) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. --> eventTypeId = "esx.problem.storage.redundancy.lost", 2018-03-30T20:19:58.054Z info hostd[F440B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 9883 : Lost path redundancy to storage device naa.6001405fc0cdc8fc75e4cd0b9f6beec9. Path vmhba64:C7:T0:L8 is down. Affected datastores: CEPH-LUN08. --> eventTypeId = "esx.problem.storage.redundancy.lost", 2018-03-30T20:22:35.054Z info hostd[F840B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 9886 : Lost path redundancy to storage device naa.6001405d8dc678c19e34e27af2807e6c. Path vmhba64:C5:T0:L5 is down. Affected datastores: CEPH-LUN05. --> eventTypeId = "esx.problem.storage.redundancy.lost", 2018-03-30T20:22:35.058Z info hostd[F440B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 9891 : Lost path redundancy to storage device naa.60014056b8554b392dd4b9c83c1343c5. Path vmhba64:C5:T0:L10 is down. Affected datastores: CEPH-LUN10. --> eventTypeId = "esx.problem.storage.redundancy.lost", 2018-03-30T20:22:35.061Z info hostd[F4C2B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 9892 : Lost path redundancy to storage device naa.6001405b0d5435acb9e4a11b3fcab332. Path vmhba64:C5:T0:L2 is down. Affected datastores: CEPH-LUN02. --> eventTypeId = "esx.problem.storage.redundancy.lost", 2018-03-30T20:22:35.064Z info hostd[F840B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 9893 : Lost path redundancy to storage device naa.6001405652ddf4079a04283ac927709e. Path vmhba64:C5:T0:L6 is down. Affected datastores: CEPH-LUN06. --> eventTypeId = "esx.problem.storage.redundancy.lost", I hope this can help you
(In reply to Xiubo Li from comment #43) > @Mike Christie > > We are discussing the following patch: > > commit f0cfe030bd7ffff980b8194c2a73a5e1fec14ead > Author: Zhang Zhuoyu <zhangzhuoyu.com> > Date: Fri Jan 19 16:44:08 2018 +0800 > > handler: do not perform implicit transition if command not supported > > If unsupported command was handled by tcmur_cmd_handler, alua implicit > transition have a chance to cause image exclusive-lock threshing between > runners, so do not perform implicit transition if command not supported > > Signed-off-by: Xiubo Li <lixiubo.com> > Signed-off-by: Zhang Zhuoyu <zhangzhuoyu.com> > > Do you remember this one ? Yes. > > This is because we were testing the ceph iscsi gateway products and when we > are pressure it by fio or dd, the /dev/sdX or /dev/mapper/mpathX will hit > read-only issues. > > The readonly issue maybe could lead the current bug. Do you have any more > idea or suggestion about this ? We should add the patch. We are seeing REPORT_SUPPORTED_OPERATION_CODES in the logs so it could definitely cause the thrashing you describe and I think we do see some lock flips when we shouldn't. I could see it could lead to IO slow downs that could lead to us getting into the state where we are trying to disable tpgs. However, other than that, I am not sure if/how it is related to runner getting stuck in the disable tpg code or how it is causing cmds to hang. I'm not saying it will not cause that. I just do not see any way it could, but there could be a bug.
William Could you upload the the /var/log/message for that time period and could you give me the output of dmesg -c; echo w >/proc/sysrq-trigger; dmesg -c >waiting-tasks.txt
(In reply to Mike Christie from comment #46) > William > > Could you upload the the /var/log/message for that time period and could you > give me the output of > > dmesg -c; echo w >/proc/sysrq-trigger; dmesg -c >waiting-tasks.txt Ignore that request. I am logged into it. I think you might be hitting bugs right now because the tcmu-runner process is showing as defunct, so the tmr hangs are expected as nothing can complete commands. Looks like runner crashed or someone tried to kill it and it cannot restart.
(In reply to Mike Christie from comment #47) > (In reply to Mike Christie from comment #46) > > William > > > > Could you upload the the /var/log/message for that time period and could you > > give me the output of > > > > dmesg -c; echo w >/proc/sysrq-trigger; dmesg -c >waiting-tasks.txt > > Ignore that request. I am logged into it. > > I think you might be hitting bugs right now because the tcmu-runner process > is showing as defunct, so the tmr hangs are expected as nothing can complete > commands. > > Looks like runner crashed or someone tried to kill it and it cannot restart. Yes, right, someone is trying to update the tcmu-runner daemon and then restart it, it stucks.
(In reply to Xiubo Li from comment #48) > (In reply to Mike Christie from comment #47) > > (In reply to Mike Christie from comment #46) > > > William > > > > > > Could you upload the the /var/log/message for that time period and could you > > > give me the output of > > > > > > dmesg -c; echo w >/proc/sysrq-trigger; dmesg -c >waiting-tasks.txt > > > > Ignore that request. I am logged into it. > > > > I think you might be hitting bugs right now because the tcmu-runner process > > is showing as defunct, so the tmr hangs are expected as nothing can complete > > commands. > > > > Looks like runner crashed or someone tried to kill it and it cannot restart. > > Yes, right, someone is trying to update the tcmu-runner daemon and then > restart it, it stucks. Were you or William trying to update it or restart it at that time or did it just crash? Just wanted to confirm because a crash is a new bug. Was it
(In reply to Mike Christie from comment #49) > (In reply to Xiubo Li from comment #48) > > (In reply to Mike Christie from comment #47) > > > (In reply to Mike Christie from comment #46) > > > > William > > > > > > > > Could you upload the the /var/log/message for that time period and could you > > > > give me the output of > > > > > > > > dmesg -c; echo w >/proc/sysrq-trigger; dmesg -c >waiting-tasks.txt > > > > > > Ignore that request. I am logged into it. > > > > > > I think you might be hitting bugs right now because the tcmu-runner process > > > is showing as defunct, so the tmr hangs are expected as nothing can complete > > > commands. > > > > > > Looks like runner crashed or someone tried to kill it and it cannot restart. > > > > Yes, right, someone is trying to update the tcmu-runner daemon and then > > restart it, it stucks. > > Were you or William trying to update it or restart it at that time or did it > just crash? > > Just wanted to confirm because a crash is a new bug. > > Was it William is doing I thinks.
It looks like osd10 had timeout issues at Mar 30 20:20:14. We see we hit the osd_op_timeout so it was taking commands longer than 30 seconds. During this time in /var/log/messages we see a lot of OSD/ceph heartbeat errors too. runner tried to disable the tpg and we then hit the bug : https://bugzilla.redhat.com/show_bug.cgi?id=1561851 where runner gets stuck trying to disable the tpg. Then it looks like runner was killed and so SCSI commands are timing out and we know the issue for that already.
(In reply to Mike Christie from comment #47) > (In reply to Mike Christie from comment #46) > > William > > > > Could you upload the the /var/log/message for that time period and could you > > give me the output of > > > > dmesg -c; echo w >/proc/sysrq-trigger; dmesg -c >waiting-tasks.txt > > Ignore that request. I am logged into it. > > I think you might be hitting bugs right now because the tcmu-runner process > is showing as defunct, so the tmr hangs are expected as nothing can complete > commands. > > Looks like runner crashed or someone tried to kill it and it cannot restart. Mike, [root@osd10 log]# cat /proc/794529/stack [<ffffffffc04f2bda>] __transport_wait_for_tasks+0xba/0x1b0 [target_core_mod] [<ffffffffc04f2db3>] transport_wait_for_tasks+0x53/0x90 [target_core_mod] [<ffffffffc04eff30>] core_tmr_abort_task+0x100/0x180 [target_core_mod] [<ffffffffc04f2f4a>] target_tmr_work+0x15a/0x170 [target_core_mod] [<ffffffff9e4b2eff>] process_one_work+0x17f/0x440 [<ffffffff9e4b3bc6>] worker_thread+0x126/0x3c0 [<ffffffff9e4baf31>] kthread+0xd1/0xe0 [<ffffffff9eb1f637>] ret_from_fork_nospec_end+0x0/0x39 [<ffffffffffffffff>] 0xffffffffffffffff [root@osd10 log]# (In reply to Mike Christie from comment #49) > (In reply to Xiubo Li from comment #48) > > (In reply to Mike Christie from comment #47) > > > (In reply to Mike Christie from comment #46) > > > > William > > > > > > > > Could you upload the the /var/log/message for that time period and could you > > > > give me the output of > > > > > > > > dmesg -c; echo w >/proc/sysrq-trigger; dmesg -c >waiting-tasks.txt > > > > > > Ignore that request. I am logged into it. > > > > > > I think you might be hitting bugs right now because the tcmu-runner process > > > is showing as defunct, so the tmr hangs are expected as nothing can complete > > > commands. > > > > > > Looks like runner crashed or someone tried to kill it and it cannot restart. > > > > Yes, right, someone is trying to update the tcmu-runner daemon and then > > restart it, it stucks. > > Were you or William trying to update it or restart it at that time or did it > just crash? > > Just wanted to confirm because a crash is a new bug. > > Was it No, the system has been keeping running from 28 Mar , I didn't reboot it, [root@osd10 log]# uptime 12:19:08 up 3 days, 15:33, 4 users, load average: 21.00, 21.10, 21.20 and no any vmcore file found under /var/crash/ and i just find that the tcmu-runner process became defunct state after many kworker threads stuck on Mar 31, 18:34. you can check the atop file under /var/log/atop #atop -r /var/log/atop/atop_20180331
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1563