Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1562459

Summary:	There are many Warning messages in vmkernel.log then I use iscsigw tool in Ceph3.0.Without any IO pressure on RADOS side, we are still found the "Lost connection" issue on Vsphere monitor tap
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	liuwei <wliu>
Component:	iSCSI	Assignee:	Mike Christie <mchristi>
Status:	CLOSED ERRATA	QA Contact:	Tejas <tchandra>
Severity:	urgent	Docs Contact:	Aron Gunn <agunn>
Priority:	unspecified
Version:	3.0	CC:	aglotov, agunn, bhubbard, ceph-eng-bugs, ceph-qe-bugs, hnallurv, jdillama, jiajzhan, juhuang, kchai, kdreyer, mchristi, mhackett, mkasturi, mzheng, tchandra, tserlin, xiaoliu, xiubli
Target Milestone:	z3
Target Release:	3.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	tcmu-runner-1.3.0-0.4.2.el7cp	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-05-15 18:20:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1549698, 1565570
Bug Blocks:

Description liuwei 2018-03-30 23:35:42 UTC

Description of problem:

There are many Warning messages in vmkernel.log then  I use iscsigw  tool in Ceph3.0.  At same time.  from the vsphere montior tap, I found the volume has Lost connection information.

At the same time,   from the vmkernel.log,there are many "iSCSI connection is being marked "OFFLINE" (Event:4)" messages.

For example:


2018-03-30T11:15:52.455Z cpu9:65646)WARNING: HBX: 727: 'CEPH-LUN09': HB at offset 0 - Volume 5abc0760-13c36702-c307-a4c64f1dcfdb may be damaged on disk. Corrupt heartbeat detected:
2018-03-30T11:15:52.455Z cpu9:65646)WARNING:   [HB state 0 offset 0 gen 0 stampUS 0 uuid 00000000-00000000-0000-000000000000 jrnl <FB 0> drv 0.0]
2018-03-30T11:15:55.067Z cpu22:66551)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4309061150c0 network resource pool netsched.pools.persist.iscsi associated
2018-03-30T11:15:55.067Z cpu22:66551)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4309061150c0 network tracker id 88966450 tracker.iSCSI.10.2.132.2 associated
2018-03-30T11:15:55.068Z cpu2:66521)WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic: vmhba64:CH:2 T:0 CN:0: Failed to receive data: Connection closed by peer
2018-03-30T11:15:55.068Z cpu2:66521)WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic: Sess [ISID: 00023d000003 TARGET: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw TPGT: 3 TSIH: 0]
2018-03-30T11:15:55.068Z cpu2:66521)WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic: Conn [CID: 0 L: 10.2.128.61:49656 R: 10.2.132.2:3260]
2018-03-30T11:15:55.068Z cpu2:66521)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: vmhba64:CH:2 T:0 CN:0: Connection rx notifying failure: Failed to Receive. State=Bound
2018-03-30T11:15:55.068Z cpu2:66521)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: Sess [ISID: 00023d000003 TARGET: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw TPGT: 3 TSIH: 0]
2018-03-30T11:15:55.068Z cpu2:66521)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: Conn [CID: 0 L: 10.2.128.61:49656 R: 10.2.132.2:3260]
2018-03-30T11:15:55.321Z cpu22:66551)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba64:CH:2 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2018-03-30T11:15:55.321Z cpu22:66551)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000003 TARGET: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw TPGT: 3 TSIH: 0]
2018-03-30T11:15:55.321Z cpu22:66551)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 10.2.128.61:49656 R: 10.2.132.2:3260]


Version-Release number of selected component (if applicable):

cat installed-rpms | grep ceph
ceph-base-12.2.1-40.el7cp.x86_64                            Mon Dec 25 01:45:02 2017
ceph-common-12.2.1-40.el7cp.x86_64                          Mon Dec 25 01:44:19 2017
ceph-iscsi-cli-2.5-9.el7cp.noarch                           Mon Dec 25 22:31:05 2017
ceph-iscsi-config-2.3-12.el7cp.noarch                       Mon Dec 25 08:49:28 2017
ceph-iscsi-tools-2.1-3.el7cp.noarch                         Mon Dec 25 22:54:29 2017
ceph-osd-12.2.1-40.el7cp.x86_64                             Mon Dec 25 01:45:23 2017
ceph-selinux-12.2.1-40.el7cp.x86_64                         Mon Dec 25 01:45:02 2017
libcephfs2-12.2.1-40.el7cp.x86_64                           Mon Dec 25 01:44:16 2017
python-cephfs-12.2.1-40.el7cp.x86_64                        Mon Dec 25 01:44:16 2017

cat installed-rpms | grep tcmu
libtcmu-1.3.0-0.4.el7cp.x86_64                              Mon Dec 25 08:49:31 2017
tcmu-runner-1.3.0-0.4.el7cp.x86_64                          Mon Dec 25 08:49:31 2017


Kernel Version:

cat uname 
Linux osd1.zyc4.cmicnmg.com 3.10.0-858.el7.x86_64 #1 SMP Tue Feb 27 08:59:23 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
[wliu@collab-shell osd1.zyc4.cmicnmg.com]$ cat etc/redhat-release 
Red Hat Enterprise Linux Server release 7.5 Beta (Maipo)


Vmware esx Version:  6.5



How reproducible:
100%reproduced

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 5 Mike Christie 2018-03-30 23:52:14 UTC

When you report this info could you also give the output of


esxcli iscsi session list

esxcli iscsi session connection list

esxcli storage  nmp device list 


esxcli storage core  device list -d iscsi_diskABC

and

esxcli storage nmp path list -d iscsi_diskABC

Could you also save us the hunt and describe what operations if any you were doing at this time like adding disks, osds, exporting new luns, etc.

Comment 6 liuwei 2018-03-31 00:03:44 UTC

Dear Mike:

Thanks for your help this ticket. 

For your questions "Could you also save us the hunt and describe what operations if any you were doing at this time like adding disks, osds, exporting new luns, etc."

In actually, without any IO action on rados side, ( adding disks, osds, exporting new luns, etc).  We are still found "Lost Connection" issue on vsphere monitor tap.


Later , we will update those command output.

Thanks again!

Liu Wei

Comment 12 Mike Christie 2018-03-31 01:48:23 UTC

For future use for the user and to simplify new bzs, could you disable the linux initiatior if you are not using it anymore? It is filling up the logs in all these bzs with a garbage relogin errors:

Mar 25 03:49:18 osd1 kernel: iSCSI Initiator Node: iqn.1994-05.com.redhat:f14d7a65d5ca is not authorized to access iSCSI target portal group: 1.

It looks like you might have set it up as a test and then only partially stopped it.

On the linux box, just do:

iscsiadm -m node -u

and then just delete the record

iscsiadm -m node -o delete

Comment 14 Mike Christie 2018-03-31 02:22:00 UTC

The esxcli session and connection output indicate you have 8 sessions.

Were you doing iscsi interface/iface/port binding on the initiator/ESX/vmware side or did you manually try to login more than 1 session per iscsi target tpg per initiator?

Comment 15 William Lau 2018-03-31 03:20:40 UTC

The vcenter client reported "connection Lost ERRORs" on Mar 30, but I have not found the logs of that day at esxi server. it's very strange that the esxi keep so limited logs.


we saw lots of WARNING level messages in esxi before, but didn't see the ERROR level log as vcenter reported. I am not sure if the "warning" in esxi server will be regarded as ERROR in vcenter client. Can any vmware expert be involved in this issue.

Comment 16 William Lau 2018-03-31 03:26:12 UTC

(In reply to Mike Christie from comment #12)
> For future use for the user and to simplify new bzs, could you disable the
> linux initiatior if you are not using it anymore? It is filling up the logs
> in all these bzs with a garbage relogin errors:
> 
> Mar 25 03:49:18 osd1 kernel: iSCSI Initiator Node:
> iqn.1994-05.com.redhat:f14d7a65d5ca is not authorized to access iSCSI target
> portal group: 1.
> 
> It looks like you might have set it up as a test and then only partially
> stopped it.
> 
> On the linux box, just do:
> 
> iscsiadm -m node -u
> 
> and then just delete the record
> 
> iscsiadm -m node -o delete

ok ,I just disable and stop the iscsi service across all the ceph cluster nodes.

Comment 19 Mike Christie 2018-03-31 05:46:20 UTC

(In reply to William Lau from comment #15)
> The vcenter client reported "connection Lost ERRORs" on Mar 30, but I have
> not found the logs of that day at esxi server. it's very strange that the
> esxi keep so limited logs.
> 

Check out 127_vmkernel.log in all_vmkernel_log.tar.gz in https://access.redhat.com/support/cases/02056772.

Comment 24 Mike Christie 2018-03-31 07:33:00 UTC

Not sure if related to the packet drops or the cause but there is a bug in the iscsi layer that we might be hitting and causing some of the problems.

We see this repeated for a long time:

Mar 28 11:03:53 osd10 kernel: iSCSI Login negotiation failed.
Mar 28 11:04:08 osd10 kernel: iSCSI Login timeout on Network Portal 10.2.132.10:3260

I think we just might be hitting this bug:

commit 1c130ae00b769a2e2df41bad3d6051ee8234b636
Author: Florian Westphal <fw>
Date:   Fri Jan 19 14:36:29 2018 +0100

    iscsi-target: make sure to wake up sleeping login worker

This could cause the connection offline errors.

I really have no idea if this bug can cause packets to just look like they are dropped.

On the other hand, if the nic is going bad and dropping packets then that would explain why we see the iscsi login timeout errors and slow commands again (In the logs I saw ABORTs being sent a couple times on Mar 30 on osd 10 osd 1 and osd20 so commands were taking longer than normal)/


The other possibility is that a command was lost in the target layer and that is causing the target to drop relogin attempts. We would see a similar login timeout error. However, from the esxlci output it looks like all the targets but the osd2 are logged back in.

Note I still need the logs for osd2 because that seems to be hitting some hang in the session reinstatement code which may be a lost command type of bug causing relogins to fail.

If you guys have an earlier vmkernel logs it would be helpful. The earliest I have starts at 2018-03-30T11:07:07.

Comment 36 William Lau 2018-03-31 16:39:05 UTC

(In reply to Mike Christie from comment #24)
> Not sure if related to the packet drops or the cause but there is a bug in
> the iscsi layer that we might be hitting and causing some of the problems.
> 
> We see this repeated for a long time:
> 
> Mar 28 11:03:53 osd10 kernel: iSCSI Login negotiation failed.
> Mar 28 11:04:08 osd10 kernel: iSCSI Login timeout on Network Portal
> 10.2.132.10:3260
> 
> I think we just might be hitting this bug:
> 
> commit 1c130ae00b769a2e2df41bad3d6051ee8234b636
> Author: Florian Westphal <fw>
> Date:   Fri Jan 19 14:36:29 2018 +0100
> 
>     iscsi-target: make sure to wake up sleeping login worker
> 
> This could cause the connection offline errors.
> 
> I really have no idea if this bug can cause packets to just look like they
> are dropped.
> 
> On the other hand, if the nic is going bad and dropping packets then that
> would explain why we see the iscsi login timeout errors and slow commands
> again (In the logs I saw ABORTs being sent a couple times on Mar 30 on osd
> 10 osd 1 and osd20 so commands were taking longer than normal)/
> 
> 
> The other possibility is that a command was lost in the target layer and
> that is causing the target to drop relogin attempts. We would see a similar
> login timeout error. However, from the esxlci output it looks like all the
> targets but the osd2 are logged back in.
> 
> Note I still need the logs for osd2 because that seems to be hitting some
> hang in the session reinstatement code which may be a lost command type of
> bug causing relogins to fail.
> 
> If you guys have an earlier vmkernel logs it would be helpful. The earliest
> I have starts at 2018-03-30T11:07:07.

I checked the atop log on 30 Mar from osd10 and osd20, these two nodes dropped much more packets than osd1 and osd2, found the reason dropping packets was the Nic received massive packets,I think it ran out of the Nic's Ring buffer.

and seemingly,the packets dropping has nothing to do with the "Lost connecton error" of vcenter, since vcenter didn't alarm when the packet was dropping .


additionally, from the 127_vmkernel_log. I see many connection/session related warning, such as :

2018-03-30T11:08:02.192Z cpu6:66508)WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic: vmhba64:CH:5 T:0 CN:0: Failed to receive data: Connection closed by peer
2018-03-30T11:08:02.192Z cpu6:66508)WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic: Sess [ISID: 00023d000006 TARGET: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw TPGT: 3 TSIH: 0]
2018-03-30T11:08:02.192Z cpu6:66508)WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic: Conn [CID: 0 L: 10.2.128.54:54493 R: 10.2.132.2:3260]
2018-03-30T11:08:02.192Z cpu6:66508)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: vmhba64:CH:5 T:0 CN:0: Connection rx notifying failure: Failed to Receive. State=Bound
2018-03-30T11:08:02.192Z cpu6:66508)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: Sess [ISID: 00023d000006 TARGET: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw TPGT: 3 TSIH: 0]
2018-03-30T11:08:02.193Z cpu6:66508)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: Conn [CID: 0 L: 10.2.128.54:54493 R: 10.2.132.2:3260]



I am not sure if the above message reflected the "connection Lost error" from vcenter side.

A similar case from DELL:

https://www.dell.com/community/EqualLogic/iSCSI-connection-issues-or-not/m-p/4123283



one of the comments worth to be noted:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

06-03-2013 03:30 PM
Re: iSCSI connection issues or not?

Whenever a connection is moved, either by the array for balancing IO across available array interfaces, page movement when you have more than three members in a pool, or with MEM installed it will periodically move connections, you WILL see redundancy lost messages.  As you noted, by the time you check the paths the connection has already logged back in as it should.   For that briefest of periods, less than a second typically, that session is gone, and a new one is being created at the same time.  The volume is still accessible via the other sessions.   So it's an info message, not an indication of a problem.  When the array balances slices to a new member, in pools more than three, expect to see numerous, simulaneous messages from all the hosts connecting to that datastore.  This flurry is completely normal.  

Social Media and Community Professional
#IWork4Dell
Get Support on Twitter - @dellcarespro

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Comment 39 Mike Christie 2018-03-31 20:10:03 UTC

(In reply to Mike Christie from comment #24)
> We see this repeated for a long time:
> 
> Mar 28 11:03:53 osd10 kernel: iSCSI Login negotiation failed.
> Mar 28 11:04:08 osd10 kernel: iSCSI Login timeout on Network Portal
> 10.2.132.10:3260
> 
> I think we just might be hitting this bug:
> 
> commit 1c130ae00b769a2e2df41bad3d6051ee8234b636
> Author: Florian Westphal <fw>
> Date:   Fri Jan 19 14:36:29 2018 +0100
> 
>     iscsi-target: make sure to wake up sleeping login worker
> 
> This could cause the connection offline errors.
> 

I made this bz:

https://bugzilla.redhat.com/show_bug.cgi?id=1562541

for the login time out issue.

Comment 43 Xiubo Li 2018-04-01 05:53:34 UTC

@Mike Christie

We are discussing the following patch:

commit f0cfe030bd7ffff980b8194c2a73a5e1fec14ead
Author: Zhang Zhuoyu <zhangzhuoyu.com>
Date:   Fri Jan 19 16:44:08 2018 +0800

    handler: do not perform implicit transition if command not supported
    
    If unsupported command was handled by tcmur_cmd_handler, alua implicit
    transition have a chance to cause image exclusive-lock threshing between
    runners, so do not perform implicit transition if command not supported
    
    Signed-off-by: Xiubo Li <lixiubo.com>
    Signed-off-by: Zhang Zhuoyu <zhangzhuoyu.com>

Do you remember this one ?

This is because we were testing the ceph iscsi gateway products and when we are pressure it by fio or dd, the /dev/sdX or /dev/mapper/mpathX will hit read-only issues.

The readonly issue maybe could lead the current bug. Do you have any more idea or suggestion about this ?

Comment 44 William Lau 2018-04-01 05:59:43 UTC

On OSD10 ,there are many worker[*] threads under D state now:

[root@osd10 log]# 
[root@osd10 log]# uptime
 12:19:08 up 3 days, 15:33,  4 users,  load average: 21.00, 21.10, 21.20
[root@osd10 log]# 
[root@osd10 log]# 
[root@osd10 log]# ps -elf | grep D
F S UID          PID    PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
4 S root        1387       1  0  80   0 - 55291 hrtime Mar28 ?        00:00:16 /usr/bin/abrt-watch-log -F BUG: WARNING: at WARNING: CPU: INFO: possible recursive locking detected ernel BUG at list_del corruption list_add corruption do_IRQ: stack overflow: ear stack overflow (cur: eneral protection fault nable to handle kernel ouble fault: RTNL: assertion failed eek! page_mapcount(page) went negative! adness at NETDEV WATCHDOG ysctl table check failed : nobody cared IRQ handler type mismatch Kernel panic - not syncing: Machine Check Exception: Machine check events logged divide error: bounds: coprocessor segment overrun: invalid TSS: segment not present: invalid opcode: alignment check: stack segment: fpu exception: simd exception: iret exception: /var/log/messages -- /usr/bin/abrt-dump-oops -xtD
4 S root        2336       1  0  80   0 - 28198 poll_s Mar28 ?        00:00:00 /usr/sbin/sshd -D
1 D root        8056       2  0  80   0 -     0 transp Mar28 ?        00:23:47 [iscsi_ttx]
1 D root      794529       2  0  80   0 -     0 transp Mar30 ?        00:00:00 [kworker/u65:0]
1 D root      823768       2  0  80   0 -     0 transp Mar31 ?        00:00:00 [kworker/u66:0]
1 D root      843864       2  0  80   0 -     0 transp Mar31 ?        00:00:00 [kworker/u65:2]
1 D root      863065       2  0  80   0 -     0 transp Mar31 ?        00:00:00 [kworker/u66:1]
1 D root      868161       2  0  80   0 -     0 transp Mar31 ?        00:00:00 [kworker/u65:1]
1 D root      868236       2  0  80   0 -     0 transp Mar31 ?        00:00:00 [kworker/u66:2]
1 D root      868238       2  0  80   0 -     0 transp Mar31 ?        00:00:00 [kworker/u65:3]
1 D root      868239       2  0  80   0 -     0 transp Mar31 ?        00:00:00 [kworker/u65:4]
1 D root      868248       2  0  80   0 -     0 transp Mar31 ?        00:00:00 [kworker/u66:3]
1 D root      868253       2  0  80   0 -     0 transp Mar31 ?        00:00:00 [kworker/u66:4]
1 D root      868255       2  0  80   0 -     0 transp Mar31 ?        00:00:00 [kworker/u66:5]
1 D root      868257       2  0  80   0 -     0 transp Mar31 ?        00:00:00 [kworker/u66:6]
1 D root      868260       2  0  80   0 -     0 transp Mar31 ?        00:00:00 [kworker/u66:7]
1 D root      868262       2  0  80   0 -     0 transp Mar31 ?        00:00:00 [kworker/u65:5]
1 D root      868263       2  0  80   0 -     0 transp Mar31 ?        00:00:00 [kworker/u66:8]
1 D root      868265       2  0  80   0 -     0 transp Mar31 ?        00:00:00 [kworker/u65:6]
1 D root      868268       2  0  80   0 -     0 transp Mar31 ?        00:00:00 [kworker/u65:7]
1 D root      868269       2  0  80   0 -     0 transp Mar31 ?        00:00:00 [kworker/u66:9]
1 D root      868275       2  0  80   0 -     0 transp Mar31 ?        00:00:00 [kworker/u65:8]
0 S root      927509  925511  0  80   0 - 28177 pipe_w 12:19 pts/1    00:00:00 grep --color=auto D
[root@osd10 log]# 
[root@osd10 log]# 
[root@osd10 log]# 
[root@osd10 log]# ps -elf | grep kworker | wc-l
-bash: wc-l: command not found
[root@osd10 log]# ps -elf | grep kworker | wc -l
161
[root@osd10 log]# 

[root@osd10 log]# cat /proc/794529/stack
[<ffffffffc04f2bda>] __transport_wait_for_tasks+0xba/0x1b0 [target_core_mod]
[<ffffffffc04f2db3>] transport_wait_for_tasks+0x53/0x90 [target_core_mod]
[<ffffffffc04eff30>] core_tmr_abort_task+0x100/0x180 [target_core_mod]
[<ffffffffc04f2f4a>] target_tmr_work+0x15a/0x170 [target_core_mod]
[<ffffffff9e4b2eff>] process_one_work+0x17f/0x440
[<ffffffff9e4b3bc6>] worker_thread+0x126/0x3c0
[<ffffffff9e4baf31>] kthread+0xd1/0xe0
[<ffffffff9eb1f637>] ret_from_fork_nospec_end+0x0/0x39
[<ffffffffffffffff>] 0xffffffffffffffff
[root@osd10 log]# 


According to the atop tool, they all became D state from 18:34 on Mar 31. some of the esxi server reported Lost access error,such as 10.2.226.129:

[root@C05-NMGJD-PA-ZYC-SV029-RH2288H-XNH:/vmfs/volumes/5aa2a258-d4765fd3-a8fd-883fd3d0366e/log] grep Lost hostd.2
2018-03-30T18:27:58.237Z info hostd[C181B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 5148 : Lost access to volume 5abc06ed-6ff4cbab-5dd7-a4c64f1dcfdb (CEPH-LUN08) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
2018-03-30T18:27:58.238Z info hostd[AC53B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 5149 : Lost access to volume 5abbff72-9da349d5-9d51-a4c64f1dcfdb (CEPH-LUN00) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
...

Note that, the time from esxi side is about 5 minutes ahead of ceph cluster nodes.


Besides, at 2018/03/30  20:19 ,iscsi_ttx, and several kworker threads went to D state, they kept D state for about 2 minutes, but two iscsi_ttx thread stuck in D forever:


Note: the "tslpu" means the number of process/thread in D state.



ATOP - osd10                                                       2018/03/30  20:19:47                                                       --------------                                                         15s elapsed
PRC | sys   24.54s  | user  33.42s |               |               | #proc    518 |  #trun	1 | #tslpi  1340  |              | #tslpu     0  | #zombie    0 |  clones    12 |               |              |  #exit      1 |
CPU | sys     144%  | user    212% |               | irq      29%  |              |  idle   2814% | wait      2%  |              |               | steal     0% |  guest     0% | curf 2.10GHz  |              |  curscal  99% |
CPL | avg1    3.59  |              |  avg5    6.26 | avg15   6.81  |              |               |               | csw  1769808 |               | intr 1961559 |               |               | numcpu    32 |               |
MEM | tot   251.4G  | free    1.0G |  cache 226.9G | dirty   4.3M  | buff   76.1M |  slab   14.9G | slrec   9.2G  | shmem   4.0G | shrss   0.0M  | shswp   0.0M |  vmbal   0.0M |               | hptot   0.0M |  hpuse   0.0M |
SWP | tot    32.0G  | free   32.0G |               |               |              |               |               |              |               |              |               | vmcom  15.8G  |              |  vmlim 157.7G |
LVM |     rhel-var  | busy	0% |               | read       0  | write     65 |               | KiB/r      0  | KiB/w    242 | MBr/s    0.0  |              |  MBw/s    1.0 | avq     7.40  |              |  avio 0.08 ms |
DSK |          sdj  | busy	4% |               | read       0  | write  10214 |               | KiB/r      0  | KiB/w     44 | MBr/s    0.0  |              |  MBw/s   29.4 | avq     1.63  |              |  avio 0.06 ms |
DSK |          sdk  | busy	4% |               | read       0  | write   8250 |               | KiB/r      0  | KiB/w     45 | MBr/s    0.0  |              |  MBw/s   24.3 | avq     1.70  |              |  avio 0.07 ms |
DSK |          sde  | busy	1% |               | read       0  | write   5501 |               | KiB/r      0  | KiB/w     38 | MBr/s    0.0  |              |  MBw/s   13.9 | avq    44.60  |              |  avio 0.03 ms |
NET | transport     | tcpi  957160 |  tcpo 2352978 | udpi       0  |              |  udpo       0 | tcpao      0  | tcppo      0 | tcprs     80  | tcpie      0 |  tcpor      0 |               | udpnp      0 |  udpie      0 |
NET | network       | ipi   957159 |               | ipo   886333  | ipfrw      0 |               | deliv 957159  |              |               |              |               | icmpi      0  |              |  icmpo      0 |
NET | enp130s   8%  | pcki  857529 |  pcko 1306151 |               | sp   10 Gbps |  si  581 Mbps | so  819 Mbps  |              | coll       0  | mlti      14 |  erri       0 | erro       0  | drpi       0 |  drpo       0 |
NET | bond0     5%  | pcki 1674473 |  pcko 1806426 |               | sp   20 Gbps |  si 1097 Mbps | so 1011 Mbps  |              | coll       0  | mlti      30 |  erri       0 | erro       0  | drpi       0 |  drpo       0 |

ATOP - osd10                                                       2018/03/30  20:20:02                                                       --------------                                                         15s elapsed
PRC | sys    0.90s  | user   0.57s |               |               | #proc    527 |  #trun	2 | #tslpi  1337  |              | #tslpu     6  | #zombie    0 |  clones    22 |               |              |  #exit      6 |
CPU | sys       5%  | user      4% |               | irq       0%  |              |  idle   3192% | wait      0%  |              |               | steal     0% |  guest     0% | curf 2.10GHz  |              |  curscal  99% |
CPL | avg1    3.78  |              |  avg5    6.16 | avg15   6.77  |              |               |               | csw    70548 |               | intr   96360 |               |               | numcpu    32 |               |
MEM | tot   251.4G  | free    1.0G |  cache 226.9G | dirty   2.5M  | buff   76.1M |  slab   14.9G | slrec   9.2G  | shmem   4.0G | shrss   0.0M  | shswp   0.0M |  vmbal   0.0M |               | hptot   0.0M |  hpuse   0.0M |
SWP | tot    32.0G  | free   32.0G |               |               |              |               |               |              |               |              |               | vmcom  15.8G  |              |  vmlim 157.7G |
LVM |     rhel-var  | busy	0% |               | read       0  | write      2 |               | KiB/r      0  | KiB/w      6 | MBr/s    0.0  |              |  MBw/s    0.0 | avq     0.00  |              |  avio 0.00 ms |
DSK |          sde  | busy	2% |               | read      30  | write     40 |               | KiB/r      4  | KiB/w      7 | MBr/s    0.0  |              |  MBw/s    0.0 | avq     1.02  |              |  avio 4.39 ms |
DSK |          sdd  | busy	0% |               | read       0  | write    533 |               | KiB/r      0  | KiB/w     84 | MBr/s    0.0  |              |  MBw/s    2.9 | avq   128.20  |              |  avio 0.03 ms |
DSK |          sdf  | busy	0% |               | read       0  | write    460 |               | KiB/r      0  | KiB/w     55 | MBr/s    0.0  |              |  MBw/s    1.7 | avq    77.50  |              |  avio 0.03 ms |
NET | transport     | tcpi   18305 |  tcpo   33885 | udpi       0  |              |  udpo       0 | tcpao      0  | tcppo      0 | tcprs    101  | tcpie      0 |  tcpor      0 |               | udpnp      0 |  udpie      0 |
NET | network       | ipi    18306 |               | ipo    18476  | ipfrw      0 |               | deliv  18306  |              |               |              |               | icmpi      0  |              |  icmpo      0 |
NET | enp130s   0%  | pcki    6803 |  pcko   16582 |               | sp   10 Gbps |  si 3238 Kbps | so 8782 Kbps  |              | coll       0  | mlti      14 |  erri       0 | erro       0  | drpi       0 |  drpo       0 |
NET | bond0     0%  | pcki   17157 |  pcko   22881 |               | sp   20 Gbps |  si 6902 Kbps | so   11 Mbps  |              | coll       0  | mlti      28 |  erri       0 | erro       0  | drpi       0 |  drpo       0 |



ATOP - osd10                                                       2018/03/30  20:20:17                                                       --------------                                                         15s elapsed
PRC | sys    0.60s  | user   0.52s |               |               | #proc    522 |  #trun	2 | #tslpi  1318  |              | #tslpu     9  | #zombie    0 |  clones    16 |               |              |  #exit     19 |
CPU | sys       4%  | user      3% |               | irq       0%  |              |  idle   3193% | wait      0%  |              |               | steal     0% |  guest     0% | curf 2.10GHz  |              |  curscal  99% |
CPL | avg1    4.58  |              |  avg5    6.22 | avg15   6.78  |              |               |               | csw    51769 |               | intr   71668 |               |               | numcpu    32 |               |
MEM | tot   251.4G  | free    1.0G |  cache 226.9G | dirty   1.0M  | buff   76.1M |  slab   14.9G | slrec   9.2G  | shmem   4.0G | shrss   0.0M  | shswp   0.0M |  vmbal   0.0M |               | hptot   0.0M |  hpuse   0.0M |
SWP | tot    32.0G  | free   32.0G |               |               |              |               |               |              |               |              |               | vmcom  15.8G  |              |  vmlim 157.7G |
LVM |     rhel-var  | busy	0% |               | read       0  | write     10 |               | KiB/r      0  | KiB/w    245 | MBr/s    0.0  |              |  MBw/s    0.2 | avq     0.00  |              |  avio 0.00 ms |
DSK |          sde  | busy	3% |               | read       4  | write    357 |               | KiB/r     64  | KiB/w     77 | MBr/s    0.0  |              |  MBw/s    1.8 | avq     1.87  |              |  avio 1.37 ms |
DSK |          sdh  | busy	1% |               | read       8  | write    284 |               | KiB/r     64  | KiB/w     82 | MBr/s    0.0  |              |  MBw/s    1.5 | avq     7.68  |              |  avio 0.41 ms |
DSK |          sdb  | busy	1% |               | read       5  | write     18 |               | KiB/r     64  | KiB/w     25 | MBr/s    0.0  |              |  MBw/s    0.0 | avq     1.00  |              |  avio 3.87 ms |
NET | transport     | tcpi   11654 |  tcpo   16054 | udpi       0  |              |  udpo       0 | tcpao      9  | tcppo     13 | tcprs     18  | tcpie      0 |  tcpor      0 |               | udpnp      0 |  udpie      0 |
NET | network       | ipi    11654 |               | ipo    11668  | ipfrw      0 |               | deliv  11654  |              |               |              |               | icmpi      0  |              |  icmpo      0 |
NET | enp129s   0%  | pcki    4598 |  pcko    4256 |               | sp   10 Gbps |  si 1944 Kbps | so 1691 Kbps  |              | coll       0  | mlti      16 |  erri       0 | erro       0  | drpi       0 |  drpo       0 |
NET | bond0     0%  | pcki    8238 |  pcko    8411 |               | sp   20 Gbps |  si 3331 Kbps | so 3257 Kbps  |              | coll       0  | mlti      32 |  erri       0 | erro       0  | drpi       0 |  drpo       0 |


ATOP - osd10                                                       2018/03/30  20:21:02                                                       --------------                                                         15s elapsed
PRC | sys    1.33s  | user   0.96s |               |               | #proc    505 |  #trun	1 | #tslpi  1329  |              | #tslpu     2  | #zombie    0 |  clones    15 |               |              |  #exit      0 |
CPU | sys       9%  | user      5% |               | irq       0%  |              |  idle   3187% | wait      0%  |              |               | steal     0% |  guest     0% | curf 2.10GHz  |              |  curscal  99% |
CPL | avg1    3.42  |              |  avg5    5.70 | avg15   6.58  |              |               |               | csw   180520 |               | intr  236565 |               |               | numcpu    32 |               |
MEM | tot   251.4G  | free    1.0G |  cache 226.9G | dirty   0.4M  | buff   76.1M |  slab   14.9G | slrec   9.2G  | shmem   4.0G | shrss   0.0M  | shswp   0.0M |  vmbal   0.0M |               | hptot   0.0M |  hpuse   0.0M |
SWP | tot    32.0G  | free   32.0G |               |               |              |               |               |              |               |              |               | vmcom  15.8G  |              |  vmlim 157.7G |
LVM |    rhel-root  | busy	0% |               | read       0  | write      1 |               | KiB/r      0  | KiB/w      4 | MBr/s    0.0  |              |  MBw/s    0.0 | avq     0.00  |              |  avio 0.00 ms |
LVM |     rhel-var  | busy	0% |               | read       0  | write      4 |               | KiB/r      0  | KiB/w     65 | MBr/s    0.0  |              |  MBw/s    0.0 | avq     0.00  |              |  avio 0.00 ms |
DSK |          sdd  | busy	0% |               | read       0  | write    358 |               | KiB/r      0  | KiB/w     18 | MBr/s    0.0  |              |  MBw/s    0.4 | avq    80.80  |              |  avio 0.01 ms |
DSK |          sdb  | busy	0% |               | read       0  | write    251 |               | KiB/r      0  | KiB/w     22 | MBr/s    0.0  |              |  MBw/s    0.4 | avq    74.33  |              |  avio 0.01 ms |
DSK |          sdf  | busy	0% |               | read	0  | write    249 |               | KiB/r      0  | KiB/w     24 | MBr/s    0.0  |              |  MBw/s    0.4 | avq    70.67  |              |  avio 0.01 ms |
NET | transport     | tcpi   26696 |  tcpo   31156 | udpi       0  |              |  udpo       0 | tcpao      0  | tcppo     45 | tcprs      1  | tcpie      0 |  tcpor      0 |               | udpnp      0 |  udpie      0 |
NET | network       | ipi    26696 |               | ipo    26650  | ipfrw      0 |               | deliv  26696  |              |               |              |               | icmpi      0  |              |  icmpo      0 |
NET | enp130s   0%  | pcki    8407 |  pcko   18843 |               | sp   10 Gbps |  si 1759 Kbps | so 2527 Kbps  |              | coll       0  | mlti      14 |  erri       0 | erro       0  | drpi       0 |  drpo       0 |
NET | enp129s   0%  | pcki   14457 |  pcko    4399 |               | sp   10 Gbps |  si 2266 Kbps | so 1702 Kbps  |              | coll       0  | mlti      15 |  erri       0 | erro       0  | drpi       0 |  drpo       0 |


during which, there are many esxi server reported connection issue, such as:



141:

[root@D04-NMGJD-PA-ZYC-SV041-RH2288H-XNH:/vmfs/volumes/5aa25cfe-f7d265b4-56aa-2c55d3428853/log] grep -i Lost hostd.1 
2018-03-30T20:23:39.755Z info hostd[B185B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 1560 : Lost access to volume 5abc008e-3b508996-3e93-a4c64f1dcfdb (CEPH-LUN02) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
-->    eventTypeId = "esx.problem.storage.redundancy.lost",
2018-03-30T20:24:03.220Z info hostd[76AF890] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 1594 : Lost path redundancy to storage device naa.60014056b8554b392dd4b9c83c1343c5. Path vmhba64:C8:T0:L10 is down. Affected datastores: CEPH-LUN10.
-->    eventTypeId = "esx.problem.storage.redundancy.lost",
2018-03-30T20:24:03.223Z info hostd[A981B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 1595 : Lost path redundancy to storage device naa.60014056b8554b392dd4b9c83c1343c5. Path vmhba64:C7:T0:L10 is down. Affected datastores: CEPH-LUN10.
-->    eventTypeId = "esx.problem.storage.redundancy.lost",
2018-03-30T20:24:03.227Z info hostd[B185B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 1596 : Lost path redundancy to storage device naa.60014056b8554b392dd4b9c83c1343c5. Path vmhba64:C5:T0:L10 is down. Affected datastores: CEPH-LUN10.


137:

[root@D04-NMGJD-PA-ZYC-SV037-RH2288H-XNH:/vmfs/volumes/5aa25cb0-5a64d290-3279-2c55d3428865/log] grep -i Lost hostd.1 
2018-03-30T20:02:28.305Z info hostd[F4C2B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 9879 : Lost access to volume 5abc06ed-6ff4cbab-5dd7-a4c64f1dcfdb (CEPH-LUN08) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
2018-03-30T20:19:28.305Z info hostd[FBC2B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 9882 : Lost access to volume 5abc06ed-6ff4cbab-5dd7-a4c64f1dcfdb (CEPH-LUN08) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
-->    eventTypeId = "esx.problem.storage.redundancy.lost",
2018-03-30T20:19:58.054Z info hostd[F440B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 9883 : Lost path redundancy to storage device naa.6001405fc0cdc8fc75e4cd0b9f6beec9. Path vmhba64:C7:T0:L8 is down. Affected datastores: CEPH-LUN08.
-->    eventTypeId = "esx.problem.storage.redundancy.lost",
2018-03-30T20:22:35.054Z info hostd[F840B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 9886 : Lost path redundancy to storage device naa.6001405d8dc678c19e34e27af2807e6c. Path vmhba64:C5:T0:L5 is down. Affected datastores: CEPH-LUN05.
-->    eventTypeId = "esx.problem.storage.redundancy.lost",
2018-03-30T20:22:35.058Z info hostd[F440B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 9891 : Lost path redundancy to storage device naa.60014056b8554b392dd4b9c83c1343c5. Path vmhba64:C5:T0:L10 is down. Affected datastores: CEPH-LUN10.
-->    eventTypeId = "esx.problem.storage.redundancy.lost",
2018-03-30T20:22:35.061Z info hostd[F4C2B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 9892 : Lost path redundancy to storage device naa.6001405b0d5435acb9e4a11b3fcab332. Path vmhba64:C5:T0:L2 is down. Affected datastores: CEPH-LUN02.
-->    eventTypeId = "esx.problem.storage.redundancy.lost",
2018-03-30T20:22:35.064Z info hostd[F840B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 9893 : Lost path redundancy to storage device naa.6001405652ddf4079a04283ac927709e. Path vmhba64:C5:T0:L6 is down. Affected datastores: CEPH-LUN06.
-->    eventTypeId = "esx.problem.storage.redundancy.lost",


I hope this can help you

Comment 45 Mike Christie 2018-04-01 06:13:58 UTC

(In reply to Xiubo Li from comment #43)
> @Mike Christie
> 
> We are discussing the following patch:
> 
> commit f0cfe030bd7ffff980b8194c2a73a5e1fec14ead
> Author: Zhang Zhuoyu <zhangzhuoyu.com>
> Date:   Fri Jan 19 16:44:08 2018 +0800
> 
>     handler: do not perform implicit transition if command not supported
>     
>     If unsupported command was handled by tcmur_cmd_handler, alua implicit
>     transition have a chance to cause image exclusive-lock threshing between
>     runners, so do not perform implicit transition if command not supported
>     
>     Signed-off-by: Xiubo Li <lixiubo.com>
>     Signed-off-by: Zhang Zhuoyu <zhangzhuoyu.com>
> 
> Do you remember this one ?

Yes.

> 
> This is because we were testing the ceph iscsi gateway products and when we
> are pressure it by fio or dd, the /dev/sdX or /dev/mapper/mpathX will hit
> read-only issues.
> 
> The readonly issue maybe could lead the current bug. Do you have any more
> idea or suggestion about this ?

We should add the patch.

We are seeing REPORT_SUPPORTED_OPERATION_CODES in the logs so it could definitely cause the thrashing you describe and I think we do see some lock flips when we shouldn't.

I could see it could lead to IO slow downs that could lead to us getting into the state where we are trying to disable tpgs. However, other than that, I am not sure if/how it is related to runner getting stuck in the disable tpg code or how it is causing cmds to hang. I'm not saying it will not cause that. I just do not see any way it could, but there could be a bug.

Comment 46 Mike Christie 2018-04-01 06:33:47 UTC

William

Could you upload the the /var/log/message for that time period and could you give me the output of

dmesg -c; echo w >/proc/sysrq-trigger; dmesg -c >waiting-tasks.txt

Comment 47 Mike Christie 2018-04-01 06:40:43 UTC

(In reply to Mike Christie from comment #46)
> William
> 
> Could you upload the the /var/log/message for that time period and could you
> give me the output of
> 
> dmesg -c; echo w >/proc/sysrq-trigger; dmesg -c >waiting-tasks.txt

Ignore that request. I am logged into it.

I think you might be hitting bugs right now because the tcmu-runner process is showing as defunct, so the tmr hangs are expected as nothing can complete commands.

Looks like runner crashed or someone tried to kill it and it cannot restart.

Comment 48 Xiubo Li 2018-04-01 06:43:02 UTC

(In reply to Mike Christie from comment #47)
> (In reply to Mike Christie from comment #46)
> > William
> > 
> > Could you upload the the /var/log/message for that time period and could you
> > give me the output of
> > 
> > dmesg -c; echo w >/proc/sysrq-trigger; dmesg -c >waiting-tasks.txt
> 
> Ignore that request. I am logged into it.
> 
> I think you might be hitting bugs right now because the tcmu-runner process
> is showing as defunct, so the tmr hangs are expected as nothing can complete
> commands.
> 
> Looks like runner crashed or someone tried to kill it and it cannot restart.

Yes, right, someone is trying to update the tcmu-runner daemon and then restart it, it stucks.

Comment 49 Mike Christie 2018-04-01 06:44:33 UTC

(In reply to Xiubo Li from comment #48)
> (In reply to Mike Christie from comment #47)
> > (In reply to Mike Christie from comment #46)
> > > William
> > > 
> > > Could you upload the the /var/log/message for that time period and could you
> > > give me the output of
> > > 
> > > dmesg -c; echo w >/proc/sysrq-trigger; dmesg -c >waiting-tasks.txt
> > 
> > Ignore that request. I am logged into it.
> > 
> > I think you might be hitting bugs right now because the tcmu-runner process
> > is showing as defunct, so the tmr hangs are expected as nothing can complete
> > commands.
> > 
> > Looks like runner crashed or someone tried to kill it and it cannot restart.
> 
> Yes, right, someone is trying to update the tcmu-runner daemon and then
> restart it, it stucks.

Were you or William trying to update it or restart it at that time or did it just crash?

Just wanted to confirm because a crash is a new bug.

Was it

Comment 50 Xiubo Li 2018-04-01 06:50:24 UTC

(In reply to Mike Christie from comment #49)
> (In reply to Xiubo Li from comment #48)
> > (In reply to Mike Christie from comment #47)
> > > (In reply to Mike Christie from comment #46)
> > > > William
> > > > 
> > > > Could you upload the the /var/log/message for that time period and could you
> > > > give me the output of
> > > > 
> > > > dmesg -c; echo w >/proc/sysrq-trigger; dmesg -c >waiting-tasks.txt
> > > 
> > > Ignore that request. I am logged into it.
> > > 
> > > I think you might be hitting bugs right now because the tcmu-runner process
> > > is showing as defunct, so the tmr hangs are expected as nothing can complete
> > > commands.
> > > 
> > > Looks like runner crashed or someone tried to kill it and it cannot restart.
> > 
> > Yes, right, someone is trying to update the tcmu-runner daemon and then
> > restart it, it stucks.
> 
> Were you or William trying to update it or restart it at that time or did it
> just crash?
> 
> Just wanted to confirm because a crash is a new bug.
> 
> Was it

William is doing I thinks.

Comment 51 Mike Christie 2018-04-01 06:56:33 UTC

It looks like osd10 had timeout issues at Mar 30 20:20:14. We see we hit the osd_op_timeout so it was taking commands longer than 30 seconds. During this time in /var/log/messages we see a lot of OSD/ceph heartbeat errors too.

runner tried to disable the tpg and we then hit the bug :
https://bugzilla.redhat.com/show_bug.cgi?id=1561851

where runner gets stuck trying to disable the tpg.

Then it looks like runner was killed and so SCSI commands are timing out and we know the issue for that already.

Comment 53 William Lau 2018-04-01 07:43:03 UTC

(In reply to Mike Christie from comment #47)
> (In reply to Mike Christie from comment #46)
> > William
> > 
> > Could you upload the the /var/log/message for that time period and could you
> > give me the output of
> > 
> > dmesg -c; echo w >/proc/sysrq-trigger; dmesg -c >waiting-tasks.txt
> 
> Ignore that request. I am logged into it.
> 
> I think you might be hitting bugs right now because the tcmu-runner process
> is showing as defunct, so the tmr hangs are expected as nothing can complete
> commands.
> 
> Looks like runner crashed or someone tried to kill it and it cannot restart.


Mike, 


[root@osd10 log]# cat /proc/794529/stack
[<ffffffffc04f2bda>] __transport_wait_for_tasks+0xba/0x1b0 [target_core_mod]
[<ffffffffc04f2db3>] transport_wait_for_tasks+0x53/0x90 [target_core_mod]
[<ffffffffc04eff30>] core_tmr_abort_task+0x100/0x180 [target_core_mod]
[<ffffffffc04f2f4a>] target_tmr_work+0x15a/0x170 [target_core_mod]
[<ffffffff9e4b2eff>] process_one_work+0x17f/0x440
[<ffffffff9e4b3bc6>] worker_thread+0x126/0x3c0
[<ffffffff9e4baf31>] kthread+0xd1/0xe0
[<ffffffff9eb1f637>] ret_from_fork_nospec_end+0x0/0x39
[<ffffffffffffffff>] 0xffffffffffffffff
[root@osd10 log]# 

(In reply to Mike Christie from comment #49)
> (In reply to Xiubo Li from comment #48)
> > (In reply to Mike Christie from comment #47)
> > > (In reply to Mike Christie from comment #46)
> > > > William
> > > > 
> > > > Could you upload the the /var/log/message for that time period and could you
> > > > give me the output of
> > > > 
> > > > dmesg -c; echo w >/proc/sysrq-trigger; dmesg -c >waiting-tasks.txt
> > > 
> > > Ignore that request. I am logged into it.
> > > 
> > > I think you might be hitting bugs right now because the tcmu-runner process
> > > is showing as defunct, so the tmr hangs are expected as nothing can complete
> > > commands.
> > > 
> > > Looks like runner crashed or someone tried to kill it and it cannot restart.
> > 
> > Yes, right, someone is trying to update the tcmu-runner daemon and then
> > restart it, it stucks.
> 
> Were you or William trying to update it or restart it at that time or did it
> just crash?
> 
> Just wanted to confirm because a crash is a new bug.
> 
> Was it



No, the system has been keeping running from 28 Mar , I didn't reboot it,

[root@osd10 log]# uptime
 12:19:08 up 3 days, 15:33,  4 users,  load average: 21.00, 21.10, 21.20


and no any vmcore file found under /var/crash/

and i just find that the tcmu-runner process became defunct state after many kworker threads stuck on Mar 31, 18:34.


you can check the atop file under /var/log/atop

#atop -r /var/log/atop/atop_20180331

Comment 93 errata-xmlrpc 2018-05-15 18:20:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1563