1610150 – Once killed, tcmu-runner fails to start with error- Failed to start LIO Userspace-passthrough daemon

Bug 1610150 - Once killed, tcmu-runner fails to start with error- Failed to start LIO Userspace-passthrough daemon

Summary: Once killed, tcmu-runner fails to start with error- Failed to start LIO Users...

Keywords:
Status:	CLOSED DUPLICATE of bug 1476730
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	gluster-block
Sub Component:
Version:	cns-3.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Prasanna Kumar Kalever
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1568862
TreeView+	depends on / blocked

Reported:	2018-07-31 04:46 UTC by Neha Berry
Modified:	2018-07-31 04:58 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-07-31 04:55:11 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Neha Berry 2018-07-31 04:46:26 UTC

This bug was hit while regressing -https://bugzilla.redhat.com/show_bug.cgi?id=1596369 fixed in = kernel-3.10.0-862.9.1.el7


Description of problem:
========================
We have a 3 node CNS cluster, namely the following:

BZ#1596369 is fixed in kernel-3.10.0-862.9.1.el7. Hence , with current kernel verison being RHEL 7.5.3 (beta) = 3.10.0-862.11.2.el7.x86_64 , we Started executing the steps for Block negative TC - "Introduce Tcmu-runner failure when block device creation and deletion is in progress"

With block device creation + deletion running in parallel, killed tcmu-runner process(which in turn killed gluster-blockd) on 2 pods - "glusterfs-storage-4z756" and "glusterfs-storage-5ml9q" . All creates and deletes went into queue as expected. 
Tried restoring the tcmu-runner service on both the nodes, but it failed to start with following message:

****Failed to start LIO Userspace-passthrough daemon.******


sh-4.2# systemctl status tcmu-runner.service
● tcmu-runner.service - LIO Userspace-passthrough daemon
   Loaded: loaded (/usr/lib/systemd/system/tcmu-runner.service; static; vendor preset: disabled)
   Active: failed (Result: timeout) since Mon 2018-07-30 17:53:06 UTC; 13min ago
  Process: 2730 ExecStartPre=/usr/libexec/gluster-block/wait-for-bricks.sh 120 (code=exited, status=1/FAILURE)
 Main PID: 2734
   CGroup: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod2fc19839_91b7_11e8_b856_005056a525c4.slice/docker-803808c27487956deddabf221dfc63145aeafbde5461550681d197bdb4d1490d.scope/system.slice/tcmu-runner.service
           └─2734 /usr/bin/tcmu-runner --tcmu-log-dir /var/log/glusterfs/gluster-block

Jul 30 17:40:06 dhcp46-159.lab.eng.blr.redhat.com systemd[1]: Starting LIO Userspace-passthrough daemon...
Jul 30 17:50:06 dhcp46-159.lab.eng.blr.redhat.com systemd[1]: tcmu-runner.service start operation timed out. Terminating.
Jul 30 17:51:36 dhcp46-159.lab.eng.blr.redhat.com systemd[1]: tcmu-runner.service stop-final-sigterm timed out. Killing.
Jul 30 17:53:06 dhcp46-159.lab.eng.blr.redhat.com systemd[1]: tcmu-runner.service still around after final SIGKILL. Entering failed mode.
Jul 30 17:53:06 dhcp46-159.lab.eng.blr.redhat.com systemd[1]: Failed to start LIO Userspace-passthrough daemon.
Jul 30 17:53:06 dhcp46-159.lab.eng.blr.redhat.com systemd[1]: Unit tcmu-runner.service entered failed state.
Jul 30 17:53:06 dhcp46-159.lab.eng.blr.redhat.com systemd[1]: tcmu-runner.service failed.

Also, we see following process stack in the 2 glusterfs pods
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

glusterfs-storage-4z756"
----------------------------

sh-4.2# ps -aux|grep Ds
root      2489  0.0  0.0 122668 11048 ?        Ds   17:32   0:00 /usr/bin/python /usr/bin/targetctl clear
root      2841  0.1  0.0 1259580 20052 ?       Dsl  17:40   0:00 /usr/bin/tcmu-runner --tcmu-log-dir /var/log/glusterfs/gluster-block
root      2968  0.0  0.0   9088   664 pts/0    S+   17:42   0:00 grep Ds
sh-4.2# cat /proc/2489/stack
[<ffffffffc025325e>] tcmu_netlink_event+0x2ae/0x3e0 [target_core_user]
[<ffffffffc025348e>] tcmu_destroy_device+0x4e/0x190 [target_core_user]
[<ffffffffc0205004>] target_free_device+0xb4/0x110 [target_core_mod]
[<ffffffffc0200135>] target_core_dev_release+0x15/0x20 [target_core_mod]
[<ffffffff842ad72a>] config_item_release+0x6a/0xf0
[<ffffffff842ad7dc>] config_item_put+0x2c/0x30
[<ffffffff842aba0b>] configfs_rmdir+0x1eb/0x310
[<ffffffff8422da4c>] vfs_rmdir+0xdc/0x150
[<ffffffff84230fd1>] do_rmdir+0x1f1/0x220
[<ffffffff84232206>] SyS_rmdir+0x16/0x20
[<ffffffff8472579b>] system_call_fastpath+0x22/0x27
[<ffffffffffffffff>] 0xffffffffffffffff

sh-4.2# cat /proc/2841/stack
[<ffffffff8471079e>] lookup_slow+0x33/0xa7
[<ffffffff8422e6e8>] path_lookupat+0x838/0x8b0
[<ffffffff8422e78b>] filename_lookup+0x2b/0xc0
[<ffffffff84231bc7>] user_path_at_empty+0x67/0xc0
[<ffffffff84231c31>] user_path_at+0x11/0x20
[<ffffffff8421d4b2>] SyS_faccessat+0xb2/0x230
[<ffffffff8421d648>] SyS_access+0x18/0x20
[<ffffffff8472579b>] system_call_fastpath+0x22/0x27
[<ffffffffffffffff>] 0xffffffffffffffff
sh-4.2# 



"glusterfs-storage-5ml9q
------------------------------

sh-4.2# ps -aux|grep Ds
root      2399  0.0  0.0 122668 11044 ?        Ds   17:32   0:00 /usr/bin/python /usr/bin/targetctl clear
root      2734  0.0  0.0 1259580 20284 ?       Dsl  17:40   0:00 /usr/bin/tcmu-runner --tcmu-log-dir /var/log/glusterfs/gluster-block
root      2907  0.0  0.0   9088   660 pts/1    S+   17:42   0:00 grep Ds
sh-4.2# cat /proc/2399/stack
[<ffffffffc015025e>] tcmu_netlink_event+0x2ae/0x3e0 [target_core_user]
[<ffffffffc015048e>] tcmu_destroy_device+0x4e/0x190 [target_core_user]
[<ffffffffc0102004>] target_free_device+0xb4/0x110 [target_core_mod]
[<ffffffffc00fd135>] target_core_dev_release+0x15/0x20 [target_core_mod]
[<ffffffffb92ad72a>] config_item_release+0x6a/0xf0
[<ffffffffb92ad7dc>] config_item_put+0x2c/0x30
[<ffffffffb92aba0b>] configfs_rmdir+0x1eb/0x310
[<ffffffffb922da4c>] vfs_rmdir+0xdc/0x150
[<ffffffffb9230fd1>] do_rmdir+0x1f1/0x220
[<ffffffffb9232206>] SyS_rmdir+0x16/0x20
[<ffffffffb972579b>] system_call_fastpath+0x22/0x27
[<ffffffffffffffff>] 0xffffffffffffffff

sh-4.2# cat /proc/2734/stack
[<ffffffffb971079e>] lookup_slow+0x33/0xa7
[<ffffffffb922e6e8>] path_lookupat+0x838/0x8b0
[<ffffffffb922e78b>] filename_lookup+0x2b/0xc0
[<ffffffffb9231bc7>] user_path_at_empty+0x67/0xc0
[<ffffffffb9231c31>] user_path_at+0x11/0x20
[<ffffffffb921d4b2>] SyS_faccessat+0xb2/0x230
[<ffffffffb921d648>] SyS_access+0x18/0x20
[<ffffffffb972579b>] system_call_fastpath+0x22/0x27
[<ffffffffffffffff>] 0xffffffffffffffff
sh-4.2# 




The block -related services are in inactive/failed state. All service status copied under a file - service_status_after_tcmu_runner_failure
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

[root@dhcp46-146 ~]# for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- systemctl is-active gluster-blockd; done
glusterfs-storage-4z756
+++++++++++++++++++++++
inactive
command terminated with exit code 3
glusterfs-storage-5ml9q
+++++++++++++++++++++++
inactive
command terminated with exit code 3
glusterfs-storage-xdd86
+++++++++++++++++++++++
active
[root@dhcp46-146 ~]# 
[root@dhcp46-146 ~]# for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i --  systemctl is-active gluster-block-target.service; done
glusterfs-storage-4z756
+++++++++++++++++++++++
failed
command terminated with exit code 3
glusterfs-storage-5ml9q
+++++++++++++++++++++++
failed
command terminated with exit code 3
glusterfs-storage-xdd86
+++++++++++++++++++++++
active
[root@dhcp46-146 ~]# 
[root@dhcp46-146 ~]# for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- systemctl is-active tcmu-runner; done
glusterfs-storage-4z756
+++++++++++++++++++++++
failed
command terminated with exit code 3
glusterfs-storage-5ml9q
+++++++++++++++++++++++
failed
command terminated with exit code 3
glusterfs-storage-xdd86
+++++++++++++++++++++++
active


++++++++++++++++++++++++++++++++++++++++++++++++++



Version-Release number of selected component (if applicable):
==========================================
Kernel version
################################
# for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- uname -a; done
glusterfs-storage-4z756
+++++++++++++++++++++++
Linux dhcp47-150.lab.eng.blr.redhat.com 3.10.0-862.11.2.el7.x86_64 #1 SMP Thu Jul 26 13:42:22 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
glusterfs-storage-5ml9q
+++++++++++++++++++++++
Linux dhcp46-159.lab.eng.blr.redhat.com 3.10.0-862.11.2.el7.x86_64 #1 SMP Thu Jul 26 13:42:22 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
glusterfs-storage-xdd86
+++++++++++++++++++++++
Linux dhcp46-244.lab.eng.blr.redhat.com 3.10.0-862.11.2.el7.x86_64 #1 SMP Thu Jul 26 13:42:22 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux

GLuster-block related packages
################################


gluster-block-0.2.1-23.el7rhgs.x86_64
tcmu-runner-1.2.0-23.el7rhgs.x86_64
python-configshell-1.1.fb23-4.el7_5.noarch
targetcli-2.1.fb46-6.el7_5.noarch

GLuster versions
################################
[root@dhcp46-146 ~]# oc rsh glusterfs-storage-4z756 rpm -qa|grep gluster
glusterfs-client-xlators-3.8.4-54.15.el7rhgs.x86_64
glusterfs-fuse-3.8.4-54.15.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-54.15.el7rhgs.x86_64
glusterfs-libs-3.8.4-54.15.el7rhgs.x86_64
glusterfs-3.8.4-54.15.el7rhgs.x86_64
glusterfs-api-3.8.4-54.15.el7rhgs.x86_64
glusterfs-cli-3.8.4-54.15.el7rhgs.x86_64
glusterfs-server-3.8.4-54.15.el7rhgs.x86_64
gluster-block-0.2.1-23.el7rhgs.x86_64

heketi version
################################
# oc rsh heketi-storage-1-zmdkd rpm -qa|grep heketi
python-heketi-7.0.0-5.el7rhgs.x86_64
heketi-client-7.0.0-5.el7rhgs.x86_64
heketi-7.0.0-5.el7rhgs.x86_64
[root@dhcp46-146 ~]# 

oc version
################################
[root@dhcp46-146 ~]# oc version
oc v3.10.14
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://dhcp46-146.lab.eng.blr.redhat.com:8443
openshift v3.10.14
kubernetes v1.10.0+b81c8f8
[root@dhcp46-146 ~]# '


How reproducible:
============================
1x1 - Attempted once in 7.5.3 kernel = 3.10.0-862.11.2.el7.x86_64

Steps to Reproduce:
+++++++++++++++++++
1. Create a CNS setup with gluster nodes=3 
2. Create 50 pvcs and confirm they are in bound state.
3. Start a loop to create(new pvcs) and delete(existing pvcs) in parallel.
4. rsh to 2 glusterfs pods and kill the tcmu-runner service one by one
 
e.g. 
oc rsh glusterfs-storage-4z756
ps -ef|grep tcmu
root       471     1  0 17:07 ?        00:00:00 /usr/bin/tcmu-runner --tcmu-log-dir /var/log/glusterfs/gluster-block
sh-4.2# kill -9 471

#oc rsh glusterfs-storage-5ml9q
sh-4.2# ps -ef|grep tcmu
root       461     1  0 17:07 ?        00:00:00 /usr/bin/tcmu-runner --tcmu-log-dir /var/log/glusterfs/gluster-block
root      2182  2054  0 17:28 pts/0    00:00:00 grep tcmu

sh-4.2# kill -9 461

5. Start the tcmu-runner service again

sh-4.2# systemctl restart tcmu-runner
Job for tcmu-runner.service failed because a timeout was exceeded. See "systemctl status tcmu-runner.service" and "journalctl -xe" for details.


sh-4.2# systemctl restart tcmu-runner
Job for tcmu-runner.service failed because a timeout was exceeded. See "systemctl status tcmu-runner.service" and "journalctl -xe" for details.


Actual results:
+++++++++++++++++++
In process stack, it is seen that "/usr/bin/python /usr/bin/targetctl clear" is in Ds state and "/usr/bin/tcmu-runner --tcmu-log-dir /var/log/glusterfs/gluster-block" is in Dsl state. Hence, tcmu-runner services failed to start.


Expected results:
++++++++++++++++++

With the kernel version = 3.10.0-862.11.2.el7.x86_64, we should not see the TCMU-runner error while it is restarted.

Comment 5 Prasanna Kumar Kalever 2018-07-31 04:55:11 UTC

Thanks for reporting the bug. We are already tracking this at BZ#1476730.

*** This bug has been marked as a duplicate of bug 1476730 ***

Note You need to log in before you can comment on or make changes to this bug.