This bug was hit while regressing -https://bugzilla.redhat.com/show_bug.cgi?id=1596369 fixed in = kernel-3.10.0-862.9.1.el7 Description of problem: ======================== We have a 3 node CNS cluster, namely the following: BZ#1596369 is fixed in kernel-3.10.0-862.9.1.el7. Hence , with current kernel verison being RHEL 7.5.3 (beta) = 3.10.0-862.11.2.el7.x86_64 , we Started executing the steps for Block negative TC - "Introduce Tcmu-runner failure when block device creation and deletion is in progress" With block device creation + deletion running in parallel, killed tcmu-runner process(which in turn killed gluster-blockd) on 2 pods - "glusterfs-storage-4z756" and "glusterfs-storage-5ml9q" . All creates and deletes went into queue as expected. Tried restoring the tcmu-runner service on both the nodes, but it failed to start with following message: ****Failed to start LIO Userspace-passthrough daemon.****** sh-4.2# systemctl status tcmu-runner.service ● tcmu-runner.service - LIO Userspace-passthrough daemon Loaded: loaded (/usr/lib/systemd/system/tcmu-runner.service; static; vendor preset: disabled) Active: failed (Result: timeout) since Mon 2018-07-30 17:53:06 UTC; 13min ago Process: 2730 ExecStartPre=/usr/libexec/gluster-block/wait-for-bricks.sh 120 (code=exited, status=1/FAILURE) Main PID: 2734 CGroup: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod2fc19839_91b7_11e8_b856_005056a525c4.slice/docker-803808c27487956deddabf221dfc63145aeafbde5461550681d197bdb4d1490d.scope/system.slice/tcmu-runner.service └─2734 /usr/bin/tcmu-runner --tcmu-log-dir /var/log/glusterfs/gluster-block Jul 30 17:40:06 dhcp46-159.lab.eng.blr.redhat.com systemd[1]: Starting LIO Userspace-passthrough daemon... Jul 30 17:50:06 dhcp46-159.lab.eng.blr.redhat.com systemd[1]: tcmu-runner.service start operation timed out. Terminating. Jul 30 17:51:36 dhcp46-159.lab.eng.blr.redhat.com systemd[1]: tcmu-runner.service stop-final-sigterm timed out. Killing. Jul 30 17:53:06 dhcp46-159.lab.eng.blr.redhat.com systemd[1]: tcmu-runner.service still around after final SIGKILL. Entering failed mode. Jul 30 17:53:06 dhcp46-159.lab.eng.blr.redhat.com systemd[1]: Failed to start LIO Userspace-passthrough daemon. Jul 30 17:53:06 dhcp46-159.lab.eng.blr.redhat.com systemd[1]: Unit tcmu-runner.service entered failed state. Jul 30 17:53:06 dhcp46-159.lab.eng.blr.redhat.com systemd[1]: tcmu-runner.service failed. Also, we see following process stack in the 2 glusterfs pods ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ glusterfs-storage-4z756" ---------------------------- sh-4.2# ps -aux|grep Ds root 2489 0.0 0.0 122668 11048 ? Ds 17:32 0:00 /usr/bin/python /usr/bin/targetctl clear root 2841 0.1 0.0 1259580 20052 ? Dsl 17:40 0:00 /usr/bin/tcmu-runner --tcmu-log-dir /var/log/glusterfs/gluster-block root 2968 0.0 0.0 9088 664 pts/0 S+ 17:42 0:00 grep Ds sh-4.2# cat /proc/2489/stack [<ffffffffc025325e>] tcmu_netlink_event+0x2ae/0x3e0 [target_core_user] [<ffffffffc025348e>] tcmu_destroy_device+0x4e/0x190 [target_core_user] [<ffffffffc0205004>] target_free_device+0xb4/0x110 [target_core_mod] [<ffffffffc0200135>] target_core_dev_release+0x15/0x20 [target_core_mod] [<ffffffff842ad72a>] config_item_release+0x6a/0xf0 [<ffffffff842ad7dc>] config_item_put+0x2c/0x30 [<ffffffff842aba0b>] configfs_rmdir+0x1eb/0x310 [<ffffffff8422da4c>] vfs_rmdir+0xdc/0x150 [<ffffffff84230fd1>] do_rmdir+0x1f1/0x220 [<ffffffff84232206>] SyS_rmdir+0x16/0x20 [<ffffffff8472579b>] system_call_fastpath+0x22/0x27 [<ffffffffffffffff>] 0xffffffffffffffff sh-4.2# cat /proc/2841/stack [<ffffffff8471079e>] lookup_slow+0x33/0xa7 [<ffffffff8422e6e8>] path_lookupat+0x838/0x8b0 [<ffffffff8422e78b>] filename_lookup+0x2b/0xc0 [<ffffffff84231bc7>] user_path_at_empty+0x67/0xc0 [<ffffffff84231c31>] user_path_at+0x11/0x20 [<ffffffff8421d4b2>] SyS_faccessat+0xb2/0x230 [<ffffffff8421d648>] SyS_access+0x18/0x20 [<ffffffff8472579b>] system_call_fastpath+0x22/0x27 [<ffffffffffffffff>] 0xffffffffffffffff sh-4.2# "glusterfs-storage-5ml9q ------------------------------ sh-4.2# ps -aux|grep Ds root 2399 0.0 0.0 122668 11044 ? Ds 17:32 0:00 /usr/bin/python /usr/bin/targetctl clear root 2734 0.0 0.0 1259580 20284 ? Dsl 17:40 0:00 /usr/bin/tcmu-runner --tcmu-log-dir /var/log/glusterfs/gluster-block root 2907 0.0 0.0 9088 660 pts/1 S+ 17:42 0:00 grep Ds sh-4.2# cat /proc/2399/stack [<ffffffffc015025e>] tcmu_netlink_event+0x2ae/0x3e0 [target_core_user] [<ffffffffc015048e>] tcmu_destroy_device+0x4e/0x190 [target_core_user] [<ffffffffc0102004>] target_free_device+0xb4/0x110 [target_core_mod] [<ffffffffc00fd135>] target_core_dev_release+0x15/0x20 [target_core_mod] [<ffffffffb92ad72a>] config_item_release+0x6a/0xf0 [<ffffffffb92ad7dc>] config_item_put+0x2c/0x30 [<ffffffffb92aba0b>] configfs_rmdir+0x1eb/0x310 [<ffffffffb922da4c>] vfs_rmdir+0xdc/0x150 [<ffffffffb9230fd1>] do_rmdir+0x1f1/0x220 [<ffffffffb9232206>] SyS_rmdir+0x16/0x20 [<ffffffffb972579b>] system_call_fastpath+0x22/0x27 [<ffffffffffffffff>] 0xffffffffffffffff sh-4.2# cat /proc/2734/stack [<ffffffffb971079e>] lookup_slow+0x33/0xa7 [<ffffffffb922e6e8>] path_lookupat+0x838/0x8b0 [<ffffffffb922e78b>] filename_lookup+0x2b/0xc0 [<ffffffffb9231bc7>] user_path_at_empty+0x67/0xc0 [<ffffffffb9231c31>] user_path_at+0x11/0x20 [<ffffffffb921d4b2>] SyS_faccessat+0xb2/0x230 [<ffffffffb921d648>] SyS_access+0x18/0x20 [<ffffffffb972579b>] system_call_fastpath+0x22/0x27 [<ffffffffffffffff>] 0xffffffffffffffff sh-4.2# The block -related services are in inactive/failed state. All service status copied under a file - service_status_after_tcmu_runner_failure +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ [root@dhcp46-146 ~]# for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- systemctl is-active gluster-blockd; done glusterfs-storage-4z756 +++++++++++++++++++++++ inactive command terminated with exit code 3 glusterfs-storage-5ml9q +++++++++++++++++++++++ inactive command terminated with exit code 3 glusterfs-storage-xdd86 +++++++++++++++++++++++ active [root@dhcp46-146 ~]# [root@dhcp46-146 ~]# for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- systemctl is-active gluster-block-target.service; done glusterfs-storage-4z756 +++++++++++++++++++++++ failed command terminated with exit code 3 glusterfs-storage-5ml9q +++++++++++++++++++++++ failed command terminated with exit code 3 glusterfs-storage-xdd86 +++++++++++++++++++++++ active [root@dhcp46-146 ~]# [root@dhcp46-146 ~]# for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- systemctl is-active tcmu-runner; done glusterfs-storage-4z756 +++++++++++++++++++++++ failed command terminated with exit code 3 glusterfs-storage-5ml9q +++++++++++++++++++++++ failed command terminated with exit code 3 glusterfs-storage-xdd86 +++++++++++++++++++++++ active ++++++++++++++++++++++++++++++++++++++++++++++++++ Version-Release number of selected component (if applicable): ========================================== Kernel version ################################ # for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- uname -a; done glusterfs-storage-4z756 +++++++++++++++++++++++ Linux dhcp47-150.lab.eng.blr.redhat.com 3.10.0-862.11.2.el7.x86_64 #1 SMP Thu Jul 26 13:42:22 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux glusterfs-storage-5ml9q +++++++++++++++++++++++ Linux dhcp46-159.lab.eng.blr.redhat.com 3.10.0-862.11.2.el7.x86_64 #1 SMP Thu Jul 26 13:42:22 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux glusterfs-storage-xdd86 +++++++++++++++++++++++ Linux dhcp46-244.lab.eng.blr.redhat.com 3.10.0-862.11.2.el7.x86_64 #1 SMP Thu Jul 26 13:42:22 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux GLuster-block related packages ################################ gluster-block-0.2.1-23.el7rhgs.x86_64 tcmu-runner-1.2.0-23.el7rhgs.x86_64 python-configshell-1.1.fb23-4.el7_5.noarch targetcli-2.1.fb46-6.el7_5.noarch GLuster versions ################################ [root@dhcp46-146 ~]# oc rsh glusterfs-storage-4z756 rpm -qa|grep gluster glusterfs-client-xlators-3.8.4-54.15.el7rhgs.x86_64 glusterfs-fuse-3.8.4-54.15.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-54.15.el7rhgs.x86_64 glusterfs-libs-3.8.4-54.15.el7rhgs.x86_64 glusterfs-3.8.4-54.15.el7rhgs.x86_64 glusterfs-api-3.8.4-54.15.el7rhgs.x86_64 glusterfs-cli-3.8.4-54.15.el7rhgs.x86_64 glusterfs-server-3.8.4-54.15.el7rhgs.x86_64 gluster-block-0.2.1-23.el7rhgs.x86_64 heketi version ################################ # oc rsh heketi-storage-1-zmdkd rpm -qa|grep heketi python-heketi-7.0.0-5.el7rhgs.x86_64 heketi-client-7.0.0-5.el7rhgs.x86_64 heketi-7.0.0-5.el7rhgs.x86_64 [root@dhcp46-146 ~]# oc version ################################ [root@dhcp46-146 ~]# oc version oc v3.10.14 kubernetes v1.10.0+b81c8f8 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://dhcp46-146.lab.eng.blr.redhat.com:8443 openshift v3.10.14 kubernetes v1.10.0+b81c8f8 [root@dhcp46-146 ~]# ' How reproducible: ============================ 1x1 - Attempted once in 7.5.3 kernel = 3.10.0-862.11.2.el7.x86_64 Steps to Reproduce: +++++++++++++++++++ 1. Create a CNS setup with gluster nodes=3 2. Create 50 pvcs and confirm they are in bound state. 3. Start a loop to create(new pvcs) and delete(existing pvcs) in parallel. 4. rsh to 2 glusterfs pods and kill the tcmu-runner service one by one e.g. oc rsh glusterfs-storage-4z756 ps -ef|grep tcmu root 471 1 0 17:07 ? 00:00:00 /usr/bin/tcmu-runner --tcmu-log-dir /var/log/glusterfs/gluster-block sh-4.2# kill -9 471 #oc rsh glusterfs-storage-5ml9q sh-4.2# ps -ef|grep tcmu root 461 1 0 17:07 ? 00:00:00 /usr/bin/tcmu-runner --tcmu-log-dir /var/log/glusterfs/gluster-block root 2182 2054 0 17:28 pts/0 00:00:00 grep tcmu sh-4.2# kill -9 461 5. Start the tcmu-runner service again sh-4.2# systemctl restart tcmu-runner Job for tcmu-runner.service failed because a timeout was exceeded. See "systemctl status tcmu-runner.service" and "journalctl -xe" for details. sh-4.2# systemctl restart tcmu-runner Job for tcmu-runner.service failed because a timeout was exceeded. See "systemctl status tcmu-runner.service" and "journalctl -xe" for details. Actual results: +++++++++++++++++++ In process stack, it is seen that "/usr/bin/python /usr/bin/targetctl clear" is in Ds state and "/usr/bin/tcmu-runner --tcmu-log-dir /var/log/glusterfs/gluster-block" is in Dsl state. Hence, tcmu-runner services failed to start. Expected results: ++++++++++++++++++ With the kernel version = 3.10.0-862.11.2.el7.x86_64, we should not see the TCMU-runner error while it is restarted.
Thanks for reporting the bug. We are already tracking this at BZ#1476730. *** This bug has been marked as a duplicate of bug 1476730 ***