Description of problem:
========================
We have a 3 node CNS cluster, namely the following:
Started executing the steps for Block negative TC - "Introduce Tcmu-runner failure when IOs are run on block devices"
The setup had 20 app pods with block volumes bind mounted.
The tcmu runner service was killed in one pod(which was the ao target for some devices) . The IO failed over to the ano path(though there was a drop in IO rate).
Killed tcmu-runner on 2nd node and then started bring up the service in both the pods. tcmu-runner service did start successfully but in next sequence, when we tried starting the gluster-block-target service, the command got stuck indefinitely and "gluster-block-target" service is still in activating state.
Following were the process stack from the 3 pods
--------------------------------------------------
oc rsh glusterfs-storage-tclzq
sh-4.2# ps aux| grep Ds
root 31528 0.0 0.0 123748 12176 ? Ds 11:33 0:00 /usr/bin/python /usr/bin/targetctl restore
root 31757 0.0 0.0 9088 664 pts/2 S+ 11:39 0:00 grep Ds
sh-4.2#
sh-4.2# cat /proc/31528/stack
[<ffffffffc048e557>] transport_clear_lun_ref+0x27/0x30 [target_core_mod]
[<ffffffffc04889d1>] core_tpg_remove_lun+0x31/0xd0 [target_core_mod]
[<ffffffffc047a3ac>] core_dev_del_lun+0x2c/0xa0 [target_core_mod]
[<ffffffffc047b8aa>] target_fabric_port_unlink+0x4a/0x60 [target_core_mod]
[<ffffffffabaa92e8>] configfs_unlink+0xf8/0x1c0
[<ffffffffaba2cafc>] vfs_unlink+0x10c/0x190
[<ffffffffaba2d5a5>] do_unlinkat+0x285/0x2d0
[<ffffffffaba2e586>] SyS_unlink+0x16/0x20
[<ffffffffabf20795>] system_call_fastpath+0x1c/0x21
[<ffffffffffffffff>] 0xffffffffffffffff
sh-4.2#
sh-4.2# ip a| grep 10.70
inet 10.70.42.223/22 brd 10.70.43.255 scope global noprefixroute dynamic ens192
sh-4.2#
+++++++++++++++++++++++++++
glusterfs-storage-c5dm8
=========================
sh-4.2# ps aux| grep Ds
root 31207 0.0 0.0 123748 12172 ? Ds 11:33 0:00 /usr/bin/python /usr/bin/targetctl restore
root 31494 0.0 0.0 9088 664 pts/4 R+ 11:41 0:00 grep Ds
sh-4.2#
sh-4.2#
sh-4.2# cat /proc/31207/stack
[<ffffffffc04dd557>] transport_clear_lun_ref+0x27/0x30 [target_core_mod]
[<ffffffffc04d79d1>] core_tpg_remove_lun+0x31/0xd0 [target_core_mod]
[<ffffffffc04c93ac>] core_dev_del_lun+0x2c/0xa0 [target_core_mod]
[<ffffffffc04ca8aa>] target_fabric_port_unlink+0x4a/0x60 [target_core_mod]
[<ffffffff87ea92e8>] configfs_unlink+0xf8/0x1c0
[<ffffffff87e2cafc>] vfs_unlink+0x10c/0x190
[<ffffffff87e2d5a5>] do_unlinkat+0x285/0x2d0
[<ffffffff87e2e586>] SyS_unlink+0x16/0x20
[<ffffffff88320795>] system_call_fastpath+0x1c/0x21
[<ffffffffffffffff>] 0xffffffffffffffff
sh-4.2#
+++++++++++++++++++++++++++++++
glusterfs-storage-r7szv - no issue as nothing was killed here
## oc rsh glusterfs-storage-r7szv
sh-4.2# ps aux|grep Ds
root 31757 0.0 0.0 9088 660 pts/3 S+ 11:42 0:00 grep Ds
sh-4.2#
-------------------------------------------------
Current Status of the gluster-block related services
---------------------------------
## for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- systemctl is-active gluster-block-target; done
glusterfs-storage-c5dm8
+++++++++++++++++++++++
activating
command terminated with exit code 3
glusterfs-storage-r7szv
+++++++++++++++++++++++
active
glusterfs-storage-tclzq
+++++++++++++++++++++++
activating
command terminated with exit code 3
## for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- systemctl is-active gluster-blockd; done
glusterfs-storage-c5dm8
+++++++++++++++++++++++
inactive
command terminated with exit code 3
glusterfs-storage-r7szv
+++++++++++++++++++++++
active
glusterfs-storage-tclzq
+++++++++++++++++++++++
inactive
command terminated with exit code 3
## for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- systemctl is-active tcmu-runner; done
glusterfs-storage-c5dm8
+++++++++++++++++++++++
active
glusterfs-storage-r7szv
+++++++++++++++++++++++
active
glusterfs-storage-tclzq
+++++++++++++++++++++++
active
Version-Release number of selected component (if applicable):
==========================================
kernel version in pods = Linux dhcp42-84.lab.eng.blr.redhat.com 3.10.0-862.6.3.el7.x86_64 #1 SMP Fri Jun 15 17:57:37 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
[root@dhcp47-178 HA_Count]# oc exec heketi-storage-1-v9nq5 -- rpm -qa| grep heketi
python-heketi-7.0.0-1.el7rhgs.x86_64
heketi-client-7.0.0-1.el7rhgs.x86_64
heketi-7.0.0-1.el7rhgs.x86_64
[root@dhcp47-178 HA_Count]#
Gluster versions
+++++
glusterfs-libs-3.8.4-54.12.el7rhgs.x86_64
glusterfs-3.8.4-54.12.el7rhgs.x86_64
glusterfs-api-3.8.4-54.12.el7rhgs.x86_64
glusterfs-cli-3.8.4-54.12.el7rhgs.x86_64
glusterfs-server-3.8.4-54.12.el7rhgs.x86_64
gluster-block-0.2.1-20.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-54.12.el7rhgs.x86_64
glusterfs-fuse-3.8.4-54.12.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-54.12.el7rhgs.x86_64
## for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- rpm -qa | grep tcmu-runner ; done
glusterfs-storage-c5dm8
+++++++++++++++++++++++
tcmu-runner-1.2.0-20.el7rhgs.x86_64
glusterfs-storage-r7szv
+++++++++++++++++++++++
tcmu-runner-1.2.0-20.el7rhgs.x86_64
glusterfs-storage-tclzq
+++++++++++++++++++++++
tcmu-runner-1.2.0-20.el7rhgs.x86_64
## for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- rpm -qa | grep targetcli ; done
glusterfs-storage-c5dm8
+++++++++++++++++++++++
targetcli-2.1.fb46-6.el7_5.noarch
glusterfs-storage-r7szv
+++++++++++++++++++++++
targetcli-2.1.fb46-6.el7_5.noarch
glusterfs-storage-tclzq
+++++++++++++++++++++++
targetcli-2.1.fb46-6.el7_5.noarch
##
oc version
===========
openshift v3.10.0-0.67.0
kubernetes v1.10.0+b81c8f8
[root@dhcp47-178 HA_Count]#
How reproducible:
============================
1x1,
Steps to Reproduce:
+++++++++++++++++++
1. Create a CNS setup with gluster nodes=3
2. Create multiple app pods with block devices bind-mounted
3. Kill tcmu-runner service on one pod followed by next after some time. dependent services- gluster-block-target and gluster-blockd also fail
4. Bring up tcmu-runner service in both the pods
5. Try bringing up gluster-block-target service in both the pods
#systemctl start gluster-block-target
6. If the service start is stuck, check for the (Ds)process stack in the glusterfs pods .
Actual results:
+++++++++++++++++++
In process stack, it is seen that "/usr/bin/python /usr/bin/targetctl restore" is in Ds state. Hence, gluster-block-target services are not coming up.
Expected results:
++++++++++++++++++
On successful startup of tcmu-runner service,manual start of gluster-block-target service should be successful.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHEA-2018:2691