Description of problem: ======================== We have a 3 node CNS cluster, namely the following: Started executing the steps for Block negative TC - "Introduce Tcmu-runner failure when IOs are run on block devices" The setup had 20 app pods with block volumes bind mounted. The tcmu runner service was killed in one pod(which was the ao target for some devices) . The IO failed over to the ano path(though there was a drop in IO rate). Killed tcmu-runner on 2nd node and then started bring up the service in both the pods. tcmu-runner service did start successfully but in next sequence, when we tried starting the gluster-block-target service, the command got stuck indefinitely and "gluster-block-target" service is still in activating state. Following were the process stack from the 3 pods -------------------------------------------------- oc rsh glusterfs-storage-tclzq sh-4.2# ps aux| grep Ds root 31528 0.0 0.0 123748 12176 ? Ds 11:33 0:00 /usr/bin/python /usr/bin/targetctl restore root 31757 0.0 0.0 9088 664 pts/2 S+ 11:39 0:00 grep Ds sh-4.2# sh-4.2# cat /proc/31528/stack [<ffffffffc048e557>] transport_clear_lun_ref+0x27/0x30 [target_core_mod] [<ffffffffc04889d1>] core_tpg_remove_lun+0x31/0xd0 [target_core_mod] [<ffffffffc047a3ac>] core_dev_del_lun+0x2c/0xa0 [target_core_mod] [<ffffffffc047b8aa>] target_fabric_port_unlink+0x4a/0x60 [target_core_mod] [<ffffffffabaa92e8>] configfs_unlink+0xf8/0x1c0 [<ffffffffaba2cafc>] vfs_unlink+0x10c/0x190 [<ffffffffaba2d5a5>] do_unlinkat+0x285/0x2d0 [<ffffffffaba2e586>] SyS_unlink+0x16/0x20 [<ffffffffabf20795>] system_call_fastpath+0x1c/0x21 [<ffffffffffffffff>] 0xffffffffffffffff sh-4.2# sh-4.2# ip a| grep 10.70 inet 10.70.42.223/22 brd 10.70.43.255 scope global noprefixroute dynamic ens192 sh-4.2# +++++++++++++++++++++++++++ glusterfs-storage-c5dm8 ========================= sh-4.2# ps aux| grep Ds root 31207 0.0 0.0 123748 12172 ? Ds 11:33 0:00 /usr/bin/python /usr/bin/targetctl restore root 31494 0.0 0.0 9088 664 pts/4 R+ 11:41 0:00 grep Ds sh-4.2# sh-4.2# sh-4.2# cat /proc/31207/stack [<ffffffffc04dd557>] transport_clear_lun_ref+0x27/0x30 [target_core_mod] [<ffffffffc04d79d1>] core_tpg_remove_lun+0x31/0xd0 [target_core_mod] [<ffffffffc04c93ac>] core_dev_del_lun+0x2c/0xa0 [target_core_mod] [<ffffffffc04ca8aa>] target_fabric_port_unlink+0x4a/0x60 [target_core_mod] [<ffffffff87ea92e8>] configfs_unlink+0xf8/0x1c0 [<ffffffff87e2cafc>] vfs_unlink+0x10c/0x190 [<ffffffff87e2d5a5>] do_unlinkat+0x285/0x2d0 [<ffffffff87e2e586>] SyS_unlink+0x16/0x20 [<ffffffff88320795>] system_call_fastpath+0x1c/0x21 [<ffffffffffffffff>] 0xffffffffffffffff sh-4.2# +++++++++++++++++++++++++++++++ glusterfs-storage-r7szv - no issue as nothing was killed here ## oc rsh glusterfs-storage-r7szv sh-4.2# ps aux|grep Ds root 31757 0.0 0.0 9088 660 pts/3 S+ 11:42 0:00 grep Ds sh-4.2# ------------------------------------------------- Current Status of the gluster-block related services --------------------------------- ## for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- systemctl is-active gluster-block-target; done glusterfs-storage-c5dm8 +++++++++++++++++++++++ activating command terminated with exit code 3 glusterfs-storage-r7szv +++++++++++++++++++++++ active glusterfs-storage-tclzq +++++++++++++++++++++++ activating command terminated with exit code 3 ## for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- systemctl is-active gluster-blockd; done glusterfs-storage-c5dm8 +++++++++++++++++++++++ inactive command terminated with exit code 3 glusterfs-storage-r7szv +++++++++++++++++++++++ active glusterfs-storage-tclzq +++++++++++++++++++++++ inactive command terminated with exit code 3 ## for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- systemctl is-active tcmu-runner; done glusterfs-storage-c5dm8 +++++++++++++++++++++++ active glusterfs-storage-r7szv +++++++++++++++++++++++ active glusterfs-storage-tclzq +++++++++++++++++++++++ active Version-Release number of selected component (if applicable): ========================================== kernel version in pods = Linux dhcp42-84.lab.eng.blr.redhat.com 3.10.0-862.6.3.el7.x86_64 #1 SMP Fri Jun 15 17:57:37 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux [root@dhcp47-178 HA_Count]# oc exec heketi-storage-1-v9nq5 -- rpm -qa| grep heketi python-heketi-7.0.0-1.el7rhgs.x86_64 heketi-client-7.0.0-1.el7rhgs.x86_64 heketi-7.0.0-1.el7rhgs.x86_64 [root@dhcp47-178 HA_Count]# Gluster versions +++++ glusterfs-libs-3.8.4-54.12.el7rhgs.x86_64 glusterfs-3.8.4-54.12.el7rhgs.x86_64 glusterfs-api-3.8.4-54.12.el7rhgs.x86_64 glusterfs-cli-3.8.4-54.12.el7rhgs.x86_64 glusterfs-server-3.8.4-54.12.el7rhgs.x86_64 gluster-block-0.2.1-20.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-54.12.el7rhgs.x86_64 glusterfs-fuse-3.8.4-54.12.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-54.12.el7rhgs.x86_64 ## for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- rpm -qa | grep tcmu-runner ; done glusterfs-storage-c5dm8 +++++++++++++++++++++++ tcmu-runner-1.2.0-20.el7rhgs.x86_64 glusterfs-storage-r7szv +++++++++++++++++++++++ tcmu-runner-1.2.0-20.el7rhgs.x86_64 glusterfs-storage-tclzq +++++++++++++++++++++++ tcmu-runner-1.2.0-20.el7rhgs.x86_64 ## for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- rpm -qa | grep targetcli ; done glusterfs-storage-c5dm8 +++++++++++++++++++++++ targetcli-2.1.fb46-6.el7_5.noarch glusterfs-storage-r7szv +++++++++++++++++++++++ targetcli-2.1.fb46-6.el7_5.noarch glusterfs-storage-tclzq +++++++++++++++++++++++ targetcli-2.1.fb46-6.el7_5.noarch ## oc version =========== openshift v3.10.0-0.67.0 kubernetes v1.10.0+b81c8f8 [root@dhcp47-178 HA_Count]# How reproducible: ============================ 1x1, Steps to Reproduce: +++++++++++++++++++ 1. Create a CNS setup with gluster nodes=3 2. Create multiple app pods with block devices bind-mounted 3. Kill tcmu-runner service on one pod followed by next after some time. dependent services- gluster-block-target and gluster-blockd also fail 4. Bring up tcmu-runner service in both the pods 5. Try bringing up gluster-block-target service in both the pods #systemctl start gluster-block-target 6. If the service start is stuck, check for the (Ds)process stack in the glusterfs pods . Actual results: +++++++++++++++++++ In process stack, it is seen that "/usr/bin/python /usr/bin/targetctl restore" is in Ds state. Hence, gluster-block-target services are not coming up. Expected results: ++++++++++++++++++ On successful startup of tcmu-runner service,manual start of gluster-block-target service should be successful.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2691