Bug 1596689

Summary:	[Tracking] Gluster-block-target service stuck in "activating" state upon manual restart
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Neha Berry <nberry>
Component:	gluster-block	Assignee:	Prasanna Kumar Kalever <prasanna.kalever>
Status:	CLOSED ERRATA	QA Contact:	Neha Berry <nberry>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	cns-3.10	CC:	ansverma, bgoyal, kramdoss, madam, pkarampu, prasanna.kalever, rgeorge, rhs-bugs, sankarshan, suprasad, vbellur, xiubli
Target Milestone:	---
Target Release:	CNS 3.10
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	kernel-3.10.0-862.11.1.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-09-12 09:26:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1596684, 1599656
Bug Blocks:	1568862

Description Neha Berry 2018-06-29 13:00:19 UTC

Description of problem:
========================
We have a 3 node CNS cluster, namely the following:

Started executing the steps for Block negative TC - "Introduce Tcmu-runner failure when IOs are run on block devices"

The setup had 20 app pods with block volumes bind mounted. 

The tcmu runner service was killed in one pod(which was the ao target for some devices) . The IO failed over to the ano path(though there was a drop in IO rate).
Killed tcmu-runner on 2nd node and then started bring up the service in both the pods. tcmu-runner service did start successfully but in next sequence, when we tried starting the gluster-block-target service, the command got stuck indefinitely and "gluster-block-target" service is still in activating state.


Following were the process stack from the 3 pods

--------------------------------------------------
 
oc rsh glusterfs-storage-tclzq
sh-4.2# ps aux| grep Ds
root     31528  0.0  0.0 123748 12176 ?        Ds   11:33   0:00 /usr/bin/python /usr/bin/targetctl restore
root     31757  0.0  0.0   9088   664 pts/2    S+   11:39   0:00 grep Ds
sh-4.2#
sh-4.2# cat /proc/31528/stack
[<ffffffffc048e557>] transport_clear_lun_ref+0x27/0x30 [target_core_mod]
[<ffffffffc04889d1>] core_tpg_remove_lun+0x31/0xd0 [target_core_mod]
[<ffffffffc047a3ac>] core_dev_del_lun+0x2c/0xa0 [target_core_mod]
[<ffffffffc047b8aa>] target_fabric_port_unlink+0x4a/0x60 [target_core_mod]
[<ffffffffabaa92e8>] configfs_unlink+0xf8/0x1c0
[<ffffffffaba2cafc>] vfs_unlink+0x10c/0x190
[<ffffffffaba2d5a5>] do_unlinkat+0x285/0x2d0
[<ffffffffaba2e586>] SyS_unlink+0x16/0x20
[<ffffffffabf20795>] system_call_fastpath+0x1c/0x21
[<ffffffffffffffff>] 0xffffffffffffffff
sh-4.2#
sh-4.2# ip a| grep 10.70
    inet 10.70.42.223/22 brd 10.70.43.255 scope global noprefixroute dynamic ens192
sh-4.2#
 
 
+++++++++++++++++++++++++++
glusterfs-storage-c5dm8
=========================
 
sh-4.2# ps aux| grep Ds
root     31207  0.0  0.0 123748 12172 ?        Ds   11:33   0:00 /usr/bin/python /usr/bin/targetctl restore
root     31494  0.0  0.0   9088   664 pts/4    R+   11:41   0:00 grep Ds
sh-4.2#
sh-4.2#
sh-4.2# cat /proc/31207/stack
[<ffffffffc04dd557>] transport_clear_lun_ref+0x27/0x30 [target_core_mod]
[<ffffffffc04d79d1>] core_tpg_remove_lun+0x31/0xd0 [target_core_mod]
[<ffffffffc04c93ac>] core_dev_del_lun+0x2c/0xa0 [target_core_mod]
[<ffffffffc04ca8aa>] target_fabric_port_unlink+0x4a/0x60 [target_core_mod]
[<ffffffff87ea92e8>] configfs_unlink+0xf8/0x1c0
[<ffffffff87e2cafc>] vfs_unlink+0x10c/0x190
[<ffffffff87e2d5a5>] do_unlinkat+0x285/0x2d0
[<ffffffff87e2e586>] SyS_unlink+0x16/0x20
[<ffffffff88320795>] system_call_fastpath+0x1c/0x21
[<ffffffffffffffff>] 0xffffffffffffffff
sh-4.2#
 
 
+++++++++++++++++++++++++++++++
 
glusterfs-storage-r7szv - no issue as nothing was killed here
 
 
## oc rsh glusterfs-storage-r7szv
sh-4.2# ps aux|grep Ds
root     31757  0.0  0.0   9088   660 pts/3    S+   11:42   0:00 grep Ds
sh-4.2#
-------------------------------------------------


Current Status of the gluster-block related services
---------------------------------
## for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i --  systemctl is-active gluster-block-target; done 
glusterfs-storage-c5dm8
+++++++++++++++++++++++
activating
command terminated with exit code 3
glusterfs-storage-r7szv
+++++++++++++++++++++++
active
glusterfs-storage-tclzq
+++++++++++++++++++++++
activating
command terminated with exit code 3
## for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i --  systemctl is-active gluster-blockd; done 
glusterfs-storage-c5dm8
+++++++++++++++++++++++
inactive
command terminated with exit code 3
glusterfs-storage-r7szv
+++++++++++++++++++++++
active
glusterfs-storage-tclzq
+++++++++++++++++++++++
inactive
command terminated with exit code 3
## for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i --  systemctl is-active tcmu-runner; done 
glusterfs-storage-c5dm8
+++++++++++++++++++++++
active
glusterfs-storage-r7szv
+++++++++++++++++++++++
active
glusterfs-storage-tclzq
+++++++++++++++++++++++
active



Version-Release number of selected component (if applicable):
==========================================

kernel version in pods = Linux dhcp42-84.lab.eng.blr.redhat.com 3.10.0-862.6.3.el7.x86_64 #1 SMP Fri Jun 15 17:57:37 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux


[root@dhcp47-178 HA_Count]# oc exec heketi-storage-1-v9nq5 -- rpm -qa| grep heketi
python-heketi-7.0.0-1.el7rhgs.x86_64
heketi-client-7.0.0-1.el7rhgs.x86_64
heketi-7.0.0-1.el7rhgs.x86_64
[root@dhcp47-178 HA_Count]# 


Gluster versions
+++++
glusterfs-libs-3.8.4-54.12.el7rhgs.x86_64
glusterfs-3.8.4-54.12.el7rhgs.x86_64
glusterfs-api-3.8.4-54.12.el7rhgs.x86_64
glusterfs-cli-3.8.4-54.12.el7rhgs.x86_64
glusterfs-server-3.8.4-54.12.el7rhgs.x86_64
gluster-block-0.2.1-20.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-54.12.el7rhgs.x86_64
glusterfs-fuse-3.8.4-54.12.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-54.12.el7rhgs.x86_64

## for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- rpm -qa | grep tcmu-runner ; done
glusterfs-storage-c5dm8
+++++++++++++++++++++++
tcmu-runner-1.2.0-20.el7rhgs.x86_64
glusterfs-storage-r7szv
+++++++++++++++++++++++
tcmu-runner-1.2.0-20.el7rhgs.x86_64
glusterfs-storage-tclzq
+++++++++++++++++++++++
tcmu-runner-1.2.0-20.el7rhgs.x86_64


## for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- rpm -qa | grep targetcli ; done
glusterfs-storage-c5dm8
+++++++++++++++++++++++
targetcli-2.1.fb46-6.el7_5.noarch
glusterfs-storage-r7szv
+++++++++++++++++++++++
targetcli-2.1.fb46-6.el7_5.noarch
glusterfs-storage-tclzq
+++++++++++++++++++++++
targetcli-2.1.fb46-6.el7_5.noarch
## 

oc version  
=========== 

openshift v3.10.0-0.67.0

kubernetes v1.10.0+b81c8f8
[root@dhcp47-178 HA_Count]# 


How reproducible:
============================
1x1, 

Steps to Reproduce:
+++++++++++++++++++
1. Create a CNS setup with gluster nodes=3 
2. Create multiple app pods with block devices bind-mounted
3. Kill tcmu-runner service on one pod followed by next after some time. dependent services- gluster-block-target and gluster-blockd also fail
4. Bring up tcmu-runner service in both the pods
5. Try bringing up gluster-block-target service in both the pods
#systemctl start gluster-block-target

6. If the service start is stuck, check for the (Ds)process stack in the glusterfs pods .



Actual results:
+++++++++++++++++++
In process stack, it is seen that "/usr/bin/python /usr/bin/targetctl restore" is in Ds state. Hence, gluster-block-target  services are not coming up.


Expected results:
++++++++++++++++++

On successful startup of tcmu-runner service,manual start of gluster-block-target service should be successful.

Comment 15 errata-xmlrpc 2018-09-12 09:26:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2691