Bug 1596689 - [Tracking] Gluster-block-target service stuck in "activating" state upon manual restart
Summary: [Tracking] Gluster-block-target service stuck in "activating" state upon manu...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: gluster-block
Version: cns-3.10
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: CNS 3.10
Assignee: Prasanna Kumar Kalever
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On: 1596684 1599656
Blocks: 1568862
TreeView+ depends on / blocked
 
Reported: 2018-06-29 13:00 UTC by Neha Berry
Modified: 2021-12-10 16:31 UTC (History)
12 users (show)

Fixed In Version: kernel-3.10.0-862.11.1.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-12 09:26:58 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2018:2691 0 None None None 2018-09-12 09:28:16 UTC

Description Neha Berry 2018-06-29 13:00:19 UTC
Description of problem:
========================
We have a 3 node CNS cluster, namely the following:

Started executing the steps for Block negative TC - "Introduce Tcmu-runner failure when IOs are run on block devices"

The setup had 20 app pods with block volumes bind mounted. 

The tcmu runner service was killed in one pod(which was the ao target for some devices) . The IO failed over to the ano path(though there was a drop in IO rate).
Killed tcmu-runner on 2nd node and then started bring up the service in both the pods. tcmu-runner service did start successfully but in next sequence, when we tried starting the gluster-block-target service, the command got stuck indefinitely and "gluster-block-target" service is still in activating state.


Following were the process stack from the 3 pods

--------------------------------------------------
 
oc rsh glusterfs-storage-tclzq
sh-4.2# ps aux| grep Ds
root     31528  0.0  0.0 123748 12176 ?        Ds   11:33   0:00 /usr/bin/python /usr/bin/targetctl restore
root     31757  0.0  0.0   9088   664 pts/2    S+   11:39   0:00 grep Ds
sh-4.2#
sh-4.2# cat /proc/31528/stack
[<ffffffffc048e557>] transport_clear_lun_ref+0x27/0x30 [target_core_mod]
[<ffffffffc04889d1>] core_tpg_remove_lun+0x31/0xd0 [target_core_mod]
[<ffffffffc047a3ac>] core_dev_del_lun+0x2c/0xa0 [target_core_mod]
[<ffffffffc047b8aa>] target_fabric_port_unlink+0x4a/0x60 [target_core_mod]
[<ffffffffabaa92e8>] configfs_unlink+0xf8/0x1c0
[<ffffffffaba2cafc>] vfs_unlink+0x10c/0x190
[<ffffffffaba2d5a5>] do_unlinkat+0x285/0x2d0
[<ffffffffaba2e586>] SyS_unlink+0x16/0x20
[<ffffffffabf20795>] system_call_fastpath+0x1c/0x21
[<ffffffffffffffff>] 0xffffffffffffffff
sh-4.2#
sh-4.2# ip a| grep 10.70
    inet 10.70.42.223/22 brd 10.70.43.255 scope global noprefixroute dynamic ens192
sh-4.2#
 
 
+++++++++++++++++++++++++++
glusterfs-storage-c5dm8
=========================
 
sh-4.2# ps aux| grep Ds
root     31207  0.0  0.0 123748 12172 ?        Ds   11:33   0:00 /usr/bin/python /usr/bin/targetctl restore
root     31494  0.0  0.0   9088   664 pts/4    R+   11:41   0:00 grep Ds
sh-4.2#
sh-4.2#
sh-4.2# cat /proc/31207/stack
[<ffffffffc04dd557>] transport_clear_lun_ref+0x27/0x30 [target_core_mod]
[<ffffffffc04d79d1>] core_tpg_remove_lun+0x31/0xd0 [target_core_mod]
[<ffffffffc04c93ac>] core_dev_del_lun+0x2c/0xa0 [target_core_mod]
[<ffffffffc04ca8aa>] target_fabric_port_unlink+0x4a/0x60 [target_core_mod]
[<ffffffff87ea92e8>] configfs_unlink+0xf8/0x1c0
[<ffffffff87e2cafc>] vfs_unlink+0x10c/0x190
[<ffffffff87e2d5a5>] do_unlinkat+0x285/0x2d0
[<ffffffff87e2e586>] SyS_unlink+0x16/0x20
[<ffffffff88320795>] system_call_fastpath+0x1c/0x21
[<ffffffffffffffff>] 0xffffffffffffffff
sh-4.2#
 
 
+++++++++++++++++++++++++++++++
 
glusterfs-storage-r7szv - no issue as nothing was killed here
 
 
## oc rsh glusterfs-storage-r7szv
sh-4.2# ps aux|grep Ds
root     31757  0.0  0.0   9088   660 pts/3    S+   11:42   0:00 grep Ds
sh-4.2#
-------------------------------------------------


Current Status of the gluster-block related services
---------------------------------
## for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i --  systemctl is-active gluster-block-target; done 
glusterfs-storage-c5dm8
+++++++++++++++++++++++
activating
command terminated with exit code 3
glusterfs-storage-r7szv
+++++++++++++++++++++++
active
glusterfs-storage-tclzq
+++++++++++++++++++++++
activating
command terminated with exit code 3
## for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i --  systemctl is-active gluster-blockd; done 
glusterfs-storage-c5dm8
+++++++++++++++++++++++
inactive
command terminated with exit code 3
glusterfs-storage-r7szv
+++++++++++++++++++++++
active
glusterfs-storage-tclzq
+++++++++++++++++++++++
inactive
command terminated with exit code 3
## for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i --  systemctl is-active tcmu-runner; done 
glusterfs-storage-c5dm8
+++++++++++++++++++++++
active
glusterfs-storage-r7szv
+++++++++++++++++++++++
active
glusterfs-storage-tclzq
+++++++++++++++++++++++
active



Version-Release number of selected component (if applicable):
==========================================

kernel version in pods = Linux dhcp42-84.lab.eng.blr.redhat.com 3.10.0-862.6.3.el7.x86_64 #1 SMP Fri Jun 15 17:57:37 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux


[root@dhcp47-178 HA_Count]# oc exec heketi-storage-1-v9nq5 -- rpm -qa| grep heketi
python-heketi-7.0.0-1.el7rhgs.x86_64
heketi-client-7.0.0-1.el7rhgs.x86_64
heketi-7.0.0-1.el7rhgs.x86_64
[root@dhcp47-178 HA_Count]# 


Gluster versions
+++++
glusterfs-libs-3.8.4-54.12.el7rhgs.x86_64
glusterfs-3.8.4-54.12.el7rhgs.x86_64
glusterfs-api-3.8.4-54.12.el7rhgs.x86_64
glusterfs-cli-3.8.4-54.12.el7rhgs.x86_64
glusterfs-server-3.8.4-54.12.el7rhgs.x86_64
gluster-block-0.2.1-20.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-54.12.el7rhgs.x86_64
glusterfs-fuse-3.8.4-54.12.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-54.12.el7rhgs.x86_64

## for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- rpm -qa | grep tcmu-runner ; done
glusterfs-storage-c5dm8
+++++++++++++++++++++++
tcmu-runner-1.2.0-20.el7rhgs.x86_64
glusterfs-storage-r7szv
+++++++++++++++++++++++
tcmu-runner-1.2.0-20.el7rhgs.x86_64
glusterfs-storage-tclzq
+++++++++++++++++++++++
tcmu-runner-1.2.0-20.el7rhgs.x86_64


## for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- rpm -qa | grep targetcli ; done
glusterfs-storage-c5dm8
+++++++++++++++++++++++
targetcli-2.1.fb46-6.el7_5.noarch
glusterfs-storage-r7szv
+++++++++++++++++++++++
targetcli-2.1.fb46-6.el7_5.noarch
glusterfs-storage-tclzq
+++++++++++++++++++++++
targetcli-2.1.fb46-6.el7_5.noarch
## 

oc version  
=========== 

openshift v3.10.0-0.67.0

kubernetes v1.10.0+b81c8f8
[root@dhcp47-178 HA_Count]# 


How reproducible:
============================
1x1, 

Steps to Reproduce:
+++++++++++++++++++
1. Create a CNS setup with gluster nodes=3 
2. Create multiple app pods with block devices bind-mounted
3. Kill tcmu-runner service on one pod followed by next after some time. dependent services- gluster-block-target and gluster-blockd also fail
4. Bring up tcmu-runner service in both the pods
5. Try bringing up gluster-block-target service in both the pods
#systemctl start gluster-block-target

6. If the service start is stuck, check for the (Ds)process stack in the glusterfs pods .



Actual results:
+++++++++++++++++++
In process stack, it is seen that "/usr/bin/python /usr/bin/targetctl restore" is in Ds state. Hence, gluster-block-target  services are not coming up.


Expected results:
++++++++++++++++++

On successful startup of tcmu-runner service,manual start of gluster-block-target service should be successful.

Comment 15 errata-xmlrpc 2018-09-12 09:26:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2691


Note You need to log in before you can comment on or make changes to this bug.