1599158 – tcmu-runner service won't start after failure

Bug 1599158 - tcmu-runner service won't start after failure

Summary: tcmu-runner service won't start after failure

Keywords:
Status:	CLOSED DUPLICATE of bug 1476730
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	gluster-block
Sub Component:
Version:	cns-3.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Xiubo Li
QA Contact:	Rachael
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1568862
TreeView+	depends on / blocked

Reported:	2018-07-09 05:32 UTC by Rachael
Modified:	2018-07-10 07:34 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-07-10 07:34:56 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Rachael 2018-07-09 05:32:30 UTC

Description of problem:

On a CNS setup, a script was run to create block PVCs. When the block devices were being created, tcmu-runner process was killed on two out of three gluster pods one after the other. As expected the creation of block devices failed. However when the tcmu-runner service was manually restarted, it failed with the following error:

sh-4.2# systemctl status tcmu-runner
● tcmu-runner.service - LIO Userspace-passthrough daemon
   Loaded: loaded (/usr/lib/systemd/system/tcmu-runner.service; static; vendor preset: disabled)
   Active: failed (Result: timeout) since Mon 2018-07-09 05:03:18 UTC; 22s ago
 Main PID: 29211
   CGroup: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode95e4671_7f74_11e8_974c_005056a525c4.slice/docker-ccd4f307cd274a7d2bedf1ddc20a9c3115a2fc07e432074b1a7e0a5d3665c737.scope/system.slice/tcmu-runner.service
           └─29211 /usr/bin/tcmu-runner --tcmu-log-dir /var/log/glusterfs/gluster-block

Jul 09 04:58:50 dhcp46-244.lab.eng.blr.redhat.com tcmu-runner[29211]: 2018-07-09 04:58:50.726 29211 [ERROR] tcmu_glfs_open:529 test-vol_glusterfs_claim30_fcf9c007-80cd-11e8-a4e5-0a580a810203: glfs_open(vol=vol_37b50be8ac1...e or directory
Jul 09 04:58:50 dhcp46-244.lab.eng.blr.redhat.com tcmu-runner[29211]: tcmu_glfs_open:529 test-vol_glusterfs_claim30_fcf9c007-80cd-11e8-a4e5-0a580a810203: glfs_open(vol=vol_37b50be8ac1fb551ad7f1b2985d8b6a7, file=block-stor...e or directory
Jul 09 04:58:50 dhcp46-244.lab.eng.blr.redhat.com tcmu-runner[29211]: 2018-07-09 04:58:50.726 29211 [ERROR] add_device:486 : handler open failed for uio29
Jul 09 04:58:50 dhcp46-244.lab.eng.blr.redhat.com tcmu-runner[29211]: add_device:486 : handler open failed for uio29
Jul 09 05:00:17 dhcp46-244.lab.eng.blr.redhat.com systemd[1]: tcmu-runner.service start operation timed out. Terminating.
Jul 09 05:01:48 dhcp46-244.lab.eng.blr.redhat.com systemd[1]: tcmu-runner.service stop-final-sigterm timed out. Killing.
Jul 09 05:03:18 dhcp46-244.lab.eng.blr.redhat.com systemd[1]: tcmu-runner.service still around after final SIGKILL. Entering failed mode.
Jul 09 05:03:18 dhcp46-244.lab.eng.blr.redhat.com systemd[1]: Failed to start LIO Userspace-passthrough daemon.
Jul 09 05:03:18 dhcp46-244.lab.eng.blr.redhat.com systemd[1]: Unit tcmu-runner.service entered failed state.
Jul 09 05:03:18 dhcp46-244.lab.eng.blr.redhat.com systemd[1]: tcmu-runner.service failed.

# for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- systemctl is-active tcmu-runner; done

glusterfs-storage-cctj8
+++++++++++++++++++++++
active

glusterfs-storage-qpk4g
+++++++++++++++++++++++
failed
command terminated with exit code 3

glusterfs-storage-w9jcs
+++++++++++++++++++++++
failed
command terminated with exit code 3

# for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- ps aux|grep Ds; done

glusterfs-storage-cctj8
+++++++++++++++++++++++

glusterfs-storage-qpk4g
+++++++++++++++++++++++
root      1423  0.0  0.0 122668 11028 ?        Ds   04:51   0:00 /usr/bin/python /usr/bin/targetctl clear
root      2437  0.0  0.0 527176 19532 ?        Ds   05:01   0:00 /usr/bin/tcmu-runner --tcmu-log-dir /var/log/glusterfs/gluster-block

glusterfs-storage-w9jcs
+++++++++++++++++++++++
root     25246  0.0  0.0 122668 11040 ?        Ds   04:51   0:00 /usr/bin/python /usr/bin/targetctl clear
root     29211  0.0  0.0 842380 19692 ?        Ds   04:58   0:00 /usr/bin/tcmu-runner --tcmu-log-dir /var/log/glusterfs/gluster-block



Version-Release number of selected component (if applicable):

# oc version
oc v3.10.0-0.67.0
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

# rpm -qa|grep gluster
glusterfs-client-xlators-3.8.4-54.12.el7rhgs.x86_64
glusterfs-fuse-3.8.4-54.12.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-54.12.el7rhgs.x86_64
glusterfs-libs-3.8.4-54.12.el7rhgs.x86_64
glusterfs-3.8.4-54.12.el7rhgs.x86_64
glusterfs-api-3.8.4-54.12.el7rhgs.x86_64
glusterfs-cli-3.8.4-54.12.el7rhgs.x86_64
glusterfs-server-3.8.4-54.12.el7rhgs.x86_64
gluster-block-0.2.1-20.el7rhgs.x86_64

heketi-7.0.0-2.el7rhgs.x86_64


How reproducible: 1/1


Steps to Reproduce:
1. Create block devices
2. While block devices are getting created kill tcmu-runner process (kill -9 <process_id>)
3. Restart tcmu-runner service (systemctl start tcmu-runner)

Actual results:
tcmu-runner fails to start


Expected results:
tcmu-runner should start


Additional info:
Logs will be attached soon

Note You need to log in before you can comment on or make changes to this bug.