Bug 1476730

Summary:	[Tracker-RHEL-BZ#1608677] tcmu-runner: allow reset netlink support
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Sweta Anandpara <sanandpa>
Component:	tcmu-runner	Assignee:	Xiubo Li <xiubli>
Status:	CLOSED ERRATA	QA Contact:	Rahul Hinduja <rhinduja>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	cns-3.9	CC:	akrishna, amukherj, ansverma, bgoyal, bkunal, bmohanra, ccalhoun, dvercill, dwojslaw, hchiramm, kramdoss, madam, nberry, nchilaka, olim, pkarampu, prasanna.kalever, rgeorge, rhs-bugs, sabose, sanandpa, sankarshan, storage-qa-internal, vbellur, xiubli, ykaul
Target Milestone:	---
Target Release:	OCS 3.11
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	tcmu-runner-1.2.0-25.el7rhgs	Doc Type:	Bug Fix
Doc Text:	Gluster-block operations (create/delete/modify) or gluster-block-target service restart, performed when tcmu-runner is in offline state, can trigger netlink hung issue, with targetcli process entering uninterruptible sleep (D state) state forever. To recover from this state, restart the tcmu-runner daemon.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-10-24 04:52:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1608677
Bug Blocks:	1559239, 1568868, 1622458, 1629577

Description Sweta Anandpara 2017-07-31 10:00:44 UTC

Description of problem:
=======================
Gluster-block daemon is dependent on rpcbind, target and tcmu-runner. When either of these services are affected, gluster-blockd fails to start. Twice it has happened that target.service gets stuck in 'activating' state for hours/days and any attempt to start gluster-blockd (or target.service) just hangs. 

Both the times, the events that took place on the setup were related to failed create (and hence, in the rollback delete path). This time, I was executing a positive scenario of creating a block, when the request timed out. A VM core was generated on one of the HA nodes, for which I have updated BZ 1473162. Post that, gluster-blockd refused to come up on that node. Error messages seen in tcmu-runner.log are pasted below.



2017-07-31 04:57:33.771 15933 [ERROR] add_device:484 : handler open failed for uio2
2017-07-31 04:57:33.789 15933 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory
2017-07-31 04:57:34.784 15933 [ERROR] glfs_check_config:493 : tcmu_create_glfs_object failed
2017-07-31 04:57:34.790 15933 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory
2017-07-31 04:57:35.788 15933 [ERROR] tcmu_glfs_open:561 : tcmu_create_glfs_object failed
2017-07-31 04:57:35.789 15933 [ERROR] add_device:484 : handler open failed for uio2
2017-07-31 04:57:35.807 15933 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory
2017-07-31 04:57:36.807 15933 [ERROR] glfs_check_config:493 : tcmu_create_glfs_object failed
2017-07-31 04:57:36.809 15933 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory


Version-Release number of selected component (if applicable):
==============================================================
glusterfs-3.8.4-35 and gluster-block-0.2.1-6


How reproducible:
================
Seen it twice


Additional info:
================


[root@dhcp47-117 ~]# gluster-block list vol2
Connection failed. Please check if gluster-block daemon is operational.
[root@dhcp47-117 ~]# systemctl status gluster-blockd
● gluster-blockd.service - Gluster block storage utility
   Loaded: loaded (/usr/lib/systemd/system/gluster-blockd.service; disabled; vendor preset: disabled)
   Active: inactive (dead)
[root@dhcp47-117 ~]# systemctl start  gluster-blockd
^C
[root@dhcp47-117 ~]# systemctl status tcmu-runner
● tcmu-runner.service - LIO Userspace-passthrough daemon
   Loaded: loaded (/usr/lib/systemd/system/tcmu-runner.service; static; vendor preset: disabled)
   Active: active (running) since Mon 2017-07-31 04:57:09 EDT; 18s ago
 Main PID: 15933 (tcmu-runner)
   CGroup: /system.slice/tcmu-runner.service
           └─15933 /usr/bin/tcmu-runner --tcmu-log-dir=/var/log/gluster-block/

Jul 31 04:57:26 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: glfs_check_config:493 : tcmu_create_glfs_object failed
Jul 31 04:57:26 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: 2017-07-31 04:57:26.719 15933 [ERROR] glfs_check_config:493 : tcmu_create_glfs_object failed
Jul 31 04:57:26 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory
Jul 31 04:57:26 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: 2017-07-31 04:57:26.729 15933 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory
Jul 31 04:57:27 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: tcmu_glfs_open:561 : tcmu_create_glfs_object failed
Jul 31 04:57:27 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: 2017-07-31 04:57:27.722 15933 [ERROR] tcmu_glfs_open:561 : tcmu_create_glfs_object failed
Jul 31 04:57:27 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: 2017-07-31 04:57:27.722 15933 [ERROR] add_device:484 : handler open failed for uio2
Jul 31 04:57:27 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: add_device:484 : handler open failed for uio2
Jul 31 04:57:27 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory
Jul 31 04:57:27 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: 2017-07-31 04:57:27.739 15933 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory
[root@dhcp47-117 ~]# systemctl stop tcmu-runner
[root@dhcp47-117 ~]# 


[root@dhcp47-117 ~]# systemctl status target
● target.service - Restore LIO kernel target configuration
   Loaded: loaded (/usr/lib/systemd/system/target.service; disabled; vendor preset: disabled)
   Active: activating (start) since Mon 2017-07-31 04:57:09 EDT; 45min ago
 Main PID: 15934 (targetctl)
   CGroup: /system.slice/target.service
           └─15934 /usr/bin/python /usr/bin/targetctl restore

Jul 31 04:57:09 dhcp47-117.lab.eng.blr.redhat.com systemd[1]: Starting Restore LIO kernel target configuration...
[root@dhcp47-117 ~]# 
[root@dhcp47-117 ~]# 
[root@dhcp47-117 ~]# 
[root@dhcp47-117 ~]# systemctl status gluster-block
Unit gluster-block.service could not be found.
[root@dhcp47-117 ~]# systemctl status gluster-blockd
● gluster-blockd.service - Gluster block storage utility
   Loaded: loaded (/usr/lib/systemd/system/gluster-blockd.service; disabled; vendor preset: disabled)
   Active: inactive (dead)
[root@dhcp47-117 ~]# 
[root@dhcp47-117 ~]# 
[root@dhcp47-117 ~]# systemctl daemon-reload
[root@dhcp47-117 ~]# systemctl start target
^C
[root@dhcp47-117 ~]# systemctl status target
● target.service - Restore LIO kernel target configuration
   Loaded: loaded (/usr/lib/systemd/system/target.service; disabled; vendor preset: disabled)
   Active: activating (start) since Mon 2017-07-31 04:57:09 EDT; 47min ago
 Main PID: 15934 (targetctl)
   CGroup: /system.slice/target.service
           └─15934 /usr/bin/python /usr/bin/targetctl restore

Jul 31 04:57:09 dhcp47-117.lab.eng.blr.redhat.com systemd[1]: Starting Restore LIO kernel target configuration...
[root@dhcp47-117 ~]#

Comment 2 Sweta Anandpara 2017-07-31 10:07:40 UTC

Unable to get gluster-block daemon UP affects the entire functionality. It is concerning as this is seen in positive scenario as well, the second time. Hence proposing it as a blocker.

Either we should not hit this at all, or a way to reset gluster-blockd if any of its dependent services take a hit.

Comment 3 Sweta Anandpara 2017-07-31 10:10:41 UTC

I understand tcmu-runner prints out the error messages as it is unable to open those devices. Looks like the files were made unavailable without tcmu-runner knowing about it.

Gluster-block logs can be found at this location: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>/

Comment 6 Sahina Bose 2017-08-04 06:39:23 UTC

The tcmu-runner-glfs.log are filled with "Client-quorum is not met" and "no subvolumes up". This clearly indicates there's an issue with the gluster volume. Unfortunately, the sosreport provided in Comment 3 does not have the glusterd or brick logs. Will need these to understand

Comment 11 Sweta Anandpara 2017-08-09 05:59:52 UTC

Logs are now attached at: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1476730/dhcp47_117_glusterfs_logs/

Comment 13 Sweta Anandpara 2017-08-09 06:32:10 UTC

Pasting the snippet below from link. It has logs all the way from April 16th. Isn't this sufficient?!

[qe@rhsqe-repo dhcp47_117_glusterfs_logs]$ hostname
rhsqe-repo.lab.eng.blr.redhat.com
[qe@rhsqe-repo dhcp47_117_glusterfs_logs]$ pwd
/home/repo/sosreports/1476730/dhcp47_117_glusterfs_logs
[qe@rhsqe-repo dhcp47_117_glusterfs_logs]$ 
[qe@rhsqe-repo dhcp47_117_glusterfs_logs]$ cd glusterfs
[qe@rhsqe-repo glusterfs]$ 
[qe@rhsqe-repo glusterfs]$ 
[qe@rhsqe-repo glusterfs]$ ls -lrt glusterd*
-rwxr-xr-x. 1 qe qe  3499487 Aug  9 11:13 glusterd.log-20170430.gz
-rwxr-xr-x. 1 qe qe   159812 Aug  9 11:13 glusterd.log-20170703.gz
-rwxr-xr-x. 1 qe qe     2619 Aug  9 11:13 glusterd.log-20170416.gz
-rwxr-xr-x. 1 qe qe    76632 Aug  9 11:13 glusterd.log-20170424.gz
-rwxr-xr-x. 1 qe qe  5045262 Aug  9 11:13 glusterd.log-20170507.gz
-rwxr-xr-x. 1 qe qe  5675694 Aug  9 11:13 glusterd.log-20170515.gz
-rwxr-xr-x. 1 qe qe   261303 Aug  9 11:13 glusterd.log
-rwxr-xr-x. 1 qe qe    21472 Aug  9 11:13 glusterd.log-20170716.gz
-rwxr-xr-x. 1 qe qe   745438 Aug  9 11:13 glusterd.log-20170521.gz
-rwxr-xr-x. 1 qe qe   105599 Aug  9 11:13 glusterd.log-20170713.gz
-rwxr-xr-x. 1 qe qe      455 Aug  9 11:13 glusterd.log-20170529.gz
-rwxr-xr-x. 1 qe qe      389 Aug  9 11:13 glusterd.log-20170604.gz
-rwxr-xr-x. 1 qe qe   132407 Aug  9 11:13 glusterd.log-20170724.gz
-rwxr-xr-x. 1 qe qe    78701 Aug  9 11:13 glusterd.log-20170611.gz
-rwxr-xr-x. 1 qe qe   127661 Aug  9 11:13 glusterd.log-20170619.gz
-rwxr-xr-x. 1 qe qe   131188 Aug  9 11:13 glusterd.log-20170625.gz
-rwxr-xr-x. 1 qe qe   109866 Aug  9 11:13 glusterd.log-20170730.gz
-rwxr-xr-x. 1 qe qe 89385768 Aug  9 11:14 glusterd.log-20170806
[qe@rhsqe-repo glusterfs]$

Comment 22 Prasanna Kumar Kalever 2018-05-15 11:19:05 UTC

*** Bug 1476176 has been marked as a duplicate of this bug. ***

Comment 28 Yaniv Kaul 2018-07-07 05:09:25 UTC

Is this bug related to bug 1560418 ?

Comment 29 Pranith Kumar K 2018-07-09 07:57:33 UTC

(In reply to Yaniv Kaul from comment #28)
> Is this bug related to bug 1560418 ?

Yes

Comment 30 Prasanna Kumar Kalever 2018-07-10 07:34:56 UTC

*** Bug 1599158 has been marked as a duplicate of this bug. ***

Comment 31 Xiubo Li 2018-07-12 07:27:13 UTC

The upstream PR: https://github.com/open-iscsi/tcmu-runner/pull/402

Comment 34 Prasanna Kumar Kalever 2018-07-31 04:55:11 UTC

*** Bug 1610150 has been marked as a duplicate of this bug. ***

Comment 42 Prasanna Kumar Kalever 2018-09-10 08:12:51 UTC

*** Bug 1577791 has been marked as a duplicate of this bug. ***

Comment 44 Michael Adam 2018-09-20 13:59:12 UTC

using the correct tracker

Comment 50 Anjana KD 2018-10-11 07:20:11 UTC

Updated the Doc text kindly verify.

Comment 52 Anjana KD 2018-10-12 05:18:50 UTC

Updated the Doc text.

Comment 54 errata-xmlrpc 2018-10-24 04:52:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2987