1476730 – [Tracker-RHEL-BZ#1608677] tcmu-runner: allow reset netlink support

Bug 1476730 - [Tracker-RHEL-BZ#1608677] tcmu-runner: allow reset netlink support

Summary: [Tracker-RHEL-BZ#1608677] tcmu-runner: allow reset netlink support

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	tcmu-runner
Sub Component:
Version:	cns-3.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 3.11
Assignee:	Xiubo Li
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Duplicates (4):	1476176 1577791 1599158 1610150 (view as bug list)
Depends On:	1608677
Blocks:	1559239 1568868 1622458 1629577
TreeView+	depends on / blocked

Reported:	2017-07-31 10:00 UTC by Sweta Anandpara
Modified:	2021-12-10 15:10 UTC (History)
CC List:	26 users (show)
Fixed In Version:	tcmu-runner-1.2.0-25.el7rhgs
Doc Type:	Bug Fix
Doc Text:	Gluster-block operations (create/delete/modify) or gluster-block-target service restart, performed when tcmu-runner is in offline state, can trigger netlink hung issue, with targetcli process entering uninterruptible sleep (D state) state forever. To recover from this state, restart the tcmu-runner daemon.
Clone Of:
Environment:
Last Closed:	2018-10-24 04:52:58 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	open-iscsi tcmu-runner pull 402	0	None	closed	tcmu-runner: add reset netlink support	2020-10-16 10:02:57 UTC
Red Hat Product Errata	RHEA-2018:2987	0	None	None	None	2018-10-24 04:54:38 UTC

Description Sweta Anandpara 2017-07-31 10:00:44 UTC

Description of problem:
=======================
Gluster-block daemon is dependent on rpcbind, target and tcmu-runner. When either of these services are affected, gluster-blockd fails to start. Twice it has happened that target.service gets stuck in 'activating' state for hours/days and any attempt to start gluster-blockd (or target.service) just hangs. 

Both the times, the events that took place on the setup were related to failed create (and hence, in the rollback delete path). This time, I was executing a positive scenario of creating a block, when the request timed out. A VM core was generated on one of the HA nodes, for which I have updated BZ 1473162. Post that, gluster-blockd refused to come up on that node. Error messages seen in tcmu-runner.log are pasted below.



2017-07-31 04:57:33.771 15933 [ERROR] add_device:484 : handler open failed for uio2
2017-07-31 04:57:33.789 15933 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory
2017-07-31 04:57:34.784 15933 [ERROR] glfs_check_config:493 : tcmu_create_glfs_object failed
2017-07-31 04:57:34.790 15933 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory
2017-07-31 04:57:35.788 15933 [ERROR] tcmu_glfs_open:561 : tcmu_create_glfs_object failed
2017-07-31 04:57:35.789 15933 [ERROR] add_device:484 : handler open failed for uio2
2017-07-31 04:57:35.807 15933 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory
2017-07-31 04:57:36.807 15933 [ERROR] glfs_check_config:493 : tcmu_create_glfs_object failed
2017-07-31 04:57:36.809 15933 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory


Version-Release number of selected component (if applicable):
==============================================================
glusterfs-3.8.4-35 and gluster-block-0.2.1-6


How reproducible:
================
Seen it twice


Additional info:
================


[root@dhcp47-117 ~]# gluster-block list vol2
Connection failed. Please check if gluster-block daemon is operational.
[root@dhcp47-117 ~]# systemctl status gluster-blockd
● gluster-blockd.service - Gluster block storage utility
   Loaded: loaded (/usr/lib/systemd/system/gluster-blockd.service; disabled; vendor preset: disabled)
   Active: inactive (dead)
[root@dhcp47-117 ~]# systemctl start  gluster-blockd
^C
[root@dhcp47-117 ~]# systemctl status tcmu-runner
● tcmu-runner.service - LIO Userspace-passthrough daemon
   Loaded: loaded (/usr/lib/systemd/system/tcmu-runner.service; static; vendor preset: disabled)
   Active: active (running) since Mon 2017-07-31 04:57:09 EDT; 18s ago
 Main PID: 15933 (tcmu-runner)
   CGroup: /system.slice/tcmu-runner.service
           └─15933 /usr/bin/tcmu-runner --tcmu-log-dir=/var/log/gluster-block/

Jul 31 04:57:26 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: glfs_check_config:493 : tcmu_create_glfs_object failed
Jul 31 04:57:26 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: 2017-07-31 04:57:26.719 15933 [ERROR] glfs_check_config:493 : tcmu_create_glfs_object failed
Jul 31 04:57:26 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory
Jul 31 04:57:26 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: 2017-07-31 04:57:26.729 15933 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory
Jul 31 04:57:27 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: tcmu_glfs_open:561 : tcmu_create_glfs_object failed
Jul 31 04:57:27 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: 2017-07-31 04:57:27.722 15933 [ERROR] tcmu_glfs_open:561 : tcmu_create_glfs_object failed
Jul 31 04:57:27 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: 2017-07-31 04:57:27.722 15933 [ERROR] add_device:484 : handler open failed for uio2
Jul 31 04:57:27 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: add_device:484 : handler open failed for uio2
Jul 31 04:57:27 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory
Jul 31 04:57:27 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: 2017-07-31 04:57:27.739 15933 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory
[root@dhcp47-117 ~]# systemctl stop tcmu-runner
[root@dhcp47-117 ~]# 


[root@dhcp47-117 ~]# systemctl status target
● target.service - Restore LIO kernel target configuration
   Loaded: loaded (/usr/lib/systemd/system/target.service; disabled; vendor preset: disabled)
   Active: activating (start) since Mon 2017-07-31 04:57:09 EDT; 45min ago
 Main PID: 15934 (targetctl)
   CGroup: /system.slice/target.service
           └─15934 /usr/bin/python /usr/bin/targetctl restore

Jul 31 04:57:09 dhcp47-117.lab.eng.blr.redhat.com systemd[1]: Starting Restore LIO kernel target configuration...
[root@dhcp47-117 ~]# 
[root@dhcp47-117 ~]# 
[root@dhcp47-117 ~]# 
[root@dhcp47-117 ~]# systemctl status gluster-block
Unit gluster-block.service could not be found.
[root@dhcp47-117 ~]# systemctl status gluster-blockd
● gluster-blockd.service - Gluster block storage utility
   Loaded: loaded (/usr/lib/systemd/system/gluster-blockd.service; disabled; vendor preset: disabled)
   Active: inactive (dead)
[root@dhcp47-117 ~]# 
[root@dhcp47-117 ~]# 
[root@dhcp47-117 ~]# systemctl daemon-reload
[root@dhcp47-117 ~]# systemctl start target
^C
[root@dhcp47-117 ~]# systemctl status target
● target.service - Restore LIO kernel target configuration
   Loaded: loaded (/usr/lib/systemd/system/target.service; disabled; vendor preset: disabled)
   Active: activating (start) since Mon 2017-07-31 04:57:09 EDT; 47min ago
 Main PID: 15934 (targetctl)
   CGroup: /system.slice/target.service
           └─15934 /usr/bin/python /usr/bin/targetctl restore

Jul 31 04:57:09 dhcp47-117.lab.eng.blr.redhat.com systemd[1]: Starting Restore LIO kernel target configuration...
[root@dhcp47-117 ~]#

Comment 2 Sweta Anandpara 2017-07-31 10:07:40 UTC

Unable to get gluster-block daemon UP affects the entire functionality. It is concerning as this is seen in positive scenario as well, the second time. Hence proposing it as a blocker.

Either we should not hit this at all, or a way to reset gluster-blockd if any of its dependent services take a hit.

Comment 3 Sweta Anandpara 2017-07-31 10:10:41 UTC

I understand tcmu-runner prints out the error messages as it is unable to open those devices. Looks like the files were made unavailable without tcmu-runner knowing about it.

Gluster-block logs can be found at this location: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>/

Comment 6 Sahina Bose 2017-08-04 06:39:23 UTC

The tcmu-runner-glfs.log are filled with "Client-quorum is not met" and "no subvolumes up". This clearly indicates there's an issue with the gluster volume. Unfortunately, the sosreport provided in Comment 3 does not have the glusterd or brick logs. Will need these to understand

Comment 11 Sweta Anandpara 2017-08-09 05:59:52 UTC

Logs are now attached at: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1476730/dhcp47_117_glusterfs_logs/

Comment 13 Sweta Anandpara 2017-08-09 06:32:10 UTC

Pasting the snippet below from link. It has logs all the way from April 16th. Isn't this sufficient?!

[qe@rhsqe-repo dhcp47_117_glusterfs_logs]$ hostname
rhsqe-repo.lab.eng.blr.redhat.com
[qe@rhsqe-repo dhcp47_117_glusterfs_logs]$ pwd
/home/repo/sosreports/1476730/dhcp47_117_glusterfs_logs
[qe@rhsqe-repo dhcp47_117_glusterfs_logs]$ 
[qe@rhsqe-repo dhcp47_117_glusterfs_logs]$ cd glusterfs
[qe@rhsqe-repo glusterfs]$ 
[qe@rhsqe-repo glusterfs]$ 
[qe@rhsqe-repo glusterfs]$ ls -lrt glusterd*
-rwxr-xr-x. 1 qe qe  3499487 Aug  9 11:13 glusterd.log-20170430.gz
-rwxr-xr-x. 1 qe qe   159812 Aug  9 11:13 glusterd.log-20170703.gz
-rwxr-xr-x. 1 qe qe     2619 Aug  9 11:13 glusterd.log-20170416.gz
-rwxr-xr-x. 1 qe qe    76632 Aug  9 11:13 glusterd.log-20170424.gz
-rwxr-xr-x. 1 qe qe  5045262 Aug  9 11:13 glusterd.log-20170507.gz
-rwxr-xr-x. 1 qe qe  5675694 Aug  9 11:13 glusterd.log-20170515.gz
-rwxr-xr-x. 1 qe qe   261303 Aug  9 11:13 glusterd.log
-rwxr-xr-x. 1 qe qe    21472 Aug  9 11:13 glusterd.log-20170716.gz
-rwxr-xr-x. 1 qe qe   745438 Aug  9 11:13 glusterd.log-20170521.gz
-rwxr-xr-x. 1 qe qe   105599 Aug  9 11:13 glusterd.log-20170713.gz
-rwxr-xr-x. 1 qe qe      455 Aug  9 11:13 glusterd.log-20170529.gz
-rwxr-xr-x. 1 qe qe      389 Aug  9 11:13 glusterd.log-20170604.gz
-rwxr-xr-x. 1 qe qe   132407 Aug  9 11:13 glusterd.log-20170724.gz
-rwxr-xr-x. 1 qe qe    78701 Aug  9 11:13 glusterd.log-20170611.gz
-rwxr-xr-x. 1 qe qe   127661 Aug  9 11:13 glusterd.log-20170619.gz
-rwxr-xr-x. 1 qe qe   131188 Aug  9 11:13 glusterd.log-20170625.gz
-rwxr-xr-x. 1 qe qe   109866 Aug  9 11:13 glusterd.log-20170730.gz
-rwxr-xr-x. 1 qe qe 89385768 Aug  9 11:14 glusterd.log-20170806
[qe@rhsqe-repo glusterfs]$

Comment 22 Prasanna Kumar Kalever 2018-05-15 11:19:05 UTC

*** Bug 1476176 has been marked as a duplicate of this bug. ***

Comment 28 Yaniv Kaul 2018-07-07 05:09:25 UTC

Is this bug related to bug 1560418 ?

Comment 29 Pranith Kumar K 2018-07-09 07:57:33 UTC

(In reply to Yaniv Kaul from comment #28)
> Is this bug related to bug 1560418 ?

Yes

Comment 30 Prasanna Kumar Kalever 2018-07-10 07:34:56 UTC

*** Bug 1599158 has been marked as a duplicate of this bug. ***

Comment 31 Xiubo Li 2018-07-12 07:27:13 UTC

The upstream PR: https://github.com/open-iscsi/tcmu-runner/pull/402

Comment 34 Prasanna Kumar Kalever 2018-07-31 04:55:11 UTC

*** Bug 1610150 has been marked as a duplicate of this bug. ***

Comment 42 Prasanna Kumar Kalever 2018-09-10 08:12:51 UTC

*** Bug 1577791 has been marked as a duplicate of this bug. ***

Comment 44 Michael Adam 2018-09-20 13:59:12 UTC

using the correct tracker

Comment 50 Anjana KD 2018-10-11 07:20:11 UTC

Updated the Doc text kindly verify.

Comment 52 Anjana KD 2018-10-12 05:18:50 UTC

Updated the Doc text.

Comment 54 errata-xmlrpc 2018-10-24 04:52:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2987

Note You need to log in before you can comment on or make changes to this bug.

akrishna
amukherj
ansverma
bgoyal
bkunal
bmohanra
ccalhoun
dvercill
dwojslaw
hchiramm
kramdoss
madam
nberry
nchilaka
olim
pkarampu
prasanna.kalever
rgeorge
rhs-bugs
sabose
sanandpa
sankarshan
storage-qa-internal
vbellur
xiubli
ykaul