Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1476730 - [Tracker-RHEL-BZ#1608677] tcmu-runner: allow reset netlink support
[Tracker-RHEL-BZ#1608677] tcmu-runner: allow reset netlink support
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: tcmu-runner (Show other bugs)
cns-3.9
Unspecified Unspecified
unspecified Severity high
: ---
: OCS 3.11
Assigned To: Xiubo Li
Rahul Hinduja
:
: 1476176 1577791 1599158 1610150 (view as bug list)
Depends On: 1608677
Blocks: 1622458 1559239 1568868 1629577
  Show dependency treegraph
 
Reported: 2017-07-31 06:00 EDT by Sweta Anandpara
Modified: 2018-10-24 00:54 EDT (History)
26 users (show)

See Also:
Fixed In Version: tcmu-runner-1.2.0-25.el7rhgs
Doc Type: Bug Fix
Doc Text:
Gluster-block operations (create/delete/modify) or gluster-block-target service restart, performed when tcmu-runner is in offline state, can trigger netlink hung issue, with targetcli process entering uninterruptible sleep (D state) state forever. To recover from this state, restart the tcmu-runner daemon.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-10-24 00:52:58 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Github open-iscsi/tcmu-runner/pull/402 None None None 2018-07-12 03:46 EDT
Red Hat Product Errata RHEA-2018:2987 None None None 2018-10-24 00:54 EDT

  None (edit)
Description Sweta Anandpara 2017-07-31 06:00:44 EDT
Description of problem:
=======================
Gluster-block daemon is dependent on rpcbind, target and tcmu-runner. When either of these services are affected, gluster-blockd fails to start. Twice it has happened that target.service gets stuck in 'activating' state for hours/days and any attempt to start gluster-blockd (or target.service) just hangs. 

Both the times, the events that took place on the setup were related to failed create (and hence, in the rollback delete path). This time, I was executing a positive scenario of creating a block, when the request timed out. A VM core was generated on one of the HA nodes, for which I have updated BZ 1473162. Post that, gluster-blockd refused to come up on that node. Error messages seen in tcmu-runner.log are pasted below.



2017-07-31 04:57:33.771 15933 [ERROR] add_device:484 : handler open failed for uio2
2017-07-31 04:57:33.789 15933 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory
2017-07-31 04:57:34.784 15933 [ERROR] glfs_check_config:493 : tcmu_create_glfs_object failed
2017-07-31 04:57:34.790 15933 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory
2017-07-31 04:57:35.788 15933 [ERROR] tcmu_glfs_open:561 : tcmu_create_glfs_object failed
2017-07-31 04:57:35.789 15933 [ERROR] add_device:484 : handler open failed for uio2
2017-07-31 04:57:35.807 15933 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory
2017-07-31 04:57:36.807 15933 [ERROR] glfs_check_config:493 : tcmu_create_glfs_object failed
2017-07-31 04:57:36.809 15933 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory


Version-Release number of selected component (if applicable):
==============================================================
glusterfs-3.8.4-35 and gluster-block-0.2.1-6


How reproducible:
================
Seen it twice


Additional info:
================


[root@dhcp47-117 ~]# gluster-block list vol2
Connection failed. Please check if gluster-block daemon is operational.
[root@dhcp47-117 ~]# systemctl status gluster-blockd
● gluster-blockd.service - Gluster block storage utility
   Loaded: loaded (/usr/lib/systemd/system/gluster-blockd.service; disabled; vendor preset: disabled)
   Active: inactive (dead)
[root@dhcp47-117 ~]# systemctl start  gluster-blockd
^C
[root@dhcp47-117 ~]# systemctl status tcmu-runner
● tcmu-runner.service - LIO Userspace-passthrough daemon
   Loaded: loaded (/usr/lib/systemd/system/tcmu-runner.service; static; vendor preset: disabled)
   Active: active (running) since Mon 2017-07-31 04:57:09 EDT; 18s ago
 Main PID: 15933 (tcmu-runner)
   CGroup: /system.slice/tcmu-runner.service
           └─15933 /usr/bin/tcmu-runner --tcmu-log-dir=/var/log/gluster-block/

Jul 31 04:57:26 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: glfs_check_config:493 : tcmu_create_glfs_object failed
Jul 31 04:57:26 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: 2017-07-31 04:57:26.719 15933 [ERROR] glfs_check_config:493 : tcmu_create_glfs_object failed
Jul 31 04:57:26 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory
Jul 31 04:57:26 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: 2017-07-31 04:57:26.729 15933 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory
Jul 31 04:57:27 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: tcmu_glfs_open:561 : tcmu_create_glfs_object failed
Jul 31 04:57:27 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: 2017-07-31 04:57:27.722 15933 [ERROR] tcmu_glfs_open:561 : tcmu_create_glfs_object failed
Jul 31 04:57:27 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: 2017-07-31 04:57:27.722 15933 [ERROR] add_device:484 : handler open failed for uio2
Jul 31 04:57:27 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: add_device:484 : handler open failed for uio2
Jul 31 04:57:27 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory
Jul 31 04:57:27 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: 2017-07-31 04:57:27.739 15933 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory
[root@dhcp47-117 ~]# systemctl stop tcmu-runner
[root@dhcp47-117 ~]# 


[root@dhcp47-117 ~]# systemctl status target
● target.service - Restore LIO kernel target configuration
   Loaded: loaded (/usr/lib/systemd/system/target.service; disabled; vendor preset: disabled)
   Active: activating (start) since Mon 2017-07-31 04:57:09 EDT; 45min ago
 Main PID: 15934 (targetctl)
   CGroup: /system.slice/target.service
           └─15934 /usr/bin/python /usr/bin/targetctl restore

Jul 31 04:57:09 dhcp47-117.lab.eng.blr.redhat.com systemd[1]: Starting Restore LIO kernel target configuration...
[root@dhcp47-117 ~]# 
[root@dhcp47-117 ~]# 
[root@dhcp47-117 ~]# 
[root@dhcp47-117 ~]# systemctl status gluster-block
Unit gluster-block.service could not be found.
[root@dhcp47-117 ~]# systemctl status gluster-blockd
● gluster-blockd.service - Gluster block storage utility
   Loaded: loaded (/usr/lib/systemd/system/gluster-blockd.service; disabled; vendor preset: disabled)
   Active: inactive (dead)
[root@dhcp47-117 ~]# 
[root@dhcp47-117 ~]# 
[root@dhcp47-117 ~]# systemctl daemon-reload
[root@dhcp47-117 ~]# systemctl start target
^C
[root@dhcp47-117 ~]# systemctl status target
● target.service - Restore LIO kernel target configuration
   Loaded: loaded (/usr/lib/systemd/system/target.service; disabled; vendor preset: disabled)
   Active: activating (start) since Mon 2017-07-31 04:57:09 EDT; 47min ago
 Main PID: 15934 (targetctl)
   CGroup: /system.slice/target.service
           └─15934 /usr/bin/python /usr/bin/targetctl restore

Jul 31 04:57:09 dhcp47-117.lab.eng.blr.redhat.com systemd[1]: Starting Restore LIO kernel target configuration...
[root@dhcp47-117 ~]#
Comment 2 Sweta Anandpara 2017-07-31 06:07:40 EDT
Unable to get gluster-block daemon UP affects the entire functionality. It is concerning as this is seen in positive scenario as well, the second time. Hence proposing it as a blocker.

Either we should not hit this at all, or a way to reset gluster-blockd if any of its dependent services take a hit.
Comment 3 Sweta Anandpara 2017-07-31 06:10:41 EDT
I understand tcmu-runner prints out the error messages as it is unable to open those devices. Looks like the files were made unavailable without tcmu-runner knowing about it.

Gluster-block logs can be found at this location: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>/
Comment 6 Sahina Bose 2017-08-04 02:39:23 EDT
The tcmu-runner-glfs.log are filled with "Client-quorum is not met" and "no subvolumes up". This clearly indicates there's an issue with the gluster volume. Unfortunately, the sosreport provided in Comment 3 does not have the glusterd or brick logs. Will need these to understand
Comment 11 Sweta Anandpara 2017-08-09 01:59:52 EDT
Logs are now attached at: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1476730/dhcp47_117_glusterfs_logs/
Comment 13 Sweta Anandpara 2017-08-09 02:32:10 EDT
Pasting the snippet below from link. It has logs all the way from April 16th. Isn't this sufficient?!

[qe@rhsqe-repo dhcp47_117_glusterfs_logs]$ hostname
rhsqe-repo.lab.eng.blr.redhat.com
[qe@rhsqe-repo dhcp47_117_glusterfs_logs]$ pwd
/home/repo/sosreports/1476730/dhcp47_117_glusterfs_logs
[qe@rhsqe-repo dhcp47_117_glusterfs_logs]$ 
[qe@rhsqe-repo dhcp47_117_glusterfs_logs]$ cd glusterfs
[qe@rhsqe-repo glusterfs]$ 
[qe@rhsqe-repo glusterfs]$ 
[qe@rhsqe-repo glusterfs]$ ls -lrt glusterd*
-rwxr-xr-x. 1 qe qe  3499487 Aug  9 11:13 glusterd.log-20170430.gz
-rwxr-xr-x. 1 qe qe   159812 Aug  9 11:13 glusterd.log-20170703.gz
-rwxr-xr-x. 1 qe qe     2619 Aug  9 11:13 glusterd.log-20170416.gz
-rwxr-xr-x. 1 qe qe    76632 Aug  9 11:13 glusterd.log-20170424.gz
-rwxr-xr-x. 1 qe qe  5045262 Aug  9 11:13 glusterd.log-20170507.gz
-rwxr-xr-x. 1 qe qe  5675694 Aug  9 11:13 glusterd.log-20170515.gz
-rwxr-xr-x. 1 qe qe   261303 Aug  9 11:13 glusterd.log
-rwxr-xr-x. 1 qe qe    21472 Aug  9 11:13 glusterd.log-20170716.gz
-rwxr-xr-x. 1 qe qe   745438 Aug  9 11:13 glusterd.log-20170521.gz
-rwxr-xr-x. 1 qe qe   105599 Aug  9 11:13 glusterd.log-20170713.gz
-rwxr-xr-x. 1 qe qe      455 Aug  9 11:13 glusterd.log-20170529.gz
-rwxr-xr-x. 1 qe qe      389 Aug  9 11:13 glusterd.log-20170604.gz
-rwxr-xr-x. 1 qe qe   132407 Aug  9 11:13 glusterd.log-20170724.gz
-rwxr-xr-x. 1 qe qe    78701 Aug  9 11:13 glusterd.log-20170611.gz
-rwxr-xr-x. 1 qe qe   127661 Aug  9 11:13 glusterd.log-20170619.gz
-rwxr-xr-x. 1 qe qe   131188 Aug  9 11:13 glusterd.log-20170625.gz
-rwxr-xr-x. 1 qe qe   109866 Aug  9 11:13 glusterd.log-20170730.gz
-rwxr-xr-x. 1 qe qe 89385768 Aug  9 11:14 glusterd.log-20170806
[qe@rhsqe-repo glusterfs]$
Comment 22 Prasanna Kumar Kalever 2018-05-15 07:19:05 EDT
*** Bug 1476176 has been marked as a duplicate of this bug. ***
Comment 28 Yaniv Kaul 2018-07-07 01:09:25 EDT
Is this bug related to bug 1560418 ?
Comment 29 Pranith Kumar K 2018-07-09 03:57:33 EDT
(In reply to Yaniv Kaul from comment #28)
> Is this bug related to bug 1560418 ?

Yes
Comment 30 Prasanna Kumar Kalever 2018-07-10 03:34:56 EDT
*** Bug 1599158 has been marked as a duplicate of this bug. ***
Comment 31 Xiubo Li 2018-07-12 03:27:13 EDT
The upstream PR: https://github.com/open-iscsi/tcmu-runner/pull/402
Comment 34 Prasanna Kumar Kalever 2018-07-31 00:55:11 EDT
*** Bug 1610150 has been marked as a duplicate of this bug. ***
Comment 42 Prasanna Kumar Kalever 2018-09-10 04:12:51 EDT
*** Bug 1577791 has been marked as a duplicate of this bug. ***
Comment 44 Michael Adam 2018-09-20 09:59:12 EDT
using the correct tracker
Comment 50 Anjana 2018-10-11 03:20:11 EDT
Updated the Doc text kindly verify.
Comment 52 Anjana 2018-10-12 01:18:50 EDT
Updated the Doc text.
Comment 54 errata-xmlrpc 2018-10-24 00:52:58 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2987

Note You need to log in before you can comment on or make changes to this bug.