Description of problem: ======================= Gluster-block daemon is dependent on rpcbind, target and tcmu-runner. When either of these services are affected, gluster-blockd fails to start. Twice it has happened that target.service gets stuck in 'activating' state for hours/days and any attempt to start gluster-blockd (or target.service) just hangs. Both the times, the events that took place on the setup were related to failed create (and hence, in the rollback delete path). This time, I was executing a positive scenario of creating a block, when the request timed out. A VM core was generated on one of the HA nodes, for which I have updated BZ 1473162. Post that, gluster-blockd refused to come up on that node. Error messages seen in tcmu-runner.log are pasted below. 2017-07-31 04:57:33.771 15933 [ERROR] add_device:484 : handler open failed for uio2 2017-07-31 04:57:33.789 15933 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory 2017-07-31 04:57:34.784 15933 [ERROR] glfs_check_config:493 : tcmu_create_glfs_object failed 2017-07-31 04:57:34.790 15933 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory 2017-07-31 04:57:35.788 15933 [ERROR] tcmu_glfs_open:561 : tcmu_create_glfs_object failed 2017-07-31 04:57:35.789 15933 [ERROR] add_device:484 : handler open failed for uio2 2017-07-31 04:57:35.807 15933 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory 2017-07-31 04:57:36.807 15933 [ERROR] glfs_check_config:493 : tcmu_create_glfs_object failed 2017-07-31 04:57:36.809 15933 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory Version-Release number of selected component (if applicable): ============================================================== glusterfs-3.8.4-35 and gluster-block-0.2.1-6 How reproducible: ================ Seen it twice Additional info: ================ [root@dhcp47-117 ~]# gluster-block list vol2 Connection failed. Please check if gluster-block daemon is operational. [root@dhcp47-117 ~]# systemctl status gluster-blockd ● gluster-blockd.service - Gluster block storage utility Loaded: loaded (/usr/lib/systemd/system/gluster-blockd.service; disabled; vendor preset: disabled) Active: inactive (dead) [root@dhcp47-117 ~]# systemctl start gluster-blockd ^C [root@dhcp47-117 ~]# systemctl status tcmu-runner ● tcmu-runner.service - LIO Userspace-passthrough daemon Loaded: loaded (/usr/lib/systemd/system/tcmu-runner.service; static; vendor preset: disabled) Active: active (running) since Mon 2017-07-31 04:57:09 EDT; 18s ago Main PID: 15933 (tcmu-runner) CGroup: /system.slice/tcmu-runner.service └─15933 /usr/bin/tcmu-runner --tcmu-log-dir=/var/log/gluster-block/ Jul 31 04:57:26 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: glfs_check_config:493 : tcmu_create_glfs_object failed Jul 31 04:57:26 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: 2017-07-31 04:57:26.719 15933 [ERROR] glfs_check_config:493 : tcmu_create_glfs_object failed Jul 31 04:57:26 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory Jul 31 04:57:26 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: 2017-07-31 04:57:26.729 15933 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory Jul 31 04:57:27 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: tcmu_glfs_open:561 : tcmu_create_glfs_object failed Jul 31 04:57:27 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: 2017-07-31 04:57:27.722 15933 [ERROR] tcmu_glfs_open:561 : tcmu_create_glfs_object failed Jul 31 04:57:27 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: 2017-07-31 04:57:27.722 15933 [ERROR] add_device:484 : handler open failed for uio2 Jul 31 04:57:27 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: add_device:484 : handler open failed for uio2 Jul 31 04:57:27 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory Jul 31 04:57:27 dhcp47-117.lab.eng.blr.redhat.com tcmu-runner[15933]: 2017-07-31 04:57:27.739 15933 [ERROR] tcmu_create_glfs_object:445 : glfs_init failed: No such file or directory [root@dhcp47-117 ~]# systemctl stop tcmu-runner [root@dhcp47-117 ~]# [root@dhcp47-117 ~]# systemctl status target ● target.service - Restore LIO kernel target configuration Loaded: loaded (/usr/lib/systemd/system/target.service; disabled; vendor preset: disabled) Active: activating (start) since Mon 2017-07-31 04:57:09 EDT; 45min ago Main PID: 15934 (targetctl) CGroup: /system.slice/target.service └─15934 /usr/bin/python /usr/bin/targetctl restore Jul 31 04:57:09 dhcp47-117.lab.eng.blr.redhat.com systemd[1]: Starting Restore LIO kernel target configuration... [root@dhcp47-117 ~]# [root@dhcp47-117 ~]# [root@dhcp47-117 ~]# [root@dhcp47-117 ~]# systemctl status gluster-block Unit gluster-block.service could not be found. [root@dhcp47-117 ~]# systemctl status gluster-blockd ● gluster-blockd.service - Gluster block storage utility Loaded: loaded (/usr/lib/systemd/system/gluster-blockd.service; disabled; vendor preset: disabled) Active: inactive (dead) [root@dhcp47-117 ~]# [root@dhcp47-117 ~]# [root@dhcp47-117 ~]# systemctl daemon-reload [root@dhcp47-117 ~]# systemctl start target ^C [root@dhcp47-117 ~]# systemctl status target ● target.service - Restore LIO kernel target configuration Loaded: loaded (/usr/lib/systemd/system/target.service; disabled; vendor preset: disabled) Active: activating (start) since Mon 2017-07-31 04:57:09 EDT; 47min ago Main PID: 15934 (targetctl) CGroup: /system.slice/target.service └─15934 /usr/bin/python /usr/bin/targetctl restore Jul 31 04:57:09 dhcp47-117.lab.eng.blr.redhat.com systemd[1]: Starting Restore LIO kernel target configuration... [root@dhcp47-117 ~]#
Unable to get gluster-block daemon UP affects the entire functionality. It is concerning as this is seen in positive scenario as well, the second time. Hence proposing it as a blocker. Either we should not hit this at all, or a way to reset gluster-blockd if any of its dependent services take a hit.
I understand tcmu-runner prints out the error messages as it is unable to open those devices. Looks like the files were made unavailable without tcmu-runner knowing about it. Gluster-block logs can be found at this location: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>/
The tcmu-runner-glfs.log are filled with "Client-quorum is not met" and "no subvolumes up". This clearly indicates there's an issue with the gluster volume. Unfortunately, the sosreport provided in Comment 3 does not have the glusterd or brick logs. Will need these to understand
Logs are now attached at: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1476730/dhcp47_117_glusterfs_logs/
Pasting the snippet below from link. It has logs all the way from April 16th. Isn't this sufficient?! [qe@rhsqe-repo dhcp47_117_glusterfs_logs]$ hostname rhsqe-repo.lab.eng.blr.redhat.com [qe@rhsqe-repo dhcp47_117_glusterfs_logs]$ pwd /home/repo/sosreports/1476730/dhcp47_117_glusterfs_logs [qe@rhsqe-repo dhcp47_117_glusterfs_logs]$ [qe@rhsqe-repo dhcp47_117_glusterfs_logs]$ cd glusterfs [qe@rhsqe-repo glusterfs]$ [qe@rhsqe-repo glusterfs]$ [qe@rhsqe-repo glusterfs]$ ls -lrt glusterd* -rwxr-xr-x. 1 qe qe 3499487 Aug 9 11:13 glusterd.log-20170430.gz -rwxr-xr-x. 1 qe qe 159812 Aug 9 11:13 glusterd.log-20170703.gz -rwxr-xr-x. 1 qe qe 2619 Aug 9 11:13 glusterd.log-20170416.gz -rwxr-xr-x. 1 qe qe 76632 Aug 9 11:13 glusterd.log-20170424.gz -rwxr-xr-x. 1 qe qe 5045262 Aug 9 11:13 glusterd.log-20170507.gz -rwxr-xr-x. 1 qe qe 5675694 Aug 9 11:13 glusterd.log-20170515.gz -rwxr-xr-x. 1 qe qe 261303 Aug 9 11:13 glusterd.log -rwxr-xr-x. 1 qe qe 21472 Aug 9 11:13 glusterd.log-20170716.gz -rwxr-xr-x. 1 qe qe 745438 Aug 9 11:13 glusterd.log-20170521.gz -rwxr-xr-x. 1 qe qe 105599 Aug 9 11:13 glusterd.log-20170713.gz -rwxr-xr-x. 1 qe qe 455 Aug 9 11:13 glusterd.log-20170529.gz -rwxr-xr-x. 1 qe qe 389 Aug 9 11:13 glusterd.log-20170604.gz -rwxr-xr-x. 1 qe qe 132407 Aug 9 11:13 glusterd.log-20170724.gz -rwxr-xr-x. 1 qe qe 78701 Aug 9 11:13 glusterd.log-20170611.gz -rwxr-xr-x. 1 qe qe 127661 Aug 9 11:13 glusterd.log-20170619.gz -rwxr-xr-x. 1 qe qe 131188 Aug 9 11:13 glusterd.log-20170625.gz -rwxr-xr-x. 1 qe qe 109866 Aug 9 11:13 glusterd.log-20170730.gz -rwxr-xr-x. 1 qe qe 89385768 Aug 9 11:14 glusterd.log-20170806 [qe@rhsqe-repo glusterfs]$
*** Bug 1476176 has been marked as a duplicate of this bug. ***
Is this bug related to bug 1560418 ?
(In reply to Yaniv Kaul from comment #28) > Is this bug related to bug 1560418 ? Yes
*** Bug 1599158 has been marked as a duplicate of this bug. ***
The upstream PR: https://github.com/open-iscsi/tcmu-runner/pull/402
*** Bug 1610150 has been marked as a duplicate of this bug. ***
*** Bug 1577791 has been marked as a duplicate of this bug. ***
using the correct tracker
Updated the Doc text kindly verify.
Updated the Doc text.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2987