Bug 1598322
| Summary: | delay gluster-blockd start until all bricks come up | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Prasanna Kumar Kalever <prasanna.kalever> | |
| Component: | gluster-block | Assignee: | Pranith Kumar K <pkarampu> | |
| Status: | CLOSED ERRATA | QA Contact: | Nitin Goyal <nigoyal> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | cns-3.10 | CC: | aclewett, asriram, atumball, bgoyal, hbustam, kramdoss, nigoyal, pkarampu, pprakash, prasanna.kalever, rcyriac, rhs-bugs, sankarshan, vbellur, xiubli | |
| Target Milestone: | --- | |||
| Target Release: | CNS 3.10 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | gluster-block-0.2.1-24.el7rhgs | Doc Type: | Bug Fix | |
| Doc Text: |
Previously, the gluster-block daemon depended on glusterd to be running and did not verify if the block hosting volumes were online and ready to be consumed before beginning its operations. This resulted in failures when the gluster-block daemon attempted to load target configuration. With this update, gluster-block daemon now waits for block hosting volumes bricks to be available before attempting to load target configuration.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1598353 (view as bug list) | Environment: | ||
| Last Closed: | 2018-09-12 09:27:16 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1560418, 1610787 | |||
| Bug Blocks: | 1568862, 1570976 | |||
|
Description
Prasanna Kumar Kalever
2018-07-05 06:45:03 UTC
Hi Pranith,
I was working on this bug, so i deleted one gluster pod out of 3 pods to verify that gluster-blockd is coming up after glusterd and all bricks. after some time when it came up everything (glusterd and all brick processes) was up except gluster-blockd on that gluster pod. gluster-blockd is not coming up with error.
sh-4.2# systemctl status gluster-blockd -l
● gluster-blockd.service - Gluster block storage utility
Loaded: loaded (/usr/lib/systemd/system/gluster-blockd.service; enabled; vendor preset: disabled)
Active: inactive (dead)
Jul 31 13:14:20 dhcp46-217.lab.eng.blr.redhat.com systemd[1]: Dependency failed for Gluster block storage utility.
Jul 31 13:14:20 dhcp46-217.lab.eng.blr.redhat.com systemd[1]: Job gluster-blockd.service/start failed with result 'dependency'.
sh-4.2# systemctl status glusterd -l
● glusterd.service - GlusterFS, a clustered file-system server
Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2018-07-31 13:01:20 UTC; 58min ago
Process: 27534 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 27537 (glusterd)
CGroup: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod4ccc22aa_94c1_11e8_8bae_005056a5f2d4.slice/docker-028937c3bdc69b3af7bd5a67bfa26254edd155a755d6aa386eff7aad98874365.scope/system.slice/glusterd.service
├─ 9723 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/07a4ac5d617c977cec7f2014901b113b.socket --xlator-option *replicate*.node-uuid=4c78819d-dc92-4898-b0db-fb3f6c17aeab
├─10085 /usr/sbin/glusterfsd -s 10.70.46.217 --volfile-id heketidbstorage.10.70.46.217.var-lib-heketi-mounts-vg_2dbaf4459b1591a754aa69f4b9c1ae41-brick_0d127fc5efe4a251b5534a2df36734b0-brick -p /var/run/gluster/vols/heketidbstorage/10.70.46.217-var-lib-heketi-mounts-vg_2dbaf4459b1591a754aa69f4b9c1ae41-brick_0d127fc5efe4a251b5534a2df36734b0-brick.pid -S /var/run/gluster/f4a121376ae22851db2fbbc5f84a18d7.socket --brick-name /var/lib/heketi/mounts/vg_2dbaf4459b1591a754aa69f4b9c1ae41/brick_0d127fc5efe4a251b5534a2df36734b0/brick -l /var/log/glusterfs/bricks/var-lib-heketi-mounts-vg_2dbaf4459b1591a754aa69f4b9c1ae41-brick_0d127fc5efe4a251b5534a2df36734b0-brick.log --xlator-option *-posix.glusterd-uuid=4c78819d-dc92-4898-b0db-fb3f6c17aeab --brick-port 49152 --xlator-option heketidbstorage-server.listen-port=49152
├─10289 /usr/sbin/glusterfsd -s 10.70.46.217 --volfile-id vol110.10.70.46.217.var-lib-heketi-mounts-vg_2dbaf4459b1591a754aa69f4b9c1ae41-brick_46538c9223275aaff1e0680461f04eab-brick -p /var/run/gluster/vols/vol110/10.70.46.217-var-lib-heketi-mounts-vg_2dbaf4459b1591a754aa69f4b9c1ae41-brick_46538c9223275aaff1e0680461f04eab-brick.pid -S /var/run/gluster/efa7e06610767d27a6aa0e8f32c8f4af.socket --brick-name /var/lib/heketi/mounts/vg_2dbaf4459b1591a754aa69f4b9c1ae41/brick_46538c9223275aaff1e0680461f04eab/brick -l /var/log/glusterfs/bricks/var-lib-heketi-mounts-vg_2dbaf4459b1591a754aa69f4b9c1ae41-brick_46538c9223275aaff1e0680461f04eab-brick.log --xlator-option *-posix.glusterd-uuid=4c78819d-dc92-4898-b0db-fb3f6c17aeab --brick-port 49153 --xlator-option vol110-server.listen-port=49153
└─27537 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
Jul 31 13:00:54 dhcp46-217.lab.eng.blr.redhat.com systemd[1]: Starting GlusterFS, a clustered file-system server...
Jul 31 13:01:20 dhcp46-217.lab.eng.blr.redhat.com systemd[1]: Started GlusterFS, a clustered file-system server.
brick processes for bricks
sh-4.2# gluster v status | grep 10289 | wc -l (id from that pod where gluster-blockd is not coming up)
292
sh-4.2# gluster v status | grep 4835 | wc -l
292
sh-4.2# gluster v status | grep 4799 | wc -l
292
sosreports ->
http://rhsqe-repo.lab.eng.blr.redhat.com/cns/bugs/BZ-1598322/
Pranith,
Can u pls debug weather it is because of the same bug or do i need to raise a new bug.
from further analysis it is seen that the tcmu-runner service has failed to start upon gluster pod restart.
sh-4.2# ps -aux | grep Ds
root 1516 0.0 0.0 584520 19656 ? Ds Jul31 0:00 /usr/bin/tcmu-runner --tcmu-log-dir /var/log/glusterfs/gluster-block
root 27770 0.0 0.0 9088 660 pts/4 S+ 05:56 0:00 grep Ds
sh-4.2#
sh-4.2# systemctl status tcmu-runner -l
● tcmu-runner.service - LIO Userspace-passthrough daemon
Loaded: loaded (/usr/lib/systemd/system/tcmu-runner.service; static; vendor preset: disabled)
Active: failed (Result: timeout) since Tue 2018-07-31 13:14:20 UTC; 16h ago
Process: 1501 ExecStartPre=/usr/libexec/gluster-block/wait-for-bricks.sh 120 (code=exited, status=1/FAILURE)
Main PID: 1516
CGroup: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod4ccc22aa_94c1_11e8_8bae_005056a5f2d4.slice/docker-028937c3bdc69b3af7bd5a67bfa26254edd155a755d6aa386eff7aad98874365.scope/system.slice/tcmu-runner.service
└─1516 /usr/bin/tcmu-runner --tcmu-log-dir /var/log/glusterfs/gluster-block
Jul 31 13:01:40 dhcp46-217.lab.eng.blr.redhat.com tcmu-runner[1516]: 2018-07-31 13:01:40.044 1516 [ERROR] add_device:516 : could not open /dev/uio1
Jul 31 13:01:40 dhcp46-217.lab.eng.blr.redhat.com tcmu-runner[1516]: add_device:516 : could not open /dev/uio1
Jul 31 13:01:40 dhcp46-217.lab.eng.blr.redhat.com tcmu-runner[1516]: 2018-07-31 13:01:40.044 1516 [ERROR] add_device:516 : could not open /dev/uio10
Jul 31 13:01:40 dhcp46-217.lab.eng.blr.redhat.com tcmu-runner[1516]: add_device:516 : could not open /dev/uio10
Jul 31 13:11:20 dhcp46-217.lab.eng.blr.redhat.com systemd[1]: tcmu-runner.service start operation timed out. Terminating.
Jul 31 13:12:50 dhcp46-217.lab.eng.blr.redhat.com systemd[1]: tcmu-runner.service stop-final-sigterm timed out. Killing.
Jul 31 13:14:20 dhcp46-217.lab.eng.blr.redhat.com systemd[1]: tcmu-runner.service still around after final SIGKILL. Entering failed mode.
Jul 31 13:14:20 dhcp46-217.lab.eng.blr.redhat.com systemd[1]: Failed to start LIO Userspace-passthrough daemon.
Jul 31 13:14:20 dhcp46-217.lab.eng.blr.redhat.com systemd[1]: Unit tcmu-runner.service entered failed state.
Jul 31 13:14:20 dhcp46-217.lab.eng.blr.redhat.com systemd[1]: tcmu-runner.service failed.
sh-4.2#
sh-4.2# systemctl status gluster-blockd -l
● gluster-blockd.service - Gluster block storage utility
Loaded: loaded (/usr/lib/systemd/system/gluster-blockd.service; enabled; vendor preset: disabled)
Active: inactive (dead)
Jul 31 13:14:20 dhcp46-217.lab.eng.blr.redhat.com systemd[1]: Dependency failed for Gluster block storage utility.
Jul 31 13:14:20 dhcp46-217.lab.eng.blr.redhat.com systemd[1]: Job gluster-blockd.service/start failed with result 'dependency'.
sh-4.2#
sh-4.2# systemctl status gluster-block-target -l
● gluster-block-target.service - Restore LIO kernel target configuration
Loaded: loaded (/usr/lib/systemd/system/gluster-block-target.service; disabled; vendor preset: disabled)
Active: inactive (dead)
Jul 31 13:14:20 dhcp46-217.lab.eng.blr.redhat.com systemd[1]: Dependency failed for Restore LIO kernel target configuration.
Jul 31 13:14:20 dhcp46-217.lab.eng.blr.redhat.com systemd[1]: Job gluster-block-target.service/start failed with result 'dependency'.
All packages from gluster pod sh-4.2# uname -r 3.10.0-862.11.2.el7.x86_64 sh-4.2# sh-4.2# rpm -qa | grep gluster glusterfs-client-xlators-3.8.4-54.15.el7rhgs.x86_64 glusterfs-fuse-3.8.4-54.15.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-54.15.el7rhgs.x86_64 glusterfs-libs-3.8.4-54.15.el7rhgs.x86_64 glusterfs-3.8.4-54.15.el7rhgs.x86_64 glusterfs-api-3.8.4-54.15.el7rhgs.x86_64 glusterfs-cli-3.8.4-54.15.el7rhgs.x86_64 glusterfs-server-3.8.4-54.15.el7rhgs.x86_64 gluster-block-0.2.1-23.el7rhgs.x86_64 sh-4.2# sh-4.2# rpm -qa | grep tcmu libtcmu-1.2.0-23.el7rhgs.x86_64 tcmu-runner-1.2.0-23.el7rhgs.x86_64 sh-4.2# (In reply to Nitin Goyal from comment #11) > All packages from gluster pod > > sh-4.2# uname -r > 3.10.0-862.11.2.el7.x86_64 > sh-4.2# > sh-4.2# rpm -qa | grep gluster > glusterfs-client-xlators-3.8.4-54.15.el7rhgs.x86_64 > glusterfs-fuse-3.8.4-54.15.el7rhgs.x86_64 > glusterfs-geo-replication-3.8.4-54.15.el7rhgs.x86_64 > glusterfs-libs-3.8.4-54.15.el7rhgs.x86_64 > glusterfs-3.8.4-54.15.el7rhgs.x86_64 > glusterfs-api-3.8.4-54.15.el7rhgs.x86_64 > glusterfs-cli-3.8.4-54.15.el7rhgs.x86_64 > glusterfs-server-3.8.4-54.15.el7rhgs.x86_64 > gluster-block-0.2.1-23.el7rhgs.x86_64 > sh-4.2# > sh-4.2# rpm -qa | grep tcmu > libtcmu-1.2.0-23.el7rhgs.x86_64 > tcmu-runner-1.2.0-23.el7rhgs.x86_64 > sh-4.2# If you execute "/usr/libexec/gluster-block/wait-for-bricks.sh 120" what is the behavior? The script should exit within 2 minutes. Is that happening? > If you execute "/usr/libexec/gluster-block/wait-for-bricks.sh 120" what is
> the behavior? The script should exit within 2 minutes. Is that happening?
sh-4.2# time ./usr/libexec/gluster-block/wait-for-bricks.sh 120
real 0m0.008s
user 0m0.003s
sys 0m0.006s
sh-4.2# time /usr/libexec/gluster-block/wait-for-bricks.sh 120
real 0m0.010s
user 0m0.005s
sys 0m0.005s
I have setup please let me know if u need setup.
(In reply to Nitin Goyal from comment #13) > > If you execute "/usr/libexec/gluster-block/wait-for-bricks.sh 120" what is > > the behavior? The script should exit within 2 minutes. Is that happening? > > sh-4.2# time ./usr/libexec/gluster-block/wait-for-bricks.sh 120 > > real 0m0.008s > user 0m0.003s > sys 0m0.006s > sh-4.2# time /usr/libexec/gluster-block/wait-for-bricks.sh 120 > > real 0m0.010s > user 0m0.005s > sys 0m0.005s > > I have setup please let me know if u need setup. Based on this, it doesn't look like the issue we were trying to solve as part of this bug. Qe can not verify this as of now because we are blocked on other bug "1610787" when we will get fix for that we will verify it. We need the following changes to make it work inside container:
sh-4.2# diff wait-for-bricks1.sh /usr/libexec/gluster-block/wait-for-bricks.sh
119c119
< if ! systemctl is-active --quiet glusterd.service > /dev/null 2>&1
---
> if ! pidof glusterd > /dev/null 2>&1
Worked as expected after this change:
[2018-08-03 12:55:24] WARNING: Timeout Expired, bricks of volumes:"vol11 (3/3), vol12 (3/3), vol13 (3/3), vol14 (3/3), vol15 (3/3), vol16 (3/3), vol17 (3/3), vol18 (3/3), vol19 (3/3), vol20 (3/3)" are yet to come online
(In reply to Pranith Kumar K from comment #27) > We need the following changes to make it work inside container: > sh-4.2# diff wait-for-bricks1.sh > /usr/libexec/gluster-block/wait-for-bricks.sh > 119c119 > < if ! systemctl is-active --quiet glusterd.service > /dev/null 2>&1 > --- > > if ! pidof glusterd > /dev/null 2>&1 Just an intermediate opinion, as I'm not sure if the above suggested command is portable across ... How about using 'ps -aux | grep -w glusterd' or something ps command oriented ? Thanks! > > > Worked as expected after this change: > [2018-08-03 12:55:24] WARNING: Timeout Expired, bricks of volumes:"vol11 > (3/3), vol12 (3/3), vol13 (3/3), vol14 (3/3), vol15 (3/3), vol16 (3/3), > vol17 (3/3), vol18 (3/3), vol19 (3/3), vol20 (3/3)" are yet to come online (In reply to Prasanna Kumar Kalever from comment #28) > (In reply to Pranith Kumar K from comment #27) > > We need the following changes to make it work inside container: > > sh-4.2# diff wait-for-bricks1.sh > > /usr/libexec/gluster-block/wait-for-bricks.sh > > 119c119 > > < if ! systemctl is-active --quiet glusterd.service > /dev/null 2>&1 > > --- > > > if ! pidof glusterd > /dev/null 2>&1 > > Just an intermediate opinion, as I'm not sure if the above suggested command > is portable across ... > > How about using 'ps -aux | grep -w glusterd' or something ps command > oriented ? > > Thanks! > > > > > > > Worked as expected after this change: > > [2018-08-03 12:55:24] WARNING: Timeout Expired, bricks of volumes:"vol11 > > (3/3), vol12 (3/3), vol13 (3/3), vol14 (3/3), vol15 (3/3), vol16 (3/3), > > vol17 (3/3), vol18 (3/3), vol19 (3/3), vol20 (3/3)" are yet to come online It matches quite a few processes including its own. sh-4.2# ps -aux | grep -w glusterd root 609 0.0 0.0 983172 21668 ? Ssl 06:37 0:03 /usr/sbin/glusterfsd -s 10.70.47.165 --volfile-id heketidbstorage.10.70.47.165.var-lib-heketi-mounts-vg_7ce5bebe83af0d394eb711c2249ba339-brick_de0182db142b42a6eca1f7c18a328d27-brick -p /var/run/gluster/vols/heketidbstorage/10.70.47.165-var-lib-heketi-mounts-vg_7ce5bebe83af0d394eb711c2249ba339-brick_de0182db142b42a6eca1f7c18a328d27-brick.pid -S /var/run/gluster/11063b292f8e975e04045d4a4db135b0.socket --brick-name /var/lib/heketi/mounts/vg_7ce5bebe83af0d394eb711c2249ba339/brick_de0182db142b42a6eca1f7c18a328d27/brick -l /var/log/glusterfs/bricks/var-lib-heketi-mounts-vg_7ce5bebe83af0d394eb711c2249ba339-brick_de0182db142b42a6eca1f7c18a328d27-brick.log --xlator-option *-posix.glusterd-uuid=f62a850a-9b26-4f5f-9870-7a9d25f2a04d --brick-port 49152 --xlator-option heketidbstorage-server.listen-port=49152 root 1516 0.0 0.0 9088 664 pts/2 S+ 13:09 0:00 grep -w glusterd root 11151 0.2 0.0 504116 20284 ? Ssl 10:43 0:25 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO (I don't think these comments need to be private) based on comment 27 marking this as Failed Qa (In reply to Pranith Kumar K from comment #29) > (In reply to Prasanna Kumar Kalever from comment #28) > > (In reply to Pranith Kumar K from comment #27) > > > We need the following changes to make it work inside container: > > > sh-4.2# diff wait-for-bricks1.sh > > > /usr/libexec/gluster-block/wait-for-bricks.sh > > > 119c119 > > > < if ! systemctl is-active --quiet glusterd.service > /dev/null 2>&1 > > > --- > > > > if ! pidof glusterd > /dev/null 2>&1 > > > > Just an intermediate opinion, as I'm not sure if the above suggested command > > is portable across ... > > > > How about using 'ps -aux | grep -w glusterd' or something ps command > > oriented ? > > > > Thanks! > > > > > > > > > > > Worked as expected after this change: > > > [2018-08-03 12:55:24] WARNING: Timeout Expired, bricks of volumes:"vol11 > > > (3/3), vol12 (3/3), vol13 (3/3), vol14 (3/3), vol15 (3/3), vol16 (3/3), > > > vol17 (3/3), vol18 (3/3), vol19 (3/3), vol20 (3/3)" are yet to come online > > It matches quite a few processes including its own. > > sh-4.2# ps -aux | grep -w glusterd > root 609 0.0 0.0 983172 21668 ? Ssl 06:37 0:03 > /usr/sbin/glusterfsd -s 10.70.47.165 --volfile-id > heketidbstorage.10.70.47.165.var-lib-heketi-mounts- > vg_7ce5bebe83af0d394eb711c2249ba339-brick_de0182db142b42a6eca1f7c18a328d27- > brick -p > /var/run/gluster/vols/heketidbstorage/10.70.47.165-var-lib-heketi-mounts- > vg_7ce5bebe83af0d394eb711c2249ba339-brick_de0182db142b42a6eca1f7c18a328d27- > brick.pid -S /var/run/gluster/11063b292f8e975e04045d4a4db135b0.socket > --brick-name > /var/lib/heketi/mounts/vg_7ce5bebe83af0d394eb711c2249ba339/ > brick_de0182db142b42a6eca1f7c18a328d27/brick -l > /var/log/glusterfs/bricks/var-lib-heketi-mounts- > vg_7ce5bebe83af0d394eb711c2249ba339-brick_de0182db142b42a6eca1f7c18a328d27- > brick.log --xlator-option > *-posix.glusterd-uuid=f62a850a-9b26-4f5f-9870-7a9d25f2a04d --brick-port > 49152 --xlator-option heketidbstorage-server.listen-port=49152 > root 1516 0.0 0.0 9088 664 pts/2 S+ 13:09 0:00 grep -w > glusterd > root 11151 0.2 0.0 504116 20284 ? Ssl 10:43 0:25 > /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO > > (I don't think these comments need to be private) We can do one thing, we will check for the presence of pidof and systemctl CLI before using these commands to figure out if glusterd is running or not. Either one of these commands should exist for the script to be useful. Otherwise let us bail saying the script needs one of these two. Thoughts? *** Bug 1613073 has been marked as a duplicate of this bug. *** I verified this bug on below rpms -> sh-4.2# rpm -qa | grep gluster glusterfs-client-xlators-3.12.2-18.el7rhgs.x86_64 glusterfs-cli-3.12.2-18.el7rhgs.x86_64 glusterfs-fuse-3.12.2-18.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-18.el7rhgs.x86_64 gluster-block-0.2.1-25.el7rhgs.x86_64 glusterfs-libs-3.12.2-18.el7rhgs.x86_64 glusterfs-3.12.2-18.el7rhgs.x86_64 glusterfs-api-3.12.2-18.el7rhgs.x86_64 python2-gluster-3.12.2-18.el7rhgs.x86_64 glusterfs-server-3.12.2-18.el7rhgs.x86_64 sh-4.2# rpm -qa | grep tcmu-runner tcmu-runner-1.2.0-24.el7rhgs.x86_64 When bricks were down on the same node where i am stopping tcmu-runner it was taking exact 2 minutes to come up sh-4.2# systemctl stop tcmu-runner; systemctl is-active tcmu-runner; systemctl is-active gluster-blockd; echo `date`; systemctl start gluster-blockd; echo `date` inactive inactive Tue Aug 28 19:16:57 UTC 2018 Tue Aug 28 19:18:59 UTC 2018 When bricks were down on the another node again it was taking near about 2 minutes (hence it is monitoring all the brick processes) sh-4.2# systemctl stop tcmu-runner; systemctl is-active tcmu-runner; systemctl is-active gluster-blockd; echo `date`; systemctl start gluster-blockd; echo `date` inactive inactive Tue Aug 28 19:28:03 UTC 2018 Tue Aug 28 19:30:06 UTC 2018 When bricks was already up on all nodes it was taking just 1 second to come up sh-4.2# systemctl stop tcmu-runner; systemctl is-active tcmu-runner; systemctl is-active gluster-blockd; echo `date`; systemctl start gluster-blockd; echo `date` inactive inactive Tue Aug 28 19:22:56 UTC 2018 Tue Aug 28 19:22:57 UTC 2018 Hence marking this as verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2691 |