Description of problem: ======================= Because of the issue mentione in bz: 1276245, tierd daemon is down on one of the node. gluster volume status on this node still shows Tier in-progress. One of the way known to start the tier is "gluster volume rebal tiervolume tier start", but it failed to start. Could start the tierd on local host by using "volume start force". Initial Status: =============== [root@dhcp37-165 glusterfs]# gluster volume status Status of volume: tiervolume Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ : : : : : : : Task Status of Volume tiervolume ------------------------------------------------------------------------------ Task : Tier migration ID : d4aec4e9-c4b9-4ef3-926b-af2b29c22096 Status : in progress [root@dhcp37-165 glusterfs]# ps -eaf | grep glusterfs | grep tier root 6806 1 9 13:47 ? 00:09:25 /usr/sbin/glusterfsd -s 10.70.37.165 --volfile-id tiervolume.10.70.37.165.rhs-brick3-tiervolume_hot -p /var/lib/glusterd/vols/tiervolume/run/10.70.37.165-rhs-brick3-tiervolume_hot.pid -S /var/run/gluster/4f46770e383fab1ee7789ff7a656a342.socket --brick-name /rhs/brick3/tiervolume_hot -l /var/log/glusterfs/bricks/rhs-brick3-tiervolume_hot.log --xlator-option *-posix.glusterd-uuid=506b81d5-08d6-421a-9aff-94e57d3740bb --brick-port 49153 --xlator-option tiervolume-server.listen-port=49153 root 6824 1 21 13:47 ? 00:22:12 /usr/sbin/glusterfsd -s 10.70.37.165 --volfile-id tiervolume.10.70.37.165.rhs-brick1-tiervolume_ct-disp1 -p /var/lib/glusterd/vols/tiervolume/run/10.70.37.165-rhs-brick1-tiervolume_ct-disp1.pid -S /var/run/gluster/4cdd38c5ea86fe823baaa5dcde1b4b57.socket --brick-name /rhs/brick1/tiervolume_ct-disp1 -l /var/log/glusterfs/bricks/rhs-brick1-tiervolume_ct-disp1.log --xlator-option *-posix.glusterd-uuid=506b81d5-08d6-421a-9aff-94e57d3740bb --brick-port 49152 --xlator-option tiervolume-server.listen-port=49152 [root@dhcp37-165 glusterfs]# ^^^^^ Note: No tierd glusterfs process is running [root@dhcp37-165 glusterfs]# gluster volume tier tiervolume status Node Promoted files Demoted files Status --------- --------- --------- --------- localhost 562 0 failed 10.70.37.133 0 18824 in progress 10.70.37.160 0 0 in progress 10.70.37.158 0 19867 in progress 10.70.37.110 0 0 in progress 10.70.37.155 0 22756 in progress 10.70.37.99 41 0 in progress 10.70.37.88 0 23585 in progress 10.70.37.112 0 0 in progress 10.70.37.199 0 20903 in progress 10.70.37.162 0 0 in progress 10.70.37.87 0 21816 in progress volume rebalance: tiervolume: success: [root@dhcp37-165 glusterfs]# [root@dhcp37-165 glusterfs]# gluster volume rebal tiervolume tier start volume rebalance: tiervolume: success: Rebalance on tiervolume has been started successfully. Use rebalance status command to check status of the rebalance process. ID: d666a0ae-f03b-4862-8f54-9b2545cfcdc3 [root@dhcp37-165 glusterfs]# gluster volume rebal tiervolume tier status Node Promoted files Demoted files Status --------- --------- --------- --------- localhost 562 0 failed 10.70.37.133 0 18824 in progress 10.70.37.160 0 0 in progress 10.70.37.158 0 19867 in progress 10.70.37.110 0 0 in progress 10.70.37.155 0 22756 in progress 10.70.37.99 41 0 in progress 10.70.37.88 0 23585 in progress 10.70.37.112 0 0 in progress 10.70.37.199 0 20903 in progress 10.70.37.162 0 0 in progress 10.70.37.87 0 21816 in progress volume rebalance: tiervolume: success: [root@dhcp37-165 glusterfs]# rebal start shows that rebalance is started successfully, but the status shows failure. [root@dhcp37-165 glusterfs]# gluster volume start tiervolume force volume start: tiervolume: success [root@dhcp37-165 glusterfs]# [root@dhcp37-165 glusterfs]# gluster volume rebal tiervolume tier status Node Promoted files Demoted files Status --------- --------- --------- --------- localhost 562 0 in progress 10.70.37.133 0 18824 in progress 10.70.37.160 0 0 in progress 10.70.37.158 0 19867 in progress 10.70.37.110 0 0 in progress 10.70.37.155 0 22756 in progress 10.70.37.99 41 0 in progress 10.70.37.88 0 23585 in progress 10.70.37.112 0 0 in progress 10.70.37.199 0 20903 in progress 10.70.37.162 0 0 in progress 10.70.37.87 0 21816 in progress volume rebalance: tiervolume: success: [root@dhcp37-165 glusterfs]# Version-Release number of selected component (if applicable): ============================================================= glusterfs-3.7.5-0.3.el7rhgs.x86_64
gluster volume rebalance volname tier start will not start the failed process if it is already started, it should throw an error saying "Tier process is already running ". Apparently that is not happening because of the #bug 1285170 . But we should need a way to start tier forcefully overriding the check. An RFC is filed for this #bug 1284751 . Workaround for this bug is to start volume forcefully, that start tier daemon if it is failed/not running. It won't start any process that is already running.
*** Bug 1284751 has been marked as a duplicate of this bug. ***
The upstream patch is : http://review.gluster.org/#/c/12983/ The downstream patch is : https://code.engineering.redhat.com/gerrit/#/c/64383/
Verified with the build: glusterfs-3.7.5-14.el7rhgs.x86_64 Killed few tierd glusterfs process which marked tierd as failed. "tier <volume> tier start force", started only the bricks which were marked faulty without restarting the rest of the bricks tierd glusterfs process. Moving this bug to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0193.html