Bug 1276273

Summary: [Tier]: start tier daemon using rebal tier start doesnt start tierd if it is failed on any of single node
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Rahul Hinduja <rhinduja>
Component: tierAssignee: hari gowtham <hgowtham>
Status: CLOSED ERRATA QA Contact: Rahul Hinduja <rhinduja>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.1CC: asrivast, dlambrig, rhinduja, rhs-bugs, rkavunga, sankarshan, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.1.2   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.7.5-13 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1292112 1293698 (view as bug list) Environment:
Last Closed: 2016-03-01 05:48:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1260783, 1292112, 1293698    

Description Rahul Hinduja 2015-10-29 10:20:34 UTC
Description of problem:
=======================

Because of the issue mentione in bz: 1276245, tierd daemon is down on one of the node. gluster volume status on this node still shows Tier in-progress. 

One of the way known to start the tier is "gluster volume rebal tiervolume tier start", but it failed to start. Could start the tierd on local host by using "volume start force".

Initial Status:
===============

[root@dhcp37-165 glusterfs]# gluster volume status 
Status of volume: tiervolume
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------

:    :        :            :            :           :               :
Task Status of Volume tiervolume
------------------------------------------------------------------------------
Task                 : Tier migration      
ID                   : d4aec4e9-c4b9-4ef3-926b-af2b29c22096
Status               : in progress         
 
[root@dhcp37-165 glusterfs]# ps -eaf | grep glusterfs | grep tier
root      6806     1  9 13:47 ?        00:09:25 /usr/sbin/glusterfsd -s 10.70.37.165 --volfile-id tiervolume.10.70.37.165.rhs-brick3-tiervolume_hot -p /var/lib/glusterd/vols/tiervolume/run/10.70.37.165-rhs-brick3-tiervolume_hot.pid -S /var/run/gluster/4f46770e383fab1ee7789ff7a656a342.socket --brick-name /rhs/brick3/tiervolume_hot -l /var/log/glusterfs/bricks/rhs-brick3-tiervolume_hot.log --xlator-option *-posix.glusterd-uuid=506b81d5-08d6-421a-9aff-94e57d3740bb --brick-port 49153 --xlator-option tiervolume-server.listen-port=49153
root      6824     1 21 13:47 ?        00:22:12 /usr/sbin/glusterfsd -s 10.70.37.165 --volfile-id tiervolume.10.70.37.165.rhs-brick1-tiervolume_ct-disp1 -p /var/lib/glusterd/vols/tiervolume/run/10.70.37.165-rhs-brick1-tiervolume_ct-disp1.pid -S /var/run/gluster/4cdd38c5ea86fe823baaa5dcde1b4b57.socket --brick-name /rhs/brick1/tiervolume_ct-disp1 -l /var/log/glusterfs/bricks/rhs-brick1-tiervolume_ct-disp1.log --xlator-option *-posix.glusterd-uuid=506b81d5-08d6-421a-9aff-94e57d3740bb --brick-port 49152 --xlator-option tiervolume-server.listen-port=49152
[root@dhcp37-165 glusterfs]# 

^^^^^ Note: No tierd glusterfs process is running

[root@dhcp37-165 glusterfs]# gluster volume tier tiervolume status
Node                 Promoted files       Demoted files        Status              
---------            ---------            ---------            ---------           
localhost            562                  0                    failed              
10.70.37.133         0                    18824                in progress         
10.70.37.160         0                    0                    in progress         
10.70.37.158         0                    19867                in progress         
10.70.37.110         0                    0                    in progress         
10.70.37.155         0                    22756                in progress         
10.70.37.99          41                   0                    in progress         
10.70.37.88          0                    23585                in progress         
10.70.37.112         0                    0                    in progress         
10.70.37.199         0                    20903                in progress         
10.70.37.162         0                    0                    in progress         
10.70.37.87          0                    21816                in progress         
volume rebalance: tiervolume: success: 
[root@dhcp37-165 glusterfs]# 

[root@dhcp37-165 glusterfs]# gluster volume rebal tiervolume tier start
volume rebalance: tiervolume: success: Rebalance on tiervolume has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: d666a0ae-f03b-4862-8f54-9b2545cfcdc3

[root@dhcp37-165 glusterfs]# gluster volume rebal tiervolume tier status
Node                 Promoted files       Demoted files        Status              
---------            ---------            ---------            ---------           
localhost            562                  0                    failed              
10.70.37.133         0                    18824                in progress         
10.70.37.160         0                    0                    in progress         
10.70.37.158         0                    19867                in progress         
10.70.37.110         0                    0                    in progress         
10.70.37.155         0                    22756                in progress         
10.70.37.99          41                   0                    in progress         
10.70.37.88          0                    23585                in progress         
10.70.37.112         0                    0                    in progress         
10.70.37.199         0                    20903                in progress         
10.70.37.162         0                    0                    in progress         
10.70.37.87          0                    21816                in progress         
volume rebalance: tiervolume: success: 
[root@dhcp37-165 glusterfs]#

rebal start shows that rebalance is started successfully, but the status shows failure. 

[root@dhcp37-165 glusterfs]# gluster volume start tiervolume force
volume start: tiervolume: success
[root@dhcp37-165 glusterfs]# 
[root@dhcp37-165 glusterfs]# gluster volume rebal tiervolume tier status
Node                 Promoted files       Demoted files        Status              
---------            ---------            ---------            ---------           
localhost            562                  0                    in progress         
10.70.37.133         0                    18824                in progress         
10.70.37.160         0                    0                    in progress         
10.70.37.158         0                    19867                in progress         
10.70.37.110         0                    0                    in progress         
10.70.37.155         0                    22756                in progress         
10.70.37.99          41                   0                    in progress         
10.70.37.88          0                    23585                in progress         
10.70.37.112         0                    0                    in progress         
10.70.37.199         0                    20903                in progress         
10.70.37.162         0                    0                    in progress         
10.70.37.87          0                    21816                in progress         
volume rebalance: tiervolume: success: 
[root@dhcp37-165 glusterfs]# 



Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.7.5-0.3.el7rhgs.x86_64

Comment 5 Mohammed Rafi KC 2015-11-25 07:10:24 UTC
gluster volume rebalance volname tier start will not start the failed process if it is already started, it should throw an error saying "Tier process is already running ". Apparently that is not happening because of the #bug 1285170 .

But we should need a way to start tier forcefully overriding the check. An RFC is filed for this #bug 1284751 .

Workaround for this bug is to start volume forcefully, that start tier daemon if it is failed/not running. It won't start any process that is already running.

Comment 10 Mohammed Rafi KC 2015-12-03 13:38:08 UTC
*** Bug 1284751 has been marked as a duplicate of this bug. ***

Comment 14 hari gowtham 2015-12-28 06:46:08 UTC
The upstream patch is : http://review.gluster.org/#/c/12983/

The downstream patch is : https://code.engineering.redhat.com/gerrit/#/c/64383/

Comment 17 Rahul Hinduja 2016-01-06 12:16:14 UTC
Verified with the build: glusterfs-3.7.5-14.el7rhgs.x86_64

Killed few tierd glusterfs process which marked tierd as failed. 
"tier <volume> tier start force", started only the bricks which were marked faulty without restarting the rest of the bricks tierd glusterfs process. Moving this bug to verified state.

Comment 20 errata-xmlrpc 2016-03-01 05:48:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0193.html