Bug 1302968 - Tiering status and rebalance status stops getting updated [NEEDINFO]
Tiering status and rebalance status stops getting updated
Status: ON_QA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterd (Show other bugs)
3.1
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Mohammed Rafi KC
Sweta Anandpara
: ZStream
Depends On: 1303125
Blocks: 1268895 1286100 1303028 1311041
  Show dependency treegraph
 
Reported: 2016-01-29 02:45 EST by nchilaka
Modified: 2018-05-24 17:32 EDT (History)
10 users (show)

See Also:
Fixed In Version: glusterfs-3.7.9-1
Doc Type: Known Issue
Doc Text:
The defrag variable is not being reinitialized during glusterd restart. This means that if glusterd goes down or needs to be restarted while the following processes are running, it does not reconnect to these processes after restarting: - rebalance - tier - remove-brick This results in these processes continuing to run without communicating with glusterd. Therefore, any operation that requires communication between these processes and glusterd fails. To work around this issue, stop or kill the rebalance, tier, or remove-brick process before restarting glusterd. This ensures that a new process is spawned when glusterd restarts.
Story Points: ---
Clone Of:
: 1303028 (view as bug list)
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
amukherj: needinfo? (rhinduja)


Attachments (Terms of Use)

  None (edit)
Description nchilaka 2016-01-29 02:45:40 EST
On my 16 node setup after about a day, 3 nodes in the rebalance status shows the lapsed time reset to "ZERO" and again after about 4-5 hours, all the nodes stopped ticking except only one node continued which is continually ticking.
Hence the promote/demote and scanned files stats have stopped getting updated


[root@dhcp37-202 ~]# gluster v rebal nagvol status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                2        0Bytes         35287             0             0          in progress           29986.00
                            10.70.37.195                0        0Bytes         35281             0             0          in progress           29986.00
                            10.70.35.155                0        0Bytes         35003             0             0          in progress           29986.00
                            10.70.35.222                0        0Bytes         35002             0             0          in progress           29986.00
                            10.70.35.108                0        0Bytes             0             0             0          in progress           29985.00
                             10.70.35.44                0        0Bytes             0             0             0          in progress           29986.00
                             10.70.35.89                0        0Bytes             0             0             0          in progress          146477.00
                            10.70.35.231                0        0Bytes             0             0             0          in progress           29986.00
                            10.70.35.176                0        0Bytes         35487             0             0          in progress           29986.00
                            10.70.35.232                0        0Bytes             0             0             0          in progress               0.00
                            10.70.35.173                0        0Bytes             0             0             0          in progress               0.00
                            10.70.35.163                0        0Bytes         35314             0             0          in progress           29986.00
                            10.70.37.101                0        0Bytes             0             0             0          in progress               0.00
                             10.70.37.69                0        0Bytes         35385             0             0          in progress           29986.00
                             10.70.37.60                0        0Bytes         35255             0             0          in progress           29986.00
                            10.70.37.120                0        0Bytes         35250             0             0          in progress           29986.00
volume rebalance: nagvol: success
[root@dhcp37-202 ~]# 
[root@dhcp37-202 ~]# 
[root@dhcp37-202 ~]# gluster v rebal nagvol status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                2        0Bytes         35287             0             0          in progress           29986.00
                            10.70.37.195                0        0Bytes         35281             0             0          in progress           29986.00
                            10.70.35.155                0        0Bytes         35003             0             0          in progress           29986.00
                            10.70.35.222                0        0Bytes         35002             0             0          in progress           29986.00
                            10.70.35.108                0        0Bytes             0             0             0          in progress           29985.00
                             10.70.35.44                0        0Bytes             0             0             0          in progress           29986.00
                             10.70.35.89                0        0Bytes             0             0             0          in progress          146488.00
                            10.70.35.231                0        0Bytes             0             0             0          in progress           29986.00
                            10.70.35.176                0        0Bytes         35487             0             0          in progress           29986.00
                            10.70.35.232                0        0Bytes             0             0             0          in progress               0.00
                            10.70.35.173                0        0Bytes             0             0             0          in progress               0.00
                            10.70.35.163                0        0Bytes         35314             0             0          in progress           29986.00
                            10.70.37.101                0        0Bytes             0             0             0          in progress               0.00
                             10.70.37.69                0        0Bytes         35385             0             0          in progress           29986.00
                             10.70.37.60                0        0Bytes         35255             0             0          in progress           29986.00
                            10.70.37.120                0        0Bytes         35250             0             0          in progress           29986.00





Also, the tier status shows as belo:
[root@dhcp37-202 ~]# gluster v  tier nagvol status
Node                 Promoted files       Demoted files        Status              
---------            ---------            ---------            ---------           
localhost            0                    0                    in progress         
10.70.37.195         0                    0                    in progress         
10.70.35.155         0                    0                    in progress         
10.70.35.222         0                    0                    in progress         
10.70.35.108         0                    0                    in progress         
10.70.35.44          0                    0                    in progress         
10.70.35.89          0                    0                    in progress         
10.70.35.231         0                    0                    in progress         
10.70.35.176         0                    0                    in progress         
10.70.35.232         0                    0                    in progress         
10.70.35.173         0                    0                    in progress         
10.70.35.163         0                    0                    in progress         
10.70.37.101         0                    0                    in progress         
10.70.37.69          0                    0                    in progress         
10.70.37.60          0                    0                    in progress         
10.70.37.120         0                    0                    in progress         
Tiering Migration Functionality: nagvol: success






-> I was running some IOs but not very heavy
-> Also, there was an nfs problem reported wrt music files, stopped palying with permission denied
-> I saw files promotes happening 
-> Also, the glusterd was restarted only on one of the nodes, in the last 2 days




glusterfs-client-xlators-3.7.5-17.el7rhgs.x86_64
glusterfs-server-3.7.5-17.el7rhgs.x86_64
gluster-nagios-addons-0.2.5-1.el7rhgs.x86_64
vdsm-gluster-4.16.30-1.3.el7rhgs.noarch
glusterfs-3.7.5-17.el7rhgs.x86_64
glusterfs-api-3.7.5-17.el7rhgs.x86_64
glusterfs-cli-3.7.5-17.el7rhgs.x86_64
glusterfs-geo-replication-3.7.5-17.el7rhgs.x86_64
glusterfs-debuginfo-3.7.5-17.el7rhgs.x86_64
gluster-nagios-common-0.2.3-1.el7rhgs.noarch
python-gluster-3.7.5-16.el7rhgs.noarch
glusterfs-libs-3.7.5-17.el7rhgs.x86_64
glusterfs-fuse-3.7.5-17.el7rhgs.x86_64
glusterfs-rdma-3.7.5-17.el7rhgs.x86_64




sosreports will be attached
Comment 2 Mohammed Rafi KC 2016-01-31 09:45:37 EST
RCA

after glusterd restart, connection between rebalance and glusterd was not re-established. It is a day 1 issue, and it is also true for rebalance/remove-brick process.

For tiering it will impact more severe, because if tier pause called after glusterd restart , glusterd won't be able to talk with rebalance process, and tier pause will mark as successful.
Comment 3 Mohammed Rafi KC 2016-01-31 09:46:05 EST
upstream patch : http://review.gluster.org/#/c/13319/
Comment 7 Mohammed Rafi KC 2016-02-05 02:41:03 EST
Can you please verify the Doc text .
Comment 11 Mohammed Rafi KC 2016-02-08 00:28:18 EST
Looks good to me.
Comment 12 Mike McCune 2016-03-28 19:28:22 EDT
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune@redhat.com with any questions
Comment 16 Atin Mukherjee 2016-08-16 11:00:07 EDT
As per comment 14, moving it to ON_QA

Note You need to log in before you can comment on or make changes to this bug.