| Summary: | Tiering status and rebalance status stops getting updated | |||
|---|---|---|---|---|
| Product: | Red Hat Gluster Storage | Reporter: | Nag Pavan Chilakam <nchilaka> | |
| Component: | glusterd | Assignee: | Mohammed Rafi KC <rkavunga> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Sweta Anandpara <sanandpa> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | rhgs-3.1 | CC: | amukherj, jbyers, nbalacha, rcyriac, rhinduja, rhs-bugs, rkavunga, sankarshan, sheggodu, storage-qa-internal, vbellur | |
| Target Milestone: | --- | Keywords: | ZStream | |
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | glusterfs-3.7.9-1 | Doc Type: | Bug Fix | |
| Doc Text: |
The defrag variable is not being reinitialized during glusterd restart. This means that if glusterd goes down or needs to be restarted while the following processes are running, it does not reconnect to these processes after restarting:
- rebalance
- tier
- remove-brick
This results in these processes continuing to run without communicating with glusterd. Therefore, any operation that requires communication between these processes and glusterd fails.
To work around this issue, stop or kill the rebalance, tier, or remove-brick process before restarting glusterd. This ensures that a new process is spawned when glusterd restarts.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1303028 (view as bug list) | Environment: | ||
| Last Closed: | 2018-09-12 03:39:58 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Bug Depends On: | 1303125 | |||
| Bug Blocks: | 1268895, 1286100, 1303028, 1311041 | |||
RCA after glusterd restart, connection between rebalance and glusterd was not re-established. It is a day 1 issue, and it is also true for rebalance/remove-brick process. For tiering it will impact more severe, because if tier pause called after glusterd restart , glusterd won't be able to talk with rebalance process, and tier pause will mark as successful. upstream patch : http://review.gluster.org/#/c/13319/ Can you please verify the Doc text . Looks good to me. This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions As per comment 14, moving it to ON_QA |
On my 16 node setup after about a day, 3 nodes in the rebalance status shows the lapsed time reset to "ZERO" and again after about 4-5 hours, all the nodes stopped ticking except only one node continued which is continually ticking. Hence the promote/demote and scanned files stats have stopped getting updated [root@dhcp37-202 ~]# gluster v rebal nagvol status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 2 0Bytes 35287 0 0 in progress 29986.00 10.70.37.195 0 0Bytes 35281 0 0 in progress 29986.00 10.70.35.155 0 0Bytes 35003 0 0 in progress 29986.00 10.70.35.222 0 0Bytes 35002 0 0 in progress 29986.00 10.70.35.108 0 0Bytes 0 0 0 in progress 29985.00 10.70.35.44 0 0Bytes 0 0 0 in progress 29986.00 10.70.35.89 0 0Bytes 0 0 0 in progress 146477.00 10.70.35.231 0 0Bytes 0 0 0 in progress 29986.00 10.70.35.176 0 0Bytes 35487 0 0 in progress 29986.00 10.70.35.232 0 0Bytes 0 0 0 in progress 0.00 10.70.35.173 0 0Bytes 0 0 0 in progress 0.00 10.70.35.163 0 0Bytes 35314 0 0 in progress 29986.00 10.70.37.101 0 0Bytes 0 0 0 in progress 0.00 10.70.37.69 0 0Bytes 35385 0 0 in progress 29986.00 10.70.37.60 0 0Bytes 35255 0 0 in progress 29986.00 10.70.37.120 0 0Bytes 35250 0 0 in progress 29986.00 volume rebalance: nagvol: success [root@dhcp37-202 ~]# [root@dhcp37-202 ~]# [root@dhcp37-202 ~]# gluster v rebal nagvol status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 2 0Bytes 35287 0 0 in progress 29986.00 10.70.37.195 0 0Bytes 35281 0 0 in progress 29986.00 10.70.35.155 0 0Bytes 35003 0 0 in progress 29986.00 10.70.35.222 0 0Bytes 35002 0 0 in progress 29986.00 10.70.35.108 0 0Bytes 0 0 0 in progress 29985.00 10.70.35.44 0 0Bytes 0 0 0 in progress 29986.00 10.70.35.89 0 0Bytes 0 0 0 in progress 146488.00 10.70.35.231 0 0Bytes 0 0 0 in progress 29986.00 10.70.35.176 0 0Bytes 35487 0 0 in progress 29986.00 10.70.35.232 0 0Bytes 0 0 0 in progress 0.00 10.70.35.173 0 0Bytes 0 0 0 in progress 0.00 10.70.35.163 0 0Bytes 35314 0 0 in progress 29986.00 10.70.37.101 0 0Bytes 0 0 0 in progress 0.00 10.70.37.69 0 0Bytes 35385 0 0 in progress 29986.00 10.70.37.60 0 0Bytes 35255 0 0 in progress 29986.00 10.70.37.120 0 0Bytes 35250 0 0 in progress 29986.00 Also, the tier status shows as belo: [root@dhcp37-202 ~]# gluster v tier nagvol status Node Promoted files Demoted files Status --------- --------- --------- --------- localhost 0 0 in progress 10.70.37.195 0 0 in progress 10.70.35.155 0 0 in progress 10.70.35.222 0 0 in progress 10.70.35.108 0 0 in progress 10.70.35.44 0 0 in progress 10.70.35.89 0 0 in progress 10.70.35.231 0 0 in progress 10.70.35.176 0 0 in progress 10.70.35.232 0 0 in progress 10.70.35.173 0 0 in progress 10.70.35.163 0 0 in progress 10.70.37.101 0 0 in progress 10.70.37.69 0 0 in progress 10.70.37.60 0 0 in progress 10.70.37.120 0 0 in progress Tiering Migration Functionality: nagvol: success -> I was running some IOs but not very heavy -> Also, there was an nfs problem reported wrt music files, stopped palying with permission denied -> I saw files promotes happening -> Also, the glusterd was restarted only on one of the nodes, in the last 2 days glusterfs-client-xlators-3.7.5-17.el7rhgs.x86_64 glusterfs-server-3.7.5-17.el7rhgs.x86_64 gluster-nagios-addons-0.2.5-1.el7rhgs.x86_64 vdsm-gluster-4.16.30-1.3.el7rhgs.noarch glusterfs-3.7.5-17.el7rhgs.x86_64 glusterfs-api-3.7.5-17.el7rhgs.x86_64 glusterfs-cli-3.7.5-17.el7rhgs.x86_64 glusterfs-geo-replication-3.7.5-17.el7rhgs.x86_64 glusterfs-debuginfo-3.7.5-17.el7rhgs.x86_64 gluster-nagios-common-0.2.3-1.el7rhgs.noarch python-gluster-3.7.5-16.el7rhgs.noarch glusterfs-libs-3.7.5-17.el7rhgs.x86_64 glusterfs-fuse-3.7.5-17.el7rhgs.x86_64 glusterfs-rdma-3.7.5-17.el7rhgs.x86_64 sosreports will be attached