| Summary: | Rebalance starts on a volume even if one of the participating node's glusterd is down | |||
|---|---|---|---|---|
| Product: | Red Hat Gluster Storage | Reporter: | Shubhendu Tripathi <shtripat> | |
| Component: | distribute | Assignee: | Nithya Balachandran <nbalacha> | |
| Status: | CLOSED DEFERRED | QA Contact: | shylesh <shmohan> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 2.1 | CC: | amukherj, nsathyan, sdharane, spalai, vagarwal, vbellur | |
| Target Milestone: | --- | |||
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1286157 1286159 (view as bug list) | Environment: | ||
| Last Closed: | 2015-11-27 12:10:25 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Bug Depends On: | ||||
| Bug Blocks: | 1035460, 1286157, 1286159 | |||
|
Description
Shubhendu Tripathi
2013-11-29 05:31:12 UTC
Per bug triage discussion with Shanks and Dusmant, removing it from corbett The issue here is with the order of daemonizing and graph initialization in the glusterfsd process. The fetching of the volfiles and the graph initialization is done after the process is daemonized. When glusterd starts the rebalance process, it returns after the process has daemonized assuming (correctly) that the process started and returns that starting the rebalance process was successful. But, in this particular case the graph initialization of the rebalance process fails as it cannot connect to the brick on the downed peer (this failure occurs even if just glusterd is down, as the brick port cannot be obtained by the client xlator). Since, rebalance requires all DHT subvolumes to be online, the process kills itself. This leads to the rebalance status showing as failed almost immediately. This is similar to rebalance ending up in failed status when a peer goes down during rebalance. This could be fixed in two ways, 1. Make sure that rebalance process has correctly initialized its graph and is connected to all the bricks before returning success. This is quite hard to, and probably requires new tolling to be done to support this approach, 2. Check if all the peers involved and the bricks involved are online during the staging of rebalance start. This is comparatively easier to do, as we already have a mechanism to check volume quorum thanks to volume snapshots. But this is also big change involving significant code changes as the volume quorum framework as it is present is kindof tied up with the snapshot. Removing devel_ack and the Denali label as solving this issue for the Denali is not easy or straight forward. Cloning this to 3.1. to be fixed in future release. |