Bug 1132839
| Summary: | Stopping or restarting glusterd on another node when volume start is in progress gives error messages but volume is started | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | senaik | |
| Component: | glusterd | Assignee: | Atin Mukherjee <amukherj> | |
| Status: | CLOSED NOTABUG | QA Contact: | storage-qa-internal <storage-qa-internal> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | rhgs-3.0 | CC: | amukherj, kparthas, nlevinki, sasundar, senaik, vbellur | |
| Target Milestone: | --- | |||
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1146902 (view as bug list) | Environment: | ||
| Last Closed: | 2015-04-13 04:43:12 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1146902 | |||
|
Description
senaik
2014-08-22 07:38:46 UTC
RCA for Scenario 1 ================== In glusterd commit op phase, if any of the commit op fails the op_ret is set to non-zero and a negative response is sent back CLI. However in this case, the local commit was successful and i.e. why the status was changed to "started". Will investigate further to determine the feasibility of returning positive response to cli if the local commit succeeds even if any of the remote commit op fails. RCA for Scenario 2 will be shared soon. Seema, Could you please point out on which node (IP) the command was executed for scenario 2 and which node (IP) was restarted? ~Atin Command was executed on 10.70.40.169 and glusterd was restarted on 10.70.40.170 initially. But I have also retried the glusterd restart on other nodes while trying to reproduce the issue For scenario 2 , it seems like logs for snapshot14 is missing, when I untar snapshot14_sosreport-qaredhat.com-20140822130530-91fd.tar.xz I can see snapshot13, 15 & 16 are present. (In reply to Atin Mukherjee from comment #3) > RCA for Scenario 1 > ================== > > In glusterd commit op phase, if any of the commit op fails the op_ret is set > to non-zero and a negative response is sent back CLI. However in this case, > the local commit was successful and i.e. why the status was changed to > "started". Will investigate further to determine the feasibility of > returning positive response to cli if the local commit succeeds even if any > of the remote commit op fails. > > RCA for Scenario 2 will be shared soon. Seema, For scenario 1, this is expected as per the design. Currently we do not have any rollback mechanism for failed transaction. In this case after local commit was successful, one of the remote commit failed as the glusterd instance on a remote node was brought down. This resulted in volume status to move to Started in few nodes but not in all nodes. Even moving local commit after completing all the remote commits may end up in similar situation as local commit might fail. We would not be able to fix this problem considering the design limitation. For scenario 2, I still don't see snapshot 14 logs in http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/snapshots/1132839/ Could you double check? I checked all the logs now and there is no evidence of volume start getting timed out. Do you remember the timestamp of this issue? I know its quite difficult to recollect the information as the BZ is quite old, but with out that its pretty difficult to figure out as I don't see any abnormal things in the log. (In reply to Atin Mukherjee from comment #9) > I checked all the logs now and there is no evidence of volume start getting > timed out. Do you remember the timestamp of this issue? I know its quite > difficult to recollect the information as the BZ is quite old, but with out > that its pretty difficult to figure out as I don't see any abnormal things > in the log. Closing this bug as I've not got any response from reporter. Kindly re-open if the issue persists. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |