Description of problem: ======================= Scenario 1 : When volume start is in progress, immediately bring down glusterd on another node, volume start fails with error message : "Commit Failed" but volume is started Scenario 2 : When volume start is in progress, immediately restart glusterd on another node, volume start does not give any output , (return code 146) but volume is started Version-Release number of selected component (if applicable): ============================================================ glusterfs 3.6.0.27 How reproducible: ================= 3/3 Steps to Reproduce: =================== Scenario 1: ~~~~~~~~~~~ 1.Create a 2x2 dist rep volume and start it. 2.Stop the volume 3.Start the volume , while volume start is in progress immediately bring down glusterd on one/more nodes . Volume start fails with below error : gluster v start vol2 volume start: vol2: failed: Commit failed on 00000000-0000-0000-0000-000000000000. Please check log file for details. Commit failed on 00000000-0000-0000-0000-000000000000. Please check log file for details. But volume status shows started gluster v i vol2 Volume Name: vol2 Type: Distributed-Replicate Volume ID: 98eb6a90-fbe5-4512-b560-d299579135d5 Status: Started Snap Volume: no Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: snapshot13.lab.eng.blr.redhat.com:/var/run/gluster/snaps/0bb578bc09154daab9c14afdc5e7f628/brick1/b3 Brick2: snapshot14.lab.eng.blr.redhat.com:/var/run/gluster/snaps/0bb578bc09154daab9c14afdc5e7f628/brick2/b3 Brick3: snapshot15.lab.eng.blr.redhat.com:/var/run/gluster/snaps/0bb578bc09154daab9c14afdc5e7f628/brick3/b3 Brick4: snapshot16.lab.eng.blr.redhat.com:/var/run/gluster/snaps/0bb578bc09154daab9c14afdc5e7f628/brick4/b3 Options Reconfigured: performance.readdir-ahead: on auto-delete: disable snap-max-soft-limit: 90 snap-max-hard-limit: 256 Scenario 2 : ~~~~~~~~~~~ 1) Stop the volume 2) Start the volume , while volume start is in progress , restart glusterd on another node. Volume start gives no output and return code is shown as 146 Volume status is shown as 'Started' [root@snapshot13 /]# gluster v start vol2 [root@snapshot13 /]# echo $? 146 [root@snapshot13 /]# gluster v i vol2 Volume Name: vol2 Type: Distributed-Replicate Volume ID: 98eb6a90-fbe5-4512-b560-d299579135d5 Status: Started Snap Volume: no Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: snapshot13.lab.eng.blr.redhat.com:/var/run/gluster/snaps/0bb578bc09154daab9c14afdc5e7f628/brick1/b3 Brick2: snapshot14.lab.eng.blr.redhat.com:/var/run/gluster/snaps/0bb578bc09154daab9c14afdc5e7f628/brick2/b3 Brick3: snapshot15.lab.eng.blr.redhat.com:/var/run/gluster/snaps/0bb578bc09154daab9c14afdc5e7f628/brick3/b3 Brick4: snapshot16.lab.eng.blr.redhat.com:/var/run/gluster/snaps/0bb578bc09154daab9c14afdc5e7f628/brick4/b3 Options Reconfigured: performance.readdir-ahead: on auto-delete: disable snap-max-soft-limit: 90 snap-max-hard-limit: 256 Actual results: =============== Volume start gives error message when volume has been started Expected results: ================= There should no error message seen Additional info:
RCA for Scenario 1 ================== In glusterd commit op phase, if any of the commit op fails the op_ret is set to non-zero and a negative response is sent back CLI. However in this case, the local commit was successful and i.e. why the status was changed to "started". Will investigate further to determine the feasibility of returning positive response to cli if the local commit succeeds even if any of the remote commit op fails. RCA for Scenario 2 will be shared soon.
Seema, Could you please point out on which node (IP) the command was executed for scenario 2 and which node (IP) was restarted? ~Atin
Command was executed on 10.70.40.169 and glusterd was restarted on 10.70.40.170 initially. But I have also retried the glusterd restart on other nodes while trying to reproduce the issue
For scenario 2 , it seems like logs for snapshot14 is missing, when I untar snapshot14_sosreport-qaredhat.com-20140822130530-91fd.tar.xz I can see snapshot13, 15 & 16 are present.
(In reply to Atin Mukherjee from comment #3) > RCA for Scenario 1 > ================== > > In glusterd commit op phase, if any of the commit op fails the op_ret is set > to non-zero and a negative response is sent back CLI. However in this case, > the local commit was successful and i.e. why the status was changed to > "started". Will investigate further to determine the feasibility of > returning positive response to cli if the local commit succeeds even if any > of the remote commit op fails. > > RCA for Scenario 2 will be shared soon. Seema, For scenario 1, this is expected as per the design. Currently we do not have any rollback mechanism for failed transaction. In this case after local commit was successful, one of the remote commit failed as the glusterd instance on a remote node was brought down. This resulted in volume status to move to Started in few nodes but not in all nodes. Even moving local commit after completing all the remote commits may end up in similar situation as local commit might fail. We would not be able to fix this problem considering the design limitation. For scenario 2, I still don't see snapshot 14 logs in http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/snapshots/1132839/ Could you double check?
I checked all the logs now and there is no evidence of volume start getting timed out. Do you remember the timestamp of this issue? I know its quite difficult to recollect the information as the BZ is quite old, but with out that its pretty difficult to figure out as I don't see any abnormal things in the log.
(In reply to Atin Mukherjee from comment #9) > I checked all the logs now and there is no evidence of volume start getting > timed out. Do you remember the timestamp of this issue? I know its quite > difficult to recollect the information as the BZ is quite old, but with out > that its pretty difficult to figure out as I don't see any abnormal things > in the log. Closing this bug as I've not got any response from reporter. Kindly re-open if the issue persists.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days