Description of problem: Did a volume set operation while the other peers in the cluster were down. Op-sm hung. Op-sm is stuck in an infinite state-transition: Old State: [Ack drain] New State: [Ack drain] Event : [GD_OP_EVENT_START_UNLOCK] timestamp: [2012-08-10 06:10:25] Old State: [Ack drain] New State: [Ack drain] Event : [GD_OP_EVENT_START_UNLOCK] timestamp: [2012-08-10 06:10:28] Old State: [Ack drain] New State: [Ack drain] Event : [GD_OP_EVENT_START_UNLOCK] timestamp: [2012-08-10 06:10:28] Old State: [Ack drain] New State: [Ack drain] Event : [GD_OP_EVENT_START_UNLOCK] timestamp: [2012-08-10 06:10:31] Old State: [Ack drain] New State: [Ack drain] Event : [GD_OP_EVENT_START_UNLOCK] timestamp: [2012-08-10 06:10:31] Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: This the setup in which I got the problem, but I think it can be triggered even with 2 machines 1.Have a cluster with 3 machines. 2.Bring two of the glusterds down. 3.Execute any glusterd operation command which uses op-sm, I used volume set. Actual results: The operation will hang after commit-op. Expected results: volume set operation should have been successful. Additional info:
The infinite loop'ing state transitions in the operation state machine was fixed in http://review.gluster.org/4043. This was happening because the notify function was queuing events into the operation state machine, on every invocation (triggered on reconnect, once every 3 secs). Concurrently, the operation state machine processes all the events in the queue. So, it is possible for the state machine to be dequeue'ing the events, ad infinitum. This is perceived as a hang, since the epoll thread could be executing glusterd_op_sm(), which processes all the events, at any point in time, in the op_sm queue. Moving it to ON_DEV, since this is fixed on both master and release-3.4 master, release-3.4: http://review.gluster.org/4043 - fixed before release-3.4 was branched from master.
REVIEW: http://review.gluster.org/4869 (glusterd: Removed 'proactive' failing of volume op) posted (#1) for review on master by Krishnan Parthasarathi (kparthas)
COMMIT: http://review.gluster.org/4869 committed in master by Vijay Bellur (vbellur) ------ commit 3b1ecc6a7fd961c709e82862fd4760b223365863 Author: Krishnan Parthasarathi <kparthas> Date: Mon Apr 22 12:27:07 2013 +0530 glusterd: Removed 'proactive' failing of volume op Volume operations were failed 'proactively', on the first disconnect of a peer that was participating in the transaction. The reason behind having this kludgey code in the first place was to 'abort' an ongoing volume operation as soon as we perceive the first disconnect. But the rpc call backs themselves are capable of injecting appropriate state machine events, which would set things in motion for an eventual abort of the transaction. Change-Id: Iad7cb2bd076f22d89a793dfcd08c2d208b39c4be BUG: 847214 Signed-off-by: Krishnan Parthasarathi <kparthas> Reviewed-on: http://review.gluster.org/4869 Reviewed-by: Jeff Darcy <jdarcy> Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Vijay Bellur <vbellur>