This service will be undergoing maintenance at 00:00 UTC, 2016-08-01. It is expected to last about 1 hours

Bug 847214

Summary: glusterd operations hang if the other peers are down
Product: [Community] GlusterFS Reporter: Pranith Kumar K <pkarampu>
Component: glusterdAssignee: krishnan parthasarathi <kparthas>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: unspecified Docs Contact:
Priority: medium    
Version: mainlineCC: amarts, gluster-bugs, jdarcy, nsathyan
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.4.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 852147 (view as bug list) Environment:
Last Closed: 2013-07-24 13:43:19 EDT Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Bug Depends On:    
Bug Blocks: 852147, 918917    

Description Pranith Kumar K 2012-08-10 02:39:51 EDT
Description of problem:
Did a volume set operation while the other peers in the cluster were down. Op-sm hung.
Op-sm is stuck in an infinite state-transition:
Old State: [Ack drain]
New State: [Ack drain]
Event    : [GD_OP_EVENT_START_UNLOCK]
timestamp: [2012-08-10 06:10:25]

Old State: [Ack drain]
New State: [Ack drain]
Event    : [GD_OP_EVENT_START_UNLOCK]
timestamp: [2012-08-10 06:10:28]

Old State: [Ack drain]
New State: [Ack drain]
Event    : [GD_OP_EVENT_START_UNLOCK]
timestamp: [2012-08-10 06:10:28]

Old State: [Ack drain]
New State: [Ack drain]
Event    : [GD_OP_EVENT_START_UNLOCK]
timestamp: [2012-08-10 06:10:31]

Old State: [Ack drain]
New State: [Ack drain]
Event    : [GD_OP_EVENT_START_UNLOCK]
timestamp: [2012-08-10 06:10:31]


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
This the setup in which I got the problem, but I think it can be triggered even with 2 machines
1.Have a cluster with 3 machines.
2.Bring two of the glusterds down.
3.Execute any glusterd operation command which uses op-sm, I used volume set. 
  
Actual results:
The operation will hang after commit-op.
Expected results:
volume set operation should have been successful.

Additional info:
Comment 1 krishnan parthasarathi 2013-04-15 02:43:20 EDT
The infinite loop'ing state transitions in the operation state machine was fixed in http://review.gluster.org/4043.

This was happening because the notify function was queuing events into the operation state machine, on every invocation (triggered on reconnect, once every 3 secs). Concurrently, the operation state machine processes all the events in the queue. So, it is possible for the state machine to be dequeue'ing the events, ad infinitum.

This is perceived as a hang, since the epoll thread could be executing glusterd_op_sm(), which processes all the events, at any point in time, in the op_sm queue.

Moving it to ON_DEV, since this is fixed on both master and release-3.4

master, release-3.4: http://review.gluster.org/4043 - fixed before release-3.4 was branched from master.
Comment 2 Anand Avati 2013-04-22 03:37:48 EDT
REVIEW: http://review.gluster.org/4869 (glusterd: Removed 'proactive' failing of volume op) posted (#1) for review on master by Krishnan Parthasarathi (kparthas@redhat.com)
Comment 3 Anand Avati 2013-04-30 07:23:51 EDT
COMMIT: http://review.gluster.org/4869 committed in master by Vijay Bellur (vbellur@redhat.com) 
------
commit 3b1ecc6a7fd961c709e82862fd4760b223365863
Author: Krishnan Parthasarathi <kparthas@redhat.com>
Date:   Mon Apr 22 12:27:07 2013 +0530

    glusterd: Removed 'proactive' failing of volume op
    
    Volume operations were failed 'proactively', on the first disconnect of
    a peer that was participating in the transaction.
    
    The reason behind having this kludgey code in the first place was to
    'abort' an ongoing volume operation as soon as we perceive the first
    disconnect. But the rpc call backs themselves are capable of injecting
    appropriate state machine events, which would set things in motion for an
    eventual abort of the transaction.
    
    Change-Id: Iad7cb2bd076f22d89a793dfcd08c2d208b39c4be
    BUG: 847214
    Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com>
    Reviewed-on: http://review.gluster.org/4869
    Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
    Tested-by: Gluster Build System <jenkins@build.gluster.com>
    Reviewed-by: Vijay Bellur <vbellur@redhat.com>