Bug 911290

Summary: [RHSC] After running "gluster peer detach force" on gluster CLI, a volume that had bricks on the server that was removed, fails to start
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Shruti Sampat <ssampat>
Component: rhscAssignee: Shireesh <shireesh>
Status: CLOSED NOTABUG QA Contact: Prasanth <pprakash>
Severity: unspecified Docs Contact:
Priority: medium    
Version: 2.1CC: dtsang, mmahoney, pprakash, rhs-bugs, sdharane, shaines, shtripat, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-03-01 07:20:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine logs none

Description Shruti Sampat 2013-02-14 16:41:18 UTC
Created attachment 697283 [details]
engine logs

Description of problem:
---------------------------------------
In a cluster of two nodes, one of the nodes is removed using "peer detach force".
For a volume that had a brick on each of the two servers, the Console now shows only one brick (the one on the detached server is no longer seen on the Console). On trying to start the volume, start fails.

On the storage node, running the command "gluster volume info <vol-name>" shows the brick that was residing on the detached server too. Volume start on the storage node fails with the following seen in the gluster logs - 
---------------------------------------

[2013-02-14 16:22:20.460327] E [glusterd-volume-ops.c:903:glusterd_op_stage_start_volume] 0-: Unable to resolve brick 10.70.35.71:/opt/gluster/volume1
/b1

Version-Release number of selected component (if applicable):
Red Hat Storage Console Version: 2.1.0-0.qa5.el6rhs 

How reproducible:
Always

Steps to Reproduce:
1. For a two-node cluster (say peer1 and peer2), create a volume having one brick (say brick1 on peer1 and brick2 on peer2) each on both the nodes.
2. Run "gluster peer detach <IP-of-peer1>" on peer2.
  
Actual results:
peer1 now disappears from the Servers tab on Console and brick1 also disappears.
Volume start fails with the following in the message in the Events log - 

"Could not start Gluster Volume volume1."

Expected results:
The Console should display the cause for failure to start the volume.

Additional info:
Find engine logs attached.

Comment 2 Shireesh 2013-02-15 06:28:48 UTC
(In reply to comment #0)
> Created attachment 697283 [details]
> engine logs
> 
> Description of problem:
> ---------------------------------------
> In a cluster of two nodes, one of the nodes is removed using "peer detach
> force".
> For a volume that had a brick on each of the two servers, the Console now
> shows only one brick (the one on the detached server is no longer seen on
> the Console). On trying to start the volume, start fails.
> 
> On the storage node, running the command "gluster volume info <vol-name>"
> shows the brick that was residing on the detached server too. Volume start
> on the storage node fails with the following seen in the gluster logs - 
> ---------------------------------------
> 
> [2013-02-14 16:22:20.460327] E
> [glusterd-volume-ops.c:903:glusterd_op_stage_start_volume] 0-: Unable to
> resolve brick 10.70.35.71:/opt/gluster/volume1
> /b1

What do you see on the gluster cli output when you try to start the volume?

> 
> Version-Release number of selected component (if applicable):
> Red Hat Storage Console Version: 2.1.0-0.qa5.el6rhs 
> 
> How reproducible:
> Always
> 
> Steps to Reproduce:
> 1. For a two-node cluster (say peer1 and peer2), create a volume having one
> brick (say brick1 on peer1 and brick2 on peer2) each on both the nodes.
> 2. Run "gluster peer detach <IP-of-peer1>" on peer2.
>   
> Actual results:
> peer1 now disappears from the Servers tab on Console and brick1 also
> disappears.

I think this behavior is fine, because the brick is not a valid one anymore - it's server is not part of the cluster.

> Volume start fails with the following in the message in the Events log - 
> 
> "Could not start Gluster Volume volume1."
> 
> Expected results:
> The Console should display the cause for failure to start the volume.

Can be done, if gluster cli provides the cause for failure.

> 
> Additional info:
> Find engine logs attached.

Comment 3 Shruti Sampat 2013-02-15 06:52:35 UTC
(In reply to comment #2)
> (In reply to comment #0)
> > Created attachment 697283 [details]
> > engine logs
> > 
> > Description of problem:
> > ---------------------------------------
> > In a cluster of two nodes, one of the nodes is removed using "peer detach
> > force".
> > For a volume that had a brick on each of the two servers, the Console now
> > shows only one brick (the one on the detached server is no longer seen on
> > the Console). On trying to start the volume, start fails.
> > 
> > On the storage node, running the command "gluster volume info <vol-name>"
> > shows the brick that was residing on the detached server too. Volume start
> > on the storage node fails with the following seen in the gluster logs - 
> > ---------------------------------------
> > 
> > [2013-02-14 16:22:20.460327] E
> > [glusterd-volume-ops.c:903:glusterd_op_stage_start_volume] 0-: Unable to
> > resolve brick 10.70.35.71:/opt/gluster/volume1
> > /b1
> 
> What do you see on the gluster cli output when you try to start the volume?
It says "volume start: <vol-name>: failed".
> 
> > 
> > Version-Release number of selected component (if applicable):
> > Red Hat Storage Console Version: 2.1.0-0.qa5.el6rhs 
> > 
> > How reproducible:
> > Always
> > 
> > Steps to Reproduce:
> > 1. For a two-node cluster (say peer1 and peer2), create a volume having one
> > brick (say brick1 on peer1 and brick2 on peer2) each on both the nodes.
> > 2. Run "gluster peer detach <IP-of-peer1>" on peer2.
> >   
> > Actual results:
> > peer1 now disappears from the Servers tab on Console and brick1 also
> > disappears.
> 
> I think this behavior is fine, because the brick is not a valid one anymore
> - it's server is not part of the cluster.
But running "gluster volume info <vol-name>" on the gluster CLI lists the bricks that reside on the detached server too.
> 
> > Volume start fails with the following in the message in the Events log - 
> > 
> > "Could not start Gluster Volume volume1."
> > 
> > Expected results:
> > The Console should display the cause for failure to start the volume.
> 
> Can be done, if gluster cli provides the cause for failure.
> 
> > 
> > Additional info:
> > Find engine logs attached.

Comment 4 Shireesh 2013-02-18 09:35:16 UTC
(In reply to comment #3)
> (In reply to comment #2)
> > (In reply to comment #0)
> > > Created attachment 697283 [details]
> > > engine logs
> > > 
> > > Description of problem:
> > > ---------------------------------------
> > > In a cluster of two nodes, one of the nodes is removed using "peer detach
> > > force".
> > > For a volume that had a brick on each of the two servers, the Console now
> > > shows only one brick (the one on the detached server is no longer seen on
> > > the Console). On trying to start the volume, start fails.
> > > 
> > > On the storage node, running the command "gluster volume info <vol-name>"
> > > shows the brick that was residing on the detached server too. Volume start
> > > on the storage node fails with the following seen in the gluster logs - 
> > > ---------------------------------------
> > > 
> > > [2013-02-14 16:22:20.460327] E
> > > [glusterd-volume-ops.c:903:glusterd_op_stage_start_volume] 0-: Unable to
> > > resolve brick 10.70.35.71:/opt/gluster/volume1
> > > /b1
> > 
> > What do you see on the gluster cli output when you try to start the volume?
> It says "volume start: <vol-name>: failed".

I think this is the problem. GlusterFS should provided a more meaningful message explaining why it failed. I suggest you raise a bug in glusterfs for this.

> > 
> > > 
> > > Version-Release number of selected component (if applicable):
> > > Red Hat Storage Console Version: 2.1.0-0.qa5.el6rhs 
> > > 
> > > How reproducible:
> > > Always
> > > 
> > > Steps to Reproduce:
> > > 1. For a two-node cluster (say peer1 and peer2), create a volume having one
> > > brick (say brick1 on peer1 and brick2 on peer2) each on both the nodes.
> > > 2. Run "gluster peer detach <IP-of-peer1>" on peer2.
> > >   
> > > Actual results:
> > > peer1 now disappears from the Servers tab on Console and brick1 also
> > > disappears.
> > 
> > I think this behavior is fine, because the brick is not a valid one anymore
> > - it's server is not part of the cluster.
> But running "gluster volume info <vol-name>" on the gluster CLI lists the
> bricks that reside on the detached server too.

I agree that the behavior in UI is a little different from gluster CLI in this case, but showing the brick in the UI also would be a little misleading as the server itself is not part of the cluster. I think it should be enough if the start volume error on UI shows the exact error given by the CLI. If you don't see it in the UI, you can raise a different bug for the same.

> > 
> > > Volume start fails with the following in the message in the Events log - 
> > > 
> > > "Could not start Gluster Volume volume1."
> > > 
> > > Expected results:
> > > The Console should display the cause for failure to start the volume.
> > 
> > Can be done, if gluster cli provides the cause for failure.
> > 
> > > 
> > > Additional info:
> > > Find engine logs attached.

Comment 5 Scott Haines 2013-02-26 22:42:01 UTC
Per Feb 20 bug triage meeting, targeting for 2.1.

Comment 6 Shireesh 2013-03-01 07:20:27 UTC
As discussed with Shruti, this behavior is fine. We should probably detect such a bad configuration (volume has bricks on non-peer servers), and generate alert for the same. She is going to raise an RFE for the same.