911290 – [RHSC] After running "gluster peer detach force" on gluster CLI, a volume that had bricks on the server that was removed, fails to start

Bug 911290 - [RHSC] After running "gluster peer detach force" on gluster CLI, a volume that had bricks on the server that was removed, fails to start

Summary: [RHSC] After running "gluster peer detach force" on gluster CLI, a volume tha...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	rhsc
Sub Component:
Version:	2.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Shireesh
QA Contact:	Prasanth
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-02-14 16:41 UTC by Shruti Sampat
Modified:	2013-07-03 06:07 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-03-01 07:20:27 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
engine logs (5.28 MB, text/x-log) 2013-02-14 16:41 UTC, Shruti Sampat	no flags	Details
View All

Description Shruti Sampat 2013-02-14 16:41:18 UTC

Created attachment 697283 [details]
engine logs

Description of problem:
---------------------------------------
In a cluster of two nodes, one of the nodes is removed using "peer detach force".
For a volume that had a brick on each of the two servers, the Console now shows only one brick (the one on the detached server is no longer seen on the Console). On trying to start the volume, start fails.

On the storage node, running the command "gluster volume info <vol-name>" shows the brick that was residing on the detached server too. Volume start on the storage node fails with the following seen in the gluster logs - 
---------------------------------------

[2013-02-14 16:22:20.460327] E [glusterd-volume-ops.c:903:glusterd_op_stage_start_volume] 0-: Unable to resolve brick 10.70.35.71:/opt/gluster/volume1
/b1

Version-Release number of selected component (if applicable):
Red Hat Storage Console Version: 2.1.0-0.qa5.el6rhs 

How reproducible:
Always

Steps to Reproduce:
1. For a two-node cluster (say peer1 and peer2), create a volume having one brick (say brick1 on peer1 and brick2 on peer2) each on both the nodes.
2. Run "gluster peer detach <IP-of-peer1>" on peer2.
  
Actual results:
peer1 now disappears from the Servers tab on Console and brick1 also disappears.
Volume start fails with the following in the message in the Events log - 

"Could not start Gluster Volume volume1."

Expected results:
The Console should display the cause for failure to start the volume.

Additional info:
Find engine logs attached.

Comment 2 Shireesh 2013-02-15 06:28:48 UTC

(In reply to comment #0)
> Created attachment 697283 [details]
> engine logs
> 
> Description of problem:
> ---------------------------------------
> In a cluster of two nodes, one of the nodes is removed using "peer detach
> force".
> For a volume that had a brick on each of the two servers, the Console now
> shows only one brick (the one on the detached server is no longer seen on
> the Console). On trying to start the volume, start fails.
> 
> On the storage node, running the command "gluster volume info <vol-name>"
> shows the brick that was residing on the detached server too. Volume start
> on the storage node fails with the following seen in the gluster logs - 
> ---------------------------------------
> 
> [2013-02-14 16:22:20.460327] E
> [glusterd-volume-ops.c:903:glusterd_op_stage_start_volume] 0-: Unable to
> resolve brick 10.70.35.71:/opt/gluster/volume1
> /b1

What do you see on the gluster cli output when you try to start the volume?

> 
> Version-Release number of selected component (if applicable):
> Red Hat Storage Console Version: 2.1.0-0.qa5.el6rhs 
> 
> How reproducible:
> Always
> 
> Steps to Reproduce:
> 1. For a two-node cluster (say peer1 and peer2), create a volume having one
> brick (say brick1 on peer1 and brick2 on peer2) each on both the nodes.
> 2. Run "gluster peer detach <IP-of-peer1>" on peer2.
>   
> Actual results:
> peer1 now disappears from the Servers tab on Console and brick1 also
> disappears.

I think this behavior is fine, because the brick is not a valid one anymore - it's server is not part of the cluster.

> Volume start fails with the following in the message in the Events log - 
> 
> "Could not start Gluster Volume volume1."
> 
> Expected results:
> The Console should display the cause for failure to start the volume.

Can be done, if gluster cli provides the cause for failure.

> 
> Additional info:
> Find engine logs attached.

Comment 3 Shruti Sampat 2013-02-15 06:52:35 UTC

(In reply to comment #2)
> (In reply to comment #0)
> > Created attachment 697283 [details]
> > engine logs
> > 
> > Description of problem:
> > ---------------------------------------
> > In a cluster of two nodes, one of the nodes is removed using "peer detach
> > force".
> > For a volume that had a brick on each of the two servers, the Console now
> > shows only one brick (the one on the detached server is no longer seen on
> > the Console). On trying to start the volume, start fails.
> > 
> > On the storage node, running the command "gluster volume info <vol-name>"
> > shows the brick that was residing on the detached server too. Volume start
> > on the storage node fails with the following seen in the gluster logs - 
> > ---------------------------------------
> > 
> > [2013-02-14 16:22:20.460327] E
> > [glusterd-volume-ops.c:903:glusterd_op_stage_start_volume] 0-: Unable to
> > resolve brick 10.70.35.71:/opt/gluster/volume1
> > /b1
> 
> What do you see on the gluster cli output when you try to start the volume?
It says "volume start: <vol-name>: failed".
> 
> > 
> > Version-Release number of selected component (if applicable):
> > Red Hat Storage Console Version: 2.1.0-0.qa5.el6rhs 
> > 
> > How reproducible:
> > Always
> > 
> > Steps to Reproduce:
> > 1. For a two-node cluster (say peer1 and peer2), create a volume having one
> > brick (say brick1 on peer1 and brick2 on peer2) each on both the nodes.
> > 2. Run "gluster peer detach <IP-of-peer1>" on peer2.
> >   
> > Actual results:
> > peer1 now disappears from the Servers tab on Console and brick1 also
> > disappears.
> 
> I think this behavior is fine, because the brick is not a valid one anymore
> - it's server is not part of the cluster.
But running "gluster volume info <vol-name>" on the gluster CLI lists the bricks that reside on the detached server too.
> 
> > Volume start fails with the following in the message in the Events log - 
> > 
> > "Could not start Gluster Volume volume1."
> > 
> > Expected results:
> > The Console should display the cause for failure to start the volume.
> 
> Can be done, if gluster cli provides the cause for failure.
> 
> > 
> > Additional info:
> > Find engine logs attached.

Comment 4 Shireesh 2013-02-18 09:35:16 UTC

(In reply to comment #3)
> (In reply to comment #2)
> > (In reply to comment #0)
> > > Created attachment 697283 [details]
> > > engine logs
> > > 
> > > Description of problem:
> > > ---------------------------------------
> > > In a cluster of two nodes, one of the nodes is removed using "peer detach
> > > force".
> > > For a volume that had a brick on each of the two servers, the Console now
> > > shows only one brick (the one on the detached server is no longer seen on
> > > the Console). On trying to start the volume, start fails.
> > > 
> > > On the storage node, running the command "gluster volume info <vol-name>"
> > > shows the brick that was residing on the detached server too. Volume start
> > > on the storage node fails with the following seen in the gluster logs - 
> > > ---------------------------------------
> > > 
> > > [2013-02-14 16:22:20.460327] E
> > > [glusterd-volume-ops.c:903:glusterd_op_stage_start_volume] 0-: Unable to
> > > resolve brick 10.70.35.71:/opt/gluster/volume1
> > > /b1
> > 
> > What do you see on the gluster cli output when you try to start the volume?
> It says "volume start: <vol-name>: failed".

I think this is the problem. GlusterFS should provided a more meaningful message explaining why it failed. I suggest you raise a bug in glusterfs for this.

> > 
> > > 
> > > Version-Release number of selected component (if applicable):
> > > Red Hat Storage Console Version: 2.1.0-0.qa5.el6rhs 
> > > 
> > > How reproducible:
> > > Always
> > > 
> > > Steps to Reproduce:
> > > 1. For a two-node cluster (say peer1 and peer2), create a volume having one
> > > brick (say brick1 on peer1 and brick2 on peer2) each on both the nodes.
> > > 2. Run "gluster peer detach <IP-of-peer1>" on peer2.
> > >   
> > > Actual results:
> > > peer1 now disappears from the Servers tab on Console and brick1 also
> > > disappears.
> > 
> > I think this behavior is fine, because the brick is not a valid one anymore
> > - it's server is not part of the cluster.
> But running "gluster volume info <vol-name>" on the gluster CLI lists the
> bricks that reside on the detached server too.

I agree that the behavior in UI is a little different from gluster CLI in this case, but showing the brick in the UI also would be a little misleading as the server itself is not part of the cluster. I think it should be enough if the start volume error on UI shows the exact error given by the CLI. If you don't see it in the UI, you can raise a different bug for the same.

> > 
> > > Volume start fails with the following in the message in the Events log - 
> > > 
> > > "Could not start Gluster Volume volume1."
> > > 
> > > Expected results:
> > > The Console should display the cause for failure to start the volume.
> > 
> > Can be done, if gluster cli provides the cause for failure.
> > 
> > > 
> > > Additional info:
> > > Find engine logs attached.

Comment 5 Scott Haines 2013-02-26 22:42:01 UTC

Per Feb 20 bug triage meeting, targeting for 2.1.

Comment 6 Shireesh 2013-03-01 07:20:27 UTC

As discussed with Shruti, this behavior is fine. We should probably detect such a bad configuration (volume has bricks on non-peer servers), and generate alert for the same. She is going to raise an RFE for the same.

Note You need to log in before you can comment on or make changes to this bug.