958076 – unable to start the volume once the brick is removed and created with the same name.

Bug 958076 - unable to start the volume once the brick is removed and created with the same name.

Summary: unable to start the volume once the brick is removed and created with the sam...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	2.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Krutika Dhananjay
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	963665
TreeView+	depends on / blocked

Reported:	2013-04-30 10:41 UTC by Rahul Hinduja
Modified:	2013-09-23 22:43 UTC (History)
CC List:	7 users (show)
Fixed In Version:	glusterfs-3.4.0.9rhs-1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	963665 (view as bug list)
Environment:
Last Closed:	2013-09-23 22:39:35 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Rahul Hinduja 2013-04-30 10:41:24 UTC

Description of problem:
=======================

Killed glusterd,glusterfs and glusterfsd on one of the storage server. Removed the brick directory using "rm". Created a directory with same name under same path.

Started glusterd.

Tried to start the volume force, it failed with error 
"volume start: vol-dis-rep: failed: Failed to get extended attribute trusted.glusterfs.volume-id for brick dir /rhs/brick1/b1. Reason : No data available"

Version-Release number of selected component (if applicable):
=============================================================

[root@rhs-client11 ~]# rpm -qa | grep gluster
glusterfs-debuginfo-3.4.0.1rhs-1.el6rhs.x86_64
glusterfs-fuse-3.4.0.1rhs-1.el6rhs.x86_64
gluster-swift-container-1.4.8-4.el6.noarch
gluster-swift-1.4.8-4.el6.noarch
gluster-swift-doc-1.4.8-4.el6.noarch
vdsm-gluster-4.10.2-4.0.qa5.el6rhs.noarch
gluster-swift-plugin-1.0-5.noarch
gluster-swift-proxy-1.4.8-4.el6.noarch
gluster-swift-account-1.4.8-4.el6.noarch
glusterfs-geo-replication-3.4.0.1rhs-1.el6rhs.x86_64
org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch
glusterfs-3.4.0.1rhs-1.el6rhs.x86_64
glusterfs-server-3.4.0.1rhs-1.el6rhs.x86_64
glusterfs-rdma-3.4.0.1rhs-1.el6rhs.x86_64
gluster-swift-object-1.4.8-4.el6.noarch
[root@rhs-client11 ~]# 



How reproducible:
=================

1/1


Steps to Reproduce:
1. Create 6*2 volume on 4 storage node (bricks: b1 to b6) 
2. Mount on FUSE and NFS client
3. killall glusterfs ; killall glusterfsd ; killall glusterd on one of the storage node.
4. Remove brick directories from the same storage node where the glusterfs,glusterd were stopped at step 3
5. Create the brick directories under same path as above.
6. Start the glusterd 
7. Start the volume forcefully
  
Actual results:
===============

[root@rhs-client11 ~]# gluster v i 
 
Volume Name: vol-dis-rep
Type: Distributed-Replicate
Volume ID: 5d6c5e6b-9ab5-450c-8fb1-9e33a16acb64
Status: Started
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: 10.70.36.35:/rhs/brick1/b1
Brick2: 10.70.36.36:/rhs/brick1/b2
Brick3: 10.70.36.35:/rhs/brick1/b3
Brick4: 10.70.36.36:/rhs/brick1/b4
Brick5: 10.70.36.35:/rhs/brick1/b5
Brick6: 10.70.36.36:/rhs/brick1/b6
Brick7: 10.70.36.37:/rhs/brick1/b7
Brick8: 10.70.36.38:/rhs/brick1/b8
Brick9: 10.70.36.37:/rhs/brick1/b9
Brick10: 10.70.36.38:/rhs/brick1/b10
Brick11: 10.70.36.37:/rhs/brick1/b11
Brick12: 10.70.36.38:/rhs/brick1/b12
Options Reconfigured:
performance.io-cache: off
[root@rhs-client11 ~]# 


[root@rhs-client11 ~]# 
[root@rhs-client11 ~]# killall glusterfs ; killall glusterfsd ; killall glusterd
[root@rhs-client11 ~]# 

[root@rhs-client11 ~]# rm -rf /rhs/brick1/b*
[root@rhs-client11 ~]# mkdir /rhs/brick1/b1
[root@rhs-client11 ~]# mkdir /rhs/brick1/b3
[root@rhs-client11 ~]# mkdir /rhs/brick1/b5
[root@rhs-client11 ~]# ls /rhs/brick1/b
b1/ b3/ b5/ 
[root@rhs-client11 ~]# ls /rhs/brick1/b1
[root@rhs-client11 ~]# 
[root@rhs-client11 ~]# service glusterd start
Starting glusterd:                                         [  OK  ]
[root@rhs-client11 ~]# 
[root@rhs-client11 ~]# gluster volume start vol-dis-rep force
volume start: vol-dis-rep: failed: Failed to get extended attribute trusted.glusterfs.volume-id for brick dir /rhs/brick1/b1. Reason : No data available
[root@rhs-client11 ~]# 


Expected results:
=================

Gluster volume should be able to start


Additional info:
================

The above case used to work on RHS2.0

Comment 3 Krutika Dhananjay 2013-04-30 11:08:12 UTC

Rahul,

That is expected behavior.

The only way you can start the volume is by manually setting the extended attribute 'trusted.glusterfs.volume-id' to the volume-id of the volume, on the newly created brick directory, before attempting to start the volume. And the volume-id of the volume can be found in the file /var/lib/glusterd/vols/<volname>/info.

However, the log message could be changed to provide the workaround for starting the volume.

Comment 4 Sachidananda Urs 2013-05-02 04:49:11 UTC

Krutika, IMO volume start should fail (to avoid accident umount cases), however when force is used, it means to say user knows what he is doing and should start the volume.

Comment 5 Rahul Hinduja 2013-05-10 06:29:21 UTC

(In reply to comment #3)
> Rahul,
> 
> That is expected behavior.
> 
> The only way you can start the volume is by manually setting the extended
> attribute 'trusted.glusterfs.volume-id' to the volume-id of the volume, on
> the newly created brick directory, before attempting to start the volume.
> And the volume-id of the volume can be found in the file
> /var/lib/glusterd/vols/<volname>/info.
> 
> However, the log message could be changed to provide the workaround for
> starting the volume.

This is a regression which used to work in 2.0 and as mentioned in comment 4, start force should work here. We try this case to simulate a disk-replacement scenario.

Comment 8 Scott Haines 2013-09-23 22:39:35 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Comment 9 Scott Haines 2013-09-23 22:43:46 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Note You need to log in before you can comment on or make changes to this bug.