Bug 991084

Summary:	No way to start a failed brick when replaced the location with empty folder
Product:	[Community] GlusterFS	Reporter:	Simon Eisenmann <longsleep>
Component:	posix	Assignee:	bugs <bugs>
Status:	CLOSED EOL	QA Contact:
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.4.0	CC:	asmarre, bugs, dkelson, gluster-bugs, info, joe, johan.huysmans, nabber00, pierre.francois, ravishankar, vbellur
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-10-07 12:22:04 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Simon Eisenmann 2013-08-01 14:53:31 UTC

Description of problem:
There is no way to start a previously existing brick from an newly created empty folder (eg. if you have replaced a failed disk). The brick will never start up because it is missing trusted.glusterfs.volume-id at extended attribute at the brick location.

E [posix.c:4288:init] 0-machines-posix: Extended 
                   attribute trusted.glusterfs.volume-id is absent


Version-Release number of selected component (if applicable):
3.4.0

How reproducible:
stop glusterd
unmount $brick
rm -rf $brick
mkfs.xfs $brick
mkdir $brick
mount $brick
start glusterd

Actual results:
Brick does not start and cannot be healed.

Expected results:
Brick will start empty and create a missing stuff to be a perfect target for healing. Either this happens automatically or there neeeds to be a commandline to initialize a folder with existing brick meta data.


Additional info:

The workaround is to manually recreate the trusted.glusterfs.volume-id extened attribute on the brick folder.

JoeJulian on IRC came up with the following command to just do this:

(vol=myvol; brick=/tmp/brick1; setfattr -n 
                   trusted.glusterfs.volume-id -v $(grep volume-id /var/lib/glusterd/vols/$vol/info | cut -d= 
                   -f2 | sed 's/-//g') $brick)


This works perfectly well, and the brick starts fine (empty) and is healed perfectly afterwards.

Comment 1 Simon Eisenmann 2013-08-02 09:24:58 UTC

For some reason the command got garbled - correct command:

(vol=myvol; brick=/tmp/brick1; setfattr -n  trusted.glusterfs.volume-id -v 0x$(grep volume-id /var/lib/glusterd/vols/$vol/info | cut -d= -f2 | sed 's/-//g') $brick)

Comment 2 Ravishankar N 2013-08-19 05:29:44 UTC

The volume-id metadata is automatically created when one of the following commands is run:
  
1.gluster volume start <VOLNAME> force
2.gluster volume replace-brick <VOLNAME> <FAILED-BRICK> <NEW-BRICK> commit force

Thereafter, the self-heal can be triggered to copy the data to the new brick.

Comment 3 Simon Eisenmann 2013-08-19 08:58:17 UTC

Force starting a volume does not work:
root@srv2:~# gluster volume start machines force
volume start: machines: failed: Failed to get extended attribute trusted.glusterfs.volume-id for brick dir /export/brick1. Reason : No data available


Force replacing a brick does not work:
root@srv2:~# gluster volume replace-brick machines srv2:/export/brick1 srv2:/export/brick1 commit force
volume replace-brick: failed: Brick: srv2:/export/brick1 not available. Brick may be containing or be contained by an existing brick



Details:

on srv2 (out of 3 servers, all clean an fully running with one brick each)

root@srv2:~# service glusterfs-server stop

Existing brick:
/dev/mapper/ubuntu-brick1 on /export/brick1 type xfs (rw,nosuid,noatime)

root@srv2:~# killall glusterfsd
-> brick goes offline

check existing attributes
root@srv2:~# getfattr -m- -d /export/brick1/
getfattr: Removing leading '/' from absolute path names
# file: export/brick1/
trusted.afr.machines-client-0=0sAAAAAAAAAAAAAAAA
trusted.afr.machines-client-1=0sAAAAAAAAAAAAAAAA
trusted.afr.machines-client-2=0sAAAAAAAAAAAAAAAA
trusted.gfid=0sAAAAAAAAAAAAAAAAAAAAAQ==
trusted.glusterfs.dht=0sAAAAAQAAAAAAAAAA/////w==
trusted.glusterfs.volume-id=0s4PICLVK6S5yX2v7X3dNtdg==

root@srv2:~# umount /export/brick1
root@srv2:~# mkfs.xfs -f /dev/mapper/ubuntu-brick1
-> New filesystem

root@srv2:~# mount /export/brick1
check existing attributes
root@srv2:~# getfattr -m- -d /export/brick1/
(there are none)

Start glusterfs server (brick does not start)
root@srv2:~# service glusterfs-server start

[2013-08-19 08:51:48.122318] E [posix.c:4288:init] 0-machines-posix: Extended attribute trusted.glusterfs.volume-id is absent
[2013-08-19 08:51:48.122450] E [xlator.c:390:xlator_init] 0-machines-posix: Initialization of volume 'machines-posix' failed, review your volfile again
[2013-08-19 08:51:48.122467] E [graph.c:292:glusterfs_graph_init] 0-machines-posix: initializing translator failed
[2013-08-19 08:51:48.122480] E [graph.c:479:glusterfs_graph_activate] 0-graph: init failed

Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick srv1:/export/brick1                               49152   Y       14621
Brick srv2:/export/brick1                               N/A     N       N/A
Brick srv3:/export/brick1                               49152   Y       13130

Comment 4 Simon Eisenmann 2013-08-19 09:04:01 UTC

Following my example from Comment #3:

root@srv2:~# (vol=machines; brick=/export/brick1; setfattr -n  trusted.glusterfs.volume-id -v 0x$(grep volume-id /var/lib/glusterd/vols/$vol/info | cut -d= -f2 | sed 's/-//g') $brick)
root@srv2:~# getfattr -m- -d /export/brick1/
getfattr: Removing leading '/' from absolute path names
# file: export/brick1/
trusted.glusterfs.volume-id=0s4PICLVK6S5yX2v7X3dNtdg==

root@srv2:~# gluster volume start machines force
volume start: machines: success


Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick srv1:/export/brick1				49152	Y	14621
Brick srv2:/export/brick1				49152	Y	22711
Brick srv3:/export/brick1				49152	Y	13130

Comment 5 Daniel Baker 2014-03-20 02:48:12 UTC

I try solution:

(vol=machines; brick=/export/sdb1/brick; setfattr -n
trusted.glusterfs.volume-id -v 0x$(grep volume-id
/var/lib/glusterd/vols/$vol/info | cut -d= -f2 | sed 's/-//g') $brick)
grep: /var/lib/glusterd/vols/machines/info: No such file or directory

And then this:

root@cluster1:/home/admincit# gluster volume start gv0 force
volume start: gv0: failed: Volume id mismatch for brick
192.168.100.170:/export/sdb1/brick. Expected volume id
e9e31a9d-e194-4cbb-851c-a837ebc753d0, volume id
00000000-0000-0000-0000-000000000000 found


This has come about because I replaced the HD the brick was on.

Is there another solution I can try or does anyone know what I have to
tweak in the above commands to get this volume id thing sorted ?


Thanks for the help,

Dan

Comment 6 nabber00 2014-08-22 03:54:06 UTC

I'm seeing this as well, Ubuntu 12.04, Gluster 3.4.2.  Comment 4 worked for me.  I did not need to use the "force" option.  I also needed to stop/start the volume if it was already started.

Comment 7 nabber00 2014-08-22 03:54:36 UTC

(In reply to nabber00 from comment #6)
> I'm seeing this as well, Ubuntu 12.04, Gluster 3.4.2.  Comment 4 worked for
> me.  I did not need to use the "force" option.  I also needed to stop/start
> the volume if it was already started.

Sorry that should have been 14.04.

Comment 8 Niels de Vos 2015-05-17 22:00:57 UTC

GlusterFS 3.7.0 has been released (http://www.gluster.org/pipermail/gluster-users/2015-May/021901.html), and the Gluster project maintains N-2 supported releases. The last two releases before 3.7 are still maintained, at the moment these are 3.6 and 3.5.

This bug has been filed against the 3,4 release, and will not get fixed in a 3.4 version any more. Please verify if newer versions are affected with the reported problem. If that is the case, update the bug with a note, and update the version if you can. In case updating the version is not possible, leave a comment in this bug report with the version you tested, and set the "Need additional information the selected bugs from" below the comment box to "bugs".

If there is no response by the end of the month, this bug will get automatically closed.

Comment 9 Kaleb KEITHLEY 2015-10-07 12:22:04 UTC

GlusterFS 3.4.x has reached end-of-life.

If this bug still exists in a later release please reopen this and change the version or open a new bug.