Bug 764602 (GLUSTER-2870)

Summary: Inconsistent xattr values when creating bricks
Product: [Community] GlusterFS Reporter: mohitanchlia
Component: replicateAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: high    
Version: 3.1.3CC: aavati, gluster-bugs, jdarcy, rabhat, vijay
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: master, release-3.2 Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description mohitanchlia 2011-05-02 18:03:03 UTC
2 issues:

1) When I create a volume the gluster mount point has inconsistent xattr

[2011-04-27 17:11:29.13142] E
[afr-self-heal-metadata.c:524:afr_sh_metadata_fix]
0-stress-volume-replicate-0: Unable to self-heal permissions/ownership
of '/' (possible split-brain). Please fix the file on all backend
volumes

Can someone please help me reason for this problem?

 gluster volume info all

Volume Name: stress-volume
Type: Distributed-Replicate
Status: Started
Number of Bricks: 8 x 2 = 16
Transport-type: tcp
Bricks:
Brick1: dsdb1:/data/gluster
Brick2: dsdb2:/data/gluster
Brick3: dsdb3:/data/gluster
Brick4: dsdb4:/data/gluster
Brick5: dsdb5:/data/gluster
Brick6: dsdb6:/data/gluster
Brick7: dslg1:/data/gluster
Brick8: dslg2:/data/gluster
Brick9: dsdb1:/data1/gluster
Brick10: dsdb2:/data1/gluster
Brick11: dsdb3:/data1/gluster
Brick12: dsdb4:/data1/gluster
Brick13: dsdb5:/data1/gluster
Brick14: dsdb6:/data1/gluster
Brick15: dslg1:/data1/gluster
Brick16: dslg2:/data1/gluster


2) Some bricks missing xattr info. for eg:


[root@dsdb1 gluster]# getfattr -dm - /data2/gluster

getfattr: Removing leading '/' from absolute path names

# file: data2/gluster
trusted.afr.stress-volume-client-16=0sAAAAAAAAAAAAAAAA
trusted.afr.stress-volume-client-17=0sAAAAAAAAAAAAAAAA
trusted.gfid=0sAAAAAAAAAAAAAAAAAAAAAQ==
trusted.glusterfs.dht=0sAAAAAQAAAAB////+lVVVUg==

trusted.glusterfs.test="working\000"


[root@dsdb3 ~]# ls /data2/gluster/12657/372657
/data2/gluster/12657/372657
[root@dsdb3 ~]# getfattr -dm - /data2/gluster

getfattr: Removing leading '/' from absolute path names

# file: data2/gluster
trusted.gfid=0sAAAAAAAAAAAAAAAAAAAAAQ==
trusted.glusterfs.dht=0sAAAAAQAAAACVVVVTqqqqpw==

trusted.glusterfs.test="working\000"



[root@dsdb4 ~]# ls /data2/gluster/12657/372657
/data2/gluster/12657/372657
[root@dsdb4 ~]# getfattr -dm - /data2/gluster

getfattr: Removing leading '/' from absolute path names

# file: data2/gluster
trusted.gfid=0sAAAAAAAAAAAAAAAAAAAAAQ==
trusted.glusterfs.dht=0sAAAAAQAAAACVVVVTqqqqpw==

trusted.glusterfs.test="working\000"

Comment 1 mohitanchlia 2011-05-18 14:57:42 UTC
Is there any estimated time for this fix?

Comment 2 Anand Avati 2011-05-18 16:16:00 UTC
(In reply to comment #1)
> Is there any estimated time for this fix?

We have a highly probable theory for this bug and are working on the fix. The fix should be available soon. In the mean time, it is safe to neglect this error log even though it is annoying.

Avati

Comment 3 mohitanchlia 2011-05-18 16:24:46 UTC
There are 2 issues I highlighted below. Can you please tell me the worst case implication of these?

Thanks!

Comment 4 Anand Avati 2011-05-18 18:50:41 UTC
It's hard to tell why that pair does not have afr xattrs, but that itself is not a cause for alarm. It may be that no files are created in that replicate pair (yet). xattrs get created on demand whenever necessary.

Comment 5 mohitanchlia 2011-05-24 14:26:59 UTC
(In reply to comment #4)
> It's hard to tell why that pair does not have afr xattrs, but that itself is
> not a cause for alarm. It may be that no files are created in that replicate
> pair (yet). xattrs get created on demand whenever necessary.

I can tell you if I don't create those xattr manually it never gets created on demand. What I have seen is that If I have lot of bricks listed and may be multiple bricks on the same machine I see that issue. I create lot of bricks are listed and then create directories from 1-30000. It should be easy to reproduce.


But I still don't understand the implication? I am thinking gluster will just not work as expected.

Comment 6 Anand Avati 2011-05-25 02:53:49 UTC
> But I still don't understand the implication? I am thinking gluster will just
> not work as expected.

Just so that we are on the right page, I'm assuming you are talking about missing *afr* attributes, right? If so, missing xattrs will get created on demand when necessary (which is not necessarily on the next access)

Avati

Comment 7 mohitanchlia 2011-05-25 13:51:37 UTC
There are 2 issues I noted:

1) Where afr attributes are not consistent on the mount point of the volume. In below eg: /data/gluster doesn't have all A's when volume is created.

2) "afr" are missing. Why are there afr created for some but not for others? 

Both of these were checked right after creating a volume.

Is there a planned date to fix this bug?

Thanks

Comment 8 Anand Avati 2011-05-25 14:42:18 UTC
(In reply to comment #7)
> There are 2 issues I noted:
> 
> 1) Where afr attributes are not consistent on the mount point of the volume. In
> below eg: /data/gluster doesn't have all A's when volume is created.

To get the exact answer to that please get the output of getfattr with "-e hex". The values in the output are base64 encoded by default and tricky to interpret. Even then, I suspect it is only associated with the "meta data split-brain" (as the changes are seen in the last 4 bytes (the metadata changelog). For now you can just delete the attributes from the backend safely (setfattr -x) as we know that it is benign in this case. 

> 2) "afr" are missing. Why are there afr created for some but not for others? 

Again, this is not a reason to get alarmed. There are a lot of well explained reasons why files/directories need not have extended attributes on them. For e.g., as soon as you mkdir, the attributes will be empty. Also the case when you first create a file (unless the mkdir utility command performs the extra chmod/chown syscall after the mkdir syscall). This is normal. Changelogs (xattrs) are written on demand where found necessary.

> Both of these were checked right after creating a volume.
> 
> Is there a planned date to fix this bug?

The fix is underway already for the meta-data split brain issue. For volumes already created you will have to manually remove the xattrs from the backend with setfattr -x. The second issue you describe (missing xattrs) is not a bug.

Comment 9 Anand Avati 2011-05-30 08:49:09 UTC
PATCH: http://patches.gluster.com/patch/7271 in master (cluster/dht: notify should succeed when waiting for all subvols first event)

Comment 10 Anand Avati 2011-05-30 11:23:14 UTC
PATCH: http://patches.gluster.com/patch/7270 in master (cluster/afr: Send the first child up/down after all its children notify)

Comment 11 Anand Avati 2011-05-31 13:10:35 UTC
PATCH: http://patches.gluster.com/patch/7330 in master (pump: init last_event array to be used in afr_notify)

Comment 12 Anand Avati 2011-05-31 13:10:56 UTC
PATCH: http://patches.gluster.com/patch/7324 in release-3.1 (cluster/afr: Send the first child up/down after all its children notify)

Comment 13 Anand Avati 2011-05-31 13:11:00 UTC
PATCH: http://patches.gluster.com/patch/7325 in release-3.1 (cluster/dht: notify should succeed when waiting for all subvols first event)

Comment 14 Anand Avati 2011-05-31 13:11:27 UTC
PATCH: http://patches.gluster.com/patch/7332 in release-3.1 (pump: init last_event array to be used in afr_notify)

Comment 15 Anand Avati 2011-05-31 13:12:13 UTC
PATCH: http://patches.gluster.com/patch/7326 in release-3.2 (cluster/afr: Send the first child up/down after all its children notify)

Comment 16 Anand Avati 2011-05-31 13:12:19 UTC
PATCH: http://patches.gluster.com/patch/7327 in release-3.2 (cluster/dht: notify should succeed when waiting for all subvols first event)

Comment 17 Anand Avati 2011-05-31 13:12:43 UTC
PATCH: http://patches.gluster.com/patch/7331 in release-3.2 (pump: init last_event array to be used in afr_notify)

Comment 18 Pranith Kumar K 2011-06-01 00:08:19 UTC
Bug is that the replicate translator notifies child_up(brick process coming up) event as soon as any of its children come up/down instead of waiting for all the children to notify atleast one event for the very first time. These events are percolated up in the graph (in this case dht). The very first time the bricks are up dht needs to setup the necessary xattrs on the bricks.
    When NFS server is started along with the volumes, dht from NFS server attempts setxattr which would be received by the afr-children that are up, causing pending meta-data on the other children which are not up. If we take a 2*2 dist-replicate setup with brick1, brick3 on server1 and brick2, brick4 on server2, (brick1/3 in one pair and brick 2/4 in another pair) following behaviour will lead to a split-brain.
    lets assume that the notification of brick coming up reaches NFS running on the local server before remote-server NFS. When the volume is started, NFS on server1 does setxattr on brick1, NFS on server2 does setxattr on brick3 leading to conflicting pending attributes on the other brick in the replica pair. This is the cause for meta-data split-brain. Similar behaviour can be observed for brick2/4.

Once the pending attributes are conflicting even if you unmount and re-mount the xattrs persist so the user will observe the pending xattrs until they are manually fixed. 

Fix is to notify dht about the processes coming up only after we know that all the bricks in afr said they are either up/down/etc the very first time.

Comment 19 mohitanchlia 2011-06-01 13:47:57 UTC
(In reply to comment #18)
> Bug is that the replicate translator notifies child_up(brick process coming up)
> event as soon as any of its children come up/down instead of waiting for all
> the children to notify atleast one event for the very first time. These events
> are percolated up in the graph (in this case dht). The very first time the
> bricks are up dht needs to setup the necessary xattrs on the bricks.
>     When NFS server is started along with the volumes, dht from NFS server
> attempts setxattr which would be received by the afr-children that are up,
> causing pending meta-data on the other children which are not up. If we take a
> 2*2 dist-replicate setup with brick1, brick3 on server1 and brick2, brick4 on
> server2, (brick1/3 in one pair and brick 2/4 in another pair) following
> behaviour will lead to a split-brain.
>     lets assume that the notification of brick coming up reaches NFS running on
> the local server before remote-server NFS. When the volume is started, NFS on
> server1 does setxattr on brick1, NFS on server2 does setxattr on brick3 leading
> to conflicting pending attributes on the other brick in the replica pair. This
> is the cause for meta-data split-brain. Similar behaviour can be observed for
> brick2/4.
> Once the pending attributes are conflicting even if you unmount and re-mount
> the xattrs persist so the user will observe the pending xattrs until they are
> manually fixed. 
> Fix is to notify dht about the processes coming up only after we know that all
> the bricks in afr said they are either up/down/etc the very first time.

Thanks for the details and for fixing it! Does it matter even if I am not mounting the client using NFS?

Comment 20 Pranith Kumar K 2011-06-02 02:53:23 UTC
(In reply to comment #19)
> (In reply to comment #18)
> > Bug is that the replicate translator notifies child_up(brick process coming up)
> > event as soon as any of its children come up/down instead of waiting for all
> > the children to notify atleast one event for the very first time. These events
> > are percolated up in the graph (in this case dht). The very first time the
> > bricks are up dht needs to setup the necessary xattrs on the bricks.
> >     When NFS server is started along with the volumes, dht from NFS server
> > attempts setxattr which would be received by the afr-children that are up,
> > causing pending meta-data on the other children which are not up. If we take a
> > 2*2 dist-replicate setup with brick1, brick3 on server1 and brick2, brick4 on
> > server2, (brick1/3 in one pair and brick 2/4 in another pair) following
> > behaviour will lead to a split-brain.
> >     lets assume that the notification of brick coming up reaches NFS running on
> > the local server before remote-server NFS. When the volume is started, NFS on
> > server1 does setxattr on brick1, NFS on server2 does setxattr on brick3 leading
> > to conflicting pending attributes on the other brick in the replica pair. This
> > is the cause for meta-data split-brain. Similar behaviour can be observed for
> > brick2/4.
> > Once the pending attributes are conflicting even if you unmount and re-mount
> > the xattrs persist so the user will observe the pending xattrs until they are
> > manually fixed. 
> > Fix is to notify dht about the processes coming up only after we know that all
> > the bricks in afr said they are either up/down/etc the very first time.
> 
> Thanks for the details and for fixing it! Does it matter even if I am not
> mounting the client using NFS?

By default gluster starts NFS server process whenever a volume is started, which tries to do the setxattr that causes the issue so, unless you disable nfs, there is a chance for this to happen even if you dont mount using nfs.

Comment 21 Pranith Kumar K 2011-08-22 04:32:16 UTC
(In reply to comment #20)
> (In reply to comment #19)
> > (In reply to comment #18)
> > > Bug is that the replicate translator notifies child_up(brick process coming up)
> > > event as soon as any of its children come up/down instead of waiting for all
> > > the children to notify atleast one event for the very first time. These events
> > > are percolated up in the graph (in this case dht). The very first time the
> > > bricks are up dht needs to setup the necessary xattrs on the bricks.
> > >     When NFS server is started along with the volumes, dht from NFS server
> > > attempts setxattr which would be received by the afr-children that are up,
> > > causing pending meta-data on the other children which are not up. If we take a
> > > 2*2 dist-replicate setup with brick1, brick3 on server1 and brick2, brick4 on
> > > server2, (brick1/3 in one pair and brick 2/4 in another pair) following
> > > behaviour will lead to a split-brain.
> > >     lets assume that the notification of brick coming up reaches NFS running on
> > > the local server before remote-server NFS. When the volume is started, NFS on
> > > server1 does setxattr on brick1, NFS on server2 does setxattr on brick3 leading
> > > to conflicting pending attributes on the other brick in the replica pair. This
> > > is the cause for meta-data split-brain. Similar behaviour can be observed for
> > > brick2/4.
> > > Once the pending attributes are conflicting even if you unmount and re-mount
> > > the xattrs persist so the user will observe the pending xattrs until they are
> > > manually fixed. 
> > > Fix is to notify dht about the processes coming up only after we know that all
> > > the bricks in afr said they are either up/down/etc the very first time.
> > 
> > Thanks for the details and for fixing it! Does it matter even if I am not
> > mounting the client using NFS?
> 
> By default gluster starts NFS server process whenever a volume is started,
> which tries to do the setxattr that causes the issue so, unless you disable
> nfs, there is a chance for this to happen even if you dont mount using nfs.

This bug is a race and very difficult to reproduce, I had unit-tested by instrumenting the code to get to the race. After this fix, no user reported this bug again, so I am going ahead and marking it as verified, feel free to re-open if some one finds this.