| Summary: | Inconsistent xattr values when creating bricks | ||
|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | mohitanchlia |
| Component: | replicate | Assignee: | Pranith Kumar K <pkarampu> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | 3.1.3 | CC: | aavati, gluster-bugs, jdarcy, rabhat, vijay |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | --- | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | master, release-3.2 | Category: | --- |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
mohitanchlia
2011-05-02 18:03:03 UTC
Is there any estimated time for this fix? (In reply to comment #1) > Is there any estimated time for this fix? We have a highly probable theory for this bug and are working on the fix. The fix should be available soon. In the mean time, it is safe to neglect this error log even though it is annoying. Avati There are 2 issues I highlighted below. Can you please tell me the worst case implication of these? Thanks! It's hard to tell why that pair does not have afr xattrs, but that itself is not a cause for alarm. It may be that no files are created in that replicate pair (yet). xattrs get created on demand whenever necessary. (In reply to comment #4) > It's hard to tell why that pair does not have afr xattrs, but that itself is > not a cause for alarm. It may be that no files are created in that replicate > pair (yet). xattrs get created on demand whenever necessary. I can tell you if I don't create those xattr manually it never gets created on demand. What I have seen is that If I have lot of bricks listed and may be multiple bricks on the same machine I see that issue. I create lot of bricks are listed and then create directories from 1-30000. It should be easy to reproduce. But I still don't understand the implication? I am thinking gluster will just not work as expected.
> But I still don't understand the implication? I am thinking gluster will just
> not work as expected.
Just so that we are on the right page, I'm assuming you are talking about missing *afr* attributes, right? If so, missing xattrs will get created on demand when necessary (which is not necessarily on the next access)
Avati
There are 2 issues I noted: 1) Where afr attributes are not consistent on the mount point of the volume. In below eg: /data/gluster doesn't have all A's when volume is created. 2) "afr" are missing. Why are there afr created for some but not for others? Both of these were checked right after creating a volume. Is there a planned date to fix this bug? Thanks (In reply to comment #7) > There are 2 issues I noted: > > 1) Where afr attributes are not consistent on the mount point of the volume. In > below eg: /data/gluster doesn't have all A's when volume is created. To get the exact answer to that please get the output of getfattr with "-e hex". The values in the output are base64 encoded by default and tricky to interpret. Even then, I suspect it is only associated with the "meta data split-brain" (as the changes are seen in the last 4 bytes (the metadata changelog). For now you can just delete the attributes from the backend safely (setfattr -x) as we know that it is benign in this case. > 2) "afr" are missing. Why are there afr created for some but not for others? Again, this is not a reason to get alarmed. There are a lot of well explained reasons why files/directories need not have extended attributes on them. For e.g., as soon as you mkdir, the attributes will be empty. Also the case when you first create a file (unless the mkdir utility command performs the extra chmod/chown syscall after the mkdir syscall). This is normal. Changelogs (xattrs) are written on demand where found necessary. > Both of these were checked right after creating a volume. > > Is there a planned date to fix this bug? The fix is underway already for the meta-data split brain issue. For volumes already created you will have to manually remove the xattrs from the backend with setfattr -x. The second issue you describe (missing xattrs) is not a bug. PATCH: http://patches.gluster.com/patch/7271 in master (cluster/dht: notify should succeed when waiting for all subvols first event) PATCH: http://patches.gluster.com/patch/7270 in master (cluster/afr: Send the first child up/down after all its children notify) PATCH: http://patches.gluster.com/patch/7330 in master (pump: init last_event array to be used in afr_notify) PATCH: http://patches.gluster.com/patch/7324 in release-3.1 (cluster/afr: Send the first child up/down after all its children notify) PATCH: http://patches.gluster.com/patch/7325 in release-3.1 (cluster/dht: notify should succeed when waiting for all subvols first event) PATCH: http://patches.gluster.com/patch/7332 in release-3.1 (pump: init last_event array to be used in afr_notify) PATCH: http://patches.gluster.com/patch/7326 in release-3.2 (cluster/afr: Send the first child up/down after all its children notify) PATCH: http://patches.gluster.com/patch/7327 in release-3.2 (cluster/dht: notify should succeed when waiting for all subvols first event) PATCH: http://patches.gluster.com/patch/7331 in release-3.2 (pump: init last_event array to be used in afr_notify) Bug is that the replicate translator notifies child_up(brick process coming up) event as soon as any of its children come up/down instead of waiting for all the children to notify atleast one event for the very first time. These events are percolated up in the graph (in this case dht). The very first time the bricks are up dht needs to setup the necessary xattrs on the bricks.
When NFS server is started along with the volumes, dht from NFS server attempts setxattr which would be received by the afr-children that are up, causing pending meta-data on the other children which are not up. If we take a 2*2 dist-replicate setup with brick1, brick3 on server1 and brick2, brick4 on server2, (brick1/3 in one pair and brick 2/4 in another pair) following behaviour will lead to a split-brain.
lets assume that the notification of brick coming up reaches NFS running on the local server before remote-server NFS. When the volume is started, NFS on server1 does setxattr on brick1, NFS on server2 does setxattr on brick3 leading to conflicting pending attributes on the other brick in the replica pair. This is the cause for meta-data split-brain. Similar behaviour can be observed for brick2/4.
Once the pending attributes are conflicting even if you unmount and re-mount the xattrs persist so the user will observe the pending xattrs until they are manually fixed.
Fix is to notify dht about the processes coming up only after we know that all the bricks in afr said they are either up/down/etc the very first time.
(In reply to comment #18) > Bug is that the replicate translator notifies child_up(brick process coming up) > event as soon as any of its children come up/down instead of waiting for all > the children to notify atleast one event for the very first time. These events > are percolated up in the graph (in this case dht). The very first time the > bricks are up dht needs to setup the necessary xattrs on the bricks. > When NFS server is started along with the volumes, dht from NFS server > attempts setxattr which would be received by the afr-children that are up, > causing pending meta-data on the other children which are not up. If we take a > 2*2 dist-replicate setup with brick1, brick3 on server1 and brick2, brick4 on > server2, (brick1/3 in one pair and brick 2/4 in another pair) following > behaviour will lead to a split-brain. > lets assume that the notification of brick coming up reaches NFS running on > the local server before remote-server NFS. When the volume is started, NFS on > server1 does setxattr on brick1, NFS on server2 does setxattr on brick3 leading > to conflicting pending attributes on the other brick in the replica pair. This > is the cause for meta-data split-brain. Similar behaviour can be observed for > brick2/4. > Once the pending attributes are conflicting even if you unmount and re-mount > the xattrs persist so the user will observe the pending xattrs until they are > manually fixed. > Fix is to notify dht about the processes coming up only after we know that all > the bricks in afr said they are either up/down/etc the very first time. Thanks for the details and for fixing it! Does it matter even if I am not mounting the client using NFS? (In reply to comment #19) > (In reply to comment #18) > > Bug is that the replicate translator notifies child_up(brick process coming up) > > event as soon as any of its children come up/down instead of waiting for all > > the children to notify atleast one event for the very first time. These events > > are percolated up in the graph (in this case dht). The very first time the > > bricks are up dht needs to setup the necessary xattrs on the bricks. > > When NFS server is started along with the volumes, dht from NFS server > > attempts setxattr which would be received by the afr-children that are up, > > causing pending meta-data on the other children which are not up. If we take a > > 2*2 dist-replicate setup with brick1, brick3 on server1 and brick2, brick4 on > > server2, (brick1/3 in one pair and brick 2/4 in another pair) following > > behaviour will lead to a split-brain. > > lets assume that the notification of brick coming up reaches NFS running on > > the local server before remote-server NFS. When the volume is started, NFS on > > server1 does setxattr on brick1, NFS on server2 does setxattr on brick3 leading > > to conflicting pending attributes on the other brick in the replica pair. This > > is the cause for meta-data split-brain. Similar behaviour can be observed for > > brick2/4. > > Once the pending attributes are conflicting even if you unmount and re-mount > > the xattrs persist so the user will observe the pending xattrs until they are > > manually fixed. > > Fix is to notify dht about the processes coming up only after we know that all > > the bricks in afr said they are either up/down/etc the very first time. > > Thanks for the details and for fixing it! Does it matter even if I am not > mounting the client using NFS? By default gluster starts NFS server process whenever a volume is started, which tries to do the setxattr that causes the issue so, unless you disable nfs, there is a chance for this to happen even if you dont mount using nfs. (In reply to comment #20) > (In reply to comment #19) > > (In reply to comment #18) > > > Bug is that the replicate translator notifies child_up(brick process coming up) > > > event as soon as any of its children come up/down instead of waiting for all > > > the children to notify atleast one event for the very first time. These events > > > are percolated up in the graph (in this case dht). The very first time the > > > bricks are up dht needs to setup the necessary xattrs on the bricks. > > > When NFS server is started along with the volumes, dht from NFS server > > > attempts setxattr which would be received by the afr-children that are up, > > > causing pending meta-data on the other children which are not up. If we take a > > > 2*2 dist-replicate setup with brick1, brick3 on server1 and brick2, brick4 on > > > server2, (brick1/3 in one pair and brick 2/4 in another pair) following > > > behaviour will lead to a split-brain. > > > lets assume that the notification of brick coming up reaches NFS running on > > > the local server before remote-server NFS. When the volume is started, NFS on > > > server1 does setxattr on brick1, NFS on server2 does setxattr on brick3 leading > > > to conflicting pending attributes on the other brick in the replica pair. This > > > is the cause for meta-data split-brain. Similar behaviour can be observed for > > > brick2/4. > > > Once the pending attributes are conflicting even if you unmount and re-mount > > > the xattrs persist so the user will observe the pending xattrs until they are > > > manually fixed. > > > Fix is to notify dht about the processes coming up only after we know that all > > > the bricks in afr said they are either up/down/etc the very first time. > > > > Thanks for the details and for fixing it! Does it matter even if I am not > > mounting the client using NFS? > > By default gluster starts NFS server process whenever a volume is started, > which tries to do the setxattr that causes the issue so, unless you disable > nfs, there is a chance for this to happen even if you dont mount using nfs. This bug is a race and very difficult to reproduce, I had unit-tested by instrumenting the code to get to the race. After this fix, no user reported this bug again, so I am going ahead and marking it as verified, feel free to re-open if some one finds this. |