1125418 – Remove of replicate brick causes client errors

Bug 1125418 - Remove of replicate brick causes client errors

Summary: Remove of replicate brick causes client errors

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	fuse
Sub Component:
Version:	3.4.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-07-31 18:52 UTC by mark.dillon3
Modified:	2015-10-07 13:50 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-10-07 13:50:53 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Logs for daemon, shd, brick and client (101.79 KB, application/octet-stream) 2014-07-31 18:52 UTC, mark.dillon3	no flags	Details
View All

Description mark.dillon3 2014-07-31 18:52:33 UTC

Created attachment 923023 [details]
Logs for daemon, shd, brick and client

Description of problem:
Gluster volume remove-brick vol replica n-1 caused an issue where a fuse client that had it mounted during the remove generated massive amounts of log data. After a remount some files are still causing warnings about 

SETXATTR() 
GETXATTR()
ACCESS()

Version-Release number of selected component (if applicable):
3.4.2-1

How reproducible:
Not sure, I expect easily but I did not attempt. We had a 3 brick replica volume and I wanted to make some disk changes under one of the bricks, so I removed one. 

Steps to Reproduce:
1. Create a 3 brick replica volume, put some data one it
2. Mount volume using fuse client in linux 
3. Remove a brick from the volume and watch the fuse client logs

Actual results:


Expected results:


Additional info:
Gluster volume info datashare_volume 
Volume Name: datashare_volume
Type: Replicate
Volume ID: 56f56605-b6ca-41bd-bfa2-cebc0145c94a
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: balthasar-gluster:/zvol/gluster_datashare/datashare_volume
Brick2: melchior-gluster:/zvol/gluster_datashare/datashare_volume
Options Reconfigured:
server.statedump-path: /tmp/

getfattr -m . -d -e hex Thumbs.db (On Melchior)
# file: Thumbs.db
system.posix_acl_access=0x0200000001000600ffffffff04000000ffffffff080007009508e86308000500a108e86308000700ae08e86310000700ffffffff20000000ffffffff
trusted.afr.datashare_volume-client-0=0x000000000000000000000000
trusted.afr.datashare_volume-client-1=0x000000000000000000000000
trusted.afr.datashare_volume-client-2=0x000000000000000000000000
trusted.gfid=0x48bf86616b904332913fc1a6d7838ed2

 getfattr -m . -d -e hex Thumbs.db (On Balthasar)
# file: Thumbs.db
system.posix_acl_access=0x0200000001000600ffffffff04000000ffffffff080007009508e86308000500a108e86308000700ae08e86310000700ffffffff20000000ffffffff
trusted.afr.datashare_volume-client-0=0x000000000000000000000000
trusted.afr.datashare_volume-client-1=0x000000000000000000000000
trusted.afr.datashare_volume-client-2=0x000000000000000000000000
trusted.gfid=0x48bf86616b904332913fc1a6d7838ed2

Comment 1 mark.dillon3 2014-07-31 19:03:38 UTC

I just realized some of my logs go back further than I expected, the remove-brick would have taken place on 07/28/2014, anything earlier than that is likely related to past issues.

Comment 2 mark.dillon3 2014-07-31 19:22:34 UTC

I'd also like to note that windows "Thumbs.db" seems almost exclusively affected for "SETATTR" (which is the majority of the errors). Only one or two other directories have complained on "ACCESS" (but this could be unrelated?) 

This info may not be significant but I figure more is always better.

Comment 3 Pranith Kumar K 2014-08-01 11:17:31 UTC

hi Mark,
     Thanks for raising the bug. Are the applications erroring out because of these errors? or these are just the warnings are appearing in the logs?

Pranith

Comment 4 mark.dillon3 2014-08-01 18:18:01 UTC

Thanks Pranith, at first I thought it might simply be this
https://bugzilla.redhat.com/show_bug.cgi?id=1104861

With the help of JoeJulian I removed any 3rd AFRs (assuming there were any), (stepped down from 3-way rep to 2-way online) but it appears the issue is still around. 

The specific problem I'm seeing is in my client log it complains when trying to perform a SETXATTR. An example from today is 

On a temp file probably excel or something 

[2014-08-01 17:25:30.162001] W [fuse-bridge.c:1172:fuse_err_cbk] 0-glusterfs-fuse: 3602757: SETXATTR() /datashare/accounting/Accounts Payable/AP REPORTS/2014 Reports/AP Totals/DBB2C7B6.tmp => -1 (Operation not permitted)
[2014-08-01 17:25:30.162124] W [fuse-bridge.c:993:fuse_setattr_cbk] 0-glusterfs-fuse: 3602758: SETATTR() /datashare/accounting/Accounts Payable/AP REPORTS/2014 Reports/AP Totals/DBB2C7B6.tmp => -1 (Operation not permitted)

And many Thumbs.db do this
[2014-08-01 17:43:19.244029] W [fuse-bridge.c:1172:fuse_err_cbk] 0-glusterfs-fuse: 3625342: SETXATTR() /datashare/engineering/Engineering-Operations/GoranB/Thumbs.db => -1 (Operation not permitted)
[2014-08-01 17:43:19.244226] W [fuse-bridge.c:993:fuse_setattr_cbk] 0-glusterfs-fuse: 3625343: SETATTR() /datashare/engineering/Engineering-Operations/GoranB/Thumbs.db => -1 (Operation not permitted)

I'm assuming Gluster is trying to set, add or remove an xattr from these files but the logs don't indicate what or why. So far, no files seem to be damaged

Here are the xattrs for one of these files 
getfattr -m . -d -e hex ./Thumbs.db
# file: Thumbs.db
system.posix_acl_access=0x0200000001000600ffffffff04000000ffffffff080007009908e86308000500a508e86308000700b208e86310000700ffffffff20000000ffffffff
trusted.afr.datashare_volume-client-0=0x000000000000000000000000
trusted.afr.datashare_volume-client-1=0x000000000000000000000000
trusted.gfid=0x69edffaed25d49a9baa970a04e310441

None of the server logs seem to be complaining right now, not the glusterd, shd or brick logs just the client. 

I have not remounted the fuse mount since running the setfattr -x but I'd be surprised if that was the case. 

The heal status says all is good, no healed recently, no split-brain, nothing healing.

Comment 5 Raghavendra Talur 2014-08-04 07:20:16 UTC

Hi Mark,

For the current error, can you remount fuse mount with acl option,
like -o acl and see if you still get the error messages. I have seen these
errors when "user.xyz" extended attributes are set over a fuse mount without acl option.


Secondly,
Are you accessing gluster volume through Samba over a fuse mount?
If yes, have you tried using the vfs plugin that we have? That is the recommend way for 3.4 onwards.

Get more details here
http://lalatendumohanty.wordpress.com/2014/02/11/using-glusterfs-with-samba-and-samba-vfs-plugin-for-glusterfs-on-fedora-20/

and packages here
http://download.gluster.org/pub/gluster/glusterfs/samba/

Comment 6 mark.dillon3 2014-08-05 15:59:37 UTC

Thanks for your response Raghavendra,

This is our fstab on the client (no we are not using VFS presently). The clients job is basically to be a proxy SMB server for windows clients while providing replication storage over gluster.

# Mount Gluster Share
balthasar-gluster:/datashare_volume /mnt/datashare_gluster glusterfs defaults,acl,_netdev 0 0

This is one of our brick mounts (all are the same)
/dev/epoch/gluster_datashare-part1 /zvol/gluster_datashare ext4 acl,user_xattr,defaults

While I can appreciate that VFS is recommended (and will look into, setup, test etc) this problem is new and only occurred after removing a replicate brick. At this point I understand that 3.4.x has issues with add/replace/remove brick while the volume is live (if I'm incorrect please clarify).

However, we have remounted the volume more than once, it was always mounted with acl and as far as I can get (getfattr) there is nothing unusual about the xattrs of the files coming up in our warning messages (something that never happened before). So I'd have to guess that some damage has been done to metadata, but I'm not informed enough to find it.

At this point I'm looking at creating a new volume and migrating the data using rsync but I'd also really like to get to the bottom of these warnings.

Presently I have no knowledge of what fuse_setattr_cbk is trying to do that it either can't (or thinks it can't) do? Normally I would look elsewhere for a warning message like this (permission problems etc) but in this case the warning has only occurred after the brick removal and as far as I can tell there is no data damage or loss.

However, if xattr attributes are failing to update and they control replication or healing data loss is definitely a fear, which is what prompts us to consider moving to a new volume (yes we have backups).

Please let me know if there is any further data I can provide.

Comment 7 mark.dillon3 2014-09-30 14:52:37 UTC

This bug likely still exists in 3.4.2-1. However, I have migrated all of my servers/clients to 3.5.2. As far as I know it is not safe to add or remove bricks on a live volume in 3.4.2-1 if files are open.

Comment 8 Niels de Vos 2015-05-17 22:00:21 UTC

GlusterFS 3.7.0 has been released (http://www.gluster.org/pipermail/gluster-users/2015-May/021901.html), and the Gluster project maintains N-2 supported releases. The last two releases before 3.7 are still maintained, at the moment these are 3.6 and 3.5.

This bug has been filed against the 3,4 release, and will not get fixed in a 3.4 version any more. Please verify if newer versions are affected with the reported problem. If that is the case, update the bug with a note, and update the version if you can. In case updating the version is not possible, leave a comment in this bug report with the version you tested, and set the "Need additional information the selected bugs from" below the comment box to "bugs".

If there is no response by the end of the month, this bug will get automatically closed.

Comment 9 Kaleb KEITHLEY 2015-10-07 13:50:53 UTC

GlusterFS 3.4.x has reached end-of-life.\                                                   \                                                                               If this bug still exists in a later release please reopen this and change the version or open a new bug.

Note You need to log in before you can comment on or make changes to this bug.