Description of problem:
gluster hang when stop a volume and eventually all gluster daemon crashed.
Had to restart glusterfs at once on all nodes
2014-07-23 20:55:44.856802] E [glusterd-utils.c:4124:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/60f2a337cd91647ecd6362967fcc955c.socket error: Permission denied
2014-07-23 20:55:45.862979] E [glusterd-utils.c:4124:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/552ffa0e242965667b3dca46bd7dbda4.socket error: No such file or directory
2014-07-23 20:55:46.868145] E [glusterd-utils.c:4124:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/e472f475caa7b04ff9d3cc9465979ed9.socket error: No such file or directory
2014-07-23 20:55:47.381313] E [glusterd-op-sm.c:3886:glusterd_op_ac_stage_op] 0-management: Stage failed on operation 'Volume Heal', Status : -1
Version-Release number of selected component (if applicable): 3.5.1
Steps to Reproduce:
1. gluster volume stop $vol (force)
gluster hang and all gluster daemon crashed
had to restart glusterfs
Can you attach the complete glusterd logs and if possible let us know the steps carried out before executing volume stop.
Also please add the type of volume, volume configuration and backend information too. It would help us recreate the issue too.
Number of Bricks: 3 x 2 = 6
XFS on ubuntu 12.04
Total volume size 17TB presented with 30% used.
We share the volume directories over NFS with directory quota
Today I tried to remove another volume but this time I disable the quota first.
When i disable quota, the quota crawl takes forever.
While the crawl still running, I do a gluster volume stop and this time the stop return immediately without hang
We have been experiencing quota usage misreporting over time when an NFS export being use a while
We starting seeing more and more xattr error and unlink error
The log is too big to upload.
After removed a volume, we getting these from nfs.log
[2014-07-25 00:04:26.011239] W [nfs3-helpers.c:3401:nfs3_log_common_res] 0-nfs-nfsv3: XID: 9d0f866, FSSTAT: NFS: 70(Invalid file handle), POSIX: 14(Bad address)
[2014-07-25 00:04:26.011336] E [nfs3.c:301:__nfs3_get_volume_id] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.5.1/xlator/nfs/server.so(nfs3_fsstat+0x1be) [0x7f3a34534d7e] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.5.1/xlator/nfs/server.so(nfs3_fsstat_reply+0x3b) [0x7f3a3453465b] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.5.1/xlator/nfs/server.so(nfs3_request_xlator_deviceid+0x78) [0x7f3a34527078]))) 0-nfs-nfsv3: invalid argument: xl
[2014-07-25 00:04:26.012372] W [nfs3-helpers.c:3401:nfs3_log_common_res] 0-nfs-nfsv3: XID: ad0f866, FSSTAT: NFS: 70(Invalid file handle), POSIX: 14(Bad address)
Seems like a candidate to be analysed by Quota team, assigning it to Varun Shastry who looks after Quota.
Today i run into a similar issue with quota on this by just doing a du against a replica 2 volume.
It's similar to what it was running when removing a volume with quota that the find setfattr scanning the volume forever.
it was actually from the disk_usage_sync.sh from extra.
The du -bc against the replica 2 volume crashed the bricks.
Attaching the gluster logs to this case.
Created attachment 922240 [details]
gluster log during brick crashed when running du
looks deeper on the scripts seems like that's the setfattr that crashs the bricks.
the setattr action that crashing the bricks are very similar to when we disable quota and crash the bricks and also same as hanging removing volume
This bug is getting closed because the 3.5 is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.