Bug 1122732

Summary: remove volume hang glustefs
Product: [Community] GlusterFS Reporter: Peter Auyeung <pauyeung>
Component: glusterdAssignee: Atin Mukherjee <amukherj>
Status: CLOSED EOL QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 3.5.1CC: amukherj, bugs, hgowtham, pauyeung, sasundar
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-06-17 15:58:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
gluster log during brick crashed when running du none

Description Peter Auyeung 2014-07-23 23:22:32 UTC
Description of problem:

gluster hang when stop a volume and eventually all gluster daemon crashed.
Had to restart glusterfs at once on all nodes

2014-07-23 20:55:44.856802] E [glusterd-utils.c:4124:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/60f2a337cd91647ecd6362967fcc955c.socket error: Permission denied
2014-07-23 20:55:45.862979] E [glusterd-utils.c:4124:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/552ffa0e242965667b3dca46bd7dbda4.socket error: No such file or directory
2014-07-23 20:55:46.868145] E [glusterd-utils.c:4124:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/e472f475caa7b04ff9d3cc9465979ed9.socket error: No such file or directory
2014-07-23 20:55:47.381313] E [glusterd-op-sm.c:3886:glusterd_op_ac_stage_op] 0-management: Stage failed on operation 'Volume Heal', Status : -1

Version-Release number of selected component (if applicable): 3.5.1


How reproducible:


Steps to Reproduce:
1. gluster volume stop $vol (force) 

Actual results:
gluster hang and all gluster daemon crashed
had to restart glusterfs

Expected results:
volume stop

Additional info:

Comment 1 Atin Mukherjee 2014-07-24 04:30:12 UTC
Can you attach the complete glusterd logs and if possible let us know the steps carried out before executing volume stop.

Comment 2 SATHEESARAN 2014-07-24 06:56:14 UTC
Peter,

Also please add the type of volume, volume configuration and backend information too. It would help us recreate the issue too.

Comment 3 Peter Auyeung 2014-07-25 00:21:22 UTC
Type: Distributed-Replicate
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp

XFS on ubuntu 12.04

Total volume size 17TB presented with 30% used.

Options Reconfigured:
features.quota: on
nfs.export-volumes: off
nfs.export-dirs: on
features.quota-deem-statfs: on
nfs.drc: off

We share the volume directories over NFS with directory quota

Today I tried to remove another volume but this time I disable the quota first.

When i disable quota, the quota crawl takes forever.

While the crawl still running, I do a gluster volume stop and this time the stop return immediately without hang

We have been experiencing quota usage misreporting over time when an NFS export being use a while
We starting seeing more and more xattr error and unlink error

Comment 4 Peter Auyeung 2014-07-25 00:22:51 UTC
The log is too big to upload.

After removed a volume, we getting these from nfs.log

[2014-07-25 00:04:26.011239] W [nfs3-helpers.c:3401:nfs3_log_common_res] 0-nfs-nfsv3: XID: 9d0f866, FSSTAT: NFS: 70(Invalid file handle), POSIX: 14(Bad address)
[2014-07-25 00:04:26.011336] E [nfs3.c:301:__nfs3_get_volume_id] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.5.1/xlator/nfs/server.so(nfs3_fsstat+0x1be) [0x7f3a34534d7e] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.5.1/xlator/nfs/server.so(nfs3_fsstat_reply+0x3b) [0x7f3a3453465b] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.5.1/xlator/nfs/server.so(nfs3_request_xlator_deviceid+0x78) [0x7f3a34527078]))) 0-nfs-nfsv3: invalid argument: xl
[2014-07-25 00:04:26.012372] W [nfs3-helpers.c:3401:nfs3_log_common_res] 0-nfs-nfsv3: XID: ad0f866, FSSTAT: NFS: 70(Invalid file handle), POSIX: 14(Bad address)

Comment 5 Atin Mukherjee 2014-07-25 04:43:30 UTC
Seems like a candidate to be analysed by Quota team, assigning it to Varun Shastry who looks after Quota.

Comment 6 Peter Auyeung 2014-07-29 17:18:38 UTC
Today i run into a similar issue with quota on this by just doing a du against a replica 2 volume.

It's similar to what it was running when removing a volume with quota that the find setfattr scanning the volume forever.

it was actually from the disk_usage_sync.sh from extra.

The du -bc against the replica 2 volume crashed the bricks.

Attaching the gluster logs to this case.

Comment 7 Peter Auyeung 2014-07-29 17:25:11 UTC
Created attachment 922240 [details]
gluster log during brick crashed when running du

Comment 8 Peter Auyeung 2014-07-29 17:34:35 UTC
looks deeper on the scripts seems like that's the setfattr that crashs the bricks.

Comment 9 Peter Auyeung 2014-07-29 17:39:24 UTC
the setattr action that crashing the bricks are very similar to when we disable quota and crash the bricks and also same as hanging removing volume

Comment 11 Niels de Vos 2016-06-17 15:58:30 UTC
This bug is getting closed because the 3.5 is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.