Bug 1122732 - remove volume hang glustefs
Summary: remove volume hang glustefs
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: 3.5.1
Hardware: x86_64
OS: Linux
Target Milestone: ---
Assignee: Atin Mukherjee
QA Contact:
Depends On:
TreeView+ depends on / blocked
Reported: 2014-07-23 23:22 UTC by Peter Auyeung
Modified: 2016-06-17 15:58 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2016-06-17 15:58:30 UTC
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:

Attachments (Terms of Use)
gluster log during brick crashed when running du (917.98 KB, application/x-gzip)
2014-07-29 17:25 UTC, Peter Auyeung
no flags Details

Description Peter Auyeung 2014-07-23 23:22:32 UTC
Description of problem:

gluster hang when stop a volume and eventually all gluster daemon crashed.
Had to restart glusterfs at once on all nodes

2014-07-23 20:55:44.856802] E [glusterd-utils.c:4124:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/60f2a337cd91647ecd6362967fcc955c.socket error: Permission denied
2014-07-23 20:55:45.862979] E [glusterd-utils.c:4124:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/552ffa0e242965667b3dca46bd7dbda4.socket error: No such file or directory
2014-07-23 20:55:46.868145] E [glusterd-utils.c:4124:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/e472f475caa7b04ff9d3cc9465979ed9.socket error: No such file or directory
2014-07-23 20:55:47.381313] E [glusterd-op-sm.c:3886:glusterd_op_ac_stage_op] 0-management: Stage failed on operation 'Volume Heal', Status : -1

Version-Release number of selected component (if applicable): 3.5.1

How reproducible:

Steps to Reproduce:
1. gluster volume stop $vol (force) 

Actual results:
gluster hang and all gluster daemon crashed
had to restart glusterfs

Expected results:
volume stop

Additional info:

Comment 1 Atin Mukherjee 2014-07-24 04:30:12 UTC
Can you attach the complete glusterd logs and if possible let us know the steps carried out before executing volume stop.

Comment 2 SATHEESARAN 2014-07-24 06:56:14 UTC

Also please add the type of volume, volume configuration and backend information too. It would help us recreate the issue too.

Comment 3 Peter Auyeung 2014-07-25 00:21:22 UTC
Type: Distributed-Replicate
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp

XFS on ubuntu 12.04

Total volume size 17TB presented with 30% used.

Options Reconfigured:
features.quota: on
nfs.export-volumes: off
nfs.export-dirs: on
features.quota-deem-statfs: on
nfs.drc: off

We share the volume directories over NFS with directory quota

Today I tried to remove another volume but this time I disable the quota first.

When i disable quota, the quota crawl takes forever.

While the crawl still running, I do a gluster volume stop and this time the stop return immediately without hang

We have been experiencing quota usage misreporting over time when an NFS export being use a while
We starting seeing more and more xattr error and unlink error

Comment 4 Peter Auyeung 2014-07-25 00:22:51 UTC
The log is too big to upload.

After removed a volume, we getting these from nfs.log

[2014-07-25 00:04:26.011239] W [nfs3-helpers.c:3401:nfs3_log_common_res] 0-nfs-nfsv3: XID: 9d0f866, FSSTAT: NFS: 70(Invalid file handle), POSIX: 14(Bad address)
[2014-07-25 00:04:26.011336] E [nfs3.c:301:__nfs3_get_volume_id] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.5.1/xlator/nfs/server.so(nfs3_fsstat+0x1be) [0x7f3a34534d7e] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.5.1/xlator/nfs/server.so(nfs3_fsstat_reply+0x3b) [0x7f3a3453465b] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.5.1/xlator/nfs/server.so(nfs3_request_xlator_deviceid+0x78) [0x7f3a34527078]))) 0-nfs-nfsv3: invalid argument: xl
[2014-07-25 00:04:26.012372] W [nfs3-helpers.c:3401:nfs3_log_common_res] 0-nfs-nfsv3: XID: ad0f866, FSSTAT: NFS: 70(Invalid file handle), POSIX: 14(Bad address)

Comment 5 Atin Mukherjee 2014-07-25 04:43:30 UTC
Seems like a candidate to be analysed by Quota team, assigning it to Varun Shastry who looks after Quota.

Comment 6 Peter Auyeung 2014-07-29 17:18:38 UTC
Today i run into a similar issue with quota on this by just doing a du against a replica 2 volume.

It's similar to what it was running when removing a volume with quota that the find setfattr scanning the volume forever.

it was actually from the disk_usage_sync.sh from extra.

The du -bc against the replica 2 volume crashed the bricks.

Attaching the gluster logs to this case.

Comment 7 Peter Auyeung 2014-07-29 17:25:11 UTC
Created attachment 922240 [details]
gluster log during brick crashed when running du

Comment 8 Peter Auyeung 2014-07-29 17:34:35 UTC
looks deeper on the scripts seems like that's the setfattr that crashs the bricks.

Comment 9 Peter Auyeung 2014-07-29 17:39:24 UTC
the setattr action that crashing the bricks are very similar to when we disable quota and crash the bricks and also same as hanging removing volume

Comment 11 Niels de Vos 2016-06-17 15:58:30 UTC
This bug is getting closed because the 3.5 is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.

Note You need to log in before you can comment on or make changes to this bug.