1122732 – remove volume hang glustefs

Bug 1122732 - remove volume hang glustefs

Summary: remove volume hang glustefs

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	3.5.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Atin Mukherjee
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-07-23 23:22 UTC by Peter Auyeung
Modified:	2016-06-17 15:58 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-06-17 15:58:30 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
gluster log during brick crashed when running du (917.98 KB, application/x-gzip) 2014-07-29 17:25 UTC, Peter Auyeung	no flags	Details
View All

Description Peter Auyeung 2014-07-23 23:22:32 UTC

Description of problem:

gluster hang when stop a volume and eventually all gluster daemon crashed.
Had to restart glusterfs at once on all nodes

2014-07-23 20:55:44.856802] E [glusterd-utils.c:4124:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/60f2a337cd91647ecd6362967fcc955c.socket error: Permission denied
2014-07-23 20:55:45.862979] E [glusterd-utils.c:4124:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/552ffa0e242965667b3dca46bd7dbda4.socket error: No such file or directory
2014-07-23 20:55:46.868145] E [glusterd-utils.c:4124:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/e472f475caa7b04ff9d3cc9465979ed9.socket error: No such file or directory
2014-07-23 20:55:47.381313] E [glusterd-op-sm.c:3886:glusterd_op_ac_stage_op] 0-management: Stage failed on operation 'Volume Heal', Status : -1

Version-Release number of selected component (if applicable): 3.5.1


How reproducible:


Steps to Reproduce:
1. gluster volume stop $vol (force) 

Actual results:
gluster hang and all gluster daemon crashed
had to restart glusterfs

Expected results:
volume stop

Additional info:

Comment 1 Atin Mukherjee 2014-07-24 04:30:12 UTC

Can you attach the complete glusterd logs and if possible let us know the steps carried out before executing volume stop.

Comment 2 SATHEESARAN 2014-07-24 06:56:14 UTC

Peter,

Also please add the type of volume, volume configuration and backend information too. It would help us recreate the issue too.

Comment 3 Peter Auyeung 2014-07-25 00:21:22 UTC

Type: Distributed-Replicate
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp

XFS on ubuntu 12.04

Total volume size 17TB presented with 30% used.

Options Reconfigured:
features.quota: on
nfs.export-volumes: off
nfs.export-dirs: on
features.quota-deem-statfs: on
nfs.drc: off

We share the volume directories over NFS with directory quota

Today I tried to remove another volume but this time I disable the quota first.

When i disable quota, the quota crawl takes forever.

While the crawl still running, I do a gluster volume stop and this time the stop return immediately without hang

We have been experiencing quota usage misreporting over time when an NFS export being use a while
We starting seeing more and more xattr error and unlink error

Comment 4 Peter Auyeung 2014-07-25 00:22:51 UTC

The log is too big to upload.

After removed a volume, we getting these from nfs.log

[2014-07-25 00:04:26.011239] W [nfs3-helpers.c:3401:nfs3_log_common_res] 0-nfs-nfsv3: XID: 9d0f866, FSSTAT: NFS: 70(Invalid file handle), POSIX: 14(Bad address)
[2014-07-25 00:04:26.011336] E [nfs3.c:301:__nfs3_get_volume_id] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.5.1/xlator/nfs/server.so(nfs3_fsstat+0x1be) [0x7f3a34534d7e] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.5.1/xlator/nfs/server.so(nfs3_fsstat_reply+0x3b) [0x7f3a3453465b] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.5.1/xlator/nfs/server.so(nfs3_request_xlator_deviceid+0x78) [0x7f3a34527078]))) 0-nfs-nfsv3: invalid argument: xl
[2014-07-25 00:04:26.012372] W [nfs3-helpers.c:3401:nfs3_log_common_res] 0-nfs-nfsv3: XID: ad0f866, FSSTAT: NFS: 70(Invalid file handle), POSIX: 14(Bad address)

Comment 5 Atin Mukherjee 2014-07-25 04:43:30 UTC

Seems like a candidate to be analysed by Quota team, assigning it to Varun Shastry who looks after Quota.

Comment 6 Peter Auyeung 2014-07-29 17:18:38 UTC

Today i run into a similar issue with quota on this by just doing a du against a replica 2 volume.

It's similar to what it was running when removing a volume with quota that the find setfattr scanning the volume forever.

it was actually from the disk_usage_sync.sh from extra.

The du -bc against the replica 2 volume crashed the bricks.

Attaching the gluster logs to this case.

Comment 7 Peter Auyeung 2014-07-29 17:25:11 UTC

Created attachment 922240 [details]
gluster log during brick crashed when running du

Comment 8 Peter Auyeung 2014-07-29 17:34:35 UTC

looks deeper on the scripts seems like that's the setfattr that crashs the bricks.

Comment 9 Peter Auyeung 2014-07-29 17:39:24 UTC

the setattr action that crashing the bricks are very similar to when we disable quota and crash the bricks and also same as hanging removing volume

Comment 11 Niels de Vos 2016-06-17 15:58:30 UTC

This bug is getting closed because the 3.5 is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.

Note You need to log in before you can comment on or make changes to this bug.