834543 – DHT: Directory removal failed

Bug 834543 - DHT: Directory removal failed

Summary: DHT: Directory removal failed

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	distribute
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	shishir gowda
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-06-22 10:43 UTC by Sachidananda Urs
Modified:	2013-12-09 01:32 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-07-11 05:00:47 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
SOS report (54.14 KB, application/x-xz) 2012-06-22 10:47 UTC, Sachidananda Urs	no flags	Details
View All

Description Sachidananda Urs 2012-06-22 10:43:55 UTC

Description of problem:

When huge number of directories are rm -rf'ed, the command errors out with ENOENT.

Brick log snippet:

===============
[2012-06-22 09:34:12.588948] I [server3_1-fops.c:907:server_setxattr_cbk] 0-scalability-1-server: 47933778: SETXATTR (null) (--) ==> trusted.glusterfs.dht (No such file or directory)
[2012-06-22 09:42:23.265813] E [posix.c:223:posix_stat] 0-scalability-1-posix: lstat on /home/scalability-2/dir-2/.glusterfs/91/0d/910d3e55-6d23-472f-b95e-bcfc82d8b73d failed: No such file or directory
[2012-06-22 09:42:23.265848] I [server3_1-fops.c:1707:server_stat_cbk] 0-scalability-1-server: 48331226: STAT <gfid:910d3e55-6d23-472f-b95e-bcfc82d8b73d> (910d3e55-6d23-472f-b95e-bcfc82d8b73d) ==> -1 (No such file or directory)
[2012-06-22 09:42:25.814130] E [posix.c:223:posix_stat] 0-scalability-1-posix: lstat on /home/scalability-2/dir-2/.glusterfs/10/2e/102e40ff-4498-44ae-a5cb-8003bff283c8 failed: No such file or directory
[2012-06-22 09:42:25.814173] I [server3_1-fops.c:1707:server_stat_cbk] 0-scalability-1-server: 48334848: STAT <gfid:102e40ff-4498-44ae-a5cb-8003bff283c8> (102e40ff-4498-44ae-a5cb-8003bff283c8) ==> -1 (No such file or directory)
[2012-06-22 09:42:26.098540] E [posix.c:223:posix_stat] 0-scalability-1-posix: lstat on /home/scalability-2/dir-2/.glusterfs/f9/50/f9507285-9225-4170-a507-4eac03b5963c failed: No such file or directory
[2012-06-22 09:42:26.098567] I [server3_1-fops.c:1707:server_stat_cbk] 0-scalability-1-server: 48335279: STAT <gfid:f9507285-9225-4170-a507-4eac03b5963c> (f9507285-9225-4170-a507-4eac03b5963c) ==> -1 (No such file or directory)
[2012-06-22 09:42:26.648949] E [posix.c:223:posix_stat] 0-scalability-1-posix: lstat on /home/scalability-2/dir-2/.glusterfs/f0/a3/f0a3e720-44ba-469a-a6d3-f2994699d3bf failed: No such file or directory
[2012-06-22 09:42:26.648990] I [server3_1-fops.c:1707:server_stat_cbk] 0-scalability-1-server: 48335954: STAT <gfid:ede2772b-ba8c-4a46-90f9-d8265bee6851>/fileop_L1_83/fileop_L1_83_L2_46/fileop_dir_83_46_49 (f0a3e720-44ba-469a-a6d3-f2994699d3bf) ==> -1 (No such file or directory)
[2012-06-22 09:42:30.665662] E [posix.c:223:posix_stat] 0-scalability-1-posix: lstat on /home/scalability-2/dir-2/.glusterfs/5c/09/5c09d329-244d-4030-aa4f-02386b16963d failed: No such file or directory
[2012-06-22 09:42:30.665703] I [server3_1-fops.c:1707:server_stat_cbk] 0-scalability-1-server: 48340941: STAT <gfid:5c09d329-244d-4030-aa4f-02386b16963d> (5c09d329-244d-4030-aa4f-02386b16963d) ==> -1 (No such file or directory)
===========================

Client logs:
==============================
[2012-06-22 10:00:33.046009] E [nfs3-helpers.c:3603:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup fa
iled: <gfid:5d693efa-d7c3-4328-8d18-c0bd77b5401b>: Invalid argument
[2012-06-22 10:00:33.046041] E [nfs3.c:1513:nfs3_access_resume] 0-nfs-nfsv3: Unable to resolve FH: (10.16.15
7.39:901) scalability-1 : 5d693efa-d7c3-4328-8d18-c0bd77b5401b
[2012-06-22 10:00:33.046052] W [nfs3-helpers.c:3389:nfs3_log_common_res] 0-nfs-nfsv3: XID: 7c44ea5f, ACCESS:
 NFS: 22(Invalid argument for operation), POSIX: 14(Bad address)
[2012-06-22 10:00:33.505755] W [client3_1-fops.c:474:client3_1_stat_cbk] 0-scalability-1-client-1: remote op
eration failed: No such file or directory
[2012-06-22 10:00:35.913556] W [client3_1-fops.c:474:client3_1_stat_cbk] 0-scalability-1-client-2: remote op
eration failed: No such file or directory
[2012-06-22 10:00:43.086167] W [client3_1-fops.c:474:client3_1_stat_cbk] 0-scalability-1-client-0: remote op
eration failed: No such file or directory
[2012-06-22 10:00:43.086200] W [client3_1-fops.c:474:client3_1_stat_cbk] 0-scalability-1-client-2: remote op
eration failed: No such file or directory
=================================

Steps to Reproduce:
1. Run fileop such that it creates a million directory.
Commandline:
# fileop -s 50K -b -w -d `pwd` -t -f 100

Give it a day or more, so that it creates huge number of directories.
Do a rm -rf on the directories.

Attached sosreport from the server.

Comment 1 Sachidananda Urs 2012-06-22 10:47:36 UTC

Created attachment 593689 [details]
SOS report

Comment 2 Sachidananda Urs 2012-06-22 11:05:38 UTC

There was only one client which was doing rm, so it is not possible that some other process would have deleted.

Comment 3 shishir gowda 2012-06-25 05:11:05 UTC

From the logs it looks like dht detects holes, and sends setxattr (layout) calls.
But, by then rmdir would have succeeded, and hence setxattr fails with ENOENT.
File-op also does clean-ups(rm), and the another manual rm -rf has been triggered.

I suspect this to be the scenario:
1. readdir returns entries
2. fileop/manual rm both are both in progress
3. one of these sends a lookup, and other removes the non-hashed dir. this is when a hole in the layout is detected.
4. A heal/setxattr of layouts is triggered, which fails, as the rmdir would have succeed by now.

If rm -rf fails, a new rm -rf on mount should clean successfully.
Please try to reproduce the bug.

Comment 4 shishir gowda 2012-07-11 03:55:41 UTC

Can you please try to reproduce the bug with the latest git repo?

Comment 5 Sachidananda Urs 2012-07-11 05:00:47 UTC

Unable to reproduce this issue. Will re-open again (with sosreport) if I can hit this issue again.

Note You need to log in before you can comment on or make changes to this bug.