Bug 764138 (GLUSTER-2406) - AFR repair on directory with large count of small files interrupts entire cluster
Summary: AFR repair on directory with large count of small files interrupts entire clu...
Keywords:
Status: CLOSED WORKSFORME
Alias: GLUSTER-2406
Product: GlusterFS
Classification: Community
Component: replicate
Version: 3.0.5
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-02-10 17:57 UTC by Joe Julian
Modified: 2011-04-06 03:56 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: ---
Mount Type: fuse
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)
Server volfile (1.16 KB, application/octet-stream)
2011-02-10 14:57 UTC, Joe Julian
no flags Details
Server volfile (1.52 KB, application/octet-stream)
2011-02-10 14:58 UTC, Joe Julian
no flags Details
Server volfile (1.55 KB, application/octet-stream)
2011-02-10 15:01 UTC, Joe Julian
no flags Details
Client volfile (2.70 KB, application/octet-stream)
2011-02-10 15:02 UTC, Joe Julian
no flags Details

Description Joe Julian 2011-02-10 14:58:33 UTC
Created attachment 436

Comment 1 Joe Julian 2011-02-10 15:01:23 UTC
Created attachment 437

Comment 2 Joe Julian 2011-02-10 15:02:01 UTC
Created attachment 438 [details]
rpm -qai run of RedHat v6.1 installed components.  (updates applied)

Comment 3 Anand Avati 2011-02-10 15:19:57 UTC
any information on what directories the other clients were working in? self heal happens with a lock on the directory till it fixes all the entries.. maybe they were waiting on the same lock?

Comment 4 Joe Julian 2011-02-10 15:54:48 UTC
The other clients were working on a variety of directories, all of which had already healed.

The user whose profile that is, isn't in today so the only access to that directory was my own.

Comment 5 Joe Julian 2011-02-10 17:57:33 UTC
Version is actually 3.0.7 but that option is not available.

After replacing a brick I did a find to perform the afr rebuild. When it came to a directory with 32000 files, all under 1k, all access from all clients was blocked.

A debug log on the client performing the repair shows nothing unexpected:
[2011-02-10 08:37:46] D [afr-self-heal-entry.c:1397:afr_sh_entry_impunge_mknod] repl2: creating missing file /home/DEBBIEW/Cookies/debbiew on ewcs7_cluster1
for each of the files that are being repaired.

The other clients are just blocked. There are no errors, disconnects, etc on any of the servers. The repair is taking place as expected.

Steps:
1. 3-way distributed replicated volume (see attached volfiles)
2. mount the volume
3. Create a directory with 30000 files under 1k
4. unmount the volume
5. wipe the bricks on one of the servers
6. /usr/sbin/glusterfs --log-level=DEBUG --disable-direct-io-mode --volfile=/etc/glusterfs/glusterfs-client.vol /mnt/gluster
7. find /mnt/gluster
8. When it gets to the directory with 30k files, all access from all clients will be blocked.

Comment 6 M S Vishwanath Bhat 2011-03-21 14:05:05 UTC
Hi Joe,

I tried reproducing this bug with 3.1.3 and 3.0.7, But unable to hit the issue. Here's what i did.
1.Created 4 * 3 distributed-replicated volume and mounted the volume.
2. Created directory with 35000 files of around 800B size. Also created some more directories with some files.
3. Unmounted the volume and wiped down two of the replicated bricks in on of the servers.
4. Mounted the volume and executed the `find .` command over mountpoint.
5. When the directory with 30,000 files was healing, I was able to access the files from other clients simultaneously.Running the `find` command did not block the other clients.

Comment 7 Pranith Kumar K 2011-04-06 00:56:54 UTC
We are unable to reproduce the bug in-house. Please feel free to re-open the bug when we can reproduce the bug.

Pranith


Note You need to log in before you can comment on or make changes to this bug.