Bug 764138 (GLUSTER-2406)

Summary: AFR repair on directory with large count of small files interrupts entire cluster
Product: [Community] GlusterFS Reporter: Joe Julian <joe>
Component: replicateAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED WORKSFORME QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0.5CC: amarts, gluster-bugs, vbhat
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: fuse
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Server volfile
none
Server volfile
none
Server volfile
none
Client volfile none

Description Joe Julian 2011-02-10 14:58:33 UTC
Created attachment 436

Comment 1 Joe Julian 2011-02-10 15:01:23 UTC
Created attachment 437

Comment 2 Joe Julian 2011-02-10 15:02:01 UTC
Created attachment 438 [details]
rpm -qai run of RedHat v6.1 installed components.  (updates applied)

Comment 3 Anand Avati 2011-02-10 15:19:57 UTC
any information on what directories the other clients were working in? self heal happens with a lock on the directory till it fixes all the entries.. maybe they were waiting on the same lock?

Comment 4 Joe Julian 2011-02-10 15:54:48 UTC
The other clients were working on a variety of directories, all of which had already healed.

The user whose profile that is, isn't in today so the only access to that directory was my own.

Comment 5 Joe Julian 2011-02-10 17:57:33 UTC
Version is actually 3.0.7 but that option is not available.

After replacing a brick I did a find to perform the afr rebuild. When it came to a directory with 32000 files, all under 1k, all access from all clients was blocked.

A debug log on the client performing the repair shows nothing unexpected:
[2011-02-10 08:37:46] D [afr-self-heal-entry.c:1397:afr_sh_entry_impunge_mknod] repl2: creating missing file /home/DEBBIEW/Cookies/debbiew on ewcs7_cluster1
for each of the files that are being repaired.

The other clients are just blocked. There are no errors, disconnects, etc on any of the servers. The repair is taking place as expected.

Steps:
1. 3-way distributed replicated volume (see attached volfiles)
2. mount the volume
3. Create a directory with 30000 files under 1k
4. unmount the volume
5. wipe the bricks on one of the servers
6. /usr/sbin/glusterfs --log-level=DEBUG --disable-direct-io-mode --volfile=/etc/glusterfs/glusterfs-client.vol /mnt/gluster
7. find /mnt/gluster
8. When it gets to the directory with 30k files, all access from all clients will be blocked.

Comment 6 M S Vishwanath Bhat 2011-03-21 14:05:05 UTC
Hi Joe,

I tried reproducing this bug with 3.1.3 and 3.0.7, But unable to hit the issue. Here's what i did.
1.Created 4 * 3 distributed-replicated volume and mounted the volume.
2. Created directory with 35000 files of around 800B size. Also created some more directories with some files.
3. Unmounted the volume and wiped down two of the replicated bricks in on of the servers.
4. Mounted the volume and executed the `find .` command over mountpoint.
5. When the directory with 30,000 files was healing, I was able to access the files from other clients simultaneously.Running the `find` command did not block the other clients.

Comment 7 Pranith Kumar K 2011-04-06 00:56:54 UTC
We are unable to reproduce the bug in-house. Please feel free to re-open the bug when we can reproduce the bug.

Pranith