Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 764138 (GLUSTER-2406)

Summary:

AFR repair on directory with large count of small files interrupts entire cluster

Product:

[Community] GlusterFS

Reporter:

Joe Julian <joe>

Component:

replicate

Assignee:

Pranith Kumar K <pkarampu>

Status:

CLOSED WORKSFORME

QA Contact:

Severity:

medium

Docs Contact:

Priority:

medium

Version:

3.0.5

CC:

amarts, gluster-bugs, vbhat

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

Type:

---

Regression:

---

Mount Type:

fuse

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Server volfile	none
Server volfile	none
Server volfile	none
Client volfile	none

Description Joe Julian 2011-02-10 14:58:33 UTC

Created attachment 436

Comment 1 Joe Julian 2011-02-10 15:01:23 UTC

Created attachment 437

Comment 2 Joe Julian 2011-02-10 15:02:01 UTC

Created attachment 438 [details]
rpm -qai run of RedHat v6.1 installed components.  (updates applied)

Comment 3 Anand Avati 2011-02-10 15:19:57 UTC

any information on what directories the other clients were working in? self heal happens with a lock on the directory till it fixes all the entries.. maybe they were waiting on the same lock?

Comment 4 Joe Julian 2011-02-10 15:54:48 UTC

The other clients were working on a variety of directories, all of which had already healed.

The user whose profile that is, isn't in today so the only access to that directory was my own.

Comment 5 Joe Julian 2011-02-10 17:57:33 UTC

Version is actually 3.0.7 but that option is not available.

After replacing a brick I did a find to perform the afr rebuild. When it came to a directory with 32000 files, all under 1k, all access from all clients was blocked.

A debug log on the client performing the repair shows nothing unexpected:
[2011-02-10 08:37:46] D [afr-self-heal-entry.c:1397:afr_sh_entry_impunge_mknod] repl2: creating missing file /home/DEBBIEW/Cookies/debbiew on ewcs7_cluster1
for each of the files that are being repaired.

The other clients are just blocked. There are no errors, disconnects, etc on any of the servers. The repair is taking place as expected.

Steps:
1. 3-way distributed replicated volume (see attached volfiles)
2. mount the volume
3. Create a directory with 30000 files under 1k
4. unmount the volume
5. wipe the bricks on one of the servers
6. /usr/sbin/glusterfs --log-level=DEBUG --disable-direct-io-mode --volfile=/etc/glusterfs/glusterfs-client.vol /mnt/gluster
7. find /mnt/gluster
8. When it gets to the directory with 30k files, all access from all clients will be blocked.

Comment 6 M S Vishwanath Bhat 2011-03-21 14:05:05 UTC

Hi Joe,

I tried reproducing this bug with 3.1.3 and 3.0.7, But unable to hit the issue. Here's what i did.
1.Created 4 * 3 distributed-replicated volume and mounted the volume.
2. Created directory with 35000 files of around 800B size. Also created some more directories with some files.
3. Unmounted the volume and wiped down two of the replicated bricks in on of the servers.
4. Mounted the volume and executed the `find .` command over mountpoint.
5. When the directory with 30,000 files was healing, I was able to access the files from other clients simultaneously.Running the `find` command did not block the other clients.

Comment 7 Pranith Kumar K 2011-04-06 00:56:54 UTC

We are unable to reproduce the bug in-house. Please feel free to re-open the bug when we can reproduce the bug.

Pranith