This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 1028582 - GlusterFS files missing randomly - the miss triggers a self heal, then missing files appear.
GlusterFS files missing randomly - the miss triggers a self heal, then missin...
Status: CLOSED EOL
Product: GlusterFS
Classification: Community
Component: core (Show other bugs)
3.4.3
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: bugs@gluster.org
:
Depends On: 971805
Blocks:
  Show dependency treegraph
 
Reported: 2013-11-08 14:52 EST by Chad Feller
Modified: 2015-10-07 09:59 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-10-07 09:59:36 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
logs at moment of 404 not found error (3.31 KB, text/plain)
2013-11-09 12:28 EST, Chad Feller
no flags Details
404 error following debmirror completion in comment #5 (3.31 KB, text/plain)
2013-11-09 13:23 EST, Chad Feller
no flags Details
Fedora_mirror_error_logs (12.41 KB, text/plain)
2013-12-30 16:54 EST, Chad Feller
no flags Details
gluster mount log from 3.4.20140129.8eda793-1, containing errors during "apt-get update" (48.74 KB, text/x-log)
2014-02-01 15:50 EST, Chad Feller
no flags Details

  None (edit)
Description Chad Feller 2013-11-08 14:52:11 EST
Description of problem:
Upon mirroring the Debian Wheezy repo on a GlusterFS mount point, some files are not properly found, even though the mirroring process completes successfully (according to the mirroring tool, "debmirror").

That is, "debmirror" completes without an error, but a subsequent "apt-get update" on a Debian machine returns a 404 not found on certain files.

Trying an "apt-get update" on a second Debian machine produces the same result.

Manually inspecting the directory structure shows the file(s) in question are actually there, at which point a subsequent "apt-get update" works successfully and Debian machines can be updated/installed/etc. using my local Debian mirror.

Looking at the glusterfs log, I notice the following:

[2013-11-08 17:56:00.886263] I [afr-self-heal-entry.c:2253:afr_sh_entry_fix] 0-gv0-replicate-0: /debian/dists/wheezy-updates/non-free/i18n: Performing conservative merge

So it almost appears as if the miss triggered a heal, which then caused things to work.  But the initial "404 not found" is problematic.

Version-Release number of selected component (if applicable):
Client (Fedora 18):
$ rpm -qa | grep glusterfs
glusterfs-libs-3.4.1-1.fc18.x86_64
glusterfs-fuse-3.4.1-1.fc18.x86_64
glusterfs-3.4.1-1.fc18.x86_64

Server (RHEL 6):
# rpm -qa | grep glusterfs
glusterfs-3.4.1-3.el6.x86_64
glusterfs-cli-3.4.1-3.el6.x86_64
glusterfs-server-3.4.1-3.el6.x86_64
glusterfs-libs-3.4.1-3.el6.x86_64
glusterfs-fuse-3.4.1-3.el6.x86_64

bricks are using XFS

How reproducible:
Consistently

Steps to Reproduce:
1. Mount GlusterFS directory
2. mirror Debian repo using "debmirror"
3. try to update Debian client using apt.

Actual results:
Not all files are able to be fetched (initially).

Expected results:
Files should be able to be fetched.

Additional info:
I tested mirroring a Debian repo on my "test" GlusterFS cluster for a month with no issue.  During that time my "test" cluster was running 3.3.2.  Two days ago I updated my "test" cluster to 3.4.1, and now I see the exact same thing on that test cluster.

the GlusterFS client, running Fedora 18 is serving up the Debian repo using Apache (httpd-2.4.6-2.fc18.x86_64). The relevant directories are mounted as so via fstab:
<gluster-server>:gv0    /mnt/gluster/gv0 glusterfs defaults,_netdev 0 0
/mnt/gluster/gv0/debian /var/www/mirror/debian  bind    defaults,bind   0 0

Gluster Volume Info on the servers:
# gluster volume info
 
Volume Name: gv0
Type: Distributed-Replicate
Volume ID: a86fbffd-408d-41f9-b2ed-a3816f09d924
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gluster0:/export/brick0
Brick2: gluster1:/export/brick0
Brick3: gluster2:/export/brick0
Brick4: gluster3:/export/brick0

Selinux is Enforcing on both the RHEL6 servers, and on the Fedora 18 GlusterFS client. The only configuration changes I made is setting:
setsebool -P httpd_use_fusefs 1
on the Fedora 18 box so that Apache played nice with the glusterfs directory.

Again, no problems on 3.3.2 with this same configuration.  My "test" cluster was upgraded from 3.3.2 to 3.4.1.  My main GlusterFS cluster was installed as 3.4.1 from scratch.
Comment 1 Chad Feller 2013-11-09 12:28:04 EST
Created attachment 821934 [details]
logs at moment of 404 not found error

This is what appears in the logs of the mount point the second that a 404 error is returned when attempting to access a file in question.
Comment 2 Chad Feller 2013-11-09 12:40:45 EST
Now on the above, apt will attempt to access something like:

http://<server>/debian/dists/wheezy-updates/main/i18n/Translation-en.bz2

I can reproduce the same thing with wget.  And this will continue indefinitely until I trigger the heal.  In order to trigger the heal, you have to navigate to the file.  Yesterday I navigated to the file from the command line (bash), and that triggered the heal.

Today I navigated to the file via a web browser, as the whole directory structure is served via Apache (as most Linux repos are).  In doing this, I noticed that I didn't have to navigate all of the way to the file in question.  The self heal was triggered once I got to:

http://<server>/debian/dists/wheezy-updates/main/

Once Apache changed into that directory, I saw the following pop up int he logs: 

[2013-11-09 17:17:44.041625] I [afr-self-heal-entry.c:2253:afr_sh_entry_fix] 0-gv0-replicate-0: /debian/dists/wheezy-updates/main/i18n: Performing conservative merge

obviously i18n is a subdirectory of the the folder I just navigated to.

Once that self-heal entry showed up in the log I was able to retrieve the file in question (Translation-en.bz2, referenced at the top of this comment) using both apt and wget.

This repo will now continue to work until the debmirror script runs again at 3:00am (local time) tomorrow, at which point certain files will likely be missing again until I manually navigate to them to trigger a self heal.
Comment 3 Chad Feller 2013-11-09 12:47:48 EST
(In reply to Chad Feller from comment #2)
> Now on the above, apt will attempt to access something like:
> 
> http://<server>/debian/dists/wheezy-updates/main/i18n/Translation-en.bz2
> 

To clarify, at the beginning of comment #2, in this statement here, I was referring to the attachment I created in comment #1.  That is, accessing this URL is what triggered the attachment in Comment #1.
Comment 4 Chad Feller 2013-11-09 13:11:39 EST
In an attempt to do further debugging, I went to manually run debmirror again, only to have it exit prematurely with the following error:

.temp/dists/wheezy/contrib/i18n/Translation-en.bz2: Invalid argument at /bin/debmirror line 1533.
releasing 1 pending lock... at /usr/share/perl5/vendor_perl/LockFile/Simple.pm line 206.

At the second it failed, the following popped up in the glusterfs mount log:

[2013-11-09 17:18:26.797059] W [defaults.c:1291:default_release] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/distribute.so(dht_open+0x2af) [0x7f873e8fc1af] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/distribute.so(dht_local_wipe+0xa7) [0x7f873e8d2907] (-->/usr/lib64/libglusterfs.so.0(fd_unref+0x144) [0x7f8745456f34]))) 0-fuse: xlator does not implement release_cbk
[2013-11-09 17:55:49.815846] W [fuse-bridge.c:915:fuse_fd_cbk] 0-glusterfs-fuse: 1067417: OPEN() /debian/.temp/dists/wheezy/contrib/i18n/Translation-en.bz2 => -1 (Invalid argument)

Note that this is the same file that we just triggered a self heal on in the previous few comments.  

(Also note that there is two log lines here, one at 17:55 (just now) and one at 17:18, which was just after I was fetching files related to the previous comments.  However, the 17:18 statement didn't show up until just now when the statement at 17:55 did.  GlusterFS doesn't seem to buffer logs (from my observations, so I'm not sure if something was hung that caused that previous log statement to appear with this one or it sheds light on anything else.)
Comment 5 Chad Feller 2013-11-09 13:19:20 EST
I just ran debmirror yet again, this time it completed successfully with the following output:

Parsing Packages and Sources files ...
Get Translation files ...
Files to download: 0 B
Download completed in 213s.
Everything OK. Moving meta files ...
cp: ‘dists/wheezy/contrib/i18n/Translation-en.bz2’ and ‘/var/www/mirror/debian/dists/wheezy/contrib/i18n/Translation-en.bz2’ are the same file
cp: ‘dists/wheezy-updates/main/i18n/Translation-en.bz2’ and ‘/var/www/mirror/debian/dists/wheezy-updates/main/i18n/Translation-en.bz2’ are the same file
cp: ‘dists/wheezy-updates/non-free/i18n/Translation-en.bz2’ and ‘/var/www/mirror/debian/dists/wheezy-updates/non-free/i18n/Translation-en.bz2’ are the same file
Cleanup mirror.
All done.

those "cp: .... are the same file" statements are unusual - that is I usually don't see those.  

In the GlusterFS logs, I'm seeing the following at the same moment:

[2013-11-09 17:55:49.816102] W [defaults.c:1291:default_release] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/distribute.so(dht_open+0x2af) [0x7f873e8fc1af] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/distribute.so(dht_local_wipe+0xa7) [0x7f873e8d2907] (-->/usr/lib64/libglusterfs.so.0(fd_unref+0x144) [0x7f8745456f34]))) 0-fuse: xlator does not implement release_cbk
[2013-11-09 18:04:39.556953] W [fuse-bridge.c:1233:fuse_unlink_cbk] 0-glusterfs-fuse: 1195277: UNLINK() /debian/dists/wheezy/contrib/i18n/Translation-en.bz2 => -1 (No such file or directory)
[2013-11-09 18:04:39.692404] W [fuse-bridge.c:1233:fuse_unlink_cbk] 0-glusterfs-fuse: 1195392: UNLINK() /debian/dists/wheezy-updates/main/i18n/Translation-en.bz2 => -1 (No such file or directory)
[2013-11-09 18:04:39.803777] W [fuse-bridge.c:1233:fuse_unlink_cbk] 0-glusterfs-fuse: 1195476: UNLINK() /debian/dists/wheezy-updates/non-free/i18n/Translation-en.bz2 => -1 (No such file or directory)

Again, this is the same problematic file.

Also note the timestamp on the first line of log output is about the same time as the previous log statements in comment #4, yet again all four of these log statements appeared at the same time.  Not sure if this is a logging bug or what.
Comment 6 Chad Feller 2013-11-09 13:23:08 EST
Created attachment 821948 [details]
404 error following debmirror completion in comment #5

following debmirror completing on comment #5, I ran apt-get update again on one of my Debian machines, got the 404 error, and this is what immediately appeared in the glusterfs mount log
Comment 7 Chad Feller 2013-11-09 13:34:41 EST
Looking at comment #6, and the related attachment, we're back at square one.  

Again, I was just able to manually navigate to the directory in question, which triggered a self-heal (conservative merge), and we're back in business, yet again.  Unfortunately, this isn't an acceptable workaround, and this bug I'm seeing in debmirror may appear in other places too.

I think I've outlined the entire process of events, and have provided enough logs to provide insight as to what is going on.

Again, using debmirror on a GlusterFS mount is not an issue in 3.3.2, only upon switching to 3.4.1 did this bug manifest itself.
Comment 8 Chad Feller 2013-11-09 19:48:37 EST
Additionally, if anyone wants to try to reproduce this on their end, I'm using the debmirror rpm from Fedora's repos:

  debmirror-2.14-2.fc18.noarch

and am running the following command:

  debmirror -e rsync -h <remote.debian.mirror> -r :debian -a amd64 -s main,contrib,non-free -d wheezy,wheezy-updates /var/www/mirror/debian --postcleanup --ignore-small-errors
Comment 9 Chad Feller 2013-12-13 18:22:19 EST
I just made an additional observation which I thought should be noted here:

The self heal apparently isn't persistent!  

The self heal is happening on the gluster client, and noted in the mount log, but if I reboot the client the changes revert.  That is, changes don't appear to be propagating out the the servers.  If I have to reboot the client (kernel update, or any other update that requires a reboot), I'm back to square one.  Files are missing, and manually navigating to them triggers a self heal.  

In attempting to poke around this issue some more, I switched off selinux (switched from Enforcing to Permissive) on both the servers and the client.  Nothing changes.  I still get the same errors, and the self heal is still not persistent.
Comment 10 Chad Feller 2013-12-19 18:58:47 EST
An additional note:

Following up on comment #9, not only is the self heal not persistent between reboots, it isn't persistent after a lot of system IO.  If there is a lot of "other" Gluster traffic, it appears that the self heal gets purged.  Everything observed in comment #9 is the same, just replace "reboot" with tons of "other" IO over GlusterFS.  

So it appears that the self heal is just cached somewhere in the GlusterFS client, and purged next time it needs that memory for something else?
Comment 11 Chad Feller 2013-12-19 19:49:34 EST
Looking into this further, I've discovered the following:  The error is down to one file.  Initially there were other files/directories giving me issues, but somehow those eventually got healed - the self heal made it into persistent memory somehow?  This last file that seems to be stubborn in not healing is:

  /debian/dists/wheezy-updates/main/i18n/Translation-en.bz2

(The fact that this was happening in multiple directories initially can be noted in the initial post.  There you see you see the self heal operation happening on a different directory.  With regards to this bug, I haven't had to touch that other directory in a while.)

I decided to investigate a little further regarding this last file in question.  What happens if I try to delete the file?

# rm Translation-en.bz2
rm: remove regular file `Translation-en.bz2'? y
rm: cannot remove `Translation-en.bz2': No such file or directory

The gluster mount log, when this happens shows the following:

[2013-12-17 04:52:50.366825] W [fuse-bridge.c:1233:fuse_unlink_cbk] 0-glusterfs-fuse: 370: UNLINK() /debian/dists/wheezy-updates/main/i18n/Translation-en.bz2 => -1 (No such file or directory)

Looking more closely at the file, I noticed that it has has hard links:

# ls -lah
total 2.5K
drwxr-xr-x. 2 root root   37 Dec 15 11:09 .
drwxr-xr-x. 5 root root   98 Nov  5 03:01 ..
-rw-r--r--. 2 root root 1.9K Nov  7 12:04 Translation-en.bz2

OK, lets find the links:

# ls -i Translation-en.bz2
10907466576973813014 Translation-en.bz2

# find /mnt/gluster/gv0/debian -inum 10907466576973813014
/mnt/gluster/gv0/debian/dists/wheezy-updates/main/i18n/Translation-en.bz2
/mnt/gluster/gv0/debian/.temp/dists/wheezy-updates/main/i18n/Translation-en.bz2

OK, Can I delete the other file?

# rm /mnt/gluster/gv0/debian/.temp/dists/wheezy-updates/main/i18n/Translation-en.bz2
rm: remove regular file `/mnt/gluster/gv0/debian/.temp/dists/wheezy-updates/main/i18n/Translation-en.bz2'? y

Sure can.

Now the file in question is now only showing 1 hard link, so it is acknowledging that I deleted the other file.

# ls -lah
total 2.5K
drwxr-xr-x. 2 root root   37 Dec 15 11:09 .
drwxr-xr-x. 5 root root   98 Nov  5 03:01 ..
-rw-r--r--. 1 root root 1.9K Nov  7 12:04 Translation-en.bz2 

So now can I delete the troublesome file?

# rm Translation-en.bz2
rm: remove regular file `Translation-en.bz2'? y
rm: cannot remove `Translation-en.bz2': No such file or directory

Nope, still cannot. (and I'm getting the same error in the gluster mount log.)

Can I rename the file?  Nope.  And I get this in the mount logs:

[2013-12-17 04:57:19.210358] W [dht-rename.c:334:dht_rename_unlink_cbk] 0-gv0-dht: /debian/dists/wheezy-updates/main/i18n/Translation-en.bz2: unlink on gv0-replicate-0 failed (No such file or directory)

OK, can I overwrite the file?

# echo "Hello" > Translation-en.bz2 

It appeared to let me.  Can I rename it now?

# mv Translation-en.bz2 t.txt

# ls -lah
total 1.0K
drwxr-xr-x. 2 root root 24 Dec 16 20:57 .
drwxr-xr-x. 5 root root 98 Nov  5 03:01 ..
-rw-r--r--. 1 root root  6 Dec 16 20:57 t.txt

It appeared to let me do that too. (Note that the file size is smaller too, as it should be.)

Now lets get rid of the file:

# mv t.txt /tmp/
# ls -a
.  ..

Beautiful.  

I accessed the same gluster volume in question from a different client to make sure the file really is gone, and that this client wasn't lying to me, and I was able to confirm that the file was really gone from the volume.

OK, now lets resync with the debmirror script.  

Before the debmirror script completed, this popped up in the gluster mount log:

[2013-12-17 06:37:29.387250] E [client-rpc-fops.c:5179:client3_3_inodelk] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_lock_rec+0x70) [0x7f6304f407b0] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_nonblocking_inodelk+0x652) [0x7f6304f57b22] (-->/usr/lib64/glusterfs/3.4.1/xlator/protocol/client.so(client_inodelk+0x99) [0x7f630518e909]))) 0-: Assertion failed: 0
[2013-12-17 06:37:29.387334] E [client-rpc-fops.c:5179:client3_3_inodelk] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_lock_rec+0x70) [0x7f6304f407b0] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_nonblocking_inodelk+0x652) [0x7f6304f57b22] (-->/usr/lib64/glusterfs/3.4.1/xlator/protocol/client.so(client_inodelk+0x99) [0x7f630518e909]))) 0-: Assertion failed: 0
[2013-12-17 06:37:29.387405] E [client-rpc-fops.c:5179:client3_3_inodelk] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_blocking_lock+0x74) [0x7f6304f590d4] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_lock_blocking+0x844) [0x7f6304f58df4] (-->/usr/lib64/glusterfs/3.4.1/xlator/protocol/client.so(client_inodelk+0x99) [0x7f630518e909]))) 0-: Assertion failed: 0
[2013-12-17 06:37:29.387471] E [client-rpc-fops.c:5179:client3_3_inodelk] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(+0x451c8) [0x7f6304f591c8] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_lock_blocking+0x844) [0x7f6304f58df4] (-->/usr/lib64/glusterfs/3.4.1/xlator/protocol/client.so(client_inodelk+0x99) [0x7f630518e909]))) 0-: Assertion failed: 0
[2013-12-17 06:37:29.387486] I [afr-lk-common.c:1075:afr_lock_blocking] 0-gv0-replicate-1: unable to lock on even one child
[2013-12-17 06:37:29.387580] I [afr-transaction.c:1063:afr_post_blocking_inodelk_cbk] 0-gv0-replicate-1: Blocking inodelks failed.
[2013-12-17 06:37:29.387616] E [dht-linkfile.c:213:dht_linkfile_setattr_cbk] 0-gv0-dht: setattr of uid/gid on /debian/dists/wheezy/contrib/i18n/Translation-en.bz2 :<gfid:00000000-0000-0000-0000-000000000000> failed (Invalid argument)
[2013-12-17 06:37:30.636628] E [client-rpc-fops.c:5179:client3_3_inodelk] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_lock_rec+0x70) [0x7f6304f407b0] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_nonblocking_inodelk+0x652) [0x7f6304f57b22] (-->/usr/lib64/glusterfs/3.4.1/xlator/protocol/client.so(client_inodelk+0x99) [0x7f630518e909]))) 0-: Assertion failed: 0
[2013-12-17 06:37:30.636717] E [client-rpc-fops.c:5179:client3_3_inodelk] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_lock_rec+0x70) [0x7f6304f407b0] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_nonblocking_inodelk+0x652) [0x7f6304f57b22] (-->/usr/lib64/glusterfs/3.4.1/xlator/protocol/client.so(client_inodelk+0x99) [0x7f630518e909]))) 0-: Assertion failed: 0
[2013-12-17 06:37:30.636789] E [client-rpc-fops.c:5179:client3_3_inodelk] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_blocking_lock+0x74) [0x7f6304f590d4] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_lock_blocking+0x844) [0x7f6304f58df4] (-->/usr/lib64/glusterfs/3.4.1/xlator/protocol/client.so(client_inodelk+0x99) [0x7f630518e909]))) 0-: Assertion failed: 0
[2013-12-17 06:37:30.636856] E [client-rpc-fops.c:5179:client3_3_inodelk] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(+0x451c8) [0x7f6304f591c8] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_lock_blocking+0x844) [0x7f6304f58df4] (-->/usr/lib64/glusterfs/3.4.1/xlator/protocol/client.so(client_inodelk+0x99) [0x7f630518e909]))) 0-: Assertion failed: 0
[2013-12-17 06:37:30.636871] I [afr-lk-common.c:1075:afr_lock_blocking] 0-gv0-replicate-0: unable to lock on even one child
[2013-12-17 06:37:30.636962] I [afr-transaction.c:1063:afr_post_blocking_inodelk_cbk] 0-gv0-replicate-0: Blocking inodelks failed.
[2013-12-17 06:37:30.637018] E [dht-linkfile.c:213:dht_linkfile_setattr_cbk] 0-gv0-dht: setattr of uid/gid on /debian/dists/wheezy-updates/non-free/i18n/Translation-en.bz2 :<gfid:00000000-0000-0000-0000-000000000000> failed (Invalid argument)

And, now the problem is back to square one with the same file in question.  Hope that sheds light on things?

P.S.  I realize the timestamps on some of the files above are before the date of this post.  I did this investigation a couple of days ago, and had my notes/results in a txt file waiting for a convenient time to update this bug report.
Comment 12 Pranith Kumar K 2013-12-23 02:30:34 EST
Chad,
      Logs suggest that some operations are coming on files/directories without any gfid (i.e. similar to inode-number but it is unique across gluster volume per file/dir). Seems like you have a consistent case. Lets say I want to re-create this issue on one of my vms with debian wheezy guest OS. What is the procedure? I am a complete debian newbie. Give me exact commands that need to be executed on the new vm to hit this.

Pranith
Comment 13 Chad Feller 2013-12-30 16:31:45 EST
Pranith,

To be clear, the Gluster hosts are Fedora and RHEL.  The Gluster gluster (4x2 distribute replicate) is RHEL 6, and the Gluster client that is accessing the volume is Fedora 18.  The role of this Fedora 18 box is an internal webserver, and we're slowly migrating its storage backend to Gluster, one volume at a time.  The Debian mirror was the first thing we migrated (we're mostly Fedora, RHEL, and CentOS, but do have a small number of Debian boxes too).  In my experience, Debian is best mirrored with the debmirror program, so from Fedora, just:

  yum install debmirror

I would move the rpm provided /etc/debmirror.conf out of the way, as I recall variables in there conflicting what I was specifying on the command line.  (I seem to recall that you'll also have to install the Debian-archive GPG key, as debmirror barks about that at first.)  And then run your debmirror script as I described in Comment #8.

Setup Apache on your Fedora box so that it serves up your mirror.  For me it is really simple:

    DocumentRoot /var/www/mirror

    <Directory /var/www/mirror>
        Options +Indexes -MultiViews
        AllowOverride all
        Order allow,deny
        allow from all
    </Directory>

(the Gluster backed Debian mirror is mounted under /var/www/mirror/debian)

On your Debian VM, edit /etc/apt/sources.list to use your mirror:

  deb http://<your.local.mirror>/debian/ wheezy main contrib non-free
  deb-src http://<your.local.mirror>/debian/ wheezy main contrib non-free

  deb http://<your.local.mirror>/debian/ wheezy-updates main contrib non-free
  deb-src http://<your.local.mirror>/debian/ wheezy-updates main contrib non-free

  # leave security.debian.org pointed at them
  # as the debmirror script above isn't mirroring this
  #
  deb http://security.debian.org/ wheezy/updates main contrib non-free
  deb-src http://security.debian.org/ wheezy/updates main contrib non-free


at this point, on your Debian box, from the command line (as root) you can use run apt as follows:

  apt-get update && apt-get upgrade

After you do this, you should be able to replicate what I'm seeing.

-Chad
Comment 14 Chad Feller 2013-12-30 16:54:34 EST
Created attachment 843555 [details]
Fedora_mirror_error_logs

Since filing this bug report, I've also created a Fedora mirror, backed by the same Gluster cluster.  

Interestingly enough, I'm seeing similar error messages in the Gluster mount log (attached) at sync time.  

Yet unlike with the errors in Debian repo, these errors haven't triggered any noticeable filesystem corruption.  That is, I haven't run into any issues with my Fedora clients using the mirror (at least yet).
Comment 15 Pranith Kumar K 2013-12-31 06:18:36 EST
Hey Chad,
       For data to be replicated the data needs to be written from gluster mount point. I could be wrong, but it seems like data is being added in the brick backend directly and not from the gluster mount point according to the example. Again, I could be wrong, please confirm.

Pranith
Comment 16 Chad Feller 2013-12-31 09:46:11 EST
Pranith,

The 4 RHEL6 boxes are the bricks.  Each box provides 1 brick (which itself is 12 disks behind hardware RAID 6).

The Fedora 18 box is the mount point (the gluster client).  So the logs I've provided you are from the Fedora 18 box.

The client mounts on boot via the fstab.  The relevant lines from my fstab are as follows:

gluster0.my.local.domain:gv0    /mnt/gluster/gv0 glusterfs defaults,_netdev,fetch-attempts=3 0 0
/mnt/gluster/gv0/debian /var/www/mirror/debian  bind    defaults,bind   0 0

The Fedora 18 box is where I am running debimirror from. It is where I am running Apache from, serving up the Debian repo (among other things).

-Chad
Comment 17 Chad Feller 2014-01-03 22:11:34 EST
Updating version to 3.4.2.  Upgraded servers and clients to 3.4.2 this morning, and re-ran the mirroring process, etc.  The problem remains.
Comment 18 Pranith Kumar K 2014-01-04 02:08:12 EST
Chad,
   Will it be possible for you to come to gluster IRC?. I am from India so timezone GMT+5:30. I will be available from 11:00AM till 7PM From Monday till Friday. We can discuss about this and come to some conclusions about the problem.

Pranith
Comment 19 Chad Feller 2014-01-04 02:31:19 EST
Pranith,

Sure, what is your handle?

-Chad
Comment 20 Pranith Kumar K 2014-01-04 03:00:35 EST
its pk and I am online at the moment.
Comment 21 Chad Feller 2014-01-08 08:51:57 EST
Let me know when you'll be on #gluster.  I was online from 11:00am to 1:00pm (your time), and again from 6:30pm to 7:00pm (your time) - I actually still am online.
Comment 22 Chad Feller 2014-01-08 08:55:01 EST
Oh, and my handle is cfeller
Comment 23 Pranith Kumar K 2014-01-13 06:01:40 EST
After the IRC conversation we found one root cause based on the logs in https://bugzilla.redhat.com/show_bug.cgi?id=1028582#c11 to be 971805. Fix is backported to release-3.4 at http://review.gluster.com/6691

Vijay will be providing the build and chad feller agreed to help in verifying this issue.

Pranith
Comment 24 Chad Feller 2014-01-30 13:52:11 EST
Hi Pranith, Vijay,

Just checking in this.  Wondering if an official 3.4.3 build with that fix was coming soon, or if you were going to be providing a test build in advance of that.

Let me know.  Thanks,

-Chad
Comment 25 Vijay Bellur 2014-01-30 13:55:54 EST
(In reply to Chad Feller from comment #24)
> Hi Pranith, Vijay,
> 
> Just checking in this.  Wondering if an official 3.4.3 build with that fix
> was coming soon, or if you were going to be providing a test build in
> advance of that.
> 
> Let me know.  Thanks,
> 


This nightly build includes the fix:

http://download.gluster.org/pub/gluster/glusterfs/nightly/glusterfs-3.4/epel-6-x86_64/glusterfs-3.4.20140129.8eda793-1.autobuild/

Please let us know how your testing goes.

Thanks,
Vijay
Comment 26 Chad Feller 2014-02-01 15:46:49 EST
The nightly build referenced in comment #25 failed to fix the issue.  I tested the build on my test gluster cluster, and had my test webserver mount the gluster volume.

First I mirrored debian on 3.4.2, like above to ensure that the problem still exists on the test cluster, which it did.  

I then destroyed the mirror and then upgraded the gluster servers to 3.4.20140129.8eda793-1:

  glusterfs-libs-3.4.20140129.8eda793-1.el6.x86_64
  glusterfs-server-3.4.20140129.8eda793-1.el6.x86_64
  glusterfs-3.4.20140129.8eda793-1.el6.x86_64
  glusterfs-cli-3.4.20140129.8eda793-1.el6.x86_64
  glusterfs-fuse-3.4.20140129.8eda793-1.el6.x86_64

I upgraded the gluster client as well to 3.4.20140129.8eda793-1:

  glusterfs-3.4.20140129.8eda793-1.fc20.x86_64
  glusterfs-fuse-3.4.20140129.8eda793-1.fc20.x86_64
  glusterfs-libs-3.4.20140129.8eda793-1.fc20.x86_64

I rebooted the servers and the client.  

After everything was back up I re-mirrored the debian repo.  The mirroring process completed with zero errors.

I  then pointed one of my debian clients at the gluster test setup and ran an "apt-get update".  At that point I got several errors:

...
(successful fetches in here)
...
Err http://<our.local.test.mirror> wheezy/non-free amd64 Packages               
  404  Not Found
Err http://<our.local.test.mirror> wheezy-updates/main Sources                  
  404  Not Found
Err http://<our.local.test.mirror> wheezy-updates/contrib Sources               
  404  Not Found
Get:20 http://security.debian.org wheezy/updates/non-free Translation-en [14 B]
Err http://<our.local.test.mirror> wheezy-updates/main amd64 Packages           
  404  Not Found
Err http://<our.local.test.mirror> wheezy-updates/main Translation-en           
  404  Not Found
... 
(more successful fetches in here)
...
Fetched 6,628 kB in 3s (1,766 kB/s)             
W: Failed to fetch http://<our.local.test.mirror>/debian/dists/wheezy/non-free/binary-amd64/Packages  404  Not Found

W: Failed to fetch http://<our.local.test.mirror>/debian/dists/wheezy-updates/main/source/Sources  404  Not Found

W: Failed to fetch http://<our.local.test.mirror>/debian/dists/wheezy-updates/contrib/source/Sources  404  Not Found

W: Failed to fetch http://<our.local.test.mirror>/debian/dists/wheezy-updates/main/binary-amd64/Packages  404  Not Found

W: Failed to fetch http://<our.local.test.mirror>/debian/dists/wheezy-updates/main/i18n/Translation-en  404  Not Found

E: Some index files failed to download. They have been ignored, or old ones used instead.


These errors coincide with exactly what I saw before...

At this point, a ton of errors appeared in the gluster mount log.  I'm attaching the entire gluster mount log, if that helps.  

I mounted and ran the mirroring process (debmirror) yesterday (1-31), which is where you see zero errors.  I ran the apt-get today (2-1), which is where all of the errors pop up.  

Also, selinux is set to 'permissive' on both the servers and the client.  'getenforce' on all boxes confirm this.
Comment 27 Chad Feller 2014-02-01 15:50:39 EST
Created attachment 858117 [details]
gluster mount log from 3.4.20140129.8eda793-1, containing errors during "apt-get update"

the corresponding log file for comment #26
Comment 28 Chad Feller 2014-03-02 14:19:08 EST
This problem remains in the glusterfs-3.4.3-0.1.alpha1 builds.  

For the record, all of the files in question, in this bug report, that gluster is having problems with, have hard links.  Did something significant change with regard to how files files with hard links are handled between 3.3.2 and 3.4.x?
Comment 29 Chad Feller 2014-03-05 17:45:47 EST
OK, this is really interesting:

(I never considered using this before, as NFS had some serious problems in 3.3.x.  And 3.3.2 was working fine for me with the native FUSE client, so I had no reason to try anything else. Anyway...)

I decided to see if I could reproduce this problem using the native Gluster NFS.  I can't!  So the problem appears to be somewhere in the native FUSE client stack.  If I use the NFS client, the issue is not present.  

???

I don't get it, but regardless, as a workaround I can use NFS for that particular subdirectory, on that particular server, and the FUSE client everywhere else.
Comment 30 Chad Feller 2014-05-10 23:08:55 EDT
Following up from comment #29, I wanted to make a clarification, based on further observations:

Using NFS with 3.4.3 doesn't completely resolve the issue.  It merely mitigates some of the (numerous) symptoms.  Upon rebooting the cluster (the bricks), the files do appear to vanish again, but re-running "debmirror" on the NFS share transparently causes a (temporal) self heal, which appears to last until the the next time the cluster (bricks) are rebooted.
Comment 31 Chad Feller 2014-05-10 23:20:51 EDT
I updated my test GlusterFS cluster to 3.5.0 (production setup is still running 3.4.3).  

On my test cluster, I am no longer able to produce this bug!!!  I didn't even have to re-mirror (via debmirror) the data on the NFS mount or over the FUSE mount.  Both immediately worked after upgrading to 3.5.0, interestingly enough.  (and I rebooted client and servers (the bricks) after upgrading to 3.5.0.)

I'm running further tests right now, but this bug may have vanished and strangely as it first appeared (recall that it wasn't present in 3.3.2).

servers:
glusterfs-3.5.0-2.el6.x86_64
glusterfs-libs-3.5.0-2.el6.x86_64
glusterfs-fuse-3.5.0-2.el6.x86_64
glusterfs-server-3.5.0-2.el6.x86_64
glusterfs-cli-3.5.0-2.el6.x86_64


client:
glusterfs-fuse-3.5.0-3.fc20.x86_64
glusterfs-libs-3.5.0-3.fc20.x86_64
glusterfs-3.5.0-3.fc20.x86_64
Comment 32 Niels de Vos 2015-05-17 17:57:40 EDT
GlusterFS 3.7.0 has been released (http://www.gluster.org/pipermail/gluster-users/2015-May/021901.html), and the Gluster project maintains N-2 supported releases. The last two releases before 3.7 are still maintained, at the moment these are 3.6 and 3.5.

This bug has been filed against the 3,4 release, and will not get fixed in a 3.4 version any more. Please verify if newer versions are affected with the reported problem. If that is the case, update the bug with a note, and update the version if you can. In case updating the version is not possible, leave a comment in this bug report with the version you tested, and set the "Need additional information the selected bugs from" below the comment box to "bugs@gluster.org".

If there is no response by the end of the month, this bug will get automatically closed.
Comment 33 Kaleb KEITHLEY 2015-10-07 09:59:36 EDT
GlusterFS 3.4.x has reached end-of-life.

If this bug still exists in a later release please reopen this and change the version or open a new bug.

Note You need to log in before you can comment on or make changes to this bug.