Description of problem: Upon mirroring the Debian Wheezy repo on a GlusterFS mount point, some files are not properly found, even though the mirroring process completes successfully (according to the mirroring tool, "debmirror"). That is, "debmirror" completes without an error, but a subsequent "apt-get update" on a Debian machine returns a 404 not found on certain files. Trying an "apt-get update" on a second Debian machine produces the same result. Manually inspecting the directory structure shows the file(s) in question are actually there, at which point a subsequent "apt-get update" works successfully and Debian machines can be updated/installed/etc. using my local Debian mirror. Looking at the glusterfs log, I notice the following: [2013-11-08 17:56:00.886263] I [afr-self-heal-entry.c:2253:afr_sh_entry_fix] 0-gv0-replicate-0: /debian/dists/wheezy-updates/non-free/i18n: Performing conservative merge So it almost appears as if the miss triggered a heal, which then caused things to work. But the initial "404 not found" is problematic. Version-Release number of selected component (if applicable): Client (Fedora 18): $ rpm -qa | grep glusterfs glusterfs-libs-3.4.1-1.fc18.x86_64 glusterfs-fuse-3.4.1-1.fc18.x86_64 glusterfs-3.4.1-1.fc18.x86_64 Server (RHEL 6): # rpm -qa | grep glusterfs glusterfs-3.4.1-3.el6.x86_64 glusterfs-cli-3.4.1-3.el6.x86_64 glusterfs-server-3.4.1-3.el6.x86_64 glusterfs-libs-3.4.1-3.el6.x86_64 glusterfs-fuse-3.4.1-3.el6.x86_64 bricks are using XFS How reproducible: Consistently Steps to Reproduce: 1. Mount GlusterFS directory 2. mirror Debian repo using "debmirror" 3. try to update Debian client using apt. Actual results: Not all files are able to be fetched (initially). Expected results: Files should be able to be fetched. Additional info: I tested mirroring a Debian repo on my "test" GlusterFS cluster for a month with no issue. During that time my "test" cluster was running 3.3.2. Two days ago I updated my "test" cluster to 3.4.1, and now I see the exact same thing on that test cluster. the GlusterFS client, running Fedora 18 is serving up the Debian repo using Apache (httpd-2.4.6-2.fc18.x86_64). The relevant directories are mounted as so via fstab: <gluster-server>:gv0 /mnt/gluster/gv0 glusterfs defaults,_netdev 0 0 /mnt/gluster/gv0/debian /var/www/mirror/debian bind defaults,bind 0 0 Gluster Volume Info on the servers: # gluster volume info Volume Name: gv0 Type: Distributed-Replicate Volume ID: a86fbffd-408d-41f9-b2ed-a3816f09d924 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: gluster0:/export/brick0 Brick2: gluster1:/export/brick0 Brick3: gluster2:/export/brick0 Brick4: gluster3:/export/brick0 Selinux is Enforcing on both the RHEL6 servers, and on the Fedora 18 GlusterFS client. The only configuration changes I made is setting: setsebool -P httpd_use_fusefs 1 on the Fedora 18 box so that Apache played nice with the glusterfs directory. Again, no problems on 3.3.2 with this same configuration. My "test" cluster was upgraded from 3.3.2 to 3.4.1. My main GlusterFS cluster was installed as 3.4.1 from scratch.
Created attachment 821934 [details] logs at moment of 404 not found error This is what appears in the logs of the mount point the second that a 404 error is returned when attempting to access a file in question.
Now on the above, apt will attempt to access something like: http://<server>/debian/dists/wheezy-updates/main/i18n/Translation-en.bz2 I can reproduce the same thing with wget. And this will continue indefinitely until I trigger the heal. In order to trigger the heal, you have to navigate to the file. Yesterday I navigated to the file from the command line (bash), and that triggered the heal. Today I navigated to the file via a web browser, as the whole directory structure is served via Apache (as most Linux repos are). In doing this, I noticed that I didn't have to navigate all of the way to the file in question. The self heal was triggered once I got to: http://<server>/debian/dists/wheezy-updates/main/ Once Apache changed into that directory, I saw the following pop up int he logs: [2013-11-09 17:17:44.041625] I [afr-self-heal-entry.c:2253:afr_sh_entry_fix] 0-gv0-replicate-0: /debian/dists/wheezy-updates/main/i18n: Performing conservative merge obviously i18n is a subdirectory of the the folder I just navigated to. Once that self-heal entry showed up in the log I was able to retrieve the file in question (Translation-en.bz2, referenced at the top of this comment) using both apt and wget. This repo will now continue to work until the debmirror script runs again at 3:00am (local time) tomorrow, at which point certain files will likely be missing again until I manually navigate to them to trigger a self heal.
(In reply to Chad Feller from comment #2) > Now on the above, apt will attempt to access something like: > > http://<server>/debian/dists/wheezy-updates/main/i18n/Translation-en.bz2 > To clarify, at the beginning of comment #2, in this statement here, I was referring to the attachment I created in comment #1. That is, accessing this URL is what triggered the attachment in Comment #1.
In an attempt to do further debugging, I went to manually run debmirror again, only to have it exit prematurely with the following error: .temp/dists/wheezy/contrib/i18n/Translation-en.bz2: Invalid argument at /bin/debmirror line 1533. releasing 1 pending lock... at /usr/share/perl5/vendor_perl/LockFile/Simple.pm line 206. At the second it failed, the following popped up in the glusterfs mount log: [2013-11-09 17:18:26.797059] W [defaults.c:1291:default_release] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/distribute.so(dht_open+0x2af) [0x7f873e8fc1af] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/distribute.so(dht_local_wipe+0xa7) [0x7f873e8d2907] (-->/usr/lib64/libglusterfs.so.0(fd_unref+0x144) [0x7f8745456f34]))) 0-fuse: xlator does not implement release_cbk [2013-11-09 17:55:49.815846] W [fuse-bridge.c:915:fuse_fd_cbk] 0-glusterfs-fuse: 1067417: OPEN() /debian/.temp/dists/wheezy/contrib/i18n/Translation-en.bz2 => -1 (Invalid argument) Note that this is the same file that we just triggered a self heal on in the previous few comments. (Also note that there is two log lines here, one at 17:55 (just now) and one at 17:18, which was just after I was fetching files related to the previous comments. However, the 17:18 statement didn't show up until just now when the statement at 17:55 did. GlusterFS doesn't seem to buffer logs (from my observations, so I'm not sure if something was hung that caused that previous log statement to appear with this one or it sheds light on anything else.)
I just ran debmirror yet again, this time it completed successfully with the following output: Parsing Packages and Sources files ... Get Translation files ... Files to download: 0 B Download completed in 213s. Everything OK. Moving meta files ... cp: ‘dists/wheezy/contrib/i18n/Translation-en.bz2’ and ‘/var/www/mirror/debian/dists/wheezy/contrib/i18n/Translation-en.bz2’ are the same file cp: ‘dists/wheezy-updates/main/i18n/Translation-en.bz2’ and ‘/var/www/mirror/debian/dists/wheezy-updates/main/i18n/Translation-en.bz2’ are the same file cp: ‘dists/wheezy-updates/non-free/i18n/Translation-en.bz2’ and ‘/var/www/mirror/debian/dists/wheezy-updates/non-free/i18n/Translation-en.bz2’ are the same file Cleanup mirror. All done. those "cp: .... are the same file" statements are unusual - that is I usually don't see those. In the GlusterFS logs, I'm seeing the following at the same moment: [2013-11-09 17:55:49.816102] W [defaults.c:1291:default_release] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/distribute.so(dht_open+0x2af) [0x7f873e8fc1af] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/distribute.so(dht_local_wipe+0xa7) [0x7f873e8d2907] (-->/usr/lib64/libglusterfs.so.0(fd_unref+0x144) [0x7f8745456f34]))) 0-fuse: xlator does not implement release_cbk [2013-11-09 18:04:39.556953] W [fuse-bridge.c:1233:fuse_unlink_cbk] 0-glusterfs-fuse: 1195277: UNLINK() /debian/dists/wheezy/contrib/i18n/Translation-en.bz2 => -1 (No such file or directory) [2013-11-09 18:04:39.692404] W [fuse-bridge.c:1233:fuse_unlink_cbk] 0-glusterfs-fuse: 1195392: UNLINK() /debian/dists/wheezy-updates/main/i18n/Translation-en.bz2 => -1 (No such file or directory) [2013-11-09 18:04:39.803777] W [fuse-bridge.c:1233:fuse_unlink_cbk] 0-glusterfs-fuse: 1195476: UNLINK() /debian/dists/wheezy-updates/non-free/i18n/Translation-en.bz2 => -1 (No such file or directory) Again, this is the same problematic file. Also note the timestamp on the first line of log output is about the same time as the previous log statements in comment #4, yet again all four of these log statements appeared at the same time. Not sure if this is a logging bug or what.
Created attachment 821948 [details] 404 error following debmirror completion in comment #5 following debmirror completing on comment #5, I ran apt-get update again on one of my Debian machines, got the 404 error, and this is what immediately appeared in the glusterfs mount log
Looking at comment #6, and the related attachment, we're back at square one. Again, I was just able to manually navigate to the directory in question, which triggered a self-heal (conservative merge), and we're back in business, yet again. Unfortunately, this isn't an acceptable workaround, and this bug I'm seeing in debmirror may appear in other places too. I think I've outlined the entire process of events, and have provided enough logs to provide insight as to what is going on. Again, using debmirror on a GlusterFS mount is not an issue in 3.3.2, only upon switching to 3.4.1 did this bug manifest itself.
Additionally, if anyone wants to try to reproduce this on their end, I'm using the debmirror rpm from Fedora's repos: debmirror-2.14-2.fc18.noarch and am running the following command: debmirror -e rsync -h <remote.debian.mirror> -r :debian -a amd64 -s main,contrib,non-free -d wheezy,wheezy-updates /var/www/mirror/debian --postcleanup --ignore-small-errors
I just made an additional observation which I thought should be noted here: The self heal apparently isn't persistent! The self heal is happening on the gluster client, and noted in the mount log, but if I reboot the client the changes revert. That is, changes don't appear to be propagating out the the servers. If I have to reboot the client (kernel update, or any other update that requires a reboot), I'm back to square one. Files are missing, and manually navigating to them triggers a self heal. In attempting to poke around this issue some more, I switched off selinux (switched from Enforcing to Permissive) on both the servers and the client. Nothing changes. I still get the same errors, and the self heal is still not persistent.
An additional note: Following up on comment #9, not only is the self heal not persistent between reboots, it isn't persistent after a lot of system IO. If there is a lot of "other" Gluster traffic, it appears that the self heal gets purged. Everything observed in comment #9 is the same, just replace "reboot" with tons of "other" IO over GlusterFS. So it appears that the self heal is just cached somewhere in the GlusterFS client, and purged next time it needs that memory for something else?
Looking into this further, I've discovered the following: The error is down to one file. Initially there were other files/directories giving me issues, but somehow those eventually got healed - the self heal made it into persistent memory somehow? This last file that seems to be stubborn in not healing is: /debian/dists/wheezy-updates/main/i18n/Translation-en.bz2 (The fact that this was happening in multiple directories initially can be noted in the initial post. There you see you see the self heal operation happening on a different directory. With regards to this bug, I haven't had to touch that other directory in a while.) I decided to investigate a little further regarding this last file in question. What happens if I try to delete the file? # rm Translation-en.bz2 rm: remove regular file `Translation-en.bz2'? y rm: cannot remove `Translation-en.bz2': No such file or directory The gluster mount log, when this happens shows the following: [2013-12-17 04:52:50.366825] W [fuse-bridge.c:1233:fuse_unlink_cbk] 0-glusterfs-fuse: 370: UNLINK() /debian/dists/wheezy-updates/main/i18n/Translation-en.bz2 => -1 (No such file or directory) Looking more closely at the file, I noticed that it has has hard links: # ls -lah total 2.5K drwxr-xr-x. 2 root root 37 Dec 15 11:09 . drwxr-xr-x. 5 root root 98 Nov 5 03:01 .. -rw-r--r--. 2 root root 1.9K Nov 7 12:04 Translation-en.bz2 OK, lets find the links: # ls -i Translation-en.bz2 10907466576973813014 Translation-en.bz2 # find /mnt/gluster/gv0/debian -inum 10907466576973813014 /mnt/gluster/gv0/debian/dists/wheezy-updates/main/i18n/Translation-en.bz2 /mnt/gluster/gv0/debian/.temp/dists/wheezy-updates/main/i18n/Translation-en.bz2 OK, Can I delete the other file? # rm /mnt/gluster/gv0/debian/.temp/dists/wheezy-updates/main/i18n/Translation-en.bz2 rm: remove regular file `/mnt/gluster/gv0/debian/.temp/dists/wheezy-updates/main/i18n/Translation-en.bz2'? y Sure can. Now the file in question is now only showing 1 hard link, so it is acknowledging that I deleted the other file. # ls -lah total 2.5K drwxr-xr-x. 2 root root 37 Dec 15 11:09 . drwxr-xr-x. 5 root root 98 Nov 5 03:01 .. -rw-r--r--. 1 root root 1.9K Nov 7 12:04 Translation-en.bz2 So now can I delete the troublesome file? # rm Translation-en.bz2 rm: remove regular file `Translation-en.bz2'? y rm: cannot remove `Translation-en.bz2': No such file or directory Nope, still cannot. (and I'm getting the same error in the gluster mount log.) Can I rename the file? Nope. And I get this in the mount logs: [2013-12-17 04:57:19.210358] W [dht-rename.c:334:dht_rename_unlink_cbk] 0-gv0-dht: /debian/dists/wheezy-updates/main/i18n/Translation-en.bz2: unlink on gv0-replicate-0 failed (No such file or directory) OK, can I overwrite the file? # echo "Hello" > Translation-en.bz2 It appeared to let me. Can I rename it now? # mv Translation-en.bz2 t.txt # ls -lah total 1.0K drwxr-xr-x. 2 root root 24 Dec 16 20:57 . drwxr-xr-x. 5 root root 98 Nov 5 03:01 .. -rw-r--r--. 1 root root 6 Dec 16 20:57 t.txt It appeared to let me do that too. (Note that the file size is smaller too, as it should be.) Now lets get rid of the file: # mv t.txt /tmp/ # ls -a . .. Beautiful. I accessed the same gluster volume in question from a different client to make sure the file really is gone, and that this client wasn't lying to me, and I was able to confirm that the file was really gone from the volume. OK, now lets resync with the debmirror script. Before the debmirror script completed, this popped up in the gluster mount log: [2013-12-17 06:37:29.387250] E [client-rpc-fops.c:5179:client3_3_inodelk] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_lock_rec+0x70) [0x7f6304f407b0] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_nonblocking_inodelk+0x652) [0x7f6304f57b22] (-->/usr/lib64/glusterfs/3.4.1/xlator/protocol/client.so(client_inodelk+0x99) [0x7f630518e909]))) 0-: Assertion failed: 0 [2013-12-17 06:37:29.387334] E [client-rpc-fops.c:5179:client3_3_inodelk] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_lock_rec+0x70) [0x7f6304f407b0] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_nonblocking_inodelk+0x652) [0x7f6304f57b22] (-->/usr/lib64/glusterfs/3.4.1/xlator/protocol/client.so(client_inodelk+0x99) [0x7f630518e909]))) 0-: Assertion failed: 0 [2013-12-17 06:37:29.387405] E [client-rpc-fops.c:5179:client3_3_inodelk] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_blocking_lock+0x74) [0x7f6304f590d4] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_lock_blocking+0x844) [0x7f6304f58df4] (-->/usr/lib64/glusterfs/3.4.1/xlator/protocol/client.so(client_inodelk+0x99) [0x7f630518e909]))) 0-: Assertion failed: 0 [2013-12-17 06:37:29.387471] E [client-rpc-fops.c:5179:client3_3_inodelk] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(+0x451c8) [0x7f6304f591c8] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_lock_blocking+0x844) [0x7f6304f58df4] (-->/usr/lib64/glusterfs/3.4.1/xlator/protocol/client.so(client_inodelk+0x99) [0x7f630518e909]))) 0-: Assertion failed: 0 [2013-12-17 06:37:29.387486] I [afr-lk-common.c:1075:afr_lock_blocking] 0-gv0-replicate-1: unable to lock on even one child [2013-12-17 06:37:29.387580] I [afr-transaction.c:1063:afr_post_blocking_inodelk_cbk] 0-gv0-replicate-1: Blocking inodelks failed. [2013-12-17 06:37:29.387616] E [dht-linkfile.c:213:dht_linkfile_setattr_cbk] 0-gv0-dht: setattr of uid/gid on /debian/dists/wheezy/contrib/i18n/Translation-en.bz2 :<gfid:00000000-0000-0000-0000-000000000000> failed (Invalid argument) [2013-12-17 06:37:30.636628] E [client-rpc-fops.c:5179:client3_3_inodelk] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_lock_rec+0x70) [0x7f6304f407b0] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_nonblocking_inodelk+0x652) [0x7f6304f57b22] (-->/usr/lib64/glusterfs/3.4.1/xlator/protocol/client.so(client_inodelk+0x99) [0x7f630518e909]))) 0-: Assertion failed: 0 [2013-12-17 06:37:30.636717] E [client-rpc-fops.c:5179:client3_3_inodelk] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_lock_rec+0x70) [0x7f6304f407b0] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_nonblocking_inodelk+0x652) [0x7f6304f57b22] (-->/usr/lib64/glusterfs/3.4.1/xlator/protocol/client.so(client_inodelk+0x99) [0x7f630518e909]))) 0-: Assertion failed: 0 [2013-12-17 06:37:30.636789] E [client-rpc-fops.c:5179:client3_3_inodelk] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_blocking_lock+0x74) [0x7f6304f590d4] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_lock_blocking+0x844) [0x7f6304f58df4] (-->/usr/lib64/glusterfs/3.4.1/xlator/protocol/client.so(client_inodelk+0x99) [0x7f630518e909]))) 0-: Assertion failed: 0 [2013-12-17 06:37:30.636856] E [client-rpc-fops.c:5179:client3_3_inodelk] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(+0x451c8) [0x7f6304f591c8] (-->/usr/lib64/glusterfs/3.4.1/xlator/cluster/replicate.so(afr_lock_blocking+0x844) [0x7f6304f58df4] (-->/usr/lib64/glusterfs/3.4.1/xlator/protocol/client.so(client_inodelk+0x99) [0x7f630518e909]))) 0-: Assertion failed: 0 [2013-12-17 06:37:30.636871] I [afr-lk-common.c:1075:afr_lock_blocking] 0-gv0-replicate-0: unable to lock on even one child [2013-12-17 06:37:30.636962] I [afr-transaction.c:1063:afr_post_blocking_inodelk_cbk] 0-gv0-replicate-0: Blocking inodelks failed. [2013-12-17 06:37:30.637018] E [dht-linkfile.c:213:dht_linkfile_setattr_cbk] 0-gv0-dht: setattr of uid/gid on /debian/dists/wheezy-updates/non-free/i18n/Translation-en.bz2 :<gfid:00000000-0000-0000-0000-000000000000> failed (Invalid argument) And, now the problem is back to square one with the same file in question. Hope that sheds light on things? P.S. I realize the timestamps on some of the files above are before the date of this post. I did this investigation a couple of days ago, and had my notes/results in a txt file waiting for a convenient time to update this bug report.
Chad, Logs suggest that some operations are coming on files/directories without any gfid (i.e. similar to inode-number but it is unique across gluster volume per file/dir). Seems like you have a consistent case. Lets say I want to re-create this issue on one of my vms with debian wheezy guest OS. What is the procedure? I am a complete debian newbie. Give me exact commands that need to be executed on the new vm to hit this. Pranith
Pranith, To be clear, the Gluster hosts are Fedora and RHEL. The Gluster gluster (4x2 distribute replicate) is RHEL 6, and the Gluster client that is accessing the volume is Fedora 18. The role of this Fedora 18 box is an internal webserver, and we're slowly migrating its storage backend to Gluster, one volume at a time. The Debian mirror was the first thing we migrated (we're mostly Fedora, RHEL, and CentOS, but do have a small number of Debian boxes too). In my experience, Debian is best mirrored with the debmirror program, so from Fedora, just: yum install debmirror I would move the rpm provided /etc/debmirror.conf out of the way, as I recall variables in there conflicting what I was specifying on the command line. (I seem to recall that you'll also have to install the Debian-archive GPG key, as debmirror barks about that at first.) And then run your debmirror script as I described in Comment #8. Setup Apache on your Fedora box so that it serves up your mirror. For me it is really simple: DocumentRoot /var/www/mirror <Directory /var/www/mirror> Options +Indexes -MultiViews AllowOverride all Order allow,deny allow from all </Directory> (the Gluster backed Debian mirror is mounted under /var/www/mirror/debian) On your Debian VM, edit /etc/apt/sources.list to use your mirror: deb http://<your.local.mirror>/debian/ wheezy main contrib non-free deb-src http://<your.local.mirror>/debian/ wheezy main contrib non-free deb http://<your.local.mirror>/debian/ wheezy-updates main contrib non-free deb-src http://<your.local.mirror>/debian/ wheezy-updates main contrib non-free # leave security.debian.org pointed at them # as the debmirror script above isn't mirroring this # deb http://security.debian.org/ wheezy/updates main contrib non-free deb-src http://security.debian.org/ wheezy/updates main contrib non-free at this point, on your Debian box, from the command line (as root) you can use run apt as follows: apt-get update && apt-get upgrade After you do this, you should be able to replicate what I'm seeing. -Chad
Created attachment 843555 [details] Fedora_mirror_error_logs Since filing this bug report, I've also created a Fedora mirror, backed by the same Gluster cluster. Interestingly enough, I'm seeing similar error messages in the Gluster mount log (attached) at sync time. Yet unlike with the errors in Debian repo, these errors haven't triggered any noticeable filesystem corruption. That is, I haven't run into any issues with my Fedora clients using the mirror (at least yet).
Hey Chad, For data to be replicated the data needs to be written from gluster mount point. I could be wrong, but it seems like data is being added in the brick backend directly and not from the gluster mount point according to the example. Again, I could be wrong, please confirm. Pranith
Pranith, The 4 RHEL6 boxes are the bricks. Each box provides 1 brick (which itself is 12 disks behind hardware RAID 6). The Fedora 18 box is the mount point (the gluster client). So the logs I've provided you are from the Fedora 18 box. The client mounts on boot via the fstab. The relevant lines from my fstab are as follows: gluster0.my.local.domain:gv0 /mnt/gluster/gv0 glusterfs defaults,_netdev,fetch-attempts=3 0 0 /mnt/gluster/gv0/debian /var/www/mirror/debian bind defaults,bind 0 0 The Fedora 18 box is where I am running debimirror from. It is where I am running Apache from, serving up the Debian repo (among other things). -Chad
Updating version to 3.4.2. Upgraded servers and clients to 3.4.2 this morning, and re-ran the mirroring process, etc. The problem remains.
Chad, Will it be possible for you to come to gluster IRC?. I am from India so timezone GMT+5:30. I will be available from 11:00AM till 7PM From Monday till Friday. We can discuss about this and come to some conclusions about the problem. Pranith
Pranith, Sure, what is your handle? -Chad
its pk and I am online at the moment.
Let me know when you'll be on #gluster. I was online from 11:00am to 1:00pm (your time), and again from 6:30pm to 7:00pm (your time) - I actually still am online.
Oh, and my handle is cfeller
After the IRC conversation we found one root cause based on the logs in https://bugzilla.redhat.com/show_bug.cgi?id=1028582#c11 to be 971805. Fix is backported to release-3.4 at http://review.gluster.com/6691 Vijay will be providing the build and chad feller agreed to help in verifying this issue. Pranith
Hi Pranith, Vijay, Just checking in this. Wondering if an official 3.4.3 build with that fix was coming soon, or if you were going to be providing a test build in advance of that. Let me know. Thanks, -Chad
(In reply to Chad Feller from comment #24) > Hi Pranith, Vijay, > > Just checking in this. Wondering if an official 3.4.3 build with that fix > was coming soon, or if you were going to be providing a test build in > advance of that. > > Let me know. Thanks, > This nightly build includes the fix: http://download.gluster.org/pub/gluster/glusterfs/nightly/glusterfs-3.4/epel-6-x86_64/glusterfs-3.4.20140129.8eda793-1.autobuild/ Please let us know how your testing goes. Thanks, Vijay
The nightly build referenced in comment #25 failed to fix the issue. I tested the build on my test gluster cluster, and had my test webserver mount the gluster volume. First I mirrored debian on 3.4.2, like above to ensure that the problem still exists on the test cluster, which it did. I then destroyed the mirror and then upgraded the gluster servers to 3.4.20140129.8eda793-1: glusterfs-libs-3.4.20140129.8eda793-1.el6.x86_64 glusterfs-server-3.4.20140129.8eda793-1.el6.x86_64 glusterfs-3.4.20140129.8eda793-1.el6.x86_64 glusterfs-cli-3.4.20140129.8eda793-1.el6.x86_64 glusterfs-fuse-3.4.20140129.8eda793-1.el6.x86_64 I upgraded the gluster client as well to 3.4.20140129.8eda793-1: glusterfs-3.4.20140129.8eda793-1.fc20.x86_64 glusterfs-fuse-3.4.20140129.8eda793-1.fc20.x86_64 glusterfs-libs-3.4.20140129.8eda793-1.fc20.x86_64 I rebooted the servers and the client. After everything was back up I re-mirrored the debian repo. The mirroring process completed with zero errors. I then pointed one of my debian clients at the gluster test setup and ran an "apt-get update". At that point I got several errors: ... (successful fetches in here) ... Err http://<our.local.test.mirror> wheezy/non-free amd64 Packages 404 Not Found Err http://<our.local.test.mirror> wheezy-updates/main Sources 404 Not Found Err http://<our.local.test.mirror> wheezy-updates/contrib Sources 404 Not Found Get:20 http://security.debian.org wheezy/updates/non-free Translation-en [14 B] Err http://<our.local.test.mirror> wheezy-updates/main amd64 Packages 404 Not Found Err http://<our.local.test.mirror> wheezy-updates/main Translation-en 404 Not Found ... (more successful fetches in here) ... Fetched 6,628 kB in 3s (1,766 kB/s) W: Failed to fetch http://<our.local.test.mirror>/debian/dists/wheezy/non-free/binary-amd64/Packages 404 Not Found W: Failed to fetch http://<our.local.test.mirror>/debian/dists/wheezy-updates/main/source/Sources 404 Not Found W: Failed to fetch http://<our.local.test.mirror>/debian/dists/wheezy-updates/contrib/source/Sources 404 Not Found W: Failed to fetch http://<our.local.test.mirror>/debian/dists/wheezy-updates/main/binary-amd64/Packages 404 Not Found W: Failed to fetch http://<our.local.test.mirror>/debian/dists/wheezy-updates/main/i18n/Translation-en 404 Not Found E: Some index files failed to download. They have been ignored, or old ones used instead. These errors coincide with exactly what I saw before... At this point, a ton of errors appeared in the gluster mount log. I'm attaching the entire gluster mount log, if that helps. I mounted and ran the mirroring process (debmirror) yesterday (1-31), which is where you see zero errors. I ran the apt-get today (2-1), which is where all of the errors pop up. Also, selinux is set to 'permissive' on both the servers and the client. 'getenforce' on all boxes confirm this.
Created attachment 858117 [details] gluster mount log from 3.4.20140129.8eda793-1, containing errors during "apt-get update" the corresponding log file for comment #26
This problem remains in the glusterfs-3.4.3-0.1.alpha1 builds. For the record, all of the files in question, in this bug report, that gluster is having problems with, have hard links. Did something significant change with regard to how files files with hard links are handled between 3.3.2 and 3.4.x?
OK, this is really interesting: (I never considered using this before, as NFS had some serious problems in 3.3.x. And 3.3.2 was working fine for me with the native FUSE client, so I had no reason to try anything else. Anyway...) I decided to see if I could reproduce this problem using the native Gluster NFS. I can't! So the problem appears to be somewhere in the native FUSE client stack. If I use the NFS client, the issue is not present. ??? I don't get it, but regardless, as a workaround I can use NFS for that particular subdirectory, on that particular server, and the FUSE client everywhere else.
Following up from comment #29, I wanted to make a clarification, based on further observations: Using NFS with 3.4.3 doesn't completely resolve the issue. It merely mitigates some of the (numerous) symptoms. Upon rebooting the cluster (the bricks), the files do appear to vanish again, but re-running "debmirror" on the NFS share transparently causes a (temporal) self heal, which appears to last until the the next time the cluster (bricks) are rebooted.
I updated my test GlusterFS cluster to 3.5.0 (production setup is still running 3.4.3). On my test cluster, I am no longer able to produce this bug!!! I didn't even have to re-mirror (via debmirror) the data on the NFS mount or over the FUSE mount. Both immediately worked after upgrading to 3.5.0, interestingly enough. (and I rebooted client and servers (the bricks) after upgrading to 3.5.0.) I'm running further tests right now, but this bug may have vanished and strangely as it first appeared (recall that it wasn't present in 3.3.2). servers: glusterfs-3.5.0-2.el6.x86_64 glusterfs-libs-3.5.0-2.el6.x86_64 glusterfs-fuse-3.5.0-2.el6.x86_64 glusterfs-server-3.5.0-2.el6.x86_64 glusterfs-cli-3.5.0-2.el6.x86_64 client: glusterfs-fuse-3.5.0-3.fc20.x86_64 glusterfs-libs-3.5.0-3.fc20.x86_64 glusterfs-3.5.0-3.fc20.x86_64
GlusterFS 3.7.0 has been released (http://www.gluster.org/pipermail/gluster-users/2015-May/021901.html), and the Gluster project maintains N-2 supported releases. The last two releases before 3.7 are still maintained, at the moment these are 3.6 and 3.5. This bug has been filed against the 3,4 release, and will not get fixed in a 3.4 version any more. Please verify if newer versions are affected with the reported problem. If that is the case, update the bug with a note, and update the version if you can. In case updating the version is not possible, leave a comment in this bug report with the version you tested, and set the "Need additional information the selected bugs from" below the comment box to "bugs". If there is no response by the end of the month, this bug will get automatically closed.
GlusterFS 3.4.x has reached end-of-life. If this bug still exists in a later release please reopen this and change the version or open a new bug.