| Summary: | parallel deletion of files mounted by different clients on the same back-end hangs and/or does not completely delete | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Pavan Vilas Sondur <pavan> | ||||||
| Component: | protocol | Assignee: | Amar Tumballi <amarts> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | |||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | 2.0.2 | CC: | amarts, gluster-bugs, john, krishna, vijay, vikas, vinayak, vraman | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | All | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | Type: | --- | |||||||
| Regression: | RTNR | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Attachments: |
|
||||||||
Note this: [2009-07-07 03:21:46] E [afr-self-heal-common.c:1088:sh_missing_entries_create] afr1: no source found for /lib/gconv/EUC-JP.so. all nodes down?. returning. Me and avati debugged this issue little more, and noticed that the issue is that, when more than one client is deleting the same backend, it may miss out finishing ENTRYLK unlock call, which means, the blocked ENTRYLK lock call from another client has never got a chance to execute. This can be reproduced easily with more than 2 clients, still figuring out how to handle the issue.. Regards, Patch from Amar: http://patches.gluster.com/patch/752/ PATCH: http://patches.gluster.com/patch/752 in master (server: don't check for 'loc->parent' in entrylk and inodelk calls) PATCH: http://patches.gluster.com/patch/768 in release-2.0 (server: don't check for 'loc->parent' in entrylk and inodelk calls) Patch 768 (git commit 215b62f1db24bc7ff55d224046e343401d78c38b) does *not* fix this for me.
I patched the 2.0.4 release with that commit and can still reproduce this.
Btw, this little Ruby one-liner creates a directory hierarchy (in pwd) suitable for testing (no files needs, just directories):
ruby -rfileutils -e 'def nd(i);Dir.chdir("..") and return if i>4;4.times{|n|FileUtils.mkdir_p(n.to_s);Dir.chdir(n.to_s);nd(i+1)};Dir.chdir("..");end;nd(1)'
I'm using Capistrano to run the "rm -rf" command concurrently on 3 clients which are mirrored/afr to 2 storage bricks.
One of the clients ends up with a hung "rm" process. This process is sleeping but gdb hangs trying to attach to it. Looking at its open file descriptors, I can see that it has open a directory whose parent directory no longer exists (in this case, fd 3 for directory /srv/share/0/1/0).
The glusterfs client process has the same thread state as reported by Pavan (nanosleep, a bunch of pthread_cond_timedwait, a read and a epoll_wait) though the threads looks like this in a normal non-error state anyway so I don't think it's relevant.
The gluster mount continues to work apparently unaffected, but cannot unmount (device is busy) on the client with the hung "rm" process.
PATCH: http://patches.gluster.com/patch/827 in master (server-protocol: fix in server-loc-fill()) PATCH: http://patches.gluster.com/patch/828 in release-2.0 (server-protocol: fix in server-loc-fill()) with patch 827/828, behaviour is very erratic. "rm" process is still hanging, but much less often. The glusterd processes are now randomly (but often) segfaulting during the concurrent "rm -rf". Some of the directories randomly can't be removed at all and I've also seen the client process jump to 100% CPU and completely hang the mount (a bug we've seen in a lot in the old 1.3.12 version). I ran one of the gluster bricks under gdb and caught a segfault, backtraces attached. This is 2.0.4 patched with patch 828 and 768. I also have a segfault from a 100% CPU client process. I'll attach the gdb output after posting this ticket. Created attachment 47 [details]
Kickstart config file
Created attachment 48 [details]
patch to fixe elm Y2K bug
http://patches.gluster.com/patch/818/ fixes the crash described in comment #10 (In reply to comment #9) > > I ran one of the gluster bricks under gdb and caught a segfault, backtraces > attached. This is 2.0.4 patched with patch 828 and 768. Hi John, This behavior happened due to applying just two patches over 2.0.4 release. Both 100% CPU and server segfault issue is fixed now with http://ftp.gluster.com/pub/gluster/glusterfs/qa-releases/glusterfs-2.0.6rc1.tar.gz Regards, Amar (In reply to comment #13) > Hi John, > This behavior happened due to applying just two patches over 2.0.4 release. > Both 100% CPU and server segfault issue is fixed now with > http://ftp.gluster.com/pub/gluster/glusterfs/qa-releases/glusterfs-2.0.6rc1.tar.gz arg, faulty testing on my part there then sorry. I'll retest asap. Thanks. (In reply to comment #14) > (In reply to comment #13) > > > Hi John, > > This behavior happened due to applying just two patches over 2.0.4 release. > > Both 100% CPU and server segfault issue is fixed now with > > http://ftp.gluster.com/pub/gluster/glusterfs/qa-releases/glusterfs-2.0.6rc1.tar.gz > > arg, faulty testing on my part there then sorry. I'll retest asap. Thanks. Oops.. actually noticed that due to some minor script errors, that release doesn't include latest patch.. hold for a minute. I will post the proper tarball link to you.. With 2.0.6-rc1 patched with patch 828 and 768 it is still very erratic. Sometimes the tree is deleted ok, but I'm still seeing "rm" process hangs, a glusterfs client process get in a loop at 100% CPU and the glusterfsd (the crashes and CPU loops are happening very often). Actually, I'm not certain about this. Still testing. This now looks to be fixed!! I wrote a little test script in ruby to set up two bricks and three mounts, all on the local machine, and then create and remove directories in parallel. You can get it here if you want: http://johnleach.co.uk/downloads/glusterfs/glusterfs-mirror-test-suite.tar.gz I've confirmed that it reproduced hangs and crashes, but with 2.0.6-rc1 patched with patch 828 it does not. I've also tested in my original multi-machine test environment in a loop for the last half hour without one hang or crash. So, this looks to be fixed now! Thanks! (In reply to comment #18) > This now looks to be fixed!! > You can get it here if you want: > > http://johnleach.co.uk/downloads/glusterfs/glusterfs-mirror-test-suite.tar.gz Thanks for the link, We will use it in our QA process.. > > > So, this looks to be fixed now! Thanks! Thanks for letting us know. We will wait another day for any hiccups with 2.0.6rc1 + patch.828, and call it 2.0.6 then. Regards, Amar Testing longer term, this isn't quite right. The mirror gets out of sync (some dirs end up on one brick but not the other) and making new directory hierarchies then continually errors with: "Device or resource busy - 0" The client logs show stuff like: [2009-07-29 22:26:07] E [afr-transaction.c:710:afr_changelog_pre_op_cbk] mirror: xattrop failed on child server1: No such file or directory [2009-07-29 22:26:07] W [fuse-bridge.c:513:fuse_attr_cbk] glusterfs-fuse: 29038: STAT() /0/2/2/1 => -1 (No such file or directory) [2009-07-29 22:26:07] W [fuse-bridge.c:513:fuse_attr_cbk] glusterfs-fuse: 29039: STAT() /0/2/2/1 => -1 (No such file or directory) [2009-07-29 22:26:07] W [fuse-bridge.c:513:fuse_attr_cbk] glusterfs-fuse: 29040: STAT() /0/2/2/1 => -1 (No such file or directory) [2009-07-29 22:26:07] W [fuse-bridge.c:513:fuse_attr_cbk] glusterfs-fuse: 29041: STAT() /0/2/2/1 => -1 (No such file or directory) [2009-07-29 22:26:07] W [fuse-bridge.c:513:fuse_attr_cbk] glusterfs-fuse: 29043: STAT() /0/2/2/1 => -1 (No such file or directory) [2009-07-29 22:26:07] W [fuse-bridge.c:639:fuse_fd_cbk] glusterfs-fuse: 29045: OPENDIR() /0/2/2 => -1 (No such file or directory) [2009-07-29 22:26:11] W [fuse-bridge.c:513:fuse_attr_cbk] glusterfs-fuse: 29325: STAT() /0/0/0/2/2 => -1 (No such file or directory) [2009-07-29 22:26:11] W [fuse-bridge.c:921:fuse_unlink_cbk] glusterfs-fuse: 29337: RMDIR() /0/0/0/2/2 => -1 (No such file or directory) [2009-07-29 22:26:11] W [fuse-bridge.c:513:fuse_attr_cbk] glusterfs-fuse: 29390: STAT() /0/0/0/1/1 => -1 (No such file or directory) [2009-07-29 22:26:12] W [fuse-bridge.c:921:fuse_unlink_cbk] glusterfs-fuse: 29400: RMDIR() /0/0/0/1/1 => -1 (No such file or directory) [2009-07-29 22:26:12] E [afr-transaction.c:710:afr_changelog_pre_op_cbk] mirror: xattrop failed on child server0: No such file or directory [2009-07-29 22:26:12] E [afr-transaction.c:710:afr_changelog_pre_op_cbk] mirror: xattrop failed on child server1: No such file or directory [2009-07-29 22:26:15] C [inode.c:865:inode_path] inode: possible infinite loop detected, forcing break. name=((null)) [2009-07-29 22:26:15] W [fuse-bridge.c:575:fuse_getattr] glusterfs-fuse: 29512: GETATTR 1353384 (fuse_loc_fill() failed) [2009-07-29 22:26:18] E [afr-transaction.c:710:afr_changelog_pre_op_cbk] mirror: xattrop failed on child server1: No such file or directory [2009-07-29 22:26:18] E [afr-transaction.c:710:afr_changelog_pre_op_cbk] mirror: xattrop failed on child server1: No such file or directory [2009-07-29 22:26:45] C [inode.c:865:inode_path] inode: possible infinite loop detected, forcing break. name=((null)) [2009-07-29 22:26:45] W [fuse-bridge.c:575:fuse_getattr] glusterfs-fuse: 29926: GETATTR 1353391 (fuse_loc_fill() failed) [2009-07-29 22:26:46] C [inode.c:865:inode_path] inode: possible infinite loop detected, forcing break. name=((null)) [2009-07-29 22:26:46] W [fuse-bridge.c:1655:fuse_opendir] glusterfs-fuse: 29939: OPENDIR (null) (fuse_loc_fill() failed) [2009-07-29 22:26:56] W [glusterfsd.c:842:cleanup_and_exit] glusterfs: shutting down I've modified my test suite to loop making and removing concurrently and this situation usually happens within a minute or two on my dual core, single disk laptop. The new version is: http://johnleach.co.uk/downloads/glusterfs/glusterfs-mirror-test-suite-2.tar.gz (the old version has been moved to glusterfs-mirror-test-suite-1.tar.gz btw) hi, just a note to say this isn't fixed in 2.0.6 rc3. Hi John, I will get back with an update by tomorrow. Krishna (In reply to comment #21) > hi, just a note to say this isn't fixed in 2.0.6 rc3. Hi John, We have tracked down the "device resource busy" to FUSE. Will get more details. But we have not been able to reproduce replicate's subvolumes being inconsistent with each other. (i.e the directory structure) is that also easily reproducible for you? Thanks Krishna (In reply to comment #22) > Hi John, > > I will get back with an update by tomorrow. > > Krishna > > (In reply to comment #21) > > hi, just a note to say this isn't fixed in 2.0.6 rc3. Hi Krishna, Yes, the inconsistent subvolumes is reproduceable with my script. I've modified it slightly to choose a random mount to do the create on each iteration (before it only used client0). http://johnleach.co.uk/downloads/glusterfs/glusterfs-mirror-test-suite-3.tar.gz If you run this for about 20 minutes or so, it eventually stops being able to create the directories properly (either it can't create any or there some that it cannot create/remove). At this point, if you look at the two export directories, you can see two different trees of directories that are never removed. Thanks, John. Hi John, We have found the cause of the inconsistent backend bug. We will provide the fix by 9th Aug. Thanks Krishna (In reply to comment #24) > Hi Krishna, > > Yes, the inconsistent subvolumes is reproduceable with my script. I've > modified it slightly to choose a random mount to do the create on each > iteration (before it only used client0). > > http://johnleach.co.uk/downloads/glusterfs/glusterfs-mirror-test-suite-3.tar.gz > > If you run this for about 20 minutes or so, it eventually stops being able to > create the directories properly (either it can't create any or there some that > it cannot create/remove). > > At this point, if you look at the two export directories, you can see two > different trees of directories that are never removed. > > Thanks, > > John. I think I've found the cause. In any directory-write operation we hold an entry lock on parent/basename. Now consider two operations that are issued in parallel: 1) C on the path /A/b, where C is one of create, mkdir, link, or rename 2) rmdir on /A (1) holds an entry lock on /A/b, and (2) holds a lock on /A. So it is apparent that (1) and (2) are not serialized. Thus, the the sequence of (1) and (2) can be different on different subvolumes. 1st subvol: sequence is --> C --> rmdir 2nd subvol: sequence is --> rmdir --> C On the 1st subvol, C will fail because the parent directory would have been removed by rmdir. On the 2nd subvol, rmdir would fail (with ENOTEMPTY) because C would have created an entry. Thus, subvolumes 1 and 2 would no longer be in sync. The fix I think is for C and rmdir to hold locks on *both* /A and /A/b. Excellent news! I should be able to test it thoroughly as soon as a fix is available, thanks guys. (In reply to comment #26) > The fix I think is for C and rmdir to hold locks on *both* > /A and /A/b. (In reply to comment #27) > Excellent news! I should be able to test it thoroughly as soon as a fix is > available, thanks guys. Hi John, Excuse us for the delayed response. The fix involves key changes to replicate module and hence taking more time than we expected, the fix will get into 2.0.7 release which is due in another 10 days. Will that work for you? Thanks Krishna (In reply to comment #28) > the fix will get into 2.0.7 release which is due in another 10 days. Will that work for you? Yes, thanks for the update Krishna. PATCH: http://patches.gluster.com/patch/1849 in release-2.0 (fail revalidate properly) PATCH: http://patches.gluster.com/patch/1862 in master (cluster/afr: Hold second lock after first lock has been granted for rename transactions.) PATCH: http://patches.gluster.com/patch/1863 in release-2.0 (cluster/afr: Hold second lock after first lock has been granted for rename transactions.) PATCH: http://patches.gluster.com/patch/1873 in master (afr transaction prevent spurious unlocks) PATCH: http://patches.gluster.com/patch/1876 in master (prevent spurious unlocks from afr selfheal) PATCH: http://patches.gluster.com/patch/1874 in release-2.0 (afr transaction prevent spurious unlocks) PATCH: http://patches.gluster.com/patch/1878 in release-2.0 (prevent spurious unlocks from afr selfheal) PATCH: http://patches.gluster.com/patch/1881 in master (afr transaction: fix op_ret check during locking) PATCH: http://patches.gluster.com/patch/1880 in release-2.0 (afr transaction: fix op_ret check during locking) PATCH: http://patches.gluster.com/patch/1923 in release-2.0 (Changes in posix unlink and rmdir to prevent inode num re-use immediately.) PATCH: http://patches.gluster.com/patch/1924 in release-2.0 (cluster/afr: Hold lock on all names under "victim" in rmdir) PATCH: http://patches.gluster.com/patch/1927 in release-2.0 (cluster/afr: Unlock only those paths which have been locked during rename.) PATCH: http://patches.gluster.com/patch/1921 in release-2.0 (locks: keep ref on the inode while locks are held) Hi, thanks for the new patches. I tested all the release-2.0 patches up to patch 1880 (against version 2.0.7) on Wednesday and noticed no improvement. That's 3 nodes, one creating directories, and two removing. Within a minute at least one of the nodes hangs. I just tested patches 1923 and 1924 (on top of the previous patches) and still no improvement - in fact, it's possibly worse - now all three nodes seem to hang together. Thanks, John. Patches 1927 and 1921 seem to improve the situation a bit, it *seems* to take a bit longer before a mount gets hung. In one case I just ended up with a directory which gives me an error if I ls in it: root@bbtest-002:/srv/share/t2# cd 0/0/ root@bbtest-002:/srv/share/t2/0/0# ls ls: cannot open directory .: No such file or directory I get a "possible infinite loop" message in the client logs when I ls in that dir. I can't delete the dir either. root@bbtest-002:/srv/share/t2/0/0# tail /var/log/glusterfs/srv-share-.log [2009-10-16 17:32:31] C [inode.c:865:inode_path] inode: possible infinite loop detected, forcing break. name=((null)) [2009-10-16 17:32:31] W [fuse-bridge.c:575:fuse_getattr] glusterfs-fuse: 50802: GETATTR 925084 (fuse_loc_fill() failed) Remounting on that client fixed it. In another case, I end up with a dir that hangs any process (in D state) that tries to stat it) - no associated events in the log in this case. Killing the gluster client process and remounting fixes this. PATCH: http://patches.gluster.com/patch/2050 in release-2.0 (change open flag to O_RDWR as ftruncate fails for a read only fd) with the feature of inode generation number in 3.0.0 branch, this bug got solved. We are not going to solve it in release-2.0 branch as it needs significant changes. Closing the bug as its fixed with 3.0.0 branch. PATCH: http://patches.gluster.com/patch/1901 in master (cluster/afr: Unlock only those paths which have been locked during rename.) PATCH: http://patches.gluster.com/patch/2644 in release-2.0 (storage/posix: prevent double close of fds resulting in EBADF errors.) |
The configuration is using just AFR - 2 servers and 3 clients. All clients mount the server's export directory and server1 is AFR'ed to server2 in each client. How I hot this bug: I copied /usr onto one client's mount point and rm -rf was issued simultaneously on each client on their respective mount points. It resulted in : Client1: rm: cannot remove directory `lib64/xulrunner-1.9pre/res/fonts': No such file or directory rm: cannot lstat `.' in `lib64/xulrunner-1.9pre': No such file or directory [root@brick6 mount]# ls include lib lib64 libexec local sbin share src tmp X11R6 it exited before deleting all files Client2: rm: cannot remove directory `lib/locale/fr_LU.utf8/LC_MESSAGES': No such file or directory [root@client05 mount]# ls include lib Same case but with fewer files left Client3: It is hung waiting on: (gdb) info threads 11 Thread 1115810112 (LWP 15794) 0x000000333da983b1 in nanosleep () from /lib64/libc.so.6 10 Thread 1116862784 (LWP 15795) 0x000000333e20a6a7 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 9 Thread 1117915456 (LWP 15796) 0x000000333e20a6a7 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 8 Thread 1118968128 (LWP 15797) 0x000000333e20a6a7 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 7 Thread 1120020800 (LWP 15798) 0x000000333e20a6a7 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 6 Thread 1121073472 (LWP 15799) 0x000000333e20a6a7 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 5 Thread 1122126144 (LWP 15800) 0x000000333e20a6a7 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 4 Thread 1123178816 (LWP 15801) 0x000000333e20a6a7 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 3 Thread 1124231488 (LWP 15802) 0x000000333e20a6a7 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 2 Thread 1102027072 (LWP 15806) 0x000000333e20cc9b in read () from /lib64/libpthread.so.0 1 Thread 47001897894240 (LWP 15793) 0x000000333dad21d8 in epoll_wait () from /lib64/libc.so.6 (gdb) t 2 [Switching to thread 2 (Thread 1102027072 (LWP 15806))]#0 0x000000333e20cc9b in read () from /lib64/libpthread.so.0 (gdb) bt #0 0x000000333e20cc9b in read () from /lib64/libpthread.so.0 #1 0x00002abf7c9f4c9a in fuse_kern_chan_receive (chp=<value optimized out>, buf=0x2abf7beb6000 "0", size=135168) at /usr/include/bits/unistd.h:35 #2 0x00002abf7c9f8650 in fuse_chan_receive (ch=0x12d02ce0, buf=0x2abf7beb6000 "0", size=135168) at fuse_session.c:191 #3 0x00002abf7c7cf1b8 in fuse_thread_proc (data=0x12cf6b70) at ../../../../../xlators/mount/fuse/src/fuse-bridge.c:2457 #4 0x000000333e206307 in start_thread () from /lib64/libpthread.so.0 #5 0x000000333dad1ded in clone () from /lib64/libc.so.6 All relevant volfiles, logfiles are in /share/tickets/<Bug No>