It's not clearly a problem with unlink; rename is even more notoriously painful to implement, and there have even been issues with it recently. In any case, it wouldn't hurt to get state dumps by sending SIGUSR1 to the glusterfsd processes, or possibly even attach to them with gdb and do "thread apply all bt full" BTW, why are you mounting with nobarrier? That's generally a terrible idea with most kinds of storage, and it's rare to see anything like GlusterFS deployed with the exceptions.
We'll work on getting you stack traces and more info. But if you try this scenario you will see the crash within a few minutes. As for the XFS barriers, we have large battery-backed RAID controllers. And directly from the XFS FAQ at: http://xfs.org/index.php/XFS_FAQ#Q._Should_barriers_be_enabled_with_storage_which_has_a_persistent_write_cache.3F Many hardware RAID have a persistent write cache which preserves it across power failure, interface resets, system crashes, etc. Using write barriers in this instance is not recommended and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with "nobarrier". But take care about the hard disk write cache, which should be off. We disable barriers on XFS, enable the write cache on the controller and, disable the write-caches on the drives. Honestly, I'm surprised you don't see this more. Especially considering that GlusterFS is meant to be run in large environments. Each of our bricks holds 80TB of usable space and can sustain ~ 8Gb/s of write traffic. These settings are all very important.
- Create a replicated volume with two bricks. - We are using XFS as the underlying filesystem. - Format the volume on each brick as: mkfs.xfs -f -l internal,lazy-count=1,version=2 /dev/DEVICE - Mount the volume on each brick as: mount -t xfs -o auto,noatime,nodiratime,nobarrier,osyncisdsync,logbsize=256k,logbufs=8 /dev/DEVICE /var/db/gluster - Create the AFR Volume, options as follows: type=2 count=2 status=1 sub_count=2 version=5 transport-type=0 volume-id=XXXX performance.io-thread-count=8 performance.stat-prefetch=off performance.quick-read=off performance.io-cache=off performance.read-ahead=off performance.write-behind=on performance.flush-behind=on performance.cache-priority=0 performance.cache-refresh-timeout=0 performance.cache-min-file-size=0 performance.cache-max-file-size=0 cluster.metadata-change-log=on cluster.data-change-log=on cluster.self-heal-window-size=32 cluster.strict-readdir=off cluster.entry-self-heal=on cluster.data-self-heal=on cluster.metadata-self-heal=on cluster.background-self-heal-count=32 cluster.entry-change-log=on transport.keepalive=on network.ping-timeout=10 network.frame-timeout=10 nfs.disable=on performance.cache-size=8388608 performance.write-behind-window-size=8388608 brick-0=XXXX:-var-db-gluster brick-1=XXXX:-var-db-gluster - Mount the volume using the gluster FUSE client on 6 client nodes. - On 3 nodes, run the following in the gluster mount point: while true; do touch test; done - On the remaining 3 nodes, run the following in the gluster mount point: while true; do rm -f test; done - Within 60 seconds the entire volume will crash. We see a number of errors that range from "Transport disconnected" to an eventual "RPC" error, then the entire volume becomes un-useable. Eventually you must kill the glusterd and glusterfsd processes on the server and restart the server. At that point, the test file is usually in a "split-brain" state because some clients experience RPC errors with brick-0, while others experience RPC errors with brick-1. Other errors also look like: [2011-07-19 19:25:28.205989] I [server3_1-fops.c:259:server_inodelk_cbk] 0-sv3-server: 2081254: INODELK /test (1529166) ==> -1 (Invalid argument) [2011-07-19 19:26:03.813649] I [server3_1-fops.c:1340:server_open_cbk] 0-sv3-server: 330759: OPEN /test (0) ==> -1 (No such file or directory) [2011-07-19 19:26:52.736672] I [afr-common.c:662:afr_lookup_done] 7-sv3-replicate-0: entries are missing in lookup of /test. [2011-07-19 19:26:03.873918] I [afr-common.c:716:afr_lookup_done] 7-sv3-replicate-0: background meta-data data entry self-heal triggered. path: /test [2011-07-19 19:26:18.564192] E [rpc-clnt.c:199:call_bail] 7-sv3-client-0: bailing out frame type(GlusterFS 3.1) op(ENTRYLK(31)) xid = 0x1134992x sent = 2011-07-19 19:26:03.873953. timeout = 10 [2011-07-19 19:25:32.136036] I [afr-self-heal-common.c:1527:afr_self_heal_completion_cbk] 7-sv3-replicate-0: background meta-data data self-heal completed on /.test [2011-07-19 19:25:38.552343] E [rpc-clnt.c:199:call_bail] 7-sv3-client-0: bailing out frame type(GlusterFS 3.1) op(ENTRYLK(31)) xid = 0x1033976x sent = 2011-07-19 19:25:26 .706684. timeout = 10 - In effect, running the replicate on any high-performance volume where multiple nodes might be creating, locking and/or removing the same files is impossible. The volume will crash in short order. Run the scenario above. - If you run these same scenarios on a local XFS or Ext3 filesystem they work with no problem. It is clearly a problem with file locking/synchronization during an unlink.
Created attachment 566 This is a log from a server running glusterfsd during the crash scenario in this ticket. The file being touched, and removed which cause the crash is the file "/crash-test" SIGUSER1 was sent to glusterfsd during this test. The log is compressed with bzip2 due to size constraints on your bugzilla install.
Created attachment 567 This is a log from the glusterfs client during crash. this client was touching the file "/crash-test" SIGUSER1 was sent to glusterfs during this crash.
Created attachment 568 This is a log from a glusterfs client which was performing an rm on the file "/crash-test" during the crash. glusterfs was sent SIGUSER1 during this crash.
(In reply to comment #5) > Created an attachment (id=568) [details] > logs from glusterfs client during rm > > This is a log from a glusterfs client which was performing an rm on the file > "/crash-test" during the crash. glusterfs was sent SIGUSER1 during this crash. hi, According to the logs provided, there are lot of disconnects between fuse clients and the bricks. The logs suggest that the network between client and the bricks was down for a while. [2011-07-20 16:26:55.718768] I [client.c:1601:client_rpc_notify] 5-sv3-client-0: disconnected [2011-07-20 16:26:55.718820] I [client.c:1601:client_rpc_notify] 7-sv3-client-0: disconnected [2011-07-20 16:26:55.718977] I [client.c:1601:client_rpc_notify] 2-sv3-client-0: disconnected [.........] The above was from the first attachment. The second and the third attachments dont give the SIGUSR1 output called as process statedump, they are client logs. Could you let us know if there were any network outages during the operation?. Procedure to collect the process statedump: ps aux | grep glusterfsd kill -USR1 <pid> of the glusterfsd processes above on all the machines the file will be stored in /tmp/glusterdump.<pid> for each pid on the respective machine. Pranith
Hi Pranith, I can confirm that there were no network interruptions. The machines are virtual and on the same blade server so the network should not have been an issue. I will get you the statedump asap, however this will have to be in about a weeks time as we are preparing for a release at the moment.
Created attachment 644
We have come accross the same problem in our production environment. We were able to replicate the problem using a similar setup and test as the ones described by Cullin Wible. In our case the problem is caused by two clients trying to create the same directory at the same time. The volume does not crash but it does hang for the duration of the frame-timeout. [2011-08-25 18:01:05.644088] I [afr-self-heal-common.c:1527:afr_self_heal_completion_cbk] 0-replicate-0: background meta-data self-heal completed on /test [2011-08-25 18:01:05.644340] I [afr-common.c:662:afr_lookup_done] 0-replicate-0: entries are missing in lookup of /test. [2011-08-25 18:01:05.645749] I [afr-common.c:662:afr_lookup_done] 0-replicate-0: entries are missing in lookup of /test. [2011-08-25 18:01:06.651344] I [dht-common.c:418:dht_revalidate_cbk] 0-dht: mismatching layouts for /test [2011-08-25 18:01:06.652091] I [dht-layout.c:588:dht_layout_normalize] 0-dht: found anomalies in /test. holes=1 overlaps=0 [2011-08-25 18:06:08.952775] E [rpc-clnt.c:199:call_bail] 0-client-0: bailing out frame type(GlusterFS 3.1) op(INODELK(29)) xid = 0x1268x sent = 2011-08-25 18:01:06.652407. timeout = 300 [2011-08-25 18:08:29.833753] W [fuse-bridge.c:413:fuse_attr_cbk] 0-glusterfs-fuse: 854: STAT() /test => -1 (No such file or directory) [2011-08-25 18:08:55.718066] C [rpc-clnt.c:435:rpc_clnt_fill_request_info] 0-rpc-clnt: cannot lookup the saved frame corresponding to xid (1268) for msg arrived on transport client-0 [2011-08-25 18:08:55.718119] I [client.c:1601:client_rpc_notify] 0-client-0: disconnected [2011-08-25 18:09:05.970297] I [client-handshake.c:1027:select_server_supported_programs] 0-client-0: Using Program GlusterFS-3.1.0, Num (1298437), Version (310) [2011-08-25 18:09:05.970534] I [client-handshake.c:863:client_setvolume_cbk] 0-client-0: Connected to 10.8.3.55:10000, attached to remote volume 'brick0'. There is also a disconnect message although we know for certain there were no network interrupts at the time. We are using gluster v3.1.3. Attached you can find the process statedump.
(In reply to comment #8) > Created an attachment (id=644) [details] > gluster process dump hi, I am given the statedump of the mount glusterfs process, I was requesting statedumps for the "glusterfsd" brick processes. > ps aux | grep glusterfsd > kill -USR1 <pid> of the glusterfsd processes above on all the machines I need to confirm if you guys are hitting one of the following bugs: http://bugs.gluster.com/show_bug.cgi?id=2994 http://bugs.gluster.com/show_bug.cgi?id=2993 Please attach the statedumps from all the machines. This bug is logged against the release 3.2.2 and the previous comment said 3.1.3 release is in the production. Could you let us know what is the version in use?. Pranith
Pranith, This was seen with gluster 3.1.3
I seem to have recreated the same issue with a lot of graph changes & so i found a lot of client disconnects. The dmesg output shows a lot of hung glusterfsd processes & a lot of write errors. INFO: task glusterfsd:7413 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. glusterfsd D ffff8101b929d0c0 0 7413 1 7411 (NOTLB) ffff81025aacfb28 0000000000000082 ffff81019426e7b8 ffffffff800345fe ffff810200000000 0000000000000008 ffff81023ee020c0 ffff8101b929d0c0 00005e76de7d6668 0000000000003848 ffff81023ee022a8 0000000000002000 Call Trace: [<ffffffff800345fe>] ip_queue_xmit+0x434/0x48e [<ffffffff8006e1db>] do_gettimeofday+0x40/0x90 [<ffffffff8005a7d6>] getnstimeofday+0x10/0x28 [<ffffffff800637ea>] io_schedule+0x3f/0x67 [<ffffffff80028cbb>] get_request_wait+0xe7/0x130 [<ffffffff800a09d8>] autoremove_wake_function+0x0/0x2e [<ffffffff8000c0d0>] __make_request+0x350/0x414 [<ffffffff80046edb>] try_to_wake_up+0x472/0x484 [<ffffffff8001c250>] generic_make_request+0x211/0x228 [<ffffffff8000b476>] __find_get_block+0x15c/0x16c [<ffffffff80033383>] submit_bio+0xe4/0xeb [<ffffffff8001a99a>] submit_bh+0xf1/0x111 [<ffffffff8001c45d>] __block_write_full_page+0x1f6/0x301 [<ffffffff8804eceb>] :ext3:ext3_get_block+0x0/0xf7 [<ffffffff8805056a>] :ext3:ext3_ordered_writepage+0xf1/0x198 [<ffffffff8001cfd0>] mpage_writepages+0x1bf/0x37d [<ffffffff88050479>] :ext3:ext3_ordered_writepage+0x0/0x198 [<ffffffff8005ac6f>] do_writepages+0x29/0x2f [<ffffffff8004fa8a>] __filemap_fdatawrite_range+0x50/0x5b [<ffffffff80050193>] do_fsync+0x2f/0xa4 [<ffffffff800e0f8a>] __do_fsync+0x23/0x36 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 ... command: Write(10): 2a 00 02 4e f2 07 00 02 30 00 mptscsih: ioc0: target reset: SUCCESS (sc=ffff81025cef3680) mptscsih: ioc0: attempting task abort! (sc=ffff8101b5306200) sd 0:0:1:0: command: Write(10): 2a 00 01 77 44 87 00 00 08 00 mptscsih: ioc0: task abort: FAILED (sc=ffff8101b5306200) mptscsih: ioc0: attempting task abort! (sc=ffff810251571980) sd 0:0:1:0: command: Write(10): 2a 00 01 78 2f bf 00 00 10 00 mptscsih: ioc0: task abort: FAILED (sc=ffff810251571980) mptscsih: ioc0: attempting target reset! (sc=ffff8101b5306200) sd 0:0:1:0: Did any of you find the same things happening on your machines? P.S: Found this on the 3.1 branch.
Created attachment 645 Here are the statedumps. In our test we run, on 2 nodes: while true; do mkdir test; done On 1 node: while true; do rm -r test; done our dmesg output shows a lot of blocked mkdir: [8013936.749285] INFO: task mkdir:26355 blocked for more than 120 seconds. [8013936.749288] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [8013936.749291] mkdir D ffff8803130d10b8 0 26355 14916 0x00000004 [8013936.749296] ffff880315a51ca8 0000000000000086 0000000000000000 0000000000000082 [8013936.749301] ffff880315a51be8 ffffffff81051d8a ffff880315a51fd8 ffff8803130d0de0 [8013936.749306] ffff880315a51fd8 ffff880315a51fd8 ffff8803130d10c0 ffff880315a51fd8 [8013936.749310] Call Trace: [8013936.749318] [<ffffffff81051d8a>] ? try_to_wake_up+0x1da/0x370 [8013936.749323] [<ffffffff81051f2d>] ? default_wake_function+0xd/0x10 [8013936.749329] [<ffffffff810419f8>] ? __wake_up_common+0x58/0x90 [8013936.749333] [<ffffffff810434ee>] ? __wake_up+0x4e/0x70 [8013936.749337] [<ffffffffa019c61d>] fuse_request_send+0xdd/0x2c0 [fuse] [8013936.749341] [<ffffffff8107a320>] ? autoremove_wake_function+0x0/0x40 [8013936.749345] [<ffffffffa01a0114>] create_new_entry+0xa4/0x240 [fuse] [8013936.749349] [<ffffffffa01a0479>] fuse_mkdir+0xb9/0xe0 [fuse] [8013936.749353] [<ffffffff8114cefd>] vfs_mkdir+0x9d/0x100 [8013936.749356] [<ffffffff8114f065>] sys_mkdirat+0x125/0x140 [8013936.749358] [<ffffffff8114f093>] sys_mkdir+0x13/0x20 [8013936.749361] [<ffffffff8100ae12>] system_call_fastpath+0x16/0x1b
The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug. If there has been no update before 9 December 2014, this bug will get automatocally closed.