Bug 764927 (GLUSTER-3195) - Unstable Replication, Followed By Volume Crash
Summary: Unstable Replication, Followed By Volume Crash
Keywords:
Status: CLOSED DEFERRED
Alias: GLUSTER-3195
Product: GlusterFS
Classification: Community
Component: replicate
Version: 3.2.2
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-07-19 19:29 UTC by Cullin Wible
Modified: 2014-12-14 19:40 UTC (History)
6 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2014-12-14 19:40:32 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
Log from glusterfsd during crash scenario (965.04 KB, application/octet-stream)
2011-07-20 14:15 UTC, James Morelli
no flags Details
logs from glusterfs client (137.55 KB, application/octet-stream)
2011-07-20 14:18 UTC, James Morelli
no flags Details
logs from glusterfs client during rm (905.74 KB, application/octet-stream)
2011-07-20 14:19 UTC, James Morelli
no flags Details
gluster process dump (16.92 KB, application/octet-stream)
2011-08-25 11:29 UTC, kakogianni
no flags Details
brick statedumps during crash (1.60 KB, application/octet-stream)
2011-09-01 09:35 UTC, kakogianni
no flags Details

Description Jeff Darcy 2011-07-19 17:22:59 UTC
It's not clearly a problem with unlink; rename is even more notoriously painful to implement, and there have even been issues with it recently.  In any case, it wouldn't hurt to get state dumps by sending SIGUSR1 to the glusterfsd processes, or possibly even attach to them with gdb and do "thread apply all bt full"

BTW, why are you mounting with nobarrier?  That's generally a terrible idea with most kinds of storage, and it's rare to see anything like GlusterFS deployed with the exceptions.

Comment 1 Cullin Wible 2011-07-19 17:33:07 UTC
We'll work on getting you stack traces and more info. But if you try this scenario you will see the crash within a few minutes.

As for the XFS barriers, we have large battery-backed RAID controllers. And directly from the XFS FAQ at: http://xfs.org/index.php/XFS_FAQ#Q._Should_barriers_be_enabled_with_storage_which_has_a_persistent_write_cache.3F

Many hardware RAID have a persistent write cache which preserves it across power failure, interface resets, system crashes, etc. Using write barriers in this instance is not recommended and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with "nobarrier". But take care about the hard disk write cache, which should be off. 

We disable barriers on XFS, enable the write cache on the controller and, disable the write-caches on the drives.

Honestly, I'm surprised you don't see this more. Especially considering that GlusterFS is meant to be run in large environments. Each of our bricks holds 80TB of usable space and can sustain ~ 8Gb/s of write traffic. These settings are all very important.

Comment 2 Cullin Wible 2011-07-19 19:29:36 UTC
- Create a replicated volume with two bricks.
- We are using XFS as the underlying filesystem.

- Format the volume on each brick as:
mkfs.xfs -f -l internal,lazy-count=1,version=2 /dev/DEVICE

- Mount the volume on each brick as:
mount -t xfs -o auto,noatime,nodiratime,nobarrier,osyncisdsync,logbsize=256k,logbufs=8 /dev/DEVICE /var/db/gluster

- Create the AFR Volume, options as follows:
	type=2
	count=2
	status=1
	sub_count=2
	version=5
	transport-type=0
	volume-id=XXXX
	performance.io-thread-count=8
	performance.stat-prefetch=off
	performance.quick-read=off
	performance.io-cache=off
	performance.read-ahead=off
	performance.write-behind=on
	performance.flush-behind=on
	performance.cache-priority=0
	performance.cache-refresh-timeout=0
	performance.cache-min-file-size=0
	performance.cache-max-file-size=0
	cluster.metadata-change-log=on
	cluster.data-change-log=on
	cluster.self-heal-window-size=32
	cluster.strict-readdir=off
	cluster.entry-self-heal=on
	cluster.data-self-heal=on
	cluster.metadata-self-heal=on
	cluster.background-self-heal-count=32
	cluster.entry-change-log=on
	transport.keepalive=on
	network.ping-timeout=10
	network.frame-timeout=10
	nfs.disable=on
	performance.cache-size=8388608
	performance.write-behind-window-size=8388608
	brick-0=XXXX:-var-db-gluster
	brick-1=XXXX:-var-db-gluster

- Mount the volume using the gluster FUSE client on 6 client nodes.

- On 3 nodes, run the following in the gluster mount point:
    while true; do touch test; done

- On the remaining 3 nodes, run the following in the gluster mount point:
    while true; do rm -f test; done

- Within 60 seconds the entire volume will crash. We see a number of errors that range from "Transport disconnected" to an eventual "RPC" error, then the entire volume becomes un-useable. Eventually you must kill the glusterd and glusterfsd processes on the server and restart the server. At that point, the test file is usually in a "split-brain" state because some clients experience RPC errors with brick-0, while others experience RPC errors with brick-1. Other errors also look like:

[2011-07-19 19:25:28.205989] I [server3_1-fops.c:259:server_inodelk_cbk] 0-sv3-server: 2081254: INODELK /test (1529166) ==> -1 (Invalid argument)

[2011-07-19 19:26:03.813649] I [server3_1-fops.c:1340:server_open_cbk] 0-sv3-server: 330759: OPEN /test (0) ==> -1 (No such file or directory)

[2011-07-19 19:26:52.736672] I [afr-common.c:662:afr_lookup_done] 7-sv3-replicate-0: entries are missing in lookup of /test.

[2011-07-19 19:26:03.873918] I [afr-common.c:716:afr_lookup_done] 7-sv3-replicate-0: background  meta-data data entry self-heal triggered. path: /test

[2011-07-19 19:26:18.564192] E [rpc-clnt.c:199:call_bail] 7-sv3-client-0: bailing out frame type(GlusterFS 3.1) op(ENTRYLK(31)) xid = 0x1134992x sent = 2011-07-19 19:26:03.873953. timeout = 10

[2011-07-19 19:25:32.136036] I [afr-self-heal-common.c:1527:afr_self_heal_completion_cbk] 7-sv3-replicate-0: background  meta-data data self-heal completed on /.test

[2011-07-19 19:25:38.552343] E [rpc-clnt.c:199:call_bail] 7-sv3-client-0: bailing out frame type(GlusterFS 3.1) op(ENTRYLK(31)) xid = 0x1033976x sent = 2011-07-19 19:25:26
.706684. timeout = 10

- In effect, running the replicate on any high-performance volume where multiple nodes might be creating, locking and/or removing the same files is impossible. The volume will crash in short order. Run the scenario above.

- If you run these same scenarios on a local XFS or Ext3 filesystem they work with no problem. It is clearly a problem with file locking/synchronization during an unlink.

Comment 3 James Morelli 2011-07-20 14:15:16 UTC
Created attachment 566


This is a log from a server running glusterfsd during the crash scenario in this ticket. The file being touched, and removed which cause the crash is the file "/crash-test" SIGUSER1 was sent to glusterfsd during this test.

The log is compressed with bzip2 due to size constraints on your bugzilla install.

Comment 4 James Morelli 2011-07-20 14:18:33 UTC
Created attachment 567


This is a log from the glusterfs client during crash. this client was touching the file "/crash-test" SIGUSER1 was sent to glusterfs during this crash.

Comment 5 James Morelli 2011-07-20 14:19:54 UTC
Created attachment 568


This is a log from a glusterfs client which was performing an rm on the file "/crash-test" during the crash. glusterfs was sent SIGUSER1 during this crash.

Comment 6 Pranith Kumar K 2011-08-03 03:59:25 UTC
(In reply to comment #5)
> Created an attachment (id=568) [details]
> logs from glusterfs client during rm
> 
> This is a log from a glusterfs client which was performing an rm on the file
> "/crash-test" during the crash. glusterfs was sent SIGUSER1 during this crash.

hi,
    According to the logs provided, there are lot of disconnects between fuse clients and the bricks. The logs suggest that the network between client and the bricks was down for a while. 

[2011-07-20 16:26:55.718768] I [client.c:1601:client_rpc_notify] 5-sv3-client-0: disconnected
[2011-07-20 16:26:55.718820] I [client.c:1601:client_rpc_notify] 7-sv3-client-0: disconnected
[2011-07-20 16:26:55.718977] I [client.c:1601:client_rpc_notify] 2-sv3-client-0: disconnected
[.........]

The above was from the first attachment. The second and the third attachments dont give the SIGUSR1 output called as process statedump, they are client logs.

Could you let us know if there were any network outages during the operation?.

Procedure to collect the process statedump:

ps aux | grep glusterfsd

kill -USR1 <pid> of the glusterfsd processes above on all the machines

the file will be stored in /tmp/glusterdump.<pid> for each pid on the respective machine.

Pranith

Comment 7 James Morelli 2011-08-04 16:48:53 UTC
Hi Pranith,

I can confirm that there were no network interruptions. The machines are virtual and on the same blade server so the network should not have been an issue.

I will get you the statedump asap, however this will have to be in about a weeks time as we are preparing for a release at the moment.

Comment 8 kakogianni 2011-08-25 11:29:45 UTC
Created attachment 644

Comment 9 kakogianni 2011-08-25 11:29:58 UTC
We have come accross the same problem in our production environment. We were able to replicate the problem using a similar setup and test as the ones described by  Cullin Wible. 

In our case the problem is caused by two clients trying to create the same directory at the same time. The volume does not crash but it does hang for the duration of the frame-timeout. 

[2011-08-25 18:01:05.644088] I [afr-self-heal-common.c:1527:afr_self_heal_completion_cbk] 0-replicate-0: background  meta-data self-heal completed on /test
[2011-08-25 18:01:05.644340] I [afr-common.c:662:afr_lookup_done] 0-replicate-0: entries are missing in lookup of /test.
[2011-08-25 18:01:05.645749] I [afr-common.c:662:afr_lookup_done] 0-replicate-0: entries are missing in lookup of /test.
[2011-08-25 18:01:06.651344] I [dht-common.c:418:dht_revalidate_cbk] 0-dht: mismatching layouts for /test
[2011-08-25 18:01:06.652091] I [dht-layout.c:588:dht_layout_normalize] 0-dht: found anomalies in /test. holes=1 overlaps=0
[2011-08-25 18:06:08.952775] E [rpc-clnt.c:199:call_bail] 0-client-0: bailing out frame type(GlusterFS 3.1) op(INODELK(29)) xid = 0x1268x sent = 2011-08-25 18:01:06.652407. timeout = 300
[2011-08-25 18:08:29.833753] W [fuse-bridge.c:413:fuse_attr_cbk] 0-glusterfs-fuse: 854: STAT() /test => -1 (No such file or directory)
[2011-08-25 18:08:55.718066] C [rpc-clnt.c:435:rpc_clnt_fill_request_info] 0-rpc-clnt: cannot lookup the saved frame corresponding to xid (1268) for msg arrived on transport client-0
[2011-08-25 18:08:55.718119] I [client.c:1601:client_rpc_notify] 0-client-0: disconnected
[2011-08-25 18:09:05.970297] I [client-handshake.c:1027:select_server_supported_programs] 0-client-0: Using Program GlusterFS-3.1.0, Num (1298437), Version (310)
[2011-08-25 18:09:05.970534] I [client-handshake.c:863:client_setvolume_cbk] 0-client-0: Connected to 10.8.3.55:10000, attached to remote volume 'brick0'.

There is also a disconnect message although we know for certain there were no network interrupts at the time.

We are using gluster v3.1.3. Attached you can find the process statedump.

Comment 10 Pranith Kumar K 2011-08-29 05:41:19 UTC
(In reply to comment #8)
> Created an attachment (id=644) [details]
> gluster process dump

hi,
    I am given the statedump of the mount glusterfs process, I was requesting statedumps for the "glusterfsd" brick processes.

> ps aux | grep glusterfsd
> kill -USR1 <pid> of the glusterfsd processes above on all the machines

I need to confirm if you guys are hitting one of the following bugs:
http://bugs.gluster.com/show_bug.cgi?id=2994
http://bugs.gluster.com/show_bug.cgi?id=2993

Please attach the statedumps from all the machines.

This bug is logged against the release 3.2.2 and the previous comment said 3.1.3 release is in the production. Could you let us know what is the version in use?.

Pranith

Comment 11 James Morelli 2011-08-29 11:50:29 UTC
Pranith,

This was seen with gluster 3.1.3

Comment 12 Rahul C S 2011-08-30 08:03:49 UTC
I seem to have recreated the same issue with a lot of graph changes & so i found a lot of client disconnects.

The dmesg output shows a lot of hung glusterfsd processes & a lot of write errors.
INFO: task glusterfsd:7413 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
glusterfsd    D ffff8101b929d0c0     0  7413      1                7411 (NOTLB)
 ffff81025aacfb28 0000000000000082 ffff81019426e7b8 ffffffff800345fe
 ffff810200000000 0000000000000008 ffff81023ee020c0 ffff8101b929d0c0
 00005e76de7d6668 0000000000003848 ffff81023ee022a8 0000000000002000
Call Trace:
 [<ffffffff800345fe>] ip_queue_xmit+0x434/0x48e
 [<ffffffff8006e1db>] do_gettimeofday+0x40/0x90
 [<ffffffff8005a7d6>] getnstimeofday+0x10/0x28
 [<ffffffff800637ea>] io_schedule+0x3f/0x67
 [<ffffffff80028cbb>] get_request_wait+0xe7/0x130
 [<ffffffff800a09d8>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8000c0d0>] __make_request+0x350/0x414
 [<ffffffff80046edb>] try_to_wake_up+0x472/0x484
 [<ffffffff8001c250>] generic_make_request+0x211/0x228
 [<ffffffff8000b476>] __find_get_block+0x15c/0x16c
 [<ffffffff80033383>] submit_bio+0xe4/0xeb
 [<ffffffff8001a99a>] submit_bh+0xf1/0x111
 [<ffffffff8001c45d>] __block_write_full_page+0x1f6/0x301
 [<ffffffff8804eceb>] :ext3:ext3_get_block+0x0/0xf7
 [<ffffffff8805056a>] :ext3:ext3_ordered_writepage+0xf1/0x198
 [<ffffffff8001cfd0>] mpage_writepages+0x1bf/0x37d
 [<ffffffff88050479>] :ext3:ext3_ordered_writepage+0x0/0x198
 [<ffffffff8005ac6f>] do_writepages+0x29/0x2f
 [<ffffffff8004fa8a>] __filemap_fdatawrite_range+0x50/0x5b
 [<ffffffff80050193>] do_fsync+0x2f/0xa4
 [<ffffffff800e0f8a>] __do_fsync+0x23/0x36
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0
...
       command: Write(10): 2a 00 02 4e f2 07 00 02 30 00
mptscsih: ioc0: target reset: SUCCESS (sc=ffff81025cef3680)
mptscsih: ioc0: attempting task abort! (sc=ffff8101b5306200)
sd 0:0:1:0: 
        command: Write(10): 2a 00 01 77 44 87 00 00 08 00
mptscsih: ioc0: task abort: FAILED (sc=ffff8101b5306200)
mptscsih: ioc0: attempting task abort! (sc=ffff810251571980)
sd 0:0:1:0: 
        command: Write(10): 2a 00 01 78 2f bf 00 00 10 00
mptscsih: ioc0: task abort: FAILED (sc=ffff810251571980)
mptscsih: ioc0: attempting target reset! (sc=ffff8101b5306200)
sd 0:0:1:0: 

Did any of you find the same things happening on your machines?

P.S: Found this on the 3.1 branch.

Comment 13 kakogianni 2011-09-01 09:35:01 UTC
Created attachment 645


Here are the statedumps.

In our test we run, on 2 nodes:
    while true; do mkdir test; done
On 1 node:
    while true; do rm -r test; done

our dmesg output shows a lot of blocked mkdir:

[8013936.749285] INFO: task mkdir:26355 blocked for more than 120 seconds.
[8013936.749288] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[8013936.749291] mkdir           D ffff8803130d10b8     0 26355  14916 0x00000004
[8013936.749296]  ffff880315a51ca8 0000000000000086 0000000000000000 0000000000000082
[8013936.749301]  ffff880315a51be8 ffffffff81051d8a ffff880315a51fd8 ffff8803130d0de0
[8013936.749306]  ffff880315a51fd8 ffff880315a51fd8 ffff8803130d10c0 ffff880315a51fd8
[8013936.749310] Call Trace:
[8013936.749318]  [<ffffffff81051d8a>] ? try_to_wake_up+0x1da/0x370
[8013936.749323]  [<ffffffff81051f2d>] ? default_wake_function+0xd/0x10
[8013936.749329]  [<ffffffff810419f8>] ? __wake_up_common+0x58/0x90
[8013936.749333]  [<ffffffff810434ee>] ? __wake_up+0x4e/0x70
[8013936.749337]  [<ffffffffa019c61d>] fuse_request_send+0xdd/0x2c0 [fuse]
[8013936.749341]  [<ffffffff8107a320>] ? autoremove_wake_function+0x0/0x40
[8013936.749345]  [<ffffffffa01a0114>] create_new_entry+0xa4/0x240 [fuse]
[8013936.749349]  [<ffffffffa01a0479>] fuse_mkdir+0xb9/0xe0 [fuse]
[8013936.749353]  [<ffffffff8114cefd>] vfs_mkdir+0x9d/0x100
[8013936.749356]  [<ffffffff8114f065>] sys_mkdirat+0x125/0x140
[8013936.749358]  [<ffffffff8114f093>] sys_mkdir+0x13/0x20
[8013936.749361]  [<ffffffff8100ae12>] system_call_fastpath+0x16/0x1b

Comment 14 Niels de Vos 2014-11-27 14:54:38 UTC
The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug.

If there has been no update before 9 December 2014, this bug will get automatocally closed.


Note You need to log in before you can comment on or make changes to this bug.