Description of problem: glusterfsd process will hang (does not respond go glusterfs requests but appears to still be running) when the underlying ext4 filesystem gets a corrupted xattr. IO to the affected brick will be stuck (glusterfsd process turns into a zombie when killed), only a reboot, fsck, and subsequent startup of gluster-server resolves the issue This may be related (subset?) of https://bugzilla.redhat.com/show_bug.cgi?id=832609 kernel messages look like this Oct 7 05:34:30 ghost9 kernel: [82029.008044] ------------[ cut here ]------------ Oct 7 05:34:30 ghost9 kernel: [82029.008063] WARNING: CPU: 4 PID: 2257 at /build/buildd/linux-lts-saucy-3.11.0/fs/ext4/ext4_jbd2.c:259 __ext4_handle_dirty_metadata+0x1a9/0x1c0() Oct 7 05:34:30 ghost9 kernel: [82029.008065] Modules linked in: rpcsec_gss_krb5 nfsv4 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep nouveau ttm snd_pcm mei_me snd_timer drm_kms_helper drm psmouse nfsd mei snd eeepc_wmi soundcore asus_wmi lpc_ich snd_page_alloc sparse_keymap i2c_algo_bit mxm_wmi video serio_raw mac_hid wmi lp nfs_acl auth_rpcgss parport nfs fscache lockd sunrpc ixgbe dca ahci libahci e1000e firewire_ohci firewire_core ptp mdio crc_itu_t pps_core Oct 7 05:34:30 ghost9 kernel: [82029.008104] CPU: 4 PID: 2257 Comm: glusterfsd Not tainted 3.11.0-20-generic #34~precise1-Ubuntu Oct 7 05:34:30 ghost9 kernel: [82029.008106] Hardware name: System manufacturer System Product Name/P9X79 WS, BIOS 4306 08/22/2013 Oct 7 05:34:30 ghost9 kernel: [82029.008108] 0000000000000103 ffff880fdd365998 ffffffff8173dd2d 0000000000000007 Oct 7 05:34:30 ghost9 kernel: [82029.008111] 0000000000000000 ffff880fdd3659d8 ffffffff8106540c ffff880fdde52180 Oct 7 05:34:30 ghost9 kernel: [82029.008112] ffff880eb9af5000 00000000ffffff8b ffff8800878b08b0 ffff880fdde52180 Oct 7 05:34:30 ghost9 kernel: [82029.008115] Call Trace: Oct 7 05:34:30 ghost9 kernel: [82029.008123] [<ffffffff8173dd2d>] dump_stack+0x46/0x58 Oct 7 05:34:30 ghost9 kernel: [82029.008128] [<ffffffff8106540c>] warn_slowpath_common+0x8c/0xc0 Oct 7 05:34:30 ghost9 kernel: [82029.008130] [<ffffffff8106545a>] warn_slowpath_null+0x1a/0x20 Oct 7 05:34:30 ghost9 kernel: [82029.008132] [<ffffffff8127f7c9>] __ext4_handle_dirty_metadata+0x1a9/0x1c0 Oct 7 05:34:30 ghost9 kernel: [82029.008136] [<ffffffff81290f03>] ext4_xattr_release_block+0x103/0x1f0 Oct 7 05:34:30 ghost9 kernel: [82029.008138] [<ffffffff81291524>] ext4_xattr_block_set+0x204/0x710 Oct 7 05:34:30 ghost9 kernel: [82029.008140] [<ffffffff81292170>] ext4_xattr_set_handle+0x370/0x490 Oct 7 05:34:30 ghost9 kernel: [82029.008143] [<ffffffff81292329>] ? ext4_xattr_set+0x99/0x140 Oct 7 05:34:30 ghost9 kernel: [82029.008145] [<ffffffff81292355>] ext4_xattr_set+0xc5/0x140 Oct 7 05:34:30 ghost9 kernel: [82029.008147] [<ffffffff81292e8d>] ext4_xattr_trusted_set+0x2d/0x30 Oct 7 05:34:30 ghost9 kernel: [82029.008153] [<ffffffff811d8b6b>] generic_setxattr+0x6b/0x90 Oct 7 05:34:30 ghost9 kernel: [82029.008155] [<ffffffff811d949b>] __vfs_setxattr_noperm+0x7b/0x1c0 Oct 7 05:34:30 ghost9 kernel: [82029.008159] [<ffffffff81337d8e>] ? evm_inode_setxattr+0xe/0x10 Oct 7 05:34:30 ghost9 kernel: [82029.008162] [<ffffffff811d969c>] vfs_setxattr+0xbc/0xc0 Oct 7 05:34:30 ghost9 kernel: [82029.008164] [<ffffffff811d97de>] setxattr+0x13e/0x1e0 Oct 7 05:34:30 ghost9 kernel: [82029.008170] [<ffffffff817494fe>] ? _raw_spin_lock+0xe/0x20 Oct 7 05:34:30 ghost9 kernel: [82029.008178] [<ffffffff811b6ee3>] ? __sb_start_write+0x53/0x110 Oct 7 05:34:30 ghost9 kernel: [82029.008181] [<ffffffff811d3492>] ? mnt_clone_write+0x12/0x30 Oct 7 05:34:30 ghost9 kernel: [82029.008183] [<ffffffff811d9c7e>] SyS_fsetxattr+0xbe/0x100 Oct 7 05:34:30 ghost9 kernel: [82029.008187] [<ffffffff811d9e5d>] ? SyS_fgetxattr+0x7d/0xd0 Oct 7 05:34:30 ghost9 kernel: [82029.008193] [<ffffffff8175291d>] system_call_fastpath+0x1a/0x1f Oct 7 05:34:30 ghost9 kernel: [82029.008195] ---[ end trace 655f8cd7683964af ]--- Oct 7 05:34:30 ghost9 kernel: [82029.008198] EXT4-fs: ext4_handle_dirty_xattr_block:167: aborting transaction: error 117 in __ext4_handle_dirty_metadata Oct 7 05:34:30 ghost9 kernel: [82029.008388] EXT4-fs error (device sda1): ext4_handle_dirty_xattr_block:167: inode #15879459: block 63987149: comm glusterfsd: journal_dirty_metadata failed: handle type 10 started at line 1173, credits 24/24, errcode -117 Oct 7 05:34:30 ghost9 kernel: [82029.008415] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4841: Readonly filesystem Oct 7 05:34:30 ghost9 kernel: [82029.008464] EXT4-fs error (device sda1) in ext4_dirty_inode:4960: error 117 Oct 7 05:34:30 ghost9 kernel: [82029.008505] EXT4-fs error (device sda1) in ext4_xattr_release_block:558: error 117 Oct 7 05:34:30 ghost9 kernel: [82029.008575] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 Oct 7 05:34:30 ghost9 kernel: [82029.008585] IP: [<ffffffff812708c1>] __ext4_error_inode+0x31/0x120 Oct 7 05:34:30 ghost9 kernel: [82029.008598] PGD 0 Oct 7 05:34:30 ghost9 kernel: [82029.008603] Oops: 0000 [#1] SMP Oct 7 05:34:30 ghost9 kernel: [82029.008609] Modules linked in: rpcsec_gss_krb5 nfsv4 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep nouveau ttm snd_pcm mei_me snd_timer drm_kms_helper drm psmouse nfsd mei snd eeepc_wmi soundcore asus_wmi lpc_ich snd_page_alloc sparse_keymap i2c_algo_bit mxm_wmi video serio_raw mac_hid wmi lp nfs_acl auth_rpcgss parport nfs fscache lockd sunrpc ixgbe dca ahci libahci e1000e firewire_ohci firewire_core ptp mdio crc_itu_t pps_core Oct 7 05:34:30 ghost9 kernel: [82029.008698] CPU: 0 PID: 2257 Comm: glusterfsd Tainted: G W 3.11.0-20-generic #34~precise1-Ubuntu Oct 7 05:34:30 ghost9 kernel: [82029.008705] Hardware name: System manufacturer System Product Name/P9X79 WS, BIOS 4306 08/22/2013 Oct 7 05:34:30 ghost9 kernel: [82029.008711] task: ffff880fd8219770 ti: ffff880fdd364000 task.ti: ffff880fdd364000 Oct 7 05:34:30 ghost9 kernel: [82029.008716] RIP: 0010:[<ffffffff812708c1>] [<ffffffff812708c1>] __ext4_error_inode+0x31/0x120 Oct 7 05:34:30 ghost9 kernel: [82029.008727] RSP: 0018:ffff880fdd365968 EFLAGS: 00010282 Oct 7 05:34:30 ghost9 kernel: [82029.008731] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000003c804f2 Oct 7 05:34:30 ghost9 kernel: [82029.008737] RDX: 0000000000001131 RSI: ffffffff81830eb0 RDI: 0000000000000000 Oct 7 05:34:30 ghost9 kernel: [82029.008745] RBP: ffff880fdd365a08 R08: ffffffff81b23460 R09: 000000000000000a Oct 7 05:34:30 ghost9 kernel: [82029.008750] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000001131 Oct 7 05:34:30 ghost9 kernel: [82029.008755] R13: 0000000000000000 R14: ffff880fdde52180 R15: ffffffff81b23460 Oct 7 05:34:30 ghost9 kernel: [82029.008761] FS: 00007fcb17efe700(0000) GS:ffff88103fc00000(0000) knlGS:0000000000000000 Oct 7 05:34:30 ghost9 kernel: [82029.008766] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 7 05:34:30 ghost9 kernel: [82029.008770] CR2: 0000000000000028 CR3: 0000000fd6d29000 CR4: 00000000001407f0 Oct 7 05:34:30 ghost9 kernel: [82029.008776] Stack: Oct 7 05:34:30 ghost9 kernel: [82029.008779] ffff880fdd365988 ffffffff811e8050 ffff880fe4d82000 ffff880fddd4cc98 Oct 7 05:34:30 ghost9 kernel: [82029.008790] ffff880fdd365998 ffffffff811e8093 ffff880fdde52180 ffffffff81838030 Oct 7 05:34:30 ghost9 kernel: [82029.008801] ffff880fdd365a08 ffffffff8127f28d ffff880fdd3659e8 ffff880fe4d82000 Oct 7 05:34:30 ghost9 kernel: [82029.008811] Call Trace: Oct 7 05:34:30 ghost9 kernel: [82029.008821] [<ffffffff811e8050>] ? __sync_dirty_buffer+0xa0/0xd0 Oct 7 05:34:30 ghost9 kernel: [82029.008828] [<ffffffff811e8093>] ? sync_dirty_buffer+0x13/0x20 Oct 7 05:34:30 ghost9 kernel: [82029.008836] [<ffffffff8127f28d>] ? ext4_journal_abort_handle+0x4d/0xe0 Oct 7 05:34:30 ghost9 kernel: [82029.008843] [<ffffffff8127f737>] __ext4_handle_dirty_metadata+0x117/0x1c0 Oct 7 05:34:30 ghost9 kernel: [82029.008854] [<ffffffff812913f3>] ? ext4_xattr_block_set+0xd3/0x710 Oct 7 05:34:30 ghost9 kernel: [82029.008865] [<ffffffff8125444a>] ext4_do_update_inode+0x36a/0x560 Oct 7 05:34:30 ghost9 kernel: [82029.008873] [<ffffffff81255e47>] ext4_mark_iloc_dirty+0x67/0x90 Oct 7 05:34:30 ghost9 kernel: [82029.008879] [<ffffffff8129204f>] ext4_xattr_set_handle+0x24f/0x490 Oct 7 05:34:30 ghost9 kernel: [82029.008886] [<ffffffff81292355>] ext4_xattr_set+0xc5/0x140 Oct 7 05:34:30 ghost9 kernel: [82029.009104] [<ffffffff81292e8d>] ext4_xattr_trusted_set+0x2d/0x30 Oct 7 05:34:30 ghost9 kernel: [82029.009534] [<ffffffff811d8b6b>] generic_setxattr+0x6b/0x90 Oct 7 05:34:30 ghost9 kernel: [82029.010056] [<ffffffff811d949b>] __vfs_setxattr_noperm+0x7b/0x1c0 Oct 7 05:34:30 ghost9 kernel: [82029.010569] [<ffffffff81337d8e>] ? evm_inode_setxattr+0xe/0x10 Oct 7 05:34:30 ghost9 kernel: [82029.011084] [<ffffffff811d969c>] vfs_setxattr+0xbc/0xc0 Oct 7 05:34:30 ghost9 kernel: [82029.011604] [<ffffffff811d97de>] setxattr+0x13e/0x1e0 Oct 7 05:34:30 ghost9 kernel: [82029.012121] [<ffffffff817494fe>] ? _raw_spin_lock+0xe/0x20 Oct 7 05:34:30 ghost9 kernel: [82029.012648] [<ffffffff811b6ee3>] ? __sb_start_write+0x53/0x110 Oct 7 05:34:30 ghost9 kernel: [82029.013143] [<ffffffff811d3492>] ? mnt_clone_write+0x12/0x30 Oct 7 05:34:30 ghost9 kernel: [82029.013631] [<ffffffff811d9c7e>] SyS_fsetxattr+0xbe/0x100 Oct 7 05:34:30 ghost9 kernel: [82029.014109] [<ffffffff811d9e5d>] ? SyS_fgetxattr+0x7d/0xd0 Oct 7 05:34:30 ghost9 kernel: [82029.014578] [<ffffffff8175291d>] system_call_fastpath+0x1a/0x1f Oct 7 05:34:30 ghost9 kernel: [82029.015037] Code: 48 89 e5 48 81 ec a0 00 00 00 48 89 5d d8 4c 89 65 e0 41 89 d4 4c 89 6d e8 4c 89 75 f0 48 89 fb 4c 89 7d f8 4c 89 4d c8 4d 89 c7 <48> 8b 47 28 48 8b 57 40 49 89 f5 49 89 ce 48 8b 80 50 03 00 00 Oct 7 05:34:30 ghost9 kernel: [82029.016080] RIP [<ffffffff812708c1>] __ext4_error_inode+0x31/0x120 Oct 7 05:34:30 ghost9 kernel: [82029.016559] RSP <ffff880fdd365968> Oct 7 05:34:30 ghost9 kernel: [82029.017041] CR2: 0000000000000028 Oct 7 05:34:30 ghost9 kernel: [82029.019503] ---[ end trace 655f8cd7683964b0 ]--- Version-Release number of selected component (if applicable): 3.5.2-ubuntu1~precise1 How reproducible: Unable to reproduce, but this happens approximately 1x per week in a 10 node cluster with 20 compute clients. Steps to Reproduce: 1. NA Actual results: Extended attributes corrupted (not sure if this is an ext4 issue or a gluster issue). Brick becomes unresponsive instead of crashing or failing gracefully. Expected results: No filesystem corruption. IO fails, or brick goes down and replica responds. Additional info:
When bug 1130242 has its patch merged, we can take include it in the an upcoming 3.5.x release.
*** This bug has been marked as a duplicate of bug 1100204 ***