Bug 1108850 - GlusterFS hangs which causes zombie processes
Summary: GlusterFS hangs which causes zombie processes
Keywords:
Status: CLOSED DUPLICATE of bug 1089470
Alias: None
Product: GlusterFS
Classification: Community
Component: fuse
Version: 3.5.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-06-12 17:04 UTC by Amit
Modified: 2014-12-14 20:28 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-12-14 20:28:09 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
tgz of entire /var/log/glusterfs folder. sorry if it's overkill (657.94 KB, application/octet-stream)
2014-06-12 17:04 UTC, Amit
no flags Details

Description Amit 2014-06-12 17:04:38 UTC
Created attachment 908223 [details]
tgz of entire /var/log/glusterfs folder. sorry if it's overkill

Description of problem:
httpd and some other processes hang and cannot be killed with kill -9. These processes access a FUSE mount point to access image files.

Version-Release number of selected component (if applicable):
1) Centos 6.5 - 2.6.32-431.17.1.el6.x86_64 #1 SMP Wed May 7 23:32:49 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

2) Gluster 3.5.0-2.el6 ( from yum repos)


How reproducible:
Unknown how to reproduce. Has happened only once in the last month.  Although recently I switched to the a datacenter that has much higher usage of the gluster system, so maybe the increased load of reads and writes contributed to it.

Steps to Reproduce:
Unknown

Actual results:


Expected results:


Additional info:

### FROM /var/log/messages ###

Jun 11 20:54:19 server-relay kernel: INFO: task glusterfs:7769 blocked for more than 120 seconds.
Jun 11 20:54:19 server-relay kernel:      Not tainted 2.6.32-431.17.1.el6.x86_64 #1
Jun 11 20:54:19 server-relay kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 11 20:54:19 server-relay kernel: glusterfs     D 0000000000000000     0  7769      1 0x00000000
Jun 11 20:54:19 server-relay kernel: ffff880436ca1a78 0000000000000082 ffffffff81aaa5c0 0000000000000086
Jun 11 20:54:19 server-relay kernel: ffff880436ca1a08 0000000000000086 ffff880436ca1a58 ffffea0005319a80
Jun 11 20:54:19 server-relay kernel: ffff880436ccb058 ffff880436ca1fd8 000000000000fbc8 ffff880436ccb058
Jun 11 20:54:19 server-relay kernel: Call Trace:
Jun 11 20:54:19 server-relay kernel: [<ffffffff810a6d01>] ? ktime_get_ts+0xb1/0xf0
Jun 11 20:54:19 server-relay kernel: [<ffffffff8111f590>] ? sync_page+0x0/0x50
Jun 11 20:54:19 server-relay kernel: [<ffffffff81528383>] io_schedule+0x73/0xc0
Jun 11 20:54:19 server-relay kernel: [<ffffffff8111f5cd>] sync_page+0x3d/0x50
Jun 11 20:54:19 server-relay kernel: [<ffffffff81528c1a>] __wait_on_bit_lock+0x5a/0xc0
Jun 11 20:54:19 server-relay kernel: [<ffffffff8111f567>] __lock_page+0x67/0x70
Jun 11 20:54:19 server-relay kernel: [<ffffffff8109af80>] ? wake_bit_function+0x0/0x50
Jun 11 20:54:19 server-relay kernel: [<ffffffff8111f6a7>] ? unlock_page+0x27/0x30
Jun 11 20:54:19 server-relay kernel: [<ffffffff81137cad>] invalidate_inode_pages2_range+0x36d/0x3b0
Jun 11 20:54:19 server-relay kernel: [<ffffffffa0278145>] fuse_reverse_inval_inode+0x75/0x90 [fuse]
Jun 11 20:54:19 server-relay kernel: [<ffffffffa026d9ef>] fuse_notify+0x12f/0x2a0 [fuse]
Jun 11 20:54:19 server-relay kernel: [<ffffffffa026d593>] ? fuse_copy_one+0x53/0x70 [fuse]
Jun 11 20:54:19 server-relay kernel: [<ffffffffa026e6ea>] fuse_dev_write+0x16a/0x3e0 [fuse]
Jun 11 20:54:19 server-relay kernel: [<ffffffff8118893a>] do_sync_write+0xfa/0x140
Jun 11 20:54:19 server-relay kernel: [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
Jun 11 20:54:19 server-relay kernel: [<ffffffff81226056>] ? security_file_permission+0x16/0x20
Jun 11 20:54:19 server-relay kernel: [<ffffffff81188c38>] vfs_write+0xb8/0x1a0
Jun 11 20:54:19 server-relay kernel: [<ffffffff81189531>] sys_write+0x51/0x90
Jun 11 20:54:19 server-relay kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Jun 11 20:54:19 server-relay kernel: INFO: task mimedefang.pl:16185 blocked for more than 120 seconds.
Jun 11 20:54:19 server-relay kernel:      Not tainted 2.6.32-431.17.1.el6.x86_64 #1
Jun 11 20:54:19 server-relay kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 11 20:54:19 server-relay kernel: mimedefang.pl D 0000000000000002     0 16185  24786 0x00000000
Jun 11 20:54:19 server-relay kernel: ffff880287b6de38 0000000000000082 ffff880287b6ddd8 ffffffff810546b9
Jun 11 20:54:19 server-relay kernel: ffff88036641abe0 0000000300000000 00000000000000d0 ffff880434ed5840
Jun 11 20:54:19 server-relay kernel: ffff8804346d1098 ffff880287b6dfd8 000000000000fbc8 ffff8804346d1098
Jun 11 20:54:19 server-relay kernel: Call Trace:
Jun 11 20:54:19 server-relay kernel: [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
Jun 11 20:54:19 server-relay kernel: [<ffffffff8109b22e>] ? prepare_to_wait+0x4e/0x80
Jun 11 20:54:19 server-relay kernel: [<ffffffffa026e085>] fuse_request_send+0xe5/0x290 [fuse]
Jun 11 20:54:19 server-relay kernel: [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
Jun 11 20:54:19 server-relay kernel: [<ffffffffa0273ff6>] fuse_flush+0x106/0x140 [fuse]
Jun 11 20:54:19 server-relay kernel: [<ffffffff8118582c>] filp_close+0x3c/0x90
Jun 11 20:54:19 server-relay kernel: [<ffffffff81185925>] sys_close+0xa5/0x100
Jun 11 20:54:19 server-relay kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Jun 11 20:54:19 server-relay kernel: INFO: task rsync:6831 blocked for more than 120 seconds.
Jun 11 20:54:19 server-relay kernel:      Not tainted 2.6.32-431.17.1.el6.x86_64 #1
Jun 11 20:54:19 server-relay kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 11 20:54:19 server-relay kernel: rsync         D 0000000000000002     0  6831   7094 0x00000000
Jun 11 20:54:19 server-relay kernel: ffff8801f28a3bb8 0000000000000086 ffff8801f28a3be8 ffffffffa027278b
Jun 11 20:54:19 server-relay kernel: 0000000000000001 0000000000000000 0000000000069c66 0000000000000000
Jun 11 20:54:19 server-relay kernel: ffff88038065b058 ffff8801f28a3fd8 000000000000fbc8 ffff88038065b058
Jun 11 20:54:19 server-relay kernel: Call Trace:
Jun 11 20:54:19 server-relay kernel: [<ffffffffa027278b>] ? fuse_dentry_revalidate+0x2fb/0x340 [fuse]
Jun 11 20:54:19 server-relay kernel: [<ffffffff810a6d01>] ? ktime_get_ts+0xb1/0xf0
Jun 11 20:54:19 server-relay kernel: [<ffffffff8111f590>] ? sync_page+0x0/0x50
Jun 11 20:54:19 server-relay kernel: [<ffffffff81528383>] io_schedule+0x73/0xc0
Jun 11 20:54:19 server-relay kernel: [<ffffffff8111f5cd>] sync_page+0x3d/0x50
Jun 11 20:54:19 server-relay kernel: [<ffffffff81528c1a>] __wait_on_bit_lock+0x5a/0xc0
Jun 11 20:54:19 server-relay kernel: [<ffffffff8111f567>] __lock_page+0x67/0x70
Jun 11 20:54:19 server-relay kernel: [<ffffffff8109af80>] ? wake_bit_function+0x0/0x50
Jun 11 20:54:19 server-relay kernel: [<ffffffff811358b2>] ? pagevec_lookup+0x22/0x30
Jun 11 20:54:19 server-relay kernel: [<ffffffff81137783>] truncate_inode_pages_range+0x4e3/0x500
Jun 11 20:54:19 server-relay kernel: [<ffffffff81137835>] truncate_inode_pages+0x15/0x20
Jun 11 20:54:19 server-relay kernel: [<ffffffff8113788f>] truncate_pagecache+0x4f/0x70
Jun 11 20:54:19 server-relay kernel: [<ffffffffa0278342>] fuse_change_attributes+0xd2/0x100 [fuse]
Jun 11 20:54:19 server-relay kernel: [<ffffffffa02700dd>] fuse_readdir+0x76d/0x7b0 [fuse]
Jun 11 20:54:19 server-relay kernel: [<ffffffff8119ea30>] ? filldir+0x0/0xe0
Jun 11 20:54:19 server-relay kernel: [<ffffffff8119ea30>] ? filldir+0x0/0xe0
Jun 11 20:54:19 server-relay kernel: [<ffffffff8119ecb0>] vfs_readdir+0xc0/0xe0
Jun 11 20:54:19 server-relay kernel: [<ffffffff8118e7a6>] ? sys_newlstat+0x36/0x50
Jun 11 20:54:19 server-relay kernel: [<ffffffff8119ee39>] sys_getdents+0x89/0xf0
Jun 11 20:54:19 server-relay kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Jun 11 20:54:27 server-relay xinetd[29487]: EXIT: mysqlchk signal=13 pid=8414 duration=0(sec)
Jun 11 20:54:39 server-relay xinetd[29487]: EXIT: mysqlchk signal=13 pid=8432 duration=0(sec)
Jun 11 20:54:51 server-relay xinetd[29487]: EXIT: mysqlchk signal=13 pid=8477 duration=0(sec)
Jun 11 20:55:03 server-relay xinetd[29487]: EXIT: mysqlchk signal=13 pid=8696 duration=0(sec)
Jun 11 20:55:15 server-relay xinetd[29487]: EXIT: mysqlchk signal=13 pid=8831 duration=0(sec)
Jun 11 20:55:27 server-relay xinetd[29487]: EXIT: mysqlchk signal=13 pid=8956 duration=0(sec)
Jun 11 20:55:39 server-relay xinetd[29487]: EXIT: mysqlchk signal=13 pid=9079 duration=0(sec)
Jun 11 20:55:51 server-relay xinetd[29487]: EXIT: mysqlchk signal=13 pid=9221 duration=0(sec)
Jun 11 20:56:03 server-relay xinetd[29487]: EXIT: mysqlchk signal=13 pid=9374 duration=0(sec)
Jun 11 20:56:15 server-relay xinetd[29487]: EXIT: mysqlchk status=0 pid=9505 duration=0(sec)
Jun 11 20:56:19 server-relay kernel: INFO: task glusterfs:7769 blocked for more than 120 seconds.
Jun 11 20:56:19 server-relay kernel:      Not tainted 2.6.32-431.17.1.el6.x86_64 #1
Jun 11 20:56:19 server-relay kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 11 20:56:19 server-relay kernel: glusterfs     D 0000000000000000     0  7769      1 0x00000000
Jun 11 20:56:19 server-relay kernel: ffff880436ca1a78 0000000000000082 ffffffff81aaa5c0 0000000000000086
Jun 11 20:56:19 server-relay kernel: ffff880436ca1a08 0000000000000086 ffff880436ca1a58 ffffea0005319a80
Jun 11 20:56:19 server-relay kernel: ffff880436ccb058 ffff880436ca1fd8 000000000000fbc8 ffff880436ccb058
Jun 11 20:56:19 server-relay kernel: Call Trace:
Jun 11 20:56:19 server-relay kernel: [<ffffffff810a6d01>] ? ktime_get_ts+0xb1/0xf0
Jun 11 20:56:19 server-relay kernel: [<ffffffff8111f590>] ? sync_page+0x0/0x50
Jun 11 20:56:19 server-relay kernel: [<ffffffff81528c1a>] __wait_on_bit_lock+0x5a/0xc0
Jun 11 20:56:19 server-relay kernel: [<ffffffff8111f567>] __lock_page+0x67/0x70
Jun 11 20:56:19 server-relay kernel: [<ffffffff8109af80>] ? wake_bit_function+0x0/0x50
Jun 11 20:56:19 server-relay kernel: [<ffffffff8111f6a7>] ? unlock_page+0x27/0x30
Jun 11 20:56:19 server-relay kernel: [<ffffffff81137cad>] invalidate_inode_pages2_range+0x36d/0x3b0
Jun 11 20:56:19 server-relay kernel: [<ffffffffa0278145>] fuse_reverse_inval_inode+0x75/0x90 [fuse]
Jun 11 20:56:19 server-relay kernel: [<ffffffffa026d9ef>] fuse_notify+0x12f/0x2a0 [fuse]
Jun 11 20:56:19 server-relay kernel: [<ffffffffa026d593>] ? fuse_copy_one+0x53/0x70 [fuse]
Jun 11 20:56:19 server-relay kernel: [<ffffffffa026e6ea>] fuse_dev_write+0x16a/0x3e0 [fuse]
Jun 11 20:56:19 server-relay kernel: [<ffffffff8118893a>] do_sync_write+0xfa/0x140
Jun 11 20:56:19 server-relay kernel: [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
Jun 11 20:56:19 server-relay kernel: [<ffffffff81226056>] ? security_file_permission+0x16/0x20
Jun 11 20:56:19 server-relay kernel: [<ffffffff81188c38>] vfs_write+0xb8/0x1a0
Jun 11 20:56:19 server-relay kernel: [<ffffffff81189531>] sys_write+0x51/0x90
Jun 11 20:56:19 server-relay kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Jun 11 20:56:19 server-relay kernel: INFO: task mimedefang.pl:25148 blocked for more than 120 seconds.
Jun 11 20:56:19 server-relay kernel:      Not tainted 2.6.32-431.17.1.el6.x86_64 #1
Jun 11 20:56:19 server-relay kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 11 20:56:19 server-relay kernel: mimedefang.pl D 0000000000000001     0 25148  24786 0x00000000
Jun 11 20:56:19 server-relay kernel: ffff88026de95d18 0000000000000082 0000000000000000 ffff880383648034
Jun 11 20:56:19 server-relay kernel: ffff88026de95c88 ffffffff812271bf ffff88026de95d68 ffffffff811989c5
Jun 11 20:56:19 server-relay kernel: ffff88026b901098 ffff88026de95fd8 000000000000fbc8 ffff88026b901098
Jun 11 20:56:19 server-relay kernel: Call Trace:
Jun 11 20:56:19 server-relay kernel: [<ffffffff812271bf>] ? security_inode_permission+0x1f/0x30
Jun 11 20:56:19 server-relay kernel: [<ffffffff811989c5>] ? __link_path_walk+0x145/0xff0
Jun 11 20:56:19 server-relay kernel: [<ffffffff8152935e>] __mutex_lock_slowpath+0x13e/0x180
Jun 11 20:56:19 server-relay kernel: [<ffffffff815291fb>] mutex_lock+0x2b/0x50
Jun 11 20:56:19 server-relay kernel: [<ffffffff8119b3e6>] do_filp_open+0x2d6/0xd20
Jun 11 20:56:19 server-relay kernel: [<ffffffff812824f5>] ? _atomic_dec_and_lock+0x55/0x80
Jun 11 20:56:19 server-relay kernel: [<ffffffff811a7ea2>] ? alloc_fd+0x92/0x160
Jun 11 20:56:19 server-relay kernel: [<ffffffff811859e9>] do_sys_open+0x69/0x140
Jun 11 20:56:19 server-relay kernel: [<ffffffff81185b00>] sys_open+0x20/0x30
Jun 11 20:56:19 server-relay kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Jun 11 20:56:19 server-relay kernel: INFO: task mimedefang.pl:16185 blocked for more than 120 seconds.
Jun 11 20:56:19 server-relay kernel:      Not tainted 2.6.32-431.17.1.el6.x86_64 #1
Jun 11 20:56:19 server-relay kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 11 20:56:19 server-relay kernel: mimedefang.pl D 0000000000000002     0 16185  24786 0x00000000
Jun 11 20:56:19 server-relay kernel: ffff880287b6de38 0000000000000082 ffff880287b6ddd8 ffffffff810546b9
Jun 11 20:56:19 server-relay kernel: ffff88036641abe0 0000000300000000 00000000000000d0 ffff880434ed5840
Jun 11 20:56:19 server-relay kernel: ffff8804346d1098 ffff880287b6dfd8 000000000000fbc8 ffff8804346d1098
Jun 11 20:56:19 server-relay kernel: Call Trace:
Jun 11 20:56:19 server-relay kernel: [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
Jun 11 20:56:19 server-relay kernel: [<ffffffff8109b22e>] ? prepare_to_wait+0x4e/0x80
Jun 11 20:56:19 server-relay kernel: [<ffffffffa026e085>] fuse_request_send+0xe5/0x290 [fuse]
Jun 11 20:56:19 server-relay kernel: [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
Jun 11 20:56:19 server-relay kernel: [<ffffffffa0273ff6>] fuse_flush+0x106/0x140 [fuse]
Jun 11 20:56:19 server-relay kernel: [<ffffffff8118582c>] filp_close+0x3c/0x90
Jun 11 20:56:19 server-relay kernel: [<ffffffff81185925>] sys_close+0xa5/0x100
Jun 11 20:56:19 server-relay kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Jun 11 20:56:19 server-relay kernel: INFO: task mimedefang.pl:14409 blocked for more than 120 seconds.
Jun 11 20:56:19 server-relay kernel:      Not tainted 2.6.32-431.17.1.el6.x86_64 #1
Jun 11 20:56:19 server-relay kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 11 20:56:19 server-relay kernel: mimedefang.pl D 0000000000000000     0 14409  24786 0x00000000
Jun 11 20:56:19 server-relay kernel: ffff88026dc51d18 0000000000000086 0000000000000000 ffff8803667f0034
Jun 11 20:56:19 server-relay kernel: ffff88026dc51c88 ffffffff812271bf ffff88026dc51d68 ffffffff811989c5
Jun 11 20:56:19 server-relay kernel: ffff88038065bab8 ffff88026dc51fd8 000000000000fbc8 ffff88038065bab8
Jun 11 20:56:19 server-relay kernel: Call Trace:
Jun 11 20:56:19 server-relay kernel: [<ffffffff812271bf>] ? security_inode_permission+0x1f/0x30
Jun 11 20:56:19 server-relay kernel: [<ffffffff811989c5>] ? __link_path_walk+0x145/0xff0
Jun 11 20:56:19 server-relay kernel: [<ffffffff8152935e>] __mutex_lock_slowpath+0x13e/0x180
Jun 11 20:56:19 server-relay kernel: [<ffffffff815291fb>] mutex_lock+0x2b/0x50
Jun 11 20:56:19 server-relay kernel: [<ffffffff8119b3e6>] do_filp_open+0x2d6/0xd20
Jun 11 20:56:19 server-relay kernel: [<ffffffff812824f5>] ? _atomic_dec_and_lock+0x55/0x80
Jun 11 20:56:19 server-relay kernel: [<ffffffff811a7ea2>] ? alloc_fd+0x92/0x160
Jun 11 20:56:19 server-relay kernel: [<ffffffff811859e9>] do_sys_open+0x69/0x140
Jun 11 20:56:19 server-relay kernel: [<ffffffff81185b00>] sys_open+0x20/0x30
Jun 11 20:56:19 server-relay kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Jun 11 20:56:19 server-relay kernel: INFO: task mimedefang.pl:14415 blocked for more than 120 seconds.
Jun 11 20:56:19 server-relay kernel:      Not tainted 2.6.32-431.17.1.el6.x86_64 #1
Jun 11 20:56:19 server-relay kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 11 20:56:19 server-relay kernel: mimedefang.pl D 0000000000000000     0 14415  24786 0x00000000
Jun 11 20:56:19 server-relay kernel: ffff880187305d18 0000000000000082 0000000000000000 ffff88040fab9034
Jun 11 20:56:19 server-relay kernel: ffff880187305c88 ffffffff812271bf ffff880187305d68 ffffffff811989c5
Jun 11 20:56:19 server-relay kernel: ffff880259d4c638 ffff880187305fd8 000000000000fbc8 ffff880259d4c638
Jun 11 20:56:19 server-relay kernel: Call Trace:
Jun 11 20:56:19 server-relay kernel: [<ffffffff812271bf>] ? security_inode_permission+0x1f/0x30
Jun 11 20:56:19 server-relay kernel: [<ffffffff811989c5>] ? __link_path_walk+0x145/0xff0
Jun 11 20:56:19 server-relay kernel: [<ffffffff8152935e>] __mutex_lock_slowpath+0x13e/0x180
Jun 11 20:56:19 server-relay kernel: [<ffffffff815291fb>] mutex_lock+0x2b/0x50
Jun 11 20:56:19 server-relay kernel: [<ffffffff8119b3e6>] do_filp_open+0x2d6/0xd20
Jun 11 20:56:19 server-relay kernel: [<ffffffff812824f5>] ? _atomic_dec_and_lock+0x55/0x80
Jun 11 20:56:19 server-relay kernel: [<ffffffff811a7ea2>] ? alloc_fd+0x92/0x160
Jun 11 20:56:19 server-relay kernel: [<ffffffff811859e9>] do_sys_open+0x69/0x140
Jun 11 20:56:19 server-relay kernel: [<ffffffff81185b00>] sys_open+0x20/0x30
Jun 11 20:56:19 server-relay kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Jun 11 20:56:19 server-relay kernel: INFO: task rsync:6831 blocked for more than 120 seconds.
Jun 11 20:56:19 server-relay kernel:      Not tainted 2.6.32-431.17.1.el6.x86_64 #1
Jun 11 20:56:19 server-relay kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 11 20:56:19 server-relay kernel: rsync         D 0000000000000002     0  6831   7094 0x00000000
Jun 11 20:56:19 server-relay kernel: ffff8801f28a3bb8 0000000000000086 ffff8801f28a3be8 ffffffffa027278b
Jun 11 20:56:19 server-relay kernel: 0000000000000001 0000000000000000 0000000000069c66 0000000000000000
Jun 11 20:56:19 server-relay kernel: ffff88038065b058 ffff8801f28a3fd8 000000000000fbc8 ffff88038065b058
Jun 11 20:56:19 server-relay kernel: Call Trace:
Jun 11 20:56:19 server-relay kernel: [<ffffffffa027278b>] ? fuse_dentry_revalidate+0x2fb/0x340 [fuse]
Jun 11 20:56:19 server-relay kernel: [<ffffffff810a6d01>] ? ktime_get_ts+0xb1/0xf0
Jun 11 20:56:19 server-relay kernel: [<ffffffff8111f590>] ? sync_page+0x0/0x50
Jun 11 20:56:19 server-relay kernel: [<ffffffff81528383>] io_schedule+0x73/0xc0
Jun 11 20:56:19 server-relay kernel: [<ffffffff8111f5cd>] sync_page+0x3d/0x50
Jun 11 20:56:19 server-relay kernel: [<ffffffff81528c1a>] __wait_on_bit_lock+0x5a/0xc0
Jun 11 20:56:19 server-relay kernel: [<ffffffff8111f567>] __lock_page+0x67/0x70
Jun 11 20:56:19 server-relay kernel: [<ffffffff8109af80>] ? wake_bit_function+0x0/0x50
Jun 11 20:56:19 server-relay kernel: [<ffffffff811358b2>] ? pagevec_lookup+0x22/0x30
Jun 11 20:56:19 server-relay kernel: [<ffffffff81137783>] truncate_inode_pages_range+0x4e3/0x500
Jun 11 20:56:19 server-relay kernel: [<ffffffff81137835>] truncate_inode_pages+0x15/0x20
Jun 11 20:56:19 server-relay kernel: [<ffffffff8113788f>] truncate_pagecache+0x4f/0x70
Jun 11 20:56:19 server-relay kernel: [<ffffffffa0278342>] fuse_change_attributes+0xd2/0x100 [fuse]
Jun 11 20:56:19 server-relay kernel: [<ffffffffa02700dd>] fuse_readdir+0x76d/0x7b0 [fuse]
Jun 11 20:56:19 server-relay kernel: [<ffffffff8119ea30>] ? filldir+0x0/0xe0
Jun 11 20:56:19 server-relay kernel: [<ffffffff8119ea30>] ? filldir+0x0/0xe0
Jun 11 20:56:19 server-relay kernel: [<ffffffff8119ecb0>] vfs_readdir+0xc0/0xe0
Jun 11 20:56:19 server-relay kernel: [<ffffffff8118e7a6>] ? sys_newlstat+0x36/0x50
Jun 11 20:56:19 server-relay kernel: [<ffffffff8119ee39>] sys_getdents+0x89/0xf0
Jun 11 20:56:19 server-relay kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Comment 1 Pranith Kumar K 2014-06-17 04:31:01 UTC
Amit,
    Do you happen to have logs of the setup? I fixed one issue which could lead to dead locks in 3.5.0. Wondering if it is the same issue. I need logs to confirm it.

Pranith

Comment 2 santosh pradhan 2014-06-17 04:36:28 UTC
(In reply to Pranith Kumar K from comment #1)
> Amit,
>     Do you happen to have logs of the setup? I fixed one issue which could
> lead to dead locks in 3.5.0. Wondering if it is the same issue. I need logs
> to confirm it.
> 
> Pranith

Logs are attached, mentioned first line of description.

Comment 3 Pranith Kumar K 2014-06-17 04:47:48 UTC
(In reply to santosh pradhan from comment #2)
> (In reply to Pranith Kumar K from comment #1)
> > Amit,
> >     Do you happen to have logs of the setup? I fixed one issue which could
> > lead to dead locks in 3.5.0. Wondering if it is the same issue. I need logs
> > to confirm it.
> > 
> > Pranith
> 
> Logs are attached, mentioned first line of description.

Thanks for pointing out Santosh :-).

Amit,
     I went through the logs. I see the following logs on the mount logs:
pk1@pranithk-laptop - ~/Downloads/1108850/glusterfs 
10:11:04 :( ⚡ grep ping mnt-storage.log 
[2014-05-19 17:39:55.228936] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-storage-client-3: server 10.35.63.197:49152 has not responded in the last 42 seconds, disconnecting.
[2014-05-19 17:39:55.229337] W [client-handshake.c:276:client_ping_cbk] 0-storage-client-3: timer must have expired
[2014-06-12 00:49:47.379302] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-storage-client-0: server 10.35.63.196:49152 has not responded in the last 42 seconds, disconnecting.
[2014-06-12 00:49:47.398995] W [client-handshake.c:276:client_ping_cbk] 0-storage-client-0: timer must have expired
[2014-06-12 00:50:37.387927] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-storage-client-1: server 10.35.63.195:49152 has not responded in the last 42 seconds, disconnecting.
[2014-06-12 00:50:37.407045] W [client-handshake.c:276:client_ping_cbk] 0-storage-client-1: timer must have expired


Those times are UTC timestamps. Did you notice the timeouts during that time frame? This looks a LOT like the bug I fixed. Basically the bug appears if the number of overlapping writes is > 64 at an instant.

Pasting the commit information here for reference.

commit bc434b3ca961757ade8c6093f4ff0dbe4b3a4672
Author: Pranith Kumar K <pkarampu>
Date:   Wed Apr 23 14:05:10 2014 +0530

    rpcsvc: Ignore INODELK/ENTRYLK/LK for throttling
    
    Problem:
    When iozone is in progress, number of blocking inodelks sometimes becomes greater
    than the threshold number of rpc requests allowed for that client
    (RPCSVC_DEFAULT_OUTSTANDING_RPC_LIMIT). Subsequent requests from that client
    will not be read until all the outstanding requests are processed and replied
    to. But because no more requests are read from that client, unlocks on the
    already granted locks will never come thus the number of outstanding requests
    would never come down. This leads to a ping-timeout on the client.
    
    Fix:
    Do not account INODELK/ENTRYLK/LK for throttling
    
    Change-Id: I59c6b54e7ec24ed7375ff977e817a9cb73469806
    BUG: 1089470
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/7531
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Krutika Dhananjay <kdhananj>
    Reviewed-by: Anand Avati <avati>

Pranith

Comment 4 Justin Clift 2014-06-17 14:57:23 UTC
Amit, the fix for this will be in GlusterFS 3.5.1.

Pranith's patch was added about a week after 3.5.0 was released.

If you hit this issue again before 3.5.1 is released (should be soon), then you may want to consider upgrading to 3.5.1 beta 2 in the meantime.  It has this patch in it too.

Hope that helps. :)

Comment 5 Niels de Vos 2014-12-14 20:28:09 UTC
$ git log -1 45a0322066513259e61c7a4b1b1ed1a0bd3a0827
commit 45a0322066513259e61c7a4b1b1ed1a0bd3a0827
Author: Pranith Kumar K <pkarampu>
Date:   Wed Apr 23 14:05:10 2014 +0530

    rpcsvc: Ignore INODELK/ENTRYLK/LK for throttling
    
    Problem:
    When iozone is in progress, number of blocking inodelks sometimes becomes greater
    than the threshold number of rpc requests allowed for that client
    (RPCSVC_DEFAULT_OUTSTANDING_RPC_LIMIT). Subsequent requests from that client
    will not be read until all the outstanding requests are processed and replied
    to. But because no more requests are read from that client, unlocks on the
    already granted locks will never come thus the number of outstanding requests
    would never come down. This leads to a ping-timeout on the client.
    
    Fix:
    Do not account INODELK/ENTRYLK/LK for throttling
    
    BUG: 1089470
    Change-Id: I9cc2c259d2462159cea913d95f98a565acb8e0c0
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/7570
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Krutika Dhananjay <kdhananj>
    Reviewed-by: Niels de Vos <ndevos>


$ git tag --contains 45a0322066513259e61c7a4b1b1ed1a0bd3a0827
v3.5.1
...

*** This bug has been marked as a duplicate of bug 1089470 ***


Note You need to log in before you can comment on or make changes to this bug.