Bug 708069 - kernel deadlock in heavy setattr/getattr operations
Summary: kernel deadlock in heavy setattr/getattr operations
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel
Version: 6.1
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: rc
: ---
Assignee: Red Hat Kernel Manager
QA Contact: Filesystem QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-05-26 16:59 UTC by colyli
Modified: 2017-12-06 10:18 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-06 10:18:10 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description colyli 2011-05-26 16:59:31 UTC
Description of problem:

We have a workload only set/get extended attribute to inode as inline, all file size is zero. After running the test workload for around 1 hour, we observe kernel oops.

Version-Release number of selected component (if applicable):
RHEL6.1 GA kernel: 2.6.32-131.0.15

How reproducible:
We reproduced the error in 2.6.32-71.29.1, 2.6.32-131.0.15, both in Ext3 and Ext4 with data=ordered journal. We suspect this may be a per-bdi writeback related issue.

Steps to Reproduce:
1. checkout the test code:
svn checkout http://code.taobao.org/svn/dirbench
2. compile the test code
cd dirbench/trunk/meta_test/press/set_vs_get
make
3. format Ext4 or Ext3 file system with default options
4. mount Ext4 or Ext3 with data=ordered mode journal.
5. run dirbench/trunk/meta_test/press/set_vs_get/run.sh
6, check dmesg, around 1 hour, oops can be observed.
  
Actual results:
The following oops info is observed:
May 25 18:50:48 core097002 kernel: [ 4680.778040] INFO: task kjournald:172 blocked for more than 120 seconds.
May 25 18:51:47 core097002 snmpd[2666]: Connection from UDP: [172.23.14.66]:33299
May 25 18:57:45 core097002 kernel: [ 4680.778181] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 25 18:57:45 core097002 kernel: [ 4680.778297] kjournald     D 0000000000000000     0   172      2 0x00000000
May 25 18:57:45 core097002 kernel: [ 4680.778304]  ffff880224efdd40 0000000000000046 0000000000000000 0000000000016980
May 25 18:57:45 core097002 kernel: [ 4680.778454]  ffff880224e84738 ffff880224e84240 ffff880224e84180 ffff880226de8240
May 25 18:57:45 core097002 snmpd[2666]: Received SNMP packet(s) from UDP: [172.23.14.66]:33299
May 25 18:57:45 core097002 kernel: [ 4680.778603]  ffff880224e84738 00000001003fda2c 00000000249100e0 0000000000000000
May 25 18:57:45 core097002 kernel: [ 4680.778753] Call Trace:
May 25 18:57:45 core097002 kernel: [ 4680.778818]  [<ffffffffa000a49c>] journal_commit_transaction+0x17c/0x1070 [jbd]
May 25 18:57:45 core097002 kernel: [ 4680.778938]  [<ffffffff81059742>] ? finish_task_switch+0x42/0xe0
May 25 18:57:46 core097002 kernel: [ 4680.779037]  [<ffffffff8108d520>] ? autoremove_wake_function+0x0/0x40
May 25 18:57:46 core097002 kernel: [ 4680.779145]  [<ffffffff810788b9>] ? try_to_del_timer_sync+0x49/0xe0
May 25 18:57:46 core097002 kernel: [ 4680.779255]  [<ffffffffa000e0b2>] kjournald+0xe2/0x250 [jbd]
May 25 18:57:46 core097002 kernel: [ 4680.779353]  [<ffffffff8108d520>] ? autoremove_wake_function+0x0/0x40
May 25 18:57:47 core097002 kernel: [ 4680.779451]  [<ffffffffa000dfd0>] ? kjournald+0x0/0x250 [jbd]
May 25 18:57:49 core097002 kernel: [ 4680.779535]  [<ffffffff8108d3a6>] kthread+0x96/0xa0
May 25 18:57:49 core097002 kernel: [ 4680.779621]  [<ffffffff8101408a>] child_rip+0xa/0x20
May 25 18:57:49 core097002 kernel: [ 4680.779710]  [<ffffffff8108d310>] ? kthread+0x0/0xa0
May 25 18:57:49 core097002 kernel: [ 4680.779797]  [<ffffffff81014080>] ? child_rip+0x0/0x20
May 25 18:57:49 core097002 kernel: [ 4680.779881] INFO: task syslogd:2430 blocked for more than 120 seconds.
May 25 18:57:49 core097002 kernel: [ 4680.779980] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 25 18:57:49 core097002 kernel: [ 4680.780105] syslogd       D 0000000000000000     0  2430      1 0x00000004
May 25 18:57:49 core097002 kernel: [ 4680.780110]  ffff880221ae98c8 0000000000000082 0000000000000000 0000000000016980
May 25 18:57:50 core097002 kernel: [ 4680.780260]  ffff88022505ec38 ffff88022505e740 ffff88022505e680 ffff880226e1e340
May 25 18:57:50 core097002 kernel: [ 4680.780408]  ffff88022505ec38 00000001003fd7ce 0000000023d91340 0000000000000000
May 25 18:57:50 core097002 kernel: [ 4680.780557] Call Trace:
May 25 18:57:50 core097002 kernel: [ 4680.780600]  [<ffffffff814ae453>] io_schedule+0x73/0xc0
May 25 18:57:50 core097002 kernel: [ 4680.780685]  [<ffffffff8118f925>] sync_buffer+0x45/0x50
May 25 18:57:50 core097002 kernel: [ 4680.780769]  [<ffffffff814aeb3a>] __wait_on_bit_lock+0x5a/0xb0
May 25 18:57:50 core097002 kernel: [ 4680.780872]  [<ffffffff8118f8e0>] ? sync_buffer+0x0/0x50
May 25 18:57:50 core097002 kernel: [ 4680.780960]  [<ffffffff8118f8e0>] ? sync_buffer+0x0/0x50
May 25 18:57:50 core097002 kernel: [ 4680.781046]  [<ffffffff814aec0d>] out_of_line_wait_on_bit_lock+0x7d/0x90
May 25 18:57:51 core097002 kernel: [ 4680.781161]  [<ffffffff8108d560>] ? wake_bit_function+0x0/0x40
May 25 18:57:51 core097002 kernel: [ 4680.781252]  [<ffffffff8118fb1c>] __lock_buffer+0x2c/0x30
May 25 18:57:51 core097002 kernel: [ 4680.781345]  [<ffffffffa0009c2f>] do_get_write_access+0x3bf/0x480 [jbd]
May 25 18:57:51 core097002 kernel: [ 4680.781449]  [<ffffffff811904ce>] ? __getblk+0x2e/0x240
May 25 18:57:51 core097002 kernel: [ 4680.781535]  [<ffffffffa0009e81>] journal_get_write_access+0x31/0x50 [jbd]
May 25 18:57:51 core097002 kernel: [ 4680.781656]  [<ffffffffa0031cd4>] __ext3_journal_get_write_access+0x34/0x70 [ext3]
May 25 18:57:51 core097002 kernel: [ 4680.781784]  [<ffffffffa0023ab3>] ext3_reserve_inode_write+0x83/0xb0 [ext3]
May 25 18:57:51 core097002 kernel: [ 4680.781904]  [<ffffffffa0023b16>] ext3_mark_inode_dirty+0x36/0x60 [ext3]
May 25 18:57:51 core097002 snmpd[2666]: Connection from UDP: [172.23.14.66]:33299
May 25 18:57:51 core097002 kernel: [ 4680.782014]  [<ffffffffa0023cf3>] ext3_dirty_inode+0x93/0xa0 [ext3]
May 25 18:57:52 core097002 kernel: [ 4680.782151]  [<ffffffff811893da>] __mark_inode_dirty+0x3a/0x170
May 25 18:57:52 core097002 kernel: [ 4680.782246]  [<ffffffff8117d492>] ? mnt_clone_write+0x12/0x40
May 25 18:57:52 core097002 snmpd[2666]: Connection from UDP: [172.23.14.66]:35344
May 25 18:57:52 core097002 kernel: [ 4680.782330]  [<ffffffff8117a39f>] file_update_time+0xef/0x190
May 25 18:57:52 core097002 snmpd[2666]: Received SNMP packet(s) from UDP: [172.23.14.66]:35344
May 25 18:57:52 core097002 kernel: [ 4680.782416]  [<ffffffff81105f3b>] __generic_file_aio_write+0x23b/0x420
May 25 18:57:52 core097002 snmpd[2666]: Connection from UDP: [172.23.14.66]:35344
May 25 18:57:52 core097002 kernel: [ 4680.782515]  [<ffffffff8110617f>] generic_file_aio_write+0x5f/0xc0

Expected results:
No oops info.

Comment 2 Eric Sandeen 2011-05-26 20:00:47 UTC
Not an oops, but a hung task.  kjournald is stuck... 

Coly, can you please retest with

http://patchwork.ozlabs.org/patch/94270/
and/or
http://patchwork.ozlabs.org/patch/94254/ ?

Thanks,
-Eric

Comment 3 colyli 2011-05-27 07:07:53 UTC
Hi Eric,

We have merged these two patches in our kernel package for days, current reported issue is reported with kernel merged these 2 patches.

The current workload is not common IMHO, all I/O are set/get extended attribute caused, the application stores key-value info in ext4 inline EA eare, with journal enabled.

Thanks for the reply :-)

Coly

Comment 4 RHEL Program Management 2011-10-07 15:36:22 UTC
Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 8 Jan Kurik 2017-12-06 10:18:10 UTC
Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available.

The official life cycle policy can be reviewed here:

http://redhat.com/rhel/lifecycle

This issue does not meet the inclusion criteria for the Production 3 Phase and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Note that a strong business justification will be required for re-evaluation. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL:

https://access.redhat.com/


Note You need to log in before you can comment on or make changes to this bug.