Bug 573106
Summary: | [Stratus 5.6 bug] task md0_resync:18061 blocked for more than 120 seconds | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | nate.dailey | ||||||
Component: | kernel | Assignee: | Jim Paradis <jparadis> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Storage QE <storage-qe> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 5.5 | CC: | andriusb, bdonahue, bkv.astro, bugzilla, ck, coughlan, cward, daniel.brnak, dkerr, dkovalsk, drjones, eparis, ernest.beinrohr, fukawi2+fedora, gerrit.slomma, gholms, happe, irekpias, john.haxby, jonas.hagberg, mark, mr_w, ngaywood, pasik, pasteur, pep, peterm, ralph, rhel, scott, tom, wadehamptoniv, wnefal+redhatbugzilla, wt75 | ||||||
Target Milestone: | rc | Keywords: | OtherQA, Regression | ||||||
Target Release: | 5.6 | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2011-01-13 21:16:31 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
nate.dailey
2010-03-12 20:42:27 UTC
Just a 'me to' posting. I am experiencing this also. I have a new installation of RHEL 5.5 x86_64 and am seeing this message in my logs. At the time, vmstat reports a growing number of processes in the blocked state and i/o performance drops through the floor. I am currently scp'ing a file from a server connected on a local lan and am seeing transfer rates start at 10MB/s down to 2.5MB/s and, sometimes much much lower. May 7 21:50:25 camunxxen02 kernel: INFO: task auditd:2248 blocked for more than 120 seconds. May 7 21:50:25 camunxxen02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 7 21:50:25 camunxxen02 kernel: auditd D 0000005343565863 0 2248 1 2271 2247 (NOTLB) May 7 21:50:25 camunxxen02 kernel: ffff8800d8f1bd88 0000000000000286 ffff8800d8f1bd98 ffffffff80262dd3 May 7 21:50:25 camunxxen02 kernel: 0000000000000009 ffff8800db8cc7a0 ffff8800000037e0 000000000000075b May 7 21:50:25 camunxxen02 kernel: ffff8800db8cc988 0000000000005f30 May 7 21:51:54 camunxxen02 kernel: Call Trace: May 7 21:51:54 camunxxen02 kernel: [<ffffffff80262dd3>] thread_return+0x6c/0x113 May 7 21:51:54 camunxxen02 kernel: [<ffffffff88036d5a>] :jbd:log_wait_commit+0xa3/0xf5 May 7 21:51:54 camunxxen02 kernel: [<ffffffff8029c521>] autoremove_wake_function+0x0/0x2e May 7 21:51:54 camunxxen02 kernel: [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff May 7 21:51:54 camunxxen02 kernel: [<ffffffff80231056>] __writeback_single_inode+0x1e9/0x328 May 7 21:51:54 camunxxen02 kernel: [<ffffffff802676cd>] do_page_fault+0xfae/0x12e0 May 7 21:51:54 camunxxen02 kernel: [<ffffffff802e5775>] sync_inode+0x24/0x33 May 7 21:51:54 camunxxen02 kernel: [<ffffffff8804c36d>] :ext3:ext3_sync_file+0xc9/0xdc May 7 21:51:54 camunxxen02 kernel: [<ffffffff80251f3f>] do_fsync+0x52/0xa4 May 7 21:51:54 camunxxen02 kernel: [<ffffffff802d370c>] __do_fsync+0x23/0x36 May 7 21:51:54 camunxxen02 kernel: [<ffffffff80260106>] system_call+0x86/0x8b May 7 21:51:54 camunxxen02 kernel: [<ffffffff80260080>] system_call+0x0/0x8b May 7 21:51:54 camunxxen02 kernel: May 7 21:51:54 camunxxen02 kernel: INFO: task syslogd:2285 blocked for more than 120 seconds. May 7 21:51:54 camunxxen02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 7 21:51:54 camunxxen02 kernel: syslogd D 0000004a4f132cba 0 2285 1 2288 2271 (NOTLB) May 7 21:51:54 camunxxen02 kernel: ffff8800d81c3d88 0000000000000286 0000000000000000 0000000000000001 May 7 21:51:54 camunxxen02 kernel: 0000000000000009 ffff8800e11e6040 ffff8800000037e0 0000000000006cfb May 7 21:51:54 camunxxen02 kernel: ffff8800e11e6228 0000000000000000 May 7 21:51:54 camunxxen02 kernel: Call Trace: May 7 21:51:54 camunxxen02 kernel: [<ffffffff88036d5a>] :jbd:log_wait_commit+0xa3/0xf5 May 7 21:51:54 camunxxen02 kernel: [<ffffffff8029c521>] autoremove_wake_function+0x0/0x2e May 7 21:51:54 camunxxen02 kernel: [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff May 7 21:51:54 camunxxen02 kernel: [<ffffffff80231056>] __writeback_single_inode+0x1e9/0x328 May 7 21:51:54 camunxxen02 kernel: [<ffffffff802d2f08>] do_readv_writev+0x26e/0x291 May 7 21:51:54 camunxxen02 kernel: [<ffffffff802e5775>] sync_inode+0x24/0x33 May 7 21:51:54 camunxxen02 kernel: [<ffffffff8804c36d>] :ext3:ext3_sync_file+0xc9/0xdc May 7 21:51:54 camunxxen02 kernel: [<ffffffff80251f3f>] do_fsync+0x52/0xa4 May 7 21:51:54 camunxxen02 kernel: [<ffffffff802d370c>] __do_fsync+0x23/0x36 May 7 21:51:54 camunxxen02 kernel: [<ffffffff802602f9>] tracesys+0xab/0xb6 May 7 21:51:54 camunxxen02 kernel: May 7 21:51:54 camunxxen02 kernel: INFO: task tail:3629 blocked for more than 120 seconds. May 7 21:51:54 camunxxen02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 7 21:51:54 camunxxen02 kernel: tail D 0000005363bcd82a 0 3629 3603 (NOTLB) May 7 21:51:54 camunxxen02 kernel: ffff8800b1bbbea8 0000000000000286 00000000ffffffd8 ffff8800b1bbbea8 May 7 21:51:54 camunxxen02 kernel: 0000000000000008 ffff8800ce58f100 ffff8800000037e0 00000000001282c1 May 7 21:51:54 camunxxen02 kernel: ffff8800ce58f2e8 ffff8800e0d44880 May 7 21:51:55 camunxxen02 kernel: Call Trace: May 7 21:51:55 camunxxen02 kernel: [<ffffffff8021b224>] cp_new_stat+0xe5/0xfd May 7 21:51:55 camunxxen02 kernel: [<ffffffff80263a7e>] __mutex_lock_slowpath+0x60/0x9b May 7 21:51:55 camunxxen02 kernel: [<ffffffff80263ac8>] .text.lock.mutex+0xf/0x14 May 7 21:51:55 camunxxen02 kernel: [<ffffffff80214682>] generic_file_llseek+0x May 7 21:51:55 camunxxen02 kernel: [<ffffffff80225c59>] sys_lseek+0x40/0x60 May 7 21:51:55 camunxxen02 kernel: [<ffffffff80260295>] tracesys+0x47/0xb6 May 7 21:51:55 camunxxen02 kernel: [<ffffffff802602f9>] tracesys+0xab/0xb6 May 7 21:51:55 camunxxen02 kernel: Please "me to" add to this post, even if I use centos 5.5. I have this behaviour on both of machines: i386 and x86_64. Another "me too". Seems to trigger the Call Trace's every 120 seconds (as per the error message) but seems to be no other ill effects on the running system apart from the ugly messages and panic that it's all broken! [root@longhorn ~]# cat /proc/mdstat Personalities : [raid1] md3 : active raid1 sdb5[1] sda5[0] 955658496 blocks [2/2] [UU] resync=DELAYED md0 : active raid1 sdd1[3] sdc1[2] sdb1[1] sda1[0] 128384 blocks [4/4] [UUUU] resync=DELAYED md2 : active raid1 sdd2[1] sdc2[0] 16779776 blocks [2/2] [UU] md4 : active raid1 sdd5[1] sdc5[0] 955658496 blocks [2/2] [UU] [=>...................] resync = 6.2% (59335104/955658496) finish=141.9min speed=105240K/sec md1 : active raid1 sdb2[1] sda2[0] 16779776 blocks [2/2] [UU] unused devices: <none> INFO: task md0_resync:536 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. md0_resync D ffff880001012460 0 536 19 539 535 (L-TLB) ffff8803d1251d70 0000000000000246 ffff8803d1251cb0 0000000000000000 0000000000000009 ffff8803d12320c0 ffff880002e4e7a0 0000000000000dbf ffff8803d12322a8 0000000000000000 Call Trace: [<ffffffff8029c309>] keventd_create_kthread+0x0/0xc4 [<ffffffff8040b4b8>] md_do_sync+0x1d8/0x833 [<ffffffff80288749>] dequeue_task+0x18/0x37 [<ffffffff80288790>] deactivate_task+0x28/0x5f [<ffffffff8026ef31>] monotonic_clock+0x35/0x7b [<ffffffff80262dd3>] thread_return+0x6c/0x113 [<ffffffff80248d8c>] try_to_wake_up+0x392/0x3a4 [<ffffffff8029c521>] autoremove_wake_function+0x0/0x2e [<ffffffff8029c309>] keventd_create_kthread+0x0/0xc4 [<ffffffff8040be8c>] md_thread+0xf8/0x10e [<ffffffff8029c309>] keventd_create_kthread+0x0/0xc4 [<ffffffff8040bd94>] md_thread+0x0/0x10e [<ffffffff80233b0f>] kthread+0xfe/0x132 [<ffffffff80260b2c>] child_rip+0xa/0x12 [<ffffffff8029c309>] keventd_create_kthread+0x0/0xc4 [<ffffffff80233a11>] kthread+0x0/0x132 [<ffffffff80260b22>] child_rip+0x0/0x12 INFO: task md3_resync:540 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. md3_resync D ffff880001012460 0 540 19 543 539 (L-TLB) ffff8803d127bd70 0000000000000246 ffff8803d127bcb0 0000000000000000 0000000000000009 ffff8803d1223040 ffff8803d12320c0 0000000000000f45 ffff8803d1223228 0000000000000000 Call Trace: [<ffffffff8029c52a>] autoremove_wake_function+0x9/0x2e [<ffffffff80287795>] __wake_up_common+0x3e/0x68 [<ffffffff8029c309>] keventd_create_kthread+0x0/0xc4 [<ffffffff8040b4b8>] md_do_sync+0x1d8/0x833 [<ffffffff80288749>] dequeue_task+0x18/0x37 [<ffffffff80288790>] deactivate_task+0x28/0x5f [<ffffffff8026ef31>] monotonic_clock+0x35/0x7b [<ffffffff80262dd3>] thread_return+0x6c/0x113 [<ffffffff80248d8c>] try_to_wake_up+0x392/0x3a4 [<ffffffff8029c521>] autoremove_wake_function+0x0/0x2e [<ffffffff8029c309>] keventd_create_kthread+0x0/0xc4 [<ffffffff8040be8c>] md_thread+0xf8/0x10e [<ffffffff8029c309>] keventd_create_kthread+0x0/0xc4 [<ffffffff8040bd94>] md_thread+0x0/0x10e [<ffffffff80233b0f>] kthread+0xfe/0x132 [<ffffffff80260b2c>] child_rip+0xa/0x12 [<ffffffff8029c309>] keventd_create_kthread+0x0/0xc4 [<ffffffff80233a11>] kthread+0x0/0x132 [<ffffffff80260b22>] child_rip+0x0/0x12 Hi, I've just encountered the same bug. Running 2.6.18-194.3.1.el5 on Core i7 920. Has not occurred in 5.4. Jun 5 16:49:50 xxx kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 5 16:49:50 xxx kernel: md1_resync D ffff81033e63d7e0 0 10605 139 10606 6480 (L-TLB) Jun 5 16:49:51 xxx kernel: ffff8103015ffd70 0000000000000046 0000000000000000 ffff81033fc4a80c Jun 5 16:49:51 xxx kernel: ffff81033faeb80c 000000000000000a ffff81033ffdd100 ffff81033e63d7e0 Jun 5 16:49:51 xxx kernel: 00003db94dbef9a3 00000000000019b7 ffff81033ffdd2e8 0000000400000000 Jun 5 16:49:51 xxx kernel: Call Trace: Jun 5 16:49:51 xxx kernel: [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4 Jun 5 16:49:51 xxx kernel: [<ffffffff8021aa1f>] md_do_sync+0x1d8/0x833 Jun 5 16:49:51 xxx kernel: [<ffffffff8008ca21>] enqueue_task+0x41/0x56 Jun 5 16:49:51 xxx kernel: [<ffffffff8008ca8c>] __activate_task+0x56/0x6d Jun 5 16:49:51 xxx kernel: [<ffffffff8008c871>] dequeue_task+0x18/0x37 Jun 5 16:49:51 xxx kernel: [<ffffffff80062ff8>] thread_return+0x62/0xfe Jun 5 16:49:51 xxx kernel: [<ffffffff800a0abe>] autoremove_wake_function+0x0/0x2e Jun 5 16:49:51 xxx kernel: [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4 Jun 5 16:49:51 xxx kernel: [<ffffffff8021b3f3>] md_thread+0xf8/0x10e Jun 5 16:49:51 xxx kernel: [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4 Jun 5 16:49:51 xxx kernel: [<ffffffff8021b2fb>] md_thread+0x0/0x10e Jun 5 16:49:51 xxx kernel: [<ffffffff80032894>] kthread+0xfe/0x132 Jun 5 16:49:51 xxx kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Jun 5 16:49:51 xxx kernel: [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4 Jun 5 16:49:51 xxx kernel: [<ffffffff80032796>] kthread+0x0/0x132 Jun 5 16:49:51 xxx kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 same for me: jun 7 08:32:43 kernel: md0_resync D ffff810001015120 0 11682 75 11683 4791 (L-TLB) Jun 7 08:32:43 kernel: ffff810062d59d70 0000000000000046 0000000000000000 ffff8101bfdd5a0c Jun 7 08:32:43 kernel: ffff8101bea0bc0c 0000000000000009 ffff81006a165040 ffff810106323080 Jun 7 08:32:43 kernel: 000003ee09ede32a 0000000000001ad9 ffff81006a165228 0000000200000000 Jun 7 08:32:43 kernel: Call Trace: Jun 7 08:32:43 kernel: [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4 Jun 7 08:32:43 kernel: [<ffffffff8021aa1f>] md_do_sync+0x1d8/0x833 Jun 7 08:32:43 kernel: [<ffffffff8008ca21>] enqueue_task+0x41/0x56 Jun 7 08:32:43 kernel: [<ffffffff8008ca8c>] __activate_task+0x56/0x6d Jun 7 08:32:43 kernel: [<ffffffff8008c871>] dequeue_task+0x18/0x37 Jun 7 08:32:43 kernel: [<ffffffff80062ff8>] thread_return+0x62/0xfe Jun 7 08:32:43 kernel: [<ffffffff800a0abe>] autoremove_wake_function+0x0/0x2e Jun 7 08:32:43 kernel: [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4 Jun 7 08:32:43 kernel: [<ffffffff8021b3f3>] md_thread+0xf8/0x10e Jun 7 08:32:43 kernel: [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4 Jun 7 08:32:43 kernel: [<ffffffff8021b2fb>] md_thread+0x0/0x10e Jun 7 08:32:43 kernel: [<ffffffff80032894>] kthread+0xfe/0x132 Jun 7 08:32:43 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Jun 7 08:32:43 kernel: [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4 Jun 7 08:32:43 kernel: [<ffffffff80032796>] kthread+0x0/0x132 Jun 7 08:32:43 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Jun 7 08:32:43 kernel: Jun 7 08:32:43 kernel: INFO: task md1_resync:11683 blocked for more than 120 seconds. Jun 7 08:32:43 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 7 08:32:43 kernel: md1_resync D ffff810060c65e30 0 11683 75 11682 (L-TLB) Jun 7 08:32:43 kernel: ffff810060c65d70 0000000000000046 0000000000000000 ffff8101bfdd5e0c Jun 7 08:32:43 kernel: ffff8101bfdd5a0c 0000000000000008 ffff810063fab860 ffff81006a165040 Jun 7 08:32:43 kernel: 000003ee09edc851 00000000000050a0 ffff810063faba48 000000028008b4b1 Jun 7 08:32:43 kernel: Call Trace: Jun 7 08:32:43 kernel: [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4 Jun 7 08:32:43 kernel: [<ffffffff8021aa1f>] md_do_sync+0x1d8/0x833 Jun 7 08:32:43 kernel: [<ffffffff8008ca21>] enqueue_task+0x41/0x56 Jun 7 08:32:44 kernel: [<ffffffff8008ca8c>] __activate_task+0x56/0x6d Jun 7 08:32:44 kernel: [<ffffffff8008c871>] dequeue_task+0x18/0x37 Jun 7 08:32:44 kernel: [<ffffffff80062ff8>] thread_return+0x62/0xfe Jun 7 08:32:44 kernel: [<ffffffff800a0abe>] autoremove_wake_function+0x0/0x2e Jun 7 08:32:44 kernel: [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4 Jun 7 08:32:44 kernel: [<ffffffff8021b3f3>] md_thread+0xf8/0x10e Jun 7 08:32:44 kernel: [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4 Jun 7 08:32:44 kernel: [<ffffffff8021b2fb>] md_thread+0x0/0x10e Jun 7 08:32:44 kernel: [<ffffffff80032894>] kthread+0xfe/0x132 Jun 7 08:32:44 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Jun 7 08:32:44 kernel: [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4 Jun 7 08:32:44 kernel: [<ffffffff80032796>] kthread+0x0/0x132 Jun 7 08:32:44 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Jun 7 08:32:44 kernel: Some system information: 6GB RAM,Hardware: FSC CELSIUS M460 Does not look like it only affects resyncs (Yeah, that is a "me too", although on CentOS): There are no other messages besides that. The machine is an application server which had a bit to do at that time. Jun 29 14:56:59 dryckjom kernel: INFO: task nfsd:3308 blocked for more than 120 seconds. Jun 29 14:56:59 dryckjom kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 29 14:57:00 dryckjom kernel: nfsd D ffff810104636de0 0 3308 1 3309 3307 (L-TLB) Jun 29 14:57:00 dryckjom kernel: ffff8100af991af0 0000000000000046 0000000000000000 ffff8100af991c60 Jun 29 14:57:00 dryckjom kernel: 0000000000000000 000000000000000a ffff8100afeb1860 ffff8100afe307a0 Jun 29 14:57:00 dryckjom kernel: 00032ecfc1662b57 000000000000ae4e ffff8100afeb1a48 00000002af88bc00 Jun 29 14:57:00 dryckjom kernel: Call Trace: Jun 29 14:57:00 dryckjom kernel: [<ffffffff8003055b>] sock_recvmsg+0x107/0x15f Jun 29 14:57:00 dryckjom kernel: [<ffffffff800ec2a2>] inode_wait+0x0/0xd Jun 29 14:57:00 dryckjom kernel: [<ffffffff800ec2ab>] inode_wait+0x9/0xd Jun 29 14:57:00 dryckjom kernel: [<ffffffff80063a16>] __wait_on_bit+0x40/0x6e Jun 29 14:57:00 dryckjom kernel: [<ffffffff800ec2a2>] inode_wait+0x0/0xd Jun 29 14:57:00 dryckjom kernel: [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78 Jun 29 14:57:00 dryckjom kernel: [<ffffffff800a0aec>] wake_bit_function+0x0/0x23 Jun 29 14:57:00 dryckjom kernel: [<ffffffff8000c5fe>] _atomic_dec_and_lock+0x39/0x57 Jun 29 14:57:00 dryckjom kernel: [<ffffffff8003dbbf>] ifind_fast+0x6e/0x83 Jun 29 14:57:00 dryckjom kernel: [<ffffffff80023290>] iget_locked+0x59/0x149 Jun 29 14:57:00 dryckjom kernel: [<ffffffff880565bf>] :ext3:ext3_get_dentry+0x3d/0xbb Jun 29 14:57:00 dryckjom kernel: [<ffffffff884ea36b>] :exportfs:find_exported_dentry+0x43/0x480 Jun 29 14:57:00 dryckjom kernel: [<ffffffff884f7753>] :nfsd:nfsd_acceptable+0x0/0xd8 Jun 29 14:57:00 dryckjom kernel: [<ffffffff884fb74f>] :nfsd:exp_get_by_name+0x5b/0x71 Jun 29 14:57:00 dryckjom kernel: [<ffffffff884fbd3e>] :nfsd:exp_find_key+0x89/0x9c Jun 29 14:57:00 dryckjom kernel: [<ffffffff88384ba2>] :sunrpc:svc_tcp_recvfrom+0x719/0x79e Jun 29 14:57:00 dryckjom kernel: [<ffffffff8009b1dd>] set_current_groups+0x159/0x164 Jun 29 14:57:00 dryckjom kernel: [<ffffffff884ea7f3>] :exportfs:export_decode_fh+0x4b/0x50 Jun 29 14:57:00 dryckjom kernel: [<ffffffff884f7ac5>] :nfsd:fh_verify+0x29a/0x4bd Jun 29 14:57:00 dryckjom kernel: [<ffffffff884ff6cc>] :nfsd:nfsd3_proc_getattr+0x87/0xbb Jun 29 14:57:00 dryckjom kernel: [<ffffffff884f51db>] :nfsd:nfsd_dispatch+0xd8/0x1d6 Jun 29 14:57:00 dryckjom kernel: [<ffffffff88382651>] :sunrpc:svc_process+0x454/0x71b Jun 29 14:57:00 dryckjom kernel: [<ffffffff80064644>] __down_read+0x12/0x92 Jun 29 14:57:00 dryckjom kernel: [<ffffffff884f55a1>] :nfsd:nfsd+0x0/0x2cb Jun 29 14:57:00 dryckjom kernel: [<ffffffff884f5746>] :nfsd:nfsd+0x1a5/0x2cb Jun 29 14:57:00 dryckjom kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Jun 29 14:57:00 dryckjom kernel: [<ffffffff884f55a1>] :nfsd:nfsd+0x0/0x2cb Jun 29 14:57:00 dryckjom kernel: [<ffffffff884f55a1>] :nfsd:nfsd+0x0/0x2cb Jun 29 14:57:00 dryckjom kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Jun 29 14:57:00 dryckjom kernel: Jun 29 14:57:00 dryckjom kernel: INFO: task vsftpd:3586 blocked for more than 120 seconds. Jun 29 14:57:00 dryckjom kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 29 14:57:01 dryckjom kernel: vsftpd D ffff81013ea9f4c0 0 3586 3350 3587 3568 (NOTLB) Jun 29 14:57:01 dryckjom kernel: ffff8100ae693c58 0000000000000086 0000000300000000 ffff8100ae693c38 Jun 29 14:57:01 dryckjom kernel: 0000000000000008 0000000000000007 ffff8100728bc7e0 ffff81013e8cf7a0 Jun 29 14:57:01 dryckjom kernel: 00032ec7d117be71 0000000000017718 ffff8100728bc9c8 00000000ae693ce8 Jun 29 14:57:01 dryckjom kernel: Call Trace: Jun 29 14:57:01 dryckjom kernel: [<ffffffff80129e9b>] avc_has_perm+0x46/0x58 Jun 29 14:57:01 dryckjom kernel: [<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b Jun 29 14:57:01 dryckjom kernel: [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14 Jun 29 14:57:01 dryckjom kernel: [<ffffffff8000cec2>] do_lookup+0x90/0x1e6 Jun 29 14:57:01 dryckjom kernel: [<ffffffff8000a20d>] __link_path_walk+0xa01/0xf42 Jun 29 14:57:01 dryckjom kernel: [<ffffffff8000e9e2>] link_path_walk+0x42/0xb2 Jun 29 14:57:01 dryckjom kernel: [<ffffffff8000ccb2>] do_path_lookup+0x275/0x2f1 Jun 29 14:57:01 dryckjom kernel: [<ffffffff8012ac27>] selinux_file_alloc_security+0x2a/0x54 Jun 29 14:57:01 dryckjom kernel: [<ffffffff8002368a>] __path_lookup_intent_open+0x56/0x97 Jun 29 14:57:01 dryckjom kernel: [<ffffffff8001af23>] open_namei+0x73/0x6d5 Jun 29 14:57:01 dryckjom kernel: [<ffffffff800274a4>] do_filp_open+0x1c/0x38 Jun 29 14:57:01 dryckjom kernel: [<ffffffff80019dd1>] do_sys_open+0x44/0xbe Jun 29 14:57:01 dryckjom kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Jun 29 14:57:01 dryckjom kernel: Jun 29 14:57:01 dryckjom kernel: INFO: task nrpe:3597 blocked for more than 120 seconds. Jun 29 14:57:01 dryckjom kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 29 14:57:01 dryckjom kernel: nrpe D ffff81013ea9f4c0 0 3597 3596 (NOTLB) Jun 29 14:57:01 dryckjom kernel: ffff81013320fc58 0000000000000082 ffff810134bffe38 ffff81013320fc38 Jun 29 14:57:01 dryckjom kernel: 0000000000000000 0000000000000005 ffff81008c61a040 ffff81013e8cf7a0 Jun 29 14:57:01 dryckjom kernel: 00032ec854c2f5c2 00000000000056a0 ffff81008c61a228 000000004fc26840 Jun 29 14:57:01 dryckjom kernel: Call Trace: Jun 29 14:57:01 dryckjom kernel: [<ffffffff80129e9b>] avc_has_perm+0x46/0x58 Jun 29 14:57:01 dryckjom kernel: [<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b Jun 29 14:57:01 dryckjom kernel: [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14 Jun 29 14:57:01 dryckjom kernel: [<ffffffff8000cec2>] do_lookup+0x90/0x1e6 Jun 29 14:57:01 dryckjom kernel: [<ffffffff8000a20d>] __link_path_walk+0xa01/0xf42 Jun 29 14:57:01 dryckjom kernel: [<ffffffff8000e9e2>] link_path_walk+0x42/0xb2 Jun 29 14:57:01 dryckjom kernel: [<ffffffff8000ccb2>] do_path_lookup+0x275/0x2f1 Jun 29 14:57:01 dryckjom kernel: [<ffffffff8012ac27>] selinux_file_alloc_security+0x2a/0x54 Jun 29 14:57:01 dryckjom kernel: [<ffffffff8002368a>] __path_lookup_intent_open+0x56/0x97 Jun 29 14:57:02 dryckjom kernel: [<ffffffff8001af23>] open_namei+0x73/0x6d5 Jun 29 14:57:02 dryckjom kernel: [<ffffffff800274a4>] do_filp_open+0x1c/0x38 Jun 29 14:57:02 dryckjom kernel: [<ffffffff8005c382>] cache_alloc_refill+0x106/0x186 Jun 29 14:57:43 dryckjom kernel: [<ffffffff80019dd1>] do_sys_open+0x44/0xbe Jun 29 14:57:43 dryckjom kernel: Jun 29 14:57:43 dryckjom kernel: INFO: task vsftpd:3587 blocked for more than 120 seconds. Jun 29 14:57:43 dryckjom kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 29 14:57:43 dryckjom kernel: vsftpd D ffff81013ea9f4c0 0 3587 3350 3600 3586 (NOTLB) Jun 29 14:57:43 dryckjom kernel: ffff8100add49c58 0000000000000082 ffff810101dff690 ffff8100add49c38 Jun 29 14:57:43 dryckjom kernel: 0000000000000008 0000000000000007 ffff81011d8ef820 ffff81011cf47080 Jun 29 14:57:43 dryckjom kernel: 00032ec7d49fcddf 0000000005dcd944 ffff81011d8efa08 00000007add49ce8 Jun 29 14:57:43 dryckjom kernel: Call Trace: Jun 29 14:57:43 dryckjom kernel: [<ffffffff80129e9b>] avc_has_perm+0x46/0x58 Jun 29 14:57:43 dryckjom kernel: [<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b Jun 29 14:57:43 dryckjom kernel: [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14 Jun 29 14:57:44 dryckjom kernel: [<ffffffff8000cec2>] do_lookup+0x90/0x1e6 Jun 29 14:57:44 dryckjom kernel: [<ffffffff8000a20d>] __link_path_walk+0xa01/0xf42 Jun 29 14:57:44 dryckjom kernel: [<ffffffff8000e9e2>] link_path_walk+0x42/0xb2 Jun 29 14:57:44 dryckjom kernel: [<ffffffff8000ccb2>] do_path_lookup+0x275/0x2f1 Jun 29 14:57:44 dryckjom kernel: [<ffffffff8012ac27>] selinux_file_alloc_security+0x2a/0x54 Jun 29 14:57:44 dryckjom kernel: [<ffffffff8002368a>] __path_lookup_intent_open+0x56/0x97 Jun 29 14:57:44 dryckjom kernel: [<ffffffff8001af23>] open_namei+0x73/0x6d5 Jun 29 14:57:44 dryckjom kernel: [<ffffffff800274a4>] do_filp_open+0x1c/0x38 Jun 29 14:57:44 dryckjom kernel: [<ffffffff8005c382>] cache_alloc_refill+0x106/0x186 Jun 29 14:57:44 dryckjom kernel: [<ffffffff80019dd1>] do_sys_open+0x44/0xbe Jun 29 14:57:44 dryckjom kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Jun 29 14:57:44 dryckjom kernel: Jun 29 14:57:44 dryckjom kernel: INFO: task ps:3599 blocked for more than 120 seconds. Jun 29 14:57:44 dryckjom kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 29 14:57:44 dryckjom kernel: ps D ffff81013ea9f4c0 0 3599 3660 (NOTLB) Jun 29 14:57:44 dryckjom kernel: ffff8100ae675ce8 0000000000000086 ffff810000019c10 ffff8100ae675cc8 Jun 29 14:57:44 dryckjom kernel: 0000000000000000 0000000000000007 ffff810088c5c100 ffff8100afeb1100 Jun 29 14:57:44 dryckjom kernel: 00032ec835d40503 0000000000028e7b ffff810088c5c2e8 00000004ffffffff Jun 29 14:57:44 dryckjom kernel: Call Trace: Jun 29 14:57:44 dryckjom kernel: [<ffffffff80129e9b>] avc_has_perm+0x46/0x58 Jun 29 14:57:44 dryckjom kernel: [<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b Jun 29 14:57:44 dryckjom kernel: [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14 Jun 29 14:57:44 dryckjom kernel: [<ffffffff8000cec2>] do_lookup+0x90/0x1e6 Jun 29 14:57:44 dryckjom kernel: [<ffffffff8000a20d>] __link_path_walk+0xa01/0xf42 Jun 29 14:57:44 dryckjom kernel: [<ffffffff8000e9e2>] link_path_walk+0x42/0xb2 Jun 29 14:57:44 dryckjom kernel: [<ffffffff8000ccb2>] do_path_lookup+0x275/0x2f1 Jun 29 14:57:44 dryckjom kernel: [<ffffffff8001280e>] getname+0x15b/0x1c2 Jun 29 14:57:44 dryckjom kernel: [<ffffffff80023876>] __user_walk_fd+0x37/0x4c Jun 29 14:57:44 dryckjom kernel: [<ffffffff80032583>] sys_faccessat+0xe4/0x18d Jun 29 14:57:44 dryckjom kernel: [<ffffffff800b76a6>] audit_syscall_entry+0x180/0x1b3 Jun 29 14:57:44 dryckjom kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Created attachment 427705 [details]
stuck tasks on my box
Sorry, that looked rather bad in the bug report, so here is the message log excerpt showing the hung processes as an attachement.
Another "me too", however these tasks never seem to unblock, resulting in hung nfs client connections. Jul 8 10:44:06 host56 kernel: INFO: task nfsd:29930 blocked for more than 120 seconds. Jul 8 10:44:06 host56 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 8 10:44:06 host56 kernel: nfsd D 00011AF0 2100 29930 1 29931 29929 (L-TLB) Jul 8 10:44:06 host56 kernel: cf35fdcc 00000046 38e9f310 00011af0 00000040 d7ef7800 f8903460 0000000a Jul 8 10:44:06 host56 kernel: e5f7a000 38ea07dc 00011af0 000014cc 00000000 e5f7a10c c1806f00 f7880200 Jul 8 10:44:06 host56 kernel: e5f7a000 00000000 00000000 00000000 e5f7a000 c0436007 c18073bc d3e47aa0 Jul 8 10:44:06 host56 kernel: Call Trace: Jul 8 10:44:06 host56 kernel: [<c0436007>] autoremove_wake_function+0x0/0x2d Jul 8 10:44:06 host56 kernel: [<c041ee80>] enqueue_task+0x29/0x39 Jul 8 10:44:06 host56 kernel: [<c061cf25>] __mutex_lock_slowpath+0x4d/0x7c Jul 8 10:44:06 host56 kernel: [<c061cf63>] .text.lock.mutex+0xf/0x14 Jul 8 10:44:06 host56 kernel: [<f8d9819e>] expkey_match+0x49/0x58 [nfsd] Jul 8 10:44:06 host56 kernel: [<c045a989>] generic_file_writev+0x28/0x96 Jul 8 10:44:06 host56 kernel: [<c045a961>] generic_file_writev+0x0/0x96 Jul 8 10:44:06 host56 kernel: [<c0475e3d>] do_readv_writev+0x149/0x247 Jul 8 10:44:06 host56 kernel: [<c0475720>] do_sync_write+0x0/0xf1 Jul 8 10:44:06 host56 kernel: [<c0430dd7>] set_current_groups+0x15a/0x166 Jul 8 10:44:06 host56 kernel: [<c0475f72>] vfs_writev+0x37/0x43 Jul 8 10:44:06 host56 kernel: [<f8d950a9>] nfsd_vfs_write+0xca/0x28a [nfsd] Jul 8 10:44:06 host56 kernel: [<c047449a>] __dentry_open+0xea/0x1ab Jul 8 10:44:06 host56 kernel: [<c04745a0>] dentry_open+0x45/0x4b Jul 8 10:44:06 host56 kernel: [<f8d958cb>] nfsd_write+0x96/0xab [nfsd] Jul 8 10:44:06 host56 kernel: [<f8d9b5a9>] nfsd3_proc_write+0xd1/0xeb [nfsd] Jul 8 10:44:06 host56 kernel: [<f8d921a4>] nfsd_dispatch+0xbb/0x1a9 [nfsd] Jul 8 10:44:06 host56 kernel: [<f8d11689>] svc_process+0x3c8/0x633 [sunrpc] Jul 8 10:44:06 host56 kernel: [<f8d9268c>] nfsd+0x17e/0x286 [nfsd] Jul 8 10:44:06 host56 kernel: [<f8d9250e>] nfsd+0x0/0x286 [nfsd] Jul 8 10:44:06 host56 kernel: [<c0405c53>] kernel_thread_helper+0x7/0x10 Another "me too" Our cluster with 348 nodes we have random nodes and random task that is hung. Running SL Kernel 2.6.18-194.3.1.el5 INFO: task syslogd:2959 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. syslogd D ffff81000101d7a0 0 2959 1 2962 2936 (NOTLB) ffff81063f87fd88 0000000000000086 0000000400000296 0000000000000001 000000063f87fd18 000000000000000a ffff81063edb57e0 ffff81034aac1860 00079032058acd17 0000000000009b14 ffff81063edb59c8 0000000600000096 Call Trace: [<ffffffff88036d8a>] :jbd:log_wait_commit+0xa3/0xf5 [<ffffffff800a0abe>] autoremove_wake_function+0x0/0x2e [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff [<ffffffff8002fbf8>] __writeback_single_inode+0x1e9/0x328 [<ffffffff800e037b>] do_readv_writev+0x26e/0x291 [<ffffffff800f2fb7>] sync_inode+0x24/0x33 [<ffffffff8804c36d>] :ext3:ext3_sync_file+0xc9/0xdc [<ffffffff800503f3>] do_fsync+0x52/0xa4 [<ffffffff800e0c00>] __do_fsync+0x23/0x36 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 See also bug 573106 Sorry, I mean see bug 590403 where I have proposed a fix. (In reply to comment #12) > Sorry, I mean see bug 590403 where I have proposed a fix. I see those timeouts without using md raid on the machines, so it has to be on a different layer. (In reply to comment #13) > (In reply to comment #12) > > Sorry, I mean see bug 590403 where I have proposed a fix. > > I see those timeouts without using md raid on the machines, so it has to be on > a different layer. Oh. Sorry. I was looking at the original report which clearly shows that the blocked task is blocked in md_do_sync. This is also the error in the subject, comment #4, comment #6 and comment #6. The other comments show something blocked in a completely different place -- I haven't checked to see if they're related to each other. At first glance, though, the only thing that comment #2, comment #7, comment #9 and comment #10 have in common with each other and the original report is that it's something that's triggering the hung task timeout (that is, nothing in common at all). I'm getting the impression that we are drifting from the subject. Let's focus on understanding what is the problem: a.) the md task getting blocked or b.) message showing up As far as I can see, the message is popping up on other occasions, so my best guess about it is that the message itself was there even before, just not showed up as default. Please correct me if I'm wrong here. The message also suggests you what to do in order to get rid of it. Personally, I don't see anything wrong with either two problems - md task somehow needs to get blocked (now is it correct to let it wait in D state or should it be sleeping indeed), and also the message is great to have just in case something else happens and task get blocked (I've got similar problem in correlation with XEN, but it is a different bug - real one). Only one thing that I don't like about mdX getting blocked in D state is that automatically bumps cpu load by 1 (I guess that if there were more md devices waiting on same 'queue', load - at least numerically - would be higher). (In reply to comment #15) > I'm getting the impression that we are drifting from the subject. Let's focus > on understanding what is the problem: > > a.) the md task getting blocked > or > b.) message showing up > > > As far as I can see, the message is popping up on other occasions, so my best > guess about it is that the message itself was there even before, just not > showed up as default. Please correct me if I'm wrong here. The message also > suggests you what to do in order to get rid of it. I have no problem with the message popping up, I have a problem with tasks being blocked for such a long time without any apparent reason, no matter if it is an md task or another task which desperately needs I/O. (In reply to comment #15) > I'm getting the impression that we are drifting from the subject. Let's focus > on understanding what is the problem: > > a.) the md task getting blocked > or > b.) message showing up > The subject of this bug is the mdX_resync task being blocked. Other reports of the hung task timeout being triggered are not related to it. The hung task timeout is not a bug, it's a mechanism whereby the kernel can warn you about kernel tasks that have been blocked for a long time. You can disable it if you like. See bug 590403 for a fairly detailed discussion of the mdX_rsync hung task timeout issue. The executive summary, though, is that this is a false positive and that it's triggered by /etc/cron.weekly/99-raid-check running all the rsync tasks at the same time. (In reply to comment #16): > I have no problem with the message popping up, I have a problem with tasks > being blocked for such a long time without any apparent reason, no matter if > it is an md task or another task which desperately needs I/O. If you are seeing a hung task timeout warning and you think it's a bug, then you need to submit a bug specific to that particular hung task: any kernel task can get hung, but like processes dying of SIGSEGV, that doesn't make it the same problem. It doesn't even make it the same problem if it appears to be I/O related. (In reply to comment #15) > case something else happens and task get blocked (I've got similar problem in > correlation with XEN, but it is a different bug - real one). I noted this comment about Xen. Do you have any bugs logged on this problem? Is it something like Bug 550724? My comment #c9 relates to NFS, which hangs indefinitely and emits this message. This was new behavior somewhere between RHEL 5.4 and 5.4. I can submit a new ticket it it seems appropriate. This just looked to be related. (In reply to comment #18) > (In reply to comment #15) > > > case something else happens and task get blocked (I've got similar problem in > > correlation with XEN, but it is a different bug - real one). > > I noted this comment about Xen. Do you have any bugs logged on this problem? Is > it something like Bug 550724? Actually it was in dom0, but I cannot reproduce it at the moment. However I do think that it was officially fixed in some recent kernel release. (In reply to comment #19) > My comment #c9 relates to NFS, which hangs indefinitely and emits this message. > This was new behavior somewhere between RHEL 5.4 and 5.4. I can submit a new > ticket it it seems appropriate. This just looked to be related. It's not related to the md_sync problem at all. I think you should submit a new bug or log a ticket. (In reply to comment #21) > (In reply to comment #19) > > My comment #c9 relates to NFS, which hangs indefinitely and emits this message. > > This was new behavior somewhere between RHEL 5.4 and 5.4. I can submit a new > > ticket it it seems appropriate. This just looked to be related. > > It's not related to the md_sync problem at all. I think you should submit a > new bug or log a ticket. Done. https://bugzilla.redhat.com/show_bug.cgi?id=616833 I have several soft RAID partitions and see this following installation (blocked md_do_sync). The O/S is CentOS 5.5 on kernel 2.6.18-194.17.1.el5 (SMP). On CentOS 5.4 and earlier, the same install and RAID configuration did not cause these messages to appear. This bug has many backtraces posted, but several look unrelated (missing md_do_sync). As John says, those issues need new bugs. comment 2 and comment 10 look like bug 605207. Confirming on 5.5, kernel 2.6.18-194.8.1.el5 smp i386, there was no issue on 5.4 md: delaying resync of md3 until md0 has finished resync (they share one or more physical units) md: md0: sync done. md: delaying resync of md3 until md4 has finished resync (they share one or more physical units) md: delaying resync of md4 until md5 has finished resync (they share one or more physical units) md: syncing RAID array md5 md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction. md: using 128k window, over a total of 273546688 blocks. md: delaying resync of md1 until md3 has finished resync (they share one or more physical units) md: delaying resync of md3 until md4 has finished resync (they share one or more physical units) RAID1 conf printout: --- wd:2 rd:2 disk 0, wo:0, o:1, dev:sda1 disk 1, wo:0, o:1, dev:sdb1 INFO: task md3_resync:5649 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. md3_resync D 0000CA3E 3312 5649 11 5650 1703 (L-TLB) d9b47ec0 00000046 45ae56fc 0000ca3e 00000064 00000000 00000000 0000000a d6a9f000 45ae610f 0000ca3e 00000a13 00000001 d6a9f10c c2013544 f06dc740 00000003 c06b5b98 c2296ecc c229a9c8 d9b47f80 c229a000 c2013588 c0667780 Call Trace: [<c05aba8b>] md_do_sync+0x1fe/0x966 [<c041eef4>] enqueue_task+0x29/0x39 [<c041ef4e>] __activate_task+0x4a/0x59 [<c041f811>] try_to_wake_up+0x3e8/0x3f2 [<c061caf0>] schedule+0x9cc/0xa55 [<c0436067>] autoremove_wake_function+0x0/0x2d [<c05ac4dd>] md_thread+0xdf/0xf5 [<c041ebb9>] complete+0x2b/0x3d [<c05ac3fe>] md_thread+0x0/0xf5 [<c0435fa3>] kthread+0xc0/0xed [<c0435ee3>] kthread+0x0/0xed [<c0405c53>] kernel_thread_helper+0x7/0x10 ======================= INFO: task md4_resync:5650 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. md4_resync D 0000CA3E 3492 5650 11 5651 5649 (L-TLB) f3969ec0 00000046 45a714b6 0000ca3e 00000064 00000096 d9b47f80 0000000a c2118550 45a7244c 0000ca3e 00000f96 00000001 c211865c c2013544 f06dc740 00000003 c06b5b98 c2280c8c c229adc8 f3969f80 c229a800 c0425f00 c0667780 Call Trace: [<c0425f00>] printk+0x18/0x8e [<c05aba8b>] md_do_sync+0x1fe/0x966 [<c041eef4>] enqueue_task+0x29/0x39 [<c041ef4e>] __activate_task+0x4a/0x59 [<c041f811>] try_to_wake_up+0x3e8/0x3f2 [<c061caf0>] schedule+0x9cc/0xa55 [<c0436067>] autoremove_wake_function+0x0/0x2d [<c05ac4dd>] md_thread+0xdf/0xf5 [<c041ebb9>] complete+0x2b/0x3d [<c05ac3fe>] md_thread+0x0/0xf5 [<c0435fa3>] kthread+0xc0/0xed [<c0435ee3>] kthread+0x0/0xed [<c0405c53>] kernel_thread_helper+0x7/0x10 ======================= Just to be clear, and just in case anyone reading the previous comment is misled: this is not a kernel bug. The error message is caused by a weekly cron job that (99-raid-check) that does this (I've copied this from my comment in bug 590404): ---------------- devnum=0 for dev in $dev_list; do echo "${check[$devnum]}" > /sys/block/$dev/md/sync_action let devnum++ done [ -z "$check_list" ] && exit 0 checking=1 while [ $checking -ne 0 ] do sleep 60 checking=0 for dev in $check_list; do sync_action=`cat /sys/block/$dev/md/sync_action` if [ "$sync_action" != "idle" ]; then checking=1 fi done done --------------- There are two loops here: the first one kicks off a check for all the MD devices, the second one waits until they've all done. The md_do_sync() code in the kernel knows its a really bad idea to to check two md arrays that share underlying storage so it detects that and it will wait until an outstanding check has completed. If this takes more than two minutes you'll see that hugely worrying "blocked for more than 120 seconds" message. As someone somewhere said, this isn't actually a bug, it's the kernel doing what it's supposed to be doing and its serialising the md_do_sync() threads rather nicely. Unfortunately, large numbers of people see the hugely worrying message and they, well, worry. As this is a once-a-week cron job and it doesn't matter if it completes in the shortest possible time I suggested running one check at a time like this: --------------- devnum=0 for dev in $dev_list; do echo "${check[$devnum]}" > /sys/block/$dev/md/sync_action while :; do sleep 60 sync_action=`cat /sys/block/$dev/md/sync_action` if [ "$sync_action" == "idle" ]; then break fi done let devnum++ done [ -z "$check_list" ] && exit 0 --------------- As I said in bug 590403, I don't have anything to test this on at the moment so it would be nice if someone who does have the problem could update /etc/cron.weekly/99-raid-check to see if this does do the trick. If you would prefer a real diff (for patch) then let me know and I'll attach one. This is not cron specific - it is showing during sync right after OS install cat /var/log/cron Oct 6 20:46:37 vz1 crond[3083]: (CRON) STARTUP (V5.0) Oct 6 21:01:01 vz1 crond[3252]: (root) CMD (run-parts /etc/cron.hourly) Oct 6 21:16:43 vz1 crontab[3260]: (root) LIST (root) EOF Linux vz1.clusterspan.net 2.6.18-194.11.1.el5 #1 SMP Tue Aug 10 19:05:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux cat /proc/mdstat Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sdb1[1] sda1[0] 25171776 blocks [2/2] [UU] resync=DELAYED md1 : active raid1 sdb2[1] sda2[0] 264960 blocks [2/2] [UU] resync=DELAYED md3 : active raid1 sdb4[1] sda4[0] 1418724160 blocks [2/2] [UU] [=>...................] resync = 6.3% (89806592/1418724160) finish=397.6min speed=55702K/sec md2 : active raid1 sdb3[1] sda3[0] 20972736 blocks [2/2] [UU] resync=DELAYED top - 21:11:45 up 27 min, 1 user, load average: 4.07, 4.05, 3.34 Tasks: 132 total, 1 running, 131 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 0.4%sy, 0.0%ni, 98.6%id, 0.0%wa, 0.0%hi, 1.0%si, 0.0%st Mem: 24677336k total, 124804k used, 24552532k free, 9840k buffers Swap: 25171768k total, 0k used, 25171768k free, 35244k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 961 root 10 -5 0 0 0 D 2.0 0.0 0:27.54 md3_resync 965 root 11 -5 0 0 0 D 0.0 0.0 0:00.00 md2_resync 969 root 11 -5 0 0 0 D 0.0 0.0 0:00.00 md1_resync 973 root 12 -5 0 0 0 D 0.0 0.0 0:00.00 md0_resync INFO: task md1_resync:969 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. md1_resync D ffff81063dc45e30 0 969 235 972 968 (L-TLB) ffff81063dc45d70 0000000000000046 0000000000000000 ffff81063dc1be0c ffff81063fc80c0c 0000000000000009 ffff81063ddf3080 ffff81063de077a0 0000001108c05f4a 000000000000087e ffff81063ddf3268 0000000400000000 Call Trace: [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4 [<ffffffff8021addc>] md_do_sync+0x1d8/0x833 [<ffffffff80062ff8>] thread_return+0x62/0xfe [<ffffffff800a09d8>] autoremove_wake_function+0x0/0x2e [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4 [<ffffffff8021b7b0>] md_thread+0xf8/0x10e [<ffffffff8021b6b8>] md_thread+0x0/0x10e [<ffffffff8003287b>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4 [<ffffffff8003277d>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 INFO: task md0_resync:973 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. md0_resync D ffffffff8015081f 0 973 235 974 972 (L-TLB) ffff81063dc4fd70 0000000000000046 0000000000000000 ffff81063dc1ba0c ffff81063dc1be0c 0000000000000008 ffff81063de077a0 ffff810115f7b080 0000001108c06484 000000000000053a ffff81063de07988 000000048008b3cb Call Trace: [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4 [<ffffffff8021addc>] md_do_sync+0x1d8/0x833 [<ffffffff80062ff8>] thread_return+0x62/0xfe [<ffffffff800a09d8>] autoremove_wake_function+0x0/0x2e [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4 [<ffffffff8021b7b0>] md_thread+0xf8/0x10e [<ffffffff8021b6b8>] md_thread+0x0/0x10e [<ffffffff8003287b>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4 [<ffffffff8003277d>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 INFO: task md2_resync:965 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. md2_resync D ffffffff8015081f 0 965 235 968 964 (L-TLB) ffff81063dc17d70 0000000000000046 0000000000000000 ffff81063faebc0c ffff81063dc1ba0c 0000000000000009 ffff81063ddf47a0 ffffffff80308b60 0000001108c04f86 000000000000081e ffff81063ddf4988 000000008008b3cb Call Trace: [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4 [<ffffffff8021addc>] md_do_sync+0x1d8/0x833 [<ffffffff80062ff8>] thread_return+0x62/0xfe [<ffffffff800a09d8>] autoremove_wake_function+0x0/0x2e [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4 [<ffffffff8021b7b0>] md_thread+0xf8/0x10e [<ffffffff8021b6b8>] md_thread+0x0/0x10e [<ffffffff8003287b>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4 [<ffffffff8019d5a7>] acpi_processor_idle_simple+0x10d/0x30e [<ffffffff8003277d>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 (In reply to comment #27) > This is not cron specific - it is showing during sync right after OS install That's interesting. What kicks off the resync after install then? Whatever it is, it's doing the same trick that the cron job does. (In reply to comment #28) > That's interesting. What kicks off the resync after install then? > > Whatever it is, it's doing the same trick that the cron job does. Resync happens because mismatch_cnt in /sys/block/<your_md_device>/md/mismatch_cnt is greater than 0. It was explained somewhere how this is harmless with raid1 and raid10. However I didn't like it, and found a way to fix it once for good - of course google was my friend back then, but I forgot to bookmark it. Also, be aware of 99-raid-check 'patches' that disable check for raid1 arrays which bring us little if no benefit at all. I have an identical call trace as that in comment 25. CentOS 5.5, kernel 2.6.18-194.17,1.el5 smp i386. This is reported when the server is first loaded (systemimager clone of my master server). My install creates three RAID1 partitions at install time and these are sync'ed after extracting the install image. These same error messages occur when running the 99-raid-check script. Created attachment 451986 [details]
Patch to fix softlockup
This patch from upstream fixes the softlockup problem by changing the wait in md_do_sync from TASK_UNINTERRUPTIBLE to TASK_INTERRUPTIBLE. From the comments on the commit:
md: Don't wait UNINTERRUPTIBLE for other resync to finish
When two md arrays share some block device (e.g each uses different
partitions on the one device), a resync of one array will wait for
the resync on the other to finish.
This can be a long time and as it currently waits TASK_UNINTERRUPTIBLE,
the softlockup code notices and complains.
So use TASK_INTERRUPTIBLE instead and make sure to flush signals
before calling schedule.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. in kernel-2.6.18-228.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. I've just tried this test kernel on a new system with 35 2TB SATA disk connected to a HP P411 controller. I doesn't solve the problem. I have a crude workaround (see the last paragraph). I've three RAID 6 arrays. They do not share disks, however, they do share the controller: Personalities : [raid6] [raid5] [raid4] md2 : active raid6 cciss/c1d34[10] cciss/c1d33[9] cciss/c1d32[8] cciss/c1d31[7] cciss/c1d30[6] cciss/c1d29[5] cciss/c1d28[4] cciss/c1d27[3] cciss/c1d26[2] cciss/c1d25[1] cciss/c1d24[0] 17581335552 blocks level 6, 32k chunk, algorithm 2 [11/11] [UUUUUUUUUUU] [>....................] resync = 1.2% (24033632/1953481728) finish=3001.1min speed=10712K/sec md1 : active raid6 cciss/c1d23[11] cciss/c1d22[10] cciss/c1d21[9] cciss/c1d20[8] cciss/c1d19[7] cciss/c1d18[6] cciss/c1d17[5] cciss/c1d16[4] cciss/c1d15[3] cciss/c1d14[2] cciss/c1d13[1] cciss/c1d12[0] 19534817280 blocks level 6, 32k chunk, algorithm 2 [12/12] [UUUUUUUUUUUU] [>....................] resync = 1.2% (24120448/1953481728) finish=2678.9min speed=12003K/sec md0 : active raid6 cciss/c1d11[11] cciss/c1d10[10] cciss/c1d9[9] cciss/c1d8[8] cciss/c1d7[7] cciss/c1d6[6] cciss/c1d5[5] cciss/c1d4[4] cciss/c1d3[3] cciss/c1d2[2] cciss/c1d1[1] cciss/c1d0[0] 19534817280 blocks level 6, 32k chunk, algorithm 2 [12/12] [UUUUUUUUUUUU] [>....................] resync = 0.4% (8389628/1953481728) finish=2617.0min speed=12386K/sec unused devices: <none> Using the arrays while they are building (i.e. mkfs.xfs, pvcreate, ls), gives the problem described above: INFO: task md0_resync:4436 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. md0_resync D ffffffff80153644 0 4436 263 4445 4435 (L-TLB) ffff81061cff9c50 0000000000000046 ffff810037e1a400 ffff81031acd6708 ffff81031ffc8000 000000000000000a ffff81031f7f4820 ffff81061febb820 0000009927e6375c 000000000013b0d4 ffff81031f7f4a08 0000000937e1a434 Call Trace: [<ffffffff883dcdba>] :raid456:get_active_stripe+0x240/0x4b7 [<ffffffff8008e414>] default_wake_function+0x0/0xe [<ffffffff883e1a08>] :raid456:sync_request+0x69e/0x735 [<ffffffff880b7485>] :cciss:start_io+0x98/0xd2 [<ffffffff80153644>] __next_cpu+0x19/0x28 [<ffffffff8021f05a>] md_do_sync+0x464/0x859 [<ffffffff800a267e>] keventd_create_kthread+0x0/0xc4 [<ffffffff8021ebe0>] md_thread+0xf8/0x10e [<ffffffff8021eae8>] md_thread+0x0/0x10e [<ffffffff80032968>] kthread+0xfe/0x132 [<ffffffff8009efc6>] __call_usermodehelper+0x0/0x4f [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a267e>] keventd_create_kthread+0x0/0xc4 [<ffffffff8003286a>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 INFO: task mkfs.xfs:4489 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mkfs.xfs D ffffffff80153644 0 4489 4474 (NOTLB) ffff81061e69db98 0000000000000086 ffff810037e002b4 ffffffff80146143 ffff81061d80c928 0000000000000008 ffff81061e13b100 ffff81061fd097a0 00000099271433a4 000000000001f551 ffff81061e13b2e8 0000000f00000002 Call Trace: [<ffffffff80146143>] elv_next_request+0x168/0x178 [<ffffffff8006ec4e>] do_gettimeofday+0x40/0x90 [<ffffffff800637ca>] io_schedule+0x3f/0x67 [<ffffffff800f7388>] __blockdev_direct_IO+0x8da/0xa82 [<ffffffff800e7f02>] blkdev_direct_IO+0x32/0x37 [<ffffffff800e7e3a>] blkdev_get_blocks+0x0/0x96 [<ffffffff800c8713>] generic_file_direct_IO+0x108/0x122 [<ffffffff8001f02a>] generic_file_direct_write+0x60/0xf2 [<ffffffff80016637>] __generic_file_aio_write_nolock+0x2b8/0x3b6 [<ffffffff800c8898>] generic_file_aio_write_nolock+0x20/0x6c [<ffffffff800c8c6f>] generic_file_write_nolock+0x8f/0xa8 [<ffffffff800c8bc7>] generic_file_read+0xac/0xc5 [<ffffffff8006723e>] do_page_fault+0x4fe/0x874 [<ffffffff800a2896>] autoremove_wake_function+0x0/0x2e [<ffffffff800e6cb6>] blkdev_file_write+0x1a/0x1f [<ffffffff80016af0>] vfs_write+0xce/0x174 [<ffffffff80043fe8>] sys_pwrite64+0x50/0x70 [<ffffffff8005dde9>] error_exit+0x0/0x84 [<ffffffff8005d116>] system_call+0x7e/0x83 A workaround is to set /proc/sys/dev/raid/speed_limit_max so the collective maximum sync bandwidth is just below the physical maximum of the disk system. I tried the 228 kernel and it seems to have fixed the problem for me. I first confirmed that the problem still exists on 2.6.18-194.11.1.el5. I started 4 simultaneous full raid1 resyncs (4 partitions mirrored across 2 disks). I saw 4 "task md*_resync... blocked for more than 120 seconds" messages. I then rebooted onto the 2.6.18-228.el5 kernel, and repeated the same test using the same 2 disks. I repeated the test twice (resyncing in each direction) and saw no "task blocked" messages. (In reply to comment #29) > (In reply to comment #28) > > That's interesting. What kicks off the resync after install then? > > > > Whatever it is, it's doing the same trick that the cron job does. > Resync happens because mismatch_cnt in > /sys/block/<your_md_device>/md/mismatch_cnt is greater than 0. > It was explained somewhere how this is harmless with raid1 and raid10. However > I didn't like it, and found a way to fix it once for good - of course google > was my friend back then, but I forgot to bookmark it. > Also, be aware of 99-raid-check 'patches' that disable check for raid1 arrays > which bring us little if no benefit at all. I don't know but i would prefer not to resync the drives every week when using SSDs. A Intel X25-M is specified for 20 GB/day a 160 GB-version would exceed this easily by the weekly resync-job. Okay, that are no server-typ SDD, but that won't help the user of those discs. Datasheet of Intel X25-M http://download.intel.com/design/flash/nand/mainstream/Specification322296.pdf I'm seeing this problem aswell using 2.6.18-194.26.1.el5 kernel. It shows up after fresh installation. I have multiple md-raid arrays (md0,md1,md2) on the same two partitioned disks. INFO: task md0_resync:7449 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. md0_resync D ffff81022f07c040 0 7449 89 3175 (L-TLB) ffff8101f147bd70 0000000000000046 ffffffff8002b398 ffff81022fa52a0c ffff81022fa52e0c 0000000000000009 ffff810227a24820 ffff81022f07c040 00000539224d8f56 0000000000001756 ffff810227a24a08 000000030000000d Call Trace: [<ffffffff8002b398>] flush_tlb_page+0xac/0xda [<ffffffff80007691>] find_get_page+0x21/0x51 [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4 [<ffffffff8021af2b>] md_do_sync+0x1d8/0x833 [<ffffffff8008ca47>] enqueue_task+0x41/0x56 [<ffffffff8008cab2>] __activate_task+0x56/0x6d [<ffffffff8008c897>] dequeue_task+0x18/0x37 [<ffffffff80062ff8>] thread_return+0x62/0xfe [<ffffffff800a0b16>] autoremove_wake_function+0x0/0x2e [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4 [<ffffffff8021b8ff>] md_thread+0xf8/0x10e [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4 [<ffffffff8021b807>] md_thread+0x0/0x10e [<ffffffff8003290a>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4 [<ffffffff8003280c>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 (In reply to comment #39) > I'm seeing this problem aswell using 2.6.18-194.26.1.el5 kernel. > It shows up after fresh installation. > > I have multiple md-raid arrays (md0,md1,md2) on the same two partitioned disks. > > INFO: task md0_resync:7449 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > md0_resync D ffff81022f07c040 0 7449 89 3175 (L-TLB) > ffff8101f147bd70 0000000000000046 ffffffff8002b398 ffff81022fa52a0c > ffff81022fa52e0c 0000000000000009 ffff810227a24820 ffff81022f07c040 > 00000539224d8f56 0000000000001756 ffff810227a24a08 000000030000000d > Call Trace: > [<ffffffff8002b398>] flush_tlb_page+0xac/0xda > [<ffffffff80007691>] find_get_page+0x21/0x51 > [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4 > [<ffffffff8021af2b>] md_do_sync+0x1d8/0x833 > [<ffffffff8008ca47>] enqueue_task+0x41/0x56 > [<ffffffff8008cab2>] __activate_task+0x56/0x6d > [<ffffffff8008c897>] dequeue_task+0x18/0x37 > [<ffffffff80062ff8>] thread_return+0x62/0xfe > [<ffffffff800a0b16>] autoremove_wake_function+0x0/0x2e > [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4 > [<ffffffff8021b8ff>] md_thread+0xf8/0x10e > [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4 > [<ffffffff8021b807>] md_thread+0x0/0x10e > [<ffffffff8003290a>] kthread+0xfe/0x132 > [<ffffffff8005dfb1>] child_rip+0xa/0x11 > [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4 > [<ffffffff8003280c>] kthread+0x0/0x132 > [<ffffffff8005dfa7>] child_rip+0x0/0x11 Thank you for adding input! Please test the RHEL 5.6 Public Beta and see if this resolves the issue. I have the same problem on two different servers. One is related to software raid array. The other server does not have any software raid installed, but it is pulling the data from remote sites onto the nfs server running the above mentioned software raid. Both running CentOS kernel 2.6.18-194.26.1.el5 Here are the trace calls: INFO: task md13_resync:23754 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. md13_resync D ffff8104360a9aa0 0 23754 296 23756 19614 (L-TLB) ffff8107cce7bd70 0000000000000046 0000000000000000 ffff810427d7bc0c ffff81010e96420c 000000000000000a ffff8104f09d0820 ffff81043623c820 0001a86c095399b9 00000000000091e3 ffff8104f09d0a08 000000058008b4d7 Call Trace: [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4 [<ffffffff8021af2b>] md_do_sync+0x1d8/0x833 [<ffffffff8008ca47>] enqueue_task+0x41/0x56 [<ffffffff8008cab2>] __activate_task+0x56/0x6d [<ffffffff8008c897>] dequeue_task+0x18/0x37 [<ffffffff80062ff8>] thread_return+0x62/0xfe [<ffffffff800a0b16>] autoremove_wake_function+0x0/0x2e [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4 [<ffffffff8021b8ff>] md_thread+0xf8/0x10e [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4 [<ffffffff8021b807>] md_thread+0x0/0x10e [<ffffffff8003290a>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4 [<ffffffff8003280c>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 INFO: task globus-gridftp-:16523 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. globus-gridft D ffff810001004420 0 16523 7428 16570 (NOTLB) ffff81021862fd48 0000000000000086 00000001800c5ab0 0000000000000000 0000000000000000 0000000000000009 ffff810214ae57e0 ffffffff80308b60 00012acd660cd9f5 000000000001facb ffff810214ae59c8 000000001862fda0 Call Trace: [<ffffffff8006e1d7>] do_gettimeofday+0x40/0x90 [<ffffffff8838886d>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd [<ffffffff800637ea>] io_schedule+0x3f/0x67 [<ffffffff88388876>] :nfs:nfs_wait_bit_uninterruptible+0x9/0xd [<ffffffff80063a16>] __wait_on_bit+0x40/0x6e [<ffffffff8838886d>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78 [<ffffffff800a0b44>] wake_bit_function+0x0/0x23 [<ffffffff8838c2ce>] :nfs:nfs_wait_on_requests_locked+0x70/0xca [<ffffffff8838d322>] :nfs:nfs_sync_inode_wait+0x60/0x1db [<ffffffff88383167>] :nfs:nfs_do_fsync+0x1f/0x3f [<ffffffff88383a0d>] :nfs:nfs_file_flush+0x84/0xac [<ffffffff80023be3>] filp_close+0x36/0x64 [<ffffffff8001e0d1>] sys_close+0x88/0xbd [<ffffffff8005d116>] system_call+0x7e/0x83 We had other problems locking the second server, so I am not sure this constitutes the main issue for us, but it would be nice to resolve it. (In reply to comment #41) > I have the same problem on two different servers. One is related to software > raid array. The other server does not have any software raid installed, but it > is pulling the data from remote sites onto the nfs server running the above > mentioned software raid. > [snip] > INFO: task globus-gridftp-:16523 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > globus-gridft D ffff810001004420 0 16523 7428 16570 (NOTLB) > ffff81021862fd48 0000000000000086 00000001800c5ab0 0000000000000000 > 0000000000000000 0000000000000009 ffff810214ae57e0 ffffffff80308b60 > 00012acd660cd9f5 000000000001facb ffff810214ae59c8 000000001862fda0 > Call Trace: > [<ffffffff8006e1d7>] do_gettimeofday+0x40/0x90 > [<ffffffff8838886d>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd > [<ffffffff800637ea>] io_schedule+0x3f/0x67 This is a completely separate issue: this bug is dealing with problems where md_do_sync shows up in the stack trace. What you have here is an NFS problem -- I'd suggest filing a separate bug for it (if one hasn't already been filed). John: The first bug I posted is for md_do_sync (md13_resync). I agree that the second bug is the nfs bug. My impression was that the bugs are in the kernel though, although it can just do its job and inform me about it. I see the status has changed to "release pending". Hopefully the issue will get resolved. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html *** Bug 597855 has been marked as a duplicate of this bug. *** *** Bug 590403 has been marked as a duplicate of this bug. *** |