Bug 573106

Summary: [Stratus 5.6 bug] task md0_resync:18061 blocked for more than 120 seconds
Product: Red Hat Enterprise Linux 5 Reporter: nate.dailey
Component: kernelAssignee: Jim Paradis <jparadis>
Status: CLOSED ERRATA QA Contact: Storage QE <storage-qe>
Severity: high Docs Contact:
Priority: high    
Version: 5.5CC: andriusb, bdonahue, bkv.astro, bugzilla, ck, coughlan, cward, daniel.brnak, dkerr, dkovalsk, drjones, eparis, ernest.beinrohr, fukawi2+fedora, gerrit.slomma, gholms, happe, irekpias, john.haxby, jonas.hagberg, mark, mr_w, ngaywood, pasik, pasteur, pep, peterm, ralph, rhel, scott, tom, wadehamptoniv, wnefal+redhatbugzilla, wt75
Target Milestone: rcKeywords: OtherQA, Regression
Target Release: 5.6   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-01-13 21:16:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
stuck tasks on my box
none
Patch to fix softlockup none

Description nate.dailey 2010-03-12 20:42:27 UTC
Description of problem:

raid1 /dev/md0 lives on the same physical disks as several other raid1 md devices. If resyncs are kicked off on multiple mirrors at the same time, something like this may appear in the log:

Mar 12 12:03:46 wyvern kernel: INFO: task md0_resync:18061 blocked for more than 120 seconds.
Mar 12 12:03:46 wyvern kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 12 12:03:46 wyvern kernel: md0_resync    D ffff81000900caa0     0 18061    107               17890 (L-TLB)
Mar 12 12:03:46 wyvern kernel:  ffff810020f5bd70 0000000000000046 0000000000000000 ffff8101055ca00c
Mar 12 12:03:46 wyvern kernel:  ffff81017fa3840c 0000000000000008 ffff8101b13920c0 ffff8101054b3820
Mar 12 12:03:46 wyvern kernel:  0000015d68a8999c 0000000000001a91 ffff8101b13922a8 0000000100000000
Mar 12 12:03:46 wyvern kernel: Call Trace:
Mar 12 12:03:46 wyvern kernel:  [<ffffffff8015113e>] __next_cpu+0x19/0x28
Mar 12 12:03:46 wyvern kernel:  [<ffffffff800a187d>] keventd_create_kthread+0x0/0xc4
Mar 12 12:03:46 wyvern kernel:  [<ffffffff8021b7ba>] md_do_sync+0x1d8/0x833
Mar 12 12:03:46 wyvern kernel:  [<ffffffff80063ff8>] thread_return+0x62/0xfe
Mar 12 12:03:46 wyvern kernel:  [<ffffffff800a1a95>] autoremove_wake_function+0x0/0x2e
Mar 12 12:03:46 wyvern kernel:  [<ffffffff800a187d>] keventd_create_kthread+0x0/0xc4
Mar 12 12:03:46 wyvern kernel:  [<ffffffff8021c18e>] md_thread+0xf8/0x10e
Mar 12 12:03:47 wyvern kernel:  [<ffffffff8021c096>] md_thread+0x0/0x10e
Mar 12 12:03:47 wyvern kernel:  [<ffffffff80032bd3>] kthread+0xfe/0x132
Mar 12 12:03:47 wyvern kernel:  [<ffffffff8005efb1>] child_rip+0xa/0x11
Mar 12 12:03:47 wyvern kernel:  [<ffffffff800a187d>] keventd_create_kthread+0x0/0xc4
Mar 12 12:03:47 wyvern kernel:  [<ffffffff80032ad5>] kthread+0x0/0x132
Mar 12 12:03:47 wyvern kernel:  [<ffffffff8005efa7>] child_rip+0x0/0x11
Mar 12 12:03:48 wyvern kernel: 

Just looks ugly... the resyncs all complete okay.


Version-Release number of selected component (if applicable):

kernel 2.6.18-191.el5


How reproducible:

Most of the time.


Steps to Reproduce:
1. Start resyncs for multiple raid1 md devices on the same disks.
2. Wait.

  
Actual results:

Ugly message.


Expected results:

No ugly message.


Additional info:

This has been fixed upstream:

http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.33.y.git;a=commit;h=9744197c3d7b329590c2be33ad7b17409bd798fe

Comment 1 Eric Paris 2010-05-02 21:49:35 UTC
Just a 'me to' posting.

Comment 2 Mark Waterhouse 2010-05-07 21:17:26 UTC
I am experiencing this also.
I have a new installation of RHEL 5.5 x86_64 and am seeing this message in my logs.

At the time, vmstat reports a growing number of processes in the blocked state and i/o performance drops through the floor.
I am currently scp'ing a file from a server connected on a local lan and am seeing transfer rates start at 10MB/s down to 2.5MB/s and, sometimes much much lower.


May  7 21:50:25 camunxxen02 kernel: INFO: task auditd:2248 blocked for more than 120 seconds.
May  7 21:50:25 camunxxen02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May  7 21:50:25 camunxxen02 kernel: auditd        D 0000005343565863     0  2248      1          2271  2247 (NOTLB)
May  7 21:50:25 camunxxen02 kernel:  ffff8800d8f1bd88  0000000000000286  ffff8800d8f1bd98  ffffffff80262dd3
May  7 21:50:25 camunxxen02 kernel:  0000000000000009  ffff8800db8cc7a0  ffff8800000037e0  000000000000075b
May  7 21:50:25 camunxxen02 kernel:  ffff8800db8cc988  0000000000005f30
May  7 21:51:54 camunxxen02 kernel: Call Trace:
May  7 21:51:54 camunxxen02 kernel:  [<ffffffff80262dd3>] thread_return+0x6c/0x113
May  7 21:51:54 camunxxen02 kernel:  [<ffffffff88036d5a>] :jbd:log_wait_commit+0xa3/0xf5
May  7 21:51:54 camunxxen02 kernel:  [<ffffffff8029c521>] autoremove_wake_function+0x0/0x2e
May  7 21:51:54 camunxxen02 kernel:  [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff
May  7 21:51:54 camunxxen02 kernel:  [<ffffffff80231056>] __writeback_single_inode+0x1e9/0x328
May  7 21:51:54 camunxxen02 kernel:  [<ffffffff802676cd>] do_page_fault+0xfae/0x12e0
May  7 21:51:54 camunxxen02 kernel:  [<ffffffff802e5775>] sync_inode+0x24/0x33
May  7 21:51:54 camunxxen02 kernel:  [<ffffffff8804c36d>] :ext3:ext3_sync_file+0xc9/0xdc
May  7 21:51:54 camunxxen02 kernel:  [<ffffffff80251f3f>] do_fsync+0x52/0xa4
May  7 21:51:54 camunxxen02 kernel:  [<ffffffff802d370c>] __do_fsync+0x23/0x36
May  7 21:51:54 camunxxen02 kernel:  [<ffffffff80260106>] system_call+0x86/0x8b
May  7 21:51:54 camunxxen02 kernel:  [<ffffffff80260080>] system_call+0x0/0x8b
May  7 21:51:54 camunxxen02 kernel:
May  7 21:51:54 camunxxen02 kernel: INFO: task syslogd:2285 blocked for more than 120 seconds.
May  7 21:51:54 camunxxen02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May  7 21:51:54 camunxxen02 kernel: syslogd       D 0000004a4f132cba     0  2285      1          2288  2271 (NOTLB)
May  7 21:51:54 camunxxen02 kernel:  ffff8800d81c3d88  0000000000000286  0000000000000000  0000000000000001
May  7 21:51:54 camunxxen02 kernel:  0000000000000009  ffff8800e11e6040  ffff8800000037e0  0000000000006cfb
May  7 21:51:54 camunxxen02 kernel:  ffff8800e11e6228  0000000000000000
May  7 21:51:54 camunxxen02 kernel: Call Trace:
May  7 21:51:54 camunxxen02 kernel:  [<ffffffff88036d5a>] :jbd:log_wait_commit+0xa3/0xf5
May  7 21:51:54 camunxxen02 kernel:  [<ffffffff8029c521>] autoremove_wake_function+0x0/0x2e
May  7 21:51:54 camunxxen02 kernel:  [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff
May  7 21:51:54 camunxxen02 kernel:  [<ffffffff80231056>] __writeback_single_inode+0x1e9/0x328
May  7 21:51:54 camunxxen02 kernel:  [<ffffffff802d2f08>] do_readv_writev+0x26e/0x291
May  7 21:51:54 camunxxen02 kernel:  [<ffffffff802e5775>] sync_inode+0x24/0x33
May  7 21:51:54 camunxxen02 kernel:  [<ffffffff8804c36d>] :ext3:ext3_sync_file+0xc9/0xdc
May  7 21:51:54 camunxxen02 kernel:  [<ffffffff80251f3f>] do_fsync+0x52/0xa4
May  7 21:51:54 camunxxen02 kernel:  [<ffffffff802d370c>] __do_fsync+0x23/0x36
May  7 21:51:54 camunxxen02 kernel:  [<ffffffff802602f9>] tracesys+0xab/0xb6
May  7 21:51:54 camunxxen02 kernel:
May  7 21:51:54 camunxxen02 kernel: INFO: task tail:3629 blocked for more than 120 seconds.
May  7 21:51:54 camunxxen02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May  7 21:51:54 camunxxen02 kernel: tail          D 0000005363bcd82a     0  3629   3603                     (NOTLB)
May  7 21:51:54 camunxxen02 kernel:  ffff8800b1bbbea8  0000000000000286  00000000ffffffd8  ffff8800b1bbbea8
May  7 21:51:54 camunxxen02 kernel:  0000000000000008  ffff8800ce58f100  ffff8800000037e0  00000000001282c1
May  7 21:51:54 camunxxen02 kernel:  ffff8800ce58f2e8  ffff8800e0d44880
May  7 21:51:55 camunxxen02 kernel: Call Trace:
May  7 21:51:55 camunxxen02 kernel:  [<ffffffff8021b224>] cp_new_stat+0xe5/0xfd
May  7 21:51:55 camunxxen02 kernel:  [<ffffffff80263a7e>] __mutex_lock_slowpath+0x60/0x9b
May  7 21:51:55 camunxxen02 kernel:  [<ffffffff80263ac8>] .text.lock.mutex+0xf/0x14
May  7 21:51:55 camunxxen02 kernel:  [<ffffffff80214682>] generic_file_llseek+0x
May  7 21:51:55 camunxxen02 kernel:  [<ffffffff80225c59>] sys_lseek+0x40/0x60
May  7 21:51:55 camunxxen02 kernel:  [<ffffffff80260295>] tracesys+0x47/0xb6
May  7 21:51:55 camunxxen02 kernel:  [<ffffffff802602f9>] tracesys+0xab/0xb6
May  7 21:51:55 camunxxen02 kernel:

Comment 3 I. Piasecki 2010-05-18 03:37:59 UTC
Please "me to" add to this post, even if I use centos 5.5. I have this behaviour on both of machines: i386 and x86_64.

Comment 4 Phillip Smith 2010-05-21 01:04:07 UTC
Another "me too". Seems to trigger the Call Trace's every 120 seconds (as per the error message) but seems to be no other ill effects on the running system apart from the ugly messages and panic that it's all broken!

[root@longhorn ~]# cat /proc/mdstat 
Personalities : [raid1] 
md3 : active raid1 sdb5[1] sda5[0]
      955658496 blocks [2/2] [UU]
      	resync=DELAYED
      
md0 : active raid1 sdd1[3] sdc1[2] sdb1[1] sda1[0]
      128384 blocks [4/4] [UUUU]
      	resync=DELAYED
      
md2 : active raid1 sdd2[1] sdc2[0]
      16779776 blocks [2/2] [UU]
      
md4 : active raid1 sdd5[1] sdc5[0]
      955658496 blocks [2/2] [UU]
      [=>...................]  resync =  6.2% (59335104/955658496) finish=141.9min speed=105240K/sec
      
md1 : active raid1 sdb2[1] sda2[0]
      16779776 blocks [2/2] [UU]
      
unused devices: <none>




INFO: task md0_resync:536 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
md0_resync    D ffff880001012460     0   536     19           539   535 (L-TLB)
 ffff8803d1251d70  0000000000000246  ffff8803d1251cb0  0000000000000000 
 0000000000000009  ffff8803d12320c0  ffff880002e4e7a0  0000000000000dbf 
 ffff8803d12322a8  0000000000000000 
Call Trace:
 [<ffffffff8029c309>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8040b4b8>] md_do_sync+0x1d8/0x833
 [<ffffffff80288749>] dequeue_task+0x18/0x37
 [<ffffffff80288790>] deactivate_task+0x28/0x5f
 [<ffffffff8026ef31>] monotonic_clock+0x35/0x7b
 [<ffffffff80262dd3>] thread_return+0x6c/0x113
 [<ffffffff80248d8c>] try_to_wake_up+0x392/0x3a4
 [<ffffffff8029c521>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8029c309>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8040be8c>] md_thread+0xf8/0x10e
 [<ffffffff8029c309>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8040bd94>] md_thread+0x0/0x10e
 [<ffffffff80233b0f>] kthread+0xfe/0x132
 [<ffffffff80260b2c>] child_rip+0xa/0x12
 [<ffffffff8029c309>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80233a11>] kthread+0x0/0x132
 [<ffffffff80260b22>] child_rip+0x0/0x12

INFO: task md3_resync:540 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
md3_resync    D ffff880001012460     0   540     19           543   539 (L-TLB)
 ffff8803d127bd70  0000000000000246  ffff8803d127bcb0  0000000000000000 
 0000000000000009  ffff8803d1223040  ffff8803d12320c0  0000000000000f45 
 ffff8803d1223228  0000000000000000 
Call Trace:
 [<ffffffff8029c52a>] autoremove_wake_function+0x9/0x2e
 [<ffffffff80287795>] __wake_up_common+0x3e/0x68
 [<ffffffff8029c309>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8040b4b8>] md_do_sync+0x1d8/0x833
 [<ffffffff80288749>] dequeue_task+0x18/0x37
 [<ffffffff80288790>] deactivate_task+0x28/0x5f
 [<ffffffff8026ef31>] monotonic_clock+0x35/0x7b
 [<ffffffff80262dd3>] thread_return+0x6c/0x113
 [<ffffffff80248d8c>] try_to_wake_up+0x392/0x3a4
 [<ffffffff8029c521>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8029c309>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8040be8c>] md_thread+0xf8/0x10e
 [<ffffffff8029c309>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8040bd94>] md_thread+0x0/0x10e
 [<ffffffff80233b0f>] kthread+0xfe/0x132
 [<ffffffff80260b2c>] child_rip+0xa/0x12
 [<ffffffff8029c309>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80233a11>] kthread+0x0/0x132
 [<ffffffff80260b22>] child_rip+0x0/0x12

Comment 5 Christian Kohlschuetter 2010-06-05 15:38:23 UTC
Hi,
I've just encountered the same bug. Running 2.6.18-194.3.1.el5 on Core i7 920.
Has not occurred in 5.4.

Jun  5 16:49:50 xxx kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  5 16:49:50 xxx kernel: md1_resync    D ffff81033e63d7e0     0 10605    139         10606  6480 (L-TLB)
Jun  5 16:49:51 xxx kernel:  ffff8103015ffd70 0000000000000046 0000000000000000 ffff81033fc4a80c
Jun  5 16:49:51 xxx kernel:  ffff81033faeb80c 000000000000000a ffff81033ffdd100 ffff81033e63d7e0
Jun  5 16:49:51 xxx kernel:  00003db94dbef9a3 00000000000019b7 ffff81033ffdd2e8 0000000400000000
Jun  5 16:49:51 xxx kernel: Call Trace:
Jun  5 16:49:51 xxx kernel:  [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4
Jun  5 16:49:51 xxx kernel:  [<ffffffff8021aa1f>] md_do_sync+0x1d8/0x833
Jun  5 16:49:51 xxx kernel:  [<ffffffff8008ca21>] enqueue_task+0x41/0x56
Jun  5 16:49:51 xxx kernel:  [<ffffffff8008ca8c>] __activate_task+0x56/0x6d
Jun  5 16:49:51 xxx kernel:  [<ffffffff8008c871>] dequeue_task+0x18/0x37
Jun  5 16:49:51 xxx kernel:  [<ffffffff80062ff8>] thread_return+0x62/0xfe
Jun  5 16:49:51 xxx kernel:  [<ffffffff800a0abe>] autoremove_wake_function+0x0/0x2e
Jun  5 16:49:51 xxx kernel:  [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4
Jun  5 16:49:51 xxx kernel:  [<ffffffff8021b3f3>] md_thread+0xf8/0x10e
Jun  5 16:49:51 xxx kernel:  [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4
Jun  5 16:49:51 xxx kernel:  [<ffffffff8021b2fb>] md_thread+0x0/0x10e
Jun  5 16:49:51 xxx kernel:  [<ffffffff80032894>] kthread+0xfe/0x132
Jun  5 16:49:51 xxx kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Jun  5 16:49:51 xxx kernel:  [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4
Jun  5 16:49:51 xxx kernel:  [<ffffffff80032796>] kthread+0x0/0x132
Jun  5 16:49:51 xxx kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11

Comment 6 Frank Büttner 2010-06-08 08:19:17 UTC
same for me:
jun  7 08:32:43  kernel: md0_resync    D ffff810001015120     0 11682     75         11683  4791 (L-TLB)
Jun  7 08:32:43  kernel:  ffff810062d59d70 0000000000000046 0000000000000000 ffff8101bfdd5a0c
Jun  7 08:32:43  kernel:  ffff8101bea0bc0c 0000000000000009 ffff81006a165040 ffff810106323080
Jun  7 08:32:43 kernel:  000003ee09ede32a 0000000000001ad9 ffff81006a165228 0000000200000000
Jun  7 08:32:43 kernel: Call Trace:
Jun  7 08:32:43 kernel:  [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4
Jun  7 08:32:43 kernel:  [<ffffffff8021aa1f>] md_do_sync+0x1d8/0x833
Jun  7 08:32:43 kernel:  [<ffffffff8008ca21>] enqueue_task+0x41/0x56
Jun  7 08:32:43 kernel:  [<ffffffff8008ca8c>] __activate_task+0x56/0x6d
Jun  7 08:32:43 kernel:  [<ffffffff8008c871>] dequeue_task+0x18/0x37
Jun  7 08:32:43 kernel:  [<ffffffff80062ff8>] thread_return+0x62/0xfe
Jun  7 08:32:43 kernel:  [<ffffffff800a0abe>] autoremove_wake_function+0x0/0x2e
Jun  7 08:32:43 kernel:  [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4
Jun  7 08:32:43 kernel:  [<ffffffff8021b3f3>] md_thread+0xf8/0x10e
Jun  7 08:32:43 kernel:  [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4
Jun  7 08:32:43 kernel:  [<ffffffff8021b2fb>] md_thread+0x0/0x10e
Jun  7 08:32:43 kernel:  [<ffffffff80032894>] kthread+0xfe/0x132
Jun  7 08:32:43 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Jun  7 08:32:43 kernel:  [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4
Jun  7 08:32:43 kernel:  [<ffffffff80032796>] kthread+0x0/0x132
Jun  7 08:32:43 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Jun  7 08:32:43 kernel: 
Jun  7 08:32:43 kernel: INFO: task md1_resync:11683 blocked for more than 120 seconds.
Jun  7 08:32:43 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  7 08:32:43 kernel: md1_resync    D ffff810060c65e30     0 11683     75              11682 (L-TLB)
Jun  7 08:32:43 kernel:  ffff810060c65d70 0000000000000046 0000000000000000 ffff8101bfdd5e0c
Jun  7 08:32:43 kernel:  ffff8101bfdd5a0c 0000000000000008 ffff810063fab860 ffff81006a165040
Jun  7 08:32:43 kernel:  000003ee09edc851 00000000000050a0 ffff810063faba48 000000028008b4b1
Jun  7 08:32:43 kernel: Call Trace:
Jun  7 08:32:43 kernel:  [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4
Jun  7 08:32:43 kernel:  [<ffffffff8021aa1f>] md_do_sync+0x1d8/0x833
Jun  7 08:32:43 kernel:  [<ffffffff8008ca21>] enqueue_task+0x41/0x56
Jun  7 08:32:44 kernel:  [<ffffffff8008ca8c>] __activate_task+0x56/0x6d
Jun  7 08:32:44 kernel:  [<ffffffff8008c871>] dequeue_task+0x18/0x37
Jun  7 08:32:44 kernel:  [<ffffffff80062ff8>] thread_return+0x62/0xfe
Jun  7 08:32:44 kernel:  [<ffffffff800a0abe>] autoremove_wake_function+0x0/0x2e
Jun  7 08:32:44 kernel:  [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4
Jun  7 08:32:44 kernel:  [<ffffffff8021b3f3>] md_thread+0xf8/0x10e
Jun  7 08:32:44 kernel:  [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4
Jun  7 08:32:44 kernel:  [<ffffffff8021b2fb>] md_thread+0x0/0x10e
Jun  7 08:32:44 kernel:  [<ffffffff80032894>] kthread+0xfe/0x132
Jun  7 08:32:44 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Jun  7 08:32:44 kernel:  [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4
Jun  7 08:32:44 kernel:  [<ffffffff80032796>] kthread+0x0/0x132
Jun  7 08:32:44 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Jun  7 08:32:44 kernel: 

Some system information:
6GB RAM,Hardware: FSC CELSIUS M460

Comment 7 Ralph Angenendt 2010-06-29 14:49:28 UTC
Does not look like it only affects resyncs (Yeah, that is a "me too", although on CentOS):

There are no other messages besides that. The machine is an application server which had a bit to do at that time.

Jun 29 14:56:59 dryckjom kernel: INFO: task nfsd:3308 blocked for more than 120 seconds.
Jun 29 14:56:59 dryckjom kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 29 14:57:00 dryckjom kernel: nfsd          D ffff810104636de0     0  3308      1          3309  3307 (L-TLB)
Jun 29 14:57:00 dryckjom kernel:  ffff8100af991af0 0000000000000046 0000000000000000 ffff8100af991c60
Jun 29 14:57:00 dryckjom kernel:  0000000000000000 000000000000000a ffff8100afeb1860 ffff8100afe307a0
Jun 29 14:57:00 dryckjom kernel:  00032ecfc1662b57 000000000000ae4e ffff8100afeb1a48 00000002af88bc00
Jun 29 14:57:00 dryckjom kernel: Call Trace:
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff8003055b>] sock_recvmsg+0x107/0x15f
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff800ec2a2>] inode_wait+0x0/0xd
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff800ec2ab>] inode_wait+0x9/0xd
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff80063a16>] __wait_on_bit+0x40/0x6e
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff800ec2a2>] inode_wait+0x0/0xd
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff800a0aec>] wake_bit_function+0x0/0x23
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff8000c5fe>] _atomic_dec_and_lock+0x39/0x57
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff8003dbbf>] ifind_fast+0x6e/0x83
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff80023290>] iget_locked+0x59/0x149
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff880565bf>] :ext3:ext3_get_dentry+0x3d/0xbb
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff884ea36b>] :exportfs:find_exported_dentry+0x43/0x480
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff884f7753>] :nfsd:nfsd_acceptable+0x0/0xd8
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff884fb74f>] :nfsd:exp_get_by_name+0x5b/0x71
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff884fbd3e>] :nfsd:exp_find_key+0x89/0x9c
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff88384ba2>] :sunrpc:svc_tcp_recvfrom+0x719/0x79e
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff8009b1dd>] set_current_groups+0x159/0x164
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff884ea7f3>] :exportfs:export_decode_fh+0x4b/0x50
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff884f7ac5>] :nfsd:fh_verify+0x29a/0x4bd
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff884ff6cc>] :nfsd:nfsd3_proc_getattr+0x87/0xbb
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff884f51db>] :nfsd:nfsd_dispatch+0xd8/0x1d6
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff88382651>] :sunrpc:svc_process+0x454/0x71b
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff80064644>] __down_read+0x12/0x92
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff884f55a1>] :nfsd:nfsd+0x0/0x2cb
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff884f5746>] :nfsd:nfsd+0x1a5/0x2cb
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff884f55a1>] :nfsd:nfsd+0x0/0x2cb
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff884f55a1>] :nfsd:nfsd+0x0/0x2cb
Jun 29 14:57:00 dryckjom kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Jun 29 14:57:00 dryckjom kernel: 
Jun 29 14:57:00 dryckjom kernel: INFO: task vsftpd:3586 blocked for more than 120 seconds.
Jun 29 14:57:00 dryckjom kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 29 14:57:01 dryckjom kernel: vsftpd        D ffff81013ea9f4c0     0  3586   3350          3587  3568 (NOTLB)
Jun 29 14:57:01 dryckjom kernel:  ffff8100ae693c58 0000000000000086 0000000300000000 ffff8100ae693c38
Jun 29 14:57:01 dryckjom kernel:  0000000000000008 0000000000000007 ffff8100728bc7e0 ffff81013e8cf7a0
Jun 29 14:57:01 dryckjom kernel:  00032ec7d117be71 0000000000017718 ffff8100728bc9c8 00000000ae693ce8
Jun 29 14:57:01 dryckjom kernel: Call Trace:
Jun 29 14:57:01 dryckjom kernel:  [<ffffffff80129e9b>] avc_has_perm+0x46/0x58
Jun 29 14:57:01 dryckjom kernel:  [<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b
Jun 29 14:57:01 dryckjom kernel:  [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14
Jun 29 14:57:01 dryckjom kernel:  [<ffffffff8000cec2>] do_lookup+0x90/0x1e6
Jun 29 14:57:01 dryckjom kernel:  [<ffffffff8000a20d>] __link_path_walk+0xa01/0xf42
Jun 29 14:57:01 dryckjom kernel:  [<ffffffff8000e9e2>] link_path_walk+0x42/0xb2
Jun 29 14:57:01 dryckjom kernel:  [<ffffffff8000ccb2>] do_path_lookup+0x275/0x2f1
Jun 29 14:57:01 dryckjom kernel:  [<ffffffff8012ac27>] selinux_file_alloc_security+0x2a/0x54
Jun 29 14:57:01 dryckjom kernel:  [<ffffffff8002368a>] __path_lookup_intent_open+0x56/0x97
Jun 29 14:57:01 dryckjom kernel:  [<ffffffff8001af23>] open_namei+0x73/0x6d5
Jun 29 14:57:01 dryckjom kernel:  [<ffffffff800274a4>] do_filp_open+0x1c/0x38
Jun 29 14:57:01 dryckjom kernel:  [<ffffffff80019dd1>] do_sys_open+0x44/0xbe
Jun 29 14:57:01 dryckjom kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Jun 29 14:57:01 dryckjom kernel: 
Jun 29 14:57:01 dryckjom kernel: INFO: task nrpe:3597 blocked for more than 120 seconds.
Jun 29 14:57:01 dryckjom kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 29 14:57:01 dryckjom kernel: nrpe          D ffff81013ea9f4c0     0  3597   3596                     (NOTLB)
Jun 29 14:57:01 dryckjom kernel:  ffff81013320fc58 0000000000000082 ffff810134bffe38 ffff81013320fc38
Jun 29 14:57:01 dryckjom kernel:  0000000000000000 0000000000000005 ffff81008c61a040 ffff81013e8cf7a0
Jun 29 14:57:01 dryckjom kernel:  00032ec854c2f5c2 00000000000056a0 ffff81008c61a228 000000004fc26840
Jun 29 14:57:01 dryckjom kernel: Call Trace:
Jun 29 14:57:01 dryckjom kernel:  [<ffffffff80129e9b>] avc_has_perm+0x46/0x58
Jun 29 14:57:01 dryckjom kernel:  [<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b
Jun 29 14:57:01 dryckjom kernel:  [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14
Jun 29 14:57:01 dryckjom kernel:  [<ffffffff8000cec2>] do_lookup+0x90/0x1e6
Jun 29 14:57:01 dryckjom kernel:  [<ffffffff8000a20d>] __link_path_walk+0xa01/0xf42
Jun 29 14:57:01 dryckjom kernel:  [<ffffffff8000e9e2>] link_path_walk+0x42/0xb2
Jun 29 14:57:01 dryckjom kernel:  [<ffffffff8000ccb2>] do_path_lookup+0x275/0x2f1
Jun 29 14:57:01 dryckjom kernel:  [<ffffffff8012ac27>] selinux_file_alloc_security+0x2a/0x54
Jun 29 14:57:01 dryckjom kernel:  [<ffffffff8002368a>] __path_lookup_intent_open+0x56/0x97
Jun 29 14:57:02 dryckjom kernel:  [<ffffffff8001af23>] open_namei+0x73/0x6d5
Jun 29 14:57:02 dryckjom kernel:  [<ffffffff800274a4>] do_filp_open+0x1c/0x38
Jun 29 14:57:02 dryckjom kernel:  [<ffffffff8005c382>] cache_alloc_refill+0x106/0x186
Jun 29 14:57:43 dryckjom kernel:  [<ffffffff80019dd1>] do_sys_open+0x44/0xbe
Jun 29 14:57:43 dryckjom kernel: 
Jun 29 14:57:43 dryckjom kernel: INFO: task vsftpd:3587 blocked for more than 120 seconds.
Jun 29 14:57:43 dryckjom kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 29 14:57:43 dryckjom kernel: vsftpd        D ffff81013ea9f4c0     0  3587   3350          3600  3586 (NOTLB)
Jun 29 14:57:43 dryckjom kernel:  ffff8100add49c58 0000000000000082 ffff810101dff690 ffff8100add49c38
Jun 29 14:57:43 dryckjom kernel:  0000000000000008 0000000000000007 ffff81011d8ef820 ffff81011cf47080
Jun 29 14:57:43 dryckjom kernel:  00032ec7d49fcddf 0000000005dcd944 ffff81011d8efa08 00000007add49ce8
Jun 29 14:57:43 dryckjom kernel: Call Trace:
Jun 29 14:57:43 dryckjom kernel:  [<ffffffff80129e9b>] avc_has_perm+0x46/0x58
Jun 29 14:57:43 dryckjom kernel:  [<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b
Jun 29 14:57:43 dryckjom kernel:  [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14
Jun 29 14:57:44 dryckjom kernel:  [<ffffffff8000cec2>] do_lookup+0x90/0x1e6
Jun 29 14:57:44 dryckjom kernel:  [<ffffffff8000a20d>] __link_path_walk+0xa01/0xf42
Jun 29 14:57:44 dryckjom kernel:  [<ffffffff8000e9e2>] link_path_walk+0x42/0xb2
Jun 29 14:57:44 dryckjom kernel:  [<ffffffff8000ccb2>] do_path_lookup+0x275/0x2f1
Jun 29 14:57:44 dryckjom kernel:  [<ffffffff8012ac27>] selinux_file_alloc_security+0x2a/0x54
Jun 29 14:57:44 dryckjom kernel:  [<ffffffff8002368a>] __path_lookup_intent_open+0x56/0x97
Jun 29 14:57:44 dryckjom kernel:  [<ffffffff8001af23>] open_namei+0x73/0x6d5
Jun 29 14:57:44 dryckjom kernel:  [<ffffffff800274a4>] do_filp_open+0x1c/0x38
Jun 29 14:57:44 dryckjom kernel:  [<ffffffff8005c382>] cache_alloc_refill+0x106/0x186
Jun 29 14:57:44 dryckjom kernel:  [<ffffffff80019dd1>] do_sys_open+0x44/0xbe
Jun 29 14:57:44 dryckjom kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Jun 29 14:57:44 dryckjom kernel: 
Jun 29 14:57:44 dryckjom kernel: INFO: task ps:3599 blocked for more than 120 seconds.
Jun 29 14:57:44 dryckjom kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 29 14:57:44 dryckjom kernel: ps            D ffff81013ea9f4c0     0  3599   3660                     (NOTLB)
Jun 29 14:57:44 dryckjom kernel:  ffff8100ae675ce8 0000000000000086 ffff810000019c10 ffff8100ae675cc8
Jun 29 14:57:44 dryckjom kernel:  0000000000000000 0000000000000007 ffff810088c5c100 ffff8100afeb1100
Jun 29 14:57:44 dryckjom kernel:  00032ec835d40503 0000000000028e7b ffff810088c5c2e8 00000004ffffffff
Jun 29 14:57:44 dryckjom kernel: Call Trace:
Jun 29 14:57:44 dryckjom kernel:  [<ffffffff80129e9b>] avc_has_perm+0x46/0x58
Jun 29 14:57:44 dryckjom kernel:  [<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b
Jun 29 14:57:44 dryckjom kernel:  [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14
Jun 29 14:57:44 dryckjom kernel:  [<ffffffff8000cec2>] do_lookup+0x90/0x1e6
Jun 29 14:57:44 dryckjom kernel:  [<ffffffff8000a20d>] __link_path_walk+0xa01/0xf42
Jun 29 14:57:44 dryckjom kernel:  [<ffffffff8000e9e2>] link_path_walk+0x42/0xb2
Jun 29 14:57:44 dryckjom kernel:  [<ffffffff8000ccb2>] do_path_lookup+0x275/0x2f1
Jun 29 14:57:44 dryckjom kernel:  [<ffffffff8001280e>] getname+0x15b/0x1c2
Jun 29 14:57:44 dryckjom kernel:  [<ffffffff80023876>] __user_walk_fd+0x37/0x4c
Jun 29 14:57:44 dryckjom kernel:  [<ffffffff80032583>] sys_faccessat+0xe4/0x18d
Jun 29 14:57:44 dryckjom kernel:  [<ffffffff800b76a6>] audit_syscall_entry+0x180/0x1b3
Jun 29 14:57:44 dryckjom kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0

Comment 8 Ralph Angenendt 2010-06-29 14:52:57 UTC
Created attachment 427705 [details]
stuck tasks on my box

Sorry, that looked rather bad in the bug report, so here is the message log excerpt showing the hung processes as an attachement.

Comment 9 Scott Leerssen 2010-07-08 17:31:56 UTC
Another "me too", however these tasks never seem to unblock, resulting in hung nfs client connections.

Jul  8 10:44:06 host56 kernel: INFO: task nfsd:29930 blocked for more than 120 seconds.
Jul  8 10:44:06 host56 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  8 10:44:06 host56 kernel: nfsd          D 00011AF0  2100 29930      1         29931 29929 (L-TLB)
Jul  8 10:44:06 host56 kernel:        cf35fdcc 00000046 38e9f310 00011af0 00000040 d7ef7800 f8903460 0000000a
Jul  8 10:44:06 host56 kernel:        e5f7a000 38ea07dc 00011af0 000014cc 00000000 e5f7a10c c1806f00 f7880200
Jul  8 10:44:06 host56 kernel:        e5f7a000 00000000 00000000 00000000 e5f7a000 c0436007 c18073bc d3e47aa0
Jul  8 10:44:06 host56 kernel: Call Trace:
Jul  8 10:44:06 host56 kernel:  [<c0436007>] autoremove_wake_function+0x0/0x2d
Jul  8 10:44:06 host56 kernel:  [<c041ee80>] enqueue_task+0x29/0x39
Jul  8 10:44:06 host56 kernel:  [<c061cf25>] __mutex_lock_slowpath+0x4d/0x7c
Jul  8 10:44:06 host56 kernel:  [<c061cf63>] .text.lock.mutex+0xf/0x14
Jul  8 10:44:06 host56 kernel:  [<f8d9819e>] expkey_match+0x49/0x58 [nfsd]
Jul  8 10:44:06 host56 kernel:  [<c045a989>] generic_file_writev+0x28/0x96
Jul  8 10:44:06 host56 kernel:  [<c045a961>] generic_file_writev+0x0/0x96
Jul  8 10:44:06 host56 kernel:  [<c0475e3d>] do_readv_writev+0x149/0x247
Jul  8 10:44:06 host56 kernel:  [<c0475720>] do_sync_write+0x0/0xf1
Jul  8 10:44:06 host56 kernel:  [<c0430dd7>] set_current_groups+0x15a/0x166
Jul  8 10:44:06 host56 kernel:  [<c0475f72>] vfs_writev+0x37/0x43
Jul  8 10:44:06 host56 kernel:  [<f8d950a9>] nfsd_vfs_write+0xca/0x28a [nfsd]
Jul  8 10:44:06 host56 kernel:  [<c047449a>] __dentry_open+0xea/0x1ab
Jul  8 10:44:06 host56 kernel:  [<c04745a0>] dentry_open+0x45/0x4b
Jul  8 10:44:06 host56 kernel:  [<f8d958cb>] nfsd_write+0x96/0xab [nfsd]
Jul  8 10:44:06 host56 kernel:  [<f8d9b5a9>] nfsd3_proc_write+0xd1/0xeb [nfsd]
Jul  8 10:44:06 host56 kernel:  [<f8d921a4>] nfsd_dispatch+0xbb/0x1a9 [nfsd]
Jul  8 10:44:06 host56 kernel:  [<f8d11689>] svc_process+0x3c8/0x633 [sunrpc]
Jul  8 10:44:06 host56 kernel:  [<f8d9268c>] nfsd+0x17e/0x286 [nfsd]
Jul  8 10:44:06 host56 kernel:  [<f8d9250e>] nfsd+0x0/0x286 [nfsd]
Jul  8 10:44:06 host56 kernel:  [<c0405c53>] kernel_thread_helper+0x7/0x10

Comment 10 Jonas Hagberg 2010-07-14 14:33:28 UTC
Another "me too"
Our cluster with 348 nodes we have random nodes and random task that is hung.
Running SL Kernel 2.6.18-194.3.1.el5

INFO: task syslogd:2959 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syslogd       D ffff81000101d7a0     0  2959      1          2962  2936 (NOTLB)
 ffff81063f87fd88 0000000000000086 0000000400000296 0000000000000001
 000000063f87fd18 000000000000000a ffff81063edb57e0 ffff81034aac1860
 00079032058acd17 0000000000009b14 ffff81063edb59c8 0000000600000096
Call Trace:
 [<ffffffff88036d8a>] :jbd:log_wait_commit+0xa3/0xf5
 [<ffffffff800a0abe>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff
 [<ffffffff8002fbf8>] __writeback_single_inode+0x1e9/0x328
 [<ffffffff800e037b>] do_readv_writev+0x26e/0x291
 [<ffffffff800f2fb7>] sync_inode+0x24/0x33
 [<ffffffff8804c36d>] :ext3:ext3_sync_file+0xc9/0xdc
 [<ffffffff800503f3>] do_fsync+0x52/0xa4
 [<ffffffff800e0c00>] __do_fsync+0x23/0x36
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

Comment 11 john.haxby@oracle.com 2010-07-20 09:15:27 UTC
See also bug 573106

Comment 12 john.haxby@oracle.com 2010-07-20 09:26:41 UTC
Sorry, I mean see bug 590403 where I have proposed a fix.

Comment 13 Ralph Angenendt 2010-07-20 09:59:08 UTC
(In reply to comment #12)
> Sorry, I mean see bug 590403 where I have proposed a fix.    

I see those timeouts without using md raid on the machines, so it has to be on a different layer.

Comment 14 john.haxby@oracle.com 2010-07-20 10:09:42 UTC
(In reply to comment #13)
> (In reply to comment #12)
> > Sorry, I mean see bug 590403 where I have proposed a fix.    
> 
> I see those timeouts without using md raid on the machines, so it has to be on
> a different layer.    

Oh.  Sorry.  I was looking at the original report which clearly shows that the blocked task is blocked in md_do_sync.  This is also the error in the subject, comment #4, comment #6 and comment #6.  The other comments show something blocked in a completely different place -- I haven't checked to see if they're related to each other.  At first glance, though, the only thing that comment #2, comment #7, comment #9 and comment #10 have in common with each other and the original report is that it's something that's triggering the hung task timeout (that is, nothing in common at all).

Comment 15 Vladimir Zidar 2010-07-20 10:13:32 UTC
I'm getting the impression that we are drifting from the subject. Let's focus on understanding what is the problem:

a.) the md task getting blocked
or
b.) message showing up


As far as I can see, the message is popping up on other occasions, so my best guess about it is that the message itself was there even before, just not showed up as default. Please correct me if I'm wrong here. The message also suggests you what to do in order to get rid of it.

Personally, I don't see anything wrong with either two problems - md task somehow needs to get blocked (now is it correct to let it wait in D state or should it be sleeping indeed), and also the message is great to have just in case something else happens and task get blocked (I've got similar problem in correlation with XEN, but it is a different bug - real one).

Only one thing that I don't like about mdX getting blocked in D state is that automatically bumps cpu load by 1 (I guess that if there were more md devices waiting on same 'queue', load - at least numerically - would be higher).

Comment 16 Ralph Angenendt 2010-07-20 10:21:46 UTC
(In reply to comment #15)
> I'm getting the impression that we are drifting from the subject. Let's focus
> on understanding what is the problem:
> 
> a.) the md task getting blocked
> or
> b.) message showing up
> 
> 
> As far as I can see, the message is popping up on other occasions, so my best
> guess about it is that the message itself was there even before, just not
> showed up as default. Please correct me if I'm wrong here. The message also
> suggests you what to do in order to get rid of it.

I have no problem with the message popping up, I have a problem with tasks being blocked for such a long time without any apparent reason, no matter if it is an md task or another task which desperately needs I/O.

Comment 17 john.haxby@oracle.com 2010-07-20 10:29:48 UTC
(In reply to comment #15)
> I'm getting the impression that we are drifting from the subject. Let's focus
> on understanding what is the problem:
> 
> a.) the md task getting blocked
> or
> b.) message showing up
> 

The subject of this bug is the mdX_resync task being blocked.   Other reports
of the hung task timeout being triggered are not related to it.  The hung task
timeout is not a bug, it's a mechanism whereby the kernel can warn you about
kernel tasks that have been blocked for a long time.  You can disable it if you
like.

See bug 590403 for a fairly detailed discussion of the mdX_rsync hung task
timeout issue.  The executive summary, though, is that this is a false positive
and that it's triggered by /etc/cron.weekly/99-raid-check running all the rsync
tasks at the same time.

(In reply to comment #16):
> I have no problem with the message popping up, I have a problem with tasks
> being blocked for such a long time without any apparent reason, no matter if
> it is an md task or another task which desperately needs I/O. 

If you are seeing a hung task timeout warning and you think it's a bug, then you need to submit a bug specific to that particular hung task: any kernel task can get hung, but like processes dying of SIGSEGV, that doesn't make it the same problem.  It doesn't even make it the same problem if it appears to be I/O
related.

Comment 18 Norman Gaywood 2010-07-20 12:16:58 UTC
(In reply to comment #15)

> case something else happens and task get blocked (I've got similar problem in
> correlation with XEN, but it is a different bug - real one).

I noted this comment about Xen. Do you have any bugs logged on this problem? Is it something like Bug 550724?

Comment 19 Scott Leerssen 2010-07-20 12:32:39 UTC
My comment #c9 relates to NFS, which hangs indefinitely and emits this message.  This was new behavior somewhere between RHEL 5.4 and 5.4.  I can submit a new ticket it it seems appropriate.  This just looked to be related.

Comment 20 Vladimir Zidar 2010-07-20 12:41:16 UTC
(In reply to comment #18)
> (In reply to comment #15)
> 
> > case something else happens and task get blocked (I've got similar problem in
> > correlation with XEN, but it is a different bug - real one).
> 
> I noted this comment about Xen. Do you have any bugs logged on this problem? Is
> it something like Bug 550724?    

Actually it was in dom0, but I cannot reproduce it at the moment. However I do think that it was officially fixed in some recent kernel release.

Comment 21 john.haxby@oracle.com 2010-07-20 13:29:18 UTC
(In reply to comment #19)
> My comment #c9 relates to NFS, which hangs indefinitely and emits this message.
>  This was new behavior somewhere between RHEL 5.4 and 5.4.  I can submit a new
> ticket it it seems appropriate.  This just looked to be related.    

It's not related to the md_sync problem at all.  I think you should submit a new bug or log a ticket.

Comment 22 Scott Leerssen 2010-07-21 14:00:11 UTC
(In reply to comment #21)
> (In reply to comment #19)
> > My comment #c9 relates to NFS, which hangs indefinitely and emits this message.
> >  This was new behavior somewhere between RHEL 5.4 and 5.4.  I can submit a new
> > ticket it it seems appropriate.  This just looked to be related.    
> 
> It's not related to the md_sync problem at all.  I think you should submit a
> new bug or log a ticket.    

Done.  https://bugzilla.redhat.com/show_bug.cgi?id=616833

Comment 23 Wade Hampton 2010-10-04 17:54:46 UTC
I have several soft RAID partitions and see this following 
installation (blocked md_do_sync). The O/S is CentOS 5.5 
on kernel 2.6.18-194.17.1.el5 (SMP).  On CentOS 5.4 and 
earlier, the same install and RAID configuration did not 
cause these messages to appear.

Comment 24 Andrew Jones 2010-10-06 06:49:02 UTC
This bug has many backtraces posted, but several look unrelated (missing md_do_sync). As John says, those issues need new bugs. comment 2 and comment 10 look like bug 605207.

Comment 25 Daniel Brnak 2010-10-06 10:49:16 UTC
Confirming on 5.5, kernel 2.6.18-194.8.1.el5 smp i386, there was no issue on 5.4


md: delaying resync of md3 until md0 has finished resync (they share one or more physical units)
md: md0: sync done.
md: delaying resync of md3 until md4 has finished resync (they share one or more physical units)
md: delaying resync of md4 until md5 has finished resync (they share one or more physical units)
md: syncing RAID array md5
md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction.
md: using 128k window, over a total of 273546688 blocks.
md: delaying resync of md1 until md3 has finished resync (they share one or more physical units)
md: delaying resync of md3 until md4 has finished resync (they share one or more physical units)
RAID1 conf printout:
 --- wd:2 rd:2
 disk 0, wo:0, o:1, dev:sda1
 disk 1, wo:0, o:1, dev:sdb1
INFO: task md3_resync:5649 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
md3_resync    D 0000CA3E  3312  5649     11          5650  1703 (L-TLB)
       d9b47ec0 00000046 45ae56fc 0000ca3e 00000064 00000000 00000000 0000000a 
       d6a9f000 45ae610f 0000ca3e 00000a13 00000001 d6a9f10c c2013544 f06dc740 
       00000003 c06b5b98 c2296ecc c229a9c8 d9b47f80 c229a000 c2013588 c0667780 
Call Trace:
 [<c05aba8b>] md_do_sync+0x1fe/0x966
 [<c041eef4>] enqueue_task+0x29/0x39
 [<c041ef4e>] __activate_task+0x4a/0x59
 [<c041f811>] try_to_wake_up+0x3e8/0x3f2
 [<c061caf0>] schedule+0x9cc/0xa55
 [<c0436067>] autoremove_wake_function+0x0/0x2d
 [<c05ac4dd>] md_thread+0xdf/0xf5
 [<c041ebb9>] complete+0x2b/0x3d
 [<c05ac3fe>] md_thread+0x0/0xf5
 [<c0435fa3>] kthread+0xc0/0xed
 [<c0435ee3>] kthread+0x0/0xed
 [<c0405c53>] kernel_thread_helper+0x7/0x10
 =======================
INFO: task md4_resync:5650 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
md4_resync    D 0000CA3E  3492  5650     11          5651  5649 (L-TLB)
       f3969ec0 00000046 45a714b6 0000ca3e 00000064 00000096 d9b47f80 0000000a 
       c2118550 45a7244c 0000ca3e 00000f96 00000001 c211865c c2013544 f06dc740 
       00000003 c06b5b98 c2280c8c c229adc8 f3969f80 c229a800 c0425f00 c0667780 
Call Trace:
 [<c0425f00>] printk+0x18/0x8e
 [<c05aba8b>] md_do_sync+0x1fe/0x966
 [<c041eef4>] enqueue_task+0x29/0x39
 [<c041ef4e>] __activate_task+0x4a/0x59
 [<c041f811>] try_to_wake_up+0x3e8/0x3f2
 [<c061caf0>] schedule+0x9cc/0xa55
 [<c0436067>] autoremove_wake_function+0x0/0x2d
 [<c05ac4dd>] md_thread+0xdf/0xf5
 [<c041ebb9>] complete+0x2b/0x3d
 [<c05ac3fe>] md_thread+0x0/0xf5
 [<c0435fa3>] kthread+0xc0/0xed
 [<c0435ee3>] kthread+0x0/0xed
 [<c0405c53>] kernel_thread_helper+0x7/0x10
 =======================

Comment 26 john.haxby@oracle.com 2010-10-06 11:11:24 UTC
Just to be clear, and just in case anyone reading the previous comment is misled: this is not a kernel bug.

The error message is caused by a weekly cron job that (99-raid-check) that does this (I've copied this from my comment in bug 590404):

----------------
devnum=0
for dev in $dev_list; do
    echo "${check[$devnum]}" > /sys/block/$dev/md/sync_action
    let devnum++
done
[ -z "$check_list" ] && exit 0

checking=1
while [ $checking -ne 0 ]
do
        sleep 60
        checking=0
        for dev in $check_list; do
        sync_action=`cat /sys/block/$dev/md/sync_action`
                if [ "$sync_action" != "idle" ]; then
                        checking=1
                fi
        done
done
---------------

There are two loops here: the first one kicks off a check for all the MD devices, the second one waits until they've all done.  The md_do_sync() code in the kernel knows its a really bad idea to to check two md arrays that share underlying storage so it detects that and it will wait until an outstanding check has completed.  If this takes more than two minutes you'll see that hugely worrying "blocked for more than 120 seconds" message.

As someone somewhere said, this isn't actually a bug, it's the kernel doing what it's supposed to be doing and its serialising the md_do_sync() threads rather nicely.  Unfortunately, large numbers of people see the hugely worrying message and they, well, worry.

As this is a once-a-week cron job and it doesn't matter if it completes in the shortest possible time I suggested running one check at a time like this:

---------------
devnum=0
for dev in $dev_list; do
    echo "${check[$devnum]}" > /sys/block/$dev/md/sync_action
    while :; do
        sleep 60
        sync_action=`cat /sys/block/$dev/md/sync_action`
        if [ "$sync_action" == "idle" ]; then
            break
        fi
    done
    let devnum++
done
[ -z "$check_list" ] && exit 0
---------------

As I said in bug 590403, I don't have anything to test this on at the moment so it would be nice if someone who does have the problem could update /etc/cron.weekly/99-raid-check to see if this does do the trick.   If you would prefer a real diff (for patch) then let me know and I'll attach one.

Comment 27 Wes 2010-10-06 11:18:45 UTC
This is not cron specific - it is showing during sync right after OS install

cat /var/log/cron 
Oct  6 20:46:37 vz1 crond[3083]: (CRON) STARTUP (V5.0)
Oct  6 21:01:01 vz1 crond[3252]: (root) CMD (run-parts /etc/cron.hourly)
Oct  6 21:16:43 vz1 crontab[3260]: (root) LIST (root)
EOF

Linux vz1.clusterspan.net 2.6.18-194.11.1.el5 #1 SMP Tue Aug 10 19:05:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

cat /proc/mdstat 
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sdb1[1] sda1[0]
      25171776 blocks [2/2] [UU]
      	resync=DELAYED
      
md1 : active raid1 sdb2[1] sda2[0]
      264960 blocks [2/2] [UU]
      	resync=DELAYED
      
md3 : active raid1 sdb4[1] sda4[0]
      1418724160 blocks [2/2] [UU]
      [=>...................]  resync =  6.3% (89806592/1418724160) finish=397.6min speed=55702K/sec
      
md2 : active raid1 sdb3[1] sda3[0]
      20972736 blocks [2/2] [UU]
      	resync=DELAYED

top - 21:11:45 up 27 min,  1 user,  load average: 4.07, 4.05, 3.34
Tasks: 132 total,   1 running, 131 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.4%sy,  0.0%ni, 98.6%id,  0.0%wa,  0.0%hi,  1.0%si,  0.0%st
Mem:  24677336k total,   124804k used, 24552532k free,     9840k buffers
Swap: 25171768k total,        0k used, 25171768k free,    35244k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                 
  961 root      10  -5     0    0    0 D  2.0  0.0   0:27.54 md3_resync                                                                                              
  965 root      11  -5     0    0    0 D  0.0  0.0   0:00.00 md2_resync                                                                                              
  969 root      11  -5     0    0    0 D  0.0  0.0   0:00.00 md1_resync                                                                                              
  973 root      12  -5     0    0    0 D  0.0  0.0   0:00.00 md0_resync                                                                                              

INFO: task md1_resync:969 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
md1_resync    D ffff81063dc45e30     0   969    235           972   968 (L-TLB)
 ffff81063dc45d70 0000000000000046 0000000000000000 ffff81063dc1be0c
 ffff81063fc80c0c 0000000000000009 ffff81063ddf3080 ffff81063de077a0
 0000001108c05f4a 000000000000087e ffff81063ddf3268 0000000400000000
Call Trace:
 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8021addc>] md_do_sync+0x1d8/0x833
 [<ffffffff80062ff8>] thread_return+0x62/0xfe
 [<ffffffff800a09d8>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8021b7b0>] md_thread+0xf8/0x10e
 [<ffffffff8021b6b8>] md_thread+0x0/0x10e
 [<ffffffff8003287b>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003277d>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

INFO: task md0_resync:973 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
md0_resync    D ffffffff8015081f     0   973    235           974   972 (L-TLB)
 ffff81063dc4fd70 0000000000000046 0000000000000000 ffff81063dc1ba0c
 ffff81063dc1be0c 0000000000000008 ffff81063de077a0 ffff810115f7b080
 0000001108c06484 000000000000053a ffff81063de07988 000000048008b3cb
Call Trace:
 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8021addc>] md_do_sync+0x1d8/0x833
 [<ffffffff80062ff8>] thread_return+0x62/0xfe
 [<ffffffff800a09d8>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8021b7b0>] md_thread+0xf8/0x10e
 [<ffffffff8021b6b8>] md_thread+0x0/0x10e
 [<ffffffff8003287b>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003277d>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

INFO: task md2_resync:965 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
md2_resync    D ffffffff8015081f     0   965    235           968   964 (L-TLB)
 ffff81063dc17d70 0000000000000046 0000000000000000 ffff81063faebc0c
 ffff81063dc1ba0c 0000000000000009 ffff81063ddf47a0 ffffffff80308b60
 0000001108c04f86 000000000000081e ffff81063ddf4988 000000008008b3cb
Call Trace:
 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8021addc>] md_do_sync+0x1d8/0x833
 [<ffffffff80062ff8>] thread_return+0x62/0xfe
 [<ffffffff800a09d8>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8021b7b0>] md_thread+0xf8/0x10e
 [<ffffffff8021b6b8>] md_thread+0x0/0x10e
 [<ffffffff8003287b>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8019d5a7>] acpi_processor_idle_simple+0x10d/0x30e
 [<ffffffff8003277d>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

Comment 28 john.haxby@oracle.com 2010-10-06 13:58:27 UTC
(In reply to comment #27)
> This is not cron specific - it is showing during sync right after OS install

That's interesting.  What kicks off the resync after install then?

Whatever it is, it's doing the same trick that the cron job does.

Comment 29 Vladimir Zidar 2010-10-06 14:50:46 UTC
(In reply to comment #28)
> That's interesting.  What kicks off the resync after install then?
> 
> Whatever it is, it's doing the same trick that the cron job does.

Resync happens because mismatch_cnt in /sys/block/<your_md_device>/md/mismatch_cnt is greater than 0. 

It was explained somewhere how this is harmless with raid1 and raid10. However I didn't like it, and found a way to fix it once for good - of course google was my friend back then, but I forgot to bookmark it.

Also, be aware of 99-raid-check 'patches' that disable check for raid1 arrays which bring us little if no benefit at all.

Comment 30 Wade Hampton 2010-10-06 17:46:42 UTC
I have an identical call trace as that in comment 25.
CentOS 5.5, kernel 2.6.18-194.17,1.el5 smp i386.  

This is reported when the server is first loaded (systemimager 
clone of my master server).  My install creates three RAID1 
partitions at install time and these are sync'ed after 
extracting the install image.  These same error messages 
occur when running the 99-raid-check script.

Comment 31 Jim Paradis 2010-10-06 20:36:26 UTC
Created attachment 451986 [details]
Patch to fix softlockup

This patch from upstream fixes the softlockup problem by changing the wait in md_do_sync from TASK_UNINTERRUPTIBLE to TASK_INTERRUPTIBLE.  From the comments on the commit:

    md: Don't wait UNINTERRUPTIBLE for other resync to finish
    
    When two md arrays share some block device (e.g each uses different
    partitions on the one device), a resync of one array will wait for
    the resync on the other to finish.
    
    This can be a long time and as it currently waits TASK_UNINTERRUPTIBLE,
    the softlockup code notices and complains.
    
    So use TASK_INTERRUPTIBLE instead and make sure to flush signals
    before calling schedule.

Comment 32 RHEL Program Management 2010-10-07 12:40:46 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 34 Jarod Wilson 2010-10-19 18:53:41 UTC
in kernel-2.6.18-228.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 36 Hans Henrik Happe 2010-10-22 09:20:09 UTC
I've just tried this test kernel on a new system with 35 2TB SATA disk connected to a HP P411 controller. I doesn't solve the problem. I have a crude workaround (see the last paragraph).

I've three RAID 6 arrays. They do not share disks, however, they do share the controller:


Personalities : [raid6] [raid5] [raid4] 
md2 : active raid6 cciss/c1d34[10] cciss/c1d33[9] cciss/c1d32[8] cciss/c1d31[7] cciss/c1d30[6] cciss/c1d29[5] cciss/c1d28[4] cciss/c1d27[3] cciss/c1d26[2] cciss/c1d25[1] cciss/c1d24[0]
      17581335552 blocks level 6, 32k chunk, algorithm 2 [11/11] [UUUUUUUUUUU]
      [>....................]  resync =  1.2% (24033632/1953481728) finish=3001.1min speed=10712K/sec
      
md1 : active raid6 cciss/c1d23[11] cciss/c1d22[10] cciss/c1d21[9] cciss/c1d20[8] cciss/c1d19[7] cciss/c1d18[6] cciss/c1d17[5] cciss/c1d16[4] cciss/c1d15[3] cciss/c1d14[2] cciss/c1d13[1] cciss/c1d12[0]
      19534817280 blocks level 6, 32k chunk, algorithm 2 [12/12] [UUUUUUUUUUUU]
      [>....................]  resync =  1.2% (24120448/1953481728) finish=2678.9min speed=12003K/sec
      
md0 : active raid6 cciss/c1d11[11] cciss/c1d10[10] cciss/c1d9[9] cciss/c1d8[8] cciss/c1d7[7] cciss/c1d6[6] cciss/c1d5[5] cciss/c1d4[4] cciss/c1d3[3] cciss/c1d2[2] cciss/c1d1[1] cciss/c1d0[0]
      19534817280 blocks level 6, 32k chunk, algorithm 2 [12/12] [UUUUUUUUUUUU]
      [>....................]  resync =  0.4% (8389628/1953481728) finish=2617.0min speed=12386K/sec
      
unused devices: <none>


Using the arrays while they are building (i.e. mkfs.xfs, pvcreate, ls), gives the problem described above:


INFO: task md0_resync:4436 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
md0_resync    D ffffffff80153644     0  4436    263          4445  4435 (L-TLB)
 ffff81061cff9c50 0000000000000046 ffff810037e1a400 ffff81031acd6708
 ffff81031ffc8000 000000000000000a ffff81031f7f4820 ffff81061febb820
 0000009927e6375c 000000000013b0d4 ffff81031f7f4a08 0000000937e1a434
Call Trace:
 [<ffffffff883dcdba>] :raid456:get_active_stripe+0x240/0x4b7
 [<ffffffff8008e414>] default_wake_function+0x0/0xe
 [<ffffffff883e1a08>] :raid456:sync_request+0x69e/0x735
 [<ffffffff880b7485>] :cciss:start_io+0x98/0xd2
 [<ffffffff80153644>] __next_cpu+0x19/0x28
 [<ffffffff8021f05a>] md_do_sync+0x464/0x859
 [<ffffffff800a267e>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8021ebe0>] md_thread+0xf8/0x10e
 [<ffffffff8021eae8>] md_thread+0x0/0x10e
 [<ffffffff80032968>] kthread+0xfe/0x132
 [<ffffffff8009efc6>] __call_usermodehelper+0x0/0x4f
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800a267e>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003286a>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

INFO: task mkfs.xfs:4489 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mkfs.xfs      D ffffffff80153644     0  4489   4474                     (NOTLB)
 ffff81061e69db98 0000000000000086 ffff810037e002b4 ffffffff80146143
 ffff81061d80c928 0000000000000008 ffff81061e13b100 ffff81061fd097a0
 00000099271433a4 000000000001f551 ffff81061e13b2e8 0000000f00000002
Call Trace:
 [<ffffffff80146143>] elv_next_request+0x168/0x178
 [<ffffffff8006ec4e>] do_gettimeofday+0x40/0x90
 [<ffffffff800637ca>] io_schedule+0x3f/0x67
 [<ffffffff800f7388>] __blockdev_direct_IO+0x8da/0xa82
 [<ffffffff800e7f02>] blkdev_direct_IO+0x32/0x37
 [<ffffffff800e7e3a>] blkdev_get_blocks+0x0/0x96
 [<ffffffff800c8713>] generic_file_direct_IO+0x108/0x122
 [<ffffffff8001f02a>] generic_file_direct_write+0x60/0xf2
 [<ffffffff80016637>] __generic_file_aio_write_nolock+0x2b8/0x3b6
 [<ffffffff800c8898>] generic_file_aio_write_nolock+0x20/0x6c
 [<ffffffff800c8c6f>] generic_file_write_nolock+0x8f/0xa8
 [<ffffffff800c8bc7>] generic_file_read+0xac/0xc5
 [<ffffffff8006723e>] do_page_fault+0x4fe/0x874
 [<ffffffff800a2896>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800e6cb6>] blkdev_file_write+0x1a/0x1f
 [<ffffffff80016af0>] vfs_write+0xce/0x174
 [<ffffffff80043fe8>] sys_pwrite64+0x50/0x70
 [<ffffffff8005dde9>] error_exit+0x0/0x84
 [<ffffffff8005d116>] system_call+0x7e/0x83


A workaround is to set /proc/sys/dev/raid/speed_limit_max so the collective maximum sync bandwidth is just below the physical maximum of the disk system.

Comment 37 nate.dailey 2010-10-22 15:55:17 UTC
I tried the 228 kernel and it seems to have fixed the problem for me.

I first confirmed that the problem still exists on 2.6.18-194.11.1.el5. I started 4 simultaneous full raid1 resyncs (4 partitions mirrored across 2 disks). I saw 4 "task md*_resync... blocked for more than 120 seconds" messages.

I then rebooted onto the 2.6.18-228.el5 kernel, and repeated the same test using the same 2 disks. I repeated the test twice (resyncing in each direction) and saw no "task blocked" messages.

Comment 38 Gerrit Slomma 2010-11-16 10:40:18 UTC
(In reply to comment #29)
> (In reply to comment #28)
> > That's interesting.  What kicks off the resync after install then?
> > 
> > Whatever it is, it's doing the same trick that the cron job does.
> Resync happens because mismatch_cnt in
> /sys/block/<your_md_device>/md/mismatch_cnt is greater than 0. 
> It was explained somewhere how this is harmless with raid1 and raid10. However
> I didn't like it, and found a way to fix it once for good - of course google
> was my friend back then, but I forgot to bookmark it.
> Also, be aware of 99-raid-check 'patches' that disable check for raid1 arrays
> which bring us little if no benefit at all.

I don't know but i would prefer not to resync the drives every week when using SSDs. A Intel X25-M is specified for 20 GB/day a 160 GB-version would exceed this easily by the weekly resync-job. Okay, that are no server-typ SDD, but that won't help the user of those discs.

Datasheet of Intel X25-M
http://download.intel.com/design/flash/nand/mainstream/Specification322296.pdf

Comment 39 Pasi Karkkainen 2010-11-27 14:37:49 UTC
I'm seeing this problem aswell using 2.6.18-194.26.1.el5 kernel.
It shows up after fresh installation.

I have multiple md-raid arrays (md0,md1,md2) on the same two partitioned disks.

INFO: task md0_resync:7449 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
md0_resync    D ffff81022f07c040     0  7449     89                3175 (L-TLB)
 ffff8101f147bd70 0000000000000046 ffffffff8002b398 ffff81022fa52a0c
 ffff81022fa52e0c 0000000000000009 ffff810227a24820 ffff81022f07c040
 00000539224d8f56 0000000000001756 ffff810227a24a08 000000030000000d
Call Trace:
 [<ffffffff8002b398>] flush_tlb_page+0xac/0xda
 [<ffffffff80007691>] find_get_page+0x21/0x51
 [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8021af2b>] md_do_sync+0x1d8/0x833
 [<ffffffff8008ca47>] enqueue_task+0x41/0x56
 [<ffffffff8008cab2>] __activate_task+0x56/0x6d
 [<ffffffff8008c897>] dequeue_task+0x18/0x37
 [<ffffffff80062ff8>] thread_return+0x62/0xfe
 [<ffffffff800a0b16>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8021b8ff>] md_thread+0xf8/0x10e
 [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8021b807>] md_thread+0x0/0x10e
 [<ffffffff8003290a>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003280c>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

Comment 40 Andrius Benokraitis 2010-11-29 05:18:01 UTC
(In reply to comment #39)
> I'm seeing this problem aswell using 2.6.18-194.26.1.el5 kernel.
> It shows up after fresh installation.
> 
> I have multiple md-raid arrays (md0,md1,md2) on the same two partitioned disks.
> 
> INFO: task md0_resync:7449 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> md0_resync    D ffff81022f07c040     0  7449     89                3175 (L-TLB)
>  ffff8101f147bd70 0000000000000046 ffffffff8002b398 ffff81022fa52a0c
>  ffff81022fa52e0c 0000000000000009 ffff810227a24820 ffff81022f07c040
>  00000539224d8f56 0000000000001756 ffff810227a24a08 000000030000000d
> Call Trace:
>  [<ffffffff8002b398>] flush_tlb_page+0xac/0xda
>  [<ffffffff80007691>] find_get_page+0x21/0x51
>  [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4
>  [<ffffffff8021af2b>] md_do_sync+0x1d8/0x833
>  [<ffffffff8008ca47>] enqueue_task+0x41/0x56
>  [<ffffffff8008cab2>] __activate_task+0x56/0x6d
>  [<ffffffff8008c897>] dequeue_task+0x18/0x37
>  [<ffffffff80062ff8>] thread_return+0x62/0xfe
>  [<ffffffff800a0b16>] autoremove_wake_function+0x0/0x2e
>  [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4
>  [<ffffffff8021b8ff>] md_thread+0xf8/0x10e
>  [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4
>  [<ffffffff8021b807>] md_thread+0x0/0x10e
>  [<ffffffff8003290a>] kthread+0xfe/0x132
>  [<ffffffff8005dfb1>] child_rip+0xa/0x11
>  [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4
>  [<ffffffff8003280c>] kthread+0x0/0x132
>  [<ffffffff8005dfa7>] child_rip+0x0/0x11

Thank you for adding input! Please test the RHEL 5.6 Public Beta and see if this resolves the issue.

Comment 41 Konstantin Belov 2011-01-13 00:41:32 UTC
I have the same problem on two different servers. One is related to software raid array. The other server does not have any software raid installed, but it is pulling the data from remote sites onto the nfs server running the above mentioned software raid.

Both running CentOS kernel 2.6.18-194.26.1.el5

Here are the trace calls:

INFO: task md13_resync:23754 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
md13_resync   D ffff8104360a9aa0     0 23754    296         23756 19614 (L-TLB)
 ffff8107cce7bd70 0000000000000046 0000000000000000 ffff810427d7bc0c
 ffff81010e96420c 000000000000000a ffff8104f09d0820 ffff81043623c820
 0001a86c095399b9 00000000000091e3 ffff8104f09d0a08 000000058008b4d7
Call Trace:
 [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8021af2b>] md_do_sync+0x1d8/0x833
 [<ffffffff8008ca47>] enqueue_task+0x41/0x56
 [<ffffffff8008cab2>] __activate_task+0x56/0x6d
 [<ffffffff8008c897>] dequeue_task+0x18/0x37
 [<ffffffff80062ff8>] thread_return+0x62/0xfe
 [<ffffffff800a0b16>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8021b8ff>] md_thread+0xf8/0x10e
 [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8021b807>] md_thread+0x0/0x10e
 [<ffffffff8003290a>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003280c>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11


INFO: task globus-gridftp-:16523 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
globus-gridft D ffff810001004420     0 16523   7428         16570       (NOTLB)
 ffff81021862fd48 0000000000000086 00000001800c5ab0 0000000000000000
 0000000000000000 0000000000000009 ffff810214ae57e0 ffffffff80308b60
 00012acd660cd9f5 000000000001facb ffff810214ae59c8 000000001862fda0
Call Trace:
 [<ffffffff8006e1d7>] do_gettimeofday+0x40/0x90
 [<ffffffff8838886d>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
 [<ffffffff800637ea>] io_schedule+0x3f/0x67
 [<ffffffff88388876>] :nfs:nfs_wait_bit_uninterruptible+0x9/0xd
 [<ffffffff80063a16>] __wait_on_bit+0x40/0x6e
 [<ffffffff8838886d>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
 [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff800a0b44>] wake_bit_function+0x0/0x23
 [<ffffffff8838c2ce>] :nfs:nfs_wait_on_requests_locked+0x70/0xca
 [<ffffffff8838d322>] :nfs:nfs_sync_inode_wait+0x60/0x1db
 [<ffffffff88383167>] :nfs:nfs_do_fsync+0x1f/0x3f
 [<ffffffff88383a0d>] :nfs:nfs_file_flush+0x84/0xac
 [<ffffffff80023be3>] filp_close+0x36/0x64
 [<ffffffff8001e0d1>] sys_close+0x88/0xbd
 [<ffffffff8005d116>] system_call+0x7e/0x83

We had other problems locking the second server, so I am not sure this constitutes the main issue for us, but it would be nice to resolve it.

Comment 43 john.haxby@oracle.com 2011-01-13 09:46:28 UTC
(In reply to comment #41)
> I have the same problem on two different servers. One is related to software
> raid array. The other server does not have any software raid installed, but it
> is pulling the data from remote sites onto the nfs server running the above
> mentioned software raid.
> [snip]
> INFO: task globus-gridftp-:16523 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> globus-gridft D ffff810001004420     0 16523   7428         16570       (NOTLB)
>  ffff81021862fd48 0000000000000086 00000001800c5ab0 0000000000000000
>  0000000000000000 0000000000000009 ffff810214ae57e0 ffffffff80308b60
>  00012acd660cd9f5 000000000001facb ffff810214ae59c8 000000001862fda0
> Call Trace:
>  [<ffffffff8006e1d7>] do_gettimeofday+0x40/0x90
>  [<ffffffff8838886d>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
>  [<ffffffff800637ea>] io_schedule+0x3f/0x67

This is a completely separate issue: this bug is dealing with problems where md_do_sync shows up in the stack trace.  What you have here is an NFS problem -- I'd suggest filing a separate bug for it (if one hasn't already been filed).

Comment 44 Konstantin Belov 2011-01-13 19:36:16 UTC
John:

The first bug I posted is for md_do_sync (md13_resync). I agree that the second bug is the nfs bug. My impression was that the bugs are in the kernel though, although it can just do its job and inform me about it. 
I see the status has changed to "release pending". Hopefully the issue will get resolved.

Comment 45 errata-xmlrpc 2011-01-13 21:16:31 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html

Comment 46 Paolo Bonzini 2011-04-01 13:31:01 UTC
*** Bug 597855 has been marked as a duplicate of this bug. ***

Comment 47 Jarod Wilson 2011-06-21 15:05:42 UTC
*** Bug 590403 has been marked as a duplicate of this bug. ***