Description of problem: Server (Fujitsu Primergy RX200 S5, 2x Intel Xeon E5520, 8GB RAM, RAID controler based on LSI MegaRaid 1078) is hanging from time to time. I had one complete lockdown today. Version-Release number of selected component (if applicable): Kernel version is 2.6.18-194.3.1.el5. Driver is megaraid_sas How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: INFO: task kjournald:1141 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kjournald D ffffffff80150462 0 1141 139 1171 1072 (L-TLB) ffff81013da4fdd0 0000000000000046 0000000000000100 0000000000000000 0000000000000000 000000000000000a ffff81023fc25040 ffff810143a027e0 00000e6850058566 0000000000001064 ffff81023fc25228 0000000400000000 Call Trace: [<ffffffff880335cf>] :jbd:journal_commit_transaction+0x16d/0x1066 [<ffffffff800a0abe>] autoremove_wake_function+0x0/0x2e [<ffffffff8004b36f>] try_to_del_timer_sync+0x7f/0x88 [<ffffffff880375d3>] :jbd:kjournald+0xc1/0x213 [<ffffffff800a0abe>] autoremove_wake_function+0x0/0x2e [<ffffffff88037512>] :jbd:kjournald+0x0/0x213 [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032894>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032796>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 INFO: task snmpd:4645 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. snmpd D ffffffff80150462 0 4645 1 4660 4631 (NOTLB) ffff81023d9f1b88 0000000000000086 0000000000000000 ffffffff8022c9c0 000000000000001c 000000000000000a ffff81023ec6d0c0 ffff810143b0e7e0 00000e696bb83c73 000000000006f3d3 ffff81023ec6d2a8 0000000880250085 Call Trace: [<ffffffff8022c9c0>] put_cmsg+0x8c/0xc6 [<ffffffff88032002>] :jbd:start_this_handle+0x2e5/0x36c [<ffffffff800a0abe>] autoremove_wake_function+0x0/0x2e [<ffffffff88032152>] :jbd:journal_start+0xc9/0x100 [<ffffffff88050c72>] :ext3:ext3_dirty_inode+0x28/0x7b [<ffffffff80013c64>] __mark_inode_dirty+0x29/0x16e [<ffffffff8000c41b>] do_generic_mapping_read+0x342/0x354 [<ffffffff8000d0cc>] file_read_actor+0x0/0x159 [<ffffffff8000c579>] __generic_file_aio_read+0x14c/0x198 [<ffffffff80016da5>] generic_file_aio_read+0x34/0x39 [<ffffffff8000cdf5>] do_sync_read+0xc7/0x104 [<ffffffff800a0abe>] autoremove_wake_function+0x0/0x2e [<ffffffff8000e14f>] do_mmap_pgoff+0x66c/0x7d7 [<ffffffff8000b681>] vfs_read+0xcb/0x171 [<ffffffff80011bd2>] sys_read+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0
I also have this problem with my HP DL360 G2 using a SmartArray 5i. It begun when I upgraded to kernel version 2.6.18-194.3.1.el5 using the cciss-driver. I have similar output on the screen and the server completely freezes which unfortunately leaves me without any log entries in messages or dmesg to paste here. I'll get back to you if I manage to get any logs next time it occurs.
I've seen the same issue on a HP DL380 G5 with 32GB RAM and SAN attached storage using cciss driver fro internal disk mirror. uname -a Linux hostname 2.6.18-194.3.1.el5 #1 SMP Sun May 2 04:17:42 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux In cacti, I have seen massiv load on the server during this time, but why would it block thoses processes for such a long time The error occurred the first time last night and it did not show up on the system with RHEL 5.4 Jul 20 23:18:49 hostname kernel: INFO: task cmahostd:27230 blocked for more than 120 seconds. Jul 20 23:18:49 hostname kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 20 23:18:49 hostname kernel: cmahostd D ffff81079ce4a080 0 27230 1 27232 27228 (NOTLB) Jul 20 23:18:49 hostname kernel: ffff810770bcbdd8 0000000000000082 00000000ffffffff 0000010100000000 Jul 20 23:18:49 hostname kernel: 0000000700b118fa 000000000000000a ffff8107716547a0 ffff81079ce4a080 Jul 20 23:18:49 hostname kernel: 0009ab4d2d148ee8 00000000009216ca ffff810771654988 000000028002c9e4 Jul 20 23:18:49 hostname kernel: Call Trace: Jul 20 23:18:49 hostname kernel: [<ffffffff8000ea46>] link_path_walk+0xa6/0xb2 Jul 20 23:18:49 hostname kernel: [<ffffffff800646ac>] __down_read+0x7a/0x92 Jul 20 23:18:49 hostname kernel: [<ffffffff800c30b1>] access_process_vm+0x47/0x18d Jul 20 23:18:49 hostname kernel: [<ffffffff8000f2d0>] __alloc_pages+0x78/0x308 Jul 20 23:18:49 hostname kernel: [<ffffffff8010627f>] proc_pid_cmdline+0x69/0xf4 Jul 20 23:18:49 hostname kernel: [<ffffffff8010678b>] proc_info_read+0x5f/0xb9 Jul 20 23:18:49 hostname kernel: [<ffffffff8000b681>] vfs_read+0xcb/0x171 Jul 20 23:18:49 hostname kernel: [<ffffffff80011bd2>] sys_read+0x45/0x6e Jul 20 23:18:49 hostname kernel: [<ffffffff8006149d>] sysenter_do_call+0x1e/0x76 Jul 20 23:18:49 hostname kernel: Jul 20 23:18:49 hostname kernel: INFO: task inosrv:32294 blocked for more than 120 seconds. Jul 20 23:18:49 hostname kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 20 23:18:49 hostname kernel: inosrv D ffff81025c0d47a0 0 32294 1 32435 32293 (NOTLB) Jul 20 23:18:49 hostname kernel: ffff8105069a9e18 0000000000000086 0000000000007e26 0000000042876e00 Jul 20 23:18:49 hostname kernel: 00000000ffffffda 000000000000000a ffff8104dfebc080 ffff81025c0d47a0 Jul 20 23:18:49 hostname kernel: 0009ab4df506e05e 0000000000000f0e ffff8104dfebc268 0000000000000000 Jul 20 23:18:49 hostname kernel: Call Trace: Jul 20 23:18:49 hostname kernel: [<ffffffff800646ac>] __down_read+0x7a/0x92 Jul 20 23:18:49 hostname kernel: [<ffffffff80066ad0>] do_page_fault+0x446/0x874 Jul 20 23:18:49 hostname kernel: [<ffffffff8005dde9>] error_exit+0x0/0x84 Jul 20 23:18:49 hostname kernel: Jul 20 23:18:49 hostname kernel: INFO: task inosrv:32468 blocked for more than 120 seconds. Jul 20 23:18:49 hostname kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 20 23:18:49 hostname kernel: inosrv D ffff81000100caa0 0 32468 1 32469 32467 (NOTLB) Jul 20 23:18:49 hostname kernel: ffff810382003d28 0000000000000086 ffff810382003c98 ffffffff8008ca8c Jul 20 23:18:49 hostname kernel: 0009ab4b17e3fd90 0000000000000009 ffff81056d4277e0 ffff81082ff18100 Jul 20 23:18:49 hostname kernel: 0009ab4b17e2e377 000000000000a462 ffff81056d4279c8 0000000100000003 Jul 20 23:18:49 hostname kernel: Call Trace: Jul 20 23:18:49 hostname kernel: [<ffffffff8008ca8c>] __activate_task+0x56/0x6d Jul 20 23:18:49 hostname kernel: [<ffffffff8859083d>] :vxfs:vx_svar_sleep_unlock+0x53/0x68 Jul 20 23:18:49 hostname kernel: [<ffffffff8008d087>] default_wake_function+0x0/0xe Jul 20 23:18:49 hostname kernel: [<ffffffff8857c9a8>] :vxfs:vx_rwsleep_rec_lock+0x74/0xac Jul 20 23:18:49 hostname kernel: [<ffffffff88558693>] :vxfs:vx_recsmp_rangelock+0xf/0x1d Jul 20 23:18:49 hostname kernel: [<ffffffff885714ac>] :vxfs:vx_irwlock+0x37/0x41 Jul 20 23:18:49 hostname kernel: [<ffffffff885b520f>] :vxfs:vx_vop_read+0x101/0x1ca Jul 20 23:18:49 hostname kernel: [<ffffffff885b7ab8>] :vxfs:vx_read+0x199/0x1e9 Jul 20 23:18:49 hostname kernel: [<ffffffff8000b681>] vfs_read+0xcb/0x171 Jul 20 23:18:49 hostname kernel: [<ffffffff800135f7>] sys_pread64+0x50/0x70 Jul 20 23:18:49 hostname kernel: [<ffffffff8005d229>] tracesys+0x71/0xe0 Jul 20 23:18:49 hostname kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Jul 20 23:18:49 hostname kernel: Jul 20 23:18:49 hostname kernel: INFO: task inosrv:20591 blocked for more than 120 seconds. Jul 20 23:18:49 hostname kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 20 23:18:49 hostname kernel: inosrv D ffff810001025e20 0 20591 1 20645 20477 (NOTLB) Jul 20 23:18:49 hostname kernel: ffff8105f5bfff08 0000000000000086 0000000000000000 ffff81042e6927a0 Jul 20 23:18:49 hostname kernel: ffffffff8008d087 000000000000000a ffff81042e6927a0 ffff81082fe29080 Jul 20 23:18:49 hostname kernel: 0009ab4b5349ae90 0000000000001b95 ffff81042e692988 0000000400402040 Jul 20 23:18:49 hostname kernel: Call Trace: Jul 20 23:18:50 hostname kernel: [<ffffffff8008d087>] default_wake_function+0x0/0xe Jul 20 23:18:50 hostname kernel: [<ffffffff80064613>] __down_write_nested+0x7a/0x92 Jul 20 23:18:50 hostname kernel: [<ffffffff800161ee>] sys_munmap+0x32/0x59 Jul 20 23:18:50 hostname kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Jul 20 23:18:50 hostname kernel:
Hi, same problem here with RHEL 5.5. Server is IBM xServer 3850M2, 4x QuadXeon, 128GB RAM. I get those messages under heavy load (oracle in this case). # uname -a Linux hostname 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux Is the temporary solution to downgrade to 2.6.164 at now? INFO: task oracle:26522 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. oracle D ffff8100010738a0 0 26522 26503 (NOTLB) ffff811b36cb3d08 0000000000000086 0000000000100000 ffff810001084ef8 ffff811b36cb3c80 0000000000000007 ffff811b2a25e820 ffff81202fe99100 000551fb9abc78d4 00000000000c315b ffff811b2a25ea08 0000000d9ab5c61e Call Trace: [<ffffffff80064c6f>] __mutex_lock_slowpath+0x60/0x9b [<ffffffff80064cb9>] .text.lock.mutex+0xf/0x14 [<ffffffff80064c06>] __mutex_unlock_slowpath+0x2a/0x33 [<ffffffff8002174c>] generic_file_aio_write+0x4e/0xc1 [<ffffffff884617b1>] :nfs:nfs_file_write+0xd8/0x14f [<ffffffff80018266>] do_sync_write+0xc7/0x104 [<ffffffff800a1ba4>] autoremove_wake_function+0x0/0x2e [<ffffffff80016a49>] vfs_write+0xce/0x174 [<ffffffff80044209>] sys_pwrite64+0x50/0x70 [<ffffffff8005e229>] tracesys+0x71/0xe0 [<ffffffff8005e28d>] tracesys+0xd5/0xe0
Did at stress-test with "fio", using the following fil-spec-file: ------------------------------------- [iometer-file-access-server] bssplit=512/10:1k/5:2k/5:4k/60:8k/2:16k/4:32k/4:64k/10 rw=randrw rwmixread=70 direct=1 size=10g ioengine=libaio iodepth=256 write_bw_log write_lat_log numjobs=6 ------------------------------------- When testing on a files system living in a HP Smart Array P410i (product revision C, firmware v. 3.50), I got several of these in syslog: ------------------------------------- Sep 1 14:19:36 oslo kernel: INFO: task cmaidad:18825 blocked for more than 120 seconds. Sep 1 14:19:36 oslo kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 1 14:19:36 oslo kernel: cmaidad D ffffffff80150839 0 18825 1 18859 18747 (NOTLB) Sep 1 14:19:36 oslo kernel: ffff8100c9ed7978 0000000000003086 0000000000003086 ffffc20010092080 Sep 1 14:19:36 oslo kernel: ffff81006e62c610 0000000000000007 ffff8100d710b040 ffff810037ca6100 Sep 1 14:19:36 oslo kernel: 000007f916a489b7 000000000000ccd2 ffff8100d710b228 0000000180022205 Sep 1 14:19:36 oslo kernel: Call Trace: Sep 1 14:19:36 oslo kernel: [<ffffffff8006e1db>] do_gettimeofday+0x40/0x90 Sep 1 14:19:36 oslo kernel: [<ffffffff8001552b>] sync_buffer+0x0/0x3f Sep 1 14:19:36 oslo kernel: [<ffffffff800637ea>] io_schedule+0x3f/0x67 Sep 1 14:19:36 oslo kernel: [<ffffffff80015566>] sync_buffer+0x3b/0x3f Sep 1 14:19:36 oslo kernel: [<ffffffff80063a16>] __wait_on_bit+0x40/0x6e Sep 1 14:19:36 oslo kernel: [<ffffffff8001552b>] sync_buffer+0x0/0x3f Sep 1 14:19:36 oslo kernel: [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78 Sep 1 14:19:36 oslo kernel: [<ffffffff800a0a06>] wake_bit_function+0x0/0x23 Sep 1 14:19:36 oslo kernel: [<ffffffff80017549>] ll_rw_block+0x8c/0xab Sep 1 14:19:36 oslo kernel: [<ffffffff8000e8dd>] __block_prepare_write+0x363/0x3a6 Sep 1 14:19:36 oslo kernel: [<ffffffff8804eceb>] :ext3:ext3_get_block+0x0/0xf7 Sep 1 14:19:36 oslo kernel: [<ffffffff800e15cf>] block_write_begin+0x80/0xcf Sep 1 14:19:36 oslo kernel: [<ffffffff88050395>] :ext3:ext3_write_begin+0xe8/0x1cc Sep 1 14:19:36 oslo kernel: [<ffffffff8804eceb>] :ext3:ext3_get_block+0x0/0xf7 Sep 1 14:19:36 oslo kernel: [<ffffffff8000fd7a>] generic_file_buffered_write+0x14b/0x675 Sep 1 14:19:36 oslo kernel: [<ffffffff880317ae>] :jbd:journal_stop+0x1f3/0x1ff Sep 1 14:19:36 oslo kernel: [<ffffffff8001669e>] __generic_file_aio_write_nolock+0x369/0x3b6 Sep 1 14:19:36 oslo kernel: [<ffffffff80021841>] generic_file_aio_write+0x65/0xc1 Sep 1 14:19:36 oslo kernel: [<ffffffff8804c1b6>] :ext3:ext3_file_write+0x16/0x91 Sep 1 14:19:36 oslo kernel: [<ffffffff800182c3>] do_sync_write+0xc7/0x104 Sep 1 14:19:36 oslo kernel: [<ffffffff800a09d8>] autoremove_wake_function+0x0/0x2e Sep 1 14:19:36 oslo kernel: [<ffffffff80016aa6>] vfs_write+0xce/0x174 Sep 1 14:19:36 oslo kernel: [<ffffffff80017373>] sys_write+0x45/0x6e Sep 1 14:19:36 oslo kernel: [<ffffffff8006149d>] sysenter_do_call+0x1e/0x76 ------------------------------------- Meanwhile, the server was largely unresponsive: Existing processes ran well, but attempts to log into the server, or starting new programs on the server, timed out. When running the exact same test on a file system living on a fibre-channel connected XIV storage system, there were no problems. In my case, the RHEL was 5.5 x86_64 with latest updates, using in-box drivers only. Server: HP Proliant DL380G6. I think that the priority of this problem needs to be increased.
A few more observations: 1. I seems that the problem doesn't always result in "task ... blocked for more than 120 seconds"; sometimes, the system is "just" unresponsive except for some of the already running processes (e.g., "top" will keep working, if started before the stress tests). 2. Changing the I/O scheduler for the involved devices from "cfq" to "noop" seems to make the problem go away.
Created attachment 442419 [details] Look of "nmon" and "top" when the system is refusing new logins and new programs can't be started
Created attachment 442424 [details] Look of "nmon" and "top" when the system is happily accepting new logins and new programs can be started
I am seeing this issue as well. System is a Dell PE2950, 2x 5410, 12GB, perc 6, 4x 147GB SAS 15K raid-10. uname -a Linux 2.6.18-194.17.4.el5 #1 SMP Mon Oct 25 15:50:53 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux The issue appears randomly every 12 - 36 hours. Once the system locks, it will eventually recover after 10-20 minutes, but during this time the machine is completely unresponsive. Nov 7 20:38:20 kernel: INFO: task kjournald:2520 blocked for more than 120 seconds. Nov 7 20:38:20 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 7 20:38:20 kernel: kjournald D ffff81032e56abc0 0 2520 137 2522 2518 (L-TLB) Nov 7 20:38:20 kernel: ffff81032b439dd0 0000000000000046 0000000000000200 0000000000000000 Nov 7 20:38:20 kernel: 0000000000000000 000000000000000a ffff81032f43f0c0 ffff8102ff802860 Nov 7 20:38:20 kernel: 0001e0273cd29582 0000000000000f5f ffff81032f43f2a8 0000000000000000 Nov 7 20:38:20 kernel: Call Trace: Nov 7 20:38:20 kernel: [<ffffffff880335cf>] :jbd:journal_commit_transaction+0x16d/0x1066 Nov 7 20:38:20 kernel: [<ffffffff800a09d4>] autoremove_wake_function+0x0/0x2e Nov 7 20:38:20 kernel: [<ffffffff8004b132>] try_to_del_timer_sync+0x7f/0x88 Nov 7 20:38:20 kernel: [<ffffffff880375d3>] :jbd:kjournald+0xc1/0x213 Nov 7 20:38:20 kernel: [<ffffffff800a09d4>] autoremove_wake_function+0x0/0x2e Nov 7 20:38:40 kernel: [<ffffffff800a07bc>] keventd_create_kthread+0x0/0xc4 Nov 7 20:38:40 kernel: [<ffffffff88037512>] :jbd:kjournald+0x0/0x213 Nov 7 20:38:40 kernel: [<ffffffff800a07bc>] keventd_create_kthread+0x0/0xc4 Nov 7 20:38:40 kernel: [<ffffffff8003290a>] kthread+0xfe/0x132 Nov 7 20:38:40 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Nov 7 20:38:40 kernel: [<ffffffff800a07bc>] keventd_create_kthread+0x0/0xc4 Nov 7 20:38:40 kernel: [<ffffffff8003280c>] kthread+0x0/0x132 Nov 7 20:38:40 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 I'm attempting to see if the scheduler change from cfq to noop as a previous person mentioned does resolve the issue.
I have the same issue. My System has Proc AMD Athlon(tm) Dual Core Processor 4050e, 4 Gb RAM, Kernel 2.6.18-194.26.1.el5. My System is running very slow. Uptime: 18:30:03 up 3:25, 2 users, load average: 7.54, 7.55, 8.10 dmesg: INFO: task mysqld:7384 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mysqld D ffff81011ccbb0f8 0 7384 2411 7397 7368 (NOTLB) ffff8100b7cd5af8 0000000000000086 ffff8100b7cd5b98 000000001c827080 ffff810001004498 0000000000000009 ffff810108d1e080 ffff81011ccbb0c0 000004c5cbe33e0d 000000000001e762 ffff810108d1e268 000000008008cab2 Call Trace: [<ffffffff80046edb>] try_to_wake_up+0x472/0x484 [<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14 [<ffffffff800a0b1f>] autoremove_wake_function+0x9/0x2e [<ffffffff88035883>] :jbd:__log_wait_for_space+0x51/0xaa [<ffffffff88032040>] :jbd:start_this_handle+0x323/0x36c [<ffffffff88032152>] :jbd:journal_start+0xc9/0x100 [<ffffffff88050c72>] :ext3:ext3_dirty_inode+0x28/0x7b [<ffffffff80013cf0>] __mark_inode_dirty+0x29/0x16e [<ffffffff8000c4db>] do_generic_mapping_read+0x347/0x359 [<ffffffff8000d18c>] file_read_actor+0x0/0x159 [<ffffffff8000c639>] __generic_file_aio_read+0x14c/0x198 [<ffffffff80016e31>] generic_file_aio_read+0x34/0x39 [<ffffffff8000ceb5>] do_sync_read+0xc7/0x104 [<ffffffff800a0b16>] autoremove_wake_function+0x0/0x2e [<ffffffff800624b6>] __sched_text_start+0xf6/0xbd6 [<ffffffff8003265d>] sys_faccessat+0x148/0x18d [<ffffffff8000b729>] vfs_read+0xcb/0x171 [<ffffffff80011c3b>] sys_read+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0
I'm seeing this on a HP DL380 G5 with Smart Array P400 raid card. Like Brad, this seems to occur every 12-36 hours or so, always under heavy loads. Here's the stack trace I got: INFO: task syslogd:3528 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. syslogd D ffff8103c5743ca8 0 3528 1 3531 1070 (NOTLB) ffff8109a982fd88 0000000000000082 0000000000000296 0000000000000003 ffff8109a982fd18 000000000000000a ffff8109ace21820 ffff8109af9e5080 0000f161cc9bd0a8 0000000000014389 ffff8109ace21a08 00000002c5743ca8 Call Trace: [<ffffffff88036d8a>] :jbd:log_wait_commit+0xa3/0xf5 [<ffffffff800a0b16>] autoremove_wake_function+0x0/0x2e [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff [<ffffffff8002fc6e>] writeback_single_inode+0x1e9/0x328 [<ffffffff800e0898>] do_readv_writev+0x26e/0x291 [<ffffffff800f34e5>] sync_inode+0x24/0x33 [<ffffffff8804c36d>] :ext3:ext3_sync_file+0xc9/0xdc [<ffffffff800501b6>] do_fsync+0x52/0xa4 [<ffffffff800e111d>] do_fsync+0x23/0x36 [<ffffffff8005d116>] system_call+0x7e/0x83
I am seeing this on a HP ProLiant DL380 G6 with a HP Smart Array P410i Controller and two HP Smart Array P812 Controllers. kernel: INFO: task kjournald:3367 blocked for more than 120 seconds. Nov 18 17:39:27 db-rdco1-e-r1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kernel: kjournald D ffffffff80150462 0 3367 455 3369 3365 (L-TLB) kernel: ffff8102aaf55dd0 0000000000000046 ffff81015919d8b0 ffff8100c785b4f0 kernel: 0000000000000000 000000000000000a ffff810c1c4370c0 ffff810252a60820 kernel: 0004a484de9a85e7 000000000000082c ffff810c1c4372a8 000000088008c871 kernel: Call Trace: kernel: [<ffffffff880335cf>] :jbd:journal_commit_transaction+0x16d/0x1066 kernel: [<ffffffff800a0abe>] autoremove_wake_function+0x0/0x2e kernel: [<ffffffff8004b36f>] try_to_del_timer_sync+0x7f/0x88 kernel: [<ffffffff880375d3>] :jbd:kjournald+0xc1/0x213 kernel: [<ffffffff800a0abe>] autoremove_wake_function+0x0/0x2e kernel: [<ffffffff88037512>] :jbd:kjournald+0x0/0x213 kernel: [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4 kernel: [<ffffffff80032894>] kthread+0xfe/0x132 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 kernel: [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4 kernel: [<ffffffff80032796>] kthread+0x0/0x132 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Created attachment 462353 [details] Crash Data Log
Created attachment 462354 [details] Crash Log
I was able to recreate the panic, and I got a vmcore. The following is the information I captured using these commands in crash. sys > crash_data.log bt -a >> crash_data.log mod >> crash_data.log log > crash_log.log I have attached the two files crash_data.log and crash_log.log
And yet another application with the same issue here: Nov 23 18:02:37 HGALUX31 kernel: INFO: task dsmserv:10519 blocked for more than 120 seconds. Nov 23 18:02:37 HGALUX31 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 23 18:02:37 HGALUX31 kernel: dsmserv D ffff81000100daa0 0 10519 1 10567 10489 (NOTLB) Nov 23 18:02:37 HGALUX31 kernel: ffff810791a79b68 0000000000000086 0000000000000001 ffff810ee7c6e4e8 Nov 23 18:02:37 HGALUX31 kernel: ffff81102d5b5968 0000000000000008 ffff810f016f7100 ffff81102ff12100 Nov 23 18:02:37 HGALUX31 kernel: 0008f386bc0abe14 0000000000005d2e ffff810f016f72e8 000000018807aa5a Nov 23 18:02:37 HGALUX31 kernel: Call Trace: Nov 23 18:02:37 HGALUX31 kernel: [<ffffffff80064167>] wait_for_completion+0x79/0xa2 Nov 23 18:02:37 HGALUX31 kernel: [<ffffffff8008e16d>] default_wake_function+0x0/0xe Nov 23 18:02:37 HGALUX31 kernel: [<ffffffff885ab5a0>] :lin_tape:lin_tape_execute_async+0xc5/0xfb Nov 23 18:02:37 HGALUX31 kernel: [<ffffffff885ab64c>] :lin_tape:tape_execute_scsi_command+0x76/0xa4 Nov 23 18:02:37 HGALUX31 kernel: [<ffffffff885ae808>] :lin_tape:tape_send_scsi_io+0x192/0x1fb Nov 23 18:02:37 HGALUX31 kernel: [<ffffffff885ae908>] :lin_tape:tape_send_scsi_cmd+0x97/0x220 Nov 23 18:02:37 HGALUX31 kernel: [<ffffffff885b17ba>] :lin_tape:tape_set_pos+0x286/0x38a Nov 23 18:02:37 HGALUX31 kernel: [<ffffffff8859d754>] :lin_tape:drvioc_exe+0xb1/0x107 Nov 23 18:02:37 HGALUX31 kernel: [<ffffffff885a760f>] :lin_tape:lin_tape_drive_ioctl+0xe89/0x10af Nov 23 18:02:37 HGALUX31 kernel: [<ffffffff8859edea>] :lin_tape:stiocsetpos+0x0/0xd Nov 23 18:02:37 HGALUX31 kernel: [<ffffffff8859637b>] :lin_tape:lin_tape_ioctl_drive+0x1c4/0x1eb Nov 23 18:02:37 HGALUX31 kernel: [<ffffffff8859a1d9>] :lin_tape:lin_tape_ioctl+0x7b/0xba Nov 23 18:02:37 HGALUX31 kernel: [<ffffffff800424bd>] do_ioctl+0x55/0x6b Nov 23 18:02:37 HGALUX31 kernel: [<ffffffff800304d6>] vfs_ioctl+0x457/0x4b9 Nov 23 18:02:37 HGALUX31 kernel: [<ffffffff8004cbb7>] sys_ioctl+0x59/0x78 Nov 23 18:02:37 HGALUX31 kernel: [<ffffffff8005e28d>] tracesys+0xd5/0xe0 It seems to come up with heavy disk io, as the examples show so far, e.g. cp, mysql, oracle, kjournal, jbd
An enterprise distribution should not allow a problem like this to exist for so long. Especially when it has been seen in the wild by so many different users. Why it the priority of the issue still "low". Can Red Hat's support system be used to push things? - If so I'll create a support ticket regarding this, although it seems silly as it is already so well documented.
Hello, Unfortunately we can confirm the problem - we have seen it on more than **512 nodes** runninng in our cluster so as on a few service nodes and virtual machines. The issue is present in 2.6.18-194.26.1.el5 kernel and also on other versions. I'm not sure when i saw these errors for the first time but i suspect problem has been __introduced__ in 194 kernel line. Problem was noticed for the first time during the raid array rebuild process: INFO: task md3_resync:8078 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. md3_resync D ffff810123d7d898 0 8078 48 8077 (L-TLB) ffff810016d53d70 0000000000000046 0000000000000000 ffff810126ce920c ffff810126efa00c 000000000000000a ffff810127965080 ffff810123d7d860 000345b561f23b3e 0000000000001b9f ffff810127965268 000000008008b4d7 Call Trace: [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4 [<ffffffff8021af2b>] md_do_sync+0x1d8/0x833 [<ffffffff8008ca47>] enqueue_task+0x41/0x56 [<ffffffff8008cab2>] __activate_task+0x56/0x6d [<ffffffff8008c897>] dequeue_task+0x18/0x37 [<ffffffff80062ff8>] thread_return+0x62/0xfe [<ffffffff800a0b16>] autoremove_wake_function+0x0/0x2e [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4 [<ffffffff8021b8ff>] md_thread+0xf8/0x10e [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4 [<ffffffff8021b807>] md_thread+0x0/0x10e [<ffffffff8003290a>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4 [<ffffffff8003280c>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 We managed to confirm the I/O starvation on all the schedulers: * noop * cfq * deadline * anticipatory According to our logs version 2.6.18_194.8.1.el5.x86_64 was the first one to start I/O timeouts Reports which can be seen over the web suggest that issue is not RHEL5.5 specific. Best Regards -- Lukasz Flis
Hi, as Lucasz told this one ist _not_ RHEL specific, because I saw it appearing in several different forums across the inet concerning different distros; so, I think it's a kernel problem, basically. However, our problem occured with kernel 2.6.18-194.el5.x86_64, somewhat earlier than Lukasz's.
Hi all, we also have been heavily impacted by this bug. We were using a .164 kernel version until CVE-2010-3081 was discovered, when we upgraded to a .194 kernel. Since then, we got similar errors as the reported ones using any > .194 kernel series in all of our nodes (+400, they are both Xen virtual machines and regular nodes). We managed to get rid of this error rebuilding the kernel and removing the following two patches: From 2.6.18-188.el5: [misc] khungtaskd: set PF_NOFREEZE flag to fix suspend (Amerigo Wang). From 2.6.18-177.el5: [sched] enable CONFIG_DETECT_HUNG_TASK support (Amerigo Wang). Since then we have been running the kernel without any kind of problems. However, we haven't had the time to make further investigations to get more information. Regards, Alvaro Lopez.
Hi all, as far as I can see, this bug is happening only on platforms with SAS HDDs and heavy load. We have solved the problem by upgrading firmware on HDDs (that was the suggestion from Fujitsu's support team). We observed the problem only on RX200 S5 servers with Fujitsu's 10K 146GB HDDs - MBD2147RC. We upgraded firmware on them from 5201 to 5203 and since that we had no such problems. Regards, Bosko
I can't agree with Bosko here, the problem is also present on HP BL 2x220c (G5,G6) server blades with SATA disks. We can also see it on Tyan GX21 serverboards with WD Raptor Drives.
I have the same opinion as Lukasz. I cannot agree with Bosko, since we have seen the problems either with SAS and SATA disks and also when accessing our GPFS filesystem.
I've checked the fixes, that Alvaro mentioned. The patch/issue: From 2.6.18-177.el5: [sched] enable CONFIG_DETECT_HUNG_TASK support (Amerigo Wang). Is clearly responsible why the message of a hung process does now produce a back trace (see more information in the kernel documentation: http://lxr.linux.no/#linux+v2.6.32.26/lib/Kconfig.debug#L197) The patch/issue: khungtaskd: set PF_NOFREEZE flag to fix suspend (Amerigo Wang). I'm not really sure if this is related to this issue. BZ: https://bugzilla.redhat.com/show_bug.cgi?id=550014 Description: In RHEL5 kernel, kthread_run() will not set PF_NOFREEZE for us, we have to set this flag by our own. Fixes a suspend hang witnessed on some systems. Upstream status: Upstream doesn't need this. Signed-off-by: WANG Cong <amwang> diff --git a/kernel/hung_task.c b/kernel/hung_task.c index 0d5a150..0fc6038 100644 --- a/kernel/hung_task.c +++ b/kernel/hung_task.c @@ -189,6 +189,10 @@ int proc_dohung_task_timeout_secs(struct ctl_table *table, int write, static int watchdog(void *dummy) { set_user_nice(current, 0); + /* + * kthread_run() doesn't help us here. + */ + current->flags |= PF_NOFREEZE; for ( ; ; ) { unsigned long timeout = sysctl_hung_task_timeout_secs; I'm really not a kernel specialist, but to me this seems that Red Hat added PF_NOFREEZE to khungtaskd - which from my point of view will only apply to khungtaskd and does not have a impact to other processes, running on the system. For me this fix is also a little bit confusing as I haven't seen it in the latest upstream kernel version (2.6.36) Anyway, please note that khungtaskd only checks every 120 sec. for tasks running on the system and having set the TASK_UNINTERRUPTIBLE flag. If khungtaskd finds a task that was not switched out by the scheduler once within the 120 sec. the khungtaskd consinders this as a hung task and will then display the tasks stack dump. So from my point of view, are all reported issues within this bug more related to a overloaded server, than to a kernel bug. To prove or disprove this, it would be helpful if we have information about the CPU (run queue) and I/O load of a machine, during the time the problem occurs. As from my site, we had the same issue with a HP DL 360 G5 once - after a short time, we've replaced the server with a much more powerful server HP DL 360 G6 running two 6 core CPUs - the OS and the kernel remained on the same patch level and since then, the machine did not report any problem again. Cheers, Simon
Simon, (In reply to comment #23) > So from my point of view, are all reported issues within this bug more related > to a overloaded server, than to a kernel bug. Please see my comments from September, including attached files. I do not think that this issue has to do with servers which are simply overloaded. As I see it, a high load brings the server into a pathological state, I/O-wise, and I believe that there must be a bug somewhere.
Hi Arvin, Thanks for the feedback. (In reply to comment #24) > Please see my comments from September, including attached files. I do not think > that this issue has to do with servers which are simply overloaded. As I see > it, a high load brings the server into a pathological state, I/O-wise, and I > believe that there must be a bug somewhere. I've checked your attached information - but could you please provide some information about the hardware, the system is running on (CPU's, etc.)? Anyhow, looking at your information, the system which is unresponsive has a current CPU load of 78 (which is usually high, except you have a multi-multi core CPU system). If I compare this with the second output, where the system is responsive, the CPU load is at 6. It's definitely possible that also a change related to disk I/O is causing this problem, but it's definitely also related to load.
Simon, (In reply to comment #25) > I've checked your attached information - but could you please provide some > information about the hardware, the system is running on (CPU's, etc.)? CPUs: Two four-socket Intel Xeon X5570 2.93GHz. RAM: 96GB ECC DDR3, consisting of 12 RAM units. Further info: http://h20195.www2.hp.com/v2/GetDocument.aspx?docname=c01709598&doctype=quickspecs&doclang=EN_GB&searchquery=&cc=dk&lc=da > Anyhow, looking at your information, the system which is unresponsive has a > current CPU load of 78 (which is usually high, except you have a multi-multi > core CPU system). Yes, and this is the problem :-) Something triggers something which brings the system in a pathological state which includes symptoms like a high load and extremely long process-spawning times.
Due to otherwise annoying events, I was lucky to get a window for re-testing the situation described in comment 4: The running server's I/O scheduler was changed from noop to cfg and the relevant file system was un-mounted and mounted again, and the fio test was run. I couldn't re-create the problem. I then rebooted the server, making sure that the default cfq I/O scheduler would be active from the beginning. When I re-ran the tests, the problem re-appeared quickly. After killing fio, things returned to normal, and I got a chance to look into syslog: Nov 26 15:06:45 oslo kernel: kjournald D ffff810c1f997860 0 1507 807 1532 1468 (L-TLB) Nov 26 15:06:45 oslo kernel: ffff810c1f863cf0 0000000000000046 ffff8104b3a85000 ffff811818f26ac0 Nov 26 15:06:45 oslo kernel: ffff81181fb172c0 000000000000000a ffff81181fe70820 ffff810c1f997860 Nov 26 15:06:45 oslo kernel: 000000d84b35fc1e 0000000000000da6 ffff81181fe70a08 000000061f549838 Nov 26 15:06:45 oslo kernel: Call Trace: Nov 26 15:06:45 oslo kernel: [<ffffffff8006e1d7>] do_gettimeofday+0x40/0x90 Nov 26 15:06:45 oslo kernel: [<ffffffff8005a7d6>] getnstimeofday+0x10/0x28 Nov 26 15:06:45 oslo kernel: [<ffffffff8001552b>] sync_buffer+0x0/0x3f Nov 26 15:06:45 oslo kernel: [<ffffffff800637ea>] io_schedule+0x3f/0x67 Nov 26 15:06:45 oslo kernel: [<ffffffff80015566>] sync_buffer+0x3b/0x3f Nov 26 15:06:45 oslo kernel: [<ffffffff80063a16>] __wait_on_bit+0x40/0x6e Nov 26 15:06:45 oslo kernel: [<ffffffff8001552b>] sync_buffer+0x0/0x3f Nov 26 15:06:45 oslo kernel: [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78 Nov 26 15:06:45 oslo kernel: [<ffffffff800a0b44>] wake_bit_function+0x0/0x23 Nov 26 15:06:45 oslo kernel: [<ffffffff880339a5>] :jbd:journal_commit_transaction+0x543/0x1066 Nov 26 15:06:45 oslo kernel: [<ffffffff8003da83>] lock_timer_base+0x1b/0x3c Nov 26 15:06:45 oslo kernel: [<ffffffff8004b132>] try_to_del_timer_sync+0x7f/0x88 Nov 26 15:06:45 oslo kernel: [<ffffffff880375d3>] :jbd:kjournald+0xc1/0x213 Nov 26 15:06:45 oslo kernel: [<ffffffff800a0b16>] autoremove_wake_function+0x0/0x2e Nov 26 15:06:45 oslo kernel: [<ffffffff88037512>] :jbd:kjournald+0x0/0x213 Nov 26 15:06:45 oslo kernel: [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4 Nov 26 15:06:45 oslo kernel: [<ffffffff8003290a>] kthread+0xfe/0x132 Nov 26 15:06:45 oslo kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Nov 26 15:06:45 oslo kernel: [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4 Nov 26 15:06:45 oslo kernel: [<ffffffff8003280c>] kthread+0x0/0x132 Nov 26 15:06:45 oslo kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 I then rebooted the server again, making sure that the noop scheduler would be chosen at boot-time. This time (using the noop scheduler), I couldn't provoke the problem to re-appear. So even though the server's firmware has been upgraded slightly and the kernel package is probably also newer, the situation is still like in September: If I use fio to generate heavy I/O on the local RAID sysetem, I end up in a pathological state (where no new processes seem to get spawn and load is > 12); whereas if I use the noop scheduler, things are fine.
I have two questions: 1) When the I/O subsystem is stuck, does changing I/O schedulers make things work again? (What I mean by this is run the system with CFQ until it is hung, then echo noop > /sys/block/sdX/queue/scheduler, where sdX is the device that holds the hung file system). 2) If not, can someone get me a vmcore?
(In reply to comment #28) > 1) When the I/O subsystem is stuck, does changing I/O schedulers make things > work again? (What I mean by this is run the system with CFQ until it is hung, > then echo noop > /sys/block/sdX/queue/scheduler, where sdX is the device that > holds the hung file system). I tried: Ran fio for around 10 minutes and saw that load got above 10 and that logging in via SSH became impossible. I had an existing terminal open in order to be able to change the scheduler, but I couldn't even start "cat". As soon as I hit CTRL-c in fio, things got responsive again. I then changed the scheduler to noop (without a reboot), and this time, fio couldn't bring the system into the pathological state. > 2) If not, can someone get me a vmcore? How is that done?
(In reply to comment #29) > (In reply to comment #28) > > 1) When the I/O subsystem is stuck, does changing I/O schedulers make things > > work again? (What I mean by this is run the system with CFQ until it is hung, > > then echo noop > /sys/block/sdX/queue/scheduler, where sdX is the device that > > holds the hung file system). > > I tried: Ran fio for around 10 minutes and saw that load got above 10 and that > logging in via SSH became impossible. I had an existing terminal open in order > to be able to change the scheduler, but I couldn't even start "cat". As soon as > I hit CTRL-c in fio, things got responsive again. > > I then changed the scheduler to noop (without a reboot), and this time, fio > couldn't bring the system into the pathological state. OK, thanks for the quick testing turn-around. > > 2) If not, can someone get me a vmcore? > > How is that done? Well, let me try to reproduce this. You've provided a nice fio job file for me, so I'll do the leg work of getting the vmcore.
I was seeing the same behavior described in this thread. I bumped up to test kernel 2.6.18-233.el5 from http://people.redhat.com/jwilson/el5 and the errors have stopped. I'm not able to quantify impact outside of the error messages because I need to induce heavy disk i/o to do it but things "feel" better too.
I need to back down on that a bit.. the problem seems to occur less frequently with the 2.6.18-233.el5 test kernel but is still evident.
I had this problem during a 5.5 NFS install. The server had a 10 Gige NIC and the nodes 1Gb NICs. On the 10 Gige card the MTU was set to 9000 by default. The minute I restarted the interface with MTU set to 1500, NFS immediately started working. Check your MTU settings. Hope this helps.
Has anyone tested with RHEL 5.6's kernel to see if this problem still happens?
I just tested with RHEL 5.6. The problem is still there :-( Messages from syslog will be shown below. I'm tempted to reformat with ext4 and see if the situation changes, but it will be a while until I get a window for doing that. A strange phenomenon (which has probably been there all along): I can only systematically reproduce it if I use "fio" shortly after booting. When fio is around 10% done, things go wrong. But if I hit CTRL+c and then re-start fio (without a reboot), the problem is gone. When the problem is seen, it develops like this (the percentages are fio's "% done"): 1% done: load=5 2% done: load=7 3% done: load=8 4% done: load=9 5% done: load=11 10% done: load=12 12% done: load=14 14% done: load=17 16% done: load=20 18% done: load=23 20% done: load=25 22% done: load=27 24% done: load=28 26% done: load=29 28% done: load=30 30% done: load=31 Around 3-4% done, the "pdflush" process starts showing up among the top processes in "top". All along, IO/s stays around 4200 (30MB/s read, 12MB/s write). Now, from syslog: Jan 18 00:25:28 oslo kernel: INFO: task kjournald:1506 blocked for more than 120 seconds. Jan 18 00:25:28 oslo kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 18 00:25:28 oslo kernel: kjournald D ffff810c4a1aaaa0 0 1506 807 1531 1467 (L-TLB) Jan 18 00:25:28 oslo kernel: ffff81181c3ebcd0 0000000000000046 ffff81181fe88000 ffffffff880b97b1 Jan 18 00:25:28 oslo kernel: 0000000000000000 000000000000000a ffff810c1f417820 ffff810c20131100 Jan 18 00:25:28 oslo kernel: 0000004a2017d95d 0000000000000be8 ffff810c1f417a08 000000031fb68cf8 Jan 18 00:25:28 oslo kernel: Call Trace: Jan 18 00:25:28 oslo kernel: [<ffffffff880b97b1>] :cciss:do_cciss_request+0x32/0x4dd Jan 18 00:25:28 oslo kernel: [<ffffffff8006ec4e>] do_gettimeofday+0x40/0x90 Jan 18 00:25:28 oslo kernel: [<ffffffff800154b2>] sync_buffer+0x0/0x3f Jan 18 00:25:29 oslo kernel: [<ffffffff800637ca>] io_schedule+0x3f/0x67 Jan 18 00:25:29 oslo kernel: [<ffffffff800154ed>] sync_buffer+0x3b/0x3f Jan 18 00:25:29 oslo kernel: [<ffffffff800639f6>] __wait_on_bit+0x40/0x6e Jan 18 00:25:29 oslo kernel: [<ffffffff800154b2>] sync_buffer+0x0/0x3f Jan 18 00:25:29 oslo kernel: [<ffffffff80063a90>] out_of_line_wait_on_bit+0x6c/0x78 Jan 18 00:25:29 oslo kernel: [<ffffffff800a28e2>] wake_bit_function+0x0/0x23 Jan 18 00:25:29 oslo kernel: [<ffffffff8003aae0>] sync_dirty_buffer+0x8e/0xc3 Jan 18 00:25:29 oslo kernel: [<ffffffff8803401e>] :jbd:journal_commit_transaction+0xbbc/0x1066 Jan 18 00:25:29 oslo kernel: [<ffffffff8003dbe6>] lock_timer_base+0x1b/0x3c Jan 18 00:25:29 oslo kernel: [<ffffffff880375d3>] :jbd:kjournald+0xc1/0x213 Jan 18 00:25:29 oslo kernel: [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e Jan 18 00:25:29 oslo kernel: [<ffffffff88037512>] :jbd:kjournald+0x0/0x213 Jan 18 00:25:29 oslo kernel: [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 Jan 18 00:25:29 oslo kernel: [<ffffffff80032974>] kthread+0xfe/0x132 Jan 18 00:25:29 oslo kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Jan 18 00:25:29 oslo kernel: [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 Jan 18 00:25:29 oslo kernel: [<ffffffff80032876>] kthread+0x0/0x132 Jan 18 00:25:29 oslo kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Jan 18 00:25:29 oslo kernel: Jan 18 00:25:29 oslo kernel: INFO: task master:25281 blocked for more than 120 seconds. Jan 18 00:25:29 oslo kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 18 00:25:29 oslo kernel: master D ffff81000900caa0 0 25281 18664 24791 (NOTLB) Jan 18 00:25:29 oslo kernel: ffff8110ba949b58 0000000000000086 0000000000000000 ffff810000075c10 Jan 18 00:25:29 oslo kernel: ffff810110a7adb0 0000000000000005 ffff81181fa56080 ffff810c4a24b0c0 Jan 18 00:25:29 oslo kernel: 00000049f62b812f 00000000000072f4 ffff81181fa56268 00000002000201d2 Jan 18 00:25:29 oslo kernel: Call Trace: Jan 18 00:25:29 oslo kernel: [<ffffffff8006ec4e>] do_gettimeofday+0x40/0x90 Jan 18 00:25:29 oslo kernel: [<ffffffff80028ae9>] sync_page+0x0/0x43 Jan 18 00:25:29 oslo kernel: [<ffffffff800637ca>] io_schedule+0x3f/0x67 Jan 18 00:25:29 oslo kernel: [<ffffffff80028b27>] sync_page+0x3e/0x43 Jan 18 00:25:29 oslo kernel: [<ffffffff8006390e>] __wait_on_bit_lock+0x36/0x66 Jan 18 00:25:29 oslo kernel: [<ffffffff8003fd9f>] __lock_page+0x5e/0x64 Jan 18 00:25:29 oslo kernel: [<ffffffff800a28e2>] wake_bit_function+0x0/0x23 Jan 18 00:25:29 oslo kernel: [<ffffffff8000c3d1>] do_generic_mapping_read+0x1df/0x359 Jan 18 00:25:29 oslo kernel: [<ffffffff8000d1bd>] file_read_actor+0x0/0x159 Jan 18 00:25:29 oslo kernel: [<ffffffff8000c697>] __generic_file_aio_read+0x14c/0x198 Jan 18 00:25:29 oslo kernel: [<ffffffff80016e0c>] generic_file_aio_read+0x34/0x39 Jan 18 00:25:29 oslo kernel: [<ffffffff8000cee6>] do_sync_read+0xc7/0x104 Jan 18 00:25:29 oslo kernel: [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e Jan 18 00:25:29 oslo kernel: [<ffffffff8002a6c4>] __vma_link+0x42/0x4b Jan 18 00:25:29 oslo kernel: [<ffffffff8001ce41>] vma_link+0x70/0xfd Jan 18 00:25:29 oslo kernel: [<ffffffff800302f7>] __up_write+0x27/0xf2 Jan 18 00:25:29 oslo kernel: [<ffffffff8000b787>] vfs_read+0xcb/0x171 Jan 18 00:25:29 oslo kernel: [<ffffffff800454fb>] kernel_read+0x41/0x55 Jan 18 00:25:29 oslo kernel: [<ffffffff8003ef47>] do_execve+0xe1/0x1ed Jan 18 00:25:29 oslo kernel: [<ffffffff80055064>] sys_execve+0x36/0x4c Jan 18 00:25:29 oslo kernel: [<ffffffff8005d4d3>] stub_execve+0x67/0xb0 Jan 18 00:25:29 oslo kernel: Jan 18 00:25:29 oslo kernel: INFO: task sh:25305 blocked for more than 120 seconds. Jan 18 00:25:29 oslo kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 18 00:25:29 oslo kernel: sh D ffff81000900caa0 0 25305 1 27695 21904 (NOTLB) Jan 18 00:25:29 oslo kernel: ffff8104b59eba38 0000000000000086 ffff810110805440 ffffc20010097080 Jan 18 00:25:29 oslo kernel: ffff810110805440 0000000000000005 ffff810c1c38d100 ffff810c4a24b0c0 Jan 18 00:25:29 oslo kernel: 0000004f045c3cba 000000000004b279 ffff810c1c38d2e8 0000000280022214 Jan 18 00:25:29 oslo kernel: Call Trace: Jan 18 00:25:29 oslo kernel: [<ffffffff8006ec4e>] do_gettimeofday+0x40/0x90 Jan 18 00:25:29 oslo kernel: [<ffffffff800154b2>] sync_buffer+0x0/0x3f Jan 18 00:25:29 oslo kernel: [<ffffffff800637ca>] io_schedule+0x3f/0x67 Jan 18 00:25:29 oslo kernel: [<ffffffff800154ed>] sync_buffer+0x3b/0x3f Jan 18 00:25:29 oslo kernel: [<ffffffff800639f6>] __wait_on_bit+0x40/0x6e Jan 18 00:25:29 oslo kernel: [<ffffffff800154b2>] sync_buffer+0x0/0x3f Jan 18 00:25:29 oslo kernel: [<ffffffff80063a90>] out_of_line_wait_on_bit+0x6c/0x78 Jan 18 00:25:29 oslo kernel: [<ffffffff800a28e2>] wake_bit_function+0x0/0x23 Jan 18 00:25:29 oslo kernel: [<ffffffff8001750f>] ll_rw_block+0x8c/0xab Jan 18 00:25:29 oslo kernel: [<ffffffff8805314c>] :ext3:ext3_find_entry+0x3bf/0x575 Jan 18 00:25:29 oslo kernel: [<ffffffff80064604>] __down_read+0x12/0x92 Jan 18 00:25:29 oslo kernel: [<ffffffff80022214>] __up_read+0x19/0x7f Jan 18 00:25:29 oslo kernel: [<ffffffff8805bb0d>] :ext3:ext3_xattr_get+0x217/0x228 Jan 18 00:25:29 oslo kernel: [<ffffffff8804dac5>] :ext3:__ext3_get_inode_loc+0x12f/0x2f9 Jan 18 00:25:29 oslo kernel: [<ffffffff880549ba>] :ext3:ext3_lookup+0x33/0x162 Jan 18 00:25:29 oslo kernel: [<ffffffff8000d008>] do_lookup+0xe5/0x1e6 Jan 18 00:25:29 oslo kernel: [<ffffffff8000a2c5>] __link_path_walk+0xa2a/0xfb9 Jan 18 00:25:29 oslo kernel: [<ffffffff8000ea74>] link_path_walk+0x42/0xb2 Jan 18 00:25:29 oslo kernel: [<ffffffff8000cda3>] do_path_lookup+0x275/0x2f1 Jan 18 00:25:29 oslo kernel: [<ffffffff800237c4>] __path_lookup_intent_open+0x56/0x97 Jan 18 00:25:29 oslo kernel: [<ffffffff8003c1db>] open_exec+0x24/0xc0 Jan 18 00:25:29 oslo kernel: [<ffffffff8001cea1>] vma_link+0xd0/0xfd Jan 18 00:25:29 oslo kernel: [<ffffffff8003eeac>] do_execve+0x46/0x1ed Jan 18 00:25:29 oslo kernel: [<ffffffff80055064>] sys_execve+0x36/0x4c Jan 18 00:25:29 oslo kernel: [<ffffffff8005d4d3>] stub_execve+0x67/0xb0
The original problem within this BZ was on an LSI controller which was fixed by upgrading the FJ drive firmware (comment #20). The >120s messages can be a symptom of many different issues, including just a very very busy system (for example, set io timeout to 300s and anytime you encounter an io timeout you can get these messages if the task stall detection time is set less than the io timeout value). For Smart Array configurations encountering this type of message or longer term hang issues, I'd suggest using BZ 580818 instead. For other configurations other than LSI or Smart Array I'd suggest opening a new BZ with appropriate details.
"You are not authorized to access bug #580818"...
We are having an issue with a SAN where the SCSI commands will timeout at the time we are having all these hung task timeout errors. After that, the filesystem will deem the problem as IO Failure and remount everything read only, crashing our application. Is that something we should be seeing on a very busy system? We do have an HP System with an SmartArray card, but this problem came up when we attached the SAN. Example below [Full dmesg]: Linux version 2.6.18-238.el5 (mockbuild.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-50)) #1 SMP Sun Dec 19 14:22:44 EST 2010 Command line: ro root=/dev/system/root BIOS-provided physical RAM map: BIOS-e820: 0000000000010000 - 000000000009f400 (usable) BIOS-e820: 000000000009f400 - 00000000000a0000 (reserved) BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 00000000cfe3e000 (usable) BIOS-e820: 00000000cfe3e000 - 00000000cfe46000 (ACPI data) BIOS-e820: 00000000cfe46000 - 00000000cfe47000 (usable) BIOS-e820: 00000000cfe47000 - 00000000e0000000 (reserved) BIOS-e820: 00000000fec00000 - 00000000fee10000 (reserved) BIOS-e820: 00000000ffc00000 - 0000000100000000 (reserved) BIOS-e820: 0000000100000000 - 000000082ffff000 (usable) DMI 2.4 present. ACPI: RSDP (v002 HP ) @ 0x00000000000f4f00 ACPI: XSDT (v001 HP ProLiant 0x00000002 Ò^D 0x0000162e) @ 0x00000000cfe3eec0 ACPI: FADT (v003 HP ProLiant 0x00000002 Ò^D 0x0000162e) @ 0x00000000cfe3efc0 ACPI: SPCR (v001 HP SPCRRBSU 0x00000001 Ò^D 0x0000162e) @ 0x00000000cfe3e180 ACPI: MCFG (v001 HP ProLiant 0x00000001 0x00000000) @ 0x00000000cfe3e200 ACPI: HPET (v001 HP ProLiant 0x00000002 Ò^D 0x0000162e) @ 0x00000000cfe3e240 ACPI: SPMI (v005 HP ProLiant 0x00000001 Ò^D 0x0000162e) @ 0x00000000cfe3e280 ACPI: ERST (v001 HP ProLiant 0x00000001 Ò^D 0x0000162e) @ 0x00000000cfe3e2c0 ACPI: MADT (v001 HP ProLiant 0x00000002 0x00000000) @ 0x00000000cfe3e4c0 ACPI: SRAT (v001 AMD FAM_F_10 0x00000002 AMD 0x00000001) @ 0x00000000cfe3e6c0 ACPI: FFFF (v001 HP ProLiant 0x00000001 Ò^D 0x0000162e) @ 0x00000000cfe3eac0 ACPI: BERT (v001 HP ProLiant 0x00000001 Ò^D 0x0000162e) @ 0x00000000cfe3ec40 ACPI: HEST (v001 HP ProLiant 0x00000001 Ò^D 0x0000162e) @ 0x00000000cfe3ec80 ACPI: FFFF (v002 HP ProLiant 0x00000002 Ò^D 0x0000162e) @ 0x00000000cfe3ee40 ACPI: SSDT (v003 HP pci0pcie 0x00000001 INTL 0x20061109) @ 0x00000000cfe42900 ACPI: SSDT (v003 HP CRSPCI0 0x00000002 HP 0x00000001) @ 0x00000000cfe42c00 ACPI: SSDT (v003 HP CRSPCI1 0x00000002 HP 0x00000001) @ 0x00000000cfe42d40 ACPI: DSDT (v001 HP DSDT 0x00000001 INTL 0x20030228) @ 0x0000000000000000 SRAT: PXM 0 -> APIC 0 -> Node 0 SRAT: PXM 0 -> APIC 1 -> Node 0 SRAT: PXM 0 -> APIC 2 -> Node 0 SRAT: PXM 0 -> APIC 3 -> Node 0 SRAT: PXM 0 -> APIC 4 -> Node 0 SRAT: PXM 0 -> APIC 5 -> Node 0 SRAT: PXM 1 -> APIC 8 -> Node 1 SRAT: PXM 1 -> APIC 9 -> Node 1 SRAT: PXM 1 -> APIC 10 -> Node 1 SRAT: PXM 1 -> APIC 11 -> Node 1 SRAT: PXM 1 -> APIC 12 -> Node 1 SRAT: PXM 1 -> APIC 13 -> Node 1 SRAT: Node 0 PXM 0 0-a0000 SRAT: Node 0 PXM 0 0-d0000000 SRAT: Node 0 PXM 0 0-430000000 SRAT: Node 1 PXM 1 430000000-830000000 NUMA: Using 28 for the hash shift. Bootmem setup node 0 0000000000000000-0000000430000000 Bootmem setup node 1 0000000430000000-000000082ffff000 Memory for crash kernel (0x0 to 0x0) notwithin permissible range disabling kdump On node 0 totalpages: 4132192 DMA zone: 2409 pages, LIFO batch:0 DMA32 zone: 833143 pages, LIFO batch:31 Normal zone: 3296640 pages, LIFO batch:31 On node 1 totalpages: 4136960 Normal zone: 4136960 pages, LIFO batch:31 ACPI: PM-Timer IO Port: 0x920 ACPI: Local APIC address 0xfee00000 ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) Processor #0 0:8 APIC version 16 ACPI: LAPIC (acpi_id[0x06] lapic_id[0x08] enabled) Processor #8 0:8 APIC version 16 ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled) Processor #1 0:8 APIC version 16 ACPI: LAPIC (acpi_id[0x07] lapic_id[0x09] enabled) Processor #9 0:8 APIC version 16 ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] enabled) Processor #2 0:8 APIC version 16 ACPI: LAPIC (acpi_id[0x08] lapic_id[0x0a] enabled) Processor #10 0:8 APIC version 16 ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] enabled) Processor #3 0:8 APIC version 16 ACPI: LAPIC (acpi_id[0x09] lapic_id[0x0b] enabled) Processor #11 0:8 APIC version 16 ACPI: LAPIC (acpi_id[0x04] lapic_id[0x04] enabled) Processor #4 0:8 APIC version 16 ACPI: LAPIC (acpi_id[0x0a] lapic_id[0x0c] enabled) Processor #12 0:8 APIC version 16 ACPI: LAPIC (acpi_id[0x05] lapic_id[0x05] enabled) Processor #5 0:8 APIC version 16 ACPI: LAPIC (acpi_id[0x0b] lapic_id[0x0d] enabled) Processor #13 0:8 APIC version 16 ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1]) ACPI: IOAPIC (id[0x08] address[0xfec00000] gsi_base[0]) IOAPIC[0]: apic_id 8, version 17, address 0xfec00000, GSI 0-15 ACPI: IOAPIC (id[0x09] address[0xfec01000] gsi_base[16]) IOAPIC[1]: apic_id 9, version 17, address 0xfec01000, GSI 16-31 ACPI: IOAPIC (id[0x0a] address[0xfec02000] gsi_base[32]) IOAPIC[2]: apic_id 10, version 17, address 0xfec02000, GSI 32-47 ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge) ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level) ACPI: IRQ0 used by override. ACPI: IRQ2 used by override. ACPI: IRQ9 used by override. Setting APIC routing to physical flat ACPI: HPET id: 0x1166a201 base: 0xfed00000 Using ACPI (MADT) for SMP configuration information Nosave address range: 000000000009f000 - 00000000000a0000 Nosave address range: 00000000000a0000 - 00000000000f0000 Nosave address range: 00000000000f0000 - 0000000000100000 Nosave address range: 00000000cfe3e000 - 00000000cfe46000 Nosave address range: 00000000cfe47000 - 00000000e0000000 Nosave address range: 00000000e0000000 - 00000000fec00000 Nosave address range: 00000000fec00000 - 00000000fee10000 Nosave address range: 00000000fee10000 - 00000000ffc00000 Nosave address range: 00000000ffc00000 - 0000000100000000 Allocating PCI resources starting at e2000000 (gap: e0000000:1ec00000) SMP: Allowing 12 CPUs, 0 hotplug CPUs Built 2 zonelists. Total pages: 8269152 Kernel command line: ro root=/dev/system/root Initializing CPU#0 PID hash table entries: 4096 (order: 12, 32768 bytes) Console: colour VGA+ 80x25 Dentry cache hash table entries: 4194304 (order: 13, 33554432 bytes) Inode-cache hash table entries: 2097152 (order: 12, 16777216 bytes) Checking aperture... CPU 0: aperture @ 8000000 size 32 MB Aperture too small (32 MB) No AGP bridge found Your BIOS doesn't leave a aperture memory hole Please enable the IOMMU option in the BIOS setup This costs you 64 MB of RAM Mapping aperture over 65536 KB of RAM @ 8000000 Nosave address range: 0000000008000000 - 000000000c000000 ACPI: DMAR not present Memory: 32956968k/34340860k available (2592k kernel code, 595212k reserved, 1649k data, 224k init) Calibrating delay loop (skipped), value calculated using timer frequency.. 4800.18 BogoMIPS (lpj=2400092) Security Framework v1.0.0 initialized SELinux: Initializing. SELinux: Starting in permissive mode selinux_register_security: Registering secondary module capability Capability LSM initialized as secondary Mount-cache hash table entries: 256 CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 512K (64 bytes/line) CPU 0/0 -> Node 0 CPU: Physical Processor ID: 0 CPU: Processor Core ID: 0 SMP alternatives: switching to UP code ACPI: Core revision 20060707 Using local APIC timer interrupts. Detected 12.500 MHz APIC timer. SMP alternatives: switching to SMP code Booting processor 1/12 APIC 0x8 Initializing CPU#1 Calibrating delay using timer specific routine.. 4800.27 BogoMIPS (lpj=2400139) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 512K (64 bytes/line) CPU 1/8 -> Node 1 CPU: Physical Processor ID: 1 CPU: Processor Core ID: 0 Six-Core AMD Opteron(tm) Processor 2431 stepping 00 SMP alternatives: switching to SMP code Booting processor 2/12 APIC 0x1 Initializing CPU#2 Calibrating delay using timer specific routine.. 4801.22 BogoMIPS (lpj=2400612) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 512K (64 bytes/line) CPU 2/1 -> Node 0 CPU: Physical Processor ID: 0 CPU: Processor Core ID: 1 Six-Core AMD Opteron(tm) Processor 2431 stepping 00 SMP alternatives: switching to SMP code Booting processor 3/12 APIC 0x9 Initializing CPU#3 Calibrating delay using timer specific routine.. 4804.38 BogoMIPS (lpj=2402193) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 512K (64 bytes/line) CPU 3/9 -> Node 1 CPU: Physical Processor ID: 1 CPU: Processor Core ID: 1 Six-Core AMD Opteron(tm) Processor 2431 stepping 00 SMP alternatives: switching to SMP code Booting processor 4/12 APIC 0x2 Initializing CPU#4 Calibrating delay using timer specific routine.. 4803.89 BogoMIPS (lpj=2401946) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 512K (64 bytes/line) CPU 4/2 -> Node 0 CPU: Physical Processor ID: 0 CPU: Processor Core ID: 2 Six-Core AMD Opteron(tm) Processor 2431 stepping 00 SMP alternatives: switching to SMP code Booting processor 5/12 APIC 0xa Initializing CPU#5 Calibrating delay using timer specific routine.. 4803.24 BogoMIPS (lpj=2401623) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 512K (64 bytes/line) CPU 5/a -> Node 1 CPU: Physical Processor ID: 1 CPU: Processor Core ID: 2 Six-Core AMD Opteron(tm) Processor 2431 stepping 00 SMP alternatives: switching to SMP code Booting processor 6/12 APIC 0x3 Initializing CPU#6 Calibrating delay using timer specific routine.. 4803.14 BogoMIPS (lpj=2401574) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 512K (64 bytes/line) CPU 6/3 -> Node 0 CPU: Physical Processor ID: 0 CPU: Processor Core ID: 3 Six-Core AMD Opteron(tm) Processor 2431 stepping 00 SMP alternatives: switching to SMP code Booting processor 7/12 APIC 0xb Initializing CPU#7 Calibrating delay using timer specific routine.. 4804.12 BogoMIPS (lpj=2402064) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 512K (64 bytes/line) CPU 7/b -> Node 1 CPU: Physical Processor ID: 1 CPU: Processor Core ID: 3 Six-Core AMD Opteron(tm) Processor 2431 stepping 00 SMP alternatives: switching to SMP code Booting processor 8/12 APIC 0x4 Initializing CPU#8 Calibrating delay using timer specific routine.. 4802.49 BogoMIPS (lpj=2401249) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 512K (64 bytes/line) CPU 8/4 -> Node 0 CPU: Physical Processor ID: 0 CPU: Processor Core ID: 4 Six-Core AMD Opteron(tm) Processor 2431 stepping 00 SMP alternatives: switching to SMP code Booting processor 9/12 APIC 0xc Initializing CPU#9 Calibrating delay using timer specific routine.. 4801.74 BogoMIPS (lpj=2400872) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 512K (64 bytes/line) CPU 9/c -> Node 1 CPU: Physical Processor ID: 1 CPU: Processor Core ID: 4 Six-Core AMD Opteron(tm) Processor 2431 stepping 00 SMP alternatives: switching to SMP code Booting processor 10/12 APIC 0x5 Initializing CPU#10 Calibrating delay using timer specific routine.. 4803.41 BogoMIPS (lpj=2401706) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 512K (64 bytes/line) CPU 10/5 -> Node 0 CPU: Physical Processor ID: 0 CPU: Processor Core ID: 5 Six-Core AMD Opteron(tm) Processor 2431 stepping 00 SMP alternatives: switching to SMP code Booting processor 11/12 APIC 0xd Initializing CPU#11 Calibrating delay using timer specific routine.. 4804.02 BogoMIPS (lpj=2402013) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 512K (64 bytes/line) CPU 11/d -> Node 1 CPU: Physical Processor ID: 1 CPU: Processor Core ID: 5 Six-Core AMD Opteron(tm) Processor 2431 stepping 00 Brought up 12 CPUs CPU#0: NMI watchdog performance counter calibration - 724->744 CPU#1: NMI watchdog performance counter calibration - 54->61 CPU#2: NMI watchdog performance counter calibration - 52->72 CPU#3: NMI watchdog performance counter calibration - 55->62 CPU#4: NMI watchdog performance counter calibration - 85->105 CPU#5: NMI watchdog performance counter calibration - 47->54 CPU#6: NMI watchdog performance counter calibration - 46->66 CPU#7: NMI watchdog performance counter calibration - 35->42 CPU#8: NMI watchdog performance counter calibration - 34->54 CPU#9: NMI watchdog performance counter calibration - 30->50 CPU#10: NMI watchdog performance counter calibration - 26->46 CPU#11: NMI watchdog performance counter calibration - 20->27 NMI watchdog testing PASSED. time.c: Using 14.318180 MHz WALL HPET GTOD HPET/TSC timer. time.c: Detected 2400.099 MHz processor. sizeof(vma)=176 bytes sizeof(page)=56 bytes sizeof(inode)=560 bytes sizeof(dentry)=216 bytes sizeof(ext3inode)=760 bytes sizeof(buffer_head)=96 bytes sizeof(skbuff)=248 bytes migration_cost=633,4994 checking if image is initramfs... it is Freeing initrd memory: 4002k freed NET: Registered protocol family 16 ACPI: bus type pci registered PCI: Using MMCONFIG at d0000000 ACPI: Interpreter enabled ACPI: Using IOAPIC for interrupt routing ACPI: No dock devices found. ACPI: PCI Root Bridge [PCI0] (0000:00) PCI: Enabling HT MSI Mapping on 0000:00:05.0 ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.EXB0._PRT] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.EXB1._PRT] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.EXB2._PRT] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.EXB3._PRT] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.EXB4._PRT] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.IPXB._PRT] ACPI: PCI Interrupt Link [IUSB] (IRQs *5) ACPI: PCI Interrupt Link [ISF0] (IRQs *14) ACPI: PCI Interrupt Link [IN00] (IRQs 7 10 11) *0, disabled. ACPI: PCI Interrupt Link [IN01] (IRQs 7 10 11) *0, disabled. ACPI: PCI Interrupt Link [IN02] (IRQs 7 10 11) *0, disabled. ACPI: PCI Interrupt Link [IN03] (IRQs 7 10 11) *0, disabled. ACPI: PCI Interrupt Link [IN04] (IRQs 7 10 11) *0, disabled. ACPI: PCI Interrupt Link [IN05] (IRQs 7 10 11) *0, disabled. ACPI: PCI Interrupt Link [IN06] (IRQs 7 10 11) *0, disabled. ACPI: PCI Interrupt Link [IN07] (IRQs 7 10 11) *0, disabled. ACPI: PCI Interrupt Link [IN08] (IRQs 7 10 11) *0, disabled. ACPI: PCI Interrupt Link [IN09] (IRQs 7 10 11) *0, disabled. ACPI: PCI Interrupt Link [IN10] (IRQs 7 10 11) *0, disabled. ACPI: PCI Interrupt Link [IN11] (IRQs 7 10 11) *0, disabled. ACPI: PCI Interrupt Link [IN12] (IRQs 7 10 11) *0, disabled. ACPI: PCI Interrupt Link [IN13] (IRQs 7 10 11) *0, disabled. ACPI: PCI Interrupt Link [IN14] (IRQs 7 10 11) *0, disabled. ACPI: PCI Interrupt Link [IN15] (IRQs 7 10 11) *0, disabled. ACPI: PCI Interrupt Link [IN16] (IRQs 7 10 *11) ACPI: PCI Interrupt Link [IN17] (IRQs 7 10 *11) ACPI: PCI Interrupt Link [IN18] (IRQs 7 10 *11) ACPI: PCI Interrupt Link [IN19] (IRQs 7 10 *11) ACPI: PCI Interrupt Link [IN20] (IRQs 7 10 *11) ACPI: PCI Interrupt Link [IN21] (IRQs 7 10 11) *0, disabled. ACPI: PCI Interrupt Link [IN22] (IRQs 7 *10 11) ACPI: PCI Interrupt Link [IN23] (IRQs *7 10 11) ACPI: PCI Interrupt Link [IN24] (IRQs 7 10 *11) ACPI: PCI Interrupt Link [IN25] (IRQs 7 10 *11) ACPI: PCI Interrupt Link [IN26] (IRQs 7 10 *11) ACPI: PCI Interrupt Link [IN27] (IRQs 7 10 11) *0, disabled. ACPI: PCI Interrupt Link [IN28] (IRQs *7 10 11) ACPI: PCI Interrupt Link [IN29] (IRQs 7 10 *11) ACPI: PCI Interrupt Link [IN30] (IRQs 7 *10 11) ACPI: PCI Interrupt Link [IN31] (IRQs 7 10 11) *0, disabled. ACPI: PCI Root Bridge [PCI1] (0000:40) ACPI: PCI Interrupt Routing Table [\_SB_.PCI1._PRT] ACPI: PCI Interrupt Routing Table [\_SB_.PCI1.EXB0._PRT] ACPI: PCI Interrupt Routing Table [\_SB_.PCI1.EXB1._PRT] ACPI: PCI Interrupt Routing Table [\_SB_.PCI1.EXB2._PRT] ACPI: PCI Interrupt Routing Table [\_SB_.PCI1.EXB3._PRT] ACPI: PCI Interrupt Routing Table [\_SB_.PCI1.EXB4._PRT] Linux Plug and Play Support v0.97 (c) Adam Belay pnp: PnP ACPI init pnp: PnP ACPI: found 12 devices usbcore: registered new driver usbfs usbcore: registered new driver hub PCI: Using ACPI for IRQ routing PCI: If a device doesn't work, try "pci=routeirq". If it helps, post a report NetLabel: Initializing NetLabel: domain hash size = 128 NetLabel: protocols = UNLABELED CIPSOv4 NetLabel: unlabeled traffic allowed by default hpet0: at MMIO 0xfed00000 (virtual 0xffffffffff5fe000), IRQs 2, 8, 0 hpet0: 3 64-bit timers, 14318180 Hz ACPI: DMAR not present PCI-DMA: Disabling AGP. PCI-DMA: aperture base @ 8000000 size 65536 KB PCI-DMA: using GART IOMMU. PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture pnp: 00:01: ioport range 0x408-0x40f has been reserved pnp: 00:01: ioport range 0x4d0-0x4d1 has been reserved pnp: 00:01: ioport range 0x4d6-0x4d6 has been reserved pnp: 00:01: iomem range 0xd0000000-0xdfffffff could not be reserved PCI: Bridge: 0000:01:0d.0 IO window: disabled. MEM window: disabled. PREFETCH window: disabled. PCI: Bridge: 0000:00:05.0 IO window: 4000-4fff MEM window: f3f00000-f3ffffff PREFETCH window 0x00000000e2000000-0x00000000e20fffff PCI: Bridge: 0000:00:0f.0 IO window: disabled. MEM window: disabled. PREFETCH window: disabled. PCI: Bridge: 0000:00:10.0 IO window: disabled. MEM window: disabled. PREFETCH window: disabled. PCI: Bridge: 0000:00:11.0 IO window: disabled. MEM window: disabled. PREFETCH window: disabled. PCI: Bridge: 0000:00:12.0 IO window: disabled. MEM window: f4000000-f7ffffff PREFETCH window 0x00000000e2100000-0x00000000e21fffff PCI: Bridge: 0000:00:13.0 IO window: disabled. MEM window: disabled. PREFETCH window: disabled. GSI 16 sharing vector 0xA9 and IRQ 16 ACPI: PCI Interrupt 0000:00:0f.0[A] -> GSI 42 (level, low) -> IRQ 169 PCI: Setting latency timer of device 0000:00:0f.0 to 64 GSI 17 sharing vector 0xB1 and IRQ 17 ACPI: PCI Interrupt 0000:00:10.0[A] -> GSI 38 (level, low) -> IRQ 177 PCI: Setting latency timer of device 0000:00:10.0 to 64 GSI 18 sharing vector 0xB9 and IRQ 18 ACPI: PCI Interrupt 0000:00:11.0[A] -> GSI 39 (level, low) -> IRQ 185 PCI: Setting latency timer of device 0000:00:11.0 to 64 GSI 19 sharing vector 0xC1 and IRQ 19 ACPI: PCI Interrupt 0000:00:12.0[A] -> GSI 40 (level, low) -> IRQ 193 PCI: Setting latency timer of device 0000:00:12.0 to 64 GSI 20 sharing vector 0xC9 and IRQ 20 ACPI: PCI Interrupt 0000:00:13.0[A] -> GSI 41 (level, low) -> IRQ 201 PCI: Setting latency timer of device 0000:00:13.0 to 64 PCI: Bridge: 0000:42:00.0 IO window: disabled. MEM window: fd200000-fd2fffff PREFETCH window: disabled. PCI: Bridge: 0000:40:0f.0 IO window: disabled. MEM window: fd200000-fd2fffff PREFETCH window: disabled. PCI: Bridge: 0000:40:10.0 IO window: 5000-5fff MEM window: fd300000-fdafffff PREFETCH window 0x00000000e2300000-0x00000000e23fffff PCI: Bridge: 0000:40:11.0 IO window: 6000-6fff MEM window: fdb00000-fdffffff PREFETCH window 0x00000000e2400000-0x00000000e24fffff PCI: Bridge: 0000:40:12.0 IO window: disabled. MEM window: f8000000-fbffffff PREFETCH window 0x00000000e2500000-0x00000000e25fffff PCI: Bridge: 0000:40:13.0 IO window: disabled. MEM window: disabled. PREFETCH window: disabled. GSI 21 sharing vector 0xD1 and IRQ 21 ACPI: PCI Interrupt 0000:40:0f.0[A] -> GSI 36 (level, low) -> IRQ 209 PCI: Setting latency timer of device 0000:40:0f.0 to 64 ACPI: PCI Interrupt 0000:42:00.0[A] -> GSI 36 (level, low) -> IRQ 209 PCI: Setting latency timer of device 0000:42:00.0 to 64 GSI 22 sharing vector 0xD9 and IRQ 22 ACPI: PCI Interrupt 0000:40:10.0[A] -> GSI 32 (level, low) -> IRQ 217 PCI: Setting latency timer of device 0000:40:10.0 to 64 GSI 23 sharing vector 0xE1 and IRQ 23 ACPI: PCI Interrupt 0000:40:11.0[A] -> GSI 33 (level, low) -> IRQ 225 PCI: Setting latency timer of device 0000:40:11.0 to 64 GSI 24 sharing vector 0xE9 and IRQ 24 ACPI: PCI Interrupt 0000:40:12.0[A] -> GSI 34 (level, low) -> IRQ 233 PCI: Setting latency timer of device 0000:40:12.0 to 64 GSI 25 sharing vector 0x32 and IRQ 25 ACPI: PCI Interrupt 0000:40:13.0[A] -> GSI 35 (level, low) -> IRQ 50 PCI: Setting latency timer of device 0000:40:13.0 to 64 NET: Registered protocol family 2 IP route cache hash table entries: 524288 (order: 10, 4194304 bytes) TCP established hash table entries: 262144 (order: 10, 4194304 bytes) TCP bind hash table entries: 65536 (order: 8, 1048576 bytes) TCP: Hash tables configured (established 262144 bind 65536) TCP reno registered audit: initializing netlink socket (disabled) type=2000 audit(1295321056.309:1): initialized Total HugeTLB memory allocated, 0 VFS: Disk quotas dquot_6.5.1 Dquot-cache hash table entries: 512 (order 0, 4096 bytes) SELinux: Registering netfilter hooks Initializing Cryptographic API alg: No test for crc32c (crc32c-generic) ksign: Installing public key data Loading keyring - Added public key F43E909AB54B946C - User ID: Red Hat, Inc. (Kernel Module GPG key) io scheduler noop registered io scheduler anticipatory registered io scheduler deadline registered io scheduler cfq registered (default) Boot video device is 0000:00:03.0 pci 0000:00:04.4: HCRESET not completed yet! PCI: Setting latency timer of device 0000:00:0f.0 to 64 PCI: Setting latency timer of device 0000:00:10.0 to 64 PCI: Setting latency timer of device 0000:00:11.0 to 64 PCI: Setting latency timer of device 0000:00:12.0 to 64 PCI: Setting latency timer of device 0000:00:13.0 to 64 PCI: Setting latency timer of device 0000:40:0f.0 to 64 PCI: Setting latency timer of device 0000:40:10.0 to 64 PCI: Setting latency timer of device 0000:40:11.0 to 64 PCI: Setting latency timer of device 0000:40:12.0 to 64 PCI: Setting latency timer of device 0000:40:13.0 to 64 pci_hotplug: PCI Hot Plug PCI Core version: 0.5 Real Time Clock Driver v1.12ac hpet_resources: 0xfed00000 is busy Non-volatile memory driver v1.2 Linux agpgart interface v0.101 (c) Dave Jones Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A 00:09: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A brd: module loaded Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx Probing IDE interface ide0... Probing IDE interface ide1... ide-floppy driver 0.99.newide usbcore: registered new driver hiddev usbcore: registered new driver usbhid drivers/usb/input/hid-core.c: v2.6:USB HID core driver PNP: PS/2 Controller [PNP0303:KBD,PNP0f0e:PS2M] at 0x60,0x64 irq 1,12 serio: i8042 KBD port at 0x60,0x64 irq 1 serio: i8042 AUX port at 0x60,0x64 irq 12 mice: PS/2 mouse device common for all mice md: md driver 0.90.3 MAX_MD_DEVS=256, MD_SB_DISKS=27 md: bitmap version 4.39 TCP bic registered Initializing IPsec netlink socket NET: Registered protocol family 1 NET: Registered protocol family 17 ACPI: (supports S0 S4 S5) Initalizing network drop monitor service Freeing unused kernel memory: 224k freed Write protecting the kernel read-only data: 519k ACPI: PCI Interrupt Link [IUSB] enabled at IRQ 5 ACPI: PCI Interrupt 0000:00:07.2[A] -> Link [IUSB] -> GSI 5 (level, low) -> IRQ 5 ehci_hcd 0000:00:07.2: EHCI Host Controller ehci_hcd 0000:00:07.2: new USB bus registered, assigned bus number 1 ehci_hcd 0000:00:07.2: irq 5, io mem 0xf3dc0000 ehci_hcd 0000:00:07.2: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004 usb usb1: configuration #1 chosen from 1 choice hub 1-0:1.0: USB hub found hub 1-0:1.0: 4 ports detected ohci_hcd: 2005 April 22 USB 1.1 'Open' Host Controller (OHCI) Driver (PCI) ACPI: PCI Interrupt 0000:00:07.0[A] -> Link [IUSB] -> GSI 5 (level, low) -> IRQ 5 ohci_hcd 0000:00:07.0: OHCI Host Controller ohci_hcd 0000:00:07.0: new USB bus registered, assigned bus number 2 ohci_hcd 0000:00:07.0: irq 5, io mem 0xf3de0000 usb usb2: configuration #1 chosen from 1 choice hub 2-0:1.0: USB hub found hub 2-0:1.0: 2 ports detected ACPI: PCI Interrupt 0000:00:07.1[A] -> Link [IUSB] -> GSI 5 (level, low) -> IRQ 5 ohci_hcd 0000:00:07.1: OHCI Host Controller ohci_hcd 0000:00:07.1: new USB bus registered, assigned bus number 3 ohci_hcd 0000:00:07.1: irq 5, io mem 0xf3dd0000 usb 1-3: new high speed USB device using ehci_hcd and address 2 usb usb3: configuration #1 chosen from 1 choice hub 3-0:1.0: USB hub found hub 3-0:1.0: 2 ports detected usb 1-3: configuration #1 chosen from 1 choice hub 1-3:1.0: USB hub found hub 1-3:1.0: 4 ports detected USB Universal Host Controller Interface driver v3.0 GSI 26 sharing vector 0x92 and IRQ 26 ACPI: PCI Interrupt 0000:00:04.4[B] -> GSI 45 (level, low) -> IRQ 146 uhci_hcd 0000:00:04.4: UHCI Host Controller uhci_hcd 0000:00:04.4: new USB bus registered, assigned bus number 4 uhci_hcd 0000:00:04.4: port count misdetected? forcing to 2 ports uhci_hcd 0000:00:04.4: HCRESET not completed yet! uhci_hcd 0000:00:04.4: irq 146, io base 0x00001800 usb usb4: configuration #1 chosen from 1 choice hub 4-0:1.0: USB hub found hub 4-0:1.0: 2 ports detected SCSI subsystem initialized HP CISS Driver (v 3.6.22-RH1) ACPI: PCI Interrupt 0000:48:00.0[A] -> GSI 33 (level, low) -> IRQ 225 cciss0: <0x323a> at PCI 0000:48:00.0 IRQ 170 using DAC cciss/c0d0: p1 p2 p3 < p5 > cciss/c0d1: p1 cciss/c0d2: p1 cciss/c0d3: p1 libata version 3.00 loaded. sata_svw 0000:01:0e.0: version 2.3 ACPI: PCI Interrupt Link [ISF0] enabled at IRQ 14 ACPI: PCI Interrupt 0000:01:0e.0[A] -> Link [ISF0] -> GSI 14 (level, low) -> IRQ 14 scsi0 : sata_svw scsi1 : sata_svw scsi2 : sata_svw scsi3 : sata_svw ata1: SATA max UDMA/133 mmio m8192@0xf3ff0000 port 0xf3ff0000 irq 14 ata2: SATA max UDMA/133 mmio m8192@0xf3ff0000 port 0xf3ff0100 irq 14 ata3: SATA max UDMA/133 mmio m8192@0xf3ff0000 port 0xf3ff0200 irq 14 ata4: SATA max UDMA/133 mmio m8192@0xf3ff0000 port 0xf3ff0300 irq 14 usb 1-3.3: new full speed USB device using ehci_hcd and address 3 usb 1-3.3: configuration #1 chosen from 1 choice input: Raritan D2CIM-VUSB as /class/input/input0 input: USB HID v1.11 Keyboard [Raritan D2CIM-VUSB] on usb-0000:00:07.2-3.3 ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata1.00: ATAPI: Optiarc DVD RW AD-7561S, AH51, max UDMA/100 usb 4-1: new full speed USB device using uhci_hcd and address 2 ata1.00: configured for UDMA/100 usb 4-1: configuration #1 chosen from 1 choice input: HP Virtual Keyboard as /class/input/input1 input: USB HID v1.01 Keyboard [HP Virtual Keyboard] on usb-0000:00:04.4-1 input: HP Virtual Keyboard as /class/input/input2 input: USB HID v1.01 Mouse [HP Virtual Keyboard] on usb-0000:00:04.4-1 ata2: SATA link down (SStatus 4 SControl 300) usb 4-2: new full speed USB device using uhci_hcd and address 3 usb 4-2: configuration #1 chosen from 1 choice hub 4-2:1.0: USB hub found hub 4-2:1.0: 7 ports detected ata3: SATA link down (SStatus 4 SControl 300) ata4: SATA link down (SStatus 4 SControl 300) Vendor: Optiarc Model: DVD RW AD-7561S Rev: AH51 Type: CD-ROM ANSI SCSI revision: 05 Initializing USB Mass Storage driver... usbcore: registered new driver usb-storage USB Mass Storage support registered. QLogic Fibre Channel HBA Driver ACPI: PCI Interrupt 0000:45:00.2[C] -> GSI 35 (level, low) -> IRQ 50 qla2xxx 0000:45:00.2: Found an ISP8001, irq 50, iobase 0xffffc20010086000 qla2xxx 0000:45:00.2: Configuring PCI space... PCI: Setting latency timer of device 0000:45:00.2 to 64 qla2xxx 0000:45:00.2: Configure NVRAM parameters... qla2xxx 0000:45:00.2: Verifying loaded RISC code... qla2xxx 0000:45:00.2: Allocated (64 KB) for EFT... qla2xxx 0000:45:00.2: Allocated (1414 KB) for firmware dump... scsi4 : qla2xxx qla2xxx 0000:45:00.2: QLogic Fibre Channel HBA Driver: 8.03.01.05.05.06-k QLogic QLE8152 - QLogic PCI-Express Dual Channel 10GbE CNA ISP8001: PCIe (2.5Gb/s x4) @ 0000:45:00.2 hdma+, host#=4, fw=5.02.01 (8d4) ACPI: PCI Interrupt 0000:45:00.3[D] -> GSI 34 (level, low) -> IRQ 233 qla2xxx 0000:45:00.3: Found an ISP8001, irq 233, iobase 0xffffc200101f2000 qla2xxx 0000:45:00.3: Configuring PCI space... PCI: Setting latency timer of device 0000:45:00.3 to 64 qla2xxx 0000:45:00.3: Configure NVRAM parameters... qla2xxx 0000:45:00.3: Verifying loaded RISC code... qla2xxx 0000:45:00.2: LOOP UP detected (10 Gbps). Vendor: Pillar Model: Axiom 300 Rev: 0000 Type: Direct-Access ANSI SCSI revision: 05 SCSI device sda: 632216064 512-byte hdwr sectors (323695 MB) sda: Write Protect is off sda: Mode Sense: 87 00 00 08 SCSI device sda: drive cache: write through SCSI device sda: 632216064 512-byte hdwr sectors (323695 MB) sda: Write Protect is off sda: Mode Sense: 87 00 00 08 SCSI device sda: drive cache: write through sda: sda1 sd 4:0:0:0: Attached scsi disk sda Vendor: Pillar Model: Axiom 300 Rev: 0000 Type: Direct-Access ANSI SCSI revision: 05 SCSI device sdb: 2098788864 512-byte hdwr sectors (1074580 MB) sdb: Write Protect is off sdb: Mode Sense: 7f 00 00 08 SCSI device sdb: drive cache: write through SCSI device sdb: 2098788864 512-byte hdwr sectors (1074580 MB) sdb: Write Protect is off sdb: Mode Sense: 7f 00 00 08 SCSI device sdb: drive cache: write through sdb: unknown partition table sd 4:0:0:1: Attached scsi disk sdb Vendor: Pillar Model: Axiom 300 Rev: 0000 Type: Direct-Access ANSI SCSI revision: 05 SCSI device sdc: 632216064 512-byte hdwr sectors (323695 MB) sdc: Write Protect is off sdc: Mode Sense: 87 00 00 08 SCSI device sdc: drive cache: write through SCSI device sdc: 632216064 512-byte hdwr sectors (323695 MB) sdc: Write Protect is off sdc: Mode Sense: 87 00 00 08 SCSI device sdc: drive cache: write through sdc: sdc1 sd 4:0:1:0: Attached scsi disk sdc Vendor: Pillar Model: Axiom 300 Rev: 0000 Type: Direct-Access ANSI SCSI revision: 05 SCSI device sdd: 2098788864 512-byte hdwr sectors (1074580 MB) sdd: Write Protect is off sdd: Mode Sense: 7f 00 00 08 SCSI device sdd: drive cache: write through SCSI device sdd: 2098788864 512-byte hdwr sectors (1074580 MB) sdd: Write Protect is off sdd: Mode Sense: 7f 00 00 08 SCSI device sdd: drive cache: write through sdd: unknown partition table sd 4:0:1:1: Attached scsi disk sdd qla2xxx 0000:45:00.3: Allocated (64 KB) for EFT... qla2xxx 0000:45:00.3: Allocated (1414 KB) for firmware dump... scsi5 : qla2xxx qla2xxx 0000:45:00.3: QLogic Fibre Channel HBA Driver: 8.03.01.05.05.06-k QLogic QLE8152 - QLogic PCI-Express Dual Channel 10GbE CNA ISP8001: PCIe (2.5Gb/s x4) @ 0000:45:00.3 hdma+, host#=5, fw=5.02.01 (8d4) device-mapper: uevent: version 1.0.3 device-mapper: ioctl: 4.11.5-ioctl (2007-12-12) initialised: dm-devel qla2xxx 0000:45:00.3: LOOP UP detected (10 Gbps). device-mapper: dm-raid45: initialized v0.2594l Vendor: Pillar Model: Axiom 300 Rev: 0000 Type: Direct-Access ANSI SCSI revision: 05 SCSI device sde: 632216064 512-byte hdwr sectors (323695 MB) sde: Write Protect is off sde: Mode Sense: 87 00 00 08 SCSI device sde: drive cache: write through SCSI device sde: 632216064 512-byte hdwr sectors (323695 MB) sde: Write Protect is off sde: Mode Sense: 87 00 00 08 SCSI device sde: drive cache: write through sde: sde1 sd 5:0:0:0: Attached scsi disk sde Vendor: Pillar Model: Axiom 300 Rev: 0000 Type: Direct-Access ANSI SCSI revision: 05 SCSI device sdf: 2098788864 512-byte hdwr sectors (1074580 MB) sdf: Write Protect is off sdf: Mode Sense: 7f 00 00 08 SCSI device sdf: drive cache: write through SCSI device sdf: 2098788864 512-byte hdwr sectors (1074580 MB) sdf: Write Protect is off sdf: Mode Sense: 7f 00 00 08 SCSI device sdf: drive cache: write through sdf: unknown partition table sd 5:0:0:1: Attached scsi disk sdf Vendor: Pillar Model: Axiom 300 Rev: 0000 Type: Direct-Access ANSI SCSI revision: 05 SCSI device sdg: 632216064 512-byte hdwr sectors (323695 MB) sdg: Write Protect is off sdg: Mode Sense: 87 00 00 08 SCSI device sdg: drive cache: write through SCSI device sdg: 632216064 512-byte hdwr sectors (323695 MB) sdg: Write Protect is off sdg: Mode Sense: 87 00 00 08 SCSI device sdg: drive cache: write through sdg: sdg1 sd 5:0:1:0: Attached scsi disk sdg Vendor: Pillar Model: Axiom 300 Rev: 0000 Type: Direct-Access ANSI SCSI revision: 05 SCSI device sdh: 2098788864 512-byte hdwr sectors (1074580 MB) sdh: Write Protect is off sdh: Mode Sense: 7f 00 00 08 SCSI device sdh: drive cache: write through SCSI device sdh: 2098788864 512-byte hdwr sectors (1074580 MB) sdh: Write Protect is off sdh: Mode Sense: 7f 00 00 08 SCSI device sdh: drive cache: write through sdh: unknown partition table sd 5:0:1:1: Attached scsi disk sdh kjournald starting. Commit interval 5 seconds EXT3-fs: mounted filesystem with ordered data mode. SELinux: Disabled at runtime. SELinux: Unregistering netfilter hooks type=1404 audit(1295321081.970:2): selinux=0 auid=4294967295 ses=4294967295 Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v2.0.8-rh (Oct 11, 2010) ACPI: PCI Interrupt 0000:03:00.0[A] -> GSI 40 (level, low) -> IRQ 193 PCI: Setting latency timer of device 0000:03:00.0 to 64 eth0: Broadcom NetXtreme II BCM5709 1000Base-T (C0) PCI Express found at mem f6000000, IRQ 193, node addr 0025b321e0ca ACPI: PCI Interrupt 0000:03:00.1[B] -> GSI 39 (level, low) -> IRQ 185 PCI: Setting latency timer of device 0000:03:00.1 to 64 eth1: Broadcom NetXtreme II BCM5709 1000Base-T (C0) PCI Express found at mem f4000000, IRQ 185, node addr 0025b321e0cc ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 34 (level, low) -> IRQ 233 PCI: Setting latency timer of device 0000:41:00.0 to 64 eth2: Broadcom NetXtreme II BCM5709 1000Base-T (C0) PCI Express found at mem fa000000, IRQ 233, node addr 0025b321e0c6 ACPI: PCI Interrupt 0000:41:00.1[B] -> GSI 33 (level, low) -> IRQ 225 PCI: Setting latency timer of device 0000:41:00.1 to 64 eth3: Broadcom NetXtreme II BCM5709 1000Base-T (C0) PCI Express found at mem f8000000, IRQ 225, node addr 0025b321e0c8 k10temp 0000:00:18.3: unreliable CPU thermal sensor; monitoring disabled k10temp 0000:00:19.3: unreliable CPU thermal sensor; monitoring disabled ACPI: PCI Interrupt 0000:00:04.2[B] -> GSI 45 (level, low) -> IRQ 146 Floppy drive(s): fd0 is 1.44M scsi 0:0:0:0: Attached scsi generic sg0 type 5 sd 4:0:0:0: Attached scsi generic sg1 type 0 sd 4:0:0:1: Attached scsi generic sg2 type 0 sd 4:0:1:0: Attached scsi generic sg3 type 0 sd 4:0:1:1: Attached scsi generic sg4 type 0 sd 5:0:0:0: Attached scsi generic sg5 type 0 sd 5:0:0:1: Attached scsi generic sg6 type 0 sd 5:0:1:0: Attached scsi generic sg7 type 0 sd 5:0:1:1: Attached scsi generic sg8 type 0 shpchp: Standard Hot Plug PCI Controller Driver version: 0.4 input: PC Speaker as /class/input/input3 802.1Q VLAN Support v1.8 Ben Greear <greearb> All bugs added by David S. Miller <davem> EDAC MC: Ver: 2.0.1 Dec 19 2010 piix4_smbus 0000:00:06.0: Found 0000:00:06.0 device ACPI: PCI Interrupt 0000:45:00.0[A] -> GSI 32 (level, low) -> IRQ 217 PCI: Setting latency timer of device 0000:45:00.0 to 64 EDAC amd64_edac: Ver: 3.2.0 Dec 19 2010 qlge 0000:45:00.0: QLogic 10 Gigabit PCI-E Ethernet Driver qlge 0000:45:00.0: Driver name: qlge, Version: 1.00.00.25. EDAC amd64: ECC is enabled by BIOS. qlge 0000:45:00.0: Patch version: 2.6.16-2.6.18-p25, Release date: 100706. EDAC amd64: ECC is enabled by BIOS. qlge 0000:45:00.0: ql_display_dev_info: Function #0, Port #0, Rev ID = 20001010. qlge 0000:45:00.0: ql_display_dev_info: MAC address 00:c0:dd:12:0e:4c ACPI: PCI Interrupt 0000:45:00.1[B] -> GSI 36 (level, low) -> IRQ 209 PCI: Setting latency timer of device 0000:45:00.1 to 64 qlge 0000:45:00.1: ql_display_dev_info: Function #1, Port #1, Rev ID = 20001010. qlge 0000:45:00.1: ql_display_dev_info: MAC address 00:c0:dd:12:0e:4e EDAC MC: F10h CPU detected EDAC MC0: Giving out device to amd64_edac Family 10h: DEV 0000:00:18.2 EDAC MC: F10h CPU detected EDAC MC1: Giving out device to amd64_edac Family 10h: DEV 0000:00:19.2 sr0: scsi3-mmc drive: 24x/24x writer dvd-ram cd/rw xa/form2 cdda tray Uniform CD-ROM driver Revision: 3.20 sr 0:0:0:0: Attached scsi CD-ROM sr0 floppy0: no floppy controllers found work still pending Floppy drive(s): fd0 is 1.44M floppy0: no floppy controllers found work still pending lp: driver loaded but no devices found ACPI: Power Button (FF) [PWRF] ACPI: Mapper loaded dell-wmi: No known WMI GUID found md: Autodetecting RAID arrays. md: autorun ... md: ... autorun DONE. device-mapper: multipath: version 1.0.5 loaded EXT3 FS on dm-0, internal journal kjournald starting. Commit interval 5 seconds EXT3 FS on dm-1, internal journal EXT3-fs: mounted filesystem with ordered data mode. kjournald starting. Commit interval 5 seconds EXT3 FS on dm-2, internal journal EXT3-fs: mounted filesystem with ordered data mode. kjournald starting. Commit interval 5 seconds EXT3 FS on dm-3, internal journal EXT3-fs: mounted filesystem with ordered data mode. kjournald starting. Commit interval 5 seconds EXT3 FS on dm-5, internal journal EXT3-fs: mounted filesystem with ordered data mode. kjournald starting. Commit interval 5 seconds EXT3 FS on cciss/c0d0p1, internal journal EXT3-fs: mounted filesystem with ordered data mode. Adding 16777208k swap on /dev/system/swap. Priority:-1 extents:1 across:16777208k powernow-k8: Pre-initialization of ACPI failed powernow-k8: Found 2 Six-Core AMD Opteron(tm) Processor 2431 processors (12 cpu cores) (version 2.20.00) powernow-k8: Your BIOS does not provide _PSS objects. PowerNow! does not work on SMP systems without _PSS objects. Complain to your BIOS vendor. powernow-k8: Your BIOS does not provide _PSS objects. PowerNow! does not work on SMP systems without _PSS objects. Complain to your BIOS vendor. powernow-k8: Your BIOS does not provide _PSS objects. PowerNow! does not work on SMP systems without _PSS objects. Complain to your BIOS vendor. powernow-k8: Your BIOS does not provide _PSS objects. PowerNow! does not work on SMP systems without _PSS objects. Complain to your BIOS vendor. powernow-k8: Your BIOS does not provide _PSS objects. PowerNow! does not work on SMP systems without _PSS objects. Complain to your BIOS vendor. powernow-k8: Your BIOS does not provide _PSS objects. PowerNow! does not work on SMP systems without _PSS objects. Complain to your BIOS vendor. powernow-k8: Your BIOS does not provide _PSS objects. PowerNow! does not work on SMP systems without _PSS objects. Complain to your BIOS vendor. powernow-k8: Your BIOS does not provide _PSS objects. PowerNow! does not work on SMP systems without _PSS objects. Complain to your BIOS vendor. powernow-k8: Your BIOS does not provide _PSS objects. PowerNow! does not work on SMP systems without _PSS objects. Complain to your BIOS vendor. powernow-k8: Your BIOS does not provide _PSS objects. PowerNow! does not work on SMP systems without _PSS objects. Complain to your BIOS vendor. powernow-k8: Your BIOS does not provide _PSS objects. PowerNow! does not work on SMP systems without _PSS objects. Complain to your BIOS vendor. powernow-k8: Your BIOS does not provide _PSS objects. PowerNow! does not work on SMP systems without _PSS objects. Complain to your BIOS vendor. Loading iSCSI transport class v2.0-871. cxgb3i: tag itt 0x1fff, 13 bits, age 0xf, 4 bits. iscsi: registered transport (cxgb3i) NET: Registered protocol family 10 lo: Disabled Privacy Extensions IPv6 over IPv4 tunneling driver Broadcom NetXtreme II CNIC Driver cnic v2.1.2 (May 26, 2010) cnic: Added CNIC device: eth2 cnic: Added CNIC device: __tmp1031196009 cnic: Added CNIC device: eth0 cnic: Added CNIC device: eth1 Broadcom NetXtreme II iSCSI Driver bnx2i v2.1.3 (Aug 10, 2010) iscsi: registered transport (bnx2i) scsi6 : Broadcom Offload iSCSI Initiator scsi7 : Broadcom Offload iSCSI Initiator scsi8 : Broadcom Offload iSCSI Initiator scsi9 : Broadcom Offload iSCSI Initiator iscsi: registered transport (tcp) device-mapper: multipath round-robin: version 1.0.0 loaded iscsi: registered transport (iser) iscsi: registered transport (be2iscsi) bnx2: eth0: using MSIX ADDRCONF(NETDEV_UP): eth0: link is not ready bnx2i [41:00.00]: ISCSI_INIT passed bnx2: eth0 NIC Copper Link is Up, 100 Mbps full duplex ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready Bluetooth: Core ver 2.10 NET: Registered protocol family 31 Bluetooth: HCI device and connection manager initialized Bluetooth: HCI socket layer initialized Bluetooth: L2CAP ver 2.8 Bluetooth: L2CAP socket layer initialized Bluetooth: RFCOMM socket layer initialized Bluetooth: RFCOMM TTY layer initialized Bluetooth: RFCOMM ver 1.8 eth0: no IPv6 routers present Bluetooth: HIDP (Human Interface Emulation) ver 1.1 Netfilter messages via NETLINK v0.30. ip_conntrack version 2.4 (8192 buckets, 65536 max) - 304 bytes per conntrack ip_tables: (C) 2000-2006 Netfilter Core Team Bridge firewalling registered Ebtables v2.0 registered ip6_tables: (C) 2000-2006 Netfilter Core Team virbr0: no IPv6 routers present kjournald starting. Commit interval 5 seconds EXT3 FS on dm-6, internal journal EXT3-fs: mounted filesystem with ordered data mode. INFO: task kswapd0:697 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kswapd0 D ffff81000100caa0 0 697 313 698 694 (L-TLB) ffff81082f699b00 0000000000000046 0000000000000010 00000000267d61f8 0fd0000600000008 000000000000000a ffff81042f5cd0c0 ffff81082ffb90c0 000004b227b3805b 0000000000036843 ffff81042f5cd2a8 000000022ff8e6f0 Call Trace: [<ffffffff88036d8a>] :jbd:log_wait_commit+0xa3/0xf5 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e [<ffffffff800276a9>] try_to_free_buffers+0x60/0xb9 [<ffffffff880332da>] :jbd:journal_try_to_free_buffers+0x19d/0x1c0 [<ffffffff800cd32c>] shrink_inactive_list+0x511/0x8d8 [<ffffffff800131a5>] shrink_zone+0x127/0x18d [<ffffffff80057be8>] kswapd+0x33d/0x495 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 [<ffffffff800578ab>] kswapd+0x0/0x495 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032974>] kthread+0xfe/0x132 [<ffffffff8009f283>] request_module+0x0/0x14d [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032876>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 INFO: task kswapd0:697 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kswapd0 D ffff81000101d7a0 0 697 313 698 694 (L-TLB) ffff81082f699b00 0000000000000046 0000000000000002 0000000000000010 ffff81052e6b5000 000000000000000a ffff81042f5cd0c0 ffff81082fe830c0 0000092a5ff10c9e 000000000062bcda ffff81042f5cd2a8 0000000600000024 Call Trace: [<ffffffff88036d8a>] :jbd:log_wait_commit+0xa3/0xf5 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e [<ffffffff800276a9>] try_to_free_buffers+0x60/0xb9 [<ffffffff880332da>] :jbd:journal_try_to_free_buffers+0x19d/0x1c0 [<ffffffff800cd32c>] shrink_inactive_list+0x511/0x8d8 [<ffffffff80047ff2>] __pagevec_release+0x19/0x22 [<ffffffff800cccfa>] shrink_active_list+0x4b4/0x4c4 [<ffffffff800131a5>] shrink_zone+0x127/0x18d [<ffffffff80057be8>] kswapd+0x33d/0x495 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 [<ffffffff800578ab>] kswapd+0x0/0x495 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032974>] kthread+0xfe/0x132 [<ffffffff8009f283>] request_module+0x0/0x14d [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032876>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 INFO: task kswapd0:697 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kswapd0 D ffff810001015120 0 697 313 698 694 (L-TLB) ffff81082f699b00 0000000000000046 0000000000000002 0000000000000010 ffff810211fe0000 000000000000000a ffff81042f5cd0c0 ffff81082ff73040 000009af68cd2c1c 00000000008aaa91 ffff81042f5cd2a8 0000000400000006 Call Trace: [<ffffffff88036d8a>] :jbd:log_wait_commit+0xa3/0xf5 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e [<ffffffff800276a9>] try_to_free_buffers+0x60/0xb9 [<ffffffff880332da>] :jbd:journal_try_to_free_buffers+0x19d/0x1c0 [<ffffffff800cd32c>] shrink_inactive_list+0x511/0x8d8 [<ffffffff80047ff2>] __pagevec_release+0x19/0x22 [<ffffffff800cccfa>] shrink_active_list+0x4b4/0x4c4 [<ffffffff800131a5>] shrink_zone+0x127/0x18d [<ffffffff80057be8>] kswapd+0x33d/0x495 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 [<ffffffff800578ab>] kswapd+0x0/0x495 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032974>] kthread+0xfe/0x132 [<ffffffff8009f283>] request_module+0x0/0x14d [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032876>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 INFO: task fio:30488 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. fio D ffff81043e0a9aa0 0 30488 27162 30489 (NOTLB) ffff8102272df768 0000000000000082 ffff81042de78c98 00000000798f4530 ffff810780ca8ee8 0000000000000009 ffff81017c983860 ffff81082ff5e100 0000294edeeb2129 00000000008bd33b ffff81017c983a48 0000000300000286 Call Trace: [<ffffffff8006ec4e>] do_gettimeofday+0x40/0x90 [<ffffffff800154b2>] sync_buffer+0x0/0x3f [<ffffffff800637ca>] io_schedule+0x3f/0x67 [<ffffffff800154ed>] sync_buffer+0x3b/0x3f [<ffffffff800639f6>] __wait_on_bit+0x40/0x6e [<ffffffff800154b2>] sync_buffer+0x0/0x3f [<ffffffff80063a90>] out_of_line_wait_on_bit+0x6c/0x78 [<ffffffff800a28e2>] wake_bit_function+0x0/0x23 [<ffffffff80025702>] __bread+0x6c/0x86 [<ffffffff8804df2a>] :ext3:ext3_get_branch+0x67/0xd2 [<ffffffff8804e1ad>] :ext3:ext3_get_blocks_handle+0xc7/0x9bc [<ffffffff8005c0fb>] cache_alloc_refill+0x106/0x186 [<ffffffff8804edb1>] :ext3:ext3_get_block+0xb6/0xf7 [<ffffffff8000e750>] __block_prepare_write+0x1a5/0x39e [<ffffffff8804ecfb>] :ext3:ext3_get_block+0x0/0xf7 [<ffffffff800e3a43>] block_write_begin+0x80/0xcf [<ffffffff880503b0>] :ext3:ext3_write_begin+0xe8/0x1cc [<ffffffff8804ecfb>] :ext3:ext3_get_block+0x0/0xf7 [<ffffffff8000fda3>] generic_file_buffered_write+0x14b/0x675 [<ffffffff880317ae>] :jbd:journal_stop+0x1f3/0x1ff [<ffffffff80016679>] __generic_file_aio_write_nolock+0x369/0x3b6 [<ffffffff80021850>] generic_file_aio_write+0x65/0xc1 [<ffffffff8804c1b6>] :ext3:ext3_file_write+0x16/0x91 [<ffffffff800182df>] do_sync_write+0xc7/0x104 [<ffffffff8006723e>] do_page_fault+0x4fe/0x874 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e [<ffffffff80016a81>] vfs_write+0xce/0x174 [<ffffffff80017339>] sys_write+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0 INFO: task fio:30488 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. fio D ffff81043e0a9aa0 0 30488 27162 30489 (NOTLB) ffff8102272df768 0000000000000082 ffff81042de78c98 00000000798f4530 ffff810780ca8ee8 0000000000000009 ffff81017c983860 ffff81082ff5e100 0000294edeeb2129 00000000008bd33b ffff81017c983a48 0000000300000286 Call Trace: [<ffffffff8006ec4e>] do_gettimeofday+0x40/0x90 [<ffffffff800154b2>] sync_buffer+0x0/0x3f [<ffffffff800637ca>] io_schedule+0x3f/0x67 [<ffffffff800154ed>] sync_buffer+0x3b/0x3f [<ffffffff800639f6>] __wait_on_bit+0x40/0x6e [<ffffffff800154b2>] sync_buffer+0x0/0x3f [<ffffffff80063a90>] out_of_line_wait_on_bit+0x6c/0x78 [<ffffffff800a28e2>] wake_bit_function+0x0/0x23 [<ffffffff80025702>] __bread+0x6c/0x86 [<ffffffff8804df2a>] :ext3:ext3_get_branch+0x67/0xd2 [<ffffffff8804e1ad>] :ext3:ext3_get_blocks_handle+0xc7/0x9bc [<ffffffff8005c0fb>] cache_alloc_refill+0x106/0x186 [<ffffffff8804edb1>] :ext3:ext3_get_block+0xb6/0xf7 [<ffffffff8000e750>] __block_prepare_write+0x1a5/0x39e [<ffffffff8804ecfb>] :ext3:ext3_get_block+0x0/0xf7 [<ffffffff800e3a43>] block_write_begin+0x80/0xcf [<ffffffff880503b0>] :ext3:ext3_write_begin+0xe8/0x1cc [<ffffffff8804ecfb>] :ext3:ext3_get_block+0x0/0xf7 [<ffffffff8000fda3>] generic_file_buffered_write+0x14b/0x675 [<ffffffff880317ae>] :jbd:journal_stop+0x1f3/0x1ff [<ffffffff80016679>] __generic_file_aio_write_nolock+0x369/0x3b6 [<ffffffff80021850>] generic_file_aio_write+0x65/0xc1 [<ffffffff8804c1b6>] :ext3:ext3_file_write+0x16/0x91 [<ffffffff800182df>] do_sync_write+0xc7/0x104 [<ffffffff8006723e>] do_page_fault+0x4fe/0x874 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e [<ffffffff80016a81>] vfs_write+0xce/0x174 [<ffffffff80017339>] sys_write+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0 sd 5:0:1:1: timing out command, waited 360s sd 5:0:1:1: SCSI error: return code = 0x06000028 end_request: I/O error, dev sdh, sector 3193936 INFO: task ls:4002 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ls D ffff81043e0a9aa0 0 4002 3221 (NOTLB) ffff810473e7bd08 0000000000000082 ffff81042de78c98 0000000079bfdfe8 ffff8106b3c35cd8 0000000000000008 ffff81040cce60c0 ffff81082ff5e100 00002a500b6569a0 000000000015dfc9 ffff81040cce62a8 0000000300000286 Call Trace: [<ffffffff8006ec4e>] do_gettimeofday+0x40/0x90 [<ffffffff800154b2>] sync_buffer+0x0/0x3f [<ffffffff800637ca>] io_schedule+0x3f/0x67 [<ffffffff800154ed>] sync_buffer+0x3b/0x3f [<ffffffff800639f6>] __wait_on_bit+0x40/0x6e [<ffffffff800154b2>] sync_buffer+0x0/0x3f [<ffffffff80063a90>] out_of_line_wait_on_bit+0x6c/0x78 [<ffffffff800a28e2>] wake_bit_function+0x0/0x23 [<ffffffff8804dc3f>] :ext3:__ext3_get_inode_loc+0x2a9/0x2f9 [<ffffffff8804dcc3>] :ext3:ext3_reserve_inode_write+0x23/0x90 [<ffffffff8804dd51>] :ext3:ext3_mark_inode_dirty+0x21/0x3c [<ffffffff88050cae>] :ext3:ext3_dirty_inode+0x63/0x7b [<ffffffff80013c94>] __mark_inode_dirty+0x29/0x16e [<ffffffff800258ae>] filldir+0x0/0xb7 [<ffffffff800353a9>] vfs_readdir+0x8c/0xa9 [<ffffffff80038c2d>] sys_getdents+0x75/0xbd [<ffffffff8005d229>] tracesys+0x71/0xe0 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 INFO: task ls:4002 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ls D ffff81043e0a9aa0 0 4002 3221 (NOTLB) ffff810473e7bd08 0000000000000082 ffff81042de78c98 0000000079bfdfe8 ffff8106b3c35cd8 0000000000000008 ffff81040cce60c0 ffff81082ff5e100 00002a500b6569a0 000000000015dfc9 ffff81040cce62a8 0000000300000286 Call Trace: [<ffffffff8006ec4e>] do_gettimeofday+0x40/0x90 [<ffffffff800154b2>] sync_buffer+0x0/0x3f [<ffffffff800637ca>] io_schedule+0x3f/0x67 [<ffffffff800154ed>] sync_buffer+0x3b/0x3f [<ffffffff800639f6>] __wait_on_bit+0x40/0x6e [<ffffffff800154b2>] sync_buffer+0x0/0x3f [<ffffffff80063a90>] out_of_line_wait_on_bit+0x6c/0x78 [<ffffffff800a28e2>] wake_bit_function+0x0/0x23 [<ffffffff8804dc3f>] :ext3:__ext3_get_inode_loc+0x2a9/0x2f9 [<ffffffff8804dcc3>] :ext3:ext3_reserve_inode_write+0x23/0x90 [<ffffffff8804dd51>] :ext3:ext3_mark_inode_dirty+0x21/0x3c [<ffffffff88050cae>] :ext3:ext3_dirty_inode+0x63/0x7b [<ffffffff80013c94>] __mark_inode_dirty+0x29/0x16e [<ffffffff800258ae>] filldir+0x0/0xb7 [<ffffffff800353a9>] vfs_readdir+0x8c/0xa9 [<ffffffff80038c2d>] sys_getdents+0x75/0xbd [<ffffffff8005d229>] tracesys+0x71/0xe0 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 INFO: task ls:6734 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ls D ffff8108296650c0 0 6734 6647 (NOTLB) ffff81069ab37e88 0000000000000082 ffff81079a26c060 ffffffff8000da09 ffff81081e0a8448 0000000000000008 ffff81042f966820 ffff8108296650c0 00002a7750d6ae3b 00000000002fc400 ffff81042f966a08 00000003ffffff9c Call Trace: [<ffffffff8000da09>] permission+0x81/0xc8 [<ffffffff80022214>] __up_read+0x19/0x7f [<ffffffff800258ae>] filldir+0x0/0xb7 [<ffffffff80063c4f>] __mutex_lock_slowpath+0x60/0x9b [<ffffffff80063c99>] .text.lock.mutex+0xf/0x14 [<ffffffff80035379>] vfs_readdir+0x5c/0xa9 [<ffffffff80038c2d>] sys_getdents+0x75/0xbd [<ffffffff8005d229>] tracesys+0x71/0xe0 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 sd 5:0:1:1: timing out command, waited 360s sd 5:0:1:1: SCSI error: return code = 0x06000028 end_request: I/O error, dev sdh, sector 8600 EXT3-fs error (device dm-6): ext3_get_inode_loc: unable to read inode block - inode=2, block=1027 Aborting journal on device dm-6. EXT3-fs error (device dm-6) in ext3_ordered_writepage: IO failure EXT3-fs error (device dm-6) in ext3_reserve_inode_write: IO failure EXT3-fs error (device dm-6) in ext3_dirty_inode: IO failure ext3_abort called. EXT3-fs error (device dm-6): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only ext3_abort called. EXT3-fs error (device dm-6): ext3_put_super: Couldn't clean up the journal kjournald starting. Commit interval 5 seconds EXT3 FS on dm-6, internal journal EXT3-fs: mounted filesystem with ordered data mode. INFO: task fio:970 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. fio D ffff81043e0a9aa0 0 970 27921 971 (NOTLB) ffff81059079b768 0000000000000082 ffff81042de78c98 00000000798832a0 ffff81082bf6ce10 0000000000000009 ffff81082b967080 ffff81082ff5e100 000039676d87a549 000000000015101f ffff81082b967268 0000000300000286 Call Trace: [<ffffffff8006ec4e>] do_gettimeofday+0x40/0x90 [<ffffffff800154b2>] sync_buffer+0x0/0x3f [<ffffffff800637ca>] io_schedule+0x3f/0x67 [<ffffffff800154ed>] sync_buffer+0x3b/0x3f [<ffffffff800639f6>] __wait_on_bit+0x40/0x6e [<ffffffff800154b2>] sync_buffer+0x0/0x3f [<ffffffff80063a90>] out_of_line_wait_on_bit+0x6c/0x78 [<ffffffff800a28e2>] wake_bit_function+0x0/0x23 [<ffffffff80025702>] __bread+0x6c/0x86 [<ffffffff8804df2a>] :ext3:ext3_get_branch+0x67/0xd2 [<ffffffff8804e1ad>] :ext3:ext3_get_blocks_handle+0xc7/0x9bc [<ffffffff8006202a>] __memset+0x1e/0xc0 [<ffffffff8005c0fb>] cache_alloc_refill+0x106/0x186 [<ffffffff8804edb1>] :ext3:ext3_get_block+0xb6/0xf7 [<ffffffff8000e750>] __block_prepare_write+0x1a5/0x39e [<ffffffff8804ecfb>] :ext3:ext3_get_block+0x0/0xf7 [<ffffffff800e3a43>] block_write_begin+0x80/0xcf [<ffffffff880503b0>] :ext3:ext3_write_begin+0xe8/0x1cc [<ffffffff8804ecfb>] :ext3:ext3_get_block+0x0/0xf7 [<ffffffff8000fda3>] generic_file_buffered_write+0x14b/0x675 [<ffffffff880317ae>] :jbd:journal_stop+0x1f3/0x1ff [<ffffffff80016679>] __generic_file_aio_write_nolock+0x369/0x3b6 [<ffffffff80021850>] generic_file_aio_write+0x65/0xc1 [<ffffffff8804c1b6>] :ext3:ext3_file_write+0x16/0x91 [<ffffffff800182df>] do_sync_write+0xc7/0x104 [<ffffffff8006723e>] do_page_fault+0x4fe/0x874 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e [<ffffffff80016a81>] vfs_write+0xce/0x174 [<ffffffff80017339>] sys_write+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0 INFO: task kswapd0:697 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kswapd0 D ffff810001015120 0 697 313 698 694 (L-TLB) ffff81082f699b00 0000000000000046 0000000000000010 0000000005043e18 0fd0000600000008 000000000000000a ffff81042f5cd0c0 ffff81082ff73040 00004824fdbdb689 00000000000cceb1 ffff81042f5cd2a8 0000000400000014 Call Trace: [<ffffffff88036d8a>] :jbd:log_wait_commit+0xa3/0xf5 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e [<ffffffff800276a9>] try_to_free_buffers+0x60/0xb9 [<ffffffff880332da>] :jbd:journal_try_to_free_buffers+0x19d/0x1c0 [<ffffffff800cd32c>] shrink_inactive_list+0x511/0x8d8 [<ffffffff80047ff2>] __pagevec_release+0x19/0x22 [<ffffffff800cccfa>] shrink_active_list+0x4b4/0x4c4 [<ffffffff800131a5>] shrink_zone+0x127/0x18d [<ffffffff80057be8>] kswapd+0x33d/0x495 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 [<ffffffff800578ab>] kswapd+0x0/0x495 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032974>] kthread+0xfe/0x132 [<ffffffff8009f283>] request_module+0x0/0x14d [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032876>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11
(In reply to comment #44) > "You are not authorized to access bug #580818"... Use BZ 615543, it should be available.
(In reply to comment #43) > The original problem within this BZ was on an LSI controller which was fixed by > upgrading the FJ drive firmware (comment #20). > > The >120s messages can be a symptom of many different issues, including just a > very very busy system (for example, set io timeout to 300s and anytime you > encounter an io timeout you can get these messages if the task stall detection > time is set less than the io timeout value). > > For Smart Array configurations encountering this type of message or longer term > hang issues, I'd suggest using BZ 580818 instead. > > For other configurations other than LSI or Smart Array I'd suggest opening a > new BZ with appropriate details. Thanks for the nice summary, Bud. I'm closing this bug. Please follow Bud's advice if you see these problems. As always, filing a ticket with support is the preferred method, as bugzilla is a bug tracking tool, not a support tool.
I have experienced the same problems with this issue on some of my machines. INFO: task kswapd0:697 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. And it may be any process in the error text, for example klogd instead of kswapd. The most interesting point that it appered after I updated to RHEL 5.6 and kernel-PAE-2.6.18-238.1.1.el5. So I bootrd kernel-PAE-2.6.18-194.32.1.el5 and problem has gone. This week I tried new kernel-PAE-2.6.18-238.5.1.el5, and again the same failure. I have this problem on most of my servers - that is HP DL360 G3 or G4, but not on all. G5 worked without any problems. What can I do? Only wait for the next kernel release hoping that the problem will be fixed?
(In reply to comment #48) > I have experienced the same problems with this issue on some of my machines. > > INFO: task kswapd0:697 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > And it may be any process in the error text, for example klogd instead of > kswapd. > > The most interesting point that it appered after I updated to RHEL 5.6 and > kernel-PAE-2.6.18-238.1.1.el5. So I bootrd kernel-PAE-2.6.18-194.32.1.el5 and > problem has gone. This week I tried new kernel-PAE-2.6.18-238.5.1.el5, and > again the same failure. > I have this problem on most of my servers - that is HP DL360 G3 or G4, but not > on all. G5 worked without any problems. What can I do? Only wait for the next > kernel release hoping that the problem will be fixed? Hi, George, Can you provide some more information on your system? Are you using a Smart Array controller? Is there any other storage attached to the system?
Hi Jeffrey, I'll try :) Yes, I am using a Smart Array Controller: [root@XXXXX ~]# hpacucli controller slot=0 logicaldrive 1 show status logicaldrive 1 (279.4 GB, RAID RAID 1+0): OK [root@XXXXXX ~]# dmidecode | grep DL Product Name: ProLiant DL360 G4 No other storages attached to the system. The system is really under high load, it's a mail server with a lot of traffic. But no such failures with last kernel: [root@XXXXX ~]# uname -r 2.6.18-194.32.1.el5PAE And 238.1 and 238.5 kernel releases hang the server [root@XXXXXX ~]# rpm -aq| grep kernel | grep PAE kernel-PAE-2.6.18-238.1.1.el5 kernel-PAE-devel-2.6.18-238.1.1.el5 kernel-PAE-devel-2.6.18-238.5.1.el5 kernel-PAE-2.6.18-194.32.1.el5 kernel-PAE-2.6.18-238.5.1.el5 kernel-PAE-devel-2.6.18-194.32.1.el5 The same situation on most of 12 servers. All of them G3 and G4. Only one is G5
Hi, same problems here. [root@XXXXX ~]# uname -r 2.6.18-194.32.1.el5 Raid-Controller: LSI 9750 8i, Seagate 2TB Constellation(SN11) I've tried the follwing things but the situation is still the same. 1. change the i/o scheduler to "noop" 2. set i/o timeout to 300 3. changed raid/hardware The issue is present only on LSI-Controllers? Anbody tried to set "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"? Please publish a fix for this issue, it's very sad to see that this issue exists so long. :(
(In reply to comment #52) > Hi, > same problems here. > [root@XXXXX ~]# uname -r > 2.6.18-194.32.1.el5 > Raid-Controller: LSI 9750 8i, Seagate 2TB Constellation(SN11) > I've tried the follwing things but the situation is still the same. > 1. change the i/o scheduler to "noop" > 2. set i/o timeout to 300 > 3. changed raid/hardware > The issue is present only on LSI-Controllers? Anbody tried to set "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs"? > Please publish a fix for this issue, it's very sad to see that this issue > exists so long. :( Hi Willi, A little correction. I didn't have this problem on 2.6.18-194.32.1.el5 kernel, just opposite, I booted back this kernel. The problem is with 2.6.18-238. I didn't try to change i/o scheduler or timeout yes, because I think that everything should work without any changes of default values and RedHat should fix this issue.
(In reply to comment #50) Hi, George, > Yes, I am using a Smart Array Controller: OK. It is likely that you are running into the firmware issue described by Bud Brown in comment #43. See bug #615543. Producing a vmcore and posting a link to it in that bug (615543) would be a good way to determine whether you're hitting that problem or something different. To be clear, if it is a hung I/O in the cciss firmware, then there's nothing we at Red Hat can do to fix it. We are working with HP to get a fix for the issue, but it is very much a firmware issue, not an O/S issue.
(In reply to comment #51) > Hi, > > same problems here. > > [root@XXXXX ~]# uname -r > 2.6.18-194.32.1.el5 > > Raid-Controller: LSI 9750 8i, Seagate 2TB Constellation(SN11) > > I've tried the follwing things but the situation is still the same. > > 1. change the i/o scheduler to "noop" > 2. set i/o timeout to 300 > 3. changed raid/hardware > > The issue is present only on LSI-Controllers? Anbody tried to set "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs"? > > Please publish a fix for this issue, it's very sad to see that this issue > exists so long. :( Hi, Willi, Please file a ticket with support. The issue you are hitting is definitely different from what was reported in this bugzilla.
Hi Jeffrey, why is this different? I have the same error message like in this bugzilla and I am also using a LSI-Raid-Controller. Can I fix this issue with "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"? Regards - Willi
The issue this bugzilla initially covered was corrected by updating firmware on the HDD. Echo-ing 0 to the hung_task_timeout_secs file will simply turn off the warning without addressing the problem. Please contact support and open a case so we can get to the bottom of your problem.
Hi Jeffrey, unfortunately we are using CentOS 5.5 x64 and we haven't got any support, so I think I'm not able to open a support case? :( But I know CentOS is just a rebuild of RedHat Enterprise. I will check this support case, because it looks like some people still have the same issue and wait for a fix. Could please anybody confirm, who has updated to RHEL-5.6 and booted back 2.6.18-194.32.1.el5 that the issue is gone? I think a have 3 options, open a support case on LSI, check this case for updates or change my Raid Controller to an Apdatec 5805Z. Any hints? Regards, Willi
(In reply to comment #54) > (In reply to comment #50) > Hi, George, > > Yes, I am using a Smart Array Controller: > OK. It is likely that you are running into the firmware issue described by Bud > Brown in comment #43. See bug #615543. Producing a vmcore and posting a link > to it in that bug (615543) would be a good way to determine whether you're > hitting that problem or something different. To be clear, if it is a hung I/O > in the cciss firmware, then there's nothing we at Red Hat can do to fix it. We > are working with HP to get a fix for the issue, but it is very much a firmware > issue, not an O/S issue. Hi Jeffrey, I upgraded firmware from official HP distribs (it was Dl360 G4) and installes a fresh HP SupportPack. But unluckily, it didn't help me, the machine hanged up again. I didn't quite well understand what did you mean: >"Producing a vmcore and posting a link > to it in that bug (615543) would be a good way to determine whether you're > hitting that problem or something different" what is a vmcore and how can I produce it ....? Now I have to boot back in 2.6.18-194.32.1.el5 kernel ..(: Any other ideas? Probobly to stop HP SupportPack services for a test?
(In reply to comment #58) > Hi Jeffrey, > unfortunately we are using CentOS 5.5 x64 and we haven't got any support, so I > think I'm not able to open a support case? :( > But I know CentOS is just a rebuild of RedHat Enterprise. I will check this > support case, because it looks like some people still have the same issue and > wait for a fix. > Could please anybody confirm, who has updated to RHEL-5.6 and booted back > 2.6.18-194.32.1.el5 that the issue is gone? Yes, 2.6.18-194.32.1.el5 is the last working kernel for me. > I think a have 3 options, open a support case on LSI, check this case for > updates or change my Raid Controller to an Apdatec 5805Z. > Any hints? > Regards, Willi
(In reply to comment #61) > (In reply to comment #58) > > Hi Jeffrey, > > unfortunately we are using CentOS 5.5 x64 and we haven't got any support, so I > > think I'm not able to open a support case? :( > > But I know CentOS is just a rebuild of RedHat Enterprise. I will check this > > support case, because it looks like some people still have the same issue and > > wait for a fix. > > Could please anybody confirm, who has updated to RHEL-5.6 and booted back > > 2.6.18-194.32.1.el5 that the issue is gone? > > Yes, 2.6.18-194.32.1.el5 is the last working kernel for me. And with RHEL-5.5 you had the same issue like me? Let's see if CentOS-5.6 will fix my issue if booted back 2.6.18-192.32.1.el5.
(In reply to comment #62) > (In reply to comment #61) > > (In reply to comment #58) > > > Hi Jeffrey, > > > unfortunately we are using CentOS 5.5 x64 and we haven't got any support, so I > > > think I'm not able to open a support case? :( > > > But I know CentOS is just a rebuild of RedHat Enterprise. I will check this > > > support case, because it looks like some people still have the same issue and > > > wait for a fix. > > > Could please anybody confirm, who has updated to RHEL-5.6 and booted back > > > 2.6.18-194.32.1.el5 that the issue is gone? > > > > Yes, 2.6.18-194.32.1.el5 is the last working kernel for me. > And with RHEL-5.5 you had the same issue like me? Let's see if CentOS-5.6 will > fix my issue if booted back 2.6.18-192.32.1.el5. I have 2 machines with CentOS but I am afraid to update it now, though they installed on ESX, so if it is harware issue (as it is 99%) it shouldn't reproduced. By the way, can anybody tell me how to make the following: I want to update with "yum update", but don't want yum to remove old kernel packet 2.6.18-192.32.1.el5. Is it possible to do so? It watns to remove it (actually it keeps only 3 working kernel as far as I understand).
No chances to solve this problem and work under new kernel ....?
This is a cross-post to BZ 615543, as both seem related: Though CentOS, but related, we're also experiencing this issue on two of our ~20 HP ProLiant DL380 G4 systems since upgrading from CentOS 5.5 to 5.6. We run 2.6.18-238.9.1.el5PAE and did not observe the bug before upgrading to 5.6. In server A we have configured SmartArray to use 6 (original) disks in 3 x RAID 1 (with two disks each). Server B uses SmartArray with three disks, two in RAID 1 and one in RAID 0. Crashing happens on a regular basis and can be provoked by putting high load on the servers, e.g. with bonnie++. The last bonnie++ run lasted only 3 hours until the server crashed. Chris
You can try this upstream patch - https://lkml.org/lkml/2011/4/13/228 it has reportedly helped in similar issues. I'm not sure, but maybe it is already applied here - http://people.redhat.com/jwilson/el5/
Tomas, thanks for the links and info. Last week I installed kernel 2.6.18-259.el5 from the website you gave to me. Unfortunateyl, it took only two days until the next system hangup with the same "blocked for more than 120 seconds" error message. If this helps, I can attach a stack trace from the system, just let me know. Is there any reasonably easy way to find out if the patch you mentioned was already in 2.6.18-259? I suppose that is was either not included in 2.6.18-259, or we are experiencing some other issue here. I saw that 2.6.18-262 came out and wondered whether it is worth a try. Thanks!
We just downloaded the src RPM of 2.6.18-262 and found that the patch provided at https://lkml.org/lkml/2011/4/13/228 (which may solve our problem) is not included in the kernel yet. What is the procedure to include the patch in an upcoming 2.6.18 kernel? Is there any reason it is not in the kernel yet? We would like to avoid - if possible - to manually compile a kernel. Given a reasonable number of server crashes recently, this problem has forced us to use old kernel versions (2.6.18-194) to remain operational.
Hi, a short update from my side. We could fix the issues. First thing I've disabled command queueing for all logical drives. The LSI support told me command queueuing was enabled. Second thing I've changed the runtime of a monitoring script, which executes a "repquota -avg" every minute. Now this script runs once a day. Since 1 months everything is working fine without any issues. I've tested the following kernels: CentOS 5.5 CentoS 5.6 2.6.18-194.x 2.6.18-238.x Regards - Willi
Hi Willi, Can you explain more in details how did you solve this great problem!? I am still suffering from it :) (In reply to comment #69) > Hi, > > a short update from my side. We could fix the issues. > > First thing I've disabled command queueing for all logical drives. The LSI > support told me command queueuing was enabled. How ??? > Second thing I've > changed the runtime of a monitoring script, which executes a "repquota -avg" > every minute. Now this script runs once a day. what king of script ...? I don't understand you quite well. > > Since 1 months everything is working fine without any issues. > I've tested the following kernels: > > CentOS 5.5 > CentoS 5.6 > > 2.6.18-194.x > 2.6.18-238.x > > Regards - Willi
IMHO taking off the load from servers as you suggest cannot be part of the solution here. I wonder why this ticket was closed, as we are apparently not the only ones facing this issue. In case there is more debug information required, I'd gladly help out.
I also wonder why the ticket is closed and what should we do to re-open it. Taking off the load really is not a solution for me.
Hi Willi We have the same issue with our DL360s and DL380s since upgrading to CentOS 5.6. Any specifics on your fix would be helpful. Thanks! (In reply to comment #69) > Hi, > > a short update from my side. We could fix the issues. > > First thing I've disabled command queueing for all logical drives. The LSI > support told me command queueuing was enabled. Second thing I've > changed the runtime of a monitoring script, which executes a "repquota -avg" > every minute. Now this script runs once a day. > > Since 1 months everything is working fine without any issues. > I've tested the following kernels: > > CentOS 5.5 > CentoS 5.6 > > 2.6.18-194.x > 2.6.18-238.x > > Regards - Willi
Hi, we have disabled command queuing on our servers. (LSI9750) You have to run "tw_cli" and then you need the following commands. /c0/u0 set qpolicy=off This disables command queuing for controller 0 and unit0. The second thing we have disabled was on of our monitoring scripts. We're ruuning samba on our servers and we used a monitoring script which executes a "repquota -avg". We need this script for our nagios monitoring. This script was running every minute. Now we run this scripts once a day and til then we haven't got any issues. But I don't know what has fixed the issue, we've only changed command queuing and the runtime of this script. Regards - Willi
This should be reopened. It's a crash under load with the latest kernel, fixed by going back to the 2.6.18-194.32.1.el5 version. This is on a HP Prolient server with HP 6i storage array, if it helps.
Dear RedHat support! If you read our last comments and if you are interested in this topic, please reopen this bug and fix it or produce a new kernel branch may be under Update 7. Don't you understand that a lot of people have problems with it!?
Anybody following this thread and still seeking a solution for it: have a look at the related thread in #615543. Although some people suggest these two bug reports might not be related, upgrading the SmartArray firmware might help to fix the server hangs, maybe also for you. Plaese read the other bug report carefully, particularly the most recent posts.
Hi guys and special thanks to chris and Richard Godbee from bug #615543! Finally, I solved this problem, the very new firmware helped me both with HP Smart Array 5i (6 servers) and HP Smart Array 6i (5 servers). Almost a whole week without any problems! And I have no problems with HP Smart Array P410i (but I have only one such server).
(In reply to comment #77) > Anybody following this thread and still seeking a solution for it: have a look > at the related thread in #615543. Although some people suggest these two bug > reports might not be related, upgrading the SmartArray firmware might help to > fix the server hangs, maybe also for you. Plaese read the other bug report > carefully, particularly the most recent posts. BZ #615543 is inaccessible. Still having this on 2.6.18-238.12.1.0.1.el5 #1 SMP Tue May 31 14:51:07 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux and would like to clear it up. thanks.
jbroome: Assuming you have a HP SmartArray storage controller of some sort, update the firmware. This seems to fix it on that hardware, see above comments. Otherwise, did you try the other suggestions in this thread ? What are your exact hardware details ?
If this is CLOSED NOTABUG? What is this? IBM x3850 M2 here. LSI Baed ServeRAID MR10 SAS Controllers RHEL 5.7 Kernel 2.6.18-238.12.1.el5 Our Firmware as upgraded to atest and greatest on the RAID COntroller and others back in Feb when we had a series of occurences. It did not seem to fix the issues as we had 2 sucessive issues in the last month...
System hosts an Oracle DB (11GR2), SAN Connections OS on the local SAS disks RAID1 on the LSI RAID Controller. System just all of a sudden becomes CPU active, clocks, and hangs.. SYSTEM LOAD goes over 200 (and this is just a 24-way server). And Syslog will always show either Kjournald and kswapd struggling and on hung_task_timeout see Below: INFO: task kjournald:21595 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kjournald D ffff8108f9a24080 0 21595 1129 21601 21593 (L-TLB) ffff812004087dd0 0000000000000046 ffff81202a21adf0 ffff810926798730 0000000000000000 000000000000000a ffff81202519a820 ffff8108f9a24080 000799dedbffe81d 000000000000290b ffff81202519aa08 000000028008dc2c Call Trace: [<ffffffff880335cf>] :jbd:journal_commit_transaction+0x16d/0x1066 [<ffffffff800a28ec>] autoremove_wake_function+0x0/0x2e [<ffffffff8004b425>] try_to_del_timer_sync+0x7f/0x88 [<ffffffff880375d3>] :jbd:kjournald+0xc1/0x213 [<ffffffff800a28ec>] autoremove_wake_function+0x0/0x2e [<ffffffff800a26d4>] keventd_create_kthread+0x0/0xc4 [<ffffffff88037512>] :jbd:kjournald+0x0/0x213 [<ffffffff800a26d4>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032b26>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a26d4>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032a28>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 INFO: task orarootagent.bi:7219 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. orarootagent. D ffff8120191d9080 0 7219 1 7220 7217 (NOTLB) ffff8109da067be8 0000000000000086 0000000000000000 ffffffff880317ae 0000000000000000 000000000000000a ffff811fed204080 ffff8120191d9080 000799df8c0b52eb 000000000000d287 ffff811fed204268 0000000c00000000 Call Trace: [<ffffffff880317ae>] :jbd:journal_stop+0x1f3/0x1ff [<ffffffff8804ffb3>] :ext3:ext3_ordered_write_end+0xd7/0x116 [<ffffffff88032002>] :jbd:start_this_handle+0x2e5/0x36c [<ffffffff800a28ec>] autoremove_wake_function+0x0/0x2e [<ffffffff88032152>] :jbd:journal_start+0xc9/0x100 [<ffffffff88050c73>] :ext3:ext3_dirty_inode+0x28/0x7b [<ffffffff80013d6c>] __mark_inode_dirty+0x29/0x16e [<ffffffff8001668a>] __generic_file_aio_write_nolock+0x28a/0x3b6 [<ffffffff8002197e>] generic_file_aio_write+0x65/0xc1 [<ffffffff8804c1b6>] :ext3:ext3_file_write+0x16/0x91 [<ffffffff800183e4>] do_sync_write+0xc7/0x104 [<ffffffff800a28ec>] autoremove_wake_function+0x0/0x2e [<ffffffff80016b71>] vfs_write+0xce/0x174 [<ffffffff8001743e>] sys_write+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0 INFO: task diskmon.bin:7242 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. diskmon.bin D ffff811ffa917820 0 7242 1 7243 7240 (NOTLB) ffff8109d95ddbe8 0000000000000086 0000000000000000 ffffffff880317ae 0000000000000000 000000000000000a ffff8120260e2100 ffff811ffa917820 000799df8c103a91 0000000000008b00 ffff8120260e22e8 0000000c00000000 Call Trace: [<ffffffff880317ae>] :jbd:journal_stop+0x1f3/0x1ff [<ffffffff8804ffb3>] :ext3:ext3_ordered_write_end+0xd7/0x116 Should we just go to SAN Boot to avoid this malaise if the OS or the Firmware remains possibly faulty?
Two things: 1) Please re-read comment #43, comment #46 and comment #47. 2) Bugzilla is not a support tool. Please read the message on the bugzilla front page for more information on how to get your issues resolved (http://bugzilla.redhat.com). Cheers, Jeff
(In reply to comment #83) > Two things: > > 1) Please re-read comment #43, comment #46 and comment #47. > 2) Bugzilla is not a support tool. Please read the message on the bugzilla > front page for more information on how to get your issues resolved > (http://bugzilla.redhat.com). > > Cheers, > Jeff Could someone open Bug #615543 (or add me to the CC) so that I can comment in the proper bug? Of all of the bugs mentioned in this thread (and I agree that this bug should be closed) this is the only open bug, which may be why people keep hammering on it...
So what is really so secret about #615543? What is it all about? In our case, we've had 3 different x3850's with the LSI Megaraid chip and all have exhibited the same issue. We last upgraded its firmware in February 2011 but the same issue persisted. We just upgraded to IBM's latest and greatest which officially address vMware ESXi 4.1 "hangs" and which IBM makes no mention of a similar Linux issue. If the servers hang again -- we'd likely just move over to SAN Boot - skipping the LSI based RAIDed disk as OS boot disk.
Please can you specify which vesion? Thanks Alex (In reply to comment #78) > Hi guys and special thanks to chris and Richard Godbee from bug #615543! > > Finally, I solved this problem, the very new firmware helped me both with HP > Smart > Array 5i (6 servers) and HP Smart Array 6i (5 servers). Almost a whole week > without any problems! > > And I have no problems with HP Smart Array P410i (but I have only one such > server).
HP DL380 G5 Smart Array P400 Firmware Version: 7.22 2.6.18-274.3.1.0.1.el5 I haven't been able to access the BZs noted in comment 46 so I don't know what they're about. Assuming they say to upgrade the SmartArray firmware, however as seen above I'm on the latest version. I've had this server hang three times in three weeks with the "blocked for more than 120 seconds" message on the console, requiring a cold boot. This server doesn't get hammered, so I doubt it's due to a busy system. I'll open a case with HP for this, but I wanted to say on the record here that this doesn't appear to be fixed.
I have both latest IBM and latest/old HP hardware (G1, G5 and G7). HP uses smart array, IBM LSI Logic array - issue appears on all servers. I'm also running latest firmware. Kernels confirmed to be affected in my tests: 2.6.18-274.17.1.el5 (RHEL5.7 - latest as of this writting) 2.6.18-274.7.1.el5 (RHEL5.7 ) 2.6.18-164.15.1.el5 (RHEL5.4 - last kernel released) IO Scheduler was default "cfq", I was able to reproduce this issue with noop - but since i did some many tests, i could have been mistaken. I'm running the noop test now. I will update this BZ if "noop" also has the same problems. I submitted the vmcore to redhat, i'm suprised to see this issue exists for almost 2 years on "enterprise" systems. I will also try to recompile kernel from source minus - 2 patches by Amerigo Wang. But since i see this issue in 2.6.18-164.15.1.el5, i doubt it will help. If anyone found the solution - please respond to this thread. Jeffrey, if you believe this is not a bug, please provide the details to the rest of the world. Referenced to a closed/private bugs don't help if you have no access.
I have the same issue at virtual and phisical machines: 2.6.18-238.9.1.el5 (RH 5.6) 2.6.18-194.11.4.el5xen (RH 5.8) I've read that to fix this problem
You are not authorized to access bug #615543. The problem is the same here, with sata disks, would like to read the other ticket, having the solution, but have no rights. Why is a critical/blocker ticket closed to the world?
BZ 615543, referenced above, was about a specific HP Smart Array controller bug that has been addressed via firmware updates and a driver workaround (commits 07d0c38e7d84f911c72058a124c7f17b3c779a6 and 1ddd5049545e0aa1a0ed19bca4d9c9c3ce1ac8a2). Both driver and firmware fixes were released in June 2011 time frame in the then current RHEL releases. This BZ's originally problem dealt with a problem that was fixed by upgrading the hard disk firmware. The "blocked for more than 120 seconds" literally has 100s if not 1000s of causes, the above being just 2 of those including. If the system only throws these occassionally, then its very likely temporary storage congestion issue outside of the host. If its a hard hang case, then to address that, or anything other than the above three causes, a support case should be opened.
We are seeing the same problem on 5.7 with NFS. INFO: task perl:3276 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. perl D ffff81000101d7a0 0 3276 3268 (NOTLB) ffff81015deb39f8 0000000000000082 ffff810199c44880 ffff810199c44880 0000000010000042 000000000000000a ffff81019ab987e0 ffff81019bc9a100 000181ba4306b3a6 000000000003139e ffff81019ab989c8 000000030000000a Call Trace: [<ffffffff8006ec8f>] do_gettimeofday+0x40/0x90 [<ffffffff891be381>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd [<ffffffff800637ce>] io_schedule+0x3f/0x67 [<ffffffff891be38a>] :nfs:nfs_wait_bit_uninterruptible+0x9/0xd [<ffffffff800639fa>] __wait_on_bit+0x40/0x6e [<ffffffff891be381>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd [<ffffffff80063a94>] out_of_line_wait_on_bit+0x6c/0x78 [<ffffffff800a2e80>] wake_bit_function+0x0/0x23 [<ffffffff891c28ea>] :nfs:nfs_update_request+0x90/0x340 [<ffffffff891c3655>] :nfs:nfs_updatepage+0x155/0x1ec [<ffffffff891b8f57>] :nfs:nfs_write_end+0x6b/0x92 [<ffffffff8000ff32>] generic_file_buffered_write+0x1cc/0x675 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6 [<ffffffff8005519c>] sk_reset_timer+0xf/0x19 [<ffffffff8005458d>] tcp_connect+0x33f/0x348 [<ffffffff80234d94>] secure_tcp_sequence_number+0x38/0x3d [<ffffffff800219bf>] generic_file_aio_write+0x67/0xc3 [<ffffffff891b9639>] :nfs:nfs_file_write+0xd8/0x14f [<ffffffff80018415>] do_sync_write+0xc7/0x104 [<ffffffff800a2e52>] autoremove_wake_function+0x0/0x2e [<ffffffff8022dd8d>] sys_connect+0x7e/0xae [<ffffffff80016b92>] vfs_write+0xce/0x174 [<ffffffff8001745b>] sys_write+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0 INFO: task perl:3276 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. perl D ffff81000101d7a0 0 3276 3268 (NOTLB) ffff81015deb39f8 0000000000000082 ffff810199c44880 ffff810199c44880 0000000010000042 000000000000000a ffff81019ab987e0 ffff81019bc9a100 000181ba4306b3a6 000000000003139e ffff81019ab989c8 000000030000000a Call Trace: [<ffffffff8006ec8f>] do_gettimeofday+0x40/0x90 [<ffffffff891be381>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd [<ffffffff800637ce>] io_schedule+0x3f/0x67 [<ffffffff891be38a>] :nfs:nfs_wait_bit_uninterruptible+0x9/0xd [<ffffffff800639fa>] __wait_on_bit+0x40/0x6e [<ffffffff891be381>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd [<ffffffff80063a94>] out_of_line_wait_on_bit+0x6c/0x78 [<ffffffff800a2e80>] wake_bit_function+0x0/0x23 [<ffffffff891c28ea>] :nfs:nfs_update_request+0x90/0x340 [<ffffffff891c3655>] :nfs:nfs_updatepage+0x155/0x1ec [<ffffffff891b8f57>] :nfs:nfs_write_end+0x6b/0x92 [<ffffffff8000ff32>] generic_file_buffered_write+0x1cc/0x675 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6 [<ffffffff8005519c>] sk_reset_timer+0xf/0x19 [<ffffffff8005458d>] tcp_connect+0x33f/0x348 [<ffffffff80234d94>] secure_tcp_sequence_number+0x38/0x3d [<ffffffff800219bf>] generic_file_aio_write+0x67/0xc3 [<ffffffff891b9639>] :nfs:nfs_file_write+0xd8/0x14f [<ffffffff80018415>] do_sync_write+0xc7/0x104 [<ffffffff800a2e52>] autoremove_wake_function+0x0/0x2e [<ffffffff8022dd8d>] sys_connect+0x7e/0xae [<ffffffff80016b92>] vfs_write+0xce/0x174 [<ffffffff8001745b>] sys_write+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0
(In reply to comment #92) > We are seeing the same problem on 5.7 with NFS. No, you aren't. Please file a support ticket for your problem so it can be categorized appropriately.
Hi, sorry, i can't read a lot of people with the same problem, and no answers from Red Hat. I'm a red hat fan, but, reading this bug, and, having the similar issue with a Fujitsu server, using centos 6.3 i think: What about open and free software and Software Libre Stuff?. Red Hat no remember that a lot of folks report and report on bugzilla, bugs of Fedora, for a better quality on future RHEL?, included report about CentOS or Scientific Linux. @Jeffrey Moyer I know that bugzilla.redhat.com is not a way of support for non-redhat products, like CentOS, but, Red Hat forget that RHEL is builden from Fedora?, a community project?. Is a bad atitude from Red Hat, no asnwers this bug.
And, sorry, but, if "NOTABUG" what is it?
Guys, At least part of this is due to a widespread LSI sas expander bug - NOT a controller bug (but make sure you have installed the latest megaraid firmware from LSI's website - NOT the vendor stuff as most of them are 2 versions out of date) Updated firmware for LSI's SAS switches (expanders) can be grabbed from their website. http://www.lsi.com/support/Pages/Download-Results.aspx?productcode=P00048&assettype=0&component=Storage%20Component&productfamily=SAS%20Switch&productname=LSI%20SAS6160%20Switch Make sure you are running the P12 switch firmware. There is a specific fix for hangs caused by hard drives momentarily disconnecting themselves from the bus. You will need megacli to apply updates. This can be obtained from a number of locations. Applying the P12 firmware update stopped all hangs on our SAS-based boxes.