Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 605444

Summary:

Fujitsu Primergy RX200 S5 Server hang w/LSI Megaraid HBA and Fujitsu SAS drives

Product:

Red Hat Enterprise Linux 5

Reporter:

bosko.radivojevic

Component:

kernel

Assignee:

Jeff Moyer <jmoyer>

Status:

CLOSED NOTABUG

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

high

Docs Contact:

Priority:

medium

Version:

5.5

CC:

ajb2, aloga, andrew_mcedwards, anthony.cheung, anthony.cheung, bboegler, bubrown, captainmrgn, chricki, darrin, dev, dkerr, dsantana, francois, gao, georgi.georgiev, gil.ben-nun, goga78, igeorgex, imusayev, james, jbroome, jhunt, jmoyer, joshr-fedora, l.flis, mdisanzo, msw, ncap, nospam, paszczak000, pgervase, pveiga, richard.a.karhuse, samuel, scooter, simon.reber, sjhstorm, tonyellis, troels, willi.fehler, xset1980, yamato_yoji, zioalex

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-01-31 20:07:33 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Look of "nmon" and "top" when the system is refusing new logins and new programs can't be started	none
Look of "nmon" and "top" when the system is happily accepting new logins and new programs can be started	none
Crash Data Log	none
Crash Log	none

Description bosko.radivojevic 2010-06-17 23:35:42 UTC

Description of problem:

Server (Fujitsu Primergy RX200 S5, 2x Intel Xeon E5520, 8GB RAM, RAID controler based on LSI MegaRaid 1078) is hanging from time to time. I had one complete lockdown today.

Version-Release number of selected component (if applicable):
Kernel version is 2.6.18-194.3.1.el5. Driver is megaraid_sas

How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
INFO: task kjournald:1141 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kjournald     D ffffffff80150462     0  1141    139          1171  1072 (L-TLB)
 ffff81013da4fdd0 0000000000000046 0000000000000100 0000000000000000
 0000000000000000 000000000000000a ffff81023fc25040 ffff810143a027e0
 00000e6850058566 0000000000001064 ffff81023fc25228 0000000400000000
Call Trace:
 [<ffffffff880335cf>] :jbd:journal_commit_transaction+0x16d/0x1066
 [<ffffffff800a0abe>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8004b36f>] try_to_del_timer_sync+0x7f/0x88
 [<ffffffff880375d3>] :jbd:kjournald+0xc1/0x213
 [<ffffffff800a0abe>] autoremove_wake_function+0x0/0x2e
 [<ffffffff88037512>] :jbd:kjournald+0x0/0x213
 [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032894>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032796>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

INFO: task snmpd:4645 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
snmpd         D ffffffff80150462     0  4645      1          4660  4631 (NOTLB)
 ffff81023d9f1b88 0000000000000086 0000000000000000 ffffffff8022c9c0
 000000000000001c 000000000000000a ffff81023ec6d0c0 ffff810143b0e7e0
 00000e696bb83c73 000000000006f3d3 ffff81023ec6d2a8 0000000880250085
Call Trace:
 [<ffffffff8022c9c0>] put_cmsg+0x8c/0xc6
 [<ffffffff88032002>] :jbd:start_this_handle+0x2e5/0x36c
 [<ffffffff800a0abe>] autoremove_wake_function+0x0/0x2e
 [<ffffffff88032152>] :jbd:journal_start+0xc9/0x100
 [<ffffffff88050c72>] :ext3:ext3_dirty_inode+0x28/0x7b
 [<ffffffff80013c64>] __mark_inode_dirty+0x29/0x16e
 [<ffffffff8000c41b>] do_generic_mapping_read+0x342/0x354
 [<ffffffff8000d0cc>] file_read_actor+0x0/0x159
 [<ffffffff8000c579>] __generic_file_aio_read+0x14c/0x198
 [<ffffffff80016da5>] generic_file_aio_read+0x34/0x39
 [<ffffffff8000cdf5>] do_sync_read+0xc7/0x104
 [<ffffffff800a0abe>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8000e14f>] do_mmap_pgoff+0x66c/0x7d7
 [<ffffffff8000b681>] vfs_read+0xcb/0x171
 [<ffffffff80011bd2>] sys_read+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

Comment 1 Amelia Nilsson 2010-07-19 13:05:09 UTC

I also have this problem with my HP DL360 G2 using a SmartArray 5i. It begun when I upgraded to kernel version 2.6.18-194.3.1.el5 using the cciss-driver. 

I have similar output on the screen and the server completely freezes which unfortunately leaves me without any log entries in messages or dmesg to paste here. I'll get back to you if I manage to get any logs next time it occurs.

Comment 2 Simon Reber 2010-07-21 13:31:56 UTC

I've seen the same issue on a HP DL380 G5 with 32GB RAM and SAN attached storage using cciss driver fro internal disk mirror.

uname -a
Linux hostname 2.6.18-194.3.1.el5 #1 SMP Sun May 2 04:17:42 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

In cacti, I have seen massiv load on the server during this time, but why would it block thoses processes for such a long time

The error occurred the first time last night and it did not show up on the system with RHEL 5.4

Jul 20 23:18:49 hostname kernel: INFO: task cmahostd:27230 blocked for more than 120 seconds.
Jul 20 23:18:49 hostname kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 20 23:18:49 hostname kernel: cmahostd      D ffff81079ce4a080     0 27230      1         27232 27228 (NOTLB)
Jul 20 23:18:49 hostname kernel:  ffff810770bcbdd8 0000000000000082 00000000ffffffff 0000010100000000
Jul 20 23:18:49 hostname kernel:  0000000700b118fa 000000000000000a ffff8107716547a0 ffff81079ce4a080
Jul 20 23:18:49 hostname kernel:  0009ab4d2d148ee8 00000000009216ca ffff810771654988 000000028002c9e4
Jul 20 23:18:49 hostname kernel: Call Trace:
Jul 20 23:18:49 hostname kernel:  [<ffffffff8000ea46>] link_path_walk+0xa6/0xb2
Jul 20 23:18:49 hostname kernel:  [<ffffffff800646ac>] __down_read+0x7a/0x92
Jul 20 23:18:49 hostname kernel:  [<ffffffff800c30b1>] access_process_vm+0x47/0x18d
Jul 20 23:18:49 hostname kernel:  [<ffffffff8000f2d0>] __alloc_pages+0x78/0x308
Jul 20 23:18:49 hostname kernel:  [<ffffffff8010627f>] proc_pid_cmdline+0x69/0xf4
Jul 20 23:18:49 hostname kernel:  [<ffffffff8010678b>] proc_info_read+0x5f/0xb9
Jul 20 23:18:49 hostname kernel:  [<ffffffff8000b681>] vfs_read+0xcb/0x171
Jul 20 23:18:49 hostname kernel:  [<ffffffff80011bd2>] sys_read+0x45/0x6e
Jul 20 23:18:49 hostname kernel:  [<ffffffff8006149d>] sysenter_do_call+0x1e/0x76
Jul 20 23:18:49 hostname kernel:
Jul 20 23:18:49 hostname kernel: INFO: task inosrv:32294 blocked for more than 120 seconds.
Jul 20 23:18:49 hostname kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 20 23:18:49 hostname kernel: inosrv        D ffff81025c0d47a0     0 32294      1         32435 32293 (NOTLB)
Jul 20 23:18:49 hostname kernel:  ffff8105069a9e18 0000000000000086 0000000000007e26 0000000042876e00
Jul 20 23:18:49 hostname kernel:  00000000ffffffda 000000000000000a ffff8104dfebc080 ffff81025c0d47a0
Jul 20 23:18:49 hostname kernel:  0009ab4df506e05e 0000000000000f0e ffff8104dfebc268 0000000000000000
Jul 20 23:18:49 hostname kernel: Call Trace:
Jul 20 23:18:49 hostname kernel:  [<ffffffff800646ac>] __down_read+0x7a/0x92
Jul 20 23:18:49 hostname kernel:  [<ffffffff80066ad0>] do_page_fault+0x446/0x874
Jul 20 23:18:49 hostname kernel:  [<ffffffff8005dde9>] error_exit+0x0/0x84
Jul 20 23:18:49 hostname kernel:
Jul 20 23:18:49 hostname kernel: INFO: task inosrv:32468 blocked for more than 120 seconds.
Jul 20 23:18:49 hostname kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 20 23:18:49 hostname kernel: inosrv        D ffff81000100caa0     0 32468      1         32469 32467 (NOTLB)
Jul 20 23:18:49 hostname kernel:  ffff810382003d28 0000000000000086 ffff810382003c98 ffffffff8008ca8c
Jul 20 23:18:49 hostname kernel:  0009ab4b17e3fd90 0000000000000009 ffff81056d4277e0 ffff81082ff18100
Jul 20 23:18:49 hostname kernel:  0009ab4b17e2e377 000000000000a462 ffff81056d4279c8 0000000100000003
Jul 20 23:18:49 hostname kernel: Call Trace:
Jul 20 23:18:49 hostname kernel:  [<ffffffff8008ca8c>] __activate_task+0x56/0x6d
Jul 20 23:18:49 hostname kernel:  [<ffffffff8859083d>] :vxfs:vx_svar_sleep_unlock+0x53/0x68
Jul 20 23:18:49 hostname kernel:  [<ffffffff8008d087>] default_wake_function+0x0/0xe
Jul 20 23:18:49 hostname kernel:  [<ffffffff8857c9a8>] :vxfs:vx_rwsleep_rec_lock+0x74/0xac
Jul 20 23:18:49 hostname kernel:  [<ffffffff88558693>] :vxfs:vx_recsmp_rangelock+0xf/0x1d
Jul 20 23:18:49 hostname kernel:  [<ffffffff885714ac>] :vxfs:vx_irwlock+0x37/0x41
Jul 20 23:18:49 hostname kernel:  [<ffffffff885b520f>] :vxfs:vx_vop_read+0x101/0x1ca
Jul 20 23:18:49 hostname kernel:  [<ffffffff885b7ab8>] :vxfs:vx_read+0x199/0x1e9
Jul 20 23:18:49 hostname kernel:  [<ffffffff8000b681>] vfs_read+0xcb/0x171
Jul 20 23:18:49 hostname kernel:  [<ffffffff800135f7>] sys_pread64+0x50/0x70
Jul 20 23:18:49 hostname kernel:  [<ffffffff8005d229>] tracesys+0x71/0xe0
Jul 20 23:18:49 hostname kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Jul 20 23:18:49 hostname kernel:
Jul 20 23:18:49 hostname kernel: INFO: task inosrv:20591 blocked for more than 120 seconds.
Jul 20 23:18:49 hostname kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 20 23:18:49 hostname kernel: inosrv        D ffff810001025e20     0 20591      1         20645 20477 (NOTLB)
Jul 20 23:18:49 hostname kernel:  ffff8105f5bfff08 0000000000000086 0000000000000000 ffff81042e6927a0
Jul 20 23:18:49 hostname kernel:  ffffffff8008d087 000000000000000a ffff81042e6927a0 ffff81082fe29080
Jul 20 23:18:49 hostname kernel:  0009ab4b5349ae90 0000000000001b95 ffff81042e692988 0000000400402040
Jul 20 23:18:49 hostname kernel: Call Trace:
Jul 20 23:18:50 hostname kernel:  [<ffffffff8008d087>] default_wake_function+0x0/0xe
Jul 20 23:18:50 hostname kernel:  [<ffffffff80064613>] __down_write_nested+0x7a/0x92
Jul 20 23:18:50 hostname kernel:  [<ffffffff800161ee>] sys_munmap+0x32/0x59
Jul 20 23:18:50 hostname kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Jul 20 23:18:50 hostname kernel:

Comment 3 Sidney Sedlak 2010-08-31 10:28:40 UTC

Hi,

same problem here with RHEL 5.5. Server is IBM xServer 3850M2, 4x QuadXeon, 128GB RAM. I get those messages under heavy load (oracle in this case).

# uname -a
Linux hostname 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

Is the temporary solution to downgrade to 2.6.164 at now?

INFO: task oracle:26522 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
oracle        D ffff8100010738a0     0 26522  26503                     (NOTLB)
 ffff811b36cb3d08 0000000000000086 0000000000100000 ffff810001084ef8
 ffff811b36cb3c80 0000000000000007 ffff811b2a25e820 ffff81202fe99100
 000551fb9abc78d4 00000000000c315b ffff811b2a25ea08 0000000d9ab5c61e
Call Trace:
 [<ffffffff80064c6f>] __mutex_lock_slowpath+0x60/0x9b
 [<ffffffff80064cb9>] .text.lock.mutex+0xf/0x14
 [<ffffffff80064c06>] __mutex_unlock_slowpath+0x2a/0x33
 [<ffffffff8002174c>] generic_file_aio_write+0x4e/0xc1
 [<ffffffff884617b1>] :nfs:nfs_file_write+0xd8/0x14f
 [<ffffffff80018266>] do_sync_write+0xc7/0x104
 [<ffffffff800a1ba4>] autoremove_wake_function+0x0/0x2e
 [<ffffffff80016a49>] vfs_write+0xce/0x174
 [<ffffffff80044209>] sys_pwrite64+0x50/0x70
 [<ffffffff8005e229>] tracesys+0x71/0xe0
 [<ffffffff8005e28d>] tracesys+0xd5/0xe0

Comment 4 Troels Arvin 2010-09-01 12:55:47 UTC

Did at stress-test with "fio", using the following fil-spec-file:
-------------------------------------
[iometer-file-access-server] 
bssplit=512/10:1k/5:2k/5:4k/60:8k/2:16k/4:32k/4:64k/10 
rw=randrw 
rwmixread=70 
direct=1 
size=10g 
ioengine=libaio 
iodepth=256 
write_bw_log 
write_lat_log 
numjobs=6
-------------------------------------

When testing on a files system living in a HP Smart Array P410i (product revision C, firmware v. 3.50), I got several of these in syslog:
-------------------------------------
Sep  1 14:19:36 oslo kernel: INFO: task cmaidad:18825 blocked for more than 120 seconds.
Sep  1 14:19:36 oslo kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  1 14:19:36 oslo kernel: cmaidad       D ffffffff80150839     0 18825      1         18859 18747 (NOTLB)
Sep  1 14:19:36 oslo kernel:  ffff8100c9ed7978 0000000000003086 0000000000003086 ffffc20010092080
Sep  1 14:19:36 oslo kernel:  ffff81006e62c610 0000000000000007 ffff8100d710b040 ffff810037ca6100
Sep  1 14:19:36 oslo kernel:  000007f916a489b7 000000000000ccd2 ffff8100d710b228 0000000180022205
Sep  1 14:19:36 oslo kernel: Call Trace:
Sep  1 14:19:36 oslo kernel:  [<ffffffff8006e1db>] do_gettimeofday+0x40/0x90
Sep  1 14:19:36 oslo kernel:  [<ffffffff8001552b>] sync_buffer+0x0/0x3f
Sep  1 14:19:36 oslo kernel:  [<ffffffff800637ea>] io_schedule+0x3f/0x67
Sep  1 14:19:36 oslo kernel:  [<ffffffff80015566>] sync_buffer+0x3b/0x3f
Sep  1 14:19:36 oslo kernel:  [<ffffffff80063a16>] __wait_on_bit+0x40/0x6e
Sep  1 14:19:36 oslo kernel:  [<ffffffff8001552b>] sync_buffer+0x0/0x3f
Sep  1 14:19:36 oslo kernel:  [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78
Sep  1 14:19:36 oslo kernel:  [<ffffffff800a0a06>] wake_bit_function+0x0/0x23
Sep  1 14:19:36 oslo kernel:  [<ffffffff80017549>] ll_rw_block+0x8c/0xab
Sep  1 14:19:36 oslo kernel:  [<ffffffff8000e8dd>] __block_prepare_write+0x363/0x3a6
Sep  1 14:19:36 oslo kernel:  [<ffffffff8804eceb>] :ext3:ext3_get_block+0x0/0xf7
Sep  1 14:19:36 oslo kernel:  [<ffffffff800e15cf>] block_write_begin+0x80/0xcf
Sep  1 14:19:36 oslo kernel:  [<ffffffff88050395>] :ext3:ext3_write_begin+0xe8/0x1cc
Sep  1 14:19:36 oslo kernel:  [<ffffffff8804eceb>] :ext3:ext3_get_block+0x0/0xf7
Sep  1 14:19:36 oslo kernel:  [<ffffffff8000fd7a>] generic_file_buffered_write+0x14b/0x675
Sep  1 14:19:36 oslo kernel:  [<ffffffff880317ae>] :jbd:journal_stop+0x1f3/0x1ff
Sep  1 14:19:36 oslo kernel:  [<ffffffff8001669e>] __generic_file_aio_write_nolock+0x369/0x3b6
Sep  1 14:19:36 oslo kernel:  [<ffffffff80021841>] generic_file_aio_write+0x65/0xc1
Sep  1 14:19:36 oslo kernel:  [<ffffffff8804c1b6>] :ext3:ext3_file_write+0x16/0x91
Sep  1 14:19:36 oslo kernel:  [<ffffffff800182c3>] do_sync_write+0xc7/0x104
Sep  1 14:19:36 oslo kernel:  [<ffffffff800a09d8>] autoremove_wake_function+0x0/0x2e
Sep  1 14:19:36 oslo kernel:  [<ffffffff80016aa6>] vfs_write+0xce/0x174
Sep  1 14:19:36 oslo kernel:  [<ffffffff80017373>] sys_write+0x45/0x6e
Sep  1 14:19:36 oslo kernel:  [<ffffffff8006149d>] sysenter_do_call+0x1e/0x76
-------------------------------------

Meanwhile, the server was largely unresponsive: Existing processes ran well, but attempts to log into the server, or starting new programs on the server, timed out.

When running the exact same test on a file system living on a fibre-channel connected XIV storage system, there were no problems.

In my case, the RHEL was 5.5 x86_64 with latest updates, using in-box drivers only. Server: HP Proliant DL380G6.

I think that the priority of this problem needs to be increased.

Comment 5 Troels Arvin 2010-09-01 14:09:32 UTC

A few more observations:

1. I seems that the problem doesn't always result in "task ... blocked for more than 120 seconds"; sometimes, the system is "just" unresponsive except for some of the already running processes (e.g., "top" will keep working, if started before the stress tests).

2. Changing the I/O scheduler for the involved devices from "cfq" to "noop" seems to make the problem go away.

Comment 6 Troels Arvin 2010-09-01 14:12:19 UTC

Created attachment 442419 [details]
Look of "nmon" and "top" when the system is refusing new logins and new programs can't be started

Comment 7 Troels Arvin 2010-09-01 14:24:00 UTC

Created attachment 442424 [details]
Look of "nmon" and "top" when the system is happily accepting new logins and new programs can be started

Comment 8 Brad Boegler 2010-11-08 02:57:10 UTC

I am seeing this issue as well. System is a Dell PE2950, 2x 5410, 12GB, perc 6, 4x 147GB SAS 15K raid-10. 

uname -a
Linux 2.6.18-194.17.4.el5 #1 SMP Mon Oct 25 15:50:53 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

The issue appears randomly every 12 - 36 hours. Once the system locks, it will eventually recover after 10-20 minutes, but during this time the machine is completely unresponsive. 

Nov  7 20:38:20 kernel: INFO: task kjournald:2520 blocked for more than 120 seconds.
Nov  7 20:38:20 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  7 20:38:20 kernel: kjournald     D ffff81032e56abc0     0  2520    137          2522  2518 (L-TLB)
Nov  7 20:38:20 kernel:  ffff81032b439dd0 0000000000000046 0000000000000200 0000000000000000
Nov  7 20:38:20 kernel:  0000000000000000 000000000000000a ffff81032f43f0c0 ffff8102ff802860
Nov  7 20:38:20 kernel:  0001e0273cd29582 0000000000000f5f ffff81032f43f2a8 0000000000000000
Nov  7 20:38:20 kernel: Call Trace:
Nov  7 20:38:20 kernel:  [<ffffffff880335cf>] :jbd:journal_commit_transaction+0x16d/0x1066
Nov  7 20:38:20 kernel:  [<ffffffff800a09d4>] autoremove_wake_function+0x0/0x2e
Nov  7 20:38:20 kernel:  [<ffffffff8004b132>] try_to_del_timer_sync+0x7f/0x88
Nov  7 20:38:20 kernel:  [<ffffffff880375d3>] :jbd:kjournald+0xc1/0x213
Nov  7 20:38:20 kernel:  [<ffffffff800a09d4>] autoremove_wake_function+0x0/0x2e
Nov  7 20:38:40 kernel:  [<ffffffff800a07bc>] keventd_create_kthread+0x0/0xc4
Nov  7 20:38:40 kernel:  [<ffffffff88037512>] :jbd:kjournald+0x0/0x213
Nov  7 20:38:40 kernel:  [<ffffffff800a07bc>] keventd_create_kthread+0x0/0xc4
Nov  7 20:38:40 kernel:  [<ffffffff8003290a>] kthread+0xfe/0x132
Nov  7 20:38:40 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Nov  7 20:38:40 kernel:  [<ffffffff800a07bc>] keventd_create_kthread+0x0/0xc4
Nov  7 20:38:40 kernel:  [<ffffffff8003280c>] kthread+0x0/0x132
Nov  7 20:38:40 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11

I'm attempting to see if the scheduler change from cfq to noop as a previous person mentioned does resolve the issue.

Comment 9 TiGeRNeT 2010-11-10 22:06:37 UTC

I have the same issue. My System has Proc AMD Athlon(tm) Dual Core Processor 4050e, 4 Gb RAM, Kernel 2.6.18-194.26.1.el5.

My System is running very slow. 

Uptime: 18:30:03 up  3:25,  2 users,  load average: 7.54, 7.55, 8.10

dmesg:
INFO: task mysqld:7384 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mysqld        D ffff81011ccbb0f8     0  7384   2411          7397  7368 (NOTLB)
 ffff8100b7cd5af8 0000000000000086 ffff8100b7cd5b98 000000001c827080
 ffff810001004498 0000000000000009 ffff810108d1e080 ffff81011ccbb0c0
 000004c5cbe33e0d 000000000001e762 ffff810108d1e268 000000008008cab2
Call Trace:
 [<ffffffff80046edb>] try_to_wake_up+0x472/0x484
 [<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b
 [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14
 [<ffffffff800a0b1f>] autoremove_wake_function+0x9/0x2e
 [<ffffffff88035883>] :jbd:__log_wait_for_space+0x51/0xaa
 [<ffffffff88032040>] :jbd:start_this_handle+0x323/0x36c
 [<ffffffff88032152>] :jbd:journal_start+0xc9/0x100
 [<ffffffff88050c72>] :ext3:ext3_dirty_inode+0x28/0x7b
 [<ffffffff80013cf0>] __mark_inode_dirty+0x29/0x16e
 [<ffffffff8000c4db>] do_generic_mapping_read+0x347/0x359
 [<ffffffff8000d18c>] file_read_actor+0x0/0x159
 [<ffffffff8000c639>] __generic_file_aio_read+0x14c/0x198
 [<ffffffff80016e31>] generic_file_aio_read+0x34/0x39
 [<ffffffff8000ceb5>] do_sync_read+0xc7/0x104
 [<ffffffff800a0b16>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800624b6>] __sched_text_start+0xf6/0xbd6
 [<ffffffff8003265d>] sys_faccessat+0x148/0x18d
 [<ffffffff8000b729>] vfs_read+0xcb/0x171
 [<ffffffff80011c3b>] sys_read+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

Comment 10 Josh R 2010-11-15 19:06:30 UTC

I'm seeing this on a HP DL380 G5 with Smart Array P400 raid card. 

Like Brad, this seems to occur every 12-36 hours or so, always under heavy loads. Here's the stack trace I got:

INFO: task syslogd:3528 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syslogd D ffff8103c5743ca8 0 3528 1 3531 1070 (NOTLB)

    ffff8109a982fd88 0000000000000082 0000000000000296 0000000000000003
    ffff8109a982fd18 000000000000000a ffff8109ace21820 ffff8109af9e5080
    0000f161cc9bd0a8 0000000000014389 ffff8109ace21a08 00000002c5743ca8

Call Trace:

    [<ffffffff88036d8a>] :jbd:log_wait_commit+0xa3/0xf5
    [<ffffffff800a0b16>] autoremove_wake_function+0x0/0x2e
    [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff
    [<ffffffff8002fc6e>] writeback_single_inode+0x1e9/0x328
    [<ffffffff800e0898>] do_readv_writev+0x26e/0x291
    [<ffffffff800f34e5>] sync_inode+0x24/0x33
    [<ffffffff8804c36d>] :ext3:ext3_sync_file+0xc9/0xdc
    [<ffffffff800501b6>] do_fsync+0x52/0xa4
    [<ffffffff800e111d>] do_fsync+0x23/0x36
    [<ffffffff8005d116>] system_call+0x7e/0x83

Comment 11 Jacob Hunt 2010-11-19 19:30:27 UTC

I am seeing this on a HP ProLiant DL380 G6 with a HP Smart Array P410i Controller and two HP Smart Array P812 Controllers.

kernel: INFO: task kjournald:3367 blocked for more than 120 seconds.
Nov 18 17:39:27 db-rdco1-e-r1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

kernel: kjournald D ffffffff80150462 0 3367 455 3369 3365 (L-TLB)
kernel: ffff8102aaf55dd0 0000000000000046 ffff81015919d8b0 ffff8100c785b4f0
kernel: 0000000000000000 000000000000000a ffff810c1c4370c0 ffff810252a60820
kernel: 0004a484de9a85e7 000000000000082c ffff810c1c4372a8 000000088008c871

kernel: Call Trace:
kernel: [<ffffffff880335cf>] :jbd:journal_commit_transaction+0x16d/0x1066
kernel: [<ffffffff800a0abe>] autoremove_wake_function+0x0/0x2e
kernel: [<ffffffff8004b36f>] try_to_del_timer_sync+0x7f/0x88
kernel: [<ffffffff880375d3>] :jbd:kjournald+0xc1/0x213
kernel: [<ffffffff800a0abe>] autoremove_wake_function+0x0/0x2e
kernel: [<ffffffff88037512>] :jbd:kjournald+0x0/0x213
kernel: [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4
kernel: [<ffffffff80032894>] kthread+0xfe/0x132
kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
kernel: [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4
kernel: [<ffffffff80032796>] kthread+0x0/0x132
kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11

Comment 12 Jacob Hunt 2010-11-23 15:16:46 UTC

Created attachment 462353 [details]
Crash Data Log

Comment 13 Jacob Hunt 2010-11-23 15:17:22 UTC

Created attachment 462354 [details]
Crash Log

Comment 14 Jacob Hunt 2010-11-23 15:17:52 UTC

I was able to recreate the panic, and I got a vmcore.  The following is the information I captured using these commands in crash.

sys > crash_data.log
bt -a >> crash_data.log
mod >> crash_data.log
log > crash_log.log

I have attached the two files crash_data.log and crash_log.log

Comment 15 Andreas Kumpf 2010-11-24 11:14:04 UTC

And yet another application with the same issue here:

Nov 23 18:02:37 HGALUX31 kernel: INFO: task dsmserv:10519 blocked for more than 120 seconds.
Nov 23 18:02:37 HGALUX31 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 23 18:02:37 HGALUX31 kernel: dsmserv       D ffff81000100daa0     0 10519      1         10567 10489 (NOTLB)
Nov 23 18:02:37 HGALUX31 kernel:  ffff810791a79b68 0000000000000086 0000000000000001 ffff810ee7c6e4e8
Nov 23 18:02:37 HGALUX31 kernel:  ffff81102d5b5968 0000000000000008 ffff810f016f7100 ffff81102ff12100
Nov 23 18:02:37 HGALUX31 kernel:  0008f386bc0abe14 0000000000005d2e ffff810f016f72e8 000000018807aa5a
Nov 23 18:02:37 HGALUX31 kernel: Call Trace:
Nov 23 18:02:37 HGALUX31 kernel:  [<ffffffff80064167>] wait_for_completion+0x79/0xa2
Nov 23 18:02:37 HGALUX31 kernel:  [<ffffffff8008e16d>] default_wake_function+0x0/0xe
Nov 23 18:02:37 HGALUX31 kernel:  [<ffffffff885ab5a0>] :lin_tape:lin_tape_execute_async+0xc5/0xfb
Nov 23 18:02:37 HGALUX31 kernel:  [<ffffffff885ab64c>] :lin_tape:tape_execute_scsi_command+0x76/0xa4
Nov 23 18:02:37 HGALUX31 kernel:  [<ffffffff885ae808>] :lin_tape:tape_send_scsi_io+0x192/0x1fb
Nov 23 18:02:37 HGALUX31 kernel:  [<ffffffff885ae908>] :lin_tape:tape_send_scsi_cmd+0x97/0x220
Nov 23 18:02:37 HGALUX31 kernel:  [<ffffffff885b17ba>] :lin_tape:tape_set_pos+0x286/0x38a
Nov 23 18:02:37 HGALUX31 kernel:  [<ffffffff8859d754>] :lin_tape:drvioc_exe+0xb1/0x107
Nov 23 18:02:37 HGALUX31 kernel:  [<ffffffff885a760f>] :lin_tape:lin_tape_drive_ioctl+0xe89/0x10af
Nov 23 18:02:37 HGALUX31 kernel:  [<ffffffff8859edea>] :lin_tape:stiocsetpos+0x0/0xd
Nov 23 18:02:37 HGALUX31 kernel:  [<ffffffff8859637b>] :lin_tape:lin_tape_ioctl_drive+0x1c4/0x1eb
Nov 23 18:02:37 HGALUX31 kernel:  [<ffffffff8859a1d9>] :lin_tape:lin_tape_ioctl+0x7b/0xba
Nov 23 18:02:37 HGALUX31 kernel:  [<ffffffff800424bd>] do_ioctl+0x55/0x6b
Nov 23 18:02:37 HGALUX31 kernel:  [<ffffffff800304d6>] vfs_ioctl+0x457/0x4b9
Nov 23 18:02:37 HGALUX31 kernel:  [<ffffffff8004cbb7>] sys_ioctl+0x59/0x78
Nov 23 18:02:37 HGALUX31 kernel:  [<ffffffff8005e28d>] tracesys+0xd5/0xe0

It seems to come up with heavy disk io, as the examples show so far, e.g. cp, mysql, oracle, kjournal, jbd

Comment 16 Troels Arvin 2010-11-25 11:21:30 UTC

An enterprise distribution should not allow a problem like this to exist for so long. Especially when it has been seen in the wild by so many different users.

Why it the priority of the issue still "low". Can Red Hat's support system be used to push things? - If so I'll create a support ticket regarding this, although it seems silly as it is already so well documented.

Comment 17 Lukasz Flis 2010-11-25 13:47:37 UTC

Hello,

Unfortunately we can confirm the problem - we have seen it on more than **512 nodes** runninng in our cluster so as on a few service nodes and virtual machines.

The issue is present in 2.6.18-194.26.1.el5 kernel and also on other versions.
I'm not sure when i saw these errors for the first time but i suspect problem has been __introduced__ in 194 kernel line. 


Problem was noticed for the first time during the raid array rebuild process:

INFO: task md3_resync:8078 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
md3_resync    D ffff810123d7d898     0  8078     48                8077 (L-TLB)
 ffff810016d53d70 0000000000000046 0000000000000000 ffff810126ce920c
 ffff810126efa00c 000000000000000a ffff810127965080 ffff810123d7d860
 000345b561f23b3e 0000000000001b9f ffff810127965268 000000008008b4d7
Call Trace:
 [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8021af2b>] md_do_sync+0x1d8/0x833
 [<ffffffff8008ca47>] enqueue_task+0x41/0x56
 [<ffffffff8008cab2>] __activate_task+0x56/0x6d
 [<ffffffff8008c897>] dequeue_task+0x18/0x37
 [<ffffffff80062ff8>] thread_return+0x62/0xfe
 [<ffffffff800a0b16>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8021b8ff>] md_thread+0xf8/0x10e
 [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8021b807>] md_thread+0x0/0x10e
 [<ffffffff8003290a>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003280c>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11


We managed to confirm the I/O starvation on all the schedulers:
 * noop
 * cfq
 * deadline
 * anticipatory

According to our logs version 2.6.18_194.8.1.el5.x86_64 was the first one to start I/O timeouts


Reports which can be seen over the web suggest that issue is not RHEL5.5 specific.

Best Regards
--
Lukasz Flis

Comment 18 Andreas Kumpf 2010-11-25 15:10:37 UTC

Hi,
as Lucasz told this one ist _not_ RHEL specific, because I saw it appearing in several different forums across the inet concerning different distros; so, I think it's a kernel problem, basically.
However, our problem occured with kernel 2.6.18-194.el5.x86_64, somewhat earlier than Lukasz's.

Comment 19 Alvaro 2010-11-25 16:05:22 UTC

Hi all,

we also have been heavily impacted by this bug. We were using a .164 kernel version until CVE-2010-3081 was discovered, when we upgraded to a .194 kernel.

Since then, we got similar errors as the reported ones using any > .194 kernel series in all of our nodes (+400, they are both Xen virtual machines and regular nodes).

We managed to get rid of this error rebuilding the kernel and removing the following two patches:

From 2.6.18-188.el5: [misc] khungtaskd: set PF_NOFREEZE flag to fix suspend (Amerigo Wang).
From 2.6.18-177.el5: [sched] enable CONFIG_DETECT_HUNG_TASK support (Amerigo Wang).

Since then we have been running the kernel without any kind of problems. However, we haven't had the time to make further investigations to get more information.

Regards,
Alvaro Lopez.

Comment 20 bosko.radivojevic 2010-11-25 22:23:49 UTC

Hi all,

as far as I can see, this bug is happening only on platforms with SAS HDDs and heavy load. We have solved the problem by upgrading firmware on HDDs (that was the suggestion from Fujitsu's support team).

We observed the problem only on RX200 S5 servers with Fujitsu's 10K 146GB HDDs - MBD2147RC. We upgraded firmware on them from 5201 to 5203 and since that we had no such problems.

Regards,
Bosko

Comment 21 Lukasz Flis 2010-11-25 23:51:29 UTC


I can't agree with Bosko here, the problem is also present on 
HP BL 2x220c (G5,G6) server blades with SATA disks.

We can also see it on Tyan GX21 serverboards with WD Raptor Drives.

Comment 22 Alvaro 2010-11-26 09:09:15 UTC

I have the same opinion as Lukasz. I cannot agree with Bosko, since we have seen the problems either with SAS and SATA disks and also when accessing our GPFS filesystem.

Comment 23 Simon Reber 2010-11-26 11:10:29 UTC

I've checked the fixes, that Alvaro mentioned.

The patch/issue: From 2.6.18-177.el5: [sched] enable CONFIG_DETECT_HUNG_TASK support (Amerigo Wang).
Is clearly responsible why the message of a hung process does now produce a back trace (see more information in the kernel documentation: http://lxr.linux.no/#linux+v2.6.32.26/lib/Kconfig.debug#L197)

The patch/issue: khungtaskd: set PF_NOFREEZE flag to fix suspend (Amerigo Wang).

I'm not really sure if this is related to this issue.

BZ:
https://bugzilla.redhat.com/show_bug.cgi?id=550014

Description:
In RHEL5 kernel, kthread_run() will not set PF_NOFREEZE
for us, we have to set this flag by our own.

Fixes a suspend hang witnessed on some systems.

Upstream status:
Upstream doesn't need this.

Signed-off-by: WANG Cong <amwang>


diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 0d5a150..0fc6038 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -189,6 +189,10 @@ int proc_dohung_task_timeout_secs(struct ctl_table *table, int write,
 static int watchdog(void *dummy)
 {
        set_user_nice(current, 0);
+       /*
+        * kthread_run() doesn't help us here.
+        */
+       current->flags |= PF_NOFREEZE;

        for ( ; ; ) {
                unsigned long timeout = sysctl_hung_task_timeout_secs;


I'm really not a kernel specialist, but to me this seems that Red Hat added PF_NOFREEZE to khungtaskd - which from my point of view will only apply to khungtaskd and does not have a impact to other processes, running on the system.

For me this fix is also a little bit confusing as I haven't seen it in the latest upstream kernel version (2.6.36)

Anyway, please note that khungtaskd only checks every 120 sec. for tasks running on the system and having set the TASK_UNINTERRUPTIBLE flag. If khungtaskd finds a task that was not switched out by the scheduler once within the 120 sec. the khungtaskd  consinders this as a hung task and will then display the tasks stack dump.

So from my point of view, are all reported issues within this bug more related to a overloaded server, than to a kernel bug.
To prove or disprove this, it would be helpful if we have information about the CPU (run queue) and I/O load of a machine, during the time the problem occurs.

As from my site, we had the same issue with a HP DL 360 G5 once - after a short time, we've replaced the server with a much more powerful server HP DL 360 G6 running two 6 core CPUs - the OS and the kernel remained on the same patch level and since then, the machine did not report any problem again.

Cheers,
Simon

Comment 24 Troels Arvin 2010-11-26 11:19:26 UTC

Simon,

(In reply to comment #23)
> So from my point of view, are all reported issues within this bug more related
> to a overloaded server, than to a kernel bug.

Please see my comments from September, including attached files. I do not think that this issue has to do with servers which are simply overloaded. As I see it, a high load brings the server into a pathological state, I/O-wise, and I believe that there must be a bug somewhere.

Comment 25 Simon Reber 2010-11-26 11:38:46 UTC

Hi Arvin,
Thanks for the feedback.

(In reply to comment #24)
> Please see my comments from September, including attached files. I do not think
> that this issue has to do with servers which are simply overloaded. As I see
> it, a high load brings the server into a pathological state, I/O-wise, and I
> believe that there must be a bug somewhere.
I've checked your attached information - but could you please provide some information about the hardware, the system is running on (CPU's, etc.)?

Anyhow, looking at your information, the system which is unresponsive has a current CPU load of 78 (which is usually high, except you have a multi-multi core CPU system).
If I compare this with the second output, where the system is responsive, the CPU load is at 6.

It's definitely possible that also a change related to disk I/O is causing this problem, but it's definitely also related to load.

Comment 26 Troels Arvin 2010-11-26 12:00:50 UTC

Simon,

(In reply to comment #25)
> I've checked your attached information - but could you please provide some
> information about the hardware, the system is running on (CPU's, etc.)?

CPUs: Two four-socket Intel Xeon X5570 2.93GHz.
RAM: 96GB ECC DDR3, consisting of 12 RAM units.
Further info: http://h20195.www2.hp.com/v2/GetDocument.aspx?docname=c01709598&doctype=quickspecs&doclang=EN_GB&searchquery=&cc=dk&lc=da


> Anyhow, looking at your information, the system which is unresponsive has a
> current CPU load of 78 (which is usually high, except you have a multi-multi
> core CPU system).

Yes, and this is the problem :-)
Something triggers something which brings the system in a pathological state which includes symptoms like a high load and extremely long process-spawning times.

Comment 27 Troels Arvin 2010-11-26 14:25:45 UTC

Due to otherwise annoying events, I was lucky to get a window for re-testing the situation described in comment 4:

The running server's I/O scheduler was changed from noop to cfg and the relevant file system was un-mounted and mounted again, and the fio test was run. I couldn't re-create the problem.

I then rebooted the server, making sure that the default cfq I/O scheduler would be active from the beginning. When I re-ran the tests, the problem re-appeared quickly. After killing fio, things returned to normal, and I got a chance to look into syslog:

Nov 26 15:06:45 oslo kernel: kjournald     D ffff810c1f997860     0  1507    807          1532  1468 (L-TLB) 
Nov 26 15:06:45 oslo kernel:  ffff810c1f863cf0 0000000000000046 ffff8104b3a85000 ffff811818f26ac0 
Nov 26 15:06:45 oslo kernel:  ffff81181fb172c0 000000000000000a ffff81181fe70820 ffff810c1f997860 
Nov 26 15:06:45 oslo kernel:  000000d84b35fc1e 0000000000000da6 ffff81181fe70a08 000000061f549838 
Nov 26 15:06:45 oslo kernel: Call Trace: 
Nov 26 15:06:45 oslo kernel:  [<ffffffff8006e1d7>] do_gettimeofday+0x40/0x90 
Nov 26 15:06:45 oslo kernel:  [<ffffffff8005a7d6>] getnstimeofday+0x10/0x28 
Nov 26 15:06:45 oslo kernel:  [<ffffffff8001552b>] sync_buffer+0x0/0x3f 
Nov 26 15:06:45 oslo kernel:  [<ffffffff800637ea>] io_schedule+0x3f/0x67 
Nov 26 15:06:45 oslo kernel:  [<ffffffff80015566>] sync_buffer+0x3b/0x3f 
Nov 26 15:06:45 oslo kernel:  [<ffffffff80063a16>] __wait_on_bit+0x40/0x6e 
Nov 26 15:06:45 oslo kernel:  [<ffffffff8001552b>] sync_buffer+0x0/0x3f 
Nov 26 15:06:45 oslo kernel:  [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78 
Nov 26 15:06:45 oslo kernel:  [<ffffffff800a0b44>] wake_bit_function+0x0/0x23 
Nov 26 15:06:45 oslo kernel:  [<ffffffff880339a5>] :jbd:journal_commit_transaction+0x543/0x1066 
Nov 26 15:06:45 oslo kernel:  [<ffffffff8003da83>] lock_timer_base+0x1b/0x3c 
Nov 26 15:06:45 oslo kernel:  [<ffffffff8004b132>] try_to_del_timer_sync+0x7f/0x88 
Nov 26 15:06:45 oslo kernel:  [<ffffffff880375d3>] :jbd:kjournald+0xc1/0x213 
Nov 26 15:06:45 oslo kernel:  [<ffffffff800a0b16>] autoremove_wake_function+0x0/0x2e 
Nov 26 15:06:45 oslo kernel:  [<ffffffff88037512>] :jbd:kjournald+0x0/0x213 
Nov 26 15:06:45 oslo kernel:  [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4 
Nov 26 15:06:45 oslo kernel:  [<ffffffff8003290a>] kthread+0xfe/0x132 
Nov 26 15:06:45 oslo kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11 
Nov 26 15:06:45 oslo kernel:  [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4 
Nov 26 15:06:45 oslo kernel:  [<ffffffff8003280c>] kthread+0x0/0x132 
Nov 26 15:06:45 oslo kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11 

I then rebooted the server again, making sure that the noop scheduler would be chosen at boot-time. This time (using the noop scheduler), I couldn't provoke the problem to re-appear.

So even though the server's firmware has been upgraded slightly and the kernel package is probably also newer, the situation is still like in September: If I use fio to generate heavy I/O on the local RAID sysetem, I end up in a pathological state (where no new processes seem to get spawn and load is > 12); whereas if I use the noop scheduler, things are fine.

Comment 28 Jeff Moyer 2010-11-29 14:54:10 UTC

I have two questions:
1) When the I/O subsystem is stuck, does changing I/O schedulers make things work again? (What I mean by this is run the system with CFQ until it is hung, then echo noop > /sys/block/sdX/queue/scheduler, where sdX is the device that holds the hung file system).
2) If not, can someone get me a vmcore?

Comment 29 Troels Arvin 2010-11-29 20:17:34 UTC

(In reply to comment #28)
> 1) When the I/O subsystem is stuck, does changing I/O schedulers make things
> work again? (What I mean by this is run the system with CFQ until it is hung,
> then echo noop > /sys/block/sdX/queue/scheduler, where sdX is the device that
> holds the hung file system).

I tried: Ran fio for around 10 minutes and saw that load got above 10 and that logging in via SSH became impossible. I had an existing terminal open in order to be able to change the scheduler, but I couldn't even start "cat". As soon as I hit CTRL-c in fio, things got responsive again.

I then changed the scheduler to noop (without a reboot), and this time, fio couldn't bring the system into the pathological state.



> 2) If not, can someone get me a vmcore?

How is that done?

Comment 30 Jeff Moyer 2010-11-29 21:06:26 UTC

(In reply to comment #29)
> (In reply to comment #28)
> > 1) When the I/O subsystem is stuck, does changing I/O schedulers make things
> > work again? (What I mean by this is run the system with CFQ until it is hung,
> > then echo noop > /sys/block/sdX/queue/scheduler, where sdX is the device that
> > holds the hung file system).
> 
> I tried: Ran fio for around 10 minutes and saw that load got above 10 and that
> logging in via SSH became impossible. I had an existing terminal open in order
> to be able to change the scheduler, but I couldn't even start "cat". As soon as
> I hit CTRL-c in fio, things got responsive again.
> 
> I then changed the scheduler to noop (without a reboot), and this time, fio
> couldn't bring the system into the pathological state.

OK, thanks for the quick testing turn-around.

> > 2) If not, can someone get me a vmcore?
> 
> How is that done?

Well, let me try to reproduce this.  You've provided a nice fio job file for me, so I'll do the leg work of getting the vmcore.

Comment 31 Daniel 2010-12-02 22:34:15 UTC

I was seeing the same behavior described in this thread. I bumped up to test kernel 2.6.18-233.el5 from http://people.redhat.com/jwilson/el5 and the errors have stopped. I'm not able to quantify impact outside of the error messages because I need to induce heavy disk i/o to do it but things "feel" better too.

Comment 32 Daniel 2010-12-03 13:21:16 UTC

I need to back down on that a bit.. the problem seems to occur less frequently with the 2.6.18-233.el5 test kernel but is still evident.

Comment 37 Steve Morgan 2010-12-12 03:19:15 UTC

I had this problem during a 5.5 NFS install. The server had a 10 Gige NIC and the nodes 1Gb NICs. On the 10 Gige card the MTU was set to 9000 by default. The minute I restarted the interface with MTU set to 1500, NFS immediately started working. Check your MTU settings. Hope this helps.

Comment 39 elventear 2011-01-17 20:10:44 UTC

Has anyone tested with RHEL 5.6's kernel to see if this problem still happens?

Comment 40 Troels Arvin 2011-01-17 23:51:37 UTC

I just tested with RHEL 5.6. The problem is still there :-( Messages from syslog will be shown below. I'm tempted to reformat with ext4 and see if the situation changes, but it will be a while until I get a window for doing that.

A strange phenomenon (which has probably been there all along): I can only systematically reproduce it if I use "fio" shortly after booting. When fio is around 10% done, things go wrong. But if I hit CTRL+c and then re-start fio (without a reboot), the problem is gone.

When the problem is seen, it develops like this (the percentages are fio's "% done"):
 1% done: load=5
 2% done: load=7
 3% done: load=8
 4% done: load=9
 5% done: load=11
10% done: load=12
12% done: load=14
14% done: load=17
16% done: load=20
18% done: load=23
20% done: load=25
22% done: load=27
24% done: load=28
26% done: load=29
28% done: load=30
30% done: load=31

Around 3-4% done, the "pdflush" process starts showing up among the top processes in "top".

All along, IO/s stays around 4200 (30MB/s read, 12MB/s write).

Now, from syslog:
Jan 18 00:25:28 oslo kernel: INFO: task kjournald:1506 blocked for more than 120 seconds. 
Jan 18 00:25:28 oslo kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
Jan 18 00:25:28 oslo kernel: kjournald     D ffff810c4a1aaaa0     0  1506    807          1531  1467 (L-TLB) 
Jan 18 00:25:28 oslo kernel:  ffff81181c3ebcd0 0000000000000046 ffff81181fe88000 ffffffff880b97b1 
Jan 18 00:25:28 oslo kernel:  0000000000000000 000000000000000a ffff810c1f417820 ffff810c20131100 
Jan 18 00:25:28 oslo kernel:  0000004a2017d95d 0000000000000be8 ffff810c1f417a08 000000031fb68cf8 
Jan 18 00:25:28 oslo kernel: Call Trace: 
Jan 18 00:25:28 oslo kernel:  [<ffffffff880b97b1>] :cciss:do_cciss_request+0x32/0x4dd 
Jan 18 00:25:28 oslo kernel:  [<ffffffff8006ec4e>] do_gettimeofday+0x40/0x90 
Jan 18 00:25:28 oslo kernel:  [<ffffffff800154b2>] sync_buffer+0x0/0x3f 
Jan 18 00:25:29 oslo kernel:  [<ffffffff800637ca>] io_schedule+0x3f/0x67 
Jan 18 00:25:29 oslo kernel:  [<ffffffff800154ed>] sync_buffer+0x3b/0x3f 
Jan 18 00:25:29 oslo kernel:  [<ffffffff800639f6>] __wait_on_bit+0x40/0x6e 
Jan 18 00:25:29 oslo kernel:  [<ffffffff800154b2>] sync_buffer+0x0/0x3f 
Jan 18 00:25:29 oslo kernel:  [<ffffffff80063a90>] out_of_line_wait_on_bit+0x6c/0x78 
Jan 18 00:25:29 oslo kernel:  [<ffffffff800a28e2>] wake_bit_function+0x0/0x23 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8003aae0>] sync_dirty_buffer+0x8e/0xc3 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8803401e>] :jbd:journal_commit_transaction+0xbbc/0x1066 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8003dbe6>] lock_timer_base+0x1b/0x3c 
Jan 18 00:25:29 oslo kernel:  [<ffffffff880375d3>] :jbd:kjournald+0xc1/0x213 
Jan 18 00:25:29 oslo kernel:  [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e 
Jan 18 00:25:29 oslo kernel:  [<ffffffff88037512>] :jbd:kjournald+0x0/0x213 
Jan 18 00:25:29 oslo kernel:  [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 
Jan 18 00:25:29 oslo kernel:  [<ffffffff80032974>] kthread+0xfe/0x132 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11 
Jan 18 00:25:29 oslo kernel:  [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 
Jan 18 00:25:29 oslo kernel:  [<ffffffff80032876>] kthread+0x0/0x132 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11 
Jan 18 00:25:29 oslo kernel:  
Jan 18 00:25:29 oslo kernel: INFO: task master:25281 blocked for more than 120 seconds. 
Jan 18 00:25:29 oslo kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
Jan 18 00:25:29 oslo kernel: master        D ffff81000900caa0     0 25281  18664               24791 (NOTLB) 
Jan 18 00:25:29 oslo kernel:  ffff8110ba949b58 0000000000000086 0000000000000000 ffff810000075c10 
Jan 18 00:25:29 oslo kernel:  ffff810110a7adb0 0000000000000005 ffff81181fa56080 ffff810c4a24b0c0 
Jan 18 00:25:29 oslo kernel:  00000049f62b812f 00000000000072f4 ffff81181fa56268 00000002000201d2 
Jan 18 00:25:29 oslo kernel: Call Trace: 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8006ec4e>] do_gettimeofday+0x40/0x90 
Jan 18 00:25:29 oslo kernel:  [<ffffffff80028ae9>] sync_page+0x0/0x43 
Jan 18 00:25:29 oslo kernel:  [<ffffffff800637ca>] io_schedule+0x3f/0x67 
Jan 18 00:25:29 oslo kernel:  [<ffffffff80028b27>] sync_page+0x3e/0x43 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8006390e>] __wait_on_bit_lock+0x36/0x66 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8003fd9f>] __lock_page+0x5e/0x64 
Jan 18 00:25:29 oslo kernel:  [<ffffffff800a28e2>] wake_bit_function+0x0/0x23 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8000c3d1>] do_generic_mapping_read+0x1df/0x359 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8000d1bd>] file_read_actor+0x0/0x159 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8000c697>] __generic_file_aio_read+0x14c/0x198 
Jan 18 00:25:29 oslo kernel:  [<ffffffff80016e0c>] generic_file_aio_read+0x34/0x39 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8000cee6>] do_sync_read+0xc7/0x104 
Jan 18 00:25:29 oslo kernel:  [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8002a6c4>] __vma_link+0x42/0x4b 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8001ce41>] vma_link+0x70/0xfd 
Jan 18 00:25:29 oslo kernel:  [<ffffffff800302f7>] __up_write+0x27/0xf2 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8000b787>] vfs_read+0xcb/0x171 
Jan 18 00:25:29 oslo kernel:  [<ffffffff800454fb>] kernel_read+0x41/0x55 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8003ef47>] do_execve+0xe1/0x1ed 
Jan 18 00:25:29 oslo kernel:  [<ffffffff80055064>] sys_execve+0x36/0x4c 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8005d4d3>] stub_execve+0x67/0xb0 
Jan 18 00:25:29 oslo kernel:  
Jan 18 00:25:29 oslo kernel: INFO: task sh:25305 blocked for more than 120 seconds. 
Jan 18 00:25:29 oslo kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
Jan 18 00:25:29 oslo kernel: sh            D ffff81000900caa0     0 25305      1         27695 21904 (NOTLB) 
Jan 18 00:25:29 oslo kernel:  ffff8104b59eba38 0000000000000086 ffff810110805440 ffffc20010097080 
Jan 18 00:25:29 oslo kernel:  ffff810110805440 0000000000000005 ffff810c1c38d100 ffff810c4a24b0c0 
Jan 18 00:25:29 oslo kernel:  0000004f045c3cba 000000000004b279 ffff810c1c38d2e8 0000000280022214 
Jan 18 00:25:29 oslo kernel: Call Trace: 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8006ec4e>] do_gettimeofday+0x40/0x90 
Jan 18 00:25:29 oslo kernel:  [<ffffffff800154b2>] sync_buffer+0x0/0x3f 
Jan 18 00:25:29 oslo kernel:  [<ffffffff800637ca>] io_schedule+0x3f/0x67 
Jan 18 00:25:29 oslo kernel:  [<ffffffff800154ed>] sync_buffer+0x3b/0x3f 
Jan 18 00:25:29 oslo kernel:  [<ffffffff800639f6>] __wait_on_bit+0x40/0x6e 
Jan 18 00:25:29 oslo kernel:  [<ffffffff800154b2>] sync_buffer+0x0/0x3f 
Jan 18 00:25:29 oslo kernel:  [<ffffffff80063a90>] out_of_line_wait_on_bit+0x6c/0x78 
Jan 18 00:25:29 oslo kernel:  [<ffffffff800a28e2>] wake_bit_function+0x0/0x23 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8001750f>] ll_rw_block+0x8c/0xab 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8805314c>] :ext3:ext3_find_entry+0x3bf/0x575 
Jan 18 00:25:29 oslo kernel:  [<ffffffff80064604>] __down_read+0x12/0x92 
Jan 18 00:25:29 oslo kernel:  [<ffffffff80022214>] __up_read+0x19/0x7f 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8805bb0d>] :ext3:ext3_xattr_get+0x217/0x228 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8804dac5>] :ext3:__ext3_get_inode_loc+0x12f/0x2f9 
Jan 18 00:25:29 oslo kernel:  [<ffffffff880549ba>] :ext3:ext3_lookup+0x33/0x162 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8000d008>] do_lookup+0xe5/0x1e6 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8000a2c5>] __link_path_walk+0xa2a/0xfb9 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8000ea74>] link_path_walk+0x42/0xb2 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8000cda3>] do_path_lookup+0x275/0x2f1 
Jan 18 00:25:29 oslo kernel:  [<ffffffff800237c4>] __path_lookup_intent_open+0x56/0x97 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8003c1db>] open_exec+0x24/0xc0 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8001cea1>] vma_link+0xd0/0xfd 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8003eeac>] do_execve+0x46/0x1ed 
Jan 18 00:25:29 oslo kernel:  [<ffffffff80055064>] sys_execve+0x36/0x4c 
Jan 18 00:25:29 oslo kernel:  [<ffffffff8005d4d3>] stub_execve+0x67/0xb0

Comment 43 Dwight (Bud) Brown 2011-01-19 20:24:58 UTC

The original problem within this BZ was on an LSI controller which was fixed by upgrading the FJ drive firmware (comment #20).

The >120s messages can be a symptom of many different issues, including just a very very busy system  (for example, set io timeout to 300s and anytime you encounter an io timeout you can get these messages if the task stall detection time is set less than the io timeout value).  

For Smart Array configurations encountering this type of message or longer term hang issues, I'd suggest using BZ 580818 instead.

For other configurations other than LSI or Smart Array I'd suggest opening a new BZ with appropriate details.

Comment 44 Troels Arvin 2011-01-19 20:30:58 UTC

"You are not authorized to access bug #580818"...

Comment 45 elventear 2011-01-19 22:08:59 UTC

We are having an issue with a SAN where the SCSI commands will timeout at the time we are having all these hung task timeout errors. After that, the filesystem will deem the problem as IO Failure and remount everything read only, crashing our application. Is that something we should be seeing on a very busy system?

We do have an HP System with an SmartArray card, but this problem came up when we attached the SAN.

Example below [Full dmesg]:

Linux version 2.6.18-238.el5 (mockbuild.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-50)) #1 SMP Sun Dec 19 14:22:44 EST 2010
Command line: ro root=/dev/system/root
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000010000 - 000000000009f400 (usable)
 BIOS-e820: 000000000009f400 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 00000000cfe3e000 (usable)
 BIOS-e820: 00000000cfe3e000 - 00000000cfe46000 (ACPI data)
 BIOS-e820: 00000000cfe46000 - 00000000cfe47000 (usable)
 BIOS-e820: 00000000cfe47000 - 00000000e0000000 (reserved)
 BIOS-e820: 00000000fec00000 - 00000000fee10000 (reserved)
 BIOS-e820: 00000000ffc00000 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 000000082ffff000 (usable)
DMI 2.4 present.
ACPI: RSDP (v002 HP                                    ) @ 0x00000000000f4f00
ACPI: XSDT (v001 HP     ProLiant 0x00000002 Ò^D 0x0000162e) @ 0x00000000cfe3eec0
ACPI: FADT (v003 HP     ProLiant 0x00000002 Ò^D 0x0000162e) @ 0x00000000cfe3efc0
ACPI: SPCR (v001 HP     SPCRRBSU 0x00000001 Ò^D 0x0000162e) @ 0x00000000cfe3e180
ACPI: MCFG (v001 HP     ProLiant 0x00000001  0x00000000) @ 0x00000000cfe3e200
ACPI: HPET (v001 HP     ProLiant 0x00000002 Ò^D 0x0000162e) @ 0x00000000cfe3e240
ACPI: SPMI (v005 HP     ProLiant 0x00000001 Ò^D 0x0000162e) @ 0x00000000cfe3e280
ACPI: ERST (v001 HP     ProLiant 0x00000001 Ò^D 0x0000162e) @ 0x00000000cfe3e2c0
ACPI: MADT (v001 HP     ProLiant 0x00000002  0x00000000) @ 0x00000000cfe3e4c0
ACPI: SRAT (v001 AMD    FAM_F_10 0x00000002 AMD  0x00000001) @ 0x00000000cfe3e6c0
ACPI: FFFF (v001 HP     ProLiant 0x00000001 Ò^D 0x0000162e) @ 0x00000000cfe3eac0
ACPI: BERT (v001 HP     ProLiant 0x00000001 Ò^D 0x0000162e) @ 0x00000000cfe3ec40
ACPI: HEST (v001 HP     ProLiant 0x00000001 Ò^D 0x0000162e) @ 0x00000000cfe3ec80
ACPI: FFFF (v002 HP     ProLiant 0x00000002 Ò^D 0x0000162e) @ 0x00000000cfe3ee40
ACPI: SSDT (v003 HP     pci0pcie 0x00000001 INTL 0x20061109) @ 0x00000000cfe42900
ACPI: SSDT (v003 HP      CRSPCI0 0x00000002 HP   0x00000001) @ 0x00000000cfe42c00
ACPI: SSDT (v003 HP      CRSPCI1 0x00000002 HP   0x00000001) @ 0x00000000cfe42d40
ACPI: DSDT (v001 HP         DSDT 0x00000001 INTL 0x20030228) @ 0x0000000000000000
SRAT: PXM 0 -> APIC 0 -> Node 0
SRAT: PXM 0 -> APIC 1 -> Node 0
SRAT: PXM 0 -> APIC 2 -> Node 0
SRAT: PXM 0 -> APIC 3 -> Node 0
SRAT: PXM 0 -> APIC 4 -> Node 0
SRAT: PXM 0 -> APIC 5 -> Node 0
SRAT: PXM 1 -> APIC 8 -> Node 1
SRAT: PXM 1 -> APIC 9 -> Node 1
SRAT: PXM 1 -> APIC 10 -> Node 1
SRAT: PXM 1 -> APIC 11 -> Node 1
SRAT: PXM 1 -> APIC 12 -> Node 1
SRAT: PXM 1 -> APIC 13 -> Node 1
SRAT: Node 0 PXM 0 0-a0000
SRAT: Node 0 PXM 0 0-d0000000
SRAT: Node 0 PXM 0 0-430000000
SRAT: Node 1 PXM 1 430000000-830000000
NUMA: Using 28 for the hash shift.
Bootmem setup node 0 0000000000000000-0000000430000000
Bootmem setup node 1 0000000430000000-000000082ffff000
Memory for crash kernel (0x0 to 0x0) notwithin permissible range
disabling kdump
On node 0 totalpages: 4132192
  DMA zone: 2409 pages, LIFO batch:0
  DMA32 zone: 833143 pages, LIFO batch:31
  Normal zone: 3296640 pages, LIFO batch:31
On node 1 totalpages: 4136960
  Normal zone: 4136960 pages, LIFO batch:31
ACPI: PM-Timer IO Port: 0x920
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 0:8 APIC version 16
ACPI: LAPIC (acpi_id[0x06] lapic_id[0x08] enabled)
Processor #8 0:8 APIC version 16
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1 0:8 APIC version 16
ACPI: LAPIC (acpi_id[0x07] lapic_id[0x09] enabled)
Processor #9 0:8 APIC version 16
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] enabled)
Processor #2 0:8 APIC version 16
ACPI: LAPIC (acpi_id[0x08] lapic_id[0x0a] enabled)
Processor #10 0:8 APIC version 16
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] enabled)
Processor #3 0:8 APIC version 16
ACPI: LAPIC (acpi_id[0x09] lapic_id[0x0b] enabled)
Processor #11 0:8 APIC version 16
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x04] enabled)
Processor #4 0:8 APIC version 16
ACPI: LAPIC (acpi_id[0x0a] lapic_id[0x0c] enabled)
Processor #12 0:8 APIC version 16
ACPI: LAPIC (acpi_id[0x05] lapic_id[0x05] enabled)
Processor #5 0:8 APIC version 16
ACPI: LAPIC (acpi_id[0x0b] lapic_id[0x0d] enabled)
Processor #13 0:8 APIC version 16
ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
ACPI: IOAPIC (id[0x08] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 8, version 17, address 0xfec00000, GSI 0-15
ACPI: IOAPIC (id[0x09] address[0xfec01000] gsi_base[16])
IOAPIC[1]: apic_id 9, version 17, address 0xfec01000, GSI 16-31
ACPI: IOAPIC (id[0x0a] address[0xfec02000] gsi_base[32])
IOAPIC[2]: apic_id 10, version 17, address 0xfec02000, GSI 32-47
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
Setting APIC routing to physical flat
ACPI: HPET id: 0x1166a201 base: 0xfed00000
Using ACPI (MADT) for SMP configuration information
Nosave address range: 000000000009f000 - 00000000000a0000
Nosave address range: 00000000000a0000 - 00000000000f0000
Nosave address range: 00000000000f0000 - 0000000000100000
Nosave address range: 00000000cfe3e000 - 00000000cfe46000
Nosave address range: 00000000cfe47000 - 00000000e0000000
Nosave address range: 00000000e0000000 - 00000000fec00000
Nosave address range: 00000000fec00000 - 00000000fee10000
Nosave address range: 00000000fee10000 - 00000000ffc00000
Nosave address range: 00000000ffc00000 - 0000000100000000
Allocating PCI resources starting at e2000000 (gap: e0000000:1ec00000)
SMP: Allowing 12 CPUs, 0 hotplug CPUs
Built 2 zonelists.  Total pages: 8269152
Kernel command line: ro root=/dev/system/root
Initializing CPU#0
PID hash table entries: 4096 (order: 12, 32768 bytes)
Console: colour VGA+ 80x25
Dentry cache hash table entries: 4194304 (order: 13, 33554432 bytes)
Inode-cache hash table entries: 2097152 (order: 12, 16777216 bytes)
Checking aperture...
CPU 0: aperture @ 8000000 size 32 MB
Aperture too small (32 MB)
No AGP bridge found
Your BIOS doesn't leave a aperture memory hole
Please enable the IOMMU option in the BIOS setup
This costs you 64 MB of RAM
Mapping aperture over 65536 KB of RAM @ 8000000
Nosave address range: 0000000008000000 - 000000000c000000
ACPI: DMAR not present
Memory: 32956968k/34340860k available (2592k kernel code, 595212k reserved, 1649k data, 224k init)
Calibrating delay loop (skipped), value calculated using timer frequency.. 4800.18 BogoMIPS (lpj=2400092)
Security Framework v1.0.0 initialized
SELinux:  Initializing.
SELinux:  Starting in permissive mode
selinux_register_security:  Registering secondary module capability
Capability LSM initialized as secondary
Mount-cache hash table entries: 256
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 512K (64 bytes/line)
CPU 0/0 -> Node 0
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 0
SMP alternatives: switching to UP code
ACPI: Core revision 20060707
Using local APIC timer interrupts.
Detected 12.500 MHz APIC timer.
SMP alternatives: switching to SMP code
Booting processor 1/12 APIC 0x8
Initializing CPU#1
Calibrating delay using timer specific routine.. 4800.27 BogoMIPS (lpj=2400139)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 512K (64 bytes/line)
CPU 1/8 -> Node 1
CPU: Physical Processor ID: 1
CPU: Processor Core ID: 0
Six-Core AMD Opteron(tm) Processor 2431 stepping 00
SMP alternatives: switching to SMP code
Booting processor 2/12 APIC 0x1
Initializing CPU#2
Calibrating delay using timer specific routine.. 4801.22 BogoMIPS (lpj=2400612)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 512K (64 bytes/line)
CPU 2/1 -> Node 0
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 1
Six-Core AMD Opteron(tm) Processor 2431 stepping 00
SMP alternatives: switching to SMP code
Booting processor 3/12 APIC 0x9
Initializing CPU#3
Calibrating delay using timer specific routine.. 4804.38 BogoMIPS (lpj=2402193)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 512K (64 bytes/line)
CPU 3/9 -> Node 1
CPU: Physical Processor ID: 1
CPU: Processor Core ID: 1
Six-Core AMD Opteron(tm) Processor 2431 stepping 00
SMP alternatives: switching to SMP code
Booting processor 4/12 APIC 0x2
Initializing CPU#4
Calibrating delay using timer specific routine.. 4803.89 BogoMIPS (lpj=2401946)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 512K (64 bytes/line)
CPU 4/2 -> Node 0
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 2
Six-Core AMD Opteron(tm) Processor 2431 stepping 00
SMP alternatives: switching to SMP code
Booting processor 5/12 APIC 0xa
Initializing CPU#5
Calibrating delay using timer specific routine.. 4803.24 BogoMIPS (lpj=2401623)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 512K (64 bytes/line)
CPU 5/a -> Node 1
CPU: Physical Processor ID: 1
CPU: Processor Core ID: 2
Six-Core AMD Opteron(tm) Processor 2431 stepping 00
SMP alternatives: switching to SMP code
Booting processor 6/12 APIC 0x3
Initializing CPU#6
Calibrating delay using timer specific routine.. 4803.14 BogoMIPS (lpj=2401574)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 512K (64 bytes/line)
CPU 6/3 -> Node 0
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 3
Six-Core AMD Opteron(tm) Processor 2431 stepping 00
SMP alternatives: switching to SMP code
Booting processor 7/12 APIC 0xb
Initializing CPU#7
Calibrating delay using timer specific routine.. 4804.12 BogoMIPS (lpj=2402064)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 512K (64 bytes/line)
CPU 7/b -> Node 1
CPU: Physical Processor ID: 1
CPU: Processor Core ID: 3
Six-Core AMD Opteron(tm) Processor 2431 stepping 00
SMP alternatives: switching to SMP code
Booting processor 8/12 APIC 0x4
Initializing CPU#8
Calibrating delay using timer specific routine.. 4802.49 BogoMIPS (lpj=2401249)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 512K (64 bytes/line)
CPU 8/4 -> Node 0
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 4
Six-Core AMD Opteron(tm) Processor 2431 stepping 00
SMP alternatives: switching to SMP code
Booting processor 9/12 APIC 0xc
Initializing CPU#9
Calibrating delay using timer specific routine.. 4801.74 BogoMIPS (lpj=2400872)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 512K (64 bytes/line)
CPU 9/c -> Node 1
CPU: Physical Processor ID: 1
CPU: Processor Core ID: 4
Six-Core AMD Opteron(tm) Processor 2431 stepping 00
SMP alternatives: switching to SMP code
Booting processor 10/12 APIC 0x5
Initializing CPU#10
Calibrating delay using timer specific routine.. 4803.41 BogoMIPS (lpj=2401706)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 512K (64 bytes/line)
CPU 10/5 -> Node 0
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 5
Six-Core AMD Opteron(tm) Processor 2431 stepping 00
SMP alternatives: switching to SMP code
Booting processor 11/12 APIC 0xd
Initializing CPU#11
Calibrating delay using timer specific routine.. 4804.02 BogoMIPS (lpj=2402013)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 512K (64 bytes/line)
CPU 11/d -> Node 1
CPU: Physical Processor ID: 1
CPU: Processor Core ID: 5
Six-Core AMD Opteron(tm) Processor 2431 stepping 00
Brought up 12 CPUs
CPU#0: NMI watchdog performance counter calibration - 724->744
CPU#1: NMI watchdog performance counter calibration - 54->61
CPU#2: NMI watchdog performance counter calibration - 52->72
CPU#3: NMI watchdog performance counter calibration - 55->62
CPU#4: NMI watchdog performance counter calibration - 85->105
CPU#5: NMI watchdog performance counter calibration - 47->54
CPU#6: NMI watchdog performance counter calibration - 46->66
CPU#7: NMI watchdog performance counter calibration - 35->42
CPU#8: NMI watchdog performance counter calibration - 34->54
CPU#9: NMI watchdog performance counter calibration - 30->50
CPU#10: NMI watchdog performance counter calibration - 26->46
CPU#11: NMI watchdog performance counter calibration - 20->27
NMI watchdog testing PASSED.
time.c: Using 14.318180 MHz WALL HPET GTOD HPET/TSC timer.
time.c: Detected 2400.099 MHz processor.
sizeof(vma)=176 bytes
sizeof(page)=56 bytes
sizeof(inode)=560 bytes
sizeof(dentry)=216 bytes
sizeof(ext3inode)=760 bytes
sizeof(buffer_head)=96 bytes
sizeof(skbuff)=248 bytes
migration_cost=633,4994
checking if image is initramfs... it is
Freeing initrd memory: 4002k freed
NET: Registered protocol family 16
ACPI: bus type pci registered
PCI: Using MMCONFIG at d0000000
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
ACPI: No dock devices found.
ACPI: PCI Root Bridge [PCI0] (0000:00)
PCI: Enabling HT MSI Mapping on 0000:00:05.0
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.EXB0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.EXB1._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.EXB2._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.EXB3._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.EXB4._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.IPXB._PRT]
ACPI: PCI Interrupt Link [IUSB] (IRQs *5)
ACPI: PCI Interrupt Link [ISF0] (IRQs *14)
ACPI: PCI Interrupt Link [IN00] (IRQs 7 10 11) *0, disabled.
ACPI: PCI Interrupt Link [IN01] (IRQs 7 10 11) *0, disabled.
ACPI: PCI Interrupt Link [IN02] (IRQs 7 10 11) *0, disabled.
ACPI: PCI Interrupt Link [IN03] (IRQs 7 10 11) *0, disabled.
ACPI: PCI Interrupt Link [IN04] (IRQs 7 10 11) *0, disabled.
ACPI: PCI Interrupt Link [IN05] (IRQs 7 10 11) *0, disabled.
ACPI: PCI Interrupt Link [IN06] (IRQs 7 10 11) *0, disabled.
ACPI: PCI Interrupt Link [IN07] (IRQs 7 10 11) *0, disabled.
ACPI: PCI Interrupt Link [IN08] (IRQs 7 10 11) *0, disabled.
ACPI: PCI Interrupt Link [IN09] (IRQs 7 10 11) *0, disabled.
ACPI: PCI Interrupt Link [IN10] (IRQs 7 10 11) *0, disabled.
ACPI: PCI Interrupt Link [IN11] (IRQs 7 10 11) *0, disabled.
ACPI: PCI Interrupt Link [IN12] (IRQs 7 10 11) *0, disabled.
ACPI: PCI Interrupt Link [IN13] (IRQs 7 10 11) *0, disabled.
ACPI: PCI Interrupt Link [IN14] (IRQs 7 10 11) *0, disabled.
ACPI: PCI Interrupt Link [IN15] (IRQs 7 10 11) *0, disabled.
ACPI: PCI Interrupt Link [IN16] (IRQs 7 10 *11)
ACPI: PCI Interrupt Link [IN17] (IRQs 7 10 *11)
ACPI: PCI Interrupt Link [IN18] (IRQs 7 10 *11)
ACPI: PCI Interrupt Link [IN19] (IRQs 7 10 *11)
ACPI: PCI Interrupt Link [IN20] (IRQs 7 10 *11)
ACPI: PCI Interrupt Link [IN21] (IRQs 7 10 11) *0, disabled.
ACPI: PCI Interrupt Link [IN22] (IRQs 7 *10 11)
ACPI: PCI Interrupt Link [IN23] (IRQs *7 10 11)
ACPI: PCI Interrupt Link [IN24] (IRQs 7 10 *11)
ACPI: PCI Interrupt Link [IN25] (IRQs 7 10 *11)
ACPI: PCI Interrupt Link [IN26] (IRQs 7 10 *11)
ACPI: PCI Interrupt Link [IN27] (IRQs 7 10 11) *0, disabled.
ACPI: PCI Interrupt Link [IN28] (IRQs *7 10 11)
ACPI: PCI Interrupt Link [IN29] (IRQs 7 10 *11)
ACPI: PCI Interrupt Link [IN30] (IRQs 7 *10 11)
ACPI: PCI Interrupt Link [IN31] (IRQs 7 10 11) *0, disabled.
ACPI: PCI Root Bridge [PCI1] (0000:40)
ACPI: PCI Interrupt Routing Table [\_SB_.PCI1._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI1.EXB0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI1.EXB1._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI1.EXB2._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI1.EXB3._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI1.EXB4._PRT]
Linux Plug and Play Support v0.97 (c) Adam Belay
pnp: PnP ACPI init
pnp: PnP ACPI: found 12 devices
usbcore: registered new driver usbfs
usbcore: registered new driver hub
PCI: Using ACPI for IRQ routing
PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a report
NetLabel: Initializing
NetLabel:  domain hash size = 128
NetLabel:  protocols = UNLABELED CIPSOv4
NetLabel:  unlabeled traffic allowed by default
hpet0: at MMIO 0xfed00000 (virtual 0xffffffffff5fe000), IRQs 2, 8, 0
hpet0: 3 64-bit timers, 14318180 Hz
ACPI: DMAR not present
PCI-DMA: Disabling AGP.
PCI-DMA: aperture base @ 8000000 size 65536 KB
PCI-DMA: using GART IOMMU.
PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture
pnp: 00:01: ioport range 0x408-0x40f has been reserved
pnp: 00:01: ioport range 0x4d0-0x4d1 has been reserved
pnp: 00:01: ioport range 0x4d6-0x4d6 has been reserved
pnp: 00:01: iomem range 0xd0000000-0xdfffffff could not be reserved
PCI: Bridge: 0000:01:0d.0
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
PCI: Bridge: 0000:00:05.0
  IO window: 4000-4fff
  MEM window: f3f00000-f3ffffff
  PREFETCH window 0x00000000e2000000-0x00000000e20fffff
PCI: Bridge: 0000:00:0f.0
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
PCI: Bridge: 0000:00:10.0
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
PCI: Bridge: 0000:00:11.0
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
PCI: Bridge: 0000:00:12.0
  IO window: disabled.
  MEM window: f4000000-f7ffffff
  PREFETCH window 0x00000000e2100000-0x00000000e21fffff
PCI: Bridge: 0000:00:13.0
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
GSI 16 sharing vector 0xA9 and IRQ 16
ACPI: PCI Interrupt 0000:00:0f.0[A] -> GSI 42 (level, low) -> IRQ 169
PCI: Setting latency timer of device 0000:00:0f.0 to 64
GSI 17 sharing vector 0xB1 and IRQ 17
ACPI: PCI Interrupt 0000:00:10.0[A] -> GSI 38 (level, low) -> IRQ 177
PCI: Setting latency timer of device 0000:00:10.0 to 64
GSI 18 sharing vector 0xB9 and IRQ 18
ACPI: PCI Interrupt 0000:00:11.0[A] -> GSI 39 (level, low) -> IRQ 185
PCI: Setting latency timer of device 0000:00:11.0 to 64
GSI 19 sharing vector 0xC1 and IRQ 19
ACPI: PCI Interrupt 0000:00:12.0[A] -> GSI 40 (level, low) -> IRQ 193
PCI: Setting latency timer of device 0000:00:12.0 to 64
GSI 20 sharing vector 0xC9 and IRQ 20
ACPI: PCI Interrupt 0000:00:13.0[A] -> GSI 41 (level, low) -> IRQ 201
PCI: Setting latency timer of device 0000:00:13.0 to 64
PCI: Bridge: 0000:42:00.0
  IO window: disabled.
  MEM window: fd200000-fd2fffff
  PREFETCH window: disabled.
PCI: Bridge: 0000:40:0f.0
  IO window: disabled.
  MEM window: fd200000-fd2fffff
  PREFETCH window: disabled.
PCI: Bridge: 0000:40:10.0
  IO window: 5000-5fff
  MEM window: fd300000-fdafffff
  PREFETCH window 0x00000000e2300000-0x00000000e23fffff
PCI: Bridge: 0000:40:11.0
  IO window: 6000-6fff
  MEM window: fdb00000-fdffffff
  PREFETCH window 0x00000000e2400000-0x00000000e24fffff
PCI: Bridge: 0000:40:12.0
  IO window: disabled.
  MEM window: f8000000-fbffffff
  PREFETCH window 0x00000000e2500000-0x00000000e25fffff
PCI: Bridge: 0000:40:13.0
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
GSI 21 sharing vector 0xD1 and IRQ 21
ACPI: PCI Interrupt 0000:40:0f.0[A] -> GSI 36 (level, low) -> IRQ 209
PCI: Setting latency timer of device 0000:40:0f.0 to 64
ACPI: PCI Interrupt 0000:42:00.0[A] -> GSI 36 (level, low) -> IRQ 209
PCI: Setting latency timer of device 0000:42:00.0 to 64
GSI 22 sharing vector 0xD9 and IRQ 22
ACPI: PCI Interrupt 0000:40:10.0[A] -> GSI 32 (level, low) -> IRQ 217
PCI: Setting latency timer of device 0000:40:10.0 to 64
GSI 23 sharing vector 0xE1 and IRQ 23
ACPI: PCI Interrupt 0000:40:11.0[A] -> GSI 33 (level, low) -> IRQ 225
PCI: Setting latency timer of device 0000:40:11.0 to 64
GSI 24 sharing vector 0xE9 and IRQ 24
ACPI: PCI Interrupt 0000:40:12.0[A] -> GSI 34 (level, low) -> IRQ 233
PCI: Setting latency timer of device 0000:40:12.0 to 64
GSI 25 sharing vector 0x32 and IRQ 25
ACPI: PCI Interrupt 0000:40:13.0[A] -> GSI 35 (level, low) -> IRQ 50
PCI: Setting latency timer of device 0000:40:13.0 to 64
NET: Registered protocol family 2
IP route cache hash table entries: 524288 (order: 10, 4194304 bytes)
TCP established hash table entries: 262144 (order: 10, 4194304 bytes)
TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
TCP: Hash tables configured (established 262144 bind 65536)
TCP reno registered
audit: initializing netlink socket (disabled)
type=2000 audit(1295321056.309:1): initialized
Total HugeTLB memory allocated, 0
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
SELinux:  Registering netfilter hooks
Initializing Cryptographic API
alg: No test for crc32c (crc32c-generic)
ksign: Installing public key data
Loading keyring
- Added public key F43E909AB54B946C
- User ID: Red Hat, Inc. (Kernel Module GPG key)
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered (default)
Boot video device is 0000:00:03.0
pci 0000:00:04.4: HCRESET not completed yet!
PCI: Setting latency timer of device 0000:00:0f.0 to 64
PCI: Setting latency timer of device 0000:00:10.0 to 64
PCI: Setting latency timer of device 0000:00:11.0 to 64
PCI: Setting latency timer of device 0000:00:12.0 to 64
PCI: Setting latency timer of device 0000:00:13.0 to 64
PCI: Setting latency timer of device 0000:40:0f.0 to 64
PCI: Setting latency timer of device 0000:40:10.0 to 64
PCI: Setting latency timer of device 0000:40:11.0 to 64
PCI: Setting latency timer of device 0000:40:12.0 to 64
PCI: Setting latency timer of device 0000:40:13.0 to 64
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
Real Time Clock Driver v1.12ac
hpet_resources: 0xfed00000 is busy
Non-volatile memory driver v1.2
Linux agpgart interface v0.101 (c) Dave Jones
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled
serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
00:09: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
brd: module loaded
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
Probing IDE interface ide0...
Probing IDE interface ide1...
ide-floppy driver 0.99.newide
usbcore: registered new driver hiddev
usbcore: registered new driver usbhid
drivers/usb/input/hid-core.c: v2.6:USB HID core driver
PNP: PS/2 Controller [PNP0303:KBD,PNP0f0e:PS2M] at 0x60,0x64 irq 1,12
serio: i8042 KBD port at 0x60,0x64 irq 1
serio: i8042 AUX port at 0x60,0x64 irq 12
mice: PS/2 mouse device common for all mice
md: md driver 0.90.3 MAX_MD_DEVS=256, MD_SB_DISKS=27
md: bitmap version 4.39
TCP bic registered
Initializing IPsec netlink socket
NET: Registered protocol family 1
NET: Registered protocol family 17
ACPI: (supports S0 S4 S5)
Initalizing network drop monitor service
Freeing unused kernel memory: 224k freed
Write protecting the kernel read-only data: 519k
ACPI: PCI Interrupt Link [IUSB] enabled at IRQ 5
ACPI: PCI Interrupt 0000:00:07.2[A] -> Link [IUSB] -> GSI 5 (level, low) -> IRQ 5
ehci_hcd 0000:00:07.2: EHCI Host Controller
ehci_hcd 0000:00:07.2: new USB bus registered, assigned bus number 1
ehci_hcd 0000:00:07.2: irq 5, io mem 0xf3dc0000
ehci_hcd 0000:00:07.2: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004
usb usb1: configuration #1 chosen from 1 choice
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 4 ports detected
ohci_hcd: 2005 April 22 USB 1.1 'Open' Host Controller (OHCI) Driver (PCI)
ACPI: PCI Interrupt 0000:00:07.0[A] -> Link [IUSB] -> GSI 5 (level, low) -> IRQ 5
ohci_hcd 0000:00:07.0: OHCI Host Controller
ohci_hcd 0000:00:07.0: new USB bus registered, assigned bus number 2
ohci_hcd 0000:00:07.0: irq 5, io mem 0xf3de0000
usb usb2: configuration #1 chosen from 1 choice
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 2 ports detected
ACPI: PCI Interrupt 0000:00:07.1[A] -> Link [IUSB] -> GSI 5 (level, low) -> IRQ 5
ohci_hcd 0000:00:07.1: OHCI Host Controller
ohci_hcd 0000:00:07.1: new USB bus registered, assigned bus number 3
ohci_hcd 0000:00:07.1: irq 5, io mem 0xf3dd0000
usb 1-3: new high speed USB device using ehci_hcd and address 2
usb usb3: configuration #1 chosen from 1 choice
hub 3-0:1.0: USB hub found
hub 3-0:1.0: 2 ports detected
usb 1-3: configuration #1 chosen from 1 choice
hub 1-3:1.0: USB hub found
hub 1-3:1.0: 4 ports detected
USB Universal Host Controller Interface driver v3.0
GSI 26 sharing vector 0x92 and IRQ 26
ACPI: PCI Interrupt 0000:00:04.4[B] -> GSI 45 (level, low) -> IRQ 146
uhci_hcd 0000:00:04.4: UHCI Host Controller
uhci_hcd 0000:00:04.4: new USB bus registered, assigned bus number 4
uhci_hcd 0000:00:04.4: port count misdetected? forcing to 2 ports
uhci_hcd 0000:00:04.4: HCRESET not completed yet!
uhci_hcd 0000:00:04.4: irq 146, io base 0x00001800
usb usb4: configuration #1 chosen from 1 choice
hub 4-0:1.0: USB hub found
hub 4-0:1.0: 2 ports detected
SCSI subsystem initialized
HP CISS Driver (v 3.6.22-RH1)
ACPI: PCI Interrupt 0000:48:00.0[A] -> GSI 33 (level, low) -> IRQ 225
cciss0: <0x323a> at PCI 0000:48:00.0 IRQ 170 using DAC
 cciss/c0d0: p1 p2 p3 < p5 >
 cciss/c0d1: p1
 cciss/c0d2: p1
 cciss/c0d3: p1
libata version 3.00 loaded.
sata_svw 0000:01:0e.0: version 2.3
ACPI: PCI Interrupt Link [ISF0] enabled at IRQ 14
ACPI: PCI Interrupt 0000:01:0e.0[A] -> Link [ISF0] -> GSI 14 (level, low) -> IRQ 14
scsi0 : sata_svw
scsi1 : sata_svw
scsi2 : sata_svw
scsi3 : sata_svw
ata1: SATA max UDMA/133 mmio m8192@0xf3ff0000 port 0xf3ff0000 irq 14
ata2: SATA max UDMA/133 mmio m8192@0xf3ff0000 port 0xf3ff0100 irq 14
ata3: SATA max UDMA/133 mmio m8192@0xf3ff0000 port 0xf3ff0200 irq 14
ata4: SATA max UDMA/133 mmio m8192@0xf3ff0000 port 0xf3ff0300 irq 14
usb 1-3.3: new full speed USB device using ehci_hcd and address 3
usb 1-3.3: configuration #1 chosen from 1 choice
input: Raritan D2CIM-VUSB as /class/input/input0
input: USB HID v1.11 Keyboard [Raritan D2CIM-VUSB] on usb-0000:00:07.2-3.3
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1.00: ATAPI: Optiarc DVD RW AD-7561S, AH51, max UDMA/100
usb 4-1: new full speed USB device using uhci_hcd and address 2
ata1.00: configured for UDMA/100
usb 4-1: configuration #1 chosen from 1 choice
input: HP Virtual Keyboard as /class/input/input1
input: USB HID v1.01 Keyboard [HP Virtual Keyboard] on usb-0000:00:04.4-1
input: HP Virtual Keyboard as /class/input/input2
input: USB HID v1.01 Mouse [HP Virtual Keyboard] on usb-0000:00:04.4-1
ata2: SATA link down (SStatus 4 SControl 300)
usb 4-2: new full speed USB device using uhci_hcd and address 3
usb 4-2: configuration #1 chosen from 1 choice
hub 4-2:1.0: USB hub found
hub 4-2:1.0: 7 ports detected
ata3: SATA link down (SStatus 4 SControl 300)
ata4: SATA link down (SStatus 4 SControl 300)
  Vendor: Optiarc   Model: DVD RW AD-7561S   Rev: AH51
  Type:   CD-ROM                             ANSI SCSI revision: 05
Initializing USB Mass Storage driver...
usbcore: registered new driver usb-storage
USB Mass Storage support registered.
QLogic Fibre Channel HBA Driver
ACPI: PCI Interrupt 0000:45:00.2[C] -> GSI 35 (level, low) -> IRQ 50
qla2xxx 0000:45:00.2: Found an ISP8001, irq 50, iobase 0xffffc20010086000
qla2xxx 0000:45:00.2: Configuring PCI space...
PCI: Setting latency timer of device 0000:45:00.2 to 64
qla2xxx 0000:45:00.2: Configure NVRAM parameters...
qla2xxx 0000:45:00.2: Verifying loaded RISC code...
qla2xxx 0000:45:00.2: Allocated (64 KB) for EFT...
qla2xxx 0000:45:00.2: Allocated (1414 KB) for firmware dump...
scsi4 : qla2xxx
qla2xxx 0000:45:00.2: 
 QLogic Fibre Channel HBA Driver: 8.03.01.05.05.06-k
  QLogic QLE8152 - QLogic PCI-Express Dual Channel 10GbE CNA
  ISP8001: PCIe (2.5Gb/s x4) @ 0000:45:00.2 hdma+, host#=4, fw=5.02.01 (8d4)
ACPI: PCI Interrupt 0000:45:00.3[D] -> GSI 34 (level, low) -> IRQ 233
qla2xxx 0000:45:00.3: Found an ISP8001, irq 233, iobase 0xffffc200101f2000
qla2xxx 0000:45:00.3: Configuring PCI space...
PCI: Setting latency timer of device 0000:45:00.3 to 64
qla2xxx 0000:45:00.3: Configure NVRAM parameters...
qla2xxx 0000:45:00.3: Verifying loaded RISC code...
qla2xxx 0000:45:00.2: LOOP UP detected (10 Gbps).
  Vendor: Pillar    Model: Axiom 300         Rev: 0000
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sda: 632216064 512-byte hdwr sectors (323695 MB)
sda: Write Protect is off
sda: Mode Sense: 87 00 00 08
SCSI device sda: drive cache: write through
SCSI device sda: 632216064 512-byte hdwr sectors (323695 MB)
sda: Write Protect is off
sda: Mode Sense: 87 00 00 08
SCSI device sda: drive cache: write through
 sda: sda1
sd 4:0:0:0: Attached scsi disk sda
  Vendor: Pillar    Model: Axiom 300         Rev: 0000
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sdb: 2098788864 512-byte hdwr sectors (1074580 MB)
sdb: Write Protect is off
sdb: Mode Sense: 7f 00 00 08
SCSI device sdb: drive cache: write through
SCSI device sdb: 2098788864 512-byte hdwr sectors (1074580 MB)
sdb: Write Protect is off
sdb: Mode Sense: 7f 00 00 08
SCSI device sdb: drive cache: write through
 sdb: unknown partition table
sd 4:0:0:1: Attached scsi disk sdb
  Vendor: Pillar    Model: Axiom 300         Rev: 0000
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sdc: 632216064 512-byte hdwr sectors (323695 MB)
sdc: Write Protect is off
sdc: Mode Sense: 87 00 00 08
SCSI device sdc: drive cache: write through
SCSI device sdc: 632216064 512-byte hdwr sectors (323695 MB)
sdc: Write Protect is off
sdc: Mode Sense: 87 00 00 08
SCSI device sdc: drive cache: write through
 sdc: sdc1
sd 4:0:1:0: Attached scsi disk sdc
  Vendor: Pillar    Model: Axiom 300         Rev: 0000
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sdd: 2098788864 512-byte hdwr sectors (1074580 MB)
sdd: Write Protect is off
sdd: Mode Sense: 7f 00 00 08
SCSI device sdd: drive cache: write through
SCSI device sdd: 2098788864 512-byte hdwr sectors (1074580 MB)
sdd: Write Protect is off
sdd: Mode Sense: 7f 00 00 08
SCSI device sdd: drive cache: write through
 sdd: unknown partition table
sd 4:0:1:1: Attached scsi disk sdd
qla2xxx 0000:45:00.3: Allocated (64 KB) for EFT...
qla2xxx 0000:45:00.3: Allocated (1414 KB) for firmware dump...
scsi5 : qla2xxx
qla2xxx 0000:45:00.3: 
 QLogic Fibre Channel HBA Driver: 8.03.01.05.05.06-k
  QLogic QLE8152 - QLogic PCI-Express Dual Channel 10GbE CNA
  ISP8001: PCIe (2.5Gb/s x4) @ 0000:45:00.3 hdma+, host#=5, fw=5.02.01 (8d4)
device-mapper: uevent: version 1.0.3
device-mapper: ioctl: 4.11.5-ioctl (2007-12-12) initialised: dm-devel
qla2xxx 0000:45:00.3: LOOP UP detected (10 Gbps).
device-mapper: dm-raid45: initialized v0.2594l
  Vendor: Pillar    Model: Axiom 300         Rev: 0000
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sde: 632216064 512-byte hdwr sectors (323695 MB)
sde: Write Protect is off
sde: Mode Sense: 87 00 00 08
SCSI device sde: drive cache: write through
SCSI device sde: 632216064 512-byte hdwr sectors (323695 MB)
sde: Write Protect is off
sde: Mode Sense: 87 00 00 08
SCSI device sde: drive cache: write through
 sde: sde1
sd 5:0:0:0: Attached scsi disk sde
  Vendor: Pillar    Model: Axiom 300         Rev: 0000
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sdf: 2098788864 512-byte hdwr sectors (1074580 MB)
sdf: Write Protect is off
sdf: Mode Sense: 7f 00 00 08
SCSI device sdf: drive cache: write through
SCSI device sdf: 2098788864 512-byte hdwr sectors (1074580 MB)
sdf: Write Protect is off
sdf: Mode Sense: 7f 00 00 08
SCSI device sdf: drive cache: write through
 sdf: unknown partition table
sd 5:0:0:1: Attached scsi disk sdf
  Vendor: Pillar    Model: Axiom 300         Rev: 0000
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sdg: 632216064 512-byte hdwr sectors (323695 MB)
sdg: Write Protect is off
sdg: Mode Sense: 87 00 00 08
SCSI device sdg: drive cache: write through
SCSI device sdg: 632216064 512-byte hdwr sectors (323695 MB)
sdg: Write Protect is off
sdg: Mode Sense: 87 00 00 08
SCSI device sdg: drive cache: write through
 sdg: sdg1
sd 5:0:1:0: Attached scsi disk sdg
  Vendor: Pillar    Model: Axiom 300         Rev: 0000
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sdh: 2098788864 512-byte hdwr sectors (1074580 MB)
sdh: Write Protect is off
sdh: Mode Sense: 7f 00 00 08
SCSI device sdh: drive cache: write through
SCSI device sdh: 2098788864 512-byte hdwr sectors (1074580 MB)
sdh: Write Protect is off
sdh: Mode Sense: 7f 00 00 08
SCSI device sdh: drive cache: write through
 sdh: unknown partition table
sd 5:0:1:1: Attached scsi disk sdh
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
SELinux:  Disabled at runtime.
SELinux:  Unregistering netfilter hooks
type=1404 audit(1295321081.970:2): selinux=0 auid=4294967295 ses=4294967295
Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v2.0.8-rh (Oct 11, 2010)
ACPI: PCI Interrupt 0000:03:00.0[A] -> GSI 40 (level, low) -> IRQ 193
PCI: Setting latency timer of device 0000:03:00.0 to 64
eth0: Broadcom NetXtreme II BCM5709 1000Base-T (C0) PCI Express found at mem f6000000, IRQ 193, node addr 0025b321e0ca
ACPI: PCI Interrupt 0000:03:00.1[B] -> GSI 39 (level, low) -> IRQ 185
PCI: Setting latency timer of device 0000:03:00.1 to 64
eth1: Broadcom NetXtreme II BCM5709 1000Base-T (C0) PCI Express found at mem f4000000, IRQ 185, node addr 0025b321e0cc
ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 34 (level, low) -> IRQ 233
PCI: Setting latency timer of device 0000:41:00.0 to 64
eth2: Broadcom NetXtreme II BCM5709 1000Base-T (C0) PCI Express found at mem fa000000, IRQ 233, node addr 0025b321e0c6
ACPI: PCI Interrupt 0000:41:00.1[B] -> GSI 33 (level, low) -> IRQ 225
PCI: Setting latency timer of device 0000:41:00.1 to 64
eth3: Broadcom NetXtreme II BCM5709 1000Base-T (C0) PCI Express found at mem f8000000, IRQ 225, node addr 0025b321e0c8
k10temp 0000:00:18.3: unreliable CPU thermal sensor; monitoring disabled
k10temp 0000:00:19.3: unreliable CPU thermal sensor; monitoring disabled
ACPI: PCI Interrupt 0000:00:04.2[B] -> GSI 45 (level, low) -> IRQ 146
Floppy drive(s): fd0 is 1.44M
scsi 0:0:0:0: Attached scsi generic sg0 type 5
sd 4:0:0:0: Attached scsi generic sg1 type 0
sd 4:0:0:1: Attached scsi generic sg2 type 0
sd 4:0:1:0: Attached scsi generic sg3 type 0
sd 4:0:1:1: Attached scsi generic sg4 type 0
sd 5:0:0:0: Attached scsi generic sg5 type 0
sd 5:0:0:1: Attached scsi generic sg6 type 0
sd 5:0:1:0: Attached scsi generic sg7 type 0
sd 5:0:1:1: Attached scsi generic sg8 type 0
shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
input: PC Speaker as /class/input/input3
802.1Q VLAN Support v1.8 Ben Greear <greearb>
All bugs added by David S. Miller <davem>
EDAC MC: Ver: 2.0.1 Dec 19 2010
piix4_smbus 0000:00:06.0: Found 0000:00:06.0 device
ACPI: PCI Interrupt 0000:45:00.0[A] -> GSI 32 (level, low) -> IRQ 217
PCI: Setting latency timer of device 0000:45:00.0 to 64
EDAC amd64_edac:  Ver: 3.2.0 Dec 19 2010
qlge 0000:45:00.0: QLogic 10 Gigabit PCI-E Ethernet Driver 
qlge 0000:45:00.0: Driver name: qlge, Version: 1.00.00.25.
EDAC amd64: ECC is enabled by BIOS.
qlge 0000:45:00.0: Patch version: 2.6.16-2.6.18-p25, Release date: 100706.
EDAC amd64: ECC is enabled by BIOS.
qlge 0000:45:00.0: ql_display_dev_info: Function #0, Port #0, Rev ID = 20001010.
qlge 0000:45:00.0: ql_display_dev_info: MAC address 00:c0:dd:12:0e:4c
ACPI: PCI Interrupt 0000:45:00.1[B] -> GSI 36 (level, low) -> IRQ 209
PCI: Setting latency timer of device 0000:45:00.1 to 64
qlge 0000:45:00.1: ql_display_dev_info: Function #1, Port #1, Rev ID = 20001010.
qlge 0000:45:00.1: ql_display_dev_info: MAC address 00:c0:dd:12:0e:4e
EDAC MC: F10h CPU detected
EDAC MC0: Giving out device to amd64_edac Family 10h: DEV 0000:00:18.2
EDAC MC: F10h CPU detected
EDAC MC1: Giving out device to amd64_edac Family 10h: DEV 0000:00:19.2
sr0: scsi3-mmc drive: 24x/24x writer dvd-ram cd/rw xa/form2 cdda tray
Uniform CD-ROM driver Revision: 3.20
sr 0:0:0:0: Attached scsi CD-ROM sr0
floppy0: no floppy controllers found
work still pending
Floppy drive(s): fd0 is 1.44M
floppy0: no floppy controllers found
work still pending
lp: driver loaded but no devices found
ACPI: Power Button (FF) [PWRF]
ACPI: Mapper loaded
dell-wmi: No known WMI GUID found
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
device-mapper: multipath: version 1.0.5 loaded
EXT3 FS on dm-0, internal journal
kjournald starting.  Commit interval 5 seconds
EXT3 FS on dm-1, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on dm-2, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on dm-3, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on dm-5, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on cciss/c0d0p1, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
Adding 16777208k swap on /dev/system/swap.  Priority:-1 extents:1 across:16777208k
powernow-k8: Pre-initialization of ACPI failed
powernow-k8: Found 2 Six-Core AMD Opteron(tm) Processor 2431 processors (12 cpu cores) (version 2.20.00)
powernow-k8: Your BIOS does not provide _PSS objects.  PowerNow! does not work on SMP systems without _PSS objects.  Complain to your BIOS vendor.
powernow-k8: Your BIOS does not provide _PSS objects.  PowerNow! does not work on SMP systems without _PSS objects.  Complain to your BIOS vendor.
powernow-k8: Your BIOS does not provide _PSS objects.  PowerNow! does not work on SMP systems without _PSS objects.  Complain to your BIOS vendor.
powernow-k8: Your BIOS does not provide _PSS objects.  PowerNow! does not work on SMP systems without _PSS objects.  Complain to your BIOS vendor.
powernow-k8: Your BIOS does not provide _PSS objects.  PowerNow! does not work on SMP systems without _PSS objects.  Complain to your BIOS vendor.
powernow-k8: Your BIOS does not provide _PSS objects.  PowerNow! does not work on SMP systems without _PSS objects.  Complain to your BIOS vendor.
powernow-k8: Your BIOS does not provide _PSS objects.  PowerNow! does not work on SMP systems without _PSS objects.  Complain to your BIOS vendor.
powernow-k8: Your BIOS does not provide _PSS objects.  PowerNow! does not work on SMP systems without _PSS objects.  Complain to your BIOS vendor.
powernow-k8: Your BIOS does not provide _PSS objects.  PowerNow! does not work on SMP systems without _PSS objects.  Complain to your BIOS vendor.
powernow-k8: Your BIOS does not provide _PSS objects.  PowerNow! does not work on SMP systems without _PSS objects.  Complain to your BIOS vendor.
powernow-k8: Your BIOS does not provide _PSS objects.  PowerNow! does not work on SMP systems without _PSS objects.  Complain to your BIOS vendor.
powernow-k8: Your BIOS does not provide _PSS objects.  PowerNow! does not work on SMP systems without _PSS objects.  Complain to your BIOS vendor.
Loading iSCSI transport class v2.0-871.
cxgb3i: tag itt 0x1fff, 13 bits, age 0xf, 4 bits.
iscsi: registered transport (cxgb3i)
NET: Registered protocol family 10
lo: Disabled Privacy Extensions
IPv6 over IPv4 tunneling driver
Broadcom NetXtreme II CNIC Driver cnic v2.1.2 (May 26, 2010)
cnic: Added CNIC device: eth2
cnic: Added CNIC device: __tmp1031196009
cnic: Added CNIC device: eth0
cnic: Added CNIC device: eth1
Broadcom NetXtreme II iSCSI Driver bnx2i v2.1.3 (Aug 10, 2010)
iscsi: registered transport (bnx2i)
scsi6 : Broadcom Offload iSCSI Initiator
scsi7 : Broadcom Offload iSCSI Initiator
scsi8 : Broadcom Offload iSCSI Initiator
scsi9 : Broadcom Offload iSCSI Initiator
iscsi: registered transport (tcp)
device-mapper: multipath round-robin: version 1.0.0 loaded
iscsi: registered transport (iser)
iscsi: registered transport (be2iscsi)
bnx2: eth0: using MSIX
ADDRCONF(NETDEV_UP): eth0: link is not ready
bnx2i [41:00.00]: ISCSI_INIT passed
bnx2: eth0 NIC Copper Link is Up, 100 Mbps full duplex
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Bluetooth: Core ver 2.10
NET: Registered protocol family 31
Bluetooth: HCI device and connection manager initialized
Bluetooth: HCI socket layer initialized
Bluetooth: L2CAP ver 2.8
Bluetooth: L2CAP socket layer initialized
Bluetooth: RFCOMM socket layer initialized
Bluetooth: RFCOMM TTY layer initialized
Bluetooth: RFCOMM ver 1.8
eth0: no IPv6 routers present
Bluetooth: HIDP (Human Interface Emulation) ver 1.1
Netfilter messages via NETLINK v0.30.
ip_conntrack version 2.4 (8192 buckets, 65536 max) - 304 bytes per conntrack
ip_tables: (C) 2000-2006 Netfilter Core Team
Bridge firewalling registered
Ebtables v2.0 registered
ip6_tables: (C) 2000-2006 Netfilter Core Team
virbr0: no IPv6 routers present
kjournald starting.  Commit interval 5 seconds
EXT3 FS on dm-6, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
INFO: task kswapd0:697 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kswapd0       D ffff81000100caa0     0   697    313           698   694 (L-TLB)
 ffff81082f699b00 0000000000000046 0000000000000010 00000000267d61f8
 0fd0000600000008 000000000000000a ffff81042f5cd0c0 ffff81082ffb90c0
 000004b227b3805b 0000000000036843 ffff81042f5cd2a8 000000022ff8e6f0
Call Trace:
 [<ffffffff88036d8a>] :jbd:log_wait_commit+0xa3/0xf5
 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800276a9>] try_to_free_buffers+0x60/0xb9
 [<ffffffff880332da>] :jbd:journal_try_to_free_buffers+0x19d/0x1c0
 [<ffffffff800cd32c>] shrink_inactive_list+0x511/0x8d8
 [<ffffffff800131a5>] shrink_zone+0x127/0x18d
 [<ffffffff80057be8>] kswapd+0x33d/0x495
 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff800578ab>] kswapd+0x0/0x495
 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032974>] kthread+0xfe/0x132
 [<ffffffff8009f283>] request_module+0x0/0x14d
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032876>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

INFO: task kswapd0:697 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kswapd0       D ffff81000101d7a0     0   697    313           698   694 (L-TLB)
 ffff81082f699b00 0000000000000046 0000000000000002 0000000000000010
 ffff81052e6b5000 000000000000000a ffff81042f5cd0c0 ffff81082fe830c0
 0000092a5ff10c9e 000000000062bcda ffff81042f5cd2a8 0000000600000024
Call Trace:
 [<ffffffff88036d8a>] :jbd:log_wait_commit+0xa3/0xf5
 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800276a9>] try_to_free_buffers+0x60/0xb9
 [<ffffffff880332da>] :jbd:journal_try_to_free_buffers+0x19d/0x1c0
 [<ffffffff800cd32c>] shrink_inactive_list+0x511/0x8d8
 [<ffffffff80047ff2>] __pagevec_release+0x19/0x22
 [<ffffffff800cccfa>] shrink_active_list+0x4b4/0x4c4
 [<ffffffff800131a5>] shrink_zone+0x127/0x18d
 [<ffffffff80057be8>] kswapd+0x33d/0x495
 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff800578ab>] kswapd+0x0/0x495
 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032974>] kthread+0xfe/0x132
 [<ffffffff8009f283>] request_module+0x0/0x14d
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032876>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

INFO: task kswapd0:697 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kswapd0       D ffff810001015120     0   697    313           698   694 (L-TLB)
 ffff81082f699b00 0000000000000046 0000000000000002 0000000000000010
 ffff810211fe0000 000000000000000a ffff81042f5cd0c0 ffff81082ff73040
 000009af68cd2c1c 00000000008aaa91 ffff81042f5cd2a8 0000000400000006
Call Trace:
 [<ffffffff88036d8a>] :jbd:log_wait_commit+0xa3/0xf5
 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800276a9>] try_to_free_buffers+0x60/0xb9
 [<ffffffff880332da>] :jbd:journal_try_to_free_buffers+0x19d/0x1c0
 [<ffffffff800cd32c>] shrink_inactive_list+0x511/0x8d8
 [<ffffffff80047ff2>] __pagevec_release+0x19/0x22
 [<ffffffff800cccfa>] shrink_active_list+0x4b4/0x4c4
 [<ffffffff800131a5>] shrink_zone+0x127/0x18d
 [<ffffffff80057be8>] kswapd+0x33d/0x495
 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff800578ab>] kswapd+0x0/0x495
 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032974>] kthread+0xfe/0x132
 [<ffffffff8009f283>] request_module+0x0/0x14d
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032876>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

INFO: task fio:30488 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
fio           D ffff81043e0a9aa0     0 30488  27162         30489       (NOTLB)
 ffff8102272df768 0000000000000082 ffff81042de78c98 00000000798f4530
 ffff810780ca8ee8 0000000000000009 ffff81017c983860 ffff81082ff5e100
 0000294edeeb2129 00000000008bd33b ffff81017c983a48 0000000300000286
Call Trace:
 [<ffffffff8006ec4e>] do_gettimeofday+0x40/0x90
 [<ffffffff800154b2>] sync_buffer+0x0/0x3f
 [<ffffffff800637ca>] io_schedule+0x3f/0x67
 [<ffffffff800154ed>] sync_buffer+0x3b/0x3f
 [<ffffffff800639f6>] __wait_on_bit+0x40/0x6e
 [<ffffffff800154b2>] sync_buffer+0x0/0x3f
 [<ffffffff80063a90>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff800a28e2>] wake_bit_function+0x0/0x23
 [<ffffffff80025702>] __bread+0x6c/0x86
 [<ffffffff8804df2a>] :ext3:ext3_get_branch+0x67/0xd2
 [<ffffffff8804e1ad>] :ext3:ext3_get_blocks_handle+0xc7/0x9bc
 [<ffffffff8005c0fb>] cache_alloc_refill+0x106/0x186
 [<ffffffff8804edb1>] :ext3:ext3_get_block+0xb6/0xf7
 [<ffffffff8000e750>] __block_prepare_write+0x1a5/0x39e
 [<ffffffff8804ecfb>] :ext3:ext3_get_block+0x0/0xf7
 [<ffffffff800e3a43>] block_write_begin+0x80/0xcf
 [<ffffffff880503b0>] :ext3:ext3_write_begin+0xe8/0x1cc
 [<ffffffff8804ecfb>] :ext3:ext3_get_block+0x0/0xf7
 [<ffffffff8000fda3>] generic_file_buffered_write+0x14b/0x675
 [<ffffffff880317ae>] :jbd:journal_stop+0x1f3/0x1ff
 [<ffffffff80016679>] __generic_file_aio_write_nolock+0x369/0x3b6
 [<ffffffff80021850>] generic_file_aio_write+0x65/0xc1
 [<ffffffff8804c1b6>] :ext3:ext3_file_write+0x16/0x91
 [<ffffffff800182df>] do_sync_write+0xc7/0x104
 [<ffffffff8006723e>] do_page_fault+0x4fe/0x874
 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e
 [<ffffffff80016a81>] vfs_write+0xce/0x174
 [<ffffffff80017339>] sys_write+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

INFO: task fio:30488 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
fio           D ffff81043e0a9aa0     0 30488  27162         30489       (NOTLB)
 ffff8102272df768 0000000000000082 ffff81042de78c98 00000000798f4530
 ffff810780ca8ee8 0000000000000009 ffff81017c983860 ffff81082ff5e100
 0000294edeeb2129 00000000008bd33b ffff81017c983a48 0000000300000286
Call Trace:
 [<ffffffff8006ec4e>] do_gettimeofday+0x40/0x90
 [<ffffffff800154b2>] sync_buffer+0x0/0x3f
 [<ffffffff800637ca>] io_schedule+0x3f/0x67
 [<ffffffff800154ed>] sync_buffer+0x3b/0x3f
 [<ffffffff800639f6>] __wait_on_bit+0x40/0x6e
 [<ffffffff800154b2>] sync_buffer+0x0/0x3f
 [<ffffffff80063a90>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff800a28e2>] wake_bit_function+0x0/0x23
 [<ffffffff80025702>] __bread+0x6c/0x86
 [<ffffffff8804df2a>] :ext3:ext3_get_branch+0x67/0xd2
 [<ffffffff8804e1ad>] :ext3:ext3_get_blocks_handle+0xc7/0x9bc
 [<ffffffff8005c0fb>] cache_alloc_refill+0x106/0x186
 [<ffffffff8804edb1>] :ext3:ext3_get_block+0xb6/0xf7
 [<ffffffff8000e750>] __block_prepare_write+0x1a5/0x39e
 [<ffffffff8804ecfb>] :ext3:ext3_get_block+0x0/0xf7
 [<ffffffff800e3a43>] block_write_begin+0x80/0xcf
 [<ffffffff880503b0>] :ext3:ext3_write_begin+0xe8/0x1cc
 [<ffffffff8804ecfb>] :ext3:ext3_get_block+0x0/0xf7
 [<ffffffff8000fda3>] generic_file_buffered_write+0x14b/0x675
 [<ffffffff880317ae>] :jbd:journal_stop+0x1f3/0x1ff
 [<ffffffff80016679>] __generic_file_aio_write_nolock+0x369/0x3b6
 [<ffffffff80021850>] generic_file_aio_write+0x65/0xc1
 [<ffffffff8804c1b6>] :ext3:ext3_file_write+0x16/0x91
 [<ffffffff800182df>] do_sync_write+0xc7/0x104
 [<ffffffff8006723e>] do_page_fault+0x4fe/0x874
 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e
 [<ffffffff80016a81>] vfs_write+0xce/0x174
 [<ffffffff80017339>] sys_write+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

sd 5:0:1:1: timing out command, waited 360s
sd 5:0:1:1: SCSI error: return code = 0x06000028
end_request: I/O error, dev sdh, sector 3193936
INFO: task ls:4002 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ls            D ffff81043e0a9aa0     0  4002   3221                     (NOTLB)
 ffff810473e7bd08 0000000000000082 ffff81042de78c98 0000000079bfdfe8
 ffff8106b3c35cd8 0000000000000008 ffff81040cce60c0 ffff81082ff5e100
 00002a500b6569a0 000000000015dfc9 ffff81040cce62a8 0000000300000286
Call Trace:
 [<ffffffff8006ec4e>] do_gettimeofday+0x40/0x90
 [<ffffffff800154b2>] sync_buffer+0x0/0x3f
 [<ffffffff800637ca>] io_schedule+0x3f/0x67
 [<ffffffff800154ed>] sync_buffer+0x3b/0x3f
 [<ffffffff800639f6>] __wait_on_bit+0x40/0x6e
 [<ffffffff800154b2>] sync_buffer+0x0/0x3f
 [<ffffffff80063a90>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff800a28e2>] wake_bit_function+0x0/0x23
 [<ffffffff8804dc3f>] :ext3:__ext3_get_inode_loc+0x2a9/0x2f9
 [<ffffffff8804dcc3>] :ext3:ext3_reserve_inode_write+0x23/0x90
 [<ffffffff8804dd51>] :ext3:ext3_mark_inode_dirty+0x21/0x3c
 [<ffffffff88050cae>] :ext3:ext3_dirty_inode+0x63/0x7b
 [<ffffffff80013c94>] __mark_inode_dirty+0x29/0x16e
 [<ffffffff800258ae>] filldir+0x0/0xb7
 [<ffffffff800353a9>] vfs_readdir+0x8c/0xa9
 [<ffffffff80038c2d>] sys_getdents+0x75/0xbd
 [<ffffffff8005d229>] tracesys+0x71/0xe0
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

INFO: task ls:4002 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ls            D ffff81043e0a9aa0     0  4002   3221                     (NOTLB)
 ffff810473e7bd08 0000000000000082 ffff81042de78c98 0000000079bfdfe8
 ffff8106b3c35cd8 0000000000000008 ffff81040cce60c0 ffff81082ff5e100
 00002a500b6569a0 000000000015dfc9 ffff81040cce62a8 0000000300000286
Call Trace:
 [<ffffffff8006ec4e>] do_gettimeofday+0x40/0x90
 [<ffffffff800154b2>] sync_buffer+0x0/0x3f
 [<ffffffff800637ca>] io_schedule+0x3f/0x67
 [<ffffffff800154ed>] sync_buffer+0x3b/0x3f
 [<ffffffff800639f6>] __wait_on_bit+0x40/0x6e
 [<ffffffff800154b2>] sync_buffer+0x0/0x3f
 [<ffffffff80063a90>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff800a28e2>] wake_bit_function+0x0/0x23
 [<ffffffff8804dc3f>] :ext3:__ext3_get_inode_loc+0x2a9/0x2f9
 [<ffffffff8804dcc3>] :ext3:ext3_reserve_inode_write+0x23/0x90
 [<ffffffff8804dd51>] :ext3:ext3_mark_inode_dirty+0x21/0x3c
 [<ffffffff88050cae>] :ext3:ext3_dirty_inode+0x63/0x7b
 [<ffffffff80013c94>] __mark_inode_dirty+0x29/0x16e
 [<ffffffff800258ae>] filldir+0x0/0xb7
 [<ffffffff800353a9>] vfs_readdir+0x8c/0xa9
 [<ffffffff80038c2d>] sys_getdents+0x75/0xbd
 [<ffffffff8005d229>] tracesys+0x71/0xe0
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

INFO: task ls:6734 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ls            D ffff8108296650c0     0  6734   6647                     (NOTLB)
 ffff81069ab37e88 0000000000000082 ffff81079a26c060 ffffffff8000da09
 ffff81081e0a8448 0000000000000008 ffff81042f966820 ffff8108296650c0
 00002a7750d6ae3b 00000000002fc400 ffff81042f966a08 00000003ffffff9c
Call Trace:
 [<ffffffff8000da09>] permission+0x81/0xc8
 [<ffffffff80022214>] __up_read+0x19/0x7f
 [<ffffffff800258ae>] filldir+0x0/0xb7
 [<ffffffff80063c4f>] __mutex_lock_slowpath+0x60/0x9b
 [<ffffffff80063c99>] .text.lock.mutex+0xf/0x14
 [<ffffffff80035379>] vfs_readdir+0x5c/0xa9
 [<ffffffff80038c2d>] sys_getdents+0x75/0xbd
 [<ffffffff8005d229>] tracesys+0x71/0xe0
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

sd 5:0:1:1: timing out command, waited 360s
sd 5:0:1:1: SCSI error: return code = 0x06000028
end_request: I/O error, dev sdh, sector 8600
EXT3-fs error (device dm-6): ext3_get_inode_loc: unable to read inode block - inode=2, block=1027
Aborting journal on device dm-6.
EXT3-fs error (device dm-6) in ext3_ordered_writepage: IO failure
EXT3-fs error (device dm-6) in ext3_reserve_inode_write: IO failure
EXT3-fs error (device dm-6) in ext3_dirty_inode: IO failure
ext3_abort called.
EXT3-fs error (device dm-6): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
ext3_abort called.
EXT3-fs error (device dm-6): ext3_put_super: Couldn't clean up the journal
kjournald starting.  Commit interval 5 seconds
EXT3 FS on dm-6, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
INFO: task fio:970 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
fio           D ffff81043e0a9aa0     0   970  27921           971       (NOTLB)
 ffff81059079b768 0000000000000082 ffff81042de78c98 00000000798832a0
 ffff81082bf6ce10 0000000000000009 ffff81082b967080 ffff81082ff5e100
 000039676d87a549 000000000015101f ffff81082b967268 0000000300000286
Call Trace:
 [<ffffffff8006ec4e>] do_gettimeofday+0x40/0x90
 [<ffffffff800154b2>] sync_buffer+0x0/0x3f
 [<ffffffff800637ca>] io_schedule+0x3f/0x67
 [<ffffffff800154ed>] sync_buffer+0x3b/0x3f
 [<ffffffff800639f6>] __wait_on_bit+0x40/0x6e
 [<ffffffff800154b2>] sync_buffer+0x0/0x3f
 [<ffffffff80063a90>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff800a28e2>] wake_bit_function+0x0/0x23
 [<ffffffff80025702>] __bread+0x6c/0x86
 [<ffffffff8804df2a>] :ext3:ext3_get_branch+0x67/0xd2
 [<ffffffff8804e1ad>] :ext3:ext3_get_blocks_handle+0xc7/0x9bc
 [<ffffffff8006202a>] __memset+0x1e/0xc0
 [<ffffffff8005c0fb>] cache_alloc_refill+0x106/0x186
 [<ffffffff8804edb1>] :ext3:ext3_get_block+0xb6/0xf7
 [<ffffffff8000e750>] __block_prepare_write+0x1a5/0x39e
 [<ffffffff8804ecfb>] :ext3:ext3_get_block+0x0/0xf7
 [<ffffffff800e3a43>] block_write_begin+0x80/0xcf
 [<ffffffff880503b0>] :ext3:ext3_write_begin+0xe8/0x1cc
 [<ffffffff8804ecfb>] :ext3:ext3_get_block+0x0/0xf7
 [<ffffffff8000fda3>] generic_file_buffered_write+0x14b/0x675
 [<ffffffff880317ae>] :jbd:journal_stop+0x1f3/0x1ff
 [<ffffffff80016679>] __generic_file_aio_write_nolock+0x369/0x3b6
 [<ffffffff80021850>] generic_file_aio_write+0x65/0xc1
 [<ffffffff8804c1b6>] :ext3:ext3_file_write+0x16/0x91
 [<ffffffff800182df>] do_sync_write+0xc7/0x104
 [<ffffffff8006723e>] do_page_fault+0x4fe/0x874
 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e
 [<ffffffff80016a81>] vfs_write+0xce/0x174
 [<ffffffff80017339>] sys_write+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

INFO: task kswapd0:697 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kswapd0       D ffff810001015120     0   697    313           698   694 (L-TLB)
 ffff81082f699b00 0000000000000046 0000000000000010 0000000005043e18
 0fd0000600000008 000000000000000a ffff81042f5cd0c0 ffff81082ff73040
 00004824fdbdb689 00000000000cceb1 ffff81042f5cd2a8 0000000400000014
Call Trace:
 [<ffffffff88036d8a>] :jbd:log_wait_commit+0xa3/0xf5
 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800276a9>] try_to_free_buffers+0x60/0xb9
 [<ffffffff880332da>] :jbd:journal_try_to_free_buffers+0x19d/0x1c0
 [<ffffffff800cd32c>] shrink_inactive_list+0x511/0x8d8
 [<ffffffff80047ff2>] __pagevec_release+0x19/0x22
 [<ffffffff800cccfa>] shrink_active_list+0x4b4/0x4c4
 [<ffffffff800131a5>] shrink_zone+0x127/0x18d
 [<ffffffff80057be8>] kswapd+0x33d/0x495
 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff800578ab>] kswapd+0x0/0x495
 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032974>] kthread+0xfe/0x132
 [<ffffffff8009f283>] request_module+0x0/0x14d
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032876>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

Comment 46 Dwight (Bud) Brown 2011-01-19 23:35:51 UTC

(In reply to comment #44)
> "You are not authorized to access bug #580818"...

Use BZ 615543, it should be available.

Comment 47 Jeff Moyer 2011-01-31 20:07:33 UTC

(In reply to comment #43)
> The original problem within this BZ was on an LSI controller which was fixed by
> upgrading the FJ drive firmware (comment #20).
> 
> The >120s messages can be a symptom of many different issues, including just a
> very very busy system  (for example, set io timeout to 300s and anytime you
> encounter an io timeout you can get these messages if the task stall detection
> time is set less than the io timeout value).  
> 
> For Smart Array configurations encountering this type of message or longer term
> hang issues, I'd suggest using BZ 580818 instead.
> 
> For other configurations other than LSI or Smart Array I'd suggest opening a
> new BZ with appropriate details.

Thanks for the nice summary, Bud.  I'm closing this bug.  Please follow Bud's advice if you see these problems.  As always, filing a ticket with support is the preferred method, as bugzilla is a bug tracking tool, not a support tool.

Comment 48 George Rafaelov 2011-03-11 13:49:27 UTC

 I have experienced the same problems with this issue on some of my machines.

INFO: task kswapd0:697 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

And it may be any process in the error text, for example klogd instead of kswapd.

The most interesting point that it appered after I updated to RHEL 5.6 and kernel-PAE-2.6.18-238.1.1.el5. So I bootrd kernel-PAE-2.6.18-194.32.1.el5 and problem has gone. This week I tried new kernel-PAE-2.6.18-238.5.1.el5, and again the same failure.
 I have this problem on most of my servers - that is HP DL360 G3 or G4, but not on all. G5 worked without any problems. What can I do? Only wait for the next kernel release hoping that the problem will be fixed?

Comment 49 Jeff Moyer 2011-03-11 14:30:43 UTC

(In reply to comment #48)
>  I have experienced the same problems with this issue on some of my machines.
> 
> INFO: task kswapd0:697 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> 
> And it may be any process in the error text, for example klogd instead of
> kswapd.
> 
> The most interesting point that it appered after I updated to RHEL 5.6 and
> kernel-PAE-2.6.18-238.1.1.el5. So I bootrd kernel-PAE-2.6.18-194.32.1.el5 and
> problem has gone. This week I tried new kernel-PAE-2.6.18-238.5.1.el5, and
> again the same failure.
>  I have this problem on most of my servers - that is HP DL360 G3 or G4, but not
> on all. G5 worked without any problems. What can I do? Only wait for the next
> kernel release hoping that the problem will be fixed?

Hi, George,

Can you provide some more information on your system?  Are you using a Smart Array controller?  Is there any other storage attached to the system?

Comment 50 George Rafaelov 2011-03-11 16:59:04 UTC

 Hi Jeffrey,
I'll try :)

Yes, I am using a Smart Array Controller:

[root@XXXXX ~]# hpacucli controller slot=0 logicaldrive 1 show status
logicaldrive 1 (279.4 GB, RAID RAID 1+0):  OK

[root@XXXXXX ~]# dmidecode | grep DL
        Product Name: ProLiant DL360 G4

No other storages attached to the system. The system is really under high load, it's a mail server with a lot of traffic. But no such failures with last kernel:
[root@XXXXX ~]# uname -r
2.6.18-194.32.1.el5PAE

And 238.1 and 238.5 kernel releases hang the server
[root@XXXXXX ~]# rpm -aq| grep kernel | grep PAE
kernel-PAE-2.6.18-238.1.1.el5
kernel-PAE-devel-2.6.18-238.1.1.el5
kernel-PAE-devel-2.6.18-238.5.1.el5
kernel-PAE-2.6.18-194.32.1.el5
kernel-PAE-2.6.18-238.5.1.el5
kernel-PAE-devel-2.6.18-194.32.1.el5

The same situation on most of 12 servers. All of them G3 and G4. Only one is G5

Comment 51 Willi Fehler 2011-03-13 17:09:37 UTC

Hi,

same problems here.

[root@XXXXX ~]# uname -r
2.6.18-194.32.1.el5

Raid-Controller: LSI 9750 8i, Seagate 2TB Constellation(SN11)

I've tried the follwing things but the situation is still the same.

1. change the i/o scheduler to "noop"
2. set i/o timeout to 300
3. changed raid/hardware

The issue is present only on LSI-Controllers? Anbody tried to set "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"?

Please publish a fix for this issue, it's very sad to see that this issue exists so long. :(

Comment 52 Willi Fehler 2011-03-13 17:12:22 UTC

Hi,

same problems here.

[root@XXXXX ~]# uname -r
2.6.18-194.32.1.el5

Raid-Controller: LSI 9750 8i, Seagate 2TB Constellation(SN11)

I've tried the follwing things but the situation is still the same.

1. change the i/o scheduler to "noop"
2. set i/o timeout to 300
3. changed raid/hardware

The issue is present only on LSI-Controllers? Anbody tried to set "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"?

Please publish a fix for this issue, it's very sad to see that this issue exists so long. :(

Comment 53 George Rafaelov 2011-03-13 19:56:14 UTC

(In reply to comment #52)
> Hi,
> same problems here.
> [root@XXXXX ~]# uname -r
> 2.6.18-194.32.1.el5
> Raid-Controller: LSI 9750 8i, Seagate 2TB Constellation(SN11)
> I've tried the follwing things but the situation is still the same.
> 1. change the i/o scheduler to "noop"
> 2. set i/o timeout to 300
> 3. changed raid/hardware
> The issue is present only on LSI-Controllers? Anbody tried to set "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs"?
> Please publish a fix for this issue, it's very sad to see that this issue
> exists so long. :(

Hi Willi,
A little correction. I didn't have this problem on 2.6.18-194.32.1.el5 kernel, just opposite, I booted back this kernel. 
The problem is with 2.6.18-238. I didn't try to change i/o scheduler or timeout yes, because I think that everything should work without any changes of default values and RedHat should fix this issue.

Comment 54 Jeff Moyer 2011-03-14 18:28:29 UTC

(In reply to comment #50)

Hi, George,

> Yes, I am using a Smart Array Controller:

OK.  It is likely that you are running into the firmware issue described by Bud Brown in comment #43. See bug #615543.  Producing a vmcore and posting a link to it in that bug (615543) would be a good way to determine whether you're hitting that problem or something different.  To be clear, if it is a hung I/O in the cciss firmware, then there's nothing we at Red Hat can do to fix it.  We are working with HP to get a fix for the issue, but it is very much a firmware issue, not an O/S issue.

Comment 55 Jeff Moyer 2011-03-14 18:31:12 UTC

(In reply to comment #51)
> Hi,
> 
> same problems here.
> 
> [root@XXXXX ~]# uname -r
> 2.6.18-194.32.1.el5
> 
> Raid-Controller: LSI 9750 8i, Seagate 2TB Constellation(SN11)
> 
> I've tried the follwing things but the situation is still the same.
> 
> 1. change the i/o scheduler to "noop"
> 2. set i/o timeout to 300
> 3. changed raid/hardware
> 
> The issue is present only on LSI-Controllers? Anbody tried to set "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs"?
> 
> Please publish a fix for this issue, it's very sad to see that this issue
> exists so long. :(

Hi, Willi,

Please file a ticket with support.  The issue you are hitting is definitely different from what was reported in this bugzilla.

Comment 56 Willi Fehler 2011-03-14 18:41:29 UTC

Hi Jeffrey,

why is this different? I have the same error message like in this bugzilla
and I am also using a LSI-Raid-Controller.

Can I fix this issue with "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"?

Regards - Willi

Comment 57 Jeff Moyer 2011-03-14 19:18:18 UTC

The issue this bugzilla initially covered was corrected by updating firmware on the HDD.  Echo-ing 0 to the hung_task_timeout_secs file will simply turn off the warning without addressing the problem.

Please contact support and open a case so we can get to the bottom of your problem.

Comment 58 Willi Fehler 2011-03-14 21:36:53 UTC

Hi Jeffrey,

unfortunately we are using CentOS 5.5 x64 and we haven't got any support, so I think I'm not able to open a support case? :(

But I know CentOS is just a rebuild of RedHat Enterprise. I will check this support case, because it looks like some people still have the same issue and wait for a fix.

Could please anybody confirm, who has updated to RHEL-5.6 and booted back 2.6.18-194.32.1.el5 that the issue is gone?

I think a have 3 options, open a support case on LSI, check this case for updates or change my Raid Controller to an Apdatec 5805Z.

Any hints?

Regards, Willi

Comment 59 Willi Fehler 2011-03-14 21:39:34 UTC

Hi Jeffrey,

unfortunately we are using CentOS 5.5 x64 and we haven't got any support, so I think I'm not able to open a support case? :(

But I know CentOS is just a rebuild of RedHat Enterprise. I will check this support case, because it looks like some people still have the same issue and wait for a fix.

Could please anybody confirm, who has updated to RHEL-5.6 and booted back 2.6.18-194.32.1.el5 that the issue is gone?

I think a have 3 options, open a support case on LSI, check this case for updates or change my Raid Controller to an Apdatec 5805Z.

Any hints?

Regards, Willi

Comment 60 George Rafaelov 2011-03-17 11:49:23 UTC

(In reply to comment #54)
> (In reply to comment #50)
> Hi, George,
> > Yes, I am using a Smart Array Controller:
> OK.  It is likely that you are running into the firmware issue described by Bud
> Brown in comment #43. See bug #615543.  Producing a vmcore and posting a link
> to it in that bug (615543) would be a good way to determine whether you're
> hitting that problem or something different.  To be clear, if it is a hung I/O
> in the cciss firmware, then there's nothing we at Red Hat can do to fix it.  We
> are working with HP to get a fix for the issue, but it is very much a firmware
> issue, not an O/S issue.


Hi Jeffrey,
I upgraded firmware from official HP distribs (it was Dl360 G4) and installes a fresh HP SupportPack. But unluckily, it didn't help me, the machine hanged up again. 

I didn't quite well understand what did you mean:

>"Producing a vmcore and posting a link
> to it in that bug (615543) would be a good way to determine whether you're
> hitting that problem or something different"

what is a vmcore and how can I produce it ....?
Now I have to boot back in 2.6.18-194.32.1.el5 kernel ..(:
Any other ideas? Probobly to stop HP SupportPack services for a test?

Comment 61 George Rafaelov 2011-03-17 11:52:03 UTC

(In reply to comment #58)
> Hi Jeffrey,
> unfortunately we are using CentOS 5.5 x64 and we haven't got any support, so I
> think I'm not able to open a support case? :(
> But I know CentOS is just a rebuild of RedHat Enterprise. I will check this
> support case, because it looks like some people still have the same issue and
> wait for a fix.
> Could please anybody confirm, who has updated to RHEL-5.6 and booted back
> 2.6.18-194.32.1.el5 that the issue is gone?

Yes, 2.6.18-194.32.1.el5 is the last working kernel for me.

> I think a have 3 options, open a support case on LSI, check this case for
> updates or change my Raid Controller to an Apdatec 5805Z.
> Any hints?
> Regards, Willi

Comment 62 Willi Fehler 2011-03-21 22:51:15 UTC

(In reply to comment #61)
> (In reply to comment #58)
> > Hi Jeffrey,
> > unfortunately we are using CentOS 5.5 x64 and we haven't got any support, so I
> > think I'm not able to open a support case? :(
> > But I know CentOS is just a rebuild of RedHat Enterprise. I will check this
> > support case, because it looks like some people still have the same issue and
> > wait for a fix.
> > Could please anybody confirm, who has updated to RHEL-5.6 and booted back
> > 2.6.18-194.32.1.el5 that the issue is gone?
> 
> Yes, 2.6.18-194.32.1.el5 is the last working kernel for me.

And with RHEL-5.5 you had the same issue like me? Let's see if CentOS-5.6 will fix my issue if booted back 2.6.18-192.32.1.el5.

Comment 63 George Rafaelov 2011-03-22 07:46:42 UTC

(In reply to comment #62)
> (In reply to comment #61)
> > (In reply to comment #58)
> > > Hi Jeffrey,
> > > unfortunately we are using CentOS 5.5 x64 and we haven't got any support, so I
> > > think I'm not able to open a support case? :(
> > > But I know CentOS is just a rebuild of RedHat Enterprise. I will check this
> > > support case, because it looks like some people still have the same issue and
> > > wait for a fix.
> > > Could please anybody confirm, who has updated to RHEL-5.6 and booted back
> > > 2.6.18-194.32.1.el5 that the issue is gone?
> > 
> > Yes, 2.6.18-194.32.1.el5 is the last working kernel for me.
> And with RHEL-5.5 you had the same issue like me? Let's see if CentOS-5.6 will
> fix my issue if booted back 2.6.18-192.32.1.el5.

I have 2 machines with CentOS but I am afraid to update it now, though they installed on ESX, so if it is harware issue (as it is 99%) it shouldn't reproduced.

By the way, can anybody tell me how to make the following:
I want to update with "yum update", but don't want yum to remove old kernel packet 2.6.18-192.32.1.el5. Is it possible to do so? It watns to remove it (actually it keeps only 3 working kernel as far as I understand).

Comment 64 George Rafaelov 2011-04-08 20:45:09 UTC

No chances to solve this problem and work under new kernel ....?

Comment 65 chris 2011-05-09 13:57:57 UTC

This is a cross-post to BZ 615543, as both seem related:

Though CentOS, but related, we're also experiencing this issue on two of our ~20 HP ProLiant DL380 G4 systems since upgrading from CentOS 5.5 to 5.6. We run 2.6.18-238.9.1.el5PAE and did not observe the bug before upgrading to 5.6.

In server A we have configured SmartArray to use 6 (original) disks in 3 x RAID 1 (with two disks each). Server B uses SmartArray with three disks, two in RAID 1 and one in RAID 0.

Crashing happens on a regular basis and can be provoked by putting high load on the servers, e.g. with bonnie++. The last bonnie++ run lasted only 3 hours until the server crashed.

Chris

Comment 66 Tomas Henzl 2011-05-09 14:40:00 UTC

You can try this upstream patch - https://lkml.org/lkml/2011/4/13/228 it has
reportedly helped in similar issues.
I'm not sure, but maybe it is already applied here -
http://people.redhat.com/jwilson/el5/

Comment 67 chris 2011-05-23 08:26:18 UTC

Tomas, thanks for the links and info. Last week I installed kernel 2.6.18-259.el5 from the website you gave to me. Unfortunateyl, it took only two days until the next system hangup with the same "blocked for more than 120 seconds" error message. If this helps, I can attach a stack trace from the system, just let me know.

Is there any reasonably easy way to find out if the patch you mentioned was already in 2.6.18-259? I suppose that is was either not included in 2.6.18-259, or we are experiencing some other issue here. I saw that 2.6.18-262 came out and wondered whether it is worth a try. Thanks!

Comment 68 chris 2011-05-23 11:25:03 UTC

We just downloaded the src RPM of 2.6.18-262 and found that the patch provided at https://lkml.org/lkml/2011/4/13/228 (which may solve our problem) is not included in the kernel yet. What is the procedure to include the patch in an upcoming 2.6.18 kernel? Is there any reason it is not in the kernel yet?

We would like to avoid - if possible - to manually compile a kernel. Given a reasonable number of server crashes recently, this problem has forced us to use old kernel versions (2.6.18-194) to remain operational.

Comment 69 Willi Fehler 2011-05-23 12:28:16 UTC

Hi,

a short update from my side. We could fix the issues.

First thing I've disabled command queueing for all logical drives. The LSI support told me command queueuing was enabled. Second thing I've 
changed the runtime of a monitoring script, which executes a "repquota -avg" every minute. Now this script runs once a day.

Since 1 months everything is working fine without any issues.
I've tested the following kernels:

CentOS 5.5
CentoS 5.6

2.6.18-194.x
2.6.18-238.x

Regards - Willi

Comment 70 George Rafaelov 2011-05-23 14:16:58 UTC

Hi Willi,
Can you explain more in details how did you solve this great problem!? I am still suffering from it :)

(In reply to comment #69)
> Hi,
> 
> a short update from my side. We could fix the issues.
> 
> First thing I've disabled command queueing for all logical drives. The LSI
> support told me command queueuing was enabled. 

How ???

> Second thing I've 
> changed the runtime of a monitoring script, which executes a "repquota -avg"
> every minute. Now this script runs once a day.

what king of script ...? I don't understand you quite well.

> 
> Since 1 months everything is working fine without any issues.
> I've tested the following kernels:
> 
> CentOS 5.5
> CentoS 5.6
> 
> 2.6.18-194.x
> 2.6.18-238.x
> 
> Regards - Willi

Comment 71 chris 2011-05-23 15:01:08 UTC

IMHO taking off the load from servers as you suggest cannot be part of the solution here. I wonder why this ticket was closed, as we are apparently not the only ones facing this issue. In case there is more debug information required, I'd gladly help out.

Comment 72 George Rafaelov 2011-05-23 15:36:44 UTC

I also wonder why the ticket is closed and what should we do to re-open it.
Taking off the load really is not a solution for me.

Comment 73 Tony Ellis 2011-05-27 16:43:30 UTC

Hi Willi

We have the same issue with our DL360s and DL380s since upgrading to CentOS 5.6. Any specifics on your fix would be helpful. Thanks!

(In reply to comment #69)
> Hi,
> 
> a short update from my side. We could fix the issues.
> 
> First thing I've disabled command queueing for all logical drives. The LSI
> support told me command queueuing was enabled. Second thing I've 
> changed the runtime of a monitoring script, which executes a "repquota -avg"
> every minute. Now this script runs once a day.
> 
> Since 1 months everything is working fine without any issues.
> I've tested the following kernels:
> 
> CentOS 5.5
> CentoS 5.6
> 
> 2.6.18-194.x
> 2.6.18-238.x
> 
> Regards - Willi

Comment 74 Willi Fehler 2011-05-27 19:35:46 UTC

Hi,

we have disabled command queuing  on our servers. (LSI9750) You have to run
"tw_cli" and then you need the following commands.

/c0/u0 set qpolicy=off

This disables command queuing for controller 0 and unit0.

The second thing we have disabled was on of our monitoring scripts. We're ruuning samba on our servers and we used a monitoring script which executes a "repquota -avg". We need this script for our nagios monitoring. This script was running every minute. Now we run this scripts once a day and til then we haven't got any issues.

But I don't know what has fixed the issue, we've only changed command queuing and
the runtime of this script.

Regards - Willi

Comment 75 tc 2011-06-16 08:17:50 UTC

This should be reopened. It's a crash under load with the latest kernel, fixed by going back to the 2.6.18-194.32.1.el5 version.

This is on a HP Prolient server with HP 6i storage array, if it helps.

Comment 76 George Rafaelov 2011-06-16 09:03:36 UTC

 Dear RedHat support!

If you read our last comments and if you are interested in this topic, please reopen this bug and fix it or produce a new kernel branch may be under Update 7.
Don't you understand that a lot of people have problems with it!?

Comment 77 chris 2011-06-17 05:43:00 UTC

Anybody following this thread and still seeking a solution for it: have a look at the related thread in #615543. Although some people suggest these two bug reports might not be related, upgrading the SmartArray firmware might help to fix the server hangs, maybe also for you. Plaese read the other bug report carefully, particularly the most recent posts.

Comment 78 George Rafaelov 2011-06-29 11:01:37 UTC

Hi guys and special thanks to chris and Richard Godbee from bug #615543!

Finally, I solved this problem, the very new firmware helped me both with HP Smart
Array 5i (6 servers) and HP Smart Array 6i (5 servers). Almost a whole week without any problems!

And I have no problems with HP Smart Array P410i (but I have only one such server).

Comment 79 john broome 2011-08-15 17:48:41 UTC

(In reply to comment #77)
> Anybody following this thread and still seeking a solution for it: have a look
> at the related thread in #615543. Although some people suggest these two bug
> reports might not be related, upgrading the SmartArray firmware might help to
> fix the server hangs, maybe also for you. Plaese read the other bug report
> carefully, particularly the most recent posts.

BZ #615543 is inaccessible. Still having this on 
2.6.18-238.12.1.0.1.el5 #1 SMP Tue May 31 14:51:07 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
 and would like to clear it up.

thanks.

Comment 80 tc 2011-08-16 08:14:01 UTC

jbroome: Assuming you have a HP SmartArray storage controller of some sort, update the firmware.

This seems to fix it on that hardware, see above comments.

Otherwise, did you try the other suggestions in this thread ? What are your exact hardware details ?

Comment 81 Alzhy Le'Cruel 2011-09-15 19:23:02 UTC

If this is CLOSED NOTABUG? What is this? 

IBM x3850 M2 here. LSI Baed ServeRAID MR10 SAS Controllers
RHEL 5.7 Kernel 2.6.18-238.12.1.el5 

Our Firmware as upgraded to atest and greatest on the RAID COntroller and others back in Feb when we had a series of occurences. It did not seem to fix the issues as we had 2 sucessive issues in the last month...

Comment 82 Alzhy Le'Cruel 2011-09-15 19:27:49 UTC

System hosts an Oracle DB (11GR2), SAN Connections OS on the local SAS disks RAID1 on the LSI RAID Controller. System just all of a sudden becomes CPU active, clocks, and hangs.. SYSTEM LOAD goes over 200 (and this is just a 24-way server). And Syslog will always show either Kjournald and kswapd struggling and on hung_task_timeout

see Below:

INFO: task kjournald:21595 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kjournald     D ffff8108f9a24080     0 21595   1129         21601 21593 (L-TLB)
 ffff812004087dd0 0000000000000046 ffff81202a21adf0 ffff810926798730
 0000000000000000 000000000000000a ffff81202519a820 ffff8108f9a24080
 000799dedbffe81d 000000000000290b ffff81202519aa08 000000028008dc2c
Call Trace:
 [<ffffffff880335cf>] :jbd:journal_commit_transaction+0x16d/0x1066
 [<ffffffff800a28ec>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8004b425>] try_to_del_timer_sync+0x7f/0x88
 [<ffffffff880375d3>] :jbd:kjournald+0xc1/0x213
 [<ffffffff800a28ec>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800a26d4>] keventd_create_kthread+0x0/0xc4
 [<ffffffff88037512>] :jbd:kjournald+0x0/0x213
 [<ffffffff800a26d4>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032b26>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800a26d4>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032a28>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

INFO: task orarootagent.bi:7219 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
orarootagent. D ffff8120191d9080     0  7219      1          7220  7217 (NOTLB)
 ffff8109da067be8 0000000000000086 0000000000000000 ffffffff880317ae
 0000000000000000 000000000000000a ffff811fed204080 ffff8120191d9080
 000799df8c0b52eb 000000000000d287 ffff811fed204268 0000000c00000000
Call Trace:
 [<ffffffff880317ae>] :jbd:journal_stop+0x1f3/0x1ff
 [<ffffffff8804ffb3>] :ext3:ext3_ordered_write_end+0xd7/0x116
 [<ffffffff88032002>] :jbd:start_this_handle+0x2e5/0x36c
 [<ffffffff800a28ec>] autoremove_wake_function+0x0/0x2e
 [<ffffffff88032152>] :jbd:journal_start+0xc9/0x100
 [<ffffffff88050c73>] :ext3:ext3_dirty_inode+0x28/0x7b
 [<ffffffff80013d6c>] __mark_inode_dirty+0x29/0x16e
 [<ffffffff8001668a>] __generic_file_aio_write_nolock+0x28a/0x3b6
 [<ffffffff8002197e>] generic_file_aio_write+0x65/0xc1
 [<ffffffff8804c1b6>] :ext3:ext3_file_write+0x16/0x91
 [<ffffffff800183e4>] do_sync_write+0xc7/0x104
 [<ffffffff800a28ec>] autoremove_wake_function+0x0/0x2e
 [<ffffffff80016b71>] vfs_write+0xce/0x174
 [<ffffffff8001743e>] sys_write+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

INFO: task diskmon.bin:7242 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
diskmon.bin   D ffff811ffa917820     0  7242      1          7243  7240 (NOTLB)
 ffff8109d95ddbe8 0000000000000086 0000000000000000 ffffffff880317ae
 0000000000000000 000000000000000a ffff8120260e2100 ffff811ffa917820
 000799df8c103a91 0000000000008b00 ffff8120260e22e8 0000000c00000000
Call Trace:
 [<ffffffff880317ae>] :jbd:journal_stop+0x1f3/0x1ff
 [<ffffffff8804ffb3>] :ext3:ext3_ordered_write_end+0xd7/0x116


Should we just go to SAN Boot to avoid this malaise if the OS or the Firmware remains possibly faulty?

Comment 83 Jeff Moyer 2011-09-15 19:48:49 UTC

Two things:

1) Please re-read comment #43, comment #46 and comment #47.
2) Bugzilla is not a support tool.  Please read the message on the bugzilla front page for more information on how to get your issues resolved (http://bugzilla.redhat.com).

Cheers,
Jeff

Comment 84 Scooter Morris 2011-09-20 15:56:30 UTC

(In reply to comment #83)
> Two things:
> 
> 1) Please re-read comment #43, comment #46 and comment #47.
> 2) Bugzilla is not a support tool.  Please read the message on the bugzilla
> front page for more information on how to get your issues resolved
> (http://bugzilla.redhat.com).
> 
> Cheers,
> Jeff

Could someone open Bug #615543 (or add me to the CC) so that I can comment in the proper bug?  Of all of the bugs mentioned in this thread (and I agree that this bug should be closed) this is the only open bug, which may be why people keep hammering on it...

Comment 85 Alzhy Le'Cruel 2011-09-21 13:54:01 UTC

So what is really so secret about #615543? What is it all about?


In our case, we've had 3 different x3850's with the LSI Megaraid chip and all have exhibited the same issue. We last upgraded its firmware in February 2011 but the same issue persisted. We just upgraded to IBM's latest and greatest which officially address vMware ESXi 4.1 "hangs" and which IBM makes no mention of a similar Linux issue.

If the servers hang again -- we'd likely just move over to SAN Boot - skipping the LSI based RAIDed disk as OS boot disk.

Comment 86 Alessandro Surace 2011-11-18 10:06:51 UTC

Please can you specify which vesion?

Thanks
Alex
(In reply to comment #78)
> Hi guys and special thanks to chris and Richard Godbee from bug #615543!
> 
> Finally, I solved this problem, the very new firmware helped me both with HP
> Smart
> Array 5i (6 servers) and HP Smart Array 6i (5 servers). Almost a whole week
> without any problems!
> 
> And I have no problems with HP Smart Array P410i (but I have only one such
> server).

Comment 87 Mark 2011-12-19 16:28:58 UTC

HP DL380 G5
Smart Array P400 
Firmware Version: 7.22
2.6.18-274.3.1.0.1.el5

I haven't been able to access the BZs noted in comment 46 so I don't know what they're about. Assuming they say to upgrade the SmartArray firmware, however as seen above I'm on the latest version.
I've had this server hang three times in three weeks with the "blocked for more than 120 seconds" message on the console, requiring a cold boot.  This server doesn't get hammered, so I doubt it's due to a busy system. I'll open a case with HP for this, but I wanted to say on the record here that this doesn't appear to be fixed.

Comment 88 ilya m. 2012-02-09 03:05:10 UTC

I have both latest IBM and latest/old HP hardware (G1, G5 and G7). HP uses smart array, IBM LSI Logic array  - issue appears on all servers. I'm also running latest firmware.

Kernels confirmed to be affected in my tests:
2.6.18-274.17.1.el5 (RHEL5.7 - latest as of this writting)
2.6.18-274.7.1.el5  (RHEL5.7 )
2.6.18-164.15.1.el5 (RHEL5.4 - last kernel released)

IO Scheduler was default "cfq", I was able to reproduce this issue with noop - but since i did some many tests, i could have been mistaken. I'm running the noop test now. I will update this BZ if "noop" also has the same problems.

I submitted the vmcore to redhat, i'm suprised to see this issue exists for almost 2 years on "enterprise" systems.

I will also try to recompile kernel from source minus - 2 patches by Amerigo Wang. But since i see this issue in 2.6.18-164.15.1.el5, i doubt it will help.

If anyone found the solution - please respond to this thread. 

Jeffrey, if you believe this is not a bug, please provide the details to the rest of the world. Referenced to a closed/private bugs don't help if you have no access.

Comment 89 Kamil Porembiński 2012-03-13 15:59:47 UTC

I have the same issue at virtual and phisical machines:

2.6.18-238.9.1.el5 (RH 5.6)
2.6.18-194.11.4.el5xen (RH 5.8)

I've read that to fix this problem

Comment 90 Petros 2012-04-23 15:21:13 UTC

You are not authorized to access bug #615543.

The problem is the same here, with sata disks, would like to read the other ticket, having the solution, but have no rights. Why is a critical/blocker ticket closed to the world?

Comment 91 Dwight (Bud) Brown 2012-06-12 18:40:10 UTC

BZ 615543, referenced above, was about a specific HP Smart Array controller bug that has been addressed via firmware updates and a driver workaround (commits 07d0c38e7d84f911c72058a124c7f17b3c779a6 and 1ddd5049545e0aa1a0ed19bca4d9c9c3ce1ac8a2).  Both driver and firmware fixes were released in June 2011 time frame in the then current RHEL releases.

This BZ's originally problem dealt with a problem that was fixed by upgrading the hard disk firmware.

The "blocked for more than 120 seconds" literally has 100s if not 1000s of causes, the above being just 2 of those including.  If the system only throws these occassionally, then its very likely temporary storage congestion issue outside of the host.  If its a hard hang case, then to address that, or anything other than the above three causes, a support case should be opened.

Comment 92 Simon Gao 2012-06-19 18:57:59 UTC

We are seeing the same problem on 5.7 with NFS.

INFO: task perl:3276 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
perl          D ffff81000101d7a0     0  3276   3268                     (NOTLB)
 ffff81015deb39f8 0000000000000082 ffff810199c44880 ffff810199c44880
 0000000010000042 000000000000000a ffff81019ab987e0 ffff81019bc9a100
 000181ba4306b3a6 000000000003139e ffff81019ab989c8 000000030000000a
Call Trace:
 [<ffffffff8006ec8f>] do_gettimeofday+0x40/0x90
 [<ffffffff891be381>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
 [<ffffffff800637ce>] io_schedule+0x3f/0x67
 [<ffffffff891be38a>] :nfs:nfs_wait_bit_uninterruptible+0x9/0xd
 [<ffffffff800639fa>] __wait_on_bit+0x40/0x6e
 [<ffffffff891be381>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
 [<ffffffff80063a94>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff800a2e80>] wake_bit_function+0x0/0x23
 [<ffffffff891c28ea>] :nfs:nfs_update_request+0x90/0x340
 [<ffffffff891c3655>] :nfs:nfs_updatepage+0x155/0x1ec
 [<ffffffff891b8f57>] :nfs:nfs_write_end+0x6b/0x92
 [<ffffffff8000ff32>] generic_file_buffered_write+0x1cc/0x675
 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6
 [<ffffffff8005519c>] sk_reset_timer+0xf/0x19
 [<ffffffff8005458d>] tcp_connect+0x33f/0x348
 [<ffffffff80234d94>] secure_tcp_sequence_number+0x38/0x3d
 [<ffffffff800219bf>] generic_file_aio_write+0x67/0xc3
 [<ffffffff891b9639>] :nfs:nfs_file_write+0xd8/0x14f
 [<ffffffff80018415>] do_sync_write+0xc7/0x104
 [<ffffffff800a2e52>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8022dd8d>] sys_connect+0x7e/0xae
 [<ffffffff80016b92>] vfs_write+0xce/0x174
 [<ffffffff8001745b>] sys_write+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

INFO: task perl:3276 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
perl          D ffff81000101d7a0     0  3276   3268                     (NOTLB)
 ffff81015deb39f8 0000000000000082 ffff810199c44880 ffff810199c44880
 0000000010000042 000000000000000a ffff81019ab987e0 ffff81019bc9a100
 000181ba4306b3a6 000000000003139e ffff81019ab989c8 000000030000000a
Call Trace:
 [<ffffffff8006ec8f>] do_gettimeofday+0x40/0x90
 [<ffffffff891be381>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
 [<ffffffff800637ce>] io_schedule+0x3f/0x67
 [<ffffffff891be38a>] :nfs:nfs_wait_bit_uninterruptible+0x9/0xd
 [<ffffffff800639fa>] __wait_on_bit+0x40/0x6e
 [<ffffffff891be381>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
 [<ffffffff80063a94>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff800a2e80>] wake_bit_function+0x0/0x23
 [<ffffffff891c28ea>] :nfs:nfs_update_request+0x90/0x340
 [<ffffffff891c3655>] :nfs:nfs_updatepage+0x155/0x1ec
 [<ffffffff891b8f57>] :nfs:nfs_write_end+0x6b/0x92
 [<ffffffff8000ff32>] generic_file_buffered_write+0x1cc/0x675
 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6
 [<ffffffff8005519c>] sk_reset_timer+0xf/0x19
 [<ffffffff8005458d>] tcp_connect+0x33f/0x348
 [<ffffffff80234d94>] secure_tcp_sequence_number+0x38/0x3d
 [<ffffffff800219bf>] generic_file_aio_write+0x67/0xc3
 [<ffffffff891b9639>] :nfs:nfs_file_write+0xd8/0x14f
 [<ffffffff80018415>] do_sync_write+0xc7/0x104
 [<ffffffff800a2e52>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8022dd8d>] sys_connect+0x7e/0xae
 [<ffffffff80016b92>] vfs_write+0xce/0x174
 [<ffffffff8001745b>] sys_write+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

Comment 93 Jeff Moyer 2012-06-19 19:08:12 UTC

(In reply to comment #92)
> We are seeing the same problem on 5.7 with NFS.

No, you aren't.

Please file a support ticket for your problem so it can be categorized appropriately.

Comment 94 xset1980 2012-10-29 07:00:57 UTC

Hi, sorry, i can't read a lot of people with the same problem, and no answers from Red Hat.
I'm a red hat fan, but, reading this bug, and, having the similar issue with a Fujitsu server, using centos 6.3 i think:

What about open and free software and Software Libre Stuff?.

Red Hat no remember that a lot of folks report and report on bugzilla, bugs of Fedora, for a better quality on future RHEL?, included report about CentOS or Scientific Linux.

@Jeffrey Moyer

I know that bugzilla.redhat.com is not a way of support for non-redhat products, like CentOS, but, Red Hat forget that RHEL is builden from Fedora?, a community project?.

Is a bad atitude from Red Hat, no asnwers this bug.

Comment 95 xset1980 2012-10-29 07:02:55 UTC

And, sorry, but, if "NOTABUG" what is it?

Comment 96 Alan Brown 2012-10-29 09:46:59 UTC

Guys,

At least part of this is due to a widespread LSI sas expander bug - NOT a controller bug (but make sure you have installed the latest megaraid firmware from LSI's website - NOT the vendor stuff as most of them are 2 versions out of date)

Updated firmware for LSI's SAS switches (expanders) can be grabbed from their website.

http://www.lsi.com/support/Pages/Download-Results.aspx?productcode=P00048&assettype=0&component=Storage%20Component&productfamily=SAS%20Switch&productname=LSI%20SAS6160%20Switch

Make sure you are running the P12 switch firmware. There is a specific fix for hangs caused by hard drives momentarily disconnecting themselves from the bus.

You will need megacli to apply updates. This can be obtained from a number of locations.

Applying the P12 firmware update stopped all hangs on our SAS-based boxes.