Hide Forgot
Description of problem:I am using RHEL 5.5 with OCFS2 for a cluster file system between 2 nodes. I am not sure how I get this, but the server freeze and stopped responding to all http requests and I cannot login either from remote session (ssh) or through the console. Version-Release number of selected component (if applicable): Kernel version is 2.6.18-194.3.1.el5PAE, httpd 2.2.3-43, ocfs2-2.6.18-194.3.1.el5PAE-1.4.7-1 How reproducible: Not sure how this can be reproduced Additional info: This is what I see in the message log: kernel: INFO: task httpd:9144 blocked for more than 120 seconds. kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kernel: httpd D 00225265 1540 9144 6915 9157 15002 (NOTLB) kernel: e3cc9e14 00200082 56e1bfcb 00225265 c54cc9f0 e157c080 c05eecfb 0000000a kernel: e1813550 56e6771a 00225265 0004b74f 00000001 e181365c c4619bc4 f7853040 kernel: 00000000 00000000 dfe965c0 00000000 c061cf87 dfe965c0 c05b8528 ffffffff kernel: Call Trace: kernel: [<c05eecfb>] __tcp_push_pending_frames+0x474/0x752 kernel: [<c061cf87>] _spin_lock_bh+0x8/0x18 kernel: [<c05b8528>] release_sock+0xc/0x91 kernel: [<c061c265>] __mutex_lock_slowpath+0x4d/0x7c kernel: [<c061c2a3>] .text.lock.mutex+0xf/0x14 kernel: [<f90a6ab3>] ocfs2_file_aio_write+0x1a4/0xb48 [ocfs2] kernel: [<c05b627b>] kernel_sendpage+0x35/0x3c kernel: [<c05b62c6>] sock_sendpage+0x44/0x81 kernel: [<c0456005>] file_send_actor+0x32/0x4b kernel: [<c04572a7>] do_generic_mapping_read+0x373/0x37b kernel: [<c04744ca>] do_sync_write+0xb6/0xf1 kernel: [<c05b863b>] lock_sock+0x8e/0x96 kernel: [<c04363ff>] autoremove_wake_function+0x0/0x2d kernel: [<c05b6730>] sys_setsockopt+0x76/0x95 kernel: [<c0474414>] do_sync_write+0x0/0xf1 kernel: [<c0474d53>] vfs_write+0xa1/0x143 kernel: [<c0475345>] sys_write+0x3c/0x63 kernel: [<c0404ead>] sysenter_past_esp+0x56/0x79 kernel: =======================
Do you see below message in dmesg(8)? megasas_register_aen[0]: already registered megasas_register_aen[0]: already registered megasas_register_aen[0]: already registered megasas_register_aen[0]: already registered megasas_register_aen[0]: already registered megasas_register_aen[0]: already registered megasas_register_aen[0]: already registered megasas_register_aen[0]: already registered megasas_register_aen[0]: already registered megasas_register_aen[0]: already registered megasas_register_aen[0]: already registered megasas_register_aen[0]: already registered megasas_register_aen[0]: already registered megasas_register_aen[0]: already registered megasas_register_aen[0]: already registered megasas_register_aen[0]: already registered megasas_register_aen[0]: already registered megasas_register_aen[0]: already registered megasas_register_aen[0]: already registered megasas_register_aen[0]: already registered By any chance, is this happening on server with PERC5 (or PERC6) controller?
I am not the OP, but _YES_!!! My box has been rebooting spontaneously (usually some time in the morning), and my messages is full of this. Dec 13 21:31:24 thunderbolt kernel: megasas_register_aen[0]: already registered Dec 13 22:16:23 thunderbolt last message repeated 3 times Dec 13 23:01:25 thunderbolt last message repeated 3 times Dec 13 23:46:30 thunderbolt last message repeated 3 times Dec 14 00:31:32 thunderbolt last message repeated 3 times Dec 14 01:16:35 thunderbolt last message repeated 3 times Linux thunderbolt 2.6.18-274.12.1.el5 #1 SMP Tue Nov 8 21:37:35 EST 2011 x86_64 x86_64 x86_64 GNU/Linux 02:0e.0 RAID bus controller: Dell PowerEdge Expandable RAID controller 5 Help please! This was the absolute only thread I could find relevant in google! I am not sure if this is related to my spontaneous reboots or not, and won't know till I sort this out. Thank you much!
Can you install from lsi.com package megacli? Search for ``4.00.11_Linux_MegaCLI.zip'' on the Internet as LSI homepage is really not customer friendly. Then please give me an output of: megacli -AdpAllInfo -aALL -NoLog | \ grep -e '^[A-Z]' | \ sed -n -e '/^Adapter/,/^Ctrl/p' | \ tr -cd '[\040-\176\t\n]' This is what I have: # uname -rm 2.6.18-274.12.1.el5PAE i686 # megacli -AdpAllInfo -aALL -NoLog | ... Adapter #0 Product Name : PERC 5/i Integrated Serial No : 12345 FW Package Build: 5.2.2-0072 Mfg. Date : 00/00/00 Rework Date : 00/00/00 Revision No : @A Battery FRU : N/A Boot Block Version : R.2.3.12 BIOS Version : MT28-9 MPT Version : MPTFW-00.10.62.00-IT FW Version : 1.03.50-0461 WebBIOS Version : 1.03-04 Ctrl-R Version : 1.04-019A I've seen those locks and hangups in the past on my systems, but I think after upgrading PERC5/i firmware to 5.2.2-0072 they went away. Firmware was downloaded from Dell.com website for PowerEdge 1950.
Here is my output from megacli: Adapter #0 Product Name : PERC 5/i Integrated Serial No : 12345 FW Package Build: 5.1.1-0040 Mfg. Date : 00/00/00 Rework Date : 00/00/00 Revision No : @A Battery FRU : N/A Boot Block Version : R.2.3.12 BIOS Version : MT28 MPT Version : MPTFW-00.10.47.00-IT FW Version : 1.03.10-0216 WebBIOS Version : 1.03-04 Ctrl-R Version : 1.04-017A I am going to find the perc5 firmware update, in case you think this is the best way to go. Thanks for the quick response!
I went ahead and updated to the version suggested, and got the below. Could it be something else? Dec 14 14:34:55 thunderbolt kernel: megasas_register_aen[0]: already registered Dec 14 14:35:01 thunderbolt last message repeated 4 times Adapter #0 Product Name : PERC 5/i Integrated Serial No : 12345 FW Package Build: 5.2.2-0072 Mfg. Date : 00/00/00 Rework Date : 00/00/00 Revision No : @A Battery FRU : N/A Boot Block Version : R.2.3.12 BIOS Version : MT28-9 MPT Version : MPTFW-00.10.62.00-IT FW Version : 1.03.50-0461 WebBIOS Version : 1.03-04 Ctrl-R Version : 1.04-019A Linux thunderbolt 2.6.18-274.12.1.el5 #1 SMP Tue Nov 8 21:37:35 EST 2011 x86_64 x86_64 x86_64 GNU/Linux
My servers still have this in dmesg(8): megasas_register_aen[0]: already registered but they don't freeze anymore. However, issue happen only few times, so I don't know how to reproduce the issue. If think, you will still see above messages in dmesg(8), but (hopefully) your system will be stable now. I'm really curious, so if you don't mind, please let me know did firmware update helped you. Thanks.
This bug/component is not included in scope for RHEL-5.11.0 which is the last RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX (at the end of RHEL5.11 development phase (Apr 22, 2014)). Please contact your account manager or support representative in case you need to escalate this bug.
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support).