Bug 199900

Summary:	174990
Product:	Red Hat Enterprise Linux 4	Reporter:	Ricky <rickychan>
Component:	acpid	Assignee:	Zdenek Prikryl <zprikryl>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.0	CC:	leonardye
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-07-15 08:01:11 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ricky 2006-07-24 07:32:27 UTC

Description of problem:
My company has logged a bug report and was told it was a duplicate bug and was 
asked to refer to bug no 174990. However, we need to find the problem 
description of this bug and also how this is to be solved in some description

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Phil Knirsch 2006-08-01 13:20:10 UTC

Here the relevant correspondance in bug 174990. The bug itself is limited to
internal view only, thats why you couldn't see it. Please let me know if this
bug described there actually matches what you were seeing so we can either close
this bug really as a duplicate or process it properly otherwise.

Read ya, Phil



Opened by Issue Tracker (tao)         on 2005-12-05 10:30 EST      
[reply]             Private

Escalated to Bugzilla from IssueTracker


Comment #1 From Issue Tracker (tao)  on 2005-12-05 10:31 EST       
[reply]  Private

From User-Agent: XML-RPC

*** 02-NOV-2005 17:35:45  Notes:  (UMRSERVER)  

     Problem Description: We are seeing random hard lockups on our two IPVS load
balancers. Once machine is at current 2.6.9-22, the other is still at 2.6.9-11.
Previous ticket was closed, we thought it had something to do with the dst cache
overflow error, but have not seen that message recently. the -11 machine has
been very stable, which is why it hadn't been upgraded, but it has locked up
twice in past two days.  

 

No indications of any problem on the box. 

 

See previous ticket for the support tarball that was generated on one of the
machines. 

 

Setup is two onboard GigE interfaces on a Dell 1850, each configured in a
bonding driver set up for failover only, each configured as a VLAN trunk line,
with three vlans in use - native, and two additional tagged vlans. IPVS
configuration is being managed with keepalived, and is set up with the two boxes
using VRRP and ipvs_syncd for dynamic balancer failover. 

 

The lockup is to the point where alt-sysrq S/U/B does not ever reboot the
machine, but alt-sysrq C to trigger a kernel crash does. There are no symptoms
on the serial console, or in syslogs. 

 

I have been asked to get more diagnostics and support from RedHat on this issue
before we go and replace the kernel with a kernel.org build to see if that has
any impact on the behavior.

*** 03-NOV-2005 05:29:44  Notes: Wulf, Joshua Jae (JWULF)  

     Problem Description: In order to get sufficient information for a diagnosis
we'll need to get some kind of crash dump. Can we try setting up Netdump:



http://kbase.redhat.com/faq/FAQ_43_2198.shtm

http://kbase.redhat.com/faq/FAQ_43_2467.shtm

http://kbase.redhat.com/faq/FAQ_80_3721.shtm



If there have been no software updates on the machine, and it suddenly starts
locking up, it's usually due to one of a limited number of things: audit
subsystem shutting the system down when it goes over 80% disk used, or a
hardware error. 



A hardware error on two separate machines simultaneously is less probable.



Trying a kernel.org build may help us to eliminate some variables. 



regards,

Joshua J Wulf

*** 03-NOV-2005 08:56:08  Notes:  (UMRSERVER)  

     Problem Description: Will try this... crash.c test doesn't work right -
doesn't build a valid module... but that's a separate issue. Testing with
sysrq-c worked fine though. Will enable this functionality on the ipvs boxes. (I
tested it on a different non-production box.) 

 

Here's a problem - these two machines use the bonding driver for the eth
interfaces, and netdump/netconsole will not load because bonding driver doesn't
support polling according to syslog message. 

 

How am I supposed to address this?

*** 03-NOV-2005 22:53:36  Notes: Wulf, Joshua Jae (JWULF)  

     Problem Description: We'd need to unbond the interfaces (which changes the
system configuration and probably disrupts your production environment) or
install another adapter for Netdump.



regards,

Joshua

*** 17-NOV-2005 23:50:09  Notes:  (UMRSERVER)  

     Problem Description: We will be unbonding the interfaces this weekend on
these two boxes. One is at -22, the other is still at -11. The one at -11 seems
to get the dst cache overflow error. We haven't see that on the -22 kernel. 

*** 18-NOV-2005 18:42:49  Notes: Wulf, Joshua Jae (JWULF)  

     Problem Description: OK, we'll wait for the results.



regards,

Joshua

*** 28-NOV-2005 08:54:06  Notes:  (UMRSERVER)  

     Problem Description: Finally managed to get updated bios, new kernel, and
net devices reconfigured. 

 

Machine hard locked again. Never triggered a panic - like usual. When I forcibly
issued a alt-sysrq-c to crash it, here's the log that was generated. 

 

Additionally, on the console, but not in the netconsole log, I saw this right
after it has the trace of the call to __call_console_drivers.  

 

<6>NETDEV WATCHDOG: eth0: transmit timed out 

 

Any suggestions on what to do next? At this point, the machines are running with
a single ethernet interface, both at -22, both with current dell bios, both with
netdump+netconsole enabled. 

*** 28-NOV-2005 08:54:15  Notes:  (UMRSERVER)  

     Problem Description: File log attached



*** 30-NOV-2005 09:12:51  Notes:  (UMRSERVER)  

     Problem Description: Additional info - with netconsole/netdump installed on
these boxes, I am no longer able to remotely reboot them after a panic (even
with A-S-C). They either lock up completely, or in todays case, get stuck in
some sort of loop doing back traces.  

 



*** 30-NOV-2005 11:57:36  Notes: Kloiber, Christopher K (Chris)
(CKLOIBER)  

     Problem Description: Please provide a sysreport from each machine, thanks.

*** 30-NOV-2005 11:59:16  Notes:  (UMRSERVER)  

     Action: We want to have this issue moved to the America IT support team
ASAP.  Our support contact indicated it was in the Australian IT support team. 
His email follows: 

 

 

I dropped your issue to my Sales Engineer and he informed me that the issue is
currently located in our Australian IT location.  I would strongly recommend to
go into the existing ticket and request that it be transferred back to America.
 I have contacted the manager of the Americas IT location stating that you would
be doing this.  Please keep in mind that I am just a Sales Rep, but I will do
everything I can to get your issues resolved.  

 

Ben Freeman -  EDU West 

 



*** 30-NOV-2005 12:21:35  Notes:  (UMRSERVER)  

     Problem Description: sysreports added. The one from ipvs02 will likely not
be useful, as it is changed hardware (trying to diagnose ourselves), and it
hasn't been up long enough to have experienced the symptom. ipvs01 did the
crash/lockup again last night at 2am or so. 

 

 

 

 

BTW, your system notifications from this support app are totally screwed up. I
am not getting updates regularly, and have never seen an update for the most
recent change to the ticket - it often shows me the previous message/update to
the ticket, even when it is an update that I have triggered myself.

*** 30-NOV-2005 12:21:36  Notes:  (UMRSERVER)  

     Problem Description: File nneulipvs01.tar.bz2 attached

File nneulipvs02.tar.bz2 attached



*** 30-NOV-2005 13:50:12  Notes: Kloiber, Christopher K (Chris)
(CKLOIBER)  

     Problem Description: I found a small problem (probably not the root cause)
in /etc/modprobe.conf. The options line for bond0 should read:



options bond0 miimon=100 mode=1



The sysreport labeled nneulipvs02 is still running the 2.6.9-11smp kernel, and
has only one intel e100 network adapter, so it's not capable of bonding.



The sysreport neglects to capture the load balancing configurations. Can you
please attach a tar file containing the contents of the /etc/sysconfig/ha
directory (from both machines, please) Thanks.

*** 30-NOV-2005 14:12:42  Notes:  (UMRSERVER)  

     Problem Description:  

Will address other issues in a moment...  

 

mode=1 comment: see attached source from bond_main.c. bonding driver accepts
either syntax 

 

from Documentation/networking/bonding.txt: 

 

        Options with textual values will accept either the text name 

or, for backwards compatibility, the option value.  E.g., 

"mode=802.3ad" and "mode=4" set the same mode. 

 

 

 

 

/* 

 * Convert string input module parms.  Accept either the 

 * number of the mode or its string name. 

 */ 

static inline int bond_parse_parm(char *mode_arg, struct bond_parm_tbl *tbl) 

{ 

        int i; 

 

        for (i = 0; tbl[i].modename; i++) { 

                if ((isdigit(*mode_arg) && 

                     tbl[i].mode == simple_strtol(mode_arg, NULL, 0)) || 

                    (strncmp(mode_arg, tbl[i].modename, 

                             strlen(tbl[i].modename)) == 0)) { 

                        return tbl[i].mode; 

                } 

        } 

 

        return -1; 

} 



*** 30-NOV-2005 14:17:23  Notes:  (UMRSERVER)  

     Problem Description: we are no longer using bonding. This should be clear
from previous ticket updates as you wanted us to enabled netconsole/netdump, and
it's not compatible with bonding without adding another eth card, which would
have been more invasive. Turning it off had no effect, problem still occurs. 

 

ipvs02 was reinstalled on new hardware yesterday. Our base install does not put
on -22 cause it is incompatible with the vast majority of systems that we run
RHEL on. We have also not yet seen the problem with ipvs02 on the replacement
hardware, but have seen it twice since yesterday on ipvs01 on the original dell
1850. 

 

We are not using /etc/sysconfig/ha, but are using keepalived. 

 

on each machine, here is the startup stuff we run: 

 

ipvsadm --start-daemon master --mcast-interface eth0 --syncid 210 

ipvsadm --start-daemon backup --mcast-interface eth0 --syncid 211 

 

ipvsadm --set 7200 120 300 

 

/usr/sbin/keepalived \\ 

        --dump-conf \\ 

        --vrrp \\ 

        --use-file /local/server/build/keepalived.conf.`hostname --short` \\ 

        --log-console \\ 

        --log-detail \\ 

 

/usr/sbin/keepalived \\ 

        --dump-conf \\ 

        --check \\ 

        --use-file /local/server/build/keepalived.conf.`hostname --short` \\ 

        --log-console \\ 

        --log-detail \\ 

 

I will attach the two keepalived config files momentarily. 

 

 



*** 30-NOV-2005 14:20:19  Notes:  (UMRSERVER)  

     Problem Description: Attached keepalived configs and output of "ipvsadm -L -n"

*** 30-NOV-2005 14:20:20  Notes:  (UMRSERVER)  

     Problem Description: File keepalived.conf.ipvs01 attached

File keepalived.conf.ipvs02 attached

File ipvsout attached



*** 30-NOV-2005 14:21:30  Notes:  (UMRSERVER)  

     Problem Description: Note, we do intend to turn bonding back on once this
is resolved, but leaving it out of the picture for now to eliminate variables.

*** 30-NOV-2005 14:29:29  Notes: Whiter, Josef M (JWHITER)  

     Action: Dear Sir,



I am going to escalate this issue to engineering.  When one of the boxes crashes
again use the sysrq-c keys to try and get a vmcore, because that will be most
helpfull in determining the cause of this crash.  Thank you,



Josef Whiter

Red Hat


This event sent from IssueTracker by streeter
 issue 84050
              


Comment #7 From Issue Tracker (tao)  on 2005-12-05 10:31 EST       
[reply]  Private

From User-Agent: XML-RPC

Has this netdump_log file been edited? It looks different from others I've
seen before.



Why do you (or the customer) think that IPVS is implicated in the hang?


Internal Status set to 'Waiting on Support'

This event sent from IssueTracker by streeter
 issue 84050
              


Comment #8 From Issue Tracker (tao)  on 2005-12-05 10:31 EST       
[reply]  Private

From User-Agent: XML-RPC

It is hard to tell from this netdump log, but it looks like an attempt was
made to reboot the machine (sysrq-b) before trying to crash it. This may
have confused things a bit.

In attempting to get a dump from this hang, please try the crash (sysrq-c)
without trying to kill processes or reboot the machine first.




This event sent from IssueTracker by streeter
 issue 84050
              


Comment #9 From Issue Tracker (tao)  on 2005-12-05 10:32 EST       
[reply]  Private

From User-Agent: XML-RPC

Notes from teh customer



=========================================================

Well, we got another one, unfortunately, I accidentally hit sysrq-s prior
to sysrq-c, so I'm not certain this one will have anything useful. No
matter, it will do it again soon enough I'm sure.  

 

BTW, we did rule out hardware, the ipvs02 machine experienced the same
symptom, but we were not able to get a log or dump from it, cause netdump
didn't get reconfigured after reinstalling on new HW. It's back running
the old HW now. 

 

Here's the log from netconsole - did not do a dump, in fact, it is still
spinning doing traces: 

 

SysRq : Emergency Sync 

SysRq : Crashing the kernel by request 

Unable to handle kernel NULL pointer dereference at virtual address
00000000 

 printing eip: 

c020abf0 

*pde = 10013001 

Oops: 0002 [#1] 

SMP  

Modules linked in: netconsole netdump ip_vs_rr ip_vs md5 ipv6 parport_pc lp
parport autofs4 i2c_dev i2c_core sunrpc 8021q button battery ac uhci_hcd
ehci_hcd hw_random e1000 floppy sg dm_snapshot dm_zero dm_mirror ext3 jbd
dm_mod megaraid_mbox megaraid_mm sd_mod scsi_mod 

CPU:    1 

EIP:    0060:[<c020abf0>]    Not tainted VLI 

EFLAGS: 00010246   (2.6.9-22.ELsmp)  

EIP is at sysrq_handle_crash+0x0/0x8 

eax: 00000063   ebx: c033a294   ecx: 00000000   edx: d078edf4 

esi: 00000063   edi: 00000000   ebp: d078edf4   esp: c03e8f74 

ds: 007b   es: 007b   ss: 0068 

Process sh (pid: 27841, threadinfo=c03e8000 task=db0927b0) 

Stack: c020ad7e c02f1dbc c02f31cd 00000006 c0458c20 dde21000 00000063
c03e8fc8  

       c021aa24 00000100 63000010 d078edf4 d078edf4 c0458c20 c0458cc8
c0458110  

       00000000 c021acbe 00000000 d078edf4 00000004 00000061 df98b840
00000001  

Call Trace: 

 [<c020ad7e>] __handle_sysrq+0x58/0xc6 

 [<c021aa24>] receive_chars+0x140/0x1f6 

 [<c021acbe>] serial8250_interrupt+0x64/0xcb 

 [<c010745e>] handle_IRQ_event+0x25/0x4f 

 [<c01079be>] do_IRQ+0x11c/0x1ae 

 ======================= 

 [<c02d1a8c>] common_interrupt+0x18/0x20 

 [<c02cfc73>] _spin_lock+0x2e/0x34 

 [<c0160493>] nr_blockdev_pages+0xd/0x3f 

 [<c01436a8>] si_meminfo+0x1f/0x3b 

 [<c0188cf4>] meminfo_read_proc+0x41/0x191 

 [<c0143050>] buffered_rmqueue+0x17d/0x1a5 

 [<c014314d>] __alloc_pages+0xd5/0x2f7 

 [<c0187143>] proc_file_read+0xd1/0x225 

 [<c0159c61>] vfs_read+0xb6/0xe2 

 [<c0159e74>] sys_read+0x3c/0x62 

 [<c02d10cf>] syscall_call+0x7/0xb 

Code: 11 c0 c7 05 10 7c 44 c0 00 00 00 00 c7 05 38 7c 44 c0 00 00 00 00 c7
05 2c 7c 44 c0 6e ad 87 4b 89 15 28 7c 44 c0 e9 d3 5d f2 ff <c6> 05 00 00
00 00 00 c3 e9 92 04 f5 ff e9 8b 4c f5 ff 85 d2 89  

 

 

 

 

 

 

 

 

 

 

and here is one of the traces it is spewing right now: 

 

<e097a653>] netpoll_netdump+0x7f/0x478 [netdump] 

 [<c020abf0>] sysrq_handle_crash+0x0/0x8           

 [<e097a5d4>] netpoll_netdump+0x0/0x478 [netdump] 

 [<e097a5cb>] netpoll_start_netdump+0xe9/0xf2 [netdump] 

 =======================                                

 [<c013403d>] try_crashdump+0x31/0x33 

 [<c010601a>] die+0xe2/0x16b          

 [<c0122459>] vprintk+0x136/0x14a 

 [<c011ac2d>] do_page_fault+0x0/0x5c6 

 [<c011b01d>] do_page_fault+0x3f0/0x5c6 

 [<c020abf0>] sysrq_handle_crash+0x0/0x8 

 [<e08ede1a>] e1000_xmit_frame+0x947/0x951 [e1000] 

 [<c010b377>] timer_interrupt+0xd6/0xde            

 [<c0278b71>] alloc_skb+0x33/0xc5       

 [<c0129715>] __mod_timer+0x101/0x10b 

 [<c020a2e5>] poke_blanked_console+0x8f/0x9a 

 [<c02096a5>] vt_console_print+0x294/0x2a5   

 [<c0209411>] vt_console_print+0x0/0x2a5   

 [<c0122103>] __call_console_drivers+0x36/0x40 

 [<c011ac2d>] do_page_fault+0x0/0x5c6          

 [<c02d1bab>] error_code+0x2f/0x38    

 [<c020abf0>] sysrq_handle_crash+0x0/0x8 

 [<c020ad7e>] __handle_sysrq+0x58/0xc6   

 [<c021aa24>] receive_chars+0x140/0x1f6 

 [<c021acbe>] serial8250_interrupt+0x64/0xcb 

 [<c010745e>] handle_IRQ_event+0x25/0x4f     

 [<c01079be>] do_IRQ+0x11c/0x1ae         

 =======================         

 [<c02d1a8c>] common_interrupt+0x18/0x20 

 [<c02cfc73>] _spin_lock+0x2e/0x34       

 [<c0160493>] nr_blockdev_pages+0xd/0x3f 

 [<c01436a8>] si_meminfo+0x1f/0x3b       

 [<c0188cf4>] meminfo_read_proc+0x41/0x191 

 [<c0143050>] buffered_rmqueue+0x17d/0x1a5 

 [<c014314d>] __alloc_pages+0xd5/0x2f7     

 [<c0187143>] proc_file_read+0xd1/0x225 

 [<c0159c61>] vfs_read+0xb6/0xe2        

 [<c0159e74>] sys_read+0x3c/0x62 

 [<c02d10cf>] syscall_call+0x7/0xb 

Badness in local_bh_enable at kernel/softirq.c:141 

 [<c01264bd>] local_bh_enable+0x34/0x57            

 [<c0290995>] rt_garbage_collect+0x196/0x276 

 [<c028189d>] dst_alloc+0x18/0x85            

 [<c0292493>] ip_route_input_slow+0x56f/0x849 

 [<c02886bb>] checksum_udp+0x6f/0x88          

 [<c0294613>] ip_rcv+0x1c9/0x3ff     

 [<c027e1c1>] netif_receive_skb+0x1f1/0x21f 

 [<e08ef029>] e1000_clean_rx_irq+0x388/0x3fa [e1000] 

 [<e08ee723>] e1000_clean+0x3a/0xcd [e1000]          

 [<c0288738>] poll_napi+0x64/0x84           

 [<c0288788>] netpoll_poll+0x30/0x35 

 [<e097a440>] netdump_startup_handshake+0x7f/0x10d [netdump] 

 [<c0205c5d>] scrup+0x63/0xce                                

 [<c0206247>] complement_pos+0x12/0x132 

 [<c02066dc>] set_cursor+0x62/0x6e      

 [<c0209697>] vt_console_print+0x286/0x2a5 

 [<c0209411>] vt_console_print+0x0/0x2a5   

 [<c01220c3>] crashdump_call_console_drivers+0x27/0x31 

 [<c0122467>] vprintk+0x144/0x14a                      

 [<e097a653>] netpoll_netdump+0x7f/0x478 [netdump] 

 [<c020abf0>] sysrq_handle_crash+0x0/0x8           

 [<e097a5d4>] netpoll_netdump+0x0/0x478 [netdump] 

 [<e097a5cb>] netpoll_start_netdump+0xe9/0xf2 [netdump] 

 =======================                                

 [<c013403d>] try_crashdump+0x31/0x33 

 [<c010601a>] die+0xe2/0x16b          

 [<c0122459>] vprintk+0x136/0x14a 

 [<c011ac2d>] do_page_fault+0x0/0x5c6 

 [<c011b01d>] do_page_fault+0x3f0/0x5c6 

 [<c020abf0>] sysrq_handle_crash+0x0/0x8 

 [<e08ede1a>] e1000_xmit_frame+0x947/0x951 [e1000] 

 [<c010b377>] timer_interrupt+0xd6/0xde            

 [<c0278b71>] alloc_skb+0x33/0xc5       



To answer your question about it being ip_vs (not ipvsadm, that's just
config tool) - we have 60+ RHEL ES4 boxes in service. Only two of them are
regularly and consistently seeing hard lockups. They are our two load
balancer boxes. 

 

We had two boxes running a very similar setup, but with RH9 previously.
They were rock solid. Unfortunately,lots of variables changed when we
upgraded - new OS, new kernel version, new hardware, gigE with vlan trunks
instead of e100 w/ dedicated ports, channel bonding for eth failover.



=============================================================================




Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by streeter
 issue 84050
              


Comment #10 From Issue Tracker (tao)         on 2005-12-05 10:32 EST
[reply]         Private

From User-Agent: XML-RPC

A vmcore would be very helpful.

I'm sending this up to the developers, because I can't make any sense of
it.


This event sent from IssueTracker by streeter
 issue 84050
              


Comment #11 From Guy Streeter (streeter)     on 2005-12-05 10:41 EST
[reply]         Private

Sorry about including so much in the bug report, but the original customer issue
has it all in the issue description and our escalation tool doesn't let us
submit partial events into bugzilla.

The short version is: The customer sees random hard lockups (not crashes, in
spite of the summary line) on systems running an ipvs load balancer. Many other
"identical" systems not running ipvs don't hang.

The customer has forced a crash, and the netconsole shows an non-stop (looping?)
backtrace. Customer is attempting to get a net dump.


Comment #12 From Issue Tracker (tao)         on 2005-12-06 14:49 EST
[reply]         Private

From User-Agent: XML-RPC

Did it again. This time, it looked like it was going to give me a dump from
the console, but I got neither a dump nor a log. Here's stuff from the
serial console: 

 

 

^M[BREAK]SysRq : Crashing the kernel by request 

^MUnable to handle kernel NULL pointer dereference at virtual address
00000000 

^M printing eip: 

^Mc020abf0 

^M*pde = 10013001 

^MOops: 0002 [#1] 

^MSMP  

^MModules linked in: netconsole netdump ip_vs_rr ip_vs md5 ipv6 parport_pc
lp parport aut 

ofs4 i2c_dev i2c_core sunrpc 8021q button battery ac uhci_hcd ehci_hcd
hw_random e1000 fl 

oppy sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod megaraid_mbox
megaraid_mm sd_mod sc 

si_mod 

^MCPU:    1 

^MEIP:    0060:[<c020abf0>]    Not tainted VLI 

^MEFLAGS: 00010246   (2.6.9-22.ELsmp)  

^MEIP is at sysrq_handle_crash+0x0/0x8 

^Meax: 00000063   ebx: c033a294   ecx: 00000000   edx: d078edf4 

^Mesi: 00000063   edi: 00000000   ebp: d078edf4   esp: c03e8f74 

^Mds: 007b   es: 007b   ss: 0068 

^MProcess sh (pid: 27841, threadinfo=c03e8000 task=db0927b0) 

^MStack: c020ad7e c02f1dbc c02f31cd 00000006 c0458c20 dde21000 00000063
c03e8fc8  

^M       c021aa24 00000100 63000010 d078edf4 d078edf4 c0458c20 c0458cc8
c0458110  

^M       00000000 c021acbe 00000000 d078edf4 00000004 00000061 df98b840
00000001  

^MCall Trace: 

^M [<c020ad7e>] __handle_sysrq+0x58/0xc6 

^M [<c021aa24>] receive_chars+0x140/0x1f6 

^M [<c021acbe>] serial8250_interrupt+0x64/0xcb 

^M [<c010745e>] handle_IRQ_event+0x25/0x4f 

^M [<c01079be>] do_IRQ+0x11c/0x1ae 

^M ======================= 

^M [<c02d1a8c>] common_interrupt+0x18/0x20 

^M [<c02cfc73>] _spin_lock+0x2e/0x34 

^M [<c0160493>] nr_blockdev_pages+0xd/0x3f 

^M [<c01436a8>] si_meminfo+0x1f/0x3b 

^M [<c0188cf4>] meminfo_read_proc+0x41/0x191 

^M [<c0143050>] buffered_rmqueue+0x17d/0x1a5 

^M [<c014314d>] __alloc_pages+0xd5/0x2f7 

^M [<c0187143>] proc_file_read+0xd1/0x225 

^M [<c0159c61>] vfs_read+0xb6/0xe2 

^M [<c0159e74>] sys_read+0x3c/0x62 

^M [<c02d10cf>] syscall_call+0x7/0xb 

^MCode: 11 c0 c7 05 10 7c 44 c0 00 00 00 00 c7 05 38 7c 44 c0 00 00 00 00
c7 05 2c 7c 44  

c0 6e ad 87 4b 89 15 28 7c 44 c0 e9 d3 5d f2 ff <c6> 05 00 00 00 00 00 c3
e9 92 04 f5 ff  

e9 8b 4c f5 ff 85 d2 89  

^MCPU#0 is frozen. 

^MCPU#1 is executing netdump. 

^M< netdump activated - performing handshake with the server. > 

^MBadness in local_bh_enable at kernel/softirq.c:141 

^M [<c01264bd>] local_bh_enable+0x34/0x57 

^M [<c0283175>] neigh_connected_output+0x6d/0xad 

^M [<c0296e9e>] ip_finish_output2+0x12e/0x16d 

^M [<e0a2c69e>] ip_vs_post_routing+0x14/0x1c [ip_vs] 

^M [<c028675b>] nf_iterate+0x40/0x81 

^M [<c0296d70>] ip_finish_output2+0x0/0x16d 

^M [<c0286a59>] nf_hook_slow+0x47/0xb4 

^M [<c0296d70>] ip_finish_output2+0x0/0x16d 

^M [<c0296d67>] ip_finish_output+0x1a5/0x1ae 

^M [<c0296d70>] ip_finish_output2+0x0/0x16d 

^M [<e0a30863>] ip_vs_dr_xmit+0x2d0/0x34a [ip_vs] 

^M [<e0a2b0ff>] ip_vs_conn_in_get+0x87/0x150 [ip_vs] 

^M [<e0a334ee>] tcp_state_transition+0x130/0x13d [ip_vs] 

^M [<e0a32e63>] tcp_conn_in_get+0x83/0x8b [ip_vs] 

^M [<e0a30593>] ip_vs_dr_xmit+0x0/0x34a [ip_vs] 

^M [<e0a2d155>] ip_vs_in+0x1a1/0x1f3 [ip_vs] 

^M [<c028675b>] nf_iterate+0x40/0x81 

^M [<c02942c2>] ip_local_deliver_finish+0x0/0x188 

^M [<c0286a59>] nf_hook_slow+0x47/0xb4 

^M [<c02942c2>] ip_local_deliver_finish+0x0/0x188 

^M [<c02942bb>] ip_local_deliver+0x1d9/0x1e0 

^M [<c02942c2>] ip_local_deliver_finish+0x0/0x188 

^M [<c02947a8>] ip_rcv+0x35e/0x3ff 

^M [<c027e1c1>] netif_receive_skb+0x1f1/0x21f 

^M [<e08ef029>] e1000_clean_rx_irq+0x388/0x3fa [e1000] 

^M [<e08ee723>] e1000_clean+0x3a/0xcd [e1000] 

^M [<c0288738>] poll_napi+0x64/0x84 

^M [<c0288788>] netpoll_poll+0x30/0x35 

^M [<e097a440>] netdump_startup_handshake+0x7f/0x10d [netdump] 

^M [<c0205c5d>] scrup+0x63/0xce 

^M [<c0206247>] complement_pos+0x12/0x132 

^M [<c02066dc>] set_cursor+0x62/0x6e 

^M [<c0209697>] vt_console_print+0x286/0x2a5 

^M [<c0209411>] vt_console_print+0x0/0x2a5 

^M [<c01220c3>] crashdump_call_console_drivers+0x27/0x31 

^M [<c0122467>] vprintk+0x144/0x14a 

^M [<e097a653>] netpoll_netdump+0x7f/0x478 [netdump] 

^M [<c020abf0>] sysrq_handle_crash+0x0/0x8 

^M [<e097a5d4>] netpoll_netdump+0x0/0x478 [netdump] 

^M [<e097a5cb>] netpoll_start_netdump+0xe9/0xf2 [netdump] 

^M ======================= 

^M [<c013403d>] try_crashdump+0x31/0x33 

^M [<c010601a>] die+0xe2/0x16b 

^M [<c0122459>] vprintk+0x136/0x14a 

^M [<c011ac2d>] do_page_fault+0x0/0x5c6 

^M [<c011b01d>] do_page_fault+0x3f0/0x5c6 

^M [<c020abf0>] sysrq_handle_crash+0x0/0x8 

^M [<e08ede1a>] e1000_xmit_frame+0x947/0x951 [e1000] 

^M [<c010b377>] timer_interrupt+0xd6/0xde 

^M [<c0278b71>] alloc_skb+0x33/0xc5 

^M [<c0129715>] __mod_timer+0x101/0x10b 

^M [<c020a2e5>] poke_blanked_console+0x8f/0x9a 

^M [<c02096a5>] vt_console_print+0x294/0x2a5 

^M [<c0209411>] vt_console_print+0x0/0x2a5 

^M [<c0122103>] __call_console_drivers+0x36/0x40 

^M [<c011ac2d>] do_page_fault+0x0/0x5c6 

^M [<c02d1bab>] error_code+0x2f/0x38 

^M [<c020abf0>] sysrq_handle_crash+0x0/0x8 

^M [<c020ad7e>] __handle_sysrq+0x58/0xc6 

^M [<c021aa24>] receive_chars+0x140/0x1f6 

^M [<c021acbe>] serial8250_interrupt+0x64/0xcb 

^M [<c010745e>] handle_IRQ_event+0x25/0x4f 

^M [<c01079be>] do_IRQ+0x11c/0x1ae 

^M ======================= 

^M [<c02d1a8c>] common_interrupt+0x18/0x20 

^M [<c02cfc73>] _spin_lock+0x2e/0x34 

^M [<c0160493>] nr_blockdev_pages+0xd/0x3f 

^M [<c01436a8>] si_meminfo+0x1f/0x3b 

^M [<c0188cf4>] meminfo_read_proc+0x41/0x191 

^M [<c0143050>] buffered_rmqueue+0x17d/0x1a5 

^M [<c014314d>] __alloc_pages+0xd5/0x2f7 

^M [<c0187143>] proc_file_read+0xd1/0x225 

^M [<c0159c61>] vfs_read+0xb6/0xe2 

^M [<c0159e74>] sys_read+0x3c/0x62 

^M [<c02d10cf>] syscall_call+0x7/0xb 

^M<4>Warning: kfree_skb on hard IRQ c0278dc3 

^M<4>Warning: kfree_skb on hard IRQ c0278dc3 

^M<4>Warning: kfree_skb on hard IRQ c0278dc3 

^MBadness in local_bh_enable at kernel/softirq.c:141 

^M [<c01264bd>] local_bh_enable+0x34/0x57 

^M [<c027dccb>] dev_queue_xmit+0x1ff/0x207 

^M [<e0919dea>] vlan_dev_hwaccel_hard_start_xmit+0x5b/0x62 [8021q] 

^M [<c027dc33>] dev_queue_xmit+0x167/0x207 

^M [<c0283187>] neigh_connected_output+0x7f/0xad 

^M [<c0296e9e>] ip_finish_output2+0x12e/0x16d 

^M [<e0a2c69e>] ip_vs_post_routing+0x14/0x1c [ip_vs] 

^M [<c028675b>] nf_iterate+0x40/0x81 

^M [<c0296d70>] ip_finish_output2+0x0/0x16d 

^M [<c0286a59>] nf_hook_slow+0x47/0xb4 

^M [<c0296d70>] ip_finish_output2+0x0/0x16d 

^M [<c0296d67>] ip_finish_output+0x1a5/0x1ae 

^M [<c0296d70>] ip_finish_output2+0x0/0x16d 

^M [<e0a30863>] ip_vs_dr_xmit+0x2d0/0x34a [ip_vs] 

^M [<e0a2b0ff>] ip_vs_conn_in_get+0x87/0x150 [ip_vs] 

^M [<e0a334ee>] tcp_state_transition+0x130/0x13d [ip_vs] 

^M [<e0a32e63>] tcp_conn_in_get+0x83/0x8b [ip_vs] 

^M [<e0a30593>] ip_vs_dr_xmit+0x0/0x34a [ip_vs] 

^M [<e0a2d155>] ip_vs_in+0x1a1/0x1f3 [ip_vs] 

^M [<c028675b>] nf_iterate+0x40/0x81 

^M [<c02942c2>] ip_local_deliver_finish+0x0/0x188 

^M [<c0286a59>] nf_hook_slow+0x47/0xb4 

^M [<c02942c2>] ip_local_deliver_finish+0x0/0x188 

^M [<c02942bb>] ip_local_deliver+0x1d9/0x1e0 

^M [<c02942c2>] ip_local_deliver_finish+0x0/0x188 

^M [<c02947a8>] ip_rcv+0x35e/0x3ff 

^M [<c027e1c1>] netif_receive_skb+0x1f1/0x21f 

^M [<e08ef029>] e1000_clean_rx_irq+0x388/0x3fa [e1000] 

^M [<e08ee723>] e1000_clean+0x3a/0xcd [e1000] 

^M [<c0288738>] poll_napi+0x64/0x84 

^M [<c0288788>] netpoll_poll+0x30/0x35 

^M [<e097a440>] netdump_startup_handshake+0x7f/0x10d [netdump] 

^M [<c0205c5d>] scrup+0x63/0xce 

^M [<c0206247>] complement_pos+0x12/0x132 

^M [<c02066dc>] set_cursor+0x62/0x6e 

^M [<c0209697>] vt_console_print+0x286/0x2a5 

^M [<c0209411>] vt_console_print+0x0/0x2a5 

^M [<c01220c3>] crashdump_call_console_drivers+0x27/0x31 

^M [<c0122467>] vprintk+0x144/0x14a 

^M [<e097a653>] netpoll_netdump+0x7f/0x478 [netdump] 

^M [<c020abf0>] sysrq_handle_crash+0x0/0x8 

^M [<e097a5d4>] netpoll_netdump+0x0/0x478 [netdump] 

^M [<e097a5cb>] netpoll_start_netdump+0xe9/0xf2 [netdump] 

 

 

and it goes on and on...


This event sent from IssueTracker by jwhiter
 issue 84050
              


Comment #18 From Issue Tracker (tao)         on 2006-01-17 15:06 EST
[reply]         Private

From User-Agent: XML-RPC

the boxes have been up for 20 days without issue.  I'm attatching the
patch i used.  lowering severity.

Severity set to: High

This event sent from IssueTracker by jwhiter
 issue 84050
              
it_file 53450
              


Comment #20 From Issue Tracker (tao)         on 2006-01-20 16:52 EST
[reply]         Private

From User-Agent: XML-RPC

Ingo had a different suggestion in rhkernel-list

http://post-office.corp.redhat.com/archives/rhkernel-list/2006-January/msg00001.html


Internal Status set to 'Waiting on Engineering'

This event sent from IssueTracker by streeter
 issue 84050
              


Comment #23 From Jason Baron (jbaron)        on 2006-03-06 11:37 EST
[reply]         Private

This looks like the right patch, also see:
http://bugs.centos.org/view.php?id=1201, where a similar patch apparently solved
the customer issue. thanks.


Comment #24 From Thomas Graf (tgraf)         on 2006-03-06 12:15 EST
[reply]         Private

Yes, although they lack a flush_scheduled_work() as proposed by Ronald Dreier
and a follow-up patch by Jualian to correctly reorder the locking, the two
referred patches alone are incomplete. The patch attached to this BZ includes
all these patches and is functionaly equivalent to upstream.


Comment #25 From Jason Baron (jbaron)        on 2006-03-19 13:47 EST
[reply]         Private

committed in stream u4 build 34.5. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/


Comment #26 From Linda Wang (lwang)  on 2006-03-28 10:26 EST       
[reply]  Private

Move to U4 CANFIX list.


Comment #27 From Jason Baron (jbaron)        on 2006-03-29 13:32 EST
[reply]         Private

*** Bug 169600 has been marked as a duplicate of this bug. ***


Comment #28 From Bob Johnson (bjohnson)      on 2006-04-11 12:06 EST
[reply]         Private

This issue is on Red Hat Engineering's list of planned work items 
for the upcoming Red Hat Enterprise Linux 4.4 release.  Engineering 
resources have been assigned and barring unforeseen circumstances, Red 
Hat intends to include this item in the 4.4 release.


Comment #29 From HOTFIX Tracker (Red Hat Internal App)
(hotfix-tracker)      on 2006-04-26 10:35 EST         [reply]       
 Private

HOTFIX Request has been released
http://seg.rdu.redhat.com/scripts/hotfix/edit.pl?id=982


Comment #30 From Jason Baron (jbaron)        on 2006-05-05 06:50 EST
[reply]         Private

*** Bug 172696 has been marked as a duplicate of this bug. ***


Comment #31 From Issue Tracker (tao)         on 2006-05-12 14:37 EST
[reply]         Private

Internal Status set to 'Resolved'
Status set to: Closed by Client
Resolution set to: 'RHEL 4 U4'

This event sent from IssueTracker by jwhiter 
 issue 84050
              


Comment #32 From Issue Tracker (tao)         on 2006-05-12 14:48 EST
[reply]         Private

Internal Status set to 'Resolved'
Status set to: Closed by Client
Resolution set to: 'RHEL 4 U4'

This event sent from IssueTracker by jwhiter 
 issue 86360
              


Comment #33 From HOTFIX Tracker (Red Hat Internal App)
(hotfix-tracker)      on 2006-05-22 13:21 EST         [reply]       
 Private

HOTFIX Release has been rescinded
http://seg.rdu.redhat.com/scripts/hotfix/edit.pl?id=982


Comment #34 From Red Hat Bugzilla (bugzilla)         on 2006-05-24
00:42 EST         [reply]         Private

Bug report changed to ON_QA status by Errata System.
A QE request has been submitted for advisory RHSA-2006:0497-10.
http://errata.devel.redhat.com/errata/showrequest.cgi?advisory=4020


Comment #35 From HOTFIX Tracker (Red Hat Internal App)
(hotfix-tracker)      on 2006-06-14 16:59 EST         [reply]       
 Private

HOTFIX Requested http://seg.rdu.redhat.com/scripts/hotfix/edit.pl?id=1107


Comment #36 From HOTFIX Tracker (Red Hat Internal App)
(hotfix-tracker)      on 2006-06-15 10:47 EST         [reply]       
 Private

HOTFIX Request has been released
http://seg.rdu.redhat.com/scripts/hotfix/edit.pl?id=1107


Comment #37 From Mike Gahagan (mgahagan)     on 2006-06-16 11:51 EST
[reply]         Private

Looks like all the affected customers are happy with the hotfix, setting
customer_verified.


Comment #38 From Issue Tracker (tao)         on 2006-06-19 15:29 EST
[reply]         Private

The .src.rpm file for the -37.EL kernel is too big to store in Issue
Tracker, so I've made it available on my People page. You can download the
file from here:

http://people.redhat.com/gcase/rhel4/

-Gary

Internal Status set to 'Waiting on Customer'
Status set to: Waiting on Client

This event sent from IssueTracker by gcase 
 issue 91336
              


Comment #39 From Jason Baron (jbaron)        on 2006-07-13 15:48 EST
[reply]         Private

*** Bug 198321 has been marked as a duplicate of this bug. ***


Comment #40 From Jason Baron (jbaron)        on 2006-07-14 14:28 EST
[reply]         Private

*** Bug 198892 has been marked as a duplicate of this bug. ***


Comment #41 From Red Hat Bugzilla (bugzilla)         on 2006-07-20
14:37 EST         [reply]         Private

Bug report changed to RELEASE_PENDING status by Errata System.
Advisory RHSA-2006:0575-14 has been changed to HOLD status.
http://errata.devel.redhat.com/errata/showrequest.cgi?advisory=4178

Comment 2 Leonard Ye 2006-08-02 08:13:18 UTC

Thanks for the detailed information about bug 174990. The description is 
different from what is being report in bug 198321, which is marked as 
duplicated bug of 174990.

Bug 174990 occurs when ipvs load balancers are installed.  But such module is 
not installed in bug 198321. Furthermore, the kernel messages are different 
too.  Lastly, it does not occur very often on our hardware and we have not find 
out a way to reproduce it yet.

Comment 4 Phil Knirsch 2007-05-16 09:55:29 UTC

Has your problem been fixed with the latest update kernels for RHEL 4.5?

Thanks,

Read ya, Phil

Comment 5 Zdenek Prikryl 2008-07-15 08:01:11 UTC

Hello,
I'm closing this bug because the reporter doesn't reply for a very long time.
So, it seems that the problem has been fixed.