Bug 199900
Summary: | 174990 | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Ricky <rickychan> |
Component: | acpid | Assignee: | Zdenek Prikryl <zprikryl> |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.0 | CC: | leonardye |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2008-07-15 08:01:11 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Ricky
2006-07-24 07:32:27 UTC
Here the relevant correspondance in bug 174990. The bug itself is limited to internal view only, thats why you couldn't see it. Please let me know if this bug described there actually matches what you were seeing so we can either close this bug really as a duplicate or process it properly otherwise. Read ya, Phil Opened by Issue Tracker (tao) on 2005-12-05 10:30 EST [reply] Private Escalated to Bugzilla from IssueTracker Comment #1 From Issue Tracker (tao) on 2005-12-05 10:31 EST [reply] Private From User-Agent: XML-RPC *** 02-NOV-2005 17:35:45 Notes: (UMRSERVER) Problem Description: We are seeing random hard lockups on our two IPVS load balancers. Once machine is at current 2.6.9-22, the other is still at 2.6.9-11. Previous ticket was closed, we thought it had something to do with the dst cache overflow error, but have not seen that message recently. the -11 machine has been very stable, which is why it hadn't been upgraded, but it has locked up twice in past two days. No indications of any problem on the box. See previous ticket for the support tarball that was generated on one of the machines. Setup is two onboard GigE interfaces on a Dell 1850, each configured in a bonding driver set up for failover only, each configured as a VLAN trunk line, with three vlans in use - native, and two additional tagged vlans. IPVS configuration is being managed with keepalived, and is set up with the two boxes using VRRP and ipvs_syncd for dynamic balancer failover. The lockup is to the point where alt-sysrq S/U/B does not ever reboot the machine, but alt-sysrq C to trigger a kernel crash does. There are no symptoms on the serial console, or in syslogs. I have been asked to get more diagnostics and support from RedHat on this issue before we go and replace the kernel with a kernel.org build to see if that has any impact on the behavior. *** 03-NOV-2005 05:29:44 Notes: Wulf, Joshua Jae (JWULF) Problem Description: In order to get sufficient information for a diagnosis we'll need to get some kind of crash dump. Can we try setting up Netdump: http://kbase.redhat.com/faq/FAQ_43_2198.shtm http://kbase.redhat.com/faq/FAQ_43_2467.shtm http://kbase.redhat.com/faq/FAQ_80_3721.shtm If there have been no software updates on the machine, and it suddenly starts locking up, it's usually due to one of a limited number of things: audit subsystem shutting the system down when it goes over 80% disk used, or a hardware error. A hardware error on two separate machines simultaneously is less probable. Trying a kernel.org build may help us to eliminate some variables. regards, Joshua J Wulf *** 03-NOV-2005 08:56:08 Notes: (UMRSERVER) Problem Description: Will try this... crash.c test doesn't work right - doesn't build a valid module... but that's a separate issue. Testing with sysrq-c worked fine though. Will enable this functionality on the ipvs boxes. (I tested it on a different non-production box.) Here's a problem - these two machines use the bonding driver for the eth interfaces, and netdump/netconsole will not load because bonding driver doesn't support polling according to syslog message. How am I supposed to address this? *** 03-NOV-2005 22:53:36 Notes: Wulf, Joshua Jae (JWULF) Problem Description: We'd need to unbond the interfaces (which changes the system configuration and probably disrupts your production environment) or install another adapter for Netdump. regards, Joshua *** 17-NOV-2005 23:50:09 Notes: (UMRSERVER) Problem Description: We will be unbonding the interfaces this weekend on these two boxes. One is at -22, the other is still at -11. The one at -11 seems to get the dst cache overflow error. We haven't see that on the -22 kernel. *** 18-NOV-2005 18:42:49 Notes: Wulf, Joshua Jae (JWULF) Problem Description: OK, we'll wait for the results. regards, Joshua *** 28-NOV-2005 08:54:06 Notes: (UMRSERVER) Problem Description: Finally managed to get updated bios, new kernel, and net devices reconfigured. Machine hard locked again. Never triggered a panic - like usual. When I forcibly issued a alt-sysrq-c to crash it, here's the log that was generated. Additionally, on the console, but not in the netconsole log, I saw this right after it has the trace of the call to __call_console_drivers. <6>NETDEV WATCHDOG: eth0: transmit timed out Any suggestions on what to do next? At this point, the machines are running with a single ethernet interface, both at -22, both with current dell bios, both with netdump+netconsole enabled. *** 28-NOV-2005 08:54:15 Notes: (UMRSERVER) Problem Description: File log attached *** 30-NOV-2005 09:12:51 Notes: (UMRSERVER) Problem Description: Additional info - with netconsole/netdump installed on these boxes, I am no longer able to remotely reboot them after a panic (even with A-S-C). They either lock up completely, or in todays case, get stuck in some sort of loop doing back traces. *** 30-NOV-2005 11:57:36 Notes: Kloiber, Christopher K (Chris) (CKLOIBER) Problem Description: Please provide a sysreport from each machine, thanks. *** 30-NOV-2005 11:59:16 Notes: (UMRSERVER) Action: We want to have this issue moved to the America IT support team ASAP. Our support contact indicated it was in the Australian IT support team. His email follows: I dropped your issue to my Sales Engineer and he informed me that the issue is currently located in our Australian IT location. I would strongly recommend to go into the existing ticket and request that it be transferred back to America. I have contacted the manager of the Americas IT location stating that you would be doing this. Please keep in mind that I am just a Sales Rep, but I will do everything I can to get your issues resolved. Ben Freeman - EDU West *** 30-NOV-2005 12:21:35 Notes: (UMRSERVER) Problem Description: sysreports added. The one from ipvs02 will likely not be useful, as it is changed hardware (trying to diagnose ourselves), and it hasn't been up long enough to have experienced the symptom. ipvs01 did the crash/lockup again last night at 2am or so. BTW, your system notifications from this support app are totally screwed up. I am not getting updates regularly, and have never seen an update for the most recent change to the ticket - it often shows me the previous message/update to the ticket, even when it is an update that I have triggered myself. *** 30-NOV-2005 12:21:36 Notes: (UMRSERVER) Problem Description: File nneulipvs01.tar.bz2 attached File nneulipvs02.tar.bz2 attached *** 30-NOV-2005 13:50:12 Notes: Kloiber, Christopher K (Chris) (CKLOIBER) Problem Description: I found a small problem (probably not the root cause) in /etc/modprobe.conf. The options line for bond0 should read: options bond0 miimon=100 mode=1 The sysreport labeled nneulipvs02 is still running the 2.6.9-11smp kernel, and has only one intel e100 network adapter, so it's not capable of bonding. The sysreport neglects to capture the load balancing configurations. Can you please attach a tar file containing the contents of the /etc/sysconfig/ha directory (from both machines, please) Thanks. *** 30-NOV-2005 14:12:42 Notes: (UMRSERVER) Problem Description: Will address other issues in a moment... mode=1 comment: see attached source from bond_main.c. bonding driver accepts either syntax from Documentation/networking/bonding.txt: Options with textual values will accept either the text name or, for backwards compatibility, the option value. E.g., "mode=802.3ad" and "mode=4" set the same mode. /* * Convert string input module parms. Accept either the * number of the mode or its string name. */ static inline int bond_parse_parm(char *mode_arg, struct bond_parm_tbl *tbl) { int i; for (i = 0; tbl[i].modename; i++) { if ((isdigit(*mode_arg) && tbl[i].mode == simple_strtol(mode_arg, NULL, 0)) || (strncmp(mode_arg, tbl[i].modename, strlen(tbl[i].modename)) == 0)) { return tbl[i].mode; } } return -1; } *** 30-NOV-2005 14:17:23 Notes: (UMRSERVER) Problem Description: we are no longer using bonding. This should be clear from previous ticket updates as you wanted us to enabled netconsole/netdump, and it's not compatible with bonding without adding another eth card, which would have been more invasive. Turning it off had no effect, problem still occurs. ipvs02 was reinstalled on new hardware yesterday. Our base install does not put on -22 cause it is incompatible with the vast majority of systems that we run RHEL on. We have also not yet seen the problem with ipvs02 on the replacement hardware, but have seen it twice since yesterday on ipvs01 on the original dell 1850. We are not using /etc/sysconfig/ha, but are using keepalived. on each machine, here is the startup stuff we run: ipvsadm --start-daemon master --mcast-interface eth0 --syncid 210 ipvsadm --start-daemon backup --mcast-interface eth0 --syncid 211 ipvsadm --set 7200 120 300 /usr/sbin/keepalived \\ --dump-conf \\ --vrrp \\ --use-file /local/server/build/keepalived.conf.`hostname --short` \\ --log-console \\ --log-detail \\ /usr/sbin/keepalived \\ --dump-conf \\ --check \\ --use-file /local/server/build/keepalived.conf.`hostname --short` \\ --log-console \\ --log-detail \\ I will attach the two keepalived config files momentarily. *** 30-NOV-2005 14:20:19 Notes: (UMRSERVER) Problem Description: Attached keepalived configs and output of "ipvsadm -L -n" *** 30-NOV-2005 14:20:20 Notes: (UMRSERVER) Problem Description: File keepalived.conf.ipvs01 attached File keepalived.conf.ipvs02 attached File ipvsout attached *** 30-NOV-2005 14:21:30 Notes: (UMRSERVER) Problem Description: Note, we do intend to turn bonding back on once this is resolved, but leaving it out of the picture for now to eliminate variables. *** 30-NOV-2005 14:29:29 Notes: Whiter, Josef M (JWHITER) Action: Dear Sir, I am going to escalate this issue to engineering. When one of the boxes crashes again use the sysrq-c keys to try and get a vmcore, because that will be most helpfull in determining the cause of this crash. Thank you, Josef Whiter Red Hat This event sent from IssueTracker by streeter issue 84050 Comment #7 From Issue Tracker (tao) on 2005-12-05 10:31 EST [reply] Private From User-Agent: XML-RPC Has this netdump_log file been edited? It looks different from others I've seen before. Why do you (or the customer) think that IPVS is implicated in the hang? Internal Status set to 'Waiting on Support' This event sent from IssueTracker by streeter issue 84050 Comment #8 From Issue Tracker (tao) on 2005-12-05 10:31 EST [reply] Private From User-Agent: XML-RPC It is hard to tell from this netdump log, but it looks like an attempt was made to reboot the machine (sysrq-b) before trying to crash it. This may have confused things a bit. In attempting to get a dump from this hang, please try the crash (sysrq-c) without trying to kill processes or reboot the machine first. This event sent from IssueTracker by streeter issue 84050 Comment #9 From Issue Tracker (tao) on 2005-12-05 10:32 EST [reply] Private From User-Agent: XML-RPC Notes from teh customer ========================================================= Well, we got another one, unfortunately, I accidentally hit sysrq-s prior to sysrq-c, so I'm not certain this one will have anything useful. No matter, it will do it again soon enough I'm sure. BTW, we did rule out hardware, the ipvs02 machine experienced the same symptom, but we were not able to get a log or dump from it, cause netdump didn't get reconfigured after reinstalling on new HW. It's back running the old HW now. Here's the log from netconsole - did not do a dump, in fact, it is still spinning doing traces: SysRq : Emergency Sync SysRq : Crashing the kernel by request Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: c020abf0 *pde = 10013001 Oops: 0002 [#1] SMP Modules linked in: netconsole netdump ip_vs_rr ip_vs md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc 8021q button battery ac uhci_hcd ehci_hcd hw_random e1000 floppy sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod megaraid_mbox megaraid_mm sd_mod scsi_mod CPU: 1 EIP: 0060:[<c020abf0>] Not tainted VLI EFLAGS: 00010246 (2.6.9-22.ELsmp) EIP is at sysrq_handle_crash+0x0/0x8 eax: 00000063 ebx: c033a294 ecx: 00000000 edx: d078edf4 esi: 00000063 edi: 00000000 ebp: d078edf4 esp: c03e8f74 ds: 007b es: 007b ss: 0068 Process sh (pid: 27841, threadinfo=c03e8000 task=db0927b0) Stack: c020ad7e c02f1dbc c02f31cd 00000006 c0458c20 dde21000 00000063 c03e8fc8 c021aa24 00000100 63000010 d078edf4 d078edf4 c0458c20 c0458cc8 c0458110 00000000 c021acbe 00000000 d078edf4 00000004 00000061 df98b840 00000001 Call Trace: [<c020ad7e>] __handle_sysrq+0x58/0xc6 [<c021aa24>] receive_chars+0x140/0x1f6 [<c021acbe>] serial8250_interrupt+0x64/0xcb [<c010745e>] handle_IRQ_event+0x25/0x4f [<c01079be>] do_IRQ+0x11c/0x1ae ======================= [<c02d1a8c>] common_interrupt+0x18/0x20 [<c02cfc73>] _spin_lock+0x2e/0x34 [<c0160493>] nr_blockdev_pages+0xd/0x3f [<c01436a8>] si_meminfo+0x1f/0x3b [<c0188cf4>] meminfo_read_proc+0x41/0x191 [<c0143050>] buffered_rmqueue+0x17d/0x1a5 [<c014314d>] __alloc_pages+0xd5/0x2f7 [<c0187143>] proc_file_read+0xd1/0x225 [<c0159c61>] vfs_read+0xb6/0xe2 [<c0159e74>] sys_read+0x3c/0x62 [<c02d10cf>] syscall_call+0x7/0xb Code: 11 c0 c7 05 10 7c 44 c0 00 00 00 00 c7 05 38 7c 44 c0 00 00 00 00 c7 05 2c 7c 44 c0 6e ad 87 4b 89 15 28 7c 44 c0 e9 d3 5d f2 ff <c6> 05 00 00 00 00 00 c3 e9 92 04 f5 ff e9 8b 4c f5 ff 85 d2 89 and here is one of the traces it is spewing right now: <e097a653>] netpoll_netdump+0x7f/0x478 [netdump] [<c020abf0>] sysrq_handle_crash+0x0/0x8 [<e097a5d4>] netpoll_netdump+0x0/0x478 [netdump] [<e097a5cb>] netpoll_start_netdump+0xe9/0xf2 [netdump] ======================= [<c013403d>] try_crashdump+0x31/0x33 [<c010601a>] die+0xe2/0x16b [<c0122459>] vprintk+0x136/0x14a [<c011ac2d>] do_page_fault+0x0/0x5c6 [<c011b01d>] do_page_fault+0x3f0/0x5c6 [<c020abf0>] sysrq_handle_crash+0x0/0x8 [<e08ede1a>] e1000_xmit_frame+0x947/0x951 [e1000] [<c010b377>] timer_interrupt+0xd6/0xde [<c0278b71>] alloc_skb+0x33/0xc5 [<c0129715>] __mod_timer+0x101/0x10b [<c020a2e5>] poke_blanked_console+0x8f/0x9a [<c02096a5>] vt_console_print+0x294/0x2a5 [<c0209411>] vt_console_print+0x0/0x2a5 [<c0122103>] __call_console_drivers+0x36/0x40 [<c011ac2d>] do_page_fault+0x0/0x5c6 [<c02d1bab>] error_code+0x2f/0x38 [<c020abf0>] sysrq_handle_crash+0x0/0x8 [<c020ad7e>] __handle_sysrq+0x58/0xc6 [<c021aa24>] receive_chars+0x140/0x1f6 [<c021acbe>] serial8250_interrupt+0x64/0xcb [<c010745e>] handle_IRQ_event+0x25/0x4f [<c01079be>] do_IRQ+0x11c/0x1ae ======================= [<c02d1a8c>] common_interrupt+0x18/0x20 [<c02cfc73>] _spin_lock+0x2e/0x34 [<c0160493>] nr_blockdev_pages+0xd/0x3f [<c01436a8>] si_meminfo+0x1f/0x3b [<c0188cf4>] meminfo_read_proc+0x41/0x191 [<c0143050>] buffered_rmqueue+0x17d/0x1a5 [<c014314d>] __alloc_pages+0xd5/0x2f7 [<c0187143>] proc_file_read+0xd1/0x225 [<c0159c61>] vfs_read+0xb6/0xe2 [<c0159e74>] sys_read+0x3c/0x62 [<c02d10cf>] syscall_call+0x7/0xb Badness in local_bh_enable at kernel/softirq.c:141 [<c01264bd>] local_bh_enable+0x34/0x57 [<c0290995>] rt_garbage_collect+0x196/0x276 [<c028189d>] dst_alloc+0x18/0x85 [<c0292493>] ip_route_input_slow+0x56f/0x849 [<c02886bb>] checksum_udp+0x6f/0x88 [<c0294613>] ip_rcv+0x1c9/0x3ff [<c027e1c1>] netif_receive_skb+0x1f1/0x21f [<e08ef029>] e1000_clean_rx_irq+0x388/0x3fa [e1000] [<e08ee723>] e1000_clean+0x3a/0xcd [e1000] [<c0288738>] poll_napi+0x64/0x84 [<c0288788>] netpoll_poll+0x30/0x35 [<e097a440>] netdump_startup_handshake+0x7f/0x10d [netdump] [<c0205c5d>] scrup+0x63/0xce [<c0206247>] complement_pos+0x12/0x132 [<c02066dc>] set_cursor+0x62/0x6e [<c0209697>] vt_console_print+0x286/0x2a5 [<c0209411>] vt_console_print+0x0/0x2a5 [<c01220c3>] crashdump_call_console_drivers+0x27/0x31 [<c0122467>] vprintk+0x144/0x14a [<e097a653>] netpoll_netdump+0x7f/0x478 [netdump] [<c020abf0>] sysrq_handle_crash+0x0/0x8 [<e097a5d4>] netpoll_netdump+0x0/0x478 [netdump] [<e097a5cb>] netpoll_start_netdump+0xe9/0xf2 [netdump] ======================= [<c013403d>] try_crashdump+0x31/0x33 [<c010601a>] die+0xe2/0x16b [<c0122459>] vprintk+0x136/0x14a [<c011ac2d>] do_page_fault+0x0/0x5c6 [<c011b01d>] do_page_fault+0x3f0/0x5c6 [<c020abf0>] sysrq_handle_crash+0x0/0x8 [<e08ede1a>] e1000_xmit_frame+0x947/0x951 [e1000] [<c010b377>] timer_interrupt+0xd6/0xde [<c0278b71>] alloc_skb+0x33/0xc5 To answer your question about it being ip_vs (not ipvsadm, that's just config tool) - we have 60+ RHEL ES4 boxes in service. Only two of them are regularly and consistently seeing hard lockups. They are our two load balancer boxes. We had two boxes running a very similar setup, but with RH9 previously. They were rock solid. Unfortunately,lots of variables changed when we upgraded - new OS, new kernel version, new hardware, gigE with vlan trunks instead of e100 w/ dedicated ports, channel bonding for eth failover. ============================================================================= Internal Status set to 'Waiting on SEG' This event sent from IssueTracker by streeter issue 84050 Comment #10 From Issue Tracker (tao) on 2005-12-05 10:32 EST [reply] Private From User-Agent: XML-RPC A vmcore would be very helpful. I'm sending this up to the developers, because I can't make any sense of it. This event sent from IssueTracker by streeter issue 84050 Comment #11 From Guy Streeter (streeter) on 2005-12-05 10:41 EST [reply] Private Sorry about including so much in the bug report, but the original customer issue has it all in the issue description and our escalation tool doesn't let us submit partial events into bugzilla. The short version is: The customer sees random hard lockups (not crashes, in spite of the summary line) on systems running an ipvs load balancer. Many other "identical" systems not running ipvs don't hang. The customer has forced a crash, and the netconsole shows an non-stop (looping?) backtrace. Customer is attempting to get a net dump. Comment #12 From Issue Tracker (tao) on 2005-12-06 14:49 EST [reply] Private From User-Agent: XML-RPC Did it again. This time, it looked like it was going to give me a dump from the console, but I got neither a dump nor a log. Here's stuff from the serial console: ^M[BREAK]SysRq : Crashing the kernel by request ^MUnable to handle kernel NULL pointer dereference at virtual address 00000000 ^M printing eip: ^Mc020abf0 ^M*pde = 10013001 ^MOops: 0002 [#1] ^MSMP ^MModules linked in: netconsole netdump ip_vs_rr ip_vs md5 ipv6 parport_pc lp parport aut ofs4 i2c_dev i2c_core sunrpc 8021q button battery ac uhci_hcd ehci_hcd hw_random e1000 fl oppy sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod megaraid_mbox megaraid_mm sd_mod sc si_mod ^MCPU: 1 ^MEIP: 0060:[<c020abf0>] Not tainted VLI ^MEFLAGS: 00010246 (2.6.9-22.ELsmp) ^MEIP is at sysrq_handle_crash+0x0/0x8 ^Meax: 00000063 ebx: c033a294 ecx: 00000000 edx: d078edf4 ^Mesi: 00000063 edi: 00000000 ebp: d078edf4 esp: c03e8f74 ^Mds: 007b es: 007b ss: 0068 ^MProcess sh (pid: 27841, threadinfo=c03e8000 task=db0927b0) ^MStack: c020ad7e c02f1dbc c02f31cd 00000006 c0458c20 dde21000 00000063 c03e8fc8 ^M c021aa24 00000100 63000010 d078edf4 d078edf4 c0458c20 c0458cc8 c0458110 ^M 00000000 c021acbe 00000000 d078edf4 00000004 00000061 df98b840 00000001 ^MCall Trace: ^M [<c020ad7e>] __handle_sysrq+0x58/0xc6 ^M [<c021aa24>] receive_chars+0x140/0x1f6 ^M [<c021acbe>] serial8250_interrupt+0x64/0xcb ^M [<c010745e>] handle_IRQ_event+0x25/0x4f ^M [<c01079be>] do_IRQ+0x11c/0x1ae ^M ======================= ^M [<c02d1a8c>] common_interrupt+0x18/0x20 ^M [<c02cfc73>] _spin_lock+0x2e/0x34 ^M [<c0160493>] nr_blockdev_pages+0xd/0x3f ^M [<c01436a8>] si_meminfo+0x1f/0x3b ^M [<c0188cf4>] meminfo_read_proc+0x41/0x191 ^M [<c0143050>] buffered_rmqueue+0x17d/0x1a5 ^M [<c014314d>] __alloc_pages+0xd5/0x2f7 ^M [<c0187143>] proc_file_read+0xd1/0x225 ^M [<c0159c61>] vfs_read+0xb6/0xe2 ^M [<c0159e74>] sys_read+0x3c/0x62 ^M [<c02d10cf>] syscall_call+0x7/0xb ^MCode: 11 c0 c7 05 10 7c 44 c0 00 00 00 00 c7 05 38 7c 44 c0 00 00 00 00 c7 05 2c 7c 44 c0 6e ad 87 4b 89 15 28 7c 44 c0 e9 d3 5d f2 ff <c6> 05 00 00 00 00 00 c3 e9 92 04 f5 ff e9 8b 4c f5 ff 85 d2 89 ^MCPU#0 is frozen. ^MCPU#1 is executing netdump. ^M< netdump activated - performing handshake with the server. > ^MBadness in local_bh_enable at kernel/softirq.c:141 ^M [<c01264bd>] local_bh_enable+0x34/0x57 ^M [<c0283175>] neigh_connected_output+0x6d/0xad ^M [<c0296e9e>] ip_finish_output2+0x12e/0x16d ^M [<e0a2c69e>] ip_vs_post_routing+0x14/0x1c [ip_vs] ^M [<c028675b>] nf_iterate+0x40/0x81 ^M [<c0296d70>] ip_finish_output2+0x0/0x16d ^M [<c0286a59>] nf_hook_slow+0x47/0xb4 ^M [<c0296d70>] ip_finish_output2+0x0/0x16d ^M [<c0296d67>] ip_finish_output+0x1a5/0x1ae ^M [<c0296d70>] ip_finish_output2+0x0/0x16d ^M [<e0a30863>] ip_vs_dr_xmit+0x2d0/0x34a [ip_vs] ^M [<e0a2b0ff>] ip_vs_conn_in_get+0x87/0x150 [ip_vs] ^M [<e0a334ee>] tcp_state_transition+0x130/0x13d [ip_vs] ^M [<e0a32e63>] tcp_conn_in_get+0x83/0x8b [ip_vs] ^M [<e0a30593>] ip_vs_dr_xmit+0x0/0x34a [ip_vs] ^M [<e0a2d155>] ip_vs_in+0x1a1/0x1f3 [ip_vs] ^M [<c028675b>] nf_iterate+0x40/0x81 ^M [<c02942c2>] ip_local_deliver_finish+0x0/0x188 ^M [<c0286a59>] nf_hook_slow+0x47/0xb4 ^M [<c02942c2>] ip_local_deliver_finish+0x0/0x188 ^M [<c02942bb>] ip_local_deliver+0x1d9/0x1e0 ^M [<c02942c2>] ip_local_deliver_finish+0x0/0x188 ^M [<c02947a8>] ip_rcv+0x35e/0x3ff ^M [<c027e1c1>] netif_receive_skb+0x1f1/0x21f ^M [<e08ef029>] e1000_clean_rx_irq+0x388/0x3fa [e1000] ^M [<e08ee723>] e1000_clean+0x3a/0xcd [e1000] ^M [<c0288738>] poll_napi+0x64/0x84 ^M [<c0288788>] netpoll_poll+0x30/0x35 ^M [<e097a440>] netdump_startup_handshake+0x7f/0x10d [netdump] ^M [<c0205c5d>] scrup+0x63/0xce ^M [<c0206247>] complement_pos+0x12/0x132 ^M [<c02066dc>] set_cursor+0x62/0x6e ^M [<c0209697>] vt_console_print+0x286/0x2a5 ^M [<c0209411>] vt_console_print+0x0/0x2a5 ^M [<c01220c3>] crashdump_call_console_drivers+0x27/0x31 ^M [<c0122467>] vprintk+0x144/0x14a ^M [<e097a653>] netpoll_netdump+0x7f/0x478 [netdump] ^M [<c020abf0>] sysrq_handle_crash+0x0/0x8 ^M [<e097a5d4>] netpoll_netdump+0x0/0x478 [netdump] ^M [<e097a5cb>] netpoll_start_netdump+0xe9/0xf2 [netdump] ^M ======================= ^M [<c013403d>] try_crashdump+0x31/0x33 ^M [<c010601a>] die+0xe2/0x16b ^M [<c0122459>] vprintk+0x136/0x14a ^M [<c011ac2d>] do_page_fault+0x0/0x5c6 ^M [<c011b01d>] do_page_fault+0x3f0/0x5c6 ^M [<c020abf0>] sysrq_handle_crash+0x0/0x8 ^M [<e08ede1a>] e1000_xmit_frame+0x947/0x951 [e1000] ^M [<c010b377>] timer_interrupt+0xd6/0xde ^M [<c0278b71>] alloc_skb+0x33/0xc5 ^M [<c0129715>] __mod_timer+0x101/0x10b ^M [<c020a2e5>] poke_blanked_console+0x8f/0x9a ^M [<c02096a5>] vt_console_print+0x294/0x2a5 ^M [<c0209411>] vt_console_print+0x0/0x2a5 ^M [<c0122103>] __call_console_drivers+0x36/0x40 ^M [<c011ac2d>] do_page_fault+0x0/0x5c6 ^M [<c02d1bab>] error_code+0x2f/0x38 ^M [<c020abf0>] sysrq_handle_crash+0x0/0x8 ^M [<c020ad7e>] __handle_sysrq+0x58/0xc6 ^M [<c021aa24>] receive_chars+0x140/0x1f6 ^M [<c021acbe>] serial8250_interrupt+0x64/0xcb ^M [<c010745e>] handle_IRQ_event+0x25/0x4f ^M [<c01079be>] do_IRQ+0x11c/0x1ae ^M ======================= ^M [<c02d1a8c>] common_interrupt+0x18/0x20 ^M [<c02cfc73>] _spin_lock+0x2e/0x34 ^M [<c0160493>] nr_blockdev_pages+0xd/0x3f ^M [<c01436a8>] si_meminfo+0x1f/0x3b ^M [<c0188cf4>] meminfo_read_proc+0x41/0x191 ^M [<c0143050>] buffered_rmqueue+0x17d/0x1a5 ^M [<c014314d>] __alloc_pages+0xd5/0x2f7 ^M [<c0187143>] proc_file_read+0xd1/0x225 ^M [<c0159c61>] vfs_read+0xb6/0xe2 ^M [<c0159e74>] sys_read+0x3c/0x62 ^M [<c02d10cf>] syscall_call+0x7/0xb ^M<4>Warning: kfree_skb on hard IRQ c0278dc3 ^M<4>Warning: kfree_skb on hard IRQ c0278dc3 ^M<4>Warning: kfree_skb on hard IRQ c0278dc3 ^MBadness in local_bh_enable at kernel/softirq.c:141 ^M [<c01264bd>] local_bh_enable+0x34/0x57 ^M [<c027dccb>] dev_queue_xmit+0x1ff/0x207 ^M [<e0919dea>] vlan_dev_hwaccel_hard_start_xmit+0x5b/0x62 [8021q] ^M [<c027dc33>] dev_queue_xmit+0x167/0x207 ^M [<c0283187>] neigh_connected_output+0x7f/0xad ^M [<c0296e9e>] ip_finish_output2+0x12e/0x16d ^M [<e0a2c69e>] ip_vs_post_routing+0x14/0x1c [ip_vs] ^M [<c028675b>] nf_iterate+0x40/0x81 ^M [<c0296d70>] ip_finish_output2+0x0/0x16d ^M [<c0286a59>] nf_hook_slow+0x47/0xb4 ^M [<c0296d70>] ip_finish_output2+0x0/0x16d ^M [<c0296d67>] ip_finish_output+0x1a5/0x1ae ^M [<c0296d70>] ip_finish_output2+0x0/0x16d ^M [<e0a30863>] ip_vs_dr_xmit+0x2d0/0x34a [ip_vs] ^M [<e0a2b0ff>] ip_vs_conn_in_get+0x87/0x150 [ip_vs] ^M [<e0a334ee>] tcp_state_transition+0x130/0x13d [ip_vs] ^M [<e0a32e63>] tcp_conn_in_get+0x83/0x8b [ip_vs] ^M [<e0a30593>] ip_vs_dr_xmit+0x0/0x34a [ip_vs] ^M [<e0a2d155>] ip_vs_in+0x1a1/0x1f3 [ip_vs] ^M [<c028675b>] nf_iterate+0x40/0x81 ^M [<c02942c2>] ip_local_deliver_finish+0x0/0x188 ^M [<c0286a59>] nf_hook_slow+0x47/0xb4 ^M [<c02942c2>] ip_local_deliver_finish+0x0/0x188 ^M [<c02942bb>] ip_local_deliver+0x1d9/0x1e0 ^M [<c02942c2>] ip_local_deliver_finish+0x0/0x188 ^M [<c02947a8>] ip_rcv+0x35e/0x3ff ^M [<c027e1c1>] netif_receive_skb+0x1f1/0x21f ^M [<e08ef029>] e1000_clean_rx_irq+0x388/0x3fa [e1000] ^M [<e08ee723>] e1000_clean+0x3a/0xcd [e1000] ^M [<c0288738>] poll_napi+0x64/0x84 ^M [<c0288788>] netpoll_poll+0x30/0x35 ^M [<e097a440>] netdump_startup_handshake+0x7f/0x10d [netdump] ^M [<c0205c5d>] scrup+0x63/0xce ^M [<c0206247>] complement_pos+0x12/0x132 ^M [<c02066dc>] set_cursor+0x62/0x6e ^M [<c0209697>] vt_console_print+0x286/0x2a5 ^M [<c0209411>] vt_console_print+0x0/0x2a5 ^M [<c01220c3>] crashdump_call_console_drivers+0x27/0x31 ^M [<c0122467>] vprintk+0x144/0x14a ^M [<e097a653>] netpoll_netdump+0x7f/0x478 [netdump] ^M [<c020abf0>] sysrq_handle_crash+0x0/0x8 ^M [<e097a5d4>] netpoll_netdump+0x0/0x478 [netdump] ^M [<e097a5cb>] netpoll_start_netdump+0xe9/0xf2 [netdump] and it goes on and on... This event sent from IssueTracker by jwhiter issue 84050 Comment #18 From Issue Tracker (tao) on 2006-01-17 15:06 EST [reply] Private From User-Agent: XML-RPC the boxes have been up for 20 days without issue. I'm attatching the patch i used. lowering severity. Severity set to: High This event sent from IssueTracker by jwhiter issue 84050 it_file 53450 Comment #20 From Issue Tracker (tao) on 2006-01-20 16:52 EST [reply] Private From User-Agent: XML-RPC Ingo had a different suggestion in rhkernel-list http://post-office.corp.redhat.com/archives/rhkernel-list/2006-January/msg00001.html Internal Status set to 'Waiting on Engineering' This event sent from IssueTracker by streeter issue 84050 Comment #23 From Jason Baron (jbaron) on 2006-03-06 11:37 EST [reply] Private This looks like the right patch, also see: http://bugs.centos.org/view.php?id=1201, where a similar patch apparently solved the customer issue. thanks. Comment #24 From Thomas Graf (tgraf) on 2006-03-06 12:15 EST [reply] Private Yes, although they lack a flush_scheduled_work() as proposed by Ronald Dreier and a follow-up patch by Jualian to correctly reorder the locking, the two referred patches alone are incomplete. The patch attached to this BZ includes all these patches and is functionaly equivalent to upstream. Comment #25 From Jason Baron (jbaron) on 2006-03-19 13:47 EST [reply] Private committed in stream u4 build 34.5. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/ Comment #26 From Linda Wang (lwang) on 2006-03-28 10:26 EST [reply] Private Move to U4 CANFIX list. Comment #27 From Jason Baron (jbaron) on 2006-03-29 13:32 EST [reply] Private *** Bug 169600 has been marked as a duplicate of this bug. *** Comment #28 From Bob Johnson (bjohnson) on 2006-04-11 12:06 EST [reply] Private This issue is on Red Hat Engineering's list of planned work items for the upcoming Red Hat Enterprise Linux 4.4 release. Engineering resources have been assigned and barring unforeseen circumstances, Red Hat intends to include this item in the 4.4 release. Comment #29 From HOTFIX Tracker (Red Hat Internal App) (hotfix-tracker) on 2006-04-26 10:35 EST [reply] Private HOTFIX Request has been released http://seg.rdu.redhat.com/scripts/hotfix/edit.pl?id=982 Comment #30 From Jason Baron (jbaron) on 2006-05-05 06:50 EST [reply] Private *** Bug 172696 has been marked as a duplicate of this bug. *** Comment #31 From Issue Tracker (tao) on 2006-05-12 14:37 EST [reply] Private Internal Status set to 'Resolved' Status set to: Closed by Client Resolution set to: 'RHEL 4 U4' This event sent from IssueTracker by jwhiter issue 84050 Comment #32 From Issue Tracker (tao) on 2006-05-12 14:48 EST [reply] Private Internal Status set to 'Resolved' Status set to: Closed by Client Resolution set to: 'RHEL 4 U4' This event sent from IssueTracker by jwhiter issue 86360 Comment #33 From HOTFIX Tracker (Red Hat Internal App) (hotfix-tracker) on 2006-05-22 13:21 EST [reply] Private HOTFIX Release has been rescinded http://seg.rdu.redhat.com/scripts/hotfix/edit.pl?id=982 Comment #34 From Red Hat Bugzilla (bugzilla) on 2006-05-24 00:42 EST [reply] Private Bug report changed to ON_QA status by Errata System. A QE request has been submitted for advisory RHSA-2006:0497-10. http://errata.devel.redhat.com/errata/showrequest.cgi?advisory=4020 Comment #35 From HOTFIX Tracker (Red Hat Internal App) (hotfix-tracker) on 2006-06-14 16:59 EST [reply] Private HOTFIX Requested http://seg.rdu.redhat.com/scripts/hotfix/edit.pl?id=1107 Comment #36 From HOTFIX Tracker (Red Hat Internal App) (hotfix-tracker) on 2006-06-15 10:47 EST [reply] Private HOTFIX Request has been released http://seg.rdu.redhat.com/scripts/hotfix/edit.pl?id=1107 Comment #37 From Mike Gahagan (mgahagan) on 2006-06-16 11:51 EST [reply] Private Looks like all the affected customers are happy with the hotfix, setting customer_verified. Comment #38 From Issue Tracker (tao) on 2006-06-19 15:29 EST [reply] Private The .src.rpm file for the -37.EL kernel is too big to store in Issue Tracker, so I've made it available on my People page. You can download the file from here: http://people.redhat.com/gcase/rhel4/ -Gary Internal Status set to 'Waiting on Customer' Status set to: Waiting on Client This event sent from IssueTracker by gcase issue 91336 Comment #39 From Jason Baron (jbaron) on 2006-07-13 15:48 EST [reply] Private *** Bug 198321 has been marked as a duplicate of this bug. *** Comment #40 From Jason Baron (jbaron) on 2006-07-14 14:28 EST [reply] Private *** Bug 198892 has been marked as a duplicate of this bug. *** Comment #41 From Red Hat Bugzilla (bugzilla) on 2006-07-20 14:37 EST [reply] Private Bug report changed to RELEASE_PENDING status by Errata System. Advisory RHSA-2006:0575-14 has been changed to HOLD status. http://errata.devel.redhat.com/errata/showrequest.cgi?advisory=4178 Thanks for the detailed information about bug 174990. The description is different from what is being report in bug 198321, which is marked as duplicated bug of 174990. Bug 174990 occurs when ipvs load balancers are installed. But such module is not installed in bug 198321. Furthermore, the kernel messages are different too. Lastly, it does not occur very often on our hardware and we have not find out a way to reproduce it yet. Has your problem been fixed with the latest update kernels for RHEL 4.5? Thanks, Read ya, Phil Hello, I'm closing this bug because the reporter doesn't reply for a very long time. So, it seems that the problem has been fixed. |