481207 – netdump generates incomplete vmcore logs with Broadcom BCM5754

Bug 481207 - netdump generates incomplete vmcore logs with Broadcom BCM5754

Summary: netdump generates incomplete vmcore logs with Broadcom BCM5754

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.9
Hardware:	i686
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Neil Horman
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-01-22 19:15 UTC by Mauro Carvalho Chehab
Modified:	2013-07-04 22:53 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-05-18 19:22:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
patch to trace tg3 events (3.43 KB, patch) 2009-01-23 21:01 UTC, Neil Horman	no flags	Details \| Diff
Captured logs for tg3 + netdump modules (136.87 KB, text/plain) 2009-01-24 00:06 UTC, Mauro Carvalho Chehab	no flags	Details
patch to ensure device quota is full in netpoll path (842 bytes, patch) 2009-01-26 13:01 UTC, Neil Horman	no flags	Details \| Diff
Serial capture log after applying the patch (11.36 MB, application/octet-stream) 2009-01-26 17:24 UTC, Mauro Carvalho Chehab	no flags	Details
Captured packets with the server/client netdump conversation (95.80 KB, application/x-bzip2) 2009-01-27 01:08 UTC, Mauro Carvalho Chehab	no flags	Details
Progress of vmcore captured file (52.14 KB, text/plain) 2009-01-28 19:03 UTC, Mauro Carvalho Chehab	no flags	Details
Captured patches for the entire conversation with netdump-server (12.26 MB, application/x-bzip2) 2009-01-28 19:52 UTC, Mauro Carvalho Chehab	no flags	Details
patch to selectively enable softirqs during netdump (485 bytes, patch) 2009-01-28 21:07 UTC, Neil Horman	no flags	Details \| Diff
netpoll-quota-fix.patch (692 bytes, patch) 2009-01-28 21:24 UTC, Andy Gospodarek	no flags	Details \| Diff
New capture logs (69 bytes, text/plain) 2009-01-30 21:31 UTC, Mauro Carvalho Chehab	no flags	Details
patch to up netdump server retrans timeout (493 bytes, patch) 2009-02-02 15:13 UTC, Neil Horman	no flags	Details \| Diff
dmesg of the OOPSes (99.08 KB, text/x-log) 2009-02-03 13:07 UTC, Mauro Carvalho Chehab	no flags	Details
Panic log (73.04 KB, text/x-log) 2009-02-03 20:35 UTC, Mauro Carvalho Chehab	no flags	Details
Show Obsolete (5) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:1024	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 4.8 kernel security and bug fix update	2009-05-18 14:57:26 UTC

Description Mauro Carvalho Chehab 2009-01-22 19:15:50 UTC

Description of problem:

Netdump sends incomplete dumps with newer versions of RHEL4. 

Version-Release number of selected component (if applicable):

It were discovered to be broken with 2.6.9-78.22.EL and 2.6.9-78.30.EL. It works fine with 2.6.9.68-23.

The machine has this Ethernet adapter:

08:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5754 Gigabit Eth)
        Subsystem: Dell: Unknown device 021e
        Flags: bus master, fast devsel, latency 0, IRQ 66
        Memory at f7ef0000 (64-bit, non-prefetchable) [size=64K]
        Expansion ROM at <ignored> [disabled]


Those are the results with 2.6.9-78.30 just compiled with
redhat/configs/kernel-2.6.9-i686-smp.config (with debug disabled):

serial terminal output:

SysRq : Crashing the kernel by request
Unable to handle kernel NULL pointer dereference at virtual address 00000000
 printing eip:
c0213168
*pde = 36f46001
Oops: 0002 [#1]
SMP 
Modules linked in: md5(U) ipv6(U) parport_pc(U) lp(U) parport(U) netconsole(U) netdump(U) autofs4(U) i2c_dev(U) i2c_core(U) sunrpc(U) ipt_REJECT(U) ipt_state(U) ip_conntrack(U) iptable_filter(U) ip_tables(U) cpufreq_powersave(U) joydev(U) dm_mirror(U) dm_mod(U) button(U) battery(U) ac(U) uhci_hcd(U) ehci_hcd(U) snd_hda_intel(U) snd_pcm_oss(U) snd_mixer_oss(U) snd_pcm(U) snd_timer(U) snd_page_alloc(U) snd_hwdep(U) snd(U) soundcore(U) tg3(U) sr_mod(U) ext3(U) jbd(U) ata_piix(U) ahci(U) libata(U) sd_mod(U) scsi_mod(U)
CPU:    4
EIP:    0060:[<c0213168>]    Not tainted VLI
EFLAGS: 00010016   (2.6.9-prep) 
EIP is at sysrq_handle_crash+0x0/0x8
eax: 00000063   ebx: c0348854   ecx: f3957000   edx: f7f21f88
esi: 00000063   edi: f3957000   ebp: f7f21f88   esp: c03ffdf8
ds: 007b   es: 007b   ss: 0068
Process swapper (pid: 0, threadinfo=c03ff000 task=c3850030)
Stack: c0213300 c02fec45 c0300090 00000202 00000006 00000001 c3a10380 00000001 
       0000002e c020dee0 f7f21f88 c0346be0 f677d80c c3a10380 0000002e c026fa75 
       00000001 00000001 f66ee0ac 00000001 f6756000 f677d80c c026de64 00000001 
Call Trace:
 [<c0213300>] __handle_sysrq+0x62/0xd9
 [<c020dee0>] kbd_event+0x83/0xb0
 [<c026fa75>] input_event+0x331/0x351
 [<c026de64>] hidinput_hid_event+0x1d0/0x208
 [<c026a431>] hid_process_event+0x28/0x52
 [<c026a707>] hid_input_field+0x2ac/0x2f9
 [<c026a7cd>] hid_input_report+0x79/0x98
 [<c026a882>] hid_irq_in+0x96/0xf2
 [<c02618ce>] usb_hcd_giveback_urb+0x14/0x3e
 [<f893f01e>] ehci_urb_done+0x70/0x7b [ehci_hcd]
 [<f893f252>] qh_completions+0x229/0x2ab [ehci_hcd]
 [<f8941ab9>] scan_periodic+0xff/0x246 [ehci_hcd]
 [<f8942423>] ehci_work+0x3b/0x8f [ehci_hcd]
 [<f89425b2>] ehci_irq+0x13b/0x152 [ehci_hcd]
 [<c026191e>] usb_hcd_irq+0x26/0x4b
 [<c01074f2>] handle_IRQ_event+0x25/0x4f
 [<c01074cd>] handle_IRQ_event+0x0/0x4f
 [<c0107a8d>] do_IRQ+0x157/0x1f2
 =======================
 [<c02dddfc>] common_interrupt+0x18/0x20
 [<c01040e8>] mwait_idle+0x33/0x42
 [<c01040a0>] cpu_idle+0x26/0x3b
Code: 11 c0 c7 05 10 e9 45 c0 00 00 00 00 c7 05 38 e9 45 c0 00 00 00 00 c7 05 2c e9 45 c0 6e ad 87 4b 89 15 28 e9 45 c0 e9 a4
 f0 f1 ff <c6> 05 00 00 00 00 00 c3 e9 12 b6 f4 ff e9 fb 00 f5 ff 85 d2 89 
CPU#0 is frozen.
CPU#1 is frozen.
CPU#2 is frozen.
CPU#3 is frozen.
CPU#4 is executing netdump.
CPU#5 is frozen.
CPU#6 is frozen.
CPU#7 is frozen.
< netdump activated - performing handshake with the server. >
NETDUMP START!
< handshake completed - listening for dump requests. >
0(4294806222)-

The netdump-server machine produces the following logs:

Jan 22 08:15:22 caramujo netdump[2730]: Got too many timeouts waiting for memory page for client 192.168.2.170, ignoring it 
Jan 22 08:15:25 caramujo netdump[2730]: Got too many timeouts waiting for SHOW_STATUS for client 192.168.2.170, rebooting it 

And the vmware file at the server is incomplete:

$ du -h 192.168.2.170-2009-01-22-08:13/*

4,0K    192.168.2.170-2009-01-22-08:13/log
224K    192.168.2.170-2009-01-22-08:13/vmcore-incomplete

If enabling debug at netdump.c, it produces a similar report. However, the size of vmware-incomplete has always 4096 bytes.

Steps to Reproduce:

the tests were done by enabling SysReq keys, and pressing SysReq-C

Comment 1 Neil Horman 2009-01-23 21:01:52 UTC

Created attachment 329882 [details]
patch to trace tg3 events

Ok, so I've gone over the logs again, and since we manage to preform the handshake successfully, I can assert that we are able to send frames for a time during the netdump mode of operation (I'm making a technical distinction here, in that during the handshake the netdump_mode variable in the kernel is set to one).  So there is a period of time during which the the netpol path through the tg3 poll controller and napi poll methods works.  Thats good news.  I need to look back at the older code and see how this changed, but my first guess as to what might be going on here is that during the napi poll, if the tg3 driver detects certain problem with the hardware, it will attempt to reset the chip, and will not receive any frames until that operation is complete.  unfortunately the reset operation is scheduled via keventd, which during a dump operation, will never run.  So I wonder if the system in question here is encountering this hardware error, and the result is that it spins waiting for a recovery that will never happen.

I've prepared this debug patch for you (untested, sorry, but it should build).  It will start printing tg3 information to the serial console about its progress through tg3_interrupt and tg3_poll when netdump starts.  Please build a kernel with it, and send in the resulting output.  It should help us confirm my theory above.  Thanks!

Comment 2 Mauro Carvalho Chehab 2009-01-24 00:06:12 UTC

Created attachment 329890 [details]
Captured logs for tg3 + netdump modules

The patch didn't compiled fine, due to two silly errors on line 4000 and 4114. I fixed it and generated the logs. I'm enclosing the full captured log.

Basically, those are the last messages of the log, before the trouble:

<2>ENTERING TG3_INTERRUPT
<2>CHECKING IF TG3 HAS WORK
<2>TG3 HAS WORK :)!
<2>TG3_INTERRUPT DONE
<2>ENTERING TG3_POLL
<2>CALLING TG3_TX
<2>TG3 CHECKNIG RX QUEUE
<2>TG3 RX WORK DONE = 0
<2>TG3 DONE WITH NAPI POLL
<6>got new req, command 1.
<6>got MEM command.
<6> ... send_netdump_mem
<6> ... send_netdump_mem: sending message
<6> ... send_netdump_mem: sent message
<6> ... send_netdump_mem: sending message
<6> ... send_netdump_mem: sent message
<6> ... send_netdump_mem: sending message
<6> ... send_netdump_mem: sent message
<6> ... send_netdump_mem: sending message
<6> ... send_netdump_mem: sent message
<6> ... send_netdump_mem: returning
<6>main netdump loop: polling controller ...
<2>ENTERING TG3_INTERRUPT
<2>CHECKING IF TG3 HAS WORK
<2>TG3 HAS WORK :)!
<2>TG3_INTERRUPT DONE
<2>ENTERING TG3_POLL
<2>CALLING TG3_TX
<2>TG3 CHECKNIG RX QUEUE
<2>TG3 RX WORK DONE = 0
<2>TG3 DONE WITH NAPI POLL
<6>got new req, command 1.
<6>got MEM command.
<6> ... send_netdump_mem
<6> ... send_netdump_mem: sending message
<6> ... send_netdump_mem: sent message
<6> ... send_netdump_mem: sending message
<6> ... send_netdump_mem: sent message
<6> ... send_netdump_mem: sending message
<6> ... send_netdump_mem: sent message
<6> ... send_netdump_mem: sending message
<6> ... send_netdump_mem: sent message
<6> ... send_netdump_mem: returning
<6>main netdump loop: polling controller ...
<2>ENTERING TG3_INTERRUPT
<2>SD_STATUS IS UPDATED! LEAVING
<2>TG3_INTERRUPT DONE
<2>ENTERING TG3_POLL
<2>TG3 CHECKNIG RX QUEUE
<2>TG3 RX WORK DONE = 0
<2>TG3 DONE WITH NAPI POLL

(and the last set of messages keeps repeting)

Comment 3 Neil Horman 2009-01-24 22:28:31 UTC

Interesting, so its not that we're getting out of order reads here, its that TG3 is legitimately telling us that we didn't do any work and are exiting.  I'll dig into this a bit deeper.  Thanks!

Comment 4 Neil Horman 2009-01-26 13:01:21 UTC

Created attachment 329981 [details]
patch to ensure device quota is full in netpoll path

Actually on inspection of the full log, it looks like we are able to receive 64 packets before we stop getting anything.  Thats interesting, given that 64 is the fixed per-poll budget that netpoll uses in its poll path.  On closer inspection, I think I see the problem.  It appears that tg3 is being too kind for its own good here.  In the tg3_poll function (the poll method for the driver), it checks the budget value that was passed in against its stored device quota, and limits it to the latter value (ostensibly to ensure that it doesn't get too big a timeslice on the processor).  Unfortunately, the netpoll path assumes that drivers will just do up to the amount of work that gets passed in in the budget value without question.  The way tg3 is written, it will decrement dev->quota until it reaches zero, then limit its work to zero on each iteration.  This works in the nominal case because net_rx_action always refills the dev->quota value, but netpoll never does, so if we don't ever run softirqs (which is the case in 4.8 netdump mode), we have a problem.  This patch should fix the issue.  Please test it out and confirm.  Thanks!

gospo, I'd thought about removing the check from tg3_poll, so that it just does <budget> amount of work, but this felt like the broader more correct fix.  I'd appreciate your opinion on this.  Thanks!

Comment 5 Andy Gospodarek 2009-01-26 14:30:18 UTC

nhorman, I wondered the same thing, but saw that the quota should be reset in __netif_rx_schedule, so if this patch works it seems apparent that we need to understand why netif_rx_schedule is being called not being called correctly to reset the quota.  Maybe this is a side effect of the sortirq handler now being disabled and just the right code in the tg3 driver?

Comment 6 Neil Horman 2009-01-26 14:51:51 UTC

good point, I had missed that, although that makes me wonder if perhaps we're not calling netif_rx_complete, making this an infinite  loop of sorts (we never remove ourselve from the napi poll list, and so we never clear LINK_STATE_RX_SCHED, so we never re-schedule, so we never reset dev->quota.  The logs seem to bear that out.  If thats the case this patch should still fix the problem.  Once we get results to confirm, we can figure the best approach to solve this.

Comment 7 Mauro Carvalho Chehab 2009-01-26 16:56:04 UTC

The patch is on the right direction, but it seems that there are still some issues. From time to time, the Ethernet traffic is holded for a large amount of time, then it returns. I've kept the system up for about 3 hours and I've got only 47864 Kb of the vmcore-incomplete file. The logs seem normal:

<2>ENTERING TG3_POLL
<2>CALLING TG3_TX
<2>TG3 CHECKNIG RX QUEUE
<2>TG3 RX WORK DONE = 1
<2>TG3 COMPLETING NAPI POLL
<2>TG3 DONE WITH NAPI POLL
<2>ENTERING TG3_INTERRUPT
<2>CHECKING IF TG3 HAS WORK
<2>TG3 HAS WORK :)!
<2>TG3_INTERRUPT DONE
<2>ENTERING TG3_POLL
<2>CALLING TG3_TX
<2>TG3 CHECKNIG RX QUEUE
<2>TG3 COMPLETING NAPI POLL
<2>TG3 DONE WITH NAPI POLL
<2>ENTERING TG3_INTERRUPT
<2>CHECKING IF TG3 HAS WORK
<2>TG3 HAS WORK :)!
<2>TG3_INTERRUPT DONE
<2>ENTERING TG3_POLL
<2>CALLING TG3_TX

The first Ethernet hangup happens when vmcore-incomplete reaches about 4000 KB of size:

Time       Size File
14:44:54    12K vmcore-incomplete
14:44:55    40K vmcore-incomplete
14:44:56    64K vmcore-incomplete
14:44:57    80K vmcore-incomplete
14:44:58    120K vmcore-incomplete
14:44:59    160K vmcore-incomplete
14:45:00    200K vmcore-incomplete
14:45:01    240K vmcore-incomplete
14:45:02    260K vmcore-incomplete
14:45:03    280K vmcore-incomplete
14:45:04    320K vmcore-incomplete
14:45:05    360K vmcore-incomplete
14:45:06    400K vmcore-incomplete
14:45:07    440K vmcore-incomplete
14:45:08    480K vmcore-incomplete
14:45:09    512K vmcore-incomplete
14:45:10    520K vmcore-incomplete
14:45:11    560K vmcore-incomplete
14:45:12    576K vmcore-incomplete
14:45:13    580K vmcore-incomplete
14:45:14    616K vmcore-incomplete
14:45:15    628K vmcore-incomplete
14:45:16    656K vmcore-incomplete
14:45:17    696K vmcore-incomplete
14:45:18    736K vmcore-incomplete
14:45:19    776K vmcore-incomplete
14:45:20    808K vmcore-incomplete
14:45:21    836K vmcore-incomplete
14:45:22    856K vmcore-incomplete
14:45:23    896K vmcore-incomplete
14:45:24    936K vmcore-incomplete
14:45:25    976K vmcore-incomplete
14:45:26    1016K vmcore-incomplete
14:45:27    1052K vmcore-incomplete
14:45:28    1080K vmcore-incomplete
14:45:29    1100K vmcore-incomplete
14:45:30    1136K vmcore-incomplete
14:45:31    1176K vmcore-incomplete
14:45:32    1216K vmcore-incomplete
14:45:33    1256K vmcore-incomplete
14:45:34    1296K vmcore-incomplete
14:45:35    1316K vmcore-incomplete
14:45:36    1336K vmcore-incomplete
14:45:37    1376K vmcore-incomplete
14:45:38    1416K vmcore-incomplete
14:45:39    1456K vmcore-incomplete
14:45:40    1496K vmcore-incomplete
14:45:41    1516K vmcore-incomplete
14:45:42    1536K vmcore-incomplete
14:45:43    1576K vmcore-incomplete
14:45:44    1616K vmcore-incomplete
14:45:45    1656K vmcore-incomplete
14:45:46    1696K vmcore-incomplete
14:45:47    1724K vmcore-incomplete
14:45:48    1736K vmcore-incomplete
14:45:49    1776K vmcore-incomplete
14:45:50    1816K vmcore-incomplete
14:45:51    1856K vmcore-incomplete
14:45:52    1896K vmcore-incomplete
14:45:53    1924K vmcore-incomplete
14:45:54    1948K vmcore-incomplete
14:45:56    1976K vmcore-incomplete
14:45:57    2016K vmcore-incomplete
14:45:58    2056K vmcore-incomplete
14:45:59    2096K vmcore-incomplete
14:46:00    2128K vmcore-incomplete
14:46:01    2136K vmcore-incomplete
14:46:02    2176K vmcore-incomplete
14:46:03    2216K vmcore-incomplete
14:46:04    2256K vmcore-incomplete
14:46:05    2296K vmcore-incomplete
14:46:06    2320K vmcore-incomplete
14:46:07    2336K vmcore-incomplete
14:46:08    2376K vmcore-incomplete
14:46:09    2416K vmcore-incomplete
14:46:10    2456K vmcore-incomplete
14:46:11    2496K vmcore-incomplete
14:46:12    2508K vmcore-incomplete
14:46:13    2536K vmcore-incomplete
14:46:14    2576K vmcore-incomplete
14:46:15    2616K vmcore-incomplete
14:46:16    2656K vmcore-incomplete
14:46:17    2692K vmcore-incomplete
14:46:18    2700K vmcore-incomplete
14:46:19    2736K vmcore-incomplete
14:46:20    2776K vmcore-incomplete
14:46:21    2812K vmcore-incomplete
14:46:22    2852K vmcore-incomplete
14:46:23    2892K vmcore-incomplete
14:46:24    2932K vmcore-incomplete
14:46:25    2944K vmcore-incomplete
14:46:26    2972K vmcore-incomplete
14:46:27    3012K vmcore-incomplete
14:46:28    3052K vmcore-incomplete
14:46:29    3092K vmcore-incomplete
14:46:30    3132K vmcore-incomplete
14:46:31    3172K vmcore-incomplete
14:46:32    3200K vmcore-incomplete
14:46:33    3228K vmcore-incomplete
14:46:34    3252K vmcore-incomplete
14:46:35    3292K vmcore-incomplete
14:46:36    3332K vmcore-incomplete
14:46:37    3372K vmcore-incomplete
14:46:38    3404K vmcore-incomplete
14:46:39    3412K vmcore-incomplete
14:46:40    3444K vmcore-incomplete
14:46:41    3476K vmcore-incomplete
14:46:42    3516K vmcore-incomplete
14:46:46    3544K vmcore-incomplete
14:46:47    3548K vmcore-incomplete
14:46:49    3568K vmcore-incomplete
14:46:50    3584K vmcore-incomplete
14:46:51    3608K vmcore-incomplete
14:46:52    3624K vmcore-incomplete
14:46:54    3632K vmcore-incomplete
14:46:55    3672K vmcore-incomplete
14:46:56    3696K vmcore-incomplete
14:46:57    3712K vmcore-incomplete
14:46:58    3728K vmcore-incomplete
14:46:59    3732K vmcore-incomplete
14:47:00    3772K vmcore-incomplete
14:47:01    3812K vmcore-incomplete
14:47:03    3832K vmcore-incomplete
14:47:04    3872K vmcore-incomplete
14:47:06    3912K vmcore-incomplete
14:47:07    3948K vmcore-incomplete
14:47:08    3984K vmcore-incomplete
14:47:10    4004K vmcore-incomplete

Then a long wait: almost 10 minutes

14:52:58    4044K vmcore-incomplete
14:52:59    4088K vmcore-incomplete
14:53:00    4116K vmcore-incomplete
14:53:01    4128K vmcore-incomplete
14:53:02    4168K vmcore-incomplete
14:53:03    4208K vmcore-incomplete
14:53:04    4248K vmcore-incomplete
14:53:05    4288K vmcore-incomplete
14:53:06    4328K vmcore-incomplete
14:53:08    4340K vmcore-incomplete
14:53:09    4368K vmcore-incomplete
14:53:10    4408K vmcore-incomplete
14:53:11    4448K vmcore-incomplete
14:53:12    4488K vmcore-incomplete
14:53:13    4528K vmcore-incomplete
14:53:14    4540K vmcore-incomplete
14:53:15    4568K vmcore-incomplete
14:53:16    4608K vmcore-incomplete
14:53:17    4648K vmcore-incomplete
14:53:18    4688K vmcore-incomplete
14:53:19    4728K vmcore-incomplete
14:53:20    4764K vmcore-incomplete
14:53:21    4768K vmcore-incomplete
14:53:22    4808K vmcore-incomplete
14:53:23    4848K vmcore-incomplete
14:53:24    4888K vmcore-incomplete
14:53:25    4928K vmcore-incomplete
14:53:26    4944K vmcore-incomplete
14:53:27    4968K vmcore-incomplete
14:53:28    5008K vmcore-incomplete
14:53:29    5048K vmcore-incomplete
14:53:30    5088K vmcore-incomplete
14:53:31    5124K vmcore-incomplete
14:53:32    5128K vmcore-incomplete
14:53:33    5168K vmcore-incomplete
14:53:34    5208K vmcore-incomplete
14:53:35    5248K vmcore-incomplete
14:53:36    5284K vmcore-incomplete
14:53:37    5304K vmcore-incomplete
14:53:38    5328K vmcore-incomplete
14:53:39    5368K vmcore-incomplete
14:53:40    5408K vmcore-incomplete
14:53:41    5448K vmcore-incomplete
14:53:42    5464K vmcore-incomplete
14:53:43    5488K vmcore-incomplete
14:53:44    5528K vmcore-incomplete
14:53:45    5568K vmcore-incomplete
14:53:46    5608K vmcore-incomplete
14:53:47    5632K vmcore-incomplete
14:53:48    5648K vmcore-incomplete
14:53:49    5688K vmcore-incomplete
14:53:50    5728K vmcore-incomplete
14:53:51    5768K vmcore-incomplete
14:53:52    5808K vmcore-incomplete
14:53:53    5840K vmcore-incomplete
14:53:54    5856K vmcore-incomplete
14:53:55    5888K vmcore-incomplete
14:53:56    5928K vmcore-incomplete
14:53:57    5968K vmcore-incomplete
14:53:58    6008K vmcore-incomplete
14:53:59    6032K vmcore-incomplete
14:54:00    6048K vmcore-incomplete
14:54:01    6088K vmcore-incomplete
14:54:02    6128K vmcore-incomplete
14:54:03    6168K vmcore-incomplete
14:54:04    6208K vmcore-incomplete
14:54:05    6244K vmcore-incomplete
14:54:06    6260K vmcore-incomplete
14:54:07    6288K vmcore-incomplete
14:54:08    6328K vmcore-incomplete
14:54:09    6368K vmcore-incomplete
14:54:10    6408K vmcore-incomplete
14:54:11    6448K vmcore-incomplete
14:54:12    6480K vmcore-incomplete
14:54:13    6496K vmcore-incomplete
14:54:14    6528K vmcore-incomplete
14:54:15    6568K vmcore-incomplete
14:54:17    6608K vmcore-incomplete
14:54:18    6648K vmcore-incomplete
14:54:19    6680K vmcore-incomplete

I'll enclose the captured data up to now. I'll let it run until the end to see if we can get more info.

Comment 8 Mauro Carvalho Chehab 2009-01-26 17:24:13 UTC

Created attachment 330005 [details]
Serial capture log after applying the patch

Ok, this is a partial log after having it retrieving the dump for a while. We have about 23.8 MB captured by now. I'll keep it running until the task completion, but it should take a looong time.

Comment 9 Neil Horman 2009-01-26 19:26:52 UTC

I don't suppose your taking a tcpdump on this, are you?  a 10 minute pause is certainly over the threshold at which the netdump server would give up and try to reboot the box, so they must be exchanging some sort of message here.  A tcpdump would tell us what that is.  logs from the netdump server would be great too if you have them.  Thanks!

Comment 10 Mauro Carvalho Chehab 2009-01-27 01:08:29 UTC

Created attachment 330052 [details]
Captured packets with the server/client netdump conversation

Ok, this is seems to be the second point where the capture stops:

15:15:43    47856K vmcore-incomplete
16:02:49 (still with the same size)

There's no other messages at the dmesg on the netdump-server. The communication didn't stop, as shown on a partial traffic captured with wireshark, during the time where vmcore weren't updated. The data were saved as libpcap format.

The dump is compressed with bzip2, since, otherwise, it would exceed the maximum connection time, while sending the file to bugzilla.

Comment 11 Neil Horman 2009-01-27 21:11:48 UTC

during the course of this trace snippet, did one of those long pauses occur?  I ask because I've been combing over it and the communication in it seems both regular and advancing in both directions.  I don't see any repeated sequence numbers in the trace, and not error replies.

Comment 12 Mauro Carvalho Chehab 2009-01-27 21:58:48 UTC

I only started the wireshark capture after arriving the 47856KB of vmcore-incomplete file. So, all those packets happened during the communication pause. Could this be due to some sort of errors at RHEL 5 netdump-server?

Comment 13 Neil Horman 2009-01-28 15:08:52 UTC

Its possible, if you have debug logs from the server it would be good to take a look.  The definitive answer I think will be in the tcpdump however.  Can you recreate the problem and get a capture that has segments during normal communication and during a pause?  Thanks!

Comment 14 Mauro Carvalho Chehab 2009-01-28 19:03:02 UTC

Created attachment 330271 [details]
Progress of vmcore captured file

Ok, I've re-generated the dump.

First, this is the file size progress over time. I've generated it with this small shell script:

O="";while :;do R="`du -s vmcore* 2>/dev/null|sed -r s/"\s+"/"K "/`"; if [ "$R" != "$O" ]; then O=$R; echo "`date +%T`    $R"; fi; sleep 1; done |tee /tmp/vmcore.log

Comment 15 Mauro Carvalho Chehab 2009-01-28 19:52:59 UTC

Created attachment 330278 [details]
Captured patches for the entire conversation with netdump-server

Ok, this is a complete dump of the interface since the beginning. You'll see that the first time I've pressed SysReq-C, the OOPS weren't generated. I had to press it twice.

Comment 16 Andy Gospodarek 2009-01-28 20:35:55 UTC

I finally got around to building a kernel with the patch in comment #4 and netdump works without any issue.  I say ship it!

Comment 17 Neil Horman 2009-01-28 21:05:36 UTC

Ok, so I see the problem here.  After about 2062 seconds worth of transfer (about 65MB) the netdump client just stops responding to ARP frames.  The trace doesn't go beyond that, so I assume that eventually it starts responding again?  But loss of your l2 entry for a client would certainly cause what your seeing.  It would be a good test to see if add a static arp entry for your client makes the pauses go away.

I'd also like you to please apply this patch on top of the others you have.  It reverts part of the patch for bz 461014.  I _think_ its safe because more recently we've started disabling softirqs in netpoll_poll_dev, which makes this call redundant (except for the hard irq part).  At any rate, Andy and I will really have to go over that to be sure, but I wanted you to try to apply it to see if it improved your performance, or made the pause problem go away.

Thanks!

Comment 18 Neil Horman 2009-01-28 21:07:54 UTC

Created attachment 330283 [details]
patch to selectively enable softirqs during netdump

This is the patch that reverts a small part of 461014.  I don't know if its universally safe to do yet, but I for the purposes of this test I think its worth getting the data point.

Comment 19 Andy Gospodarek 2009-01-28 21:24:49 UTC

Created attachment 330284 [details]
netpoll-quota-fix.patch

Whether we decide to keep the patch in comment #18 or not, I think this patch should replace the patch in comment #4.

There are perfectly valid cases where drivers may actually poll more than the quota of frames and quota may ultimately be negative.  This would would result in an addition of the budget to the quota and quota would be < quota for a valid case.  I would prefer we only modify quota if it's less than or equal to zero, so I propose this patch.

Comment 20 Mauro Carvalho Chehab 2009-01-30 21:31:22 UTC

Created attachment 330507 [details]
New capture logs

Ok, I let the capture to occur for about 3 hours and 24 minutes, appling the 3 patches (tg3 debug, enable softirqs and netpoll quota patches). The results were the same as with just the tg3debug + the old netpoll patch. So, enabling softirq's didn't cause any changes.

Also, on the first libpcap captured file, the ARP messages happened only after rebooting the crashed machine.

the enclosed tarball has 3 files:
    vmcore.cap - console output;
    vmcore.log - the script that shows the fs progression of the captured data;
    captured_pkgs.libpcap - with the libpcap captured log at the ethernet interface using a cross-over cable to link the crashed machine and the netdump server (about 2,5 GB of dump).

Comment 21 Mauro Carvalho Chehab 2009-01-30 21:34:54 UTC

Hmm... It didn't really created an attachment. Anyway, the logs are on this tarball:

http://new-people.redhat.com/~mchehab/.bz481207/captured_data.tar.bz2

Comment 22 Neil Horman 2009-01-30 23:58:44 UTC

Ok, well, thats good to know.  Andy I think that actually jives with your results regarding your bnx2 testing, in that captures wasn't significantly slowed down if we didn't have softirqs on.  That makes me feel loads better.

So now I'm a bit stumped as to why your transfer is so slow, if those arp frames are from after you rebooted the system.  I'll take a look at the new trace above to see what comes up

Comment 23 Neil Horman 2009-02-02 15:12:16 UTC

I just found the remaining problem, although I'm not sure of its root cause, or how to solve it.  It appears to be overly fast retransmits from the netdump-server.  If you look at frames 26315,26339,26379,26427 and 26502 you can see that the server is re-requesting the same frame over and over and over.  It appears that this particular card is just getting far enough behind in the outstanding request count that the server is needlessly retransmitting frames causing undue work and delay.  Please apply this patch to the netdump-server code and see if it resolves your situaiton.  It won't be the final patch (we'll probably make the timeout configurable), but this will be a good test.  Thanks!

Comment 24 Neil Horman 2009-02-02 15:13:01 UTC

Created attachment 330631 [details]
patch to up netdump server retrans timeout

Comment 25 Mauro Carvalho Chehab 2009-02-03 13:01:06 UTC

It took me some time to find netdump-server source. I was expecting it to be at CVS or at RHEL, but the package were in EPEL instead... Anyway, the change didn't solve.

Since this is a timeout issue, I decided to remove the tg3 debug patch. Without the patch, I did two tests with just netpoll-quota-fix.patch (id=330284).

On the first test, everything works fine. the dump were created quickly:

$ ls -la 192.168.2.170-2009-02-03-10:20
total 250628
drwx------  2 netdump netdump       4096 Fev  3 10:31 .
drwxr-xr-x 10 netdump netdump       4096 Fev  3 10:42 ..
-rw-------  1 netdump netdump      75789 Fev  3 10:31 log
-rw-------  1 netdump netdump 5100277760 Fev  3 10:31 vmcore

However, a second test where I just changed the netdump-server (removing the timeout patch - to see if it would cause any effect), I got an OOPS at the RHEL4 machine.

I'll attach the full dmesg log.

Comment 26 Mauro Carvalho Chehab 2009-02-03 13:07:38 UTC

Created attachment 330726 [details]
dmesg of the OOPSes

dmesg of the crashed machine. I've merged the dmesg + the console output after pressing SysRq-C, in order to give a better picture of the Kernel panic.

Comment 27 Neil Horman 2009-02-03 13:41:52 UTC

Ok, thats good news, I hadn't considered the slowdown of the debug patch.

As for the oops, if that was the result of just changing the netdump server, lets table that for now.  If the problem is reproducible on a stock build, we can pursue it in another bug.  I'll build a kernel in brew with andy's patch from comment 19 and if you could test/verify it with a stock netdump-server out of epel, I think we can call this issue solved.

Comment 28 Neil Horman 2009-02-03 18:46:30 UTC

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1671386

Ok, I'm building a kernel here with just andys patch.  IF you can confirm that it by itself solves your problem, I'll go ahead and post it.
 
Thanks!

Comment 29 Mauro Carvalho Chehab 2009-02-03 20:35:46 UTC

Created attachment 330776 [details]
Panic log

Neil,

I tested crashing RHEL 4.7 several times, with stock netdump server. From time to time, the OOPS still occurs. I'm enclosing the dump of the latest OOPS.

Just to remind, I'm using here RHEL 4 78.30.EL kernel + andy's patch
from comment 19. No other patch is applied on it.

The server is running EPEL netdump-server-0.7.16-21.el5

Comment 30 Neil Horman 2009-02-03 21:08:48 UTC

I don't see what OOPS you are talking about.  All I see in that log are the sysrq oops (which is expected, and the resultant show_state that a netdump system provides to the console).  What are you seeing that you think shouldn't be there?

Comment 31 Mauro Carvalho Chehab 2009-02-04 17:23:21 UTC

I was talking about the sysrq oops. When it appears, the netdump produces an incomplete vmware (with 0 bytes).

Comment 32 Neil Horman 2009-02-04 17:31:36 UTC

Have you checked your netdump server, and verified that it has sufficient space to store the vmcore file?  The behavior you are describing is what happens when the netdump-server determines there is insufficient space to store a vmcore.

Comment 33 Mauro Carvalho Chehab 2009-02-04 17:41:21 UTC

Could be. I don't have much available space on my notebook, where the netdump server is running:

/dev/sda3              19G   15G  3,9G  79% /

I'll do some additional tests here, being sure that I'll have enough space before generating the dumps.

Comment 34 Andy Gospodarek 2009-02-04 17:53:05 UTC

Mauro, I bet lack of space on your laptop is the issue.  I ran into the same situation last week and was frustrated for about 30 mins until nhorman told me to check and see if I had enough to room to hold the vmcores.  I wiped out all my old logs in /var/crash and found that I had enough space and the dumps were working fine.

Comment 35 Mauro Carvalho Chehab 2009-02-04 20:54:44 UTC

Andy,

This is not the case. Take a look:
$  du -s 1* 
260220  192.168.2.170-2009-02-04-16:16
84      192.168.2.170-2009-02-04-16:24

I need 260 K for a dump. I have more than that free at my /
$ df -h
Sist. Arq.            Tam   Usad Disp  Uso% Montado em
/dev/sda3              19G   15G  3,9G  79% /
/dev/sda7              42G  7,4G   32G  19% /d00
/dev/sda5             152G   86G   58G  60% /home
tmpfs                 1,9G     0  1,9G   0% /dev/shm

Well, the RHEL4 is running on a machine with 4GB of ram. Does netdump-server check if I have 4Gb before start actually receiving the vmcore?

Comment 36 Neil Horman 2009-02-04 21:08:57 UTC

You need way more than 260K for a dump, you need enough space to store the entire vmcore (which will be approximately the size of ram on the client system).

and yes, netdump-server when a dump request is received, gets the page size and the page count from the client.  It then checks to see if there are (page_size*page_count) bytes available on the configured file system to store that amount of data. If its not available, you get an immediate reboot.

Based on what you have listed above (I assume you went with the standard netdump-server config, which puts cores under /var/crash), you're target filesystem only has 3.9GB of space available.  So if it needs 4, you're very close to, if not over, the limit.  Please free  up some space  and try again.

Comment 37 Mauro Carvalho Chehab 2009-02-05 11:26:25 UTC

Yes, freeing some space solved the issue of the 0-byte vmware-incomplete:

254M    192.168.2.170-2009-02-04-19:05
253M    192.168.2.170-2009-02-04-19:14
238M    192.168.2.170-2009-02-04-19:22
254M    192.168.2.170-2009-02-04-19:31
253M    192.168.2.170-2009-02-05-08:32
256M    192.168.2.170-2009-02-05-08:35
255M    192.168.2.170-2009-02-05-08:38
254M    192.168.2.170-2009-02-05-08:41
254M    192.168.2.170-2009-02-05-08:45
253M    192.168.2.170-2009-02-05-08:47
254M    192.168.2.170-2009-02-05-08:51
47M     192.168.2.170-2009-02-05-08:57

On the last core dump, it got a Panic on 8042 in the middle of the netdump. not sure why. Notice: sometimes, I need to do SysRq-C more than once to generate an OOPS and start netdump.

This is the dump:

SysRq : Crashing the kernel by request                                          
SysRq : Crashing the kernel by request                                          
SysRq : Crashing the kernel by request                                          
SysRq : Crashing the kernel by request                                          
SysRq : Crashing the kernel by request                                          
SysRq : Crashing the kernel by request                                          
SysRq : Crashing the kernel by request                                          
Unable to handle kernel NULL pointer dereference at virtual address 00000000    
 printing eip:                                                                  
c0213168                                                                        
*pde = 363c8001                                                                 
Oops: 0002 [#1]                                                                 
SMP                                                                             
Modules linked in: md5(U) ipv6(U) parport_pc(U) lp(U) parport(U) autofs4(U) i2c)
CPU:    5                                                                       
EIP:    0060:[<c0213168>]    Not tainted VLI                                    
EFLAGS: 00010016   (2.6.9-prep)                                                 
EIP is at sysrq_handle_crash+0x0/0x8                                            
eax: 00000063   ebx: c0348854   ecx: f5473000   edx: f7f23f88                   
esi: 00000063   edi: f5473000   ebp: f7f23f88   esp: c0400df8                   
ds: 007b   es: 007b   ss: 0068                                                  
Process swapper (pid: 0, threadinfo=c0400000 task=c3863670)                     
Stack: c0213300 c02fec44 c030008f 00000202 00000006 00000001 c3a0e480 00000001  
       0000002e c020dee0 f7f23f88 c0346be0 f591a80c c3a0e480 0000002e c026fa75  
       00000001 00000001 f5afe0ac 00000001 f5afc000 f591a80c c026de64 00000001  
Call Trace:                                                                     
 [<c0213300>] __handle_sysrq+0x62/0xd9                                          
 [<c020dee0>] kbd_event+0x83/0xb0                                               
 [<c026fa75>] input_event+0x331/0x351                                           
 [<c026de64>] hidinput_hid_event+0x1d0/0x208                                    
 [<c026a431>] hid_process_event+0x28/0x52                                       
 [<c026a707>] hid_input_field+0x2ac/0x2f9                                       
 [<c026a7cd>] hid_input_report+0x79/0x98                                        
 [<c026a882>] hid_irq_in+0x96/0xf2                                              
 [<c02618ce>] usb_hcd_giveback_urb+0x14/0x3e                                    
 [<f893f01e>] ehci_urb_done+0x70/0x7b [ehci_hcd]                                
 [<f893f252>] qh_completions+0x229/0x2ab [ehci_hcd]                             
 [<f8941ab9>] scan_periodic+0xff/0x246 [ehci_hcd]                               
 [<f8942423>] ehci_work+0x3b/0x8f [ehci_hcd]                                    
 [<f89425b2>] ehci_irq+0x13b/0x152 [ehci_hcd]                                   
 [<c026191e>] usb_hcd_irq+0x26/0x4b                                             
 [<c01074f2>] handle_IRQ_event+0x25/0x4f                                        
 [<c01074cd>] handle_IRQ_event+0x0/0x4f                                         
 [<c0107a8d>] do_IRQ+0x157/0x1f2                                                
 =======================                                                        
 [<c02dde10>] common_interrupt+0x18/0x20                                        
 [<c01040e8>] mwait_idle+0x33/0x42                                              
 [<c01040a0>] cpu_idle+0x26/0x3b                                                
Code: 11 c0 c7 05 10 e9 45 c0 00 00 00 00 c7 05 38 e9 45 c0 00 00 00 00 c7 05 2 
 <0>Kernel panic - not syncing: Fatal exception in interrupt                    
 Badness in panic at kernel/panic.c:121                                         
 [<c0122e08>] panic+0x159/0x166                                                 
 [<c01060d0>] die+0x164/0x16b                                                   
 [<c0123649>] vprintk+0x136/0x14a                                               
 [<c011b6e9>] do_page_fault+0x0/0x5c6                                           
 [<c011bad9>] do_page_fault+0x3f0/0x5c6                                         
 [<c0213168>] sysrq_handle_crash+0x0/0x8                                        
 [<c011dc49>] try_to_wake_up+0x288/0x293                                        
 [<c011dc49>] try_to_wake_up+0x288/0x293                                        
 [<c0120f06>] autoremove_wake_function+0xd/0x2d                                 
 [<c011f1d6>] __wake_up_common+0x36/0x51                                        
 [<c011f21a>] __wake_up+0x29/0x3c                                               
 [<c011b6e9>] do_page_fault+0x0/0x5c6                                           
 [<c02ddf2f>] error_code+0x2f/0x38                                              
 [<c0213168>] sysrq_handle_crash+0x0/0x8                                        
 [<c0213300>] __handle_sysrq+0x62/0xd9                                          
 [<c020dee0>] kbd_event+0x83/0xb0                                               
 [<c026fa75>] input_event+0x331/0x351                                           
 [<c026de64>] hidinput_hid_event+0x1d0/0x208                                    
 [<c026a431>] hid_process_event+0x28/0x52                                       
 [<c026a707>] hid_input_field+0x2ac/0x2f9                                       
 [<c026a7cd>] hid_input_report+0x79/0x98                                        
 [<c026a882>] hid_irq_in+0x96/0xf2                                              
 [<c02618ce>] usb_hcd_giveback_urb+0x14/0x3e                                    
 [<f893f01e>] ehci_urb_done+0x70/0x7b [ehci_hcd]                                
 [<f893f252>] qh_completions+0x229/0x2ab [ehci_hcd]                             
 [<f8941ab9>] scan_periodic+0xff/0x246 [ehci_hcd]                               
 [<f8942423>] ehci_work+0x3b/0x8f [ehci_hcd]                                    
 [<f89425b2>] ehci_irq+0x13b/0x152 [ehci_hcd]                                   
 [<c026191e>] usb_hcd_irq+0x26/0x4b                                             
 [<c01074f2>] handle_IRQ_event+0x25/0x4f                                        
 [<c01074cd>] handle_IRQ_event+0x0/0x4f                                         
 [<c0107a8d>] do_IRQ+0x157/0x1f2                                                
 =======================                                                        
 [<c02dde10>] common_interrupt+0x18/0x20                                        
 [<c01040e8>] mwait_idle+0x33/0x42                                              
 [<c01040a0>] cpu_idle+0x26/0x3b                                                
Badness in i8042_panic_blink at drivers/input/serio/i8042.c:940                 
 [<c021f2c5>] i8042_panic_blink+0xc0/0x1a0                                      
 [<c0122dca>] panic+0x11b/0x166                                                 
 [<c01060d0>] die+0x164/0x16b                                                   
 [<c0123649>] vprintk+0x136/0x14a                                               
 [<c011b6e9>] do_page_fault+0x0/0x5c6                                           
 [<c011bad9>] do_page_fault+0x3f0/0x5c6                                         
 [<c0213168>] sysrq_handle_crash+0x0/0x8                                        
 [<c011dc49>] try_to_wake_up+0x288/0x293                                        
 [<c011dc49>] try_to_wake_up+0x288/0x293                                        
 [<c0120f06>] autoremove_wake_function+0xd/0x2d                                 
 [<c011f1d6>] __wake_up_common+0x36/0x51                                        
 [<c011f21a>] __wake_up+0x29/0x3c                                               
 [<c011b6e9>] do_page_fault+0x0/0x5c6                                           
 [<c02ddf2f>] error_code+0x2f/0x38                                              
 [<c0213168>] sysrq_handle_crash+0x0/0x8                                        
 [<c0213300>] __handle_sysrq+0x62/0xd9                                          
 [<c020dee0>] kbd_event+0x83/0xb0                                               
 [<c026fa75>] input_event+0x331/0x351                                           
 [<c026de64>] hidinput_hid_event+0x1d0/0x208                                    
 [<c026a431>] hid_process_event+0x28/0x52                                       
 [<c026a707>] hid_input_field+0x2ac/0x2f9                                       
 [<c026a7cd>] hid_input_report+0x79/0x98                                        
 [<c026a882>] hid_irq_in+0x96/0xf2                                              
 [<c02618ce>] usb_hcd_giveback_urb+0x14/0x3e                                    
 [<f893f01e>] ehci_urb_done+0x70/0x7b [ehci_hcd]                                
 [<f893f252>] qh_completions+0x229/0x2ab [ehci_hcd]                             
 [<f8941ab9>] scan_periodic+0xff/0x246 [ehci_hcd]                               
 [<f8942423>] ehci_work+0x3b/0x8f [ehci_hcd]                                    
 [<f89425b2>] ehci_irq+0x13b/0x152 [ehci_hcd]                                   
 [<c026191e>] usb_hcd_irq+0x26/0x4b                                             
 [<c01074f2>] handle_IRQ_event+0x25/0x4f                                        
 [<c01074cd>] handle_IRQ_event+0x0/0x4f                                         
 [<c0107a8d>] do_IRQ+0x157/0x1f2                                                
 =======================                                                        
 [<c02dde10>] common_interrupt+0x18/0x20                                        
 [<c01040e8>] mwait_idle+0x33/0x42                                              
 [<c01040a0>] cpu_idle+0x26/0x3b                                                
Badness in i8042_panic_blink at drivers/input/serio/i8042.c:943                 
 [<c021f33a>] i8042_panic_blink+0x135/0x1a0                                     
 [<c0122dca>] panic+0x11b/0x166                                                 
 [<c01060d0>] die+0x164/0x16b                                                   
 [<c0123649>] vprintk+0x136/0x14a                                               
 [<c011b6e9>] do_page_fault+0x0/0x5c6                                           
 [<c011bad9>] do_page_fault+0x3f0/0x5c6                                         
 [<c0213168>] sysrq_handle_crash+0x0/0x8                                        
 [<c011dc49>] try_to_wake_up+0x288/0x293                                        
 [<c011dc49>] try_to_wake_up+0x288/0x293                                        
 [<c0120f06>] autoremove_wake_function+0xd/0x2d                                 
 [<c011f1d6>] __wake_up_common+0x36/0x51                                        
 [<c011f21a>] __wake_up+0x29/0x3c                                               
 [<c011b6e9>] do_page_fault+0x0/0x5c6                                           
 [<c02ddf2f>] error_code+0x2f/0x38                                              
 [<c0213168>] sysrq_handle_crash+0x0/0x8                                        
 [<c0213300>] __handle_sysrq+0x62/0xd9                                          
 [<c020dee0>] kbd_event+0x83/0xb0                                               
 [<c026fa75>] input_event+0x331/0x351                                           
 [<c026de64>] hidinput_hid_event+0x1d0/0x208                                    
 [<c026a431>] hid_process_event+0x28/0x52                                       
 [<c026a707>] hid_input_field+0x2ac/0x2f9                                       
 [<c026a7cd>] hid_input_report+0x79/0x98                                        
 [<c026a882>] hid_irq_in+0x96/0xf2                                              
 [<c02618ce>] usb_hcd_giveback_urb+0x14/0x3e                                    
 [<f893f01e>] ehci_urb_done+0x70/0x7b [ehci_hcd]                                
 [<f893f252>] qh_completions+0x229/0x2ab [ehci_hcd]                             
 [<f8941ab9>] scan_periodic+0xff/0x246 [ehci_hcd]                               
 [<f8942423>] ehci_work+0x3b/0x8f [ehci_hcd]                                    
 [<f89425b2>] ehci_irq+0x13b/0x152 [ehci_hcd]                                   
 [<c026191e>] usb_hcd_irq+0x26/0x4b                                             
 [<c01074f2>] handle_IRQ_event+0x25/0x4f                                        
 [<c01074cd>] handle_IRQ_event+0x0/0x4f                                         
 [<c0107a8d>] do_IRQ+0x157/0x1f2                                                
 =======================                                                        
 [<c02dde10>] common_interrupt+0x18/0x20                                        
 [<c01040e8>] mwait_idle+0x33/0x42                                              
 [<c01040a0>] cpu_idle+0x26/0x3b                                                
Badness in i8042_panic_blink at drivers/input/serio/i8042.c:945                 
 [<c021f38d>] i8042_panic_blink+0x188/0x1a0                                     
 [<c0122dca>] panic+0x11b/0x166                                                 
 [<c01060d0>] die+0x164/0x16b                                                   
 [<c0123649>] vprintk+0x136/0x14a                                               
 [<c011b6e9>] do_page_fault+0x0/0x5c6                                           
 [<c011bad9>] do_page_fault+0x3f0/0x5c6                                         
 [<c0213168>] sysrq_handle_crash+0x0/0x8                                        
 [<c011dc49>] try_to_wake_up+0x288/0x293                                        
 [<c011dc49>] try_to_wake_up+0x288/0x293                                        
 [<c0120f06>] autoremove_wake_function+0xd/0x2d                                 
 [<c011f1d6>] __wake_up_common+0x36/0x51                                        
 [<c011f21a>] __wake_up+0x29/0x3c                                               
 [<c011b6e9>] do_page_fault+0x0/0x5c6                                           
 [<c02ddf2f>] error_code+0x2f/0x38                                              
 [<c0213168>] sysrq_handle_crash+0x0/0x8                                        
 [<c0213300>] __handle_sysrq+0x62/0xd9                                          
 [<c020dee0>] kbd_event+0x83/0xb0                                               
 [<c026fa75>] input_event+0x331/0x351                                           
 [<c026de64>] hidinput_hid_event+0x1d0/0x208                                    
 [<c026a431>] hid_process_event+0x28/0x52                                       
 [<c026a707>] hid_input_field+0x2ac/0x2f9                                       
 [<c026a7cd>] hid_input_report+0x79/0x98                                        
 [<c026a882>] hid_irq_in+0x96/0xf2                                              
 [<c02618ce>] usb_hcd_giveback_urb+0x14/0x3e                                    
 [<f893f01e>] ehci_urb_done+0x70/0x7b [ehci_hcd]                                
 [<f893f252>] qh_completions+0x229/0x2ab [ehci_hcd]                             
 [<f8941ab9>] scan_periodic+0xff/0x246 [ehci_hcd]                               
 [<f8942423>] ehci_work+0x3b/0x8f [ehci_hcd]                                    
 [<f89425b2>] ehci_irq+0x13b/0x152 [ehci_hcd]                                   
 [<c026191e>] usb_hcd_irq+0x26/0x4b                                             
 [<c01074f2>] handle_IRQ_event+0x25/0x4f                                        
 [<c01074cd>] handle_IRQ_event+0x0/0x4f                                         
 [<c0107a8d>] do_IRQ+0x157/0x1f2                                                
 =======================                                                        
 [<c02dde10>] common_interrupt+0x18/0x20                                        
 [<c01040e8>] mwait_idle+0x33/0x42                                              
 [<c01040a0>] cpu_idle+0x26/0x3b                                                

IMO, we should consider bz#481207 solved, since Andy patch properly fixed the issue with tg3 driver. The above Panic seems to be something else.

Comment 41 Vivek Goyal 2009-03-11 14:10:39 UTC

Committed in 83.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 43 Luo Fei 2009-04-24 07:03:51 UTC

The results show no problem on the machine with the tg3 driver and kernel-2.6.9.89.EL(largesmp)

[root@hp-rx8640-02 ~]# uname -a
Linux hp-rx8640-02.rhts.bos.redhat.com 2.6.9-78.22.EL #1 SMP Tue Dec 16 11:56:53 EST 2008 ia64 ia64 ia64 GNU/Linux
the log message shows:
Apr 23 22:06:16 ndnc-1 netdump[1522]: Got too many timeouts waiting for memory page for client 10.16.65.38, ignoring it 
Apr 23 22:06:19 ndnc-1 netdump[1522]: Got too many timeouts waiting for SHOW_STATUS for client 10.16.65.38, rebooting it 
[root@ndnc-1 10.16.65.38-2009-04-23-22:02]# du *
12      log
584     vmcore-incomplete

[root@hp-rx8640-02 ~]# uname -a
Linux hp-rx8640-02.rhts.bos.redhat.com 2.6.9-89.ELlargesmp #1 SMP Mon Apr 20 10:41:55 EDT 2009 ia64 ia64 ia64 GNU/Linux
[root@ndnc-1 10.16.65.38-2009-04-23-23:35]# ls
log  vmcore

Comment 45 errata-xmlrpc 2009-05-18 19:22:33 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1024.html

Note You need to log in before you can comment on or make changes to this bug.