Description of problem: While running the RHEL4.8 kerneltier1 tests the x86_64 up kernel oops on hp-dl785g5-01.rhts.bos.redhat.com Version-Release number of selected component (if applicable): Kernel 2.6.9-79.EL How reproducible: Unknown Steps to Reproduce: 1. Install RHEL4-U8-re20090115.nightly AS x86_64 on hp-dl785g5-01.rhts.bos.redhat.com 2. Then install Kernel 2.6.9-79.EL 3. Install and run RHTS test rh-tests-kernel-security-audit-audit-test Actual results: <1>Unable to handle kernel paging requestCPU#0 is executing netdump. < netdump activated - performing handshake with the server. > arch/x86_64/kernel/traps.c:334: spin_trylock(arch/x86_64/kernel/traps.c:ffffffff80437ca0) already locked by arch/x86_64/kernel/traps.c/334 <1>Unable to handle kernel paging request at 00000000fa012136 RIP: <1><ffffffffa00ad6ce>{:bnx2:bnx2_interrupt+12} PML4 1fee9ca067 PGD 0 <0>Oops: 0000 [2] CPU 0 Modules linked in: lp(U) md5 ipv6 parport_pc parport netconsole netdump autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core cpufreq_powersave joydev loop button battery ac uhci_hcd ohci_hcd ehci_hcd bnx2 dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod cciss sd_mod scsi_mod Pid: 32680, comm: rhts-system-inf Not tainted 2.6.9-79.EL RIP: 0010:[<ffffffffa00ad6ce>] <ffffffffa00ad6ce>{:bnx2:bnx2_interrupt+12} RSP: 0018:000001202b2f5850 EFLAGS: 00010046 RAX: 000001202fda63c0 RBX: 000001202d840440 RCX: 0000000030687465 RDX: 0000000000000000 RSI: 000001202d840000 RDI: 00000000000000da RBP: 000001202d840000 R08: 00000000fa012100 R09: 0000000000000086 R10: 0000000000000000 R11: 0000000000000000 R12: 000001202d840000 R13: 0000011fe09bd668 R14: 0000000000000000 R15: ffffffffa017eda0 FS: 0000002a955633e0(0000) GS:ffffffff8056e180(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000fa012136 CR3: 0000000000101000 CR4: 00000000000006e0 Process rhts-system-inf (pid: 32680, threadinfo 0000011fe09bc000, task 0000011fe9398920) Stack: 000001202e70cad8 ffffffffa00b4c06 0000000000000000 000001202d840000 000001202e70ca00 ffffffff8030da46 0000000000000055 0000000000000055 000001202e70ca00 0000000000000018 Call Trace:<ffffffffa00b4c06>{:bnx2:poll_bnx2+50} <ffffffff8030da46>{netpoll_poll_dev+92} <ffffffff8030d9c8>{netpoll_send_skb+761} <ffffffffa017c6d2>{:netdump:netpoll_netdump+313} <ffffffffa00ad6ce>{:bnx2:bnx2_interrupt+12} <ffffffffa017c57b>{:netdump:netpoll_start_netdump+177} Code: 41 0f b7 40 36 3b 46 28 75 10 48 8b 02 31 ff 8b 40 6c a8 01 <1>RIP <ffffffffa00ad6ce>{:bnx2:bnx2_interrupt+12} RSP <000001202b2f5850> <0>CR2: 00000000fa012136 <3>netpoll_start_netdump: called recursively. rebooting. Expected results: Test should pass Additional info: This systems was running the audit-test suite. It go to a point where it hung see BZ:480330. When a system in RHTS hangs like this it will try and localwatchdog. Local watchdog does the following: Alt+SysRQ+M Alt+SysRQ+T Alt+SysRQ+W cat /proc/slabcache Then kill the PID Then reboot the system. While it was doing the Alt+SysRq+T it ran into the problems.
This should be fixed in 79.EL -- can you try testing with that?
Crap, that is what you tried -- I read it 78.EL the first time. Jeff, did this test pass with 78.EL?
Jeff, is do you happen to have the output from /proc/interrupts when running on this system? It looks like bnx2_interrupt expects us to pass a bnx2_napi struct and we are passing it a netdev. That's your recipe for disaster.
Reply to Comment #2 Andy, Short answer. No Verbose answer. Not sure. This BZ has a couple of "special" things happening at once. I am not sure how each piece is actually relevant to the problem. 1. We are running a UP kernel 2. We have netdump configured to dump across the network 3. The audit-test hung while running 4. The system will run a bunch of Alt+SysRq command before it kill the test PID. (this issue looks like it happened while it was running Alt+SysRq+T) Do all of those things need to happen to trigger this issue? I am not sure. If they do then the same environment was not setup so it may have been there we just did not see it. Thanks, Jeff
Reply to Commnet #3 Andy, We don't capture that data normally on a test run. Here is a link to the data that is available: http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=5976958 The above link will need your BZ login to look at it. If you need the data we will have to reserve the system and check it out. Thanks, Jeff
Created attachment 329470 [details] bnx2-poll.patch Jeff, I think this patch would resolve it. I will add it to my rhel4 test kernels and see if I can make this test succeed with it in place.
Created attachment 329622 [details] bnx2-poll2.patch This is actually a better patch because it calls the correct handler.
It also seems that netdump (not netconsole) is broken on bnx2 with the latest RHEL4 kernels. I will group the netpoll and netdump fixes into one patch when I figure out what is wrong. It looks to be driver specific since I can back-out only the driver patches and it works fine.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Created attachment 329839 [details] bnx2-netdump-and-netpoll-fix.patch This patch seems to be working well on my system. It resolves the panics that can occur when calling the poll_controller routine (poll_bnx2) and also adds a routine for poll (bnx2_poll_all). The call to bnx2_poll_all will only be made after a call to poll_bnx2, so it will not be used in the normal case. I still need to be sure that there is no chance for one CPU to execute bnx2_poll_all while another is executing one of the dummy_netdev poll routines, but I think we are safe.
Created attachment 329895 [details] bnx2-netdump-and-netpoll-fix2.patch This patch should clear up the concerns I expressed in comment #11.
My test kernels have been updated to include a patch for this bugzilla. http://people.redhat.com/agospoda/#rhel4 Please test them and report back your results. Without immediate feedback there is a good chance this or any other fix for this driver will not be included in the upcoming update.