Bug 480693 - [RHEL4.8][Kernel] Unable to handle kernel paging request at 00000000fa012136 RIP:
[RHEL4.8][Kernel] Unable to handle kernel paging request at 00000000fa012136 ...
Status: CLOSED DUPLICATE of bug 484667
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
All Linux
low Severity medium
: rc
: ---
Assigned To: Andy Gospodarek
Martin Jenner
Depends On:
  Show dependency treegraph
Reported: 2009-01-19 16:38 EST by Jeff Burke
Modified: 2014-06-29 19:00 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2009-03-18 14:51:38 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
bnx2-poll.patch (1.84 KB, patch)
2009-01-20 10:00 EST, Andy Gospodarek
no flags Details | Diff
bnx2-poll2.patch (929 bytes, patch)
2009-01-21 12:00 EST, Andy Gospodarek
no flags Details | Diff
bnx2-netdump-and-netpoll-fix.patch (1.88 KB, patch)
2009-01-23 09:47 EST, Andy Gospodarek
no flags Details | Diff
bnx2-netdump-and-netpoll-fix2.patch (2.59 KB, patch)
2009-01-23 21:31 EST, Andy Gospodarek
no flags Details | Diff

  None (edit)
Description Jeff Burke 2009-01-19 16:38:42 EST
Description of problem:
 While running the RHEL4.8 kerneltier1 tests the x86_64 up kernel oops on hp-dl785g5-01.rhts.bos.redhat.com

Version-Release number of selected component (if applicable):
Kernel 2.6.9-79.EL 

How reproducible:

Steps to Reproduce:
1. Install RHEL4-U8-re20090115.nightly AS x86_64 on hp-dl785g5-01.rhts.bos.redhat.com
2. Then install Kernel 2.6.9-79.EL
3. Install and run RHTS test rh-tests-kernel-security-audit-audit-test
Actual results:
<1>Unable to handle kernel paging requestCPU#0 is executing netdump.
< netdump activated - performing handshake with the server. >
arch/x86_64/kernel/traps.c:334: spin_trylock(arch/x86_64/kernel/traps.c:ffffffff80437ca0) already locked by arch/x86_64/kernel/traps.c/334

<1>Unable to handle kernel paging request at 00000000fa012136 RIP: 
PML4 1fee9ca067 PGD 0 
<0>Oops: 0000 [2] 
CPU 0 
Modules linked in: lp(U) md5 ipv6 parport_pc parport netconsole netdump autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core cpufreq_powersave joydev loop button battery ac uhci_hcd ohci_hcd ehci_hcd bnx2 dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod cciss sd_mod scsi_mod
Pid: 32680, comm: rhts-system-inf Not tainted 2.6.9-79.EL
RIP: 0010:[<ffffffffa00ad6ce>] <ffffffffa00ad6ce>{:bnx2:bnx2_interrupt+12}
RSP: 0018:000001202b2f5850  EFLAGS: 00010046
RAX: 000001202fda63c0 RBX: 000001202d840440 RCX: 0000000030687465
RDX: 0000000000000000 RSI: 000001202d840000 RDI: 00000000000000da
RBP: 000001202d840000 R08: 00000000fa012100 R09: 0000000000000086
R10: 0000000000000000 R11: 0000000000000000 R12: 000001202d840000
R13: 0000011fe09bd668 R14: 0000000000000000 R15: ffffffffa017eda0
FS:  0000002a955633e0(0000) GS:ffffffff8056e180(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000fa012136 CR3: 0000000000101000 CR4: 00000000000006e0
Process rhts-system-inf (pid: 32680, threadinfo 0000011fe09bc000, task 0000011fe9398920)
Stack: 000001202e70cad8 ffffffffa00b4c06 0000000000000000 000001202d840000 
       000001202e70ca00 ffffffff8030da46 0000000000000055 0000000000000055 
       000001202e70ca00 0000000000000018 

Call Trace:<ffffffffa00b4c06>{:bnx2:poll_bnx2+50}
Code: 41 0f b7 40 36 3b 46 28 75 10 48 8b 02 31 ff 8b 40 6c a8 01 
<1>RIP <ffffffffa00ad6ce>{:bnx2:bnx2_interrupt+12} RSP <000001202b2f5850>
<0>CR2: 00000000fa012136
<3>netpoll_start_netdump: called recursively.  rebooting.

Expected results:
Test should pass

Additional info:
This systems was running the audit-test suite. It go to a point where it hung see BZ:480330. When a system in RHTS hangs like this it will try and localwatchdog. 
Local watchdog does the following:
 cat /proc/slabcache
 Then kill the PID
 Then reboot the system.

While it was doing the Alt+SysRq+T it ran into the problems.
Comment 1 Andy Gospodarek 2009-01-20 09:13:29 EST
This should be fixed in 79.EL -- can you try testing with that?
Comment 2 Andy Gospodarek 2009-01-20 09:15:06 EST
Crap, that is what you tried -- I read it 78.EL the first time.  Jeff, did this test pass with 78.EL?
Comment 3 Andy Gospodarek 2009-01-20 09:30:37 EST
Jeff, is do you happen to have the output from /proc/interrupts when running on this system?

It looks like bnx2_interrupt expects us to pass a bnx2_napi struct and we are passing it a netdev.  That's your recipe for disaster.
Comment 4 Jeff Burke 2009-01-20 09:39:10 EST
Reply to Comment #2

  Short answer.

  Verbose answer.
   Not sure. This BZ has a couple of "special" things happening at once. I am
not sure how each piece is actually relevant to the problem.

1. We are running a UP kernel
2. We have netdump configured to dump across the network
3. The audit-test hung while running
4. The system will run a bunch of Alt+SysRq command before it kill the test
(this issue looks like it happened while it was running Alt+SysRq+T)

   Do all of those things need to happen to trigger this issue? I am not sure.
If they do then the same environment was not setup so it may have been there we
just did not see it.

Comment 5 Jeff Burke 2009-01-20 09:41:10 EST
Reply to Commnet #3

   We don't capture that data normally on a test run. Here is a link to the data that is available:
The above link will need your BZ login to look at it.

   If you need the data we will have to reserve the system and check it out.

Comment 6 Andy Gospodarek 2009-01-20 10:00:56 EST
Created attachment 329470 [details]

Jeff, I think this patch would resolve it.  I will add it to my rhel4 test kernels and see if I can make this test succeed with it in place.
Comment 7 Andy Gospodarek 2009-01-21 12:00:41 EST
Created attachment 329622 [details]

This is actually a better patch because it calls the correct handler.
Comment 8 Andy Gospodarek 2009-01-21 22:32:45 EST
It also seems that netdump (not netconsole) is broken on bnx2 with the latest RHEL4 kernels.  I will group the netpoll and netdump fixes into one patch when I figure out what is wrong.

It looks to be driver specific since I can back-out only the driver patches and it works fine.
Comment 9 RHEL Product and Program Management 2009-01-21 23:01:33 EST
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
Comment 11 Andy Gospodarek 2009-01-23 09:47:10 EST
Created attachment 329839 [details]

This patch seems to be working well on my system.  It resolves the panics that can occur when calling the poll_controller routine (poll_bnx2) and also adds a routine for poll (bnx2_poll_all).

The call to bnx2_poll_all will only be made after a call to poll_bnx2, so it will not be used in the normal case.  I still need to be sure that there is no chance for one CPU to execute bnx2_poll_all while another is executing one of the dummy_netdev poll routines, but I think we are safe.
Comment 12 Andy Gospodarek 2009-01-23 21:31:20 EST
Created attachment 329895 [details]

This patch should clear up the concerns I expressed in comment #11.
Comment 13 Andy Gospodarek 2009-01-26 20:31:18 EST
My test kernels have been updated to include a patch for this bugzilla.


Please test them and report back your results.  Without immediate
feedback there is a good chance this or any other fix for this driver
will not be included in the upcoming update.
Comment 17 Andy Gospodarek 2009-02-04 12:06:09 EST
My test kernels have been updated to include a patch for this bugzilla.


Please test them and report back your results.  Without immediate
feedback there is a good chance this or any other fix for this driver
will not be included in the upcoming update.

Note You need to log in before you can comment on or make changes to this bug.