Bug 223505 - LSPP: tcpdump crashes kernel and system goes into debugger.
LSPP: tcpdump crashes kernel and system goes into debugger.
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
ppc64 Linux
medium Severity medium
: ---
: ---
Assigned To: Herbert Xu
Brian Brock
Depends On:
Blocks: RHEL5LSPPCertTracker
  Show dependency treegraph
Reported: 2007-01-19 14:35 EST by Joy Latten
Modified: 2007-11-30 17:07 EST (History)
11 users (show)

See Also:
Fixed In Version: 5.0.0
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2007-02-13 12:01:27 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
diff between 3013 and 3014 (4.95 KB, patch)
2007-01-23 12:12 EST, Eric Paris
no flags Details | Diff
[PACKET]: Fix skb->cb clobbering between aux and sockaddr (2.83 KB, patch)
2007-01-23 18:46 EST, Herbert Xu
no flags Details | Diff

  None (edit)
Description Joy Latten 2007-01-19 14:35:19 EST
Description of problem:
Just issuing, "tcpdump" or "tcpdump -i eth0" in lspp 63 kernel
causes the kernel to crash and system goes into debugger.

Version-Release number of selected component (if applicable):

How reproducible:
Happens every time. 

Steps to Reproduce:
1. tcpdump -i eth0 OR tcpdump
Actual results:

tcpdump -i eth0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes

(a few packes are picked up)
Unable to handle kernel paging request for instruction fetch
Faulting instruction address: 0x002d1694
cpu 0x0: Vector: 400 (Instruction Access) at [c00000000277bb10]
    pc: 00000000002d1694
    lr: 00000000002d1694
    sp: c00000000277bd90
   msr: 8000000040009032
  current = 0xc0000000029122f0
  paca    = 0xc000000000464300
    pid   = 1701, comm = tcpdump
enter ? for help
0:mon> r
R00 = 00000000002d1694   R16 = 0000000000000000
R01 = c00000000277bd90   R17 = 0000000000000000
R02 = c000000000579640   R18 = 00000000ffffffff
R03 = 00000000000000c8   R19 = 0000000000000000
R04 = c00000000277bcd8   R20 = 000000001008dd64
R05 = 0000000000000004   R21 = 00000000100b0000
R06 = 0000000000000000   R22 = 00000000100b0000
R07 = 0000000000000001   R23 = 00000000fd31fe3b
R08 = 000000c80000000e   R24 = c00000000fd50688
R09 = c000000002778000   R25 = c00000000fd50750
R10 = 0000000000000000   R26 = c00000000fd50880
R11 = 0000000000000000   R27 = 0000000000000000
R12 = 000c00010000a8c0   R28 = 0000000000000000
R13 = c000000000464300   R29 = 0000000000000000
R14 = 0000000000000000   R30 = c00000000050bed0
R15 = 0000000000000000   R31 = c00000000f6fbb80
pc  = 00000000002d1694
lr  = 00000000002d1694
msr = 8000000040009032   cr  = 24022482
ctr = 0000000000000000   xer = 0000000000000000   trap =  400

Expected results:
Don't expect to see kernel debugger. :-)

Additional info:
uname -a
Linux XXXXXXXX 2.6.18-1.3015.2.1.el5.lspp.63 #1 SMP Mon Jan 15 16:51:12 EST 2007
ppc64 ppc64 ppc64 GNU/Linux

I think this may be a kernel issue. 
The same machine is installed with 2.6.18-1.3002.el5 kernel, and 
tcpdump works fine when using this kernel.
Comment 2 Linda Wang 2007-01-22 15:26:58 EST
Can someone verify that the tcpdump work on other ethernet adapter? 
Also, what networking driver/adapter is eth0 attached to?
Comment 3 Joy Latten 2007-01-22 16:44:29 EST
This occurs on an lpar which is using ibmveth driver, that is it is a virtual
Comment 4 Tim Burke 2007-01-23 10:30:08 EST
Just so we understand this correctly.... is the original problem description
stating that this works fine on stock RHEL5RC, but fails on the LSPP specific
Comment 5 Linda Wang 2007-01-23 10:35:55 EST
The last ibmveth change went in on 1.2789.el5 for rhel5, is tcpdump worked on
prior kernels?  i.e. beta2 kernel, etc.  
Comment 6 Eric Paris 2007-01-23 11:04:15 EST
tcpdump -i eth0 caused a panic on a Cell architecture blade after about
receiving 8 packets.  This was running 2.6.18-4.el5.  Will attempt to switch to
the kernel mentioned in comment #5 and look for an difference.
Comment 7 Eric Paris 2007-01-23 11:13:26 EST
2.6.18-1.2767.el5 appears to work correctly and without issue
Comment 8 Eric Paris 2007-01-23 11:22:09 EST
2.6.18-1.2789.el5 also worked fine.  Still working to isolate the probomatic patch.
Comment 9 Eric Paris 2007-01-23 11:40:05 EST
panic was introduced somewhere between 1.3002.el5 and 1.3014.el5
Comment 10 Eric Paris 2007-01-23 11:47:10 EST
even better, appears to work fine on 1.3013.el    so problem must be between
3013 and 3014
Comment 11 Eric Paris 2007-01-23 12:10:46 EST
I'm going to go back a reverify my work that this patch is the problem but the
differences between 3013 and 3014 seem to be a result of 

Related: rhbz#219681 - xen dhcp patch has a new fix for a missing prototype,
round 2.

Adding Herbert to the CC since I believe it is his patch.  This appears to work
just fine on x86/x86_64 however on ppc64 it goes boom.
Comment 12 Eric Paris 2007-01-23 12:12:07 EST
Created attachment 146320 [details]
diff between 3013 and 3014
Comment 14 James Morris 2007-01-23 15:53:32 EST
When you're in the debugger, can you get a backtrace of the crash?
Comment 15 Eric Paris 2007-01-23 16:00:11 EST
No.  Below is what I get.  You can easily access the machine I'm doing this on

[root@ibm-cell-01 ~]# tcpdump -i eth0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
15:33:13.867488 arp who-has frodo.lab.boston.redhat.com tell
15:33:13.870286 arp who-has frodo.lab.boston.redhat.com tell
15:33:13.872051 IP squad5-lp1.lab.boston.redhat.com >
ibm-cell-01.lab.boston.redhat.com: ICMP echo request, id 27000, seq 44259, length 64
15:33:13.872080 IP ibm-cell-01.lab.boston.redhat.com >
squad5-lp1.lab.boston.redhat.com: ICMP echo reply, id 27000, seq 44259, length 64
15:33:13.882778 arp reply frodo.lab.boston.redhat.com is-at 00:08:02:46:ea:e9
(oui Unknown)
15:33:13.882792 IP ibm-cell-01.lab.boston.redhat.com.cap >
frodo.lab.boston.redhat.com.domain:  22026+ PTR? 0x1:
Vector: 700 (Program Check) at [c00000001b023b10]
    pc: c000000000940004
    lr: c000000000940000
    sp: c00000001b023d90
   msr: 9000000000089032
  current = 0xc000000001f5cb40
  paca    = 0xc000000000464500
    pid   = 2625, comm = tcpdump
enter ? for help
1:mon> t
[c00000001b023d90] c000000000940000 (unreliable)
Comment 16 Herbert Xu 2007-01-23 18:46:40 EST
Created attachment 146377 [details]
[PACKET]: Fix skb->cb clobbering between aux and sockaddr

Both aux data and sockaddr tries to use the same buffer which
obviously doesn't work.  We just happen to have 4 bytes free in
the skb->cb if you take away the maximum length of sockaddr_ll.
That's just enough to store the one piece of info from aux data
that we can't generate at recvmsg(2) time.

This is what the following patch does.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Comment 20 Jay Turner 2007-01-24 07:56:07 EST
QE ack for RHEL5.
Comment 22 Don Zickus 2007-01-24 16:45:17 EST
in 2.6.18-6.el5
Comment 23 Jay Turner 2007-02-13 12:01:27 EST
Closing out.

Note You need to log in before you can comment on or make changes to this bug.