Bug 223505 - LSPP: tcpdump crashes kernel and system goes into debugger.
Summary: LSPP: tcpdump crashes kernel and system goes into debugger.
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.0
Hardware: ppc64
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Herbert Xu
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: RHEL5LSPPCertTracker
TreeView+ depends on / blocked
 
Reported: 2007-01-19 19:35 UTC by Joy Latten
Modified: 2007-11-30 22:07 UTC (History)
11 users (show)

Fixed In Version: 5.0.0
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-02-13 17:01:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
diff between 3013 and 3014 (4.95 KB, patch)
2007-01-23 17:12 UTC, Eric Paris
no flags Details | Diff
[PACKET]: Fix skb->cb clobbering between aux and sockaddr (2.83 KB, patch)
2007-01-23 23:46 UTC, Herbert Xu
no flags Details | Diff

Description Joy Latten 2007-01-19 19:35:19 UTC
Description of problem:
Just issuing, "tcpdump" or "tcpdump -i eth0" in lspp 63 kernel
causes the kernel to crash and system goes into debugger.

Version-Release number of selected component (if applicable):
tcpdump-3.9.4-8.1

How reproducible:
Happens every time. 

Steps to Reproduce:
1. tcpdump -i eth0 OR tcpdump
  
Actual results:

tcpdump -i eth0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes

(a few packes are picked up)
...
...
Unable to handle kernel paging request for instruction fetch
Faulting instruction address: 0x002d1694
cpu 0x0: Vector: 400 (Instruction Access) at [c00000000277bb10]
    pc: 00000000002d1694
    lr: 00000000002d1694
    sp: c00000000277bd90
   msr: 8000000040009032
  current = 0xc0000000029122f0
  paca    = 0xc000000000464300
    pid   = 1701, comm = tcpdump
enter ? for help
0:mon>t
0:mon> r
R00 = 00000000002d1694   R16 = 0000000000000000
R01 = c00000000277bd90   R17 = 0000000000000000
R02 = c000000000579640   R18 = 00000000ffffffff
R03 = 00000000000000c8   R19 = 0000000000000000
R04 = c00000000277bcd8   R20 = 000000001008dd64
R05 = 0000000000000004   R21 = 00000000100b0000
R06 = 0000000000000000   R22 = 00000000100b0000
R07 = 0000000000000001   R23 = 00000000fd31fe3b
R08 = 000000c80000000e   R24 = c00000000fd50688
R09 = c000000002778000   R25 = c00000000fd50750
R10 = 0000000000000000   R26 = c00000000fd50880
R11 = 0000000000000000   R27 = 0000000000000000
R12 = 000c00010000a8c0   R28 = 0000000000000000
R13 = c000000000464300   R29 = 0000000000000000
R14 = 0000000000000000   R30 = c00000000050bed0
R15 = 0000000000000000   R31 = c00000000f6fbb80
pc  = 00000000002d1694
lr  = 00000000002d1694
msr = 8000000040009032   cr  = 24022482
ctr = 0000000000000000   xer = 0000000000000000   trap =  400
0:mon>

Expected results:
Don't expect to see kernel debugger. :-)

Additional info:
uname -a
Linux XXXXXXXX 2.6.18-1.3015.2.1.el5.lspp.63 #1 SMP Mon Jan 15 16:51:12 EST 2007
ppc64 ppc64 ppc64 GNU/Linux

I think this may be a kernel issue. 
The same machine is installed with 2.6.18-1.3002.el5 kernel, and 
tcpdump works fine when using this kernel.

Comment 2 Linda Wang 2007-01-22 20:26:58 UTC
Can someone verify that the tcpdump work on other ethernet adapter? 
Also, what networking driver/adapter is eth0 attached to?


Comment 3 Joy Latten 2007-01-22 21:44:29 UTC
This occurs on an lpar which is using ibmveth driver, that is it is a virtual
ethernet. 

Comment 4 Tim Burke 2007-01-23 15:30:08 UTC
Just so we understand this correctly.... is the original problem description
stating that this works fine on stock RHEL5RC, but fails on the LSPP specific
kernel?


Comment 5 Linda Wang 2007-01-23 15:35:55 UTC
The last ibmveth change went in on 1.2789.el5 for rhel5, is tcpdump worked on
prior kernels?  i.e. beta2 kernel, etc.  

Comment 6 Eric Paris 2007-01-23 16:04:15 UTC
tcpdump -i eth0 caused a panic on a Cell architecture blade after about
receiving 8 packets.  This was running 2.6.18-4.el5.  Will attempt to switch to
the kernel mentioned in comment #5 and look for an difference.

Comment 7 Eric Paris 2007-01-23 16:13:26 UTC
2.6.18-1.2767.el5 appears to work correctly and without issue

Comment 8 Eric Paris 2007-01-23 16:22:09 UTC
2.6.18-1.2789.el5 also worked fine.  Still working to isolate the probomatic patch.

Comment 9 Eric Paris 2007-01-23 16:40:05 UTC
panic was introduced somewhere between 1.3002.el5 and 1.3014.el5

Comment 10 Eric Paris 2007-01-23 16:47:10 UTC
even better, appears to work fine on 1.3013.el    so problem must be between
3013 and 3014

Comment 11 Eric Paris 2007-01-23 17:10:46 UTC
I'm going to go back a reverify my work that this patch is the problem but the
differences between 3013 and 3014 seem to be a result of 

Related: rhbz#219681 - xen dhcp patch has a new fix for a missing prototype,
round 2.

Adding Herbert to the CC since I believe it is his patch.  This appears to work
just fine on x86/x86_64 however on ppc64 it goes boom.

Comment 12 Eric Paris 2007-01-23 17:12:07 UTC
Created attachment 146320 [details]
diff between 3013 and 3014

Comment 14 James Morris 2007-01-23 20:53:32 UTC
When you're in the debugger, can you get a backtrace of the crash?

Comment 15 Eric Paris 2007-01-23 21:00:11 UTC
No.  Below is what I get.  You can easily access the machine I'm doing this on
internally.

[root@ibm-cell-01 ~]# tcpdump -i eth0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
15:33:13.867488 arp who-has frodo.lab.boston.redhat.com tell
i386-5as.lab.boston.redhat.com
15:33:13.870286 arp who-has frodo.lab.boston.redhat.com tell
ibm-cell-01.lab.boston.redhat.com
15:33:13.872051 IP squad5-lp1.lab.boston.redhat.com >
ibm-cell-01.lab.boston.redhat.com: ICMP echo request, id 27000, seq 44259, length 64
15:33:13.872080 IP ibm-cell-01.lab.boston.redhat.com >
squad5-lp1.lab.boston.redhat.com: ICMP echo reply, id 27000, seq 44259, length 64
15:33:13.882778 arp reply frodo.lab.boston.redhat.com is-at 00:08:02:46:ea:e9
(oui Unknown)
15:33:13.882792 IP ibm-cell-01.lab.boston.redhat.com.cap >
frodo.lab.boston.redhat.com.domain:  22026+ PTR? 10.76.168.192.in-acpu 0x1:
Vector: 700 (Program Check) at [c00000001b023b10]
    pc: c000000000940004
    lr: c000000000940000
    sp: c00000001b023d90
   msr: 9000000000089032
  current = 0xc000000001f5cb40
  paca    = 0xc000000000464500
    pid   = 2625, comm = tcpdump
enter ? for help
1:mon> t
[c00000001b023d90] c000000000940000 (unreliable)
1:mon>


Comment 16 Herbert Xu 2007-01-23 23:46:40 UTC
Created attachment 146377 [details]
[PACKET]: Fix skb->cb clobbering between aux and sockaddr

Both aux data and sockaddr tries to use the same buffer which
obviously doesn't work.  We just happen to have 4 bytes free in
the skb->cb if you take away the maximum length of sockaddr_ll.
That's just enough to store the one piece of info from aux data
that we can't generate at recvmsg(2) time.

This is what the following patch does.

Signed-off-by: Herbert Xu <herbert.org.au>

Comment 20 Jay Turner 2007-01-24 12:56:07 UTC
QE ack for RHEL5.

Comment 22 Don Zickus 2007-01-24 21:45:17 UTC
in 2.6.18-6.el5

Comment 23 Jay Turner 2007-02-13 17:01:27 UTC
Closing out.


Note You need to log in before you can comment on or make changes to this bug.