223505 – LSPP: tcpdump crashes kernel and system goes into debugger.

Bug 223505 - LSPP: tcpdump crashes kernel and system goes into debugger.

Summary: LSPP: tcpdump crashes kernel and system goes into debugger.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.0
Hardware:	ppc64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Herbert Xu
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	RHEL5LSPPCertTracker
TreeView+	depends on / blocked

Reported:	2007-01-19 19:35 UTC by Joy Latten
Modified:	2007-11-30 22:07 UTC (History)
CC List:	11 users (show)
Fixed In Version:	5.0.0
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-02-13 17:01:27 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
diff between 3013 and 3014 (4.95 KB, patch) 2007-01-23 17:12 UTC, Eric Paris	no flags	Details \| Diff
[PACKET]: Fix skb->cb clobbering between aux and sockaddr (2.83 KB, patch) 2007-01-23 23:46 UTC, Herbert Xu	no flags	Details \| Diff
View All

Description Joy Latten 2007-01-19 19:35:19 UTC

Description of problem:
Just issuing, "tcpdump" or "tcpdump -i eth0" in lspp 63 kernel
causes the kernel to crash and system goes into debugger.

Version-Release number of selected component (if applicable):
tcpdump-3.9.4-8.1

How reproducible:
Happens every time. 

Steps to Reproduce:
1. tcpdump -i eth0 OR tcpdump
  
Actual results:

tcpdump -i eth0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes

(a few packes are picked up)
...
...
Unable to handle kernel paging request for instruction fetch
Faulting instruction address: 0x002d1694
cpu 0x0: Vector: 400 (Instruction Access) at [c00000000277bb10]
    pc: 00000000002d1694
    lr: 00000000002d1694
    sp: c00000000277bd90
   msr: 8000000040009032
  current = 0xc0000000029122f0
  paca    = 0xc000000000464300
    pid   = 1701, comm = tcpdump
enter ? for help
0:mon>t
0:mon> r
R00 = 00000000002d1694   R16 = 0000000000000000
R01 = c00000000277bd90   R17 = 0000000000000000
R02 = c000000000579640   R18 = 00000000ffffffff
R03 = 00000000000000c8   R19 = 0000000000000000
R04 = c00000000277bcd8   R20 = 000000001008dd64
R05 = 0000000000000004   R21 = 00000000100b0000
R06 = 0000000000000000   R22 = 00000000100b0000
R07 = 0000000000000001   R23 = 00000000fd31fe3b
R08 = 000000c80000000e   R24 = c00000000fd50688
R09 = c000000002778000   R25 = c00000000fd50750
R10 = 0000000000000000   R26 = c00000000fd50880
R11 = 0000000000000000   R27 = 0000000000000000
R12 = 000c00010000a8c0   R28 = 0000000000000000
R13 = c000000000464300   R29 = 0000000000000000
R14 = 0000000000000000   R30 = c00000000050bed0
R15 = 0000000000000000   R31 = c00000000f6fbb80
pc  = 00000000002d1694
lr  = 00000000002d1694
msr = 8000000040009032   cr  = 24022482
ctr = 0000000000000000   xer = 0000000000000000   trap =  400
0:mon>

Expected results:
Don't expect to see kernel debugger. :-)

Additional info:
uname -a
Linux XXXXXXXX 2.6.18-1.3015.2.1.el5.lspp.63 #1 SMP Mon Jan 15 16:51:12 EST 2007
ppc64 ppc64 ppc64 GNU/Linux

I think this may be a kernel issue. 
The same machine is installed with 2.6.18-1.3002.el5 kernel, and 
tcpdump works fine when using this kernel.

Comment 2 Linda Wang 2007-01-22 20:26:58 UTC

Can someone verify that the tcpdump work on other ethernet adapter? 
Also, what networking driver/adapter is eth0 attached to?

Comment 3 Joy Latten 2007-01-22 21:44:29 UTC

This occurs on an lpar which is using ibmveth driver, that is it is a virtual
ethernet.

Comment 4 Tim Burke 2007-01-23 15:30:08 UTC

Just so we understand this correctly.... is the original problem description
stating that this works fine on stock RHEL5RC, but fails on the LSPP specific
kernel?

Comment 5 Linda Wang 2007-01-23 15:35:55 UTC

The last ibmveth change went in on 1.2789.el5 for rhel5, is tcpdump worked on
prior kernels?  i.e. beta2 kernel, etc.

Comment 6 Eric Paris 2007-01-23 16:04:15 UTC

tcpdump -i eth0 caused a panic on a Cell architecture blade after about
receiving 8 packets.  This was running 2.6.18-4.el5.  Will attempt to switch to
the kernel mentioned in comment #5 and look for an difference.

Comment 7 Eric Paris 2007-01-23 16:13:26 UTC

2.6.18-1.2767.el5 appears to work correctly and without issue

Comment 8 Eric Paris 2007-01-23 16:22:09 UTC

2.6.18-1.2789.el5 also worked fine.  Still working to isolate the probomatic patch.

Comment 9 Eric Paris 2007-01-23 16:40:05 UTC

panic was introduced somewhere between 1.3002.el5 and 1.3014.el5

Comment 10 Eric Paris 2007-01-23 16:47:10 UTC

even better, appears to work fine on 1.3013.el    so problem must be between
3013 and 3014

Comment 11 Eric Paris 2007-01-23 17:10:46 UTC

I'm going to go back a reverify my work that this patch is the problem but the
differences between 3013 and 3014 seem to be a result of 

Related: rhbz#219681 - xen dhcp patch has a new fix for a missing prototype,
round 2.

Adding Herbert to the CC since I believe it is his patch.  This appears to work
just fine on x86/x86_64 however on ppc64 it goes boom.

Comment 12 Eric Paris 2007-01-23 17:12:07 UTC

Created attachment 146320 [details]
diff between 3013 and 3014

Comment 14 James Morris 2007-01-23 20:53:32 UTC

When you're in the debugger, can you get a backtrace of the crash?

Comment 15 Eric Paris 2007-01-23 21:00:11 UTC

No.  Below is what I get.  You can easily access the machine I'm doing this on
internally.

[root@ibm-cell-01 ~]# tcpdump -i eth0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
15:33:13.867488 arp who-has frodo.lab.boston.redhat.com tell
i386-5as.lab.boston.redhat.com
15:33:13.870286 arp who-has frodo.lab.boston.redhat.com tell
ibm-cell-01.lab.boston.redhat.com
15:33:13.872051 IP squad5-lp1.lab.boston.redhat.com >
ibm-cell-01.lab.boston.redhat.com: ICMP echo request, id 27000, seq 44259, length 64
15:33:13.872080 IP ibm-cell-01.lab.boston.redhat.com >
squad5-lp1.lab.boston.redhat.com: ICMP echo reply, id 27000, seq 44259, length 64
15:33:13.882778 arp reply frodo.lab.boston.redhat.com is-at 00:08:02:46:ea:e9
(oui Unknown)
15:33:13.882792 IP ibm-cell-01.lab.boston.redhat.com.cap >
frodo.lab.boston.redhat.com.domain:  22026+ PTR? 10.76.168.192.in-acpu 0x1:
Vector: 700 (Program Check) at [c00000001b023b10]
    pc: c000000000940004
    lr: c000000000940000
    sp: c00000001b023d90
   msr: 9000000000089032
  current = 0xc000000001f5cb40
  paca    = 0xc000000000464500
    pid   = 2625, comm = tcpdump
enter ? for help
1:mon> t
[c00000001b023d90] c000000000940000 (unreliable)
1:mon>

Comment 16 Herbert Xu 2007-01-23 23:46:40 UTC

Created attachment 146377 [details]
[PACKET]: Fix skb->cb clobbering between aux and sockaddr

Both aux data and sockaddr tries to use the same buffer which
obviously doesn't work.  We just happen to have 4 bytes free in
the skb->cb if you take away the maximum length of sockaddr_ll.
That's just enough to store the one piece of info from aux data
that we can't generate at recvmsg(2) time.

This is what the following patch does.

Signed-off-by: Herbert Xu <herbert.org.au>

Comment 20 Jay Turner 2007-01-24 12:56:07 UTC

QE ack for RHEL5.

Comment 22 Don Zickus 2007-01-24 21:45:17 UTC

in 2.6.18-6.el5

Comment 23 Jay Turner 2007-02-13 17:01:27 UTC

Closing out.

Note You need to log in before you can comment on or make changes to this bug.