Bug 665110

Summary:

System panic in pskb_expand_head When arp_validate option is specified in bonding ARP monitor mode

Product:

Red Hat Enterprise Linux 6

Reporter:

Neal Kim <nkim>

Component:

kernel

Assignee:

Neil Horman <nhorman>

Status:

CLOSED ERRATA

QA Contact:

Jan Tluka <jtluka>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

6.0

CC:

anton, dhoward, dtian, dwu, fhrbata, james.brown, jeder, jwest, jwilson, moshiro, myamazak, nhorman, nkim, peterm, plyons, sbest, skito, tao, tpnoonan

Target Milestone:

Keywords:

ZStream

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

kernel-2.6.32-112.el6

Doc Type:

Bug Fix

Doc Text:

Bonding, when operating in the ARP monitoring mode, made erroneous assumptions regarding the ownership of ARP frames when it received them for processing. Specifically, it was assumed that the the bonding driver code was the only execution context which had access to the ARP frames network buffer data. As a result, an operation was attempted on the said buffer (specifically, to modify the size of the data buffer) which was forbidden by the kernel when a buffer was shared among several execution contexts. The result of such an operation on a shared buffer could lead to data corruption. Consequently, trying to prevent the corruption, the kernel panicked. This shared state in the network buffer could be forced to occur, for example, when running the tcpdump utility to monitor traffic on the bonding interface. Every buffer the bond interface received would be shared between the driver and the tcpdump process, thus, resulting in the aforementioned kernel panic. With this update, for the particular affected path in the bonding driver, each inbound frame is checked whether it is in the shared state. In case a buffer is shared, a private copy is made for exclusive use by the bonding driver, thus, preventing the kernel panic.

Story Points:

---

Clone Of:

607114

Environment:

Last Closed:

2011-05-19 12:49:43 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

607114

Bug Blocks:

671342

Attachments:

Description	Flags
RHEL6.0-82599EB-packet-split-disable.patch	none
patch to preform skb sharing check in bond_arp_rcv	none

Comment 2 Andy Gospodarek 2010-12-22 19:38:00 UTC

I need to do some more checking, but it looks like attempts to trim after cloning this skb are causing the failure.

It seems odd as I would expect any ARP traffic that came into the ixgbe interface to cause the failure (it should not be a problem exclusive to bonding and arp monitoring).  After I test it a bit more I'm sure I will figure out why one works and the other does not.

Comment 10 Neil Horman 2011-01-12 21:30:57 UTC

 I think maybe the best thing to do here is start instrumenting the kernels rx path to check skb->users at various points between ixgbe and bond_rcv to see if we ever get a shared skb that we can trace the origin of.  Do we have this problem re-created anywhere in RH that we can use for debugging?

Comment 14 Neil Horman 2011-01-13 18:28:48 UTC

regarding the workaround, I cant see anyway the bringup order would have an effect on the state of the rx path in the ixgbe driver.  As such this workaround shouldn't have any affect on the problem.  The only way I could clearly see it having an effect is if multiple different network cards were used in the bond and the workaround documented in comment 12 caused a different NIC to be used as the active interface, resulting in the ixgbe rx path not getting used.

This workaround can be used if need be, but I wouldn't see it as a guarantee that the problem will not happen again.  This problem still needs to be investigated to its root cause and fixed properly, meaning if the workaround is implemented we need to get this reproduced internally.

Comment 15 Andy Gospodarek 2011-01-13 18:42:52 UTC

Can you attach a sysreport for this system?  If not, I would like at *least* the output of 'lspci' and 'lspci -vvv' 

Thanks!

Comment 16 Neal Kim 2011-01-13 19:03:45 UTC

Created attachment 473393 [details]
sosreport

Comment 17 Neal Kim 2011-01-13 19:07:46 UTC

Hello Andy,

sosreport attached. I am available to assist in any way if need be.

Comment 18 Neil Horman 2011-01-13 20:58:24 UTC

I stumbled on something while looking at this.  When this happens, are you by any chance running tcpdump, or another application that might bind an AF_PACKET protocol socket to the bonded interface or one of its slaves.  This line suggests that you might have been:
[<ffffffff8104fff9>] ? __wake_up_common+0x59/0x90
 [<ffffffff81407f7a>] __pskb_pull_tail+0x2aa/0x360
 [<ffffffffa0244530>] bond_arp_rcv+0x2c0/0x2e0 [bonding]
 [<ffffffff814a0857>] ? packet_rcv+0x377/0x440     <==============HERE
 [<ffffffff8140f21b>] netif_receive_skb+0x2db/0x670
 [<ffffffff8140f788>] napi_skb_finish+0x58/0x70

Comment 19 Andy Gospodarek 2011-01-13 21:00:42 UTC

(In reply to comment #16)
> Created attachment 473393 [details]
> sosreport

Neal is this from your system (that cannot reproduce the issue) or from the customer (that can reproduce the issue)?

Comment 20 Neal Kim 2011-01-13 21:02:58 UTC

(In reply to comment #19)
> (In reply to comment #16)
> > Created attachment 473393 [details]
> > sosreport
> 
> Neal is this from your system (that cannot reproduce the issue) or from the
> customer (that can reproduce the issue)?

Andy,

That sosreport is from the customer (able to reproduce).

Comment 21 Andy Gospodarek 2011-01-13 21:25:34 UTC

(In reply to comment #20)
> (In reply to comment #19)
> > (In reply to comment #16)
> > > Created attachment 473393 [details]
> > > sosreport
> > 
> > Neal is this from your system (that cannot reproduce the issue) or from the
> > customer (that can reproduce the issue)?
> 
> Andy,
> 
> That sosreport is from the customer (able to reproduce).

[agospoda@gospo ~]$ tar -tzvf /tmp/sosreport-jharan.00395193-20101221172913-d56b.tar.xz | grep lspci
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error exit delayed from previous errors
[agospoda@gospo ~]$ tar -tjvf /tmp/sosreport-jharan.00395193-20101221172913-d56b.tar.xz | grep lspci
bzip2: (stdin) is not a bzip2 file.
tar: Child returned status 2
tar: Error exit delayed from previous errors

Comment 25 Timothy Noonan 2011-01-13 21:53:46 UTC

hi red hat, do you have the hardware that you need to debug this defect? if not, what hw do you need? thanks

Comment 26 Neil Horman 2011-01-14 02:19:40 UTC

Comment 18, I need an answer to that question.  Its important.

Comment 29 Andy Gospodarek 2011-01-14 02:41:03 UTC

Created attachment 473464 [details]
RHEL6.0-82599EB-packet-split-disable.patch

Neil's question from comment #18 needs to be addressed and after that is done I would like this patch to be tried.

Comment 30 Neil Horman 2011-01-14 14:44:53 UTC

Created attachment 473534 [details]
patch to preform skb sharing check in bond_arp_rcv

I've got a theory on whats going on as well.  If we assume that tcpdump (or some other app was running that attached an AF_PACKET socket to the bonded interface), then It would seem what would happen is that:

1) The AF_PACKET socket adds a packet reception hook to the ptype_all list via dev_add pack (most applications default to using AF_PACKET with ETH_P_ALL).

2) The bonding interface, when used with arp monitoring, adds a packet hook to the ptype_base list (it only looks for ETH_P_ARP).

3) Since netif_receive_skb always interrogates the pytpe_all list prior to the ptype_base list, the AF_PACKET packet hook (tpacket_rcv if the default mmap packet interface is used) gets called with the skb first

4) tpacket_rcv, since this is a gro frame likely falss into this if clause:
     if (macoff + snaplen > po->rx_ring.frame_size) {
at which point it does an skb_shared check, which fails, so it keeps the the skb exactly as it was, instead opting to preform an skb_get on the skb, which bumps its users reference count to 3 (1 from the alloc, +1 in deliver_skb, +1 from this skb_get).

5) after queueing the frame, tpacket_rcv, calls kfree_skb, which reduces its skb->users count back down to 2

6) next, on return, netif_receive_skb, interrogates the ptype_base list, which causes the bonding packet hook to get received (bond_arp_rcv)

7) bond_arp_receive attempts to call pskb_may_pull on the skb, which, because it doesn't have sufficient space to expand, calls pskb_expand_head, which triggers the observed BUG() panic that is triggered by skb_shared() which checks for a skb->users count greater than 1.

I think the solution to this problem is to effectively do what tpacket_rcv does, if any operations are to be preformed on the skb that require exclusive access to the skb, we need to first be sure that we are the only user of the skb.  The above patch should be able to handle that.

Heres a test build with that patch incorporated:
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3034571

Comment 31 Neal Kim 2011-01-14 19:52:28 UTC

(In reply to comment #18)
> I stumbled on something while looking at this.  When this happens, are you by
> any chance running tcpdump, or another application that might bind an AF_PACKET
> protocol socket to the bonded interface or one of its slaves.  This line
> suggests that you might have been:
> [<ffffffff8104fff9>] ? __wake_up_common+0x59/0x90
>  [<ffffffff81407f7a>] __pskb_pull_tail+0x2aa/0x360
>  [<ffffffffa0244530>] bond_arp_rcv+0x2c0/0x2e0 [bonding]
>  [<ffffffff814a0857>] ? packet_rcv+0x377/0x440     <==============HERE
>  [<ffffffff8140f21b>] netif_receive_skb+0x2db/0x670
>  [<ffffffff8140f788>] napi_skb_finish+0x58/0x70

Customer has answered back with the following answers/questions:

1.) We are not running any tcpdump. The system panics during boot up even before Linux gives us console access.

2.) None of our applications uses AF_PACKET. Is it possible that one of the RedHat applications does that?

3.) It might be relevant but we had seen when working with Broadcom NICs on RHEL 5.4 that the interfaces are put into promiscuous mode and are left there even though there is no tcpdump being run. This was seen occasionally and we were not sure what might have been causing that. ifconfig output showed that the interface was in promiscuous mode.

Comment 32 Neil Horman 2011-01-14 20:10:46 UTC

Ok, I think you should have them test the build anyway, heres why:

 For the problem, as I describe it in the bz, we need to have something listening for ETH_P_ALL packets.  That can be an AF_PACKET socket, or an AF_RAW socket

 I don't see anything that might hold such a socket open in the sosreport.  But just because I don't see it doesn't mean its not there.


But weather its there or not is somewhat moot, because if it were to be there, than this problem would be observed and my patch fixes it. If the problem stops, we can then go looking for the cause, not that it matters that much, because you should be able to run tcpdump (or some other app that uses AF_PACKET/RAW) without crashing your system

Comment 33 Neal Kim 2011-01-14 21:38:32 UTC

I have made the test kernel available to the customer. I will keep you updated once I heard back from them.

Thanks again!

Comment 35 Neil Horman 2011-01-18 11:49:57 UTC

That system is going to have to be set up to work with, as there is no link on any of the systems ethernet interfaces other than eth2.  We can certainly do that, but given that the customer has a test kernel in hand at the moment, I'm more curious to know what the result of that kernel is in their environment.

I've requested that dhafeman connect a second link on that box so we can configure bonding properly.  Please update us with the result of the test kernel asap

Comment 36 Neal Kim 2011-01-18 18:11:41 UTC

Customer is still working on testing out the new kernel. I will let you know the results once I hear back from them.

Comment 37 Neal Kim 2011-01-18 19:53:58 UTC

Customer is hitting a roadblock with a missing kernel-firmware package.

[root@qa-fusion-ch07-bl11 ~]# rpm -ivh /var/tmp/kernel 2.6.32-97.el6.test.x86_64.rpm 
error: Failed dependencies:
	kernel-firmware >= 2.6.32-97.el6.test is needed by kernel-2.6.32-97.el6.test.x86_64

From what I recall this was not part of the build in brew. Am I missing something?

Comment 38 Jarod Wilson 2011-01-18 20:14:25 UTC

The kernel-firmware package is built in the noarch build pass. But in most cases, it should be just fine for testing purposes to rpm -ivh --nodeps that kernel and use an older kernel-firmware package, as the actual firmware blobs tend to change very little.

Comment 39 Andy Gospodarek 2011-01-18 20:24:04 UTC

(In reply to comment #37)
> Customer is hitting a roadblock with a missing kernel-firmware package.
> 
> [root@qa-fusion-ch07-bl11 ~]# rpm -ivh /var/tmp/kernel
> 2.6.32-97.el6.test.x86_64.rpm 
> error: Failed dependencies:
>  kernel-firmware >= 2.6.32-97.el6.test is needed by
> kernel-2.6.32-97.el6.test.x86_64
> 
> From what I recall this was not part of the build in brew. Am I missing
> something?

Have the customer install without deps.  The kernel-firmware contents haven't changed between -97 and -71.

Comment 40 Neal Kim 2011-01-18 22:40:27 UTC

Customer has come back with their test results, and no change. They are still getting a crash on boot.

KERNEL: /usr/lib/debug/lib/modules/2.6.32-97.el6.test.x86_64/vmlinux
    DUMPFILE: /var/crash/127.0.0.1-2011-01-18-13:33:41/vmcore  [PARTIAL DUMP]
        CPUS: 16
        DATE: Tue Jan 18 13:33:24 2011
      UPTIME: 00:01:44
LOAD AVERAGE: 0.18, 0.11, 0.04
       TASKS: 341
    NODENAME: qa-fusion-ch07-bl11
     RELEASE: 2.6.32-97.el6.test.x86_64
     VERSION: #1 SMP Fri Jan 14 10:32:07 EST 2011
     MACHINE: x86_64  (2533 Mhz)
      MEMORY: 48 GB
       PANIC: "kernel BUG at net/core/skbuff.c:815!"
         PID: 0
     COMMAND: "swapper"
        TASK: ffff880c64f2aab0  (1 of 16)  [THREAD_INFO: ffff880664a42000]
         CPU: 9
       STATE: TASK_RUNNING (PANIC)

crash> where
No stack.
gdb: gdb request failed: where
crash> bt
PID: 0      TASK: ffff880c64f2aab0  CPU: 9   COMMAND: "swapper"
 #0 [ffff8800283437f0] machine_kexec at ffffffff8102edbb
 #1 [ffff880028343850] crash_kexec at ffffffff810b1078
 #2 [ffff880028343920] oops_end at ffffffff814caba0
 #3 [ffff880028343950] die at ffffffff8100f33b
 #4 [ffff880028343980] do_trap at ffffffff814ca474
 #5 [ffff8800283439e0] do_invalid_op at ffffffff8100cee5
 #6 [ffff880028343a80] invalid_op at ffffffff8100bf5b
    [exception RIP: pskb_expand_head+54]
    RIP: ffffffff81403416  RSP: ffff880028343b30  RFLAGS: 00010202
    RAX: 0000000000000002  RBX: ffff880661811080  RCX: 0000000000000020
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: ffff880661811080
    RBP: ffff880028343b80   R8: ffffffff81ba3f80   R9: ffff880661811164
    R10: ffff880c61d476c0  R11: 0000000000000400  R12: 0000000000000000
    R13: 0000000000000180  R14: ffff880c61d47000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffff880028343b88] __pskb_pull_tail at ffffffff8140587a
 #8 [ffff880028343bd8] bond_arp_rcv at ffffffffa0237580
 #9 [ffff880028343c48] __netif_receive_skb at ffffffff8140d0fb
#10 [ffff880028343cb8] netif_receive_skb at ffffffff8140d5e8
#11 [ffff880028343cf8] napi_skb_finish at ffffffff8140d6e8
#12 [ffff880028343d18] napi_gro_receive at ffffffff8140db99
#13 [ffff880028343d38] ixgbe_clean_rx_irq at ffffffffa0128b5b
#14 [ffff880028343df8] ixgbe_clean_rxtx_many at ffffffffa0129566
#15 [ffff880028343e68] net_rx_action at ffffffff8140dd63
#16 [ffff880028343ec8] __do_softirq at ffffffff8106bcb7
#17 [ffff880028343f38] call_softirq at ffffffff8100c2cc
#18 [ffff880028343f50] do_softirq at ffffffff8100df35
#19 [ffff880028343f70] irq_exit at ffffffff8106bab5
#20 [ffff880028343f80] do_IRQ at ffffffff814ce8e5
--- <IRQ stack> ---
#21 [ffff880664a43dc8] ret_from_intr at ffffffff8100bad3
    [exception RIP: intel_idle+218]
    RIP: ffffffff812ad3fa  RSP: ffff880664a43e78  RFLAGS: 00000206
    RAX: 0000000000000000  RBX: ffff880664a43ed8  RCX: 0000000000000000
    RDX: 00000000000174f9  RSI: 0000000000000000  RDI: 0000000005b0efca
    RBP: ffffffff8100bace   R8: 0000000000000000   R9: 00000000000000c8
    R10: 0000001856a6fea9  R11: 00000000fffd037f  R12: ffffffff814cc775
    R13: ffff880664a43e18  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffff52  CS: 0010  SS: 0018
#22 [ffff880664a43ee0] cpuidle_idle_call at ffffffff813dc1f7
#23 [ffff880664a43f00] cpu_idle at ffffffff81009e96

Comment 41 Neil Horman 2011-01-19 12:10:46 UTC

That makes absolutely no sense.  This kernel adds a call to skb_share_check immediately prior to the call to pskb_may_pull.  Both of those functions use skb_shared, which checks skb->users for equality to 1.  So the implication here is that skb->users changed value between the two calls to skb_shared, implying that the skb is being manipulated by 2 cpus in parallel, or that skb_share_check isn't doing what its supposed to.  The trace above shows that you got a kdump out of this, could you upload the vmcore somewhere and point me to it please?

Comment 43 Neil Horman 2011-01-19 13:54:00 UTC

ok, I have good news and bad news:

The bad news is that I messed this up, it was completely my fault.  I have my build tree here, and see the patch in it, but somehow during the build process the patch got dropped out.  I'm investigating how that happened, but right now I want to focus first on getting a correct build out.  Regardless, this was a personal mistake and I apologize.  I'm going to restart the build, and double check that the patch gets included to ensure this doesn't happen again.

The good news is that this explains why the above test kernel failed in exactly the same way.  That which was supposed to prevent the problem wasn't in place.

I'll have another build in the works with evidence of the patch's inclusion shortly.

Comment 44 Neil Horman 2011-01-19 21:28:05 UTC

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3048245

New build, I've made certain that the patch is in place this time.

Comment 45 Neal Kim 2011-01-19 21:59:23 UTC

I have delivered the new kernel to the customer for testing. I will let you know the results asap.

I also have an update regarding the reproducer. I am currently working on an older generation IBM HS21 blade, but with the same Intel Corporation 82599EB 10-Gigabit KX4 dual-port nic. I have configured bonding and arp monitoring mode with success on the same kernel version the customer is running, 2.6.32-71.el6.x86_64. My reproducer system comes up normally without a kernel panic. I will keep you updated as I continue testing.

I may try to see if I can get a hold of one of the newer HS22 blades.

Comment 46 Mark Wu 2011-01-20 01:15:49 UTC

Neal,
To reproduce the problem, you need run an application which uses AF_PACKET socket, like arping, with bonding and arp monitoring configured.

Comment 48 Neal Kim 2011-01-20 09:43:28 UTC

The customer has come back with their initial test results, and I am happy to report that the patched kernel fixes this issue. My customer is still planning to do some load testing, but all seems well so far. Thanks to everyone for your due diligence and hard work on fixing this bug. I believe our next step is to get a supported Hotfix package out to the customer?

I will keep everyone posted if any new developments arise.

Thanks again!

Comment 49 Neil Horman 2011-01-20 11:49:03 UTC

Mark, Yes, that looks like the exact same problem, and thank you, your comment 46 gives us the link to the AF_PACKET socket usage that we were speculating about in comment 30.  I think this also needs to go upstream.

Neal, as I understand the hotfix process currently:
https://docspace.corp.redhat.com/docs/DOC-47999
Thats something you can just bless and move on with.  Andy and I will make any final adjustments to this patch and get it to the right places asap.

Comment 50 Neil Horman 2011-01-20 19:30:59 UTC

posted upstream and to rhkl

Comment 51 RHEL Program Management 2011-01-20 20:54:28 UTC

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 55 Aristeu Rozanski 2011-02-03 16:09:34 UTC

Patch(es) available on kernel-2.6.32-112.el6

Comment 59 Martin Prpič 2011-02-23 15:12:53 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Bonding, when operating in the ARP monitoring mode, made erroneous assumptions regarding the ownership of ARP frames when it received them for processing. Specifically, it was assumed that the the bonding driver code was the only execution context which had access to the ARP frames network buffer data. As a result, an operation was attempted on the said buffer (specifically, to modify the size of the data buffer) which was forbidden by the kernel when a buffer was shared among several execution contexts. The result of such an operation on a shared buffer could lead to data corruption. Consequently, trying to prevent the corruption, the kernel panicked. This shared state in the network buffer could be forced to occur, for example, when running the tcpdump utility to monitor traffic on the bonding interface. Every buffer the bond interface received would be shared between the driver and the tcpdump process, thus, resulting in the aforementioned kernel panic. With this update, for the particular affected path in the bonding driver, each inbound frame is checked whether it is in the shared state. In case a buffer is shared, a private copy is made for exclusive use by the bonding driver, thus, preventing the kernel panic.

Comment 60 errata-xmlrpc 2011-05-19 12:49:43 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html