Bug 607114 - System panic in pskb_expand_head When arp_validate option is specified in bonding ARP monitor mode
Summary: System panic in pskb_expand_head When arp_validate option is specified in bon...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Andy Gospodarek
QA Contact: Liang Zheng
URL:
Whiteboard:
Depends On:
Blocks: 665110
TreeView+ depends on / blocked
 
Reported: 2010-06-23 09:37 UTC by Mark Wu
Modified: 2018-11-14 19:04 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 665110 (view as bug list)
Environment:
Last Closed: 2011-07-21 10:23:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
lspci (69.93 KB, application/octet-stream)
2011-01-18 09:23 UTC, Mark Wu
no flags Details
dmidecode (41.77 KB, application/octet-stream)
2011-01-18 09:24 UTC, Mark Wu
no flags Details
backported upstream fix (1.49 KB, patch)
2011-01-25 16:04 UTC, Andy Gospodarek
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 33923 0 None None None Never
Red Hat Product Errata RHSA-2011:1065 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.7 kernel security and bug fix update 2011-07-21 09:21:37 UTC

Description Mark Wu 2010-06-23 09:37:00 UTC
Description of problem:
System panics in some cases when bonding driver is used .
When arp_validate option is specified in bonding ARP monitor mode, bonding driver register new receive function bond_arp_rcv to pt_base[] with
protocol = 0x0806.

Let's think about the situation one more receive function is registered.
For example function packet_rcv is registered by arping command.
 
On this case , packet_rcv,bond_arp_rcv,arp_rcv(original one) ,
those tree functions are invoked orderly when arp packet is received.

When this delivery happens , the member "users" in skb struct is added
except for the last call , in which case it is for arp_rcv function.

See the source of netif_receive_skb and deliver_skb in detail.

In function "packet_recv" , the received skb is cloned and
the cloned bit on original skb is set.

Next is the process in "bond_arp_rcv".

In bond_arp_rcv, if we use some kind of NIC driver like ixgbe,
it tries to collect data of ARP packet into header area, (pskb_may_pull), because ixgbe build only mac addresses in header area
  and all the other data are put outside , referred from
skb_shared_info area.

The actual process for collection is done in __pskb_pull_tail.
In  it , if the targeted skb is cloned , it try to make new header
as described below (pskb_expand_head)


        if (eat > 0 || skb_cloned(skb)) {
                if (pskb_expand_head(skb, 0, eat > 0 ? eat + 128 : 0,
                                       GFP_ATOMIC))
 
  
But , in pskb_expand_head , if the skb is shared ,which means member users > 1,
the system panics:
int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
		     gfp_t gfp_mask)
{
        ...
	if (skb_shared(skb))
		BUG();

Version-Release number of selected component (if applicable):
RHEL 5.5

How reproducible:
- prepare NIC driver whose logic meets the conditions described below.
        We know ixgbe driver meets the condition.  
- introduce bonding driver
- specify arp_validate option
- invoke arping command 

Steps to Reproduce:
1.
2.
3.
  
Actual results:
System panic

Expected results:


Additional info:

Comment 3 Andy Gospodarek 2010-07-12 18:44:30 UTC
Mark, I have not looked into this before today as I as been out of the office.  

I will take a look at this explanation and see if I agree with the assessment.

There are some systems in beaker with ixgbe cards (and you can search by driver) so you may want to check some out and to try and reproduce this if you like.

Comment 5 Andy Gospodarek 2010-12-21 21:34:27 UTC
Can someone please post the full back-trace from the BUG halt?

Comment 12 Mark Wu 2011-01-18 09:23:58 UTC
Created attachment 474009 [details]
lspci

Comment 13 Mark Wu 2011-01-18 09:24:36 UTC
Created attachment 474010 [details]
dmidecode

Comment 16 Andy Gospodarek 2011-01-25 16:04:57 UTC
Created attachment 475197 [details]
backported upstream fix

Comment 17 Andy Gospodarek 2011-01-25 16:06:07 UTC
Comment on attachment 475197 [details]
backported upstream fix

backport of upstream fix:

commit b30532515f0a62bfe17207ab00883dd262497006
Author: Neil Horman <nhorman>
Date:   Thu Jan 20 09:02:31 2011 +0000

    bonding: Ensure that we unshare skbs prior to calling pskb_may_pull

Comment 18 Andy Gospodarek 2011-01-28 14:30:15 UTC
This patch can probably make it's way into RHEL5.7 if we get testing feedback.

Comment 19 RHEL Program Management 2011-02-01 17:01:59 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 20 Andy Gospodarek 2011-02-04 21:23:49 UTC
Updated test kernels that contain a patch for this issue are available here:

http://people.redhat.com/agospoda/#rhel5

Comment 31 Jarod Wilson 2011-05-13 22:18:37 UTC
Patch(es) available in kernel-2.6.18-261.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.

Comment 34 errata-xmlrpc 2011-07-21 10:23:28 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html


Note You need to log in before you can comment on or make changes to this bug.