Bug 607114

Summary: System panic in pskb_expand_head When arp_validate option is specified in bonding ARP monitor mode
Product: Red Hat Enterprise Linux 5 Reporter: Mark Wu <dwu>
Component: kernelAssignee: Andy Gospodarek <agospoda>
Status: CLOSED ERRATA QA Contact: Liang Zheng <lzheng>
Severity: medium Docs Contact:
Priority: low    
Version: 5.5CC: dtian, jeder, jwilson, kzhang, lzheng, moshiro, myamazak, nhorman, nkim, peterm, qcai, skito, tao
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 665110 (view as bug list) Environment:
Last Closed: 2011-07-21 10:23:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 665110    
Attachments:
Description Flags
lspci
none
dmidecode
none
backported upstream fix none

Description Mark Wu 2010-06-23 09:37:00 UTC
Description of problem:
System panics in some cases when bonding driver is used .
When arp_validate option is specified in bonding ARP monitor mode, bonding driver register new receive function bond_arp_rcv to pt_base[] with
protocol = 0x0806.

Let's think about the situation one more receive function is registered.
For example function packet_rcv is registered by arping command.
 
On this case , packet_rcv,bond_arp_rcv,arp_rcv(original one) ,
those tree functions are invoked orderly when arp packet is received.

When this delivery happens , the member "users" in skb struct is added
except for the last call , in which case it is for arp_rcv function.

See the source of netif_receive_skb and deliver_skb in detail.

In function "packet_recv" , the received skb is cloned and
the cloned bit on original skb is set.

Next is the process in "bond_arp_rcv".

In bond_arp_rcv, if we use some kind of NIC driver like ixgbe,
it tries to collect data of ARP packet into header area, (pskb_may_pull), because ixgbe build only mac addresses in header area
  and all the other data are put outside , referred from
skb_shared_info area.

The actual process for collection is done in __pskb_pull_tail.
In  it , if the targeted skb is cloned , it try to make new header
as described below (pskb_expand_head)


        if (eat > 0 || skb_cloned(skb)) {
                if (pskb_expand_head(skb, 0, eat > 0 ? eat + 128 : 0,
                                       GFP_ATOMIC))
 
  
But , in pskb_expand_head , if the skb is shared ,which means member users > 1,
the system panics:
int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
		     gfp_t gfp_mask)
{
        ...
	if (skb_shared(skb))
		BUG();

Version-Release number of selected component (if applicable):
RHEL 5.5

How reproducible:
- prepare NIC driver whose logic meets the conditions described below.
        We know ixgbe driver meets the condition.  
- introduce bonding driver
- specify arp_validate option
- invoke arping command 

Steps to Reproduce:
1.
2.
3.
  
Actual results:
System panic

Expected results:


Additional info:

Comment 3 Andy Gospodarek 2010-07-12 18:44:30 UTC
Mark, I have not looked into this before today as I as been out of the office.  

I will take a look at this explanation and see if I agree with the assessment.

There are some systems in beaker with ixgbe cards (and you can search by driver) so you may want to check some out and to try and reproduce this if you like.

Comment 5 Andy Gospodarek 2010-12-21 21:34:27 UTC
Can someone please post the full back-trace from the BUG halt?

Comment 12 Mark Wu 2011-01-18 09:23:58 UTC
Created attachment 474009 [details]
lspci

Comment 13 Mark Wu 2011-01-18 09:24:36 UTC
Created attachment 474010 [details]
dmidecode

Comment 16 Andy Gospodarek 2011-01-25 16:04:57 UTC
Created attachment 475197 [details]
backported upstream fix

Comment 17 Andy Gospodarek 2011-01-25 16:06:07 UTC
Comment on attachment 475197 [details]
backported upstream fix

backport of upstream fix:

commit b30532515f0a62bfe17207ab00883dd262497006
Author: Neil Horman <nhorman>
Date:   Thu Jan 20 09:02:31 2011 +0000

    bonding: Ensure that we unshare skbs prior to calling pskb_may_pull

Comment 18 Andy Gospodarek 2011-01-28 14:30:15 UTC
This patch can probably make it's way into RHEL5.7 if we get testing feedback.

Comment 19 RHEL Program Management 2011-02-01 17:01:59 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 20 Andy Gospodarek 2011-02-04 21:23:49 UTC
Updated test kernels that contain a patch for this issue are available here:

http://people.redhat.com/agospoda/#rhel5

Comment 31 Jarod Wilson 2011-05-13 22:18:37 UTC
Patch(es) available in kernel-2.6.18-261.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.

Comment 34 errata-xmlrpc 2011-07-21 10:23:28 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html