Description of problem: System panics in some cases when bonding driver is used . When arp_validate option is specified in bonding ARP monitor mode, bonding driver register new receive function bond_arp_rcv to pt_base[] with protocol = 0x0806. Let's think about the situation one more receive function is registered. For example function packet_rcv is registered by arping command. On this case , packet_rcv,bond_arp_rcv,arp_rcv(original one) , those tree functions are invoked orderly when arp packet is received. When this delivery happens , the member "users" in skb struct is added except for the last call , in which case it is for arp_rcv function. See the source of netif_receive_skb and deliver_skb in detail. In function "packet_recv" , the received skb is cloned and the cloned bit on original skb is set. Next is the process in "bond_arp_rcv". In bond_arp_rcv, if we use some kind of NIC driver like ixgbe, it tries to collect data of ARP packet into header area, (pskb_may_pull), because ixgbe build only mac addresses in header area and all the other data are put outside , referred from skb_shared_info area. The actual process for collection is done in __pskb_pull_tail. In it , if the targeted skb is cloned , it try to make new header as described below (pskb_expand_head) if (eat > 0 || skb_cloned(skb)) { if (pskb_expand_head(skb, 0, eat > 0 ? eat + 128 : 0, GFP_ATOMIC)) But , in pskb_expand_head , if the skb is shared ,which means member users > 1, the system panics: int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, gfp_t gfp_mask) { ... if (skb_shared(skb)) BUG(); Version-Release number of selected component (if applicable): RHEL 5.5 How reproducible: - prepare NIC driver whose logic meets the conditions described below. We know ixgbe driver meets the condition. - introduce bonding driver - specify arp_validate option - invoke arping command Steps to Reproduce: 1. 2. 3. Actual results: System panic Expected results: Additional info:
Mark, I have not looked into this before today as I as been out of the office. I will take a look at this explanation and see if I agree with the assessment. There are some systems in beaker with ixgbe cards (and you can search by driver) so you may want to check some out and to try and reproduce this if you like.
Can someone please post the full back-trace from the BUG halt?
Created attachment 474009 [details] lspci
Created attachment 474010 [details] dmidecode
Created attachment 475197 [details] backported upstream fix
Comment on attachment 475197 [details] backported upstream fix backport of upstream fix: commit b30532515f0a62bfe17207ab00883dd262497006 Author: Neil Horman <nhorman> Date: Thu Jan 20 09:02:31 2011 +0000 bonding: Ensure that we unshare skbs prior to calling pskb_may_pull
This patch can probably make it's way into RHEL5.7 if we get testing feedback.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Updated test kernels that contain a patch for this issue are available here: http://people.redhat.com/agospoda/#rhel5
Patch(es) available in kernel-2.6.18-261.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html