Description of problem: Dom0 is running RHEL5.4 Xen When trying to push TCP netperf data through the bridge device into the Dom0, the system immediately reboots due to a panic. If I disable GRO on the peth0 (ixgbe) the panic goes away. If I remove the card from the bridge and treat it as a "normal" interface (eth0) and I run netperf TCP_STREAM to the host from the external box GRO enabled, I get line speed and no reboot. Version-Release number of selected component (if applicable): # uname -a Linux 2.6.18-164.el5xen #1 SMP Tue Aug 18 15:59:52 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux How reproducible: Every time Steps to Reproduce: 1. Run Xen 2. Configure Xen to create a bridge for the ixgbe card 3. Try to run netperf TCP_STREAM from external box to Do0 Actual results: System panics if GRO is enabled Expected results: No panic, decent throughput Additional info: Thought this trace would be useful... Modules linked in: netloop netbk blktap blkbk ipt_MASQUERADE iptable_nat ip_nat xt_state ip_conntrack nfnetlink ipt_REJEd Pid: 0, comm: swapper Not tainted 2.6.18-164.el5xen #1 RIP: e030:[<ffffffff80416c9b>] [<ffffffff80416c9b>] skb_segment+0x10d/0x573 RSP: e02b:ffffffff8067cb40 EFLAGS: 00010212 RAX: 0000000000000000 RBX: ffff8805c06538c0 RCX: 0000000000000042 RDX: 00000000000005a8 RSI: 0000000000011023 RDI: ffff8805c06538c0 RBP: 0000000000001730 R08: 0000000000051000 R09: ffffffff883f6cea R10: 0000000080000000 R11: ffffffff80446379 R12: ffff8805c06538c0 R13: 0000000000000020 R14: 0000000000000042 R15: ffff8805c0653cc0 FS: 00002acc86e177d0(0000) GS:ffffffff805ca000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process swapper (pid: 0, threadinfo ffffffff8063a000, task ffffffff804eeb00) Stack: ffffffff806cd200 00ffffff8023f6bc ffff8805c8afdc60 0000000000000000 0000000000000000 00000042000005a8 0000002200000042 0000000000000001 0000000000000000 0000000000000042 Call Trace: <IRQ> [<ffffffff8043f7ad>] tcp_tso_segment+0x262/0x287 [<ffffffff8022bce2>] local_bh_enable+0x9/0xa5 [<ffffffff8044f069>] inet_gso_segment+0x10b/0x1a6 [<ffffffff80419df2>] skb_gso_segment+0x11c/0x184 [<ffffffff8041a831>] dev_hard_start_xmit+0x174/0x22f [<ffffffff80230e50>] dev_queue_xmit+0x2d3/0x394 [<ffffffff883f6e8e>] :bridge:br_dev_queue_push_xmit+0x1a4/0x1cb [<ffffffff883f6f04>] :bridge:br_forward_finish+0x4f/0x51 [<ffffffff883f6f70>] :bridge:__br_forward+0x6a/0x6d [<ffffffff883f79c2>] :bridge:br_handle_frame_finish+0xd4/0xf8 [<ffffffff883f7b6b>] :bridge:br_handle_frame+0x185/0x1a2 [<ffffffff8043f4cd>] tcp_gro_receive+0x1b5/0x233 [<ffffffff8022101d>] netif_receive_skb+0x328/0x41a [<ffffffff8041a231>] dev_gro_receive+0xf0/0x209 [<ffffffff8041a442>] napi_gro_receive+0x20/0x2f [<ffffffff88212dad>] :ixgbe:ixgbe_clean_rx_irq+0x400/0x78c [<ffffffff8022bce2>] local_bh_enable+0x9/0xa5 [<ffffffff88216ef0>] :ixgbe:ixgbe_clean_rxonly+0x7e/0x140 [<ffffffff8020ce03>] net_rx_action+0xb4/0x1f3 [<ffffffff80212c99>] __do_softirq+0x8d/0x13b [<ffffffff80260da4>] call_softirq+0x1c/0x278 [<ffffffff8026e0ab>] do_softirq+0x31/0x98 [<ffffffff8026df37>] do_IRQ+0xec/0xf5 [<ffffffff803aed9d>] evtchn_do_upcall+0x13b/0x1fb [<ffffffff802608d6>] do_hypervisor_callback+0x1e/0x2c <EOI> [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000 [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000 [<ffffffff8026f4d5>] raw_safe_halt+0x84/0xa8 [<ffffffff8026ca50>] xen_idle+0x38/0x4a [<ffffffff8024afa1>] cpu_idle+0x97/0xba [<ffffffff80644b05>] start_kernel+0x21f/0x224 [<ffffffff806441e5>] _sinittext+0x1e5/0x1eb Code: 0f 0b 68 6c f1 4b 80 c2 ac 07 4c 89 ff be 20 02 00 00 e8 09 RIP [<ffffffff80416c9b>] skb_segment+0x10d/0x573 RSP <ffffffff8067cb40> <0>Kernel panic - not syncing: Fatal exception (XEN) Domain 0 crashed: rebooting machine in 5 seconds.
Created attachment 369884 [details] gro: Fix illegal merging of trailer trash gro: Fix illegal merging of trailer trash When we've merged skb's with page frags, and subsequently receive a trailer skb (< MSS) that is not completely non-linear (this can occur on Intel NICs if the packet size falls below the threshold), GRO ends up producing an illegal GSO skb with a frag_list. This is harmless unless the skb is then forwarded through an interface that requires software GSO, whereupon the GSO code will BUG. This patch detects this case in GRO and avoids merging the trailer skb. Reported-by: Mark Wagner <mwagner> Signed-off-by: Herbert Xu <herbert.org.au>
in kernel-2.6.18-178.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please update the appropriate value in the Verified field (cf_verified) to indicate this fix has been successfully verified. Include a comment with verification details.
FWIW I saw a very similar stack trace and subsequent panic on a RHEL5.4 x86_64 box with an Intel 10Gig NIC using the ixgbe module. I'm not, however, running a xen kernel. I'm running 2.6.18-164.6.1.el5 and can confirm that disabling gro o the 10gig nic prevents the panic. I do have a valid RHEL subscription. ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at net/core/skbuff.c:1964 invalid opcode: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:00.0/irq CPU 2 Modules linked in: nfs fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc bonding ipv6 xfrm_nalgo crypto_api xt_tcpudp ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables x_tables dm_mirror dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev i5000_edac sr_mod ide_cd ixgbe edac_mc bnx2 cdrom serio_raw 8021q pcspkr sg dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache usb_storage ata_piix libata shpchp megaraid_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 0, comm: swapper Not tainted 2.6.18-164.6.1.el5 #1 RIP: 0010:[<ffffffff802253a5>] [<ffffffff802253a5>] skb_segment+0x10d/0x573 RSP: 0018:ffff8101043ebb60 EFLAGS: 00010212 RAX: 0000000000000000 RBX: ffff8101269dd680 RCX: 0000000000000042 RDX: 00000000000005a8 RSI: 00000000000901a3 RDI: ffff8101269dd680 RBP: 0000000000002e1c R08: 00000000000d0100 R09: 0000000000002e30 R10: 0000000080000000 R11: ffffffff80254c39 R12: ffff8101269dd680 R13: 0000000000000020 R14: 0000000000000042 R15: ffff8101269ddd80 FS: 0000000000000000(0000) GS:ffff81010439ce40(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007fff82445ff8 CR3: 0000000124606000 CR4: 00000000000006e0 Process swapper (pid: 0, threadinfo ffff8101043e6000, task ffff81010439d080) Stack: 000000000000000c 00ffffff883a38a8 ffff81012b04dea8 0000000000000000 0000000000000000 00000042000005a8 0000002200000042 0000000000000001 0000000000000000 0000000000000042 000005a82e6a2044 ffff8101269dd680 Call Trace: <IRQ> [<ffffffff8024e066>] tcp_tso_segment+0x262/0x287 [<ffffffff8025d92e>] inet_gso_segment+0x10b/0x1a6 [<ffffffff8022876c>] skb_gso_segment+0x11c/0x184 [<ffffffff8022915f>] dev_hard_start_xmit+0x174/0x22f [<ffffffff802481e6>] ip_finish_output+0x0/0x241 [<ffffffff80239226>] __qdisc_run+0x136/0x1f9 [<ffffffff8002f9bc>] dev_queue_xmit+0x150/0x271 [<ffffffff80032089>] ip_output+0x29a/0x2dd [<ffffffff802462c5>] ip_forward+0x24f/0x2bd [<ffffffff800359a3>] ip_rcv+0x539/0x57c [<ffffffff800207fb>] netif_receive_skb+0x3c9/0x3f5 [<ffffffff80228b86>] dev_gro_receive+0xf0/0x209 [<ffffffff80228d97>] napi_gro_receive+0x20/0x2f [<ffffffff88232e0d>] :ixgbe:ixgbe_clean_rx_irq+0x3f5/0x781 [<ffffffff88236faa>] :ixgbe:ixgbe_clean_rxonly+0x7e/0x12e [<ffffffff8001ca65>] __mod_timer+0xb0/0xbe [<ffffffff8000c845>] net_rx_action+0xac/0x1e0 [<ffffffff88233258>] :ixgbe:ixgbe_msix_clean_rx+0xbf/0xc9 [<ffffffff8001235a>] __do_softirq+0x89/0x133 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28 [<ffffffff8006cb20>] do_softirq+0x2c/0x85 [<ffffffff8006c9a8>] do_IRQ+0xec/0xf5 [<ffffffff800571ae>] mwait_idle+0x0/0x4a [<ffffffff8005d615>] ret_from_intr+0x0/0xa <EOI> [<ffffffff800571e4>] mwait_idle+0x36/0x4a [<ffffffff8004938f>] cpu_idle+0x95/0xb8 [<ffffffff80076f7f>] start_secondary+0x498/0x4a7 Code: 0f 0b 68 1f 19 2d 80 c2 ac 07 4c 89 ff be 20 02 00 00 e8 24 RIP [<ffffffff802253a5>] skb_segment+0x10d/0x573 RSP <ffff8101043ebb60> <0>Kernel panic - not syncing: Fatal exception
(In reply to comment #4) > FWIW I saw a very similar stack trace and subsequent panic on a RHEL5.4 x86_64 > box with an Intel 10Gig NIC using the ixgbe module. I'm not, however, running a > xen kernel. I'm running 2.6.18-164.6.1.el5 and can confirm that disabling gro o > the 10gig nic prevents the panic. I do have a valid RHEL subscription. > This is interesting. This was typically only a problem with the ixgbe interface was in a bridge. Can you tell me a bit more about the network configuration? Both a description of your setup and sanitizied config files are always appreciated.
The server is acting as a router between two networks. The ixbge interface (eth2) and a secondary e1000 interface (eth1) are bonded (bond0) - mode 1 active/backup with the ixgbe being the preferred interface. These two physical interfaces are connected to Network A. There is a third e1000 interface (eth0) that connects to Network B. I can reliably reproduce the panic when running the netperf from a host on the Network A to a receiving host on the Network B, but cannot cause a panic if the netperf is run in the opposite direction. If I disable GRO on the 10 gig NIC the panic does not happen. In addition, if I run a netperf from a host on Network A to the router's bond0 interface with eth2 active, the panic does not occur regardless of whether GRO is enabled or disabled. Relavent lines from modprobe.conf: alias eth0 bnx2 alias eth1 bnx2 alias eth2 ixgbe alias bond0 bonding options bond0 mode=1 primary=eth2 miimon=10 /etc/sysconfig/network-scripts/ifcfg-eth0: # Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet DEVICE=eth0 BOOTPROTO=static DHCPCLASS= HWADDR=00:19:B9:CD:6B:F4 ONBOOT=yes IPADDR=172.20.21.3 NETMASK=255.255.0.0 /etc/sysconfig/network-scripts/ifcfg-eth1: DEVICE=eth1 HWADDR=00:19:B9:CD:6B:F6 ONBOOT=yes MASTER=bond0 SLAVE=yes USERCTL=no /etc/sysconfig/network-scripts/ifcfg-eth2: DEVICE=eth2 HWADDR=00:1B:21:41:87:A0 ONBOOT=yes MASTER=bond0 SLAVE=yes USERCTL=no /etc/sysconfig/network-scripts/ifcfg-bond0: # Bonded Interfaces DEVICE=bond0 BOOTPROTO=none ONBOOT=yes NETMASK=255.255.255.0 IPADDR=192.168.3.3 USERCTL=no Please let me know if you need any more information.
(In reply to comment #6) > The server is acting as a router between two networks. The ixbge interface > (eth2) and a secondary e1000 interface (eth1) are bonded (bond0) - mode 1 > active/backup with the ixgbe being the preferred interface. These two physical > interfaces are connected to Network A. There is a third e1000 interface (eth0) > that connects to Network B. > > I can reliably reproduce the panic when running the netperf from a host on the > Network A to a receiving host on the Network B, but cannot cause a panic if the > netperf is run in the opposite direction. If I disable GRO on the 10 gig NIC > the panic does not happen. In addition, if I run a netperf from a host on > Network A to the router's bond0 interface with eth2 active, the panic does not > occur regardless of whether GRO is enabled or disabled. > Aaron, it sounds like the fix for this bug will resolve your issue as well. Though instances where bridging is used are where this is most commonly seen, the issue can occur anytime you are doing forwarding on the host. I'm quite sure this is specifically related to a setup where GRO is enabled on the receiving interface and TSO is enabled on the interface doing the transmitting out of the host. I suspect that if you disable TSO on eth0, you will not see the problem when running a netperf from a host on Network A to a host on Network B.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html
There is an issue with GRO and netxen drivers in stock RHEL5.4 Has this issue been addressed for netxen module or just ixgbe?
The fix described in this bug was generic and not driver specific.