1270892 – Debian Testing guest causes 'kernel: WARNING: at net/core/dev.c:1915 skb_warn_bad_offload+0x99/0xb0() (Not tainted)'

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1270892 - Debian Testing guest causes 'kernel: WARNING: at net/core/dev.c:1915 skb_warn_bad_offload+0x99/0xb0() (Not tainted)'

Summary: Debian Testing guest causes 'kernel: WARNING: at net/core/dev.c:1915 skb_warn...

Keywords:
Status:	CLOSED DUPLICATE of bug 1259008
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	6.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Vlad Yasevich
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-10-12 15:35 UTC by Madison Kelly
Modified:	2016-12-26 17:22 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-02-01 02:41:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Madison Kelly 2015-10-12 15:35:09 UTC

Description of problem:

I have a 2-node HA cluster used to host and protect guest VMs. I've noticed recently that some guests, in this case a Debian Testing guest, using virtio network drivers, throws a lot of the errors below and the network connection to the guest fails. 

====
Oct 12 15:19:06 node1 kernel: ------------[ cut here ]------------
Oct 12 15:19:06 node1 kernel: WARNING: at net/core/dev.c:1915 skb_warn_bad_offload+0x99/0xb0() (Not tainted)
Oct 12 15:19:06 node1 kernel: Hardware name: PRIMERGY RX2540 M1
Oct 12 15:19:06 node1 kernel: : caps=(0x4000, 0x0) len=1514 data_len=1448 ip_summed=1
Oct 12 15:19:06 node1 kernel: Modules linked in: gfs2 drbd(U) dlm sctp libcrc32c configfs ip6table_filter ip6_tables ebtable_nat ebtables bridge stp llc bonding ipv6 ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_multiport ipt_addrtype xt_conntrack nf_conntrack iptable_filter ip_tables vhost_net macvtap macvlan tun kvm_intel kvm ipmi_devintf sg joydev ses enclosure power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support ixgbe dca ptp pps_core mdio sb_edac edac_core i2c_i801 i2c_core lpc_ich mfd_core shpchp tcp_htcp ext4 jbd2 mbcache sd_mod crc_t10dif be2net megaraid_sas xhci_hcd wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Oct 12 15:19:06 node1 kernel: Pid: 0, comm: swapper Not tainted 2.6.32-573.7.1.el6.x86_64 #1
Oct 12 15:19:06 node1 kernel: Call Trace:
Oct 12 15:19:06 node1 kernel: <IRQ>  [<ffffffff81077461>] ? warn_slowpath_common+0x91/0xe0
Oct 12 15:19:06 node1 kernel: [<ffffffff81077566>] ? warn_slowpath_fmt+0x46/0x60
Oct 12 15:19:06 node1 kernel: [<ffffffff81296a55>] ? __ratelimit+0xd5/0x120
Oct 12 15:19:06 node1 kernel: [<ffffffff8146b2e9>] ? skb_warn_bad_offload+0x99/0xb0
Oct 12 15:19:06 node1 kernel: [<ffffffff8146f9b1>] ? __skb_gso_segment+0x71/0xc0
Oct 12 15:19:06 node1 kernel: [<ffffffff8146fa13>] ? skb_gso_segment+0x13/0x20
Oct 12 15:19:06 node1 kernel: [<ffffffff8146fabb>] ? dev_hard_start_xmit+0x9b/0x490
Oct 12 15:19:06 node1 kernel: [<ffffffff8148d17a>] ? sch_direct_xmit+0x15a/0x1c0
Oct 12 15:19:06 node1 kernel: [<ffffffff81470158>] ? dev_queue_xmit+0x228/0x320
Oct 12 15:19:06 node1 kernel: [<ffffffffa0250818>] ? br_dev_queue_push_xmit+0x88/0xc0 [bridge]
Oct 12 15:19:06 node1 kernel: [<ffffffffa02508a8>] ? br_forward_finish+0x58/0x60 [bridge]
Oct 12 15:19:06 node1 kernel: [<ffffffffa025095a>] ? __br_forward+0xaa/0xd0 [bridge]
Oct 12 15:19:06 node1 kernel: [<ffffffff8149acf6>] ? nf_hook_slow+0x76/0x120
Oct 12 15:19:06 node1 kernel: [<ffffffffa02509dd>] ? br_forward+0x5d/0x70 [bridge]
Oct 12 15:19:06 node1 kernel: [<ffffffffa025187e>] ? br_handle_frame_finish+0x17e/0x330 [bridge]
Oct 12 15:19:06 node1 kernel: [<ffffffff8106f712>] ? enqueue_entity+0x112/0x440
Oct 12 15:19:06 node1 kernel: [<ffffffffa0251bf0>] ? br_handle_frame+0x1c0/0x270 [bridge]
Oct 12 15:19:06 node1 kernel: [<ffffffffa0251a30>] ? br_handle_frame+0x0/0x270 [bridge]
Oct 12 15:19:06 node1 kernel: [<ffffffff8146aaf7>] ? __netif_receive_skb+0x1c7/0x570
Oct 12 15:19:06 node1 kernel: [<ffffffff8146e3d8>] ? netif_receive_skb+0x58/0x60
Oct 12 15:19:06 node1 kernel: [<ffffffff8146e4e0>] ? napi_skb_finish+0x50/0x70
Oct 12 15:19:06 node1 kernel: [<ffffffff81470329>] ? napi_gro_receive+0x39/0x50
Oct 12 15:19:06 node1 kernel: [<ffffffffa01866fa>] ? ixgbe_clean_rx_irq+0x26a/0xc70 [ixgbe]
Oct 12 15:19:06 node1 kernel: [<ffffffffa018763a>] ? ixgbe_poll+0x40a/0x760 [ixgbe]
Oct 12 15:19:06 node1 kernel: [<ffffffff81091fc9>] ? send_sigqueue+0x109/0x1b0
Oct 12 15:19:06 node1 kernel: [<ffffffff810a05e0>] ? posix_timer_fn+0x0/0xe0
Oct 12 15:19:06 node1 kernel: [<ffffffff81470443>] ? net_rx_action+0x103/0x2f0
Oct 12 15:19:06 node1 kernel: [<ffffffff810ad4bd>] ? ktime_get+0x6d/0x100
Oct 12 15:19:06 node1 kernel: [<ffffffff8107ffa1>] ? __do_softirq+0xc1/0x1e0
Oct 12 15:19:06 node1 kernel: [<ffffffff810ed920>] ? handle_IRQ_event+0x60/0x170
Oct 12 15:19:06 node1 kernel: [<ffffffff8100c38c>] ? call_softirq+0x1c/0x30
Oct 12 15:19:06 node1 kernel: [<ffffffff8100fbd5>] ? do_softirq+0x65/0xa0
Oct 12 15:19:06 node1 kernel: [<ffffffff8107fe55>] ? irq_exit+0x85/0x90
Oct 12 15:19:06 node1 kernel: [<ffffffff815426d5>] ? do_IRQ+0x75/0xf0
Oct 12 15:19:06 node1 kernel: [<ffffffff8100ba53>] ? ret_from_intr+0x0/0x11
Oct 12 15:19:06 node1 kernel: <EOI>  [<ffffffff812f109e>] ? intel_idle+0xfe/0x1b0
Oct 12 15:19:06 node1 kernel: [<ffffffff812f1081>] ? intel_idle+0xe1/0x1b0
Oct 12 15:19:06 node1 kernel: [<ffffffff8143376a>] ? cpuidle_idle_call+0x7a/0xe0
Oct 12 15:19:06 node1 kernel: [<ffffffff81009fe6>] ? cpu_idle+0xb6/0x110
Oct 12 15:19:06 node1 kernel: [<ffffffff8151f21a>] ? rest_init+0x7a/0x80
Oct 12 15:19:06 node1 kernel: [<ffffffff81c38122>] ? start_kernel+0x424/0x431
Oct 12 15:19:06 node1 kernel: [<ffffffff81c3733a>] ? x86_64_start_reservations+0x125/0x129
Oct 12 15:19:06 node1 kernel: [<ffffffff81c37453>] ? x86_64_start_kernel+0x115/0x124
Oct 12 15:19:06 node1 kernel: ---[ end trace 77db6971af6e8705 ]---
====

Many further messages similar to the above get printed, at the rate of a few per second.

The host network uses a standard bridge connected to an active/passive bond.

====
[root@node1 ~]# ifconfig |grep ifn -A 7
ifn_bond1 Link encap:Ethernet  HWaddr 90:1B:0E:53:A6:AC  
          inet6 addr: fe80::921b:eff:fe53:a6ac/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:9000  Metric:1
          RX packets:7988836 errors:1446 dropped:0 overruns:0 frame:1446
          TX packets:4341716 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:4934755400 (4.5 GiB)  TX bytes:338720278 (323.0 MiB)

ifn_bridge1 Link encap:Ethernet  HWaddr 90:1B:0E:53:A6:AC  
          inet addr:10.250.199.10  Bcast:10.250.255.255  Mask:255.255.0.0
          inet6 addr: fe80::921b:eff:fe53:a6ac/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
          RX packets:6942770 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4248623 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:4509625121 (4.1 GiB)  TX bytes:331462619 (316.1 MiB)
--
ifn_link1 Link encap:Ethernet  HWaddr 90:1B:0E:53:A6:AC  
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:9000  Metric:1
          RX packets:7135799 errors:723 dropped:0 overruns:0 frame:723
          TX packets:4341716 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:4873347900 (4.5 GiB)  TX bytes:338720278 (323.0 MiB)

ifn_link2 Link encap:Ethernet  HWaddr 90:1B:0E:53:A6:AC  
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:9000  Metric:1
          RX packets:853037 errors:723 dropped:0 overruns:0 frame:723
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:61407500 (58.5 MiB)  TX bytes:0 (0.0 b)

[root@node1 ~]# ifconfig |grep vnet -A 7
vnet0     Link encap:Ethernet  HWaddr FE:54:00:EF:A2:06  
          inet6 addr: fe80::fc54:ff:feef:a206/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
          RX packets:92810 errors:0 dropped:0 overruns:0 frame:0
          TX packets:816524 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:500 
          RX bytes:7234793 (6.8 MiB)  TX bytes:332679914 (317.2 MiB)

vnet1     Link encap:Ethernet  HWaddr FE:54:00:36:B1:AA  
          inet6 addr: fe80::fc54:ff:fe36:b1aa/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
          RX packets:33 errors:0 dropped:0 overruns:0 frame:0
          TX packets:5948 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:500 
          RX bytes:5034 (4.9 KiB)  TX bytes:762212 (744.3 KiB)


[root@node1 ~]# brctl show
bridge name	bridge id		STP enabled	interfaces
ifn_bridge1		8000.901b0e53a6ac	no		ifn_bond1
							vnet0
							vnet1

[root@node1 ~]# cat /proc/net/bonding/ifn_bond1 
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: ifn_link1 (primary_reselect better)
Currently Active Slave: ifn_link1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 120000
Down Delay (ms): 0

Slave Interface: ifn_link1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 90:1b:0e:53:a6:ac
Slave queue ID: 0

Slave Interface: ifn_link2
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 90:1b:0e:53:a9:d3
Slave queue ID: 0


[root@node1 ~]# virsh dumpxml debian |grep vnet -B 3 -A 4
    <interface type='bridge'>
      <mac address='52:54:00:36:b1:aa'/>
      <source bridge='ifn_bridge1'/>
      <target dev='vnet1'/>
      <model type='e1000'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
====

I'm also noticing, periodically, corosync retransmit messages on an entirely different set of NICs (also active/passive bonded) that may or may not be related...

====
Oct 12 13:48:44 node1 corosync[5045]:   [TOTEM ] Retransmit List: 6b6ac
Oct 12 13:48:44 node1 corosync[5045]:   [TOTEM ] Retransmit List: 6b6ac
Oct 12 13:48:44 node1 corosync[5045]:   [TOTEM ] Retransmit List: 6b6ae
Oct 12 13:48:44 node1 corosync[5045]:   [TOTEM ] Retransmit List: 6b6b0
Oct 12 13:48:44 node1 corosync[5045]:   [TOTEM ] Retransmit List: 6b6b0
Oct 12 13:48:44 node1 corosync[5045]:   [TOTEM ] Retransmit List: 6b6b0
Oct 12 13:48:44 node1 corosync[5045]:   [TOTEM ] Retransmit List: 6b6b2
Oct 12 13:48:44 node1 corosync[5045]:   [TOTEM ] Retransmit List: 6b6b2
Oct 12 13:48:44 node1 corosync[5045]:   [TOTEM ] Retransmit List: 6b6b6
Oct 12 13:48:44 node1 corosync[5045]:   [TOTEM ] Retransmit List: 6b6b6
Oct 12 13:48:44 node1 corosync[5045]:   [TOTEM ] Retransmit List: 6b6b8
Oct 12 13:48:44 node1 corosync[5045]:   [TOTEM ] Retransmit List: 6b6b8
Oct 12 14:24:01 node1 auditd[3728]: Audit daemon rotating log files
Oct 12 15:11:21 node1 kernel: device vnet1 entered promiscuous mode
Oct 12 15:11:21 node1 kernel: ifn_bridge1: port 3(vnet1) entering forwarding state
Oct 12 15:11:24 node1 ntpd[4094]: Listen normally on 15 vnet1 fe80::fc54:ff:fe36:b1aa UDP 123
Oct 12 15:11:25 node1 ricci[17116]: Executing '/usr/bin/virsh nodeinfo'
Oct 12 15:11:26 node1 ricci[17177]: Executing '/usr/bin/virsh nodeinfo'
Oct 12 15:11:26 node1 ricci[17192]: Executing '/usr/bin/virsh nodeinfo'
Oct 12 15:11:26 node1 ricci[17223]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1495992487'
Oct 12 15:11:26 node1 ricci[17239]: Executing '/usr/bin/virsh nodeinfo'
Oct 12 15:11:26 node1 ricci[17258]: Executing '/usr/bin/virsh nodeinfo'
Oct 12 15:11:26 node1 ricci[17261]: Executing '/usr/bin/virsh nodeinfo'
Oct 12 15:11:26 node1 ricci[17263]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/490731727'
Oct 12 15:11:27 node1 ricci[17291]: Executing '/usr/bin/virsh nodeinfo'
Oct 12 15:11:27 node1 ricci[17296]: Executing '/usr/bin/virsh nodeinfo'
Oct 12 15:11:27 node1 ricci[17299]: Executing '/usr/bin/virsh nodeinfo'
Oct 12 15:11:27 node1 ricci[17301]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1134119659'
Oct 12 15:11:27 node1 modcluster: Updating cluster.conf
Oct 12 15:11:27 node1 ricci[17305]: Executing '/usr/bin/virsh nodeinfo'
Oct 12 15:11:28 node1 ricci[17310]: Executing '/usr/bin/virsh nodeinfo'
Oct 12 15:11:28 node1 ricci[17313]: Executing '/usr/bin/virsh nodeinfo'
Oct 12 15:11:28 node1 ricci[17315]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/587458437'
Oct 12 15:11:28 node1 ricci[17321]: Executing '/usr/bin/virsh nodeinfo'
Oct 12 15:11:28 node1 ricci[17326]: Executing '/usr/bin/virsh nodeinfo'
Oct 12 15:11:28 node1 ricci[17329]: Executing '/usr/bin/virsh nodeinfo'
Oct 12 15:11:28 node1 ricci[17331]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1026759435'
Oct 12 15:11:28 node1 modcluster: Updating cluster.conf
Oct 12 15:11:28 node1 corosync[5045]:   [QUORUM] Members[2]: 1 2
Oct 12 15:11:28 node1 rgmanager[5405]: Reconfiguring
Oct 12 15:11:30 node1 kernel: kvm: 16934: cpu0 disabled perfctr wrmsr: 0xc2 data 0xffff
Oct 12 15:11:30 node1 rgmanager[5405]: Initializing vm:debian
Oct 12 15:11:30 node1 rgmanager[5405]: vm:debian was added to the config, but I am not initializing it.
Oct 12 15:11:30 node1 kernel: kvm: 16934: cpu0 unhandled rdmsr: 0x570
Oct 12 15:11:30 node1 kernel: kvm: 16934: cpu1 unhandled rdmsr: 0x570
Oct 12 15:11:30 node1 kernel: kvm: 16934: cpu2 unhandled rdmsr: 0x570
Oct 12 15:11:30 node1 kernel: kvm: 16934: cpu3 unhandled rdmsr: 0x570
Oct 12 15:11:30 node1 kernel: kvm: 16934: cpu4 unhandled rdmsr: 0x570
Oct 12 15:11:30 node1 kernel: kvm: 16934: cpu5 unhandled rdmsr: 0x570
Oct 12 15:11:36 node1 kernel: ifn_bridge1: port 3(vnet1) entering forwarding state
Oct 12 15:11:39 node1 rgmanager[5405]: Starting disabled service vm:debian
Oct 12 15:11:39 node1 rgmanager[5405]: Service vm:debian started
Oct 12 15:18:19 node1 corosync[5045]:   [TOTEM ] Retransmit List: 7003c
Oct 12 15:18:19 node1 corosync[5045]:   [TOTEM ] Retransmit List: 7003e
Oct 12 15:18:19 node1 corosync[5045]:   [TOTEM ] Retransmit List: 70040
Oct 12 15:18:39 node1 corosync[5045]:   [TOTEM ] Retransmit List: 70077
Oct 12 15:18:39 node1 corosync[5045]:   [TOTEM ] Retransmit List: 7008f
Oct 12 15:18:39 node1 corosync[5045]:   [TOTEM ] Retransmit List: 7008f
Oct 12 15:18:59 node1 corosync[5045]:   [TOTEM ] Retransmit List: 700e0
Oct 12 15:18:59 node1 corosync[5045]:   [TOTEM ] Retransmit List: 700e2
Oct 12 15:18:59 node1 corosync[5045]:   [TOTEM ] Retransmit List: 700e2
Oct 12 15:18:59 node1 corosync[5045]:   [TOTEM ] Retransmit List: 700e4
Oct 12 15:18:59 node1 corosync[5045]:   [TOTEM ] Retransmit List: 700e4
Oct 12 15:19:06 node1 kernel: ------------[ cut here ]------------
Oct 12 15:19:06 node1 kernel: WARNING: at net/core/dev.c:1915 skb_warn_bad_offload+0x99/0xb0() (Not tainted)
Oct 12 15:19:06 node1 kernel: Hardware name: PRIMERGY RX2540 M1
....
====

The hardware consists of 2x dual-port Intel 10Gbps NICs and 1x dual-port Emulex NIC.

====
[root@node1 ~]# lspci |grep Ethernet
02:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
02:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
03:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
03:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
04:00.0 Ethernet controller: Emulex Corporation OneConnect NIC (Skyhawk) (rev 10)
04:00.1 Ethernet controller: Emulex Corporation OneConnect NIC (Skyhawk) (rev 10)
====

All links route through a pair of Brocade ICX 6610-48 switches and the links themselves use (copper) SFP+ cables.


Version-Release number of selected component (if applicable):

The hosts are identical, fully updated RHEL 6.7 machines:

====
[root@node1 ~]# uname -a
Linux node1.ccrs.bcn 2.6.32-573.7.1.el6.x86_64 #1 SMP Thu Sep 10 13:42:16 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux

[root@node1 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 6.7 (Santiago)

[root@node1 ~]# yum list installed | grep -e qemu -e kvm
gpxe-roms-qemu.noarch            0.9.7-6.14.el6                     @dashboard1 
qemu-img.x86_64                  2:0.12.1.2-2.479.el6_7.1           @dashboard1 
qemu-kvm.x86_64                  2:0.12.1.2-2.479.el6_7.1           @dashboard1 
qemu-kvm-tools.x86_64            2:0.12.1.2-2.479.el6_7.1           @dashboard1 
====


How reproducible:

Roughly 80% of the time.


Steps to Reproduce:
1. Setup a standard bridge
2. Setup a mode=1 bonded interface using MTU 9000
3. Provision a debian testing guest
4. tail /var/log/messages on the host
5. Observe errors when the guest tries to use apt to connect to the web

Actual results:

Lots of errors in syslog, no network to guest.


Expected results:

Network to work, no errors.


Additional info:

Comment 1 Madison Kelly 2015-10-12 15:37:35 UTC

I should mention;

The IFN bond (used by the guests) span the two Intel NICs. The corosync traffic spans one emulex NIC (primary link) and one intel NIC (backup link). If the corosync retransmit messages relate, then it is unlikely to be the brand of NIC.

Comment 3 Madison Kelly 2015-10-12 16:28:15 UTC

Correction; The guest was created emulating an e1000 nic, as seen in the virsh dumpxml output.

Comment 4 Pavel Urban 2015-11-04 10:09:46 UTC

Hello,

we had very similar problem and resolved it (hopefully) by turning off large-receive-offload.

[root@lxhvehv407 ~]# tail -2 /etc/rc.d/rc.local 
ethtool -K eth0 lro off; ethtool -K eth1 lro off

HTH

Comment 5 Vlad Yasevich 2015-12-22 03:38:08 UTC

The only way I can reproduce this is to make ethX devices that are part of
the bond to have LRO enabled.

Can you please provide the output of 
  #ethtool -k if_link1
  #ethtool -k if_link2

after the bridge has been created.

Thanks
-vlad

Comment 6 Ralf Aumueller 2016-01-19 11:37:32 UTC

Hello,

after updating my cluster (6.6->6.7) I had the same problem (same network config than digimer). I guess the problem was introduced with kernel 2.6.32-515. From changelog info:

- [netdrv] revert "bonding: Allow Bonding driver to disable/enable LRO on slaves" (Nikolay Aleksandrov) [1159818]

If I understand correct the feature was removed. And in a test VM I found the following: Adding a bond interface to a bridge automatically disables LRO on the bonding interface. So I guess in older kernels these change was also send to the child's of the bond.

Best regards

Comment 7 Vlad Yasevich 2016-02-01 02:41:51 UTC

This should happen in the recent rhel6 kernels as well.

commit 0b044e4d4358a5ddcc335c31b39efd2abe95c173
Author: Jarod Wilson <jarod>
Date:   Fri Sep 25 16:59:32 2015 -0400

    [net] bonding: propagate LRO disable to slave devices

There is a z-stream bug 1287993 in progress already.

Closing this one as duplicate of 1259008

*** This bug has been marked as a duplicate of bug 1259008 ***

Comment 8 Madison Kelly 2016-12-26 17:22:57 UTC

Commenting to stop the "outstanding bugs" emails.

Note You need to log in before you can comment on or make changes to this bug.