Bug 1210086

Summary: skb_warn_bad_offload if a bond of Intel interfaces and VLANs are used
Product: [oVirt] ovirt-node Reporter: Sebastian Schrader <sebastian.schrader+bugzilla.redhat.com>
Component: GeneralAssignee: Fabian Deutsch <fdeutsch>
Status: CLOSED DEFERRED QA Contact: bugs <bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: ---CC: bugs, ecohen, fdeutsch, jbloemen, lsurette, mgoldboi, ovirt-bugs, pablo.iranzo, parsonsa, rbalakri, sebastian.schrader+bugzilla.redhat.com, yeylon
Target Milestone: ---Keywords: TestOnly
Target Release: ---Flags: ylavi: ovirt-3.6.0?
ylavi: planning_ack?
ylavi: devel_ack?
ylavi: testing_ack?
Hardware: x86_64   
OS: Linux   
Whiteboard: node
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-10-22 13:05:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Node RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Host Traceback if guest NIC is e1000
none
Guest Traceback if guest NIC is virtio_net none

Description Sebastian Schrader 2015-04-08 21:26:38 UTC
Created attachment 1012410 [details]
Host Traceback if guest NIC is e1000

Description of problem:
If you use VLANs on top of a 802.3ad bond of ixgbe network interfaces, packets are lost and skb_warn_bad_offload is reported by the host or guest kernel.

Version-Release number of selected component (if applicable):
3.5-0.201502231653.el7

How reproducible:
Use VLANs on a 802.3ad bond that consists of Intel network interfaces, e.g. ixgbe (maybe e1000e is also affected?).

Steps to Reproduce:
1. Install oVirt Node
2. Create a 802.3ad bond with ixgbe slaves
3. Use VLANs

Actual results:
Loss of every packets with certain parameters (probably size) and warnings in the guest or host kernel log.


Expected results:
Proper transmission of the packets.

Additional info:
If virtio_net is used in for the guests NICs the errors are reported by the guest kernel, if emulated NICs like e1000 are used the errors are reported by the host kernel.

The bug has been reported to the Intel network team here:
http://sourceforge.net/p/e1000/bugs/434/

Comment 1 Sebastian Schrader 2015-04-08 21:27:28 UTC
Created attachment 1012411 [details]
Guest Traceback if guest NIC is virtio_net

Comment 2 Fabian Deutsch 2015-04-13 11:20:44 UTC
oVirt Node is always pulling in the latest kernel from CentOS, this means we depend upon when CentOS is fixing this issue.

Comment 3 Fabian Deutsch 2015-04-27 14:27:36 UTC
Did you try if one of the current builds fixes this issue?

Comment 4 Sebastian Schrader 2015-04-27 22:12:10 UTC
I couldn't try it yet, http://resources.ovirt.org/pub/ovirt-3.5/iso/ yields a 500 Internal Server Error currently.

Comment 5 Fabian Deutsch 2015-04-29 12:19:51 UTC
Right, there was a reorganization, please try the latest build from this CI job:

http://jenkins.ovirt.org/job/ovirt-node_ovirt-3.5_create-iso-el7_merged/

Comment 6 Jurriƫn Bloemen 2015-05-11 11:43:38 UTC
I have the same problems and I tried the latest build in Comment 5.
This has not resolved the problem. (for me)

Comment 7 Sebastian Schrader 2015-05-11 14:20:36 UTC
I got my hands back on a machine with the affected hardware. The problem is still there.

Comment 8 Fabian Deutsch 2015-05-28 07:04:53 UTC
Okay. Thanks for testing, this can actually be a dupe of bug 1217848. We need to see when it get's resolved there, or in the upstream kernel.

Comment 9 Sebastian Schrader 2015-05-29 13:12:44 UTC
I'm not authorized for the bug you referenced.

Comment 10 Sebastian Schrader 2015-06-03 22:50:30 UTC
Yesterday, somebody reported something interesting in the referenced Kernel bug.
https://bugzilla.kernel.org/show_bug.cgi?id=82471

The bug can be avoided by disabling Large Receive Offload (LRO). I could confirm this with our machines by manually disabling it. Unfortunately I don't know how to make this change persistent with oVirt Node during reboots or apply it to a fleet of machines.

On a related note the ixgbe README contains an important warning, maybe LRO should be disabled completely in the oVirt Node ixgbe module:

Important Note
--------------

WARNING: The ixgbe driver compiles by default with the Large Receive Offload
(LRO) feature enabled. This option offers the lowest CPU utilization for
receives but is completely incompatible with *routing/ip forwarding* and
*bridging*. If enabling ip forwarding or bridging is a requirement, it is
necessary to disable LRO using compile time options as noted in the LRO
section later in this document. The result of not disabling LRO when combined
with ip forwarding or bridging can be low throughput or even a kernel panic

Comment 11 Fabian Deutsch 2015-10-22 13:05:31 UTC
In future it will be easier to do this kind of changes, deferring this bug.