Bug 1152231

Summary: qdisc: address unexpected behavior when attaching qdisc to virtual device
Product: Red Hat Enterprise Linux 7 Reporter: Jesper Brouer <jbrouer>
Component: kernelAssignee: Phil Sutter <psutter>
kernel sub component: Networking QA Contact: Li Shuang <shuali>
Status: CLOSED ERRATA Docs Contact: Mirek Jahoda <mjahoda>
Severity: medium    
Priority: medium CC: bazulay, haliu, jbrouer, kzhang, mleitner, mzhan, psutter, rcyriac, rkhan, tgraf, tlavigne
Version: 7.2   
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-3.10.0-395.el7 Doc Type: Bug Fix
Doc Text:
Unexpected behavior when attaching a qdisc to a virtual device no longer occurs Previously, attaching a qdisc to a virtual device could result in unexpected behavior such as packets being dropped prematurely and reduced bandwidth. With this update, virtual devices have a default `tx_queue_len` of 1000 and are represented by a device flag. Attaching a qdisc to a virtual device is now supported with default settings and any special handling of the `tx_queue_len=0` is no longer needed.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-03 08:49:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1301628, 1313485, 1328874    
Attachments:
Description Flags
qlen_reproducer.sh none

Description Jesper Brouer 2014-10-13 16:02:53 UTC
Description of problem:

Per design, virtual devices runs without a qdisc attached (noqueue),
for obvious performance reasons, as the underlying real device is the
only mechanism that can "push-back" (NETDEV_TX_BUSY or
netif_xmit_frozen_or_stopped()).  Thus, we can avoid all the qdisc
locking, given no queue can ever exist (code see __dev_queue_xmit).

There exists some valid use-cases for attaching a qdisc to a virtual
device (e.g. policy limiting a virtual machine).  And there is nothing
stopping userspace from attaching a qdisc to a virtual device.

The problem is that attaching a qdisc to a virtual device, results in
unexpected behavior depending on the leaf qdisc's used (especially for
the fall-through to the default qdisc, or qdisc's inheriting
dev->tx_queue_len).  The unexpected behavior is that traffic seems to
flow normally, until the configured qdisc limit is hit, then packet
drops occurs prematurely.

The main problem: this unexpected behavior and lack of feedback is
clearly a broken interface between the kernel and userspace.

The work around for userspace is simply changing the interface
"txqueuelen" to something not zero, before attaching a qdisc.  This is
a subtle and unreasonable requirement.

The only way forward is the fix the kernel, to act reasonable, and
don't break these kind of use-cases in subtle ways.


The cause of this problem it the double meaning of the
dev->tx_queue_len, inside the kernel.  Before the device is brought
up, while it still have a "noop" qdisc, the tx_queue_len=0 indicate
this is a (virtual) device that supports the "noqueue" qdisc.  After
the device is up, this value is not really used.  When a qdisc is
attached, then this tx_queue_len is again ineffect, this time
inherited and uses as the default queue length limit.  Needless to
say, a qdisc with a queue length limit of zero is not very useful as a
queue.

This issue where discovered for Docker containers:
 https://github.com/docker/libcontainer/pull/193

After a long discussion and rejected kernel patch:
 http://thread.gmane.org/gmane.linux.network/333349/focus=333456

We agreed that this behavior is a kernel bug, and the kernel need to
address this non-intuitive behavior.  And userspace should not be
forced into jumping through hoops, to get the expected behavior.
Version-Release number of selected component (if applicable):


How reproducible: 100%

Steps to Reproduce:

1. Create a VLAN device:

 export VLAN=100
 export DEV=eth1
 ip link add link ${DEV} name ${DEV}.${VLAN} type vlan id ${VLAN}
 ip addr add 192.168.123.1/24 dev ${DEV}.${VLAN}
 ip link set dev ${DEV}.${VLAN} up

2. Attach a HTB qdisc to a VLAN device and test throughput:

 tc qdisc del dev ${DEV}.${VLAN} root
 tc qdisc add dev ${DEV}.${VLAN} root handle 1: htb default 1
 tc class add dev ${DEV}.${VLAN} parent 1:0 classid 1:1 htb rate 500Mbit ceil 500Mbit
 netperf -H 192.168.123.2 -t TCP_STREAM

3. Make it work by adjusting dev->tx_queue_len:

 ifconfig ${DEV}.${VLAN} txqueuelen 1000
 tc qdisc del dev ${DEV}.${VLAN} root
 tc qdisc add dev ${DEV}.${VLAN} root handle 1: htb default 1
 tc class add dev ${DEV}.${VLAN} parent 1:0 classid 1:1 htb rate 500Mbit ceil 500Mbit

Actual results:
 Too low bandwidth throughput.

Expected results:
 Bandwidth shaping to the configured speed.

Comment 3 Phil Sutter 2015-08-18 16:28:01 UTC
The core framework for a solution I submitted earlier has been accepted:
http://www.spinics.net/lists/netdev/msg339268.html

I followed up on this with a patch series adjusting drivers, 802.1q is also among them:
https://www.mail-archive.com/netdev@vger.kernel.org/msg74403.html

The later series is currently on review.

Comment 4 Phil Sutter 2015-09-17 15:47:42 UTC
This is the list of upstream commits which should be backported:

fa8187c net: declare new net_device priv_flag IFF_NO_QUEUE
4b46995 net: sch_generic: react upon IFF_NO_QUEUE flag
2e659c0 net: 8021q: convert to using IFF_NO_QUEUE
ccecb2a net: bridge: convert to using IFF_NO_QUEUE
4afbc0d net: 6lowpan: convert to using IFF_NO_QUEUE
0a5f107 net: dsa: convert to using IFF_NO_QUEUE
cdf7370 net: batman-adv: convert to using IFF_NO_QUEUE
9ad09c5 net: hsr: convert to using IFF_NO_QUEUE
4676a15 net: caif: convert to using IFF_NO_QUEUE
906470c net: warn if drivers set tx_queue_len = 0
348e343 net: sched: drop all special handling of tx_queue_len == 0
f84bb1e net: fix IFF_NO_QUEUE for drivers using alloc_netdev
db4094b net: sched: ignore tx_queue_len when assigning default qdisc
d66d6c3 net: sched: register noqueue qdisc
3e692f2 net: sched: simplify attach_one_default_qdisc()

Comment 5 Phil Sutter 2015-09-17 15:55:53 UTC
Oops, missed a bunch:

(In reply to Phil Sutter from comment #4)
> This is the list of upstream commits which should be backported:
> 
> fa8187c net: declare new net_device priv_flag IFF_NO_QUEUE
> 4b46995 net: sch_generic: react upon IFF_NO_QUEUE flag

02f01ec net: veth: enable noqueue operation by default
ff42c02 net: dummy: convert to using IFF_NO_QUEUE
ed961ac net: geneve: convert to using IFF_NO_QUEUE
e65db2b net: loopback: convert to using IFF_NO_QUEUE
85773a6 net: nlmon: convert to using IFF_NO_QUEUE
22e380a net: team: convert to using IFF_NO_QUEUE
22dba39 net: vxlan: convert to using IFF_NO_QUEUE

> 2e659c0 net: 8021q: convert to using IFF_NO_QUEUE
> ccecb2a net: bridge: convert to using IFF_NO_QUEUE
> 4afbc0d net: 6lowpan: convert to using IFF_NO_QUEUE
> 0a5f107 net: dsa: convert to using IFF_NO_QUEUE
> cdf7370 net: batman-adv: convert to using IFF_NO_QUEUE
> 9ad09c5 net: hsr: convert to using IFF_NO_QUEUE
> 4676a15 net: caif: convert to using IFF_NO_QUEUE
> 906470c net: warn if drivers set tx_queue_len = 0
> 348e343 net: sched: drop all special handling of tx_queue_len == 0
> f84bb1e net: fix IFF_NO_QUEUE for drivers using alloc_netdev
> db4094b net: sched: ignore tx_queue_len when assigning default qdisc
> d66d6c3 net: sched: register noqueue qdisc
> 3e692f2 net: sched: simplify attach_one_default_qdisc()

Comment 7 Phil Sutter 2016-02-17 10:04:34 UTC
Coincidentally yesterday evening someone reported a problem with the backported series: http://www.gossamer-threads.com/lists/linux/kernel/2372160

Therefore block this BZ until a solution has been found.

Comment 10 Phil Sutter 2016-02-29 12:22:43 UTC
A fix has been accepted upstream, internal discussion about the right solution for RHEL7 is in progress[1].

[1] http://post-office.corp.redhat.com/archives/rhkernel-list/2016-February/msg01978.html

Comment 11 Phil Sutter 2016-03-04 12:50:59 UTC
Backported my solution from upstream, seems OK for RHEL7.

Comment 13 Rafael Aquini 2016-05-11 01:36:39 UTC
Patch(es) available on kernel-3.10.0-395.el7

Comment 21 Phil Sutter 2016-09-20 08:40:27 UTC
Hi Shuang,

It is a bit strange that the throughput is higher in error case, but I'm not sure how HTB's bandwidth calculation reacts to a queue holding only two packets, so maybe that's expected.

If you want to reproduce the actual issue (starvation), you have to use a qdisc with no zero-qlen workaround, like TBF (as mentioned above). Something like this might work:

# tc qdisc add dev eth1.90 root handle 1: tbf rate 500mbit

Cheers, Phil

Comment 23 Phil Sutter 2016-09-22 11:57:10 UTC
Created attachment 1203696 [details]
qlen_reproducer.sh

Comment 24 Phil Sutter 2016-09-22 12:01:13 UTC
Hi Shuang,

Sorry for all the confusion. Looks like my advice to use TBF wasn't helpful at all! Instead I went ahead and created a reproducer, which I tested successfully in my RHEL7 VM.

First with kernel-3.10.0-327.el7:


# bash /vmshare/reproducer/qlen_reproducer.sh
Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC
--- good test ---
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.1 () port 0 AF_INET
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    MBytes/sec  

 87380  16384  16384    10.00    2694.50   
--- bad test ---
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.1 () port 0 AF_INET
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    MBytes/sec  

 87380  16384  16384    10.24       0.40   
--- fixed test ---
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.1 () port 0 AF_INET
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    MBytes/sec  

 87380  16384  16384    10.03      52.36   


Note the reduced bandwidth in "bad test" and the correct bandwidth shaping in
"fixed test". Next I tried with current RHEL7.3 kernel, namely
kernel-3.10.0-505.el7:


# bash /vmshare/reproducer/qlen_reproducer.sh
Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC
--- good test ---
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.1 () port 0 AF_INET
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    MBytes/sec  

 87380  16384  16384    10.00    2666.09   
--- bad test ---
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.1 () port 0 AF_INET
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    MBytes/sec  

 87380  16384  16384    10.03      52.47   
--- fixed test ---
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.1 () port 0 AF_INET
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    MBytes/sec  

 87380  16384  16384    10.03      52.33   


Note the correct traffic shaping in "bad test".

HTH, Phil

Comment 28 errata-xmlrpc 2016-11-03 08:49:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2574.html