1152231 – qdisc: address unexpected behavior when attaching qdisc to virtual device

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1152231 - qdisc: address unexpected behavior when attaching qdisc to virtual device

Summary: qdisc: address unexpected behavior when attaching qdisc to virtual device

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	7.2
Hardware:	Unspecified
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Phil Sutter
QA Contact:	Li Shuang
Docs Contact:	Mirek Jahoda
URL:
Whiteboard:
Depends On:
Blocks:	1301628 1313485 1328874
TreeView+	depends on / blocked

Reported:	2014-10-13 16:02 UTC by Jesper Brouer
Modified:	2016-11-04 11:08 UTC (History)
CC List:	11 users (show)
Fixed In Version:	kernel-3.10.0-395.el7
Doc Type:	Bug Fix
Doc Text:	Unexpected behavior when attaching a qdisc to a virtual device no longer occurs Previously, attaching a qdisc to a virtual device could result in unexpected behavior such as packets being dropped prematurely and reduced bandwidth. With this update, virtual devices have a default `tx_queue_len` of 1000 and are represented by a device flag. Attaching a qdisc to a virtual device is now supported with default settings and any special handling of the `tx_queue_len=0` is no longer needed.
Clone Of:
Environment:
Last Closed:	2016-11-03 08:49:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
qlen_reproducer.sh (1015 bytes, application/x-shellscript) 2016-09-22 11:57 UTC, Phil Sutter	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2016:2574	0	normal	SHIPPED_LIVE	Important: kernel security, bug fix, and enhancement update	2016-11-03 12:06:10 UTC

Description Jesper Brouer 2014-10-13 16:02:53 UTC

Description of problem:

Per design, virtual devices runs without a qdisc attached (noqueue),
for obvious performance reasons, as the underlying real device is the
only mechanism that can "push-back" (NETDEV_TX_BUSY or
netif_xmit_frozen_or_stopped()).  Thus, we can avoid all the qdisc
locking, given no queue can ever exist (code see __dev_queue_xmit).

There exists some valid use-cases for attaching a qdisc to a virtual
device (e.g. policy limiting a virtual machine).  And there is nothing
stopping userspace from attaching a qdisc to a virtual device.

The problem is that attaching a qdisc to a virtual device, results in
unexpected behavior depending on the leaf qdisc's used (especially for
the fall-through to the default qdisc, or qdisc's inheriting
dev->tx_queue_len).  The unexpected behavior is that traffic seems to
flow normally, until the configured qdisc limit is hit, then packet
drops occurs prematurely.

The main problem: this unexpected behavior and lack of feedback is
clearly a broken interface between the kernel and userspace.

The work around for userspace is simply changing the interface
"txqueuelen" to something not zero, before attaching a qdisc.  This is
a subtle and unreasonable requirement.

The only way forward is the fix the kernel, to act reasonable, and
don't break these kind of use-cases in subtle ways.


The cause of this problem it the double meaning of the
dev->tx_queue_len, inside the kernel.  Before the device is brought
up, while it still have a "noop" qdisc, the tx_queue_len=0 indicate
this is a (virtual) device that supports the "noqueue" qdisc.  After
the device is up, this value is not really used.  When a qdisc is
attached, then this tx_queue_len is again ineffect, this time
inherited and uses as the default queue length limit.  Needless to
say, a qdisc with a queue length limit of zero is not very useful as a
queue.

This issue where discovered for Docker containers:
 https://github.com/docker/libcontainer/pull/193

After a long discussion and rejected kernel patch:
 http://thread.gmane.org/gmane.linux.network/333349/focus=333456

We agreed that this behavior is a kernel bug, and the kernel need to
address this non-intuitive behavior.  And userspace should not be
forced into jumping through hoops, to get the expected behavior.
Version-Release number of selected component (if applicable):


How reproducible: 100%

Steps to Reproduce:

1. Create a VLAN device:

 export VLAN=100
 export DEV=eth1
 ip link add link ${DEV} name ${DEV}.${VLAN} type vlan id ${VLAN}
 ip addr add 192.168.123.1/24 dev ${DEV}.${VLAN}
 ip link set dev ${DEV}.${VLAN} up

2. Attach a HTB qdisc to a VLAN device and test throughput:

 tc qdisc del dev ${DEV}.${VLAN} root
 tc qdisc add dev ${DEV}.${VLAN} root handle 1: htb default 1
 tc class add dev ${DEV}.${VLAN} parent 1:0 classid 1:1 htb rate 500Mbit ceil 500Mbit
 netperf -H 192.168.123.2 -t TCP_STREAM

3. Make it work by adjusting dev->tx_queue_len:

 ifconfig ${DEV}.${VLAN} txqueuelen 1000
 tc qdisc del dev ${DEV}.${VLAN} root
 tc qdisc add dev ${DEV}.${VLAN} root handle 1: htb default 1
 tc class add dev ${DEV}.${VLAN} parent 1:0 classid 1:1 htb rate 500Mbit ceil 500Mbit

Actual results:
 Too low bandwidth throughput.

Expected results:
 Bandwidth shaping to the configured speed.

Comment 3 Phil Sutter 2015-08-18 16:28:01 UTC

The core framework for a solution I submitted earlier has been accepted:
http://www.spinics.net/lists/netdev/msg339268.html

I followed up on this with a patch series adjusting drivers, 802.1q is also among them:
https://www.mail-archive.com/netdev@vger.kernel.org/msg74403.html

The later series is currently on review.

Comment 4 Phil Sutter 2015-09-17 15:47:42 UTC

This is the list of upstream commits which should be backported:

fa8187c net: declare new net_device priv_flag IFF_NO_QUEUE
4b46995 net: sch_generic: react upon IFF_NO_QUEUE flag
2e659c0 net: 8021q: convert to using IFF_NO_QUEUE
ccecb2a net: bridge: convert to using IFF_NO_QUEUE
4afbc0d net: 6lowpan: convert to using IFF_NO_QUEUE
0a5f107 net: dsa: convert to using IFF_NO_QUEUE
cdf7370 net: batman-adv: convert to using IFF_NO_QUEUE
9ad09c5 net: hsr: convert to using IFF_NO_QUEUE
4676a15 net: caif: convert to using IFF_NO_QUEUE
906470c net: warn if drivers set tx_queue_len = 0
348e343 net: sched: drop all special handling of tx_queue_len == 0
f84bb1e net: fix IFF_NO_QUEUE for drivers using alloc_netdev
db4094b net: sched: ignore tx_queue_len when assigning default qdisc
d66d6c3 net: sched: register noqueue qdisc
3e692f2 net: sched: simplify attach_one_default_qdisc()

Comment 5 Phil Sutter 2015-09-17 15:55:53 UTC

Oops, missed a bunch:

(In reply to Phil Sutter from comment #4)
> This is the list of upstream commits which should be backported:
> 
> fa8187c net: declare new net_device priv_flag IFF_NO_QUEUE
> 4b46995 net: sch_generic: react upon IFF_NO_QUEUE flag

02f01ec net: veth: enable noqueue operation by default
ff42c02 net: dummy: convert to using IFF_NO_QUEUE
ed961ac net: geneve: convert to using IFF_NO_QUEUE
e65db2b net: loopback: convert to using IFF_NO_QUEUE
85773a6 net: nlmon: convert to using IFF_NO_QUEUE
22e380a net: team: convert to using IFF_NO_QUEUE
22dba39 net: vxlan: convert to using IFF_NO_QUEUE

> 2e659c0 net: 8021q: convert to using IFF_NO_QUEUE
> ccecb2a net: bridge: convert to using IFF_NO_QUEUE
> 4afbc0d net: 6lowpan: convert to using IFF_NO_QUEUE
> 0a5f107 net: dsa: convert to using IFF_NO_QUEUE
> cdf7370 net: batman-adv: convert to using IFF_NO_QUEUE
> 9ad09c5 net: hsr: convert to using IFF_NO_QUEUE
> 4676a15 net: caif: convert to using IFF_NO_QUEUE
> 906470c net: warn if drivers set tx_queue_len = 0
> 348e343 net: sched: drop all special handling of tx_queue_len == 0
> f84bb1e net: fix IFF_NO_QUEUE for drivers using alloc_netdev
> db4094b net: sched: ignore tx_queue_len when assigning default qdisc
> d66d6c3 net: sched: register noqueue qdisc
> 3e692f2 net: sched: simplify attach_one_default_qdisc()

Comment 7 Phil Sutter 2016-02-17 10:04:34 UTC

Coincidentally yesterday evening someone reported a problem with the backported series: http://www.gossamer-threads.com/lists/linux/kernel/2372160

Therefore block this BZ until a solution has been found.

Comment 10 Phil Sutter 2016-02-29 12:22:43 UTC

A fix has been accepted upstream, internal discussion about the right solution for RHEL7 is in progress[1].

[1] http://post-office.corp.redhat.com/archives/rhkernel-list/2016-February/msg01978.html

Comment 11 Phil Sutter 2016-03-04 12:50:59 UTC

Backported my solution from upstream, seems OK for RHEL7.

Comment 13 Rafael Aquini 2016-05-11 01:36:39 UTC

Patch(es) available on kernel-3.10.0-395.el7

Comment 21 Phil Sutter 2016-09-20 08:40:27 UTC

Hi Shuang,

It is a bit strange that the throughput is higher in error case, but I'm not sure how HTB's bandwidth calculation reacts to a queue holding only two packets, so maybe that's expected.

If you want to reproduce the actual issue (starvation), you have to use a qdisc with no zero-qlen workaround, like TBF (as mentioned above). Something like this might work:

# tc qdisc add dev eth1.90 root handle 1: tbf rate 500mbit

Cheers, Phil

Comment 23 Phil Sutter 2016-09-22 11:57:10 UTC

Created attachment 1203696 [details]
qlen_reproducer.sh

Comment 24 Phil Sutter 2016-09-22 12:01:13 UTC

Hi Shuang,

Sorry for all the confusion. Looks like my advice to use TBF wasn't helpful at all! Instead I went ahead and created a reproducer, which I tested successfully in my RHEL7 VM.

First with kernel-3.10.0-327.el7:


# bash /vmshare/reproducer/qlen_reproducer.sh
Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC
--- good test ---
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.1 () port 0 AF_INET
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    MBytes/sec  

 87380  16384  16384    10.00    2694.50   
--- bad test ---
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.1 () port 0 AF_INET
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    MBytes/sec  

 87380  16384  16384    10.24       0.40   
--- fixed test ---
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.1 () port 0 AF_INET
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    MBytes/sec  

 87380  16384  16384    10.03      52.36   


Note the reduced bandwidth in "bad test" and the correct bandwidth shaping in
"fixed test". Next I tried with current RHEL7.3 kernel, namely
kernel-3.10.0-505.el7:


# bash /vmshare/reproducer/qlen_reproducer.sh
Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC
--- good test ---
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.1 () port 0 AF_INET
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    MBytes/sec  

 87380  16384  16384    10.00    2666.09   
--- bad test ---
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.1 () port 0 AF_INET
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    MBytes/sec  

 87380  16384  16384    10.03      52.47   
--- fixed test ---
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.1 () port 0 AF_INET
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    MBytes/sec  

 87380  16384  16384    10.03      52.33   


Note the correct traffic shaping in "bad test".

HTH, Phil

Comment 28 errata-xmlrpc 2016-11-03 08:49:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2574.html

Note You need to log in before you can comment on or make changes to this bug.