Red Hat Bugzilla – Bug 1152231
qdisc: address unexpected behavior when attaching qdisc to virtual device
Last modified: 2016-11-04 07:08:59 EDT
Description of problem: Per design, virtual devices runs without a qdisc attached (noqueue), for obvious performance reasons, as the underlying real device is the only mechanism that can "push-back" (NETDEV_TX_BUSY or netif_xmit_frozen_or_stopped()). Thus, we can avoid all the qdisc locking, given no queue can ever exist (code see __dev_queue_xmit). There exists some valid use-cases for attaching a qdisc to a virtual device (e.g. policy limiting a virtual machine). And there is nothing stopping userspace from attaching a qdisc to a virtual device. The problem is that attaching a qdisc to a virtual device, results in unexpected behavior depending on the leaf qdisc's used (especially for the fall-through to the default qdisc, or qdisc's inheriting dev->tx_queue_len). The unexpected behavior is that traffic seems to flow normally, until the configured qdisc limit is hit, then packet drops occurs prematurely. The main problem: this unexpected behavior and lack of feedback is clearly a broken interface between the kernel and userspace. The work around for userspace is simply changing the interface "txqueuelen" to something not zero, before attaching a qdisc. This is a subtle and unreasonable requirement. The only way forward is the fix the kernel, to act reasonable, and don't break these kind of use-cases in subtle ways. The cause of this problem it the double meaning of the dev->tx_queue_len, inside the kernel. Before the device is brought up, while it still have a "noop" qdisc, the tx_queue_len=0 indicate this is a (virtual) device that supports the "noqueue" qdisc. After the device is up, this value is not really used. When a qdisc is attached, then this tx_queue_len is again ineffect, this time inherited and uses as the default queue length limit. Needless to say, a qdisc with a queue length limit of zero is not very useful as a queue. This issue where discovered for Docker containers: https://github.com/docker/libcontainer/pull/193 After a long discussion and rejected kernel patch: http://thread.gmane.org/gmane.linux.network/333349/focus=333456 We agreed that this behavior is a kernel bug, and the kernel need to address this non-intuitive behavior. And userspace should not be forced into jumping through hoops, to get the expected behavior. Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. Create a VLAN device: export VLAN=100 export DEV=eth1 ip link add link ${DEV} name ${DEV}.${VLAN} type vlan id ${VLAN} ip addr add 192.168.123.1/24 dev ${DEV}.${VLAN} ip link set dev ${DEV}.${VLAN} up 2. Attach a HTB qdisc to a VLAN device and test throughput: tc qdisc del dev ${DEV}.${VLAN} root tc qdisc add dev ${DEV}.${VLAN} root handle 1: htb default 1 tc class add dev ${DEV}.${VLAN} parent 1:0 classid 1:1 htb rate 500Mbit ceil 500Mbit netperf -H 192.168.123.2 -t TCP_STREAM 3. Make it work by adjusting dev->tx_queue_len: ifconfig ${DEV}.${VLAN} txqueuelen 1000 tc qdisc del dev ${DEV}.${VLAN} root tc qdisc add dev ${DEV}.${VLAN} root handle 1: htb default 1 tc class add dev ${DEV}.${VLAN} parent 1:0 classid 1:1 htb rate 500Mbit ceil 500Mbit Actual results: Too low bandwidth throughput. Expected results: Bandwidth shaping to the configured speed.
The core framework for a solution I submitted earlier has been accepted: http://www.spinics.net/lists/netdev/msg339268.html I followed up on this with a patch series adjusting drivers, 802.1q is also among them: https://www.mail-archive.com/netdev@vger.kernel.org/msg74403.html The later series is currently on review.
This is the list of upstream commits which should be backported: fa8187c net: declare new net_device priv_flag IFF_NO_QUEUE 4b46995 net: sch_generic: react upon IFF_NO_QUEUE flag 2e659c0 net: 8021q: convert to using IFF_NO_QUEUE ccecb2a net: bridge: convert to using IFF_NO_QUEUE 4afbc0d net: 6lowpan: convert to using IFF_NO_QUEUE 0a5f107 net: dsa: convert to using IFF_NO_QUEUE cdf7370 net: batman-adv: convert to using IFF_NO_QUEUE 9ad09c5 net: hsr: convert to using IFF_NO_QUEUE 4676a15 net: caif: convert to using IFF_NO_QUEUE 906470c net: warn if drivers set tx_queue_len = 0 348e343 net: sched: drop all special handling of tx_queue_len == 0 f84bb1e net: fix IFF_NO_QUEUE for drivers using alloc_netdev db4094b net: sched: ignore tx_queue_len when assigning default qdisc d66d6c3 net: sched: register noqueue qdisc 3e692f2 net: sched: simplify attach_one_default_qdisc()
Oops, missed a bunch: (In reply to Phil Sutter from comment #4) > This is the list of upstream commits which should be backported: > > fa8187c net: declare new net_device priv_flag IFF_NO_QUEUE > 4b46995 net: sch_generic: react upon IFF_NO_QUEUE flag 02f01ec net: veth: enable noqueue operation by default ff42c02 net: dummy: convert to using IFF_NO_QUEUE ed961ac net: geneve: convert to using IFF_NO_QUEUE e65db2b net: loopback: convert to using IFF_NO_QUEUE 85773a6 net: nlmon: convert to using IFF_NO_QUEUE 22e380a net: team: convert to using IFF_NO_QUEUE 22dba39 net: vxlan: convert to using IFF_NO_QUEUE > 2e659c0 net: 8021q: convert to using IFF_NO_QUEUE > ccecb2a net: bridge: convert to using IFF_NO_QUEUE > 4afbc0d net: 6lowpan: convert to using IFF_NO_QUEUE > 0a5f107 net: dsa: convert to using IFF_NO_QUEUE > cdf7370 net: batman-adv: convert to using IFF_NO_QUEUE > 9ad09c5 net: hsr: convert to using IFF_NO_QUEUE > 4676a15 net: caif: convert to using IFF_NO_QUEUE > 906470c net: warn if drivers set tx_queue_len = 0 > 348e343 net: sched: drop all special handling of tx_queue_len == 0 > f84bb1e net: fix IFF_NO_QUEUE for drivers using alloc_netdev > db4094b net: sched: ignore tx_queue_len when assigning default qdisc > d66d6c3 net: sched: register noqueue qdisc > 3e692f2 net: sched: simplify attach_one_default_qdisc()
Coincidentally yesterday evening someone reported a problem with the backported series: http://www.gossamer-threads.com/lists/linux/kernel/2372160 Therefore block this BZ until a solution has been found.
A fix has been accepted upstream, internal discussion about the right solution for RHEL7 is in progress[1]. [1] http://post-office.corp.redhat.com/archives/rhkernel-list/2016-February/msg01978.html
Backported my solution from upstream, seems OK for RHEL7.
Patch(es) available on kernel-3.10.0-395.el7
Hi Shuang, It is a bit strange that the throughput is higher in error case, but I'm not sure how HTB's bandwidth calculation reacts to a queue holding only two packets, so maybe that's expected. If you want to reproduce the actual issue (starvation), you have to use a qdisc with no zero-qlen workaround, like TBF (as mentioned above). Something like this might work: # tc qdisc add dev eth1.90 root handle 1: tbf rate 500mbit Cheers, Phil
Created attachment 1203696 [details] qlen_reproducer.sh
Hi Shuang, Sorry for all the confusion. Looks like my advice to use TBF wasn't helpful at all! Instead I went ahead and created a reproducer, which I tested successfully in my RHEL7 VM. First with kernel-3.10.0-327.el7: # bash /vmshare/reproducer/qlen_reproducer.sh Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC --- good test --- MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.1 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. MBytes/sec 87380 16384 16384 10.00 2694.50 --- bad test --- MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.1 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. MBytes/sec 87380 16384 16384 10.24 0.40 --- fixed test --- MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.1 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. MBytes/sec 87380 16384 16384 10.03 52.36 Note the reduced bandwidth in "bad test" and the correct bandwidth shaping in "fixed test". Next I tried with current RHEL7.3 kernel, namely kernel-3.10.0-505.el7: # bash /vmshare/reproducer/qlen_reproducer.sh Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC --- good test --- MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.1 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. MBytes/sec 87380 16384 16384 10.00 2666.09 --- bad test --- MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.1 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. MBytes/sec 87380 16384 16384 10.03 52.47 --- fixed test --- MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.1 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. MBytes/sec 87380 16384 16384 10.03 52.33 Note the correct traffic shaping in "bad test". HTH, Phil
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2574.html