Bug 668934
Summary: | UDP transmit under VLAN causes guest freeze | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Douglas Schilling Landgraf <dougsland> |
Component: | kernel | Assignee: | Paolo Bonzini <pbonzini> |
Status: | CLOSED ERRATA | QA Contact: | Liang Zheng <lzheng> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 5.5.z | CC: | agospoda, cye, dnelson, drjones, dtian, hjia, jarod, jentrena, jtluka, kzhang, leiwang, mrezanin, pbonzini, qcai, qwan, skito, tburke, xen-maint, yuzhang |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2011-07-21 09:22:06 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 514489 |
Description
Douglas Schilling Landgraf
2011-01-12 04:44:55 UTC
Assigning to myself for triaging. I can reproduce it with a tg3. On the machine running netperf: xentop output for VLANs: Domain-0 -----r 78 91.4 1048764 12.5 no limit n/a rhel55-64pv --b--- 56 90.5 1048192 12.5 1048576 12.5 xentop output for no VLANs: Domain-0 -----r 87 36.0 1048612 12.5 no limit n/a rhel55-64pv --b--- 66 37.1 1048216 12.5 1048576 12.5 and CPU utilization is 3% without VLANs, 97% with. On the machine running netserver: VLAN: Domain-0 -----r 125 50.7 1048756 33.3 no limit n/a rhel55-64pv -----r 100 60.5 1048056 33.3 1048576 33.3 no VLAN: Domain-0 -----r 127 12.5 1048628 33.3 no limit n/a rhel55-64pv --b--- 102 13.9 1048136 33.3 1048576 33.3 but the CPU utilization here is 6% with VLANs and 1% without (so no substantial variation). systemtap shows a vastly higher number of event channel notifications from dom0 to domU. The first number is the number of calls to force_evtchn_callback, the second is the number of calls to evtchn_do_upcall, the third is the number of packets sent: with vlans: 42493 77045 100000 84276 152564 200000 128419 229934 300000 172263 307474 400000 without vlans: 163 2223 100000 199 4067 200000 236 5735 300000 282 7378 400000 systemtap script: global force_evtchn_callback, sys_sendto, evtchn_do_upcall probe kernel.function("evtchn_do_upcall").call { evtchn_do_upcall++ } probe kernel.function("force_evtchn_callback").call { force_evtchn_callback++ } probe kernel.function("sys_sendto").call { sys_sendto++ if (sys_sendto % 100000 == 0) printf ("%d %d %d\n",force_evtchn_callback, evtchn_do_upcall, sys_sendto) } Similar results on the host show that notify_remote_via_irq is called more often in the vlan case: vlans: 8630 37606 100000 17145 74748 200000 25441 111629 300000 33666 148509 400000 no vlans: 932 50749 100000 1850 101209 200000 2766 151573 300000 3679 202078 400000 script: global notify_remote_via_irq, dev_queue_xmit, evtchn_do_upcall probe kernel.function("notify_remote_via_irq").call { notify_remote_via_irq++ } probe kernel.function("evtchn_do_upcall").call { evtchn_do_upcall++ } probe kernel.function("dev_queue_xmit").call { dev_queue_xmit++ if (dev_queue_xmit % 100000 == 0) printf ("%d %d %d\n",notify_remote_via_irq, evtchn_do_upcall, dev_queue_xmit) } Profiling also shows skb_copy_bits relatively high in the profile, but it doesn't show in the no-vlan case, so we're hitting a different code path. Even better results with s/evtchn_do_upcall/skb_copy_bits/g from the script in comment 9. vlans: 7148 49996 100000 13937 99994 200000 20715 149993 300000 27562 199991 400000 no vlans: 926 33 100000 1823 40 200000 2718 45 300000 3604 50 400000 More systemtap... global skb_copy_bits probe kernel.function("skb_copy_bits").call { skb_copy_bits++ if (skb_copy_bits % 100000 == 0) print_stack(backtrace()) } shows: skb_copy_bits __pskb_pull_tail dev_queue_xmit+0x1c2 This is the second call to __pskb_pull_tail in dev_queue_xmit... 0xffffffff80230d3b <dev_queue_xmit+391>: mov 0x8c(%rbp),%esi 0xffffffff80230d41 <dev_queue_xmit+397>: mov %rbp,%rdi 0xffffffff80230d44 <dev_queue_xmit+400>: callq 0xffffffff8041f5a7 <__pskb_pull_tail> ... 0xffffffff80230d68 <dev_queue_xmit+436>: mov 0x8c(%rbp),%esi 0xffffffff80230d6e <dev_queue_xmit+442>: mov %rbp,%rdi 0xffffffff80230d71 <dev_queue_xmit+445>: callq 0xffffffff8041f5a7 <__pskb_pull_tail> 0xffffffff80230d76 <dev_queue_xmit+450>: ... and comparison with the source shows that __pskb_pull_tail is really __skb_linearize: /* Fragmented skb is linearized if device does not support SG, * or if at least one of fragments is in highmem and device * does not support DMA from it. */ if (skb_shinfo(skb)->nr_frags && (!(dev->features & NETIF_F_SG) || illegal_highdma(dev, skb)) && __skb_linearize(skb)) goto out_kfree_skb; Indeed, peth0.100 does not support scatter-gather: # ethtool -k peth0.100 Offload parameters for peth0.100: Cannot get device tx csum settings: Operation not supported Cannot get device scatter-gather settings: Operation not supported Cannot get device udp large send offload settings: Operation not supported rx-checksumming: on tx-checksumming: off scatter-gather: off Changing component, but keeping the bug. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. systemtap output with the patch looks much more like the output without VLANs, especially wrt skb_copy_bits: on the guest (forced_evtchn_callback, evtchn_do_upcall, packets): 381 1645 100000 724 3326 200000 1245 5514 300000 1637 7288 400000 on the host (notify_remote_via_irq, skb_copy_bits, packets): 684 6 100000 1444 10 200000 2297 15 300000 3027 19 400000 3787 25 500000 Patch(es) available in kernel-2.6.18-254.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html |