Bug 554078
Summary: | Lost the network in a KVM VM on top of 5.4 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Herbert Xu <herbert.xu> | ||||||||||
Component: | kernel | Assignee: | Herbert Xu <herbert.xu> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||||||
Severity: | medium | Docs Contact: | |||||||||||
Priority: | high | ||||||||||||
Version: | 5.4 | CC: | bruno.cornec, cward, david.jericho, herbert.xu, jean-marc.andre, khong, llim, markmc, mwagner, nsprei, orenault, riek, syeghiay, tburke, todayyang, virt-maint, ykaul | ||||||||||
Target Milestone: | rc | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | All | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | 524651 | ||||||||||||
: | 589766 589897 (view as bug list) | Environment: | |||||||||||
Last Closed: | 2010-03-30 07:15:57 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | 524651 | ||||||||||||
Bug Blocks: | 528898, 589766, 589897 | ||||||||||||
Attachments: |
|
Description
Herbert Xu
2010-01-10 11:26:38 UTC
This bug will be used to deal with the RX component of the problem while the original will be for TX only. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Created attachment 386393 [details]
virtio: net refill on out-of-memory
This is a back-port of
virtio: net refill on out-of-memory
If we run out of memory, use keventd to fill the buffer. There's a
report of this happening: "Page allocation failures in guest",
Message-ID: <20090713115158.0a4892b0.eu>
Signed-off-by: Rusty Russell <rusty.au>
Signed-off-by: David S. Miller <davem>
Created attachment 388497 [details]
socket test programs (srv.c clt.c)
srv.c -- server end
clt.c -- client end
README -- read me file
srv, clt -- binary executable on x86_64
Summary: Guest still lost network with virtio net, while worked fine with e1000 net . Steps: 1. boot a guest with the CLI listed below. 2. check your guest network using ifconfig and make sure it works well (here we mark guest ip as $guest_ip). 3. run ./srv on guest (comment 4 attachment) 4. run multiple "./stress.sh $guestip" from elsewhere, i.e. on other hosts, till no more connections could be established. note ./stress.sh calls clt program, trying to establish 500 connections to srv. 5. ping $guest_ip to see the network status. addtionally: 6. run ./clear_clt.sh on client ends to kill all "./clt" processes. 7. ping $guest_ip again to see result. (if possible, you may kill all ./srv processes inside guest, and ping again) Expected results: after step 5 and step 7, guest network still keeps live. CLI: /usr/libexec/qemu-kvm -m 768M -smp 2 -drive file=RHEL5.4-64-4k.qcow2,if=virtio,cache=off,boot=on -net nic,model=virtio,vlan=1,macaddr=76:00:40:3F:20:10 -net tap,vlan=1,script=/etc/qemu-ifup -boot c -uuid 17644ecc-d3a1-4d3c-a386-12daf50015f1 -usbdevice tablet -no-hpet -rtc-td-hack -no-kvm-pit-reinjection -monitor stdio -notify all -cpu qemu64,+sse2 -balloon none -startdate now -vnc :1 -name 176-guest1 Actual results: host: 2.6.18-164.10.1, kvm-83-105.el5_4.19 -------------------------------------------------------------------- guest | net model | connections | network status 2.6.18-164.11.1.el5PAE | virtio | 2187 | lost 2.6.18-185.el5 x86_64 | virtio | 1220 | lost -------------------------------------------------------------------- note: lost network could be brought up again by ifdown, ifup. host: 2.6.18-186 x86_64, kvm-83-155.el5 -------------------------------------------------------------------- guest | net model | connections | network status 2.6.18-185.el5 x86_64 | virtio | 1178 | lost 2.6.18-185.el5 x86_64 | e1000 | 3574 | ok -------------------------------------------------------------------- Note with e1000, even if we could not make more connections to the "srv" program running inside guest, we could still ping the guest. [root@dhcp-91-175 ~]# ping 10.66.91.51 PING 10.66.91.51 (10.66.91.51) 56(84) bytes of data. 64 bytes from 10.66.91.51: icmp_seq=1 ttl=64 time=58.1 ms 64 bytes from 10.66.91.51: icmp_seq=2 ttl=64 time=33.4 ms 64 bytes from 10.66.91.51: icmp_seq=3 ttl=64 time=21.0 ms 64 bytes from 10.66.91.51: icmp_seq=4 ttl=64 time=1.40 ms 64 bytes from 10.66.91.51: icmp_seq=5 ttl=64 time=91.5 ms 64 bytes from 10.66.91.51: icmp_seq=6 ttl=64 time=0.341 ms 64 bytes from 10.66.91.51: icmp_seq=7 ttl=64 time=0.807 ms ... Thanks for testing. Please let me know whether this problem still exists after applying the patch in this bugzilla entry plus the patch in the bug from which this is cloned. (In reply to comment #5) > Summary: > Guest still lost network with virtio net, while worked fine with e1000 net . Keqin, was your tests with Herbert's fix? Do you need a new rpm for the guest kernel? Adjusting Needinfo flag has been set to the wrong person. llim->Herbert: could you please provide us with a scratch build of the patch attached in Bugzilla? llim->sly, The bug will be updated before Mon, 22 Feb after the holiday in China once the scratch build from Herbert is available. Sorry, but I have no time to produce a scratch build. Someone else will need to take care of this. Thanks! Here's a link to the brew build: https://brewweb.devel.redhat.com/taskinfo?taskID=2265105 Please let me know if any issues. Tested with guest kernel-2.6.18-189.el5.x86_64, but virtio-net still lost. The test methods and results were similar to comment #5. (In reply to comment #5) > Keqin, was your tests with Herbert's fix? Do you need a new rpm for the guest kernel? "Patch24891: linux-2.6-net-virtio_net-fix-tx-wakeup-race-condition.patch" is included as of kernel 2.6.18-184.el5, but I couldn't see patch "virtio: net refill on out-of-memory (see comment #3)". Keqin->Naphtali, has patch "virtio: net refill on out-of-memory (see comment #3)" been applied? Created attachment 399313 [details]
virtio: net refill on out-of-memory
As fixing cancel_rearming_delayed_work in RHEL5 is non-trivial, and in order to maintain the ability to unload the virtio_net module, I'm switching the refill work to a timer.
Created attachment 399317 [details]
virtio: net refill on out-of-memory
The last version was bogus as we can't sleep in timers. This one simply uses the normal poll path to do the refill.
Tested on guest kernel 2.6.18-193.el5 that a temporary OOM condition just caused virtio network down shortly which could be restored later. (steps are similar to comment 5) in kernel-2.6.18-194.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Please update the appropriate value in the Verified field (cf_verified) to indicate this fix has been successfully verified. Include a comment with verification details. I've been running 2.6.18-194.el5 on x86_64 for over 24 hours now with no repeat of the problems mentioned in this bug. Previously they'd appear within 10 minutes of the host starting service. I'm not sure if it's related as I can't see any obvious changes in the patch attached, but I'll report it anyway. Under 2.6.18-194.el5 on the guest, ethernet frames larger than 4096 bytes won't make it to the guest when using the e1000 interface type. Rebooting using the 2.6.18-164.el5 kernel, jumbo frames work correctly with the e1000. Jumbo frames under 2.6.18-194.el5 using the virtio interface for the guest work as expected. Watching traffic on the host bridge, the incoming packets are appearing, but the guest never sees the packet. 4096 byte frame limit verified using ping. 4096 bytes - 20 for ip header - 14 for ethernet frame header - 8 for ICMP control, gives 4054 byte maximum payload. ping -M do -s 4055 <jumbo set router interface> fails with the e1000 interface type. We were using e1000 and virtio interface types for dual interfaced guests as it seemed to help delay the onset of this bug. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html hi guys, 1:I can not get into the bug 528898.so i just update here 2:maybe this issue solved by: http://lists.gnu.org/archive/html/qemu-devel/2012-04/msg03587.html. but i am not sure. |