Bug 653262
Summary: | [5.6 Regression] network is lost after balloon-up fails | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Yufang Zhang <yuzhang> | ||||||||||
Component: | kernel-xen | Assignee: | Laszlo Ersek <lersek> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | high | ||||||||||||
Version: | 5.6 | CC: | asilva, drjones, dwu, gao, jpirko, jzheng, leiwang, pbonzini, tao, xen-maint | ||||||||||
Target Milestone: | rc | Keywords: | Regression | ||||||||||
Target Release: | --- | ||||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | regression | ||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | |||||||||||||
: | 653501 653505 (view as bug list) | Environment: | |||||||||||
Last Closed: | 2011-01-13 22:00:58 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | 653501 | ||||||||||||
Bug Blocks: | 514489 | ||||||||||||
Attachments: |
|
Description
Yufang Zhang
2010-11-15 05:33:00 UTC
Created attachment 460474 [details]
config file to create the guest
Created attachment 460475 [details]
xm dmesg log
Created attachment 460476 [details]
xend.log
Most likely culprit 7c14912 [virt] xen: don't give up ballooning under mem pressure if a -221 kernel works (this patch is in -222) then that would give my accusation more weight. I could reproduce the problem with host -231, guest -222 -- I was able to ssh in the guest, and started to type "uname -r" to verify I'm running -222. I didn't get past "una", and then "ping" stopped to work too. I checked with -221, and the problem is gone. I think Andrew is right. I'll try to revert the patch. Unfortunately, I could reproduce the bug, though less reliably, also with -221. I reverted 934a6bd ("[virt] xen: remove dead code") and 7c14912 ("[virt] xen: don't give up ballooning under mem pressure") out of -231. (See http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2894621 once brew finishes; but I built it and tested it locally in parallel.) Alas, the network froze again after I typed some commands in the guest over ssh. I reproduced the bug again, this time with -222, because 7c14912 seems to expose the bug more agressively. Yesterday I noticed that whenever the network froze and I started the reboot sequence from virt-manager, the broadcast message from root (warning about the imminent reboot) appeared also through the frozen ssh session. So I suspected that only the netback->netfront (host to guest) direction was frozen, and the reverse direction still worked for whatever reason. So now I wanted to test this, by freezing the connection, starting ping from the console, and checking the echo requests with tcpdump in the host. However, this time when the network went down in the guest (which happened almost immediately after ballooning up), I couldn't even connect to the guest's console from virt-manager! Here's how I created a memory scarcity: the server has 8G physical RAM, 6G of that is assigned to dom0, and I have several guests sharing the remaining 1.5-2 G. I started just enough guests (including the one which froze) so that "xm info" in the host reported 14 M of free memory. I ballooned up the victim guest (from 512M to 1023M), which made the console unconnectable too, as said above. I still wanted to continue with the ping test, so I stopped a *different* guest (releasing 512M), to free some memory and perhaps make the console responsive again. Not only did the console become responsible, the frozen ssh connection resurrected. ... Now that I'm looking again, the ssh connection is working, but the console connection is stuck with "Connecting to console for guest". "xm info" reports 15M of free memory. Thus the symptoms have reversed (ssh works, console does not). I'll try to get a core dump to see where the guest is blocked when it seems frozen. Quick update: the console *did* resurrect. It was stuck in "Connecting to console for guest" only because I had already connected to the console from a different virt-manager (from a different virtual desktop). User error, sorry. (In reply to comment #7) > I reproduced the bug again, this time with -222, because 7c14912 seems to > expose the bug more agressively. > > Yesterday I noticed that whenever the network froze and I started the reboot > sequence from virt-manager, the broadcast message from root (warning about the > imminent reboot) appeared also through the frozen ssh session. So I suspected > that only the netback->netfront (host to guest) direction was frozen, and the > reverse direction still worked for whatever reason. So now I wanted to test > this, by freezing the connection, starting ping from the console, and checking > the echo requests with tcpdump in the host. > > However, this time when the network went down in the guest (which happened > almost immediately after ballooning up), I couldn't even connect to the guest's > console from virt-manager! Here's how I created a memory scarcity: the server > has 8G physical RAM, 6G of that is assigned to dom0, and I have several guests > sharing the remaining 1.5-2 G. I started just enough guests (including the one > which froze) so that "xm info" in the host reported 14 M of free memory. I > ballooned up the victim guest (from 512M to 1023M), which made the console > unconnectable too, as said above. I still wanted to continue with the ping > test, so I stopped a *different* guest (releasing 512M), to free some memory > and perhaps make the console responsive again. Maybe you don't have to start several guests(>=2) to reproduce this bug. Just starting only one guest is OK, as long as you turn off auto-ballooning in xend-config.sxp and make sure there is no free memory by ballooning up domain0. In case you need more free memory, just ballooning down domain0 is OK. Maybe it is more convenience than starting several guests. > Not only did the console become responsible, the frozen ssh connection > resurrected. > > ... Now that I'm looking again, the ssh connection is working, but the console > connection is stuck with "Connecting to console for guest". "xm info" reports > 15M of free memory. Thus the symptoms have reversed (ssh works, console does > not). > > I'll try to get a core dump to see where the guest is blocked when it seems > frozen. This is not really a regression, however it is exacerbated by the patch that Andrew pinpointed so we may want to treat it as such. The fix is to change netfront from flipping to copying. The copying code is already well tested as it is used by HVM guests. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Created attachment 461065 [details]
make copying the default
upstream status: c/s 1054:26562626c866 http://xenbits.xensource.com/linux-2.6.18-xen.hg?rev/26562626c866 in kernel-2.6.18-233.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. *** Bug 660315 has been marked as a duplicate of this bug. *** *** Bug 663143 has been marked as a duplicate of this bug. *** I have replayed the simple reproducer in the bug description, on a machine with 8G RAM installed. First I created a guest that leaves the system about 1.5G free memory. Then created a testing guest with memory=1024 and memmax=2048. Trying to mem-set the guest to 2048 would give it only about 1500. With -231 (both host and guest), this bug could be easily reproduced after the mem-set. The ssh session into the guest got lost, ping did not work either. Upgrading both host and guest to -238 has wiped the problem away. Network was still active after mem-set. As a result I'm putting this into VERIFIED. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html *** Bug 629213 has been marked as a duplicate of this bug. *** *** Bug 646649 has been marked as a duplicate of this bug. *** |