Description of problem: For RHEL5 PV guest(whose kernel is 2.6.18-231.el5xen), I try to ssh to it after I balloon it up. The network is lost when I ssh to it. I cannot ssh into it nor can I ping to it, although I can get console of the guest. This problem only happens when I turn off auto-ballooning(which is default value) and there is not enough free memory. For other situations such as auto-ballooning is on or there is enough free memory for guest to balloon up, no such issues are triggered. Version-Release number of selected component (if applicable): Host: xen-devel-3.0.3-117.el5 xen-libs-3.0.3-117.el5 kernel-xen-devel-2.6.18-231.el5 xen-3.0.3-117.el5 xen-debuginfo-3.0.3-117.el5 kernel-xen-2.6.18-231.el5 Guest: 2.6.18-231.el5xen How reproducible: Always Steps to Reproduce: 1. Make sure auto-ballooning is turn off in xend. # grep "balloon-dom0" /etc/xen/xend-config.sxp (auto-balloon-dom0 no) 2. Create a RHEL5 PV guest with memory=512 and maxmem=1024 3. Make sure free memory is not enough for ballooning: # xm info | grep free free_memory : 1 # xm li vm1 Name ID Mem(MiB) VCPUs State Time(s) vm1 1 511 4 -b---- 8.9 # xm li vm1 -l | grep mem (memory 512) (shadow_memory 0) (maxmem 1024) 4. Try to balloon up guest # xm mem-set vm1 900 5. ping to guest after ballooning # ping 10.66.93.117 PING 10.66.93.117 (10.66.93.117) 56(84) bytes of data. 64 bytes from 10.66.93.117: icmp_seq=1 ttl=64 time=0.252 ms 64 bytes from 10.66.93.117: icmp_seq=2 ttl=64 time=0.050 ms 64 bytes from 10.66.93.117: icmp_seq=3 ttl=64 time=0.053 ms --- 10.66.93.117 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2000ms rtt min/avg/max/mdev = 0.050/0.118/0.252/0.094 ms 6. ssh to guest after that Actual results: At step 6, you cannot ssh to the guest and after which the network of the guest is lost. You cannot ping to it any more, though the guest is still alive and you can get console of it. Expected results: Everything works well after ballooning. Additional info: No such issues are triggered when I downgrade the PV guest to -194 kernel-xen packages. So I would consider this bug as regression.
Created attachment 460474 [details] config file to create the guest
Created attachment 460475 [details] xm dmesg log
Created attachment 460476 [details] xend.log
Most likely culprit 7c14912 [virt] xen: don't give up ballooning under mem pressure if a -221 kernel works (this patch is in -222) then that would give my accusation more weight.
I could reproduce the problem with host -231, guest -222 -- I was able to ssh in the guest, and started to type "uname -r" to verify I'm running -222. I didn't get past "una", and then "ping" stopped to work too. I checked with -221, and the problem is gone. I think Andrew is right. I'll try to revert the patch.
Unfortunately, I could reproduce the bug, though less reliably, also with -221. I reverted 934a6bd ("[virt] xen: remove dead code") and 7c14912 ("[virt] xen: don't give up ballooning under mem pressure") out of -231. (See http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2894621 once brew finishes; but I built it and tested it locally in parallel.) Alas, the network froze again after I typed some commands in the guest over ssh.
I reproduced the bug again, this time with -222, because 7c14912 seems to expose the bug more agressively. Yesterday I noticed that whenever the network froze and I started the reboot sequence from virt-manager, the broadcast message from root (warning about the imminent reboot) appeared also through the frozen ssh session. So I suspected that only the netback->netfront (host to guest) direction was frozen, and the reverse direction still worked for whatever reason. So now I wanted to test this, by freezing the connection, starting ping from the console, and checking the echo requests with tcpdump in the host. However, this time when the network went down in the guest (which happened almost immediately after ballooning up), I couldn't even connect to the guest's console from virt-manager! Here's how I created a memory scarcity: the server has 8G physical RAM, 6G of that is assigned to dom0, and I have several guests sharing the remaining 1.5-2 G. I started just enough guests (including the one which froze) so that "xm info" in the host reported 14 M of free memory. I ballooned up the victim guest (from 512M to 1023M), which made the console unconnectable too, as said above. I still wanted to continue with the ping test, so I stopped a *different* guest (releasing 512M), to free some memory and perhaps make the console responsive again. Not only did the console become responsible, the frozen ssh connection resurrected. ... Now that I'm looking again, the ssh connection is working, but the console connection is stuck with "Connecting to console for guest". "xm info" reports 15M of free memory. Thus the symptoms have reversed (ssh works, console does not). I'll try to get a core dump to see where the guest is blocked when it seems frozen.
Quick update: the console *did* resurrect. It was stuck in "Connecting to console for guest" only because I had already connected to the console from a different virt-manager (from a different virtual desktop). User error, sorry.
(In reply to comment #7) > I reproduced the bug again, this time with -222, because 7c14912 seems to > expose the bug more agressively. > > Yesterday I noticed that whenever the network froze and I started the reboot > sequence from virt-manager, the broadcast message from root (warning about the > imminent reboot) appeared also through the frozen ssh session. So I suspected > that only the netback->netfront (host to guest) direction was frozen, and the > reverse direction still worked for whatever reason. So now I wanted to test > this, by freezing the connection, starting ping from the console, and checking > the echo requests with tcpdump in the host. > > However, this time when the network went down in the guest (which happened > almost immediately after ballooning up), I couldn't even connect to the guest's > console from virt-manager! Here's how I created a memory scarcity: the server > has 8G physical RAM, 6G of that is assigned to dom0, and I have several guests > sharing the remaining 1.5-2 G. I started just enough guests (including the one > which froze) so that "xm info" in the host reported 14 M of free memory. I > ballooned up the victim guest (from 512M to 1023M), which made the console > unconnectable too, as said above. I still wanted to continue with the ping > test, so I stopped a *different* guest (releasing 512M), to free some memory > and perhaps make the console responsive again. Maybe you don't have to start several guests(>=2) to reproduce this bug. Just starting only one guest is OK, as long as you turn off auto-ballooning in xend-config.sxp and make sure there is no free memory by ballooning up domain0. In case you need more free memory, just ballooning down domain0 is OK. Maybe it is more convenience than starting several guests. > Not only did the console become responsible, the frozen ssh connection > resurrected. > > ... Now that I'm looking again, the ssh connection is working, but the console > connection is stuck with "Connecting to console for guest". "xm info" reports > 15M of free memory. Thus the symptoms have reversed (ssh works, console does > not). > > I'll try to get a core dump to see where the guest is blocked when it seems > frozen.
This is not really a regression, however it is exacerbated by the patch that Andrew pinpointed so we may want to treat it as such. The fix is to change netfront from flipping to copying. The copying code is already well tested as it is used by HVM guests.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Created attachment 461065 [details] make copying the default
upstream status: c/s 1054:26562626c866 http://xenbits.xensource.com/linux-2.6.18-xen.hg?rev/26562626c866
in kernel-2.6.18-233.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
*** Bug 660315 has been marked as a duplicate of this bug. ***
*** Bug 663143 has been marked as a duplicate of this bug. ***
I have replayed the simple reproducer in the bug description, on a machine with 8G RAM installed. First I created a guest that leaves the system about 1.5G free memory. Then created a testing guest with memory=1024 and memmax=2048. Trying to mem-set the guest to 2048 would give it only about 1500. With -231 (both host and guest), this bug could be easily reproduced after the mem-set. The ssh session into the guest got lost, ping did not work either. Upgrading both host and guest to -238 has wiped the problem away. Network was still active after mem-set. As a result I'm putting this into VERIFIED.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html
*** Bug 629213 has been marked as a duplicate of this bug. ***
*** Bug 646649 has been marked as a duplicate of this bug. ***