Bug 653262 - [5.6 Regression] network is lost after balloon-up fails
Summary: [5.6 Regression] network is lost after balloon-up fails
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen
Version: 5.6
Hardware: Unspecified
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Laszlo Ersek
QA Contact: Virtualization Bugs
URL:
Whiteboard: regression
: 629213 646649 660315 663143 (view as bug list)
Depends On: 653501
Blocks: 514489
TreeView+ depends on / blocked
 
Reported: 2010-11-15 05:33 UTC by Yufang Zhang
Modified: 2018-11-14 21:00 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 653501 653505 (view as bug list)
Environment:
Last Closed: 2011-01-13 22:00:58 UTC
Target Upstream Version:


Attachments (Terms of Use)
config file to create the guest (486 bytes, text/plain)
2010-11-15 05:37 UTC, Yufang Zhang
no flags Details
xm dmesg log (11.52 KB, text/plain)
2010-11-15 05:39 UTC, Yufang Zhang
no flags Details
xend.log (13.19 KB, text/plain)
2010-11-15 05:43 UTC, Yufang Zhang
no flags Details
make copying the default (432 bytes, patch)
2010-11-17 13:41 UTC, Laszlo Ersek
no flags Details | Diff


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0017 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update 2011-01-13 10:37:42 UTC

Description Yufang Zhang 2010-11-15 05:33:00 UTC
Description of problem:
For RHEL5 PV guest(whose kernel is 2.6.18-231.el5xen), I try to ssh to it after I balloon it up. The network is lost when I ssh to it. I cannot ssh into it nor can I ping to it, although I can get console of the guest. This problem only happens when I turn off auto-ballooning(which is default value) and there is not enough free memory. For other situations such as auto-ballooning is on or there is enough free memory for guest to balloon up, no such issues are triggered.     

Version-Release number of selected component (if applicable):
Host:  
xen-devel-3.0.3-117.el5
xen-libs-3.0.3-117.el5
kernel-xen-devel-2.6.18-231.el5
xen-3.0.3-117.el5
xen-debuginfo-3.0.3-117.el5
kernel-xen-2.6.18-231.el5

Guest: 
2.6.18-231.el5xen

How reproducible:
Always

Steps to Reproduce:
1. Make sure auto-ballooning is turn off in xend. 
# grep "balloon-dom0" /etc/xen/xend-config.sxp
 (auto-balloon-dom0 no)

2. Create a RHEL5 PV guest with memory=512 and maxmem=1024
3. Make sure free memory is not enough for ballooning:
# xm info | grep free
free_memory            : 1

# xm li vm1
Name                                      ID Mem(MiB) VCPUs State   Time(s)
vm1                                        1      511     4 -b----      8.9

# xm li vm1 -l | grep mem
    (memory 512)
    (shadow_memory 0)
    (maxmem 1024)

4. Try to balloon up guest
# xm mem-set vm1 900

5. ping to guest after ballooning
# ping 10.66.93.117
PING 10.66.93.117 (10.66.93.117) 56(84) bytes of data.
64 bytes from 10.66.93.117: icmp_seq=1 ttl=64 time=0.252 ms
64 bytes from 10.66.93.117: icmp_seq=2 ttl=64 time=0.050 ms
64 bytes from 10.66.93.117: icmp_seq=3 ttl=64 time=0.053 ms

--- 10.66.93.117 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.050/0.118/0.252/0.094 ms

6. ssh to guest after that 

Actual results:
At step 6, you cannot ssh to the guest and after which the network of the guest is lost. You cannot ping to it any more, though the guest is still alive and you can get console of it.  

Expected results:
Everything works well after ballooning.


Additional info:
No such issues are triggered when I downgrade the PV guest to -194 kernel-xen packages. So I would consider this bug as regression.

Comment 1 Yufang Zhang 2010-11-15 05:37:00 UTC
Created attachment 460474 [details]
config file to create the guest

Comment 2 Yufang Zhang 2010-11-15 05:39:02 UTC
Created attachment 460475 [details]
xm dmesg log

Comment 3 Yufang Zhang 2010-11-15 05:43:06 UTC
Created attachment 460476 [details]
xend.log

Comment 4 Andrew Jones 2010-11-15 11:58:20 UTC
Most likely culprit

7c14912 [virt] xen: don't give up ballooning under mem pressure

if a -221 kernel works (this patch is in -222) then that would give my accusation more weight.

Comment 5 Laszlo Ersek 2010-11-15 15:30:39 UTC
I could reproduce the problem with host -231, guest -222 -- I was able to ssh in the guest, and started to type "uname -r" to verify I'm running -222. I didn't get past "una", and then "ping" stopped to work too. I checked with -221, and the problem is gone. I think Andrew is right. I'll try to revert the patch.

Comment 6 Laszlo Ersek 2010-11-15 17:21:50 UTC
Unfortunately, I could reproduce the bug, though less reliably, also with -221.

I reverted 934a6bd ("[virt] xen: remove dead code") and 7c14912 ("[virt] xen: don't give up ballooning under mem pressure") out of -231. (See http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2894621 once brew finishes; but I built it and tested it locally in parallel.) Alas, the network froze again after I typed some commands in the guest over ssh.

Comment 7 Laszlo Ersek 2010-11-16 10:25:26 UTC
I reproduced the bug again, this time with -222, because 7c14912 seems to expose the bug more agressively.

Yesterday I noticed that whenever the network froze and I started the reboot sequence from virt-manager, the broadcast message from root (warning about the imminent reboot) appeared also through the frozen ssh session. So I suspected that only the netback->netfront (host to guest) direction was frozen, and the reverse direction still worked for whatever reason. So now I wanted to test this, by freezing the connection, starting ping from the console, and checking the echo requests with tcpdump in the host.

However, this time when the network went down in the guest (which happened almost immediately after ballooning up), I couldn't even connect to the guest's console from virt-manager! Here's how I created a memory scarcity: the server has 8G physical RAM, 6G of that is assigned to dom0, and I have several guests sharing the remaining 1.5-2 G. I started just enough guests (including the one which froze) so that "xm info" in the host reported 14 M of free memory. I ballooned up the victim guest (from 512M to 1023M), which made the console unconnectable too, as said above. I still wanted to continue with the ping test, so I stopped a *different* guest (releasing 512M), to free some memory and perhaps make the console responsive again.

Not only did the console become responsible, the frozen ssh connection resurrected.

... Now that I'm looking again, the ssh connection is working, but the console connection is stuck with "Connecting to console for guest". "xm info" reports 15M of free memory. Thus the symptoms have reversed (ssh works, console does not).

I'll try to get a core dump to see where the guest is blocked when it seems frozen.

Comment 8 Laszlo Ersek 2010-11-16 10:28:54 UTC
Quick update: the console *did* resurrect. It was stuck in "Connecting to console for guest" only because I had already connected to the console from a different virt-manager (from a different virtual desktop). User error, sorry.

Comment 9 Yufang Zhang 2010-11-16 11:58:48 UTC
(In reply to comment #7)
> I reproduced the bug again, this time with -222, because 7c14912 seems to
> expose the bug more agressively.
> 
> Yesterday I noticed that whenever the network froze and I started the reboot
> sequence from virt-manager, the broadcast message from root (warning about the
> imminent reboot) appeared also through the frozen ssh session. So I suspected
> that only the netback->netfront (host to guest) direction was frozen, and the
> reverse direction still worked for whatever reason. So now I wanted to test
> this, by freezing the connection, starting ping from the console, and checking
> the echo requests with tcpdump in the host.
> 
> However, this time when the network went down in the guest (which happened
> almost immediately after ballooning up), I couldn't even connect to the guest's
> console from virt-manager! Here's how I created a memory scarcity: the server
> has 8G physical RAM, 6G of that is assigned to dom0, and I have several guests
> sharing the remaining 1.5-2 G. I started just enough guests (including the one
> which froze) so that "xm info" in the host reported 14 M of free memory. I
> ballooned up the victim guest (from 512M to 1023M), which made the console
> unconnectable too, as said above. I still wanted to continue with the ping
> test, so I stopped a *different* guest (releasing 512M), to free some memory
> and perhaps make the console responsive again.

Maybe you don't have to start several guests(>=2) to reproduce this bug. Just starting only one guest is OK, as long as you turn off auto-ballooning in xend-config.sxp and make sure there is no free memory by ballooning up domain0. In case you need more free memory, just ballooning down domain0 is OK. Maybe it is more convenience than starting several guests.    

> Not only did the console become responsible, the frozen ssh connection
> resurrected.
> 
> ... Now that I'm looking again, the ssh connection is working, but the console
> connection is stuck with "Connecting to console for guest". "xm info" reports
> 15M of free memory. Thus the symptoms have reversed (ssh works, console does
> not).
> 
> I'll try to get a core dump to see where the guest is blocked when it seems
> frozen.

Comment 10 Paolo Bonzini 2010-11-16 17:25:54 UTC
This is not really a regression, however it is exacerbated by the patch that Andrew pinpointed so we may want to treat it as such.  The fix is to change netfront from flipping to copying.  The copying code is already well tested as it is used by HVM guests.

Comment 12 RHEL Program Management 2010-11-17 13:39:19 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 13 Laszlo Ersek 2010-11-17 13:41:25 UTC
Created attachment 461065 [details]
make copying the default

Comment 14 Laszlo Ersek 2010-11-21 10:05:08 UTC
upstream status: c/s 1054:26562626c866

http://xenbits.xensource.com/linux-2.6.18-xen.hg?rev/26562626c866

Comment 16 Jarod Wilson 2010-11-23 17:06:13 UTC
in kernel-2.6.18-233.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 18 Laszlo Ersek 2010-12-06 15:05:16 UTC
*** Bug 660315 has been marked as a duplicate of this bug. ***

Comment 19 Laszlo Ersek 2010-12-15 20:07:41 UTC
*** Bug 663143 has been marked as a duplicate of this bug. ***

Comment 20 Jinxin Zheng 2010-12-22 07:29:37 UTC
I have replayed the simple reproducer in the bug description, on a machine with 8G RAM installed.

First I created a guest that leaves the system about 1.5G free memory.

Then created a testing guest with memory=1024 and memmax=2048. Trying to mem-set the guest to 2048 would give it only about 1500.

With -231 (both host and guest), this bug could be easily reproduced after the mem-set. The ssh session into the guest got lost, ping did not work either.

Upgrading both host and guest to -238 has wiped the problem away. Network was still active after mem-set.

As a result I'm putting this into VERIFIED.

Comment 22 errata-xmlrpc 2011-01-13 22:00:58 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html

Comment 27 Laszlo Ersek 2011-03-02 13:50:04 UTC
*** Bug 629213 has been marked as a duplicate of this bug. ***

Comment 28 Laszlo Ersek 2011-03-08 16:36:52 UTC
*** Bug 646649 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.