Bug 653505 - [4.9 Regression] network is lost after balloon-up fails
Summary: [4.9 Regression] network is lost after balloon-up fails
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.9
Hardware: Unspecified
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Laszlo Ersek
QA Contact: Virtualization Bugs
URL:
Whiteboard: regression
Depends On: 653501
Blocks: 458302
TreeView+ depends on / blocked
 
Reported: 2010-11-15 16:04 UTC by Paolo Bonzini
Modified: 2011-02-16 15:53 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 653262
Environment:
Last Closed: 2011-02-16 15:53:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
make copying the default (432 bytes, patch)
2010-11-17 13:42 UTC, Laszlo Ersek
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0263 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 4.9 kernel security and bug fix update 2011-02-16 15:14:55 UTC

Description Paolo Bonzini 2010-11-15 16:04:00 UTC
+++ This bug was initially created as a clone of Bug #653262 +++

Description of problem:
For RHEL5 PV guest(whose kernel is 2.6.18-231.el5xen), I try to ssh to it after I balloon it up. The network is lost when I ssh to it. I cannot ssh into it nor can I ping to it, although I can get console of the guest. This problem only happens when I turn off auto-ballooning(which is default value) and there is not enough free memory. For other situations such as auto-ballooning is on or there is enough free memory for guest to balloon up, no such issues are triggered.     

Version-Release number of selected component (if applicable):
Host:  
xen-devel-3.0.3-117.el5
xen-libs-3.0.3-117.el5
kernel-xen-devel-2.6.18-231.el5
xen-3.0.3-117.el5
xen-debuginfo-3.0.3-117.el5
kernel-xen-2.6.18-231.el5

Guest: 
2.6.18-231.el5xen

How reproducible:
Always

Steps to Reproduce:
1. Make sure auto-ballooning is turn off in xend. 
# grep "balloon-dom0" /etc/xen/xend-config.sxp
 (auto-balloon-dom0 no)

2. Create a RHEL5 PV guest with memory=512 and maxmem=1024
3. Make sure free memory is not enough for ballooning:
# xm info | grep free
free_memory            : 1

# xm li vm1
Name                                      ID Mem(MiB) VCPUs State   Time(s)
vm1                                        1      511     4 -b----      8.9

# xm li vm1 -l | grep mem
    (memory 512)
    (shadow_memory 0)
    (maxmem 1024)

4. Try to balloon up guest
# xm mem-set vm1 900

5. ping to guest after ballooning
# ping 10.66.93.117
PING 10.66.93.117 (10.66.93.117) 56(84) bytes of data.
64 bytes from 10.66.93.117: icmp_seq=1 ttl=64 time=0.252 ms
64 bytes from 10.66.93.117: icmp_seq=2 ttl=64 time=0.050 ms
64 bytes from 10.66.93.117: icmp_seq=3 ttl=64 time=0.053 ms

--- 10.66.93.117 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.050/0.118/0.252/0.094 ms

6. ssh to guest after that 

Actual results:
At step 6, you cannot ssh to the guest and after which the network of the guest is lost. You cannot ping to it any more, though the guest is still alive and you can get console of it.  

Expected results:
Everything works well after ballooning.


Additional info:
No such issues are triggered when I downgrade the PV guest to -194 kernel-xen packages. So I would consider this bug as regression.

--- Additional comment from yuzhang on 2010-11-15 00:37:00 EST ---

Created attachment 460474 [details]
config file to create the guest

--- Additional comment from yuzhang on 2010-11-15 00:39:02 EST ---

Created attachment 460475 [details]
xm dmesg log

--- Additional comment from yuzhang on 2010-11-15 00:43:06 EST ---

Created attachment 460476 [details]
xend.log

--- Additional comment from drjones on 2010-11-15 06:58:20 EST ---

Most likely culprit

7c14912 [virt] xen: don't give up ballooning under mem pressure

if a -221 kernel works (this patch is in -222) then that would give my accusation more weight.

--- Additional comment from lersek on 2010-11-15 10:30:39 EST ---

I could reproduce the problem with host -231, guest -222 -- I was able to ssh in the guest, and started to type "uname -r" to verify I'm running -222. I didn't get past "una", and then "ping" stopped to work too. I checked with -221, and the problem is gone. I think Andrew is right. I'll try to revert the patch.

======================

Also reproducible in RHEL4.9.

Comment 1 RHEL Program Management 2010-11-15 20:49:01 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 2 Paolo Bonzini 2010-11-16 17:26:04 UTC
This is not really a regression, however it is exacerbated by the patch that Andrew pinpointed so we may want to treat it as such.  The fix is to change netfront from flipping to copying.  The copying code is already well tested as it is used by HVM guests.

Comment 3 Laszlo Ersek 2010-11-17 13:42:55 UTC
Created attachment 461066 [details]
make copying the default

Comment 4 Vivek Goyal 2010-12-14 13:54:39 UTC
Committed in 93.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 6 Yufang Zhang 2010-12-16 08:01:25 UTC
QA verified this bug with 93.EL(guest) on RHEL5 -237 host:

Using the same steps as Description, guest network isn't lost even ballooning fails. 

Change this bug to VERIFIED.

Comment 7 errata-xmlrpc 2011-02-16 15:53:01 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0263.html


Note You need to log in before you can comment on or make changes to this bug.