Bug 653505

Summary: [4.9 Regression] network is lost after balloon-up fails
Product: Red Hat Enterprise Linux 4 Reporter: Paolo Bonzini <pbonzini>
Component: kernelAssignee: Laszlo Ersek <lersek>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 4.9CC: drjones, leiwang, mjenner, xen-maint, yuzhang
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard: regression
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 653262 Environment:
Last Closed: 2011-02-16 15:53:01 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 653501    
Bug Blocks: 458302    
Attachments:
Description Flags
make copying the default none

Description Paolo Bonzini 2010-11-15 16:04:00 UTC
+++ This bug was initially created as a clone of Bug #653262 +++

Description of problem:
For RHEL5 PV guest(whose kernel is 2.6.18-231.el5xen), I try to ssh to it after I balloon it up. The network is lost when I ssh to it. I cannot ssh into it nor can I ping to it, although I can get console of the guest. This problem only happens when I turn off auto-ballooning(which is default value) and there is not enough free memory. For other situations such as auto-ballooning is on or there is enough free memory for guest to balloon up, no such issues are triggered.     

Version-Release number of selected component (if applicable):
Host:  
xen-devel-3.0.3-117.el5
xen-libs-3.0.3-117.el5
kernel-xen-devel-2.6.18-231.el5
xen-3.0.3-117.el5
xen-debuginfo-3.0.3-117.el5
kernel-xen-2.6.18-231.el5

Guest: 
2.6.18-231.el5xen

How reproducible:
Always

Steps to Reproduce:
1. Make sure auto-ballooning is turn off in xend. 
# grep "balloon-dom0" /etc/xen/xend-config.sxp
 (auto-balloon-dom0 no)

2. Create a RHEL5 PV guest with memory=512 and maxmem=1024
3. Make sure free memory is not enough for ballooning:
# xm info | grep free
free_memory            : 1

# xm li vm1
Name                                      ID Mem(MiB) VCPUs State   Time(s)
vm1                                        1      511     4 -b----      8.9

# xm li vm1 -l | grep mem
    (memory 512)
    (shadow_memory 0)
    (maxmem 1024)

4. Try to balloon up guest
# xm mem-set vm1 900

5. ping to guest after ballooning
# ping 10.66.93.117
PING 10.66.93.117 (10.66.93.117) 56(84) bytes of data.
64 bytes from 10.66.93.117: icmp_seq=1 ttl=64 time=0.252 ms
64 bytes from 10.66.93.117: icmp_seq=2 ttl=64 time=0.050 ms
64 bytes from 10.66.93.117: icmp_seq=3 ttl=64 time=0.053 ms

--- 10.66.93.117 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.050/0.118/0.252/0.094 ms

6. ssh to guest after that 

Actual results:
At step 6, you cannot ssh to the guest and after which the network of the guest is lost. You cannot ping to it any more, though the guest is still alive and you can get console of it.  

Expected results:
Everything works well after ballooning.


Additional info:
No such issues are triggered when I downgrade the PV guest to -194 kernel-xen packages. So I would consider this bug as regression.

--- Additional comment from yuzhang on 2010-11-15 00:37:00 EST ---

Created attachment 460474 [details]
config file to create the guest

--- Additional comment from yuzhang on 2010-11-15 00:39:02 EST ---

Created attachment 460475 [details]
xm dmesg log

--- Additional comment from yuzhang on 2010-11-15 00:43:06 EST ---

Created attachment 460476 [details]
xend.log

--- Additional comment from drjones on 2010-11-15 06:58:20 EST ---

Most likely culprit

7c14912 [virt] xen: don't give up ballooning under mem pressure

if a -221 kernel works (this patch is in -222) then that would give my accusation more weight.

--- Additional comment from lersek on 2010-11-15 10:30:39 EST ---

I could reproduce the problem with host -231, guest -222 -- I was able to ssh in the guest, and started to type "uname -r" to verify I'm running -222. I didn't get past "una", and then "ping" stopped to work too. I checked with -221, and the problem is gone. I think Andrew is right. I'll try to revert the patch.

======================

Also reproducible in RHEL4.9.

Comment 1 RHEL Program Management 2010-11-15 20:49:01 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 2 Paolo Bonzini 2010-11-16 17:26:04 UTC
This is not really a regression, however it is exacerbated by the patch that Andrew pinpointed so we may want to treat it as such.  The fix is to change netfront from flipping to copying.  The copying code is already well tested as it is used by HVM guests.

Comment 3 Laszlo Ersek 2010-11-17 13:42:55 UTC
Created attachment 461066 [details]
make copying the default

Comment 4 Vivek Goyal 2010-12-14 13:54:39 UTC
Committed in 93.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 6 Yufang Zhang 2010-12-16 08:01:25 UTC
QA verified this bug with 93.EL(guest) on RHEL5 -237 host:

Using the same steps as Description, guest network isn't lost even ballooning fails. 

Change this bug to VERIFIED.

Comment 7 errata-xmlrpc 2011-02-16 15:53:01 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0263.html