Bug 653501

Summary: netback tries to balloon up even if front-end doesn't do flipping
Product: Red Hat Enterprise Linux 5 Reporter: Paolo Bonzini <pbonzini>
Component: kernel-xenAssignee: Laszlo Ersek <lersek>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 5.6CC: drjones, jpirko, jzheng, leiwang, mjenner, nachandr, tao, xen-maint, yuzhang
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 653262 Environment:
Last Closed: 2011-01-13 22:01:09 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 514489, 653262, 653505    
Attachments:
Description Flags
no need to balloon up for copying receivers none

Description Paolo Bonzini 2010-11-15 16:00:09 UTC
+++ This bug was initially created as a clone of Bug #653262 +++

Description of problem:
For RHEL5 PV guest(whose kernel is 2.6.18-231.el5xen), I try to ssh to it after I balloon it up. The network is lost when I ssh to it. I cannot ssh into it nor can I ping to it, although I can get console of the guest. This problem only happens when I turn off auto-ballooning(which is default value) and there is not enough free memory. For other situations such as auto-ballooning is on or there is enough free memory for guest to balloon up, no such issues are triggered.     

Version-Release number of selected component (if applicable):
Host:  
xen-devel-3.0.3-117.el5
xen-libs-3.0.3-117.el5
kernel-xen-devel-2.6.18-231.el5
xen-3.0.3-117.el5
xen-debuginfo-3.0.3-117.el5
kernel-xen-2.6.18-231.el5

Guest: 
2.6.18-231.el5xen

How reproducible:
Always

Steps to Reproduce:
1. Make sure auto-ballooning is turn off in xend. 
# grep "balloon-dom0" /etc/xen/xend-config.sxp
 (auto-balloon-dom0 no)

2. Create a RHEL5 PV guest with memory=512 and maxmem=1024
3. Make sure free memory is not enough for ballooning:
# xm info | grep free
free_memory            : 1

# xm li vm1
Name                                      ID Mem(MiB) VCPUs State   Time(s)
vm1                                        1      511     4 -b----      8.9

# xm li vm1 -l | grep mem
    (memory 512)
    (shadow_memory 0)
    (maxmem 1024)

4. Try to balloon up guest
# xm mem-set vm1 900

5. ping to guest after ballooning
# ping 10.66.93.117
PING 10.66.93.117 (10.66.93.117) 56(84) bytes of data.
64 bytes from 10.66.93.117: icmp_seq=1 ttl=64 time=0.252 ms
64 bytes from 10.66.93.117: icmp_seq=2 ttl=64 time=0.050 ms
64 bytes from 10.66.93.117: icmp_seq=3 ttl=64 time=0.053 ms

--- 10.66.93.117 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.050/0.118/0.252/0.094 ms

6. ssh to guest after that 

Actual results:
At step 6, you cannot ssh to the guest and after which the network of the guest is lost. You cannot ping to it any more, though the guest is still alive and you can get console of it.  

Expected results:
Everything works well after ballooning.


Additional info:
No such issues are triggered when I downgrade the PV guest to -194 kernel-xen packages. So I would consider this bug as regression.

--- Additional comment from yuzhang on 2010-11-15 00:37:00 EST ---

Created attachment 460474 [details]
config file to create the guest

--- Additional comment from yuzhang on 2010-11-15 00:39:02 EST ---

Created attachment 460475 [details]
xm dmesg log

--- Additional comment from yuzhang on 2010-11-15 00:43:06 EST ---

Created attachment 460476 [details]
xend.log

--- Additional comment from drjones on 2010-11-15 06:58:20 EST ---

Most likely culprit

7c14912 [virt] xen: don't give up ballooning under mem pressure

if a -221 kernel works (this patch is in -222) then that would give my accusation more weight.

--- Additional comment from lersek on 2010-11-15 10:30:39 EST ---

I could reproduce the problem with host -231, guest -222 -- I was able to ssh in the guest, and started to type "uname -r" to verify I'm running -222. I didn't get past "una", and then "ping" stopped to work too. I checked with -221, and the problem is gone. I think Andrew is right. I'll try to revert the patch.

======================

The bug is present on RHEL6 too, but is not a regression there.

Comment 1 RHEL Program Management 2010-11-15 20:49:39 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 2 Paolo Bonzini 2010-11-16 17:14:56 UTC
RHEL6 doesn't do flipping (see bug 653505 and bug 653262 for an explanation of how flipping causes the bug in RHEL4 and RHEL5).  However, dom0 tries to balloon up even for copying receivers and fails in the same way as explained in the above mentioned bugs.

So, this is a backend bug.

Comment 3 Paolo Bonzini 2010-11-16 17:19:50 UTC
There is a patch in upstream c/s 14355.

Comment 5 Laszlo Ersek 2010-11-16 18:19:58 UTC
c/s 14355:

http://xenbits.xensource.com/xen-unstable.hg?rev/68282f4b3e0f

(Second hunk only.)

Comment 7 RHEL Program Management 2010-11-17 06:30:30 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 8 Laszlo Ersek 2010-11-17 13:45:13 UTC
Created attachment 461068 [details]
no need to balloon up for copying receivers

Comment 10 Jarod Wilson 2010-11-23 17:06:18 UTC
in kernel-2.6.18-233.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 12 Jinxin Zheng 2010-12-22 08:18:38 UTC
I'm putting this in VERIFIED according to comment 20 of bug 653262 and comment 6 of bug 653505.

Additionally: merely updating the host kernel was found not working. Host and guest kernel must both be updated in ordery to solve this issue.

Comment 13 Paolo Bonzini 2010-12-22 09:47:57 UTC
Yes, that's correct.  HVM guests do copying by default so they do not require an upgrade.  For PV guests, if you do not upgrade the guest you need rx_copy=1 on the kernel command line of the guest.

Comment 15 errata-xmlrpc 2011-01-13 22:01:09 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html

Comment 16 Laszlo Ersek 2011-02-03 08:49:18 UTC
*** Bug 648763 has been marked as a duplicate of this bug. ***