Bug 1031534

Summary: [WIN RHEV-AGENT] Please expose swap information in the guest agent
Product: Red Hat Enterprise Virtualization Manager Reporter: Vinzenz Feenstra [evilissimo] <vfeenstr>
Component: ovirt-guest-agentAssignee: Lev Veyde <lveyde>
Status: CLOSED ERRATA QA Contact: Lukas Svaty <lsvaty>
Severity: high Docs Contact:
Priority: medium    
Version: 3.3.0CC: acathrow, asegundo, bazulay, dfediuck, eedri, iheim, lpeer, mavital, michal.skrivanek, mkenneth, msivak, sherold, srevivo, vfeenstr, yeylon
Target Milestone: ---Keywords: Triaged
Target Release: 3.3.0   
Hardware: All   
OS: Linux   
Whiteboard: sla
Fixed In Version: Doc Type: Bug Fix
Doc Text:
When the hypervisor's memory pressure grew, the Memory Overcommit Manager reduced all guests' memory, so guests with high memory load had to use swap space. This issue is fixed with enhanced ballooning rules for computing the minimum available memory and reporting the swap usage of the guests.
Story Points: ---
Clone Of: 1030515 Environment:
Last Closed: 2014-01-21 15:56:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1030515    
Bug Blocks: 1020228, 1025845    

Description Vinzenz Feenstra [evilissimo] 2013-11-18 08:45:33 UTC
+++ This bug was initially created as a clone of Bug #1030515 +++

+++ This bug was initially created as a clone of Bug #1025845 +++

Version-Release:
vdsm-4.13.0-0.5.beta1.el6ev.x86_64
mom-0.3.2-6.el6ev.noarch

Description:
When Hypervisor's memory pressure grows, MOM is supposed to reduce guests memory to make more memory available to the Hypervisor. But instead of select only the guests with free memory available, MOM is reducing all guests memory. The consequence is guests with high memory load using swap.

Additional info:

- Hypervisor: proliant
- Guests: rhel64-1, rhel64-2 and rhel64-3

- MOM ready:

2013-11-01 14:01:03,090 - mom.Monitor - INFO - GuestMonitor-rhel64_3 starting
2013-11-01 14:01:03,091 - mom.Monitor - INFO - GuestMonitor-rhel64_2 starting
2013-11-01 14:01:03,094 - mom.Monitor - INFO - GuestMonitor-rhel64_1 starting
2013-11-01 14:01:03,110 - mom.Monitor - INFO - GuestMonitor-rhel64_3 is ready
2013-11-01 14:01:03,111 - mom.Monitor - INFO - GuestMonitor-rhel64_2 is ready
2013-11-01 14:01:03,112 - mom.Monitor - INFO - GuestMonitor-rhel64_1 is ready

- Memory status:

[root@proliant mom.d]# date; free -m
Fri Nov  1 14:02:17 BRT 2013
             total       used       free     shared    buffers     cached
Mem:          3787        986       2801          0         11         71
-/+ buffers/cache:        903       2884
Swap:         4095        498       3597

[root@rhel64-1 ~]# date; free -m
Fri Nov  1 14:03:35 BRT 2013
             total       used       free     shared    buffers     cached
Mem:          1877        141       1736          0          6         28
-/+ buffers/cache:        106       1770
Swap:         4031         26       4005

[root@rhel64-2 ~]# date; free -m
Fri Nov  1 14:03:41 BRT 2013
             total       used       free     shared    buffers     cached
Mem:          1877        143       1734          0          6         30
-/+ buffers/cache:        106       1770
Swap:         4031         28       4003

[root@rhel64-3 ~]# date; free -m
Fri Nov  1 14:03:46 BRT 2013
             total       used       free     shared    buffers     cached
Mem:          1877        281       1595          0         26         98
-/+ buffers/cache:        156       1720
Swap:         4031          0       4031


- Starting memory pressure on rhel64-1 and rhel64-2:

[root@rhel64-1 ~]# ./bang.bin 1800 &
[1] 3838
[root@rhel64-1 ~]# Allocating 1800MB to work on.
[root@rhel64-1 ~]# date; free -m
Fri Nov  1 14:03:49 BRT 2013
             total       used       free     shared    buffers     cached
Mem:          1877       1817         60          0          0         13
-/+ buffers/cache:       1803         74
Swap:         4031        150       3881

[root@rhel64-2 ~]# ./bang.bin 1800 &
[1] 3890
[root@rhel64-2 ~]# Allocating 1800MB to work on.
[root@rhel64-2 ~]# date; free -m
Fri Nov  1 14:04:20 BRT 2013
             total       used       free     shared    buffers     cached
Mem:          1877       1736        140          0          6         30
-/+ buffers/cache:       1699        178
Swap:         4031         28       4003

- No pressure on rhel64-3:

[root@rhel64-3 ~]# date; free -m
Fri Nov  1 14:04:38 BRT 2013
             total       used       free     shared    buffers     cached
Mem:          1877        282       1595          0         26         98
-/+ buffers/cache:        157       1720
Swap:         4031          0       4031


- MOM working:

2013-11-01 14:04:50,426 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_3 from 2097152 to 1992294
2013-11-01 14:04:51,511 - mom.Collectors.GuestMemory - WARNING - getVmMemoryStats() error: The ovirt-guest-agent is not active
2013-11-01 14:04:52,769 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_2 from 2097152 to 1992294
2013-11-01 14:04:52,863 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_1 from 2097152 to 1992294
2013-11-01 14:04:52,983 - mom.Controllers.KSM - INFO - Updating KSM configuration: pages_to_scan:364 run:1 sleep_millisecs:43
2013-11-01 14:05:01,769 - mom.Monitor - INFO - GuestMonitor-rhel64_2 is ready
2013-11-01 14:05:04,451 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_3 from 1992296 to 1892681
2013-11-01 14:05:04,504 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_2 from 1992296 to 1892681
2013-11-01 14:05:04,517 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_1 from 1992296 to 1892681
2013-11-01 14:05:04,587 - mom.Controllers.KSM - INFO - Updating KSM configuration: pages_to_scan:664 run:1 sleep_millisecs:43
2013-11-01 14:05:14,633 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_3 from 1892684 to 1798049
2013-11-01 14:05:14,835 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_2 from 1892684 to 1798049
2013-11-01 14:05:14,885 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_1 from 1892684 to 1798049
2013-11-01 14:05:14,937 - mom.Controllers.KSM - INFO - Updating KSM configuration: pages_to_scan:964 run:1 sleep_millisecs:43
2013-11-01 14:05:24,983 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_3 from 1798052 to 1708149
2013-11-01 14:05:25,080 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_2 from 1798052 to 1708149
2013-11-01 14:05:25,160 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_1 from 1798052 to 1708149
2013-11-01 14:05:25,202 - mom.Controllers.KSM - INFO - Updating KSM configuration: pages_to_scan:1250 run:1 sleep_millisecs:43
2013-11-01 14:05:35,242 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_3 from 1708152 to 1793559
2013-11-01 14:05:35,281 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_2 from 1708152 to 1793559
2013-11-01 14:05:35,314 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_1 from 1708152 to 1793559


- After some MOM work, guests memory status:

[root@rhel64-1 ~]# date; free -m
Fri Nov  1 14:05:16 BRT 2013
             total       used       free     shared    buffers     cached
Mem:          1585       1507         78          0          0          6
-/+ buffers/cache:       1500         85
Swap:         4031        487       3544

[root@rhel64-2 ~]# date; free -m
Fri Nov  1 14:05:18 BRT 2013
             total       used       free     shared    buffers     cached
Mem:          1585       1509         75          0          0         10
-/+ buffers/cache:       1499         86
Swap:         4031        491       3540

[root@rhel64-3 ~]# date; free -m
Fri Nov  1 14:05:20 BRT 2013
             total       used       free     shared    buffers     cached
Mem:          1585        281       1303          0         26         98
-/+ buffers/cache:        156       1428
Swap:         4031          0       4031

- Notice MOM is reducing memory for all guests, even high loaded ones.


Expected results:
MOM not reducing guests memory beyond free memory limit.

--- Additional comment from Doron Fediuck on 2013-11-03 06:25:09 EST ---

Just a few things I'd like to clarify;
1. Once a VM starts swapping MOM should detect it and stop inflating it / start deflating.

2. The floor limit MOM is using is based on the "Physical Memory Guaranteed" settings we define for each VM in the Resource Allocation sub-tab of the new/edit
VM dialog. Can you please provide the numbers set for these VMs?

--- Additional comment from Martin Sivák on 2013-11-04 04:42:04 EST ---

We are not getting swap information from the guest agent so this is definitely an issue. Because when we inflate the balloon, the VM will also put some of its data to swap and mom will then think that there is still enough reclaimable memory in the vm.

To fix this we would have to modify the guest agent(s) - linux, win, mom collectors and vdsm policy for mom.

There is also an error in the policy and we have a fix for that - http://gerrit.ovirt.org/#/c/19416/1/doc/balloon.rules

--- Additional comment from Amador Pahim on 2013-11-04 06:54:39 EST ---

(In reply to Doron Fediuck from comment #1)
> Just a few things I'd like to clarify;
> 1. Once a VM starts swapping MOM should detect it and stop inflating it /
> start deflating.
> 
> 2. The floor limit MOM is using is based on the "Physical Memory Guaranteed"
> settings we define for each VM in the Resource Allocation sub-tab of the
> new/edit
> VM dialog. Can you please provide the numbers set for these VMs?

Defined Memory: 2048 MB
Physical Memory Guaranteed: 512 MB

--- Additional comment from Amador Pahim on 2013-11-04 08:33:28 EST ---

(In reply to Martin Sivák from comment #2)
> We are not getting swap information from the guest agent so this is
> definitely an issue. Because when we inflate the balloon, the VM will also
> put some of its data to swap and mom will then think that there is still
> enough reclaimable memory in the vm.

My concern is also about all VMs being inflated/deflated with the same amount of memory, regardless their individual memory status. Is this an issue? Shouldn't MOM balloon VMs with different patterns considering its individual memory availability?  

> 
> To fix this we would have to modify the guest agent(s) - linux, win, mom
> collectors and vdsm policy for mom.
> 
> There is also an error in the policy and we have a fix for that -
> http://gerrit.ovirt.org/#/c/19416/1/doc/balloon.rules

--- Additional comment from Martin Sivák on 2013-11-04 08:41:54 EST ---

> My concern is also about all VMs being inflated/deflated with the same
> amount of memory, 

That is fixed in the referenced changeset.

--- Additional comment from Adam Litke on 2013-11-04 08:51:18 EST ---

The mom policy is designed to behave in one of two ways depending on the severity of host memory pressure.  Under moderate pressure (between 5% and 20% free host memory) we try to balloon away only unused memory in guests.  Under severe host pressure (< 5% free) we purposefully cause guest swapping in order to keep the host itself from entering a swap storm.  Since you are observing guest swapping, could you share the state of host memory during that behavior?  If the host has <5% free (counting Cached pages as free) then I would argue that the policy is behaving as designed.

--- Additional comment from Martin Sivák on 2013-11-04 09:28:50 EST ---

Adam: there are two issues there

- Your fix http://gerrit.ovirt.org/#/c/19416/1/doc/balloon.rules is not present in the version he uses and that causes all the VMs to disregard the computed minimum in favour of the hard minimum (which is lower and the same for all VM).

- When you have most of the memory in swap, MoM will think RAM is (almost) free and inflate the balloon because we do not have any info about the swap usage in the policy.

But you are totally right about the two modes of ballooning we use.

Comment 1 Vinzenz Feenstra [evilissimo] 2013-11-19 08:33:15 UTC
Fix merged downstream to rhevm-3.3 branch as https://gerrit.eng.lab.tlv.redhat.com/gitweb?p=rhevm-guest-agent.git;a=commit;h=7f98f2a5a8c27c579e53a53b241b607386aac637

Comment 3 errata-xmlrpc 2014-01-21 15:56:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0075.html