Bug 1025845 - MOM is ballooning all VMs at once, even VMs without free memory enough.
Summary: MOM is ballooning all VMs at once, even VMs without free memory enough.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: mom
Version: 3.3.0
Hardware: All
OS: Linux
medium
urgent
Target Milestone: ---
: 3.3.0
Assignee: Martin Sivák
QA Contact: Lukas Svaty
Cheryn Tan
URL:
Whiteboard: sla
Depends On: 1030515 1031532 1031534
Blocks: GSS_RHEV_33_BETA
TreeView+ depends on / blocked
 
Reported: 2013-11-01 17:52 UTC by Amador Pahim
Modified: 2016-02-10 20:16 UTC (History)
16 users (show)

Fixed In Version: mom-0.3.2-8.el6ev
Doc Type: Bug Fix
Doc Text:
When the hypervisor's memory pressure grows, MOM is supposed to reduce guests' memory to make more memory available to the hypervisor. But instead of selecting only the guests with free memory available, MOM reduced all guests' memory, so guests with high memory load had to use swap space. This issue is fixed with enhanced ballooning rules for computing the minimum available memory and reporting the swap usage of the guests. Now, MOM does not reduce guests' memory beyond their free memory limit.
Clone Of:
: 1030515 (view as bug list)
Environment:
Last Closed: 2014-01-21 15:06:24 UTC
oVirt Team: SLA
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2014:0064 0 normal SHIPPED_LIVE New package: Memory Overcommitment Manager 2014-01-21 19:53:42 UTC
oVirt gerrit 20848 0 None None None Never
oVirt gerrit 20849 0 None None None Never
oVirt gerrit 20906 0 None None None Never
oVirt gerrit 20959 0 None None None Never
oVirt gerrit 21704 0 None None None Never

Description Amador Pahim 2013-11-01 17:52:49 UTC
Version-Release:
vdsm-4.13.0-0.5.beta1.el6ev.x86_64
mom-0.3.2-6.el6ev.noarch

Description:
When Hypervisor's memory pressure grows, MOM is supposed to reduce guests memory to make more memory available to the Hypervisor. But instead of select only the guests with free memory available, MOM is reducing all guests memory. The consequence is guests with high memory load using swap.

Additional info:

- Hypervisor: proliant
- Guests: rhel64-1, rhel64-2 and rhel64-3

- MOM ready:

2013-11-01 14:01:03,090 - mom.Monitor - INFO - GuestMonitor-rhel64_3 starting
2013-11-01 14:01:03,091 - mom.Monitor - INFO - GuestMonitor-rhel64_2 starting
2013-11-01 14:01:03,094 - mom.Monitor - INFO - GuestMonitor-rhel64_1 starting
2013-11-01 14:01:03,110 - mom.Monitor - INFO - GuestMonitor-rhel64_3 is ready
2013-11-01 14:01:03,111 - mom.Monitor - INFO - GuestMonitor-rhel64_2 is ready
2013-11-01 14:01:03,112 - mom.Monitor - INFO - GuestMonitor-rhel64_1 is ready

- Memory status:

[root@proliant mom.d]# date; free -m
Fri Nov  1 14:02:17 BRT 2013
             total       used       free     shared    buffers     cached
Mem:          3787        986       2801          0         11         71
-/+ buffers/cache:        903       2884
Swap:         4095        498       3597

[root@rhel64-1 ~]# date; free -m
Fri Nov  1 14:03:35 BRT 2013
             total       used       free     shared    buffers     cached
Mem:          1877        141       1736          0          6         28
-/+ buffers/cache:        106       1770
Swap:         4031         26       4005

[root@rhel64-2 ~]# date; free -m
Fri Nov  1 14:03:41 BRT 2013
             total       used       free     shared    buffers     cached
Mem:          1877        143       1734          0          6         30
-/+ buffers/cache:        106       1770
Swap:         4031         28       4003

[root@rhel64-3 ~]# date; free -m
Fri Nov  1 14:03:46 BRT 2013
             total       used       free     shared    buffers     cached
Mem:          1877        281       1595          0         26         98
-/+ buffers/cache:        156       1720
Swap:         4031          0       4031


- Starting memory pressure on rhel64-1 and rhel64-2:

[root@rhel64-1 ~]# ./bang.bin 1800 &
[1] 3838
[root@rhel64-1 ~]# Allocating 1800MB to work on.
[root@rhel64-1 ~]# date; free -m
Fri Nov  1 14:03:49 BRT 2013
             total       used       free     shared    buffers     cached
Mem:          1877       1817         60          0          0         13
-/+ buffers/cache:       1803         74
Swap:         4031        150       3881

[root@rhel64-2 ~]# ./bang.bin 1800 &
[1] 3890
[root@rhel64-2 ~]# Allocating 1800MB to work on.
[root@rhel64-2 ~]# date; free -m
Fri Nov  1 14:04:20 BRT 2013
             total       used       free     shared    buffers     cached
Mem:          1877       1736        140          0          6         30
-/+ buffers/cache:       1699        178
Swap:         4031         28       4003

- No pressure on rhel64-3:

[root@rhel64-3 ~]# date; free -m
Fri Nov  1 14:04:38 BRT 2013
             total       used       free     shared    buffers     cached
Mem:          1877        282       1595          0         26         98
-/+ buffers/cache:        157       1720
Swap:         4031          0       4031


- MOM working:

2013-11-01 14:04:50,426 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_3 from 2097152 to 1992294
2013-11-01 14:04:51,511 - mom.Collectors.GuestMemory - WARNING - getVmMemoryStats() error: The ovirt-guest-agent is not active
2013-11-01 14:04:52,769 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_2 from 2097152 to 1992294
2013-11-01 14:04:52,863 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_1 from 2097152 to 1992294
2013-11-01 14:04:52,983 - mom.Controllers.KSM - INFO - Updating KSM configuration: pages_to_scan:364 run:1 sleep_millisecs:43
2013-11-01 14:05:01,769 - mom.Monitor - INFO - GuestMonitor-rhel64_2 is ready
2013-11-01 14:05:04,451 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_3 from 1992296 to 1892681
2013-11-01 14:05:04,504 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_2 from 1992296 to 1892681
2013-11-01 14:05:04,517 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_1 from 1992296 to 1892681
2013-11-01 14:05:04,587 - mom.Controllers.KSM - INFO - Updating KSM configuration: pages_to_scan:664 run:1 sleep_millisecs:43
2013-11-01 14:05:14,633 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_3 from 1892684 to 1798049
2013-11-01 14:05:14,835 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_2 from 1892684 to 1798049
2013-11-01 14:05:14,885 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_1 from 1892684 to 1798049
2013-11-01 14:05:14,937 - mom.Controllers.KSM - INFO - Updating KSM configuration: pages_to_scan:964 run:1 sleep_millisecs:43
2013-11-01 14:05:24,983 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_3 from 1798052 to 1708149
2013-11-01 14:05:25,080 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_2 from 1798052 to 1708149
2013-11-01 14:05:25,160 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_1 from 1798052 to 1708149
2013-11-01 14:05:25,202 - mom.Controllers.KSM - INFO - Updating KSM configuration: pages_to_scan:1250 run:1 sleep_millisecs:43
2013-11-01 14:05:35,242 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_3 from 1708152 to 1793559
2013-11-01 14:05:35,281 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_2 from 1708152 to 1793559
2013-11-01 14:05:35,314 - mom.Controllers.Balloon - INFO - Ballooning guest:rhel64_1 from 1708152 to 1793559


- After some MOM work, guests memory status:

[root@rhel64-1 ~]# date; free -m
Fri Nov  1 14:05:16 BRT 2013
             total       used       free     shared    buffers     cached
Mem:          1585       1507         78          0          0          6
-/+ buffers/cache:       1500         85
Swap:         4031        487       3544

[root@rhel64-2 ~]# date; free -m
Fri Nov  1 14:05:18 BRT 2013
             total       used       free     shared    buffers     cached
Mem:          1585       1509         75          0          0         10
-/+ buffers/cache:       1499         86
Swap:         4031        491       3540

[root@rhel64-3 ~]# date; free -m
Fri Nov  1 14:05:20 BRT 2013
             total       used       free     shared    buffers     cached
Mem:          1585        281       1303          0         26         98
-/+ buffers/cache:        156       1428
Swap:         4031          0       4031

- Notice MOM is reducing memory for all guests, even high loaded ones.


Expected results:
MOM not reducing guests memory beyond free memory limit.

Comment 1 Doron Fediuck 2013-11-03 11:25:09 UTC
Just a few things I'd like to clarify;
1. Once a VM starts swapping MOM should detect it and stop inflating it / start deflating.

2. The floor limit MOM is using is based on the "Physical Memory Guaranteed" settings we define for each VM in the Resource Allocation sub-tab of the new/edit
VM dialog. Can you please provide the numbers set for these VMs?

Comment 2 Martin Sivák 2013-11-04 09:42:04 UTC
We are not getting swap information from the guest agent so this is definitely an issue. Because when we inflate the balloon, the VM will also put some of its data to swap and mom will then think that there is still enough reclaimable memory in the vm.

To fix this we would have to modify the guest agent(s) - linux, win, mom collectors and vdsm policy for mom.

There is also an error in the policy and we have a fix for that - http://gerrit.ovirt.org/#/c/19416/1/doc/balloon.rules

Comment 3 Amador Pahim 2013-11-04 11:54:39 UTC
(In reply to Doron Fediuck from comment #1)
> Just a few things I'd like to clarify;
> 1. Once a VM starts swapping MOM should detect it and stop inflating it /
> start deflating.
> 
> 2. The floor limit MOM is using is based on the "Physical Memory Guaranteed"
> settings we define for each VM in the Resource Allocation sub-tab of the
> new/edit
> VM dialog. Can you please provide the numbers set for these VMs?

Defined Memory: 2048 MB
Physical Memory Guaranteed: 512 MB

Comment 4 Amador Pahim 2013-11-04 13:33:28 UTC
(In reply to Martin Sivák from comment #2)
> We are not getting swap information from the guest agent so this is
> definitely an issue. Because when we inflate the balloon, the VM will also
> put some of its data to swap and mom will then think that there is still
> enough reclaimable memory in the vm.

My concern is also about all VMs being inflated/deflated with the same amount of memory, regardless their individual memory status. Is this an issue? Shouldn't MOM balloon VMs with different patterns considering its individual memory availability?  

> 
> To fix this we would have to modify the guest agent(s) - linux, win, mom
> collectors and vdsm policy for mom.
> 
> There is also an error in the policy and we have a fix for that -
> http://gerrit.ovirt.org/#/c/19416/1/doc/balloon.rules

Comment 5 Martin Sivák 2013-11-04 13:41:54 UTC
> My concern is also about all VMs being inflated/deflated with the same
> amount of memory, 

That is fixed in the referenced changeset.

Comment 6 Adam Litke 2013-11-04 13:51:18 UTC
The mom policy is designed to behave in one of two ways depending on the severity of host memory pressure.  Under moderate pressure (between 5% and 20% free host memory) we try to balloon away only unused memory in guests.  Under severe host pressure (< 5% free) we purposefully cause guest swapping in order to keep the host itself from entering a swap storm.  Since you are observing guest swapping, could you share the state of host memory during that behavior?  If the host has <5% free (counting Cached pages as free) then I would argue that the policy is behaving as designed.

Comment 7 Martin Sivák 2013-11-04 14:28:50 UTC
Adam: there are two issues there

- Your fix http://gerrit.ovirt.org/#/c/19416/1/doc/balloon.rules is not present in the version he uses and that causes all the VMs to disregard the computed minimum in favour of the hard minimum (which is lower and the same for all VM).

- When you have most of the memory in swap, MoM will think RAM is (almost) free and inflate the balloon because we do not have any info about the swap usage in the policy.

But you are totally right about the two modes of ballooning we use.

Comment 8 Eyal Edri 2013-11-14 12:55:25 UTC
should this bug be on MODIFIED?
if all patches are in, please move to ON_QA and mark fixed in is23

Comment 9 Martin Sivák 2013-11-14 14:12:34 UTC
Unfortunately not, there are some patches that are still missing.

Comment 11 Martin Sivák 2013-11-22 12:28:25 UTC
The mom part is ready and vddm contains fixed policy. So it should behave better now. There are still situations where this won't be enough (mostly related to swap usage) and all the related bugs add support for dealing with it.

Comment 12 Lukas Svaty 2013-11-26 11:32:12 UTC
mom stops to change size of balloon after first change

change consulted with msivak, patch in process

moving back to ASSIGNED

Comment 13 Martin Sivák 2013-11-26 11:47:13 UTC
MoM used the same variable stack for all policy runs. That caused old variable values to be used sometimes.

Comment 14 Lukas Svaty 2013-11-26 19:04:59 UTC
mom-0.3.2-8.el6ev tested in

Comment 16 errata-xmlrpc 2014-01-21 15:06:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0064.html


Note You need to log in before you can comment on or make changes to this bug.