1135909 – [Event log] Wrong warning about available swap memory of host [1023MB], when actually host has [1024MB] memory size

Bug 1135909 - [Event log] Wrong warning about available swap memory of host [1023MB], when actually host has [1024MB] memory size

Summary: [Event log] Wrong warning about available swap memory of host [1023MB], when ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	oVirt
Classification:	Retired
Component:	ovirt-engine-core
Sub Component:
Version:	3.5
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.5.1
Assignee:	Eli Mesika
QA Contact:	Michael Burman
Docs Contact:
URL:
Whiteboard:	infra
Depends On:
Blocks:	1135966
TreeView+	depends on / blocked

Reported:	2014-09-01 07:27 UTC by Michael Burman
Modified:	2016-02-10 19:32 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Clone Of:
Clones:	1135966 (view as bug list)
Environment:
Last Closed:	2015-01-21 16:13:21 UTC
oVirt Team:	Infra
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Screenshoot (397.89 KB, application/octet-stream) 2014-09-01 07:27 UTC, Michael Burman	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	32911	None	ABANDONED	Rounding up swap usage	Never
oVirt gerrit	33164	None	None	None	Never
oVirt gerrit	33265	ovirt-engine-3.5	MERGED	core: allow 2% gap in host low swap	Never

Description Michael Burman 2014-09-01 07:27:43 UTC

Created attachment 933262 [details]
Screenshoot

Description of problem:
Wrong warning messages appears in the event log every 10-15 minutes, about available swap memory of host [1023MB]. When the actual swap memory size of host is [1024MB]

cat /proc/meminfo

SwapTotal:       1048568 kB (1023.99219 MB)
SwapFree:        1048568 kB (1023.99219 MB)

This warnings started to appear in the last build.

Version-Release number of selected component (if applicable):
 oVirt Engine Version: 3.5.0-0.0.master.20140821064931.gitb794d66.el6 

How reproducible:
always

Steps to Reproduce:
1. Working setup with host installed
2.
3.

Actual results:
Wrong warnings every 10-15 minutes in the event log about available swap memory of host.

Expected results:
Not see such warning messages in the event log.

Additional info:

Comment 1 Barak 2014-09-02 11:58:25 UTC

Eli - we should just round up values from VDSM before doing the comparison.

Comment 2 Oved Ourfali 2014-09-07 08:34:43 UTC

Seems like the truncation logic resides in VDSM.
It should round it up instead of truncating.
Moving it to VDSM.

Comment 3 Yaniv Bronhaim 2014-09-07 15:03:35 UTC

can you check in host side the output of 

vdsClient -s 0 getVdsStats | grep -i swap

im quite sure this engine's side thing. nothing was changed in vdsm really long time in that area (utils.py: readMemInfo) and it seems to work fine

and in engine's side it was modified not so long ago - http://gerrit.ovirt.org/#/c/10865/6/backend/manager/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/VdsUpdateRunTimeInfo.java,cm

except that - you see that the message in audit log is:
VDS_LOW_SWAP=Available swap memory of host ${HostName} [${AvailableSwapMemory} MB] is under defined threshold [${Threshold} MB].

and the values you attached are right - free 1023, available 1024 .
so it must be wrong in engine's side.

please attach also the values you have for LogPhysicalMemoryThresholdInMB LogMaxPhysicalMemoryUsedThresholdInPercentage in your vdc_options table.

still want me to handle it?

Comment 4 Eli Mesika 2014-09-07 15:45:34 UTC

I had tried in one of my hosts :

vdsClient -s 0 getVdsStats | grep -i swap
        swapFree = 15999
        swapTotal = 15999

[root@pluto-vdsa ~]# cat /proc/meminfo |grep -i swap

SwapTotal:      16383992 kB
SwapFree:       16383992 kB


The value is in KB , so lets turn it to MB :

16383992/1024 = 15999.99609375 so IMO VDSM should report 16000 if it rounds the value , but since it reports 15999 it is clear that the value is truncated by VDSM....

Comment 5 Michael Burman 2014-09-08 05:43:36 UTC

vdsClient -s 0 getVdsStats | grep -i swap

swapFree = 1023
swapTotal = 1023


cat /proc/meminfo |grep -i swap

SwapCached:            0 kB
SwapTotal:       1048572 kB (1023.99609 MB)
SwapFree:        1048572 kB (1023.99609 MB)

Should report 1024MB if round the value. It reports 1023, the value is truncated.

Comment 6 Yaniv Bronhaim 2014-09-08 16:19:41 UTC

oh, that's I can check and fix. but according to the description this is not the bug. the bug is the engine's reports on law swap space which is wrong iiuc, no?

Comment 7 Michael Burman 2014-09-09 06:31:22 UTC

Hi,

I'm not sure where is the value truncated is done, not sure if that's engine side or the vdsm, but the bug is that the event log shows wrong warning messages about available swap memory every 10-15 minutes.

Comment 8 Yaniv Bronhaim 2014-09-14 16:41:03 UTC

I'm not sure if rounding the value up is the right behavior. although I post a patch and lets see what others say

Comment 9 Dan Kenigsberg 2014-09-14 20:27:14 UTC

Why is this a bug? The host has less than 1024MiB free memory, so you get a warning.

The only question is why this behavior is new - Vdsm did not change anything there.

Comment 10 Oved Ourfali 2014-09-15 06:14:16 UTC

(In reply to Dan Kenigsberg from comment #9)
> Why is this a bug? The host has less than 1024MiB free memory, so you get a
> warning.
> 
> The only question is why this behavior is new - Vdsm did not change anything
> there.

IMO we should be a bit more permissive in such cases. It is true that we can change the test in the engine side to follow some threshold, however I still think that rounding up sounds right in this case.

Comment 11 Michael Burman 2014-09-15 06:30:42 UTC

Hi dan,

1. This is a new behavior from the last builds, upstream and downstream, so it means that something has changed.

2. The host has: 
 SwapTotal:       1048572 kB (1023.99609 MB)
 SwapFree:        1048572 kB (1023.99609 MB)

It is closer to 1024MB, then to 1023MB.
The available memory swap on hosts hasn't changed

3. This warnings displayed in the event log every 10-15 minutes and i don't think that as administrator you would love to see the event log full with this warnings every 10 minutes.

Comment 12 Yaniv Bronhaim 2014-09-15 14:39:00 UTC

Barak, Dan - please give your final thoughts if it is a bug or not. If not, does the event logging behavior reasonable or not (you can always set different threshold)

Comment 13 Dan Kenigsberg 2014-09-15 15:04:48 UTC

Michael, if you install an older Vdsm on your host, what would it report? I'm sure that just like in 3.5.0, it would report 1023MiB too. The new behavior is not on Vdsm side.

Could it be that the recent change is in the guest? Which kernel is running there? Did it change it changes its swap accounting?

Oved, if you manage 3.4 hosts with rhevm-3.5.0, you would still see this annoying error.

Bottom line: you should either use hosts with bigger swap space, or lower the the Engine threshold to 1023. It should be fixed on Engine or not fixed at all.

Comment 14 Michael Burman 2014-09-16 05:33:37 UTC

- With older vdsm it's the same, report the same in the event log.
vdsm-4.14.13-2.el6ev.x86_64

- Like i said, it started in the last 2 builds(upstream+downstream)

- kernel 2.6.32-431.el6.x86_64

Comment 15 Michael Burman 2014-09-17 13:05:33 UTC

Started in the last 2 builds- moving back to regression

Comment 16 Dan Kenigsberg 2014-09-17 15:08:37 UTC

Could you see if an older guest kernel changes things?

Comment 17 Michael Burman 2014-09-22 06:46:16 UTC

2.6.32-431.el6.x86_64
2.6.32-431.23.3.el6.x86_64

Comment 18 Dan Kenigsberg 2014-09-22 23:35:34 UTC

I'll try to rephrase my question: does a much older guest kernels report different swap usage? I'm trying to understand what is the change that triggered the annoying reports that you see.

Comment 19 Michael Burman 2014-09-23 12:43:02 UTC

Much older kernel is kernel-2.6.32-358.el6.x86_64 and it's for rhel6.4

Just because you asked, i did a test and installed rhel6.4 with this kernel in my setup(vt3.1).
As it seems for now, with this host i doesn't get this annoying messages.

But with kernel 2.6.32-431.el6.x86_64 and above for (rhel6.5) i get this messages.

And also with kernel 3.10.0-123.el7 for rhel7 i get this messages.

Comment 20 Oved Ourfali 2014-09-23 12:48:22 UTC

we'll handle that on the engine side.

Comment 21 Sandro Bonazzola 2015-01-21 16:13:21 UTC

oVirt 3.5.1 has been released and since this bug is targeted 3.5.1 and in modified state, it should be included in this release.
Please re-target and move nack to modified if this assumption is not valid for this bug.

Note You need to log in before you can comment on or make changes to this bug.