1145977 – [events] Host memory usage exceeded defined threshold chaos - VDS_HIGH_MEM_USE vs USER_SUSPEND_VM_FINISH_FAILURE_WILL_TRY_AGAIN

Bug 1145977 - [events] Host memory usage exceeded defined threshold chaos - VDS_HIGH_MEM_USE vs USER_SUSPEND_VM_FINISH_FAILURE_WILL_TRY_AGAIN

Summary: [events] Host memory usage exceeded defined threshold chaos - VDS_HIGH_MEM_US...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	medium
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Mooli Tayer
QA Contact:	Jiri Belka
Docs Contact:
URL:
Whiteboard:	infra
Depends On:
Blocks:	rhev35betablocker oVirt_3.5.1_tracker rhev35rcblocker rhev35gablocker
TreeView+	depends on / blocked

Reported:	2014-09-24 08:39 UTC by Jiri Belka
Modified:	2016-02-10 19:41 UTC (History)
CC List:	16 users (show)
Fixed In Version:	org.ovirt.engine-root-3.5.0-18
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-02-17 17:11:22 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
sosreport-LogCollector-20140924103440.tar.xz (16.79 MB, application/x-xz) 2014-09-24 08:39 UTC, Jiri Belka	no flags	Details
binary for rhel6 (59.07 KB, application/x-executable) 2014-09-24 08:40 UTC, Jiri Belka	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	33805	ovirt-engine-3.5	MERGED	core: Change duplicate audit log values to unique ones	2020-06-08 04:30:21 UTC
oVirt gerrit	33806	ovirt-engine-3.5	MERGED	core: Add uniqueness test to AuditLogValue	2020-06-08 04:30:22 UTC
oVirt gerrit	33818	ovirt-engine-3.5	MERGED	core: Change duplicate action type values to unique ones	2020-06-08 04:30:21 UTC

Description Jiri Belka 2014-09-24 08:39:32 UTC

Created attachment 940704 [details]
sosreport-LogCollector-20140924103440.tar.xz

Description of problem:
while investigating BZ1059309 it was discovered that 'Host memory usage exceeded defined threshold' event which is available for notification is chaotic.

I believe there's something wrong with following event as I didn't receive any mail notification.

DB contains following subscriber:

-[ RECORD 2 ]--+-------------------------------------
subscriber_id  | fdfc627c-d875-11e0-90f0-83df133b58cc
event_up_name  | VDS_HIGH_MEM_USE
method_id      | 0
method_address | jbelka
tag_name       |

Then I changed value to get easy high memory use on a host.

# engine-config -g LogMaxPhysicalMemoryUsedThresholdInPercentage
LogMaxPhysicalMemoryUsedThresholdInPercentage: 10 version: general

I created memory load on the host:

# /tmp/lookbusy -c 0 -m 512mb -M 100
lookbusy (10455): mem stirrer started, PID 10456
mem_stir (10456): stirring 536870912 bytes...
^Cgot SIGCHLD for dead child 10456; exited with signal 2
killing mem spinner 10456
killing mem spinner 10456

[root@dell-r210ii-13 ~]# vdsClient -s 0 getVdsStats | grep memUs
        memUsed = '13'

Surprisingly I didn't see VDS_HIGH_MEM in notifier.log but following line:

2014-09-24 10:07:03,328 DEBUG [org.ovirt.engine.core.notifier.dao.EventsManager] event 0 => AuditLogEvent{id=1727, logTypeName='USER_SUSPEND_VM_FINISH_FAILURE_WILL_TRY_AGAIN', type=alertMessage, userId=00000000-0000-0000-0000-000000000000, userName='null', vmId=00000000-0000-0000-0000-000000000000, vmName='null', vmTemplateId=null, vmTemplateName='null', vdsId=4a196f70-c2a6-4b9a-887b-3751db6a4fd3, vdsName='dell-r210ii-13', storagePoolId=00000000-0000-0000-0000-000000000000, storagePoolName='', storageDomainId=00000000-0000-0000-0000-000000000000, storageDomainName='', logTime=2014-09-24 10:06:58.527, severity=WARNING, message='Used memory of host dell-r210ii-13 [13%] exceeded defined threshold [10%].'}

And of course I didn't receive any mail notification. Isn't there logTypeName chaos? VDS_HIGH_MEM vs USER_SUSPEND_VM_FINISH_FAILURE_WILL_TRY_AGAIN?

engine=# select * from audit_log where log_time > '2014-09-24' and message ilike 'Used memory%';
-[ RECORD 1 ]-------+---------------------------------------------------------------------------
audit_log_id        | 1727
user_id             | 00000000-0000-0000-0000-000000000000
user_name           | 
vm_id               | 00000000-0000-0000-0000-000000000000
vm_name             | 
vm_template_id      | 
vm_template_name    | 
vds_id              | 4a196f70-c2a6-4b9a-887b-3751db6a4fd3
vds_name            | dell-r210ii-13
log_time            | 2014-09-24 10:06:58.527+02
log_type_name       | USER_SUSPEND_VM_FINISH_FAILURE_WILL_TRY_AGAIN
log_type            | 532
severity            | 1
message             | Used memory of host dell-r210ii-13 [13%] exceeded defined threshold [10%].
processed           | t
storage_pool_id     | 
storage_pool_name   | 
storage_domain_id   | 
storage_domain_name | 
vds_group_id        | e89d69a9-c4fa-4f19-b99d-1de69eaa082f
vds_group_name      | 34
correlation_id      | 
job_id              | 
quota_id            | 
quota_name          | 
gluster_volume_id   | 00000000-0000-0000-0000-000000000000
gluster_volume_name | 
origin              | oVirt
custom_event_id     | -1
event_flood_in_sec  | 30
custom_data         | 
deleted             | f
call_stack          |

FYI when I was trying to get help with mperina@ and msivak@ they both raised some doubts about logic in the code how hosts' memory events are managed (a condition and results).

Version-Release number of selected component (if applicable):

vdsm-4.14.13-2.el6ev.x86_64
Red Hat Enterprise Virtualization Hypervisor release 6.5 (20140821.1.el6ev)

How reproducible:
100%

Steps to Reproduce:
1. assign Host memory usage exceeded defined threshold for notification (make notifier running and configured of course)
2. check memUsed by vdsClient and set LogMaxPhysicalMemoryUsedThresholdInPercentage little above it via engine-config
3. make memory usage on the host

Actual results:
no event notification, seems there's mismatch about logTypeName - VDS_HIGH_MEM vs USER_SUSPEND_VM_FINISH_FAILURE_WILL_TRY_AGAIN

Expected results:
should get event notification

Additional info:
http://www.devin.com/lookbusy/ to make mem usage, in attachment my compiled version from RHEL6

Comment 1 Jiri Belka 2014-09-24 08:40:50 UTC

Created attachment 940705 [details]
binary for rhel6

Comment 2 Jiri Belka 2014-09-24 08:42:52 UTC

rhevm-backend-3.5.0-0.12.beta.el6ev.noarch
rhevm-tools-3.5.0-0.12.beta.el6ev.noarch

Comment 3 Mooli Tayer 2014-09-28 15:00:28 UTC

Very strange indeed.

Trying to reproduce according to steps I got the expected value at audit log[1].

AFAIK the code producing both notifications looks fine.

Is it possible someone fiddled with the database?

Does this reproduce on the same host?




[1]
engine=> SELECT * FROM audit_log  WHERE message like 'Used memory of host%'
;
-[ RECORD 1 ]-------+---------------------------------------------------------------------
audit_log_id        | 124
user_id             | 00000000-0000-0000-0000-000000000000
user_name           | 
vm_id               | 00000000-0000-0000-0000-000000000000
vm_name             | 
vm_template_id      | 
vm_template_name    | 
vds_id              | 75636b15-9c4d-4df6-9753-8cec2ffd7ace
vds_name            | el6.4_01
log_time            | 2014-09-28 10:48:17.51-04
log_type_name       | VDS_HIGH_MEM_USE
log_type            | 532
severity            | 1
message             | Used memory of host el6.4_01 [18%] exceeded defined threshold [10%].
processed           | f
storage_pool_id     | 
storage_pool_name   | 
storage_domain_id   | 
storage_domain_name | 
vds_group_id        | 00000001-0001-0001-0001-00000000038b
vds_group_name      | Default
correlation_id      | 
job_id              | 
quota_id            | 
quota_name          | 
gluster_volume_id   | 00000000-0000-0000-0000-000000000000
gluster_volume_name | 
origin              | oVirt
custom_event_id     | -1
event_flood_in_sec  | 30
custom_data         | 
deleted             | f
call_stack          |

Comment 4 Jiri Belka 2014-09-30 13:12:09 UTC

Totally clean 3.5 (vt4) installation.

engine=# select * from audit_log where log_time > '2014-09-30' and message ilike '%memory%';
 audit_log_id |               user_id                | user_name |                vm_id                 | vm_name | vm_template_id | vm_template_
name |                vds_id                |    vds_name    |          log_time          |                 log_type_name                 | log_t
ype | severity |                                  message                                   | processed | storage_pool_id | storage_pool_name | s
torage_domain_id | storage_domain_name |             vds_group_id             | vds_group_name | correlation_id | job_id | quota_id | quota_name 
|          gluster_volume_id           | gluster_volume_name | origin | custom_event_id | event_flood_in_sec | custom_data | deleted | call_stack
 
--------------+--------------------------------------+-----------+--------------------------------------+---------+----------------+-------------
-----+--------------------------------------+----------------+----------------------------+-----------------------------------------------+------
----+----------+----------------------------------------------------------------------------+-----------+-----------------+-------------------+--
-----------------+---------------------+--------------------------------------+----------------+----------------+--------+----------+------------
+--------------------------------------+---------------------+--------+-----------------+--------------------+-------------+---------+-----------
-
          107 | 00000000-0000-0000-0000-000000000000 |           | 00000000-0000-0000-0000-000000000000 |         |                |             
     | e7ac3e7e-8cfa-4be7-98e8-9b8a8c81fe30 | dell-r210ii-03 | 2014-09-30 15:11:12.745+02 | USER_SUSPEND_VM_FINISH_FAILURE_WILL_TRY_AGAIN |      
532 |        1 | Used memory of host dell-r210ii-03 [11%] exceeded defined threshold [10%]. | t         |                 |                   |  
                 |                     | 00000001-0001-0001-0001-000000000195 | Default        |                |        |          |            
| 00000000-0000-0000-0000-000000000000 |                     | oVirt  |              -1 |                 30 |             | f       | 
(1 row)

Comment 5 Jiri Belka 2014-09-30 13:40:13 UTC

The issue is reproducible also on my colleague env.

Comment 6 Mooli Tayer 2014-09-30 16:19:30 UTC

Probably didn't reproduce for me since I used master and not 3.5.

Anyhow after digging with git rev-list I see:

the id for VDS_HIGH_MEM_USE is 532
the id for USER_SUSPEND_VM_FINISH_FAILURE_WILL_TRY_AGAIN is 130

So it must be related to this change[1]
changing the id of USER_SUSPEND_VM_FINISH_FAILURE_WILL_TRY_AGAIN from 532 to 130.

Will continue investigating tomorrow.

[1] http://gerrit.ovirt.org/#/c/29740/

Comment 7 Mooli Tayer 2014-10-05 07:53:23 UTC

Actually the problem is that change 29740, that fixed the problem of same id events, is missing in 3.5.

So both events (AuditLogType) had id 532.

Note: unfortunately this effects 34 other ALT with duplicate id's as well:

VDS_STORAGE_CONNECTION_FAILED_BUT_LAST_VDS, VDS_HIGH_NETWORK_USE, (id=533)
VDS_STORAGES_CONNECTION_FAILED, VDS_HIGH_SWAP_USE, (id=535)
VDS_STORAGE_VDS_STATS_FAILED, VDS_HIGH_CPU_USE, (id=534)
VDS_SET_NONOPERATIONAL, VDS_FAILED_TO_GET_HOST_HARDWARE_INFO, (id=517)
USER_SUSPEND_VM_FINISH_FAILURE_WILL_TRY_AGAIN, VDS_HIGH_MEM_USE, (id=532)
GLUSTER_VOLUME_REBALANCE_START_DETECTED_FROM_CLI, GLUSTER_HOST_UUID_ALREADY_EXISTS, (id=4089)
USER_VDS_START, VDS_STATUS_CHANGE_FAILED_DUE_TO_STOP_SPM_FAILURE, (id=20)
USER_VDS_STOP, VDS_PROVISION, (id=21)
USER_ADD_VM_TEMPLATE_SUCCESS, USER_ADD_VM_FINISHED_SUCCESS, (id=53)
USER_ADD_VM_TEMPLATE_FAILURE, USER_FAILED_RUN_VM, (id=54)
CPU_TYPE_UNSUPPORTED_IN_THIS_CLUSTER_VERSION, USER_INITIATED_RUN_VM_AND_PAUSE, (id=156)
VM_IMPORT_FROM_CONFIGURATION_ATTACH_DISKS_FAILED, USER_ADD, (id=149)
VM_MEMORY_UNDER_GUARANTEED_VALUE, VM_IMPORT_FROM_CONFIGURATION_EXECUTED_SUCCESSFULLY, (id=148)
VM_BALLOON_DRIVER_ERROR, USER_ADD, (id=149)
VM_BALLOON_DRIVER_UNCONTROLLED, USER_INITIATED_RUN_VM, (id=150)
VM_MEMORY_NOT_IN_RECOMMENDED_RANGE, USER_INITIATED_RUN_VM_FAILED, (id=151)
USER_LOGGED_IN_VM, USER_ATTACH_TAG_TO_TEMPLATE, (id=456)
USER_LOGGED_OUT_VM, USER_ATTACH_TAG_TO_TEMPLATE_FAILED, (id=457)
USER_LOCKED_VM, USER_DETACH_TEMPLATE_FROM_TAG, (id=458)
USER_UNLOCKED_VM, USER_DETACH_TEMPLATE_FROM_TAG_FAILED, (id=459)
IMPORTEXPORT_GET_VMS_INFO_FAILED, IMPORTEXPORT_STARTING_EXPORT_VM, (id=1162)
UPDATE_OVF_FOR_STORAGE_DOMAIN_FAILED, NETWORK_ACTIVATE_VM_INTERFACE_SUCCESS, (id=1012)
CREATE_OVF_STORE_FOR_STORAGE_DOMAIN_FAILED, NETWORK_ACTIVATE_VM_INTERFACE_FAILURE, (id=1013)
CREATE_OVF_STORE_FOR_STORAGE_DOMAIN_INITIATE_FAILED, NETWORK_DEACTIVATE_VM_INTERFACE_SUCCESS, (id=1014)
DELETE_OVF_STORE_FOR_STORAGE_DOMAIN_FAILED, NETWORK_DEACTIVATE_VM_INTERFACE_FAILURE, (id=1015)
USER_ACCOUNT_DISABLED_OR_LOCKED, NETWORK_UPDATE_DISPLAY_TO_VDS_GROUP, (id=1100)
PROVIDER_ADDED, IMPORTEXPORT_EXPORT_VM, (id=1150)
PROVIDER_ADDITION_FAILED, IMPORTEXPORT_EXPORT_VM_FAILED, (id=1151)
PROVIDER_UPDATED, IMPORTEXPORT_IMPORT_VM, (id=1152)
PROVIDER_UPDATE_FAILED, IMPORTEXPORT_IMPORT_VM_FAILED, (id=1153)
PROVIDER_REMOVED, IMPORTEXPORT_REMOVE_TEMPLATE, (id=1154)
PROVIDER_REMOVAL_FAILED, IMPORTEXPORT_REMOVE_TEMPLATE_FAILED, (id=1155)
PROVIDER_CERTIFICATE_CHAIN_IMPORTED, IMPORTEXPORT_EXPORT_TEMPLATE, (id=1156)
PROVIDER_CERTIFICATE_CHAIN_IMPORT_FAILED, IMPORTEXPORT_EXPORT_TEMPLATE_FAILED, (id=1157)

Note 2: I will also make sure that a test exists to that this does not happen again, and if not create one.

Comment 8 Mooli Tayer 2014-10-05 08:12:21 UTC

A test now exist (AuditLogTypeTest) and there are no duplicate values.

So only backport is left.

Comment 10 Mooli Tayer 2014-10-05 15:01:36 UTC

ack please, 

Also I think this is 3.5.1 materiel(not sure how to set flag). 

There is a problem to backport:
http://gerrit.ovirt.org/#/c/29779/

Since during 

private static Class<CommandBase<? extends VdcActionParametersBase>> loadClass(String className) {
        try {
            return (Class<CommandBase<? extends VdcActionParametersBase>>) Class.forName(className);
        } catch (ClassNotFoundException e) {
            return null;
        }
    }

we get ExceptionInInitializerError from InitVdsOnUpCommand:
private static int MAX_RETRIES_GLUSTER_PROBE_STATUS = Config.<Integer> getValue(ConfigValues.GlusterPeerStatusRetries);

Since Config isn't set or mocked. I understand on master there has been infrastructure work to resolve this, and don't want to backport too much.

As Tal noted, we are free from duplicate VdsActionTypes since master is a gate keeper for that.

Comment 11 Mooli Tayer 2014-10-05 15:07:14 UTC

Also, a title change might be in order, since we are talking about more than a single event. once release is set I think this should be a blocker.

Comment 12 Tal Nisan 2014-10-06 09:52:51 UTC

Exactly, as soon as master's unit tests are blocking the opportunity to introduce duplicate IDs since everything that is backported to 3.5 stable has to go through master, we are covered on that angle

Comment 13 Mooli Tayer 2014-10-22 00:55:19 UTC

removed patch, see comment 10

Comment 14 Jiri Belka 2014-10-31 10:03:26 UTC

ok, vt8 / rhevm-backend-3.5.0-0.18.beta.el6ev.noarch

engine=# select * from audit_log where log_time > '2014-10-31 10:00:00' and message ilike 'Used mem%';
 audit_log_id |               user_id                | user_name |                vm_id                 | vm_name | vm_template_id | vm_template_name |                vds_id                |    vds_name    |      
    log_time          |  log_type_name   | log_type | severity |                                  message                                   | processed | storage_pool_id | storage_pool_name | storage_domain_id | s
torage_domain_name |             vds_group_id             | vds_group_name | correlation_id | job_id | quota_id | quota_name |          gluster_volume_id           | gluster_volume_name | origin | custom_event_id 
| event_flood_in_sec | custom_data | deleted | call_stack 
--------------+--------------------------------------+-----------+--------------------------------------+---------+----------------+------------------+--------------------------------------+----------------+------
----------------------+------------------+----------+----------+----------------------------------------------------------------------------+-----------+-----------------+-------------------+-------------------+--
-------------------+--------------------------------------+----------------+----------------+--------+----------+------------+--------------------------------------+---------------------+--------+-----------------
+--------------------+-------------+---------+------------
          462 | 00000000-0000-0000-0000-000000000000 |           | 00000000-0000-0000-0000-000000000000 |         |                |                  | fe0ef6eb-b9cd-4557-b1d3-d844beed76ad | dell-r210ii-03 | 2014-
10-31 11:01:01.546+01 | VDS_HIGH_MEM_USE |      532 |        1 | Used memory of host dell-r210ii-03 [18%] exceeded defined threshold [10%]. | t         |                 |                   |                   |  
                   | 00000001-0001-0001-0001-0000000002ad | Default        |                |        |          |            | 00000000-0000-0000-0000-000000000000 |                     | oVirt  |              -1 
|                 30 |             | f       | 
(1 row)

From: ovirt
To: jbelka
Subject: alertMessage (jb-rhevm35.xxxxx.com), [Used memory of host dell-r210ii-03 [18%] exceeded defined threshold [10%].]
Date: Fri, 31 Oct 2014 11:01:23 +0100 (CET)

Time:2014-10-31 11:01:01.546
Message:Used memory of host dell-r210ii-03 [18%] exceeded defined threshold [10%].
Severity:WARNING
Host Name: dell-r210ii-03

Comment 15 Eyal Edri 2015-02-17 17:11:22 UTC

rhev 3.5.0 was released. closing.

Note You need to log in before you can comment on or make changes to this bug.