Bug 1145977

Summary: [events] Host memory usage exceeded defined threshold chaos - VDS_HIGH_MEM_USE vs USER_SUSPEND_VM_FINISH_FAILURE_WILL_TRY_AGAIN
Product: Red Hat Enterprise Virtualization Manager Reporter: Jiri Belka <jbelka>
Component: ovirt-engineAssignee: Mooli Tayer <mtayer>
Status: CLOSED CURRENTRELEASE QA Contact: Jiri Belka <jbelka>
Severity: medium Docs Contact:
Priority: urgent    
Version: 3.5.0CC: bazulay, ecohen, gklein, iheim, jbelka, lpeer, lsurette, oourfali, pstehlik, rbalakri, Rhev-m-bugs, sbonazzo, sherold, s.kieske, tnisan, yeylon
Target Milestone: ---   
Target Release: 3.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: infra
Fixed In Version: org.ovirt.engine-root-3.5.0-18 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-02-17 17:11:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1147536, 1155170, 1164308, 1164311    
Attachments:
Description Flags
sosreport-LogCollector-20140924103440.tar.xz
none
binary for rhel6 none

Description Jiri Belka 2014-09-24 08:39:32 UTC
Created attachment 940704 [details]
sosreport-LogCollector-20140924103440.tar.xz

Description of problem:
while investigating BZ1059309 it was discovered that 'Host memory usage exceeded defined threshold' event which is available for notification is chaotic.

I believe there's something wrong with following event as I didn't receive any mail notification.

DB contains following subscriber:

-[ RECORD 2 ]--+-------------------------------------
subscriber_id  | fdfc627c-d875-11e0-90f0-83df133b58cc
event_up_name  | VDS_HIGH_MEM_USE
method_id      | 0
method_address | jbelka
tag_name       |

Then I changed value to get easy high memory use on a host.

# engine-config -g LogMaxPhysicalMemoryUsedThresholdInPercentage
LogMaxPhysicalMemoryUsedThresholdInPercentage: 10 version: general

I created memory load on the host:

# /tmp/lookbusy -c 0 -m 512mb -M 100
lookbusy (10455): mem stirrer started, PID 10456
mem_stir (10456): stirring 536870912 bytes...
^Cgot SIGCHLD for dead child 10456; exited with signal 2
killing mem spinner 10456
killing mem spinner 10456

[root@dell-r210ii-13 ~]# vdsClient -s 0 getVdsStats | grep memUs
        memUsed = '13'

Surprisingly I didn't see VDS_HIGH_MEM in notifier.log but following line:

2014-09-24 10:07:03,328 DEBUG [org.ovirt.engine.core.notifier.dao.EventsManager] event 0 => AuditLogEvent{id=1727, logTypeName='USER_SUSPEND_VM_FINISH_FAILURE_WILL_TRY_AGAIN', type=alertMessage, userId=00000000-0000-0000-0000-000000000000, userName='null', vmId=00000000-0000-0000-0000-000000000000, vmName='null', vmTemplateId=null, vmTemplateName='null', vdsId=4a196f70-c2a6-4b9a-887b-3751db6a4fd3, vdsName='dell-r210ii-13', storagePoolId=00000000-0000-0000-0000-000000000000, storagePoolName='', storageDomainId=00000000-0000-0000-0000-000000000000, storageDomainName='', logTime=2014-09-24 10:06:58.527, severity=WARNING, message='Used memory of host dell-r210ii-13 [13%] exceeded defined threshold [10%].'}

And of course I didn't receive any mail notification. Isn't there logTypeName chaos? VDS_HIGH_MEM vs USER_SUSPEND_VM_FINISH_FAILURE_WILL_TRY_AGAIN?

engine=# select * from audit_log where log_time > '2014-09-24' and message ilike 'Used memory%';
-[ RECORD 1 ]-------+---------------------------------------------------------------------------
audit_log_id        | 1727
user_id             | 00000000-0000-0000-0000-000000000000
user_name           | 
vm_id               | 00000000-0000-0000-0000-000000000000
vm_name             | 
vm_template_id      | 
vm_template_name    | 
vds_id              | 4a196f70-c2a6-4b9a-887b-3751db6a4fd3
vds_name            | dell-r210ii-13
log_time            | 2014-09-24 10:06:58.527+02
log_type_name       | USER_SUSPEND_VM_FINISH_FAILURE_WILL_TRY_AGAIN
log_type            | 532
severity            | 1
message             | Used memory of host dell-r210ii-13 [13%] exceeded defined threshold [10%].
processed           | t
storage_pool_id     | 
storage_pool_name   | 
storage_domain_id   | 
storage_domain_name | 
vds_group_id        | e89d69a9-c4fa-4f19-b99d-1de69eaa082f
vds_group_name      | 34
correlation_id      | 
job_id              | 
quota_id            | 
quota_name          | 
gluster_volume_id   | 00000000-0000-0000-0000-000000000000
gluster_volume_name | 
origin              | oVirt
custom_event_id     | -1
event_flood_in_sec  | 30
custom_data         | 
deleted             | f
call_stack          |

FYI when I was trying to get help with mperina@ and msivak@ they both raised some doubts about logic in the code how hosts' memory events are managed (a condition and results).

Version-Release number of selected component (if applicable):

vdsm-4.14.13-2.el6ev.x86_64
Red Hat Enterprise Virtualization Hypervisor release 6.5 (20140821.1.el6ev)

How reproducible:
100%

Steps to Reproduce:
1. assign Host memory usage exceeded defined threshold for notification (make notifier running and configured of course)
2. check memUsed by vdsClient and set LogMaxPhysicalMemoryUsedThresholdInPercentage little above it via engine-config
3. make memory usage on the host

Actual results:
no event notification, seems there's mismatch about logTypeName - VDS_HIGH_MEM vs USER_SUSPEND_VM_FINISH_FAILURE_WILL_TRY_AGAIN

Expected results:
should get event notification

Additional info:
http://www.devin.com/lookbusy/ to make mem usage, in attachment my compiled version from RHEL6

Comment 1 Jiri Belka 2014-09-24 08:40:50 UTC
Created attachment 940705 [details]
binary for rhel6

Comment 2 Jiri Belka 2014-09-24 08:42:52 UTC
rhevm-backend-3.5.0-0.12.beta.el6ev.noarch
rhevm-tools-3.5.0-0.12.beta.el6ev.noarch

Comment 3 Mooli Tayer 2014-09-28 15:00:28 UTC
Very strange indeed.

Trying to reproduce according to steps I got the expected value at audit log[1].

AFAIK the code producing both notifications looks fine.

Is it possible someone fiddled with the database?

Does this reproduce on the same host?




[1]
engine=> SELECT * FROM audit_log  WHERE message like 'Used memory of host%'
;
-[ RECORD 1 ]-------+---------------------------------------------------------------------
audit_log_id        | 124
user_id             | 00000000-0000-0000-0000-000000000000
user_name           | 
vm_id               | 00000000-0000-0000-0000-000000000000
vm_name             | 
vm_template_id      | 
vm_template_name    | 
vds_id              | 75636b15-9c4d-4df6-9753-8cec2ffd7ace
vds_name            | el6.4_01
log_time            | 2014-09-28 10:48:17.51-04
log_type_name       | VDS_HIGH_MEM_USE
log_type            | 532
severity            | 1
message             | Used memory of host el6.4_01 [18%] exceeded defined threshold [10%].
processed           | f
storage_pool_id     | 
storage_pool_name   | 
storage_domain_id   | 
storage_domain_name | 
vds_group_id        | 00000001-0001-0001-0001-00000000038b
vds_group_name      | Default
correlation_id      | 
job_id              | 
quota_id            | 
quota_name          | 
gluster_volume_id   | 00000000-0000-0000-0000-000000000000
gluster_volume_name | 
origin              | oVirt
custom_event_id     | -1
event_flood_in_sec  | 30
custom_data         | 
deleted             | f
call_stack          |

Comment 4 Jiri Belka 2014-09-30 13:12:09 UTC
Totally clean 3.5 (vt4) installation.

engine=# select * from audit_log where log_time > '2014-09-30' and message ilike '%memory%';
 audit_log_id |               user_id                | user_name |                vm_id                 | vm_name | vm_template_id | vm_template_
name |                vds_id                |    vds_name    |          log_time          |                 log_type_name                 | log_t
ype | severity |                                  message                                   | processed | storage_pool_id | storage_pool_name | s
torage_domain_id | storage_domain_name |             vds_group_id             | vds_group_name | correlation_id | job_id | quota_id | quota_name 
|          gluster_volume_id           | gluster_volume_name | origin | custom_event_id | event_flood_in_sec | custom_data | deleted | call_stack
 
--------------+--------------------------------------+-----------+--------------------------------------+---------+----------------+-------------
-----+--------------------------------------+----------------+----------------------------+-----------------------------------------------+------
----+----------+----------------------------------------------------------------------------+-----------+-----------------+-------------------+--
-----------------+---------------------+--------------------------------------+----------------+----------------+--------+----------+------------
+--------------------------------------+---------------------+--------+-----------------+--------------------+-------------+---------+-----------
-
          107 | 00000000-0000-0000-0000-000000000000 |           | 00000000-0000-0000-0000-000000000000 |         |                |             
     | e7ac3e7e-8cfa-4be7-98e8-9b8a8c81fe30 | dell-r210ii-03 | 2014-09-30 15:11:12.745+02 | USER_SUSPEND_VM_FINISH_FAILURE_WILL_TRY_AGAIN |      
532 |        1 | Used memory of host dell-r210ii-03 [11%] exceeded defined threshold [10%]. | t         |                 |                   |  
                 |                     | 00000001-0001-0001-0001-000000000195 | Default        |                |        |          |            
| 00000000-0000-0000-0000-000000000000 |                     | oVirt  |              -1 |                 30 |             | f       | 
(1 row)

Comment 5 Jiri Belka 2014-09-30 13:40:13 UTC
The issue is reproducible also on my colleague env.

Comment 6 Mooli Tayer 2014-09-30 16:19:30 UTC
Probably didn't reproduce for me since I used master and not 3.5.

Anyhow after digging with git rev-list I see:

the id for VDS_HIGH_MEM_USE is 532
the id for USER_SUSPEND_VM_FINISH_FAILURE_WILL_TRY_AGAIN is 130

So it must be related to this change[1]
changing the id of USER_SUSPEND_VM_FINISH_FAILURE_WILL_TRY_AGAIN from 532 to 130.

Will continue investigating tomorrow.

[1] http://gerrit.ovirt.org/#/c/29740/

Comment 7 Mooli Tayer 2014-10-05 07:53:23 UTC
Actually the problem is that change 29740, that fixed the problem of same id events, is missing in 3.5.

So both events (AuditLogType) had id 532.

Note: unfortunately this effects 34 other ALT with duplicate id's as well:

VDS_STORAGE_CONNECTION_FAILED_BUT_LAST_VDS, VDS_HIGH_NETWORK_USE, (id=533)
VDS_STORAGES_CONNECTION_FAILED, VDS_HIGH_SWAP_USE, (id=535)
VDS_STORAGE_VDS_STATS_FAILED, VDS_HIGH_CPU_USE, (id=534)
VDS_SET_NONOPERATIONAL, VDS_FAILED_TO_GET_HOST_HARDWARE_INFO, (id=517)
USER_SUSPEND_VM_FINISH_FAILURE_WILL_TRY_AGAIN, VDS_HIGH_MEM_USE, (id=532)
GLUSTER_VOLUME_REBALANCE_START_DETECTED_FROM_CLI, GLUSTER_HOST_UUID_ALREADY_EXISTS, (id=4089)
USER_VDS_START, VDS_STATUS_CHANGE_FAILED_DUE_TO_STOP_SPM_FAILURE, (id=20)
USER_VDS_STOP, VDS_PROVISION, (id=21)
USER_ADD_VM_TEMPLATE_SUCCESS, USER_ADD_VM_FINISHED_SUCCESS, (id=53)
USER_ADD_VM_TEMPLATE_FAILURE, USER_FAILED_RUN_VM, (id=54)
CPU_TYPE_UNSUPPORTED_IN_THIS_CLUSTER_VERSION, USER_INITIATED_RUN_VM_AND_PAUSE, (id=156)
VM_IMPORT_FROM_CONFIGURATION_ATTACH_DISKS_FAILED, USER_ADD, (id=149)
VM_MEMORY_UNDER_GUARANTEED_VALUE, VM_IMPORT_FROM_CONFIGURATION_EXECUTED_SUCCESSFULLY, (id=148)
VM_BALLOON_DRIVER_ERROR, USER_ADD, (id=149)
VM_BALLOON_DRIVER_UNCONTROLLED, USER_INITIATED_RUN_VM, (id=150)
VM_MEMORY_NOT_IN_RECOMMENDED_RANGE, USER_INITIATED_RUN_VM_FAILED, (id=151)
USER_LOGGED_IN_VM, USER_ATTACH_TAG_TO_TEMPLATE, (id=456)
USER_LOGGED_OUT_VM, USER_ATTACH_TAG_TO_TEMPLATE_FAILED, (id=457)
USER_LOCKED_VM, USER_DETACH_TEMPLATE_FROM_TAG, (id=458)
USER_UNLOCKED_VM, USER_DETACH_TEMPLATE_FROM_TAG_FAILED, (id=459)
IMPORTEXPORT_GET_VMS_INFO_FAILED, IMPORTEXPORT_STARTING_EXPORT_VM, (id=1162)
UPDATE_OVF_FOR_STORAGE_DOMAIN_FAILED, NETWORK_ACTIVATE_VM_INTERFACE_SUCCESS, (id=1012)
CREATE_OVF_STORE_FOR_STORAGE_DOMAIN_FAILED, NETWORK_ACTIVATE_VM_INTERFACE_FAILURE, (id=1013)
CREATE_OVF_STORE_FOR_STORAGE_DOMAIN_INITIATE_FAILED, NETWORK_DEACTIVATE_VM_INTERFACE_SUCCESS, (id=1014)
DELETE_OVF_STORE_FOR_STORAGE_DOMAIN_FAILED, NETWORK_DEACTIVATE_VM_INTERFACE_FAILURE, (id=1015)
USER_ACCOUNT_DISABLED_OR_LOCKED, NETWORK_UPDATE_DISPLAY_TO_VDS_GROUP, (id=1100)
PROVIDER_ADDED, IMPORTEXPORT_EXPORT_VM, (id=1150)
PROVIDER_ADDITION_FAILED, IMPORTEXPORT_EXPORT_VM_FAILED, (id=1151)
PROVIDER_UPDATED, IMPORTEXPORT_IMPORT_VM, (id=1152)
PROVIDER_UPDATE_FAILED, IMPORTEXPORT_IMPORT_VM_FAILED, (id=1153)
PROVIDER_REMOVED, IMPORTEXPORT_REMOVE_TEMPLATE, (id=1154)
PROVIDER_REMOVAL_FAILED, IMPORTEXPORT_REMOVE_TEMPLATE_FAILED, (id=1155)
PROVIDER_CERTIFICATE_CHAIN_IMPORTED, IMPORTEXPORT_EXPORT_TEMPLATE, (id=1156)
PROVIDER_CERTIFICATE_CHAIN_IMPORT_FAILED, IMPORTEXPORT_EXPORT_TEMPLATE_FAILED, (id=1157)

Note 2: I will also make sure that a test exists to that this does not happen again, and if not create one.

Comment 8 Mooli Tayer 2014-10-05 08:12:21 UTC
A test now exist (AuditLogTypeTest) and there are no duplicate values.

So only backport is left.

Comment 10 Mooli Tayer 2014-10-05 15:01:36 UTC
ack please, 

Also I think this is 3.5.1 materiel(not sure how to set flag). 

There is a problem to backport:
http://gerrit.ovirt.org/#/c/29779/

Since during 

private static Class<CommandBase<? extends VdcActionParametersBase>> loadClass(String className) {
        try {
            return (Class<CommandBase<? extends VdcActionParametersBase>>) Class.forName(className);
        } catch (ClassNotFoundException e) {
            return null;
        }
    }

we get ExceptionInInitializerError from InitVdsOnUpCommand:
private static int MAX_RETRIES_GLUSTER_PROBE_STATUS = Config.<Integer> getValue(ConfigValues.GlusterPeerStatusRetries);

Since Config isn't set or mocked. I understand on master there has been infrastructure work to resolve this, and don't want to backport too much.

As Tal noted, we are free from duplicate VdsActionTypes since master is a gate keeper for that.

Comment 11 Mooli Tayer 2014-10-05 15:07:14 UTC
Also, a title change might be in order, since we are talking about more than a single event. once release is set I think this should be a blocker.

Comment 12 Tal Nisan 2014-10-06 09:52:51 UTC
Exactly, as soon as master's unit tests are blocking the opportunity to introduce duplicate IDs since everything that is backported to 3.5 stable has to go through master, we are covered on that angle

Comment 13 Mooli Tayer 2014-10-22 00:55:19 UTC
removed patch, see comment 10

Comment 14 Jiri Belka 2014-10-31 10:03:26 UTC
ok, vt8 / rhevm-backend-3.5.0-0.18.beta.el6ev.noarch

engine=# select * from audit_log where log_time > '2014-10-31 10:00:00' and message ilike 'Used mem%';
 audit_log_id |               user_id                | user_name |                vm_id                 | vm_name | vm_template_id | vm_template_name |                vds_id                |    vds_name    |      
    log_time          |  log_type_name   | log_type | severity |                                  message                                   | processed | storage_pool_id | storage_pool_name | storage_domain_id | s
torage_domain_name |             vds_group_id             | vds_group_name | correlation_id | job_id | quota_id | quota_name |          gluster_volume_id           | gluster_volume_name | origin | custom_event_id 
| event_flood_in_sec | custom_data | deleted | call_stack 
--------------+--------------------------------------+-----------+--------------------------------------+---------+----------------+------------------+--------------------------------------+----------------+------
----------------------+------------------+----------+----------+----------------------------------------------------------------------------+-----------+-----------------+-------------------+-------------------+--
-------------------+--------------------------------------+----------------+----------------+--------+----------+------------+--------------------------------------+---------------------+--------+-----------------
+--------------------+-------------+---------+------------
          462 | 00000000-0000-0000-0000-000000000000 |           | 00000000-0000-0000-0000-000000000000 |         |                |                  | fe0ef6eb-b9cd-4557-b1d3-d844beed76ad | dell-r210ii-03 | 2014-
10-31 11:01:01.546+01 | VDS_HIGH_MEM_USE |      532 |        1 | Used memory of host dell-r210ii-03 [18%] exceeded defined threshold [10%]. | t         |                 |                   |                   |  
                   | 00000001-0001-0001-0001-0000000002ad | Default        |                |        |          |            | 00000000-0000-0000-0000-000000000000 |                     | oVirt  |              -1 
|                 30 |             | f       | 
(1 row)

From: ovirt
To: jbelka
Subject: alertMessage (jb-rhevm35.xxxxx.com), [Used memory of host dell-r210ii-03 [18%] exceeded defined threshold [10%].]
Date: Fri, 31 Oct 2014 11:01:23 +0100 (CET)

Time:2014-10-31 11:01:01.546
Message:Used memory of host dell-r210ii-03 [18%] exceeded defined threshold [10%].
Severity:WARNING
Host Name: dell-r210ii-03

Comment 15 Eyal Edri 2015-02-17 17:11:22 UTC
rhev 3.5.0 was released. closing.