Bug 1145977
Summary: | [events] Host memory usage exceeded defined threshold chaos - VDS_HIGH_MEM_USE vs USER_SUSPEND_VM_FINISH_FAILURE_WILL_TRY_AGAIN | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Jiri Belka <jbelka> | ||||||
Component: | ovirt-engine | Assignee: | Mooli Tayer <mtayer> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Jiri Belka <jbelka> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 3.5.0 | CC: | bazulay, ecohen, gklein, iheim, jbelka, lpeer, lsurette, oourfali, pstehlik, rbalakri, Rhev-m-bugs, sbonazzo, sherold, s.kieske, tnisan, yeylon | ||||||
Target Milestone: | --- | ||||||||
Target Release: | 3.5.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | infra | ||||||||
Fixed In Version: | org.ovirt.engine-root-3.5.0-18 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2015-02-17 17:11:22 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1147536, 1155170, 1164308, 1164311 | ||||||||
Attachments: |
|
Description
Jiri Belka
2014-09-24 08:39:32 UTC
Created attachment 940705 [details]
binary for rhel6
rhevm-backend-3.5.0-0.12.beta.el6ev.noarch rhevm-tools-3.5.0-0.12.beta.el6ev.noarch Very strange indeed. Trying to reproduce according to steps I got the expected value at audit log[1]. AFAIK the code producing both notifications looks fine. Is it possible someone fiddled with the database? Does this reproduce on the same host? [1] engine=> SELECT * FROM audit_log WHERE message like 'Used memory of host%' ; -[ RECORD 1 ]-------+--------------------------------------------------------------------- audit_log_id | 124 user_id | 00000000-0000-0000-0000-000000000000 user_name | vm_id | 00000000-0000-0000-0000-000000000000 vm_name | vm_template_id | vm_template_name | vds_id | 75636b15-9c4d-4df6-9753-8cec2ffd7ace vds_name | el6.4_01 log_time | 2014-09-28 10:48:17.51-04 log_type_name | VDS_HIGH_MEM_USE log_type | 532 severity | 1 message | Used memory of host el6.4_01 [18%] exceeded defined threshold [10%]. processed | f storage_pool_id | storage_pool_name | storage_domain_id | storage_domain_name | vds_group_id | 00000001-0001-0001-0001-00000000038b vds_group_name | Default correlation_id | job_id | quota_id | quota_name | gluster_volume_id | 00000000-0000-0000-0000-000000000000 gluster_volume_name | origin | oVirt custom_event_id | -1 event_flood_in_sec | 30 custom_data | deleted | f call_stack | Totally clean 3.5 (vt4) installation. engine=# select * from audit_log where log_time > '2014-09-30' and message ilike '%memory%'; audit_log_id | user_id | user_name | vm_id | vm_name | vm_template_id | vm_template_ name | vds_id | vds_name | log_time | log_type_name | log_t ype | severity | message | processed | storage_pool_id | storage_pool_name | s torage_domain_id | storage_domain_name | vds_group_id | vds_group_name | correlation_id | job_id | quota_id | quota_name | gluster_volume_id | gluster_volume_name | origin | custom_event_id | event_flood_in_sec | custom_data | deleted | call_stack --------------+--------------------------------------+-----------+--------------------------------------+---------+----------------+------------- -----+--------------------------------------+----------------+----------------------------+-----------------------------------------------+------ ----+----------+----------------------------------------------------------------------------+-----------+-----------------+-------------------+-- -----------------+---------------------+--------------------------------------+----------------+----------------+--------+----------+------------ +--------------------------------------+---------------------+--------+-----------------+--------------------+-------------+---------+----------- - 107 | 00000000-0000-0000-0000-000000000000 | | 00000000-0000-0000-0000-000000000000 | | | | e7ac3e7e-8cfa-4be7-98e8-9b8a8c81fe30 | dell-r210ii-03 | 2014-09-30 15:11:12.745+02 | USER_SUSPEND_VM_FINISH_FAILURE_WILL_TRY_AGAIN | 532 | 1 | Used memory of host dell-r210ii-03 [11%] exceeded defined threshold [10%]. | t | | | | | 00000001-0001-0001-0001-000000000195 | Default | | | | | 00000000-0000-0000-0000-000000000000 | | oVirt | -1 | 30 | | f | (1 row) The issue is reproducible also on my colleague env. Probably didn't reproduce for me since I used master and not 3.5. Anyhow after digging with git rev-list I see: the id for VDS_HIGH_MEM_USE is 532 the id for USER_SUSPEND_VM_FINISH_FAILURE_WILL_TRY_AGAIN is 130 So it must be related to this change[1] changing the id of USER_SUSPEND_VM_FINISH_FAILURE_WILL_TRY_AGAIN from 532 to 130. Will continue investigating tomorrow. [1] http://gerrit.ovirt.org/#/c/29740/ Actually the problem is that change 29740, that fixed the problem of same id events, is missing in 3.5. So both events (AuditLogType) had id 532. Note: unfortunately this effects 34 other ALT with duplicate id's as well: VDS_STORAGE_CONNECTION_FAILED_BUT_LAST_VDS, VDS_HIGH_NETWORK_USE, (id=533) VDS_STORAGES_CONNECTION_FAILED, VDS_HIGH_SWAP_USE, (id=535) VDS_STORAGE_VDS_STATS_FAILED, VDS_HIGH_CPU_USE, (id=534) VDS_SET_NONOPERATIONAL, VDS_FAILED_TO_GET_HOST_HARDWARE_INFO, (id=517) USER_SUSPEND_VM_FINISH_FAILURE_WILL_TRY_AGAIN, VDS_HIGH_MEM_USE, (id=532) GLUSTER_VOLUME_REBALANCE_START_DETECTED_FROM_CLI, GLUSTER_HOST_UUID_ALREADY_EXISTS, (id=4089) USER_VDS_START, VDS_STATUS_CHANGE_FAILED_DUE_TO_STOP_SPM_FAILURE, (id=20) USER_VDS_STOP, VDS_PROVISION, (id=21) USER_ADD_VM_TEMPLATE_SUCCESS, USER_ADD_VM_FINISHED_SUCCESS, (id=53) USER_ADD_VM_TEMPLATE_FAILURE, USER_FAILED_RUN_VM, (id=54) CPU_TYPE_UNSUPPORTED_IN_THIS_CLUSTER_VERSION, USER_INITIATED_RUN_VM_AND_PAUSE, (id=156) VM_IMPORT_FROM_CONFIGURATION_ATTACH_DISKS_FAILED, USER_ADD, (id=149) VM_MEMORY_UNDER_GUARANTEED_VALUE, VM_IMPORT_FROM_CONFIGURATION_EXECUTED_SUCCESSFULLY, (id=148) VM_BALLOON_DRIVER_ERROR, USER_ADD, (id=149) VM_BALLOON_DRIVER_UNCONTROLLED, USER_INITIATED_RUN_VM, (id=150) VM_MEMORY_NOT_IN_RECOMMENDED_RANGE, USER_INITIATED_RUN_VM_FAILED, (id=151) USER_LOGGED_IN_VM, USER_ATTACH_TAG_TO_TEMPLATE, (id=456) USER_LOGGED_OUT_VM, USER_ATTACH_TAG_TO_TEMPLATE_FAILED, (id=457) USER_LOCKED_VM, USER_DETACH_TEMPLATE_FROM_TAG, (id=458) USER_UNLOCKED_VM, USER_DETACH_TEMPLATE_FROM_TAG_FAILED, (id=459) IMPORTEXPORT_GET_VMS_INFO_FAILED, IMPORTEXPORT_STARTING_EXPORT_VM, (id=1162) UPDATE_OVF_FOR_STORAGE_DOMAIN_FAILED, NETWORK_ACTIVATE_VM_INTERFACE_SUCCESS, (id=1012) CREATE_OVF_STORE_FOR_STORAGE_DOMAIN_FAILED, NETWORK_ACTIVATE_VM_INTERFACE_FAILURE, (id=1013) CREATE_OVF_STORE_FOR_STORAGE_DOMAIN_INITIATE_FAILED, NETWORK_DEACTIVATE_VM_INTERFACE_SUCCESS, (id=1014) DELETE_OVF_STORE_FOR_STORAGE_DOMAIN_FAILED, NETWORK_DEACTIVATE_VM_INTERFACE_FAILURE, (id=1015) USER_ACCOUNT_DISABLED_OR_LOCKED, NETWORK_UPDATE_DISPLAY_TO_VDS_GROUP, (id=1100) PROVIDER_ADDED, IMPORTEXPORT_EXPORT_VM, (id=1150) PROVIDER_ADDITION_FAILED, IMPORTEXPORT_EXPORT_VM_FAILED, (id=1151) PROVIDER_UPDATED, IMPORTEXPORT_IMPORT_VM, (id=1152) PROVIDER_UPDATE_FAILED, IMPORTEXPORT_IMPORT_VM_FAILED, (id=1153) PROVIDER_REMOVED, IMPORTEXPORT_REMOVE_TEMPLATE, (id=1154) PROVIDER_REMOVAL_FAILED, IMPORTEXPORT_REMOVE_TEMPLATE_FAILED, (id=1155) PROVIDER_CERTIFICATE_CHAIN_IMPORTED, IMPORTEXPORT_EXPORT_TEMPLATE, (id=1156) PROVIDER_CERTIFICATE_CHAIN_IMPORT_FAILED, IMPORTEXPORT_EXPORT_TEMPLATE_FAILED, (id=1157) Note 2: I will also make sure that a test exists to that this does not happen again, and if not create one. A test now exist (AuditLogTypeTest) and there are no duplicate values. So only backport is left. ack please, Also I think this is 3.5.1 materiel(not sure how to set flag). There is a problem to backport: http://gerrit.ovirt.org/#/c/29779/ Since during private static Class<CommandBase<? extends VdcActionParametersBase>> loadClass(String className) { try { return (Class<CommandBase<? extends VdcActionParametersBase>>) Class.forName(className); } catch (ClassNotFoundException e) { return null; } } we get ExceptionInInitializerError from InitVdsOnUpCommand: private static int MAX_RETRIES_GLUSTER_PROBE_STATUS = Config.<Integer> getValue(ConfigValues.GlusterPeerStatusRetries); Since Config isn't set or mocked. I understand on master there has been infrastructure work to resolve this, and don't want to backport too much. As Tal noted, we are free from duplicate VdsActionTypes since master is a gate keeper for that. Also, a title change might be in order, since we are talking about more than a single event. once release is set I think this should be a blocker. Exactly, as soon as master's unit tests are blocking the opportunity to introduce duplicate IDs since everything that is backported to 3.5 stable has to go through master, we are covered on that angle removed patch, see comment 10 ok, vt8 / rhevm-backend-3.5.0-0.18.beta.el6ev.noarch engine=# select * from audit_log where log_time > '2014-10-31 10:00:00' and message ilike 'Used mem%'; audit_log_id | user_id | user_name | vm_id | vm_name | vm_template_id | vm_template_name | vds_id | vds_name | log_time | log_type_name | log_type | severity | message | processed | storage_pool_id | storage_pool_name | storage_domain_id | s torage_domain_name | vds_group_id | vds_group_name | correlation_id | job_id | quota_id | quota_name | gluster_volume_id | gluster_volume_name | origin | custom_event_id | event_flood_in_sec | custom_data | deleted | call_stack --------------+--------------------------------------+-----------+--------------------------------------+---------+----------------+------------------+--------------------------------------+----------------+------ ----------------------+------------------+----------+----------+----------------------------------------------------------------------------+-----------+-----------------+-------------------+-------------------+-- -------------------+--------------------------------------+----------------+----------------+--------+----------+------------+--------------------------------------+---------------------+--------+----------------- +--------------------+-------------+---------+------------ 462 | 00000000-0000-0000-0000-000000000000 | | 00000000-0000-0000-0000-000000000000 | | | | fe0ef6eb-b9cd-4557-b1d3-d844beed76ad | dell-r210ii-03 | 2014- 10-31 11:01:01.546+01 | VDS_HIGH_MEM_USE | 532 | 1 | Used memory of host dell-r210ii-03 [18%] exceeded defined threshold [10%]. | t | | | | | 00000001-0001-0001-0001-0000000002ad | Default | | | | | 00000000-0000-0000-0000-000000000000 | | oVirt | -1 | 30 | | f | (1 row) From: ovirt To: jbelka Subject: alertMessage (jb-rhevm35.xxxxx.com), [Used memory of host dell-r210ii-03 [18%] exceeded defined threshold [10%].] Date: Fri, 31 Oct 2014 11:01:23 +0100 (CET) Time:2014-10-31 11:01:01.546 Message:Used memory of host dell-r210ii-03 [18%] exceeded defined threshold [10%]. Severity:WARNING Host Name: dell-r210ii-03 rhev 3.5.0 was released. closing. |