Bug 1185320
Summary: | [events] Failed PM health check status of secondary agent is not reported if primary fails | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Jiri Belka <jbelka> | ||||
Component: | ovirt-engine | Assignee: | Eli Mesika <emesika> | ||||
Status: | CLOSED ERRATA | QA Contact: | Jiri Belka <jbelka> | ||||
Severity: | low | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 3.5.0 | CC: | emesika, gklein, lpeer, lsurette, mperina, oourfali, rbalakri, Rhev-m-bugs, srevivo, ykaul | ||||
Target Milestone: | ovirt-4.0.0-rc | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2016-08-23 20:22:44 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
engine=# select log_time,message from audit_log where message like 'Health check%'; log_time | message ----------------------------+------------------------------------------------------------------------------------------------------------------- ------------------------------------------------------------ 2015-01-23 13:34:02.963+01 | Health check failed on Host dell-r210ii-03.rhev.lab.eng.brq.redhat.com primary sequential agent, future fence oper ations may fail if secondary agent is not defined properly. (1 row) Failure is not reported even when in concurrent mode. Power-Management related behavior was refactored probably making this issue obsolete. Code refactoring caused lost of original distinction of primary and secondary PM failure. - original: 2015-01-23 13:54:05,242 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-62) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-03.example.com.[Getting status of IPMI:10.34.63.243...Chassis power = Unknown, Failed] 2015-01-23 13:54:22,244 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-62) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-03.example.com.[Getting status of IPMI:10.34.63.243...Chassis power = Unknown, Failed] 2015-01-23 13:54:22,308 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-62) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Health check failed on Host dell-r210ii-03.example.com primary sequential agent, future fence operations may fail if secondary agent is not defined properly. - new: 1 2016-01-12 17:08:10,662 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-04. No reason was returned for this operation failure. See logs for further details. 2 2016-01-12 17:08:10,662 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-04. No reason was returned for this operation failure. See logs for further details. 3 2016-01-12 17:08:10,699 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Execution of power management status on Host dell-r210ii-04 using Proxy Host dell-r210ii-13.example.com and Fence Agent ipmilan:10.34.63.243 failed. 4 2016-01-12 17:08:10,699 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Execution of power management status on Host dell-r210ii-04 using Proxy Host dell-r210ii-13.example.com and Fence Agent ipmilan:10.34.63.243 failed. 5 2016-01-12 17:08:19,978 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-04. No reason was returned for this operation failure. See logs for further details. 6 2016-01-12 17:08:19,978 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-04. No reason was returned for this operation failure. See logs for further details. 7 2016-01-12 17:08:20,029 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Execution of power management status on Host dell-r210ii-04 using Proxy Host dell-r210ii-13.example.com and Fence Agent ipmilan:10.34.63.243 failed. 8 2016-01-12 17:08:20,029 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Execution of power management status on Host dell-r210ii-04 using Proxy Host dell-r210ii-13.example.com and Fence Agent ipmilan:10.34.63.243 failed. 9 2016-01-12 17:08:28,330 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-04. No reason was returned for this operation failure. See logs for further details. 10 2016-01-12 17:08:28,330 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-04. No reason was returned for this operation failure. See logs for further details. 11 2016-01-12 17:08:28,389 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Execution of power management status on Host dell-r210ii-04 using Proxy Host dell-r210ii-13.example.com and Fence Agent ipmilan:10.34.62.155 failed. 12 2016-01-12 17:08:28,389 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Execution of power management status on Host dell-r210ii-04 using Proxy Host dell-r210ii-13.example.com and Fence Agent ipmilan:10.34.62.155 failed. 13 2016-01-12 17:08:38,669 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-04. No reason was returned for this operation failure. See logs for further details. 14 2016-01-12 17:08:38,669 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-04. No reason was returned for this operation failure. See logs for further details. 15 2016-01-12 17:08:38,722 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Execution of power management status on Host dell-r210ii-04 using Proxy Host dell-r210ii-13.example.com and Fence Agent ipmilan:10.34.62.155 failed. 16 2016-01-12 17:08:38,722 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Execution of power management status on Host dell-r210ii-04 using Proxy Host dell-r210ii-13.example.com and Fence Agent ipmilan:10.34.62.155 failed. 17 2016-01-12 17:08:38,762 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Health check on Host dell-r210ii-04 indicates that future attempts to Start this host using Power-Management are expected to fail. 18 2016-01-12 17:08:38,762 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Health check on Host dell-r210ii-04 indicates that future attempts to Start this host using Power-Management are expected to fail. 19 2016-01-12 17:08:38,814 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Health check on Host dell-r210ii-04 indicates that future attempts to Stop this host using Power-Management are expected to fail. 20 2016-01-12 17:08:38,814 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Health check on Host dell-r210ii-04 indicates that future attempts to Stop this host using Power-Management are expected to fail. So originally one could distinguish if it is primary or secondary PM failing inside 'Health check.*' event messages. And... It seems a little bit spammy. oVirt 4.0 Alpha has been released, moving to oVirt 4.0 Beta target. Eli, could you please try to investigate? 1. BZ1325664 has been fixed, so we should see the reason of failure for each agent. 2. In 3.6 we allowed "unlimited number" of ordered levels for fence agents and on each level we define either one fence agent or unlimited number of "concurrent" fence agents, so we are no longer able to distinguish between primary and secondary agents. But we should be able to resolve ability to execute start and/or stop operations and display proper results during PM Health Check. (In reply to Martin Perina from comment #10) > Eli, could you please try to investigate? > > 1. BZ1325664 has been fixed, so we should see the reason of failure for each > agent. This BZ was opened for 3.5 so the VDSM release that works with 3.5 does not have the refactoring in fenceNode that cause the regression reported in BZ1325664 > > 2. In 3.6 we allowed "unlimited number" of ordered levels for fence agents > and on each level we define either one fence agent or unlimited number of > "concurrent" fence agents, so we are no longer able to distinguish between > primary and secondary agents. But we should be able to resolve ability to > execute start and/or stop operations and display proper results during PM > Health Check. Looking at the master and 3.5 branch code for PmHealthCheckManager::pmHealthCheck() , I can see that there is a bug in the 3.5 code that caused the BZ reported here. This can be easily fix for 3.5 Code for 3.6 and master is more general and will report all fencing agents Do we have to resolve that for 3.5 Moving this BZ to ON_QA since this is already fixed for 4.0 approved by Oved ok, 2016-06-02 06:08:03.687+02 | Execution of power management status on Host dell-r210ii-03 using Proxy Host dell-r210ii-04 and Fence Agent ipmilan:10.34.63.243 failed. 2016-06-02 06:08:13.073+02 | Execution of power management status on Host dell-r210ii-03 using Proxy Host dell-r210ii-04 and Fence Agent ipmilan:10.34.63.242 failed. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-1743.html |
Created attachment 983339 [details] engine.log Description of problem: If primary sequential agent PM health check status fails, failure of secondary is not reported at all. Version-Release number of selected component (if applicable): rhevm-backend-3.5.0-0.30.el6ev.noarch How reproducible: 100% Steps to Reproduce: 1. engine-config -s PMHealthCheckEnabled=true 2. define primary/secondary PM settings for a host, both with invalid password 3. tail -f /var/log/ovirt-engine/engine.log | grep 'Health check failed' Actual results: failure is reported only for primary agent Expected results: should be reported for both imo Additional info: # tail -f /var/log/ovirt-engine/engine.log | grep -i health 2015-01-23 13:35:02,975 INFO [org.ovirt.engine.core.bll.pm.PmHealthCheckManager] (DefaultQuartzScheduler_Worker-27) Power Management Health Che ck started. 2015-01-23 13:35:36,707 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-27) Correlat ion ID: null, Call Stack: null, Custom Event ID: -1, Message: Health check failed on Host dell-r210ii-03.rhev.lab.eng.brq.redhat.com primary seq uential agent, future fence operations may fail if secondary agent is not defined properly. 2015-01-23 13:35:36,708 INFO [org.ovirt.engine.core.bll.pm.PmHealthCheckManager] (DefaultQuartzScheduler_Worker-27) Power Management Health Che ck completed. 2015-01-23 13:36:36,708 INFO [org.ovirt.engine.core.bll.pm.PmHealthCheckManager] (DefaultQuartzScheduler_Worker-49) Power Management Health Che ck started. 2015-01-23 13:37:10,364 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-49) Correlat ion ID: null, Call Stack: null, Custom Event ID: -1, Message: Health check failed on Host dell-r210ii-03.rhev.lab.eng.brq.redhat.com primary seq uential agent, future fence operations may fail if secondary agent is not defined properly. 2015-01-23 13:37:10,364 INFO [org.ovirt.engine.core.bll.pm.PmHealthCheckManager] (DefaultQuartzScheduler_Worker-49) Power Management Health Che ck completed. 2015-01-23 13:38:10,364 INFO [org.ovirt.engine.core.bll.pm.PmHealthCheckManager] (DefaultQuartzScheduler_Worker-78) Power Management Health Che ck started. 2015-01-23 13:38:44,133 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-78) Correlat ion ID: null, Call Stack: null, Custom Event ID: -1, Message: Health check failed on Host dell-r210ii-03.rhev.lab.eng.brq.redhat.com primary seq uential agent, future fence operations may fail if secondary agent is not defined properly. 2015-01-23 13:38:44,133 INFO [org.ovirt.engine.core.bll.pm.PmHealthCheckManager] (DefaultQuartzScheduler_Worker-78) Power Management Health Che ck completed. 2015-01-23 13:39:44,134 INFO [org.ovirt.engine.core.bll.pm.PmHealthCheckManager] (DefaultQuartzScheduler_Worker-4) Power Management Health Chec k started.