Bug 1185320

Summary: [events] Failed PM health check status of secondary agent is not reported if primary fails
Product: Red Hat Enterprise Virtualization Manager Reporter: Jiri Belka <jbelka>
Component: ovirt-engineAssignee: Eli Mesika <emesika>
Status: CLOSED ERRATA QA Contact: Jiri Belka <jbelka>
Severity: low Docs Contact:
Priority: unspecified    
Version: 3.5.0CC: emesika, gklein, lpeer, lsurette, mperina, oourfali, rbalakri, Rhev-m-bugs, srevivo, ykaul
Target Milestone: ovirt-4.0.0-rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-23 20:22:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine.log none

Description Jiri Belka 2015-01-23 12:55:16 UTC
Created attachment 983339 [details]
engine.log

Description of problem:
If primary sequential agent PM health check status fails, failure of secondary is not reported at all.

Version-Release number of selected component (if applicable):
rhevm-backend-3.5.0-0.30.el6ev.noarch

How reproducible:
100%

Steps to Reproduce:
1. engine-config -s PMHealthCheckEnabled=true
2. define primary/secondary PM settings for a host, both with invalid password
3. tail -f /var/log/ovirt-engine/engine.log | grep 'Health check failed'

Actual results:
failure is reported only for primary agent

Expected results:
should be reported for both imo

Additional info:
# tail -f /var/log/ovirt-engine/engine.log | grep -i health                                                               
2015-01-23 13:35:02,975 INFO  [org.ovirt.engine.core.bll.pm.PmHealthCheckManager] (DefaultQuartzScheduler_Worker-27) Power Management Health Che
ck started.
2015-01-23 13:35:36,707 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-27) Correlat
ion ID: null, Call Stack: null, Custom Event ID: -1, Message: Health check failed on Host dell-r210ii-03.rhev.lab.eng.brq.redhat.com primary seq
uential agent, future fence operations may fail if secondary agent is not defined properly.
2015-01-23 13:35:36,708 INFO  [org.ovirt.engine.core.bll.pm.PmHealthCheckManager] (DefaultQuartzScheduler_Worker-27) Power Management Health Che
ck completed.
2015-01-23 13:36:36,708 INFO  [org.ovirt.engine.core.bll.pm.PmHealthCheckManager] (DefaultQuartzScheduler_Worker-49) Power Management Health Che
ck started.
2015-01-23 13:37:10,364 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-49) Correlat
ion ID: null, Call Stack: null, Custom Event ID: -1, Message: Health check failed on Host dell-r210ii-03.rhev.lab.eng.brq.redhat.com primary seq
uential agent, future fence operations may fail if secondary agent is not defined properly.
2015-01-23 13:37:10,364 INFO  [org.ovirt.engine.core.bll.pm.PmHealthCheckManager] (DefaultQuartzScheduler_Worker-49) Power Management Health Che
ck completed.
2015-01-23 13:38:10,364 INFO  [org.ovirt.engine.core.bll.pm.PmHealthCheckManager] (DefaultQuartzScheduler_Worker-78) Power Management Health Che
ck started.
2015-01-23 13:38:44,133 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-78) Correlat
ion ID: null, Call Stack: null, Custom Event ID: -1, Message: Health check failed on Host dell-r210ii-03.rhev.lab.eng.brq.redhat.com primary seq
uential agent, future fence operations may fail if secondary agent is not defined properly.
2015-01-23 13:38:44,133 INFO  [org.ovirt.engine.core.bll.pm.PmHealthCheckManager] (DefaultQuartzScheduler_Worker-78) Power Management Health Che
ck completed.
2015-01-23 13:39:44,134 INFO  [org.ovirt.engine.core.bll.pm.PmHealthCheckManager] (DefaultQuartzScheduler_Worker-4) Power Management Health Chec
k started.

Comment 1 Jiri Belka 2015-01-23 12:56:43 UTC
engine=# select log_time,message from audit_log where message like 'Health check%';
          log_time          |                                                                                    message                        
                                                            
----------------------------+-------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------
 2015-01-23 13:34:02.963+01 | Health check failed on Host dell-r210ii-03.rhev.lab.eng.brq.redhat.com primary sequential agent, future fence oper
ations may fail if secondary agent is not defined properly.
(1 row)

Comment 2 Jiri Belka 2015-01-23 13:03:10 UTC
Failure is not reported even when in concurrent mode.

Comment 3 Ori Liel 2015-06-18 12:40:24 UTC
Power-Management related behavior was refactored probably making this issue obsolete.

Comment 4 Jiri Belka 2016-01-12 16:29:13 UTC
Code refactoring caused lost of original distinction of primary and secondary PM failure.

- original:

2015-01-23 13:54:05,242 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-62) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-03.example.com.[Getting status of IPMI:10.34.63.243...Chassis power = Unknown, Failed]
2015-01-23 13:54:22,244 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-62) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-03.example.com.[Getting status of IPMI:10.34.63.243...Chassis power = Unknown, Failed]
2015-01-23 13:54:22,308 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-62) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Health check failed on Host dell-r210ii-03.example.com primary sequential agent, future fence operations may fail if secondary agent is not defined properly.

- new:

     1  2016-01-12 17:08:10,662 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-04. No reason was returned for this operation failure. See logs for further details.
     2  2016-01-12 17:08:10,662 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-04. No reason was returned for this operation failure. See logs for further details.
     3  2016-01-12 17:08:10,699 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Execution of power management status on Host dell-r210ii-04 using Proxy Host dell-r210ii-13.example.com and Fence Agent ipmilan:10.34.63.243 failed.
     4  2016-01-12 17:08:10,699 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Execution of power management status on Host dell-r210ii-04 using Proxy Host dell-r210ii-13.example.com and Fence Agent ipmilan:10.34.63.243 failed.
     5  2016-01-12 17:08:19,978 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-04. No reason was returned for this operation failure. See logs for further details.
     6  2016-01-12 17:08:19,978 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-04. No reason was returned for this operation failure. See logs for further details.
     7  2016-01-12 17:08:20,029 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Execution of power management status on Host dell-r210ii-04 using Proxy Host dell-r210ii-13.example.com and Fence Agent ipmilan:10.34.63.243 failed.
     8  2016-01-12 17:08:20,029 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Execution of power management status on Host dell-r210ii-04 using Proxy Host dell-r210ii-13.example.com and Fence Agent ipmilan:10.34.63.243 failed.
     9  2016-01-12 17:08:28,330 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-04. No reason was returned for this operation failure. See logs for further details.
    10  2016-01-12 17:08:28,330 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-04. No reason was returned for this operation failure. See logs for further details.
    11  2016-01-12 17:08:28,389 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Execution of power management status on Host dell-r210ii-04 using Proxy Host dell-r210ii-13.example.com and Fence Agent ipmilan:10.34.62.155 failed.
    12  2016-01-12 17:08:28,389 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Execution of power management status on Host dell-r210ii-04 using Proxy Host dell-r210ii-13.example.com and Fence Agent ipmilan:10.34.62.155 failed.
    13  2016-01-12 17:08:38,669 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-04. No reason was returned for this operation failure. See logs for further details.
    14  2016-01-12 17:08:38,669 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-04. No reason was returned for this operation failure. See logs for further details.
    15  2016-01-12 17:08:38,722 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Execution of power management status on Host dell-r210ii-04 using Proxy Host dell-r210ii-13.example.com and Fence Agent ipmilan:10.34.62.155 failed.
    16  2016-01-12 17:08:38,722 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Execution of power management status on Host dell-r210ii-04 using Proxy Host dell-r210ii-13.example.com and Fence Agent ipmilan:10.34.62.155 failed.
    17  2016-01-12 17:08:38,762 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Health check on Host dell-r210ii-04 indicates that future attempts to Start this host using Power-Management are expected to fail.
    18  2016-01-12 17:08:38,762 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Health check on Host dell-r210ii-04 indicates that future attempts to Start this host using Power-Management are expected to fail.
    19  2016-01-12 17:08:38,814 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Health check on Host dell-r210ii-04 indicates that future attempts to Stop this host using Power-Management are expected to fail.
    20  2016-01-12 17:08:38,814 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Health check on Host dell-r210ii-04 indicates that future attempts to Stop this host using Power-Management are expected to fail.


So originally one could distinguish if it is primary or secondary PM failing inside 'Health check.*' event messages. And... It seems a little bit spammy.

Comment 5 Yaniv Lavi 2016-05-09 11:03:34 UTC
oVirt 4.0 Alpha has been released, moving to oVirt 4.0 Beta target.

Comment 10 Martin Perina 2016-05-26 14:56:13 UTC
Eli, could you please try to investigate? 

1. BZ1325664 has been fixed, so we should see the reason of failure for each agent.

2. In 3.6 we allowed "unlimited number" of ordered levels for fence agents and on each level we define either one fence agent or unlimited number of "concurrent" fence agents, so we are no longer able to distinguish between primary and secondary agents. But we should be able to resolve ability to execute start and/or stop operations and display proper results during PM Health Check.

Comment 11 Eli Mesika 2016-05-29 09:10:10 UTC
(In reply to Martin Perina from comment #10)
> Eli, could you please try to investigate? 
> 
> 1. BZ1325664 has been fixed, so we should see the reason of failure for each
> agent.

This BZ was opened for 3.5 so the VDSM release that works with 3.5 does not have the refactoring in fenceNode that cause the regression reported in BZ1325664

> 
> 2. In 3.6 we allowed "unlimited number" of ordered levels for fence agents
> and on each level we define either one fence agent or unlimited number of
> "concurrent" fence agents, so we are no longer able to distinguish between
> primary and secondary agents. But we should be able to resolve ability to
> execute start and/or stop operations and display proper results during PM
> Health Check.

Looking at the master and 3.5 branch code for PmHealthCheckManager::pmHealthCheck() , I can see that there is a bug in the 3.5 code that caused the BZ reported here. This can be easily fix for 3.5

Code for 3.6 and master is more general and will report all fencing agents 

Do we have to resolve that for 3.5

Comment 12 Eli Mesika 2016-05-29 09:13:30 UTC
Moving this BZ to ON_QA since this is already fixed for 4.0 

approved by Oved

Comment 13 Jiri Belka 2016-06-02 07:00:55 UTC
ok,


 2016-06-02 06:08:03.687+02 | Execution of power management status on Host dell-r210ii-03 using Proxy Host dell-r210ii-04 and Fence Agent ipmilan:10.34.63.243 failed.
 2016-06-02 06:08:13.073+02 | Execution of power management status on Host dell-r210ii-03 using Proxy Host dell-r210ii-04 and Fence Agent ipmilan:10.34.63.242 failed.

Comment 15 errata-xmlrpc 2016-08-23 20:22:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1743.html