Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1185320 - [events] Failed PM health check status of secondary agent is not reported if primary fails
[events] Failed PM health check status of secondary agent is not reported if ...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.5.0
Unspecified Unspecified
unspecified Severity low
: ovirt-4.0.0-rc
: ---
Assigned To: Eli Mesika
Jiri Belka
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-01-23 07:55 EST by Jiri Belka
Modified: 2016-08-23 16:22 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-08-23 16:22:44 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
engine.log (1002.86 KB, application/x-gzip)
2015-01-23 07:55 EST, Jiri Belka
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2016:1743 normal SHIPPED_LIVE Red Hat Virtualization Manager 4.0 GA Enhancement (ovirt-engine) 2016-09-02 17:54:01 EDT

  None (edit)
Description Jiri Belka 2015-01-23 07:55:16 EST
Created attachment 983339 [details]
engine.log

Description of problem:
If primary sequential agent PM health check status fails, failure of secondary is not reported at all.

Version-Release number of selected component (if applicable):
rhevm-backend-3.5.0-0.30.el6ev.noarch

How reproducible:
100%

Steps to Reproduce:
1. engine-config -s PMHealthCheckEnabled=true
2. define primary/secondary PM settings for a host, both with invalid password
3. tail -f /var/log/ovirt-engine/engine.log | grep 'Health check failed'

Actual results:
failure is reported only for primary agent

Expected results:
should be reported for both imo

Additional info:
# tail -f /var/log/ovirt-engine/engine.log | grep -i health                                                               
2015-01-23 13:35:02,975 INFO  [org.ovirt.engine.core.bll.pm.PmHealthCheckManager] (DefaultQuartzScheduler_Worker-27) Power Management Health Che
ck started.
2015-01-23 13:35:36,707 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-27) Correlat
ion ID: null, Call Stack: null, Custom Event ID: -1, Message: Health check failed on Host dell-r210ii-03.rhev.lab.eng.brq.redhat.com primary seq
uential agent, future fence operations may fail if secondary agent is not defined properly.
2015-01-23 13:35:36,708 INFO  [org.ovirt.engine.core.bll.pm.PmHealthCheckManager] (DefaultQuartzScheduler_Worker-27) Power Management Health Che
ck completed.
2015-01-23 13:36:36,708 INFO  [org.ovirt.engine.core.bll.pm.PmHealthCheckManager] (DefaultQuartzScheduler_Worker-49) Power Management Health Che
ck started.
2015-01-23 13:37:10,364 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-49) Correlat
ion ID: null, Call Stack: null, Custom Event ID: -1, Message: Health check failed on Host dell-r210ii-03.rhev.lab.eng.brq.redhat.com primary seq
uential agent, future fence operations may fail if secondary agent is not defined properly.
2015-01-23 13:37:10,364 INFO  [org.ovirt.engine.core.bll.pm.PmHealthCheckManager] (DefaultQuartzScheduler_Worker-49) Power Management Health Che
ck completed.
2015-01-23 13:38:10,364 INFO  [org.ovirt.engine.core.bll.pm.PmHealthCheckManager] (DefaultQuartzScheduler_Worker-78) Power Management Health Che
ck started.
2015-01-23 13:38:44,133 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-78) Correlat
ion ID: null, Call Stack: null, Custom Event ID: -1, Message: Health check failed on Host dell-r210ii-03.rhev.lab.eng.brq.redhat.com primary seq
uential agent, future fence operations may fail if secondary agent is not defined properly.
2015-01-23 13:38:44,133 INFO  [org.ovirt.engine.core.bll.pm.PmHealthCheckManager] (DefaultQuartzScheduler_Worker-78) Power Management Health Che
ck completed.
2015-01-23 13:39:44,134 INFO  [org.ovirt.engine.core.bll.pm.PmHealthCheckManager] (DefaultQuartzScheduler_Worker-4) Power Management Health Chec
k started.
Comment 1 Jiri Belka 2015-01-23 07:56:43 EST
engine=# select log_time,message from audit_log where message like 'Health check%';
          log_time          |                                                                                    message                        
                                                            
----------------------------+-------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------
 2015-01-23 13:34:02.963+01 | Health check failed on Host dell-r210ii-03.rhev.lab.eng.brq.redhat.com primary sequential agent, future fence oper
ations may fail if secondary agent is not defined properly.
(1 row)
Comment 2 Jiri Belka 2015-01-23 08:03:10 EST
Failure is not reported even when in concurrent mode.
Comment 3 Ori Liel 2015-06-18 08:40:24 EDT
Power-Management related behavior was refactored probably making this issue obsolete.
Comment 4 Jiri Belka 2016-01-12 11:29:13 EST
Code refactoring caused lost of original distinction of primary and secondary PM failure.

- original:

2015-01-23 13:54:05,242 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-62) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-03.example.com.[Getting status of IPMI:10.34.63.243...Chassis power = Unknown, Failed]
2015-01-23 13:54:22,244 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-62) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-03.example.com.[Getting status of IPMI:10.34.63.243...Chassis power = Unknown, Failed]
2015-01-23 13:54:22,308 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-62) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Health check failed on Host dell-r210ii-03.example.com primary sequential agent, future fence operations may fail if secondary agent is not defined properly.

- new:

     1  2016-01-12 17:08:10,662 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-04. No reason was returned for this operation failure. See logs for further details.
     2  2016-01-12 17:08:10,662 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-04. No reason was returned for this operation failure. See logs for further details.
     3  2016-01-12 17:08:10,699 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Execution of power management status on Host dell-r210ii-04 using Proxy Host dell-r210ii-13.example.com and Fence Agent ipmilan:10.34.63.243 failed.
     4  2016-01-12 17:08:10,699 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Execution of power management status on Host dell-r210ii-04 using Proxy Host dell-r210ii-13.example.com and Fence Agent ipmilan:10.34.63.243 failed.
     5  2016-01-12 17:08:19,978 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-04. No reason was returned for this operation failure. See logs for further details.
     6  2016-01-12 17:08:19,978 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-04. No reason was returned for this operation failure. See logs for further details.
     7  2016-01-12 17:08:20,029 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Execution of power management status on Host dell-r210ii-04 using Proxy Host dell-r210ii-13.example.com and Fence Agent ipmilan:10.34.63.243 failed.
     8  2016-01-12 17:08:20,029 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Execution of power management status on Host dell-r210ii-04 using Proxy Host dell-r210ii-13.example.com and Fence Agent ipmilan:10.34.63.243 failed.
     9  2016-01-12 17:08:28,330 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-04. No reason was returned for this operation failure. See logs for further details.
    10  2016-01-12 17:08:28,330 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-04. No reason was returned for this operation failure. See logs for further details.
    11  2016-01-12 17:08:28,389 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Execution of power management status on Host dell-r210ii-04 using Proxy Host dell-r210ii-13.example.com and Fence Agent ipmilan:10.34.62.155 failed.
    12  2016-01-12 17:08:28,389 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Execution of power management status on Host dell-r210ii-04 using Proxy Host dell-r210ii-13.example.com and Fence Agent ipmilan:10.34.62.155 failed.
    13  2016-01-12 17:08:38,669 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-04. No reason was returned for this operation failure. See logs for further details.
    14  2016-01-12 17:08:38,669 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Power Management test failed for Host dell-r210ii-04. No reason was returned for this operation failure. See logs for further details.
    15  2016-01-12 17:08:38,722 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Execution of power management status on Host dell-r210ii-04 using Proxy Host dell-r210ii-13.example.com and Fence Agent ipmilan:10.34.62.155 failed.
    16  2016-01-12 17:08:38,722 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Execution of power management status on Host dell-r210ii-04 using Proxy Host dell-r210ii-13.example.com and Fence Agent ipmilan:10.34.62.155 failed.
    17  2016-01-12 17:08:38,762 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Health check on Host dell-r210ii-04 indicates that future attempts to Start this host using Power-Management are expected to fail.
    18  2016-01-12 17:08:38,762 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Health check on Host dell-r210ii-04 indicates that future attempts to Start this host using Power-Management are expected to fail.
    19  2016-01-12 17:08:38,814 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Health check on Host dell-r210ii-04 indicates that future attempts to Stop this host using Power-Management are expected to fail.
    20  2016-01-12 17:08:38,814 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Health check on Host dell-r210ii-04 indicates that future attempts to Stop this host using Power-Management are expected to fail.


So originally one could distinguish if it is primary or secondary PM failing inside 'Health check.*' event messages. And... It seems a little bit spammy.
Comment 5 Yaniv Lavi 2016-05-09 07:03:34 EDT
oVirt 4.0 Alpha has been released, moving to oVirt 4.0 Beta target.
Comment 10 Martin Perina 2016-05-26 10:56:13 EDT
Eli, could you please try to investigate? 

1. BZ1325664 has been fixed, so we should see the reason of failure for each agent.

2. In 3.6 we allowed "unlimited number" of ordered levels for fence agents and on each level we define either one fence agent or unlimited number of "concurrent" fence agents, so we are no longer able to distinguish between primary and secondary agents. But we should be able to resolve ability to execute start and/or stop operations and display proper results during PM Health Check.
Comment 11 Eli Mesika 2016-05-29 05:10:10 EDT
(In reply to Martin Perina from comment #10)
> Eli, could you please try to investigate? 
> 
> 1. BZ1325664 has been fixed, so we should see the reason of failure for each
> agent.

This BZ was opened for 3.5 so the VDSM release that works with 3.5 does not have the refactoring in fenceNode that cause the regression reported in BZ1325664

> 
> 2. In 3.6 we allowed "unlimited number" of ordered levels for fence agents
> and on each level we define either one fence agent or unlimited number of
> "concurrent" fence agents, so we are no longer able to distinguish between
> primary and secondary agents. But we should be able to resolve ability to
> execute start and/or stop operations and display proper results during PM
> Health Check.

Looking at the master and 3.5 branch code for PmHealthCheckManager::pmHealthCheck() , I can see that there is a bug in the 3.5 code that caused the BZ reported here. This can be easily fix for 3.5

Code for 3.6 and master is more general and will report all fencing agents 

Do we have to resolve that for 3.5
Comment 12 Eli Mesika 2016-05-29 05:13:30 EDT
Moving this BZ to ON_QA since this is already fixed for 4.0 

approved by Oved
Comment 13 Jiri Belka 2016-06-02 03:00:55 EDT
ok,


 2016-06-02 06:08:03.687+02 | Execution of power management status on Host dell-r210ii-03 using Proxy Host dell-r210ii-04 and Fence Agent ipmilan:10.34.63.243 failed.
 2016-06-02 06:08:13.073+02 | Execution of power management status on Host dell-r210ii-03 using Proxy Host dell-r210ii-04 and Fence Agent ipmilan:10.34.63.242 failed.
Comment 15 errata-xmlrpc 2016-08-23 16:22:44 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1743.html

Note You need to log in before you can comment on or make changes to this bug.