Bug 1312039

Summary: Fencing via power management failed on hosted-engine host with HE vm
Product: [oVirt] ovirt-engine Reporter: Artyom <alukiano>
Component: Backend.CoreAssignee: Martin Perina <mperina>
Status: CLOSED WORKSFORME QA Contact: Artyom <alukiano>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.6.3.2CC: bugs, dmoessne, mavital, mperina
Target Milestone: ---Flags: rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-03-17 11:36:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine and vdsm of hosted_engine_1 logs none

Description Artyom 2016-02-25 15:27:00 UTC
Created attachment 1130585 [details]
engine and vdsm of hosted_engine_1 logs

Description of problem:
Fencing via power management failed on hosted-engine host with HE vm

Version-Release number of selected component (if applicable):
rhevm-backend-3.6.3.2-0.1.el6.noarch

How reproducible:
Always

Steps to Reproduce:
1. Deploy HE on two hosts
2. Configure PM on host with HE vm
3. Stop network on host with HE vm
4. Wait until HE vm start on second host
5. Wait for first host to be up

Actual results:
First host will stay in not-responding state forever

Expected results:
First host must be fenced via PM

Additional info:
From engine log I can see:
2016-02-25 15:33:34,353 ERROR [org.ovirt.engine.core.bll.pm.VdsNotRespondingTreatmentCommand] (org.ovirt.thread.pool-6-thread-34) [] Failed to run Fence script on vds 'hosted_engine_2'.
2016-02-25 15:33:34,398 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host hosted_engine_2 became non responsive. It has no power management configured. Please check the host status, manually reboot it, and click "Confirm Host Has Been Rebooted"

but host has PM:
<power_management type="ipmilan">
<enabled>true</enabled>
<address>rose05-mgmt.qa.lab.tlv.redhat.com</address>
<username>root</username>
<options />
 <pm_proxies>
<pm_proxy>
<type>cluster</type>
</pm_proxy>
<pm_proxy>
<type>dc</type>
</pm_proxy>
 </pm_proxies>
<agents>
 <agent type="ipmilan" id="afef3a9e-1f7a-405a-8e00-fd3f2bdb26d3">
<address>rose05-mgmt.qa.lab.tlv.redhat.com</address>
<username>root</username>
<options />
<order>1</order>
 </agent>
</agents>
<automatic_pm_enabled>true</automatic_pm_enabled>
<kdump_detection>true</kdump_detection>
 </power_management>

Comment 1 Roy Golan 2016-03-02 13:24:52 UTC
Why is that hosted engine exclusively?

Comment 2 Martin Perina 2016-03-17 11:36:36 UTC
I haven't been able to reproduce it while testing following fencing flows in 2 nodes hosted engine cluster (on each flow HE VM is running on host1):

  1. Stop networking on host1
  2. Block connection from engine to host1 using iptables
  3. Execute kdump on host1

In all above cases host1 was properly fenced and became up afterwards.

I tested on latest stable oVirt 3.6:

ovirt-hosted-engine-ha-1.3.4.3-1
ovirt-engine-3.6.3.4-1

I still don't get how this bug could be opened when exactly same case was tested and verified in BZ1266099.

Closing as WORKSFORME, feel free to reopen if you are able to reproduce it again.