Bug 1303897 - If Host running the HE crashes it does not get fenced correctly
Summary: If Host running the HE crashes it does not get fenced correctly
Keywords:
Status: CLOSED DUPLICATE of bug 1266099
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.6.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Nobody
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: RHEV_36_HTB
TreeView+ depends on / blocked
 
Reported: 2016-02-02 11:11 UTC by Martin Tessun
Modified: 2020-08-13 08:25 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-02-03 09:36:58 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)

Description Martin Tessun 2016-02-02 11:11:44 UTC
Description of problem:
SPM host does not get fenced in case it failed with Hosted Engine running on it, as HE startup does avoid the fencing.


Scenario:

* Host 1: Hosted Engine, test VM and SPM
* Host 2: Empty

Action:
* Power off Host 1

Result:
* HE is started on Host 2
* Host 1 does not get fenced
* SPM stays on (powered off) Host 1

Event log shows the following:
	
Jan 30, 2016 8:42:14 PM Fencing failed on Storage Pool Manager ovirt1 for Data Center Default. Setting status to Non-Operational.
Jan 30, 2016 8:42:13 PM Host ovirt1 became non responsive. It has no power management configured. Please check the host status, manually reboot it, and click "Confirm Host Has Been Rebooted"

So at least it shows that it couldn't fence the host. But: The host has fencing configured and it works. If trying to power on the host via PowerManagement in Hosts Tab at the same time, a Popup Box is displayed that the action cannot be taken due to the following reasons:
* Fence is disabled due to the Engine Service start up sequence.
* Cannot start Host. Fence operation failed.

After the line shown below is logged in the event log, the host can be powered on by using the PowerManagemant in the Hosts Tab:
Jan 30, 2016 8:45:14 PM Try to recover Data Center Default. Setting status to Non Responsive.

After doing that PowerOn Action, the following is logged in the Event log (read from bottom to top for the timeline):
	
Jan 30, 2016 8:55:22 PM Storage Pool Manager runs on Host ovirt2 (Address: ovirt2.satellite.local).
Jan 30, 2016 8:55:05 PM VDSM ovirt1 command failed: Not SPM
Jan 30, 2016 8:55:04 PM VM test was restarted on Host ovirt2
Jan 30, 2016 8:54:54 PM Host ovirt1 power management was verified successfully.
Jan 30, 2016 8:54:54 PM Status of host ovirt1 was set to Up.
Jan 30, 2016 8:54:50 PM Executing power management status on Host ovirt1 using Proxy Host ovirt2 and Fence Agent xvm:225.0.0.12.
Jan 30, 2016 8:54:21 PM VM HostedEngine configuration was updated by system.
Jan 30, 2016 8:54:19 PM Kdump integration is enabled for host ovirt1, but kdump is not configured properly on host.
Jan 30, 2016 8:53:53 PM VM test was restarted on Host ovirt2
Jan 30, 2016 8:53:48 PM Host ovirt1 was started by admin@internal.
Jan 30, 2016 8:53:48 PM Power management start of Host ovirt1 succeeded.
Jan 30, 2016 8:53:47 PM Vm test was shut down due to ovirt1 host reboot or manual fence
Jan 30, 2016 8:53:45 PM Executing power management status on Host ovirt1 using Proxy Host ovirt2 and Fence Agent xvm:225.0.0.12.
Jan 30, 2016 8:53:38 PM Executing power management start on Host ovirt1 using Proxy Host ovirt2 and Fence Agent xvm:225.0.0.12.
Jan 30, 2016 8:53:37 PM Power management start of Host ovirt1 initiated.

As said this was my manual PowerOn after an additional wait time of 7 minutes.

One additional note: Doing the same test with SPM running on on another node than the powered off one, there is a similiar result:

Scenario for better understanding:

* Host 1: Hosted Engine and test VM
* Host 2: SPM

Action:
* Power off Host 1

Result:
* HE is started on Host 2
* Host 1 does not get fenced
* SPM stays (as expected) on Host 2

Event log shows the following (read from bottom to top for the timeline):

	
Jan 30, 2016 9:28:58 PM Power management start of Host ovirt1 initiated. ### That was me manually
Jan 30, 2016 9:22:46 PM User admin@internal logged in.
Jan 30, 2016 9:22:31 PM	Storage Pool Manager runs on Host ovirt2 (Address: ovirt2.satellite.local).
Jan 30, 2016 9:22:31 PM Host ovirt1 failed to recover.
Jan 30, 2016 9:22:28 PM Host ovirt1 is non responsive.
Jan 30, 2016 9:22:28 PM VM test was set to the Unknown status.
Jan 30, 2016 9:22:28 PM VM HostedEngine was set to the Unknown status.
Jan 30, 2016 9:22:26 PM Invalid status on Data Center Default. Setting status to Non Responsive.
Jan 30, 2016 9:22:22 PM Host ovirt1 is not responding. It will stay in Connecting state for a grace period of 61 seconds and after that an attempt to fence the host will be issued.

What is interessting here that no fencing attempt is logged at all.

Version-Release number of selected component (if applicable):
RHEV 3.6 beta3

How reproducible:
always

Steps to Reproduce:
See description above

Actual results:
Failed host does not get fenced in engine, but gets to "Down state" instead.

Expected results:
Hosts should get fenced as soon as the startup phase has finished and the host is still down.

Additional info:
See also BZ #1303064

Comment 1 Yaniv Lavi 2016-02-03 08:35:51 UTC
Can you please review this?

Comment 2 Oved Ourfali 2016-02-03 09:21:18 UTC
Please attach complete logs.


Note You need to log in before you can comment on or make changes to this bug.