1303064 – [RHEV 36beta] Hosted Engine failover does not work in some cases

Bug 1303064 - [RHEV 36beta] Hosted Engine failover does not work in some cases

Summary: [RHEV 36beta] Hosted Engine failover does not work in some cases

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-hosted-engine-ha
Sub Component:
Version:	3.6.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-3.6.3
Target Release:	---
Assignee:	Simone Tiraboschi
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	RHEV_36_HTB
TreeView+	depends on / blocked

Reported:	2016-01-29 12:54 UTC by Martin Tessun
Modified:	2020-08-13 08:25 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-02-08 08:04:35 UTC
oVirt Team:	Integration
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Martin Tessun 2016-01-29 12:54:01 UTC

Description of problem:
In case a Hzpervisor crashed that has SPM role as well as the Hosted Engine running on it, the HE does not get restarted on another node.
hosted-engine --vm-status shows both nodes with stale data.
vdsm throws errors in that case as well.

Version-Release number of selected component (if applicable):
3.6beta3

How reproducible:
Always at customer side

Steps to Reproduce:
1. Install hosted engine on two hosts
2. assign SPM role to the system running HE
3. crash the Hypervisor running HE

Actual results:
HE does not et restarted
hosted-engine --vm-status shows stale data for both systems 

Expected results:
After recognizing the failure, Hosted Engine should be restarted on the remaining node.


Additional info:
Lots of vdsm errors were also discovered in this case.

Comment 1 Martin Tessun 2016-01-30 19:49:26 UTC

Just an additional observation:

In case the HE and spm are running on the same HV, and that HV is powered off, the VMs running on that host do not recover, even if HE is automatically started on another HV.

What happens is that the fencing action fails, as HE is still in startup phase.
Later on it seems no additional fencing event is tried.

So you need to fence the host manually by "Confirming the host has rebooted".
This is reproducible even if the HE takeover does work.

Cheers,
Martin

Comment 2 Martin Tessun 2016-01-30 20:31:37 UTC

For the in C#1 described behaviour the following can be observed:

Scenario for better understanding:

* Host 1: Hosted Engine, test VM and SPM
* Host 2: Empty

Action:
* Power off Host 1

Result:
* HE is started on Host 2
* Host 1 does not get fenced
* SPM stays on (powered off) Host 1

Event log shows the following:

Jan 30, 2016 8:42:14 PM Fencing failed on Storage Pool Manager ovirt1 for Data Center Default. Setting status to Non-Operational.
Jan 30, 2016 8:42:13 PM Host ovirt1 became non responsive. It has no power management configured. Please check the host status, manually reboot it, and click "Confirm Host Has Been Rebooted"

So at least it shows that it couldn't fence the host. But: The host has fencing configured and it works. If trying to power on the host via PowerManagement in Hosts Tab at the same time, a Popup Box is displayed that the action cannot be taken due to the following reasons:
* Fence is disabled due to the Engine Service start up sequence.
* Cannot start Host. Fence operation failed.

After the line shown below is logged in the event log, the host can be powered on by using the PowerManagemant in the Hosts Tab:
Jan 30, 2016 8:45:14 PM Try to recover Data Center Default. Setting status to Non Responsive.

After doing that PowerOn Action, the following is logged in the Event log (read from bottom to top for the timeline):

Jan 30, 2016 8:55:22 PM Storage Pool Manager runs on Host ovirt2 (Address: ovirt2.satellite.local).
Jan 30, 2016 8:55:05 PM VDSM ovirt1 command failed: Not SPM
Jan 30, 2016 8:55:04 PM VM test was restarted on Host ovirt2
Jan 30, 2016 8:54:54 PM Host ovirt1 power management was verified successfully.
Jan 30, 2016 8:54:54 PM Status of host ovirt1 was set to Up.
Jan 30, 2016 8:54:50 PM Executing power management status on Host ovirt1 using Proxy Host ovirt2 and Fence Agent xvm:225.0.0.12.
Jan 30, 2016 8:54:21 PM VM HostedEngine configuration was updated by system.
Jan 30, 2016 8:54:19 PM Kdump integration is enabled for host ovirt1, but kdump is not configured properly on host.
Jan 30, 2016 8:53:53 PM VM test was restarted on Host ovirt2
Jan 30, 2016 8:53:48 PM Host ovirt1 was started by admin@internal.
Jan 30, 2016 8:53:48 PM Power management start of Host ovirt1 succeeded.
Jan 30, 2016 8:53:47 PM Vm test was shut down due to ovirt1 host reboot or manual fence
Jan 30, 2016 8:53:45 PM Executing power management status on Host ovirt1 using Proxy Host ovirt2 and Fence Agent xvm:225.0.0.12.
Jan 30, 2016 8:53:38 PM Executing power management start on Host ovirt1 using Proxy Host ovirt2 and Fence Agent xvm:225.0.0.12.
Jan 30, 2016 8:53:37 PM Power management start of Host ovirt1 initiated.

As said this was my manual PowerOn after an additional wait time of 7 minutes.

One additional note: Doing the same test with SPM running on onother node than the powered off one, there is a similiar result:

Scenario for better understanding:

* Host 1: Hosted Engine and test VM
* Host 2: SPM

Action:
* Power off Host 1

Result:
* HE is started on Host 2
* Host 1 does not get fenced
* SPM stays (as expected) on Host 2

Event log shows the following (read from bottom to top for the timeline):

Jan 30, 2016 9:28:58 PM Power management start of Host ovirt1 initiated. ### That was me manually
Jan 30, 2016 9:22:46 PM User admin@internal logged in.
Jan 30, 2016 9:22:31 PM Storage Pool Manager runs on Host ovirt2 (Address: ovirt2.satellite.local).
Jan 30, 2016 9:22:31 PM Host ovirt1 failed to recover.
Jan 30, 2016 9:22:28 PM Host ovirt1 is non responsive.
Jan 30, 2016 9:22:28 PM VM test was set to the Unknown status.
Jan 30, 2016 9:22:28 PM VM HostedEngine was set to the Unknown status.
Jan 30, 2016 9:22:26 PM Invalid status on Data Center Default. Setting status to Non Responsive.
Jan 30, 2016 9:22:22 PM Host ovirt1 is not responding. It will stay in Connecting state for a grace period of 61 seconds and after that an attempt to fence the host will be issued.

What is interessting here that no fencing attempt is logged at all.

Cheers,
Martin

Comment 3 Simone Tiraboschi 2016-01-31 16:38:39 UTC

(In reply to Martin Tessun from comment #2) 

Here we are probably overlapping two distinct issues.

> Result:
> * HE is started on Host 2

Is HE always automatically restarted on the other host after a few minutes or did you faced issues on that as in the first comment?

> * Host 1 does not get fenced

Fencing is a different matter, ha-agent should be not affected from host fencing.

Comment 5 Martin Tessun 2016-02-01 15:14:39 UTC

(In reply to Simone Tiraboschi from comment #3)
> (In reply to Martin Tessun from comment #2) 
> 
> Here we are probably overlapping two distinct issues.
> 
> > Result:
> > * HE is started on Host 2
> 
> Is HE always automatically restarted on the other host after a few minutes
> or did you faced issues on that as in the first comment?

I did not face that behaviour in my reproducer. The difference is that my reproducer runs on iSCSI and not FC.
On the systems I expereinced the issue first, both HV went to stale data.
I think, I could ask the customer for a sosreport from his system to check the logs further.

> 
> > * Host 1 does not get fenced
> 
> Fencing is a different matter, ha-agent should be not affected from host
> fencing.

Indeed. This was more that SPM takeover and fencing did not take place after the hosted_engine recovered. I will open a seperate bug for this.

Comment 6 Simone Tiraboschi 2016-02-01 15:38:50 UTC

(In reply to Martin Tessun from comment #5)
> > Is HE always automatically restarted on the other host after a few minutes
> > or did you faced issues on that as in the first comment?
> 
> I did not face that behaviour in my reproducer. The difference is that my
> reproducer runs on iSCSI and not FC.
> On the systems I expereinced the issue first, both HV went to stale data.
> I think, I could ask the customer for a sosreport from his system to check
> the logs further.

Thanks, really appreciated.
I tried with NFS but I wasn't able to reproduce, maybe it's something FC specific.

> > Fencing is a different matter, ha-agent should be not affected from host
> > fencing.
> 
> Indeed. This was more that SPM takeover and fencing did not take place after
> the hosted_engine recovered. I will open a seperate bug for this.

+1

Comment 7 Martin Tessun 2016-02-02 11:15:16 UTC

So I created BZ #1303897 for the 2nd issue.

Besides this I am waiting for the data from the customer. I will update the BZ once the data is available to me.

Setting needinfo on me.

Comment 8 Yaniv Lavi 2016-02-03 08:34:15 UTC

Can you try to reproduce this on FC with the details in comment #2?

Comment 9 Aharon Canan 2016-02-03 08:37:15 UTC

Elad - 
Yours...

Comment 10 Elad 2016-02-03 12:07:56 UTC

Let me see if I got the scenario right:

- 2 hosts in hosted-engine setup, both HE hosts.
- Both hosts have power management configured.
- HE VM running on SPM with one more test VM.
- Power off SPM host

Martin, please confirm, thanks.

Comment 11 Martin Tessun 2016-02-04 09:26:55 UTC

Hi Elad,

confirmed. That was exactly the scenario at customer side.
The result was that hosted-engine --vm-status showed both HE hosts as "stale data".

Cheers,
Martin

Comment 12 Elad 2016-02-04 11:47:43 UTC

Currently we don't have 2 hosts with FC and power management connected. I opened a request for it.

Comment 13 Elad 2016-02-07 12:48:05 UTC

Tested the scenario described in comment #10 over FC. 

Few minutes after the SPM host with the hosted-engine and an additional guest running on it crashes, the other host takes SPM and the HE VM starts on the second host. The setup returns to function as expected.

Therefore, using the following, I couldn't reproduce the issue over FC.

ovirt-hosted-engine-ha-1.3.3.7-1.el7ev.noarch
ovirt-vmconsole-host-1.0.0-1.el7ev.noarch
ovirt-setup-lib-1.0.1-1.el7ev.noarch
ovirt-hosted-engine-setup-1.3.2.3-1.el7ev.noarch
ovirt-vmconsole-1.0.0-1.el7ev.noarch
ovirt-host-deploy-1.4.1-1.el7ev.noarch
libgovirt-0.3.3-1.el7_2.1.x86_64
vdsm-jsonrpc-4.17.19-0.el7ev.noarch
vdsm-hook-vmfex-dev-4.17.19-0.el7ev.noarch
vdsm-python-4.17.19-0.el7ev.noarch
vdsm-4.17.19-0.el7ev.noarch
vdsm-infra-4.17.19-0.el7ev.noarch
vdsm-cli-4.17.19-0.el7ev.noarch
vdsm-yajsonrpc-4.17.19-0.el7ev.noarch
vdsm-xmlrpc-4.17.19-0.el7ev.noarch
sanlock-3.2.4-2.el7_2.x86_64
libvirt-daemon-1.2.17-13.el7_2.3.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.7.x86_64

Comment 14 Sandro Bonazzola 2016-02-08 08:04:35 UTC

According to comment #13, comment #6 and comment #5 we can't reproduce without an exact procedure for reproducing it so closing with insufficient data for now.
Please reopen if you can provide steps to reproduce.

Note You need to log in before you can comment on or make changes to this bug.