Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1148397

Summary: HA VM is not restarted on another host after host running VM is set to Nonresponsive on ppc64 arch
Product: Red Hat Enterprise Virtualization Manager Reporter: Lukas Svaty <lsvaty>
Component: ovirt-engineAssignee: Jiri Moskovcak <jmoskovc>
Status: CLOSED NOTABUG QA Contact: Lukas Svaty <lsvaty>
Severity: urgent Docs Contact:
Priority: medium    
Version: 3.4.1-1CC: dfediuck, ecohen, gklein, iheim, lpeer, lsurette, lsvaty, michal.skrivanek, ofrenkel, rbalakri, rgolan, Rhev-m-bugs, sherold, yeylon
Target Milestone: ---   
Target Release: 3.4.3   
Hardware: ppc64   
OS: Linux   
Whiteboard: sla
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-10-14 10:46:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1122979    
Attachments:
Description Flags
engine, vdsm logs none

Description Lukas Svaty 2014-10-01 11:44:36 UTC
Created attachment 943019 [details]
engine, vdsm logs

Description of problem:
When disconnecting host running HA VM. This VM should be after some time restarted on another working host

Version-Release number of selected component (if applicable):
av12_ppc

How reproducible:
100%

Steps to Reproduce:
1. Run HA VM on host
2. On host run `iptables -I INPUT 1 -s $engineIP -j DROP`
3. Wait for host to be Non Responsive
4. Wait for some time for timeout for HA to run out and check if Vm was restarted on other host

Actual results:
VM stays in unknown state and host in NoN responsive

Expected results:
Vm should be restarted on another host.

Additional info:
Attaching logs

Comment 1 Omer Frenkel 2014-10-01 13:47:40 UTC
is PM configured on the host? 
without PM this is how it suppose to work..
you need to configure PM for automatically recover HA vms,
or approve the host has been rebooted manually.

Comment 3 Jiri Moskovcak 2014-10-10 11:00:46 UTC
Can you please try to make it run longer(In reply to Omer Frenkel from comment #1)
> is PM configured on the host? 
> without PM this is how it suppose to work..
> you need to configure PM for automatically recover HA vms,
> or approve the host has been rebooted manually.

The question here is what should happen when the fencing fails. And more importantly *when* it should happen. I suspect that in this case Lukas just didn't wait enough. The engine tried ssh fencing which fails, then it sets the VMs to "Uknown" and next 10minutes it's just full of failing GetCapabilitiesVDSCommand calls without any attempt to revive the VMs.

Comment 4 Roy Golan 2014-10-13 12:57:15 UTC
(In reply to Jiri Moskovcak from comment #3)
> Can you please try to make it run longer(In reply to Omer Frenkel from
> comment #1)
> > is PM configured on the host? 
> > without PM this is how it suppose to work..
> > you need to configure PM for automatically recover HA vms,
> > or approve the host has been rebooted manually.
> 
> The question here is what should happen when the fencing fails. And more
> importantly *when* it should happen. I suspect that in this case Lukas just
> didn't wait enough. The engine tried ssh fencing which fails, then it sets
> the VMs to "Uknown" and next 10minutes it's just full of failing
> GetCapabilitiesVDSCommand calls without any attempt to revive the VMs.

if PM is configured and the the real fencing failed then it should be called again the next round. 

Lukas any update on the PM configuration?

Comment 5 Roy Golan 2014-10-13 12:58:47 UTC
note: there is no PPC related issue unless we got here a specific problem with the fencing agent which doesn't seem like the case

Comment 6 Lukas Svaty 2014-10-14 09:49:13 UTC
Retested with this scenario:

Have 2 hosts in engine, non of them have PM configured:
p1 - SPM
p2 - non-SPM, running 1 HA VM

1. on host p2 run `iptables -I INPUT 1 -s $engineIP -j DROP`
2. waited ~18 hours and VM still in status unknown
    p2 host status 'Non responsive' 


I think the VM should be restarted on other host even that fencing failed.

After setting on host p2 "Confirm 'Host has been rebooted'" VM is successfully restarted on other host.

In case of fail of fencing because of block connection for bigger amount of time (18hours) is this really the desired scenario, let HA VM be in unknown state rather than restarting it on other host?

Comment 7 Lukas Svaty 2014-10-14 10:46:12 UTC
after conversation with jiri, this is the desired behaviour as we do not want to have the same VM running on two hosts if we restart the VM and the host connectivity goes back on

closing as not a bug