Bug 1148397
| Summary: | HA VM is not restarted on another host after host running VM is set to Nonresponsive on ppc64 arch | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Lukas Svaty <lsvaty> | ||||
| Component: | ovirt-engine | Assignee: | Jiri Moskovcak <jmoskovc> | ||||
| Status: | CLOSED NOTABUG | QA Contact: | Lukas Svaty <lsvaty> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 3.4.1-1 | CC: | dfediuck, ecohen, gklein, iheim, lpeer, lsurette, lsvaty, michal.skrivanek, ofrenkel, rbalakri, rgolan, Rhev-m-bugs, sherold, yeylon | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 3.4.3 | ||||||
| Hardware: | ppc64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | sla | ||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2014-10-14 10:46:12 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1122979 | ||||||
| Attachments: |
|
||||||
is PM configured on the host? without PM this is how it suppose to work.. you need to configure PM for automatically recover HA vms, or approve the host has been rebooted manually. Can you please try to make it run longer(In reply to Omer Frenkel from comment #1) > is PM configured on the host? > without PM this is how it suppose to work.. > you need to configure PM for automatically recover HA vms, > or approve the host has been rebooted manually. The question here is what should happen when the fencing fails. And more importantly *when* it should happen. I suspect that in this case Lukas just didn't wait enough. The engine tried ssh fencing which fails, then it sets the VMs to "Uknown" and next 10minutes it's just full of failing GetCapabilitiesVDSCommand calls without any attempt to revive the VMs. (In reply to Jiri Moskovcak from comment #3) > Can you please try to make it run longer(In reply to Omer Frenkel from > comment #1) > > is PM configured on the host? > > without PM this is how it suppose to work.. > > you need to configure PM for automatically recover HA vms, > > or approve the host has been rebooted manually. > > The question here is what should happen when the fencing fails. And more > importantly *when* it should happen. I suspect that in this case Lukas just > didn't wait enough. The engine tried ssh fencing which fails, then it sets > the VMs to "Uknown" and next 10minutes it's just full of failing > GetCapabilitiesVDSCommand calls without any attempt to revive the VMs. if PM is configured and the the real fencing failed then it should be called again the next round. Lukas any update on the PM configuration? note: there is no PPC related issue unless we got here a specific problem with the fencing agent which doesn't seem like the case Retested with this scenario:
Have 2 hosts in engine, non of them have PM configured:
p1 - SPM
p2 - non-SPM, running 1 HA VM
1. on host p2 run `iptables -I INPUT 1 -s $engineIP -j DROP`
2. waited ~18 hours and VM still in status unknown
p2 host status 'Non responsive'
I think the VM should be restarted on other host even that fencing failed.
After setting on host p2 "Confirm 'Host has been rebooted'" VM is successfully restarted on other host.
In case of fail of fencing because of block connection for bigger amount of time (18hours) is this really the desired scenario, let HA VM be in unknown state rather than restarting it on other host?
after conversation with jiri, this is the desired behaviour as we do not want to have the same VM running on two hosts if we restart the VM and the host connectivity goes back on closing as not a bug |
Created attachment 943019 [details] engine, vdsm logs Description of problem: When disconnecting host running HA VM. This VM should be after some time restarted on another working host Version-Release number of selected component (if applicable): av12_ppc How reproducible: 100% Steps to Reproduce: 1. Run HA VM on host 2. On host run `iptables -I INPUT 1 -s $engineIP -j DROP` 3. Wait for host to be Non Responsive 4. Wait for some time for timeout for HA to run out and check if Vm was restarted on other host Actual results: VM stays in unknown state and host in NoN responsive Expected results: Vm should be restarted on another host. Additional info: Attaching logs