Description of problem: VMs gets stuck in "Unknown" when hypervisor is rebooted and fencing is not working even if the hypervisor comes up and the engine detects the the VMs are down. Version-Release number of selected component (if applicable): rhevm-3.3.0-0.46.el6ev.noarch How reproducible: 100% Steps to Reproduce: 1. Create a new DC with just one hyperviosr (local storage) 2. Start a VM on it 3. Reboot it Actual results: VM is set to the unknown state forever. Expected results: VM is mark temporaryly as unknown state and later as down wehe the hypervisors comes up Additional info: It seems that the is caused by defunct fencing. When the fencing does not succed thee times the VMs are marked as in unknown state, but the rerun trheratment already happened as the hypervisors came up already. 2014-02-12 13:42:46,359 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-63) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host dhcp-1-146.brq.redhat.com. 2014-02-12 13:42:46,531 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-63) vm TestVM running in db and not running in vds - add to rerun treatment. vds rhev-h.exmaple.com ... 2014-02-12 13:43:16,669 INFO [org.ovirt.engine.core.bll.FenceExecutor] (pool-4-thread-47) Attempt 3 to find fence proxy host failed... 2014-02-12 13:43:46,670 ERROR [org.ovirt.engine.core.bll.FenceExecutor] (pool-4-thread-47) Failed to run Power Management command on Host rhev-h.example.com, no running proxy Host was found. 2014-02-12 13:43:46,684 INFO [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (pool-4-thread-47) START, SetVmStatusVDSCommand( vmId = 4785f791-c535-4f64-97ef-fbd6a11bf8fd, status = Unknown), log id: 21cd9e52 2014-02-12 13:43:46,687 INFO [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (pool-4-thread-47) FINISH, SetVmStatusVDSCommand, log id: 21cd9e52 2014-02-12 13:43:46,724 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-4-thread-47) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM TestVM was set to the Unknown status.
the fencing operation should be aborted when the host comes up in the meantime. Then the rerun treatment should work properly and not get overwritten by failed fencing afterwards
Roman, you should right click on the Host from the Hosts list in the web admin UI and select "Confirm Host has been rebooted" Please recheck with the above
There's a user experience problem here, the user reboot a host with running VMs thus VMs are going to unknown state. There's not indication in the VMs tab(only hidden in the events section) for the user that he should go back to host level and confirm that the has been rebooted. Adding User Experience keyword.
(In reply to Arthur Berezin from comment #3) > There's a user experience problem here, the user reboot a host with running > VMs thus VMs are going to unknown state. There's not indication in the VMs > tab(only hidden in the events section) for the user that he should go back > to host level and confirm that the has been rebooted. > Adding User Experience keyword. I do not thing that this is the problem here. The problem is that the hypervisor where the VM was running is already up and the VM is still in the unknown state. Why would I mark hypervisor which is up as rebooted?
(In reply to Roman Hodain from comment #4) > (In reply to Arthur Berezin from comment #3) > > There's a user experience problem here, the user reboot a host with running > > VMs thus VMs are going to unknown state. There's not indication in the VMs > > tab(only hidden in the events section) for the user that he should go back > > to host level and confirm that the has been rebooted. > > Adding User Experience keyword. > > I do not thing that this is the problem here. The problem is that the > hypervisor where the VM was running is already up and the VM is still in the > unknown state. Why would I mark hypervisor which is up as rebooted? I am just copy/past from your bug description : Steps to Reproduce: 1. Create a new DC with just one hyperviosr (local storage) 2. Start a VM on it 3. Reboot it So, you had rebooted the Host manually right? If so , please test again while after you reboot the host you also right click on it as "Confirm host has been rebooted" BTW there is no fencing issue here since fencing can not work when there is only one Host in the DC (no proxy host available...)
(In reply to Eli Mesika from comment #5) > (In reply to Roman Hodain from comment #4) > > (In reply to Arthur Berezin from comment #3) > > > There's a user experience problem here, the user reboot a host with running > > > VMs thus VMs are going to unknown state. There's not indication in the VMs > > > tab(only hidden in the events section) for the user that he should go back > > > to host level and confirm that the has been rebooted. > > > Adding User Experience keyword. > > > > I do not thing that this is the problem here. The problem is that the > > hypervisor where the VM was running is already up and the VM is still in the > > unknown state. Why would I mark hypervisor which is up as rebooted? > > I am just copy/past from your bug description : > > Steps to Reproduce: > > 1. Create a new DC with just one hyperviosr (local storage) > 2. Start a VM on it > 3. Reboot it > > So, you had rebooted the Host manually right? If so , please test again > while after you reboot the host you also right click on it as "Confirm host > has been rebooted" > > BTW there is no fencing issue here since fencing can not work when there is > only one Host in the DC (no proxy host available...) Hi, I have tested your suggestion, but thisis not possible. At the time when the VM is in the unknown state the hypervisor is already up: Error while executing action: Cannot confirm 'Host has been rebooted' Host. Valid Host statuses are "Non operational", "Maintenance" or "Connecting". let me repeat what happens: - VM is up - host is up - host is down - fencing is triggered - fencing in progress (not working) - hypervisor is up - Vm is marked as down - Fencing failed - Vm is marked as in Unknown state. - Mark the hypervisor as rbooted. (not possible) I still think that this is an issue of fencing. The fencing is triggered and if it fails it marks VM as in unknow state even if they are already marked as down by the hypervisor which is already up. It i snot related only to local storage, but also to an issues where the fencing not working. Roman
(In reply to Arthur Berezin from comment #3) > There's a user experience problem here, the user reboot a host with running > VMs thus VMs are going to unknown state. There's not indication in the VMs > tab(only hidden in the events section) for the user that he should go back > to host level and confirm that the has been rebooted. > Adding User Experience keyword. is this what this bug is about? I see that this BZ is in POST, so the problem reported here was solved; what you are saying is that we have a user-experience problem that, if I understand correctly, should be tracked separately from this issue. if so - please open a separate RFE for that. For now I removed the UserExperience keyword from this BZ. My hunch is that this should be solved via a notification-center or something similar that we can plan for 4.0, definitely not 3.5 material. thanks.
(In reply to Einav Cohen from comment #9) > (In reply to Arthur Berezin from comment #3) > > There's a user experience problem here, the user reboot a host with running > > VMs thus VMs are going to unknown state. There's not indication in the VMs > > tab(only hidden in the events section) for the user that he should go back > > to host level and confirm that the has been rebooted. > > Adding User Experience keyword. > > is this what this bug is about? I see that this BZ is in POST, so the > problem reported here was solved; what you are saying is that we have a > user-experience problem that, if I understand correctly, should be tracked > separately from this issue. if so - please open a separate RFE for that. For > now I removed the UserExperience keyword from this BZ. > My hunch is that this should be solved via a notification-center or > something similar that we can plan for 4.0, definitely not 3.5 material. > thanks. There are 2 issues here, the first is fixed by Eli's patch - VM are marked as unknown after the host was rebooted and fencing failed. The other is that there's no "Call for Action" in the VMs tab when the user is expected to manually confirm a host was rebooted. I'll open a separate RFE on the second issue.
Verified with ovirt-engine-3.5.0-0.0.master.20140821064931.gitb794d66.el6.noarch. vdsm-4.16.2-1.gite8cba75.el6.x86_64. 1. single host in datacenter is up (host has no power management configured). 2. create vm. 3. vm is up. 4. manually reboot the host. 5. host state connecting. 6. fencing failed for SPM host in DC, setting DC to non-operational 7. host state non-responsive. 8. vm state unknown. 9. host up. 10. vm down. 11. host is contending for SPM. 12. DC up host is SPM.
rhev 3.5.0 was released. closing.