Description of problem: Guest vm doesn't failover if host that goes down is an SPM. HA works as expected if the host is NOT SPM. It seems as part of handling host down it attempts to launch vms but all attempts fail because there is no active SPM in the cluster. Version-Release number of selected component (if applicable): rhevm 4.1.5 How reproducible: Always Steps to Reproduce: 1. Create 3 node cluster with 1 storage & 1 iso domain 2. Note down SPM host and have a guest vm running on that host 3. Panic the SPM hosts by running 'echo c > /proc/sysrq-trigger' 4. Wait for HA failover to finish. Actual results: -HA attempts to launch vms but it fails, eventually it exhausts all vm launch attempts and vms remain in down state. -Other host doesn't become SPM Expected results: -Other host should become SPM -vms should failover and start on other host Additional info: Here is the log snippet from engine.log 2017-09-12 15:07:16,650-04 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (DefaultQuartzScheduler4) [1cc4342c] SPM Init: could not find reported vds or not up - pool: 'Default' vds_spm_id: '2' 2017-09-12 15:07:16,667-04 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (DefaultQuartzScheduler4) [1cc4342c] SPM selection - vds seems as spm 'kvm153.int.maxta.com' 2017-09-12 15:07:16,684-04 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (DefaultQuartzScheduler4) [1cc4342c] START, SpmStopVDSCommand(HostName = kvm153.int.maxta.com, SpmStopVDSCommandParameters:{runAsync='true', hostId='5e261fa6-b718-4e03-b675-5291e1b3b67a', storagePoolId='1eab8081-43dc-40ba-8bc4-e4b3ade2ee41'}), log id: 1c9837c7 2017-09-12 15:07:16,684-04 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (DefaultQuartzScheduler4) [1cc4342c] SpmStopVDSCommand:: vds 'kvm153.int.maxta.com' is in 'Reboot' status - not performing spm stop, pool id '1eab8081-43dc-40ba-8bc4-e4b3ade2ee41' 2017-09-12 15:07:16,684-04 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (DefaultQuartzScheduler4) [1cc4342c] FINISH, SpmStopVDSCommand, log id: 1c9837c7 2017-09-12 15:07:16,684-04 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (DefaultQuartzScheduler4) [1cc4342c] spm stop on spm failed, stopping spm selection! 2017-09-12 15:07:22,909-04 WARN [org.ovirt.engine.core.bll.RunVmCommand] (DefaultQuartzScheduler2) [46211442] Validation of action 'RunVm' failed for user SYSTEM. Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,VM_CANNOT_RUN_FROM_CD_WITHOUT_ACTIVE_STORAGE_DOMAIN_ISO 2017-09-12 15:07:22,910-04 INFO [org.ovirt.engine.core.bll.RunVmCommand] (DefaultQuartzScheduler2) [46211442] Lock freed to object 'EngineLock:{exclusiveLocks='[1660f092-4c67-4a00-b1b8-ff3e9cc854f9=VM]', sharedLocks=''}' 2017-09-12 15:07:22,920-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler2) [46211442] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Restart of the Highly Available VM ub153 failed. 2017-09-12 15:07:22,946-04 WARN [org.ovirt.engine.core.bll.RunVmCommand] (DefaultQuartzScheduler2) [2cfbc9b1] Validation of action 'RunVm' failed for user SYSTEM. Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,VM_CANNOT_RUN_FROM_CD_WITHOUT_ACTIVE_STORAGE_DOMAIN_ISO 2017-09-12 15:07:22,947-04 INFO [org.ovirt.engine.core.bll.RunVmCommand] (DefaultQuartzScheduler2) [2cfbc9b1] Lock freed to object 'EngineLock:{exclusiveLocks='[42b272e9-bef1-4032-9649-24eaaa60da7e=VM]', sharedLocks=''}'
Could you please attach full logs? Also error is filed against host-engine HA, so is it normal installation or hosted engine?
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days