Bug 1807937 - [4.4] run-once with wrong configuration failed to start VM but then starts the VM with its persistent configuration
Summary: [4.4] run-once with wrong configuration failed to start VM but then starts th...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Backend.Core
Version: 4.4.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.4.1
: 4.4.1.5
Assignee: Liran Rotenberg
QA Contact: Beni Pelled
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-27 14:17 UTC by Beni Pelled
Modified: 2020-08-17 06:26 UTC (History)
8 users (show)

Fixed In Version: ovirt-engine-4.4.1.5
Doc Type: Bug Fix
Doc Text:
Previously, if running a virtual machine with its Run Once configuration failed, the RHV Manager would try to run the virtual machine with its standard configuration on a different host. The current release fixes this issue. Now, if Run Once fails, the RHV Manager tries to run the virtual machine with its Run Once configuration on a different host.
Clone Of:
Environment:
Last Closed: 2020-07-08 08:26:32 UTC
oVirt Team: Virt
Embargoed:
pm-rhel: ovirt-4.4+
ahadas: devel_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 109948 0 master MERGED core: persist VM configuration on rerun 2020-10-29 11:33:43 UTC

Description Beni Pelled 2020-02-27 14:17:47 UTC
Description of problem:
Run-once fails to start a VM with a wrong configuration (as expected) but then starts the VM with its persistent configuration instead of leaving the VM off.

Version-Release number of selected component (if applicable):
RHV 4.4.0-0.20.master.el7

How reproducible:
100%

Steps to Reproduce:
1. Create a VM.
2. Start the VM by run-once with 'wrong_cpu_type' as CPU type 'system > Custom CPU' OR wrong 'Emulated Machine'.

Actual results:
The VM is up and running without the custom configurations.

Expected results:
The VM should stay down and not be able to start.

Additional info:
engine log:

2020-02-27 15:14:07,139+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [7273c1c0] VM '9a6e6501-7ac7-4362-a804-cfd0187e0c56' was reported as Down on VDS 'ddc917e6-2c1b-46d8-8535-c6941aae7c5c'(h
ost_mixed_1)
2020-02-27 15:14:07,141+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-13) [7273c1c0] START, DestroyVDSCommand(HostName = host_mixed_1, DestroyVmVDSCommandParameters:{hostId='ddc917e6-2c1b-46d8-8535-c6941aae7c5c', vmId='9a6e6501-7ac7-4362-a804-cfd0187e0c56', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='true'}), log id: 397e3f
2020-02-27 15:14:07,457+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-13) [7273c1c0] FINISH, DestroyVDSCommand, return: , log id: 397e3f
2020-02-27 15:14:07,457+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [7273c1c0] VM '9a6e6501-7ac7-4362-a804-cfd0187e0c56'(test_vm_cpu_type) moved from 'WaitForLaunch' --> 'Down'
2020-02-27 15:14:07,487+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [7273c1c0] EVENT_ID: VM_DOWN_ERROR(119), VM test_vm_cpu_type is down with error. Exit message: internal error: Unknown CPU model wrong_cpu_type.
2020-02-27 15:14:07,489+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [7273c1c0] add VM '9a6e6501-7ac7-4362-a804-cfd0187e0c56'(test_vm_cpu_type) to rerun treatment
2020-02-27 15:14:07,505+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (ForkJoinPool-1-worker-13) [7273c1c0] Rerun VM '9a6e6501-7ac7-4362-a804-cfd0187e0c56'. Called from VDS 'host_mixed_1'
2020-02-27 15:14:07,539+02 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-32718) [7273c1c0] EVENT_ID: USER_INITIATED_RUN_VM_FAILED(151), Failed to run VM test_vm_cpu_type on Host host_mixed_1.
2020-02-27 15:14:07,556+02 INFO  [org.ovirt.engine.core.bll.RunVmOnceCommand] (EE-ManagedThreadFactory-engine-Thread-32718) [7273c1c0] Lock Acquired to object 'EngineLock:{exclusiveLocks='[9a6e6501-7ac7-4362-a804-cfd0187e0c56=VM]', sharedLocks=''}'
2020-02-27 15:14:07,598+02 INFO  [org.ovirt.engine.core.vdsbroker.IsVmDuringInitiatingVDSCommand] (EE-ManagedThreadFactory-engine-Thread-32718) [7273c1c0] START, IsVmDuringInitiatingVDSCommand( IsVmDuringInitiatingVDSCommandParameters:{vmId='9a6e6501-7ac7-4362-a804-cfd0187e0c56'}), log id: 95fd84b
2020-02-27 15:14:07,600+02 INFO  [org.ovirt.engine.core.vdsbroker.IsVmDuringInitiatingVDSCommand] (EE-ManagedThreadFactory-engine-Thread-32718) [7273c1c0] FINISH, IsVmDuringInitiatingVDSCommand, return: false, log id: 95fd84b
2020-02-27 15:14:07,653+02 INFO  [org.ovirt.engine.core.bll.RunVmOnceCommand] (EE-ManagedThreadFactory-engine-Thread-32718) [7273c1c0] Running command: RunVmOnceCommand internal: false. Entities affected :  ID: 9a6e6501-7ac7-4362-a804-cfd0187e0c56 Type: VMAction group RUN_VM with role type USER
2020-02-27 15:14:07,730+02 INFO  [org.ovirt.engine.core.vdsbroker.UpdateVmDynamicDataVDSCommand] (EE-ManagedThreadFactory-engine-Thread-32718) [7273c1c0] START, UpdateVmDynamicDataVDSCommand( UpdateVmDynamicDataVDSCommandParameters:{hostId='null', vmId='9a6e6501-7ac7-4362-a804-cfd0187e0c56', vmDynamic='org.ovirt.engine.core.common.businessentities.VmDynamic@220031bf'}), log id: 4d65a77d
2020-02-27 15:14:07,737+02 INFO  [org.ovirt.engine.core.vdsbroker.UpdateVmDynamicDataVDSCommand] (EE-ManagedThreadFactory-engine-Thread-32718) [7273c1c0] FINISH, UpdateVmDynamicDataVDSCommand, return: , log id: 4d65a77d
2020-02-27 15:14:07,740+02 INFO  [org.ovirt.engine.core.vdsbroker.CreateVDSCommand] (EE-ManagedThreadFactory-engine-Thread-32718) [7273c1c0] START, CreateVDSCommand( CreateVDSCommandParameters:{hostId='adf876e1-8f92-46c9-9191-ad49c6034f59', vmId='9a6e6501-7ac7-4362-a804-cfd0187e0c56', vm='VM [test_vm_cpu_type]'}), log id: 2ed250bf
2020-02-27 15:14:07,742+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-32718) [7273c1c0] START, CreateBrokerVDSCommand(HostName = host_mixed_2, CreateVDSCommandParameters:{hostId='adf876e1-8f92-46c9-9191-ad49c6034f59', vmId='9a6e6501-7ac7-4362-a804-cfd0187e0c56', vm='VM [test_vm_cpu_type]'}), log id: 729392b0

Comment 1 Ryan Barry 2020-02-28 00:22:28 UTC
I suppose we shouldn't revert invalid configurations and should instead send an audit message

Comment 2 Steven Rosenberg 2020-03-04 16:53:48 UTC
The restart of the VM on run once is due to the fact that the scheduler attempts to relaunch the VM on a different Host. 

One can see this because the Host ids are different in the logs pasted above between the failed Host and the second launch of the VM.
 
It also uses the default configuration it seems on the assumption that the failure on the first Host is due to the configuration. 

On the assumption that the user does want to launch the VM somewhere and that its failure on one Host means it should restart it on another
Host may make sense. So this is more of a design issue.

The more elegant solution would be to categorize the types of failures and if it is a configuration failure not to attempt to restart the VM at all,
whereas if it is a problem with say lack of resources or a faulty Host, then it should attempt to run the VM on another Host, 
but with the run once configuration.

This of cause would be new development work for categorizing errors.

The current implementation that this issue complains about seems to handle simple assumptions, and it is not clear if we want to prevent attempting to relaunch a VM that failed on another Host just for run once or have the VM fail on all the Hosts in the Cluster due to a faulty run once configuration. 

The question is then, if as per this request we should change the functionality for run once and just leave the VM off.

The real issue seems to be that the cpu and emulation machine type controls are combo boxes in the run once view. This allowed the user to be able to enter invalid text into the text part of the drop down combo box. Not only is this inconsistent with the cpu and emulation machine type controls in other views which are just drop down controls, but the user should not be able to enter in any data other than what appears in the drop down box. 

Please advise on both the issue of run once reloading the VM and the control issues, though I am sure the later at least should be changed.

Comment 3 Michal Skrivanek 2020-03-05 07:36:39 UTC
normally this is caught by scheduler and not a problem

this is a special case of user entering invalid configuration (which we allow, that's intentional) and then it follows the "normal" RunOnce flow, after a shutdown or crash the configuration reverts. 

I guess it could be fixed to better handle initial rerun attepmts, but overall it's not very important/interesting


Also, don't file this as RHV bug, there's nothing RHV-specific here...

Comment 4 Arik 2020-06-05 10:11:27 UTC
So the problem here is a bit different - it's not that the run-once configuration is being reverted.

Assuming a VM is configured with console=X and emulated-machine=Y but initiated with console=X' and emulate-machine=Y' via run-once, the first execution of the command would try to run the VM with X' and Y' but rerun attempts would try to run it with console=*X'* and emilated-machine=*Y*. 

This is caused by a conceptual flaw within the engine:
generally the command's 'init' method is called, then the 'validate' and then 'execute'.
but on rerun attempts the 'init' method is not called (the command is already instantiated). Instead, we clear the cached VM (to trigger querying it again from the database) and then proceed to the 'validate' and 'execute' phases. So things that are set on the 'execute'/'validate' phase of run-vm like the graphics will be re-initialized on reruns but things that are initialized on the 'init' method won't be re-initialized (and would default to their settings within the 'persistent' VM configuration). custom emulated machine is initialized within the 'init' method [1].

This may also affect regular run-vm (not run-once) flows, e.g., payload & vm-init & boot sequence are initialized in the 'init' method of RunVm.

We can't move all the initialization to the 'execute' phase because we want those settings to be validated.
So I think the right way to fix it would be by calling the 'init' method explicitly on rerun flows[2].

[1] https://github.com/oVirt/ovirt-engine/blob/ovirt-engine-4.4.0/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/RunVmOnceCommand.java#L60-L79
[2] It wouldn't make any different for MigrateVm that doesn't do anything in its 'init' method and rerun is disabled for MigrateVmToServer

Comment 5 Beni Pelled 2020-07-05 11:18:27 UTC
Verified with:
- ovirt-engine-4.4.1.7-0.3.el8ev.noarch

Verification steps:
1. Create a VM.
2. Start the VM by run-once with 'wrong_cpu_type' as CPU type 'system > Custom CPU'.

Result:
- The engine tried to start the VM on each host and failed with "Exit message: internal error: Unknown CPU model wrong_cpu_type."

Comment 6 Sandro Bonazzola 2020-07-08 08:26:32 UTC
This bugzilla is included in oVirt 4.4.1 release, published on July 8th 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.