Description of problem: After upgrade from 4.1 to 4.2.3 (on rhevm 3) VM were with unplugged Nic and disk were un active also. Looks like the problem that were solve in BZ: 1542117 Version-Release number of selected component (if applicable): Engine: 4.2.3.2-0.1.el7 Host: OS Version:RHEL - 7.5 - 8.el7 Kernel Version:3.10.0 - 862.el7.x86_64 KVM Version:2.10.0 - 21.el7_5.2 LIBVIRT Version:libvirt-3.9.0-14.el7_5.3 VDSM Version:vdsm-4.20.26-1.el7ev How reproducible: VMs in QA production cluster - 100% Steps to Reproduce: Upgrade from 4.1 to 4.2 Will attach logs later (don't have access to resources)
logs from engine and hosts: https://drive.google.com/open?id=1tjRwiS5bpcV7FzzZFEnSIykFKtR_DOya
We see two errors on different vms: Vdsm log: 2018-04-22 11:21:24,465+0300 ERROR (periodic/5) [virt.periodic.Operation] <vdsm.virt.sampling.VMBulkstatsMonitor object at 0x7f646808a650> operation failed (periodic:222) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 220, in __call__ self._func() File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 522, in __call__ self._send_metrics() File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 538, in _send_metrics vm_sample.interval) File "/usr/lib/python2.7/site-packages/vdsm/virt/vmstats.py", line 45, in produce networks(vm, stats, first_sample, last_sample, interval) File "/usr/lib/python2.7/site-packages/vdsm/virt/vmstats.py", line 331, in networks if nic.name not in first_indexes or nic.name not in last_indexes: AttributeError: name NPE: Engine log: 2018-04-22 09:20:40,469+03 INFO [org.ovirt.engine.core.vdsbroker.CreateVDSCommand] (EE-ManagedThreadFactory-engine-Thread-26) [78cea825-3c6e-47fa-a54b-473438add222] START, CreateVDSCommand( CreateVDSCommandParameters:{hostId='0580a848-a460-4e91-b2b3-d6d98c54935e', vmId='99f42442-3851-4fdb-b79e-b025059deac1', vm='VM [compute_nested_host_setup]'}), log id: 1dcd8d55 2018-04-22 09:20:40,471+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-26) [78cea825-3c6e-47fa-a54b-473438add222] START, CreateBrokerVDSCommand(HostName = tigris04, CreateVDSCommandParameters:{hostId='0580a848-a460-4e91-b2b3-d6d98c54935e', vmId='99f42442-3851-4fdb-b79e-b025059deac1', vm='VM [compute_nested_host_setup]'}), log id: 14bc8b6e 2018-04-22 09:20:40,479+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-26) [78cea825-3c6e-47fa-a54b-473438add222] Failed in 'CreateBrokerVDS' method, for vds: 'tigris04'; host: 'tigris04.scl.lab.tlv.redhat.com': null 2018-04-22 09:20:40,479+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-26) [78cea825-3c6e-47fa-a54b-473438add222] Command 'CreateBrokerVDSCommand(HostName = tigris04, CreateVDSCommandParameters:{hostId='0580a848-a460-4e91-b2b3-d6d98c54935e', vmId='99f42442-3851-4fdb-b79e-b025059deac1', vm='VM [compute_nested_host_setup]'})' execution failed: null 2018-04-22 09:20:40,479+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-26) [78cea825-3c6e-47fa-a54b-473438add222] FINISH, CreateBrokerVDSCommand, log id: 14bc8b6e 2018-04-22 09:20:40,479+03 ERROR [org.ovirt.engine.core.vdsbroker.CreateVDSCommand] (EE-ManagedThreadFactory-engine-Thread-26) [78cea825-3c6e-47fa-a54b-473438add222] Failed to create VM: java.lang.NullPointerException at org.ovirt.engine.core.vdsbroker.builder.vminfo.LibvirtVmXmlBuilder.writeInterface(LibvirtVmXmlBuilder.java:2045) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.builder.vminfo.LibvirtVmXmlBuilder.lambda$writeInterfaces$24(LibvirtVmXmlBuilder.java:1096) [vdsbroker.jar:] at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) [rt.jar:1.8.0_161] at java.util.stream.SortedOps$SizedRefSortingSink.end(SortedOps.java:352) [rt.jar:1.8.0_161] at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) [rt.jar:1.8.0_161] at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) [rt.jar:1.8.0_161] at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) [rt.jar:1.8.0_161] at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) [rt.jar:1.8.0_161] at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) [rt.jar:1.8.0_161] at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) [rt.jar:1.8.0_161] at org.ovirt.engine.core.vdsbroker.builder.vminfo.LibvirtVmXmlBuilder.writeInterfaces(LibvirtVmXmlBuilder.java:1096) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.builder.vminfo.LibvirtVmXmlBuilder.writeDevices(LibvirtVmXmlBuilder.java:967) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.builder.vminfo.LibvirtVmXmlBuilder.buildCreateVm(LibvirtVmXmlBuilder.java:236) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand.generateDomainXml(CreateBrokerVDSCommand.java:93) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand.createInfo(CreateBrokerVDSCommand.java:50) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand.executeVdsBrokerCommand(CreateBrokerVDSCommand.java:42) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:112) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:73) [vdsbroker.jar:] at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.DefaultVdsCommandExecutor.execute(DefaultVdsCommandExecutor.java:14) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:398) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.ResourceManager$Proxy$_$$_WeldSubclass.runVdsCommand(Unknown Source) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.CreateVDSCommand.executeVmCommand(CreateVDSCommand.java:37) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.ManagingVmCommand.executeVDSCommand(ManagingVmCommand.java:17) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:73) [vdsbroker.jar:] at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.DefaultVdsCommandExecutor.execute(DefaultVdsCommandExecutor.java:14) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:398) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.ResourceManager$Proxy$_$$_WeldSubclass.runVdsCommand$$super(Unknown Source) [vdsbroker.jar:] at sun.reflect.GeneratedMethodAccessor84.invoke(Unknown Source) [:1.8.0_161] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_161] at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_161] at org.jboss.weld.interceptor.proxy.TerminalAroundInvokeInvocationContext.proceedInternal(TerminalAroundInvokeInvocationContext.java:51) [weld-core-impl.jar:2.4.7.Final-redhat-1] at org.jboss.weld.interceptor.proxy.AroundInvokeInvocationContext.proceed(AroundInvokeInvocationContext.java:79) [weld-core-impl.jar:2.4.7.Final-redhat-1] at org.ovirt.engine.core.common.di.interceptor.LoggingInterceptor.apply(LoggingInterceptor.java:12) [common.jar:] at sun.reflect.GeneratedMethodAccessor76.invoke(Unknown Source) [:1.8.0_161] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_161] at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_161] at org.jboss.weld.interceptor.reader.SimpleInterceptorInvocation$SimpleMethodInvocation.invoke(SimpleInterceptorInvocation.java:73) [weld-core-impl.jar:2.4.7.Final-redhat-1] at org.jboss.weld.interceptor.proxy.InterceptorMethodHandler.executeAroundInvoke(InterceptorMethodHandler.java:85) [weld-core-impl.jar:2.4.7.Final-redhat-1] at org.jboss.weld.interceptor.proxy.InterceptorMethodHandler.executeInterception(InterceptorMethodHandler.java:73) [weld-core-impl.jar:2.4.7.Final-redhat-1] at org.jboss.weld.interceptor.proxy.InterceptorMethodHandler.invoke(InterceptorMethodHandler.java:57) [weld-core-impl.jar:2.4.7.Final-redhat-1] at org.jboss.weld.bean.proxy.CombinedInterceptorAndDecoratorStackMethodHandler.invoke(CombinedInterceptorAndDecoratorStackMethodHandler.java:79) [weld-core-impl.jar:2.4.7.Final-redhat-1] at org.jboss.weld.bean.proxy.CombinedInterceptorAndDecoratorStackMethodHandler.invoke(CombinedInterceptorAndDecoratorStackMethodHandler.java:68) [weld-core-impl.jar:2.4.7.Final-redhat-1] at org.ovirt.engine.core.vdsbroker.ResourceManager$Proxy$_$$_WeldSubclass.runVdsCommand(Unknown Source) [vdsbroker.jar:] at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.runVdsCommand(VDSBrokerFrontendImpl.java:33) [bll.jar:] at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.runAsyncVdsCommand(VDSBrokerFrontendImpl.java:39) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.createVm(RunVmCommand.java:574) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.runVm(RunVmCommand.java:268) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.perform(RunVmCommand.java:432) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.executeVmCommand(RunVmCommand.java:357) [bll.jar:] at org.ovirt.engine.core.bll.VmCommand.executeCommand(VmCommand.java:161) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1133) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1285) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1934) [bll.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:164) [utils.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:103) [utils.jar:] at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1345) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:400) [bll.jar:] at org.ovirt.engine.core.bll.PrevalidatingMultipleActionsRunner.executeValidatedCommand(PrevalidatingMultipleActionsRunner.java:204) [bll.jar:] at org.ovirt.engine.core.bll.PrevalidatingMultipleActionsRunner.runCommands(PrevalidatingMultipleActionsRunner.java:176) [bll.jar:] at org.ovirt.engine.core.bll.SortedMultipleActionsRunnerBase.runCommands(SortedMultipleActionsRunnerBase.java:20) [bll.jar:] at org.ovirt.engine.core.bll.PrevalidatingMultipleActionsRunner.lambda$invokeCommands$3(PrevalidatingMultipleActionsRunner.java:182) [bll.jar:] at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:96) [utils.jar:] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_161] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_161] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_161] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_161] at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_161] at org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250) [javax.enterprise.concurrent.jar:1.0.0.redhat-1] at org.jboss.as.ee.concurrent.service.ElytronManagedThreadFactory$ElytronManagedThread.run(ElytronManagedThreadFactory.java:78) 2018-04-22 09:20:40,480+03 ERROR [org.ovirt.engine.core.vdsbroker.CreateVDSCommand] (EE-ManagedThreadFactory-engine-Thread-26) [78cea825-3c6e-47fa-a54b-473438add222] Command 'CreateVDSCommand( CreateVDSCommandParameters:{hostId='0580a848-a460-4e91-b2b3-d6d98c54935e', vmId='99f42442-3851-4fdb-b79e-b025059deac1', vm='VM [compute_nested_host_setup]'})' execution failed: java.lang.NullPointerException
Created attachment 1425293 [details] vdsm_host_1
Created attachment 1425295 [details] vdsm_host_2
The VM was terminated on 2018-04-20 19:25:16,435+03: VM '99f42442-3851-4fdb-b79e-b025059deac1'... moved from 'Up' --> 'Down' The cluster was upgraded on 2018-04-20 19:42:03 and caused the VM to be updated: Running command: UpdateVmCommand internal: true. Entities affected : ID: 99f42442-3851-4fdb-b79e-b025059deac1 Type: VMAction group EDIT_VM_PROPERTIES with role type USER There was no operation on this VM until 2018-04-22 09:12:52,388+03: [org.ovirt.engine.core.bll.RunVmCommand] (default task-16) [4999cd6f-21b0-48e7-a9b7-9689d2558355] Lock Acquired to object 'EngineLock:{exclusiveLocks='[99f42442-3851-4fdb-b79e-b025059deac1=VM]', sharedLocks=''}' That failed because the disk was inactive: Validation of action 'RunVm' failed for user ipinto. Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,VM_CANNOT_RUN_FROM_DISK_WITHOUT_DISK Conclusion: The disk, and probably the NIc as well, were unplugged when the cluster version was 4.1. The monitoring code in the engine remains the same for cluster versions lower than 4.2. Unfortunately, the engine logs do not cover the time that those devices got unplugged. It is most likely that the issue is on the VDSM side though - that the IDs of these devices have changed and therefore the engine marked them as unplugged. Israel, can we reproduce the upgrade process that lead to this issue?
The system is rhevm3 (production) I can't do reproduce on this environment. I can do upgrade form 4.1 to 4.2.3 on other environment, Is this can help?
(In reply to Israel Pinto from comment #6) > The system is rhevm3 (production) I can't do reproduce on this environment. > I can do upgrade form 4.1 to 4.2.3 on other environment, Is this can help? Yes, please. As we discussed offline, we need to upgrade to the same version, apply the same upgrade procedure and it would be handy to have a dump of the database before the upgrade as well.
Looking at both hosts in cluster 'RHEV-Production-AMD', I see that the yum repo file was edited using vim and then the host was updated using 'yum update'. If that's how VDSM was upgraded, while there were VMs running on the host, it may explain why the devices lost their original IDs. Raz, I was told that you've made the upgrade, can you confirm that this is how the upgrade was done?
Hi Arik, On this cluster we have 2 hosts - tigris03 (T3) and tigris04 (T4). tigris04 was 4.1 (latest) and tigris03 was 4.2.3 latest. I started the migration of running VMs from T4 to T3 by maintenance T4. In the middle of this proc, I accidentally executed 'yum update' on it to latest 4.2.3. At some point the migration proc go stuck so I rebooted the T4 machine again (after yum update) and it didn't freed the VMs so confirmed host has been rebooted.
Hi Arik, I did upgrade from 4.1 to 4.2.3 I didn't see the problem reproduced, Did the following steps: 1. On 4.1 create and run the VMs 2. Upgrade the engine to 4.2.3 3. Upgrade one host to 4.2.3, in cluster 4.1 4. Migrate one VM manually from 4.1 host to 4.2.3 5. Set source host to maintenance and run yum update Results: All VM migrate to 4.2.3 host, disks and nic on all vms are active - Also check that after rebooting the VM it starts with disk and nic active on 4.2.3 host - PASS - Restart vdsm service on 4.2.3 host - all vms are up and disk and nic status is active. Version info: 4.1 Engine: 4.1.11.2-0.1.el7 Host: OS Version: RHEL - 7.5 - 8.el7 Kernel Version: 3.10.0 - 862.el7.x86_64 KVM Version: 2.10.0 - 21.el7_5.1 LIBVIRT Version: libvirt-3.9.0-14.el7_5.3 VDSM Version: vdsm-4.19.51-1.el7ev 4.2 OS Version: RHEL - 7.5 - 8.el7 Kernel Version: 3.10.0 - 862.el7.x86_64 KVM Version: 2.10.0 - 21.el7_5.2 LIBVIRT Version: libvirt-3.9.0-14.el7_5.2 VDSM Version:vdsm-4.20.26-1.el7ev
(In reply to Israel Pinto from comment #10) > Hi Arik, > I did upgrade from 4.1 to 4.2.3 > I didn't see the problem reproduced, Did the following steps: > 1. On 4.1 create and run the VMs > 2. Upgrade the engine to 4.2.3 > 3. Upgrade one host to 4.2.3, in cluster 4.1 > 4. Migrate one VM manually from 4.1 host to 4.2.3 > 5. Set source host to maintenance and run yum update > Results: > All VM migrate to 4.2.3 host, disks and nic on all vms are active Let me propose a simpler scenario to start with: 1. Reinstall one of the hosts in that 4.1 cluster with 4.1 VDSM (or alternatively add a host with 4.1 VDSM to that cluster) 2. Run a VM on that host 3. Ensure with virth that we have no metadata saved for this VM in libvirt 4. Update VDSM on that host to 4.2 5. Restart VDSM 6. If the devices haven't been changed, add a nic to that VM Then we should see that the previous managed NIC gets unplugged.
Created attachment 1426013 [details] reproduce logs- engine, vdsm, vm xml (4.1, 4.2)
I manage to reproduce with the following steps: Note: Use only one host so the vm will not have option of migrate 1. Run VM on 4.1 host 2. Upgrade VDSM to 4.2 on host, run yum update vdsm with 4.2 repo (while VM is running) Results: 1. Host become Non-Operational 2. VM disk is inactive 3. VM NIC is unplugged
(In reply to Israel Pinto from comment #14) > I manage to reproduce with the following steps: > Note: Use only one host so the vm will not have option of migrate > 1. Run VM on 4.1 host > 2. Upgrade VDSM to 4.2 on host, run yum update vdsm with 4.2 repo > (while VM is running) > > Results: > 1. Host become Non-Operational > 2. VM disk is inactive > 3. VM NIC is unplugged great. So we need to prevent the case of upgrade while VMs are running. It's a really bad idea since there are often (and in case of 4.1 to 4.2 almost always) there are updates breaking qemu. Qemu itself, seabios, kernel. We have seen that in the past (e.g. in HE flows) that VMs just crash randomly right during the rpm transaction We can protect against that easily except for rolling back the whole transaction. But since it is an error state already (hosts shouldn't run anything, shoudl be in Maintenance) we can at least kill the remaining VMs sort-of-gracefully to prevent the situation in comment #14 where vdsm still picks up the information right before VMs crashes and confuse engine even more. Killing remaining running VMs in vdsm rpm update should be a quick fix, and good enough to protect during major upgrades
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
(In reply to Israel Pinto from comment #14) > I manage to reproduce with the following steps: > Note: Use only one host so the vm will not have option of migrate > 1. Run VM on 4.1 host > 2. Upgrade VDSM to 4.2 on host, run yum update vdsm with 4.2 repo > (while VM is running) It's an interesting test, but AFAIK, not a supported scenario. > > Results: > 1. Host become Non-Operational Do you have fencing enabled? Did VDSM restart? > 2. VM disk is inactive > 3. VM NIC is unplugged
*** Bug 1571796 has been marked as a duplicate of this bug. ***
*** Bug 1566402 has been marked as a duplicate of this bug. ***
*** Bug 1575996 has been marked as a duplicate of this bug. ***
(In reply to Michal Skrivanek from comment #15) > (In reply to Israel Pinto from comment #14) > > I manage to reproduce with the following steps: > > Note: Use only one host so the vm will not have option of migrate > > 1. Run VM on 4.1 host > > 2. Upgrade VDSM to 4.2 on host, run yum update vdsm with 4.2 repo > > (while VM is running) > > > > Results: > > 1. Host become Non-Operational > > 2. VM disk is inactive > > 3. VM NIC is unplugged > > great. So we need to prevent the case of upgrade while VMs are running. It's > a really bad idea since there are often (and in case of 4.1 to 4.2 almost > always) there are updates breaking qemu. Qemu itself, seabios, kernel. We > have seen that in the past (e.g. in HE flows) that VMs just crash randomly > right during the rpm transaction > We can protect against that easily except for rolling back the whole > transaction. But since it is an error state already (hosts shouldn't run > anything, shoudl be in Maintenance) we can at least kill the remaining VMs > sort-of-gracefully to prevent the situation in comment #14 where vdsm still > picks up the information right before VMs crashes and confuse engine even > more. > > Killing remaining running VMs in vdsm rpm update should be a quick fix, and > good enough to protect during major upgrades I am not sure this is a good idea. And what if the VMs were not affected? If vdsm can anyhow detect this scenario (running yum update without maintenance), I would instead notifying the manager + put a note in the log or even push a message to the user via the shell. If vdsm can't do that - let's at least implement the suggestion I gave in my previous update.
(In reply to Marina from comment #31) > (In reply to Michal Skrivanek from comment #15) > > (In reply to Israel Pinto from comment #14) > > > I manage to reproduce with the following steps: > > > Note: Use only one host so the vm will not have option of migrate > > > 1. Run VM on 4.1 host > > > 2. Upgrade VDSM to 4.2 on host, run yum update vdsm with 4.2 repo > > > (while VM is running) > > > > > > Results: > > > 1. Host become Non-Operational > > > 2. VM disk is inactive > > > 3. VM NIC is unplugged > > > > great. So we need to prevent the case of upgrade while VMs are running. It's > > a really bad idea since there are often (and in case of 4.1 to 4.2 almost > > always) there are updates breaking qemu. Qemu itself, seabios, kernel. We > > have seen that in the past (e.g. in HE flows) that VMs just crash randomly > > right during the rpm transaction > > We can protect against that easily except for rolling back the whole > > transaction. But since it is an error state already (hosts shouldn't run > > anything, shoudl be in Maintenance) we can at least kill the remaining VMs > > sort-of-gracefully to prevent the situation in comment #14 where vdsm still > > picks up the information right before VMs crashes and confuse engine even > > more. > > > > Killing remaining running VMs in vdsm rpm update should be a quick fix, and > > good enough to protect during major upgrades > > I am not sure this is a good idea. > And what if the VMs were not affected? all running VMs are affected > If vdsm can anyhow detect this scenario (running yum update without > maintenance), I would instead notifying the manager + put a note in the log > or even push a message to the user via the shell. it cannot easily detect, and even when it does it has no way to stop it > If vdsm can't do that - let's at least implement the suggestion I gave in my > previous update. it's ok closing it, but then be aware of this and know how to fix it in field when someone doesn't follow the instructions. It's not a trivial fix
Created attachment 1441692 [details] logs-4.2.4-1 Encountering the issue after upgrade to 4.2.4-1 (from 4.1.11-3). 2 VMs got paused after upgrade with their disks and NICs unplugged. First attempt to resume from paused failed on: 2018-05-25 22:25:24,925+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-4189) [6ca9fee3-4a67-453a-bec4-f7298cfbc8fa] EVENT_ID: USER_FAILED_R UN_VM(54), Failed to run VM Gluster_Server02 due to a failed validation: [Cannot run VM. The Custom Compatibility Version of VM Gluster_Server02 (4.1) is not supported in Data Center compatibility version 4.2.] (User: admin@internal-authz). 2018-05-25 22:25:24,925+03 WARN [org.ovirt.engine.core.bll.RunVmCommand] (EE-ManagedThreadFactory-engine-Thread-4189) [6ca9fee3-4a67-453a-bec4-f7298cfbc8fa] Validation of action 'RunVm' failed for user admin@in ternal-authz. Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VM_COMATIBILITY_VERSION_NOT_SUPPORTED,$VmName Gluster_Server02,$VmVersion 4.1,$DcVersion 4.2 Then, I powered them off, plugged the NICs and disks and tried to start them. Start VM fails with: 2018-05-25 22:28:43,007+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-4286) [ec056e12-f0fd-4e72-aadd-f4bc8eca58f3] FINISH, CreateBrokerVDSComm and, log id: 4bc008da 2018-05-25 22:28:43,007+03 ERROR [org.ovirt.engine.core.vdsbroker.CreateVDSCommand] (EE-ManagedThreadFactory-engine-Thread-4286) [ec056e12-f0fd-4e72-aadd-f4bc8eca58f3] Failed to create VM: java.lang.NullPointerE xception at org.ovirt.engine.core.vdsbroker.builder.vminfo.LibvirtVmXmlBuilder.writeInterface(LibvirtVmXmlBuilder.java:2069) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.builder.vminfo.LibvirtVmXmlBuilder.lambda$writeInterfaces$25(LibvirtVmXmlBuilder.java:1120) [vdsbroker.jar:] at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) [rt.jar:1.8.0_171] at java.util.stream.SortedOps$SizedRefSortingSink.end(SortedOps.java:352) [rt.jar:1.8.0_171] at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) [rt.jar:1.8.0_171] ======================= rhvm-4.2.4-0.1.el7.noarch vdsm-4.20.28-1.el7ev.x86_64
verify with: Engine: 4.2.4-0.1 Host:vdsm-4.20.29-1 Steps: Update host while running VMs on it. Install new RHV package and run 'yum update' command. Results: Update failed, while message: Running QEMU processes found, cannot upgrade Vdsm. Stop running VMs and rerun 'yum update' Update succeeded.
This bugzilla is included in oVirt 4.2.4 release, published on June 26th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.4 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.
(In reply to Israel Pinto from comment #36) > verify with: > Engine: 4.2.4-0.1 > Host:vdsm-4.20.29-1 > > > Steps: > Update host while running VMs on it. > Install new RHV package and run 'yum update' command. > Results: > Update failed, while message: > Running QEMU processes found, cannot upgrade Vdsm. > > Stop running VMs and rerun 'yum update' > Update succeeded. I just hit that trying to update my rhv box, I have a all in one setup, so in order to update the box I have to kill all vms? that is not really a great way to deal with anything. I guess at this point I probably need to export the vms and migrate to libvirt on RHEL as it is getting harder and harder to manage properly in my small setup.