Bug 1570349

Summary: After upgrade from 4.1 to 4.2.3 vm disk is inactive and vm nic is un-plugged
Product: [oVirt] vdsm Reporter: Israel Pinto <ipinto>
Component: CoreAssignee: bugs <bugs>
Status: CLOSED CURRENTRELEASE QA Contact: Israel Pinto <ipinto>
Severity: urgent Docs Contact:
Priority: urgent    
Version: ---CC: ahadas, bugs, dgilmore, ebenahar, ipinto, jbelka, mavital, mburman, michal.skrivanek, mkalinin, mtessun, ratamir, tnisan
Target Milestone: ovirt-4.2.4Keywords: Regression
Target Release: ---Flags: rule-engine: ovirt-4.2+
rule-engine: blocker+
mtessun: planning_ack+
rule-engine: devel_ack+
mavital: testing_ack+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: vdsm-4.20.28-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-26 08:38:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
vdsm_host_1
none
vdsm_host_2
none
reproduce logs- engine, vdsm, vm xml (4.1, 4.2)
none
logs-4.2.4-1 none

Description Israel Pinto 2018-04-22 06:36:56 UTC
Description of problem:
After upgrade from 4.1 to 4.2.3 (on rhevm 3) 
VM were with unplugged Nic and disk were un active also. 

Looks like the problem that were solve in BZ: 1542117

Version-Release number of selected component (if applicable):
Engine: 4.2.3.2-0.1.el7
Host:
OS Version:RHEL - 7.5 - 8.el7
Kernel Version:3.10.0 - 862.el7.x86_64
KVM Version:2.10.0 - 21.el7_5.2
LIBVIRT Version:libvirt-3.9.0-14.el7_5.3
VDSM Version:vdsm-4.20.26-1.el7ev

How reproducible:
VMs in QA production cluster - 100%  

Steps to Reproduce:
Upgrade from 4.1 to 4.2

Will attach logs later (don't have access to resources)

Comment 1 Israel Pinto 2018-04-22 10:24:57 UTC
logs from engine and hosts:
https://drive.google.com/open?id=1tjRwiS5bpcV7FzzZFEnSIykFKtR_DOya

Comment 2 Israel Pinto 2018-04-22 10:55:01 UTC
We see two errors on different vms:
Vdsm log:
2018-04-22 11:21:24,465+0300 ERROR (periodic/5) [virt.periodic.Operation] <vdsm.virt.sampling.VMBulkstatsMonitor object at 0x7f646808a650> operation failed (periodic:222)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 220, in __call__
    self._func()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 522, in __call__
    self._send_metrics()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 538, in _send_metrics
    vm_sample.interval)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmstats.py", line 45, in produce
    networks(vm, stats, first_sample, last_sample, interval)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmstats.py", line 331, in networks
    if nic.name not in first_indexes or nic.name not in last_indexes:
AttributeError: name

NPE: 
Engine log: 
2018-04-22 09:20:40,469+03 INFO  [org.ovirt.engine.core.vdsbroker.CreateVDSCommand] (EE-ManagedThreadFactory-engine-Thread-26) [78cea825-3c6e-47fa-a54b-473438add222] START, CreateVDSCommand( CreateVDSCommandParameters:{hostId='0580a848-a460-4e91-b2b3-d6d98c54935e', vmId='99f42442-3851-4fdb-b79e-b025059deac1', vm='VM [compute_nested_host_setup]'}), log id: 1dcd8d55
2018-04-22 09:20:40,471+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-26) [78cea825-3c6e-47fa-a54b-473438add222] START, CreateBrokerVDSCommand(HostName = tigris04, CreateVDSCommandParameters:{hostId='0580a848-a460-4e91-b2b3-d6d98c54935e', vmId='99f42442-3851-4fdb-b79e-b025059deac1', vm='VM [compute_nested_host_setup]'}), log id: 14bc8b6e
2018-04-22 09:20:40,479+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-26) [78cea825-3c6e-47fa-a54b-473438add222] Failed in 'CreateBrokerVDS' method, for vds: 'tigris04'; host: 'tigris04.scl.lab.tlv.redhat.com': null
2018-04-22 09:20:40,479+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-26) [78cea825-3c6e-47fa-a54b-473438add222] Command 'CreateBrokerVDSCommand(HostName = tigris04, CreateVDSCommandParameters:{hostId='0580a848-a460-4e91-b2b3-d6d98c54935e', vmId='99f42442-3851-4fdb-b79e-b025059deac1', vm='VM [compute_nested_host_setup]'})' execution failed: null
2018-04-22 09:20:40,479+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-26) [78cea825-3c6e-47fa-a54b-473438add222] FINISH, CreateBrokerVDSCommand, log id: 14bc8b6e
2018-04-22 09:20:40,479+03 ERROR [org.ovirt.engine.core.vdsbroker.CreateVDSCommand] (EE-ManagedThreadFactory-engine-Thread-26) [78cea825-3c6e-47fa-a54b-473438add222] Failed to create VM: java.lang.NullPointerException
	at org.ovirt.engine.core.vdsbroker.builder.vminfo.LibvirtVmXmlBuilder.writeInterface(LibvirtVmXmlBuilder.java:2045) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.builder.vminfo.LibvirtVmXmlBuilder.lambda$writeInterfaces$24(LibvirtVmXmlBuilder.java:1096) [vdsbroker.jar:]
	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) [rt.jar:1.8.0_161]
	at java.util.stream.SortedOps$SizedRefSortingSink.end(SortedOps.java:352) [rt.jar:1.8.0_161]
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) [rt.jar:1.8.0_161]
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) [rt.jar:1.8.0_161]
	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) [rt.jar:1.8.0_161]
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) [rt.jar:1.8.0_161]
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) [rt.jar:1.8.0_161]
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) [rt.jar:1.8.0_161]
	at org.ovirt.engine.core.vdsbroker.builder.vminfo.LibvirtVmXmlBuilder.writeInterfaces(LibvirtVmXmlBuilder.java:1096) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.builder.vminfo.LibvirtVmXmlBuilder.writeDevices(LibvirtVmXmlBuilder.java:967) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.builder.vminfo.LibvirtVmXmlBuilder.buildCreateVm(LibvirtVmXmlBuilder.java:236) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand.generateDomainXml(CreateBrokerVDSCommand.java:93) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand.createInfo(CreateBrokerVDSCommand.java:50) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand.executeVdsBrokerCommand(CreateBrokerVDSCommand.java:42) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:112) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:73) [vdsbroker.jar:]
	at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:]
	at org.ovirt.engine.core.vdsbroker.vdsbroker.DefaultVdsCommandExecutor.execute(DefaultVdsCommandExecutor.java:14) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:398) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.ResourceManager$Proxy$_$$_WeldSubclass.runVdsCommand(Unknown Source) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.CreateVDSCommand.executeVmCommand(CreateVDSCommand.java:37) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.ManagingVmCommand.executeVDSCommand(ManagingVmCommand.java:17) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:73) [vdsbroker.jar:]
	at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:]
	at org.ovirt.engine.core.vdsbroker.vdsbroker.DefaultVdsCommandExecutor.execute(DefaultVdsCommandExecutor.java:14) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:398) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.ResourceManager$Proxy$_$$_WeldSubclass.runVdsCommand$$super(Unknown Source) [vdsbroker.jar:]
	at sun.reflect.GeneratedMethodAccessor84.invoke(Unknown Source) [:1.8.0_161]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_161]
	at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_161]
	at org.jboss.weld.interceptor.proxy.TerminalAroundInvokeInvocationContext.proceedInternal(TerminalAroundInvokeInvocationContext.java:51) [weld-core-impl.jar:2.4.7.Final-redhat-1]
	at org.jboss.weld.interceptor.proxy.AroundInvokeInvocationContext.proceed(AroundInvokeInvocationContext.java:79) [weld-core-impl.jar:2.4.7.Final-redhat-1]
	at org.ovirt.engine.core.common.di.interceptor.LoggingInterceptor.apply(LoggingInterceptor.java:12) [common.jar:]
	at sun.reflect.GeneratedMethodAccessor76.invoke(Unknown Source) [:1.8.0_161]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_161]
	at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_161]
	at org.jboss.weld.interceptor.reader.SimpleInterceptorInvocation$SimpleMethodInvocation.invoke(SimpleInterceptorInvocation.java:73) [weld-core-impl.jar:2.4.7.Final-redhat-1]
	at org.jboss.weld.interceptor.proxy.InterceptorMethodHandler.executeAroundInvoke(InterceptorMethodHandler.java:85) [weld-core-impl.jar:2.4.7.Final-redhat-1]
	at org.jboss.weld.interceptor.proxy.InterceptorMethodHandler.executeInterception(InterceptorMethodHandler.java:73) [weld-core-impl.jar:2.4.7.Final-redhat-1]
	at org.jboss.weld.interceptor.proxy.InterceptorMethodHandler.invoke(InterceptorMethodHandler.java:57) [weld-core-impl.jar:2.4.7.Final-redhat-1]
	at org.jboss.weld.bean.proxy.CombinedInterceptorAndDecoratorStackMethodHandler.invoke(CombinedInterceptorAndDecoratorStackMethodHandler.java:79) [weld-core-impl.jar:2.4.7.Final-redhat-1]
	at org.jboss.weld.bean.proxy.CombinedInterceptorAndDecoratorStackMethodHandler.invoke(CombinedInterceptorAndDecoratorStackMethodHandler.java:68) [weld-core-impl.jar:2.4.7.Final-redhat-1]
	at org.ovirt.engine.core.vdsbroker.ResourceManager$Proxy$_$$_WeldSubclass.runVdsCommand(Unknown Source) [vdsbroker.jar:]
	at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.runVdsCommand(VDSBrokerFrontendImpl.java:33) [bll.jar:]
	at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.runAsyncVdsCommand(VDSBrokerFrontendImpl.java:39) [bll.jar:]
	at org.ovirt.engine.core.bll.RunVmCommand.createVm(RunVmCommand.java:574) [bll.jar:]
	at org.ovirt.engine.core.bll.RunVmCommand.runVm(RunVmCommand.java:268) [bll.jar:]
	at org.ovirt.engine.core.bll.RunVmCommand.perform(RunVmCommand.java:432) [bll.jar:]
	at org.ovirt.engine.core.bll.RunVmCommand.executeVmCommand(RunVmCommand.java:357) [bll.jar:]
	at org.ovirt.engine.core.bll.VmCommand.executeCommand(VmCommand.java:161) [bll.jar:]
	at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1133) [bll.jar:]
	at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1285) [bll.jar:]
	at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1934) [bll.jar:]
	at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:164) [utils.jar:]
	at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:103) [utils.jar:]
	at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1345) [bll.jar:]
	at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:400) [bll.jar:]
	at org.ovirt.engine.core.bll.PrevalidatingMultipleActionsRunner.executeValidatedCommand(PrevalidatingMultipleActionsRunner.java:204) [bll.jar:]
	at org.ovirt.engine.core.bll.PrevalidatingMultipleActionsRunner.runCommands(PrevalidatingMultipleActionsRunner.java:176) [bll.jar:]
	at org.ovirt.engine.core.bll.SortedMultipleActionsRunnerBase.runCommands(SortedMultipleActionsRunnerBase.java:20) [bll.jar:]
	at org.ovirt.engine.core.bll.PrevalidatingMultipleActionsRunner.lambda$invokeCommands$3(PrevalidatingMultipleActionsRunner.java:182) [bll.jar:]
	at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:96) [utils.jar:]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_161]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_161]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_161]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_161]
	at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_161]
	at org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250) [javax.enterprise.concurrent.jar:1.0.0.redhat-1]
	at org.jboss.as.ee.concurrent.service.ElytronManagedThreadFactory$ElytronManagedThread.run(ElytronManagedThreadFactory.java:78)

2018-04-22 09:20:40,480+03 ERROR [org.ovirt.engine.core.vdsbroker.CreateVDSCommand] (EE-ManagedThreadFactory-engine-Thread-26) [78cea825-3c6e-47fa-a54b-473438add222] Command 'CreateVDSCommand( CreateVDSCommandParameters:{hostId='0580a848-a460-4e91-b2b3-d6d98c54935e', vmId='99f42442-3851-4fdb-b79e-b025059deac1', vm='VM [compute_nested_host_setup]'})' execution failed: java.lang.NullPointerException

Comment 3 Israel Pinto 2018-04-22 10:55:45 UTC
Created attachment 1425293 [details]
vdsm_host_1

Comment 4 Israel Pinto 2018-04-22 10:56:22 UTC
Created attachment 1425295 [details]
vdsm_host_2

Comment 5 Arik 2018-04-23 08:08:39 UTC
The VM was terminated on 2018-04-20 19:25:16,435+03:
VM '99f42442-3851-4fdb-b79e-b025059deac1'... moved from 'Up' --> 'Down'

The cluster was upgraded on 2018-04-20 19:42:03 and caused the VM to be updated:
Running command: UpdateVmCommand internal: true. Entities affected :  ID: 99f42442-3851-4fdb-b79e-b025059deac1 Type: VMAction group EDIT_VM_PROPERTIES with role type USER

There was no operation on this VM until 2018-04-22 09:12:52,388+03:
[org.ovirt.engine.core.bll.RunVmCommand] (default task-16) [4999cd6f-21b0-48e7-a9b7-9689d2558355] Lock Acquired to object 'EngineLock:{exclusiveLocks='[99f42442-3851-4fdb-b79e-b025059deac1=VM]', sharedLocks=''}'

That failed because the disk was inactive:
Validation of action 'RunVm' failed for user ipinto. Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,VM_CANNOT_RUN_FROM_DISK_WITHOUT_DISK

Conclusion:
The disk, and probably the NIc as well, were unplugged when the cluster version was 4.1. The monitoring code in the engine remains the same for cluster versions lower than 4.2. Unfortunately, the engine logs do not cover the time that those devices got unplugged. It is most likely that the issue is on the VDSM side though - that the IDs of these devices have changed and therefore the engine marked them as unplugged.

Israel, can we reproduce the upgrade process that lead to this issue?

Comment 6 Israel Pinto 2018-04-23 08:15:25 UTC
The system is rhevm3 (production) I can't do reproduce on this environment.
I can do upgrade form 4.1 to 4.2.3 on other environment, Is this can help?

Comment 7 Arik 2018-04-23 08:52:44 UTC
(In reply to Israel Pinto from comment #6)
> The system is rhevm3 (production) I can't do reproduce on this environment.
> I can do upgrade form 4.1 to 4.2.3 on other environment, Is this can help?

Yes, please.
As we discussed offline, we need to upgrade to the same version, apply the same upgrade procedure and it would be handy to have a dump of the database before the upgrade as well.

Comment 8 Arik 2018-04-23 11:01:49 UTC
Looking at both hosts in cluster 'RHEV-Production-AMD', I see that the yum repo file was edited using vim and then the host was updated using 'yum update'. If that's how VDSM was upgraded, while there were VMs running on the host, it may explain why the devices lost their original IDs.

Raz, I was told that you've made the upgrade, can you confirm that this is how the upgrade was done?

Comment 9 Raz Tamir 2018-04-23 12:14:39 UTC
Hi Arik,

On this cluster we have 2 hosts - tigris03 (T3) and tigris04 (T4).
tigris04 was 4.1 (latest) and tigris03 was 4.2.3 latest.
I started the migration of running VMs from T4 to T3 by maintenance T4.
In the middle of this proc, I accidentally executed 'yum update' on it to latest 4.2.3.

At some point the migration proc go stuck so I rebooted the T4 machine again (after yum update) and it didn't freed the VMs so confirmed host has been rebooted.

Comment 10 Israel Pinto 2018-04-23 18:40:02 UTC
Hi Arik,
I did upgrade from 4.1 to 4.2.3 
I didn't see the problem reproduced, Did the following steps:
1. On 4.1 create and run the VMs
2. Upgrade the engine to 4.2.3
3. Upgrade one host to 4.2.3, in cluster 4.1
4. Migrate one VM manually from 4.1 host to 4.2.3 
5. Set source host to maintenance and run yum update
Results:
All VM migrate to 4.2.3 host, disks and nic on all vms are active 

- Also check that after rebooting the VM it starts with disk and nic active on 4.2.3 host - PASS
- Restart vdsm service on 4.2.3 host - all vms are up and disk and nic status is active.   

Version info:
4.1 
Engine: 4.1.11.2-0.1.el7
Host: 
OS Version: RHEL - 7.5 - 8.el7
Kernel Version: 3.10.0 - 862.el7.x86_64
KVM Version: 2.10.0 - 21.el7_5.1
LIBVIRT Version: libvirt-3.9.0-14.el7_5.3
VDSM Version: vdsm-4.19.51-1.el7ev

4.2 
OS Version: RHEL - 7.5 - 8.el7
Kernel Version: 3.10.0 - 862.el7.x86_64
KVM Version: 2.10.0 - 21.el7_5.2
LIBVIRT Version: libvirt-3.9.0-14.el7_5.2
VDSM Version:vdsm-4.20.26-1.el7ev

Comment 12 Arik 2018-04-23 19:26:22 UTC
(In reply to Israel Pinto from comment #10)
> Hi Arik,
> I did upgrade from 4.1 to 4.2.3 
> I didn't see the problem reproduced, Did the following steps:
> 1. On 4.1 create and run the VMs
> 2. Upgrade the engine to 4.2.3
> 3. Upgrade one host to 4.2.3, in cluster 4.1
> 4. Migrate one VM manually from 4.1 host to 4.2.3 
> 5. Set source host to maintenance and run yum update
> Results:
> All VM migrate to 4.2.3 host, disks and nic on all vms are active 

Let me propose a simpler scenario to start with:
1. Reinstall one of the hosts in that 4.1 cluster with 4.1 VDSM (or alternatively add a host with 4.1 VDSM to that cluster)
2. Run a VM on that host
3. Ensure with virth that we have no metadata saved for this VM in libvirt
4. Update VDSM on that host to 4.2
5. Restart VDSM
6. If the devices haven't been changed, add a nic to that VM
Then we should see that the previous managed NIC gets unplugged.

Comment 13 Israel Pinto 2018-04-24 12:41:15 UTC
Created attachment 1426013 [details]
reproduce logs- engine, vdsm, vm xml (4.1, 4.2)

Comment 14 Israel Pinto 2018-04-24 12:43:03 UTC
I manage to reproduce with the following steps:
Note: Use only one host so the vm will not have option of migrate 
1. Run VM on 4.1 host
2. Upgrade VDSM to 4.2 on host, run yum update vdsm with 4.2 repo
   (while VM is running)

Results:
1. Host become Non-Operational
2. VM disk is inactive
3. VM NIC is unplugged

Comment 15 Michal Skrivanek 2018-04-25 10:42:18 UTC
(In reply to Israel Pinto from comment #14)
> I manage to reproduce with the following steps:
> Note: Use only one host so the vm will not have option of migrate 
> 1. Run VM on 4.1 host
> 2. Upgrade VDSM to 4.2 on host, run yum update vdsm with 4.2 repo
>    (while VM is running)
> 
> Results:
> 1. Host become Non-Operational
> 2. VM disk is inactive
> 3. VM NIC is unplugged

great. So we need to prevent the case of upgrade while VMs are running. It's a really bad idea since there are often (and in case of 4.1 to 4.2 almost always) there are updates breaking qemu. Qemu itself, seabios, kernel. We have seen that in the past (e.g. in HE flows) that VMs just crash randomly right during the rpm transaction
We can protect against that easily except for rolling back the whole transaction. But since it is an error state already (hosts shouldn't run anything, shoudl be in Maintenance) we can at least kill the remaining VMs sort-of-gracefully to prevent the situation in comment #14 where vdsm still picks up the information right before VMs crashes and confuse engine even more.

Killing remaining running VMs in vdsm rpm update should be a quick fix, and good enough to protect during major upgrades

Comment 17 Red Hat Bugzilla Rules Engine 2018-04-27 06:53:10 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 18 Yaniv Kaul 2018-04-29 09:01:19 UTC
(In reply to Israel Pinto from comment #14)
> I manage to reproduce with the following steps:
> Note: Use only one host so the vm will not have option of migrate 
> 1. Run VM on 4.1 host
> 2. Upgrade VDSM to 4.2 on host, run yum update vdsm with 4.2 repo
>    (while VM is running)

It's an interesting test, but AFAIK, not a supported scenario.

> 
> Results:
> 1. Host become Non-Operational

Do you have fencing enabled?
Did VDSM restart?

> 2. VM disk is inactive
> 3. VM NIC is unplugged

Comment 23 Michal Skrivanek 2018-05-03 10:21:13 UTC
*** Bug 1571796 has been marked as a duplicate of this bug. ***

Comment 27 Arik 2018-05-06 07:59:46 UTC
*** Bug 1566402 has been marked as a duplicate of this bug. ***

Comment 28 Michal Skrivanek 2018-05-09 08:41:34 UTC
*** Bug 1575996 has been marked as a duplicate of this bug. ***

Comment 31 Marina Kalinin 2018-05-09 20:37:33 UTC
(In reply to Michal Skrivanek from comment #15)
> (In reply to Israel Pinto from comment #14)
> > I manage to reproduce with the following steps:
> > Note: Use only one host so the vm will not have option of migrate 
> > 1. Run VM on 4.1 host
> > 2. Upgrade VDSM to 4.2 on host, run yum update vdsm with 4.2 repo
> >    (while VM is running)
> > 
> > Results:
> > 1. Host become Non-Operational
> > 2. VM disk is inactive
> > 3. VM NIC is unplugged
> 
> great. So we need to prevent the case of upgrade while VMs are running. It's
> a really bad idea since there are often (and in case of 4.1 to 4.2 almost
> always) there are updates breaking qemu. Qemu itself, seabios, kernel. We
> have seen that in the past (e.g. in HE flows) that VMs just crash randomly
> right during the rpm transaction
> We can protect against that easily except for rolling back the whole
> transaction. But since it is an error state already (hosts shouldn't run
> anything, shoudl be in Maintenance) we can at least kill the remaining VMs
> sort-of-gracefully to prevent the situation in comment #14 where vdsm still
> picks up the information right before VMs crashes and confuse engine even
> more.
> 
> Killing remaining running VMs in vdsm rpm update should be a quick fix, and
> good enough to protect during major upgrades

I am not sure this is a good idea. 
And what if the VMs were not affected?
If vdsm can anyhow detect this scenario (running yum update without maintenance), I would instead notifying the manager + put a note in the log or even push a message to the user via the shell. 
If vdsm can't do that - let's at least implement the suggestion I gave in my previous update.

Comment 32 Michal Skrivanek 2018-05-10 09:13:34 UTC
(In reply to Marina from comment #31)
> (In reply to Michal Skrivanek from comment #15)
> > (In reply to Israel Pinto from comment #14)
> > > I manage to reproduce with the following steps:
> > > Note: Use only one host so the vm will not have option of migrate 
> > > 1. Run VM on 4.1 host
> > > 2. Upgrade VDSM to 4.2 on host, run yum update vdsm with 4.2 repo
> > >    (while VM is running)
> > > 
> > > Results:
> > > 1. Host become Non-Operational
> > > 2. VM disk is inactive
> > > 3. VM NIC is unplugged
> > 
> > great. So we need to prevent the case of upgrade while VMs are running. It's
> > a really bad idea since there are often (and in case of 4.1 to 4.2 almost
> > always) there are updates breaking qemu. Qemu itself, seabios, kernel. We
> > have seen that in the past (e.g. in HE flows) that VMs just crash randomly
> > right during the rpm transaction
> > We can protect against that easily except for rolling back the whole
> > transaction. But since it is an error state already (hosts shouldn't run
> > anything, shoudl be in Maintenance) we can at least kill the remaining VMs
> > sort-of-gracefully to prevent the situation in comment #14 where vdsm still
> > picks up the information right before VMs crashes and confuse engine even
> > more.
> > 
> > Killing remaining running VMs in vdsm rpm update should be a quick fix, and
> > good enough to protect during major upgrades
> 
> I am not sure this is a good idea. 
> And what if the VMs were not affected?

all running VMs are affected

> If vdsm can anyhow detect this scenario (running yum update without
> maintenance), I would instead notifying the manager + put a note in the log
> or even push a message to the user via the shell. 

it cannot easily detect, and even when it does it has no way to stop it

> If vdsm can't do that - let's at least implement the suggestion I gave in my
> previous update.

it's ok closing it, but then be aware of this and know how to fix it in field when someone doesn't follow the instructions. It's not a trivial fix

Comment 35 Elad 2018-05-25 19:58:26 UTC
Created attachment 1441692 [details]
logs-4.2.4-1

Encountering the issue after upgrade to 4.2.4-1 (from 4.1.11-3). 

2 VMs got paused after upgrade with their disks and NICs unplugged.
First attempt to resume from paused failed on:

2018-05-25 22:25:24,925+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-4189) [6ca9fee3-4a67-453a-bec4-f7298cfbc8fa] EVENT_ID: USER_FAILED_R
UN_VM(54), Failed to run VM Gluster_Server02 due to a failed validation: [Cannot run VM. The Custom Compatibility Version of VM Gluster_Server02 (4.1) is not supported in Data Center compatibility version 4.2.] 
(User: admin@internal-authz).
2018-05-25 22:25:24,925+03 WARN  [org.ovirt.engine.core.bll.RunVmCommand] (EE-ManagedThreadFactory-engine-Thread-4189) [6ca9fee3-4a67-453a-bec4-f7298cfbc8fa] Validation of action 'RunVm' failed for user admin@in
ternal-authz. Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VM_COMATIBILITY_VERSION_NOT_SUPPORTED,$VmName Gluster_Server02,$VmVersion 4.1,$DcVersion 4.2


Then, I powered them off, plugged the NICs and disks and tried to start them. 

Start VM fails with:

2018-05-25 22:28:43,007+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-4286) [ec056e12-f0fd-4e72-aadd-f4bc8eca58f3] FINISH, CreateBrokerVDSComm
and, log id: 4bc008da
2018-05-25 22:28:43,007+03 ERROR [org.ovirt.engine.core.vdsbroker.CreateVDSCommand] (EE-ManagedThreadFactory-engine-Thread-4286) [ec056e12-f0fd-4e72-aadd-f4bc8eca58f3] Failed to create VM: java.lang.NullPointerE
xception
        at org.ovirt.engine.core.vdsbroker.builder.vminfo.LibvirtVmXmlBuilder.writeInterface(LibvirtVmXmlBuilder.java:2069) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.builder.vminfo.LibvirtVmXmlBuilder.lambda$writeInterfaces$25(LibvirtVmXmlBuilder.java:1120) [vdsbroker.jar:]
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) [rt.jar:1.8.0_171]
        at java.util.stream.SortedOps$SizedRefSortingSink.end(SortedOps.java:352) [rt.jar:1.8.0_171]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) [rt.jar:1.8.0_171]





=======================
rhvm-4.2.4-0.1.el7.noarch
vdsm-4.20.28-1.el7ev.x86_64

Comment 36 Israel Pinto 2018-06-02 17:16:17 UTC
verify with:
Engine: 4.2.4-0.1
Host:vdsm-4.20.29-1


Steps:
Update host while running VMs on it.
Install new RHV package and run 'yum update' command.
Results:
Update failed, while message:
Running QEMU processes found, cannot upgrade Vdsm.

Stop running VMs and rerun 'yum update' 
Update succeeded.

Comment 37 Sandro Bonazzola 2018-06-26 08:38:57 UTC
This bugzilla is included in oVirt 4.2.4 release, published on June 26th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.4 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 38 Dennis Gilmore 2018-07-22 10:07:58 UTC
(In reply to Israel Pinto from comment #36)
> verify with:
> Engine: 4.2.4-0.1
> Host:vdsm-4.20.29-1
> 
> 
> Steps:
> Update host while running VMs on it.
> Install new RHV package and run 'yum update' command.
> Results:
> Update failed, while message:
> Running QEMU processes found, cannot upgrade Vdsm.
> 
> Stop running VMs and rerun 'yum update' 
> Update succeeded.

I just hit that trying to update my rhv box, I have a all in one setup, so in order to update the box I have to kill all vms? that is not really a great way to deal with anything. I guess at this point I probably need to export the vms and migrate to libvirt on RHEL as it is getting harder and harder to manage properly in my small setup.