Description of problem: ----------------------- The setup is a HCI solution which has 3 physical hosts installed with RHEL 7.2 and RHGS 3.1.3(RC)( glusterfs-3.7.9-10). The gluster replica 3 sharded volume is used for hosted engine. Another replica 3 sharded volume is used for data domain. 29 Application VMs running RHEL 7.2 were created. When all the VMs are shutdown ( by selecting them together and 'poweroff' ), I see that 2 of the VMs stuck in 'powering-down' state Version-Release number of selected component (if applicable): -------------------------------------------------------------- RHEV 3.6.7.4 RHGS 3.1.3 (RC) glusterfs-3.7.9-10 How reproducible: ----------------- Hit once and haven't tried to reproduce Steps to Reproduce: ------------------- 1. In HCI setup create 29 Application VMs running RHEL 7.2 2. Select all the VMs and powered-down Actual results: --------------- 2 of the VMs were stuck in 'powering down' state Expected results: ----------------- All the VMs should be powered down
Not sure this is the related information. But found the exceptions in the engine.log <snip> 2016-06-20 04:38:56,610 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-60) [2ab9a462] Failed during vms monitoring on host host1.lab.eng.blr.redhat .com error is: java.lang.ClassCastException: java.lang.String cannot be cast to java.util.Map 2016-06-20 04:38:56,610 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-60) [2ab9a462] Exception:: java.lang.ClassCastException: java.lang.String ca nnot be cast to java.util.Map at org.ovirt.engine.core.vdsbroker.VmsMonitoring.processVmDevices(VmsMonitoring.java:612) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VmsMonitoring.updateVmDevices(VmsMonitoring.java:562) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VmsMonitoring.processVmsWithDevicesChange(VmsMonitoring.java:368) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VmsMonitoring.refreshVmStats(VmsMonitoring.java:242) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VmsMonitoring.perform(VmsMonitoring.java:165) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.PollVmStatsRefresher.poll(PollVmStatsRefresher.java:36) [vdsbroker.jar:] at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source) [:1.8.0_91] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_91] at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_91] at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:] </snip>
Other error messages looks like "vms running in db but not in VDS" <snip> 2016-06-20 04:59:38,195 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-1) [110b2584] VM '59d48f24-d6b8-4e9e-b020-b728ed239da3(appvm03) is running in db and not running in VDS 'host1.lab.eng.blr.redhat.com' 2016-06-20 04:59:38,195 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVDSCommand] (DefaultQuartzScheduler_Worker-1) [110b2584] START, FullListVDSCommand(HostName = , FullListVDSCommandParameters:{runAsync='true', hostId='f4a25596-079b-44ed-970b-2142c67b9873', vds='Host[,f4a25596-079b-44ed-970b-2142c67b9873]', vmIds='[10c5e550-4b07-4a25-9a96-ae016864f2a8]'}), log id: 50edd038 2016-06-20 04:59:38,279 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVDSCommand] (DefaultQuartzScheduler_Worker-1) [110b2584] FINISH, FullListVDSCommand, return: [{status=Up, nicModel=rtl8139,pv, emulatedMachine=rhel6.5.0, guestDiskMapping={e5051a6c-dc96-44eb-b={name=/dev/vda}, QEMU_DVD-ROM_QM00003={name=/dev/sr0}}, vmId=10c5e550-4b07-4a25-9a96-ae016864f2a8, pid=6368, devices=[Ljava.lang.Object;@5e702073, smp=2, vmType=kvm, displayIp=0, display=vnc, displaySecurePort=-1, memSize=4096, displayPort=5900, spiceSecureChannels=smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir, statusTime=4725612060, vmName=HostedEngine, clientIp=, pauseCode=NOERR}], log id: 50edd038 2016-06-20 04:59:38,280 INFO [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-1) [110b2584] Received a memballoon Device without an address when processing VM 10c5e550-4b07-4a25-9a96-ae016864f2a8 devices, skipping device: {device=memballoon, specParams={model=none}, deviceType=balloon, type=balloon, alias=balloon0} 2016-06-20 04:59:38,280 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-1) [110b2584] Failed during vms monitoring on host host1.lab.eng.blr.redhat.com error is: java.lang.ClassCastException 2016-06-20 04:59:38,280 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-1) [110b2584] Exception:: java.lang.ClassCastException </snip>
Created attachment 1169794 [details] engine.log from hosted engine
Also seeing these error messages in engine.log <snip> 2016-06-20 05:10:57,645 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-53) [7f967349] Failed during vms monitoring on host host1.lab.eng.blr.redhat .com error is: java.lang.ClassCastException 2016-06-20 05:10:57,645 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-53) [7f967349] Exception:: java.lang.ClassCastException </snip>
appvm14 and appvm03 are the VMs stuck in "powering down" state. I am not sure the way to revive these VMs.
Created attachment 1169797 [details] screenshot showing that the vms are stuck in powering down state
I could able to hit this issue in my testing again. I observed that few of the VMs are stuck in powering-down state are on the hosts which runs the hosted engine. Once I migrated hosted-engine from that hosts, those VMs got powered down.
Raising the severity as the outcome is unmonitored VMs without any indication to the user. Seems like the issue is unrelated to the powering-off VMs - that just exposed the problem in the monitoring after a change in the devices of one of the VMs. As part of parsing the result of FullList we get ClassCast exception. Until 2016-06-20 04:13:36,984 - there is not stacktrace for the ClassCast exception, so I don't know the root cause for this. In 2016-06-20 04:13:36,984, we start getting ClassCast exception with the stacktrace that shows the problem is that VDSM reports device addresses as String while we expect it to be a Map. This won't happen in 4.0 and above since we don't count on the type anymore. I'll post a 3.6 only fix for this. More investigation is needed in order to identify the cause for the exceptions we got before 2016-06-20 04:13:36,984.
Does it still happen? if so, could you please add the output of "vdsClient 0 list" on host1.lab.eng.blr.redhat.com ?
(In reply to Arik from comment #9) > Does it still happen? if so, could you please add the output of "vdsClient 0 > list" on host1.lab.eng.blr.redhat.com ? Arik, I couldn't hit it again.
This is relevant only on 3.6 and it can not be reproduced anymore. Feel free to reopen if it happens again and provide the all the logs and the result of the "vdsClient 0 list" on the relevant host