1348129 – "VMs running in DB, but not in VDS" error while powering off all 29 VMs together

Bug 1348129 - "VMs running in DB, but not in VDS" error while powering off all 29 VMs together

Summary: "VMs running in DB, but not in VDS" error while powering off all 29 VMs together

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	Frontend.WebAdmin
Sub Component:
Version:	3.6.7
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Arik
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	Gluster-HC-1
TreeView+	depends on / blocked

Reported:	2016-06-20 08:54 UTC by SATHEESARAN
Modified:	2016-06-29 12:18 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:	HCI RHEV 3.6.7.4 and RHGS 3.1.3 ( RC )
Last Closed:	2016-06-29 12:18:50 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	sasundar: planning_ack? sasundar: devel_ack? sasundar: testing_ack?

Attachments	(Terms of Use)
engine.log from hosted engine (5.60 MB, text/plain) 2016-06-20 09:08 UTC, SATHEESARAN	no flags	Details
screenshot showing that the vms are stuck in powering down state (15.31 KB, image/png) 2016-06-20 09:16 UTC, SATHEESARAN	no flags	Details
View All

Description SATHEESARAN 2016-06-20 08:54:09 UTC

Description of problem:
-----------------------
The setup is a HCI solution which has 3 physical hosts installed with RHEL 7.2 and RHGS 3.1.3(RC)( glusterfs-3.7.9-10).

The gluster replica 3 sharded volume is used for hosted engine. Another replica 3 sharded volume is used for data domain.

29 Application VMs running RHEL 7.2 were created. When all the VMs are shutdown ( by selecting them together and 'poweroff' ), I see that 2 of the VMs stuck in 'powering-down' state

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
RHEV 3.6.7.4
RHGS 3.1.3 (RC) glusterfs-3.7.9-10

How reproducible:
-----------------
Hit once and haven't tried to reproduce

Steps to Reproduce:
-------------------
1. In HCI setup create 29 Application VMs running RHEL 7.2
2. Select all the VMs and powered-down

Actual results:
---------------
2 of the VMs were stuck in 'powering down' state

Expected results:
-----------------
All the VMs should be powered down

Comment 1 SATHEESARAN 2016-06-20 08:56:28 UTC

Not sure this is the related information.
But found the exceptions in the engine.log

<snip>
2016-06-20 04:38:56,610 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-60) [2ab9a462] Failed during vms monitoring on host host1.lab.eng.blr.redhat
.com error is: java.lang.ClassCastException: java.lang.String cannot be cast to java.util.Map
2016-06-20 04:38:56,610 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-60) [2ab9a462] Exception:: java.lang.ClassCastException: java.lang.String ca
nnot be cast to java.util.Map
        at org.ovirt.engine.core.vdsbroker.VmsMonitoring.processVmDevices(VmsMonitoring.java:612) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VmsMonitoring.updateVmDevices(VmsMonitoring.java:562) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VmsMonitoring.processVmsWithDevicesChange(VmsMonitoring.java:368) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VmsMonitoring.refreshVmStats(VmsMonitoring.java:242) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VmsMonitoring.perform(VmsMonitoring.java:165) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.PollVmStatsRefresher.poll(PollVmStatsRefresher.java:36) [vdsbroker.jar:]
        at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source) [:1.8.0_91]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_91]
        at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_91]
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]

</snip>

Comment 2 SATHEESARAN 2016-06-20 09:06:53 UTC

Other error messages looks like "vms running in db but not in VDS"

<snip>
2016-06-20 04:59:38,195 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-1) [110b2584] VM '59d48f24-d6b8-4e9e-b020-b728ed239da3(appvm03) is running in db and not running in VDS 'host1.lab.eng.blr.redhat.com'
2016-06-20 04:59:38,195 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVDSCommand] (DefaultQuartzScheduler_Worker-1) [110b2584] START, FullListVDSCommand(HostName = , FullListVDSCommandParameters:{runAsync='true', hostId='f4a25596-079b-44ed-970b-2142c67b9873', vds='Host[,f4a25596-079b-44ed-970b-2142c67b9873]', vmIds='[10c5e550-4b07-4a25-9a96-ae016864f2a8]'}), log id: 50edd038
2016-06-20 04:59:38,279 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVDSCommand] (DefaultQuartzScheduler_Worker-1) [110b2584] FINISH, FullListVDSCommand, return: [{status=Up, nicModel=rtl8139,pv, emulatedMachine=rhel6.5.0, guestDiskMapping={e5051a6c-dc96-44eb-b={name=/dev/vda}, QEMU_DVD-ROM_QM00003={name=/dev/sr0}}, vmId=10c5e550-4b07-4a25-9a96-ae016864f2a8, pid=6368, devices=[Ljava.lang.Object;@5e702073, smp=2, vmType=kvm, displayIp=0, display=vnc, displaySecurePort=-1, memSize=4096, displayPort=5900, spiceSecureChannels=smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir, statusTime=4725612060, vmName=HostedEngine, clientIp=, pauseCode=NOERR}], log id: 50edd038
2016-06-20 04:59:38,280 INFO  [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-1) [110b2584] Received a memballoon Device without an address when processing VM 10c5e550-4b07-4a25-9a96-ae016864f2a8 devices, skipping device: {device=memballoon, specParams={model=none}, deviceType=balloon, type=balloon, alias=balloon0}
2016-06-20 04:59:38,280 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-1) [110b2584] Failed during vms monitoring on host host1.lab.eng.blr.redhat.com error is: java.lang.ClassCastException
2016-06-20 04:59:38,280 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-1) [110b2584] Exception:: java.lang.ClassCastException
</snip>

Comment 3 SATHEESARAN 2016-06-20 09:08:35 UTC

Created attachment 1169794 [details]
engine.log from hosted engine

Comment 4 SATHEESARAN 2016-06-20 09:12:10 UTC

Also seeing these error messages in engine.log

<snip>
2016-06-20 05:10:57,645 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-53) [7f967349] Failed during vms monitoring on host host1.lab.eng.blr.redhat
.com error is: java.lang.ClassCastException
2016-06-20 05:10:57,645 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-53) [7f967349] Exception:: java.lang.ClassCastException
</snip>

Comment 5 SATHEESARAN 2016-06-20 09:14:48 UTC

appvm14 and appvm03 are the VMs stuck in "powering down" state.
I am not sure the way to revive these VMs.

Comment 6 SATHEESARAN 2016-06-20 09:16:25 UTC

Created attachment 1169797 [details]
screenshot showing that the vms are stuck in powering down state

Comment 7 SATHEESARAN 2016-06-21 01:38:36 UTC

I could able to hit this issue in my testing again.

I observed that few of the VMs are stuck in powering-down state are on the hosts which runs the hosted engine.

Once I migrated hosted-engine from that hosts, those VMs got powered down.

Comment 8 Arik 2016-06-21 06:56:03 UTC

Raising the severity as the outcome is unmonitored VMs without any indication to the user.

Seems like the issue is unrelated to the powering-off VMs - that just exposed the problem in the monitoring after a change in the devices of one of the VMs.

As part of parsing the result of FullList we get ClassCast exception.
Until 2016-06-20 04:13:36,984 - there is not stacktrace for the ClassCast exception, so I don't know the root cause for this.

In 2016-06-20 04:13:36,984, we start getting ClassCast exception with the stacktrace that shows the problem is that VDSM reports device addresses as String while we expect it to be a Map. This won't happen in 4.0 and above since we don't count on the type anymore. I'll post a 3.6 only fix for this.

More investigation is needed in order to identify the cause for the exceptions we got before 2016-06-20 04:13:36,984.

Comment 9 Arik 2016-06-21 09:07:59 UTC

Does it still happen? if so, could you please add the output of "vdsClient 0 list" on host1.lab.eng.blr.redhat.com ?

Comment 10 SATHEESARAN 2016-06-23 06:02:07 UTC

(In reply to Arik from comment #9)
> Does it still happen? if so, could you please add the output of "vdsClient 0
> list" on host1.lab.eng.blr.redhat.com ?

Arik,

I couldn't hit it again.

Comment 11 Tomas Jelinek 2016-06-29 12:18:50 UTC

This is relevant only on 3.6 and it can not be reproduced anymore. Feel free to reopen if it happens again and provide the all the logs and the result of the "vdsClient 0 list" on the relevant host

Note You need to log in before you can comment on or make changes to this bug.