Bug 1348129 - "VMs running in DB, but not in VDS" error while powering off all 29 VMs together
Summary: "VMs running in DB, but not in VDS" error while powering off all 29 VMs together
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Frontend.WebAdmin
Version: 3.6.7
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Arik
QA Contact: SATHEESARAN
URL:
Whiteboard:
Depends On:
Blocks: Gluster-HC-1
TreeView+ depends on / blocked
 
Reported: 2016-06-20 08:54 UTC by SATHEESARAN
Modified: 2016-06-29 12:18 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
HCI RHEV 3.6.7.4 and RHGS 3.1.3 ( RC )
Last Closed: 2016-06-29 12:18:50 UTC
oVirt Team: Virt
Embargoed:
sasundar: planning_ack?
sasundar: devel_ack?
sasundar: testing_ack?


Attachments (Terms of Use)
engine.log from hosted engine (5.60 MB, text/plain)
2016-06-20 09:08 UTC, SATHEESARAN
no flags Details
screenshot showing that the vms are stuck in powering down state (15.31 KB, image/png)
2016-06-20 09:16 UTC, SATHEESARAN
no flags Details

Description SATHEESARAN 2016-06-20 08:54:09 UTC
Description of problem:
-----------------------
The setup is a HCI solution which has 3 physical hosts installed with RHEL 7.2 and RHGS 3.1.3(RC)( glusterfs-3.7.9-10).

The gluster replica 3 sharded volume is used for hosted engine. Another replica 3 sharded volume is used for data domain.

29 Application VMs running RHEL 7.2 were created. When all the VMs are shutdown ( by selecting them together and 'poweroff' ), I see that 2 of the VMs stuck in 'powering-down' state

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
RHEV 3.6.7.4
RHGS 3.1.3 (RC) glusterfs-3.7.9-10

How reproducible:
-----------------
Hit once and haven't tried to reproduce

Steps to Reproduce:
-------------------
1. In HCI setup create 29 Application VMs running RHEL 7.2
2. Select all the VMs and powered-down

Actual results:
---------------
2 of the VMs were stuck in 'powering down' state

Expected results:
-----------------
All the VMs should be powered down

Comment 1 SATHEESARAN 2016-06-20 08:56:28 UTC
Not sure this is the related information.
But found the exceptions in the engine.log

<snip>
2016-06-20 04:38:56,610 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-60) [2ab9a462] Failed during vms monitoring on host host1.lab.eng.blr.redhat
.com error is: java.lang.ClassCastException: java.lang.String cannot be cast to java.util.Map
2016-06-20 04:38:56,610 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-60) [2ab9a462] Exception:: java.lang.ClassCastException: java.lang.String ca
nnot be cast to java.util.Map
        at org.ovirt.engine.core.vdsbroker.VmsMonitoring.processVmDevices(VmsMonitoring.java:612) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VmsMonitoring.updateVmDevices(VmsMonitoring.java:562) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VmsMonitoring.processVmsWithDevicesChange(VmsMonitoring.java:368) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VmsMonitoring.refreshVmStats(VmsMonitoring.java:242) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VmsMonitoring.perform(VmsMonitoring.java:165) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.PollVmStatsRefresher.poll(PollVmStatsRefresher.java:36) [vdsbroker.jar:]
        at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source) [:1.8.0_91]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_91]
        at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_91]
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]

</snip>

Comment 2 SATHEESARAN 2016-06-20 09:06:53 UTC
Other error messages looks like "vms running in db but not in VDS"

<snip>
2016-06-20 04:59:38,195 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-1) [110b2584] VM '59d48f24-d6b8-4e9e-b020-b728ed239da3(appvm03) is running in db and not running in VDS 'host1.lab.eng.blr.redhat.com'
2016-06-20 04:59:38,195 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVDSCommand] (DefaultQuartzScheduler_Worker-1) [110b2584] START, FullListVDSCommand(HostName = , FullListVDSCommandParameters:{runAsync='true', hostId='f4a25596-079b-44ed-970b-2142c67b9873', vds='Host[,f4a25596-079b-44ed-970b-2142c67b9873]', vmIds='[10c5e550-4b07-4a25-9a96-ae016864f2a8]'}), log id: 50edd038
2016-06-20 04:59:38,279 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVDSCommand] (DefaultQuartzScheduler_Worker-1) [110b2584] FINISH, FullListVDSCommand, return: [{status=Up, nicModel=rtl8139,pv, emulatedMachine=rhel6.5.0, guestDiskMapping={e5051a6c-dc96-44eb-b={name=/dev/vda}, QEMU_DVD-ROM_QM00003={name=/dev/sr0}}, vmId=10c5e550-4b07-4a25-9a96-ae016864f2a8, pid=6368, devices=[Ljava.lang.Object;@5e702073, smp=2, vmType=kvm, displayIp=0, display=vnc, displaySecurePort=-1, memSize=4096, displayPort=5900, spiceSecureChannels=smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir, statusTime=4725612060, vmName=HostedEngine, clientIp=, pauseCode=NOERR}], log id: 50edd038
2016-06-20 04:59:38,280 INFO  [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-1) [110b2584] Received a memballoon Device without an address when processing VM 10c5e550-4b07-4a25-9a96-ae016864f2a8 devices, skipping device: {device=memballoon, specParams={model=none}, deviceType=balloon, type=balloon, alias=balloon0}
2016-06-20 04:59:38,280 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-1) [110b2584] Failed during vms monitoring on host host1.lab.eng.blr.redhat.com error is: java.lang.ClassCastException
2016-06-20 04:59:38,280 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-1) [110b2584] Exception:: java.lang.ClassCastException
</snip>

Comment 3 SATHEESARAN 2016-06-20 09:08:35 UTC
Created attachment 1169794 [details]
engine.log from hosted engine

Comment 4 SATHEESARAN 2016-06-20 09:12:10 UTC
Also seeing these error messages in engine.log

<snip>
2016-06-20 05:10:57,645 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-53) [7f967349] Failed during vms monitoring on host host1.lab.eng.blr.redhat
.com error is: java.lang.ClassCastException
2016-06-20 05:10:57,645 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-53) [7f967349] Exception:: java.lang.ClassCastException
</snip>

Comment 5 SATHEESARAN 2016-06-20 09:14:48 UTC
appvm14 and appvm03 are the VMs stuck in "powering down" state.
I am not sure the way to revive these VMs.

Comment 6 SATHEESARAN 2016-06-20 09:16:25 UTC
Created attachment 1169797 [details]
screenshot showing that the vms are stuck in powering down state

Comment 7 SATHEESARAN 2016-06-21 01:38:36 UTC
I could able to hit this issue in my testing again.

I observed that few of the VMs are stuck in powering-down state are on the hosts which runs the hosted engine.

Once I migrated hosted-engine from that hosts, those VMs got powered down.

Comment 8 Arik 2016-06-21 06:56:03 UTC
Raising the severity as the outcome is unmonitored VMs without any indication to the user.

Seems like the issue is unrelated to the powering-off VMs - that just exposed the problem in the monitoring after a change in the devices of one of the VMs.

As part of parsing the result of FullList we get ClassCast exception.
Until 2016-06-20 04:13:36,984 - there is not stacktrace for the ClassCast exception, so I don't know the root cause for this.

In 2016-06-20 04:13:36,984, we start getting ClassCast exception with the stacktrace that shows the problem is that VDSM reports device addresses as String while we expect it to be a Map. This won't happen in 4.0 and above since we don't count on the type anymore. I'll post a 3.6 only fix for this.

More investigation is needed in order to identify the cause for the exceptions we got before 2016-06-20 04:13:36,984.

Comment 9 Arik 2016-06-21 09:07:59 UTC
Does it still happen? if so, could you please add the output of "vdsClient 0 list" on host1.lab.eng.blr.redhat.com ?

Comment 10 SATHEESARAN 2016-06-23 06:02:07 UTC
(In reply to Arik from comment #9)
> Does it still happen? if so, could you please add the output of "vdsClient 0
> list" on host1.lab.eng.blr.redhat.com ?

Arik,

I couldn't hit it again.

Comment 11 Tomas Jelinek 2016-06-29 12:18:50 UTC
This is relevant only on 3.6 and it can not be reproduced anymore. Feel free to reopen if it happens again and provide the all the logs and the result of the "vdsClient 0 list" on the relevant host


Note You need to log in before you can comment on or make changes to this bug.