Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1390296

Summary:	[scale] - NPE for vm monitoring - vm monitor failed
Product:	[oVirt] ovirt-engine	Reporter:	Eldad Marciano <emarcian>
Component:	Backend.Core	Assignee:	Arik <ahadas>
Status:	CLOSED WONTFIX	QA Contact:	eberman
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.0.5	CC:	ahadas, bugs, michal.skrivanek, rgolan, tjelinek
Target Milestone:	---	Flags:	sbonazzo: ovirt-4.2-
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-08-14 11:29:07 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Virt	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Eldad Marciano 2016-10-31 15:58:39 UTC

Description of problem:

a NPE were throw for vm monitoring, it happens for some of the hosts, 5 out of 35.

lab conditions:
1 cluster, 12 NFS SDs, 35 bare metal hosts, 3K vms

also by looking at the code with rgolan it looks like VmAnalyzer.java:760 failed to fetch the host from the DB.

which is currently exists on the hosts tab.
moreover manually I verified it manually and it exist on the vds_static (DB).

2016-10-31 15:27:28,881 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (DefaultQuartzScheduler7) [2ddc507] Failed during vms monitoring on host xyz error is: java.lang.NullP
ointerException
2016-10-31 15:27:28,881 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (DefaultQuartzScheduler7) [2ddc507] Exception:: java.lang.NullPointerException
        at org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer.isVdsNonResponsive(VmAnalyzer.java:760) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer.handOverVm(VmAnalyzer.java:739) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer.proceedDisappearedVm(VmAnalyzer.java:698) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer.analyze(VmAnalyzer.java:144) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring.lambda$analyzeVms$1(VmsMonitoring.java:161) [vdsbroker.jar:]
        at java.util.ArrayList.forEach(ArrayList.java:1249) [rt.jar:1.8.0_102]
        at org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring.analyzeVms(VmsMonitoring.java:156) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring.perform(VmsMonitoring.java:119) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher.poll(PollVmStatsRefresher.java:39) [vdsbroker.jar:]
        at sun.reflect.GeneratedMethodAccessor67.invoke(Unknown Source) [:1.8.0_102]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_102]
        at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_102]
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:77) [scheduler.jar:]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:51) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_102]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_102]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_102]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_102]
        at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_102]

Version-Release number of selected component (if applicable):
4.0.5.0-1

How reproducible:
not clear

Steps to Reproduce:
1. not clear (some of the vms tried to be migrated).


Actual results:
vm monitor failed.

Expected results:
vm monitor pass

Additional info:

Comment 2 Michal Skrivanek 2016-11-01 08:33:34 UTC

please add host side logs from that occurrence. Should be b01-h11-r620

Comment 3 Eldad Marciano 2016-11-01 09:50:37 UTC

(In reply to Michal Skrivanek from comment #2)
> please add host side logs from that occurrence. Should be b01-h11-r620

me and rgolan debug it shortyly together,
in the engine we saw running vms on a host, when the host have no running vms.
we checked it via virsh and and also by vdsClient

adding logs and outputs.

Comment 4 Michal Skrivanek 2016-11-01 10:25:31 UTC

(In reply to Eldad Marciano from comment #3)
> (In reply to Michal Skrivanek from comment #2)
> > please add host side logs from that occurrence. Should be b01-h11-r620
> 
> me and rgolan debug it shortyly together,
> in the engine we saw running vms on a host, when the host have no running
> vms.
> we checked it via virsh and and also by vdsClient
> 
> adding logs and outputs.

yes, and that should not ever happen without vdsm reporting it as such, hence the need for vdsm.log:)

Comment 5 Eldad Marciano 2016-11-01 10:42:35 UTC

-=>>vdsClient -s 0 getVdsStats |grep vmCount
	vmCount = 0
(root@b01-h30-r620) - (09:52) - (~)
-=>>virsh -r list
setlocale: No such file or directory
 Id    Name                           State
----------------------------------------------------

(root@b01-h30-r620) - (09:52) - (~)

Comment 7 Arik 2016-11-01 21:09:58 UTC

Eldad,
Looking at VM 6a7d47c2-c5be-437e-b608-982a2947ba45 (vm_scale_real_121) - I see that it started to migrate at 09:18:07, failed to migrate to the first host, tried to migrate to another host but this time the engine failed to communicate with the second host and thus the command failed at 9:19:13.
However, the engine detects a transition of this VM from Up to MigrationFrom at 9:19:18 with no prior migrate command.
So:
1. is there a chance that the VM has been migrated using virsh or vdsClient? that would explain a lot.
2. the attached VDSM log doesn't contain this VM at all, from which host did you take it? I need the log from b01-h11-r620.rhev.openstack.engineering.redhat.com that covers the period of 2016-10-31 9:18:00 - 9:21:00 (engine time)

Comment 8 Arik 2016-11-01 21:36:11 UTC

Please ignore everything but the request for the proper VDSM log in comment 7.
Now I understand what was going on:

So the second migration attempt was to b02-h13-r620.rhev.openstack.engineering.redhat.com at 9:19:13. The failed communication with VDSM resulted in:
VDSGenericException: VDSNetworkException: Heartbeat exeeded
In that case the engine doesn't change the VM (so it remained as Up and migratingToVds==null), but apparently the migration was actually triggered by VDSM (I know that because the VM eventually appeared on that host at 09:20:46,741).

At 9:20:18 the VM switched to Down on the source as part of the migration process - but migratingToVds was null so we got the NPE later on. The rest of the NPEs seem like a consequence of this issue.

Comment 9 Eldad Marciano 2016-11-07 16:36:54 UTC

(In reply to Arik from comment #8)
> Please ignore everything but the request for the proper VDSM log in comment
> 7.
> Now I understand what was going on:
> 
> So the second migration attempt was to
> b02-h13-r620.rhev.openstack.engineering.redhat.com at 9:19:13. The failed
> communication with VDSM resulted in:
> VDSGenericException: VDSNetworkException: Heartbeat exeeded
> In that case the engine doesn't change the VM (so it remained as Up and
> migratingToVds==null), but apparently the migration was actually triggered
> by VDSM (I know that because the VM eventually appeared on that host at
> 09:20:46,741).
> 
> At 9:20:18 the VM switched to Down on the source as part of the migration
> process - but migratingToVds was null so we got the NPE later on. The rest
> of the NPEs seem like a consequence of this issue.

Currently hosts were re provisioned.
either way, we didn't migrating vm manually, seems like the engine migrate them over time.

Is it possible to avoid migratingToVds null?
 
you still need the logs?

Comment 10 Michal Skrivanek 2016-11-21 13:32:30 UTC

isn't it addressed by bug 1391933?

Comment 11 Arik 2016-11-22 07:35:59 UTC

(In reply to Michal Skrivanek from comment #10)
> isn't it addressed by bug 1391933?

the NPE - yes.
But the solution should be different in order to handle this case, of getting timeout exception while asking VDSM to migrate a VM, properly - we currently produce an audit message that says the migration failed (and lose the destination host hence the NPE) and then get the VM in migrating state and conclude that there is a migration in progress. So this is not the right fix - should we open another bug to improve the handling of this case and move this bug to on_qa?

Comment 12 Michal Skrivanek 2016-11-22 07:38:33 UTC

indeed i would use a different bug or change this one. It's far less priority though

Comment 13 Michal Skrivanek 2016-12-21 09:09:23 UTC

The bug was not addressed in time for 4.1. Postponing to 4.2

Comment 14 Tomas Jelinek 2017-08-14 11:29:07 UTC

The chance this happens is very low and the consequence is basically only an incorrect error message. The fix would be pretty hard to do. Consequently, closing due to capacity.

If this is something which is happening in your case and is important for you to fix, please feel free to reopen.