Bug 1889394 - VM hosted by non-operational host fails in migration with NullPointerException
Summary: VM hosted by non-operational host fails in migration with NullPointerException
Keywords:
Status: VERIFIED
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt
Version: 4.4.3.6
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ovirt-4.4.4
: 4.4.4
Assignee: Liran Rotenberg
QA Contact: Polina
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-19 14:56 UTC by msheena
Modified: 2020-11-18 20:46 UTC (History)
4 users (show)

Fixed In Version: ovirt-engine-4.4.4
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
oVirt Team: Virt
pm-rhel: ovirt-4.4+
aoconnor: blocker-


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
oVirt gerrit 111900 master MERGED scheduling: fix NullPointerException on CPU usage 2020-11-27 15:54:30 UTC

Description msheena 2020-10-19 14:56:41 UTC
Description of problem
======================
Given there is a VM running on a host in a cluster
  And the host is non-operational
When a migration of the VM is initiated to another host in the cluster over a dedicated migration network
Then the migration fails over a NullPointerException


Version-Release number of selected component (if applicable)
============================================================
ovirt-engine-4.4.3.6-0.13.el8ev.noarch
vdsm-4.40.33-1.el8ev.x86_64

How reproducible
================
Reproduces in automation tier2 executions

Steps to Reproduce (requires at least 2 physical hosts and one VM) 
==================================================================
1. Create 2 (required in cluster)VM networks: 'net_1', 'net_2'.
2. Start a VM on host_1.
3. Attach net_1 and net_2 to both hosts where each network is bridged to a single NIC.
4. Update the role of net_1 to be the cluster's migration network.
5. Perform an IFDOWN command on host_1 on the NIC bridged with net_2 (this causes host_1 to become non-operational since the NIC is DOWN and net_2 is a required network in the cluster.
6. Migrate the VM on host_1 to host_2.

Actual results
==============
Migration fails due to NullPointerException.

Expected results
================
Migration succeeds.

Additional info
===============
- The RHV instance is in stand-alone mode (not Hosted-Engine).
- Reproduction occured in an environment where the was also a third host in the cluster which was deliberatley put into maintenance to be excluded from migration logic and to force migration to the other remaining host in the cluster.

Comment 2 Arik 2020-10-21 07:14:55 UTC
The NPE is unrelated to vm-migration but to run once
I actually don't see any migration is the log

And what's the reason the host became non-operational? it's not a common case to have a non-operational host with running VMs on it

Comment 3 Michael Burman 2020-10-21 07:24:36 UTC
Hi Arik.

1. We have reproduced this null pointer exception twice already. It is reproduced during our migration tests and yes it is fail on the run once vm command. during our migration tests.
2. The reason for host non-opertional is to cover a migration scenario and that is OK. this test running for many years and testing a very specific flow.
3. I don't know what exactly trigger this exception, but we saw it twice. null pointer shouldn't happen. We saw it twice on the run once VM command in these specific tests.

Comment 4 Michael Burman 2020-10-21 07:25:35 UTC
looks like the host list contains a  null in CpuOverloadPolicyUnit.filter(CpuOverloadPolicyUnit.java:68)
seems like a bug.

Comment 5 Arik 2020-10-21 09:20:38 UTC
I don't think that 'vds' is null because then we would have failed in SlaValidator#getEffectiveCpuCores (that is called from CpuOverloadPolicyUnit.java:56)
I suspect that the cpu usage is null but it's not clear what can lead to that
Liran, can you please take a look?

Comment 6 RHEL Program Management 2020-10-26 12:31:13 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 7 Liran Rotenberg 2020-10-26 14:56:38 UTC
This doesn't seen like a regression because it doesn't look like we have code changes that cause it.
Is it really an automation blocker? or one case?

As for the investigation:
It is possible there is a race on the host status, when having it 'Up' already in the engine, it didn't finish the InitVdsOnUpCommand.
We can tell that, because of:
2020-10-12 03:24:12,748+03 INFO  [org.ovirt.engine.core.bll.RunVmOnceCommand] (default task-8) [vms_syncAction_b7a19089-0a15-425a] Lock freed to object 'EngineLock:{exclusiveLocks='[6c8a82ef
-3323-4c84-b770-6982d31be97b=VM]', sharedLocks=''}'
2020-10-12 03:24:12,753+03 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-8) [] Operation Failed: [General command validation failure.]
2020-10-12 03:24:13,040+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-3) [59fd53] EVENT
_ID: VDS_DETECTED(13), Status of host host_mixed_2 was set to Up.

Here the run once command started while InitVdsOnUpCommand didn't finish and host_mixed_2 probably was in the list of hosts to schedule on.
We get them from SchedulingManager::fetchHosts and each host should be in Up state.
When InitVdsOnUpCommand finishes it populates the VDS with the right values using getStats command to VDSM.

To verify it, can you check if it's reproducible manually? or, by waiting few seconds after the host became up in your automation case?
In any case, the easy fix will be checking for null value and dropping that host from the list.

Comment 8 msheena 2020-10-27 11:11:38 UTC
This bug blocks a test with a very specific scenario in which we want to see that a VM is migrated to another host in the cluster incase its host becomes non-operational.

I tried to reproduced it manually several times and had no success. also this issue seems to reproduce on tier2 and not running this test solely - which adds a point to the probability this is caused by a race.
I vote in favor of the easy fix.


Note You need to log in before you can comment on or make changes to this bug.