Bug 1514853

Summary: Vm run fails due to NPE when host statistics are missing.
Product: [oVirt] ovirt-engine Reporter: Leon Goldberg <lgoldber>
Component: Backend.CoreAssignee: Ravi Nori <rnori>
Status: CLOSED CURRENTRELEASE QA Contact: Israel Pinto <ipinto>
Severity: high Docs Contact:
Priority: high    
Version: 4.1.0.2CC: bugs, lveyde, mgoldboi, michal.skrivanek, mperina, msivak
Target Milestone: ovirt-4.2.2Flags: rule-engine: ovirt-4.2+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-engine-4.2.2.4 Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-03-29 11:03:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
NPE traceback none

Description Leon Goldberg 2017-11-18 23:00:29 UTC
Created attachment 1354986 [details]
NPE traceback

Description of problem:

Start a VM quick enough post host addition fails due to missing statistics (more specifically, CPU usage) with a NPE.

How reproducible:

I'm not sure if it's reproducible via the UI. A test introduced in https://gerrit.ovirt.org/#/c/83881/ fails at all times (test_live_vm_migration) without sleeping before running the VM.


Actual results:

Running the VM fails with a NPE.


Expected results:

Running the VM is blocked until possible (host status shouldn't be "UP"?)

Comment 1 Martin Sivák 2017-11-22 11:30:39 UTC
The scheduler uses getVdsDao()        .getAllForClusterWithStatus(cluster.getId(), VDSStatus.Up);

I think the host should not be UP until its db records are fully populated. Can someone from the Infra team please take a look?

Comment 2 Oved Ourfali 2017-11-27 13:06:03 UTC
(In reply to Martin Sivák from comment #1)
> The scheduler uses getVdsDao()       
> .getAllForClusterWithStatus(cluster.getId(), VDSStatus.Up);
> 
> I think the host should not be UP until its db records are fully populated.
> Can someone from the Infra team please take a look?

Why host shouldn't be up if it doesn't have statistics?
It can move to up and gather statistics as we go.

I agree it is a race, but I wouldn't tie the statistics with the host status. Imagine that in the future the statistics might come from a different external service. Would we leave the host down until statistics are gathered?

Comment 3 Martin Sivák 2017-11-27 13:11:33 UTC
The scheduler needs to be able to get a list of hosts with all the crucial info present (like free memory). It just can't work without the data.

So we either not put the hosts to UP before that is available (because you can't start any VMs there anyway) or we need API that checks whether stats were collected already. The scheduler is not the right place to put the logic to, it needs (best case) a bool returning filter function.

Comment 4 Oved Ourfali 2017-11-27 13:43:41 UTC
Martin S - Anyhow you should fail with an error rather than get an NPE.
Martin P - your thoughts?

Comment 5 Yaniv Kaul 2017-12-18 07:16:46 UTC
(In reply to Oved Ourfali from comment #2)
> (In reply to Martin Sivák from comment #1)
> > The scheduler uses getVdsDao()       
> > .getAllForClusterWithStatus(cluster.getId(), VDSStatus.Up);
> > 
> > I think the host should not be UP until its db records are fully populated.
> > Can someone from the Infra team please take a look?
> 
> Why host shouldn't be up if it doesn't have statistics?
> It can move to up and gather statistics as we go.
> 
> I agree it is a race, but I wouldn't tie the statistics with the host
> status. Imagine that in the future the statistics might come from a
> different external service. Would we leave the host down until statistics
> are gathered?

We will never require critical statistics from an external service (i.e., not VDSM). We may ask for additional, non-scheduling related stats from an external service, so I agree with Martin Sivak here that we should require the relevant stats prior to setting the host as UP.

Comment 6 Martin Perina 2018-01-17 14:49:04 UTC
Ravi, could you please take a look where exactly we have this issue?

Comment 7 Ravi Nori 2018-02-20 18:32:41 UTC
The status of the host is set to UP before the VdsStatistics for the host is updated. This can cause a NPE if the scheduler tries to run a VM on the host before the VdsStatistics for the host have been updated. 

Setting the status to UP only after the statistics have been updated, fixes the issue.

Comment 8 Israel Pinto 2018-03-20 15:22:58 UTC
Verify with:

Steps:
Case 1:
Set host to maintenance and activate host.
During the activation try to start VM via REST.
Case 2:
Add Host, during the installation try to start VM.


Results:
On both cases, VM failed to start. No NullPointerException found.

Comment 9 Sandro Bonazzola 2018-03-29 11:03:28 UTC
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.