Created attachment 1354986 [details] NPE traceback Description of problem: Start a VM quick enough post host addition fails due to missing statistics (more specifically, CPU usage) with a NPE. How reproducible: I'm not sure if it's reproducible via the UI. A test introduced in https://gerrit.ovirt.org/#/c/83881/ fails at all times (test_live_vm_migration) without sleeping before running the VM. Actual results: Running the VM fails with a NPE. Expected results: Running the VM is blocked until possible (host status shouldn't be "UP"?)
The scheduler uses getVdsDao() .getAllForClusterWithStatus(cluster.getId(), VDSStatus.Up); I think the host should not be UP until its db records are fully populated. Can someone from the Infra team please take a look?
(In reply to Martin Sivák from comment #1) > The scheduler uses getVdsDao() > .getAllForClusterWithStatus(cluster.getId(), VDSStatus.Up); > > I think the host should not be UP until its db records are fully populated. > Can someone from the Infra team please take a look? Why host shouldn't be up if it doesn't have statistics? It can move to up and gather statistics as we go. I agree it is a race, but I wouldn't tie the statistics with the host status. Imagine that in the future the statistics might come from a different external service. Would we leave the host down until statistics are gathered?
The scheduler needs to be able to get a list of hosts with all the crucial info present (like free memory). It just can't work without the data. So we either not put the hosts to UP before that is available (because you can't start any VMs there anyway) or we need API that checks whether stats were collected already. The scheduler is not the right place to put the logic to, it needs (best case) a bool returning filter function.
Martin S - Anyhow you should fail with an error rather than get an NPE. Martin P - your thoughts?
(In reply to Oved Ourfali from comment #2) > (In reply to Martin Sivák from comment #1) > > The scheduler uses getVdsDao() > > .getAllForClusterWithStatus(cluster.getId(), VDSStatus.Up); > > > > I think the host should not be UP until its db records are fully populated. > > Can someone from the Infra team please take a look? > > Why host shouldn't be up if it doesn't have statistics? > It can move to up and gather statistics as we go. > > I agree it is a race, but I wouldn't tie the statistics with the host > status. Imagine that in the future the statistics might come from a > different external service. Would we leave the host down until statistics > are gathered? We will never require critical statistics from an external service (i.e., not VDSM). We may ask for additional, non-scheduling related stats from an external service, so I agree with Martin Sivak here that we should require the relevant stats prior to setting the host as UP.
Ravi, could you please take a look where exactly we have this issue?
The status of the host is set to UP before the VdsStatistics for the host is updated. This can cause a NPE if the scheduler tries to run a VM on the host before the VdsStatistics for the host have been updated. Setting the status to UP only after the statistics have been updated, fixes the issue.
Verify with: Steps: Case 1: Set host to maintenance and activate host. During the activation try to start VM via REST. Case 2: Add Host, during the installation try to start VM. Results: On both cases, VM failed to start. No NullPointerException found.
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.