Description of problem ====================== Given there is a VM running on a host in a cluster And the host is non-operational When a migration of the VM is initiated to another host in the cluster over a dedicated migration network Then the migration fails over a NullPointerException Version-Release number of selected component (if applicable) ============================================================ ovirt-engine-4.4.3.6-0.13.el8ev.noarch vdsm-4.40.33-1.el8ev.x86_64 How reproducible ================ Reproduces in automation tier2 executions Steps to Reproduce (requires at least 2 physical hosts and one VM) ================================================================== 1. Create 2 (required in cluster)VM networks: 'net_1', 'net_2'. 2. Start a VM on host_1. 3. Attach net_1 and net_2 to both hosts where each network is bridged to a single NIC. 4. Update the role of net_1 to be the cluster's migration network. 5. Perform an IFDOWN command on host_1 on the NIC bridged with net_2 (this causes host_1 to become non-operational since the NIC is DOWN and net_2 is a required network in the cluster. 6. Migrate the VM on host_1 to host_2. Actual results ============== Migration fails due to NullPointerException. Expected results ================ Migration succeeds. Additional info =============== - The RHV instance is in stand-alone mode (not Hosted-Engine). - Reproduction occured in an environment where the was also a third host in the cluster which was deliberatley put into maintenance to be excluded from migration logic and to force migration to the other remaining host in the cluster.
The NPE is unrelated to vm-migration but to run once I actually don't see any migration is the log And what's the reason the host became non-operational? it's not a common case to have a non-operational host with running VMs on it
Hi Arik. 1. We have reproduced this null pointer exception twice already. It is reproduced during our migration tests and yes it is fail on the run once vm command. during our migration tests. 2. The reason for host non-opertional is to cover a migration scenario and that is OK. this test running for many years and testing a very specific flow. 3. I don't know what exactly trigger this exception, but we saw it twice. null pointer shouldn't happen. We saw it twice on the run once VM command in these specific tests.
looks like the host list contains a null in CpuOverloadPolicyUnit.filter(CpuOverloadPolicyUnit.java:68) seems like a bug.
I don't think that 'vds' is null because then we would have failed in SlaValidator#getEffectiveCpuCores (that is called from CpuOverloadPolicyUnit.java:56) I suspect that the cpu usage is null but it's not clear what can lead to that Liran, can you please take a look?
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
This doesn't seen like a regression because it doesn't look like we have code changes that cause it. Is it really an automation blocker? or one case? As for the investigation: It is possible there is a race on the host status, when having it 'Up' already in the engine, it didn't finish the InitVdsOnUpCommand. We can tell that, because of: 2020-10-12 03:24:12,748+03 INFO [org.ovirt.engine.core.bll.RunVmOnceCommand] (default task-8) [vms_syncAction_b7a19089-0a15-425a] Lock freed to object 'EngineLock:{exclusiveLocks='[6c8a82ef -3323-4c84-b770-6982d31be97b=VM]', sharedLocks=''}' 2020-10-12 03:24:12,753+03 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-8) [] Operation Failed: [General command validation failure.] 2020-10-12 03:24:13,040+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-3) [59fd53] EVENT _ID: VDS_DETECTED(13), Status of host host_mixed_2 was set to Up. Here the run once command started while InitVdsOnUpCommand didn't finish and host_mixed_2 probably was in the list of hosts to schedule on. We get them from SchedulingManager::fetchHosts and each host should be in Up state. When InitVdsOnUpCommand finishes it populates the VDS with the right values using getStats command to VDSM. To verify it, can you check if it's reproducible manually? or, by waiting few seconds after the host became up in your automation case? In any case, the easy fix will be checking for null value and dropping that host from the list.
This bug blocks a test with a very specific scenario in which we want to see that a VM is migrated to another host in the cluster incase its host becomes non-operational. I tried to reproduced it manually several times and had no success. also this issue seems to reproduce on tier2 and not running this test solely - which adds a point to the probability this is caused by a race. I vote in favor of the easy fix.
This bugzilla is included in oVirt 4.4.4 release, published on December 21st 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.4 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.