Description of problem: after invalid hot-unplugging CPU, the restart of VM fails. Version-Release number of selected component (if applicable): http://bob-dr.lab.eng.brq.redhat.com/builds/4.4/rhv-4.4.0-14 How reproducible:100% Steps to Reproduce: 1. Configure VM with 'Total Virtual CPUs' = 4 (Edit/System tab), pin to host and 2 NUMA nodes. start. 2. Update the CPUs to 8. Shutdown the VM and start again. 3. Try to hot-unplug CPUs from 8 to 2. 2020-01-20 11:28:40,188+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1) [55d41ee9] EVENT_ID: FAILED_HOT_SET_NUMBER_OF_CPUS(2,034), Failed to hot set number of CPUS to VM numa_test. Underlying error message: Hot un-plugging a CPU is not supported for the guest OS Other OS and architecture x86_64. this 4. Shutdown VM, start again. 2020-01-20 11:53:46,521+02 WARN [org.ovirt.engine.core.bll.RunVmCommand] (EE-ManagedThreadFactory-engine-Thread-1174) [] Validation of action 'RunVm' failed for user admin@internal-authz. Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,VAR__ACTION__RUN,VAR__TYPE__VM,VAR__ACTION__RUN,VAR__TYPE__VM,VAR__ACTION__RUN,VAR__TYPE__VM,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName host_mixed_3,$filterName CPU,VAR__DETAIL__NOT_ENOUGH_CORES,SCHEDULING_HOST_FILTERED_REASON_WITH_DETAIL 2020-01-20 11:53:46,522+02 INFO [org.ovirt.engine.core.bll.RunVmCommand] (EE-ManagedThreadFactory-engine-Thread-1174) [] Lock freed to object 'EngineLock:{exclusiveLocks='[0676b116-8588-4b6b-b9d0-f97816bc54a5=VM]', sharedLocks=''}' 2020-01-20 11:53:46,538+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-1174) [] EVENT_ID: USER_FAILED_RUN_VM(54), Failed to run VM nu Actual results: VM fails to start Expected results: if hot unplugging didn't work, nothing should affect the restart in this scenario, because nothing actually changed Additional info:
This is NOTABUG. We could provide better logging about the operation failing, but the inability to hot unplug without an agent is well-documented. The VM shows a pending configuration change in the UI, which is applied on shutdown. After this, the scheduler (correctly) will not start it. Where is the bug here except better audit logs in RHVM
if this is an expected behavior, how the user is supposed to start such VM?
We would expect them to read the logs for the scheduler error, and to not set the number of CPUs less than what their NUMA config specified. Does this work without a manual NUMA config? Passing scheduler messages back up to RHVM is something we can definitely improve on in 4.5, though
also, it is not the case when the VM shows the pending configuration in UI (Desktop with newer configuration for the next run . Pending virtual machine changes). There is no this yellow UI icon for the case I describe. There is also no snapshot for the next configuration in the snapshots table. The CPU number remains unchanged when shut down. It looks for the user that nothing was changed in VM and yet the VM could not be restarted.
(In reply to Ryan Barry from comment #3) > We would expect them to read the logs for the scheduler error, and to not > set the number of CPUs less than what their NUMA config specified. Does this > work without a manual NUMA config? > > Passing scheduler messages back up to RHVM is something we can definitely > improve on in 4.5, though happens with numa configuration only. please see the comment 4
also , we don't set here numa config less than the cpu number. numa is 2
Ok, so there are 2 potential bugs (no pending configuration change and maybe scheduler). Andrej, want to take a look?
I think that pending configuration must be involved when the option 'Aply later' is chosen. but this is not the case.
The issue is probably the CPU hotplug, not unplug. The validation of CPU hotplug does not consider the host, where the VM is running. It is possible to increase the number of CPU cores to more than the current host has. When the VM is shut down and started again, the scheduler checks which host has enough cores to run the VM. In this case it appears that the host has less CPU cores than the VM requires. Can you check?
yes, it could be the case. [root@puma43 qemu]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 45 Model name: Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz Stepping: 7 CPU MHz: 2317.255 CPU max MHz: 2800.0000 CPU min MHz: 1200.0000 BogoMIPS: 4600.03 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 CPU(s): 0-5,12-17 NUMA node1 CPU(s): 6-11,18-23 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts flush_l1d
Created attachment 1654826 [details] engine.log reproduce in engine.log attached 2020-01-23 13:37:13,577+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-67116) [6779f16c] EVENT_ID: USER_FAILED_RUN_VM(54), Failed to run VM reproduce_start_problem (User: admin@internal-authz)
The cause is an incorrect domain XML sent to libvirt. In the <numa> tag there are more CPUs specified than the VM has. <domain> <vcpu current="8">16</vcpu> ... <cpu match="exact"> <model>Westmere</model> <topology cores="1" threads="1" sockets="16"/> <numa> <cell id="0" cpus="0,8-14" memory="524288"/> <cell id="1" cpus="1,15-21" memory="524288"/> </numa> </cpu> ... </domain> It is probably caused by a patch from Bug 1437559.
The patches on Bug 1437559 have been merged and they should fix this bug too.
this bug is targeting 4.4.2 and is in modified state. Can we retarget to 4.4.0 and move to QE?
verified on http://bob-dr.lab.eng.brq.redhat.com/builds/4.4/rhv-4.4.0-27
This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.