1792944 – VM fails to start after previous unsuccessful invalid attempt of CPU unplugging .

Bug 1792944 - VM fails to start after previous unsuccessful invalid attempt of CPU unplugging .

Summary: VM fails to start after previous unsuccessful invalid attempt of CPU unplugg...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Virt
Sub Component:
Version:	4.4.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.4.0
Target Release:	---
Assignee:	Andrej Krejcir
QA Contact:	meital avital
Docs Contact:
URL:
Whiteboard:
Depends On:	1437559
Blocks:
TreeView+	depends on / blocked

Reported:	2020-01-20 12:37 UTC by Polina
Modified:	2020-05-20 19:59 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2020-05-20 19:59:43 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	pm-rhel: ovirt-4.4+

Attachments	(Terms of Use)
engine.log (5.68 MB, text/plain) 2020-01-23 11:44 UTC, Polina	no flags	Details
View All

Description Polina 2020-01-20 12:37:08 UTC

Description of problem: after invalid hot-unplugging CPU, the restart of VM fails.

Version-Release number of selected component (if applicable):
http://bob-dr.lab.eng.brq.redhat.com/builds/4.4/rhv-4.4.0-14

How reproducible:100%


Steps to Reproduce:
1. Configure VM with 'Total Virtual CPUs' = 4 (Edit/System tab), pin to host and 2 NUMA nodes. start.
2. Update the CPUs to 8. Shutdown the VM and start again. 
3. Try to hot-unplug CPUs from 8 to 2. 

    2020-01-20 11:28:40,188+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1) [55d41ee9] EVENT_ID: FAILED_HOT_SET_NUMBER_OF_CPUS(2,034), Failed to hot set number of CPUS to VM numa_test. Underlying error message: Hot un-plugging a CPU is not supported for the guest OS Other OS and architecture x86_64.
 this 

4. Shutdown VM, start again.


2020-01-20 11:53:46,521+02 WARN  [org.ovirt.engine.core.bll.RunVmCommand] (EE-ManagedThreadFactory-engine-Thread-1174) [] Validation of action 'RunVm' failed for user admin@internal-authz. Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,VAR__ACTION__RUN,VAR__TYPE__VM,VAR__ACTION__RUN,VAR__TYPE__VM,VAR__ACTION__RUN,VAR__TYPE__VM,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName host_mixed_3,$filterName CPU,VAR__DETAIL__NOT_ENOUGH_CORES,SCHEDULING_HOST_FILTERED_REASON_WITH_DETAIL
2020-01-20 11:53:46,522+02 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (EE-ManagedThreadFactory-engine-Thread-1174) [] Lock freed to object 'EngineLock:{exclusiveLocks='[0676b116-8588-4b6b-b9d0-f97816bc54a5=VM]', sharedLocks=''}'
2020-01-20 11:53:46,538+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-1174) [] EVENT_ID: USER_FAILED_RUN_VM(54), Failed to run VM nu

Actual results:
VM fails to start 

Expected results: 
if hot unplugging didn't work, nothing should affect the restart in this scenario, because nothing actually changed


Additional info:

Comment 1 Ryan Barry 2020-01-21 01:06:39 UTC

This is NOTABUG. We could provide better logging about the operation failing, but the inability to hot unplug without an agent is well-documented. The VM shows a pending configuration change in the UI, which is applied on shutdown.

After this, the scheduler (correctly) will not start it.

Where is the bug here except better audit logs in RHVM

Comment 2 Polina 2020-01-21 14:14:15 UTC

if this is an expected behavior, how the user is supposed to start such VM?

Comment 3 Ryan Barry 2020-01-21 15:08:38 UTC

We would expect them to read the logs for the scheduler error, and to not set the number of CPUs less than what their NUMA config specified. Does this work without a manual NUMA config?

Passing scheduler messages back up to RHVM is something we can definitely improve on in 4.5, though

Comment 4 Polina 2020-01-21 15:10:17 UTC

also, it is not the case when the VM shows the pending configuration in UI (Desktop with newer configuration for the next run . Pending virtual machine changes). There is no this yellow UI icon for the case I describe. There is also no snapshot for the next configuration in the snapshots table. The CPU number remains unchanged when shut down. It looks for the user that nothing was changed in VM and yet the VM could not be restarted.

Comment 5 Polina 2020-01-21 15:12:59 UTC

(In reply to Ryan Barry from comment #3)
> We would expect them to read the logs for the scheduler error, and to not
> set the number of CPUs less than what their NUMA config specified. Does this
> work without a manual NUMA config?
> 
> Passing scheduler messages back up to RHVM is something we can definitely
> improve on in 4.5, though

happens with numa configuration only. please see the comment 4

Comment 6 Polina 2020-01-21 15:14:53 UTC

also , we don't set here numa config less than the cpu number. numa is 2

Comment 7 Ryan Barry 2020-01-21 15:27:52 UTC

Ok, so there are 2 potential bugs (no pending configuration change and maybe scheduler). Andrej, want to take a look?

Comment 8 Polina 2020-01-22 09:37:19 UTC

I think that pending configuration must be involved when the option 'Aply later' is chosen. but this is not the case.

Comment 9 Andrej Krejcir 2020-01-22 11:10:24 UTC

The issue is probably the CPU hotplug, not unplug. The validation of CPU hotplug does not consider the host, where the VM is running. It is possible to increase the number of CPU cores to more than the current host has. When the VM is shut down and started again, the scheduler checks which host has enough cores to run the VM.

In this case it appears that the host has less CPU cores than the VM requires. Can you check?

Comment 10 Polina 2020-01-23 09:14:28 UTC

yes, it could be the case.

[root@puma43 qemu]# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              24
On-line CPU(s) list: 0-23
Thread(s) per core:  2
Core(s) per socket:  6
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               45
Model name:          Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
Stepping:            7
CPU MHz:             2317.255
CPU max MHz:         2800.0000
CPU min MHz:         1200.0000
BogoMIPS:            4600.03
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            15360K
NUMA node0 CPU(s):   0-5,12-17
NUMA node1 CPU(s):   6-11,18-23
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts flush_l1d

Comment 11 Polina 2020-01-23 11:44:33 UTC

Created attachment 1654826 [details]
engine.log

reproduce in engine.log attached
2020-01-23 13:37:13,577+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-67116) [6779f16c] EVENT_ID: USER_FAILED_RUN_VM(54), Failed to run VM reproduce_start_problem  (User: admin@internal-authz)

Comment 12 Andrej Krejcir 2020-01-28 12:13:21 UTC

The cause is an incorrect domain XML sent to libvirt. In the <numa> tag there are more CPUs specified than the VM has.


<domain>
  <vcpu current="8">16</vcpu>
  ...
  <cpu match="exact">
    <model>Westmere</model>
    <topology cores="1" threads="1" sockets="16"/>
    <numa>
      <cell id="0" cpus="0,8-14" memory="524288"/>
      <cell id="1" cpus="1,15-21" memory="524288"/>
    </numa>
  </cpu>
...
</domain>


It is probably caused by a patch from Bug 1437559.

Comment 13 Andrej Krejcir 2020-01-30 08:34:36 UTC

The patches on Bug 1437559 have been merged and they should fix this bug too.

Comment 14 Sandro Bonazzola 2020-03-13 10:22:07 UTC

this bug is targeting 4.4.2 and is in modified state. Can we retarget to 4.4.0 and move to QE?

Comment 15 Polina 2020-03-29 16:54:14 UTC

verified on http://bob-dr.lab.eng.brq.redhat.com/builds/4.4/rhv-4.4.0-27

Comment 16 Sandro Bonazzola 2020-05-20 19:59:43 UTC

This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020.

Since the problem described in this bug report should be
resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.