2228103 – virt-launcher does not start with isolateEmulator thread and an even CPU count

This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .

Bug 2228103 - virt-launcher does not start with isolateEmulator thread and an even CPU count

Summary: virt-launcher does not start with isolateEmulator thread and an even CPU count

Keywords:
Status:	CLOSED MIGRATED
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	4.14.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.15.0
Assignee:	Ram Lavi
QA Contact:	Kedar Bidarkar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-08-01 12:12 UTC by Orel Misan
Modified:	2024-07-31 06:13 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-12-14 16:14:50 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubevirt kubevirt pull 10593	None	Merged	isolateEmulatorThread: Add full-pcpu-only support	2024-01-04 06:38:03 UTC
Github	kubevirt kubevirt pull 10757	None	Merged	[release 1.1] isolateEmulatorThread: Add full-pcpu-only support	2024-01-04 06:38:05 UTC
Github	kubevirt kubevirt pull 10783	None	Merged	Housekeeping cgroup: Add full-pcpu-only support	2023-12-03 08:06:28 UTC
Github	kubevirt kubevirt pull 10839	None	Merged	virt-launcher, vcpu: Fix EmulatorThreadPin assign strategy	2024-01-04 06:38:06 UTC
Github	kubevirt kubevirt pull 10872	None	Merged	IsolateEmulatorThread: Add cluster-wide parity completion setting	2024-01-04 06:38:07 UTC
Red Hat Issue Tracker	CNV-31584	None	None	None	2023-12-14 16:14:49 UTC

Description Orel Misan 2023-08-01 12:12:00 UTC

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
Create the following VMI:
```
apiVersion: kubevirt.io/v1
kind: VirtualMachineInstance
metadata:
  annotations:
    cpu-load-balancing.crio.io: disable
    cpu-quota.crio.io: disable
    irq-load-balancing.crio.io: disable
  name: dpdk-vmi
spec:
  domain:
    cpu:
      cores: 8
      dedicatedCpuPlacement: true
      isolateEmulatorThread: true
      model: host-model
      sockets: 1
      threads: 1
    devices:
      disks:
        - disk:
            bus: virtio
          name: rootdisk
        - disk:
            bus: virtio
          name: cloudinitdisk
      networkInterfaceMultiqueue: true
      rng: {}
    memory:
      guest: 4Gi
      hugepages:
        pageSize: 1Gi
  terminationGracePeriodSeconds: 0
  volumes:
    - containerDisk:
        image: 'quay.io/kiagnose/kubevirt-dpdk-checkup-vm:main'
        imagePullPolicy: Always
      name: rootdisk
    - cloudInitNoCloud:
        userData: |
          #cloud-config
          user: cloud-user
          password: 0tli-pxem-xknu
          chpasswd:
            expire: false
      name: cloudinitdisk
```

Actual results:
virt-launcher pod does not start.
kubelet emits the following event:
```
SMT Alignment Error: requested 9 cpus not multiple cpus per core = 2
```


Expected results:
The VMI should start.

Additional info:

Comment 1 Ram Lavi 2023-09-06 11:34:11 UTC

note that this problem occurs only on nodes where SMT is enabled (enabled by default on most Intel processors).

Comment 2 Itamar Holder 2023-09-07 10:45:51 UTC

I'm closing it as not-a-bug, let me explain why.

There are 3 levels of resource allocations:
1) The guest's virtual topology
2) The Pod's resources
3) The host's resources

As part of our VM manifests, we support the first two levels. For example, a VM creator can decide to define the guest with 1 socket, 4 cores and 2 threads, and the pod to be defined with 4 CPUs. However, we do not have any support for deciding how the host will provide these resources to the pod.

Actually, the error comes from Kubernetes' CPU Manager's policies. I think it was added here [1][2][3]. In any case, components like CPU Manager and OpenShift's Performance Addon Operator are responsible of configuring how the pods will be allocated with host resources.

For this reason, this is outside of CNV's scope and needs to be fixed elsewhere. We can consider to reach out to upstream Kubernetes or Openshift's PAO for a fix from their side.
I would personally love to help pushing this into Kubernetes if we think it's the right path forward. But in any case, this is IMO a feature request more than a bug.

In the meantime, a workaround is simple: to allocate one more CPU. This is not a bad workaround IMO as (with the right policy from the other components I mentioned) it would just mean that the second hyperthread from the same core would stay empty, keeping the high performance needed for such VMs.

Another important note: there is no relation between the guest's virtual topology and the host's topology. These two are completely independent of one another.

[1] https://github.com/kubernetes/enhancements/issues/2625
[2] https://github.com/kubernetes/enhancements/pull/2626
[3] https://github.com/kubernetes/kubernetes/pull/101432

Comment 3 sgott 2023-10-02 13:45:29 UTC

Under the justification that both RT workloads and DPDK will allocate an odd number of PCPUs by default, this issue is unavoidable. Thus our out-of-the-box behavior is problematic. Re-opening this BZ because of that.

Comment 4 Orel Misan 2023-10-02 13:55:17 UTC

High performance VMs, such as VMs using DPDK or realtime are sensitive to interference from neighboring workloads.

It is possible to define the kubelet’s CPU manager policy to static - `full-pcpus-only`[1], so it would only allocate full physical cores - in nodes with SMT enabled, that means allocating both threads.
The implication is that pods with an odd number of CPUs are rejected by the kubelet after being scheduled to the node - emitting the following event:

“SMT Alignment Error: requested 9 cpus not multiple cpus per core = 2”

We have created an automated test (kubevirt-dpdk-checkup) that creates two VirtualMachineInstance and runs DPDK traffic between them.

The two VMIs the checkup creates have an even number of CPUs.
Before we have started using KubeVirt’s IsolateEmulatorThread[2] option, the checkup had suffered from packet losses.

When adding the IsolateEmulatorThread option, KubeVirt adds an additional CPU to the virt-launcher pod’s requests and limits, to be used by QEMU.
This in combination with the full-pcpus-only CPU manager policy, causes the virt-launcher pod to be rejected.

As a workaround we have disabled SMT on our worker node.
This workaround is unacceptable by our stakeholders.

We wish to create a solution that will cause KubeVirt to add extra CPUs to the virt-launcher pod so that the total sum of CPUs will be even.
This behavior is wanted only when the full-pcpus-only CPU manager policy is enabled.


We use the following MCP:
```
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: worker-dpdk
  labels:
    machineconfiguration.openshift.io/role: worker-dpdk
spec:
  machineConfigSelector:
    matchExpressions:
      - key: machineconfiguration.openshift.io/role
        operator: In
        values:
          - worker
          - worker-dpdk
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker-dpdk: ""
```

We use the following PerformanceProfile:
```
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: profile-1
spec:
  cpu:
    isolated: 4-39,44-79
    reserved: 0-3,40-43
  hugepages:
    defaultHugepagesSize: 1G
    pages:
    - count: 32
      node: 0
      size: 1G
  net:
    userLevelNetworking: true
  nodeSelector:
    node-role.kubernetes.io/worker-dpdk: ""
  numa:
    topologyPolicy: single-numa-node
```

[1] https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy-options

[2] https://kubevirt.io/user-guide/virtual_machines/dedicated_cpu_resources/#requesting-dedicated-cpu-for-qemu-emulator

Comment 5 Ram Lavi 2023-10-16 09:24:29 UTC

design draft: https://github.com/kubevirt/community/pull/247

Comment 6 Ram Lavi 2023-10-22 12:47:21 UTC

implementation PR draft: https://github.com/kubevirt/kubevirt/pull/10593

Comment 7 Ram Lavi 2023-11-21 09:24:27 UTC

backport PR to 4.15
https://github.com/kubevirt/kubevirt/pull/10757

Note You need to log in before you can comment on or make changes to this bug.