Description of problem: For using guests which require timely injection of PIT interrupts, kvm-pit process should be pinned to same CPU as vcpu0. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. start guest which uses PIT timer (RHEL-5 with clock=pit for example). 2. measure rate of interrupt injection with tracing. 3. Actual results: kvm-pit running on pCPU-0 might not execute, due to additional load on CPU-0, for longer periods of time. <idle>-0 [000] d.h. 102470.847586: hrtimer_expire_entry: hrtimer=00000000e3dea339 function=pit_timer_fn [kvm] now=102463903132805 <idle>-0 [000] d.h. 102470.847587: sched_waking: comm=kvm-pit/67 pid=1212317 prio=120 target_cpu=001 <idle>-0 [001] d... 102470.847590: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=kvm-pit/67 next_pid=1212317 next_prio=120 kvm-pit/67-1212317 [001] d... 102470.847594: sched_switch: prev_comm=kvm-pit/67 prev_pid=1212317 prev_prio=120 prev_state=D ==> next_comm=swapper/1 next_pid=0 next_prio=120 CPU 0/KVM-1212312 [006] d... 102470.847596: kvm_entry: vcpu 0, rip 0x35fd CPU 0/KVM-1212312 [006] d... 102470.847597: kvm_exit: vcpu 0 reason EOI_INDUCED rip 0x1bab info1 0x0000000000000022 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000 CPU 0/KVM-1212312 [006] d... 102470.847598: kvm_entry: vcpu 0, rip 0x1bab ... CPU 0/KVM-1212312 [006] d... 102470.847622: kvm_exit: vcpu 0 reason HLT rip 0x35fc info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000 <...>-64244 [000] d.h. 102470.857590: hrtimer_expire_entry: hrtimer=00000000e3dea339 function=pit_timer_fn [kvm] now=102463913135899 <...>-64244 [000] d.h. 102470.857599: sched_waking: comm=kvm-pit/67 pid=1212317 prio=120 target_cpu=001 kube-apiserver-64649 [004] d.h. 102470.860004: hrtimer_expire_entry: hrtimer=00000000f5450c52 function=pit_timer_fn [kvm] now=102463915550678 <...>-64245 [003] d.h. 102470.863029: hrtimer_expire_entry: hrtimer=000000004736fe96 function=pit_timer_fn [kvm] now=102463918575518 <...>-64244 [000] d.h. 102470.867587: hrtimer_expire_entry: hrtimer=00000000e3dea339 function=pit_timer_fn [kvm] now=102463923133350 <...>-64244 [000] d.h. 102470.869780: hrtimer_expire_entry: hrtimer=000000005deb41ca function=pit_timer_fn [kvm] now=102463925326017 <...>-64244 [000] d... 102470.877465: sched_switch: prev_comm=kube-apiserver prev_pid=64244 prev_prio=120 prev_state=S ==> next_comm=kvm-pit/67 next_pid=1212317 next_prio=120 kvm-pit/67-1212317 [000] d... 102470.877483: sched_switch: prev_comm=kvm-pit/67 prev_pid=1212317 prev_prio=120 prev_state=D ==> next_comm=cluster-policy- next_pid=60163 next_prio=120 CPU 0/KVM-1212312 [006] d... 102470.877486: kvm_entry: vcpu 0, rip 0x35fd CPU 0/KVM-1212312 [006] d... 102470.877490: kvm_exit: vcpu 0 reason EOI_INDUCED rip 0x1bab info1 0x0000000000000022 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000 CPU 0/KVM-1212312 [006] d... 102470.877491: sched_waking: comm=kvm-pit/67 pid=1212317 prio=120 target_cpu=000 CPU 0/KVM-1212312 [006] d... 102470.877494: kvm_entry: vcpu 0, rip 0x1bab Expected results: PIT timer interrupt injection should not deviate more than 500us from expected (armed) time, when period is 10ms. Additional info: To fix this kubevirt should pin kvm-pit thread, for that VM, to pCPU of vCPU-0.
Pseudo-algorithm: 1) Retrieve the qemu process PID. 2) Retrieve the list of processes on the host. Search for the process named "kvm-pit/$pid-of-qemu", where $pid-of-qemu is an integer containing the qemu process PID. For example: "kvm-pit/35771" and $ cat /proc/35771/comm qemu-kvm 3) Add the TID of that kvm-pit process to the cgroup of the vcpus (cgroupManager.AttachTID API). 4) Assign the same cpumask as assigned to VCPU0 to kvm-pit process (can use sched_getaffinity/sched_setaffinity).
Marcello, assigned this BZ to you as you did the work. Thanks for your help!
Issue fixed upstream: https://github.com/kubevirt/kubevirt/pull/9613 Backport for this problem under way: https://bugzilla.redhat.com/show_bug.cgi?id=2208674
verify with build: CNV-v4.14.0.rhel9-1553 step: 1. enable static cpu manager policy $ oc label node c01-zpeng-414-9xqx2-worker-0-9lgvv cpumanager=true $ oc edit machineconfigpools.machineconfiguration.openshift.io worker add: ... labels: custom-kubelet: cpumanager-enabled ... edit KubeletConfig for CPU Manager $ oc edit kubeletconfigs.machineconfiguration.openshift.io .... name: cpumanager-enabled resourceVersion: "157576" uid: 4b8a8bef-af63-4cf5-be8e-16e9cc96dd2d ... spec: kubeletConfig: cpuManagerPolicy: static cpuManagerReconcilePeriod: 5s machineConfigPoolSelector: matchLabels: custom-kubelet: cpumanager-enabled pools.operator.machineconfiguration.openshift.io/worker: "" ... Check the worker for the updated kubelet.conf: $ oc debug node/c01-zpeng-414-9xqx2-worker-0-9lgvv sh-4.4# cat /etc/kubernetes/kubelet.conf | grep cpuManager "cpuManagerPolicy": "static", "cpuManagerReconcilePeriod": "5s", 2: create a vm have dedicatedcpu: true flag, requests/limits with integer values and equal. .... domain: clock: timer: pit: tickPolicy: delay cpu: cores: 1 dedicatedCpuPlacement: true isolateEmulatorThread: true model: host-passthrough sockets: 2 threads: 1 ...... resources: limits: memory: 2Gi requests: memory: 2Gi 3: before start vm, enable trace in host sh-5.1#echo 140000 > /sys/kernel/debug/tracing/buffer_size_kb enable sched_switch, sched_waking tracepoint: sh-5.1#echo sched_switch > /sys/kernel/debug/tracing/set_event sh-5.1#echo sched_waking > /sys/kernel/debug/tracing/set_event start collecting the trace and saving it to a file: cat /sys/kernel/debug/tracing/trace_pipe > /root/saved-trace.txt 4: start vm, check trace file: ..... kvm-pit/77-1016715 [001] d.h3. 11772.319302: sched_waking: comm=CPU 0/KVM pid=1016711 prio=120 target_cpu=001 .... CPU 0/KVM-1016711 [001] d.h3. 11774.406453: sched_waking: comm=kvm-pit/77 pid=1016715 prio=120 target_cpu=001 .... sh-5.1# cat /proc/1016715/status | grep Cpus_allowed Cpus_allowed: 02 Cpus_allowed_list: 1 move to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Virtualization 4.14.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6817