Bug 2185076

Summary: delays injecting PIT timer interrupt with OpenShift
Product: Container Native Virtualization (CNV) Reporter: Marcelo Tosatti <mtosatti>
Component: VirtualizationAssignee: lpivarc
Status: CLOSED ERRATA QA Contact: zhe peng <zpeng>
Severity: high Docs Contact:
Priority: high    
Version: 4.11.3CC: fbaudin, fdeutsch, gveitmic, jsuchane, lpivarc, mkletzan, pelauter, sgott, vromanso
Target Milestone: ---   
Target Release: 4.14.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-11-08 14:05:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marcelo Tosatti 2023-04-06 19:12:09 UTC
Description of problem:

For using guests which require timely injection of PIT interrupts,
kvm-pit process should be pinned to same CPU as vcpu0. 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. start guest which uses PIT timer (RHEL-5 with clock=pit for example).
2. measure rate of interrupt injection with tracing.
3. 

Actual results:

kvm-pit running on pCPU-0 might not execute, due to additional load on CPU-0,
for longer periods of time.

          <idle>-0     [000] d.h. 102470.847586: hrtimer_expire_entry: hrtimer=00000000e3dea339 function=pit_timer_fn [kvm] now=102463903132805
          <idle>-0     [000] d.h. 102470.847587: sched_waking: comm=kvm-pit/67 pid=1212317 prio=120 target_cpu=001
          <idle>-0     [001] d... 102470.847590: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=kvm-pit/67 next_pid=1212317 next_prio=120
      kvm-pit/67-1212317 [001] d... 102470.847594: sched_switch: prev_comm=kvm-pit/67 prev_pid=1212317 prev_prio=120 prev_state=D ==> next_comm=swapper/1 next_pid=0 next_prio=120
       CPU 0/KVM-1212312 [006] d... 102470.847596: kvm_entry: vcpu 0, rip 0x35fd
       CPU 0/KVM-1212312 [006] d... 102470.847597: kvm_exit: vcpu 0 reason EOI_INDUCED rip 0x1bab info1 0x0000000000000022 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000
       CPU 0/KVM-1212312 [006] d... 102470.847598: kvm_entry: vcpu 0, rip 0x1bab
...
       CPU 0/KVM-1212312 [006] d... 102470.847622: kvm_exit: vcpu 0 reason HLT rip 0x35fc info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000
           <...>-64244 [000] d.h. 102470.857590: hrtimer_expire_entry: hrtimer=00000000e3dea339 function=pit_timer_fn [kvm] now=102463913135899
           <...>-64244 [000] d.h. 102470.857599: sched_waking: comm=kvm-pit/67 pid=1212317 prio=120 target_cpu=001
  kube-apiserver-64649 [004] d.h. 102470.860004: hrtimer_expire_entry: hrtimer=00000000f5450c52 function=pit_timer_fn [kvm] now=102463915550678
           <...>-64245 [003] d.h. 102470.863029: hrtimer_expire_entry: hrtimer=000000004736fe96 function=pit_timer_fn [kvm] now=102463918575518
           <...>-64244 [000] d.h. 102470.867587: hrtimer_expire_entry: hrtimer=00000000e3dea339 function=pit_timer_fn [kvm] now=102463923133350
           <...>-64244 [000] d.h. 102470.869780: hrtimer_expire_entry: hrtimer=000000005deb41ca function=pit_timer_fn [kvm] now=102463925326017
           <...>-64244 [000] d... 102470.877465: sched_switch: prev_comm=kube-apiserver prev_pid=64244 prev_prio=120 prev_state=S ==> next_comm=kvm-pit/67 next_pid=1212317 next_prio=120
      kvm-pit/67-1212317 [000] d... 102470.877483: sched_switch: prev_comm=kvm-pit/67 prev_pid=1212317 prev_prio=120 prev_state=D ==> next_comm=cluster-policy- next_pid=60163 next_prio=120
       CPU 0/KVM-1212312 [006] d... 102470.877486: kvm_entry: vcpu 0, rip 0x35fd
       CPU 0/KVM-1212312 [006] d... 102470.877490: kvm_exit: vcpu 0 reason EOI_INDUCED rip 0x1bab info1 0x0000000000000022 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000
       CPU 0/KVM-1212312 [006] d... 102470.877491: sched_waking: comm=kvm-pit/67 pid=1212317 prio=120 target_cpu=000
       CPU 0/KVM-1212312 [006] d... 102470.877494: kvm_entry: vcpu 0, rip 0x1bab

Expected results:

PIT timer interrupt injection should not deviate more than 500us from expected (armed) time, when period is 10ms.


Additional info:

To fix this kubevirt should pin kvm-pit thread, for that VM, to pCPU of vCPU-0.

Comment 1 Marcelo Tosatti 2023-04-10 16:24:32 UTC
Pseudo-algorithm:

1) Retrieve the qemu process PID.
2) Retrieve the list of processes on the host. Search for the
process named "kvm-pit/$pid-of-qemu", where $pid-of-qemu 
is an integer containing the qemu process PID. For example: "kvm-pit/35771"
and

$ cat /proc/35771/comm
qemu-kvm

3) Add the TID of that kvm-pit process to the cgroup of the vcpus    
(cgroupManager.AttachTID API).

4) Assign the same cpumask as assigned to VCPU0 to kvm-pit process
(can use sched_getaffinity/sched_setaffinity).

Comment 11 sgott 2023-04-26 12:30:31 UTC
Marcello, assigned this BZ to you as you did the work. Thanks for your help!

Comment 13 Marcelo Tosatti 2023-05-24 11:59:24 UTC

Issue fixed upstream: https://github.com/kubevirt/kubevirt/pull/9613

Backport for this problem under way: https://bugzilla.redhat.com/show_bug.cgi?id=2208674

Comment 18 zhe peng 2023-08-09 06:22:55 UTC
verify with build:  CNV-v4.14.0.rhel9-1553

step:
1. enable static cpu manager policy
$ oc label node c01-zpeng-414-9xqx2-worker-0-9lgvv cpumanager=true

$ oc edit machineconfigpools.machineconfiguration.openshift.io worker
add:
...
labels:
    custom-kubelet: cpumanager-enabled
...

edit KubeletConfig for CPU Manager
$ oc edit kubeletconfigs.machineconfiguration.openshift.io
....
 name: cpumanager-enabled
  resourceVersion: "157576"
  uid: 4b8a8bef-af63-4cf5-be8e-16e9cc96dd2d
...
spec:
  kubeletConfig:
    cpuManagerPolicy: static
    cpuManagerReconcilePeriod: 5s
  machineConfigPoolSelector:
    matchLabels:
      custom-kubelet: cpumanager-enabled
      pools.operator.machineconfiguration.openshift.io/worker: ""
...

Check the worker for the updated kubelet.conf:
$ oc debug node/c01-zpeng-414-9xqx2-worker-0-9lgvv
sh-4.4# cat /etc/kubernetes/kubelet.conf | grep cpuManager
  "cpuManagerPolicy": "static",
  "cpuManagerReconcilePeriod": "5s",

2: create a vm have dedicatedcpu: true flag, 
requests/limits with integer values and equal.
....
domain:
        clock:
          timer:
            pit:
              tickPolicy: delay
        cpu:
          cores: 1
          dedicatedCpuPlacement: true
          isolateEmulatorThread: true
          model: host-passthrough
          sockets: 2
          threads: 1
......
resources:
          limits:
            memory: 2Gi
          requests:
            memory: 2Gi
3: before start vm, enable trace in host

sh-5.1#echo 140000 > /sys/kernel/debug/tracing/buffer_size_kb

 enable sched_switch, sched_waking tracepoint:

sh-5.1#echo sched_switch > /sys/kernel/debug/tracing/set_event
sh-5.1#echo sched_waking > /sys/kernel/debug/tracing/set_event

start collecting the trace and saving it to a file:

cat /sys/kernel/debug/tracing/trace_pipe > /root/saved-trace.txt

4: start vm, check trace file:
.....
kvm-pit/77-1016715 [001] d.h3. 11772.319302: sched_waking: comm=CPU 0/KVM pid=1016711 prio=120 target_cpu=001
....
CPU 0/KVM-1016711 [001] d.h3. 11774.406453: sched_waking: comm=kvm-pit/77 pid=1016715 prio=120 target_cpu=001
....

sh-5.1# cat /proc/1016715/status | grep Cpus_allowed
Cpus_allowed:   02
Cpus_allowed_list:      1

move to verified.

Comment 20 errata-xmlrpc 2023-11-08 14:05:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.14.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6817