2185076 – delays injecting PIT timer interrupt with OpenShift

Bug 2185076 - delays injecting PIT timer interrupt with OpenShift

Summary: delays injecting PIT timer interrupt with OpenShift

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	4.11.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.14.0
Assignee:	lpivarc
QA Contact:	zhe peng
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-04-06 19:12 UTC by Marcelo Tosatti
Modified:	2023-11-08 14:05 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-11-08 14:05:27 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	2208674	unspecified	CLOSED	delays injecting PIT timer interrupt with OpenShift	2023-07-25 10:27:55 UTC
Red Hat Issue Tracker	CNV-27940	None	None	None	2023-04-06 19:13:52 UTC
Red Hat Product Errata	RHSA-2023:6817	None	None	None	2023-11-08 14:05:52 UTC

Description Marcelo Tosatti 2023-04-06 19:12:09 UTC

Description of problem:

For using guests which require timely injection of PIT interrupts,
kvm-pit process should be pinned to same CPU as vcpu0. 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. start guest which uses PIT timer (RHEL-5 with clock=pit for example).
2. measure rate of interrupt injection with tracing.
3. 

Actual results:

kvm-pit running on pCPU-0 might not execute, due to additional load on CPU-0,
for longer periods of time.

          <idle>-0     [000] d.h. 102470.847586: hrtimer_expire_entry: hrtimer=00000000e3dea339 function=pit_timer_fn [kvm] now=102463903132805
          <idle>-0     [000] d.h. 102470.847587: sched_waking: comm=kvm-pit/67 pid=1212317 prio=120 target_cpu=001
          <idle>-0     [001] d... 102470.847590: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=kvm-pit/67 next_pid=1212317 next_prio=120
      kvm-pit/67-1212317 [001] d... 102470.847594: sched_switch: prev_comm=kvm-pit/67 prev_pid=1212317 prev_prio=120 prev_state=D ==> next_comm=swapper/1 next_pid=0 next_prio=120
       CPU 0/KVM-1212312 [006] d... 102470.847596: kvm_entry: vcpu 0, rip 0x35fd
       CPU 0/KVM-1212312 [006] d... 102470.847597: kvm_exit: vcpu 0 reason EOI_INDUCED rip 0x1bab info1 0x0000000000000022 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000
       CPU 0/KVM-1212312 [006] d... 102470.847598: kvm_entry: vcpu 0, rip 0x1bab
...
       CPU 0/KVM-1212312 [006] d... 102470.847622: kvm_exit: vcpu 0 reason HLT rip 0x35fc info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000
           <...>-64244 [000] d.h. 102470.857590: hrtimer_expire_entry: hrtimer=00000000e3dea339 function=pit_timer_fn [kvm] now=102463913135899
           <...>-64244 [000] d.h. 102470.857599: sched_waking: comm=kvm-pit/67 pid=1212317 prio=120 target_cpu=001
  kube-apiserver-64649 [004] d.h. 102470.860004: hrtimer_expire_entry: hrtimer=00000000f5450c52 function=pit_timer_fn [kvm] now=102463915550678
           <...>-64245 [003] d.h. 102470.863029: hrtimer_expire_entry: hrtimer=000000004736fe96 function=pit_timer_fn [kvm] now=102463918575518
           <...>-64244 [000] d.h. 102470.867587: hrtimer_expire_entry: hrtimer=00000000e3dea339 function=pit_timer_fn [kvm] now=102463923133350
           <...>-64244 [000] d.h. 102470.869780: hrtimer_expire_entry: hrtimer=000000005deb41ca function=pit_timer_fn [kvm] now=102463925326017
           <...>-64244 [000] d... 102470.877465: sched_switch: prev_comm=kube-apiserver prev_pid=64244 prev_prio=120 prev_state=S ==> next_comm=kvm-pit/67 next_pid=1212317 next_prio=120
      kvm-pit/67-1212317 [000] d... 102470.877483: sched_switch: prev_comm=kvm-pit/67 prev_pid=1212317 prev_prio=120 prev_state=D ==> next_comm=cluster-policy- next_pid=60163 next_prio=120
       CPU 0/KVM-1212312 [006] d... 102470.877486: kvm_entry: vcpu 0, rip 0x35fd
       CPU 0/KVM-1212312 [006] d... 102470.877490: kvm_exit: vcpu 0 reason EOI_INDUCED rip 0x1bab info1 0x0000000000000022 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000
       CPU 0/KVM-1212312 [006] d... 102470.877491: sched_waking: comm=kvm-pit/67 pid=1212317 prio=120 target_cpu=000
       CPU 0/KVM-1212312 [006] d... 102470.877494: kvm_entry: vcpu 0, rip 0x1bab

Expected results:

PIT timer interrupt injection should not deviate more than 500us from expected (armed) time, when period is 10ms.


Additional info:

To fix this kubevirt should pin kvm-pit thread, for that VM, to pCPU of vCPU-0.

Comment 1 Marcelo Tosatti 2023-04-10 16:24:32 UTC

Pseudo-algorithm:

1) Retrieve the qemu process PID.
2) Retrieve the list of processes on the host. Search for the
process named "kvm-pit/$pid-of-qemu", where $pid-of-qemu 
is an integer containing the qemu process PID. For example: "kvm-pit/35771"
and

$ cat /proc/35771/comm
qemu-kvm

3) Add the TID of that kvm-pit process to the cgroup of the vcpus    
(cgroupManager.AttachTID API).

4) Assign the same cpumask as assigned to VCPU0 to kvm-pit process
(can use sched_getaffinity/sched_setaffinity).

Comment 11 sgott 2023-04-26 12:30:31 UTC

Marcello, assigned this BZ to you as you did the work. Thanks for your help!

Comment 13 Marcelo Tosatti 2023-05-24 11:59:24 UTC


Issue fixed upstream: https://github.com/kubevirt/kubevirt/pull/9613

Backport for this problem under way: https://bugzilla.redhat.com/show_bug.cgi?id=2208674

Comment 18 zhe peng 2023-08-09 06:22:55 UTC

verify with build:  CNV-v4.14.0.rhel9-1553

step:
1. enable static cpu manager policy
$ oc label node c01-zpeng-414-9xqx2-worker-0-9lgvv cpumanager=true

$ oc edit machineconfigpools.machineconfiguration.openshift.io worker
add:
...
labels:
    custom-kubelet: cpumanager-enabled
...

edit KubeletConfig for CPU Manager
$ oc edit kubeletconfigs.machineconfiguration.openshift.io
....
 name: cpumanager-enabled
  resourceVersion: "157576"
  uid: 4b8a8bef-af63-4cf5-be8e-16e9cc96dd2d
...
spec:
  kubeletConfig:
    cpuManagerPolicy: static
    cpuManagerReconcilePeriod: 5s
  machineConfigPoolSelector:
    matchLabels:
      custom-kubelet: cpumanager-enabled
      pools.operator.machineconfiguration.openshift.io/worker: ""
...

Check the worker for the updated kubelet.conf:
$ oc debug node/c01-zpeng-414-9xqx2-worker-0-9lgvv
sh-4.4# cat /etc/kubernetes/kubelet.conf | grep cpuManager
  "cpuManagerPolicy": "static",
  "cpuManagerReconcilePeriod": "5s",

2: create a vm have dedicatedcpu: true flag, 
requests/limits with integer values and equal.
....
domain:
        clock:
          timer:
            pit:
              tickPolicy: delay
        cpu:
          cores: 1
          dedicatedCpuPlacement: true
          isolateEmulatorThread: true
          model: host-passthrough
          sockets: 2
          threads: 1
......
resources:
          limits:
            memory: 2Gi
          requests:
            memory: 2Gi
3: before start vm, enable trace in host

sh-5.1#echo 140000 > /sys/kernel/debug/tracing/buffer_size_kb

 enable sched_switch, sched_waking tracepoint:

sh-5.1#echo sched_switch > /sys/kernel/debug/tracing/set_event
sh-5.1#echo sched_waking > /sys/kernel/debug/tracing/set_event

start collecting the trace and saving it to a file:

cat /sys/kernel/debug/tracing/trace_pipe > /root/saved-trace.txt

4: start vm, check trace file:
.....
kvm-pit/77-1016715 [001] d.h3. 11772.319302: sched_waking: comm=CPU 0/KVM pid=1016711 prio=120 target_cpu=001
....
CPU 0/KVM-1016711 [001] d.h3. 11774.406453: sched_waking: comm=kvm-pit/77 pid=1016715 prio=120 target_cpu=001
....

sh-5.1# cat /proc/1016715/status | grep Cpus_allowed
Cpus_allowed:   02
Cpus_allowed_list:      1

move to verified.

Comment 20 errata-xmlrpc 2023-11-08 14:05:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.14.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6817

Note You need to log in before you can comment on or make changes to this bug.