2293909 – L2 Guest-Aggressively entering CEDE results in low performance. Possible tuning opportunity.

Bug 2293909 - L2 Guest-Aggressively entering CEDE results in low performance. Possible tuning opportunity.

Summary: L2 Guest-Aggressively entering CEDE results in low performance. Possible tuni...

Keywords:
Status:	NEW
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	ppc64le
OS:	All
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	PPCTracker
TreeView+	depends on / blocked

Reported:	2024-06-24 09:50 UTC by IBM Bug Proxy
Modified:	2024-07-04 11:18 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
IBM Linux Technology Center	205266	0	None	None	None	2024-06-24 09:51:10 UTC

Description IBM Bug Proxy 2024-06-24 09:50:44 UTC

Comment 1 IBM Bug Proxy 2024-06-24 09:50:58 UTC

== Comment: #0 - Vijay k. Puliyala <vpuliyal.com> - 2024-02-17 05:48:50 ==
KVM on PowerVM: L2 Guest-Aggressively entering CEDE results in low performance. Possible tuning opportunity.

---uname output---
Linux rhel86edb1 6.7.0-nested.1.1a946fcde971.up.ibm.el9.ppc64le #1 SMP Sun Jan 21 11:45:44 EST 2024 ppc64le ppc64le ppc64le GNU/Linux
 
---Steps to Reproduce---
Example: run READ only Test using EDB-PGBENCH and DT7 workloads on
 1. L1-Host 
 2. L2-Guest CEDE ON
 3. L2-Guest CEDE OFF

significant performance drop is observed in L2-Guest CEDE on vs L2-Guest CEDE off case.

Note: Host and Guest configuration  used performance experiments are listed below.

Location of EDB-PGBENCH: 
#wget http://ci-http-results.aus.stglabs.ibm.com/perfTest/scripts/Bug_Scripts/pgbench_install.sh
#chmod 777 pgbench_install.sh
#./pgbench_install.sh -->> it will install EDB(pgbench) and run edb on target lpar. 

Location of DT7 workload: 

#wget http://ci-http-results.aus.stglabs.ibm.com/perfTest/scripts/Bug_Scripts/DT7-Install.sh
#chmod 777 DT7-Install.sh
#./DT7-Install.sh -->> It will install DT7.

Sample Commands : Once installation was successful run below commands on target lpar. 

EDB-PGBENCH Commands : 

# su - enterprisedb
# vi t1.tc -->> copy below lines to t1.tc file . 

##########t1.tc##########
runname=select
SCALE=100
runtime=300
thread="40"
smtlist="8"
mode=select
recreateinstance=yes
recreateduringrun=yes
warmup=no
perf_stat=yes
PGSQL=/usr/local/pgsql/bin
#PGSQL=/usr/edb/as14/bin
#PGPORT=5432
cores=5
##########t1.tc##########

#cp t1.tc tc/
#./auto-run-test.sh

DT7 Commands : 

After installation of DT7 run below command :
#cd /root
#./DayTrader7_Run.sh -u 20 -l 900 -i 2  

######################################################################
Machine Type: Power 10  LPAR (RHEL9.3)
gcc 		: 11.4.1
Memory  	: 300GB
Test type	: pgbench-edb, DT7
######################################################################
KVM Host lscpu output : 

# lscpu
Architecture:            ppc64le
  Byte Order:            Little Endian
CPU(s):                  96
  On-line CPU(s) list:   0-39
  Off-line CPU(s) list:  40-95
Model name:              POWER10 (architected), altivec supported
  Model:                 2.0 (pvr 0080 0200)
  Thread(s) per core:    8
  Core(s) per socket:    5
  Socket(s):             1
  Physical sockets:      1
  Physical chips:        4
  Physical cores/chip:   12
Virtualization features:
  Hypervisor vendor:     pHyp
  Virtualization type:   para
Caches (sum of all):
  L1d:                   320 KiB (10 instances)
  L1i:                   480 KiB (10 instances)
  L2:                    10 MiB (10 instances)
  L3:                    40 MiB (10 instances)
NUMA:
  NUMA node(s):          1
  NUMA node2 CPU(s):     0-39
Vulnerabilities:
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass:     Not affected
  Spectre v1:            Vulnerable, ori31 speculation barrier enabled
  Spectre v2:            Vulnerable
  Srbds:                 Not affected
  Tsx async abort:       Not affected


##############################################

KVM on PowerVM setup: 

KVM (Kernel Virtual Machine) is a virtualization module for Linux that provides the ability of virtualization to Linux i.e. it allows the kernel to function as a hypervisor.

We used P10 2S4U system for this experiment. 

Workloads: DT7 and PGBENCH in details: 

DT7 is an open source benchmark application emulating an online stock trading system.
DT7 consist of 3 components 
1) Jmeter 
2) WAS (WebSphere Application Server)
3) DB2

DayTrader benchmark/application will be installed/deployed on WAS and this used DB2 as a backbone database.  Jmeter generate the request and interact with the WAS. which would be kind of middle ware. 

PGBENCH : 
pgbench is a simple program for running benchmark tests on PostgreSQL. It runs the same sequence of SQL commands over and over, possibly in multiple concurrent database sessions, and then calculates the average transaction rate (transactions per second).

Config of KVM Host and L2-Guest:

KVM Host Config : 
# uname -a
Linux rhel86edb1 6.7.0-nested.1.1a946fcde971.up.ibm.el9.ppc64le #1 SMP Sun Jan 21 11:45:44 EST 2024 ppc64le ppc64le ppc64le GNU/Linux
# numactl -H
available: 1 nodes (1)
node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
node 1 size: 292860 MB
node 1 free: 290979 MB
node distances:
node   1
  1:  10
# cat /proc/cmdline
BOOT_IMAGE=(ieee1275//pci@800000020000021/pci1014\\,683@0/namespace@1,msdos2)/vmlinuz-6.7.0-nested.1.1a946fcde971.up.ibm.el9.ppc64le root=/dev/mapper/rhel_rhel86edb-root ro crashkernel=2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G rd.lvm.lv=rhel_rhel86edb/root rd.lvm.lv=rhel_rhel86edb/swap biosdevname=0 mitigations=off doorbell=off
# ppc64_cpu --dscr
DSCR is 23
# cpupower idle-info
CPUidle driver: pseries_idle
CPUidle governor: menu
analyzing CPU 0:

Number of idle states: 2
Available idle states: snooze CEDE
snooze:
Flags/Description: snooze
Latency: 0
Usage: 2656
Duration: 297483
CEDE:
Flags/Description: CEDE
Latency: 12
Usage: 159981
Duration: 95235883853

# qemu-system-ppc64 --version
QEMU emulator version 7.1.0
Copyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers

#Libvirt version : libvirt-8.7.0

# cat /etc/redhat-release
Red Hat Enterprise Linux release 9.3 (Plow)
#


L2 GUEST CONFIG :  

CPU's : UN-pinned 

# cat /proc/cmdline
BOOT_IMAGE=(ieee1275/disk,msdos2)/vmlinuz-6.7.0-nested.1.1a946fcde971.up.ibm.el9.ppc64le root=/dev/mapper/rhel-root ro crashkernel=2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap mitigations=off doorbell=off
# ppc64_cpu --dscr
DSCR is 23
# cat /proc/cmdline
BOOT_IMAGE=(ieee1275/disk,msdos2)/vmlinuz-6.7.0-nested.1.1a946fcde971.up.ibm.el9.ppc64le root=/dev/mapper/rhel-root ro crashkernel=2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap mitigations=off doorbell=off
# numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
node 0 size: 106739 MB
node 0 free: 105211 MB
node distances:
node   0
  0:  10
# cat /etc/redhat-release
Red Hat Enterprise Linux release 9.3 (Plow)
#

We did DT7 and PGBENCH-Read only test on L2-Guest with CEDE On vs Off. We could see degradation with CEDE on compare with CEDE off. 

Here I?m adding DT7 and EDB-PGBENCH results.

L2-GUEST 5Cores with CEDE on: 

1) EDB-PGBENCH Data : 
+ /usr/local/pgsql/bin/pgbench -n -S -T 120 -c 40 -j 40 pgbench
pgbench (14.5)
transaction type: <builtin: select only>
scaling factor: 100
query mode: simple
number of clients: 40
number of threads: 40
duration: 120 s
number of transactions actually processed: 21811958
latency average = 0.220 ms
initial connection time = 16.004 ms
tps = 181761.468180 (without initial connection time)


2) DT7 Data: 
DayTrader7 Report

 Run Group ID=0
 Run ID=40
 Run Description=Test Run
 Host=127.0.0.1                 Users=40           Run_time=900

 Total Instances                 2
 Total Throughputs               2340.6

L2-GUEST 5Cores with CEDE Off: 

1) EDB-PGBENCH  Data : 
+ /usr/local/pgsql/bin/pgbench -n -S -T 120 -c 40 -j 40 pgbench
pgbench (14.5)
transaction type: <builtin: select only>
scaling factor: 100
query mode: simple
number of clients: 40
number of threads: 40
duration: 120 s
number of transactions actually processed: 37804765
latency average = 0.127 ms
initial connection time = 5.910 ms
tps = 315015.313022 (without initial connection time)

2) DT7 Results: 
==================================================================================
 DayTrader7 Report

 Run Group ID=0
 Run ID=41
 Run Description=Test Run
 Host=127.0.0.1                 Users=40           Run_time=900

 Total Instances                 2

 Total Throughputs               3569.6
===================================================================================

EDB-PGBENCH Performance Summary:

CEDE ON  EDB-PGBENCH  Data : 181761.46818 tps 
CEDE OFF EDB-PGBENCH  Data : 315015.31302 tps  

Percentage Drop: (181761.46818-315015.31)*100/315015.3130= 42% 
Guest when CEDE was turned ON under-performed by 42% vs CEDE turned OFF.

DT7 Performance Summary:  

CEDE ON  DT7  Data : 2340.6 tps 
CEDE OFF DT7  Data : 3569.6 tps  

Percentage Drop : (2340.6-3569.6 )*100/3569.6= 34% 
Guest when CEDE was turned ON under-performed by 34% vs CEDE turned OFF.

From above data we observed that performance drops when L2-Guest CEDE is ON when compared to L2-Guest CEDE is OFF. It is well understood that the solution cannot be offered with Shared CEDE disabled. However, it would be ideal to reduce the aggressiveness of CEDE'ing to scale to higher performance which is acceptable.

Kindly reach out for any other data.

== Comment: #1 - Amit Machhiwal <Amit.Machhiwal> - 2024-03-01 05:47:03 ==
Hi Vijay,

Could you please mention the phyp lids version you have run the benchmarks on?

Thanks,
Amit

== Comment: #2 - Vijay k. Puliyala <vpuliyal.com> - 2024-03-01 06:24:07 ==
(In reply to comment #1)
> Hi Vijay,
> 
> Could you please mention the phyp lids version you have run the benchmarks
> on?
> 
> Thanks,
> Amit

Hi Amit,

We used b240118a.phyp1060 phyp lids version. 

Thanks
Vijay Kumar

== Comment: #3 - Amit Machhiwal <Amit.Machhiwal> - 2024-03-06 05:37:59 ==
Thanks, Vijay for the update.

I tried running the below case:

L1:
    No. of cores 5
    L1 Idle
    Run monitor for 20 seconds

L2:
    No. of cores 5
    Run stress-ng

Observation: 
L1 is NOT CEDE-ing at all on the latest and slightly older builds
with both 6.5 and 6.8 KOP kernel.

As per @gautam?s observations, KOP kernel 6.5 was working fine where L1 was
CEDE-ing in this case. He had tried that earlier on PSP team's system.

Lids builds tried: b240228a.phyp1060 (latest)  and b240206a.phyp1060


I further traced the VM exit reasons and currently analyzing the log. From
trace log as well, it can be seen that the L2 is calling H_CEDE when it's going
idle.

	{0x0,	"RETURN_TO_HOST"}, \

      CPU 38/KVM-7092  [012]  3454.639244: kvm_guest_exit:       VCPU 38: trap=0x0 pc=0xc0000000000f3d74 msr=0x800000000290a033, ceded=1
       CPU 0/KVM-7054  [028]  3454.639257: kvm_guest_exit:       VCPU 0: trap=0x0 pc=0xc0000000000f3d74 msr=0x800000000290a033, ceded=1
      CPU 38/KVM-7092  [012]  3454.639261: kvm_guest_exit:       VCPU 38: trap=0x0 pc=0xc0000000000f3d74 msr=0x800000000290a033, ceded=1
       CPU 0/KVM-7054  [028]  3454.639272: kvm_guest_exit:       VCPU 0: trap=0x0 pc=0xc0000000000f3d74 msr=0x800000000290a033, ceded=1
      CPU 38/KVM-7092  [012]  3454.639274: kvm_guest_exit:       VCPU 38: trap=0x0 pc=0xc0000000000f3d74 msr=0x800000000290a033, ceded=1
      CPU 37/KVM-7091  [026]  3454.639280: kvm_guest_exit:       VCPU 37: trap=0x0 pc=0xc0000000000f3d74 msr=0x800000000290a033, ceded=1

But, L1 is not calling the H_CEDE at all even when L2 is not running any
workload:

[root@ltcd89-lp1 log]# perf stat -e probe:cede_processor
^C
 Performance counter stats for 'system wide':

                 0      probe:cede_processor                                                  

      55.960454488 seconds time elapsed

I'll be now focussing on understanding why L1 is not calling the H_CEDE at all!

Also, the latest version version of phyp lids available now is:
b240305a.phyp1060 and observations are similar on this build as well. I'll be
continuing rest of my debugging on this version of the lids.

~Amit

== Comment: #4 - Amit Machhiwal <Amit.Machhiwal> - 2024-03-07 03:17:58 ==
Hi Vijay,

I ran a couple of *pgbench* runs for the second part of the problem. Following
are the details of the environment:

PHYP Lids
=======================================================
b240305a.phyp1060

L1
=======================================================
    Kernel:
	[root@ltcd89-lp1 amachhiw]# uname -r
	6.8.0-rc1-kop-6c8898e76d7a+
    CEDE: ON

L2
=======================================================
    Kernel: 
	[root@localhost phoronix-test-suite]# uname -r
	6.8.0-nested.1.6c8898e76d7a.up.ibm.el9.ppc64le

    CEDE: ON
	Average: 3670 TPS
        Deviation: 2.45%
        Samples: 6

        Average: 3681 TPS
        Deviation: 1.25%

        Average: 3681 TPS
        Deviation: 1.32%


    CEDE: OFF 
	Average: 3308 TPS
        Deviation: 5.48%
        Samples: 12

        Average: 3338 TPS
        Deviation: 0.85%

        Average: 3287 TPS
        Deviation: 2.34%
        Samples: 5
	
Observations:
=======================================================
Results are the other way around. In my experiments, L2 performed better by more
than 9% every time with CEDE ON case compared to when CEDE was off.


~Amit

== Comment: #5 - Amit Machhiwal <Amit.Machhiwal> - 2024-03-07 03:21:45 ==
(In reply to comment #4)
> Hi Vijay,
> 
> I ran a couple of *pgbench* runs for the second part of the problem.
> Following
> are the details of the environment:
> 
> PHYP Lids
> =======================================================
> b240305a.phyp1060
> 
> L1
> =======================================================
>     Kernel:
> 	[root@ltcd89-lp1 amachhiw]# uname -r
> 	6.8.0-rc1-kop-6c8898e76d7a+
>     CEDE: ON
> 
> L2
> =======================================================
>     Kernel: 
> 	[root@localhost phoronix-test-suite]# uname -r
> 	6.8.0-nested.1.6c8898e76d7a.up.ibm.el9.ppc64le
> 
>     CEDE: ON
> 	Average: 3670 TPS
>         Deviation: 2.45%
>         Samples: 6
> 
>         Average: 3681 TPS
>         Deviation: 1.25%
> 
>         Average: 3681 TPS
>         Deviation: 1.32%
> 
> 
>     CEDE: OFF 
> 	Average: 3308 TPS
>         Deviation: 5.48%
>         Samples: 12
> 
>         Average: 3338 TPS
>         Deviation: 0.85%
> 
>         Average: 3287 TPS
>         Deviation: 2.34%
>         Samples: 5
> 	
> Observations:
> =======================================================
> Results are the other way around. In my experiments, L2 performed better by
> more
> than 9% every time with CEDE ON case compared to when CEDE was off.
> 
> 
> ~Amit

Adding to this comment, I had 5 cores (40 CPUs) assigned to both L1 and L2. There was no CPU pinning on L2.

L1
=======================================================

[root@ltcd89-lp1 amachhiw]# numactl -H
available: 4 nodes (0-3)
node 0 cpus:
node 0 size: 55111 MB
node 0 free: 46771 MB
node 1 cpus:
node 1 size: 59261 MB
node 1 free: 58475 MB
node 2 cpus:
node 2 size: 59261 MB
node 2 free: 57869 MB
node 3 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
node 3 size: 51999 MB
node 3 free: 24855 MB
node distances:
node   0   1   2   3 
  0:  10  20  20  20 
  1:  20  10  20  20 
  2:  20  20  10  20 
  3:  20  20  20  10 

L2
=======================================================
[root@localhost phoronix-test-suite]# numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
node 0 size: 43348 MB
node 0 free: 40252 MB
node distances:
node   0 
  0:  10

== Comment: #6 - Vijay k. Puliyala <vpuliyal.com> - 2024-03-07 08:29:09 ==
Hi Amit,

Thanks for sharing your lpar. I've created a vm2 guest on your lpar. Did pgbench runs. I could see degradation with CEDE On Vs CEDE OFF.

PGBENCH Results : 

CEDE On  : 41288.65 TPS
CEDE OFF : 50116.24 TPS 

Commands to run pgbench on vm2 guest. 

#su - enterprisedb
#cp t1.tc tc/
#./auto-run-test.sh

Guest Config : 

Un-pinned CPU's

# ppc64_cpu --dscr
DSCR is 23


# uname -a
Linux p10kvm.aus.stglabs.ibm.com 6.7.0-nested.1.1a946fcde971.up.ibm.el9.ppc64le #1 SMP Sun Jan 21 11:45:44 EST 2024 ppc64le ppc64le ppc64le GNU/Linux

# numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
node 0 size: 106739 MB
node 0 free: 102329 MB
node distances:
node   0
  0:  10
#


Thanks
Vijay Kumar

== Comment: #7 - Amit Machhiwal <Amit.Machhiwal> - 2024-03-26 07:39:26 ==
Hi Vijay,

Could you please help share the numbers for CEDE ON and OFF cases while keeping
the shared_buffers = 8192MB in the postgresql.conf?

The shared buffers specify the amount of memory that can be used for caching the
contents of tables etc.

The default size of shared_buffers (which is 128MB) is too small and is not used
in the customer environments.

shared_buffers = 128MB                  # min 128kB 

~Amit

== Comment: #8 - Amit Machhiwal <Amit.Machhiwal> - 2024-04-07 15:26:14 ==
Hi,

Sorry for putting up the update late. Please find the analysis below:

We observed that the L1 was not CEDE-ing at all while L2 had nothing to do and
was CEDE-ing as expected. I debugged the problem further to understand why L1
was not CEDE-ing.

Expected Behaviour
==================

1. The DEC bit is set during VCPU create in the following path
    kvmppc_book3s_vec2irqprio (	case 0x900: prio = BOOK3S_IRQPRIO_DECREMENTER;		break;)
    kvmppc_book3s_queue_irqprio()
    kvmppc_core_queue_dec()
    kvmppc_decrementer_func()
    kvmppc_decrementer_wakeup()
    kvm_arch_vcpu_create()
2. Now, when L2 doesn't have any workload to run, it traps into L1 and  calls
   `kvmppc_cede()` and exits to user in the following path:
    kvmppc_cede()
    kvmhv_p9_guest_entry()
    kvmhv_run_single_vcpu()
    kvmhv_enter_nested_guest()
    kvmppc_pseries_do_hcall (case H_ENTER_NESTED)
    kvmppc_vcpu_run_hv()

    static void kvmppc_cede(struct kvm_vcpu *vcpu)
    {
    	__kvmppc_set_msr_hv(vcpu, __kvmppc_get_msr_hv(vcpu) | MSR_EE);
    	vcpu->arch.ceded = 1;
    ...

3. Now, in the exit path when L2 has ceded, and traps into L1, L1 should have
   been canceling any pending decrementer exception in the below path (clearing
the decrementer bit set in the step 1):
    kvmppc_core_dequeue_dec()
    kvmhv_run_single_vcpu()
    kvmhv_enter_nested_guest()
    kvmppc_pseries_do_hcall (case H_ENTER_NESTED)
    kvmppc_vcpu_run_hv()

4. L1 should set a timer and should prepare to CEDE afterwards in the below
   path:
    kvmppc_set_timer()
    kvmhv_run_single_vcpu()
    kvmhv_enter_nested_guest()
    kvmppc_pseries_do_hcall (case H_ENTER_NESTED)
    kvmppc_vcpu_run_hv()

    int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 time_limit,
    			  unsigned long lpcr)
    {
    ...
        trap = kvmhv_p9_guest_entry(vcpu, time_limit, lpcr, &tb);
    ...
    
        if (!kvmhv_is_nestedv2() && kvmppc_core_pending_dec(vcpu) &&
        		((tb < kvmppc_dec_expires_host_tb(vcpu)) ||
        		 (trap == BOOK3S_INTERRUPT_SYSCALL &&
        		  kvmppc_get_gpr(vcpu, 3) == H_ENTER_NESTED)))
        	kvmppc_core_dequeue_dec(vcpu);
    ...
        r = RESUME_GUEST;
    ...
        if (is_kvmppc_resume_guest(r) && !kvmppc_vcpu_check_block(vcpu)) {
        	kvmppc_set_timer(vcpu);


Current Behaviour
=================

Everything till step 2 above was happening as expected except the step 3 and 4
where it was supposed dequeue the decrementer exception and set a timer for L1.
We confirmed this with the below probes:

Looking at the `kvmppc_set_timer()` part in the below path:

    int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 time_limit,
    			  unsigned long lpcr)
    {
    ...
    	if (is_kvmppc_resume_guest(r) && !kvmppc_vcpu_check_block(vcpu)) {
    		kvmppc_set_timer(vcpu);

With the below probe, I could confirm that kvmppc_set_timer was not hitting at
all in this path.
    perf stat -a -e probe:kvmppc_set_timer_L7 -e probe:kvmppc_set_timer_L12 sleep 3


After looking further in the `kvmppc_vcpu_check_block()` function which clearly
seemed to be the suspicious path and attaching a probe, it it could be confirmed
that there was still a pending decrementer exception.

    static bool kvmppc_vcpu_check_block(struct kvm_vcpu *vcpu)
    {
    	if (!vcpu->arch.ceded || kvmppc_vcpu_woken(vcpu))
    		return true;
    ...
    
    static bool kvmppc_vcpu_woken(struct kvm_vcpu *vcpu)
    {
    	if (vcpu->arch.pending_exceptions || vcpu->arch.prodded ||
    ...

    [root@ltcd89-lp1 amachhiw]# perf record -e probe:kvmppc_vcpu_check_block -aR sleep 3
           CPU 2/KVM   10384 [010]  2752.690156: probe:kvmppc_vcpu_check_block: (c000000000161990) ceded=0x1 pending_exceptions=0x8000 prodded=0x0 doorbell_request=0x0
           CPU 5/KVM   10387 [013]  2752.690159: probe:kvmppc_vcpu_check_block: (c000000000161990) ceded=0x1 pending_exceptions=0x8000 prodded=0x0 doorbell_request=0x0
           CPU 3/KVM   10385 [011]  2752.690161: probe:kvmppc_vcpu_check_block: (c000000000161990) ceded=0x1 pending_exceptions=0x8000 prodded=0x0 doorbell_request=0x0
           CPU 0/KVM   10382 [008]  2752.690165: probe:kvmppc_vcpu_check_block: (c000000000161990) ceded=0x1 pending_exceptions=0x8000 prodded=0x0 doorbell_request=0x0
           CPU 1/KVM   10383 [009]  2752.690166: probe:kvmppc_vcpu_check_block: (c000000000161990) ceded=0x1 pending_exceptions=0x8000 prodded=0x0 doorbell_request=0x0


At  this point, it was clear that `kvmppc_core_dequeue_dec()` which is supposed
to clear the decrementer exception bit is not getting called:

    int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 time_limit,
    			  unsigned long lpcr)
    {
    ...
    	if (!kvmhv_is_nestedv2() && kvmppc_core_pending_dec(vcpu) &&
    			((tb < kvmppc_dec_expires_host_tb(vcpu)) ||
    			 (trap == BOOK3S_INTERRUPT_SYSCALL &&
    			  kvmppc_get_gpr(vcpu, 3) == H_ENTER_NESTED)))
    		kvmppc_core_dequeue_dec(vcpu);


The `!kvmhv_is_nestedv2()` part of the above condition which was introduced in
the below patch by Jordan where the newly introduced condition skipped a call
call to	`kvmppc_core_dequeue_dec()` which is expected to clear the pending
decrementer exception bit.

180c6b072bf3 ("KVM: PPC: Book3S HV nestedv2: Do not cancel pending decrementer
exception")

The below patch fixes the L1 CEDE-ing problem which reverts the above patch. But
we are planning few more optimisations around handling the decrementer
exception. We will be sending the next version of this patch once the work is
complete.

v1: https://lore.kernel.org/linuxppc-dev/20240313072625.76804-1-vaibhav@linux.ibm.com/

~Amit

The patch for this fix has been merged into upstream kernel via commit 

7be6ce7043b4cf293c8826a48fd9f56931cef2cf("KVM: PPC: Book3S HV nestedv2: Cancel pending DEC exception")

Note You need to log in before you can comment on or make changes to this bug.