1468217 – KVM: paravirt raw_spinlock priority bump for housekeeping vcpus

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1468217 - KVM: paravirt raw_spinlock priority bump for housekeeping vcpus

Summary: KVM: paravirt raw_spinlock priority bump for housekeeping vcpus

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	kernel-rt
Sub Component:
Version:	7.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Marcelo Tosatti
QA Contact:	Pei Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-07-06 11:21 UTC by Marcelo Tosatti
Modified:	2017-09-25 21:36 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-09-25 21:36:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Call trace info when host is hang (165.19 KB, text/plain) 2017-07-28 11:03 UTC, Pei Zhang	no flags	Details
kvm hypercalls to switch to/from FIFO prio around raw_spinlocks (8.68 KB, patch) 2017-08-22 01:00 UTC, Marcelo Tosatti	no flags	Details \| Diff
View All

Description Marcelo Tosatti 2017-07-06 11:21:41 UTC

Description of problem:

Sharing a physical CPU between housekeeping vCPUs and other QEMU threads
suffers from the following problem:

 1. raw_spin_lock()
 2. protected section code
 3. raw_spin_unlock()

If a housekeeping vcpu is preempted in 2 above, and a realtime vcpu
attempts to grab that same raw_spin_lock, there is potential latency
violation.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Proposed fix:

increase housekeeping vcpu priority at raw_spin_lock/raw_spin_unlock:
have a shared region of memory between guest and host:
at raw_spin_lock time guest increases a counter in the shared region.
if this housekeeping vcpu exits, and the counter is larger
than zero, increase priority of housekeeping vcpu to realtime
vcpu priority.

At raw_spin_unlock time, if the counter value is zero,
decrease priority to original one. This allows sharing
of a pCPU between multiple housekeeping threads and other
QEMU threads. Why this is safe?
Because when a housekeeping thread grabs a raw_spinlock,
and has its priority increased to FIFO:x, no other
housekeeping thread on that pCPU will attempt to run.
That is, such housekeeping runs until the critical section
is finished.

Code status: initial headers and documentation done,
KVM side should finish today (July 4), then raw_spin_lock.
Should be tested and ready for submission by the end of the week.

Comment 2 Luiz Capitulino 2017-07-06 20:53:50 UTC

Marcelo asked me to describe the reproducer for this issue here. The reproducer is just to execute a KVM-RT test-case with vcpu0 pinned to a non-isolated core. However, it's been years that we don't do that so I'd like to go back and reproduce it again before giving futher details. This may take a few days as I am very busy working on another problem.

Comment 3 Luiz Capitulino 2017-07-07 13:46:16 UTC

Btw, something that occurred to me is whether skew_tick=1 could fix this since the tick will tick at different times for each vcpu so there should not be contention on this raw spinlock (at least not for the particular spinlock that caused the problem, I can't tell if there could be more).

Comment 26 Paolo Bonzini 2017-07-26 15:37:15 UTC

Sorry if I'm confused, but... shouldn't emulator threads at least theoretically have a _higher_ FIFO priority than the vCPUs?  Emulator threads can interrupt the vCPU, so if their priority is lower you deadlock as Luiz said in comment 14.

Comment 27 Pei Zhang 2017-07-27 11:48:20 UTC

update:

1. With only 1 emulator cpu and same with vCPU0, the guest fails boot up.

  <vcpu placement='static'>2</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='19'/>
    <vcpupin vcpu='1' cpuset='18'/>
    <emulatorpin cpuset='19'/>
    <vcpusched vcpus='0' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='1' scheduler='fifo' priority='1'/>
  </cputune> 

2. With 2 emulator CPUs, the guest can boot up.
  <cputune>
    <vcpupin vcpu='0' cpuset='19'/>
    <vcpupin vcpu='1' cpuset='18'/>
    <emulatorpin cpuset='1,19'/>
    <vcpusched vcpus='0' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='1' scheduler='fifo' priority='1'/>
  </cputune>

With this configuration, running latency with rteval for 3 hours, both host and guest work well, and latency value looks good, just like below:

Test started at Thu Jul 27 14:47:41 CST 2017

Test duration:    3h
Run rteval:       y
Run stress:       y
Isolated CPUs:    1
Kernel:           3.10.0-693.rt56.617.el7.x86_64
Kernel cmd-line:  BOOT_IMAGE=/vmlinuz-3.10.0-693.rt56.617.el7.x86_64 root=/dev/mapper/rhel_bootp--73--75--90-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto rd.lvm.lv=rhel_bootp-73-75-90/root rd.lvm.lv=rhel_bootp-73-75-90/swap rhgb quiet default_hugepagesz=1G iommu=pt intel_iommu=on isolcpus=1 intel_pstate=disable nosoftlockup skew_tick=1 nohz=on nohz_full=1 rcu_nocbs=1
Machine:          bootp-73-75-90.lab.eng.pek2.redhat.com
CPU:              Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
Results dir:      /home/log_nfv-virt-rt-kvm
running stress
   taskset -c 1 /home/nfv-virt-rt-kvm/tests/stress --cpu 1

running rteval
   rteval --onlyload --duration=3h --verbose
starting Thu Jul 27 14:47:46 CST 2017
   taskset -c 1 cyclictest -m -n -q -p95 -D 3h -h60 -i 200 -t 1 -a 1
ended Thu Jul 27 17:47:48 CST 2017

Test ended at Thu Jul 27 17:47:49 CST 2017

Latency testing results:
# Min Latencies: 00005
# Avg Latencies: 00006
# Max Latencies: 00011

3. I'm testing 12 hours now, and will update the testing results once finish.

Comment 28 Luiz Capitulino 2017-07-27 15:02:27 UTC

Pei, are CPUs 19 an 1 isolated? Because they shouldn't for this test.

Also, you should try the following test-case:

1. Shutdown the guest
2. Run a kernel build in the host and record how long it took

# cd linux-$ver; make mrproper; make allyesconfig; time make -jTWICE_AS_MANY_CPUS

3. When the kernel build finishes, just start it again. And in parallel perform steps 4 and 5 below
4. Start the guest
5. Run cyclictest with rteval in the guest for a duration which is some hours longer than the kernel build duration from item 2

If the kernel build is able to finish about at the same time from item 2, then this test-case is a PASS. If the kernel build is only able to finish after rteval is killed in the guest or if you get any another hangs in the system or in the guest, this test-case is a FAIL.

Btw, I don't know how serious item 1 from comment 27 is. We'd have to check if it is at all possible for OpenStack to be setup in this manner (that is, having only a single pCPU for the emulator thread and vcpu0).

Comment 29 Pei Zhang 2017-07-28 11:03:07 UTC

Created attachment 1305893 [details]
Call trace info when host is hang

(In reply to Luiz Capitulino from comment #28)
> Pei, are CPUs 19 an 1 isolated? Because they shouldn't for this test.

Hi Luiz,

yes, CPU 19 is isolated. 

If I understand right, below testing should use an non-isolated pCPU pinned to vCPU0. Please correct me if I'm wrong.

> Also, you should try the following test-case:
> 
> 1. Shutdown the guest
> 2. Run a kernel build in the host and record how long it took
> 
> # cd linux-$ver; make mrproper; make allyesconfig; time make
> -jTWICE_AS_MANY_CPUS
> 
> 3. When the kernel build finishes, just start it again. And in parallel
> perform steps 4 and 5 below
> 4. Start the guest
> 5. Run cyclictest with rteval in the guest for a duration which is some
> hours longer than the kernel build duration from item 2
> 
> If the kernel build is able to finish about at the same time from item 2,
> then this test-case is a PASS. If the kernel build is only able to finish
> after rteval is killed in the guest or if you get any another hangs in the
> system or in the guest, this test-case is a FAIL.


This test-case : Fail.


4 Issues observed in the testing:

(1). Build kernel is stopped when booting the guest.
# time make -j40

(2). Guest boot fails.

(3). Host is hang after a few minutes(10 minutes probably), not immediately.

(4). Call trace shows in host like below.
[ 1616.284164] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=60002 jiffies, g=242521, c=242520, q=110856)
[ 1616.284165] All QSes seen, last rcu_preempt kthread activity 59999 (4296283308-4296223309), jiffies_till_next_fqs=3
[ 1616.284167] swapper/0       R  running task        0     0      0 0x00080000
[ 1616.284169]  ffffffff81a02480 bd007fd2bcff1381 ffff88085e603dd0 ffffffff810be946
[ 1616.284170]  ffff88085e612080 ffffffff81a48a00 ffff88085e603e38 ffffffff8113b29d
[ 1616.284171]  0000000000000000 ffff88085e612080 000000000001b108 0000000000000000
[ 1616.284171] Call Trace:
[ 1616.284178]  <IRQ>  [<ffffffff810be946>] sched_show_task+0xb6/0x120
[ 1616.284182]  [<ffffffff8113b29d>] rcu_check_callbacks+0x83d/0x860
[ 1616.284186]  [<ffffffff81091ed1>] update_process_times+0x41/0x70
[ 1616.284189]  [<ffffffff810ee720>] tick_sched_handle+0x30/0x70
[ 1616.284191]  [<ffffffff810eeb49>] tick_sched_timer+0x39/0x80
[ 1616.284193]  [<ffffffff810adef4>] __run_hrtimer+0xc4/0x2c0
[ 1616.284195]  [<ffffffff810eeb10>] ? tick_sched_do_timer+0x50/0x50
[ 1616.284196]  [<ffffffff810aee20>] hrtimer_interrupt+0x130/0x350
[ 1616.284200]  [<ffffffff81047405>] local_apic_timer_interrupt+0x35/0x60
[ 1616.284204]  [<ffffffff816bc61d>] smp_apic_timer_interrupt+0x3d/0x50
[ 1616.284205]  [<ffffffff816bad9d>] apic_timer_interrupt+0x6d/0x80
[ 1616.284209]  <EOI>  [<ffffffff81527cac>] ? cpuidle_enter_state+0x5c/0xd0
[ 1616.284211]  [<ffffffff81527c98>] ? cpuidle_enter_state+0x48/0xd0
[ 1616.284212]  [<ffffffff81527dff>] cpuidle_idle_call+0xdf/0x2b0
[ 1616.284215]  [<ffffffff810270be>] arch_cpu_idle+0xe/0x40
[ 1616.284217]  [<ffffffff810e2dcc>] cpu_startup_entry+0x14c/0x1d0
[ 1616.284220]  [<ffffffff81699e94>] rest_init+0x84/0x90
[ 1616.284223]  [<ffffffff81b80040>] start_kernel+0x427/0x448
[ 1616.284224]  [<ffffffff81b7fa22>] ? repair_env_string+0x5c/0x5c
[ 1616.284226]  [<ffffffff81b7f120>] ? early_idt_handler_array+0x120/0x120
[ 1616.284227]  [<ffffffff81b7f5e3>] x86_64_start_reservations+0x24/0x26
[ 1616.284228]  [<ffffffff81b7f732>] x86_64_start_kernel+0x14d/0x170
(This kind of call trace info will repeatedly show in #dmesg.)

Full dmesg log is attached to this Comment.

Testing environment: 
Host kernel line:
# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-3.10.0-693.rt56.617.el7.x86_64 root=/dev/mapper/rhel_dell--per430--09-root ro crashkernel=auto rd.lvm.lv=rhel_dell-per430-09/root rd.lvm.lv=rhel_dell-per430-09/swap console=ttyS0,115200n81 default_hugepagesz=1G iommu=pt intel_iommu=on isolcpus=2,4,6,8,10,12,14,16,18,19,17,15,13 intel_pstate=disable nosoftlockup skew_tick=1 nohz=on nohz_full=2,4,6,8,10,12,14,16,18,19,17,15,13 rcu_nocbs=2,4,6,8,10,12,14,16,18,19,17,15,13

# lscpu | grep NUMA
NUMA node(s):          2
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19

Configuration:
  <vcpu placement='static'>2</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='3'/>
    <vcpupin vcpu='1' cpuset='18'/>
    <emulatorpin cpuset='1,3,5'/>
    <vcpusched vcpus='0' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='1' scheduler='fifo' priority='1'/>
  </cputune>


Here vCPU0 uses pCPU3 which is not isolated. When this vCPU0 is pinned to a pCPU which is non-isolated, then at least 3 CPUs are needed for emulator. Otherwise guest can not boot up.


> Btw, I don't know how serious item 1 from comment 27 is. We'd have to check
> if it is at all possible for OpenStack to be setup in this manner (that is,
> having only a single pCPU for the emulator thread and vcpu0).

I'd like to confirm the configuration in OpenStack next week.



Best Regards,
Pei

Comment 30 Luiz Capitulino 2017-07-28 13:25:18 UTC

(In reply to Pei Zhang from comment #29)
> Created attachment 1305893 [details]
> Call trace info when host is hang
> 
> (In reply to Luiz Capitulino from comment #28)
> > Pei, are CPUs 19 an 1 isolated? Because they shouldn't for this test.
> 
> Hi Luiz,
> 
> yes, CPU 19 is isolated. 
> 
> If I understand right, below testing should use an non-isolated pCPU pinned
> to vCPU0. Please correct me if I'm wrong.

You are correct. But non-isolated CPUs for vcpu0 and the emulator threads should also have been used for the testing done in comment 27.

In any case, I think that the testing you did in comment 29 shows that having vcpu0 running with fifo prio on non-isolated CPUs won't work, IMO.

Comment 31 Pei Zhang 2017-07-31 10:09:01 UTC

(In reply to Luiz Capitulino from comment #30)
> (In reply to Pei Zhang from comment #29)
> > Created attachment 1305893 [details]
> > Call trace info when host is hang
> > 
> > (In reply to Luiz Capitulino from comment #28)
> > > Pei, are CPUs 19 an 1 isolated? Because they shouldn't for this test.
> > 
> > Hi Luiz,
> > 
> > yes, CPU 19 is isolated. 
> > 
> > If I understand right, below testing should use an non-isolated pCPU pinned
> > to vCPU0. Please correct me if I'm wrong.
> 
> You are correct. But non-isolated CPUs for vcpu0 and the emulator threads
> should also have been used for the testing done in comment 27.
> 
> In any case, I think that the testing you did in comment 29 shows that
> having vcpu0 running with fifo prio on non-isolated CPUs won't work, IMO.

So the configuration should looks like:

  <vcpu placement='static'>2</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='3'/>
    <vcpupin vcpu='1' cpuset='18'/>
    <emulatorpin cpuset='3'/>
    <vcpusched vcpus='1' scheduler='fifo' priority='1'/>
  </cputune>

(pCPU3 is not isolated and vCPU0 don't set fifo:1)



Do below test with this configuration.

Testing results of Comment 27:

Step 1~2:
Shutdown guest and run kernel build in host, it takes about 15 minutes(3 runs).

run 1: 
real	 15m20.880s
user	 86m6.193s
sys	 13m51.351s

run 2:
real 	15m22.770s
user	 86m15.038s
sys	13m54.911s

run 3:
real	15m22.042s
user	86m5.641s
sys	13m56.440s


Step 3~5:
When running kernel build, start guest and run cyclictest in guest. 

- The kernel build can be finished no matter rteval is killed or not.
- Both host and guest work well. No any error in #dmesg
- The time of running kernel build is about 17minutes, which is more than about 2 minutes then above 15minutes.
- I set the latency testing time to 15m. In one of the three testing, the max latency value is as high as 53us. For full log, please refer to[1]: 
# Min Latencies: 00005
# Avg Latencies: 00006
# Max Latencies: 00053



Note: Host environment is same as Comment 29.

Reference:
[1] full log of step 3~5:
http://pastebin.test.redhat.com/504796

Comment 32 Pei Zhang 2017-08-01 07:00:02 UTC

12 hours latency testing results with below configuration:

No any stress in host, only did latency testing with rteval in guest. Both host and guest works well, but the latency value is very high.

Configuration:
  <vcpu placement='static'>2</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='3'/>
    <vcpupin vcpu='1' cpuset='18'/>
    <emulatorpin cpuset='3'/>
    <vcpusched vcpus='1' scheduler='fifo' priority='1'/>
  </cputune>

12 hours latency testing results:
# Min Latencies: 00004
# Avg Latencies: 00006
# Max Latencies: 05166

Whole testing log please refer to:
http://pastebin.test.redhat.com/504992

Host kernel line:
# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-3.10.0-693.rt56.617.el7.x86_64 root=/dev/mapper/rhel_dell--per430--09-root ro crashkernel=auto rd.lvm.lv=rhel_dell-per430-09/root rd.lvm.lv=rhel_dell-per430-09/swap console=ttyS0,115200n81 default_hugepagesz=1G iommu=pt intel_iommu=on isolcpus=2,4,6,8,10,12,14,16,18,19,17,15,13 intel_pstate=disable nosoftlockup skew_tick=1 nohz=on nohz_full=2,4,6,8,10,12,14,16,18,19,17,15,13 rcu_nocbs=2,4,6,8,10,12,14,16,18,19,17,15,13

# lscpu | grep NUMA
NUMA node(s):          2
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19

Comment 33 Marcelo Tosatti 2017-08-09 15:41:36 UTC

(In reply to Pei Zhang from comment #29)
> Created attachment 1305893 [details]
> Call trace info when host is hang
> 
> (In reply to Luiz Capitulino from comment #28)
> > Pei, are CPUs 19 an 1 isolated? Because they shouldn't for this test.
> 
> Hi Luiz,
> 
> yes, CPU 19 is isolated. 
> 
> If I understand right, below testing should use an non-isolated pCPU pinned
> to vCPU0. Please correct me if I'm wrong.
> 
> > Also, you should try the following test-case:
> > 
> > 1. Shutdown the guest
> > 2. Run a kernel build in the host and record how long it took
> > 
> > # cd linux-$ver; make mrproper; make allyesconfig; time make
> > -jTWICE_AS_MANY_CPUS
> > 
> > 3. When the kernel build finishes, just start it again. And in parallel
> > perform steps 4 and 5 below
> > 4. Start the guest
> > 5. Run cyclictest with rteval in the guest for a duration which is some
> > hours longer than the kernel build duration from item 2
> > 
> > If the kernel build is able to finish about at the same time from item 2,
> > then this test-case is a PASS. If the kernel build is only able to finish
> > after rteval is killed in the guest or if you get any another hangs in the
> > system or in the guest, this test-case is a FAIL.
> 
> 
> This test-case : Fail.
> 
> 
> 4 Issues observed in the testing:
> 
> (1). Build kernel is stopped when booting the guest.
> # time make -j40
> 
> (2). Guest boot fails.
> 
> (3). Host is hang after a few minutes(10 minutes probably), not immediately.
> 
> (4). Call trace shows in host like below.
> [ 1616.284164] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected
> by 0, t=60002 jiffies, g=242521, c=242520, q=110856)
> [ 1616.284165] All QSes seen, last rcu_preempt kthread activity 59999
> (4296283308-4296223309), jiffies_till_next_fqs=3
> [ 1616.284167] swapper/0       R  running task        0     0      0
> 0x00080000
> [ 1616.284169]  ffffffff81a02480 bd007fd2bcff1381 ffff88085e603dd0
> ffffffff810be946
> [ 1616.284170]  ffff88085e612080 ffffffff81a48a00 ffff88085e603e38
> ffffffff8113b29d
> [ 1616.284171]  0000000000000000 ffff88085e612080 000000000001b108
> 0000000000000000
> [ 1616.284171] Call Trace:
> [ 1616.284178]  <IRQ>  [<ffffffff810be946>] sched_show_task+0xb6/0x120
> [ 1616.284182]  [<ffffffff8113b29d>] rcu_check_callbacks+0x83d/0x860
> [ 1616.284186]  [<ffffffff81091ed1>] update_process_times+0x41/0x70
> [ 1616.284189]  [<ffffffff810ee720>] tick_sched_handle+0x30/0x70
> [ 1616.284191]  [<ffffffff810eeb49>] tick_sched_timer+0x39/0x80
> [ 1616.284193]  [<ffffffff810adef4>] __run_hrtimer+0xc4/0x2c0
> [ 1616.284195]  [<ffffffff810eeb10>] ? tick_sched_do_timer+0x50/0x50
> [ 1616.284196]  [<ffffffff810aee20>] hrtimer_interrupt+0x130/0x350
> [ 1616.284200]  [<ffffffff81047405>] local_apic_timer_interrupt+0x35/0x60
> [ 1616.284204]  [<ffffffff816bc61d>] smp_apic_timer_interrupt+0x3d/0x50
> [ 1616.284205]  [<ffffffff816bad9d>] apic_timer_interrupt+0x6d/0x80
> [ 1616.284209]  <EOI>  [<ffffffff81527cac>] ? cpuidle_enter_state+0x5c/0xd0
> [ 1616.284211]  [<ffffffff81527c98>] ? cpuidle_enter_state+0x48/0xd0
> [ 1616.284212]  [<ffffffff81527dff>] cpuidle_idle_call+0xdf/0x2b0
> [ 1616.284215]  [<ffffffff810270be>] arch_cpu_idle+0xe/0x40
> [ 1616.284217]  [<ffffffff810e2dcc>] cpu_startup_entry+0x14c/0x1d0
> [ 1616.284220]  [<ffffffff81699e94>] rest_init+0x84/0x90
> [ 1616.284223]  [<ffffffff81b80040>] start_kernel+0x427/0x448
> [ 1616.284224]  [<ffffffff81b7fa22>] ? repair_env_string+0x5c/0x5c
> [ 1616.284226]  [<ffffffff81b7f120>] ? early_idt_handler_array+0x120/0x120
> [ 1616.284227]  [<ffffffff81b7f5e3>] x86_64_start_reservations+0x24/0x26
> [ 1616.284228]  [<ffffffff81b7f732>] x86_64_start_kernel+0x14d/0x170
> (This kind of call trace info will repeatedly show in #dmesg.)
> 
> Full dmesg log is attached to this Comment.
> 
> Testing environment: 
> Host kernel line:
> # cat /proc/cmdline 
> BOOT_IMAGE=/vmlinuz-3.10.0-693.rt56.617.el7.x86_64
> root=/dev/mapper/rhel_dell--per430--09-root ro crashkernel=auto
> rd.lvm.lv=rhel_dell-per430-09/root rd.lvm.lv=rhel_dell-per430-09/swap
> console=ttyS0,115200n81 default_hugepagesz=1G iommu=pt intel_iommu=on
> isolcpus=2,4,6,8,10,12,14,16,18,19,17,15,13 intel_pstate=disable
> nosoftlockup skew_tick=1 nohz=on
> nohz_full=2,4,6,8,10,12,14,16,18,19,17,15,13
> rcu_nocbs=2,4,6,8,10,12,14,16,18,19,17,15,13
> 
> # lscpu | grep NUMA
> NUMA node(s):          2
> NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18
> NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19
> 
> Configuration:
>   <vcpu placement='static'>2</vcpu>
>   <cputune>
>     <vcpupin vcpu='0' cpuset='3'/>
>     <vcpupin vcpu='1' cpuset='18'/>
>     <emulatorpin cpuset='1,3,5'/>
>     <vcpusched vcpus='0' scheduler='fifo' priority='1'/>
>     <vcpusched vcpus='1' scheduler='fifo' priority='1'/>
>   </cputune>
> 
> 
> Here vCPU0 uses pCPU3 which is not isolated. When this vCPU0 is pinned to a
> pCPU which is non-isolated, then at least 3 CPUs are needed for emulator.
> Otherwise guest can not boot up.

Ok, thanks Pei, my configuration was incorrect. I'll reproduce your test.

We need a hypercall to change the priority to FIFO once boot is finished,
writing that now.

Comment 34 Marcelo Tosatti 2017-08-17 23:33:02 UTC

(In reply to Paolo Bonzini from comment #26)
> Sorry if I'm confused, but... shouldn't emulator threads at least
> theoretically have a _higher_ FIFO priority than the vCPUs?  Emulator
> threads can interrupt the vCPU, so if their priority is lower you deadlock
> as Luiz said in comment 14.

Damn, good point, the following can cause IO to never be processed:

* If you do:

1) submit IO.
2) busy spin on some non-important program on vcpu0.

The IO will only interrupt the CPU when the non-important
program HLT's, which might be never.

So better change the code. I'm using a hypercall to switch to 
SCHED_OTHER at:
        * BIOS initialization.
        * System shutdown.
and to switch to SCHED_FIFO at:
        * System startup.

But should instead:

        1) hypercall to switch to FIFO:1
        2) spin_lock_irqsave(spinlock_shared_with_vcpu1)
        3) spin_unlock_irqsave(spinlock_shared_with_vcpu1)
        4) hypercall to switch to SCHED_OTHER

Comment 35 Marcelo Tosatti 2017-08-18 17:17:36 UTC

(In reply to Marcelo Tosatti from comment #34)
> (In reply to Paolo Bonzini from comment #26)
> > Sorry if I'm confused, but... shouldn't emulator threads at least
> > theoretically have a _higher_ FIFO priority than the vCPUs?  Emulator
> > threads can interrupt the vCPU, so if their priority is lower you deadlock
> > as Luiz said in comment 14.
> 
> Damn, good point, the following can cause IO to never be processed:
> 
> * If you do:
> 
> 1) submit IO.
> 2) busy spin on some non-important program on vcpu0.
> 
> The IO will only interrupt the CPU when the non-important
> program HLT's, which might be never.
> 
> So better change the code. I'm using a hypercall to switch to 
> SCHED_OTHER at:
>         * BIOS initialization.
>         * System shutdown.
> and to switch to SCHED_FIFO at:
>         * System startup.
> 
> But should instead:
> 
>         1) hypercall to switch to FIFO:1
>         2) spin_lock_irqsave(spinlock_shared_with_vcpu1)
>         3) spin_unlock_irqsave(spinlock_shared_with_vcpu1)
>         4) hypercall to switch to SCHED_OTHER

Paolo, Luiz,

Unfortunately there are about 250 functions that call raw_spinlocks, 
so adding this to each site is not a tradeoff. Can't see any better solution 
than 

void raw_spin_lock(spinlock_t lock)
{
        if (cpu->isolated == false)
                hypercall1(ENABLE_FIFO_PRIO);

        __raw_spin_lock(lock);

        if (cpu->isolated == true)
                hypercall1(DISABLE_FIFO_PRIO);
}

Which might incur some overhead, and therefore upstream might not 
accept it (however, for the NFV workloads its fine, since the 
hotpath does not use spinlocks at all). 

Do you guys have any better ideas?

Comment 36 Luiz Capitulino 2017-08-18 18:02:50 UTC

That idea was the best one, I can't think of anything else :(

Comment 38 Marcelo Tosatti 2017-08-22 01:00:21 UTC

Created attachment 1316490 [details]
kvm hypercalls to switch to/from FIFO prio around raw_spinlocks

Note You need to log in before you can comment on or make changes to this bug.