Bug 1817045
| Summary: | RT host hang with "INFO: rcu_preempt detected stalls on CPUs/tasks" | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Pei Zhang <pezhang> | ||||||||||||
| Component: | kernel-rt | Assignee: | Peter Xu <peterx> | ||||||||||||
| kernel-rt sub component: | KVM | QA Contact: | Pei Zhang <pezhang> | ||||||||||||
| Status: | CLOSED NOTABUG | Docs Contact: | |||||||||||||
| Severity: | high | ||||||||||||||
| Priority: | medium | CC: | acme, bhu, chayang, crobinso, daolivei, jinzhao, jlelli, juri.lelli, juzhang, lcapitulino, lgoncalv, nilal, pauld, peterx, virt-maint, williams | ||||||||||||
| Version: | 7.9 | Keywords: | Reopened | ||||||||||||
| Target Milestone: | rc | ||||||||||||||
| Target Release: | --- | ||||||||||||||
| Hardware: | Unspecified | ||||||||||||||
| OS: | Unspecified | ||||||||||||||
| Whiteboard: | |||||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||
| Clone Of: | Environment: | ||||||||||||||
| Last Closed: | 2020-12-15 17:36:29 UTC | Type: | Bug | ||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||
| Documentation: | --- | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Embargoed: | |||||||||||||||
| Bug Depends On: | |||||||||||||||
| Bug Blocks: | 1890703 | ||||||||||||||
| Attachments: |
|
||||||||||||||
|
Description
Pei Zhang
2020-03-25 13:23:14 UTC
We hit same issue on rhel7.7.z testing.
Testing versions:
kernel-rt-3.10.0-1062.27.1.rt56.1050.el7.x86_64
qemu-kvm-rhev-2.12.0-33.el7_7.11.x86_64
tuned-2.11.0-5.el7_7.1.noarch
libvirt-4.5.0-23.el7_7.7.x86_64
microcode_ctl-2.1-53.7.el7_7.x86_64
rt-tests-1.0-16.el7.x86_64
RT host get below Call Trace:
[92309.610498] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=60002 jiffies, g=11061801, c=11061800, q=1435)
[92309.610500] All QSes seen, last rcu_preempt kthread activity 2 (4386982538-4386982536), jiffies_till_next_fqs=3
[92309.610502] swapper/0 R running task 0 0 0 0x00080000
[92309.610503] Call Trace:
[92309.610515] <IRQ> [<ffffffff8b4cdca6>] sched_show_task+0xb6/0x120
[92309.610518] [<ffffffff8b552b6d>] rcu_check_callbacks+0x84d/0x880
[92309.610523] [<ffffffff8b5031c0>] ? tick_sched_do_timer+0x50/0x50
[92309.610528] [<ffffffff8b4a0581>] update_process_times+0x41/0x70
[92309.610530] [<ffffffff8b502e90>] tick_sched_handle+0x30/0x70
[92309.610532] [<ffffffff8b5031f9>] tick_sched_timer+0x39/0x80
[92309.610534] [<ffffffff8b4bca61>] __hrtimer_run_queues+0x121/0x3b0
[92309.610537] [<ffffffff8b4bd9b9>] hrtimer_interrupt+0xb9/0x270
[92309.610541] [<ffffffff8b44b62b>] local_apic_timer_interrupt+0x3b/0x60
[92309.610546] [<ffffffff8bb866b3>] smp_apic_timer_interrupt+0x43/0x60
[92309.610548] [<ffffffff8bb831ba>] apic_timer_interrupt+0x16a/0x170
[92309.610554] <EOI> [<ffffffff8bb77be0>] ? __cpuidle_text_start+0x8/0x8
[92309.610557] [<ffffffff8bb77eab>] ? native_safe_halt+0xb/0x20
[92309.610559] [<ffffffff8bb77bfe>] default_idle+0x1e/0x130
[92309.610564] [<ffffffff8b4265e0>] arch_cpu_idle+0x20/0xc0
[92309.610567] [<ffffffff8b4f40da>] cpu_startup_entry+0x14a/0x1d0
[92309.610571] [<ffffffff8bb60984>] rest_init+0x84/0x90
[92309.610575] [<ffffffff8c1bd19f>] start_kernel+0x448/0x469
[92309.610576] [<ffffffff8c1bcb57>] ? repair_env_string+0x5c/0x5c
[92309.610580] [<ffffffff8c1bc120>] ? early_idt_handler_array+0x120/0x120
[92309.610582] [<ffffffff8c1bc70b>] x86_64_start_reservations+0x24/0x26
[92309.610584] [<ffffffff8c1bc861>] x86_64_start_kernel+0x154/0x177
[92309.610587] [<ffffffff8b4000d5>] start_cpu+0x5/0x14
[92489.604442] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=240007 jiffies, g=11061801, c=11061800, q=2698)
[92489.604443] All QSes seen, last rcu_preempt kthread activity 1 (4387162543-4387162542), jiffies_till_next_fqs=3
[92489.604446] swapper/0 R running task 0 0 0 0x00080000
[92489.604447] Call Trace:
[92489.604458] <IRQ> [<ffffffff8b4cdca6>] sched_show_task+0xb6/0x120
[92489.604461] [<ffffffff8b552b6d>] rcu_check_callbacks+0x84d/0x880
[92489.604467] [<ffffffff8b5031c0>] ? tick_sched_do_timer+0x50/0x50
[92489.604471] [<ffffffff8b4a0581>] update_process_times+0x41/0x70
[92489.604473] [<ffffffff8b502e90>] tick_sched_handle+0x30/0x70
[92489.604475] [<ffffffff8b5031f9>] tick_sched_timer+0x39/0x80
[92489.604477] [<ffffffff8b4bca61>] __hrtimer_run_queues+0x121/0x3b0
[92489.604480] [<ffffffff8b4bd9b9>] hrtimer_interrupt+0xb9/0x270
[92489.604484] [<ffffffff8b44b62b>] local_apic_timer_interrupt+0x3b/0x60
[92489.604488] [<ffffffff8bb866b3>] smp_apic_timer_interrupt+0x43/0x60
[92489.604491] [<ffffffff8bb831ba>] apic_timer_interrupt+0x16a/0x170
[92489.604497] <EOI> [<ffffffff8bb77be0>] ? __cpuidle_text_start+0x8/0x8
[92489.604500] [<ffffffff8bb77eab>] ? native_safe_halt+0xb/0x20
[92489.604502] [<ffffffff8bb77bfe>] default_idle+0x1e/0x130
[92489.604506] [<ffffffff8b4265e0>] arch_cpu_idle+0x20/0xc0
[92489.604510] [<ffffffff8b4f40da>] cpu_startup_entry+0x14a/0x1d0
[92489.604513] [<ffffffff8bb60984>] rest_init+0x84/0x90
[92489.604517] [<ffffffff8c1bd19f>] start_kernel+0x448/0x469
[92489.604518] [<ffffffff8c1bcb57>] ? repair_env_string+0x5c/0x5c
[92489.604521] [<ffffffff8c1bc120>] ? early_idt_handler_array+0x120/0x120
[92489.604524] [<ffffffff8c1bc70b>] x86_64_start_reservations+0x24/0x26
[92489.604526] [<ffffffff8c1bc861>] x86_64_start_kernel+0x154/0x177
[92489.604529] [<ffffffff8b4000d5>] start_cpu+0x5/0x14
[92669.598385] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=420012 jiffies, g=11061801, c=11061800, q=3800)
[92669.598387] All QSes seen, last rcu_preempt kthread activity 1 (4387342548-4387342547), jiffies_till_next_fqs=3
[92669.598389] swapper/0 R running task 0 0 0 0x00080000
[92669.598390] Call Trace:
[92669.598399] <IRQ> [<ffffffff8b4cdca6>] sched_show_task+0xb6/0x120
[92669.598401] [<ffffffff8b552b6d>] rcu_check_callbacks+0x84d/0x880
[92669.598406] [<ffffffff8b5031c0>] ? tick_sched_do_timer+0x50/0x50
[92669.598410] [<ffffffff8b4a0581>] update_process_times+0x41/0x70
[92669.598412] [<ffffffff8b502e90>] tick_sched_handle+0x30/0x70
[92669.598413] [<ffffffff8b5031f9>] tick_sched_timer+0x39/0x80
[92669.598415] [<ffffffff8b4bca61>] __hrtimer_run_queues+0x121/0x3b0
[92669.598418] [<ffffffff8b4bd9b9>] hrtimer_interrupt+0xb9/0x270
[92669.598421] [<ffffffff8b44b62b>] local_apic_timer_interrupt+0x3b/0x60
[92669.598424] [<ffffffff8bb866b3>] smp_apic_timer_interrupt+0x43/0x60
[92669.598426] [<ffffffff8bb831ba>] apic_timer_interrupt+0x16a/0x170
[92669.598432] <EOI> [<ffffffff8bb77be0>] ? __cpuidle_text_start+0x8/0x8
[92669.598434] [<ffffffff8bb77eab>] ? native_safe_halt+0xb/0x20
[92669.598436] [<ffffffff8bb77bfe>] default_idle+0x1e/0x130
[92669.598440] [<ffffffff8b4265e0>] arch_cpu_idle+0x20/0xc0
[92669.598443] [<ffffffff8b4f40da>] cpu_startup_entry+0x14a/0x1d0
[92669.598447] [<ffffffff8bb60984>] rest_init+0x84/0x90
[92669.598450] [<ffffffff8c1bd19f>] start_kernel+0x448/0x469
[92669.598451] [<ffffffff8c1bcb57>] ? repair_env_string+0x5c/0x5c
[92669.598454] [<ffffffff8c1bc120>] ? early_idt_handler_array+0x120/0x120
[92669.598457] [<ffffffff8c1bc70b>] x86_64_start_reservations+0x24/0x26
[92669.598459] [<ffffffff8c1bc861>] x86_64_start_kernel+0x154/0x177
[92669.598461] [<ffffffff8b4000d5>] start_cpu+0x5/0x14
...
Thanks Pei, this has to be debugged & fixed on 7.7.z since it's support. However, since we have other priorities at the moment, we'll wait a bit to get to this one. Pei, A quick question: the second RT guest installation that you initiate, do you perform this after the 24h run or during the execution of this 24h test? (In reply to Nitesh Narayan Lal from comment #5) > Pei, > > A quick question: the second RT guest installation that you initiate, do you > perform this after the 24h run or during the execution of this 24h test? Hi Nitesh, Guest 24h cyclictest testing and guest installation are not executed at same time. Let me share more details: (1) First we install a fresh guest. After guest installation finishes, we setup RT environment. Then we do 24h or 12h testing in this guest. Finally clear the setup, including shutdown and undefine the guest. This scenario is done. (2) Move to next scenario, we repeat (1). Best regards, Pei From the dmesg that Pei provided, the rcu grace period 10895418 never completed during 8vcpu vm installation. It should cause all the rcu callbacks to stall, and at last it can use up all the mem of the host, finally we won't be able to even ssh to it. Today I logged into the host before it dies. Listing all the vcpu threads [1]: PID TID S POL PSR PRI COMMAND 106 106 D FF 10 44 rcub/2 1820 1820 D TS 10 19 NetworkManager 3028 13214 D TS 2 19 pool 10416 10416 R TS 16 19 qemu-kvm 10416 10473 R TS 16 19 CPU 0/KVM 10416 10474 R TS 16 19 CPU 1/KVM 10416 10475 R TS 16 19 CPU 2/KVM 10416 10476 R TS 16 44 CPU 3/KVM 10416 10477 R TS 16 19 CPU 4/KVM 10416 10478 R TS 16 19 CPU 5/KVM 10416 10479 R TS 16 19 CPU 6/KVM 10416 10480 R TS 16 19 CPU 7/KVM 10416 10481 R TS 16 19 CPU 8/KVM 10416 10482 R TS 16 19 CPU 9/KVM 11915 11915 D TS 8 19 sshd 13212 13212 D TS 8 19 kworker/8:1 13220 13220 D TS 2 19 kworker/2:1 13538 13538 D TS 10 19 pickup 13796 13796 D TS 0 19 sshd 13799 13799 R TS 2 19 ps 19616 19616 D TS 8 19 kworker/8:2 30258 30258 D TS 4 19 kworker/4:0 It's strange that during vm installation the vcpus are not using the pinning information provided in the virt-install command line but all running on cpu16. Here's Pei's installation cmdline for the 8vcpu vm (copied from beaker log): 14:39:46:[cmd]:virt-install --graphics type=spice,listen=0.0.0.0 --name=rhel7.7.z_rt_8vcpu --machine pc --memory=4096,hugepages=yes --memorybacking hugepages=yes,size=1,unit=G,locked=yes --cpu host-passthrough,cache.mode=passthrough --numatune=1 --vcpus=10,cpuset=16,18,1,3,5,7,9,11,13,15 --controller type=scsi,model=virtio-scsi --controller type=virtio-serial --rng type=/dev/random --memballoon virtio --disk path=/home/images_nfv-virt-rt-kvm/rhel7.7.z_rt_8vcpu.qcow2,size=40 -l http://download.eng.pek2.redhat.com/rel-eng/latest-RHEL-7.7/compose/Server/x86_64/os/ --noautoconsole --noreboot -x ks=http://10.73.72.41/kickstart-rhel7-7-z.cfg --network bridge=switch,model=virtio,mac=38:66:88:44:de:02 Which contains: --vcpus=10,cpuset=16,18,1,3,5,7,9,11,13,15 So it seems that virt-install does not apply the vcpu pinning correctly during the installation phase. I feel like that's the cause that CPU16 got overloaded. "CPU 3/KVM" is special here since it's the only thread that has priority=44 (RT priority). I'm not sure how it's achieved with SCHED_OTHER (policy=TS)... maybe due to priority inheritage in RT worlds somehow. Anyway, ideally we should have the vcpu to be at least pinned on different cores even during installation. Is that a bug for virt-install to not apply the "--vcpu=10,cpuset=..." during vm install? Pei, to verify this, we can try to manually pin each vcpu after installation starts, to see whether we'll still see the same RCU warning. Created attachment 1701821 [details]
virt-install.log provided by Peter
Created attachment 1701823 [details]
working vm XML provided by peter
I talked with Peter on IRC. From the virt-install.log he provided, virt-install seems to be working correctly, it is passing the requested cpuset to both the install time and post-install boot XML. That's really all virt-install does in this case is generate XML, all the actual pinning behavior is handled elsewhere
But what's weird to me is that the post-install XML he grabbed from the running VM, after he rebooted it, has different CPU pinning XML, see the attachment. There's no cpuset= value but there is a complicated <cputune> section which has more pinning info.
I don't think libvirt is filling in that data itself but I'm not certain. Is something else in the test case, or on the host, changing the VM XML separately?
(In reply to Cole Robinson from comment #18) > Created attachment 1701823 [details] > working vm XML provided by peter > > I talked with Peter on IRC. From the virt-install.log he provided, > virt-install seems to be working correctly, it is passing the requested > cpuset to both the install time and post-install boot XML. That's really all > virt-install does in this case is generate XML, all the actual pinning > behavior is handled elsewhere > > But what's weird to me is that the post-install XML he grabbed from the > running VM, after he rebooted it, has different CPU pinning XML, see the > attachment. There's no cpuset= value but there is a complicated <cputune> > section which has more pinning info. Hi Cole, As I hit this issue during the beaker automation process, in order to keep the current setup for Peter to debug, I didn't quit the beaker job, so the automation job is still running now on this debug server. The "complicated <cputune> section which has more pinning info" is defined by extra testing steps, not by "virt-install". As Peter rebooted (or shutdown?) the VM, which made the VM in "shutdown" status. Then it triggered the automation script to continue other testing steps. The automation is designed like below: 1. # virt-install .... 2. Once the virt-install installation finish, the VM will shutdown, this is defined in the kickstart file http://10.73.72.41/kickstart-rhel7-7-z.cfg. 3. Then the auto script detects that the VM is in "shutdown" status, it will undefine it and re-define with an extra xml file which has exact cpu pin info. # virsh undefine rhel7.7.z_rt_8vcpu # virsh define --file /tmp/rhel7.7.z_rt_8vcpu.xml This bug was triggered in the virt-install, once we hit this issue, the installation will never finish and host will hang. Thanks. Best regards, Pei > > I don't think libvirt is filling in that data itself but I'm not certain. Is > something else in the test case, or on the host, changing the VM XML > separately? Thanks for the info Pei, that helps. So the question is, why does the plain <vcpus cpuset='...'> method not work, but the more advanced <cputune> config does work. Libvirt is responsible for setting the CPU affinity here, I don't think qemu is involved. So either libvirt is screwing up or something is wrong at the kernel level. If this reproduces with the same virt packages on a non-rt system then it's probably a libvirt issue (In reply to Cole Robinson from comment #20) > Thanks for the info Pei, that helps. > > So the question is, why does the plain <vcpus cpuset='...'> method not work, > but the > more advanced <cputune> config does work. This reminded me to go back and look at the virt-install manual, and it says: --vcpus OPTIONS Number of virtual cpus to configure for the guest. If 'maxvcpus' is specified, the guest will be able to hotplug up to MAX vcpus while the guest is running, but will startup with VCPUS. CPU topology can additionally be specified with sockets, cores, and threads. If values are omitted, the rest will be autofilled preferring sockets over cores over threads. 'cpuset' sets which physical cpus the guest can use. "CPUSET" is a comma separated list of numbers, which can also be specified in ranges or cpus to exclude. Example: 0,2,3,5 : Use processors 0,2,3 and 5 1-5,^3,8 : Use processors 1,2,4,5 and 8 So IIUC "<vcpus cpuset='...'>" will not pin one vcpu thread for each core, but pins all vcpu threads to all the available cores. E.g. "--vcpus=10,cpuset=16,18,1,3,5,7,9,11,13,15" will allow each vcpu thread to run on any of physical core within the cpu list "16,18,1,3,5,7,9,11,13,15". Then it's strange why all vcpu threads are queued on physical core 16 (with "R" state) in comment 16. > > Libvirt is responsible for setting the CPU affinity here, I don't think qemu > is > involved. So either libvirt is screwing up or something is wrong at the > kernel level. If this reproduces with the same virt packages on a non-rt > system > then it's probably a libvirt issue Agreed. And if libvirt has correctly applied the cpu list affinity to all the vcpu threads, then it could be a kernel isssue. Thanks Pei, it's reproduced. Looks like libvirt has done the expected pinning:
[root@dell-per430-10 ~]# ps -eLo tid,state,psr,comm | grep KVM
10317 R 16 CPU 0/KVM
10318 R 16 CPU 1/KVM
10319 R 16 CPU 2/KVM
10320 R 16 CPU 3/KVM
10321 R 16 CPU 4/KVM
10322 R 16 CPU 5/KVM
10323 R 16 CPU 6/KVM
10324 R 16 CPU 7/KVM
10325 R 16 CPU 8/KVM
10326 R 16 CPU 9/KVM
[root@dell-per430-10 ~]# ps -eLo tid,comm | grep KVM | while read line; do pid=${line%% *}; echo "$line: $(taskset -pc $pid)"; done
10317 CPU 0/KVM: pid 10317's current affinity list: 1,3,5,7,9,11,13,15,16,18
10318 CPU 1/KVM: pid 10318's current affinity list: 1,3,5,7,9,11,13,15,16,18
10319 CPU 2/KVM: pid 10319's current affinity list: 1,3,5,7,9,11,13,15,16,18
10320 CPU 3/KVM: pid 10320's current affinity list: 1,3,5,7,9,11,13,15,16,18
10321 CPU 4/KVM: pid 10321's current affinity list: 1,3,5,7,9,11,13,15,16,18
10322 CPU 5/KVM: pid 10322's current affinity list: 1,3,5,7,9,11,13,15,16,18
10323 CPU 6/KVM: pid 10323's current affinity list: 1,3,5,7,9,11,13,15,16,18
10324 CPU 7/KVM: pid 10324's current affinity list: 1,3,5,7,9,11,13,15,16,18
10325 CPU 8/KVM: pid 10325's current affinity list: 1,3,5,7,9,11,13,15,16,18
10326 CPU 9/KVM: pid 10326's current affinity list: 1,3,5,7,9,11,13,15,16,18
Dumping task switching on cpu16, we can indeed observe the vcpu threads frequently switching:
CPU-10325 [016] 11775.954419: sched_switch: CPU 8/KVM:10325 [120] R ==> CPU 6/KVM:10323 [120]
CPU-10323 [016] 11775.954443: sched_switch: CPU 6/KVM:10323 [120] S ==> CPU 9/KVM:10326 [120]
CPU-10326 [016] 11775.954466: sched_switch: CPU 9/KVM:10326 [120] S ==> CPU 5/KVM:10322 [120]
CPU-10322 [016] 11775.954488: sched_switch: CPU 5/KVM:10322 [120] S ==> CPU 3/KVM:10320 [120]
CPU-10320 [016] 11775.954511: sched_switch: CPU 3/KVM:10320 [120] S ==> CPU 4/KVM:10321 [120]
CPU-10321 [016] 11775.954533: sched_switch: CPU 4/KVM:10321 [120] S ==> CPU 7/KVM:10324 [120]
CPU-10324 [016] 11775.954555: sched_switch: CPU 7/KVM:10324 [120] S ==> CPU 1/KVM:10318 [120]
CPU-10318 [016] 11775.954578: sched_switch: CPU 1/KVM:10318 [120] S ==> CPU 2/KVM:10319 [120]
CPU-10319 [016] 11775.954600: sched_switch: CPU 2/KVM:10319 [120] S ==> CPU 8/KVM:10325 [120]
CPU-10325 [016] 11775.954928: sched_switch: CPU 8/KVM:10325 [120] R ==> ktimersoftd/16:159 [96]
ktimersoftd/16-159 [016] 11775.954933: sched_switch: ktimersoftd/16:159 [96] S ==> CPU 8/KVM:10325 [120]
CPU-10325 [016] 11775.955418: sched_switch: CPU 8/KVM:10325 [120] R ==> CPU 5/KVM:10322 [120]
CPU-10322 [016] 11775.955441: sched_switch: CPU 5/KVM:10322 [120] S ==> CPU 3/KVM:10320 [120]
CPU-10320 [016] 11775.955467: sched_switch: CPU 3/KVM:10320 [120] S ==> CPU 1/KVM:10318 [120]
CPU-10318 [016] 11775.955489: sched_switch: CPU 1/KVM:10318 [120] S ==> CPU 2/KVM:10319 [120]
CPU-10319 [016] 11775.955511: sched_switch: CPU 2/KVM:10319 [120] S ==> CPU 9/KVM:10326 [120]
CPU-10326 [016] 11775.955537: sched_switch: CPU 9/KVM:10326 [120] S ==> CPU 4/KVM:10321 [120]
CPU-10321 [016] 11775.955559: sched_switch: CPU 4/KVM:10321 [120] S ==> CPU 7/KVM:10324 [120]
CPU-10324 [016] 11775.955581: sched_switch: CPU 7/KVM:10324 [120] S ==> CPU 6/KVM:10323 [120]
CPU-10323 [016] 11775.955604: sched_switch: CPU 6/KVM:10323 [120] S ==> CPU 8/KVM:10325 [120]
CPU-10325 [016] 11775.955928: sched_switch: CPU 8/KVM:10325 [120] R ==> ktimersoftd/16:159 [96]
ktimersoftd/16-159 [016] 11775.955933: sched_switch: ktimersoftd/16:159 [96] S ==> CPU 8/KVM:10325 [120]
CPU-10325 [016] 11775.956418: sched_switch: CPU 8/KVM:10325 [120] R ==> CPU 7/KVM:10324 [120]
CPU-10324 [016] 11775.956439: sched_switch: CPU 7/KVM:10324 [120] S ==> CPU 2/KVM:10319 [120]
CPU-10319 [016] 11775.956461: sched_switch: CPU 2/KVM:10319 [120] S ==> CPU 9/KVM:10326 [120]
CPU-10326 [016] 11775.956483: sched_switch: CPU 9/KVM:10326 [120] S ==> CPU 4/KVM:10321 [120]
CPU-10321 [016] 11775.956505: sched_switch: CPU 4/KVM:10321 [120] S ==> CPU 6/KVM:10323 [120]
CPU-10323 [016] 11775.956527: sched_switch: CPU 6/KVM:10323 [120] S ==> CPU 3/KVM:10320 [120]
However still cannot explain why only core16 is used but not the rest (1,3,5,7,9,11,13,15,18).
Hello Peter, Luiz, I'm back to work now :) Just let you know, the testing with 7.9.z is still ongoing. During the past month, I hit this issue 1/5, both are running 24 hours. Next I'll switch to 1h testing and raise priority of this testing. vmcore will be provided once I reproduced again. Thanks. Best regards, Pei Created attachment 1734965 [details]
Some crash information for run queues and cpusets
Created attachment 1734988 [details]
Some crash information for run queues and cpusets, etc.
I've uploaded something I digged from the crash report, so far what I observed is quite similar to our latest analysis previously, but we got more interesting information. I'll try to give a quick summary below.
Firstly, the host has 20 cores, mostly idle. Only core 16 is busy, and strangely all the vcpus are running on core 16:
crash> runq
CPU 16 RUNQUEUE: ffff94495fc1d940
CURRENT: PID: 12136 TASK: ffff944907abc580 COMMAND: "CPU 3/KVM"
RT PRIO_ARRAY: ffff94495fc1dad8
[ 95] PID: 12133 TASK: ffff9448d13e3420 COMMAND: "CPU 0/KVM"
CFS RB_ROOT: ffff94495fc1d9e0
[120] PID: 12137 TASK: ffff94490e2e22c0 COMMAND: "CPU 4/KVM"
[120] PID: 12138 TASK: ffff94490e2e1160 COMMAND: "CPU 5/KVM"
[120] PID: 12139 TASK: ffff9448d4f5d6e0 COMMAND: "CPU 6/KVM"
[120] PID: 12141 TASK: ffff9448d4f5c580 COMMAND: "CPU 8/KVM"
[120] PID: 12142 TASK: ffff9448d4f5a2c0 COMMAND: "CPU 9/KVM"
[120] PID: 12077 TASK: ffff9449031d0000 COMMAND: "qemu-kvm"
[120] PID: 12134 TASK: ffff9448d13e56e0 COMMAND: "CPU 1/KVM"
[120] PID: 12140 TASK: ffff9448d4f58000 COMMAND: "CPU 7/KVM"
[120] PID: 12135 TASK: ffff944907abe840 COMMAND: "CPU 2/KVM"
The vcpu threads are with correct cpu pinning (so it seems nothing wrong with libvirt or upper layer software).
One example for vcpu0, the rest are the same:
crash> task -xR cpus_allowed 12133
PID: 12133 TASK: ffff9448d13e3420 CPU: 16 COMMAND: "CPU 0/KVM"
cpus_allowed = {
bits = {0x5aaaa, ...}
Here vcpu0 thread has lower priority (120) than vcpu3 (95).
vcpu0 is kind of special here because its priority got boosted by rcub/2:
crash> task -R prio,policy 12133
PID: 12133 TASK: ffff9448d13e3420 CPU: 16 COMMAND: "CPU 0/KVM"
prio = 95,
policy = 0,
crash> ps | grep rcub
16 2 0 ffff94490e211160 IN 0.0 0 0 [rcub/0]
17 2 0 ffff94490e2122c0 IN 0.0 0 0 [rcub/1]
106 2 10 ffff94490e3e56e0 UN 0.0 0 0 [rcub/2]
crash> task -R pi_blocked_on 106
PID: 106 TASK: ffff94490e3e56e0 CPU: 10 COMMAND: "rcub/2"
pi_blocked_on = 0xffff94490dc03da0,
crash> p/x (((struct rt_mutex_waiter *)0xffff94490dc03da0)->lock->owner)
$6 = 0xffff9448d13e3421
crash> ps 0xffff9448d13e3420
PID PPID CPU TASK ST %MEM VSZ RSS COMM
12133 1 16 ffff9448d13e3420 RU 1.1 5150084 742916 CPU 0/KVM
crash> task -R prio,policy 106
PID: 106 TASK: ffff94490e3e56e0 CPU: 10 COMMAND: "rcub/2"
prio = 95,
policy = 1,
But what's more strange is that even vcpu3 has very high priority (95), it got preempted by vcpu0 who has lower priority (120). Backtrace for vcpu3/vcpu0 threads:
crash> bt 12136
PID: 12136 TASK: ffff944907abc580 CPU: 16 COMMAND: "CPU 3/KVM"
#0 [ffff94495fc09e40] crash_nmi_callback at ffffffff99048467
#1 [ffff94495fc09e50] nmi_handle at ffffffff99781904
#2 [ffff94495fc09ea8] do_nmi at ffffffff99781b1d
#3 [ffff94495fc09ef0] end_repeat_nmi at ffffffff99780d4c
[exception RIP: vmx_vcpu_run+2655]
RIP: ffffffffc1250d4f RSP: ffff945053a8fcb0 RFLAGS: 00000046
RAX: 0000000080000202 RBX: ffff944f6062bfc0 RCX: 0000000000000048
RDX: 0000000000004404 RSI: 0000000000000000 RDI: ffff944f6062bfc0
RBP: ffff945053a8fd18 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#4 [ffff945053a8fcb0] vmx_vcpu_run at ffffffffc1250d4f [kvm_intel]
#5 [ffff945053a8fd20] vcpu_enter_guest at ffffffffc0700d4b [kvm]
#6 [ffff945053a8fda0] kvm_arch_vcpu_ioctl_run at ffffffffc0709998 [kvm]
#7 [ffff945053a8fde8] kvm_vcpu_ioctl at ffffffffc06ea0b1 [kvm]
#8 [ffff945053a8fe80] do_vfs_ioctl at ffffffff99251b80
#9 [ffff945053a8ff00] sys_ioctl at ffffffff99251e31
#10 [ffff945053a8ff50] tracesys at ffffffff997894a8 (via system_call)
RIP: 00007fc0148ef307 RSP: 00007fc0090ffa98 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 00007fc02df24001 RCX: ffffffffffffffff
RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000019
RBP: 0000000000000001 R8: 0000557717ffa450 R9: 0000000000000000
R10: 0000000000000001 R11: 0000000000000246 R12: 0000557717fe01a0
R13: 0000000000000000 R14: 00007fc02df23000 R15: 000055771b14e000
ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b
crash> bt 12133
PID: 12133 TASK: ffff9448d13e3420 CPU: 16 COMMAND: "CPU 0/KVM"
#0 [ffff94504cde3860] __schedule at ffffffff9977c15e
#1 [ffff94504cde38f8] preempt_schedule at ffffffff9977c72c
#2 [ffff94504cde3918] preempt_schedule at ffffffff9977c77b
#3 [ffff94504cde3938] swake_up at ffffffff990e453a
#4 [ffff94504cde3958] kvm_vcpu_wake_up at ffffffffc06e6e7f [kvm]
#5 [ffff94504cde3970] kvm_vcpu_kick at ffffffffc06e6ea7 [kvm]
#6 [ffff94504cde3990] vmx_deliver_posted_interrupt at ffffffffc124b79e [kvm_intel]
#7 [ffff94504cde39a8] __apic_accept_irq at ffffffffc072a5d4 [kvm]
#8 [ffff94504cde3a10] kvm_apic_set_irq at ffffffffc072a81a [kvm]
#9 [ffff94504cde3a20] kvm_irq_delivery_to_apic_fast at ffffffffc072a9e3 [kvm]
#10 [ffff94504cde3a88] kvm_irq_delivery_to_apic at ffffffffc072fe15 [kvm]
#11 [ffff94504cde3b20] kvm_lapic_reg_write at ffffffffc0729c5f [kvm]
#12 [ffff94504cde3b70] kvm_x2apic_msr_write at ffffffffc072ca6d [kvm]
#13 [ffff94504cde3b98] kvm_set_msr_common at ffffffffc0701bed [kvm]
#14 [ffff94504cde3c08] vmx_set_msr at ffffffffc125499a [kvm_intel]
#15 [ffff94504cde3c38] kvm_set_msr at ffffffffc06f0ad7 [kvm]
#16 [ffff94504cde3c48] handle_wrmsr at ffffffffc124f7f8 [kvm_intel]
#17 [ffff94504cde3c90] vmx_handle_exit at ffffffffc12553ac [kvm_intel]
#18 [ffff94504cde3d20] vcpu_enter_guest at ffffffffc0700edd [kvm]
#19 [ffff94504cde3da0] kvm_arch_vcpu_ioctl_run at ffffffffc0709998 [kvm]
#20 [ffff94504cde3de8] kvm_vcpu_ioctl at ffffffffc06ea0b1 [kvm]
#21 [ffff94504cde3e80] do_vfs_ioctl at ffffffff99251b80
#22 [ffff94504cde3f00] sys_ioctl at ffffffff99251e31
#23 [ffff94504cde3f50] tracesys at ffffffff997894a8 (via system_call)
RIP: 00007fc0148ef307 RSP: 00007fc00a902a98 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 00007fc02e063001 RCX: ffffffffffffffff
RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000016
RBP: 0000000000000001 R8: 0000557717ffa450 R9: 00000000000000ff
R10: 0000000000000001 R11: 0000000000000246 R12: 0000557717fe01a0
R13: 0000000000000000 R14: 00007fc02e062000 R15: 000055771b0aa000
ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b
So another summary the questions, majorly:
1. Why all the vcpu threads are queued on core16 runqueue rather than spread across multiple host cores? Note that the cpu affinity should be setup correctly for each vcpu threads according to above (0x5aaaa).
2. If vcpu0 has higher priority (95, after being boosted by rcub/2) than vcpu3 (120, the default value, SCHED_OTHER with nice==0), why vcpu3 got preempted by vcpu0?
I'm copying the rt-sched experts (Juri, Luis and Daniel) for help on the scheduler side.
For the full detail of the crash report, please refer to the (updated) attachment in comment 38.
Thanks,
Peter
(In reply to Peter Xu from comment #39) > I've uploaded something I digged from the crash report, so far what I > observed is quite similar to our latest analysis previously, but we got more > interesting information. I'll try to give a quick summary below. > > Firstly, the host has 20 cores, mostly idle. Only core 16 is busy, and > strangely all the vcpus are running on core 16: > > crash> runq > CPU 16 RUNQUEUE: ffff94495fc1d940 > CURRENT: PID: 12136 TASK: ffff944907abc580 COMMAND: "CPU 3/KVM" > RT PRIO_ARRAY: ffff94495fc1dad8 > [ 95] PID: 12133 TASK: ffff9448d13e3420 COMMAND: "CPU 0/KVM" > CFS RB_ROOT: ffff94495fc1d9e0 > [120] PID: 12137 TASK: ffff94490e2e22c0 COMMAND: "CPU 4/KVM" > [120] PID: 12138 TASK: ffff94490e2e1160 COMMAND: "CPU 5/KVM" > [120] PID: 12139 TASK: ffff9448d4f5d6e0 COMMAND: "CPU 6/KVM" > [120] PID: 12141 TASK: ffff9448d4f5c580 COMMAND: "CPU 8/KVM" > [120] PID: 12142 TASK: ffff9448d4f5a2c0 COMMAND: "CPU 9/KVM" > [120] PID: 12077 TASK: ffff9449031d0000 COMMAND: "qemu-kvm" > [120] PID: 12134 TASK: ffff9448d13e56e0 COMMAND: "CPU 1/KVM" > [120] PID: 12140 TASK: ffff9448d4f58000 COMMAND: "CPU 7/KVM" > [120] PID: 12135 TASK: ffff944907abe840 COMMAND: "CPU 2/KVM" > > The vcpu threads are with correct cpu pinning (so it seems nothing wrong > with libvirt or upper layer software). > One example for vcpu0, the rest are the same: > > crash> task -xR cpus_allowed 12133 > PID: 12133 TASK: ffff9448d13e3420 CPU: 16 COMMAND: "CPU 0/KVM" > cpus_allowed = { > bits = {0x5aaaa, ...} > > Here vcpu0 thread has lower priority (120) than vcpu3 (95). > > vcpu0 is kind of special here because its priority got boosted by rcub/2: > > crash> task -R prio,policy 12133 > PID: 12133 TASK: ffff9448d13e3420 CPU: 16 COMMAND: "CPU 0/KVM" > prio = 95, > policy = 0, > crash> ps | grep rcub > 16 2 0 ffff94490e211160 IN 0.0 0 0 [rcub/0] > 17 2 0 ffff94490e2122c0 IN 0.0 0 0 [rcub/1] > 106 2 10 ffff94490e3e56e0 UN 0.0 0 0 [rcub/2] > crash> task -R pi_blocked_on 106 > PID: 106 TASK: ffff94490e3e56e0 CPU: 10 COMMAND: "rcub/2" > pi_blocked_on = 0xffff94490dc03da0, > crash> p/x (((struct rt_mutex_waiter *)0xffff94490dc03da0)->lock->owner) > $6 = 0xffff9448d13e3421 > crash> ps 0xffff9448d13e3420 > PID PPID CPU TASK ST %MEM VSZ RSS COMM > 12133 1 16 ffff9448d13e3420 RU 1.1 5150084 742916 CPU 0/KVM > crash> task -R prio,policy 106 > PID: 106 TASK: ffff94490e3e56e0 CPU: 10 COMMAND: "rcub/2" > prio = 95, > policy = 1, > > But what's more strange is that even vcpu3 has very high priority (95), it > got preempted by vcpu0 who has lower priority (120). Backtrace for > vcpu3/vcpu0 threads: > > crash> bt 12136 > PID: 12136 TASK: ffff944907abc580 CPU: 16 COMMAND: "CPU 3/KVM" > #0 [ffff94495fc09e40] crash_nmi_callback at ffffffff99048467 > #1 [ffff94495fc09e50] nmi_handle at ffffffff99781904 > #2 [ffff94495fc09ea8] do_nmi at ffffffff99781b1d > #3 [ffff94495fc09ef0] end_repeat_nmi at ffffffff99780d4c > [exception RIP: vmx_vcpu_run+2655] > RIP: ffffffffc1250d4f RSP: ffff945053a8fcb0 RFLAGS: 00000046 > RAX: 0000000080000202 RBX: ffff944f6062bfc0 RCX: 0000000000000048 > RDX: 0000000000004404 RSI: 0000000000000000 RDI: ffff944f6062bfc0 > RBP: ffff945053a8fd18 R8: 0000000000000000 R9: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > --- <NMI exception stack> --- > #4 [ffff945053a8fcb0] vmx_vcpu_run at ffffffffc1250d4f [kvm_intel] > #5 [ffff945053a8fd20] vcpu_enter_guest at ffffffffc0700d4b [kvm] > #6 [ffff945053a8fda0] kvm_arch_vcpu_ioctl_run at ffffffffc0709998 [kvm] > #7 [ffff945053a8fde8] kvm_vcpu_ioctl at ffffffffc06ea0b1 [kvm] > #8 [ffff945053a8fe80] do_vfs_ioctl at ffffffff99251b80 > #9 [ffff945053a8ff00] sys_ioctl at ffffffff99251e31 > #10 [ffff945053a8ff50] tracesys at ffffffff997894a8 (via system_call) > RIP: 00007fc0148ef307 RSP: 00007fc0090ffa98 RFLAGS: 00000246 > RAX: ffffffffffffffda RBX: 00007fc02df24001 RCX: ffffffffffffffff > RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000019 > RBP: 0000000000000001 R8: 0000557717ffa450 R9: 0000000000000000 > R10: 0000000000000001 R11: 0000000000000246 R12: 0000557717fe01a0 > R13: 0000000000000000 R14: 00007fc02df23000 R15: 000055771b14e000 > ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b > > crash> bt 12133 > PID: 12133 TASK: ffff9448d13e3420 CPU: 16 COMMAND: "CPU 0/KVM" > #0 [ffff94504cde3860] __schedule at ffffffff9977c15e > #1 [ffff94504cde38f8] preempt_schedule at ffffffff9977c72c > #2 [ffff94504cde3918] preempt_schedule at ffffffff9977c77b > #3 [ffff94504cde3938] swake_up at ffffffff990e453a > #4 [ffff94504cde3958] kvm_vcpu_wake_up at ffffffffc06e6e7f [kvm] > #5 [ffff94504cde3970] kvm_vcpu_kick at ffffffffc06e6ea7 [kvm] > #6 [ffff94504cde3990] vmx_deliver_posted_interrupt at ffffffffc124b79e > [kvm_intel] > #7 [ffff94504cde39a8] __apic_accept_irq at ffffffffc072a5d4 [kvm] > #8 [ffff94504cde3a10] kvm_apic_set_irq at ffffffffc072a81a [kvm] > #9 [ffff94504cde3a20] kvm_irq_delivery_to_apic_fast at ffffffffc072a9e3 > [kvm] > #10 [ffff94504cde3a88] kvm_irq_delivery_to_apic at ffffffffc072fe15 [kvm] > #11 [ffff94504cde3b20] kvm_lapic_reg_write at ffffffffc0729c5f [kvm] > #12 [ffff94504cde3b70] kvm_x2apic_msr_write at ffffffffc072ca6d [kvm] > #13 [ffff94504cde3b98] kvm_set_msr_common at ffffffffc0701bed [kvm] > #14 [ffff94504cde3c08] vmx_set_msr at ffffffffc125499a [kvm_intel] > #15 [ffff94504cde3c38] kvm_set_msr at ffffffffc06f0ad7 [kvm] > #16 [ffff94504cde3c48] handle_wrmsr at ffffffffc124f7f8 [kvm_intel] > #17 [ffff94504cde3c90] vmx_handle_exit at ffffffffc12553ac [kvm_intel] > #18 [ffff94504cde3d20] vcpu_enter_guest at ffffffffc0700edd [kvm] > #19 [ffff94504cde3da0] kvm_arch_vcpu_ioctl_run at ffffffffc0709998 [kvm] > #20 [ffff94504cde3de8] kvm_vcpu_ioctl at ffffffffc06ea0b1 [kvm] > #21 [ffff94504cde3e80] do_vfs_ioctl at ffffffff99251b80 > #22 [ffff94504cde3f00] sys_ioctl at ffffffff99251e31 > #23 [ffff94504cde3f50] tracesys at ffffffff997894a8 (via system_call) > RIP: 00007fc0148ef307 RSP: 00007fc00a902a98 RFLAGS: 00000246 > RAX: ffffffffffffffda RBX: 00007fc02e063001 RCX: ffffffffffffffff > RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000016 > RBP: 0000000000000001 R8: 0000557717ffa450 R9: 00000000000000ff > R10: 0000000000000001 R11: 0000000000000246 R12: 0000557717fe01a0 > R13: 0000000000000000 R14: 00007fc02e062000 R15: 000055771b0aa000 > ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b > > So another summary the questions, majorly: > > 1. Why all the vcpu threads are queued on core16 runqueue rather than > spread across multiple host cores? Note that the cpu affinity should be > setup correctly for each vcpu threads according to above (0x5aaaa). Aren't this isolated cores? If so, don't think any load balancing is going to happen between them. Rather vcpus have to be affined (moved to) each isolated cpu specifically, no? Also, all vcpus except 0 and 3 seems to be running at OTHER priority, so it is eventually CFS load balance that needs to move them around, but again, I'm afraid it won't look at isolated cpus. > 2. If vcpu0 has higher priority (95, after being boosted by rcub/2) than > vcpu3 (120, the default value, SCHED_OTHER with nice==0), why vcpu3 got > preempted by vcpu0? Not sure I understand. If vcpu0 has higher priority than vcpu3, don't we actually expect a preemption? > I'm copying the rt-sched experts (Juri, Luis and Daniel) for help on the > scheduler side. > > For the full detail of the crash report, please refer to the (updated) > attachment in comment 38. > > Thanks, > Peter (In reply to Juri Lelli from comment #40) > > So another summary the questions, majorly: > > > > 1. Why all the vcpu threads are queued on core16 runqueue rather than > > spread across multiple host cores? Note that the cpu affinity should be > > setup correctly for each vcpu threads according to above (0x5aaaa). > > Aren't this isolated cores? If so, don't think any load balancing is going > to happen between them. Rather vcpus have to be affined (moved to) each > isolated cpu specifically, no? For this specific core dump, isolated cores are: isolcpus=1,3,5,7,9,11,13,15,17,19,12,14,16,18 It's indeed kind of strange that isolcpus= does not match with 0x5aaaa, however just to mention that IIUC this core is generated during VM installation phase. During that phase, our 1-on-1 vcpu pinning should not really take effect yet. That's also why all the vcpus (except the boosted one) do not have RT priority but SCHED_OTHER. > > Also, all vcpus except 0 and 3 seems to be running at OTHER priority, so it > is eventually CFS load balance that needs to move them around, but again, > I'm afraid it won't look at isolated cpus. But I thought CFS should look at any cpu specified by task_struct.cpus_allowed, right? No matter whether it contains isolated or housekeeping cpus. > > > 2. If vcpu0 has higher priority (95, after being boosted by rcub/2) than > > vcpu3 (120, the default value, SCHED_OTHER with nice==0), why vcpu3 got > > preempted by vcpu0? > > Not sure I understand. If vcpu0 has higher priority than vcpu3, don't we > actually expect a preemption? Sorry my previous sentence "vcpu3 got preempted by vcpu0" is wrong... The fact is vcpu3 preempted vcpu0. CPU 16 RUNQUEUE: ffff94495fc1d940 CURRENT: PID: 12136 TASK: ffff944907abc580 COMMAND: "CPU 3/KVM" RT PRIO_ARRAY: ffff94495fc1dad8 [ 95] PID: 12133 TASK: ffff9448d13e3420 COMMAND: "CPU 0/KVM" CFS RB_ROOT: ffff94495fc1d9e0 [120] PID: 12137 TASK: ffff94490e2e22c0 COMMAND: "CPU 4/KVM" [120] PID: 12138 TASK: ffff94490e2e1160 COMMAND: "CPU 5/KVM" [120] PID: 12139 TASK: ffff9448d4f5d6e0 COMMAND: "CPU 6/KVM" [120] PID: 12141 TASK: ffff9448d4f5c580 COMMAND: "CPU 8/KVM" [120] PID: 12142 TASK: ffff9448d4f5a2c0 COMMAND: "CPU 9/KVM" [120] PID: 12077 TASK: ffff9449031d0000 COMMAND: "qemu-kvm" [120] PID: 12134 TASK: ffff9448d13e56e0 COMMAND: "CPU 1/KVM" [120] PID: 12140 TASK: ffff9448d4f58000 COMMAND: "CPU 7/KVM" [120] PID: 12135 TASK: ffff944907abe840 COMMAND: "CPU 2/KVM" So a lower priority task (seems to have) preempted a higher priority task. |