Bug 1817045

Summary:

RT host hang with "INFO: rcu_preempt detected stalls on CPUs/tasks"

Product:

Red Hat Enterprise Linux 7

Reporter:

Pei Zhang <pezhang>

Component:

kernel-rt

Assignee:

Peter Xu <peterx>

kernel-rt sub component:

KVM

QA Contact:

Pei Zhang <pezhang>

Status:

CLOSED NOTABUG

Docs Contact:

Severity:

high

Priority:

medium

CC:

acme, bhu, chayang, crobinso, daolivei, jinzhao, jlelli, juri.lelli, juzhang, lcapitulino, lgoncalv, nilal, pauld, peterx, virt-maint, williams

Version:

7.9

Keywords:

Reopened

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-12-15 17:36:29 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1890703

Attachments:

Description	Flags
Call Trace info	none
virt-install.log provided by Peter	none
working vm XML provided by peter	none
Some crash information for run queues and cpusets	none
Some crash information for run queues and cpusets, etc.	none

Description Pei Zhang 2020-03-25 13:23:14 UTC

Created attachment 1673426 [details]
Call Trace info

Description of problem:
RT host hang during KVM-RT standard scenarios testing.

Version-Release number of selected component (if applicable):
kernel-rt-3.10.0-1130.rt56.1101.el7.x86_64
qemu-kvm-rhev-2.12.0-45.el7.x86_64
tuned-2.11.0-9.el7.noarch
libvirt-4.5.0-33.el7.x86_64
microcode_ctl-2.1-61.el7.x86_64
rt-tests-1.5-9.el7.x86_64


How reproducible:
1/1

Steps to Reproduce:
1. Setup RT host

2. Install and setup a RT guest with 1 RT vCPU.

3. Start 24h cyclictest in guest. Meanwhile compiling kernel in both host and guest housekeeping cores. Looks good.

4. Install another RT guest with 8  RT vCPUs. During the guest installation process, RT host hang.

# virt-install \
        --noautoconsole \
        --graphics type=spice,listen=0.0.0.0 \
        --name=rhel7.9_rt_8vcpu \
        --memory=4096,hugepages=yes \
        --memorybacking hugepages=yes,size=1,unit=G,locked=yes \
        --numatune=1 \
        --vcpus=10,cpuset=16,18,1,3,5,7,9,11,13,15 \
        --disk path=/home/images_nfv-virt-rt-kvm/rhel7.9_rt_8vcpu.qcow2,bus=virtio,cache=none,format=qcow2,io=threads,size=40 \
        -l http://download.eng.pek2.redhat.com/nightly/RHEL-7.9-20200319.n.0/compose/Server/x86_64/os/ \
        -x ks=http://10.73.72.41/kickstart-rhel7.cfg  \
        --network bridge=switch,model=virtio,mac=38:44:33:22:11:02


Actual results:
RT host hang with kernel Call Trace:(Full Call Trace info is attached.)

[91586.292876] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=60002 jiffies, g=10895418, c=10895417, q=1327) 
[91586.292877] All QSes seen, last rcu_preempt kthread activity 2 (4386259303-4386259301), jiffies_till_next_fqs=3 
[91586.292879] swapper/0       R  running task        0     0      0 0x00080000 
[91586.292880] Call Trace: 
[91586.292891]  <IRQ>  [<ffffffff8f6ce806>] sched_show_task+0xb6/0x120 
[91586.292895]  [<ffffffff8f753c1d>] rcu_check_callbacks+0x84d/0x880 
[91586.292898]  [<ffffffff8f704230>] ? tick_sched_do_timer+0x50/0x50 
[91586.292903]  [<ffffffff8f6a1001>] update_process_times+0x41/0x70 
[91586.292904]  [<ffffffff8f703f00>] tick_sched_handle+0x30/0x70 
[91586.292905]  [<ffffffff8f704269>] tick_sched_timer+0x39/0x80 
[91586.292908]  [<ffffffff8f6bd4e1>] __hrtimer_run_queues+0x121/0x3b0 
[91586.292910]  [<ffffffff8f6be439>] hrtimer_interrupt+0xb9/0x270 
[91586.292914]  [<ffffffff8f64cafb>] local_apic_timer_interrupt+0x3b/0x60 
[91586.292918]  [<ffffffff8fd8b723>] smp_apic_timer_interrupt+0x43/0x60 
[91586.292920]  [<ffffffff8fd8823a>] apic_timer_interrupt+0x16a/0x170 
[91586.292924]  <EOI>  [<ffffffff8fd7c650>] ? __cpuidle_text_start+0x8/0x8 
[91586.292926]  [<ffffffff8fd7c91b>] ? native_safe_halt+0xb/0x20 
[91586.292927]  [<ffffffff8fd7c66e>] default_idle+0x1e/0x130 
[91586.292931]  [<ffffffff8f627600>] arch_cpu_idle+0x20/0xc0 
[91586.292934]  [<ffffffff8f6f4c9a>] cpu_startup_entry+0x14a/0x1d0 
[91586.292938]  [<ffffffff8fd65324>] rest_init+0x84/0x90 
[91586.292943]  [<ffffffff903bf1a3>] start_kernel+0x443/0x464 
[91586.292945]  [<ffffffff903beb60>] ? repair_env_string+0x5c/0x5c 
[91586.292947]  [<ffffffff903be120>] ? early_idt_handler_array+0x120/0x120 
[91586.292949]  [<ffffffff903be714>] x86_64_start_reservations+0x24/0x26 
[91586.292951]  [<ffffffff903be86a>] x86_64_start_kernel+0x154/0x177 
[91586.292954]  [<ffffffff8f6000d5>] start_cpu+0x5/0x14 
[91766.286987] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=240007 jiffies, g=10895418, c=10895417, q=2295) 
[91766.286988] All QSes seen, last rcu_preempt kthread activity 1 (4386439308-4386439307), jiffies_till_next_fqs=3 
[91766.286991] swapper/0       R  running task        0     0      0 0x00080000 
[91766.286991] Call Trace: 
[91766.287001]  <IRQ>  [<ffffffff8f6ce806>] sched_show_task+0xb6/0x120 
[91766.287004]  [<ffffffff8f753c1d>] rcu_check_callbacks+0x84d/0x880 
[91766.287007]  [<ffffffff8f704230>] ? tick_sched_do_timer+0x50/0x50 
[91766.287012]  [<ffffffff8f6a1001>] update_process_times+0x41/0x70 
[91766.287013]  [<ffffffff8f703f00>] tick_sched_handle+0x30/0x70 
[91766.287014]  [<ffffffff8f704269>] tick_sched_timer+0x39/0x80 
[91766.287017]  [<ffffffff8f6bd4e1>] __hrtimer_run_queues+0x121/0x3b0 
[91766.287019]  [<ffffffff8f6be439>] hrtimer_interrupt+0xb9/0x270 
[91766.287022]  [<ffffffff8f64cafb>] local_apic_timer_interrupt+0x3b/0x60 
[91766.287026]  [<ffffffff8fd8b723>] smp_apic_timer_interrupt+0x43/0x60 
[91766.287028]  [<ffffffff8fd8823a>] apic_timer_interrupt+0x16a/0x170 
[91766.287032]  <EOI>  [<ffffffff8fd7c650>] ? __cpuidle_text_start+0x8/0x8 
[91766.287034]  [<ffffffff8fd7c91b>] ? native_safe_halt+0xb/0x20 
[91766.287035]  [<ffffffff8fd7c66e>] default_idle+0x1e/0x130 
[91766.287038]  [<ffffffff8f627600>] arch_cpu_idle+0x20/0xc0 
[91766.287042]  [<ffffffff8f6f4c9a>] cpu_startup_entry+0x14a/0x1d0 
[91766.287046]  [<ffffffff8fd65324>] rest_init+0x84/0x90 
[91766.287050]  [<ffffffff903bf1a3>] start_kernel+0x443/0x464 
[91766.287052]  [<ffffffff903beb60>] ? repair_env_string+0x5c/0x5c 
[91766.287055]  [<ffffffff903be120>] ? early_idt_handler_array+0x120/0x120 
[91766.287056]  [<ffffffff903be714>] x86_64_start_reservations+0x24/0x26 
[91766.287058]  [<ffffffff903be86a>] x86_64_start_kernel+0x154/0x177 
[91766.287061]  [<ffffffff8f6000d5>] start_cpu+0x5/0x14 
[91946.281099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=420012 jiffies, g=10895418, c=10895417, q=2926) 
[91946.281100] All QSes seen, last rcu_preempt kthread activity 1 (4386619313-4386619312), jiffies_till_next_fqs=3 
[91946.281102] swapper/0       R  running task        0     0      0 0x00080000 
[91946.281103] Call Trace: 
[91946.281110]  <IRQ>  [<ffffffff8f6ce806>] sched_show_task+0xb6/0x120 
[91946.281113]  [<ffffffff8f753c1d>] rcu_check_callbacks+0x84d/0x880 
[91946.281115]  [<ffffffff8f704230>] ? tick_sched_do_timer+0x50/0x50 
[91946.281119]  [<ffffffff8f6a1001>] update_process_times+0x41/0x70 
[91946.281120]  [<ffffffff8f703f00>] tick_sched_handle+0x30/0x70 
[91946.281121]  [<ffffffff8f704269>] tick_sched_timer+0x39/0x80 
[91946.281123]  [<ffffffff8f6bd4e1>] __hrtimer_run_queues+0x121/0x3b0 
[91946.281125]  [<ffffffff8f6be439>] hrtimer_interrupt+0xb9/0x270 
[91946.281128]  [<ffffffff8f64cafb>] local_apic_timer_interrupt+0x3b/0x60 
[91946.281130]  [<ffffffff8fd8b723>] smp_apic_timer_interrupt+0x43/0x60 
[91946.281132]  [<ffffffff8fd8823a>] apic_timer_interrupt+0x16a/0x170 
[91946.281136]  <EOI>  [<ffffffff8fd7c650>] ? __cpuidle_text_start+0x8/0x8 
[91946.281137]  [<ffffffff8fd7c91b>] ? native_safe_halt+0xb/0x20 
[91946.281139]  [<ffffffff8fd7c66e>] default_idle+0x1e/0x130 
[91946.281142]  [<ffffffff8f627600>] arch_cpu_idle+0x20/0xc0 
[91946.281145]  [<ffffffff8f6f4c9a>] cpu_startup_entry+0x14a/0x1d0 
[91946.281148]  [<ffffffff8fd65324>] rest_init+0x84/0x90 
[91946.281151]  [<ffffffff903bf1a3>] start_kernel+0x443/0x464 
[91946.281153]  [<ffffffff903beb60>] ? repair_env_string+0x5c/0x5c 
[91946.281155]  [<ffffffff903be120>] ? early_idt_handler_array+0x120/0x120 
[91946.281157]  [<ffffffff903be714>] x86_64_start_reservations+0x24/0x26 
[91946.281159]  [<ffffffff903be86a>] x86_64_start_kernel+0x154/0x177 
[91946.281161]  [<ffffffff8f6000d5>] start_cpu+0x5/0x14 
[92126.275211] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=600017 jiffies, g=10895418, c=10895417, q=3763) 
[92126.275212] All QSes seen, last rcu_preempt kthread activity 1 (4386799318-4386799317), jiffies_till_next_fqs=3 
[92126.275214] swapper/0       R  running task        0     0      0 0x00080000 
[92126.275215] Call Trace: 
[92126.275221]  <IRQ>  [<ffffffff8f6ce806>] sched_show_task+0xb6/0x120 
[92126.275223]  [<ffffffff8f753c1d>] rcu_check_callbacks+0x84d/0x880 
[92126.275226]  [<ffffffff8f704230>] ? tick_sched_do_timer+0x50/0x50 
[92126.275228]  [<ffffffff8f6a1001>] update_process_times+0x41/0x70 
[92126.275230]  [<ffffffff8f703f00>] tick_sched_handle+0x30/0x70 
[92126.275231]  [<ffffffff8f704269>] tick_sched_timer+0x39/0x80 
[92126.275233]  [<ffffffff8f6bd4e1>] __hrtimer_run_queues+0x121/0x3b0 
[92126.275234]  [<ffffffff8f6be439>] hrtimer_interrupt+0xb9/0x270 
[92126.275237]  [<ffffffff8f64cafb>] local_apic_timer_interrupt+0x3b/0x60 
[92126.275239]  [<ffffffff8fd8b723>] smp_apic_timer_interrupt+0x43/0x60 
[92126.275241]  [<ffffffff8fd8823a>] apic_timer_interrupt+0x16a/0x170 
[92126.275245]  <EOI>  [<ffffffff8fd7c650>] ? __cpuidle_text_start+0x8/0x8 
[92126.275246]  [<ffffffff8fd7c91b>] ? native_safe_halt+0xb/0x20 
[92126.275247]  [<ffffffff8fd7c66e>] default_idle+0x1e/0x130 
[92126.275250]  [<ffffffff8f627600>] arch_cpu_idle+0x20/0xc0 
[92126.275252]  [<ffffffff8f6f4c9a>] cpu_startup_entry+0x14a/0x1d0 
[92126.275255]  [<ffffffff8fd65324>] rest_init+0x84/0x90 
[92126.275259]  [<ffffffff903bf1a3>] start_kernel+0x443/0x464 
[92126.275261]  [<ffffffff903beb60>] ? repair_env_string+0x5c/0x5c 
[92126.275263]  [<ffffffff903be120>] ? early_idt_handler_array+0x120/0x120 
[92126.275264]  [<ffffffff903be714>] x86_64_start_reservations+0x24/0x26 
[92126.275266]  [<ffffffff903be86a>] x86_64_start_kernel+0x154/0x177 
[92126.275268]  [<ffffffff8f6000d5>] start_cpu+0x5/0x14 
[92175.722480] INFO: task rcub/2:106 blocked for more than 600 seconds. 
[92175.729570] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[92175.738308] rcub/2          D ffff92655fa9d940     0   106      2 0x00000000 
[92175.746191] Call Trace: 
[92175.748921]  [<ffffffff8fd7a6f0>] schedule+0x30/0x96 
[92175.754459]  [<ffffffff8fd7b27d>] __rt_mutex_slowlock+0xdd/0x160 
[92175.761161]  [<ffffffff8fd7b723>] rt_mutex_slowlock_locked+0xc3/0x1c0 
[92175.768347]  [<ffffffff8fd7b89c>] rt_mutex_slowlock+0x7c/0xc0 
[92175.774757]  [<ffffffff8fd7b98f>] rt_mutex_lock+0x6f/0x90 
[92175.780781]  [<ffffffff8f7500a8>] rcu_boost_kthread+0x148/0x3a0 
[92175.787386]  [<ffffffff8f74ff60>] ? cond_synchronize_rcu+0x30/0x30 
[92175.794284]  [<ffffffff8f6b9121>] kthread+0xd1/0xe0 
[92175.799724]  [<ffffffff8f6b9050>] ? kthread_worker_fn+0x170/0x170 
[92175.806523]  [<ffffffff8fd8705d>] ret_from_fork_nospec_begin+0x7/0x21 
[92175.813711]  [<ffffffff8f6b9050>] ? kthread_worker_fn+0x170/0x170 
[92306.269323] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=780022 jiffies, g=10895418, c=10895417, q=5459) 
[92306.269324] All QSes seen, last rcu_preempt kthread activity 1 (4386979323-4386979322), jiffies_till_next_fqs=3 
[92306.269326] swapper/0       R  running task        0     0      0 0x00080000 
[92306.269327] Call Trace: 
[92306.269335]  <IRQ>  [<ffffffff8f6ce806>] sched_show_task+0xb6/0x120 
[92306.269337]  [<ffffffff8f753c1d>] rcu_check_callbacks+0x84d/0x880 
[92306.269340]  [<ffffffff8f704230>] ? tick_sched_do_timer+0x50/0x50 
[92306.269344]  [<ffffffff8f6a1001>] update_process_times+0x41/0x70 
[92306.269345]  [<ffffffff8f703f00>] tick_sched_handle+0x30/0x70 
[92306.269346]  [<ffffffff8f704269>] tick_sched_timer+0x39/0x80 
[92306.269348]  [<ffffffff8f6bd4e1>] __hrtimer_run_queues+0x121/0x3b0 
[92306.269350]  [<ffffffff8f6be439>] hrtimer_interrupt+0xb9/0x270 
[92306.269353]  [<ffffffff8f64cafb>] local_apic_timer_interrupt+0x3b/0x60 
[92306.269356]  [<ffffffff8fd8b723>] smp_apic_timer_interrupt+0x43/0x60 
[92306.269358]  [<ffffffff8fd8823a>] apic_timer_interrupt+0x16a/0x170 
[92306.269361]  <EOI>  [<ffffffff8fd7c650>] ? __cpuidle_text_start+0x8/0x8 
[92306.269363]  [<ffffffff8fd7c91b>] ? native_safe_halt+0xb/0x20 
[92306.269364]  [<ffffffff8fd7c66e>] default_idle+0x1e/0x130 
[92306.269367]  [<ffffffff8f627600>] arch_cpu_idle+0x20/0xc0 
[92306.269370]  [<ffffffff8f6f4c9a>] cpu_startup_entry+0x14a/0x1d0 
[92306.269374]  [<ffffffff8fd65324>] rest_init+0x84/0x90 
[92306.269378]  [<ffffffff903bf1a3>] start_kernel+0x443/0x464 
[92306.269380]  [<ffffffff903beb60>] ? repair_env_string+0x5c/0x5c 
[92306.269382]  [<ffffffff903be120>] ? early_idt_handler_array+0x120/0x120 
[92306.269384]  [<ffffffff903be714>] x86_64_start_reservations+0x24/0x26 
[92306.269386]  [<ffffffff903be86a>] x86_64_start_kernel+0x154/0x177 
[92306.269388]  [<ffffffff8f6000d5>] start_cpu+0x5/0x14 


Expected results:
RT host should keep working well.

Additional info:

Comment 3 Pei Zhang 2020-06-18 02:11:18 UTC

We hit same issue on rhel7.7.z testing.

Testing versions:
kernel-rt-3.10.0-1062.27.1.rt56.1050.el7.x86_64
qemu-kvm-rhev-2.12.0-33.el7_7.11.x86_64
tuned-2.11.0-5.el7_7.1.noarch
libvirt-4.5.0-23.el7_7.7.x86_64
microcode_ctl-2.1-53.7.el7_7.x86_64
rt-tests-1.0-16.el7.x86_64

RT host get below Call Trace:

[92309.610498] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=60002 jiffies, g=11061801, c=11061800, q=1435)
[92309.610500] All QSes seen, last rcu_preempt kthread activity 2 (4386982538-4386982536), jiffies_till_next_fqs=3
[92309.610502] swapper/0       R  running task        0     0      0 0x00080000
[92309.610503] Call Trace:
[92309.610515]  <IRQ>  [<ffffffff8b4cdca6>] sched_show_task+0xb6/0x120
[92309.610518]  [<ffffffff8b552b6d>] rcu_check_callbacks+0x84d/0x880
[92309.610523]  [<ffffffff8b5031c0>] ? tick_sched_do_timer+0x50/0x50
[92309.610528]  [<ffffffff8b4a0581>] update_process_times+0x41/0x70
[92309.610530]  [<ffffffff8b502e90>] tick_sched_handle+0x30/0x70
[92309.610532]  [<ffffffff8b5031f9>] tick_sched_timer+0x39/0x80
[92309.610534]  [<ffffffff8b4bca61>] __hrtimer_run_queues+0x121/0x3b0
[92309.610537]  [<ffffffff8b4bd9b9>] hrtimer_interrupt+0xb9/0x270
[92309.610541]  [<ffffffff8b44b62b>] local_apic_timer_interrupt+0x3b/0x60
[92309.610546]  [<ffffffff8bb866b3>] smp_apic_timer_interrupt+0x43/0x60
[92309.610548]  [<ffffffff8bb831ba>] apic_timer_interrupt+0x16a/0x170
[92309.610554]  <EOI>  [<ffffffff8bb77be0>] ? __cpuidle_text_start+0x8/0x8
[92309.610557]  [<ffffffff8bb77eab>] ? native_safe_halt+0xb/0x20
[92309.610559]  [<ffffffff8bb77bfe>] default_idle+0x1e/0x130
[92309.610564]  [<ffffffff8b4265e0>] arch_cpu_idle+0x20/0xc0
[92309.610567]  [<ffffffff8b4f40da>] cpu_startup_entry+0x14a/0x1d0
[92309.610571]  [<ffffffff8bb60984>] rest_init+0x84/0x90
[92309.610575]  [<ffffffff8c1bd19f>] start_kernel+0x448/0x469
[92309.610576]  [<ffffffff8c1bcb57>] ? repair_env_string+0x5c/0x5c
[92309.610580]  [<ffffffff8c1bc120>] ? early_idt_handler_array+0x120/0x120
[92309.610582]  [<ffffffff8c1bc70b>] x86_64_start_reservations+0x24/0x26
[92309.610584]  [<ffffffff8c1bc861>] x86_64_start_kernel+0x154/0x177
[92309.610587]  [<ffffffff8b4000d5>] start_cpu+0x5/0x14
[92489.604442] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=240007 jiffies, g=11061801, c=11061800, q=2698)
[92489.604443] All QSes seen, last rcu_preempt kthread activity 1 (4387162543-4387162542), jiffies_till_next_fqs=3
[92489.604446] swapper/0       R  running task        0     0      0 0x00080000
[92489.604447] Call Trace:
[92489.604458]  <IRQ>  [<ffffffff8b4cdca6>] sched_show_task+0xb6/0x120
[92489.604461]  [<ffffffff8b552b6d>] rcu_check_callbacks+0x84d/0x880
[92489.604467]  [<ffffffff8b5031c0>] ? tick_sched_do_timer+0x50/0x50
[92489.604471]  [<ffffffff8b4a0581>] update_process_times+0x41/0x70
[92489.604473]  [<ffffffff8b502e90>] tick_sched_handle+0x30/0x70
[92489.604475]  [<ffffffff8b5031f9>] tick_sched_timer+0x39/0x80
[92489.604477]  [<ffffffff8b4bca61>] __hrtimer_run_queues+0x121/0x3b0
[92489.604480]  [<ffffffff8b4bd9b9>] hrtimer_interrupt+0xb9/0x270
[92489.604484]  [<ffffffff8b44b62b>] local_apic_timer_interrupt+0x3b/0x60
[92489.604488]  [<ffffffff8bb866b3>] smp_apic_timer_interrupt+0x43/0x60
[92489.604491]  [<ffffffff8bb831ba>] apic_timer_interrupt+0x16a/0x170
[92489.604497]  <EOI>  [<ffffffff8bb77be0>] ? __cpuidle_text_start+0x8/0x8
[92489.604500]  [<ffffffff8bb77eab>] ? native_safe_halt+0xb/0x20
[92489.604502]  [<ffffffff8bb77bfe>] default_idle+0x1e/0x130
[92489.604506]  [<ffffffff8b4265e0>] arch_cpu_idle+0x20/0xc0
[92489.604510]  [<ffffffff8b4f40da>] cpu_startup_entry+0x14a/0x1d0
[92489.604513]  [<ffffffff8bb60984>] rest_init+0x84/0x90
[92489.604517]  [<ffffffff8c1bd19f>] start_kernel+0x448/0x469
[92489.604518]  [<ffffffff8c1bcb57>] ? repair_env_string+0x5c/0x5c
[92489.604521]  [<ffffffff8c1bc120>] ? early_idt_handler_array+0x120/0x120
[92489.604524]  [<ffffffff8c1bc70b>] x86_64_start_reservations+0x24/0x26
[92489.604526]  [<ffffffff8c1bc861>] x86_64_start_kernel+0x154/0x177
[92489.604529]  [<ffffffff8b4000d5>] start_cpu+0x5/0x14
[92669.598385] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=420012 jiffies, g=11061801, c=11061800, q=3800)
[92669.598387] All QSes seen, last rcu_preempt kthread activity 1 (4387342548-4387342547), jiffies_till_next_fqs=3
[92669.598389] swapper/0       R  running task        0     0      0 0x00080000
[92669.598390] Call Trace:
[92669.598399]  <IRQ>  [<ffffffff8b4cdca6>] sched_show_task+0xb6/0x120
[92669.598401]  [<ffffffff8b552b6d>] rcu_check_callbacks+0x84d/0x880
[92669.598406]  [<ffffffff8b5031c0>] ? tick_sched_do_timer+0x50/0x50
[92669.598410]  [<ffffffff8b4a0581>] update_process_times+0x41/0x70
[92669.598412]  [<ffffffff8b502e90>] tick_sched_handle+0x30/0x70
[92669.598413]  [<ffffffff8b5031f9>] tick_sched_timer+0x39/0x80
[92669.598415]  [<ffffffff8b4bca61>] __hrtimer_run_queues+0x121/0x3b0
[92669.598418]  [<ffffffff8b4bd9b9>] hrtimer_interrupt+0xb9/0x270
[92669.598421]  [<ffffffff8b44b62b>] local_apic_timer_interrupt+0x3b/0x60
[92669.598424]  [<ffffffff8bb866b3>] smp_apic_timer_interrupt+0x43/0x60
[92669.598426]  [<ffffffff8bb831ba>] apic_timer_interrupt+0x16a/0x170
[92669.598432]  <EOI>  [<ffffffff8bb77be0>] ? __cpuidle_text_start+0x8/0x8
[92669.598434]  [<ffffffff8bb77eab>] ? native_safe_halt+0xb/0x20
[92669.598436]  [<ffffffff8bb77bfe>] default_idle+0x1e/0x130
[92669.598440]  [<ffffffff8b4265e0>] arch_cpu_idle+0x20/0xc0
[92669.598443]  [<ffffffff8b4f40da>] cpu_startup_entry+0x14a/0x1d0
[92669.598447]  [<ffffffff8bb60984>] rest_init+0x84/0x90
[92669.598450]  [<ffffffff8c1bd19f>] start_kernel+0x448/0x469
[92669.598451]  [<ffffffff8c1bcb57>] ? repair_env_string+0x5c/0x5c
[92669.598454]  [<ffffffff8c1bc120>] ? early_idt_handler_array+0x120/0x120
[92669.598457]  [<ffffffff8c1bc70b>] x86_64_start_reservations+0x24/0x26
[92669.598459]  [<ffffffff8c1bc861>] x86_64_start_kernel+0x154/0x177
[92669.598461]  [<ffffffff8b4000d5>] start_cpu+0x5/0x14
...

Comment 4 Luiz Capitulino 2020-06-18 03:32:36 UTC

Thanks Pei, this has to be debugged & fixed on 7.7.z since it's support.

However, since we have other priorities at the moment, we'll wait a bit
to get to this one.

Comment 5 Nitesh Narayan Lal 2020-06-18 12:17:15 UTC

Pei,

A quick question: the second RT guest installation that you initiate, do you perform this after the 24h run or during the execution of this 24h test?

Comment 6 Pei Zhang 2020-06-19 02:10:50 UTC

(In reply to Nitesh Narayan Lal from comment #5)
> Pei,
> 
> A quick question: the second RT guest installation that you initiate, do you
> perform this after the 24h run or during the execution of this 24h test?

Hi Nitesh,

Guest 24h cyclictest testing and guest installation are not executed at same time. 

Let me share more details:

(1) First we install a fresh guest. After guest installation finishes, we setup RT environment. Then we do 24h or 12h testing in this guest. Finally clear the setup, including shutdown and undefine the guest. This scenario is done.

(2) Move to next scenario, we repeat (1).

Best regards,

Pei

Comment 16 Peter Xu 2020-07-20 21:25:29 UTC

From the dmesg that Pei provided, the rcu grace period 10895418 never completed during 8vcpu vm installation.  It should cause all the rcu callbacks to stall, and at last it can use up all the mem of the host, finally we won't be able to even ssh to it.

Today I logged into the host before it dies.  Listing all the vcpu threads [1]:

  PID   TID S POL PSR PRI COMMAND
  106   106 D FF   10  44 rcub/2                                                                                               
 1820  1820 D TS   10  19 NetworkManager                                                                                       
 3028 13214 D TS    2  19 pool                                                                                                 
10416 10416 R TS   16  19 qemu-kvm                                                                                             
10416 10473 R TS   16  19 CPU 0/KVM                            
10416 10474 R TS   16  19 CPU 1/KVM                                                                                            
10416 10475 R TS   16  19 CPU 2/KVM                                                                                            
10416 10476 R TS   16  44 CPU 3/KVM                            
10416 10477 R TS   16  19 CPU 4/KVM                                                                                            
10416 10478 R TS   16  19 CPU 5/KVM                                                                                            
10416 10479 R TS   16  19 CPU 6/KVM                            
10416 10480 R TS   16  19 CPU 7/KVM                                                                                            
10416 10481 R TS   16  19 CPU 8/KVM                                                                                            
10416 10482 R TS   16  19 CPU 9/KVM                            
11915 11915 D TS    8  19 sshd                                                                                                 
13212 13212 D TS    8  19 kworker/8:1                                                                                          
13220 13220 D TS    2  19 kworker/2:1                          
13538 13538 D TS   10  19 pickup                                                                                               
13796 13796 D TS    0  19 sshd                                                                                                 
13799 13799 R TS    2  19 ps
19616 19616 D TS    8  19 kworker/8:2                                                                                          
30258 30258 D TS    4  19 kworker/4:0                                                                                          

It's strange that during vm installation the vcpus are not using the pinning information provided in the virt-install command line but all running on cpu16.  Here's Pei's installation cmdline for the 8vcpu vm (copied from beaker log):

  14:39:46:[cmd]:virt-install --graphics type=spice,listen=0.0.0.0 --name=rhel7.7.z_rt_8vcpu --machine pc --memory=4096,hugepages=yes --memorybacking hugepages=yes,size=1,unit=G,locked=yes --cpu host-passthrough,cache.mode=passthrough --numatune=1 --vcpus=10,cpuset=16,18,1,3,5,7,9,11,13,15 --controller type=scsi,model=virtio-scsi --controller type=virtio-serial --rng type=/dev/random --memballoon virtio --disk path=/home/images_nfv-virt-rt-kvm/rhel7.7.z_rt_8vcpu.qcow2,size=40 -l http://download.eng.pek2.redhat.com/rel-eng/latest-RHEL-7.7/compose/Server/x86_64/os/ --noautoconsole --noreboot -x ks=http://10.73.72.41/kickstart-rhel7-7-z.cfg  --network bridge=switch,model=virtio,mac=38:66:88:44:de:02

Which contains:

  --vcpus=10,cpuset=16,18,1,3,5,7,9,11,13,15

So it seems that virt-install does not apply the vcpu pinning correctly during the installation phase.  I feel like that's the cause that CPU16 got overloaded.  "CPU 3/KVM" is special here since it's the only thread that has priority=44 (RT priority).  I'm not sure how it's achieved with SCHED_OTHER (policy=TS)... maybe due to priority inheritage in RT worlds somehow.  Anyway, ideally we should have the vcpu to be at least pinned on different cores even during installation.

Is that a bug for virt-install to not apply the "--vcpu=10,cpuset=..." during vm install?  Pei, to verify this, we can try to manually pin each vcpu after installation starts, to see whether we'll still see the same RCU warning.

Comment 17 Cole Robinson 2020-07-20 22:00:12 UTC

Created attachment 1701821 [details]
virt-install.log provided by Peter

Comment 18 Cole Robinson 2020-07-20 22:09:43 UTC

Created attachment 1701823 [details]
working vm XML provided by peter

I talked with Peter on IRC. From the virt-install.log he provided, virt-install seems to be working correctly, it is passing the requested cpuset to both the install time and post-install boot XML. That's really all virt-install does in this case is generate XML, all the actual pinning behavior is handled elsewhere

But what's weird to me is that the post-install XML he grabbed from the running VM, after he rebooted it, has different CPU pinning XML, see the attachment. There's no cpuset= value but there is a complicated <cputune> section which has more pinning info.

I don't think libvirt is filling in that data itself but I'm not certain. Is something else in the test case, or on the host, changing the VM XML separately?

Comment 19 Pei Zhang 2020-07-21 02:05:36 UTC

(In reply to Cole Robinson from comment #18)
> Created attachment 1701823 [details]
> working vm XML provided by peter
> 
> I talked with Peter on IRC. From the virt-install.log he provided,
> virt-install seems to be working correctly, it is passing the requested
> cpuset to both the install time and post-install boot XML. That's really all
> virt-install does in this case is generate XML, all the actual pinning
> behavior is handled elsewhere
> 
> But what's weird to me is that the post-install XML he grabbed from the
> running VM, after he rebooted it, has different CPU pinning XML, see the
> attachment. There's no cpuset= value but there is a complicated <cputune>
> section which has more pinning info.

Hi Cole,

As I hit this issue during the beaker automation process, in order to keep the current setup for Peter to debug, I didn't quit the beaker job, so the automation job is still running now on this debug server. 

The "complicated <cputune> section which has more pinning info" is defined by extra testing steps, not by "virt-install". As Peter rebooted (or shutdown?) the VM, which made the VM in "shutdown" status. Then it triggered the automation script to continue other testing steps. 

The automation is designed like below:

1. # virt-install ....
  
2. Once the virt-install installation finish, the VM will shutdown, this is defined in the kickstart file http://10.73.72.41/kickstart-rhel7-7-z.cfg.

3. Then the auto script detects that the VM is in "shutdown" status, it will undefine it and re-define with an extra xml file which has exact cpu pin info.

# virsh undefine rhel7.7.z_rt_8vcpu
# virsh define --file /tmp/rhel7.7.z_rt_8vcpu.xml


This bug was triggered in the virt-install, once we hit this issue, the installation will never finish and host will hang.

Thanks.

Best regards,

Pei

> 
> I don't think libvirt is filling in that data itself but I'm not certain. Is
> something else in the test case, or on the host, changing the VM XML
> separately?

Comment 20 Cole Robinson 2020-07-21 13:52:14 UTC

Thanks for the info Pei, that helps.

So the question is, why does the plain <vcpus cpuset='...'> method not work, but the
more advanced <cputune> config does work.

Libvirt is responsible for setting the CPU affinity here, I don't think qemu is
involved. So either libvirt is screwing up or something is wrong at the
kernel level. If this reproduces with the same virt packages on a non-rt system
then it's probably a libvirt issue

Comment 21 Peter Xu 2020-07-21 14:44:52 UTC

(In reply to Cole Robinson from comment #20)
> Thanks for the info Pei, that helps.
> 
> So the question is, why does the plain <vcpus cpuset='...'> method not work,
> but the
> more advanced <cputune> config does work.

This reminded me to go back and look at the virt-install manual, and it says:

       --vcpus OPTIONS
           Number of virtual cpus to configure for the guest. If 'maxvcpus' is specified, the guest will be able to hotplug
           up to MAX vcpus while the guest is running, but will startup with VCPUS.

           CPU topology can additionally be specified with sockets, cores, and threads.  If values are omitted, the rest
           will be autofilled preferring sockets over cores over threads.

           'cpuset' sets which physical cpus the guest can use. "CPUSET" is a comma separated list of numbers, which can
           also be specified in ranges or cpus to exclude. Example:

               0,2,3,5     : Use processors 0,2,3 and 5
               1-5,^3,8    : Use processors 1,2,4,5 and 8

So IIUC "<vcpus cpuset='...'>" will not pin one vcpu thread for each core, but pins all vcpu threads to all the available cores.  E.g. "--vcpus=10,cpuset=16,18,1,3,5,7,9,11,13,15" will allow each vcpu thread to run on any of physical core within the cpu list "16,18,1,3,5,7,9,11,13,15".  Then it's strange why all vcpu threads are queued on physical core 16 (with "R" state) in comment 16.

> 
> Libvirt is responsible for setting the CPU affinity here, I don't think qemu
> is
> involved. So either libvirt is screwing up or something is wrong at the
> kernel level. If this reproduces with the same virt packages on a non-rt
> system
> then it's probably a libvirt issue

Agreed.  And if libvirt has correctly applied the cpu list affinity to all the vcpu threads, then it could be a kernel isssue.

Comment 24 Peter Xu 2020-07-21 18:53:01 UTC

Thanks Pei, it's reproduced.  Looks like libvirt has done the expected pinning:

[root@dell-per430-10 ~]# ps -eLo tid,state,psr,comm | grep KVM
10317 R  16 CPU 0/KVM
10318 R  16 CPU 1/KVM
10319 R  16 CPU 2/KVM
10320 R  16 CPU 3/KVM
10321 R  16 CPU 4/KVM
10322 R  16 CPU 5/KVM
10323 R  16 CPU 6/KVM
10324 R  16 CPU 7/KVM
10325 R  16 CPU 8/KVM
10326 R  16 CPU 9/KVM

[root@dell-per430-10 ~]# ps -eLo tid,comm | grep KVM | while read line; do pid=${line%% *}; echo "$line: $(taskset -pc $pid)"; done
10317 CPU 0/KVM: pid 10317's current affinity list: 1,3,5,7,9,11,13,15,16,18
10318 CPU 1/KVM: pid 10318's current affinity list: 1,3,5,7,9,11,13,15,16,18
10319 CPU 2/KVM: pid 10319's current affinity list: 1,3,5,7,9,11,13,15,16,18
10320 CPU 3/KVM: pid 10320's current affinity list: 1,3,5,7,9,11,13,15,16,18
10321 CPU 4/KVM: pid 10321's current affinity list: 1,3,5,7,9,11,13,15,16,18
10322 CPU 5/KVM: pid 10322's current affinity list: 1,3,5,7,9,11,13,15,16,18
10323 CPU 6/KVM: pid 10323's current affinity list: 1,3,5,7,9,11,13,15,16,18
10324 CPU 7/KVM: pid 10324's current affinity list: 1,3,5,7,9,11,13,15,16,18
10325 CPU 8/KVM: pid 10325's current affinity list: 1,3,5,7,9,11,13,15,16,18
10326 CPU 9/KVM: pid 10326's current affinity list: 1,3,5,7,9,11,13,15,16,18

Comment 25 Peter Xu 2020-07-21 19:22:49 UTC

Dumping task switching on cpu16, we can indeed observe the vcpu threads frequently switching:

             CPU-10325 [016] 11775.954419: sched_switch:         CPU 8/KVM:10325 [120] R ==> CPU 6/KVM:10323 [120]
             CPU-10323 [016] 11775.954443: sched_switch:         CPU 6/KVM:10323 [120] S ==> CPU 9/KVM:10326 [120]
             CPU-10326 [016] 11775.954466: sched_switch:         CPU 9/KVM:10326 [120] S ==> CPU 5/KVM:10322 [120]
             CPU-10322 [016] 11775.954488: sched_switch:         CPU 5/KVM:10322 [120] S ==> CPU 3/KVM:10320 [120]
             CPU-10320 [016] 11775.954511: sched_switch:         CPU 3/KVM:10320 [120] S ==> CPU 4/KVM:10321 [120]
             CPU-10321 [016] 11775.954533: sched_switch:         CPU 4/KVM:10321 [120] S ==> CPU 7/KVM:10324 [120]
             CPU-10324 [016] 11775.954555: sched_switch:         CPU 7/KVM:10324 [120] S ==> CPU 1/KVM:10318 [120]
             CPU-10318 [016] 11775.954578: sched_switch:         CPU 1/KVM:10318 [120] S ==> CPU 2/KVM:10319 [120]
             CPU-10319 [016] 11775.954600: sched_switch:         CPU 2/KVM:10319 [120] S ==> CPU 8/KVM:10325 [120]
             CPU-10325 [016] 11775.954928: sched_switch:         CPU 8/KVM:10325 [120] R ==> ktimersoftd/16:159 [96]
  ktimersoftd/16-159   [016] 11775.954933: sched_switch:         ktimersoftd/16:159 [96] S ==> CPU 8/KVM:10325 [120]
             CPU-10325 [016] 11775.955418: sched_switch:         CPU 8/KVM:10325 [120] R ==> CPU 5/KVM:10322 [120]
             CPU-10322 [016] 11775.955441: sched_switch:         CPU 5/KVM:10322 [120] S ==> CPU 3/KVM:10320 [120]
             CPU-10320 [016] 11775.955467: sched_switch:         CPU 3/KVM:10320 [120] S ==> CPU 1/KVM:10318 [120]
             CPU-10318 [016] 11775.955489: sched_switch:         CPU 1/KVM:10318 [120] S ==> CPU 2/KVM:10319 [120]
             CPU-10319 [016] 11775.955511: sched_switch:         CPU 2/KVM:10319 [120] S ==> CPU 9/KVM:10326 [120]
             CPU-10326 [016] 11775.955537: sched_switch:         CPU 9/KVM:10326 [120] S ==> CPU 4/KVM:10321 [120]
             CPU-10321 [016] 11775.955559: sched_switch:         CPU 4/KVM:10321 [120] S ==> CPU 7/KVM:10324 [120]
             CPU-10324 [016] 11775.955581: sched_switch:         CPU 7/KVM:10324 [120] S ==> CPU 6/KVM:10323 [120]
             CPU-10323 [016] 11775.955604: sched_switch:         CPU 6/KVM:10323 [120] S ==> CPU 8/KVM:10325 [120]
             CPU-10325 [016] 11775.955928: sched_switch:         CPU 8/KVM:10325 [120] R ==> ktimersoftd/16:159 [96]
  ktimersoftd/16-159   [016] 11775.955933: sched_switch:         ktimersoftd/16:159 [96] S ==> CPU 8/KVM:10325 [120]
             CPU-10325 [016] 11775.956418: sched_switch:         CPU 8/KVM:10325 [120] R ==> CPU 7/KVM:10324 [120]
             CPU-10324 [016] 11775.956439: sched_switch:         CPU 7/KVM:10324 [120] S ==> CPU 2/KVM:10319 [120]
             CPU-10319 [016] 11775.956461: sched_switch:         CPU 2/KVM:10319 [120] S ==> CPU 9/KVM:10326 [120]
             CPU-10326 [016] 11775.956483: sched_switch:         CPU 9/KVM:10326 [120] S ==> CPU 4/KVM:10321 [120]
             CPU-10321 [016] 11775.956505: sched_switch:         CPU 4/KVM:10321 [120] S ==> CPU 6/KVM:10323 [120]
             CPU-10323 [016] 11775.956527: sched_switch:         CPU 6/KVM:10323 [120] S ==> CPU 3/KVM:10320 [120]

However still cannot explain why only core16 is used but not the rest (1,3,5,7,9,11,13,15,18).

Comment 34 Pei Zhang 2020-11-27 09:27:53 UTC

Hello Peter, Luiz,

I'm back to work now :)

Just let you know, the testing with 7.9.z is still ongoing.

During the past month, I hit this issue 1/5, both are running 24 hours.

Next I'll switch to 1h testing and raise priority of this testing. vmcore will be provided once I reproduced again.

Thanks.

Best regards,

Pei

Comment 37 Peter Xu 2020-11-30 20:18:23 UTC

Created attachment 1734965 [details]
Some crash information for run queues and cpusets

Comment 38 Peter Xu 2020-11-30 22:14:38 UTC

Created attachment 1734988 [details]
Some crash information for run queues and cpusets, etc.

Comment 39 Peter Xu 2020-11-30 22:33:40 UTC

I've uploaded something I digged from the crash report, so far what I observed is quite similar to our latest analysis previously, but we got more interesting information.  I'll try to give a quick summary below.

Firstly, the host has 20 cores, mostly idle.  Only core 16 is busy, and strangely all the vcpus are running on core 16:

crash> runq
CPU 16 RUNQUEUE: ffff94495fc1d940
  CURRENT: PID: 12136  TASK: ffff944907abc580  COMMAND: "CPU 3/KVM"
  RT PRIO_ARRAY: ffff94495fc1dad8
     [ 95] PID: 12133  TASK: ffff9448d13e3420  COMMAND: "CPU 0/KVM"
  CFS RB_ROOT: ffff94495fc1d9e0
     [120] PID: 12137  TASK: ffff94490e2e22c0  COMMAND: "CPU 4/KVM"
     [120] PID: 12138  TASK: ffff94490e2e1160  COMMAND: "CPU 5/KVM"
     [120] PID: 12139  TASK: ffff9448d4f5d6e0  COMMAND: "CPU 6/KVM"
     [120] PID: 12141  TASK: ffff9448d4f5c580  COMMAND: "CPU 8/KVM"
     [120] PID: 12142  TASK: ffff9448d4f5a2c0  COMMAND: "CPU 9/KVM"
     [120] PID: 12077  TASK: ffff9449031d0000  COMMAND: "qemu-kvm"
     [120] PID: 12134  TASK: ffff9448d13e56e0  COMMAND: "CPU 1/KVM"
     [120] PID: 12140  TASK: ffff9448d4f58000  COMMAND: "CPU 7/KVM"
     [120] PID: 12135  TASK: ffff944907abe840  COMMAND: "CPU 2/KVM"

The vcpu threads are with correct cpu pinning (so it seems nothing wrong with libvirt or upper layer software). 
 One example for vcpu0, the rest are the same:

crash> task -xR cpus_allowed 12133
PID: 12133  TASK: ffff9448d13e3420  CPU: 16  COMMAND: "CPU 0/KVM"
  cpus_allowed = {
    bits = {0x5aaaa, ...}

Here vcpu0 thread has lower priority (120) than vcpu3 (95).

vcpu0 is kind of special here because its priority got boosted by rcub/2:

crash> task -R prio,policy 12133
PID: 12133  TASK: ffff9448d13e3420  CPU: 16  COMMAND: "CPU 0/KVM"
  prio = 95,
  policy = 0,
crash> ps | grep rcub
     16      2   0  ffff94490e211160  IN   0.0       0      0  [rcub/0]
     17      2   0  ffff94490e2122c0  IN   0.0       0      0  [rcub/1]
    106      2  10  ffff94490e3e56e0  UN   0.0       0      0  [rcub/2]
crash> task -R pi_blocked_on 106
PID: 106    TASK: ffff94490e3e56e0  CPU: 10  COMMAND: "rcub/2"
  pi_blocked_on = 0xffff94490dc03da0,
crash> p/x (((struct rt_mutex_waiter *)0xffff94490dc03da0)->lock->owner)
$6 = 0xffff9448d13e3421
crash> ps 0xffff9448d13e3420
   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
  12133      1  16  ffff9448d13e3420  RU   1.1 5150084 742916  CPU 0/KVM
crash> task -R prio,policy 106
PID: 106    TASK: ffff94490e3e56e0  CPU: 10  COMMAND: "rcub/2"
  prio = 95, 
  policy = 1, 

But what's more strange is that even vcpu3 has very high priority (95), it got preempted by vcpu0 who has lower priority (120).  Backtrace for vcpu3/vcpu0 threads:

crash> bt 12136
PID: 12136  TASK: ffff944907abc580  CPU: 16  COMMAND: "CPU 3/KVM"
 #0 [ffff94495fc09e40] crash_nmi_callback at ffffffff99048467
 #1 [ffff94495fc09e50] nmi_handle at ffffffff99781904
 #2 [ffff94495fc09ea8] do_nmi at ffffffff99781b1d
 #3 [ffff94495fc09ef0] end_repeat_nmi at ffffffff99780d4c
    [exception RIP: vmx_vcpu_run+2655]
    RIP: ffffffffc1250d4f  RSP: ffff945053a8fcb0  RFLAGS: 00000046
    RAX: 0000000080000202  RBX: ffff944f6062bfc0  RCX: 0000000000000048
    RDX: 0000000000004404  RSI: 0000000000000000  RDI: ffff944f6062bfc0
    RBP: ffff945053a8fd18   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffff945053a8fcb0] vmx_vcpu_run at ffffffffc1250d4f [kvm_intel]
 #5 [ffff945053a8fd20] vcpu_enter_guest at ffffffffc0700d4b [kvm]
 #6 [ffff945053a8fda0] kvm_arch_vcpu_ioctl_run at ffffffffc0709998 [kvm]
 #7 [ffff945053a8fde8] kvm_vcpu_ioctl at ffffffffc06ea0b1 [kvm]
 #8 [ffff945053a8fe80] do_vfs_ioctl at ffffffff99251b80
 #9 [ffff945053a8ff00] sys_ioctl at ffffffff99251e31
#10 [ffff945053a8ff50] tracesys at ffffffff997894a8 (via system_call)
    RIP: 00007fc0148ef307  RSP: 00007fc0090ffa98  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 00007fc02df24001  RCX: ffffffffffffffff
    RDX: 0000000000000000  RSI: 000000000000ae80  RDI: 0000000000000019
    RBP: 0000000000000001   R8: 0000557717ffa450   R9: 0000000000000000
    R10: 0000000000000001  R11: 0000000000000246  R12: 0000557717fe01a0
    R13: 0000000000000000  R14: 00007fc02df23000  R15: 000055771b14e000
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b

crash> bt 12133
PID: 12133  TASK: ffff9448d13e3420  CPU: 16  COMMAND: "CPU 0/KVM"
 #0 [ffff94504cde3860] __schedule at ffffffff9977c15e
 #1 [ffff94504cde38f8] preempt_schedule at ffffffff9977c72c
 #2 [ffff94504cde3918] preempt_schedule at ffffffff9977c77b
 #3 [ffff94504cde3938] swake_up at ffffffff990e453a
 #4 [ffff94504cde3958] kvm_vcpu_wake_up at ffffffffc06e6e7f [kvm]
 #5 [ffff94504cde3970] kvm_vcpu_kick at ffffffffc06e6ea7 [kvm]
 #6 [ffff94504cde3990] vmx_deliver_posted_interrupt at ffffffffc124b79e [kvm_intel]
 #7 [ffff94504cde39a8] __apic_accept_irq at ffffffffc072a5d4 [kvm]
 #8 [ffff94504cde3a10] kvm_apic_set_irq at ffffffffc072a81a [kvm]
 #9 [ffff94504cde3a20] kvm_irq_delivery_to_apic_fast at ffffffffc072a9e3 [kvm]
#10 [ffff94504cde3a88] kvm_irq_delivery_to_apic at ffffffffc072fe15 [kvm]
#11 [ffff94504cde3b20] kvm_lapic_reg_write at ffffffffc0729c5f [kvm]
#12 [ffff94504cde3b70] kvm_x2apic_msr_write at ffffffffc072ca6d [kvm]
#13 [ffff94504cde3b98] kvm_set_msr_common at ffffffffc0701bed [kvm]
#14 [ffff94504cde3c08] vmx_set_msr at ffffffffc125499a [kvm_intel]
#15 [ffff94504cde3c38] kvm_set_msr at ffffffffc06f0ad7 [kvm]
#16 [ffff94504cde3c48] handle_wrmsr at ffffffffc124f7f8 [kvm_intel]
#17 [ffff94504cde3c90] vmx_handle_exit at ffffffffc12553ac [kvm_intel]
#18 [ffff94504cde3d20] vcpu_enter_guest at ffffffffc0700edd [kvm]
#19 [ffff94504cde3da0] kvm_arch_vcpu_ioctl_run at ffffffffc0709998 [kvm]
#20 [ffff94504cde3de8] kvm_vcpu_ioctl at ffffffffc06ea0b1 [kvm]
#21 [ffff94504cde3e80] do_vfs_ioctl at ffffffff99251b80
#22 [ffff94504cde3f00] sys_ioctl at ffffffff99251e31
#23 [ffff94504cde3f50] tracesys at ffffffff997894a8 (via system_call)
    RIP: 00007fc0148ef307  RSP: 00007fc00a902a98  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 00007fc02e063001  RCX: ffffffffffffffff
    RDX: 0000000000000000  RSI: 000000000000ae80  RDI: 0000000000000016
    RBP: 0000000000000001   R8: 0000557717ffa450   R9: 00000000000000ff
    R10: 0000000000000001  R11: 0000000000000246  R12: 0000557717fe01a0
    R13: 0000000000000000  R14: 00007fc02e062000  R15: 000055771b0aa000
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b

So another summary the questions, majorly:

  1. Why all the vcpu threads are queued on core16 runqueue rather than spread across multiple host cores?  Note that the cpu affinity should be setup correctly for each vcpu threads according to above (0x5aaaa).

  2. If vcpu0 has higher priority (95, after being boosted by rcub/2) than vcpu3 (120, the default value, SCHED_OTHER with nice==0), why vcpu3 got preempted by vcpu0?

I'm copying the rt-sched experts (Juri, Luis and Daniel) for help on the scheduler side.

For the full detail of the crash report, please refer to the (updated) attachment in comment 38.

Thanks,
Peter

Comment 40 Juri Lelli 2020-12-01 07:55:00 UTC

(In reply to Peter Xu from comment #39)
> I've uploaded something I digged from the crash report, so far what I
> observed is quite similar to our latest analysis previously, but we got more
> interesting information.  I'll try to give a quick summary below.
> 
> Firstly, the host has 20 cores, mostly idle.  Only core 16 is busy, and
> strangely all the vcpus are running on core 16:
> 
> crash> runq
> CPU 16 RUNQUEUE: ffff94495fc1d940
>   CURRENT: PID: 12136  TASK: ffff944907abc580  COMMAND: "CPU 3/KVM"
>   RT PRIO_ARRAY: ffff94495fc1dad8
>      [ 95] PID: 12133  TASK: ffff9448d13e3420  COMMAND: "CPU 0/KVM"
>   CFS RB_ROOT: ffff94495fc1d9e0
>      [120] PID: 12137  TASK: ffff94490e2e22c0  COMMAND: "CPU 4/KVM"
>      [120] PID: 12138  TASK: ffff94490e2e1160  COMMAND: "CPU 5/KVM"
>      [120] PID: 12139  TASK: ffff9448d4f5d6e0  COMMAND: "CPU 6/KVM"
>      [120] PID: 12141  TASK: ffff9448d4f5c580  COMMAND: "CPU 8/KVM"
>      [120] PID: 12142  TASK: ffff9448d4f5a2c0  COMMAND: "CPU 9/KVM"
>      [120] PID: 12077  TASK: ffff9449031d0000  COMMAND: "qemu-kvm"
>      [120] PID: 12134  TASK: ffff9448d13e56e0  COMMAND: "CPU 1/KVM"
>      [120] PID: 12140  TASK: ffff9448d4f58000  COMMAND: "CPU 7/KVM"
>      [120] PID: 12135  TASK: ffff944907abe840  COMMAND: "CPU 2/KVM"
> 
> The vcpu threads are with correct cpu pinning (so it seems nothing wrong
> with libvirt or upper layer software). 
>  One example for vcpu0, the rest are the same:
> 
> crash> task -xR cpus_allowed 12133
> PID: 12133  TASK: ffff9448d13e3420  CPU: 16  COMMAND: "CPU 0/KVM"
>   cpus_allowed = {
>     bits = {0x5aaaa, ...}
> 
> Here vcpu0 thread has lower priority (120) than vcpu3 (95).
> 
> vcpu0 is kind of special here because its priority got boosted by rcub/2:
> 
> crash> task -R prio,policy 12133
> PID: 12133  TASK: ffff9448d13e3420  CPU: 16  COMMAND: "CPU 0/KVM"
>   prio = 95,
>   policy = 0,
> crash> ps | grep rcub
>      16      2   0  ffff94490e211160  IN   0.0       0      0  [rcub/0]
>      17      2   0  ffff94490e2122c0  IN   0.0       0      0  [rcub/1]
>     106      2  10  ffff94490e3e56e0  UN   0.0       0      0  [rcub/2]
> crash> task -R pi_blocked_on 106
> PID: 106    TASK: ffff94490e3e56e0  CPU: 10  COMMAND: "rcub/2"
>   pi_blocked_on = 0xffff94490dc03da0,
> crash> p/x (((struct rt_mutex_waiter *)0xffff94490dc03da0)->lock->owner)
> $6 = 0xffff9448d13e3421
> crash> ps 0xffff9448d13e3420
>    PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
>   12133      1  16  ffff9448d13e3420  RU   1.1 5150084 742916  CPU 0/KVM
> crash> task -R prio,policy 106
> PID: 106    TASK: ffff94490e3e56e0  CPU: 10  COMMAND: "rcub/2"
>   prio = 95, 
>   policy = 1, 
> 
> But what's more strange is that even vcpu3 has very high priority (95), it
> got preempted by vcpu0 who has lower priority (120).  Backtrace for
> vcpu3/vcpu0 threads:
> 
> crash> bt 12136
> PID: 12136  TASK: ffff944907abc580  CPU: 16  COMMAND: "CPU 3/KVM"
>  #0 [ffff94495fc09e40] crash_nmi_callback at ffffffff99048467
>  #1 [ffff94495fc09e50] nmi_handle at ffffffff99781904
>  #2 [ffff94495fc09ea8] do_nmi at ffffffff99781b1d
>  #3 [ffff94495fc09ef0] end_repeat_nmi at ffffffff99780d4c
>     [exception RIP: vmx_vcpu_run+2655]
>     RIP: ffffffffc1250d4f  RSP: ffff945053a8fcb0  RFLAGS: 00000046
>     RAX: 0000000080000202  RBX: ffff944f6062bfc0  RCX: 0000000000000048
>     RDX: 0000000000004404  RSI: 0000000000000000  RDI: ffff944f6062bfc0
>     RBP: ffff945053a8fd18   R8: 0000000000000000   R9: 0000000000000000
>     R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
>     R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> --- <NMI exception stack> ---
>  #4 [ffff945053a8fcb0] vmx_vcpu_run at ffffffffc1250d4f [kvm_intel]
>  #5 [ffff945053a8fd20] vcpu_enter_guest at ffffffffc0700d4b [kvm]
>  #6 [ffff945053a8fda0] kvm_arch_vcpu_ioctl_run at ffffffffc0709998 [kvm]
>  #7 [ffff945053a8fde8] kvm_vcpu_ioctl at ffffffffc06ea0b1 [kvm]
>  #8 [ffff945053a8fe80] do_vfs_ioctl at ffffffff99251b80
>  #9 [ffff945053a8ff00] sys_ioctl at ffffffff99251e31
> #10 [ffff945053a8ff50] tracesys at ffffffff997894a8 (via system_call)
>     RIP: 00007fc0148ef307  RSP: 00007fc0090ffa98  RFLAGS: 00000246
>     RAX: ffffffffffffffda  RBX: 00007fc02df24001  RCX: ffffffffffffffff
>     RDX: 0000000000000000  RSI: 000000000000ae80  RDI: 0000000000000019
>     RBP: 0000000000000001   R8: 0000557717ffa450   R9: 0000000000000000
>     R10: 0000000000000001  R11: 0000000000000246  R12: 0000557717fe01a0
>     R13: 0000000000000000  R14: 00007fc02df23000  R15: 000055771b14e000
>     ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b
> 
> crash> bt 12133
> PID: 12133  TASK: ffff9448d13e3420  CPU: 16  COMMAND: "CPU 0/KVM"
>  #0 [ffff94504cde3860] __schedule at ffffffff9977c15e
>  #1 [ffff94504cde38f8] preempt_schedule at ffffffff9977c72c
>  #2 [ffff94504cde3918] preempt_schedule at ffffffff9977c77b
>  #3 [ffff94504cde3938] swake_up at ffffffff990e453a
>  #4 [ffff94504cde3958] kvm_vcpu_wake_up at ffffffffc06e6e7f [kvm]
>  #5 [ffff94504cde3970] kvm_vcpu_kick at ffffffffc06e6ea7 [kvm]
>  #6 [ffff94504cde3990] vmx_deliver_posted_interrupt at ffffffffc124b79e
> [kvm_intel]
>  #7 [ffff94504cde39a8] __apic_accept_irq at ffffffffc072a5d4 [kvm]
>  #8 [ffff94504cde3a10] kvm_apic_set_irq at ffffffffc072a81a [kvm]
>  #9 [ffff94504cde3a20] kvm_irq_delivery_to_apic_fast at ffffffffc072a9e3
> [kvm]
> #10 [ffff94504cde3a88] kvm_irq_delivery_to_apic at ffffffffc072fe15 [kvm]
> #11 [ffff94504cde3b20] kvm_lapic_reg_write at ffffffffc0729c5f [kvm]
> #12 [ffff94504cde3b70] kvm_x2apic_msr_write at ffffffffc072ca6d [kvm]
> #13 [ffff94504cde3b98] kvm_set_msr_common at ffffffffc0701bed [kvm]
> #14 [ffff94504cde3c08] vmx_set_msr at ffffffffc125499a [kvm_intel]
> #15 [ffff94504cde3c38] kvm_set_msr at ffffffffc06f0ad7 [kvm]
> #16 [ffff94504cde3c48] handle_wrmsr at ffffffffc124f7f8 [kvm_intel]
> #17 [ffff94504cde3c90] vmx_handle_exit at ffffffffc12553ac [kvm_intel]
> #18 [ffff94504cde3d20] vcpu_enter_guest at ffffffffc0700edd [kvm]
> #19 [ffff94504cde3da0] kvm_arch_vcpu_ioctl_run at ffffffffc0709998 [kvm]
> #20 [ffff94504cde3de8] kvm_vcpu_ioctl at ffffffffc06ea0b1 [kvm]
> #21 [ffff94504cde3e80] do_vfs_ioctl at ffffffff99251b80
> #22 [ffff94504cde3f00] sys_ioctl at ffffffff99251e31
> #23 [ffff94504cde3f50] tracesys at ffffffff997894a8 (via system_call)
>     RIP: 00007fc0148ef307  RSP: 00007fc00a902a98  RFLAGS: 00000246
>     RAX: ffffffffffffffda  RBX: 00007fc02e063001  RCX: ffffffffffffffff
>     RDX: 0000000000000000  RSI: 000000000000ae80  RDI: 0000000000000016
>     RBP: 0000000000000001   R8: 0000557717ffa450   R9: 00000000000000ff
>     R10: 0000000000000001  R11: 0000000000000246  R12: 0000557717fe01a0
>     R13: 0000000000000000  R14: 00007fc02e062000  R15: 000055771b0aa000
>     ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b
> 
> So another summary the questions, majorly:
> 
>   1. Why all the vcpu threads are queued on core16 runqueue rather than
> spread across multiple host cores?  Note that the cpu affinity should be
> setup correctly for each vcpu threads according to above (0x5aaaa).

Aren't this isolated cores? If so, don't think any load balancing is going
to happen between them. Rather vcpus have to be affined (moved to) each
isolated cpu specifically, no?

Also, all vcpus except 0 and 3 seems to be running at OTHER priority, so it
is eventually CFS load balance that needs to move them around, but again,
I'm afraid it won't look at isolated cpus.

>   2. If vcpu0 has higher priority (95, after being boosted by rcub/2) than
> vcpu3 (120, the default value, SCHED_OTHER with nice==0), why vcpu3 got
> preempted by vcpu0?

Not sure I understand. If vcpu0 has higher priority than vcpu3, don't we
actually expect a preemption?

> I'm copying the rt-sched experts (Juri, Luis and Daniel) for help on the
> scheduler side.
> 
> For the full detail of the crash report, please refer to the (updated)
> attachment in comment 38.
> 
> Thanks,
> Peter

Comment 41 Peter Xu 2020-12-01 16:04:35 UTC

(In reply to Juri Lelli from comment #40)
> > So another summary the questions, majorly:
> > 
> >   1. Why all the vcpu threads are queued on core16 runqueue rather than
> > spread across multiple host cores?  Note that the cpu affinity should be
> > setup correctly for each vcpu threads according to above (0x5aaaa).
> 
> Aren't this isolated cores? If so, don't think any load balancing is going
> to happen between them. Rather vcpus have to be affined (moved to) each
> isolated cpu specifically, no?

For this specific core dump, isolated cores are:

isolcpus=1,3,5,7,9,11,13,15,17,19,12,14,16,18

It's indeed kind of strange that isolcpus= does not match with 0x5aaaa, however just to mention that IIUC this core is generated during VM installation phase.  During that phase, our 1-on-1 vcpu pinning should not really take effect yet.  That's also why all the vcpus (except the boosted one) do not have RT priority but SCHED_OTHER.

> 
> Also, all vcpus except 0 and 3 seems to be running at OTHER priority, so it
> is eventually CFS load balance that needs to move them around, but again,
> I'm afraid it won't look at isolated cpus.

But I thought CFS should look at any cpu specified by task_struct.cpus_allowed, right?  No matter whether it contains isolated or housekeeping cpus.

> 
> >   2. If vcpu0 has higher priority (95, after being boosted by rcub/2) than
> > vcpu3 (120, the default value, SCHED_OTHER with nice==0), why vcpu3 got
> > preempted by vcpu0?
> 
> Not sure I understand. If vcpu0 has higher priority than vcpu3, don't we
> actually expect a preemption?

Sorry my previous sentence "vcpu3 got preempted by vcpu0" is wrong...  The fact is vcpu3 preempted vcpu0.

CPU 16 RUNQUEUE: ffff94495fc1d940
  CURRENT: PID: 12136  TASK: ffff944907abc580  COMMAND: "CPU 3/KVM"
  RT PRIO_ARRAY: ffff94495fc1dad8
     [ 95] PID: 12133  TASK: ffff9448d13e3420  COMMAND: "CPU 0/KVM"
  CFS RB_ROOT: ffff94495fc1d9e0
     [120] PID: 12137  TASK: ffff94490e2e22c0  COMMAND: "CPU 4/KVM"
     [120] PID: 12138  TASK: ffff94490e2e1160  COMMAND: "CPU 5/KVM"
     [120] PID: 12139  TASK: ffff9448d4f5d6e0  COMMAND: "CPU 6/KVM"
     [120] PID: 12141  TASK: ffff9448d4f5c580  COMMAND: "CPU 8/KVM"
     [120] PID: 12142  TASK: ffff9448d4f5a2c0  COMMAND: "CPU 9/KVM"
     [120] PID: 12077  TASK: ffff9449031d0000  COMMAND: "qemu-kvm"
     [120] PID: 12134  TASK: ffff9448d13e56e0  COMMAND: "CPU 1/KVM"
     [120] PID: 12140  TASK: ffff9448d4f58000  COMMAND: "CPU 7/KVM"
     [120] PID: 12135  TASK: ffff944907abe840  COMMAND: "CPU 2/KVM"

So a lower priority task (seems to have) preempted a higher priority task.