Bug 1569910
Summary: | Call Trace shows in guest when running determine_maximum_mpps.sh | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Pei Zhang <pezhang> | ||||||||||||||||
Component: | kernel-rt | Assignee: | Marcelo Tosatti <mtosatti> | ||||||||||||||||
kernel-rt sub component: | KVM | QA Contact: | Pei Zhang <pezhang> | ||||||||||||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||||||||||||
Severity: | high | ||||||||||||||||||
Priority: | high | CC: | bhu, chayang, daolivei, juzhang, knoel, lcapitulino, lgoncalv, michen, mtosatti, pezhang, siliu, virt-maint, williams | ||||||||||||||||
Version: | 7.6 | ||||||||||||||||||
Target Milestone: | rc | ||||||||||||||||||
Target Release: | --- | ||||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||||
OS: | Unspecified | ||||||||||||||||||
Whiteboard: | |||||||||||||||||||
Fixed In Version: | kernel-rt-3.10.0-924.rt56.871.el7 | Doc Type: | No Doc Update | ||||||||||||||||
Doc Text: |
undefined
|
Story Points: | --- | ||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||
Last Closed: | 2018-10-30 09:41:20 UTC | Type: | Bug | ||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
Embargoed: | |||||||||||||||||||
Bug Depends On: | |||||||||||||||||||
Bug Blocks: | 1532680 | ||||||||||||||||||
Attachments: |
|
== Update == Tried in another host and following same steps in Description, 3/3 reproduced. Hi As I mentioned in the meeting, it would be good to have a vmcore of the system when the RCU stall is detected. To do so, we need to enable the kdump service: # systemctl enable kdump.service set the following sysctls (using /etc/sysctl.conf conf file) kernel.panic_on_rcu_stall=1 kernel.panic=1 then, reboot the system. when the RCU stall takes place again, kdump should generate a vmcore file in the /var/crash/ directory. Using the vmcore we can have a better idea of what caused the RCU stall. -- Daniel (In reply to Pei Zhang from comment #0) > Created attachment 1424448 [details] > Guest call trace log > > Description of problem: > In rt guest, when running determine_maximum_mpps.sh to get the mpps value, > the script execution fail and also guest Call Trace. > > > Version-Release number of selected component (if applicable): > qemu-kvm-rhev-2.10.0-21.el7_5.2.x86_64 > tuned-2.9.0-1.el7.noarch > libvirt-3.9.0-14.el7.x86_64 > kernel-rt-3.10.0-871.rt56.814.el7.x86_64 > > > How reproducible: > 1/1 > > Steps to Reproduce: > 1. In rt guest, clone queuelat and make it. > > 2. In rt guest, running determine_maximum_mpps.sh. The script execution is > hang during running process like below and it just stops there. > > # cat determine_maximum_mpps.sh | grep PREAMBLE= > PREAMBLE="taskset -c 1 chrt -f 1" > > # sh determine_maximum_mpps.sh > Determining maximum mpps the machine can handle > Will take a few minutes to determine mpps value > And 10 minutes run to confirm the final mpps value is stable > testing 3 Mpps > success > testing 6 Mpps > success > testing 9 Mpps > success > testing 12 Mpps > success > testing 15 Mpps > success > testing 18 Mpps > success > testing 21 Mpps > success > testing 24 Mpps > success > testing 27 Mpps > success > testing 30 Mpps > success > testing 33 Mpps > success > testing 36 Mpps > success > testing 39 Mpps > success > testing 42 Mpps > success > testing 45 Mpps > success > testing 48 Mpps > success > first loop mpps: 48 > testing 47 Mpps > mpps success 47 > second loop mpps: 47 > third loop mpps: 47 > testing 46.9 Mpps > mpps success 46.9 > Starting 10 runs of 30 seconds with 46.9 Mpps > run 1 success > run 2 success Separately from the RCU bug below, the detected Mpps is too high. Can you please try the attached determine_maximum_mpps.sh, then compress and attach here the directory /tmp/outdir/ ? Thanks. > > 3. In guest, check #dmesg, repeatedly show below Call Trace info. And the > whole log is attached to this Comment. > > [ 9164.417822] INFO: rcu_preempt self-detected stall on CPU { 1} (t=3660060 > jiffies g=46470 c=46469 q=50248) > [ 9164.417822] Task dump for CPU 1: > [ 9164.417824] queuelat R running task 0 3191 2866 > 0x00080088 > [ 9164.417824] Call Trace: > [ 9164.417827] <IRQ> [<ffffffffbd0c77c6>] sched_show_task+0xb6/0x120 > [ 9164.417829] [<ffffffffbd0cbab9>] dump_cpu_task+0x39/0x70 > [ 9164.417831] [<ffffffffbd1455a0>] rcu_dump_cpu_stacks+0x90/0xd0 > [ 9164.417833] [<ffffffffbd14a386>] rcu_check_callbacks+0x476/0x860 > [ 9164.417835] [<ffffffffbd09ac8b>] update_process_times+0x4b/0x80 > [ 9164.417837] [<ffffffffbd0fb050>] tick_sched_handle+0x30/0x70 > [ 9164.417839] [<ffffffffbd0fb479>] tick_sched_timer+0x39/0x80 > [ 9164.417841] [<ffffffffbd0b6b47>] __run_hrtimer+0xc7/0x2e0 > [ 9164.417843] [<ffffffffbd0fb440>] ? tick_sched_do_timer+0x50/0x50 > [ 9164.417845] [<ffffffffbd0b7aa0>] hrtimer_interrupt+0x130/0x350 > [ 9164.417846] [<ffffffffbd04a54b>] local_apic_timer_interrupt+0x3b/0x60 > [ 9164.417849] [<ffffffffbd726363>] smp_apic_timer_interrupt+0x43/0x60 > [ 9164.417851] [<ffffffffbd722eb2>] apic_timer_interrupt+0x162/0x170 > [ 9164.417853] <EOI> [<ffffffffbd7164b8>] ? preempt_schedule_irq+0x78/0x110 > [ 9164.417855] [<ffffffffbd7195d7>] retint_kernel+0x37/0x40 > [ 9164.417857] [<ffffffffbd7194f2>] ? retint_careful+0x10/0x34 > > > > Actual results: > The .sh script is hang, and Call Trace shows in guest #dmesg. > > Expected results: > Call Trace should not show in guest #dmesg. > > > Additional info: > 1. Guest kernel cmdline > # cat /proc/cmdline > BOOT_IMAGE=/vmlinuz-3.10.0-871.rt56.814.el7.x86_64 > root=/dev/mapper/rhel_bootp--73--75--185-root ro console=tty0 > console=ttyS0,115200n8 biosdevname=0 crashkernel=auto > rd.lvm.lv=rhel_bootp-73-75-185/root rd.lvm.lv=rhel_bootp-73-75-185/swap rhgb > quiet default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=1 > intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1 > > 2. Host keeps working well, no any error. #dmesg looks good. (In reply to Marcelo Tosatti from comment #4) > (In reply to Pei Zhang from comment #0) ... > > mpps success 46.9 > > Starting 10 runs of 30 seconds with 46.9 Mpps > > run 1 success > > run 2 success > > Separately from the RCU bug below, the detected Mpps is too high. Hi Marcelo, Mostly, the execution stops when testing with 6 Mpps, like below: # sh determine_maximum_mpps.sh Determining maximum mpps the machine can handle Will take a few minutes to determine mpps value And 10 minutes run to confirm the final mpps value is stable testing 3 Mpps success testing 6 Mpps (stop here) > > Can you please try the attached determine_maximum_mpps.sh, then compress > and attach here the directory /tmp/outdir/ ? Do you mean a new determine_maximum_mpps.sh? Could you please attach it? And during the testing, seems there is no /tmp/outdir/ generated. Thanks, Pei > Thanks. > Created attachment 1426103 [details]
determine_maximum_mpps.sh with debugging options
Runq show tasks starving: crash> runq [...] CPU 1 RUNQUEUE: ffff91367fd1af40 CURRENT: PID: 1928 TASK: ffff9135f45b9050 COMMAND: "queuelat" RT PRIO_ARRAY: ffff91367fd1b0d8 [ 95] PID: 18 TASK: ffff91367608c140 COMMAND: "rcuc/1" [ 96] PID: 20 TASK: ffff91367608e1e0 COMMAND: "ktimersoftd/1" [ 98] PID: 1928 TASK: ffff9135f45b9050 COMMAND: "queuelat" CFS RB_ROOT: ffff91367fd1afe0 [120] PID: 1919 TASK: ffff9135768b4140 COMMAND: "kworker/1:1" RCU is waiting to run for more than 18 minutes! crash> foreach RU ps -m [0 00:00:00.999] [RU] PID: 1928 TASK: ffff9135f45b9050 CPU: 1 COMMAND: "queuelat" [0 00:00:01.000] [RU] PID: 20 TASK: ffff91367608e1e0 CPU: 1 COMMAND: "ktimersoftd/1" [0 00:00:05.013] [RU] PID: 1919 TASK: ffff9135768b4140 CPU: 1 COMMAND: "kworker/1:1" [0 00:18:06.717] [RU] PID: 18 TASK: ffff91367608c140 CPU: 1 COMMAND: "rcuc/1" [0 00:19:07.078] [RU] PID: 0 TASK: ffffffff87c18480 CPU: 0 COMMAND: "swapper/0" [0 00:18:06.889] [RU] PID: 0 TASK: ffff91367608b0f0 CPU: 1 COMMAND: "swapper/1" queuelat: crash> task 1928 | grep prio prio = 98, static_prio = 120, normal_prio = 98, rt_priority = 1, rcuc: crash> task 18 | grep prio prio = 95, static_prio = 120, normal_prio = 95, rt_priority = 4, ktimersoftd crash> task 20 | grep prio prio = 96, static_prio = 120, normal_prio = 96, rt_priority = 3, As we can see, both rcuc and ktimersoftd have a priority higher than the queuelat tool, so they might have preempt the current task. We need to investigate this. Pei, Could you reproduce the problem enabling the irq_vectors, sched:sched_switch, sched:sched_wakeup and sched:sched_wakeup_new tracepoints via ftrace interface? e.g., # echo sched:sched_wakeup sched:sched_switch sched:sched_wakeup_new irq_vectors > /sys/kernel/debug/tracing/set_event Thanks! -- Daniel Interesting finding. Another idea is to quick check with an older kernel (say 7.4 or 7.5) to see if it's a recent regression. (In reply to Luiz Capitulino from comment #10) > Interesting finding. Another idea is to quick check with an older kernel > (say 7.4 or 7.5) to see if it's a recent regression. Yep, I agree! -- Daniel Created attachment 1426617 [details] File in /tmp/outdir (In reply to Marcelo Tosatti from comment #8) > Created attachment 1426103 [details] > determine_maximum_mpps.sh with debugging options Hi Marcelo, The file in /tmp/outdir is attached to this comment. Please check. Best Regards, Pei (In reply to Pei Zhang from comment #13) > Created attachment 1426617 [details] > File in /tmp/outdir > > (In reply to Marcelo Tosatti from comment #8) > > Created attachment 1426103 [details] > > determine_maximum_mpps.sh with debugging options > > Hi Marcelo, > > The file in /tmp/outdir is attached to this comment. Please check. > > # sh determine_maximum_mpps_new.sh Determining maximum mpps the machine can handle Will take a few minutes to determine mpps value And 10 minutes run to confirm the final mpps value is stable testing 3 Mpps success testing 6 Mpps (Stop here and guest shows Call Trace) (In reply to Daniel Bristot de Oliveira from comment #11) > (In reply to Luiz Capitulino from comment #10) > > Interesting finding. Another idea is to quick check with an older kernel > > (say 7.4 or 7.5) to see if it's a recent regression. > > Yep, I agree! Hi Luiz, Daniel, This Call Trace issue also existed in RHEL7.4 kernel-rt. Versions: 3.10.0-693.rt56.617.el7.x86_64 (both host&guest) And seems the Call Trace is a bit different. [ 256.553924] INFO: rcu_preempt self-detected stall on CPU { 1} (t=60000 jiffies g=9469 c=9468 q=553) [ 256.553925] Task dump for CPU 1: [ 256.553927] queuelat R running task 0 8961 8941 0x00080088 [ 256.553930] ffff8800b4c220a0 000000004be457e9 ffff88013fd03d88 ffffffff810be946 [ 256.553931] 0000000000000001 ffffffff81a48a00 ffff88013fd03da0 ffffffff810c2a09 [ 256.553932] 0000000000000002 ffff88013fd03dd0 ffffffff81136220 ffff88013fd12080 [ 256.553933] Call Trace: [ 256.553943] <IRQ> [<ffffffff810be946>] sched_show_task+0xb6/0x120 [ 256.553946] [<ffffffff810c2a09>] dump_cpu_task+0x39/0x70 [ 256.553950] [<ffffffff81136220>] rcu_dump_cpu_stacks+0x90/0xd0 [ 256.553953] [<ffffffff8113aee6>] rcu_check_callbacks+0x486/0x860 [ 256.553958] [<ffffffff81087e81>] ? raise_softirq_irqoff+0x11/0x50 [ 256.553961] [<ffffffff81091ed1>] update_process_times+0x41/0x70 [ 256.553966] [<ffffffff810ee720>] tick_sched_handle+0x30/0x70 [ 256.553968] [<ffffffff810eeb49>] tick_sched_timer+0x39/0x80 [ 256.553970] [<ffffffff810adef4>] __run_hrtimer+0xc4/0x2c0 [ 256.553972] [<ffffffff810eeb10>] ? tick_sched_do_timer+0x50/0x50 [ 256.553974] [<ffffffff810aee20>] hrtimer_interrupt+0x130/0x350 [ 256.553979] [<ffffffff81047405>] local_apic_timer_interrupt+0x35/0x60 [ 256.553984] [<ffffffff816bc61d>] smp_apic_timer_interrupt+0x3d/0x50 [ 256.553986] [<ffffffff816bad9d>] apic_timer_interrupt+0x6d/0x80 [ 256.553988] <EOI> [<ffffffff81136694>] ? rcu_eqs_exit_common.isra.29+0x24/0xe0 [ 256.553992] [<ffffffff816ae7d5>] ? __schedule+0x5/0x7d0 [ 256.553995] [<ffffffff816af5e1>] ? preempt_schedule_irq+0x81/0x110 [ 256.553996] [<ffffffff816b1df7>] retint_kernel+0x37/0x40 [ 256.553998] [<ffffffff816b1d12>] ? retint_careful+0x10/0x34 Best Regards, Pei > -- Daniel (In reply to Pei Zhang from comment #14) > (In reply to Pei Zhang from comment #13) > > Created attachment 1426617 [details] > > File in /tmp/outdir > > > > (In reply to Marcelo Tosatti from comment #8) > > > Created attachment 1426103 [details] > > > determine_maximum_mpps.sh with debugging options > > > > Hi Marcelo, > > > > The file in /tmp/outdir is attached to this comment. Please check. > > > > > > # sh determine_maximum_mpps_new.sh > Determining maximum mpps the machine can handle > Will take a few minutes to determine mpps value > And 10 minutes run to confirm the final mpps value is stable > testing 3 Mpps > success > testing 6 Mpps > > (Stop here and guest shows Call Trace) It seems the calibration part didnt finish. Can you please do these two things: 1) Apply the attached patch (patch-debug-queue.patch) with: cd rt-tests/src/queuelat/ patch < /tmp/patch-debug-queue.patch make cd ../../ make Run determine_maximum_mpps.sh with this patch in, save data here. 2) Modify determine_maximum_mpps.sh # cat determine_maximum_mpps.sh | sed -e s/"-t 30"/"-t 60"/g > new_determine_maximum_mpps.sh # cp new_determine_maximum_mpps.sh determine_maximum_mpps.sh Run determine_maximum_mpps.sh with this new determine_maximum_mpps.sh (and patch from item 1) and save data here. So we want two logs. Thanks Created attachment 1427282 [details]
queuelat.c debug patch
Hi, This is a resume of my today's findings: This vmcore shows the same set of tasks waiting to run/running. The trace is not showing any scheduling events in the CPU1 (i.e., wakeup and switch), but it is showing on CPU0. So, trace is working but no events are happening in the CPU1 - that means that no sched switch is happening, and probably, that the sched wakeup took place a long time ago. As an action, we need to increase the size of the trace buffer. What I see in the traces are many lines like these: queuelat-2051 1dNh10 798.909935: local_timer_entry: vector=239 queuelat-2051 1dNh10 798.909936: local_timer_exit: vector=239 queuelat-2051 1dNh10 798.909941: reschedule_entry: vector=253 queuelat-2051 1dNh10 798.909941: reschedule_exit: vector=253 queuelat-2051 1dNh10 798.909943: local_timer_entry: vector=239 queuelat-2051 1dNh10 798.909943: local_timer_exit: vector=239 So, IRQs are enabled, because IRQs are taking place. I see timer and *reschedule* IRQs happening. The flags (e.g, 1dNh10) shows that N is set. N means that need resched is set - I also see this flag in the thread info of the current task (queuelat). So it is enabled - and for a loooong time (at least 30 seconds). The need resched flag is used to tell if the system needs to call the scheduler. The scheduler is called as soon as preemptions get enabled, or at the return of an interrupt handler. As the interrupt handlers are running, and the scheduler is not actually scheduling. why? One reason could be the preempting being disabled. But that is not the case because I see preemption enabled in the preempt counter of the curr task. So, my current guesses are that, we are either not checking need resched in the return of the interrupt handler (preempt enable no resched?), or that we are calling the scheduler but it is not noticing that there is a higher prio task (in a fast read of the rq structure I see that the rtprio structure is pointing to the correct high prio...) I need to think about how to get more information with trace - but I need to carefully select the tracepoints/functions I want to get because we are seeing the crash some minutes after the events we want to observe takes place - fulfilling the trace buffer. I will continue working on it tomorrow. -- Daniel Created attachment 1427947 [details] File in /tmp/outdir (In reply to Marcelo Tosatti from comment #16) > (In reply to Pei Zhang from comment #14) > > (In reply to Pei Zhang from comment #13) > > > Created attachment 1426617 [details] > > > File in /tmp/outdir > > > > > > (In reply to Marcelo Tosatti from comment #8) > > > > Created attachment 1426103 [details] > > > > determine_maximum_mpps.sh with debugging options > > > > > > Hi Marcelo, > > > > > > The file in /tmp/outdir is attached to this comment. Please check. > > > > > > > > > > # sh determine_maximum_mpps_new.sh > > Determining maximum mpps the machine can handle > > Will take a few minutes to determine mpps value > > And 10 minutes run to confirm the final mpps value is stable > > testing 3 Mpps > > success > > testing 6 Mpps > > > > (Stop here and guest shows Call Trace) > > It seems the calibration part didnt finish. Can you please do these two > things: > > 1) Apply the attached patch (patch-debug-queue.patch) with: > > cd rt-tests/src/queuelat/ > patch < /tmp/patch-debug-queue.patch > make > cd ../../ > make > > Run determine_maximum_mpps.sh with this patch in, save > data here. > > 2) Modify determine_maximum_mpps.sh > > # cat determine_maximum_mpps.sh | sed -e s/"-t 30"/"-t 60"/g > > new_determine_maximum_mpps.sh > # cp new_determine_maximum_mpps.sh determine_maximum_mpps.sh > > Run determine_maximum_mpps.sh with this new determine_maximum_mpps.sh > (and patch from item 1) and save data here. > > So we want two logs. > > Thanks Following these steps, and /tmp/outdir zip files is attached to this mail. # sh determine_maximum_mpps.sh Determining maximum mpps the machine can handle Will take a few minutes to determine mpps value And 10 minutes run to confirm the final mpps value is stable testing 3 Mpps success testing 6 Mpps success testing 9 Mpps mpps failed: 9 first loop mpps: 9 testing 8 Mpps failure testing 7 Mpps mpps success 7 second loop mpps: 7 testing 7.3 Mpps success testing 7.6 Mpps success testing 7.9 Mpps mpps failure 7.9 third loop mpps: 7.9 testing 7.8 Mpps failure testing 7.7 Mpps failure testing 7.6 Mpps failure testing 7.5 Mpps failure testing 7.4 Mpps mpps success 7.4 Starting 10 runs of 30 seconds with 7.4 Mpps mpps failure (run 1) 7.4 Starting 10 runs of 30 seconds with 7.3 Mpps run 1 success run 2 success run 3 success run 4 success run 5 success mpps failure (run 6) 7.3 Starting 10 runs of 30 seconds with 7.2 Mpps run 1 success run 2 success run 3 success run 4 success run 5 success run 6 success run 7 success run 8 success run 9 success run 10 success Starting 10 minutes run with 7.2 Mpps mpps failure (run 10) 7.2 Starting 10 minutes run with 7.1 Mpps (stops here) Please let me know if you need QE's testing environment. Please mail me and I can provide it to you. Best Regards, Pei # mount 10.66.8.172:/home/shared /mnt/mount/ mount.nfs: Connection timed out Daniel mentioned an interrupt storm. Seems to be delayed_work_timer_fn whose expires < now crash> timer TVEC_BASES[0]: ffffffffba177100 JIFFIES 4295625094 EXPIRES TIMER_LIST FUNCTION 4295625097 ffff9cc07607bdb0 ffffffffb9098560 <process_timeout> 4295625120 ffff9cbf75c2bd60 ffffffffb9098560 <process_timeout> 4295625768 ffffffffb9c57220 ffffffffb90a9e80 <delayed_work_timer_fn> 4295628608 ffffffffb9d3d450 ffffffffb90a9e80 <delayed_work_timer_fn> 4295630016 ffffffffb9cc2300 ffffffffb90a9e80 <delayed_work_timer_fn> 4295630016 ffff9cc07fc17ee0 ffffffffb90a9e80 <delayed_work_timer_fn> 4295635264 ffffffffb9d38a70 ffffffffb90a9e80 <delayed_work_timer_fn> 4295638400 ffff9cc0725da848 ffffffffb90a9e80 <delayed_work_timer_fn> 4295639552 ffff9cbf40092c48 ffffffffb90a9e80 <delayed_work_timer_fn> 4295643008 ffffffffc049af20 ffffffffb90a9e80 <delayed_work_timer_fn> 4295645056 ffffffffb9d319b8 ffffffffb96c9320 <fib6_gc_timer_cb> 4295656448 ffffffffb9d39440 ffffffffb90a9e80 <delayed_work_timer_fn> 4295667712 ffffffffb9cfd320 ffffffffb9358420 <__prandom_timer> 4295729152 ffffffffb9d3cb80 ffffffffb90a9e80 <delayed_work_timer_fn> 4295868416 ffffffffb9d327b0 ffffffffb960d4a0 <flow_cache_new_hashrnd> 4295868416 ffffffffba329038 ffffffffb96835b0 <inet_frag_secret_rebuild> 4295868416 ffffffffba33a578 ffffffffb96835b0 <inet_frag_secret_rebuild> 4295890560 ffff9cc076153da0 ffffffffb9098560 <process_timeout> 4295909376 ffff9cc07fc1a730 ffffffffb90a80f0 <idle_worker_timeout> 4301914112 ffff9cc072edbf80 ffffffffb9659670 <tcp_keepalive_timer> 4302192640 ffff9cc072eddd80 ffffffffb9659670 <tcp_keepalive_timer> 4380688384 ffff9cbf76a321c0 ffffffffb96b8580 <ipv6_regen_rndid> TVEC_BASES[1]: ffff9cc07613c000 JIFFIES 4295625094 EXPIRES TIMER_LIST FUNCTION 4295565072 ffff9cc0724be280 ffffffffb90a9e80 <delayed_work_timer_fn> 4295570016 ffff9cc07fd17ee0 ffffffffb90a9e80 <delayed_work_timer_fn> Hi! This is where I stopped in the vmcore/trace analysis: This is the stack trace of the queuelat: crash> bt PID: 2070 TASK: ffff9cbf765fe1e0 CPU: 1 COMMAND: "queuelat" #0 [ffff9cc07fd03ba0] trace_function at ffffffffb915f76d <--- the IRQ stack #1 [ffff9cc07fd03bd0] function_trace_call at ffffffffb91675c3 #2 [ffff9cc07fd03be8] ftrace_call at ffffffffb972c44b #3 [ffff9cc07fd03c08] __crash_kexec at ffffffffb910eec2 #4 [ffff9cc07fd03d48] __crash_kexec at ffffffffb910ef5f #5 [ffff9cc07fd03e78] update_process_times at ffffffffb909ac4b #6 [ffff9cc07fd03ea0] tick_sched_handle at ffffffffb90fb020 #7 [ffff9cc07fd03ec0] tick_sched_timer at ffffffffb90fb449 #8 [ffff9cc07fd03ee8] __run_hrtimer at ffffffffb90b6b07 #9 [ffff9cc07fd03f28] hrtimer_interrupt at ffffffffb90b7a60 #10 [ffff9cc07fd03fb8] local_apic_timer_interrupt at ffffffffb904a54b #11 [ffff9cc07fd03fd0] smp_trace_apic_timer_interrupt at ffffffffb972d3ca #12 [ffff9cc07fd03ff0] trace_apic_timer_interrupt at ffffffffb972a022 --- <IRQ stack> --- #13 [ffff9cbfc9803e38] trace_apic_timer_interrupt at ffffffffb972a022 [exception RIP: preempt_schedule_irq+120] <--- the regular (thread) stack is here RIP: ffffffffb971ca28 RSP: ffff9cbfc9803ee0 RFLAGS: 00000282 RAX: 0000000000000082 RBX: ffffffffb971f0c5 RCX: ffff9cbfc9800000 RDX: 0000000000000008 RSI: 0000000000000001 RDI: 0000000000000082 RBP: ffff9cbfc9803ef8 R8: 00000000ccccd22b R9: 0000000000000018 R10: 000000d0f6f9b1df R11: 0000000000000008 R12: ffff9cbf765fe7c8 R13: 0000000000000001 R14: 0000000000000000 R15: ffff9cbfc9800000 ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0000 #14 [ffff9cbfc9803f00] retint_kernel at ffffffffb97205d7 #15 [ffff9cbfc9803f58] retint_careful at ffffffffb97204f2 RIP: 0000000000400dfb RSP: 00007ffd3d0b2f50 RFLAGS: 00000202 RAX: 0000000000000000 RBX: 000000000000000b RCX: 0000000000000000 RDX: 0000000000000000 RSI: 00000000017d66c0 RDI: 00000000017be010 RBP: 0000000000000000 R8: 00000000017d66a0 R9: 0000000000000000 R10: 00007ffd3d0b29a0 R11: 00007fd0e4a5b200 R12: ffff9cbfc9803f88 R13: 0000000000000202 R14: 0000000000000010 R15: ffffffffb97204f2 ORIG_RAX: ffffffffffffff10 CS: 0033 SS: 002b The regular stack points to this file: ----------------------- %< ---------------------------------------------- crash> dis -l preempt_schedule_irq+120 /usr/src/debug/kernel-3.10.0/linux-3.10.0.x86_64/kernel/sched/core.c: 3939 0xffffffffb971ca28 <preempt_schedule_irq+120>: callq 0xffffffffb971ba80 <__schedule> ----------------------- >% ---------------------------------------------- Which have this C code: ----------------------- %< ---------------------------------------------- asmlinkage void __sched preempt_schedule_irq(void) { struct thread_info *ti = current_thread_info(); enum ctx_state prev_state; /* Catch callers which need to be fixed */ BUG_ON(ti->preempt_count || !irqs_disabled()); prev_state = exception_enter(); do { add_preempt_count(PREEMPT_ACTIVE); local_irq_enable(); __schedule(); <------------- Here local_irq_disable(); sub_preempt_count(PREEMPT_ACTIVE); /* * Check again in case we missed a preemption opportunity * between schedule and now. */ barrier(); } while (need_resched()); exception_exit(prev_state); } ----------------------- >% ---------------------------------------------- As we discussed many times, the code should call schedule. The interrupts are enabled because of the previous local_irq_enable(). The same for the preemption, which was enabled at add_preempt_count(PREEMPT_ACTIVE). So, the interrupts are enabled, and the scheduler is ready to be called, as we expected... So, the question was "why isn't the scheduler being called?" The last trace from Pei showed why: we are facing a timer interrupt storm. Here is the trace that shows the problem we are seeing: In the return of an interrupt handler: (the command to read this trace is "trace-cmd report -i trace.dat --cpu 1 -l") queuelat-2070 1dNh10 955.890495: local_timer_exit: vector=239 ---------------------^ the need resched is already set (I cannot set the wakeup of the other threads because if already fulfilled the trace buffer, which is 100 MB large) Anyway, this means that there was a wakeup of a higher priority task... Which is true, if we see the runqueue: ------------------------- %< -------------------------------------------- CPU 1 RUNQUEUE: ffff9cc07fd1af40 CURRENT: PID: 2070 TASK: ffff9cbf765fe1e0 COMMAND: "queuelat" RT PRIO_ARRAY: ffff9cc07fd1b0d8 [ 95] PID: 18 TASK: ffff9cc07608c140 COMMAND: "rcuc/1" [ 96] PID: 20 TASK: ffff9cc07608e1e0 COMMAND: "ktimersoftd/1" [ 98] PID: 2070 TASK: ffff9cbf765fe1e0 COMMAND: "queuelat" CFS RB_ROOT: ffff9cc07fd1afe0 [no tasks queued] ------------------------- %< -------------------------------------------- queuelat-2070 1dNh.0 955.890496: function: irq_exit queuelat-2070 1dNh.0 955.890496: function: vtime_gen_account_irq_exit queuelat-2070 1dNh.0 955.890496: function: _raw_qspin_lock queuelat-2070 1dN..0 955.890496: function: wakeup_timer_softirqd queuelat-2070 1dN..0 955.890496: function: idle_cpu queuelat-2070 1dN..0 955.890496: function: tick_nohz_irq_exit queuelat-2070 1dN..0 955.890496: function: can_stop_full_tick queuelat-2070 1dN..0 955.890497: function: sched_can_stop_tick queuelat-2070 1dN..0 955.890497: function: posix_cpu_timers_can_stop_tick queuelat-2070 1dN..0 955.890497: function: ktime_get queuelat-2070 1dN..0 955.890497: function: tick_nohz_stop_sched_tick queuelat-2070 1dN..0 955.890497: function: rcu_needs_cpu queuelat-2070 1dN..0 955.890497: function: get_next_timer_interrupt queuelat-2070 1dN..0 955.890497: function: rt_spin_trylock queuelat-2070 1dN..0 955.890497: function: rt_spin_unlock_after_trylock_in_irq queuelat-2070 1dN..0 955.890498: function: hrtimer_get_next_event queuelat-2070 1dN..0 955.890498: function: _raw_spin_lock_irqsave queuelat-2070 1dN.10 955.890498: function: _raw_spin_unlock_irqrestore queuelat-2070 1dN..0 955.890498: function: timekeeping_max_deferment queuelat-2070 1dN..0 955.890498: function: scheduler_tick_max_deferment queuelat-2070 1dN..0 955.890498: function: jiffies_to_usecs queuelat-2070 1dN..0 955.890498: function: hrtimer_start_range_ns queuelat-2070 1dN..0 955.890499: function: __hrtimer_start_range_ns queuelat-2070 1dN..0 955.890499: function: lock_hrtimer_base queuelat-2070 1dN..0 955.890499: function: _raw_spin_lock_irqsave queuelat-2070 1dN.10 955.890499: function: enqueue_hrtimer queuelat-2070 1dN.10 955.890499: function: hrtimer_reprogram queuelat-2070 1dN.10 955.890499: function: tick_program_event queuelat-2070 1dN.10 955.890499: function: clockevents_program_event queuelat-2070 1dN.10 955.890499: function: ktime_get queuelat-2070 1dN.10 955.890500: function: clockevents_program_min_delta queuelat-2070 1dN.10 955.890500: function: ktime_get queuelat-2070 1dN.10 955.890500: function: lapic_next_deadline queuelat-2070 1dN.10 955.890503: function: _raw_spin_unlock_irqrestore This is the last function the interrupt handler runs before returning & enabling the interrupts: queuelat-2070 1dN..0 955.890503: function: rcu_irq_exit But then: BOOM: another timer interrupt arrives: queuelat-2070 1dN..0 955.890504: function: smp_trace_apic_timer_interrupt queuelat-2070 1dN..0 955.890504: function: irq_enter queuelat-2070 1dN..0 955.890504: function: rcu_irq_enter queuelat-2070 1dN..0 955.890504: function: vtime_common_account_irq_enter queuelat-2070 1dN..0 955.890504: function: vtime_account_system queuelat-2070 1dNh.0 955.890505: function: exit_idle queuelat-2070 1dNh.0 955.890505: function: kvm_guest_apic_eoi_write queuelat-2070 1dNh10 955.890505: local_timer_entry: vector=239 queuelat-2070 1dNh.0 955.890505: function: local_apic_timer_interrupt And the system keeps running this forever. -- Daniel (In reply to Daniel Bristot de Oliveira from comment #28) > Hi! This is where I stopped in the vmcore/trace analysis: > > This is the stack trace of the queuelat: > crash> bt > PID: 2070 TASK: ffff9cbf765fe1e0 CPU: 1 COMMAND: "queuelat" > #0 [ffff9cc07fd03ba0] trace_function at ffffffffb915f76d <--- > the IRQ stack > #1 [ffff9cc07fd03bd0] function_trace_call at ffffffffb91675c3 > #2 [ffff9cc07fd03be8] ftrace_call at ffffffffb972c44b > #3 [ffff9cc07fd03c08] __crash_kexec at ffffffffb910eec2 > #4 [ffff9cc07fd03d48] __crash_kexec at ffffffffb910ef5f > #5 [ffff9cc07fd03e78] update_process_times at ffffffffb909ac4b > #6 [ffff9cc07fd03ea0] tick_sched_handle at ffffffffb90fb020 > #7 [ffff9cc07fd03ec0] tick_sched_timer at ffffffffb90fb449 > #8 [ffff9cc07fd03ee8] __run_hrtimer at ffffffffb90b6b07 > #9 [ffff9cc07fd03f28] hrtimer_interrupt at ffffffffb90b7a60 > #10 [ffff9cc07fd03fb8] local_apic_timer_interrupt at ffffffffb904a54b > #11 [ffff9cc07fd03fd0] smp_trace_apic_timer_interrupt at ffffffffb972d3ca > #12 [ffff9cc07fd03ff0] trace_apic_timer_interrupt at ffffffffb972a022 > --- <IRQ stack> --- > #13 [ffff9cbfc9803e38] trace_apic_timer_interrupt at ffffffffb972a022 > [exception RIP: preempt_schedule_irq+120] <--- > the regular (thread) stack is here > RIP: ffffffffb971ca28 RSP: ffff9cbfc9803ee0 RFLAGS: 00000282 > RAX: 0000000000000082 RBX: ffffffffb971f0c5 RCX: ffff9cbfc9800000 > RDX: 0000000000000008 RSI: 0000000000000001 RDI: 0000000000000082 > RBP: ffff9cbfc9803ef8 R8: 00000000ccccd22b R9: 0000000000000018 > R10: 000000d0f6f9b1df R11: 0000000000000008 R12: ffff9cbf765fe7c8 > R13: 0000000000000001 R14: 0000000000000000 R15: ffff9cbfc9800000 > ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0000 > #14 [ffff9cbfc9803f00] retint_kernel at ffffffffb97205d7 > #15 [ffff9cbfc9803f58] retint_careful at ffffffffb97204f2 > RIP: 0000000000400dfb RSP: 00007ffd3d0b2f50 RFLAGS: 00000202 > RAX: 0000000000000000 RBX: 000000000000000b RCX: 0000000000000000 > RDX: 0000000000000000 RSI: 00000000017d66c0 RDI: 00000000017be010 > RBP: 0000000000000000 R8: 00000000017d66a0 R9: 0000000000000000 > R10: 00007ffd3d0b29a0 R11: 00007fd0e4a5b200 R12: ffff9cbfc9803f88 > R13: 0000000000000202 R14: 0000000000000010 R15: ffffffffb97204f2 > ORIG_RAX: ffffffffffffff10 CS: 0033 SS: 002b > > > The regular stack points to this file: > ----------------------- %< ---------------------------------------------- > crash> dis -l preempt_schedule_irq+120 > /usr/src/debug/kernel-3.10.0/linux-3.10.0.x86_64/kernel/sched/core.c: 3939 > 0xffffffffb971ca28 <preempt_schedule_irq+120>: callq 0xffffffffb971ba80 > <__schedule> > ----------------------- >% ---------------------------------------------- > > > Which have this C code: > ----------------------- %< ---------------------------------------------- > asmlinkage void __sched preempt_schedule_irq(void) > { > struct thread_info *ti = current_thread_info(); > enum ctx_state prev_state; > > /* Catch callers which need to be fixed */ > BUG_ON(ti->preempt_count || !irqs_disabled()); > > prev_state = exception_enter(); > > do { > add_preempt_count(PREEMPT_ACTIVE); > local_irq_enable(); > __schedule(); <------------- Here > local_irq_disable(); > sub_preempt_count(PREEMPT_ACTIVE); > > /* > * Check again in case we missed a preemption opportunity > * between schedule and now. > */ > barrier(); > } while (need_resched()); > > exception_exit(prev_state); > } > ----------------------- >% ---------------------------------------------- > > As we discussed many times, the code should call schedule. The interrupts are > enabled because of the previous local_irq_enable(). The same for the > preemption, which was enabled at add_preempt_count(PREEMPT_ACTIVE). > > So, the interrupts are enabled, and the scheduler is ready to be called, > as we expected... > > So, the question was "why isn't the scheduler being called?" > > The last trace from Pei showed why: we are facing a timer interrupt storm. Hi Daniel, As soon as ktimersoftd/1 is able to run, and execute run_timer_softirq -> __run_timers -> delayed_work_timer_fn, tick_nohz_stop_sched_tick will stop getting the timers from comment 27 into the: get_next_timer_interrupt() call And the interrupt storm will stop. So one fix would be: "If TIMER_SOFTIRQ softirq is pending (therefore ktimersoftd execution is pending), do not take old-style timers into account, since the end of execution of __run_timers will rearm timers in the future as necessary." Unsure if that is the right level to fix the bug though, i'll prepare a fix and send upstream. Hi Marcelo, Nice! Please, Cc: me in the upstream discussion! Btw, how about filing another BZ, specifically for this timer problem? -- Daniel Created attachment 1449271 [details]
skip expired timers on get_next_timer_interrupt()
Untested kernel with fix at https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=16678250 (fixing a problem with storage where the guest tests are... should be testing tomorrow). (In reply to Marcelo Tosatti from comment #33) > Untested kernel with fix at > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=16678250 > > (fixing a problem with storage where the guest tests are... > should be testing tomorrow). Ok, it boots fine... now whether it fixes the problem: not sure. Created attachment 1450216 [details]
skip expired timers on get_next_timer_interrupt() (V2)
(In reply to Daniel Bristot de Oliveira from comment #30) > Hi Marcelo, > > Nice! Please, Cc: me in the upstream discussion! > > Btw, how about filing another BZ, specifically for this timer problem? > > -- Daniel Daniel, I think this BZ is only about the kernel bug: can't see any problem with queuelat (guess is that queuelat error is caused by the wrong scheduler behaviour). (In reply to Marcelo Tosatti from comment #33) > Untested kernel with fix at > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=16678250 > > (fixing a problem with storage where the guest tests are... > should be testing tomorrow). This bug cannot be reproduced any more. The patch can fix this issue. 3/3 PASS == run1 # sh determine_maximum_mpps.sh Determining maximum mpps the machine can handle Will take a few minutes to determine mpps value And 10 minutes run to confirm the final mpps value is stable testing 3 Mpps success testing 6 Mpps success testing 9 Mpps success testing 12 Mpps success testing 15 Mpps success testing 18 Mpps success testing 21 Mpps success testing 24 Mpps success testing 27 Mpps success testing 30 Mpps success testing 33 Mpps success testing 36 Mpps success testing 39 Mpps success testing 42 Mpps success testing 45 Mpps success testing 48 Mpps success first loop mpps: 48 testing 47 Mpps mpps success 47 second loop mpps: 47 third loop mpps: 47 testing 46.9 Mpps mpps success 46.9 Starting 10 runs of 30 seconds with 46.9 Mpps run 1 success run 2 success run 3 success run 4 success run 5 success run 6 success run 7 success run 8 success run 9 success run 10 success Starting 10 minutes run with 46.9 Mpps run 10 success Final mpps is: 46.9 == run2 # sh determine_maximum_mpps.sh Determining maximum mpps the machine can handle Will take a few minutes to determine mpps value And 10 minutes run to confirm the final mpps value is stable testing 3 Mpps success ... run 10 success Starting 10 minutes run with 46.9 Mpps run 10 success Final mpps is: 46.9 ==run 3 # sh determine_maximum_mpps.sh Determining maximum mpps the machine can handle Will take a few minutes to determine mpps value And 10 minutes run to confirm the final mpps value is stable testing 3 Mpps success ... run 10 success Starting 10 minutes run with 46.9 Mpps run 10 success Final mpps is: 46.9 For some reason i can't reproduce this with a custom kernel module which adds an already expired timer on an isolated CPU (which means the conditions from which the correction patch has been based off are not fully understood). insmod-9090 [000] ....111 5329.878993: timer_start: timer=ffffffffc09ca260 function=test_timer_fn [timer_test] expires=4299996929 [timeout=-10] flags=0x0010000f queuelat-9087 [015] d...... 5329.878995: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit queuelat-9087 [015] d...... 5329.878995: get_next_timer_interrupt <-tick_nohz_stop_sched_tick queuelat-9087 [015] d...2.. 5329.878995: hrtimer_start: hrtimer=ffff96675e456fc0 function=tick_sched_timer expires=5329643000000 softexpires=5329643000000 queuelat-9087 [015] d...... 5329.878999: smp_apic_timer_interrupt <-apic_timer_interrupt queuelat-9087 [015] d..h... 5329.878999: tick_sched_timer <-__run_hrtimer queuelat-9087 [015] dN..... 5329.879001: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit queuelat-9087 [015] dN..... 5329.879002: get_next_timer_interrupt <-tick_nohz_stop_sched_tick queuelat-9087 [015] dN..2.. 5329.879002: hrtimer_start: hrtimer=ffff96675e456fc0 function=tick_sched_timer expires=5329643000000 softexpires=5329643000000 queuelat-9087 [015] dN..... 5329.879003: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit queuelat-9087 [015] dN..... 5329.879003: get_next_timer_interrupt <-tick_nohz_stop_sched_tick Here the kernel is able to switch to ktimersoftd: should find out where exactly this switch is happening, and why it fails to happen under the scenario Pei reported. ktimersoftd/15-124 [015] d....12 5329.879005: smp_apic_timer_interrupt <-apic_timer_interrupt ktimersoftd/15-124 [015] d..h.12 5329.879005: tick_sched_timer <-__run_hrtimer ktimersoftd/15-124 [015] d....12 5329.879006: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit ktimersoftd/15-124 [015] d....12 5329.879006: get_next_timer_interrupt <-tick_nohz_stop_sched_tick -RT upstream seems to execute timers from IRQ return via: static inline void invoke_softirq(void) { if (ksoftirqd_running()) return; if (!force_irqthreads) { #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK /* * We can safely execute softirq on the current stack if * it is the irq stack, because it should be near empty * at this stage. */ __do_softirq(); #else /* * Otherwise, irq_exit() is called on the task stack that can * be potentially deep already. So call softirq in its own stack * to prevent from any overrun. */ do_softirq_own_stack(); #endif } else { wakeup_softirqd(); } Then it won't hit the problem if force_irqthreads==false (which is the default). Failure to reproduce, on a guest, without custom timer being added: [root@localhost queuelat]# sh determine_maximum_mpps.sh Determining maximum mpps the machine can handle Will take a few minutes to determine mpps value And 10 minutes run to confirm the final mpps value is stable testing 3 Mpps success testing 6 Mpps ... run 8 success run 9 success run 10 success Starting 10 minutes run with 7.1 Mpps run 10 success Final mpps is: 7.1 [root@localhost queuelat]# sh determine_maximum_mpps.sh Determining maximum mpps the machine can handle Will take a few minutes to determine mpps value And 10 minutes run to confirm the final mpps value is stable testing 3 Mpps success testing 6 Mpps success testing 9 Mpps mpps failed: 9 first loop mpps: 9 testing 8 Mpps mpps success 8 second loop mpps: 8 third loop mpps: 8 testing 7.9 Mpps failure testing 7.8 Mpps mpps success 7.8 Pei, please make the machine available when possible. TIA. Adding a timer to the isolated CPU, we see the following. Summary: apic timer interrupt is armed (to handle tick_sched_timer) multiple times, but eventually ktimersoftd is able to execute: queuelat-2295 [001] d...... 1878.617133: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit queuelat-2295 [001] d...... 1878.617134: get_next_timer_interrupt <-tick_nohz_stop_sched_tick insmod-2298 [000] ....111 1878.617338: timer_start: timer=ffffffffc06c0260 function=test_timer_fn [timer_test] expires=4296546745 [timeout=-10] queuelat-2295 [001] d...... 1878.617344: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit queuelat-2295 [001] d...... 1878.617345: get_next_timer_interrupt <-tick_nohz_stop_sched_tick queuelat-2295 [001] d...... 1878.617352: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit queuelat-2295 [001] d...... 1878.617352: get_next_timer_interrupt <-tick_nohz_stop_sched_tick queuelat-2295 [001] d...... 1878.617354: smp_apic_timer_interrupt <-apic_timer_interrupt queuelat-2295 [001] d..h1.. 1878.617355: hrtimer_expire_entry: hrtimer=ffff8acb3fd15f00 function=tick_sched_timer now=1879459798601 queuelat-2295 [001] d..h... 1878.617355: tick_sched_timer <-__run_hrtimer queuelat-2295 [001] d..h1.. 1878.617357: hrtimer_expire_exit: hrtimer=ffff8acb3fd15f00 queuelat-2295 [001] dN..... 1878.617360: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit queuelat-2295 [001] dN..... 1878.617360: get_next_timer_interrupt <-tick_nohz_stop_sched_tick queuelat-2295 [001] dN..... 1878.617363: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit queuelat-2295 [001] dN..... 1878.617363: get_next_timer_interrupt <-tick_nohz_stop_sched_tick queuelat-2295 [001] dN..... 1878.617365: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit queuelat-2295 [001] dN..... 1878.617366: get_next_timer_interrupt <-tick_nohz_stop_sched_tick queuelat-2295 [001] dN..... 1878.617366: smp_apic_timer_interrupt <-apic_timer_interrupt queuelat-2295 [001] dN.h1.. 1878.617367: hrtimer_expire_entry: hrtimer=ffff8acb3fd15f00 function=tick_sched_timer now=1879459811009 queuelat-2295 [001] dN.h... 1878.617367: tick_sched_timer <-__run_hrtimer queuelat-2295 [001] dN.h1.. 1878.617368: hrtimer_expire_exit: hrtimer=ffff8acb3fd15f00 queuelat-2295 [001] dN..... 1878.617368: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit queuelat-2295 [001] dN..... 1878.617369: get_next_timer_interrupt <-tick_nohz_stop_sched_tick queuelat-2295 [001] dN..... 1878.617373: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit queuelat-2295 [001] dN..... 1878.617373: get_next_timer_interrupt <-tick_nohz_stop_sched_tick queuelat-2295 [001] dN..... 1878.617374: smp_apic_timer_interrupt <-apic_timer_interrupt queuelat-2295 [001] dN.h1.. 1878.617375: hrtimer_expire_entry: hrtimer=ffff8acb3fd15f00 function=tick_sched_timer now=1879459818539 queuelat-2295 [001] dN.h... 1878.617375: tick_sched_timer <-__run_hrtimer queuelat-2295 [001] dN.h1.. 1878.617375: hrtimer_expire_exit: hrtimer=ffff8acb3fd15f00 queuelat-2295 [001] dN..... 1878.617376: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit queuelat-2295 [001] dN..... 1878.617376: get_next_timer_interrupt <-tick_nohz_stop_sched_tick queuelat-2295 [001] dN..... 1878.617380: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit queuelat-2295 [001] dN..... 1878.617380: get_next_timer_interrupt <-tick_nohz_stop_sched_tick queuelat-2295 [001] dN..... 1878.617381: smp_apic_timer_interrupt <-apic_timer_interrupt queuelat-2295 [001] dN.h1.. 1878.617381: hrtimer_expire_entry: hrtimer=ffff8acb3fd15f00 function=tick_sched_timer now=1879459825500 queuelat-2295 [001] dN.h... 1878.617381: tick_sched_timer <-__run_hrtimer queuelat-2295 [001] dN.h1.. 1878.617382: hrtimer_expire_exit: hrtimer=ffff8acb3fd15f00 queuelat-2295 [001] dN..... 1878.617382: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit queuelat-2295 [001] dN..... 1878.617382: get_next_timer_interrupt <-tick_nohz_stop_sched_tick ktimersoftd/1-20 [001] d...2.. 1878.617386: smp_apic_timer_interrupt <-apic_timer_interrupt ktimersoftd/1-20 [001] d..h3.. 1878.617386: hrtimer_expire_entry: hrtimer=ffff8acb3fd15f00 function=tick_sched_timer now=1879459830667 ktimersoftd/1-20 [001] d..h2.. 1878.617387: tick_sched_timer <-__run_hrtimer ktimersoftd/1-20 [001] d..h3.. 1878.617387: hrtimer_expire_exit: hrtimer=ffff8acb3fd15f00 Pei, there is no need to access the machine yet, comment #42 is sufficient to justify inclusion of the patch downstream only. (In reply to Marcelo Tosatti from comment #44) > Adding a timer to the isolated CPU, we see the following. > > Summary: apic timer interrupt is armed (to handle tick_sched_timer) > multiple times, but eventually ktimersoftd is able to execute: > [...] > > Pei, there is no need to access the machine yet, comment #42 is sufficient > to justify inclusion of the patch downstream > only. Marcelo, seems my setup is not needed any more. Please mail me if still needed. Best Regards, Pei Pei, can you give me access on your machine so i can reproduce it please? ==Verification== Versions: kernel-rt-3.10.0-924.rt56.871.el7.x86_64 qemu-kvm-rhev-2.12.0-7.el7.x86_64 libvirt-4.5.0-3.el7.x86_64 tuned-2.9.0-1.el7.noarch This bug can not be reproduced any more, 10/10 PASS, it has been fixed very well. Move status to 'VERIFIED'. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:3096 |
Created attachment 1424448 [details] Guest call trace log Description of problem: In rt guest, when running determine_maximum_mpps.sh to get the mpps value, the script execution fail and also guest Call Trace. Version-Release number of selected component (if applicable): qemu-kvm-rhev-2.10.0-21.el7_5.2.x86_64 tuned-2.9.0-1.el7.noarch libvirt-3.9.0-14.el7.x86_64 kernel-rt-3.10.0-871.rt56.814.el7.x86_64 How reproducible: 1/1 Steps to Reproduce: 1. In rt guest, clone queuelat and make it. 2. In rt guest, running determine_maximum_mpps.sh. The script execution is hang during running process like below and it just stops there. # cat determine_maximum_mpps.sh | grep PREAMBLE= PREAMBLE="taskset -c 1 chrt -f 1" # sh determine_maximum_mpps.sh Determining maximum mpps the machine can handle Will take a few minutes to determine mpps value And 10 minutes run to confirm the final mpps value is stable testing 3 Mpps success testing 6 Mpps success testing 9 Mpps success testing 12 Mpps success testing 15 Mpps success testing 18 Mpps success testing 21 Mpps success testing 24 Mpps success testing 27 Mpps success testing 30 Mpps success testing 33 Mpps success testing 36 Mpps success testing 39 Mpps success testing 42 Mpps success testing 45 Mpps success testing 48 Mpps success first loop mpps: 48 testing 47 Mpps mpps success 47 second loop mpps: 47 third loop mpps: 47 testing 46.9 Mpps mpps success 46.9 Starting 10 runs of 30 seconds with 46.9 Mpps run 1 success run 2 success 3. In guest, check #dmesg, repeatedly show below Call Trace info. And the whole log is attached to this Comment. [ 9164.417822] INFO: rcu_preempt self-detected stall on CPU { 1} (t=3660060 jiffies g=46470 c=46469 q=50248) [ 9164.417822] Task dump for CPU 1: [ 9164.417824] queuelat R running task 0 3191 2866 0x00080088 [ 9164.417824] Call Trace: [ 9164.417827] <IRQ> [<ffffffffbd0c77c6>] sched_show_task+0xb6/0x120 [ 9164.417829] [<ffffffffbd0cbab9>] dump_cpu_task+0x39/0x70 [ 9164.417831] [<ffffffffbd1455a0>] rcu_dump_cpu_stacks+0x90/0xd0 [ 9164.417833] [<ffffffffbd14a386>] rcu_check_callbacks+0x476/0x860 [ 9164.417835] [<ffffffffbd09ac8b>] update_process_times+0x4b/0x80 [ 9164.417837] [<ffffffffbd0fb050>] tick_sched_handle+0x30/0x70 [ 9164.417839] [<ffffffffbd0fb479>] tick_sched_timer+0x39/0x80 [ 9164.417841] [<ffffffffbd0b6b47>] __run_hrtimer+0xc7/0x2e0 [ 9164.417843] [<ffffffffbd0fb440>] ? tick_sched_do_timer+0x50/0x50 [ 9164.417845] [<ffffffffbd0b7aa0>] hrtimer_interrupt+0x130/0x350 [ 9164.417846] [<ffffffffbd04a54b>] local_apic_timer_interrupt+0x3b/0x60 [ 9164.417849] [<ffffffffbd726363>] smp_apic_timer_interrupt+0x43/0x60 [ 9164.417851] [<ffffffffbd722eb2>] apic_timer_interrupt+0x162/0x170 [ 9164.417853] <EOI> [<ffffffffbd7164b8>] ? preempt_schedule_irq+0x78/0x110 [ 9164.417855] [<ffffffffbd7195d7>] retint_kernel+0x37/0x40 [ 9164.417857] [<ffffffffbd7194f2>] ? retint_careful+0x10/0x34 Actual results: The .sh script is hang, and Call Trace shows in guest #dmesg. Expected results: Call Trace should not show in guest #dmesg. Additional info: 1. Guest kernel cmdline # cat /proc/cmdline BOOT_IMAGE=/vmlinuz-3.10.0-871.rt56.814.el7.x86_64 root=/dev/mapper/rhel_bootp--73--75--185-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto rd.lvm.lv=rhel_bootp-73-75-185/root rd.lvm.lv=rhel_bootp-73-75-185/swap rhgb quiet default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=1 intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1 2. Host keeps working well, no any error. #dmesg looks good.