Bug 1569910

Summary: Call Trace shows in guest when running determine_maximum_mpps.sh
Product: Red Hat Enterprise Linux 7 Reporter: Pei Zhang <pezhang>
Component: kernel-rtAssignee: Marcelo Tosatti <mtosatti>
kernel-rt sub component: KVM QA Contact: Pei Zhang <pezhang>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: bhu, chayang, daolivei, juzhang, knoel, lcapitulino, lgoncalv, michen, mtosatti, pezhang, siliu, virt-maint, williams
Version: 7.6   
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-rt-3.10.0-924.rt56.871.el7 Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-30 09:41:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1532680    
Attachments:
Description Flags
Guest call trace log
none
determine_maximum_mpps.sh with debugging options
none
File in /tmp/outdir
none
queuelat.c debug patch
none
File in /tmp/outdir
none
skip expired timers on get_next_timer_interrupt()
none
skip expired timers on get_next_timer_interrupt() (V2) none

Description Pei Zhang 2018-04-20 08:54:45 UTC
Created attachment 1424448 [details]
Guest call trace log

Description of problem:
In rt guest, when running determine_maximum_mpps.sh to get the mpps value, the script execution fail and also guest Call Trace.


Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.10.0-21.el7_5.2.x86_64
tuned-2.9.0-1.el7.noarch
libvirt-3.9.0-14.el7.x86_64
kernel-rt-3.10.0-871.rt56.814.el7.x86_64


How reproducible:
1/1

Steps to Reproduce:
1. In rt guest, clone queuelat and make it.

2. In rt guest, running determine_maximum_mpps.sh. The script execution is hang during running process like below and it just stops there.

# cat determine_maximum_mpps.sh | grep PREAMBLE=
PREAMBLE="taskset -c 1 chrt -f 1"

# sh determine_maximum_mpps.sh 
Determining maximum mpps the machine can handle
Will take a few minutes to determine mpps value
And 10 minutes run to confirm the final mpps value is stable
testing 3 Mpps
success
testing 6 Mpps
success
testing 9 Mpps
success
testing 12 Mpps
success
testing 15 Mpps
success
testing 18 Mpps
success
testing 21 Mpps
success
testing 24 Mpps
success
testing 27 Mpps
success
testing 30 Mpps
success
testing 33 Mpps
success
testing 36 Mpps
success
testing 39 Mpps
success
testing 42 Mpps
success
testing 45 Mpps
success
testing 48 Mpps
success
first loop mpps: 48
testing 47 Mpps
mpps success 47
second loop mpps: 47
third loop mpps: 47
testing 46.9 Mpps
mpps success 46.9
Starting 10 runs of 30 seconds with 46.9 Mpps
run 1 success
run 2 success

3. In guest, check #dmesg, repeatedly show below Call Trace info. And the whole log is attached to this Comment.

[ 9164.417822] INFO: rcu_preempt self-detected stall on CPU { 1}  (t=3660060 jiffies g=46470 c=46469 q=50248)
[ 9164.417822] Task dump for CPU 1:
[ 9164.417824] queuelat        R  running task        0  3191   2866 0x00080088
[ 9164.417824] Call Trace:
[ 9164.417827]  <IRQ>  [<ffffffffbd0c77c6>] sched_show_task+0xb6/0x120
[ 9164.417829]  [<ffffffffbd0cbab9>] dump_cpu_task+0x39/0x70
[ 9164.417831]  [<ffffffffbd1455a0>] rcu_dump_cpu_stacks+0x90/0xd0
[ 9164.417833]  [<ffffffffbd14a386>] rcu_check_callbacks+0x476/0x860
[ 9164.417835]  [<ffffffffbd09ac8b>] update_process_times+0x4b/0x80
[ 9164.417837]  [<ffffffffbd0fb050>] tick_sched_handle+0x30/0x70
[ 9164.417839]  [<ffffffffbd0fb479>] tick_sched_timer+0x39/0x80
[ 9164.417841]  [<ffffffffbd0b6b47>] __run_hrtimer+0xc7/0x2e0
[ 9164.417843]  [<ffffffffbd0fb440>] ? tick_sched_do_timer+0x50/0x50
[ 9164.417845]  [<ffffffffbd0b7aa0>] hrtimer_interrupt+0x130/0x350
[ 9164.417846]  [<ffffffffbd04a54b>] local_apic_timer_interrupt+0x3b/0x60
[ 9164.417849]  [<ffffffffbd726363>] smp_apic_timer_interrupt+0x43/0x60
[ 9164.417851]  [<ffffffffbd722eb2>] apic_timer_interrupt+0x162/0x170
[ 9164.417853]  <EOI>  [<ffffffffbd7164b8>] ? preempt_schedule_irq+0x78/0x110
[ 9164.417855]  [<ffffffffbd7195d7>] retint_kernel+0x37/0x40
[ 9164.417857]  [<ffffffffbd7194f2>] ? retint_careful+0x10/0x34



Actual results:
The .sh script is hang, and Call Trace shows in guest #dmesg.

Expected results:
Call Trace should not show in guest #dmesg.


Additional info:
1. Guest kernel cmdline
# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-3.10.0-871.rt56.814.el7.x86_64 root=/dev/mapper/rhel_bootp--73--75--185-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto rd.lvm.lv=rhel_bootp-73-75-185/root rd.lvm.lv=rhel_bootp-73-75-185/swap rhgb quiet default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=1 intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1

2. Host keeps working well, no any error. #dmesg looks good.

Comment 2 Pei Zhang 2018-04-23 03:03:55 UTC
== Update ==

Tried in another host and following same steps in Description, 3/3 reproduced.

Comment 3 Daniel Bristot de Oliveira 2018-04-23 16:49:30 UTC
Hi

As I mentioned in the meeting, it would be good to have a vmcore of the system when the RCU stall is detected.

To do so, we need to enable the kdump service:

# systemctl enable kdump.service

set the following sysctls (using /etc/sysctl.conf conf file)

kernel.panic_on_rcu_stall=1
kernel.panic=1

then, reboot the system.

when the RCU stall takes place again, kdump should generate a vmcore file in the /var/crash/ directory.

Using the vmcore we can have a better idea of what caused the RCU stall.

-- Daniel

Comment 4 Marcelo Tosatti 2018-04-23 17:18:16 UTC
(In reply to Pei Zhang from comment #0)
> Created attachment 1424448 [details]
> Guest call trace log
> 
> Description of problem:
> In rt guest, when running determine_maximum_mpps.sh to get the mpps value,
> the script execution fail and also guest Call Trace.
> 
> 
> Version-Release number of selected component (if applicable):
> qemu-kvm-rhev-2.10.0-21.el7_5.2.x86_64
> tuned-2.9.0-1.el7.noarch
> libvirt-3.9.0-14.el7.x86_64
> kernel-rt-3.10.0-871.rt56.814.el7.x86_64
> 
> 
> How reproducible:
> 1/1
> 
> Steps to Reproduce:
> 1. In rt guest, clone queuelat and make it.
> 
> 2. In rt guest, running determine_maximum_mpps.sh. The script execution is
> hang during running process like below and it just stops there.
> 
> # cat determine_maximum_mpps.sh | grep PREAMBLE=
> PREAMBLE="taskset -c 1 chrt -f 1"
> 
> # sh determine_maximum_mpps.sh 
> Determining maximum mpps the machine can handle
> Will take a few minutes to determine mpps value
> And 10 minutes run to confirm the final mpps value is stable
> testing 3 Mpps
> success
> testing 6 Mpps
> success
> testing 9 Mpps
> success
> testing 12 Mpps
> success
> testing 15 Mpps
> success
> testing 18 Mpps
> success
> testing 21 Mpps
> success
> testing 24 Mpps
> success
> testing 27 Mpps
> success
> testing 30 Mpps
> success
> testing 33 Mpps
> success
> testing 36 Mpps
> success
> testing 39 Mpps
> success
> testing 42 Mpps
> success
> testing 45 Mpps
> success
> testing 48 Mpps
> success
> first loop mpps: 48
> testing 47 Mpps
> mpps success 47
> second loop mpps: 47
> third loop mpps: 47
> testing 46.9 Mpps
> mpps success 46.9
> Starting 10 runs of 30 seconds with 46.9 Mpps
> run 1 success
> run 2 success

Separately from the RCU bug below, the detected Mpps is too high.

Can you please try the attached determine_maximum_mpps.sh, then compress
and attach here the directory /tmp/outdir/ ? 

Thanks.

> 
> 3. In guest, check #dmesg, repeatedly show below Call Trace info. And the
> whole log is attached to this Comment.
> 
> [ 9164.417822] INFO: rcu_preempt self-detected stall on CPU { 1}  (t=3660060
> jiffies g=46470 c=46469 q=50248)
> [ 9164.417822] Task dump for CPU 1:
> [ 9164.417824] queuelat        R  running task        0  3191   2866
> 0x00080088
> [ 9164.417824] Call Trace:
> [ 9164.417827]  <IRQ>  [<ffffffffbd0c77c6>] sched_show_task+0xb6/0x120
> [ 9164.417829]  [<ffffffffbd0cbab9>] dump_cpu_task+0x39/0x70
> [ 9164.417831]  [<ffffffffbd1455a0>] rcu_dump_cpu_stacks+0x90/0xd0
> [ 9164.417833]  [<ffffffffbd14a386>] rcu_check_callbacks+0x476/0x860
> [ 9164.417835]  [<ffffffffbd09ac8b>] update_process_times+0x4b/0x80
> [ 9164.417837]  [<ffffffffbd0fb050>] tick_sched_handle+0x30/0x70
> [ 9164.417839]  [<ffffffffbd0fb479>] tick_sched_timer+0x39/0x80
> [ 9164.417841]  [<ffffffffbd0b6b47>] __run_hrtimer+0xc7/0x2e0
> [ 9164.417843]  [<ffffffffbd0fb440>] ? tick_sched_do_timer+0x50/0x50
> [ 9164.417845]  [<ffffffffbd0b7aa0>] hrtimer_interrupt+0x130/0x350
> [ 9164.417846]  [<ffffffffbd04a54b>] local_apic_timer_interrupt+0x3b/0x60
> [ 9164.417849]  [<ffffffffbd726363>] smp_apic_timer_interrupt+0x43/0x60
> [ 9164.417851]  [<ffffffffbd722eb2>] apic_timer_interrupt+0x162/0x170
> [ 9164.417853]  <EOI>  [<ffffffffbd7164b8>] ? preempt_schedule_irq+0x78/0x110
> [ 9164.417855]  [<ffffffffbd7195d7>] retint_kernel+0x37/0x40
> [ 9164.417857]  [<ffffffffbd7194f2>] ? retint_careful+0x10/0x34
> 
> 
> 
> Actual results:
> The .sh script is hang, and Call Trace shows in guest #dmesg.
> 
> Expected results:
> Call Trace should not show in guest #dmesg.
> 
> 
> Additional info:
> 1. Guest kernel cmdline
> # cat /proc/cmdline 
> BOOT_IMAGE=/vmlinuz-3.10.0-871.rt56.814.el7.x86_64
> root=/dev/mapper/rhel_bootp--73--75--185-root ro console=tty0
> console=ttyS0,115200n8 biosdevname=0 crashkernel=auto
> rd.lvm.lv=rhel_bootp-73-75-185/root rd.lvm.lv=rhel_bootp-73-75-185/swap rhgb
> quiet default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=1
> intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1
> 
> 2. Host keeps working well, no any error. #dmesg looks good.

Comment 6 Pei Zhang 2018-04-24 08:09:07 UTC
(In reply to Marcelo Tosatti from comment #4)
> (In reply to Pei Zhang from comment #0)
...
> > mpps success 46.9
> > Starting 10 runs of 30 seconds with 46.9 Mpps
> > run 1 success
> > run 2 success
> 
> Separately from the RCU bug below, the detected Mpps is too high.

Hi Marcelo,

Mostly, the execution stops when testing with 6 Mpps, like below:

# sh determine_maximum_mpps.sh 
Determining maximum mpps the machine can handle
Will take a few minutes to determine mpps value
And 10 minutes run to confirm the final mpps value is stable
testing 3 Mpps
success
testing 6 Mpps
(stop here)

> 
> Can you please try the attached determine_maximum_mpps.sh, then compress
> and attach here the directory /tmp/outdir/ ? 

Do you mean a new determine_maximum_mpps.sh? Could you please attach it? 

And during the testing, seems there is no /tmp/outdir/ generated.

Thanks,
Pei
 
> Thanks.
>

Comment 8 Marcelo Tosatti 2018-04-24 14:39:52 UTC
Created attachment 1426103 [details]
determine_maximum_mpps.sh with debugging options

Comment 9 Daniel Bristot de Oliveira 2018-04-24 17:10:19 UTC
Runq show tasks starving:

crash> runq
[...]
CPU 1 RUNQUEUE: ffff91367fd1af40
  CURRENT: PID: 1928   TASK: ffff9135f45b9050  COMMAND: "queuelat"
  RT PRIO_ARRAY: ffff91367fd1b0d8
     [ 95] PID: 18     TASK: ffff91367608c140  COMMAND: "rcuc/1"
     [ 96] PID: 20     TASK: ffff91367608e1e0  COMMAND: "ktimersoftd/1"
     [ 98] PID: 1928   TASK: ffff9135f45b9050  COMMAND: "queuelat"
  CFS RB_ROOT: ffff91367fd1afe0
     [120] PID: 1919   TASK: ffff9135768b4140  COMMAND: "kworker/1:1"


RCU is waiting to run for more than 18 minutes!

crash> foreach RU ps -m
[0 00:00:00.999] [RU]  PID: 1928   TASK: ffff9135f45b9050  CPU: 1   COMMAND: "queuelat"
[0 00:00:01.000] [RU]  PID: 20     TASK: ffff91367608e1e0  CPU: 1   COMMAND: "ktimersoftd/1"
[0 00:00:05.013] [RU]  PID: 1919   TASK: ffff9135768b4140  CPU: 1   COMMAND: "kworker/1:1"
[0 00:18:06.717] [RU]  PID: 18     TASK: ffff91367608c140  CPU: 1   COMMAND: "rcuc/1"
[0 00:19:07.078] [RU]  PID: 0      TASK: ffffffff87c18480  CPU: 0   COMMAND: "swapper/0"
[0 00:18:06.889] [RU]  PID: 0      TASK: ffff91367608b0f0  CPU: 1   COMMAND: "swapper/1"

queuelat:
	crash> task 1928 | grep prio 
	  prio = 98, 
	  static_prio = 120, 
	  normal_prio = 98, 
	  rt_priority = 1, 

rcuc:
	crash> task 18 | grep prio
	  prio = 95, 
	  static_prio = 120, 
	  normal_prio = 95, 
	  rt_priority = 4, 

ktimersoftd
	crash> task 20 | grep prio
	  prio = 96, 
	  static_prio = 120, 
	  normal_prio = 96, 
	  rt_priority = 3, 


As we can see, both rcuc and ktimersoftd have a priority higher than the
queuelat tool, so they might have preempt the current task. We need to investigate this.

Pei,

Could you reproduce the problem enabling the irq_vectors, sched:sched_switch, 
sched:sched_wakeup and sched:sched_wakeup_new tracepoints via ftrace
interface?

e.g.,

# echo sched:sched_wakeup sched:sched_switch sched:sched_wakeup_new irq_vectors > /sys/kernel/debug/tracing/set_event

Thanks!

-- Daniel

Comment 10 Luiz Capitulino 2018-04-24 17:28:36 UTC
Interesting finding. Another idea is to quick check with an older kernel (say 7.4 or 7.5) to see if it's a recent regression.

Comment 11 Daniel Bristot de Oliveira 2018-04-24 17:40:05 UTC
(In reply to Luiz Capitulino from comment #10)
> Interesting finding. Another idea is to quick check with an older kernel
> (say 7.4 or 7.5) to see if it's a recent regression.

Yep, I agree!

-- Daniel

Comment 13 Pei Zhang 2018-04-25 12:41:35 UTC
Created attachment 1426617 [details]
File in /tmp/outdir

(In reply to Marcelo Tosatti from comment #8)
> Created attachment 1426103 [details]
> determine_maximum_mpps.sh with debugging options

Hi Marcelo,

The file in /tmp/outdir is attached to this comment. Please check.


Best Regards,
Pei

Comment 14 Pei Zhang 2018-04-25 12:55:15 UTC
(In reply to Pei Zhang from comment #13)
> Created attachment 1426617 [details]
> File in /tmp/outdir
> 
> (In reply to Marcelo Tosatti from comment #8)
> > Created attachment 1426103 [details]
> > determine_maximum_mpps.sh with debugging options
> 
> Hi Marcelo,
> 
> The file in /tmp/outdir is attached to this comment. Please check.
> 
> 

# sh determine_maximum_mpps_new.sh 
Determining maximum mpps the machine can handle
Will take a few minutes to determine mpps value
And 10 minutes run to confirm the final mpps value is stable
testing 3 Mpps
success
testing 6 Mpps

(Stop here and guest shows Call Trace)

Comment 15 Pei Zhang 2018-04-25 14:03:04 UTC
(In reply to Daniel Bristot de Oliveira from comment #11)
> (In reply to Luiz Capitulino from comment #10)
> > Interesting finding. Another idea is to quick check with an older kernel
> > (say 7.4 or 7.5) to see if it's a recent regression.
> 
> Yep, I agree!

Hi Luiz, Daniel,

This Call Trace issue also existed in RHEL7.4 kernel-rt.


Versions:
3.10.0-693.rt56.617.el7.x86_64 (both host&guest)

And seems the Call Trace is a bit different.

[  256.553924] INFO: rcu_preempt self-detected stall on CPU { 1}  (t=60000 jiffies g=9469 c=9468 q=553)
[  256.553925] Task dump for CPU 1:
[  256.553927] queuelat        R  running task        0  8961   8941 0x00080088
[  256.553930]  ffff8800b4c220a0 000000004be457e9 ffff88013fd03d88 ffffffff810be946
[  256.553931]  0000000000000001 ffffffff81a48a00 ffff88013fd03da0 ffffffff810c2a09
[  256.553932]  0000000000000002 ffff88013fd03dd0 ffffffff81136220 ffff88013fd12080
[  256.553933] Call Trace:
[  256.553943]  <IRQ>  [<ffffffff810be946>] sched_show_task+0xb6/0x120
[  256.553946]  [<ffffffff810c2a09>] dump_cpu_task+0x39/0x70
[  256.553950]  [<ffffffff81136220>] rcu_dump_cpu_stacks+0x90/0xd0
[  256.553953]  [<ffffffff8113aee6>] rcu_check_callbacks+0x486/0x860
[  256.553958]  [<ffffffff81087e81>] ? raise_softirq_irqoff+0x11/0x50
[  256.553961]  [<ffffffff81091ed1>] update_process_times+0x41/0x70
[  256.553966]  [<ffffffff810ee720>] tick_sched_handle+0x30/0x70
[  256.553968]  [<ffffffff810eeb49>] tick_sched_timer+0x39/0x80
[  256.553970]  [<ffffffff810adef4>] __run_hrtimer+0xc4/0x2c0
[  256.553972]  [<ffffffff810eeb10>] ? tick_sched_do_timer+0x50/0x50
[  256.553974]  [<ffffffff810aee20>] hrtimer_interrupt+0x130/0x350
[  256.553979]  [<ffffffff81047405>] local_apic_timer_interrupt+0x35/0x60
[  256.553984]  [<ffffffff816bc61d>] smp_apic_timer_interrupt+0x3d/0x50
[  256.553986]  [<ffffffff816bad9d>] apic_timer_interrupt+0x6d/0x80
[  256.553988]  <EOI>  [<ffffffff81136694>] ? rcu_eqs_exit_common.isra.29+0x24/0xe0
[  256.553992]  [<ffffffff816ae7d5>] ? __schedule+0x5/0x7d0
[  256.553995]  [<ffffffff816af5e1>] ? preempt_schedule_irq+0x81/0x110
[  256.553996]  [<ffffffff816b1df7>] retint_kernel+0x37/0x40
[  256.553998]  [<ffffffff816b1d12>] ? retint_careful+0x10/0x34


Best Regards,
Pei

> -- Daniel

Comment 16 Marcelo Tosatti 2018-04-26 15:29:37 UTC
(In reply to Pei Zhang from comment #14)
> (In reply to Pei Zhang from comment #13)
> > Created attachment 1426617 [details]
> > File in /tmp/outdir
> > 
> > (In reply to Marcelo Tosatti from comment #8)
> > > Created attachment 1426103 [details]
> > > determine_maximum_mpps.sh with debugging options
> > 
> > Hi Marcelo,
> > 
> > The file in /tmp/outdir is attached to this comment. Please check.
> > 
> > 
> 
> # sh determine_maximum_mpps_new.sh 
> Determining maximum mpps the machine can handle
> Will take a few minutes to determine mpps value
> And 10 minutes run to confirm the final mpps value is stable
> testing 3 Mpps
> success
> testing 6 Mpps
> 
> (Stop here and guest shows Call Trace)

It seems the calibration part didnt finish. Can you please do these two things:

1) Apply the attached patch (patch-debug-queue.patch) with:

cd rt-tests/src/queuelat/
patch < /tmp/patch-debug-queue.patch
make 
cd ../../
make

Run determine_maximum_mpps.sh with this patch in, save
data here.

2) Modify determine_maximum_mpps.sh 

# cat determine_maximum_mpps.sh | sed -e s/"-t 30"/"-t 60"/g  > new_determine_maximum_mpps.sh 
# cp new_determine_maximum_mpps.sh determine_maximum_mpps.sh

Run determine_maximum_mpps.sh with this new determine_maximum_mpps.sh 
(and patch from item 1) and save data here.

So we want two logs.

Thanks

Comment 17 Marcelo Tosatti 2018-04-26 15:30:59 UTC
Created attachment 1427282 [details]
queuelat.c debug patch

Comment 18 Daniel Bristot de Oliveira 2018-04-26 16:01:03 UTC
Hi,

This is a resume of my today's findings:

This vmcore shows the same set of tasks waiting to run/running.
The trace is not showing any scheduling events in the CPU1 (i.e., wakeup
and switch), but it is showing on CPU0.
So, trace is working but no events are happening in the CPU1 - that means
that no sched switch is happening, and probably, that the sched wakeup
took place a long time ago. As an action, we need to increase the size
of the trace buffer.

What I see in the traces are many lines like these:
    queuelat-2051    1dNh10   798.909935: local_timer_entry:    vector=239
    queuelat-2051    1dNh10   798.909936: local_timer_exit:     vector=239
    queuelat-2051    1dNh10   798.909941: reschedule_entry:     vector=253
    queuelat-2051    1dNh10   798.909941: reschedule_exit:      vector=253
    queuelat-2051    1dNh10   798.909943: local_timer_entry:    vector=239
    queuelat-2051    1dNh10   798.909943: local_timer_exit:     vector=239

So, IRQs are enabled, because IRQs are taking place. I see timer and
*reschedule* IRQs happening.

The flags (e.g, 1dNh10) shows that N is set. N means that need resched
is set - I also see this flag in the thread info of the current task
(queuelat). So it is enabled - and for a loooong time (at least 30 seconds).

The need resched flag is used to tell if the system needs to call the
scheduler. The scheduler is called as soon as preemptions get enabled, or
at the return of an interrupt handler.

As the interrupt handlers are running, and the scheduler is not actually
scheduling.

why?

One reason could be the preempting being disabled. But that is not the
case because I see preemption enabled in the preempt counter of the
curr task.

So, my current guesses are that, we are either not checking need resched
in the return of the interrupt handler (preempt enable no resched?), or
that we are calling the scheduler but it is not noticing that there is
a higher prio task (in a fast read of the rq structure I see that the
rtprio structure is pointing to the correct high prio...)

I need to think about how to get more information with trace - but I need
to carefully select the tracepoints/functions I want to get because
we are seeing the crash some minutes after the events we want to observe
takes place - fulfilling the trace buffer.

I will continue working on it tomorrow.

-- Daniel

Comment 19 Pei Zhang 2018-04-28 05:25:38 UTC
Created attachment 1427947 [details]
File in /tmp/outdir

(In reply to Marcelo Tosatti from comment #16)
> (In reply to Pei Zhang from comment #14)
> > (In reply to Pei Zhang from comment #13)
> > > Created attachment 1426617 [details]
> > > File in /tmp/outdir
> > > 
> > > (In reply to Marcelo Tosatti from comment #8)
> > > > Created attachment 1426103 [details]
> > > > determine_maximum_mpps.sh with debugging options
> > > 
> > > Hi Marcelo,
> > > 
> > > The file in /tmp/outdir is attached to this comment. Please check.
> > > 
> > > 
> > 
> > # sh determine_maximum_mpps_new.sh 
> > Determining maximum mpps the machine can handle
> > Will take a few minutes to determine mpps value
> > And 10 minutes run to confirm the final mpps value is stable
> > testing 3 Mpps
> > success
> > testing 6 Mpps
> > 
> > (Stop here and guest shows Call Trace)
> 
> It seems the calibration part didnt finish. Can you please do these two
> things:
> 
> 1) Apply the attached patch (patch-debug-queue.patch) with:
> 
> cd rt-tests/src/queuelat/
> patch < /tmp/patch-debug-queue.patch
> make 
> cd ../../
> make
> 
> Run determine_maximum_mpps.sh with this patch in, save
> data here.
> 
> 2) Modify determine_maximum_mpps.sh 
> 
> # cat determine_maximum_mpps.sh | sed -e s/"-t 30"/"-t 60"/g  >
> new_determine_maximum_mpps.sh 
> # cp new_determine_maximum_mpps.sh determine_maximum_mpps.sh
> 
> Run determine_maximum_mpps.sh with this new determine_maximum_mpps.sh 
> (and patch from item 1) and save data here.
> 
> So we want two logs.
> 
> Thanks

Following these steps, and /tmp/outdir zip files is attached to this mail. 

# sh determine_maximum_mpps.sh 
Determining maximum mpps the machine can handle
Will take a few minutes to determine mpps value
And 10 minutes run to confirm the final mpps value is stable
testing 3 Mpps
success
testing 6 Mpps
success
testing 9 Mpps
mpps failed: 9
first loop mpps: 9
testing 8 Mpps
failure
testing 7 Mpps
mpps success 7
second loop mpps: 7
testing 7.3 Mpps
success
testing 7.6 Mpps
success
testing 7.9 Mpps
mpps failure 7.9
third loop mpps: 7.9
testing 7.8 Mpps
failure
testing 7.7 Mpps
failure
testing 7.6 Mpps
failure
testing 7.5 Mpps
failure
testing 7.4 Mpps
mpps success 7.4
Starting 10 runs of 30 seconds with 7.4 Mpps
mpps failure (run 1) 7.4
Starting 10 runs of 30 seconds with 7.3 Mpps
run 1 success
run 2 success
run 3 success
run 4 success
run 5 success
mpps failure (run 6) 7.3
Starting 10 runs of 30 seconds with 7.2 Mpps
run 1 success
run 2 success
run 3 success
run 4 success
run 5 success
run 6 success
run 7 success
run 8 success
run 9 success
run 10 success
Starting 10 minutes run with 7.2 Mpps
mpps failure (run 10) 7.2
Starting 10 minutes run with 7.1 Mpps
(stops here)


Please let me know if you need QE's testing environment. Please mail me and I can provide it to you. 


Best Regards,
Pei

Comment 25 Marcelo Tosatti 2018-05-28 18:26:59 UTC
# mount  10.66.8.172:/home/shared /mnt/mount/  
mount.nfs: Connection timed out

Comment 27 Marcelo Tosatti 2018-05-30 19:14:12 UTC
Daniel mentioned an interrupt storm. 

Seems to be delayed_work_timer_fn whose expires < now 

crash> timer
TVEC_BASES[0]: ffffffffba177100
  JIFFIES 
4295625094
  EXPIRES      TIMER_LIST         FUNCTION    
4295625097  ffff9cc07607bdb0  ffffffffb9098560  <process_timeout>
4295625120  ffff9cbf75c2bd60  ffffffffb9098560  <process_timeout>
4295625768  ffffffffb9c57220  ffffffffb90a9e80  <delayed_work_timer_fn>
4295628608  ffffffffb9d3d450  ffffffffb90a9e80  <delayed_work_timer_fn>
4295630016  ffffffffb9cc2300  ffffffffb90a9e80  <delayed_work_timer_fn>
4295630016  ffff9cc07fc17ee0  ffffffffb90a9e80  <delayed_work_timer_fn>
4295635264  ffffffffb9d38a70  ffffffffb90a9e80  <delayed_work_timer_fn>
4295638400  ffff9cc0725da848  ffffffffb90a9e80  <delayed_work_timer_fn>
4295639552  ffff9cbf40092c48  ffffffffb90a9e80  <delayed_work_timer_fn>
4295643008  ffffffffc049af20  ffffffffb90a9e80  <delayed_work_timer_fn>
4295645056  ffffffffb9d319b8  ffffffffb96c9320  <fib6_gc_timer_cb>
4295656448  ffffffffb9d39440  ffffffffb90a9e80  <delayed_work_timer_fn>
4295667712  ffffffffb9cfd320  ffffffffb9358420  <__prandom_timer>
4295729152  ffffffffb9d3cb80  ffffffffb90a9e80  <delayed_work_timer_fn>
4295868416  ffffffffb9d327b0  ffffffffb960d4a0  <flow_cache_new_hashrnd>
4295868416  ffffffffba329038  ffffffffb96835b0  <inet_frag_secret_rebuild>
4295868416  ffffffffba33a578  ffffffffb96835b0  <inet_frag_secret_rebuild>
4295890560  ffff9cc076153da0  ffffffffb9098560  <process_timeout>
4295909376  ffff9cc07fc1a730  ffffffffb90a80f0  <idle_worker_timeout>
4301914112  ffff9cc072edbf80  ffffffffb9659670  <tcp_keepalive_timer>
4302192640  ffff9cc072eddd80  ffffffffb9659670  <tcp_keepalive_timer>
4380688384  ffff9cbf76a321c0  ffffffffb96b8580  <ipv6_regen_rndid>
TVEC_BASES[1]: ffff9cc07613c000
  JIFFIES 
4295625094
  EXPIRES      TIMER_LIST         FUNCTION    
4295565072  ffff9cc0724be280  ffffffffb90a9e80  <delayed_work_timer_fn>
4295570016  ffff9cc07fd17ee0  ffffffffb90a9e80  <delayed_work_timer_fn>

Comment 28 Daniel Bristot de Oliveira 2018-05-31 07:59:03 UTC
Hi! This is where I stopped in the vmcore/trace analysis:

This is the stack trace of the queuelat:
crash> bt
PID: 2070   TASK: ffff9cbf765fe1e0  CPU: 1   COMMAND: "queuelat"
 #0 [ffff9cc07fd03ba0] trace_function at ffffffffb915f76d            <--- the IRQ stack
 #1 [ffff9cc07fd03bd0] function_trace_call at ffffffffb91675c3
 #2 [ffff9cc07fd03be8] ftrace_call at ffffffffb972c44b
 #3 [ffff9cc07fd03c08] __crash_kexec at ffffffffb910eec2
 #4 [ffff9cc07fd03d48] __crash_kexec at ffffffffb910ef5f
 #5 [ffff9cc07fd03e78] update_process_times at ffffffffb909ac4b
 #6 [ffff9cc07fd03ea0] tick_sched_handle at ffffffffb90fb020
 #7 [ffff9cc07fd03ec0] tick_sched_timer at ffffffffb90fb449
 #8 [ffff9cc07fd03ee8] __run_hrtimer at ffffffffb90b6b07
 #9 [ffff9cc07fd03f28] hrtimer_interrupt at ffffffffb90b7a60
#10 [ffff9cc07fd03fb8] local_apic_timer_interrupt at ffffffffb904a54b
#11 [ffff9cc07fd03fd0] smp_trace_apic_timer_interrupt at ffffffffb972d3ca
#12 [ffff9cc07fd03ff0] trace_apic_timer_interrupt at ffffffffb972a022
--- <IRQ stack> ---
#13 [ffff9cbfc9803e38] trace_apic_timer_interrupt at ffffffffb972a022  
    [exception RIP: preempt_schedule_irq+120]                         <--- the regular (thread) stack is here
    RIP: ffffffffb971ca28  RSP: ffff9cbfc9803ee0  RFLAGS: 00000282
    RAX: 0000000000000082  RBX: ffffffffb971f0c5  RCX: ffff9cbfc9800000
    RDX: 0000000000000008  RSI: 0000000000000001  RDI: 0000000000000082
    RBP: ffff9cbfc9803ef8   R8: 00000000ccccd22b   R9: 0000000000000018
    R10: 000000d0f6f9b1df  R11: 0000000000000008  R12: ffff9cbf765fe7c8
    R13: 0000000000000001  R14: 0000000000000000  R15: ffff9cbfc9800000
    ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0000
#14 [ffff9cbfc9803f00] retint_kernel at ffffffffb97205d7
#15 [ffff9cbfc9803f58] retint_careful at ffffffffb97204f2
    RIP: 0000000000400dfb  RSP: 00007ffd3d0b2f50  RFLAGS: 00000202
    RAX: 0000000000000000  RBX: 000000000000000b  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 00000000017d66c0  RDI: 00000000017be010
    RBP: 0000000000000000   R8: 00000000017d66a0   R9: 0000000000000000
    R10: 00007ffd3d0b29a0  R11: 00007fd0e4a5b200  R12: ffff9cbfc9803f88
    R13: 0000000000000202  R14: 0000000000000010  R15: ffffffffb97204f2
    ORIG_RAX: ffffffffffffff10  CS: 0033  SS: 002b


The regular stack points to this file:
----------------------- %< ----------------------------------------------
crash> dis -l preempt_schedule_irq+120
/usr/src/debug/kernel-3.10.0/linux-3.10.0.x86_64/kernel/sched/core.c: 3939
0xffffffffb971ca28 <preempt_schedule_irq+120>:  callq  0xffffffffb971ba80 <__schedule>
----------------------- >% ----------------------------------------------


Which have this C code:
----------------------- %< ----------------------------------------------
asmlinkage void __sched preempt_schedule_irq(void)
{
        struct thread_info *ti = current_thread_info();
        enum ctx_state prev_state;

        /* Catch callers which need to be fixed */
        BUG_ON(ti->preempt_count || !irqs_disabled());

        prev_state = exception_enter();

        do {
                add_preempt_count(PREEMPT_ACTIVE);
                local_irq_enable();
                __schedule();                  <------------- Here
                local_irq_disable();
                sub_preempt_count(PREEMPT_ACTIVE);

                /*
                 * Check again in case we missed a preemption opportunity
                 * between schedule and now.
                 */
                barrier();
        } while (need_resched());

        exception_exit(prev_state);
}
----------------------- >% ----------------------------------------------

As we discussed many times, the code should call schedule. The interrupts are
enabled because of the previous local_irq_enable(). The same for the
preemption, which was enabled at add_preempt_count(PREEMPT_ACTIVE).

So, the interrupts are enabled, and the scheduler is ready to be called,
as we expected...

So, the question was "why isn't the scheduler being called?"

The last trace from Pei showed why: we are facing a timer interrupt storm.

Here is the trace that shows the problem we are seeing:

In the return of an interrupt handler:

(the command to read this trace is "trace-cmd report -i trace.dat --cpu 1 -l")

queuelat-2070    1dNh10   955.890495: local_timer_exit:     vector=239
---------------------^
the need resched is already set (I cannot set the wakeup of the other
threads because if already fulfilled the trace buffer, which is 100 MB
large) Anyway, this means that there was a wakeup of a higher priority
task...

Which is true, if we see the runqueue:

------------------------- %< --------------------------------------------
CPU 1 RUNQUEUE: ffff9cc07fd1af40
  CURRENT: PID: 2070   TASK: ffff9cbf765fe1e0  COMMAND: "queuelat"
  RT PRIO_ARRAY: ffff9cc07fd1b0d8
     [ 95] PID: 18     TASK: ffff9cc07608c140  COMMAND: "rcuc/1"
     [ 96] PID: 20     TASK: ffff9cc07608e1e0  COMMAND: "ktimersoftd/1"
     [ 98] PID: 2070   TASK: ffff9cbf765fe1e0  COMMAND: "queuelat"
  CFS RB_ROOT: ffff9cc07fd1afe0
     [no tasks queued]
------------------------- %< --------------------------------------------

queuelat-2070    1dNh.0   955.890496: function:                irq_exit
queuelat-2070    1dNh.0   955.890496: function:                   vtime_gen_account_irq_exit
queuelat-2070    1dNh.0   955.890496: function:                      _raw_qspin_lock
queuelat-2070    1dN..0   955.890496: function:                   wakeup_timer_softirqd
queuelat-2070    1dN..0   955.890496: function:                   idle_cpu
queuelat-2070    1dN..0   955.890496: function:                   tick_nohz_irq_exit
queuelat-2070    1dN..0   955.890496: function:                      can_stop_full_tick
queuelat-2070    1dN..0   955.890497: function:                         sched_can_stop_tick
queuelat-2070    1dN..0   955.890497: function:                         posix_cpu_timers_can_stop_tick
queuelat-2070    1dN..0   955.890497: function:                      ktime_get
queuelat-2070    1dN..0   955.890497: function:                      tick_nohz_stop_sched_tick
queuelat-2070    1dN..0   955.890497: function:                         rcu_needs_cpu
queuelat-2070    1dN..0   955.890497: function:                         get_next_timer_interrupt
queuelat-2070    1dN..0   955.890497: function:                            rt_spin_trylock
queuelat-2070    1dN..0   955.890497: function:                            rt_spin_unlock_after_trylock_in_irq
queuelat-2070    1dN..0   955.890498: function:                            hrtimer_get_next_event
queuelat-2070    1dN..0   955.890498: function:                               _raw_spin_lock_irqsave
queuelat-2070    1dN.10   955.890498: function:                               _raw_spin_unlock_irqrestore
queuelat-2070    1dN..0   955.890498: function:                         timekeeping_max_deferment
queuelat-2070    1dN..0   955.890498: function:                         scheduler_tick_max_deferment
queuelat-2070    1dN..0   955.890498: function:                            jiffies_to_usecs
queuelat-2070    1dN..0   955.890498: function:                         hrtimer_start_range_ns
queuelat-2070    1dN..0   955.890499: function:                            __hrtimer_start_range_ns
queuelat-2070    1dN..0   955.890499: function:                               lock_hrtimer_base
queuelat-2070    1dN..0   955.890499: function:                                  _raw_spin_lock_irqsave
queuelat-2070    1dN.10   955.890499: function:                               enqueue_hrtimer
queuelat-2070    1dN.10   955.890499: function:                               hrtimer_reprogram
queuelat-2070    1dN.10   955.890499: function:                                  tick_program_event
queuelat-2070    1dN.10   955.890499: function:                                     clockevents_program_event
queuelat-2070    1dN.10   955.890499: function:                                        ktime_get
queuelat-2070    1dN.10   955.890500: function:                                        clockevents_program_min_delta
queuelat-2070    1dN.10   955.890500: function:                                           ktime_get
queuelat-2070    1dN.10   955.890500: function:                                           lapic_next_deadline
queuelat-2070    1dN.10   955.890503: function:                               _raw_spin_unlock_irqrestore

This is the last function the interrupt handler runs before returning &
enabling the interrupts:
queuelat-2070    1dN..0   955.890503: function:                   rcu_irq_exit

But then: BOOM: another timer interrupt arrives:
queuelat-2070    1dN..0   955.890504: function:             smp_trace_apic_timer_interrupt
queuelat-2070    1dN..0   955.890504: function:                irq_enter
queuelat-2070    1dN..0   955.890504: function:                   rcu_irq_enter
queuelat-2070    1dN..0   955.890504: function:                   vtime_common_account_irq_enter
queuelat-2070    1dN..0   955.890504: function:                      vtime_account_system
queuelat-2070    1dNh.0   955.890505: function:                exit_idle
queuelat-2070    1dNh.0   955.890505: function:                kvm_guest_apic_eoi_write
queuelat-2070    1dNh10   955.890505: local_timer_entry:    vector=239
queuelat-2070    1dNh.0   955.890505: function:                local_apic_timer_interrupt

And the system keeps running this forever.

-- Daniel

Comment 29 Marcelo Tosatti 2018-06-06 21:54:58 UTC
(In reply to Daniel Bristot de Oliveira from comment #28)
> Hi! This is where I stopped in the vmcore/trace analysis:
> 
> This is the stack trace of the queuelat:
> crash> bt
> PID: 2070   TASK: ffff9cbf765fe1e0  CPU: 1   COMMAND: "queuelat"
>  #0 [ffff9cc07fd03ba0] trace_function at ffffffffb915f76d            <---
> the IRQ stack
>  #1 [ffff9cc07fd03bd0] function_trace_call at ffffffffb91675c3
>  #2 [ffff9cc07fd03be8] ftrace_call at ffffffffb972c44b
>  #3 [ffff9cc07fd03c08] __crash_kexec at ffffffffb910eec2
>  #4 [ffff9cc07fd03d48] __crash_kexec at ffffffffb910ef5f
>  #5 [ffff9cc07fd03e78] update_process_times at ffffffffb909ac4b
>  #6 [ffff9cc07fd03ea0] tick_sched_handle at ffffffffb90fb020
>  #7 [ffff9cc07fd03ec0] tick_sched_timer at ffffffffb90fb449
>  #8 [ffff9cc07fd03ee8] __run_hrtimer at ffffffffb90b6b07
>  #9 [ffff9cc07fd03f28] hrtimer_interrupt at ffffffffb90b7a60
> #10 [ffff9cc07fd03fb8] local_apic_timer_interrupt at ffffffffb904a54b
> #11 [ffff9cc07fd03fd0] smp_trace_apic_timer_interrupt at ffffffffb972d3ca
> #12 [ffff9cc07fd03ff0] trace_apic_timer_interrupt at ffffffffb972a022
> --- <IRQ stack> ---
> #13 [ffff9cbfc9803e38] trace_apic_timer_interrupt at ffffffffb972a022  
>     [exception RIP: preempt_schedule_irq+120]                         <---
> the regular (thread) stack is here
>     RIP: ffffffffb971ca28  RSP: ffff9cbfc9803ee0  RFLAGS: 00000282
>     RAX: 0000000000000082  RBX: ffffffffb971f0c5  RCX: ffff9cbfc9800000
>     RDX: 0000000000000008  RSI: 0000000000000001  RDI: 0000000000000082
>     RBP: ffff9cbfc9803ef8   R8: 00000000ccccd22b   R9: 0000000000000018
>     R10: 000000d0f6f9b1df  R11: 0000000000000008  R12: ffff9cbf765fe7c8
>     R13: 0000000000000001  R14: 0000000000000000  R15: ffff9cbfc9800000
>     ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0000
> #14 [ffff9cbfc9803f00] retint_kernel at ffffffffb97205d7
> #15 [ffff9cbfc9803f58] retint_careful at ffffffffb97204f2
>     RIP: 0000000000400dfb  RSP: 00007ffd3d0b2f50  RFLAGS: 00000202
>     RAX: 0000000000000000  RBX: 000000000000000b  RCX: 0000000000000000
>     RDX: 0000000000000000  RSI: 00000000017d66c0  RDI: 00000000017be010
>     RBP: 0000000000000000   R8: 00000000017d66a0   R9: 0000000000000000
>     R10: 00007ffd3d0b29a0  R11: 00007fd0e4a5b200  R12: ffff9cbfc9803f88
>     R13: 0000000000000202  R14: 0000000000000010  R15: ffffffffb97204f2
>     ORIG_RAX: ffffffffffffff10  CS: 0033  SS: 002b
> 
> 
> The regular stack points to this file:
> ----------------------- %< ----------------------------------------------
> crash> dis -l preempt_schedule_irq+120
> /usr/src/debug/kernel-3.10.0/linux-3.10.0.x86_64/kernel/sched/core.c: 3939
> 0xffffffffb971ca28 <preempt_schedule_irq+120>:  callq  0xffffffffb971ba80
> <__schedule>
> ----------------------- >% ----------------------------------------------
> 
> 
> Which have this C code:
> ----------------------- %< ----------------------------------------------
> asmlinkage void __sched preempt_schedule_irq(void)
> {
>         struct thread_info *ti = current_thread_info();
>         enum ctx_state prev_state;
> 
>         /* Catch callers which need to be fixed */
>         BUG_ON(ti->preempt_count || !irqs_disabled());
> 
>         prev_state = exception_enter();
> 
>         do {
>                 add_preempt_count(PREEMPT_ACTIVE);
>                 local_irq_enable();
>                 __schedule();                  <------------- Here
>                 local_irq_disable();
>                 sub_preempt_count(PREEMPT_ACTIVE);
> 
>                 /*
>                  * Check again in case we missed a preemption opportunity
>                  * between schedule and now.
>                  */
>                 barrier();
>         } while (need_resched());
> 
>         exception_exit(prev_state);
> }
> ----------------------- >% ----------------------------------------------
> 
> As we discussed many times, the code should call schedule. The interrupts are
> enabled because of the previous local_irq_enable(). The same for the
> preemption, which was enabled at add_preempt_count(PREEMPT_ACTIVE).
> 
> So, the interrupts are enabled, and the scheduler is ready to be called,
> as we expected...
> 
> So, the question was "why isn't the scheduler being called?"
> 
> The last trace from Pei showed why: we are facing a timer interrupt storm.

Hi Daniel,

As soon as ktimersoftd/1 is able to run, and execute

run_timer_softirq -> __run_timers -> delayed_work_timer_fn, 
tick_nohz_stop_sched_tick will stop getting the timers from 
comment 27 into the: 

get_next_timer_interrupt() call

And the interrupt storm will stop.

So one fix would be: 

"If TIMER_SOFTIRQ softirq is pending (therefore
ktimersoftd execution is pending), do not take
old-style timers into account, since the
end of execution of __run_timers will rearm
timers in the future as necessary."

Unsure if that is the right level to fix the bug 
though, i'll prepare a fix and send upstream.

Comment 30 Daniel Bristot de Oliveira 2018-06-07 06:32:14 UTC
Hi Marcelo,

Nice! Please, Cc: me in the upstream discussion!

Btw, how about filing another BZ, specifically for this timer problem?

-- Daniel

Comment 32 Marcelo Tosatti 2018-06-08 22:03:41 UTC
Created attachment 1449271 [details]
skip expired timers on get_next_timer_interrupt()

Comment 33 Marcelo Tosatti 2018-06-11 21:55:29 UTC
Untested kernel with fix at 

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=16678250

(fixing a problem with storage where the guest tests are... 
should be testing tomorrow).

Comment 34 Marcelo Tosatti 2018-06-11 22:41:42 UTC
(In reply to Marcelo Tosatti from comment #33)
> Untested kernel with fix at 
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=16678250
> 
> (fixing a problem with storage where the guest tests are... 
> should be testing tomorrow).

Ok, it boots fine... now whether it fixes the problem: not sure.

Comment 35 Marcelo Tosatti 2018-06-11 22:43:15 UTC
Created attachment 1450216 [details]
skip expired timers on get_next_timer_interrupt() (V2)

Comment 36 Marcelo Tosatti 2018-06-11 22:46:17 UTC
(In reply to Daniel Bristot de Oliveira from comment #30)
> Hi Marcelo,
> 
> Nice! Please, Cc: me in the upstream discussion!
> 
> Btw, how about filing another BZ, specifically for this timer problem?
> 
> -- Daniel

Daniel, 

I think this BZ is only about the kernel bug: can't see any 
problem with queuelat (guess is that queuelat error is caused 
by the wrong scheduler behaviour).

Comment 37 Pei Zhang 2018-06-12 10:46:05 UTC
(In reply to Marcelo Tosatti from comment #33)
> Untested kernel with fix at 
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=16678250
> 
> (fixing a problem with storage where the guest tests are... 
> should be testing tomorrow).

This bug cannot be reproduced any more. The patch can fix this issue. 

3/3 PASS


== run1
# sh determine_maximum_mpps.sh 
Determining maximum mpps the machine can handle
Will take a few minutes to determine mpps value
And 10 minutes run to confirm the final mpps value is stable
testing 3 Mpps
success
testing 6 Mpps
success
testing 9 Mpps
success
testing 12 Mpps
success
testing 15 Mpps
success
testing 18 Mpps
success
testing 21 Mpps
success
testing 24 Mpps
success
testing 27 Mpps
success
testing 30 Mpps
success
testing 33 Mpps
success
testing 36 Mpps
success
testing 39 Mpps
success
testing 42 Mpps
success
testing 45 Mpps
success
testing 48 Mpps
success
first loop mpps: 48
testing 47 Mpps
mpps success 47
second loop mpps: 47
third loop mpps: 47
testing 46.9 Mpps
mpps success 46.9
Starting 10 runs of 30 seconds with 46.9 Mpps
run 1 success
run 2 success
run 3 success
run 4 success
run 5 success
run 6 success
run 7 success
run 8 success
run 9 success
run 10 success
Starting 10 minutes run with 46.9 Mpps
run 10 success
Final mpps is: 46.9

== run2
# sh determine_maximum_mpps.sh 
Determining maximum mpps the machine can handle
Will take a few minutes to determine mpps value
And 10 minutes run to confirm the final mpps value is stable
testing 3 Mpps
success
...
run 10 success
Starting 10 minutes run with 46.9 Mpps
run 10 success
Final mpps is: 46.9


==run 3
# sh determine_maximum_mpps.sh 
Determining maximum mpps the machine can handle
Will take a few minutes to determine mpps value
And 10 minutes run to confirm the final mpps value is stable
testing 3 Mpps
success
...
run 10 success
Starting 10 minutes run with 46.9 Mpps
run 10 success
Final mpps is: 46.9

Comment 38 Marcelo Tosatti 2018-06-20 17:44:45 UTC
For some reason i can't reproduce this with a custom kernel module which 
adds an already expired timer on an isolated CPU (which means the conditions
from which the correction patch has been based off are not fully understood).

          insmod-9090  [000] ....111  5329.878993: timer_start: timer=ffffffffc09ca260 function=test_timer_fn [timer_test] expires=4299996929 [timeout=-10] flags=0x0010000f
        queuelat-9087  [015] d......  5329.878995: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit
        queuelat-9087  [015] d......  5329.878995: get_next_timer_interrupt <-tick_nohz_stop_sched_tick
        queuelat-9087  [015] d...2..  5329.878995: hrtimer_start: hrtimer=ffff96675e456fc0 function=tick_sched_timer expires=5329643000000 softexpires=5329643000000
        queuelat-9087  [015] d......  5329.878999: smp_apic_timer_interrupt <-apic_timer_interrupt
        queuelat-9087  [015] d..h...  5329.878999: tick_sched_timer <-__run_hrtimer
        queuelat-9087  [015] dN.....  5329.879001: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit
        queuelat-9087  [015] dN.....  5329.879002: get_next_timer_interrupt <-tick_nohz_stop_sched_tick
        queuelat-9087  [015] dN..2..  5329.879002: hrtimer_start: hrtimer=ffff96675e456fc0 function=tick_sched_timer expires=5329643000000 softexpires=5329643000000
        queuelat-9087  [015] dN.....  5329.879003: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit
        queuelat-9087  [015] dN.....  5329.879003: get_next_timer_interrupt <-tick_nohz_stop_sched_tick

Here the kernel is able to switch to ktimersoftd: should find out where exactly 
this switch is happening, and why it fails to happen under the scenario Pei 
reported.

  ktimersoftd/15-124   [015] d....12  5329.879005: smp_apic_timer_interrupt <-apic_timer_interrupt
  ktimersoftd/15-124   [015] d..h.12  5329.879005: tick_sched_timer <-__run_hrtimer
  ktimersoftd/15-124   [015] d....12  5329.879006: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit
  ktimersoftd/15-124   [015] d....12  5329.879006: get_next_timer_interrupt <-tick_nohz_stop_sched_tick

Comment 42 Marcelo Tosatti 2018-06-21 22:03:27 UTC
-RT upstream seems to execute timers from IRQ return via:

static inline void invoke_softirq(void)
{
        if (ksoftirqd_running())
                return;

        if (!force_irqthreads) {
#ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
                /*
                 * We can safely execute softirq on the current stack if
                 * it is the irq stack, because it should be near empty
                 * at this stage.
                 */
                __do_softirq();
#else
                /*
                 * Otherwise, irq_exit() is called on the task stack that can
                 * be potentially deep already. So call softirq in its own stack
                 * to prevent from any overrun.
                 */
                do_softirq_own_stack();
#endif
        } else {
                wakeup_softirqd();
        }

Then it won't hit the problem if force_irqthreads==false (which is the default).

Comment 43 Marcelo Tosatti 2018-06-21 22:44:21 UTC
Failure to reproduce, on a guest, without custom timer being added:

[root@localhost queuelat]# sh determine_maximum_mpps.sh
Determining maximum mpps the machine can handle
Will take a few minutes to determine mpps value
And 10 minutes run to confirm the final mpps value is stable
testing 3 Mpps
success
testing 6 Mpps
...
run 8 success
run 9 success
run 10 success
Starting 10 minutes run with 7.1 Mpps
run 10 success
Final mpps is: 7.1

[root@localhost queuelat]# sh determine_maximum_mpps.sh
Determining maximum mpps the machine can handle
Will take a few minutes to determine mpps value
And 10 minutes run to confirm the final mpps value is stable
testing 3 Mpps
success
testing 6 Mpps
success
testing 9 Mpps
mpps failed: 9
first loop mpps: 9
testing 8 Mpps
mpps success 8
second loop mpps: 8
third loop mpps: 8
testing 7.9 Mpps
failure
testing 7.8 Mpps
mpps success 7.8

Pei, please make the machine available when possible.

TIA.

Comment 44 Marcelo Tosatti 2018-06-22 01:01:47 UTC
Adding a timer to the isolated CPU, we see the following.

Summary: apic timer interrupt is armed (to handle tick_sched_timer)
multiple times, but eventually ktimersoftd is able to execute:

        queuelat-2295  [001] d......  1878.617133: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit
        queuelat-2295  [001] d......  1878.617134: get_next_timer_interrupt <-tick_nohz_stop_sched_tick
          insmod-2298  [000] ....111  1878.617338: timer_start: timer=ffffffffc06c0260 function=test_timer_fn [timer_test] expires=4296546745 [timeout=-10]
        queuelat-2295  [001] d......  1878.617344: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit
        queuelat-2295  [001] d......  1878.617345: get_next_timer_interrupt <-tick_nohz_stop_sched_tick
        queuelat-2295  [001] d......  1878.617352: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit
        queuelat-2295  [001] d......  1878.617352: get_next_timer_interrupt <-tick_nohz_stop_sched_tick
        queuelat-2295  [001] d......  1878.617354: smp_apic_timer_interrupt <-apic_timer_interrupt
        queuelat-2295  [001] d..h1..  1878.617355: hrtimer_expire_entry: hrtimer=ffff8acb3fd15f00 function=tick_sched_timer now=1879459798601
        queuelat-2295  [001] d..h...  1878.617355: tick_sched_timer <-__run_hrtimer
        queuelat-2295  [001] d..h1..  1878.617357: hrtimer_expire_exit: hrtimer=ffff8acb3fd15f00
        queuelat-2295  [001] dN.....  1878.617360: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit
        queuelat-2295  [001] dN.....  1878.617360: get_next_timer_interrupt <-tick_nohz_stop_sched_tick
        queuelat-2295  [001] dN.....  1878.617363: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit
        queuelat-2295  [001] dN.....  1878.617363: get_next_timer_interrupt <-tick_nohz_stop_sched_tick
        queuelat-2295  [001] dN.....  1878.617365: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit
        queuelat-2295  [001] dN.....  1878.617366: get_next_timer_interrupt <-tick_nohz_stop_sched_tick
        queuelat-2295  [001] dN.....  1878.617366: smp_apic_timer_interrupt <-apic_timer_interrupt
        queuelat-2295  [001] dN.h1..  1878.617367: hrtimer_expire_entry: hrtimer=ffff8acb3fd15f00 function=tick_sched_timer now=1879459811009
        queuelat-2295  [001] dN.h...  1878.617367: tick_sched_timer <-__run_hrtimer
        queuelat-2295  [001] dN.h1..  1878.617368: hrtimer_expire_exit: hrtimer=ffff8acb3fd15f00
        queuelat-2295  [001] dN.....  1878.617368: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit
        queuelat-2295  [001] dN.....  1878.617369: get_next_timer_interrupt <-tick_nohz_stop_sched_tick
        queuelat-2295  [001] dN.....  1878.617373: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit
        queuelat-2295  [001] dN.....  1878.617373: get_next_timer_interrupt <-tick_nohz_stop_sched_tick
        queuelat-2295  [001] dN.....  1878.617374: smp_apic_timer_interrupt <-apic_timer_interrupt
        queuelat-2295  [001] dN.h1..  1878.617375: hrtimer_expire_entry: hrtimer=ffff8acb3fd15f00 function=tick_sched_timer now=1879459818539
        queuelat-2295  [001] dN.h...  1878.617375: tick_sched_timer <-__run_hrtimer
        queuelat-2295  [001] dN.h1..  1878.617375: hrtimer_expire_exit: hrtimer=ffff8acb3fd15f00
        queuelat-2295  [001] dN.....  1878.617376: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit
        queuelat-2295  [001] dN.....  1878.617376: get_next_timer_interrupt <-tick_nohz_stop_sched_tick
        queuelat-2295  [001] dN.....  1878.617380: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit
        queuelat-2295  [001] dN.....  1878.617380: get_next_timer_interrupt <-tick_nohz_stop_sched_tick
        queuelat-2295  [001] dN.....  1878.617381: smp_apic_timer_interrupt <-apic_timer_interrupt
        queuelat-2295  [001] dN.h1..  1878.617381: hrtimer_expire_entry: hrtimer=ffff8acb3fd15f00 function=tick_sched_timer now=1879459825500
        queuelat-2295  [001] dN.h...  1878.617381: tick_sched_timer <-__run_hrtimer
        queuelat-2295  [001] dN.h1..  1878.617382: hrtimer_expire_exit: hrtimer=ffff8acb3fd15f00
        queuelat-2295  [001] dN.....  1878.617382: tick_nohz_stop_sched_tick <-tick_nohz_irq_exit
        queuelat-2295  [001] dN.....  1878.617382: get_next_timer_interrupt <-tick_nohz_stop_sched_tick
   ktimersoftd/1-20    [001] d...2..  1878.617386: smp_apic_timer_interrupt <-apic_timer_interrupt
   ktimersoftd/1-20    [001] d..h3..  1878.617386: hrtimer_expire_entry: hrtimer=ffff8acb3fd15f00 function=tick_sched_timer now=1879459830667
   ktimersoftd/1-20    [001] d..h2..  1878.617387: tick_sched_timer <-__run_hrtimer
   ktimersoftd/1-20    [001] d..h3..  1878.617387: hrtimer_expire_exit: hrtimer=ffff8acb3fd15f00

Pei, there is no need to access the machine yet, comment #42 is sufficient to justify inclusion of the patch downstream 
only.

Comment 46 Pei Zhang 2018-06-22 15:29:37 UTC
(In reply to Marcelo Tosatti from comment #44)
> Adding a timer to the isolated CPU, we see the following.
> 
> Summary: apic timer interrupt is armed (to handle tick_sched_timer)
> multiple times, but eventually ktimersoftd is able to execute:
> 
[...]
> 
> Pei, there is no need to access the machine yet, comment #42 is sufficient
> to justify inclusion of the patch downstream 
> only.

Marcelo, seems my setup is not needed any more. Please mail me if still needed.

Best Regards,
Pei

Comment 48 Marcelo Tosatti 2018-06-25 15:58:10 UTC
Pei, can you give me access on your machine so i can reproduce it please?

Comment 57 Pei Zhang 2018-07-18 05:19:44 UTC
==Verification==

Versions:
kernel-rt-3.10.0-924.rt56.871.el7.x86_64
qemu-kvm-rhev-2.12.0-7.el7.x86_64
libvirt-4.5.0-3.el7.x86_64
tuned-2.9.0-1.el7.noarch


This bug can not be reproduced any more, 10/10 PASS, it has been fixed very well.

Move status to 'VERIFIED'.

Comment 59 errata-xmlrpc 2018-10-30 09:41:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3096