1441552 – kernel BUG at kernel/sched/rt.c:2021!

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1441552 - kernel BUG at kernel/sched/rt.c:2021!

Summary: kernel BUG at kernel/sched/rt.c:2021!

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	kernel-rt
Sub Component:
Version:	7.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Daniel Bristot de Oliveira
QA Contact:	Pei Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1175461 1353018
TreeView+	depends on / blocked

Reported:	2017-04-12 08:33 UTC by Pei Zhang
Modified:	2017-08-02 00:26 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Environment:
Last Closed:	2017-08-01 19:05:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:2077	0	normal	SHIPPED_LIVE	Important: kernel-rt security, bug fix, and enhancement update	2017-08-01 18:13:37 UTC

Description Pei Zhang 2017-04-12 08:33:43 UTC

Description of problem:
Install rhel7.4 host, then prepare rt testing environment. After installing kernel-rt/tuned/rt-tests/qemu-kvm-rhev/libvirt/dpdk/openvswitch, and isolate cores, host kernel panic in one of my hosts.


Version-Release number of selected component (if applicable):
3.10.0-648.rt56.566.el7.x86_64


How reproducible:
Hit this issue 1 time.

Steps to Reproduce:
1.Install rhel7.4 host
2.Install kernel-rt and virt related packages, including:
tuned-2.8.0-1.el7.noarch
rt-tests-1.0-10.el7.x86_64
qemu-kvm-rhev-2.9.0-0.el7.patchwork201703291116.x86_64
libvirt-daemon-3.2.0-2.el7.x86_64

3. Reboot host

4. Install dpdk/openvswitch
dpdk-16.11-4.el7fdp.x86_64
openvswitch-2.6.1-15.git20161206.el7fdp.x86_64

5. Add hugepage/iommu to kernel line
# cat /etc/default/grub
...
GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=rhel_dell-per730-11/root rd.lvm.lv=rhel_dell-per730-11/swap console=ttyS0,115200n81 default_hugepagesz=1G iommu=pt intel_iommu=on"

6. Isolate cores
# cat /etc/tuned/realtime-virtual-host-variables.conf
isolated_cores=2,4,6,8,10,12,14,16,18,19,17,16,15

# tuned-adm profile realtime-virtual-host

7. Then host hang with kernel panic, please see[1]


8. Reboot host in beaker button, then host will works well.


Actual results:
[1]Kernel panic info:
dell-per730-11 login: [   47.460636] ------------[ cut here ]------------
[   47.460637] kernel BUG at kernel/sched/rt.c:2021!
[   47.460639] invalid opcode: 0000 [#1] PREEMPT SMP 
[   47.460662] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_d
[   47.460670] CPU: 11 PID: 3415 Comm: taskset Not tainted 3.10.0-648.rt56.566.1
[   47.460670] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/005
[   47.460672] task: ffff90546938b090 ti: ffff90545efc8000 task.ti: ffff90545ef0
[   47.460679] RIP: 0010:[<ffffffff840ce49d>]  [<ffffffff840ce49d>] set_cpus_al0
[   47.460679] RSP: 0018:ffff90545efcb6e8  EFLAGS: 00010046
[   47.460680] RAX: 0000000000000000 RBX: ffff90546938b090 RCX: 0000000000000000
[   47.460681] RDX: 000000000000000b RSI: 0000000000000090 RDI: 0000000000000000
[   47.460681] RBP: ffff90545efcb700 R08: 0000000000000001 R09: 0000000000000002
[   47.460682] R10: ffff90545efcba08 R11: 0000000000000246 R12: ffff90546e559080
[   47.460682] R13: ffff90545efcba08 R14: 00000000ffffffea R15: ffff90546938b090
[   47.460684] FS:  00007f4544506740(0000) GS:ffff90546e540000(0000) knlGS:00000
[   47.460684] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   47.460685] CR2: 00007f4543ff3030 CR3: 0000000865667000 CR4: 00000000001407e0
[   47.460685] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   47.460686] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   47.460686] Stack:
[   47.460688]  ffff90546938b090 ffff90545efcba08 ffff90545efcba08 ffff90545efc0
[   47.460689]  ffffffff840bec60 ffff90546e559080 ffff90546938b970 ffff90545efc8
[   47.460690]  ffffffff840bfebb 0000000000000282 ffff90545efcbcc0 ffff90545efc0
[   47.460690] Call Trace:
[   47.460695]  [<ffffffff840bec60>] do_set_cpus_allowed+0x30/0x70
[   47.460697]  [<ffffffff840bfebb>] set_cpus_allowed_ptr+0x13b/0x1f0
[   47.460699]  [<ffffffff840c0181>] sched_setaffinity+0x211/0x250
[   47.460701]  [<ffffffff840ba24f>] ? migrate_enable+0xdf/0x1a0
[   47.460706]  [<ffffffff8418e787>] ? get_page_from_freelist+0x647/0xb00
[   47.460707]  [<ffffffff840ba24f>] ? migrate_enable+0xdf/0x1a0
[   47.460711]  [<ffffffff841ae7ee>] ? pagefault_enable+0x1e/0x20
[   47.460712]  [<ffffffff8418e694>] ? get_page_from_freelist+0x554/0xb00
[   47.460714]  [<ffffffff8418edd4>] ? __alloc_pages_nodemask+0x194/0xab0
[   47.460716]  [<ffffffff84193a02>] ? __lru_cache_add+0xb2/0xf0
[   47.460717]  [<ffffffff8418e787>] ? get_page_from_freelist+0x647/0xb00
[   47.460719]  [<ffffffff840ba24f>] ? migrate_enable+0xdf/0x1a0
[   47.460721]  [<ffffffff841ae7ee>] ? pagefault_enable+0x1e/0x20
[   47.460722]  [<ffffffff8418e694>] ? get_page_from_freelist+0x554/0xb00
[   47.460725]  [<ffffffff840a59ce>] ? find_task_by_pid_ns+0xe/0x30
[   47.460727]  [<ffffffff840c021a>] SYSC_sched_setaffinity+0x5a/0xc0
[   47.460730]  [<ffffffff840c028e>] SyS_sched_setaffinity+0xe/0x10
[   47.460734]  [<ffffffff846b0cc9>] system_call_fastpath+0x16/0x1b
[   47.460745] Code: 00 00 49 8b 84 24 e8 07 00 00 48 85 c0 74 1c 48 83 e8 01 4 
[   47.460747] RIP  [<ffffffff840ce49d>] set_cpus_allowed_rt+0x10d/0x110
[   47.460747]  RSP <ffff90545efcb6e8>
[   47.873614] ---[ end trace 0000000000000002 ]---
[   47.928236] Kernel panic - not syncing: Fatal exception
[   47.928254] Kernel Offset: 0x3000000 from 0xffffffff81000000 (relocation ran)


Expected results:
Host should work well.


Additional info:
1. I did the same testing in 2 hosts. One hit this issue and the other not. So I am not sure if this issue can reproduce easily. I'll try more times to reproduce and will update my results.

Comment 4 Pei Zhang 2017-05-05 06:46:57 UTC

Update: 
Currently, QE didn't hit this issue any more. The versions we used in daily testing(With multiple hosts, multiple times):

3.10.0-653.rt56.571.el7.x86_64
3.10.0-663.rt56.581.el7.x86_64

Comment 5 Pei Zhang 2017-05-11 06:03:27 UTC

With same steps, QE hit rt host hang issue. And I tested below version about 3 times currently, hit this issue 1 time.

Versions: kernel-rt-3.10.0-664.rt56.583.el7.

But this time I didn't open the console before hang, so can not provide the detail error info.  I'll keep monitoring console in next testings.

Comment 6 Luiz Capitulino 2017-05-16 17:30:29 UTC

I have a few questions:

- Do you start ovs before activating the profile? If yes, do you assigned fifo priority to the PMD threads?

- Is this always reproduced when activating the profile or is it reproduced at different times?

- Are you able to reproduce this with tuned-2.8.0-2.el7 or later?

Comment 8 Pei Zhang 2017-05-17 10:57:21 UTC

(In reply to Luiz Capitulino from comment #6)
> I have a few questions:
> 
> - Do you start ovs before activating the profile? If yes, do you assigned
> fifo priority to the PMD threads?

No, I didn't start ovs before activating the tuned profiles. Only install the ovs/dpdk packages for next testing.

> - Is this always reproduced when activating the profile or is it reproduced
> at different times?

Currently, I hit this host kernel crash only one time. In my latest kvm-rt related testing, I didn't hit this issue anymore. Besides hit hang issue[Comment 5] one time. 

> - Are you able to reproduce this with tuned-2.8.0-2.el7 or later?

In my daily testing, I already tested with tuned-2.8.0-2.el7/tuned-2.8.0-3.el7.noarch, didn't hit this issue again.


Seems this issue is very occasionally. But seems the crash info is similar with https://bugzilla.redhat.com/show_bug.cgi?id=1438120#c53



Best Regards,
Pei

Comment 9 Luiz Capitulino 2017-05-17 19:51:46 UTC

Indeed! bug 1438120 comment 53 is precisely the same issue and it's nice you got a good vmcore! Would you mind to post that info here too?

So, what seems to be happening here is:

1. Something in user-space in changing testpmd pinning and this code path is executed:

sched_setaffinity()
  set_cpu_allowed_ptr()
    do_set_cpus_allowed()
      set_cpus_allowed_rt()

2. set_cpus_allowed_rt() checks weight <= 1 and goes to decrement rt_nr_migratory. However, rt_nr_migratory is already zero. So the BUG_ON() triggers

Now, some important points:

- rt_nr_migratory keeps track of how many RT tasks in the runqueue can migrate to other CPUs. If rt_nr_migratory=0 it means no RT task can migrate

- nr_cpus_allowed=5 for the testpmd thread

- observe in the trace that a few pagefaults are happening in parallel to set_cpus_allowed_rt(), and those code paths are calling migrate_enable()

This points to a race with migrate_enable()/disable(), but I can't see it causing rt_nr_migratory being underflowed.

Comment 11 Pei Zhang 2017-05-18 04:37:11 UTC

Update:

I hit the kernel panic issue again, at the first reboot after installing rt related packages. 


1.Install rhel7.4 host

2.Install kernel-rt and virt related packages, including:
3.10.0-668.rt56.587.el7.x86_64
tuned-2.8.0-3.el7.noarch
rt-tests-1.0-11.el7.x86_64
qemu-kvm-rhev-2.9.0-5.el7.x86_64
libvirt-3.2.0-5.el7.x86_64

3. Reboot host, then host hang and kernel panic, there was no crash dump file exists, below is the console output. 


[   61.983318] ------------[ cut here ]------------
[   61.983319] kernel BUG at kernel/sched/rt.c:2022!
[   61.983322] invalid opcode: 0000 [#1] PREEMPT SMP 
[   61.983343] Modules linked in: fuse btrfs raid6_pq xor vfat msdos fat ext4 mbcache jbd2 xt_CHECKSUM iptable_mangled
[   61.983353] CPU: 0 PID: 11844 Comm: taskset Not tainted 3.10.0-668.rt56.587.el7.x86_64 #1
[   61.983354] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.0.1 04/11/2016
[   61.983355] task: ffffa11817a650f0 ti: ffffa117eff80000 task.ti: ffffa117eff80000
[   61.983362] RIP: 0010:[<ffffffff950ce50d>]  [<ffffffff950ce50d>] set_cpus_allowed_rt+0x10d/0x110
[   61.983363] RSP: 0018:ffffa117eff836e8  EFLAGS: 00010046
[   61.983364] RAX: 0000000000000000 RBX: ffffa11817a650f0 RCX: 0000000000000000
[   61.983364] RDX: 0000000000000000 RSI: 00000000000000c0 RDI: 0000000000000000
[   61.983365] RBP: ffffa117eff83700 R08: 0000000000000001 R09: 0000000000000003
[   61.983365] R10: ffffa117eff83a08 R11: 0000000000000246 R12: ffffa1181e619080
[   61.983366] R13: ffffa117eff83a08 R14: 00000000ffffffea R15: ffffa11817a650f0
[   61.983367] FS:  00007f4998c3d740(0000) GS:ffffa1181e600000(0000) knlGS:0000000000000000
[   61.983368] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   61.983368] CR2: 00007f499872a030 CR3: 000000084d708000 CR4: 00000000001407f0
[   61.983369] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   61.983370] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   61.983370] Stack:
[   61.983371]  ffffa11817a650f0 ffffa117eff83a08 ffffa117eff83a08 ffffa117eff83720
[   61.983372]  ffffffff950bed00 ffffa1181e619080 ffffa11817a659d0 ffffa117eff83778
[   61.983373]  ffffffff950bff4b 0000000000000282 ffffa117eff83cc0 ffffa117eff83750
[   61.983373] Call Trace:
[   61.983378]  [<ffffffff950bed00>] do_set_cpus_allowed+0x30/0x70
[   61.983380]  [<ffffffff950bff4b>] set_cpus_allowed_ptr+0x13b/0x1f0
[   61.983382]  [<ffffffff950c0211>] sched_setaffinity+0x211/0x250
[   61.983384]  [<ffffffff950ba2ff>] ? migrate_enable+0xdf/0x1a0
[   61.983389]  [<ffffffff9518eda7>] ? get_page_from_freelist+0x647/0xb00
[   61.983391]  [<ffffffff950ba2ff>] ? migrate_enable+0xdf/0x1a0
[   61.983394]  [<ffffffff951aedfe>] ? pagefault_enable+0x1e/0x20
[   61.983396]  [<ffffffff9518ecb4>] ? get_page_from_freelist+0x554/0xb00
[   61.983398]  [<ffffffff9518f3f4>] ? __alloc_pages_nodemask+0x194/0xab0
[   61.983400]  [<ffffffff95194022>] ? __lru_cache_add+0xb2/0xf0
[   61.983402]  [<ffffffff9518eda7>] ? get_page_from_freelist+0x647/0xb00
[   61.983403]  [<ffffffff950ba2ff>] ? migrate_enable+0xdf/0x1a0
[   61.983405]  [<ffffffff951aedfe>] ? pagefault_enable+0x1e/0x20
[   61.983407]  [<ffffffff9518ecb4>] ? get_page_from_freelist+0x554/0xb00
[   61.983410]  [<ffffffff950a5a5e>] ? find_task_by_pid_ns+0xe/0x30
[   61.983412]  [<ffffffff950c02aa>] SYSC_sched_setaffinity+0x5a/0xc0
[   61.983415]  [<ffffffff950c031e>] SyS_sched_setaffinity+0xe/0x10
[   61.983420]  [<ffffffff956b9409>] system_call_fastpath+0x16/0x1b
[   61.983430] Code: 00 00 49 8b 84 24 e8 07 00 00 48 85 c0 74 1c 48 83 e8 01 49 89 84 24 e8 07 00 00 eb 92 41 c7 84  
[   61.983432] RIP  [<ffffffff950ce50d>] set_cpus_allowed_rt+0x10d/0x110
[   61.983433]  RSP <ffffa117eff836e8>
[   62.403440] ---[ end trace 0000000000000002 ]---
[   62.451894] Kernel panic - not syncing: Fatal exception
[   63.481661] Shutting down cpus with NMI
[   63.482072] Kernel Offset: 0x14000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfff)

Comment 14 Pei Zhang 2017-05-18 15:29:12 UTC

(In reply to Luiz Capitulino from comment #9)
> Indeed! bug 1438120 comment 53 is precisely the same issue and it's nice you
> got a good vmcore! Would you mind to post that info here too?
> 

OK. When doing ping-pong live migration in rt environment, setting ovs pmd threads fifo:1 in host, and boot dpdk's testpmd with using fifo:1 in guest. After several (around 5 times mostly) migrtions, guest will kernel panic which is same with this bug.

Key steps:
# chrt -fp 1 $ovs_pmd_threads
# chrt -f 1 /usr/bin/testpmd ...

In latest testing, I enabled tracepoints as Marcelo suggested in bug 1438120 comment 58. So below link includes the vmcore and related trace.txt. 

http://fileshare.englab.nay.redhat.com/pub/section2/coredump/var/crash/pezhang/bug1441552/may18/ 

Versions with this time testing:
3.10.0-668.rt56.587.el7.x86_64
tuned-2.8.0-3.el7.noarch
dpdk-16.11-4.el7fdp.x86_64
openvswitch-2.6.1-18.git20161206.el7fdp.x86_64
libvirt-3.2.0-5.el7.x86_64


Thanks,
Pei

Comment 16 Luiz Capitulino 2017-05-18 20:57:00 UTC

I didn't have much time to work on this today. I'll be back to this tomorrow. My plan is to try to understand better what's wrong and maybe try to reproduce it using the installation test.

Can you send me your /proc/cmdline and lscpu output?

I guess the number of CPUs you're isolating may be making the issue more likely to trig.

Comment 17 Pei Zhang 2017-05-19 06:00:21 UTC

(In reply to Luiz Capitulino from comment #16)
> I didn't have much time to work on this today. I'll be back to this
> tomorrow. My plan is to try to understand better what's wrong and maybe try
> to reproduce it using the installation test.
> Can you send me your /proc/cmdline and lscpu output?

# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-3.10.0-669.rt56.588.el7.x86_64 root=/dev/mapper/rhel_dell--per430--09-root ro crashkernel=auto rd.lvm.lv=rhel_dell-per430-09/root rd.lvm.lv=rhel_dell-per430-09/swap console=ttyS0,115200n81 default_hugepagesz=1G iommu=pt intel_iommu=on isolcpus=2,4,6,8,10,12,14,16,18,19,17,16,15 nohz=on nohz_full=2,4,6,8,10,12,14,16,18,19,17,16,15 rcu_nocbs=2,4,6,8,10,12,14,16,18,19,17,16,15 intel_pstate=disable nosoftlockup

# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                20
On-line CPU(s) list:   0-19
Thread(s) per core:    1
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
Stepping:              2
CPU MHz:               2297.299
BogoMIPS:              4594.59
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm arat pln pts


> I guess the number of CPUs you're isolating may be making the issue more
> likely to trig.

OK.


Thanks,
Pei

Comment 18 Pei Zhang 2017-05-23 10:37:12 UTC

Reproduce update:

Same step with Description.

Versions:
3.10.0-638.rt56.553.el7.x86_64

Result:  
Tried 3 times and reboot 3 times, didn't hit any issue.


I'll try kernel-rt-3.10.0-647.rt56.564.el7 next.

Comment 19 Luiz Capitulino 2017-05-23 20:34:22 UTC

I was able to create a reproducer for this issue!

By analyzing the vmcore, I was able to find out the sequence of events that leads to the panic. And you were right, it really happens at profile activation time.

Before getting to the details, here's the reproducer:

1. Confirm that you don't have the realtime-virtual-host profile applied. You can check /proc/cmdline for this. If you do have the profile activated, change to another profile and reboot:

# tuned-adm profile desktop
# reboot

2. Once you don't have realtime-virtual-host parameters in /proc/cmdline, run the following script:

#!/bin/bash

sync
cd /usr/lib/tuned/realtime-virtual-host

while true; do
tuned-adm profile desktop
rm -f lapic_timer_adv_ns lapic_timer_adv_ns.cpumodel
echo activating host profile
tuned-adm profile realtime-virtual-host
echo done
sleep 1s
done

This script trigs the issue for me most of the time in less than 10 minutes. However, once or twice I had to restart the machine and run the script again.

Now, the line from the tuned profile that's triggering the issue is this one:

chrt -f 1 taskset -c $1 $QEMU -enable-kvm -device pc-testdev \
-device isa-debug-exit,iobase=0xf4,iosize=0x4 \
-display none -serial stdio -device pci-testdev \
-kernel /usr/share/qemu-kvm/tscdeadline_latency.flat \
-cpu host | grep latency | cut -f 2 -d ":" > $dir/out

This is from the run-tscdeadline-latency.sh script executed by the realtime-virtual-host profile. That explains why you had to reprovision the machine to reproduce this and also why I wasn't being able to reproduce it before: run-tscdeadline-latency.sh implements caching and it should execute only once on a system (the reproducer above remove the caching files, forcing the script to execute on every run).

I still don't know what causes this. I'm building a debug kernel to gather more information.

Comment 20 Pei Zhang 2017-05-24 01:01:36 UTC

(In reply to Luiz Capitulino from comment #19)
> I was able to create a reproducer for this issue!
> 
> By analyzing the vmcore, I was able to find out the sequence of events that
> leads to the panic. And you were right, it really happens at profile
> activation time.
> 
> Before getting to the details, here's the reproducer:
> 
> 1. Confirm that you don't have the realtime-virtual-host profile applied.
> You can check /proc/cmdline for this. If you do have the profile activated,
> change to another profile and reboot:
> 
> # tuned-adm profile desktop
> # reboot
> 
> 2. Once you don't have realtime-virtual-host parameters in /proc/cmdline,
> run the following script:
> 
> #!/bin/bash
> 
> sync
> cd /usr/lib/tuned/realtime-virtual-host
> 
> while true; do
>         tuned-adm profile desktop
>         rm -f lapic_timer_adv_ns lapic_timer_adv_ns.cpumodel
>         echo activating host profile
>         tuned-adm profile realtime-virtual-host
>         echo done
>         sleep 1s
> done
> 
> This script trigs the issue for me most of the time in less than 10 minutes.
> However, once or twice I had to restart the machine and run the script again.

Thanks Luiz. I can also reproduce this issue with this script. Now it's much easier to reproduce and I'll continue test lower versions to check if it's regression.


Thanks,
Pei

Comment 21 Pei Zhang 2017-05-24 11:24:40 UTC

Hi Luiz, 

Seems this is a regression bug. 

Tested with below versions using Comment 19. kernel-rt-3.10.0-604.rt56.517.el7 is the first kernel to hit this issue. 


kernel-rt-3.10.0-647.rt56.564.el7  fail
kernel-rt-3.10.0-638.rt56.553.el7  fail
kernel-rt-3.10.0-620.rt56.535.el7  fail
kernel-rt-3.10.0-610.rt56.524.el7  fail
kernel-rt-3.10.0-605.rt56.519.el7  fail
kernel-rt-3.10.0-604.rt56.517.el7  fail
kernel-rt-3.10.0-603.rt56.515.el7  work (running about 15 minutes, works well)
kernel-rt-3.10.0-602.rt56.514.el7  work (running about 15 minutes, works well)
kernel-rt-3.10.0-600.rt56.512.el7  work (running 1h, works well)
kernel-rt-3.10.0-514.rt56.420.el7  work( running 1h, works well)


Thanks,
Pei

Comment 22 Luiz Capitulino 2017-05-24 13:35:44 UTC

Thanks Pei. I thought I had an hypothesis for this issue but if I was right it shouldn't be a regression. I'll take a look at -604.

Comment 23 Luiz Capitulino 2017-05-24 19:30:01 UTC

Actually, my hypothesis matches exactly with your findings!

This upstream commit:

  commit 55c041cf3750894df6f304ad0489c777e27402e7
  Author: Sebastian Andrzej Siewior <bigeasy>
  Date:   Thu Jan 21 15:58:56 2016 +0100

      sched: fixup migrate disable (all tasks were bound to CPU0)

Which was backported via bug 1423432, replaces all p->nr_cpus_allowed
usage to a function called tsk_nr_cpus_allowed(). This function
returns 1 when migration is disabled (meaning, "one CPU allowed").

Now, take a look at what is done in inc_rt_migration(): rt_nr_migratory
will only get updated when nr_cpus_allowed > 1. The following case
should be possible:

1. rt_nr_migratory=0

2. Migration is disabled in some code path (eg. pagefault)

3. inc_rt_migration() is called. rt_nr_migratory increment is skipped
   since tsk_nr_cpus_allowed() will return 1

4. Migration is enabled

5. set_cpus_allowed_rt() is called from sched_setaffinity() code path

6. set_cpus_allowed_rt() should get weight=1 and tsk_nr_cpus_allowed()
   should return > 1 since migration is now enabled

7. BUG_ON() triggers

Note that this problem should also be present on upstream, since the
code is the same. However, upstream doesn't have the BUG_ON() anymore
so the problem is less severe (probably rt_nr_migratory gets unbalanced).

I guess the solution is just to change set_cpus_allowed_rt() to
check p->nr_cpus_allowed directly instead of using tsk_nr_cpus_allowed().
But this is a discussion to have on upstream.

For downstream, I think the best action right now is to revert the
series for bug 1423432.

I'll find a way to show that the problem exists on upstream and post my
patch there. Then, once it's accepted or another solution recommended,
we can backport the series for bug 1423432 along with the fix for this
issue.

Comment 28 Luiz Capitulino 2017-05-26 15:32:39 UTC

Daniel is taking up this work, so I'm re-assining this to him.

The patch below is the solution I mentioned on comment 23. Although I haven't tested it yet. If this patch solves the issue, then the issue should also exist on upstream although it will be harder to show as it doesn't have the BUG_ON().

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 35d6eac..1a38d3f 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -300,7 +300,7 @@ static void inc_rt_migration(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
 	rt_rq = &rq_of_rt_rq(rt_rq)->rt;
 
 	rt_rq->rt_nr_total++;
-	if (tsk_nr_cpus_allowed(p) > 1)
+	if (p->nr_cpus_allowed > 1)
 		rt_rq->rt_nr_migratory++;
 
 	update_rt_migration(rt_rq);
@@ -317,7 +317,7 @@ static void dec_rt_migration(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
 	rt_rq = &rq_of_rt_rq(rt_rq)->rt;
 
 	rt_rq->rt_nr_total--;
-	if (tsk_nr_cpus_allowed(p) > 1)
+	if (p->nr_cpus_allowed > 1)
 		rt_rq->rt_nr_migratory--;
 
 	update_rt_migration(rt_rq);

Comment 29 Daniel Bristot de Oliveira 2017-05-31 10:18:12 UTC

Hi Pei, 

Could you please try this kernel build?

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13311268

-- Daniel

Comment 30 Pei Zhang 2017-06-01 02:05:10 UTC

(In reply to Daniel Bristot de Oliveira from comment #29)
> Hi Pei, 
> 
> Could you please try this kernel build?
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13311268
> 
> -- Daniel

Hi Daniel,

This build still hit this panic issue. Please see related vmcore and vmcore-dmesg.txt from below link:

http://fileshare.englab.nay.redhat.com/pub/section2/coredump/var/crash/pezhang/bug1441552/jun1/


Best Regards,
Pei

Comment 31 Luiz Capitulino 2017-06-01 13:30:50 UTC

Daniel, just something I forgot to tell: I tried the patch from comment 28 a few days ago and it fixed the issue for me.

I think the problem with upstream commit 55c041c is that it's not all places that you can replace p->nr_cpus_allowed with tsk_nr_cpus_allowed(). inc_rt_migration() and dec_rt_migration() is one of those places, there could be more.

Comment 32 Daniel Bristot de Oliveira 2017-06-06 08:46:12 UTC

Hi Luiz,

By only using p->nr_cpus_allowed, your patch completely ignores that the migrate_disable/enable actions. Although in the past this was used, at that time, the migrated disable also used to change the cpumask of tasks. But that is not true anymore. So I do not believe that that is the best solution for this bug.

Comment 33 Daniel Bristot de Oliveira 2017-06-06 13:27:25 UTC

Hi Pei,

Could you please collect data using the following kernel:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13351615

?

It does not aim to fix the problem but to collect data about the problem.

I added some trace_printks and a BUG_ON() in the  dec_rt_migration(), so we can get the problem earlier.

Before starting the test, please, run the following commands:

# cat /sys/kernel/debug/tracing/available_events | grep setaff  > /sys/kernel/debug/tracing/set_event 
# sysctl -w kernel.ftrace_dump_on_oops=1

This will generate a VMCORE, please ensure that I can download the vmcore... I couldn't download the last one (403 Forbidden).

Finally, I tried to reproduce the problem locally, but the steps to reproduce did not drive me to the crash. Is there anything else I should do?

Thanks in advance.

-- Daniel

Comment 34 Luiz Capitulino 2017-06-06 13:45:57 UTC

Daniel,

I'll explain my reasioning for the fix below. However, as this issue is a high severity regression and as we're late in the snapshot phase, I think the best action for this BZ right now is to revert the series for bug 1423432 and redo the backport (along with the fix for this issue) for 7.5 or 7.4.z. Otherwise, we risk shipping 7.4 with this issue.

Regarding the fix, my thinking is: the semantics for tsk_nr_cpus_allowed(p) is "if migration is enabled, the number of CPUs this task can migrate to is nr_cpus_allowed. If migrate is disabled this task can't migrate to another CPU". The problem here is that this semantics doesn't apply to inc_rt_migration() and dec_rt_migration(), since I _guess_ those functions do global accounting of migratable tasks and are not affected by individual tasks being unable to be migrated in a short period of time.

But of course, I could be wrong.

Comment 35 Daniel Bristot de Oliveira 2017-06-06 14:36:21 UTC

Luis,

I am considering revert the other BZ, as we talked yesterday. I am just trying not to just _guess_ that this is the best thing to do.

-- Daniel

Comment 36 Pei Zhang 2017-06-07 02:45:41 UTC

(In reply to Daniel Bristot de Oliveira from comment #33)
> Hi Pei,
> 
> Could you please collect data using the following kernel:
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13351615
> 
> ?
> 
> It does not aim to fix the problem but to collect data about the problem.
> 
> I added some trace_printks and a BUG_ON() in the  dec_rt_migration(), so we
> can get the problem earlier.
> 
> Before starting the test, please, run the following commands:
> 
> # cat /sys/kernel/debug/tracing/available_events | grep setaff  >
> /sys/kernel/debug/tracing/set_event 
> # sysctl -w kernel.ftrace_dump_on_oops=1
> 
> This will generate a VMCORE, please ensure that I can download the vmcore...
> I couldn't download the last one (403 Forbidden).

Hi Daniel, 

vmcore with this test build please refer to below link:
http://fileshare.englab.nay.redhat.com/pub/section2/coredump/var/crash/pezhang/bug1441552/Jun7/

Sorry, I forgot to set the permission when upload the vmcore in Comment 30, I have update the permission now, please check.

> Finally, I tried to reproduce the problem locally, but the steps to
> reproduce did not drive me to the crash. Is there anything else I should do?

In Comment 19, Luiz provided a reproducer, could you please try it? 

Best Regards,
Pei

Comment 37 Daniel Bristot de Oliveira 2017-06-08 13:37:58 UTC

Hi Pei,

Thanks for uploading the vmcore, now I could precisely define the root cause and I have a possible fix. Although it worked fine while reproducing the case as described in the comment #19, could you please verify if the following build works for you as well?

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13379519

I am polishing the patch & description now.

Thanks in advance,
-- Daniel

Comment 38 Luiz Capitulino 2017-06-08 15:40:58 UTC

Daniel,

I just tried the reproducer on this kernel and I got a different panic. The trace is below, I have the vmcore. Let me know if you need it.

[  181.978335] ------------[ cut here ]------------
[  181.978336] kernel BUG at kernel/sched/core.c:3551!
[  181.978338] invalid opcode: 0000 [#1] PREEMPT SMP 
[  181.978356] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sb_edac edac_core coretemp intel_rapl ipmi_devintf iTCO_wdt iTCO_vendor_support iosf_mbi kvm_intel kvm dcdbas irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ipmi_ssif pcspkr sg ipmi_si ipmi_msghandler mei_me shpchp mei lpc_ich acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel ixgbe mdio mgag200 i2c_algo_bit drm_kms_helper dca tg3 syscopyarea
[  181.978361]  sysfillrect sysimgblt ptp fb_sys_fops ttm drm i2c_core ahci libahci libata megaraid_sas pps_core dm_mirror dm_region_hash dm_log dm_mod
[  181.978362] CPU: 4 PID: 1327 Comm: irq/58-p1p1-TxR Not tainted 3.10.0-675.rt56.597.test.el7.x86_64 #1
[  181.978363] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.2.6 06/08/2015
[  181.978364] task: ffff880075549030 ti: ffff880075550000 task.ti: ffff880075550000
[  181.978369] RIP: 0010:[<ffffffff810ba3b4>]  [<ffffffff810ba3b4>] migrate_disable+0x214/0x220
[  181.978369] RSP: 0018:ffff880075553e18  EFLAGS: 00010246
[  181.978370] RAX: 0000000000019080 RBX: ffff880075549030 RCX: 0000000000000020
[  181.978370] RDX: ffff88046da80000 RSI: 0000000000000120 RDI: 0000000000000000
[  181.978371] RBP: ffff880075553e20 R08: 0000000000000004 R09: 0000000000000004
[  181.978371] R10: ffff8800755491b8 R11: ffff8802ebf8e758 R12: ffff88046973e000
[  181.978372] R13: ffff880075549030 R14: ffff88046973e000 R15: ffffffff8112fc70
[  181.978373] FS:  0000000000000000(0000) GS:ffff88046da80000(0000) knlGS:0000000000000000
[  181.978373] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  181.978374] CR2: 00007f3fc7bf3000 CR3: 0000000461eaf000 CR4: 00000000001427e0
[  181.978374] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  181.978375] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  181.978375] Stack:
[  181.978376]  ffff88046a07da80 ffff880075553e30 ffffffff81086881 ffff880075553e58
[  181.978377]  ffffffff8112fc89 ffff88046a07da80 ffff880075549030 ffff880075549030
[  181.978378]  ffff880075553ec0 ffffffff811300ef ffff880075549030 0000000000000000
[  181.978378] Call Trace:
[  181.978383]  [<ffffffff81086881>] __local_bh_disable+0x31/0x40
[  181.978386]  [<ffffffff8112fc89>] irq_forced_thread_fn+0x19/0x70
[  181.978388]  [<ffffffff811300ef>] irq_thread+0x12f/0x180
[  181.978389]  [<ffffffff8112fd20>] ? wake_threads_waitq+0x40/0x40
[  181.978391]  [<ffffffff8112ffc0>] ? irq_thread_check_affinity+0x30/0x30
[  181.978394]  [<ffffffff810a9bcf>] kthread+0xcf/0xe0
[  181.978396]  [<ffffffff810a9b00>] ? kthread_worker_fn+0x170/0x170
[  181.978399]  [<ffffffff816b9198>] ret_from_fork+0x58/0x90
[  181.978401]  [<ffffffff810a9b00>] ? kthread_worker_fn+0x170/0x170
[  181.978410] Code: 8b 53 10 8b 52 18 48 8b 14 d5 e0 42 b4 81 48 8b 8c 10 e8 07 00 00 48 8d 93 a0 06 00 00 31 c0 e8 23 d2 09 00 e9 c5 fe ff ff 0f 0b <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 65 48 8b 
[  181.978412] RIP  [<ffffffff810ba3b4>] migrate_disable+0x214/0x220
[  181.978412]  RSP <ffff880075553e18>

Comment 39 Pei Zhang 2017-06-08 15:56:18 UTC

Hi Daniel, 

I hit same problem with Luiz.

Also, there is another issue:
When reboot the host manually, kdump also generate crash vmcore during the reboot process. So every reboot, there will be a vmcore generated.

Please refer to below link to get the vmcore:
http://fileshare.englab.nay.redhat.com/pub/section2/coredump/var/crash/pezhang/bug1441552/Jun8/

Best Regards,
Pei

Comment 40 Daniel Bristot de Oliveira 2017-06-09 16:08:30 UTC

Hi Pei,

Here is a kernel without the changes from the BZ1423432.

http://file.rdu.redhat.com/~daolivei/BZ1441552/

Could you test it?

We decided to revert because the real fix will require some upstream discussion, and so we will leave for the next cycle.

-- Daniel

Comment 41 Pei Zhang 2017-06-10 09:57:38 UTC

(In reply to Daniel Bristot de Oliveira from comment #40)
> Hi Pei,
> 
> Here is a kernel without the changes from the BZ1423432.
> 
> http://file.rdu.redhat.com/~daolivei/BZ1441552/
> 
> Could you test it?

Daniel, this build works well. 

-  Running Reproducer in Comment 19 1 hours, works well.
-  Re-install host and prepare rt environment, also works well.

> We decided to revert because the real fix will require some upstream
> discussion, and so we will leave for the next cycle.
> -- Daniel

Thanks,
Pei

Comment 42 Daniel Bristot de Oliveira 2017-06-12 18:10:47 UTC

Hi,

I posted the explanation of this bug here:

https://bugzilla.redhat.com/show_bug.cgi?id=1423432#c12

As the problem exists upstream and is related to the migrate_disable/enabled(), which is the subject of the other BZ, I will continue working on that BZ.

Sounds like a plan?

-- Daniel

Comment 44 Luis Claudio R. Goncalves 2017-06-12 20:23:37 UTC

The suggested patches (commit reverts) were added to kernel-rt-3.10.0-680.rt56.604. Kernel build ongoing.

Comment 45 Luiz Capitulino 2017-06-13 14:08:54 UTC

(In reply to Daniel Bristot de Oliveira from comment #42)
> Hi,
> 
> I posted the explanation of this bug here:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1423432#c12
> 
> As the problem exists upstream and is related to the
> migrate_disable/enabled(), which is the subject of the other BZ, I will
> continue working on that BZ.
> 
> Sounds like a plan?

It absolutely does! Btw, I think it's a good idea to re-open bug 1423432 now that the revert is done.

Comment 47 Pei Zhang 2017-06-14 02:16:42 UTC

Update:

Versions:
3.10.0-680.rt56.604.el7.x86_64
tuned-2.8.0-5.el7.noarch

Install rt host 5 times, works well, no any error.

Comment 48 Jiri Kastner 2017-06-14 15:19:03 UTC

(In reply to Pei Zhang from comment #47)
> Update:
> 
> Versions:
> 3.10.0-680.rt56.604.el7.x86_64
> tuned-2.8.0-5.el7.noarch
> 
> Install rt host 5 times, works well, no any error.

so can you close that bz?

Comment 49 Luiz Capitulino 2017-06-14 17:22:00 UTC

Coming a bit late to this, I've also ran my test-case for a few hours and it worked fine.

Comment 50 Pei Zhang 2017-06-15 00:52:32 UTC

Base on Comment 47 and Comment 49, this bug has been fixed well. Thanks.

Move status of this bug to 'VERIFIED'.

Comment 51 errata-xmlrpc 2017-08-01 19:05:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2077

Comment 52 errata-xmlrpc 2017-08-02 00:26:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2077

Note You need to log in before you can comment on or make changes to this bug.