Bug 1507270

Summary: BUG: scheduling while atomic: irq/41-megasas/562/0x00000002
Product: Red Hat Enterprise Linux 7 Reporter: Pei Zhang <pezhang>
Component: kernel-rtAssignee: Luis Claudio R. Goncalves <lgoncalv>
kernel-rt sub component: Other QA Contact: Pei Zhang <pezhang>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: bhu, chayang, jshortt, juzhang, lcapitulino, lgoncalv, michen, virt-maint, williams
Version: 7.5Keywords: Regression
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 3.10.0-766.rt56.697 Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-10 08:57:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1442258    
Attachments:
Description Flags
Call trace info with 3.10.0-758.rt56.689.el7.x86_64 none

Description Pei Zhang 2017-10-29 06:01:29 UTC
Created attachment 1344947 [details]
Call trace info with 3.10.0-758.rt56.689.el7.x86_64

Description of problem:
Host hit call trace info and "BUG" when preparing rt testing environment.


Version-Release number of selected component (if applicable):

3.10.0-758.rt56.689.el7.x86_64
qemu-kvm-rhev-2.10.0-3.el7.x86_64
libvirt-3.8.0-1.el7.x86_64
tuned-2.9.0-0.2.rc2.el7.noarch

How reproducible:
3/3

Steps to Reproduce:
1. Install a fresh host
2. Install rt-kernel-packages
3. Reboot host
4. Isolate cores
5. Reboot host fail, host call trace and "BUG"

Actual results:
Host hit call trace and "BUG"
 

Expected results:
Host should work well.

Additional info:

Comment 2 Pei Zhang 2017-10-29 06:19:33 UTC
Post call trace and "BUG" here:

[   33.334373] BUG: scheduling while atomic: irq/38-megasas/556/0x00000002
[   33.334404] Modules linked in: bridge stp llc iTCO_wdt iTCO_vendor_support dcdbas sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sg ipmi_ssif lpc_ich ipmi_si ipmi_devintf ipmi_msghandler mei_me acpi_power_meter mei shpchp nfsd auth_rpcgss nfs_acl lockd grace ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel i40e tg3 ptp mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm i2c_core ahci libahci mxm_wmi libata megaraid_sas pps_core wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
[   33.334406] CPU: 0 PID: 556 Comm: irq/38-megasas Not tainted 3.10.0-758.rt56.689.el7.x86_64 #1
[   33.334407] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.0.1 04/11/2016
[   33.334409]  ffff88774fde7fd8 000000006ae9a77f ffff88774fde7bd0 ffffffff896bf85f
[   33.334410]  ffff88774fde7be0 ffffffff896ba29f ffff88774fde7c40 ffffffff896c4686
[   33.334411]  ffff88774fde7fd8 ffff88774fde7fd8 ffff88774fde7fd8 ffff88774fde7fd8
[   33.334411] Call Trace:
[   33.334419]  [<ffffffff896bf85f>] dump_stack+0x19/0x1b
[   33.334422]  [<ffffffff896ba29f>] __schedule_bug+0x62/0x70
[   33.334425]  [<ffffffff896c4686>] __schedule+0x6c6/0x840
[   33.334427]  [<ffffffff896c4830>] schedule+0x30/0x96
[   33.334428]  [<ffffffff896c561d>] rt_spin_lock_slowlock+0x13d/0x350
[   33.334431]  [<ffffffff896c6b55>] rt_spin_lock+0x25/0x30
[   33.334434]  [<ffffffff8956eb47>] intel_unmap+0xe7/0x290
[   33.334436]  [<ffffffff8956ed67>] intel_unmap_sg+0x77/0x90
[   33.334441]  [<ffffffff8947a8bf>] scsi_dma_unmap+0x5f/0x80
[   33.334448]  [<ffffffffc0302c03>] complete_cmd_fusion+0x4f3/0x700 [megaraid_sas]
[   33.334453]  [<ffffffff891319c0>] ? irq_thread_fn+0x50/0x50
[   33.334458]  [<ffffffffc0302edc>] megasas_isr_fusion+0x3c/0x1a0 [megaraid_sas]
[   33.334460]  [<ffffffff891319e3>] irq_forced_thread_fn+0x23/0x70
[   33.334462]  [<ffffffff89131e3f>] irq_thread+0x12f/0x180
[   33.334464]  [<ffffffff89131a70>] ? wake_threads_waitq+0x40/0x40
[   33.334466]  [<ffffffff89131d10>] ? irq_thread_check_affinity+0x30/0x30
[   33.334470]  [<ffffffff890abf4f>] kthread+0xcf/0xe0
[   33.334472]  [<ffffffff890abe80>] ? kthread_worker_fn+0x170/0x170
[   33.334475]  [<ffffffff896cfa18>] ret_from_fork+0x58/0x90
[   33.334477]  [<ffffffff890abe80>] ? kthread_worker_fn+0x170/0x170
[   33.810584] NFSD: starting 90-second grace period (net ffffffff89b35bc0)
[   49.895225] BUG: scheduling while atomic: irq/42-megasas/560/0x00000002
[   49.895256] Modules linked in: bridge stp llc iTCO_wdt iTCO_vendor_support dcdbas sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sg ipmi_ssif lpc_ich ipmi_si ipmi_devintf ipmi_msghandler mei_me acpi_power_meter mei shpchp nfsd auth_rpcgss nfs_acl lockd grace ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel i40e tg3 ptp mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm i2c_core ahci libahci mxm_wmi libata megaraid_sas pps_core wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
[   49.895259] CPU: 0 PID: 560 Comm: irq/42-megasas Tainted: G        W      ------------   3.10.0-758.rt56.689.el7.x86_64 #1
[   49.895260] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.0.1 04/11/2016
[   49.895262]  ffff88774fdfffd8 00000000e464266c ffff88774fdffbd0 ffffffff896bf85f
[   49.895263]  ffff88774fdffbe0 ffffffff896ba29f ffff88774fdffc40 ffffffff896c4686
[   49.895264]  ffff88774fdfffd8 ffff88774fdfffd8 ffff88774fdfffd8 ffff88774fdfffd8
[   49.895264] Call Trace:
[   49.895273]  [<ffffffff896bf85f>] dump_stack+0x19/0x1b
[   49.895276]  [<ffffffff896ba29f>] __schedule_bug+0x62/0x70
[   49.895279]  [<ffffffff896c4686>] __schedule+0x6c6/0x840
[   49.895280]  [<ffffffff896c4830>] schedule+0x30/0x96
[   49.895282]  [<ffffffff896c561d>] rt_spin_lock_slowlock+0x13d/0x350
[   49.895284]  [<ffffffff896c6b55>] rt_spin_lock+0x25/0x30
[   49.895288]  [<ffffffff8956eb47>] intel_unmap+0xe7/0x290
[   49.895289]  [<ffffffff8956ed67>] intel_unmap_sg+0x77/0x90
[   49.895294]  [<ffffffff8947a8bf>] scsi_dma_unmap+0x5f/0x80
[   49.895301]  [<ffffffffc0302c03>] complete_cmd_fusion+0x4f3/0x700 [megaraid_sas]
[   49.895306]  [<ffffffff891319c0>] ? irq_thread_fn+0x50/0x50
[   49.895311]  [<ffffffffc0302edc>] megasas_isr_fusion+0x3c/0x1a0 [megaraid_sas]
[   49.895313]  [<ffffffff891319e3>] irq_forced_thread_fn+0x23/0x70
[   49.895315]  [<ffffffff89131e3f>] irq_thread+0x12f/0x180
[   49.895317]  [<ffffffff89131a70>] ? wake_threads_waitq+0x40/0x40
[   49.895319]  [<ffffffff89131d10>] ? irq_thread_check_affinity+0x30/0x30
[   49.895324]  [<ffffffff890abf4f>] kthread+0xcf/0xe0
[   49.895326]  [<ffffffff890abe80>] ? kthread_worker_fn+0x170/0x170
[   49.895329]  [<ffffffff896cfa18>] ret_from_fork+0x58/0x90
[   49.895331]  [<ffffffff890abe80>] ? kthread_worker_fn+0x170/0x170

Comment 3 Pei Zhang 2017-10-30 09:20:23 UTC
Additional info:
1. Disable tuned, then reboot host, everything works well. Enable tuned, then host hit this bug.

2. Host kernel command line while hit this issue.
# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-3.10.0-758.rt56.689.el7.x86_64 root=/dev/mapper/rhel_dell--per730--12-root ro crashkernel=auto rd.lvm.lv=rhel_dell-per730-12/root rd.lvm.lv=rhel_dell-per730-12/swap console=ttyS0,115200n81 default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=2,4,6,8,10,12,14,16,18,19,17,15,13 intel_pstate=disable nosoftlockup nohz=on nohz_full=2,4,6,8,10,12,14,16,18,19,17,15,13 rcu_nocbs=2,4,6,8,10,12,14,16,18,19,17,15,13

3. This should be regression bug, as with[1] works well(other qemu, libvirt, tuned with all same versions). I'll continue to test to get the first version which has this issue.

[1]3.10.0-745.rt56.675.el7.x86_64

Comment 5 Luiz Capitulino 2017-10-30 13:49:37 UTC
Can you try with latest RHEL7.5 non-rt kernel? If it works, then can you pass to the kernel command-line the option "threadirqs" and try again?

Comment 6 Luiz Capitulino 2017-10-30 13:57:27 UTC
Btw, you can use tuned with the realtime-virtual-host profile with the non-rt kernel for this testing.

Comment 7 Pei Zhang 2017-10-30 21:28:50 UTC
(In reply to Luiz Capitulino from comment #5)
> Can you try with latest RHEL7.5 non-rt kernel? If it works, then can you
> pass to the kernel command-line the option "threadirqs" and try again?

Add threadirqs to kernel-rt or non-rt kernel lines, both still hit this issue.


(In reply to Luiz Capitulino from comment #6)
> Btw, you can use tuned with the realtime-virtual-host profile with the
> non-rt kernel for this testing.

With non-rt kernel and use tuned with realtime-virtual-host/or cpu-partitioning profile, didn't hit this issue, everything works well.

with non-rt kernel, add "threadirqs" to kernel line[1], also works well.
[1]# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-3.10.0-756.el7.x86_64 root=/dev/mapper/rhel_dell--per730--27-root ro crashkernel=auto rd.lvm.lv=rhel_dell-per730-27/root rd.lvm.lv=rhel_dell-per730-27/swap console=ttyS0,115200n81 default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,30,28,26,24,22,20,18,16 intel_pstate=disable nosoftlockup nohz=on nohz_full=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,30,28,26,24,22,20,18,16 rcu_nocbs=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,30,28,26,24,22,20,18,16 threadirqs



This is my regression testing:
3.10.0-748.rt56.678.el7.x86_64  work
3.10.0-749.rt56.679.el7.x86_64  work
3.10.0-750.rt56.680.el7.x86_64  fail
3.10.0-751.rt56.681.el7.x86_64  fail


So 3.10.0-750.rt56.680.el7.x86_64 is the first version to hit this problem.

Comment 8 Pei Zhang 2017-10-30 21:59:10 UTC
(In reply to Pei Zhang from comment #7)
> (In reply to Luiz Capitulino from comment #5)
> > Can you try with latest RHEL7.5 non-rt kernel? If it works, then can you
> > pass to the kernel command-line the option "threadirqs" and try again?
> 
> Add threadirqs to kernel-rt or non-rt kernel lines, both still hit this
> issue.

Sorry for mistakes here. Add threadirqs to kernel-rt kernel lines[1], still hist this issue. non-rt kernel work well.

[1]# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-3.10.0-750.rt56.680.el7.x86_64 root=/dev/mapper/rhel_dell--per730--11-root ro crashkernel=auto rd.lvm.lv=rhel_dell-per730-11/root rd.lvm.lv=rhel_dell-per730-11/swap console=ttyS0,115200n81 default_hugepagesz=1G iommu=pt intel_iommu=on LANG=en_US.UTF-8 skew_tick=1 isolcpus=2,4,6,8,10,12,14,16,18,19,17,15,13 intel_pstate=disable nosoftlockup nohz=on nohz_full=2,4,6,8,10,12,14,16,18,19,17,15,13 rcu_nocbs=2,4,6,8,10,12,14,16,18,19,17,15,13 threadirqs

Comment 9 Luiz Capitulino 2017-10-30 22:06:17 UTC
OK, so this is a RT kernel issue (as we suspect). Re-assining.

Comment 15 Pei Zhang 2017-11-06 03:22:45 UTC
This bug has been fixed well. Both host and guest work well, and kvm-rt latency results looks good.


Versions:
3.10.0-768.rt56.699.el7.x86_64
qemu-kvm-rhev-2.10.0-4.el7.x86_64
tuned-2.9.0-1.el7.noarch
libvirt-3.9.0-1.el7.x86_64

24 hours latency testing results:
(1)Single VM with 1 rt vCPU:
# Min Latencies: 00006
# Avg Latencies: 00006
# Max Latencies: 00014

(2)Single VM with 8 rt vCPUs:
# Min Latencies: 00004 00006 00005 00006 00005 00006 00006 00006
# Avg Latencies: 00006 00006 00006 00006 00006 00006 00006 00006
# Max Latencies: 00013 00013 00013 00013 00012 00013 00013 00013

(3)Multiple VMs each with 1 rt vCPU:
- VM1
# Min Latencies: 00006
# Avg Latencies: 00006
# Max Latencies: 00014

- VM2
# Min Latencies: 00005
# Avg Latencies: 00006
# Max Latencies: 00014

- VM3
# Min Latencies: 00006
# Avg Latencies: 00006
# Max Latencies: 00015

- VM4
# Min Latencies: 00005
# Avg Latencies: 00006
# Max Latencies: 00015

Comment 19 errata-xmlrpc 2018-04-10 08:57:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:0676