RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1507270 - BUG: scheduling while atomic: irq/41-megasas/562/0x00000002
Summary: BUG: scheduling while atomic: irq/41-megasas/562/0x00000002
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel-rt
Version: 7.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Luis Claudio R. Goncalves
QA Contact: Pei Zhang
URL:
Whiteboard:
Depends On:
Blocks: 1442258
TreeView+ depends on / blocked
 
Reported: 2017-10-29 06:01 UTC by Pei Zhang
Modified: 2018-04-10 08:59 UTC (History)
9 users (show)

Fixed In Version: 3.10.0-766.rt56.697
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2018-04-10 08:57:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Call trace info with 3.10.0-758.rt56.689.el7.x86_64 (158.53 KB, text/plain)
2017-10-29 06:01 UTC, Pei Zhang
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:0676 0 None None None 2018-04-10 08:59:40 UTC

Description Pei Zhang 2017-10-29 06:01:29 UTC
Created attachment 1344947 [details]
Call trace info with 3.10.0-758.rt56.689.el7.x86_64

Description of problem:
Host hit call trace info and "BUG" when preparing rt testing environment.


Version-Release number of selected component (if applicable):

3.10.0-758.rt56.689.el7.x86_64
qemu-kvm-rhev-2.10.0-3.el7.x86_64
libvirt-3.8.0-1.el7.x86_64
tuned-2.9.0-0.2.rc2.el7.noarch

How reproducible:
3/3

Steps to Reproduce:
1. Install a fresh host
2. Install rt-kernel-packages
3. Reboot host
4. Isolate cores
5. Reboot host fail, host call trace and "BUG"

Actual results:
Host hit call trace and "BUG"
 

Expected results:
Host should work well.

Additional info:

Comment 2 Pei Zhang 2017-10-29 06:19:33 UTC
Post call trace and "BUG" here:

[   33.334373] BUG: scheduling while atomic: irq/38-megasas/556/0x00000002
[   33.334404] Modules linked in: bridge stp llc iTCO_wdt iTCO_vendor_support dcdbas sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sg ipmi_ssif lpc_ich ipmi_si ipmi_devintf ipmi_msghandler mei_me acpi_power_meter mei shpchp nfsd auth_rpcgss nfs_acl lockd grace ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel i40e tg3 ptp mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm i2c_core ahci libahci mxm_wmi libata megaraid_sas pps_core wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
[   33.334406] CPU: 0 PID: 556 Comm: irq/38-megasas Not tainted 3.10.0-758.rt56.689.el7.x86_64 #1
[   33.334407] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.0.1 04/11/2016
[   33.334409]  ffff88774fde7fd8 000000006ae9a77f ffff88774fde7bd0 ffffffff896bf85f
[   33.334410]  ffff88774fde7be0 ffffffff896ba29f ffff88774fde7c40 ffffffff896c4686
[   33.334411]  ffff88774fde7fd8 ffff88774fde7fd8 ffff88774fde7fd8 ffff88774fde7fd8
[   33.334411] Call Trace:
[   33.334419]  [<ffffffff896bf85f>] dump_stack+0x19/0x1b
[   33.334422]  [<ffffffff896ba29f>] __schedule_bug+0x62/0x70
[   33.334425]  [<ffffffff896c4686>] __schedule+0x6c6/0x840
[   33.334427]  [<ffffffff896c4830>] schedule+0x30/0x96
[   33.334428]  [<ffffffff896c561d>] rt_spin_lock_slowlock+0x13d/0x350
[   33.334431]  [<ffffffff896c6b55>] rt_spin_lock+0x25/0x30
[   33.334434]  [<ffffffff8956eb47>] intel_unmap+0xe7/0x290
[   33.334436]  [<ffffffff8956ed67>] intel_unmap_sg+0x77/0x90
[   33.334441]  [<ffffffff8947a8bf>] scsi_dma_unmap+0x5f/0x80
[   33.334448]  [<ffffffffc0302c03>] complete_cmd_fusion+0x4f3/0x700 [megaraid_sas]
[   33.334453]  [<ffffffff891319c0>] ? irq_thread_fn+0x50/0x50
[   33.334458]  [<ffffffffc0302edc>] megasas_isr_fusion+0x3c/0x1a0 [megaraid_sas]
[   33.334460]  [<ffffffff891319e3>] irq_forced_thread_fn+0x23/0x70
[   33.334462]  [<ffffffff89131e3f>] irq_thread+0x12f/0x180
[   33.334464]  [<ffffffff89131a70>] ? wake_threads_waitq+0x40/0x40
[   33.334466]  [<ffffffff89131d10>] ? irq_thread_check_affinity+0x30/0x30
[   33.334470]  [<ffffffff890abf4f>] kthread+0xcf/0xe0
[   33.334472]  [<ffffffff890abe80>] ? kthread_worker_fn+0x170/0x170
[   33.334475]  [<ffffffff896cfa18>] ret_from_fork+0x58/0x90
[   33.334477]  [<ffffffff890abe80>] ? kthread_worker_fn+0x170/0x170
[   33.810584] NFSD: starting 90-second grace period (net ffffffff89b35bc0)
[   49.895225] BUG: scheduling while atomic: irq/42-megasas/560/0x00000002
[   49.895256] Modules linked in: bridge stp llc iTCO_wdt iTCO_vendor_support dcdbas sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sg ipmi_ssif lpc_ich ipmi_si ipmi_devintf ipmi_msghandler mei_me acpi_power_meter mei shpchp nfsd auth_rpcgss nfs_acl lockd grace ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel i40e tg3 ptp mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm i2c_core ahci libahci mxm_wmi libata megaraid_sas pps_core wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
[   49.895259] CPU: 0 PID: 560 Comm: irq/42-megasas Tainted: G        W      ------------   3.10.0-758.rt56.689.el7.x86_64 #1
[   49.895260] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.0.1 04/11/2016
[   49.895262]  ffff88774fdfffd8 00000000e464266c ffff88774fdffbd0 ffffffff896bf85f
[   49.895263]  ffff88774fdffbe0 ffffffff896ba29f ffff88774fdffc40 ffffffff896c4686
[   49.895264]  ffff88774fdfffd8 ffff88774fdfffd8 ffff88774fdfffd8 ffff88774fdfffd8
[   49.895264] Call Trace:
[   49.895273]  [<ffffffff896bf85f>] dump_stack+0x19/0x1b
[   49.895276]  [<ffffffff896ba29f>] __schedule_bug+0x62/0x70
[   49.895279]  [<ffffffff896c4686>] __schedule+0x6c6/0x840
[   49.895280]  [<ffffffff896c4830>] schedule+0x30/0x96
[   49.895282]  [<ffffffff896c561d>] rt_spin_lock_slowlock+0x13d/0x350
[   49.895284]  [<ffffffff896c6b55>] rt_spin_lock+0x25/0x30
[   49.895288]  [<ffffffff8956eb47>] intel_unmap+0xe7/0x290
[   49.895289]  [<ffffffff8956ed67>] intel_unmap_sg+0x77/0x90
[   49.895294]  [<ffffffff8947a8bf>] scsi_dma_unmap+0x5f/0x80
[   49.895301]  [<ffffffffc0302c03>] complete_cmd_fusion+0x4f3/0x700 [megaraid_sas]
[   49.895306]  [<ffffffff891319c0>] ? irq_thread_fn+0x50/0x50
[   49.895311]  [<ffffffffc0302edc>] megasas_isr_fusion+0x3c/0x1a0 [megaraid_sas]
[   49.895313]  [<ffffffff891319e3>] irq_forced_thread_fn+0x23/0x70
[   49.895315]  [<ffffffff89131e3f>] irq_thread+0x12f/0x180
[   49.895317]  [<ffffffff89131a70>] ? wake_threads_waitq+0x40/0x40
[   49.895319]  [<ffffffff89131d10>] ? irq_thread_check_affinity+0x30/0x30
[   49.895324]  [<ffffffff890abf4f>] kthread+0xcf/0xe0
[   49.895326]  [<ffffffff890abe80>] ? kthread_worker_fn+0x170/0x170
[   49.895329]  [<ffffffff896cfa18>] ret_from_fork+0x58/0x90
[   49.895331]  [<ffffffff890abe80>] ? kthread_worker_fn+0x170/0x170

Comment 3 Pei Zhang 2017-10-30 09:20:23 UTC
Additional info:
1. Disable tuned, then reboot host, everything works well. Enable tuned, then host hit this bug.

2. Host kernel command line while hit this issue.
# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-3.10.0-758.rt56.689.el7.x86_64 root=/dev/mapper/rhel_dell--per730--12-root ro crashkernel=auto rd.lvm.lv=rhel_dell-per730-12/root rd.lvm.lv=rhel_dell-per730-12/swap console=ttyS0,115200n81 default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=2,4,6,8,10,12,14,16,18,19,17,15,13 intel_pstate=disable nosoftlockup nohz=on nohz_full=2,4,6,8,10,12,14,16,18,19,17,15,13 rcu_nocbs=2,4,6,8,10,12,14,16,18,19,17,15,13

3. This should be regression bug, as with[1] works well(other qemu, libvirt, tuned with all same versions). I'll continue to test to get the first version which has this issue.

[1]3.10.0-745.rt56.675.el7.x86_64

Comment 5 Luiz Capitulino 2017-10-30 13:49:37 UTC
Can you try with latest RHEL7.5 non-rt kernel? If it works, then can you pass to the kernel command-line the option "threadirqs" and try again?

Comment 6 Luiz Capitulino 2017-10-30 13:57:27 UTC
Btw, you can use tuned with the realtime-virtual-host profile with the non-rt kernel for this testing.

Comment 7 Pei Zhang 2017-10-30 21:28:50 UTC
(In reply to Luiz Capitulino from comment #5)
> Can you try with latest RHEL7.5 non-rt kernel? If it works, then can you
> pass to the kernel command-line the option "threadirqs" and try again?

Add threadirqs to kernel-rt or non-rt kernel lines, both still hit this issue.


(In reply to Luiz Capitulino from comment #6)
> Btw, you can use tuned with the realtime-virtual-host profile with the
> non-rt kernel for this testing.

With non-rt kernel and use tuned with realtime-virtual-host/or cpu-partitioning profile, didn't hit this issue, everything works well.

with non-rt kernel, add "threadirqs" to kernel line[1], also works well.
[1]# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-3.10.0-756.el7.x86_64 root=/dev/mapper/rhel_dell--per730--27-root ro crashkernel=auto rd.lvm.lv=rhel_dell-per730-27/root rd.lvm.lv=rhel_dell-per730-27/swap console=ttyS0,115200n81 default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,30,28,26,24,22,20,18,16 intel_pstate=disable nosoftlockup nohz=on nohz_full=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,30,28,26,24,22,20,18,16 rcu_nocbs=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,30,28,26,24,22,20,18,16 threadirqs



This is my regression testing:
3.10.0-748.rt56.678.el7.x86_64  work
3.10.0-749.rt56.679.el7.x86_64  work
3.10.0-750.rt56.680.el7.x86_64  fail
3.10.0-751.rt56.681.el7.x86_64  fail


So 3.10.0-750.rt56.680.el7.x86_64 is the first version to hit this problem.

Comment 8 Pei Zhang 2017-10-30 21:59:10 UTC
(In reply to Pei Zhang from comment #7)
> (In reply to Luiz Capitulino from comment #5)
> > Can you try with latest RHEL7.5 non-rt kernel? If it works, then can you
> > pass to the kernel command-line the option "threadirqs" and try again?
> 
> Add threadirqs to kernel-rt or non-rt kernel lines, both still hit this
> issue.

Sorry for mistakes here. Add threadirqs to kernel-rt kernel lines[1], still hist this issue. non-rt kernel work well.

[1]# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-3.10.0-750.rt56.680.el7.x86_64 root=/dev/mapper/rhel_dell--per730--11-root ro crashkernel=auto rd.lvm.lv=rhel_dell-per730-11/root rd.lvm.lv=rhel_dell-per730-11/swap console=ttyS0,115200n81 default_hugepagesz=1G iommu=pt intel_iommu=on LANG=en_US.UTF-8 skew_tick=1 isolcpus=2,4,6,8,10,12,14,16,18,19,17,15,13 intel_pstate=disable nosoftlockup nohz=on nohz_full=2,4,6,8,10,12,14,16,18,19,17,15,13 rcu_nocbs=2,4,6,8,10,12,14,16,18,19,17,15,13 threadirqs

Comment 9 Luiz Capitulino 2017-10-30 22:06:17 UTC
OK, so this is a RT kernel issue (as we suspect). Re-assining.

Comment 15 Pei Zhang 2017-11-06 03:22:45 UTC
This bug has been fixed well. Both host and guest work well, and kvm-rt latency results looks good.


Versions:
3.10.0-768.rt56.699.el7.x86_64
qemu-kvm-rhev-2.10.0-4.el7.x86_64
tuned-2.9.0-1.el7.noarch
libvirt-3.9.0-1.el7.x86_64

24 hours latency testing results:
(1)Single VM with 1 rt vCPU:
# Min Latencies: 00006
# Avg Latencies: 00006
# Max Latencies: 00014

(2)Single VM with 8 rt vCPUs:
# Min Latencies: 00004 00006 00005 00006 00005 00006 00006 00006
# Avg Latencies: 00006 00006 00006 00006 00006 00006 00006 00006
# Max Latencies: 00013 00013 00013 00013 00012 00013 00013 00013

(3)Multiple VMs each with 1 rt vCPU:
- VM1
# Min Latencies: 00006
# Avg Latencies: 00006
# Max Latencies: 00014

- VM2
# Min Latencies: 00005
# Avg Latencies: 00006
# Max Latencies: 00014

- VM3
# Min Latencies: 00006
# Avg Latencies: 00006
# Max Latencies: 00015

- VM4
# Min Latencies: 00005
# Avg Latencies: 00006
# Max Latencies: 00015

Comment 19 errata-xmlrpc 2018-04-10 08:57:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:0676


Note You need to log in before you can comment on or make changes to this bug.