Bug 1507270
Summary: | BUG: scheduling while atomic: irq/41-megasas/562/0x00000002 | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Pei Zhang <pezhang> | ||||
Component: | kernel-rt | Assignee: | Luis Claudio R. Goncalves <lgoncalv> | ||||
kernel-rt sub component: | Other | QA Contact: | Pei Zhang <pezhang> | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | high | ||||||
Priority: | high | CC: | bhu, chayang, jshortt, juzhang, lcapitulino, lgoncalv, michen, virt-maint, williams | ||||
Version: | 7.5 | Keywords: | Regression | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | 3.10.0-766.rt56.697 | Doc Type: | No Doc Update | ||||
Doc Text: |
undefined
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-04-10 08:57:57 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1442258 | ||||||
Attachments: |
|
Post call trace and "BUG" here: [ 33.334373] BUG: scheduling while atomic: irq/38-megasas/556/0x00000002 [ 33.334404] Modules linked in: bridge stp llc iTCO_wdt iTCO_vendor_support dcdbas sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sg ipmi_ssif lpc_ich ipmi_si ipmi_devintf ipmi_msghandler mei_me acpi_power_meter mei shpchp nfsd auth_rpcgss nfs_acl lockd grace ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel i40e tg3 ptp mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm i2c_core ahci libahci mxm_wmi libata megaraid_sas pps_core wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod [ 33.334406] CPU: 0 PID: 556 Comm: irq/38-megasas Not tainted 3.10.0-758.rt56.689.el7.x86_64 #1 [ 33.334407] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.0.1 04/11/2016 [ 33.334409] ffff88774fde7fd8 000000006ae9a77f ffff88774fde7bd0 ffffffff896bf85f [ 33.334410] ffff88774fde7be0 ffffffff896ba29f ffff88774fde7c40 ffffffff896c4686 [ 33.334411] ffff88774fde7fd8 ffff88774fde7fd8 ffff88774fde7fd8 ffff88774fde7fd8 [ 33.334411] Call Trace: [ 33.334419] [<ffffffff896bf85f>] dump_stack+0x19/0x1b [ 33.334422] [<ffffffff896ba29f>] __schedule_bug+0x62/0x70 [ 33.334425] [<ffffffff896c4686>] __schedule+0x6c6/0x840 [ 33.334427] [<ffffffff896c4830>] schedule+0x30/0x96 [ 33.334428] [<ffffffff896c561d>] rt_spin_lock_slowlock+0x13d/0x350 [ 33.334431] [<ffffffff896c6b55>] rt_spin_lock+0x25/0x30 [ 33.334434] [<ffffffff8956eb47>] intel_unmap+0xe7/0x290 [ 33.334436] [<ffffffff8956ed67>] intel_unmap_sg+0x77/0x90 [ 33.334441] [<ffffffff8947a8bf>] scsi_dma_unmap+0x5f/0x80 [ 33.334448] [<ffffffffc0302c03>] complete_cmd_fusion+0x4f3/0x700 [megaraid_sas] [ 33.334453] [<ffffffff891319c0>] ? irq_thread_fn+0x50/0x50 [ 33.334458] [<ffffffffc0302edc>] megasas_isr_fusion+0x3c/0x1a0 [megaraid_sas] [ 33.334460] [<ffffffff891319e3>] irq_forced_thread_fn+0x23/0x70 [ 33.334462] [<ffffffff89131e3f>] irq_thread+0x12f/0x180 [ 33.334464] [<ffffffff89131a70>] ? wake_threads_waitq+0x40/0x40 [ 33.334466] [<ffffffff89131d10>] ? irq_thread_check_affinity+0x30/0x30 [ 33.334470] [<ffffffff890abf4f>] kthread+0xcf/0xe0 [ 33.334472] [<ffffffff890abe80>] ? kthread_worker_fn+0x170/0x170 [ 33.334475] [<ffffffff896cfa18>] ret_from_fork+0x58/0x90 [ 33.334477] [<ffffffff890abe80>] ? kthread_worker_fn+0x170/0x170 [ 33.810584] NFSD: starting 90-second grace period (net ffffffff89b35bc0) [ 49.895225] BUG: scheduling while atomic: irq/42-megasas/560/0x00000002 [ 49.895256] Modules linked in: bridge stp llc iTCO_wdt iTCO_vendor_support dcdbas sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sg ipmi_ssif lpc_ich ipmi_si ipmi_devintf ipmi_msghandler mei_me acpi_power_meter mei shpchp nfsd auth_rpcgss nfs_acl lockd grace ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel i40e tg3 ptp mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm i2c_core ahci libahci mxm_wmi libata megaraid_sas pps_core wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod [ 49.895259] CPU: 0 PID: 560 Comm: irq/42-megasas Tainted: G W ------------ 3.10.0-758.rt56.689.el7.x86_64 #1 [ 49.895260] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.0.1 04/11/2016 [ 49.895262] ffff88774fdfffd8 00000000e464266c ffff88774fdffbd0 ffffffff896bf85f [ 49.895263] ffff88774fdffbe0 ffffffff896ba29f ffff88774fdffc40 ffffffff896c4686 [ 49.895264] ffff88774fdfffd8 ffff88774fdfffd8 ffff88774fdfffd8 ffff88774fdfffd8 [ 49.895264] Call Trace: [ 49.895273] [<ffffffff896bf85f>] dump_stack+0x19/0x1b [ 49.895276] [<ffffffff896ba29f>] __schedule_bug+0x62/0x70 [ 49.895279] [<ffffffff896c4686>] __schedule+0x6c6/0x840 [ 49.895280] [<ffffffff896c4830>] schedule+0x30/0x96 [ 49.895282] [<ffffffff896c561d>] rt_spin_lock_slowlock+0x13d/0x350 [ 49.895284] [<ffffffff896c6b55>] rt_spin_lock+0x25/0x30 [ 49.895288] [<ffffffff8956eb47>] intel_unmap+0xe7/0x290 [ 49.895289] [<ffffffff8956ed67>] intel_unmap_sg+0x77/0x90 [ 49.895294] [<ffffffff8947a8bf>] scsi_dma_unmap+0x5f/0x80 [ 49.895301] [<ffffffffc0302c03>] complete_cmd_fusion+0x4f3/0x700 [megaraid_sas] [ 49.895306] [<ffffffff891319c0>] ? irq_thread_fn+0x50/0x50 [ 49.895311] [<ffffffffc0302edc>] megasas_isr_fusion+0x3c/0x1a0 [megaraid_sas] [ 49.895313] [<ffffffff891319e3>] irq_forced_thread_fn+0x23/0x70 [ 49.895315] [<ffffffff89131e3f>] irq_thread+0x12f/0x180 [ 49.895317] [<ffffffff89131a70>] ? wake_threads_waitq+0x40/0x40 [ 49.895319] [<ffffffff89131d10>] ? irq_thread_check_affinity+0x30/0x30 [ 49.895324] [<ffffffff890abf4f>] kthread+0xcf/0xe0 [ 49.895326] [<ffffffff890abe80>] ? kthread_worker_fn+0x170/0x170 [ 49.895329] [<ffffffff896cfa18>] ret_from_fork+0x58/0x90 [ 49.895331] [<ffffffff890abe80>] ? kthread_worker_fn+0x170/0x170 Additional info: 1. Disable tuned, then reboot host, everything works well. Enable tuned, then host hit this bug. 2. Host kernel command line while hit this issue. # cat /proc/cmdline BOOT_IMAGE=/vmlinuz-3.10.0-758.rt56.689.el7.x86_64 root=/dev/mapper/rhel_dell--per730--12-root ro crashkernel=auto rd.lvm.lv=rhel_dell-per730-12/root rd.lvm.lv=rhel_dell-per730-12/swap console=ttyS0,115200n81 default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=2,4,6,8,10,12,14,16,18,19,17,15,13 intel_pstate=disable nosoftlockup nohz=on nohz_full=2,4,6,8,10,12,14,16,18,19,17,15,13 rcu_nocbs=2,4,6,8,10,12,14,16,18,19,17,15,13 3. This should be regression bug, as with[1] works well(other qemu, libvirt, tuned with all same versions). I'll continue to test to get the first version which has this issue. [1]3.10.0-745.rt56.675.el7.x86_64 Can you try with latest RHEL7.5 non-rt kernel? If it works, then can you pass to the kernel command-line the option "threadirqs" and try again? Btw, you can use tuned with the realtime-virtual-host profile with the non-rt kernel for this testing. (In reply to Luiz Capitulino from comment #5) > Can you try with latest RHEL7.5 non-rt kernel? If it works, then can you > pass to the kernel command-line the option "threadirqs" and try again? Add threadirqs to kernel-rt or non-rt kernel lines, both still hit this issue. (In reply to Luiz Capitulino from comment #6) > Btw, you can use tuned with the realtime-virtual-host profile with the > non-rt kernel for this testing. With non-rt kernel and use tuned with realtime-virtual-host/or cpu-partitioning profile, didn't hit this issue, everything works well. with non-rt kernel, add "threadirqs" to kernel line[1], also works well. [1]# cat /proc/cmdline BOOT_IMAGE=/vmlinuz-3.10.0-756.el7.x86_64 root=/dev/mapper/rhel_dell--per730--27-root ro crashkernel=auto rd.lvm.lv=rhel_dell-per730-27/root rd.lvm.lv=rhel_dell-per730-27/swap console=ttyS0,115200n81 default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,30,28,26,24,22,20,18,16 intel_pstate=disable nosoftlockup nohz=on nohz_full=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,30,28,26,24,22,20,18,16 rcu_nocbs=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,30,28,26,24,22,20,18,16 threadirqs This is my regression testing: 3.10.0-748.rt56.678.el7.x86_64 work 3.10.0-749.rt56.679.el7.x86_64 work 3.10.0-750.rt56.680.el7.x86_64 fail 3.10.0-751.rt56.681.el7.x86_64 fail So 3.10.0-750.rt56.680.el7.x86_64 is the first version to hit this problem. (In reply to Pei Zhang from comment #7) > (In reply to Luiz Capitulino from comment #5) > > Can you try with latest RHEL7.5 non-rt kernel? If it works, then can you > > pass to the kernel command-line the option "threadirqs" and try again? > > Add threadirqs to kernel-rt or non-rt kernel lines, both still hit this > issue. Sorry for mistakes here. Add threadirqs to kernel-rt kernel lines[1], still hist this issue. non-rt kernel work well. [1]# cat /proc/cmdline BOOT_IMAGE=/vmlinuz-3.10.0-750.rt56.680.el7.x86_64 root=/dev/mapper/rhel_dell--per730--11-root ro crashkernel=auto rd.lvm.lv=rhel_dell-per730-11/root rd.lvm.lv=rhel_dell-per730-11/swap console=ttyS0,115200n81 default_hugepagesz=1G iommu=pt intel_iommu=on LANG=en_US.UTF-8 skew_tick=1 isolcpus=2,4,6,8,10,12,14,16,18,19,17,15,13 intel_pstate=disable nosoftlockup nohz=on nohz_full=2,4,6,8,10,12,14,16,18,19,17,15,13 rcu_nocbs=2,4,6,8,10,12,14,16,18,19,17,15,13 threadirqs OK, so this is a RT kernel issue (as we suspect). Re-assining. This bug has been fixed well. Both host and guest work well, and kvm-rt latency results looks good. Versions: 3.10.0-768.rt56.699.el7.x86_64 qemu-kvm-rhev-2.10.0-4.el7.x86_64 tuned-2.9.0-1.el7.noarch libvirt-3.9.0-1.el7.x86_64 24 hours latency testing results: (1)Single VM with 1 rt vCPU: # Min Latencies: 00006 # Avg Latencies: 00006 # Max Latencies: 00014 (2)Single VM with 8 rt vCPUs: # Min Latencies: 00004 00006 00005 00006 00005 00006 00006 00006 # Avg Latencies: 00006 00006 00006 00006 00006 00006 00006 00006 # Max Latencies: 00013 00013 00013 00013 00012 00013 00013 00013 (3)Multiple VMs each with 1 rt vCPU: - VM1 # Min Latencies: 00006 # Avg Latencies: 00006 # Max Latencies: 00014 - VM2 # Min Latencies: 00005 # Avg Latencies: 00006 # Max Latencies: 00014 - VM3 # Min Latencies: 00006 # Avg Latencies: 00006 # Max Latencies: 00015 - VM4 # Min Latencies: 00005 # Avg Latencies: 00006 # Max Latencies: 00015 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:0676 |
Created attachment 1344947 [details] Call trace info with 3.10.0-758.rt56.689.el7.x86_64 Description of problem: Host hit call trace info and "BUG" when preparing rt testing environment. Version-Release number of selected component (if applicable): 3.10.0-758.rt56.689.el7.x86_64 qemu-kvm-rhev-2.10.0-3.el7.x86_64 libvirt-3.8.0-1.el7.x86_64 tuned-2.9.0-0.2.rc2.el7.noarch How reproducible: 3/3 Steps to Reproduce: 1. Install a fresh host 2. Install rt-kernel-packages 3. Reboot host 4. Isolate cores 5. Reboot host fail, host call trace and "BUG" Actual results: Host hit call trace and "BUG" Expected results: Host should work well. Additional info: