Bug 1337667 - [abrt] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
Summary: [abrt] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 23
Hardware: x86_64
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL: https://retrace.fedoraproject.org/faf...
Whiteboard: abrt_hash:0f98d64ea3ace676b65a1e664d1...
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-05-19 18:50 UTC by Alan Jenkins
Modified: 2016-09-23 20:57 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-09-23 20:57:26 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
File: dmesg (74.65 KB, text/plain)
2016-05-19 18:50 UTC, Alan Jenkins
no flags Details
Test patch (3.57 KB, patch)
2016-05-22 16:14 UTC, Alan Jenkins
no flags Details | Diff
Log output from test patch (69.65 KB, text/plain)
2016-05-22 16:17 UTC, Alan Jenkins
no flags Details

Description Alan Jenkins 2016-05-19 18:50:26 UTC
Description of problem:
Hardware: NUC 5CPYH

A lockup seems to be happening with about 50% probability, when I'm using virt-manager to restore a snapshotted VM.  (Snapshot includes both disk and ram).

Additional info:
reporter:       libreport-2.6.4
NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [powertop:2691]
Modules linked in: bnep bluetooth cpufreq_stats fuse nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_filter ebtable_broute bridge ebtables ip6table_security ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_filter ip6_tables iptable_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw vfat fat btrfs xor raid6_pq intel_rapl coretemp kvm_intel arc4 kvm irqbypass crct10dif_pclmul iwlmvm crc32_pclmul mac80211 snd_hda_codec_hdmi crc32c_intel iTCO_wdt iTCO_vendor_support snd_hda_codec_realtek snd_intel_sst_acpi snd_intel_sst_core snd_soc_rt5670 snd_soc_sst_mfld_platform iwlwifi nxp_nci_i2c snd_hda_codec_generic
 nxp_nci snd_soc_rl6231 snd_hda_intel snd_soc_core snd_hda_codec nci cfg80211 i2c_i801 mei_txe mei lpc_ich ir_mce_kbd_decoder ir_sanyo_decoder ir_sharp_decoder ir_xmp_decoder ir_rc6_decoder ir_rc5_decoder ir_sony_decoder ir_lirc_codec lirc_dev ir_nec_decoder ir_jvc_decoder snd_hda_core snd_hwdep shpchp snd_compress rc_rc6_mce snd_pcm_dmaengine snd_seq nfc ite_cir ac97_bus snd_seq_device rc_core snd_pcm i2c_designware_platform snd_timer i2c_designware_core snd tpm_tis iosf_mbi soundcore rfkill_gpio rfkill tpm spi_pxa2xx_platform soc_button_array nfsd auth_rpcgss nfs_acl lockd grace sunrpc i915 i2c_algo_bit drm_kms_helper 8021q garp stp llc mrp drm r8169 serio_raw mii sdhci_acpi sdhci mmc_core video i2c_hid fjes
CPU: 0 PID: 2691 Comm: powertop Not tainted 4.4.9-300.fc23.x86_64 #1
Hardware name:                  /NUC5CPYB, BIOS PYBSWCEL.86A.0053.2016.0401.1154 04/01/2016
task: ffff880067b40000 ti: ffff880162128000 task.ti: ffff880162128000
RIP: 0010:[<ffffffff81124778>]  [<ffffffff81124778>] smp_call_function_single+0xd8/0x130
RSP: 0018:ffff88016212bd58  EFLAGS: 00000202
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88016999fcd0
RDX: 0000000000000001 RSI: ffff88016212bd68 RDI: ffff88016212bd68
RBP: ffff88016212bda0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000660 R11: 0000000000000246 R12: ffffffff813e7d20
R13: 0000000000000008 R14: 00007ffdc7eb9438 R15: 0000000000000000
FS:  00007f2ee6cae740(0000) GS:ffff88017fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000556f308e0000 CR3: 000000014e868000 CR4: 00000000001026f0
Stack:
 ffff880054447700 000000240000bdb0 ffff88016999fcd0 ffffffff813e7d20
 ffff88016212bdb0 0000000000000003 0000000043994bb4 ffff88016212be0c
 ffff88016212be08 ffff88016212bde8 ffffffff813e7ffd 0000000000000660
Call Trace:
 [<ffffffff813e7d20>] ? __wrmsr_on_cpu+0x40/0x40
 [<ffffffff813e7ffd>] rdmsr_safe_on_cpu+0x5d/0x90
 [<ffffffff8104f1d7>] msr_read+0x97/0xe0
 [<ffffffff8122daa7>] __vfs_read+0x37/0x100
 [<ffffffff813382a0>] ? security_file_permission+0xa0/0xc0
 [<ffffffff8122e566>] vfs_read+0x86/0x130
 [<ffffffff8122f475>] SyS_pread64+0x95/0xb0
 [<ffffffff817a0f2e>] entry_SYSCALL_64_fastpath+0x12/0x71
Code: 25 28 00 00 00 75 70 48 83 c4 38 5b 41 5c 5d c3 48 8d 75 c8 48 89 d1 89 df 4c 89 e2 e8 12 fe ff ff 8b 55 e0 83 e2 01 74 cf f3 90 <8b> 55 e0 83 e2 01 75 f6 eb c3 8b 05 50 c8 e0 00 85 c0 75 85 80

Comment 1 Alan Jenkins 2016-05-19 18:50:35 UTC
Created attachment 1159604 [details]
File: dmesg

Comment 2 Alan Jenkins 2016-05-19 18:55:35 UTC
Sorry, ABRT picked the wrong backtrace.  The first one (shown in dmesg) is much more plausible as a root cause!

Praise be to the NMI watchdog.


[  534.539016] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
[  534.539025] Modules linked in:
[  534.539030]  bnep bluetooth cpufreq_stats fuse nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_filter ebtable_broute bridge ebtables ip6table_security ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_filter ip6_tables iptable_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw vfat fat btrfs xor raid6_pq intel_rapl coretemp kvm_intel arc4 kvm irqbypass crct10dif_pclmul iwlmvm crc32_pclmul mac80211 snd_hda_codec_hdmi crc32c_intel iTCO_wdt iTCO_vendor_support snd_hda_codec_realtek snd_intel_sst_acpi snd_intel_sst_core snd_soc_rt5670 snd_soc_sst_mfld_platform iwlwifi nxp_nci_i2c snd_hda_codec_generic nxp_nci
[  534.539086]  snd_soc_rl6231 snd_hda_intel snd_soc_core snd_hda_codec nci cfg80211 i2c_i801 mei_txe mei lpc_ich ir_mce_kbd_decoder ir_sanyo_decoder ir_sharp_decoder ir_xmp_decoder ir_rc6_decoder ir_rc5_decoder ir_sony_decoder ir_lirc_codec lirc_dev ir_nec_decoder ir_jvc_decoder snd_hda_core snd_hwdep shpchp snd_compress rc_rc6_mce snd_pcm_dmaengine snd_seq nfc ite_cir ac97_bus snd_seq_device rc_core snd_pcm i2c_designware_platform snd_timer i2c_designware_core snd tpm_tis iosf_mbi soundcore rfkill_gpio rfkill tpm spi_pxa2xx_platform soc_button_array nfsd auth_rpcgss nfs_acl lockd grace sunrpc i915 i2c_algo_bit drm_kms_helper 8021q garp stp llc mrp drm r8169 serio_raw mii sdhci_acpi sdhci mmc_core video i2c_hid fjes
[  534.539151] CPU: 1 PID: 3071 Comm: qemu-system-x86 Not tainted 4.4.9-300.fc23.x86_64 #1
[  534.539154] Hardware name:                  /NUC5CPYB, BIOS PYBSWCEL.86A.0053.2016.0401.1154 04/01/2016
[  534.539157]  0000000000000086 00000000b08a09bb ffff88017fd05b90 ffffffff813b62ee
[  534.539161]  0000000000000000 0000000000000000 ffff88017fd05ba8 ffffffff8115aebd
[  534.539165]  ffff88017a588000 ffff88017fd05be0 ffffffff811a32ec 0000000000000001
[  534.539170] Call Trace:
[  534.539173]  <NMI>  [<ffffffff813b62ee>] dump_stack+0x63/0x85
[  534.539189]  [<ffffffff8115aebd>] watchdog_overflow_callback+0xbd/0xd0
[  534.539194]  [<ffffffff811a32ec>] __perf_event_overflow+0x8c/0x1d0
[  534.539198]  [<ffffffff811a3ec4>] perf_event_overflow+0x14/0x20
[  534.539204]  [<ffffffff81035fe1>] intel_pmu_handle_irq+0x1e1/0x4a0
[  534.539210]  [<ffffffff8102c758>] perf_event_nmi_handler+0x28/0x50
[  534.539215]  [<ffffffff8101b0c9>] nmi_handle+0x69/0x120
[  534.539218]  [<ffffffff8101b63d>] default_do_nmi+0xad/0xf0
[  534.539221]  [<ffffffff8101b762>] do_nmi+0xe2/0x130
[  534.539227]  [<ffffffff817a3471>] end_repeat_nmi+0x1a/0x1e
[  534.539238]  [<ffffffffa06fa3d0>] ? vmx_invpcid_supported+0x30/0x30 [kvm_intel]
[  534.539243]  [<ffffffff813c32a5>] ? delay_tsc+0x25/0x60
[  534.539246]  [<ffffffff813c32a5>] ? delay_tsc+0x25/0x60
[  534.539250]  [<ffffffff813c32a5>] ? delay_tsc+0x25/0x60
[  534.539252]  <<EOE>>  [<ffffffff813c31df>] __delay+0xf/0x20
[  534.539287]  [<ffffffffa07b48db>] wait_lapic_expire+0xcb/0x150 [kvm]
[  534.539311]  [<ffffffffa0797891>] kvm_arch_vcpu_ioctl_run+0x371/0x14e0 [kvm]
[  534.539316]  [<ffffffff81120dd0>] ? futex_wake+0x90/0x180
[  534.539338]  [<ffffffffa078005d>] kvm_vcpu_ioctl+0x33d/0x620 [kvm]
[  534.539344]  [<ffffffff810b3951>] ? do_sigtimedwait+0xd1/0x200
[  534.539350]  [<ffffffff81241648>] do_vfs_ioctl+0x298/0x480
[  534.539355]  [<ffffffff81338393>] ? security_file_ioctl+0x43/0x60
[  534.539359]  [<ffffffff812418a9>] SyS_ioctl+0x79/0x90
[  534.539364]  [<ffffffff817a0f2e>] entry_SYSCALL_64_fastpath+0x12/0x71
[  552.274471] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [powertop:2691]
[  552.274479] Modules linked in: bnep bluetooth cpufreq_stats fuse nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_filter ebtable_broute bridge ebtables ip6table_security ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_filter ip6_tables iptable_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw vfat fat btrfs xor raid6_pq intel_rapl coretemp kvm_intel arc4 kvm irqbypass crct10dif_pclmul iwlmvm crc32_pclmul mac80211 snd_hda_codec_hdmi crc32c_intel iTCO_wdt iTCO_vendor_support snd_hda_codec_realtek snd_intel_sst_acpi snd_intel_sst_core snd_soc_rt5670 snd_soc_sst_mfld_platform iwlwifi nxp_nci_i2c snd_hda_codec_generic
[  552.274537]  nxp_nci snd_soc_rl6231 snd_hda_intel snd_soc_core snd_hda_codec nci cfg80211 i2c_i801 mei_txe mei lpc_ich ir_mce_kbd_decoder ir_sanyo_decoder ir_sharp_decoder ir_xmp_decoder ir_rc6_decoder ir_rc5_decoder ir_sony_decoder ir_lirc_codec lirc_dev ir_nec_decoder ir_jvc_decoder snd_hda_core snd_hwdep shpchp snd_compress rc_rc6_mce snd_pcm_dmaengine snd_seq nfc ite_cir ac97_bus snd_seq_device rc_core snd_pcm i2c_designware_platform snd_timer i2c_designware_core snd tpm_tis iosf_mbi soundcore rfkill_gpio rfkill tpm spi_pxa2xx_platform soc_button_array nfsd auth_rpcgss nfs_acl lockd grace sunrpc i915 i2c_algo_bit drm_kms_helper 8021q garp stp llc mrp drm r8169 serio_raw mii sdhci_acpi sdhci mmc_core video i2c_hid fjes

Comment 3 Alan Jenkins 2016-05-20 10:42:40 UTC
wait_lapic_expire() is described as a busy-wait optimization, to increase timer wakeup accuracy.  So the question is how we get a value of

    apic->lapic_timer.expired_tscdeadline

which is far enough ahead of `guest_tsc`, to cause an extended busy-wait.  The expected busy-wait duration is `lapic_timer_advance_ns`, which defaults to 0 (and I haven't changed it).

I see several ways the guest TSC can get set / "adjusted", and none of them redo the TSC deadline timer :(.  If you set the TSC _backwards_...

Comment 4 Alan Jenkins 2016-05-22 16:14:57 UTC
Created attachment 1160347 [details]
Test patch

I have confirmed that the lockup can be avoided by sanity-checking the delay in wait_lapic_expire().  With the attached test patch, I get a warning message but no lockup.

I also instrumented for TSC going backwards.  The resulting log confirms that this was happening.

Comment 5 Alan Jenkins 2016-05-22 16:17:27 UTC
Created attachment 1160348 [details]
Log output from test patch

Here's the log output from the test patch above.

Also, upstream mailing list:
http://thread.gmane.org/gmane.comp.emulators.kvm.devel/152370/focus=152372

Comment 6 Laura Abbott 2016-09-23 19:43:03 UTC
*********** MASS BUG UPDATE **************
 
We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 23 kernel bugs.
 
Fedora 23 has now been rebased to 4.7.4-100.fc23.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you have moved on to Fedora 24 or 25, and are still experiencing this issue, please change the version to Fedora 24 or 25.
 
If you experience different issues, please open a new bug report for those.

Comment 7 Alan Jenkins 2016-09-23 20:57:26 UTC
Ah, thanks.  The lockups should be gone now.


Note You need to log in before you can comment on or make changes to this bug.