Bug 1337667
| Summary: | [abrt] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Alan Jenkins <alan.christopher.jenkins> | ||||||||
| Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||||||
| Status: | CLOSED NOTABUG | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||
| Severity: | unspecified | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 23 | CC: | alan.christopher.jenkins, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, mchehab | ||||||||
| Target Milestone: | --- | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Unspecified | ||||||||||
| URL: | https://retrace.fedoraproject.org/faf/reports/bthash/07e86c036aab6677a05f3d8997d99473c76b83a7 | ||||||||||
| Whiteboard: | abrt_hash:0f98d64ea3ace676b65a1e664d1999c5e56078e4;VARIANT_ID=workstation; | ||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2016-09-23 20:57:26 UTC | Type: | --- | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
|
Description
Alan Jenkins
2016-05-19 18:50:26 UTC
Created attachment 1159604 [details]
File: dmesg
Sorry, ABRT picked the wrong backtrace. The first one (shown in dmesg) is much more plausible as a root cause! Praise be to the NMI watchdog. [ 534.539016] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1 [ 534.539025] Modules linked in: [ 534.539030] bnep bluetooth cpufreq_stats fuse nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_filter ebtable_broute bridge ebtables ip6table_security ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_filter ip6_tables iptable_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw vfat fat btrfs xor raid6_pq intel_rapl coretemp kvm_intel arc4 kvm irqbypass crct10dif_pclmul iwlmvm crc32_pclmul mac80211 snd_hda_codec_hdmi crc32c_intel iTCO_wdt iTCO_vendor_support snd_hda_codec_realtek snd_intel_sst_acpi snd_intel_sst_core snd_soc_rt5670 snd_soc_sst_mfld_platform iwlwifi nxp_nci_i2c snd_hda_codec_generic nxp_nci [ 534.539086] snd_soc_rl6231 snd_hda_intel snd_soc_core snd_hda_codec nci cfg80211 i2c_i801 mei_txe mei lpc_ich ir_mce_kbd_decoder ir_sanyo_decoder ir_sharp_decoder ir_xmp_decoder ir_rc6_decoder ir_rc5_decoder ir_sony_decoder ir_lirc_codec lirc_dev ir_nec_decoder ir_jvc_decoder snd_hda_core snd_hwdep shpchp snd_compress rc_rc6_mce snd_pcm_dmaengine snd_seq nfc ite_cir ac97_bus snd_seq_device rc_core snd_pcm i2c_designware_platform snd_timer i2c_designware_core snd tpm_tis iosf_mbi soundcore rfkill_gpio rfkill tpm spi_pxa2xx_platform soc_button_array nfsd auth_rpcgss nfs_acl lockd grace sunrpc i915 i2c_algo_bit drm_kms_helper 8021q garp stp llc mrp drm r8169 serio_raw mii sdhci_acpi sdhci mmc_core video i2c_hid fjes [ 534.539151] CPU: 1 PID: 3071 Comm: qemu-system-x86 Not tainted 4.4.9-300.fc23.x86_64 #1 [ 534.539154] Hardware name: /NUC5CPYB, BIOS PYBSWCEL.86A.0053.2016.0401.1154 04/01/2016 [ 534.539157] 0000000000000086 00000000b08a09bb ffff88017fd05b90 ffffffff813b62ee [ 534.539161] 0000000000000000 0000000000000000 ffff88017fd05ba8 ffffffff8115aebd [ 534.539165] ffff88017a588000 ffff88017fd05be0 ffffffff811a32ec 0000000000000001 [ 534.539170] Call Trace: [ 534.539173] <NMI> [<ffffffff813b62ee>] dump_stack+0x63/0x85 [ 534.539189] [<ffffffff8115aebd>] watchdog_overflow_callback+0xbd/0xd0 [ 534.539194] [<ffffffff811a32ec>] __perf_event_overflow+0x8c/0x1d0 [ 534.539198] [<ffffffff811a3ec4>] perf_event_overflow+0x14/0x20 [ 534.539204] [<ffffffff81035fe1>] intel_pmu_handle_irq+0x1e1/0x4a0 [ 534.539210] [<ffffffff8102c758>] perf_event_nmi_handler+0x28/0x50 [ 534.539215] [<ffffffff8101b0c9>] nmi_handle+0x69/0x120 [ 534.539218] [<ffffffff8101b63d>] default_do_nmi+0xad/0xf0 [ 534.539221] [<ffffffff8101b762>] do_nmi+0xe2/0x130 [ 534.539227] [<ffffffff817a3471>] end_repeat_nmi+0x1a/0x1e [ 534.539238] [<ffffffffa06fa3d0>] ? vmx_invpcid_supported+0x30/0x30 [kvm_intel] [ 534.539243] [<ffffffff813c32a5>] ? delay_tsc+0x25/0x60 [ 534.539246] [<ffffffff813c32a5>] ? delay_tsc+0x25/0x60 [ 534.539250] [<ffffffff813c32a5>] ? delay_tsc+0x25/0x60 [ 534.539252] <<EOE>> [<ffffffff813c31df>] __delay+0xf/0x20 [ 534.539287] [<ffffffffa07b48db>] wait_lapic_expire+0xcb/0x150 [kvm] [ 534.539311] [<ffffffffa0797891>] kvm_arch_vcpu_ioctl_run+0x371/0x14e0 [kvm] [ 534.539316] [<ffffffff81120dd0>] ? futex_wake+0x90/0x180 [ 534.539338] [<ffffffffa078005d>] kvm_vcpu_ioctl+0x33d/0x620 [kvm] [ 534.539344] [<ffffffff810b3951>] ? do_sigtimedwait+0xd1/0x200 [ 534.539350] [<ffffffff81241648>] do_vfs_ioctl+0x298/0x480 [ 534.539355] [<ffffffff81338393>] ? security_file_ioctl+0x43/0x60 [ 534.539359] [<ffffffff812418a9>] SyS_ioctl+0x79/0x90 [ 534.539364] [<ffffffff817a0f2e>] entry_SYSCALL_64_fastpath+0x12/0x71 [ 552.274471] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [powertop:2691] [ 552.274479] Modules linked in: bnep bluetooth cpufreq_stats fuse nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_filter ebtable_broute bridge ebtables ip6table_security ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_filter ip6_tables iptable_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw vfat fat btrfs xor raid6_pq intel_rapl coretemp kvm_intel arc4 kvm irqbypass crct10dif_pclmul iwlmvm crc32_pclmul mac80211 snd_hda_codec_hdmi crc32c_intel iTCO_wdt iTCO_vendor_support snd_hda_codec_realtek snd_intel_sst_acpi snd_intel_sst_core snd_soc_rt5670 snd_soc_sst_mfld_platform iwlwifi nxp_nci_i2c snd_hda_codec_generic [ 552.274537] nxp_nci snd_soc_rl6231 snd_hda_intel snd_soc_core snd_hda_codec nci cfg80211 i2c_i801 mei_txe mei lpc_ich ir_mce_kbd_decoder ir_sanyo_decoder ir_sharp_decoder ir_xmp_decoder ir_rc6_decoder ir_rc5_decoder ir_sony_decoder ir_lirc_codec lirc_dev ir_nec_decoder ir_jvc_decoder snd_hda_core snd_hwdep shpchp snd_compress rc_rc6_mce snd_pcm_dmaengine snd_seq nfc ite_cir ac97_bus snd_seq_device rc_core snd_pcm i2c_designware_platform snd_timer i2c_designware_core snd tpm_tis iosf_mbi soundcore rfkill_gpio rfkill tpm spi_pxa2xx_platform soc_button_array nfsd auth_rpcgss nfs_acl lockd grace sunrpc i915 i2c_algo_bit drm_kms_helper 8021q garp stp llc mrp drm r8169 serio_raw mii sdhci_acpi sdhci mmc_core video i2c_hid fjes wait_lapic_expire() is described as a busy-wait optimization, to increase timer wakeup accuracy. So the question is how we get a value of
apic->lapic_timer.expired_tscdeadline
which is far enough ahead of `guest_tsc`, to cause an extended busy-wait. The expected busy-wait duration is `lapic_timer_advance_ns`, which defaults to 0 (and I haven't changed it).
I see several ways the guest TSC can get set / "adjusted", and none of them redo the TSC deadline timer :(. If you set the TSC _backwards_...
Created attachment 1160347 [details]
Test patch
I have confirmed that the lockup can be avoided by sanity-checking the delay in wait_lapic_expire(). With the attached test patch, I get a warning message but no lockup.
I also instrumented for TSC going backwards. The resulting log confirms that this was happening.
Created attachment 1160348 [details] Log output from test patch Here's the log output from the test patch above. Also, upstream mailing list: http://thread.gmane.org/gmane.comp.emulators.kvm.devel/152370/focus=152372 *********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 23 kernel bugs. Fedora 23 has now been rebased to 4.7.4-100.fc23. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 24 or 25, and are still experiencing this issue, please change the version to Fedora 24 or 25. If you experience different issues, please open a new bug report for those. Ah, thanks. The lockups should be gone now. |