Bug 995708

Summary: hard LOCKUP on cpu perhaps in iptables, perhaps apic/acpi
Product: [Fedora] Fedora Reporter: Trevor Cordes <trevor>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 19CC: gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, michele, trevor
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-11-13 14:43:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Trevor Cordes 2013-08-10 10:06:41 UTC
Description of problem:
Kernel crash: "Watchdog detected hard LOCKUP on cpu 1" after about 5-20 mins running on kernels:
kernel-3.9.10-100.fc17.i686
kernel-3.9.11-200.fc18.i686

But: kernel-3.8.13-100.fc17.i686  does not crash

Booting with kernel options "acpi=off noapic nolapic pnpacpi=off pci=noacpi" fixes the bug, system remains stable with either of the buggy kernels (as well as kernel-3.10.5-201.fc19.i686).  It may be stable with a subset of the above options, I will test that as I find time, for now I went all out on the safety factor so I have a stable system.  I am now F19 and kernel-3.10.5-201.fc19.i686 and will do all testing around that kernel.  I have not seen it crash with F19 yet, but that's because I've had the acpi/apic kludges in place since F18.

Also, booting into the system and immediately turning off iptables (empty chains, ACCEPT all) seems to make the bug go away, even without any acpi/apic options.  Not the best solution, though!


Version-Release number of selected component (if applicable):
kernel-3.9.11-200.fc18.i686


How reproducible:
always, but I only captured a call trace from one crash (got lucky), all other times screen was black


Steps to Reproduce:
1. boot into affected kernels without acpi/apic protection
2. wait 5-20 mins
3.

Actual results:
crash, call trace below


Expected results:
no crash


Additional info:

System is a quirky AMD with ECC that often has hardware compat issues (USB sucks, onboard LAN sucks), but once you resolve them the system runs 100% stable for years.
	Manufacturer: TYAN
	Product Name: S2466 TIGER MPX
	Version: A2


[  549.123989] ------------[ cut here ]------------
[  549.123989] WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0xac/0xd0()
[  549.123989] Hardware name: Unknown
[  549.123989] Watchdog detected hard LOCKUP on cpu 1
[ 549.123989] Modules linked in: xfrm4_mode_tunnel authenc esp4 xfrm4_mode_transport deflate zlib_deflate twofish_generic twofish_i586
twofish_common camellia_generic serpent_generic glue_helper blowfish_generic blowfish_common cast5_generic cast_common des_generic xcbc rmd160
sha512_generic crypto_null af_key nf_conntrack_netbios_ns nf_conntrack_broadcast xt_TCPMSS xt_tcpmss xt_state xt_LOG xt_limit xt_nat
ipt_MASQUERADE xt_REDIRECT iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_mark ip6table_filter iptable_mangle ip6_tables ppdev
parport_pc parport r8169 e1000 3c59x mii i2c_amd756 amd_rng nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre nf_nat
nf_conntrack raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 ata_generic pata_acpi sata_sil24 pata_amd
floppy nouveau mxm_wmi wmi video i2c_algo_bit drm_kms_helper ttm drm i2c_core
[  549.123989] Pid: 0, comm: swapper/1 Not tainted 3.9.10-100.fc17.i686 #1
[  549.123989] Call Trace:
[  549.123989]  [<c043f6f9>] warn_slowpath_common+0x69/0x90
[  549.123989]  [<c04bdccc>] ? watchdog_overflow_callback+0xac/0xd0
[  549.123989]  [<c04bdccc>] ? watchdog_overflow_callback+0xac/0xd0
[  549.123989]  [<c04bdc20>] ? touch_nmi_watchdog+0x70/0x70
[  549.123989]  [<c043f7c3>] warn_slowpath_fmt+0x33/0x40
[  549.123989]  [<c04bdccc>] watchdog_overflow_callback+0xac/0xd0
[  549.123989]  [<c04f0db6>] __perf_event_overflow+0xa6/0x270
[  549.123989]  [<c0410b8f>] ? x86_perf_event_set_period+0x11f/0x230
[  549.123989]  [<c04f1925>] perf_event_overflow+0x15/0x20
[  549.123989]  [<c041120e>] x86_pmu_handle_irq+0x10e/0x150
[  549.123989]  [<c06d3cb2>] ? bit_cursor+0x512/0x550
[  549.123989]  [<c06cf72f>] ? fbcon_clear+0x1bf/0x1f0
[  549.123989]  [<c09856cb>] perf_event_nmi_handler+0x1b/0x20
[  549.123989]  [<c0984ea9>] nmi_handle.isra.0+0x39/0x60
[  549.123989]  [<c09850af>] do_nmi+0x1df/0x3f0
[  549.123989]  [<c09874c1>] ? __atomic_notifier_call_chain+0x21/0x30
[  549.123989]  [<c09845e7>] nmi_stack_correct+0x2f/0x34
[  549.123989]  [<c0429d51>] ? native_apic_mem_read+0x11/0x20
[  549.123989]  [<c0425743>] ? native_apic_wait_icr_idle+0x23/0x30
[  549.123989]  [<c0406ac4>] arch_irq_work_raise+0x34/0x40
[  549.123989]  [<c04e9e75>] irq_work_queue+0xa5/0xb0
[  549.123989]  [<c044149d>] wake_up_klogd+0x2d/0x30
[  549.123989]  [<c04417dd>] console_unlock+0x33d/0x450
[  549.123989]  [<c0441ce5>] vprintk_emit+0x1f5/0x4e0
[  549.123989]  [<c068859a>] ? vsnprintf+0x2ca/0x3d0
[  549.123989]  [<c097b56a>] printk+0x4d/0x4f
[  549.123989]  [<f7f910e8>] sb_close+0x28/0x50 [xt_LOG]
[  549.123989]  [<f7f91fb6>] ipt_log_packet+0x126/0x180 [xt_LOG]
[  549.123989]  [<f7f92a6c>] log_tg+0x7c/0xcc [xt_LOG]
[  549.123989]  [<c090ba60>] ipt_do_table+0x320/0x710
[  549.123989]  [<c08ff084>] ? fib_table_lookup+0x2a4/0x350
[  549.123989]  [<c090cb83>] iptable_filter_hook+0x43/0x80
[  549.123989]  [<c08bb51a>] nf_iterate+0x6a/0x90
[  549.123989]  [<c08c4640>] ? ip_frag_mem+0x50/0x50
[  549.123989]  [<c08bb59c>] nf_hook_slow+0x5c/0x100
[  549.123989]  [<c08c4640>] ? ip_frag_mem+0x50/0x50
[  549.123989]  [<c08c4a65>] ip_forward+0x385/0x3b0
[  549.123989]  [<c08c4640>] ? ip_frag_mem+0x50/0x50
[  549.123989]  [<c08c2ae0>] ip_rcv_finish+0x60/0x320
[  549.123989]  [<c08c33ac>] ip_rcv+0x24c/0x370
[  549.123989]  [<c08c2a80>] ? inet_add_protocol+0x50/0x50
[  549.123989]  [<c0893d3b>] __netif_receive_skb_core+0x55b/0x6d0
[  549.123989]  [<c0893ecd>] __netif_receive_skb+0x1d/0x60
[  549.123989]  [<c089407e>] netif_receive_skb+0x2e/0x90
[  549.123989]  [<c089491f>] napi_gro_receive+0x7f/0xb0
[  549.123989]  [<f7ef24b7>] e1000_receive_skb+0x57/0x70 [e1000]
[  549.123989]  [<f7ef40f4>] e1000_clean_rx_irq+0x214/0x430 [e1000]
[  549.123989]  [<c07781cf>] ? scsi_request_fn+0x9f/0x4b0
[  549.123989]  [<c065e247>] ? __freed_request+0x87/0x90
[  549.123989]  [<f7ef53c6>] e1000_clean+0x1d6/0x850 [e1000]
[  549.123989]  [<c0753484>] ? put_device+0x14/0x20
[  549.123989]  [<c077900d>] ? scsi_next_command+0x3d/0x50
[  549.123989]  [<c0779222>] ? scsi_io_completion+0x1b2/0x650
[  549.123989]  [<c0778d26>] ? scsi_device_unbusy+0x76/0xa0
[  549.123989]  [<c048a1fe>] ? ktime_get+0x5e/0x100
[  549.123989]  [<c089467d>] net_rx_action+0x11d/0x1f0
[  549.123989]  [<c0447763>] __do_softirq+0xc3/0x1f0
[  549.123989]  [<c047150d>] ? sched_clock_idle_wakeup_event+0x1d/0x20
[  549.123989]  [<c04479f5>] irq_exit+0x95/0xa0
[  549.123989]  [<c0404c2b>] do_IRQ+0x4b/0xc0
[  549.123989]  [<c098b333>] common_interrupt+0x33/0x38
[  549.123989]  [<c042fad5>] ? native_safe_halt+0x5/0x10
[  549.123989]  [<c040a027>] default_idle+0x37/0xe0
[  549.123989]  [<c040a9c6>] cpu_idle+0xb6/0xe0
[  549.123989]  [<c09757f8>] start_secondary+0x262/0x267
[  549.123989] ---[ end trace 6b5a14a0a996da62 ]---

Comment 1 Trevor Cordes 2013-09-26 11:03:53 UTC
I took out the "acpi=off noapic nolapic pnpacpi=off pci=noacpi" options one by one trying to see which was actually required to make the system not crash again.  I now have *all* of them out and the system is not crashing!  I think a kernel update in the interim has fixed whatever the issue was.

The kernel used now that is not crashing is: 3.10.9-200.fc19.i686

Closing this bug.

Comment 2 Trevor Cordes 2013-09-30 12:32:04 UTC
Darn, the box just crashed again yesterday.  I had recently rebooted into kernel-3.11.1-200.fc19.i686 for the first time in weeks (no acpi/apic options) and now it crashed.  Either the bug was put back in between 3.10.9-200.fc19.i686 and 3.11.1-200.fc19.i686, or it just took a long time to hit again.  In either case, somewhat strange.

I am working on getting a copy of this crash's stack track to paste in the bz.

Comment 3 Justin M. Forbes 2013-10-18 21:06:38 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 18 kernel bugs.

Fedora 18 has now been rebased to 3.11.4-101.fc18.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 19, and are still experiencing this issue, please change the version to Fedora 19.

If you experience different issues, please open a new bug report for those.

Comment 4 Michele Baldessari 2013-11-11 21:38:14 UTC
Hi Trevor,

have you ever been able to get a stack trace with 3.11.x ?

thanks,
Michele

Comment 5 Trevor Cordes 2013-11-13 09:02:41 UTC
Here is a manual transcript of the oops trace:

? leave_mm+0x80/0x50
smp_call_function_single+0xa4/0x130
? leave_mm+0x50/0x50
smp_fall_function_many+0x1ec/0x220
? leave_mm+0x50/0x50
native_flush_tlb_others+0x2b/0x30
flush_tlb_page+0x48/0x90
ptep_clear_flush+0x40/0x50
do_wp_page+0x223/0x7d0
handle_pte_fault+0x315/0x8d0
handle_mm_fault+0xb5/0x110
__do_page_fault+0x189/0x4d0
? task_stopped_code+0x50/0x50
? __do_page_fault+0x4d0/0x4d0
?do_page_fault+0xd/0x10
error_code+0x67/0x6c

Sorry for any typos.  I can provide more info (I have a pic of it on my phone) if required.

Since the last report the computer has since died completely, just the mobo or CPUs.  No visible bad caps or other obvious problems.  Probably just the ol' it's-AMD-and-it-got-too-old death.  This whole bug might therefore be nothing more than hardware dying and I would be fine with closing it as such, especially if the traces don't seem related and/or useful.  I replaced the motherboard/cpu with a newer model and everything works fine now for a month.

Comment 6 Michele Baldessari 2013-11-13 10:59:14 UTC
Hi Trevor,

given the completely different traces I'd tend to point fingers to HW this time
around (ram, cpu,...). So I'd vote for closing it at this point.

regards,
Michele

Comment 7 Josh Boyer 2013-11-13 14:43:30 UTC
Ok.  Thanks for letting us know.  We appreciate the heads up that it was likely hardware related.