Bug 954252

Summary: Kernel soft lockup during reboot on Fedora 19 pre-alpha
Product: [Fedora] Fedora Reporter: IBM Bug Proxy <bugproxy>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 19CC: gansalmon, gustavold, itamar, jkachuck, jonathan, kernel-maint, madhu.chinakonda, wgomerin
Target Milestone: ---   
Target Release: ---   
Hardware: ppc64   
OS: All   
Whiteboard:
Fixed In Version: kernel-3.9.5-301.fc19 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-06-14 04:50:25 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 920770    
Attachments:
Description Flags
reboot log
none
dmesg (after lpar restart)
none
/var/log/messages
none
nosmp
none
maxcpus=1
none
irqpoll
none
1 of 2 patch: fix_config_restore_after_eeh ( against 3.9-y stable kernel)
none
2 of 2: driver_lockdep_patch (against 3.9-y stable kernel)
none
series file
none
reboot log for 3.9.3-301 + patches none

Description IBM Bug Proxy 2013-04-22 05:01:03 UTC
== Comment: #0 - Gustavo Luiz Duarte <gusld.com> - 2013-04-17 16:23:00 ==
Description of problem:
After executing a reboot command on Fedora 19 pre-alpha kernel spits several soft lockup messages and never completes the reboot. Attached full log.
Example of soft lockup error:

[70751.348666] Disabling non-boot CPUs ...
[70776.624522] BUG: soft lockup - CPU#0 stuck for 22s! [migration/0:8]

[70776.624554] Modules linked in: binfmt_misc ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_nat nf_nat_ipv6 ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables tg3 ses enclosure ptp pps_core shpchp ipr
[70776.624854] irq event stamp: 922
[70776.624868] hardirqs last  enabled at (921): [<c0000000008d5f24>] ._raw_spin_unlock_irqrestore+0x54/0xd0
[70776.624909] hardirqs last disabled at (922): [<c0000000008d5c58>] ._raw_spin_lock_irq+0x38/0xd0
[70776.624938] softirqs last  enabled at (0): [<c000000000090070>] .copy_process.part.25+0x4c0/0x1320
[70776.624970] softirqs last disabled at (0): [<          (null)>]           (null)
[70776.624997] NIP: c00000000015c9a8 LR: c00000000015c7f4 CTR: c00000000015c8d0
[70776.625020] REGS: c0000003d8be37b0 TRAP: 0901   Not tainted  (3.9.0-0.rc4.git0.1.fc19.ppc64p7)
[70776.625040] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 28002082  XER: 20000000
[70776.625109] SOFTE: 1
[70776.625121] CFAR: c00000000015c9c8
[70776.625138] TASK = c0000003d8b74fa0[8] 'migration/0' THREAD: c0000003d8be0000 CPU: 0
GPR00: c00000000015c7f4 c0000003d8be3a30 c0000000015c4370 c0000007a469ba20 
GPR04: 0000000000000001 c00000000015c7b8 c0000000014b4370 c000000001484370 
GPR08: c000000001634370 0000000000000001 0000000000000000 0000000000000000 
GPR12: 0000000024000028 c00000000ed90000 
[70776.625307] NIP [c00000000015c9a8] .stop_machine_cpu_stop+0xd8/0x250
[70776.625330] LR [c00000000015c7f4] .cpu_stopper_thread+0xb4/0x190
[70776.625346] Call Trace:
[70776.625370] [c0000003d8be3a30] [c0000003d8be3ad0] 0xc0000003d8be3ad0 (unreliable)
[70776.625403] [c0000003d8be3ad0] [c00000000015c7f4] .cpu_stopper_thread+0xb4/0x190
[70776.625431] [c0000003d8be3c00] [c0000000000de3e4] .smpboot_thread_fn+0x2a4/0x340
[70776.625458] [c0000003d8be3cb0] [c0000000000cfd00] .kthread+0xf0/0x100
[70776.625486] [c0000003d8be3e30] [c000000000009f70] .ret_from_kernel_thread+0x64/0x74
[70776.625507] Instruction dump:
[70776.625524] 40c2fff4 7c0004ac 2f890000 7fe9fb78 409e0020 813e0020 815e0010 39290001 
[70776.625584] 915d0000 7c2004ac 913e0020 7fe9fb78 <2b890004> 419e0054 7c210b78 7c421378 

Version-Release number of selected component (if applicable):
kernel-3.9.0-0.rc4.git0.1.fc19.ppc64

How reproducible:
Always

Steps to Reproduce:
1. Execute the reboot command on a Fedora 19 system
2.
3.
  
Actual results:
Kernel gets stuck printing soft lockup error messages

== Comment: #1 - Gustavo Luiz Duarte <gusld.com> - 2013-04-17 16:26:00 ==
Console output of a reboot attempt.

== Comment: #3 - Gustavo Luiz Duarte <gusld.com> - 2013-04-18 15:32:20 ==
This is the output of dmesg after restarting the lpar from HMC (as the reboot command crashed the system).

== Comment: #4 - Gustavo Luiz Duarte <gusld.com> - 2013-04-18 15:33:47 ==
This is the content of /var/log/messages after restarting the lpar from HMC (as the reboot command crashed the system)

== Comment: #6 - Gustavo Luiz Duarte <gusld.com> - 2013-04-18 15:39:27 ==
Note that dmesg and messages attachments are from a slightly newer kernel (kernel-3.9.0-0.rc6.git2.4.fc19.ppc64p7) that has the same issue.

Comment 1 IBM Bug Proxy 2013-04-22 05:01:21 UTC
Created attachment 738429 [details]
reboot log

Comment 2 IBM Bug Proxy 2013-04-22 05:01:31 UTC
Created attachment 738430 [details]
dmesg (after lpar restart)

Comment 3 IBM Bug Proxy 2013-04-22 05:01:41 UTC
Created attachment 738431 [details]
/var/log/messages

Comment 4 IBM Bug Proxy 2013-04-23 07:30:57 UTC
------- Comment From hbabu.com 2013-04-23 07:20 EDT-------
Rebooting.
[12658.386814] irq 18: nobody cared (try booting with the "irqpoll" option)
[12658.386824] Call Trace:
[12658.386831] [c0000007affbfb90] [c000000000015bf0] .show_stack+0x130/0x200 (unreliable)
[12658.386839] [c0000007affbfc60] [c0000000001671e8] .__report_bad_irq+0x58/0x150
[12658.386844] [c0000007affbfd00] [c0000000001678c8] .note_interrupt+0x218/0x300
[12658.386849] [c0000007affbfdb0] [c000000000163f24] .handle_irq_event_percpu+0x184/0x310
[12658.386854] [c0000007affbfe90] [c000000000164110] .handle_irq_event+0x60/0xb0
[12658.386859] [c0000007affbff10] [c000000000168e74] .handle_fasteoi_irq+0xd4/0x1f0
[12658.386865] [c0000007affbff90] [c000000000023a84] .call_handle_irq+0x1c/0x2c
[12658.386870] [c0000000012d3770] [c000000000010f44] .do_IRQ+0x244/0x2c0
[12658.386875] [c0000000012d3820] [c000000000002364] hardware_interrupt_common+0x164/0x180
[12658.386882] --- Exception: 501 at .plpar_hcall_norets+0x84/0xd4
[12658.386882]     LR = .check_and_cede_processor+0x2c/0x40
[12658.386888] [c0000000012d3b10] [c000000000082218] .check_and_cede_processor+0x18/0x40 (unreliable)
[12658.386894] [c0000000012d3b80] [c0000000000822c8] .dedicated_cede_loop+0x88/0x150
[12658.386901] [c0000000012d3c40] [c00000000069a30c] .cpuidle_enter+0x2c/0x40
[12658.386906] [c0000000012d3cb0] [c00000000069ad28] .cpuidle_idle_call+0xf8/0x310
[12658.386911] [c0000000012d3d60] [c000000000072358] .pSeries_idle+0x18/0x40
[12658.386916] [c0000000012d3dd0] [c000000000017b58] .cpu_idle+0x168/0x2b0
[12658.386921] [c0000000012d3e80] [c00000000000bbd8] .rest_init+0x98/0xb0
[12658.386926] [c0000000012d3ef0] [c000000000b6470c] .start_kernel+0x4b4/0x4d0
[12658.386931] [c0000000012d3f90] [c000000000009d20] .start_here_common+0x20/0x80
[12658.386935] handlers:
[12658.386939] [<c0000000012b2ca8>] .usb_hcd_irq
[12658.386943] Disabling IRQ #18

We are seeing dropping IRQs for usb HCD and other cpus are getting soft-lockup.

https://lkml.org/lkml/2013/2/9/167 shows the similar issue.

Brian, Can someone in your team look at this issue?

Comment 5 IBM Bug Proxy 2013-04-23 14:53:42 UTC
------- Comment From thadeul.com 2013-04-23 13:32 EDT-------
This seems to be a bug in tg3, that is triggered by the same set of patches that affected USB.

In order to find out which one is really causing the issue, can you remove either one and retest, and do it once again with the other card removed instead?

Thanks.
Cascardo.

Comment 6 IBM Bug Proxy 2013-04-23 19:50:55 UTC
------- Comment From gusld.com 2013-04-23 19:42 EDT-------
I removed the tg3 card from the profile using HMC but the reboot issue remains. I also tried removing the USB device and the reboot issue still remains.

Comment 7 IBM Bug Proxy 2013-05-07 13:41:10 UTC
------- Comment From gusld.com 2013-05-07 13:30 EDT-------
I'm still experiencing this issue with latest kernel available on Fedora 19 (kernel-3.9.0-301.fc19). I tried both ppc64 and ppc64p7 flavors and both have this issue.

Comment 8 IBM Bug Proxy 2013-05-07 13:51:38 UTC
Created attachment 744743 [details]
nosmp


------- Comment on attachment From gusld.com 2013-05-07 13:49 EDT-------


Attached the results passing nosmp to the boot command line, as requested by Thadeu.

Comment 9 IBM Bug Proxy 2013-05-07 14:02:26 UTC
Created attachment 744748 [details]
maxcpus=1


------- Comment on attachment From gusld.com 2013-05-07 13:51 EDT-------


Attaching the results passing maxcpus=1 to the boot command line, as requested by Thadeu.

Comment 10 IBM Bug Proxy 2013-05-07 14:11:53 UTC
------- Comment From thadeul.com 2013-05-07 14:02 EDT-------
I guess this will have no effect at all, but can you try booting with irqpoll?

Regards.
Thadeu Cascardo.

Comment 11 IBM Bug Proxy 2013-05-07 19:01:37 UTC
Created attachment 744868 [details]
irqpoll


------- Comment on attachment From gusld.com 2013-05-07 18:56 EDT-------


Attaching the results passing irqpoll to the boot command line, as requested by Thadeu.

Comment 12 IBM Bug Proxy 2013-05-08 19:24:15 UTC
------- Comment From thadeul.com 2013-05-08 19:10 EDT-------
Hi, Gustavo.

Can you try mainline and, if it fails, do a bisect?

Regards.
Cascardo.

Comment 13 IBM Bug Proxy 2013-05-13 18:31:22 UTC
------- Comment From gusld.com 2013-05-13 18:23 EDT-------
I did a git bisect and there seems to be two different bugs. The irq messages seem to be unrelated to the soft-lockup issue. Old kernel versions (3.4.0) print the irq messages but don't trigger the soft-lockup.

My git bisect pointed to the following as the offending commit (the one that introduced the soft-lockup issue):
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=56d6aa33d3f68471466cb183d6e04b508dfb296f

Comment 14 IBM Bug Proxy 2013-05-14 15:21:32 UTC
------- Comment From gusld.com 2013-05-14 15:15 EDT-------
The following patch fixes this issue:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=96b04db9f2c16e77c31ef0e17e143da1e0cbfd78

Can we get it added to f19?

Comment 15 Gustavo Luiz Duarte 2013-05-14 15:38:02 UTC
Proposing as Beta blocker as for release criteria "All release-blocking desktops' offered mechanisms (if any) for shutting down, logging out and rebooting must work"

Comment 16 Josh Boyer 2013-05-14 15:52:08 UTC
(In reply to comment #14)
> ------- Comment From gusld.com 2013-05-14 15:15 EDT-------
> The following patch fixes this issue:
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/
> ?id=96b04db9f2c16e77c31ef0e17e143da1e0cbfd78
> 
> Can we get it added to f19?

I would suggest you send it upstream and get it included in the 3.9.y stable kernel series.

Comment 17 IBM Bug Proxy 2013-05-15 16:03:09 UTC
------- Comment From bjking1.com 2013-05-15 15:55 EDT-------
I would have expected the following patch to make the warning go away on shutdown:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/scsi/ipr.c?id=bfae7820b87c61c5065338b55405b304d9890085

Comment 18 IBM Bug Proxy 2013-05-22 01:31:42 UTC
Created attachment 751455 [details]
1 of 2 patch: fix_config_restore_after_eeh ( against 3.9-y stable kernel)


------- Comment (attachment only) From wenxiong.com 2013-05-22 01:27 EDT-------

Comment 19 IBM Bug Proxy 2013-05-22 01:31:58 UTC
Created attachment 751456 [details]
2 of 2: driver_lockdep_patch (against 3.9-y stable kernel)


------- Comment (attachment only) From wenxiong.com 2013-05-22 01:29 EDT-------

Comment 20 IBM Bug Proxy 2013-05-22 01:32:13 UTC
Created attachment 751457 [details]
series file


------- Comment (attachment only) From wenxiong.com 2013-05-22 01:30 EDT-------

Comment 21 IBM Bug Proxy 2013-05-22 16:41:00 UTC
Created attachment 751781 [details]
reboot log for 3.9.3-301 + patches


------- Comment on attachment From gusld.com 2013-05-22 15:52 EDT-------


Thanks Wendy!

I tested the latest F19 kernel (3.9.3-301.fc19.ppc64) with the 2 patches Wendy posted here. It fixed the reboot hangs, though it still spits irq messages during the reboot. I attached the full log for a reboot.

Comment 22 Josh Boyer 2013-06-10 14:16:18 UTC
The patches for this are all now included in the 3.9.5-300 release currently building.

Comment 23 Fedora Update System 2013-06-11 21:41:50 UTC
kernel-3.9.5-301.fc19 has been submitted as an update for Fedora 19.
https://admin.fedoraproject.org/updates/kernel-3.9.5-301.fc19

Comment 24 Fedora Update System 2013-06-12 19:13:03 UTC
Package kernel-3.9.5-301.fc19:
* should fix your issue,
* was pushed to the Fedora 19 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.9.5-301.fc19'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2013-10689/kernel-3.9.5-301.fc19
then log in and leave karma (feedback).

Comment 25 Fedora Update System 2013-06-14 04:50:25 UTC
kernel-3.9.5-301.fc19 has been pushed to the Fedora 19 stable repository.  If problems still persist, please make note of it in this bug report.