Bug 586967

Summary: RHEL6: x86 32-bit, nmi_watchdog_default() is __init, but called on resume
Product: Red Hat Enterprise Linux 6 Reporter: Prarit Bhargava <prarit>
Component: kernelAssignee: Prarit Bhargava <prarit>
Status: CLOSED CURRENTRELEASE QA Contact: Jan Tluka <jtluka>
Severity: medium Docs Contact:
Priority: low    
Version: 6.0CC: airlied, amarecek, azelinka, bnagendr, emcnabb, frank.arnold, jbroman, joshkayse, jturner, lee, lkundrak, mishu, notting, ptekwork, rvokal, shuang, ypu
Target Milestone: rc   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-11-11 16:15:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Initial RHEL6 fix from AMD
none
RHEL6 fix for this issue none

Description Prarit Bhargava 2010-04-28 15:10:42 UTC
From BZ 567601,

Description of problem:
After updating to kernel-2.6.32-17.el6, the system appears to suspend correctly
(powers down and the power LED begins to slowly throb) and I'm able to trigger
a resume (power LED returns to green, some disk activity) but the system never
fully resumes.  I never get a display and it doesn't appear that networking
resumes.  

Version-Release number of selected component (if applicable):
2.6.32-17.el6

How reproducible:
Always

Steps to Reproduce:
1. Suspend the machine
2. Attempt to resume
3.

Actual results:
Black screen, no networking, system appears to be frozen requiring a hard
power-cycle.

Expected results:


Additional info:
Booting 2.6.32-16.el6 instead results in the ability to suspend/resume.

Comment 1 Prarit Bhargava 2010-04-28 15:13:38 UTC
Bhavna alerted us to this -- good catch Bhavna!

The issue is that in -19.el6 only the x86_64 case was changed to __cpuinit. All others are left as __init which is causing trouble during CPU hotplug on 32-bit. 

I'll check the common code to make sure that there aren't any other __init/__cpuinit pitfalls before submitting to RHKL.

P.

Comment 2 Prarit Bhargava 2010-04-28 15:14:57 UTC
Created attachment 409879 [details]
Initial RHEL6 fix from AMD

Thanks for the patch Frank.

P.

Comment 3 Prarit Bhargava 2010-04-28 15:31:40 UTC
Patch looks good and CONFIG_DEBUG_SECTION_MISMATCH=y didn't show anything else related to this code path.

Will post to RHKL shortly.

P.

Comment 4 Prarit Bhargava 2010-04-28 17:43:48 UTC
Created attachment 409914 [details]
RHEL6 fix for this issue

Comment 5 Prarit Bhargava 2010-04-29 14:22:43 UTC
*** Bug 585003 has been marked as a duplicate of this bug. ***

Comment 6 Prarit Bhargava 2010-04-29 18:36:32 UTC
*** Bug 581749 has been marked as a duplicate of this bug. ***

Comment 7 Prarit Bhargava 2010-04-29 18:42:01 UTC
*** Bug 582129 has been marked as a duplicate of this bug. ***

Comment 8 Prarit Bhargava 2010-04-30 13:33:55 UTC
*** Bug 585766 has been marked as a duplicate of this bug. ***

Comment 9 Prarit Bhargava 2010-04-30 13:47:41 UTC
*** Bug 586164 has been marked as a duplicate of this bug. ***

Comment 10 Prarit Bhargava 2010-04-30 14:17:39 UTC
*** Bug 586776 has been marked as a duplicate of this bug. ***

Comment 11 Prarit Bhargava 2010-04-30 14:23:37 UTC
*** Bug 586830 has been marked as a duplicate of this bug. ***

Comment 12 Josh 2010-05-03 02:26:36 UTC
Would this prevent the brightness controller from working after a resume?

Comment 13 Frank Arnold 2010-05-03 11:29:16 UTC
(In reply to comment #12)
> Would this prevent the brightness controller from working after a resume?    

No brightness controller involved here.

With this bug 32-bit shouldn't resume at all. The issue was introduced with 2.6.32-17.el6, x86_64 was fixed with 2.6.32-18.el6, and the fix for all other cases is still pending (persists with 2.6.32-22.el6).

Easy way to trigger this issue:

$ echo 0 > /sys/devices/system/cpu/cpu1/online
$ echo 1 > /sys/devices/system/cpu/cpu1/online <-- box should hang here

Comment 14 Josh 2010-05-03 12:53:21 UTC
I triggered the kernel panic following the echo instructions and while it does kernel panic the system does not hang.  I am using 2.6.32-19.el6 and my bug was marked as a duplicate of this.  The brightness controller does continue to work despite the kernel panic.  Should I ask for my bug to be re-opened?

Thanks,
-josh

Comment 15 Josh 2010-05-03 12:57:12 UTC
(In reply to comment #14)
> I triggered the kernel panic following the echo instructions and while it does
> kernel panic the system does not hang.  I am using 2.6.32-19.el6 and my bug was
> marked as a duplicate of this.  The brightness controller does continue to work
> despite the kernel panic.  Should I ask for my bug to be re-opened?
> 
> Thanks,
> -josh    

I forgot to mention that when I tested the echo commands I got:

# echo 1 > /sys/devices/system/cpu/cpu1/online 
-bash: echo: write error: Invalid argument

Comment 16 Frank Arnold 2010-05-03 13:28:21 UTC
(In reply to comment #14)
> Should I ask for my bug to be re-opened?

No. Your trace looks like the ones in the other duplicates.

Comment 17 Prarit Bhargava 2010-05-03 13:32:51 UTC
(In reply to comment #15)
> (In reply to comment #14)
> > I triggered the kernel panic following the echo instructions and while it does
> > kernel panic the system does not hang.  I am using 2.6.32-19.el6 and my bug was
> > marked as a duplicate of this.  The brightness controller does continue to work
> > despite the kernel panic.  Should I ask for my bug to be re-opened?
> > 
> > Thanks,
> > -josh    
> 
> I forgot to mention that when I tested the echo commands I got:
> 
> # echo 1 > /sys/devices/system/cpu/cpu1/online 
> -bash: echo: write error: Invalid argument    

Josh, I see distinct failures.  The first is an actual *oops* which hangs the system, and the second is a BUG warning due to scheduling while atomic, which sometimes allows the system to continue executing.

I have traced both of these failures to this BZ.

The *critical* portion of the BUG warning or the oops are these three lines:

[<c0a49479>] ? nmi_cpu_busy+0x0/0x17
[<c080203e>] ? end_local_APIC_setup+0xd3/0xea
[<c08018ca>] ? start_secondary+0x102/0x24e

end_local_APIC_setup() does NOT call nmi_cpu_busy().  That is the unwinder going a bit crazy trying to determine what function has been called.  end_local_APIC_setup() has actually called  nmi_watchdog_default() which is __init and is not in the function table.

P.

Comment 18 Frank Arnold 2010-05-03 13:52:30 UTC
To add some testing data to Prarit's explanations:

I tried it the suspend/resume way on one of our boxes, which still had the needed bits installed anyway.

1. With a kernel based on 2.6.32-19.el6, including the attached patch
   * Did an `echo mem > /sys/power/state`
   * Let the box resume
   * Looked at the output of dmesg: No failures.

2. With a plain 2.6.32-19.el6
   * Did an `echo mem > /sys/power/state`
   * Let the box resume again
   * Resulted in a lot of trouble, including following trace:

   Kernel panic - not syncing: Fatal exception
   Pid: 0, comm: swapper Tainted: G      D    2.6.32-19.el6.i686 #1
   Call Trace:
    [<c08055d5>] ? panic+0x42/0xed
    [<c0808bfc>] ? oops_end+0xbc/0xd0
    [<c080831e>] ? do_int3+0x6e/0x90
    [<c0808184>] ? int3+0x30/0x38
    [<c0a49479>] ? nmi_cpu_busy+0x0/0x17
    [<c080203e>] ? end_local_APIC_setup+0xd3/0xea
    [<c08018ca>] ? start_secondary+0x102/0x24e

Comment 19 Prarit Bhargava 2010-05-03 17:12:36 UTC
   Kernel panic - not syncing: Fatal exception
   Pid: 0, comm: swapper Tainted: G      D    2.6.32-19.el6.i686 #1
   Call Trace:
    [<c08055d5>] ? panic+0x42/0xed
    [<c0808bfc>] ? oops_end+0xbc/0xd0
    [<c080831e>] ? do_int3+0x6e/0x90
    [<c0808184>] ? int3+0x30/0x38
    [<c0a49479>] ? nmi_cpu_busy+0x0/0x17
    [<c080203e>] ? end_local_APIC_setup+0xd3/0xea
    [<c08018ca>] ? start_secondary+0x102/0x24e    

That's a nice panic :)  end_local_APIC_setup(), as mentioned does not call nmi_cpu_busy() and is actually calling nmi_watchdog_default() which resolves to int3 (0xcc).

P.

Comment 20 Aristeu Rozanski 2010-05-04 14:43:34 UTC
Patch(es) available on kernel-2.6.32-24.el6

Comment 22 Ales Zelinka 2010-05-05 09:12:15 UTC
*** Bug 588663 has been marked as a duplicate of this bug. ***

Comment 24 Eric Sandeen 2010-05-07 15:38:52 UTC
*** Bug 587509 has been marked as a duplicate of this bug. ***

Comment 25 Prarit Bhargava 2010-05-10 15:42:23 UTC
*** Bug 590408 has been marked as a duplicate of this bug. ***

Comment 26 Don Zickus 2010-05-17 18:15:11 UTC
*** Bug 591138 has been marked as a duplicate of this bug. ***

Comment 27 Don Zickus 2010-05-17 18:55:33 UTC
*** Bug 592348 has been marked as a duplicate of this bug. ***

Comment 33 releng-rhel@redhat.com 2010-11-11 16:15:45 UTC
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.