Bug 586967 - RHEL6: x86 32-bit, nmi_watchdog_default() is __init, but called on resume
RHEL6: x86 32-bit, nmi_watchdog_default() is __init, but called on resume
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.0
i686 Linux
low Severity medium
: rc
: ---
Assigned To: Prarit Bhargava
Jan Tluka
:
: 581749 582129 585003 585766 586164 586776 586830 587509 588663 590408 591138 592348 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-04-28 11:10 EDT by Prarit Bhargava
Modified: 2010-11-11 11:15 EST (History)
17 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-11-11 11:15:45 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Initial RHEL6 fix from AMD (333 bytes, patch)
2010-04-28 11:14 EDT, Prarit Bhargava
no flags Details | Diff
RHEL6 fix for this issue (1.08 KB, patch)
2010-04-28 13:43 EDT, Prarit Bhargava
no flags Details | Diff

  None (edit)
Description Prarit Bhargava 2010-04-28 11:10:42 EDT
From BZ 567601,

Description of problem:
After updating to kernel-2.6.32-17.el6, the system appears to suspend correctly
(powers down and the power LED begins to slowly throb) and I'm able to trigger
a resume (power LED returns to green, some disk activity) but the system never
fully resumes.  I never get a display and it doesn't appear that networking
resumes.  

Version-Release number of selected component (if applicable):
2.6.32-17.el6

How reproducible:
Always

Steps to Reproduce:
1. Suspend the machine
2. Attempt to resume
3.

Actual results:
Black screen, no networking, system appears to be frozen requiring a hard
power-cycle.

Expected results:


Additional info:
Booting 2.6.32-16.el6 instead results in the ability to suspend/resume.
Comment 1 Prarit Bhargava 2010-04-28 11:13:38 EDT
Bhavna alerted us to this -- good catch Bhavna!

The issue is that in -19.el6 only the x86_64 case was changed to __cpuinit. All others are left as __init which is causing trouble during CPU hotplug on 32-bit. 

I'll check the common code to make sure that there aren't any other __init/__cpuinit pitfalls before submitting to RHKL.

P.
Comment 2 Prarit Bhargava 2010-04-28 11:14:57 EDT
Created attachment 409879 [details]
Initial RHEL6 fix from AMD

Thanks for the patch Frank.

P.
Comment 3 Prarit Bhargava 2010-04-28 11:31:40 EDT
Patch looks good and CONFIG_DEBUG_SECTION_MISMATCH=y didn't show anything else related to this code path.

Will post to RHKL shortly.

P.
Comment 4 Prarit Bhargava 2010-04-28 13:43:48 EDT
Created attachment 409914 [details]
RHEL6 fix for this issue
Comment 5 Prarit Bhargava 2010-04-29 10:22:43 EDT
*** Bug 585003 has been marked as a duplicate of this bug. ***
Comment 6 Prarit Bhargava 2010-04-29 14:36:32 EDT
*** Bug 581749 has been marked as a duplicate of this bug. ***
Comment 7 Prarit Bhargava 2010-04-29 14:42:01 EDT
*** Bug 582129 has been marked as a duplicate of this bug. ***
Comment 8 Prarit Bhargava 2010-04-30 09:33:55 EDT
*** Bug 585766 has been marked as a duplicate of this bug. ***
Comment 9 Prarit Bhargava 2010-04-30 09:47:41 EDT
*** Bug 586164 has been marked as a duplicate of this bug. ***
Comment 10 Prarit Bhargava 2010-04-30 10:17:39 EDT
*** Bug 586776 has been marked as a duplicate of this bug. ***
Comment 11 Prarit Bhargava 2010-04-30 10:23:37 EDT
*** Bug 586830 has been marked as a duplicate of this bug. ***
Comment 12 Josh 2010-05-02 22:26:36 EDT
Would this prevent the brightness controller from working after a resume?
Comment 13 Frank Arnold 2010-05-03 07:29:16 EDT
(In reply to comment #12)
> Would this prevent the brightness controller from working after a resume?    

No brightness controller involved here.

With this bug 32-bit shouldn't resume at all. The issue was introduced with 2.6.32-17.el6, x86_64 was fixed with 2.6.32-18.el6, and the fix for all other cases is still pending (persists with 2.6.32-22.el6).

Easy way to trigger this issue:

$ echo 0 > /sys/devices/system/cpu/cpu1/online
$ echo 1 > /sys/devices/system/cpu/cpu1/online <-- box should hang here
Comment 14 Josh 2010-05-03 08:53:21 EDT
I triggered the kernel panic following the echo instructions and while it does kernel panic the system does not hang.  I am using 2.6.32-19.el6 and my bug was marked as a duplicate of this.  The brightness controller does continue to work despite the kernel panic.  Should I ask for my bug to be re-opened?

Thanks,
-josh
Comment 15 Josh 2010-05-03 08:57:12 EDT
(In reply to comment #14)
> I triggered the kernel panic following the echo instructions and while it does
> kernel panic the system does not hang.  I am using 2.6.32-19.el6 and my bug was
> marked as a duplicate of this.  The brightness controller does continue to work
> despite the kernel panic.  Should I ask for my bug to be re-opened?
> 
> Thanks,
> -josh    

I forgot to mention that when I tested the echo commands I got:

# echo 1 > /sys/devices/system/cpu/cpu1/online 
-bash: echo: write error: Invalid argument
Comment 16 Frank Arnold 2010-05-03 09:28:21 EDT
(In reply to comment #14)
> Should I ask for my bug to be re-opened?

No. Your trace looks like the ones in the other duplicates.
Comment 17 Prarit Bhargava 2010-05-03 09:32:51 EDT
(In reply to comment #15)
> (In reply to comment #14)
> > I triggered the kernel panic following the echo instructions and while it does
> > kernel panic the system does not hang.  I am using 2.6.32-19.el6 and my bug was
> > marked as a duplicate of this.  The brightness controller does continue to work
> > despite the kernel panic.  Should I ask for my bug to be re-opened?
> > 
> > Thanks,
> > -josh    
> 
> I forgot to mention that when I tested the echo commands I got:
> 
> # echo 1 > /sys/devices/system/cpu/cpu1/online 
> -bash: echo: write error: Invalid argument    

Josh, I see distinct failures.  The first is an actual *oops* which hangs the system, and the second is a BUG warning due to scheduling while atomic, which sometimes allows the system to continue executing.

I have traced both of these failures to this BZ.

The *critical* portion of the BUG warning or the oops are these three lines:

[<c0a49479>] ? nmi_cpu_busy+0x0/0x17
[<c080203e>] ? end_local_APIC_setup+0xd3/0xea
[<c08018ca>] ? start_secondary+0x102/0x24e

end_local_APIC_setup() does NOT call nmi_cpu_busy().  That is the unwinder going a bit crazy trying to determine what function has been called.  end_local_APIC_setup() has actually called  nmi_watchdog_default() which is __init and is not in the function table.

P.
Comment 18 Frank Arnold 2010-05-03 09:52:30 EDT
To add some testing data to Prarit's explanations:

I tried it the suspend/resume way on one of our boxes, which still had the needed bits installed anyway.

1. With a kernel based on 2.6.32-19.el6, including the attached patch
   * Did an `echo mem > /sys/power/state`
   * Let the box resume
   * Looked at the output of dmesg: No failures.

2. With a plain 2.6.32-19.el6
   * Did an `echo mem > /sys/power/state`
   * Let the box resume again
   * Resulted in a lot of trouble, including following trace:

   Kernel panic - not syncing: Fatal exception
   Pid: 0, comm: swapper Tainted: G      D    2.6.32-19.el6.i686 #1
   Call Trace:
    [<c08055d5>] ? panic+0x42/0xed
    [<c0808bfc>] ? oops_end+0xbc/0xd0
    [<c080831e>] ? do_int3+0x6e/0x90
    [<c0808184>] ? int3+0x30/0x38
    [<c0a49479>] ? nmi_cpu_busy+0x0/0x17
    [<c080203e>] ? end_local_APIC_setup+0xd3/0xea
    [<c08018ca>] ? start_secondary+0x102/0x24e
Comment 19 Prarit Bhargava 2010-05-03 13:12:36 EDT
   Kernel panic - not syncing: Fatal exception
   Pid: 0, comm: swapper Tainted: G      D    2.6.32-19.el6.i686 #1
   Call Trace:
    [<c08055d5>] ? panic+0x42/0xed
    [<c0808bfc>] ? oops_end+0xbc/0xd0
    [<c080831e>] ? do_int3+0x6e/0x90
    [<c0808184>] ? int3+0x30/0x38
    [<c0a49479>] ? nmi_cpu_busy+0x0/0x17
    [<c080203e>] ? end_local_APIC_setup+0xd3/0xea
    [<c08018ca>] ? start_secondary+0x102/0x24e    

That's a nice panic :)  end_local_APIC_setup(), as mentioned does not call nmi_cpu_busy() and is actually calling nmi_watchdog_default() which resolves to int3 (0xcc).

P.
Comment 20 Aristeu Rozanski 2010-05-04 10:43:34 EDT
Patch(es) available on kernel-2.6.32-24.el6
Comment 22 Ales Zelinka 2010-05-05 05:12:15 EDT
*** Bug 588663 has been marked as a duplicate of this bug. ***
Comment 24 Eric Sandeen 2010-05-07 11:38:52 EDT
*** Bug 587509 has been marked as a duplicate of this bug. ***
Comment 25 Prarit Bhargava 2010-05-10 11:42:23 EDT
*** Bug 590408 has been marked as a duplicate of this bug. ***
Comment 26 Don Zickus 2010-05-17 14:15:11 EDT
*** Bug 591138 has been marked as a duplicate of this bug. ***
Comment 27 Don Zickus 2010-05-17 14:55:33 EDT
*** Bug 592348 has been marked as a duplicate of this bug. ***
Comment 33 releng-rhel@redhat.com 2010-11-11 11:15:45 EST
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.

Note You need to log in before you can comment on or make changes to this bug.