586967 – RHEL6: x86 32-bit, nmi_watchdog_default() is __init, but called on resume

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 586967 - RHEL6: x86 32-bit, nmi_watchdog_default() is __init, but called on resume

Summary: RHEL6: x86 32-bit, nmi_watchdog_default() is __init, but called on resume

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.0
Hardware:	i686
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Prarit Bhargava
QA Contact:	Jan Tluka
Docs Contact:
URL:
Whiteboard:
Duplicates (12):	581749 582129 585003 585766 586164 586776 586830 587509 588663 590408 591138 592348 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-04-28 15:10 UTC by Prarit Bhargava
Modified:	2010-11-11 16:15 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-11-11 16:15:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Initial RHEL6 fix from AMD (333 bytes, patch) 2010-04-28 15:14 UTC, Prarit Bhargava	no flags	Details \| Diff
RHEL6 fix for this issue (1.08 KB, patch) 2010-04-28 17:43 UTC, Prarit Bhargava	no flags	Details \| Diff
View All

Description Prarit Bhargava 2010-04-28 15:10:42 UTC

From BZ 567601,

Description of problem:
After updating to kernel-2.6.32-17.el6, the system appears to suspend correctly
(powers down and the power LED begins to slowly throb) and I'm able to trigger
a resume (power LED returns to green, some disk activity) but the system never
fully resumes.  I never get a display and it doesn't appear that networking
resumes.  

Version-Release number of selected component (if applicable):
2.6.32-17.el6

How reproducible:
Always

Steps to Reproduce:
1. Suspend the machine
2. Attempt to resume
3.

Actual results:
Black screen, no networking, system appears to be frozen requiring a hard
power-cycle.

Expected results:


Additional info:
Booting 2.6.32-16.el6 instead results in the ability to suspend/resume.

Comment 1 Prarit Bhargava 2010-04-28 15:13:38 UTC

Bhavna alerted us to this -- good catch Bhavna!

The issue is that in -19.el6 only the x86_64 case was changed to __cpuinit. All others are left as __init which is causing trouble during CPU hotplug on 32-bit. 

I'll check the common code to make sure that there aren't any other __init/__cpuinit pitfalls before submitting to RHKL.

P.

Comment 2 Prarit Bhargava 2010-04-28 15:14:57 UTC

Created attachment 409879 [details]
Initial RHEL6 fix from AMD

Thanks for the patch Frank.

P.

Comment 3 Prarit Bhargava 2010-04-28 15:31:40 UTC

Patch looks good and CONFIG_DEBUG_SECTION_MISMATCH=y didn't show anything else related to this code path.

Will post to RHKL shortly.

P.

Comment 4 Prarit Bhargava 2010-04-28 17:43:48 UTC

Created attachment 409914 [details]
RHEL6 fix for this issue

Comment 5 Prarit Bhargava 2010-04-29 14:22:43 UTC

*** Bug 585003 has been marked as a duplicate of this bug. ***

Comment 6 Prarit Bhargava 2010-04-29 18:36:32 UTC

*** Bug 581749 has been marked as a duplicate of this bug. ***

Comment 7 Prarit Bhargava 2010-04-29 18:42:01 UTC

*** Bug 582129 has been marked as a duplicate of this bug. ***

Comment 8 Prarit Bhargava 2010-04-30 13:33:55 UTC

*** Bug 585766 has been marked as a duplicate of this bug. ***

Comment 9 Prarit Bhargava 2010-04-30 13:47:41 UTC

*** Bug 586164 has been marked as a duplicate of this bug. ***

Comment 10 Prarit Bhargava 2010-04-30 14:17:39 UTC

*** Bug 586776 has been marked as a duplicate of this bug. ***

Comment 11 Prarit Bhargava 2010-04-30 14:23:37 UTC

*** Bug 586830 has been marked as a duplicate of this bug. ***

Comment 12 Josh 2010-05-03 02:26:36 UTC

Would this prevent the brightness controller from working after a resume?

Comment 13 Frank Arnold 2010-05-03 11:29:16 UTC

(In reply to comment #12)
> Would this prevent the brightness controller from working after a resume?    

No brightness controller involved here.

With this bug 32-bit shouldn't resume at all. The issue was introduced with 2.6.32-17.el6, x86_64 was fixed with 2.6.32-18.el6, and the fix for all other cases is still pending (persists with 2.6.32-22.el6).

Easy way to trigger this issue:

$ echo 0 > /sys/devices/system/cpu/cpu1/online
$ echo 1 > /sys/devices/system/cpu/cpu1/online <-- box should hang here

Comment 14 Josh 2010-05-03 12:53:21 UTC

I triggered the kernel panic following the echo instructions and while it does kernel panic the system does not hang.  I am using 2.6.32-19.el6 and my bug was marked as a duplicate of this.  The brightness controller does continue to work despite the kernel panic.  Should I ask for my bug to be re-opened?

Thanks,
-josh

Comment 15 Josh 2010-05-03 12:57:12 UTC

(In reply to comment #14)
> I triggered the kernel panic following the echo instructions and while it does
> kernel panic the system does not hang.  I am using 2.6.32-19.el6 and my bug was
> marked as a duplicate of this.  The brightness controller does continue to work
> despite the kernel panic.  Should I ask for my bug to be re-opened?
> 
> Thanks,
> -josh    

I forgot to mention that when I tested the echo commands I got:

# echo 1 > /sys/devices/system/cpu/cpu1/online 
-bash: echo: write error: Invalid argument

Comment 16 Frank Arnold 2010-05-03 13:28:21 UTC

(In reply to comment #14)
> Should I ask for my bug to be re-opened?

No. Your trace looks like the ones in the other duplicates.

Comment 17 Prarit Bhargava 2010-05-03 13:32:51 UTC

(In reply to comment #15)
> (In reply to comment #14)
> > I triggered the kernel panic following the echo instructions and while it does
> > kernel panic the system does not hang.  I am using 2.6.32-19.el6 and my bug was
> > marked as a duplicate of this.  The brightness controller does continue to work
> > despite the kernel panic.  Should I ask for my bug to be re-opened?
> > 
> > Thanks,
> > -josh    
> 
> I forgot to mention that when I tested the echo commands I got:
> 
> # echo 1 > /sys/devices/system/cpu/cpu1/online 
> -bash: echo: write error: Invalid argument    

Josh, I see distinct failures.  The first is an actual *oops* which hangs the system, and the second is a BUG warning due to scheduling while atomic, which sometimes allows the system to continue executing.

I have traced both of these failures to this BZ.

The *critical* portion of the BUG warning or the oops are these three lines:

[<c0a49479>] ? nmi_cpu_busy+0x0/0x17
[<c080203e>] ? end_local_APIC_setup+0xd3/0xea
[<c08018ca>] ? start_secondary+0x102/0x24e

end_local_APIC_setup() does NOT call nmi_cpu_busy().  That is the unwinder going a bit crazy trying to determine what function has been called.  end_local_APIC_setup() has actually called  nmi_watchdog_default() which is __init and is not in the function table.

P.

Comment 18 Frank Arnold 2010-05-03 13:52:30 UTC

To add some testing data to Prarit's explanations:

I tried it the suspend/resume way on one of our boxes, which still had the needed bits installed anyway.

1. With a kernel based on 2.6.32-19.el6, including the attached patch
   * Did an `echo mem > /sys/power/state`
   * Let the box resume
   * Looked at the output of dmesg: No failures.

2. With a plain 2.6.32-19.el6
   * Did an `echo mem > /sys/power/state`
   * Let the box resume again
   * Resulted in a lot of trouble, including following trace:

   Kernel panic - not syncing: Fatal exception
   Pid: 0, comm: swapper Tainted: G      D    2.6.32-19.el6.i686 #1
   Call Trace:
    [<c08055d5>] ? panic+0x42/0xed
    [<c0808bfc>] ? oops_end+0xbc/0xd0
    [<c080831e>] ? do_int3+0x6e/0x90
    [<c0808184>] ? int3+0x30/0x38
    [<c0a49479>] ? nmi_cpu_busy+0x0/0x17
    [<c080203e>] ? end_local_APIC_setup+0xd3/0xea
    [<c08018ca>] ? start_secondary+0x102/0x24e

Comment 19 Prarit Bhargava 2010-05-03 17:12:36 UTC

   Kernel panic - not syncing: Fatal exception
   Pid: 0, comm: swapper Tainted: G      D    2.6.32-19.el6.i686 #1
   Call Trace:
    [<c08055d5>] ? panic+0x42/0xed
    [<c0808bfc>] ? oops_end+0xbc/0xd0
    [<c080831e>] ? do_int3+0x6e/0x90
    [<c0808184>] ? int3+0x30/0x38
    [<c0a49479>] ? nmi_cpu_busy+0x0/0x17
    [<c080203e>] ? end_local_APIC_setup+0xd3/0xea
    [<c08018ca>] ? start_secondary+0x102/0x24e    

That's a nice panic :)  end_local_APIC_setup(), as mentioned does not call nmi_cpu_busy() and is actually calling nmi_watchdog_default() which resolves to int3 (0xcc).

P.

Comment 20 Aristeu Rozanski 2010-05-04 14:43:34 UTC

Patch(es) available on kernel-2.6.32-24.el6

Comment 22 Ales Zelinka 2010-05-05 09:12:15 UTC

*** Bug 588663 has been marked as a duplicate of this bug. ***

Comment 24 Eric Sandeen 2010-05-07 15:38:52 UTC

*** Bug 587509 has been marked as a duplicate of this bug. ***

Comment 25 Prarit Bhargava 2010-05-10 15:42:23 UTC

*** Bug 590408 has been marked as a duplicate of this bug. ***

Comment 26 Don Zickus 2010-05-17 18:15:11 UTC

*** Bug 591138 has been marked as a duplicate of this bug. ***

Comment 27 Don Zickus 2010-05-17 18:55:33 UTC

*** Bug 592348 has been marked as a duplicate of this bug. ***

Comment 33 releng-rhel@redhat.com 2010-11-11 16:15:45 UTC

Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.

Note You need to log in before you can comment on or make changes to this bug.