Bug 189052

Summary:

Kernel panic on shutdown or poweroff on SMP

Product:

Red Hat Enterprise Linux 3

Reporter:

Philip Pokorny <ppokorny>

Component:

kernel

Assignee:

Peter Martuccelli <peterm>

Status:

CLOSED ERRATA

QA Contact:

Brian Brock <bbrock>

Severity:

high

Docs Contact:

Priority:

medium

Version:

3.0

CC:

petrides

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

RHSA-2007-0436

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2007-06-11 17:54:11 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Patch to force acpi_power_off onto CPU0	none
dmesg from system	none

Description Philip Pokorny 2006-04-15 01:57:32 UTC

Description of problem:
When shutting down a system with "shutdown -h now" or "poweroff" so that the
system will turn off.  Sometimes (10-50%) the system will kernel panic. 
Sometimes, the system will complete the shtudown, but fail to turn off. 
Sometimes, the system does power off correctly.

Version-Release number of selected component (if applicable):
2.4.21-37.EL, and 2.4.21-40.EL

How reproducible:
Happens randomly, but is 100% reproduceable on multiple hardware platforms and
systems.

Current test hardware is a dual Xeon 3.0GHz with ia32e kernel.  Kernel panics
and failues to power off also happen on dual Opteron systems.  Kernel panic
details vary by system.

Steps to Reproduce:
1. poweroff
2. Watch for panic
3. Repeat, if no panic.
  
Actual results:

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
 printing rip:
000001007ffe6000
PML4 8775067 PGD 8762067 PMD 0
Oops: 0002
CPU 0
Pid: 10, comm: kupdated Not tainted
RIP: 0010:[<000001007ffe6000>]
RSP: 0018:000001007ffe7f30  EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffa0092050
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000001007ecd2b10
RBP: 000001007ffe6000 R08: 0000000000000078 R09: 00000100086777c0
R10: ffffffff805f09e8 R11: 0000000000000001 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffffff805e8140(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0
                                                                               
                              
Call Trace: [<ffffffff801664d0>]{kupdate+288} [<ffffffff80110d01>]{child_rip+8}
       [<ffffffff801663b0>]{kupdate+0} [<ffffffff80110cf9>]{child_rip+0}
                                                                               
                              
Process kupdated (pid: 10, stackpage=1007ffe7000)
Stack: 000001007ffe7f30 0000000000000018 ffffffff801664d0 000001007ffe6000
       0000000000000000 0000000000000000 ffffffff80110d01 0000000000000000
       0000000000000000 0000000000000000 0000000000000000 0000000000000000
       0000000000000000 0000000000000000 ffffffff805eee08 ffffffff803f0ac0
       0000000000000001 0000000000000000 000001000856a000 0000000000000e00
       ffffffff806533e0 ffffffff801663b0 0000000000000000 ffffffff80110cf9
       0000000000000010 0000000000000200 000001007ffe7f58 0000000000000000
Call Trace: [<ffffffff801664d0>]{kupdate+288} [<ffffffff80110d01>]{child_rip+8}
       [<ffffffff801663b0>]{kupdate+0} [<ffffffff80110cf9>]{child_rip+0}
                                                                               
                              
                                                                               
                              
Code: 00 00 00 00 00 00 00 00 40 00 00 00 00 00 00 00 00 00 00 00
                                                                               
                              
Kernel panic: Fatal exception
                                                                               
                              
                                                                               
                              


Expected results:

Halting system...
md: stopping all md devices.
flushing ide devices:
Power down.
[[ system powers off ]]

Additional info:

Through debugging and testing, I have confirmed that acpi_power_off is being
called and is running.

When the panic's occur, acpi_power_off is always found to be running on CPU1. 
No panics or failure to power off were observed when acpi_power_off is running
on CPU0.

When the system fails to power off, the system is in do/while(!in_value) loop at
the end of acpi_enter_sleep_state.

Based on comments in the APM code, and the observation that failures only happen
when running on CPU1, we tried the attached patch.  It forces the task to run on
CPU0.  With this patch, we have seen the process move from CPU1 to CPU0.  No
panics have been seen after applying this patch.  Testing continues.

Comment 1 Philip Pokorny 2006-04-15 01:57:32 UTC

Created attachment 127769 [details]
Patch to force acpi_power_off onto CPU0

Comment 2 Philip Pokorny 2006-04-15 02:08:50 UTC

Created attachment 127770 [details]
dmesg from system

Here is a dmesg from the system for referencing the hardware config.

Comment 3 Philip Pokorny 2006-04-16 02:19:29 UTC

The previous patch got it's whitespace munged in the process of posting it. 
Sorry about that.  Either use -l to ignore whitespace, or modify the patch to
replace the leading spaces with tabs on each line.

Automated testing has completed 320 power-on/power off cycles without failure. 
The code path in the patch to switch to CPU0 has been exercized 131 times.

Comment 4 Jim Paradis 2006-05-30 23:47:31 UTC

I notice that upstream does not have the test that you have in the patch; it
simply does the set_cpus_allowed() regardless.  I will likely submit a RHEL
patch that mirrors what upstream does...

Comment 5 Ernie Petrides 2006-10-11 19:09:08 UTC

Jim's patch for this was posted for internal review on 31-May-2006.

Comment 6 RHEL Program Management 2006-10-25 03:09:06 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 7 Jay Turner 2006-10-25 14:10:51 UTC

QE ack for 3.9.

Comment 8 Ernie Petrides 2006-11-01 23:53:11 UTC

A fix for this problem has just been committed to the RHEL3 U9
patch pool this evening (in kernel version 2.4.21-47.3.EL).

Comment 11 Red Hat Bugzilla 2007-06-11 17:54:11 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2007-0436.html