Bug 191406 - Suspend and hibernate problems on Ferrari 4005
Suspend and hibernate problems on Ferrari 4005
Status: CLOSED NOTABUG
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
5
x86_64 Linux
medium Severity medium
: ---
: ---
Assigned To: Dave Jones
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-05-11 13:53 EDT by Alex Tucker
Modified: 2015-01-04 17:27 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-01-04 12:50:14 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Alex Tucker 2006-05-11 13:53:51 EDT
Description of problem:

I've discovered that the only way to get either suspend to RAM or hibernate to
disk to work on my Ferrari 4005 laptop is to append "noapic irqpoll
acpi_irq_balance" to the kernel parameters at boot.  However, this renders my
system unstable and it will freeze on average once a day.

Version-Release number of selected component (if applicable):

kernel-2.6.16-1.2111_FC5, although I discovered the above parameters around
2.6.14 and have had the same results with previous kernels.

How reproducible:

Lockups are fairly random, although they do seem to happen under heavy load. 
I've enabled the magic sysrq key and can sometimes use that to reboot, although
I've not been able to unmount, which suggests somethings up with IDE access. 
I've once been in runlevel 3 and seen the results of the panic, but haven't
managed to capture it yet.

Steps to Reproduce:
1.  Boot with the above parameters
2.  Work away.
  
Actual results:
System freezes.  If in X, then everything locks up and sometimes the LEDs will
blink.  Once in the console I saw a panic.

Expected results:
Suspend / hibernate working without rendering the system unstable.

Additional info:
Without the kernel parameters, neither suspend nor hibernate work.  Initiating
suspend to RAM will suspend the laptop, but on resume it seems to get no further
than accessing the CD.  A few months ago I attempted to trace whereabouts in the
wakeup code the kernel had gotten to by pasting some "beep" code -- I got as far
as tracing it into the code which iterates over each device and wakes it up, but
couldn't figure out how to get much further.

I've recently installed kdump/kexec.  Forcing the kernel to crash without using
the extra kernel parameters results in the kexec'ed kernel not getting past some
IRQ problem -- I'll switch off the quiet boot option and attach a screenshot. 
Adding the extra kernel parameters above allows the kexec'ed kernel to boot and
dump an image as it should, however I've not managed to get this to happen yet
when my system freezes.

Any help or pointers gratefully received!
Alex.
Comment 1 rizo83 2006-09-26 15:55:22 EDT
I can confirm this on a Ferrari 4005 with FC5 i686 using 2.6.17-1.2187_FC5
kernel.  Suspend to disk wasnt working until I appended "noapic irqpoll
acpi_irq_balance" to the kernel.  On resume there are a lot of "unexpected IRQ
trap at vector 79" and "unexpected IRQ trap at vector 89" in dmesg.
Comment 2 Alex Tucker 2006-10-02 12:42:22 EDT
Ok, I found where the kernel-debuginfo RPMs were hidden and have managed to get
kexec to write a crash dump and use "crash" to at least get a back trace of the
panic on the latest 2.6.17-1.2187_FC5 kernel.  The following panic happened
after booting with "noapic irqpoll acpi_irq_balance" after using the machine for
about 10 minutes:

      KERNEL: /usr/lib/debug/lib/modules/2.6.17-1.2187_FC5/vmlinux
    DUMPFILE: /var/crash/2006-10-02-16:42/vmcore
        CPUS: 1
        DATE: Mon Oct  2 16:42:02 2006
      UPTIME: 01:11:33
LOAD AVERAGE: 0.51, 0.60, 0.60
       TASKS: 154
    NODENAME: ferrari.floop.org.uk
     RELEASE: 2.6.17-1.2187_FC5
     VERSION: #1 SMP Mon Sep 11 01:16:59 EDT 2006
     MACHINE: x86_64  (795 Mhz)
      MEMORY: 2 GB
       PANIC: ""
         PID: 0
     COMMAND: "swapper"
        TASK: ffffffff80542dc0  [THREAD_INFO: ffffffff806c8000]
         CPU: 0
       STATE: TASK_RUNNING (PANIC)

PID: 0      TASK: ffffffff80542dc0  CPU: 0   COMMAND: "swapper"
 #0 [ffffffff80603a60] crash_kexec at ffffffff802ab31d
 #1 [ffffffff80603ae8] crash_kexec at ffffffff802ab335
 #2 [ffffffff80603b70] crash_kexec at ffffffff802ab31d
 #3 [ffffffff80603b98] bust_spinlocks at ffffffff80281d2d
 #4 [ffffffff80603ba8] panic at ffffffff8028fb7e
 #5 [ffffffff80603c08] module_text_address at ffffffff802a3b95
 #6 [ffffffff80603c28] kernel_text_address at ffffffff8029d837
 #7 [ffffffff80603c38] show_trace at ffffffff8027049c
 #8 [ffffffff80603c78] dump_stack at ffffffff80270505
 #9 [ffffffff80603c98] spin_bug at ffffffff80213c24
#10 [ffffffff80603cb8] _raw_spin_lock at ffffffff8020762f
#11 [ffffffff80603cc8] note_interrupt at ffffffff802b325a
#12 [ffffffff80603d18] __do_IRQ at ffffffff802b2c96
#13 [ffffffff80603d58] do_IRQ at ffffffff802713c6
#14 [ffffffff80603e08] ide_do_request at ffffffff8020ea4d
#15 [ffffffff80603e30] ide_do_request at ffffffff8020ea48
#16 [ffffffff80603e60] freed_request at ffffffff8032b084
#17 [ffffffff80603e80] ide_end_request at ffffffff8020aa18
#18 [ffffffff80603ed0] ide_intr at ffffffff8020d3f5
#19 [ffffffff80603f00] note_interrupt at ffffffff802b32a7
#20 [ffffffff80603f50] __do_IRQ at ffffffff802b2c96
#21 [ffffffff80603f90] do_IRQ at ffffffff802713c6
--- <IRQ stack> ---
#22 [ffffffff806c9eb8] ret_from_intr at ffffffff80262eea
    [exception RIP: acpi_processor_idle+502]
    RIP: ffffffff8037578c  RSP: ffffffff806c9f60  RFLAGS: 00000246
    RAX: ffffffff806c9fd8  RBX: ffffffff80375596  RCX: 00000000ca3cefe4
    RDX: 0000000000008008  RSI: 00000000ca3cfc67  RDI: 0000000000000000
    RBP: 000000000008e000   R8: ffffffff806c8000   R9: 00000065e363583e
    R10: 0000000000000000  R11: ffffffff802677b6  R12: ffff81007dd028f0
    R13: ffff81007dd02800  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: fffffffffffffff4  CS: 0010  SS: 0018
#23 [ffffffff806c9f68] notifier_call_chain at ffffffff8026bfab
#24 [ffffffff806c9fa8] cpu_idle at ffffffff8024c7b7

Without the additional kernel parameters, neither suspend nor hibernate work,
nor can I get either process to produce a crash dump (magic sysrq on, but no
response even to alt-sysrq-c).

Please let me know if there's any further info that could help in the crash
dump, otherwise I'll remove it after a week or so.
Comment 3 Alex Tucker 2006-10-03 05:07:22 EDT
Since the last panic suggested a problem in swapper, I turned off swapping
(swapoff -a) and the machine stayed up for a bit longer, eventually giving the
following panic after a few hours:

PID: 2260   TASK: ffff8100750b9860  CPU: 0   COMMAND: "Xgl"
 #0 [ffffffff80603a20] crash_kexec at ffffffff802ab31d
 #1 [ffffffff80603aa8] crash_kexec at ffffffff802ab335
 #2 [ffffffff80603b30] crash_kexec at ffffffff802ab31d
 #3 [ffffffff80603b58] bust_spinlocks at ffffffff80281d2d
 #4 [ffffffff80603b68] panic at ffffffff8028fb7e
 #5 [ffffffff80603bc8] module_text_address at ffffffff802a3b95
 #6 [ffffffff80603be8] kernel_text_address at ffffffff8029d837
 #7 [ffffffff80603bf8] show_trace at ffffffff8027049c
 #8 [ffffffff80603c38] dump_stack at ffffffff80270505
 #9 [ffffffff80603c58] spin_bug at ffffffff80213c24
#10 [ffffffff80603c78] _raw_spin_lock at ffffffff8020762f
#11 [ffffffff80603c88] note_interrupt at ffffffff802b325a
#12 [ffffffff80603cd8] __do_IRQ at ffffffff802b2c96
#13 [ffffffff80603d18] do_IRQ at ffffffff802713c6
#14 [ffffffff80603d20] try_to_wake_up at ffffffff8024a2f9
#15 [ffffffff80603dc8] ide_outb at ffffffff802076e8
#16 [ffffffff80603df0] cdrom_start_packet_command at ffffffff8022f663
#17 [ffffffff80603e30] ide_do_request at ffffffff8020ed01
#18 [ffffffff80603e60] cdrom_decode_status at ffffffff80254e53
#19 [ffffffff80603e90] cdrom_pc_intr at ffffffff8025494c
#20 [ffffffff80603ed0] ide_intr at ffffffff8020d3f5
#21 [ffffffff80603f00] note_interrupt at ffffffff802b32a7
#22 [ffffffff80603f50] __do_IRQ at ffffffff802b2c96
#23 [ffffffff80603f90] do_IRQ at ffffffff802713c6
--- <IRQ stack> ---
#24 [ffff8100751dff58] ret_from_intr at ffffffff80262eea
    RIP: 000000000045170a  RSP: 00007fff44cdd250  RFLAGS: 00000246
    RAX: 00007fff44cdd260  RBX: 0000000000ca1510  RCX: 00000000007ac2d0
    RDX: 00007fff44cdd270  RSI: 0000000000000266  RDI: 00000000000000b8
    RBP: 00000000000000b8   R8: 00000000007ac450   R9: 0000000000000018
    R10: 0000000000000000  R11: 00002aaaaaeb4030  R12: 0000000000000266
    R13: 000000000097e7a0  R14: 0000000000680450  R15: 00000000009843e0
    ORIG_RAX: fffffffffffffff4  CS: 0033  SS: 002b

Alex.
Comment 4 Alex Tucker 2006-10-05 07:47:27 EDT
Googling around I spotted some patches Linus developed to help him debug the
suspend/resume cycle on his Mac mini.  It looks as though these may be rolled
into 2.6.18 at some point -- are RedHat going to include these patches, which
would certainly help more than "beeps" to figure out where things are going
wrong with resume on my laptop.
Comment 5 Dave Jones 2006-10-16 14:24:26 EDT
A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.
Comment 6 Alex Tucker 2006-10-17 10:59:38 EDT
Ok, I've updated to 2.6.18-1.2200.fc5.  It still exhibits the same behaviour as
the previous kernel (tried suspend with both extra parameters and without),
although it did take longer to crash (1 day).

Once I get kdump working again (it's complaining about "overlapping memory
segments), I will try to capture the backtrace fromt the kernel panic.

Are the TRACE_DEVICE and TRACE_RESUME macros defined in the sources for this kernel?
Comment 7 Dave Jones 2006-10-19 14:56:58 EDT
No, that functionality will trash the real time clock each boot, which obviously
isn't viable in a production kernel.

If you're not adverse to rebuilding kernels though, this would be a good method
of finding out exactly where things are dying.

Looking at the backtraces, I notice we're hitting spin_bug in the ide layer.
The only other times I've seen this happen in recent times were when people had
binary-only graphics drivers loaded. Is that the case here?
Comment 8 Alex Tucker 2006-10-19 17:16:22 EDT
(In reply to comment #7)
> No, that functionality will trash the real time clock each boot, which obviously
> isn't viable in a production kernel.

Granted.  I did see a further patch to add a /sys/power/something to enable the
trace.

> If you're not adverse to rebuilding kernels though, this would be a good method
> of finding out exactly where things are dying.

Once upon a time it was simple to apply patches, but now there's git and I have
to learn something new before I can get there.  I'll give it a go when I get a
moment.

I'm also still trying to get kdump working again after the kernel upgrade from
2.6.17 to 18, so that I can capture the kernel panic.  Starting kdump complains
about "overlapping memory segments".  I tried fiddling with the boot time
parameters to crashkernel, to no avail.  Nor can I find much in the way of
documentation.  The current setting is crashkernel=128M@16M, which gives:
"Overlapping memory segments at 0x145b000, sort_segments failed".

> Looking at the backtraces, I notice we're hitting spin_bug in the ide layer.
> The only other times I've seen this happen in recent times were when people had
> binary-only graphics drivers loaded. Is that the case here?

It is.  Although, when I've tested suspend/hibernate without the "noapic irqpoll
acpi_i..." gubbins, I've done it in single user mode after removing as many
modules as possible.  However, without the gubbins, I don't even get a kernel
panic on resume, just a locked up laptop.

I shall test with the open-source driver and let you know if things are stable,
and if so we can close this issue, although I'd obviously like to get a stable
laptop with fancy graphics :)

Comment 9 Alex Tucker 2006-10-21 08:48:23 EDT
(In reply to comment #7)
> Looking at the backtraces, I notice we're hitting spin_bug in the ide layer.
> The only other times I've seen this happen in recent times were when people had
> binary-only graphics drivers loaded. Is that the case here?

I've now tried with the open source radeon driver, rather than the ATI
binary-only one, but still the system is unstable and will panic after a while.
 I still haven't managed to get kdump working either, so can't give a backtrace.

Some more notes, in case they're of any use:

*  Straight after booting, the kernel emits the following message, "..MP-BIOS
bug: 8254 timer not connected to IO-APIC", unless I append the "noapic
irqpoll..." gubbins as boot time parameters to the kernel.

*  I get the following warning in /var/log/messages, which I've not noticed before:

PCI: Transparent bridge - 0000:00:14.4
PCI: Bus #07 (-#0a) is hidden behind transparent bridge #06 (-#07) (try 'pci=ass
ign-busses')
Please report the result to linux-kernel to fix this permanently

*  With the "gubbins" appended, after waking from a suspend, there are some
minor ACPI errors:

ACPI Exception (evregion-0424): AE_TIME, Returned by Handler for
[EmbeddedControl] [20060707]
ACPI Exception (dswexec-0458): AE_TIME, While resolving operands for [OpcodeName
unavailable] [20060707]
ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.ACAD._PSR] (Node
ffff81007fe9f690), AE_TIME
ACPI Exception (acpi_ac-0096): AE_TIME, Error reading AC Adapter state [20060707]

*  With the gubbins added, I also get frequent errors reported from the ide driver:

hdc: cdrom_pc_intr: The drive appears confused (ireason = 0x01). Trying to
recover by ending request.

*  Again with the gubbins added, there seems to be a minor panic (system
continues) after resume:

BUG: sleeping function called from invalid context at kernel/rwsem.c:20
in_atomic():0, irqs_disabled():1

Call Trace:
 [<ffffffff80269387>] show_trace+0x34/0x47
 [<ffffffff802693ac>] dump_stack+0x12/0x17
 [<ffffffff8029dcd2>] down_read+0x15/0x23
 [<ffffffff802962c0>] blocking_notifier_call_chain+0x13/0x36
 [<ffffffff803fcfc5>] cpufreq_resume+0x129/0x14c
 [<ffffffff803a347e>] __sysdev_resume+0x2a/0x66
 [<ffffffff803a362c>] sysdev_resume+0x1d/0x63
 [<ffffffff803a8025>] device_power_up+0x9/0xf
 [<ffffffff802a5f3c>] suspend_enter+0x3e/0x47
 [<ffffffff802a6088>] enter_state+0x143/0x19b
 [<ffffffff802a614f>] state_store+0x5e/0x79
 [<ffffffff802fb11c>] sysfs_write_file+0xca/0xf9
 [<ffffffff802162e7>] vfs_write+0xce/0x174
 [<ffffffff80216b6e>] sys_write+0x45/0x6e
 [<ffffffff8025c341>] tracesys+0xd1/0xdc
DWARF2 unwinder stuck at tracesys+0xd1/0xdc

*  Finally, I am running ndiswrapper (I take it this is just as evil as ATI's
binary driver) as the only way to get my wireless running.  The bcm43xx driver
used to work until I upped my memory to 2GB.
Comment 10 Alex Tucker 2006-10-22 14:40:48 EDT
After John Linville's suggestion to try his patched kernel for the bcm43xx
driver, I've now tried without ndiswrapper too, but still eventually the kernel
panics.
Comment 11 Philip Trickett 2006-10-31 16:47:59 EST
Ok, just for the record, similar problems exist on FC6.

[phil@nori ~]$ uname -a
Linux nori 2.6.18-1.2798.fc6 #1 SMP Mon Oct 16 14:39:22 EDT 2006 x86_64 x86_64
x86_64 GNU/Linux

Using the free radeon driver and the bcm43xx driver as well, no binary only
modules loaded. I am going to try with the options mentioned on boot to see if
that changes anything and will report back.

Is it OK to attach to this bug for FC6 or shall I create a new one?
Comment 12 Alex Tucker 2006-12-18 04:54:04 EST
I moved to FC6, but still had the same issues.  Indeed, the laptop is only
stable if I run with the open source radeon driver without DRI and the closed
ndiswrapper as bcm43xx still panics on loading.  I've tried all RH kernels up to
2.6.18-1.2849 as well as Jon Linville's test kernels.

Now, since I had a few moments to burn, I've compiled the vanilla 2.6.19.1
kernel using the old config from the RH kernels as a start and defaulting much
of the rest of the config.  The laptop now suspends and resumes properly with
both suspend-to-ram and suspend-to-disk using pm-utils (although not with the
proprietary ATI drivers) and even bcm43xx seems to reluctantly work and at least
doesn't cause a panic.

If I get a few more moments to burn over Christmas, I could try to figure out
whether it's the move from 2.6.18 -> 2.6.19 that solved it, or whether it's a
custom RH patch that kills it.
Comment 13 Alex Tucker 2007-01-04 12:50:14 EST
Right, I just checked the 15th December release, kernel-2.6.18-1.2868.fc6, and
got the same results.

Interestingly, since with the vanilla 2.6.19.1 kernel I can get suspend to disk
to work, I can see that a major difference is that with the RH kernel the screen
just goes black when I run pm-hibernate, while on the 2.6.19.1 kernel I get
loads of messages about compressing stuff and writing the results to swap.

Also, the bcm43xx driver still panics my system with the RH kernel.  It works
(of a fashion) ok in the vanilla 2.6.19.1 kernel.

I'm going to wait until a 2.6.19 series RH kernel comes out before testing
again, rather than trying to wrangle with the diffs between 2.6.18 and 2.6.19.1
as well as the RH diffs.
Comment 14 Alex Tucker 2007-01-22 17:05:19 EST
The latest kernel 2.6.19-1.2895.fc6 has fixed this to the extent that
pm-hibernate and pm-suspend work most of the time, modulo some strange ACPI
timeout errors which I'll raise in another bug.  The bcm43xx issue with > 2GB is
gone too, so this is with all open source drivers, albeit with less wireless
range and no 3D acceleration.

Note You need to log in before you can comment on or make changes to this bug.