|Summary:||Suspend and hibernate problems on Ferrari 4005|
|Product:||[Fedora] Fedora||Reporter:||Alex Tucker <alex>|
|Component:||kernel||Assignee:||Dave Jones <davej>|
|Status:||CLOSED NOTABUG||QA Contact:||Brian Brock <bbrock>|
|Version:||5||CC:||ncunning, pfrields, rizo83, wtogami|
|Fixed In Version:||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2007-01-04 17:50:14 UTC||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
Description Alex Tucker 2006-05-11 17:53:51 UTC
Description of problem: I've discovered that the only way to get either suspend to RAM or hibernate to disk to work on my Ferrari 4005 laptop is to append "noapic irqpoll acpi_irq_balance" to the kernel parameters at boot. However, this renders my system unstable and it will freeze on average once a day. Version-Release number of selected component (if applicable): kernel-2.6.16-1.2111_FC5, although I discovered the above parameters around 2.6.14 and have had the same results with previous kernels. How reproducible: Lockups are fairly random, although they do seem to happen under heavy load. I've enabled the magic sysrq key and can sometimes use that to reboot, although I've not been able to unmount, which suggests somethings up with IDE access. I've once been in runlevel 3 and seen the results of the panic, but haven't managed to capture it yet. Steps to Reproduce: 1. Boot with the above parameters 2. Work away. Actual results: System freezes. If in X, then everything locks up and sometimes the LEDs will blink. Once in the console I saw a panic. Expected results: Suspend / hibernate working without rendering the system unstable. Additional info: Without the kernel parameters, neither suspend nor hibernate work. Initiating suspend to RAM will suspend the laptop, but on resume it seems to get no further than accessing the CD. A few months ago I attempted to trace whereabouts in the wakeup code the kernel had gotten to by pasting some "beep" code -- I got as far as tracing it into the code which iterates over each device and wakes it up, but couldn't figure out how to get much further. I've recently installed kdump/kexec. Forcing the kernel to crash without using the extra kernel parameters results in the kexec'ed kernel not getting past some IRQ problem -- I'll switch off the quiet boot option and attach a screenshot. Adding the extra kernel parameters above allows the kexec'ed kernel to boot and dump an image as it should, however I've not managed to get this to happen yet when my system freezes. Any help or pointers gratefully received! Alex.
Comment 1 rizo83 2006-09-26 19:55:22 UTC
I can confirm this on a Ferrari 4005 with FC5 i686 using 2.6.17-1.2187_FC5 kernel. Suspend to disk wasnt working until I appended "noapic irqpoll acpi_irq_balance" to the kernel. On resume there are a lot of "unexpected IRQ trap at vector 79" and "unexpected IRQ trap at vector 89" in dmesg.
Comment 2 Alex Tucker 2006-10-02 16:42:22 UTC
Ok, I found where the kernel-debuginfo RPMs were hidden and have managed to get kexec to write a crash dump and use "crash" to at least get a back trace of the panic on the latest 2.6.17-1.2187_FC5 kernel. The following panic happened after booting with "noapic irqpoll acpi_irq_balance" after using the machine for about 10 minutes: KERNEL: /usr/lib/debug/lib/modules/2.6.17-1.2187_FC5/vmlinux DUMPFILE: /var/crash/2006-10-02-16:42/vmcore CPUS: 1 DATE: Mon Oct 2 16:42:02 2006 UPTIME: 01:11:33 LOAD AVERAGE: 0.51, 0.60, 0.60 TASKS: 154 NODENAME: ferrari.floop.org.uk RELEASE: 2.6.17-1.2187_FC5 VERSION: #1 SMP Mon Sep 11 01:16:59 EDT 2006 MACHINE: x86_64 (795 Mhz) MEMORY: 2 GB PANIC: "" PID: 0 COMMAND: "swapper" TASK: ffffffff80542dc0 [THREAD_INFO: ffffffff806c8000] CPU: 0 STATE: TASK_RUNNING (PANIC) PID: 0 TASK: ffffffff80542dc0 CPU: 0 COMMAND: "swapper" #0 [ffffffff80603a60] crash_kexec at ffffffff802ab31d #1 [ffffffff80603ae8] crash_kexec at ffffffff802ab335 #2 [ffffffff80603b70] crash_kexec at ffffffff802ab31d #3 [ffffffff80603b98] bust_spinlocks at ffffffff80281d2d #4 [ffffffff80603ba8] panic at ffffffff8028fb7e #5 [ffffffff80603c08] module_text_address at ffffffff802a3b95 #6 [ffffffff80603c28] kernel_text_address at ffffffff8029d837 #7 [ffffffff80603c38] show_trace at ffffffff8027049c #8 [ffffffff80603c78] dump_stack at ffffffff80270505 #9 [ffffffff80603c98] spin_bug at ffffffff80213c24 #10 [ffffffff80603cb8] _raw_spin_lock at ffffffff8020762f #11 [ffffffff80603cc8] note_interrupt at ffffffff802b325a #12 [ffffffff80603d18] __do_IRQ at ffffffff802b2c96 #13 [ffffffff80603d58] do_IRQ at ffffffff802713c6 #14 [ffffffff80603e08] ide_do_request at ffffffff8020ea4d #15 [ffffffff80603e30] ide_do_request at ffffffff8020ea48 #16 [ffffffff80603e60] freed_request at ffffffff8032b084 #17 [ffffffff80603e80] ide_end_request at ffffffff8020aa18 #18 [ffffffff80603ed0] ide_intr at ffffffff8020d3f5 #19 [ffffffff80603f00] note_interrupt at ffffffff802b32a7 #20 [ffffffff80603f50] __do_IRQ at ffffffff802b2c96 #21 [ffffffff80603f90] do_IRQ at ffffffff802713c6 --- <IRQ stack> --- #22 [ffffffff806c9eb8] ret_from_intr at ffffffff80262eea [exception RIP: acpi_processor_idle+502] RIP: ffffffff8037578c RSP: ffffffff806c9f60 RFLAGS: 00000246 RAX: ffffffff806c9fd8 RBX: ffffffff80375596 RCX: 00000000ca3cefe4 RDX: 0000000000008008 RSI: 00000000ca3cfc67 RDI: 0000000000000000 RBP: 000000000008e000 R8: ffffffff806c8000 R9: 00000065e363583e R10: 0000000000000000 R11: ffffffff802677b6 R12: ffff81007dd028f0 R13: ffff81007dd02800 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: fffffffffffffff4 CS: 0010 SS: 0018 #23 [ffffffff806c9f68] notifier_call_chain at ffffffff8026bfab #24 [ffffffff806c9fa8] cpu_idle at ffffffff8024c7b7 Without the additional kernel parameters, neither suspend nor hibernate work, nor can I get either process to produce a crash dump (magic sysrq on, but no response even to alt-sysrq-c). Please let me know if there's any further info that could help in the crash dump, otherwise I'll remove it after a week or so.
Comment 3 Alex Tucker 2006-10-03 09:07:22 UTC
Since the last panic suggested a problem in swapper, I turned off swapping (swapoff -a) and the machine stayed up for a bit longer, eventually giving the following panic after a few hours: PID: 2260 TASK: ffff8100750b9860 CPU: 0 COMMAND: "Xgl" #0 [ffffffff80603a20] crash_kexec at ffffffff802ab31d #1 [ffffffff80603aa8] crash_kexec at ffffffff802ab335 #2 [ffffffff80603b30] crash_kexec at ffffffff802ab31d #3 [ffffffff80603b58] bust_spinlocks at ffffffff80281d2d #4 [ffffffff80603b68] panic at ffffffff8028fb7e #5 [ffffffff80603bc8] module_text_address at ffffffff802a3b95 #6 [ffffffff80603be8] kernel_text_address at ffffffff8029d837 #7 [ffffffff80603bf8] show_trace at ffffffff8027049c #8 [ffffffff80603c38] dump_stack at ffffffff80270505 #9 [ffffffff80603c58] spin_bug at ffffffff80213c24 #10 [ffffffff80603c78] _raw_spin_lock at ffffffff8020762f #11 [ffffffff80603c88] note_interrupt at ffffffff802b325a #12 [ffffffff80603cd8] __do_IRQ at ffffffff802b2c96 #13 [ffffffff80603d18] do_IRQ at ffffffff802713c6 #14 [ffffffff80603d20] try_to_wake_up at ffffffff8024a2f9 #15 [ffffffff80603dc8] ide_outb at ffffffff802076e8 #16 [ffffffff80603df0] cdrom_start_packet_command at ffffffff8022f663 #17 [ffffffff80603e30] ide_do_request at ffffffff8020ed01 #18 [ffffffff80603e60] cdrom_decode_status at ffffffff80254e53 #19 [ffffffff80603e90] cdrom_pc_intr at ffffffff8025494c #20 [ffffffff80603ed0] ide_intr at ffffffff8020d3f5 #21 [ffffffff80603f00] note_interrupt at ffffffff802b32a7 #22 [ffffffff80603f50] __do_IRQ at ffffffff802b2c96 #23 [ffffffff80603f90] do_IRQ at ffffffff802713c6 --- <IRQ stack> --- #24 [ffff8100751dff58] ret_from_intr at ffffffff80262eea RIP: 000000000045170a RSP: 00007fff44cdd250 RFLAGS: 00000246 RAX: 00007fff44cdd260 RBX: 0000000000ca1510 RCX: 00000000007ac2d0 RDX: 00007fff44cdd270 RSI: 0000000000000266 RDI: 00000000000000b8 RBP: 00000000000000b8 R8: 00000000007ac450 R9: 0000000000000018 R10: 0000000000000000 R11: 00002aaaaaeb4030 R12: 0000000000000266 R13: 000000000097e7a0 R14: 0000000000680450 R15: 00000000009843e0 ORIG_RAX: fffffffffffffff4 CS: 0033 SS: 002b Alex.
Comment 4 Alex Tucker 2006-10-05 11:47:27 UTC
Googling around I spotted some patches Linus developed to help him debug the suspend/resume cycle on his Mac mini. It looks as though these may be rolled into 2.6.18 at some point -- are RedHat going to include these patches, which would certainly help more than "beeps" to figure out where things are going wrong with resume on my laptop.
Comment 5 Dave Jones 2006-10-16 18:24:26 UTC
A new kernel update has been released (Version: 2.6.18-1.2200.fc5) based upon a new upstream kernel release. Please retest against this new kernel, as a large number of patches go into each upstream release, possibly including changes that may address this problem. This bug has been placed in NEEDINFO state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. In the last few updates, some users upgrading from FC4->FC5 have reported that installing a kernel update has left their systems unbootable. If you have been affected by this problem please check you only have one version of device-mapper & lvm2 installed. See bug 207474 for further details. If this bug is a problem preventing you from installing the release this version is filed against, please see bug 169613. If this bug has been fixed, but you are now experiencing a different problem, please file a separate bug for the new problem. Thank you.
Comment 6 Alex Tucker 2006-10-17 14:59:38 UTC
Ok, I've updated to 2.6.18-1.2200.fc5. It still exhibits the same behaviour as the previous kernel (tried suspend with both extra parameters and without), although it did take longer to crash (1 day). Once I get kdump working again (it's complaining about "overlapping memory segments), I will try to capture the backtrace fromt the kernel panic. Are the TRACE_DEVICE and TRACE_RESUME macros defined in the sources for this kernel?
Comment 7 Dave Jones 2006-10-19 18:56:58 UTC
No, that functionality will trash the real time clock each boot, which obviously isn't viable in a production kernel. If you're not adverse to rebuilding kernels though, this would be a good method of finding out exactly where things are dying. Looking at the backtraces, I notice we're hitting spin_bug in the ide layer. The only other times I've seen this happen in recent times were when people had binary-only graphics drivers loaded. Is that the case here?
Comment 8 Alex Tucker 2006-10-19 21:16:22 UTC
(In reply to comment #7) > No, that functionality will trash the real time clock each boot, which obviously > isn't viable in a production kernel. Granted. I did see a further patch to add a /sys/power/something to enable the trace. > If you're not adverse to rebuilding kernels though, this would be a good method > of finding out exactly where things are dying. Once upon a time it was simple to apply patches, but now there's git and I have to learn something new before I can get there. I'll give it a go when I get a moment. I'm also still trying to get kdump working again after the kernel upgrade from 2.6.17 to 18, so that I can capture the kernel panic. Starting kdump complains about "overlapping memory segments". I tried fiddling with the boot time parameters to crashkernel, to no avail. Nor can I find much in the way of documentation. The current setting is crashkernel=128M@16M, which gives: "Overlapping memory segments at 0x145b000, sort_segments failed". > Looking at the backtraces, I notice we're hitting spin_bug in the ide layer. > The only other times I've seen this happen in recent times were when people had > binary-only graphics drivers loaded. Is that the case here? It is. Although, when I've tested suspend/hibernate without the "noapic irqpoll acpi_i..." gubbins, I've done it in single user mode after removing as many modules as possible. However, without the gubbins, I don't even get a kernel panic on resume, just a locked up laptop. I shall test with the open-source driver and let you know if things are stable, and if so we can close this issue, although I'd obviously like to get a stable laptop with fancy graphics :)
Comment 9 Alex Tucker 2006-10-21 12:48:23 UTC
(In reply to comment #7) > Looking at the backtraces, I notice we're hitting spin_bug in the ide layer. > The only other times I've seen this happen in recent times were when people had > binary-only graphics drivers loaded. Is that the case here? I've now tried with the open source radeon driver, rather than the ATI binary-only one, but still the system is unstable and will panic after a while. I still haven't managed to get kdump working either, so can't give a backtrace. Some more notes, in case they're of any use: * Straight after booting, the kernel emits the following message, "..MP-BIOS bug: 8254 timer not connected to IO-APIC", unless I append the "noapic irqpoll..." gubbins as boot time parameters to the kernel. * I get the following warning in /var/log/messages, which I've not noticed before: PCI: Transparent bridge - 0000:00:14.4 PCI: Bus #07 (-#0a) is hidden behind transparent bridge #06 (-#07) (try 'pci=ass ign-busses') Please report the result to linux-kernel to fix this permanently * With the "gubbins" appended, after waking from a suspend, there are some minor ACPI errors: ACPI Exception (evregion-0424): AE_TIME, Returned by Handler for [EmbeddedControl]  ACPI Exception (dswexec-0458): AE_TIME, While resolving operands for [OpcodeName unavailable]  ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.ACAD._PSR] (Node ffff81007fe9f690), AE_TIME ACPI Exception (acpi_ac-0096): AE_TIME, Error reading AC Adapter state  * With the gubbins added, I also get frequent errors reported from the ide driver: hdc: cdrom_pc_intr: The drive appears confused (ireason = 0x01). Trying to recover by ending request. * Again with the gubbins added, there seems to be a minor panic (system continues) after resume: BUG: sleeping function called from invalid context at kernel/rwsem.c:20 in_atomic():0, irqs_disabled():1 Call Trace: [<ffffffff80269387>] show_trace+0x34/0x47 [<ffffffff802693ac>] dump_stack+0x12/0x17 [<ffffffff8029dcd2>] down_read+0x15/0x23 [<ffffffff802962c0>] blocking_notifier_call_chain+0x13/0x36 [<ffffffff803fcfc5>] cpufreq_resume+0x129/0x14c [<ffffffff803a347e>] __sysdev_resume+0x2a/0x66 [<ffffffff803a362c>] sysdev_resume+0x1d/0x63 [<ffffffff803a8025>] device_power_up+0x9/0xf [<ffffffff802a5f3c>] suspend_enter+0x3e/0x47 [<ffffffff802a6088>] enter_state+0x143/0x19b [<ffffffff802a614f>] state_store+0x5e/0x79 [<ffffffff802fb11c>] sysfs_write_file+0xca/0xf9 [<ffffffff802162e7>] vfs_write+0xce/0x174 [<ffffffff80216b6e>] sys_write+0x45/0x6e [<ffffffff8025c341>] tracesys+0xd1/0xdc DWARF2 unwinder stuck at tracesys+0xd1/0xdc * Finally, I am running ndiswrapper (I take it this is just as evil as ATI's binary driver) as the only way to get my wireless running. The bcm43xx driver used to work until I upped my memory to 2GB.
Comment 10 Alex Tucker 2006-10-22 18:40:48 UTC
After John Linville's suggestion to try his patched kernel for the bcm43xx driver, I've now tried without ndiswrapper too, but still eventually the kernel panics.
Comment 11 Philip Trickett 2006-10-31 21:47:59 UTC
Ok, just for the record, similar problems exist on FC6. [phil@nori ~]$ uname -a Linux nori 2.6.18-1.2798.fc6 #1 SMP Mon Oct 16 14:39:22 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux Using the free radeon driver and the bcm43xx driver as well, no binary only modules loaded. I am going to try with the options mentioned on boot to see if that changes anything and will report back. Is it OK to attach to this bug for FC6 or shall I create a new one?
Comment 12 Alex Tucker 2006-12-18 09:54:04 UTC
I moved to FC6, but still had the same issues. Indeed, the laptop is only stable if I run with the open source radeon driver without DRI and the closed ndiswrapper as bcm43xx still panics on loading. I've tried all RH kernels up to 2.6.18-1.2849 as well as Jon Linville's test kernels. Now, since I had a few moments to burn, I've compiled the vanilla 126.96.36.199 kernel using the old config from the RH kernels as a start and defaulting much of the rest of the config. The laptop now suspends and resumes properly with both suspend-to-ram and suspend-to-disk using pm-utils (although not with the proprietary ATI drivers) and even bcm43xx seems to reluctantly work and at least doesn't cause a panic. If I get a few more moments to burn over Christmas, I could try to figure out whether it's the move from 2.6.18 -> 2.6.19 that solved it, or whether it's a custom RH patch that kills it.
Comment 13 Alex Tucker 2007-01-04 17:50:14 UTC
Right, I just checked the 15th December release, kernel-2.6.18-1.2868.fc6, and got the same results. Interestingly, since with the vanilla 188.8.131.52 kernel I can get suspend to disk to work, I can see that a major difference is that with the RH kernel the screen just goes black when I run pm-hibernate, while on the 184.108.40.206 kernel I get loads of messages about compressing stuff and writing the results to swap. Also, the bcm43xx driver still panics my system with the RH kernel. It works (of a fashion) ok in the vanilla 220.127.116.11 kernel. I'm going to wait until a 2.6.19 series RH kernel comes out before testing again, rather than trying to wrangle with the diffs between 2.6.18 and 18.104.22.168 as well as the RH diffs.
Comment 14 Alex Tucker 2007-01-22 22:05:19 UTC
The latest kernel 2.6.19-1.2895.fc6 has fixed this to the extent that pm-hibernate and pm-suspend work most of the time, modulo some strange ACPI timeout errors which I'll raise in another bug. The bcm43xx issue with > 2GB is gone too, so this is with all open source drivers, albeit with less wireless range and no 3D acceleration.