144415 – kernel-2.6.9-1.724_FC3 breaks APM suspend on Thinkpad

Bug 144415 - kernel-2.6.9-1.724_FC3 breaks APM suspend on Thinkpad

Summary: kernel-2.6.9-1.724_FC3 breaks APM suspend on Thinkpad

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	3
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Dave Jones
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	145203 146457 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-01-06 20:15 UTC by Matthew Saltzman
Modified:	2015-01-04 22:15 UTC (History)
CC List:	20 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2005-09-30 10:34:44 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Oops - init (4.92 KB, text/plain) 2005-01-12 18:02 UTC, Need Real Name	no flags	Details
patch1 for 2.6.10 - see comment #6 (880 bytes, patch) 2005-01-13 16:15 UTC, Need Real Name	no flags	Details \| Diff
patch2 for 2.6.10 - see comment #6 (529 bytes, patch) 2005-01-13 16:16 UTC, Need Real Name	no flags	Details \| Diff
"Show PC" output after resume has crashed the machine (933 bytes, text/plain) 2005-01-20 13:03 UTC, Ian Collier	no flags	Details
text of kernel panic on resume from APM suspend (842 bytes, text/plain) 2005-01-28 11:46 UTC, Ian Collier	no flags	Details
patch that converts FC4 kernel specfile for FC3 recompile (683 bytes, patch) 2005-02-07 20:34 UTC, Barry K. Nathan	no flags	Details \| Diff
UNTRIED patch for sake of experimentation (938 bytes, patch) 2005-03-06 19:52 UTC, Jesse Glick	no flags	Details \| Diff
Show PC output from crash in 2.6.10-1.1126 (1.00 KB, text/plain) 2005-04-14 16:04 UTC, Ian Collier	no flags	Details
Replacement spinlock-debug-panic patch (1.82 KB, patch) 2005-04-14 16:16 UTC, Ian Collier	no flags	Details \| Diff
spinlock trace from /var/log/messages (2.66 KB, text/plain) 2005-04-22 20:08 UTC, Satish Balay	no flags	Details
View All

Description Matthew Saltzman 2005-01-06 20:15:16 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5)
Gecko/20041111 Firefox/1.0

Description of problem:
Thinkpad T41 hangs on resume from APM suspend-to-ram (with caps-lock
light flashing).  Worked fine with kernel-2.6.9-1.681_FC3 and earlier.

Version-Release number of selected component (if applicable):
kernel-2.6.9-1.724_FC3

How reproducible:
Always (At least always so far for me.)

Steps to Reproduce:
1. Install kernel-2.6.9-1.724_FC3 on Thinkpad.  Set acpi=off kernel
parameter.
2. Suspend to RAM
3. Resume
    

Actual Results:  Machine locks hard, before display restores.  The
caps-lock light flashes.  Power-off is required to reset.

Expected Results:  Machine resumes normally

Additional info:

Another report of this bug suggested that it was intermittent and that
disabling lm_sensors appeared to fix the problem.  I have not found
either to be the case.  It fails every time for me, and disabling
lm_sensors had no effect.

Comment 1 Satish Balay 2005-01-06 20:33:41 UTC

I've encountered this issue with kernel-2.6.10-1.1063_FC4 from rawhide
(on FC3/T40)

Comment 2 Ed Hill 2005-01-06 20:53:30 UTC

I'm seeing *exactly* the same problem (including the flashing
caps-lock) on my ThinkPad A22p (2629-Y1U).  It happens frequently, but
not on every suspend-resume cycle.  I suspect that the bug was
introduced between 2.6.9-1.681_FC3 and 2.6.9-1.724_FC3 because the
latter is the first kernel that triggered the problem.

Comment 3 André Johansen 2005-01-07 20:30:47 UTC

I see the same effect on my workstation; .681 works fine, .724 hangs  
on resume, and the lights on the keyboard just blink.  
 
(HW: Athlon Thunderbird on an ABit KT7A motherboard.)

Comment 4 Satish Balay 2005-01-09 15:14:55 UTC

kernel-2.6.10-1.735_FC3 is still broken (on T40)

Comment 5 Sammy 2005-01-10 14:53:27 UTC

This bug is a major problem also for people using SWSUSP2. There ACPI 
suspend to disk has the same problem on resume. This is with vanilla 
2.6.10 and swsusp2 patch. Seems to be dominantly thinkpads, I have T42 
but it is affecting others too.

Comment 6 Barry K. Nathan 2005-01-10 15:25:00 UTC

Re: comment #5

Try 2.6.10 with the following two patches:
http://www.ussg.iu.edu/hypermail/linux/kernel/0501.0/0284.html
http://www.ussg.iu.edu/hypermail/linux/kernel/0501.0/0548.html

(or you can run 2.6.10-mm2 instead of 2.6.10 + the two patches)

I have no idea if these patches are compatible with swsusp2, so you
may need to test without swsusp2 if that's possible.

Comment 7 Matthew Saltzman 2005-01-10 23:51:58 UTC

So far, things seem to be working with latest release
kernel-2.6.10-1.737_FC3.

Comment 8 Satish Balay 2005-01-11 00:39:51 UTC

All my crashes happened with overnight suspend. I've had 736 crash on
me - and it doesn't look very different than 737 (+ debugging)

Will try 737 tonight and see what happens..

Comment 9 Satish Balay 2005-01-11 14:09:40 UTC

Ok - 737 also crashes.. Not sure how 'overnight' suspend is different
from 'short' suspend of minutes/1hour.

There is a note about the bios setting  "Hibernate after suspend
expires -> Enabled" causing this problem - but I don't have this
enabled. So something else (which must be similar) is causing this
problem.

http://mailman.linux-thinkpad.org/pipermail/linux-thinkpad/2005-January/022832.html

Comment 10 Matthew Saltzman 2005-01-11 21:30:11 UTC

It's true, kernel-2.6.10-1.737_FC3 still hangs on resume (this time
after about 3 hours in suspension).  Interestingly, the caps lock LED
doesn't flash now.

Comment 11 Matthew Saltzman 2005-01-12 02:44:12 UTC

Probably to nobody's surprise, kernel-2.6.10-1.1076_FC4 also fails as
above.

Comment 12 Satish Balay 2005-01-12 06:59:39 UTC

I tried using 2.6.10-ac8 - and it survived a 6 hour suspend.

I guess I'll have to wait and see if it can survive a few more long suspends
before assuming this version is free of this bug..

Basically I grabbed
http://kernel.org/pub/linux/kernel/people/alan/linux-2.6/2.6.9/RPM/SRPMS.kernel/kernel-2.6.10-0.ac7.1.src.rpm
and rebuilt it with 'ac-8' patch.

Comment 13 Satish Balay 2005-01-12 14:44:24 UTC

ok - 2.6.10-ac8 survived another 6hour suspend.

Comment 14 Matthew Saltzman 2005-01-12 15:37:07 UTC

Just curious, is there anything in the comments for -ac8 that indicate they
fixed something related to this?  Or any other differences between -ac* and the
-bk* series they seem to use for FC?

Comment 15 Satish Balay 2005-01-12 16:28:43 UTC

I just realized I was using ac7 not ac8. (error building kernel rpm on my part)

And kernel-2.6.10-1.737_FC3 is supporsed to be 2.6.10-ac8 + additional patches -
so I'm guessing the bug is in some of the fedora specific patches in 737. (I'm
assuming differences in ac7 vs ac8 doesn't affect this bug)

Comment 16 Need Real Name 2005-01-12 18:02:25 UTC

Created attachment 109680 [details]
Oops - init

Comment 17 Need Real Name 2005-01-12 18:04:55 UTC

Can confirm that kernel-2.6.10-1.737_FC3 with swsusp2 2.1.5.12 displays a
similar problem on my Thinkpad T40 - Oops on init, capslock flashing, system
still functional but init taking >90% cpu and unresponsive to telinit, reboot etc.

Have tried to apply the patches at comment #6, which I can get to apply cleanly
with some line number changes, but they make no difference - same symptoms.

I know swsusp2 isn't part of FC3 - but oops attached (above! oops indeed) if of
help to anyone.

Comment 18 Barry K. Nathan 2005-01-12 18:26:56 UTC

Would you mind trying again without swsusp2, with just the kernel's built-in
swsusp1 + the comment #6 patches? I'm not suggesting this as a permanent change,
just as another test to gather more information about the problem.

Comment 19 Need Real Name 2005-01-13 16:11:43 UTC

OK, I've done so - complete recompile from SRPM, without and with the patches
above. Note that I haven't tried swsusp1 with 2.6.10, as I've been happily using
swsusp2 - both with and without patches worked fine for me for suspend to ram
and disk on a Thinkpad T40.  Both messed up X on resume, which I can't recall
how to fix in swsusp1 - been using swsusp2 for too long.

I'll attach versions of the two patches above reformatted and line-numbered for
2.6.10 plain kernel, if they're helpful to someone here.

Still working on swsusp2 and my oops above - my init-in-a-spin problem may be
different (or only vaguely related) to the main problem here.

Comment 20 Need Real Name 2005-01-13 16:15:17 UTC

Created attachment 109723 [details]
patch1 for 2.6.10 - see comment #6

Comment 21 Need Real Name 2005-01-13 16:16:14 UTC

Created attachment 109724 [details]
patch2 for 2.6.10 - see comment #6

Comment 22 Matthew Saltzman 2005-01-14 14:17:40 UTC

Problem persists with kernel-2.6.10-1.741_FC3, even though that is supposed to
include -ac9 patches.  A one-hour suspend produced a hard freeze on resume.

Comment 23 Barry K. Nathan 2005-01-14 14:57:13 UTC

Re: comment #22

It's always possible that the problem is being introduced by one of the FC3
patches...

Comment 24 Thomas Roessler 2005-01-16 12:16:19 UTC

*** Bug 145203 has been marked as a duplicate of this bug. ***

Comment 25 Satish Balay 2005-01-16 16:33:26 UTC

Just an update..

I haven't seen this probllem yet since moving to the 'ac' kernel builds.
Currently using ac9. (Earlier I've used ac7, ac8)

Comment 26 Enrique Gomezdelcampo 2005-01-17 16:12:42 UTC

Problem is not limited to the Thinkpad. I have a Dell Latitude C840 and I have
identical problem. Kernel 2.6.10-1.741_FC3 still has the problem with this
laptop too.

Comment 27 Lance A. Brown 2005-01-17 17:45:05 UTC

Confirmed.  741_FC3 locks up my T41 Thinkpad.

Comment 28 Ian Collier 2005-01-20 13:01:40 UTC

I used to use 2.2.x on a ThinkPad 380 and APM worked perfectly.
Since I upgraded to a ThinkPad T22 with Fedora Core 2 I've had only
limited success (see for example bug 13095).  ACPI seemed a
non-starter when I first tried it, so I don't know whether it has
improved since then.

On upgrading from 2.6.9-1.6_FC2 to 2.6.10-1.9_FC2 it appeared that
most of the issues had been resolved as the machine did a number of
textbook suspend/resume cycles in situations that hadn't worked
before.  However, I then discovered that suspending for more than a
few minutes reliably kills the machine.  There are two possible
results, which seem to occur at random: in one the machine seems
completely dead although SysRq can be used to reboot it; in the other
the caps-lock and scroll-lock LEDs flash and if I jiggle the SysRq key 
enough I can get it to print stuff on the screen - though none of the
actions work except reboot.

I tried booting with serial console enabled and logging what came
through on another machine, but I didn't get anything at all after
the machine was suspended.  Anyway, to get to the actual point of this
comment, I did copy down the results of SysRq's "show PC" function in
the hope that it would be useful.  The appearance of RTC functions in
the traceback does seem to tally with the idea that it's dependent on
how long the machine was suspended for.  I hope there aren't too many
transcription errors in the output, which I'll attach in just a
moment.

Comment 29 Ian Collier 2005-01-20 13:03:41 UTC

Created attachment 110004 [details]
"Show PC" output after resume has crashed the machine

See comment 28.

Comment 30 Pete Toscano 2005-01-20 15:08:16 UTC

Just adding a "me too" for my T40 (2373-94U).  Short APM suspends recover fine,
but longer ones lead to a kernel panic.

Comment 31 Pete Toscano 2005-01-21 22:33:18 UTC

Just an extra data point.  If I switch to a VC before I do a long suspend, when
I resume, I get a kernel panic on the screen.  This is some of what I see:

====================
Warning: CPU frequency is 16000000, cpufreq assumed 600000 kHz.
Kernel panic - not syncing: arch/i386/kernel/time.c:178:
spin_lock(arch/i386/kernel/time.c:c0342be8) already locked by
arch/i386/kernel/time.c/310

 Badness in panic at kernel/panic.c:117
 [what looks like a stack trace]
====================

This is a 1.6Mhz Pentium M with SpeedStep enabled, but it went from a AC
connection to another AC connection, so it shouldn't have down-shifted to
600Mhz.  Then again, that warning might just be a red herring.

Comment 32 Ian Collier 2005-01-28 11:46:26 UTC

Created attachment 110348 [details]
text of kernel panic on resume from APM suspend

I'm duplicating comment 31 but without the initial cpufreq complaint.
The attached is the entire text of the kernel panic which appears on screen
after resuming the machine.  In this case the caps and scroll lock lights were
not flashing (and I have another failure almost exactly the same but with some
minor differences at the bottom of the trace).

Comment 33 Satish Balay 2005-01-28 16:24:07 UTC

Thanks to comment 31 I've been able to switch to VT-1 before suspned
(This is with 2.6.10-1.753_FC3 on a thinkpad 600E - 366MHz P-II). 
In my case - the caps & scroll-lock lights blink.

I get the same stack trace as the attachment in comment 32.. Something
like (written down manually):

[<c0112dc0>] suspend+0x3c6/0x513
             do_ioctl
             recalc_task_prio
             sys_ioctl
             syscall_call
Badness in i8042_panic_blink at drivers/input/serio/i8042.c:917
             i8042_panic_blink
             panic
             set_rt_mmss
             timer_interrupt
             handle_IRQ_event
             __do_IRQ
             do_IRQ
=============================================
             common_interrupt
             get_cmos_time
             timer_resume
             sysdev_resume
             device_power
             [few more lines which I cou'dn't write down -as the
screen went blank]

Comment 34 Matthew Saltzman 2005-01-30 01:40:32 UTC

My T41 survived a 2-hour suspend using kernel-2.6.10-1.1115_FC4 from
Rawhide.  Unfortunately, that kernel's built with gcc-3.4 so I can't
build VMware modules against it.  And it has too many Rawhide
dependencies for me to rebuild myself.

Comment 35 Matthew Saltzman 2005-02-03 00:02:43 UTC

But it still does *not* work with kernel-2.6.10-1.760_FC3, even though
it is rebased to 2.6.10-ac11.

So still no completely functional kernel on my Thinkpad.

Comment 36 Barry K. Nathan 2005-02-03 09:18:12 UTC

Based on what I'm reading here, I suspect the problem is being
triggered by one of the kernel-2.6.10-1.xxx_FC3 patches, aside from
the -ac patch.

Unfortunately I haven't been able to reproduce this on any of my
hardware (I don't have an IBM ThinkPad), otherwise I would be able to
narrow things down more.

Comment 37 Matthew Saltzman 2005-02-03 13:08:38 UTC

Anything else we can do to help test, let us know.

BTW, the devel kernel-2.6.10-1.1115_FC4 seems to work fine WRT
suspend-to-RAM, although it's not a complete solution for other reasons.

Comment 38 Ian Collier 2005-02-04 11:00:30 UTC

Here's a strange thing... my 2.6.9-1.6_FC2 kernel is now printing
spin_lock messages in the syslog.

My message log begins on Jan 2.  I have seven successful overnight
suspends, then on Jan 12:

Jan 12 00:18:03 starbright apmd[1481]: System Suspend
Jan 12 08:06:34 starbright kernel: arch/i386/kernel/time.c:178:
spin_lock(arch/i386/kernel/time.c:0235b028) already locked by
arch/i386/kernel/time.c/310
Jan 12 08:06:34 starbright kernel: arch/i386/kernel/time.c:317:
spin_unlock(arch/i386/kernel/time.c:0235b028) not locked

This happens every time except three since then, with the latest one
this morning.  I've no idea what's changed (if anything) or whether
the messages were happening before then (as the logs have been rotated
out of existence), but I've been running this kernel since Nov 24,
except for mid-January when I tried 2.6.10-1.9_FC2.  The difference
with 2.6.10 is linux-2.6.9-spinlock-debug-panic.patch which means that
instead of this message we get a kernel panic, as documented at length
in this bug report.

What this means is the underlying bug isn't new in late-2.6.9 and
2.6.10 kernels; only the panic is new.

(And this morning for the first time since installing the system last
May, I was hit by bug 142329 - grr!)

I am not a kernel hacker, but I would think that if it's possible to
block the timer interrupt while executing timer_resume() then that
would fix the problem.

Comment 39 Satish Balay 2005-02-04 14:59:49 UTC

[Now that my T40 is back from IBM-repair] I've tried kernel-2.6.10-1.760_FC3
(rebuilt with CONFIG_X86_HZ=100) & kernel-2.6.10-0.ac11

The experience is similar to comment 38.

kernel-2.6.10-0.ac11 gives the following on APM resume
Feb  4 09:02:07 asterix kernel: arch/i386/kernel/time.c:178:
spin_lock(arch/i386/kernel/time.c:c03bebe8) already locked by
arch/i386/kernel/time.c/310
Feb  4 09:02:07 asterix kernel: arch/i386/kernel/time.c:317:
spin_unlock(arch/i386/kernel/time.c:c03bebe8) not locked

[root@asterix ~]# grep spin_lock /var/log/messages* | wc -l
42

kernel-2.6.10-1.760_FC3 gives:

        recall_task_prio
        sys_ioctl
        syscall_call
Badness in i8042_panic_blink
        i8042_panic_blink
        panic
        set_rtc_mmss
        timer_interrupt
        handle_IRQ_event
        __do_IRQ
===================================
        common_interrupt
        get_cmos_time
        cpufreq_cpu_put
        cpufreq_resume
        timer_resume
        sysdev_resume
        device_power_up
        suspend
        do_ioctl
        recalc_task_prio
        sys_ioctl
        syscall_call

Comment 40 Matthew Saltzman 2005-02-07 15:49:26 UTC

Re: Comment #36: Note that it's not just Thinkpads.  #146457 looks
like a dup of this, and I've seen at least two reports of problems
with Dell Latitude C840s (one in comment #26 above).

Comment 41 David Eriksson 2005-02-07 16:31:02 UTC

DaveJ: please remove linux-2.6.9-spinlock-debug-panic.patch from
future kernels!

If not, maybe someone could update the patch so it is possible to turn
off this feature with a kernel parameter?

At http://people.redhat.com/davej/patchlist-fc3.txt the patch is
descibed like this:

"panic() instantly instead of printing a warning when spinlock
debugging is triggered. This reduces the possibility of silent data
corruption."

What "silent data corruption" is this? Is it related to the annoying
bug 142329?

And -- maybe most important -- why is this "spinlock debugging"
triggered at all?

Comment 42 Barry K. Nathan 2005-02-07 20:16:25 UTC

Re: comment #41

According to comment #37, a recent rawhide kernel isn't showing this
problem -- and I looked at that kernel's specfile a few days ago; it
still seems to be applying the spinlock debug panic patch.

In my next comment I'll post quick instructions for recompiling a
rawhide/FC4 kernel for FC3. Right now I don't have a convenient place
for posting compiled kernel binaries, so my instructions will have to do.

> What "silent data corruption" is this?

Depends on what causes the panic...

Comment 43 Barry K. Nathan 2005-02-07 20:34:20 UTC

Created attachment 110750 [details]
patch that converts FC4 kernel specfile for FC3 recompile

I decided it would be easier to do it as a patch than to write out instructions
for changing the specfile. Basically:

rpm -ivh kernel-2.6.XX-1.YYYY_FC4.src.rpm
cd /usr/src/redhat/SPECS (or wherever, if you've changed your RPM
configuration)
patch -p0 -i /path/to/kernel-spec-fc4-to-fc3.patch (i.e. this attachment)
rpmbuild -ba --target i686 kernel-2.6.spec

If/when bug 147281 is fixed, this patch will no longer be necessary (and will
probably no longer apply to the specfile either).


Anyway, this should let people recompile the FC4 kernels for FC3, without
dependency or gcc version problems. That way, other people can test and see if
the FC4 kernels really fix this bug.

Comment 44 Satish Balay 2005-02-08 16:39:55 UTC

I've tried 2.6.10-1.1126 on the T40 [with a couple of mods: 1000Hz -> 100Hz,
2.6.11-rc3-bk2 -> 2.6.11-rc3-bk4]

After an overnight suspend - I get a crash [capslock blink]. On VT-1 - The stack
scrolls by - and I see repated prints of the form:

atkbd.c: Spurious %s on %s. Some program,like XFree86, might be trying access
hardware directly.

I had to powercycle the machine.

Comment 45 Ian Collier 2005-02-09 16:05:17 UTC

Kernel 2.6.10-1.1126 compiled for FC3 i686 as per comment 43:
http://users.comlab.ox.ac.uk/ian.collier/linuxkernel/
Obviously, since I haven't signed it, you use it at your own risk.

I installed it on my ThinkPad T22 last night [it's an FC2 system so
the post-install script failed - easily fixed by running
new-kernel-pkg manually].  Then suspended for 8 hours and got a
successful wakeup this
morning with no untoward messages at all.  The machine was on a text
console at the time of the suspend (though a gdm login screen was
present on another VT) and my boot command line contains "atkbd.reset"
in case it matters.

Comment 46 Ranjan Maitra 2005-02-12 23:03:55 UTC

I installed Ian Collier's 2.6.10-1.1126 RPM on a Dell Latitude C840
and still apmsleep does not work. I gave it apmsleep +1:00 and it went
to sleep all right. However, it did not wake up. On pressing the power
button, the screen went on to its dull mode, and nothing else
happened. I did not set atkbd.reset=1, though I will try that.

Comment 47 Ranjan Maitra 2005-02-13 00:56:08 UTC

atkbd.reset=1 makes no difference....

Comment 48 Ranjan Maitra 2005-02-14 00:21:15 UTC

OK, here is something more. If I go to a text console (Ctrl-Alt-F1-F2)
the system does wake up, but with the comment that: CPU frequency is
2400000Hz, but cpufreq is assumed/set at 1200000Hz. However,
Ctrl-Alt-F1-F7) to get back to the X screen reverts it back to the
dull state, so pretty useless IMO.

Comment 49 Matthew Saltzman 2005-02-15 12:47:42 UTC

I built kernel-2.6.10-1.1137_FC3 using Barry's spec file.  (What are
kernel-xen0 and kernel-xenU?)  It survived an overnight suspend with
my T41.

I have two (unrelated) issues with it, though.  

(1) rhgb hangs occasionally (radeon driver, 7500 Mobility, only change
from default config is I've turned the DynamicClocks option on).

(2) vmware modules won't build for this kernel, even in its FC3 form.
 This makes it not a solution for me, as I need vmware.  (I haven't
investigated vmware patches yet, though.)

Re: Comment #48: I see that sort of message when resuming from ACPI
suspends.  It seems not to be a problem there.  It's related to
SpeedStep.  Also, does it help to do the suspend/resume from a VC and
then switch to X instead of suspending directly from X?

Comment 50 Satish Balay 2005-02-15 14:56:37 UTC

Re Comment 49:
kernel-xenO, kernel-xenU - they are releated to xen virtual machine
[similar to vmware] - an upcoming feature for FC4. I just disable
these two variables when building kernel for FC3

vmware modules: did you install 'kernel-devel' package?

-------------------

I've built 1141 kernel [aka 2.6.11-rc4-bk1] with the following changes]
- 1000Hz -> 100Hz
- Disable linux-2.6.9-spinlock-debug-panic.patch 

It survived an overnight suspend - without any spinlock messages.

Comment 51 Matthew Saltzman 2005-02-15 16:12:00 UTC

Re: Comment #50:  Interesting.  I'll be interested to see how xen
works.  Meanwhile, disabling should cut kernel build time a bit 8^).

Yes, I did install kernel-devel.  (Oh, dear.  Yet another new model
for getting kernel buil;d environments.  That will just thrill all the
fedora-list denizens who are finally just getting used to the FC3
model...)  The issue is some undefined symbols and failure of the
module to insmod.  I suppose I ought to file a separate bug for that,
though.  It's a bit off-topic here.

Comment 52 Peter Dalgaard 2005-02-15 23:45:17 UTC

Just another data point:

2.6.10-1.1126_FC3 (cf comment #45) has been running OK for me on a
Toshiba Portege 3440CT for a couple of days now, surviving several
suspend/resume events. This is considerably better than any of the
update kernels since 2.6.9-1.681_FC3. 

I did see the effect of comment #45 when attempting a shutdown at one
point, though.

FWIW:

kernel /vmlinuz-2.6.10-1.1126_FC3 ro root=/dev/VolGroup00/LogVol00
apic=no acpi=off rhgb quiet

(without acpi=off, the resume problem disappears - the system refuses
to enter suspend mode ;-) )

Comment 53 Matthew Saltzman 2005-02-16 01:36:09 UTC

Re: Comment #49: vmware-any-any-update89 fixes the vmware-config.pl
issues.  So far, no further hangs (fingers crossed).

Comment 54 Ranjan Maitra 2005-02-16 03:57:28 UTC

Re: Comment #49, I tried apmsleep from VC and then it went to sleep
and wake up all right, but when I switched to X, it never went to X,
but the dull screen I mentioned.

May be I should kill X completely?

Comment 55 Mihai Lazarescu 2005-02-20 11:09:59 UTC

I experience pretty much the same problems on Toshiba Tecra8100,
including caps lock blinking and that a longer suspend is necessary to
trigger the bug, so it's not limited to ThinkPads.

Comment 56 Satish Balay 2005-02-21 15:16:29 UTC

I think I've hit the problem described in comment 46 with my rebuilt rawhide
kernels [2.6.11\-rc4-bk6, bk8].

On recovery the screen is blank - and nothing works [except Fn-F3 - which makes
the screen c\ompletely dark]

There were no blinking caps-lock or num-lock - but none of the following worked:
Fn-F4, Alt-\Ctl-Backspace, Alt-Ctl-F1/Alt-Ctl-Del

I had to powercycle to reboot. I'll start suspending in VT-1 to see if there is
a trace. Cur\rently there is none - in /var/log/messages. [but then - I disabled
the flags DEBUG_SLAB, DE\BUG_BUGVERBOSE, DEBUG_PAGEALLOC, DEBUG_HIGHMEM]

Feb 21 00:11:56 asterix apmd[3263]: System Suspend
< I guess reboot at this point>
Feb 21 08:51:42 asterix syslogd 1.4.1: restart.
Feb 21 08:51:42 asterix syslog: syslogd startup succeeded
Feb 21 08:51:42 asterix syslog: klogd startup succeeded

Comment 57 Satish Balay 2005-02-22 14:49:00 UTC

Tried again [ref comment 56] - this time suspending in VT-1. Its now same
problem as comment 44.

The stack-trace scrolls by so fast that I don't know if its the same spinlock
issue or some new problem [the atkbd.c messages - which causes the scroll is new]

I'll disable the linux-2.6.9-spinlock-debug-panic.patch and try again.

Comment 58 Satish Balay 2005-03-03 06:31:45 UTC

Just a followup to comment 57 - I had an uptime of a week with the kernel I
rebuilt without linux-2.6.9-spinlock-debug-panic.patch

There were 3 variables I changed here -
- disable linux-2.6.9-spinlock-debug-panic.patch
- update to 2.6.11-rc4-bk9 [from bk8]
- unload uhci_hcd/ehci_hcd after a couple of days.

I strongly suspect linux-2.6.9-spinlock-debug-panic.patch causing the initial
problem - but I'm not sure..

Today I've rebuilt by upping to 2.6.11 [with 1154 kernel from rawhile] and all
the mods metinoed in comment 56 & 57

Hoping the good uptime won't be affected.

Comment 59 Dave Jones 2005-03-03 06:42:49 UTC

all that removing the spinlock-debug-panic should do is make the spinlock bug
non-fatal. Ie, you should still find a message in dmesg output/logs saying
something bad happened. Continuing to run after such a situation, is very risky.

Comment 60 Satish Balay 2005-03-03 06:57:14 UTC

I didn't see any spin_lock messages after disabling the
linux-2.6.9-spinlock-debug-panic.patch [hence my hesitation about assuming this
to be the problem]

And the diff form bk8 to bk9 shows:
[jantu@asterix tmp]$ diff patch-2.6.11-rc4-bk8 patch-2.6.11-rc4-bk9 |grep diff
> diff -Nru a/arch/i386/kernel/setup.c b/arch/i386/kernel/setup.c
> diff -Nru a/drivers/ide/Kconfig b/drivers/ide/Kconfig
> diff -Nru a/drivers/ide/ide-io.c b/drivers/ide/ide-io.c
> diff -Nru a/drivers/ide/ide.c b/drivers/ide/ide.c
> diff -Nru a/drivers/net/natsemi.c b/drivers/net/natsemi.c
> diff -Nru a/drivers/net/s2io.c b/drivers/net/s2io.c
> diff -Nru a/drivers/net/wireless/strip.c b/drivers/net/wireless/strip.c

Maybe its the ide stuff?

Since the previous formula worked - I'm a bit hesitant to change the variables -
and try them all - one at a time [and see when things break]

Comment 61 Jesse Glick 2005-03-06 16:53:57 UTC

*** Bug 146457 has been marked as a duplicate of this bug. ***

Comment 62 Jesse Glick 2005-03-06 19:02:06 UTC

Interesting note: for me at least, disabling NTP (system-config-date -> Network
Time Protocol) appears to prevent my laptop (Tecra M2) from freezing after APM
resume, at least the first time I tried it. If this can be confirmed by someone
else, consider it one workaround for the bug.

I suspect that this is because do_timer_interrupt() (in arch/i386/kernel/time.c)
calls the apparently problematic set_rtc_mmss() only in case STA_UNSYNC if off
(i.e. sync on, acc. to timex.h), which I am guessing is true only for NTP users (?).

See bug #146457 for a few other notes, possibly of interest to someone debugging
this. I have symptoms similar to comments #29-33 in this bug.

Comment 63 Jesse Glick 2005-03-06 19:31:33 UTC

More info: the stack traces seem to say that get_cmos_time() is getting an
interrupt somewhere inside its body while holding a lock, presumably really in
the inline mach_get_cmos_time() (why? is this predictable?), and
timer_interrupt() is handling that by calling set_rtc_mmss(). But both
get_cmos_time() and set_rtc_mmss() use the same spinlock, so of course it
results in an error. See Linus' explanation (ss. 3) in Documentation/spinlocks.txt:

... IFF you know that the spinlocks are never used in interrupt handlers, you
can use the non-irq versions:

	spin_lock(&lock);
	...
	spin_unlock(&lock);

Note the caveat. But this spinlock *is* being used in an interrupt handler. An
oversight? Would using spin_lock_irqsave() + spin_unlock_irqrestore() in
get_cmos_time() help?

Or should get_cmos_time() just temporarily set some static flag which disables
the attempted call to set_rtc_mmss() in do_timer_interrupt() until it is done
with the lock - since the synching from software clock to CMOS in this case is
happening only after a resume, in which case it is presumably useless because we
just recently loaded the SW clock from CMOS?

Comment 64 Jesse Glick 2005-03-06 19:52:16 UTC

Created attachment 111719 [details]
*UNTRIED* patch for sake of experimentation

My thinking is along the lines of the attached patch (have not tried it yet).
It might succeed in preventing unwanted NTP-triggered synch to CMOS during an
APM resume. Note that this use of spinlocks is not 100% safe since it is still
possible for a timer interrupt to come between the call to spin_lock() and
setting the flag, or between unsetting the flag and spin_unlock(); but I am
guessing that would be far less likely than hitting a timer interrupt while
inside mach_get_cmos_time(). Not sure what a completely safe version would look
like but I guess it would have to use IRQ blocking of some sort.

Comment 65 Ed Hill 2005-03-07 17:40:30 UTC

Based on comments 62--64 which suggest ntpd as a culprit, I've run:

  /etc/init.d/ntpd stop ; /etc/init.d/pcmcia stop
  apm -s

and have gone through three successful suspend/resume cycles with
2.6.10-1.770_FC3.  All suspends lasted less than two hours, but this is still an
improvement!  Thanks!  I'll try a longer suspend tonight.

Comment 66 Ed Hill 2005-03-08 21:16:33 UTC

With ntpd off, a 3+-hr suspend resulted in yet another lock-up (w/ blinking
shift-lock light) on a TP A22p running 2.6.10-1.770_FC3.

Comment 67 Thomas Fischer 2005-03-10 06:48:15 UTC

I have been experiencing the problem with a Dell Latitude c610 also for awhile 
now (Sorry been very busy to participate in this discussion). I tried Ed's 
recommendation of turning off ntpd but it does not work for me (#65).

So I guess my next question is should I try the Fc4 kernel or should I look at 
something else (like the spinlock-debug-panic.patch)?

Comment 68 Peter Dalgaard 2005-03-14 23:11:53 UTC

Had a blinking-caps-lock incident with the 2.6.10-1.1126 kernel the day before
yesterday. This was after several suspend/resume cycles per day for almost four
weeks. So it seems that it hasn't cured the problem but certainly reduced the
frequency with which it appears.

Comment 69 Valentine Kouznetsov 2005-03-29 17:46:25 UTC

Hi, 
I experience the same problem once switched from FC2 to FC3 on Compaq Evo 
N600c notebook. Well, solution to my APM problem was to install vanilla 2.6.10 
and APM again start working like a charm. I'm 100% sure that source of APM 
problem relies somewhere in FC3 patches to the kernel. It would be nice if 
people confirm that vanilla kernel doesn't have APM problem.

Comment 70 Satish Balay 2005-04-09 19:19:05 UTC

My experience so far:

 - with fedora kernels, [currently using modified 2.6.11-1.14_FC3] disabling
spinlock-debug-panic.patch avoids the crash&burn senario [replaced with friendly
messages in /var/log/messages]. If this happens one can reboot at a convinent time.

 - disabling ntpd appears to get rid of the messages in /var/log/messages. I'm
guessing my earlier success report was perhaps because ntpd couldn't start [due
to a disabled network at boot] - and it remained disabled.

 - I've briefly tried kernel 2.6.11.6 without fedora patches - it survived one
overnight suspend.

I've had shutdown prblems with 2.6.11.6 - a hang at shutting down 'iptables'. So
I switched to 2.6.11*FC3 kernels. Then I had this shudtown issue with one of the
  modified 2.6.11*FC3 kernels as well. [This is not always reproduceable. When I
manually try 'service iptables restart' it always works]

Comment 71 Ian Collier 2005-04-14 16:04:12 UTC

Created attachment 113161 [details]
Show PC output from crash in 2.6.10-1.1126

I echo comment 68.  I installed 2.6.10-1.1126 towards the beginning of February
and had several successful overnight suspend/resume cycles until suddenly it
crashed on the 27th.  Attachment shows the text which appeared when I pressed
SysRq+P, though it was difficult to catch because it was spewing the atkbd
message (at the bottom) about once every second.  Then it didn't crash again
until March 27th (with the same traceback).

Comment 72 Ian Collier 2005-04-14 16:16:35 UTC

Created attachment 113163 [details]
Replacement spinlock-debug-panic patch

...However, I seem to have an obscure hardware fault (memory problem?) which
makes my machine suddenly crash for no apparent reason maybe about once a month
(it's getting worse, though :-().  That's only relevant because it made me
reboot my machine last week - but since then, 2.6.10-1.1126 has panicked every
time I've tried an overnight suspend.

Which is why I now intend to run the standard (currently 2.6.10-11_FC2) kernel
with the spinlock-debug patch replaced by the attached.  It's a horrible hack
which makes the spinlock error cause a panic *except* when it happens during
i386/kernel/time.c.  So now I do sometimes get a spew of messages in the syslog
when I resume, but it doesn't crash any more.  (On the other hand, it does
sometimes set the clock to a stupid value.)

Comment 73 Satish Balay 2005-04-22 20:00:11 UTC

<with my modified 2.6.11-1.14_FC3 kernel comment #70> I've noticed the following
in my /var/log/messages [happened during an APM recovery-from-suspend]

Apr 18 08:20:45 asterix kernel: drivers/block/cfq-iosched.c:1065: spin_is_locked
on uninitialized spinlock f7bb481c.

[and a stack trace with a taint flag for madwifi]

Perhaps this one is a completely unrelated issue...

Comment 74 Satish Balay 2005-04-22 20:08:00 UTC

Created attachment 113573 [details]
spinlock trace from /var/log/messages

Comment 75 Ian Collier 2005-04-23 03:03:04 UTC

I added the patch from comment 64 to my kernel (2.6.10-11_FC2 with the patch
from comment 72) and for some reason it made my system clock run at double speed!

Comment 76 Dave Jones 2005-07-15 19:41:47 UTC

An update has been released for Fedora Core 3 (kernel-2.6.12-1.1372_FC3) which
may contain a fix for your problem.   Please update to this new kernel, and
report whether or not it fixes your problem.

If you have updated to Fedora Core 4 since this bug was opened, and the problem
still occurs with the latest updates for that release, please change the version
field of this bug to 'fc4'.

Thank you.

Comment 77 Satish Balay 2005-07-16 16:39:04 UTC

Since disabling 'ntpd' - I don't remember having a single crash..

Now I've enabled ntpd with 2.6.12-1.1372_FC3 - and it survived a 8 hour APM
suspend. [which is a good sign]

Will keep monitoring - and report if i see a crash.

Comment 78 André Johansen 2005-07-18 07:44:58 UTC

Seems to be working better on my workstation as well; it has survived two 
overnight suspends so far.

Comment 79 Enrique Gomezdelcampo 2005-09-18 16:41:05 UTC

kernel-2.6.12-1.1372_FC3 solved the problem on my Dell laptop, see comment #26
above. However, the latest kernel,  kernel-2.6.12-1.1376_FC3 breaks it again.
The laptop does not recover from sleep. It didn't even survived a 2 min APM
suspend. Did it happen on the Thinkpad too?

Comment 80 Enrique Gomezdelcampo 2005-09-18 16:42:32 UTC

kernel-2.6.12-1.1372_FC3 solved the problem on my Dell laptop, see comment #26
above. However, the latest kernel,  kernel-2.6.12-1.1376_FC3 breaks it again.
The laptop does not recover from sleep. It didn't even survived a 2 min APM
suspend. Did it happen on the Thinkpad too?

Comment 81 Enrique Gomezdelcampo 2005-09-18 16:44:31 UTC

kernel-2.6.12-1.1372_FC3 solved the problem on my Dell laptop, see comment #26
above. However, the latest kernel,  kernel-2.6.12-1.1376_FC3 breaks it again.
The laptop does not recover from sleep. It didn't even survived a 2 min APM suspend.
Did it happen the same way for the Thinkpad?

Comment 82 Ed Hill 2005-09-18 17:16:38 UTC

Just a couple of data/no-data points:
 - my ThinkPad A22p was sold so I no longer have it for testing purposes
 - the replacement ThinkPad T42p (2373-HTU) has suspended very reliably 
     (using ACPI with the kernel options "pci=noacpi acpi_sleep=s3_bios") 
     with all FC4 kernel updates including 2.6.12-1.1447_FC4

Comment 83 Matthew Saltzman 2005-09-18 17:34:06 UTC

I'll admit that, once I had a solution to the Radeon hot-suspend (bug #142928),
I switched to ACPI.  I've had no problems since on my T41 (2373-JHU)--don't even
need the kernel options in comment #82.  Note that kernel-2.6.12-1.1447_FC4
doesn't do ACPI suspend properly in my case (bug #165819), but 2.6.12-1398_FC4
works fine.

Comment 84 Satish Balay 2005-09-18 21:12:48 UTC

I'm still using APM with FC3 on my T40. Currently using 2.6.12-1.1378_FC3 - no
crashes yet.

Comment 85 Ian Collier 2005-09-19 10:32:28 UTC

I've been using 2.6.12-1.1376 backported to FC2 for just under a week with no
crashes on a ThinkPad T22.  I'm still using the patch from comment 72, but haven't
seen any spinlock messages since August 23, when I was running 2.6.11-1.14.

(I never said, but comment 75 was a false alarm - apparently my clock always
goes at double speed when I boot on battery power and then connect the
AC adaptor.)

Comment 86 Dave Jones 2005-09-30 10:34:44 UTC

there are a number of different problems reported here, across a variety of
kernels including a bunch of self compiled ones with add-ons, involving various
features nothing to do with apm.

if you're still having apm problems with the current fc3 errata, please file a
new bug, as this one has become far too cluttered to make any coherent analysis
upon.

Note You need to log in before you can comment on or make changes to this bug.

andrejoh
balay
barryn
david
ed
edgar.hoch
egcp
emmanuel.druon
gneeki
imc
jhmail
joe.christy
lance
nayfield
p.dalgaard
pfrields
roessler
tvfischer
typrase
wtogami