Bug 237547 - resume from RAM on intel driver hangs after a few seconds if X is onscreen
Summary: resume from RAM on intel driver hangs after a few seconds if X is onscreen
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: All
OS: Linux
medium
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 196349
TreeView+ depends on / blocked
 
Reported: 2007-04-23 18:42 UTC by Zack Cerza
Modified: 2007-11-30 22:12 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-06-20 15:52:43 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
tail -f /var/log/{messages,pm-suspend.log,Xorg.0.log} (18.68 KB, text/plain)
2007-04-24 14:05 UTC, Zack Cerza
no flags Details
dmesg output of boot sequence with apic=debug (27.87 KB, application/octet-stream)
2007-04-25 20:46 UTC, Zack Cerza
no flags Details
dmesg containing noapic suspend/resume and soft lockup (39.31 KB, text/plain)
2007-04-26 17:25 UTC, Zack Cerza
no flags Details
xorg log from X crash (42.14 KB, text/plain)
2007-04-26 17:27 UTC, Zack Cerza
no flags Details
/proc/interrupts on a 10 second interval with nohz=off highres=off apic=verbose (1.50 KB, text/plain)
2007-04-27 15:39 UTC, Zack Cerza
no flags Details

Description Zack Cerza 2007-04-23 18:42:50 UTC
Description of problem:
In FC5+6, both suspend to RAM and suspend to disk worked fine on my X41, with
its i915GM. Now in F7, the laptop will suspend, but on resume it hangs after a
few seconds of X activity. The display freezes, I can't change to a VT, the
network appears to be dead... etc.

If X is not running at all when I suspend, I can sometimes resume correctly.

I haven't seen anything in /var/log/messages when this happens.

Version-Release number of selected component (if applicable):
kernel-2.6.20-1.3088.fc7.i686
pm-utils-0.99.3-1.fc7.i386
xorg-x11-drv-i810-1.6.5-19.fc7.i386


How reproducible:
Always

Additional info:
When I said "the laptop will suspend", this includes tapping various keys on the
keyboard to get the new tickless kernel to wake up. Should I file another bug on
this?

Hibernate sometimes works; it will hang while going down, but if I start tapping
keys sometimes it will finish hibernating. Other times it won't, and I have to
cut power manually.

Comment 1 Zack Cerza 2007-04-24 14:05:19 UTC
Created attachment 153351 [details]
tail -f /var/log/{messages,pm-suspend.log,Xorg.0.log}

Comment 2 Zack Cerza 2007-04-25 20:46:25 UTC
Created attachment 153455 [details]
dmesg output of boot sequence with apic=debug

When running kernel-2.6.20-1.3110.fc7.i686, booting with 'noapic' seems to be a
workaround. Attaching apic=debug dmesg output from boot.

Comment 3 Zack Cerza 2007-04-26 17:18:51 UTC
Note: everything above is the result of testing with either no quirks, or just
s3_bios. This is different from the default in hal-info, which is to use both
vbe_post and vbestate_restore (I believe the default is wrong, and so does
thinkwiki.org, but that's another issue).



Comment 4 Zack Cerza 2007-04-26 17:25:22 UTC
Created attachment 153534 [details]
dmesg containing noapic suspend/resume and soft lockup

I tested suspend/resume with kernel-2.6.20-1.3111.fc7.i686 and noapic, using
the vbe quirks which are default. During general use there were periods where
my load average was over 6 with an 85% idle CPU and no disk or network
activity. I don't know if this is a known side-effect of noapic, but it was
weird.

Sometimes these slowdowns even locked up the entire system temporarily.
Attached is the dmesg output from that testing, which includes two "BUG: soft
lockup detected on CPU#0!" traces at the bottom. 

A few minutes after the "soft lockup" messages showed up in dmesg, the entire
system hung, much like it does a few seconds after resume without noapic, only
this was a good hour or so after resume.

Further up in that log you can see some "<device> LATE suspend" and "<device>
EARLY resume" messages; I don't know the significance of those.

Comment 5 Zack Cerza 2007-04-26 17:27:26 UTC
Created attachment 153535 [details]
xorg log from X crash

During the same boot cycle described in the previous comment, X crashed maybe
15 minutes after resuming. Here is the log from that session. I doubt if it's
useful, but take a look if you want.

Comment 6 Zack Cerza 2007-04-26 17:33:09 UTC
I realize the last two comments might be confusing. Here is what happened in
chronological order:

1. boot 3111 with noapic
2. login and suspend with VBE quirks (and not S3 BIOS)
3. resume
4. notice periodic, unexplained slowdowns and load jumps (may have started
before suspend/resume, not sure
5. maybe 15 minutes later, X crashes. gdm pops up.
6. log back in
7. about an hour (and lots of the strange slowdowns) later, soft lock occurs.
8. machine recovers, and I save dmesg output
9. immediately go to bugzilla to add this information
10. machine hardlocks (similar previous lockups on resume)
11. reboot, login, return to bugzilla :)

Comment 7 Zack Cerza 2007-04-26 20:39:34 UTC
Ajax thought this might be a drm problem, so I disabled it in xorg.conf. With
drm disabled, even booting into runlevel 1, suspending/resuming, *then* starting
X causes the system hang.

Comment 8 Zack Cerza 2007-04-26 21:34:24 UTC
Well, I think I've got it now. Booting without noapic but with nohz=off seems to
*really* work around the problem.

Comment 9 Thomas Gleixner 2007-04-26 22:48:10 UTC
Hmm, some of the dmesg outputs are really confusing

Time: tsc clocksource has been installed.
pnp: 00:00: iomem range 0x0-0x9ffff could not be reserved
<SNIP>
checking if image is initramfs... it is
Switched to high resolution mode on CPU 0

Usually the switch happens right after the clocksource install.

Can you please boot with 

nohz=off highres=off apic=verbose

on the commandline and do

# cat /proc/interrupts; sleep 10; cat /proc/interrupts

and provide the output ?


Comment 10 Zack Cerza 2007-04-27 15:39:27 UTC
Created attachment 153631 [details]
/proc/interrupts on a 10 second interval with nohz=off highres=off apic=verbose

Comment 11 Thomas Gleixner 2007-04-27 15:53:24 UTC
Looks sane.

When the hang happens, is the machine still responding to SysRq ?

If yes, the output of sysrq-t and sysrq-q would be probably helpful.


Comment 12 Zack Cerza 2007-04-27 15:56:09 UTC
It's hard to say if it responds; this only happens when I'm on X's VT. How would
I get that output saved? My laptop has no serial port. (and I've never done the
serial console thing before)

Comment 13 Thomas Gleixner 2007-04-29 19:36:09 UTC
Dave,

how close is kernel-2.6.20-1.3088.fc7.i686 to the hrtimer/dyntick code in 2.6.21 ?

Comment 14 Zack Cerza 2007-04-30 16:07:34 UTC
Thomas,

this is still a problem with kernel-2.6.21-1.3116.fc7.i686 which is 2.6.21 final
according to its changelog. I guess there could theoretically be patches to that
code in our packages, though.

Comment 15 Thomas Gleixner 2007-05-02 16:49:51 UTC
Zack,

are you still tapping keys to get the box out of suspend / hibernate or did this
change at least after the update to 2.6.21 ?


Comment 16 Zack Cerza 2007-05-02 17:01:15 UTC
From RAM, no. I'll test disk later today since I can't remember.

Comment 17 Thomas Gleixner 2007-05-02 17:25:51 UTC
Ok, can you please add "nolapic_timer" to the command line and try again ?

If my suspicion is correct, then your box should freeze right after resume.


Comment 18 Zack Cerza 2007-05-02 18:14:02 UTC
OK, so this seems a little odd. I'd been booting with nohz=off to work around
this problem, and it's been working fine. I rebooted with nolapic_timer (and
without nohz=off) and this appears to work around the problem also. I'm so far
not seeing the strangeness that I'd seen with noapic either.

No freezes yet, and it's been over 20 minutes since I resumed.

Comment 19 Thomas Gleixner 2007-05-02 18:51:59 UTC
Hmm, I'm getting more confused. That's quite the contrary to the behaviour which
I expected.

Before you switched to 2.6.20 (+hres/dyntick) was it necessary to have noapic on
the command line ? If yes, then it looks more like an apic problem, but the
confusing thing is why does this only hurt after resume.



Comment 20 Zack Cerza 2007-05-02 19:11:04 UTC
I don't recall ever having to use noapic before. The last kernels I used before
2.6.20 were the FC6 kernels. Also, to be clear, noapic didn't seem to work
around this problem properly, whereas nolapic_timer does.

Comment 21 Zack Cerza 2007-06-20 15:52:43 UTC
Looks like this got fixed somehow, and replaced with a different regression.


Note You need to log in before you can comment on or make changes to this bug.