Bug 212777 - frequent spontaneous lockups
frequent spontaneous lockups
Status: CLOSED NOTABUG
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
6
x86_64 Linux
medium Severity urgent
: ---
: ---
Assigned To: Kernel Maintainer List
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-10-28 20:13 EDT by gdelx001
Modified: 2008-01-30 22:51 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-01-30 22:51:14 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Output of lscpi (1.94 KB, text/plain)
2006-10-31 10:29 EST, Andrew Hamilton
no flags Details
Output of lspci (2.34 KB, application/octet-stream)
2006-10-31 11:28 EST, gdelx001
no flags Details
My Xorg.0.log file (46.10 KB, text/plain)
2006-10-31 18:55 EST, Andrew Hamilton
no flags Details
dmesg output (17.37 KB, text/plain)
2006-10-31 18:58 EST, Andrew Hamilton
no flags Details
2.6.17-2187_FC5 dmesg (25.45 KB, text/plain)
2006-11-01 10:12 EST, gdelx001
no flags Details
Xorg.0.log under 2.6.17-2187_FC5 (31.35 KB, text/plain)
2006-11-01 10:14 EST, gdelx001
no flags Details
My updated .config file (53.84 KB, application/octet-stream)
2006-11-04 13:14 EST, Andrew Hamilton
no flags Details
.config for Kubuntu 6.10 that locks up like fc6 (72.90 KB, text/plain)
2006-11-24 13:13 EST, Andrew Hamilton
no flags Details
.config for Kubuntu 6.10 that doesn't lock up (73.46 KB, text/plain)
2006-11-24 13:15 EST, Andrew Hamilton
no flags Details

  None (edit)
Description gdelx001 2006-10-28 20:13:25 EDT
Description of problem:
after upgrade to FC6, frequent spontaneous, complete lockups from desktop idle
and while using applications (browser, etc.).  No information about lockup is
ever available in system log.


Version-Release number of selected component (if applicable):
2.7.18-1.2798


How reproducible:
Extremely

Steps to Reproduce:
1. Wait 0.5-20 hours and it will happen
2.
3.
  
Actual results:
Complete lockup.  Screen state freezes and no response to keyboard, mouse, or
network.

Expected results:
... no lockup...

Additional info:
Tyan S2895 board recently upgraded to two dual core Opterons, memory upgraded as
well to 2 2GB DIMMs (4GB total).  Appeared to work well after hardware upgrade
with FC5 2.6.18-1.2200, although it wasn't active that long.

Memory has been well tested with Memtest86+, without problems.  Tried Nvidia
graphics driver and Xorg nv driver with same lockup results.

Not much to go on, unfortunately, other than OS and hardware changes.
Comment 1 Ed Friedman 2006-10-28 22:05:40 EDT
I've observed lockups on a 2.4 GHz Core 2 Duo Intel machine, also with a Nvidia
graphics card (only used Xorg driver).  There seems to be no problem when idle
(I've gone for a couple of days with no programs running without freezing up),
but with activity, e.g., trying to add rpm's with pirut, it rarely goes as much
as one hour without freezing up.  What is interesting is that I've seen it
freeze up while being connected via ssh.  When that happens, a number of
ordinary commands, e.g., top or less, give you "Input/output error" or  "Bus
error" messages instead of any info.  Trying to ssh in when it is frozen gives
the message: "ssh_exchange_identification: Connection closed by remote host". 
It did respond properly to ping requests, even though it was frozen.  When I
went to the console following the freezeup, no keystrokes seemed to be
recognized, i.e., no letters produced any output in the login window, and trying
for a text window with CTRL-ALT-F2 did nothing.  Only a hardware reset enabled
me to reboot.

I've run the memtest86 to verify that there is no problem with the CPU,
motherboard or RAM.
Comment 2 Andrew Hamilton 2006-10-29 01:40:40 EST
I have observed this behavior as well using 2.6.18-1.2798.fc6.i686. I also had
it happen when the 2.6.18-1.2798.fc6.i586 kernel was installed. My hardware is
an Athlon-XP 2500+, Nvidia graphics card, and Asus A7N8X-deluxe nforce2
motherboard.   Thinking that this was an acpi problem, I have tried booting with
acpi=off with no improvement in the situation. The only hint that I have seen
occurred the other morning when I woke-up my display to check email, and for
some reason httpd had eaten all the memory on the machine and other processes
were being killed.  I have httpd running on a local home lan with only me
accessing it occassionally.  This never occurred with fc5.
Comment 3 gdelx001 2006-10-29 13:21:59 EST
Went to the trouble of recovering FC5's 2.6.18-1.2200, and acheived lockup there
as well.  Installed 2.6.17-1.2187 from FC5 and have not had any lockups/freezes
so far.  But it _has_ taken longer to freeze before.  We shall see.

Is this due to the problem 2.6.18 had over some part of September where the
backtrace search (for x86_64 only?) upon a kernel bug was itself buggy?  I.e. is
the lockup really due to the reporting system, but precipitated by another
kernel bug I can't see?  Or is 2.6.18-1.2200/1.2187 already patched for that?

Comment 4 Andrew Hamilton 2006-10-29 14:13:29 EST
I wouldn't think this is x86_64 specific since I am seeing it with an Athlon-XP
2500+.  Then again, there could be two separate bugs with similar symptoms. 
It's hard to tell with no evidence besides the outcome to go on :( .
Comment 5 Ed Friedman 2006-10-30 12:51:05 EST
Is there any way to force the kernel to run in uniprocessor mode instead of SMP
mode?  Under FC5, I had a few machines that were hyper-threaded and wanted to
run under SMP, but crashed every few days when I used SMP.  When I forced them
to boot in uniprocessor mode, then they never crashed.
Comment 6 gdelx001 2006-10-30 16:06:33 EST
(In reply to comment #3)
> Went to the trouble of recovering FC5's 2.6.18-1.2200, and acheived lockup there
> as well.  Installed 2.6.17-1.2187 from FC5 and have not had any lockups/freezes
> so far.  But it _has_ taken longer to freeze before.  We shall see.
> 

2.6.17-1.2187 is still running.  I feel pretty confident now in concluding that
the 2.6.18 kernel is the lockup culprit, and not my hardware or other software.  

It looks like the kernel shipped with FC6 is _not stable_ and a new release is
required for me to be able to use it, and for FC5 as well, since it has
transitioned to 2.6.18.


Comment 7 Andrew Hamilton 2006-10-30 18:29:54 EST
I'm going to try installing the latest kernel from kernel.org, which is
2.6.19-rc3-git8 at the time of this writing to see if it still lockups up with me.
Comment 8 Andrew Hamilton 2006-10-31 10:29:51 EST
Created attachment 139863 [details]
Output of lscpi
Comment 9 Andrew Hamilton 2006-10-31 10:31:01 EST
I had 2.6.19-rc3-git8 lockup on me this morning, so the problem seems to still
exist in the latest kernel.  I will slowly now try to determine where this bug
entered the kernel by testing various versions.  Could some people who are
having this problem post the output of lspci to try to determine if we have
common hardware.  Mine is attached.
Comment 10 gdelx001 2006-10-31 11:28:04 EST
Created attachment 139871 [details]
Output of lspci
Comment 11 gdelx001 2006-10-31 11:30:16 EST
(In reply to comment #9)
> I had 2.6.19-rc3-git8 lockup on me this morning, so the problem seems to still
> exist in the latest kernel.  I will slowly now try to determine where this bug
> entered the kernel by testing various versions.  Could some people who are
> having this problem post the output of lspci to try to determine if we have
> common hardware.  Mine is attached.

Other than the fact that the motherboard chipset is _made_ by NVIDIA there is
very little in common, as one would expect given the different processor
families that are supported.  I guess the TYAN 2895 has a TI firewire chip
instead of NVIDIA...
Comment 12 Andrew Hamilton 2006-10-31 15:37:41 EST
I have now installed 2.6.18-rc5. Maybe this kernel isn't haunted like the
others.  We'll see...
Comment 13 Dave Jones 2006-10-31 17:33:31 EST
Andrew/Delamart, can you attach your dmesg outputs, and /var/log/Xorg.0.log files ?
Comment 14 Andrew Hamilton 2006-10-31 18:55:37 EST
Created attachment 139923 [details]
My Xorg.0.log file
Comment 15 Andrew Hamilton 2006-10-31 18:58:05 EST
Created attachment 139925 [details]
dmesg output

Here is my dmesg output, however this is with booting from a 2.6.18-rc5 kernel
that I compiled (using the .config from 2.7.18-1.2798). So far (3.5hrs), I
haven't had a lockup with this kernel. If you need me to reboot into
2.7.18-1.2798 and get that dmesg, let me know.
Comment 16 gdelx001 2006-11-01 10:12:39 EST
Created attachment 139988 [details]
2.6.17-2187_FC5 dmesg

2.6.17-2187_FC5 dmesg
Comment 17 gdelx001 2006-11-01 10:14:37 EST
Created attachment 139990 [details]
Xorg.0.log under 2.6.17-2187_FC5

Xorg.0.log under 2.6.17-2187_FC5 (some bad devices)
Comment 18 gdelx001 2006-11-01 10:35:22 EST
(In reply to comment #6)
> (In reply to comment #3)
> > Went to the trouble of recovering FC5's 2.6.18-1.2200, and acheived lockup there
> > as well.  Installed 2.6.17-1.2187 from FC5 and have not had any lockups/freezes
> > so far.  But it _has_ taken longer to freeze before.  We shall see.
> > 
> 
> 2.6.17-1.2187 is still running.  I feel pretty confident now in concluding that
> the 2.6.18 kernel is the lockup culprit, and not my hardware or other software.  
> 

Looks like I must eat those words, since 2.6.17-1.2187 finally locked after
approximately 48 hours of running without trouble or reported errors.  Then
again, and again.  It appears less likely to freeze in idle, as I am usually
doing something on the desktop in a web browser or terminal when it happens.

Now I have found that my particular board had problems about a year ago (i.e.
August 2005) surrounding the first ethernet port, possibly some problems
involving BIOS APIC settings.  

See:
http://www.linuxelectrons.com/phpBB2/viewtopic.php?t=171&view=next

It is _hardly_ conclusive from spotty information but those who had more trouble
seem to have had faster procs (>1.8GHz) which is a transition I made in the
upgrade.  

Various complaints about the "forcedeth" ethernet driver causing lockups/freezes
float about on the net, and some successes after bug fixes, but it is difficult
to pin down versions and dates.  

I put a _little_ stress on the network port, but am unable with my current setup
to impose a lockup of the system.  I've also forced the clock on the procs high
(performance governor), but that doesn't reliably lock things, although the
system locked _once_ while I was setting the governor.

Comment 19 Andrew Hamilton 2006-11-01 13:14:01 EST
It is interesting that you mention forcedeth may cause problems since I have
that as well.  Although I don't have it connected (I use a wireless rt61 pci
card), the module for it is loaded.
Comment 20 Andrew Hamilton 2006-11-01 22:32:08 EST
Well, I rebooted into 2.7.18-1.2798, and removed the forcedeth module.  I now
have an uptime of 9 hours, which is a pretty good sign, especially since I've
had it under high loads at some points today.
Comment 21 Andrew Hamilton 2006-11-01 23:03:37 EST
Well, of course not more than 10 minutes after I wrote the previous reply, I had
a lockup. I'll go back to trying to find the most recent kernel that doesn't
lockup on me.
Comment 22 Andrew Hamilton 2006-11-02 16:09:18 EST
I just had a 2.6.17 kernel lockup on me as well. That is odd, because I'm quite
sure I had a 2.6.17 kernel running in fc5 with no problems.
Comment 23 Andrew Hamilton 2006-11-04 13:14:23 EST
Created attachment 140366 [details]
My updated .config file

Well, I recompiled 2.6.19-rc3-git8 (which had locked up using the default fc6
.config) after removing a lot of config options I don't need. I now have an
uptime of 1 day 20 hours, so maybe the problem is now solved.  I have attached
my .config. I plan on slowly adding things back to the way they were default to
figure out what is causing the problems.
Comment 24 Andrew Hamilton 2006-11-07 11:19:46 EST
I installed Kubuntu 6.10 (Edgy), and I have the exact same problem. Maybe this
is xorg related? Both FC6 and Kubuntu Edgy use xorg 7.1.
Comment 25 Jarod Wilson 2006-11-07 12:22:53 EST
Might I suggest configuring kdump to see if you can get a vmcore when these
problems hit? Might better help to illustrate where the problem really is...

http://fedoraproject.org/wiki/FC6KdumpKexecHowTo

Comment 26 Ed Friedman 2006-11-07 17:24:15 EST
I've tried booting with the "noapic" flag and so far this seems to have solved
the problem for me.
Comment 27 gdelx001 2006-11-11 15:18:44 EST
(In reply to comment #25)
> Might I suggest configuring kdump to see if you can get a vmcore when these
> problems hit? Might better help to illustrate where the problem really is...
> 
> http://fedoraproject.org/wiki/FC6KdumpKexecHowTo
> 
> 

This sounds promising, but unfortunately after following the instructions I
can't get it to work.  I'm using the FC6 kdump kernel.  Forcing a crash produces
a syslog message that "Kexec: Warning: crash image not loaded" and when I go on
to manually run kexec to load the crash image (kexec -p ...) I get:

Invalid memory segment 0x1000000 - 0x1324fff

This seems to be a recent problem that many posters are having (maybe with
x86-64 only?).

Please tell me if there's presently a way to get around this difficulty!


Comment 28 Michael Hurley 2006-11-11 21:45:37 EST
Adding the noapic flag in grub.conf *seemed* to solve the problem for me for a
while; I had this machine up for over 24 hrs w/o problem. Now, however, the
lockups are back, sometimes within less than an hour of eachother.
Comment 29 Andrew Hamilton 2006-11-24 13:13:57 EST
Created attachment 142082 [details]
.config for Kubuntu 6.10 that locks up like fc6
Comment 30 Andrew Hamilton 2006-11-24 13:15:37 EST
Created attachment 142083 [details]
.config for Kubuntu 6.10 that doesn't lock up
Comment 31 Ryan Ackley 2007-01-05 20:09:26 EST
I am also getting frequent lockups.  I have a Core 2 Duo e6300 @ 2.8 GHz,
Gigabyte 965P-DS3 motherboard, and an nVidia 7900GS.  I get the lockup with the
CPU running at default speeds and overclocked, and with both the nv and nvidia
drivers.
Comment 32 gdelx001 2008-01-30 22:51:14 EST
The issue has gone away with the replacement of the RAM for a completely new
uniform set of DIMMs.  Memcheck86+ repeatedly passed the bad memory, so it was
an expensive guess to come to this successful conclusion.


Note You need to log in before you can comment on or make changes to this bug.