Bug 161153 - sluggish/nearly hung system
sluggish/nearly hung system
Status: CLOSED INSUFFICIENT_DATA
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: Dave Jones
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-06-20 16:48 EDT by Matt Domsch
Modified: 2015-01-04 17:20 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-05-04 09:55:21 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
profile-2.6.12-1.1390_FC4smp-7 (64.15 KB, text/plain)
2005-07-10 00:57 EDT, Matt Domsch
no flags Details
profile-2.6.12-1.1390_FC4smp-8 (1.41 KB, text/plain)
2005-07-10 00:58 EDT, Matt Domsch
no flags Details
dmesg-2.6.12-1.1390_FC4smp (20.60 KB, text/plain)
2005-07-10 00:58 EDT, Matt Domsch
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Linux Kernel 4319 None None None Never

  None (edit)
Description Matt Domsch 2005-06-20 16:48:44 EDT
Description of problem:
FC4, fresh install, though same seen after FC3->FC4 upgrade too.
Dell PowerEdge 2400, 2x933MHz, 1GB RAM, built-in e100 network, several disks on
onboard aic7890 controller, using LVM on one disk for boot, md raid 1 + lvm1 on
two disks for /home.

System initially is fine, but after a few minutes, system becomes sluggish. 
Gnome system monitor tool stops refreshing every second, becomes every few
minutes.  top hangs.  Can no longer sudo or log in, but can start new
gnome-terminals with Ctrl-T.  Can no longer ping the ethernet device from outside.  

Switching from VT7 to VT1 succeeds, but cannot log in.  SysRQ works there
though.  Nothing unusual on the task lists.   SysRQ-M shows plenty of free
memory, -P shows both CPUs in idle loop.  Outgoing network connections
occasionally OK, though mostly hung.  Rebooting via sysrq-b works.  Emergency
sync claims to work, but the data appears not to be committed to journal or
disk, as it's not present after reboot.  If I edit files while in this state,
those changes do not persist after reboot, even after sysrq-s.

Tried with acpi=off, selinux=0, audit=0 in various combinations, no effect.

This is the strangest thing I've seen in a while.


Version-Release number of selected component (if applicable):
FC4 gold release SMP i686 kernel

How reproducible:
on every boot
Comment 1 Matt Domsch 2005-06-20 18:00:45 EDT
I'll add that this system has been rock-solid for years, running everything from
RHL7.2 through FC3 most recently.  This problem is only seen on FC4.
Comment 2 Matt Domsch 2005-06-21 09:40:57 EDT
Running the UP kernel instead of SMP for 12+ hours now, no failures.  Reducing
to normal severity, as I can live with UP for a while.
Comment 3 Dave Jones 2005-06-24 23:06:54 EDT
if you boot with profile=1, and run readprofile, maybe we can find out where
it's spending its time.
Comment 4 Matt Domsch 2005-07-10 00:56:36 EDT
I've reproduced this again with 2.6.12-1.1390_FC4smp.

I'll attach the results of readprofile, but it's not showing much.
Several things to note:
1) I'm using the ppp_mppe kernel module from http://pptpclient.sf.net.  It's
possible this is related, though I'm not certain, and I haven't been able to
rule it out yet.

2) /proc/interrupts shows that the timer tick, generally 1000/sec, has stopped.

3) One of the highest count items in the profile is spin_unlock_irqrestore.

Attaching 2 profiles.  profile-${kernelver}-7 was after it had been hung for a
bit, no more disk I/O was occuring.  profile-${kernelver}-8 was taken while in
the hung state too, though I reset the profile counters and waited several
seconds before grabbing this data.

Comment 5 Matt Domsch 2005-07-10 00:57:37 EDT
Created attachment 116557 [details]
profile-2.6.12-1.1390_FC4smp-7
Comment 6 Matt Domsch 2005-07-10 00:58:08 EDT
Created attachment 116558 [details]
profile-2.6.12-1.1390_FC4smp-8
Comment 7 Matt Domsch 2005-07-10 00:58:47 EDT
Created attachment 116559 [details]
dmesg-2.6.12-1.1390_FC4smp
Comment 8 Matt Domsch 2005-07-11 16:59:13 EDT
I can confirm now that the ppp_mppe module is not at fault.  I was able to
reproduce the failure without that module loaded.
Comment 9 Dave Jones 2005-07-11 17:08:04 EDT
curious, you seem to be using the idle=poll idling routines instead of the
default idle.  Are you specifying this as a boot param ? does the problem go
away if you use the defaults ?
Comment 10 Matt Domsch 2005-07-11 18:33:44 EDT
I have no idle=anything options specified.  I tried with idle=halt, and saw
similar behavior (though saw the message at startup saying it's using halt for
idle, when otherwise I see no message at all).  But yes, it appears to be using
poll_idle (even on the non-SMP kernel).  That's strange...
Comment 11 Matt Domsch 2005-07-12 08:43:39 EDT
I built 2.6.13-rc2-git3 using the FC4 i686 SMP config file.  There, the idle
routine (as shown by readprofile) is in fact default_idle.  Same failure was
observed several minutes after boot though.  So, it's not a Fedora-specific
patch that's causing my problem.

As it happens about the same point every time, several minutes after boot, one
thought I had was that the intended jiffies wraparound at 5 minutes after boot
could be part of the problem.
Comment 12 Dave Jones 2005-07-15 17:35:07 EDT
[This comment has been added as a mass update for all FC4 kernel bugs.
 If you have migrated this bug from an FC3 bug today, ignore this comment.]

Please retest your problem with todays 2.6.12-1.1398_FC4 update.

If your problem involved being unable to boot, or some hardware not being
detected correctly, please make sure your /etc/modprobe.conf is correct *BEFORE*
installing any kernel updates.
If in doubt, you can recreate this file using..

mv /etc/sysconfig/hwconf /etc/sysconfig/hwconf.bak
mv /etc/modprobe.conf /etc/modprobe.conf.bak
kudzu


Thank you.
Comment 13 Matt Domsch 2005-07-16 09:29:40 EDT
Problem remains after 2.6.12-1.1398_FC4smp installation, no change in behavior.

default_idle was once again the default idle routine (when no kernel parameters
passed), which is back to what was expected.

The timer interrupt stopped ticking after ~230k ticks, well ahead of the jiffies
wraparound, so I don't believe this aspect to be related.  In other tests with
earlier kernels, it had stopped both before and after the wraparound point.

I'm going to look into the possibility that it's ACPI-related again (this system
has a very early ACPI BIOS), though IIRC using acpi=off initially didn't resolve
the problem.
Comment 14 Matt Domsch 2005-07-28 09:42:27 EDT
Bug in upstream bugzilla:
http://bugzilla.kernel.org/show_bug.cgi?id=4319
appears to match, I've added my details from this bug there.
Comment 15 Matt Domsch 2005-08-27 15:53:33 EDT
Progress.
Booting with 'clock=tsc' seems to work as a workaround.  This isn't root-cause,
but is a start.  Kernel 2.6.13-rc{2,4,6,7} tried, same behavior, same workaround
or disable CONFIG_X86_PM_TIMER.
Comment 16 Matt Domsch 2005-08-28 23:28:04 EDT
Well, it *was* working fine for >24 hours with 'clock=tsc' on Fedora kernel 
2.6.12-1.1398_FC4smp, then it hung same way again with timer interrupts 
stopped being received.  So it's much better, but 'clock=tsc' doesn't 
completely solve it.
Comment 17 Dave Jones 2005-09-30 02:49:00 EDT
Mass update to all FC4 bugs:

An update has been released (2.6.13-1.1526_FC4) which rebases to a new upstream
kernel (2.6.13.2). As there were ~3500 changes upstream between this and the
previous kernel, it's possible your bug has been fixed already.

Please retest with this update, and update this bug if necessary.

Thanks.
Comment 18 Matt Domsch 2005-10-01 12:59:20 EDT
2.6.13-1.1526_FC4smp fails similarly.
Comment 19 Dave Jones 2005-11-10 14:55:29 EST
2.6.14-1.1637_FC4 has been released as an update for FC4.
Please retest with this update, as a large amount of code has been changed in
this release, which may have fixed your problem.

Thank you.
Comment 20 Dave Jones 2006-02-03 00:51:01 EST
This is a mass-update to all currently open kernel bugs.

A new kernel update has been released (Version: 2.6.15-1.1830_FC4)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO_REPORTER state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

Thank you.
Comment 21 John Thacker 2006-05-04 09:55:21 EDT
Closing per previous comment.

Note You need to log in before you can comment on or make changes to this bug.