Bug 78541

Summary:

Weird partial SMP kernel 2.4.18-17.7.x hang on HP netserver lp2000r

Product:

[Retired] Red Hat Linux

Reporter:

ville.sulko

Component:

kernel

Assignee:

Arjan van de Ven <arjanv>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Brian Brock <bbrock>

Severity:

high

Docs Contact:

Priority:

medium

Version:

7.3

CC:

chpruc, erik.bennett, jason.piszcyk, johnsom, p.dania

Target Milestone:

---

Target Release:

---

Hardware:

i686

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2004-09-30 15:40:14 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Bootup messages (dmesg)	none

Description ville.sulko 2002-11-25 13:42:08 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; Q312461)

Description of problem:
On a dual HP lp2000r with kernel 2.4.18-17.7.xsmp, we have experienced twice a 
mysterious partial hang (see later) of the kernel. Network and console login 
usually possible, but result in mostly unusable sessions.

Version-Release number of selected component (if applicable):


How reproducible:
Didn't try

Steps to Reproduce:
Don't know of any systematical way to reproduce, except for leaving the system 
on, and waiting... Has however happened twice, about two weeks apart.

Additional info:

These hangs are almost too weird to describe :) On the machine there is an 
apache and tomcat process running. Both crashes were noticed by the fact that 
the web service was not responding anymore. The machine still responded to 
ping, and I was actually able to ssh into it too. The connection (ssh login) 
was a bit sluggish, and when I was logged in, typing something was a bit hard; 
it seemed that the display always lagged one character (maybe network packet) 
behind. So, when I entered 'ls', only 'l' was displayed, then pressed 
enter, 'ls' was displayed, then pressed 'enter' again, and ls listing was 
shown, etc etc. The same thing happened on the console too, so this wasn't a 
network error.
Another thing I noticed was that the system clock seemed to be hung. That was, 
date always returned the same time. From the logs we saw that cron had ceased 
to execute tasks between 00:40 and 00:50 (sar was no longer run), and when we 
gave 'shutdown -r 0' to the system, logs did show that the reboot time was 
about 00:43, even if actually was already the following day. And no, the 
shutdown also hung, and we were forced to use the main switch...
The whole system was very sluggish overall, and I couldn't run top, for 
example. ps-list completed ok, but showed nothing too unusual (sorry, no saved 
data this time). free showed that the machine wasn't swapping, so that's not 
the problem. According to 'w', the machine was also idle, so no processes were 
running wildly. There were also no panics or other kernel messages in the dmesg 
or messages -file, so there is no further info I can provide at this time.

So, overall this seemed like a partial kernel crash, leaving the
kernel unable to function properly. I know the description above most likely 
isn't enough to isolate the problem, so is there something that I should 
especially try or write down the next time it happens?

BTW, the system had been running kernel 2.4.18-10 for a couple of months, and 
there were no crashes as faw as I can recall. There seems to be a newer errata 
kernel 2.4.18-18.7.x, I might upgrade to that too, even if the bugs fixed don't 
seem to match this problem.

About the system :
HP netserver lp2000r
Dual PIII 1GHz
HP NetRAID 2M with 6 disks (raid1)
RedHat 7.3, with the latest patches up to 11-Nov-2002.

Comment 1 ville.sulko 2003-01-02 14:16:44 UTC

Ok, more similar problems on two different HP lp2000r machines. The same 
symptoms, sluggish machine etc. The other machine was running 2.4.18-18.7.x and 
the other was running 2.4.18-19.7.x. 

This time however, I noticed one very interesting thing. On both machines, 
grepping from /proc/interrupts, the timer interrupts (#0) seemed to happening 
way too infrequently. I rebooted the other machine (after which it seems to run 
just fine) and calculated from 10 second (external timing :) sample the timer 
rate, and got about 512 Hz. On the other machine (the misbehaving one), using 
the same method I got about 9 Hz... That is, total 92 interrupts (46 + 46) in 
ten seconds... I quess this explains why "sleep 1" lasts over 10 seconds and
"vmstat 1" and "top" seem to freeze.

And one more note, on the other machine, there was the following in dmesg.
May be relevant or then not...

Jan  1 14:12:10 extra1 kernel: set_rtc_mmss: can't update from 59 to 12

Comment 2 Arjan van de Ven 2003-01-02 14:20:10 UTC

this phenomenon seems to happen with a lot of kernels (not just RH's) and only
on hp lp2000r's.... hardware/bios bug?

Comment 3 ville.sulko 2003-01-02 21:00:44 UTC

Created attachment 89071 [details]
Bootup messages (dmesg)

Just for the record, here's the dmesg listing of the kernel bootup sequence.

Comment 4 ville.sulko 2003-01-03 10:52:41 UTC

One more observation, the timer interrupt rate seems to be decreasing all the 
time. Yesterday it was at about 9 Hz, six hours later at 8 Hz and now (the 
following day) it seems to be around 6 Hz...

Is there anything to check to determine the reason for the timer interrupt 
slowdown? Is it possible to verify timer circuit settings and/or reset the
timer frequency? Or could there be some other reason for the interrupts not
to be delivered, in case the timer works ok?

BTW, I checked the release notes for the latest BIOS update, and there was 
nothing related to this kind of problems. And for the question could this be HW 
problem, it could of course, but it had to be generic lp2000r HW bug. And to 
continue with, we had no problems with some earlier kernels (<= 2.4.18-10 ?). 
Were those (RH) kernels running with CONFIG_HZ=100 or 512 ? Could this be a 
result of a timer wraparound or something like that?

Comment 5 Michael Johnson 2003-01-04 00:46:40 UTC

I can confirm this issue.  I have had four HP lp2000r servers get stuck looping
the clock.  I am also unable to shutdown the boxes without a force flag.
All of the servers were running the 2.4.18-18.7xsmp Redhat errata kernel.  I
think we had a similar case to this on a previous kernel release as well.  A
reboot of the box does restore the correct clock operation.

Comment 6 Erik Bennett 2003-01-23 22:00:52 UTC

We've also seen this behavior on a dozen lp2000r machines as well as one lh3r. 
All of them had dual procs.  This happened on kernels from 2.4.18-17.7xsmp
through 2.4.18-19.7xsmp.  Also, these machines won't boot if you install an SMP
kernel on a machine with only one CPU.  It hangs right after the line:
Configuring 256 Unix98 ttys

This wasn't the case with the 2.4.9 series.

Comment 7 Erik Bennett 2003-01-23 23:40:13 UTC

I too did some timings, and the interrupts on this machine have gone to 0 (zero)
in a ten second period.  Also, this machine has a reletivly high number of them.

It takes 29 bits to hold this number (#0), for what that's worth.

Also, this clock is in a 6 second loop.  This is one of two failure modes we're
seeing.  The other is the slowdown.


rssv05 ~ 7# cat /proc/interrupts
           CPU0       CPU1       
  0:  309234823  309250674    IO-APIC-edge  timer
  1:          2          2    IO-APIC-edge  keyboard
  2:          0          0          XT-PIC  cascade
  8:          1          0    IO-APIC-edge  rtc
 12:         11          9    IO-APIC-edge  PS/2 Mouse
 14:          0          2    IO-APIC-edge  ide0
 17:     140090     140329   IO-APIC-level  eth0
 18:     680127     686096   IO-APIC-level  sym53c8xx
 19:     535511     535621   IO-APIC-level  eth1
NMI:          0          0 
LOC:  625276847  625276870 
ERR:          0
MIS:          0
rssv05 ~ 8# cat /proc/interrupts
           CPU0       CPU1       
  0:  309234823  309250674    IO-APIC-edge  timer
  1:          2          2    IO-APIC-edge  keyboard
  2:          0          0          XT-PIC  cascade
  8:          1          0    IO-APIC-edge  rtc
 12:         11          9    IO-APIC-edge  PS/2 Mouse
 14:          0          2    IO-APIC-edge  ide0
 17:     140090     140342   IO-APIC-level  eth0
 18:     680127     686096   IO-APIC-level  sym53c8xx
 19:     535521     535621   IO-APIC-level  eth1
NMI:          0          0 
LOC:  625282370  625282393 
ERR:          0
MIS:          0

Comment 8 Chad 2003-02-19 19:45:15 UTC

I have reproduced this problem consistently now 5 times in addition to watching 
several servers fail with this problem in the wild.

It affects only SMP boxes with any i386 kernel over 2.4.10 and does not affect 
RHAS.  It seems particularly problematic with HP servers.

To reproduce the problem, the server need only be installed (we install over 
the network using kickstart) with 7.x and left alone while still on the network.

Idleness is the common denominator.  Also, going from high demand to low demand 
or disuse speeds up the appearance of this issue.  I can get a hang with these 
symptoms in 1.5 to 4 days.

I can find no evidence in my research that indicates Asus CUR-DLS or CUR-DLSR 
servers (which are identical to HP LP2000r and LP1000r servers in nearly every 
way right down to the case design) are affected in the same way as HP 
hardware.  The primary difference between the two boxes is Asus uses Award BIOS 
while HP uses a modified PhoenixBIOS.

Comment 9 Jason Piszcyk 2003-05-01 03:57:08 UTC

I have also experienced the same problem.  The server is an HP LH4R with 
4CPU's.

I am running kernel 2.4.18-24.7.x.

The server was previously running Red Hat 6.2 for a year and half with no 
problems.  It was recently upgraded to 7.3, and ran fine for a couple of 
months with no issues.  It has now experienced this problem twice in the last 
fortnight.

I have disabled NTP, and configured the system to fire up in run-level 3, and 
the server has been running fine for about a week.  I am waiting to see how 
this goes.

Comment 10 ville.sulko 2003-05-02 04:24:52 UTC

Just for the record, as a workaround, I have recompiled the kernel with 
CONFIG_HZ=100, and haven't had any problems since. The problem most likely is 
still there, but at least it's seems to be much less frequent.

Comment 11 Jason Piszcyk 2003-06-19 02:37:27 UTC

I have rebuilt the Kernel with 'CONFIG_HZ=100', and the server has been up for 
almost 3 weeks with no problems, although I am experiencing some slight pain 
from keeping my fingers and toes crossed.

The server has about 50 users on it during the day, so it is getting a decent 
workout.

Thanks Ville (I hope that's right) for the suggestion, it has definitely 
helped.  Any ideas if this is the long term solution, or am I likely to see 
the 'hang' again?

Comment 12 Arjan van de Ven 2003-06-19 07:40:11 UTC

newer errata kernels have HZ=100 again for other reasons (basically on some
machines it caused time skew when burning cds)

Comment 13 Pietro Dania 2003-07-14 12:32:40 UTC

Same problem on 3 out of about 20 lp2000r machines. 
Machines are 1.4 and 1.133 GHz SMP / 1 GB RAM / NetRAID 1M 
BIOS versions are spread from 4.6.06 to 4.6.16. 
Kernel is 2.4.20-18.7 and all errata installed. 
I can reproduce the problem starting a stress test procedure, running 1000 
processess of 3 threads each. I kill all of them and soon the problem rises. 
In addition, i get a continuous sound; maybe the machine is gone flatline: 
"BIIIIIIIIIIIIIIII" :-) 
 
?!? 
 
As long as the machine is RH certified on 7.2, i'll go on with that distro.

Comment 14 Chad 2003-09-18 21:43:44 UTC

We are calling this phenomena "TimeWarp."

It is not fully understood but I have spent a good while exploring and
experimenting with affected servers.

Here's what my group knows:

1) The problem is indeed a skewing problem between the two CPUs.
2) CONFIG_HZ = 100 is just a delaying tactic.  TimeWarp still occurs at about
330 days on AS 2.1 with CONFIG_HZ = 100 kernels. 
3) Typical failure time is 30 days.
4) The "trigger" is heavy or sustained activity followed by an abrupt cessation
of activity.  Within three days of idleness, TimeWarp occurs.  The above
prescribed method for reproducing failure is correct.
5) BIOS alterations using the F11 method are not fruitful.
6) Building/installing a custom kernel that turns off ALL elements of power
management (APM and ACPI) and other superfluous functionality results in at
least 180 days (and counting) of uptime even with CONFIG_HZ = 512.

Comment 15 Bugzilla owner 2004-09-30 15:40:14 UTC

Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/