Red Hat Bugzilla – Bug 57178
System crash after a week running Distributed.net
Last modified: 2008-08-01 12:22:51 EDT
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; Cairhien)
Description of problem:
After a little over a week of uptime and running the Distributed.net
client the whole time, the system locked and I was forced to use my reset
button and then the kernel would spit out a panic due to the ext3fs
headers being corrupted.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Standard install of RH7.2
2. Install and run Distributed.Net linux client entire time
3. Leave box running for a little over a week
Actual Results: Major hard core death.
As in description:
The boxes both locked up after about 8 or 9 days uptime and I was unable
to get them to boot. The kernel would panic and not go anywhere. The
rescue CD couldn't find the O/S it said as well when I tried using it.
Ended up formatting /,/boot and swap and reinstalling
Expected Results: System should have not crashed.
But, in event it did - should not have gotten corrupted and prolly shoulda
said - whoa, you wanna run the disk checking utility?
I had 2 different systems running (both AMD systems running the i586
architechture) and both systems were running Distributed.net's client.
After a little over a week, both systems locked up on me and I was forced
to use the reset button to reboot the system. On reboot I would get a
kernel panic because the ext3fs headers were corrupted on the / parition.
In rescue mode, I could not mount the fs as ext3, only ext2. Other
partitions remained undamaged (or so it appears - I haven't been able to
properly examine them yet). The system that crashed first was an AMD K6
233 w/ 128 megs of RAM and a 6 gig hard drive strung into 3 partitions -
/boot, / and swap henceforth (marath).
The second system (rhuidean) is an Athlon 1.4 ghz w/ 1.5 gigs of RAM, a 60
gig hard drive, a 20 gig hard drive, a 13 gig hard drive and a 2.5 gig
hard drive. The 2.5 gig hard drive carried /, /boot, and swap.
Both boxes have generic ethernet cards and nothing special for a video
card. Marath has a matrox pci card w/ 4 megs of ram, rhuidean has an agp
card and I don't recall the make.
I was using the "off the shelf" version of 7.2 but customized each install
On Rhuidean, I didn't install a web server or php or mysql as I wanted to
compile myself (of which, I got apache and mysql compiled before the
crash). Marath had most packages installed inlcuding the webserver and
all that other good stuff.
Both had KDE and Gnome installed as well.
KDE was actively running on rhuidean most of the time whereas marath was
sitting at the CLI.
So, on reset, the boxes wouldn't boot because the kernel panic said it
thought the kernel headers were corrupted for the ext3fs. When my first
system crashed, I got worried about rhuidean as it carries many files
that, if lost, would be very difficult to recover.
Well, after rhuidean passed roughly 8 days uptime, I thought that it was
just marath that died for some reason. Well, I woke up the next day -
looked at kde and things didn't seem right. It locked. No inputs were
accepted to the system via the keyboard - total and complete system
failure (ala windows). I knew what was going to happen on reboot - but
tried anyhow. Same situation. Kernel panic. Ext3fs headers corrupted.
Tried linux rescue again off the boot disc - nothing. Was able to mount
my 3 other hard drives and the kernel though. They were all ext3 - so, it
didn't appear as if I'd lost any information off of them. Being able to
mount the / parition as ext2, I fsck'd it, just to see - it went ape w/
all sorts of stuff and still no boot (even telling kernel and all that I
had changed it from ext3 to ext2). But, I was able to go back into linux
rescue and because of multiple drives in the system, mount / as ext2 and
save a file or two. I have NOT checked any files on the system yet due to
the experiment described below - so, I'm unsure if I have any corrupted
files on any drives (including backed up off of / since I formatted it)
Marath is still off and hasn't been reformated and reinstalled yet. I'm
running a test on rhuidean at the moment. I did the same install as
before for rhuidean except that this time I have not installed the
Distributed.net client. I have almost a 9 day uptime so far and and
aiming to try for 2 weeks. Once I reach the 2 week mark, I'll halt the
system and plug up the other hard drives (disconnected power supply
cables) and assume that the system is "stable".
Anyhow, the DNET client causes the proc to run at roughly 99.9% the entire
time. It shouldn't have caused the box to fail either - but, so I don't
know if it just overstrained the system or what - but, it's been pretty
fatal so far.
Do these machines have VIA chipsets? (Give us the output of the lspci
program if you do not know -- "su -" and then "lspci")
If so, booting with the "noathlon" argument will work around a bug
in the BIOS from VIA.
What brand and model are your hard drives?
"cat /proc/ide/hd*/model" will tell us what we need to know.
Actually, the hard drives that corrupted data are the ones we care about,
and that, I guess, would be the ones you unplugged, so with them unplugged
the "cat /proc/ide/hd*/model" won't be very helpful.
Could you instead post the HD brand/model as read off the label for the
drives you unplugged?
I know for sure that rhuidean has a via chipset. It is booted w/ the noathlon
statement as trying to boot without it caused kernel panics and oooppps et al.
It was booted with noathlon as well before the crashing episode.
Rhuideans HDD Brands are as follows:
60 gig = maxtor 5t060h6 I believe. Since it isn't plugged up and I can't read
the label and I'm running my uptime test - it'll be 5 more days before I can
have that exact information. (same follows for 2 more drives - but, as far as I
know, they weren't damaged)
20 gig = quantum bigfoot ts (not plugged up - can't read label)
13 gig = maxtor (not plugged up - can't read label)
labels are hidden in a running system.
2.5 gig = ST52520A (seagate) - had / corrupted (ran the cat cmd listed above
for model number and then google'd the model number to get seagate)
I will verify the aboves after my 2 week uptime test is complete. Roughly on
On marath (which is currently not even booted):
1 6 gig drive = Samsung SV0643A (system not running, pulled out drive and got
info off drive)
The samsung and the seagate drives are the ones that I could immediatley tell
had corruption on them due to non-mounting as ext3 whereas all others would
mount as ext3 (and it was only / on both that wouldn't mount. /boot would
Both boxes have the VIA chipset. (can see chip on mb)
marath has a PA-2011 mb from First Mainboard
rhuidean uses a tyan S2390B
By the way, the dnet client that I used is version number:
[x86/ELF] v2.8015.469 2001-05-30
and can be found at
One other question: what wattage is your power supply in each machine?
Marath uses a 200 watt power supply (at)
Rhuidean is a 250 watt power supply (atx)
Yikes! That's almost certainly your problem. Athlons use lots of
juice, and 4 hard drives plus an athlon plus the other random components
is definitely less than AMD recommends. A hard drive that is underpowered
can write garbage; in fact, that is one of the reasons that a power loss
can damage file systems, as documented in the Red Hat whitepaper on ext3 at
When you are using 100% of your CPU time, you are drawing a bit more power
(Linux idles the CPU when it is not in use, reducing power draw) and that
means that running the dnet client might have been the straw that broke
the camel's back. Likewise, ext3 writes some things to disk twice (the
journal) and writes more often (synced every 5 seconds instead of every 30
seconds) and so can cause slighly more power draw from the disks. (This
part is conjecture; I have not measured this.)
The fact that you have not seen corruption after unplugging some hard
drives would indicate that it is likely that you have reduced your power
draw a bit. I would strongly suggest upgrading to power supplies recommended
by AMD (www.amd.com has this information on it); I tend to run 350 or 400
watt power supplies in Athlon equipment.
Well, that could possibly explain the athlon based system (rhuidean). Which is
also consuming less power by not using 100% proc and also because it only has 1
hdd plugged up along w/ 1 cd (cd not usually in system - needed for install and
haven't removed it)
But, the plain old 233 mhz amd k6 with a single hard drive - that one would
have been using FAR less power than the athlon system as the chip doesn't draw
as much and it only had 1 hdd spun up.
As a benchmark of sorts w/ marath (the 233 system). Marath ran suse 6.3 quite
happily for about a year, year and a half - somewhere around there with the 200
watt power supply, a 20 gig drive, a 13 gig drive, a 6 gig drive, and a 2.5 gig
drive. And ran the dnet client happily (I upgraded the client as new releases
came out) as well.
So, that system changed in now it has just a 6 gig drive and it had red hat 7.2
on it. Based on that - with that particular system, I wouldn't say power
consumption caused an error.
With the Athlon based system. My <shudder>windows 2k</shudder> system is an
Athlon 1.2 ghz w/ 256 megs of RAM, 2 hdd, 1cdrw/dvd, 1cd 250 watt power supply
running windows dnet 24/7 - it can maintain uptimes of roughly 2 weeks before
it just runs into a standard windows must now reboot mode. It has only fully
crashed hardcore on me one time and that was before I had dnet running.
Once again... I haven't physically been able to confirm that the data on my
other hard drives isn't corrupt (I'm hoping it is not as I was able to mount
them properly in linux rescue) - but, since I could mount them properly, I'm
assuming that they didn't get written to poorly due to being underpowered (and
I had transferred gigs between different boxes) - it was only the / partition.
I don't recall if it was in those whitepapers or not, but I had read somewhere
that sometimes the memory is dumped on the ext3 drive. I am not sure if that
But, basically - it happening on both systems, completely different - and one,
even at 100% proc usage (marath - 233 system) where it had all been running
fine before for many many many happy days.
That's why when I originally wrote (not via bugzilla) - I think I posed the
question if it had something to do w/ foreshadowing due to overuse of the proc
as most boxes prolly aren't using 100% proc 100% of the time. And fired off my
findings to ya'll.
I'll have to look into upgrading the power supply on my athlon based system
(s). But, once again - the first box that crashed isn't even close to being
athlon based and doesn't have anything in it that sucks huge amounts of power
and was able to handle more of a load before I put rh 7.2 on it. Of course,
between suse 6.3 and rh 7.2 were kernel changes. Both were detected as i386
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases,
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/