Bug 57178

Summary:	System crash after a week running Distributed.net
Product:	[Retired] Red Hat Linux	Reporter:	Brian <bcoloney>
Component:	kernel	Assignee:	Arjan van de Ven <arjanv>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Brock Organ <borgan>
Severity:	high	Docs Contact:
Priority:	medium
Version:	7.2
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-09-30 15:39:18 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Brian 2001-12-06 15:13:55 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; Cairhien)

Description of problem:
After a little over a week of uptime and running the Distributed.net 
client the whole time, the system locked and I was forced to use my reset 
button and then the kernel would spit out a panic due to the ext3fs 
headers being corrupted.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Standard install of RH7.2
2. Install and run Distributed.Net linux client entire time
3. Leave box running for a little over a week
	

Actual Results:  Major hard core death.
As in description:
The boxes both locked up after about 8 or 9 days uptime and I was unable 
to get them to boot.  The kernel would panic and not go anywhere.  The 
rescue CD couldn't find the O/S it said as well when I tried using it.
Ended up formatting /,/boot and swap and reinstalling

Expected Results:  System should have not crashed.
But, in event it did - should not have gotten corrupted and prolly shoulda 
said - whoa, you wanna run the disk checking utility?

Additional info:

I had 2 different systems running (both AMD systems running the i586 
architechture) and both systems were running Distributed.net's client.  
After a little over a week, both systems locked up on me and I was forced 
to use the reset button to reboot the system.  On reboot I would get a 
kernel panic because the ext3fs headers were corrupted on the / parition.  
In rescue mode, I could not mount the fs as ext3, only ext2.  Other 
partitions remained undamaged (or so it appears - I haven't been able to 
properly examine them yet).  The system that crashed first was an AMD K6 
233 w/ 128 megs of RAM and a 6 gig hard drive strung into 3 partitions -
 /boot, / and swap henceforth (marath).
The second system (rhuidean) is an Athlon 1.4 ghz w/ 1.5 gigs of RAM, a 60 
gig hard drive, a 20 gig hard drive, a 13 gig hard drive and a 2.5 gig 
hard drive.  The 2.5 gig hard drive carried /, /boot, and swap.
Both boxes have generic ethernet cards and nothing special for a video 
card.  Marath has a matrox pci card w/ 4 megs of ram, rhuidean has an agp 
card and I don't recall the make.
I was using the "off the shelf" version of 7.2 but customized each install 
slightly differently.
On Rhuidean, I didn't install a web server or php or mysql as I wanted to 
compile myself (of which, I got apache and mysql compiled before the 
crash).  Marath had most packages installed inlcuding the webserver and 
all that other good stuff.
Both had KDE and Gnome installed as well.
KDE was actively running on rhuidean most of the time whereas marath was 
sitting at the CLI.
So, on reset, the boxes wouldn't boot because the kernel panic said it 
thought the kernel headers were corrupted for the ext3fs.  When my first 
system crashed, I got worried about rhuidean as it carries many files 
that, if lost, would be very difficult to recover.
Well, after rhuidean passed roughly 8 days uptime, I thought that it was 
just marath that died for some reason.  Well, I woke up the next day - 
looked at kde and things didn't seem right.  It locked.  No inputs were 
accepted to the system via the keyboard - total and complete system 
failure (ala windows).  I knew what was going to happen on reboot - but 
tried anyhow.  Same situation.  Kernel panic.  Ext3fs headers corrupted.
Great.
Tried linux rescue again off the boot disc - nothing.  Was able to mount 
my 3 other hard drives and the kernel though.  They were all ext3 - so, it 
didn't appear as if I'd lost any information off of them.  Being able to 
mount the / parition as ext2, I fsck'd it, just to see - it went ape w/ 
all sorts of stuff and still no boot (even telling kernel and all that I 
had changed it from ext3 to ext2).  But, I was able to go back into linux 
rescue and because of multiple drives in the system, mount / as ext2 and 
save a file or two.  I have NOT checked any files on the system yet due to 
the experiment described below - so, I'm unsure if I have any corrupted 
files on any drives (including backed up off of / since I formatted it)
Marath is still off and hasn't been reformated and reinstalled yet.  I'm 
running a test on rhuidean at the moment.  I did the same install as 
before for rhuidean except that this time I have not installed the 
Distributed.net client.  I have almost a 9 day uptime so far and and 
aiming to try for 2 weeks.  Once I reach the 2 week mark, I'll halt the 
system and plug up the other hard drives (disconnected power supply 
cables) and assume that the system is "stable".
Anyhow, the DNET client causes the proc to run at roughly 99.9% the entire 
time.  It shouldn't have caused the box to fail either - but, so I don't 
know if it just overstrained the system or what - but, it's been pretty 
fatal so far.

Comment 1 Michael K. Johnson 2001-12-06 15:27:57 UTC

Do these machines have VIA chipsets?  (Give us the output of the lspci
program if you do not know -- "su -" and then "lspci")

If so, booting with the "noathlon" argument will work around a bug
in the BIOS from VIA.

What brand and model are your hard drives?
"cat /proc/ide/hd*/model" will tell us what we need to know.

Comment 2 Michael K. Johnson 2001-12-06 15:41:42 UTC

Actually, the hard drives that corrupted data are the ones we care about,
and that, I guess, would be the ones you unplugged, so with them unplugged
the "cat /proc/ide/hd*/model" won't be very helpful.

Could you instead post the HD brand/model as read off the label for the
drives you unplugged?

Thanks,

Comment 3 Brian 2001-12-06 15:55:16 UTC

I know for sure that rhuidean has a via chipset.  It is booted w/ the noathlon 
statement as trying to boot without it caused kernel panics and oooppps et al.  
It was booted with noathlon as well before the crashing episode.
Rhuideans HDD Brands are as follows:
60 gig = maxtor 5t060h6 I believe.  Since it isn't plugged up and I can't read 
the label and I'm running my uptime test - it'll be 5 more days before I can 
have that exact information. (same follows for 2 more drives - but, as far as I 
know, they weren't damaged)
20 gig = quantum bigfoot ts (not plugged up - can't read label)
13 gig = maxtor (not plugged up - can't read label)
labels are hidden in a running system.
2.5 gig = ST52520A (seagate) - had / corrupted (ran the cat cmd listed above 
for model number and then google'd the model number to get seagate)
I will verify the aboves after my 2 week uptime test is complete.  Roughly on 
tues 12/11.
On marath (which is currently not even booted):
1 6 gig drive = Samsung SV0643A (system not running, pulled out drive and got 
info off drive)

The samsung and the seagate drives are the ones that I could immediatley tell 
had corruption on them due to non-mounting as ext3 whereas all others would 
mount as ext3 (and it was only / on both that wouldn't mount.  /boot would 
mount)

Both boxes have the VIA chipset. (can see chip on mb)

marath has a PA-2011 mb from First Mainboard
rhuidean uses a tyan S2390B

Comment 4 Brian 2001-12-06 16:02:56 UTC

By the way, the dnet client that I used is version number:
LINUX:
[x86/ELF]  v2.8015.469  2001-05-30 

and can be found at

http://www.distributed.net/download/clients.html

Comment 5 Michael K. Johnson 2001-12-06 20:07:39 UTC

One other question:  what wattage is your power supply in each machine?

Comment 6 Brian 2001-12-06 23:08:30 UTC

Marath uses a 200 watt power supply (at)
Rhuidean is a 250 watt power supply (atx)

Comment 7 Michael K. Johnson 2001-12-06 23:39:50 UTC

Yikes!  That's almost certainly your problem.  Athlons use lots of
juice, and 4 hard drives plus an athlon plus the other random components
is definitely less than AMD recommends.  A hard drive that is underpowered
can write garbage; in fact, that is one of the reasons that a power loss
can damage file systems, as documented in the Red Hat whitepaper on ext3 at
http://www.redhat.com/support/wpapers/redhat/ext3/
When you are using 100% of your CPU time, you are drawing a bit more power
(Linux idles the CPU when it is not in use, reducing power draw) and that
means that running the dnet client might have been the straw that broke
the camel's back.  Likewise, ext3 writes some things to disk twice (the
journal) and writes more often (synced every 5 seconds instead of every 30
seconds) and so can cause slighly more power draw from the disks.  (This
part is conjecture; I have not measured this.)

The fact that you have not seen corruption after unplugging some hard
drives would indicate that it is likely that you have reduced your power
draw a bit.  I would strongly suggest upgrading to power supplies recommended
by AMD (www.amd.com has this information on it); I tend to run 350 or 400
watt power supplies in Athlon equipment.

Comment 8 Brian 2001-12-07 01:01:12 UTC

Well, that could possibly explain the athlon based system (rhuidean). Which is 
also consuming less power by not using 100% proc and also because it only has 1 
hdd plugged up along w/ 1 cd (cd not usually in system - needed for install and 
haven't removed it)

But, the plain old 233 mhz amd k6 with a single hard drive - that one would 
have been using FAR less power than the athlon system as the chip doesn't draw 
as much and it only had 1 hdd spun up.

As a benchmark of sorts w/ marath (the 233 system).  Marath ran suse 6.3 quite 
happily for about a year, year and a half - somewhere around there with the 200 
watt power supply, a 20 gig drive, a 13 gig drive, a 6 gig drive, and a 2.5 gig 
drive.  And ran the dnet client happily (I upgraded the client as new releases 
came out) as well.

So, that system changed in now it has just a 6 gig drive and it had red hat 7.2 
on it.  Based on that - with that particular system, I wouldn't say power 
consumption caused an error.

With the Athlon based system.  My <shudder>windows 2k</shudder> system is an 
Athlon 1.2 ghz w/ 256 megs of RAM, 2 hdd, 1cdrw/dvd, 1cd 250 watt power supply 
running windows dnet 24/7 - it can maintain uptimes of roughly 2 weeks before 
it just runs into a standard windows must now reboot mode.  It has only fully 
crashed hardcore on me one time and that was before I had dnet running.

Once again... I haven't physically been able to confirm that the data on my 
other hard drives isn't corrupt (I'm hoping it is not as I was able to mount 
them properly in linux rescue) - but, since I could mount them properly, I'm 
assuming that they didn't get written to poorly due to being underpowered (and 
I had transferred gigs between different boxes) - it was only the / partition.

I don't recall if it was in those whitepapers or not, but I had read somewhere 
that sometimes the memory is dumped on the ext3 drive.  I am not sure if that 
happened.

But, basically - it happening on both systems, completely different - and one, 
even at 100% proc usage (marath - 233 system) where it had all been running 
fine before for many many many happy days.

That's why when I originally wrote (not via bugzilla) - I think I posed the 
question if it had something to do w/ foreshadowing due to overuse of the proc 
as most boxes prolly aren't using 100% proc 100% of the time.  And fired off my 
findings to ya'll.

I'll have to look into upgrading the power supply on my athlon based system
(s).  But, once again - the first box that crashed isn't even close to being 
athlon based and doesn't have anything in it that sucks huge amounts of power 
and was able to handle more of a load before I put rh 7.2 on it.  Of course, 
between suse 6.3 and rh 7.2 were kernel changes.  Both were detected as i386 
though.

Comment 9 Bugzilla owner 2004-09-30 15:39:18 UTC

Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/