From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; Cairhien) Description of problem: After a little over a week of uptime and running the Distributed.net client the whole time, the system locked and I was forced to use my reset button and then the kernel would spit out a panic due to the ext3fs headers being corrupted. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Standard install of RH7.2 2. Install and run Distributed.Net linux client entire time 3. Leave box running for a little over a week Actual Results: Major hard core death. As in description: The boxes both locked up after about 8 or 9 days uptime and I was unable to get them to boot. The kernel would panic and not go anywhere. The rescue CD couldn't find the O/S it said as well when I tried using it. Ended up formatting /,/boot and swap and reinstalling Expected Results: System should have not crashed. But, in event it did - should not have gotten corrupted and prolly shoulda said - whoa, you wanna run the disk checking utility? Additional info: I had 2 different systems running (both AMD systems running the i586 architechture) and both systems were running Distributed.net's client. After a little over a week, both systems locked up on me and I was forced to use the reset button to reboot the system. On reboot I would get a kernel panic because the ext3fs headers were corrupted on the / parition. In rescue mode, I could not mount the fs as ext3, only ext2. Other partitions remained undamaged (or so it appears - I haven't been able to properly examine them yet). The system that crashed first was an AMD K6 233 w/ 128 megs of RAM and a 6 gig hard drive strung into 3 partitions - /boot, / and swap henceforth (marath). The second system (rhuidean) is an Athlon 1.4 ghz w/ 1.5 gigs of RAM, a 60 gig hard drive, a 20 gig hard drive, a 13 gig hard drive and a 2.5 gig hard drive. The 2.5 gig hard drive carried /, /boot, and swap. Both boxes have generic ethernet cards and nothing special for a video card. Marath has a matrox pci card w/ 4 megs of ram, rhuidean has an agp card and I don't recall the make. I was using the "off the shelf" version of 7.2 but customized each install slightly differently. On Rhuidean, I didn't install a web server or php or mysql as I wanted to compile myself (of which, I got apache and mysql compiled before the crash). Marath had most packages installed inlcuding the webserver and all that other good stuff. Both had KDE and Gnome installed as well. KDE was actively running on rhuidean most of the time whereas marath was sitting at the CLI. So, on reset, the boxes wouldn't boot because the kernel panic said it thought the kernel headers were corrupted for the ext3fs. When my first system crashed, I got worried about rhuidean as it carries many files that, if lost, would be very difficult to recover. Well, after rhuidean passed roughly 8 days uptime, I thought that it was just marath that died for some reason. Well, I woke up the next day - looked at kde and things didn't seem right. It locked. No inputs were accepted to the system via the keyboard - total and complete system failure (ala windows). I knew what was going to happen on reboot - but tried anyhow. Same situation. Kernel panic. Ext3fs headers corrupted. Great. Tried linux rescue again off the boot disc - nothing. Was able to mount my 3 other hard drives and the kernel though. They were all ext3 - so, it didn't appear as if I'd lost any information off of them. Being able to mount the / parition as ext2, I fsck'd it, just to see - it went ape w/ all sorts of stuff and still no boot (even telling kernel and all that I had changed it from ext3 to ext2). But, I was able to go back into linux rescue and because of multiple drives in the system, mount / as ext2 and save a file or two. I have NOT checked any files on the system yet due to the experiment described below - so, I'm unsure if I have any corrupted files on any drives (including backed up off of / since I formatted it) Marath is still off and hasn't been reformated and reinstalled yet. I'm running a test on rhuidean at the moment. I did the same install as before for rhuidean except that this time I have not installed the Distributed.net client. I have almost a 9 day uptime so far and and aiming to try for 2 weeks. Once I reach the 2 week mark, I'll halt the system and plug up the other hard drives (disconnected power supply cables) and assume that the system is "stable". Anyhow, the DNET client causes the proc to run at roughly 99.9% the entire time. It shouldn't have caused the box to fail either - but, so I don't know if it just overstrained the system or what - but, it's been pretty fatal so far.
Do these machines have VIA chipsets? (Give us the output of the lspci program if you do not know -- "su -" and then "lspci") If so, booting with the "noathlon" argument will work around a bug in the BIOS from VIA. What brand and model are your hard drives? "cat /proc/ide/hd*/model" will tell us what we need to know.
Actually, the hard drives that corrupted data are the ones we care about, and that, I guess, would be the ones you unplugged, so with them unplugged the "cat /proc/ide/hd*/model" won't be very helpful. Could you instead post the HD brand/model as read off the label for the drives you unplugged? Thanks,
I know for sure that rhuidean has a via chipset. It is booted w/ the noathlon statement as trying to boot without it caused kernel panics and oooppps et al. It was booted with noathlon as well before the crashing episode. Rhuideans HDD Brands are as follows: 60 gig = maxtor 5t060h6 I believe. Since it isn't plugged up and I can't read the label and I'm running my uptime test - it'll be 5 more days before I can have that exact information. (same follows for 2 more drives - but, as far as I know, they weren't damaged) 20 gig = quantum bigfoot ts (not plugged up - can't read label) 13 gig = maxtor (not plugged up - can't read label) labels are hidden in a running system. 2.5 gig = ST52520A (seagate) - had / corrupted (ran the cat cmd listed above for model number and then google'd the model number to get seagate) I will verify the aboves after my 2 week uptime test is complete. Roughly on tues 12/11. On marath (which is currently not even booted): 1 6 gig drive = Samsung SV0643A (system not running, pulled out drive and got info off drive) The samsung and the seagate drives are the ones that I could immediatley tell had corruption on them due to non-mounting as ext3 whereas all others would mount as ext3 (and it was only / on both that wouldn't mount. /boot would mount) Both boxes have the VIA chipset. (can see chip on mb) marath has a PA-2011 mb from First Mainboard rhuidean uses a tyan S2390B
By the way, the dnet client that I used is version number: LINUX: [x86/ELF] v2.8015.469 2001-05-30 and can be found at http://www.distributed.net/download/clients.html
One other question: what wattage is your power supply in each machine?
Marath uses a 200 watt power supply (at) Rhuidean is a 250 watt power supply (atx)
Yikes! That's almost certainly your problem. Athlons use lots of juice, and 4 hard drives plus an athlon plus the other random components is definitely less than AMD recommends. A hard drive that is underpowered can write garbage; in fact, that is one of the reasons that a power loss can damage file systems, as documented in the Red Hat whitepaper on ext3 at http://www.redhat.com/support/wpapers/redhat/ext3/ When you are using 100% of your CPU time, you are drawing a bit more power (Linux idles the CPU when it is not in use, reducing power draw) and that means that running the dnet client might have been the straw that broke the camel's back. Likewise, ext3 writes some things to disk twice (the journal) and writes more often (synced every 5 seconds instead of every 30 seconds) and so can cause slighly more power draw from the disks. (This part is conjecture; I have not measured this.) The fact that you have not seen corruption after unplugging some hard drives would indicate that it is likely that you have reduced your power draw a bit. I would strongly suggest upgrading to power supplies recommended by AMD (www.amd.com has this information on it); I tend to run 350 or 400 watt power supplies in Athlon equipment.
Well, that could possibly explain the athlon based system (rhuidean). Which is also consuming less power by not using 100% proc and also because it only has 1 hdd plugged up along w/ 1 cd (cd not usually in system - needed for install and haven't removed it) But, the plain old 233 mhz amd k6 with a single hard drive - that one would have been using FAR less power than the athlon system as the chip doesn't draw as much and it only had 1 hdd spun up. As a benchmark of sorts w/ marath (the 233 system). Marath ran suse 6.3 quite happily for about a year, year and a half - somewhere around there with the 200 watt power supply, a 20 gig drive, a 13 gig drive, a 6 gig drive, and a 2.5 gig drive. And ran the dnet client happily (I upgraded the client as new releases came out) as well. So, that system changed in now it has just a 6 gig drive and it had red hat 7.2 on it. Based on that - with that particular system, I wouldn't say power consumption caused an error. With the Athlon based system. My <shudder>windows 2k</shudder> system is an Athlon 1.2 ghz w/ 256 megs of RAM, 2 hdd, 1cdrw/dvd, 1cd 250 watt power supply running windows dnet 24/7 - it can maintain uptimes of roughly 2 weeks before it just runs into a standard windows must now reboot mode. It has only fully crashed hardcore on me one time and that was before I had dnet running. Once again... I haven't physically been able to confirm that the data on my other hard drives isn't corrupt (I'm hoping it is not as I was able to mount them properly in linux rescue) - but, since I could mount them properly, I'm assuming that they didn't get written to poorly due to being underpowered (and I had transferred gigs between different boxes) - it was only the / partition. I don't recall if it was in those whitepapers or not, but I had read somewhere that sometimes the memory is dumped on the ext3 drive. I am not sure if that happened. But, basically - it happening on both systems, completely different - and one, even at 100% proc usage (marath - 233 system) where it had all been running fine before for many many many happy days. That's why when I originally wrote (not via bugzilla) - I think I posed the question if it had something to do w/ foreshadowing due to overuse of the proc as most boxes prolly aren't using 100% proc 100% of the time. And fired off my findings to ya'll. I'll have to look into upgrading the power supply on my athlon based system (s). But, once again - the first box that crashed isn't even close to being athlon based and doesn't have anything in it that sucks huge amounts of power and was able to handle more of a load before I put rh 7.2 on it. Of course, between suse 6.3 and rh 7.2 were kernel changes. Both were detected as i386 though.
Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/