Bug 57178
Summary: | System crash after a week running Distributed.net | ||
---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Brian <bcoloney> |
Component: | kernel | Assignee: | Arjan van de Ven <arjanv> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Brock Organ <borgan> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 7.2 | ||
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i386 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2004-09-30 15:39:18 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Brian
2001-12-06 15:13:55 UTC
Do these machines have VIA chipsets? (Give us the output of the lspci program if you do not know -- "su -" and then "lspci") If so, booting with the "noathlon" argument will work around a bug in the BIOS from VIA. What brand and model are your hard drives? "cat /proc/ide/hd*/model" will tell us what we need to know. Actually, the hard drives that corrupted data are the ones we care about, and that, I guess, would be the ones you unplugged, so with them unplugged the "cat /proc/ide/hd*/model" won't be very helpful. Could you instead post the HD brand/model as read off the label for the drives you unplugged? Thanks, I know for sure that rhuidean has a via chipset. It is booted w/ the noathlon statement as trying to boot without it caused kernel panics and oooppps et al. It was booted with noathlon as well before the crashing episode. Rhuideans HDD Brands are as follows: 60 gig = maxtor 5t060h6 I believe. Since it isn't plugged up and I can't read the label and I'm running my uptime test - it'll be 5 more days before I can have that exact information. (same follows for 2 more drives - but, as far as I know, they weren't damaged) 20 gig = quantum bigfoot ts (not plugged up - can't read label) 13 gig = maxtor (not plugged up - can't read label) labels are hidden in a running system. 2.5 gig = ST52520A (seagate) - had / corrupted (ran the cat cmd listed above for model number and then google'd the model number to get seagate) I will verify the aboves after my 2 week uptime test is complete. Roughly on tues 12/11. On marath (which is currently not even booted): 1 6 gig drive = Samsung SV0643A (system not running, pulled out drive and got info off drive) The samsung and the seagate drives are the ones that I could immediatley tell had corruption on them due to non-mounting as ext3 whereas all others would mount as ext3 (and it was only / on both that wouldn't mount. /boot would mount) Both boxes have the VIA chipset. (can see chip on mb) marath has a PA-2011 mb from First Mainboard rhuidean uses a tyan S2390B By the way, the dnet client that I used is version number: LINUX: [x86/ELF] v2.8015.469 2001-05-30 and can be found at http://www.distributed.net/download/clients.html One other question: what wattage is your power supply in each machine? Marath uses a 200 watt power supply (at) Rhuidean is a 250 watt power supply (atx) Yikes! That's almost certainly your problem. Athlons use lots of juice, and 4 hard drives plus an athlon plus the other random components is definitely less than AMD recommends. A hard drive that is underpowered can write garbage; in fact, that is one of the reasons that a power loss can damage file systems, as documented in the Red Hat whitepaper on ext3 at http://www.redhat.com/support/wpapers/redhat/ext3/ When you are using 100% of your CPU time, you are drawing a bit more power (Linux idles the CPU when it is not in use, reducing power draw) and that means that running the dnet client might have been the straw that broke the camel's back. Likewise, ext3 writes some things to disk twice (the journal) and writes more often (synced every 5 seconds instead of every 30 seconds) and so can cause slighly more power draw from the disks. (This part is conjecture; I have not measured this.) The fact that you have not seen corruption after unplugging some hard drives would indicate that it is likely that you have reduced your power draw a bit. I would strongly suggest upgrading to power supplies recommended by AMD (www.amd.com has this information on it); I tend to run 350 or 400 watt power supplies in Athlon equipment. Well, that could possibly explain the athlon based system (rhuidean). Which is also consuming less power by not using 100% proc and also because it only has 1 hdd plugged up along w/ 1 cd (cd not usually in system - needed for install and haven't removed it) But, the plain old 233 mhz amd k6 with a single hard drive - that one would have been using FAR less power than the athlon system as the chip doesn't draw as much and it only had 1 hdd spun up. As a benchmark of sorts w/ marath (the 233 system). Marath ran suse 6.3 quite happily for about a year, year and a half - somewhere around there with the 200 watt power supply, a 20 gig drive, a 13 gig drive, a 6 gig drive, and a 2.5 gig drive. And ran the dnet client happily (I upgraded the client as new releases came out) as well. So, that system changed in now it has just a 6 gig drive and it had red hat 7.2 on it. Based on that - with that particular system, I wouldn't say power consumption caused an error. With the Athlon based system. My <shudder>windows 2k</shudder> system is an Athlon 1.2 ghz w/ 256 megs of RAM, 2 hdd, 1cdrw/dvd, 1cd 250 watt power supply running windows dnet 24/7 - it can maintain uptimes of roughly 2 weeks before it just runs into a standard windows must now reboot mode. It has only fully crashed hardcore on me one time and that was before I had dnet running. Once again... I haven't physically been able to confirm that the data on my other hard drives isn't corrupt (I'm hoping it is not as I was able to mount them properly in linux rescue) - but, since I could mount them properly, I'm assuming that they didn't get written to poorly due to being underpowered (and I had transferred gigs between different boxes) - it was only the / partition. I don't recall if it was in those whitepapers or not, but I had read somewhere that sometimes the memory is dumped on the ext3 drive. I am not sure if that happened. But, basically - it happening on both systems, completely different - and one, even at 100% proc usage (marath - 233 system) where it had all been running fine before for many many many happy days. That's why when I originally wrote (not via bugzilla) - I think I posed the question if it had something to do w/ foreshadowing due to overuse of the proc as most boxes prolly aren't using 100% proc 100% of the time. And fired off my findings to ya'll. I'll have to look into upgrading the power supply on my athlon based system (s). But, once again - the first box that crashed isn't even close to being athlon based and doesn't have anything in it that sucks huge amounts of power and was able to handle more of a load before I put rh 7.2 on it. Of course, between suse 6.3 and rh 7.2 were kernel changes. Both were detected as i386 though. Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/ |