From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.2) Gecko/20030716
Description of problem:
I have been battling a boot time file system error bug that looks just like a
problem with /etc/rc.d/init.d/random. Hah... at the present state of my tests I
think I have a two bit error on one memory address.
So here is the story.
During bootup, after setting hostname, system stops with "/" root partition file
Running e2fsck /dev/hdb8 (my / partition) has returned several kinds of errors.
The failure last night was:
/ contains a file system with errors
/ duplicate blocks found
/: File /var/lib/rpm/Basenames (inode #254766, mod time Wed Aug 13 10:31:25 2003)
has 1 duplicate block shared with 1 file:
/: <filesystem metadata>
The mod time above is about the time I was running the Red Hat up2date rpm
A previous boot time / filesystem error was reported as
"Holes" in various files, holes with a pattern of byte offsets that
look a lot like alternate superblock location... ~16k, 32768, 98304 etc.
E2fsck would report a bunch of holes in one file, then it would it would give an
error report, and then report a bunch of holes in another file
Specific files with damage repaired manually at boot time were:
Each file had holes reported at ~16k byte intervals.
I started searching for a malicious program running "dd ".
Nothing has changed in the dd entries of /var/log/messages over the past 3 months.
Using a find & grep against the entire hard disk, I can't find any plain text
malicious invocations of dd.
Pretty much, the only plaintext shell script running dd on my system is
My running copy of random passed the rpm signature verification and file
verification test provided for the initscripts package.
Manual runs of /etc/rc.d/init.d/random are all OK.
The "stop" block invocation of "dd " seems to have the wrong out= and in= sense
so I edited the dd statement in the stop block.
I am still getting root "/" filesystem damage.
Using the memory test on the lnx-bbc.org bootable business card, I detected a
consistent memory error of 2 bits at 590.6 mb. Two months ago I added 512 Meg of
memory to my existing 256 Meg.
The memory error is in the 256 Meg simm.
I pulled the simm memory module with the error out.
If the memory error is the source of the disk errors then I propose that Red Hat
add a memtest program to the Distribution. Red Hat should add a memory test to
the "troubleshooting flow chart".
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.Execute /etc/rc.d/init.d/random stop ... then same with start
2.Shutdown system and run e2fsck on the / partition. /dev/hdb8 on my system.
Actual Results: The random shell script works just fine.
It is hard to tell of the "/" partition is unmounted before random actually
writes to it at shutdown time.
How could two bad bits at the 590.6 mb point actually cause damage to the root
I removed the bad SIMM, to see if that removes the root file system damage
The "holes" I saw at approx 16K byte intervals might be chunks of swapfile? Bits
10 and 9 set to zero in a 16 bit address?
How could I load dummy stuff into memory to put a known file right on top of the
bad bits in question?
The host computer is a 700 mhz Athlon, with 512Meg memory, 3 hard disks and 1
CD-R/RW, 1 floppy. The installation is a "workstation with a /home partition
that has been used for about 5 years now.
I regularly use the up2date utility.
Uname -a is:
Linux familybox 2.4.20-19.8 #1 Tue Jul 15 14:59:09 EDT 2003 i686 athlon i386
Memtest is on current development rescue images.
There's nothing you can do in general to deal with bad memory; there are some
patches that allow you to map around some bad bits, but it's not 100%
guaranteed. This currently does appear to be a hardware problem for you, though.