Bug 102422

Summary: Memory holes in root partition, found at boot up
Product: [Retired] Red Hat Linux Reporter: Lee McKusick <lmckusic>
Component: initscriptsAssignee: Bill Nottingham <notting>
Status: CLOSED NOTABUG QA Contact: Brock Organ <borgan>
Severity: medium Docs Contact:
Priority: medium    
Version: 8.0CC: rvokal
Target Milestone: ---   
Target Release: ---   
Hardware: athlon   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2003-08-14 23:01:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Lee McKusick 2003-08-14 22:35:59 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.2) Gecko/20030716

Description of problem:
I have been battling a boot time file system error bug that looks just like a
problem with /etc/rc.d/init.d/random. Hah... at the present state of my tests I
think I have a two bit error on one memory address.

So here is the story.

During bootup, after setting hostname, system stops with "/" root partition file
errors. 

Running e2fsck /dev/hdb8 (my / partition) has returned several kinds of errors. 

The failure last night was:

/ contains a file system with errors
/ duplicate blocks found
...
/: File /var/lib/rpm/Basenames (inode #254766, mod time Wed Aug 13 10:31:25 2003)
has 1 duplicate block shared with 1 file:
/:    <filesystem metadata>
/:

The mod time above is about the time I was running the Red Hat up2date  rpm
update tool.

A previous boot time / filesystem error was reported as 

"Holes" in various files, holes with a pattern of byte offsets that
look a lot like alternate superblock location... ~16k, 32768, 98304 etc.

E2fsck would report a bunch of holes in one file, then it would it would give an
error report, and then report a bunch of holes in another file

Specific files with damage repaired manually at boot time were:

/usr/lib/dia/libchronogram_objects.so
/usr/share/doc/minicom-2.00.0/doc
/usr/java/j2sdk1.4.1_01/demo/plugin/applets/Animator/Beans/T8.gif
/usr/share/doc/qt-devel-3.0.5/examples/sql/overview/form1/.moc

Each file had holes reported at ~16k byte intervals.

I started searching for a malicious program running "dd ".

Nothing has changed in the dd entries of /var/log/messages over the past 3 months.

Using a find & grep against the entire hard disk, I can't find any plain text
malicious invocations of dd.

Pretty much, the only plaintext shell script running dd on my system is
/etc/rc.d/init.d/random 

My running copy of random passed the rpm signature verification and file
verification test provided for the initscripts package.

Manual runs of /etc/rc.d/init.d/random are all OK. 

The "stop" block invocation of "dd " seems to have the wrong out= and in= sense
so I edited the dd statement in the stop block.

I am still getting root "/" filesystem damage. 

Using the memory test on the lnx-bbc.org bootable business card, I detected a
consistent memory error of 2 bits at 590.6 mb. Two months ago I added 512 Meg of
memory to my existing 256 Meg.

The memory error is in the 256 Meg simm.

I pulled the simm memory module with the error out.

If the memory error is the source of the disk errors then I propose that Red Hat
add a memtest program to the Distribution. Red Hat should add a memory test to
the "troubleshooting flow chart". 

Version-Release number of selected component (if applicable):
initscripts-6.95-1

How reproducible:
Couldn't Reproduce

Steps to Reproduce:
1.Execute /etc/rc.d/init.d/random stop ... then same with start
2.Shutdown system and run e2fsck on the / partition. /dev/hdb8 on my system.
3.
    

Actual Results:  The random shell script works just fine.
It is hard to tell of the "/" partition is unmounted before random actually
writes to it at shutdown time.

Additional info:

How could two bad bits at the 590.6 mb point actually cause damage to the root
file system? 

I removed the bad SIMM, to see if that removes the root file system damage
phenomenon.

The "holes" I saw at approx 16K byte intervals might be chunks of swapfile? Bits
10 and 9 set to zero in a 16 bit address?

How could I load dummy stuff into memory to put a known file right on top of the
bad bits in question?

The host computer is a 700 mhz Athlon, with 512Meg memory, 3 hard disks and 1
CD-R/RW, 1 floppy. The installation is a "workstation with a /home partition
that has been used for about 5 years now. 

I regularly use the up2date utility.
Uname -a is:

Linux familybox 2.4.20-19.8 #1 Tue Jul 15 14:59:09 EDT 2003 i686 athlon i386
GNU/Linux

Comment 1 Bill Nottingham 2003-08-14 23:01:33 UTC
Memtest is on current development rescue images.

There's nothing you can do in general to deal with bad memory; there are some
patches that allow you to map around some bad bits, but it's not 100%
guaranteed. This currently does appear to be a hardware problem for you, though.