Bug 75412 - Data corruption on ext3?
Summary: Data corruption on ext3?
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: kernel
Version: 8.0
Hardware: i586
OS: Linux
medium
high
Target Milestone: ---
Assignee: Stephen Tweedie
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2002-10-08 06:19 UTC by Thomas M Steenholdt
Modified: 2007-04-18 16:47 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2002-10-15 21:29:19 UTC
Embargoed:


Attachments (Terms of Use)
apache error log (the bad filename copy-pasteing the log contents into an email to report while at work - pay no attention to that) (16.89 KB, text/plain)
2002-10-08 06:24 UTC, Thomas M Steenholdt
no flags Details
apache access log (the bad filename comes from copy-pasteing the log contents into an email to report while at work - pay no attention to that) (35.87 KB, text/plain)
2002-10-08 06:25 UTC, Thomas M Steenholdt
no flags Details
libperl.so from sick system - any idea what happened to this file??? (1.16 MB, application/octet-stream)
2002-10-09 10:20 UTC, Thomas M Steenholdt
no flags Details
this is the output of lspci -vvv on the system - anything else I can provide you with? (4.00 KB, text/plain)
2002-10-11 18:29 UTC, Thomas M Steenholdt
no flags Details

Description Thomas M Steenholdt 2002-10-08 06:19:13 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.1) Gecko/20020826

Description of problem:
I recently installed RedHat 8.0 at home for use as small web/mail server -
nothing too serious atm... But after 3 days of running ( grand total - i have
restarted a few times in that time) the script kiddie attacks seems to have
worked? or is this a totally different problem... when I got up this morning at
5:50 the webserver was not running... no httpd processed and service httpd
status said server is dead but pid file exists(something like that)... i tried
to do a service httpd stop (which of course failed) and a service httpd start
which let the server start with absolutely no problems... I'm attaching the
access log and error log for you to see - the was no problems reported in
/var/log/messages so i'm not really sure what else I can provide you with...
This COULD be a serious problem for people who actually need their server running...

Version-Release number of selected component (if applicable):


How reproducible:
Didn't try


Actual Results:  all httpd processes died

Expected Results:  httpd processes should not die

Additional info:

Comment 1 Thomas M Steenholdt 2002-10-08 06:24:47 UTC
Created attachment 79306 [details]
apache error log (the bad filename copy-pasteing the log contents into an email to report while at work - pay no attention to that)

Comment 2 Thomas M Steenholdt 2002-10-08 06:25:45 UTC
Created attachment 79307 [details]
apache access log (the bad filename comes from copy-pasteing the log contents into an email to report while at work - pay no attention to that)

Comment 3 Joe Orton 2002-10-08 09:08:52 UTC
The reason the server died appears to be:

[Tue Oct 08 04:02:25 2002] [notice] SIGHUP received.  Attempting to restart
Syntax error on line 10 of /etc/httpd/conf.d/perl.conf:
Cannot load /etc/httpd/modules/mod_perl.so into server:
/usr/lib/perl5/5.8.0/i386-linux-thread-multi/CORE/libperl.so: symbol
Perl_Iutf8_lower_ptr, version GLIBC_2.0 not defined in file
libpthread.so.0 with link time reference

had you changed anything on the machine last night?  Was this a fresh 8.0
install, or an upgrade?


Comment 4 Joe Orton 2002-10-08 09:13:04 UTC
Is this reproducible, or does a "service httpd reload" work correctly for you now?

CC'ing Chip in case he's seen this error before.

Comment 5 Thomas M Steenholdt 2002-10-08 09:18:36 UTC
This is a freshly installed machine and i didn't change anything that has
anything to do with httpd yesterday... i might have restarted the service on
purpose a couple of times, can't remember... But even if I changed something - I
always try to restart the daemon to see if it comes up alright. Also note that
the server came up without a glitch just by running service httpd start a few
hours later - i did not reconfigure anything to have it start again... and I was
not at the machine at 4.02

Comment 6 Thomas M Steenholdt 2002-10-08 18:49:09 UTC
Actually i'm in a situation right now, where the problem can be reproduced each
time I try to start httpd... Although the error message is similar, it's not the
same...

[root@server root]# service httpd start
Starting httpd: Syntax error on line 10 of /etc/httpd/conf.d/perl.conf:
Cannot load /etc/httpd/modules/mod_perl.so into server:
/usr/lib/perl5/5.8.0/i386-linux-thread-multi/CORE/libperl.so: undefined symbol:
Pern_cv_undef
                                                           [FAILED]

Based on the first error, which mentioned GLIBC i thought i'd mention that the
server is an i586 - 200MHz MMX based machine

and a wondering question to soothe my own curiosity - why does httpd SIGHUP in
the first place???

I'm really not sure where to look right now, but I'll leave the server running
(without rebooting it) if you'll want me to try something out.

Comment 7 Joe Orton 2002-10-09 09:07:55 UTC
I presume that is a cut'n'paste error, and the error is really that
"Perl_sv_undef" is undefined? (not "Pern_" ...)?

Does /usr/bin/perl still work on this system?

Apache is sent the SIGHUP signal by logrotate every night, to rotate the server
logs: see /etc/logrotate.d/httpd


Comment 8 Thomas M Steenholdt 2002-10-09 10:18:29 UTC
The cut'n'paste error is actualy what is written on my screen... "Pern_cv_undef"
not Perl as you suggest - And no! /usr/bin/perl doen't work correctly either.
running perl followed by the classic print("Hello world!\n"); and ^D actually
caused Hello world! to be printed to the screen. But it was followed by a
variation of the previous error, saying "relocation error:
/usr/lib/perl5/5.8.0/i386-linux-thread-multi/CORE/libperl.so: undefined symbol:
Pern_cv_undef"

As it seems we have now narrowed things down a little, i tried running an "rpm
-V perl" which came with the result ..5.....
/usr/lib/perl5/5.8.0/i386-linux-thread-multi/CORE/libperl.so - this could of
cause be due to some disk problems or... but I'm attaching the file anyway for
you to see - I did not do anything on the system that should have changed that
file?!?



Comment 9 Thomas M Steenholdt 2002-10-09 10:20:40 UTC
Created attachment 79585 [details]
libperl.so from sick system - any idea what happened to this file???

Comment 10 Joe Orton 2002-10-09 10:57:55 UTC
Ah, okay. Then I guess you have disk/RAM/... corruption; not much more we can do
to help, I think.

Comment 11 Thomas M Steenholdt 2002-10-09 11:33:04 UTC
Okay, sorry about the false alarm! - i'm reinstalling pronto! (after a few
serious system checks)

Comment 12 Thomas M Steenholdt 2002-10-11 18:26:06 UTC
OK - Having run HD surface scans and complex memory tests coutinously the last
48 hours with nothing to even indicate a minor error in the machine - I reboot
into the original failing installation and there is no problem. So now I think,
what looked to first be a problem with httpd, then perl, then disk/memory might
actually be a problem with the kernel(ide/ext3/cache/?) So how do I proceed with
this?

Should i install and run the debug kernel and see if the problem arises again?
If so, then what? I have little experience in gathering information from the
debug kernel so I don't really know how to approach it?

I'm guessing that I CAN reproduce the error, but that I'll have to wait for
maybe 24hours, maybe less, for it to happen?!?


Comment 13 Thomas M Steenholdt 2002-10-11 18:29:51 UTC
Created attachment 80025 [details]
this is the output of lspci -vvv on the system - anything else I can provide you with?

Comment 14 Stephen Tweedie 2002-10-12 12:09:23 UTC
What "complex memory tests" and "HD surface scans" did you try?  Try memtest86
on the system if you haven't already done so.  

This really does look like a hardware problem more than anything else, and
memory is the most likely suspect right now.  The fact that you're getting
single-bit-flip errors ("perl" and "pern" are just one bit apart), and that the
problem clears itself automatically after a while (the web server restart worked
OK a couple of hours after it failed) point in this direction --- it looks like
there was bad data in the kernel's filesystem cache, but after that data was
removed from cache and then reloaded later on, we got a correct copy.  That
indicates that either the system was given bad data from the disk (ie. a cabling
or controller problem) or it became corrupt once it was already in memory (ie. a
memory or CPU problem.)  It's pretty definitely not a filesystem problem (that
would corrupt a whole block of memory at once, not just a single bit.)


 ---

Comment 15 Thomas M Steenholdt 2002-10-13 06:16:38 UTC
The previously used tests were those found on the "IBM Enhanced diagnostics
diskette" for that machine - the HD Surface scan took 48 hours to complete on a
4GB disk (indicating, that if it's using the right algorithms - it's probably
pretty thorough)

I ran a memtest86 3.0 (All tests) 3 consecutive times - All passed(no comments)

What can I try next ? Of course I realize that with this kind of failure - the
next failure might not be in the libperl.so module so I'm watching out for all
strange errors.

Comment 16 Thomas M Steenholdt 2002-10-13 06:45:40 UTC
Btw, I've been running RH7.3 the last maybe 6 months (upgraded from 7.3 when 7.3
was released) and I have never had any problems until i did a fresh install of
8.0 on the machine when 8.0 was released

Comment 17 Stephen Tweedie 2002-10-14 19:42:22 UTC
How long did you run the memtest for?  3 runs may not be enough, depending on
how much memory you have --- you really need an overnight run to have any
confidence in the results.

The ext3 code in 8.0 is almost identical to that in 7.3, so I'd be looking at
that next.  Also, the IBM diagnostics disk is not necessarily going to be
helpful here --- it's not the disk itself that we want to test, it's the whole
pipeline from driver to controller to cable to disk, so "badblocks -w" is
actually a better test here --- it tests that whole pipeline under the live
Linux environment rather than the rather more forgiving environment of a
stand-alone test boot.

Can you try those tests?  The pattern of single-bit errors here strongly
indicates a hardware fault.

Comment 18 Thomas M Steenholdt 2002-10-15 21:29:12 UTC
memtest86 ran for about 12 hours (3 iterations of all tests - 128MB)
i made the badblocks -w test (booting on the installation media in rescue mode
as this is my "/" filesystem) no errors reported - a clean and virtually
identical system has been installed again and is running (btw badblocks -w does
NOT disallow checking mounted media or the check is primitive - hda1 is my root
fs and badblocks -w /dev/hda just wiped everything out right there)

Anyway still no errors were found - should i check the running system with
badblocks -n or something or e2fsck?


Comment 19 Stephen Tweedie 2002-10-16 13:28:11 UTC
Sorry to have to close this, but it really does look like a hardware, not
software, error occurred.  If it happens again, you might check to see if there
are loose cables or whether any fans have failed in the system.

Comment 20 Thomas M Steenholdt 2002-10-16 15:14:48 UTC
Well, OK I guess - I think the hardware is okay though...

is there ANY way that this could be a "bad media" install problem???

I know for a fact that the media is good, but i didn't test them on that
particular machine and it's cdrom drive is not the latest model... I did an ftp
install from my other machine this time and I haven't seen any problems yet...
It has not been running for a full 24 hours yet, though...

Thanks for your time anyway :o)

Comment 21 Stephen Tweedie 2002-10-16 15:32:10 UTC
No.  There is fairly convincing evidence that the copy of one of the shared
libraries was corrupt in memory, and then was fine later on once it had been
re-fetched from disk, so that indicates that the install itself is probably OK.
 You can check most packages with rpm -Va, of course.

Comment 22 Thomas M Steenholdt 2002-10-18 06:18:46 UTC
Just wanted you to know that you have been right onto the problem from the
start... I decided that I would start memtest86 and let it run for a week and
just hope that it would find something to explain the whole problem.
it bothers me that i didn't think of actually comparing the corrupted file with
a valid one from the start, but i guess my first fear was that someone had
somehow managed to change the file on my machine... Well I did the compare and
saw that a single byte was corrupted to 156 instead of 154, which is a two bit
change... at the same time I realize that a change from Perl to Pern is a single
bit error, so how is this possible? is the cmp -l tool confused over something?

well after 31 hours of testing in memtest86 and after 5 total passes - it
finally found an error in test #11
Failing address 0645ac0c Good 00000008 Bad 0000000a Err-Bits 00000002 Count 1

and even though that error is not completely identical with the one experienced
in libperl.so(judging by the cmp tool) i'm getting pretty confident that it
could actually be my error right there ...

Comment 23 Thomas M Steenholdt 2002-10-18 06:22:01 UTC
Argh - I just realized that cmp -l returns OCTAL values of the bytes and so
everything is right back on track, i'm guessing that the error I found is
actually the one that has been teasing me!

You ROCK!

and I've learned a thing or two about memory corruption and what it can do :)

Thanks again!


Note You need to log in before you can comment on or make changes to this bug.