From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.1) Gecko/20020826 Description of problem: I recently installed RedHat 8.0 at home for use as small web/mail server - nothing too serious atm... But after 3 days of running ( grand total - i have restarted a few times in that time) the script kiddie attacks seems to have worked? or is this a totally different problem... when I got up this morning at 5:50 the webserver was not running... no httpd processed and service httpd status said server is dead but pid file exists(something like that)... i tried to do a service httpd stop (which of course failed) and a service httpd start which let the server start with absolutely no problems... I'm attaching the access log and error log for you to see - the was no problems reported in /var/log/messages so i'm not really sure what else I can provide you with... This COULD be a serious problem for people who actually need their server running... Version-Release number of selected component (if applicable): How reproducible: Didn't try Actual Results: all httpd processes died Expected Results: httpd processes should not die Additional info:
Created attachment 79306 [details] apache error log (the bad filename copy-pasteing the log contents into an email to report while at work - pay no attention to that)
Created attachment 79307 [details] apache access log (the bad filename comes from copy-pasteing the log contents into an email to report while at work - pay no attention to that)
The reason the server died appears to be: [Tue Oct 08 04:02:25 2002] [notice] SIGHUP received. Attempting to restart Syntax error on line 10 of /etc/httpd/conf.d/perl.conf: Cannot load /etc/httpd/modules/mod_perl.so into server: /usr/lib/perl5/5.8.0/i386-linux-thread-multi/CORE/libperl.so: symbol Perl_Iutf8_lower_ptr, version GLIBC_2.0 not defined in file libpthread.so.0 with link time reference had you changed anything on the machine last night? Was this a fresh 8.0 install, or an upgrade?
Is this reproducible, or does a "service httpd reload" work correctly for you now? CC'ing Chip in case he's seen this error before.
This is a freshly installed machine and i didn't change anything that has anything to do with httpd yesterday... i might have restarted the service on purpose a couple of times, can't remember... But even if I changed something - I always try to restart the daemon to see if it comes up alright. Also note that the server came up without a glitch just by running service httpd start a few hours later - i did not reconfigure anything to have it start again... and I was not at the machine at 4.02
Actually i'm in a situation right now, where the problem can be reproduced each time I try to start httpd... Although the error message is similar, it's not the same... [root@server root]# service httpd start Starting httpd: Syntax error on line 10 of /etc/httpd/conf.d/perl.conf: Cannot load /etc/httpd/modules/mod_perl.so into server: /usr/lib/perl5/5.8.0/i386-linux-thread-multi/CORE/libperl.so: undefined symbol: Pern_cv_undef [FAILED] Based on the first error, which mentioned GLIBC i thought i'd mention that the server is an i586 - 200MHz MMX based machine and a wondering question to soothe my own curiosity - why does httpd SIGHUP in the first place??? I'm really not sure where to look right now, but I'll leave the server running (without rebooting it) if you'll want me to try something out.
I presume that is a cut'n'paste error, and the error is really that "Perl_sv_undef" is undefined? (not "Pern_" ...)? Does /usr/bin/perl still work on this system? Apache is sent the SIGHUP signal by logrotate every night, to rotate the server logs: see /etc/logrotate.d/httpd
The cut'n'paste error is actualy what is written on my screen... "Pern_cv_undef" not Perl as you suggest - And no! /usr/bin/perl doen't work correctly either. running perl followed by the classic print("Hello world!\n"); and ^D actually caused Hello world! to be printed to the screen. But it was followed by a variation of the previous error, saying "relocation error: /usr/lib/perl5/5.8.0/i386-linux-thread-multi/CORE/libperl.so: undefined symbol: Pern_cv_undef" As it seems we have now narrowed things down a little, i tried running an "rpm -V perl" which came with the result ..5..... /usr/lib/perl5/5.8.0/i386-linux-thread-multi/CORE/libperl.so - this could of cause be due to some disk problems or... but I'm attaching the file anyway for you to see - I did not do anything on the system that should have changed that file?!?
Created attachment 79585 [details] libperl.so from sick system - any idea what happened to this file???
Ah, okay. Then I guess you have disk/RAM/... corruption; not much more we can do to help, I think.
Okay, sorry about the false alarm! - i'm reinstalling pronto! (after a few serious system checks)
OK - Having run HD surface scans and complex memory tests coutinously the last 48 hours with nothing to even indicate a minor error in the machine - I reboot into the original failing installation and there is no problem. So now I think, what looked to first be a problem with httpd, then perl, then disk/memory might actually be a problem with the kernel(ide/ext3/cache/?) So how do I proceed with this? Should i install and run the debug kernel and see if the problem arises again? If so, then what? I have little experience in gathering information from the debug kernel so I don't really know how to approach it? I'm guessing that I CAN reproduce the error, but that I'll have to wait for maybe 24hours, maybe less, for it to happen?!?
Created attachment 80025 [details] this is the output of lspci -vvv on the system - anything else I can provide you with?
What "complex memory tests" and "HD surface scans" did you try? Try memtest86 on the system if you haven't already done so. This really does look like a hardware problem more than anything else, and memory is the most likely suspect right now. The fact that you're getting single-bit-flip errors ("perl" and "pern" are just one bit apart), and that the problem clears itself automatically after a while (the web server restart worked OK a couple of hours after it failed) point in this direction --- it looks like there was bad data in the kernel's filesystem cache, but after that data was removed from cache and then reloaded later on, we got a correct copy. That indicates that either the system was given bad data from the disk (ie. a cabling or controller problem) or it became corrupt once it was already in memory (ie. a memory or CPU problem.) It's pretty definitely not a filesystem problem (that would corrupt a whole block of memory at once, not just a single bit.) ---
The previously used tests were those found on the "IBM Enhanced diagnostics diskette" for that machine - the HD Surface scan took 48 hours to complete on a 4GB disk (indicating, that if it's using the right algorithms - it's probably pretty thorough) I ran a memtest86 3.0 (All tests) 3 consecutive times - All passed(no comments) What can I try next ? Of course I realize that with this kind of failure - the next failure might not be in the libperl.so module so I'm watching out for all strange errors.
Btw, I've been running RH7.3 the last maybe 6 months (upgraded from 7.3 when 7.3 was released) and I have never had any problems until i did a fresh install of 8.0 on the machine when 8.0 was released
How long did you run the memtest for? 3 runs may not be enough, depending on how much memory you have --- you really need an overnight run to have any confidence in the results. The ext3 code in 8.0 is almost identical to that in 7.3, so I'd be looking at that next. Also, the IBM diagnostics disk is not necessarily going to be helpful here --- it's not the disk itself that we want to test, it's the whole pipeline from driver to controller to cable to disk, so "badblocks -w" is actually a better test here --- it tests that whole pipeline under the live Linux environment rather than the rather more forgiving environment of a stand-alone test boot. Can you try those tests? The pattern of single-bit errors here strongly indicates a hardware fault.
memtest86 ran for about 12 hours (3 iterations of all tests - 128MB) i made the badblocks -w test (booting on the installation media in rescue mode as this is my "/" filesystem) no errors reported - a clean and virtually identical system has been installed again and is running (btw badblocks -w does NOT disallow checking mounted media or the check is primitive - hda1 is my root fs and badblocks -w /dev/hda just wiped everything out right there) Anyway still no errors were found - should i check the running system with badblocks -n or something or e2fsck?
Sorry to have to close this, but it really does look like a hardware, not software, error occurred. If it happens again, you might check to see if there are loose cables or whether any fans have failed in the system.
Well, OK I guess - I think the hardware is okay though... is there ANY way that this could be a "bad media" install problem??? I know for a fact that the media is good, but i didn't test them on that particular machine and it's cdrom drive is not the latest model... I did an ftp install from my other machine this time and I haven't seen any problems yet... It has not been running for a full 24 hours yet, though... Thanks for your time anyway :o)
No. There is fairly convincing evidence that the copy of one of the shared libraries was corrupt in memory, and then was fine later on once it had been re-fetched from disk, so that indicates that the install itself is probably OK. You can check most packages with rpm -Va, of course.
Just wanted you to know that you have been right onto the problem from the start... I decided that I would start memtest86 and let it run for a week and just hope that it would find something to explain the whole problem. it bothers me that i didn't think of actually comparing the corrupted file with a valid one from the start, but i guess my first fear was that someone had somehow managed to change the file on my machine... Well I did the compare and saw that a single byte was corrupted to 156 instead of 154, which is a two bit change... at the same time I realize that a change from Perl to Pern is a single bit error, so how is this possible? is the cmp -l tool confused over something? well after 31 hours of testing in memtest86 and after 5 total passes - it finally found an error in test #11 Failing address 0645ac0c Good 00000008 Bad 0000000a Err-Bits 00000002 Count 1 and even though that error is not completely identical with the one experienced in libperl.so(judging by the cmp tool) i'm getting pretty confident that it could actually be my error right there ...
Argh - I just realized that cmp -l returns OCTAL values of the bytes and so everything is right back on track, i'm guessing that the error I found is actually the one that has been teasing me! You ROCK! and I've learned a thing or two about memory corruption and what it can do :) Thanks again!