Description of problem: The latest Kernel upgrade causes errors in loading md-personality-3. This started the day after I installed the upgrade on the first boot since upgrade. I have syslogs going back to the original install date and no mention of this error until the 22nd of December. Additionaly, but maybe unrelated (I am still looking into it), I am having lockups in the system (Starting the day after the kernel upgrade as well) that are not eliciting any log entries/core dumps etc. All that is happening is the Caps lock and Scroll Lock lights are blinking in unison and the display is up but dark. The system is pingable but no other services respond and the console is unresponsive. No errors (or anything else for that matter) are on screen. This is happening, on average, about once a day in no discernable pattern. From syslog-> "Dec 22 17:55:51 nova kernel: kmod: failed to exec /sbin/modprobe -s -k md-personality-3, errno = 2" Version-Release number of selected component (if applicable): 2.4.18-19.8.0 How reproducible: N/A Steps to Reproduce: N/A Actual results: See syslog entry above Expected results: No error Additional info:
the blinking lights mean you got a kernel oops. can you paste or attach the output of lsmod so that I can see which modules are loaded? (eg to see if there's any usual suspects)? Is there any way to capture the oops somewhere (does it appear in /var/log/messages ?)
See below for lsmod output. As far as capturing the panic, I doubt it. As I stated in the original post it is not leaving a single trace behind. I do not even have a good idea of exactly what time it is crashing to determine if there is a cron job causing it. I was literaly piecing together a timeline by cross referencing multiple logs. I have set up a minute by minute cron job to add an entry into a log to get an exact time next time it occurs. I have unloaded rhnsd and setiathome and am left with very little remaining (see ps output below). I have some possibly bad news to add. I have confirmed another machine (i686, same arch/different mfg from the orig system in question) running software raid is having the same md-personality-3 loading errors as my system. Yet it is not crashing. Systems (all i586) without Software RAID did not develop this loading error upon upgrade to the latest kernel. It appears that the modprobe loading issue is a seperate bug from the crashes on this one system. I will leave it up to you to concur and fork the bugs as you see fit. Anything you know of that can get an error written to disk prior to freezing let me know and I will do my best to implement. If you need me to load debugging kernels and/or experiment I may be able to do this since there is some leway on this machine as it is a low load machine. Anthing else you need let me know, Tom Output of lsmod: [root@nova root]# lsmod Module Size Used by Not tainted ide-cd 33608 0 (autoclean) cdrom 33696 0 (autoclean) [ide-cd] soundcore 6532 0 (autoclean) mousedev 5524 0 (autoclean) input 5920 0 (autoclean) [mousedev] autofs 13348 0 (autoclean) (unused) e1000 55948 1 ipt_REJECT 3736 2 (autoclean) iptable_filter 2412 1 (autoclean) ip_tables 14936 2 [ipt_REJECT iptable_filter] microcode 4668 0 (autoclean) ext3 70368 2 jbd 52212 2 [ext3] raid1 15244 3 Output of ps -ef: (NB: xinetd loads pop3s imapds and thats it) UID PID PPID C STIME TTY TIME CMD root 1 0 0 15:21 ? 00:00:03 init root 2 1 0 15:21 ? 00:00:00 [keventd] root 3 1 0 15:21 ? 00:00:00 [ksoftirqd_CPU0] root 4 1 0 15:21 ? 00:00:00 [kswapd] root 5 1 0 15:21 ? 00:00:00 [bdflush] root 6 1 0 15:21 ? 00:00:00 [kupdated] root 7 1 0 15:21 ? 00:00:00 [mdrecoveryd] root 15 1 0 15:21 ? 00:00:00 [raid1d] root 16 1 0 15:21 ? 00:00:00 [raid1d] root 17 1 0 15:21 ? 00:00:00 [raid1d] root 18 1 0 15:21 ? 00:00:00 [kjournald] root 122 1 0 15:21 ? 00:00:00 [kjournald] root 444 1 0 15:21 ? 00:00:00 syslogd -m 0 root 448 1 0 15:21 ? 00:00:00 klogd -x rpc 465 1 0 15:21 ? 00:00:00 portmap rpcuser 484 1 0 15:21 ? 00:00:00 rpc.statd root 576 1 0 15:21 ? 00:00:00 /usr/sbin/sshd root 591 1 0 15:21 ? 00:00:00 xinetd -stayalive -reuse -pidfil root 603 1 0 15:21 ? 00:00:00 /bin/sh /usr/bin/safe_mysqld --d mysql 641 603 0 15:21 ? 00:00:00 /usr/libexec/mysqld --defaults-f root 653 1 0 15:21 ? 00:00:00 sendmail: accepting connections smmsp 665 1 0 15:21 ? 00:00:00 sendmail: Queue runner@01:00:00 root 678 1 0 15:21 ? 00:00:00 /usr/sbin/httpd root 687 1 0 15:21 ? 00:00:00 crond apache 701 678 0 15:21 ? 00:00:00 /usr/sbin/httpd apache 702 678 0 15:21 ? 00:00:00 /usr/sbin/httpd apache 703 678 0 15:21 ? 00:00:00 /usr/sbin/httpd apache 704 678 0 15:21 ? 00:00:00 /usr/sbin/httpd apache 705 678 0 15:21 ? 00:00:00 /usr/sbin/httpd apache 706 678 0 15:21 ? 00:00:00 /usr/sbin/httpd apache 707 678 0 15:21 ? 00:00:00 /usr/sbin/httpd apache 708 678 0 15:21 ? 00:00:00 /usr/sbin/httpd xfs 724 1 0 15:21 ? 00:00:00 xfs -droppriv -daemon root 748 1 0 15:21 tty2 00:00:00 /sbin/mingetty tty2 root 749 1 0 15:21 tty3 00:00:00 /sbin/mingetty tty3 root 750 1 0 15:21 tty4 00:00:00 /sbin/mingetty tty4 root 751 1 0 15:21 tty5 00:00:00 /sbin/mingetty tty5 root 752 1 0 15:21 tty6 00:00:00 /sbin/mingetty tty6 root 979 1 0 15:30 tty1 00:00:00 /sbin/mingetty tty1 502 2205 591 0 20:54 ? 00:00:00 imapd root 2297 576 0 21:05 ? 00:00:00 /usr/sbin/sshd root 2299 2297 0 21:05 pts/0 00:00:00 -bash root 2404 2299 0 21:19 pts/0 00:00:00 ps -ef
hmm.. it might be worth it to try a run of the memtest86 program to check for bad ram...
The first thing I looked at was hardware (Not knowing that the blinking lights is a kerenel trap). I ran all of Dell's diagnostic utilities and they came up just fine. Everything from ram to HDD. Anyhow, I will run the test w/ memtest86 and get back to you. I am in MA while the server is in NY so I need to get someone there to do it. FYI: I am on an older kernel right now and it has stayed up about 18hrs now. Not a record yet but definitely top quartile for the week. Also, those errors for md-personality-3 are gone.
hmm ok. it's not too likely memtest86 will give anything if the dell tools say stuff is ok... can you say what the exact version is of the last known OK kernel ? (that way I can check all changes more exact)
2.4.18-19.8.0 (Causing problems) 2.4.18-18.8.0 (Never caused problems and what I am booted to now) PS: I just remembered that when I did the lsmod output, which I gave you, the machine was already booted back into 2.4.18-18.8.0. Would two different kernels have loaded different modules? If so I will get you that output again.
there won't be different modules; the changes between -18.8.0 and -19.8.0 are very small...
question: are you using any special ext3 options ? (since ext3 is the biggest thing that changed between -18 and -19(
Not to my knowledge. I setup using defaults. /etc/sysconfig/harddisks has nothing turned on in it and no extra params.
Is there any word on this?
kernel 2.4.18-24.8.0 seems to have fixed the problem. Someone w/ access needs to close the bug out totally.