Bug 80457

Summary: Latest Kernel upgrade causes errors in loading md-personality-3
Product: [Retired] Red Hat Linux Reporter: Thomas Bolioli <terraformer>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 8.0   
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2003-06-05 13:25:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Thomas Bolioli 2002-12-26 19:53:25 UTC
Description of problem:
The latest Kernel upgrade causes errors in loading md-personality-3. This
started the day after I installed the upgrade on the first boot since upgrade. I
have syslogs going back to the original install date and no mention of this
error until the 22nd of December.
Additionaly, but maybe unrelated (I am still looking into it), I am having
lockups in the system (Starting the day after the kernel upgrade as well) that
are not eliciting any log entries/core dumps etc. All that is happening is the
Caps lock and Scroll Lock lights are blinking in unison and the display is up
but dark. The system is pingable but no other services respond and the console
is unresponsive. No errors (or anything else for that matter) are on screen.
This is happening, on average, about once a day in no discernable pattern. 

From syslog-> "Dec 22 17:55:51 nova kernel: kmod: failed to exec /sbin/modprobe
-s -k md-personality-3, errno = 2"

Version-Release number of selected component (if applicable):
2.4.18-19.8.0

How reproducible:
N/A

Steps to Reproduce:
N/A    

Actual results:

See syslog entry above

Expected results:
No error

Additional info:

Comment 1 Arjan van de Ven 2002-12-26 19:57:08 UTC
the blinking lights mean you got a kernel oops.
can you paste or attach the output of lsmod so that I can see which modules are
loaded? (eg to see if there's any usual suspects)?
Is there any way to capture the oops somewhere (does it appear in
/var/log/messages ?)


Comment 2 Thomas Bolioli 2002-12-27 02:22:11 UTC
See below for lsmod output. As far as capturing the panic, I doubt it. As I
stated in the original post it is not leaving a single trace behind. I do not
even have a good idea of exactly what time it is crashing to determine if there
is a cron job causing it. I was literaly piecing together a timeline by cross
referencing multiple logs. I have set up a minute by minute cron job to add an
entry into a log to get an exact time next time it occurs. I have unloaded rhnsd
and setiathome and am left with very little remaining (see ps output below).

I have some possibly bad news to add. I have confirmed another machine (i686,
same arch/different mfg from the orig system in question) running software raid
is having the same md-personality-3 loading errors as my system. Yet it is not
crashing. Systems (all i586) without Software RAID did not develop this loading
error upon upgrade to the latest kernel. It appears that the modprobe loading
issue is a seperate bug from the crashes on this one system. I will leave it up
to you to concur and fork the bugs as you see fit. 

Anything you know of that can get an error written to disk prior to freezing let
me know and I will do my best to implement. If you need me to load debugging
kernels and/or experiment I may be able to do this since there is some leway on
this machine as it is a low load machine. 

Anthing else you need let me know,
Tom

Output of lsmod:
[root@nova root]# lsmod
Module                  Size  Used by    Not tainted
ide-cd                 33608   0  (autoclean)
cdrom                  33696   0  (autoclean) [ide-cd]
soundcore               6532   0  (autoclean)
mousedev                5524   0  (autoclean)
input                   5920   0  (autoclean) [mousedev]
autofs                 13348   0  (autoclean) (unused)
e1000                  55948   1
ipt_REJECT              3736   2  (autoclean)
iptable_filter          2412   1  (autoclean)
ip_tables              14936   2  [ipt_REJECT iptable_filter]
microcode               4668   0  (autoclean)
ext3                   70368   2
jbd                    52212   2  [ext3]
raid1                  15244   3

Output of ps -ef: (NB: xinetd loads pop3s imapds and thats it)
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 15:21 ?        00:00:03 init
root         2     1  0 15:21 ?        00:00:00 [keventd]
root         3     1  0 15:21 ?        00:00:00 [ksoftirqd_CPU0]
root         4     1  0 15:21 ?        00:00:00 [kswapd]
root         5     1  0 15:21 ?        00:00:00 [bdflush]
root         6     1  0 15:21 ?        00:00:00 [kupdated]
root         7     1  0 15:21 ?        00:00:00 [mdrecoveryd]
root        15     1  0 15:21 ?        00:00:00 [raid1d]
root        16     1  0 15:21 ?        00:00:00 [raid1d]
root        17     1  0 15:21 ?        00:00:00 [raid1d]
root        18     1  0 15:21 ?        00:00:00 [kjournald]
root       122     1  0 15:21 ?        00:00:00 [kjournald]
root       444     1  0 15:21 ?        00:00:00 syslogd -m 0
root       448     1  0 15:21 ?        00:00:00 klogd -x
rpc        465     1  0 15:21 ?        00:00:00 portmap
rpcuser    484     1  0 15:21 ?        00:00:00 rpc.statd
root       576     1  0 15:21 ?        00:00:00 /usr/sbin/sshd
root       591     1  0 15:21 ?        00:00:00 xinetd -stayalive -reuse -pidfil
root       603     1  0 15:21 ?        00:00:00 /bin/sh /usr/bin/safe_mysqld --d
mysql      641   603  0 15:21 ?        00:00:00 /usr/libexec/mysqld --defaults-f
root       653     1  0 15:21 ?        00:00:00 sendmail: accepting connections
smmsp      665     1  0 15:21 ?        00:00:00 sendmail: Queue runner@01:00:00
root       678     1  0 15:21 ?        00:00:00 /usr/sbin/httpd
root       687     1  0 15:21 ?        00:00:00 crond
apache     701   678  0 15:21 ?        00:00:00 /usr/sbin/httpd
apache     702   678  0 15:21 ?        00:00:00 /usr/sbin/httpd
apache     703   678  0 15:21 ?        00:00:00 /usr/sbin/httpd
apache     704   678  0 15:21 ?        00:00:00 /usr/sbin/httpd
apache     705   678  0 15:21 ?        00:00:00 /usr/sbin/httpd
apache     706   678  0 15:21 ?        00:00:00 /usr/sbin/httpd
apache     707   678  0 15:21 ?        00:00:00 /usr/sbin/httpd
apache     708   678  0 15:21 ?        00:00:00 /usr/sbin/httpd
xfs        724     1  0 15:21 ?        00:00:00 xfs -droppriv -daemon
root       748     1  0 15:21 tty2     00:00:00 /sbin/mingetty tty2
root       749     1  0 15:21 tty3     00:00:00 /sbin/mingetty tty3
root       750     1  0 15:21 tty4     00:00:00 /sbin/mingetty tty4
root       751     1  0 15:21 tty5     00:00:00 /sbin/mingetty tty5
root       752     1  0 15:21 tty6     00:00:00 /sbin/mingetty tty6
root       979     1  0 15:30 tty1     00:00:00 /sbin/mingetty tty1
502       2205   591  0 20:54 ?        00:00:00 imapd
root      2297   576  0 21:05 ?        00:00:00 /usr/sbin/sshd
root      2299  2297  0 21:05 pts/0    00:00:00 -bash
root      2404  2299  0 21:19 pts/0    00:00:00 ps -ef

Comment 3 Arjan van de Ven 2002-12-27 11:11:31 UTC
hmm.. it might be worth it to try a run of the memtest86 program to check for
bad ram...

Comment 4 Thomas Bolioli 2002-12-27 13:57:12 UTC
The first thing I looked at was hardware (Not knowing that the blinking lights 
is a kerenel trap). I ran all of Dell's diagnostic utilities and they came up 
just fine. Everything from ram to HDD. Anyhow, I will run the test w/ memtest86 
and get back to you. I am in MA while the server is in NY so I need to get 
someone there to do it.
FYI: I am on an older kernel right now and it has stayed up about 18hrs now. 
Not a record yet but definitely top quartile for the week. Also, those errors 
for md-personality-3 are gone. 


Comment 5 Arjan van de Ven 2002-12-27 14:00:57 UTC
hmm ok. it's not too likely memtest86 will give anything if the dell tools say
stuff is ok...

can you say what the exact version is of the last known OK kernel ?
(that way I can check all changes more exact)

Comment 6 Thomas Bolioli 2002-12-27 14:11:01 UTC
2.4.18-19.8.0 (Causing problems)
2.4.18-18.8.0 (Never caused problems and what I am booted to now)
PS: I just remembered that when I did the lsmod output, which I gave you, the 
machine was already booted back into 2.4.18-18.8.0. Would two different kernels 
have loaded different modules? If so I will get you that output again.

Comment 7 Arjan van de Ven 2002-12-27 14:13:15 UTC
there won't be different modules; the changes between -18.8.0 and -19.8.0 are
very small...


Comment 8 Arjan van de Ven 2002-12-27 14:46:14 UTC
question: are you using any special ext3 options ?
(since ext3 is the biggest thing that changed between -18 and -19(

Comment 9 Thomas Bolioli 2002-12-27 15:04:09 UTC
Not to my knowledge. I setup using defaults. 
/etc/sysconfig/harddisks has nothing turned on in it and no extra params.


Comment 10 Thomas Bolioli 2003-01-16 23:52:33 UTC
Is there any word on this? 

Comment 11 Thomas Bolioli 2003-03-04 23:19:53 UTC
kernel 2.4.18-24.8.0 seems to have fixed the problem. Someone w/ access needs to
close the bug out totally.