DELL PE 6650, 4 XEON 1,5 GHz, 4 GB RAM, hypertreading enabled, 2Gb SAN-Environment,two QLA2340 HBAs with failover enabled (bios 1.43, driver 7.05.00-fo), central storage HP EVA5000 Red Hat Enterprise Linux AS release 3 (Taroon Update 5), Kernel 2.4.21- 32.0.1.ELsmp Randomly, but generally within 7 days after startup, one node of a two-node- cluster (simply failover, same hard- and software-configuration) freezes up or sometimes reboots after a kernel Oops has been captured with Netdump / Netdump Server. Because it's the backup-node it is mostly idle . All hardware on this machine was fully tested (there is an open Dell Support Services Incident) and no problems were detected. We've completely cloned the first (stable) node to the backup-node last week, in order to guarantee that there are really no differences in the setup of the two machines (originally both nodes have been installed from CD). The lockups still occurred. When the Server begins to lockup, all commands for viewing running processes (ps,w,top,..) lead to a session-hangup. We've also noticed that there is at least one /proc/pid directory, which isn't updated any more and a "ls" or "cd" /proc/pid leads also to a session-hangup while "ls /proc/pid/fd" is working. With the last lockup the pid belonged to a process started by crond. But it does appear not to be only crond related. The crash-logs (full logs attached) show also other programs involved: Pid/TGid: 14588/14588, comm: ypserv Pid/TGid: 23549/23549, comm: crond Pid/TGid: 6683/6683, comm: df Pid/TGid: 7006/7006, comm: crond I hope these descriptions help to solve this problem.
Created attachment 124378 [details] crashlogs
Created attachment 124379 [details] dmesg
Created attachment 124380 [details] lsmod_output
Can this problem be reproduced on an untainted kernel? Also, what external modules and/or drivers are being used? Thanks.
I really don't know, why the Kernel is tainted. All modules except the qla- modules and the modules of the dell server administrator (dcdipm and dcdbas) are as shipped by redhat/up2date. I booted the server without those modules but the kernel still was tainted (3 in /proc/sys/kernel/tainted). The qla-driver is the version supported from HP (HP EVA 5000). And this is the only custom compilation. There are no warnings about a module which will taint the kernel in the logs/dmesg. There are only some insmod-errors in the messages(the server does not have a parallel port): [root@Server2 root]# cat /var/log/messages | grep -i insmod Feb 6 08:13:57 Server2 insmod: /lib/modules/2.4.21- 32.0.1.ELsmp/kernel/drivers/parport/parport_pc.o: init_module: No such device Feb 6 08:13:57 Server2 insmod: Hint: insmod errors can be caused by incorrect module parameters, including invalid IO or IRQ parameters. You may find more information in syslog or the output from dmesg Feb 6 08:13:57 Server2 insmod: /lib/modules/2.4.21- 32.0.1.ELsmp/kernel/drivers/parport/parport_pc.o: insmod parport_lowlevel failed We have installed the package "kernel-smp-unsupported-2.4.21-32.0.1.EL" because we need appletalk. But the module appletalk is only loaded on the productive node. And none of the modules currently loaded is in the "unsupported"-tree of /lib/modules: [root@Server2 root]# for I in `lsmod | grep -v Tainted | cut -f1 -d " "`; do modinfo $I | grep filename; done filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/scsi/st.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/scsi/sr_mod.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/ide/ide-cd.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/cdrom/cdrom.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/audit/audit.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/fs/nfsd/nfsd.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/fs/lockd/lockd.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/net/sunrpc/sunrpc.o filename: /lib/modules/2.4.21- 32.0.1.ELsmp/kernel/drivers/usb/serial/usbserial.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/char/lp.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/parport/parport.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/net/netconsole.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/misc/dcdipm.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/misc/dcdbas.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/fs/autofs4/autofs4.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/net/3c59x.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/net/tg3.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/block/floppy.o filename: /lib/modules/2.4.21- 32.0.1.ELsmp/kernel/arch/i386/kernel/microcode.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/block/loop.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/md/lvm-mod.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/input/keybdev.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/input/mousedev.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/usb/hid.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/input/input.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/usb/host/usb-ohci.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/usb/usbcore.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/fs/ext3/ext3.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/fs/jbd/jbd.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/scsi/sg.o filename: /lib/modules/2.4.21- 32.0.1.ELsmp/kernel/drivers/addon/qla2200/qla2300.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/scsi/qla2300_conf.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/scsi/megaraid2.o filename: /lib/modules/2.4.21- 32.0.1.ELsmp/kernel/drivers/scsi/aic7xxx/aic7xxx.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/block/diskdumplib.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/scsi/sd_mod.o filename: /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/scsi/scsi_mod.o [root@Server2 root]# [root@Server2 root]# cat /etc/modules.conf alias eth0 tg3 alias eth1 tg3 alias eth2 3c59x alias scsi_hostadapter aic7xxx alias scsi_hostadapter2 megaraid2 alias usb-controller usb-ohci options scsi_mod max_scsi_luns=128 alias st off post-remove qla2300 rmmod qla2300_conf alias scsi_hostadapter3 qla2300_conf alias scsi_hostadapter4 qla2300 alias scsi_hostadapter5 sg options qla2300 ConfigRequired=1 ql2xuseextopts=1 ql2xmaxqdepth=16 qlport_down_retry=30 qlogin_retry_count=16 ql2xfailover=1 ql2xlbType=1 ql2xexcludemodel=0x0 [root@Server2 root]# The productive machine has exactly the same Hard- and Software and also a "tainted flag" of 3 but is absolutely stable!? But i know, "oops" reports marked as tainted are of no use to you. So, how can i find the cause for the tainted kernel?
Sorry, Kernel is now "clean". I've forgotten to deactivate one service of the "Dell Server Administrator". After a chkconfig .. off of all dell services, mkinitrd and a reboot the kernel is untainted: [root@Server2 root]# lsmod Module Size Used by Not tainted audit 90808 2 (autoclean) nfsd 86160 8 (autoclean) lockd 59600 1 (autoclean) [nfsd] sunrpc 89244 1 (autoclean) [nfsd lockd] usbserial 23868 0 (autoclean) (unused) lp 9156 0 (autoclean) parport 38848 0 (autoclean) [lp] netconsole 18020 0 (unused) autofs4 16888 1 (autoclean) 3c59x 30416 1 tg3 69768 1 floppy 57552 0 (autoclean) microcode 6912 0 (autoclean) loop 12728 0 (autoclean) keybdev 2976 0 (unused) mousedev 5688 0 (unused) hid 22532 0 (unused) input 6176 0 [keybdev mousedev hid] usb-ohci 23208 0 (unused) usbcore 81152 1 [usbserial hid usb-ohci] ext3 89960 6 jbd 55156 6 [ext3] lvm-mod 65568 4 sg 37324 0 qla2300 590844 7 qla2300_conf 301560 0 megaraid2 38376 7 aic7xxx 163120 0 (unused) diskdumplib 5260 0 [megaraid2 aic7xxx] sd_mod 14128 22 scsi_mod 115496 5 [sg qla2300 megaraid2 aic7xxx sd_mod] [root@Server2 root]#
There is one major issue that I cannot explain, which are the virtual addresses reported as the panicking EIPs in each of the 4 crashes. Your dmesg output shows: Linux version 2.4.21-32.0.1.ELsmp (bhcompile.redhat.com) (gcc version 3.2.3 20030502 (Red Hat Linux 3.2.3-52)) #1 SMP Tue May 17 17:52:23 EDT 2005 which verifies it's a Red Hat built kernel, compiled on Tueday, May 17th at 17:52.23. Accordingly, in order to find out exactly where (which instruction) the 4 crashes are occurring, I have booted that same exact kernel: crash> sys KERNEL: /boot/vmlinux-2.4.21-32.0.1.ELsmp DEBUGINFO: /usr/lib/debug/boot/vmlinux-2.4.21-32.0.1.ELsmp.debug DUMPFILE: /dev/mem CPUS: 2 DATE: Thu Feb 9 10:56:35 2006 UPTIME: 00:12:44 LOAD AVERAGE: 0.02, 0.13, 0.09 TASKS: 63 NODENAME: crash.boston.redhat.com RELEASE: 2.4.21-32.0.1.ELsmp VERSION: #1 SMP Tue May 17 17:52:23 EDT 2005 MACHINE: i686 (1993 Mhz) MEMORY: 511.5 MB crash> !strings /boot/vmlinux-2.4.21-32.0.1.ELsmp | grep "Linux version" Linux version 2.4.21-32.0.1.ELsmp (bhcompile.redhat.com) (gcc version 3.2.3 20030502 (Red Hat Linux 3.2.3-52)) #1 SMP Tue May 17 17:52:23 EDT 2005 crash> The four crashes reported occurred these locations: EIP: 0060:[<c0122b60>] Tainted: PF EIP is at wake_up_cpu [kernel] 0x170 (2.4.21-32.0.1.ELsmp/i686) EIP: 0060:[<c0134220>] Tainted: PF EIP is at __mod_timer [kernel] 0xc0 (2.4.21-32.0.1.ELsmp/i686) EIP: 0060:[<c017eb35>] Tainted: PF EIP is at d_lookup [kernel] 0x75 (2.4.21-32.0.1.ELsmp/i686) EIP: 0060:[<c013f10a>] Tainted: PF EIP is at vm_account [kernel] 0x7a (2.4.21-32.0.1.ELsmp/i686) Now, upon disassembling the 4 functions above -- in every case -- the panic EIP address is not a legitimate instruction address. I have never seen this behaviour; text addresses are "fixed" in the vmlinux file, and by definition they have to be the same on any machine that boots that particular kernel. Taking the first crash, looking for c0122b60 (wake_up_cpu + 0x170), note that it doesn't exist as an instruction: crash> dis wake_up_cpu ... 0xc0122b57 <wake_up_cpu+0x167>: cmp 0xffffffdc(%ebp),%ebx 0xc0122b5a <wake_up_cpu+0x16a>: jl 0xc0122b30 <wake_up_cpu+0x140> 0xc0122b5c <wake_up_cpu+0x16c>: jmp 0xc0122a63 <wake_up_cpu+0x73> 0xc0122b61 <wake_up_cpu+0x171>: mov 0xffffffd8(%ebp),%ecx 0xc0122b64 <wake_up_cpu+0x174>: mov 0xffffffec(%ebp),%eax ... crash> And in the second crash, c0134220 (__mod_timer + 0xc0) is invalid: crash> dis __mod_timer ... 0xc0134218 <__mod_timer+0xb8>: xchg %al,(%esi) 0xc013421a <__mod_timer+0xba>: xor %eax,%eax 0xc013421c <__mod_timer+0xbc>: lock btr %eax,0x18(%edi) 0xc0134221 <__mod_timer+0xc1>: sbb %eax,%eax 0xc0134223 <__mod_timer+0xc3>: test %eax,%eax ... crash> In the third crash crash, c017eb35 (d_lookup + 0x75) is bogus: crash> dis d_lookup ... 0xc017eb2b <d_lookup+0x6b>: je 0xc017ebe0 <d_lookup+0x120> 0xc017eb31 <d_lookup+0x71>: cmp %ebp,0x44(%esi) 0xc017eb34 <d_lookup+0x74>: mov (%ebx),%ebx 0xc017eb36 <d_lookup+0x76>: jne 0xc017eb20 <d_lookup+0x60> 0xc017eb38 <d_lookup+0x78>: mov 0x34(%esp),%edi ... crash> And lastly, c013f10a (vm_account + 0x7a) is bogus: crash> dis vm_account ... 0xc013f0fb <vm_account+0x6b>: mov 0x8(%esp),%edi 0xc013f0ff <vm_account+0x6f>: mov 0xc(%esp),%ebp 0xc013f103 <vm_account+0x73>: add $0x10,%esp 0xc013f106 <vm_account+0x76>: ret 0xc013f107 <vm_account+0x77>: mov %esi,%eax 0xc013f109 <vm_account+0x79>: test $0x81,%al 0xc013f10b <vm_account+0x7b>: je 0xc013f1d8 <vm_account+0x148> 0xc013f111 <vm_account+0x81>: mov %esi,%eax ... crash> However, the crashing systems are in fact attempting to execute those bogus EIP addresses. For example, the first one crashed while executing an EIP of c0122b60, which as shown above, is within wake_up_cpu(): crash> dis wake_up_cpu ... 0xc0122b57 <wake_up_cpu+0x167>: cmp 0xffffffdc(%ebp),%ebx 0xc0122b5a <wake_up_cpu+0x16a>: jl 0xc0122b30 <wake_up_cpu+0x140> 0xc0122b5c <wake_up_cpu+0x16c>: jmp 0xc0122a63 <wake_up_cpu+0x73> 0xc0122b61 <wake_up_cpu+0x171>: mov 0xffffffd8(%ebp),%ecx 0xc0122b64 <wake_up_cpu+0x174>: mov 0xffffffec(%ebp),%eax ... crash> Now, if I disassemble the (bogus) instruction at c0122b60, it evaluates to this: crash> dis c0122b60 0xc0122b60 <wake_up_cpu+0x170>: decl 0x458bd84d(%ebx) crash> So, it would take the contents of %ebx, add 0x458bd84d to it, and then reference that address location. Note below that %ebx contains a value of 2: EIP is at wake_up_cpu [kernel] 0x170 (2.4.21-32.0.1.ELsmp/i686) eax: 00000074 ebx: 00000002 ecx: e3ff2000 edx: c0441ab8 esi: 00000079 edi: ffffffff ebp: e3ff3adc esp: e3ff3aac ds: 0068 es: 0068 ss: 0068 so the resultant address would be 0x458bd84f, which is the bogus virtual address that caused the crash: Unable to handle kernel paging request at virtual address 458bd84f There's no way that I can even begin to speculate how this could happen. It almost would appear to be a hardware issue, especially if it can only be reproduced on one particular machine. However I hate to point fingers prematurely. There *may* be some other clues in the vmcore files for each crash, but to upload those, you will have to file a Red Hat support ticket, they will create an Issue Tracker for the bug, and attach it to this bugzilla. They also will give you directions on how to upload the 4 vmcore files. Again, for official Red Hat Enterprise Linux support, please log into the Red Hat support website at http://www.redhat.com/support and file a support ticket, or alternatively contact Red Hat Global Support Services at 1-888-RED-HAT1 to speak directly with a support associate and escalate an issue. Tell them that this bugzilla (180476) has already been filed so that they won't create a new one. Thanks, Dave Anderson
In answer to Dave's comment #7, one hypothetical scenario that could cause such a bogus text address is that if some external module incorrectly used the kernel's timer service, and especially if it allowed a pending timer to remain after the module was unloaded, then the memory formerly used for the module code would be used to execute instructions when the timeout expired. I suppose this is just a long shot, though.
All of the 4 backtraces leading up to the crashes look normal, i.e., they all appear to lead into the crashing function. So I don't see any connections with module code, which would be vmalloc text addresses. My best guess is that the text segment gets randomly corrupted, and then in the act of executing the kernel text there, the EIP gets "bumped" erroneously due to a newly-malformed "instruction", until it actually does something that causes a crash, like referencing a bogusly-calculated address. But that can only be verified by looking at the vmcore contents.
Sorry for my late response. The system has been stable for more than 1 week without "Dell Server Administrator" loaded (and an untainted kernel). I think there is something with the ramdisk and the dell-services, because if i recreate the ramdisk while the dell-modules are loaded and disable the dell- services the kernel is still tainted after a reboot although the modules are not loaded anymore. After a mkinitrd without the dell-modules loaded the kernel is untainted as expected. I thought that mkinitrd only processes modules listed in the modules.conf!? However, i have reinstalled the "Dell Server Administrator" last weekend (no further mkinitrd!) and the server is now stable for 3 days, with the dell- modules loaded (and a tainted kernel). I'm sure that the ramdisk of the (stable) productive machine has been created before the installation of the Dell-Services. Do you consider that this could be the reason for the lockups?
I'm sorry, but I don't know anything about the "Dell Server Administrator", nor how it plays with mkinitrd, and how that affects tainting. And your guess is as good as mine as to whether it has anything to do with the "EIP shift".
This bug is filed against RHEL 3, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you.