180476 – Hanging processes with access to procfs

Bug 180476 - Hanging processes with access to procfs

Summary: Hanging processes with access to procfs

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Dave Anderson
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-02-08 14:03 UTC by Christian Schnuerer
Modified:	2007-11-30 22:07 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-10-19 18:47:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
crashlogs (31.29 KB, text/plain) 2006-02-08 14:03 UTC, Christian Schnuerer	no flags	Details
dmesg (50.38 KB, text/plain) 2006-02-08 14:06 UTC, Christian Schnuerer	no flags	Details
lsmod_output (1.86 KB, text/plain) 2006-02-08 14:07 UTC, Christian Schnuerer	no flags	Details
View All

Description Christian Schnuerer 2006-02-08 14:03:26 UTC

DELL PE 6650, 4 XEON 1,5 GHz, 4 GB RAM, hypertreading enabled,
2Gb SAN-Environment,two QLA2340 HBAs with failover enabled 
(bios 1.43, driver 7.05.00-fo), central storage HP EVA5000

Red Hat Enterprise Linux AS release 3 (Taroon Update 5), Kernel 2.4.21-
32.0.1.ELsmp

Randomly, but generally within 7 days after startup, one node of a two-node-
cluster (simply failover, same hard- and software-configuration) freezes up or 
sometimes reboots after a kernel Oops has been captured with Netdump / Netdump 
Server.
Because it's the backup-node it is mostly idle .

All hardware on this machine was fully tested (there is an open Dell Support 
Services Incident) and no problems were detected.
We've completely cloned the first (stable) node to the backup-node last week, 
in order to guarantee that there are really no differences in the setup of the 
two machines (originally both nodes have been installed from CD).
The lockups still occurred.


When the Server begins to lockup, all commands for viewing running processes 
(ps,w,top,..) lead to a session-hangup. We've also noticed that there is at 
least one /proc/pid directory, which isn't updated any more and a "ls" 
or "cd" /proc/pid leads also to a session-hangup while "ls /proc/pid/fd" is 
working.

With the last lockup the pid belonged to a process started by crond.

But it does appear not to be only crond related. 
The crash-logs (full logs attached) show also other programs involved:
Pid/TGid: 14588/14588, comm:               ypserv
Pid/TGid: 23549/23549, comm:                crond
Pid/TGid: 6683/6683, comm:                   df
Pid/TGid: 7006/7006, comm:                crond


I hope these descriptions help to solve this problem.

Comment 1 Christian Schnuerer 2006-02-08 14:03:26 UTC

Created attachment 124378 [details]
crashlogs

Comment 2 Christian Schnuerer 2006-02-08 14:06:45 UTC

Created attachment 124379 [details]
dmesg

Comment 3 Christian Schnuerer 2006-02-08 14:07:14 UTC

Created attachment 124380 [details]
lsmod_output

Comment 4 Ernie Petrides 2006-02-08 23:47:43 UTC

Can this problem be reproduced on an untainted kernel?  Also,
what external modules and/or drivers are being used?  Thanks.

Comment 5 Christian Schnuerer 2006-02-09 11:17:24 UTC

I really don't know, why the Kernel is tainted. All modules except the qla-
modules and the modules of the dell server administrator (dcdipm and dcdbas) 
are as shipped by redhat/up2date. I booted the server without those modules but 
the kernel still was tainted (3 in /proc/sys/kernel/tainted).
The qla-driver is the version supported from HP (HP EVA 5000). And this is the 
only custom compilation.
There are no warnings about a module which will taint the kernel in the 
logs/dmesg. 

There are only some insmod-errors in the messages(the server does not have a 
parallel port):
[root@Server2 root]# cat /var/log/messages | grep -i insmod
Feb  6 08:13:57 Server2 insmod: /lib/modules/2.4.21-
32.0.1.ELsmp/kernel/drivers/parport/parport_pc.o: init_module: No such device
Feb  6 08:13:57 Server2 insmod: Hint: insmod errors can be caused by incorrect 
module parameters, including invalid IO or IRQ parameters.       You may find 
more information in syslog or the output from dmesg
Feb  6 08:13:57 Server2 insmod: /lib/modules/2.4.21-
32.0.1.ELsmp/kernel/drivers/parport/parport_pc.o: insmod parport_lowlevel failed

We have installed the package "kernel-smp-unsupported-2.4.21-32.0.1.EL" because 
we need appletalk. But the module appletalk is only loaded on the productive 
node. And none of the modules currently loaded is in the "unsupported"-tree 
of /lib/modules:

[root@Server2 root]# for I in `lsmod | grep -v Tainted | cut -f1 -d " "`; do 
modinfo $I | grep filename; done
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/scsi/st.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/scsi/sr_mod.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/ide/ide-cd.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/cdrom/cdrom.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/audit/audit.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/fs/nfsd/nfsd.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/fs/lockd/lockd.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/net/sunrpc/sunrpc.o
filename:    /lib/modules/2.4.21-
32.0.1.ELsmp/kernel/drivers/usb/serial/usbserial.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/char/lp.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/parport/parport.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/net/netconsole.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/misc/dcdipm.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/misc/dcdbas.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/fs/autofs4/autofs4.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/net/3c59x.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/net/tg3.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/block/floppy.o
filename:    /lib/modules/2.4.21-
32.0.1.ELsmp/kernel/arch/i386/kernel/microcode.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/block/loop.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/md/lvm-mod.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/input/keybdev.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/input/mousedev.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/usb/hid.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/input/input.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/usb/host/usb-ohci.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/usb/usbcore.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/fs/ext3/ext3.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/fs/jbd/jbd.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/scsi/sg.o
filename:    /lib/modules/2.4.21-
32.0.1.ELsmp/kernel/drivers/addon/qla2200/qla2300.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/scsi/qla2300_conf.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/scsi/megaraid2.o
filename:    /lib/modules/2.4.21-
32.0.1.ELsmp/kernel/drivers/scsi/aic7xxx/aic7xxx.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/block/diskdumplib.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/scsi/sd_mod.o
filename:    /lib/modules/2.4.21-32.0.1.ELsmp/kernel/drivers/scsi/scsi_mod.o
[root@Server2 root]# 

[root@Server2 root]# cat /etc/modules.conf
alias eth0 tg3
alias eth1 tg3
alias eth2 3c59x
alias scsi_hostadapter aic7xxx
alias scsi_hostadapter2 megaraid2
alias usb-controller usb-ohci

options scsi_mod max_scsi_luns=128 

alias st off

post-remove qla2300 rmmod qla2300_conf
alias scsi_hostadapter3 qla2300_conf
alias scsi_hostadapter4 qla2300
alias scsi_hostadapter5 sg
options qla2300 ConfigRequired=1 ql2xuseextopts=1 ql2xmaxqdepth=16 
qlport_down_retry=30 qlogin_retry_count=16 ql2xfailover=1 ql2xlbType=1 
ql2xexcludemodel=0x0 

[root@Server2 root]# 


The productive machine has exactly the same Hard- and Software and also 
a "tainted flag" of 3 but is absolutely stable!?

But i  know, "oops" reports marked as tainted are of no use to you.
So, how can i find the cause for the tainted kernel?

Comment 6 Christian Schnuerer 2006-02-09 14:01:05 UTC

Sorry, Kernel is now "clean". I've forgotten to deactivate one service of 
the "Dell Server Administrator". After a chkconfig .. off of all dell services,
mkinitrd and a reboot the kernel is untainted:

[root@Server2 root]# lsmod
Module                  Size  Used by    Not tainted
audit                  90808   2  (autoclean)
nfsd                   86160   8  (autoclean)
lockd                  59600   1  (autoclean) [nfsd]
sunrpc                 89244   1  (autoclean) [nfsd lockd]
usbserial              23868   0  (autoclean) (unused)
lp                      9156   0  (autoclean)
parport                38848   0  (autoclean) [lp]
netconsole             18020   0  (unused)
autofs4                16888   1  (autoclean)
3c59x                  30416   1 
tg3                    69768   1 
floppy                 57552   0  (autoclean)
microcode               6912   0  (autoclean)
loop                   12728   0  (autoclean)
keybdev                 2976   0  (unused)
mousedev                5688   0  (unused)
hid                    22532   0  (unused)
input                   6176   0  [keybdev mousedev hid]
usb-ohci               23208   0  (unused)
usbcore                81152   1  [usbserial hid usb-ohci]
ext3                   89960   6 
jbd                    55156   6  [ext3]
lvm-mod                65568   4 
sg                     37324   0 
qla2300               590844   7 
qla2300_conf          301560   0 
megaraid2              38376   7 
aic7xxx               163120   0  (unused)
diskdumplib             5260   0  [megaraid2 aic7xxx]
sd_mod                 14128  22 
scsi_mod              115496   5  [sg qla2300 megaraid2 aic7xxx sd_mod]
[root@Server2 root]#

Comment 7 Dave Anderson 2006-02-09 16:41:05 UTC

There is one major issue that I cannot explain, which are the virtual
addresses reported as the panicking EIPs in each of the 4 crashes.  Your 
dmesg output shows:

Linux version 2.4.21-32.0.1.ELsmp (bhcompile.redhat.com) (gcc version
3.2.3 20030502 (Red Hat Linux 3.2.3-52)) #1 SMP Tue May 17 17:52:23 EDT 2005

which verifies it's a Red Hat built kernel, compiled on Tueday, May 17th
at 17:52.23.  Accordingly, in order to find out exactly where (which
instruction) the 4 crashes are occurring, I have booted that same exact
kernel:

  crash> sys
        KERNEL: /boot/vmlinux-2.4.21-32.0.1.ELsmp
     DEBUGINFO: /usr/lib/debug/boot/vmlinux-2.4.21-32.0.1.ELsmp.debug
      DUMPFILE: /dev/mem
          CPUS: 2
          DATE: Thu Feb  9 10:56:35 2006
        UPTIME: 00:12:44
  LOAD AVERAGE: 0.02, 0.13, 0.09
         TASKS: 63
      NODENAME: crash.boston.redhat.com
       RELEASE: 2.4.21-32.0.1.ELsmp
       VERSION: #1 SMP Tue May 17 17:52:23 EDT 2005
       MACHINE: i686  (1993 Mhz)
        MEMORY: 511.5 MB
  crash> !strings /boot/vmlinux-2.4.21-32.0.1.ELsmp | grep "Linux version"
  Linux version 2.4.21-32.0.1.ELsmp (bhcompile.redhat.com) (gcc
  version 3.2.3 20030502 (Red Hat Linux 3.2.3-52)) #1 SMP Tue May 17 17:52:23
  EDT 2005
  crash>

The four crashes reported occurred these locations:

  EIP:    0060:[<c0122b60>]    Tainted: PF
  EIP is at wake_up_cpu [kernel] 0x170 (2.4.21-32.0.1.ELsmp/i686)
  
  EIP:    0060:[<c0134220>]    Tainted: PF
  EIP is at __mod_timer [kernel] 0xc0 (2.4.21-32.0.1.ELsmp/i686)
  
  EIP:    0060:[<c017eb35>]    Tainted: PF
  EIP is at d_lookup [kernel] 0x75 (2.4.21-32.0.1.ELsmp/i686)
  
  EIP:    0060:[<c013f10a>]    Tainted: PF
  EIP is at vm_account [kernel] 0x7a (2.4.21-32.0.1.ELsmp/i686)

Now, upon disassembling the 4 functions above -- in every case -- the
panic EIP address is not a legitimate instruction address.  I have
never seen this behaviour; text addresses are "fixed" in the vmlinux
file, and by definition they have to be the same on any machine that 
boots that particular kernel.

Taking the first crash, looking for c0122b60 (wake_up_cpu + 0x170), 
note that it doesn't exist as an instruction:

  crash> dis wake_up_cpu
  ...
  0xc0122b57 <wake_up_cpu+0x167>: cmp    0xffffffdc(%ebp),%ebx
  0xc0122b5a <wake_up_cpu+0x16a>: jl     0xc0122b30 <wake_up_cpu+0x140>
  0xc0122b5c <wake_up_cpu+0x16c>: jmp    0xc0122a63 <wake_up_cpu+0x73>
  0xc0122b61 <wake_up_cpu+0x171>: mov    0xffffffd8(%ebp),%ecx
  0xc0122b64 <wake_up_cpu+0x174>: mov    0xffffffec(%ebp),%eax
  ...
  crash>

And in the second crash, c0134220 (__mod_timer + 0xc0) is invalid:

  crash> dis __mod_timer
  ...
  0xc0134218 <__mod_timer+0xb8>:  xchg   %al,(%esi)
  0xc013421a <__mod_timer+0xba>:  xor    %eax,%eax
  0xc013421c <__mod_timer+0xbc>:  lock btr %eax,0x18(%edi)
  0xc0134221 <__mod_timer+0xc1>:  sbb    %eax,%eax
  0xc0134223 <__mod_timer+0xc3>:  test   %eax,%eax
  ...
  crash>

In the third crash crash, c017eb35 (d_lookup + 0x75) is bogus:

  crash> dis d_lookup
  ...
  0xc017eb2b <d_lookup+0x6b>:     je     0xc017ebe0 <d_lookup+0x120>
  0xc017eb31 <d_lookup+0x71>:     cmp    %ebp,0x44(%esi)
  0xc017eb34 <d_lookup+0x74>:     mov    (%ebx),%ebx
  0xc017eb36 <d_lookup+0x76>:     jne    0xc017eb20 <d_lookup+0x60>
  0xc017eb38 <d_lookup+0x78>:     mov    0x34(%esp),%edi
  ...
  crash>

And lastly, c013f10a (vm_account + 0x7a) is bogus:

  crash> dis vm_account
  ...
  0xc013f0fb <vm_account+0x6b>:   mov    0x8(%esp),%edi
  0xc013f0ff <vm_account+0x6f>:   mov    0xc(%esp),%ebp
  0xc013f103 <vm_account+0x73>:   add    $0x10,%esp
  0xc013f106 <vm_account+0x76>:   ret
  0xc013f107 <vm_account+0x77>:   mov    %esi,%eax
  0xc013f109 <vm_account+0x79>:   test   $0x81,%al
  0xc013f10b <vm_account+0x7b>:   je     0xc013f1d8 <vm_account+0x148>
  0xc013f111 <vm_account+0x81>:   mov    %esi,%eax
  ...
  crash>

However, the crashing systems are in fact attempting to execute those
bogus EIP addresses.  

For example, the first one crashed while executing an EIP of c0122b60,
which as shown above, is within wake_up_cpu():

  crash> dis wake_up_cpu
  ...
  0xc0122b57 <wake_up_cpu+0x167>: cmp    0xffffffdc(%ebp),%ebx
  0xc0122b5a <wake_up_cpu+0x16a>: jl     0xc0122b30 <wake_up_cpu+0x140>
  0xc0122b5c <wake_up_cpu+0x16c>: jmp    0xc0122a63 <wake_up_cpu+0x73>
  0xc0122b61 <wake_up_cpu+0x171>: mov    0xffffffd8(%ebp),%ecx
  0xc0122b64 <wake_up_cpu+0x174>: mov    0xffffffec(%ebp),%eax
  ...
  crash>

Now, if I disassemble the (bogus) instruction at c0122b60, it evaluates
to this:

  crash> dis c0122b60
  0xc0122b60 <wake_up_cpu+0x170>: decl   0x458bd84d(%ebx)
  crash>

So, it would take the contents of %ebx, add 0x458bd84d to it, and
then reference that address location.  Note below that %ebx contains
a value of 2: 

  EIP is at wake_up_cpu [kernel] 0x170 (2.4.21-32.0.1.ELsmp/i686)
  eax: 00000074   ebx: 00000002   ecx: e3ff2000   edx: c0441ab8
  esi: 00000079   edi: ffffffff   ebp: e3ff3adc   esp: e3ff3aac
  ds: 0068   es: 0068   ss: 0068

so the resultant address would be 0x458bd84f, which is the bogus virtual 
address that caused the crash:

  Unable to handle kernel paging request at virtual address 458bd84f

There's no way that I can even begin to speculate how this could happen.
It almost would appear to be a hardware issue, especially if it can
only be reproduced on one particular machine.  However I hate to
point fingers prematurely.

There *may* be some other clues in the vmcore files for each crash,
but to upload those, you will have to file a Red Hat support ticket,
they will create an Issue Tracker for the bug, and attach it to 
this bugzilla.  They also will give you directions on how to upload
the 4 vmcore files.

Again, for official Red Hat Enterprise Linux support, please log into
the Red Hat support website at http://www.redhat.com/support and file
a support ticket, or alternatively contact Red Hat Global Support 
Services at 1-888-RED-HAT1 to speak directly with a support associate
and escalate an issue.  Tell them that this bugzilla (180476) has already
been filed so that they won't create a new one.

Thanks,
  Dave Anderson

Comment 8 Ernie Petrides 2006-02-09 22:05:43 UTC

In answer to Dave's comment #7, one hypothetical scenario that could cause
such a bogus text address is that if some external module incorrectly used
the kernel's timer service, and especially if it allowed a pending timer to
remain after the module was unloaded, then the memory formerly used for the
module code would be used to execute instructions when the timeout expired.

I suppose this is just a long shot, though.

Comment 9 Dave Anderson 2006-02-09 22:22:50 UTC

All of the 4 backtraces leading up to the crashes look normal,
i.e., they all appear to lead into the crashing function.  So
I don't see any connections with module code, which would be
vmalloc text addresses.

My best guess is that the text segment gets randomly corrupted,
and then in the act of executing the kernel text there, the EIP
gets "bumped" erroneously due to a newly-malformed "instruction",
until it actually does something that causes a crash, like 
referencing a bogusly-calculated address.

But that can only be verified by looking at the vmcore contents.

Comment 10 Christian Schnuerer 2006-02-21 08:31:49 UTC

Sorry for my late response.
The system has been stable for more than 1 week without "Dell Server 
Administrator" loaded (and an untainted kernel).
I think there is something with the ramdisk and the dell-services, because if i 
recreate the ramdisk while the dell-modules are loaded and disable the dell-
services the kernel is still tainted after a reboot although the modules are 
not loaded anymore. 
After a mkinitrd without the dell-modules loaded the kernel is untainted as 
expected. I thought that mkinitrd only processes modules listed in the 
modules.conf!?
However, i have reinstalled the "Dell Server Administrator" last weekend (no 
further mkinitrd!) and the server is now stable for 3 days, with the dell-
modules loaded (and a tainted kernel).
I'm sure that the ramdisk of the (stable) productive machine has been created 
before the installation of the Dell-Services.
Do you consider that this could be the reason for the lockups?

Comment 11 Dave Anderson 2006-02-21 13:37:24 UTC

I'm sorry, but I don't know anything about the "Dell Server Administrator",
nor how it plays with mkinitrd, and how that affects tainting.  And your
guess is as good as mine as to whether it has anything to do with the
"EIP shift".

Comment 12 RHEL Program Management 2007-10-19 18:47:44 UTC

This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.

Note You need to log in before you can comment on or make changes to this bug.