118326 – kernel hangs under nfs/apache access

Bug 118326 - kernel hangs under nfs/apache access

Summary: kernel hangs under nfs/apache access

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Ernie Petrides
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-03-15 16:38 UTC by John Sopko
Modified:	2007-11-30 22:07 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-03-19 16:24:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description John Sopko 2004-03-15 16:38:55 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.5)
Gecko/20031007

Description of problem:
System locks up when serving web pages out of NFS or heavily accssing NFS.

Version-Release number of selected component (if applicable):
2.4.21-9.0.1.ELsmp

How reproducible:
Sometimes

Steps to Reproduce:
1. Run httpd-2.0.46-26.ent
2. access NFS files via httpd
3. or exports files and access via NFS
    

Actual Results:  System locks up, tested on 2 different machines, one
machine reported a kernel panic 

Expected Results:  System should not hang.

Additional info:

I have been trying to migrate our web server from redhat 7.3/apache
1.2.27 to redhat enterprise 3 as/apache 2.0.46. It has not gone well, I
have run into problems where the cpu locks up on 2 different systems.
The problem appears to be NFS related.  Of course the problem is
intermittent, wish I had more data. Wondering if anyone else has seen
these type of issues:

These systems are registerd in rhn.redhat.com under the account name
"unccs", the machines are named lark.cs.unc.edu and dove.cs.unc.edu.

The web server that has been running redhat 7.3 is a Dell 2650 P4
single processor/1GB memory, scsi disks. We have been running the SMP
kernel, that is hyperthreading is turned on. It has run well for over a
year. We serve web files out of the AFS and NFS file sytems, we
automount the NFS filesystems. The server has never crashed.

I loaded RHEL-AS3 on an IBM Think Centre model 8187, single P4
processor, 512MB memory,  ide disk. All patches are loaded on this
sytem which is running kernel 2.4.21-9.0.1.ELsmp. I tested the system
then moved our www.cs.unc.edu alias to the IBM system, I then would
upgrade the original Dell 6250 system and move our www.cs.unc.edu alias
back to the Dell 6250 system.

After the IBM system started servering up web pages it locked up after
about 4 hours. There was nothing on virtual console 1, no error
messages in /var/log/messages. I rebooted and kept an I on the system,
the cpu load was very lite, uptime load averages below 1, cpu usage
less then %5 percent on average, the memory usage goes up to the
physical memory amount, I think apache/httpd is smart about not going
into swap. The system then locked up again after a couple of hours,
the load was pretty lite before the lockup so it does not appear to
be resource related.

I noticed each time that it locked up the last /var/log/messages
entry was the automount deamon mounting or unmounting an NFS mount, as
I mentioned we server web pages off serveral NFS servers, for example:

Mar 14 02:41:49 dove amd[9169]: mount_nfs_fh: NFS version 3
Mar 14 02:41:49 dove amd[9169]: mount_nfs_fh: using NFS transport tcp
Mar 14 02:41:49 dove amd[1596]: thrush:/ mounted fstype host on
/.automount/thrush/root
Mar 14 02:42:19 dove amd[1596]: recompute_portmap: NFS version 3
Mar 14 02:42:19 dove amd[1596]: Using MOUNT version: 3

From the /var/log/httpd/access_log I can see the web server stops
serving up pages a few minutes after the automount messages, I don't
think the problem happens immediatley after the autmount mount or
unmount.  Unforutunately there are no other clues, the IBM system did
not log any messages on the console.

I am also able to get the IBM to intermittently lock up if I export a
filesytem to another system then heavily access the NFS fileystem,
again no error messages. This happens very intermittently, but I was
able to get this to happen a couple of time, it is difficult to make it
repeatable.

I loaded the lates RHEL-AS3 on the Dell 2650, (kernel
2.4.21-9.0.1.ELsmp), hoping the problem would go away with the
different hardware. The system had been running fine under redhat 7.3.

I moved our www.cs.unc.edu alias back to the Dell system running
rhel-as3 and apache 2.0.46. The system ran for about 60 hours
then crashed, the following messages were on the console, note
I did not have time to copy all the error codes. I am now running
this system with the single processor kernel. Again the last log entry
was an automount entry:

If I run the single processor kernel on my IBM I cannot get the system
to crash with doing heavy nfs accesses.

Here are the kernel panic messages from the Dell 2650.


cpu:    0
EIP:    0060:[<c017afd8.] tainted: PF
EFlags: 00010246

EIP is adestroy_inode[kernel] 0x28 (2.4.21-9.0.1.ELsmp/i686)

eax 00000000 ebx: ec04a480 ecx: 00000000 edx: ec04a480 esi: f313480
edi: ec014a480 ebp: 00000002 esp; c1f39f78
ds: 0068 es: 0068 ss:0068
process kswapd (pid:7 stackpage=c1f39000)
stack:


I did not copy the hex codes after each call here:

call trace: prune_dcache shrink_dcache_memory do_try_to_free_pages_kswapd
            kswapd kswapd kernel_thread_helper
 
code: 8b 48 04 85 ca 74 11 8a 1c 24ff 50 0f 8b 5c 28 08 83 c4 0c

kernel panic: fatal exception

Comment 1 Ernie Petrides 2004-03-15 19:30:10 UTC

Can this crash be reproduced in a kernel that is not tainted?

The oops was caused inside adestroy_inode(), which isn't even
part of the Red Hat Enterprise Linux kernel as released.

Comment 2 John Sopko 2004-03-15 19:51:05 UTC

As I mentioned we run Open AFS, this machine has the Open AFS client
installed. Our web server root dir is in AFS. It would be difficult
to simulate this without AFS. This is a production server. I will
continue to run the single processor kernel since we can not run
an un-tainted kernel. (Wish openafs came with the RedHat release).

The kernel panic that occured on the Dell 6250 is a bit different then
the way the IBM system was locking up. I just got my IBM system to
lock up again, (no panic messages), you can still ping the system and
I could connect to the ssh and httpd ports but they do not respond so
the network interface is partially working. I got it to hang by
accessing another NFS server from the IBM, I also have gotten it to
hang by accessing the IBM as a NFS server using another system as a
client. I run this to get it to hang

From the IBM:

cd /net/sunhost/large_data_dir
find . -type f -exec cat {} \; > /dev/null

the cpu load with uptime stays around 1 and the system is responsive,
eventually it just stops responding.

I know this will be difficult if not impossible to fix without more data.

This system is also running a openafs/tainted kernel. Give me some
time and I will test on an untainted kernel. Thanks for your help

Comment 3 Ernie Petrides 2004-03-16 00:11:47 UTC

Thanks for the update, John.  We will assume that the oops and
the lock-ups that you have encountered are due to an AFS bug or
an incompatibility between the AFS code and the RHEL 3 kernel.

If you can reproduce a problem on an untainted RHEL 3 kernel,
please update this Bugzilla entry.

Thanks.  -ernie

Comment 4 John Sopko 2004-03-19 16:24:46 UTC

Since this is a tainted kernel I am going to close this bug.

FYI, I did find one issue that may help and our server has not
crashed since. The openafs /etc/init.d/afs startup script was loading
the wrong openafs module for the lates kernel:

lsmod|grep afs
libafs-2.4.21-9.EL-i686.mp  566192   2

and the system is running:

uname -r
2.4.21-9.0.1.ELsmp

I fixed this and am running the smp kernel and the system
has been stable so far.

Note You need to log in before you can comment on or make changes to this bug.