From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.5) Gecko/20031007 Description of problem: System locks up when serving web pages out of NFS or heavily accssing NFS. Version-Release number of selected component (if applicable): 2.4.21-9.0.1.ELsmp How reproducible: Sometimes Steps to Reproduce: 1. Run httpd-2.0.46-26.ent 2. access NFS files via httpd 3. or exports files and access via NFS Actual Results: System locks up, tested on 2 different machines, one machine reported a kernel panic Expected Results: System should not hang. Additional info: I have been trying to migrate our web server from redhat 7.3/apache 1.2.27 to redhat enterprise 3 as/apache 2.0.46. It has not gone well, I have run into problems where the cpu locks up on 2 different systems. The problem appears to be NFS related. Of course the problem is intermittent, wish I had more data. Wondering if anyone else has seen these type of issues: These systems are registerd in rhn.redhat.com under the account name "unccs", the machines are named lark.cs.unc.edu and dove.cs.unc.edu. The web server that has been running redhat 7.3 is a Dell 2650 P4 single processor/1GB memory, scsi disks. We have been running the SMP kernel, that is hyperthreading is turned on. It has run well for over a year. We serve web files out of the AFS and NFS file sytems, we automount the NFS filesystems. The server has never crashed. I loaded RHEL-AS3 on an IBM Think Centre model 8187, single P4 processor, 512MB memory, ide disk. All patches are loaded on this sytem which is running kernel 2.4.21-9.0.1.ELsmp. I tested the system then moved our www.cs.unc.edu alias to the IBM system, I then would upgrade the original Dell 6250 system and move our www.cs.unc.edu alias back to the Dell 6250 system. After the IBM system started servering up web pages it locked up after about 4 hours. There was nothing on virtual console 1, no error messages in /var/log/messages. I rebooted and kept an I on the system, the cpu load was very lite, uptime load averages below 1, cpu usage less then %5 percent on average, the memory usage goes up to the physical memory amount, I think apache/httpd is smart about not going into swap. The system then locked up again after a couple of hours, the load was pretty lite before the lockup so it does not appear to be resource related. I noticed each time that it locked up the last /var/log/messages entry was the automount deamon mounting or unmounting an NFS mount, as I mentioned we server web pages off serveral NFS servers, for example: Mar 14 02:41:49 dove amd[9169]: mount_nfs_fh: NFS version 3 Mar 14 02:41:49 dove amd[9169]: mount_nfs_fh: using NFS transport tcp Mar 14 02:41:49 dove amd[1596]: thrush:/ mounted fstype host on /.automount/thrush/root Mar 14 02:42:19 dove amd[1596]: recompute_portmap: NFS version 3 Mar 14 02:42:19 dove amd[1596]: Using MOUNT version: 3 From the /var/log/httpd/access_log I can see the web server stops serving up pages a few minutes after the automount messages, I don't think the problem happens immediatley after the autmount mount or unmount. Unforutunately there are no other clues, the IBM system did not log any messages on the console. I am also able to get the IBM to intermittently lock up if I export a filesytem to another system then heavily access the NFS fileystem, again no error messages. This happens very intermittently, but I was able to get this to happen a couple of time, it is difficult to make it repeatable. I loaded the lates RHEL-AS3 on the Dell 2650, (kernel 2.4.21-9.0.1.ELsmp), hoping the problem would go away with the different hardware. The system had been running fine under redhat 7.3. I moved our www.cs.unc.edu alias back to the Dell system running rhel-as3 and apache 2.0.46. The system ran for about 60 hours then crashed, the following messages were on the console, note I did not have time to copy all the error codes. I am now running this system with the single processor kernel. Again the last log entry was an automount entry: If I run the single processor kernel on my IBM I cannot get the system to crash with doing heavy nfs accesses. Here are the kernel panic messages from the Dell 2650. cpu: 0 EIP: 0060:[<c017afd8.] tainted: PF EFlags: 00010246 EIP is adestroy_inode[kernel] 0x28 (2.4.21-9.0.1.ELsmp/i686) eax 00000000 ebx: ec04a480 ecx: 00000000 edx: ec04a480 esi: f313480 edi: ec014a480 ebp: 00000002 esp; c1f39f78 ds: 0068 es: 0068 ss:0068 process kswapd (pid:7 stackpage=c1f39000) stack: I did not copy the hex codes after each call here: call trace: prune_dcache shrink_dcache_memory do_try_to_free_pages_kswapd kswapd kswapd kernel_thread_helper code: 8b 48 04 85 ca 74 11 8a 1c 24ff 50 0f 8b 5c 28 08 83 c4 0c kernel panic: fatal exception
Can this crash be reproduced in a kernel that is not tainted? The oops was caused inside adestroy_inode(), which isn't even part of the Red Hat Enterprise Linux kernel as released.
As I mentioned we run Open AFS, this machine has the Open AFS client installed. Our web server root dir is in AFS. It would be difficult to simulate this without AFS. This is a production server. I will continue to run the single processor kernel since we can not run an un-tainted kernel. (Wish openafs came with the RedHat release). The kernel panic that occured on the Dell 6250 is a bit different then the way the IBM system was locking up. I just got my IBM system to lock up again, (no panic messages), you can still ping the system and I could connect to the ssh and httpd ports but they do not respond so the network interface is partially working. I got it to hang by accessing another NFS server from the IBM, I also have gotten it to hang by accessing the IBM as a NFS server using another system as a client. I run this to get it to hang From the IBM: cd /net/sunhost/large_data_dir find . -type f -exec cat {} \; > /dev/null the cpu load with uptime stays around 1 and the system is responsive, eventually it just stops responding. I know this will be difficult if not impossible to fix without more data. This system is also running a openafs/tainted kernel. Give me some time and I will test on an untainted kernel. Thanks for your help
Thanks for the update, John. We will assume that the oops and the lock-ups that you have encountered are due to an AFS bug or an incompatibility between the AFS code and the RHEL 3 kernel. If you can reproduce a problem on an untainted RHEL 3 kernel, please update this Bugzilla entry. Thanks. -ernie
Since this is a tainted kernel I am going to close this bug. FYI, I did find one issue that may help and our server has not crashed since. The openafs /etc/init.d/afs startup script was loading the wrong openafs module for the lates kernel: lsmod|grep afs libafs-2.4.21-9.EL-i686.mp 566192 2 and the system is running: uname -r 2.4.21-9.0.1.ELsmp I fixed this and am running the smp kernel and the system has been stable so far.