Please note that we are running 7.1sbe not just 7.1 from Dell. From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.4) Gecko/20011126 Netscape6/6.2.1 Description of problem: We have had continual problems averaging about 1/2 times a week with our main nfs server. Version-Release number of selected component (if applicable): How reproducible: Couldn't Reproduce Additional info: We're running the latest kernel 2.4.9-31 here is some out put from one of the system logs: Mar 17 04:22:34 sgp-nfs kernel: <1>Unable to handle kernel NULL pointer dereference at virtual address 00000010 Mar 17 04:22:34 sgp-nfs kernel: printing eip: Mar 17 04:22:34 sgp-nfs kernel: c0145d93 Mar 17 04:22:34 sgp-nfs kernel: *pde = 00000000 Mar 17 04:22:34 sgp-nfs kernel: Oops: 0000 Mar 17 04:22:34 sgp-nfs kernel: Kernel 2.4.9-31 Mar 17 04:22:34 sgp-nfs kernel: CPU: 0 Mar 17 04:22:34 sgp-nfs kernel: EIP: 0010:[posix_locks_deadlock+67/96] Not tainted Mar 17 04:22:34 sgp-nfs kernel: EIP: 0010:[<c0145d93>] Not tainted Mar 17 04:22:34 sgp-nfs kernel: EFLAGS: 00010207 Mar 17 04:22:34 sgp-nfs kernel: EIP is at posix_locks_deadlock [kernel] 0x43 Mar 17 04:22:34 sgp-nfs kernel: eax: fffffffc ebx: d4ca7240 ecx: 000003f9 edx: 00000000 Mar 17 04:22:34 sgp-nfs kernel: esi: 000003f0 edi: c5180d80 ebp: 00000001 esp: cdd7feec Mar 17 04:22:34 sgp-nfs kernel: ds: 0018 es: 0018 ss: 0018 Mar 17 04:22:34 sgp-nfs kernel: Process smbd (pid: 1008, stackpage=cdd7f000) Mar 17 04:22:34 sgp-nfs kernel: Stack: d41f8344 ffffffdd d4429e40 c0146174 d41f8680 d41f8344 0000000b 2a1f06dd Mar 17 04:22:34 sgp-nfs kernel: c79361d0 cf316000 00000000 ffffffeb 00000000 00000000 d41f83fc d41f828c Mar 17 04:22:34 sgp-nfs kernel: 00000000 00000006 00000001 00000000 c19ebd20 bfffda8c cdd7ff88 d41f8680 Mar 17 04:22:34 sgp-nfs kernel: Call Trace: [posix_lock_file+180/1376] posix_lock_file [kernel] 0xb4 Mar 17 04:22:34 sgp-nfs kernel: Call Trace: [<c0146174>] posix_lock_file [kernel] 0xb4 Mar 17 04:22:34 sgp-nfs kernel: [fcntl_setlk64+324/464] fcntl_setlk64 [kernel] 0x144 Mar 17 04:22:34 sgp-nfs kernel: [<c0147304>] fcntl_setlk64 [kernel] 0x144 Mar 17 04:22:34 sgp-nfs kernel: [filp_open+77/96] filp_open [kernel] 0x4d Mar 17 04:22:34 sgp-nfs kernel: [<c0135fcd>] filp_open [kernel] 0x4d Mar 17 04:22:34 sgp-nfs kernel: [getname+94/160] getname [kernel] 0x5e Mar 17 04:22:34 sgp-nfs kernel: [<c013f9ce>] getname [kernel] 0x5e Mar 17 04:22:34 sgp-nfs kernel: [sys_fcntl64+109/160] sys_fcntl64 [kernel] 0x6d Mar 17 04:22:34 sgp-nfs kernel: [<c01435fd>] sys_fcntl64 [kernel] 0x6d Mar 17 04:22:34 sgp-nfs kernel: [system_call+51/56] system_call [kernel] 0x33 Mar 17 04:22:34 sgp-nfs kernel: [<c0106f3b>] system_call [kernel] 0x33 Mar 17 04:22:34 sgp-nfs kernel: Mar 17 04:22:34 sgp-nfs kernel: Mar 17 04:22:34 sgp-nfs kernel: Code: 39 58 14 75 05 39 48 18 74 d9 8b 12 81 fa b0 3b 2b c0 75 e9
We are wondering if this has anything to do with the fact that we have apache files on the nfs server.... 03/25/2002 another occurance, here is another copy of /var/log/messages: Mar 25 02:08:00 sgp-nfs kernel: Unable to handle kernel paging request at virtual address 652e3642 Mar 25 02:08:00 sgp-nfs kernel: printing eip: Mar 25 02:08:00 sgp-nfs kernel: c0145d93 Mar 25 02:08:00 sgp-nfs kernel: *pde = 00000000 Mar 25 02:08:00 sgp-nfs kernel: Oops: 0000 Mar 25 02:08:00 sgp-nfs kernel: Kernel 2.4.9-31 Mar 25 02:08:00 sgp-nfs kernel: CPU: 0 Mar 25 02:08:00 sgp-nfs kernel: EIP: 0010:[posix_locks_deadlock+67/96] Not tainted Mar 25 02:08:00 sgp-nfs kernel: EIP: 0010:[<c0145d93>] Not tainted Mar 25 02:08:00 sgp-nfs kernel: EFLAGS: 00010a87 Mar 25 02:08:00 sgp-nfs kernel: EIP is at posix_locks_deadlock [kernel] 0x43 Mar 25 02:08:00 sgp-nfs kernel: eax: 652e362e ebx: d1adfe40 ecx: 00003d0e edx: 652e3632 Mar 25 02:08:00 sgp-nfs kernel: esi: 00003d0f edi: d1adfe40 ebp: d21e91d4 esp: d2883f30 Mar 25 02:08:00 sgp-nfs kernel: ds: 0018 es: 0018 ss: 0018 Mar 25 02:08:00 sgp-nfs kernel: Process lockd (pid: 839, stackpage=d2883000) Mar 25 02:08:00 sgp-nfs kernel: Stack: d3d9d600 d3d9dee8 d3d9dee8 e095f487 d3d9d600 d21e91d4 c3f63000 e0963b7a Mar 25 02:08:00 sgp-nfs kernel: dfa37e00 d2883f5c d3d9d5b4 d3d9dea0 d3d9d2a0 dfa37e00 e096960c e0963d47 Mar 25 02:08:00 sgp-nfs kernel: dfa37e00 d3d9dea0 d3d9d5ac 00000001 d3d9d5a0 d1adfe40 d3d9dea0 dfa37f38 Mar 25 02:08:00 sgp-nfs kernel: Call Trace: [eepro100:__insmod_eepro100_S.bss_L16+405383/72048169] lockd_down_Ra7b91a7b [lockd] 0x767 Mar 25 02:08:00 sgp-nfs kernel: Call Trace: [<e095f487>] lockd_down_Ra7b91a7b [lockd] 0x767 Mar 25 02:08:00 sgp-nfs kernel: [eepro100:__insmod_eepro100_S.bss_L16+423546/72030006] nlmsvc_invalidate_client_Rb1c3f825 [lockd] 0x277a Mar 25 02:08:00 sgp-nfs kernel: [<e0963b7a>] nlmsvc_invalidate_client_Rb1c3f825 [lockd] 0x277a Mar 25 02:08:00 sgp-nfs kernel: [eepro100:__insmod_eepro100_S.bss_L16+446732/72006820] __insmod_lockd_S.data_L2956 [lockd] 0x8cc Mar 25 02:08:00 sgp-nfs kernel: [<e096960c>] __insmod_lockd_S.data_L2956 [lockd] 0x8cc Mar 25 02:08:00 sgp-nfs kernel: [eepro100:__insmod_eepro100_S.bss_L16+424007/72029545] nlmsvc_invalidate_client_Rb1c3f825 [lockd] 0x2947 Mar 25 02:08:00 sgp-nfs kernel: [<e0963d47>] nlmsvc_invalidate_client_Rb1c3f825 [lockd] 0x2947 Mar 25 02:08:00 sgp-nfs kernel: [eepro100:__insmod_eepro100_S.bss_L16+282487/72171065] svc_process_R7eb1336f [sunrpc] 0x2d7 Mar 25 02:08:00 sgp-nfs kernel: [<e0941477>] svc_process_R7eb1336f [sunrpc] 0x2d7 Mar 25 02:08:00 sgp-nfs kernel: [eepro100:__insmod_eepro100_S.bss_L16+444600/72008952] __insmod_lockd_S.data_L2956 [lockd] 0x78 Mar 25 02:08:00 sgp-nfs kernel: [<e0968db8>] __insmod_lockd_S.data_L2956 [lockd] 0x78 Mar 25 02:08:00 sgp-nfs kernel: [eepro100:__insmod_eepro100_S.bss_L16+444636/72008916] __insmod_lockd_S.data_L2956 [lockd] 0x9c Mar 25 02:08:00 sgp-nfs kernel: [<e0968ddc>] __insmod_lockd_S.data_L2956 [lockd] 0x9c Mar 25 02:08:00 sgp-nfs kernel: [eepro100:__insmod_eepro100_S.bss_L16+403030/72050522] nlmclnt_proc_Rd9df9c43 [lockd] 0x16d6 Mar 25 02:08:00 sgp-nfs kernel: [<e095eb56>] nlmclnt_proc_Rd9df9c43 [lockd] 0x16d6 Mar 25 02:08:00 sgp-nfs kernel: [kernel_thread+38/48] kernel_thread [kernel] 0x26 Mar 25 02:08:00 sgp-nfs kernel: [<c0105726>] kernel_thread [kernel] 0x26 Mar 25 02:08:00 sgp-nfs kernel: [eepro100:__insmod_eepro100_S.bss_L16+402592/72050960] nlmclnt_proc_Rd9df9c43 [lockd] 0x1520 Mar 25 02:08:00 sgp-nfs kernel: [<e095e9a0>] nlmclnt_proc_Rd9df9c43 [lockd] 0x1520 Mar 25 02:08:00 sgp-nfs kernel: Mar 25 02:08:00 sgp-nfs kernel: Mar 25 02:08:00 sgp-nfs kernel: Code: 39 58 14 75 05 39 48 18 74 d9 8b 12 81 fa b0 3b 2b c0 75 e9
weird question: the filesystem in question isn't vfat is it ?
03/26/2002 One contributing factor may be the fact that this server is our nfs server and shares the config files for apache vs. 1.3.14 or 1.3.19 (I've checked with a sys. programmer and he wasn't sure either) to a sun ultra10 running solaris 7. The system does continue sharing files to other clients however. But for some reason it doesn't appear to re-validate the share for the config files for the ulta-10.
03/26/2002 No, it is ext2, I just checked /etc/fstab. Trent
Ok this is the 2.4.3-6 kernel ? If so then upgrading to 2.4.9-31 might be worth a shot. Quite a few NFS problems have been fixed since 2.4.3-6....
We have the latest kernel (2.4.9-31) it is installed on the system. That was one of the other fixes that we've been given along with changing the kernel module from eepro100 over epro100 (All it did was to lock the system up even faster than before, we went back to eepro100). So, far none of the suposed "fixes" have fixed the problem.
2002/04/08 Just checking status. Nothing new. So, does RedHat have any other suggestions? Trent Doyle
Please add a line insmod_opt=-S to /etc/modules.conf That will cause things like __insmod_lockd_S.data_L2956 to get reasonable names. Then also in /etc/sysconfig/syslog, add a "-x" to KLOGD_OPTIONS so that klogd doesn't confuse the log messages. That should help us get better debugging output.
Are you running samba? What about fam? chkconfig --list samba chkconfig --list sgi_fam
Samba does run on this server, but does not appear to be a factor. The samba traffic is very light and only once per hour. Other NFS machines that we have will also show this crash, and they do not run samba.
Created attachment 52744 [details] Ksymoops Output from a crash
You might want to try the following patch against fs/locks.c. This might be the source of the problem; I've set a couple of questions to the maintainer of the code about this. Index: locks.c =================================================================== RCS file: /bcrl/cvs/CVSROOT/net-aio/linux/fs/locks.c,v retrieving revision 1.1.1.1 diff -u -u -r1.1.1.1 locks.c --- locks.c 2 Apr 2002 23:47:24 -0000 1.1.1.1 +++ locks.c 8 Apr 2002 20:46:00 -0000 @@ -440,7 +440,7 @@ while (!list_empty(&blocker->fl_block)) { struct file_lock *waiter = list_entry(blocker->fl_block.next, struct file_lock, fl_block); - if (wait) { + if (0) { locks_notify_blocked(waiter); /* Let the blocked process remove waiter from the * block list when it gets scheduled.
I have made the change to the locks.c file, but how do I recompile it into the nfs module? The 2.4.9-31 kernel was installed via rpm on this machine, I have the src rpm installed, but the kernel has never been compiled on it.
Created attachment 54244 [details] Latest oops output processed thru Ksymoops
Just added another oops output. Still waiting for instructions on how to compile the suggested patch into a module.
Who is your TAO?
I am not sure what you mean by TAO. Technical Contact would be me.
TAO would be your service contract contact at Red Hat. Do you have a service contract?
We do not have a service contract. We did pay for 1 problem resolution, but since they could not fix it they told us to put the problem in here. They offered to give us the money back on the resolution, but we choose to just keep the resolution for any later problems. Do you have the standard .config file that was used for the 2.4.9-31 kernel? That is all I need to recompile.
The .config is included with the kernel source rpm.
To be more specific, the .config files are kept in the configs subdirectory below the kernel source, one for each kernel that is built.
I feel stupid. I found them right after I posted that message. I have a new compiled kernel now, I will be rebooting the server shortly to this new kernel. Any ideals on how to stress test it for this recurring error?
Created attachment 56164 [details] Ksymoops processed oops from 5/1
Created attachment 56165 [details] Ksymoops processed oops output, 5/1 - crash right after reboot.
Has there been a way found to cause this error for testing purposes? Is there anything else I can provide to help solve this error? This is a production file server with over 400gigs of storage that cannot be down for any extended period. I will try to compile the sugguestion of bcrl again and see if I can get the machine to boot with this change.
We don't have a way of testing it here, no; it's not happening to us. Without a reproducer like that, we don't have much choice.
Had another lockd process go south. Preliminary bug info as follows from /var/log/messages: May 13 16:40:09 sgp-nfs rpc.mountd: authenticated mount request from jester1.sgp.arm.gov:739 for /files0/SunOS5.7/apps/web (/files0) May 13 16:40:22 sgp-nfs rpc.mountd: authenticated unmount request from r1.sgp.arm.gov:750 for /files0/res/apps/res (/files0) May 13 16:40:22 sgp-nfs rpc.mountd: authenticated unmount request from r1.sgp.arm.gov:750 for /files0/res/apps/cse (/files0) May 13 16:40:22 sgp-nfs rpc.mountd: authenticated unmount request from r1.sgp.arm.gov:750 for /files0/res/home/sds (/files0) May 13 16:40:22 sgp-nfs rpc.mountd: authenticated unmount request from r1.sgp.arm.gov:750 for /files0/res/home/sgpdq (/files0) May 13 16:40:23 sgp-nfs rpc.mountd: authenticated mount request from jester1.sgp.arm.gov:739 for /files0/SunOS5.7/data/collection (/files0) May 13 16:40:27 sgp-nfs kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000010 May 13 16:40:27 sgp-nfs kernel: printing eip: May 13 16:40:27 sgp-nfs kernel: c0145d93 May 13 16:40:27 sgp-nfs kernel: *pde = 00000000 May 13 16:40:27 sgp-nfs kernel: Oops: 0000 May 13 16:40:27 sgp-nfs kernel: Kernel 2.4.9-31 May 13 16:40:27 sgp-nfs kernel: CPU: 0 May 13 16:40:27 sgp-nfs kernel: EIP: 0010:[<c0145d93>] Not tainted May 13 16:40:27 sgp-nfs kernel: EFLAGS: 00010207 May 13 16:40:27 sgp-nfs kernel: EIP is at posix_locks_deadlock [kernel] 0x43 May 13 16:40:27 sgp-nfs kernel: eax: fffffffc ebx: c37c28c0 ecx: 00006afe edx: 00000000 May 13 16:40:27 sgp-nfs kernel: esi: 00004ad7 edi: c37c28c0 ebp: df525230 esp: d1e57f30 May 13 16:40:27 sgp-nfs kernel: ds: 0018 es: 0018 ss: 0018 May 13 16:40:27 sgp-nfs kernel: Process lockd (pid: 832, stackpage=d1e57000) May 13 16:40:27 sgp-nfs kernel: Stack: c6167f00 d4225bc8 d4225bc8 e0942487 c6167f00 df525230 c2373c00 e0946b7a May 13 16:40:27 sgp-nfs kernel: dfa37e00 d1e57f5c c6167eb4 d4225b80 c61673a0 dfa37e00 e094e67c e0946d47 May 13 16:40:27 sgp-nfs kernel: dfa37e00 d4225b80 c6167eac 00000001 c6167ea0 c37c28c0 d4225b80 dfa37f38 May 13 16:40:27 sgp-nfs kernel: Call Trace: [<e0942487>] nlmsvc_lock [lockd] 0x1d7 May 13 16:40:27 sgp-nfs kernel: [<e0946b7a>] nlm4svc_retrieve_args [lockd] 0xaa May 13 16:40:27 sgp-nfs kernel: [<e094e67c>] nlmsvc_procedures4 [lockd] 0x40 May 13 16:40:27 sgp-nfs kernel: [<e0946d47>] nlm4svc_proc_lock [lockd] 0x97 May 13 16:40:27 sgp-nfs kernel: [<e0968477>] svc_process_R7eb1336f [sunrpc] 0x2d7 May 13 16:40:27 sgp-nfs kernel: [<e094de28>] nlmsvc_version4 [lockd] 0x0 May 13 16:40:27 sgp-nfs kernel: [<e094de4c>] nlmsvc_program [lockd] 0x0 May 13 16:40:27 sgp-nfs kernel: [<e0941b56>] lockd [lockd] 0x1b6 May 13 16:40:27 sgp-nfs kernel: [<c0105726>] kernel_thread [kernel] 0x26 May 13 16:40:27 sgp-nfs kernel: [<e09419a0>] lockd [lockd] 0x0 May 13 16:40:27 sgp-nfs kernel: May 13 16:40:27 sgp-nfs kernel: May 13 16:40:27 sgp-nfs kernel: Code: 39 58 14 75 05 39 48 18 74 d9 8b 12 81 fa b0 3b 2b c0 75 e9
Can you tell from the Oops outputs what is causing this problem? If we can figure out what is causing the lockd to crash, maybe a program can be written that will test for the problem, then we can start working on a fix.
Please test either the 2.4.9 kernel included with AS 2.1 or the 2.4.18-4 errata kernel for 7.3.
This is a production NFS server. I can't upgrade the kernel without testing it first. I have a test system, but I can't replicate the error easily. That is why I am asking about a test program. I can find the 2.4.18-3 kernel on Redhat's ftp server, do I need to get the 2.4.9 or 2.4.18-4 from elsewhere?
is http://rhn.redhat.com/errata/RHBA-2002-085.html not an accurate description of where to obtain the 2.4.18-4 update?
I am getting the newer kernel now, but this still does me no good without a way to force the crash. This server currently runs the 2.4.9-31 kernel, but not from the AS 2.1. I will search out this kernel too and get it downloaded. But all of this is a mute point if I can't find a way of duplicating the crash.
Well, your setup is the only one hitting this problem, and you've been unable to provide a reproducer, nor have you tested the suggested patch, so there's not terribly much that I can do other than point at new kernels that may have fixed the problem.
I have submited all these processed oops reports in the hopes of getting help in createing a reproducer. I do not understand what the oops is saying, other than it is a problem with lockd. As I have said before, this is a PRODUCTION NFS server, and I can not just install a patch to see if it works. When this crash happens, it takes the server down for more than 30 minutes at a time. Can anyone there look at the oops outputs and even suggest what could be causing this or suggest something that can be used to cause this problem? You asked for changes to the modules.conf and the syslog.conf for better debugging messages, and I have gave them to you. Surely someone around there can give a better answer than "It is your problem, you solve it"
We think we fixed it, based on the output you provided. We have offered you tested kernels with the candidate fix in them. Since you can't provide us with a reproducer, we can't tell for sure unless you are willing to test. If you aren't willing to test a candidate fix, then there's not much we can do. The fact that you cannot, by your own institutional rules, deploy our candidate fix without writing a test program does not make it our job to write the test program. I don't know that a test program can be written. Since you aren't willing to test, we'll just close this one as fixed in the current release, since we have made a change that we expect has fixed the problem. When you have upgraded to the current release, or at least the kernel from the current release, if this still occurs, you can feel free to re-open the bug report.
I have installed the 2.4.18-4 kernel on a test NFS machine. But since no one there is willing to even give me a glimpse of what might have been causeing the error, I have no way of writing a reproducer either. I have sent all those oops outputs so that someone there can at least help me narrow down what is causing this. I understand that we are the only one's experiencing this problem, and I do not expect you to write a test program. But what I have been asking for is for someone to look at the output and see if there are any clues as to what could be causing this crash. Does the oops output help you narrow it in any way? I have my suspects as to what is causeing this crash, but I am not sure. If anyone has any clues from the oops outputs that they want to share, I am listening.