Bug 180856 - kernel errors hanging nfsd
Summary: kernel errors hanging nfsd
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 5
Hardware: i386
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Steve Dickson
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-02-10 09:55 UTC by Daniele Branchini
Modified: 2007-11-30 22:11 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-09-18 08:25:59 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Daniele Branchini 2006-02-10 09:55:58 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; it; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1

Description of problem:
I'm running FC4 on a biprocessor server (Intel(R) Xeon(TM) CPU 3.06GHz), with two scsi subsystem attached.
I got weekly kernel errors such as this:

----------- begin paste from /var/log/messages -----------

Feb  9 05:07:32 adone kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000000
Feb  9 05:07:32 adone kernel:  printing eip:
Feb  9 05:07:32 adone kernel: c0157614
Feb  9 05:07:32 adone kernel: *pde = 01e39001
Feb  9 05:07:32 adone kernel: Oops: 0000 [#1]
Feb  9 05:07:32 adone kernel: SMP
Feb  9 05:07:32 adone kernel: last sysfs file: /class/vc/vcsa2/dev
Feb  9 05:07:32 adone kernel: Modules linked in: nfs nfsd exportfs lockd nfs_acl ipv6 parport_pc lp parport autofs4 sunrpc dm_mod video button battery ac uhci
_hcd i2c_i801 i2c_core e1000 floppy ext3 jbd aic7xxx scsi_transport_spi sd_mod scsi_mod
Feb  9 05:07:32 adone kernel: CPU:    0
Feb  9 05:07:32 adone kernel: EIP:    0060:[<c0157614>]    Not tainted VLI
Feb  9 05:07:32 adone kernel: EFLAGS: 00010246   (2.6.15-1.1830_FC4smp)
Feb  9 05:07:32 adone kernel: EIP is at page_address+0x6/0x91
Feb  9 05:07:32 adone kernel: eax: 00000000   ebx: 00000000   ecx: 00000008   edx: 00000008
Feb  9 05:07:32 adone kernel: esi: f594b038   edi: 00000007   ebp: f7d67600   esp: f7f71f20
Feb  9 05:07:32 adone kernel: ds: 007b   es: 007b   ss: 0068
Feb  9 05:07:32 adone kernel: Process nfsd (pid: 2290, threadinfo=f7f71000 task=f7fd5000)
Feb  9 05:07:32 adone kernel: Stack: 00001000 f594b038 00000007 f7d67600 f8c70e4f f594b000 f5958070 f7d67600
Feb  9 05:07:32 adone kernel:        f8c91198 f8c70d46 d94c2014 f8c63666 f7f71f00 0bfb14ac f8bb2f27 f7d67600
Feb  9 05:07:32 adone kernel:        f8c91198 f7d67600 f8c913d8 f7d67664 f8bb06c4 f7f71fd0 00000000 0000003d
Feb  9 05:07:32 adone kernel: Call Trace:
Feb  9 05:07:32 adone kernel:  [<f8c70e4f>] nfs3svc_decode_readargs+0x109/0x16d [nfsd]     [<f8c70d46>] nfs3svc_decode_readargs+0x0/0x16d [nfsd]
Feb  9 05:07:32 adone kernel:  [<f8c63666>] nfsd_dispatch+0x4d/0x1c7 [nfsd]     [<f8bb2f27>] svc_authenticate+0x97/0xae [sunrpc]
Feb  9 05:07:32 adone kernel:  [<f8bb06c4>] svc_process+0x3a1/0x65d [sunrpc]     [<f8c63458>] nfsd+0x184/0x345 [nfsd]
Feb  9 05:07:32 adone kernel:  [<c01040a2>] work_resched+0x5/0x16     [<f8c632d4>] nfsd+0x0/0x345 [nfsd]
Feb  9 05:07:32 adone kernel:  [<c010243d>] kernel_thread_helper+0x5/0xb
Feb  9 05:07:32 adone kernel: Code: 0d 08 f1 4a c0 85 c9 75 ec 0f 0b e2 01 12 70 34 c0 eb e2 69 c0 01 00 37 9e c1 e8 19 c1 e0 07 05 80 f9 4a c0 c3 55 57 56 53
 89 c3 <8b> 00 c1 e8 1e 8b 14 85 1c f9 3f c0 8b 82 0c 12 00 00 05 80 37
Feb  9 05:07:32 adone kernel: Continuing in 120 seconds.

------------- end paste from /var/log/messages -------------

The last line is new since kernel 2.6.15-1.1830, because with previous versions (kernel-2.6.12-1.1447_FC4, kernel-2.6.13-1.1532_FC4, kernel-2.6.14-1.1637_FC4) after 4 errors like the one above I had to manually reboot the server to have nfsd running again. Now it seems to automatically recover in some way, since I see the errors but nfs is actually working.
I didn't manage to relate this errors to any particular server's task or process.

I apologize in advance for my lack of knowledge, I'm sorry I don't know how to be more specific.


Version-Release number of selected component (if applicable):
kernel-2.6.15-1.1830_FC4smp

How reproducible:
Sometimes

Steps to Reproduce:
1.install FC4 on a biprocessor (Intel(R) Xeon(TM) CPU 3.06GHz) with two scsi subsystems
2.export something in nfs
3.wait a couple of days
  

Additional info:

I don't know if it's useful...
[root@adone ~]# cat /etc/exports
/terabox svradar.metarpa(ro,async,no_subtree_check) 
/ambox sibilla.metarpa(ro,async,no_subtree_check) 
[root@adone ~]# df
Filesystem        blocchi di   1K   Usati Disponib. Uso% Montato su
/dev/sda2             18930940   8979636   8974152  51% /
/dev/sda1               101086     35410     60457  37% /boot
/dev/shm                515604         0    515604   0% /dev/shm
/dev/sda4            1707407264 1106936560 513739384  69% /systera
/dev/sdb1            1730598456 831203484 811485688  51% /terabox
[root@adone ~]# lsmod
Module                  Size  Used by
nfs                   217001  0
nfsd                  231377  15
exportfs               10305  1 nfsd
lockd                  64585  3 nfs,nfsd
nfs_acl                 7873  2 nfs,nfsd
ipv6                  273825  86
parport_pc             31877  1
lp                     16905  0
parport                39561  2 parport_pc,lp
autofs4                23621  2
sunrpc                150397  18 nfs,nfsd,lockd,nfs_acl
dm_mod                 61273  0
video                  20165  0
button                 10705  0
battery                13509  0
ac                      8901  0
uhci_hcd               37073  0
i2c_i801               13005  0
i2c_core               25793  1 i2c_i801
e1000                 111917  0
floppy                 66181  0
ext3                  135241  4
jbd                    62037  1 ext3
aic7xxx               154229  5
scsi_transport_spi     25153  1 aic7xxx
sd_mod                 23105  7
scsi_mod              139497  3 aic7xxx,scsi_transport_spi,sd_mod

Comment 1 Daniele Branchini 2006-02-13 10:35:50 UTC
>Now it seems to automatically recover in some way,
>since I see the errors but nfs is actually working.

This was untrue. After two more errors like the one above, I have this situation:

[root@adone ~]# service nfs status
Arresto di NFS mountd: rpc.mountd (pid 31571) in esecuzione...
nfsd interrotto
rpc.rquotad (pid 31568) in esecuzione...
[root@adone ~]# service nfs restart
Arresto di NFS mountd:                                     [  OK  ]
Arresto del demone NFS:                                    [FALLITO]
Arresto di quotas NFS:                                     [  OK  ]
Arresto dei servizi NFS:                                   [  OK  ]
Avvio dei servizi NFS:                                     [  OK  ]
Avvio di quotas NFS:                                       [  OK  ]
Avvio demone NFS:                                          [  OK  ]
Avvio di NFS mountd:                                       [  OK  ]
[root@adone ~]# tail /var/log/messages
Feb 13 11:21:19 adone rpc.mountd: Caught signal 15, un-registering and exiting.
Feb 13 11:21:20 adone nfsd[31569]: nfssvc: Setting version failed: errno 16
(Device or resource busy)
Feb 13 11:21:20 adone rpc.idmapd: nfsdreopen: Opening '' failed: errno 2 (No
such file or directory)
[root@adone ~]# service nfs status
Arresto di NFS mountd: rpc.mountd (pid 31571) in esecuzione...
nfsd interrotto
rpc.rquotad (pid 31568) in esecuzione...


Comment 2 Daniele Branchini 2006-05-23 08:27:05 UTC
- Still having problems under 2.6.16-1.2069
- Bugs happen when trying to copy relatively large amount of data (about 500mb)
- The partition I'm actually exporting is 1.6 Tb, ext3. Everybody told me is a
dumb thing to use ext3 for such a huge partition, but I'm unable to change the
filesystem until I get another subsystem for backup. Could my problems be
related to this?


Comment 4 Stephen Tweedie 2006-07-25 09:53:49 UTC
ext3 on FC4 should be safe up to 8TB, and has been tested on such systems, so I
have no reason to think that any NFS errors are related to the large ext3 fs.

Comment 6 Dave Jones 2006-09-17 02:26:53 UTC
[This comment added as part of a mass-update to all open FC4 kernel bugs]

FC4 has now transitioned to the Fedora legacy project, which will continue to
release security related updates for the kernel.  As this bug is not security
related, it is unlikely to be fixed in an update for FC4, and has been migrated
to FC5.

Please retest with Fedora Core 5.

Thank you.

Comment 7 Daniele Branchini 2006-09-18 08:25:59 UTC
Eventually I found out that the problem was referable to an old alpha tru64
machine that was mounting the nfs partition.
I needed badly to keep the server going, so I just removed that client from
/etc/exports .

At this point I can't tell if the problem was ascribable to the architecture of
that particular nfs client or its net situation (a little bit messy) or maybe
the whole thing was actually regarding some kernel bug.

I'm changing this bug to "WORKSFORME", since I'm unable to do further
investigations...

Thank you very much Stephen for the information about large ext3 fs. Some dumb
colleague told me that it was the main cause of my problems.


Note You need to log in before you can comment on or make changes to this bug.