+++ This bug was initially created as a clone of Bug #225515 +++ Description of problem: If the /proc/sys/sunrpc/nfs_debug was set to 65535, when closing many opening files in NFSv4, the kernel panic (handle kernel NULL pointer) usually happens. kernel in RHEL5Beta2 How reproducible: Usually Steps to Reproduce: 1. Mount: mount -t nfs4 192.168.0.21:/ /mnt 2. A process creates and opens 1024 files on NFSv4 3. The process writes some test message to each file. 4. The process closes all the opening files and exit. 5. After the process exit, the test begins to umount the NFSv4 immediately. Actual results: In normal case, I haven't found the problem in the test; but when the /proc/sys/sunrpc/nfs_debug is set to debug mode(65535), the panic problem will usually happen. Additional info: The kernel panic log is as follows: NFS: nfs_update_inode(0:18/2049980 ct=1 info=0xe) BUG: unable to handle kernel NULL pointer dereference at virtual address 0000000c printing eip: deb82605 *pde = 001f3067 Oops: 0000 [#1] SMP last sysfs file: /block/hda/removable Modules linked in: nfs fscache nfsd ¡. ¡¡. CPU: 0 EIP: 0060:[<deb82605>] Not tainted VLI EFLAGS: 00010246 (2.6.18-1.2747.el5 #1) EIP is at nfs_update_inode+0xb0/0x692 [nfs] ¡skip¡. Process rpciod/0 (pid: 1865, ti=dd3ac000 task=ddd47550 task.ti=dd3ac000) Stack: deba0609 ¡ ¡ Call Trace: [<deb82c1f>] nfs_refresh_inode+0x38/0x1b0 [nfs] [<dea9f602>] rpc_exit_task+0x1e/0x6c [sunrpc] [<dea9f314>] __rpc_execute+0x82/0x1b3 [sunrpc] [<c0433899>] run_workqueue+0x83/0xc5 [<c0434171>] worker_thread+0xd9/0x10c [<c0436620>] kthread+0xc0/0xec [<c0404d63>] kernel_thread_helper+0x7/0x10 DWARF2 unwinder stuck at kernel_thread_helper+0x7/0x10 I have investigated the problem, and found the cause was the NULL pointer i_sb->s_root in the "nfs_update_inode()" when the panic happened. In the kernel, at the end of the file closing operation, the nfs_file_release() will be invoked. I have found the operation process of the kernel is as follows: nfs_file_release() |-- NFS_PROTO(inode)->file_release () | nfs_file_clear_open_context() | put_nfs_open_context() | -- nfs4_close_state | | -- nfs4_close_ops | | | nfs4_do_close() | | | nfs_update_inode() | |-- inode == inode->i_sb- >s_root->d_inode | | -- mntput(ctx_vfsmnt) | atomic_dec_and_lock(&mnt->mnt_count, &vfsmount_lock) After the asynchronous RPC call "nfs4_close_ops" is invoked in put_nfs_open_context(), the kernel invokes the mntput(), and the mnt- >mnt_count is decreased. In my test, after the file closing operation, the sys_umont() was executed immediately. In normal case, the asynchronous RPC call "nfs4_close_ops" can be completed quickly, and it rarely ever happens that the sys_umount() is invoked before the end of nfs_update_inode() operation. But when the sunrpc/nfs_debug is set to debug mode, a lot of printk operations will be invoked in NFS. Due to the lots of prink operations, the RPC call "nfs4_close_ops" will easily be delayed, then it is possible that the sys_umount() is invoked before the end of nfs_update_inode() operation. In the do_umont() (Because mnt- >mnt_count has been decreased, umount can be executed successfully), the sb->s_root will be set to NULL in the shrink_dcache_for_umount () which is invoked by the nfs_kill_super(). Therefore, kernel panic occurred by the NULL pointer access when nfs_update_date() accessed inode->i_sb->s_root. Because there is a possibility that sb->s_root of a super_block is set to NULL with umount when nfs4_close_ops () is not finished in the NFS. It is really necessary to check an empty pointer for the inode->i_sb- >s_root in the nfs_update_date(). To resolve this problem, I have made the patch attachment for the kernel. After the patch is applied, the problem can be resolved in my test. -- Additional comment from shic on 2007-01-30 19:36 EST -- Created an attachment (id=146986) nfs_update_inode_panic.patch -- Additional comment from pm-rhel on 2007-02-13 06:23 EST -- This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. -- Additional comment from steved on 2007-03-21 10:13 EST -- Upstream thread: http://www.gossamer-threads.com/lists/linux/kernel/727123 -- Additional comment from mjenner on 2007-05-02 11:39 EST -- QE ack for RHEL 5.1 -- Additional comment from steved on 2007-06-06 15:38 EST -- Created an attachment (id=156385) Proposed Upstream patch. Would it be possible to test this upstream patch? Unfortunately I'm having no success in reproducing this oops, so if you could verify that this patch fixes the race condition that would be very helpful. -- Additional comment from steved on 2007-06-19 12:40 EST -- ping.... -- Additional comment from torriem.edu on 2007-06-19 13:39 EST -- I am interested in testing this patch, but it will be some time before I can do so, since I'm in the process of some major server work. Will this patch also apply to the Fedora Core 6 kernel sources? I had this problem also on FC6. (not sure about FC7, haven't had a chance to try nfsv4 on it). -- Additional comment from steved on 2007-06-19 14:02 EST -- Yes... it should apply to an FC6 kernel... Please let me know if there is a problem...
in 2.6.18-44.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0959.html