441730 – [rhts] connectathon nfsidem test failing

Bug 441730 - [rhts] connectathon nfsidem test failing

Summary: [rhts] connectathon nfsidem test failing

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.2
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jeff Layton
QA Contact:	yanfu,wang
Docs Contact:
URL:	http://rhts.redhat.com/cgi-bin/rhts/t...
Whiteboard:
Duplicates (1):	663068 (view as bug list)
Depends On:
Blocks:	KernelPrio5.3 483701
TreeView+	depends on / blocked

Reported:	2008-04-09 18:15 UTC by Don Zickus
Modified:	2011-07-21 10:08 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-07-21 10:08:37 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
attempted q&d patch (3.86 KB, patch) 2010-08-10 20:12 UTC, Jeff Layton	no flags	Details \| Diff
patch -- wait for COMMIT rpc_release call on FLUSH_SYNC commits (11.00 KB, patch) 2010-08-16 12:59 UTC, Jeff Layton	no flags	Details \| Diff
patch -- wait for COMMIT rpc_release call on FLUSH_SYNC commits (58.54 KB, patch) 2010-08-16 13:24 UTC, Jeff Layton	no flags	Details \| Diff
patch -- wait for COMMIT rpc_release call on FLUSH_SYNC commits (11.10 KB, patch) 2010-08-16 13:31 UTC, Jeff Layton	no flags	Details \| Diff
Show Obsolete (2) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:1065	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.7 kernel security and bug fix update	2011-07-21 09:21:37 UTC

Description Don Zickus 2008-04-09 18:15:41 UTC

Description of problem:

The above url is a link to the failed test.  

From an email exchange with Peter:

Don Zickus wrote:
> A quirk popped that Jeff and I haven't seen before.  Was
> wondering if you guys could take a quick look.
>
> test failure
> http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2589554

The actual failure is that the application tried to rmdir a
directory and this failed because the directory was not empty.
Clearly, the application thought that the directory should be
empty.

The directory wasn't empty because the kernel thought that one
of the files inside of the directory, which was being unlink'd,
was still open, and thus, renamed the file to a .nfs file.

So, the question becomes, why did the kernel think that the
file was still open when it was unlink'd?

Version-Release number of selected component (if applicable):
kernel-2.6.18-89.el5
Snapshot-2

How reproducible:
Not really

Comment 1 Peter Staubach 2008-04-09 19:11:52 UTC

At this point, since it only happened on one system and we've never seen
it before, this does not seem like a reason to block the entire release.

Comment 2 Steve Dickson 2008-04-09 20:23:52 UTC

+1 to this not being a blocker...

Comment 10 RHEL Program Management 2009-02-16 15:24:23 UTC

Updating PM score.

Comment 13 Jeff Layton 2010-07-08 14:54:04 UTC

We've had some more of these problems show up recently, mostly on s390x arch. Reopening this bug to have a closer look.

Comment 15 Jeff Layton 2010-07-19 17:50:40 UTC

The captures show the REMOVE call for the sillyrenamed file occurring after the RMDIR call went out onto the wire. It seems probable that the problem is a race where the sillyrenamed files aren't being cleaned up before things like an rmdir are allowed to proceed on the containing directory.

The problem here looks like the same one that the patches for this bug were supposed to fix:

    http://bugzilla.linux-nfs.org/show_bug.cgi?id=150

...it looks like we have the patches for that bug, but there were some follow-on patches that fix some unspecified races.

I'll also note that while we added the nfsiod workqueue, the silly-renames are still queued to rpciod in RHEL5. That probably ought to be looked at...

Still wrapping my brain around how the block/unblock sillyrename stuff works so maybe the solution will jump out at me after that...

Comment 16 Jeff Layton 2010-07-28 13:18:52 UTC

Actually, looking at the test -- it does this:

------------------[snip]-----------------
                fd = open(FOO, O_RDWR|O_CREAT, 0666);
                if (fd < 0) {
                        perror("creat");
                        break;
                }
                if (write(fd, str, slen) != slen) {
                        perror("write");
                        (void) close(fd);
                        break;
                }
                if (close(fd)) {
                        perror("close");
                        break;
                }
------------------[snip]-----------------

...and that's the only real open call that it does. There should be no need for any silly-renaming with this test. So I think I'll need to focus on why that's occurring at all.

Comment 17 Jeff Layton 2010-07-28 14:36:34 UTC

Adding Ian since he did a large patchset for NFS in RHEL5.5.

Comment 18 Jeff Layton 2010-08-06 17:00:04 UTC

I've tried in vain to reproduce this my self on a test rig (x86_64 kvm guest). I've actually seen it happen on a couple of occasions but it's very fleeting.

The most likely scenario is that an unlink or rename is occasionally racing with a dput. sillyrenames are basically triggered when the d_count is higher than we expect.

My suspicion is that we have a workqueue job or something that is holding a dentry reference and eventually puts it. In most cases, the dput is done before we get to this point in the program, but sometimes that doesn't happen.

I'll go over the capture and code again and see whether I can determine where this is most likely occurring.

Comment 19 Jeff Layton 2010-08-06 18:13:48 UTC

After looking over the code, my suspicion at this point is that the rpc_release callback for the write call isn't occurring quickly enough.

The rpc_release call for the NFS write tasks are set to nfs_writedata_release. That function does a put_nfs_open_context, which in turn does a dput on the dentry.

What I may do is try and add a small sleep (1-2s) to nfs_writedata_release and see whether it makes this more reproducible...

Comment 20 Jeff Layton 2010-08-06 18:55:51 UTC

Yep. I added a "msleep(1000);" to nfs_writedata_release just before putting the open context, and the nfsidem test failed every time:

# ./nfsidem 50 testdir
testing 50 idempotencies in directory "testdir"
rmdir 1: Directory not empty

...backing out that patch allowed it to routinely pass. I really think this is the most likely source of the race.

As to how to fix it...I'm not certain. I'll need to look closer at how (and whether) this was fixed upstream.

Ian, any thoughts?

Comment 21 Jeff Layton 2010-08-06 20:01:15 UTC

Tried adding a similar patch to a rawhide kernel and wasn't able to reproduce the problem. So assuming that this is the issue, it seems to have been fixed upstream. The way it was fixed however is not clear to me.

Comment 22 Jeff Layton 2010-08-06 20:08:01 UTC

One thing that might help some is commit ccfeb506231348a3c60ab0fdb5753a574653e3c0, which starts inode writeout during sillyrename. I think though that it's probably not enough to fix this however.

Comment 23 Ian Kent 2010-08-09 05:33:10 UTC

(In reply to comment #20)
> Yep. I added a "msleep(1000);" to nfs_writedata_release just before putting the
> open context, and the nfsidem test failed every time:
> 
> # ./nfsidem 50 testdir
> testing 50 idempotencies in directory "testdir"
> rmdir 1: Directory not empty
> 
> ...backing out that patch allowed it to routinely pass. I really think this is
> the most likely source of the race.
> 
> As to how to fix it...I'm not certain. I'll need to look closer at how (and
> whether) this was fixed upstream.
> 
> Ian, any thoughts?    

I haven't looked at the code yet but it I think it's probably
worth checking in the rmdir() code path since there is no
mention of the addition of sillyrename blocking in the
patches for the kernel bug of comment #15.

I'll have a look around.

Comment 24 Jeff Layton 2010-08-09 11:07:33 UTC

The difference seems to be that rawhide kernels wait for the work to complete before proceeding with the close. With the msleep(1000); call in nfs_writedata_release:

rawhide:
07:01:04.035667 write(3, "Idempotency test 18978 running\n", 31) = 31 <0.000068>
07:01:04.035849 close(3)                = 0 <1.001571>

RHEL5:
07:03:04.916390 write(3, "Idempotency test 32541 running\n", 31) = 31 <0.000069>
07:03:04.916813 close(3)                = 0 <0.025833>

It seems like RHEL5 ought to be waiting for the rpc_release before allowing the close() to complete.

Comment 25 Jeff Layton 2010-08-09 11:31:44 UTC

One major difference -- RHEL5 does a sync write here and doesn't need to issue a COMMIT on the wire.

Comment 26 Ian Kent 2010-08-09 14:20:05 UTC

That block/unblock sillyrename stuff is odd!

Comment 27 Jeff Layton 2010-08-09 20:28:16 UTC

Ok, I think I sort of see why rawhide isn't susceptible to this:

As stated before, I think the basic problem is that the rpc_release callback for the write isn't completing before a close() is allowed to return.

The rpc_release callback in this case is queued to nfsiod. In mainline kernels, COMMIT is always handled as an asynchronous RPC that is also queued to nfsiod. The difference here is that the callers wait for the COMMIT call to complete in the FLUSH_SYNC case. Because the COMMIT is usually queued after the rpc_release for the write, those end up serialized and by the time the close is complete, the write is also complete.

In RHEL5, a COMMIT call for a FLUSH_SYNC flush operation is a synchronous RPC, so nfsiod isn't involved. We could change that and make COMMIT calls an async RPC and just have the caller wait on it to complete. The patch won't look very close to upstream code, but we can ape the approach.

Comment 29 Jeff Layton 2010-08-10 19:52:07 UTC

I knocked together a quick and dirty patch as a proof of concept but it didn't fix the problem. I suspect that the problem is that a lot of the write/commit completion has moved from the rpc_call_done callback to rpc_release. The actual mechanism isn't quite clear to me however...

Comment 30 Jeff Layton 2010-08-10 20:12:40 UTC

Created attachment 437990 [details]
attempted q&d  patch

Doesn't work, but I may want to go back to it at some point...

Another possible approach is to only queue rpc_release to the workqueue for async RPC tasks. That would probably work, but it's a divergence from upstream, and we'd have to carefully review it to make sure it won't deadlock.

I'll play around with that some soon...

Comment 31 Jeff Layton 2010-08-16 12:59:54 UTC

Created attachment 438938 [details]
patch -- wait for COMMIT rpc_release call on FLUSH_SYNC commits

This patchset seems to fix the problem for me in the artificial reproducer I have. It depends on the patchset for bug 516490. The basic idea is to make all WRITE and COMMIT calls asynchronous, and simply to make the callers wait on them to complete before proceeding. There are also some other changes -- NFS_FILE_SYNC is now only used for high priority direct memory reclaim.

The patchset follows the upstream approach but the patches themselves are quite different from what went upstream as I didn't want to pull in the large amounts of prerequisite patches in order to make them apply.

Comment 32 Jeff Layton 2010-08-16 13:24:23 UTC

Created attachment 438944 [details]
patch -- wait for COMMIT rpc_release call on FLUSH_SYNC commits

Bah, just noticed a bug in nfs_sync_inode_wait -- it was calling nfs_commit_set_lock while holding the nfsi->req_lock. This patch fixes that, though it does mean a bit more req_lock churn in that function than I'd really prefer. It might do to reorganize that function instead. I think though that this is good enough for some initial proof-of-concept testing.

Comment 33 Jeff Layton 2010-08-16 13:31:48 UTC

Created attachment 438946 [details]
patch -- wait for COMMIT rpc_release call on FLUSH_SYNC commits

Hmm...not sure what happened with that last patch upload, but this one should be OK.

Comment 34 Jeff Layton 2010-08-23 11:29:52 UTC

Ran cthon tests in RHTS on the latest kernel late last week and got some more failures, mostly on s390x again. I think there are a few other problems that need to be fixed here. There were 3 cases of failures:

NFSv2: here we don't have unstable pages, so we have no COMMIT call. Waiting on a COMMIT to complete doesn't really help things.

NFSv4 to netapp: netapps always return writes with FILE_SYNC, which tells the client "no need to send a commit for this". So again, by the time all of the pages are written out, we don't have to do a COMMIT call. Waiting for it doesn't help.

NFSv3: to sun box, I think ... this one I'm still not certain what happened. It may be that the rpc_release call for the write got queued after that for the COMMIT. It could also be that the we cleared the NFS_INO_COMMIT flag and the other calls happened before the open context could be put. That seems unlikely though...

In all 3 cases, I think we need to ensure that the rpc_release callback for a write is complete (or near complete) before we allow the close to complete.

Upstream code does most of the post-call completion in the rpc_release routine rather than in rpc_call_done. That may be the crucial difference here.

Comment 36 RHEL Program Management 2011-02-01 16:57:47 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 37 Jeff Layton 2011-02-02 19:54:27 UTC

Jarod has committed the patches for this, but we still see failures on s390...

Even with all of the changes to this code, we still don't have any guarantee that the rpc_release callback will complete before the close returns and an rmdir is attempted.

The best approach to fix this is really unclear to me. I suspect that this is also an upstream problem, but recent changes the the workqueue model there may be helping to paper over it.

I'll plan to talk it over with Trond when I'm at connectathon in a few weeks. Maybe he'll have some ideas. In the meantime, I still think these changes do help to some degree, even if it doesn't seem to fix the test failures in all cases.

Comment 40 Jeff Layton 2011-03-02 14:43:11 UTC

Moving this back to ASSIGNED for now. I think the patches I've posted so far help, but there's still a race window. Trond has given me a patch for upstream code and I'm working on backporting it for RHEL5. Testing my initial backport however with cthon tests caused an oops. I'm still looking at the cause:

Unable to handle kernel NULL pointer dereference at 0000000000000048 RIP: 
 [<ffffffff88568b2f>] :sunrpc:rpc_clnt_sigmask+0x3/0x15
PGD 3964c067 PUD 3451067 PMD 0 
Oops: 0000 [1] SMP 
last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
CPU 0 
Modules linked in: nfs nfs_acl nls_utf8 cifs autofs4 hidp rfcomm l2cap bluetooth rpcsec_gss_krb5 auth_rpcgss testmgr_cipher testmgr aead crypto_blkcipher crypto_algapi des lockd sunrpc ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy serio_raw ata_piix libata virtio_net scsi_mod i2c_piix4 i2c_core virtio_balloon pcspkr tpm_tis tpm tpm_bios dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod virtio_blk virtio_pci virtio_ring virtio ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 22176, comm: test6 Not tainted 2.6.18-245.el5.jtltest.133debug #1
RIP: 0010:[<ffffffff88568b2f>]  [<ffffffff88568b2f>] :sunrpc:rpc_clnt_sigmask+0x3/0x15
RSP: 0018:ffff810039afdc30  EFLAGS: 00010282
RAX: ffff810039afdc38 RBX: ffff810003b706a0 RCX: ffff810037108dc0
RDX: 0000000000000080 RSI: ffff810039afdc38 RDI: 0000000000000000
RBP: 0000000003b706a0 R08: ffff810002573f18 R09: 0000000000000001
R10: ffff810002573f18 R11: ffffffff8856e1e4 R12: ffff810003b706a0
R13: ffff81003a448e80 R14: 0000000000000000 R15: ffff81003f88d330
FS:  00002b54eaf9baf0(0000) GS:ffffffff8044b000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000048 CR3: 0000000003ddd000 CR4: 00000000000006e0
Process test6 (pid: 22176, threadinfo ffff810039afc000, task ffff810037108dc0)
Stack:  ffffffff887196fe 0000000000000000 ffffffff8856e2a8 ffff81003a448e20
 ffff810003b706a0 ffff81003a448e80 ffffffff8871b5f0 ffff8100390fa9f8
 00000000fffffff4 ffff81003a448e20 ffff810035dcd720 ffff81002c1e8508
Call Trace:
 [<ffffffff887196fe>] :nfs:nfs4_wait_for_completion_rpc_task+0x18/0x3d
 [<ffffffff8856e2a8>] :sunrpc:rpc_run_task_wq+0x37/0x3f
 [<ffffffff8871b5f0>] :nfs:_nfs4_proc_open+0x148/0x1bc
 [<ffffffff8871c410>] :nfs:nfs4_do_open+0xc2/0x1dd
 [<ffffffff8871e43c>] :nfs:nfs4_atomic_open+0xe0/0x1a4
 [<ffffffff800ad489>] debug_check_no_locks_freed+0x11d/0x129
 [<ffffffff800aa2e8>] lock_release_holdtime+0x27/0x48
 [<ffffffff887087a2>] :nfs:nfs_atomic_lookup+0xae/0xfd
 [<ffffffff800392a8>] __lookup_hash+0x10b/0x130
 [<ffffffff8001c2b3>] open_namei+0xf4/0x720
 [<ffffffff80028cf0>] do_filp_open+0x1c/0x38
 [<ffffffff800683df>] _spin_unlock+0x17/0x20
 [<ffffffff80016e71>] get_unused_fd+0xf9/0x107
 [<ffffffff8001afeb>] do_sys_open+0x44/0xbe
 [<ffffffff800602a6>] tracesys+0xd5/0xdf


Code: 40 8a 77 48 48 89 c7 40 d0 ee 83 e6 01 e9 bd fe ff ff 41 56 
RIP  [<ffffffff88568b2f>] :sunrpc:rpc_clnt_sigmask+0x3/0x15
 RSP <ffff810039afdc30>
CR2: 0000000000000048
 <0>Kernel panic - not syncing: Fatal exception

Comment 41 Jeff Layton 2011-03-02 16:18:17 UTC

Ok, I understand the problem. The rpc_task->tk_client is released early now, by rpc_release_resources_task. Unfortunately however, the RHEL5 code accesses the task->tk_client in nfs4_wait_for_completion_rpc_task (to set the signal masks).

Comment 42 Jeff Layton 2011-03-03 15:56:15 UTC

I fixed the problem in the RHEL5 backport of Trond's patch and I can still reproduce the issue. I think there are really two problems: 

1) this test shouldn't be causing a silly-rename in the first place, so we have a race where sometimes one occurs when it shouldn't

2) if we do a silly-rename we ought to wait until the delete is done before returning

I've narrowed down the testcase a bit and I set up some debugging to pop a printk and do a rpc_show_tasks when a silly rename occurs:

NFS: silly-rename(DIR/FILE, ct=2)
-pid- proc flgs status -client- -prog- --rqstp- -timeout -rpcwait -action- ---ops--
10194 0001 0081 000000 ffff810029a496e0 100003 ffff810025c70000 00000000     none ffffffff8856c900 ffffffff886d9480

So the dentry count was still 2, and the ops struct above seems to be:

crash> kmem ffffffff886d9480
ffffffff886d9480 (r) nfs4_close_ops  

...so the close was still not complete when the unlink was running. How this happens with Trond's patch, I'm still not sure. Unfortunately, rpc_debugging seems to make the problem go away which makes this hard to track down.

Comment 43 Jeff Layton 2011-03-03 20:30:56 UTC

I altered rpc_show_tasks to show the tk_runstate too, and got a value of 0x0011. That corresponds to:

#define RPC_TASK_RUNNING        0
#define RPC_TASK_ACTIVE         4

...unfortunately I had another bug in there that caused an oops, so I couldn't backtrack to see what the ops were. Assuming though that it was a CLOSE call, I'm a bit confused -- nfs4_do_close should be blocking until RPC_TASK_ACTIVE is clear. I'll see if I can add some debugging to determine the state of that flag after rpc_wait_for_completion_task exits.

Comment 44 Jeff Layton 2011-03-04 16:20:19 UTC

Ok, I think I might have found the problem. nfs_wb_all basically does this:

        int error = nfs_sync_inode_wait(inode, 0, 0, 0);

...the notable absence there being that FLUSH_SYNC isn't set, so the writes and commit are not waited upon. The times when I see this in v4 closes are the times that the COMMIT call is the last one to put the open file context. That gets queued to the workqueue after the close() call has already called its rpc_put_task.

I'm still looking at possible solutions here -- we need to wait until the COMMIT is really complete before we can proceed with a close() call, so it may just be that.

Comment 45 Jeff Layton 2011-03-07 12:13:16 UTC

Ok, we need several different changes to fix this on top of what has already been proposed on RHKL:

- Trond's patch to fix the races with rpc_wait_for_completion_task

- a patch to make nfs_wb_all use FLUSH_SYNC. I think that's appropriate as all of the callers of it are doing flushes for data integrity and should be waiting for them to complete anyway

- a patch to make COMMIT calls wait for the task to complete before putting it when FLUSH_SYNC is set

I just pushed a version of the last patch upstream, so I'm waiting for results on that. Meanwhile, I have patches for all of these in my test kernels and they seem to fix the problem even when I introduce artificial delays in the CLOSE and COMMIT codepaths.

Comment 56 Jarod Wilson 2011-03-23 21:42:51 UTC

Patch(es) available in kernel-2.6.18-250.el5
Detailed testing feedback is always welcomed.

Comment 58 Jeff Layton 2011-05-26 17:17:56 UTC

*** Bug 663068 has been marked as a duplicate of this bug. ***

Comment 59 yanfu,wang 2011-06-01 08:50:28 UTC

run connectathon test suite when do nfs first round testing and the test passed now, pls refer to below job links:
https://beaker.engineering.redhat.com/jobs/83020
https://beaker.engineering.redhat.com/jobs/83021

Comment 60 errata-xmlrpc 2011-07-21 10:08:37 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html

Note You need to log in before you can comment on or make changes to this bug.