Bug 736578 - writes to single file from multiple processes clobber each other
Summary: writes to single file from multiple processes clobber each other
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.6
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: 5.9
Assignee: Steve Dickson
QA Contact: Chao Ye
URL:
Whiteboard:
Depends On: 1007035 1007039
Blocks: 743405 807974 928283 1049884
TreeView+ depends on / blocked
 
Reported: 2011-09-08 07:14 UTC by Max Matveev
Modified: 2018-12-09 16:44 UTC (History)
11 users (show)

Fixed In Version: kernel-2.6.18-380.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1007035 (view as bug list)
Environment:
Last Closed: 2014-09-16 00:32:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
patch -- skip lockowner related checks on v2/3 mounts (1.96 KB, patch)
2012-07-31 13:36 UTC, Jeff Layton
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2014:1196 0 normal SHIPPED_LIVE Red Hat Enterprise Linux 5 kernel update 2014-09-16 04:27:19 UTC

Description Max Matveev 2011-09-08 07:14:48 UTC
Several reports of dump(8) running "slow" when dumping into a file
on an NFS-mounted filesystem prompted a bit of a chase to find
the root cause.

Eventually I was able to reproduce dump's behaviour by using
the following program:

#include <stdio.h>
#include <fcntl.h>
#include <string.h>

int main(int ac, char *av[])
{
	int f = open(av[1], O_WRONLY|O_TRUNC|O_CREAT, 644);
	char buf[10240];
	int i;
	int pid;
	int status;

	if (f < 0) {
		perror(av[1]);
		return 1;
	}

	pid = fork();

	for (i=0; i < 1000; i++) {
		memset(buf, i, sizeof(buf));
		if (write(f, buf, sizeof(buf)) != sizeof(buf)) {
			perror("write failed");
			return 1;
		}
	}
	if (pid) 
		waitpid(pid, &status, 0);

	close(f);
	return 0;
}

Without forking I see few UNSTABLE writes followed by a commit or two,
with forking I see lots of FILE_SYNC writes. Similar behaviour
was observed when examining tcpdump capture from dump(8) run.

Using system tap I found that nfs_flush_incompatible() was
calling nfs_wb_all() which in turn forces data out via
a synchronous write.

I think it was e992fe54551ca2427ca532de6ef88b1e1d2a0487 which
introduced the problem - in this case there is no locking involved
but the test in nfs_flush_incompatible

                if (req->wb_page != page || ctx != req->wb_context ||
                        req->wb_lock_context->lockowner != current->files ||
                        req->wb_lock_context->pid != current->tgid)
                        status = nfs_wb_page(inode, page);

evaluates to true because we're indeed dealing with differnt set of files
and different process Id.

I think we need to check if there are any locks in the lock context
before comparing the pointers and pids.

Version-Release number of selected component (if applicable): 2.6.18-238.el5

How reproducible: always


Steps to Reproduce:
1. compile the program above
   cc -o writes writes.c
2. run with argument of a file on an NFS filesystem
   ./writes /mnt/nfs/test/1
3. capture tcpdump
4. check tcpdump for any writes with stable set to FILE_SYNC
  
Actual results:
multiple write calls with stable set to FILE_SYNC

Expected results:
few large UNSTABLE writes and no writes with stable set to FILE_SYNC

Additional info:

The same problem has been observed on 5.5.z - 2.6.18-194.26.1.el5
and it is likely that the same problem will affect RHEL6.

Comment 1 RHEL Program Management 2012-01-09 14:40:04 UTC
This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.8 and Red Hat does not plan to fix this issue the currently developed update.

Contact your manager or support representative in case you need to escalate this bug.

Comment 4 Steve Dickson 2012-04-17 15:53:16 UTC
Hey Max,

I'm not seeing this type of behaviour with a later kernel (2.6.18-310.el5). Would you mind retesting with a later kernel?

tia...

Comment 5 Max Matveev 2012-07-04 08:29:55 UTC
Apologies for tardiness - was away for a while...

FWIW, on a VM with 2 virtual CPUs

[root@rebecca ~]# nfsstat -m
/mnt/tmp from 192.168.122.46:/tmp
 Flags:	rw,vers=3,rsize=131072,wsize=131072,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=192.168.122.46

[root@rebecca ~]# uname -a
Linux rebecca 2.6.18-308.8.2.el5 #1 SMP Tue May 29 11:54:17 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux

I'm still seeing very slow writes and I've got tcpdump full of FILE_SYNCs
to prove it.

Comment 6 Jeff Layton 2012-07-30 18:19:19 UTC
Ok, I can reproduce this too, and it's also a problem in current mainline kernels. We'll have to pursue a fix there first and then we can discuss backporting it for RHEL5 and 6.

Max, didn't you start a discussion thread on this upstream? If so, then reviving that might be the best first step.

Comment 7 Jeff Layton 2012-07-30 18:50:25 UTC
(In reply to comment #0)

> I think it was e992fe54551ca2427ca532de6ef88b1e1d2a0487 which
> introduced the problem - in this case there is no locking involved
> but the test in nfs_flush_incompatible
> 

Yes, that's almost assuredly when the problem was introduced.

>                 if (req->wb_page != page || ctx != req->wb_context ||
>                         req->wb_lock_context->lockowner != current->files ||
>                         req->wb_lock_context->pid != current->tgid)
>                         status = nfs_wb_page(inode, page);
> 
> evaluates to true because we're indeed dealing with differnt set of files
> and different process Id.
> 

Yes. There's also a similar problem in nfs_can_coalesce_requests()...

> I think we need to check if there are any locks in the lock context
> before comparing the pointers and pids.
> 

A couple of problems with that idea:

1) we don't have a list of locks associated with a particular lockowner. Getting that info is not as direct as one would hope. It'll probably involve taking locks and walking lists, etc. 

2) Even if we did, will we have problems if someone sets a lock after the page has been dirtied and the lockowner set? I'll need to ponder this part.

Comment 8 Max Matveev 2012-07-31 07:43:17 UTC
(In reply to comment #6)
> Max, didn't you start a discussion thread on this upstream? If so, then
> reviving that might be the best first step.

No - I was planing to spend a bit more time researching it before
going to Trond but it never quite happen.

(In reply to comment #7)
> > I think we need to check if there are any locks in the lock context
> > before comparing the pointers and pids.
>
> A couple of problems with that idea:
> 
> 1) we don't have a list of locks associated with a particular lockowner.
> Getting that info is not as direct as one would hope. It'll probably involve
> taking locks and walking lists, etc. 

But in NFSv3 case there should be no lockowner at all or there should be
a 'joker' lockowner which matches any request. What's more, given that
one cannot have the same file/inode/page on NFSv3 and NFSv4 mount at
the same time, NFSv3 should not concern itself with this concept.

> 2) Even if we did, will we have problems if someone sets a lock after the
> page has been dirtied and the lockowner set? I'll need to ponder this part.

This is interesting point - arguably client should flush all pages which have
been dirtied before granting the lock and not allowing any new pages to be
mucked up until the lock is granted in order to obey the intent of locking
(as someone put it to me once - NFS locking is only advisory and one
doesn't have to follow bad advise).

In this case multiple processes are writing to the same file but the 
cooperation is not done using locking, instead it's done in the application 
itself.

Comment 9 Jeff Layton 2012-07-31 13:36:28 UTC
Created attachment 601525 [details]
patch -- skip lockowner related checks on v2/3 mounts

Here's an initial proof-of-concept patch for mainline. I'll probably send this upstream once I write up a description of the problem. Trond will hate it, but it might help spur discussion... ;)

Basically the idea here is to just skip these new checks when we're on a non-stateful mount (vers < 4). This seems to work with your reproducer on v3, but the final fix will probably look different.

Comment 10 Jeff Layton 2012-07-31 18:37:08 UTC
Ok, strawman patch and description of the problem sent to Trond and cc'ed to linux-nfs:

    http://article.gmane.org/gmane.linux.nfs/51240

...maybe we'll get some traction.

Comment 11 Jeff Layton 2012-09-14 14:21:56 UTC
Ok, Trond replied back on that thread saying that he'd rather we move to a scheme where we track what open/lock stateid each request was done under. That makes a lot of sense, but will be a bit of an overhaul. He said he's working out a draft patchset for this, so at this point I'm just waiting to see what he comes up with.

Comment 23 RHEL Program Management 2013-04-23 00:35:33 UTC
Development Management has reviewed and declined this request.
You may appeal this decision by reopening this request.

Comment 24 Jeff Layton 2013-04-26 16:48:33 UTC
Reopening for consideration in later releases...

This problem is still unresolved upstream, but I think Trond still has a patchset in progress for it. If and when we get a resolution on mainline kernels then we can look at what can be done for existing RHEL releases.

Comment 29 RHEL Program Management 2014-01-20 23:23:36 UTC
This request was evaluated by Red Hat Product Management for
inclusion in a Red Hat Enterprise Linux release.  Product
Management has requested further review of this request by
Red Hat Engineering, for potential inclusion in a Red Hat
Enterprise Linux release for currently deployed products.
This request is not yet committed for inclusion in a release.

Comment 35 errata-xmlrpc 2014-09-16 00:32:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1196.html

Comment 36 Alexander Gordeev 2014-09-16 08:59:51 UTC
Patch(es) available in kernel-2.6.18-398.el5
You can download this test kernel (or newer) from http://people.redhat.com/agordeev/el5/
Detailed testing feedback is always welcomed.
If you require guidance regarding testing, please ask the bug assignee.


Note You need to log in before you can comment on or make changes to this bug.