Bug 736578

Summary: writes to single file from multiple processes clobber each other
Product: Red Hat Enterprise Linux 5 Reporter: Max Matveev <makc>
Component: kernelAssignee: Steve Dickson <steved>
Status: CLOSED ERRATA QA Contact: Chao Ye <cye>
Severity: high Docs Contact:
Priority: high    
Version: 5.6CC: bfields, borgan, ccui, eguan, harshula, nfs-maint, nmurray, rwheeler, smayhew, sprabhu, steved
Target Milestone: rcKeywords: Regression, Reopened
Target Release: 5.9   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.18-380.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1007035 (view as bug list) Environment:
Last Closed: 2014-09-16 00:32:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1007035, 1007039    
Bug Blocks: 743405, 807974, 928283, 1049884    
Attachments:
Description Flags
patch -- skip lockowner related checks on v2/3 mounts none

Description Max Matveev 2011-09-08 07:14:48 UTC
Several reports of dump(8) running "slow" when dumping into a file
on an NFS-mounted filesystem prompted a bit of a chase to find
the root cause.

Eventually I was able to reproduce dump's behaviour by using
the following program:

#include <stdio.h>
#include <fcntl.h>
#include <string.h>

int main(int ac, char *av[])
{
	int f = open(av[1], O_WRONLY|O_TRUNC|O_CREAT, 644);
	char buf[10240];
	int i;
	int pid;
	int status;

	if (f < 0) {
		perror(av[1]);
		return 1;
	}

	pid = fork();

	for (i=0; i < 1000; i++) {
		memset(buf, i, sizeof(buf));
		if (write(f, buf, sizeof(buf)) != sizeof(buf)) {
			perror("write failed");
			return 1;
		}
	}
	if (pid) 
		waitpid(pid, &status, 0);

	close(f);
	return 0;
}

Without forking I see few UNSTABLE writes followed by a commit or two,
with forking I see lots of FILE_SYNC writes. Similar behaviour
was observed when examining tcpdump capture from dump(8) run.

Using system tap I found that nfs_flush_incompatible() was
calling nfs_wb_all() which in turn forces data out via
a synchronous write.

I think it was e992fe54551ca2427ca532de6ef88b1e1d2a0487 which
introduced the problem - in this case there is no locking involved
but the test in nfs_flush_incompatible

                if (req->wb_page != page || ctx != req->wb_context ||
                        req->wb_lock_context->lockowner != current->files ||
                        req->wb_lock_context->pid != current->tgid)
                        status = nfs_wb_page(inode, page);

evaluates to true because we're indeed dealing with differnt set of files
and different process Id.

I think we need to check if there are any locks in the lock context
before comparing the pointers and pids.

Version-Release number of selected component (if applicable): 2.6.18-238.el5

How reproducible: always


Steps to Reproduce:
1. compile the program above
   cc -o writes writes.c
2. run with argument of a file on an NFS filesystem
   ./writes /mnt/nfs/test/1
3. capture tcpdump
4. check tcpdump for any writes with stable set to FILE_SYNC
  
Actual results:
multiple write calls with stable set to FILE_SYNC

Expected results:
few large UNSTABLE writes and no writes with stable set to FILE_SYNC

Additional info:

The same problem has been observed on 5.5.z - 2.6.18-194.26.1.el5
and it is likely that the same problem will affect RHEL6.

Comment 1 RHEL Program Management 2012-01-09 14:40:04 UTC
This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.8 and Red Hat does not plan to fix this issue the currently developed update.

Contact your manager or support representative in case you need to escalate this bug.

Comment 4 Steve Dickson 2012-04-17 15:53:16 UTC
Hey Max,

I'm not seeing this type of behaviour with a later kernel (2.6.18-310.el5). Would you mind retesting with a later kernel?

tia...

Comment 5 Max Matveev 2012-07-04 08:29:55 UTC
Apologies for tardiness - was away for a while...

FWIW, on a VM with 2 virtual CPUs

[root@rebecca ~]# nfsstat -m
/mnt/tmp from 192.168.122.46:/tmp
 Flags:	rw,vers=3,rsize=131072,wsize=131072,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=192.168.122.46

[root@rebecca ~]# uname -a
Linux rebecca 2.6.18-308.8.2.el5 #1 SMP Tue May 29 11:54:17 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux

I'm still seeing very slow writes and I've got tcpdump full of FILE_SYNCs
to prove it.

Comment 6 Jeff Layton 2012-07-30 18:19:19 UTC
Ok, I can reproduce this too, and it's also a problem in current mainline kernels. We'll have to pursue a fix there first and then we can discuss backporting it for RHEL5 and 6.

Max, didn't you start a discussion thread on this upstream? If so, then reviving that might be the best first step.

Comment 7 Jeff Layton 2012-07-30 18:50:25 UTC
(In reply to comment #0)

> I think it was e992fe54551ca2427ca532de6ef88b1e1d2a0487 which
> introduced the problem - in this case there is no locking involved
> but the test in nfs_flush_incompatible
> 

Yes, that's almost assuredly when the problem was introduced.

>                 if (req->wb_page != page || ctx != req->wb_context ||
>                         req->wb_lock_context->lockowner != current->files ||
>                         req->wb_lock_context->pid != current->tgid)
>                         status = nfs_wb_page(inode, page);
> 
> evaluates to true because we're indeed dealing with differnt set of files
> and different process Id.
> 

Yes. There's also a similar problem in nfs_can_coalesce_requests()...

> I think we need to check if there are any locks in the lock context
> before comparing the pointers and pids.
> 

A couple of problems with that idea:

1) we don't have a list of locks associated with a particular lockowner. Getting that info is not as direct as one would hope. It'll probably involve taking locks and walking lists, etc. 

2) Even if we did, will we have problems if someone sets a lock after the page has been dirtied and the lockowner set? I'll need to ponder this part.

Comment 8 Max Matveev 2012-07-31 07:43:17 UTC
(In reply to comment #6)
> Max, didn't you start a discussion thread on this upstream? If so, then
> reviving that might be the best first step.

No - I was planing to spend a bit more time researching it before
going to Trond but it never quite happen.

(In reply to comment #7)
> > I think we need to check if there are any locks in the lock context
> > before comparing the pointers and pids.
>
> A couple of problems with that idea:
> 
> 1) we don't have a list of locks associated with a particular lockowner.
> Getting that info is not as direct as one would hope. It'll probably involve
> taking locks and walking lists, etc. 

But in NFSv3 case there should be no lockowner at all or there should be
a 'joker' lockowner which matches any request. What's more, given that
one cannot have the same file/inode/page on NFSv3 and NFSv4 mount at
the same time, NFSv3 should not concern itself with this concept.

> 2) Even if we did, will we have problems if someone sets a lock after the
> page has been dirtied and the lockowner set? I'll need to ponder this part.

This is interesting point - arguably client should flush all pages which have
been dirtied before granting the lock and not allowing any new pages to be
mucked up until the lock is granted in order to obey the intent of locking
(as someone put it to me once - NFS locking is only advisory and one
doesn't have to follow bad advise).

In this case multiple processes are writing to the same file but the 
cooperation is not done using locking, instead it's done in the application 
itself.

Comment 9 Jeff Layton 2012-07-31 13:36:28 UTC
Created attachment 601525 [details]
patch -- skip lockowner related checks on v2/3 mounts

Here's an initial proof-of-concept patch for mainline. I'll probably send this upstream once I write up a description of the problem. Trond will hate it, but it might help spur discussion... ;)

Basically the idea here is to just skip these new checks when we're on a non-stateful mount (vers < 4). This seems to work with your reproducer on v3, but the final fix will probably look different.

Comment 10 Jeff Layton 2012-07-31 18:37:08 UTC
Ok, strawman patch and description of the problem sent to Trond and cc'ed to linux-nfs:

    http://article.gmane.org/gmane.linux.nfs/51240

...maybe we'll get some traction.

Comment 11 Jeff Layton 2012-09-14 14:21:56 UTC
Ok, Trond replied back on that thread saying that he'd rather we move to a scheme where we track what open/lock stateid each request was done under. That makes a lot of sense, but will be a bit of an overhaul. He said he's working out a draft patchset for this, so at this point I'm just waiting to see what he comes up with.

Comment 23 RHEL Program Management 2013-04-23 00:35:33 UTC
Development Management has reviewed and declined this request.
You may appeal this decision by reopening this request.

Comment 24 Jeff Layton 2013-04-26 16:48:33 UTC
Reopening for consideration in later releases...

This problem is still unresolved upstream, but I think Trond still has a patchset in progress for it. If and when we get a resolution on mainline kernels then we can look at what can be done for existing RHEL releases.

Comment 29 RHEL Program Management 2014-01-20 23:23:36 UTC
This request was evaluated by Red Hat Product Management for
inclusion in a Red Hat Enterprise Linux release.  Product
Management has requested further review of this request by
Red Hat Engineering, for potential inclusion in a Red Hat
Enterprise Linux release for currently deployed products.
This request is not yet committed for inclusion in a release.

Comment 35 errata-xmlrpc 2014-09-16 00:32:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1196.html

Comment 36 Alexander Gordeev 2014-09-16 08:59:51 UTC
Patch(es) available in kernel-2.6.18-398.el5
You can download this test kernel (or newer) from http://people.redhat.com/agordeev/el5/
Detailed testing feedback is always welcomed.
If you require guidance regarding testing, please ask the bug assignee.