761984 – (GLUSTER-252) client coherence problem with locks and truncate

Bug 761984 (GLUSTER-252) - client coherence problem with locks and truncate

Summary: client coherence problem with locks and truncate

Keywords:
Status:	CLOSED WONTFIX
Alias:	GLUSTER-252
Product:	GlusterFS
Classification:	Community
Component:	locks
Sub Component:
Version:	mainline
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Pavan Vilas Sondur
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	GLUSTER-386
TreeView+	depends on / blocked

Reported:	2009-09-09 01:37 UTC by Vikas Gorur
Modified:	2009-12-03 19:49 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Python script to trigger bug (1.21 KB, application/octet-stream) 2009-09-09 19:31 UTC, Robert Millner	no flags	Details
View All

Description Vikas Gorur 2009-09-09 01:37:08 UTC

Reported by Robert Millner: 

We're observing a coherence issue with GlusterFS 2.0.6.  One client
opens a file, locks, truncates and writes.  Another client waiting on a
read lock may see a zero length file after the read lock is granted.

If both nodes read/write in a loop, this tends to happen within a few
hundred tries.  The same code runs for 10000 loops without a problem if
both programs run on GlusterFS on the same node or local file system
(ext3) on the same node.

Node1 does the following (strace): 

2206  1252031615.509555 open("testfile", O_RDWR|O_CREAT|O_LARGEFILE, 0644) = 3
2206  1252031615.514886 fcntl64(3, F_SETLKW64, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}, 0xbfcaee78) = 0
2206  1252031615.517742 select(0, NULL, NULL, NULL, {0, 0}) = 0 (Timeout)
2206  1252031615.517788 _llseek(3, 0, [0], SEEK_SET) = 0
2206  1252031615.517829 ftruncate64(3, 0) = 0
2206  1252031615.520632 write(3, "01234567890123456789012345678901"..., 900) = 900
2206  1252031615.599782 close(3)        = 0

2206  1252031615.604731 open("testfile", O_RDONLY|O_CREAT|O_LARGEFILE, 0644) = 3
2206  1252031615.615158 fcntl64(3, F_SETLKW64, {type=F_RDLCK, whence=SEEK_SET, start=0, len=0}, 0xbfcaee78) = 0
2206  1252031615.624680 fstat64(3, {st_dev=makedev(0, 13), st_ino=182932, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=16, st_size=900, st_atime=2009/09/03-19:33:35, st_mtime=2009/09/03-19:33:35, st_ctime=2009/09/03-19:33:35}) = 0
2206  1252031615.624787 _llseek(3, 0, [0], SEEK_SET) = 0
2206  1252031615.624851 read(3, "01234567890123456789012345678901"..., 4096) = 900
2206  1252031615.625126 close(3)        = 0


Node2 does the following (strace):

2126  1252031615.504350 open("testfile", O_RDONLY|O_CREAT|O_LARGEFILE, 0644) = 3
2126  1252031615.509004 fcntl64(3, F_SETLKW64, {type=F_RDLCK, whence=SEEK_SET, start=0, len=0}, 0xbfc05dc8) = 0
2126  1252031615.587697 fstat64(3, {st_dev=makedev(0, 13), st_ino=182932, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=0, st_atime=2009/09/03-19:33:35, st_mtime=2009/09/03-19:33:35, st_ctime=2009/09/03-19:33:35}) = 0
2126  1252031615.588027 _llseek(3, 0, [0], SEEK_SET) = 0
2126  1252031615.588089 read(3, "", 4096) = 0
2126  1252031615.588228 close(3)        = 0



Both node clocks are NTP disciplined.  As these are virtual machines,
there's a higher dispersion but I believe you can round to the nearest
0.1s for time correlation.

Node2 waits for the write lock to clear before getting its read lock.
Node1 also tries to read the file and agrees with node2 on every stat
field except st_size.  Node2 tries to read the file and gets no data.

This is on 32 bit CentOS5 with a 2.6.27 kernel, fuse 2.7.4 on VMware.
Also observed on Amazon EC2 with their 2.6.21 fc8xen kernel.

I can make the problem unrepeatable in 10000 tries by changing the
select on Node1 to timeout in 0.1 seconds.  The problem repeats in under
5000 tries if select is set to timeout in 0.01 seconds.

This happens whether or not gluster is run with
--disable-direct-io-mode.

The volume is mirrored between four servers.  Below is the server
configuration.  The export directory is on ext3.

volume posix
  type storage/posix
  option directory /var/data/export
end-volume

volume locks
  type features/locks
  option mandatory-locks on
  subvolumes posix
end-volume

volume brick
  type performance/io-threads
  option thread-count 8
  subvolumes locks
end-volume

volume server
  type protocol/server
  option transport-type tcp
  option auth.addr.brick.allow *
  subvolumes brick
end-volume


And the client configuration:

volume remote1
  type protocol/client
  option transport-type tcp
  option remote-host 10.10.10.145
  option remote-subvolume brick
end-volume

volume remote2
  type protocol/client
  option transport-type tcp
  option remote-host 10.10.10.130
  option remote-subvolume brick
end-volume

volume remote3
  type protocol/client
  option transport-type tcp
  option remote-host 10.10.10.221
  option remote-subvolume brick
end-volume

volume remote4
  type protocol/client
  option transport-type tcp
  option remote-host 10.10.10.104
  option remote-subvolume brick
end-volume

volume replicated
  type cluster/replicate
  subvolumes remote1 remote2 remote3 remote4
end-volume

volume writebehind
    type performance/write-behind
    subvolumes replicated
end-volume

volume cache
    type performance/io-cache
    subvolumes writebehind
end-volume



The problem persists with those configurations and if any or all of the
following tweaks are made:

1. Remove the replicated volume and just use remote1.
2. Get rid of threads on the server.
3. Get rid of io-cache and writebehind on the clients.
4. Use mandatory locking on the test file.

Comment 1 Robert Millner 2009-09-09 19:31:33 UTC

Created attachment 65 [details]
gzipped patch to add imps2 support to dosemu

On a 64 bit system, the struct.pack may have to be changed.

Comment 2 Vikas Gorur 2009-09-10 08:47:46 UTC

Thanks for the tool. We have been able to reproduce the problem and are working on it.

Comment 3 Pavan Vilas Sondur 2009-11-26 05:03:12 UTC

Avati has been able to debug this issue and has foubd a root cause as well. The problem appears to be:
The script has two sections of code under a lock: truncate+write and fstat+read. The open call is outside the 'locked' sections. When atleast 2 instances of the script is run with clients on different machines, an add open call, which introduces a lookup falls after the truncate and before the write in the other script. 

This is perfectly normal, however, the fstat call later, results in serving the stbuf received from the lookup call from the fuse attr cache. This is confirmed with a lookup_cbk returning an stbuf with st_size == 0 and op_ret == 0, while the scripts were running. This problem should have been solved by setting the fuse opotion --attr-timeout=0. 

However, it turns out, there is a bug in the fuse kernel module and it  is not honoring the attr-timeout set to 0. 

Since the bug involved is not involved in glusterfs code per se, I'm marking it as 'wontfix'. Related patches / further discussion with the fix will be posted on this bug also.

Comment 4 Anand Avati 2009-12-03 16:49:09 UTC

PATCH: http://patches.gluster.com/patch/2521 in master (glusterfsd: Handle argument attribute-timeout when set to 0)

Comment 5 Anand Avati 2009-12-03 16:49:13 UTC

PATCH: http://patches.gluster.com/patch/2522 in master (mount/fuse: Handle the case when attribute_timeout is set to 0 in dictionary.)

Note You need to log in before you can comment on or make changes to this bug.