Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 802423

Summary:

GlusterFS does not work well with MS Office 2010 and Samba "posix locking = yes".

Product:

[Community] GlusterFS

Reporter:

Jeff Byers <jbyers>

Component:

locks

Assignee:

Poornima G <pgurusid>

Status:

CLOSED DEFERRED

QA Contact:

storage-qa-internal <storage-qa-internal>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

3.2.5

CC:

aavati, Andreas.Schmechtig, bugs, gluster-bugs, hermann.fohrmann, jarrpa, redhat.bugs, rwheeler, vagarwal

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

1397085 (view as bug list)

Environment:

Last Closed:

2014-12-14 19:40:33 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
output of strace smbd	none
Debug Level 10 Samba Log	none
Trace output of the lock translator from the gluster volume clvolume1	none
Gluster volume log	none
Samba Log	none
Gluster volume log	none
Samba strace	none

Description Jeff Byers 2012-03-12 14:02:01 UTC

GlusterFS does not work well with MS Office 2010 and Samba "posix locking = yes".

Despite the documented instructions to the contrary:

Gluster 3.2: Adding the CIFS Protocol
http://gluster.org/community/documentation/index.php/Gluster_3.2:_Adding_the_CIFS_Protocol

The Samba CIFS "posix locking" option cannot be set to "yes" when
GlusterFS is used.

The problem can be seen when attempting to open a large (e.g. 25
MB) .doc files stored on an GlusterFS NAS volume using Samba
CIFS. It cannot be opened using MS Office Word 2010 from Windows
2008 R2, or Windows 7.

When it fails, the MS Office Word 2010 app hangs for a long time
at 0% downloaded. When it fails, the error code 0x700468 is
shown. The failure occurs with multiple client machines.

Opening the same .doc file using a Windows 2000 or Windows XP
client and MS Office Word 2003 works fine though.

Opening the same doc file using a Windows 7 client and MS Office
Word 2010 with the file on a Windows 2000 file server works fine.

Opening the same doc file using a Windows 7 client and MS Office
Word 2010 with the Samba bypassing GlusterFS and accessing the
brick storage directly works fine. Only when GlusterFS is in the
loop does the problem occur.

A Gluster user has documented Samba settings that do work for
this case:

[Gluster-users] Using Samba and MSOffice files
http://gluster.org/pipermail/gluster-users/2011-November/009127.html

posix locking = no
directory mask = 2770
force directory mode = 2770
create mask = 2770
force create mode = 2770
force security mode = 2770
force directory security mode = 2770
force group = <your group>

Comment 1 Amar Tumballi 2012-07-11 06:28:48 UTC

Can you please check with 3.3.0 release and see if its fixed?

Comment 2 Jeff Byers 2012-07-11 13:30:55 UTC

Can you point me to the change list, or where in the source the fix was implemented so that, when I have time, I can add logging to make sure that the code path with the fix was executed? Thanks.

Comment 3 krishnan parthasarathi 2012-07-11 15:48:28 UTC

Jeff,

This bug has not been recreated locally and neither do we have the logs when you faced the issue. So we haven't root-caused the issue to any 'bug' yet. We would like to know if you continue to see the problem in 3.3.0 version and if so we could spend efforts on understanding what is causing the issue now (given that a few fixes in locks xlator have made it into the code since 3.2.0 release).

Thanks

Comment 4 Andreas Schmechtig 2012-08-30 15:04:38 UTC

I use RHEL 6.4 x86_64 with

glusterfs-fuse-3.3.0-1.el6.x86_64
glusterfs-server-3.3.0-1.el6.x86_64
glusterfs-3.3.0-1.el6.x86_64
samba-common-3.5.10-125.el6.x86_64
samba-3.5.10-125.el6.x86_64
samba-winbind-clients-3.5.10-125.el6.x86_64

and ext4 FS as bricks in a distributed replicated volume.

There is the same problem accessing Office 2010 Files through Samba on a Gluster Volume. Here are some log extracts.

1st The Samba Log Part

[2012/08/30 16:15:31.969035, 10] locking/locking.c:260(do_lock)
  do_lock: lock flavour WINDOWS_LOCK lock type WRITE start=2147483539 len=1 blocking_lock=true requested for fnum 20081 file lpar_report_20120809.xls
[2012/08/30 16:15:31.969072, 10] lib/dbwrap_tdb.c:100(db_tdb_fetch_locked)
  Locking key 13000000000000000145
[2012/08/30 16:15:31.969117, 10] lib/dbwrap_tdb.c:129(db_tdb_fetch_locked)
  Allocated locked data 0x0x7fe8b46e29a0
[2012/08/30 16:15:31.969133, 10] locking/brlock.c:1896(brl_get_locks_internal)
  brl_get_locks_internal: 1 current locks on file_id 13:8f042d8895574501:0
[2012/08/30 16:15:31.969165, 10] locking/brlock.c:48(print_lock_struct)
  [0]: smbpid = 65279, tid = 1, pid = 2803, start = 2147483539, size = 1, fnum = 20081, PENDING_WRITE WINDOWS_LOCK
[2012/08/30 16:15:31.969197,  5] locking/posix.c:968(set_posix_lock_windows_flavour)
  set_posix_lock_windows_flavour: File lpar_report_20120809.xls, offset = 2147483539, count = 1, type = WRITE
[2012/08/30 16:15:31.969224, 10] locking/posix.c:172(posix_lock_in_range)
  posix_lock_in_range: offset_out = 2147483539, count_out = 1
[2012/08/30 16:15:31.969245, 10] locking/posix.c:739(posix_lock_list)
  posix_lock_list: curr: start=2147483539,size=1
[2012/08/30 16:15:31.969261,  5] locking/posix.c:1046(set_posix_lock_windows_flavour)
  set_posix_lock_windows_flavour: Real lock: Type = WRITE: offset = 2147483539, count = 1
[2012/08/30 16:15:31.969279,  8] locking/posix.c:197(posix_fcntl_lock)
  posix_fcntl_lock 26 6 2147483539 1 1
[2012/08/30 16:15:31.969296,  8] ../lib/util/util.c:217(fcntl_lock)
  fcntl_lock 26 6 2147483539 1 1
[2012/08/30 16:15:31.970195,  3] ../lib/util/util.c:228(fcntl_lock)
  fcntl_lock: fcntl lock gave errno 11 (Resource temporarily unavailable)
[2012/08/30 16:15:31.970235,  3] ../lib/util/util.c:247(fcntl_lock)
  fcntl_lock: lock failed at offset 2147483539 count 1 op 6 type 1 (Resource temporarily unavailable)
[2012/08/30 16:15:31.970259,  8] locking/posix.c:227(posix_fcntl_lock)
  posix_fcntl_lock: Lock call failed
[2012/08/30 16:15:31.970275,  5] locking/posix.c:1051(set_posix_lock_windows_flavour)
  set_posix_lock_windows_flavour: Lock fail !: Type = WRITE: offset = 2147483539, count = 1. Errno = Resource temporarily unavailable
[2012/08/30 16:15:31.970298, 10] locking/locking.c:280(do_lock)
  do_lock: returning status=NT_STATUS_FILE_LOCK_CONFLICT
[2012/08/30 16:15:31.970315, 10] lib/dbwrap_tdb.c:42(db_tdb_record_destr)
  Unlocking key 13000000000000000145
[2012/08/30 16:15:31.970370, 10] smbd/blocking.c:476(process_lockingX)
  process_lockingX: only got 0 locks of 3 needed for file lpar_report_20120809.xls, fnum = 20081. Waiting....
[2012/08/30 16:15:31.970394, 10] smbd/blocking.c:123(recalc_brl_timeout)
[2012/08/30 16:15:31.970404, 10] smbd/blocking.c:129(recalc_brl_timeout)
  Next timeout = 4.999991 seconds from now.
[2012/08/30 16:15:36.970532, 10] lib/events.c:131(run_events)
  Running timed event "brl_timeout_fn" 0x7fe8b46679d0
[2012/08/30 16:15:36.970604,  3] smbd/sec_ctx.c:310(set_sec_ctx)
  setting sec ctx (0, 0) - sec_ctx_stack_ndx = 0
[2012/08/30 16:15:36.970623,  5] auth/token_util.c:525(debug_nt_user_token)
  NT user token: (NULL)
[2012/08/30 16:15:36.970637,  5] auth/token_util.c:551(debug_unix_user_token)
  UNIX token of user 0
  Primary group is 0 and contains 0 supplementary groups
[2012/08/30 16:15:36.970681,  5] smbd/uid.c:369(change_to_root_user)
  change_to_root_user: now uid=(0,0) gid=(0,0)
[2012/08/30 16:15:36.970703, 10] smbd/blocking.c:690(process_blocking_lock_queue)
  Processing BLR = 0x7fe8b46bc190
[2012/08/30 16:15:36.970719, 10] locking/locking.c:260(do_lock)
  do_lock: lock flavour WINDOWS_LOCK lock type WRITE start=2147483539 len=1 blocking_lock=true requested for fnum 20081 file lpar_report_20120809.xls
[2012/08/30 16:15:36.970756, 10] lib/dbwrap_tdb.c:100(db_tdb_fetch_locked)
  Locking key 13000000000000000145
[2012/08/30 16:15:36.970789, 10] lib/dbwrap_tdb.c:129(db_tdb_fetch_locked)
  Allocated locked data 0x0x7fe8b46e29a0

2nd The strace from lock translator of the gluster volume
/var/log/glusterfs/bricks/export1.log


[2012-08-30 16:15:11.926498] I [common.c:323:pl_trace_out] 0-clvolume1-locks: [GRANTED] Locker = {Pid=2803, lk-owner=7c258ce5198bf1ee, Transport=0x1460760, Frame=139} Lockee = {gfid=f093990d-e7c9-4a95-8f04-2d8895574501, fd=0x1444afc, path=/test/lpar_report_20120809.xls} Lock = {lock=FCNTL, cmd=SETLKW, type=READ, start=23596, len=0, pid=2803, lk-owner=7c258ce5198bf1ee}
[2012-08-30 16:15:11.944204] I [common.c:271:pl_trace_in] 0-clvolume1-locks: [REQUEST] Locker = {Pid=2803, lk-owner=9c1b8784017e12dd, Transport=0x1460760, Frame=140} Lockee = {gfid=f093990d-e7c9-4a95-8f04-2d8895574501, fd=0x1444afc, path=/test/lpar_report_20120809.xls} Lock = {lock=FCNTL, cmd=SETLK, type=WRITE, start=2147483539, len=1, pid=2803, lk-owner=9c1b8784017e12dd}
[2012-08-30 16:15:11.944268] I [common.c:323:pl_trace_out] 0-clvolume1-locks: [TRYAGAIN] Locker = {Pid=2803, lk-owner=9c1b8784017e12dd, Transport=0x1460760, Frame=140} Lockee = {gfid=f093990d-e7c9-4a95-8f04-2d8895574501, fd=0x1444afc, path=/test/lpar_report_20120809.xls} Lock = {lock=FCNTL, cmd=SETLK, type=WRITE, start=2147483539, len=1, pid=2803, lk-owner=9c1b8784017e12dd}
[2012-08-30 16:15:11.944268] I [common.c:323:pl_trace_out] 0-clvolume1-locks: [TRYAGAIN] Locker = {Pid=2803, lk-owner=9c1b8784017e12dd, Transport=0x1460760, Frame=140} Lockee = {gfid=f093990d-e7c9-4a95-8f04-2d8895574501, fd=0x1444afc, path=/test/lpar_report_20120809.xls} Lock = {lock=FCNTL, cmd=SETLK, type=WRITE, start=2147483539, len=1, pid=2803, lk-owner=9c1b8784017e12dd}
[2012-08-30 16:15:16.950534] I [common.c:271:pl_trace_in] 0-clvolume1-locks: [REQUEST] Locker = {Pid=2803, lk-owner=9c1b8784017e12dd, Transport=0x1460760, Frame=141} Lockee = {gfid=f093990d-e7c9-4a95-8f04-2d8895574501, fd=0x1444afc, path=/test/lpar_report_20120809.xls} Lock = {lock=FCNTL, cmd=SETLK, type=WRITE, start=2147483539, len=1, pid=2803, lk-owner=9c1b8784017e12dd}
[2012-08-30 16:15:16.950633] I [common.c:323:pl_trace_out] 0-clvolume1-locks: [TRYAGAIN] Locker = {Pid=2803, lk-owner=9c1b8784017e12dd, Transport=0x1460760, Frame=141} Lockee = {gfid=f093990d-e7c9-4a95-8f04-2d8895574501, fd=0x1444afc, path=/test/lpar_report_20120809.xls} Lock = {lock=FCNTL, cmd=SETLK, type=WRITE, start=2147483539, len=1, pid=2803, lk-owner=9c1b8784017e12dd}
[2012-08-30 16:15:16.950633] I [common.c:323:pl_trace_out] 0-clvolume1-locks: [TRYAGAIN] Locker = {Pid=2803, lk-owner=9c1b8784017e12dd, Transport=0x1460760, Frame=141} Lockee = {gfid=f093990d-e7c9-4a95-8f04-2d8895574501, fd=0x1444afc, path=/test/lpar_report_20120809.xls} Lock = {lock=FCNTL, cmd=SETLK, type=WRITE, start=2147483539, len=1, pid=2803, lk-owner=9c1b8784017e12dd}
[2012-08-30 16:15:21.956933] I [common.c:271:pl_trace_in] 0-clvolume1-locks: [REQUEST] Locker = {Pid=2803, lk-owner=9c1b8784017e12dd, Transport=0x1460760, Frame=142} Lockee = {gfid=f093990d-e7c9-4a95-8f04-2d8895574501, fd=0x1444afc, path=/test/lpar_report_20120809.xls} Lock = {lock=FCNTL, cmd=SETLK, type=WRITE, start=2147483539, len=1, pid=2803, lk-owner=9c1b8784017e12dd}
[2012-08-30 16:15:21.957299] I [common.c:323:pl_trace_out] 0-clvolume1-locks: [TRYAGAIN] Locker = {Pid=2803, lk-owner=9c1b8784017e12dd, Transport=0x1460760, Frame=142} Lockee = {gfid=f093990d-e7c9-4a95-8f04-2d8895574501, fd=0x1444afc, path=/test/lpar_report_20120809.xls} Lock = {lock=FCNTL, cmd=SETLK, type=WRITE, start=2147483539, len=1, pid=2803, lk-owner=9c1b8784017e12dd}
[2012-08-30 16:15:21.957299] I [common.c:323:pl_trace_out] 0-clvolume1-locks: [TRYAGAIN] Locker = {Pid=2803, lk-owner=9c1b8784017e12dd, Transport=0x1460760, Frame=142} Lockee = {gfid=f093990d-e7c9-4a95-8f04-2d8895574501, fd=0x1444afc, path=/test/lpar_report_20120809.xls} Lock = {lock=FCNTL, cmd=SETLK, type=WRITE, start=2147483539, len=1, pid=2803, lk-owner=9c1b8784017e12dd}
[2012-08-30 16:15:26.963536] I [common.c:271:pl_trace_in] 0-clvolume1-locks: [REQUEST] Locker = {Pid=2803, lk-owner=9c1b8784017e12dd, Transport=0x1460760, Frame=143} Lockee = {gfid=f093990d-e7c9-4a95-8f04-2d8895574501, fd=0x1444afc, path=/test/lpar_report_20120809.xls} Lock = {lock=FCNTL, cmd=SETLK, type=WRITE, start=2147483539, len=1, pid=2803, lk-owner=9c1b8784017e12dd}
[2012-08-30 16:15:26.963631] I [common.c:323:pl_trace_out] 0-clvolume1-locks: [TRYAGAIN] Locker = {Pid=2803, lk-owner=9c1b8784017e12dd, Transport=0x1460760, Frame=143} Lockee = {gfid=f093990d-e7c9-4a95-8f04-2d8895574501, fd=0x1444afc, path=/test/lpar_report_20120809.xls} Lock = {lock=FCNTL, cmd=SETLK, type=WRITE, start=2147483539, len=1, pid=2803, lk-owner=9c1b8784017e12dd}
[2012-08-30 16:15:31.969699] I [common.c:271:pl_trace_in] 0-clvolume1-locks: [REQUEST] Locker = {Pid=2803, lk-owner=9c1b8784017e12dd, Transport=0x1460760, Frame=144} Lockee = {gfid=f093990d-e7c9-4a95-8f04-2d8895574501, fd=0x1444afc, path=/test/lpar_report_20120809.xls} Lock = {lock=FCNTL, cmd=SETLK, type=WRITE, start=2147483539, len=1, pid=2803, lk-owner=9c1b8784017e12dd}
[2012-08-30 16:15:31.970044] I [common.c:323:pl_trace_out] 0-clvolume1-locks: [TRYAGAIN] Locker = {Pid=2803, lk-owner=9c1b8784017e12dd, Transport=0x1460760, Frame=144} Lockee = {gfid=f093990d-e7c9-4a95-8f04-2d8895574501, fd=0x1444afc, path=/test/lpar_report_20120809.xls} Lock = {lock=FCNTL, cmd=SETLK, type=WRITE, start=2147483539, len=1, pid=2803, lk-owner=9c1b8784017e12dd}
[2012-08-30 16:15:36.971559] I [common.c:271:pl_trace_in] 0-clvolume1-locks: [REQUEST] Locker = {Pid=2803, lk-owner=9c1b8784017e12dd, Transport=0x1460760, Frame=145} Lockee = {gfid=f093990d-e7c9-4a95-8f04-2d8895574501, fd=0x1444afc, path=/test/lpar_report_20120809.xls} Lock = {lock=FCNTL, cmd=SETLK, type=WRITE, start=2147483539, len=1, pid=2803, lk-owner=9c1b8784017e12dd}
[2012-08-30 16:15:36.971689] I [common.c:323:pl_trace_out] 0-clvolume1-locks: [TRYAGAIN] Locker = {Pid=2803, lk-owner=9c1b8784017e12dd, Transport=0x1460760, Frame=145} Lockee = {gfid=f093990d-e7c9-4a95-8f04-2d8895574501, fd=0x1444afc, path=/test/lpar_report_20120809.xls} Lock = {lock=FCNTL, cmd=SETLK, type=WRITE, start=2147483539, len=1, pid=2803, lk-owner=9c1b8784017e12dd}

with kind regards

Andreas Schmechtig

System Developer
Telefon +49 (0) 4106 - 704 - 17 44 
Telefax +49 (0) 4106 - 704 - 11 99 
andreas.schmechtig

Comment 5 Andreas Schmechtig 2012-08-30 15:11:23 UTC

Oh, it is RHEL 6.3 and not 6.4.

If you need futher logs or any test activity do not hesitate to contact me.

This bug forces many of our employees to copy Excel files from the samba share on Gluster to their local Desktop, then editing the file and copy it back to the Samba share.

with kind regards

Andreas Schmechtig

System Developer
Telefon +49 (0) 4106 - 704 - 17 44 
Telefax +49 (0) 4106 - 704 - 11 99 
andreas.schmechtig

Comment 6 Anand Avati 2012-08-30 17:49:03 UTC

Note:

Possible theory:

READ LOCK: pid = 2803, lkowner = 7c258ce5198bf1ee, fd = 0x1444afc
WRITE LOCK: pid = 2803, lkowner = 9c1b8784017e12dd, fd = 0x1444afc

One reason why this (same pid, different lkowners) can happen is because the read lock was acquired by flock() [bsd lock] and write lock is attempted via fcntl() [posix lock].

On Linux the two lock spaces are orthogonal and samba is probably expecting the write lock to get granted in the "posix space" without contending with the "bsd space" read lock.

However I notice that in our fuse-bridge.c we are ignoring ff->flock field in setlk/getlk and merging the lockspaces of both styles of lock into the same lock space (and hence they are ending up contending because of mismatching lkowners).

This needs to be confirmed.

Possible fixes:

1. pass the flock flag through till locks xlator and create a new 'locking domain' for flock.

2. add an extra condition in xlators/features/locks/src/common.c:same_owner() - that the owner is same if either lkowners match or pids match, given a common transport - and only insert but not merge the two locks from the different lkowners. This can let us remain agnostic to bsd/posix locking spaces (until we stumble upon that improbable app which holds two fds - one inherited from parent and one opened by it and expect flocks() on them to contend).

Personally prefer option 1. :-)

Comment 7 Anand Avati 2012-08-30 17:54:52 UTC

Just noticed - the thing which probably refutes the theory in the previous comment is that the read lock has an offset (23596) but flock() is a full-file lock. Given this, I cannot think of a situation how different lkowners can be set for two different offsetted (therefore flock eliminated) lock requests on the same fd from the same PID (therefore fcntl eliminated).

I suggest we start by getting an strace -f of smbd for the session.

Comment 8 Andreas Schmechtig 2012-08-30 20:18:11 UTC

I will try do generate a strace -f of smbd for the session (direct access of an excel file) tomorrow.

Comment 9 Andreas Schmechtig 2012-08-31 15:34:29 UTC

Created attachment 608548 [details]
output of strace smbd

Comment 10 Andreas Schmechtig 2012-08-31 15:35:10 UTC

Created attachment 608549 [details]
Debug Level 10 Samba Log

Comment 11 Andreas Schmechtig 2012-08-31 15:36:08 UTC

Created attachment 608550 [details]
Trace output of the lock translator from the gluster volume clvolume1

Comment 12 Anand Avati 2012-09-01 00:54:19 UTC

The output provided seems to be taken with 'strace -p <pid>' without the -f flag (either that, or the wrong smbd was traced). There are no traces of any of the file access.

Comment 13 Andreas Schmechtig 2012-09-03 07:16:41 UTC

Hello Anand,

excuse me, you are right. I have forgotten to use the -f parameter when tracing the smbd.

I will provide a new set of log files from today (2012/09/03).

with kind regard
Andrea Schmechtig

Comment 14 Andreas Schmechtig 2012-09-03 07:17:26 UTC

Created attachment 609273 [details]
Gluster volume log

Comment 15 Andreas Schmechtig 2012-09-03 07:18:31 UTC

Created attachment 609275 [details]
Samba Log

Comment 16 Andreas Schmechtig 2012-09-03 07:19:11 UTC

Created attachment 609277 [details]
Gluster volume log

Comment 17 Andreas Schmechtig 2012-09-03 07:19:55 UTC

Created attachment 609279 [details]
Samba strace

Comment 20 Hermann Fohrmann 2012-12-04 08:17:10 UTC

Hello Anand,

could you check the logs in the last 3 month.
Please give us an update on this issue, and when it will be solved.
We need a solution by the end of the year.

Regards,

Hermann

Comment 21 Amar Tumballi 2013-01-11 07:38:44 UTC

Hermann,

We suspect this was due to a locking related issue inside glusterfs, should be fixed by http://review.gluster.org/4164

That patch is only in upstream codebase (3.4.0qa6 contains it). If you can confirm if that fixes the issue, we can backport it to 3.3.x branch too.

Comment 24 Richard 2014-02-22 06:32:15 UTC

Hi,

This fix hasn't helped with fcntl locking when running CTDB.

There is a tool with the package called "ping_pong", which when ran on two nodes sharing the same GlusterFS file it is supposed to alternatley lock a shared file:

https://wiki.samba.org/index.php/Ping_pong
http://junkcode.samba.org/ftp/unpacked/junkcode/ping_pong.c

This fcntl locking does not work with GlusterFS a via fuse or nfs mounted volume.

I have tried the latest 3.4.2 version along with the 3.5.0beta2 and it still doesn't work.

Thanks,
Rich

Comment 25 Richard 2014-03-31 14:41:16 UTC

Could someone confirm that this fix is in the 3.5.0 beta releases?
If so, it still doesn't work ;-(

Comment 26 Richard 2014-04-03 14:23:25 UTC

Ah, The problem I have had with the 3.5.0beta releases is that I'm unable to create a distributed setup... adding the 2nd server into the volume doesn't work at the "peer probe" stage.

Comment 27 Niels de Vos 2014-11-27 14:54:46 UTC

The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug.

If there has been no update before 9 December 2014, this bug will get automatocally closed.