Bug 1312833

Summary: [RHEL-7] gluster-swift: Fix exceptions raised when there are mutliple instances
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Prashanth Pai <ppai>
Component: gluster-swiftAssignee: Prashanth Pai <ppai>
Status: CLOSED ERRATA QA Contact: surabhi <sbhaloth>
Severity: medium Docs Contact:
Priority: medium    
Version: rhgs-3.1CC: nlevinki, rcyriac, rhinduja, rhs-bugs, thiago
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.1.3   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: swiftonfile-2.3.0-3 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1328445 (view as bug list) Environment:
Last Closed: 2016-06-23 05:31:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1299184, 1328445    

Description Prashanth Pai 2016-02-29 10:50:38 UTC
Description of problem:
When there are multiple instances of gluster-swift instances running, all of them doing I/O on the same volume from different FUSE mounts, some operations like rename() can raise exceptions like EIO/EAGAIN which are not handled by gluster-swift code. This will result in PUT being failed.

The fix would be to identify the exceptions that can be raised in multi-node setup and make sure gluster-swift code handles them.

Example: In older versions of glusterfs rename() was a blocking operation and in recent versions it isn't i.e it will throw EAGAIN to tell client to try after some time.

Comment 2 Prashanth Pai 2016-03-10 09:00:24 UTC
RCA: 

The behaviour of glusterfs for renames has changed in recent versions. rename() used to be a blocking operation and now it isn't. More details can be found here: https://www.spinics.net/lists/gluster-devel/msg18250.html

W.r.t rename(), the scope of fixing this BZ is limited to making sure gluster-swift code takes cognizance of this change and handles rename() as non-blocking operation. gluster-swift does not expect rename() to throw a EBUSY/EAGAIN/ESTALE today. We need to handle that.

Ideally, we want the old behavior in glusterfs i.e rename() be blocking but fixing that is beyond the scope of this BZ.

This BZ also tracks further effort (if any) and work required to make gluster-swift code compatible with glusterfs behaviour when multiple FUSE mounts are used. rename() is one such case that I've seen in one of the customer issues. There could be others.

Comment 3 Prashanth Pai 2016-03-10 09:32:40 UTC
The blocking rename() issue has been tracked in this upstream glusterfs BZ here: https://bugzilla.redhat.com/show_bug.cgi?id=1304966

Comment 4 Prashanth Pai 2016-04-14 09:39:44 UTC
The non-blocking rename issue seems to be fixed in GlusterFS. In a multi-node test with haproxy as load balancer and 3 other node with PACO server running, following exceptions were seen in logs which are related to situation when space in volume is full.

Apr 14 12:55:03 rhs312el7-2 object-server[18055]: ERROR __call__ error with PUT /test/0/AUTH_test/c1/37/ad/123a178f-2e9b-4759-9ac7-6629c62733a9/26b555302e5fee68c2a9ca1aa11cbb45.dat : #012Traceback (most recent call last):#012  File "/usr/lib/python2.7/site-packages/swift/obj/server.py", line 904, in __call__#012    res = method(req)#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 2648, in wrapped#012    return func(*a, **kw)#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 1205, in _timing_stats#012    resp = func(ctrl, *args, **kwargs)#012  File "/usr/lib/python2.7/site-packages/gluster/swift/obj/server.py", line 83, in PUT#012    return server.ObjectController.PUT(self, request)#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 2648, in wrapped#012    return func(*a, **kw)#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 1205, in _timing_stats#012    resp = func(ctrl, *args, **kwargs)#012  File "/usr/lib/python2.7/site-packages/swift/obj/server.py", line 533, in PUT#012    with disk_file.create(size=fsize) as writer:#012  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__#012    return self.gen.next()#012  File "/usr/lib/python2.7/site-packages/gluster/swift/obj/diskfile.py", line 896, in create#012    self._create_dir_object(self._obj_path)#012  File "/usr/lib/python2.7/site-packages/gluster/swift/obj/diskfile.py", line 798, in _create_dir_object#012    ret, newmd = make_directory(cur_path, self._uid, self._gid, md)#012  File "/usr/lib/python2.7/site-packages/gluster/swift/obj/diskfile.py", line 156, in make_directory#012    " path %s (%s)" % (full_path, str(err)))#012DiskFileError: make_directory: mkdir failed on path /mnt/gluster-object/test/c1/37/ad/123a178f-2e9b-4759-9ac7-6629c62733a9 ([Errno 28] No space left on device: '/mnt/gluster-object/test/c1/37/ad/123a178f-2e9b-4759-9ac7-6629c62733a9') (txn: tx77d90959c13e45f79dba4-00570f45c9)

Apr 14 12:54:59 rhs312el7-2 object-server[18054]: ERROR __call__ error with PUT /test/0/AUTH_test/c1/a7/02/e5f557c5-87aa-450d-b852-2fc2b3866f0e/9ee80cd8afeeabb65fe38eeba3ab40b7.dat : #012Traceback (most recent call last):#012  File "/usr/lib/python2.7/site-packages/swift/obj/server.py", line 904, in __call__#012    res = method(req)#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 2648, in wrapped#012    return func(*a, **kw)#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 1205, in _timing_stats#012    resp = func(ctrl, *args, **kwargs)#012  File "/usr/lib/python2.7/site-packages/gluster/swift/obj/server.py", line 83, in PUT#012    return server.ObjectController.PUT(self, request)#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 2648, in wrapped#012    return func(*a, **kw)#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 1205, in _timing_stats#012    resp = func(ctrl, *args, **kwargs)#012  File "/usr/lib/python2.7/site-packages/swift/obj/server.py", line 533, in PUT#012    with disk_file.create(size=fsize) as writer:#012  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__#012    return self.gen.next()#012  File "/usr/lib/python2.7/site-packages/gluster/swift/obj/diskfile.py", line 909, in create#012    dw.close()#012AttributeError: 'NoneType' object has no attribute 'close' (txn: txcca05d5ffdf04632a042f-00570f45ca)

Apr 14 12:55:04 rhs312el7-2 object-server[18055]: ERROR __call__ error with PUT /test/0/AUTH_test/c1/eb/e8/bf21d1e5-2001-4ea6-a8c9-2926d8cace34/77e14470a18907d356869f96c8925932.dat : #012Traceback (most recent call last):#012  File "/usr/lib/python2.7/site-packages/swift/obj/server.py", line 904, in __call__#012    res = method(req)#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 2648, in wrapped#012    return func(*a, **kw)#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 1205, in _timing_stats#012    resp = func(ctrl, *args, **kwargs)#012  File "/usr/lib/python2.7/site-packages/gluster/swift/obj/server.py", line 83, in PUT#012    return server.ObjectController.PUT(self, request)#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 2648, in wrapped#012    return func(*a, **kw)#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 1205, in _timing_stats#012    resp = func(ctrl, *args, **kwargs)#012  File "/usr/lib/python2.7/site-packages/swift/obj/server.py", line 533, in PUT#012    with disk_file.create(size=fsize) as writer:#012  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__#012    return self.gen.next()#012  File "/usr/lib/python2.7/site-packages/gluster/swift/obj/diskfile.py", line 841, in create#012    os.O_WRONLY | os.O_CREAT | os.O_EXCL | O_CLOEXEC)#012  File "/usr/lib/python2.7/site-packages/gluster/swift/common/fs_utils.py", line 245, in do_open#012    err.strerror, path, flags, kwargs))#012GlusterFileSystemOSError: [Errno 116] Stale file handle, os.open("/mnt/gluster-object/test/c1/eb/e8/bf21d1e5-2001-4ea6-a8c9-2926d8cace34/.77e14470a18907d356869f96c8925932.dat.17e98fafc90344448ff719a79a9a3ae4", 800c1, {}) (txn: tx2a8c3ac889ce49eca0412-00570f45cd)

Apr 14 12:55:00 rhs312el7-4 object-server[18035]: ERROR __call__ error with PUT /test/0/AUTH_test/c1/89/d5/7964ad1a-5ff9-4b8a-bde5-001f11367511/e838c0a8a7ae0077755cea490e2a7eba.dat : #012Traceback (most recent call last):#012  File "/usr/lib/python2.7/site-packages/swift/obj/server.py", line 904, in __call__#012    res = method(req)#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 2648, in wrapped#012    return func(*a, **kw)#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 1205, in _timing_stats#012    resp = func(ctrl, *args, **kwargs)#012  File "/usr/lib/python2.7/site-packages/gluster/swift/obj/server.py", line 83, in PUT#012    return server.ObjectController.PUT(self, request)#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 2648, in wrapped#012    return func(*a, **kw)#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 1205, in _timing_stats#012    resp = func(ctrl, *args, **kwargs)#012  File "/usr/lib/python2.7/site-packages/swift/obj/server.py", line 533, in PUT#012    with disk_file.create(size=fsize) as writer:#012  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__#012    return self.gen.next()#012  File "/usr/lib/python2.7/site-packages/gluster/swift/obj/diskfile.py", line 841, in create#012    os.O_WRONLY | os.O_CREAT | os.O_EXCL | O_CLOEXEC)#012  File "/usr/lib/python2.7/site-packages/gluster/swift/common/fs_utils.py", line 245, in do_open#012    err.strerror, path, flags, kwargs))#012GlusterFileSystemOSError: [Errno 116] Stale file handle, os.open("/mnt/gluster-object/test/c1/89/d5/7964ad1a-5ff9-4b8a-bde5-001f11367511/.e838c0a8a7ae0077755cea490e2a7eba.dat.10e5a0ff2e344f4598c531bc34e24cbd", 800c1, {}) (txn: txb3d41d5925d54565816eb-00570f45c9)

Apr 14 12:55:00 rhs312el7-4 object-server[18035]: ERROR __call__ error with PUT /test/0/AUTH_test/c1/af/e2/786e3c0f-7048-4707-8536-d50a295f1784/51a49aaf9944d941581c2561181f802b.dat : [Errno 28] No space left on device, os.rename("/mnt/gluster-object/test/c1/af/e2/786e3c0f-7048-4707-8536-d50a295f1784/.51a49aaf9944d941581c2561181f802b.dat.612bdc5e472441be9db253a8b4410717", "/mnt/gluster-object/test/c1/af/e2/786e3c0f-7048-4707-8536-d50a295f1784/51a49aaf9944d941581c2561181f802b.dat"), rename('/mnt/gluster-object/test/c1/af/e2/786e3c0f-7048-4707-8536-d50a295f1784/.51a49aaf9944d941581c2561181f802b.dat.612bdc5e472441be9db253a8b4410717', '/mnt/gluster-object/test/c1/af/e2/786e3c0f-7048-4707-8536-d50a295f1784/51a49aaf9944d941581c2561181f802b.dat') (txn: txbd0f65b39dbf4f479b344-00570f45c9)

Comment 5 Prashanth Pai 2016-04-14 09:45:06 UTC
On an unrelated note, cleaning the volume from GlusterFS FUSE mount fails as follows.

[root@rhs312el7-3 ~]# rm -rf /mnt/gluster-object/test/*
rm: cannot remove ‘/mnt/gluster-object/test/c1/8a/33/859c81e9-24f3-465a-bea5-c4cdd383c53f’: Directory not empty
rm: cannot remove ‘/mnt/gluster-object/test/c1/a7/02/e5f557c5-87aa-450d-b852-2fc2b3866f0e’: Directory not empty
rm: cannot remove ‘/mnt/gluster-object/test/c1/6e/bc’: Directory not empty
rm: cannot remove ‘/mnt/gluster-object/test/c1/d5/27/f18587f0-79d7-49da-9f08-4be1c1a418d4’: Directory not empty
rm: cannot remove ‘/mnt/gluster-object/test/c1/3b/17/da648fb9-b829-45bc-a5ae-d276112884f9’: Directory not empty


I'm being told that this is a known issue in DHT and the fix is slated to get into RHGS 3.1.3

Comment 6 Prashanth Pai 2016-04-14 09:53:53 UTC
swift-object-server seems to be hung to these files:


[root@rhs312el7-2 ~]# ls -l /proc/18054/fd
total 0
lrwx------. 1 root root 64 Apr 14 15:13 0 -> /dev/null
lrwx------. 1 root root 64 Apr 14 15:13 1 -> /dev/null
lrwx------. 1 root root 64 Apr 14 15:13 12 -> socket:[273343]
lrwx------. 1 root root 64 Apr 14 15:13 13 -> socket:[273344]
lrwx------. 1 root root 64 Apr 14 15:13 2 -> /dev/null
l-wx------. 1 root root 64 Apr 14 15:13 20 -> /mnt/gluster-object/test/c1/a7/02/e5f557c5-87aa-450d-b852-2fc2b3866f0e/.9ee80cd8afeeabb65fe38eeba3ab40b7.dat.0c316d00e9884d07b75e158daf17f2d0
lrwx------. 1 root root 64 Apr 14 15:13 3 -> socket:[260495]
lrwx------. 1 root root 64 Apr 14 15:13 4 -> socket:[260496]
lrwx------. 1 root root 64 Apr 14 15:13 7 -> socket:[260505]
lr-x------. 1 root root 64 Apr 14 15:13 8 -> /dev/urandom
lr-x------. 1 root root 64 Apr 14 15:13 9 -> /dev/urandom


The above fd leak arises from error in following exception:
Apr 14 12:54:59 rhs312el7-2 object-server[18054]: ERROR __call__ error with PUT /test/0/AUTH_test/c1/a7/02/e5f557c5-87aa-450d-b852-2fc2b3866f0e/9ee80cd8afeeabb65fe38eeba3ab40b7.dat : #012Traceback (most recent call last):#012  File "/usr/lib/python2.7/site-packages/swift/obj/server.py", line 904, in __call__#012    res = method(req)#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 2648, in wrapped#012    return func(*a, **kw)#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 1205, in _timing_stats#012    resp = func(ctrl, *args, **kwargs)#012  File "/usr/lib/python2.7/site-packages/gluster/swift/obj/server.py", line 83, in PUT#012    return server.ObjectController.PUT(self, request)#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 2648, in wrapped#012    return func(*a, **kw)#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 1205, in _timing_stats#012    resp = func(ctrl, *args, **kwargs)#012  File "/usr/lib/python2.7/site-packages/swift/obj/server.py", line 533, in PUT#012    with disk_file.create(size=fsize) as writer:#012  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__#012    return self.gen.next()#012  File "/usr/lib/python2.7/site-packages/gluster/swift/obj/diskfile.py", line 909, in create#012    dw.close()#012AttributeError: 'NoneType' object has no attribute 'close' (txn: txcca05d5ffdf04632a042f-00570f45ca)

Comment 7 Prashanth Pai 2016-04-29 14:48:59 UTC
Upstream change: http://review.gluster.org/#/c/14118

Comment 8 Prashanth Pai 2016-05-04 07:14:04 UTC
Changes reviewed and merged.

Upstream: http://review.gluster.org/#/c/14118
Downstream: https://code.engineering.redhat.com/gerrit/#/c/73583/

Comment 9 surabhi 2016-06-03 07:03:12 UTC
Verified on latest build as follows:

1.setup 4 node swift setup. Start swift services on all nodes.
2. Configure haproxy node
3. start large number(3000 from node1)(3000 from node 2)(2000 from node 3) of PUT from multiple nodes and verify from ha proxy log that the requests have been forwarded to all the nodes which have swift services running and are configured in ha proxy .cfg.

Also verified that there are no exceptions raised.
Marking the BZ verified.

Comment 11 errata-xmlrpc 2016-06-23 05:31:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2016:1289