Description of problem: When there are multiple instances of gluster-swift instances running, all of them doing I/O on the same volume from different FUSE mounts, some operations like rename() can raise exceptions like EIO/EAGAIN which are not handled by gluster-swift code. This will result in PUT being failed. The fix would be to identify the exceptions that can be raised in multi-node setup and make sure gluster-swift code handles them. Example: In older versions of glusterfs rename() was a blocking operation and in recent versions it isn't i.e it will throw EAGAIN to tell client to try after some time.
RCA: The behaviour of glusterfs for renames has changed in recent versions. rename() used to be a blocking operation and now it isn't. More details can be found here: https://www.spinics.net/lists/gluster-devel/msg18250.html W.r.t rename(), the scope of fixing this BZ is limited to making sure gluster-swift code takes cognizance of this change and handles rename() as non-blocking operation. gluster-swift does not expect rename() to throw a EBUSY/EAGAIN/ESTALE today. We need to handle that. Ideally, we want the old behavior in glusterfs i.e rename() be blocking but fixing that is beyond the scope of this BZ. This BZ also tracks further effort (if any) and work required to make gluster-swift code compatible with glusterfs behaviour when multiple FUSE mounts are used. rename() is one such case that I've seen in one of the customer issues. There could be others.
The blocking rename() issue has been tracked in this upstream glusterfs BZ here: https://bugzilla.redhat.com/show_bug.cgi?id=1304966
The non-blocking rename issue seems to be fixed in GlusterFS. In a multi-node test with haproxy as load balancer and 3 other node with PACO server running, following exceptions were seen in logs which are related to situation when space in volume is full. Apr 14 12:55:03 rhs312el7-2 object-server[18055]: ERROR __call__ error with PUT /test/0/AUTH_test/c1/37/ad/123a178f-2e9b-4759-9ac7-6629c62733a9/26b555302e5fee68c2a9ca1aa11cbb45.dat : #012Traceback (most recent call last):#012 File "/usr/lib/python2.7/site-packages/swift/obj/server.py", line 904, in __call__#012 res = method(req)#012 File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 2648, in wrapped#012 return func(*a, **kw)#012 File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 1205, in _timing_stats#012 resp = func(ctrl, *args, **kwargs)#012 File "/usr/lib/python2.7/site-packages/gluster/swift/obj/server.py", line 83, in PUT#012 return server.ObjectController.PUT(self, request)#012 File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 2648, in wrapped#012 return func(*a, **kw)#012 File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 1205, in _timing_stats#012 resp = func(ctrl, *args, **kwargs)#012 File "/usr/lib/python2.7/site-packages/swift/obj/server.py", line 533, in PUT#012 with disk_file.create(size=fsize) as writer:#012 File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__#012 return self.gen.next()#012 File "/usr/lib/python2.7/site-packages/gluster/swift/obj/diskfile.py", line 896, in create#012 self._create_dir_object(self._obj_path)#012 File "/usr/lib/python2.7/site-packages/gluster/swift/obj/diskfile.py", line 798, in _create_dir_object#012 ret, newmd = make_directory(cur_path, self._uid, self._gid, md)#012 File "/usr/lib/python2.7/site-packages/gluster/swift/obj/diskfile.py", line 156, in make_directory#012 " path %s (%s)" % (full_path, str(err)))#012DiskFileError: make_directory: mkdir failed on path /mnt/gluster-object/test/c1/37/ad/123a178f-2e9b-4759-9ac7-6629c62733a9 ([Errno 28] No space left on device: '/mnt/gluster-object/test/c1/37/ad/123a178f-2e9b-4759-9ac7-6629c62733a9') (txn: tx77d90959c13e45f79dba4-00570f45c9) Apr 14 12:54:59 rhs312el7-2 object-server[18054]: ERROR __call__ error with PUT /test/0/AUTH_test/c1/a7/02/e5f557c5-87aa-450d-b852-2fc2b3866f0e/9ee80cd8afeeabb65fe38eeba3ab40b7.dat : #012Traceback (most recent call last):#012 File "/usr/lib/python2.7/site-packages/swift/obj/server.py", line 904, in __call__#012 res = method(req)#012 File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 2648, in wrapped#012 return func(*a, **kw)#012 File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 1205, in _timing_stats#012 resp = func(ctrl, *args, **kwargs)#012 File "/usr/lib/python2.7/site-packages/gluster/swift/obj/server.py", line 83, in PUT#012 return server.ObjectController.PUT(self, request)#012 File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 2648, in wrapped#012 return func(*a, **kw)#012 File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 1205, in _timing_stats#012 resp = func(ctrl, *args, **kwargs)#012 File "/usr/lib/python2.7/site-packages/swift/obj/server.py", line 533, in PUT#012 with disk_file.create(size=fsize) as writer:#012 File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__#012 return self.gen.next()#012 File "/usr/lib/python2.7/site-packages/gluster/swift/obj/diskfile.py", line 909, in create#012 dw.close()#012AttributeError: 'NoneType' object has no attribute 'close' (txn: txcca05d5ffdf04632a042f-00570f45ca) Apr 14 12:55:04 rhs312el7-2 object-server[18055]: ERROR __call__ error with PUT /test/0/AUTH_test/c1/eb/e8/bf21d1e5-2001-4ea6-a8c9-2926d8cace34/77e14470a18907d356869f96c8925932.dat : #012Traceback (most recent call last):#012 File "/usr/lib/python2.7/site-packages/swift/obj/server.py", line 904, in __call__#012 res = method(req)#012 File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 2648, in wrapped#012 return func(*a, **kw)#012 File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 1205, in _timing_stats#012 resp = func(ctrl, *args, **kwargs)#012 File "/usr/lib/python2.7/site-packages/gluster/swift/obj/server.py", line 83, in PUT#012 return server.ObjectController.PUT(self, request)#012 File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 2648, in wrapped#012 return func(*a, **kw)#012 File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 1205, in _timing_stats#012 resp = func(ctrl, *args, **kwargs)#012 File "/usr/lib/python2.7/site-packages/swift/obj/server.py", line 533, in PUT#012 with disk_file.create(size=fsize) as writer:#012 File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__#012 return self.gen.next()#012 File "/usr/lib/python2.7/site-packages/gluster/swift/obj/diskfile.py", line 841, in create#012 os.O_WRONLY | os.O_CREAT | os.O_EXCL | O_CLOEXEC)#012 File "/usr/lib/python2.7/site-packages/gluster/swift/common/fs_utils.py", line 245, in do_open#012 err.strerror, path, flags, kwargs))#012GlusterFileSystemOSError: [Errno 116] Stale file handle, os.open("/mnt/gluster-object/test/c1/eb/e8/bf21d1e5-2001-4ea6-a8c9-2926d8cace34/.77e14470a18907d356869f96c8925932.dat.17e98fafc90344448ff719a79a9a3ae4", 800c1, {}) (txn: tx2a8c3ac889ce49eca0412-00570f45cd) Apr 14 12:55:00 rhs312el7-4 object-server[18035]: ERROR __call__ error with PUT /test/0/AUTH_test/c1/89/d5/7964ad1a-5ff9-4b8a-bde5-001f11367511/e838c0a8a7ae0077755cea490e2a7eba.dat : #012Traceback (most recent call last):#012 File "/usr/lib/python2.7/site-packages/swift/obj/server.py", line 904, in __call__#012 res = method(req)#012 File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 2648, in wrapped#012 return func(*a, **kw)#012 File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 1205, in _timing_stats#012 resp = func(ctrl, *args, **kwargs)#012 File "/usr/lib/python2.7/site-packages/gluster/swift/obj/server.py", line 83, in PUT#012 return server.ObjectController.PUT(self, request)#012 File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 2648, in wrapped#012 return func(*a, **kw)#012 File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 1205, in _timing_stats#012 resp = func(ctrl, *args, **kwargs)#012 File "/usr/lib/python2.7/site-packages/swift/obj/server.py", line 533, in PUT#012 with disk_file.create(size=fsize) as writer:#012 File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__#012 return self.gen.next()#012 File "/usr/lib/python2.7/site-packages/gluster/swift/obj/diskfile.py", line 841, in create#012 os.O_WRONLY | os.O_CREAT | os.O_EXCL | O_CLOEXEC)#012 File "/usr/lib/python2.7/site-packages/gluster/swift/common/fs_utils.py", line 245, in do_open#012 err.strerror, path, flags, kwargs))#012GlusterFileSystemOSError: [Errno 116] Stale file handle, os.open("/mnt/gluster-object/test/c1/89/d5/7964ad1a-5ff9-4b8a-bde5-001f11367511/.e838c0a8a7ae0077755cea490e2a7eba.dat.10e5a0ff2e344f4598c531bc34e24cbd", 800c1, {}) (txn: txb3d41d5925d54565816eb-00570f45c9) Apr 14 12:55:00 rhs312el7-4 object-server[18035]: ERROR __call__ error with PUT /test/0/AUTH_test/c1/af/e2/786e3c0f-7048-4707-8536-d50a295f1784/51a49aaf9944d941581c2561181f802b.dat : [Errno 28] No space left on device, os.rename("/mnt/gluster-object/test/c1/af/e2/786e3c0f-7048-4707-8536-d50a295f1784/.51a49aaf9944d941581c2561181f802b.dat.612bdc5e472441be9db253a8b4410717", "/mnt/gluster-object/test/c1/af/e2/786e3c0f-7048-4707-8536-d50a295f1784/51a49aaf9944d941581c2561181f802b.dat"), rename('/mnt/gluster-object/test/c1/af/e2/786e3c0f-7048-4707-8536-d50a295f1784/.51a49aaf9944d941581c2561181f802b.dat.612bdc5e472441be9db253a8b4410717', '/mnt/gluster-object/test/c1/af/e2/786e3c0f-7048-4707-8536-d50a295f1784/51a49aaf9944d941581c2561181f802b.dat') (txn: txbd0f65b39dbf4f479b344-00570f45c9)
On an unrelated note, cleaning the volume from GlusterFS FUSE mount fails as follows. [root@rhs312el7-3 ~]# rm -rf /mnt/gluster-object/test/* rm: cannot remove ‘/mnt/gluster-object/test/c1/8a/33/859c81e9-24f3-465a-bea5-c4cdd383c53f’: Directory not empty rm: cannot remove ‘/mnt/gluster-object/test/c1/a7/02/e5f557c5-87aa-450d-b852-2fc2b3866f0e’: Directory not empty rm: cannot remove ‘/mnt/gluster-object/test/c1/6e/bc’: Directory not empty rm: cannot remove ‘/mnt/gluster-object/test/c1/d5/27/f18587f0-79d7-49da-9f08-4be1c1a418d4’: Directory not empty rm: cannot remove ‘/mnt/gluster-object/test/c1/3b/17/da648fb9-b829-45bc-a5ae-d276112884f9’: Directory not empty I'm being told that this is a known issue in DHT and the fix is slated to get into RHGS 3.1.3
swift-object-server seems to be hung to these files: [root@rhs312el7-2 ~]# ls -l /proc/18054/fd total 0 lrwx------. 1 root root 64 Apr 14 15:13 0 -> /dev/null lrwx------. 1 root root 64 Apr 14 15:13 1 -> /dev/null lrwx------. 1 root root 64 Apr 14 15:13 12 -> socket:[273343] lrwx------. 1 root root 64 Apr 14 15:13 13 -> socket:[273344] lrwx------. 1 root root 64 Apr 14 15:13 2 -> /dev/null l-wx------. 1 root root 64 Apr 14 15:13 20 -> /mnt/gluster-object/test/c1/a7/02/e5f557c5-87aa-450d-b852-2fc2b3866f0e/.9ee80cd8afeeabb65fe38eeba3ab40b7.dat.0c316d00e9884d07b75e158daf17f2d0 lrwx------. 1 root root 64 Apr 14 15:13 3 -> socket:[260495] lrwx------. 1 root root 64 Apr 14 15:13 4 -> socket:[260496] lrwx------. 1 root root 64 Apr 14 15:13 7 -> socket:[260505] lr-x------. 1 root root 64 Apr 14 15:13 8 -> /dev/urandom lr-x------. 1 root root 64 Apr 14 15:13 9 -> /dev/urandom The above fd leak arises from error in following exception: Apr 14 12:54:59 rhs312el7-2 object-server[18054]: ERROR __call__ error with PUT /test/0/AUTH_test/c1/a7/02/e5f557c5-87aa-450d-b852-2fc2b3866f0e/9ee80cd8afeeabb65fe38eeba3ab40b7.dat : #012Traceback (most recent call last):#012 File "/usr/lib/python2.7/site-packages/swift/obj/server.py", line 904, in __call__#012 res = method(req)#012 File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 2648, in wrapped#012 return func(*a, **kw)#012 File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 1205, in _timing_stats#012 resp = func(ctrl, *args, **kwargs)#012 File "/usr/lib/python2.7/site-packages/gluster/swift/obj/server.py", line 83, in PUT#012 return server.ObjectController.PUT(self, request)#012 File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 2648, in wrapped#012 return func(*a, **kw)#012 File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 1205, in _timing_stats#012 resp = func(ctrl, *args, **kwargs)#012 File "/usr/lib/python2.7/site-packages/swift/obj/server.py", line 533, in PUT#012 with disk_file.create(size=fsize) as writer:#012 File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__#012 return self.gen.next()#012 File "/usr/lib/python2.7/site-packages/gluster/swift/obj/diskfile.py", line 909, in create#012 dw.close()#012AttributeError: 'NoneType' object has no attribute 'close' (txn: txcca05d5ffdf04632a042f-00570f45ca)
Upstream change: http://review.gluster.org/#/c/14118
Changes reviewed and merged. Upstream: http://review.gluster.org/#/c/14118 Downstream: https://code.engineering.redhat.com/gerrit/#/c/73583/
Verified on latest build as follows: 1.setup 4 node swift setup. Start swift services on all nodes. 2. Configure haproxy node 3. start large number(3000 from node1)(3000 from node 2)(2000 from node 3) of PUT from multiple nodes and verify from ha proxy log that the requests have been forwarded to all the nodes which have swift services running and are configured in ha proxy .cfg. Also verified that there are no exceptions raised. Marking the BZ verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2016:1289