+++ This bug was initially created as a clone of Bug #1144428 +++ Description of problem: The session is going into faulty with OSError: [Errno 12] Cannot allocate memory backtrace in the logs. The operation I performed was sync existing data -> pause session -> rename all the files -> resume the session Version-Release number of selected component (if applicable): mainline How reproducible: Hit only once. Not sure I will be able to reproduce again. Steps to Reproduce: 1. Create and start a geo-rep session between 2*2 dist-rep master and 2*2 dist-rep slave volume. 2. Create and sync some 5k files in some directory structure. 3. Now pause the session. 5. rename all the files. 6. resume the session. Actual results: The session went to faulty MASTER NODE MASTER VOL MASTER BRICK SLAVE STATUS CHECKPOINT STATUS CRAWL STATUS ----------------------------------------------------------------------------------------------------------------------------- ccr.blr.redhat.com master /bricks/brick0 nirvana::slave faulty N/A N/A metallica.blr.redhat.com master /bricks/brick1 acdc::slave Passive N/A N/A beatles.blr.redhat.com master /bricks/brick3 rammstein::slave Passive N/A N/A pinkfloyd.blr.redhat.com master /bricks/brick2 led::slave faulty N/A N/A The backtrace in the master logs. [2014-09-19 16:19:53.933645] I [master(/bricks/brick2):1225:crawl] _GMaster: slave's time: (1411061833, 0) [2014-09-19 16:20:33.653033] E [repce(/bricks/brick2):207:__call__] RepceClient: call 18787:139727562630912:1411123833.64 (entry_ops) failed on peer with OSError [2014-09-19 16:20:33.653924] E [syncdutils(/bricks/brick2):270:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 164, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 643, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1324, in service_loop g3.crawlwrap(oneshot=True) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 524, in crawlwrap self.crawl(no_stime_update=no_stime_update) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1236, in crawl self.process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 927, in process self.process_change(change, done, retry) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 891, in process_change self.slave.server.entry_ops(entries) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__ return self.ins(self.meth, *a) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__ raise res OSError: [Errno 12] Cannot allocate memory [2014-09-19 16:20:33.657620] I [syncdutils(/bricks/brick2):214:finalize] <top>: exiting. [2014-09-19 16:20:33.663028] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2014-09-19 16:20:33.663907] I [syncdutils(agent):214:finalize] <top>: exiting. [2014-09-19 16:20:33.795839] I [monitor(monitor):222:monitor] Monitor: worker(/bricks/brick2) died in startup phase This is a remote backtrace propagated to master via RPC. The actual backtrace in slave logs are [2014-09-19 16:27:45.780600] E [repce(slave):117:worker] <top>: call failed: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker res = getattr(self.obj, rmeth)(*in_data[2:]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 662, in entry_ops [ENOENT, ESTALE, EINVAL]) File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 470, in errno_wrap return call(*arg) File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 78, in lsetxattr cls.raise_oserr() File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in raise_oserr raise OSError(errn, os.strerror(errn)) OSError: [Errno 12] Cannot allocate memory [2014-09-19 16:27:45.794786] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF. Expected results: There should be no backtraces and no faulty sessions. Additional info: The slave volume had Cluster.hash-range-gfid on
REVIEW: http://review.gluster.org/8865 (geo-rep: Fix rename of directory syncing.) posted (#1) for review on master by Kotresh HR (khiremat)
COMMIT: http://review.gluster.org/8865 committed in master by Venky Shankar (vshankar) ------ commit 7113d873af1f129effd8c6da21b49e797de8eab0 Author: Kotresh HR <khiremat> Date: Thu Sep 25 17:34:43 2014 +0530 geo-rep: Fix rename of directory syncing. The rename of directories are captured in all distributed brick changelogs. gsyncd processess these changelogs on each brick parallellaly. The first changelog to get processed will be successful. All subsequent ones will stat the 'src' and if not present, tries to create freshly on slave. It should be done only for files and not for directories. Hence when this code path was hit, regular file's blob is sent as directory's blob and gfid-access translator was erroring out as 'Invalid blob length' with errno as 'ENOMEM' Change-Id: I50545b02b98846464876795159d2446340155c82 BUG: 1146823 Signed-off-by: Kotresh HR <khiremat> Reviewed-on: http://review.gluster.org/8865 Reviewed-by: Aravinda VK <avishwan> Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Venky Shankar <vshankar> Tested-by: Venky Shankar <vshankar>
REVIEW: http://review.gluster.org/8880 (geo-rep: Fix rename of directory syncing.) posted (#1) for review on release-3.6 by Aravinda VK (avishwan)
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report. glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939 [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user