Description of problem: The session is going into faulty with OSError: [Errno 12] Cannot allocate memory backtrace in the logs. The operation I performed was sync existing data -> pause session -> rename all the files -> resume the session Version-Release number of selected component (if applicable): glusterfs-3.6.0.28-1.el6rhs.x86_64 How reproducible: Hit only once. Not sure I will be able to reproduce again. Steps to Reproduce: 1. Create and start a geo-rep session between 2*2 dist-rep master and 2*2 dist-rep slave volume. 2. Create and sync some 5k files in some directory structure. 3. Now pause the session. 5. rename all the files. 6. resume the session. Actual results: The session went to faulty MASTER NODE MASTER VOL MASTER BRICK SLAVE STATUS CHECKPOINT STATUS CRAWL STATUS ----------------------------------------------------------------------------------------------------------------------------- ccr.blr.redhat.com master /bricks/brick0 nirvana::slave faulty N/A N/A metallica.blr.redhat.com master /bricks/brick1 acdc::slave Passive N/A N/A beatles.blr.redhat.com master /bricks/brick3 rammstein::slave Passive N/A N/A pinkfloyd.blr.redhat.com master /bricks/brick2 led::slave faulty N/A N/A The backtrace in the master logs. [2014-09-19 16:19:53.933645] I [master(/bricks/brick2):1225:crawl] _GMaster: slave's time: (1411061833, 0) [2014-09-19 16:20:33.653033] E [repce(/bricks/brick2):207:__call__] RepceClient: call 18787:139727562630912:1411123833.64 (entry_ops) failed on peer with OSError [2014-09-19 16:20:33.653924] E [syncdutils(/bricks/brick2):270:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 164, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 643, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1324, in service_loop g3.crawlwrap(oneshot=True) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 524, in crawlwrap self.crawl(no_stime_update=no_stime_update) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1236, in crawl self.process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 927, in process self.process_change(change, done, retry) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 891, in process_change self.slave.server.entry_ops(entries) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__ return self.ins(self.meth, *a) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__ raise res OSError: [Errno 12] Cannot allocate memory [2014-09-19 16:20:33.657620] I [syncdutils(/bricks/brick2):214:finalize] <top>: exiting. [2014-09-19 16:20:33.663028] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2014-09-19 16:20:33.663907] I [syncdutils(agent):214:finalize] <top>: exiting. [2014-09-19 16:20:33.795839] I [monitor(monitor):222:monitor] Monitor: worker(/bricks/brick2) died in startup phase This is a remote backtrace propagated to master via RPC. The actual backtrace in slave logs are [2014-09-19 16:27:45.780600] E [repce(slave):117:worker] <top>: call failed: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker res = getattr(self.obj, rmeth)(*in_data[2:]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 662, in entry_ops [ENOENT, ESTALE, EINVAL]) File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 470, in errno_wrap return call(*arg) File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 78, in lsetxattr cls.raise_oserr() File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in raise_oserr raise OSError(errn, os.strerror(errn)) OSError: [Errno 12] Cannot allocate memory [2014-09-19 16:27:45.794786] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF. Expected results: There should be no backtraces and no faulty sessions. Additional info: The slave volume had Cluster.hash-range-gfid on I will create a sosreport for further analysis.
Not for 3.0,Dev needs to do RCA and provide inputs.
This is happening with pause, rename and resume and just renames while session is in progress as well. One thing to be noted is the renames even include directory renames.
Reproduced the issue. Renamed a directory while the geo-rep session was active. The session went into faulty state. But the slave bricks were updated with the renamed file anyway. And further addition of new files were still syncing though the geo-rep session showed faulty. And when these new files are renamed, the old name of the directory remains on the slave bricks. [root@lightning mastervol]# mkdir test [root@lightning mastervol]# mkdir test21t1 [root@lightning mastervol]# ls test test21t1 [root@lightning mastervol]# mv test notest [root@Tim mastervol]# gluster v geo mastervol 10.70.42.197::slavevol status detail MASTER NODE MASTER VOL MASTER BRICK SLAVE STATUS CHECKPOINT STATUS CRAWL STATUS FILES SYNCD FILES PENDING BYTES PENDING DELETES PENDING FILES SKIPPED ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Tim.blr.redhat.com mastervol /bricks/brick1/mastervol_b1 10.70.42.254::slavevol faulty N/A N/A 9272 1 0 0 0 Tim.blr.redhat.com mastervol /bricks/brick2/mastervol_b5 10.70.42.254::slavevol faulty N/A N/A 12763 1 0 0 0 Tim.blr.redhat.com mastervol /bricks/brick3/mastervol_b9 10.70.42.254::slavevol faulty N/A N/A 12709 1 0 0 0 Tim.blr.redhat.com mastervol /bricks/brick3/mastervol_b11 10.70.42.197::slavevol faulty N/A N/A 7696 1 0 0 0 purple mastervol /bricks/brick1/mastervol_b3 10.70.42.197::slavevol faulty N/A N/A 8 0 0 0 0 purple mastervol /bricks/brick2/mastervol_b7 10.70.42.197::slavevol faulty N/A N/A 2 1 0 0 0 green mastervol /bricks/brick1/mastervol_b4 10.70.42.151::slavevol faulty N/A N/A 0 0 0 0 0 green mastervol /bricks/brick2/mastervol_b8 10.70.42.151::slavevol faulty N/A N/A 0 0 0 0 0 Javier.blr.redhat.com mastervol /bricks/brick1/mastervol_b2 10.70.42.97::slavevol Passive N/A N/A 0 0 0 0 0 Javier.blr.redhat.com mastervol /bricks/brick2/mastervol_b6 10.70.42.97::slavevol Passive N/A N/A 0 0 0 0 0 Javier.blr.redhat.com mastervol /bricks/brick3/mastervol_b10 10.70.42.97::slavevol Passive N/A N/A 0 0 0 0 0 Javier.blr.redhat.com mastervol /bricks/brick3/mastervol_b12 10.70.42.151::slavevol Passive N/A N/A [root@lightning mastervol]# mv test21t1 test21t2 [root@lightning mastervol]# ls -l total 1048576 -rw-r--r-- 1 root root 1073741824 Sep 25 00:15 BB drwxr-xr-x 2 root root 36 Sep 25 01:03 notest drwxr-xr-x 2 root root 36 Sep 25 01:03 test21t2 The rename was successfully synced. However the old name was still retained in the slave bricks. Slave brick: /bricks/brick3/slavevol_b10: total 0 drwxr-xr-x 2 root root 6 Sep 25 01:03 notest drwxr-xr-x 2 root root 6 Sep 25 01:03 test21t1 drwxr-xr-x 2 root root 6 Sep 25 17:49 test21t2 Master brick: /bricks/brick2/mastervol_b8: total 0 drwxr-xr-x 2 root root 6 Sep 25 01:03 notest drwxr-xr-x 2 root root 6 Sep 25 01:03 test21t2 Logs from slavevol.log: [2014-09-25 12:00:33.436995] E [repce(/bricks/brick2/mastervol_b5):207:__call__] RepceClient: call 32568:140035883935488:1411626633.43 (entry_ops) failed on peer with OSError [2014-09-25 12:00:33.437428] E [syncdutils(/bricks/brick2/mastervol_b5):270:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 164, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 643, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1343, in service_loop g2.crawlwrap() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 524, in crawlwrap self.crawl(no_stime_update=no_stime_update) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1174, in crawl self.process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 927, in process self.process_change(change, done, retry) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 891, in process_change self.slave.server.entry_ops(entries) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__ return self.ins(self.meth, *a) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__ raise res OSError: [Errno 12] Cannot allocate memory
Root Cause and Fix: The rename of directories are captured in all distributed brick changelogs. gsyncd processess these changelogs on each brick parallellaly. The first changelog to get processed will be successful. All subsequent ones will stat the 'src' and if not present, tries to create freshly on slave. It should be done only for files and not for directories. Hence when this code path was hit, regular file's blob is sent as directory's blob and gfid-access translator was erroring out as 'Invalid blob length' with errno as 'ENOMEM'. Fix is not to create entry freshly if it is a directory. The fix will be sent upstream once http://review.gluster.org/#/c/8761/ is merged as it's dependant on it.
Patch sent Upstream: http://review.gluster.org/#/c/8865/
Upstream Patch: http://review.gluster.org/#/c/8865/ Downstream Patch: https://code.engineering.redhat.com/gerrit/#/c/34634/
Verified on 3.0.3 glusterfs-3.6.0.33-1.el6rhs.x86_64 The file renames successfully synced after pause/resume and no errors were found. Geo-rep status looked good. Slave files before and after pause/resume # ls /mnt/slave/ file file18 file27 file36 file45 file9 file1 file19 file28 file37 file46 glusterfs-3.5.2.tar.gz file10 file2 file29 file38 file47 linux-3.12.tar.bz2 file11 file20 file3 file39 file48 test.txt file12 file21 file30 file4 file49 text.txt file13 file22 file31 file40 file5 file14 file23 file32 file41 file50 file15 file24 file33 file42 file6 file16 file25 file34 file43 file7 file17 file26 file35 file44 file8 # ls /mnt/slave/ file file-18 file-27 file-36 file-45 file-9 file-1 file-19 file-28 file-37 file-46 glusterfs-3.5.2.tar.gz file-10 file-2 file-29 file-38 file-47 linux-3.12.tar.bz2 file-11 file-20 file-3 file-39 file-48 test.txt file-12 file-21 file-30 file-4 file-49 text.txt file-13 file-22 file-31 file-40 file-5 file-14 file-23 file-32 file-41 file-50 file-15 file-24 file-33 file-42 file-6 file-16 file-25 file-34 file-43 file-7 file-17 file-26 file-35 file-44 file-8
(In reply to shilpa from comment #11) > Verified on 3.0.3 glusterfs-3.6.0.33-1.el6rhs.x86_64 > The file renames successfully synced after pause/resume and no errors were > found. Geo-rep status looked good. > This bug fixes issue with directory renames, only file renames will not cause this bug. Is directory Renames tested?
In response to c#12. Tested both directory and file renames. Both work. Did not find any issues.
Please review and sign-off edited doc text.
Doc text looks fine to me.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0038.html