1147422 – dist-geo-rep: Session going into faulty with "Can no allocate memory" backtrace when pause, rename and resume is performed

Bug 1147422 - dist-geo-rep: Session going into faulty with "Can no allocate memory" backtrace when pause, rename and resume is performed

Summary: dist-geo-rep: Session going into faulty with "Can no allocate memory" backtra...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	geo-replication
Sub Component:
Version:	3.6.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kotresh HR
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1159190 (view as bug list)
Depends On:	1144428 1146823 1159190
Blocks:
TreeView+	depends on / blocked

Reported:	2014-09-29 08:48 UTC by Aravinda VK
Modified:	2015-01-15 10:00 UTC (History)
CC List:	12 users (show)
Fixed In Version:	3.6 beta3
Clone Of:	1146823
Environment:
Last Closed:	2014-11-11 08:40:12 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Aravinda VK 2014-09-29 08:48:58 UTC

+++ This bug was initially created as a clone of Bug #1146823 +++

+++ This bug was initially created as a clone of Bug #1144428 +++

Description of problem:
The session is going into faulty with OSError: [Errno 12] Cannot allocate memory backtrace in the logs. The operation I performed was sync existing data -> pause session -> rename all the files -> resume the session

Version-Release number of selected component (if applicable):
mainline

How reproducible:
Hit only once. Not sure I will be able to reproduce again.

Steps to Reproduce:
1. Create and start a geo-rep session between 2*2 dist-rep master and 2*2 dist-rep slave volume.
2. Create and sync some 5k files in some directory structure.
3. Now pause the session.
5. rename all the files.
6. resume the session.

Actual results:
The session went to faulty

MASTER NODE                 MASTER VOL    MASTER BRICK      SLAVE               STATUS     CHECKPOINT STATUS    CRAWL STATUS        
-----------------------------------------------------------------------------------------------------------------------------
ccr.blr.redhat.com          master        /bricks/brick0    nirvana::slave      faulty     N/A                  N/A                 
metallica.blr.redhat.com    master        /bricks/brick1    acdc::slave         Passive    N/A                  N/A                 
beatles.blr.redhat.com      master        /bricks/brick3    rammstein::slave    Passive    N/A                  N/A                 
pinkfloyd.blr.redhat.com    master        /bricks/brick2    led::slave          faulty     N/A                  N/A                 


The backtrace in the master logs.

[2014-09-19 16:19:53.933645] I [master(/bricks/brick2):1225:crawl] _GMaster: slave's time: (1411061833, 0)
[2014-09-19 16:20:33.653033] E [repce(/bricks/brick2):207:__call__] RepceClient: call 18787:139727562630912:1411123833.64 (entry_ops) failed on peer with OSError
[2014-09-19 16:20:33.653924] E [syncdutils(/bricks/brick2):270:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 164, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 643, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1324, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 524, in crawlwrap
    self.crawl(no_stime_update=no_stime_update)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1236, in crawl
    self.process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 927, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 891, in process_change
    self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 12] Cannot allocate memory
[2014-09-19 16:20:33.657620] I [syncdutils(/bricks/brick2):214:finalize] <top>: exiting.
[2014-09-19 16:20:33.663028] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2014-09-19 16:20:33.663907] I [syncdutils(agent):214:finalize] <top>: exiting.
[2014-09-19 16:20:33.795839] I [monitor(monitor):222:monitor] Monitor: worker(/bricks/brick2) died in startup phase


This is a remote backtrace propagated to master via RPC. The actual backtrace in slave logs are

[2014-09-19 16:27:45.780600] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 662, in entry_ops
    [ENOENT, ESTALE, EINVAL])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 470, in errno_wrap
    return call(*arg)
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 78, in lsetxattr
    cls.raise_oserr()
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in raise_oserr
    raise OSError(errn, os.strerror(errn))
OSError: [Errno 12] Cannot allocate memory
[2014-09-19 16:27:45.794786] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF.


Expected results:
There should be no backtraces and no faulty sessions.

Additional info:
The slave volume had Cluster.hash-range-gfid on

--- Additional comment from Anand Avati on 2014-09-26 03:40:46 EDT ---

REVIEW: http://review.gluster.org/8865 (geo-rep: Fix rename of directory syncing.) posted (#1) for review on master by Kotresh HR (khiremat)

--- Additional comment from Anand Avati on 2014-09-29 02:32:50 EDT ---

COMMIT: http://review.gluster.org/8865 committed in master by Venky Shankar (vshankar) 
------
commit 7113d873af1f129effd8c6da21b49e797de8eab0
Author: Kotresh HR <khiremat>
Date:   Thu Sep 25 17:34:43 2014 +0530

    geo-rep: Fix rename of directory syncing.
    
    The rename of directories are captured in all distributed
    brick changelogs. gsyncd processess these changelogs on
    each brick parallellaly. The first changelog to get processed
    will be successful. All subsequent ones will stat the 'src'
    and if not present, tries to create freshly on slave. It
    should be done only for files and not for directories.
    Hence when this code path was hit, regular file's blob
    is sent as directory's blob and gfid-access translator
    was erroring out as 'Invalid blob length' with errno as
    'ENOMEM'
    
    Change-Id: I50545b02b98846464876795159d2446340155c82
    BUG: 1146823
    Signed-off-by: Kotresh HR <khiremat>
    Reviewed-on: http://review.gluster.org/8865
    Reviewed-by: Aravinda VK <avishwan>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Venky Shankar <vshankar>
    Tested-by: Venky Shankar <vshankar>

--- Additional comment from Anand Avati on 2014-09-29 04:30:03 EDT ---

REVIEW: http://review.gluster.org/8880 (geo-rep: Fix rename of directory syncing.) posted (#1) for review on release-3.6 by Aravinda VK (avishwan)

Comment 1 Anand Avati 2014-09-29 08:49:39 UTC

REVIEW: http://review.gluster.org/8880 (geo-rep: Fix rename of directory syncing.) posted (#2) for review on release-3.6 by Aravinda VK (avishwan)

Comment 2 Anand Avati 2014-09-30 06:33:25 UTC

COMMIT: http://review.gluster.org/8880 committed in release-3.6 by Vijay Bellur (vbellur) 
------
commit 19b2923fd56f19dadf2d81a76a0008784a4f684f
Author: Kotresh HR <khiremat>
Date:   Thu Sep 25 17:34:43 2014 +0530

    geo-rep: Fix rename of directory syncing.
    
    The rename of directories are captured in all distributed
    brick changelogs. gsyncd processess these changelogs on
    each brick parallellaly. The first changelog to get processed
    will be successful. All subsequent ones will stat the 'src'
    and if not present, tries to create freshly on slave. It
    should be done only for files and not for directories.
    Hence when this code path was hit, regular file's blob
    is sent as directory's blob and gfid-access translator
    was erroring out as 'Invalid blob length' with errno as
    'ENOMEM'
    
    Change-Id: I50545b02b98846464876795159d2446340155c82
    BUG: 1147422
    Signed-off-by: Kotresh HR <khiremat>
    Reviewed-on: http://review.gluster.org/8865
    Reviewed-by: Aravinda VK <avishwan>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Venky Shankar <vshankar>
    Tested-by: Venky Shankar <vshankar>
    Reviewed-on: http://review.gluster.org/8880
    Reviewed-by: Vijay Bellur <vbellur>

Comment 3 Niels de Vos 2014-11-11 08:40:12 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.6.1, please reopen this bug report.

glusterfs-3.6.1 has been announced [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://supercolony.gluster.org/pipermail/gluster-users/2014-November/019410.html
[2] http://supercolony.gluster.org/mailman/listinfo/gluster-users

Comment 4 Aravinda VK 2015-01-15 10:00:09 UTC

*** Bug 1159190 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.