Bug 1159190 - dist-geo-rep: Session going into faulty with "Can no allocate memory" backtrace when pause, rename and resume is performed
Summary: dist-geo-rep: Session going into faulty with "Can no allocate memory" backtra...
Keywords:
Status: CLOSED DUPLICATE of bug 1147422
Alias: None
Product: GlusterFS
Classification: Community
Component: geo-replication
Version: 3.6.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
Assignee: Kotresh HR
QA Contact:
URL:
Whiteboard:
Depends On: 1144428 1146823
Blocks: 1147422
TreeView+ depends on / blocked
 
Reported: 2014-10-31 07:52 UTC by Kotresh HR
Modified: 2015-01-15 10:00 UTC (History)
13 users (show)

Fixed In Version:
Clone Of: 1146823
Environment:
Last Closed: 2015-01-15 10:00:09 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Kotresh HR 2014-10-31 07:52:34 UTC
+++ This bug was initially created as a clone of Bug #1146823 +++

+++ This bug was initially created as a clone of Bug #1144428 +++

Description of problem:
The session is going into faulty with OSError: [Errno 12] Cannot allocate memory backtrace in the logs. The operation I performed was sync existing data -> pause session -> rename all the files -> resume the session

Version-Release number of selected component (if applicable):
mainline

How reproducible:
Hit only once. Not sure I will be able to reproduce again.

Steps to Reproduce:
1. Create and start a geo-rep session between 2*2 dist-rep master and 2*2 dist-rep slave volume.
2. Create and sync some 5k files in some directory structure.
3. Now pause the session.
5. rename all the files.
6. resume the session.

Actual results:
The session went to faulty

MASTER NODE                 MASTER VOL    MASTER BRICK      SLAVE               STATUS     CHECKPOINT STATUS    CRAWL STATUS        
-----------------------------------------------------------------------------------------------------------------------------
ccr.blr.redhat.com          master        /bricks/brick0    nirvana::slave      faulty     N/A                  N/A                 
metallica.blr.redhat.com    master        /bricks/brick1    acdc::slave         Passive    N/A                  N/A                 
beatles.blr.redhat.com      master        /bricks/brick3    rammstein::slave    Passive    N/A                  N/A                 
pinkfloyd.blr.redhat.com    master        /bricks/brick2    led::slave          faulty     N/A                  N/A                 


The backtrace in the master logs.

[2014-09-19 16:19:53.933645] I [master(/bricks/brick2):1225:crawl] _GMaster: slave's time: (1411061833, 0)
[2014-09-19 16:20:33.653033] E [repce(/bricks/brick2):207:__call__] RepceClient: call 18787:139727562630912:1411123833.64 (entry_ops) failed on peer with OSError
[2014-09-19 16:20:33.653924] E [syncdutils(/bricks/brick2):270:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 164, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 643, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1324, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 524, in crawlwrap
    self.crawl(no_stime_update=no_stime_update)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1236, in crawl
    self.process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 927, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 891, in process_change
    self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 12] Cannot allocate memory
[2014-09-19 16:20:33.657620] I [syncdutils(/bricks/brick2):214:finalize] <top>: exiting.
[2014-09-19 16:20:33.663028] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2014-09-19 16:20:33.663907] I [syncdutils(agent):214:finalize] <top>: exiting.
[2014-09-19 16:20:33.795839] I [monitor(monitor):222:monitor] Monitor: worker(/bricks/brick2) died in startup phase


This is a remote backtrace propagated to master via RPC. The actual backtrace in slave logs are

[2014-09-19 16:27:45.780600] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 662, in entry_ops
    [ENOENT, ESTALE, EINVAL])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 470, in errno_wrap
    return call(*arg)
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 78, in lsetxattr
    cls.raise_oserr()
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in raise_oserr
    raise OSError(errn, os.strerror(errn))
OSError: [Errno 12] Cannot allocate memory
[2014-09-19 16:27:45.794786] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF.


Expected results:
There should be no backtraces and no faulty sessions.

Additional info:
The slave volume had Cluster.hash-range-gfid on

Comment 1 Aravinda VK 2015-01-15 10:00:09 UTC

*** This bug has been marked as a duplicate of bug 1147422 ***


Note You need to log in before you can comment on or make changes to this bug.