Bug 983385 - Dist-geo-rep : Lots of data creation and deletion resulted in too many failed to sync logs in geo-rep log file, consequently one of the session stopped syncing.
Summary: Dist-geo-rep : Lots of data creation and deletion resulted in too many failed...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: geo-replication
Version: 2.1
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Venky Shankar
QA Contact: Vijaykumar Koppad
URL:
Whiteboard:
: 983572 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-07-11 06:19 UTC by Vijaykumar Koppad
Modified: 2014-08-25 00:50 UTC (History)
8 users (show)

Fixed In Version: glusterfs-3.4.0.14rhs-1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-09-23 22:38:43 UTC
Embargoed:


Attachments (Terms of Use)

Description Vijaykumar Koppad 2013-07-11 06:19:37 UTC
Description of problem: If there is lot of data creation and deletion happening, there will be  lot of failed to sync logs in geo-rep log file. 
like  this, 

[2013-07-11 11:02:15.113795] W [master(/bricks/brick3):837:regjob] _GMaster: failed to sync .gfid/bf06b56b-94ae-4617-
9d9e-1d8618ee246e
[2013-07-11 11:02:15.116051] W [master(/bricks/brick3):837:regjob] _GMaster: failed to sync .gfid/cd342722-2e99-4372-
9257-2a2e80a241f1
[2013-07-11 11:02:15.118213] W [master(/bricks/brick3):837:regjob] _GMaster: failed to sync .gfid/b4581b84-e9d9-419a-
9b56-b77903526505

There will be few trace-backs like , 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[2013-07-11 11:02:16.35012] E [repce(/bricks/brick3):188:__call__] RepceClient: call 3272:140181893072640:1373520735.2 (entry_ops) failed on peer with OSError
[2013-07-11 11:02:16.35907] E [syncdutils(/bricks/brick3):206:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 133, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 510, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1060, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 525, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 928, in crawl
    self.process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 908, in process
    self.process_change(change)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 899, in process_change
    self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 204, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 189, in __call__
    raise res
OSError: [Errno 11] Resource temporarily unavailable
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Which results in that particular session going to faulty for sometimes. 


Version-Release number of selected component (if applicable):3.4.0.12rhs.beta3-1.el6rhs.x86_64


How reproducible: Observed it once.


Steps to Reproduce:
1.Create and start a geo-rep relationship  between master and slave.
2.On master create and remove files in loop, overnight. 
We can use " while : ;do  ./crefi -n 100 --multi -b 10 -d 10 --random --max=500K --min=10 <MNT_PNT>; sleep(500); rm -rf <MNT_PNT>/* ; done 


Actual results: The logs have lot failed to sync messages, and results in one of the session not syncing 


Expected results: Even if there are some failures, it should revive itself quickly, and start syncing. 


Additional info:


The slave logs had something like this ,

[2013-07-11 06:02:00.031158] W [fuse-bridge.c:2334:fuse_create_cbk] 0-glusterfs-fuse: 139561: <gfid:00000000-0000-0000-0000-00000000000d>/c29369ac-db3a-4a33-8ade-973820d01f15 => -1 (No such file or directory)
[2013-07-11 06:02:00.031363] W [defaults.c:1291:default_release] (-->/usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/cluster/distribute.so(dht_create+0x390) [0x7f7071f4e740] (-->/usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/cluster/distribute.so(dht_local_wipe+0xa7) [0x7f7071f38f67] (-->/usr/lib64/libglusterfs.so.0(fd_unref+0x13b) [0x3956a3928b]))) 0-fuse: xlator does not implement release_cbk
[2013-07-11 06:02:00.074124] W [fuse-bridge.c:2334:fuse_create_cbk] 0-glusterfs-fuse: 139564: <gfid:00000000-0000-0000-0000-00000000000d>/c3953efa-9dc5-44dc-ad07-506a6355acbb => -1 (No such file or directory)
[2013-07-11 06:02:00.074335] W [defaults.c:1291:default_release] (-->/usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/cluster/distribute.so(dht_create+0x390) [0x7f7071f4e740] (-->/usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/cluster/distribute.so(dht_local_wipe+0xa7) [0x7f7071f38f67] (-->/usr/lib64/libglusterfs.so.0(fd_unref+0x13b) [0x3956a3928b]))) 0-fuse: xlator does not implement release_cbk
[2013-07-11 06:02:00.080759] W [fuse-bridge.c:2334:fuse_create_cbk] 0-glusterfs-fuse: 139567: <gfid:00000000-0000-0000-0000-00000000000d>/c3f97a3c-856a-43bc-8ca2-012a4d82a258 => -1 (No such file or directory)
[2013-07-11 06:02:00.080970] W [defaults.c:1291:default_release] (-->/usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/cluster/distribute.so(dht_create+0x390) [0x7f7071f4e740] (-->/usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/cluster/distribute.so(dht_local_wipe+0xa7) [0x7f7071f38f67] (-->/usr/lib64/libglusterfs.so.0(fd_unref+0x13b) [0x3956a3928b]))) 0-fuse: xlator does not implement release_cbk

Comment 2 Venky Shankar 2013-07-12 12:27:11 UTC
There are many these entries in gsyncd auxiliary mount client logs:

583aaa20-e2e9-4e78-ac0f-83cf5ee31d75:gluster%3A%2F%2F127.0.0.1%3Aslave.gluster.log:[2013-07-12 12:12:24.444442] W [fuse-bridge.c:2334:fuse_create_cbk] 0-glusterfs-fuse: 659827: <gfid:00000000-0000-0000-0000-00000000000d>/aebb1cad-ec43-4532-a9d2-de24671c65b5 => -1 (No such file or directory)
583aaa20-e2e9-4e78-ac0f-83cf5ee31d75:gluster%3A%2F%2F127.0.0.1%3Aslave.gluster.log:[2013-07-12 12:12:32.621215] W [fuse-bridge.c:2334:fuse_create_cbk] 0-glusterfs-fuse: 660472: <gfid:00000000-0000-0000-0000-00000000000d>/aebb1cad-ec43-4532-a9d2-de24671c65b5 => -1 (No such file or directory)
583aaa20-e2e9-4e78-ac0f-83cf5ee31d75:gluster%3A%2F%2F127.0.0.1%3Aslave.gluster.log:[2013-07-12 12:12:46.993507] W [fuse-bridge.c:2334:fuse_create_cbk] 0-glusterfs-fuse: 662023: <gfid:00000000-0000-0000-0000-00000000000d>/aebb1cad-ec43-4532-a9d2-de24671c65b5 => -1 (No such file or directory)
583aaa20-e2e9-4e78-ac0f-83cf5ee31d75:gluster%3A%2F%2F127.0.0.1%3Aslave.gluster.log:[2013-07-12 12:13:08.500603] W [fuse-bridge.c:2334:fuse_create_cbk] 0-glusterfs-fuse: 664403: <gfid:00000000-0000-0000-0000-00000000000d>/aebb1cad-ec43-4532-a9d2-de24671c65b5 => -1 (No such file or directory)
583aaa20-e2e9-4e78-ac0f-83cf5ee31d75:gluster%3A%2F%2F127.0.0.1%3Aslave.gluster.log:[2013-07-12 12:13:33.706942] W [fuse-bridge.c:2334:fuse_create_cbk] 0-glusterfs-fuse: 667261: <gfid:00000000-0000-0000-0000-00000000000d>/aebb1cad-ec43-4532-a9d2-de24671c65b5 => -1 (No such file or directory)

-------------------------------------------------------------------------

It's from fuse_create_cbk() mentioning a create failed because of missing parent gfid '00000000-0000-0000-0000-00000000000d'

Shouldn't this be the root gfid (0x1) instead of the virtual gfid (0xd) ?

Comment 3 Venky Shankar 2013-07-12 13:29:03 UTC
*** Bug 983572 has been marked as a duplicate of this bug. ***

Comment 4 Venky Shankar 2013-07-21 06:50:57 UTC
*** Bug 983572 has been marked as a duplicate of this bug. ***

Comment 5 Vijaykumar Koppad 2013-08-10 05:52:54 UTC
verified on glusterfs-3.4.0.17rhs-1.el6rhs.x86_64

Comment 6 Scott Haines 2013-09-23 22:38:43 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Comment 7 Scott Haines 2013-09-23 22:41:28 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html


Note You need to log in before you can comment on or make changes to this bug.