983385 – Dist-geo-rep : Lots of data creation and deletion resulted in too many failed to sync logs in geo-rep log file, consequently one of the session stopped syncing.

Bug 983385 - Dist-geo-rep : Lots of data creation and deletion resulted in too many failed to sync logs in geo-rep log file, consequently one of the session stopped syncing.

Summary: Dist-geo-rep : Lots of data creation and deletion resulted in too many failed...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	2.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Venky Shankar
QA Contact:	Vijaykumar Koppad
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	983572 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-07-11 06:19 UTC by Vijaykumar Koppad
Modified:	2014-08-25 00:50 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-3.4.0.14rhs-1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-09-23 22:38:43 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Vijaykumar Koppad 2013-07-11 06:19:37 UTC

Description of problem: If there is lot of data creation and deletion happening, there will be  lot of failed to sync logs in geo-rep log file. 
like  this, 

[2013-07-11 11:02:15.113795] W [master(/bricks/brick3):837:regjob] _GMaster: failed to sync .gfid/bf06b56b-94ae-4617-
9d9e-1d8618ee246e
[2013-07-11 11:02:15.116051] W [master(/bricks/brick3):837:regjob] _GMaster: failed to sync .gfid/cd342722-2e99-4372-
9257-2a2e80a241f1
[2013-07-11 11:02:15.118213] W [master(/bricks/brick3):837:regjob] _GMaster: failed to sync .gfid/b4581b84-e9d9-419a-
9b56-b77903526505

There will be few trace-backs like , 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[2013-07-11 11:02:16.35012] E [repce(/bricks/brick3):188:__call__] RepceClient: call 3272:140181893072640:1373520735.2 (entry_ops) failed on peer with OSError
[2013-07-11 11:02:16.35907] E [syncdutils(/bricks/brick3):206:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 133, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 510, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1060, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 525, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 928, in crawl
    self.process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 908, in process
    self.process_change(change)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 899, in process_change
    self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 204, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 189, in __call__
    raise res
OSError: [Errno 11] Resource temporarily unavailable
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Which results in that particular session going to faulty for sometimes. 


Version-Release number of selected component (if applicable):3.4.0.12rhs.beta3-1.el6rhs.x86_64


How reproducible: Observed it once.


Steps to Reproduce:
1.Create and start a geo-rep relationship  between master and slave.
2.On master create and remove files in loop, overnight. 
We can use " while : ;do  ./crefi -n 100 --multi -b 10 -d 10 --random --max=500K --min=10 <MNT_PNT>; sleep(500); rm -rf <MNT_PNT>/* ; done 


Actual results: The logs have lot failed to sync messages, and results in one of the session not syncing 


Expected results: Even if there are some failures, it should revive itself quickly, and start syncing. 


Additional info:


The slave logs had something like this ,

[2013-07-11 06:02:00.031158] W [fuse-bridge.c:2334:fuse_create_cbk] 0-glusterfs-fuse: 139561: <gfid:00000000-0000-0000-0000-00000000000d>/c29369ac-db3a-4a33-8ade-973820d01f15 => -1 (No such file or directory)
[2013-07-11 06:02:00.031363] W [defaults.c:1291:default_release] (-->/usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/cluster/distribute.so(dht_create+0x390) [0x7f7071f4e740] (-->/usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/cluster/distribute.so(dht_local_wipe+0xa7) [0x7f7071f38f67] (-->/usr/lib64/libglusterfs.so.0(fd_unref+0x13b) [0x3956a3928b]))) 0-fuse: xlator does not implement release_cbk
[2013-07-11 06:02:00.074124] W [fuse-bridge.c:2334:fuse_create_cbk] 0-glusterfs-fuse: 139564: <gfid:00000000-0000-0000-0000-00000000000d>/c3953efa-9dc5-44dc-ad07-506a6355acbb => -1 (No such file or directory)
[2013-07-11 06:02:00.074335] W [defaults.c:1291:default_release] (-->/usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/cluster/distribute.so(dht_create+0x390) [0x7f7071f4e740] (-->/usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/cluster/distribute.so(dht_local_wipe+0xa7) [0x7f7071f38f67] (-->/usr/lib64/libglusterfs.so.0(fd_unref+0x13b) [0x3956a3928b]))) 0-fuse: xlator does not implement release_cbk
[2013-07-11 06:02:00.080759] W [fuse-bridge.c:2334:fuse_create_cbk] 0-glusterfs-fuse: 139567: <gfid:00000000-0000-0000-0000-00000000000d>/c3f97a3c-856a-43bc-8ca2-012a4d82a258 => -1 (No such file or directory)
[2013-07-11 06:02:00.080970] W [defaults.c:1291:default_release] (-->/usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/cluster/distribute.so(dht_create+0x390) [0x7f7071f4e740] (-->/usr/lib64/glusterfs/3.4.0.12rhs.beta3/xlator/cluster/distribute.so(dht_local_wipe+0xa7) [0x7f7071f38f67] (-->/usr/lib64/libglusterfs.so.0(fd_unref+0x13b) [0x3956a3928b]))) 0-fuse: xlator does not implement release_cbk

Comment 2 Venky Shankar 2013-07-12 12:27:11 UTC

There are many these entries in gsyncd auxiliary mount client logs:

583aaa20-e2e9-4e78-ac0f-83cf5ee31d75:gluster%3A%2F%2F127.0.0.1%3Aslave.gluster.log:[2013-07-12 12:12:24.444442] W [fuse-bridge.c:2334:fuse_create_cbk] 0-glusterfs-fuse: 659827: <gfid:00000000-0000-0000-0000-00000000000d>/aebb1cad-ec43-4532-a9d2-de24671c65b5 => -1 (No such file or directory)
583aaa20-e2e9-4e78-ac0f-83cf5ee31d75:gluster%3A%2F%2F127.0.0.1%3Aslave.gluster.log:[2013-07-12 12:12:32.621215] W [fuse-bridge.c:2334:fuse_create_cbk] 0-glusterfs-fuse: 660472: <gfid:00000000-0000-0000-0000-00000000000d>/aebb1cad-ec43-4532-a9d2-de24671c65b5 => -1 (No such file or directory)
583aaa20-e2e9-4e78-ac0f-83cf5ee31d75:gluster%3A%2F%2F127.0.0.1%3Aslave.gluster.log:[2013-07-12 12:12:46.993507] W [fuse-bridge.c:2334:fuse_create_cbk] 0-glusterfs-fuse: 662023: <gfid:00000000-0000-0000-0000-00000000000d>/aebb1cad-ec43-4532-a9d2-de24671c65b5 => -1 (No such file or directory)
583aaa20-e2e9-4e78-ac0f-83cf5ee31d75:gluster%3A%2F%2F127.0.0.1%3Aslave.gluster.log:[2013-07-12 12:13:08.500603] W [fuse-bridge.c:2334:fuse_create_cbk] 0-glusterfs-fuse: 664403: <gfid:00000000-0000-0000-0000-00000000000d>/aebb1cad-ec43-4532-a9d2-de24671c65b5 => -1 (No such file or directory)
583aaa20-e2e9-4e78-ac0f-83cf5ee31d75:gluster%3A%2F%2F127.0.0.1%3Aslave.gluster.log:[2013-07-12 12:13:33.706942] W [fuse-bridge.c:2334:fuse_create_cbk] 0-glusterfs-fuse: 667261: <gfid:00000000-0000-0000-0000-00000000000d>/aebb1cad-ec43-4532-a9d2-de24671c65b5 => -1 (No such file or directory)

-------------------------------------------------------------------------

It's from fuse_create_cbk() mentioning a create failed because of missing parent gfid '00000000-0000-0000-0000-00000000000d'

Shouldn't this be the root gfid (0x1) instead of the virtual gfid (0xd) ?

Comment 3 Venky Shankar 2013-07-12 13:29:03 UTC

*** Bug 983572 has been marked as a duplicate of this bug. ***

Comment 4 Venky Shankar 2013-07-21 06:50:57 UTC

*** Bug 983572 has been marked as a duplicate of this bug. ***

Comment 5 Vijaykumar Koppad 2013-08-10 05:52:54 UTC

verified on glusterfs-3.4.0.17rhs-1.el6rhs.x86_64

Comment 6 Scott Haines 2013-09-23 22:38:43 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Comment 7 Scott Haines 2013-09-23 22:41:28 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Note You need to log in before you can comment on or make changes to this bug.