1025231 – Dist-geo-rep : geo-rep status goes to faulty with backtrace "failed on peer with KeyError 'stat'"

Bug 1025231 - Dist-geo-rep : geo-rep status goes to faulty with backtrace "failed on peer with KeyError 'stat'"

Summary: Dist-geo-rep : geo-rep status goes to faulty with backtrace "failed on peer w...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	2.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Venky Shankar
QA Contact:	Vijaykumar Koppad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-10-31 09:51 UTC by Vijaykumar Koppad
Modified:	2014-08-25 00:50 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.4.0.37rhs-1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-11-27 15:45:01 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2013:1769	0	normal	SHIPPED_LIVE	Red Hat Storage 2.1 enhancement and bug fix update #1	2013-11-27 20:17:39 UTC

Description Vijaykumar Koppad 2013-10-31 09:51:37 UTC

Description of problem: Just starting files sync to slave with changelog mode of syncing, got a traceback in the master, and geo-rep status goes to faulty, and gets stuck in the same state,


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2013-10-31 13:47:08.476759] I [master(/bricks/brick3):370:crawlwrap] _GMaster: 20 crawls, 0 turns
[2013-10-31 13:48:08.555270] I [master(/bricks/brick3):370:crawlwrap] _GMaster: 20 crawls, 0 turns
[2013-10-31 13:48:17.601533] E [repce(/bricks/brick3):188:__call__] RepceClient: call 1458:139835433391872:1383207497.59 (entry_ops) failed on peer with KeyError
[2013-10-31 13:48:17.602133] E [syncdutils(/bricks/brick3):207:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 530, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1077, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 381, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 818, in crawl
    self.process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 775, in process
    if self.process_change(change, done, retry):
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 744, in process_change
    self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 204, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 189, in __call__
    raise res
KeyError: 'stat'
[2013-10-31 13:48:17.604809] I [syncdutils(/bricks/brick3):159:finalize] <top>: exiting.
[2013-10-31 13:48:17.613370] I [monitor(monitor):81:set_state] Monitor: new state: faulty
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>.

And on the slave side this is the traceback, 

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2013-10-31 14:52:32.894243] I [resource(slave):631:service_loop] GLUSTER: slave listening
[2013-10-31 14:52:36.283513] E [repce(slave):103:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 99, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 515, in entry_ops
    blob = entry_pack_mkdir(gfid, bname, e['stat'])
KeyError: 'stat'
[2013-10-31 14:52:36.295047] I [repce(slave):78:service_loop] RepceServer: terminating on reaching EOF.
[2013-10-31 14:52:36.295408] I [syncdutils(slave):159:finalize] <top>: exiting.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Version-Release number of selected component (if applicable):glusterfs-3.4.0.37rhs-1.el6rhs.x86_64


How reproducible: Doesn't happen everytime. 


Steps to Reproduce:
1.Create and start a geo-rep relationship between master and slave.
2.start creating files on the master, 
3.Check the status of the geo-rep 

Actual results: The geo-rep status goes to faulty.


Expected results: Geo-rep should never go to faulty .


Additional info:

Comment 2 Venky Shankar 2013-11-02 09:05:43 UTC

Vijaykumar,

was the slave cluster not updated with the new build?

With the new build, the stat structure is not passed for create/mknod/mkdir calls. I see in the backtrace that the slave gsyncd accepting a stat structure.

Comment 3 Amar Tumballi 2013-11-02 09:49:22 UTC

fixed as part of performance enhancement done by Venky (https://code.engineering.redhat.com/gerrit/14774)

Comment 4 Vijaykumar Koppad 2013-11-07 11:43:24 UTC

Not able to reproduce it in the build glusterfs-3.4.0.39rhs-1, marking it as verified.

Comment 6 errata-xmlrpc 2013-11-27 15:45:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1769.html

Note You need to log in before you can comment on or make changes to this bug.