Bug 1025231 - Dist-geo-rep : geo-rep status goes to faulty with backtrace "failed on peer with KeyError 'stat'"
Dist-geo-rep : geo-rep status goes to faulty with backtrace "failed on peer w...
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: geo-replication (Show other bugs)
2.1
x86_64 Linux
high Severity high
: ---
: ---
Assigned To: Venky Shankar
Vijaykumar Koppad
: ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-10-31 05:51 EDT by Vijaykumar Koppad
Modified: 2014-08-24 20:50 EDT (History)
6 users (show)

See Also:
Fixed In Version: glusterfs-3.4.0.37rhs-1
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-11-27 10:45:01 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Vijaykumar Koppad 2013-10-31 05:51:37 EDT
Description of problem: Just starting files sync to slave with changelog mode of syncing, got a traceback in the master, and geo-rep status goes to faulty, and gets stuck in the same state,


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2013-10-31 13:47:08.476759] I [master(/bricks/brick3):370:crawlwrap] _GMaster: 20 crawls, 0 turns
[2013-10-31 13:48:08.555270] I [master(/bricks/brick3):370:crawlwrap] _GMaster: 20 crawls, 0 turns
[2013-10-31 13:48:17.601533] E [repce(/bricks/brick3):188:__call__] RepceClient: call 1458:139835433391872:1383207497.59 (entry_ops) failed on peer with KeyError
[2013-10-31 13:48:17.602133] E [syncdutils(/bricks/brick3):207:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 530, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1077, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 381, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 818, in crawl
    self.process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 775, in process
    if self.process_change(change, done, retry):
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 744, in process_change
    self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 204, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 189, in __call__
    raise res
KeyError: 'stat'
[2013-10-31 13:48:17.604809] I [syncdutils(/bricks/brick3):159:finalize] <top>: exiting.
[2013-10-31 13:48:17.613370] I [monitor(monitor):81:set_state] Monitor: new state: faulty
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>.

And on the slave side this is the traceback, 

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2013-10-31 14:52:32.894243] I [resource(slave):631:service_loop] GLUSTER: slave listening
[2013-10-31 14:52:36.283513] E [repce(slave):103:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 99, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 515, in entry_ops
    blob = entry_pack_mkdir(gfid, bname, e['stat'])
KeyError: 'stat'
[2013-10-31 14:52:36.295047] I [repce(slave):78:service_loop] RepceServer: terminating on reaching EOF.
[2013-10-31 14:52:36.295408] I [syncdutils(slave):159:finalize] <top>: exiting.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Version-Release number of selected component (if applicable):glusterfs-3.4.0.37rhs-1.el6rhs.x86_64


How reproducible: Doesn't happen everytime. 


Steps to Reproduce:
1.Create and start a geo-rep relationship between master and slave.
2.start creating files on the master, 
3.Check the status of the geo-rep 

Actual results: The geo-rep status goes to faulty.


Expected results: Geo-rep should never go to faulty .


Additional info:
Comment 2 Venky Shankar 2013-11-02 05:05:43 EDT
Vijaykumar,

was the slave cluster not updated with the new build?

With the new build, the stat structure is not passed for create/mknod/mkdir calls. I see in the backtrace that the slave gsyncd accepting a stat structure.
Comment 3 Amar Tumballi 2013-11-02 05:49:22 EDT
fixed as part of performance enhancement done by Venky (https://code.engineering.redhat.com/gerrit/14774)
Comment 4 Vijaykumar Koppad 2013-11-07 06:43:24 EST
Not able to reproduce it in the build glusterfs-3.4.0.39rhs-1, marking it as verified.
Comment 6 errata-xmlrpc 2013-11-27 10:45:01 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1769.html

Note You need to log in before you can comment on or make changes to this bug.