Bug 1344826 - [geo-rep]: Worker crashed with "KeyError: "
Summary: [geo-rep]: Worker crashed with "KeyError: "
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: geo-replication
Version: rhgs-3.1
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: RHGS 3.2.0
Assignee: Aravinda VK
QA Contact: Rahul Hinduja
URL:
Whiteboard:
: 1400765 (view as bug list)
Depends On:
Blocks: 1345744 1348085 1348086 1351515 1351530
TreeView+ depends on / blocked
 
Reported: 2016-06-11 09:54 UTC by Rahul Hinduja
Modified: 2017-03-23 05:35 UTC (History)
11 users (show)

Fixed In Version: glusterfs-3.8.4-1
Doc Type: Bug Fix
Doc Text:
When an rsync operation is retried, the geo-replication process attempted to clean up GFIDs from the rsync queue that were already unlinked during the previous sync attempt. This resulted in a KeyError. The geo-replication process now checks for the existence of a GFID before attempting to unlink a file and remove it from the rsync queue, preventing this failure.
Clone Of:
: 1345744 (view as bug list)
Environment:
Last Closed: 2017-03-23 05:35:53 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:0486 normal SHIPPED_LIVE Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update 2017-03-23 09:18:45 UTC

Description Rahul Hinduja 2016-06-11 09:54:03 UTC
Description of problem:
=======================

While performing rm -rf on cascaded setup, found a worker crash on the primary master and intermittent master volume with traceback as: 

Master Volume:
==============

[2016-06-11 09:41:17.359086] E [syncdutils(/rhs/brick1/b1):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 201, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 720, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1497, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1201, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1107, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 984, in process
    self.datas_in_batch.remove(unlinked_gfid)
KeyError: '.gfid/757b0ad8-b6f5-44da-b71a-1b1c25a72988'


Intermittent Master:
====================

[2016-06-11 09:41:51.681622] E [syncdutils(/rhs/brick1/b1):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 201, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 720, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1497, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1201, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1107, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 984, in process
    self.datas_in_batch.remove(unlinked_gfid)
KeyError: '.gfid/757b0ad8-b6f5-44da-b71a-1b1c25a72988'
[2016-06-11 09:41:51.684969] I [syncdutils(/rhs/brick1/b1):220:finalize] <top>: exiting.



Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.7.9-10


How reproducible:
=================

Always, on cascaded setup upon remove (rm -rf)


Steps to Reproduce:
===================
1. Create geo-rep cascaded setup with (vol0,vol1,vol2). Such that vol0=>vol1, vol1=>vol2
2. Mount the vol0 volume and perform fops like (cp,create,chmod,chown,chgrp,symlink,hardlink,truncate) on vol0
3. Let it sync to slave (vol1) and (vol2)
4. Calculate arequal checksum after every fop. It should match.
5. perform rm -rf on vol0

Actual results:
===============

Worker crashed on vol1 and vol0 with keyerror.


Expected results:
=================

Worker shouldn't crash


Additional info:
================

Performed rm -rf on non cascaded setup and didn't see the crash. Also, eventually files are removed from all Master and slaves.

Comment 3 Aravinda VK 2016-06-20 06:37:45 UTC
Upstream Patch posted.
http://review.gluster.org/#/c/14706/

Comment 8 Oonkwee Lim_ 2016-07-08 22:28:30 UTC
Hello Aravinda,

The customer is still saying that the files are still not renamed.

From them:

It looks like whatever rename process should have taken place, did not. 

The files are still in the limbo state. What are some next steps I can take. 

If I mount the slave brinks RW and rename the files to match the master, will I create an inconsistent state that cannot be recovered from?

Thanks & Regards

Oonkwee
Emerging Technologies
RedHat Global Support

Comment 10 Aravinda VK 2016-07-11 05:56:47 UTC
(In reply to Oonkwee Lim_ from comment #8)
> Hello Aravinda,
> 
> The customer is still saying that the files are still not renamed.
> 
> From them:
> 
> It looks like whatever rename process should have taken place, did not. 
> 
> The files are still in the limbo state. What are some next steps I can take. 
> 
> If I mount the slave brinks RW and rename the files to match the master,
> will I create an inconsistent state that cannot be recovered from?
> 
> Thanks & Regards
> 
> Oonkwee
> Emerging Technologies
> RedHat Global Support

Looks like the files which are in limbo state are due to errors previously(before upgrade).

Safe workaround is,
- Delete the problematic file in Slave
- Trigger resync for the file using a virtual setxattr in Master mount.
  cd $MASTER_MOUNT/
  setfattr -n glusterfs.geo-rep.trigger-sync -v "1" <file-path-in-master-mount>

Comment 12 Aravinda VK 2016-07-13 06:02:33 UTC
Virtual Setxattr(glusterfs.geo-rep.trigger-sync) is similar to touch command which Geo-replication can understand. This should be set on each files or directory which needs resync.

If the problematic files are not deleted from Slave Volume, resyncing may face errors.(In both the options)

Comment 13 Oonkwee Lim_ 2016-07-18 15:53:26 UTC
Post glusterfs.geo-rep.trigger-sync update:

The geo-repl status since performing this operation has been in a Crawl Status of 'History Crawl' and I can see that LAST_SYNCED is advancing, albeit at a snail's pace.

Is there any way to gauge where in the process it might be?

Comment 14 Aravinda VK 2016-07-19 16:54:14 UTC
(In reply to Oonkwee Lim_ from comment #13)
> Post glusterfs.geo-rep.trigger-sync update:
> 
> The geo-repl status since performing this operation has been in a Crawl
> Status of 'History Crawl' and I can see that LAST_SYNCED is advancing,
> albeit at a snail's pace.
> 
> Is there any way to gauge where in the process it might be?

History Crawl will process historical changelogs till it reaches worker start time(Worker register time can be found in respective worker's log). Once it crosses the register time then it starts consuming live changelogs. We do not have a way to estimate the pending sync time since Geo-rep has to reprocess all the changelogs till current time.

Comment 16 Atin Mukherjee 2016-09-17 11:57:01 UTC
Upstream mainline : http://review.gluster.org/14706
Upstream 3.8 : http://review.gluster.org/14767

And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.

Comment 19 Atin Mukherjee 2016-12-06 05:59:42 UTC
*** Bug 1400765 has been marked as a duplicate of this bug. ***

Comment 48 errata-xmlrpc 2017-03-23 05:35:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html


Note You need to log in before you can comment on or make changes to this bug.