Bug 1348085

Summary: [geo-rep]: Worker crashed with "KeyError: "
Product: [Community] GlusterFS Reporter: Aravinda VK <avishwan>
Component: geo-replicationAssignee: Aravinda VK <avishwan>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.7.11CC: bugs, csaba, rhinduja, rhs-bugs, storage-qa-internal
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.7.13 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1345744 Environment:
Last Closed: 2016-07-20 13:55:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1344826, 1345744    
Bug Blocks:    

Description Aravinda VK 2016-06-20 06:38:18 UTC
+++ This bug was initially created as a clone of Bug #1345744 +++

+++ This bug was initially created as a clone of Bug #1344826 +++

Description of problem:
=======================

While performing rm -rf on cascaded setup, found a worker crash on the primary master and intermittent master volume with traceback as: 

Master Volume:
==============

[2016-06-11 09:41:17.359086] E [syncdutils(/rhs/brick1/b1):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 201, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 720, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1497, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1201, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1107, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 984, in process
    self.datas_in_batch.remove(unlinked_gfid)
KeyError: '.gfid/757b0ad8-b6f5-44da-b71a-1b1c25a72988'


Intermittent Master:
====================

[2016-06-11 09:41:51.681622] E [syncdutils(/rhs/brick1/b1):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 201, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 720, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1497, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1201, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1107, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 984, in process
    self.datas_in_batch.remove(unlinked_gfid)
KeyError: '.gfid/757b0ad8-b6f5-44da-b71a-1b1c25a72988'
[2016-06-11 09:41:51.684969] I [syncdutils(/rhs/brick1/b1):220:finalize] <top>: exiting.



Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.7.9-10


How reproducible:
=================

Always, on cascaded setup upon remove (rm -rf)


Steps to Reproduce:
===================
1. Create geo-rep cascaded setup with (vol0,vol1,vol2). Such that vol0=>vol1, vol1=>vol2
2. Mount the vol0 volume and perform fops like (cp,create,chmod,chown,chgrp,symlink,hardlink,truncate) on vol0
3. Let it sync to slave (vol1) and (vol2)
4. Calculate arequal checksum after every fop. It should match.
5. perform rm -rf on vol0

Actual results:
===============

Worker crashed on vol1 and vol0 with keyerror.


Expected results:
=================

Worker shouldn't crash


Additional info:
================

Performed rm -rf on non cascaded setup and didn't see the crash. Also, eventually files are removed from all Master and slaves.

--- Additional comment from Vijay Bellur on 2016-06-13 02:33:20 EDT ---

REVIEW: http://review.gluster.org/14706 (geo-rep: Safely handle if unliked GFID not present in data list) posted (#1) for review on master by Aravinda VK (avishwan)

--- Additional comment from Vijay Bellur on 2016-06-20 02:37:06 EDT ---

COMMIT: http://review.gluster.org/14706 committed in master by Aravinda VK (avishwan) 
------
commit 4797ca3778d82a671716d4913c14f285591ae959
Author: Aravinda VK <avishwan>
Date:   Mon Jun 13 12:00:40 2016 +0530

    geo-rep: Safely handle if unliked GFID not present in data list
    
    If unlinked GFID is not present in data list to be synced then
    Geo-rep worker was crashing with KeyError. Handled KeyError with
    this patch.
    
    BUG: 1345744
    Change-Id: I5a1c9ca4473e32606df2e5c7e26c95faf55d44c0
    Signed-off-by: Aravinda VK <avishwan>
    Reviewed-on: http://review.gluster.org/14706
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Kotresh HR <khiremat>

Comment 1 Vijay Bellur 2016-06-20 06:39:34 UTC
REVIEW: http://review.gluster.org/14766 (geo-rep: Safely handle if unliked GFID not present in data list) posted (#1) for review on release-3.7 by Aravinda VK (avishwan)

Comment 2 Vijay Bellur 2016-06-28 05:31:10 UTC
COMMIT: http://review.gluster.org/14766 committed in release-3.7 by Aravinda VK (avishwan) 
------
commit d22305998f99bb9a5c89b5639ca95b3689881510
Author: Aravinda VK <avishwan>
Date:   Mon Jun 13 12:00:40 2016 +0530

    geo-rep: Safely handle if unliked GFID not present in data list
    
    If unlinked GFID is not present in data list to be synced then
    Geo-rep worker was crashing with KeyError. Handled KeyError with
    this patch.
    
    BUG: 1348085
    Change-Id: I5a1c9ca4473e32606df2e5c7e26c95faf55d44c0
    Signed-off-by: Aravinda VK <avishwan>
    Reviewed-on: http://review.gluster.org/14706
    (cherry picked from commit 4797ca3778d82a671716d4913c14f285591ae959)
    Reviewed-on: http://review.gluster.org/14766
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Kotresh HR <khiremat>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 3 Kaushal 2016-07-20 13:55:32 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.13, please open a new bug report.

glusterfs-3.7.13 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://www.gluster.org/pipermail/gluster-users/2016-July/027604.html
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user