Bug 1345744 - [geo-rep]: Worker crashed with "KeyError: "
Summary: [geo-rep]: Worker crashed with "KeyError: "
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: geo-replication
Version: mainline
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Aravinda VK
QA Contact:
URL:
Whiteboard:
Depends On: 1344826
Blocks: 1348085 1348086
TreeView+ depends on / blocked
 
Reported: 2016-06-13 06:28 UTC by Aravinda VK
Modified: 2017-03-27 18:13 UTC (History)
5 users (show)

Fixed In Version: glusterfs-3.9.0
Clone Of: 1344826
: 1348085 1348086 (view as bug list)
Environment:
Last Closed: 2017-03-27 18:13:07 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Aravinda VK 2016-06-13 06:28:14 UTC
+++ This bug was initially created as a clone of Bug #1344826 +++

Description of problem:
=======================

While performing rm -rf on cascaded setup, found a worker crash on the primary master and intermittent master volume with traceback as: 

Master Volume:
==============

[2016-06-11 09:41:17.359086] E [syncdutils(/rhs/brick1/b1):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 201, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 720, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1497, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1201, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1107, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 984, in process
    self.datas_in_batch.remove(unlinked_gfid)
KeyError: '.gfid/757b0ad8-b6f5-44da-b71a-1b1c25a72988'


Intermittent Master:
====================

[2016-06-11 09:41:51.681622] E [syncdutils(/rhs/brick1/b1):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 201, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 720, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1497, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1201, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1107, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 984, in process
    self.datas_in_batch.remove(unlinked_gfid)
KeyError: '.gfid/757b0ad8-b6f5-44da-b71a-1b1c25a72988'
[2016-06-11 09:41:51.684969] I [syncdutils(/rhs/brick1/b1):220:finalize] <top>: exiting.



Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.7.9-10


How reproducible:
=================

Always, on cascaded setup upon remove (rm -rf)


Steps to Reproduce:
===================
1. Create geo-rep cascaded setup with (vol0,vol1,vol2). Such that vol0=>vol1, vol1=>vol2
2. Mount the vol0 volume and perform fops like (cp,create,chmod,chown,chgrp,symlink,hardlink,truncate) on vol0
3. Let it sync to slave (vol1) and (vol2)
4. Calculate arequal checksum after every fop. It should match.
5. perform rm -rf on vol0

Actual results:
===============

Worker crashed on vol1 and vol0 with keyerror.


Expected results:
=================

Worker shouldn't crash


Additional info:
================

Performed rm -rf on non cascaded setup and didn't see the crash. Also, eventually files are removed from all Master and slaves.

Comment 1 Vijay Bellur 2016-06-13 06:33:20 UTC
REVIEW: http://review.gluster.org/14706 (geo-rep: Safely handle if unliked GFID not present in data list) posted (#1) for review on master by Aravinda VK (avishwan)

Comment 2 Vijay Bellur 2016-06-20 06:37:06 UTC
COMMIT: http://review.gluster.org/14706 committed in master by Aravinda VK (avishwan) 
------
commit 4797ca3778d82a671716d4913c14f285591ae959
Author: Aravinda VK <avishwan>
Date:   Mon Jun 13 12:00:40 2016 +0530

    geo-rep: Safely handle if unliked GFID not present in data list
    
    If unlinked GFID is not present in data list to be synced then
    Geo-rep worker was crashing with KeyError. Handled KeyError with
    this patch.
    
    BUG: 1345744
    Change-Id: I5a1c9ca4473e32606df2e5c7e26c95faf55d44c0
    Signed-off-by: Aravinda VK <avishwan>
    Reviewed-on: http://review.gluster.org/14706
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Kotresh HR <khiremat>

Comment 3 Shyamsundar 2017-03-27 18:13:07 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.9.0, please open a new bug report.

glusterfs-3.9.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2016-November/029281.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.