1345744 – [geo-rep]: Worker crashed with "KeyError: "

Bug 1345744 - [geo-rep]: Worker crashed with "KeyError: "

Summary: [geo-rep]: Worker crashed with "KeyError: "

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	geo-replication
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Aravinda VK
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1344826
Blocks:	1348085 1348086
TreeView+	depends on / blocked

Reported:	2016-06-13 06:28 UTC by Aravinda VK
Modified:	2017-03-27 18:13 UTC (History)
CC List:	5 users (show)
Fixed In Version:	glusterfs-3.9.0
Clone Of:	1344826
Clones:	1348085 1348086 (view as bug list)
Environment:
Last Closed:	2017-03-27 18:13:07 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Aravinda VK 2016-06-13 06:28:14 UTC

+++ This bug was initially created as a clone of Bug #1344826 +++

Description of problem:
=======================

While performing rm -rf on cascaded setup, found a worker crash on the primary master and intermittent master volume with traceback as: 

Master Volume:
==============

[2016-06-11 09:41:17.359086] E [syncdutils(/rhs/brick1/b1):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 201, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 720, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1497, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1201, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1107, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 984, in process
    self.datas_in_batch.remove(unlinked_gfid)
KeyError: '.gfid/757b0ad8-b6f5-44da-b71a-1b1c25a72988'


Intermittent Master:
====================

[2016-06-11 09:41:51.681622] E [syncdutils(/rhs/brick1/b1):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 201, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 720, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1497, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1201, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1107, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 984, in process
    self.datas_in_batch.remove(unlinked_gfid)
KeyError: '.gfid/757b0ad8-b6f5-44da-b71a-1b1c25a72988'
[2016-06-11 09:41:51.684969] I [syncdutils(/rhs/brick1/b1):220:finalize] <top>: exiting.



Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.7.9-10


How reproducible:
=================

Always, on cascaded setup upon remove (rm -rf)


Steps to Reproduce:
===================
1. Create geo-rep cascaded setup with (vol0,vol1,vol2). Such that vol0=>vol1, vol1=>vol2
2. Mount the vol0 volume and perform fops like (cp,create,chmod,chown,chgrp,symlink,hardlink,truncate) on vol0
3. Let it sync to slave (vol1) and (vol2)
4. Calculate arequal checksum after every fop. It should match.
5. perform rm -rf on vol0

Actual results:
===============

Worker crashed on vol1 and vol0 with keyerror.


Expected results:
=================

Worker shouldn't crash


Additional info:
================

Performed rm -rf on non cascaded setup and didn't see the crash. Also, eventually files are removed from all Master and slaves.

Comment 1 Vijay Bellur 2016-06-13 06:33:20 UTC

REVIEW: http://review.gluster.org/14706 (geo-rep: Safely handle if unliked GFID not present in data list) posted (#1) for review on master by Aravinda VK (avishwan)

Comment 2 Vijay Bellur 2016-06-20 06:37:06 UTC

COMMIT: http://review.gluster.org/14706 committed in master by Aravinda VK (avishwan) 
------
commit 4797ca3778d82a671716d4913c14f285591ae959
Author: Aravinda VK <avishwan>
Date:   Mon Jun 13 12:00:40 2016 +0530

    geo-rep: Safely handle if unliked GFID not present in data list
    
    If unlinked GFID is not present in data list to be synced then
    Geo-rep worker was crashing with KeyError. Handled KeyError with
    this patch.
    
    BUG: 1345744
    Change-Id: I5a1c9ca4473e32606df2e5c7e26c95faf55d44c0
    Signed-off-by: Aravinda VK <avishwan>
    Reviewed-on: http://review.gluster.org/14706
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Kotresh HR <khiremat>

Comment 3 Shyamsundar 2017-03-27 18:13:07 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.9.0, please open a new bug report.

glusterfs-3.9.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2016-November/029281.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.