1476556 – [geo-rep]: Worker crash during rmdir with "NameError: global name 'lf' is not defined"

Bug 1476556 - [geo-rep]: Worker crash during rmdir with "NameError: global name 'lf' is not defined"

Summary: [geo-rep]: Worker crash during rmdir with "NameError: global name 'lf' is not...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	rhgs-3.3
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Kotresh HR
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1417151
TreeView+	depends on / blocked

Reported:	2017-07-30 09:56 UTC by Rahul Hinduja
Modified:	2017-09-21 05:04 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.8.4-37
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-09-21 05:04:21 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Description Rahul Hinduja 2017-07-30 09:56:10 UTC

Description of problem:
=======================

While carrying automation sanity check on build (glusterfs-geo-replication-3.8.4-36.el7rhgs.x86_64), one of the worker crashed with following traceback: 

[2017-07-29 16:26:24.323477] I [master(/bricks/brick1/master_brick9):1132:crawl] _GMaster: slave's time: (1501345566, 0)
[2017-07-29 16:26:29.850406] E [syncdutils(/bricks/brick1/master_brick9):296:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 204, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 780, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1582, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 570, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1143, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1118, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1001, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 829, in process_change
    logging.info(lf('Ignoring rmdir. Directory present in '
NameError: global name 'lf' is not defined
[2017-07-29 16:26:29.854490] I [syncdutils(/bricks/brick1/master_brick9):237:finalize] <top>: exiting.

Looking into the trace, it looks during rmdir. Worker crashed and became passive. Syncs are successful via other active worker. Looks like a race.


Version-Release number of selected component (if applicable):
=============================================================

glusterfs-geo-replication-3.8.4-36.el7rhgs.x86_64


How reproducible:
=================

Have seen once, will retry and update the second occurrence. 


Steps Carried:
==============
1. Run the automation sanity suite which does {create,chmod,chown,chgrp,hardlink,symlink,truncate,rename,remove} on Master and Slave (Both EC volumes)

Actual results:
===============

Worker crashed


Expected results:
=================

Workers should not crash

Additional info:

Comment 3 Kotresh HR 2017-07-31 04:26:24 UTC

As part of improving debugging ability and logging improvements, structured logging support [1] is introduced and is merged in upstream. That is not taken in downstream 3.3. But somehow a patch using the structured logging support has sneaked in downstream 3.3 and would always crash hitting that code path.
Hence it's a candidate for blocker and should be fixed. And it's easy fix.



[1] https://review.gluster.org/#/c/17551/

Comment 4 Kotresh HR 2017-07-31 04:50:00 UTC

It's downstream only patch.

Patch link:
https://code.engineering.redhat.com/gerrit/113848

Comment 5 Sweta Anandpara 2017-07-31 04:51:14 UTC

Based on comment 3, and on having discussed it with Rahul, marking blocker flag to '?'.

Comment 9 Rahul Hinduja 2017-08-05 09:52:52 UTC

How to reproduce this issue: 

1. touch dir1 => This is to find which subvolume the file hashes too  
2. rm dir1
3. mkdir dir1
4. Let it sync to slave
5. Stop the geo-replication
6. Attach gdb to mount pid and breakpoint at dht_rmdir_lock_cbk 
7. continue
8. rmdir dir1 
9. Kill the complete Hashed subvolume (captured from step 1)
10. continue
11. Start volume with force (bring back bricks)
12. ls /mnt/dir1
13. Wait for dht heal
14. Start the geo-replication

Was able to reproduce this issue on build 3.8.4-33

Verified with build: glusterfs-geo-replication-3.8.4-37.el7rhgs.x86_64

=> No crash is seen with the above mentioned steps. Moving the bug to verified state.

Comment 11 errata-xmlrpc 2017-09-21 05:04:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.