1038994 – Dist-geo-rep : geo-rep worker of the passive node crashed when it became active.

Bug 1038994 - Dist-geo-rep : geo-rep worker of the passive node crashed when it became active.

Summary: Dist-geo-rep : geo-rep worker of the passive node crashed when it became active.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	2.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.0.0
Assignee:	Venky Shankar
QA Contact:	Bhaskar Bandari
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1049572 (view as bug list)
Depends On:
Blocks:	1126356
TreeView+	depends on / blocked

Reported:	2013-12-06 10:21 UTC by Vijaykumar Koppad
Modified:	2018-12-06 15:32 UTC (History)
CC List:	15 users (show)
Fixed In Version:	3.6.0.13-1
Doc Type:	Bug Fix
Doc Text:	Cause: When a Passive node becomes active it collects the old changelogs to process, geo-rep identifies and removes respective changelog file from the list if it is already processed. If list is empty geo-rep worker was crashing since it was unable to process empty list. Consequence: Geo-rep worker crash. Fix: Geo-rep is now handles the empty list of changelog files. Result: No Geo-rep worker crash.
Clone Of:
Clones:	1126356 (view as bug list)
Environment:
Last Closed:	2014-09-22 19:30:05 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2014:1278	0	normal	SHIPPED_LIVE	Red Hat Storage Server 3.0 bug fix and enhancement update	2014-09-22 23:26:55 UTC

Description Vijaykumar Koppad 2013-12-06 10:21:16 UTC

Description of problem: geo-rep worker of the passive node crashed when it became active, after the actual active node came down.

backtrace

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2013-12-05 22:15:54.961704] I [master(/bricks/master_brick4):1068:crawl] _GMaster: skipping already processed change: CHANGELOG.1386261787...
[2013-12-05 22:15:54.962709] I [master(/bricks/master_brick4):1068:crawl] _GMaster: skipping already processed change: CHANGELOG.1386261937...
[2013-12-05 22:15:54.964027] I [master(/bricks/master_brick4):1068:crawl] _GMaster: skipping already processed change: CHANGELOG.1386261952...
[2013-12-05 22:15:54.965122] E [syncdutils(/bricks/master_brick4):207:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 540, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1157, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 473, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1072, in crawl
    self.process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 855, in process
    xtl = (int(change.split('.')[-1]) - 1, 0)
UnboundLocalError: local variable 'change' referenced before assignment
[2013-12-05 22:15:54.972412] I [syncdutils(/bricks/master_brick4):159:finalize] <top>: exiting.
[2013-12-05 22:15:54.991750] I [monitor(monitor):81:set_state] Monitor: new state: faulty

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


Version-Release number of selected component (if applicable): glusterfs-server-3.4.0.46geo-1


How reproducible: didn't try to reproduce


Steps to Reproduce:
1.create and start a geo-rep relationship between master and slave
2.run this script,
for i in {1..5}
do
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K    /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K  --fop=chmod  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=chown  /mnt/master/
    sleep 100 
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=chgrp  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K  --fop=symlink  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=hardlink  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=truncate /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=rename /mnt/master/
    sleep 500
done

3.after syncing some 100K files (you can check it from status detail), just bring down one of the active nodes.
4.Check for the backtrace in the geo-rep logs in the corresponding passive node 


Actual results: the passive geo-rep worked crashes


Expected results: Geo-rep worker shouldn't crash.


Additional info:

Comment 2 M S Vishwanath Bhat 2014-01-08 10:57:03 UTC

The steps to reproduce this issue is

1. Start a geo-rep session between master and slave.
2. Once there are some changelogs that needs to be synced (in .processing) sto pgeo-rep and stop I/O from mountpoint.
3. Create some more files and wait for 2-3 minutes.
4. Now start geo-rep again
5. Wait for the backtarce in logs


[2014-01-08 16:19:15.328288] I [master(/rhs/bricks/brick0):1065:crawl] _GMaster: slave's time: (1389177856, 115295)
[2014-01-08 16:19:15.328862] I [master(/rhs/bricks/brick0):1068:crawl] _GMaster: skipping already processed change: CHANGELOG.1389177635...
[2014-01-08 16:19:15.329555] I [master(/rhs/bricks/brick0):1068:crawl] _GMaster: skipping already processed change: CHANGELOG.1389177650...
[2014-01-08 16:19:15.330394] E [syncdutils(/rhs/bricks/brick0):207:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 540, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1157, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 473, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1072, in crawl
    self.process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 855, in process
    xtl = (int(change.split('.')[-1]) - 1, 0)
UnboundLocalError: local variable 'change' referenced before assignment
[2014-01-08 16:19:15.334372] I [syncdutils(/rhs/bricks/brick0):159:finalize] <top>: exiting.
[2014-01-08 16:19:15.349729] I [monitor(monitor):81:set_state] Monitor: new state: faulty
[2014-01-08 16:19:25.365159] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------
[2014-01-08 16:19:25.365744] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker
[2014-01-08 16:19:25.724272] I [gsyncd(/rhs/bricks/brick0):530:main_i] <top>: syncing: gluster://localhost:master -> ssh://root@falcon:gluster://localhost:slave
[2014-01-08 16:19:28.933765] I [master(/rhs/bricks/brick0):58:gmaster_builder] <top>: setting up xsync change detection mode

Comment 5 Venky Shankar 2014-03-13 08:57:12 UTC

This issue arises mostly after a switch from the initial xsync crawl to the changelog based crawl, which normally happens on geo-rep startup or the very first time a passive replica node becomes active (for which this bug is more prevalent)

After the initial crawl the slave's time (stime) is updated with the msater's time upto the point where it got synced. This (stime) is maintained at a per-directory level. Therefore, after a successful first crawl and sync, stime of root ('/') is updated with the xtime of master for root. This time is used by the changelog crawl to "skip" changelog that got accumulated when the file system crawl was ongoing (as those entities would have got synced). In the event that "all" changelogs were skipped (which is quite possible if the initial crawl "wraps up" all pending changes), this backtrace is seen as the changelog handling routine is invoked with no changelogs.

Fix for this bug is checking for non-empty changelog list before invoking the processing routine.

As mentioned in my previous comment: This fix is available in build glusterfs-3.4.0.59geo-1.el6rhs

Comment 8 Vivek Agarwal 2014-05-26 06:15:09 UTC

Fixed as a part of the rebase, marking the bz for Denali

Comment 9 Vijaykumar Koppad 2014-06-12 05:43:00 UTC

*** Bug 1049572 has been marked as a duplicate of this bug. ***

Comment 10 Vijaykumar Koppad 2014-06-12 06:01:13 UTC

verified on the build glusterfs-3.6.0.15-1.el6rhs ,

steps for verification, 

1. Start a geo-rep session between master and slave.
2. Once there are some changelogs that needs to be synced (in .processing) sto pgeo-rep and stop I/O from mountpoint.
3. Create some more files and wait for 2-3 minutes.
4. Now start geo-rep again

Comment 14 errata-xmlrpc 2014-09-22 19:30:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html

Note You need to log in before you can comment on or make changes to this bug.