Bug 1038994

Summary: Dist-geo-rep : geo-rep worker of the passive node crashed when it became active.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Vijaykumar Koppad <vkoppad>
Component: geo-replicationAssignee: Venky Shankar <vshankar>
Status: CLOSED ERRATA QA Contact: Bhaskar Bandari <bbandari>
Severity: high Docs Contact:
Priority: high    
Version: 2.1CC: aavati, avishwan, bbandari, csaba, david.macdonald, fharshav, jrankin, khiremat, nsathyan, ssamanta, vagarwal, vbhat, vgaikwad, vshankar, vumrao
Target Milestone: ---   
Target Release: RHGS 3.0.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: 3.6.0.13-1 Doc Type: Bug Fix
Doc Text:
Cause: When a Passive node becomes active it collects the old changelogs to process, geo-rep identifies and removes respective changelog file from the list if it is already processed. If list is empty geo-rep worker was crashing since it was unable to process empty list. Consequence: Geo-rep worker crash. Fix: Geo-rep is now handles the empty list of changelog files. Result: No Geo-rep worker crash.
Story Points: ---
Clone Of:
: 1126356 (view as bug list) Environment:
Last Closed: 2014-09-22 19:30:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1126356    

Description Vijaykumar Koppad 2013-12-06 10:21:16 UTC
Description of problem: geo-rep worker of the passive node crashed when it became active, after the actual active node came down.

backtrace

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2013-12-05 22:15:54.961704] I [master(/bricks/master_brick4):1068:crawl] _GMaster: skipping already processed change: CHANGELOG.1386261787...
[2013-12-05 22:15:54.962709] I [master(/bricks/master_brick4):1068:crawl] _GMaster: skipping already processed change: CHANGELOG.1386261937...
[2013-12-05 22:15:54.964027] I [master(/bricks/master_brick4):1068:crawl] _GMaster: skipping already processed change: CHANGELOG.1386261952...
[2013-12-05 22:15:54.965122] E [syncdutils(/bricks/master_brick4):207:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 540, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1157, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 473, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1072, in crawl
    self.process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 855, in process
    xtl = (int(change.split('.')[-1]) - 1, 0)
UnboundLocalError: local variable 'change' referenced before assignment
[2013-12-05 22:15:54.972412] I [syncdutils(/bricks/master_brick4):159:finalize] <top>: exiting.
[2013-12-05 22:15:54.991750] I [monitor(monitor):81:set_state] Monitor: new state: faulty

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


Version-Release number of selected component (if applicable): glusterfs-server-3.4.0.46geo-1


How reproducible: didn't try to reproduce


Steps to Reproduce:
1.create and start a geo-rep relationship between master and slave
2.run this script,
for i in {1..5}
do
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K    /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K  --fop=chmod  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=chown  /mnt/master/
    sleep 100 
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=chgrp  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K  --fop=symlink  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=hardlink  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=truncate /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=rename /mnt/master/
    sleep 500
done

3.after syncing some 100K files (you can check it from status detail), just bring down one of the active nodes.
4.Check for the backtrace in the geo-rep logs in the corresponding passive node 


Actual results: the passive geo-rep worked crashes


Expected results: Geo-rep worker shouldn't crash.


Additional info:

Comment 2 M S Vishwanath Bhat 2014-01-08 10:57:03 UTC
The steps to reproduce this issue is

1. Start a geo-rep session between master and slave.
2. Once there are some changelogs that needs to be synced (in .processing) sto pgeo-rep and stop I/O from mountpoint.
3. Create some more files and wait for 2-3 minutes.
4. Now start geo-rep again
5. Wait for the backtarce in logs


[2014-01-08 16:19:15.328288] I [master(/rhs/bricks/brick0):1065:crawl] _GMaster: slave's time: (1389177856, 115295)
[2014-01-08 16:19:15.328862] I [master(/rhs/bricks/brick0):1068:crawl] _GMaster: skipping already processed change: CHANGELOG.1389177635...
[2014-01-08 16:19:15.329555] I [master(/rhs/bricks/brick0):1068:crawl] _GMaster: skipping already processed change: CHANGELOG.1389177650...
[2014-01-08 16:19:15.330394] E [syncdutils(/rhs/bricks/brick0):207:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 540, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1157, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 473, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1072, in crawl
    self.process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 855, in process
    xtl = (int(change.split('.')[-1]) - 1, 0)
UnboundLocalError: local variable 'change' referenced before assignment
[2014-01-08 16:19:15.334372] I [syncdutils(/rhs/bricks/brick0):159:finalize] <top>: exiting.
[2014-01-08 16:19:15.349729] I [monitor(monitor):81:set_state] Monitor: new state: faulty
[2014-01-08 16:19:25.365159] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------
[2014-01-08 16:19:25.365744] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker
[2014-01-08 16:19:25.724272] I [gsyncd(/rhs/bricks/brick0):530:main_i] <top>: syncing: gluster://localhost:master -> ssh://root@falcon:gluster://localhost:slave
[2014-01-08 16:19:28.933765] I [master(/rhs/bricks/brick0):58:gmaster_builder] <top>: setting up xsync change detection mode

Comment 5 Venky Shankar 2014-03-13 08:57:12 UTC
This issue arises mostly after a switch from the initial xsync crawl to the changelog based crawl, which normally happens on geo-rep startup or the very first time a passive replica node becomes active (for which this bug is more prevalent)

After the initial crawl the slave's time (stime) is updated with the msater's time upto the point where it got synced. This (stime) is maintained at a per-directory level. Therefore, after a successful first crawl and sync, stime of root ('/') is updated with the xtime of master for root. This time is used by the changelog crawl to "skip" changelog that got accumulated when the file system crawl was ongoing (as those entities would have got synced). In the event that "all" changelogs were skipped (which is quite possible if the initial crawl "wraps up" all pending changes), this backtrace is seen as the changelog handling routine is invoked with no changelogs.

Fix for this bug is checking for non-empty changelog list before invoking the processing routine.

As mentioned in my previous comment: This fix is available in build glusterfs-3.4.0.59geo-1.el6rhs

Comment 8 Vivek Agarwal 2014-05-26 06:15:09 UTC
Fixed as a part of the rebase, marking the bz for Denali

Comment 9 Vijaykumar Koppad 2014-06-12 05:43:00 UTC
*** Bug 1049572 has been marked as a duplicate of this bug. ***

Comment 10 Vijaykumar Koppad 2014-06-12 06:01:13 UTC
verified on the build glusterfs-3.6.0.15-1.el6rhs ,

steps for verification, 

1. Start a geo-rep session between master and slave.
2. Once there are some changelogs that needs to be synced (in .processing) sto pgeo-rep and stop I/O from mountpoint.
3. Create some more files and wait for 2-3 minutes.
4. Now start geo-rep again

Comment 14 errata-xmlrpc 2014-09-22 19:30:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html