Description of problem: geo-rep worker of the passive node crashed when it became active, after the actual active node came down. backtrace >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2013-12-05 22:15:54.961704] I [master(/bricks/master_brick4):1068:crawl] _GMaster: skipping already processed change: CHANGELOG.1386261787... [2013-12-05 22:15:54.962709] I [master(/bricks/master_brick4):1068:crawl] _GMaster: skipping already processed change: CHANGELOG.1386261937... [2013-12-05 22:15:54.964027] I [master(/bricks/master_brick4):1068:crawl] _GMaster: skipping already processed change: CHANGELOG.1386261952... [2013-12-05 22:15:54.965122] E [syncdutils(/bricks/master_brick4):207:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 540, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1157, in service_loop g2.crawlwrap() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 473, in crawlwrap self.crawl() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1072, in crawl self.process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 855, in process xtl = (int(change.split('.')[-1]) - 1, 0) UnboundLocalError: local variable 'change' referenced before assignment [2013-12-05 22:15:54.972412] I [syncdutils(/bricks/master_brick4):159:finalize] <top>: exiting. [2013-12-05 22:15:54.991750] I [monitor(monitor):81:set_state] Monitor: new state: faulty >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Version-Release number of selected component (if applicable): glusterfs-server-3.4.0.46geo-1 How reproducible: didn't try to reproduce Steps to Reproduce: 1.create and start a geo-rep relationship between master and slave 2.run this script, for i in {1..5} do ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K /mnt/master/ sleep 100 ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K --fop=chmod /mnt/master/ sleep 100 ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K --fop=chown /mnt/master/ sleep 100 ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K --fop=chgrp /mnt/master/ sleep 100 ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K --fop=symlink /mnt/master/ sleep 100 ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K --fop=hardlink /mnt/master/ sleep 100 ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K --fop=truncate /mnt/master/ sleep 100 ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K --fop=rename /mnt/master/ sleep 500 done 3.after syncing some 100K files (you can check it from status detail), just bring down one of the active nodes. 4.Check for the backtrace in the geo-rep logs in the corresponding passive node Actual results: the passive geo-rep worked crashes Expected results: Geo-rep worker shouldn't crash. Additional info:
The steps to reproduce this issue is 1. Start a geo-rep session between master and slave. 2. Once there are some changelogs that needs to be synced (in .processing) sto pgeo-rep and stop I/O from mountpoint. 3. Create some more files and wait for 2-3 minutes. 4. Now start geo-rep again 5. Wait for the backtarce in logs [2014-01-08 16:19:15.328288] I [master(/rhs/bricks/brick0):1065:crawl] _GMaster: slave's time: (1389177856, 115295) [2014-01-08 16:19:15.328862] I [master(/rhs/bricks/brick0):1068:crawl] _GMaster: skipping already processed change: CHANGELOG.1389177635... [2014-01-08 16:19:15.329555] I [master(/rhs/bricks/brick0):1068:crawl] _GMaster: skipping already processed change: CHANGELOG.1389177650... [2014-01-08 16:19:15.330394] E [syncdutils(/rhs/bricks/brick0):207:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 540, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1157, in service_loop g2.crawlwrap() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 473, in crawlwrap self.crawl() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1072, in crawl self.process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 855, in process xtl = (int(change.split('.')[-1]) - 1, 0) UnboundLocalError: local variable 'change' referenced before assignment [2014-01-08 16:19:15.334372] I [syncdutils(/rhs/bricks/brick0):159:finalize] <top>: exiting. [2014-01-08 16:19:15.349729] I [monitor(monitor):81:set_state] Monitor: new state: faulty [2014-01-08 16:19:25.365159] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------ [2014-01-08 16:19:25.365744] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker [2014-01-08 16:19:25.724272] I [gsyncd(/rhs/bricks/brick0):530:main_i] <top>: syncing: gluster://localhost:master -> ssh://root@falcon:gluster://localhost:slave [2014-01-08 16:19:28.933765] I [master(/rhs/bricks/brick0):58:gmaster_builder] <top>: setting up xsync change detection mode
This issue arises mostly after a switch from the initial xsync crawl to the changelog based crawl, which normally happens on geo-rep startup or the very first time a passive replica node becomes active (for which this bug is more prevalent) After the initial crawl the slave's time (stime) is updated with the msater's time upto the point where it got synced. This (stime) is maintained at a per-directory level. Therefore, after a successful first crawl and sync, stime of root ('/') is updated with the xtime of master for root. This time is used by the changelog crawl to "skip" changelog that got accumulated when the file system crawl was ongoing (as those entities would have got synced). In the event that "all" changelogs were skipped (which is quite possible if the initial crawl "wraps up" all pending changes), this backtrace is seen as the changelog handling routine is invoked with no changelogs. Fix for this bug is checking for non-empty changelog list before invoking the processing routine. As mentioned in my previous comment: This fix is available in build glusterfs-3.4.0.59geo-1.el6rhs
Fixed as a part of the rebase, marking the bz for Denali
*** Bug 1049572 has been marked as a duplicate of this bug. ***
verified on the build glusterfs-3.6.0.15-1.el6rhs , steps for verification, 1. Start a geo-rep session between master and slave. 2. Once there are some changelogs that needs to be synced (in .processing) sto pgeo-rep and stop I/O from mountpoint. 3. Create some more files and wait for 2-3 minutes. 4. Now start geo-rep again
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1278.html