Bug 1038994
Summary: | Dist-geo-rep : geo-rep worker of the passive node crashed when it became active. | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Vijaykumar Koppad <vkoppad> | |
Component: | geo-replication | Assignee: | Venky Shankar <vshankar> | |
Status: | CLOSED ERRATA | QA Contact: | Bhaskar Bandari <bbandari> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 2.1 | CC: | aavati, avishwan, bbandari, csaba, david.macdonald, fharshav, jrankin, khiremat, nsathyan, ssamanta, vagarwal, vbhat, vgaikwad, vshankar, vumrao | |
Target Milestone: | --- | |||
Target Release: | RHGS 3.0.0 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | 3.6.0.13-1 | Doc Type: | Bug Fix | |
Doc Text: |
Cause: When a Passive node becomes active it collects the old changelogs to process, geo-rep identifies and removes respective changelog file from the list if it is already processed. If list is empty geo-rep worker was crashing since it was unable to process empty list.
Consequence:
Geo-rep worker crash.
Fix:
Geo-rep is now handles the empty list of changelog files.
Result:
No Geo-rep worker crash.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1126356 (view as bug list) | Environment: | ||
Last Closed: | 2014-09-22 19:30:05 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1126356 |
Description
Vijaykumar Koppad
2013-12-06 10:21:16 UTC
The steps to reproduce this issue is 1. Start a geo-rep session between master and slave. 2. Once there are some changelogs that needs to be synced (in .processing) sto pgeo-rep and stop I/O from mountpoint. 3. Create some more files and wait for 2-3 minutes. 4. Now start geo-rep again 5. Wait for the backtarce in logs [2014-01-08 16:19:15.328288] I [master(/rhs/bricks/brick0):1065:crawl] _GMaster: slave's time: (1389177856, 115295) [2014-01-08 16:19:15.328862] I [master(/rhs/bricks/brick0):1068:crawl] _GMaster: skipping already processed change: CHANGELOG.1389177635... [2014-01-08 16:19:15.329555] I [master(/rhs/bricks/brick0):1068:crawl] _GMaster: skipping already processed change: CHANGELOG.1389177650... [2014-01-08 16:19:15.330394] E [syncdutils(/rhs/bricks/brick0):207:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 540, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1157, in service_loop g2.crawlwrap() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 473, in crawlwrap self.crawl() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1072, in crawl self.process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 855, in process xtl = (int(change.split('.')[-1]) - 1, 0) UnboundLocalError: local variable 'change' referenced before assignment [2014-01-08 16:19:15.334372] I [syncdutils(/rhs/bricks/brick0):159:finalize] <top>: exiting. [2014-01-08 16:19:15.349729] I [monitor(monitor):81:set_state] Monitor: new state: faulty [2014-01-08 16:19:25.365159] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------ [2014-01-08 16:19:25.365744] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker [2014-01-08 16:19:25.724272] I [gsyncd(/rhs/bricks/brick0):530:main_i] <top>: syncing: gluster://localhost:master -> ssh://root@falcon:gluster://localhost:slave [2014-01-08 16:19:28.933765] I [master(/rhs/bricks/brick0):58:gmaster_builder] <top>: setting up xsync change detection mode This issue arises mostly after a switch from the initial xsync crawl to the changelog based crawl, which normally happens on geo-rep startup or the very first time a passive replica node becomes active (for which this bug is more prevalent) After the initial crawl the slave's time (stime) is updated with the msater's time upto the point where it got synced. This (stime) is maintained at a per-directory level. Therefore, after a successful first crawl and sync, stime of root ('/') is updated with the xtime of master for root. This time is used by the changelog crawl to "skip" changelog that got accumulated when the file system crawl was ongoing (as those entities would have got synced). In the event that "all" changelogs were skipped (which is quite possible if the initial crawl "wraps up" all pending changes), this backtrace is seen as the changelog handling routine is invoked with no changelogs. Fix for this bug is checking for non-empty changelog list before invoking the processing routine. As mentioned in my previous comment: This fix is available in build glusterfs-3.4.0.59geo-1.el6rhs Fixed as a part of the rebase, marking the bz for Denali *** Bug 1049572 has been marked as a duplicate of this bug. *** verified on the build glusterfs-3.6.0.15-1.el6rhs , steps for verification, 1. Start a geo-rep session between master and slave. 2. Once there are some changelogs that needs to be synced (in .processing) sto pgeo-rep and stop I/O from mountpoint. 3. Create some more files and wait for 2-3 minutes. 4. Now start geo-rep again Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1278.html |