| Summary: | dist-geo-rep: After checkpoint if a node reboots, the status goes to faulty after it becomes online. | ||
|---|---|---|---|
| Product: | Red Hat Gluster Storage | Reporter: | M S Vishwanath Bhat <vbhat> |
| Component: | geo-replication | Assignee: | Aravinda VK <avishwan> |
| Status: | CLOSED ERRATA | QA Contact: | M S Vishwanath Bhat <vbhat> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 2.1 | CC: | aavati, amarts, csaba, grajaiya, mzywusko, nsathyan, vkoppad |
| Target Milestone: | --- | Keywords: | ZStream |
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | glusterfs-3.4.0.41rhs-1 | Doc Type: | Bug Fix |
| Doc Text: |
Previously, if a node gets rebooted while checkpoint is set on it, when it comes back, geo-replication used to get to 'faulty' state. With this update, the geo-replication process doesn't go to faulty state as it handles all the failure cases properly.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2013-11-27 15:46:10 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
Amar, Is the following patch available in this build? https://code.engineering.redhat.com/gerrit/#/c/15289/ In that patch gconf.confdata.delete('checkpoint-completed') is changed to gconf.configinterface.delete('checkpoint-completed') and that should fix the issue. verified on the build glusterfs-3.4.0.43rhs-1. MASTER NODE MASTER VOL MASTER BRICK SLAVE STATUS CHECKPOINT STATUS CRAWL STATUS ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- redcloak.blr.redhat.com master /bricks/brick2 10.70.43.76::slave Passive N/A N/A redwood.blr.redhat.com master /bricks/brick4 10.70.42.151::slave Passive N/A N/A redlake.blr.redhat.com master /bricks/brick3 10.70.43.135::slave Active checkpoint as of 2013-11-13 18:14:30 is completed at 2013-11-13 18:22:25 Changelog Crawl redcell.blr.redhat.com master /bricks/brick1 10.70.43.174::slave Active checkpoint as of 2013-11-13 18:14:30 is completed at 2013-11-13 18:22:25 Changelog Crawl Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1769.html |
Description of problem: I set the checkpoint and the status says checkpoint completed even before completing the actual checkpoint. But there is different bug for it (1025358). Now at this stage if the node reboots, after it comes back up, the status of that particular node goes to faulty with the python back trace in the log file. Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 233, in twrap tf(*aa) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 546, in checkpt_service gconf.confdata.delete('checkpoint-completed') AttributeError: 'GConf' object has no attribute 'confdata' Version-Release number of selected component (if applicable): glusterfs-3.4.0.39rhs-1.el6rhs.x86_64 How reproducible: Consistently Steps to Reproduce: 1. Create and start geo-rep session between 2*2 dist-rep master and 2*2 dist-rep slave node. 2. Now start creating data on the mountpoint. Use can either untar the linux kernel or use smallfiles_cli.py 3. Now set the checkpoint 4. Now the geo-rep status says checkpoint completed. AT this point reboot a node. 5. Now run geo-rep status Actual results: Status goes faulty. NODE MASTER SLAVE HEALTH UPTIME -------------------------------------------------------------------------------------------------------------------------------------------------------- harrier.blr.redhat.com master falcon::slave faulty N/A typhoon.blr.redhat.com master falcon::slave Stable | checkpoint as of 2013-11-06 14:08:00 is completed at 2013-11-06 14:08:18 19:43:55 spitfire.blr.redhat.com master falcon::slave Stable | checkpoint as of 2013-11-06 14:08:00 is completed at 2013-11-06 14:08:17 19:44:00 mustang.blr.redhat.com master falcon::slave Stable | checkpoint as of 2013-11-06 14:08:00 is completed at 2013-11-06 14:08:18 19:43:55 And the log file has following back trace [2013-11-06 14:15:39.476916] E [syncdutils(/rhs/bricks/brick2):207:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 233, in twrap tf(*aa) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 546, in checkpt_service gconf.confdata.delete('checkpoint-completed') AttributeError: 'GConf' object has no attribute 'confdata' [2013-11-06 14:15:39.478607] I [syncdutils(/rhs/bricks/brick2):159:finalize] <top>: exiting. [2013-11-06 14:15:39.482624] I [monitor(monitor):81:set_state] Monitor: new state: faulty Expected results: Status should not go faulty and there should no python exception. Additional info: