Bug 1027137 - dist-geo-rep: After checkpoint if a node reboots, the status goes to faulty after it becomes online.
dist-geo-rep: After checkpoint if a node reboots, the status goes to faulty a...
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: geo-replication (Show other bugs)
2.1
x86_64 Linux
high Severity high
: ---
: ---
Assigned To: Aravinda VK
M S Vishwanath Bhat
: ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-11-06 03:57 EST by M S Vishwanath Bhat
Modified: 2016-05-31 21:56 EDT (History)
7 users (show)

See Also:
Fixed In Version: glusterfs-3.4.0.41rhs-1
Doc Type: Bug Fix
Doc Text:
Previously, if a node gets rebooted while checkpoint is set on it, when it comes back, geo-replication used to get to 'faulty' state. With this update, the geo-replication process doesn't go to faulty state as it handles all the failure cases properly.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-11-27 10:46:10 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description M S Vishwanath Bhat 2013-11-06 03:57:12 EST
Description of problem:
I set the checkpoint and the status says checkpoint completed even before completing the actual checkpoint. But there is different bug for it (1025358). Now at this stage if the node reboots, after it comes back up, the status of that particular node goes to faulty with the python back trace in the log file.

Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 233, in twrap
    tf(*aa)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 546, in checkpt_service
    gconf.confdata.delete('checkpoint-completed')
AttributeError: 'GConf' object has no attribute 'confdata'



Version-Release number of selected component (if applicable):
glusterfs-3.4.0.39rhs-1.el6rhs.x86_64


How reproducible:
Consistently

Steps to Reproduce:
1. Create and start geo-rep session between 2*2 dist-rep master and 2*2 dist-rep slave node.
2. Now start creating data on the mountpoint. Use can either untar the linux kernel or use smallfiles_cli.py
3. Now set the checkpoint
4. Now the geo-rep status says checkpoint completed. AT this point reboot a node.
5. Now run geo-rep status

Actual results:
Status goes faulty.

NODE                       MASTER    SLAVE            HEALTH                                                                                 UPTIME
--------------------------------------------------------------------------------------------------------------------------------------------------------
harrier.blr.redhat.com     master    falcon::slave    faulty                                                                                 N/A
typhoon.blr.redhat.com     master    falcon::slave    Stable  | checkpoint as of 2013-11-06 14:08:00 is completed at 2013-11-06 14:08:18     19:43:55
spitfire.blr.redhat.com    master    falcon::slave    Stable  | checkpoint as of 2013-11-06 14:08:00 is completed at 2013-11-06 14:08:17     19:44:00
mustang.blr.redhat.com     master    falcon::slave    Stable  | checkpoint as of 2013-11-06 14:08:00 is completed at 2013-11-06 14:08:18     19:43:55


And the log file has following back trace
[2013-11-06 14:15:39.476916] E [syncdutils(/rhs/bricks/brick2):207:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 233, in twrap
    tf(*aa)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 546, in checkpt_service
    gconf.confdata.delete('checkpoint-completed')
AttributeError: 'GConf' object has no attribute 'confdata'
[2013-11-06 14:15:39.478607] I [syncdutils(/rhs/bricks/brick2):159:finalize] <top>: exiting.
[2013-11-06 14:15:39.482624] I [monitor(monitor):81:set_state] Monitor: new state: faulty


Expected results:
Status should not go faulty and there should no python exception.

Additional info:
Comment 2 Aravinda VK 2013-11-11 05:19:56 EST
Amar, Is the following patch available in this build?
https://code.engineering.redhat.com/gerrit/#/c/15289/

In that patch gconf.confdata.delete('checkpoint-completed') is changed to 
gconf.configinterface.delete('checkpoint-completed') and that should fix the issue.
Comment 3 Vijaykumar Koppad 2013-11-13 08:35:35 EST
verified on the build glusterfs-3.4.0.43rhs-1. 

MASTER NODE                MASTER VOL    MASTER BRICK      SLAVE                  STATUS     CHECKPOINT STATUS                                                           CRAWL STATUS           
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
redcloak.blr.redhat.com    master        /bricks/brick2    10.70.43.76::slave     Passive    N/A                                                                         N/A                    
redwood.blr.redhat.com     master        /bricks/brick4    10.70.42.151::slave    Passive    N/A                                                                         N/A                    
redlake.blr.redhat.com     master        /bricks/brick3    10.70.43.135::slave    Active     checkpoint as of 2013-11-13 18:14:30 is completed at 2013-11-13 18:22:25    Changelog Crawl        
redcell.blr.redhat.com     master        /bricks/brick1    10.70.43.174::slave    Active     checkpoint as of 2013-11-13 18:14:30 is completed at 2013-11-13 18:22:25    Changelog Crawl
Comment 4 errata-xmlrpc 2013-11-27 10:46:10 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1769.html

Note You need to log in before you can comment on or make changes to this bug.