Bug 1027137 - dist-geo-rep: After checkpoint if a node reboots, the status goes to faulty after it becomes online.
Summary: dist-geo-rep: After checkpoint if a node reboots, the status goes to faulty a...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: geo-replication
Version: 2.1
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Aravinda VK
QA Contact: M S Vishwanath Bhat
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-11-06 08:57 UTC by M S Vishwanath Bhat
Modified: 2016-06-01 01:56 UTC (History)
7 users (show)

Fixed In Version: glusterfs-3.4.0.41rhs-1
Doc Type: Bug Fix
Doc Text:
Previously, if a node gets rebooted while checkpoint is set on it, when it comes back, geo-replication used to get to 'faulty' state. With this update, the geo-replication process doesn't go to faulty state as it handles all the failure cases properly.
Clone Of:
Environment:
Last Closed: 2013-11-27 15:46:10 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2013:1769 0 normal SHIPPED_LIVE Red Hat Storage 2.1 enhancement and bug fix update #1 2013-11-27 20:17:39 UTC

Description M S Vishwanath Bhat 2013-11-06 08:57:12 UTC
Description of problem:
I set the checkpoint and the status says checkpoint completed even before completing the actual checkpoint. But there is different bug for it (1025358). Now at this stage if the node reboots, after it comes back up, the status of that particular node goes to faulty with the python back trace in the log file.

Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 233, in twrap
    tf(*aa)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 546, in checkpt_service
    gconf.confdata.delete('checkpoint-completed')
AttributeError: 'GConf' object has no attribute 'confdata'



Version-Release number of selected component (if applicable):
glusterfs-3.4.0.39rhs-1.el6rhs.x86_64


How reproducible:
Consistently

Steps to Reproduce:
1. Create and start geo-rep session between 2*2 dist-rep master and 2*2 dist-rep slave node.
2. Now start creating data on the mountpoint. Use can either untar the linux kernel or use smallfiles_cli.py
3. Now set the checkpoint
4. Now the geo-rep status says checkpoint completed. AT this point reboot a node.
5. Now run geo-rep status

Actual results:
Status goes faulty.

NODE                       MASTER    SLAVE            HEALTH                                                                                 UPTIME
--------------------------------------------------------------------------------------------------------------------------------------------------------
harrier.blr.redhat.com     master    falcon::slave    faulty                                                                                 N/A
typhoon.blr.redhat.com     master    falcon::slave    Stable  | checkpoint as of 2013-11-06 14:08:00 is completed at 2013-11-06 14:08:18     19:43:55
spitfire.blr.redhat.com    master    falcon::slave    Stable  | checkpoint as of 2013-11-06 14:08:00 is completed at 2013-11-06 14:08:17     19:44:00
mustang.blr.redhat.com     master    falcon::slave    Stable  | checkpoint as of 2013-11-06 14:08:00 is completed at 2013-11-06 14:08:18     19:43:55


And the log file has following back trace
[2013-11-06 14:15:39.476916] E [syncdutils(/rhs/bricks/brick2):207:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 233, in twrap
    tf(*aa)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 546, in checkpt_service
    gconf.confdata.delete('checkpoint-completed')
AttributeError: 'GConf' object has no attribute 'confdata'
[2013-11-06 14:15:39.478607] I [syncdutils(/rhs/bricks/brick2):159:finalize] <top>: exiting.
[2013-11-06 14:15:39.482624] I [monitor(monitor):81:set_state] Monitor: new state: faulty


Expected results:
Status should not go faulty and there should no python exception.

Additional info:

Comment 2 Aravinda VK 2013-11-11 10:19:56 UTC
Amar, Is the following patch available in this build?
https://code.engineering.redhat.com/gerrit/#/c/15289/

In that patch gconf.confdata.delete('checkpoint-completed') is changed to 
gconf.configinterface.delete('checkpoint-completed') and that should fix the issue.

Comment 3 Vijaykumar Koppad 2013-11-13 13:35:35 UTC
verified on the build glusterfs-3.4.0.43rhs-1. 

MASTER NODE                MASTER VOL    MASTER BRICK      SLAVE                  STATUS     CHECKPOINT STATUS                                                           CRAWL STATUS           
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
redcloak.blr.redhat.com    master        /bricks/brick2    10.70.43.76::slave     Passive    N/A                                                                         N/A                    
redwood.blr.redhat.com     master        /bricks/brick4    10.70.42.151::slave    Passive    N/A                                                                         N/A                    
redlake.blr.redhat.com     master        /bricks/brick3    10.70.43.135::slave    Active     checkpoint as of 2013-11-13 18:14:30 is completed at 2013-11-13 18:22:25    Changelog Crawl        
redcell.blr.redhat.com     master        /bricks/brick1    10.70.43.174::slave    Active     checkpoint as of 2013-11-13 18:14:30 is completed at 2013-11-13 18:22:25    Changelog Crawl

Comment 4 errata-xmlrpc 2013-11-27 15:46:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1769.html


Note You need to log in before you can comment on or make changes to this bug.