Bug 1186286

Summary: Geo-Replication Faulty state
Product: [Community] GlusterFS Reporter: Pierre-Marie JANVRE <pierre-marie.janvre>
Component: geo-replicationAssignee: Kotresh HR <khiremat>
Status: CLOSED WORKSFORME QA Contact:
Severity: unspecified Docs Contact:
Priority: high    
Version: mainlineCC: bugs, pierre-marie.janvre, vbellur
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-05-21 03:55:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Starce output as requested none

Description Pierre-Marie JANVRE 2015-01-27 11:05:02 UTC
Description of problem: Geo-Replication Faulty state when stating it


Version-Release number of selected component (if applicable): 3.6.1


How reproducible: Each time I tried


Steps to Reproduce:
1.gluster volume geo-replication master_volume root@slave_node1::slave_volume create push-pem
2.gluster volume geo-replication master_volume root@slave_node1::slave_volume start
3.gluster volume geo-replication master_volume root@slave_node1::slave_volume status

Actual results:
Faulty

Expected results:
OK

Additional info:
Here is the setup:

Datacenter A
2 nodes:
-master_node1
-master_node2
1 brick per node (replica)

Datacenter B
2 nodes:
-slave_node1
-slave_node2
1 brick per node (replica)

OS: CentOS 6.6
Gluster: glusterfs 3.6.1 built on Nov  7 2014 15:15:48

Bricks setup properly without any error.
Passwordless authentication between node 1 of datacenter 1 and node 1 of datacenter 2 setup successfully.
Geo-Replication setup properly as  below:
gluster system:: execute gsec_create
gluster volume geo-replication master_volume root@slave_node1::slave_volume create push-pem

I can start successfully the geo-replication:
gluster volume geo-replication master_volume root@slave_node1::slave_volume start

But when checking the status, I have the following:
gluster volume geo-replication master_volume root@slave_node1::slave_volume status

MASTER NODE    MASTER VOL    MASTER BRICK    SLAVE                          STATUS    CHECKPOINT STATUS    CRAWL STATUS
------------------------------------------------------------------------------------------------------------------------
master_node1    master_volume     /master_brick1     root@slave_node1::slave_volume    faulty    N/A                  N/A
master_node2    master_volume     /master_brick2     root@slave_node1::slave_volume    faulty    N/A                  N/A

From the master node 1, I run geo-replication logs in debug mode and I found the following:
[2015-01-26 15:33:29.247694] D [monitor(monitor):280:distribute] <top>: master bricks: [{'host': 'master_node1', 'dir': '/master_brick1'}, {'host': 'master_node2', 'dir': '/master_brick2'}]
[2015-01-26 15:33:29.248047] D [monitor(monitor):286:distribute] <top>: slave SSH gateway: root@slave_node1
[2015-01-26 15:33:29.721532] I [monitor(monitor):296:distribute] <top>: slave bricks: [{'host': 'slave_node1', 'dir': '/slave_brick1'}, {'host': 'slave_node2', 'dir': '/ slave_brick2'}]
[2015-01-26 15:33:29.729722] I [monitor(monitor):316:distribute] <top>: worker specs: [('/master_brick1', 'ssh://root@slave_node2:gluster://localhost:slave_volume')]
[2015-01-26 15:33:29.730287] I [monitor(monitor):109:set_state] Monitor: new state: Initializing...
[2015-01-26 15:33:29.731513] I [monitor(monitor):163:monitor] Monitor: ------------------------------------------------------------
[2015-01-26 15:33:29.731647] I [monitor(monitor):164:monitor] Monitor: starting gsyncd worker
[2015-01-26 15:33:29.830656] D [gsyncd(agent):627:main_i] <top>: rpc_fd: '7,10,9,8'
[2015-01-26 15:33:29.831882] I [monitor(monitor):214:monitor] Monitor: worker(/master_brick1) died before establishing connection
[2015-01-26 15:33:29.831476] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining...
[2015-01-26 15:33:29.832392] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-01-26 15:33:29.832693] I [syncdutils(agent):214:finalize] <top>: exiting.
[2015-01-26 15:33:29.834060] I [monitor(monitor):109:set_state] Monitor: new state: faulty
[2015-01-26 15:33:39.846858] I [monitor(monitor):163:monitor] Monitor: ------------------------------------------------------------
[2015-01-26 15:33:39.847105] I [monitor(monitor):164:monitor] Monitor: starting gsyncd worker
[2015-01-26 15:33:39.941967] D [gsyncd(agent):627:main_i] <top>: rpc_fd: '7,10,9,8'
[2015-01-26 15:33:39.942630] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining...
[2015-01-26 15:33:39.945791] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-01-26 15:33:39.945941] I [syncdutils(agent):214:finalize] <top>: exiting.
[2015-01-26 15:33:39.945904] I [monitor(monitor):214:monitor] Monitor: worker(/master_brick1) died before establishing connection
[2015-01-26 15:33:49.959361] I [monitor(monitor):163:monitor] Monitor: ------------------------------------------------------------
[2015-01-26 15:33:49.959599] I [monitor(monitor):164:monitor] Monitor: starting gsyncd worker
[2015-01-26 15:33:50.56200] D [gsyncd(agent):627:main_i] <top>: rpc_fd: '7,10,9,8'
[2015-01-26 15:33:50.56809] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining...
[2015-01-26 15:33:50.58903] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-01-26 15:33:50.59078] I [syncdutils(agent):214:finalize] <top>: exiting.
[2015-01-26 15:33:50.59039] I [monitor(monitor):214:monitor] Monitor: worker(/master_brick1) died before establishing connection
[2015-01-26 15:34:00.72674] I [monitor(monitor):163:monitor] Monitor: ------------------------------------------------------------
[2015-01-26 15:34:00.72926] I [monitor(monitor):164:monitor] Monitor: starting gsyncd worker
[2015-01-26 15:34:00.169071] D [gsyncd(agent):627:main_i] <top>: rpc_fd: '7,10,9,8'
[2015-01-26 15:34:00.169931] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining...
[2015-01-26 15:34:00.170466] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-01-26 15:34:00.170526] I [monitor(monitor):214:monitor] Monitor: worker(/master_brick1) died before establishing connection
[2015-01-26 15:34:00.170938] I [syncdutils(agent):214:finalize] <top>: exiting.
[2015-01-26 15:34:10.183361] I [monitor(monitor):163:monitor] Monitor: ------------------------------------------------------------
[2015-01-26 15:34:10.183614] I [monitor(monitor):164:monitor] Monitor: starting gsyncd worker
[2015-01-26 15:34:10.278914] D [monitor(monitor):217:monitor] Monitor: worker(/master_brick1) connected
[2015-01-26 15:34:10.279994] I [monitor(monitor):222:monitor] Monitor: worker(/master_brick1) died in startup phase
[2015-01-26 15:34:10.282217] D [gsyncd(agent):627:main_i] <top>: rpc_fd: '7,10,9,8'
[2015-01-26 15:34:10.282943] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-01-26 15:34:10.283098] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining...
[2015-01-26 15:34:10.283303] I [syncdutils(agent):214:finalize] <top>: exiting.

Comment 1 Pierre-Marie JANVRE 2015-01-27 11:23:00 UTC
Just upgraded all nodes to version 3.6.2. Still same issue occurring.

Comment 2 Aravinda VK 2015-03-26 09:02:42 UTC
Please upload strace output using following command

strace -s 500 -f -p <MONITOR_PID> -o /tmp/strace_output.txt

To get monitor pid, `ps -ax | grep | gsyncd | grep monitor`

Run strace for some time atleast till log records workers exit 2-3 times.

Comment 3 Pierre-Marie JANVRE 2015-03-30 07:14:57 UTC
Created attachment 1008243 [details]
Starce output as requested

As requested, here is attached the result of the Strace command.

Comment 4 Aravinda VK 2016-08-19 11:32:19 UTC
GlusterFS-3.6 is nearing its End-Of-Life, only important security bugs still make a chance on getting fixed. Moving this to the mainline 'version'. If this needs to get fixed in 3.7 or 3.8 this bug should get cloned.

Comment 5 Kotresh HR 2019-05-21 03:55:15 UTC
The issue is no longer seen in latest releases, hence closing the issue. Please re-open the issue if it happens again and upload geo-rep logs.