Description of problem: Geo-Replication Faulty state when stating it Version-Release number of selected component (if applicable): 3.6.1 How reproducible: Each time I tried Steps to Reproduce: 1.gluster volume geo-replication master_volume root@slave_node1::slave_volume create push-pem 2.gluster volume geo-replication master_volume root@slave_node1::slave_volume start 3.gluster volume geo-replication master_volume root@slave_node1::slave_volume status Actual results: Faulty Expected results: OK Additional info: Here is the setup: Datacenter A 2 nodes: -master_node1 -master_node2 1 brick per node (replica) Datacenter B 2 nodes: -slave_node1 -slave_node2 1 brick per node (replica) OS: CentOS 6.6 Gluster: glusterfs 3.6.1 built on Nov 7 2014 15:15:48 Bricks setup properly without any error. Passwordless authentication between node 1 of datacenter 1 and node 1 of datacenter 2 setup successfully. Geo-Replication setup properly as below: gluster system:: execute gsec_create gluster volume geo-replication master_volume root@slave_node1::slave_volume create push-pem I can start successfully the geo-replication: gluster volume geo-replication master_volume root@slave_node1::slave_volume start But when checking the status, I have the following: gluster volume geo-replication master_volume root@slave_node1::slave_volume status MASTER NODE MASTER VOL MASTER BRICK SLAVE STATUS CHECKPOINT STATUS CRAWL STATUS ------------------------------------------------------------------------------------------------------------------------ master_node1 master_volume /master_brick1 root@slave_node1::slave_volume faulty N/A N/A master_node2 master_volume /master_brick2 root@slave_node1::slave_volume faulty N/A N/A From the master node 1, I run geo-replication logs in debug mode and I found the following: [2015-01-26 15:33:29.247694] D [monitor(monitor):280:distribute] <top>: master bricks: [{'host': 'master_node1', 'dir': '/master_brick1'}, {'host': 'master_node2', 'dir': '/master_brick2'}] [2015-01-26 15:33:29.248047] D [monitor(monitor):286:distribute] <top>: slave SSH gateway: root@slave_node1 [2015-01-26 15:33:29.721532] I [monitor(monitor):296:distribute] <top>: slave bricks: [{'host': 'slave_node1', 'dir': '/slave_brick1'}, {'host': 'slave_node2', 'dir': '/ slave_brick2'}] [2015-01-26 15:33:29.729722] I [monitor(monitor):316:distribute] <top>: worker specs: [('/master_brick1', 'ssh://root@slave_node2:gluster://localhost:slave_volume')] [2015-01-26 15:33:29.730287] I [monitor(monitor):109:set_state] Monitor: new state: Initializing... [2015-01-26 15:33:29.731513] I [monitor(monitor):163:monitor] Monitor: ------------------------------------------------------------ [2015-01-26 15:33:29.731647] I [monitor(monitor):164:monitor] Monitor: starting gsyncd worker [2015-01-26 15:33:29.830656] D [gsyncd(agent):627:main_i] <top>: rpc_fd: '7,10,9,8' [2015-01-26 15:33:29.831882] I [monitor(monitor):214:monitor] Monitor: worker(/master_brick1) died before establishing connection [2015-01-26 15:33:29.831476] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining... [2015-01-26 15:33:29.832392] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-01-26 15:33:29.832693] I [syncdutils(agent):214:finalize] <top>: exiting. [2015-01-26 15:33:29.834060] I [monitor(monitor):109:set_state] Monitor: new state: faulty [2015-01-26 15:33:39.846858] I [monitor(monitor):163:monitor] Monitor: ------------------------------------------------------------ [2015-01-26 15:33:39.847105] I [monitor(monitor):164:monitor] Monitor: starting gsyncd worker [2015-01-26 15:33:39.941967] D [gsyncd(agent):627:main_i] <top>: rpc_fd: '7,10,9,8' [2015-01-26 15:33:39.942630] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining... [2015-01-26 15:33:39.945791] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-01-26 15:33:39.945941] I [syncdutils(agent):214:finalize] <top>: exiting. [2015-01-26 15:33:39.945904] I [monitor(monitor):214:monitor] Monitor: worker(/master_brick1) died before establishing connection [2015-01-26 15:33:49.959361] I [monitor(monitor):163:monitor] Monitor: ------------------------------------------------------------ [2015-01-26 15:33:49.959599] I [monitor(monitor):164:monitor] Monitor: starting gsyncd worker [2015-01-26 15:33:50.56200] D [gsyncd(agent):627:main_i] <top>: rpc_fd: '7,10,9,8' [2015-01-26 15:33:50.56809] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining... [2015-01-26 15:33:50.58903] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-01-26 15:33:50.59078] I [syncdutils(agent):214:finalize] <top>: exiting. [2015-01-26 15:33:50.59039] I [monitor(monitor):214:monitor] Monitor: worker(/master_brick1) died before establishing connection [2015-01-26 15:34:00.72674] I [monitor(monitor):163:monitor] Monitor: ------------------------------------------------------------ [2015-01-26 15:34:00.72926] I [monitor(monitor):164:monitor] Monitor: starting gsyncd worker [2015-01-26 15:34:00.169071] D [gsyncd(agent):627:main_i] <top>: rpc_fd: '7,10,9,8' [2015-01-26 15:34:00.169931] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining... [2015-01-26 15:34:00.170466] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-01-26 15:34:00.170526] I [monitor(monitor):214:monitor] Monitor: worker(/master_brick1) died before establishing connection [2015-01-26 15:34:00.170938] I [syncdutils(agent):214:finalize] <top>: exiting. [2015-01-26 15:34:10.183361] I [monitor(monitor):163:monitor] Monitor: ------------------------------------------------------------ [2015-01-26 15:34:10.183614] I [monitor(monitor):164:monitor] Monitor: starting gsyncd worker [2015-01-26 15:34:10.278914] D [monitor(monitor):217:monitor] Monitor: worker(/master_brick1) connected [2015-01-26 15:34:10.279994] I [monitor(monitor):222:monitor] Monitor: worker(/master_brick1) died in startup phase [2015-01-26 15:34:10.282217] D [gsyncd(agent):627:main_i] <top>: rpc_fd: '7,10,9,8' [2015-01-26 15:34:10.282943] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-01-26 15:34:10.283098] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining... [2015-01-26 15:34:10.283303] I [syncdutils(agent):214:finalize] <top>: exiting.
Just upgraded all nodes to version 3.6.2. Still same issue occurring.
Please upload strace output using following command strace -s 500 -f -p <MONITOR_PID> -o /tmp/strace_output.txt To get monitor pid, `ps -ax | grep | gsyncd | grep monitor` Run strace for some time atleast till log records workers exit 2-3 times.
Created attachment 1008243 [details] Starce output as requested As requested, here is attached the result of the Strace command.
GlusterFS-3.6 is nearing its End-Of-Life, only important security bugs still make a chance on getting fixed. Moving this to the mainline 'version'. If this needs to get fixed in 3.7 or 3.8 this bug should get cloned.
The issue is no longer seen in latest releases, hence closing the issue. Please re-open the issue if it happens again and upload geo-rep logs.