Bug 1186286 - Geo-Replication Faulty state
Summary: Geo-Replication Faulty state
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: GlusterFS
Classification: Community
Component: geo-replication
Version: mainline
Hardware: x86_64
OS: Linux
high
unspecified
Target Milestone: ---
Assignee: Kotresh HR
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-01-27 11:05 UTC by Pierre-Marie JANVRE
Modified: 2021-09-09 11:37 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-05-21 03:55:15 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
Starce output as requested (116.37 KB, text/plain)
2015-03-30 07:14 UTC, Pierre-Marie JANVRE
no flags Details

Description Pierre-Marie JANVRE 2015-01-27 11:05:02 UTC
Description of problem: Geo-Replication Faulty state when stating it


Version-Release number of selected component (if applicable): 3.6.1


How reproducible: Each time I tried


Steps to Reproduce:
1.gluster volume geo-replication master_volume root@slave_node1::slave_volume create push-pem
2.gluster volume geo-replication master_volume root@slave_node1::slave_volume start
3.gluster volume geo-replication master_volume root@slave_node1::slave_volume status

Actual results:
Faulty

Expected results:
OK

Additional info:
Here is the setup:

Datacenter A
2 nodes:
-master_node1
-master_node2
1 brick per node (replica)

Datacenter B
2 nodes:
-slave_node1
-slave_node2
1 brick per node (replica)

OS: CentOS 6.6
Gluster: glusterfs 3.6.1 built on Nov  7 2014 15:15:48

Bricks setup properly without any error.
Passwordless authentication between node 1 of datacenter 1 and node 1 of datacenter 2 setup successfully.
Geo-Replication setup properly as  below:
gluster system:: execute gsec_create
gluster volume geo-replication master_volume root@slave_node1::slave_volume create push-pem

I can start successfully the geo-replication:
gluster volume geo-replication master_volume root@slave_node1::slave_volume start

But when checking the status, I have the following:
gluster volume geo-replication master_volume root@slave_node1::slave_volume status

MASTER NODE    MASTER VOL    MASTER BRICK    SLAVE                          STATUS    CHECKPOINT STATUS    CRAWL STATUS
------------------------------------------------------------------------------------------------------------------------
master_node1    master_volume     /master_brick1     root@slave_node1::slave_volume    faulty    N/A                  N/A
master_node2    master_volume     /master_brick2     root@slave_node1::slave_volume    faulty    N/A                  N/A

From the master node 1, I run geo-replication logs in debug mode and I found the following:
[2015-01-26 15:33:29.247694] D [monitor(monitor):280:distribute] <top>: master bricks: [{'host': 'master_node1', 'dir': '/master_brick1'}, {'host': 'master_node2', 'dir': '/master_brick2'}]
[2015-01-26 15:33:29.248047] D [monitor(monitor):286:distribute] <top>: slave SSH gateway: root@slave_node1
[2015-01-26 15:33:29.721532] I [monitor(monitor):296:distribute] <top>: slave bricks: [{'host': 'slave_node1', 'dir': '/slave_brick1'}, {'host': 'slave_node2', 'dir': '/ slave_brick2'}]
[2015-01-26 15:33:29.729722] I [monitor(monitor):316:distribute] <top>: worker specs: [('/master_brick1', 'ssh://root@slave_node2:gluster://localhost:slave_volume')]
[2015-01-26 15:33:29.730287] I [monitor(monitor):109:set_state] Monitor: new state: Initializing...
[2015-01-26 15:33:29.731513] I [monitor(monitor):163:monitor] Monitor: ------------------------------------------------------------
[2015-01-26 15:33:29.731647] I [monitor(monitor):164:monitor] Monitor: starting gsyncd worker
[2015-01-26 15:33:29.830656] D [gsyncd(agent):627:main_i] <top>: rpc_fd: '7,10,9,8'
[2015-01-26 15:33:29.831882] I [monitor(monitor):214:monitor] Monitor: worker(/master_brick1) died before establishing connection
[2015-01-26 15:33:29.831476] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining...
[2015-01-26 15:33:29.832392] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-01-26 15:33:29.832693] I [syncdutils(agent):214:finalize] <top>: exiting.
[2015-01-26 15:33:29.834060] I [monitor(monitor):109:set_state] Monitor: new state: faulty
[2015-01-26 15:33:39.846858] I [monitor(monitor):163:monitor] Monitor: ------------------------------------------------------------
[2015-01-26 15:33:39.847105] I [monitor(monitor):164:monitor] Monitor: starting gsyncd worker
[2015-01-26 15:33:39.941967] D [gsyncd(agent):627:main_i] <top>: rpc_fd: '7,10,9,8'
[2015-01-26 15:33:39.942630] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining...
[2015-01-26 15:33:39.945791] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-01-26 15:33:39.945941] I [syncdutils(agent):214:finalize] <top>: exiting.
[2015-01-26 15:33:39.945904] I [monitor(monitor):214:monitor] Monitor: worker(/master_brick1) died before establishing connection
[2015-01-26 15:33:49.959361] I [monitor(monitor):163:monitor] Monitor: ------------------------------------------------------------
[2015-01-26 15:33:49.959599] I [monitor(monitor):164:monitor] Monitor: starting gsyncd worker
[2015-01-26 15:33:50.56200] D [gsyncd(agent):627:main_i] <top>: rpc_fd: '7,10,9,8'
[2015-01-26 15:33:50.56809] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining...
[2015-01-26 15:33:50.58903] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-01-26 15:33:50.59078] I [syncdutils(agent):214:finalize] <top>: exiting.
[2015-01-26 15:33:50.59039] I [monitor(monitor):214:monitor] Monitor: worker(/master_brick1) died before establishing connection
[2015-01-26 15:34:00.72674] I [monitor(monitor):163:monitor] Monitor: ------------------------------------------------------------
[2015-01-26 15:34:00.72926] I [monitor(monitor):164:monitor] Monitor: starting gsyncd worker
[2015-01-26 15:34:00.169071] D [gsyncd(agent):627:main_i] <top>: rpc_fd: '7,10,9,8'
[2015-01-26 15:34:00.169931] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining...
[2015-01-26 15:34:00.170466] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-01-26 15:34:00.170526] I [monitor(monitor):214:monitor] Monitor: worker(/master_brick1) died before establishing connection
[2015-01-26 15:34:00.170938] I [syncdutils(agent):214:finalize] <top>: exiting.
[2015-01-26 15:34:10.183361] I [monitor(monitor):163:monitor] Monitor: ------------------------------------------------------------
[2015-01-26 15:34:10.183614] I [monitor(monitor):164:monitor] Monitor: starting gsyncd worker
[2015-01-26 15:34:10.278914] D [monitor(monitor):217:monitor] Monitor: worker(/master_brick1) connected
[2015-01-26 15:34:10.279994] I [monitor(monitor):222:monitor] Monitor: worker(/master_brick1) died in startup phase
[2015-01-26 15:34:10.282217] D [gsyncd(agent):627:main_i] <top>: rpc_fd: '7,10,9,8'
[2015-01-26 15:34:10.282943] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-01-26 15:34:10.283098] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining...
[2015-01-26 15:34:10.283303] I [syncdutils(agent):214:finalize] <top>: exiting.

Comment 1 Pierre-Marie JANVRE 2015-01-27 11:23:00 UTC
Just upgraded all nodes to version 3.6.2. Still same issue occurring.

Comment 2 Aravinda VK 2015-03-26 09:02:42 UTC
Please upload strace output using following command

strace -s 500 -f -p <MONITOR_PID> -o /tmp/strace_output.txt

To get monitor pid, `ps -ax | grep | gsyncd | grep monitor`

Run strace for some time atleast till log records workers exit 2-3 times.

Comment 3 Pierre-Marie JANVRE 2015-03-30 07:14:57 UTC
Created attachment 1008243 [details]
Starce output as requested

As requested, here is attached the result of the Strace command.

Comment 4 Aravinda VK 2016-08-19 11:32:19 UTC
GlusterFS-3.6 is nearing its End-Of-Life, only important security bugs still make a chance on getting fixed. Moving this to the mainline 'version'. If this needs to get fixed in 3.7 or 3.8 this bug should get cloned.

Comment 5 Kotresh HR 2019-05-21 03:55:15 UTC
The issue is no longer seen in latest releases, hence closing the issue. Please re-open the issue if it happens again and upload geo-rep logs.


Note You need to log in before you can comment on or make changes to this bug.