1261689 – geo-replication faulty

Bug 1261689 - geo-replication faulty

Summary: geo-replication faulty

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	geo-replication
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kotresh HR
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1435587
TreeView+	depends on / blocked

Reported:	2015-09-10 00:45 UTC by ikhan.kim
Modified:	2017-05-30 18:32 UTC (History)
CC List:	4 users (show)
Fixed In Version:	glusterfs-3.11.0
Clone Of:
Clones:	1435587 (view as bug list)
Environment:
Last Closed:	2017-05-30 18:32:08 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description ikhan.kim 2015-09-10 00:45:29 UTC

Description of problem:
geo-replication passive faulty

Version-Release number of selected component (if applicable):
glusterfs-3.6.3

How reproducible:
didn't try to reproduce 

Steps to Reproduce:
1. create glusterfs distributed-replication
2. create glusterfs geo-replication 
3.

Actual results:


Expected results:


Additional info:
ssh%3A%2F%2Fgeorepuser1%4052.74.184.17%3Agluster%3A%2F%2F127.0.0.1%3Acnprddrnas.log
[2015-08-31 00:31:18.726161] E [syncdutils(/estore_disk02):246:log_raise_exception] <top>: connection to peer is broken
[2015-08-31 00:31:18.726675] I [syncdutils(/estore_disk02):214:finalize] <top>: exiting.
[2015-08-31 00:31:18.728502] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-08-31 00:31:18.729090] I [syncdutils(agent):214:finalize] <top>: exiting.
[2015-08-31 00:31:18.801343] I [monitor(monitor):141:set_state] Monitor: new state: faulty
[2015-08-31 00:31:48.324974] I [monitor(monitor):215:monitor] Monitor: ------------------------------------------------------------
[2015-08-31 00:31:48.325562] I [monitor(monitor):216:monitor] Monitor: starting gsyncd worker
[2015-08-31 00:31:48.540290] I [gsyncd(/estore_disk02):633:main_i] <top>: syncing: gluster://localhost:cnprdnas -> ssh://georepuser1@sg1-cndr-fst01:gluster://localhost:cnprddrnas
[2015-08-31 00:31:48.542256] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining...
[2015-08-31 00:32:17.621978] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns
[2015-08-31 00:32:48.387411] I [monitor(monitor):281:monitor] Monitor: worker(/estore_disk02) not confirmed in 60 sec, aborting it
[2015-08-31 00:32:48.389297] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-08-31 00:32:48.389467] I [syncdutils(agent):214:finalize] <top>: exiting.
[2015-08-31 00:32:59.851869] I [monitor(monitor):215:monitor] Monitor: ------------------------------------------------------------
[2015-08-31 00:32:59.852358] I [monitor(monitor):216:monitor] Monitor: starting gsyncd worker
[2015-08-31 00:32:59.914304] I [gsyncd(/estore_disk02):633:main_i] <top>: syncing: gluster://localhost:cnprdnas -> ssh://georepuser1@sg1-cndr-fst01:gluster://localhost:cnprddrnas
[2015-08-31 00:32:59.915914] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining...
[2015-08-31 00:33:17.995424] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns
[2015-08-31 00:33:59.911894] I [monitor(monitor):281:monitor] Monitor: worker(/estore_disk02) not confirmed in 60 sec, aborting it
[2015-08-31 00:33:59.914036] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-08-31 00:33:59.914247] I [syncdutils(agent):214:finalize] <top>: exiting.
[2015-08-31 00:34:18.636343] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns
[2015-08-31 00:34:57.938266] I [monitor(monitor):215:monitor] Monitor: ------------------------------------------------------------
[2015-08-31 00:34:57.938764] I [monitor(monitor):216:monitor] Monitor: starting gsyncd worker
[2015-08-31 00:34:58.2572] I [gsyncd(/estore_disk02):633:main_i] <top>: syncing: gluster://localhost:cnprdnas -> ssh://georepuser1@sg1-cndr-fst01:gluster://localhost:cnprddrnas
[2015-08-31 00:34:58.3433] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining...
[2015-08-31 00:35:19.155519] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns
[2015-08-31 00:35:58.533] I [monitor(monitor):281:monitor] Monitor: worker(/estore_disk02) not confirmed in 60 sec, aborting it
[2015-08-31 00:35:58.2600] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-08-31 00:35:58.2794] I [syncdutils(agent):214:finalize] <top>: exiting.
[2015-08-31 00:36:08.497905] I [monitor(monitor):215:monitor] Monitor: ------------------------------------------------------------
[2015-08-31 00:36:08.498428] I [monitor(monitor):216:monitor] Monitor: starting gsyncd worker
[2015-08-31 00:36:08.559051] I [gsyncd(/estore_disk02):633:main_i] <top>: syncing: gluster://localhost:cnprdnas -> ssh://georepuser1@sg1-cndr-fst01:gluster://localhost:cnprddrnas
[2015-08-31 00:36:08.559923] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining...
[2015-08-31 00:36:19.614520] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns
[2015-08-31 00:37:08.559808] I [monitor(monitor):281:monitor] Monitor: worker(/estore_disk02) not confirmed in 60 sec, aborting it
[2015-08-31 00:37:08.561731] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-08-31 00:37:08.561908] I [syncdutils(agent):214:finalize] <top>: exiting.
[2015-08-31 00:37:20.161235] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns
[2015-08-31 00:37:34.803252] I [monitor(monitor):215:monitor] Monitor: ------------------------------------------------------------
[2015-08-31 00:37:34.803802] I [monitor(monitor):216:monitor] Monitor: starting gsyncd worker
[2015-08-31 00:37:34.866637] I [gsyncd(/estore_disk02):633:main_i] <top>: syncing: gluster://localhost:cnprdnas -> ssh://georepuser1@sg1-cndr-fst01:gluster://localhost:cnprddrnas
[2015-08-31 00:37:34.867227] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining...
[2015-08-31 00:38:20.776204] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns
[2015-08-31 00:38:34.865513] I [monitor(monitor):281:monitor] Monitor: worker(/estore_disk02) not confirmed in 60 sec, aborting it
[2015-08-31 00:38:34.867564] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-08-31 00:38:34.867752] I [syncdutils(agent):214:finalize] <top>: exiting.
[2015-08-31 00:39:21.417596] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns
[2015-08-31 00:39:37.110416] I [monitor(monitor):215:monitor] Monitor: ------------------------------------------------------------
[2015-08-31 00:39:37.110945] I [monitor(monitor):216:monitor] Monitor: starting gsyncd worker
[2015-08-31 00:39:37.173197] I [gsyncd(/estore_disk02):633:main_i] <top>: syncing: gluster://localhost:cnprdnas -> ssh://georepuser1@sg1-cndr-fst01:gluster://localhost:cnprddrnas
[2015-08-31 00:39:37.173186] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining...
[2015-08-31 00:40:21.983525] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns
[2015-08-31 00:40:37.118439] I [monitor(monitor):281:monitor] Monitor: worker(/estore_disk02) not confirmed in 60 sec, aborting it
[2015-08-31 00:40:37.120224] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-08-31 00:40:37.120392] I [syncdutils(agent):214:finalize] <top>: exiting.
[2015-08-31 00:40:48.605340] I [monitor(monitor):215:monitor] Monitor: ------------------------------------------------------------
[2015-08-31 00:40:48.605844] I [monitor(monitor):216:monitor] Monitor: starting gsyncd worker
[2015-08-31 00:40:48.668005] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining...
[2015-08-31 00:40:48.668332] I [gsyncd(/estore_disk02):633:main_i] <top>: syncing: gluster://localhost:cnprdnas -> ssh://georepuser1@sg1-cndr-fst01:gluster://localhost:cnprddrnas
[2015-08-31 00:41:22.499874] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns
[2015-08-31 00:41:48.667587] I [monitor(monitor):281:monitor] Monitor: worker(/estore_disk02) not confirmed in 60 sec, aborting it
[2015-08-31 00:41:48.669570] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-08-31 00:41:48.669785] I [syncdutils(agent):214:finalize] <top>: exiting.
[2015-08-31 00:42:13.974661] I [monitor(monitor):215:monitor] Monitor: ------------------------------------------------------------
[2015-08-31 00:42:13.975204] I [monitor(monitor):216:monitor] Monitor: starting gsyncd worker
[2015-08-31 00:42:14.37354] I [gsyncd(/estore_disk02):633:main_i] <top>: syncing: gluster://localhost:cnprdnas -> ssh://georepuser1@sg1-cndr-fst01:gluster://localhost:cnprddrnas
[2015-08-31 00:42:14.37564] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining...
[2015-08-31 00:42:22.983730] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns
[2015-08-31 00:43:14.36992] I [monitor(monitor):281:monitor] Monitor: worker(/estore_disk02) not confirmed in 60 sec, aborting it
[2015-08-31 00:43:14.38670] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-08-31 00:43:14.38840] I [syncdutils(agent):214:finalize] <top>: exiting.
[2015-08-31 00:43:23.795301] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns
[2015-08-31 00:44:24.395556] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns
[2015-08-31 00:45:24.291826] E [resource(monitor):221:errlog] Popen: command "gluster --xml --remote-host=sg1-cndr-fst01 volume status cnprddrnas detail" returned with 146
[2015-08-31 00:45:24.292504] I [syncdutils(monitor):214:finalize] <top>: exiting.

ssh%3A%2F%2Fgeorepuser1%4052.74.184.17%3Agluster%3A%2F%2F127.0.0.1%3Acnprddrnas.%2Festore_disk02.gluster.log
[2015-08-31 00:31:18.728293] I [fuse-bridge.c:4921:fuse_thread_proc] 0-fuse: unmounting /tmp/gsyncd-aux-mount-BehNtW
[2015-08-31 00:31:18.797937] W [glusterfsd.c:1194:cleanup_and_exit] (--> 0-: received signum (15), shutting down
[2015-08-31 00:31:18.797957] I [fuse-bridge.c:5599:fini] 0-fuse: Unmounting '/tmp/gsyncd-aux-mount-BehNtW'.

file attach
sosreport, glusterfs log

Comment 1 hojin kim 2015-09-16 02:22:36 UTC

Hi, Venky. the configuration is as below..

--------------------------------------------------------------
mount volume 

server1: CN1-PRD-FS01     CN2-PRD-FS01 ==> replicated volume0
                  |                        |
             distributed          distributed
                  |                         |
server2: CN1-PRD-FS02    CN2-PRD-FS02 ==> replicated volume1
                 | 
                 |
                 |
                 |
georeplicate  cndr 

we restart the geo-replication and the problem is solved.
but we want to know which cause fall into the faulty state.
Thank for advance...

Comment 2 hojin kim 2015-09-17 07:08:52 UTC

we met same error again. It fell into faulty status,,,we restart by force option..
please let me know the reason, asap...Thks..

Comment 3 Aravinda VK 2016-08-19 11:30:43 UTC

GlusterFS-3.6 is nearing its End-Of-Life, only important security bugs still make a chance on getting fixed. Moving this to the mainline 'version'. If this needs to get fixed in 3.7 or 3.8 this bug should get cloned.

Comment 4 Worker Ant 2017-04-05 06:42:00 UTC

REVIEW: https://review.gluster.org/16997 (geo-rep: Improve worker log messages) posted (#2) for review on master by Kotresh HR (khiremat)

Comment 5 Worker Ant 2017-04-05 06:51:01 UTC

REVIEW: https://review.gluster.org/16997 (geo-rep: Improve worker log messages) posted (#3) for review on master by Kotresh HR (khiremat)

Comment 6 Worker Ant 2017-04-05 08:54:27 UTC

REVIEW: https://review.gluster.org/16997 (geo-rep: Improve worker log messages) posted (#4) for review on master by Kotresh HR (khiremat)

Comment 7 Worker Ant 2017-04-07 06:09:37 UTC

COMMIT: https://review.gluster.org/16997 committed in master by Aravinda VK (avishwan) 
------
commit e01025973c73e2bd0eda8cfed22b75617305d740
Author: Kotresh HR <khiremat>
Date:   Tue Apr 4 15:39:46 2017 -0400

    geo-rep: Improve worker log messages
    
    Monitor process expects worker to establish SSH Tunnel to slave node
    and mount master volume locally with in 60 secs and acknowledge monitor
    process by closing feedback fd. If something goes wrong and worker
    does not close feedback fd with in 60 secs, monitor kills the worker.
    But there was no clue in log message about the actual issue. This patch
    adds log and indicates whether the worker is hung during SSH
    or master mount.
    
    Change-Id: Id08a12fa6f3bba1d4fe8036728dbc290e6c14c8c
    BUG: 1261689
    Signed-off-by: Kotresh HR <khiremat>
    Reviewed-on: https://review.gluster.org/16997
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Aravinda VK <avishwan>

Comment 8 Shyamsundar 2017-05-30 18:32:08 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report.

glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.