Description of problem: geo-replication passive faulty Version-Release number of selected component (if applicable): glusterfs-3.6.3 How reproducible: didn't try to reproduce Steps to Reproduce: 1. create glusterfs distributed-replication 2. create glusterfs geo-replication 3. Actual results: Expected results: Additional info: ssh%3A%2F%2Fgeorepuser1%4052.74.184.17%3Agluster%3A%2F%2F127.0.0.1%3Acnprddrnas.log [2015-08-31 00:31:18.726161] E [syncdutils(/estore_disk02):246:log_raise_exception] <top>: connection to peer is broken [2015-08-31 00:31:18.726675] I [syncdutils(/estore_disk02):214:finalize] <top>: exiting. [2015-08-31 00:31:18.728502] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-08-31 00:31:18.729090] I [syncdutils(agent):214:finalize] <top>: exiting. [2015-08-31 00:31:18.801343] I [monitor(monitor):141:set_state] Monitor: new state: faulty [2015-08-31 00:31:48.324974] I [monitor(monitor):215:monitor] Monitor: ------------------------------------------------------------ [2015-08-31 00:31:48.325562] I [monitor(monitor):216:monitor] Monitor: starting gsyncd worker [2015-08-31 00:31:48.540290] I [gsyncd(/estore_disk02):633:main_i] <top>: syncing: gluster://localhost:cnprdnas -> ssh://georepuser1@sg1-cndr-fst01:gluster://localhost:cnprddrnas [2015-08-31 00:31:48.542256] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining... [2015-08-31 00:32:17.621978] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns [2015-08-31 00:32:48.387411] I [monitor(monitor):281:monitor] Monitor: worker(/estore_disk02) not confirmed in 60 sec, aborting it [2015-08-31 00:32:48.389297] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-08-31 00:32:48.389467] I [syncdutils(agent):214:finalize] <top>: exiting. [2015-08-31 00:32:59.851869] I [monitor(monitor):215:monitor] Monitor: ------------------------------------------------------------ [2015-08-31 00:32:59.852358] I [monitor(monitor):216:monitor] Monitor: starting gsyncd worker [2015-08-31 00:32:59.914304] I [gsyncd(/estore_disk02):633:main_i] <top>: syncing: gluster://localhost:cnprdnas -> ssh://georepuser1@sg1-cndr-fst01:gluster://localhost:cnprddrnas [2015-08-31 00:32:59.915914] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining... [2015-08-31 00:33:17.995424] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns [2015-08-31 00:33:59.911894] I [monitor(monitor):281:monitor] Monitor: worker(/estore_disk02) not confirmed in 60 sec, aborting it [2015-08-31 00:33:59.914036] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-08-31 00:33:59.914247] I [syncdutils(agent):214:finalize] <top>: exiting. [2015-08-31 00:34:18.636343] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns [2015-08-31 00:34:57.938266] I [monitor(monitor):215:monitor] Monitor: ------------------------------------------------------------ [2015-08-31 00:34:57.938764] I [monitor(monitor):216:monitor] Monitor: starting gsyncd worker [2015-08-31 00:34:58.2572] I [gsyncd(/estore_disk02):633:main_i] <top>: syncing: gluster://localhost:cnprdnas -> ssh://georepuser1@sg1-cndr-fst01:gluster://localhost:cnprddrnas [2015-08-31 00:34:58.3433] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining... [2015-08-31 00:35:19.155519] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns [2015-08-31 00:35:58.533] I [monitor(monitor):281:monitor] Monitor: worker(/estore_disk02) not confirmed in 60 sec, aborting it [2015-08-31 00:35:58.2600] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-08-31 00:35:58.2794] I [syncdutils(agent):214:finalize] <top>: exiting. [2015-08-31 00:36:08.497905] I [monitor(monitor):215:monitor] Monitor: ------------------------------------------------------------ [2015-08-31 00:36:08.498428] I [monitor(monitor):216:monitor] Monitor: starting gsyncd worker [2015-08-31 00:36:08.559051] I [gsyncd(/estore_disk02):633:main_i] <top>: syncing: gluster://localhost:cnprdnas -> ssh://georepuser1@sg1-cndr-fst01:gluster://localhost:cnprddrnas [2015-08-31 00:36:08.559923] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining... [2015-08-31 00:36:19.614520] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns [2015-08-31 00:37:08.559808] I [monitor(monitor):281:monitor] Monitor: worker(/estore_disk02) not confirmed in 60 sec, aborting it [2015-08-31 00:37:08.561731] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-08-31 00:37:08.561908] I [syncdutils(agent):214:finalize] <top>: exiting. [2015-08-31 00:37:20.161235] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns [2015-08-31 00:37:34.803252] I [monitor(monitor):215:monitor] Monitor: ------------------------------------------------------------ [2015-08-31 00:37:34.803802] I [monitor(monitor):216:monitor] Monitor: starting gsyncd worker [2015-08-31 00:37:34.866637] I [gsyncd(/estore_disk02):633:main_i] <top>: syncing: gluster://localhost:cnprdnas -> ssh://georepuser1@sg1-cndr-fst01:gluster://localhost:cnprddrnas [2015-08-31 00:37:34.867227] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining... [2015-08-31 00:38:20.776204] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns [2015-08-31 00:38:34.865513] I [monitor(monitor):281:monitor] Monitor: worker(/estore_disk02) not confirmed in 60 sec, aborting it [2015-08-31 00:38:34.867564] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-08-31 00:38:34.867752] I [syncdutils(agent):214:finalize] <top>: exiting. [2015-08-31 00:39:21.417596] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns [2015-08-31 00:39:37.110416] I [monitor(monitor):215:monitor] Monitor: ------------------------------------------------------------ [2015-08-31 00:39:37.110945] I [monitor(monitor):216:monitor] Monitor: starting gsyncd worker [2015-08-31 00:39:37.173197] I [gsyncd(/estore_disk02):633:main_i] <top>: syncing: gluster://localhost:cnprdnas -> ssh://georepuser1@sg1-cndr-fst01:gluster://localhost:cnprddrnas [2015-08-31 00:39:37.173186] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining... [2015-08-31 00:40:21.983525] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns [2015-08-31 00:40:37.118439] I [monitor(monitor):281:monitor] Monitor: worker(/estore_disk02) not confirmed in 60 sec, aborting it [2015-08-31 00:40:37.120224] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-08-31 00:40:37.120392] I [syncdutils(agent):214:finalize] <top>: exiting. [2015-08-31 00:40:48.605340] I [monitor(monitor):215:monitor] Monitor: ------------------------------------------------------------ [2015-08-31 00:40:48.605844] I [monitor(monitor):216:monitor] Monitor: starting gsyncd worker [2015-08-31 00:40:48.668005] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining... [2015-08-31 00:40:48.668332] I [gsyncd(/estore_disk02):633:main_i] <top>: syncing: gluster://localhost:cnprdnas -> ssh://georepuser1@sg1-cndr-fst01:gluster://localhost:cnprddrnas [2015-08-31 00:41:22.499874] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns [2015-08-31 00:41:48.667587] I [monitor(monitor):281:monitor] Monitor: worker(/estore_disk02) not confirmed in 60 sec, aborting it [2015-08-31 00:41:48.669570] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-08-31 00:41:48.669785] I [syncdutils(agent):214:finalize] <top>: exiting. [2015-08-31 00:42:13.974661] I [monitor(monitor):215:monitor] Monitor: ------------------------------------------------------------ [2015-08-31 00:42:13.975204] I [monitor(monitor):216:monitor] Monitor: starting gsyncd worker [2015-08-31 00:42:14.37354] I [gsyncd(/estore_disk02):633:main_i] <top>: syncing: gluster://localhost:cnprdnas -> ssh://georepuser1@sg1-cndr-fst01:gluster://localhost:cnprddrnas [2015-08-31 00:42:14.37564] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining... [2015-08-31 00:42:22.983730] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns [2015-08-31 00:43:14.36992] I [monitor(monitor):281:monitor] Monitor: worker(/estore_disk02) not confirmed in 60 sec, aborting it [2015-08-31 00:43:14.38670] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-08-31 00:43:14.38840] I [syncdutils(agent):214:finalize] <top>: exiting. [2015-08-31 00:43:23.795301] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns [2015-08-31 00:44:24.395556] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns [2015-08-31 00:45:24.291826] E [resource(monitor):221:errlog] Popen: command "gluster --xml --remote-host=sg1-cndr-fst01 volume status cnprddrnas detail" returned with 146 [2015-08-31 00:45:24.292504] I [syncdutils(monitor):214:finalize] <top>: exiting. ssh%3A%2F%2Fgeorepuser1%4052.74.184.17%3Agluster%3A%2F%2F127.0.0.1%3Acnprddrnas.%2Festore_disk02.gluster.log [2015-08-31 00:31:18.728293] I [fuse-bridge.c:4921:fuse_thread_proc] 0-fuse: unmounting /tmp/gsyncd-aux-mount-BehNtW [2015-08-31 00:31:18.797937] W [glusterfsd.c:1194:cleanup_and_exit] (--> 0-: received signum (15), shutting down [2015-08-31 00:31:18.797957] I [fuse-bridge.c:5599:fini] 0-fuse: Unmounting '/tmp/gsyncd-aux-mount-BehNtW'. file attach sosreport, glusterfs log
Hi, Venky. the configuration is as below.. -------------------------------------------------------------- mount volume server1: CN1-PRD-FS01 CN2-PRD-FS01 ==> replicated volume0 | | distributed distributed | | server2: CN1-PRD-FS02 CN2-PRD-FS02 ==> replicated volume1 | | | | georeplicate cndr we restart the geo-replication and the problem is solved. but we want to know which cause fall into the faulty state. Thank for advance...
we met same error again. It fell into faulty status,,,we restart by force option.. please let me know the reason, asap...Thks..
GlusterFS-3.6 is nearing its End-Of-Life, only important security bugs still make a chance on getting fixed. Moving this to the mainline 'version'. If this needs to get fixed in 3.7 or 3.8 this bug should get cloned.
REVIEW: https://review.gluster.org/16997 (geo-rep: Improve worker log messages) posted (#2) for review on master by Kotresh HR (khiremat)
REVIEW: https://review.gluster.org/16997 (geo-rep: Improve worker log messages) posted (#3) for review on master by Kotresh HR (khiremat)
REVIEW: https://review.gluster.org/16997 (geo-rep: Improve worker log messages) posted (#4) for review on master by Kotresh HR (khiremat)
COMMIT: https://review.gluster.org/16997 committed in master by Aravinda VK (avishwan) ------ commit e01025973c73e2bd0eda8cfed22b75617305d740 Author: Kotresh HR <khiremat> Date: Tue Apr 4 15:39:46 2017 -0400 geo-rep: Improve worker log messages Monitor process expects worker to establish SSH Tunnel to slave node and mount master volume locally with in 60 secs and acknowledge monitor process by closing feedback fd. If something goes wrong and worker does not close feedback fd with in 60 secs, monitor kills the worker. But there was no clue in log message about the actual issue. This patch adds log and indicates whether the worker is hung during SSH or master mount. Change-Id: Id08a12fa6f3bba1d4fe8036728dbc290e6c14c8c BUG: 1261689 Signed-off-by: Kotresh HR <khiremat> Reviewed-on: https://review.gluster.org/16997 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Aravinda VK <avishwan>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report. glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html [2] https://www.gluster.org/pipermail/gluster-users/