+++ This bug was initially created as a clone of Bug #1261689 +++ Description of problem: geo-replication passive faulty Version-Release number of selected component (if applicable): glusterfs-3.6.3 How reproducible: didn't try to reproduce Steps to Reproduce: 1. create glusterfs distributed-replication 2. create glusterfs geo-replication 3. Actual results: Expected results: Additional info: ssh%3A%2F%2Fgeorepuser1%4052.74.184.17%3Agluster%3A%2F%2F127.0.0.1%3Acnprddrnas.log [2015-08-31 00:31:18.726161] E [syncdutils(/estore_disk02):246:log_raise_exception] <top>: connection to peer is broken [2015-08-31 00:31:18.726675] I [syncdutils(/estore_disk02):214:finalize] <top>: exiting. [2015-08-31 00:31:18.728502] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-08-31 00:31:18.729090] I [syncdutils(agent):214:finalize] <top>: exiting. [2015-08-31 00:31:18.801343] I [monitor(monitor):141:set_state] Monitor: new state: faulty [2015-08-31 00:31:48.324974] I [monitor(monitor):215:monitor] Monitor: ------------------------------------------------------------ [2015-08-31 00:31:48.325562] I [monitor(monitor):216:monitor] Monitor: starting gsyncd worker [2015-08-31 00:31:48.540290] I [gsyncd(/estore_disk02):633:main_i] <top>: syncing: gluster://localhost:cnprdnas -> ssh://georepuser1@sg1-cndr-fst01:gluster://localhost:cnprddrnas [2015-08-31 00:31:48.542256] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining... [2015-08-31 00:32:17.621978] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns [2015-08-31 00:32:48.387411] I [monitor(monitor):281:monitor] Monitor: worker(/estore_disk02) not confirmed in 60 sec, aborting it [2015-08-31 00:32:48.389297] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-08-31 00:32:48.389467] I [syncdutils(agent):214:finalize] <top>: exiting. [2015-08-31 00:32:59.851869] I [monitor(monitor):215:monitor] Monitor: ------------------------------------------------------------ [2015-08-31 00:32:59.852358] I [monitor(monitor):216:monitor] Monitor: starting gsyncd worker [2015-08-31 00:32:59.914304] I [gsyncd(/estore_disk02):633:main_i] <top>: syncing: gluster://localhost:cnprdnas -> ssh://georepuser1@sg1-cndr-fst01:gluster://localhost:cnprddrnas [2015-08-31 00:32:59.915914] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining... [2015-08-31 00:33:17.995424] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns [2015-08-31 00:33:59.911894] I [monitor(monitor):281:monitor] Monitor: worker(/estore_disk02) not confirmed in 60 sec, aborting it [2015-08-31 00:33:59.914036] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-08-31 00:33:59.914247] I [syncdutils(agent):214:finalize] <top>: exiting. [2015-08-31 00:34:18.636343] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns [2015-08-31 00:34:57.938266] I [monitor(monitor):215:monitor] Monitor: ------------------------------------------------------------ [2015-08-31 00:34:57.938764] I [monitor(monitor):216:monitor] Monitor: starting gsyncd worker [2015-08-31 00:34:58.2572] I [gsyncd(/estore_disk02):633:main_i] <top>: syncing: gluster://localhost:cnprdnas -> ssh://georepuser1@sg1-cndr-fst01:gluster://localhost:cnprddrnas [2015-08-31 00:34:58.3433] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining... [2015-08-31 00:35:19.155519] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns [2015-08-31 00:35:58.533] I [monitor(monitor):281:monitor] Monitor: worker(/estore_disk02) not confirmed in 60 sec, aborting it [2015-08-31 00:35:58.2600] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-08-31 00:35:58.2794] I [syncdutils(agent):214:finalize] <top>: exiting. [2015-08-31 00:36:08.497905] I [monitor(monitor):215:monitor] Monitor: ------------------------------------------------------------ [2015-08-31 00:36:08.498428] I [monitor(monitor):216:monitor] Monitor: starting gsyncd worker [2015-08-31 00:36:08.559051] I [gsyncd(/estore_disk02):633:main_i] <top>: syncing: gluster://localhost:cnprdnas -> ssh://georepuser1@sg1-cndr-fst01:gluster://localhost:cnprddrnas [2015-08-31 00:36:08.559923] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining... [2015-08-31 00:36:19.614520] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns [2015-08-31 00:37:08.559808] I [monitor(monitor):281:monitor] Monitor: worker(/estore_disk02) not confirmed in 60 sec, aborting it [2015-08-31 00:37:08.561731] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-08-31 00:37:08.561908] I [syncdutils(agent):214:finalize] <top>: exiting. [2015-08-31 00:37:20.161235] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns [2015-08-31 00:37:34.803252] I [monitor(monitor):215:monitor] Monitor: ------------------------------------------------------------ [2015-08-31 00:37:34.803802] I [monitor(monitor):216:monitor] Monitor: starting gsyncd worker [2015-08-31 00:37:34.866637] I [gsyncd(/estore_disk02):633:main_i] <top>: syncing: gluster://localhost:cnprdnas -> ssh://georepuser1@sg1-cndr-fst01:gluster://localhost:cnprddrnas [2015-08-31 00:37:34.867227] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining... [2015-08-31 00:38:20.776204] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns [2015-08-31 00:38:34.865513] I [monitor(monitor):281:monitor] Monitor: worker(/estore_disk02) not confirmed in 60 sec, aborting it [2015-08-31 00:38:34.867564] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-08-31 00:38:34.867752] I [syncdutils(agent):214:finalize] <top>: exiting. [2015-08-31 00:39:21.417596] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns [2015-08-31 00:39:37.110416] I [monitor(monitor):215:monitor] Monitor: ------------------------------------------------------------ [2015-08-31 00:39:37.110945] I [monitor(monitor):216:monitor] Monitor: starting gsyncd worker [2015-08-31 00:39:37.173197] I [gsyncd(/estore_disk02):633:main_i] <top>: syncing: gluster://localhost:cnprdnas -> ssh://georepuser1@sg1-cndr-fst01:gluster://localhost:cnprddrnas [2015-08-31 00:39:37.173186] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining... [2015-08-31 00:40:21.983525] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns [2015-08-31 00:40:37.118439] I [monitor(monitor):281:monitor] Monitor: worker(/estore_disk02) not confirmed in 60 sec, aborting it [2015-08-31 00:40:37.120224] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-08-31 00:40:37.120392] I [syncdutils(agent):214:finalize] <top>: exiting. [2015-08-31 00:40:48.605340] I [monitor(monitor):215:monitor] Monitor: ------------------------------------------------------------ [2015-08-31 00:40:48.605844] I [monitor(monitor):216:monitor] Monitor: starting gsyncd worker [2015-08-31 00:40:48.668005] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining... [2015-08-31 00:40:48.668332] I [gsyncd(/estore_disk02):633:main_i] <top>: syncing: gluster://localhost:cnprdnas -> ssh://georepuser1@sg1-cndr-fst01:gluster://localhost:cnprddrnas [2015-08-31 00:41:22.499874] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns [2015-08-31 00:41:48.667587] I [monitor(monitor):281:monitor] Monitor: worker(/estore_disk02) not confirmed in 60 sec, aborting it [2015-08-31 00:41:48.669570] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-08-31 00:41:48.669785] I [syncdutils(agent):214:finalize] <top>: exiting. [2015-08-31 00:42:13.974661] I [monitor(monitor):215:monitor] Monitor: ------------------------------------------------------------ [2015-08-31 00:42:13.975204] I [monitor(monitor):216:monitor] Monitor: starting gsyncd worker [2015-08-31 00:42:14.37354] I [gsyncd(/estore_disk02):633:main_i] <top>: syncing: gluster://localhost:cnprdnas -> ssh://georepuser1@sg1-cndr-fst01:gluster://localhost:cnprddrnas [2015-08-31 00:42:14.37564] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining... [2015-08-31 00:42:22.983730] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns [2015-08-31 00:43:14.36992] I [monitor(monitor):281:monitor] Monitor: worker(/estore_disk02) not confirmed in 60 sec, aborting it [2015-08-31 00:43:14.38670] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-08-31 00:43:14.38840] I [syncdutils(agent):214:finalize] <top>: exiting. [2015-08-31 00:43:23.795301] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns [2015-08-31 00:44:24.395556] I [master(/estore_disk01):514:crawlwrap] _GMaster: 0 crawls, 0 turns [2015-08-31 00:45:24.291826] E [resource(monitor):221:errlog] Popen: command "gluster --xml --remote-host=sg1-cndr-fst01 volume status cnprddrnas detail" returned with 146 [2015-08-31 00:45:24.292504] I [syncdutils(monitor):214:finalize] <top>: exiting. ssh%3A%2F%2Fgeorepuser1%4052.74.184.17%3Agluster%3A%2F%2F127.0.0.1%3Acnprddrnas.%2Festore_disk02.gluster.log [2015-08-31 00:31:18.728293] I [fuse-bridge.c:4921:fuse_thread_proc] 0-fuse: unmounting /tmp/gsyncd-aux-mount-BehNtW [2015-08-31 00:31:18.797937] W [glusterfsd.c:1194:cleanup_and_exit] (--> 0-: received signum (15), shutting down [2015-08-31 00:31:18.797957] I [fuse-bridge.c:5599:fini] 0-fuse: Unmounting '/tmp/gsyncd-aux-mount-BehNtW'. file attach sosreport, glusterfs log --- Additional comment from hojin kim on 2015-09-15 22:22:36 EDT --- Hi, Venky. the configuration is as below.. -------------------------------------------------------------- mount volume server1: CN1-PRD-FS01 CN2-PRD-FS01 ==> replicated volume0 | | distributed distributed | | server2: CN1-PRD-FS02 CN2-PRD-FS02 ==> replicated volume1 | | | | georeplicate cndr we restart the geo-replication and the problem is solved. but we want to know which cause fall into the faulty state. Thank for advance... --- Additional comment from hojin kim on 2015-09-17 03:08:52 EDT --- we met same error again. It fell into faulty status,,,we restart by force option.. please let me know the reason, asap...Thks.. --- Additional comment from Aravinda VK on 2016-08-19 07:30:43 EDT --- GlusterFS-3.6 is nearing its End-Of-Life, only important security bugs still make a chance on getting fixed. Moving this to the mainline 'version'. If this needs to get fixed in 3.7 or 3.8 this bug should get cloned.
Created attachment 1266026 [details] Strace of Monitor process.
Hi Riyas, As discussed, did you get the strace output with following options? Please provide us the same. strace -s 500 -T -ttt -f -p <MONITOR_PID> -o /tmp/strace_output.txt Thanks, Kotresh
Hi Prashanth, What was the issue with slave volume? This issue is a side effect of that. As I had shared earlier, geo-rep master tries to get slave info. If restarting the slave is fixing the issue, then for some reason, either the glusterd and ssh connection is taking more than 60 secs from geo-rep worker and that is why the worker was crashing.
Upstream Patch: https://review.gluster.org/#/c/16997/
Upstream Patches: https://review.gluster.org/#/c/16997/ (master) It is in 3.11 as part of rebase. Downstream Patch: https://code.engineering.redhat.com/gerrit/#/c/106655/
Verified with the build: glusterfs-geo-replication-3.8.4-35.el7rhgs.x86_64 1. Monitor starts the worker and agent 2. Worker will do the ssh connection between master and slave. gsyncd on slave mounts the slave volume on local slave host. 3. Worker mounts the master volume locally 4. Upon succesful completion of 2 and 3. It acknowledges back to the monitor. [2017-06-29 11:33:33.563153] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/rhs/brick2/b4). Slave node: ssh://root.43.150:gluster://localhost:slave [2017-06-29 11:33:33.588865] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/rhs/brick1/b1). Slave node: ssh://root.43.52:gluster://localhost:slave [2017-06-29 11:33:33.815712] I [resource(/rhs/brick2/b4):1676:connect_remote] SSH: Initializing SSH connection between master and slave... [2017-06-29 11:33:33.818607] I [changelogagent(/rhs/brick2/b4):73:__init__] ChangelogAgent: Agent listining... [2017-06-29 11:33:33.824904] I [resource(/rhs/brick1/b1):1676:connect_remote] SSH: Initializing SSH connection between master and slave... [2017-06-29 11:33:33.840003] I [changelogagent(/rhs/brick1/b1):73:__init__] ChangelogAgent: Agent listining... [2017-06-29 11:33:40.65862] I [resource(/rhs/brick2/b4):1683:connect_remote] SSH: SSH connection between master and slave established. Time taken: 6.2496 secs [2017-06-29 11:33:40.66544] I [resource(/rhs/brick2/b4):1416:connect] GLUSTER: Mounting gluster volume locally... [2017-06-29 11:33:40.139708] I [resource(/rhs/brick1/b1):1683:connect_remote] SSH: SSH connection between master and slave established. Time taken: 6.3142 secs [2017-06-29 11:33:40.140361] I [resource(/rhs/brick1/b1):1416:connect] GLUSTER: Mounting gluster volume locally... [2017-06-29 11:33:45.215692] I [resource(/rhs/brick2/b4):1429:connect] GLUSTER: Mounted gluster volume. Time taken: 5.1487 secs [2017-06-29 11:33:45.216334] I [gsyncd(/rhs/brick2/b4):778:main_i] <top>: Closing feedback fd, waking up the monitor [2017-06-29 11:33:45.280860] I [resource(/rhs/brick1/b1):1429:connect] GLUSTER: Mounted gluster volume. Time taken: 5.1400 secs [2017-06-29 11:33:45.281521] I [gsyncd(/rhs/brick1/b1):778:main_i] <top>: Closing feedback fd, waking up the monitor Incase if the failure is in the ssh connection, it fails and logs the errors: [2017-07-24 11:56:07.717888] I [gsyncd(conf):688:main_i] <top>: Config Set: session-owner = a7039831-0bb5-40ec-a6aa-1e568b0ba516 [2017-07-24 11:56:08.607244] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Initializing... [2017-07-24 11:56:08.611022] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Initializing... [2017-07-24 11:56:08.625727] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Initializing... [2017-07-24 11:56:08.897754] I [monitor(monitor):74:get_slave_bricks_status] <top>: Unable to get list of up nodes of slave, returning empty list: Another transaction is in progress for slave. Please try agai n after sometime. [2017-07-24 11:56:08.898608] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/rhs/brick2/b5). Slave node: ssh://root.41.209:gluster://localhost:slave [2017-07-24 11:56:08.923662] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/rhs/brick1/b1). Slave node: ssh://root.41.209:gluster://localhost:slave [2017-07-24 11:56:08.939045] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/rhs/brick3/b9). Slave node: ssh://root.41.209:gluster://localhost:slave [2017-07-24 11:56:09.225071] I [changelogagent(/rhs/brick1/b1):73:__init__] ChangelogAgent: Agent listining... [2017-07-24 11:56:09.231733] I [resource(/rhs/brick2/b5):1676:connect_remote] SSH: Initializing SSH connection between master and slave... [2017-07-24 11:56:09.241522] I [changelogagent(/rhs/brick2/b5):73:__init__] ChangelogAgent: Agent listining... [2017-07-24 11:56:09.261339] I [changelogagent(/rhs/brick3/b9):73:__init__] ChangelogAgent: Agent listining... [2017-07-24 11:56:09.262459] I [resource(/rhs/brick3/b9):1676:connect_remote] SSH: Initializing SSH connection between master and slave... [2017-07-24 11:56:09.274967] I [resource(/rhs/brick1/b1):1676:connect_remote] SSH: Initializing SSH connection between master and slave... [2017-07-24 11:56:14.779021] E [syncdutils(/rhs/brick3/b9):269:log_raise_exception] <top>: connection to peer is broken [2017-07-24 11:56:14.778943] E [syncdutils(/rhs/brick2/b5):269:log_raise_exception] <top>: connection to peer is broken [2017-07-24 11:56:14.779907] E [resource(/rhs/brick3/b9):234:errlog] Popen: command "ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -p 22 -oControlM aster=auto -S /tmp/gsyncd-aux-ssh-OPaCBo/aa544c64f039a6d7349da94e0501bc58.sock root.41.209 /nonexistent/gsyncd --session-owner a7039831-0bb5-40ec-a6aa-1e568b0ba516 --local-id .%2Frhs%2Fbrick3%2Fb9 --loc al-node 10.70.42.79 -N --listen --timeout 120 gluster://localhost:slave" returned with 1, saying: [2017-07-24 11:56:14.780435] E [resource(/rhs/brick3/b9):238:logerr] Popen: ssh> [2017-07-24 11:56:10.085599] I [cli.c:762:main] 0-cli: Started running /usr/sbin/gluster with version 3.8.4 [2017-07-24 11:56:14.780502] E [resource(/rhs/brick2/b5):234:errlog] Popen: command "ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -p 22 -oControlM aster=auto -S /tmp/gsyncd-aux-ssh-UJB2EP/aa544c64f039a6d7349da94e0501bc58.sock root.41.209 /nonexistent/gsyncd --session-owner a7039831-0bb5-40ec-a6aa-1e568b0ba516 --local-id .%2Frhs%2Fbrick2%2Fb5 --loc al-node 10.70.42.79 -N --listen --timeout 120 gluster://localhost:slave" returned with 1, saying: [2017-07-24 11:56:14.780728] E [resource(/rhs/brick3/b9):238:logerr] Popen: ssh> [2017-07-24 11:56:10.085653] I [cli.c:645:cli_rpc_init] 0-cli: Connecting to remote glusterd at localhost [2017-07-24 11:56:14.780967] E [resource(/rhs/brick3/b9):238:logerr] Popen: ssh> [2017-07-24 11:56:10.298145] I [MSGID: 101190] [event-epoll.c:602:event_dispatch_epoll_worker] 0-epoll: Started thread with ind ex 1 [2017-07-24 11:56:14.780844] E [resource(/rhs/brick2/b5):238:logerr] Popen: ssh> [2017-07-24 11:56:10.082411] I [cli.c:762:main] 0-cli: Started running /usr/sbin/gluster with version 3.8.4 [2017-07-24 11:56:14.781389] E [resource(/rhs/brick2/b5):238:logerr] Popen: ssh> [2017-07-24 11:56:10.082532] I [cli.c:645:cli_rpc_init] 0-cli: Connecting to remote glusterd at localhost [2017-07-24 11:56:14.781356] E [resource(/rhs/brick3/b9):238:logerr] Popen: ssh> [2017-07-24 11:56:10.300407] I [socket.c:2465:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2017-07-24 11:56:14.781673] E [resource(/rhs/brick2/b5):238:logerr] Popen: ssh> [2017-07-24 11:56:10.300755] I [MSGID: 101190] [event-epoll.c:602:event_dispatch_epoll_worker] 0-epoll: Started thread with ind ex 1 [2017-07-24 11:56:14.781827] E [resource(/rhs/brick3/b9):238:logerr] Popen: ssh> [2017-07-24 11:56:10.302962] I [cli-rpc-ops.c:7020:gf_cli_getwd_cbk] 0-cli: Received resp to getwd [2017-07-24 11:56:14.782166] E [resource(/rhs/brick2/b5):238:logerr] Popen: ssh> [2017-07-24 11:56:10.300919] I [socket.c:2465:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2017-07-24 11:56:14.782174] E [resource(/rhs/brick3/b9):238:logerr] Popen: ssh> [2017-07-24 11:56:10.303072] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2017-07-24 11:56:14.782412] E [resource(/rhs/brick2/b5):238:logerr] Popen: ssh> [2017-07-24 11:56:10.302254] I [cli-rpc-ops.c:7020:gf_cli_getwd_cbk] 0-cli: Received resp to getwd [2017-07-24 11:56:14.782664] E [resource(/rhs/brick2/b5):238:logerr] Popen: ssh> [2017-07-24 11:56:10.302529] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2017-07-24 11:56:14.782872] I [syncdutils(/rhs/brick3/b9):237:finalize] <top>: exiting. [2017-07-24 11:56:14.783211] I [syncdutils(/rhs/brick2/b5):237:finalize] <top>: exiting. Since the fix is only in the logging improvements, moving the bug to verified state based on the above logs.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774