Bug 1435587
| Summary: | [GSS]geo-replication faulty | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Riyas Abdulrasak <rnalakka> | ||||
| Component: | geo-replication | Assignee: | Kotresh HR <khiremat> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Rahul Hinduja <rhinduja> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | rhgs-3.0 | CC: | amukherj, avishwan, bugs, csaba, ihkim, khiremat, khoj, mmuir, pdhange, rallan, rhs-bugs, sarumuga, storage-qa-internal | ||||
| Target Milestone: | --- | Keywords: | Triaged | ||||
| Target Release: | RHGS 3.3.0 | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | glusterfs-3.8.4-26 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | 1261689 | Environment: | |||||
| Last Closed: | 2017-09-21 04:35:56 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1261689 | ||||||
| Bug Blocks: | 1417151 | ||||||
| Attachments: |
|
||||||
|
Description
Riyas Abdulrasak
2017-03-24 10:10:35 UTC
Created attachment 1266026 [details]
Strace of Monitor process.
Hi Riyas, As discussed, did you get the strace output with following options? Please provide us the same. strace -s 500 -T -ttt -f -p <MONITOR_PID> -o /tmp/strace_output.txt Thanks, Kotresh Hi Prashanth, What was the issue with slave volume? This issue is a side effect of that. As I had shared earlier, geo-rep master tries to get slave info. If restarting the slave is fixing the issue, then for some reason, either the glusterd and ssh connection is taking more than 60 secs from geo-rep worker and that is why the worker was crashing. Upstream Patch: https://review.gluster.org/#/c/16997/ Upstream Patches: https://review.gluster.org/#/c/16997/ (master) It is in 3.11 as part of rebase. Downstream Patch: https://code.engineering.redhat.com/gerrit/#/c/106655/
Verified with the build: glusterfs-geo-replication-3.8.4-35.el7rhgs.x86_64
1. Monitor starts the worker and agent
2. Worker will do the ssh connection between master and slave. gsyncd on slave mounts the slave volume on local slave host.
3. Worker mounts the master volume locally
4. Upon succesful completion of 2 and 3. It acknowledges back to the monitor.
[2017-06-29 11:33:33.563153] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/rhs/brick2/b4). Slave node: ssh://root.43.150:gluster://localhost:slave
[2017-06-29 11:33:33.588865] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/rhs/brick1/b1). Slave node: ssh://root.43.52:gluster://localhost:slave
[2017-06-29 11:33:33.815712] I [resource(/rhs/brick2/b4):1676:connect_remote] SSH: Initializing SSH connection between master and slave...
[2017-06-29 11:33:33.818607] I [changelogagent(/rhs/brick2/b4):73:__init__] ChangelogAgent: Agent listining...
[2017-06-29 11:33:33.824904] I [resource(/rhs/brick1/b1):1676:connect_remote] SSH: Initializing SSH connection between master and slave...
[2017-06-29 11:33:33.840003] I [changelogagent(/rhs/brick1/b1):73:__init__] ChangelogAgent: Agent listining...
[2017-06-29 11:33:40.65862] I [resource(/rhs/brick2/b4):1683:connect_remote] SSH: SSH connection between master and slave established. Time taken: 6.2496 secs
[2017-06-29 11:33:40.66544] I [resource(/rhs/brick2/b4):1416:connect] GLUSTER: Mounting gluster volume locally...
[2017-06-29 11:33:40.139708] I [resource(/rhs/brick1/b1):1683:connect_remote] SSH: SSH connection between master and slave established. Time taken: 6.3142 secs
[2017-06-29 11:33:40.140361] I [resource(/rhs/brick1/b1):1416:connect] GLUSTER: Mounting gluster volume locally...
[2017-06-29 11:33:45.215692] I [resource(/rhs/brick2/b4):1429:connect] GLUSTER: Mounted gluster volume. Time taken: 5.1487 secs
[2017-06-29 11:33:45.216334] I [gsyncd(/rhs/brick2/b4):778:main_i] <top>: Closing feedback fd, waking up the monitor
[2017-06-29 11:33:45.280860] I [resource(/rhs/brick1/b1):1429:connect] GLUSTER: Mounted gluster volume. Time taken: 5.1400 secs
[2017-06-29 11:33:45.281521] I [gsyncd(/rhs/brick1/b1):778:main_i] <top>: Closing feedback fd, waking up the monitor
Incase if the failure is in the ssh connection, it fails and logs the errors:
[2017-07-24 11:56:07.717888] I [gsyncd(conf):688:main_i] <top>: Config Set: session-owner = a7039831-0bb5-40ec-a6aa-1e568b0ba516
[2017-07-24 11:56:08.607244] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Initializing...
[2017-07-24 11:56:08.611022] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Initializing...
[2017-07-24 11:56:08.625727] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Initializing...
[2017-07-24 11:56:08.897754] I [monitor(monitor):74:get_slave_bricks_status] <top>: Unable to get list of up nodes of slave, returning empty list: Another transaction is in progress for slave. Please try agai
n after sometime.
[2017-07-24 11:56:08.898608] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/rhs/brick2/b5). Slave node: ssh://root.41.209:gluster://localhost:slave
[2017-07-24 11:56:08.923662] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/rhs/brick1/b1). Slave node: ssh://root.41.209:gluster://localhost:slave
[2017-07-24 11:56:08.939045] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/rhs/brick3/b9). Slave node: ssh://root.41.209:gluster://localhost:slave
[2017-07-24 11:56:09.225071] I [changelogagent(/rhs/brick1/b1):73:__init__] ChangelogAgent: Agent listining...
[2017-07-24 11:56:09.231733] I [resource(/rhs/brick2/b5):1676:connect_remote] SSH: Initializing SSH connection between master and slave...
[2017-07-24 11:56:09.241522] I [changelogagent(/rhs/brick2/b5):73:__init__] ChangelogAgent: Agent listining...
[2017-07-24 11:56:09.261339] I [changelogagent(/rhs/brick3/b9):73:__init__] ChangelogAgent: Agent listining...
[2017-07-24 11:56:09.262459] I [resource(/rhs/brick3/b9):1676:connect_remote] SSH: Initializing SSH connection between master and slave...
[2017-07-24 11:56:09.274967] I [resource(/rhs/brick1/b1):1676:connect_remote] SSH: Initializing SSH connection between master and slave...
[2017-07-24 11:56:14.779021] E [syncdutils(/rhs/brick3/b9):269:log_raise_exception] <top>: connection to peer is broken
[2017-07-24 11:56:14.778943] E [syncdutils(/rhs/brick2/b5):269:log_raise_exception] <top>: connection to peer is broken
[2017-07-24 11:56:14.779907] E [resource(/rhs/brick3/b9):234:errlog] Popen: command "ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -p 22 -oControlM
aster=auto -S /tmp/gsyncd-aux-ssh-OPaCBo/aa544c64f039a6d7349da94e0501bc58.sock root.41.209 /nonexistent/gsyncd --session-owner a7039831-0bb5-40ec-a6aa-1e568b0ba516 --local-id .%2Frhs%2Fbrick3%2Fb9 --loc
al-node 10.70.42.79 -N --listen --timeout 120 gluster://localhost:slave" returned with 1, saying:
[2017-07-24 11:56:14.780435] E [resource(/rhs/brick3/b9):238:logerr] Popen: ssh> [2017-07-24 11:56:10.085599] I [cli.c:762:main] 0-cli: Started running /usr/sbin/gluster with version 3.8.4
[2017-07-24 11:56:14.780502] E [resource(/rhs/brick2/b5):234:errlog] Popen: command "ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -p 22 -oControlM
aster=auto -S /tmp/gsyncd-aux-ssh-UJB2EP/aa544c64f039a6d7349da94e0501bc58.sock root.41.209 /nonexistent/gsyncd --session-owner a7039831-0bb5-40ec-a6aa-1e568b0ba516 --local-id .%2Frhs%2Fbrick2%2Fb5 --loc
al-node 10.70.42.79 -N --listen --timeout 120 gluster://localhost:slave" returned with 1, saying:
[2017-07-24 11:56:14.780728] E [resource(/rhs/brick3/b9):238:logerr] Popen: ssh> [2017-07-24 11:56:10.085653] I [cli.c:645:cli_rpc_init] 0-cli: Connecting to remote glusterd at localhost
[2017-07-24 11:56:14.780967] E [resource(/rhs/brick3/b9):238:logerr] Popen: ssh> [2017-07-24 11:56:10.298145] I [MSGID: 101190] [event-epoll.c:602:event_dispatch_epoll_worker] 0-epoll: Started thread with ind
ex 1
[2017-07-24 11:56:14.780844] E [resource(/rhs/brick2/b5):238:logerr] Popen: ssh> [2017-07-24 11:56:10.082411] I [cli.c:762:main] 0-cli: Started running /usr/sbin/gluster with version 3.8.4
[2017-07-24 11:56:14.781389] E [resource(/rhs/brick2/b5):238:logerr] Popen: ssh> [2017-07-24 11:56:10.082532] I [cli.c:645:cli_rpc_init] 0-cli: Connecting to remote glusterd at localhost
[2017-07-24 11:56:14.781356] E [resource(/rhs/brick3/b9):238:logerr] Popen: ssh> [2017-07-24 11:56:10.300407] I [socket.c:2465:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2017-07-24 11:56:14.781673] E [resource(/rhs/brick2/b5):238:logerr] Popen: ssh> [2017-07-24 11:56:10.300755] I [MSGID: 101190] [event-epoll.c:602:event_dispatch_epoll_worker] 0-epoll: Started thread with ind
ex 1
[2017-07-24 11:56:14.781827] E [resource(/rhs/brick3/b9):238:logerr] Popen: ssh> [2017-07-24 11:56:10.302962] I [cli-rpc-ops.c:7020:gf_cli_getwd_cbk] 0-cli: Received resp to getwd
[2017-07-24 11:56:14.782166] E [resource(/rhs/brick2/b5):238:logerr] Popen: ssh> [2017-07-24 11:56:10.300919] I [socket.c:2465:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2017-07-24 11:56:14.782174] E [resource(/rhs/brick3/b9):238:logerr] Popen: ssh> [2017-07-24 11:56:10.303072] I [input.c:31:cli_batch] 0-: Exiting with: 0
[2017-07-24 11:56:14.782412] E [resource(/rhs/brick2/b5):238:logerr] Popen: ssh> [2017-07-24 11:56:10.302254] I [cli-rpc-ops.c:7020:gf_cli_getwd_cbk] 0-cli: Received resp to getwd
[2017-07-24 11:56:14.782664] E [resource(/rhs/brick2/b5):238:logerr] Popen: ssh> [2017-07-24 11:56:10.302529] I [input.c:31:cli_batch] 0-: Exiting with: 0
[2017-07-24 11:56:14.782872] I [syncdutils(/rhs/brick3/b9):237:finalize] <top>: exiting.
[2017-07-24 11:56:14.783211] I [syncdutils(/rhs/brick2/b5):237:finalize] <top>: exiting.
Since the fix is only in the logging improvements, moving the bug to verified state based on the above logs.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774 |