Bug 1654118

Summary: [geo-rep]: Failover / Failback shows fault status in a non-root setup
Product: [Community] GlusterFS Reporter: Kotresh HR <khiremat>
Component: geo-replicationAssignee: Kotresh HR <khiremat>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: low Docs Contact:
Priority: low    
Version: 4.1CC: bugs, csaba, rallan, rhinduja, rhs-bugs, sankarshan, storage-qa-internal, ygoitom
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-4.1.7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1651498 Environment:
Last Closed: 2019-01-22 14:09:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1510752, 1651498    
Bug Blocks: 1654117    

Description Kotresh HR 2018-11-28 05:00:57 UTC
+++ This bug was initially created as a clone of Bug #1651498 +++

+++ This bug was initially created as a clone of Bug #1510752 +++

Description of problem:
=======================

While executing a failover / failback scenario on a non-root geo-rep set up, while starting the original non-root session between the master and the slave, the status is faulty.

The logs show the following:


[2017-11-08 06:52:08.899] E [resource(/rhs/brick1/b1):234:errlog] Popen: command "ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -p 22 -oControlMaster=
auto -S /tmp/gsyncd-aux-ssh-ozvxWN/ab5534f3bb3f74602da3c8c3068a4aa5.sock geoaccount.43.175 /nonexistent/gsyncd --session-owner b4645ef5-836f-4605-98b3-207abd550fc0 --local-id .%2Frhs%2Fbrick1%2Fb1 --local-
node 10.70.43.14 -N --listen --timeout 120 gluster://localhost:slave" returned with 255, saying:
[2017-11-08 06:52:08.1159] E [resource(/rhs/brick1/b1):238:logerr] Popen: ssh> @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[2017-11-08 06:52:08.1418] E [resource(/rhs/brick1/b1):238:logerr] Popen: ssh> @         WARNING: UNPROTECTED PRIVATE KEY FILE!          @
[2017-11-08 06:52:08.1662] E [resource(/rhs/brick1/b1):238:logerr] Popen: ssh> @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[2017-11-08 06:52:08.1856] E [resource(/rhs/brick1/b1):238:logerr] Popen: ssh> Permissions 0770 for '/var/lib/glusterd/geo-replication/secret.pem' are too open.
[2017-11-08 06:52:08.2038] E [resource(/rhs/brick1/b1):238:logerr] Popen: ssh> It is required that your private key files are NOT accessible by others.
[2017-11-08 06:52:08.2216] E [resource(/rhs/brick1/b1):238:logerr] Popen: ssh> This private key will be ignored.
[2017-11-08 06:52:08.2465] E [resource(/rhs/brick1/b1):238:logerr] Popen: ssh> Load key "/var/lib/glusterd/geo-replication/secret.pem": bad permissions
[2017-11-08 06:52:08.2824] E [resource(/rhs/brick1/b1):238:logerr] Popen: ssh> Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
[2017-11-08 06:52:08.3571] I [syncdutils(/rhs/brick1/b1):237:finalize] <top>: exiting.
[2017-11-08 06:52:08.7929] I [monitor(monitor):347:monitor] Monitor: worker(/rhs/brick1/b1) died before establishing connection
[2017-11-08 06:52:08.8866] I [repce(/rhs/brick1/b1):92:service_loop] RepceServer: terminating on reaching EOF.
[2017-11-08 06:52:08.9479] I [syncdutils(/rhs/brick1/b1):237:finalize] <top>: exiting.
[2017-11-08 06:52:17.353275] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/rhs/brick2/b4). Slave node: ssh://geoaccount.43.175:gluster://localhost:slave
[2017-11-08 06:52:17.565080] I [resource(/rhs/brick2/b4):1684:connect_remote] SSH: Initializing SSH connection between master and slave...
[2017-11-08 06:52:17.567466] I [changelogagent(/rhs/brick2/b4):73:__init__] ChangelogAgent: Agent listining...
[2017-11-08 06:52:17.714301] E [syncdutils(/rhs/brick2/b4):269:log_raise_exception] <top>: connection to peer is broken
[2017-11-08 06:52:17.715086] E [resource(/rhs/brick2/b4):234:errlog] Popen: command "ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -p 22 -oControlMast
er=auto -S /tmp/gsyncd-aux-ssh-S6S9iP/ab5534f3bb3f74602da3c8c3068a4aa5.sock geoaccount.43.175 /nonexistent/gsyncd --session-owner b4645ef5-836f-4605-98b3-207abd550fc0 --local-id .%2Frhs%2Fbrick2%2Fb4 --loc
al-node 10.70.43.14 -N --listen --timeout 120 gluster://localhost:slave" returned with 255, saying:
[2017-11-08 06:52:17.715459] E [resource(/rhs/brick2/b4):238:logerr] Popen: ssh> @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[2017-11-08 06:52:17.715709] E [resource(/rhs/brick2/b4):238:logerr] Popen: ssh> @         WARNING: UNPROTECTED PRIVATE KEY FILE!          @
[2017-11-08 06:52:17.715914] E [resource(/rhs/brick2/b4):238:logerr] Popen: ssh> @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[2017-11-08 06:52:17.716105] E [resource(/rhs/brick2/b4):238:logerr] Popen: ssh> Permissions 0770 for '/var/lib/glusterd/geo-replication/secret.pem' are too open.
[2017-11-08 06:52:17.716289] E [resource(/rhs/brick2/b4):238:logerr] Popen: ssh> It is required that your private key files are NOT accessible by others.
[2017-11-08 06:52:17.716600] E [resource(/rhs/brick2/b4):238:logerr] Popen: ssh> This private key will be ignored.
[2017-11-08 06:52:17.716799] E [resource(/rhs/brick2/b4):238:logerr] Popen: ssh> Load key "/var/lib/glusterd/geo-replication/secret.pem": bad permissions
[2017-11-08 06:52:17.717060] E [resource(/rhs/brick2/b4):238:logerr] Popen: ssh> Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
[2017-11-08 06:52:17.717834] I [syncdutils(/rhs/brick2/b4):237:finalize] <top>: exiting.
[2017-11-08 06:52:17.721502] I [repce(/rhs/brick2/b4):92:service_loop] RepceServer: terminating on reaching EOF.
[2017-11-08 06:52:17.722084] I [syncdutils(/rhs/brick2/b4):237:finalize] <top>: exiting.
[2017-11-08 06:52:17.721748] I [monitor(monitor):347:monitor] Monitor: worker(/rhs/brick2/b4) died before establishing connection


Version-Release number of selected component (if applicable):
=============================================================
mainline


Steps to Reproduce:
===================
1. Created a non-root session between the master and the slave 
2. Stopped the master volume with the force option
3. Promoted slave to master
4. Brought master back online and stopped original geo-rep session between original master and slave
5. Set up non-root session from original slave to original master and wrote some data
6. Stopped IO and set checkpoint
7. Waited for checkpoint to complete
8. Stopped and deleted geo-rep session between original slave to original master
9. Reset the options that promoted slave volume as master volume
10. Resume the original session between the original master and original slave


Actual results:
===============
Geo-rep status was faulty

Expected results:
================
Geo-rep status should be ACTIVE / PASSIVE


A simple non-root session was set up between the master and slave. The following was observed.

while executing : gluster-mountbroker setup /var/mountbroker-root geogroup
on the slave, it was noticed that under /var/lib/glusterd/ the permission for the geo-replication directory changes from drwxr-xr-x to drwxrwx---

--- Additional comment from Worker Ant on 2018-11-20 03:45:02 EST ---

REVIEW: https://review.gluster.org/21689 (geo-rep: Fix permissions with non-root setup) posted (#1) for review on master by Kotresh HR

--- Additional comment from Kotresh HR on 2018-11-20 03:47:14 EST ---

Summary:

geo-rep: Fix permissions with non-root setup
    
Problem:
    In non-root fail-over/fail-back(FO/FB), when slave is
    promoted as master, the session goes to 'Faulty'
    
Cause:
    The command 'gluster-mountbroker <mountbroker-root> <group>'
    is run as a pre-requisite on slave in non-root setup.
    It modifies the permission and group of following required
    directories and files recursively
    
      [1] /var/lib/glusterd/geo-replication
      [2] /var/log/glusterfs/geo-replication-slaves
    
    In a normal setup, this is executed on slave node and hence
    doing it recursively is not an issue on [1]. But when original
    master becomes slave in non-root during FO/FB, it contains
    ssh public keys and modifying permissions on them causes
    geo-rep to fail with incorrect permissions.
    
Fix:
    Don't do permission change recursively. Fix permissions for
    required files.

--- Additional comment from Worker Ant on 2018-11-25 23:21:28 EST ---

REVIEW: https://review.gluster.org/21689 (geo-rep: Fix permissions with non-root setup) posted (#3) for review on master by Amar Tumballi

Comment 1 Worker Ant 2018-11-28 05:05:47 UTC
REVIEW: https://review.gluster.org/21732 (geo-rep: Fix permissions with non-root setup) posted (#1) for review on release-4.1 by Kotresh HR

Comment 2 Worker Ant 2018-11-29 15:29:27 UTC
REVIEW: https://review.gluster.org/21732 (geo-rep: Fix permissions with non-root setup) posted (#1) for review on release-4.1 by Kotresh HR

Comment 3 Shyamsundar 2019-01-22 14:09:20 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-4.1.7, please open a new bug report.

glusterfs-4.1.7 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2019-January/000118.html
[2] https://www.gluster.org/pipermail/gluster-users/