Bug 1654118 - [geo-rep]: Failover / Failback shows fault status in a non-root setup
Summary: [geo-rep]: Failover / Failback shows fault status in a non-root setup
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: geo-replication
Version: 4.1
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
Assignee: Kotresh HR
QA Contact:
URL:
Whiteboard:
Depends On: 1510752 1651498
Blocks: 1654117
TreeView+ depends on / blocked
 
Reported: 2018-11-28 05:00 UTC by Kotresh HR
Modified: 2019-01-22 14:09 UTC (History)
8 users (show)

Fixed In Version: glusterfs-4.1.7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1651498
Environment:
Last Closed: 2019-01-22 14:09:20 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gluster.org Gerrit 21689 0 None None None 2018-11-28 05:00:57 UTC
Gluster.org Gerrit 21732 0 None Merged geo-rep: Fix permissions with non-root setup 2018-11-29 15:29:30 UTC

Description Kotresh HR 2018-11-28 05:00:57 UTC
+++ This bug was initially created as a clone of Bug #1651498 +++

+++ This bug was initially created as a clone of Bug #1510752 +++

Description of problem:
=======================

While executing a failover / failback scenario on a non-root geo-rep set up, while starting the original non-root session between the master and the slave, the status is faulty.

The logs show the following:


[2017-11-08 06:52:08.899] E [resource(/rhs/brick1/b1):234:errlog] Popen: command "ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -p 22 -oControlMaster=
auto -S /tmp/gsyncd-aux-ssh-ozvxWN/ab5534f3bb3f74602da3c8c3068a4aa5.sock geoaccount.43.175 /nonexistent/gsyncd --session-owner b4645ef5-836f-4605-98b3-207abd550fc0 --local-id .%2Frhs%2Fbrick1%2Fb1 --local-
node 10.70.43.14 -N --listen --timeout 120 gluster://localhost:slave" returned with 255, saying:
[2017-11-08 06:52:08.1159] E [resource(/rhs/brick1/b1):238:logerr] Popen: ssh> @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[2017-11-08 06:52:08.1418] E [resource(/rhs/brick1/b1):238:logerr] Popen: ssh> @         WARNING: UNPROTECTED PRIVATE KEY FILE!          @
[2017-11-08 06:52:08.1662] E [resource(/rhs/brick1/b1):238:logerr] Popen: ssh> @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[2017-11-08 06:52:08.1856] E [resource(/rhs/brick1/b1):238:logerr] Popen: ssh> Permissions 0770 for '/var/lib/glusterd/geo-replication/secret.pem' are too open.
[2017-11-08 06:52:08.2038] E [resource(/rhs/brick1/b1):238:logerr] Popen: ssh> It is required that your private key files are NOT accessible by others.
[2017-11-08 06:52:08.2216] E [resource(/rhs/brick1/b1):238:logerr] Popen: ssh> This private key will be ignored.
[2017-11-08 06:52:08.2465] E [resource(/rhs/brick1/b1):238:logerr] Popen: ssh> Load key "/var/lib/glusterd/geo-replication/secret.pem": bad permissions
[2017-11-08 06:52:08.2824] E [resource(/rhs/brick1/b1):238:logerr] Popen: ssh> Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
[2017-11-08 06:52:08.3571] I [syncdutils(/rhs/brick1/b1):237:finalize] <top>: exiting.
[2017-11-08 06:52:08.7929] I [monitor(monitor):347:monitor] Monitor: worker(/rhs/brick1/b1) died before establishing connection
[2017-11-08 06:52:08.8866] I [repce(/rhs/brick1/b1):92:service_loop] RepceServer: terminating on reaching EOF.
[2017-11-08 06:52:08.9479] I [syncdutils(/rhs/brick1/b1):237:finalize] <top>: exiting.
[2017-11-08 06:52:17.353275] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/rhs/brick2/b4). Slave node: ssh://geoaccount.43.175:gluster://localhost:slave
[2017-11-08 06:52:17.565080] I [resource(/rhs/brick2/b4):1684:connect_remote] SSH: Initializing SSH connection between master and slave...
[2017-11-08 06:52:17.567466] I [changelogagent(/rhs/brick2/b4):73:__init__] ChangelogAgent: Agent listining...
[2017-11-08 06:52:17.714301] E [syncdutils(/rhs/brick2/b4):269:log_raise_exception] <top>: connection to peer is broken
[2017-11-08 06:52:17.715086] E [resource(/rhs/brick2/b4):234:errlog] Popen: command "ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -p 22 -oControlMast
er=auto -S /tmp/gsyncd-aux-ssh-S6S9iP/ab5534f3bb3f74602da3c8c3068a4aa5.sock geoaccount.43.175 /nonexistent/gsyncd --session-owner b4645ef5-836f-4605-98b3-207abd550fc0 --local-id .%2Frhs%2Fbrick2%2Fb4 --loc
al-node 10.70.43.14 -N --listen --timeout 120 gluster://localhost:slave" returned with 255, saying:
[2017-11-08 06:52:17.715459] E [resource(/rhs/brick2/b4):238:logerr] Popen: ssh> @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[2017-11-08 06:52:17.715709] E [resource(/rhs/brick2/b4):238:logerr] Popen: ssh> @         WARNING: UNPROTECTED PRIVATE KEY FILE!          @
[2017-11-08 06:52:17.715914] E [resource(/rhs/brick2/b4):238:logerr] Popen: ssh> @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[2017-11-08 06:52:17.716105] E [resource(/rhs/brick2/b4):238:logerr] Popen: ssh> Permissions 0770 for '/var/lib/glusterd/geo-replication/secret.pem' are too open.
[2017-11-08 06:52:17.716289] E [resource(/rhs/brick2/b4):238:logerr] Popen: ssh> It is required that your private key files are NOT accessible by others.
[2017-11-08 06:52:17.716600] E [resource(/rhs/brick2/b4):238:logerr] Popen: ssh> This private key will be ignored.
[2017-11-08 06:52:17.716799] E [resource(/rhs/brick2/b4):238:logerr] Popen: ssh> Load key "/var/lib/glusterd/geo-replication/secret.pem": bad permissions
[2017-11-08 06:52:17.717060] E [resource(/rhs/brick2/b4):238:logerr] Popen: ssh> Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
[2017-11-08 06:52:17.717834] I [syncdutils(/rhs/brick2/b4):237:finalize] <top>: exiting.
[2017-11-08 06:52:17.721502] I [repce(/rhs/brick2/b4):92:service_loop] RepceServer: terminating on reaching EOF.
[2017-11-08 06:52:17.722084] I [syncdutils(/rhs/brick2/b4):237:finalize] <top>: exiting.
[2017-11-08 06:52:17.721748] I [monitor(monitor):347:monitor] Monitor: worker(/rhs/brick2/b4) died before establishing connection


Version-Release number of selected component (if applicable):
=============================================================
mainline


Steps to Reproduce:
===================
1. Created a non-root session between the master and the slave 
2. Stopped the master volume with the force option
3. Promoted slave to master
4. Brought master back online and stopped original geo-rep session between original master and slave
5. Set up non-root session from original slave to original master and wrote some data
6. Stopped IO and set checkpoint
7. Waited for checkpoint to complete
8. Stopped and deleted geo-rep session between original slave to original master
9. Reset the options that promoted slave volume as master volume
10. Resume the original session between the original master and original slave


Actual results:
===============
Geo-rep status was faulty

Expected results:
================
Geo-rep status should be ACTIVE / PASSIVE


A simple non-root session was set up between the master and slave. The following was observed.

while executing : gluster-mountbroker setup /var/mountbroker-root geogroup
on the slave, it was noticed that under /var/lib/glusterd/ the permission for the geo-replication directory changes from drwxr-xr-x to drwxrwx---

--- Additional comment from Worker Ant on 2018-11-20 03:45:02 EST ---

REVIEW: https://review.gluster.org/21689 (geo-rep: Fix permissions with non-root setup) posted (#1) for review on master by Kotresh HR

--- Additional comment from Kotresh HR on 2018-11-20 03:47:14 EST ---

Summary:

geo-rep: Fix permissions with non-root setup
    
Problem:
    In non-root fail-over/fail-back(FO/FB), when slave is
    promoted as master, the session goes to 'Faulty'
    
Cause:
    The command 'gluster-mountbroker <mountbroker-root> <group>'
    is run as a pre-requisite on slave in non-root setup.
    It modifies the permission and group of following required
    directories and files recursively
    
      [1] /var/lib/glusterd/geo-replication
      [2] /var/log/glusterfs/geo-replication-slaves
    
    In a normal setup, this is executed on slave node and hence
    doing it recursively is not an issue on [1]. But when original
    master becomes slave in non-root during FO/FB, it contains
    ssh public keys and modifying permissions on them causes
    geo-rep to fail with incorrect permissions.
    
Fix:
    Don't do permission change recursively. Fix permissions for
    required files.

--- Additional comment from Worker Ant on 2018-11-25 23:21:28 EST ---

REVIEW: https://review.gluster.org/21689 (geo-rep: Fix permissions with non-root setup) posted (#3) for review on master by Amar Tumballi

Comment 1 Worker Ant 2018-11-28 05:05:47 UTC
REVIEW: https://review.gluster.org/21732 (geo-rep: Fix permissions with non-root setup) posted (#1) for review on release-4.1 by Kotresh HR

Comment 2 Worker Ant 2018-11-29 15:29:27 UTC
REVIEW: https://review.gluster.org/21732 (geo-rep: Fix permissions with non-root setup) posted (#1) for review on release-4.1 by Kotresh HR

Comment 3 Shyamsundar 2019-01-22 14:09:20 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-4.1.7, please open a new bug report.

glusterfs-4.1.7 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2019-January/000118.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.