Bug 1615911

Summary:	[geo-rep]: No such file or directory when a node is shut down and brought back
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Rochelle <rallan>
Component:	geo-replication	Assignee:	Kotresh HR <khiremat>
Status:	CLOSED ERRATA	QA Contact:	Rochelle <rallan>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.4	CC:	amukherj, apaladug, avishwan, csaba, rallan, rhs-bugs, sankarshan, storage-qa-internal, vdas
Target Milestone:	---	Keywords:	Regression
Target Release:	RHGS 3.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.12.2-16	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-09-04 06:52:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1503137

Description Rochelle 2018-08-14 14:08:04 UTC

Description of problem:
========================
No such file or directory error present in the master logs when a node is shut down and broight back,
Once the node that was shut down is brought back up, one of the workers (out of 2) from that node show FAULTY status.  

[root@dhcp41-253 ~]# gluster volume geo-replication master 10.70.43.210::slave status
 
MASTER NODE     MASTER VOL    MASTER BRICK      SLAVE USER    SLAVE                  SLAVE NODE      STATUS     CRAWL STATUS       LAST_SYNCED                  
-----------------------------------------------------------------------------------------------------------------------------------------------------
10.70.41.253    master        /rhs/brick1/b3    root          10.70.43.210::slave    N/A             Faulty     N/A                N/A                          
10.70.41.253    master        /rhs/brick2/b6    root          10.70.43.210::slave    10.70.43.59     Passive    N/A                N/A                          
10.70.43.163    master        /rhs/brick1/b1    root          10.70.43.210::slave    10.70.43.210    Active     Changelog Crawl    2018-08-14 13:48:45          
10.70.43.163    master        /rhs/brick2/b4    root          10.70.43.210::slave    10.70.43.210    Active     Changelog Crawl    2018-08-14 13:48:45          
10.70.42.60     master        /rhs/brick1/b2    root          10.70.43.210::slave    10.70.42.156    Passive    N/A                N/A                          
10.70.42.60     master        /rhs/brick2/b5    root          10.70.43.210::slave    10.70.42.156    Passive    N/A                N/A     


[2018-08-14 13:54:24.721455] E [repce(/rhs/brick1/b3):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 41, in register
    return Changes.cl_register(cl_brick, cl_dir, cl_log, cl_level, retries)
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 45, in cl_register
    cls.raise_changelog_err()
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 27, in raise_changelog_err
    raise ChangelogException(errn, os.strerror(errn))
ChangelogException: [Errno 2] No such file or directory
[2018-08-14 13:54:24.725533] E [repce(/rhs/brick1/b3):209


Version-Release number of selected component (if applicable):
==============================================================

[root@dhcp41-253 master]# rpm -qa | grep gluster
glusterfs-client-xlators-3.12.2-15.el7rhgs.x86_64
glusterfs-rdma-3.12.2-15.el7rhgs.x86_64
glusterfs-server-3.12.2-15.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-libs-3.12.2-15.el7rhgs.x86_64
glusterfs-api-3.12.2-15.el7rhgs.x86_64
glusterfs-events-3.12.2-15.el7rhgs.x86_64
glusterfs-3.12.2-15.el7rhgs.x86_64
glusterfs-fuse-3.12.2-15.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-15.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch
glusterfs-cli-3.12.2-15.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
python2-gluster-3.12.2-15.el7rhgs.x86_64

How reproducible:
=================
2/2


Steps to Reproduce:
1.Create and start a geo-rep session with 3x3 master and slave volumes
2.Create I/O on the master mount : for i in {create,chmod,chown,chgrp,symlink,truncate,rename,hardlink}; do crefi --multi -n 5 -b 5 -d 5 --max=1K --min=50 --random -T 2 -t text --fop=$i /node/master/ ; sleep 10 ; done

3.While I/O is happening, shut down the node with ACTIVE workers. The PASSIVE workers which obtain the lock become ACTIVE and participate in syncing

4.The checksum of the master and slave match as expected

5.after checking if the sync is done, bring back the node that was shut down.  

Actual results:
===============
One of the workers (out of 2) from the node that was brought back show FAULTY status. 

Expected results:
=================
Once the node is brought back, both the workers should ideally be in PASSIVE state


Additional info:

Comment 5 Kotresh HR 2018-08-16 11:02:30 UTC

Analysis:

When the node is brought up back, the geo-rep worker is started but could not connect to brick. Hence the Faulty state.

Looking to libgfchangelog logs, rpc is failed for some reason.

======================================================
[2018-08-14 13:32:53.146652] I [MSGID: 132019] [gf-history-changelog.c:935:gf_history_changelog] 0-gfchangelog: FINAL: from: 1534253526, to: 1534253571, changes: 4
[2018-08-14 13:33:08.859846] I [MSGID: 132019] [gf-history-changelog.c:767:gf_changelog_extract_min_max] 0-gfchangelog: MIN: 1534252536, MAX: 1534253586, TOTAL CHANGELOGS: 70
[2018-08-14 13:33:08.860060] I [MSGID: 132019] [gf-history-changelog.c:935:gf_history_changelog] 0-gfchangelog: FINAL: from: 1534253526, to: 1534253586, changes: 5
[2018-08-14 13:54:03.635946] I [MSGID: 132028] [gf-changelog.c:560:gf_changelog_register_generic] 0-gfchangelog: Registering brick: /rhs/brick1/b3 [notify filter: 1]
[2018-08-14 13:54:03.637715] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2018-08-14 13:54:03.637800] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 3
[2018-08-14 13:54:03.637966] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2018-08-14 13:54:03.639023] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 4
[2018-08-14 13:54:03.653980] I [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2018-08-14 13:54:05.654436] I [socket.c:3599:socket_submit_request] 0-gfchangelog: not connected (priv->connected = 0)
[2018-08-14 13:54:05.654469] W [rpc-clnt.c:1679:rpc_clnt_submit] 0-gfchangelog: failed to submit rpc-request (XID: 0x2 Program: LIBGFCHANGELOG, ProgVers: 1, Proc: 1) to rpc-transport (gfchangelog)
[2018-08-14 13:54:05.654492] E [MSGID: 132023] [gf-changelog.c:261:gf_changelog_setup_rpc] 0-gfchangelog: Could not initiate probe RPC, bailing out!!!
[2018-08-14 13:54:05.654560] I [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2018-08-14 13:54:05.654891] E [MSGID: 132022] [gf-changelog.c:566:gf_changelog_register_generic] 0-gfchangelog: Error registering with changelog xlator
[2018-08-14 13:54:22.711301] I [MSGID: 132028] [gf-changelog.c:560:gf_changelog_register_generic] 0-gfchangelog: Registering brick: /rhs/brick1/b3 [notify filter: 1]
[2018-08-14 13:54:22.718923] I [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2018-08-14 13:54:24.719451] I [socket.c:3599:socket_submit_request] 0-gfchangelog: not connected (priv->connected = 0)
[2018-08-14 13:54:24.719543] W [rpc-clnt.c:1679:rpc_clnt_submit] 0-gfchangelog: failed to submit rpc-request (XID: 0x2 Program: LIBGFCHANGELOG, ProgVers: 1, Proc: 1) to rpc-transport (gfchangelog)
=========================================================

From brick logs, looks like the brick didn't come up after node is back

Brick went down around 
2018-08-14 13:32

The Faulty error started coming error at around
2018-08-14 13:54

So assuming the node came back around 13:54, so there is no brick log around that with graph loaded. So brick didn't come up

========================
master-client-2-0-0
[2018-08-14 13:32:49.923031] I [addr.c:55:compare_addr_and_update] 0-/rhs/brick1/b3: allowed = "*", received addr = "10.70.41.253"
[2018-08-14 13:32:49.923099] I [login.c:111:gf_auth] 0-auth/login: allowed user names: 5710b280-df15-4de3-a609-c93c418b2f5c
[2018-08-14 13:32:49.923140] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-master-server: accepted client from dhcp41-253.lab.eng.blr.redhat.com-30249-2018/08/14-13:32:49:818533-master-client-2-0-0 (version: 3.12.2)
[2018-08-14 13:32:50.000634] I [addr.c:55:compare_addr_and_update] 0-/rhs/brick1/b3: allowed = "*", received addr = "10.70.41.253"
[2018-08-14 13:32:50.000703] I [login.c:111:gf_auth] 0-auth/login: allowed user names: 5710b280-df15-4de3-a609-c93c418b2f5c
[2018-08-14 13:32:50.000743] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-master-server: accepted client from dhcp41-253.lab.eng.blr.redhat.com-30252-2018/08/14-13:32:49:880584-master-client-2-0-0 (version: 3.12.2)
[2018-08-14 13:32:50.520496] I [addr.c:55:compare_addr_and_update] 0-/rhs/brick1/b3: allowed = "*", received addr = "10.70.43.163"
[2018-08-14 13:32:50.520636] I [login.c:111:gf_auth] 0-auth/login: allowed user names: 5710b280-df15-4de3-a609-c93c418b2f5c
[2018-08-14 13:32:50.520677] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-master-server: accepted client from dhcp43-163.lab.eng.blr.redhat.com-31883-2018/08/14-13:32:50:413587-master-client-2-0-0 (version: 3.12.2)
[2018-08-14 13:32:50.639809] I [addr.c:55:compare_addr_and_update] 0-/rhs/brick1/b3: allowed = "*", received addr = "10.70.42.60"
[2018-08-14 13:32:50.639878] I [login.c:111:gf_auth] 0-auth/login: allowed user names: 5710b280-df15-4de3-a609-c93c418b2f5c
[2018-08-14 13:32:50.639915] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-master-server: accepted client from dhcp42-60.lab.eng.blr.redhat.com-20854-2018/08/14-13:32:50:531903-master-client-2-0-0 (version: 3.12.2)
[2018-08-14 13:32:50.705107] I [addr.c:55:compare_addr_and_update] 0-/rhs/brick1/b3: allowed = "*", received addr = "10.70.43.163"
[2018-08-14 13:32:50.705235] I [login.c:111:gf_auth] 0-auth/login: allowed user names: 5710b280-df15-4de3-a609-c93c418b2f5c
[2018-08-14 13:32:50.705291] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-master-server: accepted client from dhcp43-163.lab.eng.blr.redhat.com-31895-2018/08/14-13:32:50:595040-master-client-2-0-0 (version: 3.12.2)
[2018-08-14 13:32:50.764690] I [addr.c:55:compare_addr_and_update] 0-/rhs/brick1/b3: allowed = "*", received addr = "10.70.42.60"
[2018-08-14 13:32:50.764778] I [login.c:111:gf_auth] 0-auth/login: allowed user names: 5710b280-df15-4de3-a609-c93c418b2f5c
[2018-08-14 13:32:50.764839] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-master-server: accepted client from dhcp42-60.lab.eng.blr.redhat.com-20857-2018/08/14-13:32:50:640529-master-client-2-0-0 (version: 3.12.2)
[2018-08-14 13:44:58.612896] I [addr.c:55:compare_addr_and_update] 0-/rhs/brick1/b3: allowed = "*", received addr = "10.70.42.41"
[2018-08-14 13:44:58.613025] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-master-server: accepted client from dhcp42-41.lab.eng.blr.redhat.com-8160-2018/08/14-13:44:58:502057-master-client-2-0-0 (version: 3.12.2)
[2018-08-14 13:45:55.989063] W [socket.c:593:__socket_rwv] 0-glusterfs: readv on 10.70.41.253:24007 failed (No data available)
[2018-08-14 13:45:55.989163] I [glusterfsd-mgmt.c:2349:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.70.41.253
[2018-08-14 13:45:55.989193] I [glusterfsd-mgmt.c:2370:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
[2018-08-14 13:46:00.003174] W [glusterfsd.c:1367:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7fd657e6fdd5] -->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xe5) [0x557d5e50c575] -->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x557d5e50c3eb] ) 0-: received signum (15), shutting down
[2018-08-14 13:46:00.055398] W [socket.c:593:__socket_rwv] 0-master-changelog: readv on /var/run/gluster/.2bb646d408433506236fffd6edfc4b4830209.sock failed (No data available)
[2018-08-14 13:46:00.055612] I [timer.c:89:gf_timer_call_cancel] (-->/usr/lib64/glusterfs/3.12.2/xlator/features/changelog.so(+0x191fb) [0x7fd64a5e91fb] -->/lib64/libgfrpc.so.0(rpc_clnt_disable+0x4b) [0x7fd658dd943b] -->/lib64/libglusterfs.so.0(gf_timer_call_cancel+0x179) [0x7fd65901efc9] ) 0-timer: ctx cleanup started
[2018-08-14 13:46:00.805317] E [socket.c:3289:socket_connect] 0-glusterfs: connection attempt on 10.70.41.253:24007 failed, (Network is unreachable)
[2018-08-14 13:46:00.805471] I [socket.c:3599:socket_submit_request] 0-glusterfs: not connected (priv->connected = -1)
[2018-08-14 13:46:00.805489] W [rpc-clnt.c:1679:rpc_clnt_submit] 0-glusterfs: failed to submit rpc-request (XID: 0x6 Program: Gluster Portmap, ProgVers: 1, Proc: 5) to rpc-transport (glusterfs)
=========================================================================

So geo-rep's behaviour is expected to be Faulty until the brick is completely up.

Comment 12 errata-xmlrpc 2018-09-04 06:52:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607