1615911 – [geo-rep]: No such file or directory when a node is shut down and brought back

Bug 1615911 - [geo-rep]: No such file or directory when a node is shut down and brought back

Summary: [geo-rep]: No such file or directory when a node is shut down and brought back

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	Kotresh HR
QA Contact:	Rochelle
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1503137
TreeView+	depends on / blocked

Reported:	2018-08-14 14:08 UTC by Rochelle
Modified:	2018-09-14 04:32 UTC (History)
CC List:	9 users (show)
Fixed In Version:	glusterfs-3.12.2-16
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-09-04 06:52:17 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2018:2607	0	None	None	None	2018-09-04 06:53:50 UTC

Description Rochelle 2018-08-14 14:08:04 UTC

Description of problem:
========================
No such file or directory error present in the master logs when a node is shut down and broight back,
Once the node that was shut down is brought back up, one of the workers (out of 2) from that node show FAULTY status.  

[root@dhcp41-253 ~]# gluster volume geo-replication master 10.70.43.210::slave status
 
MASTER NODE     MASTER VOL    MASTER BRICK      SLAVE USER    SLAVE                  SLAVE NODE      STATUS     CRAWL STATUS       LAST_SYNCED                  
-----------------------------------------------------------------------------------------------------------------------------------------------------
10.70.41.253    master        /rhs/brick1/b3    root          10.70.43.210::slave    N/A             Faulty     N/A                N/A                          
10.70.41.253    master        /rhs/brick2/b6    root          10.70.43.210::slave    10.70.43.59     Passive    N/A                N/A                          
10.70.43.163    master        /rhs/brick1/b1    root          10.70.43.210::slave    10.70.43.210    Active     Changelog Crawl    2018-08-14 13:48:45          
10.70.43.163    master        /rhs/brick2/b4    root          10.70.43.210::slave    10.70.43.210    Active     Changelog Crawl    2018-08-14 13:48:45          
10.70.42.60     master        /rhs/brick1/b2    root          10.70.43.210::slave    10.70.42.156    Passive    N/A                N/A                          
10.70.42.60     master        /rhs/brick2/b5    root          10.70.43.210::slave    10.70.42.156    Passive    N/A                N/A     


[2018-08-14 13:54:24.721455] E [repce(/rhs/brick1/b3):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 41, in register
    return Changes.cl_register(cl_brick, cl_dir, cl_log, cl_level, retries)
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 45, in cl_register
    cls.raise_changelog_err()
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 27, in raise_changelog_err
    raise ChangelogException(errn, os.strerror(errn))
ChangelogException: [Errno 2] No such file or directory
[2018-08-14 13:54:24.725533] E [repce(/rhs/brick1/b3):209


Version-Release number of selected component (if applicable):
==============================================================

[root@dhcp41-253 master]# rpm -qa | grep gluster
glusterfs-client-xlators-3.12.2-15.el7rhgs.x86_64
glusterfs-rdma-3.12.2-15.el7rhgs.x86_64
glusterfs-server-3.12.2-15.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-libs-3.12.2-15.el7rhgs.x86_64
glusterfs-api-3.12.2-15.el7rhgs.x86_64
glusterfs-events-3.12.2-15.el7rhgs.x86_64
glusterfs-3.12.2-15.el7rhgs.x86_64
glusterfs-fuse-3.12.2-15.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-15.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch
glusterfs-cli-3.12.2-15.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
python2-gluster-3.12.2-15.el7rhgs.x86_64

How reproducible:
=================
2/2


Steps to Reproduce:
1.Create and start a geo-rep session with 3x3 master and slave volumes
2.Create I/O on the master mount : for i in {create,chmod,chown,chgrp,symlink,truncate,rename,hardlink}; do crefi --multi -n 5 -b 5 -d 5 --max=1K --min=50 --random -T 2 -t text --fop=$i /node/master/ ; sleep 10 ; done

3.While I/O is happening, shut down the node with ACTIVE workers. The PASSIVE workers which obtain the lock become ACTIVE and participate in syncing

4.The checksum of the master and slave match as expected

5.after checking if the sync is done, bring back the node that was shut down.  

Actual results:
===============
One of the workers (out of 2) from the node that was brought back show FAULTY status. 

Expected results:
=================
Once the node is brought back, both the workers should ideally be in PASSIVE state


Additional info:

Comment 5 Kotresh HR 2018-08-16 11:02:30 UTC

Analysis:

When the node is brought up back, the geo-rep worker is started but could not connect to brick. Hence the Faulty state.

Looking to libgfchangelog logs, rpc is failed for some reason.

======================================================
[2018-08-14 13:32:53.146652] I [MSGID: 132019] [gf-history-changelog.c:935:gf_history_changelog] 0-gfchangelog: FINAL: from: 1534253526, to: 1534253571, changes: 4
[2018-08-14 13:33:08.859846] I [MSGID: 132019] [gf-history-changelog.c:767:gf_changelog_extract_min_max] 0-gfchangelog: MIN: 1534252536, MAX: 1534253586, TOTAL CHANGELOGS: 70
[2018-08-14 13:33:08.860060] I [MSGID: 132019] [gf-history-changelog.c:935:gf_history_changelog] 0-gfchangelog: FINAL: from: 1534253526, to: 1534253586, changes: 5
[2018-08-14 13:54:03.635946] I [MSGID: 132028] [gf-changelog.c:560:gf_changelog_register_generic] 0-gfchangelog: Registering brick: /rhs/brick1/b3 [notify filter: 1]
[2018-08-14 13:54:03.637715] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2018-08-14 13:54:03.637800] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 3
[2018-08-14 13:54:03.637966] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2018-08-14 13:54:03.639023] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 4
[2018-08-14 13:54:03.653980] I [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2018-08-14 13:54:05.654436] I [socket.c:3599:socket_submit_request] 0-gfchangelog: not connected (priv->connected = 0)
[2018-08-14 13:54:05.654469] W [rpc-clnt.c:1679:rpc_clnt_submit] 0-gfchangelog: failed to submit rpc-request (XID: 0x2 Program: LIBGFCHANGELOG, ProgVers: 1, Proc: 1) to rpc-transport (gfchangelog)
[2018-08-14 13:54:05.654492] E [MSGID: 132023] [gf-changelog.c:261:gf_changelog_setup_rpc] 0-gfchangelog: Could not initiate probe RPC, bailing out!!!
[2018-08-14 13:54:05.654560] I [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2018-08-14 13:54:05.654891] E [MSGID: 132022] [gf-changelog.c:566:gf_changelog_register_generic] 0-gfchangelog: Error registering with changelog xlator
[2018-08-14 13:54:22.711301] I [MSGID: 132028] [gf-changelog.c:560:gf_changelog_register_generic] 0-gfchangelog: Registering brick: /rhs/brick1/b3 [notify filter: 1]
[2018-08-14 13:54:22.718923] I [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2018-08-14 13:54:24.719451] I [socket.c:3599:socket_submit_request] 0-gfchangelog: not connected (priv->connected = 0)
[2018-08-14 13:54:24.719543] W [rpc-clnt.c:1679:rpc_clnt_submit] 0-gfchangelog: failed to submit rpc-request (XID: 0x2 Program: LIBGFCHANGELOG, ProgVers: 1, Proc: 1) to rpc-transport (gfchangelog)
=========================================================

From brick logs, looks like the brick didn't come up after node is back

Brick went down around 
2018-08-14 13:32

The Faulty error started coming error at around
2018-08-14 13:54

So assuming the node came back around 13:54, so there is no brick log around that with graph loaded. So brick didn't come up

========================
master-client-2-0-0
[2018-08-14 13:32:49.923031] I [addr.c:55:compare_addr_and_update] 0-/rhs/brick1/b3: allowed = "*", received addr = "10.70.41.253"
[2018-08-14 13:32:49.923099] I [login.c:111:gf_auth] 0-auth/login: allowed user names: 5710b280-df15-4de3-a609-c93c418b2f5c
[2018-08-14 13:32:49.923140] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-master-server: accepted client from dhcp41-253.lab.eng.blr.redhat.com-30249-2018/08/14-13:32:49:818533-master-client-2-0-0 (version: 3.12.2)
[2018-08-14 13:32:50.000634] I [addr.c:55:compare_addr_and_update] 0-/rhs/brick1/b3: allowed = "*", received addr = "10.70.41.253"
[2018-08-14 13:32:50.000703] I [login.c:111:gf_auth] 0-auth/login: allowed user names: 5710b280-df15-4de3-a609-c93c418b2f5c
[2018-08-14 13:32:50.000743] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-master-server: accepted client from dhcp41-253.lab.eng.blr.redhat.com-30252-2018/08/14-13:32:49:880584-master-client-2-0-0 (version: 3.12.2)
[2018-08-14 13:32:50.520496] I [addr.c:55:compare_addr_and_update] 0-/rhs/brick1/b3: allowed = "*", received addr = "10.70.43.163"
[2018-08-14 13:32:50.520636] I [login.c:111:gf_auth] 0-auth/login: allowed user names: 5710b280-df15-4de3-a609-c93c418b2f5c
[2018-08-14 13:32:50.520677] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-master-server: accepted client from dhcp43-163.lab.eng.blr.redhat.com-31883-2018/08/14-13:32:50:413587-master-client-2-0-0 (version: 3.12.2)
[2018-08-14 13:32:50.639809] I [addr.c:55:compare_addr_and_update] 0-/rhs/brick1/b3: allowed = "*", received addr = "10.70.42.60"
[2018-08-14 13:32:50.639878] I [login.c:111:gf_auth] 0-auth/login: allowed user names: 5710b280-df15-4de3-a609-c93c418b2f5c
[2018-08-14 13:32:50.639915] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-master-server: accepted client from dhcp42-60.lab.eng.blr.redhat.com-20854-2018/08/14-13:32:50:531903-master-client-2-0-0 (version: 3.12.2)
[2018-08-14 13:32:50.705107] I [addr.c:55:compare_addr_and_update] 0-/rhs/brick1/b3: allowed = "*", received addr = "10.70.43.163"
[2018-08-14 13:32:50.705235] I [login.c:111:gf_auth] 0-auth/login: allowed user names: 5710b280-df15-4de3-a609-c93c418b2f5c
[2018-08-14 13:32:50.705291] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-master-server: accepted client from dhcp43-163.lab.eng.blr.redhat.com-31895-2018/08/14-13:32:50:595040-master-client-2-0-0 (version: 3.12.2)
[2018-08-14 13:32:50.764690] I [addr.c:55:compare_addr_and_update] 0-/rhs/brick1/b3: allowed = "*", received addr = "10.70.42.60"
[2018-08-14 13:32:50.764778] I [login.c:111:gf_auth] 0-auth/login: allowed user names: 5710b280-df15-4de3-a609-c93c418b2f5c
[2018-08-14 13:32:50.764839] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-master-server: accepted client from dhcp42-60.lab.eng.blr.redhat.com-20857-2018/08/14-13:32:50:640529-master-client-2-0-0 (version: 3.12.2)
[2018-08-14 13:44:58.612896] I [addr.c:55:compare_addr_and_update] 0-/rhs/brick1/b3: allowed = "*", received addr = "10.70.42.41"
[2018-08-14 13:44:58.613025] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-master-server: accepted client from dhcp42-41.lab.eng.blr.redhat.com-8160-2018/08/14-13:44:58:502057-master-client-2-0-0 (version: 3.12.2)
[2018-08-14 13:45:55.989063] W [socket.c:593:__socket_rwv] 0-glusterfs: readv on 10.70.41.253:24007 failed (No data available)
[2018-08-14 13:45:55.989163] I [glusterfsd-mgmt.c:2349:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.70.41.253
[2018-08-14 13:45:55.989193] I [glusterfsd-mgmt.c:2370:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
[2018-08-14 13:46:00.003174] W [glusterfsd.c:1367:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7fd657e6fdd5] -->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xe5) [0x557d5e50c575] -->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x557d5e50c3eb] ) 0-: received signum (15), shutting down
[2018-08-14 13:46:00.055398] W [socket.c:593:__socket_rwv] 0-master-changelog: readv on /var/run/gluster/.2bb646d408433506236fffd6edfc4b4830209.sock failed (No data available)
[2018-08-14 13:46:00.055612] I [timer.c:89:gf_timer_call_cancel] (-->/usr/lib64/glusterfs/3.12.2/xlator/features/changelog.so(+0x191fb) [0x7fd64a5e91fb] -->/lib64/libgfrpc.so.0(rpc_clnt_disable+0x4b) [0x7fd658dd943b] -->/lib64/libglusterfs.so.0(gf_timer_call_cancel+0x179) [0x7fd65901efc9] ) 0-timer: ctx cleanup started
[2018-08-14 13:46:00.805317] E [socket.c:3289:socket_connect] 0-glusterfs: connection attempt on 10.70.41.253:24007 failed, (Network is unreachable)
[2018-08-14 13:46:00.805471] I [socket.c:3599:socket_submit_request] 0-glusterfs: not connected (priv->connected = -1)
[2018-08-14 13:46:00.805489] W [rpc-clnt.c:1679:rpc_clnt_submit] 0-glusterfs: failed to submit rpc-request (XID: 0x6 Program: Gluster Portmap, ProgVers: 1, Proc: 5) to rpc-transport (glusterfs)
=========================================================================

So geo-rep's behaviour is expected to be Faulty until the brick is completely up.

Comment 12 errata-xmlrpc 2018-09-04 06:52:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Note You need to log in before you can comment on or make changes to this bug.