Description of problem: ======================== No such file or directory error present in the master logs when a node is shut down and broight back, Once the node that was shut down is brought back up, one of the workers (out of 2) from that node show FAULTY status. [root@dhcp41-253 ~]# gluster volume geo-replication master 10.70.43.210::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ----------------------------------------------------------------------------------------------------------------------------------------------------- 10.70.41.253 master /rhs/brick1/b3 root 10.70.43.210::slave N/A Faulty N/A N/A 10.70.41.253 master /rhs/brick2/b6 root 10.70.43.210::slave 10.70.43.59 Passive N/A N/A 10.70.43.163 master /rhs/brick1/b1 root 10.70.43.210::slave 10.70.43.210 Active Changelog Crawl 2018-08-14 13:48:45 10.70.43.163 master /rhs/brick2/b4 root 10.70.43.210::slave 10.70.43.210 Active Changelog Crawl 2018-08-14 13:48:45 10.70.42.60 master /rhs/brick1/b2 root 10.70.43.210::slave 10.70.42.156 Passive N/A N/A 10.70.42.60 master /rhs/brick2/b5 root 10.70.43.210::slave 10.70.42.156 Passive N/A N/A [2018-08-14 13:54:24.721455] E [repce(/rhs/brick1/b3):117:worker] <top>: call failed: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker res = getattr(self.obj, rmeth)(*in_data[2:]) File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 41, in register return Changes.cl_register(cl_brick, cl_dir, cl_log, cl_level, retries) File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 45, in cl_register cls.raise_changelog_err() File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 27, in raise_changelog_err raise ChangelogException(errn, os.strerror(errn)) ChangelogException: [Errno 2] No such file or directory [2018-08-14 13:54:24.725533] E [repce(/rhs/brick1/b3):209 Version-Release number of selected component (if applicable): ============================================================== [root@dhcp41-253 master]# rpm -qa | grep gluster glusterfs-client-xlators-3.12.2-15.el7rhgs.x86_64 glusterfs-rdma-3.12.2-15.el7rhgs.x86_64 glusterfs-server-3.12.2-15.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-libs-3.12.2-15.el7rhgs.x86_64 glusterfs-api-3.12.2-15.el7rhgs.x86_64 glusterfs-events-3.12.2-15.el7rhgs.x86_64 glusterfs-3.12.2-15.el7rhgs.x86_64 glusterfs-fuse-3.12.2-15.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-15.el7rhgs.x86_64 libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64 vdsm-gluster-4.19.43-2.3.el7rhgs.noarch glusterfs-cli-3.12.2-15.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 python2-gluster-3.12.2-15.el7rhgs.x86_64 How reproducible: ================= 2/2 Steps to Reproduce: 1.Create and start a geo-rep session with 3x3 master and slave volumes 2.Create I/O on the master mount : for i in {create,chmod,chown,chgrp,symlink,truncate,rename,hardlink}; do crefi --multi -n 5 -b 5 -d 5 --max=1K --min=50 --random -T 2 -t text --fop=$i /node/master/ ; sleep 10 ; done 3.While I/O is happening, shut down the node with ACTIVE workers. The PASSIVE workers which obtain the lock become ACTIVE and participate in syncing 4.The checksum of the master and slave match as expected 5.after checking if the sync is done, bring back the node that was shut down. Actual results: =============== One of the workers (out of 2) from the node that was brought back show FAULTY status. Expected results: ================= Once the node is brought back, both the workers should ideally be in PASSIVE state Additional info:
Analysis: When the node is brought up back, the geo-rep worker is started but could not connect to brick. Hence the Faulty state. Looking to libgfchangelog logs, rpc is failed for some reason. ====================================================== [2018-08-14 13:32:53.146652] I [MSGID: 132019] [gf-history-changelog.c:935:gf_history_changelog] 0-gfchangelog: FINAL: from: 1534253526, to: 1534253571, changes: 4 [2018-08-14 13:33:08.859846] I [MSGID: 132019] [gf-history-changelog.c:767:gf_changelog_extract_min_max] 0-gfchangelog: MIN: 1534252536, MAX: 1534253586, TOTAL CHANGELOGS: 70 [2018-08-14 13:33:08.860060] I [MSGID: 132019] [gf-history-changelog.c:935:gf_history_changelog] 0-gfchangelog: FINAL: from: 1534253526, to: 1534253586, changes: 5 [2018-08-14 13:54:03.635946] I [MSGID: 132028] [gf-changelog.c:560:gf_changelog_register_generic] 0-gfchangelog: Registering brick: /rhs/brick1/b3 [notify filter: 1] [2018-08-14 13:54:03.637715] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2018-08-14 13:54:03.637800] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 3 [2018-08-14 13:54:03.637966] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2 [2018-08-14 13:54:03.639023] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 4 [2018-08-14 13:54:03.653980] I [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-08-14 13:54:05.654436] I [socket.c:3599:socket_submit_request] 0-gfchangelog: not connected (priv->connected = 0) [2018-08-14 13:54:05.654469] W [rpc-clnt.c:1679:rpc_clnt_submit] 0-gfchangelog: failed to submit rpc-request (XID: 0x2 Program: LIBGFCHANGELOG, ProgVers: 1, Proc: 1) to rpc-transport (gfchangelog) [2018-08-14 13:54:05.654492] E [MSGID: 132023] [gf-changelog.c:261:gf_changelog_setup_rpc] 0-gfchangelog: Could not initiate probe RPC, bailing out!!! [2018-08-14 13:54:05.654560] I [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-08-14 13:54:05.654891] E [MSGID: 132022] [gf-changelog.c:566:gf_changelog_register_generic] 0-gfchangelog: Error registering with changelog xlator [2018-08-14 13:54:22.711301] I [MSGID: 132028] [gf-changelog.c:560:gf_changelog_register_generic] 0-gfchangelog: Registering brick: /rhs/brick1/b3 [notify filter: 1] [2018-08-14 13:54:22.718923] I [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-08-14 13:54:24.719451] I [socket.c:3599:socket_submit_request] 0-gfchangelog: not connected (priv->connected = 0) [2018-08-14 13:54:24.719543] W [rpc-clnt.c:1679:rpc_clnt_submit] 0-gfchangelog: failed to submit rpc-request (XID: 0x2 Program: LIBGFCHANGELOG, ProgVers: 1, Proc: 1) to rpc-transport (gfchangelog) ========================================================= From brick logs, looks like the brick didn't come up after node is back Brick went down around 2018-08-14 13:32 The Faulty error started coming error at around 2018-08-14 13:54 So assuming the node came back around 13:54, so there is no brick log around that with graph loaded. So brick didn't come up ======================== master-client-2-0-0 [2018-08-14 13:32:49.923031] I [addr.c:55:compare_addr_and_update] 0-/rhs/brick1/b3: allowed = "*", received addr = "10.70.41.253" [2018-08-14 13:32:49.923099] I [login.c:111:gf_auth] 0-auth/login: allowed user names: 5710b280-df15-4de3-a609-c93c418b2f5c [2018-08-14 13:32:49.923140] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-master-server: accepted client from dhcp41-253.lab.eng.blr.redhat.com-30249-2018/08/14-13:32:49:818533-master-client-2-0-0 (version: 3.12.2) [2018-08-14 13:32:50.000634] I [addr.c:55:compare_addr_and_update] 0-/rhs/brick1/b3: allowed = "*", received addr = "10.70.41.253" [2018-08-14 13:32:50.000703] I [login.c:111:gf_auth] 0-auth/login: allowed user names: 5710b280-df15-4de3-a609-c93c418b2f5c [2018-08-14 13:32:50.000743] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-master-server: accepted client from dhcp41-253.lab.eng.blr.redhat.com-30252-2018/08/14-13:32:49:880584-master-client-2-0-0 (version: 3.12.2) [2018-08-14 13:32:50.520496] I [addr.c:55:compare_addr_and_update] 0-/rhs/brick1/b3: allowed = "*", received addr = "10.70.43.163" [2018-08-14 13:32:50.520636] I [login.c:111:gf_auth] 0-auth/login: allowed user names: 5710b280-df15-4de3-a609-c93c418b2f5c [2018-08-14 13:32:50.520677] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-master-server: accepted client from dhcp43-163.lab.eng.blr.redhat.com-31883-2018/08/14-13:32:50:413587-master-client-2-0-0 (version: 3.12.2) [2018-08-14 13:32:50.639809] I [addr.c:55:compare_addr_and_update] 0-/rhs/brick1/b3: allowed = "*", received addr = "10.70.42.60" [2018-08-14 13:32:50.639878] I [login.c:111:gf_auth] 0-auth/login: allowed user names: 5710b280-df15-4de3-a609-c93c418b2f5c [2018-08-14 13:32:50.639915] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-master-server: accepted client from dhcp42-60.lab.eng.blr.redhat.com-20854-2018/08/14-13:32:50:531903-master-client-2-0-0 (version: 3.12.2) [2018-08-14 13:32:50.705107] I [addr.c:55:compare_addr_and_update] 0-/rhs/brick1/b3: allowed = "*", received addr = "10.70.43.163" [2018-08-14 13:32:50.705235] I [login.c:111:gf_auth] 0-auth/login: allowed user names: 5710b280-df15-4de3-a609-c93c418b2f5c [2018-08-14 13:32:50.705291] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-master-server: accepted client from dhcp43-163.lab.eng.blr.redhat.com-31895-2018/08/14-13:32:50:595040-master-client-2-0-0 (version: 3.12.2) [2018-08-14 13:32:50.764690] I [addr.c:55:compare_addr_and_update] 0-/rhs/brick1/b3: allowed = "*", received addr = "10.70.42.60" [2018-08-14 13:32:50.764778] I [login.c:111:gf_auth] 0-auth/login: allowed user names: 5710b280-df15-4de3-a609-c93c418b2f5c [2018-08-14 13:32:50.764839] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-master-server: accepted client from dhcp42-60.lab.eng.blr.redhat.com-20857-2018/08/14-13:32:50:640529-master-client-2-0-0 (version: 3.12.2) [2018-08-14 13:44:58.612896] I [addr.c:55:compare_addr_and_update] 0-/rhs/brick1/b3: allowed = "*", received addr = "10.70.42.41" [2018-08-14 13:44:58.613025] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-master-server: accepted client from dhcp42-41.lab.eng.blr.redhat.com-8160-2018/08/14-13:44:58:502057-master-client-2-0-0 (version: 3.12.2) [2018-08-14 13:45:55.989063] W [socket.c:593:__socket_rwv] 0-glusterfs: readv on 10.70.41.253:24007 failed (No data available) [2018-08-14 13:45:55.989163] I [glusterfsd-mgmt.c:2349:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.70.41.253 [2018-08-14 13:45:55.989193] I [glusterfsd-mgmt.c:2370:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers [2018-08-14 13:46:00.003174] W [glusterfsd.c:1367:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7fd657e6fdd5] -->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xe5) [0x557d5e50c575] -->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x557d5e50c3eb] ) 0-: received signum (15), shutting down [2018-08-14 13:46:00.055398] W [socket.c:593:__socket_rwv] 0-master-changelog: readv on /var/run/gluster/.2bb646d408433506236fffd6edfc4b4830209.sock failed (No data available) [2018-08-14 13:46:00.055612] I [timer.c:89:gf_timer_call_cancel] (-->/usr/lib64/glusterfs/3.12.2/xlator/features/changelog.so(+0x191fb) [0x7fd64a5e91fb] -->/lib64/libgfrpc.so.0(rpc_clnt_disable+0x4b) [0x7fd658dd943b] -->/lib64/libglusterfs.so.0(gf_timer_call_cancel+0x179) [0x7fd65901efc9] ) 0-timer: ctx cleanup started [2018-08-14 13:46:00.805317] E [socket.c:3289:socket_connect] 0-glusterfs: connection attempt on 10.70.41.253:24007 failed, (Network is unreachable) [2018-08-14 13:46:00.805471] I [socket.c:3599:socket_submit_request] 0-glusterfs: not connected (priv->connected = -1) [2018-08-14 13:46:00.805489] W [rpc-clnt.c:1679:rpc_clnt_submit] 0-glusterfs: failed to submit rpc-request (XID: 0x6 Program: Gluster Portmap, ProgVers: 1, Proc: 5) to rpc-transport (glusterfs) ========================================================================= So geo-rep's behaviour is expected to be Faulty until the brick is completely up.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607