Description of problem: ======================= Had a 6 node cluster with a couple of distribute/dist-replicate volumes. Enabled brick-multiplexing and created a disperse 1*(4+2) volume. Attached a plain distribute 4*1 as hot tier. Created a few distribute/dist-rep volumes even after that, and continued with testing. At the end of the day, disabled brick-multiplexing and did not do anything further. When the tier volume was assessed again after a few days, it was noticed that all the tier daemons are in failed state. The logs (older ones - pointing to the day when I did testing on that setup) point out a failed socket connection resulting in failure of the daemon eventually. Sosreports will be copied to http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>/ Version-Release number of selected component (if applicable): ============================================================ 3.8.4-23 How reproducible: ================ 1:1 Steps to Reproduce: ================== Do not have anything more than what is already mentioned in the description. However, I'll be trying out the a few scenarios with tier volume and brick-multiplexing to see if there are any sure shot steps to reproduce this. Additional info: ================ tier-logs ----------- [2017-04-25 07:36:33.543874] I [glusterfsd-mgmt.c:2150:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers [2017-04-25 07:36:36.736800] W [glusterfsd.c:1288:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dc5) [0x7f5487964dc5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x7f5488ffbf05] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x7f5488ffbd6b] ) 0-: received signum (15), shutting down [2017-04-25 08:23:36.195081] I [MSGID: 100030] [glusterfsd.c:2417:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.8.4 (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/disp --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *dht.readdir-optimize=on --xlator-option *tier-dht.xattr-name=trusted.tier.tier-dht --xlator-option *dht.rebalance-cmd=6 --xlator-option *dht.node-uuid=49610061-1788-4cbc-9205-0e59fe91d842 --xlator-option *dht.commit-hash=0 --socket-file /var/run/gluster/gluster-tier-ca8ba15e-1c0e-463c-b041-76bca48b0330.sock --pid-file /var/lib/glusterd/vols/disp/tier/49610061-1788-4cbc-9205-0e59fe91d842.pid -l /var/log/glusterfs/disp-tier.log) [2017-04-25 08:23:36.217403] I [MSGID: 101190] [event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2017-04-25 08:23:36.217511] E [socket.c:2318:socket_connect_finish] 0-glusterfs: connection to ::1:24007 failed (Connection refused); disconnecting socket [2017-04-25 08:23:36.217543] I [glusterfsd-mgmt.c:2129:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: localhost ... ... ... [2017-04-25 08:23:46.628316] I [MSGID: 114020] [client.c:2356:notify] 0-disp-client-3: parent translators are ready, attempting connect on transport [2017-04-25 08:23:46.641409] I [MSGID: 101190] [event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2 [2017-04-25 08:23:46.643834] E [MSGID: 114058] [client-handshake.c:1537:client_query_portmap_cbk] 0-disp-client-0: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. [2017-04-25 08:23:46.643915] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-disp-client-0: disconnected from disp-client-0. Client process will keep trying to connect to glusterd until brick's port is available [2017-04-25 08:23:46.644158] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-disp-client-1: changing port to 49153 (from 0) [2017-04-25 08:23:46.644268] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-disp-client-2: changing port to 49153 (from 0) [2017-04-25 08:23:46.648628] I [MSGID: 114020] [client.c:2356:notify] 0-disp-client-4: parent translators are ready, attempting connect on transport ... ... ... [2017-04-25 08:23:50.265413] I [MSGID: 0] [dht-rebalance.c:3730:gf_defrag_total_file_cnt] 0-disp-tier-dht: Total number of files = 75 [2017-04-25 08:23:50.265437] E [MSGID: 0] [dht-rebalance.c:3893:gf_defrag_start_crawl] 0-disp-tier-dht: Failed to get the total number of files. Unable to estimate time to complete rebalance. [2017-04-25 08:23:50.265844] I [dht-rebalance.c:3938:gf_defrag_start_crawl] 0-DHT: Thread[0] creation successful [2017-04-25 08:23:50.265918] I [dht-rebalance.c:3938:gf_defrag_start_crawl] 0-DHT: Thread[1] creation successful [root@dhcp47-121 ~]# [root@dhcp47-121 ~]# gluster pool list UUID Hostname State a0557927-4e5e-4ff7-8dce-94873f867707 dhcp47-113.lab.eng.blr.redhat.com Connected c0dac197-5a4d-4db7-b709-dbf8b8eb0896 dhcp47-114.lab.eng.blr.redhat.com Connected f828fdfa-e08f-4d12-85d8-2121cafcf9d0 dhcp47-115.lab.eng.blr.redhat.com Connected a96e0244-b5ce-4518-895c-8eb453c71ded dhcp47-116.lab.eng.blr.redhat.com Connected 17eb3cef-17e7-4249-954b-fc19ec608304 dhcp47-117.lab.eng.blr.redhat.com Connected 49610061-1788-4cbc-9205-0e59fe91d842 localhost Connected [root@dhcp47-121 ~]# [root@dhcp47-121 ~]# [root@dhcp47-121 ~]# gluster v list disp dist distrep distrep2 distrep3 [root@dhcp47-121 ~]# [root@dhcp47-121 ~]# [root@dhcp47-121 ~]# gluster v info disp Volume Name: disp Type: Tier Volume ID: ca8ba15e-1c0e-463c-b041-76bca48b0330 Status: Started Snapshot Count: 0 Number of Bricks: 10 Transport-type: tcp Hot Tier : Hot Tier Type : Distribute Number of Bricks: 4 Brick1: 10.70.47.115:/bricks/brick4/disp_tier3 Brick2: 10.70.47.114:/bricks/brick4/disp_tier2 Brick3: 10.70.47.113:/bricks/brick4/disp_tier1 Brick4: 10.70.47.121:/bricks/brick4/disp_tier0 Cold Tier: Cold Tier Type : Disperse Number of Bricks: 1 x (4 + 2) = 6 Brick5: 10.70.47.121:/bricks/brick3/disp_0 Brick6: 10.70.47.113:/bricks/brick3/disp_1 Brick7: 10.70.47.114:/bricks/brick3/disp_2 Brick8: 10.70.47.115:/bricks/brick3/disp_3 Brick9: 10.70.47.116:/bricks/brick3/disp_4 Brick10: 10.70.47.117:/bricks/brick3/disp_5 Options Reconfigured: nfs.disable: on transport.address-family: inet features.bitrot: on features.scrub: Active features.quota: on features.inode-quota: on features.quota-deem-statfs: on features.scrub-freq: hourly performance.stat-prefetch: on features.ctr-enabled: on cluster.tier-mode: cache cluster.brick-multiplex: disable [root@dhcp47-121 ~]# [root@dhcp47-121 ~]# [root@dhcp47-121 ~]# gluster v status disp Status of volume: disp Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Hot Bricks: Brick 10.70.47.115:/bricks/brick4/disp_tier 3 49152 0 Y 14509 Brick 10.70.47.114:/bricks/brick4/disp_tier 2 49152 0 Y 16405 Brick 10.70.47.113:/bricks/brick4/disp_tier 1 49152 0 Y 16834 Brick 10.70.47.121:/bricks/brick4/disp_tier 0 49152 0 Y 23606 Cold Bricks: Brick 10.70.47.121:/bricks/brick3/disp_0 49153 0 Y 23612 Brick 10.70.47.113:/bricks/brick3/disp_1 49153 0 Y 16841 Brick 10.70.47.114:/bricks/brick3/disp_2 49153 0 Y 16406 Brick 10.70.47.115:/bricks/brick3/disp_3 49153 0 Y 14515 Brick 10.70.47.116:/bricks/brick3/disp_4 49152 0 Y 30238 Brick 10.70.47.117:/bricks/brick3/disp_5 49152 0 Y 25198 Self-heal Daemon on localhost N/A N/A Y 23558 Quota Daemon on localhost N/A N/A Y 23567 Bitrot Daemon on localhost N/A N/A Y 13552 Scrubber Daemon on localhost N/A N/A Y 13566 Self-heal Daemon on dhcp47-113.lab.eng.blr. redhat.com N/A N/A Y 16790 Quota Daemon on dhcp47-113.lab.eng.blr.redh at.com N/A N/A Y 16804 Bitrot Daemon on dhcp47-113.lab.eng.blr.red hat.com N/A N/A Y 27284 Scrubber Daemon on dhcp47-113.lab.eng.blr.r edhat.com N/A N/A Y 27298 Self-heal Daemon on dhcp47-114.lab.eng.blr. redhat.com N/A N/A Y 16365 Quota Daemon on dhcp47-114.lab.eng.blr.redh at.com N/A N/A Y 16374 Bitrot Daemon on dhcp47-114.lab.eng.blr.red hat.com N/A N/A Y 27085 Scrubber Daemon on dhcp47-114.lab.eng.blr.r edhat.com N/A N/A Y 27099 Self-heal Daemon on dhcp47-115.lab.eng.blr. redhat.com N/A N/A Y 14469 Quota Daemon on dhcp47-115.lab.eng.blr.redh at.com N/A N/A Y 14478 Bitrot Daemon on dhcp47-115.lab.eng.blr.red hat.com N/A N/A Y 24984 Scrubber Daemon on dhcp47-115.lab.eng.blr.r edhat.com N/A N/A Y 24998 Self-heal Daemon on dhcp47-117.lab.eng.blr. redhat.com N/A N/A Y 25143 Quota Daemon on dhcp47-117.lab.eng.blr.redh at.com N/A N/A Y 25152 Bitrot Daemon on dhcp47-117.lab.eng.blr.red hat.com N/A N/A Y 2447 Scrubber Daemon on dhcp47-117.lab.eng.blr.r edhat.com N/A N/A Y 2460 Self-heal Daemon on dhcp47-116.lab.eng.blr. redhat.com N/A N/A Y 30198 Quota Daemon on dhcp47-116.lab.eng.blr.redh at.com N/A N/A Y 30207 Bitrot Daemon on dhcp47-116.lab.eng.blr.red hat.com N/A N/A Y 8011 Scrubber Daemon on dhcp47-116.lab.eng.blr.r edhat.com N/A N/A Y 8024 Task Status of Volume disp ------------------------------------------------------------------------------ Task : Tier migration ID : 31a36238-7edc-46d5-8eea-3bf8d25a2599 Status : in progress [root@dhcp47-121 ~]# [root@dhcp47-121 ~]# [root@dhcp47-121 ~]# [root@dhcp47-121 ~]# gluster v tier disp status Node Promoted files Demoted files Status --------- --------- --------- --------- localhost 0 0 in progress dhcp47-113.lab.eng.blr.redhat.com 0 0 failed dhcp47-114.lab.eng.blr.redhat.com 0 0 failed dhcp47-115.lab.eng.blr.redhat.com 0 0 failed dhcp47-116.lab.eng.blr.redhat.com 0 0 failed dhcp47-117.lab.eng.blr.redhat.com 0 0 failed Tiering Migration Functionality: disp: success [root@dhcp47-121 ~]# [root@dhcp47-121 ~]# [root@dhcp47-121 ~]# [root@dhcp47-121 ~]# [root@dhcp47-121 ~]# rpm -qa | grep gluster glusterfs-api-3.8.4-23.el7rhgs.x86_64 glusterfs-3.8.4-23.el7rhgs.x86_64 glusterfs-cli-3.8.4-23.el7rhgs.x86_64 glusterfs-fuse-3.8.4-23.el7rhgs.x86_64 glusterfs-server-3.8.4-23.el7rhgs.x86_64 glusterfs-events-3.8.4-23.el7rhgs.x86_64 vdsm-gluster-4.17.33-1.1.el7rhgs.noarch gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-client-xlators-3.8.4-23.el7rhgs.x86_64 python-gluster-3.8.4-23.el7rhgs.noarch gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-23.el7rhgs.x86_64 glusterfs-rdma-3.8.4-23.el7rhgs.x86_64 glusterfs-libs-3.8.4-23.el7rhgs.x86_64 [root@dhcp47-121 ~]#
[qe@rhsqe-repo 1447920]$ hostname rhsqe-repo.lab.eng.blr.redhat.com [qe@rhsqe-repo 1447920]$ pwd /home/repo/sosreports/1447920 [qe@rhsqe-repo 1447920]$ ll total 279940 -rwxr-xr-x. 1 qe qe 47589384 May 4 15:48 sosreport-sysreg-prod_dhcp47-113-20170504044728.tar.xz -rwxr-xr-x. 1 qe qe 48654484 May 4 15:48 sosreport-sysreg-prod_dhcp47-114-20170504044728.tar.xz -rwxr-xr-x. 1 qe qe 48254232 May 4 15:48 sosreport-sysreg-prod_dhcp47-115-20170504044728.tar.xz -rwxr-xr-x. 1 qe qe 45685132 May 4 15:48 sosreport-sysreg-prod_dhcp47-116-20170504044728.tar.xz -rwxr-xr-x. 1 qe qe 46089020 May 4 15:48 sosreport-sysreg-prod_dhcp47-117-20170504044728.tar.xz -rwxr-xr-x. 1 qe qe 50375272 May 4 15:48 sosreport-sysreg-prod_dhcp47-121-20170504044728.tar.xz [qe@rhsqe-repo 1447920]$
RCA: From the logs, I was able to see that the brick multiplexing was enabled, then a volume was created and it was converted into a tiered volume. then multiplexing was disabled after which the upgrade was done. After the upgrade, the tierd didn't come up on the node as it wasn't able to connect to the subvolumes [2017-04-25 08:23:46.587380] E [MSGID: 114058] [client-handshake.c:1537:client_query_portmap_cbk] 0-disp-client-6: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. glusterd kept trying for a while and as it was unable to connect, it got a child down event. [2017-04-25 08:23:25.458734] E [socket.c:2318:socket_connect_finish] 0-glusterfs: connection to ::1:24007 failed (Connection refused); disconnecting socket This patch (https://review.gluster.org/#/c/17101/) fixes this socket issue.
Upstream patches : https://review.gluster.org/#/q/topic:bug-1444596 Downstream patches: https://code.engineering.redhat.com/gerrit/#/c/105595/ https://code.engineering.redhat.com/gerrit/#/c/105596/
Build : 3.8.4-36 Enabled brick mux, created a volume and attached tier. Disabled brick mux, and rebooted one node. tier daemons are coming up after that. @rahul, switching brick-mux on and off when volume is present is it recommended?
BUILD: 3.8.4-38 1. Enabled brick-mux 2. created a volume and made it tiered volume 3. Disabled brick-mux (with and without this step) 4. restarted glusterd 5. tier daemons are coming up (visible in status too, gluster vol tier <vol> status) Hence marking it as verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774