Description of problem: ====================== glustershd fails to start on one of the nodes when we do a volume force start to bring a brick online the chance of hitting this problem is about 60% Version-Release number of selected component (if applicable): =========== glusterfs-server-3.8.4-52.3.el7rhgs.x86_64 How reproducible: ================= 3/5 Steps to Reproduce: 1. create a brick mux setup 2. create about 30 1x3 volumes 3. start the volumes 4. pump IOs to the base volume and another volume(i created an extra ecvol for this) 5.now kill a brick say b1 6. use volume force start of any volume(some vol in higher ascending order say vol15 or vol20 ...and not the base volume) Actual results: ========= shd fails to start on one of the vols [root@dhcp35-192 logs]# gluster v start threeway_4-1 force volume start: threeway_4-1: success [root@dhcp35-192 logs]# [root@dhcp35-192 logs]# [root@dhcp35-192 logs]# [root@dhcp35-192 logs]# [root@dhcp35-192 logs]# [root@dhcp35-192 logs]# gluster v status threeway_4-1 Status of volume: threeway_4-1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp35-192.lab.eng.blr.redhat.com:/rh s/brick1/threeway_4-1 49154 0 Y 7620 Brick dhcp35-214.lab.eng.blr.redhat.com:/rh s/brick1/threeway_4-1 49152 0 Y 2139 Brick dhcp35-215.lab.eng.blr.redhat.com:/rh s/brick1/threeway_4-1 49152 0 Y 2188 Self-heal Daemon on localhost N/A N/A Y 7641 Self-heal Daemon on 10.70.35.214 N/A N/A N N/A Self-heal Daemon on 10.70.35.215 N/A N/A Y 3593 Task Status of Volume threeway_4-1 ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp35-192 logs]# gluster v status threeway_4-1 Status of volume: threeway_4-1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp35-192.lab.eng.blr.redhat.com:/rh s/brick1/threeway_4-1 49154 0 Y 7620 Brick dhcp35-214.lab.eng.blr.redhat.com:/rh s/brick1/threeway_4-1 49152 0 Y 2139 Brick dhcp35-215.lab.eng.blr.redhat.com:/rh s/brick1/threeway_4-1 49152 0 Y 2188 Self-heal Daemon on localhost N/A N/A Y 7641 Self-heal Daemon on 10.70.35.215 N/A N/A Y 3593 Self-heal Daemon on 10.70.35.214 N/A N/A N N/A Task Status of Volume threeway_4-1 ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp35-192 logs]# gluster v status threeway_4-1 Status of volume: threeway_4-1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp35-192.lab.eng.blr.redhat.com:/rh s/brick1/threeway_4-1 49154 0 Y 7620 Brick dhcp35-214.lab.eng.blr.redhat.com:/rh s/brick1/threeway_4-1 49152 0 Y 2139 Brick dhcp35-215.lab.eng.blr.redhat.com:/rh s/brick1/threeway_4-1 49152 0 Y 2188 Self-heal Daemon on localhost N/A N/A Y 7641 Self-heal Daemon on 10.70.35.214 N/A N/A N N/A Self-heal Daemon on 10.70.35.215 N/A N/A Y 3593 Task Status of Volume threeway_4-1 ------------------------------------------------------------------------------ There are no active volume tasks shd log on the failed node ============================= [2018-01-02 09:26:05.254393] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-threeway_8-3-client-0: changing port to 49153 (from 0) [2018-01-02 09:26:05.256812] E [socket.c:2360:socket_connect_finish] 0-threeway_8-2-client-0: connection to 10.70.35.192:49153 failed (Connection refused); disconnecting socket [2018-01-02 09:26:05.259138] E [socket.c:2360:socket_connect_finish] 0-threeway_8-3-client-0: connection to 10.70.35.192:49153 failed (Connection refused); disconnecting socket [2018-01-02 09:26:06.836822] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-threeway_9-2-client-0: changing port to 49153 (from 0) [2018-01-02 09:26:06.836888] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-threeway_9-3-client-0: changing port to 49153 (from 0) [2018-01-02 09:26:06.842680] E [socket.c:2360:socket_connect_finish] 0-threeway_9-3-client-0: connection to 10.70.35.192:49153 failed (Connection refused); disconnecting socket [2018-01-02 09:26:06.845448] E [socket.c:2360:socket_connect_finish] 0-threeway_9-2-client-0: connection to 10.70.35.192:49153 failed (Connection refused); disconnecting socket [2018-01-02 09:25:35.000272] W [MSGID: 114031] [client-rpc-fops.c:1102:client3_3_getxattr_cbk] 0-ecvol_1-client-0: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001). Key: trusted.glusterfs.pathinfo [Transport endpoint is not connected] [2018-01-02 09:25:35.000272] W [MSGID: 114031] [client-rpc-fops.c:1102:client3_3_getxattr_cbk] 0-ecvol_1-client-3: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001). Key: trusted.glusterfs.pathinfo [Transport endpoint is not connected] [2018-01-02 09:26:06.871241] W [glusterfsd.c:1300:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7e25) [0x7f82b62b8e25] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x559fe33aa115] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x559fe33a9f3b] ) 0-: received signum (15), shutting down glusterd log on failed node: [2018-01-02 09:22:32.598639] I [MSGID: 106568] [glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: scrub service is stopped [2018-01-02 09:22:31.425126] W [MSGID: 106603] [glusterd-utils.c:5509:find_compat_brick_in_vol] 0-management: cluster.max-bricks-per-process options isn't set. Continuing with no limit set for brick multiplexing. [2018-01-02 09:26:06.836244] I [MSGID: 106062] [glusterd-volume-ops.c:2625:glusterd_op_start_volume] 0-management: Global dict not present. [2018-01-02 09:26:06.859478] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already stopped [2018-01-02 09:26:06.859541] I [MSGID: 106568] [glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: nfs service is stopped [2018-01-02 09:26:06.870280] I [MSGID: 106568] [glusterd-proc-mgmt.c:87:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 3431 [2018-01-02 09:26:07.871845] I [MSGID: 106568] [glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: glustershd service is stopped [2018-01-02 09:26:07.872907] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped [2018-01-02 09:26:07.872932] I [MSGID: 106568] [glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: bitd service is stopped [2018-01-02 09:26:07.873365] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already stopped [2018-01-02 09:26:07.873378] I [MSGID: 106568] [glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: scrub service is stopped [2018-01-02 09:26:15.474219] W [socket.c:3252:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/6cadc430f1126cdfc26934cc40d501c8.socket, (No such file or directory) [2018-01-02 09:26:22.475025] W [socket.c:3252:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/6cadc430f1126cdfc26934cc40d501c8.socket, (No such file or directory) [2018-01-02 09:26:29.476126] W [socket.c:3252:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/6cadc430f1126cdfc26934cc40d501c8.socket, (No such file or directory) [2018-01-02 09:26:36.477195] W [socket.c:3252:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/6cadc430f1126cdfc26934cc40d501c8.socket, (No such file or directory) [2018-01-02 09:26:43.478066] W [socket.c:3252:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/6cadc430f1126cdfc26934cc40d501c8.socket, (No such file or directory) [2018-01-02 09:26:50.478974] W [socket.c:3252:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/6cadc430f1126cdfc26934cc40d501c8.socket, (No such file or directory) [2018-01-02 09:26:57.479941] W [socket.c:3252:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/6cadc430f1126cdfc26934cc40d501c8.socket, (No such file or directory) [2018-01-02 09:27:04.480940] W [socket.c:3252:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/6cadc430f1126cdfc26934cc40d501c8.socket, (No such file or directory) [2018-01-02 09:27:11.481909] W [socket.c:3252:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/6cadc430f1126cdfc26934cc40d501c8.socket, (No such file or directory) [2018-01-02 09:27:18.482915] W [socket.c:3252:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/6cadc430f1126cdfc26934cc40d501c8.socket, (No such file or directory) [2018-01-02 09:27:25.483834] W [socket.c:3252:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/6cadc430f1126cdfc26934cc40d501c8.socket, (No such file or directory) [2018-01-02 09:27:32.484586] W [socket.c:3252:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/6cadc430f1126cdfc26934cc40d501c8.socket, (No such file or directory) [2018-01-02 09:27:39.485411] W [socket.c:3252:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/6cadc430f1126cdfc26934cc40d501c8.socket, (No such file or directory) [2018-01-02 09:27:46.486318] W [socket.c:3252:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/6cadc430f1126cdfc26934cc40d501c8.socket, (No such file or directory) [2018-01-02 09:27:53.487179] W [socket.c:3252:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/6cadc430f1126cdfc26934cc40d501c8.socket, (No such file or directory) [2018-01-02 09:28:00.488647] W [socket.c:3252:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/6cadc430f1126cdfc26934cc40d501c8.socket, (No such file or directory) [2018-01-02 09:28:07.489609] W [socket.c:3252:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/6cadc430f1126cdfc26934cc40d501c8.socket, (No such file or directory) [2018-01-02 09:28:14.490452] W [socket.c:3252:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/6cadc430f1126cdfc26934cc40d501c8.socket, (No such file or directory) [2018-01-02 09:28:21.491311] W [socket.c:3252:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/6cadc430f1126cdfc26934cc40d501c8.socket, (No such file or directory) [2018-01-02 09:28:28.492307] W [socket.c:3252:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/6cadc430f1126cdfc26934cc40d501c8.socket, (No such file or directory) [2018-01-02 09:28:35.493312] W [socket.c:3252:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/6cadc430f1126cdfc26934cc40d501c8.socket, (No such file or directory) [root@dhcp35-214 glusterfs]# ps -ef|grep gluster root 1843 1 0 13:19 ? 00:00:08 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO root 2139 1 48 14:35 ? 00:18:32 /usr/sbin/glusterfsd -s dhcp35-214.lab.eng.blr.redhat.com --volfile-id threeway_1-1.dhcp35-214.lab.eng.blr.redhat.com.rhs-brick1-threeway_1-1 -p /var/run/gluster/vols/threeway_1-1/dhcp35-214.lab.eng.blr.redhat.com-rhs-brick1-threeway_1-1.pid -S /var/run/gluster/67a38a9c501ff0734c31814c65b3c067.socket --brick-name /rhs/brick1/threeway_1-1 -l /var/log/glusterfs/bricks/rhs-brick1-threeway_1-1.log --xlator-option *-posix.glusterd-uuid=7b404e5c-b227-4575-8816-75c33688e0be --brick-port 49152 --xlator-option threeway_1-1-server.listen-port=49152 root 3656 1791 0 15:13 pts/0 00:00:00 grep --color=auto gluster
also have seen this problem in below scenario too Brick mux:have many vols on a brick mux setup and do a reboot of node Have seen a couple of times shd not starting on one of the nodes at the end eg:10.70.47.44 [root@dhcp47-44 glusterfs]# gluster v status vol_1-1 Status of volume: vol_1-1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp46-8.lab.eng.blr.redhat.com:/glus ter/brick1/vol_1-1 49152 0 Y 3025 Brick dhcp46-44.lab.eng.blr.redhat.com:/glu ster/brick1/vol_1-1 49152 0 Y 8737 Brick dhcp46-53.lab.eng.blr.redhat.com:/glu ster/brick1/vol_1-1 49152 0 Y 9846 Brick dhcp46-117.lab.eng.blr.redhat.com:/gl uster/brick1/vol_1-1 49152 0 Y 8731 Brick dhcp47-44.lab.eng.blr.redhat.com:/glu ster/brick1/vol_1-1 49152 0 Y 2930 Brick dhcp47-181.lab.eng.blr.redhat.com:/gl uster/brick1/vol_1-1 49152 0 Y 9521 Self-heal Daemon on localhost N/A N/A N N/A Self-heal Daemon on dhcp46-8.lab.eng.blr.re dhat.com N/A N/A Y 3016 Self-heal Daemon on 10.70.46.53 N/A N/A Y 26092 Self-heal Daemon on 10.70.46.44 N/A N/A Y 24993 Self-heal Daemon on 10.70.47.181 N/A N/A Y 25826 Self-heal Daemon on 10.70.46.117 N/A N/A Y 24138 Task Status of Volume vol_1-1 ------------------------------------------------------------------------------ There are no active volume tasks
sosreport@http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1530217/
Logs are not
also refer 1530320 - Brick Multiplexing: brick still down in heal info context(glfs) even though brick is online
giving qa_ack as this would be hit by customer if not fixed and it is critical.
on a fresh setup ,i did a kill brick and took statedump, pmap entries are cleaned up as expected. also, i have tried the steps in description, 5 times and didnt hit this issue hence moving to verified on 3.8.4-52-4 before brick kill glusterd.pmap_port=49152 glusterd.pmap[49152].type=4 glusterd.pmap[49152].brickname=/gluster/brick1/ecvol_1-1 /gluster/brick2/ecvol_1-2 /gluster/brick3/ecvol_1-3 /gluster/brick1/ecvol_2-1 /gluster/brick2/ecvol_2-2 /gluster/brick3/ecvol_2-3 /gluster/brick1/ecvol_3-1 /gluster/brick2/ecvol_3-2 /gluster/brick3/ecvol_3-3 glusterd.client1.identifier=10.70.46.44:1023 post brick kill glusterd.pmap_port=49152 glusterd.pmap[49152].type=0 glusterd.pmap[49152].brickname=(null) however there were stale and a few inconsistent behaviors with port mapping , will raise new bugs accordingly eg: 1531452 - Brick Multiplexing: Stale glusterd portmapping entries to glusterfsd getting created sometimes on glusterd restart brick mux breaks by spawning 2 fsd instead of one, which i feel we can live with for now(and there is a bug reported already)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0083