Description of problem: ========================= With multiplexing enabled, If A volume whose name and brick names is used as the base name for the brick processes , is deleted and recreated, will lead to the bricks going down for all the volumes associated with this base name we get posix health check failures(which we get generally if we delete a brick of the volume from backend) rhs-brick1-dr1[19704]: [2017-04-24 14:31:00.182013] M [MSGID: 113075] [posix-helpers.c:1838:posix_health_check_thread_proc] 0-dr1-posix: health-check failed, going down Message from syslogd@dhcp35-122 at Apr 24 20:01:00 ... rhs-brick1-dr1[19704]:[2017-04-24 14:31:00.182013] M [MSGID: 113075] [posix-helpers.c:1838:posix_health_check_thread_proc] 0-dr1-posix: health-check failed, going down Broadcast message from systemd-journald.eng.blr.redhat.com (Mon 2017-04-24 20:01:30 IST): rhs-brick1-dr1[19704]: [2017-04-24 14:31:30.182744] M [MSGID: 113075] [posix-helpers.c:1845:posix_health_check_thread_proc] 0-dr1-posix: still alive! -> SIGTERM Message from syslogd@dhcp35-122 at Apr 24 20:01:30 ... rhs-brick1-dr1[19704]:[2017-04-24 14:31:30.182744] M [MSGID: 113075] [posix-helpers.c:1845:posix_health_check_thread_proc] 0-dr1-posix: still alive! -> SIGTERM Version-Release number of selected component (if applicable): ================== 3.8.4-23 How reproducible: ========= 2/2 Steps to Reproduce: ==================== 1.enable brick multiplex 2.create a 2x2 volume say dr1 as below and start it gluster v create dr1 rep 2 10.70.35.122:/rhs/brick1/dr1 10.70.35.23:/rhs/brick1/dr1 10.70.35.112:/rhs/brick1/dr1 10.70.35.138:/rhs/brick1/dr1 3.now again create dr2 as below, make sure vol settings will be same as dr1 and hence they should be getting the same brick process as dr1 gluster v create dr2 rep 2 10.70.35.122:/rhs/brick2/dr2 10.70.35.23:/rhs/brick2/dr2 10.70.35.112:/rhs/brick2/dr2 10.70.35.138:/rhs/brick2/dr2 and start dr2 4. same way create dr3 gluster v create dr3 rep 2 10.70.35.122:/rhs/brick3/dr3 10.70.35.23:/rhs/brick3/dr3 10.70.35.112:/rhs/brick3/dr3 10.70.35.138:/rhs/brick3/dr3 and start dr3 5. now delete dr1 6. now remove the bricks of dr1 ie rm -rf /rhs/brick1/dr1 on all nodes 7. now recreate a volume as dr1 as below again gluster v create dr1 rep 2 10.70.35.122:/rhs/brick1/dr1 10.70.35.23:/rhs/brick1/dr1 10.70.35.112:/rhs/brick1/dr1 10.70.35.138:/rhs/brick1/dr1 Actual results: =============== all nodes' bricks go down as below [root@dhcp35-45 ~]# gluster v status Status of volume: dr1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.122:/rhs/brick1/dr1 N/A N/A N N/A Brick 10.70.35.23:/rhs/brick1/dr1 N/A N/A N N/A Brick 10.70.35.112:/rhs/brick1/dr1 N/A N/A N N/A Brick 10.70.35.138:/rhs/brick1/dr1 N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 32440 Self-heal Daemon on 10.70.35.130 N/A N/A Y 18292 Self-heal Daemon on 10.70.35.122 N/A N/A Y 19956 Self-heal Daemon on 10.70.35.112 N/A N/A Y 6254 Self-heal Daemon on 10.70.35.23 N/A N/A Y 6425 Self-heal Daemon on 10.70.35.138 N/A N/A Y 17621 Task Status of Volume dr1 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: dr2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.122:/rhs/brick2/dr2 N/A N/A N N/A Brick 10.70.35.23:/rhs/brick2/dr2 N/A N/A N N/A Brick 10.70.35.112:/rhs/brick2/dr2 N/A N/A N N/A Brick 10.70.35.138:/rhs/brick2/dr2 N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 32440 Self-heal Daemon on 10.70.35.130 N/A N/A Y 18292 Self-heal Daemon on 10.70.35.122 N/A N/A Y 19956 Self-heal Daemon on 10.70.35.112 N/A N/A Y 6254 Self-heal Daemon on 10.70.35.138 N/A N/A Y 17621 Self-heal Daemon on 10.70.35.23 N/A N/A Y 6425 Task Status of Volume dr2 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: dr3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.122:/rhs/brick3/dr3 N/A N/A N N/A Brick 10.70.35.23:/rhs/brick3/dr3 N/A N/A N N/A Brick 10.70.35.112:/rhs/brick3/dr3 N/A N/A N N/A Brick 10.70.35.138:/rhs/brick3/dr3 N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 32440 Self-heal Daemon on 10.70.35.112 N/A N/A Y 6254 Self-heal Daemon on 10.70.35.130 N/A N/A Y 18292 Self-heal Daemon on 10.70.35.23 N/A N/A Y 6425 Self-heal Daemon on 10.70.35.122 N/A N/A Y 19956 Self-heal Daemon on 10.70.35.138 N/A N/A Y 17621 Task Status of Volume dr3 ------------------------------------------------------------------------------ There are no active volume tasks
Please attach the logs, as or me issue is different. Below is the details of the issue been observe on my setup that too wit multiple iterations. 1. 4 node cluster 2. Create DR1 3. Start DR1 4. create DR2 5. start DR2 6. create DR3 7. Start DR3 8. Delete DR1 9. delete dr1 bricks from all nodes 10.Create DR1 11.Start DR1 Output...[root@localhost ~]# gluster v status Status of volume: dr1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.122.4:/rhs/brick1/dr1 49152 0 Y 4444 Brick 192.168.122.6:/rhs/brick1/dr1 49157 0 Y 3323 Brick 192.168.122.79:/rhs/brick1/dr1 49152 0 Y 3916 Brick 192.168.122.109:/rhs/brick1/dr1 49152 0 Y 4745 Self-heal Daemon on localhost N/A N/A Y 3343 Self-heal Daemon on 192.168.122.4 N/A N/A Y 4464 Self-heal Daemon on 192.168.122.109 N/A N/A Y 4765 Self-heal Daemon on 192.168.122.79 N/A N/A Y 3936 Task Status of Volume dr1 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: dr2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.122.4:/rhs/brick1/dr2 N/A N/A N N/A Brick 192.168.122.6:/rhs/brick1/dr2 N/A N/A N N/A Brick 192.168.122.79:/rhs/brick1/dr2 N/A N/A N N/A Brick 192.168.122.109:/rhs/brick1/dr2 N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 3343 Self-heal Daemon on 192.168.122.109 N/A N/A Y 4765 Self-heal Daemon on 192.168.122.79 N/A N/A Y 3936 Self-heal Daemon on 192.168.122.4 N/A N/A Y 4464 Task Status of Volume dr2 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: dr3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.122.4:/rhs/brick1/dr3 N/A N/A N N/A Brick 192.168.122.6:/rhs/brick1/dr3 N/A N/A N N/A Brick 192.168.122.79:/rhs/brick1/dr3 N/A N/A N N/A Brick 192.168.122.109:/rhs/brick1/dr3 N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 3343 Self-heal Daemon on 192.168.122.4 N/A N/A Y 4464 Self-heal Daemon on 192.168.122.79 N/A N/A Y 3936 Self-heal Daemon on 192.168.122.109 N/A N/A Y 4765 Task Status of Volume dr3 ------------------------------------------------------------------------------ There are no active volume tasks For me the newly dr1 creation shows the details correctly, but not the other volumes. Initial level of debugging reveal that its a path issue.
also note that i cannot start or delete the volumes due to below error [root@dhcp35-45 ~]# gluster v list dr1 dr2 dr3 [root@dhcp35-45 ~]# for i in $(gluster v list);do gluster v stop $i;done Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y volume stop: dr1: failed: Commit failed on 10.70.35.112. Error: error Commit failed on 10.70.35.23. Error: error Commit failed on 10.70.35.138. Error: error Commit failed on 10.70.35.122. Error: error Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y volume stop: dr2: failed: Commit failed on 10.70.35.23. Error: error Commit failed on 10.70.35.122. Error: error Commit failed on 10.70.35.112. Error: error Commit failed on 10.70.35.138. Error: error Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y volume stop: dr3: failed: Commit failed on 10.70.35.23. Error: error Commit failed on 10.70.35.122. Error: error Commit failed on 10.70.35.138. Error: error Commit failed on 10.70.35.112. Error: error [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# gluster v status Volume dr1 is not started Volume dr2 is not started Volume dr3 is not started [root@dhcp35-45 ~]# gluster v start dr1 volume start: dr1: failed: Pre Validation failed on 10.70.35.122. Volume dr1 already started Pre Validation failed on 10.70.35.23. Volume dr1 already started Pre Validation failed on 10.70.35.112. Volume dr1 already started Pre Validation failed on 10.70.35.138. Volume dr1 already started [root@dhcp35-45 ~]# gluster v start dr2 volume start: dr2: failed: Pre Validation failed on 10.70.35.122. Volume dr2 already started Pre Validation failed on 10.70.35.112. Volume dr2 already started Pre Validation failed on 10.70.35.138. Volume dr2 already started Pre Validation failed on 10.70.35.23. Volume dr2 already started [root@dhcp35-45 ~]# gluster v start dr3 volume start: dr3: failed: Pre Validation failed on 10.70.35.23. Volume dr3 already started Pre Validation failed on 10.70.35.112. Volume dr3 already started Pre Validation failed on 10.70.35.122. Volume dr3 already started Pre Validation failed on 10.70.35.138. Volume dr3 already started [root@dhcp35-45 ~]# gluster v status Volume dr1 is not started Volume dr2 is not started Volume dr3 is not started
sosreports @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1444926
upstream patch : https://review.gluster.org/#/c/17101/
Upstream patches : https://review.gluster.org/#/q/topic:bug-1444596 Downstream patches: https://code.engineering.redhat.com/gerrit/#/c/105595/ https://code.engineering.redhat.com/gerrit/#/c/105596/
Even on 3.8.4-25 the issue exists. below are the steps I will have to move to failed_qa . 1) created 3 1x3 vols v1,v2,v3 with brick mux enabled, and all bricks getting same pid 2) IOs started on v2 and v3 3) stopped v1---->IOs still going on 4) deletes v1 --->still good 5) now deleted the bricks of v1---> the bricks were just directories under the LV(and each volume had seperate LV) This makes posix health check failed msg pop up as below Broadcast message from systemd-journald.eng.blr.redhat.com (Tue 2017-05-16 12:40:33 IST): rhs-brick1-myr-1[28967]: [2017-05-16 07:10:33.029490] M [MSGID: 113075] [posix-helpers.c:1893:posix_health_check_thread_proc] 0-myr-1-posix: health-check failed, going down Message from syslogd@dhcp35-45 at May 16 12:40:33 ... rhs-brick1-myr-1[28967]:[2017-05-16 07:10:33.029490] M [MSGID: 113075] [posix-helpers.c:1893:posix_health_check_thread_proc] 0-myr-1-posix: health-check failed, going down IOs stop Volume mount inaccessible with transport end point error. Tried to mount v2 on a new directory which is failing [2017-05-16 07:22:12.931554] I [MSGID: 114064] [client-handshake.c:148:client_notify_parents_child_up] 0-myr-3-client-1: Defering sending CHILD_UP message as the client translators are not yet ready to serve. [2017-05-16 07:22:12.931596] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-myr-3-client-2: Connected to myr-3-client-2, attached to remote volume '/rhs/brick3/myr-3'. [2017-05-16 07:22:12.931632] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-myr-3-client-2: Server and Client lk-version numbers are not same, reopening the fds [2017-05-16 07:22:12.931775] I [MSGID: 114064] [client-handshake.c:148:client_notify_parents_child_up] 0-myr-3-client-2: Defering sending CHILD_UP message as the client translators are not yet ready to serve. [2017-05-16 07:22:12.931803] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-myr-3-client-1: Server lk version = 1 [2017-05-16 07:22:12.931862] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-myr-3-client-2: Server lk version = 1 [2017-05-16 07:22:12.932443] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-myr-3-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2017-05-16 07:22:12.933664] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-myr-3-client-0: Connected to myr-3-client-0, attached to remote volume '/rhs/brick3/myr-3'. [2017-05-16 07:22:12.933687] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-myr-3-client-0: Server and Client lk-version numbers are not same, reopening the fds [2017-05-16 07:22:12.933767] I [MSGID: 114064] [client-handshake.c:148:client_notify_parents_child_up] 0-myr-3-client-0: Defering sending CHILD_UP message as the client translators are not yet ready to serve. [2017-05-16 07:22:12.933915] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-myr-3-client-0: Server lk version = 1 [2017-05-16 07:22:23.900464] I [fuse-bridge.c:5251:fuse_graph_setup] 0-fuse: switched to graph 0 [2017-05-16 07:22:23.902241] I [fuse-bridge.c:4153:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.22 [2017-05-16 07:22:23.902504] I [MSGID: 108006] [afr-common.c:4827:afr_local_init] 0-myr-3-replicate-0: no subvolumes up [2017-05-16 07:22:23.902904] W [fuse-bridge.c:767:fuse_attr_cbk] 0-glusterfs-fuse: 2: LOOKUP() / => -1 (Transport endpoint is not connected) [2017-05-16 07:22:23.907736] I [fuse-bridge.c:5092:fuse_thread_proc] 0-fuse: unmounting /mnt/test4 The message "I [MSGID: 108006] [afr-common.c:4827:afr_local_init] 0-myr-3-replicate-0: no subvolumes up" repeated 2 times between [2017-05-16 07:22:23.902504] and [2017-05-16 07:22:23.906642] [2017-05-16 07:22:23.908145] W [glusterfsd.c:1291:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dc5) [0x7f9fbfee1dc5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x7f9fc1577f45] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x7f9fc1577d6b] ) 0-: received signum (15), shutting down [2017-05-16 07:22:23.908179] I [fuse-bridge.c:5803:fini] 0-fuse: Unmounting '/mnt/test4'.
validation: not seeing the issue anymore on 3.8.4-27 hence moving to verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774