Description of problem: After a gluster pod was deleted, new gluster pod got created but the brick process doesn't seem to come up. This issue is seen with 100 volumes running and mounted on 100 different application pods. This is not seen with a single vol mapped to single pod. sh-4.2# gluster vol status vol_2bd7e27c196eddac0127b182c7749be9 Status of volume: vol_2bd7e27c196eddac0127b182c7749be9 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.46.190:/var/lib/heketi/mounts/v g_91bf878e839259d5983c1b8116da8911/brick_34 ec5d5f91f442aecf5be89875517595/brick 49152 0 Y 1590 Brick 10.70.47.59:/var/lib/heketi/mounts/vg _a5597e6aa7af855a39d96b64a47787ca/brick_a22 1df1e052866d65e30204d7a682ecf/brick N/A N/A N N/A Brick 10.70.46.65:/var/lib/heketi/mounts/vg _cb53b766a8f3325469c78bf558f2ae4e/brick_0e3 86e7bb5235cc7e7989e66eca431d1/brick 49152 0 Y 1559 Self-heal Daemon on localhost N/A N/A Y 1478 Self-heal Daemon on 10.70.46.65 N/A N/A Y 40193 Self-heal Daemon on 10.70.46.190 N/A N/A Y 7260 Task Status of Volume vol_2bd7e27c196eddac0127b182c7749be9 ------------------------------------------------------------------------------ There are no active volume tasks Version-Release number of selected component (if applicable): glusterfs-server-3.8.4-28.el7rhgs.x86_64 How reproducible: 1/1 Steps to Reproduce: 1. on a 3 node cns cluster, create 100 application pods and mount 100 cns volumes. 2. Run some IO 3. delete a gluster pod 4. New gluster pod got created as expected and glusterd started 5. check for volume status Actual results: brick process for the node rebooted (pod deleted) did not start. As a result, self-heal doesn't kick in. Expected results: brick process should start automatically and self heal should complete. Additional info: logs shall be attached shortly
Looking at the setup and the logs, here is my theory what has happened here: 1. brick processes were running with pid (say 1000) before the pod was deleted. 2. When the new pod came up, one non gluster process picked up the same pid 1000 after the very first brick was brought up with pid 1001 and then the subsequent brick start attempts tried to connect to the brick assuming it was running with pid 1000, the following log sequence does confirm that brick went for a connect instead of an attach [2017-06-20 11:14:51.444692] I [glusterd-utils.c:5443:glusterd_brick_start] 0-maa nagement: discovered already-running brick /var/lib/heketi/mounts/vg_a5597e6aa7aa f855a39d96b64a47787ca/brick_15ef1d866be79abfea2446a7c887d192/brick [2017-06-20 11:14:51.444731] I [MSGID: 106143] [glusterd-pmap.c:277:pmap_registrr y_bind] 0-pmap: adding brick /var/lib/heketi/mounts/vg_a5597e6aa7af855a39d96b64aa 47787ca/brick_15ef1d866be79abfea2446a7c887d192/brick on port 49152 [2017-06-20 11:14:51.444775] I [rpc-clnt.c:1060:rpc_clnt_connection_init] 0-manaa gement: setting frame-timeout to 600 [2017-06-20 11:14:51.444877] I [glusterd-utils.c:5443:glusterd_brick_start] 0-maa nagement: discovered already-running brick /var/lib/heketi/mounts/vg_a5597e6aa7aa f855a39d96b64a47787ca/brick_d523daa447130512618fd72da50cad72/brick [2017-06-20 11:14:51.444901] I [MSGID: 106143] [glusterd-pmap.c:277:pmap_registrr y_bind] 0-pmap: adding brick /var/lib/heketi/mounts/vg_a5597e6aa7af855a39d96b64aa 47787ca/brick_d523daa447130512618fd72da50cad72/brick on port 49152 3. and the theory is more concrete when we see that glusterd gets disconnect notifications due to this [2017-06-20 11:14:51.507841] I [socket.c:2465:socket_event_handler] 0-transport:: EPOLLERR - disconnecting now [2017-06-20 11:14:51.509668] I [MSGID: 106005] [glusterd-handler.c:5687:__glustee rd_brick_rpc_notify] 0-management: Brick 10.70.47.59:/var/lib/heketi/mounts/vg_66 1f1e5206bf507bc54aef6984e91157b/brick_08a4ad42cdf24e5e3b56bd940c6b243b/brick hass disconnected from glusterd. [2017-06-20 11:14:51.511046] I [socket.c:2465:socket_event_handler] 0-transport:: EPOLLERR - disconnecting now [2017-06-20 11:14:51.512293] I [MSGID: 106005] [glusterd-handler.c:5687:__glustee rd_brick_rpc_notify] 0-management: Brick 10.70.47.59:/var/lib/heketi/mounts/vg_ee 863cbc857b5d64d58d31afd42383234/brick_1146fd5fda435fedf6241a0f3c84a3e6/brick hass disconnected from glusterd. [2017-06-20 11:14:51.513633] I [socket.c:2465:socket_event_handler] 0-transport:: EPOLLERR - disconnecting now [2017-06-20 11:14:51.516157] I [MSGID: 106005] [glusterd-handler.c:5687:__glustee rd_brick_rpc_notify] 0-management: Brick 10.70.47.59:/var/lib/heketi/mounts/vg_ee 863cbc857b5d64d58d31afd42383234/brick_7e1cfecf225612a67464e891b391ebc5/brick has disconnected from glusterd. Ideally there are two problems here: 1. the pidfiles shouldn't be in /var/lib/glusterd as these files shouldn't be persisted on node reboot. 2. is_gf_pid_running should be enhanced in a way so that we should have a stricter check of if its a gluster process. Doing 2 would actually solve the problem for now. However we should try to finish the ongoing work https://review.gluster.org/#/c/13580/ .
I think we can resolve the same without change is_gf_pid_running. Below is the code path in glusterd_brick_start in which we are trying to connect with brick if it is running. I think before call glusterd_brick_connect here we can validate socketpath, if pid(glusterfs) is not running the socketpath value will not be valid we can move out(goto) from the scope instead of calling glusterd_brick_connect. >>>>>>>>>>>>> if (is_brick_mx_enabled ()) glusterd_get_sock_from_brick_pid (pid, socketpath, sizeof(socketpath)); else glusterd_set_brick_socket_filepath (volinfo, brickinfo, socketpath, sizeof (socketpath)); gf_log (this->name, GF_LOG_DEBUG, "Using %s as sockfile for brick %s of volume %s ", socketpath, brickinfo->path, volinfo->volname); (void) glusterd_brick_connect (volinfo, brickinfo, socketpath); } return 0; } >>>>>>>>>>>>>>>>> Regards Mohit Agrawal
(In reply to Mohit Agrawal from comment #10) > I think we can resolve the same without change is_gf_pid_running. > Below is the code path in glusterd_brick_start in which we are trying to > connect with brick if it is running. > > I think before call glusterd_brick_connect here we can validate socketpath, > if pid(glusterfs) is not running the socketpath value will not be valid we > can move out(goto) from the scope instead of calling glusterd_brick_connect. > > >>>>>>>>>>>>> > > if (is_brick_mx_enabled ()) > glusterd_get_sock_from_brick_pid (pid, > socketpath, > > sizeof(socketpath)); > else > glusterd_set_brick_socket_filepath (volinfo, > brickinfo, > > socketpath, > sizeof > (socketpath)); > gf_log (this->name, GF_LOG_DEBUG, > "Using %s as sockfile for brick %s of volume > %s ", > socketpath, brickinfo->path, > volinfo->volname); > (void) glusterd_brick_connect (volinfo, brickinfo, > socketpath); > } > return 0; > } > > >>>>>>>>>>>>>>>>> We can't move to out as we are actually supposed to start the brick, you'd basically need to come out of the if block and proceed further. Otherwise I like the idea. > > > > Regards > Mohit Agrawal
I want to say control will come out from gf_is_service_running scope and start to execute the code from below point ret = _mk_rundir_p (volinfo); if (ret) goto out; other_brick = find_compatible_brick (conf, volinfo, brickinfo, &other_vol); if (other_brick) { ret = attach_brick (this, brickinfo, other_brick, volinfo, other_vol); if (ret == 0) { goto out; } } Regards Mohit Agrawal
upstream patch : https://review.gluster.org/#/c/17601
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/110184/
Verified this bug in cns build - cns-deploy-5.0.0-15.el7rhgs.x86_64 (glusterfs-3.8.4-40.el7rhgs.x86_64) Deleting gluster pod with 100 volumes brings up the gluster pod and all the bricks are up, self-heal completes. Moving the bug to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774