Bug 1463221

Summary: cns-brick-multiplexing: brick process fails to restart after gluster pod failure
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: krishnaram Karthick <kramdoss>
Component: coreAssignee: Mohit Agrawal <moagrawa>
Status: CLOSED ERRATA QA Contact: krishnaram Karthick <kramdoss>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: cns-3.6CC: amukherj, hchiramm, kramdoss, nchilaka, pprakash, rcyriac, rhs-bugs, rtalur, storage-qa-internal, vinug
Target Milestone: ---Flags: hchiramm: needinfo-
Target Release: RHGS 3.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: 3.3.0-devel-freeze-exception, brick-multiplexing
Fixed In Version: glusterfs-3.8.4-32 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1463613 1464072 (view as bug list) Environment:
Last Closed: 2017-09-21 04:59:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1464072    
Bug Blocks: 1417151, 1463613    

Description krishnaram Karthick 2017-06-20 11:57:47 UTC
Description of problem:

After a gluster pod was deleted, new gluster pod got created but the brick process doesn't seem to come up.

This issue is seen with 100 volumes running and mounted on 100 different application pods. This is not seen with a single vol mapped to single pod.

sh-4.2# gluster vol status vol_2bd7e27c196eddac0127b182c7749be9 
Status of volume: vol_2bd7e27c196eddac0127b182c7749be9
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.46.190:/var/lib/heketi/mounts/v
g_91bf878e839259d5983c1b8116da8911/brick_34
ec5d5f91f442aecf5be89875517595/brick        49152     0          Y       1590 
Brick 10.70.47.59:/var/lib/heketi/mounts/vg
_a5597e6aa7af855a39d96b64a47787ca/brick_a22
1df1e052866d65e30204d7a682ecf/brick         N/A       N/A        N       N/A  
Brick 10.70.46.65:/var/lib/heketi/mounts/vg
_cb53b766a8f3325469c78bf558f2ae4e/brick_0e3
86e7bb5235cc7e7989e66eca431d1/brick         49152     0          Y       1559 
Self-heal Daemon on localhost               N/A       N/A        Y       1478 
Self-heal Daemon on 10.70.46.65             N/A       N/A        Y       40193
Self-heal Daemon on 10.70.46.190            N/A       N/A        Y       7260 
 
Task Status of Volume vol_2bd7e27c196eddac0127b182c7749be9
------------------------------------------------------------------------------
There are no active volume tasks


Version-Release number of selected component (if applicable):
glusterfs-server-3.8.4-28.el7rhgs.x86_64

How reproducible:
1/1

Steps to Reproduce:
1. on a 3 node cns cluster, create 100 application pods and mount 100 cns volumes. 
2. Run some IO
3. delete a gluster pod
4. New gluster pod got created as expected and glusterd started
5. check for volume status 

Actual results:
brick process for the node rebooted (pod deleted) did not start. As a result, self-heal doesn't kick in.

Expected results:
brick process should start automatically and self heal should complete.

Additional info:
logs shall be attached shortly

Comment 6 Atin Mukherjee 2017-06-21 08:30:02 UTC
Looking at the setup and the logs, here is my theory what has happened here:

1. brick processes were running with pid (say 1000) before the pod was deleted.
2. When the new pod came up, one non gluster process picked up the same pid 1000 after the very first brick was brought up with pid 1001 and then the subsequent brick start attempts tried to connect to the brick assuming it was running with pid 1000, the following log sequence does confirm that brick went for a connect instead of an attach

[2017-06-20 11:14:51.444692] I [glusterd-utils.c:5443:glusterd_brick_start] 0-maa
nagement: discovered already-running brick /var/lib/heketi/mounts/vg_a5597e6aa7aa
f855a39d96b64a47787ca/brick_15ef1d866be79abfea2446a7c887d192/brick
[2017-06-20 11:14:51.444731] I [MSGID: 106143] [glusterd-pmap.c:277:pmap_registrr
y_bind] 0-pmap: adding brick /var/lib/heketi/mounts/vg_a5597e6aa7af855a39d96b64aa
47787ca/brick_15ef1d866be79abfea2446a7c887d192/brick on port 49152
[2017-06-20 11:14:51.444775] I [rpc-clnt.c:1060:rpc_clnt_connection_init] 0-manaa
gement: setting frame-timeout to 600
[2017-06-20 11:14:51.444877] I [glusterd-utils.c:5443:glusterd_brick_start] 0-maa
nagement: discovered already-running brick /var/lib/heketi/mounts/vg_a5597e6aa7aa
f855a39d96b64a47787ca/brick_d523daa447130512618fd72da50cad72/brick
[2017-06-20 11:14:51.444901] I [MSGID: 106143] [glusterd-pmap.c:277:pmap_registrr
y_bind] 0-pmap: adding brick /var/lib/heketi/mounts/vg_a5597e6aa7af855a39d96b64aa
47787ca/brick_d523daa447130512618fd72da50cad72/brick on port 49152


3. and the theory is more concrete when we see that glusterd gets disconnect notifications due to this
[2017-06-20 11:14:51.507841] I [socket.c:2465:socket_event_handler] 0-transport::
 EPOLLERR - disconnecting now
[2017-06-20 11:14:51.509668] I [MSGID: 106005] [glusterd-handler.c:5687:__glustee
rd_brick_rpc_notify] 0-management: Brick 10.70.47.59:/var/lib/heketi/mounts/vg_66
1f1e5206bf507bc54aef6984e91157b/brick_08a4ad42cdf24e5e3b56bd940c6b243b/brick hass
 disconnected from glusterd.
[2017-06-20 11:14:51.511046] I [socket.c:2465:socket_event_handler] 0-transport::
 EPOLLERR - disconnecting now
[2017-06-20 11:14:51.512293] I [MSGID: 106005] [glusterd-handler.c:5687:__glustee
rd_brick_rpc_notify] 0-management: Brick 10.70.47.59:/var/lib/heketi/mounts/vg_ee
863cbc857b5d64d58d31afd42383234/brick_1146fd5fda435fedf6241a0f3c84a3e6/brick hass
 disconnected from glusterd.
[2017-06-20 11:14:51.513633] I [socket.c:2465:socket_event_handler] 0-transport::
 EPOLLERR - disconnecting now
[2017-06-20 11:14:51.516157] I [MSGID: 106005] [glusterd-handler.c:5687:__glustee
rd_brick_rpc_notify] 0-management: Brick 10.70.47.59:/var/lib/heketi/mounts/vg_ee
863cbc857b5d64d58d31afd42383234/brick_7e1cfecf225612a67464e891b391ebc5/brick has disconnected from glusterd.


Ideally there are two problems here:

1. the pidfiles shouldn't be in /var/lib/glusterd as these files shouldn't be persisted on node reboot.

2. is_gf_pid_running should be enhanced in a way so that we should have a stricter check of if its a gluster process.

Doing 2 would actually solve the problem for now. However we should try to finish the ongoing work https://review.gluster.org/#/c/13580/ .

Comment 10 Mohit Agrawal 2017-06-22 10:28:22 UTC
I think we can resolve the same without change is_gf_pid_running.
Below is the code path in glusterd_brick_start in which we are trying to connect with brick if it is running.
 
I think before call glusterd_brick_connect here we can validate socketpath, if pid(glusterfs) is not running the socketpath value will not be valid we can move out(goto) from the scope instead of calling glusterd_brick_connect.

>>>>>>>>>>>>>

if (is_brick_mx_enabled ())
                                glusterd_get_sock_from_brick_pid (pid, socketpath,
                                                                  sizeof(socketpath));
                        else
                                glusterd_set_brick_socket_filepath (volinfo, brickinfo,
                                                                    socketpath,
                                                                    sizeof (socketpath));
                        gf_log (this->name, GF_LOG_DEBUG,
                                "Using %s as sockfile for brick %s of volume %s ",
                                socketpath, brickinfo->path, volinfo->volname);
                        (void) glusterd_brick_connect (volinfo, brickinfo,
                                        socketpath);
                }
                return 0;
        }

>>>>>>>>>>>>>>>>>

  

Regards
Mohit Agrawal

Comment 11 Atin Mukherjee 2017-06-22 10:33:21 UTC
(In reply to Mohit Agrawal from comment #10)
> I think we can resolve the same without change is_gf_pid_running.
> Below is the code path in glusterd_brick_start in which we are trying to
> connect with brick if it is running.
>  
> I think before call glusterd_brick_connect here we can validate socketpath,
> if pid(glusterfs) is not running the socketpath value will not be valid we
> can move out(goto) from the scope instead of calling glusterd_brick_connect.
> 
> >>>>>>>>>>>>>
> 
> if (is_brick_mx_enabled ())
>                                 glusterd_get_sock_from_brick_pid (pid,
> socketpath,
>                                                                  
> sizeof(socketpath));
>                         else
>                                 glusterd_set_brick_socket_filepath (volinfo,
> brickinfo,
>                                                                    
> socketpath,
>                                                                     sizeof
> (socketpath));
>                         gf_log (this->name, GF_LOG_DEBUG,
>                                 "Using %s as sockfile for brick %s of volume
> %s ",
>                                 socketpath, brickinfo->path,
> volinfo->volname);
>                         (void) glusterd_brick_connect (volinfo, brickinfo,
>                                         socketpath);
>                 }
>                 return 0;
>         }
> 
> >>>>>>>>>>>>>>>>>

We can't move to out as we are actually supposed to start the brick, you'd basically need to come out of the if block and proceed further. Otherwise I like the idea.

> 
>   
> 
> Regards
> Mohit Agrawal

Comment 12 Mohit Agrawal 2017-06-22 10:39:17 UTC
I want to say control will come out from gf_is_service_running scope and start to execute the code from below point 

       ret = _mk_rundir_p (volinfo);
        if (ret)
                goto out;

        other_brick = find_compatible_brick (conf, volinfo, brickinfo,
                                             &other_vol);
        if (other_brick) {
                ret = attach_brick (this, brickinfo, other_brick,
                                    volinfo, other_vol);
                if (ret == 0) {
                        goto out;
                }
        }

Regards
Mohit Agrawal

Comment 13 Atin Mukherjee 2017-06-22 12:01:41 UTC
upstream patch : https://review.gluster.org/#/c/17601

Comment 14 Atin Mukherjee 2017-06-27 14:03:42 UTC
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/110184/

Comment 17 krishnaram Karthick 2017-08-22 05:40:20 UTC
Verified this bug in cns build - cns-deploy-5.0.0-15.el7rhgs.x86_64 (glusterfs-3.8.4-40.el7rhgs.x86_64)

Deleting gluster pod with 100 volumes brings up the gluster pod and all the bricks are up, self-heal completes.

Moving the bug to verified.

Comment 19 errata-xmlrpc 2017-09-21 04:59:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774