Bug 1698131

Summary:

multiple glusterfsd processes being launched for the same brick, causing transport endpoint not connected

Product:

[Community] GlusterFS

Reporter:

Darrell <budic>

Component:

glusterd

Assignee:

Atin Mukherjee <amukherj>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Bala Konda Reddy M <bmekala>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

CC:

bugs, pasik, rhs-bugs, sankarshan, storage-qa-internal, vbellur

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-04-29 03:28:32 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1692394

Attachments:

Description	Flags
glusterd and brick logs	none

Description Darrell 2019-04-09 16:23:21 UTC

Created attachment 1553896 [details]
glusterd and brick logs

Description of problem:

Under gluster 5.3, 5.5, & 6.0, I've encountered multiple glusterfsd processes being launched for the same brick. This eventually leads to bricks disconnecting and access problems to the volumes under ovirt.

Version-Release number of selected component (if applicable):

gluster 5.3, gluster 5.5, gluster6

currently:
glusterfs-libs-6.0-1.el7.x86_64
glusterfs-server-6.0-1.el7.x86_64
glusterfs-6.0-1.el7.x86_64

How reproducible:

100% on two clusters.

Steps to Reproduce:
1. Install/upgrade to gluster 6
2. wait
3.

Actual results:

# ps ax | grep glusterfsd
 6227 ?        Ssl    1:10 /usr/sbin/glusterfsd -s boneyard-san --volfile-id gv0.boneyard-san.v0-bricks-gv0 -p /var/run/gluster/vols/gv0/boneyard-san-v0-bricks-gv0.pid -S /var/run/gluster/13f7371d90de07c8.socket --brick-name /v0/bricks/gv0 -l /var/log/glusterfs/bricks/v0-bricks-gv0.log --xlator-option *-posix.glusterd-uuid=7643ebf9-7661-4b9b-935b-93fa4c85ab6f --process-name brick --brick-port 49158 --xlator-option gv0-server.listen-port=49158
 6237 ?        Ssl   67:22 /usr/sbin/glusterfsd -s boneyard-san --volfile-id gvOvirt.boneyard-san.v0-gbOvirt-b0 -p /var/run/gluster/vols/gvOvirt/boneyard-san-v0-gbOvirt-b0.pid -S /var/run/gluster/4789c93bdacffcce.socket --brick-name /v0/gbOvirt/b0 -l /var/log/glusterfs/bricks/v0-gbOvirt-b0.log --xlator-option *-posix.glusterd-uuid=7643ebf9-7661-4b9b-935b-93fa4c85ab6f --process-name brick --brick-port 49159 --xlator-option gvOvirt-server.listen-port=49159
13320 ?        Ssl   34:52 /usr/sbin/glusterfsd -s boneyard-san --volfile-id gv0.boneyard-san.v0-bricks-gv0 -p /var/run/gluster/vols/gv0/boneyard-san-v0-bricks-gv0.pid -S /var/run/gluster/13f7371d90de07c8.socket --brick-name /v0/bricks/gv0 -l /var/log/glusterfs/bricks/v0-bricks-gv0.log --xlator-option *-posix.glusterd-uuid=7643ebf9-7661-4b9b-935b-93fa4c85ab6f --process-name brick --brick-port 49157 --xlator-option gv0-server.listen-port=49157
13332 ?        Ssl   14:08 /usr/sbin/glusterfsd -s boneyard-san --volfile-id gvOvirt.boneyard-san.v0-gbOvirt-b0 -p /var/run/gluster/vols/gvOvirt/boneyard-san-v0-gbOvirt-b0.pid -S /var/run/gluster/4789c93bdacffcce.socket --brick-name /v0/gbOvirt/b0 -l /var/log/glusterfs/bricks/v0-gbOvirt-b0.log --xlator-option *-posix.glusterd-uuid=7643ebf9-7661-4b9b-935b-93fa4c85ab6f --process-name brick --brick-port 49160 --xlator-option gvOvirt-server.listen-port=49160
30170 ?        Ssl    1:15 /usr/sbin/glusterfsd -s boneyard-san --volfile-id gv0.boneyard-san.v0-bricks-gv0 -p /var/run/gluster/vols/gv0/boneyard-san-v0-bricks-gv0.pid -S /var/run/gluster/13f7371d90de07c8.socket --brick-name /v0/bricks/gv0 -l /var/log/glusterfs/bricks/v0-bricks-gv0.log --xlator-option *-posix.glusterd-uuid=7643ebf9-7661-4b9b-935b-93fa4c85ab6f --process-name brick --brick-port 49153 --xlator-option gv0-server.listen-port=49153

Status of volume: gvOvirt
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick boneyard-san:/v0/gbOvirt/b0           49160     0          Y       13332
Brick necropolis-san:/v0/gbOvirt/b0         49153     0          Y       22346
Brick ossuary-san:/v0/gbOvirt/b0            49152     0          Y       16920
Self-heal Daemon on localhost               N/A       N/A        Y       603  
Self-heal Daemon on necropolis-san          N/A       N/A        Y       15592
Self-heal Daemon on boneyard-san            N/A       N/A        Y       18792
 
Task Status of Volume gvOvirt
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@ossuary telsin]# gluster vol heal gvOvirt info
Brick boneyard-san:/v0/gbOvirt/b0
Status: Transport endpoint is not connected
Number of entries: -

Brick necropolis-san:/v0/gbOvirt/b0
Status: Connected
Number of entries: 0

Brick ossuary-san:/v0/gbOvirt/b0
Status: Connected
Number of entries: 0


Expected results:

No more than one glusterfsd process per brick, and no transport errors.

Additional info:

can restore normal operation by killing extra brick processes, possibly restart glusterd to re-establish cluster communication.
Opening new bug because I have brick-multiplex disabled, so possibly different than https://bugzilla.redhat.com/show_bug.cgi?id=1683602, or maybe they share the same root cause?

# gluster vol get all all | grep brick
cluster.brick-multiplex                 disable                                 
cluster.max-bricks-per-process          250   

Hari and Atkin have some additional log files from these systems, possibly filed under https://bugzilla.redhat.com/show_bug.cgi?id=1694010.

Comment 2 Atin Mukherjee 2019-04-10 12:13:22 UTC

Two requests I have from you:

1. Could you pass back the output of 'gluster peer status' and 'gluster volume status'
2. Could you share the tar of /var/log/glusterfs/*.log ? 


Please note that we did fix a similar problem in glusterfs-6.0 with the following commit, but if you're still able to reproduce it we need to investigate. On a test setup, running a volume start and ps aux | grep glusterfsd only shows me the required brick processes though, but the details asked for might give us more insights.

commit 36c75523c1f0545f32db4b807623a8f94df98ca7
Author: Mohit Agrawal <moagrawal>
Date:   Fri Mar 1 13:41:24 2019 +0530

    glusterfsd: Multiple shd processes are spawned on brick_mux environment
    
    Problem: Multiple shd processes are spawned while starting volumes
             in the loop on brick_mux environment.glusterd spawn a process
             based on a pidfile and shd daemon is taking some time to
             update pid in pidfile due to that glusterd is not able to
             get shd pid
    
    Solution: Commit cd249f4cb783f8d79e79468c455732669e835a4f changed
              the code to update pidfile in parent for any gluster daemon
              after getting the status of forking child in parent.To resolve
              the same correct the condition update pidfile in parent only
              for glusterd and for rest of the daemon pidfile is updated in
              child
    
    > Change-Id: Ifd14797fa949562594a285ec82d58384ad717e81
    > fixes: bz#1684404
    > (Cherry pick from commit 66986594a9023c49e61b32769b7e6b260b600626)
    > (Reviewed on upstream link https://review.gluster.org/#/c/glusterfs/+/22290/)
    
    Change-Id: I9a68064d2da1acd0ec54b4071a9995ece0c3320c
    fixes: bz#1683880
    Signed-off-by: Mohit Agrawal <moagrawal>

Comment 3 Darrell 2019-04-13 01:58:38 UTC

While things were in the state I described above, peer status was normal, as it is now:
[root@boneyard telsin]# gluster peer status
Number of Peers: 2

Hostname: ossuary-san
Uuid: 0ecbf953-681b-448f-9746-d1c1fe7a0978
State: Peer in Cluster (Connected)
Other names:
10.50.3.12

Hostname: necropolis-san
Uuid: 5d082bda-bb00-48d4-9f51-ea0995066c6f
State: Peer in Cluster (Connected)
Other names:
10.50.3.10

There's a 'gluster vol status gvOvirt' from the time there were multiple fsd processes running in the original ticket. At the moment, everything is normal, so I can't get you another while unusual things are happening. At the moment, it looks like:

[root@boneyard telsin]# gluster vol status
Status of volume: gv0
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick necropolis-san:/v0/bricks/gv0         49154     0          Y       10425
Brick boneyard-san:/v0/bricks/gv0           49152     0          Y       8504 
Brick ossuary-san:/v0/bricks/gv0            49152     0          Y       13563
Self-heal Daemon on localhost               N/A       N/A        Y       22864
Self-heal Daemon on ossuary-san             N/A       N/A        Y       5815 
Self-heal Daemon on necropolis-san          N/A       N/A        Y       13859
 
Task Status of Volume gv0
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: gvOvirt
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick boneyard-san:/v0/gbOvirt/b0           49153     0          Y       9108 
Brick necropolis-san:/v0/gbOvirt/b0         49155     0          Y       10510
Brick ossuary-san:/v0/gbOvirt/b0            49153     0          Y       13577
Self-heal Daemon on localhost               N/A       N/A        Y       22864
Self-heal Daemon on ossuary-san             N/A       N/A        Y       5815 
Self-heal Daemon on necropolis-san          N/A       N/A        Y       13859
 
Task Status of Volume gvOvirt
------------------------------------------------------------------------------
There are no active volume tasks

Also of note, it appears to have corrupted my Ovirt Hosted Engine VM.

Full logs are attached, hope it helps! Sorry about some of the large files, for some reason this system wasn't rotating them properly until I did some cleanup.

I can take this cluster to 6.1 as soon as it appears in testing, or leave it a bit longer and try restarting some volumes or rebooting to see if I can recreate if it would help?

Comment 4 Darrell 2019-04-13 02:03:10 UTC

Logs were to big to attach, find them here:

https://tower.ohgnetworks.com/index.php/s/UCj5amzjQdQsE5C

Comment 5 Atin Mukherjee 2019-04-29 03:28:32 UTC

From glusterfs/glusterd.log-20190407 I can see the following:

[2019-04-02 22:03:45.520037] I [glusterd-utils.c:6301:glusterd_brick_start] 0-management: starting a fresh brick process for brick /v0/bricks/gv0                                                                                                                                                                                                                                                                                       
[2019-04-02 22:03:45.522039] I [rpc-clnt.c:1000:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2019-04-02 22:03:45.586328] C [MSGID: 106003] [glusterd-server-quorum.c:348:glusterd_do_volume_quorum_action] 0-management: Server quorum regained for volume gvOvirt. Starting local bricks.
[2019-04-02 22:03:45.586480] I [glusterd-utils.c:6214:glusterd_brick_start] 0-management: discovered already-running brick /v0/gbOvirt/b0
[2019-04-02 22:03:45.586495] I [MSGID: 106142] [glusterd-pmap.c:290:pmap_registry_bind] 0-pmap: adding brick /v0/gbOvirt/b0 on port 49157 
[2019-04-02 22:03:45.586519] I [rpc-clnt.c:1000:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2019-04-02 22:03:45.662116] E [MSGID: 101012] [common-utils.c:4075:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/gv0/boneyard-san-v0-bricks-gv0.pid
[2019-04-02 22:03:45.662164] I [glusterd-utils.c:6301:glusterd_brick_start] 0-management: starting a fresh brick process for brick /v0/bricks/gv0

Which indicates that we attempted to start two processes for the same brick but this was with glusterfs-5.5 version which doesn't have the fix as mentioned in comment 2.

Post this cluster has been upgraded to 6.0, I don't see such event. So this is already fixed and I am closing the bug.