Created attachment 1553896 [details] glusterd and brick logs Description of problem: Under gluster 5.3, 5.5, & 6.0, I've encountered multiple glusterfsd processes being launched for the same brick. This eventually leads to bricks disconnecting and access problems to the volumes under ovirt. Version-Release number of selected component (if applicable): gluster 5.3, gluster 5.5, gluster6 currently: glusterfs-libs-6.0-1.el7.x86_64 glusterfs-server-6.0-1.el7.x86_64 glusterfs-6.0-1.el7.x86_64 How reproducible: 100% on two clusters. Steps to Reproduce: 1. Install/upgrade to gluster 6 2. wait 3. Actual results: # ps ax | grep glusterfsd 6227 ? Ssl 1:10 /usr/sbin/glusterfsd -s boneyard-san --volfile-id gv0.boneyard-san.v0-bricks-gv0 -p /var/run/gluster/vols/gv0/boneyard-san-v0-bricks-gv0.pid -S /var/run/gluster/13f7371d90de07c8.socket --brick-name /v0/bricks/gv0 -l /var/log/glusterfs/bricks/v0-bricks-gv0.log --xlator-option *-posix.glusterd-uuid=7643ebf9-7661-4b9b-935b-93fa4c85ab6f --process-name brick --brick-port 49158 --xlator-option gv0-server.listen-port=49158 6237 ? Ssl 67:22 /usr/sbin/glusterfsd -s boneyard-san --volfile-id gvOvirt.boneyard-san.v0-gbOvirt-b0 -p /var/run/gluster/vols/gvOvirt/boneyard-san-v0-gbOvirt-b0.pid -S /var/run/gluster/4789c93bdacffcce.socket --brick-name /v0/gbOvirt/b0 -l /var/log/glusterfs/bricks/v0-gbOvirt-b0.log --xlator-option *-posix.glusterd-uuid=7643ebf9-7661-4b9b-935b-93fa4c85ab6f --process-name brick --brick-port 49159 --xlator-option gvOvirt-server.listen-port=49159 13320 ? Ssl 34:52 /usr/sbin/glusterfsd -s boneyard-san --volfile-id gv0.boneyard-san.v0-bricks-gv0 -p /var/run/gluster/vols/gv0/boneyard-san-v0-bricks-gv0.pid -S /var/run/gluster/13f7371d90de07c8.socket --brick-name /v0/bricks/gv0 -l /var/log/glusterfs/bricks/v0-bricks-gv0.log --xlator-option *-posix.glusterd-uuid=7643ebf9-7661-4b9b-935b-93fa4c85ab6f --process-name brick --brick-port 49157 --xlator-option gv0-server.listen-port=49157 13332 ? Ssl 14:08 /usr/sbin/glusterfsd -s boneyard-san --volfile-id gvOvirt.boneyard-san.v0-gbOvirt-b0 -p /var/run/gluster/vols/gvOvirt/boneyard-san-v0-gbOvirt-b0.pid -S /var/run/gluster/4789c93bdacffcce.socket --brick-name /v0/gbOvirt/b0 -l /var/log/glusterfs/bricks/v0-gbOvirt-b0.log --xlator-option *-posix.glusterd-uuid=7643ebf9-7661-4b9b-935b-93fa4c85ab6f --process-name brick --brick-port 49160 --xlator-option gvOvirt-server.listen-port=49160 30170 ? Ssl 1:15 /usr/sbin/glusterfsd -s boneyard-san --volfile-id gv0.boneyard-san.v0-bricks-gv0 -p /var/run/gluster/vols/gv0/boneyard-san-v0-bricks-gv0.pid -S /var/run/gluster/13f7371d90de07c8.socket --brick-name /v0/bricks/gv0 -l /var/log/glusterfs/bricks/v0-bricks-gv0.log --xlator-option *-posix.glusterd-uuid=7643ebf9-7661-4b9b-935b-93fa4c85ab6f --process-name brick --brick-port 49153 --xlator-option gv0-server.listen-port=49153 Status of volume: gvOvirt Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick boneyard-san:/v0/gbOvirt/b0 49160 0 Y 13332 Brick necropolis-san:/v0/gbOvirt/b0 49153 0 Y 22346 Brick ossuary-san:/v0/gbOvirt/b0 49152 0 Y 16920 Self-heal Daemon on localhost N/A N/A Y 603 Self-heal Daemon on necropolis-san N/A N/A Y 15592 Self-heal Daemon on boneyard-san N/A N/A Y 18792 Task Status of Volume gvOvirt ------------------------------------------------------------------------------ There are no active volume tasks [root@ossuary telsin]# gluster vol heal gvOvirt info Brick boneyard-san:/v0/gbOvirt/b0 Status: Transport endpoint is not connected Number of entries: - Brick necropolis-san:/v0/gbOvirt/b0 Status: Connected Number of entries: 0 Brick ossuary-san:/v0/gbOvirt/b0 Status: Connected Number of entries: 0 Expected results: No more than one glusterfsd process per brick, and no transport errors. Additional info: can restore normal operation by killing extra brick processes, possibly restart glusterd to re-establish cluster communication. Opening new bug because I have brick-multiplex disabled, so possibly different than https://bugzilla.redhat.com/show_bug.cgi?id=1683602, or maybe they share the same root cause? # gluster vol get all all | grep brick cluster.brick-multiplex disable cluster.max-bricks-per-process 250 Hari and Atkin have some additional log files from these systems, possibly filed under https://bugzilla.redhat.com/show_bug.cgi?id=1694010.
Two requests I have from you: 1. Could you pass back the output of 'gluster peer status' and 'gluster volume status' 2. Could you share the tar of /var/log/glusterfs/*.log ? Please note that we did fix a similar problem in glusterfs-6.0 with the following commit, but if you're still able to reproduce it we need to investigate. On a test setup, running a volume start and ps aux | grep glusterfsd only shows me the required brick processes though, but the details asked for might give us more insights. commit 36c75523c1f0545f32db4b807623a8f94df98ca7 Author: Mohit Agrawal <moagrawal> Date: Fri Mar 1 13:41:24 2019 +0530 glusterfsd: Multiple shd processes are spawned on brick_mux environment Problem: Multiple shd processes are spawned while starting volumes in the loop on brick_mux environment.glusterd spawn a process based on a pidfile and shd daemon is taking some time to update pid in pidfile due to that glusterd is not able to get shd pid Solution: Commit cd249f4cb783f8d79e79468c455732669e835a4f changed the code to update pidfile in parent for any gluster daemon after getting the status of forking child in parent.To resolve the same correct the condition update pidfile in parent only for glusterd and for rest of the daemon pidfile is updated in child > Change-Id: Ifd14797fa949562594a285ec82d58384ad717e81 > fixes: bz#1684404 > (Cherry pick from commit 66986594a9023c49e61b32769b7e6b260b600626) > (Reviewed on upstream link https://review.gluster.org/#/c/glusterfs/+/22290/) Change-Id: I9a68064d2da1acd0ec54b4071a9995ece0c3320c fixes: bz#1683880 Signed-off-by: Mohit Agrawal <moagrawal>
While things were in the state I described above, peer status was normal, as it is now: [root@boneyard telsin]# gluster peer status Number of Peers: 2 Hostname: ossuary-san Uuid: 0ecbf953-681b-448f-9746-d1c1fe7a0978 State: Peer in Cluster (Connected) Other names: 10.50.3.12 Hostname: necropolis-san Uuid: 5d082bda-bb00-48d4-9f51-ea0995066c6f State: Peer in Cluster (Connected) Other names: 10.50.3.10 There's a 'gluster vol status gvOvirt' from the time there were multiple fsd processes running in the original ticket. At the moment, everything is normal, so I can't get you another while unusual things are happening. At the moment, it looks like: [root@boneyard telsin]# gluster vol status Status of volume: gv0 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick necropolis-san:/v0/bricks/gv0 49154 0 Y 10425 Brick boneyard-san:/v0/bricks/gv0 49152 0 Y 8504 Brick ossuary-san:/v0/bricks/gv0 49152 0 Y 13563 Self-heal Daemon on localhost N/A N/A Y 22864 Self-heal Daemon on ossuary-san N/A N/A Y 5815 Self-heal Daemon on necropolis-san N/A N/A Y 13859 Task Status of Volume gv0 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: gvOvirt Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick boneyard-san:/v0/gbOvirt/b0 49153 0 Y 9108 Brick necropolis-san:/v0/gbOvirt/b0 49155 0 Y 10510 Brick ossuary-san:/v0/gbOvirt/b0 49153 0 Y 13577 Self-heal Daemon on localhost N/A N/A Y 22864 Self-heal Daemon on ossuary-san N/A N/A Y 5815 Self-heal Daemon on necropolis-san N/A N/A Y 13859 Task Status of Volume gvOvirt ------------------------------------------------------------------------------ There are no active volume tasks Also of note, it appears to have corrupted my Ovirt Hosted Engine VM. Full logs are attached, hope it helps! Sorry about some of the large files, for some reason this system wasn't rotating them properly until I did some cleanup. I can take this cluster to 6.1 as soon as it appears in testing, or leave it a bit longer and try restarting some volumes or rebooting to see if I can recreate if it would help?
Logs were to big to attach, find them here: https://tower.ohgnetworks.com/index.php/s/UCj5amzjQdQsE5C
From glusterfs/glusterd.log-20190407 I can see the following: [2019-04-02 22:03:45.520037] I [glusterd-utils.c:6301:glusterd_brick_start] 0-management: starting a fresh brick process for brick /v0/bricks/gv0 [2019-04-02 22:03:45.522039] I [rpc-clnt.c:1000:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2019-04-02 22:03:45.586328] C [MSGID: 106003] [glusterd-server-quorum.c:348:glusterd_do_volume_quorum_action] 0-management: Server quorum regained for volume gvOvirt. Starting local bricks. [2019-04-02 22:03:45.586480] I [glusterd-utils.c:6214:glusterd_brick_start] 0-management: discovered already-running brick /v0/gbOvirt/b0 [2019-04-02 22:03:45.586495] I [MSGID: 106142] [glusterd-pmap.c:290:pmap_registry_bind] 0-pmap: adding brick /v0/gbOvirt/b0 on port 49157 [2019-04-02 22:03:45.586519] I [rpc-clnt.c:1000:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2019-04-02 22:03:45.662116] E [MSGID: 101012] [common-utils.c:4075:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/gv0/boneyard-san-v0-bricks-gv0.pid [2019-04-02 22:03:45.662164] I [glusterd-utils.c:6301:glusterd_brick_start] 0-management: starting a fresh brick process for brick /v0/bricks/gv0 Which indicates that we attempted to start two processes for the same brick but this was with glusterfs-5.5 version which doesn't have the fix as mentioned in comment 2. Post this cluster has been upgraded to 6.0, I don't see such event. So this is already fixed and I am closing the bug.