Bug 1698131
Summary: | multiple glusterfsd processes being launched for the same brick, causing transport endpoint not connected | ||||||
---|---|---|---|---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Darrell <budic> | ||||
Component: | glusterd | Assignee: | Atin Mukherjee <amukherj> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Bala Konda Reddy M <bmekala> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 6 | CC: | bugs, pasik, rhs-bugs, sankarshan, storage-qa-internal, vbellur | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2019-04-29 03:28:32 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1692394 | ||||||
Attachments: |
|
Description
Darrell
2019-04-09 16:23:21 UTC
Two requests I have from you: 1. Could you pass back the output of 'gluster peer status' and 'gluster volume status' 2. Could you share the tar of /var/log/glusterfs/*.log ? Please note that we did fix a similar problem in glusterfs-6.0 with the following commit, but if you're still able to reproduce it we need to investigate. On a test setup, running a volume start and ps aux | grep glusterfsd only shows me the required brick processes though, but the details asked for might give us more insights. commit 36c75523c1f0545f32db4b807623a8f94df98ca7 Author: Mohit Agrawal <moagrawal> Date: Fri Mar 1 13:41:24 2019 +0530 glusterfsd: Multiple shd processes are spawned on brick_mux environment Problem: Multiple shd processes are spawned while starting volumes in the loop on brick_mux environment.glusterd spawn a process based on a pidfile and shd daemon is taking some time to update pid in pidfile due to that glusterd is not able to get shd pid Solution: Commit cd249f4cb783f8d79e79468c455732669e835a4f changed the code to update pidfile in parent for any gluster daemon after getting the status of forking child in parent.To resolve the same correct the condition update pidfile in parent only for glusterd and for rest of the daemon pidfile is updated in child > Change-Id: Ifd14797fa949562594a285ec82d58384ad717e81 > fixes: bz#1684404 > (Cherry pick from commit 66986594a9023c49e61b32769b7e6b260b600626) > (Reviewed on upstream link https://review.gluster.org/#/c/glusterfs/+/22290/) Change-Id: I9a68064d2da1acd0ec54b4071a9995ece0c3320c fixes: bz#1683880 Signed-off-by: Mohit Agrawal <moagrawal> While things were in the state I described above, peer status was normal, as it is now: [root@boneyard telsin]# gluster peer status Number of Peers: 2 Hostname: ossuary-san Uuid: 0ecbf953-681b-448f-9746-d1c1fe7a0978 State: Peer in Cluster (Connected) Other names: 10.50.3.12 Hostname: necropolis-san Uuid: 5d082bda-bb00-48d4-9f51-ea0995066c6f State: Peer in Cluster (Connected) Other names: 10.50.3.10 There's a 'gluster vol status gvOvirt' from the time there were multiple fsd processes running in the original ticket. At the moment, everything is normal, so I can't get you another while unusual things are happening. At the moment, it looks like: [root@boneyard telsin]# gluster vol status Status of volume: gv0 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick necropolis-san:/v0/bricks/gv0 49154 0 Y 10425 Brick boneyard-san:/v0/bricks/gv0 49152 0 Y 8504 Brick ossuary-san:/v0/bricks/gv0 49152 0 Y 13563 Self-heal Daemon on localhost N/A N/A Y 22864 Self-heal Daemon on ossuary-san N/A N/A Y 5815 Self-heal Daemon on necropolis-san N/A N/A Y 13859 Task Status of Volume gv0 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: gvOvirt Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick boneyard-san:/v0/gbOvirt/b0 49153 0 Y 9108 Brick necropolis-san:/v0/gbOvirt/b0 49155 0 Y 10510 Brick ossuary-san:/v0/gbOvirt/b0 49153 0 Y 13577 Self-heal Daemon on localhost N/A N/A Y 22864 Self-heal Daemon on ossuary-san N/A N/A Y 5815 Self-heal Daemon on necropolis-san N/A N/A Y 13859 Task Status of Volume gvOvirt ------------------------------------------------------------------------------ There are no active volume tasks Also of note, it appears to have corrupted my Ovirt Hosted Engine VM. Full logs are attached, hope it helps! Sorry about some of the large files, for some reason this system wasn't rotating them properly until I did some cleanup. I can take this cluster to 6.1 as soon as it appears in testing, or leave it a bit longer and try restarting some volumes or rebooting to see if I can recreate if it would help? Logs were to big to attach, find them here: https://tower.ohgnetworks.com/index.php/s/UCj5amzjQdQsE5C From glusterfs/glusterd.log-20190407 I can see the following: [2019-04-02 22:03:45.520037] I [glusterd-utils.c:6301:glusterd_brick_start] 0-management: starting a fresh brick process for brick /v0/bricks/gv0 [2019-04-02 22:03:45.522039] I [rpc-clnt.c:1000:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2019-04-02 22:03:45.586328] C [MSGID: 106003] [glusterd-server-quorum.c:348:glusterd_do_volume_quorum_action] 0-management: Server quorum regained for volume gvOvirt. Starting local bricks. [2019-04-02 22:03:45.586480] I [glusterd-utils.c:6214:glusterd_brick_start] 0-management: discovered already-running brick /v0/gbOvirt/b0 [2019-04-02 22:03:45.586495] I [MSGID: 106142] [glusterd-pmap.c:290:pmap_registry_bind] 0-pmap: adding brick /v0/gbOvirt/b0 on port 49157 [2019-04-02 22:03:45.586519] I [rpc-clnt.c:1000:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2019-04-02 22:03:45.662116] E [MSGID: 101012] [common-utils.c:4075:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/gv0/boneyard-san-v0-bricks-gv0.pid [2019-04-02 22:03:45.662164] I [glusterd-utils.c:6301:glusterd_brick_start] 0-management: starting a fresh brick process for brick /v0/bricks/gv0 Which indicates that we attempted to start two processes for the same brick but this was with glusterfs-5.5 version which doesn't have the fix as mentioned in comment 2. Post this cluster has been upgraded to 6.0, I don't see such event. So this is already fixed and I am closing the bug. |