Bug 1480516

Summary:	Gluster Bricks are not coming up after pod restart when bmux is ON
Product:	[Community] GlusterFS	Reporter:	Mohit Agrawal <moagrawa>
Component:	glusterd	Assignee:	Mohit Agrawal <moagrawa>
Status:	CLOSED WORKSFORME	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	mainline	CC:	akhakhar, amukherj, annair, atumball, bmekala, bugs, hchiramm, jarrpa, madam, pprakash, rcyriac, rhs-bugs, rreddy, rtalur, storage-qa-internal, vbellur
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-6.x	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1480423	Environment:
Last Closed:	2019-05-11 11:41:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1258561, 1480423, 1480459
Bug Blocks:	1480332

Description Mohit Agrawal 2017-08-11 09:52:20 UTC

+++ This bug was initially created as a clone of Bug #1480423 +++

+++ This bug was initially created as a clone of Bug #1480332 +++

Description of problem:


One of the volume brick is not up after restarting the pod.

sh-4.2# rpm -qa |grep glusterfs
glusterfs-fuse-3.8.4-39.el7rhgs.x86_64
glusterfs-server-3.8.4-39.el7rhgs.x86_64
glusterfs-libs-3.8.4-39.el7rhgs.x86_64
glusterfs-3.8.4-39.el7rhgs.x86_64
glusterfs-api-3.8.4-39.el7rhgs.x86_64
glusterfs-cli-3.8.4-39.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-39.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-39.el7rhgs.x86_64
sh-4.2#


sh-4.2# gluster v status
Status of volume: heketidbstorage
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 192.168.35.5:/var/lib/heketi/mounts/v
g_e2c07dc7b49f9f42ca327a3073966364/brick_7a
c173229b1e69e8730be424956846e4/brick        49152     0          Y       524 
Brick 192.168.35.2:/var/lib/heketi/mounts/v
g_11f9761be704eb359ffd71c4d76afdf1/brick_04
0ff4a25411b19d2b24d0340bf3e9f8/brick        49152     0          Y       403 
Brick 192.168.35.6:/var/lib/heketi/mounts/v
g_1438320d4e7bde3c3d2f232671afd699/brick_1f
0d216ac0f28f1f5b0fbbc4adcdc69b/brick        49152     0          Y       410 
Self-heal Daemon on localhost               N/A       N/A        Y       399 
Self-heal Daemon on 192.168.35.5            N/A       N/A        Y       514 
Self-heal Daemon on 192.168.35.6            N/A       N/A        Y       435 
Self-heal Daemon on 192.168.35.4            N/A       N/A        Y       412 
Self-heal Daemon on 192.168.35.2            N/A       N/A        Y       412 
 
Task Status of Volume heketidbstorage
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: vol_96d8648ebd2ff7dcb87be3a2587c6246
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 192.168.35.6:/var/lib/heketi/mounts/v
g_9ed2ce0c55973e60d0ffc25d0028157f/brick_90
862eaeb677759f02e0809786e8ec70/brick        49153     0          Y       416 
Brick 192.168.35.3:/var/lib/heketi/mounts/v
g_803bb877441ad72c82b0ee6113a79511/brick_9a
7f3517851a71e13bc3ff1f3b96f4bc/brick        49152     0          Y       408 
Brick 192.168.35.4:/var/lib/heketi/mounts/v
g_799ba6da52ca29cf5404bd96b53fa957/brick_3f
ed005da1f6746d2d299b1e971fcc39/brick        49152     0          Y       403 
Brick 192.168.35.6:/var/lib/heketi/mounts/v
g_1438320d4e7bde3c3d2f232671afd699/brick_61
5a892b042f50617e4d228e0971c6ce/brick        49153     0          Y       416 
Brick 192.168.35.4:/var/lib/heketi/mounts/v
g_3ae4bb7cd25740f3cd84a15dbb985f48/brick_bb
9eb06eb20f6d570d4d2f6378b14742/brick        49152     0          Y       403 
Brick 192.168.35.5:/var/lib/heketi/mounts/v
g_96d6ff00fd2a38ead4c9ef9cd6db935f/brick_1a
6dcabc43a70402dc595ed89d76126f/brick        49153     0          Y       531 
Self-heal Daemon on localhost               N/A       N/A        Y       399 
Self-heal Daemon on 192.168.35.4            N/A       N/A        Y       412 
Self-heal Daemon on 192.168.35.2            N/A       N/A        Y       412 
Self-heal Daemon on 192.168.35.5            N/A       N/A        Y       514 
Self-heal Daemon on 192.168.35.6            N/A       N/A        Y       435 
 
Task Status of Volume vol_96d8648ebd2ff7dcb87be3a2587c6246
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: vol_bd43b5b5f048c469ee16f80376d73ec3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 192.168.35.3:/var/lib/heketi/mounts/v
g_803bb877441ad72c82b0ee6113a79511/brick_80
9331b5ef0e23c3327cf4483258dd25/brick        N/A       N/A        N       N/A 
Brick 192.168.35.4:/var/lib/heketi/mounts/v
g_799ba6da52ca29cf5404bd96b53fa957/brick_7d
5d8d5c3074f965d456c6d1fbfbc54d/brick        49153     0          Y       407 
Brick 192.168.35.2:/var/lib/heketi/mounts/v
g_11f9761be704eb359ffd71c4d76afdf1/brick_54
dd82e9703966757b5c8f9f332c998b/brick        49152     0          Y       403 
Self-heal Daemon on localhost               N/A       N/A        Y       399 
Self-heal Daemon on 192.168.35.5            N/A       N/A        Y       514 
Self-heal Daemon on 192.168.35.6            N/A       N/A        Y       435 
Self-heal Daemon on 192.168.35.2            N/A       N/A        Y       412 
Self-heal Daemon on 192.168.35.4            N/A       N/A        Y       412 
 
Task Status of Volume vol_bd43b5b5f048c469ee16f80376d73ec3
------------------------------------------------------------------------------
There are no active volume tasks
 
sh-4.2# 


sh-4.2# cat glusterd.log|grep -i disconnec
[2017-08-08 11:45:35.612285] I [MSGID: 106004] [glusterd-handler.c:5879:__glusterd_peer_rpc_notify] 0-management: Peer <192.168.35.6> (<e74482f2-b77f-4920-817e-5181cb748cce>), in state <Peer in Cluster>, has disconnected from glusterd.
[2017-08-08 11:45:45.700948] E [socket.c:2360:socket_connect_finish] 0-management: connection to 192.168.35.6:24007 failed (Connection refused); disconnecting socket
[2017-08-08 15:21:15.512361] I [socket.c:2465:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2017-08-08 15:21:15.513083] I [MSGID: 106005] [glusterd-handler.c:5687:__glusterd_brick_rpc_notify] 0-management: Brick 192.168.35.3:/var/lib/heketi/mounts/vg_803bb877441ad72c82b0ee6113a79511/brick_809331b5ef0e23c3327cf4483258dd25/brick has disconnected from glusterd.
[2017-08-09 17:43:29.010548] I [MSGID: 106004] [glusterd-handler.c:5879:__glusterd_peer_rpc_notify] 0-management: Peer <192.168.35.4> (<29382453-610e-40f6-b2b3-9ae4edcfde51>), in state <Peer in Cluster>, has disconnected from glusterd.
[2017-08-09 17:43:39.504984] E [socket.c:2360:socket_connect_finish] 0-management: connection to 192.168.35.4:24007 failed (Connection refused); disconnecting socket
[2017-08-09 17:45:59.307446] I [MSGID: 106004] [glusterd-handler.c:5879:__glusterd_peer_rpc_notify] 0-management: Peer <192.168.35.2> (<8dbffc32-38b5-460c-a857-cd2303a35c35>), in state <Peer in Cluster>, has disconnected from glusterd.
[2017-08-09 17:46:09.793636] E [socket.c:2360:socket_connect_finish] 0-management: connection to 192.168.35.2:24007 failed (Connection refused); disconnecting socket
[2017-08-09 17:48:29.725173] I [MSGID: 106004] [glusterd-handler.c:5879:__glusterd_peer_rpc_notify] 0-management: Peer <192.168.35.6> (<e74482f2-b77f-4920-817e-5181cb748cce>), in state <Peer in Cluster>, has disconnected from glusterd.
[2017-08-09 17:48:40.112820] E [socket.c:2360:socket_connect_finish] 0-management: connection to 192.168.35.6:24007 failed (Connection refused); disconnecting socket
[2017-08-09 17:53:41.528247] E [socket.c:2360:socket_connect_finish] 0-management: connection to 192.168.35.5:24007 failed (Connection refused); disconnecting socket
[2017-08-09 17:53:41.528302] I [MSGID: 106004] [glusterd-handler.c:5879:__glusterd_peer_rpc_notify] 0-management: Peer <192.168.35.5> (<849ac512-5d53-47b3-bbe2-7afff39627e7>), in state <Peer in Cluster>, has disconnected from glusterd.
[2017-08-09 17:53:43.192552] I [socket.c:2465:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2017-08-09 17:53:43.193180] I [MSGID: 106005] [glusterd-handler.c:5687:__glusterd_brick_rpc_notify] 0-management: Brick 192.168.35.3:/var/lib/heketi/mounts/vg_803bb877441ad72c82b0ee6113a79511/brick_9a7f3517851a71e13bc3ff1f3b96f4bc/brick has disconnected from glusterd.
[2017-08-09 17:53:43.193974] I [socket.c:2465:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2017-08-09 17:53:43.194563] I [MSGID: 106005] [glusterd-handler.c:5687:__glusterd_brick_rpc_notify] 0-management: Brick 192.168.35.3:/var/lib/heketi/mounts/vg_803bb877441ad72c82b0ee6113a79511/brick_809331b5ef0e23c3327cf4483258dd25/brick has disconnected from glusterd.
sh-4.2# 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-08-10 13:46:47 EDT ---

This bug is automatically being proposed for the current release of Container-Native Storage under active development, by setting the release flag 'cns‑3.6.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Humble Chirammal on 2017-08-10 13:47:59 EDT ---

Atin and Mohit has analysed this issue and have found the root cause that this is a PID clash. More updates will be followed soon.

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-08-11 00:57:30 EDT ---

This bug is automatically being proposed for the current release of Red Hat Gluster Storage 3 under active development, by setting the release flag 'rhgs‑3.3.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Atin Mukherjee on 2017-08-11 00:59:44 EDT ---

Rejy - Can you please raise a blocker flag on this bug? This issue is coming from CNS testing and we'd need to have a discussion on this bug in rhgs-3.3.0 blocker triage meeting.

--- Additional comment from Atin Mukherjee on 2017-08-11 01:14:22 EDT ---

RCA:

When the gluster pod in CNS environment is restarted, given all the gluster daemon pidfiles are stored in /var/lib/glusterd/ location all the pidfiles are persisted. Now on restart of the pod when glusterd tries to start the bricks it first fetches out the pid from the respective brick pidfiles and try to match up the entry in /proc/<pid>/cmdline and in case it finds out an entry it assumes that brick is running.

Now in this particular case there were two bricks which glusterd had to start on this pod. While GlusterD sent the trigger to start the first brick and given on a glusterd restart scenario bricks are started asynchronously, GlusterD went ahead and tried to start the second brick. Now for every daemon, as part of the daemonization there are two processes spawned up. 1st is the parent and 2nd is the child and when the child is forked out, 1st process goes off and in this transition there are two pid entries maintained in /proc/<pid>/cmdline. 

Consider for first brick the parent pid was 101 and the child pid was 102. Also in this particular case the pidfile of 2nd brick had an entry of 101 prior to restarting the pod. Now when glusterd tries to restart the 2nd brick it finds 101 to be running (although its the child pid) and assumes that 2nd brick is running and never triggers a start or an attach to this brick.

In container environment the pid numbers are in 3 digits and the probability of this clash is much higher than CRS deployments where gluster storage nodes are outside of containers.

One of the easy way to mitigate this problem in CNS is to have the pidfiles cleaned up during start up. However in CRS although the probability of this clash is *very* less, there is no guarantee that we can never hit it. 

The impact of this issue is that the brick will not come up, however there wouldn't be any impact to I/O given the other replica bricks would be running. Doing a volume start force would attach this brick and brick status will be online.

We have a patch https://review.gluster.org/#/c/13580/ which changes the pidfile location from /var/lib/glusterd to /var/run/gluster so that on node reboot all the pidfiles are cleaned up. 

upstream patch : https://review.gluster.org/#/c/13580/

--- Additional comment from Humble Chirammal on 2017-08-11 01:47:24 EDT ---

Thanks Atin, I would like to propose this as a blocker for CNS 3.6 release.

Comment 1 Worker Ant 2017-08-11 09:59:45 UTC

REVIEW: https://review.gluster.org/18025 (glusterd: Introduce run-directory option in glusterd.vol file) posted (#1) for review on master by MOHIT AGRAWAL (moagrawa)

Comment 2 Atin Mukherjee 2018-04-02 02:50:41 UTC

The bug itself is fixed. Patch 18025 can be tracked separately if required.

Comment 3 Amar Tumballi 2019-05-11 11:41:42 UTC

As per Comment#2