1480423 – Gluster Bricks are not coming up after pod restart when bmux is ON

Bug 1480423 - Gluster Bricks are not coming up after pod restart when bmux is ON

Summary: Gluster Bricks are not coming up after pod restart when bmux is ON

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Mohit Agrawal
QA Contact:	Bala Konda Reddy M
Docs Contact:
URL:
Whiteboard:
Depends On:	1258561 1480459
Blocks:	1417151 1480332 1480516
TreeView+	depends on / blocked

Reported:	2017-08-11 04:57 UTC by Atin Mukherjee
Modified:	2017-09-21 05:04 UTC (History)
CC List:	16 users (show)
Fixed In Version:	glusterfs-3.8.4-40
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1480332
Clones:	1480516 (view as bug list)
Environment:
Last Closed:	2017-09-21 05:04:21 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Comment 3 Atin Mukherjee 2017-08-11 05:14:22 UTC

RCA:

When the gluster pod in CNS environment is restarted, given all the gluster daemon pidfiles are stored in /var/lib/glusterd/ location all the pidfiles are persisted. Now on restart of the pod when glusterd tries to start the bricks it first fetches out the pid from the respective brick pidfiles and try to match up the entry in /proc/<pid>/cmdline and in case it finds out an entry it assumes that brick is running.

Now in this particular case there were two bricks which glusterd had to start on this pod. While GlusterD sent the trigger to start the first brick and given on a glusterd restart scenario bricks are started asynchronously, GlusterD went ahead and tried to start the second brick. Now for every daemon, as part of the daemonization there are two processes spawned up. 1st is the parent and 2nd is the child and when the child is forked out, 1st process goes off and in this transition there are two pid entries maintained in /proc/<pid>/cmdline.

Consider for first brick the parent pid was 101 and the child pid was 102. Also in this particular case the pidfile of 2nd brick had an entry of 101 prior to restarting the pod. Now when glusterd tries to restart the 2nd brick it finds 101 to be running (although its the child pid) and assumes that 2nd brick is running and never triggers a start or an attach to this brick.

In container environment the pid numbers are in 3 digits and the probability of this clash is much higher than CRS deployments where gluster storage nodes are outside of containers.

One of the easy way to mitigate this problem in CNS is to have the pidfiles cleaned up during start up. However in CRS although the probability of this clash is *very* less, there is no guarantee that we can never hit it.

The impact of this issue is that the brick will not come up, however there wouldn't be any impact to I/O given the other replica bricks would be running. Doing a volume start force would attach this brick and brick status will be online.

We have a patch https://review.gluster.org/#/c/13580/ which changes the pidfile location from /var/lib/glusterd to /var/run/gluster so that on node reboot all the pidfiles are cleaned up.

upstream patch : https://review.gluster.org/#/c/13580/

Comment 4 Humble Chirammal 2017-08-11 05:47:24 UTC

Thanks Atin, I would like to propose this as a blocker for CNS 3.6 release.

Comment 6 Atin Mukherjee 2017-08-11 11:45:32 UTC

upstream 3.12 patch : https://review.gluster.org/#/c/18023/
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/115068

Comment 9 Bala Konda Reddy M 2017-08-23 11:13:28 UTC

BUILD : 3.8.4-41

Tests performed

1. On all volume types, enable volume options related to respective component( Snapshot, Quota, NFS, Bitrot, Rebalance, NFS-Ganesha and Samba, Geo-Rep)
2. Check all the daemons(snapd, tierd, quotad,nfsd, bitd) are online after enabling the respective volume option.
3. Tested the node reboot scenarios which have been covered in regression.
4. Tested the glusterd restart scenarios.
5. Tested the volume restart scenarios.

Result : 

After the node reboot, glusterd restart and volume restart the respective daemons and bricks are online and running as expected and no stale pids are seen and no crashes seen

Hence marking it as verified

Comment 11 errata-xmlrpc 2017-09-21 05:04:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.