RCA: When the gluster pod in CNS environment is restarted, given all the gluster daemon pidfiles are stored in /var/lib/glusterd/ location all the pidfiles are persisted. Now on restart of the pod when glusterd tries to start the bricks it first fetches out the pid from the respective brick pidfiles and try to match up the entry in /proc/<pid>/cmdline and in case it finds out an entry it assumes that brick is running. Now in this particular case there were two bricks which glusterd had to start on this pod. While GlusterD sent the trigger to start the first brick and given on a glusterd restart scenario bricks are started asynchronously, GlusterD went ahead and tried to start the second brick. Now for every daemon, as part of the daemonization there are two processes spawned up. 1st is the parent and 2nd is the child and when the child is forked out, 1st process goes off and in this transition there are two pid entries maintained in /proc/<pid>/cmdline. Consider for first brick the parent pid was 101 and the child pid was 102. Also in this particular case the pidfile of 2nd brick had an entry of 101 prior to restarting the pod. Now when glusterd tries to restart the 2nd brick it finds 101 to be running (although its the child pid) and assumes that 2nd brick is running and never triggers a start or an attach to this brick. In container environment the pid numbers are in 3 digits and the probability of this clash is much higher than CRS deployments where gluster storage nodes are outside of containers. One of the easy way to mitigate this problem in CNS is to have the pidfiles cleaned up during start up. However in CRS although the probability of this clash is *very* less, there is no guarantee that we can never hit it. The impact of this issue is that the brick will not come up, however there wouldn't be any impact to I/O given the other replica bricks would be running. Doing a volume start force would attach this brick and brick status will be online. We have a patch https://review.gluster.org/#/c/13580/ which changes the pidfile location from /var/lib/glusterd to /var/run/gluster so that on node reboot all the pidfiles are cleaned up. upstream patch : https://review.gluster.org/#/c/13580/
Thanks Atin, I would like to propose this as a blocker for CNS 3.6 release.
upstream 3.12 patch : https://review.gluster.org/#/c/18023/ downstream patch : https://code.engineering.redhat.com/gerrit/#/c/115068
BUILD : 3.8.4-41 Tests performed 1. On all volume types, enable volume options related to respective component( Snapshot, Quota, NFS, Bitrot, Rebalance, NFS-Ganesha and Samba, Geo-Rep) 2. Check all the daemons(snapd, tierd, quotad,nfsd, bitd) are online after enabling the respective volume option. 3. Tested the node reboot scenarios which have been covered in regression. 4. Tested the glusterd restart scenarios. 5. Tested the volume restart scenarios. Result : After the node reboot, glusterd restart and volume restart the respective daemons and bricks are online and running as expected and no stale pids are seen and no crashes seen Hence marking it as verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774