Description of problem: ====================== Description of problem: ======================= I have a 3 node brickmux enabled cluster 3 volumes exist as below 12x(6+2) ecvol named cvlt-ecv 2 1x3 afr vols, namely testvol and logvol IOs are being done on cvlt-ecv volume(just DDs and appends) Two of the nodes have been upgraded over past few days. As part of upgrading the last node of a 3 node cluster to 6.0.5(including kernel), I did a node reboot. Post that the bricks were not coming up due to some bad entries in fstab and on resolving them I also noticed that the cluster went to rejected state. When check the cksums of the cvlt-ecv volume, I noticed a difference in the cksum value b/w n3(node being upgraded) when compared to n1 and n2 Hence to fix that we deleted all the cvlt-ecv directory under /var/lib/glusterd so that glusterd will heal them. Did a restart of glusterd and the peer rejected issue was fixed. However, we noticed that the shd was not showing online for the 2 afr volumes. Tried to do restart of glusterd( including deleting glusterfsd,shd,fs procs) But the shd is not coming up for the 2 afr volumes based on the logs we noticed that the /var/run/gluster/testvol and logvol have stale pid entries still existing and hence blocking the shd start on these volumes I went ahead and deleted the old stale pid files and shd came up on all the volumes. While I thought it was a one off thing, However I now see the same behavior in another node too, which is quite concerning, as we see below problems 1) manual index heal command is timing out 2) heal deamon is not running on the other volumes(as stale pidfile exists in /var/run/gluster) Version-Release number of selected component (if applicable): =================== 6.0.5 How reproducible: ============ consistent on my cluster Steps to Reproduce: -================ explained in description and also more details on cluster available at https://docs.google.com/spreadsheets/d/1_jmnDAcs1TqXbWjw-r4iCYo4zGKheSAzP1lfMSxVS6w/edit#gid=0