GlusterFS lost track of the parent directory paths for about 7,800 +files over a three day period where it was unable to track the process of replicating them to one member of the cluster. The data was all present on two other nodes, so once we rediscovered the paths to those files, GlusterFS was able to restore proper operations. We suspect there was (and very well still is) networking problems with one or more nodes, due to some frame errors on the nics. And we have seen evidence of links being dropped and restored at the physical layer. The self-heal process was only able to list GFIDs, and did not know the file paths. So we spent about 24 hours scouring the volume based on the dates of the GFID files to find their respective paths. Once we were able to "stat" the file via a gluster mount point, all of the files were healed. Notes taken by Ravi regarding this pbench volume heal issue: ---------------------- August 4th/5th, 2016 Written by Ravi: List of T-files that need data and metadata heal on pbench-replicate-5: <gfid:3a1d56d0-9945-450f-858c-84314dff9f6a> <gfid:cc6c1b79-92f9-4728-a7b6-d62c5de50084> <gfid:cc2dd7f2-532b-4d15-9dbd-0561cffd50bf> <gfid:973cb698-d3d0-45a3-874b-94134c4afd5a> <gfid:89ae4bf6-941e-430e-b786-964815e1d70f> Subvolumes of pbench-replicate-5 are: pbench-client-15 pbench-client-16 pbench-client-17 Heal is not happening because pbench-client-16 doesn't contain the entry itself. I have run the find command on the following nodes in a screen session to help identify the file path: 3a1d56d0-9945-450f-858c-84314dff9f6a ==> gprfs002 cc6c1b79-92f9-4728-a7b6-d62c5de50084 ==> gprfs012 cc2dd7f2-532b-4d15-9dbd-0561cffd50bf ==> gprfs001 973cb698-d3d0-45a3-874b-94134c4afd5a ==> gprfs009 89ae4bf6-941e-430e-b786-964815e1d70f ==> gprfs011 Once I have the path, we should be able to do a temporary mount and stat the files so that the entry gets created. Then the data/metadata heal must be able to complete. Out of the 5 entries, 3 of them happen to be symbolic links. I'll update once I get the file paths for those entries. Notes: The cached subvol for the 5 T files are pbench-replicate-4 and pbench-replicate-11. The result of ls -l on the bricks of those cached subvols: 1) ls -l .glusterfs/3a/1d/3a1d56d0-9945-450f-858c-84314dff9f6a ==>symlink file lrwxrwxrwx 2 17932 17932 78 Aug 1 16:13 .glusterfs/3a/1d/3a1d56d0-9945-450f-858c-84314dff9f6a -> /pbench/archive/fs-version-001/overcloud-controller-0/20160801-1341_cbt.tar.xz 2) ls -l .glusterfs/cc/6c/cc6c1b79-92f9-4728-a7b6-d62c5de50084 ==>symlink file lrwxrwxrwx 2 17932 17932 97 Aug 2 03:23 .glusterfs/cc/6c/cc6c1b79-92f9-4728-a7b6-d62c5de50084 -> /pbench/archive/fs-version-001/dhcp31-124/fio_sdb-sdc-1-job-iodepth-32_2016-08-01_18:59:09.tar.xz 3) ls -l .glusterfs/cc/2d/cc2dd7f2-532b-4d15-9dbd-0561cffd50bf ==> Regular file. 4) ls -l .glusterfs/97/3c/973cb698-d3d0-45a3-874b-94134c4afd5a ==> Regular file. 5) ls -l .glusterfs/89/ae/89ae4bf6-941e-430e-b786-964815e1d70f ==> s==>symlink file lrwxrwxrwx 2 17932 17932 78 Aug 2 23:42 .glusterfs/89/ae/89ae4bf6-941e-430e-b786-964815e1d70f -> /pbench/archive/fs-version-001/overcloud-controller-0/20160802-2300_cbt.tar.xz From Peter: To speed things up, we leveraged the knowledge that the top level directories would have timestamps in the same time range as the GFID files, so instead of searching the entire volume, in each of the five cases, we just looked for the most recent top level directories and then performed the find from there.
We were running RHGS 3.1.2 with a set of patches from Vijay, and then upgraded to 3.1.3 during the self-heal process we applied.
hi Peter, Based on the info so far it seems to have happened because of a feature called optimistic changelog for directory operations where the entries are marked bad after a failure happens, i.e. there is no pre-operation marking that is done. So if the failure happens in such a way that before the marking is done we lose network connectivity to both the bricks then we lose track of which directory needs healing. We would need sosreports of the machines and sample gfids which went into this state to confirm the theory. If we confirm the theory we will give a volume set option to turn optimistic change log off. There is a way to mount the filesystem to turn this off as well, which we can use until this patch is merged. Pranith
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html