Red Hat Bugzilla – Bug 872601
split-brain caused by %preun% script if server rpm is upgraded during self-heal
Last modified: 2014-12-14 14:40:29 EST
Description of problem:
During an rpm upgrade of glusterfs-server, preun will run "/sbin/service glusterfsd condrestart". This, of course, succeeds in killing all glusterfsd instances as status reports that glusterfsd is running so stop kills them all and start does nothing since this isn't a legacy configuration. Normally this is a desired effect as it allows the brick instances to load the new version when glusterd is next restarted.
If, however, we just upgraded one server in a replica set and the self-heal hasn't completed, upgrading the next server will cause a split-brain as the stale files on the first server will be the only files available to afr and will be updated making the files on the second server to also be considered stale.
As long as there's disk activity on the same file during both brick restarts, and that file is large enough to not complete the self heal in time, always.
Perhaps a check should be done to ensure a clean self-heal state before doing the condrestart in preun or in the init/systemctl scripts.
need some action from replicate part too.
We need to implement a command which can tell if any of the files need self-heal or not.
REVIEW: http://review.gluster.org/6145 (cluster/afr: Provide setxattr interface for triggering heal) posted (#1) for review on master by Pranith Kumar Karampuri (email@example.com)
REVIEW: http://review.gluster.org/6145 (cluster/afr: Provide setxattr interface for triggering heal) posted (#2) for review on master by Pranith Kumar Karampuri (firstname.lastname@example.org)
REVIEW: http://review.gluster.org/6195 (extras/scripts: Script to self-heal in a synchronous way) posted (#1) for review on master by Pranith Kumar Karampuri (email@example.com)
The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug.
If there has been no update before 9 December 2014, this bug will get automatocally closed.