Description of problem: Glusterd fails to restart after replacing one cluster member and a volume has snapshots. Volume is 3 replica, striped and distributed and runs on a 6 node cluster. Take the new host, stop and start gluster. Gluster will be unable to start because it expects to be able to mount a LVM snapshot that doesn't exist locally. Version-Release number of selected component (if applicable): 3.7.8 How reproducible: 1:1 Steps to Reproduce: 1. gluster volume replace-brick $vol $failed_peer $new_peer:$new_brick commit force 2. stop the gluster daemons on a host 3. Actual results: Gluster startup fails: The message "I [MSGID: 106498] [glusterd-handler.c:3640:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0" repeated 4 times between [2016-03-02 21:54:25.326406] and [2016-03-02 21:54:25.369229] [2016-03-02 21:54:28.226630] I [MSGID: 106544] [glusterd.c:159:glusterd_uuid_init] 0-management: retrieved UUID: 7a79537b-4389-4e04-93f9-275fc438268b [2016-03-02 21:54:28.227741] E [MSGID: 106187] [glusterd-store.c:3310:glusterd_resolve_snap_bricks] 0-management: resolve brick failed in restore [2016-03-02 21:54:28.227770] E [MSGID: 106186] [glusterd-store.c:4297:glusterd_resolve_all_bricks] 0-management: resolving the snap bricks failed for snap: apcfs-default_GMT-2016.02.26-15.42.14 [2016-03-02 21:54:28.227853] E [MSGID: 101019] [xlator.c:433:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again [2016-03-02 21:54:28.227877] E [graph.c:322:glusterfs_graph_init] 0-management: initializing translator failed [2016-03-02 21:54:28.227895] E [graph.c:661:glusterfs_graph_activate] 0-graph: init failed [2016-03-02 21:54:28.233362] W [glusterfsd.c:1236:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xcd) [0x7f056f2da1fd] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x126) [0x7f056f2da0d6] -->/usr/sbin/glusterd(cleanup_and_exit+0x69) [0x7f056f2d9709] ) 0-: received signum (0), shutting down Expected results: Glusterd should start. Additional info:
(In reply to Ben Werthmann from comment #0) > Description of problem: > > Glusterd fails to restart after replacing one cluster member and a volume > has snapshots. Volume is 3 replica, striped and distributed and runs on a 6 > node cluster. When you say replacing cluster member do you mean a brick or a peer? Was the replace-brick command successful? Could you attach the complete glusterd log file along with cmd_history.log? > > Take the new host, stop and start gluster. Gluster will be unable to start > because it expects to be able to mount a LVM snapshot that doesn't exist > locally. > > Version-Release number of selected component (if applicable): > 3.7.8 > > How reproducible: > 1:1 > > Steps to Reproduce: > 1. gluster volume replace-brick $vol $failed_peer $new_peer:$new_brick > commit force > 2. stop the gluster daemons on a host > 3. > > Actual results: > Gluster startup fails: > The message "I [MSGID: 106498] > [glusterd-handler.c:3640:glusterd_friend_add_from_peerinfo] 0-management: > connect returned 0" repeated 4 times between [2016-03-02 21:54:25.326406] > and [2016-03-02 21:54:25.369229] > [2016-03-02 21:54:28.226630] I [MSGID: 106544] > [glusterd.c:159:glusterd_uuid_init] 0-management: retrieved UUID: > 7a79537b-4389-4e04-93f9-275fc438268b > [2016-03-02 21:54:28.227741] E [MSGID: 106187] > [glusterd-store.c:3310:glusterd_resolve_snap_bricks] 0-management: resolve > brick failed in restore > [2016-03-02 21:54:28.227770] E [MSGID: 106186] > [glusterd-store.c:4297:glusterd_resolve_all_bricks] 0-management: resolving > the snap bricks failed for snap: apcfs-default_GMT-2016.02.26-15.42.14 > [2016-03-02 21:54:28.227853] E [MSGID: 101019] [xlator.c:433:xlator_init] > 0-management: Initialization of volume 'management' failed, review your > volfile again > [2016-03-02 21:54:28.227877] E [graph.c:322:glusterfs_graph_init] > 0-management: initializing translator failed > [2016-03-02 21:54:28.227895] E [graph.c:661:glusterfs_graph_activate] > 0-graph: init failed > [2016-03-02 21:54:28.233362] W [glusterfsd.c:1236:cleanup_and_exit] > (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xcd) [0x7f056f2da1fd] > -->/usr/sbin/glusterd(glusterfs_process_volfp+0x126) [0x7f056f2da0d6] > -->/usr/sbin/glusterd(cleanup_and_exit+0x69) [0x7f056f2d9709] ) 0-: received > signum (0), shutting down > > Expected results: > Glusterd should start. > > Additional info:
In this case, both. We're testing recovery from a complete server failure where the storage (brick) and compute (peer) have failed. We first 'gluster peer probe $new_peer_ip'. Later we remove the dead peer via 'gluster peer detach $failed_peer force', then 'gluster volume replace-brick $vol $failed_peer $new_peer_ip:$new_brick commit force'. The 'gluster volume replace-brick' operation exited with a non-zero exit status. I'll build a test environment to gather the complete glusterd log file along with cmd_history.log.
Correction: The 'gluster volume replace-brick' operation exited with a _zero_ exit status.
Created attachment 1142223 [details] incumbent node All control actions are performed from this node.
Created attachment 1142224 [details] new node logs Logs from the replacement node.
Another Correction: We remove the dead peer via 'gluster peer detach $failed_peer force', AFTER 'gluster volume replace-brick $vol $failed_peer $new_peer_ip:$new_brick commit force'. Found this thread: https://www.gluster.org/pipermail/gluster-users/2015-June/022264.html Current IPs are 172.27.20.205, 172.27.18.105, and 172.27.21.136. 172.27.20.50 was the IP of the node which was replaced. ubuntu@bwerthmann-743d2d98:~$ sudo grep -iR 172 /var/lib/glusterd/* | grep -v 20\.205 | grep -v 18\.105 | grep -v 21\.136 /var/lib/glusterd/snaps/s1_GMT-2016.03.30-23.30.58/7c2ef5a4775648a1adf20bc1f7ae764b/info:brick-1=172.27.20.50:-run-gluster-snaps-7c2ef5a4775648a1adf20bc1f7ae764b-brick2-apcfs /var/lib/glusterd/snaps/s1_GMT-2016.03.30-23.30.58/7c2ef5a4775648a1adf20bc1f7ae764b/bricks/172.27.20.50:-run-gluster-snaps-7c2ef5a4775648a1adf20bc1f7ae764b-brick2-apcfs:hostname=172.27.20.50 /var/lib/glusterd/snaps/s1_GMT-2016.03.30-23.30.58/7c2ef5a4775648a1adf20bc1f7ae764b/trusted-7c2ef5a4775648a1adf20bc1f7ae764b.tcp-fuse.vol: option remote-host 172.27.20.50 /var/lib/glusterd/snaps/s1_GMT-2016.03.30-23.30.58/7c2ef5a4775648a1adf20bc1f7ae764b/7c2ef5a4775648a1adf20bc1f7ae764b.tcp-fuse.vol: option remote-host 172.27.20.50 /var/lib/glusterd/snaps/s0_GMT-2016.03.30-23.18.02/2996d3515da24ddd859f32c294587ca8/2996d3515da24ddd859f32c294587ca8.tcp-fuse.vol: option remote-host 172.27.20.50 /var/lib/glusterd/snaps/s0_GMT-2016.03.30-23.18.02/2996d3515da24ddd859f32c294587ca8/info:brick-1=172.27.20.50:-run-gluster-snaps-2996d3515da24ddd859f32c294587ca8-brick2-apcfs /var/lib/glusterd/snaps/s0_GMT-2016.03.30-23.18.02/2996d3515da24ddd859f32c294587ca8/bricks/172.27.20.50:-run-gluster-snaps-2996d3515da24ddd859f32c294587ca8-brick2-apcfs:hostname=172.27.20.50 /var/lib/glusterd/snaps/s0_GMT-2016.03.30-23.18.02/2996d3515da24ddd859f32c294587ca8/trusted-2996d3515da24ddd859f32c294587ca8.tcp-fuse.vol: option remote-host 172.27.20.50 /var/lib/glusterd/snaps/s2_GMT-2016.03.30-23.31.03/86baab41a3174d6bba13325a85d9a46b/info:brick-1=172.27.20.50:-run-gluster-snaps-86baab41a3174d6bba13325a85d9a46b-brick2-apcfs /var/lib/glusterd/snaps/s2_GMT-2016.03.30-23.31.03/86baab41a3174d6bba13325a85d9a46b/bricks/172.27.20.50:-run-gluster-snaps-86baab41a3174d6bba13325a85d9a46b-brick2-apcfs:hostname=172.27.20.50 /var/lib/glusterd/snaps/s2_GMT-2016.03.30-23.31.03/86baab41a3174d6bba13325a85d9a46b/86baab41a3174d6bba13325a85d9a46b.tcp-fuse.vol: option remote-host 172.27.20.50 /var/lib/glusterd/snaps/s2_GMT-2016.03.30-23.31.03/86baab41a3174d6bba13325a85d9a46b/trusted-86baab41a3174d6bba13325a85d9a46b.tcp-fuse.vol: option remote-host 172.27.20.50 ubuntu@bwerthmann-743d2d98:~$ find /var/lib/glusterd/ -name *\.20\.50* /var/lib/glusterd/snaps/s1_GMT-2016.03.30-23.30.58/7c2ef5a4775648a1adf20bc1f7ae764b/7c2ef5a4775648a1adf20bc1f7ae764b.172.27.20.50.run-gluster-snaps-7c2ef5a4775648a1adf20bc1f7ae764b-brick2-apcfs.vol /var/lib/glusterd/snaps/s1_GMT-2016.03.30-23.30.58/7c2ef5a4775648a1adf20bc1f7ae764b/bricks/172.27.20.50:-run-gluster-snaps-7c2ef5a4775648a1adf20bc1f7ae764b-brick2-apcfs /var/lib/glusterd/snaps/s0_GMT-2016.03.30-23.18.02/2996d3515da24ddd859f32c294587ca8/2996d3515da24ddd859f32c294587ca8.172.27.20.50.run-gluster-snaps-2996d3515da24ddd859f32c294587ca8-brick2-apcfs.vol /var/lib/glusterd/snaps/s0_GMT-2016.03.30-23.18.02/2996d3515da24ddd859f32c294587ca8/bricks/172.27.20.50:-run-gluster-snaps-2996d3515da24ddd859f32c294587ca8-brick2-apcfs /var/lib/glusterd/snaps/s2_GMT-2016.03.30-23.31.03/86baab41a3174d6bba13325a85d9a46b/86baab41a3174d6bba13325a85d9a46b.172.27.20.50.run-gluster-snaps-86baab41a3174d6bba13325a85d9a46b-brick2-apcfs.vol /var/lib/glusterd/snaps/s2_GMT-2016.03.30-23.31.03/86baab41a3174d6bba13325a85d9a46b/bricks/172.27.20.50:-run-gluster-snaps-86baab41a3174d6bba13325a85d9a46b-brick2-apcfs ubuntu@bwerthmann-743d2d98:~$ sudo find /var/lib/glusterd/ -type f -exec sed -i "s/172.27.20.50/172.27.20.205/g" {} \; ubuntu@bwerthmann-743d2d98:~$ sudo grep -iR 172 /var/lib/glusterd/* | grep -v 20\.205 | grep -v 18\.105 | grep -v 21\.136 ubuntu@bwerthmann-743d2d98:~$ ubuntu@bwerthmann-743d2d98:~$ sudo rm -rf /var/lib/glusterd/snaps/s* Now glusterd starts ubuntu@bwerthmann-743d2d98:~$ sudo service glusterfs-server start glusterfs-server start/running, process 6637 ubuntu@bwerthmann-743d2d98:~$ sudo service glusterfs-server status glusterfs-server start/running, process 6637
So this means that an IP change had happened? If so GlusterD is not capable of handling this scenario until and unless some manual intervention is done (the workaround you had executed). If you'd have used hostname while probing the nodes, this situation could have been avoided.
Yes, the replacement peer (1 of 3) and new brick have a new IP. In our case, DNS is not available. When the failed brick is removed via 'gluster volume replace-brick $vol $failed_peer $new_peer_ip:$new_brick commit force' why are there lingering references to the old peer/brick? Is there a reason that 'replace-brick' does not fix all of references to the old peer/brick? If 'replace-brick' has been issued, is it safe to drop the snapshot references of the old peer/brick? I suspect the suggested fix of using the hostname will fail if any snapshots exist because in the case of new peer/new brick case, the new peer will not have the LVM snapshots needed to resolve the snapshot references. Put another way: why is the following set of operations not valid? - deploy gluster with three servers, one brick each, one volume replicated across all 3 - create a snapshot - lose one server - add a replacement peer and new brick with a new IP address - replace-brick the missing brick onto the new server (wait for replication to finish) - force remove the old server - verify everything is working as expected - restart _any_ server in the cluster, without failure
Is there an update on this issue?
(In reply to Ben Werthmann from comment #9) > Is there an update on this issue? We will test the steps and get back.
OK, so this is happening as per the current functionality. When a replace brick is issued, the op is restricted to that same volume. No other references get changed here. As snapshot works just like a volume, if the snapshot is referring to a failed peer which has been already been replaced, glusterd will fail to restore the snap here. I don't think we have any other option but to change the IP in the volfile (rename) of the snap and make that work. Rajesh, What's your thought here?
Hello, is there an update on this issue?
Pinging Rajesh
Or, maybe Avra wants to have a look at this?
This bug is getting closed because GlusteFS-3.7 has reached its end-of-life. Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS. If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.
I agree with Comment #11 from Atin. Snapshot volume is exactly like any other gluster volume, when it comes to peer handshake. The fact that it is not finding the IP in the cluster, which is in it's volfile will make it do a peer reject. As Atin suggested, the only option we have here is to change the IP in the volfile with the snap. But even with that option, as gluster snapshots are COW LVM based snapshots you end up losing your snapshot if it had any bricks in the cluster member which was forcibly removed.
Avra - I was wondering if we have snapshot bricks hosted by peer N should we allow detaching the same peer? Isn't this a bug at first place?
Well, that is the most likely scenario to happen. If we disallow it, we force them to stay put with that peer for good, or delete all those snapshots(which would be unusable anyway if we did do peer detach). As you suggested, we can tell the user that the peer is hosting snapshot bricks and therefore cannot be detached. It is similar to how we do not allow deletion of a volume, if it still has snapshots in it. This solution is still not a holistic one, as it forces the user to delete all his snapshots. But it is still the best one we got yet.
REVIEW: https://review.gluster.org/16907 (glusterd : Glusterd fails to restart after replace brick is done for snap volume) posted (#1) for review on master by Gaurav Yadav (gyadav)
REVIEW: https://review.gluster.org/16907 (glusterd : Glusterd fails to restart after replace brick is done for snap volume) posted (#2) for review on master by Gaurav Yadav (gyadav)
The reported case differs slightly from problem statement in 16907. Problem: - Deploy gluster on 3 nodes, one brick each, one volume replicated, add peers by IP address - Create a snapshot - Lose one server - Add a replacement peer and new brick with a new IP address - replace-brick the missing brick onto the new server (wait for replication to finish) - peer detach the old server - after doing above steps, glusterd fails to restart. Our expectations: - Glusterd starts and remain started when in the above state, even if there are issues with a volume and/or related snapshots. [1] - The non-snapshot volume would start and a heal would kick off (to catch up from where 'replace-brick <volumename> <oldbrick> <newbrick> commit force' left off) - A procedure exists which allows for replacing a failed server / brick and allows us to maintain snapshots. [2] - Limitations of snapshots are documented, such as having to delete all of your snapshots to replace a failed node. Is this true in all cases or it is something specifc that we are doing? While recovering snapshots should be possible, our priority is [1]. Having gluster enter a state where *none* of the glusterd processes can start is a significant risk. Functional glusterd should be able to service/start the primary volume, even if the snapshot volumes enter an unusable state. [1] Should issues with one volume prevent other volumes from starting due to glusterd crashing? Is this by design? Please elaborate on this behavior. If a well meaning individual restarts the gluster services or reboots gluster servers for troubleshooting, there would be a cluster wide-outage of glusterd which implies no new client connections. [2] In theory, it should be possible to recover Gluster snapshots based on lvm-thin. I think we'd just need to "replay the snapshot history" on a new thin-lv. The process could be something like: 1. Create a new thin-LV 2. replace-brick the oldest snapshot, create a LVM snapshot, update gluster references for the snapshot volume to the new snapshot 4. goto next snapshot unless head This is probably A gross oversimplification of the problem, but it seems that recovering snapshots should be possible. Aside comment on 16907: - What's the recovery path when a peer has failed and something is preventing the removal of snapshots[3] before replacing the brick? Generally when we need to perform the "recover a failed gluster server" task something else has gone wrong. [3] different reasons where snapshot remove failed: - Bugs with with dm-thin/lvmtools - Gluster is not responding to "tpool_metadata is at low water mark" events, leading to a thinpool wedged in a read-only state. - poor interaction with netfilter's conntrack
Additionally, I acknowledge that patch 16907 prevents this specific bug as reported. The outstanding concern with patch 16907 is the recovery path if snapshots cannot be removed due to other failures.
Hi Guarav, care to reply to comment #12 and comment #22?
As Avra already mentioned in comment #16 and comment #18 that the solution is not the holistic one and it force the user to delete all his snapshots. But it is still the best one we got yet. Could you please mention the other failure where you have concerns
REVIEW: https://review.gluster.org/16907 (glusterd : Glusterd fails to restart after replace brick is done for snap volume) posted (#3) for review on master by Gaurav Yadav (gyadav)
REVIEW: https://review.gluster.org/16907 (glusterd : Peer is hosting snapshot bricks therefore can't be detatched) posted (#4) for review on master by Gaurav Yadav (gyadav)
REVIEW: https://review.gluster.org/16907 (glusterd : Disallow peer detach if snapshot bricks exist on it) posted (#5) for review on master by Gaurav Yadav (gyadav)
(In reply to Gaurav Yadav from comment #24) > > Could you please mention the other failure where you have concerns Performing replace-brick operations resolves entering states: 1. Total system failure - server/instance is "terminated" 2. A server running Gluster enters an unrecoverable error state and must be replaced to recover the cluster from a degraded state (case: replica 3 volumes). In the case of 2, generally, a LVM thin-pool (thin data lv, and snapshot lvs) enters a read-only state because thinpool's metadata LV has exhausted and fails to extend [1]. Gluster ignores "tpool_metadata is at low water mark" events, and continues to create snapshots. [1] I discussed this issue with Zdenek Kabelac. The issue is due to older kernel dm-thin support and/or older versions of the userspace lvmtools.
Fix is made as per #comment 8 > > deploy gluster with three servers, one brick each, one volume replicated across all 3 - create a snapshot - lose one server - add a replacement peer and new brick with a new IP address - replace-brick the missing brick onto the new server (wait for replication to finish) - force remove the old server - verify everything is working as expected - restart _any_ server in the cluster, without failure Explanation for case mentioned in #comment 28, glusterd is not getting started after executing above test case because : when replace brick command is executed, glusterd updates path of bricks in vol files however in snap files glusterd dont change that, the reason being:- snapshot was created at "point in time". Now while restarting the service, glusterd see snap vol, but while restoring it tries to get the info of brick from the node which is already detached from the server and info is not present hence glusterd fails to load. Now in the fix glusterd iterates through all snapvolume's bricks and if it finds any brick, it disallow peer to detach which is best possible solution.
Thanks everyone!
REVIEW: https://review.gluster.org/16907 (glusterd : Disallow peer detach if snapshot bricks exist on it) posted (#6) for review on master by Gaurav Yadav (gyadav)
REVIEW: https://review.gluster.org/16907 (glusterd : Disallow peer detach if snapshot bricks exist on it) posted (#7) for review on master by Gaurav Yadav (gyadav)
COMMIT: https://review.gluster.org/16907 committed in master by Atin Mukherjee (amukherj) ------ commit 1c92f83ec041176ad7c42ef83525cda7d3eda3c5 Author: Gaurav Yadav <gyadav> Date: Thu Mar 16 14:56:39 2017 +0530 glusterd : Disallow peer detach if snapshot bricks exist on it Problem : - Deploy gluster on 2 nodes, one brick each, one volume replicated - Create a snapshot - Lose one server - Add a replacement peer and new brick with a new IP address - replace-brick the missing brick onto the new server (wait for replication to finish) - peer detach the old server - after doing above steps, glusterd fails to restart. Solution: With the fix detach peer will populate an error : "N2 is part of existing snapshots. Remove those snapshots before proceeding". While doing so we force user to stay with that peer or to delete all snapshots. Change-Id: I3699afb9b2a5f915768b77f885e783bd9b51818c BUG: 1322145 Signed-off-by: Gaurav Yadav <gyadav> Reviewed-on: https://review.gluster.org/16907 Smoke: Gluster Build System <jenkins.org> Reviewed-by: Atin Mukherjee <amukherj> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report. glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html [2] https://www.gluster.org/pipermail/gluster-users/