Bug 1322145 - Glusterd fails to restart after replacing a failed GlusterFS node and a volume has a snapshot
Summary: Glusterd fails to restart after replacing a failed GlusterFS node and a volum...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: snapshot
Version: mainline
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Gaurav Yadav
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-03-29 21:36 UTC by Ben Werthmann
Modified: 2017-05-30 18:34 UTC (History)
6 users (show)

Fixed In Version: glusterfs-3.11.0
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-05-30 18:34:38 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
incumbent node (228.71 KB, application/x-gzip)
2016-03-31 13:08 UTC, Ben Werthmann
no flags Details
new node logs (9.87 KB, application/x-gzip)
2016-03-31 13:10 UTC, Ben Werthmann
no flags Details

Description Ben Werthmann 2016-03-29 21:36:27 UTC
Description of problem:

Glusterd fails to restart after replacing one cluster member and a volume has snapshots. Volume is 3 replica, striped and distributed and runs on a 6 node cluster. 

Take the new host, stop and start gluster.  Gluster will be unable to start because it expects to be able to mount a LVM snapshot that doesn't exist locally.

Version-Release number of selected component (if applicable):
3.7.8

How reproducible:
1:1

Steps to Reproduce:
1. gluster volume replace-brick $vol $failed_peer $new_peer:$new_brick commit force
2. stop the gluster daemons on a host 
3.

Actual results:
Gluster startup fails:
The message "I [MSGID: 106498] [glusterd-handler.c:3640:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0" repeated 4 times between [2016-03-02 21:54:25.326406] and [2016-03-02 21:54:25.369229]
[2016-03-02 21:54:28.226630] I [MSGID: 106544] [glusterd.c:159:glusterd_uuid_init] 0-management: retrieved UUID: 7a79537b-4389-4e04-93f9-275fc438268b
[2016-03-02 21:54:28.227741] E [MSGID: 106187] [glusterd-store.c:3310:glusterd_resolve_snap_bricks] 0-management: resolve brick failed in restore
[2016-03-02 21:54:28.227770] E [MSGID: 106186] [glusterd-store.c:4297:glusterd_resolve_all_bricks] 0-management: resolving the snap bricks failed for snap: apcfs-default_GMT-2016.02.26-15.42.14
[2016-03-02 21:54:28.227853] E [MSGID: 101019] [xlator.c:433:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again
[2016-03-02 21:54:28.227877] E [graph.c:322:glusterfs_graph_init] 0-management: initializing translator failed
[2016-03-02 21:54:28.227895] E [graph.c:661:glusterfs_graph_activate] 0-graph: init failed
[2016-03-02 21:54:28.233362] W [glusterfsd.c:1236:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xcd) [0x7f056f2da1fd] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x126) [0x7f056f2da0d6] -->/usr/sbin/glusterd(cleanup_and_exit+0x69) [0x7f056f2d9709] ) 0-: received signum (0), shutting down

Expected results:
Glusterd should start. 

Additional info:

Comment 1 Atin Mukherjee 2016-03-30 04:04:40 UTC
(In reply to Ben Werthmann from comment #0)
> Description of problem:
> 
> Glusterd fails to restart after replacing one cluster member and a volume
> has snapshots. Volume is 3 replica, striped and distributed and runs on a 6
> node cluster. 
When you say replacing cluster member do you mean a brick or a peer? Was the replace-brick command successful? Could you attach the complete glusterd log file along with cmd_history.log?
> 
> Take the new host, stop and start gluster.  Gluster will be unable to start
> because it expects to be able to mount a LVM snapshot that doesn't exist
> locally.
> 
> Version-Release number of selected component (if applicable):
> 3.7.8
> 
> How reproducible:
> 1:1
> 
> Steps to Reproduce:
> 1. gluster volume replace-brick $vol $failed_peer $new_peer:$new_brick
> commit force
> 2. stop the gluster daemons on a host 
> 3.
> 
> Actual results:
> Gluster startup fails:
> The message "I [MSGID: 106498]
> [glusterd-handler.c:3640:glusterd_friend_add_from_peerinfo] 0-management:
> connect returned 0" repeated 4 times between [2016-03-02 21:54:25.326406]
> and [2016-03-02 21:54:25.369229]
> [2016-03-02 21:54:28.226630] I [MSGID: 106544]
> [glusterd.c:159:glusterd_uuid_init] 0-management: retrieved UUID:
> 7a79537b-4389-4e04-93f9-275fc438268b
> [2016-03-02 21:54:28.227741] E [MSGID: 106187]
> [glusterd-store.c:3310:glusterd_resolve_snap_bricks] 0-management: resolve
> brick failed in restore
> [2016-03-02 21:54:28.227770] E [MSGID: 106186]
> [glusterd-store.c:4297:glusterd_resolve_all_bricks] 0-management: resolving
> the snap bricks failed for snap: apcfs-default_GMT-2016.02.26-15.42.14
> [2016-03-02 21:54:28.227853] E [MSGID: 101019] [xlator.c:433:xlator_init]
> 0-management: Initialization of volume 'management' failed, review your
> volfile again
> [2016-03-02 21:54:28.227877] E [graph.c:322:glusterfs_graph_init]
> 0-management: initializing translator failed
> [2016-03-02 21:54:28.227895] E [graph.c:661:glusterfs_graph_activate]
> 0-graph: init failed
> [2016-03-02 21:54:28.233362] W [glusterfsd.c:1236:cleanup_and_exit]
> (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xcd) [0x7f056f2da1fd]
> -->/usr/sbin/glusterd(glusterfs_process_volfp+0x126) [0x7f056f2da0d6]
> -->/usr/sbin/glusterd(cleanup_and_exit+0x69) [0x7f056f2d9709] ) 0-: received
> signum (0), shutting down
> 
> Expected results:
> Glusterd should start. 
> 
> Additional info:

Comment 2 Ben Werthmann 2016-03-30 19:31:56 UTC
In this case, both. We're testing recovery from a complete server failure where the storage (brick) and compute (peer) have failed. We first 'gluster peer probe $new_peer_ip'. Later we remove the dead peer via 'gluster peer detach $failed_peer force', then 'gluster volume replace-brick $vol $failed_peer $new_peer_ip:$new_brick commit force'. The 'gluster volume replace-brick' operation exited with a non-zero exit status.

I'll build a test environment to gather the complete glusterd log file along with cmd_history.log.

Comment 3 Ben Werthmann 2016-03-31 12:29:45 UTC
Correction: The 'gluster volume replace-brick' operation exited with a _zero_ exit status.

Comment 4 Ben Werthmann 2016-03-31 13:08:56 UTC
Created attachment 1142223 [details]
incumbent node

All control actions are performed from this node.

Comment 5 Ben Werthmann 2016-03-31 13:10:00 UTC
Created attachment 1142224 [details]
new node logs

Logs from the replacement node.

Comment 6 Ben Werthmann 2016-03-31 15:50:06 UTC
Another Correction: We remove the dead peer via 'gluster peer detach $failed_peer force', AFTER 'gluster volume replace-brick $vol $failed_peer $new_peer_ip:$new_brick commit force'. 

Found this thread: https://www.gluster.org/pipermail/gluster-users/2015-June/022264.html

Current IPs are 172.27.20.205, 172.27.18.105, and 172.27.21.136. 172.27.20.50 was the IP of the node which was replaced.


ubuntu@bwerthmann-743d2d98:~$ sudo grep -iR 172 /var/lib/glusterd/* | grep -v 20\.205 | grep -v 18\.105 | grep -v 21\.136
/var/lib/glusterd/snaps/s1_GMT-2016.03.30-23.30.58/7c2ef5a4775648a1adf20bc1f7ae764b/info:brick-1=172.27.20.50:-run-gluster-snaps-7c2ef5a4775648a1adf20bc1f7ae764b-brick2-apcfs
/var/lib/glusterd/snaps/s1_GMT-2016.03.30-23.30.58/7c2ef5a4775648a1adf20bc1f7ae764b/bricks/172.27.20.50:-run-gluster-snaps-7c2ef5a4775648a1adf20bc1f7ae764b-brick2-apcfs:hostname=172.27.20.50
/var/lib/glusterd/snaps/s1_GMT-2016.03.30-23.30.58/7c2ef5a4775648a1adf20bc1f7ae764b/trusted-7c2ef5a4775648a1adf20bc1f7ae764b.tcp-fuse.vol:    option remote-host 172.27.20.50
/var/lib/glusterd/snaps/s1_GMT-2016.03.30-23.30.58/7c2ef5a4775648a1adf20bc1f7ae764b/7c2ef5a4775648a1adf20bc1f7ae764b.tcp-fuse.vol:    option remote-host 172.27.20.50
/var/lib/glusterd/snaps/s0_GMT-2016.03.30-23.18.02/2996d3515da24ddd859f32c294587ca8/2996d3515da24ddd859f32c294587ca8.tcp-fuse.vol:    option remote-host 172.27.20.50
/var/lib/glusterd/snaps/s0_GMT-2016.03.30-23.18.02/2996d3515da24ddd859f32c294587ca8/info:brick-1=172.27.20.50:-run-gluster-snaps-2996d3515da24ddd859f32c294587ca8-brick2-apcfs
/var/lib/glusterd/snaps/s0_GMT-2016.03.30-23.18.02/2996d3515da24ddd859f32c294587ca8/bricks/172.27.20.50:-run-gluster-snaps-2996d3515da24ddd859f32c294587ca8-brick2-apcfs:hostname=172.27.20.50
/var/lib/glusterd/snaps/s0_GMT-2016.03.30-23.18.02/2996d3515da24ddd859f32c294587ca8/trusted-2996d3515da24ddd859f32c294587ca8.tcp-fuse.vol:    option remote-host 172.27.20.50
/var/lib/glusterd/snaps/s2_GMT-2016.03.30-23.31.03/86baab41a3174d6bba13325a85d9a46b/info:brick-1=172.27.20.50:-run-gluster-snaps-86baab41a3174d6bba13325a85d9a46b-brick2-apcfs
/var/lib/glusterd/snaps/s2_GMT-2016.03.30-23.31.03/86baab41a3174d6bba13325a85d9a46b/bricks/172.27.20.50:-run-gluster-snaps-86baab41a3174d6bba13325a85d9a46b-brick2-apcfs:hostname=172.27.20.50
/var/lib/glusterd/snaps/s2_GMT-2016.03.30-23.31.03/86baab41a3174d6bba13325a85d9a46b/86baab41a3174d6bba13325a85d9a46b.tcp-fuse.vol:    option remote-host 172.27.20.50
/var/lib/glusterd/snaps/s2_GMT-2016.03.30-23.31.03/86baab41a3174d6bba13325a85d9a46b/trusted-86baab41a3174d6bba13325a85d9a46b.tcp-fuse.vol:    option remote-host 172.27.20.50

ubuntu@bwerthmann-743d2d98:~$ find /var/lib/glusterd/ -name *\.20\.50*
/var/lib/glusterd/snaps/s1_GMT-2016.03.30-23.30.58/7c2ef5a4775648a1adf20bc1f7ae764b/7c2ef5a4775648a1adf20bc1f7ae764b.172.27.20.50.run-gluster-snaps-7c2ef5a4775648a1adf20bc1f7ae764b-brick2-apcfs.vol
/var/lib/glusterd/snaps/s1_GMT-2016.03.30-23.30.58/7c2ef5a4775648a1adf20bc1f7ae764b/bricks/172.27.20.50:-run-gluster-snaps-7c2ef5a4775648a1adf20bc1f7ae764b-brick2-apcfs
/var/lib/glusterd/snaps/s0_GMT-2016.03.30-23.18.02/2996d3515da24ddd859f32c294587ca8/2996d3515da24ddd859f32c294587ca8.172.27.20.50.run-gluster-snaps-2996d3515da24ddd859f32c294587ca8-brick2-apcfs.vol
/var/lib/glusterd/snaps/s0_GMT-2016.03.30-23.18.02/2996d3515da24ddd859f32c294587ca8/bricks/172.27.20.50:-run-gluster-snaps-2996d3515da24ddd859f32c294587ca8-brick2-apcfs
/var/lib/glusterd/snaps/s2_GMT-2016.03.30-23.31.03/86baab41a3174d6bba13325a85d9a46b/86baab41a3174d6bba13325a85d9a46b.172.27.20.50.run-gluster-snaps-86baab41a3174d6bba13325a85d9a46b-brick2-apcfs.vol
/var/lib/glusterd/snaps/s2_GMT-2016.03.30-23.31.03/86baab41a3174d6bba13325a85d9a46b/bricks/172.27.20.50:-run-gluster-snaps-86baab41a3174d6bba13325a85d9a46b-brick2-apcfs

ubuntu@bwerthmann-743d2d98:~$ sudo find /var/lib/glusterd/ -type f -exec sed -i "s/172.27.20.50/172.27.20.205/g" {} \;
ubuntu@bwerthmann-743d2d98:~$ sudo grep -iR 172 /var/lib/glusterd/* | grep -v 20\.205 | grep -v 18\.105 | grep -v 21\.136
ubuntu@bwerthmann-743d2d98:~$
ubuntu@bwerthmann-743d2d98:~$ sudo rm -rf /var/lib/glusterd/snaps/s*

Now glusterd starts

ubuntu@bwerthmann-743d2d98:~$ sudo service glusterfs-server start
glusterfs-server start/running, process 6637
ubuntu@bwerthmann-743d2d98:~$ sudo service glusterfs-server status
glusterfs-server start/running, process 6637

Comment 7 Atin Mukherjee 2016-06-21 13:59:00 UTC
So this means that an IP change had happened? If so GlusterD is not capable of handling this scenario until and unless some manual intervention is done (the workaround you had executed). If you'd have used hostname while probing the nodes, this situation could have been avoided.

Comment 8 Ben Werthmann 2016-06-21 17:50:45 UTC
Yes, the replacement peer (1 of 3) and new brick have a new IP. In our case, DNS is not available. When the failed brick is removed via 'gluster volume replace-brick $vol $failed_peer $new_peer_ip:$new_brick commit force' why are there lingering references to the old peer/brick? Is there a reason that 'replace-brick' does not fix all of references to the old peer/brick? If 'replace-brick' has been issued, is it safe to drop the snapshot references of the old peer/brick?

I suspect the suggested fix of using the hostname will fail if any snapshots exist because in the case of new peer/new brick case, the new peer will not have the LVM snapshots needed to resolve the snapshot references.

Put another way: why is the following set of operations not valid?
- deploy gluster with three servers, one brick each, one volume replicated across all 3
- create a snapshot
- lose one server
- add a replacement peer and new brick with a new IP address
- replace-brick the missing brick onto the new server (wait for replication to finish)
- force remove the old server
- verify everything is working as expected
- restart _any_ server in the cluster, without failure

Comment 9 Ben Werthmann 2016-07-20 21:17:08 UTC
Is there an update on this issue?

Comment 10 Atin Mukherjee 2016-07-21 11:11:29 UTC
(In reply to Ben Werthmann from comment #9)
> Is there an update on this issue?

We will test the steps and get back.

Comment 11 Atin Mukherjee 2016-07-25 06:19:57 UTC
OK, so this is happening as per the current functionality. When a replace brick is issued, the op is restricted to that same volume. No other references get changed here. As snapshot works just like a volume, if the snapshot is referring to a failed peer which has been already been replaced, glusterd will fail to restore the snap here. I don't think we have any other option but to change the IP in the volfile (rename) of the snap and make that work.

Rajesh,

What's your thought here?

Comment 12 Ben Werthmann 2017-01-18 16:54:25 UTC
Hello, is there an update on this issue?

Comment 13 Ben Werthmann 2017-03-03 19:50:03 UTC
Pinging Rajesh

Comment 14 Niels de Vos 2017-03-07 22:16:34 UTC
Or, maybe Avra wants to have a look at this?

Comment 15 Kaushal 2017-03-08 10:56:10 UTC
This bug is getting closed because GlusteFS-3.7 has reached its end-of-life.

Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.

Comment 16 Avra Sengupta 2017-03-13 10:38:20 UTC
I agree with Comment #11 from Atin. Snapshot volume is exactly like any other gluster volume, when it comes to peer handshake. The fact that it is not finding the IP in the cluster, which is in it's volfile will make it do a peer reject.

As Atin suggested, the only option we have here is to change the IP in the volfile with the snap. But even with that option, as gluster snapshots are COW LVM based snapshots you end up losing your snapshot if it had any bricks in the cluster member which was forcibly removed.

Comment 17 Atin Mukherjee 2017-03-13 12:25:15 UTC
Avra - I was wondering if we have snapshot bricks hosted by peer N should we allow detaching the same peer? Isn't this a bug at first place?

Comment 18 Avra Sengupta 2017-03-14 05:48:55 UTC
Well, that is the most likely scenario to happen. If we disallow it, we force them to stay put with that peer for good, or delete all those snapshots(which would be unusable anyway if we did do peer detach).

As you suggested, we can tell the user that the peer is hosting snapshot bricks and therefore cannot be detached. It is similar to how we do not allow deletion of a volume, if it still has snapshots in it.

This solution is still not a holistic one, as it forces the user to delete all his snapshots. But it is still the best one we got yet.

Comment 19 Worker Ant 2017-03-16 09:49:42 UTC
REVIEW: https://review.gluster.org/16907 (glusterd : Glusterd fails to restart after replace brick is done for snap volume) posted (#1) for review on master by Gaurav Yadav (gyadav)

Comment 20 Worker Ant 2017-03-16 16:18:01 UTC
REVIEW: https://review.gluster.org/16907 (glusterd : Glusterd fails to restart after replace brick is done for snap volume) posted (#2) for review on master by Gaurav Yadav (gyadav)

Comment 21 Ben Werthmann 2017-03-16 20:44:15 UTC
The reported case differs slightly from problem statement in 16907.


Problem:

 - Deploy gluster on 3 nodes, one brick each, one volume replicated, add peers by IP address
 - Create a snapshot
 - Lose one server
 - Add a replacement peer and new brick with a new IP address
 - replace-brick the missing brick onto the new server (wait for replication to finish)
 - peer detach the old server
 - after doing above steps, glusterd fails to restart.

Our expectations:

 - Glusterd starts and remain started when in the above state, even if there are issues with a volume and/or related snapshots. [1]
 - The non-snapshot volume would start and a heal would kick off (to catch up from where 'replace-brick <volumename> <oldbrick> <newbrick> commit force' left off)
 - A procedure exists which allows for replacing a failed server / brick and allows us to maintain snapshots. [2]
 - Limitations of snapshots are documented, such as having to delete all of your snapshots to replace a failed node. Is this true in all cases or it is something specifc that we are doing?

While recovering snapshots should be possible, our priority is [1]. Having gluster enter a state where *none* of the glusterd processes can start is a significant risk. Functional glusterd should be able to service/start the primary volume, even if the snapshot volumes enter an unusable state. 

 [1] Should issues with one volume prevent other volumes from starting due to glusterd crashing? Is this by design? Please elaborate on this behavior. If a well meaning individual restarts the gluster services or reboots gluster servers for troubleshooting, there would be a cluster wide-outage of glusterd which implies no new client connections.

 [2] In theory, it should be possible to recover Gluster snapshots based on lvm-thin. I think we'd just need to "replay the snapshot history" on a new thin-lv. The process could  be something like:
   1. Create a new thin-LV
   2. replace-brick the oldest snapshot, create a LVM snapshot, update gluster references for the snapshot volume to the new snapshot 
   4. goto next snapshot unless head

This is probably A gross oversimplification of the problem, but it seems that recovering snapshots should be possible.


Aside comment on 16907:

 - What's the recovery path when a peer has failed and something is preventing the removal of snapshots[3] before replacing the brick? Generally when we need to perform the "recover a failed gluster server" task something else has gone wrong.
 

[3] different reasons where snapshot remove failed:
- Bugs with with dm-thin/lvmtools
- Gluster is not responding to "tpool_metadata is at low water mark" events, leading to a thinpool wedged in a read-only state.
- poor interaction with netfilter's conntrack

Comment 22 Ben Werthmann 2017-03-16 21:20:43 UTC
Additionally, I acknowledge that patch 16907 prevents this specific bug as reported. The outstanding concern with patch 16907 is the recovery path if snapshots cannot be removed due to other failures.

Comment 23 Niels de Vos 2017-03-17 08:25:19 UTC
Hi Guarav, care to reply to comment #12 and comment #22?

Comment 24 Gaurav Yadav 2017-03-20 09:30:30 UTC
   As Avra already mentioned  in comment #16 and comment #18 that the solution is not the holistic one and it force the user to delete all his snapshots. But it is still the best one we got yet.

Could you please mention the other failure where you have concerns

Comment 25 Worker Ant 2017-03-20 16:51:20 UTC
REVIEW: https://review.gluster.org/16907 (glusterd : Glusterd fails to restart after replace brick is done for snap volume) posted (#3) for review on master by Gaurav Yadav (gyadav)

Comment 26 Worker Ant 2017-03-22 11:32:04 UTC
REVIEW: https://review.gluster.org/16907 (glusterd : Peer is hosting snapshot bricks therefore can't be detatched) posted (#4) for review on master by Gaurav Yadav (gyadav)

Comment 27 Worker Ant 2017-03-22 11:57:33 UTC
REVIEW: https://review.gluster.org/16907 (glusterd : Disallow peer detach if snapshot bricks exist on it) posted (#5) for review on master by Gaurav Yadav (gyadav)

Comment 28 Ben Werthmann 2017-03-22 17:48:47 UTC
(In reply to Gaurav Yadav from comment #24)
> 
> Could you please mention the other failure where you have concerns

Performing replace-brick operations resolves entering states:

1. Total system failure - server/instance is "terminated"
2. A server running Gluster enters an unrecoverable error state and must be replaced to recover the cluster from a degraded state (case: replica 3 volumes).

In the case of 2, generally, a LVM thin-pool (thin data lv, and snapshot lvs) enters a read-only state because thinpool's metadata LV has exhausted and fails to extend [1]. Gluster ignores "tpool_metadata is at low water mark" events, and continues to create snapshots.


[1] I discussed this issue with Zdenek Kabelac. The issue is due to older kernel dm-thin support and/or older versions of the userspace lvmtools.

Comment 29 Gaurav Yadav 2017-03-22 18:42:13 UTC
Fix is made as per #comment 8
>
> deploy gluster with three servers, one brick each, one volume replicated across all 3
- create a snapshot
- lose one server
- add a replacement peer and new brick with a new IP address
- replace-brick the missing brick onto the new server (wait for replication to finish)
- force remove the old server
- verify everything is working as expected
- restart _any_ server in the cluster, without failure


Explanation for case mentioned in #comment 28, 

glusterd is not getting started after executing above test case because : when replace brick command is executed, glusterd updates path of bricks in vol files however in snap files glusterd dont change that, the reason being:- snapshot was created at "point in time". 
Now while restarting the service, glusterd see snap vol, but while restoring it tries to get the info of brick from the node which is already detached from the server and info is not present hence glusterd fails to load.

Now in the fix glusterd iterates through all snapvolume's bricks and if it finds any brick, it disallow peer to detach which is best possible solution.

Comment 30 Ben Werthmann 2017-03-27 15:00:37 UTC
Thanks everyone!

Comment 31 Worker Ant 2017-03-30 10:51:25 UTC
REVIEW: https://review.gluster.org/16907 (glusterd : Disallow peer detach if snapshot bricks exist on it) posted (#6) for review on master by Gaurav Yadav (gyadav)

Comment 32 Worker Ant 2017-03-31 07:03:28 UTC
REVIEW: https://review.gluster.org/16907 (glusterd : Disallow peer detach if snapshot bricks exist on it) posted (#7) for review on master by Gaurav Yadav (gyadav)

Comment 33 Worker Ant 2017-04-01 01:53:13 UTC
COMMIT: https://review.gluster.org/16907 committed in master by Atin Mukherjee (amukherj) 
------
commit 1c92f83ec041176ad7c42ef83525cda7d3eda3c5
Author: Gaurav Yadav <gyadav>
Date:   Thu Mar 16 14:56:39 2017 +0530

    glusterd : Disallow peer detach if snapshot bricks exist on it
    
    Problem :
    - Deploy gluster on 2 nodes, one brick each, one volume replicated
    - Create a snapshot
    - Lose one server
    - Add a replacement peer and new brick with a new IP address
    - replace-brick the missing brick onto the new server
      (wait for replication to finish)
    - peer detach the old server
    - after doing above steps, glusterd fails to restart.
    
    Solution:
      With the fix detach peer will populate an error : "N2 is part of
      existing snapshots. Remove those snapshots before proceeding".
      While doing so we force user to stay with that peer or to delete
      all snapshots.
    
    Change-Id: I3699afb9b2a5f915768b77f885e783bd9b51818c
    BUG: 1322145
    Signed-off-by: Gaurav Yadav <gyadav>
    Reviewed-on: https://review.gluster.org/16907
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Atin Mukherjee <amukherj>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 34 Shyamsundar 2017-05-30 18:34:38 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report.

glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.