1274485 – Unable to mount volume after conversion from distributed to replicated.

Bug 1274485 - Unable to mount volume after conversion from distributed to replicated.

Summary: Unable to mount volume after conversion from distributed to replicated.

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	3.7.5
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-10-22 19:26 UTC by Richard
Modified:	2015-10-28 08:40 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-10-28 08:40:43 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Richard 2015-10-22 19:26:22 UTC

Description of problem:

I have setup a distributed volume with a single node/brick. When adding a new brick to the volume and converting it into a 2 brick replicated volume you can't mount the volume after reboot. 

Version-Release number of selected component (if applicable): 

This is happening on all releases after v3.4.7 as that is the last one where it works. 

Steps to Reproduce:

1. create new server and setup a single brick distributed volume.
2. build a 2nd server and add this to the distributed volume as such:

gluster volume add-brick myvol replica 2

This will convert the volume from distributed to replicated.

3. mount the volume and all is ok.
4. power off both servers
5. boot 1st server only
5. check glusterd is started
6. try and mount volume on any server and it won't work.

Actual results: volume is not mountable unless all bricks in the replicate have glusterd running.

Expected results: replicated volume should be mountable on one node only.

Comment 1 Atin Mukherjee 2015-10-26 05:10:57 UTC

This is expected. In a given 2 node setup, if both glusterd goes down and one of them come back online, it doesn't restart the bricks until it receives the first peer connect event to maintain consistency and ensure that this instance is not stale. In this case since the other server was down, glusterd never started the brick process and due to which the mount failed. We recommend to have a third dummy node (need not to host any bricks) in this setup to avoid this situation.

Comment 2 Richard 2015-10-26 09:23:54 UTC

That's not true... this is ONLY a problem when you convert a distributed volume to a replicated volume.

if you start with a replicated volume in the first place you can start any node at any time and it will mount the volume.

Thanks,

Rich

Comment 3 Richard 2015-10-26 09:25:44 UTC

From what I can see, the volume still thinks that it is distribured, when infact it is now replicated. The internal metadata for this is not updated when adding the 2nd brick and converting it to replicated.

Comment 4 Atin Mukherjee 2015-10-26 10:31:07 UTC

(In reply to Richard from comment #2)
> That's not true... this is ONLY a problem when you convert a distributed
> volume to a replicated volume.
> 
> if you start with a replicated volume in the first place you can start any
> node at any time and it will mount the volume.
> 
> Thanks,
> 
> Rich

Do you claim that you are able to mount a replicated volume in a 2 node setup when both the nodes were powered off and one of them turned on with glusterd service? Code doesn't say that though. Just reconfirming the reproducer:

1. Create a distributed volume in 1 node
2. Add a peer
3. Add brick with the new brick been hosted in the newly added peer
4. Both nodes are powered off
5. One of the node comes back, glusterD service is up as well
6. Mount fails

Comment 5 Richard 2015-10-26 12:09:26 UTC

Yes. If the volume is originally created as a 2 brick replciated volume then it remounts just fine with only one brick online after poweroff.

However, if I create a volume with a single brick and then add a 2nd brick and make the volume replicated I have issues remouting after reboot. I can mount the replicated volume during this time, but after a reboot it is not mountable again.

Yes, the steps you describe are pretty much what I use to create my volume.


1. Create a distributed volume in 1 node
1.1 Mount volume works
2. Add a peer
3. Add brick with the new brick been hosted in the newly added peer
3.1 mount volume on new brick works ok too.
4. Both nodes are powered off
5. One of the node comes back, glusterD service is up as well
6. Mount fails

Step #1) gluster volume create data brick2:folder
Step #3) gluster volume add-brick data replica 2 brick2:folder

my mount command is the standard fuse one:

mount -t glusterfs brick1:data /mnt/gluster

Thanks

Rich

Comment 6 Richard 2015-10-26 17:18:16 UTC

ok, small update... this all used to work, but after a migration from EL6 to EL7 it doesn't work anymore :-(

Why disable this feature and not document the fact it is now broken? I can't remember seeing anywhere that you can't mount half a replicated volume? doesn't make sense to me to put in such a pointless limtation.

This means I can't perform ANY maintenance on a replicated volume without fear of the remaining brick in service going offline and stopping me from restarting my volume.

I understand the 3 brick witness thing, but you should have made tha toptional... after all, isn't that what the arbiter option is for?

Comment 7 Atin Mukherjee 2015-10-27 05:25:53 UTC

(In reply to Richard from comment #6)
> ok, small update... this all used to work, but after a migration from EL6 to
> EL7 it doesn't work anymore :-(
> 
> Why disable this feature and not document the fact it is now broken? I can't
> remember seeing anywhere that you can't mount half a replicated volume?
This is a special case where both the servers were down and only one of them came back. As I mentioned earlier at this point of time, the online server's glusterD has no knowledge about whether its current state is up to date or stale. Correctness is a key point in any distributed system and considering that we can't trust on this information until and unless it receives notification from any of its other peers about the current state of the configuration. This is a limitation of the current design of existing GlusterD1.0. Fortunately in GlusterD2.0, we should be able to get rid of this problem. I understand this is bit annoying but i.e. why we never recommend a 2 node setup to deal this case. As a conclusion, we won't be able to solve this problem till GlusterD2.0 lands. If this satisfies you can you close this bug? 

> doesn't make sense to me to put in such a pointless limtation.
> 
> This means I can't perform ANY maintenance on a replicated volume without
> fear of the remaining brick in service going offline and stopping me from
> restarting my volume.
> 
> I understand the 3 brick witness thing, but you should have made tha
> toptional... after all, isn't that what the arbiter option is for?

Comment 8 Richard 2015-10-27 11:28:09 UTC

What version of GlusterD is used in GlusterFS 3.4.7? this worked fine in that release. It is any release from 3.5.x onwards that has this limitation now.

The annoying bit is that this has now broken something that _was_ working.

If this is a "special case" then how come older versions of Gluster used to work just fine?

When is GlusterD2.0 due to be released?

Comment 9 Atin Mukherjee 2015-10-28 04:00:29 UTC

(In reply to Richard from comment #8)
> What version of GlusterD is used in GlusterFS 3.4.7? this worked fine in
> that release. It is any release from 3.5.x onwards that has this limitation
> now.
> 
> The annoying bit is that this has now broken something that _was_ working.
I must say even though it *was* working but with the sacrifice of correctness, that's why this was identified as a gap post 3.4.x and was fixed *not to* sacrifice the correctness although it broke existing thing.
> 
> If this is a "special case" then how come older versions of Gluster used to
> work just fine?
> 
> When is GlusterD2.0 due to be released?
GlusterD 2.0 is currently in pre development phase, we expect it to get the first cut of it by end of next year.

Comment 10 Richard 2015-10-28 08:40:43 UTC

Knowing that this will never work in the short term has forced me to look at other cluster file systems for my project. I initially thought I was going mad with it not working ;-)

I will keep an eye on future releases of GlusterFS to see how it matures and may come back as I really like the fact it does not require a dedicated metadata server.

Thank you all for your hard work.

Rich

Note You need to log in before you can comment on or make changes to this bug.