Bug 765338 (GLUSTER-3606)

Summary: Feature: Create a distribute volume with one brick and add the other later
Product: [Community] GlusterFS Reporter: jaw171
Component: glusterdAssignee: Amar Tumballi <amarts>
Severity: low Docs Contact:
Priority: medium    
Version: mainlineCC: gluster-bugs, jdarcy, vijay, vraman
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Description jaw171 2011-09-21 16:52:26 EDT
Currently when creating a replica (sub)volume gluster requires the initial number of bricks to be the same as the replica count.  This makes it very difficult to migrate from storage that is in use to gluster.  The mdadm tools can do this when creating a mirror by using the word "missing" in place of the second half of the mirror, gluster should be able to do this too.

Take this scenario:
server1 has 35TB of storage and 15TB is used, it runs the kernel's NFS server
server2 is purchased with 35TB of storage and 0TB used

You want to move to gluster and have the two servers in a replica volume.  Right now (as I understand it) the only way to do this would be to stop writes to server1, copy the data to a third server, set up the volume with server1 and server2, then move the data back into the volume.  What if you don't have a third server that is big enough?  Well then you can't set up your volume!  Also this process would take a very long time to move that much data.  Think if you had 100TB...

How this should work (modeled after mdadm):
On server2 create the volume:
gluster volume create vol_test replica 2 MISSING gfs-dev-02.cssd.pitt.edu:/bricks/lv_brick_02_004a

Then at your leisure you can use rsync or whatever to copy the data from server1's (still live) storage to the volume to get a rough copy of the data.  Later you would stop writes to server1, do a final rsync to get a crash-consistent copy of the data into the volume.  After this you would delete the data from server1 (or format) and simply add the brick to the replica set of the existing volume and force a full self-heal.  Tada.  You moved from a single NFS server to a replica Gluster volume with minimal downtime and without the need for a third server to stage the data.  It may even be possible to add a command to gluster to add a replica anytime you want and change the replica count as such:

gluster add-replica-brick {volume name} {existing replica brick} {new replica brick}

This way not only could gluster support going from a replica count of 1 to 2 but to any number without needing to delete and re-create the volume.  That would be more difficult to implement but in order to make the "create a volume with a missing brick" thing work there has to be a way to later add a brick and tell gluster "replicate with THIS existing brick" and the only clear way I can think to do that would be with a special command.  It may even be possible to use this to convert a stripe volume to a stripe-replicate with this "add a replica to any brick" thing.  Just some thoughts and rambling.

How it works now:
[root@gfs-dev-02b ~]# gluster volume create vol_test replica 2 gfs-dev-02.cssd.pitt.edu:/bricks/lv_brick_02_004a
number of bricks is not a multiple of replica count
Usage: volume create <NEW-VOLNAME> [stripe <COUNT>] [replica <COUNT>] [transport <tcp|rdma|tcp,rdma>] <NEW-BRICK> ...
Comment 1 jaw171 2011-09-22 12:59:55 EDT
User semiosis stated the third server idea isn't needed, but this is not documented anywhere.

<semiosis> gimpy2157: a 3rd server isnt necessary, you can already create a replicated volume with one brick empty and the other brick pre-loaded with data                         
<semiosis> gimpy2157: so in your scenario you would just stop NFS clients/service, create the gluster volume with one brick equal to the existing NFS export, the other brick on    
+the other server to which you want data replicated, then start clients up through glusterfs.  clients can begin immediately accessing data and it will be self-healed to the new   
+(blank) replica on demand, and you can also do a repair to replicate other data                                                                                                    
<semiosis> before clients would otherwise read & self-heal it                                                                                                                       
<semiosis> that is essentially the procedure to change replica count on an existing glusterfs volume                                                                                
*** Guest27144 is now known as _anthonyd-                                                                                                                                           
*** rudimeyer (~rudimeyer@ has joined channel #gluster                                                                                                               
*** gnu111 has left channel #gluster ("Leaving")                                                                                                                                    
*** adjohn (~adjohn@ has joined channel #gluster                                                                                                                        
> semiosis: So any data that exists in a brick would simply be part of the volume when the volume is created?  Nothing else needs to be done?                                       
<semiosis> yeah pretty much                                                                                                                                                         
<semiosis> there's some caveats but in your scenario they're all safely avoided                                                                                                     
<semiosis> like, dont add data to a brick when glusterfsd is running, it may not notice                                                                                             
<semiosis> and, don't go modifying xattrs on your own, except to delete them                                                                                                        
> Is this procedure documented anywhere?                                                                                                                                            
<semiosis> and then there's the issue of adding files to the wrong distribute subvolume                                                                                             
<semiosis> but since you're doing a straight replicate (1x2) that isnt an issue either                                                                                              
<semiosis> but i think even that can be solved with a rebalance (though i've not actually tested that out)                                                                          
<semiosis> documented? does this conversation count?                                                                                                                                
<semiosis> ;)                                                                                                                                                                       
> I don't really trust rebalance right now (bug 765308).                                                                                                                              
<glusterbot> Bug http://goo.gl/oHJpN major, P3, ---, amar@gluster.com, ASSIGNED, Rebalance on Distributed-Replicate in fails and consumes more space then before                    
<semiosis> yeah neither do i, though in regards to existing bugs, i heard talk of a new rebalance engine coming very soon, so i've got my hopes up for that :)                      
> It does not.  Someone should write this up with caveots and such.                                                                                                                 
> Anyway, I'll test it out tomorrow.  Right now it's time to go drink.                                                                                                              
<semiosis> enjoy!                                                                                                                                                                   
<semiosis> gimpy2157: if your tests are successful, you would be in a great position to do the write up on the gluster community site                                               
<semiosis> gimpy2157: i think an article about creating a replicated glusterfs volume with existing data would be a very useful contribution

I went ahead and tested this.

#Put data into the storage area of one server.
[root@gfs-dev-01b ~]# num=0;while [ $num -le 250 ];do dd bs=1024 count=10000 if=/dev/urandom of=/bricks/lv_brick_01_001a/foo.$num ;num=$(($num + 1));done

#Calculate the md5sums of the data.
[root@gfs-dev-01b ~]# md5sum /bricks/lv_brick_01_001a/* > /var/tmp/brick.md5
[root@gfs-dev-01b ~]# tail /var/tmp/brick.md5
54fdc23c4501df97f7306543a89a082f  /bricks/lv_brick_01_001a/foo.90
41578906b8c81972691992c9962c9717  /bricks/lv_brick_01_001a/foo.91
132ed345a4b60289f8aecc0bf9973b51  /bricks/lv_brick_01_001a/foo.92
0c02ddd6541b008d297d99dffe40244a  /bricks/lv_brick_01_001a/foo.93
090f78876568765974e31927a0a53309  /bricks/lv_brick_01_001a/foo.94
5b0f817389b6c59c8f26be2501b99a24  /bricks/lv_brick_01_001a/foo.95
1f1dd79016bd6fc83a51a04fb83ff067  /bricks/lv_brick_01_001a/foo.96
c918222324f4816730e7ea028fae8ca2  /bricks/lv_brick_01_001a/foo.97
29c1fa548457f0393a82ced3d08e01bb  /bricks/lv_brick_01_001a/foo.98
952a250eddf018b6d83c7ddf087955b5  /bricks/lv_brick_01_001a/foo.99

#Create the volume.
[root@gfs-dev-01b ~]# gluster volume create vol_test replica 2 gfs-dev-01b.cssd.pitt.edu:/bricks/lv_brick_01_001a gfs-dev-03b.cssd.pitt.edu:/bricks/lv_brick_03_001b

#Check the size of the brick of the other server.
[root@gfs-dev-03b ~]# df -hP
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_system-lv_root   20G  3.9G   15G  21% /
tmpfs                 372M     0  372M   0% /dev/shm
/dev/sda1             194M   45M  139M  25% /boot
/dev/mapper/vg_system-lv_var  9.9G  332M  9.1G   4% /var
/dev/mapper/vg_system-lv_brick_03_001b  5.0G   33M  5.0G   1% /bricks/lv_brick_03_001b

#Mount the volume:
[root@gfs-dev-01b ~]# mount -t glusterfs gfs-dev-01b.cssd.pitt.edu:vol_test /mnt/fuse

#Check the md5 sums.
[root@gfs-dev-01b ~]# sed -e 's/\/bricks\/lv_brick_01_001a/\/mnt\/fuse/g' /var/tmp/brick.md5 > /var/tmp/fuse.md5
[root@gfs-dev-01b ~]# md5sum --check /var/tmp/fuse.md5
/mnt/fuse/foo.0: OK
/mnt/fuse/foo.1: OK
/mnt/fuse/foo.10: OK
/mnt/fuse/foo.100: OK
/mnt/fuse/foo.101: OK
/mnt/fuse/foo.102: OK
/mnt/fuse/foo.103: OK
/mnt/fuse/foo.104: OK
/mnt/fuse/foo.105: OK
/mnt/fuse/foo.106: OK
/mnt/fuse/foo.107: OK
/mnt/fuse/foo.108: OK
/mnt/fuse/foo.109: OK

#Check the other brick in the replica.
[root@gfs-dev-03b ~]# df -hP
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_system-lv_root   20G  3.9G   15G  21% /
tmpfs                 372M     0  372M   0% /dev/shm
/dev/sda1             194M   45M  139M  25% /boot
/dev/mapper/vg_system-lv_var  9.9G  332M  9.1G   4% /var
/dev/mapper/vg_system-lv_brick_03_001b  5.0G  2.5G  2.6G  49% /bricks/lv_brick_03_001b

Sure seems to work this way, but of course dd from urandom isn't the sort of ting gluster is normally used for and it's not much data.  Making replication more dynamic by being able to add and remove replica bricks at will does still seem like a neat idea with some valid use cases, such as having the ability to cleanly pull a brick out of volume.
Comment 2 Amar Tumballi 2011-10-28 05:49:32 EDT
bug 765037 was trying to solve the similar issue. that is now fixed in master branch, and we are not planing to back-port it to the release-3.2 branch. So, if you can test this on master branch and if it fails, (re-)open the bug.