758636 – Can't expand volume

Bug 758636 - Can't expand volume

Summary: Can't expand volume

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	hekafs
Sub Component:
Version:	16
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Assignee:	Kaleb KEITHLEY
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	759121
TreeView+	depends on / blocked

Reported:	2011-11-30 10:06 UTC by cr
Modified:	2012-07-16 13:35 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Clones:	759121 (view as bug list)
Environment:
Last Closed:	2012-07-16 13:35:36 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description cr 2011-11-30 10:06:50 UTC

Hi,

with glusterFS you can expand exist replica volumes with number of bricks = number of replica.
But how can I do it with hekaFS?

Comment 1 Kaleb KEITHLEY 2011-11-30 16:27:59 UTC

Not a bug.

Send questions about HekaFS to gluster-users or ask in #gluster on freenode (irc.freenode.net)

Comment 2 cr 2011-11-30 16:41:15 UTC

I did use the commands from gluster but then nothing is working anymore. There are no heka commands to do that. I use gluster expand tools and add bricks.

Comment 3 Kaleb KEITHLEY 2011-11-30 17:02:24 UTC

reopen as enhancement request for expand functionality and/or better documentation about how to expand a HekaFS volume.

Comment 4 Jeff Darcy 2011-11-30 17:17:54 UTC

I think it would be reasonable to consider this a doc bug.  The underlying issue here is that there's no change notification between the GlusterFS and HekaFS management pieces.  As a result, changes to the underlying GlusterFS volume won't even be noticed by HekaFS until the HekaFS volume is restarted.  At that point we'll regenerate the multi-tenant HekaFS config and recalculate the list of daemons to start.

cr, can you please verify that restarting the HekaFS volume results in the changed brick list (or other changes) being picked up properly for you?  This should at least be documented; the even better news is that the underlying problem should go away as we integrate HekaFS functionality into GlusterFS instead of their being separate.

Comment 5 cr 2011-11-30 17:31:59 UTC

Hi,

if I rebalance it with glusterFS and want to restart the Volume from web interface and got error 500 message if I want to start the volume again.

regards Christopher.

Comment 6 Jeff Darcy 2011-11-30 17:42:46 UTC

That's even more disturbing.  I think there might be some interesting interactions between a global rebalance and behavior on the per-tenant volumes (which are built on subdirectories of the GlusterFS bricks), but nothing that should prevent the daemons from starting up.  Is there anything in /var/log/hekafs to shed more light on the failure?

Comment 7 cr 2011-11-30 18:22:14 UTC

Nothing about trying to start the volume via hekaFS web interface. 

Ok regula this workaround should work?

1. adding bricks on manage server tabs
2. adding bricks with mount point on manage volume tab
3.  gluster volume add-brick VOLNAME NEW-BRICK on terminal
4. Both cluster volume info and hfs_list_volumes seems to be fine
5. gluster volume rebalance VOLNAME start did not working because volume is not started via cluster volume start VOLNAME

Which workaround did u prefer?

I have 2 bricks configured as replicate volume, added 2 new bricks, now I want change volume to replicate-distributed volume with 2 replicate without loosing datas. Did I loosing data if I delete volume and re-create new one?

Comment 8 Kaleb KEITHLEY 2011-11-30 18:34:52 UTC

> Nothing about trying to start the volume via hekaFS web interface. 

start and stop are on the Volume Management page (tab) for Existing Volumes.

>
> Ok regula this workaround should work?
> 
> 1. adding bricks on manage server tabs

No, you don't add bricks on this page, just nodes.

> 2. adding bricks with mount point on manage volume tab

Yes.

> 3.  gluster volume add-brick VOLNAME NEW-BRICK on terminal
> 4. Both cluster volume info and hfs_list_volumes seems to be fine
> 5. gluster volume rebalance VOLNAME start did not working because volume is not started via cluster volume start VOLNAME

Generally speaking you should not use gluster commands on hekafs volumes.

> Did I loosing data if I delete volume and re-create new one?

No, the files will remain, untouched, on the underlying bricks. They will still be there when you recreate the new volume.

Comment 9 Jeff Darcy 2011-11-30 18:54:01 UTC

There's no way to add bricks to an existing volume in the HekaFS interface. You can add them via the GlusterFS *if you're very careful* but, as noted reviously, HekaFS won't notice the change until its own volumes are restarted. Without rebalancing, the sequence would look like this.

(0) Stop the volume at all levels.

(1) gluster volume add-brick MYVOL replica 2 SERVER1:BRICK1 SERVER2:BRICK2

(2) hfs_start_volume MYVOL

If you want to rebalance before starting the modified volume, you could do the
following.

(1.1) gluster volume start MYVOL

(1.2) gluster volume rebalance MYVOL start

(1.3) *wait* for rebalance to complete

(1.4) gluster volume stop MYVOL

The last two steps are important, because you should definitely not have the
volume started both through "gluster volume start" and hfs_start_volume at the
same time. In the not-too-distant future, online expansion of HekaFS volumes
will be possible (because by then they'll just be GlusterFS volumes), but doing
this requires a level of management integration that has not happened yet.

If you delete and recreate the volume, so long as you recreate it with a
command that only appends new bricks at the end, there should be no loss of
data.

original definition: gluster volume create XYZ replica 2 brick1 brick2
OK: gluster volume create XYZ replica 2 brick1 brick2 brick3 brick4
NOT OK: gluster volume create XYZ replica 2 brick1 brick4 brick3 brick2

The last example would result in brick1+brick4 being combined into one replica
pair and brick3+brick2 being combined into another, with many files present in
both pairs. That will cause DHT, which is trying to distribute each file to
only one of its subvolumes, to get very confused and data loss could result.

Comment 10 cr 2011-11-30 22:02:40 UTC

If I want start the Volume via hfs:


Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/bottle.py", line 499, in handle
    return handler(**args)
  File "/usr/lib/python2.7/site-packages/hekafsd.py", line 119, in start_volume
    return hfs_start_volume.run_www(vol_name)
  File "/usr/lib/python2.7/site-packages/hfs_start_volume.py", line 250, in run_www
    blob = run_common(vol_name)
  File "/usr/lib/python2.7/site-packages/hfs_start_volume.py", line 243, in run_common
    url_obj = urllib2.urlopen(url)
  File "/usr/lib64/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib64/python2.7/urllib2.py", line 400, in open
    response = meth(req, response)
  File "/usr/lib64/python2.7/urllib2.py", line 513, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib64/python2.7/urllib2.py", line 438, in error
    return self._call_chain(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 372, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 521, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 500: INTERNAL SERVER ERROR

Comment 11 Jeff Darcy 2011-11-30 22:24:43 UTC

Are you sure hekafsd is running on the new servers?  This kind of traceback usually occurs when we fail to contact one of the nodes that's hosting a brick for the volume we're starting.  This can be because hekafsd isn't running there (even though glusterd might be), firewall issues, DNS issues, etc.  Also, make sure the volume is stopped (from both GlusterFS's and HekaFS's point of view) before trying to start it, by running "ps" to look for "glusterfsd" processes.  If you're still getting a server error in that case, it's probably because of something essentially unrelated to expanding the volume, because we're not even getting to the point where the changed volume definition would matter.

Comment 12 cr 2011-11-30 23:07:29 UTC

Yes, all is fine, iptables is off, ping is ok both IP and DNS.

What about the different version? 

Node 1+2 had:

Version    : 0.7
Ausgabe    : 16.fc16

Node 3+4:


hekafs-0.7-18.fc16.x86_64.rpm                                                                                                                                          

I created an Temp volume on the 2 new node, no problem.

Comment 13 cr 2011-12-01 08:47:40 UTC

Nothing changed. I rebuild our cluster local and installed on each bricks the newest hekaFS and glusterFS packages, same error.

First build Volume with 2 nodes and one replicate Volume.
Second remove Volume build new one with same name and 4 nodes, and get on start volume this error:

Error 500: Internal Server Error

Sorry, the requested URL http://192.168.1.114:8080/volumes/TMP/start caused an error:

Unhandled exception
Exception:

HTTPError()
Traceback:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/bottle.py", line 499, in handle
    return handler(**args)
  File "/usr/lib/python2.7/site-packages/hekafsd.py", line 117, in start_volume
    return hfs_start_volume.run_www(vol_name)
  File "/usr/lib/python2.7/site-packages/hfs_start_volume.py", line 253, in run_www
    blob = run_common(vol_name)
  File "/usr/lib/python2.7/site-packages/hfs_start_volume.py", line 246, in run_common
    url_obj = urllib2.urlopen(url)
  File "/usr/lib64/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib64/python2.7/urllib2.py", line 400, in open
    response = meth(req, response)
  File "/usr/lib64/python2.7/urllib2.py", line 513, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib64/python2.7/urllib2.py", line 438, in error
    return self._call_chain(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 372, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 521, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 500: INTERNAL SERVER ERROR

Comment 14 cr 2011-12-01 09:07:53 UTC

Did it, I have to remove the Tenants and recreate it.

Comment 15 Jeff Darcy 2011-12-01 13:15:10 UTC

I'm glad you found a solution.  The tenant-related 500 error seems like a separate bug, so I'll clone this one to track it.

Comment 16 cr 2011-12-02 01:17:36 UTC

After recreating the volume all web server get time out, and nobody know why.

Comment 17 Kaleb KEITHLEY 2012-07-16 13:35:36 UTC

HekaFS will be merged into core Gluster

Note You need to log in before you can comment on or make changes to this bug.