Bug 1440900 - [RFE] Support Volume expansion in Heketi
Summary: [RFE] Support Volume expansion in Heketi
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: heketi
Version: cns-3.5
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: CNS 3.6
Assignee: Raghavendra Talur
QA Contact: Tejas Chaphekar
URL:
Whiteboard:
: 1446062 1477919 1477982 (view as bug list)
Depends On: 1477431
Blocks: 1445444 1477919
TreeView+ depends on / blocked
 
Reported: 2017-04-10 17:40 UTC by Daniel Messer
Modified: 2021-12-10 15:00 UTC (History)
22 users (show)

Fixed In Version: heketi-5.0.0-1.el7rhgs
Doc Type: Bug Fix
Doc Text:
Previously, new bricks that are added to volume as part of volume expansion would not have the right GID set and would lead to I/O failures. With this build, GID is set on all the new bricks.
Clone Of:
Environment:
Last Closed: 2017-10-11 07:07:22 UTC
Embargoed:


Attachments (Terms of Use)
glusterfs client log from OCP app server (570.09 KB, text/plain)
2017-04-12 20:04 UTC, Annette Clewett
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:2879 0 normal SHIPPED_LIVE heketi bug fix and enhancement update 2017-10-11 11:07:06 UTC

Description Daniel Messer 2017-04-10 17:40:18 UTC
Description of problem:

In a situation where a client mounting a heketi-managed is nearing 100% capacity utilization an administrator might expand the volume to have more space. I.e. an administrator would issue heketi-cli volume expand --volume=8081e2f112ef14d3e267a37840369469 --expand-size=30. While this completes successfully and the increase in volume capacity is reflected to the client online (i.e. doing df -h) a user is not able to write past the original capacity limit. Any write to a new or existing file will fail with: No space left on device.


Version-Release number of selected component (if applicable):

heketi v4.0.0-5 and/or rhgs3/rhgs-volmanager-rhel7:3.2.0-5 image.


How reproducible:

Increase the size of a volume provisioned via heketi by issuing the expand through heketi and try to use the additional capacity from the client.

Steps to Reproduce:
1. Provision a volume through heketi, 1GiB of capacity and mount it on a client. Write some data.
2. Observe that the volume by default is 1x3 replicated.
3. Extend the capacity of the volume via heketi to 5 GiB
4. Observe that the volume is now a 2x3 distributed-replicated volume, consisting of the original 1x3x1GiB replica set plus the new 1x3x4GiB replicate set.
5. Verify the new capacity of the mount on the client, i.e. df -h -> 5GiB total capacity of the mount vs. 1GiB before
6.1 Write a new file > 1GiB. 
 - or-  
6.2 Write to an existing file to make it go >1GiB.

Actual results:

In both scenarios the writes should successfully complete.

Expected results:

In both scenarios the writes fail at the 1 GiB total mount capacity utilization with the error: No space left on device.


Additional info:

It may be expected for 6.1 to fail since the file is still sitting on the first 1x3x1GiB replica set and no rebalance happened.
It is not expected for 6.2 to fail. Gluster should be able to write a link file to the 1x3x1GiB replica set pointing to the actual file being stored on the 1x3x4GiB replica set.

Comment 2 Michael Adam 2017-04-11 20:31:36 UTC
This is a really serious usability issue.

Even though expansion is not currently officially supported, we should still analyze and fix or work around it.

Comment 3 Michael Adam 2017-04-11 20:35:23 UTC
I think since we are not doing rebalance, we may need to enable the NUFA (non uniform file allocation) translator. But according to the docs, this needs to be enabled before putting any data into the volume, in particular before expansion.

Comment 4 Nithya Balachandran 2017-04-12 05:35:31 UTC
Can we get access to the system or the logs? Writing to existing files on the nearly full bricks will cause the error but writes new files on the new bricks should not error out.

Comment 5 Daniel Messer 2017-04-12 08:21:46 UTC
@Nithya: theoretically yes - the systems are up in AWS and are currently used for writing the reference architecture. In which time zone are you? We can plan for you to access the systems when nobody else is using them. In this case, please send me your public SSH key.

Comment 6 Michael Adam 2017-04-12 16:11:26 UTC
This needs to be moved out of cns 3.5.

Comment 7 Nithya Balachandran 2017-04-12 17:37:37 UTC
I had a look at Annette's setup over bluejeans. I could see that files continued to go to the older bricks. Some dirs were created on the new bricks but a directory created from the mount point after the volume was expanded was created only on the older bricks.


Annette will rerun the steps after enabling client debug logs and upload the client log files for this volume. If the logs do not contain anything helpful, I shall try to get hold of a setup in Blr and see if it can be reproduced.

Comment 8 Annette Clewett 2017-04-12 20:04:21 UTC
Created attachment 1271246 [details]
glusterfs client log from OCP app server

Log was collected from OCP app nodes hosting mysql pod with pvc=pvc-5aadcf06-1fb0-11e7-a977-067ee6f6ca67 

/var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/glusterfs/pvc-5aadcf06-1fb0-11e7-a977-067ee6f6ca67

Comment 9 Annette Clewett 2017-04-12 20:35:47 UTC
After expansion in gluster pod:
Volume Name: vol_b0d968afa402845bf91b0d5ccf2f480f
Type: Distributed-Replicate
Volume ID: 08bfb2ed-7ea3-42ed-b7dc-11e15781a2e4
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: 10.20.5.234:/var/lib/heketi/mounts/vg_cfd8eda997abca7a10afb56e6bd804d9/brick_66f84264b8e3fbfefbe4972c78efc4f5/brick
Brick2: 10.20.4.40:/var/lib/heketi/mounts/vg_50ff38c6fa45ca9220521a4fd49126be/brick_2fe54494608cce76a4830fc0fffd63b6/brick
Brick3: 10.20.6.177:/var/lib/heketi/mounts/vg_108bebea4a67d6ed3f569b5162a4e20f/brick_17728915fa17db8398fdc9432032c797/brick
Brick4: 10.20.4.40:/var/lib/heketi/mounts/vg_50ff38c6fa45ca9220521a4fd49126be/brick_3f0858383090fe03c0c4ee3d6e49cd87/brick
Brick5: 10.20.6.177:/var/lib/heketi/mounts/vg_108bebea4a67d6ed3f569b5162a4e20f/brick_96ea4058eaa6520149e6b8a66949683a/brick
Brick6: 10.20.5.234:/var/lib/heketi/mounts/vg_cfd8eda997abca7a10afb56e6bd804d9/brick_20671a517431f238b4267da8f4e791ff/brick
Options Reconfigured:
diagnostics.client-log-level: DEBUG
performance.readdir-ahead: on


After expansion in mysql pod:
[ec2-user@ip-172-31-23-150 mysql-files]$ oc rsh mysql-1-sn5lb 
sh-4.2$ df -h
Filesystem                                                                                          Size  Used Avail Use% Mounted on
/dev/mapper/docker-202:2-37772588-52f48b5fc468d3930b05002b3068bfc92d6ab0b61fddb0bd5b0688ce2bb2fe79  3.0G  446M  2.6G  15% /
tmpfs                                                                                               7.7G     0  7.7G   0% /dev
tmpfs                                                                                               7.7G     0  7.7G   0% /sys/fs/cgroup
/dev/xvdc                                                                                            50G   44M   50G   1% /etc/hosts
/dev/xvda2                                                                                           15G  3.1G   12G  21% /run/secrets
shm                                                                                                  64M     0   64M   0% /dev/shm
10.20.4.40:vol_b0d968afa402845bf91b0d5ccf2f480f                                                     2.0G  976M  1.1G  48% /var/lib/mysql/data
tmpfs                                                                                               7.7G   16K  7.7G   1% /run/secrets/kubernetes.io/serviceaccount

sh-4.2$ cd /var/lib/mysql/data

sh-4.2$ mkdir test1
sh-4.2$ cd test1
sh-4.2$ dd if=/dev/zero of=bigfile3 bs=1M count=1000 oflag=direct
dd: error writing 'bigfile3': No space left on device
dd: closing output file 'bigfile3': No space left on device

In glusterfs log
[2017-04-12 19:24:55.527530] W [fuse-bridge.c:1290:fuse_err_cbk] 0-glusterfs-fuse: 146866: FLUSH() ERR => -1 (No space left on device)

Comment 10 Nithya Balachandran 2017-04-13 11:20:22 UTC
mkdir for test1 failed on the newly added bricks with EPERM. Was there anything different about the permissions on the root dirs of the old bricks?


[2017-04-12 19:24:15.414778] E [MSGID: 114031] [client-rpc-fops.c:321:client3_3_mkdir_cbk] 5-vol_b0d968afa402845bf91b0d5ccf2f480f-client-3: remote operation failed. Path: /test1 [Permission denied]
1229 [2017-04-12 19:24:15.414797] D [MSGID: 0] [client-rpc-fops.c:326:client3_3_mkdir_cbk] 5-stack-trace: stack-address: 0x7f083d2d2f50, vol_b0d968afa402845bf91b0d5ccf2f480f-client-3 returned -1 error: Permission denied [Permission denied]
[2017-04-12 19:24:15.414866] E [MSGID: 114031] [client-rpc-fops.c:321:client3_3_mkdir_cbk] 5-vol_b0d968afa402845bf91b0d5ccf2f480f-client-5: remote operation failed. Path: /test1 [Permission denied]
[2017-04-12 19:24:15.414879] D [MSGID: 0] [client-rpc-fops.c:326:client3_3_mkdir_cbk] 5-stack-trace: stack-address: 0x7f083d2d2f50, vol_b0d968afa402845bf91b0d5ccf2f480f-client-5 returned -1 error: Permission denied [Permission denied]
[2017-04-12 19:24:15.415246] E [MSGID: 114031] [client-rpc-fops.c:321:client3_3_mkdir_cbk] 5-vol_b0d968afa402845bf91b0d5ccf2f480f-client-4: remote operation failed. Path: /test1 [Permission denied]

Comment 11 Nithya Balachandran 2017-04-13 11:22:16 UTC
As the dir could not be created on the new bricks, the entire hash range was set on the dir on the old subvol which causes any new files in that dir to be writted to the old bricks.


[2017-04-12 19:24:15.419242] I [MSGID: 109036] [dht-common.c:8889:dht_log_new_layout_for_dir_selfheal] 5-vol_b0d968afa402845bf91b0d5ccf2f480f-dht: Setting layout of /test1 with [Subvol_name: vol_b0d968afa402845bf91b0d5ccf2f480f-replicate-0, Err: -1 , Start: 0 , Stop: 4294967295 , Hash: 1 ], [Subvol_name     : vol_b0d968afa402845bf91b0d5ccf2f480f-replicate-1, Err: 13 , Start: 0 , Stop: 0 , Hash: 0 ],

Comment 12 Daniel Messer 2017-04-13 11:24:31 UTC
It starts to make sense no, what happened. I am not sure about the why, though. The creating of the bricks is handled by heketi - we don't have any means of influencing that. The permissions on the dirs on the old brick are not something we are touch either.
On the client side nothing changed inbetween expanding the volume.

Comment 13 Nithya Balachandran 2017-04-17 08:36:07 UTC
(In reply to Daniel Messer from comment #12)
> It starts to make sense no, what happened. I am not sure about the why,
> though. The creating of the bricks is handled by heketi - we don't have any
> means of influencing that. The permissions on the dirs on the old brick are
> not something we are touch either.
> On the client side nothing changed inbetween expanding the volume.

Can you check the permissions on the root of the bricks for both old and new?

Comment 14 Annette Clewett 2017-04-17 14:57:47 UTC
Please send update.

Comment 15 Michael Adam 2017-04-17 22:01:46 UTC
Oh, it's a permission issue?
Maybe it is the fact that expansion can not yet cope with GID security?

https://github.com/heketi/heketi/issues/558

Comment 16 Daniel Messer 2017-04-18 09:51:32 UTC
Good point. From the heketi GitHub issue I conclude that the issue has been fixed in heketi only under certain conditions. So it looks like we need to have more intelligence in heketi in determining the GID/UID of existing bricks to use them for newly created bricks for the expansion?

Comment 17 Humble Chirammal 2017-04-27 11:22:54 UTC
*** Bug 1446062 has been marked as a duplicate of this bug. ***

Comment 19 Raghavendra Talur 2017-06-13 12:38:32 UTC
https://github.com/heketi/heketi/pull/766 fixes the problem for new volumes that are created.

Comment 20 Riyas Abdulrasak 2017-08-03 14:34:30 UTC
*** Bug 1477982 has been marked as a duplicate of this bug. ***

Comment 21 Riyas Abdulrasak 2017-08-03 14:36:32 UTC
*** Bug 1477919 has been marked as a duplicate of this bug. ***

Comment 22 Dana Safford 2017-08-08 20:59:25 UTC
This week, in talking with Accenture for case 01883671, Michael Adam &  Sudhir Prasad said they would decide whether a backport to CNS 3.5 is feasible, and If "yes" make it happen.

Has that analysis been completed yet??

I raised the "Customer Escalation' flag.

Thanks,

Comment 23 Tejas Chaphekar 2017-08-21 04:57:17 UTC
Performed a successful volume expansion and write data onto new bricks without doing a manual re-balance

Performed following scenarios

1. Performed Volume expansion, created a new file and directory and perform I/O inside existing directories

2. Created new directories under the mount point and performed I/O.

3. Performed a volume expansion when the volume is full and performed I/O

4. Performed a volume expansion before volume is full and performed I/O

Following builds were used for the verification

heketi-client-5.0.0-7.el7rhgs.x86_64
cns-deploy-5.0.0-14.el7rhgs.x86_64
Gluster - rhgs-server-rhel7:3.3.0-11
Heketi -  rhgs-volmanager-rhel7:3.3.0-9

Comment 26 Raghavendra Talur 2017-10-04 09:17:52 UTC
doc text looks good to me

Comment 28 errata-xmlrpc 2017-10-11 07:07:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:2879

Comment 29 Raghavendra Talur 2017-10-11 09:02:13 UTC
Is there a workaround for CNS 3.5?

Yes, here is the procedure to expand a volume:

Before expanding the volume, we need to fetch the ‘GID’ of the ‘existing’ brick.
If you have already expanded, carefully pick the GID from old bricks(they will have the right GID, all new bricks will have 0:0)

Step 1 : To fetch GID of the existing brick:

#gluster vol list
  
  vol_123123
  vol_456456
   …


# gluster vol info vol_123123

 Get the brick list of the volume:

 For each brick, 

# oc rsh -it gluster-pod1 ls -l /var/lib/heketi/mounts/vg_<vgid>/brick_<brickId>

<gluster-pod1> in above command is the name of the gluster pod for a particular brick.

Note owner and group from the output 
drwxr-xr-x 2 root GID 6 Jun 29 13:18 brick

GID must/would all be the same for all bricks



Step 2: Expand the volume

# heketi-cli volume expand --volume=<vol_id> --expand-size=<size to expand in Gb>


Step 3:  Apply existing GID to newly added bricks : ( Below commands must be performed inside gluster pods)

# chown -R :GID </new/brick/path/>
# chmod -R 2775 <new//brick/path>


Step 4:  Perform a rebalance on the volume

# gluster volume rebalance <VOLNAME> start

Refer to https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/html/administration_guide/sect-rebalancing_volumes for more details on rebalance operation.
While volume expansion is in progress  and once it is finished, recheck the APP pod mount point and make sure the mount point works without issues.

Comment 30 Raghavendra Talur 2017-10-11 09:04:21 UTC
What happens after an upgrade to CNS 3.6?

a. New volumes that are created after upgrading don't need any manual steps for volume expansion.

b. Volumes that were created in CNS 3.5 or before need manual steps even if the expansion is performed using CNS 3.6. The procedure for workaround is same as given in #29.

Comment 31 Humble Chirammal 2017-10-11 11:28:58 UTC
(In reply to Raghavendra Talur from comment #29)
> Is there a workaround for CNS 3.5?
> 
> Yes, here is the procedure to expand a volume:
...............
> 
> Refer to
> https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/
> html/administration_guide/sect-rebalancing_volumes for more details on
> rebalance operation.
> While volume expansion is in progress  and once it is finished, recheck the
> APP pod mount point and make sure the mount point works without issues.

Above need to be changed. We should not document the process of expanding a volume which is already BOUND. The reason being there is no support for it from Openshift yet.

Comment 32 Humble Chirammal 2017-10-11 13:31:02 UTC
(In reply to Humble Chirammal from comment #31)
> (In reply to Raghavendra Talur from comment #29)
> > Is there a workaround for CNS 3.5?
> > 
> > Yes, here is the procedure to expand a volume:
> ...............
> > 
> > Refer to
> > https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/
> > html/administration_guide/sect-rebalancing_volumes for more details on
> > rebalance operation.
> > While volume expansion is in progress  and once it is finished, recheck the
> > APP pod mount point and make sure the mount point works without issues.
> 
> Above need to be changed. We should not document the process of expanding a
> volume which is already BOUND. The reason being there is no support for it
> from Openshift yet.

Apart from above, rebalance is auto performed by heketi shipped in CNS 3.6 release, so mostly we need to avoid the rebalance command in above draft, but this need thorough testing


Note You need to log in before you can comment on or make changes to this bug.