Description of problem: In a situation where a client mounting a heketi-managed is nearing 100% capacity utilization an administrator might expand the volume to have more space. I.e. an administrator would issue heketi-cli volume expand --volume=8081e2f112ef14d3e267a37840369469 --expand-size=30. While this completes successfully and the increase in volume capacity is reflected to the client online (i.e. doing df -h) a user is not able to write past the original capacity limit. Any write to a new or existing file will fail with: No space left on device. Version-Release number of selected component (if applicable): heketi v4.0.0-5 and/or rhgs3/rhgs-volmanager-rhel7:3.2.0-5 image. How reproducible: Increase the size of a volume provisioned via heketi by issuing the expand through heketi and try to use the additional capacity from the client. Steps to Reproduce: 1. Provision a volume through heketi, 1GiB of capacity and mount it on a client. Write some data. 2. Observe that the volume by default is 1x3 replicated. 3. Extend the capacity of the volume via heketi to 5 GiB 4. Observe that the volume is now a 2x3 distributed-replicated volume, consisting of the original 1x3x1GiB replica set plus the new 1x3x4GiB replicate set. 5. Verify the new capacity of the mount on the client, i.e. df -h -> 5GiB total capacity of the mount vs. 1GiB before 6.1 Write a new file > 1GiB. - or- 6.2 Write to an existing file to make it go >1GiB. Actual results: In both scenarios the writes should successfully complete. Expected results: In both scenarios the writes fail at the 1 GiB total mount capacity utilization with the error: No space left on device. Additional info: It may be expected for 6.1 to fail since the file is still sitting on the first 1x3x1GiB replica set and no rebalance happened. It is not expected for 6.2 to fail. Gluster should be able to write a link file to the 1x3x1GiB replica set pointing to the actual file being stored on the 1x3x4GiB replica set.
This is a really serious usability issue. Even though expansion is not currently officially supported, we should still analyze and fix or work around it.
I think since we are not doing rebalance, we may need to enable the NUFA (non uniform file allocation) translator. But according to the docs, this needs to be enabled before putting any data into the volume, in particular before expansion.
Can we get access to the system or the logs? Writing to existing files on the nearly full bricks will cause the error but writes new files on the new bricks should not error out.
@Nithya: theoretically yes - the systems are up in AWS and are currently used for writing the reference architecture. In which time zone are you? We can plan for you to access the systems when nobody else is using them. In this case, please send me your public SSH key.
This needs to be moved out of cns 3.5.
I had a look at Annette's setup over bluejeans. I could see that files continued to go to the older bricks. Some dirs were created on the new bricks but a directory created from the mount point after the volume was expanded was created only on the older bricks. Annette will rerun the steps after enabling client debug logs and upload the client log files for this volume. If the logs do not contain anything helpful, I shall try to get hold of a setup in Blr and see if it can be reproduced.
Created attachment 1271246 [details] glusterfs client log from OCP app server Log was collected from OCP app nodes hosting mysql pod with pvc=pvc-5aadcf06-1fb0-11e7-a977-067ee6f6ca67 /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/glusterfs/pvc-5aadcf06-1fb0-11e7-a977-067ee6f6ca67
After expansion in gluster pod: Volume Name: vol_b0d968afa402845bf91b0d5ccf2f480f Type: Distributed-Replicate Volume ID: 08bfb2ed-7ea3-42ed-b7dc-11e15781a2e4 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 3 = 6 Transport-type: tcp Bricks: Brick1: 10.20.5.234:/var/lib/heketi/mounts/vg_cfd8eda997abca7a10afb56e6bd804d9/brick_66f84264b8e3fbfefbe4972c78efc4f5/brick Brick2: 10.20.4.40:/var/lib/heketi/mounts/vg_50ff38c6fa45ca9220521a4fd49126be/brick_2fe54494608cce76a4830fc0fffd63b6/brick Brick3: 10.20.6.177:/var/lib/heketi/mounts/vg_108bebea4a67d6ed3f569b5162a4e20f/brick_17728915fa17db8398fdc9432032c797/brick Brick4: 10.20.4.40:/var/lib/heketi/mounts/vg_50ff38c6fa45ca9220521a4fd49126be/brick_3f0858383090fe03c0c4ee3d6e49cd87/brick Brick5: 10.20.6.177:/var/lib/heketi/mounts/vg_108bebea4a67d6ed3f569b5162a4e20f/brick_96ea4058eaa6520149e6b8a66949683a/brick Brick6: 10.20.5.234:/var/lib/heketi/mounts/vg_cfd8eda997abca7a10afb56e6bd804d9/brick_20671a517431f238b4267da8f4e791ff/brick Options Reconfigured: diagnostics.client-log-level: DEBUG performance.readdir-ahead: on After expansion in mysql pod: [ec2-user@ip-172-31-23-150 mysql-files]$ oc rsh mysql-1-sn5lb sh-4.2$ df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/docker-202:2-37772588-52f48b5fc468d3930b05002b3068bfc92d6ab0b61fddb0bd5b0688ce2bb2fe79 3.0G 446M 2.6G 15% / tmpfs 7.7G 0 7.7G 0% /dev tmpfs 7.7G 0 7.7G 0% /sys/fs/cgroup /dev/xvdc 50G 44M 50G 1% /etc/hosts /dev/xvda2 15G 3.1G 12G 21% /run/secrets shm 64M 0 64M 0% /dev/shm 10.20.4.40:vol_b0d968afa402845bf91b0d5ccf2f480f 2.0G 976M 1.1G 48% /var/lib/mysql/data tmpfs 7.7G 16K 7.7G 1% /run/secrets/kubernetes.io/serviceaccount sh-4.2$ cd /var/lib/mysql/data sh-4.2$ mkdir test1 sh-4.2$ cd test1 sh-4.2$ dd if=/dev/zero of=bigfile3 bs=1M count=1000 oflag=direct dd: error writing 'bigfile3': No space left on device dd: closing output file 'bigfile3': No space left on device In glusterfs log [2017-04-12 19:24:55.527530] W [fuse-bridge.c:1290:fuse_err_cbk] 0-glusterfs-fuse: 146866: FLUSH() ERR => -1 (No space left on device)
mkdir for test1 failed on the newly added bricks with EPERM. Was there anything different about the permissions on the root dirs of the old bricks? [2017-04-12 19:24:15.414778] E [MSGID: 114031] [client-rpc-fops.c:321:client3_3_mkdir_cbk] 5-vol_b0d968afa402845bf91b0d5ccf2f480f-client-3: remote operation failed. Path: /test1 [Permission denied] 1229 [2017-04-12 19:24:15.414797] D [MSGID: 0] [client-rpc-fops.c:326:client3_3_mkdir_cbk] 5-stack-trace: stack-address: 0x7f083d2d2f50, vol_b0d968afa402845bf91b0d5ccf2f480f-client-3 returned -1 error: Permission denied [Permission denied] [2017-04-12 19:24:15.414866] E [MSGID: 114031] [client-rpc-fops.c:321:client3_3_mkdir_cbk] 5-vol_b0d968afa402845bf91b0d5ccf2f480f-client-5: remote operation failed. Path: /test1 [Permission denied] [2017-04-12 19:24:15.414879] D [MSGID: 0] [client-rpc-fops.c:326:client3_3_mkdir_cbk] 5-stack-trace: stack-address: 0x7f083d2d2f50, vol_b0d968afa402845bf91b0d5ccf2f480f-client-5 returned -1 error: Permission denied [Permission denied] [2017-04-12 19:24:15.415246] E [MSGID: 114031] [client-rpc-fops.c:321:client3_3_mkdir_cbk] 5-vol_b0d968afa402845bf91b0d5ccf2f480f-client-4: remote operation failed. Path: /test1 [Permission denied]
As the dir could not be created on the new bricks, the entire hash range was set on the dir on the old subvol which causes any new files in that dir to be writted to the old bricks. [2017-04-12 19:24:15.419242] I [MSGID: 109036] [dht-common.c:8889:dht_log_new_layout_for_dir_selfheal] 5-vol_b0d968afa402845bf91b0d5ccf2f480f-dht: Setting layout of /test1 with [Subvol_name: vol_b0d968afa402845bf91b0d5ccf2f480f-replicate-0, Err: -1 , Start: 0 , Stop: 4294967295 , Hash: 1 ], [Subvol_name : vol_b0d968afa402845bf91b0d5ccf2f480f-replicate-1, Err: 13 , Start: 0 , Stop: 0 , Hash: 0 ],
It starts to make sense no, what happened. I am not sure about the why, though. The creating of the bricks is handled by heketi - we don't have any means of influencing that. The permissions on the dirs on the old brick are not something we are touch either. On the client side nothing changed inbetween expanding the volume.
(In reply to Daniel Messer from comment #12) > It starts to make sense no, what happened. I am not sure about the why, > though. The creating of the bricks is handled by heketi - we don't have any > means of influencing that. The permissions on the dirs on the old brick are > not something we are touch either. > On the client side nothing changed inbetween expanding the volume. Can you check the permissions on the root of the bricks for both old and new?
Please send update.
Oh, it's a permission issue? Maybe it is the fact that expansion can not yet cope with GID security? https://github.com/heketi/heketi/issues/558
Good point. From the heketi GitHub issue I conclude that the issue has been fixed in heketi only under certain conditions. So it looks like we need to have more intelligence in heketi in determining the GID/UID of existing bricks to use them for newly created bricks for the expansion?
*** Bug 1446062 has been marked as a duplicate of this bug. ***
https://github.com/heketi/heketi/pull/766 fixes the problem for new volumes that are created.
*** Bug 1477982 has been marked as a duplicate of this bug. ***
*** Bug 1477919 has been marked as a duplicate of this bug. ***
This week, in talking with Accenture for case 01883671, Michael Adam & Sudhir Prasad said they would decide whether a backport to CNS 3.5 is feasible, and If "yes" make it happen. Has that analysis been completed yet?? I raised the "Customer Escalation' flag. Thanks,
Performed a successful volume expansion and write data onto new bricks without doing a manual re-balance Performed following scenarios 1. Performed Volume expansion, created a new file and directory and perform I/O inside existing directories 2. Created new directories under the mount point and performed I/O. 3. Performed a volume expansion when the volume is full and performed I/O 4. Performed a volume expansion before volume is full and performed I/O Following builds were used for the verification heketi-client-5.0.0-7.el7rhgs.x86_64 cns-deploy-5.0.0-14.el7rhgs.x86_64 Gluster - rhgs-server-rhel7:3.3.0-11 Heketi - rhgs-volmanager-rhel7:3.3.0-9
doc text looks good to me
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:2879
Is there a workaround for CNS 3.5? Yes, here is the procedure to expand a volume: Before expanding the volume, we need to fetch the ‘GID’ of the ‘existing’ brick. If you have already expanded, carefully pick the GID from old bricks(they will have the right GID, all new bricks will have 0:0) Step 1 : To fetch GID of the existing brick: #gluster vol list vol_123123 vol_456456 … # gluster vol info vol_123123 Get the brick list of the volume: For each brick, # oc rsh -it gluster-pod1 ls -l /var/lib/heketi/mounts/vg_<vgid>/brick_<brickId> <gluster-pod1> in above command is the name of the gluster pod for a particular brick. Note owner and group from the output drwxr-xr-x 2 root GID 6 Jun 29 13:18 brick GID must/would all be the same for all bricks Step 2: Expand the volume # heketi-cli volume expand --volume=<vol_id> --expand-size=<size to expand in Gb> Step 3: Apply existing GID to newly added bricks : ( Below commands must be performed inside gluster pods) # chown -R :GID </new/brick/path/> # chmod -R 2775 <new//brick/path> Step 4: Perform a rebalance on the volume # gluster volume rebalance <VOLNAME> start Refer to https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/html/administration_guide/sect-rebalancing_volumes for more details on rebalance operation. While volume expansion is in progress and once it is finished, recheck the APP pod mount point and make sure the mount point works without issues.
What happens after an upgrade to CNS 3.6? a. New volumes that are created after upgrading don't need any manual steps for volume expansion. b. Volumes that were created in CNS 3.5 or before need manual steps even if the expansion is performed using CNS 3.6. The procedure for workaround is same as given in #29.
(In reply to Raghavendra Talur from comment #29) > Is there a workaround for CNS 3.5? > > Yes, here is the procedure to expand a volume: ............... > > Refer to > https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/ > html/administration_guide/sect-rebalancing_volumes for more details on > rebalance operation. > While volume expansion is in progress and once it is finished, recheck the > APP pod mount point and make sure the mount point works without issues. Above need to be changed. We should not document the process of expanding a volume which is already BOUND. The reason being there is no support for it from Openshift yet.
(In reply to Humble Chirammal from comment #31) > (In reply to Raghavendra Talur from comment #29) > > Is there a workaround for CNS 3.5? > > > > Yes, here is the procedure to expand a volume: > ............... > > > > Refer to > > https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/ > > html/administration_guide/sect-rebalancing_volumes for more details on > > rebalance operation. > > While volume expansion is in progress and once it is finished, recheck the > > APP pod mount point and make sure the mount point works without issues. > > Above need to be changed. We should not document the process of expanding a > volume which is already BOUND. The reason being there is no support for it > from Openshift yet. Apart from above, rebalance is auto performed by heketi shipped in CNS 3.6 release, so mostly we need to avoid the rebalance command in above draft, but this need thorough testing