Description of problem: Created openstack cinder bootable volumes backed by RHS, run a remove-brick operation on the brick containing the file. After rebooting one of the nodes in the replica pair, the volume file migration is incomplete. Leaves a -rw-rwSrwT on the file. Impact: A nova instance that is booted out of this volume, when reboogted errors out because instance tries to fetch the volume file from the backend and the file is in an inconsistent state. Version-Release number of selected component (if applicable): glusterfs-3.4.0.59rhs-1.el6_4.x86_64 How reproducible: Reproduced the issue twice Steps to Reproduce: 1. Create a 6X2 distribute-replicate volume called cinder and set virt group option. (i.e) gluster volume create cinder replica 2 <brick1> ... <brick12> 2. Tag the volume with group virt (i.e) gluster volume set cinder-vol group virt 3. Set owner uid and gid to the volume (i.e) gluster volume set cinder-vol storage.owner-uid 165 gluster volume set cinder-vol storage.owner-gid 165 4. Configure RHOS to use this cinder volume to create bootable volumes for instances. 5. Create a bootable volume of 10GB (6299fd6a-ff07-47c7-9e3d-c03def66a327) 6. Boot an instance out of this volume that contains a fedora image for example. 7. While instance is up and running, locate the bricks, that has the cinder volume file 6299fd6a-ff07-47c7-9e3d-c03def66a327. # file: var/lib/cinder/volumes/b5e61da7fdba3f3bd0bafd1215216639/volume-6299fd6a-ff07-47c7-9e3d-c03def66a327 trusted.glusterfs.pathinfo="(<DISTRIBUTE:cinder-dht> (<REPLICATE:cinder-replicate-5> <POSIX(/rhs/brick1/c7):rhs1-vm3:/rhs/brick1/c7/volume-6299fd6a-ff07-47c7-9e3d-c03def66a327> <POSIX(/rhs/brick1/c8):rhs1-vm4:/rhs/brick1/c8/volume-6299fd6a-ff07-47c7-9e3d-c03def66a327>))" 8. Now perform a remove-brick start operation on this brick: # gluster v remove-brick cinder 10.70.37.96:/rhs/brick1/c7 10.70.37.77:/rhs/brick1/c8 start 9. Check the status and check for the file 6299fd6a-ff07-47c7-9e3d-c03def66a327 to see if it is being migrated: Before migration file on the nodes rhs1-vm3 and rhs-vm4 respectively: /rhs/brick1/c7: total 756076 -rw-rw-rw- 2 qemu qemu 10737418240 Feb 12 16:38 volume-6299fd6a-ff07-47c7-9e3d-c03def66a327 /rhs/brick1/c8: total 756004 -rw-rw-rw- 2 qemu qemu 10737418240 Feb 12 16:35 volume-6299fd6a-ff07-47c7-9e3d-c03def66a327 10. Once migration has started on the file, reboot one of the nodes in the replica pair. Rebooted rhs1-vm3 containing /rhs/brick1/c7. 11. Once the node comes back up, check for self-heal and rebalance status. 12. self heal completes: # gluster v heal cinder info healed Brick 10.70.37.77:/rhs/brick1/c8 Number of entries: 2 at path on brick ----------------------------------- 2014-02-12 06:59:02 / 2014-02-12 11:13:00 /volume-6299fd6a-ff07-47c7-9e3d-c03def66a327 13. Check for rebalance status: # gluster v remove-brick cinder 10.70.37.96:/rhs/brick1/c7 10.70.37.77:/rhs/brick1/c8 status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 6 0 0 completed 0.00 10.70.37.96 0 0Bytes 6 0 0 completed 1.00 Status shows "completed": 14. But when I check the bricks, I see that the file has not been migrated and it has a sticky bit set on it. output from the node that was rebooted rhs1-vm3: /rhs/brick1/c11: total 159132 -rw-rw-rw- 2 qemu qemu 2147483648 Feb 12 12:39 volume-2cda821d-f7a8-4249-a778-55beaefe0447 ---------T 2 qemu qemu 10737418240 Feb 12 16:46 volume-6299fd6a-ff07-47c7-9e3d-c03def66a327 /rhs/brick1/c7: total 851472 -rw-rwSrwT 2 qemu qemu 10737418240 Feb 12 16:46 volume-6299fd6a-ff07-47c7-9e3d-c03def66a327 On the replica pair rhs1-vm4: /rhs/brick1/c12: total 159064 -rw-rw-rw- 2 qemu qemu 2147483648 Feb 12 12:39 volume-2cda821d-f7a8-4249-a778-55beaefe0447 ---------T 2 qemu qemu 10737418240 Feb 12 16:55 volume-6299fd6a-ff07-47c7-9e3d-c03def66a327 /rhs/brick1/c8: total 851360 -rw-rwSrwT 2 qemu qemu 10737418240 Feb 12 16:55 volume-6299fd6a-ff07-47c7-9e3d-c03def66a327 Actual results: Due to this behavior, an instances that is booted from the file under test cannot be recovered once it is rebooted. Because the file being fetched is inconsistent. If the instance is not rebooted, then we will not see any effect on the instance. Expected results: RHS node reboot should not affect the file being migrated since the other replica pair is still up. Additional info: Rebalance logs from the two nodes are attached.
Created attachment 862265 [details] Rebalance logs from the replica pair
Created attachment 862266 [details] Rebalance logs from the second replica node
Shilpa, As per additional information in comment0, "Rebalance logs from the two nodes are attached." So you have only 2 nodes in the "Trusted Storage Pool" ? The reason why I am after this question is, from RHSS 2.1 Update2, we have "server & client"-side quorum enabled in virt profile, it may have some impact on the volume. Also, nowhere in the bug, I find the information about number of hosts in the cluster (i.e) Trusted Storage Pool. It would be helpful if you provide info on number of RHSS Nodes in the cluster, volume info and volume status" If there were only 2 nodes in that cluster, rebooting one RHSS Nodes, lacks server quorum and the volume would have gone offline
One another find: From the step 10, in comment0 is that, you have rebooted the node containing FIRST brick of the replica group, and at this point that particular replica group would have become READ-ONLY
I do not have client quorum enabled in this test case and there are four nodes in total. Volume info: # gluster v i Volume Name: cinder Type: Distributed-Replicate Volume ID: d3ff221b-0e7e-49b7-9019-c81dd87618d1 Status: Started Number of Bricks: 6 x 2 = 12 Transport-type: tcp Bricks: Brick1: 10.70.37.96:/rhs/brick1/c3 Brick2: 10.70.37.77:/rhs/brick1/c4 Brick3: 10.70.37.121:/rhs/brick1/c5 Brick4: 10.70.37.140:/rhs/brick1/c6 Brick5: 10.70.37.121:/rhs/brick1/c9 Brick6: 10.70.37.140:/rhs/brick1/c10 Brick7: 10.70.37.121:/rhs/brick1/c1 Brick8: 10.70.37.140:/rhs/brick1/c2 Brick9: 10.70.37.96:/rhs/brick1/c11 Brick10: 10.70.37.77:/rhs/brick1/c12 Brick11: 10.70.37.96:/rhs/brick1/c7 Brick12: 10.70.37.77:/rhs/brick1/c8 Options Reconfigured: performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: enable storage.owner-uid: 165 storage.owner-gid: 165 server.allow-insecure: on
@Shilpa we need a few more observations on this bug as follows, - In step 14 the file is not migrated but still has the sticky bits on it and the new source still has the link file. Can we check the above using md5sum on the file before the operations performed and then post the operation performed, to ensure there is no data loss. The md5sum would depend on if the file is static from nova perspective, or if nova writes to this file, and changes contents, in which case the md5 may not match. - In the case where nova is not able to see the file or unable to boot, the FUSE mount was showing up the file, albeit with the bits, correct? Can we check if we manually remove the bits from the mentioned file nova is able to boot the image back? So that we know a troubleshooting option if a customer hits this issue. Code standpoint: ---------------- From the logs, cinder-rebalance.log -------------------- [2014-02-12 11:08:25.860790] I [dht-common.c:2646:dht_setxattr] 0-cinder-dht: fixing the layout of / [2014-02-12 11:08:25.867823] I [dht-rebalance.c:1121:gf_defrag_migrate_data] 0-cinder-dht: migrate data called on / [2014-02-12 11:08:25.893411] I [dht-rebalance.c:672:dht_migrate_file] 0-cinder-dht: /volume-6299fd6a-ff07-47c7-9e3d-c03def66a327: attempting to move from cinder-replicate-5 to cinder-replicate-4 [2014-02-12 11:08:44.292742] I [dht-rebalance.c:1783:gf_defrag_status_get] 0-glusterfs: Rebalance is in progress. Time taken is 19.00 secs [2014-02-12 11:08:44.298030] I [dht-rebalance.c:1786:gf_defrag_status_get] 0-glusterfs: Files migrated: 0, size: 0, lookups: 6, failures: 0, skipped: 0 [2014-02-12 11:08:44.801218] W [socket.c:522:__socket_rwv] 0-glusterfs: readv on 127.0.0.1:24007 failed (No data available) [2014-02-12 11:08:48.760104] W [glusterfsd.c:1099:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x37438e894d] (-->/lib64/libpthread.so.0() [0x3743c07851] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x4052fd]))) 0-: received signum (15), shutting down cinder-rebalance2.log --------------------- Does not have entries for this file. So basically the file was being moved from c7 to c11/12, and c7 was shutdown. So rebalance stopped in between, leaving behind rebalance atrifacts on the files (i.e ST bits on the source and link file on the target). From this point on, rebalance completed is not a valid state, it should still be in progress. The rebalance is started again on c7 but this time does not decide to migrate the same file (need to check code why this is so) and reports that rebalance is complete. This is something that needs to be handled.
Tried to unset the ST bits and restart the nova instance. The image was still invalid from nova's standpoint.
While testing further, figured out that if I run a rebalance again on the node (containing the first brick) after it is rebooted, the file does migrate to the brick that is marked as destination. It still has the ST bits set though. But interestingly, nova picks this up as a valid image and the instance can be booted. The bricks that were added in this case are c9 and c10. Migrated file with ST bits set after re-running rebalance(gluster v rebalance volume start) on the rebooted node: /rhs/brick1/c9: total 756348 -rw-rwSrwT 2 qemu qemu 10737418240 Feb 13 13:57 volume-2c6dcac9-2e52-4515-b628-701108ee13a5
Cloning this to 3.1. To be fixed in future.