1064321 – [RHS-RHOS] Openstack cinder volume file on RHS not properly migrated after a rebalance and self-heal process.

Bug 1064321 - [RHS-RHOS] Openstack cinder volume file on RHS not properly migrated after a rebalance and self-heal process.

Summary: [RHS-RHOS] Openstack cinder volume file on RHS not properly migrated after a ...

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	distribute
Sub Component:
Version:	2.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Nithya Balachandran
QA Contact:	storage-qa-internal@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1035040 1286126
TreeView+	depends on / blocked

Reported:	2014-02-12 12:24 UTC by shilpa
Modified:	2015-11-27 11:39 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	Cause: Node containing brick of a volume under rebalance is restarted. Consequence: Rebalance reports it is completed when the node is brought back online, but the data is not rebalanced. If this was a remove-brick->rebalance operation, then the data on the node may still not be rebalanced and hence on a commit there could be data loss. Workaround (if any): Re-run the rebalance in case any node is brought down during rebalance in progress, especially if it is a rebalance post remove-brick operation. Result: The second rebalance ensures that the data is migrated to the right node and completes the required rebalance.
Clone Of:
Clones:	1286126 (view as bug list)
Environment:
Last Closed:	2015-11-27 11:38:29 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Rebalance logs from the replica pair (534.31 KB, text/x-log) 2014-02-12 12:27 UTC, shilpa	no flags	Details
Rebalance logs from the second replica node (483.39 KB, text/x-log) 2014-02-12 12:28 UTC, shilpa	no flags	Details
View All

Description shilpa 2014-02-12 12:24:43 UTC

Description of problem:
Created openstack cinder bootable volumes backed by RHS, run a remove-brick operation on the brick containing the file. After rebooting one of the nodes in the replica pair, the volume file migration is incomplete. Leaves a -rw-rwSrwT on the file.

Impact:
A nova instance that is booted out of this volume, when reboogted errors out because instance tries to fetch the volume file from the backend and the file is in an inconsistent state.

Version-Release number of selected component (if applicable):
glusterfs-3.4.0.59rhs-1.el6_4.x86_64

How reproducible: Reproduced the issue twice

Steps to Reproduce:
1. Create a 6X2 distribute-replicate volume called cinder and set virt group option.
(i.e) gluster volume create cinder replica 2 <brick1> ... <brick12>

2. Tag the volume with group virt
(i.e) gluster volume set cinder-vol group virt

3. Set owner uid and gid to the volume
(i.e) gluster volume set cinder-vol storage.owner-uid 165
gluster volume set cinder-vol storage.owner-gid 165

4. Configure RHOS to use this cinder volume to create bootable volumes for instances.

5. Create a bootable volume of 10GB (6299fd6a-ff07-47c7-9e3d-c03def66a327)

6. Boot an instance out of this volume that contains a fedora image for example.

7. While instance is up and running, locate the bricks, that has the cinder volume file 6299fd6a-ff07-47c7-9e3d-c03def66a327.

# file: var/lib/cinder/volumes/b5e61da7fdba3f3bd0bafd1215216639/volume-6299fd6a-ff07-47c7-9e3d-c03def66a327
trusted.glusterfs.pathinfo="(<DISTRIBUTE:cinder-dht> (<REPLICATE:cinder-replicate-5> <POSIX(/rhs/brick1/c7):rhs1-vm3:/rhs/brick1/c7/volume-6299fd6a-ff07-47c7-9e3d-c03def66a327> <POSIX(/rhs/brick1/c8):rhs1-vm4:/rhs/brick1/c8/volume-6299fd6a-ff07-47c7-9e3d-c03def66a327>))"

8. Now perform a remove-brick start operation on this brick:

# gluster v remove-brick cinder 10.70.37.96:/rhs/brick1/c7 10.70.37.77:/rhs/brick1/c8 start

9. Check the status and check for the file 6299fd6a-ff07-47c7-9e3d-c03def66a327 to see if it is being migrated:

Before migration file on the nodes rhs1-vm3 and rhs-vm4 respectively:

/rhs/brick1/c7:
total 756076
-rw-rw-rw- 2 qemu qemu 10737418240 Feb 12 16:38 volume-6299fd6a-ff07-47c7-9e3d-c03def66a327

/rhs/brick1/c8:
total 756004
-rw-rw-rw- 2 qemu qemu 10737418240 Feb 12 16:35 volume-6299fd6a-ff07-47c7-9e3d-c03def66a327

10. Once migration has started on the file, reboot one of the nodes in the replica pair. Rebooted rhs1-vm3 containing /rhs/brick1/c7.

11. Once the node comes back up, check for self-heal and rebalance status.

12. self heal completes:

# gluster v heal cinder info healed

Brick 10.70.37.77:/rhs/brick1/c8
Number of entries: 2
at path on brick
-----------------------------------
2014-02-12 06:59:02 /
2014-02-12 11:13:00 /volume-6299fd6a-ff07-47c7-9e3d-c03def66a327

13. Check for rebalance status:

# gluster v remove-brick cinder 10.70.37.96:/rhs/brick1/c7 10.70.37.77:/rhs/brick1/c8 status
Node Rebalanced-files size scanned failures skipped status run time in secs
--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------
localhost 0 0Bytes 6 0 0 completed 0.00
10.70.37.96 0 0Bytes 6 0 0 completed 1.00

Status shows "completed":

14. But when I check the bricks, I see that the file has not been migrated and it has a sticky bit set on it.

output from the node that was rebooted rhs1-vm3:

/rhs/brick1/c11:
total 159132
-rw-rw-rw- 2 qemu qemu 2147483648 Feb 12 12:39 volume-2cda821d-f7a8-4249-a778-55beaefe0447
---------T 2 qemu qemu 10737418240 Feb 12 16:46 volume-6299fd6a-ff07-47c7-9e3d-c03def66a327

/rhs/brick1/c7:
total 851472
-rw-rwSrwT 2 qemu qemu 10737418240 Feb 12 16:46 volume-6299fd6a-ff07-47c7-9e3d-c03def66a327

On the replica pair rhs1-vm4:

/rhs/brick1/c12:
total 159064
-rw-rw-rw- 2 qemu qemu 2147483648 Feb 12 12:39 volume-2cda821d-f7a8-4249-a778-55beaefe0447
---------T 2 qemu qemu 10737418240 Feb 12 16:55 volume-6299fd6a-ff07-47c7-9e3d-c03def66a327

/rhs/brick1/c8:
total 851360
-rw-rwSrwT 2 qemu qemu 10737418240 Feb 12 16:55 volume-6299fd6a-ff07-47c7-9e3d-c03def66a327

Actual results:

Due to this behavior, an instances that is booted from the file under test cannot be recovered once it is rebooted. Because the file being fetched is inconsistent. If the instance is not rebooted, then we will not see any effect on the instance.

Expected results:

RHS node reboot should not affect the file being migrated since the other replica pair is still up.

Additional info:

Rebalance logs from the two nodes are attached.

Comment 2 shilpa 2014-02-12 12:27:53 UTC

Created attachment 862265 [details]
Rebalance logs from the replica pair

Comment 3 shilpa 2014-02-12 12:28:30 UTC

Created attachment 862266 [details]
Rebalance logs from the second replica node

Comment 4 SATHEESARAN 2014-02-12 15:08:19 UTC

Shilpa, 

As per additional information in comment0, 
"Rebalance logs from the two nodes are attached."
So you have only 2 nodes in the "Trusted Storage Pool" ?

The reason why I am after this question is, from RHSS 2.1 Update2, we have "server & client"-side quorum enabled in virt profile, it may have some impact on the volume.

Also, nowhere in the bug, I find the information about number of hosts in the cluster (i.e) Trusted Storage Pool. It would be helpful if you provide info on number of RHSS Nodes in the cluster, volume info and volume status"

If there were only 2 nodes in that cluster, rebooting one RHSS Nodes, lacks server quorum and the volume would have gone offline

Comment 5 SATHEESARAN 2014-02-12 15:14:40 UTC

One another find:
From the step 10, in comment0 is that, you have rebooted the node containing FIRST brick of the replica group, and at this point that particular replica group would have become READ-ONLY

Comment 6 shilpa 2014-02-12 17:28:31 UTC

I do not have client quorum enabled in this test case and there are four nodes in total. 

Volume info:

# gluster v i
 
Volume Name: cinder
Type: Distributed-Replicate
Volume ID: d3ff221b-0e7e-49b7-9019-c81dd87618d1
Status: Started
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: 10.70.37.96:/rhs/brick1/c3
Brick2: 10.70.37.77:/rhs/brick1/c4
Brick3: 10.70.37.121:/rhs/brick1/c5
Brick4: 10.70.37.140:/rhs/brick1/c6
Brick5: 10.70.37.121:/rhs/brick1/c9
Brick6: 10.70.37.140:/rhs/brick1/c10
Brick7: 10.70.37.121:/rhs/brick1/c1
Brick8: 10.70.37.140:/rhs/brick1/c2
Brick9: 10.70.37.96:/rhs/brick1/c11
Brick10: 10.70.37.77:/rhs/brick1/c12
Brick11: 10.70.37.96:/rhs/brick1/c7
Brick12: 10.70.37.77:/rhs/brick1/c8
Options Reconfigured:
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
storage.owner-uid: 165
storage.owner-gid: 165
server.allow-insecure: on

Comment 7 Shyamsundar 2014-02-13 07:05:52 UTC

@Shilpa we need a few more observations on this bug as follows,

- In step 14 the file is not migrated but still has the sticky bits on it and the new source still has the link file.

Can we check the above using md5sum on the file before the operations performed and then post the operation performed, to ensure there is no data loss.

The md5sum would depend on if the file is static from nova perspective, or if nova writes to this file, and changes contents, in which case the md5 may not match.

- In the case where nova is not able to see the file or unable to boot, the FUSE mount was showing up the file, albeit with the bits, correct?

Can we check if we manually remove the bits from the mentioned file nova is able to boot the image back? So that we know a troubleshooting option if a customer hits this issue.

Code standpoint:
----------------
From the logs, 
cinder-rebalance.log
--------------------
[2014-02-12 11:08:25.860790] I [dht-common.c:2646:dht_setxattr] 0-cinder-dht: fixing the layout of /
[2014-02-12 11:08:25.867823] I [dht-rebalance.c:1121:gf_defrag_migrate_data] 0-cinder-dht: migrate data called on /
[2014-02-12 11:08:25.893411] I [dht-rebalance.c:672:dht_migrate_file] 0-cinder-dht: /volume-6299fd6a-ff07-47c7-9e3d-c03def66a327: attempting to move from cinder-replicate-5 to cinder-replicate-4
[2014-02-12 11:08:44.292742] I [dht-rebalance.c:1783:gf_defrag_status_get] 0-glusterfs: Rebalance is in progress. Time taken is 19.00 secs
[2014-02-12 11:08:44.298030] I [dht-rebalance.c:1786:gf_defrag_status_get] 0-glusterfs: Files migrated: 0, size: 0, lookups: 6, failures: 0, skipped: 0
[2014-02-12 11:08:44.801218] W [socket.c:522:__socket_rwv] 0-glusterfs: readv on 127.0.0.1:24007 failed (No data available)
[2014-02-12 11:08:48.760104] W [glusterfsd.c:1099:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x37438e894d] (-->/lib64/libpthread.so.0() [0x3743c07851] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x4052fd]))) 0-: received signum (15), shutting down
cinder-rebalance2.log
---------------------
Does not have entries for this file.

So basically the file was being moved from c7 to c11/12, and c7 was shutdown. So rebalance stopped in between, leaving behind rebalance atrifacts on the files (i.e ST bits on the source and link file on the target).

From this point on, rebalance completed is not a valid state, it should still be in progress. The rebalance is started again on c7 but this time does not decide to migrate the same file (need to check code why this is so) and reports that rebalance is complete. This is something that needs to be handled.

Comment 8 shilpa 2014-02-13 07:12:23 UTC

Tried to unset the ST bits and restart the nova instance. The image was still invalid from nova's standpoint.

Comment 9 shilpa 2014-02-13 08:31:09 UTC

While testing further, figured out that if I run a rebalance again on the node (containing the first brick) after it is rebooted, the file does migrate to the brick that is marked as destination. It still has the ST bits set though. But interestingly, nova picks this up as a valid image and the instance can be booted. 

The bricks that were added in this case are c9 and c10. Migrated file with ST bits set after re-running rebalance(gluster v rebalance volume start) on the rebooted node:


/rhs/brick1/c9:
total 756348
-rw-rwSrwT 2 qemu qemu 10737418240 Feb 13 13:57 volume-2c6dcac9-2e52-4515-b628-701108ee13a5

Comment 12 Susant Kumar Palai 2015-11-27 11:38:29 UTC

Cloning this to 3.1. To be fixed in future.

Note You need to log in before you can comment on or make changes to this bug.