Bug 1054782 - [RHS-RHOS] Openstack glance image corruption after remove-brick/rebalance on the RHS nodes
Summary: [RHS-RHOS] Openstack glance image corruption after remove-brick/rebalance on ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterfs
Version: 2.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: RHGS 2.1.2
Assignee: Shyamsundar
QA Contact: shilpa
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-01-17 13:10 UTC by shilpa
Modified: 2016-09-29 17:40 UTC (History)
9 users (show)

Fixed In Version: glusterfs-3.4.0.59rhs-1.el6rhs
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1058569 1059687 (view as bug list)
Environment:
Last Closed: 2014-02-25 08:15:00 UTC
Embargoed:


Attachments (Terms of Use)
strace output of openstack glance image-create command (1.18 MB, text/plain)
2014-01-24 07:32 UTC, shilpa
no flags Details
screenshot (80.46 KB, image/png)
2014-04-10 11:10 UTC, shilpa
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2014:0208 0 normal SHIPPED_LIVE Red Hat Storage 2.1 enhancement and bug fix update #2 2014-02-25 12:20:30 UTC

Description shilpa 2014-01-17 13:10:41 UTC
Description of problem: While uploading a fedora image to glance, I ran remove-brick rebalance on the volume serving openstack glance images. This has led to corruption of the image.


Version-Release number of selected component (if applicable):
glusterfs-3.4.0.57rhs-1.el6rhs.x86_64
RHOS 4.0

How reproducible:
Tried once


Steps to Reproduce:
1. Create a 6*2 distribute-replicate volume called glance-vol.
2. Tag the volumes with group virt
   #gluster volume set glance-vol group virt

3. Set the storage.owner-uid and storage.owner-gid of glance-vol to 161
         gluster volume set glance-vol storage.owner-uid 161
         gluster volume set glance-vol storage.owner-gid 161


4. # gluster v i glance-vol
 
Volume Name: glance-vol
Type: Distributed-Replicate
Volume ID: b6fcdad4-b6d1-4dca-a3ab-0a6cfda677ee
Status: Started
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: 10.70.37.168:/rhs/brick1/g1
Brick2: 10.70.37.214:/rhs/brick1/g2
Brick3: 10.70.37.168:/rhs/brick1/g5
Brick4: 10.70.37.214:/rhs/brick1/g6
Brick5: 10.70.37.181:/rhs/brick1/g7
Brick6: 10.70.37.164:/rhs/brick1/g8
Brick7: 10.70.37.168:/rhs/brick1/g9
Brick8: 10.70.37.214:/rhs/brick1/g10
Brick9: 10.70.37.181:/rhs/brick1/g11
Brick10: 10.70.37.164:/rhs/brick1/g12
Brick11: 10.70.37.181:/rhs/brick1/g3
Brick12: 10.70.37.164:/rhs/brick1/g4
Options Reconfigured:
cluster.quorum-type: auto
cluster.server-quorum-type: server
server.allow-insecure: on
storage.owner-gid: 161
storage.owner-uid: 161
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable


5. On RHOS machine, mount the RHS glance volume on /mnt/gluster/glance/images and start the glance-api service.

6. Now start uploading any OS image to glance repository using:

# glance image-create --name="f19" --is-public=true --container-format=ovf --disk-format=qcow2 --copy-from http://berrange.fedorapeople.org/images/2012-11-15/f17-x86_64-openstack-sda.qcow2


7. Start remove-brick and rebalance on the glance-vol.

# gluster v remove-brick glance-vol 10.70.37.181:/rhs/brick1/g7 10.70.37.164:/rhs/brick1/g8 start
volume remove-brick start: success
ID: ba25f77d-5679-4e43-867f-b100b677b7af


8. The image file being created is

# glance image-list 

| aaae6505-c35c-49a3-8598-4d5d72724396 | f19            | qcow2       | ovf              | 251985920 | active |


Location of image file before rebalance:

/rhs/brick1/g5/glance/images:
total 65728
-rw-r----- 2 161 161 251985920 Jan 17 14:27 aaae6505-c35c-49a3-8598-4d5d72724396

After rebalance:

/rhs/brick1/g1/glance/images:
total 246080
---------T 2 161 161         0 Jan 17 14:44 aaae6505-c35c-49a3-8598-4d5d72724396

Notice there is a T-file created.


Actual results:

Image file is corrupt. 

Expected results:

The rebalance should have migrated the file properly and image corruption should not have occured.


Additional info:

Found this error when I tried to boot an instance using the f19 image. In the logs on RHOS machine, /var/log/nova/compute.log

ERROR nova.compute.manager [req-6998269c-bdec-4d74-946c-cd2e57d13099 060b7e29ac6b4fcd93d7bbf338d01cdc c0fb2b915fd94096a4adbb5ba115fdea] [i
nstance: b2fd1ee2-4a1e-4270-a188-7fb1cc003e8c] Error: [Errno 32] Corrupted image. Checksum was 28202ba1001d5c44c6ac874e15208daa expected 1f104b5667768964d5df8c4ad1d7cd2
7

Will upload sosreports shortly.

Comment 2 shilpa 2014-01-21 10:27:09 UTC
The issue is reproducible. Tested again with the same results, only this time I did not find T files left behind.

During rebalance:

/rhs/brick2/g1/glance/images:
total 106880
---------T 2 161 161 117440512 Jan 21 15:37 e89b030f-5c94-4c38-a993-ba15080c1989

/rhs/brick2/g5/glance/images:
total 130304
-rw-r-S--T 2 161 161 117440512 Jan 21 15:37 e89b030f-5c94-4c38-a993-ba15080c1989


After rebalance:

/rhs/brick2/g1/glance/images:
total 246080
-rw-r----- 2 161 161 251985920 Jan 21 15:51 efc17649-4ae5-48f9-82c8-da486ffe8bf8


/rhs/brick2/g5/glance/images:
total 0
 

RHOS compute.log error:

 IOError: [Errno 32] Corrupted image. Checksum was affad9bbe6d1de0aecc4497f3f00aaff expected 1f104b5667768964d5df8c4ad1d7cd27

Comment 3 Shyamsundar 2014-01-21 13:18:23 UTC
Tested a similar use case with BigBend and Corbett and found some discrepancy in noted behaviour, but overall there is data loss and md5 sums of large file copy with rebalance running is different, or in other words this is a data loss scenario.

Tested:
1) Create 6 x 1 distribute volume
2) Start copy of a large file into the volume (via FUSE mount)
3) Remove the brick to which the file is being written to (should mostly be the hashed sub volume) (like step (7) provided in the #Description)
4) Monitor the progress of cp and the file migration on the bricks
  - Noticed that with BigBend the stat information states the right logical size but different physical size with md5sum of the file differing
  - On Corbett, the stat information states incorrect logical size and the md5sum is also different

Bottom line, did not see the exact meta data as in the report, but the file data consistency is not maintained.

Need to further check code and debug to see what is occurring in this case.

Comment 4 shilpa 2014-01-24 07:29:42 UTC
1. Sosreports in http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1054782/

Comment 5 shilpa 2014-01-24 07:32:57 UTC
Created attachment 854825 [details]
strace output of openstack glance image-create command

Comment 6 Shyamsundar 2014-01-27 05:54:07 UTC
Update to comment #3:

The test case attempted to reproduce the issue has the following characterietic, the 'cp <src> <tgt>' opens the file with O_CREAT | O_EXCL mode which causes the function dht_rebalance_inprogress_task and dht_migration_complete_check_task to open the file as rebalance is detected in DHT on the data path, with the same modes.

This causes an open failure as the file already exists and hence the mode passed returns with an EEXIST. This is a slilent failure (Logs have the failure as below, but client is not notified on the failure) for the client and hence the writes in progress is not written to the old file as well as the new file under rebalance, causing data corruption.

Potential fix here is to change the open modes appropriately in the function above to prevent the problem.

Additionally, as rebalance reads the stat information and decides the file size to migrate, the in flight IOs are left out of the same (as the stat size does not reflect the changed (increased) file size). Hence the above in flight IOs are not replayed by rebalance, but expected that the client DHT replays the same, as a result the new fd getting an error in open results in the stated problem.

Update on the bug:
Checking the logs on the open stack setup though, does not show the same logs, hence something else could be going wrong in this case. Need to debug the same further.

Logs of interest:
[2014-01-22 05:34:51.263961] W [client-rpc-fops.c:464:client3_3_open_cbk] 0-test-client-2: remote operation failed: File exists. Path: /largefile/largefile.txt (7edacd43-3433-4ebc-aaac-b189e8362ac1)
[2014-01-22 05:34:51.263986] E [dht-helper.c:1037:dht_rebalance_inprogress_task] 0-test-dht: failed to send open the fd (0x195288c, flags=0100301) on file /largefile/largefile.txt @ test-client-2
[2014-01-22 05:34:51.267264] W [client-rpc-fops.c:464:client3_3_open_cbk] 0-test-client-2: remote operation failed: File exists. Path: /largefile/largefile.txt (7edacd43-3433-4ebc-aaac-b189e8362ac1)
[2014-01-22 05:34:51.267288] E [dht-helper.c:1037:dht_rebalance_inprogress_task] 0-test-dht: failed to send open the fd (0x195288c, flags=0100301) on file /largefile/largefile.txt @ test-client-2

Comment 7 Shyamsundar 2014-01-28 14:03:54 UTC
Further debugging the RHOS setup, shows that the image file is being created with O_TRUC mode bit. This will have the same problem of truncating the file when rebalance in progress is detected.

The patch for O_CREAT | O_EXCL already handles O_TRUN as well. Further course of action would be to test the same fix on the RHOS setup to see if the problem goes away.

Upstream patch submitted for the same is here, http://review.gluster.org/#/c/6830/

Comment 8 shilpa 2014-01-29 18:03:55 UTC
Applied the patch and re-tested. It did not fix the issue. Image file still disappears upon rebalance on the corresponding bricks.

# rpm -q | grep gluster
glusterfs-fuse-3.4.0.58dhttrunc-1.el6rhs.x86_64
glusterfs-rdma-3.4.0.58dhttrunc-1.el6rhs.x86_64
glusterfs-libs-3.4.0.58dhttrunc-1.el6rhs.x86_64
glusterfs-api-3.4.0.58dhttrunc-1.el6rhs.x86_64
glusterfs-geo-replication-3.4.0.58dhttrunc-1.el6rhs.x86_64
glusterfs-api-devel-3.4.0.58dhttrunc-1.el6rhs.x86_64
glusterfs-devel-3.4.0.58dhttrunc-1.el6rhs.x86_64
glusterfs-3.4.0.58dhttrunc-1.el6rhs.x86_64
glusterfs-server-3.4.0.58dhttrunc-1.el6rhs.x86_64
glusterfs-debuginfo-3.4.0.58dhttrunc-1.el6rhs.x86_64

Before rebalance:

1.
# glance image-create --name="f17" --is-public=true --container-format=ovf --disk-format=qcow2 --copy-from http://berrange.fedorapeople.org/images/2012-11-15/f17-x86_64-openstack-sda.qcow2


# glance image-list
+--------------------------------------+--------+-------------+------------------+-----------+--------+
| ID                                   | Name   | Disk Format | Container Format | Size      | Status |
+--------------------------------------+--------+-------------+------------------+-----------+--------+
| af8eb70e-4565-4981-a6de-3296e78e0e5a | f17    | qcow2       | ovf              | 251985920 | saving |

# getfattr -d -etext -m. -n trusted.glusterfs.pathinfo /mnt/gluster/glance/images/af8eb70e-4565-4981-a6de-3296e78e0e5a
getfattr: Removing leading '/' from absolute path names
# file: mnt/gluster/glance/images/af8eb70e-4565-4981-a6de-3296e78e0e5a
trusted.glusterfs.pathinfo="(<DISTRIBUTE:glance-vol-dht> (<REPLICATE:glance-vol-replicate-0> <POSIX(/rhs/brick2/g10):rhs-vm2:/rhs/brick2/g10/glance/images/af8eb70e-4565-4981-a6de-3296e78e0e5a> <POSIX(/rhs/brick2/g9):rhs-vm1:/rhs/brick2/g9/glance/images/af8eb70e-4565-4981-a6de-3296e78e0e5a>))"

2. 
# gluster v remove-brick glance-vol 10.70.37.180:/rhs/brick2/g9 10.70.37.150:/rhs/brick2/g10 start

After rebalance the image disappeared:

# getfattr -d -etext -m. -n trusted.glusterfs.pathinfo /mnt/gluster/glance/images/af8eb70e-4565-4981-a6de-3296e78e0e5a
getfattr: /mnt/gluster/glance/images/af8eb70e-4565-4981-a6de-3296e78e0e5a: No such file or directory

Comment 9 Shyamsundar 2014-01-30 11:05:02 UTC
Further analysis and fix:
-------------------------

The bug has now forked into 2 behaviours as follows,

1) Data corruption of file under write when remove-brick(rebalance) occurs

2) File missing when remove-brick(rebalance) occurs

1) Data corruption
------------------

The root cause of this issue is that the open flags used to open the file under rebalance on the new sub volume was sending O_TRUNC in the RHOS case, and in the unit test case sending O_CREAT | O_EXCL. This caused the file under rebalance to be truncated (hence mismatching md5 sums or sizes) or the open on the new sub volume to fail, which was ignored in DHT silently, causing newer data not to be written to both the old and new sub volumes for the file under migration.

This is fixed by the mentioned patch in comment #7 (tested by myself on the RHOS setup thrice)

2) File missing
---------------

The root cause of this issue is due to the following scenario (triggered due to the way we were reproducing the bug, but can happen anyway),
- Application starts a write on a file
- Remove-brick(rebalance) is triggered on the sub volume where the file resides
- During the actual file migration by rebalance there is no write IOs happening (this happens with RHOS as the source of the file being copied is from the web, hence there are write stalls, observable using some fop logging on writes)
- Post the migration (which takes about 3-5 seconds on the RHOS setup), say a lookup on the same file is triggered (which we were triggering to check if the file si growing in size using a ls -l on the file, hence the statement on triggered during reproduction of the bug), then the files caches sub volume changes to the new sub volume, but the fd that we hold is of the older sub volume
- A write by the application post this, triggers a fd, sub vol  mismatch due to the above step resulting in EINVAL from the layers below (logs as below seen in the client mount logs)

In (2) there is no data corruption, as an error is sent back to the application and glance in this case decides to remove this image from its store as there is a failure in create_image.

This is simulated in a simple bash open file case as well.

(2) will be forked into a separate bug for further analysis on possible fixes.

For (1) the patch up stream is now posted here and awaiting acceptance, https://code.engineering.redhat.com/gerrit/#/c/19107/1

Comment 11 shilpa 2014-01-31 11:12:22 UTC
In response to comment#9 part 2 (File missing)

I re-tested without running ls -l on the client and still hit the issue. The image file goes missing.

Sequence of events:

1. # glance image-create --name="f17img" --is-public=true --container-format=ovf --disk-format=qcow2 --copy-from http://berrange.fedorapeople.org/images/2012-11-15/f17-x86_64-openstack-sda.qcow2

2. #getfattr -d -etext -m. -n trusted.glusterfs.pathinfo /mnt/gluster/glance/images/65f87e72-280d-4838-8291-6bac34d96c1d
getfattr: Removing leading '/' from absolute path names
# file: mnt/gluster/glance/images/65f87e72-280d-4838-8291-6bac34d96c1d
trusted.glusterfs.pathinfo="(<DISTRIBUTE:glance-vol-dht> (<REPLICATE:glance-vol-replicate-0> <POSIX(/rhs/brick2/g12):rhs-vm4:/rhs/brick2/g12/glance/images/65f87e72-280d-4838-8291-6bac34d96c1d> <POSIX(/rhs/brick2/g11):rhs-vm3:/rhs/brick2/g11/glance/images/65f87e72-280d-4838-8291-6bac34d96c1d>))"

3. On RHS do a remove-brick on the bricks from above output. 

4. Do not run ls -l on the mountpoint on client machine to see the download status. 

5. Wait until the rebalance finishes and check for the file.

# getfattr -d -etext -m. -n trusted.glusterfs.pathinfo /mnt/gluster/glance/images/65f87e72-280d-4838-8291-6bac34d96c1d
getfattr: /mnt/gluster/glance/images/65f87e72-280d-4838-8291-6bac34d96c1d: No such file or directory

Comment 12 shilpa 2014-01-31 11:56:04 UTC
With the patch applied, data corruption is not seen anymore. 

Now,the image creation fails and the image is removed from the glance database. The same image can be downloaded again, which is nothing but a new file on glance, but there are chances that the new file may go to the same brick that is being rebalanced and glance may choose to remove the file again.

Comment 13 shilpa 2014-02-05 12:01:20 UTC
Verfied on glusterfs-3.4.0.59rhs-1.el6rhs. No more data corruption seen.

Comment 14 Pavithra 2014-02-06 10:21:56 UTC
Shyam,

Can you please provide the doc text for errata?

Comment 15 Pavithra 2014-02-06 10:31:55 UTC
After a discussion with Vivek, I am removing the needinfo on Shyam.

Comment 17 errata-xmlrpc 2014-02-25 08:15:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html

Comment 18 shilpa 2014-04-10 11:10:56 UTC
Created attachment 884903 [details]
screenshot


Note You need to log in before you can comment on or make changes to this bug.