Bug 1203739 - Self-heal of sparse image files on 3-way replica "unsparsifies" the image
Summary: Self-heal of sparse image files on 3-way replica "unsparsifies" the image
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: replicate
Version: 3.6.2
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
Assignee: Ravishankar N
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-03-19 14:59 UTC by Matt R
Modified: 2018-09-26 10:33 UTC (History)
13 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2016-01-07 08:56:59 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Matt R 2015-03-19 14:59:01 UTC
Description of problem:
This is similar to 1190633. We have a 3-way replica that is used to store sparse oVirt disk images. The same servers are also the oVirt nodes.

When a machine is taken down for maintenance, the third machine's healing process "inflates" the sparse image to full size.

In our scenario, we have 3 servers, each is an oVirt node and hosts a brick for the replica.

In oVirt, one node is the SPM (Storage Master), and the other two are "normal."

We leave the SPM alone, but put one of the "normal" machines into maintenance. Remove the brick from the machine being rebooted.

Then reboot the machine, and add the brick back to the volume. Oddly, it's the *third* server that re-inflates the sparse images

Version-Release number of selected component (if applicable):
glusterfs-server-3.6.2-1.el6.x86_64


How reproducible:
Always


Steps to Reproduce:
1. Set up 3-node replication with sparse images
2. Set one non-SPM oVirt node to maintenance
3. Remove brick from gluster volume of the server in maintenance
4. Reboot
5. Add the brick back to the volume
6. The SPM & the recently-rebooted server act correctly. The third, untouched server, will inflate the sparse images.

Actual results:
Here's a "df" of the machines after rebooting.

SPM machine (not rebooted):
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/rootvg-ovirtlv
                      500G   12G  488G   3% /gluster/ovirt

"Normal" machine, the one that was rebooted:
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/rootvg-ovirtlv
                      500G   12G  489G   3% /gluster/ovirt


The third, untouched, "normal" machine:
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/rootvg-ovirtlv
                      500G   64G  437G  13% /gluster/ovirt



Expected results:
I expect the heal not to inflate sparse images on the third machine (or any of them, actually).

This is definitely an issue since VMs tend to overallocate disk space.

The only work around is to move disk images between storage domains, and then back to the original domain.


Additional info:
Here's the volume info:
gluster> volume info ovirt
 
Volume Name: ovirt
Type: Replicate
Volume ID: b39ed03e-0d03-40a2-acad-8384cf0c5cb4
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: server1:/gluster/ovirt/brick
Brick2: server2:/gluster/ovirt/brick
Brick3: server3:/gluster/ovirt/brick
Options Reconfigured:
cluster.data-self-heal-algorithm: diff
server.allow-insecure: on
storage.owner-gid: 36
storage.owner-uid: 36
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off

Comment 1 Ravishankar N 2015-03-27 04:43:18 UTC
Hi Matt, Bug 1190633 happens only when the data-self-heal-algorithm is set to 'full'. The issue shouldn't occur with it set to 'diff' which is what your volinfo shows.

When you say:
3. Remove brick from gluster volume of the server in maintenance
5. Add the brick back to the volume

what exactly do you mean? Unmounting and remounting the brick or reducing the replica count via remove-brick command and then adding it back again using add-brick command?

Could you also check the specific file in question on all the bricks as opposed to df of the file system that you provided?
i.e. `ls -lh /gluster/ovirt/file_name`
and  `du -h /gluster/ovirt/file_name` on all 3 bricks.

Comment 2 Matt R 2015-03-30 16:07:09 UTC
Hi Ravishankar,

I haven't forgotten about this, I've been swamped. I hope to have this information to you by the end of this week.

Comment 3 Ravishankar N 2015-03-31 13:08:32 UTC
(In reply to Matt R from comment #2)
> Hi Ravishankar,
> 
> I haven't forgotten about this, I've been swamped. I hope to have this
> information to you by the end of this week.

Sure Matt.

Comment 4 Matt R 2015-03-31 18:47:54 UTC
Hi Ravishankar,

Here's more detailed information.

For point 3 & 5
I removed the brick and reduced the replica. Then after rebooting, I formatted the file system, and re-added the brick, increasing the replica count.

Below is a detailed list of the steps I performed to create this issue. Let me know if you need any more info.

Thanks,
Matt

Before:
Server1:
[~]{575}# df -h /gluster/ovirt/
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/rootvg-ovirtlv
                      500G   25G  475G   5% /gluster/ovirt

[~]{577}# du -h -s /gluster/ovirt/
25G	/gluster/ovirt/

Server2:
[~]{484}# df -h /gluster/ovirt/
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/rootvg-ovirtlv
                      500G   25G  475G   5% /gluster/ovirt

[~]{486}# du -hs /gluster/ovirt/
25G	/gluster/ovirt/


Server3:
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/rootvg-ovirtlv
                      500G   25G  475G   5% /gluster/ovirt

[~]{402}# du -hs /gluster/ovirt/
25G	/gluster/ovirt/


On oVirt:
Put Server1 into maintenance mode

Then:
Remove brick:
gluster> volume remove-brick ovirt replica 2 server1.umaryland.edu:/gluster/ovirt/brick force
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
volume remove-brick commit force: success

Reboot Server1

Before re-adding the brick:
[~]{588}# gluster volume status ovirt
Status of volume: ovirt
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick server3.umaryland.edu:/gluster/ovirt/brick	49154	Y	3187
Brick server2.umaryland.edu:/gluster/ovirt/brick	49154	Y	8894
NFS Server on localhost					2049	Y	3200
Self-heal Daemon on localhost				N/A	Y	3207
NFS Server on server2.umaryland.edu			2049	Y	3399
Self-heal Daemon on server2.umaryland.edu		N/A	Y	3413
NFS Server on server3.umaryland.edu		2049	Y	27589
Self-heal Daemon on server3.umaryland.edu		N/A	Y	27599
 
Task Status of Volume ovirt
------------------------------------------------------------------------------
There are no active volume tasks

Server1:
[~]{592}# mkfs.xfs -f -i size=512 /dev/rootvg/ovirtlv 
meta-data=/dev/rootvg/ovirtlv    isize=512    agcount=16, agsize=8191984 blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=131071744, imaxpct=25
         =                       sunit=16     swidth=16 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=64000, version=2
         =                       sectsz=512   sunit=16 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Add back brick:
gluster> volume add-brick ovirt replica 3 server1.umaryland.edu:/gluster/ovirt/brick force
volume add-brick: success

After adding the brick:
gluster> volume status ovirt
Status of volume: ovirt
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick server3.umaryland.edu:/gluster/ovirt/brick	49154	Y	3187
Brick server2.umaryland.edu:/gluster/ovirt/brick	49154	Y	8894
Brick server1.umaryland.edu:/gluster/ovirt/brick	49155	Y	10213
NFS Server on localhost					2049	Y	8804
Self-heal Daemon on localhost				N/A	Y	8814
NFS Server on server1					2049	Y	10226
Self-heal Daemon on server1				N/A	Y	10234
NFS Server on server3.umaryland.edu		2049	Y	31960
Self-heal Daemon on server3.umaryland.edu		N/A	Y	31971
 
Task Status of Volume ovirt
------------------------------------------------------------------------------
There are no active volume tasks


After heal starts:
Server1:
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/rootvg-ovirtlv
                      500G  7.9G  492G   2% /gluster/ovirt
[~]{614}# du -hs /gluster/ovirt/
7.9G	/gluster/ovirt/
(This will continue to grow until 25G, which is the actual amount of disk space used)

Server2 (oVirt SPM):
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/rootvg-ovirtlv
                      500G   25G  475G   6% /gluster/ovirt
[~]{510}# du -hs /gluster/ovirt/
25G	/gluster/ovirt/

(This will stay the same)

Server 3:
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/rootvg-ovirtlv
                      500G   44G  457G   9% /gluster/ovirt
[~]{426}# du -hs /gluster/ovirt/
44G	/gluster/ovirt/

(This will continue to grow to allocated, rather than sparse, size)


And for one particular disk image directory:
Server1:
[./e545dbec-c16c-4b25-8d8e-6bcae0f925d1]{645}# du -m .
5898	.
(Staying the same, oddly ~1GB smaller than the original)

Server2:
[./e545dbec-c16c-4b25-8d8e-6bcae0f925d1]{539}# du -m .
6874	.
(Staying the same)

Server3:
[./e545dbec-c16c-4b25-8d8e-6bcae0f925d1]{454}# du -m .
23382	.
(Growing)

Comment 5 SATHEESARAN 2015-04-07 17:40:33 UTC
I have found the same behaviour with latest glusterfs-3.7 nightly build, without involving oVirt in picture.

These are the steps that I used to reproduce this issue :

1. Installed 3 RHEL servers with latest glusterfs-3.7 nightly build
2. Created a gluster cluster ( Trusted Storage Pool )
3. Created a replica 3 volume
4. Fuse mounted the volume on another RHEL 6.6 server
5. Created a sparse file
(i.e) dd if=/dev/urandom of=vm.img bs=1024 count=0 seek=24M
6. Performed fallocate on that file for 5G
(i.e) fallocate -l5G vm.img
7. Reduce the replica count to 2
(i.e) gluster volume remove-brick <vol-name> replica 2 <brick3>
8. Added a new brick
(i.e) gluster volume add-brick <vol-name> replica 3 <new-brick>
9. Triggered self-heal
(i.e) gluster volume heal <vol-name> full

10. Wait till the self-heal gets completed
11. Check for the file size across all the bricks
(i.e) du -sh <file> - would give the disk space 
      du -sh <file> --apparent-size - would give the sparse file size

From the above test, I could note that in one of the server, the file size has become 24G which proves the existence of this problem

Comment 6 Ben England 2016-01-04 16:39:58 UTC
raising priority and cc'ing people who are working with hyperconverged Gluster storage.  So if sparse VM images are being used with Gluster, then this will matter because space will get used up fast by a self-heal.  Not only that but performance will suffer because self-heal will have to do a lot more I/O.

Comment 7 Ravishankar N 2016-01-05 04:30:09 UTC
Hi Sas, I'm still unable to re-create the issue on release-3.7 branch with the steps mentioned in comment #5. I see that in the replaced brick, the file is still sparse. Can you try to recreate this and share the setup with me? Thanks!

Comment 8 Ravishankar N 2016-01-05 05:33:48 UTC
(In reply to Ben England from comment #6)
> So if sparse VM images are being used with Gluster, then
> this will matter because space will get used up fast by a self-heal.  Not
> only that but performance will suffer because self-heal will have to do a
> lot more I/O.
Hi Ben,
We have logic in AFR selfheal which will not write to the sink if the file is sparse *and* the checksums of the data blocks (as it loops through all the chunks of the file) is same on the source and sink(s). There is also a possibility that the disk usage is different because the way XFS preallocation works (See BZ 1277992). In any case I'm going to work with Sas to see if comment#5 is repeatable and is related to gluster.

Comment 9 SATHEESARAN 2016-01-07 08:27:41 UTC
(In reply to Ravishankar N from comment #7)
> Hi Sas, I'm still unable to re-create the issue on release-3.7 branch with
> the steps mentioned in comment #5. I see that in the replaced brick, the
> file is still sparse. Can you try to recreate this and share the setup with
> me? Thanks!

Hi Ravi,

I have tested with glusterfs 3.7.6 on RHEL-7 (http://download.gluster.org/pub/gluster/glusterfs/3.7/3.7.6/EPEL.repo/epel-7Server/x86_64/ ) and I am not facing this issue

Providing the test steps below :

1. Created a replica 3 volume and optimized it for virt-store usecase
2. Started the volume and used it as a RHEV Data Domain  ( RHEV 3.5.6 )
3. Killed one of the brick of the volume
4. Created 3 AppVMs ( Images of size -  15G, 20G, 25G respectively ) and installed RHEL 7.2 on the App VMs
5. Brought back the killed brick ( by starting the volume with force option )
6. Triggered self-heal
7. After self-heal is completed, check the image size on all the bricks 

Observation : None of the image file on the 3 bricks, has lost sparseness property

Comment 11 Ravishankar N 2016-01-07 08:56:59 UTC
Thanks Sas. I'm closing the bug. Matt, if you are still able to hit the issue in the latest bits, do feel free to raise a bug.


Note You need to log in before you can comment on or make changes to this bug.