Bug 762692 - (GLUSTER-960) Disk space leakage and fs corruption after VMs are vmotion'ed
Disk space leakage and fs corruption after VMs are vmotion'ed
Status: CLOSED CURRENTRELEASE
Product: GlusterFS
Classification: Community
Component: replicate (Show other bugs)
3.1-alpha
All Linux
urgent Severity high
: ---
: ---
Assigned To: Pavan Vilas Sondur
:
: GLUSTER-508 GLUSTER-958 GLUSTER-959 GLUSTER-986 GLUSTER-1064 763087 (view as bug list)
Depends On:
Blocks: GLUSTER-1737
  Show dependency treegraph
 
Reported: 2010-05-27 02:10 EDT by Shehjar Tikoo
Modified: 2015-12-01 11:45 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: ---
Regression: RTP
Mount Type: nfs
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
Volfiles and logs (6.70 KB, application/octet-stream)
2010-05-26 23:10 EDT, Shehjar Tikoo
no flags Details
Linux NFS client df output (130.53 KB, image/png)
2010-05-27 00:10 EDT, Shehjar Tikoo
no flags Details
vmkernel df output display (94.02 KB, image/png)
2010-05-27 00:11 EDT, Shehjar Tikoo
no flags Details
tcpdump showing the network trace (814.62 KB, application/octet-stream)
2010-05-27 00:24 EDT, Shehjar Tikoo
no flags Details
Scenario 1 DEBUG log (205.37 KB, application/octet-stream)
2010-05-31 01:54 EDT, Shehjar Tikoo
no flags Details
Scenario1 data heal full algo TRACE log (942.75 KB, application/octet-stream)
2010-05-31 04:30 EDT, Shehjar Tikoo
no flags Details
fuse-repl-corruption (772.01 KB, application/octet-stream)
2010-06-01 05:33 EDT, Shehjar Tikoo
no flags Details

  None (edit)
Description Shehjar Tikoo 2010-05-27 00:10:16 EDT
The attached pngs show that there is some kind of difference in the way file system size is reported by nfs or calculated by the vmkernel client and the linux client.
Comment 1 Shehjar Tikoo 2010-05-27 00:10:48 EDT
Created attachment 219 [details]
patch for printerdb
Comment 2 Shehjar Tikoo 2010-05-27 00:11:18 EDT
Created attachment 220 [details]
collection of patches to a2ps
Comment 3 Shehjar Tikoo 2010-05-27 00:13:38 EDT
(In reply to comment #1)
> The attached pngs show that there is some kind of difference in the way file
> system size is reported by nfs or calculated by the vmkernel client and the
> linux client.

The available capacity and used space values are similar but the free space values are off by 2G.
Comment 4 Shehjar Tikoo 2010-05-27 00:24:58 EDT
Created attachment 221 [details]
Patch for problem

Network trace includes the fsstat nfs requests that shows the server returning he file system size values.
Comment 5 Shehjar Tikoo 2010-05-27 02:10:40 EDT
After setting up 4 ESXs to see the common gluster storage, I was able to initiate and complete vmotion of a test VM between two ESX hosts multiple times. I also brought down/up volume replicas a few times before/concurrent with vmotion, etc. After running overnight, the VM's guest OS started showing signs of disk space leakage. I suspected root fs corruption and rebooted the VM to discover that the root fs is indeed corrupted. Naïve recovery with 'fsck -y /dev/sda2' did bring the fs into consistent state, but on subsequent reboot, the VM has hung before being able to boot the OS (extensive corruption ?).
Comment 6 Shehjar Tikoo 2010-05-28 00:12:08 EDT
#######################################################
################# DD RUN ON UBUNTU VM #################
#######################################################


#########  BEFORE RUNNING DD ##################
shehjart@shehjart-desktop:~$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1              12G  2.2G  8.6G  21% /
none                  243M  180K  243M   1% /dev
none                  248M  248K  247M   1% /dev/shm
none                  248M   84K  247M   1% /var/run
none                  248M     0  248M   0% /var/lock
none                  248M     0  248M   0% /lib/init/rw
none                   12G  2.2G  8.6G  21% /var/lib/ureadahead/debugfs


#########  DD RUN ##################
shehjart@shehjart-desktop:~$ dd if=/dev/zero of=testfile bs=64k count=10000
10000+0 records in
10000+0 records out
655360000 bytes (655 MB) copied, 44.8151 s, 14.6 MB/s

#########  AFTER RUNNING DD ##################
shehjart@shehjart-desktop:~$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1              12G  2.8G  7.9G  27% /
none                  243M  180K  243M   1% /dev
none                  248M  248K  247M   1% /dev/shm
none                  248M   84K  247M   1% /var/run
none                  248M     0  248M   0% /var/lock
none                  248M     0  248M   0% /lib/init/rw
none                   12G  2.8G  7.9G  27% /var/lib/ureadahead/debugfs

#########  REMOVING DD-CREATE FILE ##################
shehjart@shehjart-desktop:~$ rm testfile

#########  AFTER REMOVING FILE  ##################
shehjart@shehjart-desktop:~$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1              12G  2.2G  8.6G  21% /
none                  243M  180K  243M   1% /dev
none                  248M  248K  247M   1% /dev/shm
none                  248M   84K  247M   1% /var/run
none                  248M     0  248M   0% /var/lock
none                  248M     0  248M   0% /lib/init/rw
none                   12G  2.2G  8.6G  21% /var/lib/ureadahead/debugfs
shehjart@shehjart-desktop:~$





#######################################################
################# LINUX MOUNT POINT ###################
#######################################################

###########   BEFORE RUNNING DD IN VM   ###############
root@indus:/# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1              69G  9.3G   57G  15% /
udev                  1.9G  308K  1.9G   1% /dev
none                  1.9G  1.6M  1.9G   1% /dev/shm
none                  1.9G  156K  1.9G   1% /var/run
none                  1.9G     0  1.9G   0% /var/lock
none                  1.9G     0  1.9G   0% /lib/init/rw
none                   69G  9.3G   57G  15% /var/lib/ureadahead/debugfs
/dev/sda6             215G  107G   97G  53% /home
192.168.1.191:/dist    43G   36G  5.2G  88% /mnt




##########   AFTER REMOVING DD-CREATED FILE IN VM ##########
root@indus:/# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1              69G  9.3G   57G  15% /
udev                  1.9G  308K  1.9G   1% /dev
none                  1.9G  1.6M  1.9G   1% /dev/shm
none                  1.9G  156K  1.9G   1% /var/run
none                  1.9G     0  1.9G   0% /var/lock
none                  1.9G     0  1.9G   0% /lib/init/rw
none                   69G  9.3G   57G  15% /var/lib/ureadahead/debugfs
/dev/sda6             215G  107G   97G  53% /home
192.168.1.191:/dist    39G   33G  3.6G  91% /mnt
root@indus:/# 


Disk space used on the mount is not returned even after the file is removed although the VM's df output says the the disk space was reclaimed.
Comment 7 Shehjar Tikoo 2010-05-31 00:14:22 EDT
More info from user:
Couple of questions about the disk space leakage:

1. When the VM was moved to the new esx, was the datastore on the 
destination ESX:

-> the same NFS server as the source ESX?
-> different NFS server, but same namespace/bricks?
-> different NFS server and different namespace/bricks?

(1) All the Vmotion sequences were between two ESXis, both of the ESXis had replicas of the vdisk file. Each ESXi NFS-mounted a single instance of GlusterFS through its own NFS server (running in the local storage VM). I am not sure what you mean by "the same namespace/brick", but I can tell that all the storage VMs used the same local_triple.vol and nfs_triple.vol files (the latter defining a single top-level volume), the configuration was completely symmetric, and only one GlusterFS filesystem instance was exported/mounted. So, the destination ESX was the same instance of GlusterFS through a different NFS server (as compared to the source ESXi). 

I have performed several Vmotion sequences, back and forth between the two ESXis mentioned above. Some Vmotion sequences were done while the replica on the destination ESXi was reachable, while other sequences were done with the replica on the destination ESXi down. 


2. How did you come to the conclusion that the disk space was leaking? 
Did a fs operation report ENOSPC? or was just an observation based on 
output of df/du/ls?

The disk space leak was observed in the guest OS's root filesystem in the test VM on the virtual disk stored in the NFS datastore backed by GlusterFS. This was likely due to corruption of the filesystem's metadata (block allocation tables) on the vdisk. I have performed several 'dd if=/dev/zero of /tmp/fileX bs=1M count=1000; rm /tmp/fileX' steps, and eventually, I got ENOSP from dd, although of course, the free space should not have changed. 

No space leak in GlusterFS itself was observed as far as I could tell, the vdisk was just a file in GlusterFS, and although its contents was corrupted, if was there and accessible. 


3. If the destination VM's datastore was a different NFS 
server+different namespace, where was the disk leakage observed? The 
source or the destination NFS?

Not sure I understood the question entirely; at the time the disk leak/corruption was observed, the test VM was returned to the initial ESXi host (where it was created), after having been mode back and forth between the two ESXis a few times.
Comment 8 Shehjar Tikoo 2010-05-31 00:21:58 EDT
The problem is not with vmotion'ing only but also with simple self-heal
scenarios. Even with just two replicas and a dd run there are two problems:

In a nfs export of a 2-replica volume, no perf translators, just io-threads on
posix.

Scenario 1:
1. Start a dd run on nfs mount.
2. Bring down the replica that is not the nfs server, since nfs server is also
a replica.
3. After a few seconds, restart the downed replica.
4. dd finishes.
5. File on the downed replica is same size as the nfs server replica but
corrupted.

Scenario 2:
1. Start a dd run on nfs mount.
2. Bring down the replica that is not the nfs server, since nfs server is also
a replica.
3. dd finishes.
4. Bring up the downed replica.
4. Do ls -lRh on the mount point.
5. Check the file on the downed replica. It has not been self-healed.
Comment 9 Shehjar Tikoo 2010-05-31 00:30:40 EDT
(In reply to comment #8)
 > Scenario 1:
> 1. Start a dd run on nfs mount.
> 2. Bring down the replica that is not the nfs server, since nfs server is also
> a replica.
> 3. After a few seconds, restart the downed replica.
> 4. dd finishes.
> 5. File on the downed replica is same size as the nfs server replica but
> corrupted.
> 


In this scenario, the self-heal should take place as soon as the replica comes back up because there is an open fd on the file.
Comment 10 Shehjar Tikoo 2010-05-31 01:54:36 EDT
Created attachment 222 [details]
patch for /etc/logrotate.d/apache

The log is of a run where the self-healed file was corrupted after a dd run ended. This is for scenario 1 described in comment 8.
Comment 11 Shehjar Tikoo 2010-05-31 02:28:42 EDT
For a TRACE level log where the self-healed file on downed replica is corrupt,
see:

dev:/share/tickets/960/nfs-repl-corrupted-trace.log.bz2

To go to the line where the self-heal should begin, because the replica comes
back up at that point, search for lines with timestamp:

2010-05-31 14:41:01
Comment 12 Shehjar Tikoo 2010-05-31 02:45:57 EDT
FWIW. I am not able to reproduce it in TRACE mode by using the default number of pipelined requests in linux NFS client, i.e. 16. With TRACE, I had to increase the number to 128 to get the data corruption in the self-healed filed.

That points to a race condition or a wrong assumption about the behviour of write fops.
Comment 13 Shehjar Tikoo 2010-05-31 04:30:47 EDT
Created attachment 223 [details]
Spec file patch

I see the data corruption even when running with the full self-heal algo. This is the TRACE log for that run.
Comment 14 Shehjar Tikoo 2010-05-31 04:31:49 EDT
The previous attachment is a partial log. I'll attach the complete one if needed.
Comment 15 Shehjar Tikoo 2010-05-31 22:05:58 EDT
Here is another run.

This time both the replicas are corrupted. The replica downed has the following contents:
[root@client8 shehjart]# hexdump -x testdir1/pat9-1g
0000000    0a39    0a39    0a39    0a39    0a39    0a39    0a39    0a39
*
22910000    0000    0000    0000    0000    0000    0000    0000    0000
*
22aee000    0a39    0a39    0a39    0a39    0a39    0a39    0a39    0a39
*
237b5000    0000    0000    0000    0000    0000    0000    0000    0000
*
37384000    0a39    0a39    0a39    0a39    0a39    0a39    0a39    0a39
*
37385000


The one that continued up through the run has the following contents:
root@client10:/home/shehjart# hexdump -x testdir1/pat9-1g
0000000    0a39    0a39    0a39    0a39    0a39    0a39    0a39    0a39
*
22930000    0000    0000    0000    0000    0000    0000    0000    0000
*
22aee000    0a39    0a39    0a39    0a39    0a39    0a39    0a39    0a39
*
237b5000    0000    0000    0000    0000    0000    0000    0000    0000
*
37384000    0a39    0a39    0a39    0a39    0a39    0a39    0a39    0a39
*
37385000


The log is in dev:/share/tickets/960/nfs-repl-source-sink-corrupt-trace.log.bz2
Comment 16 Shehjar Tikoo 2010-06-01 05:33:05 EDT
Contrary to earlier comment, I can see the corruption when running with FUSE also. It can be reproduced by simulating the load generated by a NFS client with 128 tcp_slot_table_entries, i.e a load where there are a large number of batched write requests.

Over nfs the corruption does not occur when the tcp_slot_table_entries is reduced to 4.
Over FUSE, this load can be generated by load write-behind over replicate and setting window-size to 1Gb. Then over the FUSE mount, run a dd of 1Gb.

DO NOT USE /DEV/ZERO AS INPUT FOR DD. GENERATE YOUR OWN PATTERN FILE.

Here is how to do so:

 $ echo "9" > pat9;
 $ for i in $(seq 1 512); do cat pat9 >> pat9-1k; done
 $ for i in $(seq 1 1024); do cat pat9-1k >> pat9-1m; done
 $ for i in $(seq 1 1024); do cat pat9-1m >>  pat9-1g; done


Use the pat9-1g file as input.

Log file fuse-repl-corruption is attached.
Comment 17 Shehjar Tikoo 2010-06-01 05:33:59 EDT
Created attachment 226 [details]
Patch for type on mkbootdisk script
Comment 18 Pavan Vilas Sondur 2010-06-03 04:43:51 EDT
Here are some of my findings:
* Initially the frames had lock-owner field unset (by nfs) and it resulted in unexpected behavior from locks, which led to corruption

* After nfs started issuing unique values to frame->lk_owner, with two patches:
http://patches.gluster.com/patch/3370/
http://patches.gluster.com/patch/3369/
which unlock with a matching lk-owner and do not allow overlapping locks at all, the corruption went away.

* Although the corruption went away, afr seems to be healing more than once - 
When afr gets a write after a subvolume has come back up it proceeds to do a flush+post-post-op self heal and issues a full file lock.

* Until the first one is granted there could be many such "write" requests which decide to do the post-post-op and block on a full file lock. They should be harmless as after they get a lock, they see that self-heal is not required and just carry on.

* However, multiple self heal happen back to back, but not in parallel. To add to woes, nfs which sends a max of 128 writes (set in our test) resends them and the vicious self heal cycles go on and glusterfs never recovers and appears hung. It actually is spinning on doing the many self heals.

* I added a patch which limits to only the first write to trigger a self heal. It worked and dd finished without problems. Need to look into how we clear the pending flags after the post post op.

* While running tests now, we saw that a server once restarted is getting inodelk requests. This can cause corruption, which is suspect is the cause for the latest corruption reported above by Shehjar. In this case self heal lock is granted on the second subvols and write locks (range inodelks) are going to the first subvols, resulting in I/O while self heal is on.
Comment 19 Pavan Vilas Sondur 2010-06-03 16:22:41 EDT
AFR has a race, where the following races can happen due to a server going up or down:
1. One write racing with another
2. A write(s) racing with flush
3. A self heal racing with any FOP
etc.

Just mentioned these as they are relevant to the use-case in this bug. Infact it can happen with any transaction and is not seen easily because it needs to happen in a window where a server goes down while a write is in progress with other competing writes and a server goes up/down during overlapping writes (self-heal and other writes included)

The bug is in afr_lock()
int32_t afr_lock (call_frame_t *frame, xlator_t *this)                                                           
{                                                                                                                
        afr_pid_save (frame);                                                                                    
                                                                                                                 
        frame->root->pid = (long) frame->root;                                                                   
                                                                                                                 
        return afr_lock_rec (frame, this, 0);      <<<<< is the child_index                                                             
}                              


 static                                                                                                           
int afr_lock_rec (call_frame_t *frame, xlator_t *this, int child_index)                                          
{                                                                          
...
...
...

             /* skip over children that or down                                                               
                   or don't have the fd open */                                                                  
                                                                                                                 
                while ((child_index < priv->child_count)                                                         
                       && (!local->child_up[child_index]                                                         
                           || !fd_ctx->opened_on[child_index]))                                                  
                                                                                                                 
                        child_index++;              <<<<


it's a race because local->child_up[<down_server_child_index>] becomes 1 when it comes back up

...
...
...

This bug can cause corruption and even multiple simultaneous writes - even self heal and writes together. I'll send a patch for this soon. Until then set afr options:
  option data-lock-server-count <no of subvols>
  option metadata-lock-server-count <no of subvols>
  option entry-lock-server-count <no of subvols>
Comment 20 Shehjar Tikoo 2010-06-04 01:58:54 EDT
(In reply to comment #19)
.....
.....
.....
> This bug can cause corruption and even multiple simultaneous writes - even self
> heal and writes together. I'll send a patch for this soon. Until then set afr
> options:
>   option data-lock-server-count <no of subvols>
>   option metadata-lock-server-count <no of subvols>
>   option entry-lock-server-count <no of subvols>


For the record, the corruption is there even with these options.
Comment 21 Pavan Vilas Sondur 2010-06-04 04:27:18 EDT
I've filed bug 762718 to track the afr locking problems.
Comment 22 Shehjar Tikoo 2010-06-18 03:07:47 EDT
Logs from latest run after applying the 3k line patch from pavan:
dev:/data/share/tickets/960/self-heal-corruption-logs.tar.bz2
Comment 23 Anand Avati 2010-07-04 22:52:59 EDT
PATCH: http://patches.gluster.com/patch/3420 in master (nfs: Assign incremental lk owner to frames)
Comment 24 Shehjar Tikoo 2010-07-19 08:40:31 EDT
Regression Tests described at:
http://test.gluster.com/show_bug.cgi?id=86
Comment 25 Shehjar Tikoo 2010-07-20 22:14:48 EDT
*** Bug 959 has been marked as a duplicate of this bug. ***
Comment 26 Shehjar Tikoo 2010-07-27 04:40:51 EDT
*** Bug 986 has been marked as a duplicate of this bug. ***
Comment 27 Pavan Vilas Sondur 2010-08-10 02:23:37 EDT
*** Bug 1064 has been marked as a duplicate of this bug. ***
Comment 28 Pavan Vilas Sondur 2010-08-10 02:28:04 EDT
*** Bug 508 has been marked as a duplicate of this bug. ***
Comment 29 Anand Avati 2010-08-22 22:27:44 EDT
PATCH: http://patches.gluster.com/patch/4247 in master (cluster/afr: Use 2 phase locking for transactions and self heal.)
Comment 30 Anand Avati 2010-08-22 22:27:52 EDT
PATCH: http://patches.gluster.com/patch/4248 in master (protocol/server: features/locks: Fix nonblocking entrylks, inodelks in locks and server.)
Comment 31 Vijay Bellur 2010-08-23 23:11:12 EDT
*** Bug 1355 has been marked as a duplicate of this bug. ***
Comment 32 Vijay Bellur 2010-08-23 23:18:04 EDT
*** Bug 958 has been marked as a duplicate of this bug. ***
Comment 33 Anand Avati 2010-08-26 00:09:15 EDT
PATCH: http://patches.gluster.com/patch/4299 in master (cluster/afr: Do not issue blocking locks in self-heal.)
Comment 34 Vijay Bellur 2010-09-03 01:04:21 EDT
PATCH: http://patches.gluster.com/patch/4517 in master (cluster/replicate: fix warnings during build.)
Comment 35 Shehjar Tikoo 2010-09-20 05:38:04 EDT
Re-opened because I just ran into this again while helping Raghu with a different bug.
Comment 36 Shehjar Tikoo 2010-09-20 05:38:27 EDT
Saw it on 3.1 alpha
Comment 37 Vijay Bellur 2010-09-20 08:13:28 EDT
PATCH: http://patches.gluster.com/patch/4872 in master (cluster/afr: print correct offset in debug message during diff self-heal.)
Comment 38 Vijay Bellur 2010-09-29 01:43:31 EDT
PATCH: http://patches.gluster.com/patch/5063 in master (replicate: use new style (un)locking in self heal of missing entries)
Comment 39 Amar Tumballi 2010-10-04 00:29:59 EDT
What is the status of this bug? I see lot of patches going in for the bug. Because its blocker, can we see to it that the status is updated properly. I guess even if the bug exists, lets take it after 3.1.0.. let me know your opinion.
Comment 40 Shehjar Tikoo 2010-10-04 22:50:36 EDT
(In reply to comment #39)
> What is the status of this bug? I see lot of patches going in for the bug.
> Because its blocker, can we see to it that the status is updated properly. I
> guess even if the bug exists, lets take it after 3.1.0.. let me know your
> opinion.

Have to test with latest patches to confirm.
Comment 41 Shehjar Tikoo 2010-10-06 01:40:41 EDT
Still occurs on 3.1qa41. A failed replica is not getting self-heal when it recovers, compared to earlier when self-heal did get triggered but corrupted the destination. Now,even when the failed replica comes back up, the IO continues on the other replica.
Comment 42 Shehjar Tikoo 2010-11-08 03:40:54 EST
Tested for 3.1.1qa5 does not corrupt self-healed file or allow parallel IO and self-heal. Thanks for playing.

Note You need to log in before you can comment on or make changes to this bug.