Hide Forgot
The attached pngs show that there is some kind of difference in the way file system size is reported by nfs or calculated by the vmkernel client and the linux client.
Created attachment 219 [details] patch for printerdb
Created attachment 220 [details] collection of patches to a2ps
(In reply to comment #1) > The attached pngs show that there is some kind of difference in the way file > system size is reported by nfs or calculated by the vmkernel client and the > linux client. The available capacity and used space values are similar but the free space values are off by 2G.
Created attachment 221 [details] Patch for problem Network trace includes the fsstat nfs requests that shows the server returning he file system size values.
After setting up 4 ESXs to see the common gluster storage, I was able to initiate and complete vmotion of a test VM between two ESX hosts multiple times. I also brought down/up volume replicas a few times before/concurrent with vmotion, etc. After running overnight, the VM's guest OS started showing signs of disk space leakage. I suspected root fs corruption and rebooted the VM to discover that the root fs is indeed corrupted. Naïve recovery with 'fsck -y /dev/sda2' did bring the fs into consistent state, but on subsequent reboot, the VM has hung before being able to boot the OS (extensive corruption ?).
####################################################### ################# DD RUN ON UBUNTU VM ################# ####################################################### ######### BEFORE RUNNING DD ################## shehjart@shehjart-desktop:~$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 12G 2.2G 8.6G 21% / none 243M 180K 243M 1% /dev none 248M 248K 247M 1% /dev/shm none 248M 84K 247M 1% /var/run none 248M 0 248M 0% /var/lock none 248M 0 248M 0% /lib/init/rw none 12G 2.2G 8.6G 21% /var/lib/ureadahead/debugfs ######### DD RUN ################## shehjart@shehjart-desktop:~$ dd if=/dev/zero of=testfile bs=64k count=10000 10000+0 records in 10000+0 records out 655360000 bytes (655 MB) copied, 44.8151 s, 14.6 MB/s ######### AFTER RUNNING DD ################## shehjart@shehjart-desktop:~$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 12G 2.8G 7.9G 27% / none 243M 180K 243M 1% /dev none 248M 248K 247M 1% /dev/shm none 248M 84K 247M 1% /var/run none 248M 0 248M 0% /var/lock none 248M 0 248M 0% /lib/init/rw none 12G 2.8G 7.9G 27% /var/lib/ureadahead/debugfs ######### REMOVING DD-CREATE FILE ################## shehjart@shehjart-desktop:~$ rm testfile ######### AFTER REMOVING FILE ################## shehjart@shehjart-desktop:~$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 12G 2.2G 8.6G 21% / none 243M 180K 243M 1% /dev none 248M 248K 247M 1% /dev/shm none 248M 84K 247M 1% /var/run none 248M 0 248M 0% /var/lock none 248M 0 248M 0% /lib/init/rw none 12G 2.2G 8.6G 21% /var/lib/ureadahead/debugfs shehjart@shehjart-desktop:~$ ####################################################### ################# LINUX MOUNT POINT ################### ####################################################### ########### BEFORE RUNNING DD IN VM ############### root@indus:/# df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 69G 9.3G 57G 15% / udev 1.9G 308K 1.9G 1% /dev none 1.9G 1.6M 1.9G 1% /dev/shm none 1.9G 156K 1.9G 1% /var/run none 1.9G 0 1.9G 0% /var/lock none 1.9G 0 1.9G 0% /lib/init/rw none 69G 9.3G 57G 15% /var/lib/ureadahead/debugfs /dev/sda6 215G 107G 97G 53% /home 192.168.1.191:/dist 43G 36G 5.2G 88% /mnt ########## AFTER REMOVING DD-CREATED FILE IN VM ########## root@indus:/# df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 69G 9.3G 57G 15% / udev 1.9G 308K 1.9G 1% /dev none 1.9G 1.6M 1.9G 1% /dev/shm none 1.9G 156K 1.9G 1% /var/run none 1.9G 0 1.9G 0% /var/lock none 1.9G 0 1.9G 0% /lib/init/rw none 69G 9.3G 57G 15% /var/lib/ureadahead/debugfs /dev/sda6 215G 107G 97G 53% /home 192.168.1.191:/dist 39G 33G 3.6G 91% /mnt root@indus:/# Disk space used on the mount is not returned even after the file is removed although the VM's df output says the the disk space was reclaimed.
More info from user: Couple of questions about the disk space leakage: 1. When the VM was moved to the new esx, was the datastore on the destination ESX: -> the same NFS server as the source ESX? -> different NFS server, but same namespace/bricks? -> different NFS server and different namespace/bricks? (1) All the Vmotion sequences were between two ESXis, both of the ESXis had replicas of the vdisk file. Each ESXi NFS-mounted a single instance of GlusterFS through its own NFS server (running in the local storage VM). I am not sure what you mean by "the same namespace/brick", but I can tell that all the storage VMs used the same local_triple.vol and nfs_triple.vol files (the latter defining a single top-level volume), the configuration was completely symmetric, and only one GlusterFS filesystem instance was exported/mounted. So, the destination ESX was the same instance of GlusterFS through a different NFS server (as compared to the source ESXi). I have performed several Vmotion sequences, back and forth between the two ESXis mentioned above. Some Vmotion sequences were done while the replica on the destination ESXi was reachable, while other sequences were done with the replica on the destination ESXi down. 2. How did you come to the conclusion that the disk space was leaking? Did a fs operation report ENOSPC? or was just an observation based on output of df/du/ls? The disk space leak was observed in the guest OS's root filesystem in the test VM on the virtual disk stored in the NFS datastore backed by GlusterFS. This was likely due to corruption of the filesystem's metadata (block allocation tables) on the vdisk. I have performed several 'dd if=/dev/zero of /tmp/fileX bs=1M count=1000; rm /tmp/fileX' steps, and eventually, I got ENOSP from dd, although of course, the free space should not have changed. No space leak in GlusterFS itself was observed as far as I could tell, the vdisk was just a file in GlusterFS, and although its contents was corrupted, if was there and accessible. 3. If the destination VM's datastore was a different NFS server+different namespace, where was the disk leakage observed? The source or the destination NFS? Not sure I understood the question entirely; at the time the disk leak/corruption was observed, the test VM was returned to the initial ESXi host (where it was created), after having been mode back and forth between the two ESXis a few times.
The problem is not with vmotion'ing only but also with simple self-heal scenarios. Even with just two replicas and a dd run there are two problems: In a nfs export of a 2-replica volume, no perf translators, just io-threads on posix. Scenario 1: 1. Start a dd run on nfs mount. 2. Bring down the replica that is not the nfs server, since nfs server is also a replica. 3. After a few seconds, restart the downed replica. 4. dd finishes. 5. File on the downed replica is same size as the nfs server replica but corrupted. Scenario 2: 1. Start a dd run on nfs mount. 2. Bring down the replica that is not the nfs server, since nfs server is also a replica. 3. dd finishes. 4. Bring up the downed replica. 4. Do ls -lRh on the mount point. 5. Check the file on the downed replica. It has not been self-healed.
(In reply to comment #8) > Scenario 1: > 1. Start a dd run on nfs mount. > 2. Bring down the replica that is not the nfs server, since nfs server is also > a replica. > 3. After a few seconds, restart the downed replica. > 4. dd finishes. > 5. File on the downed replica is same size as the nfs server replica but > corrupted. > In this scenario, the self-heal should take place as soon as the replica comes back up because there is an open fd on the file.
Created attachment 222 [details] patch for /etc/logrotate.d/apache The log is of a run where the self-healed file was corrupted after a dd run ended. This is for scenario 1 described in comment 8.
For a TRACE level log where the self-healed file on downed replica is corrupt, see: dev:/share/tickets/960/nfs-repl-corrupted-trace.log.bz2 To go to the line where the self-heal should begin, because the replica comes back up at that point, search for lines with timestamp: 2010-05-31 14:41:01
FWIW. I am not able to reproduce it in TRACE mode by using the default number of pipelined requests in linux NFS client, i.e. 16. With TRACE, I had to increase the number to 128 to get the data corruption in the self-healed filed. That points to a race condition or a wrong assumption about the behviour of write fops.
Created attachment 223 [details] Spec file patch I see the data corruption even when running with the full self-heal algo. This is the TRACE log for that run.
The previous attachment is a partial log. I'll attach the complete one if needed.
Here is another run. This time both the replicas are corrupted. The replica downed has the following contents: [root@client8 shehjart]# hexdump -x testdir1/pat9-1g 0000000 0a39 0a39 0a39 0a39 0a39 0a39 0a39 0a39 * 22910000 0000 0000 0000 0000 0000 0000 0000 0000 * 22aee000 0a39 0a39 0a39 0a39 0a39 0a39 0a39 0a39 * 237b5000 0000 0000 0000 0000 0000 0000 0000 0000 * 37384000 0a39 0a39 0a39 0a39 0a39 0a39 0a39 0a39 * 37385000 The one that continued up through the run has the following contents: root@client10:/home/shehjart# hexdump -x testdir1/pat9-1g 0000000 0a39 0a39 0a39 0a39 0a39 0a39 0a39 0a39 * 22930000 0000 0000 0000 0000 0000 0000 0000 0000 * 22aee000 0a39 0a39 0a39 0a39 0a39 0a39 0a39 0a39 * 237b5000 0000 0000 0000 0000 0000 0000 0000 0000 * 37384000 0a39 0a39 0a39 0a39 0a39 0a39 0a39 0a39 * 37385000 The log is in dev:/share/tickets/960/nfs-repl-source-sink-corrupt-trace.log.bz2
Contrary to earlier comment, I can see the corruption when running with FUSE also. It can be reproduced by simulating the load generated by a NFS client with 128 tcp_slot_table_entries, i.e a load where there are a large number of batched write requests. Over nfs the corruption does not occur when the tcp_slot_table_entries is reduced to 4. Over FUSE, this load can be generated by load write-behind over replicate and setting window-size to 1Gb. Then over the FUSE mount, run a dd of 1Gb. DO NOT USE /DEV/ZERO AS INPUT FOR DD. GENERATE YOUR OWN PATTERN FILE. Here is how to do so: $ echo "9" > pat9; $ for i in $(seq 1 512); do cat pat9 >> pat9-1k; done $ for i in $(seq 1 1024); do cat pat9-1k >> pat9-1m; done $ for i in $(seq 1 1024); do cat pat9-1m >> pat9-1g; done Use the pat9-1g file as input. Log file fuse-repl-corruption is attached.
Created attachment 226 [details] Patch for type on mkbootdisk script
Here are some of my findings: * Initially the frames had lock-owner field unset (by nfs) and it resulted in unexpected behavior from locks, which led to corruption * After nfs started issuing unique values to frame->lk_owner, with two patches: http://patches.gluster.com/patch/3370/ http://patches.gluster.com/patch/3369/ which unlock with a matching lk-owner and do not allow overlapping locks at all, the corruption went away. * Although the corruption went away, afr seems to be healing more than once - When afr gets a write after a subvolume has come back up it proceeds to do a flush+post-post-op self heal and issues a full file lock. * Until the first one is granted there could be many such "write" requests which decide to do the post-post-op and block on a full file lock. They should be harmless as after they get a lock, they see that self-heal is not required and just carry on. * However, multiple self heal happen back to back, but not in parallel. To add to woes, nfs which sends a max of 128 writes (set in our test) resends them and the vicious self heal cycles go on and glusterfs never recovers and appears hung. It actually is spinning on doing the many self heals. * I added a patch which limits to only the first write to trigger a self heal. It worked and dd finished without problems. Need to look into how we clear the pending flags after the post post op. * While running tests now, we saw that a server once restarted is getting inodelk requests. This can cause corruption, which is suspect is the cause for the latest corruption reported above by Shehjar. In this case self heal lock is granted on the second subvols and write locks (range inodelks) are going to the first subvols, resulting in I/O while self heal is on.
AFR has a race, where the following races can happen due to a server going up or down: 1. One write racing with another 2. A write(s) racing with flush 3. A self heal racing with any FOP etc. Just mentioned these as they are relevant to the use-case in this bug. Infact it can happen with any transaction and is not seen easily because it needs to happen in a window where a server goes down while a write is in progress with other competing writes and a server goes up/down during overlapping writes (self-heal and other writes included) The bug is in afr_lock() int32_t afr_lock (call_frame_t *frame, xlator_t *this) { afr_pid_save (frame); frame->root->pid = (long) frame->root; return afr_lock_rec (frame, this, 0); <<<<< is the child_index } static int afr_lock_rec (call_frame_t *frame, xlator_t *this, int child_index) { ... ... ... /* skip over children that or down or don't have the fd open */ while ((child_index < priv->child_count) && (!local->child_up[child_index] || !fd_ctx->opened_on[child_index])) child_index++; <<<< it's a race because local->child_up[<down_server_child_index>] becomes 1 when it comes back up ... ... ... This bug can cause corruption and even multiple simultaneous writes - even self heal and writes together. I'll send a patch for this soon. Until then set afr options: option data-lock-server-count <no of subvols> option metadata-lock-server-count <no of subvols> option entry-lock-server-count <no of subvols>
(In reply to comment #19) ..... ..... ..... > This bug can cause corruption and even multiple simultaneous writes - even self > heal and writes together. I'll send a patch for this soon. Until then set afr > options: > option data-lock-server-count <no of subvols> > option metadata-lock-server-count <no of subvols> > option entry-lock-server-count <no of subvols> For the record, the corruption is there even with these options.
I've filed bug 762718 to track the afr locking problems.
Logs from latest run after applying the 3k line patch from pavan: dev:/data/share/tickets/960/self-heal-corruption-logs.tar.bz2
PATCH: http://patches.gluster.com/patch/3420 in master (nfs: Assign incremental lk owner to frames)
Regression Tests described at: http://test.gluster.com/show_bug.cgi?id=86
*** Bug 959 has been marked as a duplicate of this bug. ***
*** Bug 986 has been marked as a duplicate of this bug. ***
*** Bug 1064 has been marked as a duplicate of this bug. ***
*** Bug 508 has been marked as a duplicate of this bug. ***
PATCH: http://patches.gluster.com/patch/4247 in master (cluster/afr: Use 2 phase locking for transactions and self heal.)
PATCH: http://patches.gluster.com/patch/4248 in master (protocol/server: features/locks: Fix nonblocking entrylks, inodelks in locks and server.)
*** Bug 1355 has been marked as a duplicate of this bug. ***
*** Bug 958 has been marked as a duplicate of this bug. ***
PATCH: http://patches.gluster.com/patch/4299 in master (cluster/afr: Do not issue blocking locks in self-heal.)
PATCH: http://patches.gluster.com/patch/4517 in master (cluster/replicate: fix warnings during build.)
Re-opened because I just ran into this again while helping Raghu with a different bug.
Saw it on 3.1 alpha
PATCH: http://patches.gluster.com/patch/4872 in master (cluster/afr: print correct offset in debug message during diff self-heal.)
PATCH: http://patches.gluster.com/patch/5063 in master (replicate: use new style (un)locking in self heal of missing entries)
What is the status of this bug? I see lot of patches going in for the bug. Because its blocker, can we see to it that the status is updated properly. I guess even if the bug exists, lets take it after 3.1.0.. let me know your opinion.
(In reply to comment #39) > What is the status of this bug? I see lot of patches going in for the bug. > Because its blocker, can we see to it that the status is updated properly. I > guess even if the bug exists, lets take it after 3.1.0.. let me know your > opinion. Have to test with latest patches to confirm.
Still occurs on 3.1qa41. A failed replica is not getting self-heal when it recovers, compared to earlier when self-heal did get triggered but corrupted the destination. Now,even when the failed replica comes back up, the IO continues on the other replica.
Tested for 3.1.1qa5 does not corrupt self-healed file or allow parallel IO and self-heal. Thanks for playing.