Created attachment 742661 [details] Gluster Server Log Description of problem: Currently unable to use gluster storage pools (backed by xfs filesystems) with KVM. I'm able to add a VirtIO disk without issue, however, when I try to create (or mount) an ext4 filesystem from the KVM guest, I get errors. If I use a gluster volume that is _not_ backed by xfs (e.g. ext4)) everything works as expected. I'm able to consistently reproduce on RHEL 6.4 x86_64 (2.6.32-358.6.1.el6.x86_64). Additionally, "cache mode" on the VirtIO disks is set to "none" (to allow for safe live migration). Version-Release number of selected component (if applicable): 3.4.0alpha3 How reproducible: 100% Steps to Reproduce: 1. Create an xfs filesystem on gluster server (mkfs.xfs -i size=512 [LV name]) 2. Mount filesystem on gluster server (mount [LV name] [mountpoint]) 2. Create a gluster volume (gluster volume create xfs [xfs filesystem]) 3. Start gluster volume (gluster volume start xfs) 4. Define gluster volume on KVM host/gluster client, I'm using virt-manager. 5. Add VirtIO disk from gluster storage pool (format=raw, cachemode=none) to KVM guest, again I'm using virt-manager 6. On KVM guest, partition new VirtIO disk (fdisk /dev/[new VirtIO disk]) 7. On KVM guest, create an ext4 filesystem (mkfs.ext4 /dev/[new VirtIO disk]1) 8. On KVM guest, mount new ext4 filesystem (mount /dev/[new VirtIOdisk]1 [moutpoint]) 9. Get errors on KVM guest: [root@rhel6-guest ~]# mount /dev/vdb1 /mnt/ext4 mount: you must specify the filesystem type [root@rhel6-guest ~]# Errors from KVM host/gluster client: [2013-05-02 11:50:58.466222] W [client-rpc-fops.c:2697:client3_3_readv_cbk] 0-xfs-client-0: remote operation failed: Invalid argument [2013-05-02 11:50:58.466258] W [fuse-bridge.c:2049:fuse_readv_cbk] 0-glusterfs-fuse: 9272: READ => -1 (Invalid argument) [2013-05-02 11:50:58.485255] W [client-rpc-fops.c:873:client3_3_writev_cbk] 0-xfs-client-0: remote operation failed: Invalid argument [2013-05-02 11:50:58.485301] W [fuse-bridge.c:2127:fuse_writev_cbk] 0-glusterfs-fuse: 9273: WRITE => -1 (Invalid argument) Errors from gluster server: [2013-05-02 11:50:58.462986] E [posix.c:1940:posix_readv] 0-xfs-posix: read failed on fd=0xc58228: Invalid argument [2013-05-02 11:50:58.463024] I [server-rpc-fops.c:1489:server_readv_cbk] 0-xfs-server: 9013: READV 1 (50928256-564b-4d96-95b8-13ac2d638f10) ==> (Invalid argument) [2013-05-02 11:50:58.482077] E [posix.c:2135:posix_writev] 0-xfs-posix: write failed: offset 32256, Invalid argument [2013-05-02 11:50:58.482114] I [server-rpc-fops.c:1439:server_writev_cbk] 0-xfs-server: 9014: WRITEV 1 (50928256-564b-4d96-95b8-13ac2d638f10) ==> (Invalid argument) Actual results: Unable to mount/create ext4 filesystems on KVM guest with xfs backed gluster volumes. Expected results: Successfully create ext4 filesystems. Additional info: I'll upload straces, gluster server logs and gluster client logs shortly. Related to gluster-users thread: http://www.gluster.org/pipermail/gluster-users/2013-April/035962.html
Created attachment 742662 [details] Gluster Client (KVM Host) Log
Created attachment 742663 [details] Strace of glusterd during read/write errors
I just updated my gluster server and gluster client (KVM host) to 3.4.0beta1. Unfortunately, I'm getting the exact same errors. I'll upload an strace, gluster server log and gluster client log shortly.
Created attachment 745334 [details] Gluster Client (KVM Host) Log (3.4.0beta1)
Created attachment 745335 [details] Gluster Server Log (3.4.0beta1)
Created attachment 745336 [details] Strace of glusterd during read/write errors (3.4.0beta1)
Confirmed that I'm getting the same errors with 3.4.0beta3. I'm attaching the gluster server brick log, gluster client log and strace of glusterd PID (on server).
Created attachment 759927 [details] Gluster Server Log (3.4.0beta3) Gluster Server (brick) log from 3.4.0beta3.
Created attachment 759928 [details] Gluster Client (KVM Host) Log (3.4.0beta3) Gluster client log from KVM host (3.4.0beta3).
Created attachment 759929 [details] Strace of glusterd during read/write errors (3.4.0beta3) Running strace -ttTf -o [file] -p [pid] against the glusterfsd process.
Just for some clarification: the 3.4.0beta3 strace output is from glusterfsd (not glusterd). Sorry for any confusion.
Just confirmed that I'm getting the same behavior with 3.4.0beta4. I increased the log level to "TRACE" on the one xfs brick, started an strace and duplicated the same errors. I'll upload my strace and gluster logs shortly.
Created attachment 768972 [details] gluster brick log with tracing enabled (3.4.0beta4)
Created attachment 768973 [details] gluster client (KVM host) log (3.4.0beta4)
Created attachment 768974 [details] Strace of glusterfsd during read/write errors (3.4.0beta4)
I was able to reproduce the same problem on 3.4.0 (GA). This time I increased the log level to "TRACE" on both the server brick and the client. I'll upload logs (and an strace of glusterfsd) shortly.
Created attachment 773982 [details] gluster brick log with tracing enabled (3.4.0 GA)
Created attachment 773983 [details] gluster client (KVM host) log with tracing enabled (3.4.0 GA)
Created attachment 773984 [details] Strace of glusterfsd during read/write errors (3.4.0 GA)
I'm able to reproduce this on 3.4.1. I'l upload another round of straces and gluster (server+client) logs shortly.
Created attachment 812713 [details] gluster brick log with tracing enabled (3.4.1 GA) diagnostics.brick-log-level = DEBUG
Created attachment 812714 [details] gluster client (KVM host) log with tracing enabled (3.4.1 GA) diagnostics.client-log-level = DEBUG
Created attachment 812715 [details] Strace of glusterfsd during read/write errors (3.4.1 GA) strace -ttTf -p [pid of glusterfsd] -o /tmp/glusterfsd_server_3.4.1_strace.20131015.out
I noticed that all disks in my gluster server(s) have 4K sectors. Using an older disk with 512 byte sectors, I did _not_ get any errors on my gluster client / KVM server. I switched back to using my newer 4K drives and manually set the XFS sector size (sectsz) to 512. With the manually set sector size of 512, everything worked as expected. I think I might be hitting some sort of qemu/libvirt bug. However, all of the bugs I found that sound similar[1][2] have already been fixed in RHEL6. [1] https://bugzilla.redhat.com/show_bug.cgi?id=608548 [2] https://bugzilla.redhat.com/show_bug.cgi?id=748902
I'm unsure what would cause a EINVAL when writing to an image file on a XFS brick. The only thing that I know can cause this, is reading/writing after the EOF. It is possible that the sparse file has not been completely created (similar to bug 1058663). In the logs of the fuse client, we can observe that the write-behind translator is enabled: 13: volume xfs-write-behind 14: type performance/write-behind 15: subvolumes xfs-dht 16: end-volume Because the problem from bug 1058663 only happens irregular, I would be interested in any test results with glusterfs-3.5.1, or where write-behind is disabled. Could you maybe provide that? From attachment #812715 [details], strace of the brick: Open test.img returns fd=14: 55581 19:30:48.112025 open("/gluster/xfs/test.img", O_RDWR|O_DIRECT|O_CLOEXEC <unfinished ...> 55539 19:30:48.112044 <... readv resumed> [{"\0\0A4\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\f\0\5\363\227\0\0\0$"..., 120}], 1) = 120 <0.000026> 55581 19:30:48.112073 <... open resumed> ) = 14 <0.000034> pread(), fstat() and fgetxattr() work, these are the last successful calls: 55545 19:30:59.579685 pread(14, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 0) = 4096 <0.000095> 55545 19:30:59.579837 fstat(14, {st_mode=S_IFREG|0600, st_size=1073741824, ...}) = 0 <0.000012> 55545 19:30:59.579902 fgetxattr(14, "trusted.gfid", "\x1ef\xcb\x10\x10\x1fM\x12\xbd\xb0\xb6*\xd3\xdc\xe6\xd9", 16) = 16 <0.000013> pread() starts to return EINVAL (note the huge offset?): 55581 19:30:59.581812 pread(14, 0x7f2b155be000, 4096, 1073380864) = -1 EINVAL (Invalid argument) <0.000012> ... 55545 19:30:59.587159 pread(14, 0x7f2b155be000, 4096, 1073380864) = -1 EINVAL (Invalid argument) <0.000012> pwrite() starts to fail too (much smaller offset): 55581 19:31:04.623451 pwrite(14, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 32256 <unfinished ...> ... 55581 19:31:04.623497 <... pwrite resumed> ) = -1 EINVAL (Invalid argument) <0.000022>
From 'man 2 write': EINVAL fd is attached to an object which is unsuitable for writing; or the file was opened with the O_DIRECT flag, and either the address specified in buf, the value specified in count, or the current file offset is not suitably aligned. The offset could be after EOF, but O_DIRECT is used to open the file. Maybe there are some alignment issues...
(In reply to Niels de Vos from comment #25) > Because the problem from bug 1058663 only happens irregular, I would be > interested in any test results with glusterfs-3.5.1, or where write-behind > is disabled. Could you maybe provide that? Unfortunately I no longer have access to this hardware setup. Let me see if I can try to build out something similar. (It might take me a little bit.)
(In reply to Niels de Vos from comment #26) > The offset could be after EOF, but O_DIRECT is used to open the file. Maybe > there are some alignment issues... And FWIW, changing some of the mkfs.xfs attributes (mkfs.xfs -f -i size=512 -s size=512) fixed my problem. Based on this information, do you still want me to try to reproduce with 3.5.1?
(In reply to JYundt from comment #28) > (In reply to Niels de Vos from comment #26) > > > The offset could be after EOF, but O_DIRECT is used to open the file. Maybe > > there are some alignment issues... > > And FWIW, changing some of the mkfs.xfs attributes (mkfs.xfs -f -i size=512 > -s size=512) fixed my problem. > > Based on this information, do you still want me to try to reproduce with > 3.5.1? No, that is not needed. This definitely sounds like an alignment issue. It would be interesting to know if the errors in the kvm-guest also happen when the filesystem of the brick is used directly to store the images. The bricks return the same values as the output of 'stat -f /bricks/dir/....' when a volume gets mounted (by fuse or through the api). If the blocksize differs between bricks, I do not doubt that the result is unstable. We would need at least the following info to understand more about this: 1. does the issue happen without gluster, but on the filesystem of the brick 2. output of xfs_info from the filesystem 3. output of dumpe2fs from the ext4 filesystem on the same storage/disk If you can not provide the details, or have no interest in digging further, let us know and we'll close the bug. [ See comment #24 and comment #28 contain a fix/workaround. ]
(In reply to Niels de Vos from comment #29) > (In reply to JYundt from comment #28) > > (In reply to Niels de Vos from comment #26) > > > If you can not provide the details, or have no interest in digging further, > let us know and we'll close the bug. I apologize but due to the lack of hardware, I don't think I'll be able to reproduce this in a timely fashion. Do you want to close it in the meantime?
Okay, yes, let's close this for now. Feel free to re-open this bug or file a new one when more details get available (also see comment #29).