Description of problem: In a RHEV-RHGS hyperconverged environment, adding disk to VM from a glusterfs storage pool fails when glusterfs is running in posix/directio mode The gluster volume is configured to run in directIO mode by adding option o-direct on in the /var/lib/glusterd/vols/gl_01/*.vol files. Example below volume gl_01-posix type storage/posix option o-direct on option brick-gid 36 option brick-uid 36 option volume-id c131155a-d40c-4d9e-b056-26c61b924c26 option directory /bricks/b01/g end-volume When the option is removed and the volume is restarted, disks can be added to the VM from the glusterfs pool. Version-Release number of selected component (if applicable): RHEV version is RHEV 3.6 glusterfs-client-xlators-3.7.5-11.el7rhgs.x86_64 glusterfs-cli-3.7.5-11.el7rhgs.x86_64 glusterfs-libs-3.7.5-11.el7rhgs.x86_64 glusterfs-3.7.5-11.el7rhgs.x86_64 glusterfs-api-3.7.5-11.el7rhgs.x86_64 glusterfs-fuse-3.7.5-11.el7rhgs.x86_64 glusterfs-server-3.7.5-11.el7rhgs.x86_64 How reproducible: Easily reproducible Steps to Reproduce: 1. Create a GlusterFS storage pool in a RHEV environment 2. Configure GlusterFS in a posix/directIO mode 3. Create a new VM or add disk to an existing VM. The add disk part fails Actual results: Expected results: Additional info:
Hi Sanjay, In light of the recent discussion we had wrt direct-io behavior on a mail thread, I have the following question: Assuming the 'cache=none' command line option implies that the vm image files will all be opened with O_DIRECT flag (which means that the write buffers will already be aligned with the "sector size of the underlying block device", the only layer in the combined client-server stack that could prevent us from achieving o-direct-like behavior because of caching would be the write-behind translator. Therefore, I am wondering if it is sufficient to enable 'performance.strict-o-direct' to achieve the behavior you expect to see with o-direct? -Krutika
I have tested with different options. The only option that enabled true directIO on the glusterfs server was the posix setting. I can verify again with the performance.strict-o-direct with the recent glusterfs version (glusterfs-server-3.7.5-18.33) installed on my system just to be sure.
Upstream patch at http://review.gluster.org/13846 Moving the state of the bug to POST.
Moving back to Assigned with comments from Vijay: The current behavior in sharding is the following: 1. open the base/first shard with O_DIRECT 2. open the subsequent shards without O_DIRECT. All write operations are converted to write + fsync operations to minimize the usage of page cache. With the planned patch, sharding will be opening non-first shards with O_DIRECT to completely eliminate any usage of page cache.
*** Bug 1322014 has been marked as a duplicate of this bug. ***
http://review.gluster.org/#/c/14191/
Solution to the VM pause issue as seen in the BZ https://bugzilla.redhat.com/show_bug.cgi?id=1339136 - is to make sharding honor O_DIRECT. So this bug needs to be proposed for RHGS 3.1.3
The fix for making individual shards inherit original fd's flags involves changes to management of anon fds (fd.c). Also, one major consumer of anon fds apart from sharding is gluster-NFS. So once this patch lands, it would be good to verify that it doesn't break the existing functionality on NFS. Specifically fd based operations (reads and writes)need to be tested on NFS mounts to ensure they work fine. In this regard, it would be good to also use fd flags like O_DIRECT, O_SYNC, O_DSYNC from the application. -Krutika
All the required patches are pulled into downstream now: > http://review.gluster.org/14271 > http://review.gluster.org/10219 > http://review.gluster.org/14215 > http://review.gluster.org/14191 > http://review.gluster.org/14639 > http://review.gluster.org/14623 Moving the state to Modified
Tested with RHGS 3.1.3 build - glusterfs-3.7.9-10.el7rhgs with the following tests. 1. Created a replica 3 volume 2. Disabled remote-dio and enabled strict-o-direct on the volume 3. Created a RHEV data domain backed by the above created volume 4. Did 'strace' on the brick process while 100% write workload is happening with fio. all the shards are opened with O_DIRECT as expected ./.30489:open("/rhgs/brick1/vmb1/.glusterfs/aa/59/aa59991a-31b5-41e6-87e5-e83e7bd4a082", O_RDWR|O_DIRECT) = 115 <0.000023> ./.30489:open("/rhgs/brick1/vmb1/.glusterfs/1e/d4/1ed4977a-3c75-459d-af88-a21d50190bd3", O_RDWR|O_DIRECT) = 115 <0.000019> ./.30489:open("/rhgs/brick1/vmb1/.glusterfs/aa/59/aa59991a-31b5-41e6-87e5-e83e7bd4a082", O_RDWR|O_DIRECT) = 114 <0.000025> ./.30489:open("/rhgs/brick1/vmb1/.glusterfs/aa/59/aa59991a-31b5-41e6-87e5-e83e7bd4a082", O_RDWR|O_DIRECT) = 114 <0.000020> ./.30489:open("/rhgs/brick1/vmb1/.glusterfs/aa/59/aa59991a-31b5-41e6-87e5-e83e7bd4a082", O_RDWR) = 114 <0.000020> ./.30489:open("/rhgs/brick1/vmb1/.glusterfs/aa/59/aa59991a-31b5-41e6-87e5-e83e7bd4a082", O_RDWR|O_DIRECT) = 114 <0.000035> ./.30489:open("/rhgs/brick1/vmb1/.glusterfs/aa/59/aa59991a-31b5-41e6-87e5-e83e7bd4a082", O_RDWR|O_DIRECT) = 115 <0.000023> ./.30489:open("/rhgs/brick1/vmb1/.glusterfs/aa/59/aa59991a-31b5-41e6-87e5-e83e7bd4a082", O_RDWR|O_DIRECT) = 115 <0.000023> ./.30489:open("/rhgs/brick1/vmb1/.glusterfs/aa/59/aa59991a-31b5-41e6-87e5-e83e7bd4a082", O_RDWR) = 114 <0.000024> ./.30489:openat(AT_FDCWD, "/rhgs/brick1/vmb1/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 115 <0.000027> ./.30489:openat(AT_FDCWD, "/rhgs/brick1/vmb1/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 115 <0.000020> ./.30489:open("/rhgs/brick1/vmb1/.glusterfs/aa/59/aa59991a-31b5-41e6-87e5-e83e7bd4a082", O_RDWR|O_DIRECT) = 115 <0.000027>
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1240