Description of problem: Taking a backup image of a running VM's disk consistently causes the VM to enter a paused state. It can then be resumed with no issues. This problem has started since the enabling of the sharding translator, which has led to drastically faster heal times. Version-Release number of selected component (if applicable): 3.7.11-1 How reproducible: 50-100%. It's intermittent, sometimes the machines will pause, other times they won't. Does not seem to be related to disk size. Steps to Reproduce: 1. Create and install oVirt environment using GlusterFS as storage in Distributed Replicate platform. 2. Use default volume options, except enabling the sharding translator. 3. Create a Windows Server 2012 R2 VM, take a backup image using a VSS capture utility like BackupExec, Acronis, Windows Server Backup, etc. Actual results: Machines will pause seconds after the backup has started. Hosts did not go down, bricks did not go down. They can be resumed immediately which is successful. Expected results: Machines should not pause. Additional info: Distributed replicate volume. Number of bricks: 6 Replica Count: 3 Volume options: cluster.self-heal-window-size: 256 cluster.data-self-heal-algorithm: full diagnostics.count-fop-hits: on diagnostics.latency-measurement: on performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: enable cluster.quorum-type: auto cluster.server-quorum-type: server storage.owner-uid: 36 server.allow-insecure: on storage.owner-gid: 36 network.ping-timeout: 10 features.shard-block-size: 512MB features.shard: on
Can you please attach brick logs and client logs to the bug?
Created attachment 1163317 [details] Brick Log for Server.
Created attachment 1163318 [details] NFS client log from Server
I've attached the brick log from the main share point as well as the NFS client log. If you need more, such as the logs from other bricks let me know. I'll have to rotate them to cut them down in size. Also, the last time this occurred was May 29th, 12:55p.
Hi, Thanks for the bug report and apologies for the delay in looking into it. I went through your attachments and I only see logs about loss of quorum. Could you try and recreate this with FUSE and then attach fuse mount logs? FWIW, two other community users, namely Lindsay Mathieson and Kevin Lemmonier hit vm pauses due to bugs in replicate module and due to races in interaction between sharding and replicate module in 3.7.11. They've been fixed now and the fixes should be available in 3.7.12. And as per http://www.gluster.org/pipermail/gluster-devel/2016-May/049677.html , 3.7.12 will be out around 9th of June. Let me know if that works for you. If not, I could share the patches/src tar ball with the fixes applied on top of 3.7.11 and you could confirm that the patches fix the problems you're seeing. -Krutika
Greetings, Sorry for the delay getting back to you. At this point if you have a strong hunch that this is resolved in 3.7.12 I might update to that when it comes out and try to recreate the issue on that version. I'm not exactly sure how to recreate it with FUSE as it happens in a virtual environment. Would simply copying the VM image to another network source suffice? -Nathan
Yes, it would be good to try it with 3.7.12 and run the same test case and confirm whether the issue still appears with the fixes. You can follow updates on gluster-users and gluster-devel MLs for announcement on the release of 3.7.12. -Krutika
Not sure what it was, but 3.7.12 resolved this. I've been running several days without any pausing or any changes.