Bug 1058300 - VMs do not resume after paused state and storage connection to a gluster domain (they will also fail to be manually resumed)
Summary: VMs do not resume after paused state and storage connection to a gluster doma...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: GlusterFS
Classification: Community
Component: core
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard: gluster
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-01-27 13:30 UTC by Andrew Lau
Modified: 2019-04-28 09:31 UTC (History)
27 users (show)

Fixed In Version: glusterfs-5.5
Clone Of:
Environment:
Last Closed: 2019-03-25 17:10:16 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
engine.log (689.12 KB, text/plain)
2014-01-27 13:30 UTC, Andrew Lau
no flags Details
vdsm.log (98.54 KB, text/plain)
2014-01-27 13:32 UTC, Andrew Lau
no flags Details
libvirt.log (60.06 KB, text/plain)
2014-02-01 01:04 UTC, Andrew Lau
no flags Details
qemu/vm.log (8.35 KB, text/plain)
2014-02-01 01:05 UTC, Andrew Lau
no flags Details
vdsm.log (86.56 KB, text/plain)
2014-02-01 01:06 UTC, Andrew Lau
no flags Details
ovirt-host logs (vdsm/libvirt) (134.20 KB, application/gzip)
2014-02-25 21:25 UTC, Andrew Lau
no flags Details
ovirt-engine logs (615.45 KB, application/gzip)
2014-02-25 21:25 UTC, Andrew Lau
no flags Details
gluster.tar (395.03 KB, application/x-tar)
2014-04-10 04:04 UTC, Andrew Lau
no flags Details
gluster brick logs (5.02 KB, application/gzip)
2014-06-17 10:10 UTC, Andrew Lau
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1450891 0 None None None Never

Description Andrew Lau 2014-01-27 13:30:50 UTC
Created attachment 856063 [details]
engine.log

Description of problem:
Running VMs will go into paused state when storage connection is lost (usual). However when the connection is returned, they do not resume. This sounds very similar to BZ 1036358 but I believe my nightly version is newer than the patch.

Version-Release number of selected component (if applicable):
vdsm-4.14.0-33.gitf029545.el6.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Stop the gluster volume which hosts the VM images
2. Wait for VMs to enter paused state
3. Re-start the gluster volume and wait for the SPM to be re-established
4. Wait a while to see if VMs will auto restart.
5. Try to manually resume a VM

Actual results:
No VMs are restarted and need to be forced off and then powered back on.

Expected results:
VMs should resume

Additional info:
In my logs, I attempt to manually resume dev1-melb-example-net

Comment 1 Andrew Lau 2014-01-27 13:32:11 UTC
Created attachment 856064 [details]
vdsm.log

Comment 2 Andrew Lau 2014-01-27 14:10:15 UTC
Strange update...

In 3.3.2 when they went into a paused state I used to be able to ungracefully power them off and start it again however this time when I ungracefully powered them off through the ovirt UI they appeared to be down in the UI but virsh list showed them as paused.

virsh destroy vm

Then starting them again through the UI seemed to be the only way.. I did not expect this to creep up so I only have a 5MB if you want me to upload them - sorry.

Comment 3 Sergey Gotliv 2014-01-31 23:10:22 UTC
Nir, it looks very similar to BZ 1036358, but for Gluster.

Comment 4 Nir Soffer 2014-01-31 23:48:14 UTC
(In reply to Sergey Gotliv from comment #3)
> Nir, it looks very similar to BZ 1036358, but for Gluster.

Looking in vdsm log, we don't see any vm that paused due to EIO error, so we don't exepct that any vm in this log will be unpaued when storage becomes accessible again. And indeed, when storage becomes accessible again, and the domain monitor correclty invoke the contEIOVms method, we can see that no vm is found as paused because of EIO error, and no vm is unapused, as expected.

We can see lot of libvirt "eother" errors, like this:
libvirtEventLoop::INFO::2014-01-28 00:13:17,614::vm::4507::vm.Vm::(_onAbnormalStop) vmId=`b3cde33c-f4c5-4b42-8e9d-820b80abc2fb`::abnormal vm stop device virtio-disk0 error eother

The current code unpause only vms that paused because of EIO error.

This is not related to bug 1036358, which was about unpausing vms paused because of EIO, when vdsm starts.

Comment 5 Nir Soffer 2014-01-31 23:52:30 UTC
 Andrew, is vdsm log complete, showing the time where the vms were paused?

If not, please reproduce again, and include vdsm log showing the vms getting paused.

Comment 6 Nir Soffer 2014-02-01 00:13:56 UTC
To understand better what the vms were stopped anbormally, please attach also libvirt and qemu logs. The logs are:
/var/log/livbirtd.log
/var/log/libvirt/qemu/vmname.log

Comment 7 Andrew Lau 2014-02-01 01:04:28 UTC
Created attachment 858008 [details]
libvirt.log

Attached are the three logs you requested. The VM in question in this case is: rabbitmq1.melb.example.net

Comment 8 Andrew Lau 2014-02-01 01:05:32 UTC
Created attachment 858009 [details]
qemu/vm.log

The last two "block I/O error in device 'drive-virtio-disk0': Bad file descriptor (9)" was me attempting to manually resume the VM through the engine.

Comment 9 Andrew Lau 2014-02-01 01:06:02 UTC
Created attachment 858010 [details]
vdsm.log

Comment 10 Itamar Heim 2014-02-02 08:17:09 UTC
Setting target release to current version for consideration and review. please
do not push non-RFE bugs to an undefined target release to make sure bugs are
reviewed for relevancy, fix, closure, etc.

Comment 11 Ayal Baron 2014-02-17 21:11:01 UTC
Nir, any update on this?

Comment 12 Sergey Gotliv 2014-02-18 07:34:50 UTC
Andrew,

This is a hosted engine setup, right?

Comment 13 Andrew Lau 2014-02-18 07:39:46 UTC
Correct.

Comment 14 Sergey Gotliv 2014-02-18 07:53:17 UTC
Andrew,

Attached logs are too short and don't match, for example:

Engine log starts at 
2014-01-28 00:13:04,495 and ends at 2014-01-28 00:16:13,468 (3 minutes)

VDSM log starts at 
014-02-01 11:58:12,437 and ends at 2014-02-01 12:00:54,668 (2 minutes)

Same about libvirt.

So I see 

This:

OSError: [Errno 2] No such file or directory: '/rhev/data-center/mnt/glusterSD/172.16.1.5:_VM-DATA/ff3a0c94-11dc-446d-b093-5f68ad81520d' 

And that:

Exception: path to storage domain 562f0160-7b80-42e2-b248-5754455c40fc not found in /rhev/data-center/mnt


But don't find the root cause because I don't see the entire flow.
Can you attach a zip with full logs.

Comment 15 Andrew Lau 2014-02-25 21:24:16 UTC
Sorry for the delay in getting back to you, naturally testing this requires some sort of downtime so it was hard to get this in motion.

Steps: 
- I stopped all the other VMs and nulled all the logs so you could get a clean log view. 
- Waited a few minutes, then shut off the volume hosting the VM data. 
- I attempted to restart it manually both through the engine and through virsh.

The VM here is nginxtemp

I hope this helps.

Comment 16 Andrew Lau 2014-02-25 21:25:05 UTC
Created attachment 867646 [details]
ovirt-host logs (vdsm/libvirt)

Comment 17 Andrew Lau 2014-02-25 21:25:46 UTC
Created attachment 867647 [details]
ovirt-engine logs

Comment 18 Nir Soffer 2014-02-26 21:59:17 UTC
This looks like virt issue.

1. Domain becomes valid:
Thread-5162::DEBUG::2014-02-26 07:48:20,994::domainMonitor::247::Storage.DomainMonitorThread::(_monitorDomain) Domain ff3a0c94-11dc-446d-b093-5f68ad81520d changed its status to
 Valid

2. Domain monitor emit an event
Thread-2496084::DEBUG::2014-02-26 07:48:20,996::misc::884::Event.Storage.DomainMonitor.onDomainStateChange::(_emit) Emitting event

3. Event handler look for puased vms
Thread-2496084::DEBUG::2014-02-26 07:48:20,997::misc::894::Event.Storage.DomainMonitor.onDomainStateChange::(_emit) Calling registered method `contEIOVms`
Thread-2496084::DEBUG::2014-02-26 07:48:20,998::misc::904::Event.Storage.DomainMonitor.onDomainStateChange::(_emit) Event emitted
Thread-2496085::INFO::2014-02-26 07:48:20,999::clientIF::126::vds::(contEIOVms) vmContainerLock acquired
Thread-2496085::INFO::2014-02-26 07:48:21,001::clientIF::133::vds::(contEIOVms) 

4. A pause vm is found and unpaused
Cont vm 66874a89-8745-4eff-b3d2-5fde7625f8dc in EIO
libvirtEventLoop::DEBUG::2014-02-26 07:48:21,057::vm::5096::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`66874a89-8745-4eff-b3d2-5fde7625f8dc`::event Resumed detail 0 opaque None
libvirtEventLoop::DEBUG::2014-02-26 07:48:21,065::vm::5096::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`66874a89-8745-4eff-b3d2-5fde7625f8dc`::event Resumed detail 0 opaque None

5. Vm stopped abnormally
libvirtEventLoop::INFO::2014-02-26 07:48:21,068::vm::4505::vm.Vm::(_onAbnormalStop) vmId=`66874a89-8745-4eff-b3d2-5fde7625f8dc`::abnormal vm stop device virtio-disk0 error eother
libvirtEventLoop::DEBUG::2014-02-26 07:48:21,068::vm::5096::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`66874a89-8745-4eff-b3d2-5fde7625f8dc`::event Suspended detail 2 opaque None
libvirtEventLoop::INFO::2014-02-26 07:48:21,069::vm::4505::vm.Vm::(_onAbnormalStop) vmId=`66874a89-8745-4eff-b3d2-5fde7625f8dc`::abnormal vm stop device virtio-disk0 error eother

Michal, can you check this?

Comment 19 Sandro Bonazzola 2014-03-04 09:22:58 UTC
This is an automated message.
Re-targeting all non-blocker bugs still open on 3.4.0 to 3.4.1.

Comment 20 Allon Mureinik 2014-03-06 10:53:12 UTC
(In reply to Nir Soffer from comment #18)
> 5. Vm stopped abnormally
> libvirtEventLoop::INFO::2014-02-26
> 07:48:21,068::vm::4505::vm.Vm::(_onAbnormalStop)
> vmId=`66874a89-8745-4eff-b3d2-5fde7625f8dc`::abnormal vm stop device
> virtio-disk0 error eother
> libvirtEventLoop::DEBUG::2014-02-26
> 07:48:21,068::vm::5096::vm.Vm::(_onLibvirtLifecycleEvent)
> vmId=`66874a89-8745-4eff-b3d2-5fde7625f8dc`::event Suspended detail 2 opaque
> None
> libvirtEventLoop::INFO::2014-02-26
> 07:48:21,069::vm::4505::vm.Vm::(_onAbnormalStop)
> vmId=`66874a89-8745-4eff-b3d2-5fde7625f8dc`::abnormal vm stop device
> virtio-disk0 error eother
> 
> Michal, can you check this?
Michal, can anyone from your team take a look please?

Comment 21 Michal Skrivanek 2014-03-06 11:45:05 UTC
so in other words VM was in EIO and after resume it immediately got paused egain with EOTHER error

Comment 22 Michal Skrivanek 2014-03-06 11:51:00 UTC
2014-02-25 20:48:21.034+0000: 2852: debug : qemuProcessHandleResume:776 : Transitioned guest xxx out of paused into resumed state

followed immediately by

2014-02-25 20:48:21.041+0000: 2852: debug : qemuMonitorIOProcess:354 : QEMU_MONITOR_IO_PROCESS: mon=0x7f61600c5dd0 buf={"timestamp": {"seconds": 1393361301, "microseconds": 38479}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "__com.redhat_debug_info": {"message": "Bad file descriptor", "errno": 9}, "__com.redhat_reason": "eother", "operation": "write", "action": "stop"}}
{"timestamp": {"seconds": 1393361301, "microseconds": 41019}, "event": "STOP"}
{"timestamp": {"seconds": 1393361301, "microseconds": 41368}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "__com.redhat_debug_info": {"message": "Bad file descriptor", "errno": 9}, "__com.redhat_reason": "eother", "operation": "write", "action": "stop"}}
 len=640

Comment 23 Michal Skrivanek 2014-03-06 11:53:34 UTC
and anything relevant in QEMU logs (is it the right one, though? I'm missing the cmdline) is:
block I/O error in device 'drive-virtio-disk0': Read-only file system (30)
block I/O error in device 'drive-virtio-disk0': Read-only file system (30)
block I/O error in device 'drive-virtio-disk0': Bad file descriptor (9)
block I/O error in device 'drive-virtio-disk0': Bad file descriptor (9)
block I/O error in device 'drive-virtio-disk0': Bad file descriptor (9)
block I/O error in device 'drive-virtio-disk0': Bad file descriptor (9)
block I/O error in device 'drive-virtio-disk0': Bad file descriptor (9)
block I/O error in device 'drive-virtio-disk0': Bad file descriptor (9)

so nothing virt here...

Comment 24 Nir Soffer 2014-03-06 12:17:01 UTC
(In reply to Michal Skrivanek from comment #23)
> and anything relevant in QEMU logs (is it the right one, though? I'm missing
> the cmdline) is:
> block I/O error in device 'drive-virtio-disk0': Read-only file system (30)
> block I/O error in device 'drive-virtio-disk0': Read-only file system (30)
> block I/O error in device 'drive-virtio-disk0': Bad file descriptor (9)
> block I/O error in device 'drive-virtio-disk0': Bad file descriptor (9)
> block I/O error in device 'drive-virtio-disk0': Bad file descriptor (9)
> block I/O error in device 'drive-virtio-disk0': Bad file descriptor (9)
> block I/O error in device 'drive-virtio-disk0': Bad file descriptor (9)
> block I/O error in device 'drive-virtio-disk0': Bad file descriptor (9)
> 
> so nothing virt here...

We (storage) did hour part, detecting the that the storage became valid, and tried to continue the vm. Now tha the vm got stopped, I don't see how can we help in this case. Maybe qemu folks can help?

Comment 25 Michal Skrivanek 2014-03-06 12:25:47 UTC
maybe. I don't really know. My area starts when I get the error of Bad file descriptor again after resume.

Comment 26 Nir Soffer 2014-03-06 13:01:54 UTC
Michael, can you look at this?

A vm was paused by EIO error, storage became available, paused vm was unpaused, fail during the unpause. See comment 18, comment 22, and comment 23.

Comment 27 Stefan Hajnoczi 2014-03-06 15:34:02 UTC
(In reply to Nir Soffer from comment #26)
> A vm was paused by EIO error, storage became available, paused vm was
> unpaused, fail during the unpause. See comment 18, comment 22, and comment
> 23.

The QEMU process is getting EROFS and EBADF when the file for 'drive-virtio-disk0' on a Gluster FUSE mount:
/rhev/data-center/mnt/glusterSD/172.16.1.5:_VM-DATA/ff3a0c94-11dc-446d-b093-5f68ad81520d/images/a746f514-51eb-4926-80d5-545108438f01/ba8685ec-668a-4780-88ba-0b114e1a7e7d

QEMU just makes preadv(2)/pwritev(2)/fdatasync(2) syscalls but the error is coming from the Gluster FUSE daemon.

I suspect the Gluster logs and maybe dmesg on the KVM host will contain more information on why the Gluster FUSE daemon was unable to process I/O requests.  Perhaps the process died and the FUSE mount went away?

Adding Vijay for advice on which logs to collect from Gluster.

Comment 28 Vijay Bellur 2014-04-08 16:55:50 UTC
Can you please provide a tarball of /var/log/glusterfs on the hypervisor when this happens?

Comment 29 Andrew Lau 2014-04-10 04:03:45 UTC
Hi,

Attached is my gluster logs, my hypervisors run both vdsm and gluster-server btw.

I also did a few extra testing on this, it seems if the gluster storage is removed/lost and then returned within the SPM timeout the VM will automatically resume. I believe this is 5 minutes.

However even within that 5 minutes, if the user attempts to manually resume the VM when there is no storage it will attempt to resume it and fail. This probably giving the resume too early issue. I think the engine should prevent the user from resuming the VM if the datastore is unavailable.. if there is an ISO attached to the VM it will not allow the resume because the ISO domain is inactive (as it relies on the data domain which is inactive). That's a little weird that only the ISO domain gets checked, but the data domain doesn't..

Back to the point, after the 5 or so minute timeout, when the gluster domain returns I'm seeing it attempt to resume the instance as my cursor (within the console) resumes for about 5 seconds and then times out again with the engine reporting the VM resume failed.

HTH

Comment 30 Andrew Lau 2014-04-10 04:04:44 UTC
Created attachment 884735 [details]
gluster.tar

Comment 31 Andrew Lau 2014-06-14 08:31:46 UTC
I recently tried the network.ping-timeout option:
gluster volume set vm-data network.ping-timeout 10

The qemu logs now report:

block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107)

Comment 32 Humble Chirammal 2014-06-16 14:39:26 UTC
>
However even within that 5 minutes, if the user attempts to manually resume the VM when there is no storage it will attempt to resume it and fail. 
>

If the storage is unavailable, attempting to resume a 'paused' vm is not 'expected to work' anyway. 

>
 I think the engine should prevent the user from resuming the VM if the datastore is unavailable.
...
if there is an ISO attached to the VM it will not allow the resume because the ISO domain is inactive (as it relies on the data domain which is inactive). That's a little weird that only the ISO domain gets checked, but the data domain doesn't..
>

some one from vdsm team can comment on this.

>
Back to the point, after the 5 or so minute timeout, when the gluster domain returns I'm seeing it attempt to resume the instance as my cursor (within the console) resumes for about 5 seconds and then times out again with the engine reporting the VM resume failed.
>
This error mostly indicates that glusterfs was unable to serve I/O request..  common cause of this error are  network interruption or  brick is not available. 


Client log :
------------------
[2014-04-10 03:35:58.620542] W [socket.c:514:__socket_rwv] 0-vm-sata-client-0: readv failed (No data available)
[2014-04-10 03:35:58.620709] W [socket.c:1962:__socket_proto_state_machine] 0-vm-sata-client-0: reading from socket failed. Error (No data available), peer (172.16.2.11:50153)
[2014-04-10 03:35:58.620832] I [client.c:2098:client_rpc_notify] 0-vm-sata-client-0: disconnected
[2014-04-10 03:35:59.646518] W [afr-inode-write.c:457:afr_writev] 0-vm-sata-replicate-0: failing writev due to lack of quorum
[2014-04-10 03:35:59.646588] W [fuse-bridge.c:2167:fuse_writev_cbk] 0-glusterfs-fuse: 41153: WRITE => -1 (Read-only file system)
[2014-04-10 03:36:00.237352] W [afr-inode-write.c:457:afr_writev] 0-vm-sata-replicate-0: failing writev due to lack of quorum
[2014-04-10 03:36:00.237419] W [fuse-bridge.c:2167:fuse_writev_cbk] 0-glusterfs-fuse: 41163: WRITE => -1 (Read-only file system)
[2014-04-10 03:36:00.829441] W [afr-inode-write.c:457:afr_writev] 0-vm-sata-replicate-0: failing writev due to lack of quorum
[2014-04-10 03:36:00.829524] W [fuse-bridge.c:2167:fuse_writev_cbk] 0-glusterfs-fuse: 41173: WRITE => -1 (Read-only file system)
[2014-04-10 03:36:00.829827] W [socket.c:514:__socket_rwv] 0-vm-sata-client-1: readv failed (No data available)

Brick log says:

[2014-04-10 03:35:58.614444] W [glusterfsd.c:1002:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x7fc2c4123b6d] (-->/lib64/libpthread.so.0(+0x79d1) [0x7fc2c47b69d1] (-->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xcd) [0x40533d]))) 0-: received signum (15), shutting down
[2014-04-10 03:35:58.614494] W [glusterfsd.c:1002:cleanup_and_exit] (-->/lib64/libc.so.6(+0x43bf0) [0x7fc2c407ebf0] (-->/usr/lib64/libglusterfs.so.0(synctask_wrap+0x12) [0x7fc2c4e3eaa2] (-->/usr/sbin/glusterfsd(glusterfs_handle_terminate+0x15) [0x40a5f5]))) 0-: received signum (15), shutting down
[2014-04-10 03:43:18.258953] I [glusterfsd.c:1910:main] 0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 3.4.3 (/usr/sbin/glusterfsd -s 172.16.2.11 --volfile-id vm-sata.172.16.2.11.data1-vm-sata -p /var/lib/glusterd/vols/vm-sata/run/172.16.2.11-data1-vm-sata.pid -S /var/run/880d32267c8ea6e63cb8edaff26364f3.socket --brick-name /data1/vm-sata -l /var/log/glusterfs/bricks/data1-vm-sata.log --xlator-option *-posix.glusterd-uuid=986c80cc-3ad9-4012-81fd-a18b559245a8 --brick-port 50153 --xlator-option vm-sata-server.listen-port=50153)


It looks like brick process was down , so client failed on I/O. 

iuc, you have 2 servers  (172.16.2.11 &172.16.2.12) which serves bricks .

Are you sure the storage domain came back and it was available when you tried vm 'resume' ? 

If yes, can you provide the logs from other server as well and give more details on :

*) Gluster configuration of your setup.
*) Steps to reproduce this issue
*) Version of subjected packages
*) Glusterd logs from both servers and also  provide qemu/vm logs at the problemtic timestamp.

Comment 33 Andrew Lau 2014-06-17 10:06:23 UTC
I tried this again with the following setup:

3 hosts
gluster volume: vm-data
2 bricks (replicated) only on two hosts (172.16.2.11, 172.16.2.12)
third host (172.16.2.13) part of the gluster peer group, but not a brick.

Gluster storage volume mounted as localhost:/vm-data

VM running on third host.

Replicating a switch failure by just running a reboot, the outage was about 1 minute tops. When the network returned, the storage domain was set back to active. However the vm 'foreman1' was set to paused state.

Only two lines exist in 'qemu/foreman1.log'
block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107)
block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107)

Attempting to resume it manually would just result in another entry.


Gluster Version:
glusterfs-libs-3.5.0-2.el6.x86_64
glusterfs-cli-3.5.0-2.el6.x86_64
glusterfs-3.5.0-2.el6.x86_64
glusterfs-api-3.5.0-2.el6.x86_64
glusterfs-fuse-3.5.0-2.el6.x86_64
glusterfs-server-3.5.0-2.el6.x86_64
glusterfs-rdma-3.5.0-2.el6.x86_64
vdsm-gluster-4.14.9-0.el6.noarch

Brick Info:
[root@ov-hv1-2a-08-23 bricks]# gluster volume info vm-data
 
Volume Name: vm-data
Type: Replicate
Volume ID: 9c10ce6c-5d94-42a6-81be-d628e813da48
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 172.16.2.11:/data1/vm-data
Brick2: 172.16.2.12:/data1/vm-data
Options Reconfigured:
network.ping-timeout: 10
nfs.disable: yes
storage.owner-gid: 36
storage.owner-uid: 36
server.allow-insecure: on
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
auth.allow: 172.16.*.*
cluster.server-quorum-ratio: 51%

Logs attached, the failure occured at 09:52

Comment 34 Andrew Lau 2014-06-17 10:10:57 UTC
Created attachment 909506 [details]
gluster brick logs

Attached are the logs from the two gluster storage (hv1, hv2). hv3-client contains the virt host (/var/log/glusterfs/rhev-data-center-mnt-glusterSD-localhost\:_vm-data.log )

Comment 35 Humble Chirammal 2014-06-18 09:55:57 UTC
Thanks for providing the details and logs. From the same, it looks like 'quorum' playing a role here :


[2014-04-10 03:35:59.646518] W [afr-inode-write.c:457:afr_writev] 0-vm-sata-replicate-0: failing writev due to lack of quorum
[2014-04-10 03:35:59.646588] W [fuse-bridge.c:2167:fuse_writev_cbk] 0-glusterfs-fuse: 41153: WRITE => -1 (Read-only file system)
[2014-04-10 03:36:00.237352] W [afr-inode-write.c:457:afr_writev] 0-vm-sata-replicate-0: failing writev due to lack of quorum
[2014-04-10 03:36:00.237419] W [fuse-bridge.c:2167:fuse_writev_cbk] 0-glusterfs-fuse: 41163: WRITE => -1 (Read-only file system)
[2014-04-10 03:36:00.829441] W [afr-inode-write.c:457:afr_writev] 0-vm-sata-replicate-0: failing writev due to lack of quorum
[2014-04-10 03:36:00.829524] W [fuse-bridge.c:2167:fuse_writev_cbk] 0-glusterfs-fuse: 41173: WRITE => -1 (Read-only file system)
[2014-04-10 03:36:00.829827] W [socket.c:514:__socket_rwv] 0-vm-sata-client-1: readv failed (No data available)

I am involving one of the devel (who works on this  component) from glusterfs team to check further on this.

Btw, iic, you have/had enabled 'virt' group profile [1]on the subjected volume at problematic time. can you please confirm?

[1]
https://access.redhat.com/site/documentation/en-US/Red_Hat_Storage/2.1/html-single/Quick_Start_Guide/index.html#configure_vols_cli

Comment 36 Andrew Lau 2014-06-18 10:25:07 UTC
Quorum would kick in as the switch would remove all network connectivity between any servers for that period of time.

I have applied the 'virt' group profile as well as the extra ones like:
storage.owner-gid: 36
storage.owner-uid: 36
server.allow-insecure: on
network.ping-timeout: 10

The network.ping-timeout was recommended for the virt timeout as well. It appears to have changed the qemu output from a more "serious"

block I/O error in device 'drive-virtio-disk0': Bad file descriptor (9)
to
block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107)

Although, from the earlier logs (last year) was oVirt 3.3 and Gluster 3.4.

Comment 37 Andrew Lau 2014-07-22 12:27:21 UTC
I have tried a new setup, WITHOUT the 'recommended' param:
network.ping-timeout: 10

I broke a quorum by killing gluster on one host. After restoring the connection.

One out of three VMs now resumed from the paused state. The VM which resumed, was doing active writes at the time of the killing of gluster, was paused first. It was the only one which resumed.

Comment 38 Humble Chirammal 2014-08-13 07:34:58 UTC
Did some more research on this topic and here it is:

In your environment, client and server side quorum  has been enabled.

Your configuration looks like this:


Number of Bricks: 1 x 2 = 2
Brick1: 172.16.2.11:/data1/vm-data
Brick2: 172.16.2.12:/data1/vm-data


If client quorum is enabled and type is 'auto', then for odd replica count, quorum is met if the number of up bricks is more than 50%. For even replica count ( which is the case here) , it is met if more than 50% of bricks are up. If exactly 50% of the bricks are up, then the first brick (( here Brick1: 172.16.2.11:/data1/vm-data)) has to be up.


For server side quorum, as you have 3 servers, you loose quorum  only if 2 nodes are down. 

>>

I broke a quorum by killing gluster on one host. After restoring the connection.

One out of three VMs now resumed from the paused state. The VM which resumed, was doing active writes at the time of the killing of gluster, was paused first. It was the only one which resumed.
>>


In this scenario, it looks like only *one* node was in trouble. 
Can you please let us know which node was that ? if it is "the node which serve first brick", readonly fs and vm pause ..etc are expected. However I dont know why other 2 vms failed to resume when we got the quorum ( which also means the volume started to work) back. 

I think only logs can tell something about this.

Comment 39 Christopher Pereira 2015-02-24 19:28:49 UTC
The bug is still present in oVirt 3.5.
I have a gluster storage with no replica and client and server quorum disabled.
VM was working fine until I stopped and restarted the gluster volume via ovirt-engine.
VM got paused and it is impossible to restart the VM.
Shutting down and restarting the VM works, but it will destroy the VM.

VDSM says:

- abnormal vm stop device virtio-disk0 error eother
- desktopLock failed

Please increase severity, since VMs are paused on network errors and will fail to resume.

Comment 40 Christopher Pereira 2015-02-24 19:49:02 UTC
Please ignore my comment.
I noticed that another VM resumed succesfully, so I will have to research deeper.

Comment 41 Christopher Pereira 2015-02-24 22:24:49 UTC
Ok. One VM is unable to resume.
The reason in my case seems to be "Transport endpoint is not connected" (http://www.gluster.org/community/documentation/index.php/Transport_endpoint_is_not_connected)

To reproduce, I'm just stopping and restarting a gluster volume (with no replica).

This are the libvirtd logs:

2015-02-24 22:02:05.034+0000: 19733: debug : virDomainGetXMLDesc:4363 : dom=0x7fe6e8003760, (VM: name=Backup-Server, uuid=56c30083-9248-4145-92be-38109b3536e1), flags=0
2015-02-24 22:02:05.034+0000: 19733: debug : virDomainFree:2281 : dom=0x7fe6e8003760, (VM: name=Backup-Server, uuid=56c30083-9248-4145-92be-38109b3536e1)
2015-02-24 22:02:05.036+0000: 19731: debug : virDomainResume:2393 : dom=0x7fe6f000ea20, (VM: name=Backup-Server, uuid=56c30083-9248-4145-92be-38109b3536e1)
2015-02-24 22:02:05.036+0000: 19731: debug : qemuDomainObjBeginJobInternal:847 : Starting job: modify (async=none)
2015-02-24 22:02:05.036+0000: 19731: debug : qemuProcessStartCPUs:2922 : Using lock state '(null)'
2015-02-24 22:02:05.036+0000: 19731: debug : qemuMonitorStartCPUs:1217 : mon=0x7fe6f00d6de0
2015-02-24 22:02:05.036+0000: 19731: debug : qemuMonitorSend:911 : QEMU_MONITOR_SEND_MSG: mon=0x7fe6f00d6de0 msg={"execute":"cont","id":"libvirt-238"}
2015-02-24 22:02:05.037+0000: 19726: debug : qemuMonitorIOWrite:463 : QEMU_MONITOR_IO_WRITE: mon=0x7fe6f00d6de0 buf={"execute":"cont","id":"libvirt-238"}
2015-02-24 22:02:05.046+0000: 19726: debug : qemuMonitorIOProcess:355 : QEMU_MONITOR_IO_PROCESS: mon=0x7fe6f00d6de0 buf={"timestamp": {"seconds": 1424815325, "microseconds": 46322}, "event": "RESUME"}
2015-02-24 22:02:05.046+0000: 19726: debug : qemuMonitorEmitResume:1035 : mon=0x7fe6f00d6de0
2015-02-24 22:02:05.046+0000: 19726: debug : qemuProcessHandleResume:777 : Transitioned guest Backup-Server out of paused into resumed state
2015-02-24 22:02:05.046+0000: 19726: debug : qemuProcessHandleResume:785 : Using lock state '(null)' on resume event
2015-02-24 22:02:05.113+0000: 19726: debug : virDomainFree:2281 : dom=0x114a130, (VM: name=Backup-Server, uuid=56c30083-9248-4145-92be-38109b3536e1)
2015-02-24 22:02:05.113+0000: 19726: debug : qemuMonitorIOProcess:355 : QEMU_MONITOR_IO_PROCESS: mon=0x7fe6f00d6de0 buf={"timestamp": {"seconds": 1424815325, "microseconds": 47842}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "__com.redhat_debug_info": {"message": "Transport endpoint is not connected", "errno": 107}, "__com.redhat_reason": "eother", "operation": "write", "action": "stop"}}
2015-02-24 22:02:05.114+0000: 19726: debug : qemuMonitorEmitIOError:1077 : mon=0x7fe6f00d6de0
2015-02-24 22:02:05.145+0000: 19731: debug : qemuDomainObjEndJob:992 : Stopping job: modify (async=none)
2015-02-24 22:02:05.145+0000: 19731: debug : virDomainFree:2281 : dom=0x7fe6f000ea20, (VM: name=Backup-Server, uuid=56c30083-9248-4145-92be-38109b3536e1)
2015-02-24 22:02:05.145+0000: 19726: debug : qemuProcessHandleIOError:945 : Transitioned guest Backup-Server to paused state due to IO error
2015-02-24 22:02:05.146+0000: 19726: debug : qemuProcessHandleIOError:955 : Preserving lock state '(null)'
2015-02-24 22:02:05.195+0000: 19726: debug : qemuMonitorEmitStop:1025 : mon=0x7fe6f00d6de0
2015-02-24 22:02:05.196+0000: 19726: debug : qemuMonitorEmitIOError:1077 : mon=0x7fe6f00d6de0
2015-02-24 22:02:05.196+0000: 19726: debug : qemuMonitorEmitIOError:1077 : mon=0x7fe6f00d6de0
2015-02-24 22:02:05.196+0000: 19726: debug : virDomainFree:2281 : dom=0x115b160, (VM: name=Backup-Server, uuid=56c30083-9248-4145-92be-38109b3536e1)
2015-02-24 22:02:05.196+0000: 19726: debug : virDomainFree:2281 : dom=0x115b160, (VM: name=Backup-Server, uuid=56c30083-9248-4145-92be-38109b3536e1)
2015-02-24 22:02:05.196+0000: 19726: debug : virDomainFree:2281 : dom=0x115b160, (VM: name=Backup-Server, uuid=56c30083-9248-4145-92be-38109b3536e1)
2015-02-24 22:02:05.196+0000: 19726: debug : virDomainFree:2281 : dom=0x115b160, (VM: name=Backup-Server, uuid=56c30083-9248-4145-92be-38109b3536e1)
2015-02-24 22:02:05.196+0000: 19726: debug : virDomainFree:2281 : dom=0x115b160, (VM: name=Backup-Server, uuid=56c30083-9248-4145-92be-38109b3536e1)
2015-02-24 22:02:05.197+0000: 19726: debug : qemuMonitorIOProcess:355 : QEMU_MONITOR_IO_PROCESS: mon=0x7fe6f00d6de0 buf={"timestamp": {"seconds": 1424815325, "microseconds": 50560}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "__com.redhat_debug_info": {"message": "Transport endpoint is not connected", "errno": 107}, "__com.redhat_reason": "eother", "operation": "write", "action": "stop"}}
2015-02-24 22:02:05.197+0000: 19726: debug : qemuMonitorEmitIOError:1077 : mon=0x7fe6f00d6de0
2015-02-24 22:02:05.197+0000: 19726: debug : qemuMonitorEmitIOError:1077 : mon=0x7fe6f00d6de0
2015-02-24 22:02:05.197+0000: 19726: debug : virDomainFree:2281 : dom=0x114a050, (VM: name=Backup-Server, uuid=56c30083-9248-4145-92be-38109b3536e1)
2015-02-24 22:02:05.197+0000: 19726: debug : virDomainFree:2281 : dom=0x114a050, (VM: name=Backup-Server, uuid=56c30083-9248-4145-92be-38109b3536e1)
2015-02-24 22:02:05.198+0000: 19730: debug : virDomainGetInfo:4213 : dom=0x7fe6fc00cd20, (VM: name=Backup-Server, uuid=56c30083-9248-4145-92be-38109b3536e1), info=0x7fe710746ad0
2015-02-24 22:02:05.198+0000: 19730: debug : qemudGetProcessInfo:1351 : Got status for 2205/0 user=1409 sys=537 cpu=1 rss=73001
2015-02-24 22:02:05.198+0000: 19730: debug : virDomainFree:2281 : dom=0x7fe6fc00cd20, (VM: name=Backup-Server, uuid=56c30083-9248-4145-92be-38109b3536e1)
2015-02-24 22:02:05.299+0000: 19734: debug : virDomainGetXMLDesc:4363 : dom=0x7fe7040e8500, (VM: name=Backup-Server, uuid=56c30083-9248-4145-92be-38109b3536e1), flags=0
2015-02-24 22:02:05.299+0000: 19734: debug : virDomainFree:2281 : dom=0x7fe7040e8500, (VM: name=Backup-Server, uuid=56c30083-9248-4145-92be-38109b3536e1)
2015-02-24 22:02:05.300+0000: 19732: debug : virDomainGetXMLDesc:4363 : dom=0x7fe6ec0d92d0, (VM: name=Backup-Server, uuid=56c30083-9248-4145-92be-38109b3536e1), flags=0
2015-02-24 22:02:05.301+0000: 19732: debug : virDomainFree:2281 : dom=0x7fe6ec0d92d0, (VM: name=Backup-Server, uuid=56c30083-9248-4145-92be-38109b3536e1)

Comment 42 Christopher Pereira 2015-02-24 22:48:40 UTC
The other message I get is:

virDomainObjGetMetadata:16665 : metadata not found: Requested metadata element is not present

Sometimes stopping gluster volume + restarting gluster volume + resuming the VM works, and sometimes not.
In this case, shutting down the VM is the only way to bring it up again.

Comment 43 Christopher Pereira 2015-02-25 01:26:37 UTC
I discovered something important:

The qemu process of the paused VM is accessing a wrong image path.
lsof is returning:

qemu-kvm 1404 qemu   11u  unknown                                       /b55d0462-0dec-41a3-b0fd-6225fc5cf248/images/393a14c3-99df-4ffd-8ded-0705a22d3304/dc9d4128-3172-4259-9923-54db19347fd2 (stat: No such file or directory)

After restarting the VM, the paths are correct:

qemu-kvm 7781 qemu   11u   REG               0,34   647364608 11256586019595226674 /rhev/data-center/mnt/glusterSD/h1.imatronix.com:gluster-storage/b55d0462-0dec-41a3-b0fd-6225fc5cf248/images/393a14c3-99df-4ffd-8ded-0705a22d3304/dc9d4128-3172-4259-9923-54db19347fd2

This is the reason while the VM cannot be started again.

Comment 44 Christopher Pereira 2015-02-25 02:07:47 UTC
No. False alarm.

It seems that lsof is not always returning the correct path.
Lsof's FAQ says "3.1.2	Why does lsof report the wrong path names?"

Additional info:

1) When stopping and resuming the gluster volume, it is also not possible to resume the VM via virsh (libvirt).
In libvirt/qemu/VM logs, I get a:
block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107)

2) If the VM is suspended in oVirt before stopping and resuming the gluster volume, it can be resumed via oVirt.

3) If the VM is suspended in virsh before stopping and resuming the gluster volume, it can be resumed via virsh.

Maybe the file descriptors pointing to the gluster volume must be reopened after restarting the gluster volume.

Comment 45 Christopher Pereira 2015-02-25 02:56:13 UTC
I confirmed with people from gluster that the file descriptors pointing to the gluster-mount should be reopened in QEMU after restarting the volume.

Can someone please confirm that the file handlers are reopened in QEMU when libvirt resumes a VM?

I'm not sure if this also applies for network problems, but IMO we should provide a stable solution for resuming VMs even when glusterd or the volume is restarted.

Comment 46 Christopher Pereira 2015-03-26 04:12:36 UTC
Stefan from Qemu-devel, confirmed that "QEMU doesn't reopen file descriptors".

Thus, if all assumptions are correct, QEMU must be patched to reopen file descriptors when resuming VMs to fix this issue.

Comment 47 Christopher Pereira 2015-05-01 19:28:43 UTC
I reported the problem to QEMU here:
https://bugs.launchpad.net/qemu/+bug/1450891

Please update BZ as it won't be fixed in oVirt nor GluserFS, but in QEMU.

Note that this issue was triggered by this other one (fixed in Gerrit):
https://bugzilla.redhat.com/show_bug.cgi?id=1201355

Comment 48 Christopher Pereira 2015-05-04 22:07:09 UTC
It's not possible to reopen the image file in QEMU, because it could lead to data corruption.

If the glusterfs endpoint must be shutdown for some reason, the VM must be suspended before invalidating the file descriptors.

Any way, the main reason the file descriptors got invalidated, was because of BZ 1172905:
https://bugzilla.redhat.com/show_bug.cgi?id=1172905

This issue is just the consequence.

Comment 49 John Boero 2015-06-24 08:27:00 UTC
Very interesting.  Thanks Christopher.  I'm an onsite consultant doing PoC and we're seeing the same issue here.  Using 3 replicas with default quorum ratio renders our volume read-only during a test kill of 2 bricks (by design).  We're debating adjusting quorum ratio manually to allow a single replica to be considered quorate for read/write.  Storage is fairly stable RAID6 with hot spare, and customer is aware that a single brick in rw mode will not be considered unsafe with no replication.

Looking forward to your fix in 3.6A.

Comment 50 Christopher Pereira 2015-10-11 07:42:12 UTC
Hi John,

We dropped gluster replication after finding out that writes are synchronic (i.e. performance will be limited to storage network bandwith).
We are waiting for async write support in stable gluster branches.

A 1 Gbps network would restrict disk writes to 125 MB/s.
Having multiple VM's means sharing those slow resources.
Besides, gluster doesn't suport SEEK_HOLE reads, giving very bad performance with raw sparse files.
There were some patches to implement SEEK_HOLE support, but they probably won't be merged in the short term:
https://bugzilla.redhat.com/show_bug.cgi?id=1220173

Comment 51 Christopher Pereira 2015-10-14 11:03:39 UTC
Tested on 3.6-rc1 on CentOS 7.
Patches verified in production for some months.
Can be easily verified by checking that glusterd service is NOT running inside VDSM group.
The main issue was:
https://bugzilla.redhat.com/show_bug.cgi?id=1201355#c7

Comment 52 Amar Tumballi 2019-03-25 17:10:16 UTC
Looks like latest releases of glusterfs releases work totally fine with oVirt releases. Please reopen if these issues are seen when used with glusterfs-5.5 or 6.0 versions.


Note You need to log in before you can comment on or make changes to this bug.