Bug 1467893

Summary: VM lease missing after host loses storage connection
Product: Red Hat Enterprise Virtualization Manager Reporter: Javier Coscia <jcoscia>
Component: vdsmAssignee: Michal Skrivanek <michal.skrivanek>
Status: CLOSED DUPLICATE QA Contact: Raz Tamir <ratamir>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.1.2CC: amureini, aperotti, bazulay, berrange, chhu, cmestreg, dfediuck, dyuan, gwatson, jcoscia, jsuchane, lmen, lsurette, mavital, michal.skrivanek, mkalinin, mtessun, nsoffer, pkrempa, rbalakri, srevivo, teigland, tjelinek, tnisan, xuzhang, ycui, ykaul, ylavi
Target Milestone: ovirt-4.1.7Keywords: Reopened
Target Release: ---   
Hardware: x86_64   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-10-02 06:18:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1417161    

Description Javier Coscia 2017-07-05 12:37:00 UTC
Description of problem:

A VM lease in sanlock disapear after the host where the VM is running loses access to the storage domain where the VM and the lease are.


Version-Release number of selected component (if applicable):

rhevm-4.1.2.2-0.1.el7.noarch

vdsm-4.19.15-1.el7ev.x86_64
sanlock-3.4.0-1.el7.x86_64



How reproducible: 

100%

Steps to Reproduce:
1. Start a HA VM and configure it to have a lease on the same SD as the VM is
2. In my case I have a iSCSI SD so I dropped the connection >> iscsiadm ... --logout
3. VM switches to Paused state instead of being killed by sanlock

Actual results:

VM lease is lost and VM switches to Paused state


Expected results:


Sanlock should notice the lease expired and kill the qemu-kvm process to let the VM be restarted elsewhere since is an HA VM.


Additional info:

Will add the following logs from my test


- vdsm.log
- sanlock.log
- libvirtd.log (debug - 1:qemu 1:locking)
- engine.log


The test in engine started about 2017-07-05 09:22:52

Comment 1 Javier Coscia 2017-07-05 12:38:36 UTC
Created attachment 1294569 [details]
engine, vdsm, sanlock and libvirtd logs

Comment 5 Doron Fediuck 2017-07-09 10:43:35 UTC
Hi,
I need some clarifications;
1. Which VM are you referring to with the lease? The HE VM or a different one?
2. How long did you wait before giving up waiting for restart?

Comment 6 Javier Coscia 2017-07-10 12:56:22 UTC
(In reply to Doron Fediuck from comment #5)
> Hi,
> I need some clarifications;
> 1. Which VM are you referring to with the lease? The HE VM or a different
> one?

Hi Doron, the VM was a normal VM, not an HE VM, I tested this in a non SHE environment, although customer has one.

> 2. How long did you wait before giving up waiting for restart?

In my case, the VM switched to paused state immediately after I logged out the iscsi connection. Cannot remember how long I waited, wasn't a few seconds for sure, but the VM didn't switch from Paused during this time.

Comment 12 Nir Soffer 2017-07-31 21:55:21 UTC
(In reply to Javier Coscia from comment #0)
> Description of problem:
> 
> A VM lease in sanlock disapear after the host where the VM is running loses
> access to the storage domain where the VM and the lease are.

This is expected - when sanlock cannot access storage for more than 80 seconds
or so, the vm lease expires, and sanlock will kill the vm and remove the lease.
This enable oVirt to start the vm on another host, even if the original host is
not accessible (e.g. lost power, networking issue).

> Steps to Reproduce:
> 1. Start a HA VM and configure it to have a lease on the same SD as the VM is
> 2. In my case I have a iSCSI SD so I dropped the connection >> iscsiadm ...
> --logout

This is not a good simulation, better use iptables to drop packets to/from
this storage, or block access using the switch/storage.

> 3. VM switches to Paused state instead of being killed by sanlock

This is either sanlock or libvirt issue. When the lease expire, sanlock should
kill the vm. If the vm is paused, it should be killed. Otherwise oVirt cannot
guarantee anything based on the fact that the vm does not hold a lease.

I suspect that this is an issue with the helper program libvirt uses to terminate
vms after a lease was expired.

We need to reproduce this and collect libvirt debug logs.

Comment 13 Nir Soffer 2017-07-31 22:36:39 UTC
Checking the logs from attachment 1294569 [details].


vdsm.log
--------

1. Vm started with a lease

2017-07-05 09:23:52,652-0300 INFO  (vm/734aee59) [virt.vm] (vmId='734aee59-6008-4f24-bc8f-b35b64265171') <?xml version='1.0' encoding='UTF-8'?>
<domain xmlns:ovirt="http://ovirt.org/vm/tune/1.0" type="kvm">
   ...
        <lease>
            <key>734aee59-6008-4f24-bc8f-b35b64265171</key>
            <lockspace>5b69db0c-b854-42d7-b162-6975a3ff089c</lockspace>
            <target offset="3145728" path="/dev/5b69db0c-b854-42d7-b162-6975a3ff089c/xleases" />
        </lease>

2. Trouble monitoring storage domain

2017-07-05 09:29:52,659-0300 ERROR (check/loop) [storage.Monitor] Error checking path /dev/5b69db0c-b854-42d7-b162-6975a3ff089c/metadata (monitor:485)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 483, in _pathChecked
    delay = result.delay()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line 362, in delay
    raise exception.MiscFileReadException(self.path, self.rc, self.err)
MiscFileReadException: Internal file read failure: ('/dev/5b69db0c-b854-42d7-b162-6975a3ff089c/metadata', 1, bytearray(b"/usr/bin/dd: error reading \'/dev/5b69db0c-b854-42d7-b162-6975a3ff089

3. Vm paused

2017-07-05 09:29:53,542-0300 INFO  (libvirt/events) [virt.vm] (vmId='734aee59-6008-4f24-bc8f-b35b64265171') abnormal vm stop device scsi0-0-0-0 error eio (vm:4144)
2017-07-05 09:29:53,542-0300 INFO  (libvirt/events) [virt.vm] (vmId='734aee59-6008-4f24-bc8f-b35b64265171') CPU stopped: onIOError (vm:5033)
2017-07-05 09:29:53,543-0300 INFO  (libvirt/events) [virt.vm] (vmId='734aee59-6008-4f24-bc8f-b35b64265171') CPU stopped: onSuspend (vm:5033)

There are no more libvirt events logs.


libvirtd.log
------------

(Using UTC timestamps - based on vdsm logs, we are 3 hours after vdsm log)

1. Starting vm

2017-07-05 12:23:53.786+0000: 26756: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f740800e0a0 buf={"return": [{"frontend-open": false, "filename": "spicevmc", "label": "charchannel2"}, {"frontend-open": false, "filename": "disconnected:unix:/var/lib/libvirt/qemu/channels/734aee59-6008-4f24-bc8f-b35b64265171.org.qemu.guest_agent.0,server", "label": "charchannel1"}, {"frontend-open": false, "filename": "disconnected:unix:/var/lib/libvirt/qemu/channels/734aee59-6008-4f24-bc8f-b35b64265171.com.redhat.rhevm.vdsm,server", "label": "charchannel0"}, {"frontend-open": true, "filename": "unix:/var/lib/libvirt/qemu/domain-1-rhel73rhv41/monitor.sock,server", "label": "charmonitor"}], "id": "libvirt-3"}
 len=603

I don't see anything interesting in libvirt log around 12:30:50, when sanlock
lease was expired.


sanlock.log
-----------

1. We have a lease 5b69db0c-b854-42d7-b162-6975a3ff089c:734aee59-6008-4f24-bc8f-b35b64265171:/dev/5b69db0c-b854-42d7-b162-6975a3ff089c/xleases:3145728

2017-07-05 09:23:54-0300 32936 [703]: s3:r9 resource 5b69db0c-b854-42d7-b162-6975a3ff089c:734aee59-6008-4f24-bc8f-b35b64265171:/dev/5b69db0c-b854-42d7-b162-6975a3ff089c/xleases:3145728 for 3,14,27610

2. Not sure what these messages say...

2017-07-05 09:24:22-0300 32964 [700]: s3 host 2 21 32961 ff3d4e62-2c49-4fab-bbfe-c083e94c8ab4.rhvh41.rei
2017-07-05 09:24:27-0300 32969 [700]: s1 host 2 14 32959 ff3d4e62-2c49-4fab-bbfe-c083e94c8ab4.rhvh41.rei

3. Trouble updating delta lease

2017-07-05 09:29:52-0300 33294 [3906]: 5b69db0c aio collect RD 0x7fbe1c0008c0:0x7fbe1c0008d0:0x7fbe3834c000 result -5:0 match res
2017-07-05 09:29:52-0300 33294 [3906]: s3 delta_renew read rv -5 offset 0 /dev/5b69db0c-b854-42d7-b162-6975a3ff089c/ids
2017-07-05 09:29:52-0300 33294 [3906]: s3 renewal error -5 delta_length 0 last_success 33273

4. Trouble accessing the lease - maybe vdsm was trying to access the lease?

2017-07-05 09:29:53-0300 33295 [2367]: read_sectors leader offset 3145728 rv -5 /dev/5b69db0c-b854-42d7-b162-6975a3ff089c/xleases
2017-07-05 09:29:53-0300 33295 [2367]: r8 paxos_release leader_read error -5
2017-07-05 09:29:53-0300 33295 [2367]: r8 release_token release leader -5

5. Trouble continue...

2017-07-05 09:30:50-0300 33353 [3906]: 5b69db0c aio collect RD 0x7fbe1c0008c0:0x7fbe1c0008d0:0x7fbe3834c000 result -5:0 match res
2017-07-05 09:30:50-0300 33353 [3906]: s3 delta_renew read rv -5 offset 0 /dev/5b69db0c-b854-42d7-b162-6975a3ff089c/ids
2017-07-05 09:30:50-0300 33353 [3906]: s3 renewal error -5 delta_length 0 last_success 33273

6. Sanlock expire the leases?

2017-07-05 09:30:50-0300 33353 [700]: s3 check_our_lease failed 80
2017-07-05 09:30:51-0300 33353 [700]: s3 all pids clear

David, according to this log, did sanlock terminate the lease owner?

Usually we see lot of logs about sending SIGTERM, and then about
sending SIGKILL - I guess sanlock is configured to use killpath
program, but we don't see any log about running this program or
its exit code.

Comment 14 David Teigland 2017-07-31 23:01:47 UTC
> 1. We have a lease
> 5b69db0c-b854-42d7-b162-6975a3ff089c:734aee59-6008-4f24-bc8f-b35b64265171:/
> dev/5b69db0c-b854-42d7-b162-6975a3ff089c/xleases:3145728
> 
> 2017-07-05 09:23:54-0300 32936 [703]: s3:r9 resource
> 5b69db0c-b854-42d7-b162-6975a3ff089c:734aee59-6008-4f24-bc8f-b35b64265171:/
> dev/5b69db0c-b854-42d7-b162-6975a3ff089c/xleases:3145728 for 3,14,27610
> 
> 2. Not sure what these messages say...
> 
> 2017-07-05 09:24:22-0300 32964 [700]: s3 host 2 21 32961
> ff3d4e62-2c49-4fab-bbfe-c083e94c8ab4.rhvh41.rei
> 2017-07-05 09:24:27-0300 32969 [700]: s1 host 2 14 32959
> ff3d4e62-2c49-4fab-bbfe-c083e94c8ab4.rhvh41.rei

sanlock records that it sees live lockspace/host_id/delta leases from these hosts.

> 3. Trouble updating delta lease
> 
> 2017-07-05 09:29:52-0300 33294 [3906]: 5b69db0c aio collect RD
> 0x7fbe1c0008c0:0x7fbe1c0008d0:0x7fbe3834c000 result -5:0 match res
> 2017-07-05 09:29:52-0300 33294 [3906]: s3 delta_renew read rv -5 offset 0
> /dev/5b69db0c-b854-42d7-b162-6975a3ff089c/ids
> 2017-07-05 09:29:52-0300 33294 [3906]: s3 renewal error -5 delta_length 0
> last_success 33273
> 
> 4. Trouble accessing the lease - maybe vdsm was trying to access the lease?
> 
> 2017-07-05 09:29:53-0300 33295 [2367]: read_sectors leader offset 3145728 rv
> -5 /dev/5b69db0c-b854-42d7-b162-6975a3ff089c/xleases
> 2017-07-05 09:29:53-0300 33295 [2367]: r8 paxos_release leader_read error -5
> 2017-07-05 09:29:53-0300 33295 [2367]: r8 release_token release leader -5
> 
> 5. Trouble continue...
> 
> 2017-07-05 09:30:50-0300 33353 [3906]: 5b69db0c aio collect RD
> 0x7fbe1c0008c0:0x7fbe1c0008d0:0x7fbe3834c000 result -5:0 match res
> 2017-07-05 09:30:50-0300 33353 [3906]: s3 delta_renew read rv -5 offset 0
> /dev/5b69db0c-b854-42d7-b162-6975a3ff089c/ids
> 2017-07-05 09:30:50-0300 33353 [3906]: s3 renewal error -5 delta_length 0
> last_success 33273
> 
> 6. Sanlock expire the leases?
> 
> 2017-07-05 09:30:50-0300 33353 [700]: s3 check_our_lease failed 80
> 2017-07-05 09:30:51-0300 33353 [700]: s3 all pids clear

When 'check_our_lease failed 80' appears, it means sanlock has entered recovery mode for the lockspace, which means it will begin trying to kill any processes using resource leases in the lockspace.  If all those processes either release their resource/paxos leases, or exit, then you get the 'all pids clear' message, which means that the lockspace is now unused.  The watchdog is then disabled for the lockspace so it won't fire.

> David, according to this log, did sanlock terminate the lease owner?

Yes

> Usually we see lot of logs about sending SIGTERM, and then about
> sending SIGKILL - I guess sanlock is configured to use killpath
> program, but we don't see any log about running this program or
> its exit code.

You'll normally see kill messages like this:
s20 kill 42444 sig 15 count 1

between 'check_our_lease failed' and 'all pids clear'.  If you didn't, then there were probably no resource leases held in the lockspace when it failed, so sanlock was able to immediately drop the unused lockspace.

More details about most of these messages are written here
http://people.redhat.com/teigland/sanlock-messages.txt

Comment 15 Nir Soffer 2017-07-31 23:33:26 UTC
More info from libvirt log:

1. Acquiring the lease

2017-07-05 12:23:53.800+0000: 26759: debug : virDomainLockManagerNew:146 : Adding leases
2017-07-05 12:23:53.800+0000: 26759: debug : virDomainLockManagerAddLease:57 : Add lease /dev/5b69db0c-b854-42d7-b162-6975a3ff089c/xleases
2017-07-05 12:23:53.800+0000: 26759: debug : virLockManagerAddResource:332 : lock=0x7f74080102a0 type=1 name=734aee59-6008-4f24-bc8f-b35b64265171 nparams=3 params=0x7f742c6646a0 flags=0
2017-07-05 12:23:53.800+0000: 26759: debug : virLockManagerLogParams:91 :   key=path type=string value=/dev/5b69db0c-b854-42d7-b162-6975a3ff089c/xleases
2017-07-05 12:23:53.800+0000: 26759: debug : virLockManagerLogParams:85 :   key=offset type=ulong value=3145728
2017-07-05 12:23:53.800+0000: 26759: debug : virLockManagerLogParams:91 :   key=lockspace type=string value=5b69db0c-b854-42d7-b162-6975a3ff089c
2017-07-05 12:23:53.800+0000: 26759: debug : virDomainLockManagerNew:151 : Adding disks
2017-07-05 12:23:53.800+0000: 26759: debug : virDomainLockManagerAddImage:90 : Add disk /rhev/data-center/00000001-0001-0001-0001-000000000311/5b69db0c-b854-42d7-b162-6975a3ff089c/images/7ba
addbe-c0d7-4648-a497-6f1bc8c2b1ce/4e1562e2-f8e9-47a6-8b8c-384665e6b585
2017-07-05 12:23:53.800+0000: 26759: debug : virLockManagerAddResource:332 : lock=0x7f74080102a0 type=0 name=/rhev/data-center/00000001-0001-0001-0001-000000000311/5b69db0c-b854-42d7-b162-69
75a3ff089c/images/7baaddbe-c0d7-4648-a497-6f1bc8c2b1ce/4e1562e2-f8e9-47a6-8b8c-384665e6b585 nparams=0 params=(nil) flags=0
2017-07-05 12:23:53.800+0000: 26759: debug : virLockManagerAcquire:350 : lock=0x7f74080102a0 state='<null>' flags=0 action=0 fd=(nil)

    note: action=0 means  VIR_DOMAIN_LOCK_FAILURE_DEFAULT

2017-07-05 12:23:53.800+0000: 26759: debug : virLockManagerSanlockAcquire:989 : Acquiring object 1
2017-07-05 12:23:54.086+0000: 26759: debug : virLockManagerSanlockAcquire:1027 : Acquire completed fd=-1


When using VIR_DOMAIN_LOCK_FAILURE_DEFAULT, libvirt does not use killpath:

 948         if (action != VIR_DOMAIN_LOCK_FAILURE_DEFAULT) {
 949             char uuidstr[VIR_UUID_STRING_BUFLEN];
 950             virUUIDFormat(priv->vm_uuid, uuidstr);
 951             if (virLockManagerSanlockRegisterKillscript(sock, priv->vm_uri,
 952                                                         uuidstr, action) < 0)
 953                 goto error;
 954         }

We also do not see this log:

851     VIR_DEBUG("Register sanlock killpath: %s %s", path, args);

So we are not using killpath, and sanlock is responsible for killing
a vm that lost its lease.


2. Libvirt get eio error from qemu

2017-07-05 12:29:53.534+0000: 26756: info : qemuMonitorJSONIOProcessLine:206 : QEMU_MONITOR_RECV_EVENT: mon=0x7f740800e0a0 event={"timestamp": {"seconds": 1499257793, "microseconds": 534090}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-scsi0-0-0-0", "nospace": false, "__com.redhat_reason": "eio", "reason": "Input/output error", "operation": "write", "action": "stop"}}


3. Pausing the vm

2017-07-05 12:29:53.534+0000: 26756: debug : qemuProcessHandleIOError:946 : Transitioned guest rhel73rhv41 to paused state due to IO error


4. Adding leases?

2017-07-05 12:29:53.535+0000: 26756: debug : virDomainLockManagerNew:146 : Adding leases
2017-07-05 12:29:53.535+0000: 26756: debug : virDomainLockManagerAddLease:57 : Add lease /dev/5b69db0c-b854-42d7-b162-6975a3ff089c/xleases
2017-07-05 12:29:53.535+0000: 26756: debug : virLockManagerAddResource:332 : lock=0x7f743f304f10 type=1 name=734aee59-6008-4f24-bc8f-b35b64265171 nparams=3 params=0x7fff80a18bc0 flags=0
2017-07-05 12:29:53.535+0000: 26756: debug : virLockManagerLogParams:91 :   key=path type=string value=/dev/5b69db0c-b854-42d7-b162-6975a3ff089c/xleases
2017-07-05 12:29:53.535+0000: 26756: debug : virLockManagerLogParams:85 :   key=offset type=ulong value=3145728
2017-07-05 12:29:53.535+0000: 26756: debug : virLockManagerLogParams:91 :   key=lockspace type=string value=5b69db0c-b854-42d7-b162-6975a3ff089c
2017-07-05 12:29:53.535+0000: 26756: debug : virDomainLockManagerNew:151 : Adding disks
2017-07-05 12:29:53.535+0000: 26756: debug : virDomainLockManagerAddImage:90 : Add disk /rhev/data-center/00000001-0001-0001-0001-000000000311/5b69db0c-b854-42d7-b162-6975a3ff089c/images/7ba
addbe-c0d7-4648-a497-6f1bc8c2b1ce/4e1562e2-f8e9-47a6-8b8c-384665e6b585
2017-07-05 12:29:53.535+0000: 26756: debug : virLockManagerAddResource:332 : lock=0x7f743f304f10 type=0 name=/rhev/data-center/00000001-0001-0001-0001-000000000311/5b69db0c-b854-42d7-b162-69
75a3ff089c/images/7baaddbe-c0d7-4648-a497-6f1bc8c2b1ce/4e1562e2-f8e9-47a6-8b8c-384665e6b585 nparams=0 params=(nil) flags=0
2017-07-05 12:29:53.535+0000: 26756: debug : virLockManagerRelease:365 : lock=0x7f743f304f10 state=0x7f74080043a0 flags=0


5. Release sanlock lease?

2017-07-05 12:29:53.541+0000: 26756: error : virLockManagerSanlockRelease:1089 : Failed to release lock: Input/output error
2017-07-05 12:29:53.541+0000: 26756: debug : virLockManagerFree:387 : lock=0x7f743f304f10
2017-07-05 12:29:53.541+0000: 26756: warning : qemuProcessHandleIOError:958 : Unable to release lease on rhel73rhv41
2017-07-05 12:29:53.541+0000: 26756: debug : qemuProcessHandleIOError:959 : Preserving lock state '5b69db0c-b854-42d7-b162-6975a3ff089c:734aee59-6008-4f24-bc8f-b35b64265171:/dev/5b69db0c-b854-42d7-b162-6975a3ff089c/xleases:3145728:3'


6. In sanlock log we see at this time:

2017-07-05 09:29:53-0300 33295 [2367]: r8 write_host_block host_id 1 flags 0 gen 0 rv -5
2017-07-05 09:29:53-0300 33295 [2367]: r8 release_token write_host_block -5
2017-07-05 09:29:53-0300 33295 [2367]: worker3 aio collect RD 0x7fbe140008c0:0x7fbe140008d0:0x7fbe1403f000 result -5:0 match res
2017-07-05 09:29:53-0300 33295 [2367]: read_sectors leader offset 3145728 rv -5 /dev/5b69db0c-b854-42d7-b162-6975a3ff089c/xleases
2017-07-05 09:29:53-0300 33295 [2367]: r8 paxos_release leader_read error -5
2017-07-05 09:29:53-0300 33295 [2367]: r8 release_token release leader -5


David, do we still have lease r8 at this point?

Based on this, it looks like libvirt paused the vm, and released the
sanlock lease. This can explain why sanlock did not kill the vm.

Comment 16 Nir Soffer 2017-07-31 23:59:21 UTC
Daniel, based on attachment 1294569 [details] comments 15, I think what
happened is this (using localtime):

09:29:53  Vm paused because of io error
09:29:53  Libvirt released the vm lease
09:30:50  sanlock failed to renew delta lease

Sanlock did not kill the vm because it did not have a lease at that time.

We expect this flow:

1. VM Pause because of io error
2. Libvirt does not modify the vm lease
3. Sanlock fail to renew the delta lease
4. Sanlock kill the vm
5. oVirt start the vm on another host

This looks like a libvirt issue, can you take a look?

Comment 17 David Teigland 2017-08-01 00:15:48 UTC
I don't yet understand what libvirt is trying to do in steps 4 and 5.  I think we'll need to study libvirt code to figure that out.

> 6. In sanlock log we see at this time:
> 
> 2017-07-05 09:29:53-0300 33295 [2367]: r8 write_host_block host_id 1 flags 0
> gen 0 rv -5
> 2017-07-05 09:29:53-0300 33295 [2367]: r8 release_token write_host_block -5
> 2017-07-05 09:29:53-0300 33295 [2367]: worker3 aio collect RD
> 0x7fbe140008c0:0x7fbe140008d0:0x7fbe1403f000 result -5:0 match res
> 2017-07-05 09:29:53-0300 33295 [2367]: read_sectors leader offset 3145728 rv
> -5 /dev/5b69db0c-b854-42d7-b162-6975a3ff089c/xleases
> 2017-07-05 09:29:53-0300 33295 [2367]: r8 paxos_release leader_read error -5
> 2017-07-05 09:29:53-0300 33295 [2367]: r8 release_token release leader -5
> 
> David, do we still have lease r8 at this point?

For the purposes of the sanlock daemon on this host, and the process it was holding the lease for, the lease is effectively released.  The sanlock daemon knows that the process is no longer using the lease, and that it can safely release the resource/paxos lease on disk.  And the watchdog no longer needs to be active to protect the lockspace.  When sanlock goes to write to disk that the lease is free, the i/o errors prevent it from doing that.  (This is expected when releasing leases in response to failing i/o in a lockspace, but we try anyway.)  Failure to free the lease on disk doesn't really matter to this hosts.  To other hosts that might want to acquire this lease, the fact that it wasn't released cleanly on disk means they will have to wait for the full delta lease expiration before they can acquire it.
 
> Based on this, it looks like libvirt paused the vm, and released the
> sanlock lease. This can explain why sanlock did not kill the vm.

I could confirm that with the debugging output from sanlock at the time.

Comment 18 Nir Soffer 2017-08-01 15:02:28 UTC
Javier, on which libvirt version was it reproduced?

Comment 19 Javier Coscia 2017-08-01 15:59:19 UTC
Hi Nir, version is libvirt-daemon-2.0.0-10.el7_3.9.x86_64

Comment 20 Javier Coscia 2017-08-01 16:01:14 UTC
Returning needinfo on Daniel, sorry

Comment 21 Martin Tessun 2017-08-01 16:27:14 UTC
Jarda, could you please have a look and comment on the flow described in comment #15 :

09:29:53  Vm paused because of io error
09:29:53  Libvirt released the vm lease
09:30:50  sanlock failed to renew delta lease

Why did libvirt release the VM-Lease? Is this expected?
As of comment #19 the libvirt version is libvirt-daemon-2.0.0-10.el7_3.9.x86_64

Thanks!
Martin

Comment 22 Daniel Berrangé 2017-08-01 16:52:30 UTC
Libvirt will always release the lease when the VM CPUs are paused, regardless of the reason for the pause. It records the current lease state, so if there's a request to resume CPUs, it'll pass that state back to sanlock. sanlock will verify the lease version to make sure the lease hasn't been acquired somewhere else in the mean time.

sanlock is only expected to kill the VM, if the VM is running and holding the leases, which is not the case here. The VM stopped executing because QEMU saw and I/O error, and the lease was released.

It is still safe to start the VM up on a new host if oVirt wants to and it can kill off the old VM when it sees it pause due to I/O error.

So I'm not seeing any libvirt bug here, everything is working as intended.

Comment 23 Nir Soffer 2017-08-01 17:20:02 UTC
Daniel, releasing a lease of a vm is not our intention. Can you explain why this
behaviour is needed?

Comment 24 David Teigland 2017-08-01 17:44:16 UTC
My recollection of the original design, is that vm leases would be released when the vm was paused, and then reacquired before the vm was resumed, like Dan described.  If the vm was not running, there was no need to protect the vm disk.
This was many years ago now; are there some new thoughts about how the vm lease should correspond with the vm run state?

(I can't quite follow the libvirt log messages to tell whether it's doing the expected steps of releasing the lease after the vm is paused, and reacquiring it before the vm is resumed.)

Comment 25 Daniel Berrangé 2017-08-01 17:52:48 UTC
Releasing leases on pause is critical for live migration because you have multiple QEMU's running concurrently, but it is reasonable more generally, as the goal of this locking code is simply to prevent multiple concurrent writers, not to prevent multiple concurrent QEMU processes having the file open in general.

Comment 26 Nir Soffer 2017-08-01 18:28:20 UTC
Based on comment 22 and later, I think we need to:

1. RHV: Adapt HA VMs policy to try to start paused vms on another host, maybe after
   certain timeout.

2. Libvirt: Document how leases are handled when pausing and resuming vms, I cannot
   find any documentation about this.

Michal, storage side is behaving as intended, so I think your team should continue
to handle this bug.

Comment 27 Yaniv Kaul 2017-08-01 19:05:58 UTC
(In reply to Nir Soffer from comment #26)
> Based on comment 22 and later, I think we need to:
> 
> 1. RHV: Adapt HA VMs policy to try to start paused vms on another host,
> maybe after
>    certain timeout.

What happen if the storage comes back? Will they resume? That's the whole point of pausing them in the 1st place?
Will they get their lock back?

> 
> 2. Libvirt: Document how leases are handled when pausing and resuming vms, I
> cannot
>    find any documentation about this.
> 
> Michal, storage side is behaving as intended, so I think your team should
> continue
> to handle this bug.

Comment 28 Yaniv Kaul 2017-08-01 19:07:44 UTC
(In reply to Daniel Berrange from comment #25)
> Releasing leases on pause is critical for live migration because you have
> multiple QEMU's running concurrently, but it is reasonable more generally,
> as the goal of this locking code is simply to prevent multiple concurrent
> writers, not to prevent multiple concurrent QEMU processes having the file
> open in general.

RHV use case was (is) to prevent cases of split-brain of multiple QEMU processes (on different hosts!). There's no other reason we'd have multiple writers. So something went south between RHV requirements and the implementation.

Comment 29 David Teigland 2017-08-01 19:16:19 UTC
1. libvirt makes a record of the current sanlock lease version
2. libvirt pauses vm
3. libvirt releases the sanlock lease
4. time passes
5. libvirt reacquires the sanlock lease, specifying the lease version
   saved in 2
6. if step 5 fails, libvirt kills the paused vm
7. if step 5 succeeds, libvirt resumes the vm 

If another host has acquiring the lease and run the vm in step 4, then in step 5 sanlock will see that the lease version doesn't match, and will fail to re-acquire the lease.

Comment 30 David Teigland 2017-08-01 19:21:33 UTC
(In reply to Yaniv Kaul from comment #28)
> (In reply to Daniel Berrange from comment #25)
> > Releasing leases on pause is critical for live migration because you have
> > multiple QEMU's running concurrently, but it is reasonable more generally,
> > as the goal of this locking code is simply to prevent multiple concurrent
> > writers, not to prevent multiple concurrent QEMU processes having the file
> > open in general.
> 
> RHV use case was (is) to prevent cases of split-brain of multiple QEMU
> processes (on different hosts!). There's no other reason we'd have multiple
> writers. So something went south between RHV requirements and the
> implementation.

What Dan is saying is that there is no real reason to hold the lease while the qemu process is running but the vm is paused.  This is especially relevant when talking about live migration since there will be two qemu processes running for a time (on source and dest), but only one with the vm running.  The lease corresponds to the qemu process with the vm running.

Comment 31 Nir Soffer 2017-08-01 19:25:04 UTC
Yanvi, I don't think there is any risk in libvirt behaviour. We need to enhance 
vm life cycle management to start paused HA VMs on another host.

Adding reference to the original design:
https://www.redhat.com/archives/libvir-list/2010-August/msg00179.html

Comment 33 Michal Skrivanek 2017-08-02 06:12:54 UTC
Considering the current HA functionality, the easiest would be to simply kill the EIO paused VM right away and report abnormal shutdown, the rest of the logic will handle it.
We can perhaps fail to kill the paused VM, then:
if the hypervisor eventually goes unresponsive it will be fenced.
if the hypervisor stays Up and the error is recovered, libvirt should fail to re-acquire the lease and destroy the VM. 
We need to be a bit careful about what state we report back to engine in case the destroy fails

Yaniv, if the above sounds reasonable I suggest to spin off a separate bug to bug 1230788 to handle just this specific behavior for HA VMs with lease.

Comment 34 Yaniv Lavi 2017-08-06 13:19:49 UTC
(In reply to Michal Skrivanek from comment #33)
> Considering the current HA functionality, the easiest would be to simply
> kill the EIO paused VM right away and report abnormal shutdown, the rest of
> the logic will handle it.
> We can perhaps fail to kill the paused VM, then:
> if the hypervisor eventually goes unresponsive it will be fenced.
> if the hypervisor stays Up and the error is recovered, libvirt should fail
> to re-acquire the lease and destroy the VM. 
> We need to be a bit careful about what state we report back to engine in
> case the destroy fails
> 
> Yaniv, if the above sounds reasonable I suggest to spin off a separate bug
> to bug 1230788 to handle just this specific behavior for HA VMs with lease.

I think this should be a libvirt behavior unrelated to the RFE. As you said, if the hypervisor stays Up and the error is recovered, libvirt/sanlock should fail to re-acquire the lease and destroy the VM. Martin, what do you think?

Comment 35 Allon Mureinik 2017-08-10 12:07:29 UTC
Pushing to 4.1.6 pending needinfo.

Comment 36 Martin Tessun 2017-08-18 08:53:22 UTC
(In reply to Yaniv Lavi (Dary) from comment #34)
> (In reply to Michal Skrivanek from comment #33)
> > Considering the current HA functionality, the easiest would be to simply
> > kill the EIO paused VM right away and report abnormal shutdown, the rest of
> > the logic will handle it.
> > We can perhaps fail to kill the paused VM, then:
> > if the hypervisor eventually goes unresponsive it will be fenced.
> > if the hypervisor stays Up and the error is recovered, libvirt should fail
> > to re-acquire the lease and destroy the VM. 
> > We need to be a bit careful about what state we report back to engine in
> > case the destroy fails
> > 
> > Yaniv, if the above sounds reasonable I suggest to spin off a separate bug
> > to bug 1230788 to handle just this specific behavior for HA VMs with lease.
> 
> I think this should be a libvirt behavior unrelated to the RFE. As you said,
> if the hypervisor stays Up and the error is recovered, libvirt/sanlock
> should fail to re-acquire the lease and destroy the VM. Martin, what do you
> think?

I discussed this with Jarda. We also think that this should (and can) be handled by libvirt.

Anyways it should not cause any corruption as re-acquiring the lease does fail, but it might lead to some confusion as there could be two VMs running (where one of them does not have access to its storage).

Comment 37 Allon Mureinik 2017-08-28 12:33:53 UTC
(In reply to Martin Tessun from comment #36)
> (In reply to Yaniv Lavi (Dary) from comment #34)
> > (In reply to Michal Skrivanek from comment #33)
> > > Considering the current HA functionality, the easiest would be to simply
> > > kill the EIO paused VM right away and report abnormal shutdown, the rest of
> > > the logic will handle it.
> > > We can perhaps fail to kill the paused VM, then:
> > > if the hypervisor eventually goes unresponsive it will be fenced.
> > > if the hypervisor stays Up and the error is recovered, libvirt should fail
> > > to re-acquire the lease and destroy the VM. 
> > > We need to be a bit careful about what state we report back to engine in
> > > case the destroy fails
> > > 
> > > Yaniv, if the above sounds reasonable I suggest to spin off a separate bug
> > > to bug 1230788 to handle just this specific behavior for HA VMs with lease.
> > 
> > I think this should be a libvirt behavior unrelated to the RFE. As you said,
> > if the hypervisor stays Up and the error is recovered, libvirt/sanlock
> > should fail to re-acquire the lease and destroy the VM. Martin, what do you
> > think?
> 
> I discussed this with Jarda. We also think that this should (and can) be
> handled by libvirt.
> 
> Anyways it should not cause any corruption as re-acquiring the lease does
> fail, but it might lead to some confusion as there could be two VMs running
> (where one of them does not have access to its storage).

Martin, can you move this to libvirt then and target it accordingly?

Comment 39 Daniel Berrangé 2017-08-30 14:39:08 UTC
(In reply to Michal Skrivanek from comment #33)
> Considering the current HA functionality, the easiest would be to simply
> kill the EIO paused VM right away and report abnormal shutdown, the rest of
> the logic will handle it.

This is not something libvirt should do. This equates an EIO on the disk volume, with failure of the lease. There is no requirement that the the lease be held on the same storage volume as the disk itself (and libvirt does not even know whether it is or not), so libvirt should not silently turn EIO on the disk volume into a psuedo-lease failure.  If the mgmt app wants todo that, it can kill the guest itself when it sees the pause due to EIO, but this not a policy to put in libvirt.

Comment 40 Peter Krempa 2017-08-31 07:32:03 UTC
I concur. The only thing libvirt can do here is allow a "kill" policy for disk I/O errors, which will not correlate with the lease, but simply kill the VM on disk error if the user wishes so. On the other hand I don't think this will be of much use to oVirt, since they also handle other disk errors internally so it would require a rather large set of knobs to express the matrix of the desired outcomes.

In general I think if the VM is paused due to a disk failure, libvirtd delivers an event [1] to the management that can be used for any higher level policy decisions they wish based on the state of the VM and other data.


[1] http://libvirt.org/html/libvirt-libvirt-domain.html#virConnectDomainEventIOErrorReasonCallback

Comment 41 Marina Kalinin 2017-08-31 17:12:17 UTC
I am reopening and reassigning this to RHV.
We need to find a solution here, otherwise we are missing a point in VM Leases behavior.

Though to me it sounds like libvirt behavior should recognize a different volume and reconsider to not release the lock.

Comment 43 Nir Soffer 2017-09-04 14:44:48 UTC
*** Bug 1459156 has been marked as a duplicate of this bug. ***

Comment 44 Nir Soffer 2017-09-04 14:52:21 UTC
I agree with Peter that this should be handled by the management program, libvirt.
HA logic can destroy paused a HA VM after I/O error and start the VM on another
host.

Comment 46 Nir Soffer 2017-09-14 17:08:51 UTC
Michal, this bug is about vm life cycle, I think it belongs to your team.

Comment 47 Michal Skrivanek 2017-09-19 13:55:36 UTC
I am
- moving it under virt
- nacking 4.1.z request due to Design
- proposing to handle this as part of RFE bug 1317450 (pending confirmation in that bug that it is suitable)