Bug 1650417

Summary:	HA is broken for VMs having disks in NFS storage domain because of Qemu OFD locking
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	nijin ashok <nashok>
Component:	ovirt-engine	Assignee:	Nobody <nobody>
Status:	CLOSED ERRATA	QA Contact:	Polina <pagranat>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.2.7	CC:	ahadas, dprezhev, gveitmic, kmehta, michal.skrivanek, nashok, nsoffer, pagranat, rdlugyhe, smelamud, tnisan, vvasilev
Target Milestone:	ovirt-4.4.1	Keywords:	Rebase, TestOnly
Target Release:	4.3.0	Flags:	lsvaty: testing_plan_complete-
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	rhv-4.4.0-29	Doc Type:	Bug Fix
Doc Text:	Previously, if a host failed and if the RHV Manager tried to start the high-availability virtual machine (HA VM) before the NFS lease expired, OFD locking caused the HA VM to fail with the error, "Failed to get "write" lock Is another process using the image?." If the HA VM failed three times in a row, the Manager could not start it again, breaking the HA functionality. The current release fixes this issue. RHV Manager would continue to try starting the VM even after three failures (the frequency of the attempts decreases over time). Eventually, once the lock expires, the VM would be started.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-08-04 13:16:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Virt	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description nijin ashok 2018-11-16 06:26:32 UTC

Description of problem:

If a host fails (hardware failure, OS crash, network issue etc.) and if the engine tries to start the HA VM before the NFS lease expires, it will fail with the error "Failed to get "write" lock Is another process using the image?." because of OFD locking. If it fails for continuous 3 times, the engine will not start it again breaking the HA functionality.

The issue was observed in the internal RH production environment and I was able to reproduce it as well.

Here rhvh-16 got crashed.

===
2018-11-04 15:51:28,925-05 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connection timeout for host 'rhvh-16', last response arrived 16919 ms ago.
2018-11-04 15:52:33,629-05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-64) [6db662ce] EVENT_ID: VM_WAS_SET_DOWN_DUE_TO_HOST_REBOOT_OR_MANUAL_FENCE(143), Vm puppet01 was shut down due to rhvh-16 host reboot or manual fence
===

The engine tries to start the VM on 3 hosts and all failed with the mentioned error. 

===
2018-11-04 15:52:35,358-05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-74) [5ec88a40] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM puppet01 on Host rhvh-15

2018-11-04 15:52:38,620-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-7) [] EVENT_ID: VM_DOWN_ERROR(119), VM puppet01 is down with error. Exit message: internal error: qemu unexpectedly closed the monitor: 2018-11-04T20:52:37.481659Z qemu-kvm: -drive file=/rhev/data-center/mnt//b7d4d0d9-73b5-444c-8dec-bb0db5152c85/images/d7e828ff-78ea-4dc4-a82a-61dcdf431454/a4b9cea3-0579-4361-b750-3ec04a405732,format=qcow2,if=none,id=drive-ua-d7e828ff-78ea-4dc4-a82a-61dcdf431454,serial=d7e828ff-78ea-4dc4-a82a-61dcdf431454,cache=none,werror=stop,rerror=stop,aio=threads: Failed to get "write" lock


2018-11-04 15:52:38,880-05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-59) [] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM puppet01 on Host rhvh-13
2018-11-04 15:52:42,626-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-4) [] EVENT_ID: VM_DOWN_ERROR(119), VM puppet01 is down with error. Exit message: internal error: qemu unexpectedly closed the monitor: 2018-11-04T20:52:40.920096Z qemu-kvm: -drive file=/rhev/data-center/mnt//b7d4d0d9-73b5-444c-8dec-bb0db5152c85/images/d7e828ff-78ea-4dc4-a82a-61dcdf431454/a4b9cea3-0579-4361-b750-3ec04a405732,format=qcow2,if=none,id=drive-ua-d7e828ff-78ea-4dc4-a82a-61dcdf431454,serial=d7e828ff-78ea-4dc4-a82a-61dcdf431454,cache=none,werror=stop,rerror=stop,aio=threads: Failed to get "write" lock


2018-11-04 15:52:42,878-05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-66) [] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM puppet01 on Host rhvh-02
2018-11-04 15:52:46,200-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-7) [] EVENT_ID: VM_DOWN_ERROR(119), VM puppet01 is down with error. Exit message: internal error: qemu unexpectedly closed the monitor: 2018-11-04T20:52:45.060839Z qemu-kvm: -drive file=/rhev/data-center/mnt/b7d4d0d9-73b5-444c-8dec-bb0db5152c85/images/d7e828ff-78ea-4dc4-a82a-61dcdf431454/a4b9cea3-0579-4361-b750-3ec04a405732,format=qcow2,if=none,id=drive-ua-d7e828ff-78ea-4dc4-a82a-61dcdf431454,serial=d7e828ff-78ea-4dc4-a82a-61dcdf431454,cache=none,werror=stop,rerror=stop,aio=threads: Failed to get "write" lock
===

Since we have a default value of 3 for MaxRerunVmOnVdsCount, it will not try to start the VM again and hence the whole HA functionality is lost. Also as per the customer, the current value of v4-lease-seconds is 30 seconds in his NFS storage which is low compared to NFS RHEL server. So there is a high probability of hitting this issue for RHEL NFS storage where the default lease seconds is 180 seconds. However, this all depends on how fast engine is starting the VM.


Version-Release number of selected component (if applicable):

RHV 4..2.7
NFS V4.1 storage domain

How reproducible:

Partially as this depends on when the engine is marking the host has failed. 

Steps to Reproduce:

1. Crash the host which is having HA VMs running with disks in NFS storage. 
2. Watch for the HA VMs start events and probably it will fail with the mentioned error.


Actual results:

HA functionality is not working because of OFD locks.

Expected results:

HA functionality should work.

Additional info:

Comment 2 Michal Skrivanek 2018-11-17 08:22:40 UTC

I thought we disabled these locks everywhere, Tal?

Comment 3 Tal Nisan 2018-11-18 15:29:21 UTC

Seems like a similar bug to bug 1624734, Nir can you confirm?

Comment 4 Nir Soffer 2018-11-18 17:49:20 UTC

(In reply to Tal Nisan from comment #3)
> Seems like a similar bug to bug 1624734, Nir can you confirm?

I don't think so.

qemu locking is disabled only on NFS 3, since NFS 3 does not release locks of 
dead clients. With NFS 4, NFS should release the locks if the client owning the
lock is dead. If the lock was not released, the client is probably live, and we
cannot start engine on another machine.


> Steps to Reproduce:
> 1. Crash the host which is having HA VMs running with disks in NFS storage. 
> 2. Watch for the HA VMs start events and probably it will fail with the mentioned
>    error.

What do you mean by "Crash the host"?
Please share output of lslocks on the NFS server

> Since we have a default value of 3 for MaxRerunVmOnVdsCount

I don't know why we limit the number of retries. We cannot have HA with limited
number of retries.

Comment 5 Michal Skrivanek 2018-11-19 10:16:44 UTC

(In reply to Nir Soffer from comment #4)
> I don't know why we limit the number of retries. We cannot have HA with
> limited
> number of retries.

because we can't indefinitely try running a VM. It's a fairly expensive and resource consuming action.

Comment 6 Ryan Barry 2018-11-21 00:25:30 UTC

Nijin, can you provide the output of lslocks when this is reproducing?

Comment 8 Nir Soffer 2018-11-22 08:00:08 UTC

According to comment 7, the issue is timing:

2018-11-21 23:37:15,000-05 - host rebooted (estimate based on engine logs)
2018-11-21 23:37:38,426-05 - engine detect communication timeout
2018-11-21 23:40:11,105-05 - engine fail to start vm (write lock failure)
2018-11-21 23:40:12,009-05 - engine fail to start vm (write lock failure)
2018-11-21 23:40:12,845-05 - engine fail to start vm (write lock failure)
2018-11-21 23:40:15,000-05 - lease expires on NFS server

NFS timeout is out of RHV control, since we don't manage the NFS server.

So what we can do is:
- Document that NFS server v4 lease timeout must be 80 seconds, same as sanlock
- Fixed restart timeouts so we don't stop trying before the lease expire.

I think this can work for recovery:

- Try first 80 seconds after the last interaction with the host. In case of
  sanlock this may succeed.
- Then retry every X seconds doubling X on every failure. Start with X=30, stop
  doubling when X=240. (The values should be configurable).

This requires detection of the write lock error in vdsm side - can be done using
libvirt API.

Comment 9 Denis Prezhevalsky 2018-11-22 08:16:46 UTC

Nir,
NFS "v4-lease-seconds" NetApp parameter default is 30sec. 
And that default didn't worked in this case.
What we done so far is:
- decreasing "v4-lease-seconds" from 30s to 10s
- increasing "MaxRerunVmOnVdsCount" engine parameter from 3 to 6

This wasn't tested enough yet to confirm as working workaround, thought.

Comment 10 Nir Soffer 2018-11-22 08:57:13 UTC

(In reply to Denis Prezhevalsky from comment #9)
> NFS "v4-lease-seconds" NetApp parameter default is 30sec. 
> And that default didn't worked in this case.

Maybe this is a bug in NetApp, or maybe the configuration change was not applied?

> What we done so far is:
> - decreasing "v4-lease-seconds" from 30s to 10s

This is too short, I would not change it.

Is this bug about RHEL NFS server (see comment 7) or NetApp?

Comment 11 Denis Prezhevalsky 2018-11-22 09:49:58 UTC

>> What we done so far is:
>> - decreasing "v4-lease-seconds" from 30s to 10s

> This is too short, I would not change it.

So, what would be reasonable values for "v4-lease-seconds" (NetApp) and "MaxRerunVmOnVdsCount" (Engine) ?

Comment 12 nijin ashok 2018-11-22 11:04:00 UTC

(In reply to Nir Soffer from comment #10)
> Is this bug about RHEL NFS server (see comment 7) or NetApp?

The environment reported in the case is using NetApp and I used RHEL NFS server for reproducing.

> Maybe this is a bug in NetApp, or maybe the configuration change was not applied?

For this particular incident, the VM was started a bit early but not within 30 seconds.

2018-11-04 15:51:28: Manager reported error for the host (The host got crashed)
2018-11-04 15:52:33: Manager detected that the host is in "Kdump flow".
2018-11-04 15:52:35: Manager try to start the HA VMs which all failed with write lock.

Here the manager started the HA VM when it marks the host as Kdumping.

2018-11-04 15:52:33,071-05 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engine-Thread-64) [794bb107] START, SetVdsStatusVDSCommand(HostName = rhvh-16, SetVdsStatusVDSCommandParameters:{hostId='23b9ce13-57e4-4dcd-9a63-b302c15c777a', status='Kdumping', nonOperationalReason='NONE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 592a3156

The logs for this incident is available in comment #1.

This may be a bug in NetApp as well.

Comment 13 Nir Soffer 2018-11-23 00:23:31 UTC

(In reply to nijin ashok from comment #12)
> For this particular incident, the VM was started a bit early but not within
> 30 seconds.
> 
> 2018-11-04 15:51:28: Manager reported error for the host (The host got
> crashed)
> 2018-11-04 15:52:33: Manager detected that the host is in "Kdump flow".
> 2018-11-04 15:52:35: Manager try to start the HA VMs which all failed with
> write lock.

This is 67 seconds after the manager reported an error, but maybe the host lost
connection to engine, but was still alive regarding the NFS server?

Do we have info on the time of the host "crash"? was it real error or a controlled
test? (in the last case we may know the time we stopped the host).

Can we test this NetApp server using another host or VM?

- mount some directory on this NetApp, using same mount options as vdsm
- run qemu manually, using file on this mount
- stop the host or the VM
- try to run qemu manually on another host, using same file
- retry every 10 seconds until it succeed

We will quickly see if there is an issue with v4-lease-seconds on this NetApp.

Comment 14 Nir Soffer 2018-11-23 00:36:04 UTC

(In reply to Denis Prezhevalsky from comment #11)
> So, what would be reasonable values for "v4-lease-seconds" (NetApp) and
> "MaxRerunVmOnVdsCount" (Engine) ?

v4-lease-seconds should use a value that will make the lease expire after 80
seconds. You need to check NetApp documentation or consult with them.

Shmuel can answer about MaxRerunVmOnVdsCount? Do we have other configuration values
related to recovery of HA VMs?

Comment 15 nijin ashok 2018-11-23 07:31:03 UTC

(In reply to Nir Soffer from comment #13)
> Do we have info on the time of the host "crash"? was it real error or a
> controlled
> test? (in the last case we may know the time we stopped the host).
> 

It was not a controlled test. It's hard to say when the host got crashed.

> Can we test this NetApp server using another host or VM?
> 
> - mount some directory on this NetApp, using same mount options as vdsm
> - run qemu manually, using file on this mount
> - stop the host or the VM
> - try to run qemu manually on another host, using same file
> - retry every 10 seconds until it succeed
> 
> We will quickly see if there is an issue with v4-lease-seconds on this
> NetApp.

I will coordinate with Denis to see if we can test this as he is managing that environment.

Regardless, as you mentioned in comment 8, the engine should detect the "write lock error" and should not stop starting the VMs if we get this error as part of HA VMs recovery. 

The lease time can vary depends on the storage vendor.

Comment 16 Michal Skrivanek 2018-11-23 09:40:05 UTC

For HA VM there's additionally RetryToRunAutoStartVmIntervalInSeconds parameter (default 30s)

But I do not see that in effect here, that could be because of the fencing - the initial run of both regular VM and HA VMs have immediate reruns.

Either way, IMHO the right solution is not timeouts but rather clearing out the locks on successful boot/fencing. Can't we somehow release the locks as part of our fencing?

Comment 17 Shmuel Melamud 2018-11-25 21:41:30 UTC

Michal is right, MaxRerunVmOnVdsCount sets the maximum amount of RunVmCommand reruns and these reruns are immediate. But I would expect that in RetryToRunAutoStartVmIntervalInSeconds AutoStartVmsRunner will give another try to start the VM.

Comment 19 Ryan Barry 2019-01-21 14:54:31 UTC

Re-targeting to 4.3.1 since it is missing a patch, an acked blocker flag, or both

Comment 21 Nir Soffer 2019-02-12 07:17:30 UTC

(In reply to Michal Skrivanek from comment #16)
> Either way, IMHO the right solution is not timeouts but rather clearing out
> the locks on successful boot/fencing. Can't we somehow release the locks as
> part of our fencing?

I don't think we can release locks acquired by qemu. This may lead to data
corruption. If qemu takes cluster level locks it should be responsible to 
break the locks if needed.

But I think the right way is to depend on the storage for breaking stale locks.
If the storage does not handle stale locks properly it should be fixed on
the storage side. Until storage is fixed, this should be handled manually.

Comment 22 Daniel Gur 2019-08-28 13:14:24 UTC

sync2jira

Comment 23 Daniel Gur 2019-08-28 13:19:26 UTC

sync2jira

Comment 25 Ryan Barry 2019-12-06 13:52:20 UTC

Please retest. The changes in https://bugzilla.redhat.com/show_bug.cgi?id=912723 are likely to resolve, as there is not longer a limited number of retries

Comment 28 Polina 2020-04-22 15:56:06 UTC

re-tested on http://bob-dr.lab.eng.brq.redhat.com/builds/4.4/rhv-4.4.0-31

according to https://polarion.engineering.redhat.com/polarion/#/project/RHEVM3/workitem?id=RHEVM-26910

The retry is done many times the then when storage problem is fixed,  the HA VM is restarted

the engine.log reports the attempts:

2020-04-22 18:16:29,056+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-7) [43014a67] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed.
2020-04-22 18:16:59,397+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [627cf1e6] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed.
2020-04-22 18:17:29,759+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [55bffcd2] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed.
2020-04-22 18:17:59,925+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-23) [5168622a] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed.
2020-04-22 18:18:30,145+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-10) [360c4b62] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed.
2020-04-22 18:19:00,393+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-75) [311c0279] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed.
2020-04-22 18:19:31,565+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-25) [13da0fd7] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed.
2020-04-22 18:20:00,877+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-15) [278702ae] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed.
2020-04-22 18:20:31,057+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-33) [1684018c] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed.
2020-04-22 18:21:01,301+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-58) [3041dd5b] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed.

Comment 31 Polina 2020-07-19 10:49:56 UTC

Hi Rolfe,

could you please link here the document that must be reviewed? I don't get it under supplied link 

thank you

Comment 37 errata-xmlrpc 2020-08-04 13:16:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3247