Description of problem: If a host fails (hardware failure, OS crash, network issue etc.) and if the engine tries to start the HA VM before the NFS lease expires, it will fail with the error "Failed to get "write" lock Is another process using the image?." because of OFD locking. If it fails for continuous 3 times, the engine will not start it again breaking the HA functionality. The issue was observed in the internal RH production environment and I was able to reproduce it as well. Here rhvh-16 got crashed. === 2018-11-04 15:51:28,925-05 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connection timeout for host 'rhvh-16', last response arrived 16919 ms ago. 2018-11-04 15:52:33,629-05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-64) [6db662ce] EVENT_ID: VM_WAS_SET_DOWN_DUE_TO_HOST_REBOOT_OR_MANUAL_FENCE(143), Vm puppet01 was shut down due to rhvh-16 host reboot or manual fence === The engine tries to start the VM on 3 hosts and all failed with the mentioned error. === 2018-11-04 15:52:35,358-05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-74) [5ec88a40] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM puppet01 on Host rhvh-15 2018-11-04 15:52:38,620-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-7) [] EVENT_ID: VM_DOWN_ERROR(119), VM puppet01 is down with error. Exit message: internal error: qemu unexpectedly closed the monitor: 2018-11-04T20:52:37.481659Z qemu-kvm: -drive file=/rhev/data-center/mnt//b7d4d0d9-73b5-444c-8dec-bb0db5152c85/images/d7e828ff-78ea-4dc4-a82a-61dcdf431454/a4b9cea3-0579-4361-b750-3ec04a405732,format=qcow2,if=none,id=drive-ua-d7e828ff-78ea-4dc4-a82a-61dcdf431454,serial=d7e828ff-78ea-4dc4-a82a-61dcdf431454,cache=none,werror=stop,rerror=stop,aio=threads: Failed to get "write" lock 2018-11-04 15:52:38,880-05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-59) [] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM puppet01 on Host rhvh-13 2018-11-04 15:52:42,626-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-4) [] EVENT_ID: VM_DOWN_ERROR(119), VM puppet01 is down with error. Exit message: internal error: qemu unexpectedly closed the monitor: 2018-11-04T20:52:40.920096Z qemu-kvm: -drive file=/rhev/data-center/mnt//b7d4d0d9-73b5-444c-8dec-bb0db5152c85/images/d7e828ff-78ea-4dc4-a82a-61dcdf431454/a4b9cea3-0579-4361-b750-3ec04a405732,format=qcow2,if=none,id=drive-ua-d7e828ff-78ea-4dc4-a82a-61dcdf431454,serial=d7e828ff-78ea-4dc4-a82a-61dcdf431454,cache=none,werror=stop,rerror=stop,aio=threads: Failed to get "write" lock 2018-11-04 15:52:42,878-05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-66) [] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM puppet01 on Host rhvh-02 2018-11-04 15:52:46,200-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-7) [] EVENT_ID: VM_DOWN_ERROR(119), VM puppet01 is down with error. Exit message: internal error: qemu unexpectedly closed the monitor: 2018-11-04T20:52:45.060839Z qemu-kvm: -drive file=/rhev/data-center/mnt/b7d4d0d9-73b5-444c-8dec-bb0db5152c85/images/d7e828ff-78ea-4dc4-a82a-61dcdf431454/a4b9cea3-0579-4361-b750-3ec04a405732,format=qcow2,if=none,id=drive-ua-d7e828ff-78ea-4dc4-a82a-61dcdf431454,serial=d7e828ff-78ea-4dc4-a82a-61dcdf431454,cache=none,werror=stop,rerror=stop,aio=threads: Failed to get "write" lock === Since we have a default value of 3 for MaxRerunVmOnVdsCount, it will not try to start the VM again and hence the whole HA functionality is lost. Also as per the customer, the current value of v4-lease-seconds is 30 seconds in his NFS storage which is low compared to NFS RHEL server. So there is a high probability of hitting this issue for RHEL NFS storage where the default lease seconds is 180 seconds. However, this all depends on how fast engine is starting the VM. Version-Release number of selected component (if applicable): RHV 4..2.7 NFS V4.1 storage domain How reproducible: Partially as this depends on when the engine is marking the host has failed. Steps to Reproduce: 1. Crash the host which is having HA VMs running with disks in NFS storage. 2. Watch for the HA VMs start events and probably it will fail with the mentioned error. Actual results: HA functionality is not working because of OFD locks. Expected results: HA functionality should work. Additional info:
I thought we disabled these locks everywhere, Tal?
Seems like a similar bug to bug 1624734, Nir can you confirm?
(In reply to Tal Nisan from comment #3) > Seems like a similar bug to bug 1624734, Nir can you confirm? I don't think so. qemu locking is disabled only on NFS 3, since NFS 3 does not release locks of dead clients. With NFS 4, NFS should release the locks if the client owning the lock is dead. If the lock was not released, the client is probably live, and we cannot start engine on another machine. > Steps to Reproduce: > 1. Crash the host which is having HA VMs running with disks in NFS storage. > 2. Watch for the HA VMs start events and probably it will fail with the mentioned > error. What do you mean by "Crash the host"? Please share output of lslocks on the NFS server > Since we have a default value of 3 for MaxRerunVmOnVdsCount I don't know why we limit the number of retries. We cannot have HA with limited number of retries.
(In reply to Nir Soffer from comment #4) > I don't know why we limit the number of retries. We cannot have HA with > limited > number of retries. because we can't indefinitely try running a VM. It's a fairly expensive and resource consuming action.
Nijin, can you provide the output of lslocks when this is reproducing?
According to comment 7, the issue is timing: 2018-11-21 23:37:15,000-05 - host rebooted (estimate based on engine logs) 2018-11-21 23:37:38,426-05 - engine detect communication timeout 2018-11-21 23:40:11,105-05 - engine fail to start vm (write lock failure) 2018-11-21 23:40:12,009-05 - engine fail to start vm (write lock failure) 2018-11-21 23:40:12,845-05 - engine fail to start vm (write lock failure) 2018-11-21 23:40:15,000-05 - lease expires on NFS server NFS timeout is out of RHV control, since we don't manage the NFS server. So what we can do is: - Document that NFS server v4 lease timeout must be 80 seconds, same as sanlock - Fixed restart timeouts so we don't stop trying before the lease expire. I think this can work for recovery: - Try first 80 seconds after the last interaction with the host. In case of sanlock this may succeed. - Then retry every X seconds doubling X on every failure. Start with X=30, stop doubling when X=240. (The values should be configurable). This requires detection of the write lock error in vdsm side - can be done using libvirt API.
Nir, NFS "v4-lease-seconds" NetApp parameter default is 30sec. And that default didn't worked in this case. What we done so far is: - decreasing "v4-lease-seconds" from 30s to 10s - increasing "MaxRerunVmOnVdsCount" engine parameter from 3 to 6 This wasn't tested enough yet to confirm as working workaround, thought.
(In reply to Denis Prezhevalsky from comment #9) > NFS "v4-lease-seconds" NetApp parameter default is 30sec. > And that default didn't worked in this case. Maybe this is a bug in NetApp, or maybe the configuration change was not applied? > What we done so far is: > - decreasing "v4-lease-seconds" from 30s to 10s This is too short, I would not change it. Is this bug about RHEL NFS server (see comment 7) or NetApp?
>> What we done so far is: >> - decreasing "v4-lease-seconds" from 30s to 10s > This is too short, I would not change it. So, what would be reasonable values for "v4-lease-seconds" (NetApp) and "MaxRerunVmOnVdsCount" (Engine) ?
(In reply to Nir Soffer from comment #10) > Is this bug about RHEL NFS server (see comment 7) or NetApp? The environment reported in the case is using NetApp and I used RHEL NFS server for reproducing. > Maybe this is a bug in NetApp, or maybe the configuration change was not applied? For this particular incident, the VM was started a bit early but not within 30 seconds. 2018-11-04 15:51:28: Manager reported error for the host (The host got crashed) 2018-11-04 15:52:33: Manager detected that the host is in "Kdump flow". 2018-11-04 15:52:35: Manager try to start the HA VMs which all failed with write lock. Here the manager started the HA VM when it marks the host as Kdumping. 2018-11-04 15:52:33,071-05 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engine-Thread-64) [794bb107] START, SetVdsStatusVDSCommand(HostName = rhvh-16, SetVdsStatusVDSCommandParameters:{hostId='23b9ce13-57e4-4dcd-9a63-b302c15c777a', status='Kdumping', nonOperationalReason='NONE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 592a3156 The logs for this incident is available in comment #1. This may be a bug in NetApp as well.
(In reply to nijin ashok from comment #12) > For this particular incident, the VM was started a bit early but not within > 30 seconds. > > 2018-11-04 15:51:28: Manager reported error for the host (The host got > crashed) > 2018-11-04 15:52:33: Manager detected that the host is in "Kdump flow". > 2018-11-04 15:52:35: Manager try to start the HA VMs which all failed with > write lock. This is 67 seconds after the manager reported an error, but maybe the host lost connection to engine, but was still alive regarding the NFS server? Do we have info on the time of the host "crash"? was it real error or a controlled test? (in the last case we may know the time we stopped the host). Can we test this NetApp server using another host or VM? - mount some directory on this NetApp, using same mount options as vdsm - run qemu manually, using file on this mount - stop the host or the VM - try to run qemu manually on another host, using same file - retry every 10 seconds until it succeed We will quickly see if there is an issue with v4-lease-seconds on this NetApp.
(In reply to Denis Prezhevalsky from comment #11) > So, what would be reasonable values for "v4-lease-seconds" (NetApp) and > "MaxRerunVmOnVdsCount" (Engine) ? v4-lease-seconds should use a value that will make the lease expire after 80 seconds. You need to check NetApp documentation or consult with them. Shmuel can answer about MaxRerunVmOnVdsCount? Do we have other configuration values related to recovery of HA VMs?
(In reply to Nir Soffer from comment #13) > Do we have info on the time of the host "crash"? was it real error or a > controlled > test? (in the last case we may know the time we stopped the host). > It was not a controlled test. It's hard to say when the host got crashed. > Can we test this NetApp server using another host or VM? > > - mount some directory on this NetApp, using same mount options as vdsm > - run qemu manually, using file on this mount > - stop the host or the VM > - try to run qemu manually on another host, using same file > - retry every 10 seconds until it succeed > > We will quickly see if there is an issue with v4-lease-seconds on this > NetApp. I will coordinate with Denis to see if we can test this as he is managing that environment. Regardless, as you mentioned in comment 8, the engine should detect the "write lock error" and should not stop starting the VMs if we get this error as part of HA VMs recovery. The lease time can vary depends on the storage vendor.
For HA VM there's additionally RetryToRunAutoStartVmIntervalInSeconds parameter (default 30s) But I do not see that in effect here, that could be because of the fencing - the initial run of both regular VM and HA VMs have immediate reruns. Either way, IMHO the right solution is not timeouts but rather clearing out the locks on successful boot/fencing. Can't we somehow release the locks as part of our fencing?
Michal is right, MaxRerunVmOnVdsCount sets the maximum amount of RunVmCommand reruns and these reruns are immediate. But I would expect that in RetryToRunAutoStartVmIntervalInSeconds AutoStartVmsRunner will give another try to start the VM.
Re-targeting to 4.3.1 since it is missing a patch, an acked blocker flag, or both
(In reply to Michal Skrivanek from comment #16) > Either way, IMHO the right solution is not timeouts but rather clearing out > the locks on successful boot/fencing. Can't we somehow release the locks as > part of our fencing? I don't think we can release locks acquired by qemu. This may lead to data corruption. If qemu takes cluster level locks it should be responsible to break the locks if needed. But I think the right way is to depend on the storage for breaking stale locks. If the storage does not handle stale locks properly it should be fixed on the storage side. Until storage is fixed, this should be handled manually.
sync2jira
Please retest. The changes in https://bugzilla.redhat.com/show_bug.cgi?id=912723 are likely to resolve, as there is not longer a limited number of retries
re-tested on http://bob-dr.lab.eng.brq.redhat.com/builds/4.4/rhv-4.4.0-31 according to https://polarion.engineering.redhat.com/polarion/#/project/RHEVM3/workitem?id=RHEVM-26910 The retry is done many times the then when storage problem is fixed, the HA VM is restarted the engine.log reports the attempts: 2020-04-22 18:16:29,056+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-7) [43014a67] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed. 2020-04-22 18:16:59,397+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [627cf1e6] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed. 2020-04-22 18:17:29,759+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [55bffcd2] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed. 2020-04-22 18:17:59,925+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-23) [5168622a] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed. 2020-04-22 18:18:30,145+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-10) [360c4b62] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed. 2020-04-22 18:19:00,393+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-75) [311c0279] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed. 2020-04-22 18:19:31,565+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-25) [13da0fd7] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed. 2020-04-22 18:20:00,877+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-15) [278702ae] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed. 2020-04-22 18:20:31,057+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-33) [1684018c] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed. 2020-04-22 18:21:01,301+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-58) [3041dd5b] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed.
Hi Rolfe, could you please link here the document that must be reviewed? I don't get it under supplied link thank you
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3247