Bug 1650417
| Summary: | HA is broken for VMs having disks in NFS storage domain because of Qemu OFD locking | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | nijin ashok <nashok> |
| Component: | ovirt-engine | Assignee: | Nobody <nobody> |
| Status: | CLOSED ERRATA | QA Contact: | Polina <pagranat> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.2.7 | CC: | ahadas, dprezhev, gveitmic, kmehta, michal.skrivanek, nashok, nsoffer, pagranat, rdlugyhe, smelamud, tnisan, vvasilev |
| Target Milestone: | ovirt-4.4.1 | Keywords: | Rebase, TestOnly |
| Target Release: | 4.3.0 | Flags: | lsvaty:
testing_plan_complete-
|
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | rhv-4.4.0-29 | Doc Type: | Bug Fix |
| Doc Text: |
Previously, if a host failed and if the RHV Manager tried to start the high-availability virtual machine (HA VM) before the NFS lease expired, OFD locking caused the HA VM to fail with the error, "Failed to get "write" lock Is another process using the image?." If the HA VM failed three times in a row, the Manager could not start it again, breaking the HA functionality.
The current release fixes this issue. RHV Manager would continue to try starting the VM even after three failures (the frequency of the attempts decreases over time). Eventually, once the lock expires, the VM would be started.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-08-04 13:16:18 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
nijin ashok
2018-11-16 06:26:32 UTC
I thought we disabled these locks everywhere, Tal? Seems like a similar bug to bug 1624734, Nir can you confirm? (In reply to Tal Nisan from comment #3) > Seems like a similar bug to bug 1624734, Nir can you confirm? I don't think so. qemu locking is disabled only on NFS 3, since NFS 3 does not release locks of dead clients. With NFS 4, NFS should release the locks if the client owning the lock is dead. If the lock was not released, the client is probably live, and we cannot start engine on another machine. > Steps to Reproduce: > 1. Crash the host which is having HA VMs running with disks in NFS storage. > 2. Watch for the HA VMs start events and probably it will fail with the mentioned > error. What do you mean by "Crash the host"? Please share output of lslocks on the NFS server > Since we have a default value of 3 for MaxRerunVmOnVdsCount I don't know why we limit the number of retries. We cannot have HA with limited number of retries. (In reply to Nir Soffer from comment #4) > I don't know why we limit the number of retries. We cannot have HA with > limited > number of retries. because we can't indefinitely try running a VM. It's a fairly expensive and resource consuming action. Nijin, can you provide the output of lslocks when this is reproducing? According to comment 7, the issue is timing: 2018-11-21 23:37:15,000-05 - host rebooted (estimate based on engine logs) 2018-11-21 23:37:38,426-05 - engine detect communication timeout 2018-11-21 23:40:11,105-05 - engine fail to start vm (write lock failure) 2018-11-21 23:40:12,009-05 - engine fail to start vm (write lock failure) 2018-11-21 23:40:12,845-05 - engine fail to start vm (write lock failure) 2018-11-21 23:40:15,000-05 - lease expires on NFS server NFS timeout is out of RHV control, since we don't manage the NFS server. So what we can do is: - Document that NFS server v4 lease timeout must be 80 seconds, same as sanlock - Fixed restart timeouts so we don't stop trying before the lease expire. I think this can work for recovery: - Try first 80 seconds after the last interaction with the host. In case of sanlock this may succeed. - Then retry every X seconds doubling X on every failure. Start with X=30, stop doubling when X=240. (The values should be configurable). This requires detection of the write lock error in vdsm side - can be done using libvirt API. Nir, NFS "v4-lease-seconds" NetApp parameter default is 30sec. And that default didn't worked in this case. What we done so far is: - decreasing "v4-lease-seconds" from 30s to 10s - increasing "MaxRerunVmOnVdsCount" engine parameter from 3 to 6 This wasn't tested enough yet to confirm as working workaround, thought. (In reply to Denis Prezhevalsky from comment #9) > NFS "v4-lease-seconds" NetApp parameter default is 30sec. > And that default didn't worked in this case. Maybe this is a bug in NetApp, or maybe the configuration change was not applied? > What we done so far is: > - decreasing "v4-lease-seconds" from 30s to 10s This is too short, I would not change it. Is this bug about RHEL NFS server (see comment 7) or NetApp? >> What we done so far is: >> - decreasing "v4-lease-seconds" from 30s to 10s > This is too short, I would not change it. So, what would be reasonable values for "v4-lease-seconds" (NetApp) and "MaxRerunVmOnVdsCount" (Engine) ? (In reply to Nir Soffer from comment #10) > Is this bug about RHEL NFS server (see comment 7) or NetApp? The environment reported in the case is using NetApp and I used RHEL NFS server for reproducing. > Maybe this is a bug in NetApp, or maybe the configuration change was not applied? For this particular incident, the VM was started a bit early but not within 30 seconds. 2018-11-04 15:51:28: Manager reported error for the host (The host got crashed) 2018-11-04 15:52:33: Manager detected that the host is in "Kdump flow". 2018-11-04 15:52:35: Manager try to start the HA VMs which all failed with write lock. Here the manager started the HA VM when it marks the host as Kdumping. 2018-11-04 15:52:33,071-05 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engine-Thread-64) [794bb107] START, SetVdsStatusVDSCommand(HostName = rhvh-16, SetVdsStatusVDSCommandParameters:{hostId='23b9ce13-57e4-4dcd-9a63-b302c15c777a', status='Kdumping', nonOperationalReason='NONE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 592a3156 The logs for this incident is available in comment #1. This may be a bug in NetApp as well. (In reply to nijin ashok from comment #12) > For this particular incident, the VM was started a bit early but not within > 30 seconds. > > 2018-11-04 15:51:28: Manager reported error for the host (The host got > crashed) > 2018-11-04 15:52:33: Manager detected that the host is in "Kdump flow". > 2018-11-04 15:52:35: Manager try to start the HA VMs which all failed with > write lock. This is 67 seconds after the manager reported an error, but maybe the host lost connection to engine, but was still alive regarding the NFS server? Do we have info on the time of the host "crash"? was it real error or a controlled test? (in the last case we may know the time we stopped the host). Can we test this NetApp server using another host or VM? - mount some directory on this NetApp, using same mount options as vdsm - run qemu manually, using file on this mount - stop the host or the VM - try to run qemu manually on another host, using same file - retry every 10 seconds until it succeed We will quickly see if there is an issue with v4-lease-seconds on this NetApp. (In reply to Denis Prezhevalsky from comment #11) > So, what would be reasonable values for "v4-lease-seconds" (NetApp) and > "MaxRerunVmOnVdsCount" (Engine) ? v4-lease-seconds should use a value that will make the lease expire after 80 seconds. You need to check NetApp documentation or consult with them. Shmuel can answer about MaxRerunVmOnVdsCount? Do we have other configuration values related to recovery of HA VMs? (In reply to Nir Soffer from comment #13) > Do we have info on the time of the host "crash"? was it real error or a > controlled > test? (in the last case we may know the time we stopped the host). > It was not a controlled test. It's hard to say when the host got crashed. > Can we test this NetApp server using another host or VM? > > - mount some directory on this NetApp, using same mount options as vdsm > - run qemu manually, using file on this mount > - stop the host or the VM > - try to run qemu manually on another host, using same file > - retry every 10 seconds until it succeed > > We will quickly see if there is an issue with v4-lease-seconds on this > NetApp. I will coordinate with Denis to see if we can test this as he is managing that environment. Regardless, as you mentioned in comment 8, the engine should detect the "write lock error" and should not stop starting the VMs if we get this error as part of HA VMs recovery. The lease time can vary depends on the storage vendor. For HA VM there's additionally RetryToRunAutoStartVmIntervalInSeconds parameter (default 30s) But I do not see that in effect here, that could be because of the fencing - the initial run of both regular VM and HA VMs have immediate reruns. Either way, IMHO the right solution is not timeouts but rather clearing out the locks on successful boot/fencing. Can't we somehow release the locks as part of our fencing? Michal is right, MaxRerunVmOnVdsCount sets the maximum amount of RunVmCommand reruns and these reruns are immediate. But I would expect that in RetryToRunAutoStartVmIntervalInSeconds AutoStartVmsRunner will give another try to start the VM. Re-targeting to 4.3.1 since it is missing a patch, an acked blocker flag, or both (In reply to Michal Skrivanek from comment #16) > Either way, IMHO the right solution is not timeouts but rather clearing out > the locks on successful boot/fencing. Can't we somehow release the locks as > part of our fencing? I don't think we can release locks acquired by qemu. This may lead to data corruption. If qemu takes cluster level locks it should be responsible to break the locks if needed. But I think the right way is to depend on the storage for breaking stale locks. If the storage does not handle stale locks properly it should be fixed on the storage side. Until storage is fixed, this should be handled manually. sync2jira sync2jira Please retest. The changes in https://bugzilla.redhat.com/show_bug.cgi?id=912723 are likely to resolve, as there is not longer a limited number of retries re-tested on http://bob-dr.lab.eng.brq.redhat.com/builds/4.4/rhv-4.4.0-31 according to https://polarion.engineering.redhat.com/polarion/#/project/RHEVM3/workitem?id=RHEVM-26910 The retry is done many times the then when storage problem is fixed, the HA VM is restarted the engine.log reports the attempts: 2020-04-22 18:16:29,056+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-7) [43014a67] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed. 2020-04-22 18:16:59,397+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [627cf1e6] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed. 2020-04-22 18:17:29,759+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [55bffcd2] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed. 2020-04-22 18:17:59,925+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-23) [5168622a] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed. 2020-04-22 18:18:30,145+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-10) [360c4b62] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed. 2020-04-22 18:19:00,393+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-75) [311c0279] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed. 2020-04-22 18:19:31,565+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-25) [13da0fd7] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed. 2020-04-22 18:20:00,877+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-15) [278702ae] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed. 2020-04-22 18:20:31,057+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-33) [1684018c] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed. 2020-04-22 18:21:01,301+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-58) [3041dd5b] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Restart of the Highly Available VM golden_env_mixed_virtio_0 failed. Hi Rolfe, could you please link here the document that must be reviewed? I don't get it under supplied link thank you Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3247 |