1593300 – Migration leads to VM running on 2 Hosts (Split brain)

Bug 1593300 - Migration leads to VM running on 2 Hosts (Split brain)

Summary: Migration leads to VM running on 2 Hosts (Split brain)

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.1.11
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Nobody
QA Contact:	meital avital
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-06-20 13:49 UTC by nijin ashok
Modified:	2021-09-09 14:42 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-07-26 12:32:04 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1600059	0	unspecified	CLOSED	[RFE] Add by default a storage lease to HA VMs	2021-09-09 15:01:52 UTC
Red Hat Issue Tracker	RHV-43531	0	None	None	None	2021-09-09 14:42:05 UTC

Internal Links: 1600059 1719735

Description nijin ashok 2018-06-20 13:49:51 UTC

Description of problem:

A host went "not responding" during the VM migration which leads to VM running on two hosts. The issue was observed in the customer environment and was reproducible in my test lab.

The issue was reproducible with the below flow.

1. Configure a very low value for "Migration bandwidth".

2. Start the migration of a _HA_ VM.

3. Cut the connection between the hypervisor and manager (block the 54321 port).

4. The VM will go into unknown status and host will go into "not responding".

5. Restore the connection between hypervisor and manager back.

6. The status of the VM will be changed to "UP" from "Unknown". It doesn't get "MigratingTo" status even though the migration is still running in the hypervisor.

7. Do 3-4 migration again for the VM from the admin portal. It will fail as expected since there is already a migration going on.

8. When the migration in the step [2] got completed, the VM was detected as down by the engine. Since it's HA VM, the engine automatically started on another host leading to VM running in two different hosts.

Note: The issue is not reproducible if we didn't do step [7]. It looks like the issue is because the VM was restored as "UP" instead of "MigratingTo" in the manager which allows doing migration again even though there is an existing migration.

Log analysis.
=======

Migration was done for VM test_vm (fb7add47-e61a-40cf-9eb7-b9bddf2a2970) from host 10.65.177.70 to 10.65.177.129.

===
2018-06-20 08:13:55,787-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-33) [7754eaf2-3b6a-453a-ad4a-54ceb4f66ccc] EVENT_ID: VM_MIGRATION_START(62), Correlation ID: 7754eaf2-3b6a-453a-ad4a-54ceb4f66ccc, Job ID: c6a0c3ae-0a4b-4202-a252-e72740276ec5, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: Migration started (VM: test_vm, Source: 10.65.177.70, Destination: 10.65.177.129, User: admin@internal-authz).
===

The connection between the vdsmd and engine was blocked using IPTABLE rules. So host went into "not responding" and VM to "unknown" status.

===
2018-06-20 08:14:35,133-04 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-39) [7dda081b] EVENT_ID: VDS_HOST_NOT_RESPONDING(9,027), Correlation ID: null, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: Host 10.65.177.70 is not responding. Host cannot be fenced automatically because power management for the host is disabled.
2018-06-20 08:15:21,220-04 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-5) [75f2c17d] EVENT_ID: VM_SET_TO_UNKNOWN_STATUS(142), Correlation ID: null, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: VM test_vm was set to the Unknown status.
===

The connection was restored back. The VM came back as up. However, migration is still going on in the hypervisor.

===
2018-06-20 08:16:12,512-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler1) [2a3072ad] EVENT_ID: VDS_DETECTED(13), Correlation ID: 260fc914, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: Status of host 10.65.177.70 was set to Up.
2018-06-20 08:16:16,143-04 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler5) [64317fe5] VM 'fb7add47-e61a-40cf-9eb7-b9bddf2a2970'(test_vm) moved from 'Unknown' --> 'Up'
2018-06-20 08:16:16,160-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler5) [64317fe5] EVENT_ID: VM_STATUS_RESTORED(163), Correlation ID: null, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: VM test_vm status was restored to Up.
===

Initiate a migration again to a different host 10.65.177.141 which will fail with error "cannot acquire state change lock (held by remoteDispatchDomainMigratePerform3Params)".

===
2018-06-20 17:46:58,424+0530 INFO  (jsonrpc/3) [vdsm.api] START migrate(params={u'incomingLimit': 2, u'src': u'10.65.177.70', u'dstqemu': u'10.65.177.141', u'autoConverge': u'true', u'tunneled': u'false', u'enableGuestEvents': True, u'dst': u'10.65.177.141:54321', u'convergenceSchedule': {u'init': [{u'params': [u'100'], u'name': u'setDowntime'}], u'stalling': [{u'action': {u'params': [u'150'], u'name': u'setDowntime'}, u'limit': 1}, {u'action': {u'params': [u'200'], u'name': u'setDowntime'}, u'limit': 2}, {u'action': {u'params': [u'300'], u'name': u'setDowntime'}, u'limit': 3}, {u'action': {u'params': [u'400'], u'name': u'setDowntime'}, u'limit': 4}, {u'action': {u'params': [u'500'], u'name': u'setDowntime'}, u'limit': 6}, {u'action': {u'params': [], u'name': u'abort'}, u'limit': -1}]}, u'vmId': u'fb7add47-e61a-40cf-9eb7-b9bddf2a2970', u'abortOnError': u'true', u'outgoingLimit': 2, u'compressed': u'false', u'maxBandwidth': 2, u'method': u'online', 'mode': 'remote'}) (api:39)


2018-06-20 17:47:29,446+0530 ERROR (migsrc/fb7add47) [virt.vm] (vmId='fb7add47-e61a-40cf-9eb7-b9bddf2a2970') Failed to migrate (migration:429)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 411, in run
    self._startUnderlyingMigration(time.time())
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 487, in _startUnderlyingMigration
    self._perform_with_conv_schedule(duri, muri)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 562, in _perform_with_conv_schedule
    self._perform_migration(duri, muri)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 528, in _perform_migration
    self._vm._dom.migrateToURI3(duri, params, flags)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 77, in f
    raise toe
TimeoutError: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMigratePerform3Params)

=====

Initiate a migration again to the host 10.65.177.129 which already have a migration incoming for this VM.

=====
2018-06-20 17:47:52,518+0530 INFO  (jsonrpc/6) [vdsm.api] START migrate(params={u'incomingLimit': 2, u'src': u'10.65.177.70', u'dstqemu': u'10.65.177.129', u'autoConverge': u'true', u'tunneled': u'false', u'enableGuestEvents': True, u'dst': u'10.65.177.129:54321', u'convergenceSchedule': {u'init': [{u'params': [u'100'], u'name': u'setDowntime'}], u'stalling': [{u'action': {u'params': [u'150'], u'name': u'setDowntime'}, u'limit': 1}, {u'action': {u'params': [u'200'], u'name': u'setDowntime'}, u'limit': 2}, {u'action': {u'params': [u'300'], u'name': u'setDowntime'}, u'limit': 3}, {u'action': {u'params': [u'400'], u'name': u'setDowntime'}, u'limit': 4}, {u'action': {u'params': [u'500'], u'name': u'setDowntime'}, u'limit': 6}, {u'action': {u'params': [], u'name': u'abort'}, u'limit': -1}]}, u'vmId': u'fb7add47-e61a-40cf-9eb7-b9bddf2a2970', u'abortOnError': u'true', u'outgoingLimit': 2, u'compressed': u'false', u'maxBandwidth': 2, u'method': u'online', 'mode': 'remote'}) (api:39)

2018-06-20 17:48:22,860+0530 ERROR (migsrc/fb7add47) [virt.vm] (vmId='fb7add47-e61a-40cf-9eb7-b9bddf2a2970') Failed to migrate (migration:429)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 411, in run
    self._startUnderlyingMigration(time.time())
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 487, in _startUnderlyingMigration
    self._perform_with_conv_schedule(duri, muri)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 562, in _perform_with_conv_schedule
    self._perform_migration(duri, muri)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 528, in _perform_migration
    self._vm._dom.migrateToURI3(duri, params, flags)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 77, in f
    raise toe
TimeoutError: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMigratePerform3Params)
====

The initial migration got completed within this time and VM was made down by libvirtd in source host 10.65.177.70.

====
2018-06-20 17:49:00,883+0530 INFO  (libvirt/events) [virt.vm] (vmId='fb7add47-e61a-40cf-9eb7-b9bddf2a2970') CPU stopped: onSuspend (vm:4997)
2018-06-20 17:49:04,160+0530 INFO  (libvirt/events) [virt.vm] (vmId='fb7add47-e61a-40cf-9eb7-b9bddf2a2970') underlying process disconnected (vm:680)
2018-06-20 17:49:04,161+0530 INFO  (libvirt/events) [virt.vm] (vmId='fb7add47-e61a-40cf-9eb7-b9bddf2a2970') Release VM resources (vm:4199)
2018-06-20 17:49:04,162+0530 INFO  (libvirt/events) [virt.vm] (vmId='fb7add47-e61a-40cf-9eb7-b9bddf2a2970') Stopping connection (guestagent:430)
2018-06-20 17:49:04,761+0530 INFO  (libvirt/events) [virt.vm] (vmId='fb7add47-e61a-40cf-9eb7-b9bddf2a2970') Changed state to Down: Lost connection with qemu process (code=2) (vm:1207)
2018-06-20 17:49:04,761+0530 INFO  (libvirt/events) [virt.vm] (vmId='fb7add47-e61a-40cf-9eb7-b9bddf2a2970') Stopping connection (guestagent:430)
===

The manager detects this and started the VM on the different host as a part of "HA rerun treatment".

===
2018-06-20 08:19:06,286-04 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-12) [] VM 'fb7add47-e61a-40cf-9eb7-b9bddf2a2970' was reported as Down on VDS '2fd9e8b6-b985-4896-8108-75841ae910db'(10.65.177.70)
2018-06-20 08:19:07,311-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-12) [] EVENT_ID: VM_DOWN_ERROR(119), Correlation ID: null, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: VM test_vm is down with error. Exit message: Lost connection with qemu process.
2018-06-20 08:19:07,312-04 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-12) [] add VM 'fb7add47-e61a-40cf-9eb7-b9bddf2a2970'(test_vm) to HA rerun treatment
2018-06-20 08:19:07,354-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-12) [] EVENT_ID: HA_VM_FAILED(9,602), Correlation ID: null, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: Highly Available VM test_vm failed. It will be restarted automatically.
2018-06-20 08:19:33,622-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-0) [] EVENT_ID: VDS_INITIATED_RUN_VM(506), Correlation ID: 65b8a16f, Job ID: 60fb767c-c576-4457-ad27-5a3a25a06850, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: Trying to restart VM test_vm on Host 10.65.177.141
2018-06-20 08:19:33,601-04 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-0) [] VM 'fb7add47-e61a-40cf-9eb7-b9bddf2a2970'(test_vm) moved from 'PoweringUp' --> 'Up'
====

Ended up VM in two host.

====
Host 10.65.177.129

virsh -r list
 Id    Name                           State
----------------------------------------------------
 27    test_vm                        running


Host 10.65.177.141

virsh -r list --all
 Id    Name                           State
----------------------------------------------------
 6     test_vm                        running

===


Version-Release number of selected component (if applicable):

rhevm-4.1.11.2-0.1.el7.noarch (This has fix from bug 1486543)


How reproducible:

100 % if we follow the correct steps.

Actual results:

VM running in two hosts.

Expected results:

VM should not be running in two hosts.

Additional info:

Comment 4 Michal Skrivanek 2018-06-21 14:54:19 UTC

The problem is that the source host vdsm is restarted during an ongoing migration. The recovery migration detection code was only introduced in 4.2. Can you please try to reproduce the scenario again, this time with a 4.2 vdsm (the engine can still stay 4.1)?
We believe it can either fix it completely or help significantly

Now to those restarts - they are frequent (196 times in cca 12 hours) and induced explicitly (SIGTERM) in quite fixed interval which indicates those are soft fencing actions. That can be caused by the fact that other hostscan still talk to engine and olso to this blocked host so whenever engine invokes a fence action on other host it easily ssh to this host and restarts vdsm. It still looks like a fencing problem, though, that it tries to do that over and over again for hours. Eli, can you please take a look?

Comment 6 nijin ashok 2018-06-22 06:45:21 UTC

(In reply to Michal Skrivanek from comment #4)
> The problem is that the source host vdsm is restarted during an ongoing
> migration. The recovery migration detection code was only introduced in 4.2.
> Can you please try to reproduce the scenario again, this time with a 4.2
> vdsm (the engine can still stay 4.1)?
> We believe it can either fix it completely or help significantly
> 

Yes, it's not reproducible in 4.2. Checked with 4.2 hosts with both 4.1 and 4.2 manager. The VM is getting correct status.

2018-06-22 02:29:37,848-04 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler2) [431488e3] VM 'fb7add47-e61a-40cf-9eb7-b9bddf2a2970'(test_vm) moved from 'Unknown' --> 'MigratingFrom'

Comment 7 Eli Mesika 2018-06-24 13:48:16 UTC

(In reply to Michal Skrivanek from comment #4)
Is info still needed due to comment 6 ?

Comment 8 Michal Skrivanek 2018-06-24 19:15:25 UTC

Yes. I believe restarting VDSM 200 times is not a desired behavior. Or did it change in 4.2 as well? The recovery improvements are not covering all the corner cases od vdsm reatarts
I think it would help to understand why we restarted it so often

Comment 9 Eli Mesika 2018-06-25 08:41:33 UTC

(In reply to Michal Skrivanek from comment #8)
> Yes. I believe restarting VDSM 200 times is not a desired behavior. Or did
> it change in 4.2 as well? The recovery improvements are not covering all the
> corner cases od vdsm reatarts
> I think it would help to understand why we restarted it so often

Looking at the log I see that soft0-fencing is done always once after host changes status from NonOperational to nonResponding 

The scenario is :

Host tries to connect to storage 
Host fails to connect to storage 
Host became non operational 
Network error on connecting to the host is thrown 
Host became non responding
Fencing handling begins , starting with soft-fencing

Comment 10 Michal Skrivanek 2018-06-25 09:40:54 UTC

thanks. The bug description says the connection from engine to host was blocked, I assumed all the time. But I see that indeed the connection was always working for few minutes and only then cut, causing another fencing cycle.

How exactly was the blocking done?

Comment 11 nijin ashok 2018-06-25 10:00:34 UTC

(In reply to Michal Skrivanek from comment #10)
> thanks. The bug description says the connection from engine to host was
> blocked, I assumed all the time. But I see that indeed the connection was
> always working for few minutes and only then cut, causing another fencing
> cycle.
> 
> How exactly was the blocking done?

For the customer, one of the storage was not available and hence the host was going non-operational.

The test I have done in my test lab is by blocking the 54321 port using iptable.

Comment 12 Michal Skrivanek 2018-06-25 10:37:23 UTC

(In reply to nijin ashok from comment #11)
> (In reply to Michal Skrivanek from comment #10)
> > thanks. The bug description says the connection from engine to host was
> > blocked, I assumed all the time. But I see that indeed the connection was
> > always working for few minutes and only then cut, causing another fencing
> > cycle.
> > 
> > How exactly was the blocking done?
> 
> For the customer, one of the storage was not available and hence the host
> was going non-operational.
> 
> The test I have done in my test lab is by blocking the 54321 port using
> iptable.

then how could the engine establish the connection periodically, after each soft fencing? I was looking at your test run, and there are 26 restarts in the vdsm log and 21x host status Up in engine.log, and after each one there is a communication with engine going on. 

As for the customer behavior, I wasn't able to check it as the engine.log is from different period than host logs, I only see those 198 restarts of vdsm presumably due to soft fencing. But we need the engine log to understand the reason for fencing.

Comment 13 nijin ashok 2018-06-27 09:45:11 UTC

(In reply to Michal Skrivanek from comment #12)> 
> then how could the engine establish the connection periodically, after each
> soft fencing? I was looking at your test run, and there are 26 restarts in
> the vdsm log and 21x host status Up in engine.log, and after each one there
> is a communication with engine going on. 
> 

There were many manual steps which have done in my test. I have manually removed and added IPTABLE rule to mimic the issue and have restarted vdsmd manually. I think it would be better to check customer log for ssh soft fencing issue. 

> As for the customer behavior, I wasn't able to check it as the engine.log is
> from different period than host logs, I only see those 198 restarts of vdsm
> presumably due to soft fencing. But we need the engine log to understand the
> reason for fencing.

I can see that the tar file uploaded contains engine log at the time of issue. There are 196 restarts of vdsm.

xzgrep "I am the actual vdsm" customer_environment/src_host_vdsm.log.xz|wc -l
196

I can see ssh soft fencing action related to these events as well.

xzgrep "Executing SSH Soft Fencing" customer_environment/engine.log.xz |wc -l
196


This happened between "2018-06-18 21:01" to "2018-06-19 14:37". The engine log uploaded has logs between "2018-06-18 03:39" to "2018-06-19 15:05".

Please let me know if you are interested in any other logs.

Comment 14 Michal Skrivanek 2018-06-27 09:48:29 UTC

(In reply to nijin ashok from comment #13)
> (In reply to Michal Skrivanek from comment #12)> 
> > then how could the engine establish the connection periodically, after each
> > soft fencing? I was looking at your test run, and there are 26 restarts in
> > the vdsm log and 21x host status Up in engine.log, and after each one there
> > is a communication with engine going on. 
> > 
> 
> There were many manual steps which have done in my test. I have manually
> removed and added IPTABLE rule to mimic the issue and have restarted vdsmd
> manually. I think it would be better to check customer log for ssh soft
> fencing issue. 

right. indeed

 
> > As for the customer behavior, I wasn't able to check it as the engine.log is
> > from different period than host logs, I only see those 198 restarts of vdsm
> > presumably due to soft fencing. But we need the engine log to understand the
> > reason for fencing.
> 
> I can see that the tar file uploaded contains engine log at the time of
> issue. There are 196 restarts of vdsm.
> 
> xzgrep "I am the actual vdsm" customer_environment/src_host_vdsm.log.xz|wc -l
> 196
> 
> I can see ssh soft fencing action related to these events as well.
> 
> xzgrep "Executing SSH Soft Fencing" customer_environment/engine.log.xz |wc -l
> 196

great - so that explains the restarts. What I'd still like to understand (though a little bit off topic) is the reason for that in their environment. If they executed a connectivity test it shouldn't be flapping like that. I just want to rule out some fencing problem which would behave like that during a permanent connection failure.

Comment 17 Michal Skrivanek 2018-07-19 12:24:37 UTC

apparently not urgent anymore, demoting

Comment 18 Michal Skrivanek 2018-07-26 12:32:04 UTC

without further information the only thing we have is the theory in comment #4 that it is already fixed in 4.2
Please reopen if you get any more details or think it is not fixed in 4.2 so we can look into it further

Comment 19 Daniel Gur 2019-08-28 13:13:10 UTC

sync2jira

Comment 20 Daniel Gur 2019-08-28 13:17:23 UTC

sync2jira

Note You need to log in before you can comment on or make changes to this bug.