1816327 – VM is allowed to run on two hosts concurrently if initial VM startup fails with "No JSON object could be decoded" error

Bug 1816327 - VM is allowed to run on two hosts concurrently if initial VM startup fails with "No JSON object could be decoded" error

Summary: VM is allowed to run on two hosts concurrently if initial VM startup fails wi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.3.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.4.1
Target Release:	4.4.1
Assignee:	Milan Zamazal
QA Contact:	Qin Yuan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-23 19:28 UTC by Gordon Watson
Modified:	2023-09-07 22:31 UTC (History)
CC List:	10 users (show)
Fixed In Version:	ovirt-engine-4.4.1.5 vdsm-4.40.12
Doc Type:	Bug Fix
Doc Text:	Previously, if you tried to start an already-running virtual machine (VM) on the same host, VDSM failed this operation too late and the VM on the host became hidden from the RHV Manager. The current release fixes the issue: VDSM immediately rejects attempts to start a running VM on the same host.
Clone Of:
Environment:
Last Closed:	2020-08-04 13:27:53 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	4923051	None	None	None	2020-03-23 22:02:24 UTC
Red Hat Product Errata	RHEA-2020:3246	None	None	None	2020-08-04 13:28:24 UTC
oVirt gerrit	108178	master	MERGED	api: Retrieve VM id from the XML on VM creation	2021-02-11 15:12:18 UTC
oVirt gerrit	109828	master	MERGED	send vmId to VDSM on create VM	2021-02-11 15:12:18 UTC

Description Gordon Watson 2020-03-23 19:28:55 UTC

Description of problem:

After a host was rebooted, a VM was started and encountered a "No JSON object could be decoded" error. The VM was not terminated, yet VDSM no longer saw the VM as being active. The engine tried to start it again on the same host, and it failed with "Virtual machine already exists". It was then started successfully on a different host. However, the 'qemu-kvm' process was still running on the original host.


Version-Release number of selected component (if applicable):

RHV 4.2.8
RHEL 7.7
    libvirt-4.5.0-23.el7_7.3.x86_64            
    qemu-kvm-rhev-2.12.0-33.el7_7.4.x86_64            
    vdsm-4.30.38-1.el7ev.x86_64                       


How reproducible:

Not.


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 6 Gordon Watson 2020-03-23 21:45:30 UTC

FYI: In comment #1, there are annotations indicating "<<<<  above". These have no meaning in the context above, and so can be ignored.

Comment 7 Michal Skrivanek 2020-03-24 13:03:28 UTC

hard to say. "Confirm Host has been rebooted" is always suspicious. Could it be that there is a different thost elsewhere which was running (or at least reporting Down) the VM at that time? Logs don't go back that much...

That guest agent issue seems unrelated, but also raises suspicions as to what's running in the guests. Is there anything interfering with the serial channel? It shouldn't be affecting anything, except if it would actually be giving misleading status updates (vdsm would accept a valid message coming from guest)

Comment 8 Gordon Watson 2020-03-24 13:49:39 UTC

Michal,

Thanks for the quick response. There are only 2 hosts in this DC. I checked the logs and I don't see either VM having been started previously on the other host. In fact, the other host was also rebooted at around the same time. 

I got the guest agent logs, but they don't go back far enough.

So, in this case, regardless of the "No JSON object could be decoded" error and the reason for it, it seems that there is a discrepancy between VDSM and libvirt/qemu. VDSM presumably no longer saw the VM as being up, but the 'qemu-kvm' process was still running. Is there possibly something in the logic that is not handling this type of error correctly ?

Regards, GFW.

Comment 10 Gordon Watson 2020-03-27 18:47:23 UTC

Michal et al,

I think my analysis of the above is incorrect. The "No JSON object could be decoded" error does not appear to the be cause of the problem. It just got reported from the guest agent handler, and the VM probably continued on. 

What appears to have happened is that a group of about 13 VMs got started all at once via the GUI. Three of these were then started a second time, under a different correlation id, but they were not started by customer manually. These three VMs all ended running on two hosts at the same time. Two of them happened to encounter the "No JSON object could be decoded" error, the other did not. I'm not sure why this happened though. 

Anyway, the way these VMs started was that all of the VMs except three executed IsVmDuringInitiatingVDSCommand() under the first correlation id. Then these three executed IsVmDuringInitiatingVDSCommand() and RunVM() under the second correlation id. They then executed sIVmDuringInitiatingVDSCommand() and RunVM() under the first correlation id, and at that time they were all running on two hosts. [1]

I don't see how this second correlation id was invoked. However, it was interleaved with the first correlation id sequence. 

The only discrepancy I see on the host that the VMs that reported "No JSON object could be decoded" were running on was that a 'getAllVmStats' did not show either of these two VMs in its list of VMs shortly after this. That was why I had thought originally that the "No JSON object could be decoded" had caused this. However, since one of the three VMs did not report this, and still ended up on two hosts, then I started to concentrate on the engine logs (thanks to Juan's interjection) and noticed the two discrete VM startup groupings under two separate correlation ids.

One thing here is that the engine is 4.2.8. Is there something that was fixed in 4.3 that might explain this ?

Also, maybe we need a new bug for this, as the title and main comment are now seemingly not accurate.

Regards, GFW.




[1] correlation id '2ea51fdd-095e-4011-a016-843ddcbf2c5f' starts, '58fec24e-032a-4b5b-910f-087b8fb1ad09' then starts in the middle of it, and actually starts three of the VMs in the list of VMs for '2ea51fdd-095e-4011-a016-843ddcbf2c5f';

$ egrep '2ea51fdd-095e-4011-a016-843ddcbf2c5f|58fec24e-032a-4b5b-910f-087b8fb1ad09' engine.log-20200315 |egrep 'IsVmDuringInitiatingVDSCommand|RunVm' |grep -v FINISH| grep -v Prevalidating|grep -v EngineLock |grep -v 'failed for' |grep -v attached|awk '{print $2, $6, $9, $16}'
14:11:02,844+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] IsVmDuringInitiatingVDSCommandParameters:{vmId='fd1f07a0-bace-42df-8153-474fb439a51b'}), 
14:11:02,914+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] IsVmDuringInitiatingVDSCommandParameters:{vmId='402d9220-cd29-4d5b-8df1-d841d4a354ee'}), 
14:11:02,977+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] IsVmDuringInitiatingVDSCommandParameters:{vmId='ac3ef8ed-f0ba-4f81-8767-dff87fb3bf72'}), 
14:11:02,997+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] IsVmDuringInitiatingVDSCommandParameters:{vmId='88617e2e-efb1-41c5-9797-c8539d972f38'}), 
14:11:03,006+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] IsVmDuringInitiatingVDSCommandParameters:{vmId='46b8a287-72d7-415f-b5cd-f5c6431fe9ea'}), 
14:11:03,018+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] IsVmDuringInitiatingVDSCommandParameters:{vmId='b974dfb7-de27-4d12-a51c-29ce5536cfe5'}), 
14:11:03,023+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] IsVmDuringInitiatingVDSCommandParameters:{vmId='ab47db5a-95cd-45bc-822c-c2343b3b8a79'}), 
14:11:03,032+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] IsVmDuringInitiatingVDSCommandParameters:{vmId='1cccd04b-d7e8-4943-8b77-5e2476b8489b'}), 
14:11:03,076+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] IsVmDuringInitiatingVDSCommandParameters:{vmId='5a9cd64c-6112-4571-b306-a62ce90dcc95'}), 
14:11:29,016+01 [58fec24e-032a-4b5b-910f-087b8fb1ad09] IsVmDuringInitiatingVDSCommandParameters:{vmId='c9ab1023-7068-42e3-8fb3-7420a529b241'}), 
14:11:29,016+01 [58fec24e-032a-4b5b-910f-087b8fb1ad09] IsVmDuringInitiatingVDSCommandParameters:{vmId='afea232f-24ae-46f2-8110-dbd82045036f'}), 
14:11:29,030+01 [58fec24e-032a-4b5b-910f-087b8fb1ad09] IsVmDuringInitiatingVDSCommandParameters:{vmId='388c000c-9110-40d3-9771-8b60a9830ccf'}), 
14:11:29,091+01 [58fec24e-032a-4b5b-910f-087b8fb1ad09] RunVmCommand 388c000c-9110-40d3-9771-8b60a9830ccf						<<< no error
14:11:36,735+01 [58fec24e-032a-4b5b-910f-087b8fb1ad09] RunVmCommand c9ab1023-7068-42e3-8fb3-7420a529b241						<<< No JSON object could be decoded
14:11:41,275+01 [58fec24e-032a-4b5b-910f-087b8fb1ad09] RunVmCommand afea232f-24ae-46f2-8110-dbd82045036f						<<< No JSON object could be decoded
14:14:35,510+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] IsVmDuringInitiatingVDSCommandParameters:{vmId='388c000c-9110-40d3-9771-8b60a9830ccf'}), 	
14:14:35,536+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] IsVmDuringInitiatingVDSCommandParameters:{vmId='afea232f-24ae-46f2-8110-dbd82045036f'}), 	
14:14:35,538+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] IsVmDuringInitiatingVDSCommandParameters:{vmId='c9ab1023-7068-42e3-8fb3-7420a529b241'}), 
14:14:35,594+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] RunVmCommand afea232f-24ae-46f2-8110-dbd82045036f						<<< Virtual machine already exists
14:14:44,359+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] RunVmCommand ab47db5a-95cd-45bc-822c-c2343b3b8a79
14:14:52,588+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] RunVmCommand 46b8a287-72d7-415f-b5cd-f5c6431fe9ea
14:14:58,700+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] RunVmCommand 388c000c-9110-40d3-9771-8b60a9830ccf					 	<<< no error
14:15:04,946+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] RunVmCommand fd1f07a0-bace-42df-8153-474fb439a51b
14:15:08,943+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] RunVmCommand 88617e2e-efb1-41c5-9797-c8539d972f38
14:15:21,315+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] RunVmCommand ac3ef8ed-f0ba-4f81-8767-dff87fb3bf72
14:15:35,707+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] RunVmCommand b974dfb7-de27-4d12-a51c-29ce5536cfe5
14:15:47,523+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] RunVmCommand 402d9220-cd29-4d5b-8df1-d841d4a354ee
14:16:01,863+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] RunVmCommand 5a9cd64c-6112-4571-b306-a62ce90dcc95
14:16:15,485+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] RunVmCommand c9ab1023-7068-42e3-8fb3-7420a529b241						<<< Virtual machine already exists
14:16:26,785+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] RunVmCommand 1cccd04b-d7e8-4943-8b77-5e2476b8489b
14:16:41,769+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] RunVmCommand ea16db9c-32cb-4f90-a5da-2417738c0166



[2] getAllVmStats;

2020-03-14 14:28:01,658+0100 INFO  (jsonrpc/7) [throttled] Current getAllVmStats: {'1cccd04b-d7e8-4943-8b77-5e2476b8489b': 'Up', '5a9cd64c-6112-4571-b306-a62ce90dcc95': 'Up', '388c000c-9110-40d3-9771-8b60a9830ccf': 'Up', 'b974dfb7-de27-4d12-a51c-29ce5536cfe5': 'Up', 'ea16db9c-32cb-4f90-a5da-2417738c0166': 'Up', 'fd1f07a0-bace-42df-8153-474fb439a51b': 'Up', 'ac3ef8ed-f0ba-4f81-8767-dff87fb3bf72': 'Up'} (throttledlog:104)

Comment 11 Milan Zamazal 2020-04-01 16:25:05 UTC

I can see in the Vdsm log that Engine had started the given VMs successfully and there had been no Destroy call before Engine tried to start an already running VM again. I can't see anything else that would indicate something very wrong happened to the given VMs on the host. That means Engine should be well aware the VMs are running and shouldn't start them again. We don't have getAllVmStats output from the meantime to prove that, but the Engine log also doesn't report anything about disappearing VMs. So it looks like a problem on the Engine side.

One weird thing is that the VM.create call was not rejected immediately when the VM was already running due to the check in clientIF.createVM. That would mean the VM was not present in vmContainer. But it's actually a bug in the check. The vmId parameter passed to clientIF.createVM is None rather than the VM id (I assume since Engine 4.2) so the check always passes. It also means that the second VM.create call can, depending on timing, rewrite the original Vm instance in vmContainer with the failed one. At this moment Vdsm loses track of the VM already running. Then Engine already calls Destroy but the failed Vm instance is already Down, so nothing happens. Just the running VM becomes completely invisible because even the failed Vm instance gets (rightly) removed from vmContainer.

This also explains why the later getAllVmStats output doesn't show the VM.

One thing I haven't figured out yet is why Vdsm periodic recovery didn't find the abandoned VMs. I can see one of them was actually found, but only two hours later, while the periodic check should be run every 60 seconds.

We have bugs in Vdsm, at least the broken check for VMs already present in vmContainer, but it doesn't explain the split brain. Engine tries to start an already running VM first and only then things go wrong on the Vdsm side. Engine could call VM.create on another host and then there would be two VM instances running without anything wrong on the host sides. Engine shouldn't call the second VM.create at all. So I think there is a bug on the Engine side. I'm not familiar with how things go in Engine and don't know what could possibly change since 4.2, Engine developers should tell.

Comment 15 Michal Skrivanek 2020-04-02 17:38:28 UTC

(In reply to Milan Zamazal from comment #11)

> We have bugs in Vdsm, at least the broken check for VMs already present in
> vmContainer, but it doesn't explain the split brain.

not directly, but it's very likely the reason.

> Engine tries to start
> an already running VM first and only then things go wrong on the Vdsm side.
> Engine could call VM.create on another host and then there would be two VM
> instances running without anything wrong on the host sides. Engine shouldn't
> call the second VM.create at all. So I think there is a bug on the Engine
> side. I'm not familiar with how things go in Engine and don't know what
> could possibly change since 4.2, Engine developers should tell.


without local reproducer we can't tell for sure, but losing track of monitoring can trigger pretty much any bad behavior...

Comment 17 Milan Zamazal 2020-04-02 17:49:08 UTC

(In reply to Michal Skrivanek from comment #15)
> (In reply to Milan Zamazal from comment #11)
> 
> > We have bugs in Vdsm, at least the broken check for VMs already present in
> > vmContainer, but it doesn't explain the split brain.
> 
> not directly, but it's very likely the reason.

No, the bad Vdsm behavior is a followup of the split brain, not its cause. Liran looks into possible Engine problems and Gordon's observation about the correlation ids in Comment 10 seems to be a good clue so far.

Comment 20 Liran Rotenberg 2020-04-06 16:06:31 UTC

I couldn't find anything on the engine, debug logs probably would be helpful but this isn't a easy reproduce.
The user initiate run command 3 times as I saw it. Each for a group of VMs. Each runner has limit of 10 running threads running in parallel.
Focusing on 2 of them, containing the same VMs (can happen if activating run twice fast). There is no order within the runners.
This is not a problem, having the same VM twice in different running isn't a problem too.

The VM takes its lock, and validating the run command. The first time VM taking the lock, it should pass and the VM should start.
This is what we see in comment #10 - 58fec24e-032a-4b5b-910f-087b8fb1ad09.
When 2ea51fdd-095e-4011-a016-843ddcbf2c5f comes after it for the 3 split-brained VMs the lock is already freed for the VM. Meaning the VMs are running.
Then we see many failures for other VMs, because the lock is taken, failing the command and preventing the run command for those VMs.

For the specific 3 VMs above, since the lock is free, it will move to run command validation again:
14:14:35,510+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] IsVmDuringInitiatingVDSCommandParameters:{vmId='388c000c-9110-40d3-9771-8b60a9830ccf'}),
14:14:35,536+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] IsVmDuringInitiatingVDSCommandParameters:{vmId='afea232f-24ae-46f2-8110-dbd82045036f'}),
14:14:35,538+01 [2ea51fdd-095e-4011-a016-843ddcbf2c5f] IsVmDuringInitiatingVDSCommandParameters:{vmId='c9ab1023-7068-42e3-8fb3-7420a529b241'}),
The above check - IsVmDuringInitiatingVDSCommand will pass too, since the VM already running.

This is when we have problem. These messages show we pass the validation for the VM status. The VM status in the logs reported as running while this validation shows the VM isn't running.
There is no indication in the log where or why the status changed. The only suspicious messages there are the warnings:
"skipping VM '{}' from this monitoring cycle - the VM data has changed since fetching the data"

This warning comes from VmsMonitoring:
if (vmManager.getVmDataChangedTime() != null && fetchTime - vmManager.getVmDataChangedTime() <= 0)
telling us the engine data for that VM is newer than what is reported back from VDSM.

Without debug logs it will be hard to know how it went wrong since nothing indicating status change on a VM other than starting it and moving at the final step to 'Up'.

Comment 34 Milan Zamazal 2020-06-22 16:40:14 UTC

We don't have sufficient information on the split brain and the other issue with a running VM check is fixed.

Comment 35 Sandro Bonazzola 2020-07-02 11:55:49 UTC

$ git tag --contains 8e24e35593b79c7ce568a9da40f0c6e5abd9d7df
ovirt-engine-4.4.1.5

Comment 36 Sandro Bonazzola 2020-07-02 11:56:45 UTC

$ git tag --contains d7e757f00d19adaa9681d5b9c3be9f5c0bc8343d
v4.40.12
v4.40.13
v4.40.14
v4.40.15
v4.40.16
v4.40.17
v4.40.18
v4.40.19
v4.40.20
v4.40.21

Comment 39 Milan Zamazal 2020-07-08 19:23:17 UTC

Notes on verification of the issue when something attempts to start an already running VM:

- Engine should prevent that under normal circumstances, but it can be invoked using vdsm-client on the given host.

- vdsm-client can, call VM.create API, with parameters provided in a JSON file:

  vdsm-client -f params.json VM create

- The content of params.json can be something like:

  {
    "vmID": "anything",
    "vmParams": {"xml": "<domain><name>test</name><uuid>UUID-OF-AN-ALREADY-RUNNING-VM-ON-THE-HOST</uuid></domain>"}
  }

- The vdsm-client call should fail with "Virtual machine already exists" without even attempting to start the VM and Engine should keep to manage the already running VM normally.

Comment 40 Qin Yuan 2020-07-09 08:14:35 UTC

Verified with:
vdsm-4.40.22-1.el8ev.x86_64
ovirt-engine-4.4.1.7-0.3.el8ev.noarch

Steps:
1. Create a VM and run it on engine.
2. Try to start the already running VM by running `vdsm-client -f params.json VM create` on the host holding the VM.

Results:
1. When start VM on engine, engine provides vmId now.
2. The vdsm-client call failed with "Virtual machine already exists"
3. The VM is running normally on engine.

[root@puma42 ~]# vdsm-client Host getVMList
[
    {
        "status": "Up",
        "statusTime": "2163790564",
        "vmId": "0adf2553-8e6b-4875-b362-8d0e82b10116"
    }
]

[root@puma42 ~]# cat params.json 
{
    "vmID": "0adf2553-8e6b-4875-b362-8d0e82b10116",
    "vmParams": {"xml": "<domain><name>test</name><uuid>0adf2553-8e6b-4875-b362-8d0e82b10116</uuid></domain>"}
}

[root@puma42 ~]# vdsm-client -f params.json VM create
vdsm-client: Command VM.create with args {'vmID': '0adf2553-8e6b-4875-b362-8d0e82b10116', 'vmParams': {'xml': '<domain><name>test</name><uuid>0adf2553-8e6b-4875-b362-8d0e82b10116</uuid></domain>'}} failed:
(code=4, message=Virtual machine already exists)

vdsm log:
2020-07-09 10:48:38,739+0300 INFO  (jsonrpc/0) [api.virt] START create(vmParams={'vmId': '0adf2553-8e6b-4875-b362-8d0e82b10116', 'xml': ...}) from=::ffff:10.35.161.176,51886, flow_id=a461b209-bdeb-4252-a663-15329e46611f, vmId=0adf2553-8e6b-4875-b362-8d0e82b10116 (api:48)
...
2020-07-09 10:48:38,748+0300 INFO  (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call VM.create succeeded in 0.01 seconds (__init__:312)
...
2020-07-09 10:57:53,397+0300 INFO  (jsonrpc/6) [api.virt] START create(vmParams={'xml': '<domain><name>test</name><uuid>0adf2553-8e6b-4875-b362-8d0e82b10116</uuid></domain>'}) from=::1,46364, vmId=0adf2553-8e6b-4875-b362-8d0e82b10116 (api:48)
2020-07-09 10:57:53,397+0300 WARN  (jsonrpc/6) [vds] vm 0adf2553-8e6b-4875-b362-8d0e82b10116 already exists (API:197)
2020-07-09 10:57:53,398+0300 ERROR (jsonrpc/6) [api] FINISH create error=Virtual machine already exists (api:131)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 124, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/API.py", line 198, in create
    raise exception.VMExists()
vdsm.common.exception.VMExists: Virtual machine already exists
2020-07-09 10:57:53,431+0300 INFO  (jsonrpc/6) [api.virt] FINISH create return={'status': {'code': 4, 'message': 'Virtual machine already exists'}} from=::1,46364, vmId=0adf2553-8e6b-4875-b362-8d0e82b10116 (api:54)
2020-07-09 10:57:53,431+0300 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call VM.create failed (error 4) in 0.03 seconds (__init__:312)

Comment 50 errata-xmlrpc 2020-08-04 13:27:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) 4.4), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:3246

Note You need to log in before you can comment on or make changes to this bug.