1640872 – VMI become inoperable in "Failed" state after restart it multiple times

Bug 1640872 - VMI become inoperable in "Failed" state after restart it multiple times

Summary: VMI become inoperable in "Failed" state after restart it multiple times

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	1.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	1.3
Assignee:	Roman Mohr
QA Contact:	zhe peng
Docs Contact:
URL:
Whiteboard:
Depends On:	1642530
Blocks:
TreeView+	depends on / blocked

Reported:	2018-10-19 03:03 UTC by Guohua Ouyang
Modified:	2018-11-06 13:10 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-11-06 13:10:11 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
vmi_failed (32.92 KB, image/png) 2018-10-19 03:03 UTC, Guohua Ouyang	no flags	Details
logs (4.22 KB, application/x-gzip) 2018-10-19 03:10 UTC, Guohua Ouyang	no flags	Details
View All

Description Guohua Ouyang 2018-10-19 03:03:27 UTC

Created attachment 1495534 [details]
vmi_failed

Description of problem:
On web console, Click VM action and restart the running VM multiple times, the VM  get into "Failed" state. After it, stop/start/restart the VM does not help.

Version-Release number of selected component (if applicable):
CNV-v1.2.0-5-1-g605855d
origin-web-console-server:1.2

How reproducible:
100%

Steps to Reproduce:
1. Go to actions of a running VM
2. Click restart multiple times.
3.

Actual results:
VM become inoperatable in Failed state

Expected results:
It should gray out the "Restart" button to prevent restarting the VM which is undergoing restart.

Additional info:

Comment 2 Guohua Ouyang 2018-10-19 03:10:13 UTC

Created attachment 1495535 [details]
logs

Comment 3 Tomas Jelinek 2018-10-22 09:27:27 UTC

@Fabian: any idea why would this happen? All the UI does is that it deletes the VMI as a reaction to "restart". If you do it couple of times, it will not be possible to start the VM anymore and it will stay as scheduling.

Comment 5 Fabian Deutsch 2018-10-22 09:47:44 UTC

Roman, could you please take a look?

Comment 6 Roman Mohr 2018-10-22 11:17:45 UTC

on it.

Comment 10 Roman Mohr 2018-10-24 11:10:14 UTC

When I investigated the environment, I saw that the two mentioned VMIs were running, like expected. There is a good chance that under some conditions it can take up to *minutes* (!) until kubernetes completely removes deleted pods. If the pod is still hanging there, the VMI can't be started again because that may lead to inconsistencies (e.g. data corruption). Therefore the failed VMI hangs around until the pod goes away. The controllers then continue their work as usual.

Are we sure that we did not experience that scenario?

In the future it would help to also check if the pod is already gone. If not, looking into the pods events normally reveals why the cleanup takes so long. There are issues filed in k8s to improve the situation on the pods.

I can't rule out that we have an issue in kubevirt, but I can't dig deeper.

Comment 11 Roman Mohr 2018-10-24 11:33:37 UTC

Ok, have it. Seems like it is related to this issue:

```
E1024 04:01:30.166019       1 streamwatcher.go:109] Unable to decode an event from the watch stream: unable to decode watch event: no kind "Status" is registered for version "v1"
```

Thanks for reporting it.

Comment 12 Tomas Jelinek 2018-10-24 11:40:34 UTC

Thank you Roman for the investigation!

Moving to virtualization component.

Comment 13 Guohua Ouyang 2018-10-24 12:35:50 UTC

If it needs more information, restart a running VM 3-5 times quickly can reproduce the issue.

Comment 14 Roman Mohr 2018-10-24 15:02:31 UTC

Ok, so it seems to be related to the PropagationPolicy which is sent via the web-ui if it deletes a VMI. If you only set one delete request, the garbage collector removes the "Foreground" finalizer. If we do it multiple times, for whatever reason it stays. Interestingly, in both cases the GC claims to have removed the finalizer.

If the CR is deleted via the commandline, where it does not set a PropagationPolicy, everything works as expected.


Here the delete requeest payload from the UI:

```
{"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Foreground"}
```


Here the VMI directly after it got deleted:

```
apiVersion: kubevirt.io/v1alpha2
kind: VirtualMachineInstance
metadata:
  creationTimestamp: 2018-10-24T14:34:28Z
  deletionGracePeriodSeconds: 0
  deletionTimestamp: 2018-10-24T14:48:18Z
  finalizers:
  - foregroundDeleteVirtualMachine
  uid: e99fc418-d799-11e8-a4cd-fa163e0953ea
```

Here the GC logs:

```
I1024 14:48:18.907722       1 graph_builder.go:553] add [kubevirt.io/v1alpha2/VirtualMachineInstance, namespace: default, name: fedora, uid: e99fc418-d799-11e8-a4cd-fa163e0953ea] to the attemptToDelete, because it's waiting for its dependents to be deleted
I1024 14:48:18.908062       1 garbagecollector.go:408] processing item [kubevirt.io/v1alpha2/VirtualMachineInstance, namespace: default, name: fedora, uid: e99fc418-d799-11e8-a4cd-fa163e0953ea]
I1024 14:48:18.913571       1 garbagecollector.go:530] remove DeleteDependents finalizer for item [kubevirt.io/v1alpha2/VirtualMachineInstance, namespace: default, name: fedora, uid: e99fc418-d799-11e8-a4cd-fa163e0953ea]
```

Now with multiple times sending the delete request:

```
apiVersion: kubevirt.io/v1alpha2
kind: VirtualMachineInstance
metadata:
  creationTimestamp: 2018-10-24T14:54:41Z
  deletionGracePeriodSeconds: 0
  deletionTimestamp: 2018-10-24T14:54:58Z
  finalizers:
  - foregroundDeleteVirtualMachine
  - foregroundDeletion
  uid: bca97fd7-d79c-11e8-a4cd-fa163e0953ea
```

here the logs again:

```
I1024 14:54:57.368392       1 graph_builder.go:553] add [kubevirt.io/v1alpha2/VirtualMachineInstance, namespace: default, name: fedora, uid: bca97fd7-d79c-11e8-a4cd-fa163e0953ea] to the attemptToDelete, because it's waiting for its dependents to be deleted
I1024 14:54:57.368910       1 garbagecollector.go:408] processing item [kubevirt.io/v1alpha2/VirtualMachineInstance, namespace: default, name: fedora, uid: bca97fd7-d79c-11e8-a4cd-fa163e0953ea]
I1024 14:54:57.388385       1 garbagecollector.go:530] remove DeleteDependents finalizer for item [kubevirt.io/v1alpha2/VirtualMachineInstance, namespace: default, name: fedora, uid: bca97fd7-d79c-11e8-a4cd-fa163e0953ea]
```


As you can see, when the VMI is stuck, it still has the finalizer left:


```
apiVersion: kubevirt.io/v1alpha2
kind: VirtualMachineInstance
metadata:
  creationTimestamp: 2018-10-24T14:54:41Z
  deletionGracePeriodSeconds: 0
  deletionTimestamp: 2018-10-24T14:54:58Z
  finalizers:
  - foregroundDeletion
  uid: bca97fd7-d79c-11e8-a4cd-fa163e0953ea

```

The conclusion here is, that rightfully the gc immediately removes the finalizer, because we don't set an owner reference on the pod. If we do the delete request again, the finalizer gets readded, but the gc does not process the item anymore.


I will file an issue against openshift. As a workaround, I suggest to only send `"propagationPolicy":"Foreground"` right now. Also on latest master the issue might be harder to reproduce, because we now set an owner-reference on the Pod. Therefore the garbage collector can't immediately remove the finalizer. However, the issue still exists, it is just harder to hit.

Comment 15 Roman Mohr 2018-10-24 15:03:47 UTC

Let me know how you want to proceed.

Comment 16 Roman Mohr 2018-10-24 15:05:09 UTC

> 
> I will file an issue against openshift. As a workaround, I suggest to only
> send `"propagationPolicy":"Foreground"` 

Obviously I meant to *not* send the propagationPolicy right now.

> right now. Also on latest master the
> issue might be harder to reproduce, because we now set an owner-reference on
> the Pod. Therefore the garbage collector can't immediately remove the
> finalizer. However, the issue still exists, it is just harder to hit.

Comment 17 Roman Mohr 2018-10-24 15:32:20 UTC

Here the created issue: https://bugzilla.redhat.com/show_bug.cgi?id=1642530

Comment 18 Guohua Ouyang 2018-10-25 09:42:08 UTC

The issue does not happen on the new UI. 

To me, I'd like to have a test on the origin console with the newer kubevirt which has the owner-reference feature. If the problem is really hard to hit, we can just close the bug.

CNV 1.2 is out and we won't have new build for the origin web console for CNV 1.3, so I can go without apply the workaround as well.

Comment 19 Fabian Deutsch 2018-11-05 10:58:59 UTC

Zhe/Guohua, thus how would you liek to proceed with his issue?

Comment 20 Fabian Deutsch 2018-11-05 14:33:48 UTC

Steve, Nelly, acks?

Comment 21 Fabian Deutsch 2018-11-05 14:34:24 UTC

There were code changes, which will land in 1.3 (0.9)

Comment 22 Guohua Ouyang 2018-11-06 09:01:40 UTC

I will check it on 1.3 (0.9) on the original web console to see if the bug is reproducible.

Comment 23 Guohua Ouyang 2018-11-06 11:06:25 UTC

I tested on cnv 1.3 which should be built on kubevirt v0.9.3. It still has this issue on the original web console.

As I mentioned in above comment, it's not happening on the new web ui.

I'm fine to close the issue.

Note You need to log in before you can comment on or make changes to this bug.