Created attachment 1495534 [details] vmi_failed Description of problem: On web console, Click VM action and restart the running VM multiple times, the VM get into "Failed" state. After it, stop/start/restart the VM does not help. Version-Release number of selected component (if applicable): CNV-v1.2.0-5-1-g605855d origin-web-console-server:1.2 How reproducible: 100% Steps to Reproduce: 1. Go to actions of a running VM 2. Click restart multiple times. 3. Actual results: VM become inoperatable in Failed state Expected results: It should gray out the "Restart" button to prevent restarting the VM which is undergoing restart. Additional info:
Created attachment 1495535 [details] logs
@Fabian: any idea why would this happen? All the UI does is that it deletes the VMI as a reaction to "restart". If you do it couple of times, it will not be possible to start the VM anymore and it will stay as scheduling.
Roman, could you please take a look?
on it.
When I investigated the environment, I saw that the two mentioned VMIs were running, like expected. There is a good chance that under some conditions it can take up to *minutes* (!) until kubernetes completely removes deleted pods. If the pod is still hanging there, the VMI can't be started again because that may lead to inconsistencies (e.g. data corruption). Therefore the failed VMI hangs around until the pod goes away. The controllers then continue their work as usual. Are we sure that we did not experience that scenario? In the future it would help to also check if the pod is already gone. If not, looking into the pods events normally reveals why the cleanup takes so long. There are issues filed in k8s to improve the situation on the pods. I can't rule out that we have an issue in kubevirt, but I can't dig deeper.
Ok, have it. Seems like it is related to this issue: ``` E1024 04:01:30.166019 1 streamwatcher.go:109] Unable to decode an event from the watch stream: unable to decode watch event: no kind "Status" is registered for version "v1" ``` Thanks for reporting it.
Thank you Roman for the investigation! Moving to virtualization component.
If it needs more information, restart a running VM 3-5 times quickly can reproduce the issue.
Ok, so it seems to be related to the PropagationPolicy which is sent via the web-ui if it deletes a VMI. If you only set one delete request, the garbage collector removes the "Foreground" finalizer. If we do it multiple times, for whatever reason it stays. Interestingly, in both cases the GC claims to have removed the finalizer. If the CR is deleted via the commandline, where it does not set a PropagationPolicy, everything works as expected. Here the delete requeest payload from the UI: ``` {"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Foreground"} ``` Here the VMI directly after it got deleted: ``` apiVersion: kubevirt.io/v1alpha2 kind: VirtualMachineInstance metadata: creationTimestamp: 2018-10-24T14:34:28Z deletionGracePeriodSeconds: 0 deletionTimestamp: 2018-10-24T14:48:18Z finalizers: - foregroundDeleteVirtualMachine uid: e99fc418-d799-11e8-a4cd-fa163e0953ea ``` Here the GC logs: ``` I1024 14:48:18.907722 1 graph_builder.go:553] add [kubevirt.io/v1alpha2/VirtualMachineInstance, namespace: default, name: fedora, uid: e99fc418-d799-11e8-a4cd-fa163e0953ea] to the attemptToDelete, because it's waiting for its dependents to be deleted I1024 14:48:18.908062 1 garbagecollector.go:408] processing item [kubevirt.io/v1alpha2/VirtualMachineInstance, namespace: default, name: fedora, uid: e99fc418-d799-11e8-a4cd-fa163e0953ea] I1024 14:48:18.913571 1 garbagecollector.go:530] remove DeleteDependents finalizer for item [kubevirt.io/v1alpha2/VirtualMachineInstance, namespace: default, name: fedora, uid: e99fc418-d799-11e8-a4cd-fa163e0953ea] ``` Now with multiple times sending the delete request: ``` apiVersion: kubevirt.io/v1alpha2 kind: VirtualMachineInstance metadata: creationTimestamp: 2018-10-24T14:54:41Z deletionGracePeriodSeconds: 0 deletionTimestamp: 2018-10-24T14:54:58Z finalizers: - foregroundDeleteVirtualMachine - foregroundDeletion uid: bca97fd7-d79c-11e8-a4cd-fa163e0953ea ``` here the logs again: ``` I1024 14:54:57.368392 1 graph_builder.go:553] add [kubevirt.io/v1alpha2/VirtualMachineInstance, namespace: default, name: fedora, uid: bca97fd7-d79c-11e8-a4cd-fa163e0953ea] to the attemptToDelete, because it's waiting for its dependents to be deleted I1024 14:54:57.368910 1 garbagecollector.go:408] processing item [kubevirt.io/v1alpha2/VirtualMachineInstance, namespace: default, name: fedora, uid: bca97fd7-d79c-11e8-a4cd-fa163e0953ea] I1024 14:54:57.388385 1 garbagecollector.go:530] remove DeleteDependents finalizer for item [kubevirt.io/v1alpha2/VirtualMachineInstance, namespace: default, name: fedora, uid: bca97fd7-d79c-11e8-a4cd-fa163e0953ea] ``` As you can see, when the VMI is stuck, it still has the finalizer left: ``` apiVersion: kubevirt.io/v1alpha2 kind: VirtualMachineInstance metadata: creationTimestamp: 2018-10-24T14:54:41Z deletionGracePeriodSeconds: 0 deletionTimestamp: 2018-10-24T14:54:58Z finalizers: - foregroundDeletion uid: bca97fd7-d79c-11e8-a4cd-fa163e0953ea ``` The conclusion here is, that rightfully the gc immediately removes the finalizer, because we don't set an owner reference on the pod. If we do the delete request again, the finalizer gets readded, but the gc does not process the item anymore. I will file an issue against openshift. As a workaround, I suggest to only send `"propagationPolicy":"Foreground"` right now. Also on latest master the issue might be harder to reproduce, because we now set an owner-reference on the Pod. Therefore the garbage collector can't immediately remove the finalizer. However, the issue still exists, it is just harder to hit.
Let me know how you want to proceed.
> > I will file an issue against openshift. As a workaround, I suggest to only > send `"propagationPolicy":"Foreground"` Obviously I meant to *not* send the propagationPolicy right now. > right now. Also on latest master the > issue might be harder to reproduce, because we now set an owner-reference on > the Pod. Therefore the garbage collector can't immediately remove the > finalizer. However, the issue still exists, it is just harder to hit.
Here the created issue: https://bugzilla.redhat.com/show_bug.cgi?id=1642530
The issue does not happen on the new UI. To me, I'd like to have a test on the origin console with the newer kubevirt which has the owner-reference feature. If the problem is really hard to hit, we can just close the bug. CNV 1.2 is out and we won't have new build for the origin web console for CNV 1.3, so I can go without apply the workaround as well.
Zhe/Guohua, thus how would you liek to proceed with his issue?
Steve, Nelly, acks?
There were code changes, which will land in 1.3 (0.9)
I will check it on 1.3 (0.9) on the original web console to see if the bug is reproducible.
I tested on cnv 1.3 which should be built on kubevirt v0.9.3. It still has this issue on the original web console. As I mentioned in above comment, it's not happening on the new web ui. I'm fine to close the issue.