1601365 – Timeout error on VM creation

Bug 1601365 - Timeout error on VM creation

Summary: Timeout error on VM creation

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	1.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	1.4
Assignee:	David Vossel
QA Contact:	Denys Shchedrivyi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-07-16 08:17 UTC by Vatsal Parekh
Modified:	2019-02-26 13:24 UTC (History)
CC List:	9 users (show)
Fixed In Version:	kubevirt-0.13.0-1.g7b9cb66.f6a440c virt-api-container-v1.4.0-12 virt-handler-container-v1.4.0-12 virt-launcher-container-v1.4.0-12 virt-operator-container-v1.4.0-4 virt-api-container-v1.4.0-12
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-02-26 13:24:16 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2019:0417	0	None	None	None	2019-02-26 13:24:20 UTC

Description Vatsal Parekh 2018-07-16 08:17:22 UTC

Description of problem:
While creating a VM, with attaching a PVC of CDI, I was getting a `Timeout error` when using the `oc create`, but the actual VM objects were still created

On the same environment, this is not reproducible now after the pods were re-created 

Version-Release number of selected component (if applicable):
0.7alpha5

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Fabian Deutsch 2018-07-16 09:00:54 UTC

Please provide the VM or template you used.

And you sy tht this only happened on the first try, but afterwards it worked?

Comment 2 Vatsal Parekh 2018-07-16 11:00:44 UTC

(In reply to Fabian Deutsch from comment #1)
> Please provide the VM or template you used.
> 
> And you sy tht this only happened on the first try, but afterwards it worked?

Below is the template I used

```
apiVersion: v1
kind: Template
metadata:
  name: cirros-demo-template
  annotations:
    description: "OpenShift KubeVirt Cirros VM template"
    tags: "kubevirt,openshift,template,linux"
  labels:
    kubevirt.io/os: fedora28
    miq.github.io/kubevirt-is-vm-template: "true"
objects:
- apiVersion: kubevirt.io/v1alpha2
  kind: VirtualMachine
  metadata:
    name: ${NAME}
    labels:
      kubevirt-ovm: ovm-${NAME}
  spec:
    running: false
    template:
      metadata:
        labels:
          kubevirt-ovm: ovm-${NAME}
          special: demo-key
      spec:
        domain:
          cpu:
            cores: ${{CPU_CORES}}
          resources:
            requests:
              memory: 1Gi
          devices:
            disks:
            - disk:
                bus: virtio
              name: disk0
              volumeName: root
            - disk:
                bus: virtio
              name: cloudinitdisk
              volumeName: cloudinitvolume
        volumes:
          - name: root
            persistentVolumeClaim:
              claimName: rhel75-pvc-2
          - name: cloudinitvolume
            cloudInitNoCloud:
              userData: |-
                #cloud-config
                password: "redhat"
                chpasswd: { expire: False }
                ssh_authorized_keys:
                - ${KEY}
parameters:
- name: NAME
  description: Name for the new VM
- name: CPU_CORES
  description: Amount of cores
  value: "4"
- description: ssh key
  name: KEY
  value: "ssh-rsa"
```

Comment 3 Vatsal Parekh 2018-07-16 11:11:17 UTC

(In reply to Fabian Deutsch from comment #1)
> Please provide the VM or template you used.
> 
> And you sy tht this only happened on the first try, but afterwards it worked?

Yes this happened to me randomly once, tried the same cluster after few days and it works fine now

Comment 4 Nelly Credi 2018-07-17 12:17:26 UTC

@Roman, do you have some insights about this issue?
I think you were looking at Vatsal's environment last week

Comment 5 Roman Mohr 2018-07-17 12:39:56 UTC

Did not look into it so far. @Vatsal, are you sure that the PVC was already properly provisioned by CDI? Afaik there is right now no protection which prevents you from already attaching to a disk which is only half imported. 

Further there exist some weird issues regarding to PVC provisioning and mount-retries. In order to find out if this is the case, I would need the event log from k8s (always something good to add):

Something like:

```
kubectl get events > events.log
```

Comment 6 Roman Mohr 2018-07-17 12:40:44 UTC

Oh and the mount retries are pod-kubelet related. There could also be something wrong in kubevirt of course, but I would need events and logs.

Comment 7 Vatsal Parekh 2018-07-19 08:07:57 UTC

(In reply to Roman Mohr from comment #5)
> Did not look into it so far. @Vatsal, are you sure that the PVC was already
> properly provisioned by CDI? Afaik there is right now no protection which
> prevents you from already attaching to a disk which is only half imported. 
> 
> Further there exist some weird issues regarding to PVC provisioning and
> mount-retries. In order to find out if this is the case, I would need the
> event log from k8s (always something good to add):
> 
> Something like:
> 
> ```
> kubectl get events > events.log
> ```

Afair, the PVC was provisioned, and the importer-pod had completed successfully.
And I don't see any logs in kube-system, and only log in default namespace is

LAST SEEN   FIRST SEEN   COUNT     NAME                                      KIND                   SUBOBJECT   TYPE      REASON           SOURCE                               MESSAGE
3m          7h           28        ansible-service-broker.15429c83d4ff5065   ClusterServiceBroker               Normal    FetchedCatalog   service-catalog-controller-manager   Successfully fetched catalog entries from broker.

Comment 8 Roman Mohr 2018-07-19 10:26:50 UTC

Vatsal, I see that two out of three virt-handlers are not connected to the cluster. I a lot of errors like:

```
level=error timestamp=2018-07-19T00:04:41.627712Z pos=vm.go:707 component=virt-handler reason="Patch https://172.30.0.1:443/api/v1/nodes/cnv-executor-vatsal-unified-demo-node1.example.com: dial tcp 172.30.0.1:443: connect: connection refused" msg="Can't patch node cnv-executor-vatsal-unified-demo-node1.example.com"
```

and

```
E0717 18:04:21.368897    3275 reflector.go:205] kubevirt.io/kubevirt/pkg/virt-handler/vm.go:224: Failed to list *v1.VirtualMachineInstance: Get https://172.30.0.1:443/apis/kubevirt.io/v1alpha2/virtualmachineinstances?labelSelector=kubevirt.io%2FnodeName+in+%28cnv-executor-vatsal-unified-demo-node1.example.com%29&limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host^
```

If virt-handler can't update its heartbeat for some time the node will be marked as unschedulable, but if the pod got already created before the cluster recognizes that the node is not schedulable the Pod may hang  around for some time and finally time out.

We need to improve in that area. I created https://github.com/kubevirt/kubevirt/issues/1077 some time ago which should first improve virt-handler behaviour in the case of such errors and second should better inform the user what went wrong.

Comment 9 Roman Mohr 2018-07-31 12:46:23 UTC

A solution is discussed in https://github.com/kubevirt/kubevirt/pull/1356.

Comment 12 Roman Mohr 2018-09-25 09:56:25 UTC

We have merged https://github.com/kubevirt/kubevirt/pull/1515 today. It allows monitoring KubeVirt components via Prometheus. See https://kubevirt.io/user-guide/#/misc/monitoring for further details.

This allows detecting possible virt-handler issues via monitoring.

The whole story is now like this:

1) Like before, if there are connection issues, virt-handler can't update it's heartbeat anymore
2) After a few minutes our node controller detects that virt-handler is unresponsive and will mark the node as non-schedulable. From now on no new VMIs will be scheduled to the node.
3) Since virt-handler is unresponsive, we have no idea on how the VMIs are doing. A fencing agent or an admin has to investigate
3.1) If only virt-handler is unresponsive, VMIs can simply be deleted via kubectl. The node controller will then move the VMIs to failed state on unresponsive virt-handler nodes, after the corresponding pods are gone.
3.2) If the whole node is unresponsive cluster-specific fencing operations need to be performed and finally the VMI-Pods need to be force-deleted. Our node-controller or a recovered virt-handler will then move the VMIs to failed state.

Finally, this still leaves a window of up to a few minutes, where VMIs may be scheduled to a node, where virt-handler started to be unresponsive, but the heartbeat did not yet time out.

In order to faster detect when KubeVirt components have issues like this, monitoring via Prometheus can now be used (especially watch for rest_client_requests_total{code="<error>"}"). See step 3.1 and 3.2 on how to resolve that.

Comment 13 Roman Mohr 2018-09-25 09:57:14 UTC

Vatsal, do the description and the instructions above make sense for you?

Comment 14 Roman Mohr 2018-09-26 10:09:54 UTC

Additionally https://github.com/kubevirt/kubevirt/pull/1544 got merged. It will ensure that **newly** VMIs (meaing VMIs which were never processed successfully at least once by virt-handler) which can't be started because of unresponsive virt-launcher pods go faster into a failed state (after 5 minutes). VMIs already running still need to be fenced manually.

Comment 15 Vatsal Parekh 2018-10-01 13:29:59 UTC

(In reply to Roman Mohr from comment #13)
> Vatsal, do the description and the instructions above make sense for you?

Roman,
Yes this makes it quite clear, additionally, it would be good if we have your comment documented somewhere in docs

Comment 16 Fabian Deutsch 2018-11-05 13:53:32 UTC

Rmoan, can you add the doucmentation part?

Comment 17 Roman Mohr 2018-11-05 14:58:45 UTC

Yes on it.

Comment 18 Fabian Deutsch 2018-11-27 13:18:47 UTC

I think this is already done and the documentation merged.

Roman, can this be moved to POST?

It should be moved to ON_QA once the upstream usuer-guide is available in downstream.

Comment 19 Roman Mohr 2018-11-27 13:38:45 UTC

Yes, it is here: https://kubevirt.io/user-guide/#/workloads/virtual-machines/unresponsive-nodes

Comment 20 sgott 2019-01-22 21:49:51 UTC

Vatsal,

Do you have a good idea of how this bug/fix can be verified?

Moving to MODIFIED because fix was previously merged in the release 0.13 stable branch.

Comment 21 Israel Pinto 2019-01-29 11:05:06 UTC

Steps to reproduce according to https://bugzilla.redhat.com/show_bug.cgi?id=1601365#c12:

Have 2 node 1 VM on each node.

Case 1:
 1. Set latency on the nodes (using tc: , like: tc qdisc add dev eth0 root netem delay 97ms ) and make virt-handler lose connection
 2. Check VMs is moving to failure state. 
 3. Remove latency and see VMs recovering.
Case 2:
 1. Shutdown the nodes.
 2. Check VMs is moving to failure state. 
 3. Start nodes, check VMs recovering.

Comment 22 Denys Shchedrivyi 2019-01-29 21:06:44 UTC

Verified with these steps:

 1) create and run VM
 2) ifdown eth0 on the node
 3) ifup eth0 on the node after 10 minutes


After shutting down eth0 on the node POD became in "Unknown" state, but VMI was in "Running":

# oc get pod
virt-launcher-vm-fedora-4zs82    2/2	   Unknown    0          36m

# oc get vmi
vm-fedora   37m       Running   10.130.0.33   cnv-executor-dshchedr2-node1.example.com

After enabling eth0 back POD was recreated, VMI changed IP address:

# oc get pod
virt-launcher-vm-fedora-p4xrz    2/2       Running   0          3m

# oc get vmi
vm-fedora   59m       Running   10.130.0.33   cnv-executor-dshchedr2-node1.example.com


From my point of view it's expected behavior

Comment 23 Denys Shchedrivyi 2019-01-29 21:10:25 UTC

 There is a mistake in previous comment..

Verified with these steps:

 1) create and run VM
 2) ifdown eth0 on the node
 3) ifup eth0 on the node after 10 minutes


After shutting down eth0 on the node POD became in "Unknown" state, but VMI was in "Running":

# oc get pod
virt-launcher-vm-fedora-4zs82    2/2	   Unknown    0          36m

# oc get vmi
vm-fedora   37m       Running   10.130.0.33   cnv-executor-dshchedr2-node1.example.com

After enabling eth0 back POD was recreated, VMI changed IP address:

# oc get pod
virt-launcher-vm-fedora-p4xrz    2/2       Running   0          3m

# oc get vmi
vm-fedora   3m        Running   10.130.0.45   cnv-executor-dshchedr2-node1.example.com


From my point of view it's expected behavior

Comment 24 Denys Shchedrivyi 2019-01-30 18:56:08 UTC

Some additional tests:

According to comment 21:

Case 1 - increasing network latency to 1000ms:
POD and VM remain in Running state. New VM can be created succesfully
But VNC and console does not work:

# virtctl console vm-fedora
Can't connect to websocket (503): service unavailable
# virtctl vnc vm-fedora
Can't access VMI vm-fedora: Can't connect to websocket (503): service unavailable

After removing network latency all services works well.


Case 2 - Shutdown the nodes.

Step_1: Created VM on the node1
# oc get pod -o wide
virt-launcher-vm-cirros-pvc-cdi-g84l7   1/1       Running   0          9m        10.129.0.43   cnv-executor-dshchedr-node1.example.com     <none>


Step_2: Shutdown node1. Node2 is active:
After ~5 minutes POD changed status to Unknown, but VMI had status Running:

# oc get pod
virt-launcher-vm-cirros-pvc-cdi-g84l7   1/1       Unknown    0          16m

# oc get vmi
vm-cirros-pvc-cdi   16m       Running   10.129.0.43   cnv-executor-dshchedr-node1.example.com


 Trying to create new VM, but got an error:
# oc create -f vm-containerDisk.yaml 
Error from server (Timeout): error when creating "vm-containerDisk.yaml": Timeout: request did not complete within allowed duration

Step_3: Enable node1 after ~10 minutes:
POD recreated and switched to another node - node2, VMI running and accessible:

# oc get pod
virt-launcher-vm-cirros-pvc-cdi-rwkkv   1/1       Running   0          5m       10.130.0.89   cnv-executor-dshchedr-node2.example.com     <none>


After enabling node1 new VM can be created successfully:
# oc create -f vm-containerDisk.yaml 
virtualmachine.kubevirt.io/vm-fedora created

Comment 26 sgott 2019-01-31 14:41:46 UTC

Denys,

Regarding your comments in https://bugzilla.redhat.com/show_bug.cgi?id=1601365#c24 it appears you are confirming that things are working as expected?

* Interrupt network connectivity to node. Observe that Pods/VMI's on that node eventually time out
* No new VMI's can be rescheduled (because no nodes)
* Once a node becomes available, things start working again.

Is that correct? Does Engineering need to take action?

Comment 27 Denys Shchedrivyi 2019-01-31 15:17:51 UTC

I'm not sure with this:

Node1 is up, Node2 is down (power or eth0 no matter) - I can't create VM:

# oc create -f vm.yml 
Error from server (Timeout): Timeout: request did not complete within allowed duration


Everything else looks good for me.

Comment 28 sgott 2019-01-31 15:26:01 UTC

David, could you verify if a server timeout is expected in this case?

Comment 29 David Vossel 2019-01-31 15:55:48 UTC

> David, could you verify if a server timeout is expected in this case?

If the disconnected node contained control plane components (like the api server or etcd) then yes it is possible for API timeouts to occur. 

If the disconnected node does not contain any control plane components, or if the control plane is replicated, then I would not expect API timeouts to occur during creation.

There's not enough information in this bz to determine either of those things. 


@Denys, please attach output from the following commands.

1. kubectl get pods --all-namespaces
2. kubectl describe nodes

Comment 30 Denys Shchedrivyi 2019-01-31 16:10:33 UTC

Created attachment 1525461 [details]
oc describe nodes

Comment 31 Denys Shchedrivyi 2019-01-31 16:11:35 UTC

Created attachment 1525463 [details]
all pods

Comment 32 Denys Shchedrivyi 2019-01-31 17:27:55 UTC

Made some additional tests on my current environment:

 case 1 - node1 is up, node2 is down:

can't create VM:


# oc create -f vm.yml 
Error from server (Timeout): Timeout: request did not complete within allowed duration
 
 case 2 - node1 is down, node2 is up:


VM created successful

# oc create -f vm.yml 
virtualmachine.kubevirt.io/vm-cirros created

Comment 33 David Vossel 2019-01-31 17:36:44 UTC

Perfect. Here's what's going on.

node2 has both kubevirt apiservers

----------------------------------------------------
Name:               cnv-executor-dshchedr-node2.example.com
Roles:              compute
Labels:             beta.kubernetes.io/arch=amd64
.
.
.
Non-terminated Pods:         (16 in total)
  Namespace                  Name                                     CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                                     ------------  ----------  ---------------  -------------
.
.
.
  kubevirt                   virt-api-fd86fd5fc-llx5g                 0 (0%)        0 (0%)      0 (0%)           0 (0%)
  kubevirt                   virt-api-fd86fd5fc-mdqmp                 0 (0%)        0 (0%)      0 (0%)           0 (0%).
-----------------------------------------------------


If node2's network is disrupted, the kubernetes API server won't be able to contact the kubevirt API server's validation hook during creation. Basically, if all instances of the kubevirt API server (virt-api-*) can not be contacted, then no VMI's can get posted. 


So yes, in this circumstance the timeout is expected behavior. 

However, the virt-api pods really shouldn't be scheduled on the same node like that. It reduces the availability of our components when we don't spread out redundant instances across separate nodes.

Comment 34 David Vossel 2019-01-31 17:37:44 UTC

Knowing this, it makes perfect sense that your situation works when node1 is down but not when node2 is down.

Comment 35 Denys Shchedrivyi 2019-01-31 18:00:50 UTC

So, if it is expected behavior I marked this bug as Verified.

Comment 41 errata-xmlrpc 2019-02-26 13:24:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0417

Note You need to log in before you can comment on or make changes to this bug.