Description of problem: While creating a VM, with attaching a PVC of CDI, I was getting a `Timeout error` when using the `oc create`, but the actual VM objects were still created On the same environment, this is not reproducible now after the pods were re-created Version-Release number of selected component (if applicable): 0.7alpha5 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Please provide the VM or template you used. And you sy tht this only happened on the first try, but afterwards it worked?
(In reply to Fabian Deutsch from comment #1) > Please provide the VM or template you used. > > And you sy tht this only happened on the first try, but afterwards it worked? Below is the template I used ``` apiVersion: v1 kind: Template metadata: name: cirros-demo-template annotations: description: "OpenShift KubeVirt Cirros VM template" tags: "kubevirt,openshift,template,linux" labels: kubevirt.io/os: fedora28 miq.github.io/kubevirt-is-vm-template: "true" objects: - apiVersion: kubevirt.io/v1alpha2 kind: VirtualMachine metadata: name: ${NAME} labels: kubevirt-ovm: ovm-${NAME} spec: running: false template: metadata: labels: kubevirt-ovm: ovm-${NAME} special: demo-key spec: domain: cpu: cores: ${{CPU_CORES}} resources: requests: memory: 1Gi devices: disks: - disk: bus: virtio name: disk0 volumeName: root - disk: bus: virtio name: cloudinitdisk volumeName: cloudinitvolume volumes: - name: root persistentVolumeClaim: claimName: rhel75-pvc-2 - name: cloudinitvolume cloudInitNoCloud: userData: |- #cloud-config password: "redhat" chpasswd: { expire: False } ssh_authorized_keys: - ${KEY} parameters: - name: NAME description: Name for the new VM - name: CPU_CORES description: Amount of cores value: "4" - description: ssh key name: KEY value: "ssh-rsa" ```
(In reply to Fabian Deutsch from comment #1) > Please provide the VM or template you used. > > And you sy tht this only happened on the first try, but afterwards it worked? Yes this happened to me randomly once, tried the same cluster after few days and it works fine now
@Roman, do you have some insights about this issue? I think you were looking at Vatsal's environment last week
Did not look into it so far. @Vatsal, are you sure that the PVC was already properly provisioned by CDI? Afaik there is right now no protection which prevents you from already attaching to a disk which is only half imported. Further there exist some weird issues regarding to PVC provisioning and mount-retries. In order to find out if this is the case, I would need the event log from k8s (always something good to add): Something like: ``` kubectl get events > events.log ```
Oh and the mount retries are pod-kubelet related. There could also be something wrong in kubevirt of course, but I would need events and logs.
(In reply to Roman Mohr from comment #5) > Did not look into it so far. @Vatsal, are you sure that the PVC was already > properly provisioned by CDI? Afaik there is right now no protection which > prevents you from already attaching to a disk which is only half imported. > > Further there exist some weird issues regarding to PVC provisioning and > mount-retries. In order to find out if this is the case, I would need the > event log from k8s (always something good to add): > > Something like: > > ``` > kubectl get events > events.log > ``` Afair, the PVC was provisioned, and the importer-pod had completed successfully. And I don't see any logs in kube-system, and only log in default namespace is LAST SEEN FIRST SEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE 3m 7h 28 ansible-service-broker.15429c83d4ff5065 ClusterServiceBroker Normal FetchedCatalog service-catalog-controller-manager Successfully fetched catalog entries from broker.
Vatsal, I see that two out of three virt-handlers are not connected to the cluster. I a lot of errors like: ``` level=error timestamp=2018-07-19T00:04:41.627712Z pos=vm.go:707 component=virt-handler reason="Patch https://172.30.0.1:443/api/v1/nodes/cnv-executor-vatsal-unified-demo-node1.example.com: dial tcp 172.30.0.1:443: connect: connection refused" msg="Can't patch node cnv-executor-vatsal-unified-demo-node1.example.com" ``` and ``` E0717 18:04:21.368897 3275 reflector.go:205] kubevirt.io/kubevirt/pkg/virt-handler/vm.go:224: Failed to list *v1.VirtualMachineInstance: Get https://172.30.0.1:443/apis/kubevirt.io/v1alpha2/virtualmachineinstances?labelSelector=kubevirt.io%2FnodeName+in+%28cnv-executor-vatsal-unified-demo-node1.example.com%29&limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host^ ``` If virt-handler can't update its heartbeat for some time the node will be marked as unschedulable, but if the pod got already created before the cluster recognizes that the node is not schedulable the Pod may hang around for some time and finally time out. We need to improve in that area. I created https://github.com/kubevirt/kubevirt/issues/1077 some time ago which should first improve virt-handler behaviour in the case of such errors and second should better inform the user what went wrong.
A solution is discussed in https://github.com/kubevirt/kubevirt/pull/1356.
We have merged https://github.com/kubevirt/kubevirt/pull/1515 today. It allows monitoring KubeVirt components via Prometheus. See https://kubevirt.io/user-guide/#/misc/monitoring for further details. This allows detecting possible virt-handler issues via monitoring. The whole story is now like this: 1) Like before, if there are connection issues, virt-handler can't update it's heartbeat anymore 2) After a few minutes our node controller detects that virt-handler is unresponsive and will mark the node as non-schedulable. From now on no new VMIs will be scheduled to the node. 3) Since virt-handler is unresponsive, we have no idea on how the VMIs are doing. A fencing agent or an admin has to investigate 3.1) If only virt-handler is unresponsive, VMIs can simply be deleted via kubectl. The node controller will then move the VMIs to failed state on unresponsive virt-handler nodes, after the corresponding pods are gone. 3.2) If the whole node is unresponsive cluster-specific fencing operations need to be performed and finally the VMI-Pods need to be force-deleted. Our node-controller or a recovered virt-handler will then move the VMIs to failed state. Finally, this still leaves a window of up to a few minutes, where VMIs may be scheduled to a node, where virt-handler started to be unresponsive, but the heartbeat did not yet time out. In order to faster detect when KubeVirt components have issues like this, monitoring via Prometheus can now be used (especially watch for rest_client_requests_total{code="<error>"}"). See step 3.1 and 3.2 on how to resolve that.
Vatsal, do the description and the instructions above make sense for you?
Additionally https://github.com/kubevirt/kubevirt/pull/1544 got merged. It will ensure that **newly** VMIs (meaing VMIs which were never processed successfully at least once by virt-handler) which can't be started because of unresponsive virt-launcher pods go faster into a failed state (after 5 minutes). VMIs already running still need to be fenced manually.
(In reply to Roman Mohr from comment #13) > Vatsal, do the description and the instructions above make sense for you? Roman, Yes this makes it quite clear, additionally, it would be good if we have your comment documented somewhere in docs
Rmoan, can you add the doucmentation part?
Yes on it.
I think this is already done and the documentation merged. Roman, can this be moved to POST? It should be moved to ON_QA once the upstream usuer-guide is available in downstream.
Yes, it is here: https://kubevirt.io/user-guide/#/workloads/virtual-machines/unresponsive-nodes
Vatsal, Do you have a good idea of how this bug/fix can be verified? Moving to MODIFIED because fix was previously merged in the release 0.13 stable branch.
Steps to reproduce according to https://bugzilla.redhat.com/show_bug.cgi?id=1601365#c12: Have 2 node 1 VM on each node. Case 1: 1. Set latency on the nodes (using tc: , like: tc qdisc add dev eth0 root netem delay 97ms ) and make virt-handler lose connection 2. Check VMs is moving to failure state. 3. Remove latency and see VMs recovering. Case 2: 1. Shutdown the nodes. 2. Check VMs is moving to failure state. 3. Start nodes, check VMs recovering.
Verified with these steps: 1) create and run VM 2) ifdown eth0 on the node 3) ifup eth0 on the node after 10 minutes After shutting down eth0 on the node POD became in "Unknown" state, but VMI was in "Running": # oc get pod virt-launcher-vm-fedora-4zs82 2/2 Unknown 0 36m # oc get vmi vm-fedora 37m Running 10.130.0.33 cnv-executor-dshchedr2-node1.example.com After enabling eth0 back POD was recreated, VMI changed IP address: # oc get pod virt-launcher-vm-fedora-p4xrz 2/2 Running 0 3m # oc get vmi vm-fedora 59m Running 10.130.0.33 cnv-executor-dshchedr2-node1.example.com From my point of view it's expected behavior
There is a mistake in previous comment.. Verified with these steps: 1) create and run VM 2) ifdown eth0 on the node 3) ifup eth0 on the node after 10 minutes After shutting down eth0 on the node POD became in "Unknown" state, but VMI was in "Running": # oc get pod virt-launcher-vm-fedora-4zs82 2/2 Unknown 0 36m # oc get vmi vm-fedora 37m Running 10.130.0.33 cnv-executor-dshchedr2-node1.example.com After enabling eth0 back POD was recreated, VMI changed IP address: # oc get pod virt-launcher-vm-fedora-p4xrz 2/2 Running 0 3m # oc get vmi vm-fedora 3m Running 10.130.0.45 cnv-executor-dshchedr2-node1.example.com From my point of view it's expected behavior
Some additional tests: According to comment 21: Case 1 - increasing network latency to 1000ms: POD and VM remain in Running state. New VM can be created succesfully But VNC and console does not work: # virtctl console vm-fedora Can't connect to websocket (503): service unavailable # virtctl vnc vm-fedora Can't access VMI vm-fedora: Can't connect to websocket (503): service unavailable After removing network latency all services works well. Case 2 - Shutdown the nodes. Step_1: Created VM on the node1 # oc get pod -o wide virt-launcher-vm-cirros-pvc-cdi-g84l7 1/1 Running 0 9m 10.129.0.43 cnv-executor-dshchedr-node1.example.com <none> Step_2: Shutdown node1. Node2 is active: After ~5 minutes POD changed status to Unknown, but VMI had status Running: # oc get pod virt-launcher-vm-cirros-pvc-cdi-g84l7 1/1 Unknown 0 16m # oc get vmi vm-cirros-pvc-cdi 16m Running 10.129.0.43 cnv-executor-dshchedr-node1.example.com Trying to create new VM, but got an error: # oc create -f vm-containerDisk.yaml Error from server (Timeout): error when creating "vm-containerDisk.yaml": Timeout: request did not complete within allowed duration Step_3: Enable node1 after ~10 minutes: POD recreated and switched to another node - node2, VMI running and accessible: # oc get pod virt-launcher-vm-cirros-pvc-cdi-rwkkv 1/1 Running 0 5m 10.130.0.89 cnv-executor-dshchedr-node2.example.com <none> After enabling node1 new VM can be created successfully: # oc create -f vm-containerDisk.yaml virtualmachine.kubevirt.io/vm-fedora created
Denys, Regarding your comments in https://bugzilla.redhat.com/show_bug.cgi?id=1601365#c24 it appears you are confirming that things are working as expected? * Interrupt network connectivity to node. Observe that Pods/VMI's on that node eventually time out * No new VMI's can be rescheduled (because no nodes) * Once a node becomes available, things start working again. Is that correct? Does Engineering need to take action?
I'm not sure with this: Node1 is up, Node2 is down (power or eth0 no matter) - I can't create VM: # oc create -f vm.yml Error from server (Timeout): Timeout: request did not complete within allowed duration Everything else looks good for me.
David, could you verify if a server timeout is expected in this case?
> David, could you verify if a server timeout is expected in this case? If the disconnected node contained control plane components (like the api server or etcd) then yes it is possible for API timeouts to occur. If the disconnected node does not contain any control plane components, or if the control plane is replicated, then I would not expect API timeouts to occur during creation. There's not enough information in this bz to determine either of those things. @Denys, please attach output from the following commands. 1. kubectl get pods --all-namespaces 2. kubectl describe nodes
Created attachment 1525461 [details] oc describe nodes
Created attachment 1525463 [details] all pods
Made some additional tests on my current environment: case 1 - node1 is up, node2 is down: can't create VM: # oc create -f vm.yml Error from server (Timeout): Timeout: request did not complete within allowed duration case 2 - node1 is down, node2 is up: VM created successful # oc create -f vm.yml virtualmachine.kubevirt.io/vm-cirros created
Perfect. Here's what's going on. node2 has both kubevirt apiservers ---------------------------------------------------- Name: cnv-executor-dshchedr-node2.example.com Roles: compute Labels: beta.kubernetes.io/arch=amd64 . . . Non-terminated Pods: (16 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits --------- ---- ------------ ---------- --------------- ------------- . . . kubevirt virt-api-fd86fd5fc-llx5g 0 (0%) 0 (0%) 0 (0%) 0 (0%) kubevirt virt-api-fd86fd5fc-mdqmp 0 (0%) 0 (0%) 0 (0%) 0 (0%). ----------------------------------------------------- If node2's network is disrupted, the kubernetes API server won't be able to contact the kubevirt API server's validation hook during creation. Basically, if all instances of the kubevirt API server (virt-api-*) can not be contacted, then no VMI's can get posted. So yes, in this circumstance the timeout is expected behavior. However, the virt-api pods really shouldn't be scheduled on the same node like that. It reduces the availability of our components when we don't spread out redundant instances across separate nodes.
Knowing this, it makes perfect sense that your situation works when node1 is down but not when node2 is down.
So, if it is expected behavior I marked this bug as Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0417