Description of problem: Create a VM following https://github.com/kubevirt/containerized-data-importer/blob/master/doc/datavolumes.md#example-vm-using-dv and start it, then the virt-launcher pod is running, but the readiness probe failed. "Readiness probe failed: cat: /tmp/healthy: No such file or directory" Version-Release number of selected component (if applicable): openshift v3.11.12 CNV 1.2 How reproducible: Always Steps to Reproduce: 1.Create a VM using https://raw.githubusercontent.com/kubevirt/containerized-data-importer/master/manifests/example/example-vm-dv.yaml 2. Start the VM after import process is finished. 3. Check the virt-launcher pod status. Actual results: 3. [root@cnv-executor-vatsal-master1 ~]# oc get pod NAME READY STATUS RESTARTS AGE importer-example-dv-ggmbb 0/1 Completed 0 14m virt-launcher-example-vm-jsscr 1/1 Running 0 9m [root@cnv-executor-vatsal-master1 ~]# oc describe pod virt-launcher-example-vm-jsscr Name: virt-launcher-example-vm-jsscr Namespace: qwang-1 Priority: 0 PriorityClassName: <none> Node: cnv-executor-vatsal-node1.example.com/172.16.0.22 Start Time: Mon, 15 Oct 2018 04:53:35 -0400 Labels: kubevirt.io=virt-launcher kubevirt.io/created-by=cd14441c-d057-11e8-8387-fa163ee68a54 kubevirt.io/vm=example-vm Annotations: kubevirt.io/domain=example-vm kubevirt.io/owned-by=virt-handler openshift.io/scc=privileged Status: Running IP: 10.129.0.14 Containers: compute: Container ID: cri-o://8b2491bfa68177fe0fe6bda5a0c6a73b51ee087b487b142ce28a8511b0dfcd02 Image: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/cnv12-tech-preview/virt-launcher:1.2 Image ID: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/cnv12-tech-preview/virt-launcher@sha256:0cab62d55ef835bc71477460eca2ed3846ae5c56f56501211891dd39981df041 Port: <none> Host Port: <none> Command: /usr/share/kubevirt/virt-launcher/entrypoint.sh --qemu-timeout 5m --name example-vm --uid cd14441c-d057-11e8-8387-fa163ee68a54 --namespace qwang-1 --kubevirt-share-dir /var/run/kubevirt --readiness-file /tmp/healthy --grace-period-seconds 15 --hook-sidecars 0 State: Running Started: Mon, 15 Oct 2018 04:53:50 -0400 Ready: True Restart Count: 0 Limits: devices.kubevirt.io/kvm: 1 devices.kubevirt.io/tun: 1 Requests: devices.kubevirt.io/kvm: 1 devices.kubevirt.io/tun: 1 memory: 160932Ki Readiness: exec [cat /tmp/healthy] delay=2s timeout=5s period=2s #success=1 #failure=5 Environment: <none> Mounts: /var/run/kubevirt from virt-share-dir (rw) /var/run/kubevirt-private/vmi-disks/example-datavolume from example-datavolume (rw) /var/run/libvirt from libvirt-runtime (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-9ctmt (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: example-datavolume: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: example-dv ReadOnly: false virt-share-dir: Type: HostPath (bare host directory volume) Path: /var/run/kubevirt HostPathType: libvirt-runtime: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: default-token-9ctmt: Type: Secret (a volume populated by a Secret) SecretName: default-token-9ctmt Optional: false QoS Class: Burstable Node-Selectors: kubevirt.io/schedulable=true node-role.kubernetes.io/compute=true Tolerations: node.kubernetes.io/memory-pressure:NoSchedule Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 9m default-scheduler Successfully assigned qwang-1/virt-launcher-example-vm-jsscr to cnv-executor-vatsal-node1.example.com Normal Pulling 9m kubelet, cnv-executor-vatsal-node1.example.com pulling image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/cnv12-tech-preview/virt-launcher:1.2" Normal Pulled 9m kubelet, cnv-executor-vatsal-node1.example.com Successfully pulled image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/cnv12-tech-preview/virt-launcher:1.2" Normal Created 9m kubelet, cnv-executor-vatsal-node1.example.com Created container Normal Started 9m kubelet, cnv-executor-vatsal-node1.example.com Started container Warning Unhealthy 9m (x5 over 9m) kubelet, cnv-executor-vatsal-node1.example.com Readiness probe failed: cat: /tmp/healthy: No such file or directory [root@cnv-executor-vatsal-master1 ~]# oc rsh virt-launcher-example-vm-jsscr sh-4.2# ls /tmp/healthy /tmp/healthy [root@cnv-executor-vatsal-master1 ~]# oc logs virt-launcher-example-vm-jsscr level=info timestamp=2018-10-15T08:53:51.196801Z pos=manager.go:69 component=virt-launcher msg="Collected all requested hook sidecar sockets" level=info timestamp=2018-10-15T08:53:51.196877Z pos=manager.go:72 component=virt-launcher msg="Sorted all collected sidecar sockets per hook point based on their priority and name: map[]" level=info timestamp=2018-10-15T08:53:51.197575Z pos=libvirt.go:261 component=virt-launcher msg="Connecting to libvirt daemon: qemu:///system" + mkdir -p /var/log/kubevirt + touch /var/log/kubevirt/qemu-kube.log + chown qemu:qemu /var/log/kubevirt/qemu-kube.log + [[ -z '' ]] ++ ip -o -4 a ++ tr -s ' ' ++ cut '-d ' -f 2 ++ grep -v -e '^lo[0-9:]*$' ++ head -1 + LIBVIRTD_DEFAULT_NETWORK_DEVICE=eth0 + echo 'Selected "eth0" as primary interface' + [[ -n eth0 ]] + echo 'Setting libvirt default network to "eth0"' + mkdir -p /etc/libvirt/qemu/networks/autostart + cat + ln -s -f /etc/libvirt/qemu/networks/default.xml /etc/libvirt/qemu/networks/autostart/default.xml + echo 'cgroup_controllers = [ ]' + '[' -d /dev/hugepages ']' + echo 'log_outputs = "1:stderr"' + /usr/sbin/libvirtd 2018-10-15 08:53:51.310+0000: 48: info : libvirt version: 3.9.0, package: 14.el7_5.8 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2018-09-04-07:27:24, x86-019.build.eng.bos.redhat.com) 2018-10-15 08:53:51.310+0000: 48: info : hostname: example-vm 2018-10-15 08:53:51.310+0000: 48: error : virDBusGetSystemBus:109 : internal error: Unable to get DBus system bus connection: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory 2018-10-15 08:53:51.895+0000: 48: error : virDBusGetSystemBus:109 : internal error: Unable to get DBus system bus connection: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory 2018-10-15 08:53:51.895+0000: 48: warning : networkStateInitialize:761 : DBus not available, disabling firewalld support in bridge_network_driver: internal error: Unable to get DBus system bus connection: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory 2018-10-15 08:53:51.913+0000: 48: error : virPCIDeviceConfigOpen:312 : Failed to open config space file '/sys/bus/pci/devices/0000:00:00.0/config': Read-only file system 2018-10-15 08:53:51.913+0000: 48: error : virPCIDeviceConfigOpen:312 : Failed to open config space file '/sys/bus/pci/devices/0000:00:01.0/config': Read-only file system 2018-10-15 08:53:51.913+0000: 48: error : virPCIDeviceConfigOpen:312 : Failed to open config space file '/sys/bus/pci/devices/0000:00:01.1/config': Read-only file system 2018-10-15 08:53:51.914+0000: 48: error : virPCIDeviceConfigOpen:312 : Failed to open config space file '/sys/bus/pci/devices/0000:00:01.2/config': Read-only file system 2018-10-15 08:53:51.915+0000: 48: error : virPCIDeviceConfigOpen:312 : Failed to open config space file '/sys/bus/pci/devices/0000:00:01.3/config': Read-only file system 2018-10-15 08:53:51.918+0000: 48: error : virPCIDeviceConfigOpen:312 : Failed to open config space file '/sys/bus/pci/devices/0000:00:02.0/config': Read-only file system 2018-10-15 08:53:51.921+0000: 48: error : virPCIDeviceConfigOpen:312 : Failed to open config space file '/sys/bus/pci/devices/0000:00:03.0/config': Read-only file system 2018-10-15 08:53:51.921+0000: 48: error : virPCIDeviceConfigOpen:312 : Failed to open config space file '/sys/bus/pci/devices/0000:00:04.0/config': Read-only file system 2018-10-15 08:53:51.922+0000: 48: error : virPCIDeviceConfigOpen:312 : Failed to open config space file '/sys/bus/pci/devices/0000:00:05.0/config': Read-only file system 2018-10-15 08:53:51.922+0000: 48: error : virPCIDeviceConfigOpen:312 : Failed to open config space file '/sys/bus/pci/devices/0000:00:06.0/config': Read-only file system 2018-10-15 08:53:51.923+0000: 48: error : virPCIDeviceConfigOpen:312 : Failed to open config space file '/sys/bus/pci/devices/0000:00:07.0/config': Read-only file system 2018-10-15 08:53:53.390+0000: 48: error : virCommandWait:2601 : internal error: Child process (/usr/sbin/dmidecode -q -t 0,1,2,4,17) unexpected exit status 1: /dev/mem: No such file or directory 2018-10-15 08:53:53.403+0000: 48: error : virNodeSuspendSupportsTarget:336 : internal error: Cannot probe for supported suspend types 2018-10-15 08:53:53.403+0000: 48: warning : virQEMUCapsInit:1220 : Failed to get host power management capabilities level=info timestamp=2018-10-15T08:54:01.199112Z pos=libvirt.go:276 component=virt-launcher msg="Connected to libvirt daemon" level=info timestamp=2018-10-15T08:54:01.207272Z pos=virt-launcher.go:143 component=virt-launcher msg="Watchdog file created at /var/run/kubevirt/watchdog-files/qwang-1_example-vm" level=info timestamp=2018-10-15T08:54:01.207640Z pos=client.go:152 component=virt-launcher msg="Registered libvirt event notify callback" level=info timestamp=2018-10-15T08:54:01.207709Z pos=virt-launcher.go:60 component=virt-launcher msg="Marked as ready" level=info timestamp=2018-10-15T08:54:02.108527Z pos=converter.go:811 component=virt-launcher msg="Found nameservers in /etc/resolv.conf: \ufffd\u0010\u0000\u0016" level=info timestamp=2018-10-15T08:54:02.108563Z pos=converter.go:812 component=virt-launcher msg="Found search domains in /etc/resolv.conf: qwang-1.svc.cluster.local svc.cluster.local cluster.local openstacklocal example.com" level=info timestamp=2018-10-15T08:54:02.109320Z pos=dhcp.go:62 component=virt-launcher msg="Starting SingleClientDHCPServer" level=error timestamp=2018-10-15T08:54:02.117189Z pos=common.go:126 component=virt-launcher msg="updated MAC for interface: eth0 - 0a:58:0a:a5:6c:43" level=info timestamp=2018-10-15T08:54:02.126846Z pos=manager.go:158 component=virt-launcher namespace=qwang-1 name=example-vm kind= uid=cd14441c-d057-11e8-8387-fa163ee68a54 msg="Domain defined." level=info timestamp=2018-10-15T08:54:02.127040Z pos=client.go:136 component=virt-launcher msg="Libvirt event 0 with reason 0 received" 2018-10-15 08:54:02.132+0000: 35: error : virDBusGetSystemBus:109 : internal error: Unable to get DBus system bus connection: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory 2018-10-15 08:54:02.156+0000: 35: warning : qemuInterfaceOpenVhostNet:687 : Unable to open vhost-net. Opened so far 0, requested 1 2018-10-15 08:54:02.449+0000: 35: error : virCgroupDetect:714 : At least one cgroup controller is required: No such device or address level=info timestamp=2018-10-15T08:54:02.459035Z pos=client.go:119 component=virt-launcher msg="domain status: 3:11" level=info timestamp=2018-10-15T08:54:02.459722Z pos=virt-launcher.go:214 component=virt-launcher msg="Detected domain with UUID 8a21e546-ddf7-56a7-bebf-4b652fc17428" level=info timestamp=2018-10-15T08:54:02.461174Z pos=monitor.go:253 component=virt-launcher msg="Monitoring loop: rate 1s start timeout 5m0s" level=info timestamp=2018-10-15T08:54:02.463004Z pos=client.go:145 component=virt-launcher msg="processed event" level=info timestamp=2018-10-15T08:54:02.663789Z pos=client.go:136 component=virt-launcher msg="Libvirt event 4 with reason 0 received" level=info timestamp=2018-10-15T08:54:02.668599Z pos=client.go:119 component=virt-launcher msg="domain status: 1:1" level=info timestamp=2018-10-15T08:54:02.669951Z pos=client.go:145 component=virt-launcher msg="processed event" level=info timestamp=2018-10-15T08:54:02.670005Z pos=client.go:136 component=virt-launcher msg="Libvirt event 2 with reason 0 received" level=info timestamp=2018-10-15T08:54:02.672169Z pos=manager.go:189 component=virt-launcher namespace=qwang-1 name=example-vm kind= uid=cd14441c-d057-11e8-8387-fa163ee68a54 msg="Domain started." level=info timestamp=2018-10-15T08:54:02.672837Z pos=server.go:74 component=virt-launcher namespace=qwang-1 name=example-vm kind= uid=cd14441c-d057-11e8-8387-fa163ee68a54 msg="Synced vmi" level=info timestamp=2018-10-15T08:54:02.673175Z pos=client.go:119 component=virt-launcher msg="domain status: 1:1" level=info timestamp=2018-10-15T08:54:02.674097Z pos=client.go:145 component=virt-launcher msg="processed event" level=info timestamp=2018-10-15T08:54:02.687797Z pos=server.go:74 component=virt-launcher namespace=qwang-1 name=example-vm kind= uid=cd14441c-d057-11e8-8387-fa163ee68a54 msg="Synced vmi" level=info timestamp=2018-10-15T08:54:03.463028Z pos=monitor.go:222 component=virt-launcher msg="Found PID for 8a21e546-ddf7-56a7-bebf-4b652fc17428: 122" Expected results: A readiness probe determines if a container is ready to service requests, it should probe. Additional info:
It's normal for the readiness probe to fail after the VM is first started. Does it resolve after a short period of time? Can you connect to the VM console, vnc, or ssh?
(In reply to Adam Litke from comment #1) > It's normal for the readiness probe to fail after the VM is first started. If it's normal can we turn the message into something less scary? E.g. catch the "No such file or directory" error for a reasonable amount of time and transform it into a notice of "Waiting VM to start". Also consider to fail hard after a certain deadline.
(In reply to Adam Litke from comment #1) > It's normal for the readiness probe to fail after the VM is first started. After restarting the VM, readiness probe still fails. > Does it resolve after a short period of time? No, it doesn't resolve after 1 hour. > Can you connect to the VM console, vnc, or ssh? Yes, I can connect to it via ssh or console.
Can you check if your host filesystem is full or has errors? df -h and check dmesg.
@qixuan, can you still reproduce this issue?
Tested on openshift v3.11.16,CNV 1.3, it can be reproduced. Attached: http://pastebin.test.redhat.com/662181 http://pastebin.test.redhat.com/662184
Marcin, can you please take a look if this reproduces, and why this is happening?
David, Do you have any insight on this?
> level=info timestamp=2018-10-15T08:54:01.207709Z pos=virt-launcher.go:60 component=virt-launcher msg="Marked as ready" From the logs attached to this issue, I see that the virt-launcher pod initialized and wrote readiness file to disk. We'd only see that log message if the file was written successfully. So, the file is being written. Furthermore looking at the logs I see that the domain is started as well. > level=info timestamp=2018-10-15T08:54:02.459722Z pos=virt-launcher.go:214 component=virt-launcher msg="Detected domain with UUID 8a21e546-ddf7-56a7-bebf-4b652fc17428" Even the 'kubectl describe pod' output shows that the container's ready condition is set to true. > Conditions: > Type Status > Initialized True > Ready True > ContainersReady True > PodScheduled True All of the data provided here contradicts that "Readiness probe failed" event. We see in the logs that the file is written, and we even ready condition being set on the container. The qemu process even starts within the pod. I suggest attempting to reproduce this outside of KubeVirt by defining a simple pod with the same /tmp/healthy readiness check our virt-launcher pod uses and an entrypoint command arg that does something like this, touch /tmp/healthy && while true; do sleep 1; done This will rule KubeVirt out of the equation.
Qixuan, Could you follow the suggestion provided to help isolate the root cause?
seems similar to the issue i ve got and that was fixed by using docker88 ( downgrade to docker82 is also supposed to do the trick)
Thanks Karim, was there an open bug for this? if so, could you please link it to this one?
Qixuan, any update from your side?
i still see this issue with Installed Packages Name : docker Arch : x86_64 Epoch : 2 Version : 1.13.1 Release : 90.git07f3374.el7 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 1m default-scheduler Successfully assigned cdi/virt-launcher-vm-alpine-datavolume-wlxj5 to cnv-executor-nelly1-node2.example.com Normal Pulled 1m kubelet, cnv-executor-nelly1-node2.example.com Container image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/cnv-tech-preview/virt-launcher:v1.4.0" already present on machine Normal Created 1m kubelet, cnv-executor-nelly1-node2.example.com Created container Normal Started 1m kubelet, cnv-executor-nelly1-node2.example.com Started container Warning Unhealthy 1m (x4 over 1m) kubelet, cnv-executor-nelly1-node2.example.com Readiness probe failed: cat: /var/run/kubevirt-infra/healthy: No such file or directory
Nelly, your comment just reminded me of something--possibly an important clue. We recently moved the readiness check path! In changeset 211ba95720d778d95ba9c46e7d8064f6a106640c we moved from using "/tmp/healthy" to using "/var/run/kubevirt-infra/healthy" as the location for the readiness check. Your log looks like something I would expect--because virt-launcher doesn't create the /var/run/kubevirt-infra/healthy file until it is ready. If you continue to see "readiness probe failed" events after 5 minutes or so, that would be a problem. Now something important to note: the changeset in question was merged into the upstream 0.11 release--which was right around the time this bug was reported. It's entirely possible that the issue was transient. e.g. using older manifests but kubevirt's master already had the file location change; or using cached images, but others were pulled new. Qixuan, Could you please attempt to reproduce this bug, but use the latest release, just to verify that this issue is really still occurring?
well, it doesnt take 5m more like .5h
During the time span where you observe "Readiness Check Failed" events, are you still able to schedule VMI's?
Test with: CNV 1.4 Steps: Create VM (with attached yaml) Start VM Results: VM is up and running I can see the Readiness probe failing See outputs: ❯❯❯ oc get vm,vmi NAME AGE RUNNING VOLUME virtualmachine.kubevirt.io/vm-cirros-datavolume 3m true NAME AGE PHASE IP NODENAME virtualmachineinstance.kubevirt.io/vm-cirros-datavolume 3m Running 10.129.0.30 cnv-executor-ipinto-node2.example.com ❯❯❯ oc get pv,pvc NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE persistentvolume/local-pv-1f7a0db 6521Mi RWO Delete Available hdd 1d persistentvolume/local-pv-4cdfe782 6521Mi RWO Delete Available hdd 1d persistentvolume/local-pv-4db2857d 6521Mi RWO Delete Available hdd 1d persistentvolume/local-pv-7cac3158 6521Mi RWO Delete Available hdd 1d persistentvolume/local-pv-97fe7162 6521Mi RWO Delete Available hdd 6m persistentvolume/local-pv-b5f29194 6521Mi RWO Delete Bound default/cirros-dv hdd 1d persistentvolume/local-pv-e345daaf 6521Mi RWO Delete Available hdd 15m persistentvolume/local-pv-f443c1b 6521Mi RWO Delete Available hdd 1d persistentvolume/local-pv-f474b691 6521Mi RWO Delete Available hdd 1d NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/cirros-dv Bound local-pv-b5f29194 6521Mi RWO hdd 4m ❯❯ oc describe pod virt-launcher-vm-cirros-datavolume-7qbbt (17:05)(:|✔) Name: virt-launcher-vm-cirros-datavolume-7qbbt Namespace: default Priority: 0 PriorityClassName: <none> Node: cnv-executor-ipinto-node2.example.com/172.16.0.19 Start Time: Thu, 24 Jan 2019 17:02:13 +0200 Labels: kubevirt.io=virt-launcher kubevirt.io/created-by=0882f78a-1fe9-11e9-8e1b-fa163e32b4dd kubevirt.io/vm=vm-cirros-datavolume Annotations: k8s.v1.cni.cncf.io/networks-status=[{ "name": "openshift.1", "ips": [ "10.129.0.30" ], "default": true, "dns": {} }] kubevirt.io/domain=vm-cirros-datavolume openshift.io/scc=privileged Status: Running IP: 10.129.0.30 Controlled By: VirtualMachineInstance/vm-cirros-datavolume Containers: compute: Container ID: cri-o://cf2dc74440607a16dafc5c4c488e57e7bd3fe4aae9994f5a643a3a5378beaeaf Image: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/cnv-tech-preview/virt-launcher:v1.4.0 Image ID: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/cnv-tech-preview/virt-launcher@sha256:9eed9e635e20505f8122cd0bef710f901f1433b1dd26c6a7bb13e19246b70652 Port: <none> Host Port: <none> Command: /usr/bin/virt-launcher --qemu-timeout 5m --name vm-cirros-datavolume --uid 0882f78a-1fe9-11e9-8e1b-fa163e32b4dd --namespace default --kubevirt-share-dir /var/run/kubevirt --ephemeral-disk-dir /var/run/kubevirt-ephemeral-disks --readiness-file /var/run/kubevirt-infra/healthy --grace-period-seconds 15 --hook-sidecars 0 --less-pvc-space-toleration 10 State: Running Started: Thu, 24 Jan 2019 17:02:21 +0200 Ready: True Restart Count: 0 Limits: devices.kubevirt.io/kvm: 1 devices.kubevirt.io/tun: 1 devices.kubevirt.io/vhost-net: 1 Requests: devices.kubevirt.io/kvm: 1 devices.kubevirt.io/tun: 1 devices.kubevirt.io/vhost-net: 1 memory: 161679432 Readiness: exec [cat /var/run/kubevirt-infra/healthy] delay=2s timeout=5s period=2s #success=1 #failure=5 Environment: <none> Mounts: /var/run/kubevirt from virt-share-dir (rw) /var/run/kubevirt-ephemeral-disks from ephemeral-disks (rw) /var/run/kubevirt-infra from infra-ready-mount (rw) /var/run/kubevirt-private/vmi-disks/datavolumedisk1 from datavolumedisk1 (rw) /var/run/libvirt from libvirt-runtime (rw) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: datavolumedisk1: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: cirros-dv ReadOnly: false infra-ready-mount: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: virt-share-dir: Type: HostPath (bare host directory volume) Path: /var/run/kubevirt HostPathType: libvirt-runtime: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: ephemeral-disks: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: QoS Class: Burstable Node-Selectors: kubevirt.io/schedulable=true Tolerations: node.kubernetes.io/memory-pressure:NoSchedule Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 4m default-scheduler Successfully assigned default/virt-launcher-vm-cirros-datavolume-7qbbt to cnv-executor-ipinto-node2.example.com Normal Pulled 4m kubelet, cnv-executor-ipinto-node2.example.com Container image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/cnv-tech-preview/virt-launcher:v1.4.0" already present on machine Normal Created 4m kubelet, cnv-executor-ipinto-node2.example.com Created container Normal Started 4m kubelet, cnv-executor-ipinto-node2.example.com Started container Warning Unhealthy 4m (x4 over 4m) kubelet, cnv-executor-ipinto-node2.example.com Readiness probe failed: cat: /var/run/kubevirt-infra/healthy: No such file or directory
Created attachment 1523182 [details] VM with data volume
For previous log: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 4m default-scheduler Successfully assigned default/virt-launcher-vm-cirros-datavolume-7qbbt to cnv-executor-ipinto-node2.example.com Normal Pulled 4m kubelet, cnv-executor-ipinto-node2.example.com Container image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/cnv-tech-preview/virt-launcher:v1.4.0" already present on machine Normal Created 4m kubelet, cnv-executor-ipinto-node2.example.com Created container Normal Started 4m kubelet, cnv-executor-ipinto-node2.example.com Started container Warning Unhealthy 4m (x4 over 4m) kubelet, cnv-executor-ipinto-node2.example.com Readiness probe failed: cat: /var/run/kubevirt-infra/healthy: No such file or directory From code: initialDelaySeconds := 2 timeoutSeconds := 5 periodSeconds := 2 successThreshold := 1 failureThreshold := 5 It could actually be as follows: 4min ago, the VMI got scheduled. Compute containre started, now: After 2 seconds the initial probe was done. Failed. And then over 8sec the next 3 probes were done, all failed. In general this means that it took 10sec from the container start up to VM start, which could be ok. The remaining question Israel, does the "readiness probe failed" event count continued to increase over time i.e. did it go from "x4" to "x5" and beyond? If it did NOT increae, then it looks like a sane story. One mitigation could be to increase the initialDelaySeconds to a higher value to delay the first (and thus subsequent) probes.
Good point Fabian, I saw that when trying to replicate this and was momentarily confused as well. There's still an outstanding question that was never answered: Build a trivial container that simply creates the readiness probe file and sleeps. This will help isolate whether or not the KubeVirt code itself is introducing the readiness probe failed condition.
Does this bug still applies? I tried my best, but I cannot replicate the issue on local environment, tried both K8s and Openshift. The readiness probe always come up when VM starts properly.
My result same with comment 20, The event count did not increase, always "x4", and in events i saw: 16m 16m 4 virt-launcher-vmi-ephemeral-89q69.1584fcf471b9d206 Pod spec.containers{compute} Warning Unhealthy kubelet, cnv-executor-zpeng29d-node2.example.com Readiness probe failed: cat: /var/run/kubevirt-infra/healthy: No such file or directory the count is 4. I login container, can't see that file $ oc rsh virt-launcher-vmi-ephemeral-89q69 sh-4.4# cat /var/run/kubevirt-infra/healthy cat: /var/run/kubevirt-infra/healthy: No such file or directory but have /tmp/healthy sh-4.4# ls /tmp/ healthy is this d/s problem or i miss something?
1. good that it is not increasing, then it's the normal behavior (it is unhealthy until the health flag is there, which appears after N [in this case 4x] times) 2. Unsure why the path is different. For me this bug is good to close as the number does not continue to increase.
(In reply to Fabian Deutsch from comment #27) > 2. Unsure why the path is different. Can we have this investigated before closing? (It could just be the version we're testing, etc. but I'd rather know what's going on)
I will double check that. I am convinced there is no bug, but just some weird configuration.
Israel or Zhe, can you please verify on new release that this bug still applies? I am not able to replicate it on development setup. It might be a misconfiguration.
Ok, I have deleted the whole dev cluster and started from scratch. I can see similar issue. It seems, that the volume mount kubevirt-infra is not mounted properly. Will investigate further.
And I found the issue. Typing faster, than the cluster is booting. Zhe, please verify the folder is really missing. When you boot to pod, specify -c compute to select the right container inside the pod. Same thing happened to me just now. It is easy to forget that there are two containers inside the vm-laucher pod. Waiting for confirmation. Besides that, I verified the same. At the begging there are 4x calls to bad probe, but than it start and works. Pod becomes healthy.
(In reply to Petr Kotas from comment #32) > Besides that, I verified the same. At the begging there are 4x calls to bad > probe, but than it start and works. Pod becomes healthy. I think there was also a concern WRT the path of the file which should have been /var/run/kubevirt-infra/healthy ?
Yes, that is right. The path is there. Log in to the compute container and it is there.
Was Davids suggestion ever tried? He suggested this: > All of the data provided here contradicts that "Readiness probe failed" event. We see in the logs that the file is written, and we even ready condition being set on the container. The qemu process even starts within the pod. >I suggest attempting to reproduce this outside of KubeVirt by defining a simple pod with the same /tmp/healthy readiness check our virt-launcher pod uses and an entrypoint command arg that does something like this, touch /tmp/healthy && while true; do sleep 1; done >This will rule KubeVirt out of the equation. And looking at the data I agree with David that the probes seem to actually succeed.
I tried to reproduce the readiness probe outside of the kubevirt. It works. However, I do not have the exact same setup as QE has, nor do I have access to it. Moreover, I also verified, that the readiness probe goes up as expected in the KubeVirt. Waiting for reply from QE.
Israel, can you verify that the behaviour is indeed correct please? This bug can be closed.
I verify with latest build: $ oc rsh -c compute virt-launcher-vmi-ephemeral-2jtzh sh-4.2# ls /var/run/kubevirt-infra/healthy /var/run/kubevirt-infra/healthy
Zhe, does the error message still appear? Warning Unhealthy 9m (x5 over 9m) kubelet, cnv-executor-vatsal-node1.example.com Readiness probe failed: cat: /tmp/healthy: No such file or directory
With information provided by Zhe, I can confirm that the issue has been resolved with new release. Most likely, old images were cached and caused mixed errors. Due to the fact that new release expected probe on path /var/run/kubevirt-infra/healthy, but old image provided the path /tmp/healthy. What I see from the comments and from my verification. The issue has been resolved. Can we close the bug?
I am closing this bug as it is resolved by current release. Summary: The issue was version mismatch. Older KubeVirt had readiness probes located on /tmp/healthy, newer KubeVirt has readiness probe located at /var/run/kubevirt-infra/healthy. The mismatch most likely happend due to container cache when doing update to new release. Controllers were still searching for old /tmp/healthy but launcher pods already had the /var/run/kubevirt-infra/healthy. Regarding the readiness probe fails 4x at first. This is expected behaviour. Readiness probe comes up only after the virtual machine properly starts. Provided logs confirm that the virtual machine came up later on and the amount of readiness failures did not raise. Taking into account the latest comments and my findings, the issue has been resolved by properly upgrading to new release.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days