Bug 1639204 - Readiness probe failed : cat: /tmp/healthy: No such file or directory" when using datavolume to create a VM
Summary: Readiness probe failed : cat: /tmp/healthy: No such file or directory" when u...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 1.2
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 2.0
Assignee: Petr Kotas
QA Contact: zhe peng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-10-15 09:44 UTC by Qixuan Wang
Modified: 2023-09-14 04:40 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-07 07:30:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
VM with data volume (997 bytes, text/plain)
2019-01-24 15:12 UTC, Israel Pinto
no flags Details

Description Qixuan Wang 2018-10-15 09:44:53 UTC
Description of problem:
Create a VM following https://github.com/kubevirt/containerized-data-importer/blob/master/doc/datavolumes.md#example-vm-using-dv and start it, then the virt-launcher pod is running, but the readiness probe failed. "Readiness probe failed: cat: /tmp/healthy: No such file or directory"


Version-Release number of selected component (if applicable):
openshift v3.11.12
CNV 1.2

How reproducible:
Always

Steps to Reproduce:
1.Create a VM using https://raw.githubusercontent.com/kubevirt/containerized-data-importer/master/manifests/example/example-vm-dv.yaml

2. Start the VM after import process is finished.

3. Check the virt-launcher pod status.


Actual results:
3. [root@cnv-executor-vatsal-master1 ~]# oc get pod
NAME                             READY     STATUS      RESTARTS   AGE
importer-example-dv-ggmbb        0/1       Completed   0          14m
virt-launcher-example-vm-jsscr   1/1       Running     0          9m

[root@cnv-executor-vatsal-master1 ~]# oc describe pod virt-launcher-example-vm-jsscr
Name:               virt-launcher-example-vm-jsscr
Namespace:          qwang-1
Priority:           0
PriorityClassName:  <none>
Node:               cnv-executor-vatsal-node1.example.com/172.16.0.22
Start Time:         Mon, 15 Oct 2018 04:53:35 -0400
Labels:             kubevirt.io=virt-launcher
                    kubevirt.io/created-by=cd14441c-d057-11e8-8387-fa163ee68a54
                    kubevirt.io/vm=example-vm
Annotations:        kubevirt.io/domain=example-vm
                    kubevirt.io/owned-by=virt-handler
                    openshift.io/scc=privileged
Status:             Running
IP:                 10.129.0.14
Containers:
  compute:
    Container ID:  cri-o://8b2491bfa68177fe0fe6bda5a0c6a73b51ee087b487b142ce28a8511b0dfcd02
    Image:         brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/cnv12-tech-preview/virt-launcher:1.2
    Image ID:      brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/cnv12-tech-preview/virt-launcher@sha256:0cab62d55ef835bc71477460eca2ed3846ae5c56f56501211891dd39981df041
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/share/kubevirt/virt-launcher/entrypoint.sh
      --qemu-timeout
      5m
      --name
      example-vm
      --uid
      cd14441c-d057-11e8-8387-fa163ee68a54
      --namespace
      qwang-1
      --kubevirt-share-dir
      /var/run/kubevirt
      --readiness-file
      /tmp/healthy
      --grace-period-seconds
      15
      --hook-sidecars
      0
    State:          Running
      Started:      Mon, 15 Oct 2018 04:53:50 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      devices.kubevirt.io/kvm:  1
      devices.kubevirt.io/tun:  1
    Requests:
      devices.kubevirt.io/kvm:  1
      devices.kubevirt.io/tun:  1
      memory:                   160932Ki
    Readiness:                  exec [cat /tmp/healthy] delay=2s timeout=5s period=2s #success=1 #failure=5
    Environment:                <none>
    Mounts:
      /var/run/kubevirt from virt-share-dir (rw)
      /var/run/kubevirt-private/vmi-disks/example-datavolume from example-datavolume (rw)
      /var/run/libvirt from libvirt-runtime (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-9ctmt (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  example-datavolume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  example-dv
    ReadOnly:   false
  virt-share-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/kubevirt
    HostPathType:
  libvirt-runtime:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  default-token-9ctmt:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-9ctmt
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubevirt.io/schedulable=true
                 node-role.kubernetes.io/compute=true
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
Events:
  Type     Reason     Age              From                                            Message
  ----     ------     ----             ----                                            -------
  Normal   Scheduled  9m               default-scheduler                               Successfully assigned qwang-1/virt-launcher-example-vm-jsscr to cnv-executor-vatsal-node1.example.com
  Normal   Pulling    9m               kubelet, cnv-executor-vatsal-node1.example.com  pulling image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/cnv12-tech-preview/virt-launcher:1.2"
  Normal   Pulled     9m               kubelet, cnv-executor-vatsal-node1.example.com  Successfully pulled image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/cnv12-tech-preview/virt-launcher:1.2"
  Normal   Created    9m               kubelet, cnv-executor-vatsal-node1.example.com  Created container
  Normal   Started    9m               kubelet, cnv-executor-vatsal-node1.example.com  Started container
  Warning  Unhealthy  9m (x5 over 9m)  kubelet, cnv-executor-vatsal-node1.example.com  Readiness probe failed: cat: /tmp/healthy: No such file or directory

[root@cnv-executor-vatsal-master1 ~]# oc rsh virt-launcher-example-vm-jsscr
sh-4.2# ls /tmp/healthy
/tmp/healthy


[root@cnv-executor-vatsal-master1 ~]# oc logs virt-launcher-example-vm-jsscr
level=info timestamp=2018-10-15T08:53:51.196801Z pos=manager.go:69 component=virt-launcher msg="Collected all requested hook sidecar sockets"
level=info timestamp=2018-10-15T08:53:51.196877Z pos=manager.go:72 component=virt-launcher msg="Sorted all collected sidecar sockets per hook point based on their priority and name: map[]"
level=info timestamp=2018-10-15T08:53:51.197575Z pos=libvirt.go:261 component=virt-launcher msg="Connecting to libvirt daemon: qemu:///system"
+ mkdir -p /var/log/kubevirt
+ touch /var/log/kubevirt/qemu-kube.log
+ chown qemu:qemu /var/log/kubevirt/qemu-kube.log
+ [[ -z '' ]]
++ ip -o -4 a
++ tr -s ' '
++ cut '-d ' -f 2
++ grep -v -e '^lo[0-9:]*$'
++ head -1
+ LIBVIRTD_DEFAULT_NETWORK_DEVICE=eth0
+ echo 'Selected "eth0" as primary interface'
+ [[ -n eth0 ]]
+ echo 'Setting libvirt default network to "eth0"'
+ mkdir -p /etc/libvirt/qemu/networks/autostart
+ cat
+ ln -s -f /etc/libvirt/qemu/networks/default.xml /etc/libvirt/qemu/networks/autostart/default.xml
+ echo 'cgroup_controllers = [ ]'
+ '[' -d /dev/hugepages ']'
+ echo 'log_outputs = "1:stderr"'
+ /usr/sbin/libvirtd
2018-10-15 08:53:51.310+0000: 48: info : libvirt version: 3.9.0, package: 14.el7_5.8 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2018-09-04-07:27:24, x86-019.build.eng.bos.redhat.com)
2018-10-15 08:53:51.310+0000: 48: info : hostname: example-vm
2018-10-15 08:53:51.310+0000: 48: error : virDBusGetSystemBus:109 : internal error: Unable to get DBus system bus connection: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
2018-10-15 08:53:51.895+0000: 48: error : virDBusGetSystemBus:109 : internal error: Unable to get DBus system bus connection: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
2018-10-15 08:53:51.895+0000: 48: warning : networkStateInitialize:761 : DBus not available, disabling firewalld support in bridge_network_driver: internal error: Unable to get DBus system bus connection: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
2018-10-15 08:53:51.913+0000: 48: error : virPCIDeviceConfigOpen:312 : Failed to open config space file '/sys/bus/pci/devices/0000:00:00.0/config': Read-only file system
2018-10-15 08:53:51.913+0000: 48: error : virPCIDeviceConfigOpen:312 : Failed to open config space file '/sys/bus/pci/devices/0000:00:01.0/config': Read-only file system
2018-10-15 08:53:51.913+0000: 48: error : virPCIDeviceConfigOpen:312 : Failed to open config space file '/sys/bus/pci/devices/0000:00:01.1/config': Read-only file system
2018-10-15 08:53:51.914+0000: 48: error : virPCIDeviceConfigOpen:312 : Failed to open config space file '/sys/bus/pci/devices/0000:00:01.2/config': Read-only file system
2018-10-15 08:53:51.915+0000: 48: error : virPCIDeviceConfigOpen:312 : Failed to open config space file '/sys/bus/pci/devices/0000:00:01.3/config': Read-only file system
2018-10-15 08:53:51.918+0000: 48: error : virPCIDeviceConfigOpen:312 : Failed to open config space file '/sys/bus/pci/devices/0000:00:02.0/config': Read-only file system
2018-10-15 08:53:51.921+0000: 48: error : virPCIDeviceConfigOpen:312 : Failed to open config space file '/sys/bus/pci/devices/0000:00:03.0/config': Read-only file system
2018-10-15 08:53:51.921+0000: 48: error : virPCIDeviceConfigOpen:312 : Failed to open config space file '/sys/bus/pci/devices/0000:00:04.0/config': Read-only file system
2018-10-15 08:53:51.922+0000: 48: error : virPCIDeviceConfigOpen:312 : Failed to open config space file '/sys/bus/pci/devices/0000:00:05.0/config': Read-only file system
2018-10-15 08:53:51.922+0000: 48: error : virPCIDeviceConfigOpen:312 : Failed to open config space file '/sys/bus/pci/devices/0000:00:06.0/config': Read-only file system
2018-10-15 08:53:51.923+0000: 48: error : virPCIDeviceConfigOpen:312 : Failed to open config space file '/sys/bus/pci/devices/0000:00:07.0/config': Read-only file system
2018-10-15 08:53:53.390+0000: 48: error : virCommandWait:2601 : internal error: Child process (/usr/sbin/dmidecode -q -t 0,1,2,4,17) unexpected exit status 1: /dev/mem: No such file or directory

2018-10-15 08:53:53.403+0000: 48: error : virNodeSuspendSupportsTarget:336 : internal error: Cannot probe for supported suspend types
2018-10-15 08:53:53.403+0000: 48: warning : virQEMUCapsInit:1220 : Failed to get host power management capabilities
level=info timestamp=2018-10-15T08:54:01.199112Z pos=libvirt.go:276 component=virt-launcher msg="Connected to libvirt daemon"
level=info timestamp=2018-10-15T08:54:01.207272Z pos=virt-launcher.go:143 component=virt-launcher msg="Watchdog file created at /var/run/kubevirt/watchdog-files/qwang-1_example-vm"
level=info timestamp=2018-10-15T08:54:01.207640Z pos=client.go:152 component=virt-launcher msg="Registered libvirt event notify callback"
level=info timestamp=2018-10-15T08:54:01.207709Z pos=virt-launcher.go:60 component=virt-launcher msg="Marked as ready"
level=info timestamp=2018-10-15T08:54:02.108527Z pos=converter.go:811 component=virt-launcher msg="Found nameservers in /etc/resolv.conf: \ufffd\u0010\u0000\u0016"
level=info timestamp=2018-10-15T08:54:02.108563Z pos=converter.go:812 component=virt-launcher msg="Found search domains in /etc/resolv.conf: qwang-1.svc.cluster.local svc.cluster.local cluster.local openstacklocal example.com"
level=info timestamp=2018-10-15T08:54:02.109320Z pos=dhcp.go:62 component=virt-launcher msg="Starting SingleClientDHCPServer"
level=error timestamp=2018-10-15T08:54:02.117189Z pos=common.go:126 component=virt-launcher msg="updated MAC for interface: eth0 - 0a:58:0a:a5:6c:43"
level=info timestamp=2018-10-15T08:54:02.126846Z pos=manager.go:158 component=virt-launcher namespace=qwang-1 name=example-vm kind= uid=cd14441c-d057-11e8-8387-fa163ee68a54 msg="Domain defined."
level=info timestamp=2018-10-15T08:54:02.127040Z pos=client.go:136 component=virt-launcher msg="Libvirt event 0 with reason 0 received"
2018-10-15 08:54:02.132+0000: 35: error : virDBusGetSystemBus:109 : internal error: Unable to get DBus system bus connection: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
2018-10-15 08:54:02.156+0000: 35: warning : qemuInterfaceOpenVhostNet:687 : Unable to open vhost-net. Opened so far 0, requested 1
2018-10-15 08:54:02.449+0000: 35: error : virCgroupDetect:714 : At least one cgroup controller is required: No such device or address
level=info timestamp=2018-10-15T08:54:02.459035Z pos=client.go:119 component=virt-launcher msg="domain status: 3:11"
level=info timestamp=2018-10-15T08:54:02.459722Z pos=virt-launcher.go:214 component=virt-launcher msg="Detected domain with UUID 8a21e546-ddf7-56a7-bebf-4b652fc17428"
level=info timestamp=2018-10-15T08:54:02.461174Z pos=monitor.go:253 component=virt-launcher msg="Monitoring loop: rate 1s start timeout 5m0s"
level=info timestamp=2018-10-15T08:54:02.463004Z pos=client.go:145 component=virt-launcher msg="processed event"
level=info timestamp=2018-10-15T08:54:02.663789Z pos=client.go:136 component=virt-launcher msg="Libvirt event 4 with reason 0 received"
level=info timestamp=2018-10-15T08:54:02.668599Z pos=client.go:119 component=virt-launcher msg="domain status: 1:1"
level=info timestamp=2018-10-15T08:54:02.669951Z pos=client.go:145 component=virt-launcher msg="processed event"
level=info timestamp=2018-10-15T08:54:02.670005Z pos=client.go:136 component=virt-launcher msg="Libvirt event 2 with reason 0 received"
level=info timestamp=2018-10-15T08:54:02.672169Z pos=manager.go:189 component=virt-launcher namespace=qwang-1 name=example-vm kind= uid=cd14441c-d057-11e8-8387-fa163ee68a54 msg="Domain started."
level=info timestamp=2018-10-15T08:54:02.672837Z pos=server.go:74 component=virt-launcher namespace=qwang-1 name=example-vm kind= uid=cd14441c-d057-11e8-8387-fa163ee68a54 msg="Synced vmi"
level=info timestamp=2018-10-15T08:54:02.673175Z pos=client.go:119 component=virt-launcher msg="domain status: 1:1"
level=info timestamp=2018-10-15T08:54:02.674097Z pos=client.go:145 component=virt-launcher msg="processed event"
level=info timestamp=2018-10-15T08:54:02.687797Z pos=server.go:74 component=virt-launcher namespace=qwang-1 name=example-vm kind= uid=cd14441c-d057-11e8-8387-fa163ee68a54 msg="Synced vmi"
level=info timestamp=2018-10-15T08:54:03.463028Z pos=monitor.go:222 component=virt-launcher msg="Found PID for 8a21e546-ddf7-56a7-bebf-4b652fc17428: 122"


Expected results:
A readiness probe determines if a container is ready to service requests, it should probe.


Additional info:

Comment 1 Adam Litke 2018-10-15 16:39:39 UTC
It's normal for the readiness probe to fail after the VM is first started.  Does it resolve after a short period of time?  Can you connect to the VM console, vnc, or ssh?

Comment 2 Federico Simoncelli 2018-10-15 20:38:23 UTC
(In reply to Adam Litke from comment #1)
> It's normal for the readiness probe to fail after the VM is first started. 

If it's normal can we turn the message into something less scary? E.g. catch the "No such file or directory" error for a reasonable amount of time and transform it into a notice of "Waiting VM to start".

Also consider to fail hard after a certain deadline.

Comment 3 Qixuan Wang 2018-10-16 07:16:27 UTC
(In reply to Adam Litke from comment #1)
> It's normal for the readiness probe to fail after the VM is first started. 
After restarting the VM, readiness probe still fails.

> Does it resolve after a short period of time?  
No, it doesn't resolve after 1 hour.


> Can you connect to the VM console, vnc, or ssh?
Yes, I can connect to it via ssh or console.

Comment 4 Adam Litke 2018-10-16 11:30:22 UTC
Can you check if your host filesystem is full or has errors?  

df -h 

and check dmesg.

Comment 5 Nelly Credi 2018-10-23 12:19:21 UTC
@qixuan, can you still reproduce this issue?

Comment 6 Qixuan Wang 2018-10-25 09:13:04 UTC
Tested on openshift v3.11.16,CNV 1.3, it can be reproduced.

Attached: 
http://pastebin.test.redhat.com/662181
http://pastebin.test.redhat.com/662184

Comment 9 Fabian Deutsch 2018-11-27 13:29:57 UTC
Marcin, can you please take a look if this reproduces, and why this is happening?

Comment 10 sgott 2018-12-06 20:19:44 UTC
David,

Do you have any insight on this?

Comment 11 David Vossel 2018-12-06 21:21:35 UTC
> level=info timestamp=2018-10-15T08:54:01.207709Z pos=virt-launcher.go:60 component=virt-launcher msg="Marked as ready"

From the logs attached to this issue, I see that the virt-launcher pod initialized and wrote readiness file to disk. We'd only see that log message if the file was written successfully. 

So, the file is being written. 

Furthermore looking at the logs I see that the domain is started as well. 

> level=info timestamp=2018-10-15T08:54:02.459722Z pos=virt-launcher.go:214 component=virt-launcher msg="Detected domain with UUID 8a21e546-ddf7-56a7-bebf-4b652fc17428"


Even the 'kubectl describe pod' output shows that the container's ready condition is set to true. 

> Conditions:
>  Type              Status
>  Initialized       True
>  Ready             True
>  ContainersReady   True
>  PodScheduled      True


All of the data provided here contradicts that "Readiness probe failed" event. We see in the logs that the file is written, and we even ready condition being set on the container. The qemu process even starts within the pod. 

I suggest attempting to reproduce this outside of KubeVirt by defining a simple pod with the same /tmp/healthy readiness check our virt-launcher pod uses and an entrypoint command arg that does something like this, touch /tmp/healthy && while true; do sleep 1; done

This will rule KubeVirt out of the equation.

Comment 12 sgott 2018-12-07 17:02:46 UTC
Qixuan,

Could you follow the suggestion provided to help isolate the root cause?

Comment 13 Karim Boumedhel 2018-12-07 21:13:12 UTC
seems similar to the issue i ve got and that was fixed by using docker88 ( downgrade to docker82 is also supposed to do the trick)

Comment 14 Nelly Credi 2018-12-09 16:07:08 UTC
Thanks Karim, was there an open bug for this? 
if so, could you please link it to this one?

Comment 15 Marcin Franczyk 2019-01-17 14:32:40 UTC
Qixuan, any update from your side?

Comment 16 Nelly Credi 2019-01-22 21:07:33 UTC
i still see this issue with 
Installed Packages
Name        : docker
Arch        : x86_64
Epoch       : 2
Version     : 1.13.1
Release     : 90.git07f3374.el7

Events:
  Type     Reason     Age              From                                            Message
  ----     ------     ----             ----                                            -------
  Normal   Scheduled  1m               default-scheduler                               Successfully assigned cdi/virt-launcher-vm-alpine-datavolume-wlxj5 to cnv-executor-nelly1-node2.example.com
  Normal   Pulled     1m               kubelet, cnv-executor-nelly1-node2.example.com  Container image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/cnv-tech-preview/virt-launcher:v1.4.0" already present on machine
  Normal   Created    1m               kubelet, cnv-executor-nelly1-node2.example.com  Created container
  Normal   Started    1m               kubelet, cnv-executor-nelly1-node2.example.com  Started container
  Warning  Unhealthy  1m (x4 over 1m)  kubelet, cnv-executor-nelly1-node2.example.com  Readiness probe failed: cat: /var/run/kubevirt-infra/healthy: No such file or directory

Comment 17 sgott 2019-01-22 22:55:40 UTC
Nelly, your comment just reminded me of something--possibly an important clue.

We recently moved the readiness check path! In changeset

211ba95720d778d95ba9c46e7d8064f6a106640c

we moved from using "/tmp/healthy" to using "/var/run/kubevirt-infra/healthy" as the location for the readiness check.

Your log looks like something I would expect--because virt-launcher doesn't create the /var/run/kubevirt-infra/healthy file until it is ready. If you continue to see "readiness probe failed" events after 5 minutes or so, that would be a problem.

Now something important to note: the changeset in question was merged into the upstream 0.11 release--which was right around the time this bug was reported. It's entirely possible that the issue was transient. e.g. using older manifests but kubevirt's master already had the file location change; or using cached images, but others were pulled new.


Qixuan, Could you please attempt to reproduce this bug, but use the latest release, just to verify that this issue is really still occurring?

Comment 18 Nelly Credi 2019-01-23 09:37:42 UTC
well, it doesnt take 5m
more like .5h

Comment 19 sgott 2019-01-23 16:15:17 UTC
During the time span where you observe "Readiness Check Failed" events, are you still able to schedule VMI's?

Comment 20 Israel Pinto 2019-01-24 15:11:29 UTC
Test with: CNV 1.4  
Steps:
Create VM (with attached yaml)
Start VM

Results:
VM is up and running 
I can see the Readiness probe failing 

See outputs: 
 ❯❯❯ oc get vm,vmi                                                                                                                                                        
NAME                                              AGE       RUNNING   VOLUME
virtualmachine.kubevirt.io/vm-cirros-datavolume   3m        true      

NAME                                                      AGE       PHASE     IP            NODENAME
virtualmachineinstance.kubevirt.io/vm-cirros-datavolume   3m        Running   10.129.0.30   cnv-executor-ipinto-node2.example.com
 ❯❯❯ oc get pv,pvc                                                                                                                                                        
NAME                                 CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM               STORAGECLASS   REASON    AGE
persistentvolume/local-pv-1f7a0db    6521Mi     RWO            Delete           Available                       hdd                      1d
persistentvolume/local-pv-4cdfe782   6521Mi     RWO            Delete           Available                       hdd                      1d
persistentvolume/local-pv-4db2857d   6521Mi     RWO            Delete           Available                       hdd                      1d
persistentvolume/local-pv-7cac3158   6521Mi     RWO            Delete           Available                       hdd                      1d
persistentvolume/local-pv-97fe7162   6521Mi     RWO            Delete           Available                       hdd                      6m
persistentvolume/local-pv-b5f29194   6521Mi     RWO            Delete           Bound       default/cirros-dv   hdd                      1d
persistentvolume/local-pv-e345daaf   6521Mi     RWO            Delete           Available                       hdd                      15m
persistentvolume/local-pv-f443c1b    6521Mi     RWO            Delete           Available                       hdd                      1d
persistentvolume/local-pv-f474b691   6521Mi     RWO            Delete           Available                       hdd                      1d

NAME                              STATUS    VOLUME              CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/cirros-dv   Bound     local-pv-b5f29194   6521Mi     RWO            hdd            4m


❯❯ oc describe pod virt-launcher-vm-cirros-datavolume-7qbbt                                                                                                             (17:05)(:|✔)
Name:               virt-launcher-vm-cirros-datavolume-7qbbt
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               cnv-executor-ipinto-node2.example.com/172.16.0.19
Start Time:         Thu, 24 Jan 2019 17:02:13 +0200
Labels:             kubevirt.io=virt-launcher
                    kubevirt.io/created-by=0882f78a-1fe9-11e9-8e1b-fa163e32b4dd
                    kubevirt.io/vm=vm-cirros-datavolume
Annotations:        k8s.v1.cni.cncf.io/networks-status=[{
    "name": "openshift.1",
    "ips": [
        "10.129.0.30"
    ],
    "default": true,
    "dns": {}
}]
                kubevirt.io/domain=vm-cirros-datavolume
                openshift.io/scc=privileged
Status:         Running
IP:             10.129.0.30
Controlled By:  VirtualMachineInstance/vm-cirros-datavolume
Containers:
  compute:
    Container ID:  cri-o://cf2dc74440607a16dafc5c4c488e57e7bd3fe4aae9994f5a643a3a5378beaeaf
    Image:         brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/cnv-tech-preview/virt-launcher:v1.4.0
    Image ID:      brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/cnv-tech-preview/virt-launcher@sha256:9eed9e635e20505f8122cd0bef710f901f1433b1dd26c6a7bb13e19246b70652
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/bin/virt-launcher
      --qemu-timeout
      5m
      --name
      vm-cirros-datavolume
      --uid
      0882f78a-1fe9-11e9-8e1b-fa163e32b4dd
      --namespace
      default
      --kubevirt-share-dir
      /var/run/kubevirt
      --ephemeral-disk-dir
      /var/run/kubevirt-ephemeral-disks
      --readiness-file
      /var/run/kubevirt-infra/healthy
      --grace-period-seconds
      15
      --hook-sidecars
      0
      --less-pvc-space-toleration
      10
    State:          Running
      Started:      Thu, 24 Jan 2019 17:02:21 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      devices.kubevirt.io/kvm:        1
      devices.kubevirt.io/tun:        1
      devices.kubevirt.io/vhost-net:  1
    Requests:
      devices.kubevirt.io/kvm:        1
      devices.kubevirt.io/tun:        1
      devices.kubevirt.io/vhost-net:  1
      memory:                         161679432
    Readiness:                        exec [cat /var/run/kubevirt-infra/healthy] delay=2s timeout=5s period=2s #success=1 #failure=5
    Environment:                      <none>
    Mounts:
      /var/run/kubevirt from virt-share-dir (rw)
      /var/run/kubevirt-ephemeral-disks from ephemeral-disks (rw)
      /var/run/kubevirt-infra from infra-ready-mount (rw)
      /var/run/kubevirt-private/vmi-disks/datavolumedisk1 from datavolumedisk1 (rw)
      /var/run/libvirt from libvirt-runtime (rw)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  datavolumedisk1:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  cirros-dv
    ReadOnly:   false
  infra-ready-mount:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:  
  virt-share-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/kubevirt
    HostPathType:  
  libvirt-runtime:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:  
  ephemeral-disks:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:      
QoS Class:       Burstable
Node-Selectors:  kubevirt.io/schedulable=true
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
Events:
  Type     Reason     Age              From                                            Message
  ----     ------     ----             ----                                            -------
  Normal   Scheduled  4m               default-scheduler                               Successfully assigned default/virt-launcher-vm-cirros-datavolume-7qbbt to cnv-executor-ipinto-node2.example.com
  Normal   Pulled     4m               kubelet, cnv-executor-ipinto-node2.example.com  Container image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/cnv-tech-preview/virt-launcher:v1.4.0" already present on machine
  Normal   Created    4m               kubelet, cnv-executor-ipinto-node2.example.com  Created container
  Normal   Started    4m               kubelet, cnv-executor-ipinto-node2.example.com  Started container
  Warning  Unhealthy  4m (x4 over 4m)  kubelet, cnv-executor-ipinto-node2.example.com  Readiness probe failed: cat: /var/run/kubevirt-infra/healthy: No such file or directory

Comment 21 Israel Pinto 2019-01-24 15:12:09 UTC
Created attachment 1523182 [details]
VM with data volume

Comment 23 Fabian Deutsch 2019-01-25 11:26:15 UTC
For previous log:
Events:
  Type     Reason     Age              From                                            Message
  ----     ------     ----             ----                                            -------
  Normal   Scheduled  4m               default-scheduler                               Successfully assigned default/virt-launcher-vm-cirros-datavolume-7qbbt to cnv-executor-ipinto-node2.example.com
  Normal   Pulled     4m               kubelet, cnv-executor-ipinto-node2.example.com  Container image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/cnv-tech-preview/virt-launcher:v1.4.0" already present on machine
  Normal   Created    4m               kubelet, cnv-executor-ipinto-node2.example.com  Created container
  Normal   Started    4m               kubelet, cnv-executor-ipinto-node2.example.com  Started container
  Warning  Unhealthy  4m (x4 over 4m)  kubelet, cnv-executor-ipinto-node2.example.com  Readiness probe failed: cat: /var/run/kubevirt-infra/healthy: No such file or directory

From code:
        initialDelaySeconds := 2
        timeoutSeconds := 5
        periodSeconds := 2
        successThreshold := 1
        failureThreshold := 5


It could actually be as follows:
4min ago, the VMI got scheduled. Compute containre started, now:
After 2 seconds the initial probe was done. Failed. And then over 8sec the next 3 probes were done, all failed.
In general this means that it took 10sec from the container start up to VM start, which could be ok.

The remaining question Israel, does the "readiness probe failed" event count continued to increase over time i.e. did it go from "x4" to "x5" and beyond?

If it did NOT increae, then it looks like a sane story. One mitigation could be to increase the initialDelaySeconds to a higher value to delay the first (and thus subsequent) probes.

Comment 24 sgott 2019-01-25 12:43:36 UTC
Good point Fabian, I saw that when trying to replicate this and was momentarily confused as well.

There's still an outstanding question that was never answered: Build a trivial container that simply creates the readiness probe file and sleeps. This will help isolate whether or not the KubeVirt code itself is introducing the readiness probe failed condition.

Comment 25 Petr Kotas 2019-02-05 14:47:06 UTC
Does this bug still applies? I tried my best, but I cannot replicate the issue on local environment, tried both K8s and Openshift. 
The readiness probe always come up when VM starts properly.

Comment 26 zhe peng 2019-02-20 06:37:30 UTC
My result same with comment 20,
The event count did not increase, always "x4", and in events i saw:
16m         16m          4         virt-launcher-vmi-ephemeral-89q69.1584fcf471b9d206   Pod                      spec.containers{compute}               Warning   Unhealthy          kubelet, cnv-executor-zpeng29d-node2.example.com   Readiness probe failed: cat: /var/run/kubevirt-infra/healthy: No such file or directory
the count is 4.

I login container, can't see that file
$ oc rsh virt-launcher-vmi-ephemeral-89q69
sh-4.4# cat /var/run/kubevirt-infra/healthy
cat: /var/run/kubevirt-infra/healthy: No such file or directory

but have /tmp/healthy
sh-4.4# ls /tmp/
healthy
is this d/s problem or i miss something?

Comment 27 Fabian Deutsch 2019-02-20 08:33:09 UTC
1. good that it is not increasing, then it's the normal behavior (it is unhealthy until the health flag is there, which appears after N [in this case 4x] times)
2. Unsure why the path is different.

For me this bug is good to close as the number does not continue to increase.

Comment 28 Federico Simoncelli 2019-02-20 08:48:16 UTC
(In reply to Fabian Deutsch from comment #27)
> 2. Unsure why the path is different.

Can we have this investigated before closing?
(It could just be the version we're testing, etc. but I'd rather know what's going on)

Comment 29 Petr Kotas 2019-02-21 06:51:35 UTC
I will double check that.
I am convinced there is no bug, but just some weird configuration.

Comment 30 Petr Kotas 2019-02-22 15:03:01 UTC
Israel or Zhe, can you please verify on new release that this bug still applies? I am not able to replicate it on development setup.
It might be a misconfiguration.

Comment 31 Petr Kotas 2019-02-22 15:36:00 UTC
Ok, I have deleted the whole dev cluster and started from scratch. I can see similar issue. It seems, that the volume mount kubevirt-infra is not mounted properly.
Will investigate further.

Comment 32 Petr Kotas 2019-02-22 15:39:35 UTC
And I found the issue. Typing faster, than the cluster is booting.

Zhe, please verify the folder is really missing. When you boot to pod, specify -c compute to select the right container inside the pod.
Same thing happened to me just now. It is easy to forget that there are two containers inside the vm-laucher pod.

Waiting for confirmation.

Besides that, I verified the same. At the begging there are 4x calls to bad probe, but than it start and works. Pod becomes healthy.

Comment 33 Federico Simoncelli 2019-02-25 13:22:13 UTC
(In reply to Petr Kotas from comment #32)
> Besides that, I verified the same. At the begging there are 4x calls to bad
> probe, but than it start and works. Pod becomes healthy.

I think there was also a concern WRT the path of the file which should have been /var/run/kubevirt-infra/healthy ?

Comment 34 Petr Kotas 2019-02-25 13:45:27 UTC
Yes, that is right. The path is there. Log in to the compute container and it is there.

Comment 35 Roman Mohr 2019-02-26 10:01:08 UTC
Was Davids suggestion ever tried?

He suggested this:


> All of the data provided here contradicts that "Readiness probe failed" event. We see in the logs that the file is written, and we even ready condition being set on the container. The qemu process even starts within the pod. 

>I suggest attempting to reproduce this outside of KubeVirt by defining a simple pod with the same /tmp/healthy readiness check our virt-launcher pod uses and an entrypoint command arg that does something like this, touch /tmp/healthy && while true; do sleep 1; done

>This will rule KubeVirt out of the equation.

And looking at the data I agree with David that the probes seem to actually succeed.

Comment 36 Petr Kotas 2019-02-26 10:26:15 UTC
I tried to reproduce the readiness probe outside of the kubevirt. It works. However, I do not have the exact same setup as QE has, nor do I have access to it.
Moreover, I also verified, that the readiness probe goes up as expected in the KubeVirt.

Waiting for reply from QE.

Comment 37 Petr Kotas 2019-03-04 08:15:28 UTC
Israel, can you verify that the behaviour is indeed correct please? This bug can be closed.

Comment 38 zhe peng 2019-03-04 09:10:48 UTC
I verify with latest build:
$ oc rsh -c compute virt-launcher-vmi-ephemeral-2jtzh
sh-4.2# ls /var/run/kubevirt-infra/healthy 
/var/run/kubevirt-infra/healthy

Comment 39 Fabian Deutsch 2019-03-04 10:50:05 UTC
Zhe, does the error message still appear?

Warning  Unhealthy  9m (x5 over 9m)  kubelet, cnv-executor-vatsal-node1.example.com  Readiness probe failed: cat: /tmp/healthy: No such file or directory

Comment 40 Petr Kotas 2019-03-05 15:25:03 UTC
With information provided by Zhe, I can confirm that the issue has been resolved with new release.

Most likely, old images were cached and caused mixed errors. Due to the fact that new release expected probe on path /var/run/kubevirt-infra/healthy,
but old image provided the path /tmp/healthy.

What I see from the comments and from my verification. The issue has been resolved.

Can we close the bug?

Comment 41 Petr Kotas 2019-03-07 07:30:45 UTC
I am closing this bug as it is resolved by current release.

Summary:

The issue was version mismatch. Older KubeVirt had readiness probes located on /tmp/healthy, newer KubeVirt has readiness probe located at /var/run/kubevirt-infra/healthy.
The mismatch most likely happend due to container cache when doing update to new release. 
Controllers were still searching for old /tmp/healthy but launcher pods already had the /var/run/kubevirt-infra/healthy.

Regarding the readiness probe fails 4x at first. This is expected behaviour. Readiness probe comes up only after the virtual machine properly starts.
Provided logs confirm that the virtual machine came up later on and the amount of readiness failures did not raise. 

Taking into account the latest comments and my findings, the issue has been resolved by properly upgrading to new release.

Comment 42 Red Hat Bugzilla 2023-09-14 04:40:13 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.