Bug 1639204
| Summary: | Readiness probe failed : cat: /tmp/healthy: No such file or directory" when using datavolume to create a VM | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Container Native Virtualization (CNV) | Reporter: | Qixuan Wang <qixuan.wang> | ||||
| Component: | Virtualization | Assignee: | Petr Kotas <pkotas> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | zhe peng <zpeng> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 1.2 | CC: | alitke, cnv-qe-bugs, dvossel, fdeutsch, fsimonce, ipinto, kboumedh, mfranczy, mlibra, ncredi, pkotas, qixuan.wang, rmohr, sgordon, sgott, zpeng | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 2.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2019-03-07 07:30:45 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Qixuan Wang
2018-10-15 09:44:53 UTC
It's normal for the readiness probe to fail after the VM is first started. Does it resolve after a short period of time? Can you connect to the VM console, vnc, or ssh? (In reply to Adam Litke from comment #1) > It's normal for the readiness probe to fail after the VM is first started. If it's normal can we turn the message into something less scary? E.g. catch the "No such file or directory" error for a reasonable amount of time and transform it into a notice of "Waiting VM to start". Also consider to fail hard after a certain deadline. (In reply to Adam Litke from comment #1) > It's normal for the readiness probe to fail after the VM is first started. After restarting the VM, readiness probe still fails. > Does it resolve after a short period of time? No, it doesn't resolve after 1 hour. > Can you connect to the VM console, vnc, or ssh? Yes, I can connect to it via ssh or console. Can you check if your host filesystem is full or has errors? df -h and check dmesg. @qixuan, can you still reproduce this issue? Tested on openshift v3.11.16,CNV 1.3, it can be reproduced. Attached: http://pastebin.test.redhat.com/662181 http://pastebin.test.redhat.com/662184 Marcin, can you please take a look if this reproduces, and why this is happening? David, Do you have any insight on this? > level=info timestamp=2018-10-15T08:54:01.207709Z pos=virt-launcher.go:60 component=virt-launcher msg="Marked as ready" From the logs attached to this issue, I see that the virt-launcher pod initialized and wrote readiness file to disk. We'd only see that log message if the file was written successfully. So, the file is being written. Furthermore looking at the logs I see that the domain is started as well. > level=info timestamp=2018-10-15T08:54:02.459722Z pos=virt-launcher.go:214 component=virt-launcher msg="Detected domain with UUID 8a21e546-ddf7-56a7-bebf-4b652fc17428" Even the 'kubectl describe pod' output shows that the container's ready condition is set to true. > Conditions: > Type Status > Initialized True > Ready True > ContainersReady True > PodScheduled True All of the data provided here contradicts that "Readiness probe failed" event. We see in the logs that the file is written, and we even ready condition being set on the container. The qemu process even starts within the pod. I suggest attempting to reproduce this outside of KubeVirt by defining a simple pod with the same /tmp/healthy readiness check our virt-launcher pod uses and an entrypoint command arg that does something like this, touch /tmp/healthy && while true; do sleep 1; done This will rule KubeVirt out of the equation. Qixuan, Could you follow the suggestion provided to help isolate the root cause? seems similar to the issue i ve got and that was fixed by using docker88 ( downgrade to docker82 is also supposed to do the trick) Thanks Karim, was there an open bug for this? if so, could you please link it to this one? Qixuan, any update from your side? i still see this issue with Installed Packages Name : docker Arch : x86_64 Epoch : 2 Version : 1.13.1 Release : 90.git07f3374.el7 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 1m default-scheduler Successfully assigned cdi/virt-launcher-vm-alpine-datavolume-wlxj5 to cnv-executor-nelly1-node2.example.com Normal Pulled 1m kubelet, cnv-executor-nelly1-node2.example.com Container image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/cnv-tech-preview/virt-launcher:v1.4.0" already present on machine Normal Created 1m kubelet, cnv-executor-nelly1-node2.example.com Created container Normal Started 1m kubelet, cnv-executor-nelly1-node2.example.com Started container Warning Unhealthy 1m (x4 over 1m) kubelet, cnv-executor-nelly1-node2.example.com Readiness probe failed: cat: /var/run/kubevirt-infra/healthy: No such file or directory Nelly, your comment just reminded me of something--possibly an important clue. We recently moved the readiness check path! In changeset 211ba95720d778d95ba9c46e7d8064f6a106640c we moved from using "/tmp/healthy" to using "/var/run/kubevirt-infra/healthy" as the location for the readiness check. Your log looks like something I would expect--because virt-launcher doesn't create the /var/run/kubevirt-infra/healthy file until it is ready. If you continue to see "readiness probe failed" events after 5 minutes or so, that would be a problem. Now something important to note: the changeset in question was merged into the upstream 0.11 release--which was right around the time this bug was reported. It's entirely possible that the issue was transient. e.g. using older manifests but kubevirt's master already had the file location change; or using cached images, but others were pulled new. Qixuan, Could you please attempt to reproduce this bug, but use the latest release, just to verify that this issue is really still occurring? well, it doesnt take 5m more like .5h During the time span where you observe "Readiness Check Failed" events, are you still able to schedule VMI's? Test with: CNV 1.4
Steps:
Create VM (with attached yaml)
Start VM
Results:
VM is up and running
I can see the Readiness probe failing
See outputs:
❯❯❯ oc get vm,vmi
NAME AGE RUNNING VOLUME
virtualmachine.kubevirt.io/vm-cirros-datavolume 3m true
NAME AGE PHASE IP NODENAME
virtualmachineinstance.kubevirt.io/vm-cirros-datavolume 3m Running 10.129.0.30 cnv-executor-ipinto-node2.example.com
❯❯❯ oc get pv,pvc
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
persistentvolume/local-pv-1f7a0db 6521Mi RWO Delete Available hdd 1d
persistentvolume/local-pv-4cdfe782 6521Mi RWO Delete Available hdd 1d
persistentvolume/local-pv-4db2857d 6521Mi RWO Delete Available hdd 1d
persistentvolume/local-pv-7cac3158 6521Mi RWO Delete Available hdd 1d
persistentvolume/local-pv-97fe7162 6521Mi RWO Delete Available hdd 6m
persistentvolume/local-pv-b5f29194 6521Mi RWO Delete Bound default/cirros-dv hdd 1d
persistentvolume/local-pv-e345daaf 6521Mi RWO Delete Available hdd 15m
persistentvolume/local-pv-f443c1b 6521Mi RWO Delete Available hdd 1d
persistentvolume/local-pv-f474b691 6521Mi RWO Delete Available hdd 1d
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/cirros-dv Bound local-pv-b5f29194 6521Mi RWO hdd 4m
❯❯ oc describe pod virt-launcher-vm-cirros-datavolume-7qbbt (17:05)(:|✔)
Name: virt-launcher-vm-cirros-datavolume-7qbbt
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: cnv-executor-ipinto-node2.example.com/172.16.0.19
Start Time: Thu, 24 Jan 2019 17:02:13 +0200
Labels: kubevirt.io=virt-launcher
kubevirt.io/created-by=0882f78a-1fe9-11e9-8e1b-fa163e32b4dd
kubevirt.io/vm=vm-cirros-datavolume
Annotations: k8s.v1.cni.cncf.io/networks-status=[{
"name": "openshift.1",
"ips": [
"10.129.0.30"
],
"default": true,
"dns": {}
}]
kubevirt.io/domain=vm-cirros-datavolume
openshift.io/scc=privileged
Status: Running
IP: 10.129.0.30
Controlled By: VirtualMachineInstance/vm-cirros-datavolume
Containers:
compute:
Container ID: cri-o://cf2dc74440607a16dafc5c4c488e57e7bd3fe4aae9994f5a643a3a5378beaeaf
Image: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/cnv-tech-preview/virt-launcher:v1.4.0
Image ID: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/cnv-tech-preview/virt-launcher@sha256:9eed9e635e20505f8122cd0bef710f901f1433b1dd26c6a7bb13e19246b70652
Port: <none>
Host Port: <none>
Command:
/usr/bin/virt-launcher
--qemu-timeout
5m
--name
vm-cirros-datavolume
--uid
0882f78a-1fe9-11e9-8e1b-fa163e32b4dd
--namespace
default
--kubevirt-share-dir
/var/run/kubevirt
--ephemeral-disk-dir
/var/run/kubevirt-ephemeral-disks
--readiness-file
/var/run/kubevirt-infra/healthy
--grace-period-seconds
15
--hook-sidecars
0
--less-pvc-space-toleration
10
State: Running
Started: Thu, 24 Jan 2019 17:02:21 +0200
Ready: True
Restart Count: 0
Limits:
devices.kubevirt.io/kvm: 1
devices.kubevirt.io/tun: 1
devices.kubevirt.io/vhost-net: 1
Requests:
devices.kubevirt.io/kvm: 1
devices.kubevirt.io/tun: 1
devices.kubevirt.io/vhost-net: 1
memory: 161679432
Readiness: exec [cat /var/run/kubevirt-infra/healthy] delay=2s timeout=5s period=2s #success=1 #failure=5
Environment: <none>
Mounts:
/var/run/kubevirt from virt-share-dir (rw)
/var/run/kubevirt-ephemeral-disks from ephemeral-disks (rw)
/var/run/kubevirt-infra from infra-ready-mount (rw)
/var/run/kubevirt-private/vmi-disks/datavolumedisk1 from datavolumedisk1 (rw)
/var/run/libvirt from libvirt-runtime (rw)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
datavolumedisk1:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: cirros-dv
ReadOnly: false
infra-ready-mount:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
virt-share-dir:
Type: HostPath (bare host directory volume)
Path: /var/run/kubevirt
HostPathType:
libvirt-runtime:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
ephemeral-disks:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
QoS Class: Burstable
Node-Selectors: kubevirt.io/schedulable=true
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m default-scheduler Successfully assigned default/virt-launcher-vm-cirros-datavolume-7qbbt to cnv-executor-ipinto-node2.example.com
Normal Pulled 4m kubelet, cnv-executor-ipinto-node2.example.com Container image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/cnv-tech-preview/virt-launcher:v1.4.0" already present on machine
Normal Created 4m kubelet, cnv-executor-ipinto-node2.example.com Created container
Normal Started 4m kubelet, cnv-executor-ipinto-node2.example.com Started container
Warning Unhealthy 4m (x4 over 4m) kubelet, cnv-executor-ipinto-node2.example.com Readiness probe failed: cat: /var/run/kubevirt-infra/healthy: No such file or directory
Created attachment 1523182 [details]
VM with data volume
For previous log:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m default-scheduler Successfully assigned default/virt-launcher-vm-cirros-datavolume-7qbbt to cnv-executor-ipinto-node2.example.com
Normal Pulled 4m kubelet, cnv-executor-ipinto-node2.example.com Container image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/cnv-tech-preview/virt-launcher:v1.4.0" already present on machine
Normal Created 4m kubelet, cnv-executor-ipinto-node2.example.com Created container
Normal Started 4m kubelet, cnv-executor-ipinto-node2.example.com Started container
Warning Unhealthy 4m (x4 over 4m) kubelet, cnv-executor-ipinto-node2.example.com Readiness probe failed: cat: /var/run/kubevirt-infra/healthy: No such file or directory
From code:
initialDelaySeconds := 2
timeoutSeconds := 5
periodSeconds := 2
successThreshold := 1
failureThreshold := 5
It could actually be as follows:
4min ago, the VMI got scheduled. Compute containre started, now:
After 2 seconds the initial probe was done. Failed. And then over 8sec the next 3 probes were done, all failed.
In general this means that it took 10sec from the container start up to VM start, which could be ok.
The remaining question Israel, does the "readiness probe failed" event count continued to increase over time i.e. did it go from "x4" to "x5" and beyond?
If it did NOT increae, then it looks like a sane story. One mitigation could be to increase the initialDelaySeconds to a higher value to delay the first (and thus subsequent) probes.
Good point Fabian, I saw that when trying to replicate this and was momentarily confused as well. There's still an outstanding question that was never answered: Build a trivial container that simply creates the readiness probe file and sleeps. This will help isolate whether or not the KubeVirt code itself is introducing the readiness probe failed condition. Does this bug still applies? I tried my best, but I cannot replicate the issue on local environment, tried both K8s and Openshift. The readiness probe always come up when VM starts properly. My result same with comment 20, The event count did not increase, always "x4", and in events i saw: 16m 16m 4 virt-launcher-vmi-ephemeral-89q69.1584fcf471b9d206 Pod spec.containers{compute} Warning Unhealthy kubelet, cnv-executor-zpeng29d-node2.example.com Readiness probe failed: cat: /var/run/kubevirt-infra/healthy: No such file or directory the count is 4. I login container, can't see that file $ oc rsh virt-launcher-vmi-ephemeral-89q69 sh-4.4# cat /var/run/kubevirt-infra/healthy cat: /var/run/kubevirt-infra/healthy: No such file or directory but have /tmp/healthy sh-4.4# ls /tmp/ healthy is this d/s problem or i miss something? 1. good that it is not increasing, then it's the normal behavior (it is unhealthy until the health flag is there, which appears after N [in this case 4x] times) 2. Unsure why the path is different. For me this bug is good to close as the number does not continue to increase. (In reply to Fabian Deutsch from comment #27) > 2. Unsure why the path is different. Can we have this investigated before closing? (It could just be the version we're testing, etc. but I'd rather know what's going on) I will double check that. I am convinced there is no bug, but just some weird configuration. Israel or Zhe, can you please verify on new release that this bug still applies? I am not able to replicate it on development setup. It might be a misconfiguration. Ok, I have deleted the whole dev cluster and started from scratch. I can see similar issue. It seems, that the volume mount kubevirt-infra is not mounted properly. Will investigate further. And I found the issue. Typing faster, than the cluster is booting. Zhe, please verify the folder is really missing. When you boot to pod, specify -c compute to select the right container inside the pod. Same thing happened to me just now. It is easy to forget that there are two containers inside the vm-laucher pod. Waiting for confirmation. Besides that, I verified the same. At the begging there are 4x calls to bad probe, but than it start and works. Pod becomes healthy. (In reply to Petr Kotas from comment #32) > Besides that, I verified the same. At the begging there are 4x calls to bad > probe, but than it start and works. Pod becomes healthy. I think there was also a concern WRT the path of the file which should have been /var/run/kubevirt-infra/healthy ? Yes, that is right. The path is there. Log in to the compute container and it is there. Was Davids suggestion ever tried? He suggested this: > All of the data provided here contradicts that "Readiness probe failed" event. We see in the logs that the file is written, and we even ready condition being set on the container. The qemu process even starts within the pod. >I suggest attempting to reproduce this outside of KubeVirt by defining a simple pod with the same /tmp/healthy readiness check our virt-launcher pod uses and an entrypoint command arg that does something like this, touch /tmp/healthy && while true; do sleep 1; done >This will rule KubeVirt out of the equation. And looking at the data I agree with David that the probes seem to actually succeed. I tried to reproduce the readiness probe outside of the kubevirt. It works. However, I do not have the exact same setup as QE has, nor do I have access to it. Moreover, I also verified, that the readiness probe goes up as expected in the KubeVirt. Waiting for reply from QE. Israel, can you verify that the behaviour is indeed correct please? This bug can be closed. I verify with latest build: $ oc rsh -c compute virt-launcher-vmi-ephemeral-2jtzh sh-4.2# ls /var/run/kubevirt-infra/healthy /var/run/kubevirt-infra/healthy Zhe, does the error message still appear? Warning Unhealthy 9m (x5 over 9m) kubelet, cnv-executor-vatsal-node1.example.com Readiness probe failed: cat: /tmp/healthy: No such file or directory With information provided by Zhe, I can confirm that the issue has been resolved with new release. Most likely, old images were cached and caused mixed errors. Due to the fact that new release expected probe on path /var/run/kubevirt-infra/healthy, but old image provided the path /tmp/healthy. What I see from the comments and from my verification. The issue has been resolved. Can we close the bug? I am closing this bug as it is resolved by current release. Summary: The issue was version mismatch. Older KubeVirt had readiness probes located on /tmp/healthy, newer KubeVirt has readiness probe located at /var/run/kubevirt-infra/healthy. The mismatch most likely happend due to container cache when doing update to new release. Controllers were still searching for old /tmp/healthy but launcher pods already had the /var/run/kubevirt-infra/healthy. Regarding the readiness probe fails 4x at first. This is expected behaviour. Readiness probe comes up only after the virtual machine properly starts. Provided logs confirm that the virtual machine came up later on and the amount of readiness failures did not raise. Taking into account the latest comments and my findings, the issue has been resolved by properly upgrading to new release. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |