Description of problem: Guest agent info is not available via virtctl after running systemctl restart <service_name> (for e.g. ssh or guest agent service itself) from guest os Version-Release number of selected component (if applicable): $ virtctl version Client Version: version.Info{GitVersion:"v0.34.0-rc.0-22-g156076b", GitCommit:"156076b1a9241493551578788c29b666aeca7167", GitTreeState:"clean", BuildDate:"2020-10-04T13:16:13Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{GitVersion:"v0.34.0-rc.0-6-gad89f92", GitCommit:"ad89f923b784b46fd989e95feb5409ae707cb130", GitTreeState:"clean", BuildDate:"2020-10-02T09:12:02Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"} $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-fc.9 True False 3d2h Cluster version is 4.6.0-fc.9 $ oc get csv -n openshift-cnv NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v2.5.0 OpenShift Virtualization 2.5.0 kubevirt-hyperconverged-operator.v2.4.1 Succeeded How reproducible: 100% Steps to Reproduce: 1.create and run vm with guest agent installed 2.run "systemctl restart sshd" from guest os 3.run virtctl guestosinfo <vm_name> Actual results: $ virtctl -n supported-os-common-templates-fedora-test-fedora-os-support guestosinfo fedora-31-1602162245-87957 {"component":"","level":"error","msg":"Cannot retrieve GuestOSInfo: an error on the server (\"Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \\\"fedora-31-1602162245-87957\\\": VMI does not have guest agent connected\") has prevented the request from succeeding","pos":"vmi.go:449","timestamp":"2020-10-08T13:30:14.896722Z"} Error getting guestosinfo of VirtualMachine fedora-31-1602162245-87957, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"fedora-31-1602162245-87957\": VMI does not have guest agent connected") has prevented the request from succeeding Expected results: $ virtctl -n supported-os-common-templates-fedora-test-fedora-os-support guestosinfo fedora-31-1602162245-87957 { "guestAgentVersion": "4.1.1", "hostname": "ibm-p8-kvm-03-guest-02", "os": { "name": "Fedora", "kernelRelease": "5.4.17-200.fc31.x86_64", "version": "31 (Cloud Edition)", "prettyName": "Fedora 31 (Cloud Edition)", "versionId": "31", "kernelVersion": "#1 SMP Sat Feb 1 19:00:13 UTC 2020", "machine": "x86_64", "id": "fedora" }, "timezone": "UTC, 0", "fsInfo": { "disks": [ { "diskName": "vda1", "mountPoint": "/", "fileSystemType": "ext4", "usedBytes": 1696858112, "totalBytes": 25220722688 } ] } } Additional info: Although the virtctl returns error, it is still available via api endpoint /apis/subresources.kubevirt.io/v1alpha3/namespaces/{namespace}/virtualmachineinstances/{name}/guestosinfo
Targetting this to the next release. It's unclear to me why the severity was designated as urgent. Please voice your concern if you feel this truly is urgent and needs to be addressed immediately.
1. The GA data is important tool for the end-user, in addition "Although the virtctl returns error, it is still available via api endpoint /apis/subresources.kubevirt.io/v1alpha3/namespaces/{namespace}/virtualmachineinstances/{name}/guestosinfo" It can point to more serious issue. At lease let see what is the root cause before pushing it to 2.6. 2. This is regression we reboot GA on the since the BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1845127 Fix is at RHEL 8.3 user space.
@Daniel Reproduced now with: 1. image form http://download.eng.bos.redhat.com/brewroot/packages/rhel-guest-image/8.3/402/images/rhel-guest-image-8.3-402.x86_64.qcow2 apiVersion: cdi.kubevirt.io/v1alpha1 kind: DataVolume metadata: name: rhel-8-3-dv spec: source: http: url: "http://cnv-qe-server.rhevdev.lab.eng.rdu2.redhat.com/files/cnv-tests/rhel-images/rhel-83.qcow2" pvc: storageClassName: hostpath-provisioner volumeMode: Filesystem accessModes: - ReadWriteOnce resources: requests: storage: 25Gi 2. VM: oc process -n openshift rhel7-server-tiny-v0.11.3 -p PVCNAME=rhel-8-3-dv -p NAME=rhel-8-3-vm -p CLOUD_USER_PASSWORD=redhat | oc create -n default -f - 3. VMI has only qemu-guest-agent [cloud-user@rhel-8-3-vm ~]$ sudo systemctl status qemu-guest-agent ● qemu-guest-agent.service - QEMU Guest Agent Loaded: loaded (/usr/lib/systemd/system/qemu-guest-agent.service; disabled; > Active: active (running) since Wed 2020-10-14 02:45:38 EDT; 5min ago Main PID: 807 (qemu-ga) Tasks: 1 (limit: 4761) Memory: 1.6M CGroup: /system.slice/qemu-guest-agent.service └─807 /usr/bin/qemu-ga --method=virtio-serial --path=/dev/virtio-por> Oct 14 02:45:38 localhost.localdomain systemd[1]: Started QEMU Guest Agent. [cloud-user@rhel-8-3-vm ~]$ sudo systemctl status virt-guest-agent Unit virt-guest-agent.service could not be found. 4. VMI has guest agent info Guest OS Info: Id: rhel Kernel Release: 4.18.0-240.el8.x86_64 Kernel Version: #1 SMP Wed Sep 23 05:13:10 EDT 2020 Name: Red Hat Enterprise Linux Pretty Name: Red Hat Enterprise Linux 8.3 (Ootpa) Version: 8.3 Version Id: 8.3 5. sudo systemctl restart qemu-guest-agent 6. Wait for a few minutes -- guest agent info is no longer in vmi describe Status: Active Pods: 263df3ae-0deb-49ac-8f22-2d80a7c029b6: ruty-250-13-s7whp-worker-0-4wg84 Conditions: Last Probe Time: <nil> Last Transition Time: <nil> Message: cannot migrate VMI with non-shared PVCs Reason: DisksNotLiveMigratable Status: False Type: LiveMigratable Last Probe Time: <nil> Last Transition Time: 2020-10-14T06:44:42Z Status: True Type: Ready Last Probe Time: 2020-10-14T06:51:59Z Last Transition Time: <nil> Status: True Type: AgentConnected Guest OS Info: Interfaces: Interface Name: eth0 Ip Address: 10.128.2.53 Ip Addresses: 10.128.2.53 Mac: 02:00:00:77:31:31 Name: default -- After some more time, virtctl fails to retrieve info $ virtctl userlist rhel-8-3-vm Error listing users of VirtualMachine rhel-8-3-vm, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"rhel-8-3-vm\": VMI does not have guest agent connected") has prevented the request from succeeding
Some more info - we're loosing guset agent connectivy after a while (also on windows after pause/unpause vmi ?) Id: mswindows Kernel Release: 17763 Kernel Version: 10.0 Name: Microsoft Windows Pretty Name: Windows Server 2019 Standard Version: 2019 Version Id: 2019 Paused: Type: Paused Guest OS Info: Interfaces: Interface Name: Ethernet 2 VMI is unpaused: ]$ virtctl userlist -n supported-os-common-templates-windows-test-windows-os-support win-19-1602661172-9470403 { "metadata": {}, "items": [ { "userName": "Administrator", "domain": "WIN-CUCKQ65DH6K", "loginTime": 1602686609.525345 } ] } $ virtctl userlist -n supported-os-common-templates-windows-test-windows-os-support win-19-1602661172-9470403 Error listing users of VirtualMachine win-19-1602661172-9470403, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"win-19-1602661172-9470403\": VMI does not have guest agent connected") has prevented the request from succeeding
What is think that is happening here is that once in ~10 min, we're receiving an empty state of the guest agent channel though the domain notification channel. This causes us to incorrectly mark the agent as disconnected since the agent can still be queried and reply properly. I don't think that restarting services such as sshd has any impact on this bug...
The root cause for this bug is a timed resync that we're doing to our domains. During that resync loop, we're getting the domain state without it's runtime information so some of the fields such as the guest agent's connection state are omitted. I'm preparing a fix.
Created attachment 1723157 [details] fix domain manager to get domain XML with runtime data
AH, a PR is available: https://github.com/kubevirt/kubevirt/pull/4395
We suspect this is a duplicate of this bug https://bugzilla.redhat.com/show_bug.cgi?id=1883875
PR that addresses this: https://github.com/kubevirt/kubevirt/pull/4628
[kbidarka@localhost migration]$ oc get csv -n openshift-cnv NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v2.6.0 OpenShift Virtualization 2.6.0 kubevirt-hyperconverged-operator.v2.5.3 Succeeded [kbidarka@localhost migration]$ virtctl userlist vm-rhel83-nfs Error listing users of VirtualMachine vm-rhel83-nfs, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"vm-rhel83-nfs\": VMI does not have guest agent connected") has prevented the request from succeeding [kbidarka@localhost migration]$ virtctl userlist vm1-rhel83-nfs Error listing users of VirtualMachine vm1-rhel83-nfs, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"vm1-rhel83-nfs\": VMI does not have guest agent connected") has prevented the request from succeeding I suspect that this maybe seen after a migration, not sure though. Also reproducing this instantly is a challenge as we need to wait for a few hrs for this issue to occur. Will update here, after I do the following: 1) Create a VM, restart quemu-guest-agent, fetch userlist info, after few hrs again fetch userlist info. 2) Create a VM, restart quemu-guest-agent, fetch userlist info, migrate the VM, fetch userlist info, after few hrs again fetch userlist info. 3) Create a VM, do not restart quemu-guest-agent, fetch userlist info, migrate the VM, fetch userlist info, after few hrs again fetch userlist info.
I migrate VM and waited for 3 hours we lost the info: $ virtctl guestosinfo rhel8-puzzled-moth -n user-agent {"component":"","level":"error","msg":"Cannot retrieve GuestOSInfo: an error on the server (\"Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \\\"rhel8-puzzled-moth\\\": VMI does not have guest agent connected\") has prevented the request from succeeding","pos":"vmi.go:449","timestamp":"2021-02-02T12:28:26.348784Z"} Error getting guestosinfo of VirtualMachine rhel8-puzzled-moth, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"rhel8-puzzled-moth\": VMI does not have guest agent connected") has prevented the request from succeeding $ virtctl fslist rhel8-puzzled-moth -n user-agent Error listing filesystems of VirtualMachine rhel8-puzzled-moth, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"rhel8-puzzled-moth\": VMI does not have guest agent connected") has prevented the request from succeeding Note: Before migration i can get all the info with virtctl with VM running for 24H.
Issue with GA being lost after migration should be addressed in https://github.com/kubevirt/kubevirt/pull/4982
Need to update the Fixed version and move the bug to ON_QA since the PR https://github.com/kubevirt/kubevirt/pull/5198 was back ported and merged
Verified on hco v2.6.1-5 Followed steps: 1) create and start vm - OK 2) check guest os info (virtctl get guestosinfo/fslist/userlist) - OK 3) wait ~ 1 hour and check guest os info again - OK 4) migrate vm and check guest os info after ~ 5 hour - OK
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (CNV 2.6.1 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:1126