+++ This bug was initially created as a clone of Bug #1886453 +++ Description of problem: Guest agent info is not available via virtctl after running systemctl restart <service_name> (for e.g. ssh or guest agent service itself) from guest os Version-Release number of selected component (if applicable): $ virtctl version Client Version: version.Info{GitVersion:"v0.34.0-rc.0-22-g156076b", GitCommit:"156076b1a9241493551578788c29b666aeca7167", GitTreeState:"clean", BuildDate:"2020-10-04T13:16:13Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{GitVersion:"v0.34.0-rc.0-6-gad89f92", GitCommit:"ad89f923b784b46fd989e95feb5409ae707cb130", GitTreeState:"clean", BuildDate:"2020-10-02T09:12:02Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"} $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-fc.9 True False 3d2h Cluster version is 4.6.0-fc.9 $ oc get csv -n openshift-cnv NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v2.5.0 OpenShift Virtualization 2.5.0 kubevirt-hyperconverged-operator.v2.4.1 Succeeded How reproducible: 100% Steps to Reproduce: 1.create and run vm with guest agent installed 2.run "systemctl restart sshd" from guest os 3.run virtctl guestosinfo <vm_name> Actual results: $ virtctl -n supported-os-common-templates-fedora-test-fedora-os-support guestosinfo fedora-31-1602162245-87957 {"component":"","level":"error","msg":"Cannot retrieve GuestOSInfo: an error on the server (\"Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \\\"fedora-31-1602162245-87957\\\": VMI does not have guest agent connected\") has prevented the request from succeeding","pos":"vmi.go:449","timestamp":"2020-10-08T13:30:14.896722Z"} Error getting guestosinfo of VirtualMachine fedora-31-1602162245-87957, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"fedora-31-1602162245-87957\": VMI does not have guest agent connected") has prevented the request from succeeding Expected results: $ virtctl -n supported-os-common-templates-fedora-test-fedora-os-support guestosinfo fedora-31-1602162245-87957 { "guestAgentVersion": "4.1.1", "hostname": "ibm-p8-kvm-03-guest-02", "os": { "name": "Fedora", "kernelRelease": "5.4.17-200.fc31.x86_64", "version": "31 (Cloud Edition)", "prettyName": "Fedora 31 (Cloud Edition)", "versionId": "31", "kernelVersion": "#1 SMP Sat Feb 1 19:00:13 UTC 2020", "machine": "x86_64", "id": "fedora" }, "timezone": "UTC, 0", "fsInfo": { "disks": [ { "diskName": "vda1", "mountPoint": "/", "fileSystemType": "ext4", "usedBytes": 1696858112, "totalBytes": 25220722688 } ] } } Additional info: Although the virtctl returns error, it is still available via api endpoint /apis/subresources.kubevirt.io/v1alpha3/namespaces/{namespace}/virtualmachineinstances/{name}/guestosinfo --- Additional comment from on 2020-10-08 13:39:16 UTC --- --- Additional comment from RHEL Program Management on 2020-10-08 13:46:54 UTC --- This request has been proposed as a blocker, but a release flag has not been requested. Please set a release flag to ? to ensure we may track this bug against the appropriate upcoming release, and reset the blocker flag to ?. --- Additional comment from on 2020-10-09 01:33:16 UTC --- Why is this considered so urgent? There does not appear to be risk of data loss, and the guest agent is technically optional. --- Additional comment from on 2020-10-09 18:14:03 UTC --- Targetting this to the next release. It's unclear to me why the severity was designated as urgent. Please voice your concern if you feel this truly is urgent and needs to be addressed immediately. --- Additional comment from Israel Pinto on 2020-10-11 11:53:42 UTC --- 1. The GA data is important tool for the end-user, in addition "Although the virtctl returns error, it is still available via api endpoint /apis/subresources.kubevirt.io/v1alpha3/namespaces/{namespace}/virtualmachineinstances/{name}/guestosinfo" It can point to more serious issue. At lease let see what is the root cause before pushing it to 2.6. 2. This is regression we reboot GA on the since the BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1845127 Fix is at RHEL 8.3 user space. --- Additional comment from Ruth Netser on 2020-10-14 07:03:30 UTC --- @Daniel Reproduced now with: 1. image form http://download.eng.bos.redhat.com/brewroot/packages/rhel-guest-image/8.3/402/images/rhel-guest-image-8.3-402.x86_64.qcow2 apiVersion: cdi.kubevirt.io/v1alpha1 kind: DataVolume metadata: name: rhel-8-3-dv spec: source: http: url: "http://cnv-qe-server.rhevdev.lab.eng.rdu2.redhat.com/files/cnv-tests/rhel-images/rhel-83.qcow2" pvc: storageClassName: hostpath-provisioner volumeMode: Filesystem accessModes: - ReadWriteOnce resources: requests: storage: 25Gi 2. VM: oc process -n openshift rhel7-server-tiny-v0.11.3 -p PVCNAME=rhel-8-3-dv -p NAME=rhel-8-3-vm -p CLOUD_USER_PASSWORD=redhat | oc create -n default -f - 3. VMI has only qemu-guest-agent [cloud-user@rhel-8-3-vm ~]$ sudo systemctl status qemu-guest-agent ● qemu-guest-agent.service - QEMU Guest Agent Loaded: loaded (/usr/lib/systemd/system/qemu-guest-agent.service; disabled; > Active: active (running) since Wed 2020-10-14 02:45:38 EDT; 5min ago Main PID: 807 (qemu-ga) Tasks: 1 (limit: 4761) Memory: 1.6M CGroup: /system.slice/qemu-guest-agent.service └─807 /usr/bin/qemu-ga --method=virtio-serial --path=/dev/virtio-por> Oct 14 02:45:38 localhost.localdomain systemd[1]: Started QEMU Guest Agent. [cloud-user@rhel-8-3-vm ~]$ sudo systemctl status virt-guest-agent Unit virt-guest-agent.service could not be found. 4. VMI has guest agent info Guest OS Info: Id: rhel Kernel Release: 4.18.0-240.el8.x86_64 Kernel Version: #1 SMP Wed Sep 23 05:13:10 EDT 2020 Name: Red Hat Enterprise Linux Pretty Name: Red Hat Enterprise Linux 8.3 (Ootpa) Version: 8.3 Version Id: 8.3 5. sudo systemctl restart qemu-guest-agent 6. Wait for a few minutes -- guest agent info is no longer in vmi describe Status: Active Pods: 263df3ae-0deb-49ac-8f22-2d80a7c029b6: ruty-250-13-s7whp-worker-0-4wg84 Conditions: Last Probe Time: <nil> Last Transition Time: <nil> Message: cannot migrate VMI with non-shared PVCs Reason: DisksNotLiveMigratable Status: False Type: LiveMigratable Last Probe Time: <nil> Last Transition Time: 2020-10-14T06:44:42Z Status: True Type: Ready Last Probe Time: 2020-10-14T06:51:59Z Last Transition Time: <nil> Status: True Type: AgentConnected Guest OS Info: Interfaces: Interface Name: eth0 Ip Address: 10.128.2.53 Ip Addresses: 10.128.2.53 Mac: 02:00:00:77:31:31 Name: default -- After some more time, virtctl fails to retrieve info $ virtctl userlist rhel-8-3-vm Error listing users of VirtualMachine rhel-8-3-vm, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"rhel-8-3-vm\": VMI does not have guest agent connected") has prevented the request from succeeding --- Additional comment from Ruth Netser on 2020-10-14 07:49:41 UTC --- Some more info - we're loosing guset agent connectivy after a while (also on windows after pause/unpause vmi ?) Id: mswindows Kernel Release: 17763 Kernel Version: 10.0 Name: Microsoft Windows Pretty Name: Windows Server 2019 Standard Version: 2019 Version Id: 2019 Paused: Type: Paused Guest OS Info: Interfaces: Interface Name: Ethernet 2 VMI is unpaused: ]$ virtctl userlist -n supported-os-common-templates-windows-test-windows-os-support win-19-1602661172-9470403 { "metadata": {}, "items": [ { "userName": "Administrator", "domain": "WIN-CUCKQ65DH6K", "loginTime": 1602686609.525345 } ] } $ virtctl userlist -n supported-os-common-templates-windows-test-windows-os-support win-19-1602661172-9470403 Error listing users of VirtualMachine win-19-1602661172-9470403, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"win-19-1602661172-9470403\": VMI does not have guest agent connected") has prevented the request from succeeding --- Additional comment from Daniel Belenky on 2020-10-20 11:29:55 UTC --- What is think that is happening here is that once in ~10 min, we're receiving an empty state of the guest agent channel though the domain notification channel. This causes us to incorrectly mark the agent as disconnected since the agent can still be queried and reply properly. I don't think that restarting services such as sshd has any impact on this bug... --- Additional comment from Daniel Belenky on 2020-10-21 10:03:35 UTC --- The root cause for this bug is a timed resync that we're doing to our domains. During that resync loop, we're getting the domain state without it's runtime information so some of the fields such as the guest agent's connection state are omitted. I'm preparing a fix. --- Additional comment from Daniel Belenky on 2020-10-21 10:32:38 UTC --- --- Additional comment from Fabian Deutsch on 2020-11-18 13:21:41 UTC --- Daniel, what's the status of this bug? Was a fix provided? --- Additional comment from Fabian Deutsch on 2020-11-18 13:22:32 UTC --- AH, a PR is available: https://github.com/kubevirt/kubevirt/pull/4395 --- Additional comment from Kedar Bidarkar on 2020-11-19 14:52:16 UTC --- We suspect this is a duplicate of this bug https://bugzilla.redhat.com/show_bug.cgi?id=1883875 --- Additional comment from on 2020-12-07 15:05:25 UTC --- PR that addresses this: https://github.com/kubevirt/kubevirt/pull/4628 --- Additional comment from on 2020-12-16 14:41:37 UTC --- This is committed in the stable branch in this changeset 50a3f7d558ed841d4c4251479ee40cf807117e86 but is not yet included in a downstream build --- Additional comment from on 2021-01-28 12:54:27 UTC --- This was fixed in 2.6.0 --- Additional comment from Kedar Bidarkar on 2021-02-01 18:55:05 UTC --- [kbidarka@localhost migration]$ oc get csv -n openshift-cnv NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v2.6.0 OpenShift Virtualization 2.6.0 kubevirt-hyperconverged-operator.v2.5.3 Succeeded [kbidarka@localhost migration]$ virtctl userlist vm-rhel83-nfs Error listing users of VirtualMachine vm-rhel83-nfs, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"vm-rhel83-nfs\": VMI does not have guest agent connected") has prevented the request from succeeding [kbidarka@localhost migration]$ virtctl userlist vm1-rhel83-nfs Error listing users of VirtualMachine vm1-rhel83-nfs, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"vm1-rhel83-nfs\": VMI does not have guest agent connected") has prevented the request from succeeding I suspect that this maybe seen after a migration, not sure though. Also reproducing this instantly is a challenge as we need to wait for a few hrs for this issue to occur. Will update here, after I do the following: 1) Create a VM, restart quemu-guest-agent, fetch userlist info, after few hrs again fetch userlist info. 2) Create a VM, restart quemu-guest-agent, fetch userlist info, migrate the VM, fetch userlist info, after few hrs again fetch userlist info. 3) Create a VM, do not restart quemu-guest-agent, fetch userlist info, migrate the VM, fetch userlist info, after few hrs again fetch userlist info. --- Additional comment from Israel Pinto on 2021-02-02 14:21:21 UTC --- I migrate VM and waited for 3 hours we lost the info: $ virtctl guestosinfo rhel8-puzzled-moth -n user-agent {"component":"","level":"error","msg":"Cannot retrieve GuestOSInfo: an error on the server (\"Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \\\"rhel8-puzzled-moth\\\": VMI does not have guest agent connected\") has prevented the request from succeeding","pos":"vmi.go:449","timestamp":"2021-02-02T12:28:26.348784Z"} Error getting guestosinfo of VirtualMachine rhel8-puzzled-moth, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"rhel8-puzzled-moth\": VMI does not have guest agent connected") has prevented the request from succeeding $ virtctl fslist rhel8-puzzled-moth -n user-agent Error listing filesystems of VirtualMachine rhel8-puzzled-moth, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"rhel8-puzzled-moth\": VMI does not have guest agent connected") has prevented the request from succeeding Note: Before migration i can get all the info with virtctl with VM running for 24H. --- Additional comment from on 2021-02-08 09:06:57 UTC --- Issue with GA being lost after migration should be addressed in https://github.com/kubevirt/kubevirt/pull/4982 --- Additional comment from on 2021-02-16 19:05:31 UTC --- PR was merged. Waiting for 2.6.0 to be released before backporting this to the stable release branch. --- Additional comment from on 2021-03-10 22:12:23 UTC --- Lubo, Can you please backport relevant PRs to the release-0.36 branch? --- Additional comment from on 2021-03-15 09:55:00 UTC --- It's almost there https://github.com/kubevirt/kubevirt/pull/5198 . --- Additional comment from Shaul Garbourg on 2021-03-22 09:23:27 UTC --- Need to update the Fixed version and move the bug to ON_QA since the PR https://github.com/kubevirt/kubevirt/pull/5198 was back ported and merged --- Additional comment from on 2021-03-24 20:01:34 UTC --- Verified on hco v2.6.1-5 Followed steps: 1) create and start vm - OK 2) check guest os info (virtctl get guestosinfo/fslist/userlist) - OK 3) wait ~ 1 hour and check guest os info again - OK 4) migrate vm and check guest os info after ~ 5 hour - OK
verify with build hco-bundle-registry-container-v2.5.6-65 virt-operator-container-v2.5.6-3 step: scenario 1: 1 create and start fedora vm 2 run "systemctl restart sshd" from guest os 3 run virtctl guestosinfo $vm can get guest info scenario 2: 1 create and start rhel vm 2 run "systemctl restart qemu-guest-agent" 3 run virtctl guestofinfo $vm can get guest info scenario 3: 1 create and start windows vm 2 pause/unpause vm 3 run virtctl guestosinfo $vm can get guest info move to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Virtualization 2.5.6 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:2045