+++ This bug was initially created as a clone of Bug #1886453 +++
Description of problem:
Guest agent info is not available via virtctl after running systemctl restart <service_name> (for e.g. ssh or guest agent service itself) from guest os
Version-Release number of selected component (if applicable):
$ virtctl version
Client Version: version.Info{GitVersion:"v0.34.0-rc.0-22-g156076b", GitCommit:"156076b1a9241493551578788c29b666aeca7167", GitTreeState:"clean", BuildDate:"2020-10-04T13:16:13Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{GitVersion:"v0.34.0-rc.0-6-gad89f92", GitCommit:"ad89f923b784b46fd989e95feb5409ae707cb130", GitTreeState:"clean", BuildDate:"2020-10-02T09:12:02Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.6.0-fc.9 True False 3d2h Cluster version is 4.6.0-fc.9
$ oc get csv -n openshift-cnv
NAME DISPLAY VERSION REPLACES PHASE
kubevirt-hyperconverged-operator.v2.5.0 OpenShift Virtualization 2.5.0 kubevirt-hyperconverged-operator.v2.4.1 Succeeded
How reproducible:
100%
Steps to Reproduce:
1.create and run vm with guest agent installed
2.run "systemctl restart sshd" from guest os
3.run virtctl guestosinfo <vm_name>
Actual results:
$ virtctl -n supported-os-common-templates-fedora-test-fedora-os-support guestosinfo fedora-31-1602162245-87957
{"component":"","level":"error","msg":"Cannot retrieve GuestOSInfo: an error on the server (\"Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \\\"fedora-31-1602162245-87957\\\": VMI does not have guest agent connected\") has prevented the request from succeeding","pos":"vmi.go:449","timestamp":"2020-10-08T13:30:14.896722Z"}
Error getting guestosinfo of VirtualMachine fedora-31-1602162245-87957, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"fedora-31-1602162245-87957\": VMI does not have guest agent connected") has prevented the request from succeeding
Expected results:
$ virtctl -n supported-os-common-templates-fedora-test-fedora-os-support guestosinfo fedora-31-1602162245-87957
{
"guestAgentVersion": "4.1.1",
"hostname": "ibm-p8-kvm-03-guest-02",
"os": {
"name": "Fedora",
"kernelRelease": "5.4.17-200.fc31.x86_64",
"version": "31 (Cloud Edition)",
"prettyName": "Fedora 31 (Cloud Edition)",
"versionId": "31",
"kernelVersion": "#1 SMP Sat Feb 1 19:00:13 UTC 2020",
"machine": "x86_64",
"id": "fedora"
},
"timezone": "UTC, 0",
"fsInfo": {
"disks": [
{
"diskName": "vda1",
"mountPoint": "/",
"fileSystemType": "ext4",
"usedBytes": 1696858112,
"totalBytes": 25220722688
}
]
}
}
Additional info:
Although the virtctl returns error, it is still available via api endpoint /apis/subresources.kubevirt.io/v1alpha3/namespaces/{namespace}/virtualmachineinstances/{name}/guestosinfo
--- Additional comment from on 2020-10-08 13:39:16 UTC ---
--- Additional comment from RHEL Program Management on 2020-10-08 13:46:54 UTC ---
This request has been proposed as a blocker, but a release flag has not been requested. Please set a release flag to ? to ensure we may track this bug against the appropriate upcoming release, and reset the blocker flag to ?.
--- Additional comment from on 2020-10-09 01:33:16 UTC ---
Why is this considered so urgent? There does not appear to be risk of data loss, and the guest agent is technically optional.
--- Additional comment from on 2020-10-09 18:14:03 UTC ---
Targetting this to the next release. It's unclear to me why the severity was designated as urgent. Please voice your concern if you feel this truly is urgent and needs to be addressed immediately.
--- Additional comment from Israel Pinto on 2020-10-11 11:53:42 UTC ---
1. The GA data is important tool for the end-user, in addition "Although the virtctl returns error, it is still available via api endpoint /apis/subresources.kubevirt.io/v1alpha3/namespaces/{namespace}/virtualmachineinstances/{name}/guestosinfo"
It can point to more serious issue. At lease let see what is the root cause before pushing it to 2.6.
2. This is regression we reboot GA on the since the BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1845127
Fix is at RHEL 8.3 user space.
--- Additional comment from Ruth Netser on 2020-10-14 07:03:30 UTC ---
@Daniel
Reproduced now with:
1. image form http://download.eng.bos.redhat.com/brewroot/packages/rhel-guest-image/8.3/402/images/rhel-guest-image-8.3-402.x86_64.qcow2
apiVersion: cdi.kubevirt.io/v1alpha1
kind: DataVolume
metadata:
name: rhel-8-3-dv
spec:
source:
http:
url: "http://cnv-qe-server.rhevdev.lab.eng.rdu2.redhat.com/files/cnv-tests/rhel-images/rhel-83.qcow2"
pvc:
storageClassName: hostpath-provisioner
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 25Gi
2. VM:
oc process -n openshift rhel7-server-tiny-v0.11.3 -p PVCNAME=rhel-8-3-dv -p NAME=rhel-8-3-vm -p CLOUD_USER_PASSWORD=redhat | oc create -n default -f -
3. VMI has only qemu-guest-agent
[cloud-user@rhel-8-3-vm ~]$ sudo systemctl status qemu-guest-agent
● qemu-guest-agent.service - QEMU Guest Agent
Loaded: loaded (/usr/lib/systemd/system/qemu-guest-agent.service; disabled; >
Active: active (running) since Wed 2020-10-14 02:45:38 EDT; 5min ago
Main PID: 807 (qemu-ga)
Tasks: 1 (limit: 4761)
Memory: 1.6M
CGroup: /system.slice/qemu-guest-agent.service
└─807 /usr/bin/qemu-ga --method=virtio-serial --path=/dev/virtio-por>
Oct 14 02:45:38 localhost.localdomain systemd[1]: Started QEMU Guest Agent.
[cloud-user@rhel-8-3-vm ~]$ sudo systemctl status virt-guest-agent
Unit virt-guest-agent.service could not be found.
4. VMI has guest agent info
Guest OS Info:
Id: rhel
Kernel Release: 4.18.0-240.el8.x86_64
Kernel Version: #1 SMP Wed Sep 23 05:13:10 EDT 2020
Name: Red Hat Enterprise Linux
Pretty Name: Red Hat Enterprise Linux 8.3 (Ootpa)
Version: 8.3
Version Id: 8.3
5. sudo systemctl restart qemu-guest-agent
6. Wait for a few minutes
-- guest agent info is no longer in vmi describe
Status:
Active Pods:
263df3ae-0deb-49ac-8f22-2d80a7c029b6: ruty-250-13-s7whp-worker-0-4wg84
Conditions:
Last Probe Time: <nil>
Last Transition Time: <nil>
Message: cannot migrate VMI with non-shared PVCs
Reason: DisksNotLiveMigratable
Status: False
Type: LiveMigratable
Last Probe Time: <nil>
Last Transition Time: 2020-10-14T06:44:42Z
Status: True
Type: Ready
Last Probe Time: 2020-10-14T06:51:59Z
Last Transition Time: <nil>
Status: True
Type: AgentConnected
Guest OS Info:
Interfaces:
Interface Name: eth0
Ip Address: 10.128.2.53
Ip Addresses:
10.128.2.53
Mac: 02:00:00:77:31:31
Name: default
-- After some more time, virtctl fails to retrieve info
$ virtctl userlist rhel-8-3-vm
Error listing users of VirtualMachine rhel-8-3-vm, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"rhel-8-3-vm\": VMI does not have guest agent connected") has prevented the request from succeeding
--- Additional comment from Ruth Netser on 2020-10-14 07:49:41 UTC ---
Some more info - we're loosing guset agent connectivy after a while (also on windows after pause/unpause vmi ?)
Id: mswindows
Kernel Release: 17763
Kernel Version: 10.0
Name: Microsoft Windows
Pretty Name: Windows Server 2019 Standard
Version: 2019
Version Id: 2019
Paused:
Type: Paused
Guest OS Info:
Interfaces:
Interface Name: Ethernet 2
VMI is unpaused:
]$ virtctl userlist -n supported-os-common-templates-windows-test-windows-os-support win-19-1602661172-9470403
{
"metadata": {},
"items": [
{
"userName": "Administrator",
"domain": "WIN-CUCKQ65DH6K",
"loginTime": 1602686609.525345
}
]
}
$ virtctl userlist -n supported-os-common-templates-windows-test-windows-os-support win-19-1602661172-9470403
Error listing users of VirtualMachine win-19-1602661172-9470403, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"win-19-1602661172-9470403\": VMI does not have guest agent connected") has prevented the request from succeeding
--- Additional comment from Daniel Belenky on 2020-10-20 11:29:55 UTC ---
What is think that is happening here is that once in ~10 min, we're receiving an empty state of the guest agent channel though the domain notification channel.
This causes us to incorrectly mark the agent as disconnected since the agent can still be queried and reply properly.
I don't think that restarting services such as sshd has any impact on this bug...
--- Additional comment from Daniel Belenky on 2020-10-21 10:03:35 UTC ---
The root cause for this bug is a timed resync that we're doing to our domains. During that resync loop, we're getting the domain state without it's runtime information so some of the fields such as the guest agent's connection state are omitted. I'm preparing a fix.
--- Additional comment from Daniel Belenky on 2020-10-21 10:32:38 UTC ---
--- Additional comment from Fabian Deutsch on 2020-11-18 13:21:41 UTC ---
Daniel, what's the status of this bug? Was a fix provided?
--- Additional comment from Fabian Deutsch on 2020-11-18 13:22:32 UTC ---
AH, a PR is available: https://github.com/kubevirt/kubevirt/pull/4395
--- Additional comment from Kedar Bidarkar on 2020-11-19 14:52:16 UTC ---
We suspect this is a duplicate of this bug https://bugzilla.redhat.com/show_bug.cgi?id=1883875
--- Additional comment from on 2020-12-07 15:05:25 UTC ---
PR that addresses this: https://github.com/kubevirt/kubevirt/pull/4628
--- Additional comment from on 2020-12-16 14:41:37 UTC ---
This is committed in the stable branch in this changeset
50a3f7d558ed841d4c4251479ee40cf807117e86
but is not yet included in a downstream build
--- Additional comment from on 2021-01-28 12:54:27 UTC ---
This was fixed in 2.6.0
--- Additional comment from Kedar Bidarkar on 2021-02-01 18:55:05 UTC ---
[kbidarka@localhost migration]$ oc get csv -n openshift-cnv
NAME DISPLAY VERSION REPLACES PHASE
kubevirt-hyperconverged-operator.v2.6.0 OpenShift Virtualization 2.6.0 kubevirt-hyperconverged-operator.v2.5.3 Succeeded
[kbidarka@localhost migration]$ virtctl userlist vm-rhel83-nfs
Error listing users of VirtualMachine vm-rhel83-nfs, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"vm-rhel83-nfs\": VMI does not have guest agent connected") has prevented the request from succeeding
[kbidarka@localhost migration]$ virtctl userlist vm1-rhel83-nfs
Error listing users of VirtualMachine vm1-rhel83-nfs, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"vm1-rhel83-nfs\": VMI does not have guest agent connected") has prevented the request from succeeding
I suspect that this maybe seen after a migration, not sure though. Also reproducing this instantly is a challenge as we need to wait for a few hrs for this issue to occur.
Will update here, after I do the following:
1) Create a VM, restart quemu-guest-agent, fetch userlist info, after few hrs again fetch userlist info.
2) Create a VM, restart quemu-guest-agent, fetch userlist info, migrate the VM, fetch userlist info, after few hrs again fetch userlist info.
3) Create a VM, do not restart quemu-guest-agent, fetch userlist info, migrate the VM, fetch userlist info, after few hrs again fetch userlist info.
--- Additional comment from Israel Pinto on 2021-02-02 14:21:21 UTC ---
I migrate VM and waited for 3 hours we lost the info:
$ virtctl guestosinfo rhel8-puzzled-moth -n user-agent
{"component":"","level":"error","msg":"Cannot retrieve GuestOSInfo: an error on the server (\"Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \\\"rhel8-puzzled-moth\\\": VMI does not have guest agent connected\") has prevented the request from succeeding","pos":"vmi.go:449","timestamp":"2021-02-02T12:28:26.348784Z"}
Error getting guestosinfo of VirtualMachine rhel8-puzzled-moth, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"rhel8-puzzled-moth\": VMI does not have guest agent connected") has prevented the request from succeeding
$ virtctl fslist rhel8-puzzled-moth -n user-agent
Error listing filesystems of VirtualMachine rhel8-puzzled-moth, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"rhel8-puzzled-moth\": VMI does not have guest agent connected") has prevented the request from succeeding
Note: Before migration i can get all the info with virtctl with VM running for 24H.
--- Additional comment from on 2021-02-08 09:06:57 UTC ---
Issue with GA being lost after migration should be addressed in https://github.com/kubevirt/kubevirt/pull/4982
--- Additional comment from on 2021-02-16 19:05:31 UTC ---
PR was merged. Waiting for 2.6.0 to be released before backporting this to the stable release branch.
--- Additional comment from on 2021-03-10 22:12:23 UTC ---
Lubo,
Can you please backport relevant PRs to the release-0.36 branch?
--- Additional comment from on 2021-03-15 09:55:00 UTC ---
It's almost there https://github.com/kubevirt/kubevirt/pull/5198 .
--- Additional comment from Shaul Garbourg on 2021-03-22 09:23:27 UTC ---
Need to update the Fixed version and move the bug to ON_QA since the PR https://github.com/kubevirt/kubevirt/pull/5198 was back ported and merged
--- Additional comment from on 2021-03-24 20:01:34 UTC ---
Verified on hco v2.6.1-5
Followed steps:
1) create and start vm - OK
2) check guest os info (virtctl get guestosinfo/fslist/userlist) - OK
3) wait ~ 1 hour and check guest os info again - OK
4) migrate vm and check guest os info after ~ 5 hour - OK
verify with build
hco-bundle-registry-container-v2.5.6-65
virt-operator-container-v2.5.6-3
step:
scenario 1:
1 create and start fedora vm
2 run "systemctl restart sshd" from guest os
3 run virtctl guestosinfo $vm
can get guest info
scenario 2:
1 create and start rhel vm
2 run "systemctl restart qemu-guest-agent"
3 run virtctl guestofinfo $vm
can get guest info
scenario 3:
1 create and start windows vm
2 pause/unpause vm
3 run virtctl guestosinfo $vm
can get guest info
move to verified.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (OpenShift Virtualization 2.5.6 Images), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHEA-2021:2045