Bug 1946082 - During rsync the domain state is recieved without runtime information, some fields (such as guest agent connection state) are omitted
Summary: During rsync the domain state is recieved without runtime information, some ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 2.5.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 2.5.6
Assignee: lpivarc
QA Contact: zhe peng
URL:
Whiteboard:
Depends On: 1886453
Blocks: 1883875 1946081
TreeView+ depends on / blocked
 
Reported: 2021-04-04 07:04 UTC by Ruth Netser
Modified: 2021-05-19 14:56 UTC (History)
9 users (show)

Fixed In Version: hco-bundle-registry-container-v2.5.6-65 virt-operator-container-v2.5.6-3
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1886453
Environment:
Last Closed: 2021-05-19 14:56:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2021:2045 0 None None None 2021-05-19 14:56:44 UTC

Description Ruth Netser 2021-04-04 07:04:47 UTC
+++ This bug was initially created as a clone of Bug #1886453 +++

Description of problem:
Guest agent info is not available via virtctl after running systemctl restart <service_name> (for e.g. ssh or guest agent service itself) from guest os

Version-Release number of selected component (if applicable):
$ virtctl version
Client Version: version.Info{GitVersion:"v0.34.0-rc.0-22-g156076b", GitCommit:"156076b1a9241493551578788c29b666aeca7167", GitTreeState:"clean", BuildDate:"2020-10-04T13:16:13Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{GitVersion:"v0.34.0-rc.0-6-gad89f92", GitCommit:"ad89f923b784b46fd989e95feb5409ae707cb130", GitTreeState:"clean", BuildDate:"2020-10-02T09:12:02Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

$ oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-fc.9   True        False         3d2h    Cluster version is 4.6.0-fc.9

$ oc get csv -n openshift-cnv
NAME                                      DISPLAY                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v2.5.0   OpenShift Virtualization   2.5.0     kubevirt-hyperconverged-operator.v2.4.1   Succeeded

How reproducible:
100%

Steps to Reproduce:
1.create and run vm with guest agent installed
2.run "systemctl restart sshd" from guest os
3.run virtctl guestosinfo <vm_name>

Actual results:
$ virtctl -n supported-os-common-templates-fedora-test-fedora-os-support guestosinfo fedora-31-1602162245-87957
{"component":"","level":"error","msg":"Cannot retrieve GuestOSInfo: an error on the server (\"Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \\\"fedora-31-1602162245-87957\\\": VMI does not have guest agent connected\") has prevented the request from succeeding","pos":"vmi.go:449","timestamp":"2020-10-08T13:30:14.896722Z"}
Error getting guestosinfo of VirtualMachine fedora-31-1602162245-87957, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"fedora-31-1602162245-87957\": VMI does not have guest agent connected") has prevented the request from succeeding


Expected results:
$ virtctl -n supported-os-common-templates-fedora-test-fedora-os-support guestosinfo fedora-31-1602162245-87957
{
  "guestAgentVersion": "4.1.1",
  "hostname": "ibm-p8-kvm-03-guest-02",
  "os": {
    "name": "Fedora",
    "kernelRelease": "5.4.17-200.fc31.x86_64",
    "version": "31 (Cloud Edition)",
    "prettyName": "Fedora 31 (Cloud Edition)",
    "versionId": "31",
    "kernelVersion": "#1 SMP Sat Feb 1 19:00:13 UTC 2020",
    "machine": "x86_64",
    "id": "fedora"
  },
  "timezone": "UTC, 0",
  "fsInfo": {
    "disks": [
      {
        "diskName": "vda1",
        "mountPoint": "/",
        "fileSystemType": "ext4",
        "usedBytes": 1696858112,
        "totalBytes": 25220722688
      }
    ]
  }
}


Additional info:
Although the virtctl returns error, it is still available via api endpoint /apis/subresources.kubevirt.io/v1alpha3/namespaces/{namespace}/virtualmachineinstances/{name}/guestosinfo

--- Additional comment from  on 2020-10-08 13:39:16 UTC ---



--- Additional comment from RHEL Program Management on 2020-10-08 13:46:54 UTC ---

This request has been proposed as a blocker, but a release flag has not been requested. Please set a release flag to ? to ensure we may track this bug against the appropriate upcoming release, and reset the blocker flag to ?.

--- Additional comment from  on 2020-10-09 01:33:16 UTC ---

Why is this considered so urgent? There does not appear to be risk of data loss, and the guest agent is technically optional.

--- Additional comment from  on 2020-10-09 18:14:03 UTC ---

Targetting this to the next release. It's unclear to me why the severity was designated as urgent. Please voice your concern if you feel this truly is urgent and needs to be addressed immediately.

--- Additional comment from Israel Pinto on 2020-10-11 11:53:42 UTC ---

1. The GA data is important tool for the end-user, in addition "Although the virtctl returns error, it is still available via api endpoint /apis/subresources.kubevirt.io/v1alpha3/namespaces/{namespace}/virtualmachineinstances/{name}/guestosinfo"
It can point to more serious issue. At lease let see what is the root cause before pushing it to 2.6.
2. This is regression we reboot  GA on the since the BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1845127
Fix is at RHEL 8.3 user space.

--- Additional comment from Ruth Netser on 2020-10-14 07:03:30 UTC ---

@Daniel
Reproduced now with:
1. image form http://download.eng.bos.redhat.com/brewroot/packages/rhel-guest-image/8.3/402/images/rhel-guest-image-8.3-402.x86_64.qcow2
apiVersion: cdi.kubevirt.io/v1alpha1
kind: DataVolume
metadata:
  name: rhel-8-3-dv
spec:
  source:
      http:
         url: "http://cnv-qe-server.rhevdev.lab.eng.rdu2.redhat.com/files/cnv-tests/rhel-images/rhel-83.qcow2"
  pvc:
    storageClassName: hostpath-provisioner
    volumeMode: Filesystem
    accessModes:
      - ReadWriteOnce
    resources:
      requests:
        storage: 25Gi

2. VM:
oc process -n openshift rhel7-server-tiny-v0.11.3 -p PVCNAME=rhel-8-3-dv -p NAME=rhel-8-3-vm -p CLOUD_USER_PASSWORD=redhat  | oc create -n default -f -

3. VMI has only qemu-guest-agent

[cloud-user@rhel-8-3-vm ~]$ sudo systemctl status qemu-guest-agent
● qemu-guest-agent.service - QEMU Guest Agent
   Loaded: loaded (/usr/lib/systemd/system/qemu-guest-agent.service; disabled; >
   Active: active (running) since Wed 2020-10-14 02:45:38 EDT; 5min ago
 Main PID: 807 (qemu-ga)
    Tasks: 1 (limit: 4761)
   Memory: 1.6M
   CGroup: /system.slice/qemu-guest-agent.service
           └─807 /usr/bin/qemu-ga --method=virtio-serial --path=/dev/virtio-por>

Oct 14 02:45:38 localhost.localdomain systemd[1]: Started QEMU Guest Agent.
[cloud-user@rhel-8-3-vm ~]$ sudo systemctl status virt-guest-agent
Unit virt-guest-agent.service could not be found.


4. VMI has guest agent info
  Guest OS Info:
    Id:              rhel
    Kernel Release:  4.18.0-240.el8.x86_64
    Kernel Version:  #1 SMP Wed Sep 23 05:13:10 EDT 2020
    Name:            Red Hat Enterprise Linux
    Pretty Name:     Red Hat Enterprise Linux 8.3 (Ootpa)
    Version:         8.3
    Version Id:      8.3


5. sudo systemctl restart qemu-guest-agent  

6. Wait for a few minutes

-- guest agent info is no longer in vmi describe
Status:
  Active Pods:
    263df3ae-0deb-49ac-8f22-2d80a7c029b6:  ruty-250-13-s7whp-worker-0-4wg84
  Conditions:
    Last Probe Time:       <nil>
    Last Transition Time:  <nil>
    Message:               cannot migrate VMI with non-shared PVCs
    Reason:                DisksNotLiveMigratable
    Status:                False
    Type:                  LiveMigratable
    Last Probe Time:       <nil>
    Last Transition Time:  2020-10-14T06:44:42Z
    Status:                True
    Type:                  Ready
    Last Probe Time:       2020-10-14T06:51:59Z
    Last Transition Time:  <nil>
    Status:                True
    Type:                  AgentConnected
  Guest OS Info:
  Interfaces:
    Interface Name:  eth0
    Ip Address:      10.128.2.53
    Ip Addresses:
      10.128.2.53
    Mac:             02:00:00:77:31:31
    Name:            default


-- After some more time, virtctl fails to retrieve info
$ virtctl userlist rhel-8-3-vm 
Error listing users of VirtualMachine rhel-8-3-vm, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"rhel-8-3-vm\": VMI does not have guest agent connected") has prevented the request from succeeding

--- Additional comment from Ruth Netser on 2020-10-14 07:49:41 UTC ---

Some more info - we're loosing guset agent connectivy after a while (also on windows after pause/unpause vmi ?)

    Id:              mswindows
    Kernel Release:  17763
    Kernel Version:  10.0
    Name:            Microsoft Windows
    Pretty Name:     Windows Server 2019 Standard
    Version:         2019
    Version Id:      2019


Paused:

    Type:                  Paused
  Guest OS Info:
  Interfaces:
    Interface Name:  Ethernet 2


VMI is unpaused:

]$ virtctl userlist -n supported-os-common-templates-windows-test-windows-os-support win-19-1602661172-9470403
{
  "metadata": {},
  "items": [
    {
      "userName": "Administrator",
      "domain": "WIN-CUCKQ65DH6K",
      "loginTime": 1602686609.525345
    }
  ]
}

$ virtctl userlist -n supported-os-common-templates-windows-test-windows-os-support win-19-1602661172-9470403
Error listing users of VirtualMachine win-19-1602661172-9470403, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"win-19-1602661172-9470403\": VMI does not have guest agent connected") has prevented the request from succeeding

--- Additional comment from Daniel Belenky on 2020-10-20 11:29:55 UTC ---

What is think that is happening here is that once in ~10 min, we're receiving an empty state of the guest agent channel though the domain notification channel.
This causes us to incorrectly mark the agent as disconnected since the agent can still be queried and reply properly.
I don't think that restarting services such as sshd has any impact on this bug...

--- Additional comment from Daniel Belenky on 2020-10-21 10:03:35 UTC ---

The root cause for this bug is a timed resync that we're doing to our domains. During that resync loop, we're getting the domain state without it's runtime information so some of the fields such as the guest agent's connection state are omitted. I'm preparing a fix.

--- Additional comment from Daniel Belenky on 2020-10-21 10:32:38 UTC ---



--- Additional comment from Fabian Deutsch on 2020-11-18 13:21:41 UTC ---

Daniel, what's the status of this bug? Was a fix provided?

--- Additional comment from Fabian Deutsch on 2020-11-18 13:22:32 UTC ---

AH, a PR is available: https://github.com/kubevirt/kubevirt/pull/4395

--- Additional comment from Kedar Bidarkar on 2020-11-19 14:52:16 UTC ---

We suspect this is a duplicate of this bug https://bugzilla.redhat.com/show_bug.cgi?id=1883875

--- Additional comment from  on 2020-12-07 15:05:25 UTC ---

PR that addresses this: https://github.com/kubevirt/kubevirt/pull/4628

--- Additional comment from  on 2020-12-16 14:41:37 UTC ---

This is committed in the stable branch in this changeset

50a3f7d558ed841d4c4251479ee40cf807117e86

but is not yet included in a downstream build

--- Additional comment from  on 2021-01-28 12:54:27 UTC ---

This was fixed in 2.6.0

--- Additional comment from Kedar Bidarkar on 2021-02-01 18:55:05 UTC ---

[kbidarka@localhost migration]$ oc get csv -n openshift-cnv 
NAME                                      DISPLAY                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v2.6.0   OpenShift Virtualization   2.6.0     kubevirt-hyperconverged-operator.v2.5.3   Succeeded

[kbidarka@localhost migration]$ virtctl userlist vm-rhel83-nfs
Error listing users of VirtualMachine vm-rhel83-nfs, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"vm-rhel83-nfs\": VMI does not have guest agent connected") has prevented the request from succeeding

[kbidarka@localhost migration]$ virtctl userlist vm1-rhel83-nfs
Error listing users of VirtualMachine vm1-rhel83-nfs, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"vm1-rhel83-nfs\": VMI does not have guest agent connected") has prevented the request from succeeding

I suspect that this maybe seen after a migration, not sure though. Also reproducing this instantly is a challenge as we need to wait for a few hrs for this issue to occur.

Will update here, after I do the following:
1) Create a VM, restart quemu-guest-agent, fetch userlist info, after few hrs again fetch userlist info.
2) Create a VM, restart quemu-guest-agent, fetch userlist info, migrate the VM, fetch userlist info, after few hrs again fetch userlist info.
3) Create a VM, do not restart quemu-guest-agent, fetch userlist info, migrate the VM, fetch userlist info, after few hrs again fetch userlist info.

--- Additional comment from Israel Pinto on 2021-02-02 14:21:21 UTC ---

I migrate VM and waited for 3 hours we lost the info:
$ virtctl guestosinfo  rhel8-puzzled-moth -n user-agent
{"component":"","level":"error","msg":"Cannot retrieve GuestOSInfo: an error on the server (\"Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \\\"rhel8-puzzled-moth\\\": VMI does not have guest agent connected\") has prevented the request from succeeding","pos":"vmi.go:449","timestamp":"2021-02-02T12:28:26.348784Z"}
Error getting guestosinfo of VirtualMachine rhel8-puzzled-moth, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"rhel8-puzzled-moth\": VMI does not have guest agent connected") has prevented the request from succeeding

$ virtctl fslist rhel8-puzzled-moth -n user-agent
Error listing filesystems of VirtualMachine rhel8-puzzled-moth, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"rhel8-puzzled-moth\": VMI does not have guest agent connected") has prevented the request from succeeding

Note: Before migration i can get all the info with virtctl with VM running for 24H.

--- Additional comment from  on 2021-02-08 09:06:57 UTC ---

Issue with GA being lost after migration should be addressed in https://github.com/kubevirt/kubevirt/pull/4982

--- Additional comment from  on 2021-02-16 19:05:31 UTC ---

PR was merged. Waiting for 2.6.0 to be released before backporting this to the stable release branch.

--- Additional comment from  on 2021-03-10 22:12:23 UTC ---

Lubo,

Can you please backport relevant PRs to the release-0.36 branch?

--- Additional comment from  on 2021-03-15 09:55:00 UTC ---

It's almost there https://github.com/kubevirt/kubevirt/pull/5198 .

--- Additional comment from Shaul Garbourg on 2021-03-22 09:23:27 UTC ---

Need to update the Fixed version and move the bug to ON_QA since the PR https://github.com/kubevirt/kubevirt/pull/5198 was back ported and merged

--- Additional comment from  on 2021-03-24 20:01:34 UTC ---

Verified on hco v2.6.1-5

Followed steps:
1) create and start vm - OK
2) check guest os info (virtctl get guestosinfo/fslist/userlist) - OK
3) wait ~ 1 hour and check guest os info again - OK
4) migrate vm and check guest os info after ~ 5 hour - OK

Comment 2 zhe peng 2021-04-23 08:46:42 UTC
verify with build 
hco-bundle-registry-container-v2.5.6-65
virt-operator-container-v2.5.6-3

step:
scenario 1:
1 create and start fedora vm 
2 run "systemctl restart sshd" from guest os
3 run virtctl guestosinfo $vm
can get guest info 

scenario 2:
1 create and start rhel vm
2 run "systemctl restart qemu-guest-agent"
3 run virtctl guestofinfo $vm
can get guest info

scenario 3:
1 create and start windows vm
2 pause/unpause vm 
3 run virtctl guestosinfo $vm
can get guest info

move to verified.

Comment 7 errata-xmlrpc 2021-05-19 14:56:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Virtualization 2.5.6 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:2045


Note You need to log in before you can comment on or make changes to this bug.