Bug 1886453 - During rsync the domain state is recieved without runtime information, some fields (such as guest agent connection state) are omitted
Summary: During rsync the domain state is recieved without runtime information, some ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 2.5.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 2.6.1
Assignee: lpivarc
QA Contact: vsibirsk
URL:
Whiteboard:
Depends On:
Blocks: 1883875 1946081 1946082
TreeView+ depends on / blocked
 
Reported: 2020-10-08 13:37 UTC by vsibirsk
Modified: 2021-04-07 08:46 UTC (History)
8 users (show)

Fixed In Version: hco-bundle-registry-container-v2.6.0-489 virt-operator-container-v2.6.0-100
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1946082 (view as bug list)
Environment:
Last Closed: 2021-04-07 08:46:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
fix domain manager to get domain XML with runtime data (46 bytes, patch)
2020-10-21 10:32 UTC, Daniel Belenky
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt kubevirt pull 4395 0 None closed domain manager: fix timed resync 2021-02-21 13:53:17 UTC
Red Hat Product Errata RHEA-2021:1126 0 None None None 2021-04-07 08:46:36 UTC

Description vsibirsk 2020-10-08 13:37:56 UTC
Description of problem:
Guest agent info is not available via virtctl after running systemctl restart <service_name> (for e.g. ssh or guest agent service itself) from guest os

Version-Release number of selected component (if applicable):
$ virtctl version
Client Version: version.Info{GitVersion:"v0.34.0-rc.0-22-g156076b", GitCommit:"156076b1a9241493551578788c29b666aeca7167", GitTreeState:"clean", BuildDate:"2020-10-04T13:16:13Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{GitVersion:"v0.34.0-rc.0-6-gad89f92", GitCommit:"ad89f923b784b46fd989e95feb5409ae707cb130", GitTreeState:"clean", BuildDate:"2020-10-02T09:12:02Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

$ oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-fc.9   True        False         3d2h    Cluster version is 4.6.0-fc.9

$ oc get csv -n openshift-cnv
NAME                                      DISPLAY                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v2.5.0   OpenShift Virtualization   2.5.0     kubevirt-hyperconverged-operator.v2.4.1   Succeeded

How reproducible:
100%

Steps to Reproduce:
1.create and run vm with guest agent installed
2.run "systemctl restart sshd" from guest os
3.run virtctl guestosinfo <vm_name>

Actual results:
$ virtctl -n supported-os-common-templates-fedora-test-fedora-os-support guestosinfo fedora-31-1602162245-87957
{"component":"","level":"error","msg":"Cannot retrieve GuestOSInfo: an error on the server (\"Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \\\"fedora-31-1602162245-87957\\\": VMI does not have guest agent connected\") has prevented the request from succeeding","pos":"vmi.go:449","timestamp":"2020-10-08T13:30:14.896722Z"}
Error getting guestosinfo of VirtualMachine fedora-31-1602162245-87957, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"fedora-31-1602162245-87957\": VMI does not have guest agent connected") has prevented the request from succeeding


Expected results:
$ virtctl -n supported-os-common-templates-fedora-test-fedora-os-support guestosinfo fedora-31-1602162245-87957
{
  "guestAgentVersion": "4.1.1",
  "hostname": "ibm-p8-kvm-03-guest-02",
  "os": {
    "name": "Fedora",
    "kernelRelease": "5.4.17-200.fc31.x86_64",
    "version": "31 (Cloud Edition)",
    "prettyName": "Fedora 31 (Cloud Edition)",
    "versionId": "31",
    "kernelVersion": "#1 SMP Sat Feb 1 19:00:13 UTC 2020",
    "machine": "x86_64",
    "id": "fedora"
  },
  "timezone": "UTC, 0",
  "fsInfo": {
    "disks": [
      {
        "diskName": "vda1",
        "mountPoint": "/",
        "fileSystemType": "ext4",
        "usedBytes": 1696858112,
        "totalBytes": 25220722688
      }
    ]
  }
}


Additional info:
Although the virtctl returns error, it is still available via api endpoint /apis/subresources.kubevirt.io/v1alpha3/namespaces/{namespace}/virtualmachineinstances/{name}/guestosinfo

Comment 4 sgott 2020-10-09 18:14:03 UTC
Targetting this to the next release. It's unclear to me why the severity was designated as urgent. Please voice your concern if you feel this truly is urgent and needs to be addressed immediately.

Comment 5 Israel Pinto 2020-10-11 11:53:42 UTC
1. The GA data is important tool for the end-user, in addition "Although the virtctl returns error, it is still available via api endpoint /apis/subresources.kubevirt.io/v1alpha3/namespaces/{namespace}/virtualmachineinstances/{name}/guestosinfo"
It can point to more serious issue. At lease let see what is the root cause before pushing it to 2.6.
2. This is regression we reboot  GA on the since the BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1845127
Fix is at RHEL 8.3 user space.

Comment 6 Ruth Netser 2020-10-14 07:03:30 UTC
@Daniel
Reproduced now with:
1. image form http://download.eng.bos.redhat.com/brewroot/packages/rhel-guest-image/8.3/402/images/rhel-guest-image-8.3-402.x86_64.qcow2
apiVersion: cdi.kubevirt.io/v1alpha1
kind: DataVolume
metadata:
  name: rhel-8-3-dv
spec:
  source:
      http:
         url: "http://cnv-qe-server.rhevdev.lab.eng.rdu2.redhat.com/files/cnv-tests/rhel-images/rhel-83.qcow2"
  pvc:
    storageClassName: hostpath-provisioner
    volumeMode: Filesystem
    accessModes:
      - ReadWriteOnce
    resources:
      requests:
        storage: 25Gi

2. VM:
oc process -n openshift rhel7-server-tiny-v0.11.3 -p PVCNAME=rhel-8-3-dv -p NAME=rhel-8-3-vm -p CLOUD_USER_PASSWORD=redhat  | oc create -n default -f -

3. VMI has only qemu-guest-agent

[cloud-user@rhel-8-3-vm ~]$ sudo systemctl status qemu-guest-agent
● qemu-guest-agent.service - QEMU Guest Agent
   Loaded: loaded (/usr/lib/systemd/system/qemu-guest-agent.service; disabled; >
   Active: active (running) since Wed 2020-10-14 02:45:38 EDT; 5min ago
 Main PID: 807 (qemu-ga)
    Tasks: 1 (limit: 4761)
   Memory: 1.6M
   CGroup: /system.slice/qemu-guest-agent.service
           └─807 /usr/bin/qemu-ga --method=virtio-serial --path=/dev/virtio-por>

Oct 14 02:45:38 localhost.localdomain systemd[1]: Started QEMU Guest Agent.
[cloud-user@rhel-8-3-vm ~]$ sudo systemctl status virt-guest-agent
Unit virt-guest-agent.service could not be found.


4. VMI has guest agent info
  Guest OS Info:
    Id:              rhel
    Kernel Release:  4.18.0-240.el8.x86_64
    Kernel Version:  #1 SMP Wed Sep 23 05:13:10 EDT 2020
    Name:            Red Hat Enterprise Linux
    Pretty Name:     Red Hat Enterprise Linux 8.3 (Ootpa)
    Version:         8.3
    Version Id:      8.3


5. sudo systemctl restart qemu-guest-agent  

6. Wait for a few minutes

-- guest agent info is no longer in vmi describe
Status:
  Active Pods:
    263df3ae-0deb-49ac-8f22-2d80a7c029b6:  ruty-250-13-s7whp-worker-0-4wg84
  Conditions:
    Last Probe Time:       <nil>
    Last Transition Time:  <nil>
    Message:               cannot migrate VMI with non-shared PVCs
    Reason:                DisksNotLiveMigratable
    Status:                False
    Type:                  LiveMigratable
    Last Probe Time:       <nil>
    Last Transition Time:  2020-10-14T06:44:42Z
    Status:                True
    Type:                  Ready
    Last Probe Time:       2020-10-14T06:51:59Z
    Last Transition Time:  <nil>
    Status:                True
    Type:                  AgentConnected
  Guest OS Info:
  Interfaces:
    Interface Name:  eth0
    Ip Address:      10.128.2.53
    Ip Addresses:
      10.128.2.53
    Mac:             02:00:00:77:31:31
    Name:            default


-- After some more time, virtctl fails to retrieve info
$ virtctl userlist rhel-8-3-vm 
Error listing users of VirtualMachine rhel-8-3-vm, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"rhel-8-3-vm\": VMI does not have guest agent connected") has prevented the request from succeeding

Comment 7 Ruth Netser 2020-10-14 07:49:41 UTC
Some more info - we're loosing guset agent connectivy after a while (also on windows after pause/unpause vmi ?)

    Id:              mswindows
    Kernel Release:  17763
    Kernel Version:  10.0
    Name:            Microsoft Windows
    Pretty Name:     Windows Server 2019 Standard
    Version:         2019
    Version Id:      2019


Paused:

    Type:                  Paused
  Guest OS Info:
  Interfaces:
    Interface Name:  Ethernet 2


VMI is unpaused:

]$ virtctl userlist -n supported-os-common-templates-windows-test-windows-os-support win-19-1602661172-9470403
{
  "metadata": {},
  "items": [
    {
      "userName": "Administrator",
      "domain": "WIN-CUCKQ65DH6K",
      "loginTime": 1602686609.525345
    }
  ]
}

$ virtctl userlist -n supported-os-common-templates-windows-test-windows-os-support win-19-1602661172-9470403
Error listing users of VirtualMachine win-19-1602661172-9470403, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"win-19-1602661172-9470403\": VMI does not have guest agent connected") has prevented the request from succeeding

Comment 8 Daniel Belenky 2020-10-20 11:29:55 UTC
What is think that is happening here is that once in ~10 min, we're receiving an empty state of the guest agent channel though the domain notification channel.
This causes us to incorrectly mark the agent as disconnected since the agent can still be queried and reply properly.
I don't think that restarting services such as sshd has any impact on this bug...

Comment 9 Daniel Belenky 2020-10-21 10:03:35 UTC
The root cause for this bug is a timed resync that we're doing to our domains. During that resync loop, we're getting the domain state without it's runtime information so some of the fields such as the guest agent's connection state are omitted. I'm preparing a fix.

Comment 10 Daniel Belenky 2020-10-21 10:32:38 UTC
Created attachment 1723157 [details]
fix domain manager to get domain XML with runtime data

Comment 12 Fabian Deutsch 2020-11-18 13:22:32 UTC
AH, a PR is available: https://github.com/kubevirt/kubevirt/pull/4395

Comment 13 Kedar Bidarkar 2020-11-19 14:52:16 UTC
We suspect this is a duplicate of this bug https://bugzilla.redhat.com/show_bug.cgi?id=1883875

Comment 14 sgott 2020-12-07 15:05:25 UTC
PR that addresses this: https://github.com/kubevirt/kubevirt/pull/4628

Comment 17 Kedar Bidarkar 2021-02-01 18:55:05 UTC
[kbidarka@localhost migration]$ oc get csv -n openshift-cnv 
NAME                                      DISPLAY                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v2.6.0   OpenShift Virtualization   2.6.0     kubevirt-hyperconverged-operator.v2.5.3   Succeeded

[kbidarka@localhost migration]$ virtctl userlist vm-rhel83-nfs
Error listing users of VirtualMachine vm-rhel83-nfs, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"vm-rhel83-nfs\": VMI does not have guest agent connected") has prevented the request from succeeding

[kbidarka@localhost migration]$ virtctl userlist vm1-rhel83-nfs
Error listing users of VirtualMachine vm1-rhel83-nfs, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"vm1-rhel83-nfs\": VMI does not have guest agent connected") has prevented the request from succeeding

I suspect that this maybe seen after a migration, not sure though. Also reproducing this instantly is a challenge as we need to wait for a few hrs for this issue to occur.

Will update here, after I do the following:
1) Create a VM, restart quemu-guest-agent, fetch userlist info, after few hrs again fetch userlist info.
2) Create a VM, restart quemu-guest-agent, fetch userlist info, migrate the VM, fetch userlist info, after few hrs again fetch userlist info.
3) Create a VM, do not restart quemu-guest-agent, fetch userlist info, migrate the VM, fetch userlist info, after few hrs again fetch userlist info.

Comment 18 Israel Pinto 2021-02-02 14:21:21 UTC
I migrate VM and waited for 3 hours we lost the info:
$ virtctl guestosinfo  rhel8-puzzled-moth -n user-agent
{"component":"","level":"error","msg":"Cannot retrieve GuestOSInfo: an error on the server (\"Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \\\"rhel8-puzzled-moth\\\": VMI does not have guest agent connected\") has prevented the request from succeeding","pos":"vmi.go:449","timestamp":"2021-02-02T12:28:26.348784Z"}
Error getting guestosinfo of VirtualMachine rhel8-puzzled-moth, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"rhel8-puzzled-moth\": VMI does not have guest agent connected") has prevented the request from succeeding

$ virtctl fslist rhel8-puzzled-moth -n user-agent
Error listing filesystems of VirtualMachine rhel8-puzzled-moth, an error on the server ("Operation cannot be fulfilled on virtualmachineinstance.kubevirt.io \"rhel8-puzzled-moth\": VMI does not have guest agent connected") has prevented the request from succeeding

Note: Before migration i can get all the info with virtctl with VM running for 24H.

Comment 19 lpivarc 2021-02-08 09:06:57 UTC
Issue with GA being lost after migration should be addressed in https://github.com/kubevirt/kubevirt/pull/4982

Comment 23 Shaul Garbourg 2021-03-22 09:23:27 UTC
Need to update the Fixed version and move the bug to ON_QA since the PR https://github.com/kubevirt/kubevirt/pull/5198 was back ported and merged

Comment 24 vsibirsk 2021-03-24 20:01:34 UTC
Verified on hco v2.6.1-5

Followed steps:
1) create and start vm - OK
2) check guest os info (virtctl get guestosinfo/fslist/userlist) - OK
3) wait ~ 1 hour and check guest os info again - OK
4) migrate vm and check guest os info after ~ 5 hour - OK

Comment 29 errata-xmlrpc 2021-04-07 08:46:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (CNV 2.6.1 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:1126


Note You need to log in before you can comment on or make changes to this bug.