Bug 2064160

Summary: "failed to sync guest time" log spam in destination virt-launcher pod during VM live migration
Product: Container Native Virtualization (CNV) Reporter: nijin ashok <nashok>
Component: VirtualizationAssignee: lpivarc
Status: CLOSED ERRATA QA Contact: zhe peng <zpeng>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.9.3CC: acardace, iholder, kbidarka, lpivarc, sgott
Target Milestone: ---   
Target Release: 4.13.3   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: hco-bundle-registry-container-v4.13.1.rhel9-79 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-16 14:09:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description nijin ashok 2022-03-15 07:42:56 UTC
Description of problem:

The destination domain will be in paused status during virtual machine live migration. However, it looks like virt-launcher is continuously trying to set the guest time when the VM is paused and results in the below events every second.

~~~
{"component":"virt-launcher","kind":"","level":"warning","msg":"failed to sync guest time","name":"rhel8-organisational-pony","namespace":"nijin-cnv","pos":"manager.go:261","reason":"virError(Code=55, Domain=10, Message='Requested operation is not valid: domain is not running')","timestamp":"2022-03-15T07:16:14.786101Z","uid":"31323a86-8f9d-425b-aec9-5ac514e8279d"}
{"component":"virt-launcher","kind":"","level":"warning","msg":"failed to sync guest time","name":"rhel8-organisational-pony","namespace":"nijin-cnv","pos":"manager.go:261","reason":"virError(Code=55, Domain=10, Message='Requested operation is not valid: domain is not running')","timestamp":"2022-03-15T07:16:15.785591Z","uid":"31323a86-8f9d-425b-aec9-5ac514e8279d"}
{"component":"virt-launcher","kind":"","level":"warning","msg":"failed to sync guest time","name":"rhel8-organisational-pony","namespace":"nijin-cnv","pos":"manager.go:261","reason":"virError(Code=55, Domain=10, Message='Requested operation is not valid: domain is not running')","timestamp":"2022-03-15T07:16:16.794454Z","uid":"31323a86-8f9d-425b-aec9-5ac514e8279d"}
~~~

The error is expected from the libvirtd when the VM is paused:

~~~
virsh -r list |grep -i nijin_test_2
 5    nijin_test_2           paused

virsh domtime nijin_test_2 --now
error: Requested operation is not valid: domain is not running
~~~


Version-Release number of selected component (if applicable):

kubevirt-hyperconverged-operator.v4.9.3

How reproducible:

100%

Steps to Reproduce:

1. Live migrate a virtual machine. 
2. Watch for the logs in the destination virt-launcher pod.

Actual results:

"failed to sync guest time" log spam in destination virt-launcher pod during VM live migration

Expected results:

These logs spam the destination virt-launcher pod and hence I think shouldn't be logged during the migration.

Additional info:

Comment 2 Itamar Holder 2022-05-16 09:21:24 UTC
Hey @nashok,

The log message comes from here: https://github.com/kubevirt/kubevirt/blob/main/pkg/virt-launcher/virtwrap/manager.go#L237.

Note that following the code I see that "setGuestTime()" function is called either when a VMI un-pauses or when a migration is finished, so I don't think the problem is that we log it during migration. Maybe it comes from the fact that right after migration this operation can fail therefore retried for a few times?

It's also important to ask - how spammy is that message? I mean, it's common to have such "error-retry" messages in our logs for a few times. Does it appear for 3 times or 30 times?

Thanks,
Itamar Holder.

Comment 3 nijin ashok 2022-05-31 12:41:30 UTC
(In reply to Itamar Holder from comment #2)
> Hey @nashok,
> 
> The log message comes from here:
> https://github.com/kubevirt/kubevirt/blob/main/pkg/virt-launcher/virtwrap/
> manager.go#L237.
> 

I think this is called from here https://github.com/kubevirt/kubevirt/blob/5bf9394089f6790279aad9ac2001a7a230d34ec3/pkg/virt-handler/vm.go#L681 and this is an early stage of migration (domain detected in the target virt-launcher). The actual VM migration (migration of memory) starts after this and for a big VM, this is going to take time. During this time, the target virt-launcher will log this warning message every 1 second and will stop after a minute (ticker timeout is 60 seconds).

Output from target virt-launcher pod.
~~~
cluster-up/kubectl.sh logs  virt-launcher-vm-cirros-mdqn4|grep -i "failed to sync guest time" |wc -l
selecting docker as container runtime
60

cluster-up/kubectl.sh logs  virt-launcher-vm-cirros-mdqn4|grep -i "failed to sync guest time" |sed -n '1p;$p'
selecting docker as container runtime
{"component":"virt-launcher","kind":"","level":"warning","msg":"failed to sync guest time","name":"vm-cirros","namespace":"default","pos":"manager.go:295","reason":"virError(Code=55, Domain=10, Message='Requested operation is not valid: domain is not running')","timestamp":"2022-05-31T12:13:29.300151Z","uid":"6fd68580-a322-470b-a1ce-938b6f556cf1"}
{"component":"virt-launcher","kind":"","level":"error","msg":"failed to sync guest time","name":"vm-cirros","namespace":"default","pos":"manager.go:267","timestamp":"2022-05-31T12:14:28.299731Z","uid":"6fd68580-a322-470b-a1ce-938b6f556cf1"}
[root@rhel_83 kubevirt]#
~~~

> It's also important to ask - how spammy is that message? I mean, it's common
> to have such "error-retry" messages in our logs for a few times. Does it
> appear for 3 times or 30 times?
> 

It's very spammy and it logs 60 times for a long running migration.

It's easy to reproduce. Just reduce bandwidthPerMigration to a low value and watch the destination virt-launcher pod logs.

Comment 4 Kedar Bidarkar 2022-09-14 12:27:52 UTC
Deferring it to 4.13 due to Severity being medium and capacity.

Comment 7 Kedar Bidarkar 2023-06-21 12:21:15 UTC
Moving this to 4.13.3 as this needs time to investigate as per Virt Devs.

Comment 8 sgott 2023-06-21 12:21:43 UTC
Lubo, any progress on this?

Comment 10 lpivarc 2023-07-31 13:46:00 UTC
This needs to be retested with the latest release.

Comment 13 zhe peng 2023-08-03 08:03:06 UTC
verify with build:CNV-v4.13.3.rhel9-160

step same with comment 6

s1: default setting
$ oc get pod
NAME                            READY   STATUS      RESTARTS   AGE
virt-launcher-vm-fedora-9zvg5   1/1     Running     0          26s
virt-launcher-vm-fedora-hrcmz   0/1     Completed   0          3m9s
$ oc  logs virt-launcher-vm-fedora-9zvg5 | grep "failed to sync guest time" | wc -l
0

s2: low BW
migration takes more than 20 mins
$ oc logs virt-launcher-vm-fedora-hh5h5 | grep "failed to sync guest time"| wc -l
2


move to verified.

Comment 23 errata-xmlrpc 2023-08-16 14:09:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.13.3 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:4664