2143018 – VM on heterogeneous AMD cluster may experience significant timejumps and soft locks

Bug 2143018 - VM on heterogeneous AMD cluster may experience significant timejumps and soft locks

Summary: VM on heterogeneous AMD cluster may experience significant timejumps and soft...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	4.11.3
Assignee:	Itamar Holder
QA Contact:	Denys Shchedrivyi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-11-15 19:49 UTC by Denys Shchedrivyi
Modified:	2023-02-07 15:16 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-02-07 15:16:28 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	CNV-22522	0	None	None	None	2022-11-15 19:59:10 UTC
Red Hat Product Errata	RHEA-2023:0621	0	None	None	None	2023-02-07 15:16:38 UTC

Description Denys Shchedrivyi 2022-11-15 19:49:47 UTC

Description of problem:
 
 Reference: Bug 2125671

 On a heterogeneous AMD cluster where each node has own tsc-frequency - the lowest frequency set to all nodes, as expected:
> $ for i in $(oc get node -o name);do echo $i;oc describe $i | grep tsc-freq; done
> node/cnv-qe-infra-25.cnvqe2.lab.eng.rdu2.redhat.com
>                     cpu-timer.node.kubevirt.io/tsc-frequency=1800000000                        
>                     scheduling.node.kubevirt.io/tsc-frequency-1800000000=true         
> node/cnv-qe-infra-26.cnvqe2.lab.eng.rdu2.redhat.com
>                     cpu-timer.node.kubevirt.io/tsc-frequency=2500000000
>                     scheduling.node.kubevirt.io/tsc-frequency-1800000000=true            
>                     scheduling.node.kubevirt.io/tsc-frequency-2500000000=true
> node/cnv-qe-infra-27.cnvqe2.lab.eng.rdu2.redhat.com
>                     cpu-timer.node.kubevirt.io/tsc-frequency=3000000000
>                     scheduling.node.kubevirt.io/tsc-frequency-1800000000=true       
>                     scheduling.node.kubevirt.io/tsc-frequency-3000000000=true

 And VM is asking for this frequency:
> bash-4.4$ virsh dumpxml 1 | grep tsc
>    <timer name='tsc' frequency='1800000000'/>

 
 However, VM may observe time jumps in logs right after run or after migration:

> Nov 15 13:22:28 rhel-tsc-10 systemd[4839]: Startup finished in 27ms.
> Nov 15 13:22:28 rhel-tsc-10 systemd[1]: Started User Manager for UID 1000.
> Nov 15 13:22:28 rhel-tsc-10 systemd[1]: Started Session 2 of user fedora.
> Nov 15 16:20:18 rhel-tsc-10 kernel: clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:
> Nov 15 16:20:18 rhel-tsc-10 kernel: clocksource:                       'kvm-clock' wd_now: a007fc3247f wd_last: 5368ba9ca5 mask: ffffffffffffffff
> Nov 15 16:20:18 rhel-tsc-10 kernel: clocksource:                       'tsc' cs_now: 1200f50582d2 cs_last: 96329863ce mask: ffffffffffffffff
> Nov 15 16:20:18 rhel-tsc-10 kernel: tsc: Marking TSC unstable due to clocksource watchdog
> Nov 15 16:20:18 rhel-tsc-10 systemd[1]: Starting dnf makecache...


 and switching from tsc to kvm-clock:

> # cat /sys/devices/system/clocksource/clocksource0/current_clocksource 
> kvm-clock


Version-Release number of selected component (if applicable):
4.11

Comment 2 Kedar Bidarkar 2022-11-16 14:00:13 UTC

As part of qe_test_coverage, we plan to add a testcase and cover migration with different cpu frequencies on the nodes.

Comment 5 Denys Shchedrivyi 2023-01-05 01:00:28 UTC

Verified on OCP 4.11.20 + CNV v4.11.2-21:

 the kernel on the node:

> $ uname -r
> 4.18.0-372.36.1.el8_6.x86_64


 Created 15 VMs, migrated multiple times: all VMs are accessible, don't see any time jumps. Don't see switching clocksource to kvm-clock.

Comment 12 errata-xmlrpc 2023-02-07 15:16:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Virtualization 4.11.3 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:0621

Note You need to log in before you can comment on or make changes to this bug.