Bug 2143018 - VM on heterogeneous AMD cluster may experience significant timejumps and soft locks
Summary: VM on heterogeneous AMD cluster may experience significant timejumps and soft...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 4.11.0
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 4.11.3
Assignee: Itamar Holder
QA Contact: Denys Shchedrivyi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-11-15 19:49 UTC by Denys Shchedrivyi
Modified: 2023-02-07 15:16 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-02-07 15:16:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker CNV-22522 0 None None None 2022-11-15 19:59:10 UTC
Red Hat Product Errata RHEA-2023:0621 0 None None None 2023-02-07 15:16:38 UTC

Description Denys Shchedrivyi 2022-11-15 19:49:47 UTC
Description of problem:
 
 Reference: Bug 2125671

 On a heterogeneous AMD cluster where each node has own tsc-frequency - the lowest frequency set to all nodes, as expected:
> $ for i in $(oc get node -o name);do echo $i;oc describe $i | grep tsc-freq; done
> node/cnv-qe-infra-25.cnvqe2.lab.eng.rdu2.redhat.com
>                     cpu-timer.node.kubevirt.io/tsc-frequency=1800000000                        
>                     scheduling.node.kubevirt.io/tsc-frequency-1800000000=true         
> node/cnv-qe-infra-26.cnvqe2.lab.eng.rdu2.redhat.com
>                     cpu-timer.node.kubevirt.io/tsc-frequency=2500000000
>                     scheduling.node.kubevirt.io/tsc-frequency-1800000000=true            
>                     scheduling.node.kubevirt.io/tsc-frequency-2500000000=true
> node/cnv-qe-infra-27.cnvqe2.lab.eng.rdu2.redhat.com
>                     cpu-timer.node.kubevirt.io/tsc-frequency=3000000000
>                     scheduling.node.kubevirt.io/tsc-frequency-1800000000=true       
>                     scheduling.node.kubevirt.io/tsc-frequency-3000000000=true

 And VM is asking for this frequency:
> bash-4.4$ virsh dumpxml 1 | grep tsc
>    <timer name='tsc' frequency='1800000000'/>

 
 However, VM may observe time jumps in logs right after run or after migration:

> Nov 15 13:22:28 rhel-tsc-10 systemd[4839]: Startup finished in 27ms.
> Nov 15 13:22:28 rhel-tsc-10 systemd[1]: Started User Manager for UID 1000.
> Nov 15 13:22:28 rhel-tsc-10 systemd[1]: Started Session 2 of user fedora.
> Nov 15 16:20:18 rhel-tsc-10 kernel: clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:
> Nov 15 16:20:18 rhel-tsc-10 kernel: clocksource:                       'kvm-clock' wd_now: a007fc3247f wd_last: 5368ba9ca5 mask: ffffffffffffffff
> Nov 15 16:20:18 rhel-tsc-10 kernel: clocksource:                       'tsc' cs_now: 1200f50582d2 cs_last: 96329863ce mask: ffffffffffffffff
> Nov 15 16:20:18 rhel-tsc-10 kernel: tsc: Marking TSC unstable due to clocksource watchdog
> Nov 15 16:20:18 rhel-tsc-10 systemd[1]: Starting dnf makecache...


 and switching from tsc to kvm-clock:

> # cat /sys/devices/system/clocksource/clocksource0/current_clocksource 
> kvm-clock


Version-Release number of selected component (if applicable):
4.11

Comment 2 Kedar Bidarkar 2022-11-16 14:00:13 UTC
As part of qe_test_coverage, we plan to add a testcase and cover migration with different cpu frequencies on the nodes.

Comment 5 Denys Shchedrivyi 2023-01-05 01:00:28 UTC
Verified on OCP 4.11.20 + CNV v4.11.2-21:

 the kernel on the node:

> $ uname -r
> 4.18.0-372.36.1.el8_6.x86_64


 Created 15 VMs, migrated multiple times: all VMs are accessible, don't see any time jumps. Don't see switching clocksource to kvm-clock.

Comment 12 errata-xmlrpc 2023-02-07 15:16:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Virtualization 4.11.3 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:0621


Note You need to log in before you can comment on or make changes to this bug.