Bug 673242

Summary: Time runs too fast in a VM on processors with > 4GHZ freq
Product: Red Hat Enterprise Linux 5 Reporter: Alok Kataria <akataria>
Component: kernelAssignee: Tim Burke <tburke>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: urgent    
Version: 5.7CC: dhecht, dhoward, garrett, jasonmc, jiajyang, jmalanik, jpirko, jsavanyo, juzhang, knoel, kzhang, mjenner, plyons, prarit, qcai, qwan, sghosh, tburke
Target Milestone: rcKeywords: ZStream
Target Release: 5.7   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, on VMware, the time ran too fast on virtual machines with more than 4GHz TSC (Time Step Counter) processor frequency if they were using PIT/TSC based timekeeping. This was due to a calculation bug in the get_hypervisor_cycles_per_sec function. This update fixes the calculation, and timekeeping works correctly for such virtual machines.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-07-21 09:24:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 690133, 690134    
Attachments:
Description Flags
Fix for 4GHZ TSC issue. none

Description Alok Kataria 2011-01-27 19:46:46 UTC
Created attachment 475672 [details]
Fix for 4GHZ TSC issue.

Description of problem:

We have seen that the time in virtualized RHEL 5.4 or later guest runs too fast, when run on processors supporting more than 4GHZ TSC frequency.

This is due to a bug in calculation in get_hypervisor_cycles_per_sec, this affects only VMware VM's which use the tsc_based_timekeeping.

The fix is trivial and is attached. Please apply it for the next update.
This bug fix is necessary for all updates of RHEL 5.4, 5.5 & 5.6.

Thanks.

Comment 2 Zachary Amsden 2011-02-02 13:53:42 UTC
The bugfix is fine, but it's a bit painful to port to all these releases since they are all on separate branches.  We can update the RHEL5 branch first, then move it to all the z-streams.  I'm actually porting fixes to RHEL5 now, so I can work this in.

Comment 3 Zachary Amsden 2011-02-03 14:43:37 UTC
requesting dev ack and pm ack for 5.7; work is already done, simple patch which must be backported.

Comment 5 Dor Laor 2011-03-01 14:55:21 UTC
You got the ack, Zach, please send the patch to rhkernel with all the relevant kernel versions.

Comment 6 Zachary Amsden 2011-03-07 18:14:25 UTC
The patch does not apply cleanly due to KVM specific changes to the code.  Fixing this is trivial, but I need to verify that this will not affect kernels run under Xen.

Comment 7 Zachary Amsden 2011-03-23 04:57:58 UTC
patches posted for all branches

Comment 11 Jarod Wilson 2011-03-28 18:38:04 UTC
Patch(es) available in kernel-2.6.18-252.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.

Comment 13 Alok Kataria 2011-03-28 22:22:38 UTC
I verified the -252 kernel and timekeeping seems to work correctly for >4GHZ tsc frequency VMs. 
Thanks for picking up the fix.

Comment 14 Jason McCormick 2011-03-30 18:27:09 UTC
Has anyone seen anything to indicate that this patch would do more that just deal with "wall clock" timekeeping?  I've been having a problem with VMware-based VMs running EL5.6 (kernels -238.1.1 and -238.5.1) that are randomly hanging on boot.  I'm unable to track this back to anything ESX-related and it all seems to be related to using TSC as a timesource.  All of the issues began with the upgrade to EL 5.6 and kernel 2.6.18-238.1.1.el5 and persists in 2.6.18-238.5.1.el5 (we skipped -238.el5 for internal timing reasons).  This has affected more than 25 hosts at this point of all different configurations, but always EL 5.6 VMs only.  AS4 is not affected and we don't have any EL6 VMs yet.  The issue is exactly the same.  During the initial kernel start, it gets as far as:

  PCI: Setting latency timer of device 0000:00:01.0 to 64
  NET: Registered protocol family 2
  IP route cache hash table entries: 32768 (order: 5, 131072 bytes)
  TCP established hash table entries: 131072 (order: 8, 1048576 bytes)
  TCP bind hash table entries: 65536 (order: 7, 524288 bytes)
  TCP: Hash tables configured (established 131072 bind 65536)
  TCP reno registered
  Simple Boot Flag at 0x36 set to 0x80

The next line on all VMs that boot successfully is:

  Using TSC for driving interrupts

However VMs that are hanging during boot never reach the "Using TSC..." line.  This leads me to believe that the problem is related to the OS electing to use TSC as the clocksouce and that is somehow an unstable combination with ESX 3.5 and EL 5.6 VMs.  However the issue is sporadic and I can't make this issue occur - simply that when an EL5.6 VM fails to boot, they all fail in the same place in the same way.  Ay way this is related?  If not, sorry for the noise, but we're grasping at straws and VMware hasn't been very helpful thus far.

Comment 15 Jason McCormick 2011-03-30 19:02:09 UTC
Sorry for the noise, I think my problem is related to changes in Bug 538022 that implemented a TSC timer for interrupts.

Comment 17 Zachary Amsden 2011-04-27 18:22:53 UTC
It wouldn't hurt to test the fixed kernel in a Xen VM as well, but it's not a high priority.  The backport just got a bit complex because every version of RHEL5 needed a different fix.

Comment 18 Martin Prpič 2011-06-02 13:28:38 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, on VMware, the time ran too fast on virtual machines with more than 4GHz TSC (Time Step Counter) processor frequency if they were using PIT/TSC based timekeeping. This was due to a calculation bug in the get_hypervisor_cycles_per_sec function. This update fixes the calculation, and timekeeping works correctly for such virtual machines.

Comment 20 errata-xmlrpc 2011-07-21 09:24:07 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html