463573 – Patches to improve timekeeping for RHEL kernels running under VMware.

Bug 463573 - Patches to improve timekeeping for RHEL kernels running under VMware.

Summary: Patches to improve timekeeping for RHEL kernels running under VMware.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.3
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Rik van Riel
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	462410 466606 (view as bug list)
Depends On:
Blocks:	474658 476075 483701 485920 512913
TreeView+	depends on / blocked

Reported:	2008-09-23 21:35 UTC by Alok Kataria
Modified:	2018-11-14 13:44 UTC (History)
CC List:	32 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-09-02 08:35:28 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
x86: Code to detect if running on VMware. (14.24 KB, patch) 2008-09-23 21:35 UTC, Alok Kataria	no flags	Details \| Diff
x86_64: Add a minimal TSC based clocksource implementation for 64bit. (8.15 KB, patch) 2008-09-23 21:37 UTC, Alok Kataria	no flags	Details \| Diff
Fix false softlockup warnings (2.50 KB, patch) 2008-09-23 21:38 UTC, Alok Kataria	no flags	Details \| Diff
BACKPORT: Defines dmi_name_in_vendors (2.10 KB, patch) 2008-09-25 21:14 UTC, Alok Kataria	no flags	Details \| Diff
Add serial number as a field to check for in dmi_name_in_vendors (962 bytes, patch) 2008-09-25 21:15 UTC, Alok Kataria	no flags	Details \| Diff
Check DMI before accessing backdoor port (3.99 KB, patch) 2008-09-25 21:33 UTC, Alok Kataria	no flags	Details \| Diff
Add some print messages to the kernel bootup log. (1.29 KB, patch) 2008-09-25 21:33 UTC, Alok Kataria	no flags	Details \| Diff
add serial key for dmi check (1.07 KB, patch) 2008-10-08 01:58 UTC, Rik van Riel	no flags	Details \| Diff
vm tsc calibration (forward ported to latest RHEL 5) (16.61 KB, patch) 2008-10-08 01:59 UTC, Rik van Riel	no flags	Details \| Diff
minimal clocksource for 64 bit (forward ported to latest RHEL 5) (8.42 KB, patch) 2008-10-08 02:00 UTC, Rik van Riel	no flags	Details \| Diff
fix lockup warnings for 64 bit (forward ported to latest RHEL 5) (2.59 KB, patch) 2008-10-08 02:01 UTC, Rik van Riel	no flags	Details \| Diff
add printk msgs (1.54 KB, patch) 2008-10-08 02:01 UTC, Rik van Riel	no flags	Details \| Diff
vm tsc calibration, using the padding in struct cpuinfo_x86 (16.80 KB, patch) 2008-10-08 20:48 UTC, Rik van Riel	no flags	Details \| Diff
[BUGFIX] Do not overcompensate for divider= option (2.80 KB, patch) 2008-10-09 00:49 UTC, Alok Kataria	no flags	Details \| Diff
vmware tsc calibration, but more generic to accomodate other hypervisors (15.06 KB, patch) 2008-10-11 16:13 UTC, Marcelo Tosatti	no flags	Details \| Diff
code restructuring (4.23 KB, patch) 2008-10-12 03:06 UTC, Alok Kataria	no flags	Details \| Diff
KVM support for x86_hyper_vendor and preset lpj (10.11 KB, patch) 2008-10-15 20:46 UTC, Marcelo Tosatti	no flags	Details \| Diff
KVM support for x86_hyper_vendor and preset lpj (v2) (11.98 KB, patch) 2008-10-21 17:13 UTC, Marcelo Tosatti	no flags	Details \| Diff
add serial key for dmi check (1.09 KB, patch) 2008-12-04 19:14 UTC, Rik van Riel	no flags	Details \| Diff
code to detect running on vmware and get tsc frequency from hypervisor (18.26 KB, patch) 2008-12-04 19:16 UTC, Rik van Riel	no flags	Details \| Diff
minimal tsc clocksource for vmware (11.82 KB, patch) 2008-12-04 19:18 UTC, Rik van Riel	no flags	Details \| Diff
fix 64 bit soft lockup warnings on vmware (2.59 KB, patch) 2008-12-04 19:19 UTC, Rik van Riel	no flags	Details \| Diff
kvm timer calibration (22.34 KB, patch) 2008-12-04 19:19 UTC, Rik van Riel	no flags	Details \| Diff
Backport of the fdiv_bug fix from mainline. (1.26 KB, patch) 2008-12-11 20:59 UTC, Alok Kataria	no flags	Details \| Diff
Disable softlockup with the TSC based algorithm under VMware (2.92 KB, patch) 2009-01-15 20:20 UTC, Alok Kataria	no flags	Details \| Diff
kvm timer calibration (v4) (22.43 KB, patch) 2009-01-16 15:33 UTC, Chris Lalancette	no flags	Details \| Diff
Fix TSC behavior when running under VMware on AMD systems. (1.40 KB, patch) 2009-02-23 22:36 UTC, Alok Kataria	no flags	Details \| Diff
Show Obsolete (20) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:1243	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update	2009-09-01 08:53:34 UTC

Description Alok Kataria 2008-09-23 21:35:33 UTC

Created attachment 317536 [details]
x86: Code to detect if running on VMware.

Description of problem:

In a virtual environment, timekeeping for RHEL 64 bit kernels can be problematic, since time is kept by counting timer interrupts for this kernel.

The problem arises when the VM is descheduled for some portion of time.  When the VM is rescheduled, the hypervisor needs to "catch up" delivering timer interrupts so that the kernel can determine the correct time.

Until the VM is caught up, the kernel's time will be behind, causing
short term divergence of the kernel's time with wallclock time.
Additionally, under certain overcommitment conditions, it may not be
possible for the hypervisor to fully catch up.  In this case, the kernel
time can fall behind over the long term.

The solution is to change the kernel's timekeeping algorithm to keep
time based on how much time has elapsed according to a time counter
rather than by counting interrupts.  This is similar to the timeofday
algorithm used by clocksource enabled mainline kernels or the RHEL5
32bit kernel.

Along with this there are other changes that are done, like asking the hypervisor for TSC frequency.

Note that these new changes are enabled by default only after a VMware hypervisor has been detected, eliminating any effect when running on non-VMware systems (besides executing the VMware hypervisor detection code).

Patches are based on the 2.6.18-92.1.10 sources.

Patch series
   vm_tsc_calibration.patch
   minimal-clocksource-for-64bit.patch
   fix-lockup-warnings-for64bit.patch


Please have a look at these patches. 

Thanks,
Alok

Comment 1 Alok Kataria 2008-09-23 21:37:16 UTC

Created attachment 317537 [details]
x86_64: Add a minimal TSC based clocksource implementation for 64bit.

Comment 2 Alok Kataria 2008-09-23 21:38:07 UTC

Created attachment 317538 [details]
Fix false softlockup warnings

Comment 3 Rik van Riel 2008-09-24 15:07:10 UTC

Poking a random IO port to detect whether or not we are running on VMware is really not the best idea, because that same IO port address could be used by an actual device (when not running on VMware).  

Upstream discussions agreed that DMI should be used for detecting that we are running on VMware:

http://lkml.org/lkml/2008/9/8/130

http://lkml.org/lkml/2008/9/24/145

The other two patches look good to me, but it would be good to get the equivalent functionality in the upstream kernel first.

Overall, this is definately functionality that we should have, because it will fix the time drift issue.

Comment 4 Rik van Riel 2008-09-24 15:16:26 UTC

Btw, using the PCI ID should be a safe way to detect that the kernel is running on VMware, too.  The only thing that is not safe is poking a random IO port, so that part of the patch needs to change.

Of course, after the system has detected that it is running on VMware, poking that IO port to find out more info is safe, because under VMware that IO port is reserved :)

Comment 5 Alok Kataria 2008-09-24 20:14:12 UTC

Hi Rik,

Thanks for the comments. Please see my replied inline. 

(In reply to comment #3)
> Poking a random IO port to detect whether or not we are running on VMware is
> really not the best idea, because that same IO port address could be used by an
> actual device (when not running on VMware).  

You are right, but IMO this is possible in theory only. In practice, the I/O port we chose (0x5658) has no known conflicts with any standard PC device
ranges, and since no new devices do programmable I/O anyway, it just
doesn't matter.

If you still think that we should add some checks to thwart the therotical problem too, then may be we can do the DMI stuff. But this may require backporting Andi's patch to add dmi_name_in_vendors function.
If you are fine with these additional changes let me know and i can cook up a patch.


> but it would be good to get the
> equivalent functionality in the upstream kernel first.

Which functionality are you talking about here ? 
If you mean the functionality to detect whether we are running on VMware, yes i plan to push patches for this upstream but that is going to be a lenghty process and may hit only the 2.6.28 tree, since there we are going to take a generic approach and will try to unify this for all the hypervisors. 
IMO it won't help waiting until the functionality is there in the mainline tree, since, that is going to be a little different and intrusive approach.

Thanks,
Alok

Comment 6 Rik van Riel 2008-09-24 20:40:27 UTC

Backporting dmi_name_in_vendors seems perfectly safe. I have no objections to that.

With that test added, your patch series seems safe enough for a RHEL update (in my personal opinion, at least - but I'll try my best to convince the others).

Comment 7 Alok Kataria 2008-09-25 02:50:14 UTC

(In reply to comment #6)
> Backporting dmi_name_in_vendors seems perfectly safe. I have no objections to
> that.

After discussing this with some folks at my end, i understand that there are cases in which we may not return VMware specific DMI strings. Though there are work arounds for that, I think that it would be all unnecessary.

I have taken a stab at explaining folks on LKML how this code is not a problem when run on a native hardware with a device on this port. 
To explain in short when we do a IN on this port apart from changing the value of eax, VMware also writes a MAGIC value in EBX. Which should not be done by any device. 
Also note that we run this right at bootup in early_cpu_init, which is way before even a device is initialized.
So all in all we can do away with all the added dmi complexity.

Please have a look at this discussion on LKML
http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/217faed615a3b775/d0b2cf4dd60daaad#d0b2cf4dd60daaad

Let me know your views on this. 

Thanks,
Alok

Comment 8 Rik van Riel 2008-09-25 04:19:44 UTC

Your explanation makes sense.

If upstream goes with that (eg. -mm or Ingo's x86 tree), the same should be acceptable in a RHEL update.

Comment 9 Alok Kataria 2008-09-25 21:12:15 UTC

(In reply to comment #8)
> Your explanation makes sense.
> 
> If upstream goes with that (eg. -mm or Ingo's x86 tree), the same should be
> acceptable in a RHEL update.

Hi Rik, 

If you are following the discussion on LKML, you might have noticed that there are problems with accessing the port directly. So to keep these complications out of the discussion here, I have opted to do the DMI checks first and only after that access the VMware backdoor port. 

The patch series now is ,

vm_tsc_calibration.patch
minimal-clocksource-for-64bit.patch
fix-lockup-warnings-for64bit.patch
dmi_definition.patch
add_serial_key_for_dmicheck.patch
backdoor_fix.patch
add_print_msgs.patch

I will attach the last 4 patches to this bug.

Thanks,
Alok

Comment 10 Alok Kataria 2008-09-25 21:14:30 UTC

Created attachment 317727 [details]
BACKPORT: Defines dmi_name_in_vendors

Comment 11 Alok Kataria 2008-09-25 21:15:19 UTC

Created attachment 317728 [details]
Add serial number as a field to check for in dmi_name_in_vendors

Comment 12 Alok Kataria 2008-09-25 21:33:11 UTC

Created attachment 317734 [details]
Check DMI before accessing backdoor port

Comment 13 Alok Kataria 2008-09-25 21:33:52 UTC

Created attachment 317735 [details]
Add some print messages to the kernel bootup log.

Comment 14 Alok Kataria 2008-09-25 21:37:36 UTC

Hi Rik, 

I have attached all the patches. 
Please have a look at them and let me know if you have any questions.

Thanks,
Alok

Comment 15 Rik van Riel 2008-09-25 22:03:58 UTC

These patches look reasonable to me.

I have requested product management to consider this for inclusion in RHEL 5.4.

After the RHEL 5.3 sprint is over, upstream discussions should have settled down and I'll check these patches for kABI impact (looks like there should be no major problem).  

Once permission to include this feature in RHEL 5.4 has been granted I will submit these patches for review internally.

Comment 16 Alok Kataria 2008-09-25 22:38:08 UTC

(In reply to comment #15)
> These patches look reasonable to me.
> 
> I have requested product management to consider this for inclusion in RHEL 5.4.

Hi Rik,

I was hoping to see these patches in RHEL 5.3.

The only addition codewise on a native system would be running the dmi_name_in_vendors check for the "VMware" string. After that the rest of code would always be run in a VM.

If you see any issues with the patches let me know about them so that we can work on solving those to make it ready for 5.3. 

Thanks,
Alok

> 
> After the RHEL 5.3 sprint is over, upstream discussions should have settled
> down and I'll check these patches for kABI impact (looks like there should be
> no major problem).  
> 
> Once permission to include this feature in RHEL 5.4 has been granted I will
> submit these patches for review internally.

Comment 17 Rik van Riel 2008-09-25 23:48:30 UTC

I understand that you would like to see the patches in RHEL 5.3, however it is very late in the development cycle to propose new functionality.

Getting an exception like this created is not up to me, but up to Red Hat's management.

Comment 22 Rik van Riel 2008-10-08 01:58:43 UTC

Created attachment 319716 [details]
add serial key for dmi check

forward ported to current RHEL 5 tree

Comment 23 Rik van Riel 2008-10-08 01:59:18 UTC

Created attachment 319717 [details]
vm tsc calibration (forward ported to latest RHEL 5)

Comment 24 Rik van Riel 2008-10-08 02:00:01 UTC

Created attachment 319718 [details]
minimal clocksource for 64 bit (forward ported to latest RHEL 5)

Comment 25 Rik van Riel 2008-10-08 02:01:07 UTC

Created attachment 319719 [details]
fix lockup warnings for 64 bit (forward ported to latest RHEL 5)

Comment 26 Rik van Riel 2008-10-08 02:01:38 UTC

Created attachment 319720 [details]
add printk msgs

Comment 27 Rik van Riel 2008-10-08 02:03:50 UTC

It turns out that the dmi_name_in_vendors() code is already in the latest RHEL5.  I merged the backdoor fix into the "vm tsc calibration" patch.

Comment 28 Rik van Riel 2008-10-08 02:12:14 UTC

OK, it looks like there are two potential issues with the patches:
- the cpuinfo_x86 struct is changed (I will compile a test kernel RPM to see if this impacts any symbols)
- there are no Kconfig changes that allow one to enable CONFIG_DETECT_SOFTLOCKUP (easy fix)

Comment 29 Rik van Riel 2008-10-08 02:19:38 UTC

OK, CONFIG_DETECT_SOFTLOCKUP is already set by something else.

I just kicked off a build with a test kernel RPM to check for kABI impact of the cpuinfo_x86 changes.

Comment 30 Rik van Riel 2008-10-08 16:03:50 UTC

*** ERROR - ABI BREAKAGE WAS DETECTED ***

The following symbols have been changed (this will cause an ABI breakage):

cpu_data
boot_cpu_data

The boot_cpu_data is just a single cpuinfo_x86 so that is no big deal, but cpu_data is an array of struct cpuinfo_x86, which may be accessed from somewhere else.

Adding an entry to the struct cpuinfo_x86 is not an option, at least not when the size of cpuinfo_x86 changes.

We may be able to steal some space from padding in the middle of the struct, between x86_max_cores and x86_power:

        int     x86_tlbsize;    /* number of 4K pages in DTLB/ITLB combined(in pages)*/
        __u8    x86_virt_bits, x86_phys_bits;
        __u8    x86_max_cores;  /* cpuid returned max cores value */
        __u32   x86_power;      
        __u32   extended_cpuid_level;   /* Max extended CPUID function supported */

Disgusting, but I suspect that would preserve the kABI.

Comment 31 Alok Kataria 2008-10-08 16:44:55 UTC

(In reply to comment #30)
> *** ERROR - ABI BREAKAGE WAS DETECTED ***
> 
> The following symbols have been changed (this will cause an ABI breakage):
> 
> cpu_data
> boot_cpu_data
> 
> The boot_cpu_data is just a single cpuinfo_x86 so that is no big deal, but
> cpu_data is an array of struct cpuinfo_x86, which may be accessed from
> somewhere else.
> 
> Adding an entry to the struct cpuinfo_x86 is not an option, at least not when
> the size of cpuinfo_x86 changes.

Oh ok. 

> 
> We may be able to steal some space from padding in the middle of the struct,
> between x86_max_cores and x86_power:
> 
>         int     x86_tlbsize;    /* number of 4K pages in DTLB/ITLB combined(in
> pages)*/
>         __u8    x86_virt_bits, x86_phys_bits;
>         __u8    x86_max_cores;  /* cpuid returned max cores value */
>         __u32   x86_power;      
>         __u32   extended_cpuid_level;   /* Max extended CPUID function
> supported */
> 
> Disgusting, but I suspect that would preserve the kABI.

Nope, i dont think we should use the padding space, the cpuinfo_x86 structure ( atleast for i386) is accessed in head.S and changing variables in the structure will require us to make changes in head.S too. 
We can have a global flag about x86_hyper_vendor, it doesn't necessarily need to be a per_cpu variable. Let me know the tree that you are working with, I will make these changes and send you a patch. 

Thanks,
Alok

Comment 32 Rik van Riel 2008-10-08 20:48:28 UTC

Created attachment 319799 [details]
vm tsc calibration, using the padding in struct cpuinfo_x86

Alok, these patches are against kernel-2.6.18-118.el5

Comment 33 Alok Kataria 2008-10-08 22:15:26 UTC

(In reply to comment #32)
> Created an attachment (id=319799) [details]
> vm tsc calibration, using the padding in struct cpuinfo_x86
> 

I am ok with these changes to the cpuinfo_x86 structure. 

Thanks,
Alok

Comment 34 Alok Kataria 2008-10-09 00:49:56 UTC

Created attachment 319817 [details]
[BUGFIX] Do not overcompensate for divider= option

When divider= commandline option is used, the new timekeeping algorithm has a bug. 
The problem is that, we assume, we still take cycles_per_jiffy
TSC ticks between each interrupt, which is not the case when
divider= option is used.

With divider=10 being used, the actual hz value is now 100 instead
of the default 1000, so we take a timer interrupt every 10ms. The
logic where we update jiffies needs to accomodate this so that
TSC and jiffies remain insync.

Without this patch we have a bug where we update the jiffies value
too fast with divider=.

This patch calculates cycles_per_tick from the REAL_HZ, instead of HZ.
Also, cycles_per_jiffy is now more correctly named as cycles_per_tick,
since 1 tick no more corresponds to 1 jiffy with tick divider.

Alternatively, another solution could be that we just remove this loop while calling do_timer, 
for (i = 0; i < tick_divider; i++), 

But that would results in extra iterations of the outer loop which can be avoided by this patch. 

Please apply this patch to your tree.

Comment 35 Chris Wright 2008-10-09 16:39:28 UTC

This patch can be made useful for other HV environments as well.
So things such as:

+    if (boot_cpu_data.x86_hyper_vendor == X86_HYPER_VENDOR_VMWARE)
+    unsigned long vm_tsc_khz = vmware_get_tsc_khz();

need to be made slightly more generic.

Comment 36 Marcelo Tosatti 2008-10-11 16:13:42 UTC

Created attachment 320098 [details]
vmware tsc calibration, but more generic to accomodate other hypervisors  

Following previous suggestion. Alok, Chris, please review.

Comment 37 Chris Wright 2008-10-12 00:25:27 UTC

Yes, that's precisely what I was thinking.  Looks good to me.
Testing results?

Comment 38 Alok Kataria 2008-10-12 03:01:53 UTC

(In reply to comment #36)
> Created an attachment (id=320098) [details]
> vmware tsc calibration, but more generic to accomodate other hypervisors  
> 
> Following previous suggestion. Alok, Chris, please review.

Hi Marcelo,

This looks good. Along with this, I think we should call  vmware_enable_lazy_timer_emulation() only on VMware hypervisors, this doesn't affect anything now, but if and when code for other hypervisors is added later it would be needed. 
I have created a patch on top of your changes which does that.

Thanks,
Alok

Comment 39 Alok Kataria 2008-10-12 03:06:04 UTC

Created attachment 320133 [details]
code restructuring

Move the enable_lazy_timer_mode code in the paravirt_generic function.
So that now its only called when running under VMware hypervisor.

This patch is on top of the vmware_tsc_calibration_generic.patch and minimal-clocksource-for-64bit.patch.

Also note that I have folded the add_print_msgs.patch in this new patch.

Boot tested on 32 and 64bit VM's.

Comment 40 Marcelo Tosatti 2008-10-13 19:27:36 UTC

Alok,

The code restructuring patch fails to apply cleanly on top of vmware_tsc_calibration_generic.patch, there is a reject at arch/x86_64/kernel/time.c. 

Your original patches did not enable lazy timer emulation for x86_64, as you can see in the first patch attached for example (https://bugzilla.redhat.com/attachment.cgi?id=317536).

Is that intented?

Comment 41 Alok Kataria 2008-10-13 19:39:14 UTC

(In reply to comment #40)
> Alok,
> 
> The code restructuring patch fails to apply cleanly on top of
> vmware_tsc_calibration_generic.patch, there is a reject at
> arch/x86_64/kernel/time.c. 
> 
> Your original patches did not enable lazy timer emulation for x86_64, as you
> can see in the first patch attached for example
> (https://bugzilla.redhat.com/attachment.cgi?id=317536).
> 
> Is that intented?

Hi Marcello, 

The lazy timer emulation mode for 64bit is enabled by the minimal_clocksource patch.

As i mentioned in comment#39, this code restructuring patch needs to be applied on top of minimal-clocksource-for-64bit.patch.
It should apply cleanly on top of that.

Let me know how it goes.

Thanks,
Alok

Comment 43 Marcelo Tosatti 2008-10-15 20:46:03 UTC

Created attachment 320485 [details]
KVM support for x86_hyper_vendor and preset lpj

Alok,

My bad - sorry.

Attached patch provides preset lpj support for KVM hypervisor. Skipping the PIT routing check comes for free (which is also a problem under KVM).

Tested with both x86_64 and i386. Glommer, please ACK.

Comment 44 Glauber Costa 2008-10-16 12:42:20 UTC

From reading the code only, the patches seem fine to me as far as KVM is concerned. If marcelo tested it, ACK.

Comment 45 Alok Kataria 2008-10-16 18:54:03 UTC

With all the traffic on this bug, i think some people may be confused regarding the current state of the patches, atleast i am :-). I will make an attempt to summarize the current set of patches, so that things are crystal clear.

The following patches are required for timekeeping (note all of these are already attached on this bug)

1. Backport dmi_name_in_vendors definiton
https://bugzilla.redhat.com/attachment.cgi?id=317727
This patch is not needed in recent RHEL kernel tree as the function is already implemented there.

2. Add serial number as a field to check for in dmi_name_in_vendors
https://bugzilla.redhat.com/attachment.cgi?id=317728

3. vmware tsc calibration, but more generic to accomodate other hypervisors
https://bugzilla.redhat.com/attachment.cgi?id=320098

4. x86_64: Add a minimal TSC based clocksource implementation for 64bit.
https://bugzilla.redhat.com/attachment.cgi?id=317537

5. code restructuring
https://bugzilla.redhat.com/attachment.cgi?id=320133

6. fix lockup warnings for 64 bit (forward ported to latest RHEL 5)
https://bugzilla.redhat.com/attachment.cgi?id=319719

7. [BUGFIX] Do not overcompensate for divider= option
https://bugzilla.redhat.com/attachment.cgi?id=319817

8. Apart from this Marcelo also posted a patch which will take advantage of some these things for KVM too
KVM support for x86_hyper_vendor and preset lpj
https://bugzilla.redhat.com/attachment.cgi?id=320485

AFAIK, these patches have *no* problems related to ABI breakage or code not being generic, and they should be ready for inclusion in the current state.

Thanks,
Alok

Comment 46 Rik van Riel 2008-10-16 18:59:00 UTC

The only question remaining is, what is happening with this code upstream?

Is anybody submitting it to the x86 tree or one of the other trees?

Comment 47 Alok Kataria 2008-10-16 19:20:41 UTC

(In reply to comment #46)
> The only question remaining is, what is happening with this code upstream?
> 
> Is anybody submitting it to the x86 tree or one of the other trees?

Hi Rik, 

Great, that you ask this. 

As you know the minimal clocksource is not needed for upstream kernel since, 64bit already has clocksource implmentation there. Otherwise only other patch that will make sense upstream is the 3rd patch of this series. This patch gets the tsc frequency from hypervisor. Now on this front too the upstream kernel's ( 2.6.27) tsc calibration code is far robust that the RHEL counterpart due to the SMI checks and all. The pester mingo problem too is avoided by one of my patches in 2.6.27-rc1 which calculates the lpj value based on tsc_khz, instead of calibrating it, which is error prone in virtualization.

So there is no pressing need for getting that patch too atleast for the 2.6.27 tree. As far as the hypervisor detection code, you might be aware that our  approach to generalize some stuff for all hypervisors wasn't accepted and we couldn't reach consensus on this. So the patches that i have for upstream are just to detect if we are running specifically on VMware(not in a generic) and change the behavior of tsc clocksource in some ways. Though i am still running some tests on these patches and would submit it upstream, IMO they would in no way be related to what we do for RHEL kernels since the problem is different on these 2 kernels. 

So to cut the long story short, the upstream patches are going to be sumbitted but they would be different that these as they address different set of problems. Let me know if you have any more questions. 

Thanks,
Alok

Comment 48 Rik van Riel 2008-10-16 19:35:35 UTC

OK, then there are no real obstacles to getting this code in a RHEL 5 update.

Comment 50 Akemi Yagi 2008-10-16 20:04:39 UTC

(In reply to comment #45)
> With all the traffic on this bug, i think some people may be confused regarding
> the current state of the patches, atleast i am :-). I will make an attempt to
> summarize the current set of patches, so that things are crystal clear.
> 
> AFAIK, these patches have *no* problems related to ABI breakage or code not
> being generic, and they should be ready for inclusion in the current state.

Thank you for the summary. This is very helpful.  Do these patches apply
verbatim on the RHEL 2.6.18 kernel?

Comment 51 Alok Kataria 2008-10-16 20:29:34 UTC

(In reply to comment #50)
> (In reply to comment #45)
> > With all the traffic on this bug, i think some people may be confused regarding
> > the current state of the patches, atleast i am :-). I will make an attempt to
> > summarize the current set of patches, so that things are crystal clear.
> > 
> > AFAIK, these patches have *no* problems related to ABI breakage or code not
> > being generic, and they should be ready for inclusion in the current state.
> 
> Thank you for the summary. This is very helpful.  Do these patches apply
> verbatim on the RHEL 2.6.18 kernel?

Yes they should apply on this RHEL kernel -  kernel-2.6.18-118.el5
Also i have been testing and applying these patches at my end on the 2.6.18-92.1.10 tree.

Thanks,
Alok

Comment 52 Glauber Costa 2008-10-20 12:54:50 UTC

Rik,

AFAICS, these patches do not solve the problem of "timekeeping" for KVM, only for vmware.

For KVM, it provides an improvement to the situation, but we're not using the KVM paraclock here, only the paravirt_lpj. So, we avoid very nasty situations in which delay loops complete early when the machine is calibrated under load, but time can still drift.

Comment 56 Marcelo Tosatti 2008-10-21 17:13:33 UTC

Created attachment 321043 [details]
KVM support for x86_hyper_vendor and preset lpj (v2)

fixed KVM support for x86_hyper_vendor and preset lpj (previous patch had missing hunks)

Comment 57 Rik van Riel 2008-10-21 18:55:34 UTC

I have folded the "fix" patches into earlier patches in the series, so every patch on its own should create a correctly working kernel series.

I am building this kernel now in our build system and will submit the patches for internal review soon.

Comment 58 Rik van Riel 2008-10-23 17:42:34 UTC

I posted the patches for review this morning.

Comment 68 Alok Kataria 2008-11-26 00:27:39 UTC

(In reply to comment #58)
> I posted the patches for review this morning.

Hi Rik,

What is the status on these patches ? i wanted to know what release are these patches going to be merged in.

Also last i spoke to you, i think i answered all the questions that you had related to the implementation and there are no more potential issues with the patchset, is that right ?

Thanks,
Alok

Comment 69 Rik van Riel 2008-11-26 02:40:26 UTC

Your patches are aimed for RHEL 5.4.

When the last RHEL 5.3 odds and ends are done I will start working on your patches.

Btw, am I right in assuming that I have your permission to merge code upstream, but should not mention you by name when sending the code upstream?

Comment 70 Rik van Riel 2008-11-26 03:42:54 UTC

Uhmm never mind that upstream comment - I was confused with another patch series for 5.4 :)

Comment 71 Rik van Riel 2008-11-26 16:13:39 UTC

Alok,

to clarify why these patches got delayed to RHEL 5.4, when doing the internal review of the patches people had a number of concerns about merging code that was still in flux upstream and it was decided that we should wait for the code to settle down upstream and merge in RHEL code similar to what is going upstream.

Comment 72 Rik van Riel 2008-12-04 19:14:08 UTC

Created attachment 325724 [details]
add serial key for dmi check

Comment 73 Rik van Riel 2008-12-04 19:16:44 UTC

Created attachment 325725 [details]
code to detect running on vmware and get tsc frequency from hypervisor

Comment 74 Rik van Riel 2008-12-04 19:18:17 UTC

Created attachment 325726 [details]
minimal tsc clocksource for vmware

Comment 75 Rik van Riel 2008-12-04 19:19:18 UTC

Created attachment 325727 [details]
fix 64 bit soft lockup warnings on vmware

Comment 76 Rik van Riel 2008-12-04 19:19:59 UTC

Created attachment 325728 [details]
kvm timer calibration

Comment 77 Rik van Riel 2008-12-04 19:21:13 UTC

The patches from comment #72 through comment #76 should address the issues that were found in both internal and upstream review of the code.

Comment 80 RHEL Program Management 2008-12-05 12:18:21 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 83 Subhendu Ghosh 2008-12-05 18:44:46 UTC

Can we update the patch in comment #74 for the kernel Documentation to include hypervisor references in both enablement and disablement options?

Comment 86 RHEL Program Management 2008-12-05 21:19:25 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 91 Alok Kataria 2008-12-08 03:15:09 UTC

(In reply to comment #77)
> The patches from comment #72 through comment #76 should address the issues that
> were found in both internal and upstream review of the code.

Hi Rik,

I haven't really got a chance (yet) to look through the changes in detail, but before i do get to that in a couple of days, i wanted to know if you tested these new patches on any of the VMware products and verified if the changes are taking effect ?

Apart from that, can you also please point us to a test kernel rpm which has these patches applied on the recent rhel kernel tree,  so that i can ask QA to do some verification testing at our end ?

Thanks,
Alok

P.S. : I am on vacation till January and will have intermittent access to mails, so please be aware that responses might be delayed.

Comment 92 Akemi Yagi 2008-12-08 03:50:48 UTC

Hi,

I have rebuilt the latest test kernel (-125) from http://people.redhat.com/dzickus after applying the patches offered in this bugzilla and made it available for CentOS users:

http://centos.toracat.org/kernel/centos5/vmware463573/

(announced in the CentOS-virt mailing list: http://lists.centos.org/pipermail/centos-virt/2008-December/000765.html )

Anyone wishing to give it a try is welcome to download the patched kernel from there.

Thank you for all the work in providing the patches.

Akemi

Comment 95 Alok Kataria 2008-12-11 20:57:12 UTC

> Hi Rik,
> 
> I haven't really got a chance (yet) to look through the changes in detail, but
> before i do get to that in a couple of days, i wanted to know if you tested
> these new patches on any of the VMware products and verified if the changes are
> taking effect ?

Hi Rik,

While testing these patches we found a bug with the 32 bit kernels.
The bug actually exists in the check_fpu code which  resulted in resetting the boot_cpu_data.x86_hyper_vendor field, thanks to Garrett for figuring this out.

Due to this bug the TSC was being marked as unstable on a AMD box.

This bug in fpu was exposed when the change for x86_hyper_vendor field was moved under the padding.

This fpu bug has been fixed in mainline and here are the commit details

commit e0d22d03c06c4e2c194d7010bc1e4a972199f156
Author: Krzysztof Helt <krzysztof.h1>
Date:   Thu Jul 31 23:43:44 2008 +0200

    x86: fdiv bug detection fix
    
    The fdiv detection code writes s32 integer into
    the boot_cpu_data.fdiv_bug.
    However, the boot_cpu_data.fdiv_bug is only char (s8)
    field so the detection overwrites already set fields for
    other bugs, e.g. the f00f bug field.
    
    Use local s32 variable to receive result.

This fix should be backported to the RHEL kernel for these patches to work correctly, else we cannot relaibly depend on the value of x86_hyper_vendor.
I have a backport of this patch for RHEL and will upload it soon.

Comment 96 Alok Kataria 2008-12-11 20:59:34 UTC

Created attachment 326672 [details]
Backport of the fdiv_bug fix from mainline.

Comment 97 Rik van Riel 2008-12-11 21:18:07 UTC

Thank you for that additional patch, Alok.  I have submitted it for inclusion.

Comment 99 Tim Kramer 2008-12-19 16:25:14 UTC

In our testing, we only see this message on the HOTFIX kernel, not on the standard 2.6.18-92.1.22.el5. (see below)

Is this expected?  The time does still hold tight but the support teams will be questioning this when they see it in production.


BUG: soft lockup - CPU#0 stuck for 10s! [swapper:0]
CPU 0:
Modules linked in: nfs lockd fscache nfs_acl netconsole autofs4 sunrpc dm_multipath lp sg floppy pcspkr parport_pc i2c_piix4 parport shpchp i2c_core ide_cd e1000 serio_raw cdrom dm_snapshot dm_zero dm_mirror dm_mod ata_piix libata mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 0, comm: swapper Not tainted 2.6.18-92.1.22.el5.0.HOTFIX.BZ463573 #1
RIP: 0010:[<ffffffff80064af8>]  [<ffffffff80064af8>] _spin_unlock_irqrestore+0x8/0x9
RSP: 0018:ffffffff80418d88  EFLAGS: 00000287
RAX: 0000000000000000 RBX: ffff81007f17a800 RCX: ffff8100557832c0
RDX: 0000000000000000 RSI: 0000000000000287 RDI: ffff81007f17a850
RBP: ffffffff80418d00 R08: 0000000000000051 R09: 0000000000000000
R10: ffff81006b3c9140 R11: 0000000000000060 R12: ffffffff8005dc8e
R13: ffff81006b3c9080 R14: ffffffff800773ce R15: ffffffff80418d00
FS:  0000000000000000(0000) GS:ffffffff803a0000(0000) knlGS:0000000000000000


CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002b05abfed030 CR3: 000000004ceb4000 CR4: 00000000000006e0

Call Trace:
 <IRQ>  [<ffffffff88075c83>] :scsi_mod:scsi_dispatch_cmd+0x27d/0x2ff
 [<ffffffff8807b09c>] :scsi_mod:scsi_request_fn+0x2c1/0x390
 [<ffffffff8005c036>] blk_run_queue+0x41/0x72
 [<ffffffff88079ee6>] :scsi_mod:scsi_next_command+0x2d/0x39
 [<ffffffff8807a042>] :scsi_mod:scsi_end_request+0xbd/0xcb
 [<ffffffff8807a19e>] :scsi_mod:scsi_io_completion+0x14e/0x324
 [<ffffffff880a642a>] :sd_mod:sd_rw_intr+0x212/0x23f
 [<ffffffff8807a433>] :scsi_mod:scsi_device_unbusy+0x67/0x81
 [<ffffffff800378ce>] blk_done_softirq+0x5f/0x6d
 [<ffffffff80011f35>] __do_softirq+0x5e/0xd6
 [<ffffffff80078610>] end_level_ioapic_vector+0x9/0x16
 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
 [<ffffffff8006c706>] do_softirq+0x2c/0x85
 [<ffffffff8006c58e>] do_IRQ+0xec/0xf5
 [<ffffffff8006aed0>] default_idle+0x0/0x50
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff8006aef9>] default_idle+0x29/0x50
 [<ffffffff80048bcf>] cpu_idle+0x95/0xb8
 [<ffffffff803db801>] start_kernel+0x220/0x225
 [<ffffffff803db22f>] _sinittext+0x22f/0x236

Comment 100 Dan Hecht 2008-12-19 18:44:05 UTC

(In reply to comment #99)
> In our testing, we only see this message on the HOTFIX kernel, not on the
> standard 2.6.18-92.1.22.el5. (see below)
> 
> Is this expected?  The time does still hold tight but the support teams will be
> questioning this when they see it in production.

Hi Tim,

Please provide more information about your environment.  
 - Is this when running on in a VM on VMware or physical machine?  Which version?
 - 32-bit or 64-bit?
 - What processor?
 - How many cpus (virtual and physical)?
 - What is the workload that is running when you hit this?
 - What's the frequency?

thanks,
Dan

Comment 101 Tim Kramer 2008-12-22 19:09:49 UTC

These errors are only being seen in a VM that is RHEL5 64bit. The workload is a simple DD out to a null device to run the system hard (four VM's on a physical box).  This is the same test that we used to get the clock to drift. Each VM is coded to a single CPU core.


Here is the feedback from the engineer doing the testing.  

I think it's a cause for concern that frightening looking kernel backtrace messages of several dozen lines per occurrence are spewing across the console repeatedly. Especially when they start out with a line saying "BUG: soft lockup - CPU#0 stuck ...". If the messages are innocuous, they shouldn't be appearing on the console at all.

However, the patched kernel does appear to fix the time drift issue. That's probably more important than the extraneous console messages--if they're really not a sign of a problem.

Is there a way to suppress the messages?  The delivery group that reads this in the console log will raise many flags.

Comment 102 Tim Kramer 2008-12-23 12:51:45 UTC

Here's more info:

- Is this when running on in a VM on VMware or physical machine? Which version?
VM on VMware ESX Server 3.5.0 build 84374
- 32-bit or 64-bit?
64-bit
- What processor?
Intel Xeon E5450 @ 3.00GHz
- How many cpus (virtual and physical)?
2 4-core physical CPUs; 1 virtual CPU per VM
- What is the workload that is running when you hit this?
Each of four VMs were running precisely:
while true; do dd if=/dev/zero of=/tmp/bigfile bs=4096k count=100>/dev/null 2>&1; done
- What's the frequency?
Randomly and occasionally. Over  the course of a one hour sample run from 14:15 until 15:15 on four VMs, here's when and where the messages occurred (they were also logged to /var/log/messages):
=== xchi8074vnap ====
Dec 22 14:17:56 xchi8074vnap kernel: BUG: soft lockup - CPU#0 stuck for 10s! [swapper:0]
Dec 22 14:23:12 xchi8074vnap kernel: BUG: soft lockup - CPU#0 stuck for 11s! [swapper:0]
Dec 22 14:47:18 xchi8074vnap kernel: BUG: soft lockup - CPU#0 stuck for 11s! [swapper:0]
=== xchi8075vnap ====
Dec 22 14:16:44 xchi8075vnap kernel: BUG: soft lockup - CPU#0 stuck for 10s! [dd:7005]
Dec 22 14:36:20 xchi8075vnap kernel: BUG: soft lockup - CPU#0 stuck for 10s! [kjournald:347]
Dec 22 15:03:41 xchi8075vnap kernel: BUG: soft lockup - CPU#0 stuck for 11s! [swapper:0]
=== xchi8076vnap ====
Dec 22 14:18:43 xchi8076vnap kernel: BUG: soft lockup - CPU#0 stuck for 10s! [swapper:0]
Dec 22 14:36:33 xchi8076vnap kernel: BUG: soft lockup - CPU#0 stuck for 10s! [swapper:0]
Dec 22 14:57:20 xchi8076vnap kernel: BUG: soft lockup - CPU#0 stuck for 10s! [swapper:0]
=== xchi8077vnap ====
Dec 22 14:18:08 xchi8077vnap kernel: BUG: soft lockup - CPU#0 stuck for 10s! [swapper:0]
Dec 22 14:20:51 xchi8077vnap kernel: BUG: soft lockup - CPU#0 stuck for 10s! [pdflush:6829]
Dec 22 14:29:12 xchi8077vnap kernel: BUG: soft lockup - CPU#0 stuck for 11s! [swapper:0]
Dec 22 15:04:33 xchi8077vnap kernel: BUG: soft lockup - CPU#0 stuck for 11s! [swapper:0]

Comment 105 Dave Maley 2009-01-06 22:32:15 UTC

Alok/Dan (or anybody else involved in testing from vmware):

Were the messages from comment #99 seen during VMWare's testing of this patchset?  Since they're not seen on a stock 2.6.18-92.1.22 kernel it appears they result from the the patches in this bug.  Specifically it seems likely they're related to the soft lockup warnings patch from comment #75.

Comment 106 Dan Hecht 2009-01-06 23:20:26 UTC

Hi Dave:

We have looked into these softlockup warnings; here is the explanation:

The patch in comment #75 is related, but really that patch had attempted to suppress the softlockup warnings when they would occur when running on a hypervisor under high load / IO. However, as the testing has demonstrated, that comment #75 patch was not successful in preventing the warnings in all cases.

The softlockup warning indicates that some amount of time has gone by without having the kernel schedule the softlockup thread (i.e. it's detecting cases where the kernel is not able to reschedule). Normally, this indicates a kernel bug.

When running on a hypervisor, however, it is possible to hit this scenario even when there is no real bug. It can happen when the hypervisor is under high load, and does not schedule the VM often enough.

Prior to applying all the patches in this bug, the kernel kept time by counting interrupts. That means that if the hypervisor is not scheduling the VM very often, the kernel is also not getting timer interrupts very often, and time in the VM will fall behind. This was exactly the time drift problem we are solving with these patches. However, this has the side effect of also causing the softlockup warning to *not* fire even when the kernel is not rescheduling (because time in the VM also won't be advancing).

With the patches, we have fixed timekeeping so that time in the VM stays correct regardless of how frequently the VM is able to run. Thus, the kernel will now notice that time had passed even though it had not had a chance to run (and therefore not reschedule its processes) and so the softlockup warning is more likely to warn in this case. However, the warning is a false positive. Note that this is a unusual case and should only happen under extreme CPU load or some types of I/O load; in either case it does not indicate a real bug though. The VM is probably occasionally blocked when some virtual disk emulation has to run, or something.

The patch in comment #75 attempted to notify the softlockup code when the VM was descheduled by the hypervisor, but I don't think it is fool proof.

Anyway, we have a few options:
1) Disable the softlockup warning completely when running on hypervisors because the warning will more often than not be a false positive.
2) Try to improve along the lines of the patch in comment #75 to make the kernel filter out cases where the hypervisor was descheduled. I don't think it will be possible to filter all cases though.

I believe Alok plans to look into seeing if #2 is possible when he returns from vacation on Monday Jan 11 (and fall back to #1 if that looks futile), so we should have more info for you next week. If Redhat prefers just going straight to solution #1, that would be okay with us too (that should be a really simple patch).

BTW, a workaround for now might be to increase the softlockup threshold; I believe that can be done at runtime via a /proc node, in case that somehow unblocks someone (though keep in mind these are just spurious warnings).

Dan

Comment 107 Alok Kataria 2009-01-15 20:14:47 UTC

(In reply to comment #105)
> Alok/Dan (or anybody else involved in testing from vmware):
> 
> Were the messages from comment #99 seen during VMWare's testing of this
> patchset?  Since they're not seen on a stock 2.6.18-92.1.22 kernel it appears
> they result from the the patches in this bug.  Specifically it seems likely
> they're related to the soft lockup warnings patch from comment #75.

As explained by Dan earlier, these softlockups are spurious warnings. I have looked through the code and don't see a easy way to handle all of these cases where the softlockup code should be educated about the jump in time. 

The approach that i have taken is to *disable* the softlockup watchdog if we are using the new TSC based algorithm under VMware.

Will upload the patch soon. Please note that the patch in comment #75 should be reverted before applying this one.

Comment 108 Alok Kataria 2009-01-15 20:20:10 UTC

Created attachment 329135 [details]
Disable softlockup with the TSC based algorithm under VMware

This patch obsoletes the patch in comment #75 (fix 64 bit soft lockup warnings on vmware). 
Please revert this (#75)patch before applying the patch to disable softlockups.

Comment 109 Chris Lalancette 2009-01-16 15:33:14 UTC

Created attachment 329213 [details]
kvm timer calibration (v4)

Internal review caught a possible memory leak in the KVM calibration, so the attached patch fixes that problem.

Chris Lalancette

Comment 110 Chris Lalancette 2009-01-16 16:40:13 UTC

Just to be clear, the latest patch that VMware attached (named "Disable softlockup with the TSC based algorithm under VMware") is supposed to replace patch 4 (named "fix 64 bit soft lockup warnings on vmware"), correct?  Presuming that is the case, I've marked the "Disable softlockup..." as obsolete here in the BZ.  Can VMware please review this and make sure that this is what was intended?

Thanks,
Chris Lalancette

Comment 111 Chris Lalancette 2009-01-16 16:52:51 UTC

Arg!  I totally screwed up that last comment.  Here's what I *meant* to say:

Just to be clear, the latest patch that VMware attached (named "Disable
softlockup with the TSC based algorithm under VMware") is supposed to replace
patch 4 (named "fix 64 bit soft lockup warnings on vmware"), correct? 
Presuming that is the case, I've marked the "fix 64 bit soft lockup warnings on vmware" as obsolete here in the BZ.  Can VMware please review this and make sure that this is what was intended?

Thanks,
Chris Lalancette

Comment 112 Alok Kataria 2009-01-16 17:43:09 UTC

(In reply to comment #111)
> Arg!  I totally screwed up that last comment.  Here's what I *meant* to say:
> 
> Just to be clear, the latest patch that VMware attached (named "Disable
> softlockup with the TSC based algorithm under VMware") is supposed to replace
> patch 4 (named "fix 64 bit soft lockup warnings on vmware"), correct? 
> Presuming that is the case, I've marked the "fix 64 bit soft lockup warnings on
> vmware" as obsolete here in the BZ.  Can VMware please review this and make
> sure that this is what was intended?

Thats right.
The "Disable softlockup" patch replaces "fix 64bit softlockup warnings" patch.

Thanks,
Alok

> 
> Thanks,
> Chris Lalancette

Comment 113 Chris Lalancette 2009-01-23 10:26:05 UTC

I've uploaded a test kernel that contains this fix (along with several others)
to this location:

http://people.redhat.com/clalance/virttest

Could the original reporter try out the test kernels there, and report back if
it fixes the problem?

Thanks,
Chris Lalancette

Comment 114 Alok Kataria 2009-01-27 02:06:09 UTC

(In reply to comment #113)
> I've uploaded a test kernel that contains this fix (along with several others)
> to this location:
> 
> http://people.redhat.com/clalance/virttest
> 
> Could the original reporter try out the test kernels there, and report back if
> it fixes the problem?
> 


The uploaded kernels have all the fixes and work as expected. 

Thanks,
Alok

Comment 116 Chris Lalancette 2009-02-02 07:29:38 UTC

Excellent, thanks for the testing!

Chris Lalancette

Comment 118 Chris Lalancette 2009-02-11 09:02:59 UTC

OK, I've done a thorough look through of the patches. The meat of it is generally fine, but there are a few places where I'd like to see us come closer to what actually went upstream.

In that vein, I've done some work to port the upstream code into the RHEL-5 tree, and then take a look at the differences between that code and what has been posted in this BZ. The current patches I'm looking at are available at http://new-people.redhat.com/clalance/bz463573 (compile tested only at the moment), but I do have some questions and comments about why we are diverging from what upstream has, and if upstream will need some of the parts that are in here. Please note that I'm doing this with an eye to keeping things working in RHEL-6; I don't want to have to revisit this BZ then.

1) I dropped the first patch (add serial key to DMI check) completely in favor of the dmi_name_in_serial() patch added upstream. That seems relatively uncontroversial.

2) For the second patch (code to detect running on vmware and get tsc frequency from hypervisor), I've re-arranged things to look much more like what went upstream. However, there are parts here that differ from upstream. For instance, in arch/{i386,x86_64}/kernel/io_apic.c in the local patches, we skip doing the timer_irq_works() check if we are in a hypervisor, while (as far as I can tell) we don't skip that check in upstream. Why is this? Does upstream need this? If upstream doesn't need this, why not?

3) Also in the second patch, during time initialization in both i386 and x86_64, in the local patch we set "preset_lpj" to the value we fetch from the hypervisor. Instead, I went with the upstream solution (namely git hash 3da757daf86e498872855f0b5e101f763ba79499), and use the "lpj_fine" variable to basically do the same.

4) Also in the second patch, during time initialization in both i386 and x86_64, in the local patch we setup "lazy timer timer emulation". From my (brief) reading, this VMware-specific feature basically means that timer interrupts can pile up and be merged together instead of holding time back until they are all injected. This isn't setup in the upstream kernel, but is in this patchset. Why is that? Now, this one may be very different because of the constant 1000HZ RHEL-5 kernel vs. the upstream tickless kernel, but I want to make sure that we don't have to re-visit this patch for RHEL-6 and beyond.

5) Patch 3 (minimal tsc clocksource for vmware) is an obvious one-off patch for RHEL-5 because of the old timekeeping code and because of the split nature of the i386 and x86_64 architectures. In upstream, we don't actually need anything like this, correct? That is, x86_64 already has the ability to use a TSC timesource now, yes? Also, note that in this patch I've changed the kernel command-line parameters name from "[no_]timekeeping_use_tsc" to "x86_64_use_tsc". I'm open to further suggestions for names, but I think having two options is overkill where we could instead do an "foo=[on|off|auto]".

6) Patch 5 (Disable softlockup with the TSC based algorithm under VMware) is another one-off for RHEL-5. Why doesn't the upstream kernel need to do this?

Chris Lalancette

Comment 119 Alok Kataria 2009-02-11 19:20:39 UTC

(In reply to comment #118)
> OK, I've done a thorough look through of the patches.  The meat of it is
> generally fine, but there are a few places where I'd like to see us come closer
> to what actually went upstream.
> 
> In that vein, I've done some work to port the upstream code into the RHEL-5
> tree, and then take a look at the differences between that code and what has
> been posted in this BZ.  The current patches I'm looking at are available at
> http://new-people.redhat.com/clalance/bz463573 (compile tested only at the
> moment), but I do have some questions and comments about why we are diverging
> from what upstream has, and if upstream will need some of the parts that are  > in here. 

The reason being, that development for RHEL was done before the mainline
tree. Also as you too mention later, some of the changes are required
only for the RHEL tree, mainline is okay because of the clocksource
implementation. 

Another point is that we wanted to keep changes to a distro kernel as
minimal as we could, doing a generic hypervisor level interface resulted
in lot of code addition, but that generalization, too was done later in
mainline and we didn't want to change the status quo with RHEL patches. 

Having said that, I appreciate the effort you have put looking through each change and understanding the behind it, but I would like to point one thing that every small change requires us to go through a whole lot of testing cycle before we can ack a particular patch. We have already gone through this cycle multiple times with RHEL for these changes, but hopefully this is the last time for these patches.

> Please note that I'm doing this with an eye to keeping things working in
> RHEL-6; I don't want to have to revisit this BZ then.

Have you decided what kernel RHEL 6 is going to be based off ? If its
later than 2.6.29 there is anyways no need to revisit this BZ.

> 
> 1)  I dropped the first patch (add serial key to DMI check) completely in     > favor of the dmi_name_in_serial() patch added upstream.  That seems relatively
> uncontroversial.

Okay.

> 
> 2)  For the second patch (code to detect running on vmware and get tsc
> frequency from hypervisor), I've re-arranged things to look much more like    > what went upstream.  However, there are parts here that differ from upstream.  > For instance, in arch/{i386,x86_64}/kernel/io_apic.c in the local patches, we > skip doing the timer_irq_works() check if we are in a hypervisor, while (as   > far as I can tell) we don't skip that check in upstream.  Why is this?  Does   > upstream need this?  If upstream doesn't need this, why not?

With the preset_lpj (now known as lpj_fine) changes skipping
timer_irq_works is no more a requirement, we can drop that change.

> 
> 3)  Also in the second patch, during time initialization in both i386 and
> x86_64, in the local patch we set "preset_lpj" to the value we fetch from the
> hypervisor.  Instead, I went with the upstream solution (namely git hash
> 3da757daf86e498872855f0b5e101f763ba79499), and use the "lpj_fine" variable to
> basically do the same.

Okay, but this patch has other white space removal and KERN_CONT
definition changes, you might want to separate that out in a different
patch ? 

> 
> 4)  Also in the second patch, during time initialization in both i386 and
> x86_64, in the local patch we setup "lazy timer timer emulation".  From my
> (brief) reading, this VMware-specific feature basically means that timer
> interrupts can pile up and be merged together instead of holding time back
> until they are all injected.  This isn't setup in the upstream kernel, but is
> in this patchset.  Why is that?  Now, this one may be very different because  > of the constant 1000HZ RHEL-5 kernel vs. the upstream tickless kernel, but I  > want to make sure that we don't have to re-visit this patch for RHEL-6 and    > beyond.

For the newer kernels we detect that the kernel is running in tickless
(aka no_HZ mode) and we automagically enter the lazy mode.
So its not required for mainline, but *is* required for RHEL 5, please
add that change.

> 
> 5)  Patch 3 (minimal tsc clocksource for vmware) is an obvious one-off patch
> for RHEL-5 because of the old timekeeping code and because of the split nature
> of the i386 and x86_64 architectures.  In upstream, we don't actually need
> anything like this, correct?  That is, x86_64 already has the ability to use a
> TSC timesource now, yes?

You are right, mainline has clocksource for 64bit and hence this is not
needed there.

> Also, note that in this patch I've changed the kernel command-line parameters  > name from "[no_]timekeeping_use_tsc" to "x86_64_use_tsc".  I'm open to further  > suggestions for names, but I think having two options is overkill where we    > could instead do an "foo=[on|off|auto]".

Makes sense.

> 
> 6)  Patch 5 (Disable softlockup with the TSC based algorithm under VMware) is
> another one-off for RHEL-5.  Why doesn't the upstream kernel need to do this?

Please note this was done only for the 64bit change for RHEL, with the
clocksource implementation in mainline this is mostly taken care of by
the schedclock, but ya there are corner cases where this could be a
problem in mainline too. I need to add a stolen time implementation for
the kernel and make the softlockup watchdog aware of it, but I haven't
started working on it. This will be done for mainline later.
So please add this change for RHEL 5.

Thanks,
Alok

Comment 120 Chris Lalancette 2009-02-12 21:26:32 UTC

(In reply to comment #119)
> The reason being, that development for RHEL was done before the mainline
> tree. Also as you too mention later, some of the changes are required
> only for the RHEL tree, mainline is okay because of the clocksource
> implementation. 
> 
> Another point is that we wanted to keep changes to a distro kernel as
> minimal as we could, doing a generic hypervisor level interface resulted
> in lot of code addition, but that generalization, too was done later in
> mainline and we didn't want to change the status quo with RHEL patches. 

Yes, totally understood.  I'm totally fine with it going on in parallel with the work going on upstream; however, given that we have time now before the 5.4 freeze, we may as well make it like it is upstream to get it right, and make sure that future fixes are easier to backport.

> Having said that, I appreciate the effort you have put looking through each
> change and understanding the behind it, but I would like to point one thing
> that every small change requires us to go through a whole lot of testing cycle
> before we can ack a particular patch. We have already gone through this cycle
> multiple times with RHEL for these changes, but hopefully this is the last time
> for these patches.

Right, understood.  I'm not trying to make busy work; I just want to make sure we do it right.

> > Please note that I'm doing this with an eye to keeping things working in
> > RHEL-6; I don't want to have to revisit this BZ then.
> 
> Have you decided what kernel RHEL 6 is going to be based off ? If its
> later than 2.6.29 there is anyways no need to revisit this BZ.

It's still up in the air, but it will be somewhere around 2.6.29 or 2.6.30.  So we should have these fixes for RHEL-6, which is good.

> > 
> > 1)  I dropped the first patch (add serial key to DMI check) completely in     > favor of the dmi_name_in_serial() patch added upstream.  That seems relatively
> > uncontroversial.
> 
> Okay.
> 
> > 
> > 2)  For the second patch (code to detect running on vmware and get tsc
> > frequency from hypervisor), I've re-arranged things to look much more like    > what went upstream.  However, there are parts here that differ from upstream.  > For instance, in arch/{i386,x86_64}/kernel/io_apic.c in the local patches, we > skip doing the timer_irq_works() check if we are in a hypervisor, while (as   > far as I can tell) we don't skip that check in upstream.  Why is this?  Does   > upstream need this?  If upstream doesn't need this, why not?
> 
> With the preset_lpj (now known as lpj_fine) changes skipping
> timer_irq_works is no more a requirement, we can drop that change.

Ah, OK, great.

> > 
> > 3)  Also in the second patch, during time initialization in both i386 and
> > x86_64, in the local patch we set "preset_lpj" to the value we fetch from the
> > hypervisor.  Instead, I went with the upstream solution (namely git hash
> > 3da757daf86e498872855f0b5e101f763ba79499), and use the "lpj_fine" variable to
> > basically do the same.
> 
> Okay, but this patch has other white space removal and KERN_CONT
> definition changes, you might want to separate that out in a different
> patch ? 

Yeah, that's true.  That was pretty much compile fixes, but I'll split it back out.

> > 
> > 4)  Also in the second patch, during time initialization in both i386 and
> > x86_64, in the local patch we setup "lazy timer timer emulation".  From my
> > (brief) reading, this VMware-specific feature basically means that timer
> > interrupts can pile up and be merged together instead of holding time back
> > until they are all injected.  This isn't setup in the upstream kernel, but is
> > in this patchset.  Why is that?  Now, this one may be very different because  > of the constant 1000HZ RHEL-5 kernel vs. the upstream tickless kernel, but I  > want to make sure that we don't have to re-visit this patch for RHEL-6 and    > beyond.
> 
> For the newer kernels we detect that the kernel is running in tickless
> (aka no_HZ mode) and we automagically enter the lazy mode.
> So its not required for mainline, but *is* required for RHEL 5, please
> add that change.

Ah, OK.  Yep, that's totally fine, I'll put it back into these patches.

> > 
> > 5)  Patch 3 (minimal tsc clocksource for vmware) is an obvious one-off patch
> > for RHEL-5 because of the old timekeeping code and because of the split nature
> > of the i386 and x86_64 architectures.  In upstream, we don't actually need
> > anything like this, correct?  That is, x86_64 already has the ability to use a
> > TSC timesource now, yes?
> 
> You are right, mainline has clocksource for 64bit and hence this is not
> needed there.
> 
> > Also, note that in this patch I've changed the kernel command-line parameters  > name from "[no_]timekeeping_use_tsc" to "x86_64_use_tsc".  I'm open to further  > suggestions for names, but I think having two options is overkill where we    > could instead do an "foo=[on|off|auto]".
> 
> Makes sense.
> 
> > 
> > 6)  Patch 5 (Disable softlockup with the TSC based algorithm under VMware) is
> > another one-off for RHEL-5.  Why doesn't the upstream kernel need to do this?
> 
> Please note this was done only for the 64bit change for RHEL, with the
> clocksource implementation in mainline this is mostly taken care of by
> the schedclock, but ya there are corner cases where this could be a
> problem in mainline too. I need to add a stolen time implementation for
> the kernel and make the softlockup watchdog aware of it, but I haven't
> started working on it. This will be done for mainline later.
> So please add this change for RHEL 5.

OK, good.  As long as either upstream isn't affected, or it's on a TODO list somewhere, it's a fairly simple change to make, and should only affect VMware code paths.  I'll add it back in.

Again, thanks for taking the time with this; hopefully we won't have to go through too much more to get it done.  Rik also promised to re-review the updated patches at some point, so that will be another pair of eyes on it.

Chris Lalancette

Comment 121 Alok Kataria 2009-02-12 21:54:16 UTC

(In reply to comment #120)

> Again, thanks for taking the time with this; hopefully we won't have to go
> through too much more to get it done.  Rik also promised to re-review the
> updated patches at some point, so that will be another pair of eyes on it.
> 

That great. So once you have the rest of the 2(*) changes checked back in and Rik has had a look at these patches, please point me to the RPM for this kernel and I will give it a shot.

Thanks,
Alok


(*) The lazy_timer_emulation mode change and disabling softlockup for VMware thing.

Comment 122 RHEL Program Management 2009-02-16 15:07:47 UTC

Updating PM score.

Comment 123 Chris Lalancette 2009-02-18 14:06:38 UTC

OK, I uploaded new patches to http://people.redhat.com/clalance/bz463573 based on our last conversation.  I might have gone a little overboard; I basically did 1->1 for the upstream patches, and then just added the RHEL-5 specific patches on top.  Take a quick look; it should look very familiar to you, since it's mostly your upstream work :).  I compiled tested this on all arches, I just need to start doing some actual testing on it.

Chris Lalancette

Comment 124 Alok Kataria 2009-02-18 18:44:42 UTC

(In reply to comment #123)
> OK, I uploaded new patches to http://people.redhat.com/clalance/bz463573 based
> on our last conversation.  I might have gone a little overboard; I basically
> did 1->1 for the upstream patches, and then just added the RHEL-5 specific
> patches on top.  Take a quick look; it should look very familiar to you, since
> it's mostly your upstream work :). 

Thanks Chris. Here are my comments.

1. patch 003 - x86-Hypervisor-detection-and-get-tsc_freq-from-hype.patch

I] Patch description is slightly truncated, it needs to have this...

 The current way to check if we are on VMware is following,
    #  check if "hypervisor present bit" is set, if so read the 0x40000000
       cpuid leaf and check for "VMwareVMware" signature.
    #  if the above fails, check the DMI vendors name for "VMware" string
       if we find one we query the VMware hypervisor port to check if we are
       under VMware.

II]. Why is the header file with prototypes for generic functions called
   "vmware-hypervisor.h"  ? it should be called just "hypervisor.h".

2. patch 004 x86-add-a-synthetic-TSC_RELIABLE-feature-bit.patch

Though the TSC_RELIABLE feature bit is not required for RHEL, we can live with the extra bit usage, I guess.

3. patch 0012 x86-vmware-lazy-timer-emulation.patch

This lazy mode needs to be enabled for 64bit too. For 64bit this should be done *only* when we are using the TSC based timekeeping method. Please add that code.

4. patch 0013 x86-Disable-softlock-processing-with-tsc-based-time.patch

This patch is wrong. We need to disable softlockup for 64bit only, *not* for
32bit. Also note, for 64bit it needs to be disabled only if TSC based
timekeeping is used.

> I compiled tested this on all arches, I
> just need to start doing some actual testing on it.

Okay, once you have done your testing please point me to the kernel RPM too.

Thanks,
Alok

Comment 125 Alok Kataria 2009-02-18 18:51:40 UTC

(In reply to comment #124)
> (In reply to comment #123)
> > OK, I uploaded new patches to http://people.redhat.com/clalance/bz463573 based
> > on our last conversation.  I might have gone a little overboard; I basically
> > did 1->1 for the upstream patches, and then just added the RHEL-5 specific
> > patches on top.  Take a quick look; it should look very familiar to you, since
> > it's mostly your upstream work :). 
> 
> Thanks Chris. Here are my comments.
> 

Missed one comment,


patch 10 x86_64-Add-a-minimal-TSC-based-clocksource-implemen.patch

You might already have thought about it, but would still like to bring this up.  With this patch any hypervisor which defines the get_hypervisor_tsc_freq, will use the TSC based timekeeping method by default, does this warrant a documentation of some sought.

Thanks,
Alok

Comment 128 Chris Lalancette 2009-02-19 14:46:22 UTC

(In reply to comment #124)
> (In reply to comment #123)
> > OK, I uploaded new patches to http://people.redhat.com/clalance/bz463573 based
> > on our last conversation.  I might have gone a little overboard; I basically
> > did 1->1 for the upstream patches, and then just added the RHEL-5 specific
> > patches on top.  Take a quick look; it should look very familiar to you, since
> > it's mostly your upstream work :). 
> 
> Thanks Chris. Here are my comments.
> 
> 1. patch 003 - x86-Hypervisor-detection-and-get-tsc_freq-from-hype.patch
> 
> I] Patch description is slightly truncated, it needs to have this...
> 
>  The current way to check if we are on VMware is following,
>     #  check if "hypervisor present bit" is set, if so read the 0x40000000
>        cpuid leaf and check for "VMwareVMware" signature.
>     #  if the above fails, check the DMI vendors name for "VMware" string
>        if we find one we query the VMware hypervisor port to check if we are
>        under VMware.

Oops.  That was because I cut-n-pasted the description, and "git commit" took the # lines as comments.  I'm not quite sure how to fix that; for the moment, I just changed the # to - in the commit message to fix it up.

> 
> II]. Why is the header file with prototypes for generic functions called
>    "vmware-hypervisor.h"  ? it should be called just "hypervisor.h".

This is an unfortunate side-effect of our Xen patches in RHEL-5.  I won't go into the sordid details, but essentially the "hypervisor.h" file from this upstream submission collides with the "hypervisor.h" that we are carrying for the Xen stuff.  I'm open to other names besides "vmware-hypervisor.h", but it can't just be "hypervisor.h".

> 2. patch 004 x86-add-a-synthetic-TSC_RELIABLE-feature-bit.patch
> 
> Though the TSC_RELIABLE feature bit is not required for RHEL, we can live with
> the extra bit usage, I guess.

Right.  I guess it is spurious at this point, but we will never conflict with upstream anyway, so it's not really hurting anything, I don't think.

> 
> 3. patch 0012 x86-vmware-lazy-timer-emulation.patch
> 
> This lazy mode needs to be enabled for 64bit too. For 64bit this should be done
> *only* when we are using the TSC based timekeeping method. Please add that
> code.

Well, interestingly enough, in RHEL-5, this code is compiled for both i386 and x86_64 (despite the fact that it's only in the arch/i386 subdir).  So we should be good for x86_64.

I'll fix it up so it only enables it for 64-bit when the tsc method is active.

> 
> 4. patch 0013 x86-Disable-softlock-processing-with-tsc-based-time.patch
> 
> This patch is wrong. We need to disable softlockup for 64bit only, *not* for
> 32bit. Also note, for 64bit it needs to be disabled only if TSC based
> timekeeping is used.

Ah, right.  Will fix it up.

> patch 10 x86_64-Add-a-minimal-TSC-based-clocksource-implemen.patch
> 
> You might already have thought about it, but would still like to bring this up.
>  With this patch any hypervisor which defines the get_hypervisor_tsc_freq, will
> use the TSC based timekeeping method by default, does this warrant a
> documentation of some sought.

Yes, I actually did realize that at some point, and forgot about it.  You are right, this could have unwanted effect on other hypervisors; the way I've redone it for enabling "lazy time emulation" for point 3 above should alleviate this concern.

> 
> > I compiled tested this on all arches, I
> > just need to start doing some actual testing on it.
> 
> Okay, once you have done your testing please point me to the kernel RPM too.

Yep, will do!  I will fix your comments above, and then at least do some basic testing before I hand it over to you.

Chris Lalancette

Comment 129 Marcelo Tosatti 2009-02-19 20:03:20 UTC

Chris,

Please hold to send the patch for inclusion in the next release, gotta solve
a divide error in:

(gdb) l *(0xffffffff8006f03c)
0xffffffff8006f03c is in paravirt_get_tsc_khz (include/linux/kvm_hypervisor.h:29).
24	
25	static inline unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)
26	{
27	        u64 pv_tsc_khz = 1000000ULL << 32;
28	
29	        do_div(pv_tsc_khz, src->tsc_to_system_mul);
30	        if (src->tsc_shift < 0)
31	                pv_tsc_khz <<= -src->tsc_shift;
32	        else
33	                pv_tsc_khz >>= src->tsc_shift;

When using HPET.

Comment 130 Chris Lalancette 2009-02-19 21:25:36 UTC

Marcelo,
     OK, no problem.  I'm still finishing up the rebased patches (based on comments from Alok), and then I need to do some testing, so there is some time before we send the patches for internal review.  Just send me the patches when you are ready (or attach them to the BZ).

Chris Lalancette

Comment 131 Alok Kataria 2009-02-19 21:34:56 UTC

(In reply to comment #128)
> > 
> > II]. Why is the header file with prototypes for generic functions called
> >    "vmware-hypervisor.h"  ? it should be called just "hypervisor.h".
> 
> This is an unfortunate side-effect of our Xen patches in RHEL-5.  I won't go
> into the sordid details, but essentially the "hypervisor.h" file from this
> upstream submission collides with the "hypervisor.h" that we are carrying for
> the Xen stuff.  I'm open to other names besides "vmware-hypervisor.h", but it
> can't just be "hypervisor.h".
> 

"generic-hypervisor.h"  ?

Comment 132 Chris Lalancette 2009-02-19 21:39:49 UTC

(In reply to comment #131)
> (In reply to comment #128)
> > > 
> > > II]. Why is the header file with prototypes for generic functions called
> > >    "vmware-hypervisor.h"  ? it should be called just "hypervisor.h".
> > 
> > This is an unfortunate side-effect of our Xen patches in RHEL-5.  I won't go
> > into the sordid details, but essentially the "hypervisor.h" file from this
> > upstream submission collides with the "hypervisor.h" that we are carrying for
> > the Xen stuff.  I'm open to other names besides "vmware-hypervisor.h", but it
> > can't just be "hypervisor.h".
> > 
> 
> "generic-hypervisor.h"  ?

Works for me; I'll rename to that.  Thanks!

Chris Lalancette

Comment 133 Marcelo Tosatti 2009-02-20 02:29:22 UTC

Chris,

False alarm, with your patchset there is no issue.

Sorry for the noise.

Comment 134 Chris Lalancette 2009-02-20 18:10:06 UTC

Alok,
     OK, I think I've finished up the changes you requested, and I've now done basic testing on the patches.  I haven't beat up on them yet; my testing basically consisted of booting both the x86_64 and i386 kernels on bare-metal, as a Xen FV guest, and as a KVM FV guest, and just running a few basic tests to make sure they weren't broken.  I've uploaded both the current patchset and the kernels to http://people.redhat.com/clalance/bz463573.  Note that I only uploaded the x86_64, i686, i686-PAE, and src RPMs due to space constraints; if you need more than that, let me know and I'll figure something out.  Hopefully you can get started on testing as well, and we can compare notes.  Thanks again for your help with all of this.

Chris Lalancette

Comment 135 Akemi Yagi 2009-02-20 18:22:21 UTC

Chris,
I would like to test the patches. But I'm having difficulties connecting to  http://people.redhat.com/clalance/bz463573 .  This is not open to the public?

Comment 136 Chris Lalancette 2009-02-20 18:33:07 UTC

It's open to anyone, as far as I know.  But you are right, I'm having a hard time connecting to it from outside Red Hat too.  I'll open up an internal request to see what is going on.

Chris Lalancette

Comment 137 Alok Kataria 2009-02-20 19:14:19 UTC

(In reply to comment #134)
> Alok,
>      OK, I think I've finished up the changes you requested, and I've now done
> basic testing on the patches.  I haven't beat up on them yet; my testing
> basically consisted of booting both the x86_64 and i386 kernels on bare-metal,
> as a Xen FV guest, and as a KVM FV guest, and just running a few basic tests to
> make sure they weren't broken.  I've uploaded both the current patchset and the
> kernels to http://people.redhat.com/clalance/bz463573.  Note that I only
> uploaded the x86_64, i686, i686-PAE, and src RPMs due to space constraints; if
> you need more than that, let me know and I'll figure something out. 

This should be enough i guess, let me check the patches and then i will start with testing it. 

> Hopefully  you can get started on testing as well, and we can compare notes.  

Yep sure thing, as soon as the access problem is resolved i will get on to this. 

Thanks,
Alok

Comment 138 Alok Kataria 2009-02-23 22:32:34 UTC

(In reply to comment #137)
> (In reply to comment #134)
> > Alok,
> >      OK, I think I've finished up the changes you requested, and I've now done
> > basic testing on the patches.  I haven't beat up on them yet; my testing
> > basically consisted of booting both the x86_64 and i386 kernels on bare-metal,
> > as a Xen FV guest, and as a KVM FV guest, and just running a few basic tests to
> > make sure they weren't broken.  I've uploaded both the current patchset and the
> > kernels to http://people.redhat.com/clalance/bz463573.  Note that I only
> > uploaded the x86_64, i686, i686-PAE, and src RPMs due to space constraints; if
> > you need more than that, let me know and I'll figure something out. 
> 
> This should be enough i guess, let me check the patches and then i will start
> with testing it. 
> 
> > Hopefully  you can get started on testing as well, and we can compare notes. 

> 
> Yep sure thing, as soon as the access problem is resolved i will get on to
> this. 

Hi Chris, 

The access problem is still unresolved, though the links were briefly accessible on Friday evening and I could download some of the RPMs.

While reviewing these patches I noticed that the new patches are missing some of the code to skip TSC verification checks and TSC unsync checks; without these changes TSC is marked as unsynchorinized and hence unstable on 32bit AMD systems. 

I am attaching a patch to fix this.

Also can you please upload the header packages for these kernel.

Thanks,
Alok

Comment 139 Alok Kataria 2009-02-23 22:36:02 UTC

Created attachment 332979 [details]
Fix TSC behavior when running under VMware on AMD systems.

On AMD's systems TSC is marked as unstable, with the previous patches we used
to explicitly check for VMware in x86_hyper_vendor, and skip these checks. 
With the new patches we can use the synthetic CPU feature bits for this.

Mainline behavior :
On mainline in the unsynchronized_tsc check the CONSTANT_TSC bit is checked,
since I didn't want to change kernel behavior on a native AMD system, I am
checking for FEATURE_TSC_RELIABLE bit here.

For the verify_tsc_freq case mainline too checks for FEATURE_TSC_RELIABLE bit.

Comment 140 Akemi Yagi 2009-02-24 00:10:58 UTC

(In reply to comment #138)
> (In reply to comment #137)
> > (In reply to comment #134)

> I am attaching a patch to fix this.
> 
> Also can you please upload the header packages for these kernel.
> 
> Thanks,
> Alok

If it helps, I have rebuilt the kernels from the srpm and made the packages (including -devel and -header) available at:

http://centos.toracat.org/kernel/centos5/bz463573/

I can upload the -PAE type if needed.

Akemi

Comment 141 Alok Kataria 2009-02-24 00:16:05 UTC

(In reply to comment #140)
> (In reply to comment #138)
> > (In reply to comment #137)
> > > (In reply to comment #134)
> 
> > I am attaching a patch to fix this.
> > 
> > Also can you please upload the header packages for these kernel.
> > 
> > Thanks,
> > Alok
> 
> If it helps, I have rebuilt the kernels from the srpm and made the packages
> (including -devel and -header) available at:
> 
> http://centos.toracat.org/kernel/centos5/bz463573/

Is this with the fix for AMD, that I sent earlier ?

Thanks,
Alok
> 
> I can upload the -PAE type if needed.
> 
> Akemi

Comment 142 Akemi Yagi 2009-02-24 00:20:41 UTC

No, this is *before*.  I built them yesterday.  Shall I try building them with the latest patch?

Akemi

Comment 143 Alok Kataria 2009-02-24 00:29:30 UTC

(In reply to comment #142)
> No, this is *before*.  I built them yesterday.  Shall I try building them with
> the latest patch?

No that's fine, I was just curious, I do have the kernel compiled with this for my testing. Only after my correctness testing is completed, I am going to ask QA over here to do a final complete stress testing, which they can run on RPM's from Chris.

Thanks,
Alok

Comment 144 Chris Lalancette 2009-02-24 08:05:51 UTC

(In reply to comment #138)
> Hi Chris, 
> 
> The access problem is still unresolved, though the links were briefly
> accessible on Friday evening and I could download some of the RPMs.

Hm, are you sure?  IT assures me it is back up, and I can access it both from an internal machine and an external machine.  Maybe your browser is getting confused or something?  Can you just try something like:

# wget http://people.redhat.com/clalance/bz463573/kernel-2.6.18-131.el5dz_test.i686.rpm

> While reviewing these patches I noticed that the new patches are missing some
> of the code to skip TSC verification checks and TSC unsync checks; without
> these changes TSC is marked as unsynchorinized and hence unstable on 32bit AMD
> systems. 
> 
> I am attaching a patch to fix this.

OK, sure.  I will definitely take a look and fold it in.

> 
> Also can you please upload the header packages for these kernel.

Done now.

Chris Lalancette

Comment 145 Chris Lalancette 2009-02-24 11:17:56 UTC

(In reply to comment #144)
> (In reply to comment #138)
> > Hi Chris, 
> > 
> > The access problem is still unresolved, though the links were briefly
> > accessible on Friday evening and I could download some of the RPMs.
> 
> Hm, are you sure?  IT assures me it is back up, and I can access it both from
> an internal machine and an external machine.  Maybe your browser is getting
> confused or something?  Can you just try something like:
> 
> # wget
> http://people.redhat.com/clalance/bz463573/kernel-2.6.18-131.el5dz_test.i686.rpm

If that still doesn't work, I've also uploaded all of the same stuff to another server:

http://illuminari.org/~clalancette/bz463573

So as not to stand in the way of progress.

Chris Lalancette

Comment 146 Alok Kataria 2009-03-02 23:08:52 UTC

Hi Chris, 

The RPM that you uploaded with the tsc fix for AMD and other patches, looks in perfect shape from the TSC based timekeeping perspective and testing didn't bring up anything suspicious as well. 

So i think we are pretty much done with this now. 

Thanks,
Alok

Comment 147 Chris Lalancette 2009-03-05 11:03:07 UTC

Alok,
     Sorry for the delay.  Great, thanks for the testing.  We'll do some additional testing on them, and get them queued up.  Thanks again!

Chris Lalancette

Comment 153 Don Zickus 2009-04-20 17:10:33 UTC

in kernel-2.6.18-140.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 155 Chris Lalancette 2009-04-22 10:33:01 UTC

Alok,
     As you've seen in comment #153, these patches were committed to the RHEL-5.4 tree.  One note about the committed patches; we changed the kernel command-line argument (yet again) to be "clock=tsccount" or "clock=notsccount".  It was pointed out that we regularly get bug reports to the effect of "clock=tsc notsc" doesn't do what people expect.  If the user specified something like "clock=jiffies x86_64_use_tsc=on" would have been difficult to figure out what they intended; using the "clock" parameter neatly sidesteps that.

The default is still to "auto-detect" whether we are running under VMware, and if so, turn on the new TSC counting mode.  If you specify "clock=tsccount", it forces the kernel to use the new TSC counting mode, regardless of whether we are running under VMware or not.  If you specify "clock=notsccount", it forces the kernel to *not* use the new TSC counting mode, regardless of whether we are running under VMware or not.

Thanks for your patience and work on the patches, and let me know if you have questions or concerns.  I'll set this to NEEDINFO from you for now, but feel free to clear it if you don't have anything more to add.

Chris Lalancette

Comment 156 Alok Kataria 2009-04-22 22:14:29 UTC

(In reply to comment #155)

> The default is still to "auto-detect" whether we are running under VMware, and
> if so, turn on the new TSC counting mode.  If you specify "clock=tsccount", it
> forces the kernel to use the new TSC counting mode, regardless of whether we
> are running under VMware or not.  If you specify "clock=notsccount", it forces
> the kernel to *not* use the new TSC counting mode, regardless of whether we are
> running under VMware or not.

This new command line parameter looks okay to me, i did a  cursory look over the rest of the code and it looks similar to the previous version that we had. 

Thanks,
Alok

Comment 157 Chris Lalancette 2009-04-23 08:12:59 UTC

OK, thanks again!

Chris Lalancette

Comment 161 Chris Ward 2009-07-03 18:09:26 UTC

~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.

Comment 162 Alok Kataria 2009-07-21 22:17:50 UTC

Our testing on RHEL5.4 for timekeeping didn't bring up any issues. Thanks.

Comment 163 Chris Lalancette 2009-07-27 14:24:28 UTC

(In reply to comment #162)
> Our testing on RHEL5.4 for timekeeping didn't bring up any issues. Thanks.  

Great, thanks for the additional testing and confirmation!

Chris Lalancette

Comment 166 Giles Westwood 2009-08-17 15:21:23 UTC

Is it still recommended to use "clocksource=acpi_pm divider=10" ?

Comment 167 Garrett Smith 2009-08-17 19:54:50 UTC

With these patches you should not use "clocksource=acpi_pm".  

Using "divider=10" is at your discretion: Using it will lower the timer interrupt rate, which will lower the cpu overhead of idle VMs.  Not using it will give a finer granularity for wakeups.  Having a 1ms rather than a 10ms granuarity doesn't matter for most applications, so using "divider=10" is probably the right tradeoff.  Either way it should no longer effect time keeping in the VM.

Comment 168 errata-xmlrpc 2009-09-02 08:35:28 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Comment 169 Prarit Bhargava 2011-10-17 14:13:36 UTC

*** Bug 462410 has been marked as a duplicate of this bug. ***

Comment 170 Prarit Bhargava 2011-11-03 17:33:26 UTC

*** Bug 466606 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.

amyagi
bfox
chrisw
clalance
cward
dhecht
dmair
duck
dzickus
fluo
garrett
gcosta
james.leddy
jeff
jlo
jzhenyon
liko
markmc
mishu
mrappa
mtosatti
pasteur
rar3g3ms
riek
riel
sghosh
sivadass.m
sputhenp
srihan
syeghiay
tao
tkramer