Bug 538022

Summary: java.util.concurrent: long delay and intervals drift since kernel update to 164
Product: Red Hat Enterprise Linux 5 Reporter: Sarah Windler Burri <sarah.windler>
Component: kernelAssignee: Chris Lalancette <clalance>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 5.4CC: adrian.fischli, akarlsso, akataria, clalance, ctatman, cward, dhecht, dhoward, dougsland, drjones, garrett, jasonmc, mjenner, moshiro, plyons, prarit, pveiga, qcai, rprice, tao
Target Milestone: rcKeywords: Regression, ZStream
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard: kernel update, time sync, java.util.concurrent, virtual servers
Fixed In Version: Doc Type: Bug Fix
Doc Text:
When running under VMware, timeouts in user applications could occur much later than they should have due to faulty kernel timekeeping. With this update, kernel timekeeping has been improved and timeouts occur in a more timely manner.
Story Points: ---
Clone Of: Environment:
Java on RedHat on virtual servers
Last Closed: 2011-01-13 20:55:09 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 679412    
Attachments:
Description Flags
scheduler.java
none
system call test with futex()
none
dmesg -164 with clock option
none
dmesg kernel 140
none
TSC based timer for 32bit
none
V2 TSC based timer for 32bit
none
Version 3 of the VMware TSC timer patch
none
Version 4 of the VMware TSC timer patch
none
Interdiff between version 1 and version 4 of the patch
none
TSC based timer for 32bit
none
Address some review comments.
none
RHEL5 patch
none
Test patch to fix a race in tsc based timer startup for i386. none

Description Sarah Windler Burri 2009-11-17 10:06:10 UTC
Description of problem:
Since we updated to RHEL 5.4, kernel-164 execution time of the Java library method
java.util.concurrent.ScheduledExecutorService.scheduleAtFixedRate(Runnable command, long initialDelay, long period, TimeUnit unit)
drifts. This happens only on virtual servers.

How reproducible:
I started the attached scheduler.java on a virtual server with:

Linux vpp2-srv-templa.ethz.ch 2.6.18-128.7.1.el5 #1 SMP Wed Aug 19 04:00:44 EDT 2009 i686 athlon i386 GNU/Linux

java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03)
Java HotSpot(TM) Client VM (build 14.1-b02, mixed mode)

Below You can see the output of scheduler.java:

Default TimeZone: sun.util.calendar.ZoneInfo[id="Europe/Zurich",offset=3600000,dstSavings=3600000,useDaylight=true,transitions=119,lastRule=java.util.SimpleTimeZone[id=Europe/Zurich,offset=3600000,dstSavings=3600000,useDaylight=true,startYear=0,startMode=2,startMonth=2,startDay=-1,startDayOfWeek=1,startTime=3600000,startTimeMode=2,endMode=2,endMonth=9,endDay=-1,endDayOfWeek=1,endTime=3600000,endTimeMode=2]]

Current time in milliseconds = 1257759514436 => Mon Nov 09 10:38:34 CET 2009

Object should start at: Tue Nov 10 07:00:00 CET 2009 with delay in Milliseconds: 73285387


Tue Nov 10 07:00:12 CET 2009: Object start
Tue Nov 10 07:00:13 CET 2009: Object end


Then we patched the RedHat kernel to:
Linux vpp2-srv-templa.ethz.ch 2.6.18-164.6.1.el5 #1 SMP Tue Oct 27 11:30:06 EDT 2009 i686 athlon i386 GNU/Linux

and we received the result below:
Wed Nov 11 07:34:21 CET 2009: Object start
Wed Nov 11 07:34:22 CET 2009: Object end
Thu Nov 12 07:50:31 CET 2009: Object start
Thu Nov 12 07:50:32 CET 2009: Object end
Fri Nov 13 07:46:36 CET 2009: Object start
Fri Nov 13 07:46:37 CET 2009: Object end
Sat Nov 14 07:44:24 CET 2009: Object start
Sat Nov 14 07:44:25 CET 2009: Object end
Sun Nov 15 07:44:30 CET 2009: Object start
Sun Nov 15 07:44:31 CET 2009: Object end
Mon Nov 16 07:49:01 CET 2009: Object start
Mon Nov 16 07:49:02 CET 2009: Object end

Additional info:
I use the Java method scheduleAtFixedRate(...) since years for starting java Runnables. The Java library method provides a "crontab" service. But since the update from kernel-128 to kernel-164 long delay and long interval parameters don't work anymore on virtual servers. Please help.
Sarah

Comment 1 Sarah Windler Burri 2009-11-17 12:27:57 UTC
Created attachment 369870 [details]
scheduler.java

Attachment scheduler.java

Comment 2 Sarah Windler Burri 2009-11-23 06:47:19 UTC
Sorry, I forgot information about our virtual environment: 
We use VMWare, Version 4.
Sarah

Comment 4 Sarah Windler Burri 2010-04-08 13:20:45 UTC
Hi there 
Just an other test program, this time in C: It tests time drifting on a virtual server with the system call "futex(). We detected, that the Java concurrent library uses a lot this call for 'Waitings'. The output of the futex()-test is listed below:

./test.exe 240 (240 means futex() will wait for four hours)
aktueller Zeitstempel:  08.04.2010  07:41:09
erwarteter Zeitstempel: 08.04.2010  11:41:09
gemessener Zeitstempel: 08.04.2010  11:43:40

Der Prozess wurde 151 Sekunden zu spaet geweckt


Time drifts in this test more than two minutes on a newly patched Red Hat system on a 32-bit virtual server!

Again: This is not a Java issue, but rather a system kernel case.
Regards
Sarah

Comment 5 Sarah Windler Burri 2010-04-08 13:22:36 UTC
Created attachment 405307 [details]
system call test with futex()

Comment 13 Douglas Schilling Landgraf 2010-04-22 18:26:41 UTC
Created attachment 408408 [details]
dmesg -164 with clock option

Comment 14 Douglas Schilling Landgraf 2010-04-22 18:30:17 UTC
Created attachment 408410 [details]
dmesg kernel 140

Comment 15 Douglas Schilling Landgraf 2010-04-22 18:35:03 UTC
Hello Chris,

> Those patches actually did multiple things, and the goal of them was actually 
> to improve timekeeping in a VMware guest.  I have to admit that I've never 
> tested them with the FUTEX_WAIT stuff, though, so something subtle may be 
> broken there.  To start with, I have two requests:
> 
> 1)  Give me the full dmesg of a problem guest.
> 2)  Try booting one of the problem guests (preferably a -164 or later kernel) 
> with "clock=notsccount" on the guest kernel command-line.  This should 
> disable the new TSC count mode that we added to help on VMware, and hopefully 
> take it 
> out of the equation (or point it out as the culprit).  After booting with 
> this option, please give me a full dmesg, and also re-run the test to see if > the problem persists.

Attached all dmesg. Even using clock=notsccount option the behaviour is the same:

./test.exe 240

aktueller Zeitstempel:  22.04.2010  06:45:50

erwarteter Zeitstempel: 22.04.2010  10:45:50

gemessener Zeitstempel: 22.04.2010  10:57:36


Let me know if you need any additional info.

Thanks
Douglas

Comment 21 Chris Lalancette 2010-07-01 13:40:43 UTC
Hello Alok,
     I've CC'ed you on this bug since we are having some complaints that the time-keeping VMware patches we put into RHEL-5.4 are causing problems.  In particular, the test program in Comment #5 seems to always wake up late with the VMware patches in place.  I've confirmed that this is because of the "[x86] vmware lazy timer emulation" patch; if I back just that patch out of a RHEL-5 kernel, then the example code no longer loses time.
     That being said, it seems like the behavior that is being seen is, in fact, the *point* of lazy timer evaluation.  Do you have any guidance here about what we should be doing?  Should we revert this patch and go back to non-lazy timer evaluation?  Should we add a command-line option to disable it?  Can you give us any idea of VMware recommended best practices about this problem?

Thanks,
Chris Lalancette

Comment 22 Alok Kataria 2010-07-01 20:45:29 UTC
Hi Chris,

Thanks for bringing this to my notice.

I think we understand what the underlying problem could be, the 32bit kernel doesn't advance jiffies along with realtime unlike what is done in the 64bit code path. This may result in the poll timeouts to be delayed.

I need to see how much of a coding effort it would be to get this fixed, let me take a detailed look and get back. Please note that given the long weekend and other things I might not get to the bottom of this until next week.

In the meantime can you please confirm that the 64bit kernels don't this problem ?

Thanks again,
Alok

Comment 25 Guy Streeter 2010-07-27 16:05:20 UTC
Our customer confirms this does not occur on 64-bit systems.

Comment 26 Guy Streeter 2010-08-02 19:45:58 UTC
Alok,
 Have you had a chance to look into this issue? It is affecting our mutual customers and they need a resolution.
thanks,
--Guy

Comment 27 Alok Kataria 2010-08-02 20:57:29 UTC
Hi Guy,

Sorry for the delay, I have a half baked patch somewhere on my box. Let me try to find the tree and bring this to closure soon. Please give me until end of this week, to get back with a patch which solves the problem.

Lets keep it need-info'ed to me until I provide the patch.

Thanks,
Alok

Comment 28 Alok Kataria 2010-08-07 02:04:59 UTC
Created attachment 437291 [details]
TSC based timer for 32bit

Hi Guy/Chris,

As promised :), attached is a patch which fixes the reported problem.
This is pretty similar to how 64bit tsc_based_timer works.

Please note that I still need to run detailed tests with this patch, hence haven't signed it off yet. Though, I have verified that the uploaded futex script doesn't exhibit the problem on 32bit anymore.

Please take a look and let me know any comments that you may have.

Thanks,
Alok

Comment 29 Alok Kataria 2010-08-07 02:08:22 UTC
Forgot to mention, I the patch above, I am also setting the preset_lpj value when on VMware so that lpj vaue is correct for all cpus on VMware.

Comment 30 Issue Tracker 2010-08-16 20:52:43 UTC
Event posted on 08-16-2010 04:46pm EDT by dlandgra

Hello Nitin,

Here the kernel test:
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2683069

Please remember that test packages are not supported packages.

Finally, to make work the patch provided by Alok, I have made the follow
change:

+#ifndef CONFIG_XEN
+               init_tsc_timer();
+#endi

Added the macro #ifndef CONFIG_XEN to avoid kernel compilation breakage
for xen kernels because xen kernels doens't include vmware.o hypervisor.o
according with arch/i386/kernel/cpu/Makefile.

Cheers
Douglas


This event sent from IssueTracker by dlandgra 
 issue 378445

Comment 31 Douglas Schilling Landgraf 2010-08-16 21:01:35 UTC
Created attachment 439015 [details]
V2 TSC based timer for 32bit

Comment 33 Issue Tracker 2010-08-23 11:39:33 UTC
Event posted on 08-23-2010 04:34am EDT by nbansal

The test kernel works fine. 
 
[javavpp@id-rz-dock-1-118 scheduler]$ uname -a Linux
id-rz-dock-1-118.ethz.ch 2.6.18-164.el5.IT378445.2.TEST #1 SMP Mon Aug 16
13:40:53 EDT 2010 i686 athlon i386 GNU/Linux 
 
[javavpp@id-rz-dock-1-118 scheduler]$ more out Default TimeZone:
sun.util.calendar.ZoneInfo[id="Europe/Zurich",offset=3600000,dstSavings=3600000,useDaylight=true,transitions=119,lastRu

 
le=java.util.SimpleTimeZone[id=Europe/Zurich,offset=3600000,dstSavings=3600000,useDaylight=true,startYear=0,startMode=2,startMonth=2,star

 
tDay=-1,startDayOfWeek=1,startTime=3600000,startTimeMode=2,endMode=2,endMonth=9,endDay=-1,endDayOfWeek=1,endTime=3600000,endTimeMode=2]]

 
Current time in milliseconds = 1282310296228 => Fri Aug 20 15:18:16 CEST
2010 Object should start at: Sat Aug 21 07:00:00 CEST 2010 with delay in
Milliseconds: 56503334 Sat Aug 21 07:00:08 CEST 2010: Object start Sat Aug
21 07:00:09 CEST 2010: Object end Sun Aug 22 07:00:13 CEST 2010: Object
start Sun Aug 22 07:00:14 CEST 2010: Object end Mon Aug 23 07:00:13 CEST
2010: Object start Mon Aug 23 07:00:14 CEST 2010: Object end 
 

Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by dlandgra 
 issue 378445

Comment 35 Chris Lalancette 2010-08-31 20:32:59 UTC
Hey Alok,
     We've found a couple of bugs in the original patch you sent us.  Here are the changes we've made:

1)  We added a #ifndef CONFIG_XEN around the init_tsc_timer() code so it compiles on Xen (this should be pretty uncontroversial).

2)  I found that in the implementation of do_timer_tsc_timekeeping(), we weren't accounting for the tick divider.  That is, in the x86_64 version of do_timer_tsc_timekeeping, we do a loop of:

		for (i = 0; i < tick_divider; i++)
			do_timer_jiffy(regs);

to take into account all of the ticks.  As far as I can tell, this is also needed in the 32-bit version.  I've added it.

3)  During subsequent testing of the x86_64 version of this code, we found that we could get softlockups because we were accounting too many ticks in one cycle.  Therefore, in the x86_64 version, we added a "touch_all_softlockup_watchdogs()" call in do_timer_tsc_timekeeping() to touch the softlockup timers as appropriate. I think we also need this for the 32-bit version, so I've added it as well.

4)  I've removed the part of the patch that sets preset_lpj.  Despite the horrible variable naming, it seems the semantics of preset_lpj are to take a command-line parameter for the lpj.  For a "detected" lpj (rather than calibrated), we usually use the lpj_fine parameter.  Additionally, we were already setting lpj_fine in the x86_64 version, so I've changed arch/i386/kernel/tsc.c:tsc_init() to call get_hypervisor_tsc_freq() like the 64-bit version, and update the tsc_khz and cpu_khz (which should trickle down into lpj_fine).  However, looking in arch/i386/kernel/cpu/vmware.c:vmware_get_tsc_khz() (which is called to do the lpj_fine adjustment), it looks like we only turn on lazy_timer_evaluation (and nosoftlokup) if timekeeping_use_tsc >= 0, which is only true for the 64-bit version now.  I know that the lazy timer evaluation sort of caused this bug, but with our new tsc counting algorithm for 32-bit, do we want to turn this on for 32-bit?

Thanks,
Chris Lalancette

Comment 36 Chris Lalancette 2010-08-31 20:34:05 UTC
Created attachment 442273 [details]
Version 3 of the VMware TSC timer patch

Version 3 of the TSC-counting algorithm, with the changes explained in the previous comment.

Comment 37 Chris Lalancette 2010-08-31 20:59:42 UTC
Alok,
     Just an additional note, the above patch fails to build on x86_64.  It's because vmware_init_tsc_timer() doesn't exist on 64-bit, but the 64-bit code paths go through there.  I think may fix it up to do all of the stuff that vmware_init_tsc_timer() does within tsc_init() (similar to what 64-bit version does), but it should be a pretty minor change.  The rest of my comments still stand.

Thanks,
Chris Lalancette

Comment 38 Alok Kataria 2010-09-01 01:56:24 UTC
Hi Chris,

Thanks for working on this.

I have a request though, can you please generate a patch on top of my earlier one ? reviews are painful otherwise. Also it helps figure what other changes you might have done, for instance cycles_accounted_limit is now set to just 10 secs instead of 10mins. I could have easily missed it if I didn't do a diff between my version and yours. 
Also it is easy to blame someone when we know who introduced what part of the change :)

Below are my highlevel comments, will look in more details once you post the iterative patch.

> 1)  We added a #ifndef CONFIG_XEN around the init_tsc_timer() code so it
> compiles on Xen (this should be pretty uncontroversial).

Yeah :)

> 
> 2)  I found that in the implementation of do_timer_tsc_timekeeping(), we
> weren't accounting for the tick divider.  That is, in the x86_64 version of
> do_timer_tsc_timekeeping, we do a loop of:
> 
>   for (i = 0; i < tick_divider; i++)
>    do_timer_jiffy(regs);
> 
> to take into account all of the ticks.  As far as I can tell, this is also
> needed in the 32-bit version.  I've added it.

Its not needed in the 32bit version, do_timer_jiffy in 32bit already has a tick_divider loop unlike the 64 bit version. 

Looking at it now, I think the 64bit version has a bug for the 
CONFIG_X86_LOCAL_APIC && !using_apic_timer case, we have a nested tick_divider loop. We will hit that problem only if the apic is disabled.

> 
> 3)  During subsequent testing of the x86_64 version of this code, we found that
> we could get softlockups because we were accounting too many ticks in one
> cycle.  Therefore, in the x86_64 version, we added a
> "touch_all_softlockup_watchdogs()" call in do_timer_tsc_timekeeping() to touch
> the softlockup timers as appropriate. I think we also need this for the 32-bit
> version, so I've added it as well.

In my testing for the 64bit change back then, I had seen that touching the watchdog didn't suffice and I still saw some false softlockups reported, thats the reason we disabled softlockups in this mode.  Though I don't feel strongly about this I am fine with this change.

> 
> 4)  I've removed the part of the patch that sets preset_lpj.  Despite the
> horrible variable naming, it seems the semantics of preset_lpj are to take a
> command-line parameter for the lpj.  For a "detected" lpj (rather than
> calibrated), we usually use the lpj_fine parameter.  Additionally, we were
> already setting lpj_fine in the x86_64 version, so I've changed
> arch/i386/kernel/tsc.c:tsc_init() to call get_hypervisor_tsc_freq() like the
> 64-bit version, and update the tsc_khz and cpu_khz (which should trickle down
> into lpj_fine). 

Yeah though lpj_fine is accounted for only the boot cpu, the AP's still go through the delay calibration dance. On mainline both VMware and KVM use preset_lpj to skip delay calibration. So it would be great if you can keep preset_lpj part of the change, I am thinking we can get rid of lpj_fine if no one else depends on that.

>  However, looking in
> arch/i386/kernel/cpu/vmware.c:vmware_get_tsc_khz() (which is called to do the
> lpj_fine adjustment), it looks like we only turn on lazy_timer_evaluation (and
> nosoftlokup) if timekeeping_use_tsc >= 0, which is only true for the 64-bit
> version now. 

Not really, note that we are checking if timekeeping_use_tsc is either zero or greater than that. In 32 bit it is zero to start with. So we still try to enable the lazy_timer_emulation mode there too.

Thanks again for taking a look.

Comment 39 Alok Kataria 2010-09-01 02:12:01 UTC
(In reply to comment #37)
> Alok,
>      Just an additional note, the above patch fails to build on x86_64.  It's
> because vmware_init_tsc_timer() doesn't exist on 64-bit, but the 64-bit code
> paths go through there.  I think may fix it up to do all of the stuff that
> vmware_init_tsc_timer() does within tsc_init() (similar to what 64-bit version
> does), but it should be a pretty minor change. 

That won't work, the reason I had to do it later in the boot process is that I wanted to enable this mode only after the kernel is done booting all the cpus.
The 32bit kernel has these functions synchronize_tsc_bp/ap which reset the TSC to zero for synchronization purpose. As a result at some point the last_tsc_accounted value is far greater than what rdtscll returns,  due to which we always return from the below condition in do_timer_tsc_timekeeping

 if (unlikely(tsc < tsc_accounted))
                return;

The result is that we don't call do_timer_jiffy until the tsc is again greater or equal to the last_tsc_accounted value, which results in kernel stalls at boot.

I could have alternatively fixed it by resetting last_tsc_accounted whenever we do a write_tsc but that seemed very hackish to me.

Let me know what you think about this.

Thanks.

Comment 40 Chris Lalancette 2010-09-01 21:40:44 UTC
(In reply to comment #38)
> Hi Chris,
> 
> I have a request though, can you please generate a patch on top of my earlier
> one ? reviews are painful otherwise. Also it helps figure what other changes
> you might have done, for instance cycles_accounted_limit is now set to just 10
> secs instead of 10mins. I could have easily missed it if I didn't do a diff
> between my version and yours. 
> Also it is easy to blame someone when we know who introduced what part of the
> change :)

Yes, definitely.  I'm sorry, that was my mistake, I will definitely include an interdiff for the next patch.

<snip>

> > 
> > 2)  I found that in the implementation of do_timer_tsc_timekeeping(), we
> > weren't accounting for the tick divider.  That is, in the x86_64 version of
> > do_timer_tsc_timekeeping, we do a loop of:
> > 
> >   for (i = 0; i < tick_divider; i++)
> >    do_timer_jiffy(regs);
> > 
> > to take into account all of the ticks.  As far as I can tell, this is also
> > needed in the 32-bit version.  I've added it.
> 
> Its not needed in the 32bit version, do_timer_jiffy in 32bit already has a
> tick_divider loop unlike the 64 bit version. 

You are totally right, I've removed this change.

> 
> Looking at it now, I think the 64bit version has a bug for the 
> CONFIG_X86_LOCAL_APIC && !using_apic_timer case, we have a nested tick_divider
> loop. We will hit that problem only if the apic is disabled.

Right again.  I made another patch to fix up this problem, but since it's a corner case, I'll just open another bug for it and we can include it separately.

> 
> > 
> > 3)  During subsequent testing of the x86_64 version of this code, we found that
> > we could get softlockups because we were accounting too many ticks in one
> > cycle.  Therefore, in the x86_64 version, we added a
> > "touch_all_softlockup_watchdogs()" call in do_timer_tsc_timekeeping() to touch
> > the softlockup timers as appropriate. I think we also need this for the 32-bit
> > version, so I've added it as well.
> 
> In my testing for the 64bit change back then, I had seen that touching the
> watchdog didn't suffice and I still saw some false softlockups reported, thats
> the reason we disabled softlockups in this mode.  Though I don't feel strongly
> about this I am fine with this change.

OK.  I think I'm going to keep this change, since it does seem to help with kvmclock.

> 
> > 
> > 4)  I've removed the part of the patch that sets preset_lpj.  Despite the
> > horrible variable naming, it seems the semantics of preset_lpj are to take a
> > command-line parameter for the lpj.  For a "detected" lpj (rather than
> > calibrated), we usually use the lpj_fine parameter.  Additionally, we were
> > already setting lpj_fine in the x86_64 version, so I've changed
> > arch/i386/kernel/tsc.c:tsc_init() to call get_hypervisor_tsc_freq() like the
> > 64-bit version, and update the tsc_khz and cpu_khz (which should trickle down
> > into lpj_fine). 
> 
> Yeah though lpj_fine is accounted for only the boot cpu, the AP's still go
> through the delay calibration dance. On mainline both VMware and KVM use
> preset_lpj to skip delay calibration. So it would be great if you can keep
> preset_lpj part of the change, I am thinking we can get rid of lpj_fine if no
> one else depends on that.

Yeah, I see what you mean.  In that case, I'm OK with using preset_lpj like you did in your initial patch, so I've reverted back to that.  I'm not going to remove lpj_fine at this point, just because it's a bit gratuitous.

> 
> >  However, looking in
> > arch/i386/kernel/cpu/vmware.c:vmware_get_tsc_khz() (which is called to do the
> > lpj_fine adjustment), it looks like we only turn on lazy_timer_evaluation (and
> > nosoftlokup) if timekeeping_use_tsc >= 0, which is only true for the 64-bit
> > version now. 
> 
> Not really, note that we are checking if timekeeping_use_tsc is either zero or
> greater than that. In 32 bit it is zero to start with. So we still try to
> enable the lazy_timer_emulation mode there too.

Yeah, you are right again.

So after reverting back to your patch plus my two minor changes above, this patch still failed to build on 64-bit.  There were a few problems, mostly revolving around vmware_init_tsc_timer() not being exported to 64-bit, and many of the variables contained therein also not being exported.  However, looking at vmware_init_tsc_timer(), and the similar code in arch/x86_64/kernel/time.c:time_init() for hypervisors, they were very similar.  Because of that I split this all out into init_tsc_timer(), and got rid of the vmware_init_tsc_timer() completely.

I'm attaching a new version of the patch, which I've now successfully compiled and booted on 32-bit and 64-bit.  Can you review this and see if the changes are acceptable to you?  I'm going to be off for a few days, so Prarit (now CC'ed on this bug) is going to be taking this over from me.

Chris Lalancette

Comment 41 Chris Lalancette 2010-09-01 21:41:35 UTC
Created attachment 442500 [details]
Version 4 of the VMware TSC timer patch

Version 4 of the VMware TSC timer patch

Comment 42 Chris Lalancette 2010-09-01 21:42:24 UTC
Created attachment 442501 [details]
Interdiff between version 1 and version 4 of the patch

Comment 43 Alok Kataria 2010-09-02 19:29:23 UTC
> > I have a request though, can you please generate a patch on top of my earlier
> > one ? reviews are painful otherwise. Also it helps figure what other changes
> > you might have done, for instance cycles_accounted_limit is now set to just 10
> > secs instead of 10mins.

I forgot to ask why you changed this ? We can have situations when in some overcommitted situations the guest can be delayed for more than 10 seconds, and when that happens time will start lagging behind in this mode too now.
I think 1 minute might be enough though, will change that.

> >  I could have easily missed it if I didn't do a diff
> > between my version and yours. 
> > Also it is easy to blame someone when we know who introduced what part of the
> > change :)
> 
> Yes, definitely.  I'm sorry, that was my mistake, I will definitely include an
> interdiff for the next patch.
> 

I meant incremental patches. 
 <snip>

Anyways, the v4 patch doesn't work for me, my kernel hung during bootup on 32 bit.

The problem is because we are using the same timekeeping_use_tsc in the timer interrupt hook to decide what interrupt mode do we require (legacy or tsc based) in 32 bit mode we delay the initialization of tsc_timer after all cpus are bought up ( look at comment 39), as a result between the call to tsc_init and clocksource_init the timer can misbehave. We really need a different switch (enable_tsc_timer) for the 32bit code. 

Let me attach 2 (broken-out) patches which should address all the concerns that we have.

Comment 44 Alok Kataria 2010-09-02 19:33:33 UTC
Created attachment 442711 [details]
TSC based timer for 32bit

This is actually the first patch that I had posted in comment 28, will post an incremental patch on top of this, which addresses all the concerns that you had brought up in your review in comment 35. Thanks

Comment 45 Alok Kataria 2010-09-02 19:35:31 UTC
Created attachment 442713 [details]
Address some review comments.

Patch on top of the one in comment #44.

With following changes 
 - fix compiler warning for XEN
 - touch softlockup watchdog when processing multiple timer ticks
   (useful for KVM)
 - the maximum cycles that are accounted in 1 timer interrupt is
   set to 1 mins now, instead of 10mins.
 - restructure code so that 32 and 64bit can share initilization of
   the tsc based timer.

Thanks.

Comment 48 Prarit Bhargava 2010-09-03 22:03:00 UTC
Created attachment 442992 [details]
RHEL5 patch

Patch against latest tree.  Alok, can you please review.

P.

Comment 49 Prarit Bhargava 2010-09-03 22:05:32 UTC
Alok, please review the most recently attached patch.  This patch is a combination of your previous patches + some additional fixes due to code drift.

I know you wanted an incremental patch, however, this is the patch that will be committed to the tree so I just wanted to get a sign off on it.

The patch has cleared our internal build system, brew, and is ready to be submitted after a sign off by Alok.

Thanks,

P.

Comment 50 Alok Kataria 2010-09-03 22:57:06 UTC
LGTM, Prarit. 
Thanks to Chris and you for going through multiple iterations on this.

Comment 55 Jarod Wilson 2010-09-10 21:38:37 UTC
in kernel-2.6.18-219.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 65 errata-xmlrpc 2011-01-13 20:55:09 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html

Comment 68 Martin Prpič 2011-03-24 19:48:05 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
When running under VMware, timeouts in user applications could occur much later than they should have due to faulty kernel timekeeping. With this update, kernel timekeeping has been improved and timeouts occur in a more timely manner.

Comment 69 Alok Kataria 2011-03-24 20:40:08 UTC
Hi Martin, 

This problem is specific to 32bit kernels, which is not what is mentioned in the Technical notes that you propose. Assuming the technical notes are common for 32bit and 64bit kernels, maybe we should start with something like below,

"When running with a 32bit kernel under VMware, timeouts in user ...."

I am fine with however you want to frame this, unless the fact that 64bit kernels are fine without the patch is conveyed to the customer. 


Thanks.

Comment 70 Jason McCormick 2011-03-30 19:01:26 UTC
Would the patches applied in this bug to the kernel still be affecting 32-bit EL5.6 VMs at boot?  We're seeing a problem after applying the -238.* releases of the EL5 kernel to our ESX 3.5 environment at boot time.  VMs sporadically hang at boot just before this line in the kernel boot process - "Using TSC for driving interrupts".  VMs crash just before this kernel message consistently but we're not able to trigger the condition in a controlled manner.  It happens overnight whenever a lot of VMs reboot simultaneously for package updates so I would assume the ESX node the VMs are on is quite busy and may be exercising some of the issue described above.  We did not see this problem with the immediately preceeding kernel 2.6.18-194.32.1.el5.  I note this patch included the message that appears to be the hang point inside the function void init_tsc_timer(void) in attachment 442992 [details] referenced in this bug.

Comment 71 Alok Kataria 2011-04-05 01:01:54 UTC
Created attachment 489890 [details]
Test patch to fix a race in tsc based timer startup for i386.

Hi Jason, 

This is interesting, after your comment I went back and looked at the code,  and I think the race below is possible. 

In init_tsc_timer, as soon as we set enable_tsc_timer and before last_tsc_accounted has been updated, we may get an interrupt, which now tries to use tsc based do_timer routine and can cause random behavior, since, the other values (last_tsc_accounted, cycles_accounted_limit) are still not setup correctly. 

I tried looping with RHEL 5.6 on my ToT build of ESX, but haven't been able to hit this hang. Given that you are able to reproduce this, can you try the attached test patch on your setup and report if it fixes the issue for you ?

Comment 72 Jason McCormick 2011-04-05 01:31:58 UTC
Alok,

Thanks for the reply.  I will test it out, but I can't reliably recreate this condition other than when a bunch of VMs reboot around the same time, some VMs will sometimes do it.  It's occurrence is somewhat random but the hang is always in the same place.  I'll give it a shot though.

Just to tie some administrative stuff together, I opened a new Bugzilla about this in Bug 692966 because I filed this with Red Hat Support and the core issue at hand was getting lost in the details of this Bugzilla entry.  I don't know if you want to cross-file your patch there as well.  Since you're with VMware, I should probably also mention this is open with VMware as well as SR 11055505103.

Thanks!