Bug 467782

Summary:

unstable time source

Product:

Red Hat Enterprise Linux 5

Reporter:

Issue Tracker <tao>

Component:

kernel

Assignee:

Prarit Bhargava <prarit>

Status:

CLOSED ERRATA

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

high

Docs Contact:

Priority:

high

Version:

5.2

CC:

cward, kbaxley, kzhang, mgahagan, peterm, syeghiay, tao, woodard

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-09-02 08:33:44 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

483701, 485920

Attachments:

Description	Flags
RHEL5 fix for this issue	none

Description Issue Tracker 2008-10-20 21:00:38 UTC

Escalated to Bugzilla from IssueTracker

Comment 1 Issue Tracker 2008-10-20 21:00:39 UTC

Date: Tue, 14 Oct 2008 08:22:50 -0700
From: Jeff Cunningham <cu>
To: bwoodard
Cc: tdhooge
Subject: ntp problem on chaos 4.1
Message-ID: <20081014152250.GA5404.gov>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.16 (2007-06-09)

Hey Ben,

I'm bringing up the hera cluster, and I'm seeing ntp issues
there that haven't been seen on the other machines, and was
wondering if you could take a look.  hera is running chaos 4.1

Specifically, after every cluster boot, some small number of
nodes (3-7) prefer to sync to their local clock rather than
the proper ntp host:

# hera122 /root > ntpq -p
remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*LOCAL(0)        .LOCL.          10 l   27   64  377    0.000    0.000   0.001
ehera3          134.9.1.98       3 m  899 1024  377    0.229  -658692 10190.9
ehera2          134.9.1.98       3 m   15   64  377    0.084  -666945 2933.68
# hera122 /root >

It's a different set of nodes after each boot, and all the problem nodes
have very similar values for offset and jitter (their clocks are always
set about 11 minutes in the future).

Is this something that has been seen in Redhat land recently?

--Jeff

This event sent from IssueTracker by streeter  [Support Engineering Group]
 issue 229865

Comment 2 Issue Tracker 2008-10-20 21:00:41 UTC


In /etc/ntp.conf file on the client, can we add iburst on the server line.
This allows ntpd to set the clock and exit in just under one minute.

The issue may be related to check-in time interval of ntpd. This interval
is set initally to 64 seconds, but while ntpd is running this interval can
increase to a maximum to 1024 seconds (just over 17 mins to conserve
network bandwidth).

Here are the steps.

1) edit ntp.conf and add iburst on the server line.
2) create the file /etc/cron.d/ntp with the following line:
*/3 * * * * root /usr/sbin/ntpd -q > /dev/null &
This will tell cron to check the clock every three mins. When ntpd is run
with -q option and cannot contact the NTP Server or has trouble setting
the clock, it will exit after two mins.
3) Execute service crond restart on client to activate change. You will
see ntp entries in /var/log/messages every three mins corresponding to the
clock update.

From article on RH Knowledge Base
http://kbase.redhat.com/faq/FAQ_43_4081.shtm


This event sent from IssueTracker by streeter  [Support Engineering Group]
 issue 229865

Comment 3 Issue Tracker 2008-10-20 21:00:42 UTC

From: Ritesh Mishra <rmishra>
To: Kent Baxley <kbaxley>, Ben Woodard <woodard>,
mgrondona
Subject: Fwd: ntp problem on chaos 4.1
Date: Wed, 15 Oct 2008 14:01:52 -0400 (EDT)

FYI. 

Ritesh Mishra
Consultant - Redhat, Inc
Lawrence Livermore Labs, B4525
Phone:925.422.8580
Alt:510.432.3983
email:rmishra
email:mishra4.gov


----- Forwarded Message -----
From: "Jeff Cunningham" <cu>
To: "Ritesh Mishra" <rmishra>
Sent: Wednesday, October 15, 2008 9:47:52 AM GMT -08:00 US/Canada Pacific
Subject: Re: ntp problem on chaos 4.1

Hey Ritesh,

A brief update...

I set this up yesterday, and the node was updating correctly
every 3 minutes with no problem.  Out dist ran overnight
and replaced my temporary setup, and the clock is now 10 minutes
in the future, just as before.

Also, the check-in period is the same as all the other nodes
(1024 s) that don't have the problem.  We're currently looking
into the possibilty that cpuspeed/cpufreq may be causing this.

--Jeff

On Tue, Oct 14, 2008 at 11:42:27AM -0700, Ritesh Mishra wrote:
> Hey Jeff can we try couple of things for troubleshooting purposes.
>
> In /etc/ntp.conf file on the client, can we add iburst on the server
line. 
> This allows ntpd to set the clock and exit in just under one minute.
>
> The issue may be related to check-in time interval of ntpd. This
interval 
> is set initally to 64 seconds, but while ntpd is running this interval
can 
> increase to a maximum to 1024 seconds (just over 17 mins to conserve 
> network bandwidth).
>
> Here are the steps.
>
> 1) edit ntp.conf and add iburst on the server line.
> 2) create the file /etc/cron.d/ntp with the following line:
> */3 * * * * root /usr/sbin/ntpd -q > /dev/null &
> This will tell cron to check the clock every three mins. When ntpd is
run 
> with -q option and cannot contact the NTP Server or has trouble setting
the 
> clock, it will exit after two mins.
> 3) Execute service crond restart on client to activate change. You will
see 
> ntp entries in /var/log/messages every three mins corresponding to the 
> clock update.
>
> Ritesh
> Jeff Cunningham wrote:  
>> Again, output is the same for all 7 nodes:
>>
>> # hera753 /root > ntpq -c"rv 0 stratum" ehera2
>> assID=0 status=0644 leap_none, sync_ntp, 4 events,
event_peer/strat_chg,
>> stratum=3
>> # hera753 /root >
>>
>> --Jeff
>>
>> On Tue, Oct 14, 2008 at 10:17:11AM -0700, Ritesh Mishra wrote:
>>     
>>> Hey Jeff,
>>>
>>> can you run the following command.
>>>
>>> ntpq -c"rv 0 stratum" your_time_server
>>>
>>> where the your_time_server is what hera is pointing to.
>>>
>>> Ritesh
>>>
>>>
>>> Jeff Cunningham wrote:
>>>       
>>>> Hello Ritesh,
>>>>
>>>> Just to give you some more background, hera is an 864 node
>>>> cluster.
>>>>
>>>> I'm leaving the current set of bad nodes in a "bad" state
>>>> for troubleshooting reasons, and I'm noticing that the offset has 
>>>> increased to 12 minutes for the current set of 7 bad nodes, all
within a 
>>>> 10 second variation.  The offset
>>>> is not necessarily the same after the reboots, but all the
>>>> nodes in the bad state always have the same amount of offset.
>>>>
>>>> Also, the the set of bad nodes changes after each full-cluster
>>>> reboot.  Currently, its [89,122,183,518,678,753,828], but the
>>>> last boot was [507,700,722], and before that [150,259,359,400,528]
>>>>
>>>> Here's ntpstat (all 7 nodes give this same result):
>>>>
>>>> # hera89 /root > ntpstat
>>>> synchronised to local net at stratum 11 time correct to within 10 ms
>>>> polling server every 1024 s
>>>> # hera89 /root >
>>>>
>>>> --Jeff
>>>>
>>>>
>>>> On Tue, Oct 14, 2008 at 09:27:42AM -0700, Ritesh Mishra wrote:
>>>>           
>>>>> Hey Jeff,
>>>>>
>>>>> Does the value stay the same after the reboots (11 mins in the
future) 
>>>>> for the nodes? what does ntpstat return on these nodes?
>>>>>
>>>>> Regards,
>>>>> Ritesh Mishra
>>>>>
>>>>> P.S I recently joined Ben's group. Sorry for not introducing
myself.
>>>>>               
>>>         
>  



This event sent from IssueTracker by streeter  [Support Engineering Group]
 issue 229865

Comment 4 Issue Tracker 2008-10-20 21:00:44 UTC

ntp and timesync issues. it seems more HW related. All clocks seem to drift
in sync. This is TLCC hardware, with the chaos 4.0 systems we aren't seeing
the problem only with 4.1 systems. It seems to be a problem introduced
between 5.1 and 5.2. we are seeing it on TLCC HW chaos4.1
sudo cat
/sys/devices/system/clocksource/clocksource0/available_clocksource 
looks like those changes are in 2.6.18-119. It looks like it only effects
intel boxes. I'm not sure it is the same thing but I think the patch is a
good idea never the less. 
why was something supposed to change from RHEL5.1 to 5.2?
One working theory was that the default timer sources that the machine
used changed between releases.
You have the choice of timer={tsc,pmtimer,hpet,tscsync}
This relates to RHEL4 but it gives a basic explanation:
http://kbase.redhat.com/faq/FAQ_85_6993.shtm
well it is jiffies on all our peloton machines too, so we've got some sort
of systemic problem
though dmesg says:
time.c: Using 25.000000 MHz WALL HPET GTOD HPET timer.
hpet0: 3 32-bit timers, 25000000 Hz
time.c: Using 25.000000 MHz WALL HPET GTOD HPET/TSC timer.
time.c: Detected 2311.849 MHz processor.
http://lkml.org/lkml/2005/11/4/173
http://www.nabble.com/Re:--PATCH-0-2--Improve-hpet-accuracy-p17687684.html
http://kerneltrap.org/mailarchive/linux-kernel/2008/4/11/1406044
does RHEL have adjtimex anywhere?
#include <sys/timex.h>
int adjtimex(struct timex *buf);
fedora?
http://kerneltrap.org/mailarchive/linux-kernel/2008/4/11/1406304
it's in fedora
adjtimex.x86_64 0:1.21-4.fc9
adjtimex-1.21-4.fc9.x86_64

http://www.x86-64.org/pipermail/discuss/2006-April/008277.html
calibration of the HPET is done based upon the PIT and that can change
based upon clock speed.



This event sent from IssueTracker by streeter  [Support Engineering Group]
 issue 229865

Comment 5 Issue Tracker 2008-10-20 21:00:46 UTC

Prarit's going to send us a writeup tomorrow regarding comment #18 in
BZ 461184, but I thought I'd also share this with you guys.

I told him about the ntp issue and his feedback is below:


> > Specifically, after every cluster boot, some small number of nodes
> > (3-7) prefer to sync to their local clock rather than the proper ntp
> > host:
> >
> > # hera122 /root > ntpq -p
> > remote           refid      st t when poll reach   delay   offset 
jitter
> >
==============================================================================
> > *LOCAL(0)        .LOCL.          10 l   27   64  377    0.000    0.000
  0.001
> > ehera3          134.9.1.98       3 m  899 1024  377    0.229  -658692
10190.9
> > ehera2          134.9.1.98       3 m   15   64  377    0.084  -666945
2933.68
> > # hera122 /root >
> >
> >
> > It's a different set of nodes after each boot, and all the problem
> > nodes have very similar values for offset and jitter (their clocks
are
> > always set about 11 minutes in the future).   We're currently looking
> > into the possibilty that cpuspeed/cpufreq may be causing this, but we
> > don't know for sure.  
> >
> >     

I have a feeling I know what causes this.

A few months ago I was chasing a bizarre issue on AMD HW and Xen-enabled 
kernels.  When the hypervisor booted and passed off control to the dom0, 
many "Time went backwards" messages were output to the screen.

During the investigation it was noted in some cases systems were booting 
with inconsistent clock speeds across CPUs (not cores).  This was 
confusing to say the least.  We had a phone conference with AMD and 
asked them about it -- they said it shouldn't be happening, but I gave 
them evidence it was happening.

It's a royal PITA (excuse my french).  It destabilizes the system during 
boot up and can lead to all sorts of funky behavior.

I have also noticed large time jumps when rebooting a system.  I boot 
the system once, start ntpd (which then synchs up the clock) and 
reboot.  After the second reboot, sometimes another macroscopic time 
jump is needed -- I never did try to find out why that was happening... 
but since it was occurring during shutdown/restart, I figured that 
something in the BIOS was screwing things up.


- -- 
*************
Kent Baxley
Technical Account Manager
Red Hat, Inc.
kbaxley
gpg key id: E281242D
http://pgp.mit.edu


Summary edited.

This event sent from IssueTracker by streeter  [Support Engineering Group]
 issue 229865

Comment 6 Issue Tracker 2008-10-20 21:00:48 UTC

Due to the fact that this causes so much of a problem in their environment,
here is what I think that we need to do:
1) Kent- I think that we need to raise the priority of this issue and work
normal formal channels of support as well as working our back-channels.
2) Until we have more evidence that it is a BIOS issue, we need to
consider that a hunch and put it on the back burner to evaluate later.
3) Ritesh - Can you work on trying to figure out if there is some
acceptable workaround where we can get NTP to be more aggressive in
correcting the time on the systems so that it doesn't limit itself to
200ppm. If we can run the ntp daemon more often or allow a bigger time
offset and force it to keep the time on the system more accurate that will
alleviate the symptoms of the problem which seem to be biggest problem.
4) I'll work with grondo on getting trying to get to the root cause of the
problem by looking at the patches and the calibrate apic stuff. My working
hypothesis is that the 5.2 kernel is pushing power management on the
Barcelona's harder and that is sometimes leaving the state of the CPUs in
some lower power state between reboots. The BIOS is supposed to power up
the CPUs but the BIOS was only tested from a cold restart. Thus the BIOS
writers concluded that everything was as it should be when in fact, the
core power state needs to be reset in case it is a warm reboot.

This hypothesis presents some testable lines of investigation that I'll
follow up on.
1) what changes in the power management have been made between 5.1 and
5.2
2) can we reset the power state and clock speed of the CPU just before
reboot and "fix" the problem -- sort of doing the BIOS's job for it.




This event sent from IssueTracker by streeter  [Support Engineering Group]
 issue 229865

Comment 7 Issue Tracker 2008-10-20 21:00:51 UTC

Tags: Forwarded 
From: Prarit Bhargava <prarit>
To: Ben Woodard <woodard>
CC: Kent Baxley <kbaxley>, Ritesh Mishra <rmishra>
Subject: Re: Bugzilla 461184 - request for information
Date: Mon, 20 Oct 2008 08:33:47 -0400
User-Agent: Thunderbird 2.0.0.17 (X11/20080914)

Ben,

Sorry I'm late in replying back to you.  I had to jump on a critical 
issue for the RHEL5.3 beta on Thursday and Friday and it took almost all 
of my time.

Ben Woodard wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Prarit Bhargava wrote:
>     
>> Just curious, Ben -- are you looking for a technical discussion of
what
>> the changes were or an explanation of why the code in 461184 was
backed
>> out?
>>       
>
> Actually you know what I'm really interested in is more detail about
is:
> 1) How the HPET is calibrated based upon the processor speed. (I know
> "use the source luke" -- I haven't had time)
>     
See arch/x86_64/kernel/apic.c: setup_APIC_timer()

We read the HPET until a change occurs and that's how we calibrate the 
HPET.  We do this relative to the TSC.

There was a change to this code between 5.1 and 5.2:  See changeset 
00da8f835d68336e504b57ab69774b951ec65fad

> 2) Where in the AMD docs it said that processors should always come up
> at the same speed (I'm assuming that this is in the BIOS and Kernel
> Writer's docs but where?)
>     
My limited knowledge of that came from a telephone conference we had 
with AMD engineering regarding a "time went backwards" issue that we 
found with Xen kernels.  After doing some measurements, it was obvious 
to me that the CPUs (not COREs) were NOT coming up at the same speed 
relative to one another and that this difference was leading to all 
sorts of problems in the early boot of Xen kernels.

This surprised AMD engineering as they were sure that the CPUs were 
supposed to be synchronized -- but I showed them live-time evidence that 
this was not the case.  They leaned the same direction you're leaning 
and said the vendor's BIOS must be at fault.

I'll take a look at the BIOS guide to see if I can find anything
specific.

> 3) What could have changed between 5.1 and 5.2 that might have led to
> the processor speed being inconsistent between boots. Could it be
> something with power management? Why is it inconsistent?
> time.c: Detected 2311.850 MHz processor.
> for nodes where the HPET is calibrated properly and the time is stable.
> time.c: Detected 2288.237 MHz processor.
>     

There hasn't been much code churn in this area -- mostly minor bug 
fixing.  There are a couple of things, however, that come to mind that 
might be causing a problem (and both are worth backing out and testing).

The first is Alan's tick divider patchset.  We've encountered a few bugs 
here and there in this code.  It *might* cause problems and is one of 
the first things I back out when I see some funkiness with the timing
code.

I even have the commit in a text file so I don't have to go find it: 
d1a92475d3fceb833556d82ff0b962e5c42f0595

(The patch probably won't come out cleanly anymore.  There were several 
patches by Chris Lalancette to fix bugs)

The second patch bff2a8b50e255b474aa8e06e1be1ad6a513284ef, is Bhavna's 
(onsite AMD partner engineer) patch for a code problem I found.

[ uh ... I realize you're offsite Ben -- do you have access to dzickus' 
git kernel tree?  The patch commits above are relative to his tree.]

I think Kent pointed me to a similar system in RDU -- let me try to see 
if I can reproduce this locally.  Kent -- think you can send me access 
details again?

> for nodes where the HPET is not calibrated properly and the system's
> time drifts forward by about 11min/day which is greater than the 200ppm
> that NTP will fix.
>
>     

Interesting -- I've seen some drift *between boots*.  Is this real 
live-time drift?  Or does the drift suddenly jump between boots of the 
system.  I've seen the latter case on some systems but haven't had any 
time to chase it down.

> I'm sort of gathering that this is a BIOS issue, lord knows we have had
> lots of problems with SuperMicro's BIOS, but until I can cite a
> definitive answer to #2 and explain why whatever changed between 5.1
and
> 5.2 was the "right" thing to do. I really can't push back to SM's BIOS
> guys. They are pretty stubborn and don't like being forced to make a
new
> version of the BIOS.
>     

<heavy heavy sarcasm>

Gee?  Stubborn BIOS engineers?  Really?  I had no idea such a thing 
existed...

</heavy heavy sarcasm>

Clearly, you haven't had to deal with HP BIOS engineers. :)

Let me look at a few things this AM and I'll get back to you,

P.

This event sent from IssueTracker by streeter  [Support Engineering Group]
 issue 229865

Comment 8 Issue Tracker 2008-10-20 21:00:54 UTC

Date problem reported: 10-14-08

Problem: Unstable time source issues on LLNL cluster nodes using quad-core
CPU's and SuperMicro Motherboards.

How reproducible: Seems to happen quite frequently on nodes running
RHEL5.2.

How to reproduce: On every cluster reboot, it seems that a small
percentage of nodes (different ones every time) prefer to sync their local
clocks instead of to the proper ntp server.   The problematic machines end
up being about 11 minutes in the future:

What's expected from SEG:  Over the past week, we've been doing some
back-channel work with Prarit Bhargava on this, and this may or may not be
a BIOS issue.  However, at this time, there's no really solid proof that
there is.  The customer did indicate that the same systems still running
RHEL5.1 didn't seem to have any timing issues, and some information sent
to us from Prarit indicates that there was a little bit of code churn
between 5.1 and 5.2 that may (we don't know for sure, though) have some
affect on system timing.

LLNL is on an October 28th deadline, and Prarit has already made plans to
start working on this (maybe as early as today).    





Issue escalated to Support Engineering Group by: kbaxley.
Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by streeter  [Support Engineering Group]
 issue 229865

Comment 9 Issue Tracker 2008-10-20 21:00:57 UTC

From: Prarit Bhargava <prarit>
To: Kent Baxley <kbaxley>
CC: Ben Woodard <woodard>, Ritesh Mishra <rmishra>
Subject: Re: Bugzilla 461184 - request for information
Date: Mon, 20 Oct 2008 13:26:42 -0400
User-Agent: Thunderbird 2.0.0.17 (X11/20080914)



Kent Baxley wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On Mon, 20 Oct 2008 08:59:11 -0400
> Prarit Bhargava <prarit> wrote:
>
>     
>> Ugh -- I just had another high priority BZ come across my desk.  
>> Hopefully I'll be able to get to this by COB today :/.
>>       
>
> Hi, Prarit.  Thanks for all the info.  I'm going to be sending the
> current timing problems up to BZ formally today. I've received some
> word that things might start to boil in this one.
>
> The customer's got an October 28th deployment deadline, and I don't
> believe they have a workaround in place, yet.  
>
> The scoop I got on this was that whether or not this turns out to be a
> BIOS issue, Red Hat is going to probably end up going to get blamed
> since there apparently were some changes between 5.1 and 5.2.
>     

I'm almost done with the other issue (also a timing issue) -- I will let 
my manager know this is next on my list.

P.



This event sent from IssueTracker by streeter  [Support Engineering Group]
 issue 229865

Comment 10 Issue Tracker 2008-10-20 21:01:00 UTC

There's no BZ open on this yet, so, can SEG get one opened up, please?

Thanks.


This event sent from IssueTracker by streeter  [Support Engineering Group]
 issue 229865

Comment 11 Issue Tracker 2008-10-20 21:01:03 UTC

Hey Jeff,

Just wanted to get an update. Do you think we could put the fix that I had
sent you to resolve this issue?


Thanks,

Ritesh Mishra


This event sent from IssueTracker by streeter  [Support Engineering Group]
 issue 229865

Comment 12 Issue Tracker 2008-10-20 21:01:06 UTC

Links to commits that Prarit mentioned in his email posted here:

https://enterprise.redhat.com/issue-tracker/?module=issues&action=view&tid=229865&gid=596&view_type=lifoall#eid_2427367


http://git.engineering.redhat.com/?p=users/dledford/rhel5-kernel.git;a=commitdiff;h=00da8f835d68336e504b57ab69774b951ec65fad;hp=f1f453c57baf61e0fc25aca41b4cf61a4ae34bc5

http://git.engineering.redhat.com/?p=users/dledford/rhel5-kernel.git;a=commitdiff;h=d1a92475d3fceb833556d82ff0b962e5c42f0595

http://git.engineering.redhat.com/?p=users/dledford/rhel5-kernel.git;a=commitdiff;h=bff2a8b50e255b474aa8e06e1be1ad6a513284ef


This event sent from IssueTracker by streeter  [Support Engineering Group]
 issue 229865

Comment 13 Issue Tracker 2008-10-20 21:01:09 UTC

Date: Mon, 20 Oct 2008 11:59:52 -0700
From: Ben Woodard <woodard>
To: Prarit Bhargava <prarit>
CC: Kent Baxley <kbaxley>, Ritesh Mishra <rmishra>
Subject: Re: Bugzilla 461184 - request for information

Thanks Prarit, that is great.

Looking at the code you mentioned, there is a local patch that I think
you need to know about. These machines will not kdump properly they
end up not getting timer interrupts to the processor that the kdump
kernel comes up on. Thus the machine hangs right after:
Memory: 119784k/147440k available (2437k kernel code, 11272k reserved,
1233k data, 184k init
i.e. while initializing the timer.

This originally started as:
https://enterprise.redhat.com/issue-tracker/?module=issues&action=view&tid=133794
but then because the issue was getting so long and drawn out we moved
it to:
https://enterprise.redhat.com/issue-tracker/?module=issues&action=view&tid=221481
You can find the actual patch:
https://bugzilla.redhat.com/attachment.cgi?id=275741

So you don't have to read through gazillions of pages of text I'll try
to sum up what is going on there.
The boot cpu is 0. The kdump kernel comes up on whatever cpu/core
caused the panic & this is likely not cpu 0. Other CPUs are not
getting the timer interrupts; thus the hang trying to initialize the
timer. Neil Horman initially cooked up a patch which  booted the kdump
kernel on cpu 0 which worked but upstream didn't like that approach.
So he made a new patch that upstream was more in favor of which was to
initialize the IOAPIC early. This also worked and LLNL has been
carrying that patch locally for several months on both 5.1 and 5.2
based kernels. The patch can be found:

Support bought a machine just like LLNLs which reproduces the problem.
Neil Horman has yet to submit his patch upstream or the rhkernel
mailing list. Kent should be able to get you access to the machine
with the supermicro motherboard which has the problem if needed.

I hope it isn't something evil like changing the order of when the
local IOAPIC is initialized to route interrupts is now causing an
issue with power management initialization or something. BTW my
working hypothesis that I'm trying to pin figure out how to test
correctly is that the 5.2 kernel is pushing power management on the
Barcelona's harder and that is sometimes leaving the state of the CPUs
in some lower power state between warm reboots. I believe that AMD
believes that the BIOS is supposed to explcitly put the CPUs in their
highest power state but the BIOS was only tested from a cold restart.
Thus the BIOS writers concluded that everything was as it should be
when in fact, the core power state needs to be reset in case it is a
warm reboot.

My thoughts are that we might be able to test this by forcing the
power state of the cores to their highest state in the shutdown path
just before they are halted. I'm trying to figure out how to do that
right now. In all honesty, I'm not a good kernel hacker (never wanted
to do kernel work) so it takes me more time to figure things like that
out.

- -ben

This event sent from IssueTracker by streeter  [Support Engineering Group]
 issue 229865

Comment 14 Prarit Bhargava 2008-10-21 12:38:40 UTC

I spoke with peterm (my manager) and he said that we (RH) have a blanket NDA with AMD.

So, AFAWCT, you're covered.

Here is chapter and verse on what the BIOS requirements are for setting the P-states of the CPUs. This info is taken from AMD Family 11h Processor BKDG.

2.4.2 P-states
P-states are operational performance states characterized by a unique combination of frequency and voltage.
The processor supports dynamic P-state changes for up to two frequency domains and up to two voltage
planes: VDD0 and VDD1. Refer to section 2.4.1 [Processor Power Planes And Voltage Control] for voltage
plane definitions. Up to 8 core P-states, called P0 though P7, are supported. P0 is the highest-power, highest-
performance P-state; each ascending P-state number represents a lower-power, lower-performance P-state.
Lower performance P-states (higher numbered P-states) must have a COF that is less than or equal to higher
performance P-states (lower numbered P-states). Lower performance P-states must also have a VID that is
higher than or equal to (voltage that is lower than or equal to) higher performance P-states. The number of
defined P-states is optimized for power and performance trade-offs. At least one P-state (P0) is enabled and
specified for all processors. Out of cold reset, the voltage and frequency of the cores is specified by
MSRC001_0071[StartupPstate].

[Prarit: The last line is important here.]

2.4.2.3 P-state Transition Behavior

.
.
.

• If RESET_L asserts the processor cores are transitioned to C0 and to the power state specified by
MSRC001_0071.
• After a warm reset [The P-state Control Register] MSRC001_0062 and [The P-state Status Register]
MSRC001_0063 are consistent with MSRC001_0071[CurPstate].
• After a warm reset MSRC001_0070 may not reflect MSRC001_0071. See [The BIOS COF and VID
Requirements After Reset] 2.4.2.9.

[Prarit: After a warm reset, a core's P-state Control and Status Registers are supposed to be consistent with the
global P-state register. I'm not sure if these are the correct technical terms for these registers, but register 62 and 63 AFAICT are per-core, not per-cpu.]

2.4.2.9 BIOS COF and VID Requirements After Reset
Warm reset is asynchronous and can leave MSRC001_0070 in an unknown state. Since BIOS cannot always
determine whether a reset was a warm or cold reset, the following requirement must be done after all resets.
• Write MSRC001_0070[CpuFid, CpuDid, CpuVid, PstateId] with MSRC001_0071[CurCpuFid, CurCpuDid,
CurCpuVid, CurPstate].

[Prarit: After a warm reset, the BIOS must copy the contents of register 71 into 70]

So ... it seems like the BIOS is required to set the P-state of all cores to the same value as the processors.

Comment 15 Prarit Bhargava 2008-10-21 12:45:46 UTC

> 
> My thoughts are that we might be able to test this by forcing the
> power state of the cores to their highest state in the shutdown path
> just before they are halted. I'm trying to figure out how to do that
> right now. In all honesty, I'm not a good kernel hacker (never wanted
> to do kernel work) so it takes me more time to figure things like that
> out.
> 

Ben, I'm not sure we need to make a kernel change (yet).  Hopefully we can try to fix this in userspace :)

The service that controls cpu frequencies is called cpuspeed.  cpuspeed loads a governor which allows one to control the cpu frequency ondemand.

The script that controls cpuspeed is in the /etc/init.d dir.

So, my thinking is that on shutdown before the cpuspeed service is stopped we can just (as a test) put in a few lines to push the cpu freq on all cpus to their max and then continue to shutdown the cpuspeed service.

The sysfs files for cpufreq are here:

/sys/devices/system/cpu/cpuX/cpufreq

and the directory contains
affected_cpus     scaling_available_frequencies  scaling_governor
cpuinfo_max_freq  scaling_available_governors    scaling_max_freq
cpuinfo_min_freq  scaling_cur_freq               scaling_min_freq
ondemand          scaling_driver

We should be able to echo scaling_max_freq into scaling_cur_freq, and in theory that should set everyone to the max value ...

Ben, what do you think?

P.

Comment 17 Prarit Bhargava 2008-10-21 13:36:41 UTC

Hrmm ... it looks like we do #15 in the cpuspeed shutdown path already if the governor is the ondemand governor (I'm assuming LLNL has the default).

P.

Comment 19 Ben Woodard 2008-10-21 14:35:13 UTC

Verified that the 11h BKDG is publicly available online so information containted within it need not be covered by the NDA. 
http://www.google.com/search?q=AMD+BKDG+11h&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/41256.pdf

Comment 20 Prarit Bhargava 2008-10-21 14:39:41 UTC

Based on comment #19, opening up comment #14.

P.

Comment 21 Issue Tracker 2008-10-21 14:50:17 UTC

Here's the BIOS info that I've got:

BIOS Information
        Vendor: American Megatrends Inc.
        Version: 080014
        Release Date: 03/15/2008
        Address: 0xF0000
        Runtime Size: 64 kB
        ROM Size: 1024 kB
        Characteristics:
                ISA is supported
                PCI is supported
                PNP is supported
                APM is supported
                BIOS is upgradeable
                BIOS shadowing is allowed
                ESCD support is available
                Boot from CD is supported
                Selectable boot is supported
                BIOS ROM is socketed
                EDD is supported
                5.25"/1.2 MB floppy services are supported (int 13h)
                3.5"/720 KB floppy services are supported (int 13h)
                3.5"/2.88 MB floppy services are supported (int 13h)
                Print screen service is supported (int 5h)
                8042 keyboard services are supported (int 9h)
                Serial services are supported (int 14h)
                Printer services are supported (int 17h)
                CGA/mono video services are supported (int 10h)
                ACPI is supported
                USB legacy is supported
                LS-120 boot is supported
                ATAPI Zip drive boot is supported
                BIOS boot specification is supported
        BIOS Revision: 8.14



This event sent from IssueTracker by kbaxley 
 issue 229865

Comment 22 Prarit Bhargava 2008-10-21 15:00:20 UTC

Kent went into the lab and bit-twidled the system in BIOS

That's a lot better  :) 

powernow-k8: Found 4 Dual-Core AMD Opteron(tm) Processor 8216 processors (8 cpu cores) (version 2.20.00)
powernow-k8:    0 : fid 0x10 (2400 MHz), vid 0xa
powernow-k8:    1 : fid 0xe (2200 MHz), vid 0xc
powernow-k8:    2 : fid 0xc (2000 MHz), vid 0xe
powernow-k8:    3 : fid 0xa (1800 MHz), vid 0x10
powernow-k8:    4 : fid 0x2 (1000 MHz), vid 0x12
powernow-k8:    0 : fid 0x10 (2400 MHz), vid 0x8
powernow-k8:    1 : fid 0xe (2200 MHz), vid 0xa
powernow-k8:    2 : fid 0xc (2000 MHz), vid 0xc
powernow-k8:    3 : fid 0xa (1800 MHz), vid 0xe
powernow-k8:    4 : fid 0x2 (1000 MHz), vid 0x12
powernow-k8:    0 : fid 0x10 (2400 MHz), vid 0x8
powernow-k8:    1 : fid 0xe (2200 MHz), vid 0xa
powernow-k8:    2 : fid 0xc (2000 MHz), vid 0xc
powernow-k8:    3 : fid 0xa (1800 MHz), vid 0xe
powernow-k8:    4 : fid 0x2 (1000 MHz), vid 0x12
powernow-k8:    0 : fid 0x10 (2400 MHz), vid 0xa
powernow-k8:    1 : fid 0xe (2200 MHz), vid 0xc
powernow-k8:    2 : fid 0xc (2000 MHz), vid 0xe
powernow-k8:    3 : fid 0xa (1800 MHz), vid 0x10
powernow-k8:    4 : fid 0x2 (1000 MHz), vid 0x12
ACPI: (supports S0 S1 S4 S5)

I've just started a quick 3-loop reboot test that (hopefully) will do the following:

Check /home/countdown and if !0, grep the dmesg log for "time", copy that data to /home/bootlog, and then reboot.

If /home/countdown is 0, stop the test.

If this short test is successful, I will run a longer test.

Ben, do you know if we *must* change the cpufreq on a core?  I suspect the answer is yes, but I'd like to know if you were seeing strange behavior on essentially quiescent systems...

P.

Comment 23 Issue Tracker 2008-10-21 15:14:33 UTC

Prarit,

A while ago we noticed that on nodes where the time is stable and
therefore the HPET is calibrated properly we see:
    time.c: Detected 2311.850 MHz processor.
However on nodes where the HPET seems to be running too fast for NTP to
handle it, we see:
    time.c: Detected 2288.237 MHz processor.
This along with the information that you provided from the BKDG seems to
indicate that there really is a problem with the BIOS. Do you concur?

Some things that I'm working on:
1) getting the BIOS version and the actual CPU model.
2) Finding out if we have ever seen the problem on dual core machines or
only on our quad cores. It might be that Kent's test machine is not a good
enough representative of LLNL's machine.
3) finding out if the problem every happens on cold boot or if it only
happens on warm boot. That might be a very useful clue.

So I'd love to find out from you if:
1) My understanding of the information already provided matches yours and
that we have enough info to already assert that it is a BIOS problem.
2) Any guesses why this changed between 5.1 and 5.2. Are we in fact
driving the P-state stuff harder in 5.2 to conserve power and reduce heat?
In all honesty, the P-state hypothesis was just a working hypothesis. To me
the hypothesis seems to be torpedoed by the fact that we do transition to
P0 in the shutdown path. Hmm...unless they aren't shutting down the
machine (it is diskless) and just asserting a break on the serial console
to get it to reboot more or less instantly...Must do some additional
checking.

I'll also start seeing if I can find a good place to stick some printk's
to see if we are in the right P-state as the kernel comes up. time.c
shortly before we print: 
   time.c: Detected 2311.850 MHz processor.
Seems like a good place to start.


This event sent from IssueTracker by woodard 
 issue 229865

Comment 24 Prarit Bhargava 2008-10-21 15:30:14 UTC

(In reply to comment #23)
> Prarit,
> 
> A while ago we noticed that on nodes where the time is stable and
> therefore the HPET is calibrated properly we see:
>     time.c: Detected 2311.850 MHz processor.
> However on nodes where the HPET seems to be running too fast for NTP to
> handle it, we see:
>     time.c: Detected 2288.237 MHz processor.
> This along with the information that you provided from the BKDG seems to
> indicate that there really is a problem with the BIOS. Do you concur?
> 

I would agree with that.  But (as bmaly might point out) we've had internal reports of similar odd behavior on other systems.  This still could be a kernel side issue.

> Some things that I'm working on:
> 1) getting the BIOS version and the actual CPU model.

dmidecode should dump that for you.

> 2) Finding out if we have ever seen the problem on dual core machines or
> only on our quad cores. It might be that Kent's test machine is not a good
> enough representative of LLNL's machine.

Okay.

> 3) finding out if the problem every happens on cold boot or if it only
> happens on warm boot. That might be a very useful clue.
> 
> So I'd love to find out from you if:
> 1) My understanding of the information already provided matches yours and
> that we have enough info to already assert that it is a BIOS problem.
> 2) Any guesses why this changed between 5.1 and 5.2. Are we in fact
> driving the P-state stuff harder in 5.2 to conserve power and reduce heat?
> In all honesty, the P-state hypothesis was just a working hypothesis. To me
> the hypothesis seems to be torpedoed by the fact that we do transition to
> P0 in the shutdown path. Hmm...unless they aren't shutting down the
> machine (it is diskless) and just asserting a break on the serial console
> to get it to reboot more or less instantly...Must do some additional
> checking.

Yeah ... I'm wondering why this suddenly started happening in 5.2 as well -- see my previous comment about this could still be something in the kernel.

Do you have a system @ LLNL that reliably reproduces this issue?  IMO, we should try doing a binary search through the kernels to see if we can at least pinpoint a specific version where this behavior started.

P.

Comment 25 Ben Woodard 2008-10-21 15:57:57 UTC

> Ben, do you know if we *must* change the cpufreq on a core?  I suspect the
> answer is yes, but I'd like to know if you were seeing strange behavior on
> essentially quiescent systems...

I'll have to find out for you. These are diskless HPC compute nodes but they haven't been put into production yet and the SA's may feel at liberty to just reboot it whenever they feel the need. For production machines, they generally try to drain the node before rebooting. This would mean that all computational jobs would be stopped and the node would be quiet before being rebooted. It is reasonably likely that since this is a pre-production machine, that it is fairly quiescent most of the time. During this pre-production period they are usually shaking out hardware and software. They might run certain MPI benchmarks to identify poorly performing IB links (not CPU intensive) or run certain computational jobs with known answers to start identifying memory errors. However, CPU load is likely to be intermittent. 

Furthermore, it is the habit here at LLNL to just assert break on the serial console to reboot a node rather than shutting it down since they are diskless. This could be bypassing the normal steps that forces the CPU into P0 before rebooting. If we have been shifting to P0 on shutdown that could have been making the BIOS problem.

Comment 26 Ben Woodard 2008-10-21 16:21:44 UTC

> Do you have a system @ LLNL that reliably reproduces this issue?  IMO, we
> should try doing a binary search through the kernels to see if we can at least
> pinpoint a specific version where this behavior started.

That is going to be hard. The problem appears reliably on a cluster of machines. The problem is that it appears inconsistently on any individual machine. My understanding is that it is <10% of the time. I'll see what we can do but rebooting the cluster is a bit of an undertaking and doing it many times sequentially even for a binary sort and sorting through all the data to find the needed info is non-trivial.

Comment 27 Prarit Bhargava 2008-10-21 16:53:58 UTC

(In reply to comment #26)
> > Do you have a system @ LLNL that reliably reproduces this issue?  IMO, we
> > should try doing a binary search through the kernels to see if we can at least
> > pinpoint a specific version where this behavior started.
> 
> That is going to be hard. The problem appears reliably on a cluster of
> machines. The problem is that it appears inconsistently on any individual
> machine. My understanding is that it is <10% of the time. I'll see what we can
> do but rebooting the cluster is a bit of an undertaking and doing it many times
> sequentially even for a binary sort and sorting through all the data to find
> the needed info is non-trivial.

I agree shutting down the cluster is unreasonable -- Is there a way for us to even get one node for 24-48 hours for testing?

P.

Comment 28 Ben Woodard 2008-10-21 17:34:52 UTC

> Is there a way for us to even get one node for 24-48 hours for testing?

yes, I believe that is doable. I'll try to replicate the problem to get a baseline of how often it reproduces on the hardware.

Comment 29 Issue Tracker 2008-10-21 22:26:56 UTC

Just talking to Trent regarding this:
1) It seems to happen between .125% and 1% of the time. 1-8x per reboot on
a 864 node cluster. That will make reproducing it "tricky" because we will
have to reboot approximately 1-800x to make it happen once. At approx
3min/reboot that is between 1/5hrs and 1/40hrs.
2) The nodes are quiet at the time at the time he reboots them.
3) He reboots them by effectively yanking the power not by using the
serial break. This is a cold boot rather than a warm boot.
4) The problem always goes away with a reboot. He's never seen it reoccur
twice in a row. 
5) Never the same nodes. #4 and #5 indicate that this is likely a purely
statistical phenomena not a HW issue.

The nodes had a problem with initializing the processors that we never
really got to the bottom of. What sort of happened was as soon as we added
a printk the problem went away so we left the printk's in. I had forgotten
about this problem. I doubt that the problem is related but Trent
mentioned that the timing was coincidental. 

He's going to look into getting me some TLCC like nodes to test with.


This event sent from IssueTracker by woodard 
 issue 229865

Comment 31 Brian Maly 2008-10-22 03:16:59 UTC

Hey, what arch is this problem found on? x86_64? Is i386 affected as well?

Comment 32 Issue Tracker 2008-10-22 13:11:47 UTC

Everything at LLNL is running x86_64, so, we don't know whether or not i386
is affected by this.


This event sent from IssueTracker by kbaxley 
 issue 229865

Comment 33 Prarit Bhargava 2008-10-22 15:08:38 UTC

(In reply to comment #29)
> Just talking to Trent regarding this:
> 1) It seems to happen between .125% and 1% of the time. 1-8x per reboot on
> a 864 node cluster. That will make reproducing it "tricky" because we will
> have to reboot approximately 1-800x to make it happen once. At approx
> 3min/reboot that is between 1/5hrs and 1/40hrs.
> 2) The nodes are quiet at the time at the time he reboots them.

That could mean the CPUs are in a lower power state than P0.

> 3) He reboots them by effectively yanking the power not by using the
> serial break. This is a cold boot rather than a warm boot.
> 4) The problem always goes away with a reboot. He's never seen it reoccur
> twice in a row. 

Interesting, so this is only occuring on cold reboots.

> 5) Never the same nodes. #4 and #5 indicate that this is likely a purely
> statistical phenomena not a HW issue.

Yes, I agree.  Can you get clarification on one other thing:  You earlier mentioned that you are NOT seeing this on RHEL5.1 and only in RHEL5.2.  Is that still the case?

P.

Comment 34 Ben Woodard 2008-10-22 15:59:03 UTC

Yes we have at least two large clusters of identical hardware which has been running 5.1 kernel for some time and we aren't seeing this problem on them at all.

Also re: i386 arch. Kent is right. We don't do i386 at all. These are 16 core machines with at least 32GB of RAM that don't do things like virtualized environments or DB's. It is pretty likely that if we were trying to use 32b env, we'd have all sorts of problems with lowmem and other VM issues.

Comment 35 Issue Tracker 2008-10-22 18:10:42 UTC

New data point. Did a standard /sbin/shutdown -r last night on the whole
cluster and ended up with 5/864 nodes with the wrong HPET speed. i.e.
0.46% 

So it doesn't seem to matter if it is a cold boot or a warm boot.






This event sent from IssueTracker by woodard 
 issue 229865

Comment 36 Brian Maly 2008-10-22 21:22:07 UTC

Re: Comment #35,

Can you be more specific about wrong HPET speed? Is the hpet_period actually different (i.e. MHz is not the same)? If so, thats a HUGE clue to whats going on.

Comment 37 Ben Woodard 2008-10-23 21:39:37 UTC

Prarit: One more question: 

This morning you said: 
<prarit> After some extensive debugging I discovered to my dismay that these particular AMD systems (I cannot recall if they were dual or quad core) were initialized with different cpu frequencies.
<neb> <prarit> We had a concall with AMD + Xen + RH where I showed them the evidence.  The AMD engineers agreed that the systems were booting incorrectly.

What proof did you present to them?
Where you dumping the Pstate or Cstate shortly after boot for all the CPUs and therefore demonstrate that when the CPU comes out of BIOS, the procs were not in the wrong state?

Comment 38 Brian Maly 2008-10-23 22:03:16 UTC

Can we get an answer to Comment #36?

Comment 39 Brian Maly 2008-10-23 22:15:13 UTC

Anyway, on the other system (which was aDL580 G5) where I saw this problem, BogoMips and lpj being different for each processor, I was able to resolve the issue using "hpet=disable" as a boot arg. I might ask that this be tried based on the fact that the they symptom is exactly the same. And if we are lucky and its the same bug, I have a potential fix for that as it (i.e. it would be a duplicate of a bug I already have).

Comment 40 Ben Woodard 2008-10-23 22:59:51 UTC

We were thinking it was the HPET that we were using but a closer inspection of
the message and confirmation that booting with notsc worked around the problem,
pointed prarit and I at the patch from
https://bugzilla.redhat.com/show_bug.cgi?id=428479 

So our comments regarding the HPET were misinformed.

Comment 41 Issue Tracker 2008-10-24 04:13:43 UTC

Running a test all night on the nodes where I repeatedly boot a kernel that
has the 428479 patch removed. That is the only difference vs. a kernel that
regularly manifests the problem. The test should include at least 1000
reboots with this kernel from a fairly quiet state. If Prarit's idea is
correct, then then we should have no cases where the clocks drift.




This event sent from IssueTracker by woodard 
 issue 229865

Comment 44 Ben Woodard 2008-10-25 01:22:46 UTC

Note, that >1000 reboots without the patch from BZ428479 did not reproduce the problem. This isn't a perfect test but it is strongly suggestive that the this one patch and the way that it enables the TSC are strongly implicated.

We now have two lines of related evidence. Booting with notsc and removing that patch.

For good science, I still need to test my methodology to make sure that it will reproduce the problem under the conditions that I setup. I'll get to that on Monday.

Last night I pondered another line of attack regarding this. Using the TSC isn't bad it just isn't getting calibrated correctly all the time. Could we add something that retests it late in the boot process, maybe even an initiated by a simple init ship and then recalibrates the TSC if necessary. There might be some advantage in using the TSC vs. the HPET since some scientific apps make heavy use of GTOD.

Comment 45 Prarit Bhargava 2008-10-27 12:03:41 UTC

(In reply to comment #44)
> Note, that >1000 reboots without the patch from BZ428479 did not reproduce the
> problem. This isn't a perfect test but it is strongly suggestive that the this
> one patch and the way that it enables the TSC are strongly implicated.
> 

... I was afraid that this was going to be the problem.

> We now have two lines of related evidence. Booting with notsc and removing that
> patch.
> 
> For good science, I still need to test my methodology to make sure that it will
> reproduce the problem under the conditions that I setup. I'll get to that on
> Monday.

Okay.  I'll take a look at the offending code.

Ben, IIRC LLNL runs RHEL, but respins the kernel correct?

> 
> Last night I pondered another line of attack regarding this. Using the TSC
> isn't bad it just isn't getting calibrated correctly all the time. Could we add
> something that retests it late in the boot process, maybe even an initiated by
> a simple init ship and then recalibrates the TSC if necessary. There might be
> some advantage in using the TSC vs. the HPET since some scientific apps make
> heavy use of GTOD.

I'll look into it.  I think upstream code was suggested to re-calibrate the TSC to make sure the calibration value was comparable to the first calibration attempt.  If they didn't match a third attempt was made IIRC ... I'm not sure the code made it into mainline though ...

P.

Comment 50 Ben Woodard 2008-10-27 15:22:21 UTC

I can do that test on a slice of the cluster & reboot enough to get statistically valid data but it will take approx 24hrs to run. Can you make a kernel rpm?

Comment 51 Prarit Bhargava 2008-10-27 15:38:13 UTC

(In reply to comment #50)
> I can do that test on a slice of the cluster & reboot enough to get
> statistically valid data but it will take approx 24hrs to run. Can you make a
> kernel rpm?

Building one now.  I'll attach it when it's done.

P.

Comment 52 Issue Tracker 2008-10-27 16:13:06 UTC

more solid information that the fix that enables the TSC on the 10h procs
is implicated.

--- Comment #38 from Trent D'Hooge <tdhooge>  2008-10-24 11:05:57
---
2 reboots of Hera with TSC_AMD patch removed, ntp looks good. On reboot,
here
is the variation I saw for time.c

5 nodes at 2288 MHz
855 nodes at 2311 MHz
1 node at 2335 MHz
1 node at 2416 MHz


This event sent from IssueTracker by woodard 
 issue 229865

Comment 55 Prarit Bhargava 2008-11-04 15:03:14 UTC

Created attachment 322436 [details]
RHEL5 fix for this issue

Comment 56 RHEL Program Management 2009-01-27 20:40:09 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 57 Don Zickus 2009-02-02 19:47:33 UTC

in kernel-2.6.18-130.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 58 RHEL Program Management 2009-02-16 15:08:54 UTC

Updating PM score.

Comment 60 Chris Ward 2009-07-03 18:11:14 UTC

~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.

Comment 62 Ben Woodard 2009-07-28 23:10:37 UTC

Kent please open up this bug. It won't let me unrestrict it.

Comment 63 Ben Woodard 2009-07-28 23:28:13 UTC

It appears that this patch isn't quite sufficient to fix the problem completely. We have temporarily used a workaround:

23 [ben@wopri chaos]$cat amd-notsc.patch 
Index: linux+rh+chaos/arch/x86_64/kernel/time.c
===================================================================
--- linux+rh+chaos.orig/arch/x86_64/kernel/time.c       2009-04-22 13:53:52.000000000 -0700
+++ linux+rh+chaos/arch/x86_64/kernel/time.c    2009-04-22 14:01:05.000000000 -0700
@@ -1049,13 +1049,12 @@
                return 1;
 #endif
 
-       /* AMD or Intel systems with constant TSCs have synchronized clocks */
-       if (boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
-               return 0;
-
        /* Most intel systems have synchronized TSCs except for
           multi node systems */
        if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) {
+               /* Intel systems with constant TSCs have synchronized clocks */
+               if (boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
+                       return 0;
 #ifdef CONFIG_ACPI
                /* But TSC doesn't tick in C3 so don't use it there */
                if (acpi_fadt.length > 0 && acpi_fadt.plvl3_lat < 1000 &&

So unfortunately in the confusion of trying various things out, some things got confused and I must have inadvertently passed incorrect information back to Prarit confirming that his calibrate_cpu patch was sufficient to solve the problem when it wasn't.

Comment 64 Zhang Kexin 2009-07-30 14:27:56 UTC

Ben, thanks for testing.
then should we make this bug assigned? need pull this workaround patch?

Comment 69 Zhang Kexin 2009-07-31 09:46:29 UTC

according to comment#66, moved back to ON_QA, we just need to do the SanityOnly verification for this bug.

Comment 70 Zhang Kexin 2009-07-31 10:13:06 UTC

patch is in, add SanityOnly.

Comment 72 errata-xmlrpc 2009-09-02 08:33:44 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html