Bug 207917 - Kernel Review: cpu_has_tsc return value patch
Kernel Review: cpu_has_tsc return value patch
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.8
i686 Linux
urgent Severity high
: ---
: ---
Assigned To: Prarit Bhargava
Brian Brock
:
Depends On:
Blocks: 202071
  Show dependency treegraph
 
Reported: 2006-09-25 08:22 EDT by Prarit Bhargava
Modified: 2007-11-16 20:14 EST (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-10-09 09:08:57 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
cpu_has_tsc return value patch (980 bytes, patch)
2006-09-25 09:37 EDT, Prarit Bhargava
no flags Details | Diff
Test RPM (7.91 MB, application/x-rpm)
2006-09-25 10:38 EDT, Prarit Bhargava
no flags Details
SMP Test RPM (8.33 MB, application/x-rpm)
2006-09-27 19:11 EDT, Prarit Bhargava
no flags Details

  None (edit)
Description Prarit Bhargava 2006-09-25 08:22:53 EDT
Description of problem:

The current code base does not properly check the return value of cpu_has_tsc. 
This may result in SMP boxes failing to properly synchronize their TSCs.

Version-Release number of selected component (if applicable):  RHEL3.8

This BZ is to inform Intel of the change and to request that Intel runs the
attached kernel through it's internal test system.
Comment 2 Prarit Bhargava 2006-09-25 09:37:09 EDT
Created attachment 137052 [details]
cpu_has_tsc return value patch
Comment 3 Prarit Bhargava 2006-09-25 10:20:20 EDT
Intel colleagues,

I'm currently building a RHEL3.8 kernel RPM for you to test with.  This RPM
should be attached to this BZ by noon ET.

P.
Comment 4 Prarit Bhargava 2006-09-25 10:29:01 EDT
Some additional information:

When the patch in comment #2 was added to the kernel, boot hangs were seen after
initializing the per-cpu migration threads:

Starting migration thread for cpu 0
Starting migration thread for cpu 1

<no more output ... system hung at this point>

The system in question was an Intel prototype and /proc/cpuinfo contains

[root@dhcp83-14 tmp]# cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Genuine Intel(R) CPU 3.00GHz
stepping        : 4
cpu MHz         : 2994.521
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
runqueue        : 0
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm lm
bogomips        : 5976.88

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Genuine Intel(R) CPU 3.00GHz
stepping        : 4
cpu MHz         : 2994.521
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
runqueue        : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm lm
bogomips        : 5976.88

I reset the BIOS settings to the default and the system no longer hangs and is
functional.  I do not see any hangs on this system (or any other Intel system).

P.
Comment 5 Prarit Bhargava 2006-09-25 10:38:32 EDT
Created attachment 137055 [details]
Test RPM

Boots on RH test system:

[root@dhcp83-14 root]# uname -a
Linux dhcp83-14.boston.redhat.com 2.4.21-47.EL_private_bz207917 #1 Mon Sep 25
09:38:15 EDT 2006 i686 i686 i386 GNU/Linux
Comment 6 Suresh Siddha 2006-09-25 14:30:10 EDT
We will test this RPM and getback in a day. 64bit kernel also has similar code 
as the proposed TSC sync change, so I expect the TSC sync change should be 
fine.

BTW, why is there a new file include/asm-i386/pci-direct.h got introduced in 
this patch? That seems to be not related to TSC sync change.
Comment 7 Prarit Bhargava 2006-09-25 17:14:05 EDT
> 
> BTW, why is there a new file include/asm-i386/pci-direct.h got introduced in 
> this patch? That seems to be not related to TSC sync change.

Oops.  My mistake.  That shouldn't be there -- but it's harmless of course ;)

P.
Comment 9 Prarit Bhargava 2006-09-27 10:11:26 EDT
Suresh, just to re-iterate:  The inclusion of pci-direct.h is harmless and
doesn't effect the cpu_has_tsc return values.  While it was included in error,
it shouldn't effect the operation of your system.

P.
Comment 10 Suresh Siddha 2006-09-27 17:55:01 EDT
hmm.. RPM posted in comment #5 is an UP RPM and hence the patch in discussion 
will not get tested with this RPM. Please post a SMP RPM.
Comment 11 Prarit Bhargava 2006-09-27 19:11:23 EDT
Created attachment 137265 [details]
SMP Test RPM

SMP Test RPM
Comment 12 Suresh Siddha 2006-09-27 21:31:31 EDT
Thanks. SMP RPM test booted fine on one of my boxes. We will do more testing 
and if there are any issues I will report it here.
Comment 13 Ronald Pacheco 2006-09-28 13:34:37 EDT
Suresh,

Thanks for the unit test.  We really need some broader test coverage and we have
a sense of urgency to get this done.  Can we encourage you to provide us with
more test results very soon?

Thanks!

Ron Pacheco
Comment 14 Bob Johnson 2006-09-28 13:37:59 EDT
Suresh, we are looking not for more testing on the same box, but broader testing
across several platforms if possible.
Comment 15 Bob Johnson 2006-09-28 13:42:21 EDT
And actually this patch is being looked at for an out of cycle errata and we
would hate to impact any currently working platforms, that's the reason for the
broader test and the urgency for the request. 
Comment 16 Prarit Bhargava 2006-09-28 14:09:27 EDT
Geoff asked me what the BIOS version and the Product Code of the box that we saw
failures on are:

BIOS: VVCLGI7J.86A.2595.2005.0421.1024

Product Code: D3C41SYBANLPBE
MM#: 868839

P.
Comment 17 Suresh Siddha 2006-09-28 18:53:55 EDT
After getting some more information about Prarits experiments(where he 
observed system hang with this patch with some unknown bios settings) through 
Geoff, I have few questions.

a) Why is this patch needed and on what platforms? Most of the Intel platforms 
doesn't require this as TSC's are synchronized and even on the NUMA kind of 
platforms, we doesn't rely on the TSC to be in sync.

b) If there is an urgency in getting this patch into errata release, we should 
have a cpu vendor check on where this is really required.

My feeling is that unless Prarit's hang gets reproduced and fully understood, 
we shouldn't apply this patch for Intel plaforms, especially if no customer on 
Intel platforms is requesting/waiting for this patch.
Comment 18 Prarit Bhargava 2006-09-29 07:12:49 EDT
> 
> a) Why is this patch needed and on what platforms? Most of the Intel platforms 
> doesn't require this as TSC's are synchronized and even on the NUMA kind of 
> platforms, we doesn't rely on the TSC to be in sync.

This patch is needed for another vendor to synchronize the TSCs.  IIRC, Intel
does this synchronization in HW so this step is not required.

> 
> b) If there is an urgency in getting this patch into errata release, we should 
> have a cpu vendor check on where this is really required.

That was an option I explored earlier, however, it is necessary to understand
why we saw TSC synchronization failures on Intel boxes.  The code in question is
identical to that of the upstream kernel and should not cause any problems.

> 
> My feeling is that unless Prarit's hang gets reproduced and fully understood, 
> we shouldn't apply this patch for Intel plaforms, especially if no customer on 
> Intel platforms is requesting/waiting for this patch.

Agreed -- I will regenerate the patch with a (CPU_VENDOR != INTEL) check prior
to synchronizing the TSCs.

Note You need to log in before you can comment on or make changes to this bug.