1. Feature Overview: a. Name of feature: TSC keeps running in C3+ b. Feature Description This feature largly consists of selectively disabling (expensive) workarounds in the existing kernel to enable improved behavior. As such we expect the changes to be limited in scope/risk. 2. Feature Details: a. Architectures: 32-bit x86 64-bit Intel EM64T b. Bugzilla Dependencies: c. Drivers or hardware dependencies: Intel Nehalem based SDVs. d. Upstream acceptance information: Code will be posted after spec release, ETA Nov'08 2.6.29. e. External links: f. Severity (U,H,M,L): Medium g. Target Release Date: 3. Business Justification: a. Why is this feature needed? Power management feature. b. What hardware does this enable? c. Forecast, impact on revenue? d. Any configuration info? e. Are there other dependencies (drivers). 4. Primary contact at Red Hat, email, phone (chat) John Villalovos jvillalo 5. Primary contact at Partner, email, phone (chat) Gabbert, Keve A, +1 503 264 7597 keve.a.gabbert
This is now in the 2.6.29 kernel. We need to create a backport of the patch. Gitweb: http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=40fb17152c50a69dc304dd632131c2f41281ce44 Commit: 40fb17152c50a69dc304dd632131c2f41281ce44 Parent: 7e91a122b11bb250d08ab125afd2c232c87502e1 Author: Venki Pallipadi <venkatesh.pallipadi> AuthorDate: Mon Nov 17 16:11:37 2008 -0800 Committer: Ingo Molnar <mingo> CommitDate: Tue Dec 16 21:02:50 2008 +0100 x86: support always running TSC on Intel CPUs Impact: reward non-stop TSCs with good TSC-based clocksources, etc. Add support for CPUID_0x80000007_Bit8 on Intel CPUs as well. This bit means that the TSC is invariant with C/P/T states and always runs at constant frequency. With Intel CPUs, we have 3 classes * CPUs where TSC runs at constant rate and does not stop n C-states * CPUs where TSC runs at constant rate, but will stop in deep C-states * CPUs where TSC rate will vary based on P/T-states and TSC will stop in deep C-states. To cover these 3, one feature bit (CONSTANT_TSC) is not enough. So, add a second bit (NONSTOP_TSC). CONSTANT_TSC indicates that the TSC runs at constant frequency irrespective of P/T-states, and NONSTOP_TSC indicates that TSC does not stop in deep C-states. CPUID_0x8000000_Bit8 indicates both these feature bit can be set. We still have CONSTANT_TSC _set_ and NONSTOP_TSC _not_set_ on some older Intel CPUs, based on model checks. We can use TSC on such CPUs for time, as long as those CPUs do not support/enter deep C-states. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi> Signed-off-by: Ingo Molnar <mingo>
Created attachment 330131 [details] Backport of the TSC always running patch
For x86_64, there is no GENERIC_TIME configure option that would in fact delete the TSC-always-running code... We might need to request RH to enable GENERIC_TIME to add clocksource driver for x86_64 too, I don't know why clocksource driver is not included in x86_64 kernel in the first place.
But x86_64 might have constant_tsc features from the first generation x86_64 processors...,, Then it will make the comment #3 invalid..., We need someone to confirm it..
Luming: I am not clear on what you mean by "constant_tsc from first generation". Can you explain it a bit. NHM is the first Intel CPU to have CPUID_0x80000007_Bit8. Regarding 64 bit. Yes. In RHEL5 timeframe, generic timer code was only there for i386. x86_64 was using its own timer routines. Later it was integrated into generic timer. So, for RHEL5, you will need more changes than the above backport. Specifically in the x86_64 timer code where time source gets selected. arch/x86/kernel/time.c time_init_gtod() and friends.
>NHM is the first Intel CPU to have CPUID_0x80000007_Bit8. The following is /proc/cpuinfo on my oldest Napa SDV constant_tsc is set in flags field. Is it the bit you mentioned above? [root@napa ~]# cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Pentium(R) M CPU 000 @ 1.83GHz stepping : 1 cpu MHz : 1000.000 cache size : 4096 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow bogomips : 3666.45 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:
constant_tsc was used as a "software feature" bit not directly linked to any cpuid bits. It was set for AMD based on CPUID bit. But, for Intel it was set based on model stepping. constant_tsc was mainly used to identify whether TSC runs at constant rate in presence of P-states (freq changes). As a part of this patch to mainline, I have two features, CONSTANT_TSC and NONSTOP_TSC CONSTANT_TSC means TSC does not change with freq. NONSTOP_TSC means that TSC does not stop in C-state. The CPUID bit mentioned earlier means both these features are present. But, on earlier CPUs (like the Napa one), we will not have CPUID bit set, but we will still set the constant_tsc thing based on the family.
>constant_tsc was used as a "software feature" bit not directly linked to any >cpuid bits. It was set for AMD based on CPUID bit. But, for Intel it was set >based on model stepping. ok, then the eariler Intel CPUs (determined by its family/model/stepping) just have constant tsc regardless of P and T states, but would stop in Cx state. The new things to NHM is "tsc does not stop in Cx state".
But before NHM, we could not call it *TRUE* constant_tsc. Upstream obsoletes CONSTANT_TSC with NONSTOP_TSC.., I'm not sure it is reasonable.. If there is a external component relies on the test of "boot_cpu_has(X86_FEATURE_CONSTANT_TSC)", it would very likely fail on Intel CPUs before NHM. The problem would only be fixed by explicitily asking the changes of test CONSTANT_TSC to NONSTOP_TSC... Venki, What do you think about the problem?
If someone outside kernel use CONSTANT_TSC to mean TSC runs in deep C-state, they are doing it at their own risk. It will not pass on Core / Core 2 family of CPUs. Almost all kernel usages of CONSTANT_TSC are only in the cpufreq and related routines. Only other place is processor_idle, which has a family of AMD check along with CONSTANT_TSC. So, I don't think it is a big problem.
I was trying to divert the back port patch from upstream by only using CONSTANT_TSC ( don't use NOSTOP_TSC..). And take back CONSTANT_TSC feature from all CPUs without CPUID_0x80000007_Bit8 set.. I can see Pros of this approach as it can remove all problems I mentioned in my previous comments. The only Cons might be it is divergent from upstream. I will try to keep the back port in accordance with upstream until we could see how RH could respond... btw, I'm not sure this belongs to RHEL 5 ABI check list..if it is, I would be more curious about how it is implemented in ABI check..
> I was trying to divert the back port patch from upstream by only using > CONSTANT_TSC ( don't use NOSTOP_TSC..). And take back CONSTANT_TSC feature from > all CPUs without CPUID_0x80000007_Bit8 set.. That will have a side effect on the kernel usages of CONSTANT_TSC in cpufreq code for Core / Core 2 based CPUs, as they will no longer have this flag and will take a different code path.
>That will have a side effect on the kernel usages of CONSTANT_TSC in cpufreq >code for Core / Core 2 based CPUs Good point, I need to think twice before *remove* anything without shaking others standing on top of it. My choice would be to make a new name for the core/core 2's CONSTANT_TSC..., but I would keep the original semantic of CONSTANT_TSC...
Let's go back to my original comment#3 , and comment#4. Venki has confirm Core /Core 2 based CPUs has C3-TSC-stop problem. In RHEL 5 kernel, processor_idle would notify users TSC halts in C2,C3. (acpi_processor_idle(), drivers/acpi/processor_idle.c) But the CONFIG_GENERIC_TIME is *NOT* defined for x86_64 Arch... So RHEL 5 kernel would suffer from C3-TSC-stop problem on Intel Core/Core 2 based CPUs...which would affect gettimeofday(). Without the CONFIG_GENERIC_TIME defined, x86_64 kernel would not beneift from clocksorce driver as i386. The following is clock source info from my old sdv with conroe processor (installed with RHEL 5.3): [root@intel-conroe clocksource0]# cat /sys/devices/system/clocksource/clocksource0/available_clocksource jiffies [root@intel-conroe clocksource0]# cat /sys/devices/system/clocksource/clocksource0/current_clocksource jiffies [root@intel-conroe clocksource0]# ls /sys/devices/system/clocksource/ clocksource0 [root@intel-conroe clocksource0]# cat /proc/version Linux version 2.6.18-128.el5 (mockbuild.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Wed Dec 17 11:41:38 EST 2008
John, Based on comment#14, would you please ask Red Hat to evaluate if RH can enable CONFIG_GENERIC_TIME in RHEL 5.4? Peter, Please evaluate my request of adding clocksourc driver support for x86_64. Thanks, Luming
Adding CONFIG_GENERIC_TIME for x86-64 will be a big change and will change the timer code for x86-64 a lot. That means x86-64 bit apic, tsc, and every other code related to timer should change to use GENERIC_TIME. Doesn't look like a feasible thing to do. For 64 bit code in RHEL5 to take advantage of always running TSC, arch/x86_64/kernel/time.c should change. That change will be only for RHEL5 and not valid for upstream. Change would be to choose TSC as reliable gettimeofday source during boot time for cpus with always running tsc.
>Adding CONFIG_GENERIC_TIME for x86-64 will be a big change.. The big change is for fixing issues I mentioned in comment#14.. that I found when browsing code.. Not for taking advantage of always running TSC. We should evluate the impatc of that issues before we make any decision. For "always running TSC" itself on x86_64, I thought nothing needs to chanage.. But I might miss something..., Venki please correct me if it's NOT true. Because: 1. the default do_gettimeoffset is do_gettimeoffset_tsc. 2. there is only one clocksource: TSC. 3. no able to notify TSC stop/change in C2/c3, and P state change, which in turn would change clocksource... In short, the default behaviour is x86_64 has constant and no_stop tsc in RHEL 5 x86 kernel..which I think is a serious bug (which would affect all intel core/core 2 CPUS) unless I miss something really important..
No. x86_64 kernel do not have any issue with core or core 2 CPUs. It does not have GENERIC_TIME code. But, x86_64 knows about HPET, PIT, PMTIMER and TSC and picks it based on the CPU time at the boot up time. On intel it should be using HPET or PMTIMER as gettimeofday base. GENERIC_TIME not being in x86_64 is not a bug. It was the same with upstream 2.6.18. GENERIC_TIME was added to x86_64 much later (.20 or so, IIRC). But, 2.6.18 kernel would work on Core/Core 2 without any bug.
Venki, Then It's weird! I got the following debug info with 2.6.18-130.el5debug on a old SDV with conroe cpu. [root@intel-conroe ~]# dmesg | grep time_init time_init_gtod: notsc=0 time_init_gtod: do_gettimeoffset is using tsc
Venki, on napa, the following info ACKs comment#18 in f11 rawhide kernel: [root@napa clocksource0]# cat available_clocksource hpet acpi_pm jiffies tsc [root@napa clocksource0]# cat current_clocksource hpet [root@napa clocksource0]# cat /proc/version Linux version 2.6.29-0.78.rc3.git5.fc11.x86_64 (mockbuild.phx.redhat.com) (gcc version 4.3.2 20081105 (Red Hat 4.3.2-7) (GCC) ) #1 SMP Tue Feb 3 16:27:09 EST 2009
Luming, About Comment #19. Does your Conroe support deep C3 state? Using TSC on platforms that do not support deep C-state is perfectly alright. About Comment #20 I am not sure what we are going to get by comparing Conroe with RHEL5 and Napa with FC11. Can you run your RHEL5 debug kernel on napa/merom system and see whether TSC is still getting used there. If it is getting used, then there that is a bug as napa supports deep C-state.
Also, please follow through the code in arch/x86_64/kernel/time.c in RHEL5 and you can see when and how the code switches across different gettimeoffsets with hpet, pm and tsc.
To comment#21, the following data should indicate this is a bug. On napa: [root@napa clocksource0]# cat /proc/acpi/processor/CPU0/power active state: C2 max_cstate: C8 bus master activity: 00000000 states: C1: type[C1] promotion[C2] demotion[--] latency[000] usage[00000010] duration[00000000000000000000] *C2: type[C2] promotion[--] demotion[C1] latency[001] usage[00214146] duration[00000000000719382858] [root@napa clocksource0]# cat available_clocksource jiffies [root@napa clocksource0]# cat current_clocksource jiffies [root@napa clocksource0]# cat /proc/version Linux version 2.6.18-128.el5 (mockbuild.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Wed Dec 17 11:41:38 EST 2008
oops, Please ignore the comment#22, because I'm fooled by the data retrived from sysfs interface of clocksource. More investigation shows that the napa does use HPET as clock source. The clocksource sys intferace doesn't reflect the current setting of do_gettimeoffset. It is a trivial bug. But we'd better fix it too. On this napa, notsc is set to 1 and in unsynchronized_tsc: I get acpi_fadt.length==244, acpi_fadt.plvl3_lat=35 RHEL 5 uses this kind of code in unsynchronized_tsc(): if (acpi_fadt.length > 0 && acpi_fadt.plvl3_lat < 1000) return 1; Doesn't it look likely to have problem, given we have alternative method _CST?
Updating PM score.
Here is a bad example, on our Tylersburg-EP SDV, I get this: time_init_gtod: acpi_fadt.length = 244 time_init_gtod: acpi_fadt.plvl3_lat= 1001 The data above would make the following test fail: if (acpi_fadt.length > 0 && acpi_fadt.plvl3_lat < 1000) return 1; And the platform would have deep C-state And gettimeofday using TSC working together, which would cause trouble if the processor doesn't support NONSTOP_TSC. This Tylersburg-EP SDV happens to have NHM installed. So Basically it should work without backporting any patch. But what happens if one platform has acpi_fadt.plvl3_lat>1000, and doesn't have NHM installed? The clean way I can think of now to handle this kind of problem is to following upstream to introduce mark_tsc_unstable(), and GENERIC_TIME code intro x86_64 RHEL 5 kernel.
typo in comment#24, please ignore comment#23
About comment #27 Good Luck introducing GENERIC_TIME into RHEL5 x86_64 :-) My take would be, as I said before, first we need to find whether this is a real problem on any existing system and then go find a reasonable fix.
Created attachment 333006 [details] a back port
Created attachment 333007 [details] a back port
in kernel-2.6.18-134.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
In my test (with 2.6.9-128.1.5.el5) on a Nehalem whitebox, cpuinfo did not have the the nonstop_tsc CPU flag I was expecting, but it *does* have a value in the "power management" field. Here is a slice of a diff: -power management: +power management: [8] Here is a one CPU segment of each diff. 2.6.18-128.el5 (5.3): processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz stepping : 5 cpu MHz : 2933.511 cache size : 8192 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr popcnt lahf_lm bogomips : 5872.68 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: 2.6.18-128.1.5.el5 (5.3.z with this patch) processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz stepping : 5 cpu MHz : 2933.477 cache size : 8192 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr popcnt lahf_lm bogomips : 5872.67 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: [8] Is the [8] value for the "power management" field all I should be seeing? This segment of the patch would indicate I should also get the nonstop_tsc flag: + /* + * c->x86_power is 8000_0007 edx. Bit 8 is TSC runs at constant rate + * with P/T states and does not stop in deep C-states + */ + if (c->x86_power & (1 << 8)) { + set_bit(X86_FEATURE_CONSTANT_TSC,c->x86_capability); + set_bit(X86_FEATURE_NONSTOP_TSC,c->x86_capability); + } I can report this issue on the Z bug too, if this is indeed an issue.
IIRC, this is NOT a issue. The upstream should be same.
To comment #37 I must have recalled wrong things..And please *ignore* comment#37. Now, I have seen you problem. And have figured out that upstream is much smarter than 2.6.18. Upstream generates x86_cap_flags, 2.6.18 does that manually. I will post a patch here, a little bit latter, please test. Thanks, Luming
ok, Now I see the nonstop_tsc flag, with my RHEL 5 debug kernel. Please try the attached patch, confirm it fix your problem. I will post it later. processor : 15 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Genuine Intel(R) CPU @ 0000 @ 2.40GHz stepping : 2 cpu MHz : 1596.000 cache size : 8192 KB physical id : 1 siblings : 8 core id : 3 cpu cores : 4 apicid : 23 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr popcnt lahf_lm bogomips : 4800.13 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: [8] [root@tyler ~]# cat /proc/version Linux version 2.6.18-130.el5.nm_ppc_notifydebug (root.intel.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #4 SMP Thu Mar 19 09:43:27 CST 2009
Created attachment 335791 [details] a fix add nonstop_tsc flag in x86_cap_flags ...
Created attachment 335795 [details] kernel rpm exposing nonstop_tsc Heh, looks like we've done work in parallel...
Moving back to POST to pickup the new fixes for this bz.
Created attachment 337194 [details] a lost part for i386
the patch at comment#44 has been tested and posted.
the breww info: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1743217
<kzhang> could you please explain what the patch does? just in case of my leader inquire about. <kzhang> because it's very near GA, and he may want to know whether it is worthy to do a respin <Luming> ok, without patch, with 32-bit kernel, the tsc clock source is initialized *NOT a best* clocksource <Luming> the patch add check for cpu cap flag NONSTOP_TSC <Luming> then tsc clock source will be the *best* clock source <kzhang> ok, thanks :)
Moving back to POST to pickup the new fixes for this bz. Hopefully this will be the _last_ time... ;-)
in kernel-2.6.18-138.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
Verified RHEL 5.4 alpah, it is fixed.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html