Bug 700886
Summary: | RHEL5.6 TSC used as default clock source on multi-chassis system | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | IBM Bug Proxy <bugproxy> |
Component: | kernel | Assignee: | Prarit Bhargava <prarit> |
Status: | CLOSED ERRATA | QA Contact: | Zhang Kexin <kzhang> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 5.6 | CC: | balkov, dZhu, eguan, jfeeney, jkachuck, kzhang, peterm, rdassen, rprice, washer |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | All | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
With this update, IBM System x3850 X5 is now properly identified as a multi-chassis system by querying the system name and checking for multiple Chassis entries in the SMBIOS table. If multiple Chassis entries are found, the TSC is marked as unsynchronized. The side effect of this solution is that the kernel will attempt to synchronize the TSC on every CPU during system boot which will cause a small delay and error message to be displayed. For other multi-chassis systems, the "notsc" boot parameter can be used to disable the TSC.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2012-02-21 03:47:04 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 684940, 690969, 726799, 758797 | ||
Attachments: |
Description
IBM Bug Proxy
2011-04-29 18:20:45 UTC
------- Comment From masbock.com 2011-04-29 19:48 EDT------- The problem was introduced with the linux-2.6-x86_64-unify-apic-mapping-code.patch The patch makes the assumption that systems setting the FORCE_APIC_PHYSICAL_DESTINATION_MODE bit in the FADT cannot not be multi-chassis. This assumption does not hold. ------- Comment From masbock.com 2011-05-06 13:31 EDT------- This is a regression from RHEL5.5. Serious time skew problems due to this bug have been observed by a customer. Therefore I raise the severity to ship issue. ------- Comment From masbock.com 2011-05-09 13:35 EDT------- I was wrong when I said this was regression. In fact in RHEL5.5 the TSC is also selected as a clock source on the x3850 M2 dual-node system. None of the multi-node tests (designed to mark the TSC as unstable) work for this system. I've removed the Regression flag from this BZ. Max -- from your private email you said that RHEL6 correctly chooses the HPET. I do know that the order of clocksource was changed between RHEL5 and RHEL6. It is entirely possible that is why RHEL6 works. I'll take a closer look at the code and specifically the decisions made based on the FADT table information. P. (In reply to comment #6) > I've removed the Regression flag from this BZ. > > Max -- from your private email you said that RHEL6 correctly chooses the HPET. > I do know that the order of clocksource was changed between RHEL5 and RHEL6. > It is entirely possible that is why RHEL6 works. > > I'll take a closer look at the code and specifically the decisions made based > on the FADT table information. > > P. I don't see anything in the timer code that accesses the FADT. Max, could you send me a dmesg output from a "good" boot and a "bad" boot? I'd like to take a look ... P. ------- Comment From masbock.com 2011-05-16 12:43 EDT------- (In reply to comment #11) > (In reply to comment #6) > > I've removed the Regression flag from this BZ. > > > > Max -- from your private email you said that RHEL6 correctly chooses the HPET. > > I do know that the order of clocksource was changed between RHEL5 and RHEL6. > > It is entirely possible that is why RHEL6 works. > > > > I'll take a closer look at the code and specifically the decisions made based > > on the FADT table information. > > > > P. > I don't see anything in the timer code that accesses the FADT. Max, could you > send me a dmesg output from a "good" boot and a "bad" boot? I'd like to take a > look ... Hi Prarit, the fact that RHEL6 picks the HPET on the dual-node x3850 M2 is because there is a "time warp" check for TSCs on different CPUs. This check discovers a warp between the TSC on CPU 0 (chassis 1) and CPU 16 (chassis 2) and as a consequence removes the TSC from the list of available clock sources. RHEL5: My earlier comment that the new code in apic_is_clustered_box() broke the kernel's ability to detect x3850 M2 multi-node system is incorrect. These multi-chassis systems were never detected as such by RHEL5. (the dmi_check_multi() function applies to older IBM systems). The fact remains that neither RHEL5 nor RHEL6 categorizes this dual-node system as a multi-chassis box. . > > Hi Prarit, > > the fact that RHEL6 picks the HPET on the dual-node x3850 M2 is because there > is a "time warp" check for TSCs on different CPUs. This check discovers a warp > between the TSC on CPU 0 (chassis 1) and CPU 16 (chassis 2) and as a > consequence removes the TSC from the list of available clock sources. Ah, I see. So it's purely by accident that RHEL6 does the right thing. > > RHEL5: My earlier comment that the new code in apic_is_clustered_box() broke > the kernel's ability to detect x3850 M2 multi-node system is incorrect. These > multi-chassis systems were never detected as such by RHEL5. (the > dmi_check_multi() function applies to older IBM systems). > > The fact remains that neither RHEL5 nor RHEL6 categorizes this dual-node system > as a multi-chassis box. Hmm ... do you know if the system is correctly identified as multi-chassis upstream? If that's broken then we should attempt to fix this problem there and move the code back into RHEL5 and RHEL6. I wonder if the chassis type in the SMBIOS (Type3, "Type" field which should be 0x19) is correct on your system? Can you do a 'dmidecode -t 3' and put the output in this BZ? Thanks, P. P. ------- Comment From masbock.com 2011-05-20 15:53 EDT------- (In reply to comment #13) > I wonder if the chassis type in the SMBIOS (Type3, "Type" field which should be > 0x19) is correct on your system? > Can you do a 'dmidecode -t 3' and put the output in this BZ? # dmidecode -t 3 # dmidecode 2.11 SMBIOS 2.4 present. Handle 0x003A, DMI type 3, 13 bytes Chassis Information Manufacturer: IBM Type: Main Server Chassis Lock: Not Present Version: Not Specified Serial Number: Not Specified Asset Tag: Boot-up State: Safe Power Supply State: Unknown Thermal State: Unknown Security Status: Unknown Handle 0x003B, DMI type 3, 13 bytes Chassis Information Manufacturer: IBM Type: Main Server Chassis Lock: Not Present Version: Not Specified Serial Number: Not Specified Asset Tag: Boot-up State: Safe Power Supply State: Unknown Thermal State: Unknown Security Status: Unknown Okay, that seems correct (and what I wrote about earlier with 0x19 was actually incorrect). Each chassis has it's own Chassis structure and you have two chassis therefore two Chassis entries in the SMBIOS structs. I think I can code around this scenario -- can you test out a kernel patch for me? Thanks, P. ------- Comment From lcm.com 2011-05-23 15:23 EDT------- (In reply to comment #13) > . > > Hi Prarit, > > the fact that RHEL6 picks the HPET on the dual-node x3850 M2 is because there > > is a "time warp" check for TSCs on different CPUs. This check discovers a warp > > between the TSC on CPU 0 (chassis 1) and CPU 16 (chassis 2) and as a > > consequence removes the TSC from the list of available clock sources. > Ah, I see. So it's purely by accident that RHEL6 does the right thing. Not necessarily by accident. In my opinion, RHEL6 (and current mainline) are using the most accurate mechanism for determining whether the CPU TSCs can be used as a global time source - checking for TSC time warp across sockets/buses/interconnects, after calling unsynchronized_tsc() fast path. Multi-chassis platforms typically don't have synchronized TSCs across chassis boundaries. However, it would be perfectly reasonable to assume that a platform could implement logic that would keep the TSCs synchronous even across physical nodes. So, a generic 'if multi-chassis' check may not always apply. I think the appropriate way to fix this for RHEL5 would be to incorporate the check tsc sync code from mainline (probably too invasive?) or include multi_dmi_table[] entries for the other affected multi-node servers. The caveat with simply including multi_dmi_table[] entries is that single node and multi node servers will have the same DMI information. So, and additional change that checks the number of nodes (chassis) or number of CPUs in the platform would also be required. ------- Comment From masbock.com 2011-05-23 18:49 EDT------- For reference, the TSC time warp check went into 2.6.19 and is described in this article: http://lwn.net/Articles/211051/ On the dual-node x3850 M2 with RHEL6 it is this code that detects a time warp between TSCs on different nodes. >I think the appropriate way to fix this for RHEL5 would be to incorporate the >check tsc sync code from mainline (probably too invasive?) I spent my weekend reviewing the TSC Warp code and I agree that it is too invasive for this stage in RHEL5. >or include >multi_dmi_table[] entries for the other affected multi-node servers. The caveat >with simply including multi_dmi_table[] entries is that single node and multi >node servers will have the same DMI information. Well ... maybe we could figure something out for your specific system. What we do know is that there are TWO (or more) SMBIOS Type 3 Chassis structures. So maybe something like: if (vendor == IBM && model == x3850) && (num_chassis() > 1) notsc = true; Of course, I would use the standard dmi code in the kernel for this ... I realize this doesn't scale well, but I don't think the TSC Warp code would get into RHEL5. P. ------- Comment From lcm.com 2011-05-23 19:51 EDT------- (In reply to comment #19) > >or include > >multi_dmi_table[] entries for the other affected multi-node servers. The caveat > >with simply including multi_dmi_table[] entries is that single node and multi > >node servers will have the same DMI information. > > Well ... maybe we could figure something out for your specific system. What we > do know is that there are TWO (or more) SMBIOS Type 3 Chassis structures. > > So maybe something like: > > if (vendor == IBM && model == x3850) && (num_chassis() > 1) > notsc = true; > > Of course, I would use the standard dmi code in the kernel for this ... > > I realize this doesn't scale well, but I don't think the TSC Warp code would > get into RHEL5. > This works for me. Max and I can get you the appropriate DMI data for the pertinent servers. We have already had a couple of customer escalations relative to this issue, and while there's a boot option workaround, it would be terrific if things just works as expected out of the box. Thanks! Created attachment 500609 [details]
RHEL5 initial patch
lcm (sorry, I didn't catch your full name) and Max,
Can you please modify this patch with your DMI entries and test? This patch will count the number of type 3 structures, and cause unsynchronized_tsc() to return 1.
Thanks,
P.
------- Comment From masbock.com 2011-05-24 14:20 EDT------- (In reply to comment #21) > Created an attachment (id=61710) [details] > RHEL5 initial patch > > > ------- Comment on attachment From prarit 2011-05-24 09:28:43 > EDT------- > > > lcm (sorry, I didn't catch your full name) and Max, > > Can you please modify this patch with your DMI entries and test? This patch > will count the number of type 3 structures, and cause unsynchronized_tsc() to > return 1. > I am collecting the DMI information and will test the patch. Created attachment 500886 [details]
Preliminary updated patch to detect IBM multi-chassis systems
------- Comment on attachment From masbock.com 2011-05-25 13:22 EDT-------
Prarit, I updated your patch with dmi information that will match some of the systems for which we need to detect multi-chassis. I have tested the patch on an IBM x3850 M2 multi-chassis system. I correctly detects the multiple chassis and selects the HPET as clock source. I will have to do more extensive testing, including the case of the single-chassis system of the same type.
------- Comment From masbock.com 2011-05-25 18:48 EDT------- The patch has a side effect. Due to the following code: static void __cpuinit tsc_sync_wait(void) { /* * When the CPU has synchronized TSCs assume the BIOS * or the hardware already synced. Otherwise we could * mess up a possible perfect synchronization with a * not-quite-perfect algorithm. */ if (notscsync || !cpu_has_tsc || !unsynchronized_tsc()) return; sync_tsc(0); } sync_tsc is now called on each CPU because unsynchronized_tsc() returns 1. This does no harm in this case, but it is unnecessary and noisy (and perhaps confusing: why sync the TSCs if we don't use them). Here is boot time dmesg output from one of the CPUs: Booting processor 6/32 APIC 0x11 Initializing CPU#6 Calibrating delay using timer specific routine.. 5863.06 BogoMIPS (lpj=2931531) CPU: L1 I cache: 32K, L1 D cache: 32K CPU: L2 cache: 4096K CPU 6/11 -> Node 0 CPU: Physical Processor ID: 4 CPU: Processor Core ID: 1 CPU6: Thermal monitoring enabled (TM1) Intel(R) Xeon(R) CPU X7350 @ 2.93GHz stepping 0b APIC: IBM x3850 Multi Chassis detected CPU 6: Syncing TSC to CPU 0. CPU 6: synchronized TSC with CPU 0 (last diff -407 cycles, maxerr 4488 cycles) SMP alternatives: switching to SMP code (In reply to comment #21) > ------- Comment From masbock.com 2011-05-25 18:48 EDT------- > The patch has a side effect. Due to the following code: > static void __cpuinit tsc_sync_wait(void) > { > /* > * When the CPU has synchronized TSCs assume the BIOS > * or the hardware already synced. Otherwise we could > * mess up a possible perfect synchronization with a > * not-quite-perfect algorithm. > */ > if (notscsync || !cpu_has_tsc || !unsynchronized_tsc()) > return; > sync_tsc(0); > } .. Working around that maybe difficult. My vote is that we just live with it. We know we're going to reject the tsc anyway and like you said it is harmless and just spits out a bit of extra (ignorable) info into dmesg. P. Created attachment 501325 [details]
RHEL5 v2
Max, does this patch work for you? It's a bit cleaner than the first patch...
P.
------- Comment From masbock.com 2011-05-27 13:31 EDT------- (In reply to comment #26) > Created an attachment (id=61802) [details] > RHEL5 v2 > > Max, does this patch work for you? It's a bit cleaner than the first patch... > Prarit, unfortunately this patch doesn't work. unsynchronized_tsc() is called from every secondary CPU. num_chassis gets incremented every time unsynchronized_tsc() is invoked. num_chassis ends up being (NUM_CPUS * NUM_CHASSIS). Perhaps something like this would work: __cpuinit int unsynchronized_tsc(void) { #ifdef CONFIG_SMP + /* + * RHEL5: Upstream the TSC Warp code should catch multi-chassis + * systems. The code is too invasive for RHEL5. Doing this + * check here is safe ... + */ + if (dmi_check_system(multi_dmi_table)) { + if (num_chassis) /* only walk once */ + dmi_walk(check_multi_chassis); + if (num_chassis > 1) + return 1; + } + But this only works if we are sure unsynchronized_tsc is called sequentially on all CPUs. - Max ------- Comment From masbock.com 2011-05-27 13:46 EDT------- (In reply to comment #27) Correction to my previous comment: it should really be "if (!num_chassis) /* walk only once */" > Perhaps something like this would work: > > __cpuinit int unsynchronized_tsc(void) > { > #ifdef CONFIG_SMP > + /* > + * RHEL5: Upstream the TSC Warp code should catch multi-chassis > + * systems. The code is too invasive for RHEL5. Doing this > + * check here is safe ... > + */ > + if (dmi_check_system(multi_dmi_table)) { + if (!num_chassis) /* walk only once */ <-- was wrong before > + dmi_walk(check_multi_chassis); > + if (num_chassis > 1) > + return 1; > + } > + > Created attachment 501377 [details]
Patch to detect IBM multi-chassis, modified version of Prarit's earlier patch
------- Comment on attachment From masbock.com 2011-05-27 15:43 EDT-------
Updated version of Prarit's last patch. I modified it so that chassis are counted only once, based on my earlier comments.
Created attachment 501833 [details]
RHEL5 v3
Oops -- good point Max :) How 'bout this then? This way we only actually run the chassis check once.
P.
------- Comment From masbock.com 2011-06-01 19:54 EDT------- (In reply to comment #30) > Created an attachment (id=61826) [details] > RHEL5 v3 > > > ------- Comment on attachment From prarit 2011-05-30 11:22:35 > EDT------- > > > Oops -- good point Max :) How 'bout this then? This way we only actually run > the chassis check once. > > P. __cpuinit int unsynchronized_tsc(void) { #ifdef CONFIG_SMP + /* + * RHEL5: Upstream the TSC Warp code should catch multi-chassis + * systems. The code is too invasive for RHEL5. Doing this + * check here is safe ... + */ + if (num_chassis > 1) + return 1; + + if (dmi_check_system(multi_dmi_table)) { + dmi_walk(check_multi_chassis); + if (num_chassis > 1) + return 1; + } + This doesn't work either because unsynchronized_tsc is called NR_CPUS times (on every non-boot cpu and in time_init()). On a single chassis system the first invocation set num_chassis to 1. The second invocation passes the num_chassis > 1 test and set num_chassis to 2. The third invocation deems the TSC as unsynchronized. The last the patch I attached does the following: __cpuinit int unsynchronized_tsc(void) { #ifdef CONFIG_SMP + /* + * RHEL5: Upstream the TSC Warp code should catch multi-chassis + * systems. The code is too invasive for RHEL5. Doing this + * check here is safe ... + */ + if (dmi_check_system(multi_dmi_table)) { + if (!num_chassis) /* walk dmi only once to count chassis */ + dmi_walk(check_multi_chassis); + if (num_chassis > 1) + return 1; + } + This seems to work. Tested on single and multi-chassis systems. - Max masbock, I'm submitting https://bugzilla.redhat.com/attachment.cgi?id=501377 for internal review this AM. FYI ;) P. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Patch(es) available in kernel-2.6.18-282.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. hi IBM, 1) how to reproduce it and how can I determine the problem has been reproduced ? What is the phenomenon of clock skew ? 2)Does this problem exist in single chassis systems ? 3) Does X3950 M2 have the same problem ? I did the following steps on X390M2 ,Is it right ? (ibm-x3950m2-1.gsslab.rdu.redhat.com) #uname -r 2.6.18-275.el5 #dmesg time.c: Using 266.538728 MHz WALL PIT GTOD PIT/TSC timer. the system use the TSC ---------------------------------------------------------------------------------- #uname -r 2.6.18-300.el5 #dmesg time.c: Using 266.538728 MHz WALL PIT GTOD PIT/HPET timer. Calibrating delay using timer specific routine.. 5330.02 BogoMIPS (lpj=2665013) CPU: L1 I cache: 32K, L1 D cache: 32K CPU: L2 cache: 3072K CPU: L3 cache: 16384K CPU 1/0 -> Node 0 CPU: Physical Processor ID: 0 CPU: Processor Core ID: 0 Genuine Intel(R) CPU @ 2.66GHz stepping 01 CPU 1: Syncing TSC to CPU 0. CPU 1: synchronized TSC with CPU 0 (last diff -2040 cycles, maxerr 2550 cycles) the system use the hpet Hi Washer, could you please have a look at comment#38 ? Thanks. Created attachment 542450 [details]
gettimeofday on 4 cpus which locates in 4 nodes individually
The original problem was reproduced by a simple program calling sleep and observing the actual sleep times. One such process bound to each processor. Much like the suggestion above. (In reply to comment #46) > The original problem was reproduced by a simple program calling sleep and > observing the actual sleep times. One such process bound to each processor. > Much like the suggestion above. Hi James, could you please upload the reproducer? Because we are not sure how to reproduce it exactly. Thanks! Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: With this update, IBM System x3850 X5 is now properly identified as a multi-chassis system by querying the system name and checking for multiple Chassis entries in the SMBIOS table. If multiple Chassis entries are found, the TSC is marked as unsynchronized. The side effect of this solution is that the kernel will attempt to synchronize the TSC on every CPU during system boot which will cause a small delay and error message to be displayed. For other multi-chassis systems, the "notsc" boot parameter can be used to disable the TSC. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0150.html |