Description of problem: Back in November when we started the Linux testing we encountered problems with timing running backwards - which given we use CLOCK_MONOTONIC is a concern. The initial error manifested as an assertion failure in our code that showed that values read from CLOCK_MONOTONIC indicated time had jumped backwards. Using Ingo Molnar's time-warp.c test program we discovered that time-warps were being seen in both the TOD clock and CLOCK_MONOTONIC and that it occurred for both acpi_pm and hpet time sources. The problem seemed restricted to certain machines - with Intel processors and Supermicro boards - running RHEL-RT. Analysis of the CLOCK_MONOTONIC time-warps showed an interesting thing.If we look at the before/after values of the clock when a time-warp occurs e.g.: 00013a0ff9085e8e 00013a0ef908b8d8 ^ | The problem is that the high-word of the second reading is one less than the high-word of the first reading. If you ignore that bit then the difference is the expected ~23usecs. You see the same pattern (with different values) for the results from other machines (two so far). In both cases the high-word has had one subtracted from it. TOD warps show a similar down-by-one error but in a different bit position: 000449ecf5c1cc5d 000449ecf5c0b7cf ^ | Further data was gathered and we ended up with the following summary on November 30: - The CLOCK_MONOTONIC time-warp in all cases occurs because the high 32-bits of the time value has had one subtracted from it, resulting in a difference of 4.29 seconds. - All systems that exhibited CLOCK_MONOTONIC time-warps also exhibited time-of-day time-warps. (They also exhibit TSC time-warps but that's to be expected). - All the failing machines are Supermicro systems with the same basic hardware, differing only in CPU (ie number of cores). Other hardware has not exhibited the same failures running the same OS version. - The four machines running RHEL-RT 5.1 (kernel 2.6.21) all exhibited time-warps. - The one machine running Ubuntu 7.10 (kernel 2.6.22) did not - Rebooting one of the failing machines into an earlier RHEL version (kernel 2.6.18-47.el5PAE) did not experience any CLOCK_MONOTONIC time-warps over a 24 hour period. It did have two TOD warps (2.6s and 2.8s) which are quite large. Discussions with Ingo Molnar, Thomas Gleixner and other RH engineers shed no light and it was suggested that we open a kernel bug in Bugzilla. That didn't happen then as I was trying to gather further data and strangely the time-warps stopped reproducing on the machines concerned! In mid April the problem reappeared on a machine now running 2.6.24.4-30.el5rt. We re-ran the time-warp test with the results below: new TOD-warp maximum: -12822 usecs, 00044adcb5315729 -> 00044adcb5312513 new TOD-warp maximum: -35555 usecs, 00044adcbebb1883 -> 00044adcbeba8da0 new TOD-warp maximum: -42799 usecs, 00044add22ddf841 -> 00044add22dd5112 new TOD-warp maximum: -43790 usecs, 00044add849e6613 -> 00044add849dbb05 new TOD-warp maximum: -47510 usecs, 00044ade21ea6657 -> 00044ade21e9acc1 new TOD-warp maximum: -975532 usecs, 00044ae016a6636d -> 00044ae0169780c1 new TOD-warp maximum: -996768 usecs, 00044ae50d6c1862 -> 00044ae50d5ce2c2 new TOD-warp maximum: -1002260 usecs, 00044ae8e3b50685 -> 00044ae8e3a5bb71 new CLOCK-warp maximum: -4294962827 nsecs, 00004536aee6e62b -> 00004535aee6f7a0 new CLOCK-warp maximum: -4294963385 nsecs, 000046439d2540cf -> 000046429d255016 new CLOCK-warp maximum: -4294963595 nsecs, 00004a82af6a9b76 -> 00004a81af6aa9eb | 5.61 us, TOD-warps:1465 | 5.61 us, CLOCK-warps:24 | 5.50 us, COUNTER-warps:0 Again you can see the off-by-one problem in the high order 32-bit value for the CLOCK_MONOTONIC warps. How reproducible: Inconsistently. Only reproduces on certain hardware running the RHEL RT kernel. And then the problem seem to disappear for long periods then reappear. Steps to Reproduce: Run Ingo Molnar's timewarp.c test
Could you please attach here: - A better description of the hardware were you found the failures - lspci -v - /proc/cpuinfo I have the impression the tests ran in a 32bits kernel. Is it a safe assumption?
Created attachment 306861 [details] Dump of http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh
I've attached the output of http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh which was requested by Ingo Molnar last year. The system is running the 32-bit kernel. Hardware information is as follows: These are Supermicro machines. Each of them consists of 1U Supermicro chassis [1], Supermicro motherboard [2], Intel CPU, memory, HDD, and CDROM drive all put together by a local retailer. [1] http://www.supermicro.com/products/chassis/1U/512/SC512F-260.cfm [2] http://www.supermicro.com/products/motherboard/Xeon3000/3000/PDSMi+.cfm As requested: [root@vmsqe-core2q-02 ~]# lspci -v 00:00.0 Host bridge: Intel Corporation E7230/3000/3010 Memory Controller Hub (rev c0) Subsystem: Super Micro Computer Inc Unknown device 7980 Flags: bus master, fast devsel, latency 0 Capabilities: [e0] Vendor Specific Information 00:01.0 PCI bridge: Intel Corporation E7230/3000/3010 PCI Express Root Port (rev c0) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0 Bus: primary=00, secondary=01, subordinate=01, sec-latency=0 Capabilities: [88] #0d [0000] Capabilities: [80] Power Management version 2 Capabilities: [90] Message Signalled Interrupts: 64bit- Queue=0/0 Enable- Capabilities: [a0] Express Root Port (Slot+) IRQ 0 Capabilities: [100] Virtual Channel Capabilities: [140] Unknown (5) 00:1c.0 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 1 (rev 01) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0 Bus: primary=00, secondary=09, subordinate=0a, sec-latency=0 Memory behind bridge: e8100000-e81fffff Capabilities: [40] Express Root Port (Slot+) IRQ 0 Capabilities: [80] Message Signalled Interrupts: 64bit- Queue=0/0 Enable- Capabilities: [90] #0d [0000] Capabilities: [a0] Power Management version 2 Capabilities: [100] Virtual Channel Capabilities: [180] Unknown (5) 00:1c.4 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 5 (rev 01) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0 Bus: primary=00, secondary=0d, subordinate=0d, sec-latency=0 I/O behind bridge: 00004000-00004fff Memory behind bridge: e8000000-e80fffff Capabilities: [40] Express Root Port (Slot+) IRQ 0 Capabilities: [80] Message Signalled Interrupts: 64bit- Queue=0/0 Enable- Capabilities: [90] #0d [0000] Capabilities: [a0] Power Management version 2 Capabilities: [100] Virtual Channel Capabilities: [180] Unknown (5) 00:1c.5 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 6 (rev 01) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0 Bus: primary=00, secondary=0e, subordinate=0e, sec-latency=0 I/O behind bridge: 00005000-00005fff Memory behind bridge: e8200000-e82fffff Capabilities: [40] Express Root Port (Slot+) IRQ 0 Capabilities: [80] Message Signalled Interrupts: 64bit- Queue=0/0 Enable- Capabilities: [90] #0d [0000] Capabilities: [a0] Power Management version 2 Capabilities: [100] Virtual Channel Capabilities: [180] Unknown (5) 00:1d.0 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #1 (rev 01) (prog-if 00 [UHCI]) Subsystem: Super Micro Computer Inc Unknown device 7980 Flags: bus master, medium devsel, latency 0, IRQ 18 I/O ports at 3000 [size=32] 00:1d.1 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #2 (rev 01) (prog-if 00 [UHCI]) Subsystem: Super Micro Computer Inc Unknown device 7980 Flags: bus master, medium devsel, latency 0, IRQ 19 I/O ports at 3020 [size=32] 00:1d.2 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #3 (rev 01) (prog-if 00 [UHCI]) Subsystem: Super Micro Computer Inc Unknown device 7980 Flags: bus master, medium devsel, latency 0, IRQ 20 I/O ports at 3040 [size=32] 00:1d.3 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #4 (rev 01) (prog-if 00 [UHCI]) Subsystem: Super Micro Computer Inc Unknown device 7980 Flags: bus master, medium devsel, latency 0, IRQ 16 I/O ports at 3060 [size=32] 00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2 EHCI Controller (rev 01) (prog-if 20 [EHCI]) Subsystem: Super Micro Computer Inc Unknown device 7980 Flags: bus master, medium devsel, latency 0, IRQ 18 Memory at e8600000 (32-bit, non-prefetchable) [size=1K] Capabilities: [50] Power Management version 2 Capabilities: [58] Debug port 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1) (prog-if 01 [Subtractive decode]) Flags: bus master, fast devsel, latency 0 Bus: primary=00, secondary=0f, subordinate=0f, sec-latency=32 I/O behind bridge: 00006000-00006fff Memory behind bridge: e8300000-e83fffff Prefetchable memory behind bridge: 00000000e0000000-00000000e7f00000 Capabilities: [50] #0d [0000] 00:1f.0 ISA bridge: Intel Corporation 82801GB/GR (ICH7 Family) LPC Interface Bridge (rev 01) Subsystem: Super Micro Computer Inc Unknown device 7980 Flags: bus master, medium devsel, latency 0 Capabilities: [e0] Vendor Specific Information 00:1f.2 IDE interface: Intel Corporation 82801GB/GR/GH (ICH7 Family) SATA IDE Controller (rev 01) (prog-if 80 [Master]) Subsystem: Super Micro Computer Inc Unknown device 7980 Flags: bus master, 66MHz, medium devsel, latency 0, IRQ 19 I/O ports at 01f0 [size=8] I/O ports at 03f4 [size=1] I/O ports at 0170 [size=8] I/O ports at 0374 [size=1] I/O ports at 30b0 [size=16] Capabilities: [70] Power Management version 2 00:1f.3 SMBus: Intel Corporation 82801G (ICH7 Family) SMBus Controller (rev 01) Subsystem: Super Micro Computer Inc Unknown device 7980 Flags: medium devsel, IRQ 19 I/O ports at 1100 [size=32] 09:00.0 PCI bridge: Intel Corporation 6702PXH PCI Express-to-PCI Bridge A (rev 09) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0 Bus: primary=09, secondary=0a, subordinate=0a, sec-latency=64 Capabilities: [44] Express PCI/PCI-X Bridge IRQ 0 Capabilities: [5c] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable- Capabilities: [6c] Power Management version 2 Capabilities: [d8] PCI-X bridge device Capabilities: [100] Advanced Error Reporting Capabilities: [300] Power Budgeting 09:00.1 PIC: Intel Corporation 6700/6702PXH I/OxAPIC Interrupt Controller A (rev 09) (prog-if 20 [IO(X)-APIC]) Subsystem: Super Micro Computer Inc Unknown device 7980 Flags: bus master, fast devsel, latency 0 Memory at e8100000 (32-bit, non-prefetchable) [size=4K] Capabilities: [44] Express Endpoint IRQ 0 Capabilities: [6c] Power Management version 2 0d:00.0 Ethernet controller: Intel Corporation 82573E Gigabit Ethernet Controller (Copper) (rev 03) Subsystem: Super Micro Computer Inc Unknown device 108c Flags: bus master, fast devsel, latency 0, IRQ 219 Memory at e8000000 (32-bit, non-prefetchable) [size=128K] I/O ports at 4000 [size=32] Capabilities: [c8] Power Management version 2 Capabilities: [d0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable+ Capabilities: [e0] Express Endpoint IRQ 0 Capabilities: [100] Advanced Error Reporting Capabilities: [140] Device Serial Number 80-50-90-ff-ff-48-30-00 0e:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller Subsystem: Super Micro Computer Inc Unknown device 109a Flags: bus master, fast devsel, latency 0, IRQ 17 Memory at e8200000 (32-bit, non-prefetchable) [size=128K] I/O ports at 5000 [size=32] Capabilities: [c8] Power Management version 2 Capabilities: [d0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable- Capabilities: [e0] Express Endpoint IRQ 0 Capabilities: [100] Advanced Error Reporting Capabilities: [140] Device Serial Number 81-50-90-ff-ff-48-30-00 0f:00.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02) (prog-if 00 [VGA]) Subsystem: Super Micro Computer Inc Unknown device 7980 Flags: bus master, stepping, fast Back2Back, medium devsel, latency 66, IRQ 11 Memory at e0000000 (32-bit, prefetchable) [size=128M] I/O ports at 6000 [size=256] Memory at e8300000 (32-bit, non-prefetchable) [size=64K] [virtual] Expansion ROM at e8320000 [disabled] [size=128K] Capabilities: [50] Power Management version 2 [root@vmsqe-core2q-02 ~]# cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Quad CPU @ 2.40GHz stepping : 7 cpu MHz : 2394.041 cache size : 4096 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm bogomips : 4790.70 clflush size : 64 processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Quad CPU @ 2.40GHz stepping : 7 cpu MHz : 2394.041 cache size : 4096 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 4 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm bogomips : 4785.49 clflush size : 64 processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Quad CPU @ 2.40GHz stepping : 7 cpu MHz : 2394.041 cache size : 4096 KB physical id : 0 siblings : 4 core id : 2 cpu cores : 4 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm bogomips : 4788.01 clflush size : 64 processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Quad CPU @ 2.40GHz stepping : 7 cpu MHz : 2394.041 cache size : 4096 KB physical id : 0 siblings : 4 core id : 3 cpu cores : 4 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm bogomips : 4788.02 clflush size : 64
We have tested two Supermicro boards with Intel processors but had no success on reproducing the bug. Did you have any success reproducing this bug on other similar machines?
As per the original description: - All the failing machines are Supermicro systems with the same basic hardware, differing only in CPU (ie number of cores). Other hardware has not exhibited the same failures running the same OS version. - The four machines running RHEL-RT 5.1 (kernel 2.6.21) all exhibited time-warps. - The one machine running Ubuntu 7.10 (kernel 2.6.22) did not - Rebooting one of the failing machines into an earlier RHEL version (kernel 2.6.18-47.el5PAE) did not experience any CLOCK_MONOTONIC time-warps over a 24 hour period. But even on the failing machines this bug can disappear for long periods. It's almost as if the value of the time affects whether or not the bug appears.
A few questions: - Have you been able to reproduce this issue with MRG GA kernel, 2.6.24.7-65.el5rt? - Have you been able to reproduce this issue in any other machines (except the two ones cited in the report)? - Are both failing machines quad-core? - Would that be possible running the cfs-debug-info.sh script against 2.6.24.7-65.el5rt? - Are you using NTP? Have you niticed anything different on the logs? This issue is taking long to be solved but I really haven't been able to reproduce the issue. If the answer to the questions above doesn't give us any clues, I will instrument a kernel to check on clocksources, ntp and GTOD events. Or prepare a script to run with the kernel-rt-trace kernel.
- None of the machines have been tested with the latest kernel as yet (most recent was 2.6.24.4-30.el5rt) - This was reproduced on 4 machines running RH, and failed to reproduce on a similar machine running Ubuntu - as per original description. - Failing machines have been both dual-core and quad-core - Will need to get access to machines to gather requested info, once they have been updated to latest kernel. - Need to check ntp status once I can access the machines again. No info on anything in logs.
I reproduced a time-warp with 2.6.24.7-65: vmsqe-core2q-02 > ./time-warp-test-rh-notsc 4 CPUs, running 4 parallel test-tasks. checking for time-warps via: - gettimeofday (TOD) syscall (usec resolution) - clock_gettime(CLOCK_MONOTONIC) syscall (nsec resolution) - dummy monotonic counter for testing new TOD-warp maximum: -5842 usecs, 0004515713524496 -> 0004515713522dc4 new TOD-warp maximum: -30834 usecs, 0004515715b50034 -> 0004515715b487c2 new TOD-warp maximum: -75572 usecs, 000451574c8bc581 -> 000451574c8a9e4d new TOD-warp maximum: -119428 usecs, 000451579d8d2210 -> 000451579d8b4f8c new TOD-warp maximum: -142421 usecs, 000451595902dfe9 -> 000451595900b394 new TOD-warp maximum: -868400 usecs, 0004516d93818c3f -> 0004516d93744c0f new TOD-warp maximum: -1004857 usecs, 0004517239390eba -> 000451723929b981 new CLOCK-warp maximum: -4294955983 nsecs, 000278fdfa570f39 -> 000278fcfa573b6a | 5.31 us, TOD-warps:3251 | 5.31 us, CLOCK-warps:1 | 5.23 us, COUNTER-warps:0 \ This is a quad-core machine. The warp happened somewhere between 24 and 36 hours of testing. ntpd is not running. Output from cfs-debug-info.sh will be attached.
Created attachment 311211 [details] cfs-debug-info output
Thanks David. I will walk through the data and try to find a similar quad-core machine here to reproduce the issue.
I have been running the test for 8+ days with no single error message in a Supermicro MB with an Intel processor. This is just a FYI.
Which processor exactly? We've seen issues with the Core2 systems but not the Xeon "Woodcrest". Thanks.
There has been no recent activity on this issue. Is this still a problem?
I'm running the time-warp test on one of the original machines to see if this still reproduces, but as noted in the previous comments this problem sometimes disappears for long periods of time. We do know that even with Solaris we sometimes get strange timing issues on these kinds of machines.
I haven't seen a time warp in quite some time...
This issue has not been updated in a while and is using an older, unsupported kernel. This BZ is being closed WONTFIX. If you believe this is still an issue on our most recent MRG-2.5 3.10 Realtime kernel, please file a new issue for further investigation.