Bug 203818
Description
andrew m. boardman
2006-08-23 20:33:36 UTC
I am seeing the same thing on an IBM NetVista p/n 8311 (2.4Ghz P4). I do not know if it is related, but the other thing I noticed as different between the 2.6.9-22.EL kernel that works and the 2.6.9-42.EL and 2.6.9-42.0.2.EL that "stops", is the IRQ routing done by ACPI (is this a regression caused by bugzilla 184254, perhaps?). The old kernel was assigning shared IRQ's of 11,10,9,5, while the new (42 and 42.0.2) kernel is assigning IRQ's in the high 100's low 200's. I noticed one line difference that might pertain from /var/log/messages just before the listing of IRQs assigned: Old, working -> PCI: IRQ9 SCI: Level Trigger. New, stopping -> ENABLING IO-APIC IRQs Again, I don't know if the IRQ Routing is related to the real problem of the system clock, but noticed the "difference" when investigating.... I too am seeing this behaviour on an IBM ThinkCentre running RHEL4. Is this related to Fedora Core bug 127411 and bug 152170? Specifically this is seen with both 2.6.9-5.EL and 2.6.9-42.0.2.EL kernals on an IBM ThinkCentre. cat /proc/cpuinfo gives model name : Intel(R) Pentium(R) 4 CPU 2.80GHz The issue also occurs with Fedora Core 5 kernels 2.6.15-1.2054_FC5smp and 2.6.17-1.2187_FC5. what chipset is in the affected machines? i.e. can we get an 'lspci'? also, what timesource is being used? HPET, PMTImer, TSC/PIT? Created attachment 138273 [details]
lspci -vvv output from an affected machine
Requested output attached.
I'm not sure how to answer the "what timesource is being used" question. The
problem manifests in what's reported by the kernel's timekeeping as reported by
gettimeofday(). (Via, say, running "date" repeatedly.) As to whether the
kernel's using HPET, the PM timer, some combination thereof, or something else
entirely, I'm not sure how to check; no "clock=..." boot paramater was given,
though, so it should be running with default behaviour. (The hardware clock I
reference that's actually working properly is the RTC.)
Created attachment 139344 [details]
lspci -vvv from another affected machine
This is lspci -vvv from my affected machine.
Is the status of this bug correct - as it still has NEEDINFO from amb - even though this has been provided? Is cpuspeed scaling being used? If so, does the problem still occur with it disabled.. i.e. if the CPU runs at a fixed frequency? Also, can someone attach sysreports for both a working and broken kernel? Created attachment 146867 [details]
sysreport of system WHILE the clock is running backwards
We managed to get a sysreport while the clock was running backwards.
Created attachment 146868 [details]
sysreport of same hardware same bad kernel before failure
This sysreport is taken of the same hardware and the same kernel as 11850,
but before the clock started running backwards. We presume this sysreport
will be identical to 11850.
Created attachment 146870 [details]
sysreport of same hardware but with kernel that works fine.
This is the sysreport of the same hardware, but running the older kernel
that does not exhibit the "clock runs backwards" failure. We suspect you
will find this identical to the other two sysreports excepting for the
kernel version, and that pretty much should tell you exactly what the hardware
is, the configuration, and confirm that the kernel is at fault.
Bad user interface! No biscuit! How am I suspposed to add in my comments AND say I'm providing the requested information, when the "create attachment" link below VERY CAREFULLY DESTROYS my COMMNET?? Anyway, here is your sysreport output! It would have helped to know that sysreport was a command that existed, but that we'd need to install the sysreport RPM before we saw it. We never heard of it before. Looks like your using PMtimer with the bad/broken sysreport and TSC in the good/working sysreport. Can you try booting with "clock=tsc" and see if it makes a difference? The Intel ICH5 chipset has a few known timer issues. Also, you might try disabling NTP and see if that make a difference. Created attachment 147416 [details]
Sysreport with "clock=tsc"
I've just attached a sysreport from the test system running the unhappy kernel, and with, we believe, the setting clock=tsc. I've included the sysreport itself in the hopes that you can use it to determine that we correctly set clock=tsc, because it made NO difference. i.e. We logged onto this host and generated the sysreport with the clock running backwards. I GUESS we could try disabling NTP, but I'd like to know why there is any reason to suspect that NTP is the cause, when it's the same NTP but a DIFFERENT kernel? The use of "clock=tsc" was correct... the system booted using TSC as the timesource. The reason NTP is suspect is due to some timekeeping code changes that relate to NTP. Also, if cpuspeed scaling (i.e. speedstep-centrino) is being used, try disabling this and see if it makes a difference. We dont seem to have this piece of hardware in-house, but I will look at the kernel changes between .22 and .42 and see if anything looks suspect. I may need some help testing in order to isolate which patch caused this regression. The chipset in this system is known to be problematic, but not usually as broken as in this case. FYI: We were advised by someone to try booting with ACPI and APIC turned off, to help narrow down whether or not the advanced power management functionality in the kernel is causing the variations in clock ticks. We followed these steps: 1) Reboot 2) Catch grub and prevent it from auto booting 3) Highlight the newest kernel and hit 'e' 4) Highlight the kernel line and hit 'e' 5) At the end of the line add " noapic noacpi apic=off acpi=off" 6) Hit enter, then 'b' to boot And the result was a system whose clock PROPERLY kept time again. FYI: I've taken some time to learn about the different clock algorithms mentioned (HPET, PMTImer, TSC/ PIT). Apparently PMTimer is what you're supposed to use to compensate for stuff happening with ACPI, and indeed if one runs TSC with ACPI enabled one should expect to see incorrect time. We're gonna see what happens with HPET and PIT selected as the clock source, but I think we'll see the clock still run backwards because the root cause of this problem is that the code in the PMTimer routine no longer properly does the job of properly compensating for action by ACPI. Since the older kernel was running TSC by default, perhaps the more accurate characterization of the situation is: With the newer kernel, ACPI is enabled by default and the clock algorithm is correctly set to PMTimer to deal with the effects of ACPI. Alas, PMTimer is not properly dealing, and the clock behaves the SAME as if TSC is running with no intervention to deal with the ACPI. Can you narrow down which boot arg was the one that made a difference in Comment #18. It would help to narrow down which piece of the kernel code is failing. From my reading, I learned that ACPI code is generally expected to be flaky and harder to get right than the APIC code. Figuring, "Ok, let's do a quick sanity check on the code we expect to be good, let's turn off ACPI and let the good APIC code run and make sure that's not the source of the problem." Guess what? We booted with arguments noacpi acpi=off and the clock RAN BACKWARDS! So what we expected to NOT be a source of problems IS a source. We are now testing with arguments noapci apci=off in the hope that the clock will run forwards, and that we've isolated the fault into the unexpected quadrant of APIC support. I think you will find the following VERY interesting. Yesterday we booted with noapic apic=off Today the clock is FINE! I believe this means we may have isolated the fault to interaction with the APIC code. Would you please confirm for me that experimentation with other clock algorithms is probably not appropriate now? Would you also suggest another step we can take to further help isolate the fault. The root cause is almost certainly in the APIC code. The clock code is generic and all systems would have exhibited this same problem (and this certainly would have been fixed long ago). So can we do a little APIC debugging here? First "apic=" is only used for debugging and likely isnt needed. Can you just boot with the "noapic" flag (and no other flags) and see if the system still keeps time? Second, can you just boot the system with "apic=debug" or even "apic=verbose". This should provide some usefull debugging info. Please attach any relevant output to this Bug. Third, try booting with "lapic" (and no other flags) instead and see if the local APIC works better for keeping good time. Status report: We ran for a day with just the "noapic" flag. Clock was fine. We ran for a day with just the "lapic" flag. Clock was BAD. I will start up the system with "apic=debug", and attach output that seems relevant. Answers to the following clarifying questions would be helpful: Is is sane to specify BOTH apic=debug and apic=verbose as options? Over what time period should I plan to gather log output? Are there any log files besides /var/log/messages that will be relevant? apic=verbose is apic=debug, just more detailed output. In looking at the code again here verbose is probably whats needed (instead of just basic debug level). Gather any usefull log entries from boot until the system is completly up and running. We are particularly interested in anything that pops up during the boot/init process. A copy of /var/log/dmesg is probably sufficient. Its worth setting /etc/syslog.conf to log kern.* to /var/log/messages and include that file as well (or just the portion since last boot, since the messages file can be huge). Just to make sure I was clear enough in Comment #25, just boot with "apic=verbose". I want to look and see what code path we are going down. A copy of the same apic=verbose output from a working kernel would be very helpfull as well (if possible). We can see if the code path differs between working & non working. There are a few suspect patches I can isolate and remove + build a test kernel if your willing to help test. Created attachment 149064 [details]
apic=verbose dmesg output
Created attachment 149065 [details]
messages output from restart with apic=verbose
Thanks very much for quickly answering my clarifying questions. I was expecting I'd have to run the system over night before giving you the output you required. Here you ARE! I didn't muck with syslog.conf because it looked like it was doing the right thing. What further information can I supply? Created attachment 149070 [details]
dmesg output for kernel with working clock
Created attachment 149071 [details]
messages output with working clock pickes up where previous messages leaves off.
I see our follow-ups crossed in the ether. I've added attachments for the .9-34 kernel that works. (Reminder: Its the .9-43 kernel that has the problem.) I don't know how much help comparing the working one will be because it looks like that version didn't DO any APIC clock work at all. It used tsc, remember? Anyway, let me know what else to do to help isolate this fault. Re: Comment #32, Just want to clear up some potential points of confusion... The system didnt keep time even when using TSC as the timesource on the broken kernel correct (see Comment #16)? Im assuming here that the broken kernel wont keep time no matter which timesource is used (HPET, PMTimer, TSC and PIT). Please verify assumption this is correct. Im guessing the problem is in the apic code though. Disabling apic resolved the issue. Worth noting is that when TSC is used as a timesource, its actually using PIT and TSC (and not pure TSC). PIT is used during the timer interrupt, and the TSC offset is read during a non-interrupt, etc (i.e. gettimeofday() call). TSC is used for performance reasons... its quick to read but has but less accurate, hence the need to use PIT as the base and do a quick TSC read to calculate the offset. TSC is scalable on MP systems like when you have a database with thousands of threads all doing a gettimeofday() Also, thanks for the apic=verbose outputs.. Ill take a look at them. Looked at the sysreports a bit more... So booting with "clock=tsc" on a bad kernel doesnt fix the problem right? We have established so far that using PMTimer (and apic timer) doesnt work. But what about "clock=pit"? Im just a bit confused here on what works and what doesnt (gone around in circles too many times due to the obscurity of this problem). Does a broken kernel fail to keep time regardless of which clock is used? Or does it acutally keep time with TSC? Booting with "clock=tsc" on a bad kernel does NOT fix the problem. We tried that early on. We were going to try other time bases, but were unsure if that mattered. I will run with "clock=pit" over the weekend and report results on Monday. Is "clock=hpet" the other option to specify? I am also running a 3.0ghz IBM that is having the same problem. I have only added the noapic to grub.conf on the kernel lines and that seems to have fixed it at this point. I would also add that shutdowns where "VERY" slow before. It would almost appear to hang. Even just issuing simple commands (top, ls) would lag greatly. I ran with "clock=pit" over the weekend. BAD CLOCK! No biscuit! It still thinks it's friday. So we now know that the following do NOT keep time with apic enabled: pmtimer tsc pit I will run with "clock=hpet" and report on those results tomorrow or so. Thanks for the debugging info. Im able post a test kernel tarball (with suspect patches removed from the build) Are you able and willing to build and install a kernel and do some simple testing (boot and check time, etc)? Since this is almost certainly APIC related, I wanted to determine which APIC patch is causing the issue. At least then we will know why this is broken and can come up with a fix. Created attachment 149291 [details]
test case in case its helpfull
Im attaching this script in case its helpfull in detecting the time skew faster
or more easily.
Run 'hwclock -s" to sync RTC and sysclock before running the checktime script.
RTC is used as a reference time source, hence the need to sync.
Thank you for the test script. It will be helpful in reducing the turn-around time on tests. I've been running the tests while the engineer who normally builds and installs kernels has been on vacation. Since that engineer doesn't actually report to me, I don't know when he's scheduled to return. I'm out of practice building kernels, and I don't want to mess up Andrew's infrastructure. With the caveat that there will be some delay though, YES we are INDEED willing to install, and if necessary build, kernels with bits and pieces in or out that you supply. So yes, please do supply your test kernel tarball. If you have multiple patches for us to fiddle with, please make explicit exactly what you want in/out and in what order you'd like us to test the various kernels. I just attempted to use your checktime utility. Alas, though ksh may be your default shell, and something you assume everyone has installed, it is not installed on our (usually pretty heavily tricked out) default configuration. It's inconvenient for me to track down the ksh rpm and install it. And I'm not up to speed on the differences between ksh and sh or csh. Would it be terribly inconvenient for you to provide a checktime.sh script based on sh or csh? Sorry. ksh is bash compatable. just change line 1 of script to #!/bin/bash and run. Ill pull some patches and post a kernel tarball and provide the URL in Bugzilla. Status report: I ran off yesterday with clock=hpet running but without trying your amended script. (We'll use that with the kernel stuff.) clock=hpet ALSO FAILS to keep correct time. (We expected this, and it's the last "due diligence test".) The engineer who would build kernels is back from vacation. I'll be meeting with him later today to discuss this. So as soon as you post the kernel tarball, we'll be ready to take the next step. First thing to try determine is when this regression occured. Comment #1 seems to indicate this worked in RHEL4U2 but was broken in RHEL4U4. So does the RHEL4U3 kernel work? I need to determine if it was the U3 or U4 patchset that broke this. Can you install this kernel RPM and see if the system keeps time: http://people.redhat.com/bmaly/kernel-2.6.9-34.EL.i686.rpm From there, Ill pull suspect patches out and have you build a kernel from a tarball which Ill post. We can then isolate which patch is the culprit, then figure out why it broke. We have installed the kernel RPM you've provided, and are running the checktime utility. Two things to note: 1. Under bash, numbers with a leading zero are interpreted base 8, so at all times where the leading zero is present in the output of date, we lose. The script blows out because the comparison is out of range. Andrew fixed the problem by changing all the arithmetic expressions to include "#10" to force base 10. (Gotta love those shells, everyone should write their own shell, and make it as close as possible, but NOT the same as the existing shells.) 2. The clock failure, in a failing kernel does not happen right away. Maybe the clocktest script will notice exactly when it happens, but it DOES take a while before the clock quits working. At any rate, we're running the kernel you've provided. Since it's 2.6.9.-34, like the working kernel we reported on in the sysreports we filed as attachments, we expect this one will keep time just fine. We'll know for sure tomorrow, I expect. Status update: We ran the 2.6.9-34 kernel provided. As expected it kept time perfectly well. Additional information: When running tests, expect that it will be some time before the problem manifests. It may be that at around midnight, the clock simply STOPS. Monitoring with the clocktest script will give us more exact insight. Last night we ran the 2.6.9-42.0.3 kernel with the checktime program. I will attach the output of checktime. We rebooted the system on 8 March 2007 at 14:27PM. The clock ran fine until 9 March 2007 at 04:14:50. At that point the divergence was monotonically increasing. Note that 04:15 AM is within half an hour of an uptime of 12 hours. This suggests a couple questions to me: Is there something that gets done in advance of 12 hours of uptime? Is there something that gets run out of cron at around 4:15 AM related to time keeping? Created attachment 149733 [details]
Output of checktime.sh showing startup, and increasing skew after ~12 hours
This regression first appeared in 2.6.9-34.15 kernel. I will determine which of the 7 patches that we in .15 that caused the breakage. Can you provide the /proc/interrupts with the kernel that halts and with the 2.6.9-34 one? I am curious to see if the 2.6.9-34 ended up using the LAPIC or the PIC controller. Those IBM boxes have a firmware bug where the SMI firmware when exiting screws up with the timer. One solution is to not use IO APIC, which is what the 2.6.9-34 Uni Processor kernels did, or just disable it using noapic (which disables the IO APIC init routines). I've been trying inside IBM to get them to fix this, but no luck so far. What I think we can do for RHEL4 U6 is to hard-code the 'noapic' parameter in the kernel so that when it detects these machines it sets that. For that, we need the output of 'dmidecode'. Brian, the possibility of this regression was mentioned by me when submitting the fix for RHEL4 U4 and that the IBM ThinkCentre will be affected, but that many other boxes would benefit from this. The solution that Jason agreed to was to submit code for blacklisting those ThinkCentre to do _no_ IOAPIC initialization (which is what 2.6.9-34 did). Im currently doing some testing, so I cant reboot with a known bad kernel, but heres the /proc/interrupts for now. I will update this if after booting with a known bad kernel, if it happens to differ. CPU0 0: 7368715 XT-PIC timer 1: 10 XT-PIC i8042 2: 0 XT-PIC cascade 5: 0 XT-PIC uhci_hcd 8: 1 XT-PIC rtc 9: 0 XT-PIC acpi, Intel ICH5, ehci_hcd 10: 0 XT-PIC uhci_hcd 11: 79432 XT-PIC uhci_hcd, uhci_hcd, eth0, i915@pci:0000:00:02.0 12: 92 XT-PIC i8042 14: 10801 XT-PIC ide0 NMI: 0 ERR: 0 I dont want to go blacklisting systems if possible, its hard to maintain. It may be a more generic chipset issue which would be more practical to address. So I will need to determine how the code changed when .config options were modified. I saw some #ifdef CONFIG_SMP related bits upstream that might need backporting (to remove bits of code from the UP kernel). And yeah, this is a UP kernel. I seem to remember CONFIG_X86_UP_IOAPIC and CONFIG_X86_UP_APIC also being related to some NMI code as well.. wonder if thats related. Just an update on this issue. The change in the -34.15 kernel that caused this regression was a change to config-i686. Setting CONFIG_X86_UP_IOAPIC=y breaks timekeeping on this system, unsetting this config option causes the box to run fine. I will explore why this change broke timekeeping and if there are any upstream fixes that can be backported to resolve this issue. I was able to narrow down which lines of code cause the regression. In arch/i386/kernel/time.c in function do_timer_interrupt(): #ifdef CONFIG_X86_IO_APIC if (timer_ack) { /* * Subtle, when I/O APICs are used we have to ack timer IRQ * manually to reset the IRR bit for do_slow_gettimeoffset(). * This will also deassert NMI lines for the watchdog if run * on an 82489DX-based system. */ spin_lock(&i8259A_lock); outb(0x0c, PIC_MASTER_OCW3); /* Ack the IRQ; AEOI will end it automatically. */ inb(PIC_MASTER_POLL); spin_unlock(&i8259A_lock); } #endif With this bit of code removed, the regression dissapears. It was enabled with the config file changes that went into -34.15. So the regression dissapears when timer_ack=0. timer_ack is set in arch/i386/kernel/io_apic.c in check_timer(). timer_ack=0 happens if timer_irq_works() returns 0. So seems like timer_irq_works() might be a place to fix this possibly. I would like to see if there are other changes in io_apic that might cause the timer_irq breakage, as thats a better fix if its possible. If I remember correctly, the timer_ack is set when probing the timer via IO APIC or via the legacy timer, which is determined if the board IRQs are initialized via APIC or the legacy timer, which in turn is determined by probing the ACPI tables provided by the BIOS . Or when CONFIG_X86_IO_APIC is not set, then using the MP tables to set the IRQs and also using the legacy timer instead. So you think it might be a BIOS issue or perhaps ACPI breakage (wont process the BIOS info correctly)? The affected system uses an Intel ICH5 chipset. Its known to be a rather buggy chipset, but I need to find a different system that has this chipset to test on. I think we need to figure out if this problem affects all systems with ICH5 or just this specific model of Dell. I do see ACPI parser errors on this system for what its worth. Quick factual correction: This is NOT a Dell system. It is an IBM S50 system. -wdc Not a Dell. Ok, I must have confused this with a Dell related bug. But some Dell's do use this chipset, so Ill try and round one up. Konrad, can we determine if there were any BIOS updates from IBM that might address this issue? Is this a known issue to IBM, does it affect any other models too? John, Is this related to the SMI issues you had seen sometime ago? Brian, I believe that there were no updates from IBM for those boxes as they have reached end-of-life. This issue does not affect other models (John, please correct me if I am wrong) So its my hypothesis that timer interrupts stop because the system hangs in do_timer_interrupt(). Its either because of a deadlock issue with spin_lock(&i8259A_lock), or more likely a hang in writing/reading the registers: outb(0x0c, PIC_MASTER_OCW3); inb(PIC_MASTER_POLL); I wonder what happens if a SMI occurs while doing this write/read? Strangely this code works for about 12 hours before timer interrupts stop. Im also exploring the possibility that this is NMI related just to rule that out. So I have a test running to test the deadlock theory. I will know by tomorrow if thats the issue. Its easier to test than if a write/read completed. (In reply to comment #60) > Is this related to the SMI issues you had seen sometime ago? It does look like it. The problem w/ IBM ThinkCentres was a BIOS SMI would muck up the apic interrupt frequency, so time would move very slowly. A BIOS update could fix it, but not all affected systems got BIOS updates. I believe booting w/ noapic is the workaround for systems that do not have a BIOS update available. There is also a kernel side patch (never accepted into mainline) that can be found attached to the following bugme bug: http://bugzilla.kernel.org/show_bug.cgi?id=6296 The patch has been around a few years and has not gone into mainline, so there may be issues using it more widely. I'm really not sure. Any way we could get a dmidecode output for the affected IBM NetVista system? Created attachment 154286 [details]
patch to resolve the issue on ThinkCentre S50
Putting this back in NEEDINFO state... See Comment #64, need dmi strings for other affected models. This request was evaluated by Red Hat Kernel Team for inclusion in a Red Hat Enterprise Linux maintenance release, and has moved to bugzilla status POST. I just wanted to say that Brian Maly worked really hard to understand this bug, and exceeded my expectations in his effort level and in his effectiveness at moving the issue forward. It's been a very long time since I've seen anybody do this well with a bug. THank you VERY much Brian! -Bill Cattey Linux Platform Coordinator Massachusetts Institute of Technology Cambridge MA committed in stream U6 build 55.25. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/ An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0791.html |