Description of problem: There appears to either be a crash caused by tsc timer drift (which is somewhat common on AMD x2 machines) as it crashes on boot with a message of milliseconds (+/- with the exact same number on both cores, just opposite sign) between the cores AND/OR there is a locking issue. I have only received a few errors. The machine is legacy free so I cannot do a log to a serial port or printer. However, I am attaching the few error messages that I get. clock=pit doesn't work, it makes it worse. maxcpus=1 is what I am doing now and it is stable other than a problem with X (another bug). Older kernels have fewer problems, test 1 rescue disc from FC6 works ... reasonably well. I don't know what the difference is. This is with kernel 2.6.17-2517. It happens with both x86-64 and 32. I am currently running 32 as it is FAR more stable (completely separate installs, so no, I don't have library mismatch).
Created attachment 133699 [details] syslog of the few errors I get (nothing ever shows on screen)
maxcpus=1 was still a little unstable. However, with 2564 I was able to do noapic and nolapic and get stability. With nolapic I lost some functionality (powernow for example). I am now runnin noapic and things appear stable. What do I need to do fo help debug apic as it works fine in windows?
2583 still scrashed without noapic. Would a dmidecode dump help?
I am not a very advanced user but I would just like to note that I have had the same/similar problems with FC5-64 and FC6test2-64. I will try FC6test2-32 and report back. This is on a Turion notebook (HP Pavilion dv6040us).
ok, mine is dv6045nr. From what I have been told all dv6000 series are the same. Mine has the webcam, yours may not. We have the same wifi chipset, video, etc. Hard drive sizes and memory may also be different, but I believe the cpu is also the same (not just series, but the actual model).
Yes, the dv6040us has a webcam. I could not care less whether the webcam works under linux, however. I just want the machine to work basically. FC5-32 seems to be slightly more stable. Now that I can boot the darn thing, someone please let me know what system info I should post here, and I will. (And please give me the command I need to produce it, thanks.) Would you reccommend trying F6-test2-32 ? At this point I don't even want to run pup for fear of breaking something. I was even contemplating a severe downgrade (to like FC3) as an experiment. (Can you email me or post here what I have to do to turn off apic, if this seems to help?) Also, and here is some information that may actually be useful to folks: Knoppix boots and seems to be rock-solid. It's just not a very practical solution in the long-run. It was my friend's disk, I think the June 2006 release of Knoppix.
The webcam is the zrsomethingxx driver. It is currently not v4l2 (v4l was removed), it will be soon. Boot with option "noapic" on the command line (don't use nolapic so that you keep speed step and other power management features). This does seem to fix 99% of all my stability. I have issues with suspend/hibernate not resuming properly (black screen or freeze and black screen depending on what options I give X and such). I am running 32bit as the 64 was just too unstable. As for system info, I think since we have the same machine (memory and disk size being the difference it looks like), if they ask for dmidecode we should both give it, what else they will want, I do not know.
title Fedora Core (2.6.17-1.2586.fc6) root (hd1,0) kernel /vmlinuz-2.6.17-1.2586.fc6 ro root=/dev/VolGroup00/LogVol00 quiet rhgb <-- put noapic here. initrd /initrd-2.6.17-1.2586.fc6.img This is a clip from my desktop grub.conf... therefore it doesn't have the noapic.
Ok, starting in 259x kernels, I could often get into X. When I can (having some trouble with 2600), it works great provided I do not try to put the machine to sleep or switch to another console. (This may be the same bug as 201482.) If I don't start X, and I am at the console, if I cat /var/log/messages several times (some times even just once), the machine crashes. Is this a locking bug? If so, why does it not crash with noapic? Does it change enough of the timing or the way interrupts are handled? If this is not a locking bug, why does it seem to only hit on output to the console or switching virtual consoles? (Yes, this machine crashes on resume from suspend with or without noapic, it may be related because it seems to be at the point it would resume the screen... i.e. network is up, disks are up, etc. and dpms work seems to lock the machine hard instead of just failing to bring up the screen... which is what normally seems to happen.) Is there a way to capture an oops without serial, parallel or screen? The netdump packages don't seem to be functioning.
Slowly able to get oopses out of the crashes. Bug 205183 may or may not be related to this one. However, it is the same machine. (One more coming shortly.)
Bug 205185 may or may not be related to this one. However, it is affecting the functionality of the same machine.
Bug 205185 is now closed. That leaves the crash on resume from suspend and if I don't add noapic. There are also some bugs with Intel HDA audio which I will add shortly.
Ok, it seems that the crash seems to happen on VC switch when noapic is not provided. Most of the time I now boot fine, rhgb starts. The system crashes a lot when rhgb hands off to gdm. It will always crash if I switch from X to a text mode VC.
Finally, I was able to get some trace backs out of my logs. I believe this is a resume from suspend or suspend problem and not a hibernate/resume from hibernate problem, as hibernate works and there were several of those in my logs without these errors. Oct 18 12:51:12 mysystem kernel: BUG: sleeping function called from invalid context at kernel/rwsem.c:20 Oct 18 12:51:12 mysystem kernel: in_atomic():0, irqs_disabled():1 Oct 18 12:51:12 mysystem kernel: [<c04051db>] dump_trace+0x69/0x1af Oct 18 12:51:12 mysystem kernel: [<c0405339>] show_trace_log_lvl+0x18/0x2c Oct 18 12:51:12 mysystem kernel: [<c04058ed>] show_trace+0xf/0x11 Oct 18 12:51:12 mysystem kernel: [<c04059ea>] dump_stack+0x15/0x17 Oct 18 12:51:12 mysystem kernel: [<c0439446>] down_read+0x12/0x20 Oct 18 12:51:12 mysystem kernel: [<c0431601>] blocking_notifier_call_chain+0xe/0x29 Oct 18 12:51:12 mysystem kernel: [<c05a9798>] cpufreq_resume+0x118/0x135 Oct 18 12:51:12 mysystem kernel: [<c0551440>] __sysdev_resume+0x20/0x53 Oct 18 12:51:12 mysystem kernel: [<c0551583>] sysdev_resume+0x16/0x47 Oct 18 12:51:12 mysystem kernel: [<c0555767>] device_power_up+0x5/0xa Oct 18 12:51:12 mysystem kernel: [<c04418fd>] suspend_enter+0x3b/0x44 Oct 18 12:51:12 mysystem kernel: [<c0441a2c>] enter_state+0x126/0x176 Oct 18 12:51:12 mysystem kernel: [<c0441b01>] state_store+0x85/0x99 Oct 18 12:51:12 mysystem kernel: [<c04a5fe6>] subsys_attr_store+0x1e/0x22 Oct 18 12:51:14 mysystem kernel: [<c04a60d9>] sysfs_write_file+0xa7/0xce Oct 18 12:51:14 mysystem kernel: [<c046f805>] vfs_write+0xa8/0x159 Oct 18 12:51:14 mysystem kernel: [<c046fe32>] sys_write+0x41/0x67 Oct 18 12:51:14 mysystem kernel: [<c0404013>] syscall_call+0x7/0xb Oct 18 12:51:14 mysystem kernel: DWARF2 unwinder stuck at syscall_call+0x7/0xb Oct 18 12:51:14 mysystem kernel: Leftover inexact backtrace: Oct 18 12:51:14 mysystem kernel: =======================
Sorry, the oops came from: kernel-2.6.18-1.2798.fc6 (i686 version I believe).
The above oops may be like the acpi_cpufreq one that was fixed within the last two months. When this one gets fixed, please don't close this bug as there are apic or locking issues that remain.
Largely, the last year of kernels have been better. I no longer need special options on boot, except to setup a vgafb. If I don't, it crashes switching between X and console. Occassionally, I still get odd crashes. I also can't hibernate or sleep (due to crashes on resume). I haven't messed with the new work arounds in rawhide yet. This bug may soon be closed.
I have a HP laptop that has the same issues. I see there hasn't been any traffic on this ticket in a month, what can I do to help?
I have an HP dualcore Turion running Rawhide and it locks up randomly unless I use "noapic noirqdebug". Nobody has the answer for this problem...
I don't have an answer. I am the original reporter.
noapic, nolapic, and noirqdebug are not a solution to why the kernel doesn't run properly on this hardware. Unfortunately, it looks like the debugging Trever has done seems to be the only work posted in this bug. I get hard lockups (SysRQ doesn't work) w/o noapic. w/ noapic, I lose USB ports. I've tried booting w/ apic=debug ignore_loglevel vga=0x0f07 to try to see where the laptop locks up, but it locks up so solid that the kernel doesn't print an oops. I'm a little lost on what to do here, I posted to LKML and got zero response. We seem to be getting little attention here aswell. AMD has some driver updates to CPU frequency scaling for the Turion processors, I'm going to see if I can compile a custom kernel w/o any CPU frequency scaling and see if that has any affect.
(In reply to comment #21) > I get hard lockups (SysRQ doesn't work) w/o noapic. w/ noapic, I lose USB > ports. Try: noapic noirqdebug And post your hardware information (make and model of system.)
Created attachment 253781 [details] dv6408nr noapic noirqdebug info 2.6.23.1-10.fc7 w/ noapic noirqdebug on hp pavilion dv6408nr turion x2 amd mcp51 chipset
noapic noirqdebug works better (doesn't lock up) but /proc/interrupts shows the error counter steadilly rising. USB (problem device) and the multimedia hotkeys trigger an increase in error interrupts. The laptop is a HP Pavilion dv6409nr. Attached is a small tarball containing the following: kernel_debug-noapic_noirqdebug/dmesg.txt kernel_debug-noapic_noirqdebug/lspci.txt kernel_debug-noapic_noirqdebug/lsusb.txt kernel_debug-noapic_noirqdebug/proc_interrupts.txt kernel_debug-noapic_noirqdebug/dmidecode.txt kernel_debug-noapic_noirqdebug/proc_cpuinfo.txt
Created attachment 254751 [details] dv6408nr acpi dsdt dissasembly
Created attachment 254771 [details] dv6408nr all acpi tables (dsdt, apic, hpet, etc) binary & disassembled
Just installed Fedora 8/i386, I have the same problems as I did under Fedora 7/i386. For whatever reason, x86_64 kernels seem to boot better on this hardware. I downloaded both Fedora 8 i386 and x86_64, and will try out the x86_64 disto later and provide the same info as above.
I have lost patience with this and I am dumping the machine, which is now over 14 months old. Good luck, everybody.
bump? or something? hello?
Looks like this may be one in the same problem. How can I apply the patch from this thread to a FC8 kernel? I'd like to test to see if this solves problems. http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg240281.html
Changing version to 8 since bug exists there.
(In reply to comment #30) > Looks like this may be one in the same problem. How can I apply the patch from > this thread to a FC8 kernel? I'd like to test to see if this solves problems. > > http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg240281.html I tried the patch and while it did let me use tickless, the system still locks up after a while. Only "noapic noirqdebug" seems to work.
The recent kernels seem to be better, but once in a great while I still get problems. (This is on a newer machine as the 6045nr from HP died like most others sold at that time... garbage.)
The problem is the lapic timer is broken because of the C1E p-state. If you boot with 'nolapic_timer clocksource=hpet hpet=force' it boots pretty reliably. Infrequent lockups during boot of the kernel, and infrequent lockups post boot. My lockups may be thermal related, I'm not entirely sure. The above kernel boot paramaters permit usage of the APIC, which is essential to having proper functioning IRQs. The linux kernel developers are supposedly working on a hpet based nohz implimentation, but since this particular CPU type on laptops seems to be the only one affected, nobody seems to want or have any interest in fixing this. So in the mean time, these laptops run full tilt, at around 56c. Every minute or so, the fans kick in, and bring it down to around 50c. Lather, rince, repeat. If C1E p-state is disabled, the system runs with nohz enabled, but it still runs at full clock speed and gets rather warm.
Changing version to '9' as part of upcoming Fedora 9 GA. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Confirmed still broken in FC9-final. Confirmed working as previously described in FC9 using FC8 kernel-2.6.24.7-92.fc8.i686 kernel
This sounds like another instance of the 'ati chipset has timer problems' bug that we've seen a few flavours of. If my guess is right, the good news is that .26rc seems to have fixed this for me, so when we either backport .26 to F9, or identify the changeset(s) for backport, we'll have this fixed.
Dave: ftp://download.fedora.redhat.com/pub/fedora/linux/development/source/SRPMS/kernel-2.6.26-0.54.rc4.git5.fc10.src.rpm Is that the kernel you're talking about?
I can no longer help with this bug as the laptop I have doesn't seem to be affected. (The old one died.)
(In reply to comment #37) > This sounds like another instance of the 'ati chipset has timer problems' bug > that we've seen a few flavours of. If my guess is right, the good news is that > .26rc seems to have fixed this for me, so when we either backport .26 to F9, or > identify the changeset(s) for backport, we'll have this fixed. > This is nvidia MCP51/C51 chipset. My HP TX1000 is affected too...
Trying out kernel-2.6.27-0.208.rc1.git2.fc10.src.rpm, to see if it works at all. Just had to replace the HD on the laptop that has this problem, will report back in a few days.
kernel-2.6.27-0.208.rc1.git2.fc10.src.rpm doesn't help any. Still random lockups. Is there something I can do to help figure out what is causing the lockups? The system locks up hard, sysrq doesn't help any. If a system-wide timer isn't working, how can I debug that?
(In reply to comment #42) > kernel-2.6.27-0.208.rc1.git2.fc10.src.rpm doesn't help any. Still random > lockups. > > Is there something I can do to help figure out what is causing the lockups? > The system locks up hard, sysrq doesn't help any. If a system-wide timer isn't > working, how can I debug that? Try adding 'io_delay=0xed' to the kernel boot options. That notebook is not in the table for the IO delay quirk, but it should be.
(In reply to comment #43) > > Try adding 'io_delay=0xed' to the kernel boot options. > No dice.
any other ideas? Or is this something that is simply going to have to take development time on the part of the mainstream kernel developers?
(In reply to comment #45) > any other ideas? Or is this something that is simply going to have to take > development time on the part of the mainstream kernel developers? Some people report that adding "nolapic_timer" to the boot options helps.
Adding nolapic_timer? https://bugzilla.redhat.com/show_bug.cgi?id=201471#c34 really? The only way this laptop is usable is to never turn it off. Otherwise it will randomly lock up on boot still, as I described above. nohz still does not work, which makes the cpu run full tilt and get quite toasty.
The logs attached to this Bug seem to indicate a failure in the TSC code and also a failure in the cpufreq code. Can we debug both problems a little more? Both issues are big problems. First, does the problem still exist if the CPU's run at a static freq (i.e. so cpufreq doesnt initiate a frequency transition at any point)? Set the governor in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor to 'performance' so that the system runs at a static CPU freq. Let the system run for a while. See if the problem still occurs. Second, does this problem still exist if 'notsc' is used as a boot arg? This should help determine if lack of TSC synchronization is partly to blame.
Brian - 99% of the problem occurs on boot. I'm not entirely sure why, but intermittantly the laptop locks up solid. After hard power cycling a random number of times, the kernel manages to get past the point it gets stuck. Once the system is booted, it stays up. That's why I try not to turn it off. Rignt now I've got it "booting" (a relative term) with io_delay=0xed clocksource=hpet hpet=force nolapic_timer The only options that seem to help so far are the hpet and lapic timer otions. Re: notsc- Nov 3 22:01:14 xrap kernel: Linux version 2.6.27-0.244.rc2.git1.c1e.fc9.i686 (warewolf@xrap) (gcc version 4.3.0 20080428 (Red Hat 4.3.0-8) (GCC) ) #1 SMP Thu Aug 28 01:27:13 EDT 2008 Nov 3 22:01:14 xrap kernel: Kernel command line: ro root=/dev/encrypted/Eroot io_delay=0xed clocksource=hpet hpet=force nolapic_timer Nov 3 22:01:14 xrap kernel: Clocksource tsc unstable (delta = -97636198 ns) It looks like the system detects it's hosed, and decides not to use it. [root@xrap clocksource0]# cat available_clocksource hpet acpi_pm jiffies tsc [root@xrap clocksource0]# cat current_clocksource hpet [root@xrap clocksource0]# This laptop has no serial ports, so I have little to no chance of getting a console log, unless I hand transcribe it. Point me at a kernel, and I'll do everything I can to get to the bottom of this issue.
Thanks for the info on TSC. Looks like thats another thing that has to get fixed. Could you try building a kernel with cpufreq disabled (i.e. not built into the kernel) and see if that makes a difference? It would be nice to determine if this is cpufreq related. I want to try an eliminate the most obvious points of failure first. Im suspicious of cpufreq being I saw some cpufreq related BUG output in one of the logs attached to this BZ. It might not be related though. Most helpful though would be a stacktrace output from lockup so we can figure out where exactly things are breaking. That would save a lot of debugging time. Take a digital pic of the screen with a camera and attach it to this Bug if its possible.
Created attachment 325634 [details] photo of laptop screen vga=0x0f07 of nocpufreq kernel (no other cmdline args) kernel-2.6.27.5-117.fc10.src.rpm kernel w/ modified config to disable cpufreq, including acpi based cpufreq. This was with the command line args vga=0x0f07, and no other arguments.
I just realised I didn't mention the above screen shot was of the laptop locking up. It does sort-of look like I snapped a photo mid-boot, heh.
So there is no stack trace really, it just hangs during boot?
Yep, that's it exactly.
You want me to try installing FC10 on this laptop? .. I'm at a loss for what to do.
Trying a newer kernel could help establish if this is fixed in a newer kernel. If this is the case the fix can easily be backported.
I updated this box last night and the latest fedora 9 kernel (2.6.27.9-73.fc9.i686) doesn't fix it. I still have to try to boot it multiple times with nolapic nolapic_timer, and randomly one out of ten it boots successfully. Do you know an AMD kernel dev that I could hop on IRC with or something? I get the feeling some interactive "okay try this" debugging may help. In the mean time, I'll pull down the latest fedora 10 kernel and try that, but I'm not expecting it to work :(
alright update; and this may be something you can roll with. I should have left the system sit for a minute, because it took a moment for the BUG to finally occur. Fedora 10 kernel BUGs out with this (hand transcribed): ACPI: processor limited to max C-state 1 BUG: spinlock lockup on CPU#0, swapper/0, c080766c (Not tainted) PID: 0, comm: swapper Not tainted 2.6.26.8-159.fc10.i686.debug #1 [<c06bd3e5>] ? printk+0xf/0x12 [<c052be54>] _raw_spin_lock+0xd7/0x4b [<c06bf73d>] _spin_lock+0x3d/0x4b [<c04463fd>] clockevents_notify+0x13/0x4d [<c040927d>] c1e_idle+0xe1/0x11c [<c0f02c4d>] cpu_idle+0x101/0x134 [<c06aebbe>] rest_init+0x4e/0x50 [<c084889d>] start_kernel+0x30b/0x310 [<c0848080>] __init_begin+0x80/0x88 I'm going to try playing with acpi kernel options now.
The problem still exists in the fedora 11 alpha. :(
It looks like ubuntu's 2.6.27-11-generic works just fine on this laptop, and even drops (correctly) into C1 state thus solving my long lasting power problem.
After a year of trying to help, I've lost interest. Good luck. I'm switching to Ubuntu, becuase it -just works-. For those keeping count, that's three people of the community who have given up. Ofcourse, I'm not counting people with @redhat.com e-mail addresses.
This message is a reminder that Fedora 9 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 9. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '9'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 9's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 9 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Fedora 9 changed to end-of-life (EOL) status on 2009-07-10. Fedora 9 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.