Description of problem: This problem has its roots in F8 kernel 2.6.24.3-34 see bug 439107 the current kernel is 2.6.25-14.fc9.x86_64. This is an Acer Aspire 5020 with a RTL8169 NIC. When the network is started the system becomes slow and unresponsive ie mouse pointer movement is jerky and it takes several seconds from clicking on a an item before there is an action this can last for a few minutes to 20mins+, it eventually returns to normal. It is so bad that I cannot boot with the network enabled so I have network manager disabled and ifcfg ONBOOT disabled. Version-Release number of selected component (if applicable): How reproducible:Start the network (eth0) Steps to Reproduce: 1.Get an Aspire 5020 2.Install Fedora 9 3.Try to boot Linux Actual results:After logging in the system 'freezes' Expected results:After logging in I should have a usable desktop and eth0 active. Additional info: There is nothing obvious in the logs. I also have SElinux set to permissive.
what does the output of top show during this sluggishness? what about the contents of /proc/interrupts?
Top shows this This is top before starting the network top - 11:37:45 up 6 min, 2 users, load average: 0.09, 0.25, 0.15 Tasks: 124 total, 1 running, 123 sleeping, 0 stopped, 0 zombie Cpu(s): 2.4%us, 0.5%sy, 0.0%ni, 97.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 1027028k total, 543940k used, 483088k free, 12024k buffers Swap: 2031608k total, 0k used, 2031608k free, 212516k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2427 root 20 0 390m 29m 9784 S 1.0 3.0 0:06.43 Xorg 3079 root 20 0 359m 19m 10m R 0.3 2.0 0:00.97 gnome-terminal 3109 root 20 0 14840 1124 844 R 0.3 0.1 0:01.21 top 1 root 20 0 4048 840 588 S 0.0 0.1 0:01.11 init 2 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root RT -5 0 0 0 S 0.0 0.0 0:00.00 migration/0 This is top while things a slow top - 11:41:08 up 10 min, 3 users, load average: 0.10, 0.21, 0.16 Tasks: 128 total, 2 running, 126 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 99.2%id, 0.0%wa, 0.8%hi, 0.0%si, 0.0%st Mem: 1027028k total, 587036k used, 439992k free, 12608k buffers Swap: 2031608k total, 0k used, 2031608k free, 217560k cache And this top when its a settled top - 11:45:11 up 14 min, 3 users, load average: 0.26, 0.22, 0.17 Tasks: 127 total, 3 running, 124 sleeping, 0 stopped, 0 zombie Cpu(s): 10.0%us, 1.7%sy, 0.0%ni, 88.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 1027028k total, 615032k used, 411996k free, 13656k buffers Swap: 2031608k total, 0k used, 2031608k free, 237984k cached And this /proc/interrupts This is before starting the network and the other 2 are afterwards. CPU0 0: 6454 local-APIC-edge timer 1: 66 IO-APIC-edge i8042 8: 0 IO-APIC-edge rtc0 12: 132 IO-APIC-edge i8042 14: 8249 IO-APIC-edge pata_atiixp 15: 1423 IO-APIC-edge pata_atiixp 17: 508 IO-APIC-fasteoi ATI IXP, ATI IXP Modem 18: 0 IO-APIC-fasteoi radeon@pci:0000:01:00.0 19: 866 IO-APIC-fasteoi ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3 20: 13 IO-APIC-fasteoi yenta, sdhc0:slot0, sdhc0:slot1, sdhc0:slot2, tifm_7xx1 21: 514 IO-APIC-fasteoi acpi 22: 5 IO-APIC-fasteoi firewire_ohci NMI: 0 Non-maskable interrupts LOC: 41269 Local timer interrupts RES: 0 Rescheduling interrupts CAL: 0 function call interrupts TLB: 0 TLB shootdowns TRM: 0 Thermal event interrupts THR: 0 Threshold APIC interrupts SPU: 0 Spurious interrupts ERR: 2 CPU0 0: 6454 local-APIC-edge timer 1: 102 IO-APIC-edge i8042 8: 0 IO-APIC-edge rtc0 12: 1392 IO-APIC-edge i8042 14: 10566 IO-APIC-edge pata_atiixp 15: 2368 IO-APIC-edge pata_atiixp 17: 725 IO-APIC-fasteoi ATI IXP, ATI IXP Modem 18: 0 IO-APIC-fasteoi radeon@pci:0000:01:00.0 19: 2550 IO-APIC-fasteoi ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3 20: 13 IO-APIC-fasteoi yenta, sdhc0:slot0, sdhc0:slot1, sdhc0:slot2, tifm_7xx1 21: 633 IO-APIC-fasteoi acpi 22: 7 IO-APIC-fasteoi firewire_ohci 23: 1089627 IO-APIC-fasteoi eth0 NMI: 0 Non-maskable interrupts LOC: 51930 Local timer interrupts RES: 0 Rescheduling interrupts CAL: 0 function call interrupts TLB: 0 TLB shootdowns TRM: 0 Thermal event interrupts THR: 0 Threshold APIC interrupts SPU: 0 Spurious interrupts ERR: 2 CPU0 0: 6454 local-APIC-edge timer 1: 119 IO-APIC-edge i8042 8: 0 IO-APIC-edge rtc0 12: 1392 IO-APIC-edge i8042 14: 10622 IO-APIC-edge pata_atiixp 15: 2956 IO-APIC-edge pata_atiixp 17: 763 IO-APIC-fasteoi ATI IXP, ATI IXP Modem 18: 0 IO-APIC-fasteoi radeon@pci:0000:01:00.0 19: 2853 IO-APIC-fasteoi ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3 20: 13 IO-APIC-fasteoi yenta, sdhc0:slot0, sdhc0:slot1, sdhc0:slot2, tifm_7xx1 21: 699 IO-APIC-fasteoi acpi 22: 8 IO-APIC-fasteoi firewire_ohci 23: 5682583 IO-APIC-fasteoi eth0 NMI: 0 Non-maskable interrupts LOC: 53130 Local timer interrupts RES: 0 Rescheduling interrupts CAL: 0 function call interrupts TLB: 0 TLB shootdowns TRM: 0 Thermal event interrupts THR: 0 Threshold APIC interrupts SPU: 0 Spurious interrupts ERR: 2 ~
Changing version to '9' as part of upcoming Fedora 9 GA. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Ok, so when you indicate the system is slow, your cpu idle time is 99.2%, which indicates that its not really doing much of anything, so its not like the system is bogged down in a user space process anywnere As for your interrupts, the eth0 interrupt count is high, but thats to be expected, as its a gigabit NIC. I've looked at the napi poll and top half interrupt routines for this driver, and neither show any out of the ordinary situations in which you might block cpu execution. I would recommend the following three steps: 1) Enable debug on the NIC. you can do this by adding this line: debug=16 as a module option for r8169 in /etc/modprobe.conf (don't forget to re-insert the module or reboot for the change to take effect). Also make sure you're syslog daemon is configured to capture at least KERN_INFO messages. This should tell us if the driver is encountering any operational problems 2) Use ethtool to check on NIC stats periodically. Lets see if we're hitting any errors frequently on the wire. 3) If (1) and (2) aren't fruitful, lets use oprofile to figure out where in the kernel the cpu is spending most of its non-idle time. That should give us a clue where to look next. Thanks!
How do I configure syslog to record Kern Info messages? Extra info. I just tried a KDE desktop and this problem is not apparent, logged in started the network, no degradation of performance. Logged out of KDE and tried to log back into Gnome as root with eth0 still active from the KDE session and it hung at the 'This session is running as a privileged user' dialouge after clicking continue.
to log kernel info messages you just need to make sure that *.info to /var/log/messages is specified in /etc/syslog.conf As for your new data, thats good info. Its speaks to not so much a NIC problem but rather a configuration problem in some desktop component (or some configuration that the desktop relies on). I've seen things like this be related to bad hosts files or dns resoution files (since X relies on TCP to transport keyboard and mouse events). Can you send me your /etc/resolv.conf, and /etc/hosts file. In fact a sysreport, or sosreport would be perfect. Thanks!
OK sysreport e-mailed to you.
2 things immediately jump out at me: 1) you have debug messages turned on for various gnome components, which is probably slowing things down a good deal. 2) You are getting a _ton_ of soft lockup errors, as you can see from your/var/log/messages file. (2) is probably the primary source of your slowness. Can you confirm that you don't see those softlockups whe you use kde. That should confirm for us that we're on the right trail, and we can start figuring out why you're loking up there. Thanks!
I've looked through those logs (where did all that info come from) and heres what I've found. 9 May no debug or soft lock ups. 10 May we see debug messages no soft lock ups. 11 May we see both. I suspect these have been introduced by updates. All 3 days I was having the same problem. I decided to implement your suggestions in #4. I don't have a /etc/modprobe.conf !!! I do have a /etc/modprobe.d/modprobe.conf.dist but there's no R8169 in there. How do I use oprofile?
Further research into sysreport error logs. Those Debug and SoftLockup messages are ocurring during boot so yes they are present in both Gnome and KDE. The last soft lock up message was recorded may 17 at 16:13:12.
I wouldn't worry too much about the details of comment 4 at this point. It won't hurt to have the data, but given what you've provided in the sysreport, I don't think we're going to find anything there. This is almost certainly a bug in a gnome component somewhere. I just saw bz447138 come up as well. I don't see the same power management markers in your logs as in theres, but it may be you don't have debug turned on for that component. The symptoms you experience seem simmilar, and I do note that gdm is trying to set some pm settings in your logs repeatedly, and failing at it. It might be worthwhile to turn off power management on your system under gnome to see if that helps the problem. Also it would be worth booting with nohz=off just to make sure this isn't some odd behavior related to waking up from tickless mode. As for oprofile, I usually point people at this: http://www.redhat.com/magazine/012oct05/features/oprofile/ Its not tailored to your situation, but it provides a good overview of how to use the oprofile tool. I'd focus on the gnome power management issue first though.
I've uninstalled gnome power manager and also package management tool. No difference. I've done several reboots today and cut the ends of the message logs, highlighted in red where I start the network and in green when it settles down. See attacment.
Created attachment 305901 [details] log tails
I've just had a look at that bug 447138 and it certainly shares the same history as this one. I was trawling through the Ubuntu forums today and found that alot of them had machines locking up on kernel 2.6.24. Clearly something changed how things get done from 2.6.23 to 2.6.24
grr, ok. Those logs make this seem like a driver problem again. I may need to build you a debug kernel. It appears from the logs that we report link up, but we aren't ready to trulyready to transmit. It looks like netif_stop_queue may have been called on this driver (which would eventually lead to the netdev watchdog timinig out). Lots of stuff went in between 2.6.23 and 2.6.24 (not to mention 2.6.25). One of the bigger things though was the tickkless kernel support., Please try booting with nohz=off and let me know if that changes your environment at all. If not, I'll start building a debug kernel for you.
I have the same problem with the driver r8169 (mac_version=0x0c, cf. source code). My box is booting in level 3, so this is not related to x11/gnome/whatever. The update to kernel-2.6.24.5-85.fc8 freezes as soon as the network is used. Even the sequence modprobe / rmmod /modprobe on r8169 freezes. The keyboard is also frozen and magic keys don't work. I compiled various vanilla kernels and can confirm that the problem is somewhere between: 2.6.23.17 and 2.6.24 Both, the API and the r8169 driver were updated. In particular, the MSI support was added to the driver. The good news is that I succeded in porting the old driver (2.6.23.17) to the new API (2.6.24), and I have now a working network under 2.6.24. For now, I am adding piece by piece the new features in my driver to track the problem down.
Hooray, nohz=off fixed it. Added it to grub.conf so I don't forget. Thanks for your help. One last thing, any idea how to turn off those annoying gdm debug messages?
Debug messages can be turned off by editing the debug section of /etc/gdm/custom.conf As for the performance issue, my guess is you're seeing various instances of bz 204948. After doing some more digging, apparently several people are seeing various forms of what you are, and it stems from constantly entering an leaving a power saving state. This should settle out as we remove cpu wakeup events. Until then, if you run powertop, that should give you suggestions for things you can disable to keep your performance good if/when you renable tickless.