Bug 445946

Summary: After starting network system response becomes very slow
Product: [Fedora] Fedora Reporter: Tony <tony.baz>
Component: kernelAssignee: Neil Horman <nhorman>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: low    
Version: 9CC: jmoskovc, nhorman, olivier.baudron
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: 2.6.25-14.fc9 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-05-20 10:53:14 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log tails none

Description Tony 2008-05-10 03:18:08 UTC
Description of problem: This problem has its roots in F8 kernel 2.6.24.3-34 see
bug 439107 the current kernel is 2.6.25-14.fc9.x86_64. This is an Acer Aspire
5020 with a RTL8169 NIC. When the network is started the system becomes slow and
 unresponsive ie mouse pointer movement is jerky and it takes several seconds
from clicking on a an item before there is an action this can last for a few
minutes to 20mins+, it eventually returns to normal. It is so bad that I cannot
boot with the network enabled so I have network manager disabled and ifcfg
ONBOOT disabled.


Version-Release number of selected component (if applicable):


How reproducible:Start the network (eth0)


Steps to Reproduce:
1.Get an Aspire 5020
2.Install Fedora 9
3.Try to boot Linux
  
Actual results:After logging in the system 'freezes'


Expected results:After logging in I should have a usable desktop and eth0 active.


Additional info: There is nothing obvious in the logs. I also have SElinux set
to permissive.

Comment 1 Neil Horman 2008-05-13 11:49:45 UTC
what does the output  of top show during this sluggishness?  what about the
contents of /proc/interrupts?

Comment 2 Tony 2008-05-14 02:04:53 UTC
Top shows this

This is top before starting the network

top - 11:37:45 up 6 min,  2 users,  load average: 0.09, 0.25, 0.15
Tasks: 124 total,   1 running, 123 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.4%us,  0.5%sy,  0.0%ni, 97.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1027028k total,   543940k used,   483088k free,    12024k buffers
Swap:  2031608k total,        0k used,  2031608k free,   212516k cached
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
         
 2427 root      20   0  390m  29m 9784 S  1.0  3.0   0:06.43 Xorg              
         
 3079 root      20   0  359m  19m  10m R  0.3  2.0   0:00.97 gnome-terminal    
         
 3109 root      20   0 14840 1124  844 R  0.3  0.1   0:01.21 top               
         
    1 root      20   0  4048  840  588 S  0.0  0.1   0:01.11 init              
         
    2 root      15  -5     0    0    0 S  0.0  0.0   0:00.00 kthreadd          
         
    3 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 migration/0           

This is top while things a slow

top - 11:41:08 up 10 min,  3 users,  load average: 0.10, 0.21, 0.16
Tasks: 128 total,   2 running, 126 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 99.2%id,  0.0%wa,  0.8%hi,  0.0%si,  0.0%st
Mem:   1027028k total,   587036k used,   439992k free,    12608k buffers
Swap:  2031608k total,        0k used,  2031608k free,   217560k cache

And this top when its a settled

top - 11:45:11 up 14 min,  3 users,  load average: 0.26, 0.22, 0.17
Tasks: 127 total,   3 running, 124 sleeping,   0 stopped,   0 zombie
Cpu(s): 10.0%us,  1.7%sy,  0.0%ni, 88.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1027028k total,   615032k used,   411996k free,    13656k buffers
Swap:  2031608k total,        0k used,  2031608k free,   237984k cached

And this /proc/interrupts

This is before starting the network and the other 2 are afterwards.

CPU0
  0:       6454  local-APIC-edge      timer
  1:         66   IO-APIC-edge      i8042
  8:          0   IO-APIC-edge      rtc0
 12:        132   IO-APIC-edge      i8042
 14:       8249   IO-APIC-edge      pata_atiixp
 15:       1423   IO-APIC-edge      pata_atiixp
 17:        508   IO-APIC-fasteoi   ATI IXP, ATI IXP Modem
 18:          0   IO-APIC-fasteoi   radeon@pci:0000:01:00.0
 19:        866   IO-APIC-fasteoi   ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3
 20:         13   IO-APIC-fasteoi   yenta, sdhc0:slot0, sdhc0:slot1,
sdhc0:slot2, tifm_7xx1
 21:        514   IO-APIC-fasteoi   acpi
 22:          5   IO-APIC-fasteoi   firewire_ohci
NMI:          0   Non-maskable interrupts
LOC:      41269   Local timer interrupts
RES:          0   Rescheduling interrupts
CAL:          0   function call interrupts
TLB:          0   TLB shootdowns
TRM:          0   Thermal event interrupts
THR:          0   Threshold APIC interrupts
SPU:          0   Spurious interrupts
ERR:          2


        CPU0
  0:       6454  local-APIC-edge      timer
  1:        102   IO-APIC-edge      i8042
  8:          0   IO-APIC-edge      rtc0
 12:       1392   IO-APIC-edge      i8042
 14:      10566   IO-APIC-edge      pata_atiixp
 15:       2368   IO-APIC-edge      pata_atiixp
 17:        725   IO-APIC-fasteoi   ATI IXP, ATI IXP Modem
 18:          0   IO-APIC-fasteoi   radeon@pci:0000:01:00.0
 19:       2550   IO-APIC-fasteoi   ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3
 20:         13   IO-APIC-fasteoi   yenta, sdhc0:slot0, sdhc0:slot1,
sdhc0:slot2, tifm_7xx1
 21:        633   IO-APIC-fasteoi   acpi
 22:          7   IO-APIC-fasteoi   firewire_ohci
 23:    1089627   IO-APIC-fasteoi   eth0
NMI:          0   Non-maskable interrupts
LOC:      51930   Local timer interrupts
RES:          0   Rescheduling interrupts
CAL:          0   function call interrupts
TLB:          0   TLB shootdowns
TRM:          0   Thermal event interrupts
THR:          0   Threshold APIC interrupts
SPU:          0   Spurious interrupts
ERR:          2

    CPU0
  0:       6454  local-APIC-edge      timer
  1:        119   IO-APIC-edge      i8042
  8:          0   IO-APIC-edge      rtc0
 12:       1392   IO-APIC-edge      i8042
 14:      10622   IO-APIC-edge      pata_atiixp
 15:       2956   IO-APIC-edge      pata_atiixp
 17:        763   IO-APIC-fasteoi   ATI IXP, ATI IXP Modem
 18:          0   IO-APIC-fasteoi   radeon@pci:0000:01:00.0
 19:       2853   IO-APIC-fasteoi   ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3
 20:         13   IO-APIC-fasteoi   yenta, sdhc0:slot0, sdhc0:slot1,
sdhc0:slot2, tifm_7xx1
 21:        699   IO-APIC-fasteoi   acpi
 22:          8   IO-APIC-fasteoi   firewire_ohci
 23:    5682583   IO-APIC-fasteoi   eth0
NMI:          0   Non-maskable interrupts
LOC:      53130   Local timer interrupts
RES:          0   Rescheduling interrupts
CAL:          0   function call interrupts
TLB:          0   TLB shootdowns
TRM:          0   Thermal event interrupts
THR:          0   Threshold APIC interrupts
SPU:          0   Spurious interrupts
ERR:          2
~                                                
 

Comment 3 Bug Zapper 2008-05-14 10:57:14 UTC
Changing version to '9' as part of upcoming Fedora 9 GA.
More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 4 Neil Horman 2008-05-14 11:17:06 UTC
Ok, so when you indicate the system is slow, your cpu idle time is 99.2%, which
indicates that its not really doing much of anything, so its not like the system
is bogged down in a user space process anywnere

As for your interrupts, the eth0 interrupt count is high, but thats to be
expected, as its a gigabit NIC.  I've looked at the napi poll and top half
interrupt routines for this driver, and neither show any out of the ordinary
situations in which you might block cpu execution.

I would recommend the following three steps:

1) Enable debug on the NIC.  you can do this by adding this line:
debug=16
as a module option for r8169 in /etc/modprobe.conf (don't forget to re-insert
the module or reboot for the change to take effect).  Also make sure you're
syslog daemon is configured to capture at least KERN_INFO messages.  This should
tell us if the driver is encountering any operational problems

2) Use ethtool to check on NIC stats periodically.  Lets see if we're hitting
any errors frequently on the wire.

3) If (1) and (2) aren't fruitful, lets use oprofile to figure out where in the
kernel the cpu is spending most of its non-idle time.  That should give us a
clue where to look next.

Thanks!

Comment 5 Tony 2008-05-14 20:43:01 UTC
How do I configure syslog to record Kern Info messages?

Extra info. I just tried a KDE desktop and this problem is not apparent, logged
in started the network, no degradation of performance. Logged out of KDE and
tried to log back into Gnome as root with eth0 still active from the KDE session
and it hung at the 'This session is running as a privileged user' dialouge after
clicking continue.

Comment 6 Neil Horman 2008-05-15 11:15:48 UTC
to log kernel info messages you just need to make sure that *.info to
/var/log/messages is specified in /etc/syslog.conf

As for your new data, thats good info.  Its speaks to not so much a NIC problem
but rather a configuration problem in some desktop component (or some
configuration that the desktop relies on).  I've seen things like this be
related to bad hosts files or dns resoution files (since X relies on TCP to
transport keyboard and mouse events).  Can you send me your /etc/resolv.conf,
and /etc/hosts file.  In fact a sysreport, or sosreport would be perfect.  Thanks!

Comment 7 Tony 2008-05-16 06:14:37 UTC
OK sysreport e-mailed to you.

Comment 8 Neil Horman 2008-05-16 11:23:34 UTC
2 things immediately jump out at me:

1) you have debug messages turned on for various gnome components, which is
probably slowing things down a good deal.

2) You are getting a _ton_ of soft lockup errors, as you can see from
your/var/log/messages file.  

(2) is probably the primary source of your slowness.  Can you confirm that you
don't see those softlockups whe you use kde.  That should confirm for us that
we're on the right trail, and we can start figuring out why you're loking up
there.  Thanks!

Comment 9 Tony 2008-05-17 03:15:59 UTC
I've looked through those logs (where did all that info come from) and heres
what I've found. 
9 May no debug or soft lock ups.
10 May we see debug messages no soft lock ups.
11 May we see both.
I suspect these have been introduced by updates. All 3 days I was having the
same problem.

I decided to implement your suggestions in #4.
I don't have a /etc/modprobe.conf !!! I do have a
/etc/modprobe.d/modprobe.conf.dist but there's no R8169 in there.

How do I use oprofile?

Comment 10 Tony 2008-05-18 01:09:16 UTC
Further research into sysreport error logs. Those Debug and SoftLockup messages
are ocurring during boot so yes they are present in both Gnome and KDE. The last
soft lock up message was recorded may 17 at 16:13:12.

Comment 11 Neil Horman 2008-05-18 23:37:59 UTC
I wouldn't worry too much about the details of comment 4 at this point.  It
won't hurt to have the data, but given what you've provided in the sysreport, I
don't think we're going to find anything there.  This is almost certainly a bug
in a gnome component somewhere.  I just saw bz447138 come up as well.  I don't
see the same power management markers in your logs as in theres, but it may be
you  don't have debug turned on for that component.  The symptoms you experience
seem simmilar, and I do note that gdm is trying to set some pm settings in your
logs repeatedly, and failing at it.  It might be worthwhile to turn off power
management on your system under gnome to see if that helps the problem.  Also it
would be worth booting with nohz=off just to make sure this isn't some odd
behavior related to waking up from tickless mode.

As for oprofile, I usually point people at this:
http://www.redhat.com/magazine/012oct05/features/oprofile/
Its not tailored to your situation, but it provides a good overview of how to
use the oprofile tool.  I'd focus on the gnome power management issue first though.

Comment 12 Tony 2008-05-19 05:41:10 UTC
I've uninstalled gnome power manager and also package management tool. No
difference.

I've done several reboots today and cut the ends of the message logs,
highlighted in red where I start the network and in green when it settles down.
See attacment.

Comment 13 Tony 2008-05-19 05:41:49 UTC
Created attachment 305901 [details]
log tails

Comment 14 Tony 2008-05-19 06:00:19 UTC
I've just had a look at that bug 447138 and it certainly shares the same history
as this one. I was trawling through the Ubuntu forums today and found that alot
of them had machines locking up on kernel 2.6.24. Clearly something changed how
things get done from 2.6.23 to 2.6.24

Comment 15 Neil Horman 2008-05-19 11:21:35 UTC
grr, ok.  Those logs make this seem like a driver problem again.  I may need to
build you a debug kernel.  It appears from the logs that we report link up, but
we aren't ready to trulyready to transmit.  It looks like netif_stop_queue may
have been called on this driver (which would eventually lead to the netdev
watchdog timinig out).

Lots of stuff went in between 2.6.23 and 2.6.24 (not to mention 2.6.25).  One of
the bigger things though was the tickkless kernel support.,  Please try booting
with nohz=off and let me know if that changes your environment at all.  If not,
I'll start building a debug kernel for you.

Comment 16 Olivier Baudron 2008-05-19 12:54:35 UTC
I have the same problem with the driver r8169 (mac_version=0x0c, cf. source
code). My box is booting in level 3, so this is not related to
x11/gnome/whatever. The update to kernel-2.6.24.5-85.fc8 freezes as soon as the
network is used. Even the sequence modprobe / rmmod /modprobe on r8169 freezes.
The keyboard is also frozen and magic keys don't work. I compiled various
vanilla kernels and can confirm that  the problem is somewhere between:
2.6.23.17 and 2.6.24

Both, the API and the r8169 driver were updated. In particular, the MSI support
was added to the driver. The good news is that I succeded in porting the old
driver (2.6.23.17) to the new API (2.6.24), and I have now a working network
under 2.6.24. For now, I am adding piece by piece the new features in my driver
to track the problem down.

Comment 17 Tony 2008-05-20 05:54:21 UTC
Hooray, nohz=off fixed it. Added it to grub.conf so I don't forget. Thanks for
your help.
One last thing, any idea how to turn off those annoying gdm debug messages?

Comment 18 Neil Horman 2008-05-20 10:53:14 UTC
Debug messages can be turned off by editing the debug section of
/etc/gdm/custom.conf

As for the performance issue, my guess is you're seeing various instances of bz
204948.  After doing some more digging, apparently several people are seeing
various forms of what you are, and it stems from constantly entering an leaving
a power saving state.  This should settle out as we remove  cpu wakeup events. 
Until then, if you run powertop, that should give you suggestions for things you
can disable to keep your performance good if/when you renable tickless.