Bug 242572

Summary: Realtek 8169 (Netgear GA311 etc.) based gigabit cards freeze startup
Product: [Fedora] Fedora Reporter: Randy <ranster>
Component: kernelAssignee: Andy Gospodarek <agospoda>
Status: CLOSED CURRENTRELEASE QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: high    
Version: 7CC: blindg0, cebbert, chris.brown, davej, dylan.semler, elh, eric, haaseg, masaki.chikama, mmcguire74, nphilipp, peterm, romieu, thomas, webmaster
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: 2.6.23* Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-02-16 01:55:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
r8169-irq-reorder.patch
none
NetworkManager-0.6.5-5.fc7.linktest.src.rpm
none
/tmp/eb2a021c4710b98081daa797d5a729ac23c240cd.patch
none
/tmp/99f252b097a3bd6280047ba2175b605671da4a23.patch
none
f6-f7.patch
none
startup script for realtek NIC so system doesn't freeze
none
configuration used to bisect
none
Requested info
none
debug helper
none
dmesg output of patched r8169 driver
none
More debug helper
none
dmesg output of patched r8169 driver (2)
none
avoid useless NAPI poll scheduling
none
modified patch for 2.6.22
none
avoid useless NAPI poll scheduling (against 2.6.22.2) none

Description Randy 2007-06-04 20:45:27 UTC
Description of problem:
Realtek 8169 based gigabit cards freeze startup

Version-Release number of selected component (if applicable):
F7

How reproducible:
When I installed F7 on a Dell system equipped with a Realtek 8169 chip gigabit card as an upgrade to 
FC6, I found that the system would freeze with "starting networking" on the screen during startup. The 
pointer doesn't even move with mouse movement. Unplugging/reconnecting the ethernet cable 
unfreezes the system and allows it to start up. Also if the built in ethernet is active and in use, the 
Realtek card has to be inactivated to avoid the freeze on startup, even when no cable is connected to it. 
A cable must be plugged into the card in that case, in order to unfreeze the system. I tested both a 
Hawking and a Netgear brand card and got the same results. FC6 had no such problem on my system.

Steps to Reproduce:
1. Install Realtek chip gigabit card
2. Connect ethernet cable
3. Boot system
  
Actual results:
System freezes until cable is disconnected/reconnected

Expected results:
System starts up normally

Additional info:
Dell Optiplex GX110 (PIII/800) 512 MB memory, Hawking gigabit card, Netgear gigabit switch

Comment 1 Mike McGuire 2007-06-05 08:55:34 UTC
Same issue as me:

http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=242301

Comment 2 Chuck Ebbert 2007-06-07 17:00:31 UTC
*** Bug 242357 has been marked as a duplicate of this bug. ***

Comment 3 John W. Linville 2007-06-07 17:37:48 UTC
*** Bug 242301 has been marked as a duplicate of this bug. ***

Comment 4 Mike McGuire 2007-06-07 17:59:13 UTC
So, John can we get the priority moved to high/urgent as this issue renders my 
laptop totally unusable unless I'm sitting next to a router/switch?? 

Comment 5 Mike McGuire 2007-06-08 01:36:41 UTC
So didi somebody just change this back to med priority ??   I also had a comment
with a log dump of the error

Comment 6 Mike McGuire 2007-06-08 01:38:42 UTC
Oooops, I can also tell you the .3212 fc8 kernel displays the same issues...

Comment 7 Mike McGuire 2007-06-08 02:16:45 UTC
Randy,

Can you please change the priority on this to Urgent/High??  This bug is causing
serious issues especially stemming from remote management of servers using
RTL8169 chipsets or Laptops with RTL8169 chipsets that are used to manage these
remote devices.  Here is a dump of the error.  Sorry for being pushy but it just
blew our testbed and guess who they're pointing fingers at??  Here is a log dump
of the error:

Jun  2 20:13:52 localhost kernel: r8169: eth0: link up
Jun  2 20:14:10 localhost kernel: r8169: eth0: link up
Jun  2 20:14:10 localhost kernel: BUG: soft lockup detected on CPU#0!
Jun  2 20:14:10 localhost kernel:  [<c0451f3e>] softlockup_tick+0xa5/0xb4
Jun  2 20:14:10 localhost kernel:  [<c042e930>] update_process_times+0x3b/0x5e
Jun  2 20:14:10 localhost kernel:  [<c043d2bd>] tick_sched_timer+0x78/0xbb
Jun  2 20:14:10 localhost kernel:  [<c0439df5>] hrtimer_interrupt+0x12b/0x1b6
Jun  2 20:14:10 localhost kernel:  [<c043d245>] tick_sched_timer+0x0/0xbb
Jun  2 20:14:10 localhost kernel:  [<c0408534>] timer_interrupt+0x2c/0x32
Jun  2 20:14:10 localhost kernel:  [<c04521aa>] handle_IRQ_event+0x1a/0x3f
Jun  2 20:14:10 localhost kernel:  [<c04535ea>] handle_level_irq+0x81/0xc7
Jun  2 20:14:10 localhost kernel:  [<c04072c7>] do_IRQ+0xb8/0xd1
Jun  2 20:14:10 localhost kernel:  [<c04058ff>] common_interrupt+0x23/0x28
Jun  2 20:14:10 localhost kernel:  [<c04058ff>] common_interrupt+0x23/0x28
Jun  2 20:14:10 localhost kernel:  [<c0561704>] yenta_interrupt+0x13/0xb4
Jun  2 20:14:10 localhost kernel:  [<c04521aa>] handle_IRQ_event+0x1a/0x3f
Jun  2 20:14:10 localhost kernel:  [<c04535ea>] handle_level_irq+0x81/0xc7
Jun  2 20:14:10 localhost kernel:  [<c0453569>] handle_level_irq+0x0/0xc7
Jun  2 20:14:10 localhost kernel:  [<c04072bb>] do_IRQ+0xac/0xd1
Jun  2 20:14:10 localhost kernel:  [<c04058ff>] common_interrupt+0x23/0x28
Jun  2 20:14:10 localhost kernel:  [<c042b2dc>] __do_softirq+0x54/0xba
Jun  2 20:14:10 localhost kernel:  [<c04071b7>] do_softirq+0x59/0xb1
Jun  2 20:14:10 localhost kernel:  [<c0453569>] handle_level_irq+0x0/0xc7
Jun  2 20:14:10 localhost kernel:  [<c042b194>] irq_exit+0x38/0x6b
Jun  2 20:14:10 localhost kernel:  [<c04072cc>] do_IRQ+0xbd/0xd1
Jun  2 20:14:10 localhost kernel:  [<c04058ff>] common_interrupt+0x23/0x28
Jun  2 20:14:10 localhost kernel:  [<f8b0007b>] rtl8169_init_one+0x5c7/0x9d7 [r8169]
Jun  2 20:14:10 localhost kernel:  [<c060171d>] _spin_unlock_irqrestore+0x8/0x9
Jun  2 20:14:10 localhost kernel:  [<f8aff1f7>] rtl8169_open+0x139/0x194 [r8169]
Jun  2 20:14:10 localhost kernel:  [<c05a2f8d>] dev_open+0x2b/0x62
Jun  2 20:14:10 localhost kernel:  [<c05a19e1>] dev_change_flags+0x47/0xe4
Jun  2 20:14:10 localhost kernel:  [<c05a977b>] rtnl_setlink+0x264/0x365
Jun  2 20:14:10 localhost kernel:  [<c05a9517>] rtnl_setlink+0x0/0x365
Jun  2 20:14:10 localhost kernel:  [<c05a8dad>] rtnetlink_rcv_msg+0x1c1/0x1e6
Jun  2 20:14:10 localhost kernel:  [<c05b4e19>] netlink_run_queue+0x50/0xbe
Jun  2 20:14:10 localhost kernel:  [<c05a8bec>] rtnetlink_rcv_msg+0x0/0x1e6
Jun  2 20:14:10 localhost kernel:  [<c05a8bab>] rtnetlink_rcv+0x25/0x3d
Jun  2 20:14:10 localhost kernel:  [<c05b51b6>] netlink_data_ready+0x12/0x4c
Jun  2 20:14:10 localhost kernel:  [<c05b426a>] netlink_sendskb+0x19/0x30
Jun  2 20:14:10 localhost kernel:  [<c05b5198>] netlink_sendmsg+0x277/0x283
Jun  2 20:14:10 localhost kernel:  [<c0599180>] sock_sendmsg+0xd0/0xeb
Jun  2 20:14:10 localhost kernel:  [<c0436e71>] autoremove_wake_function+0x0/0x35
Jun  2 20:14:10 localhost kernel:  [<c0436e71>] autoremove_wake_function+0x0/0x35
Jun  2 20:14:10 localhost kernel:  [<c04e7100>] copy_from_user+0x3a/0x66
Jun  2 20:14:10 localhost kernel:  [<c059932d>] sys_sendmsg+0x192/0x1f7
Jun  2 20:14:10 localhost kernel:  [<c0599e0d>] sys_recvmsg+0x1b9/0x1cd
Jun  2 20:14:10 localhost kernel:  [<c04e7350>] copy_to_user+0x3c/0x50
Jun  2 20:14:10 localhost kernel:  [<c0599c3c>] move_addr_to_user+0x50/0x68
Jun  2 20:14:13 localhost kernel:  [<c059a0d6>] sys_getsockname+0x9f/0xb0
Jun  2 20:14:13 localhost kernel:  [<c06016f4>] _spin_lock_bh+0x8/0x18
Jun  2 20:14:13 localhost kernel:  [<c059adb6>] release_sock+0x12/0x9d
Jun  2 20:14:13 localhost kernel:  [<c059a4fc>] sys_socketcall+0x240/0x261
Jun  2 20:14:13 localhost kernel:  [<c0404f70>] syscall_call+0x7/0xb
Jun  2 20:14:13 localhost kernel:  =======================
Jun  2 20:14:13 localhost kernel: r8169: eth0: link down

Comment 8 John W. Linville 2007-06-08 13:35:14 UTC
Andy, let me know if you need help locating suitable hardware...

Comment 9 Andy Gospodarek 2007-06-08 19:23:24 UTC
What device do you have?  I've got one of these:

05:04.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit
Ethernet (rev 10)
        Subsystem: Netgear Unknown device 311a
        Flags: bus master, 66MHz, medium devsel, latency 66, IRQ 19
        I/O ports at 2000 [size=256]
        Memory at f2004800 (32-bit, non-prefetchable) [size=256]
        [virtual] Expansion ROM at 88000000 [disabled] [size=128K]
        Capabilities: [dc] Power Management version 2

05:04.0 0200: 10ec:8169 (rev 10)
        Subsystem: 1385:311a
        Flags: bus master, 66MHz, medium devsel, latency 66, IRQ 19
        I/O ports at 2000 [size=256]
        Memory at f2004800 (32-bit, non-prefetchable) [size=256]
        [virtual] Expansion ROM at 88000000 [disabled] [size=128K]
        Capabilities: [dc] Power Management version 2

and it seems to boot both with and without the cable connected.  

Can you give me some more specific information about the Dell System you are
using as well.  

I'm also attaching a test patch, to see if that might help out.  I'm doubtful
that it will, but it might be worth trying.







Comment 10 Andy Gospodarek 2007-06-08 19:24:23 UTC
Created attachment 156603 [details]
r8169-irq-reorder.patch

test patch -- untested as I cannot reproduce the problem on my system

Comment 11 Mike McGuire 2007-06-08 21:20:19 UTC
Here's my device, Sager laptop.  There never has been any issue with Fedora 4,
5, 6 except with 7.  So how might the patch be implemented?.  There is another
bug report "eth0 boot fail"  with the same issues and the RTL8169 mentioned

02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit
Ethernet (rev 10)
        Subsystem: CLEVO/KAPOK Computer Unknown device 0470
        Flags: bus master, 66MHz, medium devsel, latency 128, IRQ 10
        I/O ports at a200 [size=256]
        Memory at d0008000 (32-bit, non-prefetchable) [size=256]
        [virtual] Expansion ROM at 88000000 [disabled] [size=128K]
        Capabilities: [dc] Power Management version 2


Comment 12 Mike McGuire 2007-06-09 12:46:57 UTC
OK, I just found a little bit more.  When booting Fedora 7, If I select "I" for
interactive mode, when it gets to the line:

Start Service Network   Y/N (C)ontinue   I select Yes and error happens

FATAL:   Module not found   (Next Line)

Bringing up loopback interface      (OK)

FATAL:   Module not found


But the computer boots without plugging the ethernet caable in...

Comment 13 Mike McGuire 2007-06-09 12:59:18 UTC
But again starting with the cable plugged, I go to "I"nteractive mode, and it
get to the line:

Start NetworkManagerDispatcher and it hangs until I disconnect the cable, it
will probably get past this part if I shut off the Service for Dispatcher

Comment 14 Jonathan Jordan 2007-06-09 16:55:21 UTC
I found a driver for the 8169 chipset on the Realtek website that was released
May 23, but when I try to compile it, I get a few errors. You can find the
driver here,
http://www.realtek.com.tw/downloads/downloadsView.aspx?Langid=1&PNid=13&PFid=4&Level=5&Conn=4&DownTypeID=3&GetDown=false&Downloads=true#5,7,8,10,982

Can someone try to install it?

Comment 15 Nils Philippsen 2007-06-10 12:47:29 UTC
I'm currently installing a machine that exhibits the problem. I didn't know
about that it made a difference if you (re)plug the cable, but found that after
loading "fail-safe" i.e. very conservative BIOS settings, there's a 50/50 chance
that it works instead of hanging. Once it's installed, I can try the patch above
and see if it makes a difference.

Comment 16 Nils Philippsen 2007-06-10 12:51:52 UTC
BTW, my card is in "lspci" (transcribed from the installer shell):

00:09.0 Ethernet controller: D-Link System Inc DGE-528T Gigabit Ethernet Adapter
(rev 10)

and "lspci -n":

00:09.0 0200: 1186:4300 (rev 10)

Comment 17 Jonathan Jordan 2007-06-10 20:00:58 UTC
I would like to try the patch above, but do not know how to install it. Can
someone instruct me on how to do so?

Comment 18 Andy Gospodarek 2007-06-11 14:31:52 UTC
(In reply to comment #13)
> But again starting with the cable plugged, I go to "I"nteractive mode, and it
> get to the line:
> 
> Start NetworkManagerDispatcher and it hangs until I disconnect the cable, it
> will probably get past this part if I shut off the Service for Dispatcher

Interesting.  NetworkManager should poll to determine if there is link on the
system so if checking for link never returns then this might explain the hang. 

I'd be curious to know if the system boots ok without NetoworkManager enabled
and then what happens when calling `ethtool eth0` and `mii-tool eth0` to see if
that call ever returns.

Comment 19 Andy Gospodarek 2007-06-11 14:34:25 UTC
(In reply to comment #15)
> I'm currently installing a machine that exhibits the problem. I didn't know
> about that it made a difference if you (re)plug the cable, but found that after
> loading "fail-safe" i.e. very conservative BIOS settings, there's a 50/50 chance
> that it works instead of hanging. Once it's installed, I can try the patch above
> and see if it makes a difference.

Great info, Nils!  I still cannot reproduce this, but it seems the only
r8169-based card I have isn't the same as everyone else's.




Comment 20 Andy Gospodarek 2007-06-11 14:52:07 UTC
(In reply to comment #17)
> I would like to try the patch above, but do not know how to install it. Can
> someone instruct me on how to do so?

You will need to download and install the necessary SRPM (usually 'src' in in
place of 'i386' or 'x86_64' as the arch) and build a new rpm using the rpmbuild
command.  A guide to using rpm/rpmbuild can be found here:

http://www.redhat.com/magazine/002dec04/features/betterliving-part2/

But what you basically want to do is:

1. install the rpm
2. copy the attachment from comment #10 to the file called
linux-kernel-test.patch (should be in /usr/src/redhat/SOURCES)
3. modify the kernel-2.6.spec file (in /usr/src/redhat/SPECS) and remove the '#'
from the line that says

#%define buildid .local

4. type `rpmbuild -ba kernel-2.6.spec`
5. wait for a while
6. go get your new rpms from the /usr/src/redhat/RPMS directory and install them.

I would *strongly* recommend reading on up rpms first, but those commands should
get you going.

Comment 21 Mike McGuire 2007-06-11 15:21:19 UTC
(In reply to comment #18)
> (In reply to comment #13)
> > But again starting with the cable plugged, I go to "I"nteractive mode, and 
it
> > get to the line:
> > 
> > Start NetworkManagerDispatcher and it hangs until I disconnect the cable, 
it
> > will probably get past this part if I shut off the Service for Dispatcher
> Interesting.  NetworkManager should poll to determine if there is link on the
> system so if checking for link never returns then this might explain the 
hang. 
> I'd be curious to know if the system boots ok without NetoworkManager enabled
> and then what happens when calling `ethtool eth0` and `mii-tool eth0` to see 
if
> that call ever returns.

I can confirm, that when NetworkManager and NetworkManagerDispatcher services 
are disabled, the computer will boot fine with the cable plugged in or whether 
the cable is disconnected it still boots.  

With the services disable, I can open terminal window and type:
"service NetworkManager start"
and it will verify it with the "OK"
and the system will hang which is cleared by plugging the cable in or 
disconnecting the cable as the case may be....




Comment 22 Andy Gospodarek 2007-06-11 17:57:59 UTC
(In reply to comment #21)
> 
> I can confirm, that when NetworkManager and NetworkManagerDispatcher services 
> are disabled, the computer will boot fine with the cable plugged in or whether 
> the cable is disconnected it still boots.  
> 
> With the services disable, I can open terminal window and type:
> "service NetworkManager start"
> and it will verify it with the "OK"
> and the system will hang which is cleared by plugging the cable in or 
> disconnecting the cable as the case may be....
> 

Thanks for the feedback!  

With NetworkManager disabled, can you also try to run the these to commands:

# ethtool eth0
# mii-tool eth0

And let me know if the box hangs?

I'll start looking at the NetworkManager sources and see what ioctl's/sysfs
entries might be getting called.

Comment 23 Nils Philippsen 2007-06-11 19:12:11 UTC
Hmm, with the test patch networking doesn't work at all, not even with
replugging the cable.

Comment 24 Andy Gospodarek 2007-06-11 19:41:59 UTC
Well that's not good....sounds like we might have this narrowed down to
something that NetworkManager anyway, so it probably isn't an initialization
problem.

Comment 25 Andy Gospodarek 2007-06-11 20:57:47 UTC
Created attachment 156759 [details]
NetworkManager-0.6.5-5.fc7.linktest.src.rpm

I'm not sure this will make a difference, but I did pull 2 patches from
NetworkManager to see if they might help.  

I'd be pretty surprised if this made a difference, but it might be worth a try
to build it and try it.

Comment 26 Mike McGuire 2007-06-11 22:54:57 UTC
(In reply to comment #22)
> (In reply to comment #21)
> > 
> > I can confirm, that when NetworkManager and NetworkManagerDispatcher services 
> > are disabled, the computer will boot fine with the cable plugged in or whether 
> > the cable is disconnected it still boots.  
> > 
> > With the services disable, I can open terminal window and type:
> > "service NetworkManager start"
> > and it will verify it with the "OK"
> > and the system will hang which is cleared by plugging the cable in or 
> > disconnecting the cable as the case may be....
> > 
> 
> Thanks for the feedback!  
> 
> With NetworkManager disabled, can you also try to run the these to commands:
> 
> # ethtool eth0
> # mii-tool eth0
> 
> And let me know if the box hangs?
> 
> I'll start looking at the NetworkManager sources and see what ioctl's/sysfs
> entries might be getting called.

OK, with the NM services disabled, and the cable disconnected running the
ethtool and mii-tool I get this without any hangs

(ethtool)
Settings for eth0:
	Supported ports: [ TP ]
	Supported link modes:   10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	                        1000baseT/Full 
	Supports auto-negotiation: Yes
	Advertised link modes:  10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	                        1000baseT/Full 
	Advertised auto-negotiation: Yes
	Speed: 100Mb/s
	Duplex: Half
	Port: Twisted Pair
	PHYAD: 0
	Transceiver: internal
	Auto-negotiation: on
	Supports Wake-on: pumbg
	Wake-on: g
	Current message level: 0x00000033 (51)
	Link detected: yes

(mii-tool)

SIOCGMIIPHY on 'eth0' failed: No such device


Now with NM services enabled with the cable connected:

(ethtool)

Settings for eth0:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Advertised auto-negotiation: Yes
        Speed: 100Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: pumbg
        Wake-on: g
        Current message level: 0x00000033 (51)
        Link detected: yes

(mii-tool)

eth0: negotiated 100baseTx-FD flow-control, link ok

One thing I did notice was ethtool is displaying "Link Detected: yes" when the
cable is disconnected.





Comment 27 Mike McGuire 2007-06-11 23:01:52 UTC
(In reply to comment #25)
> Created an attachment (id=156759) [edit]
> NetworkManager-0.6.5-5.fc7.linktest.src.rpm
> 
> I'm not sure this will make a difference, but I did pull 2 patches from
> NetworkManager to see if they might help.  
> 
> I'd be pretty surprised if this made a difference, but it might be worth a try
> to build it and try it.

I'm getting an error "cannot install source packages"



Comment 28 Nils Philippsen 2007-06-12 07:21:11 UTC
re comment #24: I don't think it's something in NetworkManager, as I don't use
it on the machine. In fact, I saw it first while installing the machine, both
when trying to use DHCP and static IP. Now the machine is setup for DHCP on
bootup. Sorry for not saying this in so many words earlier...

Comment 29 Mike McGuire 2007-06-12 13:06:16 UTC
I'm leaning towards something other than NetworkManager also.  I just turned 
off the NM services, changed the network device configuration to "activate on 
boot" shutdown the computer, plugged in the cable and turned the computer on.  
The computer booted fine and had network connectivity.

Tried it with the same settings BUT with the cable disconnected and it hung 
on "starting network" and wouldn't continue until I plugged the cable in.

Tried the same scenario but this time using the wireless and settings 
of "activate on boot" It hesitated on "starting network" and jumped to the 
line by line activation sequence, BUT it didn't hang the computer, it just 
said FAILED and continued on to login.

I am using static ip addressing and haven't tried with DHCP.  

Comment 30 Andy Gospodarek 2007-06-12 15:42:35 UTC
(In reply to comment #27)
> 
> I'm getting an error "cannot install source packages"
> 

Check to make sure you have the 'rpmbuild' rpm installed and that the directory
/usr/src/redhat exists.

Comment 31 Andy Gospodarek 2007-06-12 15:55:46 UTC
Thanks for the feedback Nils (comment #28) and Mike (comment #29).

Did either of you notice that this was a hard lockup?  Could you switch to
another vty with ctrl+alt+F2/F3/etc?  

Also, does anyone happen to remember what the latest working fc6 kernel might
have been on their system?  If you can verify with some reliability that 

kernel-2.6.20-1.2952.fc6

worked well on their system then it will help me narrow down the differences
between that last working kernel and the current one.


Comment 32 Mike McGuire 2007-06-12 16:41:43 UTC
It worked fine with realease, "kernel-2.6.20-1.2952.fc6"  I was updated all 
the way until Fedora 7 final.

Comment 33 Andy Gospodarek 2007-06-12 18:43:40 UTC
Well there appear to be 5 patches that directly effect r8169 that have been
pushed upstream between kernel-2.6.20-1.2952.fc6 and kernel-2.6.21-1.3194.fc7. 
Those changes are:

commit 1371fa6db0bbb8e23f988a641f5ae7361bc629dd
Author: Francois Romieu <romieu.com>
Date:   Mon Apr 2 23:01:11 2007 +0200

    r8169: fix suspend/resume for down interface

    The PM hooks are no-op if the r8169 interface is down (i.e. !IFF_UP).
    However, as the chipset is enabled, the device will not work after a
    suspend/resume cycle. The patch always issue the required PCI suspend
    sequence and removes the module unload/reload workaround.

    Signed-off-by: Arnaud Patard <apatard>
    Signed-off-by: Francois Romieu <romieu.com>
    Signed-off-by: Jeff Garzik <jeff>

commit 99f252b097a3bd6280047ba2175b605671da4a23
Author: Francois Romieu <romieu.com>
Date:   Mon Apr 2 22:59:59 2007 +0200

    r8169: issue request_irq after the private data are completely initialized

    The irq handler schedules a NAPI poll request unconditionally as soon as
    the status register is not clean. It has been there - and wrong - for
    ages but a recent timing change made it apparently easier to trigger.

    Signed-off-by: Francois Romieu <romieu.com>
    Cc: Jay Cliburn <jacliburn>
    Signed-off-by: Jeff Garzik <jeff>

commit 2efa53f373ed811d4860904f5205b8a3b376e253
Author: Francois Romieu <romieu.com>
Date:   Fri Mar 9 00:00:05 2007 +0100

    r8169: fix a race between PCI probe and dev_open

    Initialize the timer with the rest of the private-struct.

    Signed-off-by: Jeff Garzik <jeff>
    Signed-off-by: Francois Romieu <romieu.com>
    Signed-off-by: Jeff Garzik <jeff>

commit 9e0db8ef4a8c8fd6f3a506259975d7f8db962421
Author: Francois Romieu <romieu.com>
Date:   Thu Mar 8 23:59:54 2007 +0100

    r8169: revert bogus BMCR reset

    Added during bf793295e1090af84972750898bf8470df5e5419

    The current code requests a reset but prohibits autoneg, 1000 Mb/s,
    100 Mb/s and full duplex. The 8168 does not like it at all.

    Signed-off-by: Francois Romieu <romieu.com>
    Signed-off-by: Jeff Garzik <jeff>

commit eb2a021c4710b98081daa797d5a729ac23c240cd
Author: Francois Romieu <romieu.com>
Date:   Thu Feb 15 23:37:21 2007 +0100

    r8169: RTNL and flush_scheduled_work deadlock

    flush_scheduled_work() in net_device->close has a slight tendency
    to deadlock with tasks on the workqueue that hold RTNL.

    rtl8169_close/down simply need the recovery tasks to not meddle
    with the hardware while the device is going down.

    Signed-off-by: Francois Romieu <romieu.com>
    Signed-off-by: Jeff Garzik <jeff>

commit dcb92f8804717b845db70939b523c5d152a2e0ea
Author: Al Viro <viro.org.uk>
Date:   Fri Feb 9 16:39:00 2007 +0000

    [PATCH] uintptr_t is unsigned long, not u32

    Signed-off-by: Al Viro <viro.org.uk>
    Signed-off-by: Linus Torvalds <torvalds>


None of them jump out to me as the obvious cause of this regression, but it
seems unlikely that the problem comes from:

commit 1371fa6db0bbb8e23f988a641f5ae7361bc629dd
Author: Francois Romieu <romieu.com>
Date:   Mon Apr 2 23:01:11 2007 +0200

    r8169: fix suspend/resume for down interface

(since we aren't doing suspend/resume in these cases)

commit 2efa53f373ed811d4860904f5205b8a3b376e253
Author: Francois Romieu <romieu.com>
Date:   Fri Mar 9 00:00:05 2007 +0100

    r8169: fix a race between PCI probe and dev_open

(this fix prevents a panic -- it should not cause a hang)

commit 9e0db8ef4a8c8fd6f3a506259975d7f8db962421
Author: Francois Romieu <romieu.com>
Date:   Thu Mar 8 23:59:54 2007 +0100

    r8169: revert bogus BMCR reset

(hardware specific call that was seen as an error previously)

commit dcb92f8804717b845db70939b523c5d152a2e0ea
Author: Al Viro <viro.org.uk>
Date:   Fri Feb 9 16:39:00 2007 +0000

    [PATCH] uintptr_t is unsigned long, not u32

(this is a simple fix that should not cause our problem)

IF I ruled these out correctly then we are only left with these 2 patches as the
problems:

commit 99f252b097a3bd6280047ba2175b605671da4a23
Author: Francois Romieu <romieu.com>
Date:   Mon Apr 2 22:59:59 2007 +0200

    r8169: issue request_irq after the private data are completely initialized

commit eb2a021c4710b98081daa797d5a729ac23c240cd
Author: Francois Romieu <romieu.com>
Date:   Thu Feb 15 23:37:21 2007 +0100

    r8169: RTNL and flush_scheduled_work deadlock

I'm leaning towards the first one being the issue since that patch reordered
when we first expect to start getting interrupts, but the second one would be an
issues as well if there manages to be some bad interaction with the linkwatch
code.  Is any way that someone with the hardware could do a few builds and try
to back-out these 2 patches (maybe one at a time) and see if it makes a difference?

Comment 34 Andy Gospodarek 2007-06-12 18:44:48 UTC
Created attachment 156815 [details]
/tmp/eb2a021c4710b98081daa797d5a729ac23c240cd.patch

Comment 35 Andy Gospodarek 2007-06-12 18:46:34 UTC
Created attachment 156816 [details]
/tmp/99f252b097a3bd6280047ba2175b605671da4a23.patch

Comment 36 Andy Gospodarek 2007-06-12 18:47:20 UTC
Created attachment 156817 [details]
f6-f7.patch

differences between fc6 and f7 for r8169 driver

Comment 37 Chuck Ebbert 2007-06-12 19:40:23 UTC
http://www.spinics.net/lists/netdev/msg31920.html

Comment 38 Francois Romieu 2007-06-13 00:01:46 UTC
You can forget Rolf's problem: it is a different one.

-- 
Ueimor


Comment 39 Mike McGuire 2007-06-13 00:10:36 UTC
Ok Andy I just tried the patches from Comment 35, 36, 37,  Long story short, no
change in the hang up.

1st.  Installed comment 35 patch, and reboted, no change
2nd.  Removed comment 35 patch and installed comment 36 patch and rebooted, no
change
3rd.  Installed comment 35 and 36 patches and rebooted, no change
4th.  Removed comment 35 and 36 patches and installed comment 37 patch and
rebooted, no change.... 

Comment 40 Nils Philippsen 2007-06-13 08:00:16 UTC
Just wanted to confirm that the latest FC6 kernel works here as well.

Comment 41 Andreas Adamis 2007-06-14 00:29:24 UTC
hmmm I see lots of network manager here and in my case it's not NM that hangs 
the deal.
The system will freeze if the eth0 interface is brought up during BOOT process 
here. Regardless of whether I'm using NM or not.

The only remedy I found to be able to boot into X, was to go and untick the 
"load at system startup" check box on the network settings for eth0.

Once I do this, the interface is ignored and the system boots normaly.

*Now* once I boot and I'm into X, I can bring up a terminal, type "ifup eth0" 
and everything works like there was no problem EVER.

Note1: The cable is plugged through the whole process. No tampering with it.
Note2: I, too, verify that everything works perfectly (with this NIC) in 
Fedora 
6
Note3: For the record, a cheap lame NIC will work perfectly on boot or any 
other instance.

I hope it helps.


Comment 42 Gregory Haase 2007-06-14 02:24:19 UTC
I found this bug after just doing a fresh install on a machine. In reading
through, it seemed there was some confusion as to whether it would work on boot
with the cable plugged in or unplugged, or whether it could bring up the
ethernet card at boot at all.

I started testing, and I found some interesting results:

1.) Started PC with network cable unplugged. The system froze at Starting
Networking. I plugged in the network cable. The computer resumed the boot
process. ifconfig revealed that eth0 had come up and successfully grabbed an IP
address from my dhcp server.

2.) Started PC with network cable plugged in. The system froze at Starting
Networking. I removed the network cable. The computer resumed the boot process.
Logged in and issued "service network restart" and eth0 came up and successfully
grabbed an IP address from my dhcp server.

3.) Started the PC with the network cable plugged in. The system froze at
Starting Networking. I unplugged the network cable. As soon as the connectivity
light switched off, I put the cable back in as quickly as possible. The computer
resumed the boot process. ifconfig revealed that eth0 had come up and
successfully grabbed an IP address from my dhcp server.

So it seems that the card is waiting for some kind of status, and it can't get
it until there is a change in connectivity. It seems like it doesn't matter
whether the cable is already plugged in or removed - it's the act of plugging it
in or removing it. As soon as it gets going, it doesn't matter if it's plugged
in or not. Since it takes a while for the network to timeout, there is plenty of
time to plug the cable back in and get an IP address after removing it.

I've now performed the actions outlined in step 3 at least 6 times successfully.

Comment 43 David 2007-06-18 22:22:53 UTC
Any update on this bug?  I still have two machines doing this?

Comment 44 Mike McGuire 2007-06-20 14:19:53 UTC
It seems this issue has fallen by the wayside....

Comment 45 Andy Gospodarek 2007-06-20 20:32:11 UTC
Sorry this issue hasn't been worked for the last few days.  I was out of town
and unable to look at it futher until today (I JUST got back).  I also have been
unable to locate hardware that can actually reproduce this issue so I'd rather
not just toss out any old idea to solve the problem and make everyone else test
this for me.  

If kernel with the patch from comment #36 backed out (applied with -R) still
showed the same problem, I certainly question how this could be driver related
since the code produced would be exactly the same as what was shipped with the
latest FC6 kernel.

Has anyone tried a forced-install of the FC6 kernel on their F7 box and
indicated whether or not they see the same failure?  

Also, when the system locks up, is the console still responsive (like can you
type <ENTER> and see that the screen scrolls)?

Comment 46 David 2007-06-20 22:54:23 UTC
Hi Andy,

I have forced on the last FC6 kernel and the machine fires up perfectly, no more
pulling ethernet cables!

When it locks on on the F7 kernel its locked up COMPLETELY, I mean even the
number lock and caps lock are frozen!

If you see my bug # 242181 which was around longer than this bug I already tried
the FC6 kernel on the 18th of this month.

Cheers,
David


Comment 47 Mike McGuire 2007-06-20 23:53:14 UTC
Hi Andy,

I concur with David, force the FC6 kernel seems to alleviate the issues with
lockups, the kernel I have know installed is/;

kernel-2.6.20-1.2948.fc6.i686.rpm

Mike....

Comment 48 David 2007-06-21 08:12:57 UTC
Hi Mike,

I tried the latest FC6 kernel 2952 and it was fine.



Hi Andy,

There are a number of bugs with the F7 kernel.  I have a bug open that it also
won't scan with my scanner bug # 243953, as well as the last F7 kernel won't
start xwindows on another machine as its messing up the PCI cards badly bug # 242391



Comment 49 Andreas Adamis 2007-06-25 14:16:32 UTC
Hello Andy!

Since I am fairly new to linux (and FC) I don't know how to force install FC6
kernel (I tried installing FC6 altogether and it works fine) but I can tell you
that the system hangs and freezes during startup so basically there's still no
console yet.

If I boot with interactive boot and I don't turn up eth0, everything boots fine
and then when I'm logged on, I can just turn go to a console and "ifup eth0" and
everything will work perfectly without messing with the -already connected- cable.

Hope it helps


Comment 50 Andy Gospodarek 2007-06-25 17:46:38 UTC
Thanks for the feedback, Andreas.  I'm still trying to narrow what specifically
might be causing this problem, so I hope to provide some F7 test kernels
sometime soon so we can work through this on F7 rathat than FC6.  Look for those
soon!

Comment 51 Mike McGuire 2007-07-07 01:11:56 UTC
Any progress being made on this???  The new .3255 kernel in the
"updates-testing" repo doesn't correct the issues with the realtek 8169

Comment 52 Eric Kerby 2007-07-07 22:09:30 UTC
Created attachment 158726 [details]
startup script for realtek NIC so system doesn't freeze

I would like to add that I have been seeing this issue with F7 as well.  My
temporary solution has been to disabled the "activate on boot" option for this
card.  I wrote a startup script to separately activate the Realtek 8169 NIC
after the normal network startup.  I am including this script as an attachment.


Note that if I change the "# chkconfig: 2345 75 90" line to make the boot
priority closer to 10 (which is /etc/init.d/network's default boot priority),
the freeze will still occur.  I haven't experimented enough to see just how
much I can change the boot priority, but 75 works for me.

The only problem with this approach is that named and dhcpd (this computer is a
router for my network) have to be restarted after I log in.  At least the
system doesn't freeze, though.

Comment 53 Thomas Müller 2007-07-09 10:09:01 UTC
Yesterday I upgraded from Fedora Core 6 to Fedora 7 and I'm experiencing the
same problem.
I'll append some information about my system, maybe someone will finally be able
to see a pattern.

I still have the latest FC6 kernel (kernel-2.6.20-1.2962.fc6) installed and the
nic works perfectly with it.

When I boot the F7 kernel (kernel-2.6.21-1.3228.fc7) the system hangs when the
initscripts try to bring up the interface. It's configured to use a static IP.
Boot continues when I unplug and replug the network cable.


Mainboard: Asus P4B266
NIC: Eusso UEC2300-32R


/var/log/messages:
Jul  9 10:44:45 linux kernel: r8169: eth0: link down
Jul  9 10:44:45 linux kernel: BUG: soft lockup detected on CPU#0!
Jul  9 10:44:45 linux kernel: [<c0451ea2>] softlockup_tick+0xa5/0xb4
Jul  9 10:44:45 linux kernel: [<c042e930>] update_process_times+0x3b/0x5e
Jul  9 10:44:45 linux kernel: [<c043d298>] tick_sched_timer+0x57/0x9a
Jul  9 10:44:45 linux kernel: [<c0439df5>] hrtimer_interrupt+0x12b/0x1b6
Jul  9 10:44:45 linux kernel: [<c043d241>] tick_sched_timer+0x0/0x9a
Jul  9 10:44:45 linux kernel: [<c0408534>] timer_interrupt+0x2c/0x32
Jul  9 10:44:45 linux kernel: [<c045210e>] handle_IRQ_event+0x1a/0x3f
Jul  9 10:44:45 linux kernel: [<c045354e>] handle_level_irq+0x81/0xc7
Jul  9 10:44:45 linux kernel: [<c04534cd>] handle_level_irq+0x0/0xc7
Jul  9 10:44:45 linux kernel: [<c04072bb>] do_IRQ+0xac/0xd1
Jul  9 10:44:45 linux kernel: [<c04058ff>] common_interrupt+0x23/0x28
Jul  9 10:44:45 linux kernel: [<c042b2dc>] __do_softirq+0x54/0xba
Jul  9 10:44:45 linux kernel: [<c04071b7>] do_softirq+0x59/0xb1
Jul  9 10:44:45 linux kernel: [<c04534cd>] handle_level_irq+0x0/0xc7
Jul  9 10:44:45 linux kernel: [<c042b194>] irq_exit+0x38/0x6b
Jul  9 10:44:45 linux kernel: [<c04072cc>] do_IRQ+0xbd/0xd1
Jul  9 10:44:45 linux kernel: [<c04058ff>] common_interrupt+0x23/0x28
Jul  9 10:44:45 linux kernel: [<c04200d8>] find_busiest_group+0x264/0x4c5
Jul  9 10:44:45 linux kernel: [<c0601895>] _spin_unlock_irqrestore+0x8/0x9
Jul  9 10:44:45 linux kernel: [<c042e863>] __mod_timer+0xa1/0xab
Jul  9 10:44:45 linux kernel: [<f8a4e1ec>] rtl8169_open+0x12e/0x194 [r8169]
Jul  9 10:44:45 linux kernel: [<c05a3054>] dev_open+0x2b/0x62
Jul  9 10:44:45 linux kernel: [<c05a1aa1>] dev_change_flags+0x47/0xe4
Jul  9 10:44:45 linux kernel: [<c05de45c>] devinet_ioctl+0x250/0x56a
Jul  9 10:44:45 linux kernel: [<c04e72c0>] copy_to_user+0x3c/0x50
Jul  9 10:44:45 linux kernel: [<c0598b47>] sock_ioctl+0x19f/0x1be
Jul  9 10:44:45 linux kernel: [<c05989a8>] sock_ioctl+0x0/0x1be
Jul  9 10:44:45 linux kernel: [<c047f713>] do_ioctl+0x1f/0x62
Jul  9 10:44:45 linux kernel: [<c047f99a>] vfs_ioctl+0x244/0x256
Jul  9 10:44:45 linux kernel: [<c047f9f8>] sys_ioctl+0x4c/0x64
Jul  9 10:44:45 linux kernel: [<c0404f70>] syscall_call+0x7/0xb
Jul  9 10:44:45 linux kernel: =======================
Jul  9 10:44:45 linux kernel: r8169: eth0: link up


lspci -v:
02:0a.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit
Ethernet (rev 10)
        Subsystem: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet
        Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 5
        I/O ports at b800 [size=256]
        Memory at f2800000 (32-bit, non-prefetchable) [size=256]
        [virtual] Expansion ROM at f3e00000 [disabled] [size=128K]
        Capabilities: [dc] Power Management version 2

lspci -vn:
02:0a.0 0200: 10ec:8169 (rev 10)
        Subsystem: 10ec:8169
        Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 5
        I/O ports at b800 [size=256]
        Memory at f2800000 (32-bit, non-prefetchable) [size=256]
        [virtual] Expansion ROM at f3e00000 [disabled] [size=128K]
        Capabilities: [dc] Power Management version 2


ethtool eth0:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: pumbg
        Wake-on: g
        Current message level: 0x00000033 (51)
        Link detected: yes


mii-tool eth0:
eth0: negotiated 100baseTx-FD flow-control, link ok


Please tell me if you need any additional information.

Comment 54 David 2007-07-11 11:42:17 UTC
When will this ever get fixed - much later rather than sooner :(

I just installed kernel 3255 from updates-testing  It fixed my sane and cannon
scanner (another bug), but the damn bug is eth hangup is STILL there :(


Comment 55 Mike McGuire 2007-07-18 01:21:17 UTC
Has this bug been ignored??  The bug is still present in the 2.6.22 kernel that
is currently in "updates-testing"

Comment 56 Eric Kerby 2007-07-18 01:30:45 UTC
I certainly hope it hasn't been ignored.  I just had to deal with bringing my
servers back up after a power outage.  Of course, I ran into issues from this
card again.

Comment 57 Francois Romieu 2007-07-18 07:21:14 UTC
webmaster:
> When will this ever get fixed - much later rather than sooner :(
>
> I just installed kernel 3255 from updates-testing  It fixed my sane and cannon
> scanner (another bug), but the damn bug is eth hangup is STILL there :(

Is the bug still there if you add an 'clocksource=pit' option in the
kernel boot command line ?

-- 
Ueimor


Comment 58 Thomas Müller 2007-07-18 08:26:02 UTC
Yes, it still hangs with 'clocksource=pit'

Comment 59 Andy Gospodarek 2007-07-18 15:18:28 UTC
Are all of you that are noticing problems using SMP or UP systems?  I'm starting
to wonder if this could be an issue with uni-processor kernels/systems.

Comment 60 Thomas Müller 2007-07-18 15:26:32 UTC
I am using an uniprocessor system.

Comment 61 Andy Gospodarek 2007-07-18 17:40:51 UTC
Success!  I found a 32-bit UP system and managed to reproduce this with an F7
LiveCD.  The system hung when trying to start NetworkManager and when I
connected the cable on the r8169 mentioned in comment #9 it continued to boot. 

I'll do an install and see what I can figure out!

Comment 62 Andy Gospodarek 2007-07-18 19:42:09 UTC
Reproduced this on an installed system.  

Looks like the trick might be a 32-bit UP system with an smp kernel (the default
f7 kernel is smp).  A UP kernel might die too, but we don't build one of those
anymore, so I can't say if that makes a difference.

I did notice that it doesn't *always* hang while booting, but it seems to happen
most of the time.  Seeing soft-lockups like that makes me wonder if some code
was added recently that works well on true SMP systems but not on UP ones....



Comment 63 Mike McGuire 2007-07-18 22:10:22 UTC
Uniprocessor system here also......

Comment 64 David 2007-07-18 22:15:24 UTC
Yes I agree both my machines that do this are single threaded Pentium 4 2.6 and
2.8 processor machines.

Comment 65 Mike McGuire 2007-07-18 22:17:14 UTC
But with this FC6 kernel it works fine.....

2.6.20-1.2952.fc6 #1 SMP

Comment 66 David 2007-07-18 22:39:14 UTC
Agree there is something not quite right with F7 kernels as regards this.  I
also have another 2 Pentium 4 (one 2.6 and other 2.8) single threaded machines
with same exact ethernet card and these have NEVER had the bug.

Its got to be some timing issue that depending on your motherboard BIOS and PCI
address space.

I say this as I have a bug 242391 and 247913 that are to do with PCI grpahics
card detection problems on one machine with F7 but never with a FC6 kernel either.

Maybe they are somehow related?

Comment 67 Thomas Müller 2007-07-19 18:17:15 UTC
No one mentioned it before, so I just want to add, that this problem is not
limited to fedora kernels.
I have the same problem with vanilla kernels.

Comment 68 Andy Gospodarek 2007-07-19 18:53:01 UTC
Good to know, Thomas.

Comment 69 Francois Romieu 2007-07-19 21:30:27 UTC
Thomas Müller (thomas):
[...]
> I have the same problem with vanilla kernels.

Thomas, can you bissect the problem or do all kernels fail ?

-- 
Ueimor

Comment 70 Thomas Müller 2007-07-20 10:15:52 UTC
(In reply to comment #69)
> Thomas, can you bissect the problem or do all kernels fail ?

vanilla 2.6.20 works fine, vanilla 2.6.21 hangs.
I'll try to bisect it

Comment 71 Gregory Haase 2007-07-20 12:12:03 UTC
Do we really think this is a kernal issue?  Once you're booted up, everything
runs fine. You can ifup/ifdown all you want, and 'service network restart' works
fine as well. It would seem to suggest that both drivers and kernel are fine. 
It seems more likely that we have a problem with how something in the boot
process is interacting with the kernel.

Also, has anyone done a search to see if this is affecting any other
distribution between these two kernels?

Comment 72 David 2007-07-20 12:20:41 UTC
How can it not be a kernel issue?  Force on any fc6 kernel and constantly boot
the machine over and over all day long and it works perfectly.

If I only change the kernel and nothing else, then it works explain how its not
the kernel?

Comment 73 David 2007-07-20 12:22:53 UTC
Also I forgot to mention, if you have got the server on a remote location, how
do you start and stop the network when its frozen during bootup requiring the
ethernet cable to be physically removed and reinserted.

This has happened twice so I left on a fc6 kernel, but this is no long term
solution - fix the kernel is the only solution.

Comment 74 Thomas Müller 2007-07-20 12:25:36 UTC
(In reply to comment #71)
*shrug*
2.6.20 works, 2.6.21 doesn't. Exactly the same userspace tools.

Even if Fedora is the only distribution unlucky enough to trigger this, I think
it's something that has to be fixed within the kernel.

Comment 75 Andy Gospodarek 2007-07-20 18:50:04 UTC
(In reply to comment #70)
> (In reply to comment #69)
> > Thomas, can you bissect the problem or do all kernels fail ?
> 
> vanilla 2.6.20 works fine, vanilla 2.6.21 hangs.
> I'll try to bisect it

I just did a 2.6.21 build from Linus' tree and it works fine for me....

Comment 76 Thomas Müller 2007-07-20 20:11:59 UTC
(In reply to comment #75)
> I just did a 2.6.21 build from Linus' tree and it works fine for me....

That's very strange... I tested 2.6.21.6, 2.6.22.1 and 2.6.22-git8 and every one
failed.

Did you use the default configuration shipped with the vanilla kernel or did you
use the fedora one?

I started bisecting and the first version git suggested (somewhere between
2.6.20 and 2.6.21) also failed.

I'm not at home for the weekend, but I will retest everything when I get home on
sunday and continue bisecting if no one has a new idea by then. :)

Comment 77 Andy Gospodarek 2007-07-20 20:40:03 UTC
I used the fedora one and then when doing a 'make oldconfig' just hold down
<ENTER>  (I know, I know, brilliant).

The stable trees failing is interesting since you can't ever get exactly those
fixes out of Linus's trees.  My last build had some flavor or 2.6.22-rc2
working, so I wonder if the change got introduced after that and fed into
2.6.21-stable sometime before 2.6.21.6 (but that seems unlikely).

I'll check out my config a little more and see what might be different.

Comment 78 CHIKAMA Masaki 2007-07-21 18:18:50 UTC
I have the same problem.
My hardware is

00:0a.3 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit
Ethernet (rev 10)
        Subsystem: Melco Inc Unknown device 0237
        Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 11
        I/O ports at 8800 [size=256]
        Memory at c5800000 (32-bit, non-prefetchable) [size=256]
        [virtual] Expansion ROM at 20000000 [disabled] [size=128K]
        Capabilities: <access denied>

It works fine with kernel-2.6.20-1.2962.fc6.
It doesn't work 2.6.21-1.3228.fc7 nor 2.6.22.1-27.fc7 nor 2.6.23-0.41.rc0.git14.fc8.
The rawhide kernel sometimes works but sometimes doesn't.
I also tested with starting runlevel 1 and run "/etc/init.d/network start"
in addition to normal sratup process.

I tried the drive mentioned in comment #14 and it seems to work for now.
The working environment is kernel-2.6.22.1-27.fc7 + r8169-6.002.00 driver.
This driver needs some small modification to compile on kernel 2.6.22.

Comment 79 Thomas Müller 2007-07-22 16:44:39 UTC
Created attachment 159748 [details]
configuration used to bisect

Comment 80 Thomas Müller 2007-07-22 16:46:59 UTC
Comment on attachment 159748 [details]
configuration used to bisect

I'm back home, retested everything and finished bisecting the kernel.

I started with the configuration shipped with the fedora kernel
2.6.21-1.3228.fc7 but deactivated some drivers to speed things up.

According to git-bisect the first bad commit is
a304e1b82808904c561b7b149b467e338c53fcce
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h
=a304e1b82808904c561b7b149b467e338c53fcce

Before this commit the kernel works fine and never hang during multiple
reboots. With this commit it hangs almost every time I reboot.

The commit adds a new configuration option "Debug shared IRQ handlers"
(CONFIG_DEBUG_SHIRQ) that is set to 'y' in the Fedora configuration. If I set
this to 'n', the kernel also works fine.


Andy: Did you build the vanilla 2.6.21 with a 2.6.20 fedora configuration? If
yes, maybe CONFIG_DEBUG_SHIRQ was deactivated by default, so you didn't trigger
this


output of git-bisect:

a304e1b82808904c561b7b149b467e338c53fcce is first bad commit
commit a304e1b82808904c561b7b149b467e338c53fcce
Author: David Woodhouse <dwmw2>
Date:	Mon Feb 12 00:52:00 2007 -0800

    [PATCH] Debug shared irqs

    Drivers registering IRQ handlers with SA_SHIRQ really ought to be able to
    handle an interrupt happening before request_irq() returns.  They also
    ought to be able to handle an interrupt happening during the start of their
    call to free_irq().  Let's test that hypothesis....

    [bunk: Kconfig fixes]
    Signed-off-by: David Woodhouse <dwmw2>
    Cc: Arjan van de Ven <arjan>
    Signed-off-by: Jesper Juhl <jesper.juhl>
    Signed-off-by: Ingo Molnar <mingo>
    Signed-off-by: Adrian Bunk <bunk>
    Signed-off-by: Andrew Morton <akpm>
    Signed-off-by: Linus Torvalds <torvalds>

:040000 040000 21b0f48e3c9aa1d1880b8cfcc2509fa37ab16ff0
57bfbda5aa3dc95f31216ce99e2222c23e4adced M	kernel
:040000 040000 35b70986b0b39798d573fec0b73b9035d0e85739
1839fb518a55d408690014fcd43c6629bc2b9a50 M	lib

Comment 81 Andy Gospodarek 2007-07-23 12:54:23 UTC
Nice find, Thomas!  That patch would certainly explain the softirq lockup
messages that appear in comment #7 where there seem to be constant IRQs.  

I'll take a look at my config when I get to work today and let you know, but I
bet I didn't add CONFIG_DEBUG_SHIRQ=y.  

I'll also take a look at this patch and see what we can do to resolve the issue.

Comment 82 Andy Gospodarek 2007-07-23 15:30:58 UTC
CONFIG_DEBUG_SHIRQ was not set in my kernel builds....

Comment 83 CHIKAMA Masaki 2007-07-23 16:37:53 UTC
(In reply to comment #80)

I also confirm that kernel-2.6.22.1-27.fc7 + CONFIG_DEBUG_SHIRQ=n works fine.
I checked 10 times both cold boot (power on) and warm boot (shutdown -r).

I also checked kernel-2.6.22.1-27.fc7 + revert

commit 99f252b097a3bd6280047ba2175b605671da4a23
Author: Francois Romieu <romieu.com>
Date:   Mon Apr 2 22:59:59 2007 +0200

    r8169: issue request_irq after the private data are completely initialized
 
, but that doesn't help. Sometimes it boots fine but sometimes doesn't.
I would happily test any patches. My hardware is UP system.


By the way, the patch which mentioned in comment #80 seems to be strange.
The CONFIG_DEBUG_SHIRQ=y effects nothing in function "free_irq".
Is this an intensional behavior ?
I think the all "return;" in "for(;;)" loop should be "break;".

Should I make a patch and send upstream ? 

Comment 84 Andy Gospodarek 2007-07-23 17:34:22 UTC
(In reply to comment #83)
> 
> 
> By the way, the patch which mentioned in comment #80 seems to be strange.
> The CONFIG_DEBUG_SHIRQ=y effects nothing in function "free_irq".
> Is this an intensional behavior ?
> I think the all "return;" in "for(;;)" loop should be "break;".
> 
> Should I make a patch and send upstream ? 

actually it does have an effect on free_irq:

@@ -403,6 +406,17 @@ void free_irq(unsigned int irq, void *dev_id)
                spin_unlock_irqrestore(&desc->lock, flags);
                return;
        }
+#ifdef CONFIG_DEBUG_SHIRQ
+       if (handler) {
+               /*
+                * It's a shared IRQ -- the driver ought to be prepared for it
+                * to happen even now it's being freed, so let's make sure....
+                * We do this after actually deregistering it, to make sure that
+                * a 'real' IRQ doesn't run in parallel with our fake
+                */
+               handler(irq, dev_id);
+       }
+#endif
 }
 EXPORT_SYMBOL(free_irq);



Comment 85 CHIKAMA Masaki 2007-07-23 17:47:08 UTC
(In reply to comment #84)

> actually it does have an effect on free_irq:
What?

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=kernel/irq/manage.c;h=203a518b6f1437d134c115ce7d43e1e04e1d5c25;hb=HEAD

 403 void free_irq(unsigned int irq, void *dev_id)
404 {
405         struct irq_desc *desc;
406         struct irqaction **p;
407         unsigned long flags;
408         irqreturn_t (*handler)(int, void *) = NULL;
409
410         WARN_ON(in_interrupt());
411         if (irq >= NR_IRQS)
412                 return;
413
414         desc = irq_desc + irq;
415         spin_lock_irqsave(&desc->lock, flags);
416         p = &desc->action;
417         for (;;) {
... 
454                 spin_unlock_irqrestore(&desc->lock, flags);
455                 return;
456         }
457 #ifdef CONFIG_DEBUG_SHIRQ
458         if (handler) {
459                 /*
460                  * It's a shared IRQ -- the driver ought to be prepared for it
461                  * to happen even now it's being freed, so let's make sure....
462                  * We do this after actually deregistering it, to make sure that
463                  * a 'real' IRQ doesn't run in parallel with our fake
464                  */
465                 handler(irq, dev_id);
466         }
467 #endif
468 }

How can I break this "for(;;)" loop?


Comment 86 Andy Gospodarek 2007-07-23 17:58:10 UTC
Yes, you are correct.  Sorry I misunderstood your first statement.  

A patch like this should be probably be submitted:

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 203a518..03edf45 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -448,11 +448,11 @@ void free_irq(unsigned int irq, void *dev_id)
                        if (action->flags & IRQF_SHARED)
                                handler = action->handler;
                        kfree(action);
-                       return;
+                       break;
                }
                printk(KERN_ERR "Trying to free already-free IRQ %d\n", irq);
                spin_unlock_irqrestore(&desc->lock, flags);
-               return;
+               break;
        }
 #ifdef CONFIG_DEBUG_SHIRQ
        if (handler) {


Comment 87 Thomas Müller 2007-07-23 18:07:20 UTC
(In reply to comment #82)
> CONFIG_DEBUG_SHIRQ was not set in my kernel builds....
Ok, now everything seems to make sense.

It looks like the r8169 driver does have a problem with interrupts happening
before request_irq() returns and that's exactly what the patch tries to test.

Comment 88 CHIKAMA Masaki 2007-07-23 18:18:10 UTC
(In reply to comment #86)
> Yes, you are correct.  Sorry I misunderstood your first statement.  
Thanks. I'm sure my English is bad.

> A patch like this should be probably be submitted:
Yes. But this patch introduces totally untested code path and
it will break many drivers like this bug.
So I think it's dangerous to send to upstream.
(Or it may be a good time to send it because the merge window for
2.6.23 is just closed.)

Comment 89 Thomas Müller 2007-07-23 18:23:58 UTC
(In reply to comment #88)
> Yes. But this patch introduces totally untested code path and
> it will break many drivers like this bug.
> So I think it's dangerous to send to upstream.

Nevertheless, I think you should send this to LKML or directly to David
Woodhouse <dwmw2> as he seems to be the original author of this
patch. I'm pretty sure he never intended to just unconditionally skip his own
code. :)
Even if they don't want to change it now, they should know about this.

Comment 90 Andy Gospodarek 2007-07-23 18:34:48 UTC
After looking at this, it seemed there was a chance that rtl8169_interrupt might
not like getting run if there weren't any interrupts to service.  Causing a link
change is just the thing that makes an interrupt pop.  Whenever
CONFIG_DEBUG_SHIRQ=y the 'handler' (in this case rtl8169_interrupt) is always
called once before an actual interrupt is available.  It seems our system hangs
until one is ready, so I decided to add the following hack to see how
successfully the system could be booted if this call was basically ignored.

diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c
index bb6896a..a864194 100644
--- a/drivers/net/r8169.c
+++ b/drivers/net/r8169.c
@@ -2723,6 +2723,9 @@ static irqreturn_t rtl8169_interrupt(int irq, void
*dev_instance)
        int status;
        int handled = 0;
 
+       if (!in_atomic())
+               return IRQ_RETVAL(1);
+               
        do {
                status = RTL_R16(IntrStatus);
 

It seemed to perform well, so now it seems one of 2 things needs to happen:

1) rtl8169_interrupt (and other driver interrupt handlers) need(s) to be
modified more cleanly to not block if there are no interrupts to handle.

2) the 'test' call to the handler in request_irq needs to be removed.


Comment 91 CHIKAMA Masaki 2007-07-23 18:44:35 UTC
(In reply to comment #89)
I'll. But it's midnight and I can't access my mail server right now.
I can do it 7 hours later, so I'm happy if someone kindly tell 
David Woodhouse about this. If no one do it I'll do it.

Comment 92 Andy Gospodarek 2007-07-23 19:07:47 UTC
Masaki, I'll talk to him about it.

Comment 93 Thomas Müller 2007-07-23 19:09:59 UTC
(In reply to comment #90)
> It seemed to perform well, so now it seems one of 2 things needs to happen:
> 
> 1) rtl8169_interrupt (and other driver interrupt handlers) need(s) to be
> modified more cleanly to not block if there are no interrupts to handle.
> 
> 2) the 'test' call to the handler in request_irq needs to be removed.
> 

I think we might want to notify LKML about this.
Problems with this patch seem to be rare, but we triggered one and maybe it's
only a matter of time until someony else has the same problem with a totally
different driver.

Until this is finally solved, Fedora could deactivate CONFIG_DEBUG_SHIRQ in it's
default configuration. It shouldn't break anything and I would finally be able
to reboot without having to unplug and replug cables ;)

Comment 94 Andy Gospodarek 2007-07-23 19:23:21 UTC
I've talked with the f7 kernel maintainer and I think we can unset
CONFIG_DEBUG_SHIRQ for the next f7 build.

Comment 95 Francois Romieu 2007-07-23 22:33:20 UTC
agospoda:
[...]
> It seemed to perform well, so now it seems one of 2 things needs to happen:
> 1) rtl8169_interrupt (and other driver interrupt handlers) need(s) to be
> modified more cleanly to not block if there are no interrupts to handle.

1. So far I fail to see where rtl8169_interrupt could block. I'll try harder.
2. Without CONFIG_DEBUG_SHIRQ, IRQ handlers are run with irq enabled and are
   non-reentrant. With CONFIG_DEBUG_SHIRQ, irq are disabled and the usual
   protection against reentrant handler is not there. Imho it is anything
   but safe.

-- 
Ueimor

Comment 96 Brandon Thomas 2007-07-24 15:32:02 UTC
(In reply to comment #62)
> Reproduced this on an installed system.  
> 
> Looks like the trick might be a 32-bit UP system with an smp kernel (the default
> f7 kernel is smp).  A UP kernel might die too, but we don't build one of those
> anymore, so I can't say if that makes a difference.
> 
> I did notice that it doesn't *always* hang while booting, but it seems to happen
> most of the time.  Seeing soft-lockups like that makes me wonder if some code
> was added recently that works well on true SMP systems but not on UP ones....
> 
> 

(In reply to comment #62)
> Reproduced this on an installed system.  
> 
> Looks like the trick might be a 32-bit UP system with an smp kernel (the default
> f7 kernel is smp).  A UP kernel might die too, but we don't build one of those
> anymore, so I can't say if that makes a difference.
> 
> I did notice that it doesn't *always* hang while booting, but it seems to happen
> most of the time.  Seeing soft-lockups like that makes me wonder if some code
> was added recently that works well on true SMP systems but not on UP ones....
> 
> 

Hi,
I've been pondering this freeze since F7 first came out and I upgraded to it.  I
had disabled hyperthreading on my laptop for some reason or another, and upon
reading this and UP systems, I decided I'd give it a shot and try re-enabling
hyperthreading.  No more lock ups!  This might be a little-too-late
response...but its feedback none the less :-)

Comment 97 CHIKAMA Masaki 2007-07-24 16:08:35 UTC
(In reply to comment #92)
> Masaki, I'll talk to him about it.
Thanks Andy.

(In reply to comment #90)
> It seems our system hangs
> until one is ready, so I decided to add the following hack to see how
> successfully the system could be booted if this call was basically ignored.
I tested this patch (2.6.22.1-27.fc7 + the patch + CONFIG_DEBUG_SHIRQ=y)
and it works fine.

Comment 98 Andy Gospodarek 2007-07-24 16:44:16 UTC
This should be fixed with 2.6.22.1-33. 

Comment 99 CHIKAMA Masaki 2007-07-25 01:21:15 UTC
I notice that Ingo has been tracking IRQ problem with ne2k NIC on LKML.
See the followning thread.

Re: 2.6.20->2.6.21 - networking dies after random time
 http://www.ussg.iu.edu/hypermail/linux/kernel/0707.3/0057.html

I wonder this may be some hints for our bug.

Comment 100 CHIKAMA Masaki 2007-07-25 04:25:54 UTC
(In reply to comment #95)

> 2. Without CONFIG_DEBUG_SHIRQ, IRQ handlers are run with irq enabled and are
>    non-reentrant. With CONFIG_DEBUG_SHIRQ, irq are disabled and the usual
>    protection against reentrant handler is not there. Imho it is anything
>    but safe.

Even with CONFIG_DEBUG_SHIRQ, I think that irq are not disabled.
In "request_irq()",

506 #ifdef CONFIG_LOCKDEP
507         /*
508          * Lockdep wants atomic interrupt handlers:
509          */
510         irqflags |= IRQF_DISABLED;
511 #endif

but CONFIG_LOCKDEP isn't defined anywhere. 
We have only CONFIG_LOCKDEP_SUPPORT. 
I think all CONFIG_LOCKDEP in kernel tree are wrong.

Or am I some missing ?

Comment 101 CHIKAMA Masaki 2007-07-25 04:45:07 UTC
Oh, I'm wrong. Please ignore.  Sorry.

Comment 102 CHIKAMA Masaki 2007-07-25 15:17:16 UTC
How about this workaround?
This patch clears IntrStatus and discard any pending interrupt.
It cames from a LKML post.
 http://www.uwsg.iu.edu/hypermail/linux/kernel/0707.3/0635.html

+++ src/r8169.c 2007-07-25 22:08:48.000000000 +0900
@@ -1737,6 +1737,7 @@
 {
        struct rtl8169_private *tp = netdev_priv(dev);
        struct pci_dev *pdev = tp->pci_dev;
+       void __iomem *ioaddr = tp->mmio_addr;
        int retval = -ENOMEM;
 
 
@@ -1764,6 +1765,8 @@
 
        smp_mb();
 
+       RTL_W16(IntrStatus, 0xffff);
+   
        retval = request_irq(dev->irq, rtl8169_interrupt, IRQF_SHARED,
                             dev->name, dev);
        if (retval < 0)

I think rtl8169_interrupt() should not be called before rtl8169_open() is
finished. 

Comment 103 Francois Romieu 2007-07-25 21:32:08 UTC
Masaki Chikama:
> How about this workaround ?
[...]

Useless. See rtl8169_irq_mask_and_ack() in rtl8169_init_one.

#53 contains a nice trace. The answer is there. It is not exactly fun to 
read due to the gazillion of debug options which changes the code like
hell. :o/

-- 
Ueimor

Comment 104 Mike McGuire 2007-07-25 23:01:19 UTC
This should be fixed with 2.6.22.1-33.

Andy I just check this build in updates-testing and can confirm that it works,
10 reboots both with cable plugged in and cable unplugged.  Thanks to everyone
that helped on this....

Comment 105 Andy Gospodarek 2007-07-26 13:08:10 UTC
Good news, Mike.  I still plan to keep this open because I'd like to enable
CONFIG_DEBUG_SHIRQ when the r8169 problem is fixed.



Comment 106 Ville Skyttä 2007-07-27 14:47:28 UTC
2.6.22.1-33.fc7 introduces this (or a similar) problem for me, no problems
noticed with any earlier F7 kernels, I've ran them all including some testing
updates.  

The lockup is not hard, I can reboot with Ctrl+Alt+Del from the state where it
gets stuck (determining IP info for eth0).  Booting with cable unplugged goes
through, but ifup eth0 doesn't work after booting either, hangs while getting IP
address from DHCP.  DHCPOFFER is found in /var/log/messages, but after that
nothing happens, nothing logged.

This is the same card as Nils reported in comment 16, i686, NM not involved,
very minimal setup.  Happens on every boot with 2.6.22.1-33.fc7, reverting to
2.6.22.1-27.fc7 fixes it.

Comment 107 Ville Skyttä 2007-07-28 13:30:30 UTC
Hm, 33.fc7 introduces the same problem also for a different, older, 8139too
driven card, so maybe what I'm seeing is something else.

Comment 108 Francois Romieu 2007-07-28 13:49:09 UTC
Ville Skyttä :
> Hm, 33.fc7 introduces the same problem also for a different, older, 8139too
> driven card, so maybe what I'm seeing is something else.

Can you trigger the bug after boot-up with the r8169 and, while the network
is locked, post the result of:

$ echo d > /proc/sysrq-trigger
$ echo t > /proc/sysrq-trigger
$ echo q > /proc/sysrq-trigger
$ echo w > /proc/sysrq-trigger

(of course it will be nicer if there are not too many processes)

-- 
Ueimor

Comment 109 Ville Skyttä 2007-07-28 14:58:39 UTC
Created attachment 160162 [details]
Requested info

Comment 110 Francois Romieu 2007-07-28 19:50:37 UTC
Created attachment 160173 [details]
debug helper

Could someone who experiences the bug on a UP system (no HT please) try
the attached patch with an UP built kernel and send the resulting dmesg ?

Thanks in advance.

-- 
Ueimor

Comment 111 Thomas Müller 2007-07-29 11:46:58 UTC
Created attachment 160185 [details]
dmesg output of patched r8169 driver

I applied the patch to a kernel freshly pulled from the offical git repository.

SMP is deactivated, CONFIG_DEBUG_SHIRQ is set to 'y'. System hangs on boot.

Comment 112 Francois Romieu 2007-07-29 14:47:12 UTC
Thomas, the hang happens between the up/down change in:
[...]
r8169: eth0: status = 00000020
r8169: eth0: link up
r8169: eth0: link up
r8169: eth0: link down

Right ?

-- 
Ueimor

Comment 113 Thomas Müller 2007-07-29 15:24:50 UTC
(In reply to comment #112)
> Thomas, the hang happens between the up/down change in:

I think so, yes.
It's just after iptables got initialized and before all other network interfaces
are brought up.

Comment 114 Paul Osmialowski 2007-07-30 07:14:37 UTC
Disaster guys! After I've upgraded kernel recently from 2.6.21-1.3228.fc7 to
2.6.22.1-33.fc7 I cannot use my 01:00.0 Ethernet controller: Realtek
Semiconductor Co., Ltd. RTL8101E PCI Express Fast Ethernet controller (rev 01).
Fortunately, I can revert to old kernel. The effect is that it hangs totally
trying to get IP address from DHCP (DHCP server log shows that it gathered
address successfully!), probably the problem is as you've mention above: it
hangs during initialization.
My host is: Asus notebook F9F Core 2 Duo 1GB RAM.


Comment 115 Franky Van Liedekerke 2007-07-30 11:45:06 UTC
Hi,

I have the exact same problem as comment 106 describes: dhcp doesn't work for
the r8169 driver in 2.6.22.1-33.fc7, but it works for the older kernel
2.6.22.1-27.fc7. lspci output:

05:05.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit
Ethernet (rev 10)
        Subsystem: Fujitsu Siemens Computer GmbH Unknown device 10B0
        Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 20
        I/O ports at 3000 [size=256]
        Memory at b0304800 (32-bit, non-prefetchable) [size=256]
        [virtual] Expansion ROM at 88000000 [disabled] [size=128K]
        Capabilities: [dc] Power Management version 2

(a fixed IP works just fine)

Franky

Comment 116 CHIKAMA Masaki 2007-07-30 12:25:21 UTC
(In reply to comment #110)
> Created an attachment (id=160173) [edit]
> debug helper
> 
> Could someone who experiences the bug on a UP system (no HT please) try
> the attached patch with an UP built kernel and send the resulting dmesg ?

I tested with 2.6.22.1-27.fc7 + latest r8169.c from linux-2.6.git + the patch.
It never boots with runlevel 5.
So I firstly boot in single user mode, then "rmmod r8169" and
"/etc/init.d/network start". I get a bit different result from comment #111.

r8169: eth0: status = 00000020
r8169: eth0: link down
r8169: eth0: link up
r8169: eth0: link up


Comment 117 Thomas Müller 2007-07-30 17:10:39 UTC
(In reply to comment #116)
> I get a bit different result from comment #111.
> 
> r8169: eth0: status = 00000020
> r8169: eth0: link down
> r8169: eth0: link up
> r8169: eth0: link up
> 

Did you perhaps boot with the cable unplugged and then replugged it, when the
system hang?

Comment 118 Ville Skyttä 2007-07-30 20:07:58 UTC
2.6.22.1-41.fc7 from updates-testing fixes the issue for me.

Comment 119 Francois Romieu 2007-07-30 21:26:03 UTC
Created attachment 160277 [details]
More debug helper

Can someone apply the attached patch on top of the previous one and send
the updated dmesg ?

Testing with CONFIG_PRINTK_TIME enabled would be welcome.

-- 
Ueimor

Comment 120 Thomas Müller 2007-07-30 22:39:00 UTC
Created attachment 160280 [details]
dmesg output of patched r8169 driver (2)

I applied the patch to the same kernel I used last time. Config is the same,
except that CONFIG_PRINTK_TIME is activated.

System never hang during several reboots. Is this a possible effect from your
patch or should I try harder to get the system to lock up?

The important lines from dmesg are probably the following:
[   37.110992] r8169: eth0: status = 00000020
[   37.111000] r8169: eth0: link up
[   37.111003] r8169: eth0: in your hands ... (00000020)
[   37.111024] r8169: eth0: link up

Comment 121 CHIKAMA Masaki 2007-07-30 23:39:29 UTC
(In reply to comment #117)
> Did you perhaps boot with the cable unplugged and then replugged it, when the
> system hang?
No. 
1. Boot single user mode with the cable plugged.
2. rmmod r8169 (because the module is already loaded by udev(?)).
3. Run "/etc/init.d/network start" (module is auto-loaded and starts
to talk to DHCP server)

The possibility of hung up is higher ,when I don't run "rmmod r8169".

Comment 122 Jouni Väliaho 2007-07-31 07:15:41 UTC
I found a patch, that seems to fix problem in my r8169. 
http://lkml.org/lkml/2007/6/28/326
They say that the patch will included in kernel 2.6.23.


Comment 123 Francois Romieu 2007-07-31 07:18:05 UTC
Thomas Müller:
[...]
> System never hang during several reboots. Is this a possible effect from your
> patch or should I try harder to get the system to lock up?

It is a possible/expected effect of the patch. It is not clear why the
scheduling of the NAPI call (soft-)locks the system but I should have
made it more dependent on the status register :o/

I'll polish the patch.

-- 
Ueimor

Comment 124 Francois Romieu 2007-07-31 07:21:17 UTC
Jouni Valiaho:
[...]
> I found a patch, that seems to fix problem in my r8169. 
> http://lkml.org/lkml/2007/6/28/326
> They say that the patch will included in kernel 2.6.23.

2.6.23-rc1 ought to behave as described in the message from lkml.
If addresses different issues though.

-- 
Ueimor

Comment 125 Francois Romieu 2007-08-01 22:21:40 UTC
Created attachment 160474 [details]
avoid useless NAPI poll scheduling

The attached patch should be enough to fix the broken kernels.
I'll welcome people to test it.

The patch should be usable against your favorite FC kernel or
against 2.6.22 or later kernels.

-- 
Ueimor

Comment 126 Thomas Müller 2007-08-02 09:17:06 UTC
Created attachment 160507 [details]
modified patch for 2.6.22

I can confirm that 2.6.22 and the current git kernel work fine with your patch
(or some version of it) applied.

However, the patch does not cleanly apply to 2.6.22 because a variable was
renamed/moved. The attached patch should do exactly the same as yours, but can
be applied to 2.6.22.

Comment 127 Paul Osmialowski 2007-08-03 08:27:41 UTC
Regarding my previous complain:
I've updated to kernel-2.6.22.1-41.fc7 and my RTL8101E PCI Express Fast Ethernet
controller (rev 01) works again.


Comment 128 CHIKAMA Masaki 2007-08-03 14:55:19 UTC
(In reply to comment #125)
> Created an attachment (id=160474) [edit]
> avoid useless NAPI poll scheduling

I tested with 2.6.22.1-27.fc7 + latest r8169.c from linux-2.6.git + the patch.
It works fine so far.
Is this the final patch to be margined upstream ?
Thank you for resolving this problem.

Comment 129 Francois Romieu 2007-08-03 19:27:32 UTC
Masaki Chikama:
> I tested with 2.6.22.1-27.fc7 + latest r8169.c from linux-2.6.git + the patch.

(just to be sure)
Did you use the same .config as was used with the non-working kernel ?

> Is this the final patch to be margined upstream ?

If it really works, yes. :o)

I would not mind someone helping me to pinpoint the reason of the problem
with further testing though.

-- 
Ueimor




Comment 130 CHIKAMA Masaki 2007-08-04 04:13:32 UTC
(In reply to comment #129)
> Masaki Chikama:
> > I tested with 2.6.22.1-27.fc7 + latest r8169.c from linux-2.6.git + the patch.
> 
> (just to be sure)
> Did you use the same .config as was used with the non-working kernel ?

Yes. I only added CONFIG_PRINTK_TIME=y for previous test.
CONFIG_DEBUG_SHIRQ=y is still there.

Comment 131 Thomas Müller 2007-08-04 09:04:34 UTC
(In reply to comment #129)
> (just to be sure)
> Did you use the same .config as was used with the non-working kernel ?
> 

Yes.
I compiled the unpatched kernel and tried to boot it --> System hang.
Then I patched the r8169.c and recompiled without touching the config or
anything else --> System worked fine.


> I would not mind someone helping me to pinpoint the reason of the problem
> with further testing though.
Just tell me what you need. I have no problem rebooting my system, it's not an
important server or anything like that. :)

Comment 132 Micke 2007-08-07 10:37:46 UTC
I recently bought a 8169 and suffered from the startup hangs. However, since a
few kernel updates back I have not experienced the freeze. I think all the last
2-3 startups with kernel-2.6.22.1-41.fc7 have gone without a problem.

Related or not; about an hour ago the server froze (pointer, network,
everything) after being up 1 day. I unplugged the network cable and reinserted
it: system unfroze and is still running. No nothing in the log. Just "r8169:
eth1: link down" and "r8169: eth1: link up"

Sorry for the noise.


Comment 133 Chuck Ebbert 2007-08-07 21:39:08 UTC
(In reply to comment #129)
> 
> If it really works, yes. :o)
> 
> I would not mind someone helping me to pinpoint the reason of the problem
> with further testing though.

Are you going to submit something for -stable after a patch is merged?


Comment 134 Francois Romieu 2007-08-07 22:07:24 UTC
Chuck Ebbert :
[...]
> Are you going to submit something for -stable after a patch is merged ?

I have not thought about it so far. Is there a strong demand for it ?

-- 
Ueimor




Comment 135 Andreas Adamis 2007-08-10 00:30:10 UTC
For us who have the problem, yes :D
I´ve been waiting for months to setup my file server because of this bug :P


Comment 136 Mike McGuire 2007-08-10 21:04:12 UTC
This bug is still persistant in the F8 Test 1 kernel also.....

Comment 137 Francois Romieu 2007-08-10 21:11:04 UTC
Created attachment 161084 [details]
avoid useless NAPI poll scheduling (against 2.6.22.2)

I have diffed/updated the patch against 2.6.22.2. Can the interested
parties check that it is ok before I submit it for inclusion in -stable ?

Thanks in advance.

-- 
Ueimor

Comment 138 Chuck Ebbert 2007-08-10 21:44:40 UTC
(In reply to comment #136)
> This bug is still persistant in the F8 Test 1 kernel also.....

It's fixed upstream, so will be fixed in the next F8 kernel.

Comment 139 Thomas Müller 2007-08-11 07:48:32 UTC
(In reply to comment #137)
> I have diffed/updated the patch against 2.6.22.2. Can the interested
> parties check that it is ok before I submit it for inclusion in -stable ?

It's similar to the modified patch I already tested and posted earlier in
comment #126, however one line changed:

Instead of
  RTL_W16(IntrMask, rtl8169_intr_mask & ~rtl8169_napi_event);
you now call
  RTL_W16(IntrMask, rtl8169_napi_event & ~rtl8169_napi_event);
which (if I'm not mistaken) is equal to
  RTL_W16(IntrMask, 0x00000000);

Was that intentional?

Comment 140 Francois Romieu 2007-08-11 08:05:50 UTC
Thomas Müller <thomas> :
[...]
> Was that intentional?

No.

I'll push your patch to 2.6.22-stable.

-- 
Ueimor

Comment 141 Chuck Ebbert 2007-08-13 15:50:37 UTC
Wasn't this also fixed in 2.6.22.1-41 by disabling shared IRQ debugging in the
kernel config?

Comment 142 Francois Romieu 2007-08-14 22:33:50 UTC
Chuck Ebbert :
> Wasn't this also fixed in 2.6.22.1-41 by disabling shared IRQ debugging in the
> kernel config ?

Yes. I understood that Andy would prefer a fix which does not exclude this
config option though (see comment #105).

-- 
Ueimor

Comment 143 Andy Gospodarek 2007-08-14 23:42:03 UTC
Francios, You are correct.  I'd like us to enable that config option again as
soon as we feel the r8169 driver is safe.

Comment 144 Christopher Brown 2007-09-17 13:35:55 UTC
*** Bug 245367 has been marked as a duplicate of this bug. ***

Comment 145 Andy Gospodarek 2007-10-05 21:28:38 UTC
Is this still a problem with the latest f7 kernels?  Francios' patch should be
upstream and included in these kernels:

commit 313b0305b5a1e7e0fb39383befbf79558ce68a9c
Author: Francois Romieu <romieu.com>
Date:   Thu Aug 2 00:00:48 2007 +0200

    r8169: avoid needless NAPI poll scheduling

    Theory  : though needless, it should not have hurt.
    Practice: it does not play nice with DEBUG_SHIRQ + LOCKDEP + UP
    (see https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=242572).

    The patch makes sense in itself but I should dig why it has an effect
    on #242572 (assuming that NAPI do not change in a near future).

    Signed-off-by: Francois Romieu <romieu.com>
    Cc: Edward Hsu <edward_hsu.tw>


Comment 146 Christopher Brown 2008-01-09 01:05:35 UTC
Hello,

I'm reviewing this bug as part of the kernel bug triage project, an attempt to
isolate current bugs in the Fedora kernel.

http://fedoraproject.org/wiki/KernelBugTriage

I am CC'ing myself to this bug and will try and assist you in resolving it if I can.

There hasn't been much activity on this bug for a while. Could you tell me if
you are still having problems with the latest kernel?

If the problem no longer exists then please close this bug or I'll do so in a
few days if there is no additional information lodged.

Comment 147 Christopher Brown 2008-02-16 01:55:36 UTC
Closing as per previous comment indicating this should now be resolved.